Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Experimental analysis and feedforward design of neural networks
(USC Thesis Other)
Experimental analysis and feedforward design of neural networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Experimental Analysis and Feedforward Design of Neural Networks by Ruiyuan Lin A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) August 2021 Copyright 2021 Ruiyuan Lin Table of Contents List Of Tables v List Of Figures vi Abstract ix Chapter 1: Introduction 1 1.1 Signicance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Experimental Analysis on Convolutional Neural Networks . . . . . . . . . . 1 1.1.2 New Interpretation of Multi-Layer Perceptron . . . . . . . . . . . . . . . . 3 1.1.3 Feedforward Multi-Layer Perceptron Design . . . . . . . . . . . . . . . . . . 3 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 CNN Behavior Under Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 SSL-Guided Convolutional Layer Design . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Co-Training on CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.4 Demystifying the SqueezeNet . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.5 Interpretation of MLPs for Classication with the Network Design . . . . . 7 1.2.6 Constructing MLPs as Piecewise Polynomial Approximators . . . . . . . . . 8 1.3 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Experimental Analysis on Convolutional Neural Networks 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2.2 Fashion-MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2.3 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Convolutional Layer 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 Convolutional Layer 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: SSL-Guided Convolutional Layers Design 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Information Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 ii 3.2.2 Model Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Successive Subspace Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Saab Guided Convolutional Layer Filter Number Choice . . . . . . . . . . . . . . . 26 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Fashion-MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.3 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 4: Insight on Co-Training Based Deep Semi-Supervised Learning 36 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Consistency-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1.1 -Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1.2 Temporal Ensembling . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1.3 Mean Teacher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Discussion on Enforcing Network Dierence . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 5: Experimental Analysis on Squeeze Networks 43 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 SqueezeNet and Fire Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 PixelHop++ and Its Relation with SqueezeNet . . . . . . . . . . . . . . . . . . . . 46 5.4 Discriminability Study via Cross-Entropy Analysis . . . . . . . . . . . . . . . . . . 48 5.5 Analysis via Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter 6: Feedforward Design of Multi-Layer Perceptron (MLP) 56 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 From Two-Class LDA to MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.1 Two-Class LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2.2 One-Layer Two-Neuron Perceptron . . . . . . . . . . . . . . . . . . . . . . . 59 6.2.3 Need of Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.3.1 Stage 1 (from l in to l 1 ) . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.3.2 Stage 2 (from l 1 to l 2 ) . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2.3.3 Stage 3 (from l 2 to l out ) . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3 Design of Feedforward MLP (FF-MLP) . . . . . . . . . . . . . . . . . . . . . . . . 62 6.3.1 Feedfoward Design Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3.1.1 Stage 1 (from l in to l 1 ) - Half-Space Partitioning . . . . . . . . . . 63 6.3.1.2 Stage 2 (from l 1 to l 2 ) - Subspace Isolation . . . . . . . . . . . . . 63 6.3.1.3 Stage 3 (from l 2 to l out ) - Class-wise Subspace Mergence . . . . . 64 6.3.2 Pruning of Partitioning Hyperplanes . . . . . . . . . . . . . . . . . . . . . . 65 6.3.3 Summary of FF-MLP Architecture and Link Weights . . . . . . . . . . . . 67 6.4 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.5 Observations on BP-MLP Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.5.1 Eect of Backpropagation (BP) . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.5.2 Eect of Initializations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.6.1 Classication Accuracy for 2D Samples . . . . . . . . . . . . . . . . . . . . 72 iii 6.6.2 Classication Accuracy for Higher-Dimensional Samples . . . . . . . . . . . 73 6.6.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.7 Comments on Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.7.1 BP-MLP Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.7.1.1 Architecture Design as an Optimization Problem . . . . . . . . . . 76 6.7.1.2 Constructive Neural Network Learning . . . . . . . . . . . . . . . 77 6.7.2 Relationship with Work on Interpretation of Neurons as Partitioning Hy- perplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.7.3 Relationship with LDA and SVM . . . . . . . . . . . . . . . . . . . . . . . . 79 6.7.4 Relationship with Interpretable Feedforward CNN . . . . . . . . . . . . . . 80 6.8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Chapter 7: Bridging Multi-Layer Perceptron (MLP) and Piecewise Low-Order Polynomial Approximators 93 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Proposed MLP Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.2.1 Piecewise Constant Approximation . . . . . . . . . . . . . . . . . . . . . . . 95 7.2.2 Piecewise Linear Approximation . . . . . . . . . . . . . . . . . . . . . . . . 96 7.2.3 Piecewise Cubic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3 Generalization to Multivariate Vector Functions . . . . . . . . . . . . . . . . . . . . 101 7.4 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Chapter 8: Conclusion and Future Work 106 8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Bibliography 109 iv List Of Tables 5.1 The average ratio of the activation sum of windows out of the localization map (13x13) in the Fire9 module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.2 Dimensions of the 3D tensors used for cross entropy computation in re modules. . 50 5.3 Cross entropy values at the input and the output of squeeze layers and the output of 1x1 lters, 3x3 lters and their concatenation of the expand layers. . . . . . . . 50 6.1 Comparison of training and testing classication performance between FF-MLP, BP-MLP with FF-MLP initialization and BP-MLP with random initialization. The best (mean) training and testing accuracy are highlighted in bold. . . . . . . . . . 72 6.2 Training and testing accuracy results of FF-MLP and BP-MLP with random initial- ization for four higher-dimensional datasets. The best (mean) training and testing accuracy are highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3 Comparison of computation time in seconds of FF-MLP (left) and BP-MLP (right) with 15 and 50 epochs. The mean and standard deviation of computation time in 5 runs are reported for BP-MLP. The shortest (mean) running time is highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 v List Of Figures 2.1 Accuracy vs number of lters in Convolutional Layer 1 on MNIST. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Accuracy vs number of lters in Convolutional Layer 1 on the Fashion-MNIST. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Accuracy vs number of lters in Convolutional Layer 1 on CIFAR-10. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Accuracy vs number of lters in Convolutional Layer 2 on the MNIST dataset. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Accuracy vs number of lters in Convolutional Layer 2 on the Fashion-MNIST dataset. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Accuracy vs number of lters in Convolutional Layer 2 on the CIFAR-10 dataset. Results from 3 runs are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 An example to illustrate the necessity of rectication [57]. . . . . . . . . . . . . . . 21 3.2 The dierence of eigenvalue ratios between the adjacent components for MNIST/Conv1. 29 3.3 Eigenvalue ratio for MNIST/Conv1. . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 The dierence of eigenvalue ratios between the adjacent components for MNIST/Conv2. 30 3.5 Eigenvalue ratio for MNIST/Conv2. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 The dierence of eigenvalue ratios between the adjacent components for Fashion- MNIST/Conv1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7 Eigenvalue ratio for Fashion-MNIST/Conv1 . . . . . . . . . . . . . . . . . . . . . 31 3.8 The dierence of eigenvalue ratios between the adjacent components for Fashion- MNIST/Conv2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.9 Eigenvalue ratio for Fashion-MNIST/Conv2. . . . . . . . . . . . . . . . . . . . . . . 32 vi 3.10 The dierence of eigenvalue ratios between the adjacent components for CIFAR- 10/Conv1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.11 Eigenvalue ratio for CIFAR-10/Conv1. . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.12 The dierence of eigenvalue ratios between the adjacent components for CIFAR- 10/Conv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.13 Eigenvalue ratio for CIFAR-10/Conv2 . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 The system diagram of the SqueezeNet [47]. . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Illustration of the squeeze unit and the expand unit of a re module [47]. . . . . . 46 5.3 Illustration of the three-level c/w Saab transform in PixelHop++, which provides a sequence of successive subspace approximations to the input image. . . . . . . . 47 5.4 Corresponding bounding boxes (from left to right) in the input, Fire 2, Fire 5 and Fire 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.5 Input images (top) and their visualization results using guided backpropagation (bottom) of expand/1x1 lters (left) and expand/3x3 lters (right) in Fire9. . . . . 54 5.6 Two dog images from the ImageNet: Afghan hound (left) and Dandie Dinmount (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.7 Visualization across dierent layers using the Grad-CAM method. Each row cor- responds to one re module - from Fire2 (top) to Fire9 (bottom). Columns (from left to right) correspond to the squeeze layer, expand/1x1, expand/3x3 and ex- pand/joint layers, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.1 Illustration of (a) two Gaussian blobs separated by a line in a 2D plane, and (b) a one-layer two-neuron perceptron system. . . . . . . . . . . . . . . . . . . . . . . . . 83 6.2 MLP design for Example 1 (XOR): (a) weights between input layer l in and layer l 1 , (b) weights between layer l 1 and layer l 2 , where red links represent weight of 1 and black links represent weightsP , whereP is assumed to be a suciently large positive number, (c) weights between l 2 and l out , where red links represent weight of 1 and black links represent weight of 0. . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Proposed MLP design: the MLP network with l 1 and l 2 two intermediate layers, where neurons in layer l 1 are drawn in pairs (e.g., blue and light blue nodes) representing two sides of a decision boundary and where each neuron in layer l 2 represents an isolated region of interest. . . . . . . . . . . . . . . . . . . . . . . . . 85 6.4 Training sample distributions of 2D examples: (a) Example 1: XOR, (b) Example 2: 3-Gaussian-blobs, (c) Example 3: 9-Gaussian-blobs, (d) Example 4: circle-and- ring, (e) Example 5: 2-new-moons and (f) Example 6: 4-new-moons, where dierent classes are in dierent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 vii 6.5 Response heatmaps for 3 Gaussian blobs in Example 2 in (a) layer l 1 and (b) layer l 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.6 Neuron responses of the 9-Gaussian-blobs in Example 3 in (a) layerl 1 and (b) layer l 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.7 Comparison of classication results of FF-MLP for Example 3 with two dierent error thresholds: (a) Th=0.3 and (b) Th=0.1. . . . . . . . . . . . . . . . . . . . . . 88 6.8 Partitioning lines for (a) Example 2 and (b) Example 3. . . . . . . . . . . . . . . . 88 6.9 Visualization of the 2-new-moons example: (a) the generated random samples from the tted GMMs with 2 components per class and (b) the classication result of FF-MLP with error threshold equal to 0.1. . . . . . . . . . . . . . . . . . . . . . . . 88 6.10 Classication results of FF-MLP for Example 6, where each new moon is approxi- mated by 3 Gaussian components. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.11 Training and testing accuracy curves of BP-MLP as functions of the epoch number for (a) 3-Gaussian-blobs, (b) 9-Gaussian-blobs (Th=0.3), (c) 4-new-moons, (d) 9- Gaussian-blobs (Th=0.1) and (e) circle-and-ring, where the network is initialized by the proposed FF-MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.12 Classication results for the circle-and-ring example with (a) FF-MLP, 4 compo- nents, (b) FF-MLP, 16 components, (c) BP-MLP with FF-MLP initialization, 4 components, and (d) BP-MLP with FF-MLP initialization, 16 components. . . . . 91 6.13 Comparison of BP-MLP classication results for 9-Gaussian-blobs with (a) FF- MLP initialization and (b) random initialization. . . . . . . . . . . . . . . . . . . . 91 6.14 Comparison of BP-MLP classication results for 4-new-moons with (a) FF-MLP initialization and (b) random initialization. . . . . . . . . . . . . . . . . . . . . . . 92 7.1 Responses of (a) neuronsR j;1 andR j;2 individually (dotted lines) and jointly (solid lines), (b) neurons R j;3 and R j;4 , (c) all four neurons combined, and (d) the ap- proximation of f(x) with a sequence of weighted triangle functions. . . . . . . . . . 104 7.2 (a) The unit cubic activation function, (b) responses of R j;1 and R j;2 individually (dotted lines) and jointly (solid lines), (c) and (d) two designs for the approximation to f(x) with a sequence of weighted third-order bumps. . . . . . . . . . . . . . . . 105 7.3 Two kernels of dierent supports centered at x j , which can be achieved by adding more neurons to the intermediate layer of an MLP. . . . . . . . . . . . . . . . . . . 105 viii Abstract Neural networks have been shown to be eective in many applications. To better explain the behaviors of the neural networks, we examine the properties of neural networks experimentally and analytically in this research. In the rst part, we conduct experiments on convolutional neural networks (CNNs) and observe the behaviors of the networks. We also propose some insight or conjectures for CNNs in this part. First, we demonstrate how the accuracy changes with the size of the convolutional layers. Second, we develop a design to determine the size of the convolutional layers based on SSL. Third, as a case study, we analyze the SqueezeNet, which is able to achieve the same accuracy as AlexNet with 50x fewer parameters, by studying the evolution of cross-entropy values across layers and doing visualization. Fourth, we also propose some insight on co-training based deep semi-supervised learning. In the second part, we propose new angles to understand and interpret neural network. To understand the behaviors of multilayer perceptrons (MLPs) as classiers, we interpret MLPs as a generalization of a two-class LDA system so that it can handle an input composed by multiple Gaussian modalities belonging to multiple classes. An MLP design with two hidden layers that also species the lter weights is proposed. To understand the behaviors of multilayer perceptrons (MLPs) as a regressor we construct MLPs as a piecewise low-order polynomial approximator using a signal processing approach. The constructed MLP contains one input, one intermediate and one output layers. Its construction includes the specication of neuron numbers and all lter weights. ix Through the construction, a one-to-one correspondence between the approximation of an MLP and that of a piecewise low-order polynomial is established. x Chapter 1 Introduction 1.1 Signicance of the Research Neural networks (NNs) have been proved eective in many applications. In particular, CNNs have achieved huge success in lots of applications in the eld of computer vision, including object classication, detection, segmentation and so on. While the back-propagation-based optimization could achieve superior performance in terms of accuracy, it also leads to some bothering questions to researchers, such as the lack of interpretability of the network structure and the heuristic ne-tuning of network parameters. 1.1.1 Experimental Analysis on Convolutional Neural Networks A great majority of current CNN literature is application-oriented, yet eorts are made to build theoretical foundation of CNNs. Cybenko [20] and Hornik et al. [44] proved that the multi-layer perceptron (MLP) is a universal approximator in late 80s. Recent studies on CNNs include: visualization of lter responses at various layers [105, 129, 135, 73, 121], scattering networks [74, 9, 122], tensor analysis [18], generative modeling [21], relevance propagation [4], Taylor de- composition [82], information bottleneck [100, 72], multi-layer convolutional sparse modeling [113] 1 and over-parameterized shallow neural network optimization [108]. More recently, CNN's inter- pretability has been examined by a few researchers from various angles. Examples include in- terpretable knowledge representations [133], critical nodes and data routing paths identication [120], the role of nonlinear activation [57], convolutional lters as signal transforms [58, 124], etc. Despite the above-mentioned eorts, it is still challenging to provide an end-to-end analysis to the working principle of deep CNNs. There are mysterious yet important properties associated with CNNs, which have never been clearly understood before. We list some of them below. 1. Being insensitive to parameter initialization. The training of CNNs is formulated as a non-convex optimization problem. There exist local minima in non-convex optimization, which are dependent on parameter initialization. However, in practice, the performance of trained CNNs is not sensitive to parameter initialization. This phenomenon has been studied by Choromanska [16] and Kawaguchi [50] with highly mathematical treatment. It is desired to provide further insights into this phenomenon. 2. Overt. CNNs are clearly an over-parameterized system that has more parameters than the number of training samples (i.e. the input-output constraints). We expect to see an overtting phenomenon for an over-parameterized machine learning system. However, we do not see a severe overt in many trained CNNs. Will this be always true? In this research, we aim to address some unanswered questions related to CNNs. It is chal- lenging to conduct rigorous mathematical analysis on convolutional neural networks (CNNs) with more than two hidden layers. To shed light on several poorly understood behaviors of CNNs, we mainly adopt an experimental methodology in this research. Many complicated deep learning techniques have been proposed in order to address specic issues of deep learning such as semi-supervised learning techniques, small networks. However, for some of these works, there is little study on the reason of excellent performance. Our research 2 aims to demystify some of these methods. We focus on the SqueezeNet [47] as a case study in this work. We also look at co-training based deep semi-supervised learning and propose some insights/conjectures. 1.1.2 New Interpretation of Multi-Layer Perceptron MLP, proposed by Rosenblatt in 1958 [95], has a history of more than 60 years. While there are instances of theoretical investigations into why the MLP works (e.g., the classic articles of Cybenko [20], and Hornik, Stinchcombe and White [44]), the majority of the eorts have focused on applications such as speech recognition [1], economic time series [26], image processing [106]. MLPs are known to be universal approximators and various forms of the universal approxima- tion theorems have been proved [20, 44, 17, 11, 101]. The proofs were based on tools from func- tional analysis (e.g., the Hahn-Banach theorem [97]) and real analysis (e.g., Sprecher-Kolmogorov theorem [109, 53]), which is less well known in the signal processing community. Furthermore, although they show the existence of a solution, most of them do not oer a specic MLP design. And the lack of MLP design examples hinders our understanding of the behavior of the MLPs. In our research, we provide new interpretation of MLP being a classier and a universal approximator. 1.1.3 Feedforward Multi-Layer Perceptron Design A problem of MLP is that the network design can be dicult to determine. While there are rules of thumbs of MLP design, the MLP design remains to be an open problem. In practice, to nd a working architecture, trials and errors are made in determining the layer number and the neuron number of each layer, which can take a lot of time. Many related works adopt search algorithms to speed up the design process. In our research, we propose a network design method that determines the architecture analytically. 3 In our research, we provide specic network designs for classication problems and for MLPs as universal approximators. 1.2 Contributions of the Research 1.2.1 CNN Behavior Under Stress To facilitate easier analysis, we adopt a very simple LeNet-5-like network. We adjust the number of lters in each layer and analyze the eect. In particular, we aim to show how dierently networks with limited resources (i.e., when the number of lters is very small) and networks with rich resources behave. We found in our experiments that the performance improves as the lter number of a convo- lutional layer increases until the number of lters reaches certain critical point. An important contribution of our work is the investigation into the resource-sparse networks. Most works on CNN adopted networks with very rich resources. In our work, we also look into how the networks behave under very limit resources. We hope our observations on the resource-scarce networks could provide inspirations for research on the size reduction of convolutional neural networks, which reduces computational cost. The explanations of some of the observations are worth further investigation. We believe more future research can be inspired from the conjectures we made in this study. 1.2.2 SSL-Guided Convolutional Layer Design Designing CNNs could be complicated and tedious, since there are some many hyper-parameters to adjust. For the design of convolutional layers, we have to determine the number of layers and the number of lters in each layer, which may take a lot of trials and errors. And this would require a lot of computational time and resources. Some may set the lter mumber in each layer 4 empirically but often this leads to excessively large convolutional layers, which wastes computation resources and may require special training techniques. Inspired by our observations in the experiments, we proposed a simple way to determine the number of lters to use in each convolutional layer, which can simplify the network design by eliminating the need for trials and errors to nd a suitable convolutional layer size. Moreover, un- like the model compression methods, our method does not rely on pre-trained over-parameterized models. In fact, our method does not require training any extra networks but derives the suitable layer size using statistical techniques. Another contribution of this work is to bridge the standard convolutional layers in CNNs with the Saab (Subspace Approximation with Adjusted Bias) transform [60] in successive subspace learning (SSL). Saab transform was inspired by the standard convolutional layer and the non- linear activation. SSL has been applied to dierent applications [15, 131] as an alternative to CNNs. In our method, we establish a quantitative relationship between the Saab transform and the lter numbers in a standard convolutional layer. 1.2.3 Co-Training on CNNs Co-training is a traditional semi-supervised learning method, which can be applied to almost all kinds of classifers. The original co-training method [7] requires conditionally independent feature splits, which is dicult and even impossible to obtain in practice. Therefore, the necessity of such constraint on the feature needs investigation. Nigam and Ghani [85] examine the importance of the conditional independence assumption. The authors show that explicitly utilizing the independent feature splits performs better than ignoring such splits. They show empirically that, using feature splits without guaranteed conditionally independence may also perform better than methods with no feature split. Wang and Zhou [119] show that two conditionally independent views are not necessary. Co-training can work well provided that the two classiers are suciently dierent. 5 Recently, co-training is applied on CNNs [92, 39, 51]. Using conditionally independent feature splits for CNNs is almost impossible. Many previous works with CNNs based on co-training or multi-view training introduced some network dierent constraints. However, the eectiveness of such constraints was not veried in many of these works. In fact, the network dierence might have come from the random initialization or dropout instead of from the constraints. Even if the networks collapse at the end of training, it is still likely that they have been making dierent predictions at early stage of training. By exchanging knowledge at each step of training, they are expected to have learned new knowledge from the other network. Therefore, is it necessity and eective to enforcing network dierence? In this work, we discuss some network dierence constraints. We hope our insight can inspire related research. 1.2.4 Demystifying the SqueezeNet The SqueezeNet [47] achieves about the same accuracy level of the AlexNet [56] with 50x fewer pa- rameters. In this work, we attempt to provide a scientic explanation to the superior performance of the SqueezeNet. The re module is a main component of the SqueezeNet. It consist of consists of a squeeze unit followed by an expand unit. Each squeeze unit consists of 1x1 lters that has less output channels than input channels. Therefore, squeeze unit can be viewed as performing dimension reduction on the input features. This is one of the main reasons for the small size of SqueezeNet. Each expand unit contains 1x1 and 3x3 lters in parallel. As the 1x1 lters demand fewer parameters than 3x3 lters, the network size is reduced. Since the design of the squeeze and expand units helps network size reduction, we aim to answer the following questions: 1. Will the dimension reduction in squeeze units downgrade the discriminant power of the features by removing too much useful information from input? 6 2. Are expand units eectively using the 1x1 and 3x3 lters? Do the combined features from 1x1 and 3x3 actually have better discriminant power than each individual set of features? To measure the discriminability, we use the cross entropy, which has been adopted in [60]. We also perform visualization. Here is a brief summary of our ndings. 1. For most re modules, the dimension reduction in the squeeze units does not lower the features discriminability. 2. The expand/1x1 and expand/3x3 lters work jointly to form more discriminative features. 3. The 1x1 and 3x3 lters can capture dierent features that complement each other. 1.2.5 Interpretation of MLPs for Classication with the Network Design In this work, we interpret MLPs as a combination of multiple LDA systems that divides the input spaces into multiple linear regions with the help of ReLU. We design a network with one input layer, two intermediate layers l 1 and l 2 , and one output layer in this work. The design species the number of neurons in each layer. The design also species the weights and biases. And thus we do not need backpropagation. Our design consists of three main parts: 1. From input layer l i n to l 1 . LDA divides the input space into two parts, and each of the neurons inl 1 corresponds to one side of an LDA hyperplane. The weights and biases of the link between input layer and layer l 1 is obtained from LDA directly. 2. From l 1 to l 2 . With multiple LDAs, we can divide the space into multiple regions and each neuron in l 2 represents one region. We can view the region as the intersection of the half-subspaces which are represented by neurons in l 1 . 3. From l 2 to output layer l o ut. In this step, we connect the regions with the corresponding class labels. 7 1.2.6 Constructing MLPs as Piecewise Polynomial Approximators We bridge the MLPs with the piecewise polynomual approximators. 1. We designed an MLP with the unit step activation function that is equivalent to a piecewise constant approximator; 2. We designed an MLP with the ReLU activation function that is equivalent to a piecewise linear approximator; 3. We designed an MLP with the cubic activation function that is equivalent to a piecewise cubic approximator. Our design includes three layers, input layer, a hidden layer and a output layer. The design species the neuron number in the hidden layer. The parameters are also specied. Our goal is to shed light on the properties of MLPs using the properties of the piecewise polynomial approximators. Here without the use of functional and real analysis, we provide an intuitive way to explain the universal approximation properties of MLPs. 1.3 Organization of the Thesis The rest of the thesis proposal is organized as follows. The eect of dierent convolutional layer sizes is presented in Chapter 2. Our SSL-guided convolutional layer design method is presented in Chapter 3. Discussion on co-training based deep semi-supervised learning algorithms is presented in Chapter 4. The design of SqueezeNet is analyzed in Chapter 5. Our interpretation of MLPs as a classier is described in Chapter 6. Our new interpretation of MLPs for regression is introduced in Chapter 7. Lastly, the concluding remarks are given in Chapter 8. 8 Chapter 2 Experimental Analysis on Convolutional Neural Networks 2.1 Introduction CNNs have been adopted in many tasks and have achieved state-of-the-art performance. Most works on CNN adopted networks with very rich resources in order to obtain good performance. However, using a very large number with redundant convolutional lters and unnecessarily wide FC layer is computationally expensive. Building a smaller network with competitive performance would be desired. In our work, we rst look into how the networks behave under very limit resources. We also investigate experimentally how many convolutional lters are needed in order to achieve reasonable performance. We hope our observations on the resource-scarce networks could provide inspirations for research on the size reduction of convolutional neural networks. 2.2 Experimental Design 2.2.1 CNN Architectures In this study, we adopt a LeNet-5-like architecture [63]. The general network architecture is listed as follows. 9 • Image Resolutions: 32 32 or 28 28 • CNN Architecture: { Convolutional Layer 1: Filter size 5 5, After Pooling 14 14K 1 { Convolutional Layer 2: Filter size 5 5, After Pooling 5 5K 2 { FC Layer 1: 1 1K 3 { FC Layer 2: 1 1K 4 More information on the input size and the padding used in the rst convolutional layer can be found in 2.2.2. For the convolutional layers, we consider two dierent sets of K 1 and K 2 values: one sets of suciently large values, and the other of very small values. These mimic rich and poor convolu- tional layer representations. We investigate how the richness convolutional layer representations aects the roles of the fully connected layers. Readers can refer to our code on Google Colab 1 for the detailed setup. The code and results in the link may be updated and/or corrected if any mistakes are found or improvement is needed. Therefore, the link may provide more updated implementation/results than this thesis. Readers may run the code in the link to generate more updated results. Comments from the readers are also welcome. 2.2.2 Datasets In this study, we use small datasets of tiny images, including MNIST, Fashion-MNIST and CIFAR- 10. 1 https://colab.research.google.com/drive/1MebLEgfaW7mIjfW6zBXyaUqe4cD98U8x?usp=sharing 10 2.2.2.1 MNIST The MNIST dataset [63] is composed of 28 28 grayscale handwritten 0-9 digit images. There are 60,000 samples in the training set and 10,000 samples in the testing set. Note that the MNIST dataset images are 28 28 in resolution, so we use "same" padding in the rst convolutional layer, the size of the resulted convolutional layer output is still 2828K 1 and 14 14K 1 after pooling, which is consistent with the Section 2.2.1. The MNIST dataset consists of digit images, which contain mainly strokes. It is a relatively simple dataset mainly for recognizing patterns. 2.2.2.2 Fashion-MNIST Fashion-MNIST [123] is a 10-class fashion product image dataset. Fashion-MNIST was proposed as a more challenging alternative to MNIST. Similar to MNIST, Fashion-MNIST comprises 2828 grayscale images. Fashion-MNIST also keeps the same data split of 60,000 training samples and 10,000 testing samples as MNIST. "Same" padding is used in the rst convolutional layer for Fashion-MNIST so that the output size of each layers is consistent with the ones listed in 2.2.1. Fashion-MNIST is a more challenging dataset compared to MNIST. Fashion-MMIST contains some shape information for the networks to capture. However, the images are grayscale so there is no color information. 2.2.2.3 CIFAR-10 CIFAR-10 [55] is a 10-class tiny image dataset. Each sample is a 32 32 color image. The data is split into a training set of 50,000 images and a testing set of 10,000 images. Since CIFAR-10 images are 32 32, we do not use padding for the rst convolutional layer so that the output size is 28 28K 1 and 14 14K 1 after pooling, consistent with the output sizes in Section 2.2.1. 11 CIFAR-10 contains colored images and thus there is extra color information to capture. It is more dicult than Fashion-MNIST and MNIST. 2.3 Methods For our experiments, we used a LeNet-5-like [63] network. We maintained the number of convo- lutional and FC layers and the number of neurons in each FC layer in the network but altered the number of lters in each convolutional layer. We conducted the experiments on MNIST [63], Fashion-MNIST [123] and CIFAR-10 [55]. For our experiments, we mimicked the case where other layers except the layer of concerns are rich. We kept the FC layers to be size 120, 84 for MNIST and Fashion-MNIST and size 200, 100 for CIFAR-10 for the experiments. A convolutional layer is also kept rich when the other is being studied. We follow the settings of LeNet-5 as the "rich" settings for MNIST and Fashion-MNIST, K 1 = 6 and K 2 = 16 for the 1st and 2nd convolutional layer respectively. For CIFAR-10, since the dataset is more complicated, we set the "rich" settings to be K 1 = 32 and K 2 = 64. 2.4 Results Note that due to the limited number of epochs, the networks might not have converged within the epochs trained. Readers may refer to our implementation / results on Google Colab for related information. 2.4.1 Convolutional Layer 1 Figure 2.1, 2.2 and 2.3 show the accuracy with varying number of lters in convolutional layer 1. By keeping other layers "rich", we noticed that the accuracy increased with increasing con- volutional layer size (with some uctuations). When the resource is limited, the accuracy was 12 compromised. Intuitively, this seems reasonable since more lters would allow more useful infor- mation to be encoded. After some point, the accuracy increased very slow or did not increase much. This seems understandable as the useful information from input can be encoded by some certain number of lters. Adding extra lters to encode more information is not necessary. Another interesting observation on Convolutional Layer 1 is that adding more convolutional lters did not lead to more severe overtting. Figure 2.1: Accuracy vs number of lters in Convolutional Layer 1 on MNIST. Results from 3 runs are reported. 2.4.2 Convolutional Layer 2 Figure 2.4, 2.5 and 2.6 show the performance with dierent lter numbers in Convolutional Layer 2. Similar to Convolutional Layer 1, we can see that the testing performance improved when larger convolutional layer was used, and after some point the testing accuracy did not change much. 13 Figure 2.2: Accuracy vs number of lters in Convolutional Layer 1 on the Fashion-MNIST. Results from 3 runs are reported. Figure 2.3: Accuracy vs number of lters in Convolutional Layer 1 on CIFAR-10. Results from 3 runs are reported. 2.5 Summary When the number of lters is very small for convolutional layers, the performance is compromised, even though other layers have rich resources. As the number of lters increases, the performance 14 Figure 2.4: Accuracy vs number of lters in Convolutional Layer 2 on the MNIST dataset. Results from 3 runs are reported. Figure 2.5: Accuracy vs number of lters in Convolutional Layer 2 on the Fashion-MNIST dataset. Results from 3 runs are reported. is improved until the number of lters reaches certain critical point. After this critical point, the improvement is relatively small and even negligible in some cases. This would be motivate 15 Figure 2.6: Accuracy vs number of lters in Convolutional Layer 2 on the CIFAR-10 dataset. Results from 3 runs are reported. us to nd such critical point that would provide reasonable performance while minimizing the computational cost. Inspired by this, we proposed a method to determine the number of lters in convolutional layers, which is presented in Chapter 3. 16 Chapter 3 SSL-Guided Convolutional Layers Design 3.1 Introduction In Chapter 2, we have shown that there is some critical point for the number of lters in con- volutional layers, which provides a good trade-o between performance and computational cost. Such critical point would be helpful in CNN design to determine the number of lters to use in each convolutional layer. In information bottleneck method for deep learning [116], we try to nd a representation that maximizes the mutual information between a layer with the output while minimizing the mutual information between that layer with the input. Therefore, we can consider the problem as nding the most compact representation that retain sucient information about the output from the input. While previous studies considered xed number of lters in convolutional layers, one way to dene the compactness is the number of feature maps (and thus the number of lters). In principal component analysis (PCA), we maximize the information retained about the input while trimming the dimensions containing mostly noise. Saab (Subspace Approximation with Adjusted Bias) transform determines the parameters in each convolutional layer via PCA. In this chapter, we will investigate the relationship between eigenvalue distribution in Saab transform and the number of feature maps to achieve reasonable performance by back-propagation. Note 17 that PCA is unsupervised and thus it cannot measure the mutual information between a layer and the output. It only tries to estimate the minimal number of dimensions to keep in order to retain the "meaningful" information about the input, which may not be useful to determine the output. 3.1.1 Objective Intuitively, with a reasonable method to determine the number of dimension to retain after Saab transform, we may assume the dimensions trimmed after PCA analysis to be mostly noise and thus are likely to be useful to determine the output. And thus, it is likely that the number of dimensions to retain is an upper bound to the "critical point" in a convolutional layer to achieve good results given sucient capability of the subsequent layers. As a result, we can safely use the the number of dimensions to retain as the number of lters in the standard convolutional layers without degrading the performance much. In the following sections, we will verify such claim experimentally. 3.1.2 Contribution An important contribution of this chapter is proposing a simple method to determine feasible and relatively small number of lters to use in convolutional layers. Determining the number of lters in convolutional layer has required a lot of trials and errors previously. Some may start with "classical" settings in other works or from experience, which is likely to be over-parameterized and cause a lot of redundancy [24]. A systematic manner to determine such hyperparameters would simplify the process of designing network architecture. This work also bridges Subspace Approximation with Adjusted Bias (Saab) in successive space learning (SSL) [12] with the standard convolutional layers trained by back-propagation by estab- lishing the quantitative relationship. We also compared the Saab lters and features with the standard convolutional layer ones obtained by back-propagation. 18 The chapter is organized as following. Section 3.2 presents selected related works. Procedure to determine the number of dimensions to retain after Saab transform is demonstrated in Sec. 3.3. We present our experiments and discuss the relationship between the number of dimensions retained after PCA analysis and the "minimum" number of lters to use in a convolutional layer to achieve reasonable results Section 3.4. 3.2 Related Work 3.2.1 Information Bottleneck The information bottleneck method was proposed by Tishby et al. to nd the short "bottle- neck" representations of the input that contains the maximum information about the output. In particular, Tishby et al. further rened the informaiton bottleneck method for deep learning in [116]. They argued that a network should be trained to optimize the trade-o between com- pression and prediction. According to Tishby et al., all layers network should have the mutual information between the layer representation and output increase during training. On the other hand, a well-trained network should have the smaller mutual information between the layer rep- resentation and the input in deeper layers than shallower layers. Intuitively, such representations discard the useless information from the input while maintaining sucient information about the output. In [103], Schwartz-Ziv et al. analyzed the mutual information between representation and input/output in order to demonstrate the changes in the deep neural networks during the training process. Ma et al. [72] proposed a network training method HSIC bottleneck based on information bottleneck to replace back-propagation in deep learning. HSIC demonstrates similar performance to back-propagation on MNIST, Fashion-MNIST and CIFAR-10. 19 3.2.2 Model Compression CNNs used in many applications are usually over-parameterized with a lot of redundancy [24], wasting computational resources. These over-parameterized networks also require extra eorts in training such as dropout [111] or regularization [33], reducing the network size while maintaining the accuracy would be desired. The model compression problem has been studied in many works [40, 33, 127, 81, 64, 41, 117, 65, 27, 71, 25, 47, 45, 134, 49]. Most of the model compression works require multiple times of re-training, which takes a lot of computational resources [33]. Garg et al. [33] proposed a simple PCA based method to reduce the number of lters and the number of layers by analyzing a trained network. However, this method still requires an over-parameterized pre-trained network. 3.2.3 Successive Subspace Learning Successive subspace learning (SSL) [12] is a machine learning method inspired by CNN. Saab transform is adopted in this method to perform unsupervised dimension reduction. In the follow- ing paragraphs, a brief review on SSL is given. To help the reader understand the idea ow of SSL better, the review is presented in terms of four milestones in the development of SSL. 1. Why non-linear activations? Why do we need non-linear activations? While the famous universal approximator theory provides the abstract functional analysis explanation, a new engineering perspective using signal processing theories is exploited in [57], [59] and [60]. [57] points out that non-linear activations are important for both the one-layer system and the multi-layer system. For an one-layer system, non-linear activation serves as the correlation rectication which recties all negative correlations to zero. Convolution is basically to nd with which lters the input signal has strong correlations. A monotonic relationship is preferred so that stronger 20 correlations mean closer distances between the input and lters. However, if there is no ReLU, the signal pairs x andx which have the farthest geodesic distance in the signal space would be highly-correlated, though the correlation is negative. For a multi-layer system, a more severe problem named sign-confusion will occur if there is no non-linear activation. In the sign-confusion problem, the system cannot distinguish between the original signal and its counterpart with the opposite sign. Figure 3.1 demonstrates the sign-confusion problem. On the unit circle, input vector x and anchor vectors a 1 , a 2 , a 3 form angles 1 , 2 , 3 respectively. Correlations a T 1 x and a T 2 x are positive and a T 3 x is negative. While positive correlation is a good geodesic distance indicator, negative correlation is not. x and a 3 are far in terms of geodesic distance while their correlation is strong. We cannot use negative sign to indicate farther distance either since the existence of negative response could cause confusion. Without ReLU, multiplying a positive response by a negative weight at the subsequent layer yields the same sign as multiplying a negative response by a positive weight. Similar phenomenon occurs when multiplying negative response by negative weight and positive response by positive weight. The network will therefore not be able to distinguish those cases. Therefore, allowing neg- ative sign in response is not desired and non-linear activation is needed to kill the negative response. Figure 3.1: An example to illustrate the necessity of rectication [57]. 21 However, the popular activation function ReLU in CNNs chops the negative part directly, which could lead to severe information loss. Therefore, the sign-to-position (S/P) format conversion is proposed to save more information in [59]. First, each kernel x i is augmented withx i . Then the input signal is projected onto the doubled kernel set to get a long vector representation, and ReLU is applied on that vector to get the nal representation. Here not only the non-linearity is preserved, but also less information is lost since we know whether those coecients in the nal vector come from the original kernels or augmented kernels. But the S/P format conversion leads to the exponential growth of the size of the representation. Hence, later in the follow-up work [60], a bias vector calculated from the training data is used to substitute S/P conversion. The corresponding method is called Subspace Approximation with Adjusted Bias (Saab). 2. Role of convolutional layers - dimension reduction After understanding non-linear activations, [59] and [60] went further to interpret the role of convolutional layers and the necessity of multi-layer structures, and a more interpretable data-driven method to calculate lters is proposed. [59] is the earliest work on interpreting CNNs systematically from a signal processing per- spective. In this paper, the convolution operation is interpreted as the linear transformation of the input signal, and a convolutional layer is interpreted as the combination of linear trans- formation and the non-linear activation. The feature extraction process using convolutions can be seen as the dimension reduction to get a more compact representation of the input data. Therefore, other than the label-assisted backpropagation, which is computational ex- pensive, PCA can be applied on the input data to learn the lters for dimension reduction, which is unsupervised and easy to calculate. More importantly, while the backpropagation 22 works like a black box, PCA is quite interpretable because it projects the input data onto the most informative directions. In addition, the number of lters in each layer can be deter- mined by the energy percent or the eigenvalue distribution of PCA, rather than ne-tuning over dierent numbers in CNNs. This data-driven method of calculating lters combined with the S/P conversion is called Subspace Approximation with Augmented Kernels (Saak). Replacing S/P conversion with the bias adjustment [60] mentioned above, the new theory is called Subspace Approximation with Adjusted Bias (Saab). Saab samples windows of input in the same manner as a standard convolutional layer. Sup- pose the kernel size of a convolutional layer is M, we take the MMC blocks from the input feature with stride S, C denotes the number of channels. As a result, we obtain multiples samples of MMC-D features. We decompose these features into DC and AC components. The DC kernel is basically 1 p N (1; cdots; 1) T . For AC components, We will perform PCA on sample-mean-removed input. DC and the selected AC kernels act as the weights in the convolutional layer. Then, a suciently large bias is added to the each output neuron before ReLU such that the input to ReLU will be all positive. Therefore, the output remains the same after ReLU and the information will not be trimmed by ReLU. Finding the bias value is simple. Suppose the input of a convolutional layer is x, since the kernels have unit norm, the bias simply has to be greater or equal to the largestkxk 2 2 . The detailed derivation can be found in [60]. As for the necessity of multi-layer structures, [59] explains this in terms of computational feasibility. Because the size of the input image is usually huge (around thousands of di- mensions), it is intractable to go from several thousands to the nal representation of size 23 around several hundred directly. Also, the non-linear operation between consecutive con- volutional layers cannot be utilized in this case. Later, [60] provides a deeper insight using spatial-spectral analysis. In each convolutional layer, the training of small 3 by 3 or 5 by 5 lters is equivalent to gathering the local spatial information inside the neighborhood to derive kernels for the spectral representation, and the group of lters in that layer repre- sent the signatures at the corresponding scale-level (or size of the receptive eld). Since dierent layers have dierent sizes of receptive elds, lters from dierent layers form a set of compound lters. As the input signal is passed through layer by layer, a compact spatial-spectral representation is attained. Such an elaborated process cannot be achieved by just a single dimension reduction manipulation. 3. Role of fully-connected layers - label-assisted regression On the long way of building up and rening the SSL theory, it is vital to nd a reasonable interpretation for fully-connected (FC) layers because FC layers bridge the feature repre- sentations with the nal classication decisions and thus have a large impact on the nal performance. In [60] each FC layer is interpreted as a Label-Assisted reGression (LAG) which aims to regress the input to a smaller size and also to improve the classiability, then multiple LAGs stacked layer by layer will output a short vector as the classication score. Because of the fully-connected property, it is natural to interpret it as the regression. However, the key point here is how the labels could be exploited. We know that the normal regression problemWx =b needs the input vectorx and the ground truth target vectorb to solve the regression weights W . For an FC layer, we have the output of the previous layer as the input vector and the weights to be learned, but the target vector is missing except in the last FC layer where the ground truth one-hot vectors could serve as b. Therefore, in the LAG for intermediate FC layers, b is the one-hot vector of pseudo-labels which are 24 generated by doing k-means clustering inside each class. For example, for FC1 in LeNet-5, the length of the output is 120 so b is of length 120, which means the corresponding LAG needs to regress the training samples into 120 classes. If the true number of class is 10, then now we need to further divide each class into 12 sub-classes. Therefore, inside each class, k-means is performed to generated 12 clusters which correspond to the 12 sub-classes. After k-means for all classes, each sample has a corresponding b. 4. Gain from the cost function - cross-entropy-guided subspace partitioning In the newest work of SSL, a new framework is proposed which absorbs the merits of pre- vious works and also introduces new ideas to boost the overall performance. The whole SSL framework consists of three modules: (module 1) PCA-based unsupervised dimension reduction, (module 2) cross-entropy-guided subspace partitioning, and (module 3) classi- ers. A simplied version applied in the point cloud classication task achieves the SOTA performance [131]. Module 1 consists of multi-layer Saab and max pooling layers. Its goal is to get a compact representation of the raw input data, which resembles the role of convolutional layers in CNNs. Though the network structure here resembles that in CNNs, it is worth noting that in SSL features or representations from module 1 do not serve as the nal representations to be feed into classiers and they are not ne-tuned like their counterparts in CNNs to boost the classication performance. The unsupervised dimension reduction here is basically to form a decent and tractable representation, and later in module 2 this representation would be further processed and rened using labels to get a nal good representation. 25 Module 2 is a tree-structured subspace partitioning process where the splitting rules is based on the cross entropy. For each cluster, the samples are classied using a classier such as random forest and SVM, the predictions will be used to computed the cross en- tropy. The tree-structured subspace partitioning process serves as the rst fully-connected layer in CNNs which bridges features from convolutional layers with ground truth labels to get a more ecient representation. Since the cluster splitting is based on minimizing the maximum cross entropy, optimization is cooperated into the regression process, which al- lows the stronger ability to deal with inter-class variability compared with the previous LAG. Module 3 plays the role of the last fully-connected layer which aims at decision making so that predicted labels could be generated. Here classical classiers in machine learning are used instead of doing another linear regression for two considerations. The rst one is that classiers like random forest are more robust to popular adversarial attacks because the nal decision making is not directly connected to the input data. The second consideration is that the wisdom from the machine learning eld could be borrowed so that we do not need to invent wheels on proposing new powerful classiers. 3.3 Saab Guided Convolutional Layer Filter Number Choice Saab vs PCA Saab transform decomposes the features into DC and AC components rst and performs PCA on the AC components. The reason why Saab is used instead of a standard PCA is that the outputs at each layer using Saab transformation could be achieved by standard convolutional layers with ReLU activation by assigning the corresponding weights and (suciently large) bias without losing important information in the features. If we used the standard PCA directly, the output of each layer may be negative. If we apply ReLU directly on the output, some 26 information may be lost. Saab avoids such problems by shifting the output by a suciently large bias so that all output will be positive and thus ReLU will not change the output. In our PCA analysis for AC components in Saab, though we do not take the class labels into consideration, we will analyze the eigenvalue distribution to see the "signal" and "noise" distribution of the input itself, where "noise" is generally expected to be useless. We will plot the eigenvalues from the PCA analysis sorted from large to small. We will look for the "saturation point" or a point in the saturation region, after the point, the slope is expected to be extremely small, so that we may be able to expect relatively small and probably negligible information gain if we use one more principal component than the point. To nd such points, we compute the dierence of eigenvalue ratios between the adjacent components and nd the rst point with absolute dierence smaller than a threshold. This threshold is a hyper-parameter in our design. It is 10 5 in our experiments. This method may not be the optimal way for nding the point. 3.4 Experimental Results We mimicked the convolutional layers in a LeNet-5-like architecture [63] using Saab. That is, there was two convolutional layers and the kernel size for both was 5x5. Windows were taken with stride 1. For easier analysis, to analyze the eect of the size of a convolutional layer, we keep the other convolutional layer rich so as to retain sucient information. We follow the settings of LeNet-5 as the "rich" settings. For MNIST and Fashion-MNIST, K 1 = 6 and K 2 = 16 for the 1st and 2nd convolutional layer respectively, and thus for saab transform, we keep 5 and 15 AC components for the 1st and 2nd layer respectively. For CIFAR-10, since the dataset is more complicated, we set the "rich" settings to be K 1 = 32 andK 2 = 64, and thus for saab transform, we keep 31 and 63 AC components for the 1st and 2nd layer respectively. 27 To save memory usage, we trained the kernel only with 10000 images for MNIST and Fashion- MNIST datasets and 4500 images For CIFAR-10. Readers can refer to our implementation on Google Colab 1 for the detailed setup. The code and results in the links may be updated and/or corrected if any mistakes are found or improvement is needed. Therefore, the two links may provide more updated implementation/results than this thesis. Readers may run the code in the links to generate more updated results. Comments from the readers are also welcome. 3.4.1 MNIST Convolutional Layer 1 For the rst convolutional layer, the eigenvalue ratio of each principal component is shown in Figure 3.3. The absolute values of the dierence are also plotted in Figure 3.2. We noticed that there are no saturation region in the plot in Figure 3.3. For the results of the absolute dierence, we also found there is no point with the absolute dierence smaller than the threshold. So we can keep all 25 AC components to be safe. With the addition of DC component, it would be 26 components in total. In Chapter 2, we found for the MNIST dataset, the number of lters in Convolutional Layer 1 has very little impact on the accuracy and therefore using 26 lters for conv1 can be a safe choice. Convolutional Layer 2 The eigenvalue of each principal component for the second convolu- tional layer is shown in Figure 3.5. In this plot, we can see the curve becomes nearly at after around the 40th component. The dierence thresholding approach nds the 57th point as marked in the gure in green. Together with the DC component, it would be 58 components in total. So we can use 58 lters for the conv2 layer. In the conv2 results in Chapter 2, we can see that 58 lies in the saturation region for the accuracy. 1 https://colab.research.google.com/drive/1P1OZu-n1ibhBtgxwbqhOqScr-ihLzBMO?usp=sharing and https://colab.research.google.com/drive/1k28AdJtW3U7skTD-3eosTKYIhCn6WN23?usp=sharing 28 Figure 3.2: The dierence of eigenvalue ratios between the adjacent components for MNIST/Conv1. Figure 3.3: Eigenvalue ratio for MNIST/Conv1. 3.4.2 Fashion-MNIST Convolutional Layer 1 For the rst convolutional layer, the eigenvalue ratio for each compo- nent is shown in Figure 3.7. Similar to MNIST, we noticed that there are no saturation region in the plot. For the results of the absolute dierence, we also found there is no point with the absolute dierence smaller than the threshold. So we can keep all 25 AC components to be safe. 29 Figure 3.4: The dierence of eigenvalue ratios between the adjacent components for MNIST/Conv2. Figure 3.5: Eigenvalue ratio for MNIST/Conv2. With the addition of DC component, it would be 26 components in total. In the accuracy plot in Chapter 2, we found for Fashion-MNIST dataset, using 26 lters is in the saturation region. Convolutional Layer 2 The eigenvalue of each principal component for the second convolu- tional layer is shown in Figure 3.9. We can see the curve becomes nearly at after around the 20th component. The dierence thresholding approach nds the 46th point as marked in the gure 30 Figure 3.6: The dierence of eigenvalue ratios between the adjacent components for Fashion- MNIST/Conv1. Figure 3.7: Eigenvalue ratio for Fashion-MNIST/Conv1 in green. Together with the DC component, it would be 47 components in total. So we can use 47 lters for the conv2 layer. In the conv2 results in Chapter 2, we can see that 47 lies in the saturation region in terms of the accuracy. 31 Figure 3.8: The dierence of eigenvalue ratios between the adjacent components for Fashion- MNIST/Conv2. Figure 3.9: Eigenvalue ratio for Fashion-MNIST/Conv2. 3.4.3 CIFAR-10 Convolutional Layer 1 The eigenvalue distribution of the rst convolutional layer is shown in Figure 3.11. In the gure, we can see the curve becomes nearly at after around the 20th component. The dierence thresholding approach nds the 34th point as marked in the gure in green. Together with the DC component, it would be 35 components in total. So we can use 32 35 lters for conv1. In the conv1 results in Chapter 2, we can see that 35 lies in the saturation region in terms of the accuracy. Figure 3.10: The dierence of eigenvalue ratios between the adjacent components for CIFAR- 10/Conv1. Figure 3.11: Eigenvalue ratio for CIFAR-10/Conv1. Convolutional Layer 2 The eigenvalue distribution of the second convolutional layer is shown in Figure 3.13. In the gure, we can see the curve becomes nearly at after the 43th component, 33 which is found by our proposed absolute dierence thresholding method. Together with the DC component, it would be 44 components in total. So we can use 44 lters for conv2. In Chapter 2, for CIFAR-10, we can see that 44 lies in the saturation region in terms of the testing accuracy. Figure 3.12: The dierence of eigenvalue ratios between the adjacent components for CIFAR- 10/Conv2 Figure 3.13: Eigenvalue ratio for CIFAR-10/Conv2 We found that the testing performance may not improve much by using over 30 lters in the conv2 layer. Therefore, 34 3.4.4 Summary From the experiments, by comparing the turning point in Saab results and back-propagation results, we found that we may safely use the saturation point or a point in the saturation region in Saab results as the number of lters to use in convolutional layers. 3.5 Conclusion We proposed a simple method to estimate the number of lters to use in each convolutional layer. Compared to model compression techniques, our method does not rely on a pre-trained over-parameterized model but nds a good set of lter number from the training data directly. The lter number found by our method may not be tight. The number may exceed the optimal lter number by a small margin. If desired, the lter number found by our method can be used to form a preliminary network, which can be further reduced by the model compression methods. 35 Chapter 4 Insight on Co-Training Based Deep Semi-Supervised Learning 4.1 Introduction Deep learning models typically require a huge amount of labeled data to avoid overtting. How- ever, labeled data is expensive and unavailable in many cases. On the other hand, unlabeled data is easier to obtained. Therefore, we wish to use the knowledge from the unlabeled data to improve the generalizability of the model trained with a limited amount of labeled data. One approach to utilize the unlabeled data is to assign pseudo-labels. A simple way to generate psuedo-labels is to use the condent predictions of unlabeled data. In a widely used method named self-training [137], the trained model makes predictions on the unlabeled data, and predictions with high condence will be added to the training set. The model is then retrained with the new training set. The process terminates when no more predictions are added to the training set. A drawback of self-training is that the model reinforces its own mistakes. Another way is to use two classiers and let them provide pseudo-labels for each other, namely co-training [7]. More generally, we can utilize more than two classiers and train them to be consistent with each other, which is termed multiview learning [137]. 36 Instead of using dierent models with dierent views, recent methods [80, 22, 104, 5] train models to make consistent predictions under dierent congurations. The model [94] encourages consistent predictions for the clean and noisy versions of the input data. model [62] encourages the model to be consistent under random congurations such as random input transformations and dropout [111] for labeled and unlabeled data. Temporal Ensembling [62] computes a running average of the past predictions of a model and matches the model's current predictions with the averaged predictions. Dierent from Temporal Ensembling, Mean Teacher [115] averages model weights instead of predictions. In Mean Teacher, the exponential moving average of a (student) network's weights is assigned to its teacher model. The student model is trained to make predictions that match the ones of the teacher model on both labeled and unlabeled data. In this work, we look at consistency-based semi-supervised learning methods. We also look at methods that adopt the idea of co-training. We also discuss the eectiveness and the necessity of enforcing dierence when applying co-training to deep neural networks. In theory, co-training has strict requirements on the features in order to work well. However, such requirements are hard to fulll in practice. Many previous works with CNNs based on co-training or multi-view training introduced some network dierent constraints. However, the eectiveness of such constraints was not veried in many of these works. In fact, the network dierence might have come from the random initialization or dropout instead of from the constraints. So if we do not enforce the network dierence but simply rely on the randomness from initialization, dropout and input augmentation for natural dierence in the networks at each training step, even if the networks collapse in the end, since they exchange knowledge throughout the training process at each step when they are still dierent, can we expect such process to improve the network performance? The rest of the paper is organized as follows. We introduce related semi-supervised learning methods in Section 4.2. Section 4.3 summarizes and discusses some network dierence constraints. Section 4.4 concludes our paper. 37 4.2 Related Work In this section, we introduce the consistency-based semi-supervised learning methods including -model, Temporal Ensembling [62] and Mean Teacher [115]. We will also introduce co-training [7], and fast-SWA [3]. 4.2.1 Consistency-Based Methods In consistency-based methods, we generally have a student and a teacher. 4.2.1.1 -Model In -model [62], a network teaches itself. The network generates predictions under dierent input transformations and dropout, which are trained to be consistent. 4.2.1.2 Temporal Ensembling Temporal Ensembling [62] computes the exponential moving average of predictions from models obtained in previous epochs, which acts as the teacher. The authors claimed that the averaged predictions tend to better approximate the real labels than the model obtained at the current epoch, and thus the pseudo-labels formed by the averaged prediction may be of better quality. 4.2.1.3 Mean Teacher The Mean Teacher model [115] uses two networks of the same architecture, namely the student model and the teacher model. The weights of the teacher model are the exponential moving average of the ones of the student model. Let us denote the weights of the teacher model by w 0 and the weights of the student model by w. At step t, we update w 0 as w 0 t =w 0 t1 + (1)w t ; (4.1) 38 where is the decay rate. During training, the student model is updated to agree with the teacher model on the predictions for both labeled and unlabeled data while reducing classication loss on labeled data. 4.2.2 Co-Training In the co-training algorithm proposed by Blum and Mitchell [7], the features are split into two subsets that are conditionally independent given the class label. Each subset is also expected to provide adequate information to train a good classier on its own. Two classiers are trained on the labeled data using the two feature subsets respectively. Then both classiers label a randomly selected subset of unlabeled samples and teach each other with the most condently labeled samples. However, conditionally independent feature splits may be hard to obtained in practice. There- fore, many works look at the necessity of the co-training assumptions in practice [118, 85, 119]. Nigam and Ghani [85] examined the importance of the conditional independence assumption. The authors showed that explicitly utilizing the independent feature splits performs better than ignoring such splits. The authors showed empirically that, using feature splits without guaranteed conditionally independence may also perform better than methods with no feature split. Wang and Zhou [119] showed that two conditionally independent views are not necessary. Co-training can work well provided that the two classiers are suciently dierent. Deep Co-Training Qiao et al. [92] applied the idea of co-training to deep learning. Two networks were trained to agree with each other on the dataset and disagree with each other on adversarial examples. The authors also extended the idea to more than two networks. The proposed Deep Co-Training method outperformed state-of-the-art methods on SVHN [84] with 1000 labeled samples, CIFAR-10 [54] with 4000 labeled samples, CIFAR-100 [54] with 10000 labeled samples, and ImageNet [98] with 10% labeled samples. While Deep co-training achieved 39 good performance, the training process is very time-consuming since adversarial examples are computed every iteration for each network. Dual Student Ke et al. also adopted a co-training-like method for deep learning [51]. Dierent from Deep Co-Training, their proposed Dual Student method does not explicitly enforce network dierence but places constraints on the knowledge exchanged. In Dual Student, an sample is considered stable if a small perturbation does not change the predicted label and the predicted label has high probability. The networks only teach each other with such stable samples. Apart from the between-network constraint, for each network, Dual Student also enforces the consistency constraint as in consistency-based semi-supervised learning methods. Co-Teaching Besides the applications in semi-supervised learning, Han et al. adapted co- training for learning with noisy labels [39]. In their proposed Co-teaching method, two networks teach each other with the small-loss samples. Co-teaching is similar to Dual Student in the sense that both methods try to lter out unreliable knowledge and let the networks exchange only reliable knowledge. In Dual Student, reliable knowledge is from the stable samples while in Co-teaching small-loss samples are considered relatively reliable. 4.3 Discussion on Enforcing Network Dierence To prevent having two identical student networks after training, we may need to add some simi- larity constraints. In [92], the authors found that when no dissimilarity constraint was enforced, the networks collapsed on the SVHN dataset and caused an accuracy drop. However, even if the networks collapse as training goes, the networks have been updated with each other's knowledge at each step throughout the training process, and it is likely that they have been making dierence 40 predictions at least at the early stage of training. Therefore, if we do not enforce dierence in the networks, will we still benet from the co-training process? In [92], the network dierence was introduced by encouraging the networks to make dierent predictions on adversarial examples. The principal is to nd some dataset D 0 such that the objective to make the networks agree on the original dataset is compatible with the objective that the networks disagree on the dataset D 0 . However, using adversarial examples to construct the datasetD 0 requires generating adversarial examples of the networks for all labeled and unlabeled images in the mini-batch every iteration. In Deep Co-training, the use of adversarial examples encourages a network to be robust against the adversarial examples of the other network, which can potentially improve the robustness of the network and thus the overall performance. In this case, the improvement in the nal accuracy may not be attributed to the use of multiple networks but the robustness against adversarial examples. There are other possible ways to dene the similarity loss, some of which have been exploited in previous literature. We may dene dissimilarity constraints based on the network weights. One alternative proposed by Saito et al. [99] uses the weight matrices of the rst fully connected layer after the shared generator network. The similarity loss is dened asjW T F1 W F2 j, where W F1 and W F2 are the weight matrices. Another choice is to use the cosine distance, which computes the similarity loss L sim as follows. L sim = 1 N o X r j P c W r;c W 0 r;c j q P c W 2 r;c q P c W 0 2 r;c ; (4.2) where W , W 0 2 R NoNi are the weight matrices of the fully connected layer in the student network and teacher network respectively. N o and N i denote the number of output and input channels respectively. In our network, N o would be the number of classes. W r;c is the element at the r-th row and c-th column. We can view L sim as the average absolute cosine similarity of the projection directions over each output dimension. 41 Adopting local dissimilarity measures that enforce dierence only on part of the network may not ensure the networks have dierent views or make dierent predictions on certain input. For example, two networks with very dierent weights in some layers may still produce same predictions. Therefore, these constraints may not be as eective to enforce diverse views as the dierence constraint on the predictions. 4.4 Conclusion and Future Work In this paper, we propose some insight on co-training based deep semi-supervised learning. We also raise questions regarding the eectiveness or necessity of imposing dierence constraints between the networks when applying co-training to the neural networks. But we did not verify the eectiveness or necessity in this work. We hope this can inspire future research. 42 Chapter 5 Experimental Analysis on Squeeze Networks 5.1 Introduction Deep learning has achieved great success in many applications in the last decade. Most work focuses on improving performance at the cost of higher network complexity. Yet, they are not suitable for applications in mobile and edge computing, where there exist severe computation and storage constraints. This motivates research on small neural networks in recent years. Small networks are signicantly smaller in size than regular networks while achieving comparable per- formance. One famous example is the SqueezeNet [47]. It achieves about the same accuracy level of the AlexNet [56] with 50x fewer parameters. While the SqueezeNet greatly reduces the net- work size and achieves impressive performance, there is little study on the reason of its superior eciency. Our research objective is to demystify the design of the SqueezeNet quantitatively. One main architectural component of the SqueezeNet is the \re module". It is the cascade of a \squeeze unit" and an \expand unit". It is the main reason for computational and storage saving. In this work, we aim to provide a scientic explanation to the superior performance of the SqueezeNet by investigating the roles of the re module. There is relationship between the successive subspace learning (SSL) and our current analysis. Kuo et al. [57, 58, 59, 60] designed 43 several interpretable machine learning models based on the SSL principle. The cross-entropy tool was computed and applied to each individual feature to measure its discriminant power in [60]. In this work, we use the same tool to demystify the behavior of the SqueezeNet. We will show that the cross-entropy is higher in earlier re modules. Its values keep decreasing when we proceed to later modules. This indicates better and better discriminant power of the whole system. As to the function of each re module, we have the following two main observations. First, the squeeze unit does not lower feature discriminability for most re modules. Yet, it is used to lower the channel number to reduce network complexity for a smaller model size. Second, the expand unit adopts 1x1 and 3x3 lters jointly to form more discriminative features for cross-entropy reduction. The 1x1 and 3x3 lters can capture dierent features that complement each other. We also use visualization tools to shed light on the Squeezenet. Furthermore, there is a close relationship between the design of the SqueezeNet and the PixelHop++ system proposed in [14]. It will be elaborated in Section 5.3. The rest of the paper is organized as follows. The SqueezeNet and its re module are brie y reviewed in Section 5.2. The relationship between the SqueezeNet and the PixelHop++ system is discussed in Section 5.3. Then, to demystify the SqueezeNet, we analyze the evolution of cross entropy values across multiple re modules in Section 5.4 and use the visualization tool to provide several illustrative examples in Section 5.5. Finally, concluding remarks and future extension are given in Section 5.6. 5.2 SqueezeNet and Fire Module The SqueezeNet was proposed by Iandola et al. in [47] with an objective to reduce the model size of a large neural network while preserving its recognition performance. The system diagram of the SqueezeNet is shown in Fig. 5.1. The left one is the baseline. The middle and the right ones are variants. The middle one has simple bypass while the right one has complex bypass. 44 We will focus on the baseline without bypass. It has 10 layers in total. The rst and the last layers are two convolutional layers while there are 8 re modules as middle layers in between. To strike a good balance between the performance and the model size, the SqueezeNet conducts the max-pooling operation at three layers. This pooling operation is needed since it oers larger receptive elds. The SqueezeNet places the pooling operation at relatively late positions in the network to preserve larger activation maps. It is apparent from the gure that the re module plays a key role in the SqueezeNet. Figure 5.1: The system diagram of the SqueezeNet [47]. Each re module consists of a squeeze unit followed by an expand unit as shown in Fig. 5.2. The squeeze unit consists of 1x1 lters, with less channels in the output than the input. The expand unit has 1x1 and 3x3 lters in parallel. The re module reduces the network size based on the following two ideas. 45 1. Saving via Channel Redundancy Removal The number of output channels is smaller than that of input channels at the squeeze unit. By inserting the squeeze unit before the expand unit, the number of input channels for the 3x3 lters in the expand unit can be greatly reduced. The dimension reduction idea works if there exists redundancy between input channels. 2. Saving by Spatial Correlation Removal Some of the 3x3 lters are replaced by the 1x1 lters since the 1x1 lters demand fewer parameters. This is re ected in the design of the expand unit, which consists of 1x1 lters in parallel with the 3x3 lters. This dimension reduction idea works for those lters if there exists little correlation between spatially adjacent pixels. Figure 5.2: Illustration of the squeeze unit and the expand unit of a re module [47]. 5.3 PixelHop++ and Its Relation with SqueezeNet We would like to point out an interesting relationship between the PixelHop++ system proposed in [14] and the SqueezeNet. A three-level PixelHop++ system is shown in Fig. 5.3. It can have more than three levels if the input image size is larger. The input is an image of resolution 32x32xK 0 , where K 0 = 1 and 3 for gray-scale and color images, respectively. Each level has 46 one channel-wise (c/w) Saab transform unit followed by (2x2)-to-(1x1) max-pooling. The c/w Saab transform operates on a block of NxN pixels with stride equal one in all three levels. In the training phase, we collect sample blocks to derive transform kernels. In the testing phase, we apply lters to each block and generate a set of responses at its center. The responses can be positive or negative. The constant bias is added to all responses to ensure that they are all non-negative, which explains the name of the \successive approximation with adjusted bias (Saab) transform" [60]. Figure 5.3: Illustration of the three-level c/w Saab transform in PixelHop++, which provides a sequence of successive subspace approximations to the input image. For PixelHop++, the number of lters at each level can be controlled by a threshold. If the threshold is larger, more channels will be discarded. This is in analogy to the squeeze unit. For all channels reserved at each layer, some low frequency channels will go through another block Saab transform by concatenating spatially adjacent pixels. This is in anaology to the expand unit. There two main dierences between the PixelHop++ and the SqueezeNet. First, lter weights in the PixelHop++ are determined by pixel correlation statistically while they are determined by backpropagation in SqueezeNet. Second, since responses at dierent channels are decorrelated by the c/w Saab transform in the PixelHop++ system, the input to each block transform is a 2D (spatial) tensor rather than a 3D (joint spatial-spectral) tensor in the SqueezeNet. 47 5.4 Discriminability Study via Cross-Entropy Analysis The cross entropy is commonly used as the cost function of a convolutional neural network (CNN). It indicates the discriminability of the network - the lower its output cross-entropy, the higher its discriminant power. Here, we use the cross entropy value to measure the discriminant power at a network layer. Typically, the layer-wise discriminability of a CNN becomes better as the layer goes deeper. In other words, we expect the cross-entropy to go down in deeper layers. We will study the evolution of cross-entropy values through re module #2-#9. Given a classier with prediction probabilityp n;c for each samplen and classc, the mean cross entropy per sample is dened as H = 1 N N X n C X c y n;c log(p n;c ); (5.1) where N is the total number of samples, C is the total number of classes, and y n;c is the binary encoding of true class labels. That is,y n;c = 1 if samplen belongs to classc. Otherwise,y n;c = 0. To compute the cross entropy at a particular network layer, we need a classier to yield p n;c for samples at that layer. Being inspired by the work in [93, 60]], we can design a classier based on the k-means clustering algorithm. That is, we can use the k-means to split the data at the layer into K clusters, where each cluster may consist of samples of dierent classes. The prediction probability,p n;c , of each sample depends on the cluster to which the sample belongs. If sample n belongs to cluster k, we can approximate the prediction probability via ~ p n;c M k;c P C c=1 M k;c ; (5.2) whereM k;c denotes the number of samples of classc in clusterk. The approximation in Eq. (5.2) holds when there are enough training samples and we have an ideal predictor. In other words, it is the lower bound of the cross entropy value since it implies 100% prediction accuracy. 48 Table 5.1: The average ratio of the activation sum of windows out of the localization map (13x13) in the Fire9 module. window size ratio window size ratio window size ratio 1x1 0.204 6x6 0.795 10x10 0.936 2x2 0.514 7x7 0.838 11x11 0.957 3x3 0.640 8x8 0.876 12x12 0.971 4x4 0.702 9x9 0.909 13x13 1.000 5x5 0.750 Fire 9 Fire 5 … Fire 2 … Input image … Dandie dinmont 29 7 224 119 54 27 15 13 Draw by cv2 29 Figure 5.4: Corresponding bounding boxes (from left to right) in the input, Fire 2, Fire 5 and Fire 9. The cross entropy measures the discriminant power of a network or a network layer. The layer- wise discriminability of a network should go higher as the layer goes deeper. We will demonstrate this property of the SqueezeNet by an experimental design. We randomly select 9,855 images from 10 categories from the ImageNet [23] as the dataset for illustration. The ten selected classes are: siamese cat, valley, convertible, liner, damsel y, cab, breakwater, electric locomotive, dandie dinmont and soup bowl. Unless specied otherwise, the cluster number for k-means is set to 32 in our experiments. We performed analysis on the PyTorch Hub's pre-trained SqueezeNet 1.0 model 1 in all our experiments in this paper. Cross entropy computation with all features at each layer is memory and time consuming. Furthermore, this is not what is done by a trained CNN. A CNN can zoom in on the spatial and spectral discriminant regions through iterative cross-entropy minimization via backpropagation in 1 https://pytorch.org/hub/pytorch vision squeezenet/ 49 Table 5.2: Dimensions of the 3D tensors used for cross entropy computation in re modules. squeeze expand/1x1 expand/3x3 expand/concatenate Fire2 29x29x15 29x29x53 29x29x52 29x29x102 Fire3 29x29x15 29x29x57 29x29x55 29x29x111 Fire4 29x29x30 29x29x108 29x29x112 29x29x219 Fire5 15x15x29 15x15x113 15x15x108 15x15x220 Fire6 15x15x44 15x15x167 15x15x169 15x15x335 Fire7 15x15x44 15x15x172 15x15x170 15x15x341 Fire8 15x15x59 15x15x225 15x15x222 15x15x446 Fire9 7x7x59 7x7x227 7x7x217 7x7x443 Table 5.3: Cross entropy values at the input and the output of squeeze layers and the output of 1x1 lters, 3x3 lters and their concatenation of the expand layers. in/squeeze out/squeeze 1x1-out/expand 3x3-out/expand concatenated out/expand Fire 2 1.646 1.639 1.53 1.726 1.502 Fire 3 1.502 1.44 1.411 1.686 1.403 Fire 4 1.403 1.435 1.456 1.571 1.404 Fire 5 1.322 1.227 1.121 1.143 1.063 Fire 6 1.063 1.029 0.953 1.132 0.951 Fire 7 0.951 0.911 0.941 1.082 0.918 Fire 8 0.918 0.850 0.881 1.060 0.904 Fire 9 0.834 0.723 0.620 0.913 0.714 the training process. To nd the discriminant regions, we adopt a visualization tool, Grad-CAM [102], to generate localization maps from the expand layer of size 13x13 in Fire9. We show the average ratio of the activation sum as a function of the window sizes in Fire9 in Table 5.1. We choose the salient region of size 7x7 from each image, which is about 29% of the original size yet keeps 83.8% of the activation sum on average. The position of the chosen region can vary for dierent images. Then, we project this localization map back to all intermediate layers and the original input of size 224x224 to determine the corresponding locations. Next, we compute the activation sum of the windows over our dataset for each channel. We select important channels among each group of intermediate outputs (including expand/1x1, expand/3x3, expand concate- nated outputs) by sorting channels with the activation sum. We only keep channels that have a larger activation sum, which preserve 95% or higher of the total activation sum over all the 50 channels. Then, we use selected windows and selected channels to form 3D tensors. Only features inside the bounding box and important channels are used for cross entropy computation. An example of the corresponding bounding boxes at dierent layers is shown in Fig. 5.4. The sizes of the resulted 3D tensors in all 8 re modules are listed in Table 5.2, where the rst two numbers refer to the spatial dimensions while the last one is the spectral dimension. The main results of our analysis are summarized in Table 5.3, which shows cross entropy values at the input and output of squeeze layers and at outputs of 1x1 lters, 3x3 lters and their concatenation of expand layers at each re module. Squeeze layers in re modules are 1x1 lters, which act as spectral dimensionality reduction. They pool the important input information across channels. For most re modules, dimension reduction in the squeeze layers does not lower feature discriminability since the cross-entropy values still decrease. The input of the expand layer feeds into 1x1 and 3x3 convolution lters, and their outputs are concatenated in the channel dimension. We see that the concatenated outputs have lower cross entropy than outputs of 1x1 and 3x3 lters for most of re modules. This implies that 1x1 and 3x3 lters work together to form more discriminant features. 5.5 Analysis via Visualization In this section, we use two visualization techniques to understand saliency and activation in re modules of the SqueezeNet. They are: Grad-CAM (Gradient-weighted Class Activation Mapping) [102] and Guided Backpropagation [110]. Grad-CAM is a generalization of CAM (Class Activation Mapping) [136]. Grad-CAM shows the heatmap of salient regions of an image, for which the model makes classication decision. Grad-CAM is architecture-agnostic for CNNs. It generates the class-discriminative localization maps for an arbitrary CNN architecture. It can be viewed as a weakly-supervised localization approach. For target class c, it rst computes the gradient of score y c with respect to feature maps. Then, it uses the results to obtain weights c k which 51 capture the importance of feature map k. Feature maps will be weighted by these importance weights to form the localization map. Grad-CAM works well in localizing regions but not for ne-grained visualization. For ne- grained visualization of saliency, we use Guided Backpropagation, which is a modication of the deconvnet [129] by incorporating backpropagation in the deconvnet. It allows the guidance from deeper layers, and its reconstruction is more accurate than the deconvnet. Guided backpropa- gation allows us to visualize patterns from intermediate layers as well as the last layer of the network. First, we use guided backpropagation visualization tool. We used the implementation at [86]. Images of the dandie dinmount class from the ImageNet and their visualization results at outputs of expand/1x1 and expand/3x3 lters in Fire9 are shown in Fig. 5.5. We choose the top 10 lters with the largest activation sum. Each row in this gure corresponds to a lter. We show the top 9 activations of each lter and arrange them from left to right in rows. We see that expand/3x3 lters and expand/1x1 lters capture dierent regions. Expand/3x3 lters focus on the face, especially eyes and nose. These structured regions need larger receptive elds. In contrast, expand/1x1 lters focus on the fur region due to its ne-grained property. The concatenation of 3x3 and 1x1 lters can capture both. Next, we use the Grad-CAM visualization tool 2 The two input images are Afghan hound and Dandie Dinmount as shown in Fig. 5.6. We visualize the localization maps of each re module in Fig. 5.7. Each row shows results of one re module. They are ordered from Fire2 (top) to Fire9 (bottom). For each column, we show visualization results for the squeeze layer, expand/1x1, expand/3x3 and expand concatenated lters from left to right. For the Afghan hound, the heatmap spreads out the image in the rst three re modules. The heatmap does not nd the important region. At the expand/1x1 lters of Fire5, the SqueezeNet starts to focus on the dog body. At the expand/3x3 lters of Fire8, the SqueezeNet starts to focus on the dog 2 We used the implementation at https://pypi.org/project/pytorch-gradcam/ 52 face. For the Dandie Dinmount, the dog face takes up most of the area in the image. Yet, we observe similar transition. From Fire6, the SqueezeNet starts to focus on the dog body. At the expand/3x3 lters of Fire7, the SqueezeNet starts to focus on the dog face, especially its eyes and nose. They are discriminant features for this category. 5.6 Conclusion and Future Work In this work, we attempted to demystify the Fire design in the SqueezeNet. We proposed a method to compute the cross entropy at dierent layers based on saliency. This tool should be helpful in future study on other neural network architectures. Through the cross-entropy analysis, we had two main conclusions. First, spectral dimension reduction in the squeeze layers does not lower feature discriminability for most re modules. Second, the 1x1 and 3x3 lters in expand layers work jointly to form discriminant features for object classication. Furthermore, we demonstrated how expand/1x1 and expand/3x3 capture dierent features that complement each other with visualization tools. A CNN solution is trained using labeled data. Through optimization, the trained network can nd the most discriminant features that separate one class from the other automatically. Yet, the solution is sensitive to perturbation such as adversarial attacks. The SSL system provides an alternative path based on pixel statistics in dierent scales. The model is usually smaller (see [14]). The whole eld is still at its infancy. We believe that more research on SSL theory and applications will be a promising future direction. 53 Figure 5.5: Input images (top) and their visualization results using guided backpropagation (bot- tom) of expand/1x1 lters (left) and expand/3x3 lters (right) in Fire9. 54 Figure 5.6: Two dog images from the ImageNet: Afghan hound (left) and Dandie Dinmount (right). Figure 5.7: Visualization across dierent layers using the Grad-CAM method. Each row corre- sponds to one re module - from Fire2 (top) to Fire9 (bottom). Columns (from left to right) correspond to the squeeze layer, expand/1x1, expand/3x3 and expand/joint layers, respectively. 55 Chapter 6 Feedforward Design of Multi-Layer Perceptron (MLP) 6.1 Introduction The multilayer perceptron (MLP), proposed by Rosenblatt in 1958 [95], has a history of more than 60 years. However, while there are instances of theoretical investigations into why the MLP works (e.g., the classic articles of Cybenko [20], and Hornik, Stinchcombe and White [44]), the majority of the eorts have focused on applications such as speech recognition [1], economic time series [26], image processing [106], and many others. One-hidden-layer MLPs with suitable activation functions are shown to be universal approxi- mators [20, 44, 69]. Yet, this only shows the existence but does not provide guideline in network design [19]. Sometimes, deeper networks could be more ecient than shallow wider networks. The MLP design remains to be an open problem. In practice, trials and errors are made in deter- mining the layer number and the neuron number of each layer. The process of hyper parameter netuning is time consuming. We attempt to address these two problems simultaneously in this work: MLP theory and automatic MLP network design. For MLP theory, we will examine the MLP from a brand new angle. That is, we view an MLP as a generalization form of the classical two-class linear discriminant analysis (LDA). The input to a two-class LDA system is two Gaussian distributed sources, and the output is the predicted 56 class. The two-class LDA is valuable since it has a closed-form solution. Yet, its applicability is limited due to the severe constraint in the problem set-up. It is desired to generalize the LDA so that an arbitrary combination of multimodal Gaussian sources represented by multiple object classes can be handled. If there exists such a link between MLP and two-class LDA, analytical results of the two-class LDA can be leveraged for the understanding and design of the MLP. The generalization is possible due to the following observations. • The rst MLP layer splits the input space with multiple partitioning hyperplanes. We can also generate multiple partitioning hyperplanes with multiple two-class LDA systems running in parallel. • With specially designed weights, each neuron in the second MLP layer can activate one of the regions formed by the rst layer hyperplanes. • A sign confusion problem arises when two MLP layers are in cascade. This problem is solved by applying the rectied linear unit (ReLU) operation to the output of each layer. In this paper, we rst make an explicit connection between the two-class LDA and the MLP design. Then, we propose a general MLP architecture that contains input layer l in , output layer l out , two intermediate layers, l 1 and l 2 . Our MLP design consists of three stages: • Stage 1 (from input layer l in to l 1 ): Partition the whole input space exibly into a few half-subspace pairs, where the intersection of half-subspaces yields many regions of interest. • Stage 2 (from intermediate layer l 1 to l 2 ): Isolate each region of interest from others in the input space. • Stage 3 (from intermediate layer l 2 to output layer l out ): Connect each region of interest to its associated class. The proposed design can determine the MLP architecture and weights of all links in a feedforward one-pass manner without trial and error. No backpropagation is needed in network training. 57 In contrast with traditional MLPs that are trained based on end-to-end optimization through backpropagation (BP), it is proper to call our new design the feedforward MLP (FF-MLP) and traditional ones the backpropagation MLP (BP-MLP). Intermediate layers are not hidden but explicit in FF-MLP. Experiments are conducted to compare the performance of FF-MLPs and BP-MLPs. The advantages of FF-MLPs over BP-MLPs are obvious in many areas, including faster design time and training time. The rest of the paper is organized as follows. The relationship between the two-class LDA and the MLP is described in Sec. 6.2. A systematic design of an interpretable MLP in a one-pass feedforward manner is presented in Sec. 6.3. Several MLP design examples are given in Sec. 6.4. Observations on the BP-MLP behavior are stated in Sec. 6.5. We compare the performance of BP-MLP and FF-MLP by experiments in Sec. 6.6. Comments on related previous work are made in Sec. 6.7. Finally, concluding remarks and future research directions are given in Sec. 6.8. 6.2 From Two-Class LDA to MLP 6.2.1 Two-Class LDA Without loss of generality, we consider two-dimensional (2D) random vectors, denoted by x2R 2 , as the input for ease of visualization. They are samples from two classes, denoted by C 1 and C 2 , each of which is a Gaussian-distributed function. The two distributions can be expressed as N ( 1 ; 1 ) andN ( 2 ; 2 ), where 1 and 2 are their mean vectors and 1 and 2 their covariance matrices, respectively. Fig. 6.1(a) shows an example of two Gaussian distributions, where the blue and orange dots are samples from classesC 1 andC 2 , respectively. Each Gaussian-distributed modality is called a Gaussian blob. A linear classier can be used to separate the two blobs, known as the two-class linear discrim- inant analysis (LDA), in closed form. LDA assumes homoscedasticity, that is, the covariances of 58 dierent classes are identical: 1 = 2 = . In this case, the decision boundary can be formu- lated into the form of [91][42] w T x +b = 0; (6.1) where w = (w 1 ;w 2 ) T = 1 ( 1 2 ): (6.2) b = 1 2 2 T 1 2 1 2 1 T 1 1 + log p 1p ; (6.3) where p = P (y = 1). Then, sample x is classied to class C 1 if w T x +b evaluates positive. Otherwise, sample x is classied to class C 2 . 6.2.2 One-Layer Two-Neuron Perceptron We can convert the LDA in Sec. 6.2.1 to a one-layer two-neuron perceptron system as shown in Fig. 6.1(b). The input consists of two nodes, denoting the rst and the second dimensions of random vector x. The output consists of two neurons in orange and green, respectively. The two orange links have weight w 1 and w 2 that can be determined based on Eq. (6.2). The bias b for the orange node can be obtained based on Eq. (6.3). Similarly, the two green links have weight w 1 andw 2 and the green node has biasb. The rectied linear unit (ReLU), dened as ReLU(y) = max(0;y); (6.4) is chosen to be the activation function in the neuron. The activated responses of the two neurons are shown in the right part of Fig. 6.1(b). The left (or right) dark purple region in the top (or bottom) subgure means zero responses. Responses are non-zero in the other half. We see more positive values as moving further to the right (or left). 59 One may argue that there is no need to have two neurons in Fig. 6.1(b). One neuron is sucient in making correct decision. Although this could be true, some information of the two- class LDA system is lost in a one-neuron system. The magnitude of the response value for samples in the left region is all equal to zero if only the orange output node exists. This degrades the classication performance. In contrast, by keeping both orange and green nodes, we can preserve the same amount of information as that in the two-class LDA. One may also argue that there is no need to use the ReLU activation function in this example. Yet, ReLU activation is essential when the one-layer perceptron is generalized to a multi-layer perceptron as explained below. We use ~ x to denote the mirror (or the re ection) of x against the decision line in Fig. 6.1(a) as illustrated by the pair of green dots. Clearly, their responses are of the opposite sign. The neuron response in the next stage is the sum of multiple response-weight products. Then, one cannot distinguish the following two cases: • a positive response multiplied by a positive weight, • a negative response multiplied by a negative weight; since both contribute to the output positively. Similarly, one cannot distinguish the following two cases, either: • a positive response multiplied by a negative weight, • a negative response multiplied by a positive weight; since both contribute to the output negatively. As a result, the roles of x and its mirror ~ x are mixed together. The sign confusion problem was rst pointed out in [57]. This problem can be resolved by the ReLU nonlinear activation. 60 6.2.3 Need of Multilayer Perceptron Samples from multiple classes cannot be separated by one linear decision boundary in general. One simple example is given below. Example 1 (XOR). The sample distribution of the XOR pattern is given in Fig. 6.2(a). It has four Gaussian blobs belonging to two classes. Each class corresponds to the \exclusive-OR" output of coordinates' signs of inputs of a blob. That is, Gaussian blobs located in the quadrant where x-axis and y-axis have the same sign belong to class 0. Otherwise, they belong to class 1. The MLP hasl in ,l 1 ,l 2 andl out four layers and three stages of links. We will examine the design stage by stage in a feedforward manner. 6.2.3.1 Stage 1 (from l in to l 1 ) Two partitioning lines are needed to separate four Gaussian blobs { one vertical and one horizontal ones as shown in the gure. Based on the discussion in Sec. 6.2.1, we can determine two weight vectors, w 0 and w 00 , which are vectors orthogonal to the vertical and horizontal lines, respectively. In other words, we have two LDA systems that run in parallel between the input layer and the rst intermediate layer 1 of the MLP as shown in Fig. 6.2(a). Layer l 1 has four neurons. They are partitioned into two pairs of similar color: blue and light blue as one pair and orange and light orange as another pair. The responses of these four neurons are shown in the gure. By appearance, the dimension goes from two to four froml in tol 1 . Actually, each pair of nodes oers a complementary representation in one dimension. For example, the blue and light blue nodes cover the negative and the positive regions of the horizontal axis, respectively. 6.2.3.2 Stage 2 (from l 1 to l 2 ) To separate four Gaussian blobs in Example 1 completely, we need layer l 2 as shown in Fig. 6.2(b). The objective is to have one single Gaussian blob represented by one neuron in layer l 2 . 1 We do not use the term \hidden" but \intermediate" since all middle layers in our feedforward design are explicit rather than implicit. 61 We use the blob in the rst quadrant as an example. It is the top node in layer l 2 . The light blue and the orange nodes have nonzero responses in this region, and we can set their weights to 1 (in red). There is however a side eect - undesired responses in the second and the fourth quadrants are brought in as well. The side eect can be removed by subtracting responses from the blue and the light orange nodes. In Fig. 6.2(b), we use red and black links to represent weight of 1 andP , where P is assumed to be a suciently large positive number 2 , respectively. With this assignment, we can preserve responses in the rst quadrant and make responses in the other three quadrants negative. Since negative responses are clipped to zero by ReLU, we obtain the desired response. 6.2.3.3 Stage 3 (from l 2 to l out ) The top two nodes belong to one class and the bottom two nodes belong to another class in this example. The weight of a link is set to one if it connects a Gaussian blob to its class. Otherwise, it is set to zero. All bias terms are zero. 6.3 Design of Feedforward MLP (FF-MLP) In this section, we generalize the MLP design for Example 1 so that it can handle an input composed by multiple Gaussian modalities belonging to multiple classes. The feedforward design procedure is stated in Sec. 6.3.1. Pruning of partitioning hyperplanes is discussed in Sec. 6.3.2. Finally, the designed FF-MLP architecture is summarized in Sec. 6.3.3. 2 We set P to 1000 in our experiments. 62 6.3.1 Feedfoward Design Procedure We examine anN-dimensional sample space formed by G Gaussian blobs belonging toC classes, whereGC. The MLP architecture is shown in Fig. 6.3. Again, it has layers l in ,l 1 ,l 2 andl out . Their neuron numbers are denoted by D in , D 1 , D 2 and D out , respectively. Clearly, we have D in =N; D out =C: (6.5) We will show that D 1 2 0 B B @ G 2 1 C C A ; GD 2 2 G(G1)=2 : (6.6) We examine the following three stages one more time but in a more general setting. 6.3.1.1 Stage 1 (from l in to l 1 ) - Half-Space Partitioning When the input containsG Gaussian blobs of identical covariances, we can select any two to form a pair and use a two-class LDA to separate them. Since there areL =C G 2 =G(G1)=2 pairs, we can run L LDA systems in parallel and, as a result, the rst intermediate layer has 2L neurons. This is an upper bound since some partitioning hyperplanes can be pruned sometimes. Each LDA system corresponds to an (N1)-dimensional hyperplane that partitions theN-dimensional input space into two half-spaces represented by a pair of neurons. The weights of the incident links are normal vectors of the hyperplane of the opposite signs. The bias term can be also determined analytically. Interpretation of MLPs as separating hyper-planes is not new (see discussion on previous work in Sec. 6.7.2). 6.3.1.2 Stage 2 (from l 1 to l 2 ) - Subspace Isolation The objective of Stage 2 is to isolate each Gaussian blob in one subspace represented by one or more neurons in the second intermediate layer. As a result, we need G or more neurons to 63 representG Gaussian blobs. By isolation, we mean that responses of the Gaussian blob of interest are preserved while those of other Gaussian blobs are either zero or negative. Then, after ReLU, only one Gaussian blob of positive responses is preserved (or activated) while others are clipped to zero (or deactivated). We showed an example to isolate a Gaussian blob in Example 1. This process can be stated precisely below. We denote the set of L partitioning hyperplanes by H 1 ;H 2 ; ;H L . Hyperplane H l divides the whole input space into two half-spaces S l;0 and S l;1 . Since a neuron of layer l 1 represents a half-space, we can use S l;0 and S l;1 to label the pair of neurons that supports hyperplane H l . The intersection of L half-spaces yields a region denoted by R, which is represented by binary sequence s(R) = \c 1 ;c 2 ; ;c L " of length L if it lies in half-space S l;c l , l = 1; 2; ;L. There are at most 2 L = 2 G(G1)=2 partitioned regions, and each of them is represented by binary sequence s(R) in layerl 2 . We assign weight one to the link from neuronS l;c l in layerl 1 to neuron s(R) = \c 1 ;c 2 ; ;c L " in layer l 2 , and weightP to the link from neuron S l; c l , where c l is the logical complement of c l , to the same neuron in layer l 2 . The upper bound of D 2 given in Eq. (6.6) is a direct consequence of the above derivation. However, we should point out that this bound is actually not tight. There is a tighter upper bound for D 2 , which is 3 D 2 N X i=0 0 B B @ L i 1 C C A : (6.7) 6.3.1.3 Stage 3 (from l 2 to l out ) - Class-wise Subspace Mergence Each neuron in layer l 2 represents one Gaussian blob, and it has C outgoing links. Only one of them has the weight equal to one while others have the weight equal to zero since it only belongs to one class. Thus, we connect a Gaussian blob to its target class 4 with weight \one" and delete it from other classes with weight \0". 3 The Steiner-Schl a i Theorem (1850), as cited in https://www.math.miami.edu/armstrong/309sum19/309sum19notes.pdf, p. 21 4 In our experiments, we determine the target class of each region using the majority class in that region. 64 Since our MLP design does not require end-to-end optimization, no backprogagation is needed in the training. It is interpretable and its weight assignment is done in a feedforward manner. To dierentiate the traditional MLP based on backpropagation optimization, we name the traditional one \BP-MLP" and ours \FF-MLP". FF-MLP demands the knowledge of sample distributions, which are provided by training samples. For general sample distributions, we can approximate the distribution of each class by a Gaussian mixture model (GMM). Then, we can apply the same technique as developed before. 5 The number of mixture components is a hyper-parameter in our design. It aects the quality of the approximation. When the number is too small, it may not represent the underlying distribution well. When the number is too large, it may increase computation time and network complexity. More discussion on this topic is given in Example 4. It is worth noting that the Gaussian blobs obtained by this method are not guaranteed to have the same covariances. Since we perform GMM 6 on samples of each class separately, it is hard to control the covariances of blobs of dierent classes. This does not meet the homoscedasticity assumption of LDA. In the current design, we apply LDA[91] to separate the blobs even if they do not share the same covariances. Improvement is possible by adopting heteroscedastic variants of LDA [38]. 6.3.2 Pruning of Partitioning Hyperplanes Stage 1 in our design gives an upper bound on the neuron and link numbers at the rst intermediate layer. To give an example, we have 4 Gaussian blobs in Example 1 while the presented MLP design has 2 LDA systems only. It is signicantly lower than the upper bound - 6 = C 4 2 . The number of LDA systems can be reduced because some partitioning hyperplanes are shared in splitting multiple pairs. In Example 1, one horizontal line partitions two Gaussian pairs and one vertical 5 In our implementation, we rst estimate the GMM parameters using the training data. Then, we use the GMM to generate samples for LDAs. 6 In our experiments, we allow dierent covariance matrices for dierent components since we do not compute LDA among blobs of the same class. 65 line also partitions two Gaussian pairs. Four hyperplanes degenerate to two. Furthermore, the 45- and 135-degree lines can be deleted since the union of the horizontal and the vertical lines achieves the same eect. We would like to emphasize that the redundant design may not aect the classication performance of the MLP. In other words, pruning may not be essential. Yet, it may be desired to reduce the MLP model size with little training accuracy degradation in some applications. For this reason, we present a systematic way to reduce the link number between l in and l 1 and the neuron number D 1 in l 1 here. We begin with a full design that hasM =G(G1)=2 LDA systems in Stage 1. Thus,D 1 = 2M and the number of links between l in and l 1 is equal to 2NM. We develop a hyperplane pruning process based on the importance of an LDA system based on the following steps. 1. Delete one LDA and keep remaining LDAs the same in Stage 1. The input space can still be split by them. 2. For a partitioned subspace enclosed by partitioning hyperplanes, we use the majority class as the prediction for all samples in the subspace. Compute the total number of misclassied samples in the training data. 3. Repeat Steps 1 and 2 for each LDA. Compare the number of misclassied samples due to each LDA deletion. Rank the importance of each LDA based on the impact of its deletion. An LDA is more important if its deletion yields a higher error rate. We can delete the "least important" LDA if the resulted error rate is lower than a pre-dened threshold. Since there might exist correlations between multiple deleted LDA systems, it is risky to delete multiple LDA systems simultaneously. Thus, we delete one partitioning hyperplane (equivalently, one LDA) at a time, run the pruning algorithm again, and evaluate the next LDA for pruning. This process is repeated as long as the minimum resulted error rate is less than a pre-dened threshold (and there is more than one remaining LDA). The error threshold is a hyperparameter that balances the network model size and the training accuracy. 66 It is worth noting that it is possible that one neuron inl 2 covers multiple Gaussian blobs of the same class after pruning, since the separating hyperplanes between Gaussian blobs of the same class may have little impact on the classication accuracy. 7 6.3.3 Summary of FF-MLP Architecture and Link Weights We can summarize the discussion in the section below. The FF-MLP architecture and its link weights are fully specied by parameter L, which is the number of partitioning hyperplanes, as follows. 1. the number of the rst layer neurons is D 1 = 2L, 2. the number of the second layer is D 2 2 L , 3. link weights in Stage 1 are determined by each individual 2-class LDA, 4. link weights in Stage 2 are either 1 or -P, and 5. link weights in Stage 3 are either 1 or 0. 6.4 Illustrative Examples In this section, we provide more examples to illustrate the FF-MLP design procedure. The training sample distributions of several 2D examples are shown in Fig. 6.4. Example 2 (3-Gaussian-blobs). There are three Gaussian-distributed classes in blue, orange and green as shown in Fig. 6.4(b). The three Gaussian blobs have identical covariance matrices. We use three lines and D 1 = 6 neurons in layer l 1 to separate them in Stage 1. Fig. 6.5(a) shows neuron responses in layer l 1 . We see three neuron pairs: 0 and 3, 1 and 4, and 2 and 5. In Stage 2, we would like to isolate each Gaussian blob in a certain subspace. However, due to the shape 7 In our implementation, we do not perform LDA between Gaussian blobs of the same class in Stage 1 in order to save computation. 67 of activated regions in Fig. 6.5(b), we need two neurons to preserve one Gaussian blob in layer l 2 . For example, neurons 0 and 1 can preserve the Gaussian blob in the top as shown in Fig. 6.5(b). If three partitioning lines h 1 , h 2 , h 3 intersect at nearly the same point as illustrated in Fig. 6.8(a), we have 6 nonempty regions instead of 7. Our FF-MLP design has D in = 2, D 1 = 6, D 2 = 6 and D out = 3. The training accuracy and the testing accuracy of the designed FF-MLP are 99.67% and 99.33%, respectively. This shows that the MLP splits the training data almost perfectly and ts the underlying data distribution very well. Example 3 (9-Gaussian-blobs). It contains 9 Gaussian blobs of the same covariance matrices aligned in 3 rows and 3 columns as shown in Fig. 6.4(c). In Stage 1, we need C 9 2 = 36 separating lines 8 at most. We run the pruning algorithm with the error threshold equal to 0.3 and reduce the separating lines to 4, each of which partition adjacent rows and columns as expected. Then, there are D 1 = 8 neurons in layer l 1 . The neuron responses in layer l 1 are shown in Fig. 6.6(a). They are grouped in four pairs: 0 and 4, 1 and 5, 2 and 6, and 3 and 7. In Stage 2, we form 9 regions, each of which contains one Gaussian blob as shown in Fig. 6.6(b). As shown in Fig. 6.8, we have two pairs of nearly parallel partitioning lines. Only 9 nonempty regions are formed in the nite range. The training and testing accuracy are 88.11% and 88.83% respectively. The error threshold aects the number of partitioning lines. When we set the error threshold to 0.1, we have 27 partitioning lines and 54 neurons in layer l 1 . By doing so, we preserve all needed boundaries between blobs of dierent classes. The training and testing accuracy are 89.11% and 88.58%, respectively. The performance dierence is very small in this case. Example 4 (Circle-and-Ring). It contains an inner circle as one class and an outer ring as the other class as shown in Fig. 6.4(d)[91]. To apply our MLP design, we use one Gaussian blob to model the inner circle and approximate the outer ring with 4 and 16 Gaussian components, respectively. For the case of 4 Gaussian components, blobs of dierent classes can be separated by 4 partitioning lines. By using a larger number of blobs, we may obtain a better approximation to 8 In implementation, we only generate 27 lines to separate blobs of dierent classes. 68 the original data. The corresponding classication results are shown in Figs. 6.12(a) and (b). We see that the decision boundary of 16 Gaussian components is smoother than that of 4 Gaussian components. Example 5 (2-New-Moons). It contains two interleaving new moons as shown in Fig. 6.4(e)[91]. Each new moon corresponds to one class. We use 2 Gaussian components for each class and show the generated samples from the tted GMMs in Fig. 6.9(a), which appears to be a good approxi- mation to the original data visually. By applying our design to the Gaussian blobs, we obtain the classication result as shown in Fig. 6.9(b), which is very similar to the ground truth (see Table 6.1). Example 6 (4-New-Moons). It contains four interleaving new moons as shown in Fig. 6.4(f)[91], where each moon is a class. We set the number of blobs to 3 for each moon and the error thresh- old to 0.05. There are 9 partitioning lines and 18 neurons in layer l 1 , which in turn yields 28 region neurons in layer l 2 . The classication results are shown in Fig. 6.10. We can see that the predictions are similar to the ground truth and t the underlying distribution quite well. The training accuracy is 95.75% and the testing accuracy is 95.38%. 6.5 Observations on BP-MLP Behavior Even when FF-MLP and BP-MLP adopt the same MLP architecture designed by our proposed method, BP-MLP has two dierences from FF-MLP: 1) backpropagation (BP) optimization of the cross-entropy loss function and 2) network initialization. We report observations on the eects of BP and dierent initializations in Secs. 6.5.1 and 6.5.2, respectively. 6.5.1 Eect of Backpropagation (BP) We initialize a BP-MLP with weights and biases of the FF-MLP design and conduct BP using the gradient descent (SGD) optimizer with 0.01 learning rate and zero momentum. We observe 69 four representative cases, and show the training and testing accuracy curves as a function of the epoch number in Figs. 6.11(a)-(e). • BP has very little eect. One such example is the 3-Gaussian-blobs case. Both training and testing curves remain at the same level as shown in Fig. 6.11(a). • BP has little eect on training but a negative eect on testing. The training and testing accuracy curves for the 9-Gaussian-blobs case are plotted as a function of the epoch number in Fig. 6.11(b). The network has 8 neurons in l 1 and 9 neurons in l 2 , which is the same network architecture as in Fig. 6.7(a). Although the training accuracy remains at the same level, the testing accuracy uctuates with several drastic drops. This behavior is dicult to explain, indicating the interpretability challenge of BP-MLP. • BP has negative eects on both training and testing. The training and testing accuracy curves for 4-new-moons case are plotted as a function of the epoch number in Fig. 6.11(c). Both training and testing accuracy uctuate and several drastic drops are observed for the testing accuracy. As shown in Table 6.1, the nal training and testing accuracy are lower compared to the FF-MLP results. The predictions for the training samples are shown in Fig. 6.14(a), which is worse than the ones in Fig. 6.10. Another example is the 9-Gaussian-blobs case with error threshold equal to 1. The training and testing accuracy decrease during training. The nal training and testing accuracy are lower than FF-MLP as shown in Table 6.1. • BP has positive eect on both training and testing. For the circle-and-ring example when the outer ring is approximated with 16 components, the predictions for the training samples after BP are shown in Fig. 6.12(d). We can see the improvement in the training and testing accuracy in Table 6.1. However, similar to the 70 9-Gaussian-blobs and 4-new-moons cases, we also observe the drastic drops of the testing accuracy during training. 6.5.2 Eect of Initializations We compare dierent initialization schemes for BP-MLP. One is to use FF-MLP and the other is to use the random initialization. We have the following observations. • Either initialization works. For some datasets such as the 3-Gaussian-blob data, their nal classication performance is similar using either random initialization or FF-MLP initialization. • In favor of the FF-MLP initialization. We compare the BP-MLP classication results for 9-Gaussian-blobs with FF-MLP and random initializations in Fig. 6.13. The network has 8 neurons in l 1 and 9 neurons in l 2 , which is the same network architecture as in Fig. 6.7(a). The advantage of the FF-MLP initialization is well preserved by BP-MLP. In contrast, with the random initialization, BP tends to nd smooth boundaries to split data, which does not t the underlying source data distribution in this case. • Both initializations fail. We compare the BP-MLP classication results for 4-new-moons with FF-MLP and random initializations in Fig. 6.14. The result with the random initialization fails to capture the concave moon shape. Generally speaking, BP-MLP with random initialization tends to over-simplify the boundaries and data distribution as observed in both 9-Gaussian-blobs and 4-new-moons. 71 6.6 Experiments 6.6.1 Classication Accuracy for 2D Samples We compare training and testing classication performance among FF-MLP, BP-MLP with FF- MLP initialization, and BP-MLP with random initialization for Examples 1-6 in the last section in Table 6.1. The integers in parentheses in the rst row for BP-MLP are the epoch numbers. In the rst column, the numbers in parentheses for the 9-Gaussian-blobs are error thresholds, the numbers in parentheses for the circle-and-ring are Gaussian component numbers for outer- ring's approximations. For BP-MLP with random initialization, we report means and standard deviations of classication accuracy over 5 runs. We used the Xavier uniform initializer [34] for random initialization. We trained the network for two dierent epoch numbers; namely, 15 epochs and 50 epochs in dierent runs. Dataset FF-MLP BP-MLP with FF-MLP init.(50) BP-MLP with random init. (50) BP-MLP with random init. (15) train test train test train test train test XOR 100.00 99.83 100.00 99.83 99.83 0.16 99.42 0.24 93.20 11.05 92.90 11.06 3-Gaussian-blobs 99.67 99.33 99.67 99.33 99.68 0.06 99.38 0.05 99.48 0.30 99.17 0.48 9-Gaussian-blobs (0.1) 89.11 88.58 70.89 71.08 84.68 0.19 85.75 0.24 78.71 2.46 78.33 3.14 9-Gaussian-blobs (0.3) 88.11 88.83 88.06 88.58 81.62 6.14 81.35 7.29 61.71 9.40 61.12 8.87 circle-and-ring (4) 88.83 87.25 89.00 86.50 81.93 7.22 82.80 5.27 70.57 13.42 71.25 11.27 circle-and-ring (16) 83.17 80.50 85.67 88.00 86.20 1.41 85.05 1.85 66.20 9.33 65.30 11.05 2-new-moons 88.17 91.25 88.17 91.25 83.97 1.24 87.60 0.52 82.10 1.15 86.60 0.58 4-new-moons 95.75 95.38 87.50 87.00 86.90 0.25 84.00 0.33 85.00 0.98 82.37 0.76 Table 6.1: Comparison of training and testing classication performance between FF-MLP, BP- MLP with FF-MLP initialization and BP-MLP with random initialization. The best (mean) training and testing accuracy are highlighted in bold. Let us focus on test accuracy. FF-MLP has better test performance with the following settings: • XOR (99.83%) • 9-Gaussian-blobs with error threshold 0.1 or 0.3 (88.58% and 88.83% respectively); • circle-and-ring with 4 Gaussian components for the outer ring (87.25%); • 2-new-moons (91.25%) 72 • 4-new-moons (95.38%). Under these settings, FF-MLP outperforms BP-MLP with FF-MLP initialization or random ini- tialization. 6.6.2 Classication Accuracy for Higher-Dimensional Samples Besides 2D samples, we test FF-MLP for four higher-dimensional datasets. The datasets are described below. • Iris Dataset. The Iris plants dataset [91, 31] is a classication dataset with 3 dierent classes and 150 samples in total. The input dimension is 4. • Wine Dataset. The Wine recognition dataset [91, 30] has 3 classes with 59, 71, and 48 samples respectively. The input dimension is 13. • B.C.W. Dataset. The breast cancer wisconsin (B.C.W) dataset [91, 30] is a binary clas- sication dataset. It has 569 samples in total. The input dimension is 30. • Pima Indians Diabetes Dataset. The Pima Indians diabetes dataset 9 [107] is for diabetes prediction. It is a binary classication dataset with 768 8-dimensional entries. In our experiments, we removed the samples with the physically impossible value zero for glucose, diastolic blood pressure, triceps skin fold thickness, insulin, or BMI. We used only the remaining 392 samples. We report the training and testing accuracy results of FF-MLP, BP-MLP with random initial- ization and trained with 15 and 50 epochs in Table 6.2. For BP-MLP, the means of classication accuracy and the standard deviations over 5 runs are reported. For the Iris dataset, we set the number of Gaussian components to 2 in each class. The error threshold is set to 0.05. There are 2 partitioning hyperplanes and, thus, 4 neurons in layer l 1 . 9 We used the data from https://www.kaggle.com/uciml/pima-indians-diabetes-database for our experiments. 73 Dataset D in Dout D 1 D 2 Accuracy FF-MLP BP-MLP/random init. (50) BP-MLP/random init. (15) train test train test train test Iris 4 3 4 3 96.67 98.33 65.33 23.82 64.67 27.09 47.11 27.08 48.33 29.98 Wine 13 3 6 6 97.17 94.44 85.66 4.08 79.72 9.45 64.34 7.29 61.39 8.53 B.C.W 30 2 2 2 96.77 94.30 95.89 0.85 97.02 0.57 89.79 2.41 91.49 1.19 Pima 8 2 18 88 91.06 73.89 80.34 1.74 75.54 0.73 77.02 2.89 73.76 1.45 Table 6.2: Training and testing accuracy results of FF-MLP and BP-MLP with random initial- ization for four higher-dimensional datasets. The best (mean) training and testing accuracy are highlighted in bold. The testing accuracy for FF-MLP is 98.33%. For BP-MLP with random initialization trained for 50 epochs, the mean test accuracy is 64.67%. In this case, it seems that the network may have been too small for BP-MLP to arrive at a good solution. For the wine dataset, we set the number of Gaussian components to 2 in each class. The error threshold is set to 0.05. There are 3 partitioning hyperplanes and thus 6 neurons in layer l 1 . The testing accuracy for FF-MLP is 94.44%. Under the same architecture, BP-MLP with random initialization and 50 epochs gives the mean test accuracy of 79.72%. FF-MLP outperforms BP- MLP. For the B.C.W. dataset, we set the number of Gaussian components to 2 per class. The error threshold is set to 0.05. There are 1 partitioning hyperplanes and thus 2 neurons in layer l 1 . The testing accuracy for FF-MLP is 94.30%. The mean testing accuracy for BP-MLP is 97.02%. BP-MLP outperforms FF-MLP on this dataset. Finally, for the Pima Indians dataset, we set the number of Gaussian components to 4 per class. The error threshold is set to 0.1. There are 9 partitioning hyperplanes and thus 18 neurons in layer l 1 . The testing accuracy for FF-MLP is 73.89% while that for BP-MLP is 75.54% on average. BP-MLP outperforms FF-MLP in terms of testing accuracy. 6.6.3 Computational Complexity We compare the computational complexity of FF-MLP and BP-MLP in Table 6.3, Tesla V100- SXM2-16GB GPU was used in the experiments. For FF-MLP, we show the time of each step 74 of our design and its sum in the total column. The steps include: 1) preprocessing with GMM approximation (stage 0), 2) tting partitioning hyperplanes and redundant hyperplane deletion (stage 1), 3) Gaussian blob isolation (stage 2), and 4) assigning each isolated region to its cor- responding output class (stage 3). For comparison, we show the training time of BP-MLP in 15 epochs and 50 epochs (in two separate runs) under the same network with random initialization. Dataset GMM Boundary Region Classes Total BP (15) BP (50) construction representation assignment XOR 0.00000 0.01756 0.00093 0.00007 0.01855 2.88595 0.06279 8.50156 0.14128 3-Gaussian-blobs 0.00000 0.01119 0.00126 0.00008 0.01253 2.78903 0.07796 8.26536 0.17778 9-Gaussian-blobs (0.1) 0.00000 0.22982 0.00698 0.00066 0.23746 2.77764 0.14215 8.34885 0.28903 9-Gaussian-blobs (0.3) 0.00000 2.11159 0.00156 0.00010 2.11325 2.79140 0.06179 8.51242 0.24676 circle-and-ring (4) 0.02012 0.01202 0.00056 0.00006 0.03277 1.50861 0.14825 3.79068 0.28088 circle-and-ring (16) 0.04232 0.05182 0.00205 0.00020 0.09640 1.43951 0.15573 3.80061 0.13775 2-new-moons 0.01835 0.01111 0.00053 0.00006 0.03006 1.44454 0.06723 3.64791 0.08565 4-new-moons 0.03712 11.17161 0.00206 0.00021 11.21100 1.98338 0.04357 5.71387 0.14150 Iris 0.02112 0.02632 0.00011 0.00002 0.04757 0.73724 0.01419 1.60543 0.14658 Wine 0.01238 0.03551 0.00015 0.00003 0.04807 0.81173 0.01280 1.72276 0.07268 B.C.W 0.01701 0.03375 0.00026 0.00003 0.05106 1.08800 0.05579 2.73232 0.12023 Pima 0.03365 0.16127 0.00074 0.00039 0.19604 0.96707 0.03306 2.32731 0.10882 Table 6.3: Comparison of computation time in seconds of FF-MLP (left) and BP-MLP (right) with 15 and 50 epochs. The mean and standard deviation of computation time in 5 runs are reported for BP-MLP. The shortest (mean) running time is highlighted in bold. As shown in Table 6.3 FF-MLP takes 1 second or less in the network architecture and weight design. The only exceptions are 4-new-moons and 9-Gaussian-blobs with an error threshold of 0.3. Most computation time is spent on boundary construction, which includes hyperplane pruning. 75 To determine the hyperplane to prune, we need to repeat the temporary hyperplane deletion and error rate computation for each hyperplane. This is a very time-consuming process. If we do not prune any hyperplane, the total computation time can be signicantly shorter at the cost of a larger network size. Generally speaking, for non-Gaussian datasets, GMM approximation and tting partitioning hyperplanes and redundant hyperplane deletion take most execution time among all steps. For datasets consisting of Gaussian blobs, no GMM approximation is needed. For BP-MLP, the time reported in Table 6.3 does not include network design time but only the training time with BP. It is apparent from the table that the running time in the FF-MLP design is generally shorter than that in the BP-MLP design. 6.7 Comments on Related Work We would like to comment on prior work that has some connection with our current research. 6.7.1 BP-MLP Network Design Design of the BP-MLP network architectures has been studied by quite a few researchers in the last three decades. It is worthwhile to review eorts in this area as a contrast of our FF-MLP design. Two approaches in the design of BP-MLPs will be examined in this and the next subsections: 1) viewing architecture design as an optimization problem and 2) adding neurons and/or layers as needed in the training process. 6.7.1.1 Architecture Design as an Optimization Problem One straightforward idea to design the BP-MLP architecture is to try dierent sizes and select the one that gives the best performance. A good search algorithm is needed to reduce the trial number. For example, Doukim et al. [29] used a coarse-to-ne two-step search algorithm to determine the number of neurons in a hidden layer. First, a coarse search was performed to 76 rene the search space. Then, a ne-scale sequential search was conducted near the optimum value obtained in the rst step. Some methods were also proposed for both structure and weight optimization. Ludermir et al. [70] used a tabu search [35] and simulated annealing [52] based method to nd the optimal architecture and parameters at the same time. Modied bat algorithm [126] was adopted in [48] to optimize the network parameters and architecture. The MLP network structure and its weights were optimized in [128] based on backpropagation, tabu search, heuristic simulated annealing, and genetic algorithms [36]. 6.7.1.2 Constructive Neural Network Learning In constructive design, neurons can be added to the network as needed to achieve better per- formance. Examples include: [89, 79, 32]. To construct networks, an iterative weight update approach [88] can be adopted [79, 32, 112, 77]. Neurons were sequentially added to separate mul- tiple groups in one class from the other class in [76]. Some samples of one class were separated from all samples of another class with a newly added neuron at each iteration. Newly separated samples are excluded from consideration in search of the next neuron. The input space is split into pure regions that contain only samples of the same class while the output of a hidden layer is proved to be linearly separable. Another idea to construct networks is to leverage geometric properties [88], e.g., [125, 2, 75, 8, 6]. Spherical threshold neurons were used in [125]. Each neuron is activated if low d(W;X) high ; (6.8) whereW ,X, low and high are the weight vector, the input vector, lower and higher thresholds respectively, andd() represents the distance function. That is, only samples between two spheres of radii low and high centered at theW are activated. At each iteration, a newly added neuron excludes the largest number of samples of the same class in the activated region from the working 77 set. The hidden representations are linearly separable by assigning weights to the last layer properly. This method can determine the thresholds and weights without backpropagation. 6.7.2 Relationship with Work on Interpretation of Neurons as Partitioning Hyperplanes The interpretation of neurons as partitioning hyperplanes was done by some researchers before. As described in Sec. 6.7.1.2, Marchand et al. [76] added neurons to split the input space into pure regions containing only samples of the same class. Liou et al. [67] proposed a three-layer network. Neurons in the rst layer represent cutting hyperplanes. Their outputs are represented as binary codes because of the use of threshold activation. Each region corresponds to a vertex of a multidimensional cube. Neurons in the second layer represent another set of hyperplanes cutting through the cube. They split the cube into multiple parts again. Each part may contain multiple vertices, corresponding to a set of regions in the input space. The output of the second layer is vertices in a multidimensional cube. The output layer oers another set of cuts. Vertices at the same side are assigned the same class, which implies the mergence of multiple regions. The weights and biases of links can be determined by BP. Neurons are added if the samples cannot be well dierentiated based on the binary code at the layer output, which is done layer by layer. Cabrelli et al. [10] proposed a two-layer network for convex recursive deletion (CoRD) regions based on threshold activation. They also interpreted hidden layers as separating hyperplanes that split the input space into multiple regions. For each neuron, a region evaluates 0 or 1 depending on the side of the hyperplane that it lies in. Geometrically, the data points in each region lie in a vertex of a multidimensional cube. Their method does not require BP to update weights. Results in [76, 67, 10] rely on threshold neurons. One clear drawback of threshold neurons is that they only have 0 or 1 binary outputs. A lot of information is lost if the input has a response range (rather than a binary value). FF-MLP uses the ReLU activation and continuous 78 responses can be well preserved. Clearly, FF-MLP is much more general. Second, if a sample point is farther away from the boundary, its response is larger. We should pay attention to the side of the partitioning hyperplane a sample lies as well as the distance of a sample from the hyperplane. Generally, the response value carrys the information of the point position. This aects prediction condence and accuracy. Preservation of continuous responses helps boost the classication accuracy. We should point out one critical reason for the simplicity of the FF-MLP design. That is, we are not concerned with separating classes but Gaussian blobs in the rst layer. Let us take the 9-Gaussian-blobs belonging to 3 classes as an example. There are actually 3 9 ways to assign 9 Gaussian blobs to three classes. If we are concerned with class separation, it will be very complicated. Instead, we isolate each Gaussian blob through layers l 1 and l 2 and leave the blob- class association to the output layer. This strategy greatly simplies our design. 6.7.3 Relationship with LDA and SVM An LDA system can be viewed as a perceptron with specic weight and bias as explained in Sec. 6.2. The main dierence between an LDA system and a perceptron is the method to obtain their model parameters. While the model parameters of an LDA system are determined analytically from the training data distribution, the parameters of MLP are usually learned via BP. There is a common ground between FF-MLP and the support-vector machine (SVM) classier. That is, in its basic form, an SVM constructs the maximum-margin hyperplane to separate two classes, where the two-class LDA plays a role as well. Here, we would like to emphasize their dierences. • SVM contains only one-stage while FF-MLP contains multiple stages in cascade. Since there is no sign confusion problem with SVM, no ReLU activation is needed. For nonlinear SVM, 79 nonlinearity comes from nonlinear kernels. It is essential for FF-MLP to have ReLU to lter out negative responses to avoid the sign confusion problem due to multi-stage cascade. • The multi-class SVM is built upon the integration of multiple two-class SVMs. Thus, its complexity grows quickly with the number of underlying classes. In contrast, FF-MLP partitions Gaussian blobs using LDA in the rst stage. FF-MLP connects each Gaussian blob to its own class type in Stage 3. We can use a simple example to illustrate the second item. If there are G Gaussian blobs belonging toC classes, we haveC G ways in denining the blob-class association. All of them can be easily solved by a single FF-MLP with slightly modication of the link weights in Stage 3. In contrast, each association demands one SVM solution. We need C G SVM solutions in total. 6.7.4 Relationship with Interpretable Feedforward CNN Kuo et al. gave an interpretation to convolutional neural networks (CNNs) and proposed a feed- forward design in [60], where all CNN parameters can be obtained in one-pass feedforward fashion without back-propagation. The convolutional layers of CNNs were interpreted as a sequence of spatial-spectral signal transforms. The Saak and the Saab transforms were introduced in [59] and [60], respectively, for this purpose. The fully-connected (FC) layers of CNNs were viewed as the cascade of multiple linear least squared regressors. The work of interpretable CNNs was built upon the mathematical foundation in [57, 58]. Recently, Kuo et al. developed new machine learn- ing theory called \successive subspace learning (SSL)" and applied it to a few applications such as image classication [13] [14], 3D point cloud classication [132], [130], face gender classication [96], etc. Although the FC layers of CNNs can be viewed as an MLP, we should point out one dierence between CNN/FC layers and classical MLPs. The input to CNNs is raw data (e.g. images) or near-raw data (e.g., spectrogram in audio processing). Convolutional layers of CNNs play the role 80 of feature extraction. The output from the last convolutional layer is a high-dimensional feature vector and it serves as the input to FC layers for decision. Typically, the neuron numbers in FC layers are monotonically decreasing. In contrast, the input to classical MLPs is a feature vector of lower dimension. The neuron number of intermediate layers may go up and down. It is not as regular as the CNN/FC layers. 6.8 Conclusion and Future Work We made an explicit connection between the two-class LDA and the MLP design and proposed a general MLP architecture that contains two intermediate layers, denoted by l 1 and l 2 , in this work. The design consists of three stages: 1) stage 1: from input layer l in to l 1 , 2) stage 2: from intermediate layer l 1 to l 2 , 3) stage 3: from intermediate layer l 2 to l out . The purpose of stage 1 is to partition the whole input space exibly into a few half-subspace pairs. The intersection of these half-subspaces yields many regions of interest. The objective of stage 2 is to isolate each region of interest from other regions of the input space. Finally, we connect each region of interest to its associated class based on training data. We use Gaussian blobs to illustrate regions of interest with simple examples. In practice, we can leverage GMM to approximate datasets of general shapes. The proposed MLP design can determine the MLP architecture (i.e. the number of neurons in layersl 1 tol 2 ) and weights of all links in a feedforward one-pass manner without any trial and error. Experiments were conducted extensively to compare the performance of FF-MLP and the traditional BP-MLP. There are several possible research directions for future extension. First, it is worthwhile to develop a systematic pruning algorithm so as to reduce the number of partitioning hyperplanes in stage 1. Our current pruning method is a greedy search algorithm, where the one that has the least impact on the classication performance is deleted rst. It does not guarantee the global optimality. Second, it is well known that BP-MLP is vulnerable to adversarial attacks. 81 It is important to check whether FF-MLP encounters the same problem. Third, we did not observe any advantage of using the backpropagation optimization in the designed FF-MLP in Sec. 6.6. One conjecture is that the size of the proposed FF-MLP is too compact. The BP may help if the network size is larger. This conjecture waits for further investigation. Fourth, for a general sample distribution, we approximate the distribution with Gaussian blobs that may have dierent covariances. This does not meet the homoscedasticity assumption of the two-class LDA. It is possible to improve the system by replacing LDA with heteroscedastic variants. Acknowledgment This material is based on research sponsored by US Army Research Laboratory (ARL) under contract number W911NF2020157. The U.S. Government is authorized to reproduce and dis- tribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ocial policies or endorsements, either expressed or implied, of US Army Research Laboratory (ARL) or the U.S. Government. Computation for the work was supported by the University of Southern California's Center for High Performance Computing (hpc.usc.edu). 82 + =0 + >0 + <0 − � (a) − , − , ReLU( − − ) ReLU( + ) (b) Figure 6.1: Illustration of (a) two Gaussian blobs separated by a line in a 2D plane, and (b) a one-layer two-neuron perceptron system. 83 ′ + =0 + =0 ′ + ′ <0 ′ + ′ >0 + ′′ >0 + ′′ <0 ′ ′ , − ′ ′ , − , − , − ’ 1 (a) 2 1 (b) 2 (c) Figure 6.2: MLP design for Example 1 (XOR): (a) weights between input layer l in and layer l 1 , (b) weights between layer l 1 and layer l 2 , where red links represent weight of 1 and black links represent weightsP , where P is assumed to be a suciently large positive number, (c) weights between l 2 and l out , where red links represent weight of 1 and black links represent weight of 0. 84 … … 1 , 1 2 , 2 3 , 3 … … … … Input 1 2 Output Figure 6.3: Proposed MLP design: the MLP network withl 1 andl 2 two intermediate layers, where neurons in layerl 1 are drawn in pairs (e.g., blue and light blue nodes) representing two sides of a decision boundary and where each neuron in layer l 2 represents an isolated region of interest. 85 (a) (b) (c) (d) (e) (f) Figure 6.4: Training sample distributions of 2D examples: (a) Example 1: XOR, (b) Example 2: 3-Gaussian-blobs, (c) Example 3: 9-Gaussian-blobs, (d) Example 4: circle-and-ring, (e) Example 5: 2-new-moons and (f) Example 6: 4-new-moons, where dierent classes are in dierent colors. 86 (a) 1st intermediate layer (b) 2nd intermediate layer Figure 6.5: Response heatmaps for 3 Gaussian blobs in Example 2 in (a) layer l 1 and (b) layerl 2 . (a) (b) Figure 6.6: Neuron responses of the 9-Gaussian-blobs in Example 3 in (a) layer l 1 and (b) layer l 2 . 87 (a) Th=0.3 (b) Th=0.1 Figure 6.7: Comparison of classication results of FF-MLP for Example 3 with two dierent error thresholds: (a) Th=0.3 and (b) Th=0.1. (a) Example 2 (b) Example 3 Figure 6.8: Partitioning lines for (a) Example 2 and (b) Example 3. (a) (b) Figure 6.9: Visualization of the 2-new-moons example: (a) the generated random samples from the tted GMMs with 2 components per class and (b) the classication result of FF-MLP with error threshold equal to 0.1. 88 Figure 6.10: Classication results of FF-MLP for Example 6, where each new moon is approxi- mated by 3 Gaussian components. 89 (a) 3-Gaussian-blobs (b) 9-Gaussian-blobs, Th=0.3 (c) 4-new-moons (d) 9-Gaussian-blobs, Th=0.1 (e) circle-and-ring Figure 6.11: Training and testing accuracy curves of BP-MLP as functions of the epoch number for (a) 3-Gaussian-blobs, (b) 9-Gaussian-blobs (Th=0.3), (c) 4-new-moons, (d) 9-Gaussian-blobs (Th=0.1) and (e) circle-and-ring, where the network is initialized by the proposed FF-MLP. 90 (a) FF-MLP, 4 Components (b) FF-MLP, 16 Components (c) BP-MLP, 4 Components (d) BP-MLP, 16 Components Figure 6.12: Classication results for the circle-and-ring example with (a) FF-MLP, 4 components, (b) FF-MLP, 16 components, (c) BP-MLP with FF-MLP initialization, 4 components, and (d) BP-MLP with FF-MLP initialization, 16 components. (a) FF-MLP initialization (b) random initialization Figure 6.13: Comparison of BP-MLP classication results for 9-Gaussian-blobs with (a) FF-MLP initialization and (b) random initialization. 91 (a) FF-MLP initialization (b) random initialization Figure 6.14: Comparison of BP-MLP classication results for 4-new-moons with (a) FF-MLP initialization and (b) random initialization. 92 Chapter 7 Bridging Multi-Layer Perceptron (MLP) and Piecewise Low-Order Polynomial Approximators 7.1 Introduction Piecewise low-order polynomial approximation (or regression) is a classic topic in numerical analy- sis. It ts multiple low-order polynomials to a function in partitioned intervals. The approximation capability of piecewise low-order polynomials is well understood. In contrast, the representation power of neural networks is not as obvious. The representation power of neural networks has been studied for three decades, and universal approximation theorems in various forms have been proved [20, 44, 17, 11, 101]. These proofs indicate that neural networks with appropriate ac- tivations can represent a wide range of functions of interest. Their goals were to make their theorems as general as possible. The proofs were based on tools from functional analysis (e.g., the Hahn-Banach theorem [97]) and real analysis (e.g., Sprecher-Kolmogorov theorem [109, 53]), which is less well known in the signal processing community. Furthermore, although they show the existence of a solution, most of them do not oer a specic MLP design. Lack of MLP design examples hinders our understanding of the behavior of neural networks. A feedforward construction of an MLP with two intermediate layers was presented to solve a classication problem in [66]. As a sequel to [66], this work addresses the regression problem. 93 For ease of presentation, our treatment will focus on the approximation of a univariate function. We adopt a signal processing (or numerical analysis) approach to tackle it. Through the design, we can explain how MLPs oer universal approximations intuitively. The construction includes the choice of activation functions, neuron numbers, and lter weights between layers. A one-to- one correspondence between an MLP and a piecewise low-order polynomial will be established. Except for the trivial case of the unit step activation function, results on MLP construction and its equivalence to piecewise low-order polynomial approximators are new. Another interesting observation is the critical role played by activation functions. That is, activation functions actually dene kernels of piecewise low-order polynomials. Since the approximation error of piecewise low- order polynomials is well understood, our ndings shed light on the universal approximation capability of an MLP as well. The rest of this paper is organized as follows. The construction of MLPs corresponding to pieceswise constant, linear and cubic polynomial approximations is detailed in Sec. 7.2. The gen- eralization from the 1-D case to multivariate vector functions is sketched in Sec. 7.3. Concluding remarks and future research directions are given in Sec. 7.4. 7.2 Proposed MLP Constructions Without loss of generality, we consider a continuous function f(x) dened on interval x2 [0; 1], which is uniformly divided into N sub-intervals. Let h = N 1 be the length of the subinterval and x i = ih, where i = 0; 1; ;N. We show how to construct an MLP so that it serves as a piecewise constant or a piecewise linear approximation to f(x) in this section. 94 7.2.1 Piecewise Constant Approximation A piecewise constant approximation of f(x) can be written as f(x)f c (x) = N1 X i=0 f(x i )b i (x); (7.1) where b i (x) = 8 > > < > > : 1; x i x<x i+1 0; otherwise: (7.2) is the unit box function. We can rewrite the unit box function as the dierence of two unit step functions b i (x) =u(xx i )u(xx i+1 ); (7.3) where u(x) = 8 > > < > > : 1; x 0 0; otherwise; (7.4) is the unit step function. Then, we can rewrite Eq. (7.1) as f c (x) =f(x 0 )u(xx 0 ) + N1 X i=1 [f(x i )f(x i1 )]u(h 1 (xx i )): (7.5) Our designed MLP consists of one input node denoting x2 [0; 1], one output node denoting f(x), and N intermediate nodes (or neurons), denoted by R j , where j = 0; 1; ;N 1. We use j and j to denote the weight and bias between the input node, R in and R j and ~ j and ~ j to 95 denote the weight and bias between R j and the output node R out , respectively. The response at R j before activation, denoted by y j , and the response at R out , denoted by z, can be written as y j = j x + j ; (7.6) z = N1 X j=0 ~ j Act(y j ) + ~ j ; (7.7) where Act is the activation function, j and ~ j are weights and j and ~ j are biases. With the unit step function as activation and Eq. (7.5), it is easy to verify that we can choose the following weights and biases for the MLP: R in )R j : j =h 1 ; j =h 1 x j ; R j )R out : ~ j =f(x j )f(x j1 ); ~ j = 0; (7.8) where f(x 1 ) 0. The above derivation was given in [101], and it is repeated here for the sake of completeness. 7.2.2 Piecewise Linear Approximation A commonly used nonlinear activation function is the rectied linear unit (ReLU), which can be written as r(x) = max(0;x): (7.9) We will construct an MLP usingr(x) as the nonlinear activation function and show its equivalence to a piecewise linear approximation to f(x). The latter can be expressed as f(x)f l (x) = N1 X i=0 f(x i )t i (x); (7.10) where t i (x) =t(h 1 (xx i )) (7.11) 96 and where t(x) is the unit triangle function in form of t(x) = 8 > > > > > > < > > > > > > : x + 1; if 1x 0 x + 1; if 0x 1 0; otherwise: (7.12) With one input node denoting x2 [0; 1] and one output node denoting f(x), we construct an MLP that has 4 neurons in two pairs, denoted by R j;k , k = 1; ; 4, in the intermediate layer to yield the same eect of t j . We use j;k and j;k to denote the weight and bias from the input node R in to R j;k , respectively. In our design, they are specied as: R in )R j;1 : j;1 =h 1 ; j;1 =h 1 x j1 ; R in )R j;2 : j;2 =h 1 ; j;2 =h 1 x j ; R in )R j;3 : j;3 =h 1 ; j;3 =h 1 x j+1 ; R in )R j;4 : j;4 =h 1 ; j;4 =h 1 x j : (7.13) The responses of neurons R j;k , k = 1; 3 and k = 2; 4 after ReLU nonlinear activation are shown in Figs. 7.1 (a) and (b), respectively, where the dotted lines indicate those of individual neurons and the solid lines show those of combined responses. Finally, the combination of all responses of all 4 neurons at the output node is shown in Fig. 7.1 (c). It is a unit triangle plus a vertical shift. Similarly, we use ~ j;k and ~ j;k to denote the weight and bias from R j;k to the output node R out , respectively. They are specied below: R j;k )R out : ~ j;k =f(x j ); k = 1; 3 R j;k )R out : ~ j;k =f(x j ); k = 2; 4 R j;k )R out : ~ j;k =0:25f(x j ); 1k 4: (7.14) 97 It is easy to verify from the Fig. 7.1 (d) that these four neurons are nothing but a weighted triangle centered at x i with its base between x i1 and x i+1 and height f(x i ) and f(x) can be approximated by a sequence of such weighted triangles. 7.2.3 Piecewise Cubic Approximation Next, we design an MLP that oers a piecewise cubic approximation to f(x). Besides incoming and outgoing weights/biases, we will design the unit cubic activation (see Fig. 7.2(a)) in form of: q(x) = 8 > > > > > > < > > > > > > : 0 x1; a 3 x 3 +a 2 x 2 +a 1 x +a 0 ; 1x 1; 1 x 1; (7.15) where q(x),1 x 1, satises two constraints: i) q(1) = 0 and, ii) q(x) is anti-symmetric with respect to (0; 0:5); i.e., q(x) +q(x) = 1: (7.16) Note that these two constraints imply q(0) = 0:5 and q(1) = 1. By substituting q(1) = 0, q(0) = 0:5, q(1) = 1 into Eq. (7.15), we get a 2 = 0, a 1 +a 3 = 0:5, and a 0 = 0:5. There is one degree of freedom left. To complete the design, one choice is to specify the rst-order derivative at the in ection point: (x;f(x)) = (0; 0:5). As shown in Fig. 7.2(a), the in ection point has the maximum rst-order derivative and its second-order derivative is zero. For interval [x j2 ;x j+2 ] centered at x j , we can use two neurons to build a third-order unit bump function plus a vertical shift as shown in Fig. 7.2(b). The weight and bias from the input node R in to R j;k are specied as: R in )R j;1 : j;1 =h 1 ; j;1 =h 1 x j1 ; R in )R j;2 : j;2 =h 1 ; j;2 =h 1 x j+1 (7.17) 98 where R j;1 and R j;2 are activated by the unit cubic function, q(x). The weights and biases from R j;k , k = 1; 2, to output node R out are: R j;k )R out : ~ j;k =g(x j ); ~ j;k =0:5g(x j ); (7.18) where g(x j ), j = 0; ;N, is the solution to the following linear system of (N + 1) equations: g(x j ) + 0:5[g(x j1 ) +g(x j+1 )] =f(x j ); 0jN; (7.19) with boundary conditions g(x 1 ) = g(x N+1 ) = 0. Given the target regression function f(x j ) at the right-hand-side of (7.19), one can solve the system for g(x j ), j = 0; ;N, uniquely. If f(x) is smooth in a local region, g(x) 0:5f(x) as shown in Fig. 7.2(c). There is an alternative design as shown in Fig. 7.2(d), where we increase the spacing between two adjacent bumps from h to 2h. The weight and bias from the input node R in to R j;k still follow Eq. (7.17) with j = 2j 0 , where j 0 = 0; 1; ;N=2, while Eq. (7.18) should be changed to R j;k )R out : ~ j;k =f(x j ); ~ j;k =0:5f(x j ): (7.20) This is ne since there is no interference betweenf(x j2 ),f(x j )f(x j+2 ). Furthermore, using the Taylor series expansion 1 and based on the factx j+1 is the in ection point of two associated cubic activation functions, we can derive 0:5[f(x j+2 ) +f(x j )]f(x j+1 )O(h 4 ); (7.21) The design in Fig. 7.2(c) demands 2N neurons while that in Fig. 7.2(d) demandsN neurons only in total. The latter design appears to be simpler. 1 under the assumption that low-order derivatives of f(x) are bounded 99 7.2.4 Discussion This section builds a bridge between Piecewise low-order polynomial and MLP approximations. The approximation errors of piecewise low-order polynomials can be derived using Taylor's series expansion. As N = h 1 goes to innity, the errors of piecewise constant, linear and cubic polynomials converge to zero at a rate of O(h), O(h 2 ) and O(h 4 ). The same conclusion applies to the corresponding designed MLPs. There is however dierence between them. To give an example, consider the kernel density estimation problem [90]: ^ f h (x) = N X j=1 ! j K(h 1 (xx j )); (7.22) where K(x) is a kernel. The unknown function f(x) at given point x can be expressed as the summation of weighted kernels, where weights! j can be solved by a regression method (say, linear regression) for given K(x). The box, triangle, and third-order unit bump functions are kernels to smooth statistical variations in local intervals. The pre-selected kernel could be too local and rigid to be ecient in handling long- and mid-range correlations in f(x). In contrast, parameters in the rst stage of an MLP allow a richer choice of kernels. Our MLP design does not exploit this exibility fully. For example, we may adopt more dierent j;k values (see Fig. 7.3). One focal point of neural network research is to understand the role of activation in a neuron. Historically, earliest activation was the simple unit step function [78]. It oers \on" and \o" two states with an abrupt jump. Such activation has zero derivatives except at x = 0 and, as a result, backpropagation is not able to update parameters. Several variants such as the sigmoid, ReLU, leaky ReLU activations have been developed later. Paired activations are used to build kernels of low-order polynomials here. As presented earlier, higher-order nonlinear activations are not fundamental in the universal approximation capability of neural networks. They aect the kernel shape only. Instead, the role of nonlinear activation is interpreted as a tool to resolve the sign confusion problem caused by the cascade of multiple layers in [57] and [66]. 100 7.3 Generalization to Multivariate Vector Functions Our presentation has focused on the case of 1D input/output so far. In general, the regression problem can be a q-dimensional vector function of p multivariables: F : x! f(x); where x2 [0; 1] p f2R q ; (7.23) wherep;q 1. We will provide a sketch to the generalization from the univariate scalar function to the multivariate vector function in this section. Generalization of our proposed design to a vector function is straightforward since we can design one MLP for each output dimention. The generalization from the univariate case to the multivariate case using 1D piecewise low- order polynomials, we can proceed as follows: 1. Partition [0; 1] p into p-D cells of volume size h p ; 2. Conduct the 1D approximation separately; 3. Perform the tensor product of 1D kernels to form p-D kernels. For the MLP design, we cannot perform the tensor product in the third step directly. Instead, we can use the tensor product result to nd the weighted contributions of neighboring grid points to the center of each cell and construct an MLP accordingly. In most practical machine learning applications, the multivariate vector function, f(x), it is assumed to be suciently smooth. That is, its lower-order partial derivatives exist. Consequently, Talyor series expansion for multivariable functions can be adopted to estimate the convergence rate to the target regression function. 101 7.4 Conclusion and Future Work A bridge between piecewise low-order polynomials and three-layer MLP approximators was built in this work. Instead of following the traditional path of universal approximation theorem of neural networks based on functional or real analysis, we proposed a signal processing approach to construct MLPs. The construction includes the choice of activation functions, neuron numbers, and lter weights between adjacent layers. Our design decomposes an MLP into two stages: 1) kernel construction in the rst stage and 2) weight regression in the second stage. We can explain how an MLP oers an universal approximation to a univariate function intuitively by this approach. Kernel design plays an important role in many machine learning and statistical inference problems. It was shown to be related to the shape of the activation function here. Given an MLP architecture, backprogagation is used to solve kernel construction and weight regression problems jointly via end-to-end optimization. It is however an open problem to nd the optimal number of neurons in the intermediate layer. On the other hand, the traditional approach decouples the whole problem into two subproblems and solve them one by one in a feedforward manner. It is interesting to study the relationship between the kernel design and the rst-stage lter weight/bias determination of MLPs. It is also desired to exploit prior knowledge about data and applications for exible kernel design in feedforward learning systems. . Acknowledgment This material is based on research sponsored by US Army Research Laboratory (ARL) under contract number W911NF2020157. The U.S. Government is authorized to reproduce and dis- tribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ocial policies or endorsements, either expressed or implied, of US Army Research Laboratory (ARL) or the U.S. Government. Computation for the work was 102 supported by the University of Southern California's Center for High Performance Computing (hpc.usc.edu). 103 (a) (b) (c) (d) Figure 7.1: Responses of (a) neurons R j;1 and R j;2 individually (dotted lines) and jointly (solid lines), (b) neurons R j;3 and R j;4 , (c) all four neurons combined, and (d) the approximation of f(x) with a sequence of weighted triangle functions. 104 (a) (b) (c) (d) Figure 7.2: (a) The unit cubic activation function, (b) responses of R j;1 and R j;2 individually (dotted lines) and jointly (solid lines), (c) and (d) two designs for the approximation to f(x) with a sequence of weighted third-order bumps. Figure 7.3: Two kernels of dierent supports centered at x j , which can be achieved by adding more neurons to the intermediate layer of an MLP. 105 Chapter 8 Conclusion and Future Work 8.1 Conclusion We examined the properties of neural networks experimentally and analytically in this research. For the rst part, we performed experiments on convolutional neural networks and observe their behavior. First, we demonstrated how the number of lters in the convolutional layers aects the performance of the networks. We found the performance improves as the lter number of a convolutional layer increases until the number of lters reaches certain critical point. Second, we proposed a method to determine the number of lters in the convolutional layers based on SSL. Thrid, we analyzed the features of SqueezeNet. We proposed a method to compute the cross- entropy values based on saliency. The proposed tool could be helpful in the study on other neural network architectures. We showed that the dimension reduction in the squeeze units does not lower the feature discriminability for most re modules and that the 1x1 and 3x3 features work jointly to form more discrimative features. We also demonstrated how expand/1x1 and expand/3x3 capture dierent features that complement each other with visualization. 106 Lastly, we proposed some insight on co-training based deep semi-supervised learning. We also raised some questions regarding the eectiveness or necessity of imposing network dierence constraints. In the second part, we proposed new angles to understand and interpret neural network for classication and regression. In terms of classication problems, we viewed the MLPs as a generalization of the two-class LDA. An MLP design with two hidden layers that also species the lter weights was proposed. Therefore, backpropagation is not required in our design. To understand the behaviors of MLPs as a regressor, we proposed network designs with one intermediate layer with dierent activation function. Our MLP design includes the specication of neuron numbers and all lter weights. Through the construction, a one-to-one correspondence between the approximation of an MLP and that of a piecewise low-order polynomial is established. And thus we can apply the theories of piecewise low-order polynomial on our constructed MLPs. This helps to shed light on the universal approximation properties of MLPs. 8.2 Future Work Understanding very deep neural networks In this research, we mainly focused on small networks (e.g., the LeNet-5-like networks or MLPs with one or two hidden layers), nowadays very deep and large networks are used in many applications. However, it is challenging to conduct rigorous mathematicaly analysis on convolutional neural networks (CNNs) with more than two hidden layers. More experiments and theoretical analysis could be done to better understand very deep networks such as ResNet [43] or DenseNet [46]. Understanding adversarial attacks There has been a large number of literatures devoted to this topic. It was rst pointed by Szegedy et al. [114] that deep neural networks (DNN) can be 107 easily fooled by adding carefully-crafted adversarial perturbations to input images. These adver- sarial examples can trick deep learning systems into erroneous predictions with high condence. It was further shown in [61] that these examples exist in the physical world. Even worse, adversarial attacks are often transferable [68]; namely, one can generate adversarial attacks without knowing the parameters of a target model. Goodfellow et al. [37] developed an ecient fast gradient sign method (FGSM) to add adversarial perturbations by computing the gradients of the cost function w.r.t the input. Along this direction, Kurakin et al. [61] proposed a basic iterative method (BIM) that iteratively computes the gradients and takes a small step in the direction (instead of a large one like the FGSM). Later, Dong et al. [28] integrated a momentum term to the BIM to stabilize update directions. Papernot et al. [87] generated an adversarial attack by restricting the L0-norm of perturbations. DeepFool (DF) [83] iteratively calculated perturbations to take adversarial im- ages to the decision boundary which is linearized in the high-dimensional space. In the previous chapters, we have investigated how resource-rich and resource-scarce networks behave dierently in terms of the accuracy. Since the degree of freedom of resource-rich and resource-scarce networks is dierent, how does this aect the robustness of the networks against the adversarial attack? More research can be done along this direction. 108 Bibliography [1] Abdul Ahad, Ahsan Fayyaz, and Tariq Mehmood. Speech recognition using multilayer perceptron. In IEEE Students Conference, ISCON'02. Proceedings., volume 1, pages 103{ 109. IEEE, 2002. [2] Ethem Alpaydin. Gal: Networks that grow when they learn and shrink when they forget. International journal of pattern recognition and Articial Intelligence, 8(01):391{414, 1994. [3] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many consistent explanations of unlabeled data: Why you should average. In International Conference on Learning Representations, 2019. [4] Sebastian Bach, Alexander Binder, Gr egoire Montavon, Frederick Klauschen, Klaus-Robert M uller, and Wojciech Samek. On pixel-wise explanations for non-linear classier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015. [5] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pages 3365{3373, 2014. [6] Kristin P Bennett and Olvi L Mangasarian. Neural network training via linear program- ming. Technical report, University of Wisconsin-Madison Department of Computer Sci- ences, 1990. [7] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92{100. ACM, 1998. [8] Nirmal K Bose and Amulya K Garga. Neural network design using voronoi diagrams. IEEE Transactions on Neural Networks, 4(5):778{787, 1993. [9] Joan Bruna and St ephane Mallat. Invariant scattering convolution networks. IEEE trans- actions on pattern analysis and machine intelligence, 35(8):1872{1886, 2013. [10] Carlos Cabrelli, Ursula Molter, and Ron Shonkwiler. A constructive algorithm to solve" convex recursive deletion"(cord) classication problems via two-layer perceptron networks. IEEE Transactions on Neural Networks, 11(3):811{816, 2000. [11] Tianping Chen and Hong Chen. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Transactions on Neural Networks, 6(4):911{917, 1995. [12] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (ssl) method for object classication. arXiv preprint arXiv:1909.08190, 2019. [13] Yueru Chen and C-C Jay Kuo. Pixelhop: A successive subspace learning (ssl) method for object recognition. Journal of Visual Communication and Image Representation, page 102749, 2020. 109 [14] Yueru Chen, Mozhdeh Rouhsedaghat, Suya You, Raghuveer Rao, and C-C Jay Kuo. Pixel- hop++: A small successive-subspace-learning-based (ssl-based) model for image classica- tion. arXiv preprint arXiv:2002.03141, 2020. [15] Yueru Chen, Zhuwei Xu, Shanshan Cai, Yujian Lang, and C-C Jay Kuo. A saak transform approach to ecient, scalable and robust handwritten digits recognition. arXiv preprint arXiv:1710.10714, 2017. [16] Anna Choromanska, Mikael Hena, Michael Mathieu, G erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Articial Intelligence and Statistics, pages 192{204, 2015. [17] Charles K Chui and Xin Li. Approximation by ridge functions and neural networks with one hidden layer. Journal of Approximation Theory, 70(2):131{141, 1992. [18] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698{728, 2016. [19] Bal azs Csan ad Cs aji et al. Approximation with articial neural networks. Faculty of Sci- ences, Etvs Lornd University, Hungary, 24(48):7, 2001. [20] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303{314, 1989. [21] Jifeng Dai, Yang Lu, and Ying-Nian Wu. Generative modeling of convolutional neural networks. arXiv preprint arXiv:1412.6296, 2014. [22] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semi-supervised learning that requires a bad gan. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6510{6520. Curran Associates, Inc., 2017. [23] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248{255. Ieee, 2009. [24] Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, and Nando De Fre- itas. Predicting parameters in deep learning. In Advances in neural information processing systems, pages 2148{2156, 2013. [25] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for ecient evaluation. In Advances in neural information processing systems, pages 1269{1277, 2014. [26] A Victor Devadoss and T Antony Alphonnse Ligori. Forecasting of stock prices using multi layer perceptron. International Journal of Web Technology, 2(2):49{55, 2013. [27] Xiaohan Ding, Guiguang Ding, Jungong Han, and Sheng Tang. Auto-balanced lter pruning for ecient convolutional neural networks. In Thirty-Second AAAI Conference on Articial Intelligence, 2018. [28] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Xiaolin Hu, Jianguo Li, and Jun Zhu. Boosting adversarial attacks with momentum. 2018. 110 [29] Chelsia Amy Doukim, Jamal Ahmed Dargham, and Ali Chekima. Finding the number of hidden neurons for an mlp neural network using coarse to ne search technique. In 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), pages 606{609. IEEE, 2010. [30] Dheeru Dua and Casey Gra. UCI machine learning repository, 2017. [31] Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179{188, 1936. [32] Marcus Frean. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural computation, 2(2):198{209, 1990. [33] Isha Garg, Priyadarshini Panda, and Kaushik Roy. A low eort approach to structured cnn design using pca. IEEE Access, 2019. [34] Xavier Glorot and Yoshua Bengio. Understanding the diculty of training deep feedfor- ward neural networks. In Proceedings of the thirteenth international conference on articial intelligence and statistics, pages 249{256, 2010. [35] Fred Glover. Future paths for integer programming and links to articial intelligence. Com- puters & Operations Research, 13(5):533{549, 1986. [36] David E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., USA, 1st edition, 1989. [37] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. In International Conference on Learning Representations, 2015. [38] Kojo Sarfo Gyam, James Brusey, Andrew Hunt, and Elena Gaura. Linear classier design under heteroscedasticity in linear discriminant analysis. Expert Systems with Applications, 79:44 { 52, 2017. [39] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527{8537, 2018. [40] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neu- ral networks with pruning, trained quantization and human coding. arXiv preprint arXiv:1510.00149, 2015. [41] Babak Hassibi, David G Stork, and Gregory Wol. Optimal brain surgeon: Extensions and performance comparisons. In Advances in neural information processing systems, pages 263{270, 1994. [42] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learn- ing. Springer, second edition, 2017. [43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. [44] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359{366, 1989. 111 [45] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ecient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. [46] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. [47] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. [48] Najmeh Sadat Jaddi, Salwani Abdullah, and Abdul Razak Hamdan. Optimization of neural network model using modied bat-inspired algorithm. Applied Soft Computing, 37:71{86, 2015. [49] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014. [50] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Pro- cessing Systems 29, pages 586{594. Curran Associates, Inc., 2016. [51] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, and Rynson WH Lau. Dual student: Breaking the limits of the teacher in semi-supervised learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 6728{6736, 2019. [52] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated an- nealing. science, 220(4598):671{680, 1983. [53] Andrei Nikolaevich Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. In Doklady Akademii Nauk, volume 114, pages 953{956. Russian Academy of Sciences, 1957. [54] Alex Krizhevsky. Learning multiple layers of features from tiny images. Master's thesis, Department of Computer Science, University of Toronto, 2009. [55] Alex Krizhevsky and Georey Hinton. Learning multiple layers of features from tiny images. 2009. [56] Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet classication with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097{1105, 2012. [57] C-C Jay Kuo. Understanding convolutional neural networks with a mathematical model. Journal of Visual Communication and Image Representation, 41:406{413, 2016. [58] C-C Jay Kuo. The CNN as a guided multilayer RECOS transform [lecture notes]. IEEE Signal Processing Magazine, 34(3):81{89, 2017. [59] C-C Jay Kuo and Yueru Chen. On data-driven Saak transform. Journal of Visual Commu- nication and Image Representation, 50:237{246, 2018. [60] C-C Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, and Yueru Chen. Interpretable convolu- tional neural networks via feedforward design. Journal of Visual Communication and Image Representation, 60:346{359, 2019. 112 [61] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016. [62] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016. [63] Yann LeCun, L eon Bottou, Yoshua Bengio, and Patrick Haner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278{2324, 1998. [64] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598{605, 1990. [65] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning lters for ecient convnets. arXiv preprint arXiv:1608.08710, 2016. [66] Ruiyuan Lin, Zhiruo Zhou, Suya You, Raghuveer Rao, and C-C Jay Kuo. From two-class linear discriminant analysis to interpretable multilayer perceptron design. arXiv preprint arXiv:2009.04442, 2020. [67] Cheng-Yuan Liou and Wen-Jen Yu. Ambiguous binary representation in multilayer neu- ral networks. In Proceedings of ICNN'95-International Conference on Neural Networks, volume 1, pages 379{384. IEEE. [68] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016. [69] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. In Advances in neural information processing systems, pages 6231{6239, 2017. [70] Teresa B Ludermir, Akio Yamazaki, and Cleber Zanchettin. An optimization methodology for neural network weights and architectures. IEEE Transactions on Neural Networks, 17(6):1452{1459, 2006. [71] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A lter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058{5066, 2017. [72] Wan-Duo Ma, J. P. Lewis, and W. Bastiaan Kleijn. The hsic bottleneck: Deep learning without back-propagation. ArXiv, abs/1908.01580, 2019. [73] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by in- verting them. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), pages 5188{5196. IEEE, 2015. [74] St ephane Mallat. Group invariant scattering. Communications on Pure and Applied Math- ematics, 65(10):1331{1398, 2012. [75] Mario Marchand. Learning by minimizing resources in neural networks. Complex Systems, 3:229{241, 1989. [76] Mario Marchand, Mostefa Golea, and P al Ruj an. A convergence theorem for sequential learning in two-layer perceptrons. EPL (Europhysics Letters), 11(6):487, 1990. [77] FM Frattale Mascioli and Giuseppe Martinelli. A constructive algorithm for binary neural networks: The oil-spot algorithm. IEEE Transactions on Neural Networks, 6(3):794{797, 1995. 113 [78] Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4):115{133, 1943. [79] Marc Mezard and Jean-P Nadal. Learning in feedforward layered networks: The tiling algorithm. Journal of Physics A: Mathematical and General, 22(12):2191, 1989. [80] T. Miyato, S. Maeda, S. Ishii, and M. Koyama. Virtual adversarial training: A regulariza- tion method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1{1, 2018. [81] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convo- lutional neural networks for resource ecient inference. arXiv preprint arXiv:1611.06440, 2016. [82] Gr egoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus- Robert M uller. Explaining nonlinear classication decisions with deep taylor decomposition. Pattern Recognition, 65:211{222, 2017. [83] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574{2582, 2016. [84] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. [85] Kamal Nigam and Rayid Ghani. Analyzing the eectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge manage- ment, pages 86{93. ACM, 2000. [86] Utku Ozbulak. Pytorch cnn visualizations. https://github.com/utkuozbulak/pytorch-cnn- visualizations, 2019. [87] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372{387. IEEE, 2016. [88] Rajesh Parekh, Jihoon Yang, and Vasant Honavar. Constructive neural-network learning algorithms for pattern classication. IEEE Transactions on neural networks, 11(2):436{451, 2000. [89] Rajesh G Parekh, Jihoon Yang, and Vasant Honavar. Constructive neural network learning algorithms for multi-category real-valued pattern classication. 1997. [90] Emanuel Parzen. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):1065{1076, 1962. [91] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Jour- nal of Machine Learning Research, 12:2825{2830, 2011. [92] Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan Yuille. Deep co-training for semi-supervised image recognition. In Vittorio Ferrari, Martial Hebert, Cristian Sminchis- escu, and Yair Weiss, editors, Computer Vision { ECCV 2018, pages 142{159, Cham, 2018. Springer International Publishing. 114 [93] Thiyagarajan Ramanathan, Abinaya Manimaran, Suya You, and CC Jay Kuo. Robustness of saak transform against adversarial attacks. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2531{2535. IEEE, 2019. [94] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi- supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546{3554, 2015. [95] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and orga- nization in the brain. Psychological review, 65(6):386, 1958. [96] Mozhdeh Rouhsedaghat, Yifan Wang, Xiou Ge, Shuowen Hu, Suya You, and C-C Jay Kuo. Facehop: A light-weight low-resolution face gender classication method. arXiv preprint arXiv:2007.09510, 2020. [97] Walter Rudin. Functional analysis, 1973. [98] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211{252, 2015. [99] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsu- pervised domain adaptation. arXiv preprint arXiv:1702.08400, 2017. [100] Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Bren- dan D Tracey, and David D Cox. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019. [101] Franco Scarselli and Ah Chung Tsoi. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural networks, 11(1):15{37, 1998. [102] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient- based localization. International Journal of Computer Vision, 128(2):336{359, Oct 2019. [103] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017. [104] Jocelyn Sietsma and Robert J.F. Dow. Creating articial neural networks that generalize. Neural Networks, 4(1):67 { 79, 1991. [105] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: visualising image classication models and saliency maps. arXiv preprint arXiv:1312.6034, 2013. [106] K Sivakumar and Uday B Desai. Image restoration using a multilayer perceptron with a multilevel sigmoidal function. IEEE transactions on signal processing, 41(5):2018{2022, 1993. [107] Jack W Smith, JE Everhart, WC Dickson, WC Knowler, and RS Johannes. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, page 261. American Medical Informatics Association, 1988. 115 [108] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742{769, 2018. [109] David A Sprecher. On the structure of continuous functions of several variables. Transac- tions of the American Mathematical Society, 115:340{355, 1965. [110] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striv- ing for simplicity: The all convolutional net, 2014. [111] Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A simple way to prevent neural networks from overtting. Journal of Machine Learning Research, 15:1929{1958, 2014. [112] I Stephen. Perceptron-based learning algorithms. IEEE Transactions on neural networks, 50(2):179, 1990. [113] Jeremias Sulam, Vardan Papyan, Yaniv Romano, and Michael Elad. Multilayer convo- lutional sparse modeling: Pursuit and dictionary learning. IEEE Transactions on Signal Processing, 66(15):4090{4104, 2018. [114] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. 2013. [115] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195{1204, 2017. [116] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1{5, April 2015. [117] Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. Axnn: energy-ecient neuromorphic systems using approximate computing. In Proceedings of the 2014 international symposium on Low power electronics and design, pages 27{32. ACM, 2014. [118] W. Wang and Z.-H. Zhou. Co-training with insucient views. Journal of Machine Learning Research, 29:467{482, 2013. cited By 18. [119] Wei Wang and Zhi-Hua Zhou. Analyzing co-training style algorithms. In Joost N. Kok, Jacek Koronacki, Raomon Lopez de Mantaras, Stan Matwin, Dunja Mladeni c, and Andrzej Skowron, editors, Machine Learning: ECML 2007, pages 454{465, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. [120] Yulong Wang, Hang Su, Bo Zhang, and Xiaolin Hu. Interpret neural networks by identifying critical data routing paths. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8906{8914, 2018. [121] Donglai Wei, Bolei Zhou, Antonio Torrabla, and William Freeman. Understanding intra- class knowledge inside cnn. arXiv preprint arXiv:1507.02379, 2015. [122] Thomas Wiatowski and Helmut B olcskei. A mathematical theory of deep convolutional neu- ral networks for feature extraction. IEEE Transactions on Information Theory, 64(3):1845{ 1866, 2017. [123] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. 116 [124] Hao Xu, Yueru Chen, Ruiyuan Lin, and C-C Jay Kuo. Understanding convolutional neural networks via discriminant feature analysis. APSIPA Transactions on Signal and Informa- tion Processing, 7, 2018. [125] Jihoon Yang, Rajesh Parekh, and Vasant Honavar. Distal: An inter-pattern distance-based constructive learning algorithm. Intelligent Data Analysis, 3(1):55{73, 1999. [126] Xin-She Yang. A new metaheuristic bat-inspired algorithm. In Nature inspired cooperative strategies for optimization (NICSO 2010), pages 65{74. Springer, 2010. [127] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194{9203, 2018. [128] Cleber Zanchettin, Teresa B Ludermir, and Leandro Maciel Almeida. Hybrid training method for mlp: optimization of architecture and training. IEEE Transactions on Sys- tems, Man, and Cybernetics, Part B (Cybernetics), 41(4):1097{1109, 2011. [129] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818{833. Springer, 2014. [130] Min Zhang, Yifan Wang, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop++: A lightweight learning model on point sets for 3d classication. arXiv preprint arXiv:2002.03281, 2020. [131] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop: An explainable machine learning method for point cloud classication. arXiv preprint arXiv:1907.12766, 2019. [132] Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, and C-C Jay Kuo. Pointhop: An explainable machine learning method for point cloud classication. IEEE Transactions on Multimedia, 2020. [133] Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, pages 8827{8836, 2018. [134] Xiangyu Zhang, Jianhua Zou, Xiang Ming, Kaiming He, and Jian Sun. Ecient and ac- curate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, pages 1984{1992, 2015. [135] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene CNNs. arXiv preprint arXiv:1412.6856, 2014. [136] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921{2929, 2016. [137] Xiaojin Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. 117
Abstract (if available)
Abstract
Neural networks have been shown to be effective in many applications. To better explain the behaviors of the neural networks, we examine the properties of neural networks experimentally and analytically in this research. ? In the first part, we conduct experiments on convolutional neural networks (CNNs) and observe the behaviors of the networks. We also propose some insight or conjectures for CNNs in this part. First, we demonstrate how the accuracy changes with the size of the convolutional layers. Second, we develop a design to determine the size of the convolutional layers based on SSL. Third, as a case study, we analyze the SqueezeNet, which is able to achieve the same accuracy as AlexNet with 50x fewer parameters, by studying the evolution of cross-entropy values across layers and doing visualization. Fourth, we also propose some insight on co-training based deep semi-supervised learning. ? In the second part, we propose new angles to understand and interpret neural network. To understand the behaviors of multilayer perceptrons (MLPs) as classifiers, we interpret MLPs as a generalization of a two-class LDA system so that it can handle an input composed by multiple Gaussian modalities belonging to multiple classes. An MLP design with two hidden layers that also specifies the filter weights is proposed. To understand the behaviors of multilayer perceptrons (MLPs) as a regressor we construct MLPs as a piecewise low-order polynomial approximator using a signal processing approach. The constructed MLP contains one input, one intermediate and one output layers. Its construction includes the specification of neuron numbers and all filter weights. Through the construction, a one-to-one correspondence between the approximation of an MLP and that of a piecewise low-order polynomial is established.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Object classification based on neural-network-inspired image transforms
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Differential verification of deep neural networks
PDF
Designing neural networks from the perspective of spatial reasoning
PDF
Exploring complexity reduction in deep learning
PDF
Object localization with deep learning techniques
PDF
Neural sequence models: Interpretation and augmentation
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Deep learning for subsurface characterization and forecasting
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
A data-driven approach to image splicing localization
PDF
Machine learning techniques for perceptual quality enhancement and semantic image segmentation
PDF
Machine learning methods for 2D/3D shape retrieval and classification
PDF
High-capacity feedback neural networks
PDF
Multimodal and self-guided clustering approaches toward context aware speaker diarization
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Geometric and dynamical modeling of multiscale neural population activity
PDF
Efficient graph learning: theory and performance evaluation
Asset Metadata
Creator
Lin, Ruiyuan
(author)
Core Title
Experimental analysis and feedforward design of neural networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-08
Publication Date
07/18/2023
Defense Date
03/22/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
convolutional neural network,machine learning,multilayer perceptron,neural network,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Narayanan, Shri (
committee member
), Neumann, Ulrich (
committee member
)
Creator Email
ruiyuanl@usc.edu,sandlamlry@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15615960
Unique identifier
UC15615960
Legacy Identifier
etd-LinRuiyuan-9788
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lin, Ruiyuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
convolutional neural network
machine learning
multilayer perceptron
neural network