Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Validation of an alternative neural decision tree
(USC Thesis Other)
Validation of an alternative neural decision tree
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Validation of an Alternative Neural Decision Tree by Yu Leo Lu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Applied Mathematics) August 2021 Copyright 2021 Yu Leo Lu I dedicate this thesis to my mother, Zhonghui Yu, who has always been a great source of inspiration and support. This thesis is also dedicated to Dr. Chunming Wang, who encourages me to pursue the path of data science and machine learning. ii Acknowledgements First, I would like to thank the University of Southern California (USC) for supporting my Ph.D. study. My deepest gratitude goes to my advisor, Dr. Chunming Wang, who takes me into the world of machine learning. He guided me through multiple papers, including this thesis, and provided valuable collaboration experiences with space physicists from NASA JPL and GFZ German Re- search Centre. His enthusiasm for mathematical applications kept me constantly engaged with my research and prepared me for my future career in data science. I would also like to thank my com- mittee members Gary Rosen and C.-C. Jay Kuo, for letting my defense be an enjoyable moment and for your brilliant comments and suggestions. I would like to express my most significant appreciation to my mother, Zhonghui Yu, who never gave up on me despite how challenging the situation is, who always inspired me despite how frustrating the problem is. Also, I would like to thank my wife Sarah Jeam for her trust, understanding, and never-ending motivations. Finally, I would like to give special thanks to my best friend Jason Fang and office manager Amy Yung for their countless help throughout the Ph.D. program. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: General Statistical and Machine Learning Framework 4 2.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Statistical Learning and Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Machine Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 3: Introduction of Decision Tree and Artificial Neural Network 12 3.1 Decision tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Construct the Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Gradient-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 Additional Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 4: Neural Decision Tree 26 4.1 Replicating a decision Tree with a Neural Network . . . . . . . . . . . . . . . . . 28 4.2 Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Chapter 5: Statistical Consistency 35 5.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Covering and Packing Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Proof of the Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 iv Chapter 6: Exploration of Algorithm Designs and Classification Performance 64 6.1 Comparison of NDT to Arbitrary Networks . . . . . . . . . . . . . . . . . . . . . 66 6.1.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1.3 SVHN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.1.4 Reuters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2 Tuning of NDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3 Further observations in selecting hyper-parameters . . . . . . . . . . . . . . . . . 73 Chapter 7: Building Forecast Models for Plasmasphere Dynamics and Regression Per- formance 75 7.1 Developing a Regression Based Neural Decision Tree Model for Forecasting Plas- masphere Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.2 A Broader View of Task of Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 8: Conclusion 94 References 95 Appendices 96 A Selection of the Activation Function in the First Layer of NDT . . . . . . . . . . . 97 B Hoeffding’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 C Exponential Inequality and the Inspiration by the Covering Number . . . . . . . . 101 D Lemmas for the Proof of the Main Theorem . . . . . . . . . . . . . . . . . . . . . 102 v List of Tables 2.1 The Play Tennis data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6.1 Data sets used in this paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2 Comparison of different architectures . . . . . . . . . . . . . . . . . . . . . . . . 67 6.3 MNIST and CIFAR-10 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 SVHN and Reuters Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.5 Continuous Leaning Feature in MNIST . . . . . . . . . . . . . . . . . . . . . . . 72 6.6 Complexity Comparison in MNIST . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.7 MNIST Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.8 Reuters Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.1 Attributes in the input for PINE and NDT models for plasmasphere dynamics . . . 77 7.2 Approaches for model selection for PINE and for NDT based approach . . . . . . 79 7.3 Robust Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4 Comparison of final RMSE for different NDT constructed models and PINE. . . . 82 7.5 The models and training algorithms are select for similar final RMSE performance. The times are measured on a personal computer wit a Intel® Core™ i7-4790 Pro- cessor CPU, a NVIDIA GeForce GTX 970 GPU and a total of 32 GB ROM. . . . . 83 vi List of Figures 3.1 A CART sample with names on each nodes . . . . . . . . . . . . . . . . . . . . . 13 3.2 A decision tree example. Given any observation (x;y)2R 2 f0;1g, this tree partitions the feature space into disjoint subsetfR 1 ;:::;R 5 g and then predict the output ˆ y as the most frequent class. . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 An example of two hidden layer neural network. The input (x 1 ;:::;x d )2R d has d features. The summation and sigmoid icon indicates the linear combination is followed by a non-linear activation function . . . . . . . . . . . . . . . . . . . . . 18 3.4 Plots of commonly used activation functions . . . . . . . . . . . . . . . . . . . . . 19 3.5 Nodes in(l+ 2) th layer that y (l+1) i connects to . . . . . . . . . . . . . . . . . . . . 23 4.1 An example of decision tree with its associated NDT. . . . . . . . . . . . . . . . . 27 4.2 2D partition of the feature space into 5 subsets each with specific shading associ- ated with 5 leaves in Figure 4.1(a). The 4-tuple vector is the binary representation of 4 decision nodes in the same tree. . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Replicating a decision tree with neural network in 3 steps. . . . . . . . . . . . . . 33 6.1 MNIST Testing error for different architectures. Each epoch means the training data is being used to update weight once. . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 CIFAR-10 testing error. Each epoch means the training data is being used to update weight once. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.3 SVHN testing error. Each epoch means the training data is being used to update weight once. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.4 Reuters testing error. Each epoch means the training data is being used to update weight once. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.1 Comparison of Rate of reduction of RMSE for NDT and PINE using first order gradient descent type of optimization methods. . . . . . . . . . . . . . . . . . . . 81 7.2 Model predictions of electron density field for June 26-27, 2001 storm. . . . . . . . 82 vii 7.3 Since Van Allen Probes collect data along their orbits, instantaneous global imag- ing of the plasmasphere density field is obviously unavailable and spatially non- uniform distribution of data is inherent to the measurement approach. . . . . . . . 84 7.4 Local statistics of training data shows distinct spatial variability in both average and standard deviation of electron density. . . . . . . . . . . . . . . . . . . . . . . 85 7.5 When Cartesian coordinates are used for the geolocation of training data in NDT training process, the spatial continuity in the prediction of electron density field is achieved. Since polar geolocation is used in the training of PINE, electron density field produced by PINE can have visible spatial discontinuities. . . . . . . . . . . . 86 7.6 Effect of weighted stat scaled and stat scaled only . . . . . . . . . . . . . . . . . . 88 7.7 Effect of weighted stat scaled and stat scaled only . . . . . . . . . . . . . . . . . . 88 7.8 Distribution of Singular Values and projection of data onto principal directions . . 90 7.9 Electron density fields predicted by NDT and PINE for input parameters perturbed by one standard deviation in direction of the first 5 principle components respectively. 91 7.10 Difference in Electron density fields predicted by NDT and PINE for input pa- rameters perturbed by1 standard deviation in direction of the first 5 principle components respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.11 Difference in Electron density fields predicted by different models gentrained by NDT for input parameters perturbed by1 standard deviation in direction of the first 5 principle components respectively. . . . . . . . . . . . . . . . . . . . . . . . 93 8.1 A tree example with 2 decision nodes j and k and 3 leaf nodes. . . . . . . . . . . 98 8.2 Comparison between the indicator activationI, hyperbolic tangent t(x) and bounded ReLU r 1 (x) from the first hidden layer h to the second hidden layer. (Left) ˜ h is activated byI. (Middle) ˜ h is activated by t(x). (Right) ˜ h is activated by r 1 (x) . . . . 100 viii Abstract This dissertation presents a modified version of the Neural Decision Tree (NDT), initially proposed by Biau, with extended statistical consistency results, various benchmark studies, and real-world space weather applications. We performed multiple numerical experiments to investigate several critical features of this alternative NDT. Our experimental results demonstrate the attractiveness of NDT over the traditional artificial neural networks by adopting a strategically selected network structure with appropriate initial weights. Our results indicate that NDT provides higher classifi- cation accuracy in various multi-category classification benchmark data set we have tested than an arbitrarily constructed neural network with a comparable degree of freedom. Empirical evidence also suggests that NDT is more stable and robust to hyper-parameters and gradient-based opti- mization algorithms commonly involved in neural networks training. In addition, we demonstrate the feasibility of NDT in solving a regression space weather problem. We built a prediction model for the plasmasphere electron density with solar and geomagnetic measurements as inputs. In this application, we revealed how NDT offers available sophisticated network layouts for building a predictive model, thus taking advantage of the deep-learning potential of the neural network. ix Chapter 1 Introduction Decision Tree algorithms (CART) [1] and Artificial Neural networks (ANNs) [2] [3] are two of the most popular machine-learning methods [4] [5]. The highly efficient back-propagation technique for evaluating the gradient vector of the output with respect to the weights defining in the ANNs enables the iterative stochastic gradient-based optimization method to continuously “learn” and update based on available training data. However, for people starting with a new application, the lack of reliable guidelines for selecting a network architecture, i.e., the number of layers and neurons for ANNs, can be bewildering. On the other hand, the decision tree algorithm (CART) seeks to iteratively partition the feature space into K disjoint subsets, for which a simple model can be built relating the dependent variable Y to the feature vector X, seems more accessible. It has an intrinsic feature selection component that seeks to reduce uncertainty entropy or Gini-impurity. The flowchart-like structure is often self-limited by threshold criteria used to determine whether further partitioning is beneficial. Nevertheless, continuous learning and updating for a decision tree are not as efficient as a neural network. The commonly employed partitioning along the coordinate axis also significantly limits the effectiveness of a CART. This thesis proposes an approach to combining decision trees and neural networks, and we call it the Neural Decision Tree (NDT). As suggested in [6], it is possible to construct an ANN that replicates the behavior of a decision tree. The feature selection inherent in the construction of a CART can provide structures for an ANN with selections of the number of neurons and layers, as well as initial weights that are determined by the training data. With gradient-based methods that can further optimize weights 1 for the initial ANN, constraints on decision boundaries for CART are removed in the learning process. There have been several published research articles on initializing neural networks by a decision tree. In particular, Kontschieder in [7] employed a stochastic gradient descent approach to minimize the risk on a differentiable splitting function to learn how trees and forests partition the data set. Biau in [6] provided theoretical results for weak consistency of the optimal classifier when training data is uniformly distributed in a bounded subset of d dimensional feature space. The basic idea of NDT, formalized in Biau’s paper published in April 2018, is quite intuitive. However, to achieve superior performance, detailed design considerations can have significant effects. Our work here represents the enhancement over Biau’s results in the following areas: • In the theoretical area, we extended the statistical consistency results from weak statistical consistency of the optimal classifier to strong consistency for arbitrary distribution, not lim- ited to uniform distribution of data in the feature space, under the assumption that the class of NDT’s functions can asymptotically approximate the regression function. • We discovered the significant role a strategical selection of activation functions can play in NDT. The improved performance of our modified NDT is due, in a substantial part, to our selection of the activation functions and associate initialization of the weights for the neurons. • Unlike Biau’s work that only focused on benchmark regression problems, this thesis carried out extensive studies of the algorithm for both multi-category classification and regression problems. We first investigated critical aspects of our NDT on several commonly used classi- fication benchmark data sets. These experimental results demonstrated attractive properties of our NDT over traditional randomly initialized ANNs. • In addition to the benchmark analysis, we demonstrated the capabilities of the NDT when facing a new application with limited prior experiences on the data set. Notably, we devel- oped forecast models for plasmasphere dynamics in space weather using NDT. This example 2 perfectly illustrated the capability of NDT in shortening the process of discovery of promis- ing model structures. 1.1 Thesis Organization In the next chapter, we shall first review the general statistical and machine learning framework. Chapter 3 introduces two of the most popular machine learning methods: Decision Tree and Arti- ficial Neural Network. In Chapter 4, we propose the construction of a NDT in both multi-category classification and regression. We present the theoretical consistency results in Chapter 5. In Chap- ter 6, we shall exhibit the experimental results of our algorithm in comparison to ad hoc selections of neural network structures with random initial weights on benchmark classification data set. In Chapter 7, we shall present a space weather application of NDT on building forecast models for plasmasphere dynamics in comparison to previous models constructed by the traditional machine- learning approach. In Chapter 8, we provide a summary of our work in NDT with further remarks on possible future research. 3 Chapter 2 General Statistical and Machine Learning Framework Machine learning (ML) is a set of methods that can automatically detect patterns in data, and then it use the uncovered patterns to predict future data or to perform other kinds of decision making under uncertainty ([8]). The most widely used form of ML is supervised or predictive learning approach, and the goal is to learn a mapping from inputs to outputs. To be specific, let X be a R d -valued random vector and Y be aR-valued random variable. LetD be the joint probability distribution for(X;Y). The goal is to find a (measurable) function f :R d !R such that f(X) is a “good approximation” of Y . The random vector X is called the predictor or input, and the random variable Y is called the response or output. Each elements of X is a feature or an attribute, so there are d features in X. One way to obtain a better approximation is to reduce the error between f(X) and Y measured by a function L= L( f(X);Y) called a loss function (sometimes also called risk or error function). On the other hand, minimizing the error between f(X) and Y depends on not only each specific loss function L but also the joint distributionD which is usually unknown for (X;Y). In most applications, it is often possible to obtain the data generated from the distributionD. Hence, we start with the tabular representation of the data. 4 2.1 Data Representation Broadly speaking, data is a collection of facts, such as numbers, words, and measurements. One of the most common forms to represent data is by tabular consist rows and columns. Consider a sample tabular data for outdoor tennis availability in Table 2.1. Each row is an observation (or an instance) that contains the same number of features, which provides information on the weather condition and its binary decision of playing tennis of a day. Each column provides information on the same feature described by each row. For example, the first observation (row) has an input of [9, 98, 42%] and an output of “No”. Input Output Day Wind Speed (mph) Temperature ( F) Humidity Play Tennis 1 9 98 42% No 2 3 80 35% Yes 3 6 70 58% Yes 4 7 60 62% No 5 9 55 40% No 6 14 75 35% No 7 2 85 38% Yes 8 6 90 40% Yes 9 8 75 76% No 10 4 80 71% No Table 2.1: The Play Tennis data set In general, there are two types of features: Quantitative and Categorical. Quantitative fea- tures are those with numeric values. Categorical features, on the other hand, have a finite set of categories, and it does not have any natural ordering. In the previous example, Wind Speed, Temperature, and Humidity from the input are all quantitative features, and Play Tennis from the output are categorical features. Although the input can also contain categorical variables in some applications, this thesis only focuses on problems with quantitative inputs. A row of quantitative features can be conveniently described as a feature vector. The vector space associated with these vectors are called the feature space. From the previous outdoor tennis example in table 2.1, the first feature vector is [9;98;42%], and its associated feature space is 5 RR[0;1]R 3 where [0;1] denotes all numbers between 0 to 1 inclusive. The matrix by vertically concatenating every row feature vector is called the design matrix X2R nd (in bold to distinguish between the d-dimensional random vector X) where n is number of observations and d is the number of features. The vector by vertically concatenating each scalar output is called the response vector Y2R n1 . Referring back to the example, the design matrix X and the response vector Y are: X= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 9 98 42% 3 80 35% 6 70 58% 7 60 62% . . . 4 75 76% 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2R 103 Y= 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 No Yes Yes No . . . No 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2R 101 Together,(X;Y) is our data, and we will denote the i th observation by(x i ;y i )2R 1d R. Note that each observation(x i ;y i ) is an particular instance of the random variable(X;Y) from the distribution D. To study the statistical consistency property, we often refer the set D n =f(X 1 ;Y 1 );:::;(X n ;Y n )g as the data where(X i ;Y i ) are independent and identically distributed (i.i.d.) copies of the random variables(X;Y) with the same distributionD. This should not be confused with the design matrix with its response vector(X;Y)2R nd R n1 . 2.2 Statistical Learning and Consistency We tend to to select learning methods to find the function f based on whether the output is quan- titative or categorical. We say the problem is classification when the output is categorical. The problem is regression when the output is quantitative. First, we shall consider the case when the 6 output Y is quantitative, i.e., a regression problem. One of the commonly used loss function in regression is the L 2 loss: Definition 2.2.1 Let (X;Y) be random vector where X isR d -valued, and Y isR-valued. Let f : R d !R be a measurable function so that f(X) approximate Y . The L 2 loss of f define as: Ej f(X)Yj 2 where the expectation is taken with respect to the joint distribution of X and Y . It is known that the regression function m(x)=E(YjX) is the L 2 loss minimizer: E(YjX)= arg min f :R d !R Ej f(X)Yj 2 : Indeed, let m(X)=E(YjX) and consider rewrite the L 2 loss by substrate and add the m(X): Ej f(X)Yj 2 =Ej f(X) m(X)+ m(X)Yj 2 =Ej f(X) m(X)j 2 +Ejm(X)Yj 2 + 2 (cross term) where (cross term)=E X;Y ( f(X) m(X))(m(X)Y) =E X E Y ( f(X) m(X))(m(X)Y) X = 0 The last equality is because when conditioning on X,( f(X) m(X)) is a constant andE Y (m(X) Y)= 0. Hence, we decouple the L 2 loss into two parts: Ej f(X)Yj 2 =Ej f(X) m(X)j 2 +Ejm(X)Yj 2 : (2.1) 7 Therefore, the L 2 loss is minimized whenever f(X)= m(X)=E(YjX). Note that in general application, it is impossible to obtain m(X) because the joint distribution of X and Y are usually unknown. One solution is to estimate m(X) from the data. More precisely, let(X 1 ;Y 1 );:::(X n ;Y n ) be n independent and identically distributed copies of the random variables (X;Y) with the dis- tributionD, and let D n =f(X 1 ;Y 1 );:::;(X n ;Y n )g be the data. Instead of finding the L 2 minimizer m(X), we will try to find the empirical L 2 minimizer m n such that m n (X)= arg min f2F n 1 n n å j=1 j f(X j )Y j j 2 whereF n is a set of functions depends on the data size n and the learning method we select. In general, the empirical L 2 minimizer m n (X) will not be equal to the actual L 2 minimizer m(X). We would like m n (X) to be “as close as possible to ” m(X), as the sample size n goes to infinity. To be more precise, note that m n is a random variable that depends on D n and X. Let m denote the distribution of X, the L 2 loss of m n (X) and m(X) is then defined as E X jm n (X) m(X)j 2 = Z jm n (x) m(x)j 2 m(dx): (2.2) Note that R jm n (X) m(X)j 2 m(dx) is a random variable which depends on the data D n . To study the statistical property of this random variable, we consider the following definition of statistical consistency: Definition 2.2.2 1. We sayfm n g is weakly consistent for a given distributionD of(X;Y) if lim n!¥ E Z jm n (X) m(X)j 2 m(dx) = 0: 8 2. We sayfm n g is strongly consistent for a given distributionD of(X;Y) if lim n!¥ Z jm n (X) m(X)j 2 m(dx)= 0 a:s:(with probability one): 3. We sayfm n g is weakly universally consistent if it is weakly consistent for all distribution of (X;Y) with E(Y 2 )<¥ 4. We sayfm n g is strongly universally consistent if it is strongly consistent for all distribution of(X;Y) with E(Y 2 )<¥ When the output Y is categorical, we may still be able to use the L 2 loss to evaluate the differ- ence between the d-dimensional random vector X andfC 1 ;:::;C K g-valued random variable Y by reconstructing Y into a K-dimensional random vector ˜ Y . The k th component of ˜ Y takes on value 1 if Y yields class C k and takes on the value 0 otherwise for k= 1;:::;K. With such 1 to 1 relation- ship between Y and ˜ Y , our goal becomes finding the function f :R d !R K such thatE( f(X) ˜ Y) is minimized. 2.3 Machine Learning Framework In practical supervised machine-learning applications, we only have access to the design matrix and its associated response vector(X;Y)2R nd R n1 . Our goal becomes finding a measurable f :R d !R such that L( f(X);Y) is minimized for a loss function L. The expression f(X)2R n1 is the row-wise operations of the design matrix X, so the i th entry of f(X) is f(x i ) where x i 2R d is the i th row of X. One common loss function L for evaluating the difference between f(X) and Y is the mean square error (MSE): MSE( f(X);Y)=jj f(X) Yjj 2 2 = 1 n n å i=1 j f(x i ) y i j 2 9 where y i 2R is the i th entry of Y. On the other hand, the performance evaluation of f cannot solely relies on the loss L( f(X);Y) since it is possible to create some function f that actually “learn” nothing about the relationship between X and Y, yet the loss L( f(X);Y)= 0. Consider the following example, let f be a dictionary function defined as f(x)= 8 > > < > > : y; if x is an observation in X ¯ Y; otherwise where ¯ Y2R is the average (or most frequent category) of the quantitative (or categorical) output. In this case, the loss is always 0 since f(X)= Y. However, for any unseen tabular data(X 2 ;Y 2 ) that different from the original, we will expect a large loss L( f(X 2 );Y 2 ) since f(X 2 )= ¯ Y which is simply an average (or most frequent) dummy method. The solution to this problem is by reserving part of the data for testing purpose only. To be specific, we will first divide (X;Y) into two disjoint subsets: the training data set (X train ;Y train ) and the testing data set (X test ;Y test ). Both data sets have same features but disjoint observations. Then we will find f by minimizing the training loss L( f(X train );Y train ) on the training data set only. Once we find the function f , we will evaluate this function on the testing data set by computing the testing loss L( f(X test );Y test ). Because the testing data set (X test ;Y test ) are never involved in the process of finding f , we generally expect f to produce slightly higher testing loss than training. On the other hand, if a given function f yields a small training loss but significantly larger testing, we are said to be overfitting the data. The dictionary function mentioned before is an extreme example of overfitting with zero training loss and large testing loss. In general, overfitting happens because the method is trying too hard to find patterns in the training data. Some patterns may be caused by random chance rather than the actual relationship between X and Y. Therefore, we will find the function f through minimizing the loss on the training data set and evaluate the performance of f on the testing data set. 10 In machine-learning framework, the function f is called the model, and the learning method to find f is called the algorithm. In the next chapter, we will focus on finding the measurable function f using two of the most popular machine learning algorithms: Decision Tree and Neural Network. 11 Chapter 3 Introduction of Decision Tree and Artificial Neural Network Decision Tree (DT), is a widely used data-modeling technique for over fifty years in the field of machine learning and statistics [9]. A decision tree seeks to partition the feature space with hy- perplanes parallel to coordinates axes into subsets for which a simple model can be built relating dependent variable Y to feature vector X seems more accessible. It has an intrinsic feature se- lection component that seeks to reduce uncertainty entropy or Gini-impurity. The interpretable flowchart-like structure and widely available software (CART [1], C4.5 [10]) help the techniques gain acceptance and popularity. Artificial Neural Networks (ANN), often shorten as Neural Network (NN), have been consider to be the universal function approximators [11] [3]. An ANN models the output y by applying a non-linear activation function to a linear combination of the outputs of the previous layer, starting with the input x as the outputs of the zeroth layer or input-layer. Initial weight parameters in the linear combination are typically randomly selected. Optimization of weights is further carried out by highly efficient, iterative gradient-based optimization methods. 3.1 Decision tree In general, CART partitions the feature space into K disjoint subsetfR 1 ;:::;R K g, and then one fits a simple model for each R k for k= 1;:::K. In order to partition the feature space, there are two questions we need to answer first: 12 1. How to partition the feature space (grow a tree) 2. How many disjoint subsets is sufficient (prune a tree) Before diving into the detail of those two steps, it is helpful to introduce some common termi- nologies below with Figure 3.1 for CART that appear multiple times in this thesis. Figure 3.1: A CART sample with names on each nodes • Root Node represent the entire data(X;Y), and the splits start here • Decision Node is a sub-node that needs further splitting • Leaf/Terminal Node is a node that does not splits anymore. • Parent Node and Child Node is a pair of nodes between a non-leaf node and two sub-nodes that it directly connected to. 3.1.1 Growing A binary tree partitions the feature space into disjoint subsets by iteratively splitting the training data set into subsets according to whether or not x j < b for some j= 1:::d, and some b2R, for all input training vector x2R d . Figure 3.2 provides an example of a classification binary tree. 13 Figure 3.2: A decision tree example. Given any observation(x;y)2R 2 f0;1g, this tree partitions the feature space into disjoint subsetfR 1 ;:::;R 5 g and then predict the output ˆ y as the most frequent class. Selection of x j and b in each partition can be based on a greedy algorithm to minimize a predefined splitting criterion, often referred to as the loss function. To illustrate this idea, let Q m denotes the set of observations in node m and Q m1 ;Q m2 denote the left and right child of Q m respectively. Note that when m= 0, Q 0 is the entire training data set. Mathematically, we have Q m =fObservations in node mg Q m1 ( j;s)=fx2 Q m : x j < sg Q m2 ( j;s)=fx2 Q m : x j sg for m 0. Once we select a loss function Loss m ( j;s), we seek the splitting feature j and split point s to minimize the weighted sum of loss from left and right child: min j;s w 1 Loss m1 ( j;s)+ w 2 Loss m2 ( j;s) 14 where Loss m1 ( j;s) and Loss m2 ( j;s) are the loss from left and right child with weights w 1 and w 2 respectively in their parent Q m . Below are some commonly used loss function: • MSE: Loss m = 1 jQ m j å x i 2Q m (y i c) 2 where c= ave(y i jx i 2 Q m ) denotes the average of the training output andjQ m j denotes the number of observations in node m. • Entropy: Loss m = å k2K ˆ p m;k log ˆ p m;k where ˆ p m;k = 1 jQ m j å fi:x i 2Q m g I(y i = k) (for classification only) is the empirical probability (or frequency) of observations that have class k in node m. The entropy measures the level of surprise of the outcome from a given distribution. In a multinomial distribution of K categories, the least surprise case is when the probability of the k th category is 1 with the rest categories being 0 (so it is not surprised at all when we see the output) for 1 k K. The entropy of the least surprise case is 0. The most surprising case is when the probability of each category is equally likely, which will result in entropy of logK. • Gini-impurity: Loss m = å k2K ˆ p m;k (1 ˆ p m;k ) Note that Gini-impurity measures how often a randomly chosen observations are incorrectly classified if it was randomly labeled based on the distribution of those K classes in node m. It reaches the minimum of 0 if all observations have the same class in node m. • Mis-classification rate: Loss m = 1 jQ m j å fi:x i 2Q m g I(y i 6= argmax k ˆ p m;k ) 15 Note that the MSE criterion is capable of both regression and classification problems. On the other hand, Entropy, Gini-impurity, and Mis-classification rate are also popular choices in classification. Overall, the procedure of growing a tree can be summarized as the following: 1. Select the splitting feature j and the split point s that minimize w 1 Loss m1 ( j;s)+ w 2 Loss m2 ( j;s) starting from the root node m= 0. 2. Define each sub-node that directly connects to Q m as Q 1 = Q 01 ( j;s)=fx2 Q 0 : x j < sg Q 2 = Q 02 ( j;s)=fx2 Q 0 : x j sg 3. Repeat this process for Q 1 , Q 2 ,. . . until a stopping criterion is met. 3.1.2 Pruning Generally speaking, there are two ways to decide the number of disjoint subset in order to achieve sufficient performance. • Pre-pruning method stop growing the tree earlier by applying stopping criteria before it grows to a full-size tree that perfectly classifies the training set. • Post-pruning allows the tree to grow fully that perfectly classify the training set, and then we “trim” the fully-grown tree by a cost function. One pre-pruning stopping criterion sets a threshold value b for the loss reduction from parent to child nodes. We stop splitting if Loss m w 1 Loss m1 ( j;s)+ w 2 Loss m2 ( j;s) 0), which equals 1 if x> 0; and 0 otherwise • Logistic (or sometimes referred as sigmoid): 1 1+ e x • Hyperbolic tangent: 1 2 (tanh(x)+ 1) • Bounded rectified linear units (bounded ReLU): min(max(0;x);1). Definition 3.2.2 A neuron is a function g fromR d toR given by g(x)=s(xa+ b) wheres is an activation function,x2R 1d ,a2R d1 and b2R. 18 Figure 3.4: Plots of commonly used activation functions In neural network, we will often consider neurons in a set which we refer as a layer. Therefore, it is natural to extend the definition of an activation function element-wise,s :R n ![0;1] n : [s(x)] i =s(x i ) Hence, the vector-valued activation is the scalar-valued activation that applied on each element of the vector. We will use the same notations for both cases for convenience. Consider a training dataset with N training examples. Let M be the number of attributes from the input and K be the number of attributes in the output. In addition, let(X;Y)2R NM R NK denote the training data and(x;y)2R 1M R 1K denote a single training example. Note that in classification, K attributes from the output correspond to K classes from the categorical output. 19 3.2.1 Construct the Neural Network To construct a neural network, the first layer (set of neurons) is the input layer which is basically the input x. Each subsequent layer is a group of linear combination of previous layer composite with an activation function. Such process is called the Feed Forward. The last layer of the network is called the output layer, and a layer between the input layer and the output layer is called the hidden layer. For example, in figure 3.3, the first hidden layer has two neurons and the second hidden layer has five neurons. Next, suppose there are L 1 hidden layers in the network, and let y (0) 2R 1d be the input layer x. Denotefy (1) ;:::;y (L1) g as the set of hidden layers and y (L) as the output layer of the network. From(l) th layer to(l+ 1) th layer, we have: z (l+1) i = y (l) W (l+1) i (3.1) y (l+1) i =s(z (l+1) i ) (3.2) for l= 0;1;:::;L 1 where • W (l+1) i 2R M l 1 is the pre-determined weight column parameter, • M l is a pre-determined constant such that y (l) 2R 1M l , so M l determines the number of neurons in l th layer. • y (l) W (l+1) i is the matrix multiplication, • y (l+1) i (scalar) is the i th node at layer(l+ 1), Note that from (l) th layer to (l+ 1) th layer, the choice of activation functions s can be different than choices from (l+ 1) th to (l+ 2) th layer, so equation (3.2) should technically be written as y (l+1) i =s l (z (l+1) i ). For simplicity, we will assume each layer’s activation function are the same for the remaining of this section. 20 Overall, the computation sequence of each layer is a feed-forward in an increasing order: y (0) ! y (1) !! y (L) 3.2.2 Gradient-Based Optimization Once every layer of neural network y (0) ;y (1) ;:::;y (L) is constructed, the next step is to minimize the difference between the predicted output y (L) and the actual output y by adjusting weight parameters between each layer. We will first define the loss function to measure the difference between y (L) and y: E := 1 2 jjy (L) yjj 2 2 = 1 2 K å k=1 (y (L) k y k ) 2 To write y (L) in term of a composite function of the input x and weights W (L) ;W (L1) ;:::;W (1) , we have: y (L) =s(y (L1) W (L) ) =s(s(y (L2) W (L1) )W (L) ) =s(s(s(y (L3) W (L2) )W (L1) )W (L) ) = . . . =s(s(s(s(xW (1) )W (L2) )W (L1) )W (L) ) where W (l+1) 2R M l M l+1 is the weight matrix with i th column being W (l+1) i for l= 0;1;:::L 1. Let w (l+1) ji denote the j th component of W (l+1) i , so w (l+1) ji is the weight parameter from y (l) j to y (l+1) i . To minimize the loss E, we will update w (l+1) ji using iterative gradient-based optimization: w (l+1) ji w (l+1) ji h ¶E ¶w (l+1) ji 21 where • i ranges from 1 to M l+1 , the number of nodes in layer(l+ 1), • j ranges from 1 to M l , and • l ranges from 0 to L 1, and • h is the step size in gradient descent. For simplicity, we assume the step size h to be fixed for each weight update. To compute the gradient term ¶E ¶w (l+1) ji , we apply the chain rule: ¶E ¶w (l+1) ji = ¶E ¶y (l+1) i dy (l+1) i dz (l+1) i ¶z (l+1) i ¶w (l+1) ji = ¶E ¶y (l+1) i s 0 (z (l+1) i ) y (l) j where the s 0 (z (l+1) i ) and y (l) j is derived from equation(3.1) and (3.2) respectively in the feed- forward step. It remains to compute ¶E ¶y (l+1) i for l= 0;:::L1. Because E= 1 2 å K k=1 (y (L) k y k ) 2 , we will find ¶E ¶y (l+1) i in backward, starting from the last layer ¶E ¶y (L) i Case 1, the output layer when l+ 1= L: ¶E ¶y (L) i = ¶ ¶y (L) i 1 2 K å k=1 (y (L) k y k ) 2 =(y (L) i y i ) Case 2, the hidden layer when l+ 1= 1;:::;L 1: For(l+ 1) th hidden layer y (l+1) , its i th component y (l+1) i , will only affect E through nodes in (l+2) th layer that he is directly connected to. Therefore, we will compute ¶E ¶y (l+1) i in backward base on the derivative with respect to the next layer y (l+2) as shown in Figure 3.5: 22 Figure 3.5: Nodes in(l+ 2) th layer that y (l+1) i connects to ¶E ¶y (l+1) i = M l+2 å q=1 ¶E ¶y (l+2) q ¶y (l+2) q ¶z (l+2) q ¶z (l+2) q ¶y (l+1) i = M l+2 å q=1 ¶E ¶y (l+2) q s 0 (z (l+2) q ) w (l+2) iq ; where ¶E ¶y (l+2) q is already computed since the computation sequence of the derivative is from the output layer L to the input layer: ¶E ¶y (L) ! ¶E ¶y (L1) !! ¶E ¶y (0) : The backward computation of the derivative in finding the gradient is called the back propagation. 3.2.3 Additional Remarks Remark 1: Note that each observation will update weight parameters w (l+1) ji once, and the entire training data set with N observations will therefore, update weights N times. Updating weights per 23 observation is called the Stochastic Gradient Descent (SGD). In contrast to SGD, another common weights update is the standard Gradient Descent, which updates weights when an entire training data set is fed into the network. Therefore, a training data set with N observations will only update weights once. In this case, the loss E will then be redefine as: E (GD) def = 1 N N å n=1 E (SGD) = 1 N N å n=1 1 2 jjy (L) n y n jj 2 2 To summarize key comparisons in between GD and SGD: • In standard gradient descend, weights will be updated once we sum all examples’ loss, whereas in SGD, weights are updated upon examining each training example. • The standard gradient descend requires more computation per weight update because the loss function E (GD) is averaged over N training data. On the other hand, averaging all obser- vations reduces the noise of each individuals, so GD can often used with a larger step size per weight update than SGD. • In cases where there are multiple local minima in E (GD) = 1 N å N i=1 E (SGD) with respect to weights w (l+1) i , using SGD can sometimes avoid falling into these local minima because it uses ¶E (SGD) ¶W to guide its search direction. Remark 2: Recall that in feed-forward, z (l+1) i = y (l) W (l+1) i . Sometimes, we would like to add some bias to z (l+1) i that does not depends on the previous layer y (l) as: z (l+1) i = y (l) W (l+1) i + b (l+1) i 24 Instead of adjusting every calculation in back-propagation, we can simply redefine the weight and the l th layer as: ˜ W (l+1) i = 2 6 4 b (l+1) i W (l+1) i 3 7 5 , ˜ y (l) = 2 6 4 1 y (l) 3 7 5 : so that z (l+1) i = ˜ y (l) ˜ W (l+1) i = y (l) W (l+1) i + b (l+1) i Hence, the rest of computation in back-propagation can remain unchanged. Remark 3: Given a neural networkN 0 , we can always add more neurons and layers to N 0 with zero initial weights, so the output of the new network (sayN 1 ) is the same as the output ofN 0 . Doing so will add more complexity of the network and therefore reduce the training loss. On the other hand,N 1 does not guarantee to have a lower testing loss as mentioned in Section 2.3. 25 Chapter 4 Neural Decision Tree Artificial neural networks (ANNs) have been considered to be universal function approximators. They have the capability to approximate the function without knowing its type. However, select- ing a network architecture to achieve the smallest approximation error is a significant difficulty. Decision Trees, on the other hand, can partition the feature space with hyperplanes and fit a sim- ple model in the partition. Although continuous learning for a decision tree is not as efficient as for a neural network and linear splits along the coordinate often limit its effectiveness, trees can be easily implemented and provide a decision boundary in almost any situation. Such natural complementary between ANNs and decision trees inspires us to find a hybrid method combining both. In this chapter, we shall present our approach for constructing a Neural Decision Tree (NDT). The intuition behind the NDT comes from a basic comparison between a single elementary neuron and a decision node in a binary tree generated by CART. To compare a neuron and a tree node, let s represent an elementary neuron with input x2R d : s(x)= a(w T x b); w2R n ;b2R; (4.1) 26 (a) A decision tree with labeled and enu- merated 4 decision nodes and 5 leaf nodes. (b) The corresponding 2-hidden-layers NDT. Figure 4.1: An example of decision tree with its associated NDT. where a :R7![0;1] is referred to as an activation function. When a=I is the indicator function for non-negative real numbers, the function s can be rewritten as s(x)= 8 > < > : 1 w T x b 0; 0 w T x b< 0: As a result, s essentially creates a partition ofR d by the hyperplane w T x b= 0 into two subsets S 1 = s 1 (1) and S 0 = s 1 (0). The action of a decision node in a binary tree is indeed a such partition as well, except that a common decision tree partitions the feature space according to the value of a single component x j of feature vector x. Thus, by taking w= e j , the partitions created by s are S 1 =fx2R d ;x j bg; S 0 =fx2R d ;x j < bg: Consequently, by representing every decision node in a binary tree with an elementary neuron of the above form, it is possible to determine from the outputs of these neurons which leaf node an input vector x should be placed in. Since each leaf is assigned a single class in classification tree or a node average in a regression tree, it is therefore possible to reproduce the outcome of a decision tree exactly using a neural network in which activation functions are indicator function. 27 More concretely, our construction of a NDT requires two phases: 1. A neural network is constructed using an indicator functionI(x)= 1 for x> 0 andI(x)= 0 otherwise, as an activation function to replicate the input/output relationship of a decision tree and provide initial weights for the NDT. 2. Replacing activation functions at various layers with strategically selected “smoother” acti- vation function to relax the decision boundary from trees. As a result a typical NDT has 2 hidden layers that represent the set of decision and terminal nodes respectively as in Figure 4.1(b). 4.1 Replicating a decision Tree with a Neural Network In this section, we shall demonstrate how to replicate a multi-classification tree to a neural network using an indicator activation function. First, we denote the bold(X;y)2R nd R n as the training data and the (x;y)2R 1d R as one single observation. In particular, we will treat the input x2R 1d as a row vector for notation simplicity in this section. Consider a standard binary treeT. Let K be the number of decision nodes. For a decision node j, the splitting criterion is denoted by x q( j) < d j . As a binary tree,T has K+ 1 leaves, and each leaf is assigned to one of m categories. First Layer: The first hidden layer, h, is constructed to replicate the set of decision nodes inT as in Figure 4.3(a). Hence, h2R 1K contains K number of neurons. Given an input x2R 1d as a row vector, let h j =I(xW (1) j + b (1) j ) be the j th neuron of h. The initial weight vector W (1) j 2R d and a offset b (1) j 2R for j= 1:::K n will be selected such that the output of the neuron equals to one when the criterion for the split of decision node j is verified, and zero otherwise. Note that the real-valued indicator function isI(x)= 1 for x> 0 andI(x)= 0, and the vector-valued function I(x) can be generalized by element-wise operation. For a decision node j, the splitting criterion is denoted by x q( j) < d j where x q( j) denote the q( j)’s attribute of the input x. The weight vector W (1) j and the offset b (1) j are then initialized as the following: 28 Figure 4.2: 2D partition of the feature space into 5 subsets each with specific shading associated with 5 leaves in Figure 4.1(a). The 4-tuple vector is the binary representation of 4 decision nodes in the same tree. W (1) i; j = 8 > > < > > : 1 if i= q( j); 0 otherwise (4.2) b (1) j = d j (4.3) for i= 1:::d. Hence, given any input x, the output h=I(xW (1) + b (1) ) is a binary 0;1 vector that represents the splitting results of the treeT where W (1) =[W (1) 1 ;;W (1) K ]2R dK ; b (1) =[b (1) 1 ;;b (1) K ]2R 1K : . For the sample tree in Figure 4.1(a), the 4-tuple vector in Figure 4.2 marks different binary representation. Second Layer: The output of the second hidden layer r2R K+1 is designed as a binary vector with K+ 1 entries representing the K+ 1 leaves in a binary tree with K decision nodes. The j-th 29 entry of r, r j (x)= 1 if the input x should be in the partition represented by the j-th leaves. It is important to note that each value of the binary vector h uniquely identifies a leaf on the tree. Thus for each neuron r j =I(hW (2) j + b (2) j ), the initial weights W (2) j 2R K and offsets b (2) j 2R for j = 1;:::;K+ 1 are defined such that when the value of input binary vector is associated with leaf j, the neuron produces an output of one, and zero otherwise. Let P j denote the set of all possible binary vectors from the first layer that is associated with leaf node j. If for any vector p2 P j f0;1g K the i-th component p i = 1, then the criterion for the i-th decision must be verified for leaf j. Similarly, if for any vector p2 P j , p i = 0 the criterion for the i-th decision must be false. On the other hand if p i can be either 0 or 1 then the i-th decision does not impact leaf j. The weights W (2) j and offsets b (2) j are given by: for i= 1:::K W (2) i; j = 8 > > > > > > < > > > > > > : 1 if p i = 1; 8p2 P j ; 1 if p i = 0; 8p2 P j ; 0 if p i can be either 0 or 18p2 P j : (4.4) b (2) j = å fi:W (2) i; j =1g 1 + 1 2 : (4.5) Hence, the output of the second layer r=I(hW (2) + b (2) ) is also a binary vector with only a single component equals to 1 which, for a given input x, indicates that it belongs to the partition ofT. Figure 4.2 demonstrate the 2D partition of each region (leaf node) with its associate binary representation. 30 The intuition of such initialization is the following: if an input x belongs leaf node j inT, then hW (2) j = å fi:W (2) i j =1g 1 hW (2) j + b (2) j = å fi:W (2) i j =1g 1+ b (2) j = 1 2 : Otherwise, hW (2) j + b (2) j < 1 2 . Output Layer: Depends on whether the problem is classification or regression, the output layer will be constructed differently. For a classification problem, the output layer has the same number of neurons as the number of categories in a classification problem. For class j, the role of the output layer is simply assign a value 1 to the j-th neuron whenever an input vector x is lead to a leaf assigned to category j, and 0 to all other neuron in this layer. Let C be the number of classes. The output layer has the form y j (3) =I(rW (3) j + b (3) j )2R C where initial weights W (3) j 2R (K n +1) and offsets b (3) j 2R for j= 1:::C are given by W (3) i; j = 8 > > < > > : 1; if leaf i yield class j 1; if leaf i does not yield class j b (3) j = 0 as in Figure 4.3(c). For a regression problem, the output layer has a single neuron. and it represent the final output from the treeT. The neuron will select the regression output from the associate leaf node. Let fC 1 ;:::C K+1 g be the regression output for leaf nodef1;:::K+ 1g and W (3) 2R (K+1)1 ;b (3) 2R 31 be the weight and offsets from the layer r to the output layer. The initialization of W (3) and b (3) are given by W (3) j = C j b (3) = 0 for j= 1;:::K+ 1. At last, the neural network output is y (3) = rW (3) + b (3) : Hence, for any given input x, y (3) =I(rW (3) + b (3) ) is identical to the output of the treeT. One of the differences between NDT construct in previous work by Biau and ours is in the selection of weights and offsets. In our case, weights and offsets are selected so that both h and r are binary vector inf0;1g m instead off1;1g m for m= K n ;K n + 1 in the previous work. Even though this appears to be a simple deviation, our choice enable us to partially preserve the splitting criterion in a decision tree during subsequent relaxation of decision boundaries. We shall discuss the detail of selection of activation functions in Section 4.2, Relaxation. 4.2 Relaxation The main purpose of initializing an ANN with a decision tree is that the partition of the feature space created by CART offers a rough approximation of the level sets of the true classifier. How- ever, the restrictive use by CART of only hyperplanes perpendicular to axes of the feature space is 32 (a) Replicating from decision nodes of the treeT to first hid- den layer of NDT. (b) Replicating from leaf nodes ofT to second hidden layer of NDT. (c) Replicating from the output ofT to the output of NDT. Figure 4.3: Replicating a decision tree with neural network in 3 steps. 33 unlikely to be optimal for an efficient approximation. In order to enable optimization techniques such as stochastic gradient descent to train the ANN that initialized with a decision tree, we need to replace indicator I(x) by a “smooth” activation function s(x). The selection of activation functions can have a significant impact on the perfor- mance of the final NDT. Our experience indicates that without a strategic selection of activation functions could remove any advantage of NDT over uninitialized ANN. Appendix A illustrate the impact of a properly chosen activation function for the first layer. From the input x to the first hidden layer h, our experience suggests the bounded Rectified Linear (ReL) activation function s 1 (x)= min(max(0;x);1) (4.6) where s 1 (x) is the activation for h j =s 1 (xW (1) j + b (1) j ). This selection make use of the fact that s 1 (x) has a strict 0 as a lower bound, so h partially preserves the splitting criterion of the deci- sion tree. For second layer r j =s 2 (hW (2) j + b (2) j ), we suggest using the standard logistic function s 2 (x)= 1 1+e x . For the output layer y (3) j =s 3 (rW (3) j + b (3) j ) we use the standard soft-max function s 3 (r) j = e r j =å C j=1 e r j , where C is the number of classes to estimate the probability of class j. That is,s 3 (r) j P(Y = jjX), for j= 1:::C. 34 Chapter 5 Statistical Consistency An important characteristic of a promising statistical learning algorithm is the convergence of the optimally constructed classifier toward the ’true’ underlying class assignment as the volume of training data tend toward infinity. Algorithms with this characteristics are referred to as statistically consistent. In this chapter, we shall establish the consistency of the empirical L 2 loss minimizer over the set of 2 layer NDT functions based on the statistical learning framework in Section 2.2. 5.1 Consistency To simplify the notation, we will only establish the consistency result in binary classification in this section. Generalization to multiple category classification problem and bounded regression problem follows essentially identical arguments albeit with more cumbersome notations. Let(X;Y) denote a random vector where X isR d -valued, and Y isf0;1g-valued. Recall from equation (2.2), the regression function m(X)=E(YjX) is the minimizer of the L 2 riskE X;Y j f(X) Yj 2 over the set of Borel measurable f :R d !R with respect to probability distribution of(X;Y). Let (X 1 ;Y 1 );:::;(X n ;Y n ) be independent and identically distributed (i.i.d.) samples of (X;Y), and let D n =f(X i ;Y i )g n i=1 represents the training data set. The empirical L 2 risk minimizer m n is defined as: m n = arg min f2F n 1 n n å j=1 j f(X j )Y j j 2 ; 35 whereF n is the set of 2 hidden-layer NDT functions that may depend on the sample size n. To defineF n more specifically, let K n be the number of decision nodes from the tree that initializes NDT, so K n + 1 is the number of nodes in the second hidden layer. Let W (1) 2R dK n , W (2) 2R K n (K n +1) , W (3) 2R (K n +1)1 represent the vectors of weights defining all neurons in the first, second and the output layer of a NDT, respectively. Let b (1) 2R K n , b (2) 2R K n +1 and b (3) 2R be offsets for the neurons of the three layers. We assume that there exists a constant c> 0 such that for the second hidden layer, output layer, and for all n, jjW (2) jj 1 +jjW (3) jj 1 +jjb (2) jj 1 +jb (3) j cK n ; (5.1) wherekW (2) k 1 ,kb (2) k 1 represent the L 1 matrix norm for these matrices. In fact, if all entries in W (2) , b (2) W (3) and b (3) are bounded by a constant, the above automatically assumption holds. Let l be the parameter tuple: l = W (1) ;W (2) ;W (3) ;b (1) ;b (2) ;b (3) Let W (i)T denote the transpose of W (i) for i= 1;2;3. Given a column vector X2R d , our two hidden-layer neural network function has the following form: f l (X)=s 3 (W (3)T s 2 (W (2)T s 1 (W (1)T X+ b (1) )+ b (2) )+ b (3) ) (5.2) where s i is bounded, non-decreasing, and globally Lipschitz continuous activation function for layer i= 1;2;3 such that lim x!¥ s(x)= 0, lim x!¥ s(x)= 1. Also, we use s i to represent a vector-valued function for which the activation function is applied element-wise to its arguments. That is, for Z2R d ,[s(Z)] j =s(z j ) for j= 1;:::;d. Hence, the set of 2 hidden-layer neural decision tree functions,F n , can be written as: 36 F n = f l (X) : X2R d ;l satisfies (5.1) (5.3) Theorem 5.1.1 (Main Result: Strongly Universal Consistency of m n ) Let (X;Y)2R d f0;1g be a random vector. Let(X 1 ;Y 1 );:::;(X n ;Y n ) be i.i.d. samples of(X;Y) and D n =f(X i ;Y i )g n i=1 . Let F n be the class of neural networks defined above, and let m n be the empirical L 2 loss minimizer in F n that depends on D n . If K 2 n log(K 4 n ) n ! 0 and inf f2F n E X ( f(X) m(X)) 2 ! 0 as n!¥, then for any distribution for(X;Y), E X (m n (X) m(X))= Z jm n (x) m(x)j 2 m X (dx)! 0 a:s: In prior work, [12] provides general result of consistency for 1 (hidden) layer standard neural network. [6] presents a weak (converge in expectation) consistency proof of 2 hidden-layer NDT with assumption of uniform distribution for X. In Theorem 5.1.1, we gave a strong (converge almost surely) universal (for any distribution of X) consistency proof of 2 layer NDT assuming m(X) can be approximated by functions ofF n in L 2 . The proof of Theorem 5.1.1 involves concepts from covering and packing numbers (in section 5.2) and VC dimension (in section 5.3). The actual proof will start in section 5.4. 5.2 Covering and Packing Numbers The concept of covering and packing number has an important role in obtaining upper bounds for the exponential inequality, which is used to prove the Theorem 5.1.1. The relationship between the covering number and the exponential inequality is briefly discussed in Appendix C. We first introduce the concept of covering and packing numbers and study the relationship between them. 37 Definition 5.2.1 Let G be a set of functions g :R d !R. Let z n d =fz 1 ;:::z n g be n fixed points in R d , so z i 2R d for i= 1;:::;n. Fore > 0, (1). Thee-cover of G on z n d is a set of functionsfg 1 ;:::;g N g where g 1 ;:::g N :R d !R for some N2N that for every g2 G, there exist a g j 2fg 1 ;:::;g N g such that 1 n n å i=1 jg(z i ) g j (z i )j<e (2). Thee-coveringnumberN (e;G;z n d ) on z n d is the minimal N2N such thatfg 1 ;:::;g N g is ane-cover of G. (3). Thee-packing of G on z n d is the set of functionsfg 1 ;:::;g N g where g 1 ;:::g N 2 G for some N2N that for all 1 j< k N, 1 n n å i=1 jg j (z i ) g k (z i )je (4). The e-packing number M(e;G;z n d ) on z n d is the maximal N2N such thatfg 1 ;:::;g N g is ane-packing of G As the next Lemma shows, the covering and packing number are closely related. Lemma 5.2.1 Let z n d =fz 1 ;:::z n g. For any z 1 ;:::;z n 2R d , we have M(2e;G;z n d )N (e;G;z n d ) M(e;G;z n d ) Proof 5.2.1 For the first inequality, letfh 1 ;:::;h l g be 2e-packing of G. Then fix any g2 G, let U e (g)= h :R d !R : 1 n n å i=1 jh(z i ) g(z i )j<e 38 Note that U e (g) contains at most one of h i ’s from 2e-packing. If not, say both h 1 ;h 2 2 U e (g), then 1 n n å i=1 jh 1 (z i ) h 2 (z i )j= 1 n n å i=1 jh 1 (z i ) g(z i )+ g(z i ) h 2 (z i )j 1 n n å i=1 jh 1 (z i ) g(z i )j+ 1 n n å i=1 jh 2 (z i ) g(z i )j 2e (because h 1 ;h 2 2 U e (g)) which contradicts the fact that h 1 ;h 2 comes from the 2e-packing of G. Therefore U e (g) contains at most one h i ’s, and this implies anye-covering of G contains at least l=jfh 1 ;:::;h l gj functions, which proves the first inequality. For the second inequality, letfg 1 ;:::;g l g bee-packing of G with maximum size. Let h2 G that different from g i ’s. Thenfh;g 1 ;:::;g l g is a set with cardinality l+1, and it cannot be ane packing of G. That means, there exist g j 2fg 1 ;:::;g l g such that 1 n n å i=1 jh(z i ) g j (z i )j<e This implies the setfg 1 ;:::;g l g is already enough to cover G, and therefore, the size of any e-cover of G will be equal or less than l= M(e;G;z n d ). The next two lemmas compares the covering number of class of functions that are sum or product of functions from other sets to each individual set. Lemma 5.2.2 Let F and G be two families of functions fromR d toR. If F L G denote the set of functionsf f+ g : f2 F;g2 Gg, then for any z 1 ;:::;z n 2R d ande;d > 0, we have N (e+d;F M G;z n d )N (e;F;z n d ) N(d;G;z n d ) where z n d =fz 1 ;:::;z n g. 39 Proof 5.2.2 Letf f 1 ;:::; f K g andfg 1 :::;g L g be the smallest size e and d cover of F and G re- spectively on z n d . Then for any f2 F, g2 G, there exist f k 2f f 1 ;:::; f K g, g l 2fg 1 :::;g L g such that 1 n n å i=1 j f(z i ) f k (z i )j<e and 1 n n å i=1 jg(z i ) g l (z i )j<d Then by triangle inequality: 1 n n å i=1 j f(z i ) f k (z i )+ g(z i ) g l (z i )j 1 n n å i=1 j f(z i ) f k (z i )j+ 1 n n å i=1 jg(z i ) g l (z i )j <e+d This implies setf f k + g l : k= 1:::K;l= 1:::Lg is an(e+d)-cover of F L G on z n d . Lemma 5.2.3 Let F and G be two families of functions fromR d toR such thatj f(x)j M 1 ;jg(x)j M 2 for all x2R d ; f2 F;g2 G. If F N G denote the set of functionsf f g : f2 F;g2 Gg, then for any z 1 ;:::;z n 2R d ande;d > 0, we have N (e+d;F O G;z n d )N (e=M 2 ;F;z n d ) N(d=M 1 ;G;z n d ) where z n d =fz 1 ;:::;z n g. Proof 5.2.3 Letf f 1 ;:::; f K g andfg 1 :::;g L g be the smallest sizee=M 2 andd=M 1 cover of F and G respectively on z n d . Then for any f2 F, g2 G, there exist f k 2f f 1 ;:::; f K g, g l 2fg 1 :::;g L g such that 1 n n å i=1 j f(z i ) f k (z i )j< e M 2 and 1 n n å i=1 jg(z i ) g l (z i )j< d M 1 40 By triangle inequality and basic algebra, we have: 1 n n å i=1 j f(z i )g(z i ) f k (z i )g l (z i )j = 1 n n å i=1 f(z i )g(z i ) f(z i )g l (z i ) f k (z i )g l (z i )+ f(z i )g l (z i ) = 1 n n å i=1 f(z i )[g(z i ) g l (z i )]+ g l (z i )[ f(z i ) f k (z i )] 1 n n å i=1 f(z i )[g(z i ) g l (z i )] + 1 n n å i=1 g l (z i )[ f(z i ) f k (z i )] M 1 1 n n å i=1 jg(z i ) g l (z i )j+ M 2 1 n n å i=1 j f(z i ) f k (z i )j <e+d This implies setf f k g l : k= 1:::K;l= 1:::;Lg is the(e+d)-cover of F N G on z n d . Next, the Lemma below is about the covering number between a set of functions whose mem- bers are constant multiple of functions from the other set. Lemma 5.2.4 Let F be families of functions fromR d toR. Let G=fs( f); f2 Fg wheres :R! R is an activation function in Definition 3.2.1 with globally Lipschitz continuity with Lipschitz constant L, i.e. s isjs(u)s(v)j Lju vj for all u;v;2R and some C2R. Then for any z 1 ;:::;z n 2R d N (e;G;z n d )N (e=L;F;z n d ) where z n d =fz 1 ;:::;z n g. Proof 5.2.4 Letf f 1 ;:::; f n g be the smallest size e=L-cover of F, then for any f2 F, there exist f j 2f f 1 ;:::; f n g such that 1 n n å i=1 j f(z i ) f j (z i )j< e L 41 From the Lipschitz continuity, we have: 1 n n å i=1 js( f(z i ))s( f 1 (z i ))j< L 1 n n å i=1 j f(z i ) f 1 (z i )j <e Next, we will extend the definition of covering and packing number by letting Z 1 ;:::;Z n to be a sequence ofR d -valued i.i.d. random vectors and denote Z n d =fZ 1 ;:::;Z n g. Then thee-covering number of G on Z n d , denote asN (e;G;Z n d ), is then a random variable. Theorem 5.2.1 (Also see [12] Theorem 9.1) Let Z 1 ;:::;Z n be a sequence of i.i.d. random vari- able. Let G be a set of functions g :R d ![0;B]. For any n, anye > 0, P sup g2G 1 n n å i=1 g(Z i )Eg(Z) >e 8EN (e=8;G;Z n d )e ne 2 128B 2 where Z n d =fZ 1 ;:::;Z n g. Proof 5.2.5 The proof of this theorem includes 4 steps. In step 1, we consider another set of n i.i.d. copies of Z 1 : Z 0 1 ;Z 0 2 ;:::;Z 0 n (independent with Z 1 ;:::;Z n ) and claim P sup g2G 1 n n å i=1 g(Z i )Eg(Z) >e 2P sup g2G 1 n n å i=1 g(Z i ) 1 n n å i=1 g(Z 0 i ) > e 2 In step 2, we introduce U 1 ;:::;U n , as n i.i.d. discrete random variable, independent with Z 1 ;:::;Z n ;Z 0 1 ;:::;Z 0 n , such that U i = 8 > > < > > : 1 with probability 1 2 1 with probability 1 2 42 for i= 1;:::;n. Then we will show the latter probability (of step 1) can be simplify as: P sup g2G 1 n n å i=1 g(Z i ) 1 n n å i=1 g(Z 0 i ) > e 2 2P sup g2G 1 n n å i=1 U i g(Z i ) > e 4 At last, we bound above the later probability (of step 2) with covering number in step 3 and apply Hoeffding’s inequality (see Appendix B) in step 4. Step1: Let Z 0 1 ;Z 0 2 ;:::;Z 0 n be n i.i.d. copies of Z 1 that independent with Z 1 ;:::;Z n with the same distribution as Z. Let g be a function g2 G such that g = 8 > > < > > : g2 G s:t: j 1 n å n i=1 g(Z i )Eg(Z)j>e if such g2 G exist any other arbitrary g2 G else Note that g depends on Z n d , and denoteE[g (Z)jZ n d ] to be expectation of g (Z) with respect to Z given Z n d . Now consider P sup g2G 1 n n å i=1 g(Z i ) 1 n n å i=1 g(Z 0 i ) > e 2 P 1 n n å i=1 g (Z i ) 1 n n å i=1 g (Z 0 i ) > e 2 P 1 n n å i=1 g (Z i )E[g (Z)jZ n d ] >e | {z } event E 1 ; and 1 n n å i=1 g (Z 0 i )E[g (Z)jZ n d ] e 2 | {z } eventE 2 The last inequality is from the fact that if event E 1 and E 2 implies some event C, then P(C) P(E 1 E 2 ). Recall the law of total probability of P(E 1 E 2 ) condition on Z n d : P(E 1 E 2 )=E Z n d [P(E 1 E 2 jZ n d )] 43 Because the event E 1 depends on Z n d only, given Z n d = z n d , we know whether the event E 1 holds true or not. If E 1 holds true, then P(E 1 E 2 jZ n d )= P(E 2 jZ n d ). If not, then P(E 1 E 2 jZ n d )= 0. That is, P(E 1 E 2 jZ n d )= 8 > > < > > : P(E 2 jZ n d ) if E 1 holds true 0 if E 1 is false Together with law of total probability, we have: P(E 1 E 2 )=E[P(E 1 E 2 jZ n d )] =E I(E 1 ) P(E 2 jZ n d ) where I(E 1 )= 8 > > < > > : 1 if E 1 holds true 0 if E 1 is false To lower bound P(E 2 jZ n d ), we use Chebyshev’s Inequality: 1 P(E 2 jZ n d )= P 1 n n å i=1 g (Z 0 i )E[g (Z)jZ n d ] > e 2 Z n d Var(g (Z)jZ n d ) n( e 2 ) 2 B 2 =4 n( e 2 ) 2 = B 2 ne 2 where g 2 G implies g (Z)2[0;B], and 44 Var(g (Z)jZ n d )= Var g (Z) B 2 Z n d E g (Z) B 2 2 Z n d B 2 =4 Hence, for n large enough, in particular, n 2B 2 e 2 (if not, see Step 4), we have 1 P(E 2 jZ n d ) B 2 ne 2 1 2 which implies P(E 2 jZ n d )= P 1 n n å i=1 g (Z 0 i )E[g (Z)jZ n d ] e 2 Z n d > 1 2 Back to P(E 1 E 2 ), we see P(E 1 E 2 )=E I(E 1 ) P(BjZ n d ) 1 2 P(E 1 ) = 1 2 P 1 n n å i=1 g (Z i )E[g (Z)jZ n d ] >e = 1 2 P sup g2G 1 n n å i=1 g(Z i )Eg(Z) >e Where the last equality is by definition of g . Put everything together, we have: P sup g2G 1 n n å i=1 g(Z i )E[g(Z)jZ n d ] >e 2P sup g2G 1 n n å i=1 g(Z i ) 1 n n å i=1 g(Z 0 i ) > e 2 45 Step 2: Let U 1 ;:::;U n be n i.i.d. discrete uniform random variable onf1;1g, independent with Z 1 ;:::;Z n ;Z 0 1 ;:::;Z 0 n . Note that if one randomly interchange some component of Z n d and Z 0 n d , the joint distribution of Z n d ;Z 0 n d will remain the same because of the independency between U n 1 , Z n d , and Z 0 n d , and the i.i.d. property of Z n d ;Z 0 n d . Hence P sup g2G 1 n n å i=1 g(Z i ) 1 n n å i=1 g(Z 0 i ) > e 2 = P sup g2G 1 n n å i=1 U i [g(Z i ) g(Z 0 i )] > e 2 randomly interchange P sup g2G 1 n n å i=1 U i g(Z i ) > e 4 + P sup g2G 1 n n å i=1 U i g(Z 0 i ) > e 4 2P sup g2G 1 n n å i=1 U i g(Z i ) > e 4 Apply total probability, take the expectation w.r.t. Z n d : P sup g2G 1 n n å i=1 U i g(Z i ) > e 4 =E P sup g2G 1 n n å i=1 U i g(Z i ) > e 4 Z n d Hence, in step 3 and 4, we will condition Z n d = z n d , for some fixed z n d 2R d and claim that P sup g2G 1 n n å i=1 U i g(z i ) > e 4 N (e=8;G;z n d ) 2exp ne 2 128B 2 Step3: Note that event sup g2G j 1 n å n i=1 U i g(z i )j> e 4 implies eventj 1 n å n i=1 U i g(z i )j> e 4 for some g2 G. That means, P sup g2G 1 n n å i=1 U i g(z i ) > e 4 P 9g2 G : 1 n n å i=1 U i g(z i ) > e 4 Next, let G e=8 bee=8-cover of G on z n d with minimal size. Fix g2 G, there exist ˜ g2 G e=8 such that 1 n n å i=1 jg(z i ) ˜ g(z 1 )j< e 8 46 Then 1 n n å i=1 U i g(z i ) = 1 n n å i=1 [U i g(z i )U i ˜ g(z i )+U i ˜ g(z i )] 1 n n å i=1 [g(z i ) ˜ g(z i )] + 1 n n å i=1 U i ˜ g(z i ) < 1 n n å i=1 U i ˜ g(z i ) + e 8 Hence, eventj 1 n å n i=1 U i g(z i )j> e 4 for some g2 G implies 1 n å n i=1 jU i ˜ g(z i )j+ e 8 > e 4 and P 9g2 G : 1 n n å i=1 U i g(z i ) > e 4 P 9 ˜ g2 G e=8 : 1 n n å i=1 U i ˜ g(z i ) + e 8 > e 4 jG e=8 j max ˜ g2G e=8 P 1 n n å i=1 U i ˜ g(z i ) > e 8 =N (e=8;G;z n d ) max ˜ g2G e=8 P 1 n n å i=1 U i ˜ g(z i ) > e 8 Step4: we will show for ˜ g2 G e=8 , P 1 n n å i=1 U i ˜ g(z i ) > e 8 2exp ne 2 128B 2 W.L.O.G., we can assume 0 ˜ g(z) B;8z2R d . If not, one can always truncate ˜ g(z) between 0 and B and still being a member of G e=8 . In addition, U 1 ˜ g(z 1 );:::;U n ˜ g(z n ) are n independent random variable. Together, B U i ˜ g(z i ) B;8i= 1:::;n Apply Hoeffding’s inequality (See Appendix B), P 1 n n å i=1 U i ˜ g(z i ) > e 8 2exp 2N (e=8) 2 (2B) 2 2exp ne 2 128B 2 47 Recall back in step 1, we assume n large, in particular, n 2B 2 e 2 . If not, then ne 2 2B 2 < 1 and 8e ne 2 128B 2 > 8e 1 > 1. Then the theorem trivially holds true as the right hand side is greater than 1. 5.3 VC Dimension The VC dimension provides an upper bounds on the packing number of sets of functions, which are not necessarily subsets of some finite-dimensional vector space of functions. We shall start with the following definitions: Definition 5.3.1 LetA be a collection of subsets ofR d and z 1 ;:::;z n 2R d for some integer n. Denote s(A;fz 1 ;:::;z n g)= A\fz 1 ;:::;z n g : A2A as the number of distinct subsets offz 1 ;:::;z n g of the form A\fz 1 ;:::;z n g. If every subset of fz 1 ;:::;z n g can be represented in the form of A\fz 1 ;:::;z n g, then s(A;fz 1 ;:::;z n g)= 2 n . In this case, we sayA shattersfz 1 ;:::;z n g, orfz 1 ;:::;z n g is shattered byA . Denote S(A;n)= max fz 1 :::;z n g2R d s(A;fz 1 :::;z n g) to be the n th shatter coefficient ofA . That means S(A;n) is the maximal number distinct subsets of n points that can be picked out byA . Definition 5.3.2 The VC dimension ofA is defined as: V A = supfn2N : S(A;n)= 2 n g That is, V A is the largest number n such that there exist a setfz 1 ;:::z n g with cardinality n that is shattered byA . 48 The next Sauer-Shelah Lemma and its corollary will be used to derive the uppder bounds on the packing number. Lemma 5.3.1 (Sauer–Shelah Lemma) For any n2N , S(A;n) V A å i=0 n i Proof 5.3.1 See [13] or [14] Corollary 5.3.1 For any n2N , S(A;n)(n+ 1) V A In addition, if V A is finite, then for all n V A , S(A;n) en V A V A Proof 5.3.2 From Sauer-Shelah lemma and binomial theorem: S(A;n) V A å i=0 n i = V A å i=0 n! (n i)! 1 i! V A å i=0 n i V A i =(n+ 1) V A 49 If V A <¥ and for n such that V A n 1, apply Sauer-Shelah lemma and binomial theorem, we have: V A n V A S(A;n) V A n V A V A å i=0 n i V A å i=0 V A n i n i n å i=0 V A n i n i = 1+ V A n n = e V A and V A n V A S(A;n) e V A =) S(A;n) en V A V A Theorem 5.3.1 ([12] Theorem 9.4) Let G be a class of functions g :R d ![0;B] with V G + 2. Denote G + = f(z;t)2R d R : g(z) tg : g2 G be a collection of sets in form forf(z;t)2R d R : g(z) tg for g2 G. Let z 1 ;:::;z n 2R d and denote z n d =fz 1 ;:::;z n g Then for any 0<e B=4, we have M(e;G;z n d ) 3 2eB e log 3eB e V G + where V G + is the VC dimension of the set G + . Proof 5.3.3 The proof will consist of 3 steps. In step 1, we bound the packing number M(e;G;z n d ) to a shatter coefficient of G + , S(G + ;k) for some integer k. In step 2, we apply corollary 5.3.1 to obtain bound for this shatter coefficient S(G + ;k). In step 3, we handle the inequality from step 2 to obtain the final result. 50 Step 1: Let m= M(e;G;z n d ) to be the e-packing number of G and ¯ G=fg 1 ;:::;g m g be an e-packing of G. For any integer k, let Q 1 ;:::;Q k 2R d be i.i.d. random variable with empirical distributionn n on z n d . That means if Qn n , then P(Q= z i )= 1 n . Moreover, if f;g2 ¯ G, then Ej f(Q) g(Q)j= 1 n n å i=1 j f(z i ) g(z i )je Next, let T 1 ;:::;T k uni f orm(0;B) and independent with Q 1 ;:::;Q k . Let R i =(Q i ;T i ) for i= 1;:::;k and R k d+1 =fR 1 ;:::;R k g. Let G f =f(z;t)2R d R : f(z) tg for f :R d ![0;B]. We have: S(G + ;k)= max f(z 1 ;t 1 );:::(z k ;t k )2R d Rg s(G + ;f(z 1 ;t 1 );:::;(z k ;t k )g) Es(G + ;fR 1 ;:::;R k g) (5.4) Es(fG f : f2 ¯ Gg;fR 1 ;:::;R k g) (5.5) Es fG f : f2 ¯ G;G f \ R k d+1 6= G g \ R k d+1 for all g2 ¯ G; f6= gg;R k d+1 (5.6) =E å f2 ¯ G I(G f \ R k d+1 6= G g \ R k d+1 for all g2 ¯ G;g6= f) (5.7) = å f2 ¯ G P(G f \ R k d+1 6= G g \ R k d+1 for all g2 ¯ G;g6= f) = å f2 ¯ G 1 P(G f \ R k d+1 = G g \ R k d+1 for some g2 ¯ G;g6= f) (5.8) å f2 ¯ G 1 m max g2 ¯ G;g6= f P(G f \ R k d+1 = G g \ R k d+1 ) (5.9) From (5.4) to (5.5), we are using the fact that G + =fG f : f2 GgfG f : f2 ¯ Gg implies Es(G + ;R k d+1 )Es(fG f : f2 ¯ Gg;R k d+1 ). Similarly, (5.5) implies (5.6). From (5.6) to (5.7), we 51 used the definition of s(;R k d+1 ). From (5.8) to (5.9), we are using the following fact by union bound: P(E i for some event E i ;i= 1;2;:::;m) = P m [ i=1 E i m å i=1 P(E i ) m max i=1;:::;m P(E i ) Next, we will need to bound P(G f \ R k d+1 = G g \ R k d+1 ). Because R 1 ;:::R k are i.i.d., we have: P(G f \ R k d+1 = G g \ R k d+1 )= P(G f \fR 1 g= G g \fR 1 g;:::;G f \fR k g= G g \fR k g) = P(G f \fR 1 g= G g \fR 1 g) k From law of total probability, we have, P(G f \fR 1 g= G g \fR 1 g) = 1 P(G f \fR 1 g6= G g \fR 1 g) = 1E P(G f \fR 1 g6= G g \fR 1 gjQ 1 ) where the expectation of the last equality is taken with respect to the random variable Q 1 . Then, 52 1E P(G f \fR 1 g6= G g \fR 1 gjQ 1 ) (5.10) = 1E P f(Q 1 )< T 1 < g(Q 1 ) or g(Q 1 )< T 1 < f(Q 1 ) Q 1 (5.11) = 1E j f(Q 1 ) g(Q 1 )j B (5.12) = 1 1 B 1 n n å i=1 j f(z i ) g(z i )j (5.13) 1 e B From (5.10) to (5.11), we first note that (z;t)2 G f implies f(z) t. G f \fR 1 g = G f \ f(Q 1 ;T 1 )g, so if G f \fR 1 g is not empty, then f(Q 1 ) T 1 . That means given Q 1 , if G f \fR 1 g6= G g \fR 1 g, then either f(Q 1 ) < T 1 < g(Q 1 ) or g(Q 1 ) < T 1 < f(Q 1 ). From (5.11) to (5.12), we recall that f;g2 ¯ G G and G is a set of functions fromR d to[0;B]. Then the probability inside (5.11) is 1 B j f(Q 1 ) g(Q 1 )j. From (5.12) to (5.13), we recall that Ej f(Q) g(Q)j= 1 n n å i=1 j f(z i ) g(z i )je: That means, P(G f \ R k d+1 = G g \ R k d+1 )= P(G f \fR 1 g= G g \fR 1 g) k 1 e B k e ek B and max g2 ¯ G;g6= f P(G f \ R k d+1 = G g \ R k d+1 ) e ek B 53 Hence, S(G + ;k) å f2 ¯ G 1 m max g2 ¯ G;g6= f P(G f \ R k d+1 = G g \ R k d+1 ) å f2 ¯ G 1 m e ek B = m 1 m e ek B Set k=b B e log(2m)c. For anye2(0;B=4], we have 1 m e ek B 1 mexp e B B e log(2m) 1 = 1 1 2 exp e B 1 1 2 exp 1 4 1 3 Therefore m= M(e;G;z n d ) 3S(G + ;k) (5.14) where k=b B e log(2M(e;G;z n d ))c Step2: 54 If k V G +, then B e log(2M(e;G;z n d )) 1 V G + and M(e;G;z n d ) 1 2 exp e(V G ++ 1) B e 2 e V G + 3 8elog(12e) V G + 3 2eB e log 3eB e V G + where we have used 0<e B=4 again in the last inequality and the proof is done. If k> V G +, then we can then apply corollary 5.3.1 on (5.14): m 3S(G + ;k) 3 ek V G + V G + 3 eB eV G + log(2m) V G + To complete the proof, we need an inequality from the next step. Step3: We claim that for a2R + ;b2N ;a e and b 2, x 3 a b log(2x) b implies x 3 2alog(3a) b Once the claim is justified, letting a= eB e ;b= V G + m 3 eB eV G + log(2m) V G + implies M(e;G;z n d )= m 3 2eB e log 3eB e V G + 55 and the proof is done. To justify the claim, first note that x 3 a b log(2x) b is equivalent as (2x) 1=b 6 1=b alog(2x) 1=b Let u=(2x) 1=b ;c= 6 1=b a, we have u clog(u). Note that if we can show u clog(u) implies u 2clog(c), then x= 1 2 u b 1 2 2clog(c) b = 1 2 2 6 1=b a log(6 1=b a) b 3 2alog(3a) b where the last inequality uses 6 1=b 3 for b 2. To complete the proof, we need to show u clog(u)=) u 2clog(c) which is equivalent to show u> 2clog(c)=) u> clog(u) Let f 1 (u)= u; f 2 (u)= clog(u). Because c= 6 1=b a> a e, we have f 0 1 (u)= 1> 1 2log(e) 1 2log(c) = c 2clog(c) > c u = f 0 2 (u) Moreover, f 1 (2clog(c))> f 2 (2clog(c)) since f 1 (2clog(c))> f 2 (2clog(c))() 2clog(c)> clog(2clog(c)) () 2clog(c)> clog(2)+ clog(c)+ clog(log(c)) () clog(c) clog(2) clog(log(c))> 0 () clog c 2log(c) > 0 () c> 2log(c) 56 And indeed, c> 2log(c) when c> e. Together we have, if u> 2clog(c), then • f 0 1 (u)> f 0 2 (u) • f 1 (2clog(c))> f 2 (2clog(c)) and that shows f 1 (u)> f 2 (u), i.e. u> clog(u). The last theorem in this section derive an upper bound of the VC dimension of an r-dimensional vector space of functions. Theorem 5.3.2 ([12] Theorem 9.5) Let G be an r-dimensional vector space of functions fromR d toR, then V A r where A = fz : g(z) 0g : g2 G Proof 5.3.4 We would like to show for any set that contains r+ 1 points or more,A cannot shat- ter it. That is, letfz 1 ;:::;z r+1 g, z i 2R d be distinct r+ 1 points, we claim that there exist a subset Hfz 1 ;:::;z r+1 g such that nofz : g(z) 0g2A contains precisely H. To find such H, let L : G!R r+1 be a linear transformation: L(g)=[g(z 1 );:::;g(z r+1 )] T and let LG denote the image of L. Note that LG is a subspace ofR r+1 and dim(LG) dim(G)= r. That means there exist g =[g 1 ;:::;g r+1 ] T such thatg2R r+1 ,g6= 0, and g T L(g)=g 1 g(z 1 )+:::g r+1 g(z r+1 )= 0 for all g2 G 57 Here one can assume that at least one ofg i ’s is negative, because if not, redefineg =g. Next, we separate the sumg T L(g) into two parts, one withg i 0, and another withg i < 0: g T L(g)= å i:g i 0 g i g(z i )+ å i:g i <0 g i g(z i ) Together withg T L(g)= 0, we have: å i:g i 0 g i g(z i ) | {z } LHS = å i:g i <0 (g i )g(z i ) | {z } RHS With help of this equation, we claim when H=fz i :g i 0; 1 i r+1g, nofz : g(z) 0g2 A contains precisely H. To show this, we use contradiction. Suppose if there exist g2 G such that fz : g(z) 0g precisely pick those z i ’s with g i 0. Then for the rest of z i ’s with g i < 0, we have g(z i )< 0. That is, the LHS is non-negative (g i 0 and g(z i ) 0), but the RHS is negative (g i < 0 and g(z i )< 0), which is a contradiction. 5.4 Proof of the Main Result From Lemma D.1, We obtain E X ((m n (X) m(X)) 2 ) 2 sup f2F n 1 n n å k=1 j f(X k )Y k j 2 E X;Y ( f(X)Y) 2 ) (5.15) + inf f2F n E X ( f(X) m(X)) 2 : (5.16) Intuitively, we upper bound the L 2 risk of the empirical minimizer into two quantities. The first quantity (5.15) makes sure that the empirical L 2 risk off(X i ;Y i )g n 1 will be close to the L 2 loss uniformly overF n . The second quantity (5.16) is because our estimate cannot be better (with respect to the L 2 loss) than the best function inF n , The infimum bound (5.16) is guaranteed by our assumption that (5.16)! 0 as n!¥ which is equivalent to assume that the setF n is dense in 58 the space of of admissible functions m(X). For the (5.15), we note that by Borel-Cantelli lemma, if ¥ å n=1 P sup f2F n E X;Y j f(X)Yj 2 1 n n å j=1 j f(X j )Y j j 2 >e <¥ (5.17) then sup f2F n E X;Y j f(X)Yj 2 1 n n å j=1 j f(X j )Y j j 2 ! 0 a:s: Before continuing our argument, we adopt a new notation for clarity by defining Z =(X;Y), Z i =(X i ;Y i ) and H n = h :R d R!R : h(x;y)=j f(x) yj 2 for f2F n (5.18) We note that in classification, we havej f(x)j 1;jyj 1, which implies 0 h(x;y) 4. For regression, one usually perform input-output processing for better efficiency. The processed output will be bounded within1. For this set of bounded functionH n , Theorem 5.2.1 gives an upper bound depending on the covering number forH n as following: P sup f2F n Ej f(X)Yj 2 1 n n å j=1 j f(X j )Y j j 2 >e = P sup h2H n Eh(Z) 1 n n å j=1 h(Z j ) >e 8EN e=8;H n ;Z n d+1 exp ne 2 128(4) 2 (5.19) where Z n d+1 =fZ 1 :::Z n g denote the set of (d+1)-dimensional random variables. The termN (e=8;H n ;Z n d+1 ) denote thee=8 covering number for the set of random variables Z n d+1 from Definition 5.2.1. In Ap- pendix C, we briefly discuss sufficient conditions for (5.17) and why we have to use concepts from covering number. 59 To find an upper bound forEN (e=8;H n ;Z n d+1 ), let z n d+1 =fz 1 ;:::;z n g be an instance of Z n d+1 where z i =(x i ;y i )2R d R for i= 1:::n. Let x n d =fx 1 ;:::x n g. The following bound is established in Lemma D.2. N (e=8;H n ;z n d+1 )N (e=32;F n ;x n d ) (5.20) Next, we will define a sequence of sets of functions defined on x2R d for reconstructingF n : G 0 =fa T x+ b : a2R d ;b2Rg G 1 =fs 1 (a T x+ b) : a2R d ;b2Rg G 2 =fwg : g2 G 2 ;jwj cK n g G 3 = K n å i=1 g i +b : g i 2 G 3 ;jbj cK n G 4 =fs 2 (g) : g2 G 4 g G 5 =fwg : g2 G 5 ;jwj cK n g G 6 = K n +1 å i=1 g i +a : g i 2 G 5 ;jaj cK n F n =fs 3 (g) : g2 G 6 g wheres i is the same as in (5.2) for i= 1;2;3. First, note that G 0 is a set of linear vector space of dimension d+ 1. Let G 0 += f(z;t) : g(z) tg : g2 G 0 denote the set of all sub-graph of G 0 . Let VC(G 0 +) denote the VC dimension of G 0 +. Then it is not hard to show that the VC(G 0 +) d+ 1. Moving forward to G 1 , because applying the composition of non-decreasing function does not increase the VC dimension of set of sub-graphs, 60 we have VC(G 1 +) VC(G 0 +) d+ 1. Next, we will apply Theorem 5.3.1 on G 1 . Because G 1 is bounded between 0 and 1, we have N (e;G 1 ;x n d ) 3 2e e log 3e e VC(G 1 +) 3 3e e 2d+4 : To obtain the bound forN (e;F n ;x n d ), we will repeatedly apply Lemma 5.2.2 and Lemma 5.2.3. N (e;G 2 ;x n d )N e 2 ;fw :jwj cK n g;x n d N e 2cK n ;G 1 ;x n d 4cK n e 3 3e 2cK n e 2d+4 12ecK n e 2d+5 N (e;G 3 ;x n d )N e K n + 1 ;fb :jbj cK n g;x n d N e K n + 1 ;G 2 ;x n d K n 2(K n + 1)cK n e 12e(K n + 1)cK n e (2d+5)K n 12e(K n + 1)cK n e (2d+5)K n +1 Since s 2 is a bounded, non-decreasing, and globally Lipschitz continuous activation function with Lipschitz constant L, letg 2 denote maxfL;1g. Then Lemma 5.2.4 shows N (e;G 4 ;x n d )N e g 2 ;G 3 ;x n d 12eg 2 (K n + 1)cK n e (2d+5)K n +1 : 61 Repeating the argument for G 5 and G 6 , we have N (e=8;H n ;x n d ) N (e=32;F n ;x n d ) N e=32;G 6 ;x n d 768e(K n + 1)(K n + 2)c 2 K 2 n e (2d+5)K 2 n +(2d+7)K n +3 Returning to equation (5.19), we have P sup f2F n Ej f(X)Yj 2 1 n n å j=1 j f(X j )Y j j 2 >e 8EN e=8;H n ;Z n d exp ne 2 128(4b 2 n ) 2 8 768e(K n + 1)(K n + 2)c 2 K 2 n e (2d+5)K 2 n +(2d+7)K n +3 exp ne 2 128(4 2 ) = 8a n : To show convergence of series in (5.17), we have ¥ å n=1 P sup f2F n Ej f(X)Yj 2 1 n n å j=1 j f(X j )Y j j 2 >e ¥ å n=1 8a n ¥ å n=1 8exp n e 2 2048 Q n <¥ 62 where Q n = (2d+ 5)K 2 n +(2d+ 7)K n + 3 n log 768e(K n + 1)(K n + 2)c 2 K 2 n e : The last inequality because as n!¥, K 2 n log(K 4 n ) n ! 0 implies Q n ! 0, and that means [ e 2 2048 Q n ]> e 2 4096 for large n. In addition,å ¥ n=1 e cn converges for any positive constant c> 0 by the ratio test. Together we have,å ¥ n=1 e n(cQ n ) converges by the comparison test. The result then follows from Borel-Cantelli lemma. 63 Chapter 6 Exploration of Algorithm Designs and Classification Performance Theoretical consistency shown in the previous chapter is predicated on our ability to identify opti- mal classifier m n based on available data set D n . When a neural network is involved, the dimension of the space of coefficients, also referred to as weights, defining the network is typically quite con- siderable. The optimization criterion is generally highly non-convex with respect to the weights. This chapter uses numerical experimental results to demonstrate NDT’s superior ability to deter- mine an optimal classifier efficiently. Indeed, in the subsequent subsections, we shall present the following features of the NDT: • NDT provides an appropriate structure tailored to a problem for a neural network. In section 6.1, we shall compare the classification performance of neural decision trees (NDT) to other networks with arbitrary architecture. • As for most other statistical learning algorithms, NDT can be constructed using several hyper-parameters, such as the amount of data used to train the CART and the specified depth of the CART. Section 6.2 focus on effect of these hyper-parameters that affect NDT. • In addition to providing structure and initialization to a neural network, NDT also seems to offer robustness of its performance with respect to a few hyper-parameters common to 64 all neural networks. Section 6.3 compares the dependence of algorithm performance on parameters such as batch sizes and learning rates. The comparative performances of NDT and other approaches are established using four dif- ferent benchmark data sets. They are widely used as experimental data sets in machine learning publications such as [15], [16], and [17]. Table 6.1 gives a brief description of the data sets, and they are: • MNIST from [18]: Grey-scaled image of hand written digits. Each observation is a grey- scale 28x28 pixel image with 10 associated classes (0-9). There are a total of 60000 images of training examples and 10000 images of testing examples. • CIFAR-10 from [19]: A classification data set contains 60,000 32x32x3 RGB color images in 10 different classes. Each class has 6,000 images. There are 50,000 images for training and 10,000 images of testing. The 10 different classes are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. • Street View House Number (SVHN) from [15]: A real-world data set of color image of house numbers collected by Google Street View. The format we use contains 99,000 32x32x3 RGB color images in 10 different class labels (0-9). There are 73k training images and 26k testing images. Table 6.1: Data sets used in this paper Training Set Testing Set Dimensions classes MNIST 60K 10K 784 (2828) 10 CIFAR-10 60K 10K 3072 (32323) 10 SVHN 73K 26K 3072 (32323) 10 Reuters 8982 2246 2000 46 65 6.1 Comparison of NDT to Arbitrary Networks When starting a new application, there are minimal guidelines for selecting a neural network archi- tecture. This section shall compare NDT to a network without tree initialization and other networks with arbitrary architectures. To evaluate such a broad range of ANN constructions, we will select an architecture given by the NDT and compare it with doubling and halving the number of neurons, as well as increasing and decreasing the number of layers. Table 6.2 summarized the structure and construction of NDT and the other five networks in comparison with the following details: • NDT: 2-layer neural decision tree constructed based on Section 4. The size of the binary tree is controlled by a maximum tree depth of 5. • NN: 2-layer neural network with the same number of neurons and layers as NDT. Instead of tree initialization, weights are initialized in random Gaussian, and activation functions are logistic for both layers • NN-D: similar to NN with double the amount of neurons in each layer. • NN-H: similar to NN with half the amount of neurons in each layer. • NN-1: 1-hidden layer network. The amount of neurons in the hidden layer is the same as the total amount of neurons in NDT. Weights and activation functions are the same as NN. • NN-3: 3-hidden layers network with the same total amount of neurons as NDT. Neurons are placed evenly in each of the 3 hidden layers. Weights and activation functions are the same as NN. For the training procedure, the decision tree (DT) is implemented using Python scikit-learn from [20]. We will first train the DT using the entire data set with a maximum depth of 5, and it will be included in the comparison with other networks. The network optimization algorithm is batched gradient descent using the TensorFlow2 [21]. We will use a batch size of 1000 and learning rate of 50 in batched gradient descent for all data sets. The error % will be recorded for 66 Table 6.2: Comparison of different architectures Total Hidden Weight Activation Function Neurons Layers Initialized layer-1 layer-2 layer-3 NDT P 2 Tree r 1 s - NN P 2 Gaussian s s - NN-D 2P 2 Gaussian s s - NN-H P=2 2 Gaussian s s - NN-1 P 1 Gaussian s - - NN-3 P 3 Gaussian s s s P is the total number of neurons in NDT, s indicates logistic activation function, and r 1 indicates bounded ReL activation function defined in (4.6). each epoch, and one epoch means every batched training data is being used to update the weight onece. Clearly, such fixed hyper-parameter selection may not be optimal for all data set, however, these selection may constitute a first attempt for a new application. In addition, section 6.3 provides further experiments with other parameter choices. At last, each experiment is repeated 10 times for methods with random Gaussian weight initialization, and the error % is the average over 10 repeated runs. Moreover, the classification error for DT is a constant over all epochs because we only train it once to initialize NDT. 6.1.1 MNIST Our first experiment considers handwritten digit recognition from the MNIST data set. Figure 6.1 shows the testing error for all methods over 20 epochs. The initial error in epoch 0 for NDT does not match the error rate of DT due to the smoothness of the activation function. However, the error drastically decreased from epoch 0 to epoch 1. With the bounded ReL activation, the NDT successively preserves the splitting criterion from the DT and optimizes the decision boundary even further with just one single epoch. In comparison, other networks without tree initialization take much more epochs to converge, and arbitrarily selected architecture does not provide any advantages. In fact, there is a good chance that picking a random architecture may end up with a 67 poor model fit like NN-3. At last, NDT not only provides an architecture with a well-preserved weight initialization, but it also remains the best model for the rest of the epochs. Table 6.3 provides the particular architecture and testing error at the last epoch for the MNIST data set. Table 6.3: MNIST and CIFAR-10 Comparison MNIST CIFAR-10 Architecture Error % Architecture Error % NDT 784-31-32-10 5.73 3072-31-32-10 59.92 NN 784-31-32-10 11.63 3072-31-32-10 68.6 NN-D 784-62-64-10 11.23 3072-62-64-10 67.7 NN-H 784-15-16-10 11.3 3072-15-16-10 68.96 NN-1 784-63-10 12.73 3072-63-10 81.45 NN-3 784-21-21-21-10 21.93 3072-21-21-21-10 72.63 Figure 6.1: MNIST Testing error for different architectures. Each epoch means the training data is being used to update weight once. 6.1.2 CIFAR-10 The second experiment is performed using the CIFAR-10 data set. In Figure 6.2, the DT effectively offers the NDT a warm start for a lower classification error. Compare to the MNIST experiment in Section 6.1.1, the risk of an arbitrarily selected network is even more significant. Without a 68 proper initialization, NN-1 starts diverging at epoch 4. Most importantly, consider NN and NDT that share the same architecture as in Table 6.3. The only difference between NN and NDT is the weight initialization and first layer activation. Yet the difference between the two methods is approximately 10% for each epoch! Figure 6.2: CIFAR-10 testing error. Each epoch means the training data is being used to update weight once. 6.1.3 SVHN For this data set, the learning rate of 50 is relatively large for the NDT, as shown in Figure 6.3. Unlike the divergence of NN-1 from the CIFAR-10 experiment in Section 6.1.2, NDT follows the guideline from DT and manages to have the lowest error until the last epoch. 6.1.4 Reuters Because this data set has approximately 9000 of training data, keeping a batch size of 1000 means weights are only updated 9 times for each epoch. Without tree initialization, all networks yield relatively poor performance as in Figure 6.4. As in Table 6.4, both NN and NDT share the same 69 Table 6.4: SVHN and Reuters Comparison SVHN Reuters Architecture Error % Architecture Error % NDT 3072-31-32-10 43.26 2000-25-26-10 31.66 NN 3072-31-32-10 47.81 2000-25-26-10 68.09 NN-D 3072-62-64-10 49.90 2000-50-52-10 69.89 NN-H 3072-15-16-10 52.03 2000-12-13-10 64.62 NN-1 3072-63-10 81.63 2000-51-10 60.73 NN-3 3072-21-21-21-10 67.01 2000-17-17-17-10 61.81 Figure 6.3: SVHN testing error. Each epoch means the training data is being used to update weight once. 70 architecture, but their last-epoch error difference is more than 35%. In addition, neither adding nor deleting neurons/layers helps in error reduction. Figure 6.4: Reuters testing error. Each epoch means the training data is being used to update weight once. 6.2 Tuning of NDT The decision tree and the neural network have naturally complementary strengths. A tree can pro- vide a rough and easily explainable initial decision boundary for classification, which the mirroring network can improve through iterative gradient-based optimization. NDT also provides continuous learning capability, which is not well developed for a typical CART-like approach. Once a new batch of data becomes available following the initial development of the NDT, one does not need to retrain the tree model with the entire data set for an update. Instead, we can still use the origi- nal tree and further adapt the model with the new data by gradient descent. This experiment will simulate the situation by initializing NDT from the tree that constructs with 50% and 20% of the entire data, and we will call them NDT-P50 and NDT-P20, respectively. The detail is summarized in the Table 6.5, and the NDT with 100% tree portion is the same method as in section 6.1. We should emphasize that we do not suggest using less than 100% of the available data to create the 71 initial decision tree. Our experiment here serves to demonstrate the ability of NDT to incorporate incrementally available data. Another aspect of the NDT is the complexity. In general, a more complex network with deeper layers and more neurons is necessary to discover the hidden pattern for a more complicated data set. One way is to initialize the NDT naturally from a more complex and deeper tree. In this experiment, we will vary the complexity of the tree by the maximum depth. Recall that all previous NDT has a maximum tree depth of 5. In Table 6.6, we will compare NDT with NDT-6 and NDT-8, which correspond to the NDT that has a maximum depth of 6 and 8 respectively. Table 6.5: Continuous Leaning Feature in MNIST Tree MNIST Reuters portion Architecture error % Architecture error % NDT 100% 784-31-32-10 5.73 2000-25-26-46 31.65 NDT-P50 50% 784-31-32-10 5.5 2000-22-23-46 35.75 NDT-P20 20% 784-31-32-10 4.9 2000-20-21-46 33.57 Table 6.6: Complexity Comparison in MNIST Max MNIST Reuters Depth Architecture error % Architecture error % NDT 5 784-31-32-10 5.73 2000-25-26-46 31.65 NDT-6 6 784-63-64-10 4.42 2000-42-43-46 31.03 NDT-8 8 784-252-253-10 2.82 2000-104-105-46 29.96 Surprisingly, with only 20% of tree initialization as in Table 6.5, the NDT-P20 has an even lower error rate than the fully initialized NDT in the MNIST data set. Since there are a total of 60000 training images, constructing a NDT with 12000 is generally enough to obtain a moderate initial decision boundary from the tree. The mirroring network can then further improved the boundary once the rest of the data is available. As for the Reuters data set, the NDT with less than 100% data initialization has a relatively higher error rate. On the other hand, it is not a huge difference considering the fact that NDT-P20 was initialized by only 1800 samples. Next, increasing the network complexity by allowing a maximum tree depth of 6 and 8, both NDT-6 and NDT-8 indeed lowered the error rate compared to the original NDT as shown in Table 72 6.6. In addition, the relative error from NDT to NDT-8 on the MNIST data set is lowered by more than 50%. The architecture from both NDT-6 and NDT-8 on MNIST reaches the maximum depth as their second layer have 64= 2 6 and 2532(2 7 = 128;2 8 = 256) respectively. On the Reuters data set, the NDT-8 has an actual tree depth of 7, so its relative error rate is lowered by about 5%. Overall, such initialization of NDT naturally provides a way to increase its complexity for deeper training effectively. 6.3 Further observations in selecting hyper-parameters Our last experiment involves hyper-parameters selection for gradient optimization, and we will demonstrate it with the Reuters data set. The learning rate in gradient descent determines the proportion of going against gradient per weight update. Neither a too large nor a too low rate is preferred. The former may result in the algorithm diverging, and the latter will cause a slower convergence. In general, such hyper-parameter selection has to be done by multiple runs in the validation set. For a large data set, it can be highly computationally expensive. Table 6.7 and Table 6.8 summarize the testing error % of each methods under different combi- nation of batch sizes and learning rates for the MNIST and Reuters data set respectively. If we treat this experiment as a validation process of selecting hyper-parameter, the lowest error rate for each parameter combo all comes from NDT. In fact, the worst performance of NDT still performs better than the best of all arbitrarily constructed ANNs. One conjecture we have for such robustness of NDT in the Gradient Descend is that the tree already offers a solution close enough to the local minimum of the minimization problem. Having different learning rates or batch sizes does not change much when the gradient norm is already small. 73 Table 6.7: MNIST Comparison bs 1000 bs 600 lr 25 lr 50 lr 100 lr 25 lr 50 lr 100 NDT 5.5 5.73 5.97 5.56 5.87 6.26 NN 16.39 11.63 27.01 12.05 11.27 9.32 NN-D 14.66 11.23 34.03 15.08 10.19 9.79 NN-H 16.03 11.30 19.02 13.07 11.51 18.21 NN-1 14.97 12.73 10.26 14.02 10.28 9.5 NN-3 15.58 21.93 74.48 13.64 19.69 89.91 Table 6.8: Reuters Comparison bs 1000 bs 100 lr 50 lr 100 lr 200 lr 50 lr 100 lr 200 NDT 31.66 32.77 31.83 29.7 30.1 29.61 NN 68.09 56.71 61.8 47.15 42 37.52 NN-D 69.89 66.93 76.77 50.94 45.41 63.72 NN-H 64.62 60.44 54 46.57 42.22 38.58 NN-1 60.73 65.49 53.54 53.83 41.87 40.2 NN-3 68.81 65.93 58.49 46.4 44.05 38.86 Testing classification error % with two batch sizes (bs) and three learning rates (lr). 74 Chapter 7 Building Forecast Models for Plasmasphere Dynamics and Regression Performance In Chapter 6, we have seen the superior performance with continuous learning and robustness of the Neural Decision Tree (NDT) on the benchmark data sets. In this chapter, we will demonstrate the performance and practicality of NDT in a real-world problem that has limited guidelines for applying machine learning. The application we will be studied with is building forecast models for the plasmasphere dynamic in space weather. Space weather is concerned with the influence of solar and interplanetary magnetic fields (IMF) on Earth. Specifically, strong perturbations in the IMF due to solar corona mass ejections (CME) can cause significant changes in the Earth’s magnetosphere that are known to sometimes lead to catastrophic disruptions for the power networks. Similarly, the plasma in the Earth’s upper atmo- sphere, which includes ionosphere from 80 to 2,000 km above the Earth’s surface and plasmasphere between 2,000 and 22,000km altitude, plays a vital role in wireless communication between sta- tions on the Earth’s surface, as well as, to satellites in space. In fact, the density of plasma in these regions of the atmosphere determines the delay in the speed of propagation, as well as the amount of diffraction of the radio wave by the ionosphere and plasmasphere. However, unlike the tropo- sphere, or more commonly known as regular weather, the direct measurements of the ionosphere and plasmasphere are considerably lower. The phenomenology of these layers of the atmosphere is much more complex than that of the lower atmosphere. As a result, the possibility of taking 75 advantage of machine-learning techniques for building ionosphere and plasmasphere models has attracted the attention of the space weather community. Indeed, one of the motivations for this thesis is to explore the feasibility of finding a suitable machine-learning technique for building predictive space weather models. 7.1 Developing a Regression Based Neural Decision Tree Model for Forecasting Plasmasphere Dynamics The initial goal of our work is to evaluate the NDT’s ability to produce an ANN model with compatible performance to the PINE model reported in [22] with minimal manual adjustment. Unlike PINE with is an ANN with a single hidden layer, ANN models generated by the NDT algorithm always have at least two hidden layers. We use identical set data to train our NDT as that was used for PINE’s training. As reported in [22], the plasmasphere electron density is derived from the upper hybrid frequency with is retrieved from measurements by the Electric and Magnetic Field Instrument Suite and Integrated Science instrumentation suite (EMFISIS) on the Van Allen Probes satellites using the Neural-network-based Upper hybrid Resonance Determination (NURD) algorithm [23]. The input variables for the models are selected through repeated experimentation by Zhelavskaya et al. to include recent time-history of solar and geomagnetic parameters obtained initially from NASA’s OmniWeb data service. A complete list of attributes for the model inputs X is shown in Table 7.1 below. It is useful to notes that density data are retrieved along the spacecrafts’ orbit over time, there- fore the sampling in spatial variables L and MLT are entirely dependent on orbit of the Van Allen Probes. The sampling frequency for the rest of input variables varies from 3 hour to one sec- ond. The moving averaged values over intervals of different length help to provide stability of the model. While the training data consists of a large collection of matched pairs X i ;y i where y i is the plasmasphere electron density at a specific location given by(L i ;MLT i ), the true utility of the 76 Table 7.1: Attributes in the input for PINE and NDT models for plasmasphere dynamics Row Index Name Time Stamp 1 AE Current 2 kp Current 3 SymH Current 4 F107 Current 5 L Altitude in a.u. 6 MLT Magnetic local time 7-12 AE avg Averages for AE over previous 3,6,12,24,36,48 hours 13-18 kp avg Averages for kp over previous 3,6,12,24,36,48 hours 19-24 SymH avg Averages for SymH over previous 3,6,12,24,36,48 hours 25-30 F10.7 avg Averages for F10.7 over previous 3,6,12,24,36,48 hours resulting model for predicting the plasmasphere dynamics is to produce the entire electron den- sity field over the Earth equatorial plane for a given set of solar and geomagnetic data X. This constitutes an extension of the traditional supervised learning paradigm in the sense that for each input vector X the actual intended output is a 2-dimensional scalar field. However, the training data available to us consists of point-wise values of the desired field at different time. An analogy in the context of image recognition would be trying to determine if an image is that of a dog when instead of given the entire image, we have only one single pixel of the image at a given time. This extension obviously substantially increases the challenge for model training. Consequently, there are important features for the desired output field that are not explicitly represented by the data. We shall discuss these additional properties in the next section. In this section, we focus our attention on constructing a regression model using NDT that can accurately reproduce the plasmasphere electron density at discrete points. In particular, we would like to attempt to answer the following questions: 1. Can a NDT initiated neural network with similar model complexity automatically produce the performance in terms of prediction least square error similar to PINE? 2. Does NDT provide substantially favorable initialization that the convergence rate for the training process is accelerated compared with randomly initiated network as seen in [24]? 77 3. Does NDT initiated neural network deliver robustness in optimization similar to what we have seen for other problems [24]? 4. Can NDT initiated neural network with reduced model complexity produce comparable per- formance in terms of prediction error? Before presenting the NDT’s construction of plasmasphere dynamics models, it is helpful to pro- vide a brief description of our use of the data set prepared by Zhelavskaya and her colleagues. As mentioned previously, the total data set available consists of matched pairs of solar and geomag- netic measurements to plasmasphere electron density at a specific altitude L and geomagnetic local time MLT covering the time period from October 1st, 2012 to May 12th, 2016. For the training and model selection work by Zhelavskaya and her colleagues, this data set is partitioned into train- ingT and testing or validation subsetsV with a ratio of 9 to 1 in data volume by randomized sampling without repetition. In order to simplify the direct comparison of model performance, we use the same partitions as Zhelavskaya et al. in all comparisons of RMSE among the models. In selecting a suitable network structure for PINE, Zhelavskaya et al. consider a single-hidden layer neural network withf23,30,38,45,53g neurons as candidate structures. To decide on a suit- able size for the network, they have used the approach of 5 fold cross-validation to select a structure with the lowest RMSE. That is, by partitioning the training subsetT into five equal-sized subsets and using any 4 of them for model training and the remaining one for measuring RMSE perfor- mance. The average of the 5 RMSE values is used to represent the performance for the specific size of the neural network. It is worth reminding us that since all training of neural network for PINE follow the common approach of random initialization of the weights defining a network; a single model evaluation involves two sources of randomization: selection of data making the 4-subsets of the five folder cross-validation and the randomization of the initial weights. As a result, a meaning- ful assessment of the performance of a network structure also involves a repeated training process for each training-validation step in the five-fold cross-validation to average out the effect of ran- domized initialization. The enormous computational efforts required in selecting suitable models among candidate design renders consideration of more elaborate network structure prohibitively 78 expensive. Indeed, with just five candidate model structures and m randomized initialization for each step in the five-fold cross-validation, a total of 25m model training and validation process is required. If the approach is extended to two hidden layers structures, the combinatorial explosion of candidates will make the selection nearly impossible computationally. As presented in Section 4, NDT selects the network architecture and the initial weights for neurons based on the decision tree, which is created through preliminary processing of training data. The architecture selection and weight initialization remove the need for repeated training to average out the effect of random initialization, as was the case in a common neural network eval- uation. Moreover, a single criterion on either the minimum threshold for RMSE reduction when creating a new decision node in the tree or the maximum number of nodes required automatically allows the NDT construction process to select a promising network layout involving two hidden layers with appropriate initial weights for the neurons. In fact, since the construction of CART is relatively insensitive to the volume of data used as shown in [24], it allows us to bypass the cross-validation step in establishing a reliable and representative performance measure for a given network structure. As a result, in this section, all performance comparisons between the final se- lection for the PINE model with 45 neurons and models created by the NDT algorithm are derived using the entire subsetT for training and evaluated on the subsetV . Table 7.2 compares model selection approaches for NDT and PINE. Table 7.2: Approaches for model selection for PINE and for NDT based approach NDT PINE Candidate architectures 2 hidden-layers 1 layer Network initialization Replicating CART Random weights Scoring RMSE TrainingT and validationV 5-fold cross-validation usingT TrainingT and validationV The most popular optimization algorithm for training a neural network is the Stochastic Gra- dient Descent (SGD), for which the gradient with respect to the weight vector is evaluated on the performance of a single data point or of a small patch of data points using the highly efficient backward propagation algorithm. The weights are then updated by a small fixed fraction, often 79 referred to as a step, in the negative direction of the gradient vector. SGD is particularly attractive for applications involving continuous learning when incremental data availability allows continu- ous improvement of a model. As a first-order optimization technique, SGD does have a tendency, in some cases, to be slow in final convergence to a local minimum. In these situations, quadratic quasi-Newton methods such as Levenberg-Marquardt (LM) often provide improved convergence. However, the price for ’faster’ convergence in terms of the number of iterations is often much more computationally intensive iteration. As a result, LM algorithm based training typically uses gradi- ent evaluated on the model performance over the entire training data set. Our experiments indicate that NDT-created neural networks can achieve significantly faster convergence compared with net- work structure used by PINE with randomized weights when the SGD algorithm is used. As shown in Figures 7.1a and 7.1b, the decrease in RMSE is much faster during the training for NDT than for PINE. In these experiments, the entirety of the nearly 3 million training data points is group into 293 patches for 10,000 data points each for a SGD update of weights defining a network. When all 293 have been used once, the optimization process is said to reaches one epoch. The patches are then being reused in a new epoch of the training process. At the end of each epoch, the RMSE is evaluated on the entire training data setT and validation data setV . From Figures 7.1a and 7.1b we observe that not only a much faster reduction of RMSE for NDT as the training progress than that for PINE, the rate of decrease also shows a smoother approach in Figure 7.1a to a local minimum without the intermediate slowing down as seen for PINE. In the previous efforts by Zhelavskaya et al., it was found that the LM optimization technique was necessary to deliver slightly lower RMSE for both training and validation. Our experiments also confirm their observation. However, a common variant of the SGD method, Adaptive Moment Estimation [25], is often referred to as Adam algorithm with similar efficiency as SGD method can produce near-identical performance in terms of final RMSE level as LM algorithm as shown in Table 7.3 below. As we can see in Table 7.3, using the Adam algorithm, the RMSE level for NDT is nearly identical to that of PINE when trained with the LM algorithm, although LM seems able to reduce RMSE to an even lower level for both the training and validation data. A relevant question 80 (a) Changes in the sum of RMSE as functions of iteration number during the training of PINE and NDT in SGD. (b) Changes in the sum of RMSE as functions of iteration number during the training of PINE and NDT in Adam Figure 7.1: Comparison of Rate of reduction of RMSE for NDT and PINE using first order gradient descent type of optimization methods. is whether or not these minuscule differences have any significance statistically or in terms of model prediction accuracy. We shall attempt to address this issue later. Table 7.3: Robust Optimizer NDT PINE Optimization Algorithm SGD Adam LM SGD LM Training RMSE 0.3226 0.3158 0.3043 0.3649 0.3145 Testing RMSE 0.3316 0.3282 0.3204 0.3811 0.3282 The NDT model used in the comparison shown in Table 7.3 above is a model for which we limited the total number of decision node in CART to 25 so that the overall dimension of the weight vector for the resulting NDT is nearly identical to the PINE model with 45 neurons. We have also experimented with NDT models with a much lower degree of freedom involving a much smaller number of neurons in the network. Indeed, as shown in Table 7.4, compared with the default NDT initiated by a CART with 25 decision nodes, NDT initiated with CARTs with 15 or 10 decision nodes produced comparable or even lower RMSE level when optimized with the LM algorithm. Since the ultimate goal of our work is to produce a predictive model for plasmasphere dynam- ics, or more concretely, generate electron density field on the equatorial plane for a given solar and 81 Table 7.4: Comparison of final RMSE for different NDT constructed models and PINE. NDT PINE # of Decision nodes for NDT 25 15 10 Dimension of weight params 1478 738 443 1441 Fraction to dimension of PINE 100% 50% 30 % 100% Training RMSE 0.3043 0.3081 0.3198 0.3145 Testing RMSE 0.3204 0.3162 0.3256 0.3282 magnetic condition specified by the input vector X, we plotted in Figure 7.2 the predicted electron density field for all 4 models listed in Table 7.4 for a time period of known plasmasphere storm from June 26 to June 27, 2001. Figure 7.2: Model predictions of electron density field for June 26-27, 2001 storm. As we can see in Figure 7.2 the difference in the model predictions are quite minuscule con- sistent with their RMSE performance despite of their substantial difference in model complexity. However, the computation intensity in training these models can be vastly different as illustrate in Table 7.5 below. As we can see, the time required for training a model with high number of weights can be an order of magnitude longer than one that requires fraction number of weights. In addition to being more robust and stable, models with fewer parameters tend to have much higher informa- tion content measured by AIC or BIC indices. The fast training process also allow us to explore 82 other critical issues relevant for the development of a regression based model as we shall discuss in the next section. Our experimental results demonstrate that NDT algorithm can deliver high performance regression neural network models through inherently sophisticated multiple hidden layer structures. Table 7.5: The models and training algorithms are select for similar final RMSE performance. The times are measured on a personal computer wit a Intel® Core™ i7-4790 Processor CPU, a NVIDIA GeForce GTX 970 GPU and a total of 32 GB ROM. NDT PINE # of nodes for NDT 25 25 15 10 Optimization Algorithm Adam LM LM LM LM Train/Test RMSE 0.32/0.33 0.30/0.32 0.31/0.32 0.32/0.33 0.31/0.33 Time (minutes) 13.13 162.16 28.76 11.91 158.06 7.2 A Broader View of Task of Modeling As we have indicated at the beginning of Section 7.1 that the construction of a plasmasphere dynamics model based on the type of data available to us is particularly challenging because unlike for most supervised learning applications, for each solar and magnetic condition, our training data is not the ultimate model response which should be the electron density field in the Earth’s equatorial plane. Instead, each data point merely provides the density at a specific point in the plasmasphere. Since data are collected along the orbit of Van Allen Probes, the amount of data available over a 24 hour time period covers only a small fraction of the space in the plasmasphere as shown in Figure 7.3a. It would take several months’ worth of data to cover a significant portion of the plasmasphere. The underlying values for the solar and magnetic conditions can undergo substantial changes over this period of time. Consequently, the problem of obtaining a predictive model of electron density distribution for plasmasphere using solar and magnetic field observation seems extremely challenging even seemly unrealistic. More discussions on this aspect of model will be given in the next section. We also note that the spatial distribution of data is highly non- uniform as shown in Figure 7.3b. This is, of course, a result of the orbit for the Van Allen Probes 83 where the orbit reaches its highest point, and it is tangential to the circle at L= 6 on the equatorial plane. Consequently, a much larger number of training data is available at altitude L= 6. A closer examination of the preliminary descriptive statistical analysis of the available data shows both the average and empirical standard deviation of electron density are systematically spatially dependent (Figure 7.4a and 7.4b). (a) Data available for a period of 24 hours from Van Allen Probes (b) Spatial Distribution of Data Figure 7.3: Since Van Allen Probes collect data along their orbits, instantaneous global imaging of the plasmasphere density field is obviously unavailable and spatially non-uniform distribution of data is inherent to the measurement approach. We recall the fundamental assumptions that leads to statistical consistency of the regression analyses include the following: 1. Residual errors in data points are independent and identically distributed. Thus, the least square regression leads to the optical estimation of the mean electron density. 2. The distribution of training data should reflect the distribution of conditions that require prediction. Since the true goal of our prediction is the electron density on the entire equatorial plane at a given time, ideally, the data points should be uniformly distributed. Moreover, the electron density of all points on the equatorial plane are clearly not identically distributed. Indeed, density at lower altitude are substantially higher than high altitude region as shown in Figure 7.4a. 84 (a) Average electron density in the plasmasphere (b) Empirical standard deviation of electron den- sity in the plasmasphere Figure 7.4: Local statistics of training data shows distinct spatial variability in both average and standard deviation of electron density. Another property inherent in our understanding of physics is that electron density should be spatial continuous. However, when spatial coordinates L and MLT are used, the spatial input data are defined over a rectangular area of[0;6][0;24]. As far as the training of a model is concerned, no information is indicating at the boundary at MLT = 0 and MLT = 24 are actually the same spa- tial point. On the other hand, a transformation to Cartesian coordinate x m = Lcos2p(MLT=24), and y m = sin2p(MLT=24) would explicitly guarantee the continuity across the boundary at MLT = 0. Naturally, when training data volume is large and densely covers all areas of the space for input variables, the optimal regression predictor would naturally produce a spatial continuous electron density field. However, data from the Van Allen Probes is not sufficiently dense near the region where MLT = 0. As a result, we can clearly see spatial discontinuity at MLT = 0 in the PINE prediction for a storm period of June 26-27, 2001 when the model is trained with geolocation of data is registered in polar coordinates, see Figure 7.5. Figure 7.5 also shows that spatial discon- tinuity is removed for NDT prediction when training data is geolocated in Cartesian coordinates. 85 Figure 7.5: When Cartesian coordinates are used for the geolocation of training data in NDT training process, the spatial continuity in the prediction of electron density field is achieved. Since polar geolocation is used in the training of PINE, electron density field produced by PINE can have visible spatial discontinuities. The deviation from the basic statistical assumptions for regression underlying model training may mean in practice that the same relative residual errors in electron density region weigh sig- nificantly more in the model training process than low-density region or regions where a higher abundance of data has oversize importance. The ability of NDT to quickly select a suitable network configuration enables us to quickly explore the approaches that can address these high-level data analysis issues that stem from our understanding of the physical properties of the plasmasphere. A usual remedy for the disparity in spatial and statistical data distribution is by re-scaling of raw data. In particular, we can partition the plasmaspheric region into areas where data density and statistics are similar. In our case, the partitions are according to altitudes. LetA k ;k= 1;;K be defined by A k =f(l;mlt);l k1 l < l k g: We define a function k(l) by(l;mlt)2A k(l) for all l. Consider localized sample mean and standard deviation defined by ¯ y k = 1 N k å (l i ;mlt i )2A k y i ; s k = 1 N k 1 å (l i ;mlt i )2A k (y i ¯ y k ) 2 ; 86 where N k =jf(l i ;mlt i )2A k gj: Then a normalized version of electron density is defined by ˆ y i = y i ¯ y k(l i ) s k(l i ) : (7.1) When a new regression neural network is trained to predict ˆ y instead of y, the training data are more consistent with the statistical assumptions for regression analysis. In the subsequent discussion, we refer to model trained with data scaled by local statistics as statistically scaled models. Naturally, the output ˆ y(x) of a statistically scaled model must be restored to the original scale by y(x)= ˆ y(x)s k(l) + ¯ y k(l) : Similarly, we could remedy the non-uniform spatial distribution of data by scaling. Letr k be the number density of data points in the regionA k . We can replace the standard deviation in (7.1) by ˆ s k =s k = p r k . We refer to a model trained with variable weights for data points as a weighted model. The scaling and weighing of data are equivalent to the change of the regression performance metric. Therefore, it is expected the new models would produce larger RMSE in their predictions when tested against validation data set than previous training where lowering RMSE is the opti- mization criterion. However, these new variants of the model may provide a better representation of plasmasphere dynamical features compared to actual imagery of the plasmasphere electron den- sity field. To illustrate the effects of our data transformation, we simulated plasmasphere electron density field during the storm of June 26-27, 2001 as in [22], see Figure 7.6. As a reference, we show the prediction of plasmasphere density under normal conditions de- fined by the mean values of the solar and magnetic input parameters in Figure 7.7. Not surprisingly, all 4 variant models show essentially the same density field. However, comparing with Figure 7.7, we observe in Figure 7.6 that all 4 models showed the enhancement of electron density in the mid-afternoon (low-left) region of the equatorial plane as a clockwise rotation during the on-set of the storm at around 12 UTC on June 26, 2001. As the storm progressed, we observe a significant depletion of electron density at high altitude while a remnant of the enhancement at around 15 87 Figure 7.6: Effect of weighted stat scaled and stat scaled only Figure 7.7: Effect of weighted stat scaled and stat scaled only 88 MLT persisted for at least 6 hours until 0 UT on June 27, 2001 before the density field returned to a near-normal state. The 4 variant models give somewhat different predictions of this transient period. In fact, all DNT models with Cartesian spatial registration of data show a much slower pro- cess with enhancement persists strongly in the afternoon (lower-left) region. Also, the progression of the decline of an enhanced region seems more detailed in NDT predictions with a much more localized enhanced region toward the end of the storm at around 0 UT on June 27. Although the determination of which of these variant models are consistently capable of producing more realis- tic prediction of plasmasphere dynamics during storm conditions cannot be resolved by anecdote comparison shown here, the NDT variants presented show that careful data representation can al- ter the final construction of a trained model. The different scaling and weighing of training data provide effective ways to construct a plurality of models that may deliver more reliable predictions for plasmasphere conditions in the ensemble. 7.3 Discussion and Conclusion Our numerical experimental results presented in Sections 7.1 and 7.2 show that NDT provides ap- propriate selection for the structure of neural network based on the available training data, and the method also leads to good initialization for the neural network. These features not only yield excel- lent performance in reducing regression errors as shown in Sections 7.1, but the fast convergence of NDT also enables us to focus on the physics and theoretical statistical aspect of the modeling problem. Even though the comparison between models with different degrees of adherence to standard statistical theoretical assumptions and physical constraints seem to produce qualitatively similar predictions for the storm event of June 26-27, 2001, a deeper examination of these models can 89 reveal substantial differences among them. For this purpose, we first perform a principal compo- nent analysis of the input parameters, i.e., AE, kp, F107, SymH are their near-time histories. More precisely, we first normalize each component of vector X as follows: V i; j = X i; j ¯ X i s i ; where ¯ X i = 1 N N å j=1 X i; j ; s 2 i = 1 N 1 N å j=1 (X i; j ¯ X i ) 2 ; (7.2) for each of the components i= 1;;28 of input vector X j with j= 1;;N by removing the com- ponents for L and MLT . Consider the eigenvaluesl 2 i and eigenvectors u i of the matrix VV T . The valuesl i and vectors u i are therefore principal values and principal components of the normalized data set V j ; j= 1;;N. Figure 7.8 shows that there are 5 to 6 dominant principal components for our training data set. Examining the projections of data onto the principal components also Figure 7.8: Distribution of Singular Values and projection of data onto principal directions reveals that outliers for the first and the second principal components are clearly either all neg- ative or all positive. Given the small number of these outliers and the fact that electron density data over a period of time when these outliers occur are very limited, we, therefore, do not expect the training neural network model for plasmasphere dynamics will be capable of modeling the extreme conditions represented by these outliers. Indeed, the prediction of plasmasphere density 90 under conditions X = ¯ Xl i u i diag(s 1 ;;s 28 ) for the first 5 principal components in Figure 7.9 show signs of model saturation indicated by near-zero density at high altitude. Figure 7.9: Electron density fields predicted by NDT and PINE for input parameters perturbed by one standard deviation in direction of the first 5 principle components respectively. Note from the right panel in Figure 7.8 that the outliers in the first 5 principal components often are far beyond one standard deviation away from the mean value. However, perturbation of input parameters by more than one standard deviation can sometimes lead to non-physical input. Therefore, results in Figure 7.9 actually understate the issues of model saturation. These results are entirely expected because of the limited availability of data during extreme conditions. The model saturation also reveals the limitation of data-driven models trained with our data regarding their ability to predict plasmasphere density under extreme conditions. We are also interested in the systematic difference among the model variants in moderate condi- tions. In particular, we would like to understand whether or not the principal components identified 91 in the solar and magnetic inputs of the models lead to physically meaningful characteristics in the predicted electron density field. To do this, we evaluate the difference in the predicted electron density field with input parameters perturbed by1 standard deviation from the mean values, or the difference of difference for the predicted fields. In Figure 7.10, these differences are shown for the first 5 principal components for the weighted NDT and PINE. In addition to the spatial dis- continuity at mlt = 0, which is visible in the PINE predicted electron density field in perturbation of principal components, there are also noticeable differences in perturbation of input parameter along with other principal components. In particular, for both the 2nd and 4th principal compo- nents, the enhancement of electron density near midnight at high altitudes has much more finely resolved structures for the NDT model. Further comparison among the variant NDT model shown Figure 7.10: Difference in Electron density fields predicted by NDT and PINE for input parameters perturbed by1 standard deviation in direction of the first 5 principle components respectively. in Figure 7.11 shows a progression of changes in the perturbation patterns. Indeed, when only the geolocation registration is changed from polar to Cartesian coordinates, the pattern produced by NDT are similar to those predicted by PINE without the spatial discontinuities at mlt = 0. How- ever, other spatial features in the perturbed electron density fields emerge as additional scalings of data are introduced. 92 Figure 7.11: Difference in Electron density fields predicted by different models gentrained by NDT for input parameters perturbed by1 standard deviation in direction of the first 5 principle components respectively. Without extensive independent validation data, it is difficult or impossible to conclude which model variants are more appropriate at representing the changes in the plasmasphere electron den- sity field under characteristic changes in the input parameters. However, the models generated by NDT based on different physical and statistical considerations provide a wide range of alternative models for the prediction of the plasmasphere dynamics. When taken as an ensemble, we are more likely to capture the diversity of dynamical behavior of the plasmasphere. 93 Chapter 8 Conclusion This thesis presents an enhancement of the hybrid supervised learning algorithm, Neural Decision Tree (NDT), in both classification and regression. Since there are multiple methods for replicat- ing a CART with a MLP of NN using an indicator activation, a strategically selected approach of mapping and selecting activation functions becomes a crucial step to achieve an effective hybrid method. In addition to an improved statistical strong consistency for a much wider class of dis- tributions of data samples, our experiments with NDT demonstrate the potential benefits of our approach in various machine learning applications. Compares to a randomly initiated network, the NDT approach naturally leads to a more sophisticated neural network structure. The NDT has proven successes in both benchmark classification and space weather regression problems. One of the most attractive aspects of the NDT approach is its ability to identify appropriate network struc- tures based on the decision tree initialization without prior experience. This feature is particularly relevant for any new applications when limited expertise in machine-learning methods exists for many applications. In fact, the construction of NDT is not limited to a fully-connected two layers neural network. Additional elements can be added based on specific features of an application. For example, com- bining NDT with convolutional filters may be a better suit for image recognition. These results of NDT give us confidence that as a general tool, NDT can provide benefits for a wide range of applications. 94 References 1. Leo, B., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and regression trees. Wadsworth International Group (1984). 2. Yegnanarayana, B. Artificial neural networks (PHI Learning Pvt. Ltd., 2009). 3. Shanmuganathan, S. in Artificial neural network modelling 1–14 (Springer, 2016). 4. Kaur, S. & Jindal, S. A survey on machine learning algorithms. Int J Innovative Res Adv Eng (IJIRAE) 3, 2349–2763 (2016). 5. Das, K. & Behera, R. N. A survey on machine learning: concept, algorithms and applications. International Journal of Innovative Research in Computer and Communication Engineering 5, 1301–1309 (2017). 6. Biau, G., Scornet, E. & Welbl, J. Neural random forests. Sankhya A 81, 347–386 (2018). 7. Kontschieder, P., Fiterau, M., Criminisi, A. & Rota Bulo, S. Deep neural decision forests in Proceedings of the IEEE international conference on computer vision (2015), 1467–1475. 8. Murphy, K. P. Machine learning: a probabilistic perspective (MIT press, 2012). 9. Loh, W.-Y . Fifty years of classification and regression trees. International Statistical Review 82, 329–348 (2014). 10. Quinlan, J. R. C4. 5: programs for machine learning (Elsevier, 2014). 11. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of con- trol, signals and systems 2, 303–314 (1989). 12. Gy¨ orfi, L., Kohler, M., Krzyzak, A. & Walk, H. A distribution-free theory of nonparametric regression (Springer Science & Business Media, 2006). 13. Sauer, N. On the density of families of sets. Journal of Combinatorial Theory, Series A 13, 145–147 (1972). 14. Shelah, S. A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics 41, 247–261 (1972). 15. Netzer, Y ., Wang, T., Coates, A., Bissacco, A., Wu, B. & Ng, A. Y . Reading digits in natural images with unsupervised feature learning (2011). 16. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1929–1958 (2014). 17. Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015). 18. LeCun, Y . & Cortes, C. MNIST handwritten digit database (2010). 19. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images tech. rep. (Citeseer, 2009). 95 20. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., et al. Scikit- learn: Machine learning in Python. Journal of machine learning research 12, 2825–2830 (2011). 21. Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems Software avail- able from tensorflow.org. 2015. 22. Zhelavskaya, I. S., Shprits, Y . Y . & Spasojevi´ c, M. Empirical modeling of the plasmasphere dynamics using neural networks. Journal of Geophysical Research: Space Physics 122, 11– 227 (2017). 23. Zhelavskaya, I., Spasojevic, M., Shprits, Y . & Kurth, W. Automated determination of electron density from electric field measurements on the Van Allen Probes spacecraft. Journal of Geophysical Research: Space Physics 121, 4611–4625 (2016). 24. Lu, Y . L. & Wang, C. Validation of an Alternative Neural Decision Tree in 2020 IEEE Inter- national Conference on Big Data (Big Data) (2020), 3682–3691. 25. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). 26. Maas, A. L., Hannun, A. Y . & Ng, A. Y . Rectifier nonlinearities improve neural network acoustic models in Proc. icml 30 (2013), 3. 27. Glorot, X., Bordes, A. & Bengio, Y . Deep sparse rectifier neural networks in Proceedings of the fourteenth international conference on artificial intelligence and statistics (2011), 315– 323. 28. Hoeffding, W. in The Collected Works of Wassily Hoeffding 409–426 (Springer, 1963). 96 Appendices A Selection of the Activation Function in the First Layer of NDT In order to analyze the impact of the first layer activation, it is helpful to first consider the 2 hidden layer feed-forward neural network that replicate the decision treeT mathematically: y (3) = f(x) =I(rW (3) ) =I(I(hW (2) + b (2) )W (3) ) =I(I(I(xW (1) + b (1) )W (2) + b (2) )W (3) + b (3) ) where x2R 1d is an input d-dimensional row vector,fW (i) ;b (i) g 3 i=1 is the weight and bias for hidden layer 1, 2 and output layer respectively.I(x) is the indicator function which equals to 1 if x> 0, and 0 otherwise. To further improve the decision boundary from gradient-based optimization, we need to replace the indicator activation I to a smoother (differentiable almost everywhere) functions(x) as below: 97 y (3) = f(x) =s 3 (s 2 (s 1 (xW (1) + b (1) )W (2) + b (2) )W (3) + b (3) ) wheres 1 ;s 2 ands 3 are three functions with non-zero derivatives. Selection of activation functions can have significant impact on the performance of the final NDT. Throughout several classification experiments, some activation functions will even result the NDT to lose the advantages from the the decision tree. From our understanding so far, nobody has addressed this issue yet. Figure 8.1: A tree example with 2 decision nodes j and k and 3 leaf nodes. First, s 1 will replace the indicator I(x) in the hyperplane layer h =I(xW (1) + b (1) ) where x2R 1d is a row input vector. Note that different combinations of 0’s and 1’s in h means various ways of going left or right in each decision node, i.e. h determines the tree structure. In addition, the indicator function is discontinuous at 0, a small variation around 0 can yield a clear distinction. It turns out that replacingI by an activation function s 1 that goes to 0 too slow as x!¥ may result a loss of the advantage from tree initialization. To illustrate this issue, consider an example for decision node j and k shown in figure 8.1, let ˜ h j and ˜ h j denote the two corresponding neuron in the first hidden layer before applying the activation function, so ˜ h j =[xW (1) + b (1) ] j ˜ h k =[xW (1) + b (1) ] k 98 Assume ˜ h j = 0:9 and ˜ h k =0:01 for some input vector x. The indicator activationI yields a clear distinction in between decision node j and k: I( ˜ h j )= 1 andI( ˜ h k )= 0 On the contrary, for a hyperbolic tangent activation t(x)= 1 2 (tanh(x)+ 1), we have: t( ˜ h j ) 0:85 and t( ˜ h k ) 0:49 which already has less distinction between neuron j and neuron k. Moreover, such distinction between two neurons will be further carried and magnified in the second hidden layer r. To see this, we continue this example and construct the weight for the leaf node i in figure 8.1. Recall the construction of W (2) in (4.4), the i th column of W (2) and its corresponding bias are W (2) i = 2 6 4 1 1 3 7 5 ;b (2) i =0:5: Then I( ˜ h)W i (2) + b i (2) = 0:5 t( ˜ h)W i (2) + b i (2) = 0:85 0:49 0:5 =0:14 which demonstrate the two activation, indicator and hyperbolic tangent, yields completely opposite sign. In a binary classification problem, this opposite sign difference in the second hidden layer will yield completely opposite classes after the soft-max activation in the output layer! 99 Figure 8.2: Comparison between the indicator activationI, hyperbolic tangent t(x) and bounded ReLU r 1 (x) from the first hidden layer h to the second hidden layer. (Left) ˜ h is activated byI. (Middle) ˜ h is activated by t(x). (Right) ˜ h is activated by r 1 (x) To solve this issue, our first attempt is picking an activation with hard 0 as a lower bound for s 1 , so the outcome of the activation will be 0 whenever it takes negative value. The ReLU function r(x)= max(0;x) naturally comes in mind. In addition, those hard 0’s inside the first layer create a large sparsity compare to other smooth activation function like logistic and hyperbolic tangent [26], [27]. One potential problem of ReLU activation in NDT is the lack of upper bound. Therefore, we will add a upper bound of 1 to obtain the bounded ReLU activation as shown in 3.4: r 1 (x)= min(max(x;0);1) Hence, r 1 (x) not only has a hard 0, but also bounded above by 1 to further preserve the tree structure. For activation function in hyper-plane layers 1 (x), we suggest s 1 (x)= r 1 (x)= min(max(0;x);1) 100 B Hoeffding’s Inequality Lemma B.1 Let X 1 ;:::;X n be independent real-valued random variables bounded by[a 1 ;b 1 ];:::;[a n ;b n ] respectively where a i ;b i 2R;i= 1;:::;n. Then P ¯ XE( ¯ X)>e 2exp 2ne 2 1 n å n i=1 (b i a i ) 2 wheree > 0 and ¯ X = 1 n å n i=1 X i is the sample mean. Proof B.1 See [28], (also [12] Appendix Lemma A.3.) C Exponential Inequality and the Inspiration by the Covering Number Let Z;Z 1 ;:::;Z n be set of i.i.d. R d - valued random variables. For n2N , let G n be a set of functions g :R d !R. We are interest in sufficient conditions for sup g2G n 1 n n å i=1 g(Z i )Eg(Z) ! 0 a:s: as n!¥. Note that without supremum, for any fixed g with Eg(Z) <¥, strong law of large number implies for n! 0, 1 n n å i=1 g(Z i )Eg(Z) ! 0 a:s: If g :R d ![0;B] is bounded, then Hoeffding’s inequality B implies P 1 n n å i=1 g(Z i )Eg(Z) >e 2exp 2ne 2 B 2 101 Together with union bound, we have P sup g2G n 1 n n å i=1 g(Z i )Eg(Z) >e 2jG n jexp 2ne 2 B 2 wherejG n j denotes the cardinality of G n . That means, if n å i=1 jG n jexp 2ne 2 B 2 <¥ (C.1) then Borel-Cantelli lemma implies sup g2G n 1 n å n i=1 g(Z i )Eg(Z) ! 0 almost surely. On the other hand, the inequality (C.1) is never satisfied because the carnality ofjG n j is always infinity. Hence, we will implement the idea of covering number to obtain results in theorem (5.2.1). D Lemmas for the Proof of the Main Theorem Lemma D.1 Let(X;Y)2R d f0;1g be random variable, and(X 1 ;Y 1 );:::;(X n ;Y n ) be i.i.d. copies of(X;Y). LetF n be a class of functions f :R d !R depends on(X j ;Y j ) n j=1 , m n (X) be the empirical L 2 risk minimizer as in (5.1), and m(X)=E(YjX) be the L 2 risk minimizer, then Z (m n (X) m(X)) 2 m(dx) 2 sup f2F n 1 n n å j=1 ( f(X j )Y j ) 2 E( f(X)Y) 2 (D.1) + inf f2F n E( f(X) m(X)) 2 (D.2) (D.3) Proof D.1 LetF n be the class of neural networks defined as (5.3). we have: 102 E X ((m n (X) m(X)) 2 )=E X;Y (m n (X)Y) 2 E X;Y (m(X)Y) 2 =E X;Y (m n (X)Y) 2 inf f2F n E X;Y ( f(X)Y) 2 + inf f2F n E X;Y ( f(X)Y) 2 E X;Y (m(X)Y) 2 =E X;Y (m n (X)Y) 2 inf f2F n E X;Y ( f(X)Y) 2 + inf f2F n E X ( f(X) m(X)) 2 : Since E X;Y (m n (X)Y) 2 inf f2F n E X;Y ( f(X)Y) 2 = sup f2F n (E X;Y (m n (X)Y) 2 E X;Y ( f(X)Y) 2 ); and by the definition of m n as the minimizer of L 2 classification error for the training data, we have E X;Y (m n (X)Y) 2 E X;Y ( f(X)Y) 2 =E X;Y (m n (X)Y) 2 1 n n å k=1 jm n (X k )Y k j 2 + 1 n n å k=1 jm n (X k )Y k j 2 1 n n å k=1 j f(X k )Y k j 2 + 1 n n å k=1 j f(X k )Y k j 2 E X;Y ( f(X)Y) 2 ) E X;Y (m n (X)Y) 2 1 n n å k=1 jm n (X k )Y k j 2 + 1 n n å k=1 j f(X k )Y k j 2 E X;Y ( f(X)Y) 2 ): Lemma D.2 Using the notation defined in (5.18) forH n and Z n d , inequality (5.20) holds. 103 Proof D.2 Let h;h j 2H n be two arbitrary elements fromH n and consider the following inequal- ity: 1 n N å i=1 jh(Z i ) h j (Z i )j = 1 n N å i=1 j f(X i ) y i j 2 j f j (X i ) y i j 2 = 1 n N å i=1 j f(X i ) f j (X i )jj f(X i ) y i + f j (X i ) y i j 4 1 n N å i=1 j f(X i ) f j (X i )j: IfN (e=32;F n ;X n d ) is the e=32-cover number ofF n w.r.t. L 1 norm, then there exist f j 2 F n ; j= 1;;N (e=32;F n ;X n d ) such that for any f2F n , there exists j such that n 1 å n i=1 j f(X i ) f j (X i )j<e=32. This implies n 1 å n i=1 jh(Z i )h j (Z i )j<e=8. Hence, h j 2H n ; j= 1;;N (e=32;F n ;X n d ) form one possiblee=8-cover ofH n over Z n d . SinceN (e=8;H n ;Z n d ) represents the smallest cardi- nality ofe=8 cover ofH n over Z n d , we have (5.20) hold. 104
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Experimental analysis and feedforward design of neural networks
PDF
Predicting debris yield using artificial intelligence models
PDF
Recurrent neural networks with tunable activation functions to solve Sylvester equation
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Statistical citation network analysis and asymmetric error controls
PDF
Statistical downscaling with artificial neural network
PDF
Deep generative models for time series counterfactual inference
PDF
Differential verification of deep neural networks
PDF
The spread of an epidemic on a dynamically evolving network
PDF
Bridging performance gaps by occupancy and weather data-driven energy prediction modeling using neural networks
PDF
Neural matrix factorization model combing auxiliary information for movie recommender system
PDF
Linear quadratic control, estimation, and tracking for random abstract parabolic systems with application to transdermal alcohol biosensing
PDF
Obtaining breath alcohol concentration from transdermal alcohol concentration using Bayesian approaches
PDF
Covariance modeling and estimation for distributed parameter systems and their approximations
PDF
Neural networks for narrative continuation
PDF
Dynamic network model for systemic risk
PDF
Reaching decisions in dynamic environments
PDF
Estimation of random input to semi-linear abstract parabolic systems with application to quantitative description of drinking behavior based on trans-dermal alcohol concentration
PDF
Determining blood alcohol concentration from transdermal alcohol data: calibrating a mathematical model using a drinking diary
PDF
Efficient graph learning: theory and performance evaluation
Asset Metadata
Creator
Lu, Yu Leo
(author)
Core Title
Validation of an alternative neural decision tree
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2021-08
Publication Date
07/12/2021
Defense Date
06/16/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,artificial neural networks,decision tree,OAI-PMH Harvest,performance evaluation,plasmasphere dynamic modeling,space weather forecast,statistical learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Wang, Chunming (
committee chair
), Kuo, C.-C. Jay (
committee member
), Rosen, Gary (
committee member
)
Creator Email
leoluyu@gmail.com,yullu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16255859
Unique identifier
UC16255859
Legacy Identifier
etd-LuYuLeo-9727
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lu, Yu Leo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
artificial intelligence
artificial neural networks
decision tree
performance evaluation
plasmasphere dynamic modeling
space weather forecast
statistical learning