Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical insights into deep learning and flexible causal inference
(USC Thesis Other)
Statistical insights into deep learning and flexible causal inference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Statistical Insights into Deep Learning and Flexible Causal Inference by Hao Wu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements of the Degree DOCTOR OF PHILOSOPHY (Applied Mathematics) May 2020 Copyright 2020 Hao Wu Dedication To my family for their support and love... ii Acknowledgements My PhD study at University of Southern California is a memorable journey that is a precious treasure I would cherish in my life long. First of all, I greatly express my sincere appreciation to my advisors, Prof. Yingying Fan and Prof. Jinchi Lv, for their extradinary advice and constant sup- port that direct me in exploring the research world of modern statistics and deep learning. This experience would impact me a lot for my future career path. Also, I would like to thank Prof. Jianfeng Zhang for being a co-advisor at my home de- partment and providing positive encouragement. In addition, I want to thank Prof. Jason Fulman and Prof. Sergey Lototsky for serving as my committee members with valuable insights. Additionally, I am willing to express my deep thanks to Prof. Jin Ma, the Director of Graduate Studies - Prof. Susan Montgomery, the Department Chair - Prof. Eric Friedlander, and the Office Manager - Amy Yung for their kind support during my entire PhD study that I will remember forever. iii Lastly, I would like to thank my parents and their endless love. I also want to express my thanks for all my friends for keeping me accompany in this journey. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures x Abstract xii 1 Introduction 1 2 Statistical Insights into Deep Learning 6 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Model setting and some initial results . . . . . . . . . . . . . 13 2.2.2 Classification methods without clustering versus with clustering 17 2.2.3 Deep neural networks via TensorFlow . . . . . . . . . . . . . 19 v 2.3 Some statistical insights and heuristic arguments . . . . . . . . . . . 23 2.3.1 Gaussian classification in one subspace . . . . . . . . . . . . 23 2.3.2 Gaussian classification in two subspaces . . . . . . . . . . . . 25 2.4 Real data application . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Supplementary results and studies . . . . . . . . . . . . . . . . . . . 30 2.6.1 Building deep neural networks . . . . . . . . . . . . . . . . . 30 2.6.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 32 2.6.3 Some heuristic arguments on how DNN learns in subspace classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.4 Figures from real data application . . . . . . . . . . . . . . . 42 2.6.5 Additional large-scale simulation study . . . . . . . . . . . . 44 3 Flexible Causal Inference 47 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Research method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.1 Data construction . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.2 Framework of flexible large-scale causal inference . . . . . . 50 3.2.3 Training and testing . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Main findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 Observational study without matching . . . . . . . . . . . . 57 vi 3.3.2 Observational study with matching . . . . . . . . . . . . . . 59 3.3.3 Randomized experiment . . . . . . . . . . . . . . . . . . . . 64 3.4 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.1 Variations of features . . . . . . . . . . . . . . . . . . . . . . 68 3.4.2 Observed confounding factor . . . . . . . . . . . . . . . . . . 70 3.4.3 Unobserved confounding factor . . . . . . . . . . . . . . . . 72 3.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6 Supplementary results and studies . . . . . . . . . . . . . . . . . . . 74 3.6.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6.2 Tuning process of deep learning . . . . . . . . . . . . . . . . 76 3.6.3 Tuning process of other machine learning methods . . . . . . 81 4 Discussions 86 References 89 vii List of Tables 2.1 Bias-corrected distance correlations (DC) between different pairs of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Bias-corrected distance correlations (DC) for the top hidden layer of DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Clustering error rates on hidden layers . . . . . . . . . . . . . . . . 22 2.4 Bias-corrected distance correlations (DC) for real data application . 28 2.5 DNN with hidden units [10] . . . . . . . . . . . . . . . . . . . . . . 31 2.6 DNN with hidden units [20] . . . . . . . . . . . . . . . . . . . . . . 32 2.7 DNN with hidden units [50] . . . . . . . . . . . . . . . . . . . . . . 32 2.8 DNN with hidden units [100] . . . . . . . . . . . . . . . . . . . . . 32 2.9 Bias-corrected distance correlations (DC) for large-scale simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.10 Clustering error rates on hidden layers for large-scale simulation study 46 3.1 Numerical results of observation data without matching . . . . . . . 57 3.2 Numerical results of observation data after matching . . . . . . . . 59 3.3 Numerical results of experimental data . . . . . . . . . . . . . . . . 65 viii List of Figures 2.1 DNN structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Classification error rate for lasso logistic regression . . . . . . . . . 15 2.3 Classification error rates by different methods . . . . . . . . . . . 18 2.4 Classification error rates for the top hidden layer of DNN by differ- ent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Classification error rate of DNN . . . . . . . . . . . . . . . . . . . 20 2.6 t-SNE visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Classification of cats and dogs (copyright by Google Images) . . . 31 2.8 Traffic signs of labels 0 to 61 . . . . . . . . . . . . . . . . . . . . . 43 2.9 Instances of traffic signs in each cluster . . . . . . . . . . . . . . . 43 2.10 Classification error rates of LDA and DNN . . . . . . . . . . . . . 45 2.11 Classification error rate for LDA regarding subspaces . . . . . . . 46 3.1 Data construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Deep Neural Network (DNN) . . . . . . . . . . . . . . . . . . . . . 52 3.3 10-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . 56 ix 3.4 Data visualization of observational data . . . . . . . . . . . . . . . 62 3.5 Correlation matrix of observational data . . . . . . . . . . . . . . 63 3.6 Tuning results for deep learning (matched observational data) . . 64 3.7 Accuracies trend for deep learning (matched observational data) . 65 3.8 Data visualization of observational data . . . . . . . . . . . . . . . 66 3.9 Activation function - ReLU . . . . . . . . . . . . . . . . . . . . . . 77 3.10 Activation function - Sigmoid . . . . . . . . . . . . . . . . . . . . 77 3.11 Standard DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.12 DNN after applying dropout . . . . . . . . . . . . . . . . . . . . . 80 3.13 Adam algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.14 Batch normalization visualization . . . . . . . . . . . . . . . . . . 83 3.15 Batch normalization algorithm . . . . . . . . . . . . . . . . . . . . 84 x Abstract Deep learning has benefited almost every aspect of modern big data applications. Yet its statistical properties still remain largely unexplored. To gain some statisti- cal insights into this, we design a simple simulation setting where we generate data from some latent subspace structure with each subspace regarded as a cluster. We empirically demonstrate that the performance of DNN is very similar to the ideal two-step procedure of clustering followed by classification (unsupervised plus super- vised). However, none of the hidden layers in DNN conducts successful clustering based on a series of simulation studies. We provide some statistical insights and heuristic arguments to support our empirical discoveries and further demonstrate the revealed phenomenon on the real data application of traffic sign recognition. In the field of causal inference, quantifying the average treat effects (ATE) for intervention policies plays an important role. Yet deep learning has not been well investigated in treatment effect inference applications. To this end and motivated by the application of blood donation intervention via shortage mobile messages, in this paper we exploit deep learning to learn the complex interaction structures between the treatment and the covariates in high-dimensional feature space and construct the counterfactuals using the learned nonparametric function by deep learning. Our framework estimates ATE on full dimensional space including the treatment and covariates, and it requires very mild assumptions on the distribution xi of each data point and the treatment given the covariates that are easily satisfied in practice. xii Chapter 1 Introduction Nowadays the advances of information technologies make big data more preva- lent in real applications. As one of the popular methodologies, deep learning has been applied successfully to a wide range of machine learning tasks such as image recognition, computer vision, natural language processing, and machine transla- tion. Researchers have made lots of efforts to develop algorithms for these complex problems and deep learning has shown great performance on different tasks, which have significant impact on the growth of field literature and applications; see, e.g., [17; 87; 89]. However, the interpretability of deep learning is one of the major chal- lenging perspectives of the modern development of deep learning in both academic research and industrial applications since it is highly nonparametric compared to the conventional method such as linear regression or logistic regression. In recent years, it has drawn great attentions and there are some research work from different 1 perspectives including probabilistic interpretation, information theory, and approx- imation theory; see, e.g., [44; 69; 73]. However, the statistical properties of deep learning still remain largely unex- plored. The work presented in Chapter 2 is motivated by a breakthrough of the challenge and real life problems. We may use one example to demonstrate the motivation, in the task of classifying images of cats and dogs where they are in different activities, such as eating, sleeping, and running. Roughly speaking, each activity of animals can be mimicked by one of the lower-dimensional subspaces and the union of these subspaces form a high-dimensional space where animals can have different activities. We design a simple simulation setting where the gener- ated data have latent subspace structure, each subspace is regarded as a cluster. It is commonly believed nowadays that deep neural nets (DNNs) benefit from rep- resentational learning. Based on our empirically studies, we demonstrate that the performance of DNN is comparable to the ideal two-step procedure which is the unsupervised clustering followed by supervised classification. This motivates us to ask: does DNN indeed mimic the two-step procedure statistically? That is, do bottom layers in DNN try to cluster first and then top layers classify within each cluster? To answer this question, we conduct a series of simulation studies and to our surprise, none of the hidden layers in DNN conducts successful clustering. In some sense, our results provide an important complement to the common belief of representational learning, suggesting that at least in some model settings, although 2 the performance of DNN is comparable to the ideal two-step procedure knowing the true latent cluster information a priori, it does not really do clustering in any of its layers. We provide theoretical proofs to show that it is able to asymptotically achieve ideal Bayes risk if we classify the data within one subspace, and the latent subspace structure indeed introduces additional difficulties in classification, which is reflected in additional risk term. We also provide some statistical insights and heuristic arguments that the training loss of the DNN can approximate closely to the ideal two-step classification procedure under the probabilistic and geometric construction, therefore the classification accuracy of DNN can mimic closely that of the two-step procedure, while the hidden layers of DNN does not successfully recover the latent subspace labels. Moreover, we demonstrate the revealed phe- nomenon on the real data application of traffic sign recognition. Causal inference has been widely applied in various areas including economy, social science, and medical science. One of the major tasks is quantifying the aver- age treat effects (ATE) especially for intervention policies ; see, e.g.,[42; 6; 41]. The advances of information technologies make naturally big observational data more prevalent in real applications nowadays, where the experiments may not be com- pletely randomized. Another common feature of such data is high dimensionality, where the number of covariates that can be constructed from the rich history of customers at the individual level can be large. As a popular tool of nonparametric learning, deep learning has been applied successfully to a wide range of machine 3 learning tasks such as computer vision and machine translation. The connections between machine learning and causal inference has drawn attentions to researchers in recent years; see, e.g., [6; 66; 14]. Yet deep learning has not been well investigated in treatment effect inference applications. To this end and motivated by the application of blood donation intervention via shortage mobile messages, Chapter 3 presents the work exploiting deep learning to learnthecomplexinteractionstructuresbetweenthetreatmentandthecovariatesin high-dimensional feature space and construct the counterfactuals using the learned nonparametric function by deep learning. Such an approach provides a natural non- parametric estimate of the average treatment effect(ATE) by taking the difference of the empirical averages of the predicted mean responses for the treatment and control groups, respectively. Our simulation and empirical studies reveal that the suggested approach to treatment effect inference based on deep learning performs well compared to other machine learning methods and is adaptive to both big obser- vational data and randomized experimental data with comparable performance to some traditional causal inference methods such as matching and propensity score that reduce the dimension to zero and one, respectively. We analyze ATE from a different perspective, our framework estimates ATE on full dimensional space including the treatment and covariates, and it requires very mild assumptions on the distribution of each data point and the treatment given the covariates that are easily satisfied in practice. We also provide some insights into the mechanisms of 4 the suggested approach. Our new framework of flexible treatment effect inference is generally applicable to other applications in non-profit and for-profit industries. This general direction is pinpointed by respected gurus in the field of causal infer- ence as the next milestone. 5 Chapter 2 Statistical Insights into Deep Learning 2.1 Introduction Deep learning is a popular machine learning method that has gained a lot of interest in recent years. It has dramatically improved the state-of-the-art in image process- ing, speech recognition, text analysis, and many other domains such as healthcare and finance; see, e.g., [33; 3; 62; 55; 34]. In particular, there has been a signifi- cant amount of research on classifications of different objects by deep learning; see, e.g., [10; 79; 11; 29; 51]. To name a few, [49] proposed a multilevel deep learning approach for land cover and crop types classification with multi-temporal multi- source satellite imagery. [17] created a deep learning-based classification scheme by stacking denoising auto-encoders to segment organs at risk in the optic region in 6 brain cancer. [20] demonstrated the effectiveness of deep learning in dermatology, a technique that we apply to both general skin conditions and specific cancers. Despite their popularity, most deep learning methods are regarded as black-box procedures in the sense that their statistical properties are largely unknown. In recent years, researchers attempt to unveil the statistical properties behind deep learning. Existing theoretical developments focus mainly on algorithms, proba- bilistic understanding, and approximation theory in deep learning. For example, [64] demonstrated how different smoothness classes lead to satisfactory results for approximation by ReLU networks and Gaussian networks on the entire Euclidean space. [68] developed a generative probabilistic model for deep learning that ex- plicitly captures latent nuisance variation. [69] used the classical theory of ordinary differential equations and replaced a potentially fundamental puzzle about gen- eralization in deep learning with elementary properties of gradient optimization techniques. [44] introduced a new perspective to analyze the converging dynam- ics of neural network via the Neural Tangent Kernel (NTK). [73] suggested that compression dynamics in the information plane are not a general feature of deep networks, but are critically influenced by the nonlinearities employed by the net- work. [58] introduced a general framework of DeepPINK for reproducible feature selection in deep neural networks. Nevertheless the statistical insights into deep learning methods are still mostly unexcavated. 7 In this paper, we attempt to provide some understandings on the statistical performance of basic deep learning methods through numerically and theoretically studying a simple statistical model for two-class classification in high-dimensional space. Our model assumes that observations are independently and identically distributed in some high-dimensional space which is the union of several latent lower-dimensional subspaces. On each subspace, we further assume that the obser- vations are independent and follow a mixture of two Gaussian distributions with known labels. Note that the subspace structure is completely latent and unavail- able to us. This model is remotely motivated from some real life problems such as classifying cats and dogs using image data, where different pictures can correspond to different activities (e.g., eating, sleeping, and running). Roughly speaking, each lower-dimensional subspace in our model can be regarded as one type of activities of animals, and the union of all these subspaces mimics the reality that pictures for each animal species can consist of different activities. Another example is skin cancer detection, which aims at successfully classifying cancer cells and non-cancer ones. Skin cancer is among the most common of all human cancers, with 1 mil- lion people in the U.S. diagnosed each year with some type of the disease. There are three major types of skin cancers: basal cell carcinoma (BCC), squamous cell carcinoma (SCC), and melanoma. They are likely to grow in different areas which could be thought of as clusters. A further example is the recognition of traffic signs that lie in different groups based on the level of brightness and the angle of view, which plays an important role in self-driving cars. These examples motivate us 8 to consider the high-dimensional classification problem under the latent structural assumption of lower-dimensional subspaces. We acknowledge that our statistical model is definitely an oversimplified version of the previously discussed real life examples. However, since our intension is to provide some statistical insights into deep learning methods, studying this oversimplified model should not be a big issue for our specific purpose. It is popularly believed nowadays that deep neural nets (DNNs) benefit from representational learning, meaning that the bottom layers of DNN try to extract representations of different clusters (subspaces in our statistical model), and then top layers try to use these representations to help with classifications. However, through applying deep learning methods to a simulated data set from our statis- tical model discussed above, we discover some surprising facts which provide an important complement to the common belief of representational learning. To bet- ter understand our message, let us temporarily walk away from the deep learning framework and think about the ideal procedure for classification under our model setting. Ideally, if the clusters (or subspaces) are known to us, then the best we can do is to first separate the data according to subspaces, and then conduct classifica- tion within each subspace. In the following, we will name this two-step procedure relying on the oracle subspace information the ideal procedure, and use it as a bench- mark to evaluate the performance of DNN. It is worth mentioning that clustering methods are popular in the data mining literature. For example, [80] formulated a great solution to recover the union of subspaces from a high-dimensional setting 9 with elegant theoretical guarantees. Many adapted algorithms have also been devel- oped. [2] designed customized neural network coupled with K-means clustering and added the clustering hardening loss to guide the process of updating the network and encourage clustering of feature space, which is also support evidence for our pa- per that the regular DNN does not efficiently conduct clustering. In this paper, we intend to study this phenomenon of regular DNN through simple statistical model, provide some statistical insights, and demonstrate the revealed phenomenon on the real data application. For face recognition, [56; 15; 87; 88; 89] have dedicated a lot of efforts to the design of new loss functions to minimize the distances between the deep features of the same class, thereby improving the face identification and veri- fication accuracy empirically. Since it is impractical to precollect all possible class identitiesfortraining, thoseinterestingworksfocusoncompressingthelearneddeep features of the same class to their centers and enhancing their discriminative power for verifying new unseen classes without label prediction. In contrast, in our paper the data has a latent subspace structure on top of the classes, where the subspaces (clusters) can be viewed as the superclasses at a higher level. Under such model structure, we explicitly study how DNN learns without clustering in advance and why it achieves comparable classification performance to the ideal two-step proce- dure, that is, unsupervised clustering to partition data into subspaces (superclasses) followed by supervised classification for further recognition within each subspace. Throughsimulationstudies, wediscoverthattheclassification errorrateofDNN is very close to that of the ideal procedure, motivating us to ask whether the bottom 10 layersofDNNindeedconductclusteringfirstandthentoplayersclassifywithineach learned cluster, which can be regarded as some type of representational learning. However, our simulation studies suggest otherwise. More specifically, we discover that none of the layers in DNN successfully recover the cluster/subspace structure. This message suggests that at least in some high-dimensional classifications, the way DNN learns is different from the commonly believed representational learning. Tostrengthenourempiricalevidence, wefurtherprovidesometheoreticalresults andinsightsunderourstatisticalmodel. Ourtheoreticalfindingsareconsistentwith our empirical ones – the ideal procedure can have very low classification error rate and greatly outperform the naive procedure of blind classification which completely ignoresthesubspacestructureinthedata. Wealsoprovidesomeheuristicstatistical arguments on the performance of DNN. Furthermore, we demonstrate the revealed phenomenon on the real data application of traffic sign recognition, which plays a crucial role in self-driving vehicles. 11 Table 2.1: Bias-corrected distance corre- lations (DC) between different pairs of variables Variable 1 Variable 2 DC X Y 0.0779 X (1) Y (1) 0.3817 X (2) Y (2) 0.2826 X (3) Y (3) 0.4004 f X (1) Y (1) 0.4740 f X (2) Y (2) 0.3345 f X (3) Y (3) 0.4038 Y Y F 0.8341 b Y Y F 0.8829 Xc Z subspace 0.8925 X bottomLayer Z subspace 0.3428 X topLayer Z subspace 0.2420 X twoLayers Z subspace 0.2965 Table 2.2: Bias-corrected distance corre- lations (DC) for the top hidden layer of DNN Variable 1 Variable 2 Training DC Test DC b Y Y F 0.8829 0.8354 Y lasso Y F 0.8717 0.8175 Y knn Y F 0.8707 0.8197 Y linearSVM Y F 0.8650 0.8081 Y kernelSVM Y F 0.8746 0.8310 Y randomForest Y F 0.8345 0.8217 X: nD feature matrix; Y : vector of true class labels; Y F : predicted class labels by the ideal procedure; b Y : predicated class labels by DNN with two hidden layers; f X (k) : n k p coefficient matrix corresponding to the k th subspace;X (k) = f X (k) A T k : n k D feature matrix corresponding to subspace k;Y (k) : labels for observations on subspace k; X c : transformed X via clustering; X bottomLayer : n-vector representing the hidden layerafter theinput layer;X topLayer : n-vector representingthe hiddenlayer before the output layer; X twoLayers : the augmented matrix [X bottomLayer ;X topLayer ]; Z subspace : the latent subspace label for each observation 12 2.2 Experiments 2.2.1 Model setting and some initial results Consider a sample of independent observations in R D from a union of three lower- dimensional subspaces S 1 ;S 2 ;S 3 , where each of the three subspaces is of dimen- sionality p with p < D. For each subspace S k (for k 2 f1; 2; 3g), denote by A k 2 R Dp the basis matrix whose columns are orthonormal vectors spanning the subspace. For an observation from subspace S k , suppose it can be repre- sented as x = A k e x, where e x 2 R p follows the mixture Gaussian distribution e x (1Y )N p ( (k) 0 ; ) +YN p ( (k) 1 ; ) with (k) j 2R p (for j2f1; 2g) the mean vectors and 2R pp the covariance matrix. Here,Y is a Bernoulli random variable with probability of success 1/2 representing the class label. Note that the latent subspace structure, i.e., the basis matrices (A 1 ;A 2 ;A 3 ), is completely unavailable to us. Suppose we have n labeled observations (x i ;Y i ), i = 1; ;n, independently sampled from the above latent subspace model. Denote by X = [x T 1 ; ;x T n ] T 2 R nD the matrix whose columns are feature vectors, and Y = (Y 1 ; ;Y n ) T the vector collecting the labels. Let (X (k) ;Y (k) ) be the observations corresponding to the kth subspace, and f X (k) the corresponding coefficient matrix, that is, X (k) = f X (k) A T k . Notethatforeachk2f1; 2; 3g,X (k) ; f X (k) ;Y (k) , andA k areunobservable 13 tous. Ourgoalistotrainaclassifier b C(x)suchthatforanewunlabeledobservation x, the trained classifier b C(x) can predict the class label ofx with high success rate. We simulate n k = 800 observations from subspace S k , and set D = 100 and p = 40. The augmented basis matrix [A 1 ;A 2 ;A 3 ]2R D(3p) is non-degenerate with rank D. So we end up with n = 2; 400 observations (x i ;Y i ) in total, where each feature vectorx i has dimensionality D = 100. Leta = (0:317; 0:318; 0:684) T and d = (1:098; 1:092;1:104) T . We set (k) 0 =A T k a(k)1 and (k) 1 =A T k (a(k)+d(k))1, where 1 is a vector of 1’s, for k = 1; 2; 3, where a(k) and d(k) stand for the kth entry of the vectorsa andd, respectively. Figure 2.1: DNN structure 14 Figure 2.2: Classification error rate for lasso logistic regression As discussed in Section 2.1, if the oracle information on each observation’s sub- space identity is known, the best one can do is the ideal procedure. We apply the ideal procedure to the simulated data and use the result as a benchmark to evaluate the performance of other methods. We also apply neural network with two hidden layers to the simulated data set (x i ;Y i ). More details about our neural network can be found in a later section. The third comparison method is the Fisher’s linear discriminant analysis (LDA) applied to (x i ;Y i ). Note that for the last two methods, we do not use the oracle information on observations’ subspace identity. Table 2.1 shows the distance correlations (DC) [85; 70] between various pairs of variables which is a specific case of Maximum Mean Discrepancy (MMD) [45; 53; 19; 47; 4; 28] with the distance kernel. DC provides an effective measure of 15 the nonlinear dependency between two random variables of potentially different dimensions and data types defined as dCor(u;v) = dCov(u;v) p dVar(u)dVar(v) (2.1) with dCov 2 (u;v) = 1 c p c q Z R p+q j u;v (x;y) u (x) v (y)j 2 jjxjj p+1 jjyjj q+1 dxdy: (2.2) Hereu andv are two random vectors of arbitrary dimensions p andq, respectively, u (x), v (y), and u;v (x;y) are the characteristic functions of u, v, and (u;v), and the constantc k = (1+k)=2 =((1 +k)=2) is half of the surface area of the unit sphere inR k+1 . In particular, DC is zero if and only ifu andv are independent. See [84] for details of the bias-corrected version of DC. Several messages can be drawn. First, the classification problem is challenging as reflected from the very low DC between feature matrixX and label vectorY . This will be further confirmed later by Figure 2.3. Second, knowing the subspace/cluster identity can help with classification, as evidenced from the much higher DCs between the class labelsY (k) and submatrices X (k) or f X (k) , and the high DC betweenY andY F withY F denoting the predicted training labels by the ideal procedure. In fact, the classification error rate by the idealprocedureis4.34%. Third, thepredictedtraininglabelsbyDNN(thestructure 16 is shown in Figure 2.1) b Y mimicY F very closely, with distance correlation as high as 0:8829. Fourth, none of the hidden layers cluster the data well, as shown by the low distance correlations in the last three rows in Table 2.1. These results suggest a surprising fact that DNN does not learn the cluster representations in data well but still manages to classify well. We next provide more empirical evidence in the following sections. 2.2.2 Classification methods without clustering versus with clustering In the last section we have seen that the ideal procedure based on oracle subspace information performs well in classification. We further demonstrate here that even with empirically learned cluster information, the performance of various popular classifiers can be very close to that of the ideal procedure. Specifically, we directly apply popularly used classification methods including LDA, Lasso logistic regres- sion, KNN,decisiontree, randomforest, andSVMwithlinearandnonlinearkernels, to the simulated data. To compare, we also implement a two-step procedure where in the first step, we use subspace clustering algorithm [67] to cluster the data first, and then in the second step we apply each of the aforementioned classification meth- ods within each identified cluster. The number of repetitions is 100, with the mean error for clustering being 0.0008. The very low clustering error also suggests that the latent subspace structure is easy to learn. 17 Figure 2.3: Classification error rates by different methods It is shown that all methods have improved performance after we incorporate the clustering step, with clustering followed by lasso logistic regression the best with overall classification error rate 5.46%. In addition, with clustering, almost all methods (except for KNN because of possible curse of dimensionality), perform similarly and comparable to the ideal procedure of classification error rate 4.34%. For illustration purpose, the specific plot for lasso logistic regression is shown below in Figure 2.2. Similar to Table 2.1, we calculate the DC between the predicted class labels by various methods in Figure 2.3. The results are presented in Table 2.2. The last column corresponds to the DC’s for test data. These results confirm again that DNN has performance very close to these two-step procedures. 18 Figure 2.4: Classification error rates for the top hidden layer of DNN by different methods 2.2.3 Deep neural networks via TensorFlow We implemented deep learning by TensorFlow on the hardware 8-core 2.4 gigahertz Intel E5-2665 processors and NVIDIA K20 Kepler GPU accelerators. To speed up the training process of DNN, we use parallel computing with 24 threads/CPUs (2 sockets, 6 cores per socket and 2 threads per core) and 2 GPUs. Figure 2.5 shows the average error rate over 100 test data sets. It can be seen fromFigure2.5thattheclassificationerrorratefollowsaU-shapedcurve, regardless of whether dropout is used or not on both CPU or GPU. For the same number of layers, the error rate decreases as the number of hidden units grows. Additionally, for the same number of hidden units in each hidden layer , the error rate declines first and then rises afterward. This explains why we focus on DNN with 2 hidden 19 layers and 100 hidden units in each layer (dropout = 0.5) (see Tables 2.5–2.8 in Section 2.6 for detailed numerical results for various numbers of layers with dropout = 0.5). Later, we add dropout = 0.2 for the input layer, and apply centerization and normalization to achieve lower error rate 0.0640 (CPU) and 0.0644(GPU). Figure 2.5: Classification error rate of DNN To examine whether any layers of DNN carry the latent cluster information, we apply various popularly used clustering methods to the bottom hidden layer X bottomLayer ,tophiddenlayerX topLayer ,andthecombinationofthetwohiddenlayers X bottomLayer + X topLayer . The methods we experimented with include the greedy 20 Figure 2.6: t-SNE visualization subspace clustering (greedySC)[67], the robust subspace clustering via thresholding (RSCT)[35], the K-means clustering [57], the hierarchical clustering [71], and the SVD coupled with the aforementioned two methods, where the clustering error rates are shown in Table 2.3. The clustering error rate is the mismatch ratio of cluster labels relative to the true labels up to permutation of classes. To get some visualization, we provide the t-SNE plots [61] in Figure 2.6. We can see from the plots that DNN did not preserve the clustering pattern and learned features in a different mechanism. The patterns shown in the three plots also support the trend in the DC values that we calculated in Table 2.1. These high clustering error rates here with the low clustering error rate in Section 2.2.2, as well as the t-SNE visualization in Figure 2.6, suggest that DNN does not conduct efficient clustering in any of its layers. 21 Table 2.3: Clustering error rates on hidden layers Method X bottomLayer X topLayer X bottomLayer + X topLayer GreedySC 0.431913 0.470013 0.470083 RSCT 0.403967 0.430333 0.421375 K-means 0.448750 0.474450 0.460292 SVD + K-means 0.444742 0.470392 0.464654 Hierarchical 0.666054 0.666050 0.666079 SVD + Hierarchical 0.666075 0.666025 0.666071 We also apply the classification methods discussed in Section 2.2.2 to just the top hidden layer. As seen from Figure 2.4, the classification performance is highly similar to that in Figure 2.3 with the exception of the decision tree method. These results reinforce our previous statement that DNN has comparable classification results to two-step procedures without performing clustering first. In addition, we have also conducted a larger-scale simulation study in which both the sample size and dimensionality are enlarged by ten times. Indeed the implementation of such a large-scale study was very time-consuming. However, the patterns and phenomenon observed previously stay the same in the large scale. A short summary of the main results for this additional large-scale simulation study is presented in Section 2.6.5 (the full detailed results are available upon request). 22 2.3 Somestatisticalinsightsandheuristicarguments We attempt to provide some theoretical justifications on our empirical findings in this section. Our results have three components: classification within each cluster, classification without learning the cluster information, and some heuristic statistical arguments on how DNN learns. 2.3.1 Gaussian classification in one subspace We focus on classification within one subspace in this section. For simplicity, we drop all the superscripts corresponding to the subspace and thus the two class distributions areN p ( 0 ; ) andN p ( 1 ; ), and the observations are (x i ;Y i ), i = 1; ;n. We restrict ourselves to lower-dimensional subspace, so the dimensionality is p. The class prior probabilities are assumed to be equal. In this two-class Gaussian classification, the Bayes rule takes the form B (x) = 1f T 1 (x)> 0g; where = 1 0 and = 1 2 ( 0 + 1 ). With training data, we can plug in the sample estimates of i and , denoted by b i and b , and the corresponding LDA classifier is F (x) = 1f b T b 1 (xb )> 0g with b =b 1 b 0 : 23 It is well known that the Bayes risk, or the classification error rate of the Bayes rule B (x), takes the form R B = (m=2); where m 2 = T 1 is the Mahalanobis distance between two classes. For the LDA F (x), conditional on the training data, the classification error rate is R n ( F ) = 1 2 X k=0;1 (1) k+1 T b 1 ( k b k ) + b b 1 b =2 ( b Tb 1 b 1b ) 1=2 ! ; where = 1 with the standard Gaussian cumulative distribution function. The following result is a direct consequence from [52] (Theorem 1 (ii)). We include it for completeness. Theorem 1. Assume that p=n! 0 as n!1. Then for large enough n, R n ( F ) =R B +o p (1): (2.3) The above result suggests that when the latent subspace structure is known, classification can be highly accurate as long as the Mahalanobis distance m 2 is large enough. This is consistent with our results in Section 2.2.2. See, for example, [21; 24; 25] for more developments on high-dimensional classification. 24 2.3.2 Gaussian classification in two subspaces We now consider the case where data are from the union of some lower-dimensional subspaces. To simplify the proof, we consider two subspaces S 1 and S 2 with p = D=2, and assume that the basis matrices take the formsA T 1 = [I p ; 0]2R pD and A T 2 = [0;I p ]2R pD . Inaddition, oneachsubspaceS k , weassumethatthecommon covariance matrix for the mixture Gaussian isI p . We also assume that conditional on class label Y, observations have equal chance of coming from each of these two subspaces. These oversimplifications allow us to provide an explicit bound for the overall classification error rate. With the above model structure, class Y = 0 observations have feature vectors independently drawn from the mixture of two degenerate Gaussian distributions N (A 1 (1) 0 ;A 1 A T 1 ) andN (A 2 (2) 0 ;A 2 A T 2 ) with equal mixing probability. Similarly, classY = 1observationshavefeaturevectorsindependentlydrawnfromthemixture of two degenerate Gaussian distributionsN (A 1 (1) 1 ;A 1 A T 1 ) andN (A 2 (2) 1 ;A 2 A T 2 ) with equal mixing probability. Note that the subspace structure is completely latent to us and we only observe data pairs (x i ;Y i ). We will focus on the LDA classification rule by pretending that observations were from two class Gaussian classifications. This corresponds to the the first boxplot labeled as LDA in Figure 2.3. 25 Theorem 2. Assume that (1) 0 = (2) 1 = 0 andk (1) 1 k 2 =k (2) 0 k 2 . Then the Bayes rule has risk R B = 1 4 + (k (1) 1 k 2 ): (2.4) Assume further that p=n! 0. Then for large enough n, it holds that R n ( F ) =R B +o p (1): The full proof is provided in the Section 2.6.2. It is interesting to see from (2.4) that as long ask (1) 1 k 2 is large enough, the overall classification error rate is close to 1=4. In view of our simulation results in Section 2.2.2, it is seen that our theory is consistent with the result in Figure 2.3 about LDA. Moreover, using our notation in this section, the asymptotic classification error within one subspace (as shown in Theorem 1) can be written as (k (1) 1 k 2 =2). Combining this result with Theorem 2, we can see that the latent subspace structure does impose additional difficulties in classification, and the extra difficulty is mainly reflected in the term 1=4. In addition, we also provide in Section 2.6.3 of some heuristic statistical argu- ments on how DNN learns in our simple subspace classification model introduced earlier, which support our empirical findings. 26 2.4 Real data application In addition to the simulation studies, we also demonstrate the revealed phenomenon on the real data application of traffic sign recognition which is a crucial task in self- driving vehicles. The real data set (cropped images) is composed of 62 classes labeled as 0 to 61. The sizes of the images are around 128*128 pixels and each one contains a tuple for RGB color ranging from 0 to 255. An overview of the images is shown in Figure 2.8 [63] in Section 2.6.4. We transform the images to 32*32 pixels which is beneficial for recognition. Thus the number of features for each image is 32*32*3 = 3072. See also [9; 77]. Eachimageofatrafficsignmaybecapturedbythecamerafromthefrontviewor the side view, and the brightness of the image could be low or high. We apply three clustering strategies: 1) four clusters with low brightness and front view (LBFV), low brightness and side view (LBSV), high brightness and front view (HBFV), and high brightness and side view (HBSV);2) two clusters with low brightness (LB) and high brightness (HB); and3) two clusters with front view (FV) and side view (SV). These strategies are motivated by the intermediate results by clustering algorithms (the images are clustered by brightness) and the different aspect ratios of the images (the images are clustered by view angle). For instance, the images of label 21 (stop sign) can be partitioned into four clusters as shown in Figure 2.9 in Section 2.6.4. 27 Table 2.4: Bias-corrected distance correlations (DC) for real data application Clustering Strategy Variable 1 Variable 2 DC 1) LBFV, LBSV, X bottomLayer Z 1 0.0558 HBFV, HBSV X topLayer Z 1 0.0508 X twoLayers Z 1 0.0516 2) LB, HB X bottomLayer Z 2 0.0631 X topLayer Z 2 0.0583 X twoLayers Z 2 0.0591 3) FV, SV X bottomLayer Z 3 0.0311 X topLayer Z 3 0.0276 X twoLayers Z 3 0.0281 X bottomLayer : deep feature by the hidden layer after the input layer; X topLayer : deep feature by the hidden layer before the output layer; X twoLayers : the augmented matrix [X bottomLayer ;X topLayer ]; Z i : the one-hot cluster identities of the strategy i);L: low,H: high,B: brightness;F: front,S: side,V: view Without applying the cluster structure a priori, the classification error rates by linear discriminant analysis (LDA), random forest, support vector machine (SVM), and deep learning are 17.38%, 12.86%, 12.42%, and 4.73%, respectively. If we utilize the cluster identities based on the above three strategies, then the classification error rates can be reduced for the clusters, which images have low brightness and/or front view (LBFV of strategy 1, LB of strategy 2, and FV of strategy3), whichmatchesourintuitiononthedifficultylevelofdetectingthetraffic signs. The sample sizes for these four clusters HBFV, HBSV, LBFV, and LBSV in the training data set are 1706, 799, 1380, and 690, respectively. As a result, random 28 forest can lower the classification error rate by 1% using the clustering strategies 1 and 2, SVM benefits from strategy 3 and also decreases the classification error rate by 1%, and the four-cluster strategy 1 facilitates LDA to obtain a much improved classification error rate of 9.56% error rate. So does deep learning capture the cluster structure internally over some hidden layers? As in our analysis for the simulation studies, we calculate the distance correlations of the deep features by the first hidden layer, second hidden layer, and their combinations, respectively, with the one hot cluster identities of the three clustering strategies; see Table 2.4. We see that the distance correlations are quite low. These real data results are consistent with our simulation experiment and theoretical interpretation that DNN does not learn the cluster structure but yet is capable of achieving comparable or even better classification performance in comparison to other machine leaning methods with clustering added on top. 2.5 Conclusion The main purpose of this paper is to provide some statistical insights into how DNN learns in the subspace classification model. We have designed a simple sim- ulation study on two-class classification with data generated from latent subspace structure. We compared DNN with the two-step procedures of clustering followed by classification. We discovered that although DNN has comparable classification 29 performance to the two-step procedure, it does not conduct efficient clustering in any of its layers. This is a surprising result and in some sense provides an impor- tant complement to the common belief of representational learning for DNN. We also provided theoretical results to support our empirical findings and presented some heuristic statistical arguments on how DNN learns. The real data application of traffic sign recognition for self-driving vehicles further showcased the revealed phenomenon. 2.6 Supplementary results and studies One real life problem that motivates our study is shown in Figure 2.7, where we classify cats and dogs from the a collection of images capturing different activities (e.g., eating, sleeping, and running). In our model, we mimic the set of images through a high-dimensional space which is the union of three subspaces, and each low-dimensional subspace corresponds to one type of activities. 2.6.1 Building deep neural networks We train and test different structures of DNNs over 100 training and test data sets with the same sample size, which are randomly generated according to our model setting. We choose the optimal hyper parameters on the validation set (10% of the 30 Figure 2.7: Classification of cats and dogs (copyright by Google Images) training set). DNNs are run on both CPUs and GPUs, we discover that when the network is shallow or has small number of hidden units, CPUs outperform GPUs in terms of computational speed. However, as the number of hidden layers and units increases, GPUs definitely run faster. Table 2.5: DNN with hidden units [10] Layer Training error Test error Run time (sec) CPU GPU CPU GPU CPU GPU 1 0.1282 0.1283 0.1548 0.1536 8.1665 14.4766 2 0.0923 0.0875 0.1328 0.1274 9.5305 21.9774 3 0.1016 0.1262 0.1518 0.1521 11.0447 19.4718 4 0.0993 0.0961 0.1555 0.1580 12.7921 26.6293 5 0.1104 0.1263 0.1697 0.1826 14.3730 28.7950 10 0.4439 0.4483 0.4540 0.4575 22.6595 38.1027 31 Table 2.6: DNN with hidden units [20] Layer Training error Test error Run time (sec) CPU GPU CPU GPU CPU GPU 1 0.1051 0.1085 0.1324 0.1339 9.5807 19.5555 2 0.0500 0.0510 0.0940 0.0956 12.1759 21.2431 3 0.0391 0.0399 0.1033 0.1051 14.8497 23.4055 4 0.0330 0.0330 0.1213 0.1233 18.1428 26.5479 5 0.0342 0.0318 0.1469 0.1433 21.0214 37.0104 10 0.2165 0.1794 0.2649 0.2371 35.0329 50.8225 Table 2.7: DNN with hidden units [50] Layer Training error Test error Run time (sec) CPU GPU CPU GPU CPU GPU 1 0.0848 0.0847 0.1139 0.1137 10.1360 19.5282 2 0.0306 0.0300 0.0795 0.0795 15.1895 17.0301 3 0.0105 0.0105 0.0812 0.0812 20.1912 16.5940 4 0.0057 0.0054 0.0967 0.0956 25.2243 17.9423 5 0.0033 0.0030 0.1389 0.1342 30.0675 29.2289 10 0.0170 0.0255 0.1565 0.1599 53.2125 40.1224 Table 2.8: DNN with hidden units [100] Layer Training error Test error Run time (sec) CPU GPU CPU GPU CPU GPU 1 0.0781 0.0777 0.1077 0.1073 12.8185 18.7767 2 0.0201 0.0206 0.0765 0.0764 21.2060 21.5077 3 0.0030 0.0031 0.0770 0.0775 29.3675 24.0765 4 0.0013 0.0013 0.0813 0.0807 37.3774 26.7721 5 0.0009 0.0008 0.1002 0.0990 45.5618 23.3326 10 0.0021 0.0045 0.1422 0.1404 84.7540 42.6131 2.6.2 Proof of Theorem 2 Proof. Recall that under our model assumption, class Y = 0 observations have feature vectors independently drawn from the mixture of two degenerate Gaussian 32 distributionsN (A 1 (1) 0 ;A 1 A T 1 ) andN (A 2 (2) 0 ;A 2 A T 2 ) with equal mixing probabil- ity. Similarly, class Y = 1 observations have feature vectors independently drawn from the mixture of two degenerate Gaussian distributionsN (A 1 (1) 1 ;A 1 A T 1 ) and N (A 2 (2) 1 ;A 2 A T 2 ) with equal mixing probability. Therefore, for a random observa- tionx2R D from class k, we have the condition mean m k E[xjY =k] = 0 B @ 1 2 (1) k 1 2 (2) k 1 C A ; and the conditional covariance matrix Var(xjY =k) = 0 B @ 1 2 I p + 1 4 (1) k ( (1) k ) T 0 0 1 2 I p + 1 4 (2) k ( (2) k ) T 1 C A : Therefore, for an observation randomly drawn, the unconditional covariance matrix can be calculated as Var(x) = E[Var(xjY )] + Var(E[xjY ]) = 0 B @ C 11 C 12 C 21 C 22 1 C A : where C 11 = 1 2 I p + 3 16 ( (1) 0 ( (1) 0 ) T + (1) 1 ( (1) 1 ) T ); C 12 = 1 16 ( (1) 0 ( (2) 0 ) T + (1) 1 ( (2) 1 ) T ); 33 C 21 = 1 16 ( (2) 0 ( (1) 0 ) T + (2) 1 ( (1) 1 ) T ); C 22 = 1 2 I p + 3 16 ( (2) 0 ( (2) 0 ) T + (2) 1 ( (2) 1 ) T ): Recall that we assume (1) 0 = 0 and (2) 1 = 0. Thus the above conditional mean and unconditional covariance matrix can be simplified as m 0 = E[xjY = 0] = 0 B @ 0 1 2 (2) 0 1 C A ; m 1 = E[xjY = 1] = 0 B @ 1 2 (1) 1 0 1 C A ; and 0 Var(x) = 0 B @ 1 2 I p + 3 16 (1) 1 ( (1) 1 ) T 0 0 1 2 I p + 3 16 (2) 0 ( (2) 0 ) T 1 C A ; respectively. 34 Now suppose we completely ignore the subspace structure and use the Bayes rule from two class Gaussian classification, that is, by pretending that the data were from Gaussian distributions N (m 0 ; 0 ) andN (m 1 ; 0 ); we obtain the following decision rule (x) = 1fr(x) := x m 0 +m 1 2 T 0 1 (m 1 m 0 )> 0g: (2.5) We next calculate the classification error of the above decision rule (2.5). For a new observationx with true class label Y, by conditional probability, P ((x)6=Y ) =P ((x) = 0jY = 1) 1 2 +P ((x) = 1jY = 0) 1 2 : (2.6) We first calculate P ((x) = 0jY = 1). Since conditional on class 1, a random observation has equal probability of coming from the two subspaces, we have P ((x) = 0jY = 1) = 1 2 P ((x) = 0jY = 1;x2S 1 ) + 1 2 P ((x) = 0jY = 1;x2S 2 ): (2.7) Now, recall that for a random observationx from classY = 1 and subspaceS 1 , we have xN (A 1 (1) 1 ;A 1 A T 1 ) 35 Therefore, the classification rule (x) has Gaussian distribution with mean and variance E[r(x)jY = 1;x2S 1 ] = 3 8 ( (1) 1 ) T ( 1 2 I p + 3 16 (1) 1 ( (1) 1 ) T ) 1 (1) 1 + 1 8 ( (2) 0 ) T ( 1 2 I p + 3 16 (2) 0 ( (2) 0 ) T ) 1 (2) 0 = 3 4 k (1) 1 k 2 2 1 + 3 8 k (1) 1 k 2 2 + 1 4 k (2) 0 k 2 2 1 + 3 8 k (2) 0 k 2 2 = k (1) 1 k 2 2 1 + 3 8 k (1) 1 k 2 2 Var(r(x)jY = 1;x2S 1 ) = 1 4 ( (1) 1 ) T ( 1 2 I p + 3 16 (1) 1 ( (1) 1 ) T ) 2 (1) 1 = k (1) 1 k 2 2 (1 + 3 8 k (1) 1 k 2 2 ) 2 ; respectively, where we have used the assumption thatk (2) 0 k 2 2 =k (1) 1 k 2 2 . Thus, it can be calculated that P ((x) = 0jY = 1;x2S 1 ) = k (1) 1 k 2 : 36 Similarly, it can be derived that P ((x) = 0jY = 1;x2S 2 ) = k (2) 1 k 2 = 1=2; P ((x) = 1jY = 0;x2S 1 ) = k (1) 0 k 2 = 1=2; P ((x) = 1jY = 0;x2S 2 ) = k (2) 0 k 2 : This proves the first result of the theorem. Combining the above result with (2.6) and (2.7) we arrive at R( B ) =P ((x)6=Y ) = 1 4 + 1 2 k (2) 0 k 2 : When the population parameters are estimated from sample, as long as the sample size satisfying thatp=n! 0, then using similar arguments as in [52], we can prove that conditional on training data, the classification error rate of the plug-in classifier F (x) satisfies that with asymptotic probability 1, R n ( F )!R( B ): This completes the proof of the theorem. 37 2.6.3 Some heuristic arguments on how DNN learns in sub- space classification In this section we intend to provide some heuristic statistical arguments on how DNN learns in our simple subspace classification model introduced earlier. In such a model, there arek subspacesS j ’s of the ambient feature spaceR D with 1jk and within each subspace there are two classes that are labeled as 0 and 1. For simplicity, we assume that the k subspaces S j ’s are generated independently and that each subspaceS j has dimensionalityp =p j and is uniformly distributed on the Grassmann manifoldG D;p , which is comprised of allp-dimensional subspaces ofR D . See, e.g., [59] for some brief background on the geometry and invariant measure of Grassmann manifold. To simplify the technical exposition, we further assume that for each subspace S j , the data distributions corresponding to the two classes 0 and 1 areN (0;I p j ) andN ( j ;I p j ), respectively and that each class on each subspace has sample size n 0 . Thus the total sample size is n = 2kn 0 . Moreover assume that the k vectors j ’s are also statistically independent of each other. Denote by H j the (p j 1)-dimensional affine hyperplane passing through point 2 1 j and with normal vector j . The half space of subspace S j containing point j is referred to as the positive half space for brevity. Then an ideal classification procedure is assigning the true subspace cluster label to each data point and classifying it as class 1 or 0 according to whether or not the data point lies in the positive half space of the corresponding subspace. Such an ideal classification procedure is an optimal 38 classifier since it knows the true subspace cluster label and the underlying classifier given each subspace is the Bayes classifier. We next gain some statistical insights into how DNN tries to mimic closely the above ideal classification procedure without recovering the latent subspace cluster label. Although the k subspaces S j ’s lie in the same ambient Euclidean space R D , it is more useful to view them as points on the Grassmann manifolds which are compact Riemannian homogeneous spaces. To ease the technical presentation of the underlying geometry, we further assume that all the k subspaces S j ’s have the same dimensionality p. Thus (S j ) 1jk is simply an independent and identically distributed (i.i.d.) sample from the uniform (Haar) distribution on the Grassmann manifold G D;p . By the classical manifold embedding theory from geometry and topology, the Grassmann manifoldG D;p which is a curved space of dimensionp(D p) can be embedded into a higher-dimensional Euclidean space where the distance on the Grassmann manifold is preserved. Here in a similar spirit, DNN tries to embed the points on each subspaceS j in a high-dimensional latent Euclidean space that is different than the original ambient Euclidean space R D . Let us consider DNN with multiple hidden layers and the sigmoid layer at the top for two-class classification. As is common in the more recent deep learning literature, we consider the rectified linear unit (ReLU) activation function for all the neurons in the hidden layers. For a given ReLU neuronl, assume that the inputs are z i with 1 i m l , the bias is w 0l , and the network weights are w il with 1 39 i m l . Denote by z l = (1;z 1 ; ;z m l ) T and w l = (w 0l ;w 1l ; ;w m l l ) T the input vector and the weight vector, respectively, associated with this particular ReLU neuron, where the bias is regarded as the weight for the constant input 1. Then the output of this neuron is given by h l (z l ; w l ) = (z T l w l ) + , where a + = max(0;a) for each scalar a. In view of such a representation, each ReLU neuron in the first (bottom) hidden layer projects the input vector x = (x 1 ; ;x D ) T 2R D onto the lineft(w 1l ; ;w m l l ) T :t2Rg with a given center and scale, and zeros out all the negative values of this new coordinate, that is, collapsing the negative half of the real line to a single point zero. Assume that there are at least kp ReLU neurons in the bottom hidden layer. It is easy to see that for each 1 j k, there exist a set of p ReLU neurons with some suitable bias and weight parameters such that the resulting p ReLU coordinates as given above provide a one-to-one mapping from the positive half space associated with subspace S j to R p + and shrink the corresponding negative half space toward the origin (by comparing each data point in the positive half space and its mirror image with respect to H j ), where R + stands for the positive half of the real line. To simplify the arguments assume that the total number of ReLU neurons in the bottom hidden layer is exactly kp; otherwise we assign zero values to all the parameters of those remaining ReLU neurons for the moment. By our construction, the network parameters for the p ReLU neurons associated with subspace S j depend only on S j and j , and thus are independent of all other S l ’s and l ’s with 1 l6= j k. To mimic the random initialization of network 40 parameters in the training of DNN, let us further add some small independent Gaussian noises to the network parameters for the bottom hidden layer. Applying the classical deviation bounds for light-tailed distributions with conditioning we can show that as the growth rates of k, p, and D relative to sample size n are properly controlled, the above choice of random network weights maps each subspaceS j into a higher-dimensional spaceR kp + and with significant probability, such a mapping is a small perturbation of the aforementioned mapping from subspace S j toR p + ; see, e.g., the technical arguments in [22; 26; 27; 59]. Observe that in our above probabilistic and geometric construction, all the data pointsinthepositivehalfspaceassociatedwitheachsubspaceS j tendtoconcentrate on a relatively large compact set in R kp + nf0g under the latent ReLU coordinates, whereas all the remaining data points tend to concentrate around the origin due to the shrinkage effect mentioned before. Such a geometric representation of the two classes in a higher-dimensional latent Euclidean space leads to the fact that the training loss of the DNN can approximate closely that of the above ideal classifica- tion procedure as the sample sizen increases. Therefore, the resulting classification accuracy of the DNN can mimic closely that of the ideal classification procedure. Meanwhile, it is important to notice that by its nature, the latent representation given by the ReLU neurons in the hidden layers generally does not lead to successful recovery of the latent subspace cluster labels. 41 If the width of the neural network is limited or there are multiple classes within each subspace cluster, we need to use the ReLU neurons in additional hidden lay- ers to build up effective embedding of the subspaces in a high-dimensional latent Euclidean space. A formal, rigorous theory on the above heuristic arguments and statistical insights for DNN in the subspace classification model will be presented in a separate paper. 2.6.4 Figures from real data application The real data that support the findings of this study are openly available at https://btsd.ethz.ch/shareddata/. 42 Figure 2.8: Traffic signs of labels 0 to 61 Figure 2.9: Instances of traffic signs in each cluster 43 2.6.5 Additional large-scale simulation study Additionally, we simulate n k = 8; 000 observations from each subspace S k , and set D = 1; 000 and p = 40. So we end up with n = 24; 000 observations (x i ;Y i ) in total, where each feature vector x i has dimensionality D = 1; 000. The patterns and phenomenon observed previously stay the same in the large scale. The classification error rate of LDA without clustering is 0.1795, while the classification error rate of DNN (without clustering) is0.0361, which is comparable to the value 0.0371 for LDA with the knowledge of cluster identities; see Figures 2.10 and 2.11 for details. To examine whether any layers of DNN carries the latent cluster information, again we apply various popularly used clustering methods to the bottom hidden layer (bHidden), top hidden layer (tHidden), and the combination of the two hidden layers (bHidden + tHidden) in Table 2.10. We also calculate the bias-corrected distance correlation ofZ subspace paired withX c ,X bottomLayer , andX topLayer in Table 2.9. As seen in these tables, the high clustering error rates and decreasing trend of bias-corrected distance correlations stay the same in the large scale, suggesting that DNN does not really do clustering in any of its layers in the large-scale model setting. 44 Figure 2.10: Classification error rates of LDA and DNN Table2.9: Bias-correcteddistancecorrelations(DC)forlarge-scalesimulationstudy Variable 1 Variable 2 DC X c Z subspace 0.8944 X bottomLayer Z subspace 0.2014 X topLayer Z subspace 0.1501 X c : transformedX via clustering;X bottomLayer : n-vector representing the hid- den layer after the input layer;X topLayer : n-vector representing the hidden layer before the output layer;Z subspace : the latent subspace label for each observation 45 Figure 2.11: Classification error rate for LDA regarding subspaces Table 2.10: Clustering error rates on hidden layers for large-scale simulation study Method X bottomLayer X topLayer X bottomLayer + X topLayer GreedySC 0.479288 0.555154 0.540575 RSCT 0.498617 0.495413 0.497163 K-means 0.502529 0.513588 0.509113 SVD + K-means 0.510063 0.513175 0.506933 Hierarchical 0.666613 0.666617 0.666625 SVD + Hierarchical 0.666613 0.666617 0.666613 46 Chapter 3 Flexible Causal Inference 3.1 Introduction Causal inference plays a crucial role in economics, social sciences, business appli- cations, and medical science. One important problem is estimating the treatment effect, which has been studied widely with different causal inference methods and basedondifferentmodelspecifications. Thetreatmentcanrefertoastrategyforbe- havioralintervention,awebpagewithanewformat,oramedicaltreatmentforsome specific disease. Most of existing studies in the causal inference literature have fo- cused on the estimation of the average treatment effect (ATE) [37; 30; 32; 42; 6; 41]. This problem of treatment effect estimation and inference is extremely important since decision makers may discover certain strategies that can effectively motivate a positive social behavior to provoke the development of economy, a new design 47 of e-commerce web page to attract more potential customers, or a single medical treatment to facilitate the prevention or cure of some disease. “Deep learning with massive amounts of computational power, machines can now recognize objects and translate speech in real time. Artificial intelligence is finally getting smart.” by Robert D. Hof. [39]. As one of the most popular machine learning methods, deep learning has in- creasinglyimprovedthestate-of-the-artinvariousareassuchasimageclassification, speech recognition, text analysis, healthcare, and audio recognition [55; 3; 62; 17; 51; 18]. Deep learning can build much advanced and complex architecture, and has recently proven successfully implemented on dramatically large feature space. In particular, various nonparametric models defined by deep nets have been success- fully developed and applied to a wide range of big data applications from different disciplines. Recently, researchersarenotlimitedtousedeeplearningasablack-boxtechnical tool in a variety of applications, but in fact have been studying the theoretical ex- planations, such as probabilistic interpretation, information theory, approximation theory, and statistical insights [68; 64; 69; 74; 16; 7; 90; 58]. Yet deep learning meth- ods are much less understood in the challenging setting of causal inference. While 48 the amount of data is growing incredibly everyday, many important challenges arise in causal inference. How can we design methods to take advantage of the large number of observations in the data? How can we utilize the high-dimensional fea- ture space to estimate precisely the causal effect? Researchers have paid growing attention to the connections between machine learning and causal inference in re- cent years; see, e.g., [31; 5; 6; 66; 14]. Nevertheless, we know very little about how we can take advantage of modern deep learning methods to help infer the causal effect. The interface of deep learning and causal inference is still largely unexplored. To empower large-scale causal inference with modern machine learning, in this paper we suggest a novel unified framework of adaptive causal inference via the fully nonparametric approach of deep learning and showcase its effectiveness on the causal estimates for both observational data and randomized experiment. We also provide discussions on the mechanisms underlying the effectiveness of our approach and offer some justifications for our discoveries. This approach can be valuable for any context where big observational data or big experiment is available. We focus on a specific context: motivating blood donations via shortage mobile messages [82; 83], but our novel framework can be extended to other applications in non- profit and for-profit industries. 49 3.2 Research method 3.2.1 Data construction The database of the blood bank has every donation record from the year 2005 to 2010. Instead of estimating causal effect through several indicators of historical do- nations over time periods via regressions, in this paper, we flatten the data structure so that each unique donor has a large amount of feature variables, a visualization is presented in Figure 3.1. The detailed description of data and the construction will be illustrated for observational and experimental data separately in the next section. 3.2.2 Framework of flexible large-scale causal inference A well-known framework for causal inference was developed by Neyman and Rubin [76]. In their causal model, there are two potential outcomes: Y i (1) and Y i (0) representing the outcomes with and without treatment, respectively. Their values in our research are in setf0; 1g, where 1 means that the data pointi has the desired effect (the donor eventually donated in the target year), and 0 means that it does not have the effect (the donor did not donate in the target year). The treatment indicator T i 2f0; 1g where 0 means that the data point i belongs to the control group, and 1 means that it is in the treatment group. A widely used metric is the 50 Figure 3.1: Data construction averagetreatmenteffect(ATE),whichisdefinedasE[Y i (1)Y i (0)]. Ifthetreatment effect Y i (T i = 1)Y i (T i = 0) is constant for all unit i and the outcome in control group has a linear relationship with the covariate X i , then the relationship of the triple (X i ;T i ;Y i ) for each unit i can be represented by Y i = + T i +X T i +, where ; , and are model parameters, and represents the model error term. We make the following two key assumptions. 51 Figure 3.2: Deep Neural Network (DNN) Assumption 1. T i ?? (Y i (0);Y i (1)) j X i : Assumption 2. 0<P[T i = 1jX i ]< 1: 52 Under the above two assumptions, we can estimate the ATE by the sample coun- terpart ^ E[Y i jT i = 1] ^ E[Y i jT i = 0], where ^ E represents the sample mean. In this paper, we estimate the treatment effect (of mobile messaging on a donor’s decision) from a new perspective through the fully nonparametric approach of deep learning. We assemble a large number of features for all donors in our sample by combining their detailed demographics (such as age, gender, education, occupa- tion, marriage status, and resident status) with their complete donation history (all donation records across the past 10 years). In addition, we perform a basic one hot encoding of categorical features. Those individuals belong to two independent groups, the treated and control groups, to indicate whether they receive the short- age message or not. The group indicator is another binary feature treated with value 0 or 1. We estimate the treatment effect on the target outcome donate (la- beled as 0 or 1) through classifications by deep learning and some other important machine learning approaches including logistic regression, Lasso logistic regression [86], decision tree, and random forest. The treatment effect in our framework is estimated as the difference of the average predicted probabilities. The structure of deep neural network (DNN) is shown in Figure 3.2. 53 3.2.3 Training and testing The structure of deep neural network (DNN) is shown in Figure 3.2. It has an input layer and an output layer, and hidden layers between them. The edges represent weights to connect two neurons. Each neuron in the current layer is the weighted sum of all the neurons inthe previous layer and transformedby activation functions. DNN works like how human brain neurons process information. The two popular used activation functions are ReLU in hidden layers and Sigmoid or Softmax in output layer. To overcome the issue of overfitting. we can use some techniques including l 1 and l 2 regularizations [65], dropout [81] and batch normalization [43]. We can also pick the desired optimizer and the associating learning rate, number of epochs, batch size, number of hidden layers, number of neurons in each layer, preprocessing methods, etc. Also, [12; 75; 13; 48; 36; 78] provide more discussions of the structure, hyperparameters and implementations of DNN in specific domains. The data with extended features are randomly split 100 times into training and test data sets, each training data has half number of individuals who belong to the treated group and half number of individuals belonging to the control group, and the corresponding test data has the rest halves. We train our classification models separately on the 100 training data sets and then test their prediction performance on the corresponding 100 test data sets. The treatment effect of mobile messaging is estimated as the difference of the average value of the predicted probabilities for 54 donors in treated and control groups. This problem drives the attention for imbalanced classes, as there is a worldwide shortage of blood supply due to low participation in blood donation, i.e. the propor- tion of the zero-value outcomes donate in our model is extremely high. To overcome the imbalance, we have used a variety of data preprocessing approaches, including resampling techniques (random subsampling for the majority class (donate = 0), random oversampling for the minority class (donate = 1), “smote” method to gen- erate synthetic data, and adding sample weights based on prior probabilities of the original data) and scaling methods (min-max, unit-scale, zero-center, and normal- ization). Weexplorethetuningprocessofdeeplearningbygridsearchoneight-dimensional hyper-parameter space, selecting the ideal scaling methods and the resampling methods illustrated above. In order to pick the desired values for hyperparameters (see Section 3.6 for detailed descriptions), we utilize the 10-fold cross validation, which is shown in Figure 3.3. In each training and test split among 100 times, we randomly select 10% of training data to validate the predictive performance of a group of hyperparameters, repeatedly train our model on the 90% of data and see the prediction accuracy on the 10% of data, then average accuracy among 10 times. Finally, we will pick the group of hyperparameters that lead to the highest 55 Figure 3.3: 10-fold cross validation average accuracy and train this model on the whole training data set and record the accuracy on the test data. The overall training and test accuracies are the average values over the 100 random splits. 56 Methods Training Accuracy Test Accuracy Training Causal Effect Test Causal Effect Logistic Regression 0.9814(0.0011) 0.9747(0.0011) 0.0692(0.0076) 0.0697(0.0070) Lasso Logistic Regression 0.9795(0.0016) 0.9757(0.0012) 0.0663(0.0079) 0.0671(0.0069) Decision Tree 0.9821(0.0013) 0.9763(0.0010) 0.0316(0.0160) 0.0319(0.0125) Random Forest 0.9886(0.0013) 0.9762(0.0015) 0.0286(0.0051) 0.0230(0.0031) Deep Learning 0.9699(0.0012) 0.9697(0.0012) 0.0170(0.0093) 0.0170(0.0094) Table 3.1: Numerical results of observation data without matching 3.3 Main findings 3.3.1 Observational study without matching Data description We have 24606 observational individuals, among which 1749 donors are in treated group who receive the shortage mobile messages and the other 22857 donors are in controlgroupwhodonotreceivethemessages. Weconductaninnovativemethodof data structure to flatten the stacked features to high-dimensional space that consist of the demographics and historical records in each calendar half year. See Section 3.3.2 for detailed description of features, the unequal total amounts of different data sets are mainly due to the different number of races that appear in the data. As stated before, this data is randomly split 100 times to obtain training and test data 57 sets for training models and making predictions, we use the proportion of the zero- label outcomes as the baseline and the average values for training and test data sets are 0.9699 and 0.9697 respectively. The features include donor_id, age_range (encoded to 5 binary variables), gen- der, marriage_status, education_status (bachelor or not), id_dummy (encoded to 4 binaryvariables), test_dummy (indicatorofpassingabatteryofbloodtestsforHIV, hepatitis, syphilis, and other diseases), resident_types, race, occupation, blood_type, half-year historical records that consist of weight, diastolic blood pressure (DBP), systolic blood pressure (SBP), pulse, donate_volume, plasma (blood plasma com- ponents or whole blood), and the group indicator treated. The outcome is donate with values 0 or 1 to represent whether the donor donates in the target half year. Numerical results The average prediction accuracies and treatment effects of all models are shown in Table 3.1, where the standard deviation (SD) is inside the parentheses. We observe that deep learning can prevent effectively the overfitting and the estimated causal effect of 1.7% by deep learning in high-dimensional feature space without matching is consistent with the analysis in the previous work of [83], where they used the tool of matching and regression. 58 3.3.2 Observational study with matching Methods Training Accuracy Test Accuracy Training Causal Effect Test Causal Effect Logistic Regression 0.9112(0.0056) 0.8986(0.0061) 0.0028(0.0098) 0.00003(0.0116) Lasso Logistic Regression 0.9057(0.0051) 0.9069(0.0051) 0.0335(0.0) 0.0294(0.0) Decision Tree 0.9076(0.0052) 0.9049(0.0056) 0.0012(0.0036) 0.0011(0.0034) Random Forest 0.9069(0.0054) 0.9066(0.0054) 0.0017(0.0025) 0.0006(0.0025)) Deep Learning 0.9072(0.0056) 0.9038(0.0057) 0.0102(0.0073) 0.0095(0.0075) Table 3.2: Numerical results of observation data after matching Matching strategy Among observational individuals other than treated ones, we select the subsamples who have blood type AB or O (since the blood bank sends messages to donors with blood type A or B as the treated group), and have donated at least once before the target half year, eventually 95183 donors are potentially selected in control group. Next we perform one-to-one matching for each treated donor among these 95183 individuals, such that their demographics and past donation behavior are the same on 16 dimensions: gender, education (have bachelor degree or not), marriage status (married or not), age range (18-21, 22-25, 26-30, 31-40, and 40+), id_dummy (in the city where we collected the long-term data, in other cities in same province, in other provinces, or unknown), and 11 half-year donate indicators before target 59 half year. In historical record, a potential donor faced three choices: not donat- ing, donating alone, and donating along with a group, we use the eleven half-year indicators to save the information (0-not donate, 1-donate individually, 2-donate in group). Our one-to-one matching ensures the Euclidean norm of the distance between each treated and corresponding matched pair is zero. More specifically, we perform the matching algorithm in two steps. First, for individuali in a calendar half yeart, we define a variabled it equal to 0 if he/she did not donate, 1 if he/she made an individual donation at t, 2 if he/she made a group donation. Then for an individual j in the pool of potential controls, we define the Euclidean distance between i and j as ED ij = ( P t (d it d jt ) 2 ) 1=2 . To be conserva- tive, we focus on the pool of controls with zero distance to the treated (ED ij =0). In the second step, among those with zero distance, we search for donors that share the same gender, education, marriage status, age range and location indicator. Even- tually, 1494 donors in treated group has their matched control donors, 142 out of the treated individuals donated their blood after they received the shortage mobile message, 138 out of these 1494 matched control individuals finally donated in the target half year. 60 Data construction After matching and introducing a binary variable treated to serve as a group indi- cator, we have 161 features in total, including donor_id, age _range(encoded to 5 binary variables), gender, marriage_status, education_status (bachelor or not), id_dummy (encodedto4binaryvariables), test_dummy (indicatorofpassingabat- tery of blood tests for HIV, hepatitis, syphilis, and other diseases), resident_types, race, occupation, blood_type, half-year historical records that consist of weight, dias- tolic blood pressure (DBP), Systolic blood pressure (SBP), pulse, donate_volume and plasma (blood plasma components or whole blood). The outcome is still donate with values 0 or 1 to represent whether the donor donates in the target half year. Again, we randomly split the data 100 times into training and test data sets, and in each split training and test data set have the same number (747) of individuals in treated or control groups, the average baseline are 0.9060 and 0.9066 respectively for training and test data sets. Figure3.4representsthe3Dvisualizationandcorrelationmatrixforthematched observational data. The red points have outcome donate equaling 1, and the blue points have outcome equaling zero. We can see from the figure that the data is highly imbalanced and scattered. Figure 3.5 shows the correlations between each feature and the outcome, where the last column of the triangular matrix represents 61 the values for the outcome donate, and the color bar on the right refers to the range of correlation values between 0 and 1. Figure 3.4: Data visualization of observational data Numerical results We further examine the performance of deep learning in estimating the causal ef- fect using observational data with matching. Following the previous study [83], we perform the matching algorithm in two steps. First, we select the control unit with zero distance on the donate indicators d it in each half year, where i represents for the potential donor i, and t represents for time in half-year, and d it takes values 62 Figure 3.5: Correlation matrix of observational data inf0; 1; 2g for not donating, donating individually, and donating in group, respec- tively. In the second step, we further search for donors that share the same gender, education, marriage status, age range, and location indicator. Finally, 1494 donors in treated group have their matched control donors. From Table 3.2, we can see that deep learning leads to the closest causal effect estimation to the previous analysis result of 1.7% in [83] compared to other machine learning methods, while the estimated value is lower than the one by deep learning without matching. 63 Figure 3.6: Tuning results for deep learning (matched observational data) 3.3.3 Randomized experiment Experiment description We designed a randomized experiment that a sample of 24989 past donors was ran- domly assigned into the treated and control groups. 10994 donors in treated group received the shortage message, and 13995 donors were not message recipients. Each 64 Figure 3.7: Accuracies trend for deep learning (matched observational data) Methods Training Accuracy Test Accuracy Training Causal Effect Test Causal Effect Logistic Regression 0.9842(0.0585) 0.9792(0.0584) 0.0010(0.0188) 0.0014(0.01817) Lasso Logistic Regression 0.9917(0.0006) 0.9917(0.0006) 0.0(0.0) 0.0(0.0)) Decision Tree 0.9917(0.0006) 0.9917(0.0007) 0.000005(0.00005) -0.000003(0.00003) Random Forest 0.9938(0.0006) 0.9917(0.0006) 0.0011(0.0006) 0.0005(0.0005) Deep Learning 0.9908(0.0018) 0.9903(0.0021) 0.0030(0.0010) 0.0030(0.0010) Table 3.3: Numerical results of experimental data individual has 244 expanded features including donor id, gender, marriage status, blood type, id dummy, resident category, race, test dummy, education, occupation, 19 half-year history (donate indicator, donate age, weight, pulse, SBP, DBP, donate volume, plasma), treated, and outcome donate. Again, we randomly split the data 100 times into training and test data sets, in each split, training data has 12494 65 donors, among which 5497 are in treated group, and 6997 are in control group, the rest half is stored in test data. Figure 3.8 presents the 3D data visualization and correlation matrix of each feature and the outcome for the experimental data. Similar to the matched obser- vational data, the sample points are extremely imbalanced and scattered, also the correlation values are relatively small and approach to zero. Figure 3.8: Data visualization of observational data 66 Numerical results The predictive results by various classification algorithms for experimental data are listed in Table 3.3. We notice that the estimated causal effects using lasso logistic regression and decision tree are around zero. The values are relatively higher when using logistic regression and random forest. However, the estimation using deep learning achieves 0.3% which was calculated in [82]. 3.4 Mechanism Recall that if the treatment effect Y i (T i = 1)Y i (T i = 0) is constant for all unit i and the outcome in control group has a linear relation with the covariate X i , then the relation of the triple (X i ;T i ;Y i ) for each unit i can be represented by Y i = + T i +X T i + under the two assumptions of unconfoundedness and overlap assumptions. Also, we could estimate the average treatment effect (ATE) by the sample estimation ^ E[Y i jT i = 1] ^ E[Y i jT i = 0] 67 In this section, we will interpret the theoretical guarantees for deep learning to achieve the above numerical results for observational and experimental data from three perspectives: variations of features, observed confounding factor, unobserved confounding factor. In addition, we will demonstrate the suitable data regarding the traditional method and our new framework. 3.4.1 Variations of features For the observational data, the number of units in control group is much more than the number in treated group. If we apply one-to-one matching on each treated unit, we would discard lots of control units. Nevertheless, these units are not redundant, instead, the control units include numerous useful information to help capture the relation between the covariates, the treatment indicator and the outcome. Tra- ditionally, for example, if we use Difference In Difference (DID) regression model [50; 1; 8] Y i = + 1 t i + 2 T i + 3 t i T i +X T i + i the estimated treatment effect is the coefficient of the interaction term between time t i and active treatment T i (here it’s a dummy variable of receiving the shortage 68 mobile message). Among the machine learning approaches we have applied, logistic regression and lasso logistic regression depend on the relation log( P(Y i = 1) 1P(Y i = 1) ) = + T i +X T i + i whereP(Y i = 1)representstheprobabilitythatpersonieventuallydonatedafterthe time receiving shortage message. While, deep learning as well as decision tree and random forest make predictions based on nonlinear model, to simplify the notations P(Y i = 1) =F (X i ;T i ) + i wherethefunctionF ()ishighlynonlinear. Ourframeworkutilizestheclassification models to estimate ATE by 1 n n X i=1 P(Y i = 1) 1(T i =1) 1 n n X i=1 P(Y i = 0) 1(T i =0) where 1() is the indication function and n is the sample size. What if we apply matching such that the units differ from the treatment while they maintain the same for major characteristics? Then the covariates X i for i = 1; 2;:::;n would concentrate in a small range, in other words, it’s equivalent to be 69 dominated by error term i . From one perspective, the R squared measurement is defined by R 2 = 1 P i 2 i P i (Y i Y ) It can be seen as the fraction of variance explained. Thus if the error i are rela- tively high, then the residual sum of squared will be large, finally R squared will be quite low. So the proportion of variance explained by the model is low, the true underlying relation for the triple (X i ;T i ;Y i ) is not well captured. From the other perspective, the noisy data consist of large amount of meaningless information that leads to skew conclusions. Additionally, matching also decreases the sample size that leads to under-fitting of deep learning, since it usually works efficiently for complex cases. Therefore, matching on the fixed dimensions reduce variations of features and sample size, while these variations are important for deep learning (as well as other machine methods) to capture the true relation between the combined covariates (X, T) and the outcome Y. 3.4.2 Observed confounding factor In the model Y i = + T i +X T i +, if the confounding factors are observed we may use the matching strategy to pair up treated and control units following 70 some metrics (e.g., [83] used the Euclidean distance) so that the two units in each pair differ only in the treatment assignment, where they share exactly the same characteristics on all other dimensions. Matching can reduce the overt bias for the traditional methods, such as Difference in Differences (DID). We only need to regress on time, intervention, and the interaction term between time and in- tervention, then the coefficient of the last term would be the estimated treatment effect. However, our empirical results reveal that deep learning for observational data without matching achieves comparable performance. It can be interpreted us- ing the similar statistical insights of deep learning in the recent work of [90]. In that paper, the high-dimensional data points have some latent subspace structure, and it was proved in theory that it is easier to classify the data within cluster while it is difficult to conduct classification if the cluster labels are unknown. Surprisingly, the classification performance of deep learning without clustering is extremely close to that of the two-step procedure if we learn the unsupervised subspace/cluster labels at first and then learn the supervised class labels. Some heuristic arguments were made therein that with the popularly used activation function and suitably chosen weights, deep learning is able to project internally data points from different hidden subspaces into a high-dimensional Euclidean space. It turns out that deep learning without clustering has almost the same classification performance as the other machine learning methods coupled with clustering. The matching and clus- tering strategies both group observed units with similar characteristics together. The metric we use to match the treated and control pair can also be regarded as 71 the distance we use to build the similarity graph for clustering. Following the theo- retical results and heuristic arguments in [90], deep learning may make projections into a higher-dimensional space that replicates internally the matching. There is no need to specify the dimensions of features or which features we should use to match the units. Instead deep learning can grasp the information of the latent pairs from high-dimensional feature space automatically. 3.4.3 Unobserved confounding factor Then what about the unobserved confounding factor? In a linear regression model Y i = + T i +X T i + i there might be endogeneity when X i is correlated with i . For example, Y i and T i represents income and education of college, X i are other controls, however the treatment factor would be correlated to the error term i that represents educations of parents, that are not covered in other predictors X i . In terms of the applica- tion in priceless market, the behavioral intervention of shortage message might be correlated with the health record of the donor for 1 year before the target half year. If the donor has severe disease among this 1 year, then the donor should not be the potential message recipient. It has been studied in [54] that deep learning with two hidden layers has a great prediction performance of binary classification 72 in heterogeneous datasets associated with complex disease. The heterogeneity in the diagnosis of complex diseases could be extended to the correlated error term i in our deep learning model. 3.5 Contribution In our research study, we propose an important yet underexplored approach deep learning to analyze causal effect of behavioral intervention (i.e., mobile messaging) to motivate prosocial activity (i.e., blood donations). We bridge the two streams of causal effect estimations by fitted coefficients of OLS on reduced form of analysis on several outcomes and predicted probabilities of deep learning as one of the most powerful high-dimensional methodology recently. This research study includes the rule to construct the features, the strategies to handle imbalanced data, the de- tailed tuning process of deep learning, the training and testing process via the cross-validation (CV), data visualizations, and diagrams and algorithms for deep learning. We build some important connections between the powerful prediction method of deep learning and flexible large-scale causal inference. Based on the em- pirical results, deep learning is able to replicate the causal effect for the unmatched observational, matched observational, and randomized experimental studies with our unified framework. There is no need to decide how many or which features to matchthetreatedandcontrolpairs, orresorttodifferentapproachesacrossdifferent 73 studies. Deep learning can automatically capture the observed and unobserved con- founding factors in certain scenarios. We have provided the mechanisms and some insights of our empirical findings, and our work contributes to the development of large-scale causal inference with modern machine learning. Our framework can be applied in a wide range of business applications in both non-profit and for-profit sectors. The deep learning approach can be especially helpful for firms when big observational data (with many observations and records) is available. 3.6 Supplementary results and studies 3.6.1 Data description The data sets used in this paper come from a blood bank in a provincial capital city of China where the population is more than 8 million. The responsibility of the blood bank is supplying blood to 18 hospitals in the city and maintaining the administrative database that includes donation records from January 1, 2005 to August 10, 2013. All identity-related information are carefully removed, and each donor is identified by a unique, scrambled ID that allows us to track the donation behavior over time in our research studies. Those records include the donation time, location, amount, quality (’pass’ the test or not), and form (voluntarily or in group), also the donor’s gender, age, marriage status, education, and occupation. 74 The structure and the construction of our data set in this paper have been illus- trated in Section 3.2.1. This city has encountered severer blood shortage since the year 2010, partly related to the increasing demand. To address the shortage, the blood bank has conducted mobile messaging program that sent shortage message to existing donors to encour- age donation of whole blood dating back to September 3, 2010. The message was specific for blood types A and B, allowing us to track the donors in control group who did not receive the shortage message but have the same demographics as the message recipients (see Section 3.3.2 for the detailed matching strategy). Additionally, the blood bank conducted a randomized experiment in 2014 of 80,000 participants in a separate study (see Section 3.3.3 for the data structure of exper- imental data used in this paper). The program sent a simple reminder message to existing donors that did not mention a shortage. In comparison with the much higher treatment effect we discovered in observational data which emphasized the shortage, the reminder message in the experimental data leads to only a slight increase of donation rate, suggesting that announcing a shortage to donors is im- portant to drive a grow in donation rate. 75 3.6.2 Tuning process of deep learning The two widely used activation function are ReLU and Sigmoid. They helps cap- ture the intelligent relation between neurons and compute efficiently in gradient propagations. Since we are making binary predictions, a Sigmoid function Sigmoid(x) = 1 1 +e x produces a curve with an "S" shape and the output values are in the range of 0 and 1, which is especially beneficial when we estimate the treatment effect by the predicted probabilities. Sigmoid function (special case of Softmax function for multiclass) are usually employed in learning algorithms that transforms linear input to nonlinear output, ast =wx +b wherew andb are weights and bias. For hidden layers, we use ReLU activation function ReLU(x) = max(0;x) which introduce sparsity and is computationally efficient in gradient propagation. The graphs for these two popular activation functions are shown in Figures 3.9 and 3.10. In deep Learning, the most important hyperparameters include number_layers, number_units, batch_size,number_epochs,dropout_prob,learning_rate,l 1 -regularizer 76 Figure 3.9: Activation function - ReLU Figure 3.10: Activation function - Sigmoid 77 and l 2 -regularizer (explained in the next paragraph). They are tuned on the vali- dation data sets, which is 10% of the training data sets, and the rest 90% are used for training models separately (this procedure is presented in Figure 3.3). The above eight hyperparameters play different roles in deep network. num- ber_layers is the number of hidden layers between input layer and output layer. number_units is the number of hidden units within a layer. batch_size is the num- ber of subsamples given to the deep network after which parameter update happens. number_epochs isthenumberoftimesthewholetrainingdataisleaned, weincrease the numbers until the validation accuracy starts decreasing even though training accuracy is increasing (which stands for overfitting). learning_rate defines how quickly a network updates its parameters, low rate slows down the learning process but converges smoothly, large rate speeds up the learning but may not converge. Dropout,l 1 -regulariztion andl 2 -regularization are regularization techniques to pre- vent overfitting thus leverage the generalization power. Each unit is retained with some probability dropout_ prob, or temporarily removed otherwise (see Figures 3.11 and 3.12). l 1 -regularizer and l 2 -regularizer penalize the absolute and square values of the weights that tends to drive some weights to exactly zero or drive all the weights to smaller values, respectively, then the loss function becomes L total (w;x) =L model (w;x) +L reg (w) 78 Figure 3.11: Standard DNN whereL reg (w) = P i w 2 i (l 2 -regularizetion) or P i jw i j (l 1 -regularizetion). For the deep learning model of binary classification, we use cross-entropy loss function 1 n X i y i logf(x i ) + (1y i )log(1f(x i )) where X =fx 1 ;x 2 ;:::;x n g is the set of input observations in the training data set, each of they i is either 0 or 1, andf(x) represents the output activation of the deep net given input x. In addition to the above hyperparameters, we use Adam optimizer, which is an effective optimization algorithm (one of the best optimizer at present) derived from adaptive moments (introduced in [46]), and it inherits the benefits of both Adaptive 79 Figure 3.12: DNN after applying dropout Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSProp), the update looks like RMSProp except that a smooth version of the gradient is used instead of the raw stochastic gradient. Adam has the following configuration parameters: ; 1 ; 2 and . is referred to the step size or learning rate, smaller values make learning slower but possibly get closer to the extreme, while larger values of result in faster learning. 1 and 2 are the exponential decay rates for the first and second moment estimates, respectively. is a very small number (e.g. 10 8 ) that avoids division by zero in computing. The algorithm is shown in Figure 3.13 where g 2 t denotes the elementwise square, and t 1 and t 2 denote 1 and 2 at time step t, respectively. Many popular deep learning libraries including TensorFlow follow the default settings of these parameters recommended by [46]: 80 = 0:001; 1 = 0:9; 2 = 0:999; = 10 8 . In regard of regularization, we also use batch normalization, which normalizes the output of a previous activation layer by subtracting the batch mean and di- viding by the batch standard deviation, and consequently scales and shifts by two trainable parameters (multiplied by and added by ) to each layer. Figures 3.14 and 3.15 present the graph and algorithm of this technique, it is claimed to accel- erate training by reducing internal covariate shift [43]. In the numerical results, Figure 3.6 presents the prediction accuracies of training and validation data sets in our tuning process. It can be noticed that oversampling by “smote" performs better than other resampling methods. It is also interestingly found in Figure 3.7 that the test accuracy increases towards its baseline as the training accuracy decreases from over-fitting status to the baseline. 3.6.3 Tuning process of other machine learning methods Additionally, we conduct numerical experiments on logistic regression [40], lasso logistic regression [86], decision tree [72], and random forest [38] for comparison. 81 Figure 3.13: Adam algorithm 82 Figure 3.14: Batch normalization visualization Logistic regression follows the model log( P(y = 1jx) 1P(y = 1jx) ) =w 0 +wx It fits the probability of the given data point in class 1 as P(y = 1jx) = 1 1 +e w 0 +wx where wx =w 1 x 1 +w 2 x 2 +::: +w p x p : 83 Figure 3.15: Batch normalization algorithm The graph of the logistic function is as the Sigmoid activation function with "S" shape, ranging in (0, 1). The parameter vector w could be estimated by Maximum Likelihood Estimation (MLE) or minimizing the negative log likelihood L = n X i=1 [log(P(y i = 1jx i ) y i ) +log((1P(y i = 1jx i ) 1y i )] 84 Lasso logistic regression involves as the tuning parameter, and the modified loss function is L +jwj We implement 10-fold cross validations and grid search on training data set to select the desired tuning parameters. For instance, the parameter in Lasso logistic regression controls the strength of the penalty so that it can perform variable selec- tion in the linear model (for a further reference and software, see [60; 23]). In terms of decision tree, the parameters max_depth, min_sample_split, min_sample_leaf and max_feature are tuned that indicate how deep the tree can be, the minimum samples required to split an internal node, the minimum number of samples re- quired to be at a leaf node and the number of features to consider when looking for the best split. Regarding random forest, we select the values of number_tree ( the number of trees you want to build before taking the maximum voting or averages of predictions) and the previous parameters in decision tree model, etc. 85 Chapter 4 Discussions The advances of information technologies make big data informatics more prevalent in real applications nowadays. As a popular methodology, deep learning has been applied in a variety of areas and shown great performance in different tasks. Yet, its statistical insights is less much explored. Chapter 2 presents the work where the simulation data is generated from latent subspaces, which is remotely motivated by real applications. An ideal procedure would be an unsupervised clustering followed by supervised classification so that it is able to learn the subspace identities and then classify within each subspace. In empirical studies, the conventional machine learning methods lead to the less classification error rate using the two-step proce- durethandirectlyclassifyinginonestep. Weprovidetheoreticalproofstoshowthat if we know the subspace identities in advance, the classification is able to achieve the ideal Bayes risk asymptotically, while the latent subspace structure indeed in- troduces additional difficulty term, that explains our empirical findings. When we 86 explore the performance of deep learning on this latent subspace classification task, to our surprise, deep learning can closely approximate the performance of two-step procedure from conventional machine learning methods. Does deep learning con- duct clustering to learn the subspace structure internally? According to a series of simulations, deep learning does not successfully capture the clustering informa- tion in any of its hidden layers. Then how does deep learning mimic the two-step procedure? We provide some statistical insights and heuristic arguments under our probabilistic and geometric settings that the ReLU neurons project the data to a higher dimensional space and the training loss of deep learning approximate that of two-step procedure, which supports our empirical discoveries. We further con- duct experiment on real application of traffic sign recognition to demonstrate the revealed phenomenon. In some sense, our results provide an important complement to the common belief of representational learning. It worths investigations on ex- tending the work to other model settings and other DNN architectures(e.g., CNN, sparse net) to interpret how DNN actually learns. Causalinferenceplaysanimportantroleinwideapplicationsincludingeconomy, social science, and medical science. One major aspect is quantifying the average treatment effect (ATE) by taking the counterfactuals of the predicted mean re- sponses for the treatment and control groups, which is studied through regressions with matching or propensity score that reduce the dimension to zero and one, re- spectively, in some traditional causal inference methods. But it is costly to make 87 decisions on matching and select subset of control units. As to the growing big data informatics, we analyze ATE from a different perspective, we utilize the full dimen- sional space including the covariates and the treatment indicator, and it requires very mild assumptions on the distribution of each data point. We don’t need figure out the number of characteristics and which characteristics to reduce the dimension that are required in traditional studies. The deep learning based framework is able to achieve comparable estimates to those from traditional causal inference method such as matching, and the it is adaptive to both observational data and random- ized experimental data in our motivated real application of estimating the effect of motivating blood donations via shortage mobile message. Also the suggested deep learning based framework performs well compared to other conventional ma- chine learning methods. The framework can be especially helpful to firms when big observational data (with many observations and records) is available. It worths in- vestigations on testing the boundary of the framework on the range of data volume and variety. Additionally, this framework can be extended to other applications in non-profit and for-profit industries such as forecasting and managing the traffic of blood donors, the capacity of medicine at hospitals, and the customer flow for e-commerce. 88 References [1] Abadie, A. (2005). Semiparametric difference-in-differences estimators. The Review of Economic Studies, 72(1):1–19. [2] Aljalbout, E., Golkov, V., Siddiqui, Y., Strobel, M., and Cremers, D. (2018). Clustering with deep learning: Taxonomy and new methods. arXiv preprint arXiv:1801.07648. [3] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pages 173–182. [4] Arbel, M., Sutherland, D., Bińkowski, M., and Gretton, A. (2018). On gradi- ent regularizers for mmd gans. In Advances in Neural Information Processing Systems, pages 6700–6710. [5] Athey, S. (2015). Machine learning and causal inference for policy evaluation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 5–6. ACM. [6] Athey, S. and Imbens, G. W. (2015). Machine learning methods for estimating heterogeneous causal effects. Stat, 1050(5). [7] Bengio, Y., Goodfellow, I. J., and Courville, A. (2015). Deep learning. Nature, 521(7553):436–444. [8] Branas, C. C., Cheney, R. A., MacDonald, J. M., Tam, V. W., Jackson, T. D., and Ten Have, T. R. (2011). A difference-in-differences analysis of health, safety, and greening vacant urban space. American Journal of Epidemiology, 174(11):1296–1306. 89 [9] Cao, J., Song, C., Peng, S., Xiao, F., and Song, S. (2019). Improved traffic sign detection and recognition algorithm for intelligent vehicles. Sensors, 19(18):4021. [10] Chen, Y., Lin, Z., Zhao, X., Wang, G., and Gu, Y. (2014). Deep learning-based classification of hyperspectral data. IEEE Journal of Selected topics in applied earth observations and remote sensing, 7(6):2094–2107. [11] Cireşan, D., Meier, U., and Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745. [12] Collobert, R.andWeston, J.(2008). Aunifiedarchitecturefornaturallanguage processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167. ACM. [13] Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30–42. [14] Demirkaya,E.,Vossler,P.,Fan,Y.,Lv,J.,andWang,J.(2020). Nonparametric inference of heterogeneous treatment effects with two-scale DNN. Manuscript. [15] Deng, J., Guo, J., and Zafeiriou, S. (2018). Arcface: Additive angular margin loss for deep face recognition. arXiv preprint arXiv:1801.07698. [16] Deng, L., Yu, D., et al. (2014). Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4):197–387. [17] Dolz, J., Reyns, N., Betrouni, N., Kharroubi, D., Quidet, M., Massoptier, L., and Vermandel, M. (2017). A deep learning classification scheme based on augmented-enhanced features to segment organs at risk on the optic region in brain cancer patients. arXiv preprint arXiv:1703.10480. [18] Dong, C., Loy, C. C., He, K., and Tang, X. (2014). Learning a deep convolu- tional network for image super-resolution. In European Conference on Computer Vision, pages 184–199. Springer. [19] Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906. [20] Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., and Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118. 90 [21] Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules. The Annals of statistics, 36(6):2605–2637. [22] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimen- sional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911. [23] Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP- dimensionality. IEEE Transactions on Information Theory, 57(8):5467–5484. [24] Fan, Y., Jin, J., and Yao, Z. (2013). Optimal classification in sparse gaussian graphic model. The Annals of Statistics, 41(5):2537–2571. [25] Fan, Y., Kong, Y., Li, D., and Zheng, Z. (2015). Innovated interaction screening for high-dimensional nonlinear classification. The Annals of Statistics, 43(3):1243–1272. [26] Fan, Y. and Lv, J. (2013). Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association, 108(503):1044–1061. [27] Fan,Y.andTang,C.Y.(2013). Tuningparameterselectioninhighdimensional penalized likelihood. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):531–552. [28] Gao, L., Lv, J., and Shao, Q. (2019). Asymptotic distributions of high- dimensional nonparametric inference with distance correlation. arXiv preprint arXiv:1910.12970. [29] Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for large- scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 513–520. [30] Gordon, B. R., Zettelmeyer, F., Bhargava, N., and Chapsky, D. (2018). A comparison of approaches to advertising measurement: Evidence from big field experiments at facebook. Manuscript. [31] Grimmer, J. (2015). We are all social scientists now: how big data, machine learning, and causal inference work together. PS: Political Science & Politics, 48(1):80–83. [32] Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331. 91 [33] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. [34] Heaton, J., Polson, N., and Witte, J. H. (2017). Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry, 33(1):3–12. [35] Heckel, R. and Bölcskei, H. (2015). Robust subspace clustering via threshold- ing. IEEE Transactions on Information Theory, 61(11):6320–6342. [36] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Se- nior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97. [37] Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71(4):1161–1189. [38] Ho, T. K. (1995). Random decision forests. In Document Analysis and Recog- nition, 1995., Proceedings of the Third International Conference on, volume 1, pages 278–282. IEEE. [39] Hof, R. D. (2013). Deep learning–with massive amounts of computational power, machines can now recognize objects and translate speech in real time. artificial intelligence is finally getting smart. [40] Hosmer Jr, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied logistic regression, volume 398. John Wiley & Sons. [41] Huang, N., Mojumder, P., Sun, T., Lv, J., and Golden, J. (2020). Not reg- istered? please sign-up now: a randomized field experiment on the timing of registration request. Manuscript. [42] Huang, N., Sun, T., Chen, P.-y., and Golden, J. (2017). Social media inte- gration and e-commerce platform performance: A randomized field experiment. Manuscript. [43] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 92 [44] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Conver- gence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580. [45] Jones, T. and Forrest, S. (1995). Fitness distance correlation as a measure of problem difficulty for genetic algorithms. In Proc. 6th Internat. Conf. on Genetic Algorithms. [46] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. [47] Kong, Y., Li, D., Fan, Y., and Lv, J. (2017). Interaction pursuit in high- dimensional multi-response regression via distance correlation. The Annals of Statistics, 45(2):897–922. [48] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105. [49] Kussul, N., Lavreniuk, M., Skakun, S., and Shelestov, A. (2017). Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters, 14(5):778–782. [50] Lechner, M. et al. (2011). The estimation of causal effects by difference-in- difference methods. Foundations and Trends in Econometrics, 4(3):165–224. [51] Lee, H., Pham, P., Largman, Y., and Ng, A. Y. (2009). Unsupervised fea- ture learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems, pages 1096–1104. [52] Li, Q. and Shao, J. (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica, pages 457–473. [53] Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance corre- lation learning. Journal of the American Statistical Association, 107(499):1129– 1139. [54] Li, X., Liu, L., Zhou, J., and Wang, C. (2018). Heterogeneity analysis and diagnosis of complex diseases based on deep learning method. Scientific Reports, 8(1):6155. [55] Litjens, G., Kooi, T., Bejnordi, B. E., Setio, A. A. A., Ciompi, F., Ghafoorian, M., van der Laak, J. A., van Ginneken, B., and Sánchez, C. I. (2017). A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88. 93 [56] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. (2017). SphereFace: Deep hypersphere embedding for face recognition. arXiv preprint arXiv:1704.08063. [57] Lloyd, S. (1982). Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137. [58] Lu, Y., Fan, Y., Lv, J., and Noble, W. S. (2018). DeepPINK: reproducible feature selection in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS 2018), pages 8689–8699. [59] Lv, J. (2013). Impacts of high dimensionality in finite samples. The Annals of Statistics, 41(4):2236–2262. [60] Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. The Annals of Statistics, 37(6A):3498– 3528. [61] Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605. [62] Majumder, N., Poria, S., Gelbukh, A., and Cambria, E. (2017). Deep learning- based document modeling for personality detection from text. IEEE Intelligent Systems, 32(2):74–79. [63] Mathias, M., Timofte, R., Benenson, R., and Van Gool, L. (2013). Traffic sign recognitionhow far are we from the solution? In The 2013 international joint conference on Neural networks (IJCNN), pages 1–8. IEEE. [64] Mhaskar, H. N. and Poggio, T. (2016). Deep vs. shallow networks: An approx- imation theory perspective. Analysis and Applications, 14(06):829–848. [65] Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. [66] Obermeyer, Z. and Emanuel, E. J. (2016). Predicting the future? big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375(13):1216. [67] Park, D., Caramanis, C., and Sanghavi, S. (2014). Greedy subspace clustering. In Advances in Neural Information Processing Systems, pages 2753–2761. 94 [68] Patel, A. B., Nguyen, T., and Baraniuk, R. G. (2015). A probabilistic theory of deep learning. arXiv preprint arXiv:1504.00641. [69] Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., and Mhaskar, H. (2017). Theory of deep learning iii: explaining the non-overfitting puzzle. arXiv preprint arXiv:1801.00173. [70] Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062):1518–1524. [71] Rokach, L. and Maimon, O. (2005). Clustering methods. In Data mining and knowledge discovery handbook, pages 321–352. Springer. [72] Safavian, S. R. and Landgrebe, D. (1991). A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics, 21(3):660– 674. [73] Saxe, A., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B., and Cox, D. (2018). On the information bottleneck theory of deep learning. In International Conference on Learning Representations. [74] Schmidhuber, J.(2015). Deeplearninginneuralnetworks: Anoverview. Neural Networks, 61:85–117. [75] Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association. [76] Sekhon, J. S. The neyman-rubin model of causal inference and estimation via matching methods. [77] Shao, F., Wang, X., Meng, F., Rui, T., Wang, D., and Tang, J. (2018). Real- time traffic sign detection and recognition method based on simplified gabor wavelets and cnns. Sensors, 18(10):3192. [78] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484. [79] Socher, R., Huval, B., Bath, B., Manning, C. D., and Ng, A. Y. (2012). Convolutional-recursive deep learning for 3d object classification. In Advances in Neural Information Processing Systems, pages 656–664. 95 [80] Soltanolkotabi, M., Elhamifar, E., Candes, E.J., etal.(2014). Robustsubspace clustering. The Annals of Statistics, 42(2):669–699. [81] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958. [82] Sun, T., Gao, G. G., and Jin, G. Z. (2015). Mobile messaging for offline social interactions: a large field experiment. Technical report, National Bureau of Economic Research. [83] Sun, T., Lu, S. F., and Jin, G. Z. (2016). Solving shortage in a priceless market: Insights from blood donation. Journal of Health Economics, 48:149–165. [84] Székely, G. J. and Rizzo, M. L. (2013). The distance correlation t-test of independence in high dimension. Journal of Multivariate Analysis, 117:193–213. [85] Szkely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769– 2794. [86] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, pages 267–288. [87] Wang, F., Cheng, J., Liu, W., and Liu, H. (2018a). Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926–930. [88] Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., and Liu, W. (2018b). CosFace: Large margin cosine loss for deep face recognition. arXiv preprint arXiv:1801.09414. [89] Wen, Y., Zhang, K., Li, Z., and Qiao, Y. (2019). A comprehensive study on center loss for deep face recognition. International Journal of Computer Vision, pages 1–16. [90] Wu, H., Fan, Y., and Lv, J. (2020). Statistical insights into DNN learning in subspace classification. Stat, to appear. 96
Abstract (if available)
Abstract
Deep learning has benefited almost every aspect of modern big data applications. Yet its statistical properties still remain largely unexplored. To gain some statistical insights into this, we design a simple simulation setting where we generate data from some latent subspace structure with each subspace regarded as a cluster. We empirically demonstrate that the performance of DNN is very similar to the ideal two-step procedure of clustering followed by classification (unsupervised plus supervised). However, none of the hidden layers in DNN conducts successful clustering based on a series of simulation studies. We provide some statistical insights and heuristic arguments to support our empirical discoveries and further demonstrate the revealed phenomenon on the real data application of traffic sign recognition. ❧ In the field of causal inference, quantifying the average treat effects (ATE) for intervention policies plays an important role. Yet deep learning has not been well investigated in treatment effect inference applications. To this end and motivated by the application of blood donation intervention via shortage mobile messages, in this paper we exploit deep learning to learn the complex interaction structures between the treatment and the covariates in high-dimensional feature space and construct the counterfactuals using the learned nonparametric function by deep learning. Our framework estimates ATE on full dimensional space including the treatment and covariates, and it requires very mild assumptions on the distribution of each data point and the treatment given the covariates that are easily satisfied in practice.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Nonparametric ensemble learning and inference
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Robust causal inference with machine learning on observational data
PDF
Learning distributed representations from network data and human navigation
PDF
Statistical methods for causal inference and densely dependent random sums
PDF
Scalable multivariate time series analysis
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Topics in selective inference and replicability analysis
PDF
Learning fair models with biased heterogeneous data
PDF
Model selection principles and false discovery rate control
PDF
Human appearance analysis and synthesis using deep learning
PDF
Exploration of human microbiome through metagenomic analysis and computational algorithms
PDF
High dimensional estimation and inference with side information
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
Asset Metadata
Creator
Wu, Hao
(author)
Core Title
Statistical insights into deep learning and flexible causal inference
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
04/24/2020
Defense Date
03/13/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
big data,causal inference,clustering,deep learning,DNN,OAI-PMH Harvest,observational studies,statistical insights,subspace classification
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Zhang, Jianfeng (
committee chair
), Fan, Yingying (
committee member
), Fulman, Jason (
committee member
), Lototsky, Sergey (
committee member
), Lv, Jinchi (
committee member
)
Creator Email
hwu409@usc.edu,wuhao19892012@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-287412
Unique identifier
UC11673425
Identifier
etd-WuHao-8319.pdf (filename),usctheses-c89-287412 (legacy record id)
Legacy Identifier
etd-WuHao-8319.pdf
Dmrecord
287412
Document Type
Dissertation
Rights
Wu, Hao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
big data
causal inference
clustering
deep learning
DNN
observational studies
statistical insights
subspace classification