Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Large-scale inference in multiple Gaussian graphical models
(USC Thesis Other)
Large-scale inference in multiple Gaussian graphical models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LARGE-SCALE INFERENCE IN MULTIPLE GAUSSIAN GRAPHICAL MODELS by Yongjian Kang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATH) May 2016 Copyright 2016 Yongjian Kang Dedication to My family ii Acknowledgements It is an interesting journey for my ve years Ph.D study at USC. First, I would like to express my deepest appreciation to my advisors, Prof.Fan and Lv, for their unlim- ited patience, constant encouragement, and excellent guidance. It is their eort that helps me deep into modern statistical research and they provide endless support to me like family. They have been persevering with me throughout the time of my study. This dissertation is impossible without their guidance. Their exceptional insights and ongoing encouragement keeps me ghting with all diculties. I would also like to thank Prof. Zhang for serving as my advisor in my home de- partment and Prof. Mikulevicius for serving as my committee member. They patiently audited my research and provide generous help all the time. I also want to thank Prof. Ren, who is one of the collaborators of the second part. He is very intelligent and talented. His hard work and immense knowledge grew our research to a new level. I also greatly express my thanks to our department committee who approved my application and oered me such a good chance to study in the one of the best Ph.D program in the world. Faculty here are very responsible and talented. Learning from them also helps me step into the fantasy math palace. I also want to express my deepest appreciation to my parents. They are the best parents in the world. Their endless love accompanies me through the journey and keeps me not alone. I also want to thank all my friends and relatives. They are my iii solid back-up and my life would not be so great without them. I feel extremely lucky to have so many people help me and support me all the time. iv Contents Dedication ii Acknowledgements iii List of Figures vii List of Tables viii Abstract ix Chapter 1: Group-wise square-root Lasso 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Our model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Algorithm and theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6.1 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.7.1 Proof of Theorem 1.5.1 . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2: Large-Scale Inference in Multiple Gaussian Graphical Model 21 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Prologue: Testing for Single Graph . . . . . . . . . . . . . . . . . . . . 24 2.4 Two Testing Procedures for Multiple Graphs . . . . . . . . . . . . . . . 26 2.5 Tuning-free Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.5.1 Computational Algorithm . . . . . . . . . . . . . . . . . . . . . 38 2.6 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.6.1 Model Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.2 Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.3 Precision Matrices Estimation . . . . . . . . . . . . . . . . . . . 45 2.6.4 Heavy-tailed Distribution . . . . . . . . . . . . . . . . . . . . . 51 v 2.7 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter 3: Robust Subspace Classication 57 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4.2 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 69 Chapter 4: Interaction Screening for Logistic Regression 70 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Interaction Screening Method for logistic regression . . . . . . . . . . . 71 4.3 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Simulation settings . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 Screening Performance . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.3 Selection Performance . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5 Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.1 Proof of Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . 83 4.5.2 Proof of Theorem 4.2.2 . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.3 Lemmas and their proofs . . . . . . . . . . . . . . . . . . . . . . 87 Bibliography 94 vi List of Figures 2.1 Common edges between C1 and other types identied by 1 and 2 . . 55 vii List of Tables 1.1 Multiple regression performance for our method and Lasso. Standard errors are in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1 Testing results for model I with = 0:05. Mean values are reported with standard errors in parentheses. . . . . . . . . . . . . . . . . . . . 43 2.2 Testing results for model II with = 0:05. Mean values are reported with standard errors in parentheses. . . . . . . . . . . . . . . . . . . . . 44 2.3 Estimation Results for Model I. . . . . . . . . . . . . . . . . . . . . . . 47 2.4 Estimation Results for Model II. . . . . . . . . . . . . . . . . . . . . . . 48 2.5 Average computation cost measured by CPU time in seconds. . . . . . 50 2.6 Testing Results for Laplace distribution with = 0:05 . . . . . . . . . . 52 2.7 Top 10 nodes with highest degrees identied by 1 and 2 in descending order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1 Simulation results for subspace classication . . . . . . . . . . . . . . . 66 3.2 Simulation results for subspace classication for second model . . . . . 68 3.3 Results of SPECT data . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 Screening Probability of true interaction. Case 1 with = 0:5. . . . . . 77 4.2 Screening Probability of true interaction. Case 2 with = 0:25. . . . . 77 4.3 Selection performance of Lasso t for case 1 when = 0:5. . . . . . . . . 80 4.4 Selection performance of Lasso t for case 2 when = 0:25. . . . . . . . . 80 4.5 Prediction error, Misclassication rate and model size on the breast cancer data in [33] over 50 random splits. Standard errors are in the parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 Numbers of appearance of main eect variable and interaction among 50 experiments, which are greater or equal to 4. . . . . . . . . . . . . . 82 viii Abstract This dissertation addresses three challenging problems with respect to large scale sta- tistical inference in high-dimensional setting. In high-dimensional setting, where the number of features could be larger than the sample size, diculties including high collinearity among the predictors, spurious variables and noise accumulation arise for classic statistic models. As a powerful tool for producing interpretable models, sparse modeling has gained popularity for analyzing large-scale data sets and is one of the hottest topics in modern statistical research. In the rst part of this dissertation, we consider multiple regression problem in high- dimensional setting. As to regression problem, most of existing methods just focus on one set of data. When the dimension is low, ordinary least squares estimators have been widely used. In high-dimensional setting, regularized regression has been studied and LASSO is widely used as it provides a sparse solution path. However, regressions of multiple data sets with same support have been seldomly studied. In this part, we model this problem as one linear regression problem with diagonal block design matrix by combining multiple data sets together. Also to make full use of their common sup- port, we consider coecients with same index across sets as one group factor. Thus we propose an estimator based on group regularization, which we call Group-wise square- root Lasso estimator. Considering typical group lasso formulation suers from tuning problem since optimal tuning parameter is dependent with noise level , we dene ix our loss function as the sum of the square root of the L 2 norm of errors. We propose our new estimator based on above formulation. Then we develop an algorithm to solve our estimator and prove the algorithm returns the optimal theoretically. Numerical analysis demonstrates that our method achieves a better performance compared to existing methods. In the second part, we consider Large-scale inference in multiple Gaussian Graph- ical Model. Traditionally, using estimated residuals to recover one Gaussian graph is widely used. Existing literature cover both unbiased estimation and statistical infer- ence. However, estimating multiple graphs simultaneously has seldomly been studied. In high dimension setting where dimension is very large compared with sample size, motivated by scientic applications such as Genomics, it is common to assume each precision matrix has certain common sparsity structure. In order to provide reliable inference and recover the graph, we propose to apply our Group-wise square-root Lasso thus obtain improved estimation of residuals. And based on these we propose two test statistics for inference of graph structure. It is also straight forward that we extend our method for estimation. Our new method is evidenced by simulations and theoretical results. In the third part, we develop a new subspace classication method. In real clas- sication problems, it is well known that in many cases the data within each class approximately lies on a lower dimensional subspace of the feature space. Based on this assumption, we develop a new algorithm where we treat each training sample as a feature and a new observation as a dependent variable. We treat it as a regression problem and use Sparse group Lasso to select the support of features. We could mod- ify tuning parameter such that only one group of feature is selected, thus determine the category of the new observation. We proved that our method would achieve one x hundred percent accuracy with large probability asymptotically. Our method also pos- sesses the advantage that we have parameters to tune and the idea is easy to interpret with geometric analysis. In the fourth part, we provide an interaction screening method for logistic problem. A naive approach for interaction terms in logistic regression could be treating all inter- actions among features as new features and adding them into the model. The problem is that after this the number of new features would be with the same order of the square of original number of feature. Especially when the dimension is high, solving the model will be very challenging. We develop a fast and accurate screening method and use penalized generalized linear model introduced in [5] for selection. We conduct a two step procedure for our screening method where we use conditional correlation as test statistics. We can show that our method will achieve a good performance in both accuracy and computation cost, which could be illustrated by simulation and real data settings. xi Chapter 1 Group-wise square-root Lasso 1.1 Introduction In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables, or covariates) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. Multiple linear regression have a wide real applications such as price prediction, potential consumers estimations. The model takes the form y i = x T i +" i ; i = 1; 2:::n wherei is denoted as the index of the observation. x i and are denoted as vectors with dimensionp and" i is denoted as the unknown noise with mean 0. For simplicity, we assume " i i:i:d N(0; 2 ). The problem is that given Y and X, how to provide a robust estimator ^ . The development of ecient algorithms for estimation of large- scale models has been a major goal of statistical learning research. Variable selection in high dimensional linear regression models has become a very hot topic of research in the last decade. In linear models, postulating that some com- ponents of are zero is equivalent to assuming that the corresponding predictors are 1 unrelated to the response after controlling for the predictors with non-zero components, [8]. In many real applications, it is very common that not all explanatory variables contribute to the model. In fact, there may be just a small portion of explanatory vari- ables actually aect dependent variable. Thus regularization is introduced to provide additional information in order to solve an ill-posed problem or to prevent overtting. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters. LASSO is one of the most famous regularization penalties. Tibshirani [9] proposed the least absolute shrinkage and selection operator (LASSO) using the idea of the L 1 norm of being upper bounded. Under the constraint that the L 1 norm of is less than a tuning parametert, solving is transferred to solving an optimization problem using Lagrange Multiplier. It has been well studied that estimator via Lasso enjoys some of the favorable properties such as its solution is sparse. This implies that LASSO could be used in sparse modelling. However, even though LASSO enjoys many desired properties, it does not have an explicit solution. Efron, et.al [10] developed a new model selection algorithm called LARS. Since LASSO could be implemented by a simple modication of the LARS algorithm, LASSO gained a lot popularity after it could be solved numerically. Note that regularization conduct shrinkage on coecients estimation, Zheng et. al. [4] discussed the shrinkage eect on thresholded regression. Besides this, Fan and Lv [5] explored other types of regularization and discussed the corresponding properties of estimator under dierent regularization. More recently, a large literature focusing on the selection of groups of predictors has been developed and attracted large attention. This problem requires methods that set to zero entire groups of coecients and is the focus of this work, [11]. In 2 many regression problems we are interested in nding important explanatory factors in predicting the response variable, where each explanatory factor may be represented by a group of derived input variables. The most common example is the additive model with polynomial or nonparametric components. In both situations, each component in the additive model may be expressed as a linear combination of a number of basis functions of the original measured variable. In such cases the selection of important measured variables corresponds to the selection of groups of basis functions, Yuan and Lin [12]. However, a big problem for this approach is that due to its loss function, the optimal value of tuning parameter depends on (the noise level) and the accurate estimation of whenp>n may be as dicult as the original problem of selection. The consideration of the square-root form of the criterion was rst proposed by Owen [13] in the statistics literature, and a similar approach is the Scaled Lasso by Sun and Zhang [14]. Belloni et al. [15] studied theoretically the estimation and prediction accuracy of square-root Lasso and showed that it is similar to that of the Lasso, with the net advantage that optimality can be achieved for a tuning sequence independent of . This makes this version of the Lasso-type procedure much more appealing when p is large, especially when p > n. Bunea, Lederer and She [11] studied group square-root Lasso and they provided a tuning-free parameter for their estimator, which avoids the heavy tuning. In this chapter, we consider a multi-task problem where we have several sets of data and their true coecients share same sparsity pattern. This problem has a wide real application such as gene chip signal procession and multiple price prediction. However, this problem has seldomly been studied due to the diculty of mathematical formu- lation and numerical solution. We will provide our formula for our estimation and 3 provide an algorithm to obtain our estimation. We also provide theoretical support of proposed algorithm. The contributions of our work are: Proposing a new estimator for multiple data sets regression problem, which we call Group-wise square-root Lasso estimator. Developing an algorithm with optimization techniques to obtain our desired es- timator within nite iteration. Providing theoretical support which guarantees the convergence of our algorithm and its global optimality. The rest of this chapter is organized as following. We will discuss our motivation and address Group-wise square-root Lasso estimator in Section 1.2 and 1.3. We will provide our mathematical formulation in Section 1.4. We will present our algorithm to obtain the estimator and provide theoretical support in Section 1.5. We provide corresponding numerical analysis in Section 1.6. 1.2 Motivation For traditional regression problems, we just have one set of datafY;Xg, where Y is a n by 1 vector and X is an n by p design matrix. When the dimension is low and solution is not sparse, ordinary least square estimator has been widely used. To solve this, classical method is to minimize the L 2 norm of empirical loss: b = arg min 2R p ( n 1 n X i=1 (y i p X j=1 j x ij ) 2 ) 4 And with some algebra, we could get b = (X T X) 1 X T y This is called OLS estimator. However, this estimator fails when the dimension is high since X 0 X is not invert- ible. Also when the dimension is high, it is very natural to assume the solution is sparse since not every explanatory variable should be included into the model. Thus to provide sparse solution, given the data set, we add regularization and follow corre- sponding algorithm to get the estimation b . Since from a Bayesian point of view many regularization techniques correspond to imposing certain prior distributions on model parameters, we choose dierent penalty based on dierent estimation target. However, regressions of multiple data sets with same support has seldomly been studied. As technique develops very fast, massive data are generated and usually we could get multiple data sets which are of same experiment essentially. Thus it is very natural to assume multiple data sets share common support if the problem could be transferred to a regression problem. Thus estimating each model separately doesn't meet our expectation because it doesn't make use of common information and is very expensive in computation. Based on this purpose, we propose a new estimator b and name it as group-wise square-root Lasso. I will present how we dene it, how we construct computation algorithm, how to prove our algorithm and provide numerical studies in following section. As shown below, our method takes advantage of their common structure and provides a more accurate estimation. 5 1.3 Problem formulation Suppose we are given several sets of datafY (i) ;X (i) g; i = 1; 2;:::k. For each set, the dimension is p and sample size is n. In order to transfer it to one regression problem, we construct a new response variableY and covariatesX by combining them together as following: Y = 0 B B B B B B B @ Y (1) Y (2) ::: Y (k) 1 C C C C C C C A X = 0 B B B B B B B @ X (1) X (2) . . . X (k) 1 C C C C C C C A Thus Y is a vector with length nk and X is a block diagonal matrix with size (nk;pk). Based on above X and Y , we dene our new and as following: = 0 B B B B B B B @ (1) (2) ::: (k) 1 C C C C C C C A = 0 B B B B B B B @ (1) (2) ::: (k) 1 C C C C C C C A where is a vector with lengthpk and is a vector with lengthnk. Thus we could formulate our problem as to estimate for following regression problem: Y =X + 6 where (j) ;j = 1; 2;:::k share same support and N(0; 2 I pk ). Without loss of generality, we assume all covariates in X i have been standardized. 1.4 Our model To make full use of their same sparsity pattern, we use group Lasso penalty. Since we know group Lasso penalty enjoys group-wise sparsity, we set up a group factor of with same index in its group, which means (j) = ( (1) (j) ; (2) (j) ;:::; (k) (j) ) Totally, we have p such group factors. Thus variable selection is transformed to the selection among these group factors. Considering typical group lasso formulation suers from tuning problem since op- timal tuning parameter is dependent with noise level , we denote ~ Q t ( (t) ) = Y (t) X (t) (t) 2 for t2 [k]. For the loss function of each data set, we use square root of ~ Q t . By using this loss function, we could avoid the heavy tuning of . The consideration of the square-root form of the criterion was rst proposed by Owen [13] in the statistics literature. Belloni et al. [15] studied theoretically the estimation and showed that it is similar to that of the Lasso, with the net advantage that optimality can be achieved for a tuning sequence independent of. This makes this version of the Lasso-type procedure much more appealing when p is large, especially when p>n. Thus, the group-wise square-root Lasso is dened as the solution to the following optimization problem, 7 ^ = arg min 2R kp ( n 1=2 k X t=1 kY (t) X (t) (t) k + p X l=1 (l) ) : (1.1) where = ( (1) T ;:::; (k) T ) T 2 R pk . We note that is the true vector and T =fj : (j) 6= 0g withjTjs. One big issue for this estimator is that it doesn't have an explicit solution. Hence we develop an algorithm which could return us the solution of above optimization problem. 1.5 Algorithm and theorem In this section, we propose the computational algorithm to solve for our estimator . We are inspired by the algorithm from Bunea, Lederer and She [11] where they solve the group square-root lasso. Note that our model is formulated with dierent loss function where we use the sum of group-wise L 2 norm of residuals as loss function. Note that we want to solve following optimization problem: ^ = arg min 2R kp ( n 1 k X t=1 kY (t) X (t) (t) k + p X l=1 (l) ) : (1.2) where Y (t) 2R n , X (t) 2R np and (t) 2R p are respectively the response, the design matrix and the regression coecient vector corresponding to the tth graph with t = 1; ;k, the vector = (( (1) ) 0 ; ( (2) ) 0 ; ; ( (k) ) 0 ) 0 2R kp is formed by stacking (k) 's one underneath the other, and (l) is the subvector of formed by taking out eachlth component of (t) fort = 1; ;k. Similarly, we dene the subvectors ^ (l) ,l = 1; ;p and ^ (t) , t = 1; ;k. To solve the optimization problem in (1:1), we propose a scaled iterative thresh- olding procedure to obtain the estimator of . This algorithm is designed specically 8 for our optimization problem with convergence guarantees, motivated by the algorithm for group square-root Lasso with homogeneous noises in [11] as well as a more general algorithm developed in [8]. In the end, to further reduce the bias of the estimator ^ incurred by the penalty in (1:1), one may ret an ordinary least squares estimate on the support of ^ to obtain the nal estimate. Our algorithm consists of the following steps. First, we scale the response vector, the design matrix and the parameter Y (t) =K 0 !Y (t) ; X (t) =K 0 ! X (t) ;=K 0 !; t2 [k]; (1.3) whereK 0 is some preselected large enough scalar. It is clear that solution remains the same after scaling. This step reduces the norm of the design matrix, which guarantees the convergence of the iterative algorithm as shown in the proof of Theorem 1.5.1 later. Hereafter we slightly abuse the notation and still use Y (t) , X (t) and to denote the response, the design matrix and the penalty parameter after rescaling. RecallkAk ` 2 denotes the spectral norm of matrixA. The choice ofK 0 = max t kX (t) k ` 2 , suggested by the equation (1.10) in the proof of Theorem 1.5.1, works well in our simulation studies. Secondly, with the rescaled data matrix, we solve the optimization problem in (1.1) iteratively and use (m) to denote the solution in the mth iteration. In particular, (0) represents certain initial estimator of . In our simulation studies and real data analysis, we pick (0) = 0, which works well. We will use (m) (t) and (m) (l) to denote the subvectors of (m) following the same notation as in (1.1). Specically, in the (m + 1)th iteration with (m), we set R(m) = ((R(m) (1) ) 0 ; ; (R(m) (k) ) 0 ) 0 with R(m) (t) = (X (t) ) 0 (X (t) (m) (t) Y (t) )= (n) 1=2 X (t) (m) (t) Y (t) , t2 [k]. Let R(m) (l) be the subvector ofR(m)2R pk corresponding to thelth group, i.e.,R(m) (l) = 9 (R(m) (1) l ;:::;R(m) (k) l ) 0 2R k with R(m) (t) l the lth component of the vector R(m) (t) . In addition, we denote the scaling factor A(m) = k X t=1 (n) 1=2 X (t) (m) (t) Y (t) 1 : With the new notation, we update (m + 1) as (m + 1) (l) = ! (m) (l) R(m) (l) A(m) ; A(m) ; l2 [p]; (1.4) where ! is the multivariate soft-thresholding operator dened as ! (0;) := 0 and ! (a;) :=a(kak ;)=kak for the non-zero vectora with (t;) := sgn(t)(jtj) + , the soft-thresholding rule. We stop the iteration when dierence between the estimates from two subsequent iterations is small enough. The following theorem guarantees the convergence of our algorithm to a global optimum. Theorem 1.5.1. Suppose that > 0 and the following regularity condition holds: min t2[k] inf 2A t X (t) Y (t) >c 0 ; where for each t2 [k], A t =fv(m) (t) + (1v)(m + 1) (t) :v2 [0; 1];m = 0; 1; 2;:::g and c 0 is a positive constant. Then for K 0 large enough, the sequence of iterates (m) starting with any (0) will converge to the global optimum of (1.1). Remark 1.5.1. The regularity condition in Theorem 1.5.1 is not strong, especially in the high dimensional settings, where X (t) Y (t) = 0 typically has no solution. 10 Let ^ be the solution generated by our algorithm. We have shown that we guarantee the existence and uniqueness of ^ and also ^ is a global minimum of (1.1). 1.6 Numerical Studies In this section, we will test our proposed method on simulated data sets. In order to justify our performance, we also compare our method with LASSO, where we use LASSO on the same data set by solving each regression problem separately. 1.6.1 Settings We generate k i.i.d data setsfX (i) g; i = 1; 2;:::k and each has n observations and p dimensions. We use the model Y (i) =X (i) (i) + (i) to generate Y (i) , where (i) ; i = 1; 2;:::k are vectors with length p and share the same support, which means8i, (i) (j) 6= 0; j = 1; 2;:::p or (i) (j) = 0; j = 1; 2;:::p. And the noise (i) N(0;I n ); i = 1; 2;:::k . Note that for each observation from fX (i) g; i = 1; 2;:::k, it follows standard multivariate normal distribution, which means any row of X (i) follows N(0;I p ). Note that = ( (1) 0 ; (2) 0 ;::: (k) 0 ) 0 and (i) = ( (i) 1 ; (i) 2 ;::: (i) p ); i = 1; 2;:::k. For all i, the support of (i) isf1; 2; 3g, which means8ik;8j = 1; 2;:::;p, (i) (j) 6= 0 if and only ifj = 1; 2; 3. We set (i) 1 =3:1+0:1i, (i) 2 = 2:10:1i and (i) 3 = 2+0:1i, which means for all non-sparse, they are dierent across all k groups. Based on above model, we use our proposed algorithm to solve for. Meanwhile, we also use Lasso as a comparison. Instead of combining them together, we use LASSO to 11 solve above regression on each data set separately. Thus we could obtain the estimation of each (i) . Then we could combine all estimations together to obtain nal estimation . For the model setting parameters, we set the number of group k = 10, number of observation within each group n = 50 and number of feature p = 100. 1.6.2 Results We generate 100 repeats in total independently. For each repeat, observations are generated independently. For each repeat, we obtain ^ as estimation of for both our method and LASSO. To measure how accurate the estimation is, we compute two measures, the estimation error ^ and estimation lossX( ^ ). For each measure, we calculate L 1 ,L 2 and L 1 norm of above two vectors for both methods. We also want to investigate how well both methods recover the support of, which under this case is j = 1; 2; 3. In each repeat, we count the number of False Negative (FN) and False positive (FP) for both methods. And we present the average of both FN and FP among 100 repeats. Table 1.1 summarized the performance of both our proposed method and using Lasso separately. Standard errors are in parentheses. From the table, we could see that compared to separate Lasso, our method is superior in all elds. As to the support recovery, Lasso recovers true support for each repeat while Lasso gets 0:4 False positive in average for each repeat. Note that our method recovers true support for each repeat while getting no False positive at all. Also for estimation error, our method perform better in all three norms. Lasso gets larger estimation error in all three norms, which means our estimation is closer to truth in all three aspects. Meanwhile it is the same with respect to empirical loss. Lasso gets larger estimation error in all three norms, which means as to reestimate y 12 our method also provides an estimation closer to the truth in all three aspects. These together illustrate that our method is better than LASSO in our simulation setting. 13 Table 1.1: Multiple regression performance for our method and Lasso. Standard errors are in parentheses. norm our lasso Estimation Error L 1 4.8608(0.7916) 5.2004(0.8101) L 2 1.0133(0.1529) 1.0796(0.2217) L 1 0.3733(0.0849) 0.4216(0.1687) Estimation loss L 1 261.1353(8.2322) 273.4386(12.2501) L 2 13.9238(0.4282) 14.5717(0.7412) L 1 1.6361(0.1588) 1.8677(0.4205) FN 0 0 FP 0 0.4 14 1.7 Proof of Theorems 1.7.1 Proof of Theorem 1.5.1 Proof. The proof consists of two parts. First we prove that our algorithm has a unique guaranteed convergence point . Then we show this point is the global optimum of (1.1). Step 1: Convergence of (m). Dene F () = P k t=1 kY (t) X (t) (t) k=(n) 1=2 + P p l=1 k (l) k as the objective function we want to minimize with respect to . To prove the desired result, we rst construct a surrogate function and then prove that the updating rule optimizes the surrogate function. Then we explore the relationship between the objective function and the surrogate function, from where we can conclude that the limit of (m) is optimal for our objective function. We start with introducing a surrogate function: G(; ) = k X t=1 Y (t) X (t) (t) p n + 1 2 k X t=1 1 p n Y (t) X (t) (t) k k 2 + p X l=1 (l) (1.5) + k X t=1 1 p n Y (t) X (t) (t) ( (t) (t) ) 0 (X (t) ) 0 (X (t) (t) Y (t) ); where (t) and (l) are the subvectors of dened similarly as (t) and (l) . Thus, F () =G(;). Let R (t) = (X (t) ) 0 (X (t) (t) Y (t) )=((n) 1=2 kY (t) X (t) (t) k) and R = ((R (1) ) 0 ; ; (R (k) ) 0 ) 0 . We can rewrite the last term in (1.5) as k X t=1 1 p n Y (t) X (t) (t) ( (t) (t) ) 0 (X (t) ) 0 (X (t) (t) Y (t) ) = ( ) 0 R: 15 Let A = P k t=1 ((n) 1=2 kY (t) X (t) (t) k) 1 . Thus, given , minimizing G over is equivalent to minimizing the following objective function formed by the last three components of G in (1.5) with respect to 1 2 Ak k 2 + p X l=1 (l) + ( ) 0 R: This is also equivalent to minimizing the following objective function with respect to 1 2 + R A 2 + A p X l=1 (l) : (1.6) Combining the above results we obtain that for a given , the minimizer to G(; ) dened in (1.5) is the same as the minimizer to (1.6). Now set =(m) and correspondingly, we dene the vector R(m) and the scalar A(m) similarly to the vectorR and the scalarA with replaced with(m). We update (m + 1) as the minimizer to (1.6). Then (m + 1) also minimizes G((m); ). Note that the optimization problem in (1.6) is separable and can be written in the following form p X l=1 1 2 (l) R (l) A (l) 2 + A (l) : (1.7) In view of (1.7), the optimization problem in (1.6) can be solved by minimizing each of the p terms inside of the summation. The solution has explicit form. Specically, by Lemma 1 and Lemma 2 in [8], we obtain that (m + 1) has the explicit form (m + 1) (l) = ! (m) (l) R(m) (l) A(m) ; A(m) ; l2 [p]; (1.8) 16 where R(m) (l) above is the subvector of R(m) dened in a similar way to (l) as a subvector of . Thus, it follows that G((m);(m + 1))G((m);(m)) =F ((m)): On the other hand, using Taylor series expansion, we get for any and , k X t=1 Y (t) X (t) (t) p n + k X t=1 1 p n X (t) (t) Y (t) ( (t) (t) ) 0 (X (t) ) 0 (X (t) (t) Y (t) ) k X t=1 Y (t) X (t) (t) p n k X t=1 ( (t) (t) ) 0 (X (t) ) 0 X (t) ( (t) (t) ) 2 p n X (t) (t) Y (t) ; (1.9) with (t) lying between the line segment connecting (t) and (t) for each t2 [k]. Denote bykXk ` 2 the spectral norm of a matrix X. By combining (1.9) and (1.5) and setting =(m) and =(m + 1), it follows that F ((m))F ((m + 1))G((m);(m + 1))F ((m + 1)) k X t=1 ((m + 1) (t) (m) (t) ) 0 (X (t) ) 0 X (t) ((m + 1) (t) (m) (t) ) 2 p n X (t) (t) Y (t) + 1 2 A(m)k(m + 1)(m)k 2 = k X t=1 ((m + 1) (t) (m) (t) ) 0 0 @ A(m) 2 I (X (t) ) 0 X (t) 2 p n X (t) (t) Y (t) 1 A ((m + 1) (t) (m) (t) ) k X t=1 1 2 p n 0 B @ 1 X (t) (m) (t) Y (t) X (t) 2 ` 2 2 X (t) (t) Y (t) 1 C A (m + 1) (t) (m) (t) 2 (1.10) 17 where I in the second last step means the identity matrix of appropriate size. We need to show the right hand side of (1.10) is positive. At the initial step m = 0, it is clear that this can be achieved by picking a large enoughK 0 in the scaling step (1.3) as long askX (t) (t) Y (t) k6= 0. This fact and the regularity condition are able to guarantee F ((m)) is decreasing. Set B 0 = (n (0) ) 1=2 F ((0)) and recall kX (t) (t) Y (t) k > c 0 . Indeed, it suces thatkX (t) k 2 ` 2 < c 0 =B 0 . Specically, we are able to use induction to show that for all t and m, X (t) (m) (t) Y (t) B 0 and F ((m))F ((0)): (1.11) Thus, combining the above inequalities we arrive at the following conclusion 1 X (t) (m) (t) Y (t) X (t) 2 ` 2 2 X (t) (t) Y (t) 1 2B 0 : Consequently, F ((m)) F ((m + 1)) by (1.10). Thus F ((m)) is monotonically decreasing in m. Since F ((m)) is lower bounded from below by 0, it follows that lim m!1 F ((m)) exists and lim m!1 jF ((m+1))F ((m))j = 0. This together with (1.10) proves lim m!1 k(m + 1)(m)k = 0: (1.12) Also notice that for all m = 0; 1; , k(m)k p X l=1 (m) (l) F ((m)) F ((0)) ; which ensures that (m) is uniformly bounded in a closed subset of R kp . So for all m = 0; 1; , the vector(m) is uniformly bounded and in a closed subset ofR kp . This implies(m) has at least one convergence point. This together with (1.12) ensures that 18 (m) has a unique limit point , which is a xed point of the thresholding rule in (1.8). It remains to prove (1.11). When m = 0, it is easy to verify thatkX (t) (0) (t) Y (t) k B 0 and F ((m)) F ((0)). Let us assume for all m T , it holds that kX (t) (m) (t) Y (t) kB 0 and F ((m))F ((0)). It follows that 1 X (t) (T ) (t) Y (t) X (t) 2 ` 2 2 X (t) (t) Y (t) > 1 2B 0 : Thus, based on (1.10),F ((T +1))F ((T ))F ((0)). We can also getkX (t) (T + 1) (t) Y (t) k (n (0) ) 1=2 F ((T + 1)) (n (0) ) 1=2 F ((0)) = B 0 . Thus, (1.11) also holds for m =T + 1. This completes the proof of (1.11) for all m and t and also concludes the proof of the rst step. Step 2: Global optimality. We next show that is the global optimum of (1.1). Since F () is a summation of two convex functions of , it follows that F () is also a convex function with respect to . Thus is a global minimizer of F () if and only if it satises the following KKT conditions ((X (t) ) 0 (X (t) (t) Y (t) )) l p n X (t) (t) Y (t) = (t) l (l) ; for (t) l 6= 0 (1.13) ((X (t) ) 0 (X (t) (t) Y (t) )) l p n X (t) (t) Y (t) ; for (t) l = 0; (1.14) where the subscript l in both equations means the lth component of a vector. Recall that is the xed point of the soft thresholding rule in (1.8), i.e., (l) = ! (l) R (l) A ; A ; l2 [p]; 19 where R (l) and A are dened similarly as R(m) (l) and A(m) in (1.8) expect that we replace (m) with . Next note that when (l) = 0, by the denition of soft- thresholding rule,kR (l) =A k=A , which entails thatkR (l) k. So ((X (t) ) 0 (X (t) (t) Y (t) )) l p n X (t) (t) Y (t) =jR (t) l jkR (l) k; for (t) l = 0: (1.15) This veries the second KKT condition (1.14) for the xed point . When (l) 6= 0, by soft-thresholding rule, (l) = (l) R (l) A A (l) R (l) A (l) R (l) A : (1.16) By taking ` 2 norm on both sides, we obtain thatk (l) k =k (l) R (l) =A k=A . Furthermore, (1.16) can be rearranged as A (l) R (l) A = R (l) A (l) R (l) A : Combining the above two results, we obtain that (l) =R (l) (l) R (l) A A =R (l) (l) ; which further entails that R (t) l = (X (t) ) 0 (X (t) (t) Y (t) ) l p n X (t) (t) Y (t) = (t) l (l) ; for (t) l 6= 0: (1.17) This veries the rst KKT condition (1.13) for . Combining (1.15) and (1.17), we can conclude that () is a global minimum of (1.1). 20 Chapter 2 Large-Scale Inference in Multiple Gaussian Graphical Model 2.1 Introduction In this chapter, we are going to present Large-Scale Inference in Multiple Gaussian Graphical Model. We are going to provide both joint testing and joint estimation in Multiple Gaussian Graphical Model where we use our group-wise square-root Lasso estimator in previous chapter. Estimation of dependency networks for high dimensional data sets is especially desirable in many scientic areas. It attracts lots of practical attention to recover the underlying network formed by a large number of individuals who are sparsely related. Graphical models provide a feasible approach to explore the conditional independence structure among a large set of nodes. Specically, Gaussian graphical model has proven to be a very powerful formalism to infer dependence structures of various data sets. In Gaussian graphical models, the conditional independence is fully characterized by the sparsity pattern in the precision matrix, which is the inverse of the covariance matrix. Inference on the structure of Gaussian graphical model is challenging when the dimension is greater than the sample size, Liu [16]. Some literature provide estimation of precision matrix. Liu [16] proposed a method for inference on the structure of Gaussian graphical model via bias correction. This provides the approach that we could statistically justify the graph structure. Fan and Lv [1] provided Innovated scalable 21 ecient estimation in ultra-large Gaussian graphical models. However, previous work seldom consider inference multiple graphs together which is more and more important in elds such as biology and sociology. Some work addressed how to provide joint estimation among multiple graphs, such as JGL and MPE. However, these methods did not provide further inference after estimation. In this chapter, we present our multiple Gaussian Graphs inference, where we propose two test statistics thus provide inference on multiple graph structure. Based on the inference, we could further obtain the estimation of the precision matrices. We will provide both theoretical and numerical support of our method. 2.2 Motivation The Gaussian graphical model, a powerful tool for investigating the relationship among a large number of random variables in a complex system, is used in a wide range of scientic applications. A central question for Gaussian graphical models is how to recover the structure of an undirected Gaussian graph, [17]. It is well known that recovering the structure of an undirected Gaussian graph G = (V, E) is equivalent to recovering the support of the population precision matrix of the data in the Gaussian graphical model. Let X = (X 1 ;X 2 ;:::;X p )N(; ); where is the population covariance matrix. The precision matrix, denoted by = (w ij ), is dened as the inverse of covariance matrix, = 1 . There is an edge between V i andV j , that is, (i;j)2E, if and only ifw ij 6= 0. Consequently, the support recovery of the precision matrix yields the recovery of the structure of the graph G. Inference 22 on one graph has been widely studied. However, how to estimate multiple graphs together is rarely studied. Suppose we havek classes of data, each of whichX (t) follows a multivariate Gaussian distribution X (t) = (X (t) 1 ;:::;X (t) p ) 0 N(0; ( (t) ) 1 ), where (t) 2 R pp denotes the precision matrix of the tth class. Moreover, the distributions of X (1) ;X (2) ;:::;X (k) are independent. As we mentioned in the introduction, the precision matrix (t) = (! (t) ij ) pp re ects the conditional dependency relationship among p random variables X (t) 1 ;:::;X (t) p . In particular,! (t) ab = 0 is equivalent to thatX (t) a andX (t) b are condition- ally independent given all other variables. In Gaussian graphical model, there is no edge between two nodes X (t) a and X (t) b whenever ! (t) ab = 0. In high-dimensional setting wherep is very large compared with sample size, moti- vated by scientic applications such as genomics, it is common to assume each (t) has certain sparsity structure. Typical goals involve inference of ! (t) ab as well as estimation of (t) . When k = 1, i.e., there is only one class of data, our setting coincides with the typical single Gaussian graphical model. However, whenever k 2, one may wish to borrow strength across all k classes of data to have a more accurate estimation of k precision matrices if k classes are related to each other. In this paper, we assume thatk classes share similar sparsity structure. In other words, we expect for each pair of nodes (a;b)2 [p] [p] with a6= b, either ! (t) ab = 0 simultaneously for all t2 [k] or alternatively, the nonzero signal for the vector (! (1) ab ;:::;! (k) ab ) 0 is signicant. We denote ! 0 ab = (! (1) ab ;:::;! (k) ab ) 0 2R k and the edge set of k graphs can be dened as E =f(a;b) :! 0 ab 6= 0;a6=bg: (2.1) 23 Our aim of the current paper is to test the null hypothesis H 0;ab :! 0 ab = (! (1) ab ;! (2) ab ;:::;! (k) ab ) 0 = 0: (2.2) Depending on the types of alternatives, we introduce two dierent fully data-driven test statistics later in this section. We also establish statistical properties which imply superior performance over those obtained by estimating each class individually. To facilitate our presentation, the following notations will be used. For a vector a = (a 1 ; ;a p ) 0 , we usea j to denote the subvector with thejth component removed. For a matrix A = (a ij ), we use A ;j to denote the jth column, use A j;j to denote the subvector ofA ;j with thejth component removed, andA ;j to denote the submatrix of A with the jth column removed. 2.3 Prologue: Testing for Single Graph It is well known that for any j 2 [p], the conditional distribution of X (t) j given all remaining variables X (t) j in class t follows the following Gaussian distribution, X (t) j jX (t) j N(X (t)0 j C (t) j ; 1=! (t) jj ); (2.3) where the coecient C (t) j = (t) j;j =! (t) jj . This motivates the neighborhood regression approach X (t) j = X (t)0 j C (t) j + (t) j , where (t) j N(0; 1=! (t) jj ) and is independent of X (t) j . To test whether each pair ! (t) ab = 0, it is worthwhile to investigate the covariance of residuals (t) a and (t) b obtained by regressingX (t) a andX (t) b against its remaining variables respectively. In fact, it is easy to check that Cov( (t) a ; (t) b ) = ! (t) ab ! (t) aa ! (t) bb . (2.4) 24 In particular, there is no edge between nodesX (t) a andX (t) b if and only if the covariance of two residuals is zero Cov( (t) a ; (t) b ) = 0. Hence the graph structure of the tth class is re ected by the support of the covariance matrix Cov( (t) ), where the residual vector (t) = ( (t) 1 ; (t) 2 ;:::; (t) p ) 0 . Assume for each class t2 [k], we have n (t) i.i.d. observations X (t) 1; ;X (t) 2; ;:::;X (t) n (t) ; , whereX (t) i; = (X (t) i;1 ;:::;X (t) i;p ) 0 N(0; (t) ). Furthermore, we assume that observations among dierent classes are independent. Let us denote the n by p-dimensional data matrix of the tth class as X (t) = (X (t) 1; ;X (t) 2; ;:::;X (t) n (t) ; ) 0 . In other words, the ith row of X (t) corresponds to the ith observation X (t) i; . Without loss of generality, we focus on the edge between nodes X (t) 1 and X (t) 2 and consider the testing problem (2.2) with a = 1 and b = 2. Motivated by the conditional distribution (2.3), we can write the residuals of this neighborhood regression withj = 1 and 2 by then (t) -dimensional vectorsE (t) ;1 andE (t) ;2 respectively, where E (t) ;j = (E (t) 1;j ;E (t) 2;j ;:::;E (t) n (t) ;j ) 0 and E (t) i;j =X (t) i;j X (t)0 i;j C (t) j ; where j = 1; 2; i2 [n (t) ]: (2.5) Here the coecient C (t) j = (t) j;j (! (t) jj ) 1 and E (t) i;j i:i:d: N(0; (! (t) jj ) 1 ) for i2 [n (t) ]. Observing the covariance structure of the noise (2.4), we can simply take the sample covariance of E (t) ;1 and E (t) ;2 to obtain an ecient estimator of Cov( 1 ; 2 ) = ! (t) 12 ! (t) 11 ! (t) 22 if we have access to the residualsE (t) ;1 andE (t) ;2 . Unfortunately, the residuals can only be estimated from the regression model (2.5) and are unknown to us. Whenk = 1, this problem is considered in [16] for single Gaussian graphical model. We remove the subscript in our notation to emphasize that there is only one class. Let ^ C j be any estimator of C j for j = 1; 2 satisfying some relatively weak statistical properties. The estimated residuals can be written as ^ E i;j =X i;j X 0 i;j ^ C j ; where j = 25 1; 2;i2 [n]. Due to estimation errors induced by ^ C 1 and ^ C 2 , a test statistic for null ! 12 = 0 is proposed based on a bias correction idea. In particular, the test statistics is dened as T 1;2 = 1 n n X i=1 ^ E i;1 ^ E i;2 + n X i=1 ^ E i;1 2 ^ C 2;1 + n X i=1 ^ E i;2 2 ^ C 1;2 ! ; where ^ C 2;1 is the coordinate of ^ C 2 corresponding to feature X ;1 and ^ C 1;2 is dened similarly. The initial estimator ^ C j was picked by using Dantzig selector or Lasso for neighborhood regression based on (2.3). It is shown in [16] that under mild assumption on and maximum node degree s := max a P b6=a 1f! ab 6= 0g = o( p n logp ), the statistic is asymptotically normal distributed with mean zero and computable variance under null ! 12 = 0. Hence the testing problem can be easily solved asymptotically. 2.4 Two Testing Procedures for Multiple Graphs For multiple class setting k > 1, it is an easy extension to apply this bias corrected test statistic following the same method for each single graph as that in [16]. However, the structural similarity information is not used for this approach. To this end, we would like to take advantage of the relationship among k classes by considering the above approach simultaneously. Observe the conditional distributions of k classes X (t) j jX (t) j N(X (t)0 j C (t) j ; 1=! (t) jj ) for t 2 [k] share similar sparsity structure on the coecients C (t) j due to the fact C (t) j = (t) j;j =! (t) jj . For example, if we have the group structure ! 0 ab = 0, then it is clear that C (t) a;b = 0 for all t 2 [k]. This fact suggests that signicant improvement can be achieved for estimation of the residuals E (t) i;j simultaneously for t2 [k]. Indeed this is the case. 26 We rst introduce some notation. For each j2 [p], let the joint coecients of the neighborhood regressions for k classes be C 0 j = C (1)0 j ;:::;C (k)0 j 2R (p1)k : Suppose we have some general estimator ^ C 0 j = ( ^ C (1)0 j ;:::; ^ C (k)0 j ) 0 ofC 0 j . Then we dene the estimation error of C 0 j as j = ^ C 0 j C 0 j . Denote the k-dimensional subvectors of C 0 j and ^ C 0 j indexed by the lth group by C 0 j(l) = C (1) j;l ;C (2) j;l ;:::;C (k) j;l 0 , ^ C 0 j(l) = ^ C (1) j;l ; ^ C (2) j;l ;:::; ^ C (k) j;l 0 2R k ; (2.6) respectively. The j(l) is dened as a subvector of j similarly. We denote the joint sparsity of k classes as the maximum node degree ofE in (2.1), s := max a X b6=a 1f! 0 ab 6= 0g (2.7) We assume the following general conditions for our model. (A1) LetX (t) N(0; ( (t) ) 1 ), where (t) 2R pp ,t2 [k]. Moreover, we assume there exists some absolute constantM > 0 such that 1=M min ( (t) ) max ( (t) ) M for all t2 [k]. (A2) The sample sizes of each of thek classes are comparable: n (1) n (2) :::n (k) . We denote this level byn (0) whenever the rate involves. For concreteness, we set n (0) = min t fn (t) g and assume max t fn (t) g=n (0) M 0 . The following statistical properties are the working assumptions for our testing problem of k classes. Suppose for each j2 [p], certain initial estimator ^ C 0 j satises with high probability, 27 1 p k k j k C 1 s 1 + (logp)=k n (0) 1=2 ; (2.8) X l6=j 1 p k j(l) C 2 s 1 + (logp)=k n (0) 1=2 ; (2.9) 1 k k X t=1 X (t) ;j ^ C (t) j C (t) j 2 n (t) C 3 s 1 + (logp)=k n (0) : (2.10) Indeed, we propose a tuning-free approach to obtain ^ C 0 j satisfying all the properties above. The analysis of our estimator is new due to heterogeneity of noises across dierent classes. In particular, typical tuning free procedures such as scaled Lasso or square root Lasso would not work in the current setting. Obtaining initial estimator of C 0 j , we have the estimated residuals for each t2 [k], ^ E (t) i;j =X (t) i;j X (t)0 i;j ^ C (t) j ; where j = 1; 2;i2 [n (t) ]: (2.11) The bias issue appears in the high-dimensional setting. Motivated by [16], we dene the bias corrected statistic T (t) n;k;1;2 for each t2 [k] as follows, T (t) n;k;1;2 = 1 n (t) 0 @ n (t) X i=1 ^ E (t) i;1 ^ E (t) i;2 + n (t) X i=1 ^ E (t) i;1 2 ^ C (t) 2;1 + n (t) X i=1 ^ E (t) i;2 2 ^ C (t) 1;2 1 A : In addition, due to the distribution of (2.5), we dene ^ ! (t) jj = ( P n (t) i=1 ^ E (t) i;j ^ E (t) i;j =n (t) ) 1 as an estimator of noise level ! (t) jj of class t2 [k] for node j = 1; 2. Indeed, after 28 bias correction, the statistic T (t) n;k;1;2 is close to J (t) n;k;1;2 , a constant times the covariance between (t) 1 and (t) 2 . J (t) n;k;1;2 = [1! (t) 11 (^ ! (t) 11 ) 1 ! (t) 22 (^ ! (t) 22 ) 1 ] ! (t) 12 ! (t) 11 ! (t) 22 : (2.12) Due to the group structure and the target of our hypothesis in (2.2)H 0;12 :! 0 12 = 0, we naturally construct our test statistics using certain function of all T (t) n;k;1;2 , t2 [k]. Thanks to the joint estimation accuracy ofC 0 j in terms of the properties (2.8)-(2.10), we rst present a Chi-based test statistic U n;k;1;2 dened in (2.13) for null H 0;12 :! 0 12 = 0 against alternative hypothesis for which the condition is imposed on the` 2 normk! 0 12 k. In other words, our test statistic is powerful whenever the signalk! 0 12 k is larger than some testable region boundary. To introduce our test statistic, we dene U 2 n;k;1;2 = k X t=1 n (t) ^ ! (t) 22 ^ ! (t) 11 T (t) n;k;1;2 2 : (2.13) In addition, to characterize its limiting distribution under null, we dene U n;k;1;2 as follows, V (t) n;k;1;2 = s ! (t) 22 ~ ! (t) 11 n (t) n (t) X i=1 E (t) i;1 E (t) i;2 EE (t) i;1 E (t) i;2 ; (2.14) U 2 n;k;1;2 = k X t=1 V (t) n;k;1;2 2 = k X t=1 ! (t) 22 ~ ! (t) 11 n (t) 0 @ n (t) X i=1 E (t) i;1 E (t) i;2 EE (t) i;1 E (t) i;2 1 A 2 ; where ~ ! (t) jj = ( P n (t) i=1 E (t) i;j E (t) i;j =n (t) ) 1 is the oracle estimator of ! (t) jj , assuming E (t) ;1 and E (t) ;2 are observable. Note that under null, the Gaussian vector E (t) ;2 N(0;I(! (t) 22 ) 1 ) is independent of E (t) ;1 . Therefore V (t) n;k;1;2 N(0; 1) and they are independent of each 29 other for t2 [k]. Consequently, under null hypothesis, we have that U 2 n;k;1;2 2 (k). We are ready to summarize the rst main theorem on Chi-based test. Theorem 2.4.1. Assume (A1) and (A2) hold. In addition, assume we have initial estimators of ^ C 0 j ,j = 1; 2, which satisfy working properties (2.8)-(2.10) with probability at least 1C 0 p 1 . Besides, assumes (k + logp)=n (0) =o(1) and log(k= 1 ) =O(s(1 + (logp)=k)). Then with probability at least 1 (12 +C 0 )p 1 4 1 , we have k X t=1 n (t) ^ ! (t) 22 ^ ! (t) 11 T (t) n;k;1;2 J (t) n;k;1;2 2 ! 1=2 U n;k;1;2 C s k + logp p n (0) ; where C > 0 depends on M;M 0 ;;C 1 ;C 2 and C 3 . In particular, under the null hypoth- esis in (2.2) H 0;12 :! 0 12 = 0, we have U 2 n;k;1;2 2 (k), which yields U n;k;1;2 U n;k;1;2 C s k + logp p n (0) Theorem 2.4.1 implies that under H 0;12 , we should propose a test using statistic U n;k;1;2 : for any xed 2 (0; 1), an asymptotic test with signicance level is dened by 2 = 1 U n;k;1;2 >z l2 k (1) ; (2.15) wherez l2 k (1) denotes the 100(1)th-percentile of the Chi distribution with degree of freedom k. In the asymptotic setting in which p;k;s are possibly diverging as n (0) !1, the following corollary shows that under the sample size requirementn (0) s 2 (k + logp) 2 , 2 is indeed an asymptotic test with signicance level . Corollary 2.4.1. Assume conditions of Theorem 2.4.1 hold with 1 =o(1) and > 1. Moreover, the sample size satises that s 2 (k + logp) 2 = o(n (0) ). Then test 2 dened in (2.15) has asymptotic signicance level : 30 While it is natural to apply Chi-based testing procedure (2.15) to test if a vector ! 0 12 is zero, some other testing procedures may be more powerful with certain extra knowledge. In what follows, we present an alternative test based on a linear functional of! 0 12 , which is closely related to its` 1 norm. In some applications, the sign relationship of one target edge across k graphs are provided implicitly or explicitly. For example, one may expect all! (t) 12 ,t2 [k] have the same sign, i.e. they are either all non-positive or all non-negative. To this end, testing the null hypothesisH 0;12 :! 0 12 = 0 is equivalent to testingk! 0 12 k 1 =j P k t=1 ! (t) 12 j = 0. In a more general setting, the sign relationship can be represented by a sign vector = ( 1 ;:::; k ) 0 2f1;1g k uniquely (or up to a single sign) such thatk! 0 12 k 1 = P k t=1 t ! (t) 12 (ork! 0 12 k 1 =j P k t=1 t ! (t) 12 j = 0). Given this sign vector , we propose a linear functional-based test statistic V n;k;1;2 as follows, V n;k;1;2 () = k X t=1 t q n (t) ^ ! (t) 22 ^ ! (t) 11 T (t) n;k;1;2 : (2.16) To characterize its limiting distribution under null, we dene V n;k;1;2 () = k X t=1 t V (t) n;k;1;2 = k X t=1 t s ! (t) 22 ~ ! (t) 11 n (t) n (t) X i=1 E (t) i;1 E (t) i;2 EE (t) i;1 E (t) i;2 ; where V (t) n;k;1;2 is dened in (2.14). With this extra sign information, our test statistic is powerful whenever the signal k! 0 12 k 1 is suciently large. Note that under null, V (t) n;k;1;2 N(0; 1) are independent with each other fort2 [k]. Consequently we have thatV n;k;1;2 ()N(0;k) for any sign vector . We are ready to summarize the second main theorem on linear functional- based test. 31 Theorem 2.4.2. Assume the assumptions in Theorem 2.4.1 hold. Then with proba- bility at least 1 (12 +C 0 )p 1 4 1 , we have k X t=1 t q n (t) ^ ! (t) 22 ^ ! (t) 11 T (t) n;k;1;2 J (t) n;k;1;2 V (t) n;k;1;2 C s k + logp p n (0) ; (2.17) where C > 0 only depends on M;M 0 ;;C 1 ;C 2 and C 3 . In particular, under the null hypothesis in (2.2)H 0;12 :! 0 12 = 0, we haveJ (t) n;k;1;2 = 0 andV n;k;1;2 ()N(0;k), which yield V n;k;1;2 ()V n;k;1;2 () C s k + logp p n (0) Theorem 2.4.2 quanties the behavior of the test statisticV n;k;1;2 () underH 0;12 . In addition, if the sign vector is given uniquely such thatk! 0 12 k 1 = P k t=1 t ! (t) 12 under alternative, then Theorem 2.4.2 and the denition of J (t) n;k;1;2 in (2.12) suggest a one- sided test as follows: for any xed2 (0; 1), an asymptotic test with signicance level is given by 1 = 1 V n;k;1;2 () p k <z() ; (2.18) wherez() denotes the 100()th-percentile of the standard normal distribution. If the sign vector is given up to a single sign (e.g. we only know all t , t2 [k] are identical), then it is more natural to dene a two-sided test. We omit the details of this two-sided test for simplicity. Based on Theorem 2.4.2, the following corollary shows that under the sample size requirement n (0) s 2 k 1 (k + logp) 2 , 1 is indeed an asymptotic test with signicance level . Corollary 2.4.2. Assuming conditions of Theorem 2.4.2 hold with 1 = o(1) and > 1. Moreover, the sample size satises that s 2 k 1 (k + logp) 2 =o(n (0) ), then test 1 dened in (2.18) has asymptotic signicance level . 32 While most results in literature either focus on testing procedures for a single graph or develop estimation procedures of multiple graphs without statistical inference, The- orems 2.4.1-2.4.2 in this paper provide statistical inference procedures over multiple graphs at a rst time. Since our procedures are closely related to that in [16], it is interesting to compare them with possible ones established by applying the method in [16] to each of the k graphs. Remark 2.4.1. The advantage of our testing procedure 1 is re ected on the sample size requirement s 2 k 1 (k + logp) 2 =o(n (0) ) in Corollary 2.4.2, thanks to the structural similarity information across k graphs which makes working assumptions (2.8)-(2.10) possible. In comparison, to test H 0;12 : ! 0 12 = 0, one can apply the procedure in [16] to each one of the k graphs and then construct a similar linear functional type test as (2.18). However, it can be shown that a stronger sample size assumptions 2 k (logp) 2 = o(n (0) ) is required. Remark 2.4.2. Due to the reason discussed in Remark 2.4.1, the advantage of our testing procedure 2 is re ected on the sample size requirement in Corollary 2.4.1. In comparison, one can construct a similar Chi-based test as (2.15) based on residuals ^ E (t) i;j which are obtained through applying the procedure in [16] to each individual graph. For such a test, it can be shown that s 2 k (logp) 2 =o(n (0) ) is required. Therefore, whenever k =o((logp) 2 ), our procedure 2 has a weaker sample size requirement. 2.5 Tuning-free Algorithm In this section, we propose a tuning-free method to estimate the coecientsC 0 j for each j2 [p] and show that our estimator ^ C 0 j satises the working assumptions (2.8)-(2.10). By symmetry, we only focus on the setting j = 1 without loss of generality. Recall that for each graph t2 [k], we have data matrix X (t) = (X (t) 1; ;X (t) 2; ;:::;X (t) n (t) ; ) 0 with 33 i.i.d. rows X (t) i; = (X (t) i;1 ;:::;X (t) i;p ) 0 N(0; ( (t) ) 1 ), where i2 [n (t) ]. The node-wise conditional distribution (2.3) suggests the following joint linear regression model, 0 B B B B B B B @ X (1) ;1 X (2) ;1 . . . X (k) ;1 1 C C C C C C C A = 0 B B B B B B B @ X (1) ;1 X (2) ;1 . . . X (k) ;1 1 C C C C C C C A 0 B B B B B B B @ C (1) 1 C (2) 1 . . . C (k) 1 1 C C C C C C C A + 0 B B B B B B B @ E (1) ;1 E (2) ;1 . . . E (k) ;1 1 C C C C C C C A = X 0 ;1 C 0 1 +E 0 ;1 2R N ; (2.19) where N = P k t=1 n (t) , E (1) ;1 = (E (t) 1;1 ;:::;E (t) n (t) ;1 ) 0 is dened in (2.5) and consists of i.i.d. components with N(0; (! (t) 11 ) 1 ). Moreover, we have the group sparsity structure of the coecients C 0 1 . In other words, all but at most s subvectors are zero vectors, i.e. C 0 1(l) = 0 for mostl2 [p]nf1g, whereC 0 1(l) 2R k and sparsitys are dened in (2.6) and (2.7) setting respectively. The joint group structure and sparsity structure naturally motivate us to apply a variate of group Lasso to estimate the coecient C 0 1 . The statistical properties of standard group Lasso are well understood and imply a faster rate of convergence of estimatingC 0 1 as well as X 0 ;1 C 0 1 , compared with the standard Lasso approach. See, for instance, [18], [19] for further details. However, the optimal choice of tuning parameter critically depends on the common noise level and typically is unknown. Hence a practical and data-driven choice of is needed to lead to optimal estimation. This further step has been investigated recently in [11] and [20] by extending square-root Lasso ([15]) and scaled Lasso ([14]) to the group structure respectively. A signicant dierence of the standard group Lasso setting from the current model in (2.19) lies in the noise level. Indeed, instead of common noise level for all components ofE 0 ;1 , we allow each class has its own noise level, namely (! (t) 11 ) 1 ,t2 [k]. The strategy 34 of square-root Lasso or scaled Lasso, which essentially includes an extra parameter of noise level, can only handle homogeneous noises. In this section, we extend the group square-root Lasso a step further to allow heterogeneous noises. We comment that this extension usually is never trivial and the novelty of our analysis is due to an intrinsic constant upper bound of tted residual level for each class, as a property of graphical models. We rst introduce some notation. LetQ t ( (t) ) = 1 n (0) kX (t) ;1 X (t) ;1 (t) k 2 fort2 [k], where we set (t) = ( (t) 2 ;:::; (t) p ) 0 2 R (p1) to match the index set of C (t) 1 . We then collect all (t) to form a k(p 1)-dimensional vector 0 = ( (1)0 ; (2)0 ;:::; (k)0 ) 0 . The lth group of 0 is denoted by 0 (l) = ( (1) l ;:::; (k) l ) 0 2 R k in the same way as we dened C 0 1(l) for C 0 1 . We further set a diagonal matrix of dimension (p 1); D (t) 1 = diag(X (t)0 ;1 X (t) ;1 =n (t) ) and then collect them to form a k(p 1)-dimensional diagonal matrix D 1 . The submatrix of D 1 corresponding to the lth group is denoted by D 1(l) . In particular, the tth entry on the diagonal of D 1(l) is X (t)0 ;l X (t) ;l =n (t) . The group square-root Lasso with heterogeneous noises (GSRLH) is dened as the solution to the following optimization problem, ^ C 0 1 = arg min 0 2R k(p1) ( k X t=1 Q 1=2 t ( (t) ) + p X l=2 D 1=2 1(l) 0 (l) ) ; (2.20) where the parameter will be provided explicitly later, which is independent with the noise levels (! (t) 11 ) 1 , t2 [k]. The GSRLH dened in (2.20) yields an estimator of C 0 1 . For estimation of C 0 j with any j 2 [p], we can simply replaced 1 by j in the above method. The optimization coincides with standard square-root Lasso when k = 1 while becomes the standard group square-root Lasso as proposed in [11] if P k t=1 Q 1=2 t ( (t) ) is replaced by ( P k t=1 Q t ( (t) )) 1=2 . However, when the noise levels vary 35 across dierent classes, the standard group square-root Lasso cannot be carried over to take into account this issue. As shown in the analysis of Theorem 2.5.1 below, one important ingredient for the success of GSRLH estimators is the following event: for any xed > 1, we dene B 1 = 8 < : max 2lp D 1=2 E1 D 1=2 1(l) X 00 ;(l) E 0 ;1 p n (0) 1 + 1 9 = ; ; (2.21) where X 0 ;(l) 2R Nk is the submatrix of X 0 ;1 formed by columns with indices corre- sponding to the lth group and D E1 is the diagonal matrix with its tth diagonal entry being squared ` 2 norm of the noise vector E (t) ;1 , that is ( D E1 ) tt =kE (t) ;1 k 2 . Similarly we can dene eventsB j for any j2 [p]. This event can be interpreted as that the pure noise incurred is dominated by the penalty level. In order to guarantee eventB 1 holds with high probability, we need to carefully pick a sharp choice of , which is independent with noise levels of E 0 ;1 . Given conditions (A1) and (A2), the following result summarizes the prediction as well as estimation bounds of our GSRLH estimators. Theorem 2.5.1. Let ^ C 0 j be the solution of (2.20) with the following well specied parameter = + 1 1 k + 2 logp + 2 p k logp n (0) (1) 1=2 ; where 2 = 8( logp + logk)=n (0) = o(1). Then with probability at least 1 3p +1 eventB j holds. Assume conditions (A1) and (A2) are satised and the maximum node 36 degreesC n (0) = logp withC > 0 depending on,M andM 0 , then there exists some constant C depending on M and M 0 such that X l2[p]nfjg 1 p k ^ C 0 j(l) C 0 j(l) Cs 1 + (logp)=k n (0) 1=2 ; (2.22) ^ C 0 j C 0 j C s 1 + (logp)=k n (0) 1=2 ; (2.23) 1 k k X t=1 X (t) ;1 ^ C (t) j C (t) j 2 n (0) Cs 1 + (logp)=k n (0) ; (2.24) with probability at least 1 4p +1 . Remark 2.5.1. The novelty of analysis follows from an intrinsic upper bound of the tted residual level. It is worthwhile to point out, with the knowledge of this quan- tity, we can also apply regular group Lasso with a tuning parameter depending on this quantity and obtain corresponding justiable theorem. However, the intrinsic upper bound in our analysis does not appear in the optimization and only provides a theoreti- cal support while regular group Lasso has to apply it in the tuning parameter explicitly. Consequently, this possibly loose intrinsic upper bound yields large bias for regular group Lasso but still sharp result for GSRLH. Remark 2.5.2. The proof of Theorem 2.4.1, 2.4.2 and 2.5.1 is mostly worked out by Prof. Ren, Prof. Fan and Prof. Lv. I really appreciate their hard work in helping prove these. Thus I abbreviate the corresponding proof here and readers with interests could refer to their paper for details. It is interesting to compare the sharpness of our parameter and the one used in [11] for the homogeneous noises setting. First, one advantage of our choice of comes from the scaling matrix D 1 which makes the noise per column of X 0 ;(l) homogeneous and sharpens by a factor formed as the ratio of the largest and the smallest ` 2 norm 37 among all columns. Besides, due to simple block diagonal structure of X 0 ;(l) , a direct and sharp Chi-square tail probability ([21]) provides sharper constant factors for both k and logp. In practice, we can also calculate the sharp parameter by simulation. Specif- ically, we simulate T = 10000 times the value of k D 1=2 E1 D 1=2 1(2) X 00 ;(2) E 0 ;1 k=(n (0) ) 1=2 and pick the 100(1 1=p ) percentile of its empirical distribution as our choice of ( 1)=( + 1) with > 1. We take > 1 because of the union bound argument, given that only the setting l = 2 is simulated. To this end, we note the components of D 1=2 E1 D 1=2 1(2) X 00 ;(2) E 0 ;1 are independent and their distributions can be characterized easily since they do not depend on the variance of X 00 ;(2) or E 0 ;1 . For each replication T2 [10000], we simulate the tth component of D 1=2 E1 D 1=2 1(2) X 00 ;(2) E 0 ;1 independently as follows: generate Z 1;t;T N(0;I)2 R n (t) and Z 2;t;T N(0;I)2 R n (t) independently and calculate Z t;T = (n (t) ) 1=2 Z 0 1;t;T Z 2;t;T =(kZ 1;t;T kkZ 1;t;T k) 1=2 . Hence the simulated value ofk D 1=2 E1 D 1=2 1(2) X 00 ;(2) E 0 ;1 k can be written as ( P k t=1 Z 2 t;T ) 1=2 . Formally, we pick our parameter sim as sim = 1 p n (0) + 1 1 inf ( v : 10000 X T=1 1 ( ( k X t=1 Z 2 t;T ) 1=2 <v ) =10000 1 1=p ) : (2.25) 2.5.1 Computational Algorithm In this section, we propose our computational algorithm to solve GSRLH (2.20) eciently. We rst standardize each column of X 0 ;1 to have ` 2 norm (n (t) ) 1=2 . The resulting new design matrix after standardization is denote by X 0 ;1 = 38 diagf X (1) ;1 ; ; X (k) ;1 g, where X 0 ;1 = X 0 ;1 D 1=2 1 with D 1 dened in Section 2.5. Consider the following optimization problem: ^ C 0 1 = arg min 0 2R k(p1) ( k X t=1 Q 1=2 t ( (t) ) + p X l=2 0 (l) ) : (2.26) where Q t ( (t) ) =kX (t) ;1 X (t) ;1 (t) k 2 =n (0) for t2 [k], and the notations 0 , (t) , and 0 (l) are dened similarly as in (2.20). It is seen that the objective function in (2.26) diers from the one in (2.20) only by a scaling matrix D 1=2 1 , and their minimizers are related in the way that ^ C 0 1 = D 1=2 1 ^ C 0 1 with ^ C 0 1 dened in (2.20). Thus, the problem of solving (2.20) reduces to the problem of solving (2.26). It suces to discuss our new algorithm for solving (2.26). To ease the presentation, we abuse the notation in this section and rewrite (2.26) in the following general form with simplied notation ^ = arg min 2R kp ( (n (0) ) 1=2 k X t=1 kY (t) X (t) (t) k + p X l=1 (l) ) : (2.27) which we could easily observe that this is exactly the same formulation with 1.1. Read- ers could refer to the last chapter for the details of our algorithm to solve this. Thus we use our algorithm introduced in last chapter to obtain our initial estimation of . Then we ret an ordinary least squares estimate on the support of ^ to obtain the nal estimate. 2.6 Simulation Studies In this section, we implement our proposed procedure with some simulated data sets. Section 2.6.1 presents the hypothesis testing results of our methods. Since almost all existing methods on multiple graphs focus on estimation instead of testing, in Section 39 2.6.3 we modify our procedure to obtain estimates of the precision matrices and then compare with some popularly used existing methods such as MPE [22], GGL and FGL [23]. All these three methods provide joint estimations of precision matrices. MPE is based on a weighted constrained ` 1 and ` 1 minimization and is implemented via a second-order cone programming. Both GGL and FGL maximize the penalized log likelihood function to estimate the precision matrix, with the dierence that the generalized fused Lasso penalty is used for FGL while the generalized group Lasso penalty is used for GGL. To demonstrate the robustness of our method, we also simulate data from heavy-tailed distribution such as the Laplace distribution and present the results in Section 2.6.4. Throughout the simulations, the number of repetition is 100 and the parameter = sim is picked from equation (2.25) using simulation with = 1 and =1. We comment that one can also use the choice of in Theorem 2.5.1, which has similar but slightly worse performance, compared with sim . 2.6.1 Model Settings We consider two dierent model settings for generating precision matrices (t) = (! (t) ij ), t2 [k]. In both cases, the block diagonal structure is used to introduce sparsity into the precision matrices in the sense that entries outside of the diagonal blocks are all equal to zero. In Model I, all k precision matrices share the same block diagonal structure and all diagonal blocks have the same size. For a xed pair (i;j) withi6=j, if the (i;j) entry is inside of a diagonal block, then! (1) ij ; ;! (k) ij are independently simulated from the uniform distribution U[0:2; 0:4] or U[0:6; 1:2], depending on whether it belongs to the upper half diagonal blocks or the lower half diagonal blocks. All other o-diagonal entries within the diagonal blocks are simulated similarly and independently. Finally, the diagonal entries are set to be 1 for the upper half diagonal blocks and 3 for the lower half. 40 Note that in the above Model I, for the (i;j) entry, the vector (! (1) ij ; ;! (k) ij ) 0 is either a vector of zero or has nonzero values across all k coordinates. To introduce more variety into the sparsity pattern, we further consider Model II, which diers from Model I only in the data generating scheme of entries inside the diagonal blocks. For an entry (i;j) with i6= j inside a diagonal block, we rst ip a fair coin. If it is a head then the vector (! (1) ij ; ;! (k) ij ) 0 is generated in the same way as that for Model I. If it is a tail, then we randomly draw an integer k 0 from the uniform distribution over 1; 2; ;k and then set ! (t) ij = 0 for all t6=k 0 and generate ! (k 0 ) ij from U[0:2; 0:4] orU[0:6; 1:2], depending on whether (i;j) belongs to the upper half diagonal blocks or lower half diagonal blocks. It is seen that Model II is sparser than Model I. For each model, we consider three dierent sets of parameters by varying the di- mension p and group size k whereas x the sample size at n (t) = n = 100, t2 [k] for Model I and n (t) =n = 200, t2 [k] for Model II. We also x the block size at 8. 2.6.2 Testing Results We demonstrate the performance of our proposed methods with data simulated from the above described model settings. For each simulated data set and each pair (i;j) withi6=j, we apply the Chi-based test 2 and linear functional-based test 1 with sign vector = (1; 1) 0 to detect whether edges exist between nodes i and j for any of the k graphs. The signicance level is set to be = 0:05. We use two dierent methods to calculate the critical values. The rst method calculates the critical values using the asymptotic null distributions derived in Theorems 2.4.1 and 2.4.2. The corresponding critical values are named as \Theoretical" in Tables 2.1 and 2.2. The second method, named as \Empirical" in Tables 2.1 and 2.2, computes the critical values empirically based on the values of U n;k;i;j for 2 (or V n;k;i;j () for 1 ) of the entries outside the diagonal blocks. More specically, since the entries outside the diagonal blocks are 41 all equal to zero across all k graphs, the 5% critical value can be calculated as the 95th-percentile of the pooled test statistics for all such null entries. Note that the \Empirical" critical value relies on the knowledge of true nulls, and thus can only be calculated in simulation studies. The purpose of using both methods is to compare the \Theoretical" values with the \Empirical" ones to justify our ndings on the null distributions in Theorems 2.4.1 and 2.4.2. With these critical values, we can calculate the false negative rates (FNR) and false positive rate (FPR). It is obvious that with the \Empirical" critical value the FPR should be exactly 5% and thus we omit its values in the table. The FNR based on both critical values are reported. In fact, it is seen from Tables 2.1 and 2.2 that the \Theoretical" values are very close to \Empirical" values, indicating that the asymptotic null distributions we found describe the empirical distributions very closely. The numbers reported in Tables 2.1 and 2.2 are mean values with standard errors in the parentheses. To better evaluate these methods, we then vary the critical value and generate a full ROC curve. The areas under the ROC curves are reported in the last columns of Tables 2.1 and 2.2. It is seen that both methods have close to 1 areas under ROC curves in all settings. 42 Table 2.1: Testing results for model I with = 0:05. Mean values are reported with standard errors in parentheses. k p FNR (10 2 ) FPR ROC area Empirical Theoretical (10 2 ) (10 2 ) Setting 1 5 50 0.375(0.484) 0.369(0.454) 5.044(0.656) 99.90(0.078) 1 Setting 2 10 50 0(0) 0(0) 4.945(0.752) 100(0) Setting 3 10 200 0.001(0.014) 0.001(0.014) 5.005(0.170) 100(0) Setting 1 5 50 3.268(1.568) 3.161(1.422) 5.123(0.722) 99.26(0.319) 2 Setting 2 10 50 0.006(0.060) 0.006(0.060) 5.352(0.751) 100(0.010) Setting 3 10 200 0.077(0.100) 0.077(0.098) 4.896(0.177) 99.97(0.019) 43 Table 2.2: Testing results for model II with = 0:05. Mean values are reported with standard errors in parentheses. k p FNR FPR ROC area Empirical Theoretical (10 2 ) (10 2 ) Setting 1 5 50 0.226(0.043) 0.224(0.038) 5.151(0.821) 94.54(1.346) 1 Setting 2 10 50 0.327(0.041) 0.327(0.038) 5.046(0.932) 90.26(2.07) Setting 3 10 200 0.306(0.017) 0.305(0.016) 5.04(0.233) 91.12(0.771) Setting 1 5 50 0.066(0.019) 0.064(0.017) 5.125(0.747) 98.42(0.520) 2 Setting 2 10 50 0.099(0.021) 0.094(0.020) 5.416(0.750) 97.66(0.560) Setting 3 10 200 0.090(0.010) 0.090(0.010) 5.017(0.149) 97.79(0.302) 44 From Table 2.1, we see that the testing procedure 1 is signicantly better than 2 in all settings under model I. From setting 1 to setting 2, both testing procedures become better while from setting 2 to setting 3, both procedures become worse. These are consistent with our theoretical results. To understand this, taking the entry (1; 2) for example, the separating rate for alternative H l1 1;12 (s;;) with the corresponding optimal test 1 isk! 0 12 k 1 n p k=n (0) . Since the entries of ! 0 12 are i.i.d. from uniform distribution U[0:2; 0:4], as k increases, the separating rate condition becomes weaker becausek! 0 12 k 1 grows linearly with k while the right hand side n p k=n (0) grows at a slower rate of p k. Thus, the growth ofk makes the separating rate condition easier to be satised. From setting 2 to setting 3, the dimensionp becomes much larger, and thus the area under ROC becomes smaller. The same explanation applies to the Chi-based test 2 . Comparing Table 2.1 with Table 2.2, we see that the performances of testing pro- cedures 2 and 1 both become worse. This is reasonable because Model II is sparser than Model I and the separating rate conditions are harder to be satised for these sparser entries with only one nonzero value acrossk graphs because this nonzero entry needs to have magnitude much larger than n p k=n (0) for 1 or n p k 1=2 =n (0) for 2 . Thus, dierent from Table 2.1, the separating rate conditions become easier for denser entries with all k nonzero values as k increases, whereas these conditions become more stringent for sparser entries with only one nonzero value as k increases. This increased diculty for sparser entries is more sever for linear functional-based test 1 than for Chi-based test phi 2 in view of the separating rates n . 2.6.3 Precision Matrices Estimation As mentioned at the beginning of this section, almost all existing methods on multiple graphs focus on the estimation part. To compare with these existing methods, we 45 modify our procedure to generate sparse estimates of the precision matrices. More specically, we propose a two-step procedure. In the rst step, for each entry (i;j) with i6=j, we conduct hypothesis testing at signicance level to see if the null hypothesis H 0;ij is rejected or not. The critical values at signicance level are calculated using the asymptotic distribution obtained in Theorems 2.4.1 and 2.4.2. In the second step, for eachi2 [p], we estimate the (i;i) entry of thetth graph as ^ ! (t) ii and for each rejected null hypothesis H 0;ij , we estimate the (i;j) entry of the tth graph as^ ! (t) ii ^ ! (t) jj T (t) n;k;i;j , suggested by equation (2.12) where the notation is the same as previous section. It is seen that our two-step procedure has one tuning parameter, the signicance level . To tune , we generate an independent validation set with the same sample size n (t) = n = 100 in Model I and n (t) = n = 200 in Model II for all t2 [k]. Then for each given value of , we obtain a set of sparse estimates ^ 0 = ( ^ (1) ; ; ^ (k) ) of the k graphs using the training data and calculate the value of loss function L( ^ 0 ) = k X t=1 n log(det( ^ (t) )) tr( ^ (t) ^ (t) ) o ; (2.28) where and ^ (1) ; ; ^ (k) are the sample covariance matrix estimates using the valida- tion data. The value of is then chosen by minimizing the loss function over a grid of 10 values for . We consider three comparison methods MPE, GGL and FGL, each with one reg- ularization parameter to tune. For fair comparison, for each method we use the same validation set to tune the regularization parameter and choose the one minimizing the loss function in (2.28) over a grid of 10 values. 46 Table 2.3: Estimation Results for Model I. Setting k p Method ` 1 ` 2 ` F Setting 1 5 50 1 4.968(0.041) 3.417(0.036) 6.657(0.036) 2 5.68(0.070) 3.894(0.081) 7.578(0.131) MPE 7.556(0.024) 6.347(0.056) 11.53(0.083) GGL 8.331(0.009) 7.289(0.005) 13.05(0.005) FGL 7.989(0.046) 7.247(0.044) 13.13(0.069) Setting 2 10 50 1 5.117(0.102) 3.281(0.103) 6.416(0.194) 2 5.191(0.104) 3.333(0.108) 6.542(0.202) MPE 7.075(0.022) 5.618(0.048) 10.44(0.070) GGL 8.193(0.006) 7.241(0.005) 12.98(0.010) FGL 8.132(0.003) 7.461(0.003) 13.36(0.004) Setting 3 10 200 1 5.84(0.096) 3.997(0.116) 14.3(0.474) 2 6.466(0.111) 4.674(0.142) 16.79(0.594) MPE - - - GGL 8.467(0.006) 7.489(0.003) 27.01(0.003) FGL - - - 47 Table 2.4: Estimation Results for Model II. Setting k p Method ` 1 ` 2 ` F Setting 1 5 50 1 3.651(0.035) 2.091(0.018) 4.723(0.023) 2 3.368(0.045) 2.042(0.023) 4.392(0.043) MPE 4.909(0.020) 3.289(0.015) 6.668(0.018) GGL 7.087(0.009) 5.155(0.004) 9.653(0.005) FGL 6.748(0.007) 4.942(0.004) 9.563(0.006) Setting 2 10 50 1 3.095(0.018) 1.898(0.009) 4.213(0.011) 2 3.019(0.020) 1.878(0.011) 4.099(0.013) MPE 3.613(0.013) 2.264(0.010) 4.325(0.014) GGL 5.708(0.006) 4.325(0.003) 8.238(0.004) FGL 5.606(0.005) 4.27(0.003) 8.228(0.004) Setting 3 10 200 1 6.035(0.077) 2.7(0.018) 11.18(0.078) 2 5.595(0.085) 3.448(0.061) 15.19(0.306) MPE - - - GGL 6.976(0.005) 5.195(0.004) 18.23(0.004) FGL - - - 48 To evaluate the performance of dierent methods, we calculate the matrix 1-norm loss, spectral norm loss and the Frobenius norm loss of the estimation errors, and denote them as ` 1 , ` 2 and ` F respectively. The mean values are reported in Tables 2.3 and 2.4 with standard errors presented in parentheses. For setting 3 of both Model I and Model II, the computational cost for FGL and MPE are so large that we could not nish all 100 simulations within a reasonable amount of time so the results are not reported. To give readers a better idea on the computational cost of various methods, we record the average computation time for each method over 100 repetitions and present the results in Table 2.5. Since the computational cost of 1 is almost identical to that of 2 , only the results for 2 are reported. It is seen from Table 2.3 that in all three settings, both 2 and 1 outperform FGL and GGL signicantly. We also observe that in setting 1, 1 outperforms MPE and 2 is comparable to MPE while in setting 2, 2 and 1 outperform MPE signicantly. Similar phenomenons can be observed from Table 2.4. In view of the computational cost presented in Table 2.5, our methods are much faster than MPE and FGL in all settings. Thus, the overall performance of our methods are superior to all three competitors. Note that setting 1 diers from setting 2 only in the number of graphs k. Thus, it is fair to conclude that compared to other methods, our methods have greater advantages in estimating a large number of graphs simultaneously, which is in line with our theoretical ndings that our methods allow k to diverge with p at a faster rate. 49 Table 2.5: Average computation cost measured by CPU time in seconds. Setting 1 (10 0 ) Setting 2 (10 1 ) Setting 3 (10 2 ) 2 MPE GGL FGL 2 MPE GGL FGL 2 MPE GGL FGL Model I 7.2 57.7 9.2 64.8 2.1 8.7 2.6 13.5 3.9 36.7 3.7 18.2 Model II 18.1 69.8 18.2 44.4 3.0 10.0 3.5 28.7 6.8 38.6 5.9 23.1 50 2.6.4 Heavy-tailed Distribution The purpose of this section is to investigate the robustness of our method to heavy- tailed distributions such as the Laplace distribution. For each of the previously intro- duced model settings, after generating the precision matrix (t) , instead of simulating the data matrix X (t) from the multivariate normal distribution with mean vector 0 and covariance matrix ( (t) ) 1 , we simulate X (t) from the multivariate Laplace distribution with covariance matrix ( (t) ) 1 . More specically, we rst generate a random vector whose components are i.i.d Laplace random variables with location parameter 0 and scale parameter 1= p 2, and then multiply this vector by ( (t) ) 1=2 to obtain the desired Laplace random vector. All other settings are the same as in previous section. Table 2.6 presents the testing results of our methods for the Laplace distribution. Compared to Table 2.1 and Table 2.2, we observe that across all settings of Models I and II, the performance of our methods stays almost the same when the normal distribution is replaced with Laplace distribution, demonstrating the robustness of our methods to the heavy-tailed distributions. We also explored other heavy-tailed distributions such as thet-distribution with degrees of freedom 5 and the results are very similar. To save space, these results are not presented here but are always available upon request. 51 Table 2.6: Testing Results for Laplace distribution with = 0:05 Model I k p FNR(10 2 ) FPR ROC area Empirical Theoretical (10 2 ) (10 2 ) Setting 1 5 50 0.345(0.480) 0.357(0.440) 4.986(0.723) 99.91(0.068) 1 Setting 2 10 50 0(0) 0(0) 5.089(0.991) 100(0) Setting 3 10 200 0(0) 0(0) 5.03(0.172) 100(0) Setting 1 5 50 3.012(1.555) 2.810(1.438) 5.293(0.669) 99.32(0.287) 2 Setting 2 10 50 0(0) 0(0) 5.701(0.824) 100(0.004) Setting 3 10 200 0.066(0.094) 0.063(0.094) 5.073(0.171) 99.98(0.016) Model II k p FNR FPR ROC area Empirical Theoretical (10 2 ) (10 2 ) Setting 1 5 50 0.226(3.594) 0.226(3.414) 5.046(0.973) 94.49(1.311) 1 Setting 2 10 50 0.317(3.765) 0.319(3.497) 5,011(0.908) 90.88(1.806) Setting 3 10 200 0.309(1.574) 0.308(1.567) 5.048(0.219) 91.03(0.766) Setting 1 5 50 0.069(0.020) 0.066(0.019) 5.388(0.854) 98.43(0.512) 2 Setting 2 10 50 0.093(0.020) 0.090(0.019) 5.375(0.725) 97.66(0.629) Setting 3 10 200 0.089(0.010) 0.088(0.010) 5.083(0.177) 97.83(0.320) 52 2.7 Real Data Analysis In this section, we demonstrate the performance of our methods using a real data set on the epithelial ovarian cancer. As introduced in [24], ovarian cancer has six molecular subtypes, which are noted as C1 to C6 following the notations in [24]. It is discovered in the same paper that there is a signicant dierence in expression levels of genes associated with stromal and immune cell types between C1 and other subtypes. It has also been discovered that C1 patients suer from a lower survival rate. We consider the RNA expression data containing n (1) = 78 patients from C1 subtype and n (2) = 113 patients from all other subtypes combined together. The number of genes under consideration is p = 87. Our purpose is to recover the graphs of genes related to the apoptosis pathway from the KEGG database [25] for disease subtype C1 and other subtypes combined so that we can identify which genes are crucial in both disease subtype C1 and all other subtypes combined. Thus, the number of graphs we aim to recover is k = 2. We apply our proposed methods to this afore-introduced data set with signicance level = 0:001. For an entry (i;j) with i6= j, if the corresponding null hypothesis H 0;ij is rejected then we think there is an edge connecting node i and node j in at least one of the two graphs. We present the connectivity structures identied by 1 and 2 in Figure 2.1. We further want to nd out which nodes are crucial in dening the connectivity structure identied in Figure 2.1. Motivated by the denition of central nodes introduced in [22], we dene important nodes as the ones with largest degrees in the graphs presented in Figure 2.1. Table 2.7 lists the top 10 nodes with highest degrees identied by 1 and 2 . Since two graphs are considered, there are two possible sign vectors (1; 1) 0 and (1;1) 0 up to a single sign for our linear functional- based test 1 . Without the knowledge of the sign vector, we test both relationships, 53 i.e., sum and subtraction, and conduct the corresponding two-sided tests. However, the result of subtraction is not convincing since the corresponding graph is too sparse. The largest degree among all nodes is 4 and second largest degree is 2. All the other degrees are less or equal to 1. Thus we only present the result of sum. Among these genes, 1L1B, MYD88, NFKB1 and PIK3R5 have been identied as key genes and been implicated in ovarian cancer risk or progression [22, 26]. In addition, BIRC3 and FAS have been proved to function importantly in ovarian cancer and in particular it has been discovered that upregulation of FAS reverses the development of resistance to Cisplatin in epithelial ovarian cancer [27, 28], which demonstrates the importance of the nodes identied by our method in ovarian cancer. 54 Figure 2.1: Common edges between C1 and other types identied by 1 and 2 55 Table 2.7: Top 10 nodes with highest degrees identied by 1 and 2 in descending order Method Nodes 1 MYD88, NFKB1, CSF2RB, PIK3R5, FAS, PIK3CG, TRADD, BIRC3, IL1B, NFKBIA 2 NFKB1, MYD88, CSF2RB, PIK3R5, BIRC3, PIK3CG, FAS, IL1B, CAPN1, NFKBIA 56 Chapter 3 Robust Subspace Classication 3.1 Introduction In machine learning and statistics, classication is dened as the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient. You may also wish to apply classication techniques to predict whether the weather of tomorrow will be sunny, rainy or snowy [34]. In the terminology of machine learning, classication is considered an instance of supervised learning, i.e. learning where a training set of correctly identied observa- tions is available. The corresponding unsupervised procedure is known as clustering. Often, the individual observations are analyzed into a set of quantiable properties, known variously as explanatory variables or features. Some classic parametric classi- ers are SVM, Naive Bayes and Logistic Regression. Classic non-parametric classier isk Nearest Neighbors (kNN).kNN is an algorithm that classies the new cases based on distance which is dened as similarity measures or distance measures with existing observations. In real classication problems, it is well known that in many cases the data within each class approximately lies on a lower dimensional subspace of the feature space. That is to say, even though the number of feature may be extremely large, such as 57 gene expression data, only a few features determines the observation. Under this case, instead of exploring every feature, it is more attractive if we could identify these dominating features and explore their behavior. Based on this assumption, we dene this type of classication problems as subspace classication. Subspace classication has a wide application in real life such as gene expression and pattern classication. Subspace methods in data analysis was rst studied in the 1930s by Hotelling [29]. Kramer and Mathews [31] observed the importance of the subspace methods in data compression and optimal reproduction in the 1950s. Ten years later, Watanabe et al. [32] published their paper in pattern classication which is regarded as the rst application of subspace method. With the pioneering work of Kohonen et al. [30], Learning subspace methods have been widely developed up to now. However, subspace classication problem has seldomly been studied. We want to take advantage of its characterization and develop a robust subspace classication method. So in this chapter, we are going to present our innovative algorithm in subspace classication. The rest of this chapter is organized as follows. First we will present our math- ematical formulation of subspace classication. Then we will present our innovative algorithm in subspace classication. We will prove that asymptotically our method would achieve perfect classication. Numerical examples also illustrate the superior of our method. 3.2 Methodology First we are going to explain the notations which are used later. Suppose we have l subspaces S 1 ;S 2 ;:::;S l ofR n of dimensions d 1 ;d 2 ;:::;d l . Each subspace S i has a set of linearly independent basis vectors,fu 1 ;:::;u d i g, which can be combined into a nd i matrix U i which, thus, has rank d i , for all il. 58 Thus, x i =U i z; z2R d i ;x i 2R n which means given any subspace basis U i and a subspace vector z2 R d i , we could project it intoR n to obtainx i 2R n . We say a sampley i 2R n is from theith subspace would follow this formula, y i =x i +"; "2R n ;x i 2R n wherex i is dened as above. " is an independent noise with mean zero. Note that the inverse signal-to-noise ratio 2 is dened as 2 = max Ek"k 2 2 kxk 2 2 . We further assume is bounded above and where is a positive constant. We dene that samples generated from the same subspace belong to the same category. Suppose we have p samples from every category and we know all the labels for each sample. We note y j i as the jth sample from the ith category, for all il and jp. Thus we have a training data Y with dimension lpn, where Y = 0 B B B B B B B @ y 0 1 y 0 2 . . . y 0 l 1 C C C C C C C A and y 0 i = 0 B B B B B B B @ y 1 i y 2 i . . . y p i 1 C C C C C C C A 59 Thus our subspace classication problem is that given the training data Y , how to determine which category or which subspace from 1 to l a new observation y 0 belongs to. To conquer this problem, we want to make full use of its characterization that observations are generated from a certain subspace. Our idea is that if any two subspaces are not approximately in parallel, a new observation would be explained best with a linear combination of existing observations in the same category, not others, because if we clean the noise from each observation, a new observation would lie exactly in the subspace which existing observations from the same category generate. As long as the noise level is controlled, we could still expect if any two subspaces are not approximately in parallel, a new observation would lie closest to the spanned subspaces of the samples from the same category. However, if there are any subspaces which are approximately in parallel, above idea would not work since the spanned subspaces generated from these two categories would not be distinguishable. We would add some restriction about the angle between two subspaces, which would be introduced later. 3.3 Algorithm Given above idea, when any two subspaces are not approximately in parallel, we want to determine which subspace a new observation is closest to. A naive implementation would be calculating the generated subspaces for each category and the corresponding distance. However, this is not feasible in high dimension setting. First, the computation cost is huge since the dimension is high. Especially when the training set is very large, the computation cost would be in the same order as the square of number of observations. Second, there is no well-dened literature to compute the empirical subspace given existing observations. Also note that in high dimension setting, error 60 would accumulate such that our calculation would be as bad as random guessing. So this naive approach is not good to implement above subspace idea. In order to avoid the disadvantage of above naive approach, we implement our idea by regularized regression. As introduced in previous chapters, regularized regression tackles sparse modelling when the dimension is high. Specically, Lasso which is L 1 regularized regression, enjoys the following properties. Lasso provides anL 1 penalty on the norm of regression coecients, thus enjoys sparsity. When tuning parameter is large enough, all would be set to 0. When decreases, the support of would increase one by one. Group Lasso enjoys above property in group sense. To ease the presentation, we introduce two extended Lasso formulation, Group Lasso and Sparse Group Lasso. Group Lasso could be formulated as the following problem: arg min 2R p ( 1 2n kY Xk 2 2 + K X k=1 p p k (k) 2 ) and Sparse Group Lasso could be formulated as the following problem: arg min 2R p ( 1 2n kY Xk 2 2 + (1) K X k=1 p p k (k) 2 +kk 1 ) (3.1) where is a positive parameter less than 1. The dierence between Group Lasso and Sparse Group Lasso is that for sparse group Lasso another L 1 penalty is added. Thus, Sparse Group Lasso yields solutions that are sparse at both the group and individual feature levels. Recall the distance between an observation and a subspace could be modelled as the loss error of regression. Thus we consider a regression model where we treat a new 61 observation as dependent variable and transform of existing labelled training data as covariates. This is a new idea and dierent from traditional regression. For traditional regressions, we will treat our samples as row observations and its features as covariates. However, here we treat each sample as a feature and feature values across all samples as row observations. Recall the non-zero coecients implies the importance of the corresponding feature. The intuition is borrowed from geometric interpretation and we expect the coecients from the same category would be signicantly non-zero since we expect it to be close to the linear subspace existing observations generate. As to determine which coecient is signicantly non-zero, we use Sparse Group Lasso to solve the above high dimension regression. To assign a new observation with high dimension, we regard this observation as a response variable and labeled observations as covariates. We do the linear regression with Sparse Group Lasso penalty, which is theL 1 penalty on each candidate group and none-zero regression coecients indicate that its corresponding group is selected. We increase the penalty until only one group is selected and we assign the new observation to its category. We also want to explore theoretical support for our method to prove that under certain conditions, the accuracy of our method would be 100 percent in probability asymptotically. Thus we introduce sparse group Lasso penalty which would be used in the proof of our following theorem. Thus, given the training setY and a new observationY 0 , Our classication method consists of two steps: Let Y =Y T 0 and X =Y T in (3.1), which is equivalent to following: ^ = arg min 2R lp ( 1 2n Y T 0 Y T 2 2 + (1) l X k=1 p p (k) 2 +kk 1 ) (3.2) 62 where is a positive parameter less than 1 and (k) = 0 B B B B B B B @ (k1)p+1 (k1)p+2 . . . kp 1 C C C C C C C A 2R p This means that we set the coecients from the same category as group features. Thus we have l groups of features in total for (3.2). We initially set large enough such that all are 0. Then we gradually decrease until the rst group of feature is selected. We record the label of this selected group and we assign the selected group's label to the new observation. Comparing with other classication methods, advantages of our method cover sev- eral aspects: Do not have parameters to tune. Across our algorithm, even though we need to update , there is no tuning parameter in our model. This means that for our method, we don't need to suer the tuning step which is often a big cost. Also without tuning parameter, our method enjoys the property that the result will not be in uenced by the choice of tuning parameter. Computation Eciency. As explained above, our method don't need tuning which is a big save of computation cost. Also note the core algorithm of our method is solving a sparse group Lasso which is already implemented with fast convergence. Thus our method is very ecient as to computation. Easy to interpret with geometric theoretical support. We could easily observe that our idea has a solid geometric theoretical support. Also note that for above 63 theorem, we are interpreting its geometric insights into quantitative analysis. We would expect our algorithm to achieve perfect performance when the sample size is increasing. Handles High-dimension data well. Note that classical methods would fail as number of features increases. However, in our algorithm, we transfer the features into observations. This means that when number of features increases, we would have more observations, which improves our performance. This avoids traditional diculties from dimension increase and transfers it into our advantages. 3.4 Numerical Study In this section, we investigate our numerical performance. We would implement our method into both simulated data sets and real data sets. Note that our method has no parameter to tune, to provide a fair comparison we evaluate the performance of Nearest Neighbour (NN) and k Nearest Neighbour (kNN) with the same data. 3.4.1 Simulation Study First Model For our rst model in simulation study, we generate three settings. For each category among all settings, we generate a n by d matrix whose entries are independently and randomly generated from standard normal distribution. Then we compute an orthonor- mal basis for range of this matrix as our U following the notation in the Methodology section. After this, we generate a d by 1 vectorz whose entries are also independently and randomly generated from standard normal distribution. We normalize this z to have norm 1. Thus we could obtain ourx asx =Uz thus obtain the sample asy =x+. 64 Across all settings, we set noise level = 1, dimension of every subspaced = 100, total dimension n = 1000 and training samples for each label p = 10. The only parameter that is dierent across all settings is the number of category equal to l = 2; 5; 11. To evaluate the performance of all three methods, we generate an independent testing set whose number of samples for each label is also p = 10. All other parameter are the same with the training set. We generate 100 repeats for all three settings independently. Note that for our method and NN, there is no parameter to tune. For kNN, we need to tune the number of nearest neighbour k, which we use ve fold cross validation to tune. To measure the performance, we calculate the False labeled Rate for each repeat and compute the average across 100 repeats. Standard errors are in parentheses. The results are presented in table 3.1. From the table we could observe that our method is superior than others signi- cantly. When l = 2, which means it is binary classication problem, our false labelled rate is less than 0.05 while others are both around 0.2. Whenl increases to 5, our false labelled rate is just around 0.1 while others increase to around 0.4. Our false labelled rate is just around one quarter of those of other methods. When l increases to 11, our false labelled rate increases to around 0.15 while others increase to around 0.5. This means under this setting, NN and kNN are as bad as random guessing. 65 Table 3.1: Simulation results for subspace classication False labeled Rate l Our kNN NN Model I 2 0.0455(0.0439) 0.1985(0.0911) 0.2165(0.0927) Model II 5 0.108(0.0495) 0.3896(0.0703) 0.415(0.0815) Model III 11 0.1577(0.0394) 0.5155(0.0543) 0.5373(0.0555) 66 Second Model For our second model in simulation study, we also generate three settings. Across all settings, we set noise level = 2, dimension of every subspaced = 100, total dimension n = 1000 and training samples for each label p = 10. The only parameter that is dierent across all settings is the number of category equal to l = 2; 5; 11. To evaluate the performance of all three methods, we generate an independent testing set whose number of samples for each label is alsop = 10. All other parameter are the same with the training set. We generate 100 repeats for all three settings independently. Note that for our method and NN, there is no parameter to tune. ForkNN, we need to tune the number of nearest neighbourk, which we use ve fold cross validation to tune. The only dierence between the second model and the rst model is that we increase the noise level. To measure the performance, we calculate the False labeled Rate for each repeat and compute the average across 100 repeats. Standard errors are in parentheses. The results are presented in table 3.2. From the table we could observe that our method is still superior than others signicantly. When l = 2, which means it is binary classication problem, our false labelled rate is less than 0.18 while others are both around 0.4. When l increases to 5, our false labelled rate is just around 0.37 while others increase to around 0.66. When l increases to 11, our false labelled rate increases to around 0.5 while others increase to around 0:8. 67 Table 3.2: Simulation results for subspace classication for second model False labeled Rate l Our kNN NN Model I 2 0.179(0.09) 0.3795(0.11) 0.3995(0.131) Model II 5 0.3696(0.074) 0.6666(0.068) 0.6802(0.064) Model III 11 0.496(0.049) 0.795(0.078) 0.78(0.066) 68 3.4.2 Real data analysis We also evaluate the performance of our method on a real data set. We apply both our method and nearest neighbour on the Single Proton Emis- sion Computed Tomography (SPECT) images data. And it is available at https://archive.ics.uci.edu/ml/datasets/SPECT+Heart. The data set describes diag- nosing of cardiac. Each of the patients is classied into two categories: normal and abnormal. The database consists of a training set with 80 samples and a testing set with 187 samples. All labels of these 267 samples are given. Among the training set, there are 40 normal samples and 40 abnormal samples. 44 features were created for each patient and there is no missing values. As in the simulation study, we apply both our method and nearest neighbour on this data set. The result is shown in 3.3. From the table we could observe that we only got 66 false labelled observation among 187 testing samples. kNN got 80 false labelled samples while NN got 74 false labelled samples. Considering together with simulation study, our method performance signicantly better than NN and kNN numerically. Table 3.3: Results of SPECT data False labeled Number Our kNN NN 66 80 74 69 Chapter 4 Interaction Screening for Logistic Regression 4.1 Introduction As to other classication, Logistic regression is a powerful tool in identifying which category a certain observation belongs to. Dierent from other classication methods, its model involves in probability to consider the likelihood for a certain observation belonging to a specic category. For example, as to binary logistic regression, we would assume its dependent variable Y is either 0 or 1. Instead of deferring whether Y = 1 or 0, logistic regression follows the modelP (Y = 1) = e 1+e , where the parameter is a function with respect to features X. To be more specic, it is always assumed that = p X i=1 i X i . Based on this, people would use the training set to learn the coecients , thus provide an estimated model for logistic regression. However, there is no explicit solution for training logistic regression model. Thus people have developed several ways to solve it numerically. Traditionally, gradient descent and stochastic gradient descent are widely used to solve this. However, these methods would fail because of computation cost and error accumulation when the 70 dimension is high, which is a typical case in modern high dimension research. Fan and Lv [5] implemented regularization where they add a penalty to provide a sparse solution. Followed by linear regression, it is quite intuitive that a variable screening step would help when the dimension is high. Fan and Lv [2] proposed a screening step where they could initially decrease the feature space thus obtain a much smaller dimension. However, most of existing methods would not work when there are interaction terms in the model. That is to say, they fail to consider the problem when interaction among features are essential in the model, which means = P p i=1 i X i + P i;j r i;j X i X j . Interactions are believed to behave very important in real applications such as gene expression. A naive approach could be treating all interactions among features as new features and adding them into the model. The problem is that after this the number of new features would be with the same order of the square of original number of feature. Especially when the dimension is high, solving the model will be very challenging. Similar problem has been considered in Fan et.al [3]. In this chapter, we develop a fast and accurate screening method and use penalized generalized linear model introduced in [5] for selection. We can show that our method will achieve a good performance in both accuracy and computation cost, which could be illustrated by simulation and real data settings. 4.2 Interaction Screening Method for logistic re- gression We propose our interaction Screening Method based on the intuition that if a certain interaction is in the model, it will in uence the probability forY to be 1. That is to say, 71 if we collect all the values of a certain interaction for each category, they would possess a large dierence if this interaction is in the model. Thus, if we want to consider a certain pair (i;j), the conditional correlation between X i and X j given Y will have a signicant change compared to original correlation. Given above idea, we conduct a two step procedure for our screening method. First, we could compute the correlation matrices (c ij ) pp of the whole sample and conditional correlation (c 1 ij ) pp and (c 0 ij ) pp where c 0 ij = cor(x i ;x j jY = 0) and c 1 ij = cor(x i ;x j jY = 1). Thus for a certain interactionX i X j ;i<j, we dene its test statistics as T ij =max(jc 0 ij c ij j;jc 1 ij c ij j). After this, we determine a threshold and only test statistics larger than this threshold could lead to its corresponding interaction to be chosen in the model. After this screening step, we combine all original covariates and these new screened interactions as our new feature space. Then we apply penalized generalized linear model introduced in [5] for selection. This would provide a nal sparse logistic regression model. The advantage of our method consists of followings: Our screening method is very intuitive. The insight is pretty simple and the idea could be simply implemented by the dierence of conditional correlations. The computation cost is relatively small. As to our screening step, we just need to compute the conditional correlation matrices. This is eciently implemented in many statistical softwares. Thus the computation cost is relatively small even when the dimension is high. Selection step rets the model. Even though a certain interaction passes the screening step, it is not necessary for it to be chosen in the model. We t a selection step after the screening which helps the improvement in accuracy. 72 Following theorem would provide an insight why our method works. We will con- sider the simplest example, where the parameter is equal to = rX 1 X 2 ;r > 0. X 1 and X 2 are iid N(0; 1), so cor(X 1 ;X 2 ) = 0. Recall that our logistics model is that Y is a binary variable equal to either 0 or 1 and P (Y = 1) = e 1+e . We want to illustrate that conditional correlation between X 1 andX 2 givenY will have a signicant change compared to original correlation. So we aim to establish an inequality between the conditional correlation and a specic function to show this change. Theorem 4.2.1 (Correlation inequalities). Assume = rX 1 X 2 and X 1 and X 2 are iid N(0,1), Cor(X 1 ;X 2 jY = 1) =E(X 1 X 2 jY = 1)c r r + 1 > 0 (4.1) where when r 1 c = 1:12 and when 0<r 1 c = 0:56 Thus under the setting in 4.2.1,T 12 is signicantly larger than 0. Following theorem indicates that all other statistics would be 0. Theorem 4.2.2. Assume =rX 1 X 2 and X 1 and X 2 are iid N(0,1), T ij = 0; 8(i;j)6= (1; 2) (4.2) 4.3 Simulation studies In this section, we investigate the performance of our method in nite samples via sim- ulations. Our method consists of two procedures, screening and selection. In screening procedure we will a certain amount of interactions to be included for next step. Then in selection procedure we combine all explanatory variables and screened interaction 73 terms from last step as our nal explanatory variables and apply the selection algo- rithms to determine which variables to be included and its corresponding coecients. First we will evaluate our screening performance and compare to three existing meth- ods, SIS, DCSIS and SIRI. Next we will evaluate the performance after nal selection in dierent models under dierent settings. 4.3.1 Simulation settings We consider the logistic regression model where the response variable Y is either 0 or 1. And P (Y = 1) = e 1 +e where = x + P 1i<jn ij X i X j and x = (X 1 ;X 2 ; ;X p ) T is a normal random vector with its mean equal to zero and its covariance matrix as = ( ij ) pp with ij = jijj and is a constant. For each simulation, we choose the dimension of variables p = 1000 and the number of observations n = 200. For each model and setting, we run 100 simulations.We construct four dierent models and each model has two settings: = 0:25 and = 0:5.The rst three models are concerning about strong, anti and weak heredity, while the fourth model are two interactions with dierent coecients. Each model is dened as following, Model 1 (strong heredity) = 2X 1 +X 3 + 3X 1 X 3 (4.3) Model 2 (weak heredity) = 2X 1 +X 3 + 3X 1 X 4 (4.4) 74 Model 3 (anti heredity) = 2X 1 +X 3 + 3X 4 X 5 (4.5) Model 4 (interactions only with dierent coecients) = 3X 1 X 2 + 2X 3 X 5 (4.6) 4.3.2 Screening Performance In this section, we evaluate our screening performance and compare the results to three other methods, SIS, DCSIS and SIRI. For our screening procedure, since the response variable should be either 0 or 1, we could regard it as a classication problem: 0 as one group and 1 response as the other group. For each simulation, we divide the observations into two groups and compute the correlation matrix of explanatory variables inside the group. Thus for each simulation, we have a pair of correlation matrices (c 0 ij ) pp and (c 1 ij ) pp wherec 0 ij =cor(x i ;x j jY = 0) andc 1 ij =cor(x i ;x j jY = 1). We could also compute the correlation matrices among all observations, (c ij ) pp where c ij = cor(x i ;x j ). Then we dene our screening matrix as (cs ij ) pp , where cs ij = max(jc 0 ij c ij j;jc 1 ij c ij j). This screening matrix interprets how large the dierence is between the correlation and conditional correlation for a certain pair of x. Since this screening matrix is symmetric and its diagonal is zero, we just consider its upper- triangular part, which has p(p1) 2 entries in total. We rank the entries and keep the top the terms of a given threshold. From [6],the suggested number of new features to be kept is j n logn k and based on our model setting j n logn k = 37. Only if the entry cs ij is kept, the corresponding interaction term x i x j is kept for selection. 75 For each simulation, we could verify whether the true interaction terms are in- cluded in these j n logn k interactions. If a true interaction is not included in these j n logn k interactions, then this would be considered as a false negative. For each model and setting among all 100 simulations, we compute Screening probability for a true inter- action, which is dened as the ratio of how many times the true interaction is kept to the total times which is 100. The larger Screening probability is, the better a screen- ing method is. We compare the screening probability of our method to other three screening methods: SIS, DCSIS and SIRI. Table 1 and Table 2 present the screening probability of true interactions kept after screening procedure of each method. In model 1, we have strong heredity and all methods achieve near perfect screening probability. In model 2 where we only have weak heredity, our method still achieve perfect performance while other methods achieves 0.6 at most. In model 3 where we only have anti heredity, our method still achieve near perfect screening probability which is equal or close to 1. SIRI has a screening probability 0.75 in case 1, while all other probabilities in two cases are less than 0.3. In model 4 where we have only interactions, our methods outperform all other methods signicantly. While the screening probability of x 3 x 4 which has the lower coecient is still close to 1 in our methods for both cases, all other methods fail to detect x 3 x 4 among all simulations. 76 Table 4.1: Screening Probability of true interaction. Case 1 with = 0:5. screening probability Model 1 Model 2 Model 3 Model 4 x 1 x 3 x 1 x 4 x 4 x 5 x 1 x 2 x 3 x 5 our model 1 1 0.99 1 0.98 sis 1 0.51 0.04 0 0 dcsis 1 0.6 0.29 0.07 0 siri 0.99 0.44 0.75 0.54 0 Table 4.2: Screening Probability of true interaction. Case 2 with = 0:25. screening probability Model 1 Model 2 Model 3 Model 4 x 1 x 3 x 1 x 4 x 4 x 5 x 1 x 2 x 3 x 5 our model 1 1 1 1 0.95 sis 0.99 0.07 0 0.01 0 dcsis 1 0.08 0 0.01 0 siri 0.98 0.08 0.06 0.05 0 77 4.3.3 Selection Performance In this section, we will evaluate the performance after nal selection in dierent models under dierent settings from screening procedure. Note that we have kept j n logn k = 37 interaction from screening procedure, we regard each kept interaction as a new explanatory variable. Combining them with all main explanatory variables, we have p+ j n logn k = 1037 explanatory variables in total to select from. As to variable selection method, we compare the Lasso, SCAD and SICA in the threshold parameter space and we note them as Lasso t ,SCAD t and SICA t . According to the paper, these three methods produce similarly performance, so we just present the selection performance of Lasso t . The performance of our method after selection is summarized in table 3 and 4. The rst measure we use to evaluate the performance is the prediction error PE, which is dened as E(Y ~ X T ^ ) 2 . ^ is an estimate of calculated in our selection procedure. (X T ;Y ) is a new generated observation of p covariates and the response which is independent with simulation sets and ~ X is an explanatory vector formed by X and its corresponding interactions screened in our screening step. We generate 10,000 independent samples to estimate the prediction error. The second measure is the average number of false positives of main explanatory variable in each simulation, where a main false positive represents a falsely selected irrelevant main eect covariate in the model. We note it as FP main. The third measure is the average number of false positives of interactions in each simulation, where a interaction false positive represents a falsely selected irrelevant interaction in the model, to total number of experiments. We note it as FP inter. Similarly, we can dene false negatives of main explanatory variable as missed true covariates after selection and false negatives of interactions as missed true interactions after selection. Thus the fourth measure is the average number of false negatives of main explanatory variable in each simulation which we note as FN 78 main and the fth measure is the average number of false negatives of interaction in each simulation which we note as FN inter. The sixth measure is the model selection consistency probability of our method based on all simulations , which we note as MSCP. We also add the prediction error of oracle method as the last measure. From these two tables we can see, except model 1 when = 0:5, the dierences between our PE and oracle PE are all within two standard errors of oracle. As to other measures, for all models and cases, FP main and FN inter are all around 0, which means after selection we could hardly nd any noise main eect or miss a true interaction. When = 0:5, in model 1 where strong heredity holds, FN main is 0.23 and MSCP=0.69. Except this model, FP inter and FN main for all other models are around 0.1 while MSCPs are all around 0.8. When = 0:25, it has a much better performance than the rst case with largest FP inter equal to 0.1 and largest FN main equal to 0.07. Their MSCPs are all around 0.9 with the largest one equal to 0.93. 79 Table 4.3: Selection performance of Lasso t for case 1 when = 0:5. Measure Model 1 Model 2 Model 3 Model 4 PE 0.1418(0.0016) 0.1254(0.0012) 0.1068(0.0013) 0.1309(0.0012) FP main 0 0.01 0.01 0 FP inter 0.1 0.12 0.13 0.09 FN main 0.23 0.06 0.08 0 FN inter 0 0 0.01 0.05 MSCP 0.69 0.87 0.78 0.89 oracle PE 0.1309(0.0034) 0.1203(0.0028) 0.1011(0.0029) 0.1260(0.0024) Table 4.4: Selection performance of Lasso t for case 2 when = 0:25. Measure Model 1 Model 2 Model 3 Model 4 PE 0.1335(0.0011) 0.1243(0.0011) 0.1068(0.0007) 0.1312(0.0011) FP main 0.01 0 0.01 0 FP inter 0.07 0.1 0.08 0.02 FN main 0.07 0.04 0.01 0 FN inter 0 0 0 0.05 MSCP 0.87 0.88 0.91 0.93 oracle PE 0.1288(0.0029) 0.1199(0.0032) 0.1046(0.0031) 0.1275(0.0027) 80 4.4 Real data analysis 4.4.1 Real Data Analysis We apply our method to the breast cancer dataset, which is originally studied in [33]. The purpose of the study is to identify which genes are related to the relapse of female breast cancer based on clinical outcomes using gene expression data. There are 78 patients in total with 44 of them in the good prognosis group and 34 of them in the poor one. For one patient in the poor prognosis group, there are several missing values, so we abandon it. Thus each group has sample sizejY = 1j = 44 andjY = 0j = 33. As to explanatory main eect variables, we choose the p = 231 genes selected in [33] as candidates. We randomly split the 77 samples into a training set with 61 samples and a vali- dation set with 16 samples. We repeat this random split 50 times. For each split, we apply our method and compute prediction error and mis-classication rate using the validation data. For screening, we kept the topbn 1c = 60 interactions and we use Lasso t for selection. We also calculate the model size of each split and frequencies of each main eect explanatory variable and interaction selected among 50 splits, which are summarized in table 5 and 6. From the tables we can see that our method achieves the best classication performance. As to the most selected interaction term x 116 x 203 , we can observe that its corresponding two main eect variable x 116 and x 203 are not selected once among all 50 splits, which illustrate our method can successfully identify true interaction terms whose corresponding main eect variables are not signicant explanatory variables. 81 Table 4.5: Prediction error, Misclassication rate and model size on the breast cancer data in [33] over 50 random splits. Standard errors are in the parentheses. Method PE MR Model size main interaction all Our method 0.1334(0.0067) 0.1613(0.0137) 4.1(2.55) 2.38 (1.46) 6.48(3.77) Table 4.6: Numbers of appearance of main eect variable and interaction among 50 experiments, which are greater or equal to 4. main eect Numbers of appearance interaction Numbers of appearance x 230 35 x 116 x 203 33 x 231 25 x 114 x 231 10 x 31 23 x 78 x 221 4 x 22 19 x 116 x 419 4 x 189 13 x 8 12 x 3 10 x 10 9 x 1 6 x 216 6 x 121 5 82 4.5 Proof of Theorems 4.5.1 Proof of Theorem 4.2.1 Proof. With lemma 3 and 4, Cor(X 1 ;X 2 jY = 1) = Cov(X 1 ;X 2 jY = 1) p Var(X 1 jY = 1)Var(X 2 jY = 1) = E(X 1 X 2 jY = 1)E(X 1 jY = 1)E(X 2 jY = 1) p Var(X 1 jY = 1)Var(X 2 jY = 1) =E(X 1 X 2 jY = 1) Note 0 =X 1 X 2 , thus =r 0 , E(X 1 X 2 jY = 1) = Z 1 1 0 f( 0 jY = 1)d 0 = R 1 1 0 P (Y = 1j 0 )f( 0 )d 0 P (Y = 1) = R 1 1 0 e 0 1+e 0 f( 0 )d 0 P (Y = 1) = 2 g(r) 83 And g(r) = Z 1 1 0 e r 0 1 +e r 0 f( 0 )d 0 = Z 0 1 0 e r 0 1 +e r 0 f( 0 )d 0 + Z 1 0 0 e r 0 1 +e r 0 f( 0 )d 0 = Z 1 0 0 e r 0 1 +e r 0 f( 0 )d( 0 ) + Z 1 0 0 e r 0 1 +e r 0 f( 0 )d 0 = Z 1 0 0 1 1 +e r 0 f( 0 )d 0 + Z 1 0 0 e r 0 1 +e r 0 f( 0 )d 0 = Z 1 0 0 e r 0 1 1 +e r 0 f( 0 )d 0 = Z 1 0 0 (1 2 1 +e r 0 )f( 0 )d 0 = Z 1 0 0 f( 0 )d 0 Z 1 0 0 2 1 +e r 0 f( 0 )d 0 = 2 E(X 1 X 2 jX 1 X 2 > 0) Z 1 0 0 2 1 +e r 0 f( 0 )d 0 = 1 Z 1 0 0 2 1 +e r 0 f( 0 )d 0 84 Note that 0 = X 1 X 2 has its density function as f( 0 ) = K 0 ( 0 ) where K 0 ( 0 ) is a modied Bessel function of the second kind and K 0 ( 0 ) = R 1 0 cos( 0 t) p t 2 +1 dt. So, g(r) = 1 Z 1 0 0 2 1 +e r 0 f( 0 )d 0 = 1 Z log(3) 0 0 2 1 +e r 0 f( 0 )d 0 Z 1 log(3) 0 2 1 +e r 0 f( 0 )d 0 = 1 Z log(3) 0 0 2 1 +e r 0 K 0 ( 0 )d 0 Z 1 log(3) 0 2 1 +e r 0 K 0 ( 0 )d 0 1 Z log(3) 0 0 2 1 +e r0 K 0 ( 0 )d 0 Z 1 log(3) 0 2 1 +e rlog(3) K 0 ( 0 )d 0 = 1 Z log(3) 0 0 K 0 ( 0 )d 0 Z 1 log(3) 0 2 1 + 3 r K 0 ( 0 )d 0 = 1 Z log(3) 0 0 K 0 ( 0 )d 0 2 1 + 3 r Z 1 log(3) 0 K 0 ( 0 )d 0 We could compute that Z log(3) 0 0 K 0 ( 0 )d 0 =log(3)besselk(1;log(3)) = 0:44 and Z 1 log(3) 0 K 0 ( 0 )d 0 = 1 Z log(3) 0 0 K 0 ( 0 )d 0 = 0:56 85 So, g(r) r r+1 r r+1 = [1 0:44 0:56 2=(1 + 3 r ) 1 + 1=(r + 1)] r + 1 r =0:44 + 0:56 (1=r 2=(1 + 3 r ) r r + 1 ) =0:44 + 0:56 r (1 2 r + 1 1 + 3 r ) =0:44 + 0:56 r 3 r 2r 1 1 + 3 r When r 1, 0:56 r 3 r 2r1 1+3 r 0, so g(r) r r+1 r r+1 0:44 Thus, Cor(X 1 ;X 2 jY = 1) = 2 g(r) 1:12 r r + 1 When 0r 1, 0:56 r 3 r 2r 1 1 + 3 r 0:28 ( 3 r 1 r 2) According to Taylor expansion 3 r 1 r 1, so 0:28 ( 3 r 1 r 2)0:28 f(r) r r+1 r r+1 0:72 Thus, Cor(X 1 ;X 2 jY = 1) = 2 g(r) 0:56 r r + 1 which nished the proof. 86 4.5.2 Proof of Theorem 4.2.2 First note that Cor(X i ;X j ) = 0; 8i;j. Since = rX 1 X 2 and8i;j, X i ;X j are inde- pendent,8i> 2, X i is independent of , thus independent of Y . So, Cor(X i ;X j jY ) = Cor(X i ;X j ) = 0; 8i;j > 2. Thus T i;j = 0; 8i;j > 2. And for j > 2, Cor(X 1 ;X j jY = 1) = E(X 1 X j jY = 1)E(X 1 jY = 1)E(X j jY = 1) p Var(X 1 jY = 1)Var(X j jY = 1) . Note that E(X j jY = 1) = E(X j ) = 0 and E(X 1 X j jY = 1) = E(X j )E(X 1 jY = 1) = 0 since X j and Y are independent. Thus Cor(X 1 ;X j jY = 1) = 0 Similarly, Cor(X 1 ;X j jY = 0) = 0 This means that T 1;j = 0; 8j > 2: Similarly, T 2;j = 0; 8j > 2: 4.5.3 Lemmas and their proofs Lemma 1 and its proof Lemma 4.5.1. P (Y = 1;> 0) =P (Y = 0;< 0) 87 Proof. For8a> 0, P (Y = 1j =a) = e a 1 +e a = 1 1 +e a =P (Y = 0j =a) So for8a> 0, P (Y = 1j>a) =P (Y = 0j<a) which means P (Y = 1j> 0) =P (Y = 0j<0) Because =rX 1 X 2 and X 1 and X 2 are iid N(0,1), is symmetric with 0. So P (> 0) =P (< 0) = 0:5 Thus, according to the Bayes Rule, P (Y = 1;> 0) =P (Y = 0;< 0) Lemma 2 and its proof Lemma 4.5.2. P (Y = 1) =P (Y = 0) = 0:5 Proof. From Lemma 4.5.1, we have P (Y = 1;> 0) =P (Y = 0;< 0) 88 Similarly, we have P (Y = 1;< 0) =P (Y = 0;< 0) We add them up, so P (Y = 1;> 0) +P (Y = 1;< 0) =P (Y = 0;< 0) +P (Y = 0;< 0) which means P (Y = 1) =P (Y = 0). Since Y is binary, P (Y = 1) =P (Y = 0) = 0:5 Lemma 3 and its proof Lemma 4.5.3. E(X 1 jY = 1) = 0 Proof. For8a> 0, P (X 1 a;< 0;Y = 1) =P (X 1 a;< 0)P (Y = 1jX 1 a;< 0) =P (X 1 a;< 0)P (Y = 1j< 0) =P (X 1 a)P (X 2 > 0)P (Y = 1j< 0) =P (X 1 a)P (X 2 > 0)P (Y = 1j< 0) =P (X 1 a;< 0)P (Y = 1j< 0) =P (X 1 a;< 0;Y = 1) 89 Similarly, P (X 1 a;> 0;Y = 1) =P (X 1 a;> 0;Y = 1) We add them up, thus P (X 1 a;< 0;Y = 1) +P (X 1 a;> 0;Y = 1) =P (X 1 a;< 0;Y = 1) +P (X 1 a;> 0;Y = 1) So P (X 1 a;Y = 1) =P (X 1 a;Y = 1) which means P (X 1 ajY = 1) =P (X 1 ajY = 1) So E(X 1 jY = 1) = 0 Lemma 4 and its proof Lemma 4.5.4. Var(X 1 jY = 1) = 1 Proof. Because of Lemma 4.5.3, Var(X 1 jY = 1) =E(X 2 1 jY = 1) 90 Note t =X 2 1 , E(tjY = 1) = Z 1 0 tf(tjY = 1)dt = Z 1 0 t f(t;Y = 1) P (Y = 1) dt = Z 1 0 t 2f(t;Y = 1)dt (4.7) So, if f(t;Y = 1) =f(t;Y = 0), then 2f(t;Y = 1) =f(t;Y = 1) +f(t;Y = 0) =f(t) thus E(tjY = 1) = Z 1 0 t 2f(t;Y = 1)dt = Z 1 0 tf(t)dt =E(X 2 1 ) = 1 To prove f(t;Y = 1) =f(t;Y = 0), we just need to show that for8a> 0, P (X 2 1 a 2 ;Y = 1) =P (X 2 1 a 2 ;Y = 0) Note that P (X 1 a;< 0;Y = 1) =P (X 1 a;< 0)P (Y = 1jX 1 a;< 0) =P (X 1 a;< 0)P (Y = 1j< 0) =P (X 1 a)P (X 2 > 0)P (Y = 1j< 0) =P (X 1 a)P (X 2 > 0)P (Y = 1j< 0) =P (X 1 a;> 0)P (Y = 0j> 0) =P (X 1 a;> 0;Y = 0) Similarly, P (X 1 a;> 0;Y = 1) =P (X 1 a;< 0;Y = 0) 91 So we add them up, P (X 1 a;< 0;Y = 1) +P (X 1 a;> 0;Y = 1) =P (X 1 a;> 0;Y = 0) +P (X 1 a;< 0;Y = 0) =P (X 1 a;Y = 1) =P (X 1 a;Y = 0) Similarly, we can get P (X 1 a;Y = 1) =P (X 1 a;Y = 0) And we add them up, P (X 1 a;Y = 1) +P (X 1 a;Y = 1) =P (X 1 a;Y = 0) +P (X 1 a;Y = 0) P (X 2 1 a 2 ;Y = 1) =P (X 2 1 a 2 ;Y = 0) So f(t;Y = 1) =f(t;Y = 0), then Var(X 1 jY = 1) =E(X 2 1 jY = 1) =E(X 2 1 ) = 1 Lemma 5 and its proof Lemma 4.5.5. E(X 1 X 2 jX 1 X 2 > 0) = 2 Proof. Because X 1 and X 2 are symmetric with 0, X 1 X 2 is also symmetric with 0. So, P (X 1 X 2 > 0) = 1 2 92 Since X 1 and X 2 are independent, P (X 1 > 0;X 2 > 0) =P (X 1 < 0;X 2 < 0) = 1 4 and P (X 1 > 0;X 2 > 0) +P (X 1 < 0;X 2 < 0) =P (X 1 X 2 > 0) Thus, E(X 1 X 2 jX 1 X 2 > 0) = 1 2 (E(X 1 X 2 jX 1 > 0;X 2 > 0) +E(X 1 X 2 jX 1 < 0;X 2 < 0)) = 1 2 (E(X 1 jX 1 > 0)E(X 2 jX 2 > 0) +E(X 1 jX 1 < 0)E(X 2 jX 2 < 0)) Because E(X 1 jX 1 > 0) = Z 1 0 x 1 p 2 e 1 2x 2 dx = r 2 similarly we can get E(X 2 jX 2 > 0) = r 2 E(X 1 jX 1 < 0) = r 2 and E(X 2 jX 2 < 0) = r 2 So we plug them in and get, E(X 1 X 2 jX 1 X 2 > 0) = 2 93 Bibliography [1] Fan, Yingying, and Jinchi Lv. "INNOVATED SCALABLE EFFICIENT ESTI- MATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS." [2] Fan, Jianqing, and Jinchi Lv. "Sure independence screening for ultrahigh dimen- sional feature space." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.5 (2008): 849-911. [3] Fan, Yingying, et al. "Innovated interaction screening for high-dimensional non- linear classication." The Annals of Statistics 43.3 (2015): 1243-1272. [4] Zheng, Zemin, Yingying Fan, and Jinchi Lv. "High dimensional thresholded re- gression and shrinkage eect." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.3 (2014): 627-649. [5] Fan, Yingying, and Jinchi Lv. "Asymptotic equivalence of regularization methods in thresholded parameter space." Journal of the American Statistical Association 108.503 (2013): 1044-1061. [6] Fan, Jianqing, and Jinchi Lv. "Sure independence screening for ultrahigh dimen- sional feature space." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.5 (2008): 849-911. 94 [7] Soltanolkotabi, Mahdi, Ehsan Elhamifar, and Emmanuel J. Candes. "Robust sub- space clustering." The Annals of Statistics 42.2 (2014): 669-699. [8] She, Yiyuan. "An iterative algorithm for tting nonconvex penalized generalized linear models with grouped predictors." Computational Statistics & Data Analysis 56.10 (2012): 2976-2990. [9] Tibshirani, Robert. "Regression shrinkage and selection via the lasso." Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288. [10] Efron, Bradley, et al. "Least angle regression." The Annals of statistics 32.2 (2004): 407-499. [11] Bunea, Florentina, Johannes Lederer, and Yiyuan She. "The group square-root lasso: Theoretical properties and fast algorithms." Information Theory, IEEE Transactions on 60.2 (2014): 1313-1325. [12] Yuan, Ming, and Yi Lin. "Model selection and estimation in regression with grouped variables." Journal of the Royal Statistical Society: Series B (Statisti- cal Methodology) 68.1 (2006): 49-67. [13] Owen, Art B. "A robust hybrid of lasso and ridge regression." Contemporary Mathematics 443 (2007): 59-72. [14] Sun, Tingni, and Cun-Hui Zhang. "Scaled sparse linear regression." Biometrika (2012): ass043. [15] Belloni, Alexandre, Victor Chernozhukov, and Lie Wang. "Square-root lasso: piv- otal recovery of sparse signals via conic programming." Biometrika 98.4 (2011): 791-806. 95 [16] Liu, Weidong. "Gaussian graphical model estimation with false discovery rate control." The Annals of Statistics 41.6 (2013): 2948-2978. [17] Ren, Zhao, et al. "Asymptotic normality and optimalities in estimation of large Gaussian graphical models." The Annals of Statistics 43.3 (2015): 991-1026. APA [18] Lounici, Karim, et al. "Oracle inequalities and optimal inference under group sparsity." The Annals of Statistics (2011): 2164-2204. [19] Huang, Junzhou, and Tong Zhang. "The benet of group sparsity." The Annals of Statistics 38.4 (2010): 1978-2004. [20] Mitra, Ritwik, and Cun-Hui Zhang. "The Benet of Group Sparsity in Group Inference with De-biased Scaled Group Lasso." arXiv preprint arXiv:1412.4170 (2014). [21] Laurent, Beatrice, and Pascal Massart. "Adaptive estimation of a quadratic func- tional by model selection." Annals of Statistics (2000): 1302-1338. [22] Cai, T. Tony, et al. "Joint Estimation of Multiple High-dimensional Precision Matrices." The Annals of Statistics 38 (2015): 2118-2144. [23] Danaher, Patrick, Pei Wang, and Daniela M. Witten. "The joint graphical lasso for inverse covariance estimation across multiple classes." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76.2 (2014): 373-397. [24] Tothill, Richard W., et al. "Novel molecular subtypes of serous and endometrioid ovarian cancer linked to clinical outcome." Clinical Cancer Research 14.16 (2008): 5198-5208. [25] Kanehisa, Minoru, and Susumu Goto. "KEGG: kyoto encyclopedia of genes and genomes." Nucleic acids research 28.1 (2000): 27-30. 96 [26] Giudice, Fernanda S., and Cristiane H. Squarize. "The determinants of head and neck cancer: Unmasking the PI3K pathway mutations." Journal of carcinogenesis & mutagenesis (2013). [27] Yang, Fan, et al. "Upregulation of Fas in epithelial ovarian cancer reverses the development of resistance to Cisplatin." BMB reports 48.1 (2015): 30. [28] Jonsson, Jenny-Maria, et al. "Distinct gene expression proles in ovarian cancer linked to Lynch syndrome." Familial cancer 13.4 (2014): 537-545. [29] Hotelling, Harold. "Analysis of a complex of statistical variables into principal components." Journal of educational psychology 24.6 (1933): 417. [30] Kohonen, Teuvo, and Erkki Reuhkala. A very fast associative method for the recognition and correction of misspelt words, based on redundant hash addressing. Helsinki University of Technology, 1978. [31] Kramer, H. P., and Max V. Mathews. "A linear coding for transmitting a set of correlated signals." Information Theory, IRE Transactions on 2.3 (1956): 41-46. [32] Watanabe, Y., S. Millward, and A. F. Graham. "Regulation of transcription of the reovirus genome." Journal of molecular biology 36.1 (1968): 107-123. [33] Van't Veer, Laura J., et al. "Gene expression proling predicts clinical outcome of breast cancer." nature 415.6871 (2002): 530-536. [34] Wikipedia contributors. "Statistical classication." Wikipedia, The Free Encyclo- pedia. Wikipedia, The Free Encyclopedia, 11 Apr. 2016. Web. 11 Apr. 2016 97
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Model selection principles and false discovery rate control
PDF
Nonparametric ensemble learning and inference
PDF
Statistical analysis of high-throughput genomic data
PDF
Large scale inference with structural information
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
High dimensional estimation and inference with side information
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Optimal dividend and investment problems under Sparre Andersen model
PDF
Incorporating prior knowledge into regularized regression
PDF
Some topics on continuous time principal-agent problem
PDF
Symmetric and trimmed solutions of simple linear regression
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Numerical methods for high-dimensional path-dependent PDEs driven by stochastic Volterra integral equations
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Conditional mean-fields stochastic differential equation and their application
Asset Metadata
Creator
Yongjian, Kang
(author)
Core Title
Large-scale inference in multiple Gaussian graphical models
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
04/21/2016
Defense Date
03/09/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Gaussian graph,multiple regression,OAI-PMH Harvest,subspace classification
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Fan, Yingying (
committee chair
), Lv, Jinchi (
committee chair
), Zhang, Jianfeng (
committee chair
)
Creator Email
yongjiak@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-237260
Unique identifier
UC11278388
Identifier
etd-YongjianKa-4326.pdf (filename),usctheses-c40-237260 (legacy record id)
Legacy Identifier
etd-YongjianKa-4326-0.pdf
Dmrecord
237260
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yongjian, Kang
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Gaussian graph
multiple regression
subspace classification