Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Feature selection in high-dimensional modeling with thresholded regression
(USC Thesis Other)
Feature selection in high-dimensional modeling with thresholded regression
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FEATURE SELECTION IN HIGH-DIMENSIONAL MODELING WITH THRESHOLDED REGRESSION by Zemin Zheng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) May 2015 Copyright 2015 Zemin Zheng Dedication To my family ii Acknowledgments It is a long journey since I started my Ph.D career at USC and discovered some interesting topics in high-dimensional statistics. Although my research discoveries aredefinitely faraway fromenough, Iwouldnever beabletofinish my dissertation without the guidance of my committee members, help from friends, and support from my families. I would like to express my deepest gratitude to my advisors, Dr. Jinchi Lv and Dr. Yingying Fan for their patience, encouragement, and excellent guidance that have brought me so far. They helped me experience the cutting edge of statistical research anddevelop methodologiesforpracticalissues beyond thetextbooks. Not onlydidtheyfinanciallysupportmyresearchforacademicconferencesandsummer aids, but also provided me valuable suggestions during those critical periods. I would also like to thank Dr. Larry Goldstein for serving as my advisor in the home department, who patiently audited my research to keep it on the right track and gave me many advises and generous help of recommendation to further my research. iii I owe many thanks to Dr. Peter Baxendale, who was always willing to help and participate in my oral and final defense committees. Special thanks goes to Dr. Jianfeng Zhang, who served as my oral exam committee and helped me with recommendation for excellent teaching. I would also like to thank my friends, Pallavi Basu and Yongjian Kang for their contributions to the second part of this dissertation, which grew to a new level based on our discussions and effects. Finally, I am definitely in debt to my parents, who stood by me through the good times and bad over the years and gave me endless love and support. I would also like to thank my relatives and friends. They shared experience with me and provided invaluable suggestions. iv Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures ix Abstract x Chapter 1: Introduction 1 1.1 High-Dimensional Variable Selection via Regularization . . . . . . . 1 1.2 Model with Latent Features . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: High-Dimensional Thresholded Regression and Shrink- age Effect 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Thresholded Regression . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Global Properties and Shrinkage Effects of Thresholded Regression 12 2.3.1 Technical Conditions . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Global Properties and Shrinkage Effects . . . . . . . . . . . 16 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Real Data Example . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . 35 2.6.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 41 2.6.3 Additional Technical Details . . . . . . . . . . . . . . . . . . 52 Chapter 3: High-Dimensional Latent Variable Thresholded Regres- sion 61 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 v 3.2 Sparse Regression with Latent Confounding Factors . . . . . . . . . 63 3.3 Thresholded Regression with Confounding Factors . . . . . . . . . . 68 3.3.1 Multiple Component Spike Models . . . . . . . . . . . . . . 68 3.3.2 Thresholded Regression with Estimated Confounding Factors 71 3.3.3 Technical Conditions . . . . . . . . . . . . . . . . . . . . . . 73 3.3.4 Main Theoretical Results . . . . . . . . . . . . . . . . . . . . 77 3.4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.2 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . 84 3.4.3 Real Data Example . . . . . . . . . . . . . . . . . . . . . . . 91 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.6.1 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.6.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . 97 3.6.3 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . 105 3.6.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . 106 3.6.5 Additional Technical Details . . . . . . . . . . . . . . . . . . 115 Bibliography 130 vi List of Tables 2.1 Means and standard deviations (in parentheses) of different perfor- mance measures by all methods over 100 simulations in Section 2.5.1 28 2.2 Means and standard deviations (in parentheses) of different perfor- mance measures by all methods followed by the L 2 -regularization over 100 simulations in Section 2.5.1 . . . . . . . . . . . . . . . . . 29 2.3 Means and standard deviations (in parentheses) of different per- formance measures by all methods over 100 simulations in Section 2.5.1; the population error standard deviation (SD) σ p df/(df−2) equals 0.4472 in the case of p = 1000, and 0.3354 in the case of p= 5000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.4 Means and standard deviations (in parentheses) of different perfor- mance measures by all methods followed by the L 2 -regularization over 100 simulations in Section 2.5.1 . . . . . . . . . . . . . . . . . 60 2.5 Selection probabilities(t-statistics, withmagnitudeabove 2inbold- face)ofmostfrequentlyselectedpredictorswithnumberuptomedian model size by each method across 100 random splittings of the dia- betes data set in Section 2.5.2 . . . . . . . . . . . . . . . . . . . . . 60 3.1 Meansandstandarderrors(inparentheses) ofdifferentperformance measures for regression coefficients of observable variables by all methods over 100 simulations in Section 3.4.2. M 1 : model without hidden features, M 2 : model with confounding factors . . . . . . . . 87 3.2 Meansandstandarderrors(inparentheses) ofdifferentperformance measures for regression coefficients of confounding factors by all methods over 100 simulations in Section 3.4.2 . . . . . . . . . . . . 87 3.3 Meansandstandarderrors(inparentheses) ofdifferentperformance measures for regression coefficients of observable variables by all methods over 100 simulations in Section 3.4.2. M 1 : model without hidden features, M 2 : model with confounding factors, the popula- tion error standard deviation (SD) σ p df/(df−2) equals to 0.4472 90 3.4 Meansandstandarderrors(inparentheses) ofdifferentperformance measures for regression coefficients of confounding factors by all methods over 100 simulations in Section 3.4.2 . . . . . . . . . . . . 90 vii 3.5 The means and standard errors (in parentheses) of classification errors, median model sizes and selection probabilities for the first two sample principal components by different methods over 50 ran- dom splittings of leukemia data set in Section 3.4.3. M 1 : model without hidden features, M 2 : model with hidden features . . . . . . 92 viii List of Figures 2.1 Representative prediction risk curves as a function of the ridge parameterλ 1 byallmethodsinSection2.5.1forthecaseof(p,|β weak |) = (1000,0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Representative L 2 -risk curves as a function of the ridge parameter λ 1 by all methods in Section 2.5.1 for the case of (p,|β weak |) = (1000,0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Representative prediction error curves as a function of the ridge parameter λ 1 by all methods on the diabetes data set in Section 2.5.2. 33 ix Abstract This dissertation addresses two challenging problems with respect to feature selec- tion in the high-dimensional setting, where the number of covariates can be larger than the sample size. Some notorious difficulties in high-dimensional model selec- tion include high collinearity among the predictors, spurious variables and noise accumulation. It would become more challenging when some latent features exist in the hidden structure of the original model. Inthefirstpartofthisdissertation,weconsidertheproblemofhigh-dimensional sparse modeling via regularization, which provides a powerful tool for analyzing large-scale data sets and obtaining meaningful, interpretable models. The use of nonconvex penalty functions shows advantage in selecting important features in high dimensions, but the global optimality of such methods still demands more understanding. In this part, we consider sparse regression with hard-thresholding penalty, which we show to give rise to thresholded regression. This approach is x motivated by its close connection with the L 0 -regularization, which can be unreal- istic to implement in practice but of appealing sampling properties, and its com- putational advantage. Under some mild regularity conditions allowing possibly exponentially growing dimensionality, we establish the oracle inequalities of the resulting regularized estimator, as the global minimizer, under various prediction and variable selection losses, as well as the oracle risk inequalities of the hard- thresholded estimator followed by a further L 2 -regularization. The risk properties exhibit interesting shrinkage effects under both estimation and prediction losses. We identify the optimal choice of the ridge parameter, which is shown to have simultaneousadvantagestoboththeL 2 -lossandpredictionloss. Thesenewresults and phenomena are evidenced by simulation and real data examples. Inthesecondpart,wefocusonthehigh-dimensionalmodelwithlatentfeatures. As a powerful tool for producing interpretable models, sparse modeling has gained popularity for analyzing large-scale data sets. Most of existing methods assume implicitly that all features in a model are observable. Yet some latent confounding factors may potentially exist in the hidden structure of the original model in many applications. Omitting such confounding factors can cause serious issues in both predictionandvariableselection, eveninthelow-dimensionalsetting. Theimpacts of latent features are less well understood in high dimensions. In this part, we suggest high-dimensional latent variable thresholded regression that incorporates confounding factors as the principal components of the random design matrix. We xi estimate the population principal components by the sample ones and establish asymptotic properties of the estimated prediction vectors in high dimensions for a wide class of distributions. With the aid of these properties, we prove that high- dimensional thresholded regression with estimated confounding factors can still enjoy model selection consistency and oracle inequalities under various prediction and variable selection losses for both observable covariates and latent confounding factors. The new method and results are evidenced by simulation and real data examples. The second part of the dissertation borrows the strengths of high-dimensional thresholded regression established in the first part, which shares the appealing hard-thresholding property that sets small coefficients to exact zeros, thus gener- ating a sparse model with high interpretability. The further analysis of the latent model replies on a nice approximation for the confounding part. Both two parts employ the power of high-dimensional sparse modeling via regularization. xii Chapter 1 Introduction 1.1 High-Dimensional Variable Selection via Regularization The advances of information technologies in the past few decades have made it much easier than before to collect large amount of data over a wide spectrum of dimensions in different fields, results in big data problems particularly in the domains of biomedical research in molecular biology, imaging data in machine learning, and high frequency transaction data in economics and finance. As a powerful tool of sparse modeling and variable selection, regularization methods have been widely used to analyze large-scale data sets and produce meaningful, interpretable models. Depending on the type of penalty functions used, the reg- ularization methods can be grouped as two classes: convex ones and nonconvex ones. AtypicalexampleofconvexpenaltyistheL 1 -penaltywhichgivesrisetotheL 1 - regularization methods such as the Lasso (Tibshirani, 1996) and Dantzig selector 1 (Candes and Tao, 2007). The convexity of these methods makes the implementa- tion efficient and facilitates the theoretical analysis. In a seminal paper, Bickel, RitovandTsybakov(2009)establishedtheoracleinequalitiesofboththeLassoand Dantzig selector under various prediction and estimation losses and, in particular, proved their asymptotic equivalence under certain conditions. Despite their convexity and popularity, it has become a well-known phe- nomenon that convex regularization methods can suffer from the bias issue that is inherited from the convexity of the penalty function. This issue can deteriorate the power of identifying important covariates and the efficiency of estimating their effects in high dimensions. To attenuate this issue, Fan and Li (2001) initiated the general framework of nonconcave penalized likelihood with nonconvex penalties including the proposed smoothly clipped absolute deviation (SCAD) penalty, and showed that the oracle properties can hold for a wide class of nonconvex regu- larization methods. Other nonconvex regularization methods include the bridge regression using the L q -penalty for 0 < q < 1 (Frank and Friedman, 1993), the minimax concave penalty (MCP) (Zhang, 2010), and the smooth integration of counting and absolute deviation (SICA) penalty (Lv and Fan, 2009). A main mes- sage of these works is that nonconvex regularization can be beneficial in selecting important covariates in high dimensions. Although there is a growing literature on nonconvex regularization methods, some important questions still remain. As an important step, most of existing 2 studies for these methods focus on some local minimizer with appealing properties due to their general nonconvexity. Yet the properties of the global minimizer need more delicate analysis, and the global theory may depend on the specific form of regularization. A natural question is whether the oracle inequalities hold for nonconvex regularization methods as for the L 1 -regularization methods (Candes and Tao, 2007; Bickel, Ritov and Tsybakov, 2009), and the logarithmic factor of dimensionality that appears commonly in the oracle inequalities is optimal. In the problemofGaussianmeanestimation,thereisawell-knownphenomenonofStein’s shrinkage effect (Stein, 1956; James and Stein, 1961) stating that the maximum likelihood estimator or least-squares estimator may no longer be optimal in risk under the quadratic loss in multiple dimensions. Thus another natural question is whether similar shrinkage effects hold for these methods under both estimation and prediction losses. In Chapter 2, we intend to provide some partial answers to the aforementioned questions, with a focus on one particular member of the nonconvex family, the hard-thresholding penalty. 1.2 Model with Latent Features Most ofexisting methods foranalyzing large-scale data sets assume implicitly that all features in a model are observable. Yet in many applications, some latent con- founding factors may potentially exist in the hidden structure of original models. In regression analysis, omitting such confounding factors allows noise variables to 3 be included into the model and important covariates to be missed or even have opposite effects. This can potentially cause serious issues in both prediction and variable selection. Such issues can occur even in low-dimensional settings. In the high-dimensional case, the problem of latent features is even more challenging. Toappreciatetheeffectsofconfoundingfactors,let’sconsiderasimpleexample. Educated people are generally unlikely to commit crimes, but many sociological data sets show that cities with better education levels display higher crime rates. Based on these data sets, if we regress crime rates on education levels, there seems to be, counterintuitively, a positive relationship between them. Urbanization is a key confounding factor here. However, it is an unobserved factor and might be estimated by a few demographic variables. Similarly, it is a common case in social sciences, epidemiology, and finance that some unobservable variables can correlate with both predictors and the outcome but may not be measured directly and included as covariates in the quantitative model. We refer the readers to Chapter 2 in Anderson et al. (1980) for examples and detailed discussions of confounding variables,aswellastheirimpactsonestimatingtheeffectsofobservablepredictors. For the purpose of prediction and variable selection in presence of potential confounding factors, various statistical methods have been studied in literature for regression analyses. Forexample, Frank (2000)extended techniques for sensitivity analysesanddefinedanindexfunctiontomeasuretheimpactofaconfoundingvari- able on a regression coefficient. Monte Carlo simulation methods were employed 4 in Austin and Brunner (2004) to assess the inflation of type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Lin, PsatyandKronmal(1998)presentedageneralapproachtoassessingthesensitivity of the point and interval estimates of the primary exposure effect to the residual confounding effects of unmeasured variables, after adjusting for measured covari- ates in an observational study. In the econometrics literature (Chudik, Pesaran and Tosetti, 2011; Kapetanios, Pesaran and Yamagata, 2011; Pesaran and Tosetti, 2011), confounding factors are often estimated as nuisance parameters. Despite these progresses in capturing latent features, relatively few studies deal with con- founding factors in high dimensions. A key contribution of our work is identifying and estimating confounding factors in high-dimensional data. In our setup we allow for both the numbers of observable covariates and latent factors to be large. To the best of our knowledge, this is a new contribution to the case of high-dimensional latent factors. An important objective is to recover the support of the true regression coefficient vector for observable variables as well as identify the significant ones from a set of potential confounding factors. It is also necessary to estimate the effects of both observable covariates and latent factors in view of prediction. As latent factors may correlate with both predictors and the outcome, a natural question is how to select and estimate the effects for observable variables in presence of unobservable ones. Since it is generally difficulttoanalyzetheirimpactswhennostructuresoflatentfeaturesareimposed, 5 anothernaturalquestioniswhetherpotentialconfoundingfactorscanbeeffectively estimated. Furthermore, we explore the possibility for simultaneous identification and estimation of the effects of true covariates and confounding factors, enjoying properties such as model selection consistency and nice convergence rates in oracle inequalities for both prediction and estimation. 6 Chapter 2 High-Dimensional Thresholded Regression and Shrinkage Effect 2.1 Introduction The L 0 -regularization, which amounts to thebest subset regression, motivateddif- ferent forms of regularization. This method was proved to enjoy the oracle risk inequalities under the prediction loss in Barron, Birge and Massart (1999). It is, however, unrealistic to implement in practice due to its combinatorial compu- tational complexity. As an alternative to the L 0 -penalty, the hard-thresholding penalty is continuous with a fixed, finite maximum concavity which controls the computational difficulty. We show that both approaches give rise to a thresholded regression. As is well known in the wavelets literature, the hard-thresholding reg- ularization is equivalent to the L 0 -regularization in the case of orthonormal design matrix. This connection motivates us to fully investigate the approach of hard- thresholding regularization. 7 The main contributions of this chapter are twofold. First, we establish com- prehensive global properties of the hard-thresholding regularization, including the oracle inequalities under various prediction and variable selection losses and the nonoptimality of the logarithmic factor of dimensionality. Second, we show that the hard-thresholding regularization followed by a further L 2 -regularization enjoys interestingStein’sshrinkageeffectsintermsofrisksunderbothestimationandpre- diction losses. The identified optimal choice of the ridge parameter is revealed to have simultaneous advantages to both the L 2 -loss and prediction loss, which result builds an interesting connection between model selection and prediction. These new results and phenomena provide further insights into the hard-thresholding regularization method. The rest of the chapter is organized as follows. Section 2.2 presents the thresh- oldedregressionwiththehard-thresholdingpenaltyandL 0 -penalty,andtheirhard- thresholding property. We establish the globalpropertiesofthresholded regression under various prediction and variable selection losses and unveil Stein’s shrinkage effects for both estimation and prediction losses, with optimal choices of the ridge parameter, in Section 2.3. Section 2.4 discusses briefly the implementation of thresholded regression. We provide several simulation and real data examples in Section 2.5. 8 2.2 Thresholded Regression To address the questions raised in the Introduction, we focus our attention on the linear regression model y =Xβ+ε, (2.1) where y = (y 1 ,··· ,y n )τ is an n-dimensional response vector, X = (x 1 ,··· ,x p ) is an n× p deterministic design matrix consisting of p covariate vectors, β = (β 1 ,··· ,β p )τ isap-dimensional regressioncoefficient vector, andε =(ε 1 ,··· ,ε n )τ is an n-dimensional error vector. The goal of variable selection is to consistently recover the true underlying sparse model supp(β 0 ) ={j : β 0,j 6= 0,1≤ j ≤ p} for the true regression coefficient vector β 0 = (β 0,1 ,··· ,β 0,p )τ in model (2.1), and to estimate the s=kβ 0 k 0 nonzero regression coefficients β 0,j ’s. To producea sparse estimate ofβ 0 , we consider theapproachofpenalized least squares which minimizes Q(β) =(2n) −1 ky−Xβk 2 2 +kp λ (β)k 1 , (2.2) the penalized residual sum of squares with penalty function p λ (t). Here we use the compact notation p λ (β) = p λ (|β|) = (p λ (|β 1 |),··· ,p λ (|β p |))τ with |β| = (|β 1 |,··· ,|β p |)τ. The penalty function p λ (t), defined on t ∈ [0,∞) and indexed by λ≥ 0, is assumed to be increasing in both t and λ with p λ (0) = 0, indicating that the amount of regularization increases with the magnitude of the parameter 9 and the regularization parameter λ. To align all covariates to a common scale, we assume that each covariate vector x j is rescaled to have L 2 -norm n 1/2 , matching that of the constant covariate 1 for the intercept. See, for example, the references mentioned in the Introduction for the specific forms of various penalty functions that have been proposed for sparse modeling. As elucidated in the Introduction, we focus on the hard-thresholding penalty p H,λ (t) = 1 2 λ 2 −(λ−t) 2 + , t≥ 0, (2.3) which is closely related to the L 0 -penalty p H 0 ,λ (t) = 2 −1 λ 2 1 {t6=0} , t ≥ 0. It is well known that in the wavelets setting with the design matrix X multi- plied by n −1/2 being orthonormal, that is, n −1 XτX = I p , the penalized least squares in (2.2) reduces to a componentwise minimization problem with Q(β) = 2 −1 k b β ols −βk 2 2 +kp λ (β)k 1 , where b β ols = n −1 Xτy is the ordinary least-squares estimator. In this setting, the use of hard-thresholding penalty p H,λ (t) gives the componentwise hard-thresholding, which is of the form z1 {|z|>λ} , on the ordinary least-squares estimator (Antoniadis, 1996). In contrast, the use of the L 1 -penalty p λ (t) = λtyields the soft-thresholding which is ofthe formsgn(z)(|z|−λ) + . When the L 0 -penalty p H 0 ,λ (t) is used, an identical hard-thresholding rule to that by 10 hard-thresholding penalty p H,λ (t) is obtained. We see that in the case of orthonor- mal design matrix, both approaches of hard-thresholding regularization and L 0 - regularization are equivalent. This simple connection suggests that they may have more general connection, which motivates our study. Moreover, the hard-thresholding penalty in (2.3) is continuous and has fixed, finite maximum concavity κ(p H,λ )= sup 0<t 1 <t 2 <∞ − p ′ H,λ (t 2 )−p ′ H,λ (t 1 ) t 2 −t 1 = 1, (2.4) which is related to the computational difficulty of the regularization method and gives rise to its computational advantage. The computationally attractive method of hard-thresholding regularization indeed shares some similarity with L 0 - regularization in the general case, as shown in the following lemma on the hard- thresholding property. Lemma1. Forbothhard-thresholdingpenaltyp H,λ (t) andL 0 -penaltyp H 0 ,λ (t), min- imizing Q(β) in (2.2) along the j-th coordinate with 1≤ j ≤ p, at any p-vector β j with j-th component zero, gives the univariate global minimizer for that coordinate of the same form b β(z) =z1 {|z|>λ} , with z = n −1 (y−Xβ j )τx j . The simple observation in Lemma 1 facilitates our technical analysis and enables us to derive parallel results for both methods. Since the global minimizer is necessarily the global minimizer along each coordinate, the characterization of 11 each coordinate in the above lemma shows that the regularized estimators given by both methods are natural generalizations of the univariate hard-thresholded estimator. In this sense, we refer to both methods as thresholded regression using hard-thresholding. There is, however, no guarantee that both estimators are iden- tical when p > 1. We show in Theorem 1 that the two methods can have similar oracle inequalities under various prediction and variable selection losses, which justifies a further connection between them. 2.3 Global Properties and Shrinkage Effects of Thresholded Regression 2.3.1 Technical Conditions It is well known that high collinearity is commonly associated with large-scale data sets. High collinearity can lead to unstable estimation or even loss of model identifiability in regression problems. More specifically, there may exist another p-vector β 1 that is different from β 0 such that Xβ 1 is (nearly) identical to Xβ 0 , when the dimensionality p is large compared with the sample size n. Thus to ensure model identifiability and reduce model instability, it is necessary to control the size ofsparse models, since it is clear fromthe geometric point ofview that the collinearity among the covariates increases with the dimensionality. This idea was 12 exploited in Donoho and Elad (2003) for the problem of sparse recovery, that is, the noiseless case of (2.1). To ensure the identifiability of β 0 , they introduced the concept of spark, denoted as spark(X), for a design matrix X, which is defined as the smallest number τ such that there exists a linearly dependent subgroup of τ columns from X. In particular, β 0 is uniquely defined as long as s< spark(X)/2, which provides a basic condition for model identifiability. Since we are interested in variable selection in the presence of noise, we extend their concept of spark to the robust case as follows. Definition 1. The robust spark M = rspark c (X) of an n×p design matrixX with bound c is defined as the smallest number τ such that there exists a subgroup of τ columns from n −1/2 X such that the corresponding submatrix has a singular value less than the given positive constant c. An equivalent representation of the robust spark M =rspark c (X)in Definition 1 is the largest number τ such that the following inequality holds min kδk 0 <τ, kδk 2 =1 n −1/2 kXδk 2 ≥ c. (3.5) This inequality provides a natural constraint on the collinearity for sparse models. In view of (3.5), our robust spark condition of s < M/2, to be introduced in Condition 2, is in a similar spirit to the restricted eigenvalue condition in Bickel, Ritov and Tsybakov (2009). The restricted eigenvalue condition assumes (3.5) 13 with the L 0 -norm constraint kδk 0 < τ replaced by the L 1 -norm constraint of kδ J c 0 k 1 ≤ c 0 kδ J 0 k 1 for some positive constant c 0 , where J 0 ⊂ {1,··· ,p} with |J 0 |≤ s ′ , J c 0 is the complement of J 0 , andδ A denotes a subvector ofδ consisting of components with indices in a given set A. The robust spark condition of s <M/2 requires that(3.5)holdsforτ = 2s+1. SincesuchanL 0 -normconstraintgenerally defines a smaller subset than the above L 1 -norm constraint for s ′ = 2s, the robust spark condition can be weaker than the restricted eigenvalue condition. It is easy to show that the robust spark rspark c (X) increases as c decreases, and approaches the spark spark(X) as c→ 0+. Thus M can generally be any positive integer no larger than n+1. To ensure model identifiability and reduce the instability in estimated model, we consider the regularized estimator b β on the union of coordinate subspaces S M/2 = {β ∈ R p : kβk 0 < M/2}, as exploited in Fan and Lv (2011) to charac- terize the global optimality of nonconcave penalized likelihood estimators. Thus throughout the paper, the regularized estimator b β is defined as the global mini- mizer b β = arg min β∈S M/2 Q(β), (3.6) where Q(β) is defined in (2.2). When the size of sparse models exceeds M/2, that is, β falls outside the space S M/2 , there is generally no guarantee for model identifiability. 14 To facilitate our technical analysis, we make the following three regularity con- ditions. Condition 1. ε∼ N(0,σ 2 I n ) for some positive σ. Condition 2. It holds that s < M/2, s = o(n), and b 0 = min j∈supp(β 0 ) |β 0,j | > ( p 16/c 2 ∨ 1)c −1 c 2 p (2s+1)(loge p)/n, where M is the robust spark of X with bound c given in Definition 1, c 2 ≥ √ 10σ is some positive constant, ande p = n∨p. Condition 3. kβ 0 k 2 is bounded from below by some positive constant and max kδk 0 <M/2,kδk 2 =1 n −1/2 kXδk 2 ≤ c 3 for some positive constant c 3 . Condition 1 is standard in the linear regression model. The Gaussian error distributionisassumedtosimplifythetechnicalarguments. Thetheoreticalresults continue to hold for other error distributions with possibly different probability boundinTheorem1;see,forexample,FanandLv(2011)formoretechnicaldetails. In particular, some numerical results for the t error distribution are presented in Section2.5.1. Theheavy-tailednessoftheerrordistributiontypicallyleadstolower probability for the prediction and variable selection bounds to hold. The first part s < M/2 of Condition 2 puts a sparsity constraint on the true model size s that involves the robust spark given in Definition 1. As explained above, such a robust spark condition is needed to ensure model identifiability. 15 We typically assume a diverging ratio of the sample size n to the number of true covariates s, that is, s = o(n), to obtain consistent estimation of β 0 . The third part ofCondition 2 gives a lower bound on the minimum signal strength for model selection consistency. Condition 3, which is only needed in Theorem 2, facilitates the derivation of the oracle risk properties of the regularized estimator, which are stronger than the oracleinequalitiesinTheorem1. Inparticular,thefirstpartofCondition3assumes that the L 2 -norm of β 0 is bounded from below, which is mild and sensible. The second part of Condition 3 is a restricted-eigenvalue-type assumption and requires that the maximum singular value of each submatrix of n −1/2 X by taking out less than M/2 columns is bounded from above. 2.3.2 Global Properties and Shrinkage Effects In view of Lemma 1, the regularization parameter λ determines the threshold level for both hard-thresholding penalty p H,λ (t) and L 0 -penalty p H 0 ,λ (t). So a natural ideaforensuring themodelselection consistency ischoosing anappropriatelylarge regularization parameter λ to suppress all noise covariates and retain important ones. This approachisshown tobe effective inthefollowing theoremonthe model selection consistency and oracle inequalities of thresholded regression. Theorem 1. Assume that Conditions 1–2 hold and c −1 c 2 p (2s+1)(loge p)/n < λ <b 0 (1∧ p c 2 /2). Then for both hard-thresholding penalty p H,λ (t) and L 0 -penalty 16 p H 0 ,λ (t), the regularized estimator b β in (2.1) satisfies that with probability at least 1−(2/π) 1/2 c −1 2 σ(loge p) −1/2 e p 1−c 2 2 /(2σ 2 ) −(2/π) 1/2 c ′−1 2 σs(logn) −1/2 n −c ′2 2 /(2σ 2 ) for some positive constant c ′ 2 ≥ √ 2σ, it holds simultaneously that: (a) (Model selection consistency). supp( b β) = supp(β 0 ); (b) (Prediction loss). n −1/2 ||X( b β−β 0 )|| 2 ≤ 2c ′ 2 c −1 p s(logn)/n; (c) (Estimation losses). k b β−β 0 k q ≤ 2c −2 c ′ 2 s 1/q p (logn)/n for q ∈ [1,2] and k b β−β 0 k ∞ is bounded by the same upper bound as for q = 2. With the above choice of the regularization parameter λ, the prediction loss of the regularized estimator is within a logarithmic factor (logn) 1/2 of that of the oracle estimator, which is referred to as the least-squares estimator on the true underlying sparse model. Theorem 1 also establishes the oracle inequalities of the regularized estimator under the L q -estimation losses with q ∈ [1,2]∪{∞}. These results hold simultaneously with significant probability that converges to one polynomially with sample size n, since e p = n∨ p. The dimensionality p is allowed to grow up to exponentially fast with the sample size n, in view of the range for λ. The key to deriving these rates is establishing the model selection consistency ofthehard-thresholdedestimator,thatis,theexactrecoveryofthetrueunderlying sparse model. Such a property enables us to construct a key event with significant probability, on which we can conduct delicate analysis. The suitable range of the 17 regularization parameter is critical in this theorem since the lower bound on λ is needed for suppressing all noise covariates and the upper bound on λ is needed for retaining all true covariates, although this range is unknown to us in practice. Theorem 1 builds on Lemma 1, both of which share a common feature that the technical arguments apply equally to both hard-thresholding penalty and L 0 - penalty. Thus, under conditions of Theorem 1, the regularized estimators given by both hard-thresholding penalty p H,λ (t) and L 0 -penalty p H 0 ,λ (t) are approximately asymptotically equivalent, that is, having the same convergence rates in the oracle inequalities under various prediction and variable selection losses. This formally justifies the motivation and advantage of studying the hard-thresholding regular- ization. In fact, their approximate asymptotic equivalence extends to the oracle risk inequalities under different prediction and variable selection losses. These results complement those on the oracle risk inequalities under the prediction loss inBarron,BirgeandMassart(1999). Sinceitenjoysthesameappealingproperties as the L 0 -regularization, the hard-thresholding regularization provides an attrac- tive alternative to the L 0 -regularization thanks to its computational advantage, as discussed in Section 2.2. As mentioned in the Introduction, many studies have contributed to the ora- cle inequalities for the L 1 -regularization methods. For instance, Candes and Tao 18 (2007)proved thatthe Dantzigselector canachieve a losswithin a logarithmicfac- tor of the dimensionality compared to that for the oracle estimator. Bunea, Tsy- bakov and Wegkamp (2007) established sparsity oracle inequalities for the Lasso estimator. Bickel, Ritov and Tsybakov (2009) derived parallel oracle inequali- ties for the Lasso estimator and Dantzig selector under the prediction loss and L q -estimation losses with q ∈ [1,2]. A common feature of these results is the appearance of some power of the logarithmic factor logp of the dimensionality p. In contrast, such a factor is replaced by the logarithmic factor logn of the sample size n in our setting. This suggests the general nonoptimality of the logarithmic factor of dimensionality in the oracle inequalities when p grows nonpolynomially with n. Our results are also related to other work on nonconvex regularization methods. Antoniadis and Fan (2001) obtained comprehensive oracle inequalities anduniversalthresholdingparametersforawideclassofgeneralpenaltyfunctions, in the wavelets setting. Zhang (2010) proved that the MCP estimator can attain certain minimax convergence rates for the estimation of regression coefficients in L q -balls. Although providing bounds on different estimation and prediction losses on an event with large probability, the oracle inequalities of the thresholded regres- sion presented in Theorem 1 do not take into account its performance over the full sample space. Thus it is of interest to investigate a stronger property of ora- cle risk inequalities for thresholded regression, where the risk under a loss is its 19 expectation over all realizations. As shown in the proof of Theorem 1, the hard- thresholded estimator b β in (2.1) on its support supp( b β) is exactly the ordinary least-squares estimator constructed using covariates in supp( b β). Motivated by such a representation, we consider a refitted estimator constructed by applying a further L 2 -regularization to the thresholded regression b β refitted = (X 1 τX 1 +λ 1 I s 1 ) −1 X 1 τy, (3.7) whereX 1 is a submatrix of the design matrixX consisting of columns in supp( b β), s 1 = k b βk 0 , and λ 1 ≥ 0 is the ridge parameter. In the special case of λ 1 = 0, the above refitted estimator b β refitted becomes the original hard-thresholded estimator b β in (2.1). LetX 0 beasubmatrixofthedesignmatrixXconsistingofcolumnsinsupp(β 0 ) and X 0 τX 0 = PτDP an eigendecomposition with P an orthogonal matrix and D = diag{d 1 ,··· ,d s }. We show that Stein’s shrinkage effects (Stein, 1956; James and Stein, 1961) also hold for the thresholded regression followed by the L 2 - regularization in terms ofrisks under bothestimation and prediction losses. These results are presented in the following theorem on the oracle risk inequalities of the L 2 -regularized thresholded regression. Theorem 2. Assume that conditions of Theorem 1 and Condition 3 hold. Then the L 2 -regularized refitted estimator b β refitted in (3.7) satisfies that: 20 (a) (L 2 -risk). The minimum L 2 -risk Ek b β refitted −β 0 k 2 2 is attained at the opti- mal ridge parameter λ 1 = λ 1,opt = O(skβ 0 k −2 2 ) + O(s 2 n −1 kβ 0 k −4 2 ), with the leading termO(skβ 0 k −2 2 ) sandwichedbetweensσ 2 kβ 0 k −2 2 (λ min /λ max ) 2 and sσ 2 kβ 0 k −2 2 (λ max /λ min ) 2 , and equals O(s/n)+O(s 2 n −2 kβ 0 k −2 2 ) with the lead- ing term O(s/n) being P s j=1 (λ 2 1,opt b 2 j +d j σ 2 )/(d j +λ 1,opt ) 2 ; (b) (L q -risk). The minimum L q -risk Ek b β refitted − β 0 k q q equals O(s/n q/2 ) + O(s 2 kβ 0 k −2 2 /n q/2+1 ) for q ∈ [1,2], and the minimum L ∞ -risk Ek b β refitted − β 0 k ∞ equals O(s 1/2 /n 1/2 )+O(s 3/2 kβ 0 k −2 2 /n 3/2 ); (c) (Prediction risk). The minimum prediction risk n −1 EkX( b β refitted − β 0 )k 2 2 is attained at the optimal ridge parameter λ 1 = λ ′ 1,opt = O(skβ 0 k −2 2 )+O(s 2 n −1 kβ 0 k −4 2 ), with the leadingtermO(skβ 0 k −2 2 ) sandwiched between sσ 2 kβ 0 k −2 2 λ min /λ max and sσ 2 kβ 0 k −2 2 λ max /λ min , and equals O(s/n)+ O(s 2 n −2 kβ 0 k −2 2 ) with the leading term O(s/n) being n −1 P s j=1 [(λ ′ 1,opt ) 2 b 2 j d j + d 2 j σ 2 ]/(d j +λ ′ 1,opt ) 2 , where (b 1 ,··· ,b s )τ = Pβ 0,1 with β 0,1 a subvector of β 0 consisting of all nonzero components, and λ min and λ max are the smallest and largest eigenvalues ofX 0 τX 0 , respectively. Although it has the well-known bias issue, the ridge regression applied after the thresholded regression is shown in Theorem 2 to be capable of improving both estimation and prediction, since the original hard-thresholded estimator is 21 simply the refitted estimator with λ 1 =0 and the minimum risks under the losses are attained at nonzero ridge parameters λ 1 . Intuitively, the bias incurred by an appropriately small amount of L 2 -regularization can be offset by the reduction in estimationvariability,leadingtoimprovement intheoverallrisksoftheregularized estimator. This phenomenon can be clearly seen in the representative L 2 -risk and prediction risk curves as a function of the ridge parameter λ 1 in Section 2.5. The risks drop as λ 1 increases from zero, and start to rise after the minimum risks are attained. The model selection consistency of the thresholded regression plays a key role in deriving the risk properties of the L 2 -regularized refitted estimator. The opti- mal risks are attained at nontrivial ridge parameter λ 1 for both the L q -loss and prediction loss. Since s = o(n) by Condition 2 and kβ 0 k 2 is bounded from below by some positive constant by Condition 3, we see that both optimal ridge param- eters λ 1,opt and λ ′ 1,opt for the L 2 -risk and prediction risk, respectively, are of the same leading order O(skβ 0 k −2 2 ). In particular, the leading term O(skβ 0 k −2 2 ) of the optimalridgeparameterforL 2 -riskhasa similarrangetothatoftheoptimalridge parameter forprediction risk, differing by onlya factorofλ max /λ min . Such afactor is the condition number of the Gram matrixX 0 τX 0 resulting from the true design matrix X 0 . In view of (3.5) and Condition 3, its condition number λ max /λ min is sandwiched between 1 and c 2 3 /c 2 . 22 It is interesting to observe that the optimal choices of the ridge parameter for both L 2 -loss and prediction loss are of the same order O(skβ 0 k −2 2 ), which is proportional to the true model size s and has an inverse relationship with kβ 0 k 2 . This indicates that stronger signal leads to smaller optimal L 2 -shrinkage. Thus the optimal ridge parameter has a simultaneous benefit on both the L 2 -loss and prediction loss. Furthermore, the minimum L 2 -risk and minimum prediction risk share the same order of O(s/n), when the risks are minimized by the optimal ridge parameters. These risk properties demonstrate that Stein’s shrinkage effects extend to the thresholded regression followed by the L 2 -regularization under both estimation and prediction losses. The idea of refitting has also been investigated in van de Geer, B¨ uhlmann and Zhou (2011), who established bounds on the prediction loss and L q -loss with q ∈ [1,2] for the thresholded Lasso estimator. The thresholded Lasso is a three- stepprocedurewiththeLassofollowedbyhard-thresholdingandanordinaryleast- squares refitting, while the above L 2 -regularized refitting is a two-step procedure with hard-thresholding and ordinary least-squares refitting automatic in thresh- olded regression. A main difference is that our study focuses on the risk proper- ties and identifying optimal ridge parameters for minimizing the risks. These new riskproperties revealinteresting Stein’s shrinkageeffects inthresholded regression, which was lacking before. 23 2.4 Implementation Efficient algorithms for the implementation of regularization methods include the LQA (Fan and Li, 2001), LARS (Efron et al., 2004), and LLA (Zou and Li, 2008). As an alternative to these algorithms, the coordinate optimization has become popular due to its scalability for large-scale problems; see, for example, Friedman etal. (2007),WuandLange(2008),andFanandLv(2011). Inthispaper,weapply the ICA algorithm (Fan and Lv, 2011) to implement the regularization methods. See Section V in Fan and Lv (2011) for a detailed description of this algorithm. An analysis of convergence properties of this algorithm has been presented in Lin and Lv (2013). In particular, the univariate global minimizer for each coordinate admits a closed form as given in Lemma 1, for both hard-thresholding penalty p H,λ (t) and L 0 -penalty p H 0 ,λ (t). We would like to point out that the algorithm is not guaranteed to find the global minimizer. Although our theory relies on the union of coordinate subspaces S M/2 associ- ated with the robust spark of the design matrix, the implementation via the ICA algorithm does not require the knowledge of such a space. It is a path-following algorithm, based on a decreasing grid ofregularization parameter λ, that produces a sequence ofmost sparse solutions to less sparse solutions, with the solution given by the previous λ as a warm start forthe next λ. The collinearity ofsparse models can be tracked easily by calculating the smallest singular value of the subdesign matrix given by the support of each produced sparse solution. 24 To better illustrate our theoretical results and make a fair comparison of all methods, we select the tuning parameters by minimizing the prediction error cal- culated using an independent validation set, with size equal to the sample size in the simulation study. We use the SICA (Lv and Fan, 2009) with penalty p λ (t;a) = λ(a + 1)t/(a + t), with a small shape parameter a such as 10 −4 or 10 −2 , as a proxy of the L 0 -regularization method. Following Lin and Lv (2013), some pilot solutions with larger values of a are computed to stabilize the solution. See also Lin and Lv (2013) for the closed-form solution of the univariate SICA estimator. 2.5 Numerical Studies Inthissection,weinvestigatethefinite-sampleperformanceofregularizationmeth- ods with hard-thresholding (Hard) and SICA penalties, with comparison to the Lasso and oracle procedure which knew the true model in advance. We consider both cases of light-tailed and heavy-tailed errors, with Gaussian distribution for the former and t-distribution for the latter. 25 2.5.1 Simulation Examples Simulation Example 1 We first consider the linear regression model (2.1) with Gaussian error ε ∼ N(0,σ 2 I n ). Wegenerated100datasetsfromthismodelwithtrueregressioncoeffi- cientvectorβ 0 =(v T ,··· ,v T ,0τ) T withthepatternv= (β T strong ,β T weak ) T repeated q times, where β strong = (0.6,0,0,−0.6,0,0)τ and β weak = (0.05,0,0,−0.05,0,0)τ or (0.1,0,0,−0.1,0,0)τ. The coefficient subvectors β strong and β weak stand for the strong signals and weak signals in β 0 , respectively. The two choices of β weak showed the performance offourmethods under different levels ofweak signals. We set q = 3 so that there are six strong signals (with magnitude 0.6) and six weak signals (with magnitude 0.05 or 0.1) in the true coefficient vector. The sample size n was chosen to be 100 and two settings of (p,σ)= (1000,0.4) and (5000,0.3) were considered. For each data set, all the rows of the n×p design matrixX were sampled asindependent andidentically distributed copies from a multivariate nor- mal distribution N(0,Σ) with Σ = (0.5 |i−j| ) 1≤i,j≤p . This allows for correlation among the covariates at the population level. The sample collinearity among the covariates can be at an even higher level due to high dimensionality. We applied the Lasso, Hard, and SICA to produce a sequence of sparse models and selected the tuning parameters as discussed in Section 2.4. 26 0 2 4 6 8 0.35 0.4 0.45 0.5 λ 1 Prediction risk Lasso−L 2 0 2 4 6 8 0.188 0.19 0.192 0.194 0.196 0.198 λ 1 Prediction risk Hard−L 2 0 2 4 6 8 0.188 0.19 0.192 0.194 0.196 0.198 λ 1 Prediction risk SICA−L 2 0 2 4 6 8 0.188 0.19 0.192 0.194 0.196 0.198 λ 1 Prediction risk Oracle−L 2 Figure 2.1: Representative prediction risk curves as a function of the ridge param- eter λ 1 by all methods in Section 2.5.1 for the case of (p,|β weak |)=(1000,0.05). 0 2 4 6 8 0.45 0.5 0.55 0.6 0.65 λ 1 L 2 −risk Lasso−L 2 0 2 4 6 8 0.18 0.185 0.19 0.195 0.2 0.205 λ 1 L 2 −risk Hard−L 2 0 2 4 6 8 0.18 0.185 0.19 0.195 0.2 0.205 λ 1 L 2 −risk SICA−L 2 0 2 4 6 8 0.16 0.17 0.18 0.19 λ 1 L 2 −risk Oracle−L 2 Figure 2.2: Representative L 2 -risk curves as a function of the ridge parameter λ 1 by all methods in Section 2.5.1 for the case of (p,|β weak |)= (1000,0.05). 27 Table 2.1: Means and standard deviations (in parentheses) of different perfor- mance measures by all methods over 100 simulations in Section 2.5.1 Setting Measure Lasso Hard SICA Oracle p = 1000 PE 0.3025 (0.0479) 0.1862 (0.0086) 0.1862 (0.0103) 0.1829 (0.0100) |β weak | = 0.05 L 2 -loss 0.4007 (0.0653) 0.1679 (0.0238) 0.1678 (0.0276) 0.1505 (0.0324) L 1 -loss 1.7660 (0.2942) 0.5274 (0.0769) 0.5276 (0.0921) 0.4277 (0.0979) L ∞ -loss 0.2012 (0.0418) 0.0804 (0.0258) 0.0790 (0.0255) 0.0854 (0.0207) FP 33.7900 (7.0457) 0.0800 (0.2727) 0.0900 (0.4044) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.6000 (0.6513) 5.9900 (0.1000) 5.9900 (0.1000) 0 (0) b σ 0.4295 (0.0473) 0.4158 (0.0328) 0.4155 (0.0351) 0.4000 (0.0347) p = 1000 PE 0.3643 (0.0584) 0.2272 (0.0115) 0.2283 (0.0124) 0.1829 (0.0100) |β weak | = 0.1 L 2 -loss 0.4882 (0.0674) 0.2749 (0.0223) 0.2769 (0.0224) 0.1505 (0.0324) L 1 -loss 2.2134 (0.3202) 0.8466 (0.1018) 0.8553 (0.1052) 0.4277 (0.0979) L ∞ -loss 0.2225 (0.0453) 0.1068 (0.0177) 0.1077 (0.0177) 0.0854 (0.0207) FP 34.4300 (6.9866) 0.0900 (0.3208) 0.1600 (0.5453) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 4.9200 (0.9711) 5.8200 (0.6257) 5.8000 (0.5125) 0 (0) b σ 0.4676 (0.0541) 0.4559 (0.0377) 0.4540 (0.0425) 0.4000 (0.0347) p = 5000 PE 0.2634 (0.0744) 0.1097 (0.0058) 0.1088 (0.0039) 0.1027 (0.0062) |β weak | = 0.05 L 2 -loss 0.4419 (0.0904) 0.1476 (0.0185) 0.1450 (0.0127) 0.1122 (0.0260) L 1 -loss 1.8507 (0.3387) 0.4593 (0.0602) 0.4528 (0.0464) 0.3166 (0.0775) L ∞ -loss 0.2188 (0.0507) 0.0621 (0.0206) 0.0592 (0.0152) 0.0663 (0.0188) FP 37.3900 (4.9826) 0.0600 (0.2778) 0.0100 (0.1000) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.8600 (0.3487) 5.9900 (0.1000) 6.0000 (0) 0 (0) b σ 0.3822 (0.0452) 0.3173 (0.0239) 0.3187 (0.0231) 0.2976 (0.0242) p = 5000 PE 0.3603 (0.1089) 0.1838 (0.2401) 0.1489 (0.0070) 0.1027 (0.0062) |β weak | = 0.1 L 2 -loss 0.5594 (0.1054) 0.2830 (0.1654) 0.2581 (0.0136) 0.1122 (0.0260) L 1 -loss 2.4396 (0.3980) 0.8361 (0.4546) 0.7685 (0.1077) 0.3166 (0.0775) L ∞ -loss 0.2584 (0.0618) 0.1117 (0.0704) 0.1016 (0.0066) 0.0663 (0.0188) FP 38.6000 (4.2593) 0.0700 (0.4324) 0.2200 (1.8123) 0 (0) FN-strong 0 (0) 0.1100 (0.7771) 0 (0) 0 (0) FN-weak 5.5300 (0.6269) 5.7700 (0.5096) 5.7100 (0.6403) 0 (0) b σ 0.4417 (0.0557) 0.3826 (0.1214) 0.3629 (0.0429) 0.2976 (0.0242) The overall signal-to-noise ratios in the settings of (p,|β weak |) = (1000,0.05), (1000,0.1), (5000,0.05), and (5000,0.1) are 11.70, 11.77, 20.80, and 20.92, respec- tively. These overall measures, however, do not reflect the individual signal strength for each strong or weak signal, which measures the difficulty of the vari- able selection problem. In the case ofp = 1000,the individual signal-to-noiseratio 28 Table 2.2: Means and standard deviations (in parentheses) of different perfor- mance measures by all methods followed by the L 2 -regularization over 100 simu- lations in Section 2.5.1 Setting Measure Lasso-L 2 Hard-L 2 SICA-L 2 Oracle-L 2 p = 1000 PE 0.3501 (0.0538) 0.1851 (0.0083) 0.1852 (0.0101) 0.1812 (0.0098) |β weak | = 0.05 L 2 -loss 0.4464 (0.0635) 0.1658 (0.0237) 0.1657 (0.0277) 0.1437 (0.0329) L 1 -loss 2.5473 (0.3986) 0.5169 (0.0764) 0.5177 (0.0928) 0.4061 (0.1001) L ∞ -loss 0.1698 (0.0412) 0.0752 (0.0250) 0.0741 (0.0248) 0.0772 (0.0206) p = 1000 PE 0.4168 (0.0675) 0.2257 (0.0109) 0.2270 (0.0117) 0.1812 (0.0098) |β weak | = 0.1 L 2 -loss 0.5272 (0.0688) 0.2734 (0.0221) 0.2755 (0.0218) 0.1435 (0.0328) L 1 -loss 3.0270 (0.4430) 0.8366 (0.1016) 0.8450 (0.1045) 0.4053 (0.0990) L ∞ -loss 0.1905 (0.0472) 0.1053 (0.0150) 0.1060 (0.0142) 0.0770 (0.0204) p = 5000 PE 0.2642 (0.0532) 0.1090 (0.0055) 0.1082 (0.0037) 0.1020 (0.0060) |β weak | = 0.05 L 2 -loss 0.4358 (0.0675) 0.1462 (0.0181) 0.1437 (0.0127) 0.1082 (0.0263) L 1 -loss 2.4131 (0.3343) 0.4508 (0.0586) 0.4448 (0.0462) 0.3019 (0.0758) L ∞ -loss 0.1896 (0.0454) 0.0597 (0.0195) 0.0569 (0.0146) 0.0610 (0.0197) p = 5000 PE 0.3594 (0.0816) 0.1830 (0.2403) 0.1481 (0.0071) 0.1020 (0.0060) |β weak | = 0.1 L 2 -loss 0.5492 (0.0833) 0.2823 (0.1656) 0.2574 (0.0141) 0.1082 (0.0264) L 1 -loss 3.0841 (0.4236) 0.8280 (0.4560) 0.7614 (0.1090) 0.3017 (0.0760) L ∞ -loss 0.2242 (0.0562) 0.1116 (0.0704) 0.1013 (0.0058) 0.0610 (0.0198) is0.6 2 /σ 2 =2.25foreachstrongsignal, and0.05 2 /σ 2 =0.0156or0.1 2 /σ 2 = 0.0625 for each weak signal with level 0.05 or 0.1. In the case of p =5000, the individual signal-to-noise ratio is 0.6 2 /σ 2 = 4 for each strong signal, and 0.05 2 /σ 2 = 0.0278 or 0.1 2 /σ 2 = 0.1111 for each weak signal with level 0.05 or 0.1. We see that the six weak covariates have very low signal strength. Their signal strength is even lower when the high dimensionality is taken into account, due to the well-known phenomenon of noise accumulation in high dimensions. To compare the three regularization methods with the oracle procedure, we consider several performance measures. The first measure is the prediction error (PE) defined as E(Y −xτ b β) 2 with b β an estimate and (xτ,Y) an independent 29 observation of the covariates and response. To calculate the expectation, we gen- erated an independent test sample of size 10,000. The second to fourth measures are the L q -estimation losses k b β−β 0 k q with q = 2,1, and ∞, respectively. The fifth to seventh measures are the number of false positives (FP), and numbers of false negatives for strong signals (FN-strong) and false negatives for weak signals (FN-weak) for variable selection, where a false positive means a falsely selected noise covariate in the model and a false negative means a missed true covariate. We also compare the estimated error standard deviationb σ by all methods. Table 2.1 summarizes the comparison results by all methods. As seen in the measure of FN-weak, the weak covariates tended to be excluded by each regular- izationmethodsincetheyhaveverylowsignalstrength. Attheweaksignallevelof 0.05,thankstotheirconcavitybothHardandSICAfollowedverycloselytheoracle procedure in terms of all other measures, while the Lasso produced a much larger model with lower prediction and variable selection accuracy due to its well-known bias issue. When the weak signal level increases to 0.1, the performance of each method deteriorated due to the difficulty of recovering weak covariates. We also considered the case of no weak signals with β weak =0. In such case, all methods performed better and their relative performance was the same as in the case with the weak signal level of 0.05, with both Hard and SICA having almost identical performance as the oracle procedure. To save space, these additional simulation results are not included here but are available upon request. 30 We also investigate the risk properties and shrinkage effects of the L 2 - regularized refitted estimators b β refitted defined in (3.7) for all methods. Table 2.2 presents the performance of these shrinkage estimators in the above two settings with the ridge parameter λ 1 selected to minimize the corresponding risks. A com- parison of risks under different losses in Tables 2.1 and 2.2 shows the improvement of the L 2 -regularized refitted estimators over the estimators given by Hard, SICA, and oracle procedure, respectively. These numerical results are in line with the theoretical results in Theorem 2. The results of the L 2 -regularized refitted esti- mator for the Lasso show no improvement in risks. This is because of the bias issue of the Lasso giving rise to a large model. Figures 2.1 and 2.2 depict some representative risk curves as a function of the ridge parameter λ 1 by all methods for the prediction loss and L 2 -loss, respectively. These plots demonstrate Stein’s shrinkage effects for the thresholded regression followed by the L 2 -regularization under both estimation and prediction risks. Simulation Example 2 A natural question is whether the results and phenomena for light-tailed errors hold for heavy-tailed errors or not. We now turn our attention to such a case for the linear regression model (2.1) with t error distribution. The setting of this simulation example is the same as that in Section 2.5.1 except that the error vector isε = ση, where the components of the n-dimensional random vector η are 31 independent and follow the t-distribution with df = 10 degrees of freedom. We compared the Lasso, Hard, and SICA with the oracle procedure in the same two settings of (p,σ)= (1000,0.4)and (5000,0.3). The same performance measures as in Section 2.5.1 are employed for comparison. The means and standard deviations of different performance measures by all methods are listed in Table 2.3. Table 2.4 details the performance of the L 2 - regularized refittedestimators, asdescribed inSection2.5.1,with theridgeparam- eter λ 1 selected to minimize the corresponding risks. The conclusions are similar tothoseinSection2.5.1. Bycomparingtheresultsinthissimulation examplewith those in Section 2.5.1 for Gaussian error, we see that the performance of all meth- odsdeterioratedwhen theerrordistributionbecomes heavy-tailed. BothHardand SICA still followed closely the oracle procedure at the weak signal level of 0.05. We also observe the phenomenon of Stein’s shrinkage effects for the thresholded regression followed by the L 2 -regularization under both estimation and prediction risks in this case of heavy-tailed error distribution. 2.5.2 Real Data Example We apply the Lasso, Hard, and SICA, as well as these methods followed by the L 2 -regularization, to the diabetes data set studied in Efron et al. (2004). This data set consists of measurements for n = 442 diabetes patients on the response variable, a quantitative measure of disease progression one year after baseline, 32 0 10 20 30 2780 2800 2820 2840 2860 2880 2900 2920 λ 1 Prediction error Lasso−L 2 0 10 20 30 2500 2550 2600 2650 2700 λ 1 Prediction error Hard−L 2 0 10 20 30 2460 2480 2500 2520 2540 2560 2580 λ 1 Prediction error SICA−L 2 Figure2.3: Representativepredictionerrorcurvesasafunctionoftheridgeparam- eter λ 1 by all methods on the diabetes data set in Section 2.5.2. and ten baseline variables: sex (sex), age (age), body mass index (bmi), average blood pressure (bp), and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu). Efron et al. (2004) considered the quadratic model with interactions, by adding the squares of all baseline variables except the dummy variable sex, and all interactions between each pair of the ten baseline variables. This results in a linear regression model with p = 64 predictors. We adopt this model to analyze the diabetes data set. We randomly split the full data set 100 times into a training set of 400 sam- ples and a validation set of 42 samples. For each splitting of the data set, we applied each regularization method to the training set with the quadratic model, and calculated the prediction error, as defined in Section 2.5.1, on the validation set. Minimizing the prediction error gives the best model for each regularization method. The means (standard deviations) of these minimum prediction errors over 100 random splittings were 2894.5 (655.5) for Lasso, 2802.5 (635.5) for Hard, 33 and 2800.6 (615.2) for SICA. We see that both Hard and SICA improved over Lasso in prediction accuracy. The relatively large standard deviations indicate the difficulty of the prediction problem for this data set. Based on the estimated model by each method, we also investigated the L 2 -regularized refitted estimator with ridge parameter λ 1 selected by the validation set. The means (standard devi- ations) of their prediction errors over 100 random splittings were 2957.9 (671.3) for Lasso-L 2 , 2770.3 (630.9) for Hard-L 2 , and 2770.2 (614.9) for SICA-L 2 . We observe in Figure 2.3 shrinkage effects for both Hard and SICA followed by the L 2 -regularization, whereas the refitting with L 2 -regularization did not generally improve the performance of Lasso, as also shown in the simulation studies. We also calculated the median model size by each method: 18 by Lasso, 8 by Hard, and8 by SICA. Foreachmethod, we computed thepercentage oftimes each predictor was selected and listed the most frequently chosen m predictors in Table 2.5, with m equal to the median model size by the method. Table 2.5 also reports the t-statistics of selected predictors as the ratio of mean to standard deviation, with the means and standard deviations of their coefficients calculated over 100 random splittings. We see that the set of most frequently selected predictors for Hard is identical to that for SICA, which is further a subset of that for Lasso. Some of these selected predictors have t-statistics with magnitude below 2, indi- cating less significance. We also observe that the coefficients for predictors sex and hdl estimated by all methods are negative. It is interesting to note that the 34 interaction term sex∗age is found to be significant, although the predictor age is an insignificant variable based on each method. 2.6 Proofs 2.6.1 Proof of Theorem 1 The proof contains two parts. The first part establishes the model selection con- sistency property of b β with a suitably chosen λ. The second part proves the the oracle prediction properties using the model selection consistency property from the first part. Part 1: Model selection consistency property. We prove supp( b β) = supp(β 0 ) in two steps. In the first step, it will be shown that the number of nonzero elements in b β is no larger than s conditioning on event E defined in (6.32) (see Lemma 2 in Section 2.6.3 of Additional Technical Details), when c 2 c p (2s+1)(loge p)/n< λ< b 0 . We prove this by using the global optimality of b β. By Lemma 1 and λ < b 0 , any nonzero component of the true regression coef- ficient vector β 0 or of the global minimizer b β is greater than λ, which ensures 35 thatkp λ ( b β)k 1 =λ 2 k b βk 0 /2 andkp λ (β 0 )k 1 = sλ 2 /2. Thus, kp λ ( b β)k 1 −kp λ (β 0 )k 1 = (k b βk 0 −s)λ 2 /2. Denote by δ = b β−β 0 . Direct calculations yield Q( b β)−Q(β 0 )= 2 −1 kn − 1 2 Xδk 2 2 −n −1 ετXδ+kp λ (β)k 1 −kp λ (β 0 )k 1 = 2 −1 kn − 1 2 Xδk 2 2 −n −1 ετXδ+(k b βk 0 −s)λ 2 /2. (6.8) On the other hand, Conditional on event E, we have |n −1 ετXδ| ≤ kn −1 ετXk ∞ kδk 1 ≤ c 2 p (loge p)/nkδk 1 ≤ c 2 p (loge p)/nkδk 1 2 0 kδk 2 . (6.9) In addition, by definition and Condition 2, we obtain kδk 0 ≤kβ 0 k 0 +k b βk 0 < M, with M being the robust spark of X. Therefore, Definition 1 entails kn − 1 2 Xδk 2 ≥ ckδk 2 . (6.10) Combining(6.11)withtheinequalities(6.12)and(6.13)establishedabovegives Q( b β)−Q(β 0 )≥ 2 −1 c 2 kδk 2 2 − c 2 p (loge p)/nkδk 1 2 0 kδk 2 +(k b βk 0 −s)λ 2 /2. (6.11) Thus, the global optimality of b β ensures that 2 −1 c 2 kδk 2 2 −c 2 r loge p n kδk 1 2 0 kδk 2 +(k b βk 0 −s)λ 2 /2≤ 0. 36 Reorganizing the above inequality and collecting terms, we get h ckδk 2 − c 2 c r loge p n kδk 1 2 0 i 2 −( c 2 c ) 2 loge p n kδk 0 +(k b βk 0 −s)λ 2 ≤ 0, which gives (k b βk 0 −s)λ 2 ≤ ( c 2 c ) 2 loge p n kδk 0 . (6.12) We next bound the value of k b βk 0 using the above inequality (6.15). Let k = k b βk 0 , thenkδk 0 =k b β−β 0 k 0 ≤ k+s. Thus, itfollows from(6.15)that(k−s)λ 2 ≤ ( c 2 c ) 2loge p n (k+s). Organizing it in terms of k and s, we get k(λ 2 −( c 2 c ) 2 loge p n )≤ s(λ 2 +( c 2 c ) 2 loge p n ). (6.13) Since λ > c 2 c p (2s+1)loge p/n, we have λ 2 −( c 2 c ) 2 (2s+1) loge p n >0 and λ 2 c 2 n− c 2 2 loge p> 2c 2 2 sloge p. Then it follows from inequality (6.16) that k≤ s (λ 2 +( c 2 c ) 2loge p n ) (λ 2 −( c 2 c ) 2 loge p n ) = s(1+ 2c 2 2 loge p λ 2 c 2 n−c 2 2 loge p )< s+1. Therefore, the number of nonzero elements in b β satisfies k b βk 0 ≤ s. The second stepisbasedonthefirststep, wherewe willuse proofby contradic- tionto show thatsupp(β 0 )⊂ supp( b β)with the additionalassumption λ < b 0 c/ √ 2 of the theorem. Suppose that supp(β 0 ) 6⊂ supp( b β), then the number of missed true coefficients k = |supp(β 0 )\supp( b β)| ≥ 1. Thus we have k b βk 0 ≥ s− k and 37 kδk 0 ≤ k b βk 0 +kβ 0 k 0 ≤ 2s. Combining these two results with inequality (6.14) yields Q( b β)−Q(β 0 )≥ (2 −1 c 2 kδk 2 −c 2 r 2sloge p n )kδk 2 −kλ 2 /2. (6.14) Note that for each j ∈ supp(β 0 )\ supp( b β), we have |δ j | = |β 0,j | ≥ b 0 with b 0 being the lowest signal strength in Condition 2. Thus, kδk 2 ≥ √ kb 0 , which together with Condition 2 entails 4 −1 c 2 kδk 2 ≥ 4 −1 c 2 √ kb 0 ≥ 4 −1 c 2 b 0 >c 2 p (2sloge p)/n. Thus, it follows from (6.17) that Q( b β)−Q(β 0 )≥ 4 −1 c 2 kδk 2 2 −kλ 2 /2≥ 4 −1 c 2 kb 2 0 −kλ 2 /2> 0, where the last step is because of the additional assumption λ < b 0 c/ √ 2. The above inequality contradicts with the global optimality of b β. Thus, we have supp(β 0 ) ⊂ supp( b β). Combining this with k b βk 0 ≤ s from the first step, we know that supp( b β) =supp(β 0 ). ItfollowsfromLemma1andthecharacterizationofthepenalizedleast-squares estimatorinTheorem1inLvandFan(2009)thatthehard-thresholdedestimator b β on its support supp( b β) is exactly the ordinary least-squares estimator constructed 38 using covariatesinsupp( b β). Withthemodelselection consistency propertyproved above, we have theexplicit formof b β onits supportas (X 0 τX 0 ) −1 X 0 τy, whereX 0 is the submatrix of the design matrix X consisting of columns in supp(β 0 ). Now we derive bounds for the prediction and estimation losses of b β. Part2: Predictionandestimationlosses. TheideaistogettheL 2 -estimation loss bound by the globaloptimality of b β, conditional on the eventE 1 =E∩E ′ with E and E ′ defined in (6.32) (see Lemma 2 in Section 2.6.3 of Additional Technical Details). Conditional on E 1 , we have kδk 0 ≤ s by the model selection consistency prop- erty proved above. Thus, by the Cauchy-Schwarz inequality we have |n −1 ετX 0 δ| ≤ kn −1 ετX 0 k ∞ kδk 1 ≤ c ′ 2 r logn n kδk 1 ≤ c ′ 2 r slogn n kδk 2 , (6.15) Since (6.11) and (6.13) are still true as they depend only on Condition 2 and Definition 1, it follows from (6.18) and the model selection consistency property k b βk 0 =s that Q( b β)−Q(β 0 ) =2 −1 kn −1 Xδk 2 2 −n −1 ετXδ+(k b βk 0 −s)λ 2 /2 ≥ 2 −1 c 2 kδk 2 2 −n −1 ετXδ≥ (2 −1 c 2 kδk 2 −c ′ 2 r slogn n )kδk 2 . 39 Then it follows from the globaloptimality of b β that 2 −1 c 2 kδk 2 −c ′ 2 q slogn n ≤ 0, which gives the L 2 and L ∞ estimation bound as k b β−β 0 k 2 =kδk 2 ≤ 2c −2 c ′ 2 p (slogn)/n, k b β−β 0 k ∞ ≤k b β−β 0 k 2 ≤ 2c −2 c ′ 2 p (slogn)/n. For L q -estimation loss with 1≤ q < 2, applying H¨ older’s inequality gives k b β−β 0 k q =( n X j=1 |δ j | q ) 1/q ≤ ( n X j=1 |δ j | 2 ) 1 2 ( X δ j 6=0 1 2 2−q ) 1 q − 1 2 =kδk 2 kδk 1 q − 1 2 0 ≤ 2c −2 c ′ 2 s 1 q p (logn)/n. (6.16) Finally we prove the bound for oracle prediction loss. Since b β is the global minimizer, it follows from (6.11) and the model selection consistency property that conditioning onE 1 2 −1/2 n − 1 2 kX( b β−β 0 )k 2 ≤ n n −1 ετXδ−(k b βk 0 −s)λ 2 /2 o 1/2 ≤ kn −1 Xτ 0 εk ∞ kδk 1 1/2 ≤ c ′ 2 c −1 p 2s(logn)/n, where the last step is because of the L 1 estimation bound proved above. This completes the proof. 40 2.6.2 Proof of Theorem 2 In this proof, we apply mathematical techniques such as singular value decom- position and Taylor expansion to study the explicit forms of risks of the refitted estimator b β refitted under squared L 2 -loss and squared prediction loss, and to find out the orders and leading terms of the optimal tuning parameter λ 1 and the corresponding minimized risks. The proof consists of two parts. Part 1: Risk properties for b β refitted under L q -estimation loss. We first consider the risk of b β refitted under the squared L 2 -loss and find the order and leading term of the corresponding optimal λ 1 . The main idea is to divide the risk into two parts, and then minimize the first part conditional on event E defined in (6.32),andshow thattheotherparthasasmaller order. Bydefault, allarguments below are conditional onE. Proof of Theorem 1 ensures that supp( b β) = supp(β 0 ) conditional on event E under Conditions 1 and 2. Thus, if we denoteX 0 as the oracle design matrix, then X 1 =X 0 and s 1 = s 0 . Let I s be the s×s identity matrix for a positive integer s. It follows that b β refitted = (X 1 τX 1 +λ 1 I s 1 ) −1 X 1 τy =(X 0 τX 0 +λ 1 I s ) −1 X 0 τX 0 β 0 +(X 0 τX 0 +λ 1 I s ) −1 X 0 τε, 41 where in the last step we used y = X 0 β 0 +ε. So the difference between b β refitted and β 0 is b β refitted −β 0 =−λ 1 (X 0 τX 0 +λ 1 I s ) −1 β 0 +(X 0 τX 0 +λ 1 I s ) −1 X 0 τε. Setμ =−λ 1 (X 0 τX 0 +λ 1 I s ) −1 β 0 andA =X 0 (X 0 τX 0 +λ 1 I s ) −1 ,then b β refitted −β 0 = μ+Aτε. Thus, conditioning onE we have k b β refitted −β 0 k 2 2 =μτμ+2μτAτε+ετAAτε. (6.17) In view of (6.17), we consider the expectation of k b β refitted −β 0 k 2 2 by using the following decomposition: Ek b β refitted −β 0 k 2 2 = E{1 E k b β refitted −β 0 k 2 2 }+E{1 E ck b β refitted −β 0 k 2 2 } ≤ E{μτμ+2μτAτε+ετAAτε}+E{1 E ck b β refitted −β 0 k 2 2 }. Since P(E c ) = o(1) by Lemma 2 in Section 2.6.3 of Additional Technical Details, the above inequality becomes an equation asymptotically by the dominated con- vergence theorem, which provides the basis for determining the orders of the risks. 42 To ease the presentation, we do not distinguish between these two representa- tions hereafter. The above decomposition, along with (6.17), Condition 1 and ε∼ N(0,σ 2 I n ), gives Ek b β refitted −β 0 k 2 2 (6.18) ≤μτμ+σ 2 tr(AAτ)+E{1 E ck b β refitted −β 0 k 2 2 } =I 1 (λ 1 )+I 2 (λ 1 )+I 3 (λ 1 ), (6.19) where I 1 (λ 1 ) =λ 2 1 β 0 τ(X 0 τX 0 +λ 1 I s ) −2 β 0 , (6.20) I 2 (λ 1 ) =σ 2 tr(X 0 (X 0 τX 0 +λ 1 I s ) −2 )X 0 τ), (6.21) I 3 (λ 1 ) =E{1 E ck b β refitted −β 0 k 2 2 }. (6.22) We analyze the first two terms, I 1 (λ 1 ) and I 2 (λ 1 ) in (6.18), by singular value decomposition. Since X 0 τX 0 is symmetric and positive semidefinite, there exists s× s orthonormal matrix P such that X 0 τX 0 = PτDP, where D is a diago- nal matrix with nonnegative elements d i ,i = 1,··· ,s, the eigenvalues of X 0 τX 0 . Replacing X 0 τX 0 with PτDP, we get X 0 τX 0 +λ 1 I s = Pτ(D+λ 1 I s )P and (X 0 τX 0 +λ 1 I s ) −2 = Pτ(D+λ 1 I s ) −2 P. 43 Set b =(b 1 ,··· ,b s )τ =Pβ 0 . Then kbk 2 =kβ 0 k 2 and the first term becomes I 1 (λ 1 ) =λ 2 1 β 0 τPτ(D+λ 1 I s ) −2 Pβ 0 = s X i=1 λ 2 1 b 2 i (d i +λ 1 ) 2 and the second term can be simplified as I 2 (λ 1 )= σ 2 tr(X 0 τX 0 (X 0 τX 0 +λ 1 I s ) −2 ) =σ 2 tr(PτDPPτ(D+λ 1 I s ) −2 P) = σ 2 tr(D(D+λ 1 I s ) −2 ) = s X i=1 σ 2 d i (d i +λ 1 ) 2 . Substituting the above two terms into (6.18), we get Ek b β refitted −β 0 k 2 2 ≤ f(λ 1 )+ I 3 (λ 1 ), where f(λ 1 ) = s X i=1 λ 2 1 b 2 i (d i +λ 1 ) 2 + s X i=1 σ 2 d i (d i +λ 1 ) 2 . (6.23) Note that f(λ 1 ) is a sum of two terms, with the first term increasing with λ 1 and the second term decreasing with λ 1 . Besides f(λ 1 ), we have another term E{1 E ck b β refitted −β 0 k 2 2 }, which will be shown to be of a strictly smaller order than f(λ 1 ). Part 1.1: Identifying orders of optimal λ 1 and corresponding f(λ 1 ) for L 2 -risk. It is hard to find the exact λ 1 minimizes f(λ 1 ) since the denominators in the sum are different, but we can surely identify its order, in the following three steps. 44 First of all, we claim that c 2 n ≤ d i ≤ c 2 3 n for all i. It suffices to show that the maximum and minimum eigenvalues ofX 0 τX 0 , denoted as λ max and λ min , can be bounded as c 2 ≤ λ min /n ≤ λ max /n ≤ c 2 3 . To this end, note that as X 0 is the submatrix ofX formed by columns with indices in supp(β 0 ) and|supp(β 0 )| = s< M/2 by Condition 2, we have λ min /n ≥ c 2 by the property of robust spark. On the other hand, since we assumed |supp(β 0 )| < M/2, Condition 3 ensures that λ max /n ≤ c 2 3 . So we have proved c 2 ≤ λ min /n ≤ λ max /n ≤ c 2 3 . Since d i ’s are the eigenvalues of X 0 τX 0 , it follows that c 2 n≤ λ min ≤ d i ≤ λ max ≤ c 2 3 n. (6.24) In fact, the same argument applies for any submatrix of X with the number of columns less than M/2. Since |supp( b β 1 )|≤k b βk 0 < M/2 by (2.1), we also have c 2 n≤ λ min (X 1 τX 1 )≤ λ max (X 1 τX 1 )≤ c 2 3 n, (6.25) which will be used later for analyzing I 3 (λ 1 ). 45 Second, we show that the optimal λ 1 that minimizes f(λ 1 ), denoted as λ 1,opt , is of the order o(n). If it is not true, then there exists some constant k > 0 such that λ 1,opt ≥ kn. By (6.24), Condition 3, and since kβ 0 k 2 ≥ O(1), we have f(λ 1,opt )≥ s X i=1 λ 2 1,opt b 2 i (d i +λ 1,opt ) 2 ≥ s X i=1 k 2 n 2 b 2 i (c 2 3 n+kn) 2 = s X i=1 k 2 b 2 i (c 2 3 +k) 2 = k 2 kβ 0 k 2 2 (c 2 3 +k) 2 ≥ O(1). (6.26) However, by the optimality of λ 1,opt , f(λ 1,opt ) ≤ f(0) = P s i=1 σ 2 d i = O( s n ) = o(1). It is a contradiction and thus we must have λ 1,opt =o(n). In the third step, we go one step further to show that the order of λ 1,opt is indeedO(skβ 0 k −2 2 )+O(s 2 n −1 kβ 0 k −4 2 )byapplying Taylorexpansion onf ′ (λ 1 )with λ 1,opt =o(n). Direct calculations yield f ′ (λ 1 ) = s X i=1 2λ 1 b 2 i d i (d i +λ 1 ) 3 − s X i=1 2σ 2 d i (d i +λ 1 ) 3 = s X i=1 2λ 1 b 2 i d i −2σ 2 d i (d i +λ 1 ) 3 . (6.27) Since the optimal λ 1 satisfies f ′ (λ 1,opt ) = 0, we have s X i=1 λ 1,opt b 2 i d i (d i +λ 1,opt ) 3 = s X i=1 σ 2 d i (d i +λ 1,opt ) 3 . (6.28) Wewillrearrangetheaboveequationasaquadraticequationforλ 1,opt byusing Taylor expansion. Since it has been proved that λ 1,opt = o(n), or equivalently, λ 1,opt = o(d i ) for each 1 ≤ i ≤ s, we can apply Taylor expansion with Lagrange 46 remainder to deal with the two fractions in (6.28). For the left hand side of (6.28), we have s X i=1 λ 1,opt b 2 i d i (d i +λ 1,opt ) 3 = s X i=1 λ 1,opt b 2 i d i ( 1 d 3 i − 3λ 1,opt (d i +ω i ) 4 )= λ 1,opt ( s X i=1 b 2 i d 2 i − s X i=1 3b 2 i d i λ 1,opt (d i +ω i ) 4 ), where ω i ’s are numbers between 0 and λ 1,opt . For the right hand side of (6.28), we get s X i=1 σ 2 d i (d i +λ 1,opt ) 3 = s X i=1 σ 2 d i ( 1 d 3 i − 3λ 1,opt d 4 i + 6λ 2 1,opt (d i +γ i ) 5 ) = s X i=1 σ 2 d 2 i −λ 1,opt s X i=1 3σ 2 d 3 i + s X i=1 6σ 2 d i λ 2 1,opt (d i +γ i ) 5 , where γ i ’s are numbers between 0 and λ 1,opt . Equalling the two sides yields λ 1,opt ( s X i=1 b 2 i d 2 i − s X i=1 3b 2 i d i λ 1,opt (d i +ω i ) 4 )= s X i=1 σ 2 d 2 i −λ 1,opt s X i=1 3σ 2 d 3 i + s X i=1 6σ 2 d i λ 2 1,opt (d i +γ i ) 5 . Reorganizing it in terms of the power of λ 1,opt , we obtain: s X i=1 6σ 2 d i (d i +γ i ) 5 + s X i=1 3b 2 i d i (d i +ω i ) 4 ! λ 2 1,opt − s X i=1 b 2 i d 2 i + s X i=1 3σ 2 d 3 i ! λ 1,opt + s X i=1 σ 2 d 2 i =0. (6.29) Its solution for λ 1,opt is −b− √ b 2 −4ac 2a , where a = P s i=1 6σ 2 d i (d i +γ i ) 5 + P s i=1 3b 2 i d i (d i +ω i ) 4 ,b = −( P s i=1 b 2 i d 2 i + P s i=1 3σ 2 d 3 i ) and c = P s i=1 σ 2 d 2 i . We drop the solution λ 1,opt = −b+ √ b 2 −4ac 2a 47 since its order is O(n), which can be proved by analyzing the orders of a,b and c as follows. With c 2 n≤ d i ≤ c 2 3 n, we can immediately calculate the orders of terms in a,b and c as s X i=1 6σ 2 d i (d i +γ i ) 5 = O(sn −4 ), s X i=1 3b 2 i d i (d i +ω i ) 4 = O(n −3 kβ 0 k 2 2 ), s X i=1 b 2 i d 2 i = O(n −2 kβ 0 k 2 2 ), s X i=1 3σ 2 d 3 i = O(sn −3 ), s X i=1 σ 2 d 2 i =O(sn −2 ). Then we have a = O(n −3 kβ 0 k 2 2 ), b = O(n −2 kβ 0 k 2 2 ), and c = O(sn −2 ). We know that b 2 = O(n −4 kβ 0 k 4 2 ) is the leading term in b 2 −4ac since 4ac = O(sn −5 kβ 0 k 2 2 ). Since b < 0, both −b and √ b 2 −4ac are positive and they are of the same order O(n −2 kβ 0 k 2 2 ). So the order for −b+ √ b 2 −4ac 2a is O(n −2 kβ 0 k 2 2 )/O(n −3 kβ 0 k 2 2 ) = O(n). Since we have proved λ 1,opt = o(n) before, this rules out the possibility of λ 1,opt = −b+ √ b 2 −4ac 2a , which entails that λ 1,opt = −b− √ b 2 −4ac 2a . We further show that λ 1,opt has a leading order O(skβ 0 k −2 2 ) followed by a secondary order O(s 2 n −1 kβ 0 k −4 2 ), in Section 2.6.3. Plugging λ 1,opt = O(skβ 0 k −2 2 )+O(s 2 n −1 kβ 0 k −4 2 ) into f(λ 1 ) defined in (6.23), we obtain s X i=1 λ 2 1,opt b 2 i (d i +λ 1,opt ) 2 = O( s 2 n 2 kβ 0 k 2 2 )+O( s 3 n 3 kβ 0 k 4 2 ), s X i=1 σ 2 d i (d i +λ 1,opt ) 2 =O( s n )+O( s 2 n 2 kβ 0 k 2 2 ). 48 Thus, the order for f(λ 1,opt ) is O(s/n)+O(s 2 n −2 kβ 0 k −2 2 ). Part 1.2: Bounding the leading term of order O(skβ 0 k −2 2 ) in λ 1,opt . In fact, the leading order O(skβ 0 k −2 2 ) in λ 1,opt comes from −t/(2a), which equals to−c/b since 4ac =2bt. Plugging the definitions of b and c gives − c b = P s i=1 σ 2 d 2 i P s i=1 b 2 i d 2 i + P s i=1 3σ 2 d 3 i . By (6.24), we see that P s i=1 3σ 2 d 3 i is a smaller order term compared with P s i=1 b 2 i d 2 i . Thus, the leading term for− c b is ( P s i=1 σ 2 d 2 i )( P s i=1 b 2 i d 2 i ) −1 . Recall that λ min and λ max stand for the smallest and largest eigenvalues of X 0 τX 0 . With λ min ≤ d i ≤ λ max and P s i=1 b 2 i =kβ 0 k 2 2 , we obtain that the leading term for−c/b can be bounded as sσ 2 kβ 0 k 2 2 λ 2 min λ 2 max = P s i=1 σ 2 λ 2 max P s i=1 b 2 i λ 2 min ≤ P s i=1 σ 2 d 2 i P s i=1 b 2 i d 2 i ≤ P s i=1 σ 2 λ 2 min P s i=1 b 2 i λ 2 max = sσ 2 kβ 0 k 2 2 λ 2 max λ 2 min . So the leading term for λ 1,opt , which is O(skβ 0 k −2 2 ), is between sσ 2 kβ 0 k 2 2 λ 2 min λ 2 max and sσ 2 kβ 0 k 2 2 λ 2 max λ 2 min . In particular, when λ max equals to λ min , which implies all d i ’s are the same, we can solve (6.27) readily to get λ 1,opt = sσ 2 kβ 0 k 2 2 , which coincides with the above bounds for the leading term. Part 1.3: Bounding term E{1 E ck b β refitted −β 0 k 2 2 } in (6.18). Now let us turn to the last term in (6.18): I 3 (λ 1 ) = E{1 E ck b β refitted −β 0 k 2 2 }. We prove in Section 49 2.6.3 that compared with f(λ 1,opt ), the order of E{1 E ck b β refitted −β 0 k 2 2 } is much smaller. Thus f(λ 1,opt ) is the leading term of Ek b β refitted −β 0 k 2 2 and Ek b β refitted −β 0 k 2 2 = O(s/n)+O(s 2 n −2 kβ 0 k −2 2 ) (6.30) for the optimal choice of λ 1 . Part1.4: Boundsforthe L q -risks. BasedontheriskforsquaredL 2 -lossabove, we can derive the bounds for the risks of L q -losses by using H¨ older’s inequality, as shown in Section 2.6.3. The bound under L ∞ -loss follows directly from the inequality Ek b β refitted −β 0 k ∞ ≤ Ek b β refitted −β 0 k 2 . Part 2: Risk properties for b β refitted under prediction loss. In this part, we will find the risk property of the prediction loss for the refitted estimator b β refitted in a very similar way as before. Similarly to (6.17), we have kX( b β refitted − β 0 )k 2 2 = (μτ + ετA)X 0 τX 0 (μ + Aτε) = μτX 0 τX 0 μ +2μτX 0 τX 0 Aτε+ετAX 0 τX 0 Aτε. Taking expectation to the prediction loss, we have n −1 EkX( b β refitted −β 0 )k 2 2 = n −1 E{1 E kX( b β refitted −β 0 )k 2 2 }+n −1 E{1 E ckX( b β refitted −β 0 )k 2 2 } ≤ n −1 E{(μτX 0 τX 0 μ+2μτX 0 τX 0 Aτε+ετAX 0 τX 0 Aτε)}+n −1 E n 1 E ckX( b β refitted −β 0 )k 2 2 o = n −1 (μτX 0 τX 0 μ+σ 2 tr(AX 0 τX 0 Aτ))+n −1 E{1 E ckX( b β refitted −β 0 )k 2 2 }. (6.31) 50 Using definitions μ =−λ 1 (X 0 τX 0 +λ 1 I s ) −1 β 0 , A =X 0 (X 0 τX 0 +λ 1 I s ) −1 and X 0 τX 0 = PτDP, we get μτX 0 τX 0 μ = λ 2 1 (Pβ 0 )τ(D+λ 1 I s ) −1 D(D+λ 1 I s ) −1 Pβ 0 = λ 2 1 s X i=1 b 2 i d i (d i +λ 1 ) 2 , and σ 2 tr(AX 0 τX 0 Aτ) = σ 2 tr(PτDPPτ(D+λ 1 I s ) −1 PPτDPPτ(D+λ 1 I s ) −1 P) = σ 2 tr(D(D+λ 1 I s ) −1 D(D+λ 1 I s ) −1 )= s X i=1 σ 2 d 2 i (d i +λ 1 ) 2 . Plugging the above two terms into (6.31) yields 1 n EkX( b β refitted −β 0 )k 2 2 ≤ 1 n λ 2 1 s X i=1 b 2 i d i (d i +λ 1 ) 2 + s X i=1 σ 2 d 2 i (d i +λ 1 ) 2 ! + 1 n E n 1 E ckX( b β refitted −β 0 )k 2 2 o . Set g(λ 1 )= λ 2 1 s X i=1 b 2 i d i (d i +λ 1 ) 2 + s X i=1 σ 2 d 2 i (d i +λ 1 ) 2 , and note that it can be transformed from f(λ 1 ) by multiplying d i in the ith term of each sum. Denote the optimal λ 1 for minimizing g(λ 1 ) as λ ′ 1,opt . In view of (6.26) and (6.29), the same argument applies, we also get λ ′ 1,opt = o(n) and consequently, we can deduce λ ′ 1,opt = O(skβ 0 k −2 2 ) +O(s 2 n −1 kβ 0 k −4 2 ) as the 51 ratio of orders does not change. Then we can prove that the leading term for λ ′ 1,opt is between sσ 2 kβ 0 k 2 2 λ min λmax and sσ 2 kβ 0 k 2 2 λmax λ min and g(λ ′ 1,opt ) = O(s)+O(s 2 n −1 kβ 0 k −2 2 ). The term n −1 E{1 E ckX( b β refitted −β 0 )k 2 2 } can be shown to have a smaller order than O(s 2 n −2 kβ 0 k −2 2 ) similarly as before. Therefore, n −1 EkX( b β refitted −β 0 )k 2 2 = O(s/n)+O(s 2 n −2 kβ 0 k −2 2 ), which concludes the proof. 2.6.3 Additional Technical Details This subsection contains proofs of Lemmas 1–2, and technical details in the proof of Theorem 2. Proof of Lemma 1 Since each covariate vector x j is rescaled to have L 2 -norm n 1/2 , the solution b β(z) = z1 {|z|>λ} with z = n −1 (y−Xβ j )τx j can be easily derived for the uni- variate penalized least-squares estimator for both penalties p H,λ (t) and p H 0 ,λ (t). Lemma 2 and its proof Lemma 2. Define two events E ={kn −1 Xτεk ∞ ≤ c 2 p (loge p)/n} and E ′ ={kn −1 X 0 τεk ∞ ≤ c ′ 2 p (logn)/n} (6.32) 52 with c 2 ≥ √ 10σ and c ′ 2 ≥ √ 2σ some positive constants. Then we have P(E c )≤ (2/π) 1/2 c −1 2 σ(loge p) −1/2 e p 1− c 2 2 2σ 2 → 0, P(E ′ c )≤ (2/π) 1/2 c ′−1 2 σs(logn) −1/2 n − c ′2 2 2σ 2 → 0, as n→∞. Proof of Lemma 2: The proofsfortheinequalities onP(E c )andP(E ′ c )aresimilar, soweonlyoutlinethefirstone. Sincethej-thcovariatevectorx j hasbeenrescaled to have L 2 -norm n 1/2 andε∼ N(0,σ 2 I n ), we have n −1 x j τε∼ N(0,σ 2 /n) for each j. ByBonferroni’sinequality andGaussiantailprobabilitybound(seeProposition 2.2.1 in Dudley, 1999), we have P(E c )≤ p X j=1 P |n −1 x j τε| > c 2 p (loge p)/n ≤ p X j=1 2σ c 2 p loge p 1 √ 2π e − c 2 2 log e p 2σ 2 = √ 2σ c 2 p πloge p e p 1− c 2 2 2σ 2 , which tends to 0 as n→∞ since c 2 ≥ √ 10σ. Technical details in the proof of Theorem 2 53 Identifying the order of λ 1,opt We find the order of λ 1,opt by analyzing −b− √ b 2 −4ac 2a . To this end, we first find out the order of t that satisfies 4ac =2bt. By direct calculations, we have 4ac =4 ( s X i=1 6σ 2 d i (d i +γ i ) 5 + s X i=1 3b 2 i d i (d i +ω i ) 4 ) ( s X i=1 σ 2 d 2 i ) =4 ( s X i=1 6sσ 2 d i (d i +γ i ) 5 kβ 0 k 2 2 + s X i=1 3sb 2 i d i (d i +ω i ) 4 kβ 0 k 2 2 ) ( s X i=1 σ 2 kβ 0 k 2 2 sd 2 i ), where P s i=1 σ 2 kβ 0 k 2 2 sd 2 i is of the same order as b, and thus, t should have order 4 ( s X i=1 6sσ 2 d i (d i +γ i ) 5 kβ 0 k 2 2 + s X i=1 3sb 2 i d i (d i +ω i ) 4 kβ 0 k 2 2 ) = O(s 2 n −4 kβ 0 k −2 2 )+O(sn −3 ). Replacing 4ac with 2bt and by Taylor expansion, we have √ b 2 −4ac = √ b 2 −2bt = p (b−t) 2 −t 2 =−(b−t)− t 2 2 p (b−t) 2 −t ′ , where t ′ is a number between 0 and t 2 . By the above calculations on the order of t, we have t 2 2 p (b−t) 2 −t ′ =O(s 2 n −6 )/O(n −2 kβ 0 k 2 2 ) = O(s 2 n −4 kβ 0 k −2 2 ). 54 Thus, combining the above two results, the order of optimal λ can be calculated as follows: λ 1,opt = −b− √ b 2 −4ac 2a = 1 2a (−t+ t 2 2 p (b−t) 2 −t ′ ) =O(n 3 kβ 0 k −2 2 )(O(sn −3 )+O(s 2 n −4 kβ 0 k −2 2 )+O(s 2 n −4 kβ 0 k −2 2 )) =O(skβ 0 k −2 2 )+O(s 2 n −1 kβ 0 k −4 2 ). Therefore, λ 1,opt has a leading order O(skβ 0 k −2 2 ) followed by a secondary order O(s 2 n −1 kβ 0 k −4 2 ). Bounding term I 3 (λ 1 ) We proceed to bound I 3 (λ 1 ) = E{1 E ck b β refitted −β 0 k 2 2 }. Since k b β refitted −β 0 k 2 2 ≤ 2(k b β refitted k 2 2 +kβ 0 k 2 2 ), we have E{1 E ck b β refitted −β 0 k 2 2 }≤ E{1 E c2(k b β refitted k 2 2 +kβ 0 k 2 2 )} =2E(1 E ck b β refitted k 2 2 )+2P(E c )kβ 0 k 2 2 . (6.33) Note that E(1 E ckβ 0 k 2 2 ) = O(P(E c )kβ 0 k 2 2 ) = O( 1 √ loge p e p 1− c 2 2 2σ 2 kβ 0 k 2 2 ) = o(e p 1− c 2 2 2σ 2 kβ 0 k 2 2 ), which is much smaller than O(s 2 n −2 kβ 0 k −2 2 ) since c 2 can be cho- sen arbitrarily large for any given level of signal strength kβ 0 k 2 . So it remains to bound E(1 E ck b β refitted k 2 2 ). 55 We first bound k b β refitted k 2 as k b β refitted k 2 ≤ (X 1 τX 1 +λ 1 I s 1 ) −1 X 1 τk 2 ky 2 = λ max (X 1 (X 1 τX 1 +λ 1 I s 1 ) −2 X 1 τ) 1/2 kyk 2 . By(6.25),wehaveX 1 τX 1 +λ 1 I s 1 ≥ (c 2 n+λ 1 )I s 1 , where≥meansX 1 τX 1 +λ 1 I s 1 − (c 2 n+λ 1 )I s 1 is positive semidefinite. It follows that λ max (X 1 (X 1 τX 1 +λ 1 I s 1 ) −2 X 1 τ)≤ λ max (X 1 X 1 τ) (c 2 n+λ 1 ) 2 = O(n) (c 2 n+λ 1 ) 2 . Then we have k b β refitted k 2 2 ≤ λ max (X 1 (X 1 τX 1 +λ 1 I s 1 ) −2 X 1 τ)kyk 2 2 ≤ O(n)kyk 2 2 (c 2 n+λ 1 ) 2 , (6.34) Ontheotherhand,sincekyk 4 2 =kX 0 β 0 +εk 4 2 =(β 0 τX 0 τX 0 β 0 +2β 0 τX 0 τε+ετε) 2 and ε∼ N(0,σ 2 I n ), by (6.24), we have Ekyk 4 2 = (β 0 τX 0 τX 0 β 0 ) 2 +2(β 0 τX 0 τX 0 β 0 )E(ετε)+4E(β 0 τX 0 τε) 2 +E(ετε) 2 = O(n 2 kβ 0 k 4 2 )+O(n 2 kβ 0 k 2 2 )+O(nkβ 0 k 2 2 )+O(n 2 )= O(n 2 kβ 0 k 4 2 ) (6.35) 56 where all of the four terms above are positive. Combining (6.34) with (6.35) and by the Cauchy-Schwarz inequality, we have E{1 E ck b β refitted k 2 2 }≤ O(n) (c 2 n+λ 1 ) 2 E{1 E ckyk 2 2 }≤ O(n) (c 2 n+λ 1 ) 2 (E{1 2 E c}) 1 2 (Ekyk 4 2 ) 1 2 = O(P(E c ) 1 2 kβ 0 k 2 2 )= O (loge p) − 1 4 e p 1 2 − c 2 2 4σ 2 kβ 0 k 2 2 =o e p 1 2 − c 2 2 4σ 2 kβ 0 k 2 2 , whichisalsostrictlysmallerthanO(s 2 n −2 kβ 0 k −2 2 )sincec 2 canbechosenarbitrarily large for any given level of signal strength kβ 0 k 2 . The above inequality together with (6.33) and Lemma 2 ensures that E{1 E ck b β refitted −β 0 k 2 2 }≤ 2E(1 E ck b β refitted k 2 2 )+2P(E c )kβ 0 k 2 2 ≤ o(e p 1 2 − c 2 2 4σ 2 kβ 0 k 2 2 )= o s 2 n −2 kβ 0 k −2 2 . (6.36) Bounds for the L q -risks with q∈ [1,2) We first show that for q ∈ [1,2), E(1 E ck b β refitted −β 0 k q q ) has a smaller order than O(s 2 kβ 0 k −2 2 /n q/2+1 ) when we choose large enough c 2 . Since by (2.1) and s <M/2 in Condition 2,k b β refitted −β 0 k 0 ≤k b β refitted k 0 +kβ 0 k 0 ≤ M 2 + M 2 =M ≤ n+1, then 57 H¨ older’s inequality ensures k b β refitted −β 0 k q q ≤ (n+1) 1−q/2 k b β refitted −β 0 k q 2 . Thus, by Cauchy-Schwarz inequality and (6.36) we have E(1 E ck b β refitted −β 0 k q q )≤ (n+1) 1−q/2 E(1 E ck b β refitted −β 0 k q 2 ) ≤ (n+1) 1−q/2 E(1 E ck b β refitted −β 0 k 2 2 ) q/2 ≤ o(n 1− q 2 p q 4 − c 2 2 q 8σ 2 kβ 0 k q 2 ) (6.37) ≤ O(s 2 kβ 0 k −2 2 /n q/2+1 ), (6.38) where the last step is because c 2 can be chosen sufficiently large. By H¨ older’s inequality, when q∈ [1,2), we have k b β refitted −β 0 k q ≤ s 1/q−1/2 k b β refitted −β 0 k 2 on event E, which together with (6.37) and the L 2 -loss bound (6.30) gives Ek b β refitted −β 0 k q q ≤ s 1−q/2 Ek b β refitted −β 0 k q 2 +E(1 E ck b β refitted −β 0 k q q ) ≤ s 1−q/2 (Ek b β refitted −β 0 k 2 2 ) q/2 +E(1 E ck b β refitted −β 0 k q q ) = O(s/n q/2 )+O(s 2 kβ 0 k −2 2 /n q/2+1 ). 58 Table 2.3: Means and standard deviations (in parentheses) of different perfor- mance measures by all methods over 100 simulations in Section 2.5.1; the popu- lation error standard deviation (SD) σ p df/(df−2) equals 0.4472 in the case of p =1000, and 0.3354 in the case of p = 5000 Setting Measure Lasso Hard SICA Oracle p = 1000 PE 0.3845 (0.0705) 0.2277 (0.0137) 0.2285 (0.0168) 0.2276 (0.0151) |β weak | = 0.05 L 2 -loss 0.4547 (0.0801) 0.1718 (0.0316) 0.1734 (0.0361) 0.1655 (0.0415) L 1 -loss 1.9683 (0.3523) 0.5335 (0.0932) 0.5386 (0.1124) 0.4682 (0.1202) L ∞ -loss 0.2306 (0.0554) 0.0858 (0.0347) 0.0870 (0.0360) 0.0937 (0.0283) FP 32.7800 (8.6311) 0.0600 (0.2778) 0.1000 (0.4606) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.6200 (0.6321) 6.0000 (0) 6.0000 (0) 0 (0) Error SD 0.4867 (0.0599) 0.4652 (0.0412) 0.4645 (0.0417) 0.4517 (0.0368) p = 1000 PE 0.4462 (0.0777) 0.2693 (0.0149) 0.2702 (0.0172) 0.2276 (0.0151) |β weak | = 0.1 L 2 -loss 0.5331 (0.0773) 0.2787 (0.0245) 0.2797 (0.0272) 0.1655 (0.0415) L 1 -loss 2.4177 (0.3695) 0.8557 (0.1003) 0.8628 (0.1188) 0.4682 (0.1202) L ∞ -loss 0.2491 (0.0558) 0.1123 (0.0250) 0.1131 (0.0259) 0.0937 (0.0283) FP 34.1400 (8.1996) 0.0600 (0.2387) 0.1300 (0.6139) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.0200 (0.9209) 5.9200 (0.2727) 5.8600 (0.4499) 0 (0) Error SD 0.5152 (0.0633) 0.5033 (0.0445) 0.5004 (0.0512) 0.4517 (0.0368) p = 5000 PE 0.3295 (0.1096) 0.1343 (0.0058) 0.1343 (0.0060) 0.1277 (0.0068) |β weak | = 0.05 L 2 -loss 0.4897 (0.1151) 0.1539 (0.0168) 0.1541 (0.0169) 0.1226 (0.0270) L 1 -loss 2.0497 (0.4196) 0.4831 (0.0588) 0.4833 (0.0586) 0.3411 (0.0796) L ∞ -loss 0.2439 (0.0625) 0.0684 (0.0209) 0.0688 (0.0212) 0.0717 (0.0191) FP 38.4600 (5.4558) 0.0300 (0.1714) 0.0300 (0.1714) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.8800 (0.3266) 5.9800 (0.1407) 5.9800 (0.1407) 0 (0) Error SD 0.4254 (0.0569) 0.3560 (0.0319) 0.3560 (0.0319) 0.3366 (0.0321) p = 5000 PE 0.4307 (0.1419) 0.1761 (0.0080) 0.1767 (0.0132) 0.1277 (0.0068) |β weak | = 0.1 L 2 -loss 0.6030 (0.1245) 0.2671 (0.0146) 0.2680 (0.0219) 0.1226 (0.0270) L 1 -loss 2.6203 (0.4894) 0.8068 (0.0701) 0.8150 (0.1238) 0.3411 (0.0796) L ∞ -loss 0.2845 (0.0722) 0.1047 (0.0143) 0.1035 (0.0105) 0.0717 (0.0191) FP 38.3900 (5.0510) 0.0500 (0.2190) 0.1800 (1.2092) 0 (0) FN-strong 0 (0) 0 (0) 0 (0) 0 (0) FN-weak 5.5900 (0.5702) 5.8500 (0.3860) 5.8100 (0.4648) 0 (0) Error SD 0.4828 (0.0625) 0.4039 (0.0354) 0.4005 (0.0458) 0.3366 (0.0321) 59 Table 2.4: Means and standard deviations (in parentheses) of different perfor- mance measures by all methods followed by the L 2 -regularization over 100 simu- lations in Section 2.5.1 Setting Measure Lasso-L 2 Hard-L 2 SICA-L 2 Oracle-L 2 p = 1000 PE 0.4447 (0.0747) 0.2263 (0.0135) 0.2270 (0.0146) 0.2256 (0.0148) |β weak | = 0.05 L 2 -loss 0.5059 (0.0767) 0.1686 (0.0321) 0.1701 (0.0342) 0.1588 (0.0411) L 1 -loss 2.8356 (0.5052) 0.5191 (0.0929) 0.5238 (0.1013) 0.4426 (0.1157) L ∞ -loss 0.1976 (0.0541) 0.0796 (0.0328) 0.0808 (0.0335) 0.0858 (0.0280) p = 1000 PE 0.5170 (0.0835) 0.2676 (0.0149) 0.2684 (0.0172) 0.2256 (0.0148) |β weak | = 0.1 L 2 -loss 0.5828 (0.0767) 0.2770 (0.0250) 0.2780 (0.0277) 0.1588 (0.0412) L 1 -loss 3.3344 (0.5248) 0.8426 (0.1047) 0.8491 (0.1230) 0.4427 (0.1156) L ∞ -loss 0.2180 (0.0567) 0.1099 (0.0218) 0.1107 (0.0227) 0.0858 (0.0281) p = 5000 PE 0.3312 (0.0815) 0.1335 (0.0055) 0.1335 (0.0056) 0.1268 (0.0066) |β weak | = 0.05 L 2 -loss 0.4858 (0.0877) 0.1520 (0.0165) 0.1522 (0.0165) 0.1180 (0.0275) L 1 -loss 2.6985 (0.4105) 0.4719 (0.0587) 0.4724 (0.0581) 0.3256 (0.0807) L ∞ -loss 0.2090 (0.0523) 0.0645 (0.0182) 0.0649 (0.0187) 0.0653 (0.0182) p = 5000 PE 0.4332 (0.1092) 0.1750 (0.0075) 0.1757 (0.0132) 0.1268 (0.0066) |β weak | = 0.1 L 2 -loss 0.5954 (0.0995) 0.2659 (0.0144) 0.2668 (0.0221) 0.1178 (0.0275) L 1 -loss 3.3325 (0.5086) 0.7961 (0.0715) 0.8053 (0.1260) 0.3254 (0.0808) L ∞ -loss 0.2467 (0.0623) 0.1036 (0.0113) 0.1029 (0.0100) 0.0649 (0.0180) Table2.5: Selectionprobabilities(t-statistics,withmagnitudeabove2inboldface) of most frequently selected predictors with number up to median model size by each method across 100 random splittings of the diabetes data set in Section 2.5.2 Predictor Lasso Hard SICA Predictor Lasso Hard SICA sex 0.94 (-2.03) 0.83 (-2.16) 0.82 (-2.07) bp 2 0.54 (0.42) — — bmi 1.00 (17.24) 0.99 (6.25) 1.00 (8.65) glu 2 1.00 (3.95) 0.50 (0.94) 0.57 (1.10) bp 1.00 (5.82) 0.87 (2.51) 0.91 (2.99) sex∗age 0.98 (3.09) 0.87 (2.46) 0.81 (2.00) tc 0.43 (-0.67) — — sex∗bp 0.73 (0.99) — — hdl 1.00 (-3.63) 0.80 (-1.86) 0.79 (-1.83) age∗bp 0.87 (1.27) — — ltg 1.00 (9.27) 1.00 (7.27) 1.00 (8.22) age∗ltg 0.74 (0.82) — — glu 0.85 (1.21) — — age∗glu 0.59 (0.82) — — age 2 0.94 (1.91) — — bmi∗bp 0.99 (2.25) 0.81 (1.94) 0.76 (1.69) bmi 2 0.98 (2.49) — — bp∗hdl 0.47 (0.68) — — 60 Chapter 3 High-Dimensional Latent Variable Thresholded Regression 3.1 Introduction Inthischapter,weexploreaspecialclassofconfoundingfactorsasprincipalcompo- nentsoftherandomdesignmatrix. Principalcomponentsaretreatedasapotential family of confounding factors since they evaluate orthogonal directions reflecting maximal variations in the data. On the other hand, a sparse model may fail to represent the underlying structure (such as in genomics where a relatively large number of genes may be responsible for a certain disease), principal components can help identify the true model by recovering the latent features. For estimating population principal components, we use their sample counterparts and establish their asymptotic properties on an event with significant probability. 61 Toproducesparseandmeaningfulmodels, variousregularizationmethodshave beenwidelyusedinthehigh-dimensionalsettingwithoutassumptionsofconfound- ing factors. Some popularly used regularization methods include the Lasso in Tib- shirani (1996), SCAD in Fan and Li (2001), and Dantzig selector in Candes and Tao (2007). In this paper, we suggest regularization methods for high-dimensional variable selection with latent features. However, our methods differ from existing ones with the belief that there are potential confounding factors that need to be identified andestimated. Toaddress thisissue, basedonthesampleprincipalcom- ponents we exploit high-dimensional thresholded regression in Zheng, Fan and Lv (2014) and study its global properties for recovering the true model with hidden features. Themaincontributionsofthispaperarethreefold. First,weexploreonepoten- tial family of confounding factors, the population principal components, estimated by their sample counterparts. In terms of theoretical results, we provide conver- gence rates of estimated prediction vectors, which hold for a wide class of error distributions. Thepredictionpropertiesareimportantinourstudysinceoneofour objectives is to enable better prediction of the response variable, after accounting for the latent features. This is, however, much less studied in literature, and our work is a first attempt in the high-dimensional case. Second, penalized regression withestimatedconfoundingfactorsenablesvariableselectionbothforobservedand unobserved latent features. Moreover, we also estimate the number of important 62 confounding factors, in the context of variable selection and prediction. In a way, we simultaneously perform variable selection for observed variables and confound- ing factors. Third, we show that high-dimensional thresholded regression with estimated confounding factors can still enjoy model selection consistency and ora- cle inequalities under various prediction and estimation losses for both observable covariates and latent confounding factors. In particular, the thresholded regres- sionisshown tobeabletoidentify thetruecovariatesandconfounding factorsand estimate their effects at the same time. Therefore, we provide a new approach, supported by theoretical results, to identifying and estimating latent features for sparse regression. 3.2 Sparse Regression with Latent Confounding Factors Denote by y = (y 1 ,··· ,y n )τ the n-dimensional response vector and X = (x 1 ,··· ,x p ) an n×p random design matrix with p covariates. Assume that each row ofX is a random vector which has mean zero and covariance matrix Σ. Con- sider the linear regression model, y =Xβ+˜ ε, 63 where β = (β 1 ,··· ,β p )τ is a p-dimensional regression coefficient vector and ˜ ε = (˜ ε 1 ,··· ,˜ ε n )τ is n-dimensional error vector. Sparse regression aims at consistently recovering thesupportofandestimatingthetrueregressioncoefficientvectorβ 0 = (β 0,1 ,··· ,β 0,p )τ, which is assumed to be sparse. We consider the high-dimensional setting where the dimensionality p is no less than the sample size n. Thus we use n→∞ and p→∞ exchangeably throughout the paper. At thispoint, we don’tmake any furtherassumptions ontheerrordistribution. Most ofexisting regression methods assume that the errorvector is independent of or uncorrelated with the design matrix, which may not be true in many situations due to the existence of unobservable confounding factors which are related to the design matrix. These unobservable latent factors are generally difficult to capture and include as covariates in quantitative models. Principal components have been applied in various fields to deal with big data problems. We refer the readers to, for example, Ott and Rabinowitz (1999) and Fang, Feng and Yuan (2013) for the application of principal components on eval- uating heritability in modern genomics data. This intrigues us to incorporate confounding factors as principal components of the random design matrix in the linear regression model. In our paper, we allow for the error vector ˜ ε to contain nonsparse latent fea- tures. These components are essentially population principal components related 64 totheK largestpopulationeigenvalues, whereK standsforthenumberofnontriv- ialeigenvalues. WeallowK todivergewiththesamplesizen. LetF = (f 1 ,··· ,f K ) of size n×K be the potential confounding factors given by the leading K popula- tion principal components. The linear regression model after accounting for these factors becomes y =Xβ+Fγ +ε, (2.1) where γ = (γ 1 ,··· ,γ K )τ is a K-dimensional coefficient vector and ε = (ε 1 ,··· ,ε n )τ is an error vector assumed to be independent ofX andF. To ensure model identifiability, each latent factor f k is rescaled to have L 2 -norm n 1/2 as a common scale, matching that of the constant covariate 1 for the intercept. For future prediction, we can transform γ back to the coefficient vector of the original principalcomponentsusingthecorrespondingscales. Assumethatthetrueparam- eter vector γ 0 = (γ 0,1 ,··· ,γ 0,K )τ is also sparse. Besides selection and estimation of observable predictors, we aim to identify the true confounding factors as well as analyze their impacts on the recovery of important observed variables. Kneip and Sarda (2011) considers a model which is a combination of high- dimensional regression model with sparse coefficient vector and functional regres- sion model that can be rewritten using a sparse expansion of the predictor func- tions. Their predictor vector can essentially be decomposed into a sum of two uncorrelatedrandomcomponents: onecomponentwithdiagonalcovariancematrix reflecting individual variabilitieswhiletheothershares commonfactorsandcanbe 65 expressed as linear combinations of their leading principal components. Although both of us propose to include sample principal components as additional explana- tory variables in an augmented regression model, the underlying models are gener- ally different as they assume that the predictors are uncorrelated after subtracting theprincipalcomponents. Bycontrast,ourmodeldoesnotimposesuchanassump- tion. Furthermore, the latent features in Kneip and Sarda (2011) come from the principal components of the functional part instead of the entire predictor vector, so they also need to take into account the estimation errors by using the principal components of the whole design matrix. While our model and that of Kneip and Sarda (2011) are different in terms of modeling of latent variables and scope of applications, we share certain similarities of sparse modeling via regularization. However, the eigen-structures are quite different out of different purposes and our regularization method of thresholded regression is nonconvex instead of a convex one. With unobservable latent factors F, it is challenging to effectively estimate parameter vectors β 0 and γ 0 . We partially overcome this difficulty by assuming that the confounding factors appear in an unknown (linear) form of the observed variables. Then, F can be estimated by its sample counterpart. Denote by ˆ F = ( ˆ f 1 ,··· , ˆ f K ) the corresponding sample principal components of the random design matrixX. To produce a sparse estimate forthe true coefficient vector (β 0 τ,γ 0 τ)τ, 66 we suggest sparse regression with nonsparse latent features using regularization which minimizes Q((βτ,γτ)τ) =(2n) −1 ky−Xβ− ˆ Fγk 2 2 +kp λ ((β ∗ τ,γτ)τ)k 1 , (2.2) the penalized residual sum of squares with penalty function p λ (·), where β ∗ = n −1/2 (kx 1 k 2 β 1 ,··· ,kx p k 2 β p )τ is the coefficient vector corresponding to the design matrix with each column rescaled to have L 2 -norm n 1/2 . The penalty function p λ (t) is defined on t∈ [0,∞) and indexed by λ ≥ 0. Also, p λ (t) is assumed to be increasing in both λ and t with p λ (0) =0. Here p λ ((β ∗ τ,γτ)τ) is a compact notation for p λ (|(β ∗ τ,γτ)τ|) =(p λ (|β ∗,1 |),··· ,p λ (|β ∗,p |),p λ (|γ 1 |),··· ,p λ (|γ K |))τ, with β ∗ = (β ∗,1 ,··· ,β ∗,p )τ and |(β ∗ τ,γτ)τ| = (|β ∗,1 |,··· ,|β ∗,p |,|γ 1 |,··· ,|γ K |)τ. Each estimated latent factor ˆ f k is also assumed to be rescaled to have L 2 -norm n 1/2 . The regularization method in (2.2) makes the theoretical analyses more chal- lenging in the sense that the matrix of confounding factors F is unobservable and we have to estimate it. In this way, it can simultaneously estimate β and γ, 67 recovering their true supports under certain conditions. Naturally, the accuracy of estimate of F is crucial in this approach. As the estimation errors of confounding factors can be magnified by their effects in the prediction of the response, we con- sider γ in an L ∞ ball B T = {γ ∈ R K : kγk ∞ ≤ T}, that is, all components of γ are assumed to be of magnitude no larger than T. We allow T to diverge with n, with certain orders such that it will not deteriorate the estimation accuracy (see Condition 10 for details). Asymptotic properties of the estimated confounding factors will be established in the next section. 3.3 Thresholded Regression with Confounding Factors We will first establish asymptotic properties of the estimated confounding factors for a wide class of distributions with significant probability. Based on this, we prove global properties including model selection consistency and oracle inequal- ities under various losses for thresholded regression with estimated confounding factors. 3.3.1 Multiple Component Spike Models High-dimensional principal component analysis (PCA) particularly in the context of spiked covariance model, introduced by Johnstone (2001), has been studied in, 68 for example, Paul (2007), Johnstone and Lu (2009), and Jung and Marron (2009), amongothers. Shen, Shen andMarron(2012)explored the(in)consistency ofPCA inmultiple component spike modelswith possible tiers. We adopta similarsetting for the estimation of confounding factors. However, our work extends beyond esti- mating principal components to using the estimated confounding factors in sparse regressionso astoenablevariableselection andprediction. Instead ofstudying the phase transition for the consistency of PCA, we focus on the convergence rates of estimated confounding factors, upon which we prove the nice properties of sparse regression with estimated factors. Furthermore, the PCA is also popularly used in functional data analysis to reduce intrinsically infinite dimensionality to a finite level, pointing to the most significant components of the data. See, for example, Hall and Hosseini-Nasab (2006 and 2009). We adopt the general framework studied in Jung and Marron (2009). Assume thattherows oftheobservablen×prandomdesignmatrixX areindependent and identically distributedwithmeanzeroandcovariancematrixΣbutnotnecessarily Gaussian. The eigen-decomposition of the population covariance matrix is given by Σ = UΛU ′ , where Λ is a diagonal matrix of eigenvalues λ 1 ≥ λ 2 ≥ ··· ≥ λ p ≥ 0 and U = (u 1 ,··· ,u p ) is an orthogonal matrix of population eigenvectors. Analogously, the estimated confounding factors are principal components of the sample covariance matrix S = n −1 X ′ X, an unbiased estimate of the population covariance matrix Σ, noting that the covariate vector is assumed to have mean 69 zero. Similarly we have an eigen-decomposition for the sample covariance matrix asS= ˆ U ˆ Λ ˆ U ′ , where ˆ Λisadiagonalmatrixofsample eigenvalues ˆ λ 1 ≥ ˆ λ 2 ≥···≥ ˆ λ p ≥ 0 and ˆ U = (ˆ u 1 ,··· ,ˆ u p ) is an orthogonal matrix of sample eigenvectors. Let Z = Λ −1/2 U ′ X ′ be the sphered data matrix. Then the columns of Z are independent and identically distributed with mean zero and covariance matrix I p . To build our theory, we will assume tail probability bounds for each entry of Z that can be satisfied for a wide class of distributions. The dual matrix of S is defined as S D = n −1 XX ′ = n −1 Z ′ ΛZ. Note that the eigenvalues of S consists of the eigenvalues of S D along with p−n zeros. Thespiked covariancemodel, introduced inJohnstone(2001),hasbeenstudied in substantial amount by Paul (2007), Johnstone and Lu (2009), and Jung and Marron (2009) in the context of consistency of PCA. This model assumes that the eigen-structure of the covariance matrixΣ consists of the first few eigenvalues deviating from one while the remaining are set to be near one. In this paper we consider a more general case, the multiple component spike models (Shen, Shen and Marron, 2012). Assume there are m tiers of eigenvalues which deviate from unity, where 1 ≤ m < n, is the number of tiers for nontrivial eigenvalues. The eigenvalues grow polynomially in p with degree same within each tier but different than those in other tiers such that the m tiers are distinguishable. Set tier sizes to bek 1 ,··· ,k m , where k i ’s arepositive integers such that P m j=1 k j = K <n. Denote 70 by k m+1 = p−K, the number of remaining eigenvalues ofΣ. Define the index set J l of eigenvalues in the lth tier as J l = n l−1 X j=1 k j +1,··· , l−1 X j=1 k j +k l o for l = 1,··· ,m+1. The properties ofsample eigenvectors arestudied comprehensively forthe mul- tiple component spike models with tiered eigenvalues, which serve as a benchmark for estimating confounding factors. For the identifiability of prediction vectors, we assume each tier size to be one in sparse regression with estimated confounding factors, that is, k l = 1 for 1 ≤ l ≤ m, implying K = m in this case. Then the multiple component spike model with tiered eigenvalues will reduce to a multiple component spike model with distinct eigenvalues. 3.3.2 Thresholded Regression with Estimated Confound- ing Factors In this part, we discuss the penalty function to be used in (2.2). As shown in Lemma 1 of Zheng, Fan and Lv (2014), the hard-thresholding penalty p H,λ (t) = 1 2 λ 2 −(λ−t) 2 + , t ≥ 0 shares the similar hard-thresholding property as the L 0 - regularization with penalty p H 0 ,λ (t) = 2 −1 λ 2 1 {t6=0} , t≥ 0. These two methods are exactly equivalent to each other in the wavelet setting with orthonormal design 71 matrix, and can also share similar asymptotic properties in the more general case. Since the L 0 -regularizationis challenging to implement, thehard-thresholding reg- ularizationprovidesacomputationallyattractivealternativewithfixed,finitemax- imum concavity which controls the computational difficulty. Because of the hard- thresholdingproperty,bothmethodsarereferredtoasthresholdedregressionusing hard-thresholding. We will use penalty p H,λ (t) for minimizing the penalized least squares in (2.2). A key concept introduced in Zheng, Fan and Lv (2014) is the robust spark rspark c (X) of an n× p design matrix X with bound c, defined as the smallest possible number τ such that there exists a subgroup of τ columns from the n×p matrix n −1/2 ˜ X such that the corresponding submatrix has a singular value less than the given positive constant c, where ˜ X is obtained by rescaling the columns ofX to have a common scale of L 2 -norm n 1/2 . Proposition 1 in Fan and Lv (2013) characterizes the order of rspark c (X) when the design matrix X is generated from Gaussiandistribution. Lv(2013)furtherbuiltalowerbound(inTheorem2)onthe robust spark for more general cases of random design matrix. Set M = ˜ cn/(logp) for some positive constant ˜ c. We will apply Theorem 2 in Lv (2013) to show that M provides a lower bound on the robust spark of matrix (X,F). Following Zheng, Fan and Lv (2014) and Fan and Lv (2013), we consider the regularized estimator ( b βτ,b γτ)τ on the union of coordinate subspaces S M/2 = {(βτ,γτ)τ ∈ R p+K : k(βτ,γτ)τk 0 < M/2} to ensure model identifiability 72 and reduce instability in estimated model. Therefore, the regularized estimator ( b βτ,b γτ)τ is defined as the global minimizer ( b βτ,b γτ)τ = arg min (βτ,γτ)τ∈S M/2 Q (βτ,γτ)τ , (3.3) where Q (βτ,γτ)τ is defined in (2.2). 3.3.3 Technical Conditions In this part, we provide a few technical conditions and discuss their relevance in detail. These conditions are crucial for our main results which will be introduced later. Condition 4 (Orders of eigenvalues). (a). There exist some positive constants c j , α l , and M l such that for any j ∈ J l , 1≤ l≤ m, λ j /p α l =c j +o(K −1 p αm−α m−1 ) and c j ≤ M l , where α l ’s satisfy α 1 >α 2 >···> α m >1. (b). There exists some positive constant M m+1 such that λ j ≤ M m+1 for any j ∈ J m+1 . 73 Condition 5 (Tail probability bounds). (a) For any 1 ≤ j ≤ p and 1 ≤ t ≤ n, the (j,t)th entry of Z, denoted as z jt , has the tail probability bound P(|z jt | > f 1 (p))≤ g 1 (p), if j ∈ J 1 ∪J 2 ∪···∪J m ; P(|z jt | > f 2 (p))≤ g 2 (p), if j ∈ J m+1 , where f 1 , f 2 , g 1 ,g 2 are functions and as n → ∞, g 1 ,g 2 satisfy ng 1 (p) P m l=1 k l + ng 2 (p)k m+1 → 0. (b) It holds that (f 2 1 (p) P m l=t+1 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p αt → 0, for any t, 1≤ t≤ m, as p→∞. Condition 6. It holds that P(min p j=1 kx j k 2 >0,max p j=1 kx j k 2 ≤ L 1 n 1/2 ,min j∈A 0 kx j k 2 ≥ Ln 1/2 )= 1−α L,L 1 , where A 0 = supp(β 0 ), 1 ≤ L 1 ≤ q logp logn and L are positive constants, and α L,L 1 converges to 0 as n→∞. Condition 7. The error vector ε∼ N(0,σ 2 I n ) for some positive σ. Condition 8. The true model size s =kβ 0 k 0 +kγ 0 k 0 = o(n) and s < M/2. The minimum signal strength satisfies b 0 = min j∈supp(β 0 ) (L|β 0,j |)∧ min j∈supp(γ 0 ) |γ 0,j | > [(4c −1 1 )∨1]c −1 1 c 2 p (2s+1)(logp)/n, 74 where c 1 is defined in Proposition 1 and c 2 ≥ √ 10σL 1 is some positive constant. Condition 9. The inequality kn −1/2 (X,F)δk 2 ≥ ckδk 2 holds for any δ satisfying kδk 0 <M with probability 1−θ n,p approaching one as n→∞. Condition 10. The convergence rates in Theorem 3 satisfy cos(ω jj )≥ 1− c 2 2 logn 8K 2 T 2 n when n is large enough for any j, 1≤ j ≤ K, where ω jj is the angle between the jth confounding factor f j and its sample counterpart ˆ f j . Part (a) of Condition 4 requires that the nontrivial eigenvalues in each tier converge to the corresponding order of p with a constant bound M l on the limits c j ’s for the lth tier. The convergence rates are assumed to be of the order of o(K −1 ) to ensure the convergence of the sum of the eigenvalues. The assumption of constant bounds is automatically satisfied if there are finitely many eigenvalues in each tier. Similarly, part (b) of Condition 4 assumes a uniform bound M m+1 for the remaining eigenvalues. Our technical arguments apply to more general cases where the bound M l of each tier can grow at a rate slower than any polynomial order of dimensionality p. However, the tail probability bounds in Condition 5 would then involve more complicated expressions of M l ’s. As pointed out earlier, the columns of the sphered data matrixZ =Λ −1/2 U ′ X ′ are independent and identically distributed with mean zero and covariance matrix I p . Additionally, Condition 5 assumes a relatively mild tail probability bound on eachentryz jt . Thistechnicalconditionisusefultoensuretheconsistencyofsample principal components. It is a fairly weak condition and satisfied by, for example, 75 Gaussian random variables, where the tail probability decays exponentially. We distinguish the indices in J 1 ∪J 2 ∪···∪J m and J m+1 to account for the size of the last tier k m+1 which grows at a rate of p−K, much larger than than that of the first m tiers whose combined cardinality is no larger than n. Condition 6 assumes that with significant probability 1−α L,L 1 , the L 2 -norm of each column vector of the random design matrix is nonzero with an upper bound L 1 n 1/2 as well as a lower bound Ln 1/2 for the columns with indices in A 0 = supp(β 0 ). The nonzero L 2 -norm assumption is a natural condition and can be easily satisfied for a wide class of distributions. The probability for the upper and lower bounds can be close to one when the constant L is small and constant L 1 is large. Condition 6 is needed since rescalings are used in (2.2). Conditions 7 and 8 are similar to the corresponding conditions in Zheng, Fan and Lv (2014) for deriving the global properties of thresholded regression. The Gaussian error distribution is assumed to simplify the technical arguments and can be generalized to other light-tailed error distributions. The first part s <M/2 of Condition 8 puts a sparsity constraint on the true model size s to ensure model identifiability, while the second part gives a lower bound on the minimum signal strength to ensure model selection consistency. Condition 9 is related to the robust spark of matrix (X,F). Let us gain some insights into this condition. Set M 1 = Cn/(logp) with C some positive constant largerthan˜ cinM. Itisreasonabletoassumealowerboundedminimumeigenvalue 76 forthecovariancematrixofanyM 1 columnsof(X,F). Thisassumptionalongwith Conditions 1 and 2 in Lv (2013) on the population distribution of (X,F) yields the result of Theorem 2 in Lv (2013), which says with asymptotic probability one, the robust spark κ c for any submatrix consisting of M columns of (X,F) satisfies κ c ≥ M = ˜ cn/(logp) for sufficiently small constant c and some positive constant ˜ c depending only on c. This implies Condition 9 by the definition of robust spark. The tail probability bound given in the proof of Theorem 2 (Lv, 2013) is θ n,p = p ˜ K exp(− ˜ Cn) that decays exponentially with the sample size n, where ˜ C is a positive constant and ˜ K =2 −1 ˜ Cn/(logp). Condition 10 imposes a specific convergence rate O(logn/(K 2 T 2 n)) so as to bound the the estimation error of confounding factors. As discussed in Theo- rem 3, the convergence rates of sample eigenvectors are in polynomial orders of p determined by the relative spike sizes, under some mild conditions on m, k l , and the distribution of Z. Since the convergence rates of sample prediction vectors are shown to be at least as good as those of sample eigenvectors in Theorem 3, Condition 10 can generally hold when the spikes are distinguishable up to order logn/(K 2 T 2 n). 3.3.4 Main Theoretical Results We will provide our main result in this section. We will illustrate that sample eigenvectors andprediction verctor converge topopulationones. Based onthis, we 77 could establish Model selection consistency and oracle inequalities of thresholded regression. The following theorem gives the convergence rates of sample eigenvec- tors and estimated prediction vectors Xˆ u i , 1 ≤ i ≤ K. As mentioned before, we assume each tier size to be one for the identifiability of prediction vectors. Theorem 3 (Convergence rates of sample eigenvectors and prediction vectors). Assume that Conditions 4 and 5 hold, with probability at least 1− ng 1 (p)K − ng 2 (p)(p−K) as p→∞, it holds that (a) for any 1≤ r≤ m, the convergence rate for θ ir = Angle(ˆ u i ,span{u j : j ∈ J r }) is arccos({1− r−1 X t=1 r−1 Y i=t (1+k i )O p (A(t))−O p (A(r))} 1/2 ), where A(t) = (f 2 1 (p) P m l=t+1 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p αt . (b) Assume in addition that each tier of nontrivial eigenvalues has size one, then for any 1≤ i≤ K, the convergence rate for ω ii = Angle(Xˆ u i ,Xu i ) is at least arccos({1− i−1 X t=1 2 i−t−1 O p (A(t))−O p (A(i))} 1/2 ), where A(t) becomes (f 2 1 (p) P m l=t+1 M l p α l +f 2 2 (p)(p−K)M m+1 )/p αt . It is clear from part (a) of Theorem 3 that the convergence rates of sample eigenvectors rely on A(t), which depends on the tier number m, tier size k l , and 78 relative spike size p α l /p αt . Under some mild conditions on the orders of m, k l (for instance, m,k l ∼ logp) and the distribution of Z (such as Gaussian distribution, f 1 (p),f 2 (p)∼ logp),theanglesθ ir convergetozeroinpolynomialordersofpdeter- mined by p α l /p αt . The larger the relative spike sizes, the faster the convergence rates we have. The results in part (b) show that the convergence rates of sample prediction vectorsareatleastasgoodasthoseofthesampleeigenvectors inpart(a). Part(b) ofTheorem 3 is important forour purpose of sparse regression with latent features since itprovides the prediction properties which determine the accuracy ofestima- tionforconfoundingfactors. Thisismuchlesswellstudiedintheliterature. Tothe bestofourknowledge, ourworkisafirstattemptinthehigh-dimensionalcase. We will show that these convergence rates are critical in establishing global properties ofthresholdedregressionwithestimatedconfoundingfactors. Thekeytoobtaining these asymptotic orders is to conduct delicate analysis on a high probability event with respect toZ. We emphasize that its tail probability nKg 1 (p)+ng 2 (p)(p−K) candecayfairlyfast. Forinstance,inthecaseofGaussiandistribution,g 1 ,g 2 decay exponentially with f 1 ,f 2 . AssumethatthecovariancematrixΣoftherandomdesignmatrixXadmitsan eigen-structure that follows a multiple component spike model with distinct eigen- values. Then the results of Theorem 3 can be applied to estimated confounding factors. Combining Conditions 9 and 10 yields the following useful proposition. 79 Proposition 1. Under Conditions 9 and 10, the inequality kn −1/2 (X, ˆ F)δk 2 ≥ c 1 kδk 2 holds for some positive constant c 1 and any δ satisfyingkδk 0 < M with probability at least 1−θ n,p −nKg 1 (p)−n(p−K)g 2 (p) approaching one as n→∞. The positive constant c 1 is smaller than, but close to, c for large enough n, as shown in the proof of Proposition 1. Proposition 1 implies that the robust spark of any submatrix consisting of t (t <M) columns of the augmented design matrix (X, ˆ F) will be close to that of any t (t < M) columns of matrix (X,F) when F is accurately estimated by ˆ F. This proposition plays a key role in proving the global properties of thresholded regression in the presence of hidden features. Theorem 4 (Model selection consistency and oracle inequalities of thresholded regression). Assume that Conditions 4–10 hold and c −1 1 c 2 p (2s+1)(logp)/n <λ < b 0 [1∧(c 1 / √ 2)]. Then for both the hard-thresholding penalty p H,λ (t) and L 0 -penalty p H 0 ,λ (t), the regularized estimator ( b βτ,b γτ)τ satisfies that with probability at least 1 − 2σ(2/π) 1/2 c −1 2 (logp) −1/2 (p + K)p − c 2 2 8σ 2 L 2 1 − 2σ(2/π) 1/2 c ′−1 2 s(logn) −1/2 n − c ′2 2 8σ 2 − (nKg 1 (p)+ng 2 (p)(p−K))−θ n,p −α L,L 1 , it holds simultaneously that: 80 (a) (Model selection consistency). supp ( b βτ,b γτ)τ = supp (β 0 τ,γ 0 τ)τ ; (b) (Prediction loss). n −1/2 k(X, ˆ F)( b βτ,b γτ)τ −(X,F)(β 0 τ,γ 0 τ)τk 2 ≤ √ 2 (c 2 /4+2 √ sc ′ 2 c −1 1 ) p (logn)/n; (c) (Estimation losses). k b β−β 0 k q ≤ 2L −1 c −2 1 c ′ 2 s 1/q p (logn)/n, kb γ −γ 0 k q ≤ 2c −2 1 c ′ 2 s 1/q p (logn)/n for q∈ [1,2], and k b β−β 0 k ∞ , kb γ−γ 0 k ∞ are bounded by the same upper bounds as for q =2, where c ′ 2 ≥ 2 √ 2σ is some positive constant. Theorem4showsthatthejointestimator( b βτ,b γτ)τ canenjoytheniceproperty of model selection consistency and its prediction loss can be within a logarithmic factor (logn) 1/2 of that of the oracle estimator, with the above choice of the regu- larization parameter λ. These results are similar to those in Theorem 1 of Zheng, Fan and Lv (2014) without hidden features. The term ( √ 2c 2 /4) p (logn)/n in the prediction loss bound indeed comes from the estimation errors of confound- ing factors, but it does not affect the convergence rate. Thus, despite the errors in estimating the confounding factors, by applying the thresholded regression to augmented design matrix (X, ˆ F), we can still recover the true supports of β and γ simultaneously as well as obtain a nice convergence rate for prediction. Further- more,Theorem4alsoestablishestheoracleinequalitiesoftheregularizedestimator ( b βτ,b γτ)τ under the L q -estimation losses with q∈ [1,2]∪{∞}. These results hold 81 simultaneously with probability converging to one, where the dimensionality p is allowed to grow up to exponentially fast with the sample size n. Since it is generally challenging to study the properties of regularization meth- ods with unobservable confounding factors, we have explored the linear regression model with hidden features for a special class of confounding factors involving the population principal components. For multiple component spike model with dom- inating spikes such thatthedirections ofprincipal components aredistinguishable, we have exploited sample principal components as estimated confounding factors. In this way, through penalized regression, we are able to correctly identify the number of underlying confounding factors and effectively estimate their effects for future prediction. Moreover, it is demonstrated in Theorem 4 that when esti- mated confounding factors are added to the original model, they do not affect variable selection and estimation of observable covariates, which implies that vari- able selection and estimation for observable and unobservable variables can be simultaneously conducted. In models without latent features, numerous studies have contributed to the variable selection and estimation performance for various kinds of regularization methods. For example, Candes and Tao (2007) proved that compared to the ora- cle estimator, the Dantzig selector can achieve a loss of the ideal procedure within a logarithmic factor of the dimensionality. The oracle property of the penalized likelihood estimators was established in Fan and Peng (2004) in the setting of 82 diverging dimensionality. Bickel, Ritov and Tsybakov (2009) derived parallel ora- cle inequalities for the Lasso estimator and Dantzig selector under the prediction loss and various estimation losses. Tang and Leng (2010) proposed the penalized empirical likelihood method for parameter estimation and variable selection that enjoys the oracle property. The rank consistency, prediction, and estimation per- formance bounds were established for a reduced-rank estimation method by the adaptive nuclear-norm penalization in Chen, Dong and Chan (2013). 3.4 Numerical Studies In this section, we discuss the implementation of thresholded regression and inves- tigate the finite-sample performance of regularization methods with the Lasso, hard-thresholding (Hard), and SICA (Lv and Fan, 2008) penalties using our pro- posed model. We apply these penalties to two different settings, M 1 and M 2 . Model M 1 involves only observable covariates, while model M 2 incorporates con- founding factors as latent features. The oracle procedure (Oracle) which knew the true model in advance is also conducted as a benchmark. The case of Gaus- sian error and linear regression model with one confounding factor is considered in the first example, while in the second example we deal with the case where the error is relatively heavy-tailed with t-distribution and the number of underlying confounding factors is increased to two. 83 3.4.1 Implementation There are many efficient algorithms including the LQA (Fan and Li, 2001), LARS (Efron et al., 2004), and LLA (Zou and Li, 2008) for implementing regularization methods. In this paper, we apply the ICA algorithm (Fan and Lv, 2011) to imple- ment the aforementioned three regularization methods in all numerical studies. A detaileddescriptionandananalysisofconvergence propertiesofthisalgorithmcan be found in Fan and Lv (2011) and Lin and Lv (2013), respectively. See Section 4 in Zheng, Fan and Lv (2014) for detailed discussions of implementation using the ICA algorithm on the union of coordinate subspaces S M/2 , as well as the criterion for selecting the tuning parameters by minimizing the prediction error calculated using an independent validation set for a fair comparison of all methods. 3.4.2 Simulation Examples Simulation Example 1 In the first simulation example, we consider the linear regression model (2.1) with Gaussian error ε ∼ N(0,σ 2 I n ). We generated 100 data sets from this model with true regression coefficient vector β 0 = (vτ,··· ,vτ,0)τ and true coefficient vector of confounding factors γ 0 = (0.5,0)τ, where v = (0.6,0,0,−0.6,0,0)τ was repeatedqtimesandγ 0 isaK×1vectorwithonenonzerocomponent0.5,denoting the effect of the confounding factor. We set q =3 and K =10 such that there are 84 sixnonzerocomponentswithmagnitude0.6inthetruecoefficientvectorβ 0 and10 leadingsampleprincipalcomponentswillbeincludedinthemodelM 2 asestimated confounding factors. We adopted the setting of (p,σ)= (1000,0.4). For each data set, the rows of the n×p design matrixX were sampled as independent and iden- tically distributed (i.i.d.) copies from a multivariate normal distribution N(0,Σ) with Σ = 1 2 (Σ 1 +Σ 2 ), where Σ 1 = (0.5 |i−j| ) 1≤i,j≤p and Σ 2 = 0.5I p +0.511τ. The choice of Σ 1 allows for correlation among the covariates at the population level. And the key point in this covariance matrix construction is Σ 2 , a matrix with all entries being 0.5 except that diagonals are 1. It has an eigen-structure where the largest eigenvalue is comparable with p, which accounts for the hidden feature. Based on theconstruction ofΣ 1 andΣ 2 , it iseasy to check thatthe population covariancematrixΣhasthelargesteigenvalue251.75andtheothersallbelow1.75. The underlying confounding factor is the population principal component f corre- sponding to the largest population eigenvalue. For three regularization methods, model M 2 involves sample principal components corresponding to the 10 largest sample eigenvalues as estimates of confounding factors. The oracle procedure uses the true confounding factor f instead of the estimated ones. For both models M 1 and M 2 , we applied the Lasso, Hard, SICA, and Oracle to produce a sequence of sparse models and selected the regularization parameter λ as discussed in Section 3.4.1. 85 To compare the performance of the aforementioned four methods under two different models, we consider several performance measures. The first measure is the prediction error (PE) defined in model M 2 as E(Y −xτ b β− ˆ fτb γ) 2 with ˆ f the estimated confounding factors, ( b β, b γ) an estimate, and (xτ,Y) an independent observation of the covariates and response, and as E(Y −xτ b β) 2 in model M 1 . For the oracle procedure, ˆ f is replaced by f, the true confounding factor. To calculatetheexpectation,anindependenttestsampleofsize10,000wasgenerated. The second to fourth measures are the L q -estimation losses of the true regression coefficient vector, k b β−β 0 k q with q = 2,1, and ∞, respectively. The fifth and sixth measures are the false positives (FP), falsely selected noise covariates, and falsenegatives (FN),missed truecovariatesforvariableselectionrelatedtoβ 0 . We also calculated the estimated error standard deviation b σ by all methods in both models. The results are summarized in Table 3.1. Table 3.2 displays the results for estimating the effect of the confounding factor, with measures similar to those in Table 3.1. They are the L q -estimation losses kb γ−γ 0 k q with q = 2,1, and ∞, FP and FN related to γ 0 . From Table 3.1, it is obvious that the prediction errors and estimation losses in M 2 are much less than those in model M 1 , while the estimation of the true regressioncoefficientvectorβ 0 isseriouslydeterioratedinmodelM 1 ,withfarmore selected variables, fewer included true covariates, and inaccurately estimated error standard deviation, compared with that in model M 2 . All the methods perform 86 Table 3.1: Means and standard errors (in parentheses) of different performance measures for regression coefficients of observable variables by all methods over 100 simulations in Section 3.4.2. M 1 : model without hidden features, M 2 : model with confounding factors Setting Measure Lasso Hard SICA Oracle M 1 PE 15.06 (0.17) 11.33 (0.14) 7.70 (0.09) — L 2 -loss 2.17 (0.01) 3.83 (0.03) 3.12 (0.02) — L 1 -loss 12.05 (0.07) 19.10 (0.15) 17.67 (0.11) — L ∞ -loss 0.71 (0.01) 1.52 (0.03) 1.17 (0.02) — FP 47.54 (0.15) 26.26 (0.67) 42.35 (0.63) — FN 4.86 (0.09) 5.10 (0.09) 4.51 (0.11) — b σ 3.08 (0.04) 0.76 (0.02) 0.20 (0.01) — M 2 PE 1.38 (0.02) 0.19 (0.00) 0.19 (0.00) 0.17 (0.00) L 2 -loss 1.22 (0.01) 0.13 (0.01) 0.12 (0.00) 0.12 (0.00) L 1 -loss 4.19 (0.05) 0.27 (0.02) 0.24 (0.01) 0.24 (0.01) L ∞ -loss 0.60 (0.00) 0.09 (0.00) 0.08 (0.00) 0.08 (0.00) FP 30.04 (0.67) 0.14 (0.14) 0 (0) 0 (0) FN 2.61 (0.07) 0 (0) 0 (0) 0 (0) b σ 0.97 (0.01) 0.41 (0.00) 0.41 (0.00) 0.40 (0.00) Table 3.2: Means and standard errors (in parentheses) of different performance measures for regression coefficients of confounding factors by all methods over 100 simulations in Section 3.4.2 Measure Lasso Hard SICA Oracle L 2 -loss (×10 −2 ) 10.45 (0.16) 0.71 (0.04) 0.70 (0.04) 0.35 (0.02) L 1 -loss (×10 −2 ) 10.50 (0.17) 0.71 (0.04) 0.70 (0.04) 0.35 (0.02) L ∞ -loss (×10 −2 ) 10.44 (0.16) 0.71 (0.04) 0.70 (0.04) 0.35 (0.02) FP 0.03 (0.02) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) much better with estimated confounding factors in model M 2 . The performance of Hard and SICA is comparable to that of the oracle procedure regardless of estimation error of the confounding factor, which is in line with the theoretical results in Theorem 4. We also see from Table 3.2 that both Hard and SICA select 87 exactly the true confounding factor and enjoy relatively accurate estimation of its effect. Simulation Example 2 A natural question is whether the results and phenomenon still hold if more con- founding factors are involved and the error becomes relatively heavy-tailed. Thus we turn our attention to the case where the linear regression model (2.1) involves more than one population principal component as confounding factors and the error vector ε follows the t-distribution. In general, the result may not be as good asinthecasewithonlyoneconfoundingfactorsincetherelativesizesoftheseveral leading eigenvalues may notbelarge enoughforsignificant separation. The heavy- tailed error can also deteriorate the prediction and estimation processes. These points will be illustrated through this simulation example. The setting of this example is the same as that in Section 3.4.2 except that the population covariance matrix has been changed such that population princi- pal components related to the two largest population eigenvalues are added to the hidden structure as confounding factors, with true coefficient vector γ 0 = (0.5,−0.5,0)τ, andthe errorvector becomesε = ση, where the components ofthe n-dimensional random vector η are independent and follow the t-distribution with df =10 degrees of freedom. In order to generate a covariance matrix with leading 88 eigenvalues in several distinguishable orders of the dimensionality p, we consider a block diagonal matrix e Σ = Σ 11 0 0 Σ 22 , where Σ 11 = 3 4 (Σ 1 +Σ 2 ) 1≤i,j≤200 and Σ 22 = 1 2 (Σ 1 +Σ 2 ) 1≤i,j≤800 with the con- struction of Σ 1 ,Σ 2 similar to that in Section 3.4.2 but of different dimensions. In this case, the two largest eigenvalues of e Σ are 201.75 and 77.61, respectively, while the others are less than 2.63. We still use 10 sample principal components to estimate the confounding factors and compare the performance of the Lasso, Hard, SICA and oracle procedure under two different models M 1 and M 2 . The same performance measures as those mentioned in Section 3.4.2 are employed for comparison. The means and standard errors of different performance measures by all meth- odsarelistedinTables3.3and3.4. TheresultsaresimilartothoseinSection3.4.2 with all methods performing far better in model M 2 . However, although the per- formance measure of Hard and SICA is still close to that of the oracle procedure, bothmethods deteriorated in view ofTable 3.1. Itis also clear from Table 3.4 that the estimation for the coefficients of confounding factors becomes more challeng- ing. This shows the difficulty of estimating populationprincipal components when the covariance matrix contains several leading eigenvalues whose relative sizes are 89 Table 3.3: Means and standard errors (in parentheses) of different performance measures for regression coefficients of observable variables by all methods over 100 simulations in Section 3.4.2. M 1 : model without hidden features, M 2 : model with confounding factors, the population error standard deviation (SD) σ p df/(df−2) equals to 0.4472 Setting Measure Lasso Hard SICA Oracle M 1 PE 22.94 (0.24) 19.26 (0.23) 13.70 (0.16) — L 2 -loss 2.45 (0.01) 4.53 (0.03) 3.75 (0.03) — L 1 -loss 13.65 (0.08) 23.19 (0.23) 20.69 (0.14) — L ∞ -loss 0.87 (0.02) 1.80 (0.03) 1.44 (0.03) — FP 47.12 (0.17) 29.12 (0.70) 36.65 (0.61) — FN 4.72 (0.08) 4.99 (0.09) 4.48 (0.09) — Error SD 3.55 (0.05) 0.85 (0.03) 0.38 (0.02) — M 2 PE 3.21 (0.12) 1.22 (0.12) 1.22 (0.12) 0.22 (0.00) L 2 -loss 1.28 (0.01) 0.18 (0.02) 0.18 (0.02) 0.10 (0.00) L 1 -loss 5.00 (0.08) 0.50 (0.08) 0.47 (0.06) 0.21 (0.01) L ∞ -loss 0.60 (0.00) 0.10 (0.01) 0.11 (0.01) 0.07 (0.00) FP 35.08 (0.67) 1.76 (0.60) 1.42 (0.46) 0 (0) FN 2.81 (0.07) 0.05 (0.05) 0.06 (0.05) 0 (0) Error SD 1.16 (0.02) 0.48 (0.01) 0.48 (0.01) 0.45 (0.00) Table 3.4: Means and standard errors (in parentheses) of different performance measures for regression coefficients of confounding factors by all methods over 100 simulations in Section 3.4.2 Measure Lasso Hard SICA Oracle L 2 -loss 0.22 (0.01) 0.07 (0.00) 0.07 (0.00) 0.01 (0.00) L 1 -loss 0.28 (0.01) 0.09 (0.01) 0.09 (0.01) 0.01 (0.00) L ∞ -loss 0.20 (0.01) 0.07 (0.00) 0.07 (0.00) 0.01 (0.00) FP 0 (0) 0 (0) 0 (0) 0 (0) FN 0 (0) 0 (0) 0 (0) 0 (0) notlargeenoughforsignificantseparation. Therelatively heavy-tailed errordistri- bution also contributes to the deteriorated performance of the estimation results. 90 3.4.3 Real Data Example We compare the performance of the Lasso, Hard, and SICA under two different models, with and without confounding factors, on Leukemia data set from high- density Affymetrix oligonucleotide arrays, which was previously studied in Golub et al. (1999), available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. This data set was also analyzed in Fan and Fan (2008). It consists of 7129 genes and72samplesfromtwoclasses: 47inclassALL(acutelymphocyticleukemia)and 25 in class AML (acute mylogenous leukemia). Although the response variable is binary, wefitthelinearregression modelwiththeregularizationmethodsandused a validation set to select the regularization parameter. With the fitted regression equation, we classify a patient asinclass ALLifthepredicted response is less than 1/2, and as in class AML otherwise. We randomly split the 72 samples into a training set of 30 samples (10 in class ALL and 20 in class AML) and a validation set of 42 test samples (37 in class ALL and 5 in class AML). To reduce the dimensionality, for each splitting we first selected 1000 covariates with the largest 1000 two sample t-statistics in magnitude based on the training set (Fan and Fan, 2008). The underlying confounding fac- tors were estimated by the sample principal components. Only sample principal components with t-statistics over 0.5 in magnitude were counted as possible con- founding factors and added to the residual matrix to obtain an augmented design matrix, where the residual matrix was obtained by subtracting the original design 91 matrix by its projection onto the space spanned by those selected sample principal components. We repeated the random splitting 50 times and obtained a linear classifier for each method in both models using the training set and calculated the classification error using the test set. The classification results are summarized in Table 3.5. We also calculated the median model size by each method for both models. Table 3.5: The means and standard errors (in parentheses) of classification errors, median model sizes and selection probabilities for the first two sample principal components by different methods over 50 random splittings of leukemia data set in Section 3.4.3. M 1 : model without hidden features, M 2 : model with hidden features Setting Measure Lasso Hard SICA M 1 Classification Errors 1.54 (0.13) 7.66 (0.49) 3.78 (0.35) Median Model Size 9.5 6 4 M 2 Classification Errors 0.72 (0.11) 1.56 (0.18) 0.90 (0.15) Median Model Size 3 2 2 Selection Probability 1 0.94 0.90 0.92 Selection Probability 2 0.76 0.70 0.76 It is clear from Table 3.5 that our proposed model with estimated confounding factors tends to have a smaller median model size and reduces more than half of the classification errors. An interesting phenomenon is that our models with estimatedconfoundingfactorsselectedbyallthreemethodscontainalmostexactly the first two sample principal components, with a selection probability over 0.9 for the first sample principal component and above 0.7 for the other. It seems that the higher percentages these two sample principal components are selected, the better performance the regularization method enjoys. This fact demonstrates 92 that for this data set our model using confounding factors which are estimated by the leading two sample principal components performs much better,since the classificationerrorsaresignificantlyreducedwhenthehiddenfeaturesareincluded. 3.5 Discussion We have investigated the problems of prediction and variable selection in presence of latent features. Our work has focused on one potential family of confounding factors: population principal components. It has been shown that when the eigen- structureofthepopulationcovariancematrixfollowsageneralmultiplecomponent spike model with dominating spikes, the unobservable confounding factors can be estimated consistently with the sample principal components. This enables the suggested sparse regression with nonsparse latent features to accurately estimate theeffectsofobservablecovariates. Inparticular,wehaveexploitedthethresholded regression and shown that by incorporating the estimated confounding factors, it can simultaneously identify the number of important confounding factors and estimate their contributions to the response. To simplify the technical presentation, we have considered the multiple compo- nent spike model with dominating spikes, where the leading population principal components are identifiable. It would be interesting to investigate whether simi- lar prediction and variable selection results would hold in the more general case of multiple component spike models with tiered eigenvalues. Since the subspace 93 consistency and corresponding convergence rates of sample eigenvectors have been established in the paper, a natural idea is to estimate the confounding factors by using orthonormalbasesofspaces spanned bysampleeigenvectors withinthesame tier. Another possible extension of our work is to exploit other ways than principal components to deal with unobservable confounding factors. For instance, Chan- drasekaran, Parrilo and Willsky (2012)proposed a convex optimization framework forgraphicalmodelselection inthesetting where thelatentandobserved variables are jointly Gaussian, with the structure of the observed variables conditional on the latent variables is specified by a graphical model. Alternating direction meth- ods in Ma, Xue and Zou (2013) can be used for solving such convex optimization problem. In contrast, our method does not require the Gaussianity assumption. Moreover, itwould beinteresting toextend themethod tomore generalmodel set- tings including generalized linear models, survival analysis, and panel data. These problems are beyond the scope of the current paper and will be interesting topics for future research. 3.6 Proofs In this section, we list the Lemmas and present the proofs for main results. The additional technical proofs for all Lemmas are provided in Section 3.6.5. 94 3.6.1 Lemmas A sample eigenvector is said to be consistent with its population counterpart if the angle between them converges to zero as the dimensionality p goes to infinity. However, when several population eigenvalues belong to the same tier, the corre- sponding sample eigenvectors may not be distinguishable. In that case, subspace consistency is essential to characterize the asymptotic properties of sample eigen- vectors. We refer the readers to Jung and Marron (2009) for a detailed discussion ofconsistency and subspace consistency ofsample eigenvectors. In this section, we introduce the following lemmas which are used in the proofs of main results. Lemma 3 (Asymptotic orders of sample eigenvalues). Under Conditions 4 and 5, with probability at least 1−(ng 1 (p) P m l=1 k l +ng 2 (p)k m+1 ), the eigenvalues of the sample covariance matrix S satisfy ˆ λ j =O p (p α l ), j ∈ J l , l =1,··· ,m, as p→∞. Lemma4 (Subspaceconsistencyofsampleeigenvectors). Under Conditions 4 and 5, with probability at least 1−(ng 1 (p) P m l=1 k l +ng 2 (p)k m+1 ), the eigenvectors of the sample covariance matrix S satisfy Angle(ˆ u i ,span{u j : j ∈ J l })→ 0, i∈ J l , l = 1,··· ,m, 95 as p→∞. Lemma 5. Let ˜ ε = (F− ˆ F)γ+ε and denote by X 0 , ˆ F 0 the submatrices of X and ˆ F consisting of columns in supp(β 0 ) and supp(γ 0 ), respectively. Define two events ˜ E ={kn −1 (X, ˆ F)τ˜ εk ∞ ≤ c 2 p (logp)/n} and E 0 ={kn −1 (X 0 , ˆ F 0 )τ˜ εk ∞ ≤ c ′ 2 p (logn)/n}, with c 2 ≥ √ 10σL 1 and c ′ 2 ≥ 2 √ 2σ some positive constants. Then conditional on the event in Condition 6, we have P(E c 1 )≤ 2σ(2/π) 1/2 c −1 2 (logp) −1/2 (p+K)p − c 2 2 8σ 2 L 2 1 + 2σ(2/π) 1/2 c ′−1 2 s(logn) −1/2 n − c ′2 2 8σ 2 +nKg 1 (p)+ng 2 (p)(p−K)→ 0 as n→∞, where E 1 = ˜ E∩E 0 . Lemma 3 shows that with probability approaching one, with diverging dimen- sionality p, the sample eigenvalues in the first m tiers (corresponding to the non- trivial eigenvalues) grow at the same rates as their population counterparts. The asymptotic order O p (p α l ) applies to each sample eigenvalue, which is useful to establish subspace consistency andconvergence rates forthe corresponding sample eigenvectors. 96 BasedontheasymptoticpropertiesofsampleeigenvaluesinLemma3,Lemmas 4andTheorem1gofurthertoshowthesubspaceconsistencyandconvergencerates ofsampleeigenvectorscorrespondingtotheeigenvaluesinthefirstmtiers,withthe same probability bound. It is immediate that the subspace consistency reduces to theconsistencyofsampleeigenvectorsifeachtierofnontrivialeigenvaluescontains only one element, as in the spiked covariance model. 3.6.2 Proof of Theorem 3 Proof of (a) Thekeyideaofthisproofistonotethefactthatthesampleeigenvectorsindifferent groups are perpendicular to each other, such that we can conduct delicate analysis on the angles between sample eigenvectors and the space spanned by population eigenvectors. Based on the relationship between angles of vector and spanned spaces and angles of two spanned spaces, we prove the result by induction. In fact, the convergence rates for sample eigenvectors with indices in J 1 have already been given in the proof of Lemma 4. We will prove convergence rates for sample eigenvectors in J 2 as a starting point and then get the conclusion by induction. Step1: Convergenceratesofsampleeigenvectorswithindicesin J 2 . This step can also be regarded as the first step for induction. For any i∈ J 1 ∪···∪J m , define θ il = Angle(ˆ u i ,span{u j : j ∈ J l }), l =1,··· ,m, (6.4) 97 as the angle between the sample eigenvector ˆ u i and the space spanned by popula- tion eigenvectors in J l . And note θ ∗ rr = max ˜ u∈span{ ˆ u i :i∈Jr} Angle(˜ u,span{u j : j ∈ J r }) as the angles between the subspace spanned by sample eigenvectors in J r and the subspace spanned by population eigenvectors in J r . For any vector ˜ u in the span{ˆ u i : i∈ J r } with L 2 -norm 1, we have ˜ u = P kr i=1 a i ˆ u i with P kr i=1 a 2 i = 1. TheconvergenceratesofsampleeigenvectorswithindicesinJ 1 areindeedgiven in Proof of Lemma 4. Since cos 2 (θ i1 ) = P j∈J 1 p 2 ji , for any i ∈ J 1 , it shows that conditional on event E as p→∞, cos 2 (θ i1 )≥ 1−O p ((f 2 1 (p) m X l=2 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p α 1 )= 1−O p (A(1)). (6.5) Because sin 2 θ i1 +cos 2 θ i1 = 1, sin 2 θ i1 ≤ O p (A(1)). Then for any ˜ u in span{ˆ u i : i∈ J 1 } with L 2 -norm 1, we have ˜ u = P k 1 i=1 a i ˆ u i with P k 1 i=1 a 2 i = 1. It follows that sin 2 Angle(˜ u,span{u j : j ∈ J 1 })≤ ( k 1 X i a i sinθ i1 ) 2 98 ≤ ( k 1 X i a i ) 2 O p (A(1)) =k 1 O p (A(1)). BecauseAngle(˜ u,span{u j : j ∈ J 1 })isintheinterval[0,π/2]andthefunctionsin 2 is monotonous in this interval, this means for θ ∗ 11 which is the maximum among all angles above we have sin 2 θ ∗ 11 ≤ k 1 O p (A(1)). For a sample eigenvector with index i∈ J 2 , in order to find a lower bound for cos 2 (θ i2 ), we write it as cos 2 (θ i2 )= X j∈J 2 p 2 ji = 1 − X j∈J 1 p 2 ji − X j∈J 3 ∪···∪J m+1 p 2 ji . (6.6) Since the sample eigenvectors in different groups are perpendicular to each other, any sample eigenvectors from tier J 2 is perpendicular to the spanned spaces of sample eigenvectors from tier J 1 . Thus we have for any sample eigenvector with index i∈ J 2 , it should satisfy π 2 −θ 11 ∗≤ θ i1 ≤ π 2 . Thus, together with the above bound, we get X j∈J 1 p 2 ji =cos 2 (θ i1 )≤ sin 2 (θ ∗ 11 )≤ k 1 O p (A(1)). 99 On the other hand, by (6.27) in Proof of Lemma 2, we have X j∈J 3 ∪···∪J m+1 p 2 ji ≤ O p (A(2)). Plugging the above two bounds into (6.6) gives the convergence rates for sample eigenvectors in J 2 , cos 2 (θ i2 )≥ 1−(1+k 1 )O p (A(1))−O p (A(2)). Step 2: Convergence rates of sample eigenvectors with indices in J 3 to J m . For any fixed k, 2≤ k ≤ m−1, assume that the conclusion is true, that is, for any i∈ J s , 1≤ s≤ k, cos 2 (θ is )≥ 1− s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))−O p (A(s)), we need to show that the conclusion holds for i∈ J k+1 . Given that the conclusion is true for any 1 ≤ s ≤ k, using the same technique in step 1, for any ˜ u in span{ˆ u i : i∈ J s } with L 2 -norm 1, sin 2 Angle(˜ u,span{u j : j ∈ J s })≤ ( ks X i a i sinθ is ) 2 ≤ ( ks X i a i ) 2 { s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))+O p (A(s))} 100 ≤ k s s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))+k s O p (A(s)). It follows that sin 2 θ ∗ ss ≤ k s s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))+k s O p (A(s)). Similarly as before, we write cos 2 (θ i,k+1 ) as cos 2 (θ i,k+1 ) = X j∈J k+1 p 2 ji = 1 − X j∈J 1 ∪···∪J k p 2 ji − X j∈J k+2 ∪···∪J m+1 p 2 ji . We will apply the same technique as in Step 1 to bound P j∈J 1 ∪···∪J k p 2 ji . For any s, 1 ≤ s ≤ k, since the sample eigenvector ˆ u i are perpendicular to the sample eigenvectors in the sth group, we have π 2 −θ ss ∗≤ θ is ≤ π 2 , which gives X j∈Js p 2 ji =cos 2 (θ is )≤ sin 2 (θ ∗ ss )≤ k s s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))+k s O p (A(s)). 101 By adding up the indices from J 1 to J k , we get X j∈J 1 ∪···∪J k p 2 ji ≤ k X s=1 {k s s−1 X t=1 s−1 Y i=t (1+k i )O p (A(t))+k s O p (A(s))}. Simplifying the above expression by combining same terms gives X j∈J 1 ∪···∪J k p 2 ji ≤ k X t=1 k Y i=t (1+k i )O p (A(t)). On the other hand, by (6.27) in Proof of Lemma 2, we have X j∈J k+2 ∪···∪J m+1 p 2 ji ≤ O p (A(k+1)). Combining these two bounds gives the convergence rates for cos 2 (θ i,k+1 ) as cos 2 (θ i,k+1 )≥ 1− k X t=1 k Y i=t (1+k i )O p (A(t))−O p (A(k+1)). By induction, the results of part (a) hold for any 1≤ k≤ m. Proof of (b) In this proof, we will show that the convergence rates of prediction vectors are at least as fast as those of sample eigenvectors established in part (a) when each tier has size one. We conduct delicate analysis on the angles between sample prediction vectors and populationprediction vectors, and make use ofsome results 102 about sample eigenvalues and inner products between eigenvectors derived in the proofs of Lemmas 3 and 4 in Sections 3.6.5 and 3.6.5. First,accordingtopart(a),wheneachtierhassizeone, whichmeansk j =1for any 1≤ j ≤ m, we could have the convergence rate for θ jr = Angle(ˆ u j ,span{u j : j ∈ J r }) for any r is arccos({1− r−1 X t=1 2 r−t O p (A(t))−O p (A(r))} 1/2 ). Set ω ij to be the angle between Xu i and Xˆ u j . Note that X ′ X = n ˆ U ˆ Λ ˆ U ′ and ˆ u j is the eigenvector of X ′ X corresponding to the eigenvalue n ˆ λ j with L 2 -norm equaling 1. Then cos(ω jj ) can be calculated as cos(ω jj ) = (Xu j ) ′ Xˆ u j kXu j k 2 kXˆ u j k 2 = n ˆ λ j u ′ j ˆ u j q n ˆ λ j kXu j k 2 = q n ˆ λ j u ′ j ˆ u j kXu j k 2 . (6.7) For the expression kXu j k 2 at the bottom, we have kXu j k 2 2 =u ′ j X ′ Xu j =nu ′ j ˆ U ˆ Λ ˆ U ′ u j =n p X i=1 ˆ λ i (u ′ j ˆ u i ) 2 = n p X i=1 ˆ λ i p 2 ji . Squaring both sides on (4) and replacing kXu j k 2 2 with the above expression gives cos 2 (ω jj )= n ˆ λ j (u ′ j ˆ u j ) 2 n P p i=1 ˆ λ i p 2 ji = ˆ λ j ku j k 2 2 kˆ u j k 2 2 cos 2 (θ jj ) P p i=1 ˆ λ i p 2 ji = ˆ λ j cos 2 (θ jj ) P p i=1 ˆ λ i p 2 ji , 103 where θ jj is defined in (6.4) as the angle between u j and ˆ u j since each tier of nontrivial eigenvalues contains only one element in the setting of Theorem 3. Tofurthersimplifytheexpressionofcos 2 (ω jj ),wemakeuseof(6.24)fordealing with the denominator P p i=1 ˆ λ i p 2 ji such that cos 2 (ω jj ) = ˆ λ j cos 2 (θ jj ) P p i=1 ˆ λ i p 2 ji = ˆ λ j cos 2 (θ jj ) λ j (n −1 z j z ′ j ) , (6.8) where z j is defined before as the jth row vector of Z. Recall that in Proof of Lemma 3, we get some properties of the sample eigenvalues, especially a lower bound for ˆ λ j as listed in (6.22). Combined with the setting of Theorem 3 that each tier of nontrivial eigenvalues has size 1, it gives ˆ λ j ≥ ϕ 1 (n −1 λ j z ′ j z j ). Replacing ˆ λ j in (6.8) with this bound gives cos 2 (ω jj )= ˆ λ j cos 2 (θ jj ) λ j (n −1 z j z ′ j ) ≥ ϕ 1 (n −1 z ′ j z j )cos 2 (θ jj ) n −1 z j z ′ j . (6.9) Since ϕ 1 (n −1 z ′ j z j ) denotes the largest eigenvalue of n −1 z ′ j z j , a p×p matrix with rank 1, we have all the other eigenvalues to be 0. Thus, ϕ 1 (n −1 z ′ j z j ) =trace(n −1 z ′ j z j )= trace(n −1 z j z ′ j ) =n −1 z j z ′ j . 104 Then the expression in (6.9) indeed gives cos 2 (ω jj )≥ cos 2 (θ jj ). Thus we know that the convergence rates of prediction vectors are at least as good as that of sample eigenvectors, which completes the proof. 3.6.3 Proof of Proposition 1 Since the inequalitykn −1/2 (X,F)δk 2 ≥ ckδk 2 holds for anyδ satisfyingkδk 0 <M with large probability 1−θ n,p by Condition 9, we derive a similar result for (X, ˆ F) byanalyzingtheestimationerrorofconfoundingfactorsbasedonaboveconditions. We first bound the estimation error max 1≤j≤K kf j − ˆ f j k 2 . By Condition 10 and conditional on the event E, we have kf j − ˆ f j k 2 2 ≤kf j k 2 2 +k ˆ f j k 2 2 −2 f ′ j ˆ f j = n+n−2kf j k 2 k ˆ f j k 2 cos(ω jj ) =2n−2ncos(ω jj ) =2n(1−cos(ω jj ))≤ c 2 2 logn 4K 2 T 2 for any 1≤ j ≤ K, which gives the confounding factor estimation error bound max 1≤j≤K kf j − ˆ f j k 2 ≤ c 2 2KT p logn. (6.10) 105 Then by the triangular inequality we have kn −1/2 (X, ˆ F)δk 2 ≥kn −1/2 (X,F)δk 2 −kn −1/2 (X,F)δ−n −1/2 (X, ˆ F)δk 2 ≥ ckδk 2 −n −1/2 k(F− ˆ F)δ 1 k 2 ≥ ckδk 2 −n −1/2 max 1≤j≤K k(f j − ˆ f j )k 2 kδ 1 k 1 , where δ 1 is a subvector of δ consisting of components with respect to F. By applying (6.10) as well as the fact thatkδ 1 k 1 ≤ √ Kkδ 1 k 2 ≤ √ Kkδk 2 , we get kn −1/2 (X, ˆ F)δk 2 ≥ ckδk 2 −n −1/2 · c 2 2KT p logn· √ Kkδk 2 ≥ c 1 kδk 2 , where c 1 is some positive constant no larger than c− c 2 2T q logn nK . Therefore, c 1 is smaller thanbutclose tocforsufficiently largen. Combining above conditions, we have shown that Proposition 1 holds with probability at least 1−θ n,p −nKg 1 (p)− n(p−K)g 2 (p). 3.6.4 Proof of Theorem 4 A similar idea as in Zheng, Fan and Lv (2013) is used to prove the global prop- erties of thresholded regression. The proof consists of two parts. The first part illustrates the model selection consistency property of ( b βτ,b γτ)τ with certain λ given in Theorem 4. Based on the model selection consistency property from the first part, we prove the oracle prediction loss is upper bounded as the second part. The key challenge of this proof is that we need to deal with the estimation error 106 of the confounding factors, which we solved by using the convergence properties of the estimated prediction vectors established before. We first prove the properties when the L 2 -norms of columns of the design matrixX are of a common scale n 1/2 , which meansβ ∗ =β and L =1 in this case. Denote s =kβ 0 k 0 +kγ 0 k 0 , b 0 = min j∈supp(β 0 ) (|β 0,j |)∧min j∈supp(γ 0 ) (|γ 0,j |) as the number of overall true covariates and minimum signal strength for both covariates and confounding factors. Part 1: Model selection consistency property. The proof consists of two steps. In the first step, it will be shown that when c 2 c 1 p (2s+1)(logp)/n< λ< b 0 , the number of nonzero elements in ( b βτ,b γτ)τ is no larger than s conditioning on event ˜ E defined in Lemma 5. We prove this by using the global optimality of ( b βτ,b γτ)τ. The second step is based on the first step, where we will use proof by contradiction to show that supp((β 0 τ,γ 0 τ)τ) ⊂ supp(( b βτ,b γτ)τ) with the addi- tional assumption λ <b 0 c 1 / √ 2 of the theorem. By the hard-thresholding property in Lemma 1 of Zheng, Fan and Lv (2013) and λ < b 0 , any nonzero component of the true regression coefficient vector (β 0 τ,γ 0 τ)τ or of the global minimizer ( b βτ,b γτ)τ is greater than λ, which ensures thatkp λ (( b βτ,b γτ)τ)k 1 =λ 2 k( b βτ,b γτ)τk 0 /2andkp λ ((β 0 τ,γ 0 τ)τ)k 1 =sλ 2 /2. Thus, kp λ (( b βτ,b γτ)τ)k 1 −kp λ ((β 0 τ,γ 0 τ)τ)k 1 = (k( b βτ,b γτ)τk 0 −s)λ 2 /2. 107 Denote by δ =( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τ. Direct calculations yield Q(( b βτ,b γτ)τ)−Q((β 0 τ,γ 0 τ)τ) = 2 −1 kn − 1 2 (X, ˆ F)δk 2 2 −n −1 ˜ ετ(X, ˆ F)δ (6.11) +(k( b βτ,b γτ)τk 0 −s)λ 2 /2, where ˜ ε = (F− ˆ F)γ +ε, the sum of the random error ε and estimation error (F− ˆ F)γ. On the other hand, conditional on event ˜ E, we have |n −1 ˜ ετ(X, ˆ F)δ| ≤ kn −1 ˜ ετ(X, ˆ F)k ∞ kδk 1 (6.12) ≤ c 2 p (logp)/nkδk 1 ≤ c 2 p (logp)/nkδk 1 2 0 kδk 2 . In addition, by Condition 8 and definition of S M/2 , we obtain kδk 0 ≤ k(β 0 τ,γ 0 τ)τk 0 +k( b βτ,b γτ)τk 0 < M, with M being the robust spark of (X, ˆ F). By Proposition 1, we have kn − 1 2 (X, ˆ F)δk 2 ≥ c 1 kδk 2 . (6.13) Plugging inequalities (6.12) and (6.13) into (6.11) gives that Q(( b βτ,b γτ)τ)−Q((β 0 τ,γ 0 τ)τ)≥ 2 −1 c 2 1 kδk 2 2 − c 2 p (logp)/nkδk 1 2 0 kδk 2 +(k( b βτ,b γτ)τk 0 −s)λ 2 /2. (6.14) 108 Thus, the global optimality of ( b βτ,b γτ)τ ensures that 2 −1 c 2 1 kδk 2 2 −c 2 r logp n kδk 1 2 0 kδk 2 +(k( b βτ,b γτ)τk 0 −s)λ 2 /2≤ 0. Reorganizing the above inequality and collecting terms, we get h c 1 kδk 2 − c 2 c 1 r logp n kδk 1 2 0 i 2 −( c 2 c 1 ) 2 logp n kδk 0 +(k( b βτ,b γτ)τk 0 −s)λ 2 ≤ 0, Since h c 1 kδk 2 − c 2 c 1 q logp n kδk 1 2 0 i 2 ≥ 0, this gives us (k( b βτ,b γτ)τk 0 −s)λ 2 ≤ ( c 2 c 1 ) 2 logp n kδk 0 . (6.15) We next bound the value of k( b βτ,b γτ)τk 0 using the above inequality (6.15). Let k =k( b βτ,b γτ)τk 0 , then kδk 0 =k( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τk 0 ≤ k +s. Thus, it follows from (6.15) that (k−s)λ 2 ≤ ( c 2 c 1 ) 2logp n (k+s). Organizing it in terms of k and s, we get k(λ 2 −( c 2 c 1 ) 2 logp n )≤ s(λ 2 +( c 2 c 1 ) 2 logp n ). (6.16) 109 Since λ > c 2 c 1 p (2s+1)logp/n, we have λ 2 −( c 2 c 1 ) 2 (2s+1) logp n >0 and λ 2 c 2 1 n− c 2 2 logp > 2c 2 2 slogp. Thus we have 2c 2 2 logp λ 2 c 2 1 n−c 2 2 logp < 1/s. Then it follows from inequality (6.16) that k≤ s (λ 2 +( c 2 c 1 ) 2logp n ) (λ 2 −( c 2 c 1 ) 2 logp n ) = s(1+ 2c 2 2 logp λ 2 c 2 1 n−c 2 2 logp )< s+1. Therefore, the number of nonzero elements in ( b βτ,b γτ)τ satisfies k( b βτ,b γτ)τk 0 ≤ s. The second step is based on the first step, where we will use proof by contra- diction to show that supp((β 0 τ,γ 0 τ)τ) ⊂ supp(( b βτ,b γτ)τ) with the additional assumption λ < b 0 c 1 / √ 2 of the theorem. Suppose that supp((β 0 τ,γ 0 τ)τ) 6⊂ supp( b βτ,b γτ)τ, then we denote the number of missed true coefficients as k =|supp((β 0 τ,γ 0 τ)τ)\supp( b βτ,b γτ)τ|≥ 1. Thuswehavek( b βτ,b γτ)τk 0 ≥ s−kandkδk 0 ≤k( b βτ,b γτ)τk 0 +k(β 0 τ,γ 0 τ)τk 0 ≤ 2s. Combining these two results with inequality (6.14) yields Q(( b βτ,b γτ)τ)−Q((β 0 τ,γ 0 τ)τ)≥ (2 −1 c 2 1 kδk 2 −c 2 r 2slogp n )kδk 2 −kλ 2 /2. (6.17) 110 Note that for each j ∈ supp((β 0 τ,γ 0 τ)τ)\supp(( b βτ,b γτ)τ), we have |δ j |≥ b 0 with b 0 being the lowest signal strength inCondition 8. Thus,kδk 2 ≥ √ kb 0 , which together with Condition 8 entails 4 −1 c 2 1 kδk 2 ≥ 4 −1 c 2 1 √ kb 0 ≥ 4 −1 c 2 1 b 0 >c 2 p (2slogp)/n. Thus, it follows from (6.17) that Q(( b βτ,b γτ)τ)−Q((β 0 τ,γ 0 τ)τ)≥ 4 −1 c 2 1 kδk 2 2 −kλ 2 /2≥ 4 −1 c 2 1 kb 2 0 −kλ 2 /2 >0, where the last step is because of the additional assumption λ < b 0 c 1 / √ 2. The above inequality contradicts with the global optimality of ( b βτ,b γτ)τ. Thus, we havesupp((β 0 τ,γ 0 τ)τ)⊂ supp(( b βτ,b γτ)τ). Combiningthiswithk( b βτ,b γτ)τk 0 ≤ s from the first step, we know that supp( b βτ,b γτ)τ = supp((β 0 τ,γ 0 τ)τ). Part 2: Prediction and estimation losses. For this part, we want to find the prediction and estimation losse. The idea is to get the L 2 -estimation loss bound by the global optimality of ( b βτ,b γτ)τ, conditional on the event E 1 = ˜ E ∩E 0 with ˜ E and E 0 defined in Lemma 5. Then using similar techniques in part 1, we could derive bounds for the prediction and estimation losses. DenoteX 0 , ˆ F 0 asthesubmatricesofXand ˆ Fconsistofcolumnswithrespectto supp(β 0 ) and supp(γ 0 ), respectively. Conditional on E 1 , we have kδk 0 ≤ s by the 111 model selection consistency property proved above. Thus, by the Cauchy-Schwarz inequality and definition ofE 0 we have |n −1 ετ(X 0 , ˆ F 0 )δ| ≤ kn −1 ετ(X 0 , ˆ F 0 )k ∞ kδk 1 (6.18) ≤ c ′ 2 r logn n kδk 1 ≤ c ′ 2 r slogn n kδk 2 . Recall (6.11)and (6.13)from part 1, it follows from (6.18)and the model selection consistency property k( b βτ,b γτ)τk 0 = s that Q(( b βτ,b γτ)τ)−Q((β 0 τ,γ 0 τ)τ) =2 −1 kn −1 (X, ˆ F)δk 2 2 −n −1 ˜ ετ(X, ˆ F)δ+(k( b βτ,b γτ)τk 0 −s)λ 2 /2 ≥2 −1 c 2 1 kδk 2 2 −n −1 ˜ ετ(X, ˆ F)δ≥ (2 −1 c 2 1 kδk 2 −c ′ 2 r slogn n )kδk 2 . Since ( b βτ,b γτ)τ is the global optimality, we know that 2 −1 c 2 1 kδk 2 −c ′ 2 r slogn n ≤ 0, which gives the L 2 and L ∞ estimation bound as k( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τk 2 =kδk 2 ≤ 2c −2 1 c ′ 2 p (slogn)/n, k( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τk ∞ ≤k( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τk 2 ≤ 2c −2 1 c ′ 2 p (slogn)/n. 112 For L q -estimation loss with 1≤ q < 2, applying H¨ older’s inequality gives k( b βτ,b γτ)τ−(β 0 τ,γ 0 τ)τk q = ( X j |δ j | q ) 1/q ≤ ( X j |δ j | 2 ) 1 2 ( X δ j 6=0 1 2 2−q ) 1 q − 1 2 =kδk 2 kδk 1 q − 1 2 0 ≤ 2c −2 1 c ′ 2 s 1 q p (logn)/n. Nextweprovetheboundfororaclepredictionloss. Since( b βτ,b γτ)τ istheglobal minimizer, itfollowsfrom(6.11)andthemodelselection consistency propertythat conditioning on E 1 , (2n) −1/2 k(X, ˆ F)(( b βτ,b γτ)τ −(β 0 τ,γ 0 τ)τ)k 2 ≤ n n −1 ˜ ετ(X, ˆ F)δ−(k( b βτ,b γτ)τk 0 −s)λ 2 /2 o 1/2 ≤ n kn −1 (X 0 , ˆ F 0 )τ˜ εk ∞ kδk 1 o 1/2 ≤ c ′ 2 c −1 1 p 2s(logn)/n, where the last step is because of the L 1 estimation bound proved above. Then for oracle prediction loss, together with (6.28)in the proofof Lemma 5, it follows that (2n) −1/2 k(X, ˆ F)( b βτ,b γτ)τ −(X,F)(β 0 τ,γ 0 τ)τk 2 ≤c ′ 2 c −1 1 p 2s(logn)/n+(2n) −1/2 k(F− ˆ F)γ 0 k 2 ≤ (2 √ 2sc ′ 2 c −1 1 + √ 2c 2 L 1 /4) p (logn)/n. 113 Last we will derive our results for general cases when the L 2 -norms of columns ofX are not of the common scale n 1/2 . We can rewrite the penalized least squares as Q((βτ,γτ)τ) =(2n) −1 ky− ˜ Xβ ∗ − ˆ Fγk 2 2 +kp λ ((β ∗ τ,γτ)τ)k 1 , where ˜ X is the design matrix with the L 2 -norm of each column rescaled to n 1/2 and β ∗ = n −1/2 (β 1 kx 1 k 2 ,··· ,β p kx p k 2 )τ is the corresponding coefficient vector defined in (2.2). Since the same conditions hold for β ∗ when the scale does not equal to 1, we can similarly derive the the- oretical properties for ( b β ∗ τ,b γτ)τ as before. By Condition 6, we know that the model selection consistency of ˆ β ∗ implies the model selection consistency of ˆ β. The bound of prediction loss for ˆ β does not change since ˜ X b β ∗ =X b β. The bounds of estimation losses for ˆ β can be induced as k b β−β 0 k 2 ≤ 2L −1 c −2 1 c ′ 2 p (slogn)/n,k b β−β 0 k q ≤ 2L −1 c −2 1 c ′ 2 s 1 q p (logn)/n, k b β−β 0 k ∞ ≤k b β−β 0 k 2 ≤ 2L −1 c −2 1 c ′ 2 p (slogn)/n. The probability for these inequalities to be held is at least 1 subtracted by P(E c 1 ), θ n,p in Condition 9, andα L,L 1 inCondition 6. Given the boundforP(E c 1 )in 114 Lemma 5, we have the above inequalities of estimation and prediction losses hold simultaneously with probability at least 1−2σ(2/π) 1/2 c −1 2 (logp) −1/2 (p+K)p − c 2 2 8σ 2 L 2 1 −2σ(2/π) 1/2 c ′−1 2 s(logn) −1/2 n − c ′2 2 8σ 2 −(nKg 1 (p)+ng 2 (p)(p−K))−θ n,p −α L,L 1 , which concludes the proof. 3.6.5 Additional Technical Details The following lemma is needed in proving Lemma 3. Lemma 6 (Weyl’s inequality, Horn and Johnson (1990)). IfA, B are m×m real symmetric matrices, then for all k = 1,··· ,m, ϕ k (A)+ϕ m (B) ϕ k+1 (A)+ϕ m−1 (B) . . . ϕ m (A)+ϕ k (B) ≤ ϕ k (A+B)≤ ϕ k (A)+ϕ 1 (B) ϕ k−1 (A)+ϕ 2 (B) . . . ϕ 1 (A)+ϕ k (B) , where ϕ i (·) is a function that takes the ith largest eigenvalue for a certain matrix. 115 Proof of Lemma 3 Themainideaofproving Lemma 3istouseinductiontoshowthattheeigenvalues of the sample covariance matrix divided by their corresponding orders of p will converge in distribution to some random variables with support (0,∞). To ease readability, the proof is divided into three steps. Step 1: Large probability event E. In this step, we will define an eventE and its probability approaches 1 as p increases to infinity. Our later discussion will all be based onthis event. We first define a series ofevents, denoted asE jt , 1≤ j ≤ p, 1≤ t≤ n, as E jt = {|z jt |≤ f 1 (p)}, if j ∈ J 1 ∪J 2 ∪···∪J m ,t= 1,··· ,n; {|z jt |≤ f 2 (p)}, if j ∈ J m+1 ,t= 1,··· ,n. By Condition 5, each event E jt has a tail probability bound P(E c jt ) ≤ g 1 (p) if j ∈ J 1 ,J 2 ,··· ,J m or P(E c jt ) ≤ g 2 (p) if j ∈ J m+1 . Set E = ∩ n t=1 ∩ p j=1 E jt as the intersection of all events in the series, then by Condition 5 we have P(E c ) =P(∪ n t=1 ∪ p j=1 E c jt )≤ n X t=1 m X l=1 X j∈J l P(E c jt )+ n X t=1 X j∈J m+1 P(E c jt ) ≤ ng 1 (p) m X l=1 k l +ng 2 (p)k m+1 → 0, as p→∞. So, the probability of event E converges to 1 as p increases to infinity. 116 Step 2: Convergence of eigenvalues withindices in J 1 . This step is also the first step for induction. We will show that conditioned on event E, for any i∈ J 1 , ˆ λ i /p α 1 converges to ϕ i (n −1 Z ′ 1 C 1 Z 1 ) in distribution as p → ∞, where C 1 defined in (6.19) and ϕ i (·) is a continuous function that takes the ith largest eigenvalue of a certain matrix. LetCbeap×pdiagonalmatrixwiththefirstK diagonalcomponentsequaling to c j , 1 ≤ j ≤ K, and the rest being 1. We decompose Z,C and Λ into block matrices corresponding to the index groups J 1 ,J 2 ,··· ,J m+1 such that Z = Z 1 Z 2 . . . Z m+1 ,C = C 1 O ··· O O C 2 ··· O . . . . . . . . . . . . O O ··· C m+1 ,Λ = Λ 1 O ··· O O Λ 2 ··· O . . . . . . . . . . . . O O ··· Λ m+1 . (6.19) Then for the dual matrix S D , we have S D =n −1 Z ′ ΛZ = n −1 m+1 X l=1 Z ′ l Λ l Z l . (6.20) Divided by p α 1 on both sides of (6.20) gives p −α 1 S D =n −1 p −α 1 Z ′ 1 Λ 1 Z 1 +n −1 p −α 1 m X l=2 Z ′ l Λ l Z l +n −1 p −α 1 Z ′ m+1 Λ m+1 Z m+1 . (6.21) 117 For a given matrix A, note that the Frobenius norm is defined as kAk F = {tr(AAτ)} 1/2 . For any 1 ≤ j,k ≤ n, the (j,k)th element in P m l=2 Z ′ l Λ l Z l sat- isfies | m X l=2 k l X t=1 λ (l) t z (l) tj z (l) tk |≤ f 2 1 (p) m X l=2 k l X t=1 λ (l) t → f 2 1 (p) m X l=2 k l X t=1 p α l c (l) t ≤ f 2 1 (p) m X l=2 k l M l p α l as p→∞ by Condition 4, where λ (l) t and c (l) t are the tth diagonal elements in Λ l and C l , z (l) tj and z (l) tk are the (t,j)th and (t,k)th elements in Z l . It gives kn −1 p −α 1 m X l=2 Z ′ l Λ l Z l k F ≤ n −1 p −α 1 nf 2 1 (p) m X l=2 k l M l p α l = f 2 1 (p) m X l=2 k l M l p α l /p α 1 , as p→∞. Similarly we have kn −1 p −α 1 Z ′ m+1 Λ m+1 Z m+1 k F ≤ f 2 2 (p)k m+1 M m+1 /p α 1 . Therefore, by Condition 5, it follows that kn −1 p −α 1 m X l=2 Z ′ l Λ l Z l +n −1 p −α 1 Z ′ m+1 Λ m+1 Z m+1 k F ≤ (f 2 1 (p) m X l=2 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p α 1 → 0 asp→∞. Sowehaveshownthatthesumofthelasttwotermsof(6.21)converges to zero matrix in Frobenius norm, conditioned on event E. 118 By the same argument, it is immediate that as p→∞, kn −1 p −α 1 Z ′ 1 Λ 1 Z 1 −n −1 Z ′ 1 C 1 Z 1 k F ≤ 1 n o(K −1 p αm−α m−1 )nf 1 (p)k 1 ≤ o(p αm−α m−1 )f 1 (p) ByCondition2part(b),whenwesett =m−1, wecouldhavep αm−α m−1 f 1 (p)→ 0. Thus, kn −1 p −α 1 Z ′ 1 Λ 1 Z 1 −n −1 Z ′ 1 C 1 Z 1 k F → 0 Then take back to (18), finally we get as p→∞, kp −α 1 S D −n −1 Z ′ 1 C 1 Z 1 k F ≤kn −1 p −α 1 Z ′ 1 Λ 1 Z 1 −n −1 Z ′ 1 C 1 Z 1 k F +kn −1 p −α 1 m X l=2 Z ′ l Λ l Z l +n −1 p −α 1 Z ′ m+1 Λ m+1 Z m+1 k F → 0. By further applying Corollary 6.3.8 of Horn and Johnson (1990), we have max 1≤i≤n |ϕ i (p −α 1 S D )−ϕ i (n −1 Z ′ 1 C 1 Z 1 )|≤kp −α 1 S D −n −1 Z ′ 1 C 1 Z 1 k F → 0 asp→∞,whereϕ i (·)isacontinuousfunctionthattakestheithlargesteigenvalue of a certain matrix. It follows that for any i∈ J 1 , ˆ λ i /p α 1 =ϕ i (p −α 1 S D ) converges in distribution to ϕ i (n −1 Z ′ 1 C 1 Z 1 ), a random variable with support (0,∞). 119 Step 3: Convergence of eigenvalues with indices in J 2 ,··· ,J m . As the second step of induction, we will show ˆ λ i = O p (p αt ) as p→∞, for any i∈ J t , by the established rates of convergence ˆ λ i = O p (p α l ), for any i∈ J l , l = 1,··· ,t−1, t ≥ 2. The basic idea in this step is to use Weyl’s inequality in Lemma 6 to get bothalowerboundandanupperboundonp −αtˆ λ i thatconvergetoasamerandom variable, ϕ i− P t−1 l=1 k l (n −1 Z ′ t C t Z t ), as p→∞. Divided by p αt on both sides of (6.20) gives p −αt S D =n −1 p −αt t−1 X l=1 Z ′ l Λ l Z l +n −1 p −αt m+1 X l=t Z ′ l Λ l Z l . Applying Weyl’s inequality, we get ϕ i (p −αt S D )≤ ϕ 1+ P t−1 l=1 k l (n −1 p −αt t−1 X l=1 Z ′ l Λ l Z l )+ϕ i− P t−1 l=1 k l (n −1 p −αt m+1 X l=t Z ′ l Λ l Z l ) = ϕ i− P t−1 l=1 k l (n −1 p −αt m+1 X l=t Z ′ l Λ l Z l ), wherethefirsttermbecomes0since n −1 p −αt P t−1 l=1 Z ′ l Λ l Z l hasaranknomorethan P t−1 l=1 k l . Similarly as in Step 2, by Conditions 4 and 5, we have kn −1 p −αt m+1 X l=t Z ′ l Λ l Z l −n −1 Z ′ t C t Z t k F → 0, as p→∞. Thus, we obtain an upper bound for ϕ i (p −αt S D ) as ϕ i− P t−1 l=1 k l (n −1 Z ′ t C t Z t ) for any i∈ J t as p→∞. 120 On the other hand, also by Weyl’s inequality, as p→∞, ϕ i (p −αt S D )≥ ϕ i (n −1 p −αt t X l=1 Z ′ l Λ l Z l )+ϕ n (n −1 p −αt m+1 X l=t+1 Z ′ l Λ l Z l ) →ϕ i (n −1 p −αt t X l=1 Z ′ l Λ l Z l ), where the second term vanishes since n −1 p −αt P m+1 l=t+1 Z ′ l Λ l Z l converges to a zero matrix in Frobenius norm by Conditions 4 and 5, similarly as in Step 2. To deal with the term ϕ i (n −1 p −αt P t l=1 Z ′ l Λ l Z l ), we use Weyl’s inequality once more to get ϕ P t l=1 k l (n −1 p −αt t−1 X l=1 Z ′ l Λ l Z l )≤ ϕ i (n −1 p −αt t X l=1 Z ′ l Λ l Z l ) +ϕ 1−i+ P t l=1 k l (−n −1 p −αt Z ′ t Λ t Z t ). Note that the term on the left hand side is zero since the rank of the matrix n −1 p −αt P t−1 l=1 Z ′ l Λ l Z l is at most P t−1 l=1 k l . Thus we have ϕ i (n −1 p −αt t X l=1 Z ′ t Λ t Z t )≥−ϕ 1−i+ P t l=1 k l (−n −1 p −αt Z ′ t Λ t Z t ) =ϕ kt−(1−i+ P t l=1 k l )+1 (n −1 p −αt Z ′ t Λ t Z t )= ϕ i− P t−1 l=1 k l (n −1 p −αt Z ′ t Λ t Z t ), where we make use of the fact that ϕ i (A) = −ϕ n−i+1 (−A) for any n× n real symmetric matrix A, when 1≤ i≤ n. 121 Therefore, we get a lower bound ϕ i− P t−1 l=1 k l (n −1 p −αt Z ′ t Λ t Z t ) for ϕ i (p −αt S D ). In terms of sample eigenvalues, it shows that as p→∞, ˆ λ i =ϕ i (S D )≥ ϕ i− P t−1 l=1 k l (n −1 Z ′ t Λ t Z t ), (6.22) which will be used later for proving convergence properties of sample prediction vectors. By further noting that as p→∞, kn −1 p −αt Z ′ t Λ t Z t −n −1 Z ′ t C t Z t k F → 0, we have ϕ i− P t−1 l=1 k l (n −1 p −αt Z ′ t Λ t Z t )→ϕ i− P t−1 l=1 k l (n −1 Z ′ t C t Z t ). Thus, this lower bound on ϕ i (p −αt S D ) has a limit exactly the same as the upper bound derived before. It follows that for any i∈ J t , p −αtˆ λ i = ϕ i (p −αt S D )→ϕ i− P t−1 l=1 k l (n −1 Z ′ t C t Z t ) indistribution. Notethatϕ i− P t−1 l=1 k l (n −1 Z ′ t C t Z t )isarandomvariablewithsupport (0,∞). TogetherwithStep2,byinduction,weprovethat ˆ λ i =O p (p α l )asp→∞, for any i∈ J l ,l =1,··· ,m. 122 Proof of Lemma 4 The key ingredient of this proof is to measure the angle between sample eigen- vector and the space spanned by population eigenvectors by the sum of inner products between sample and population eigenvectors. In this way, it suffices to show the convergence of the sum of inner products on the high probability event E by induction. We will finish the proof in three steps. Step 1: We will show that the subspace consistency of the sample eigenvectors with indices in the group J l , l = 1,··· ,m, is equivalent to X j∈J l p 2 ji → 1, for any i∈ J l , l =1,··· ,m, (6.23) where p ji = u ′ j ˆ u i is the inner product between population eigenvector u j , jth column of U, and sample eigenvector ˆ u i , ith column of ˆ U. For any i ∈ J l , l = 1,··· ,m, we have Angle(ˆ u i ,span{u j : j ∈ J l })= arccos( ˆ u ′ i ( P j∈J l (u ′ j ˆ u i )u j ) kˆ u i k 2 ·k P j∈J l (u ′ j ˆ u i )u j k 2 ) = arccos( P j∈J l (u ′ j ˆ u i ) 2 1·( P j∈J l (u ′ j ˆ u i ) 2 ) 1/2 )= arccos(( X j∈J l p 2 ji ) 1/2 ). Thus, to show Angle(ˆ u i ,span{u j : j ∈ J 1 }) → 0 is the equivalent to proving P j∈J l p 2 ji → 1 as p → ∞ for any i ∈ J l , l = 1,··· ,m. We will prove this by induction. 123 Step 2: This step aims to prove that conditioned on event E, for any i ∈ J 1 , P j∈J 1 p 2 ji → 1 , as p → ∞. It is also the starting point for proving (6.23) by induction. Set P ={p ij } ij = U ′ ˆ U. Then we have P p j=1 p 2 ji = 1 for any i since P is a unitary matrix. Then it suffices to show P j∈J 2 ∪···∪J m+1 p 2 ji → 0 for any i∈ J 1 , as p→∞. Recall that Z =Λ −1/2 U ′ X ′ , we have a connection between Z and P by n −1 ZZ ′ = n −1 Λ −1/2 U ′ X ′ XUΛ −1/2 =Λ −1/2 P ˆ ΛP ′ Λ −1/2 , which gives λ −1 j p X i=1 ˆ λ i p 2 ji = n −1 z j z ′ j . (6.24) for any 1 ≤ j ≤ p, where z j is the jth row vector of Z. So we have for any 1≤ i≤ p, λ −1 j ˆ λ i p 2 ji ≤ n −1 z j z ′ j . It follows that X j∈J 2 ∪···∪J m+1 p 2 ji ≤ X j∈J 2 ∪···∪J m+1 n −1 z j z ′ j λ j / ˆ λ i = n X t=1 X j∈J 2 ∪···∪J m+1 z 2 jt λ j /(n ˆ λ i ), (6.25) where z jt is the (j,t)th entry of Z. 124 Conditional on event E, by Lemma 3 and Condition 5, we have n X t=1 X j∈J 2 ∪···∪J m+1 z 2 jt λ j /(n ˆ λ i )≤ X j∈J 2 ∪···∪Jm f 2 1 (p)λ j / ˆ λ i + X j∈J m+1 f 2 2 (p)λ j / ˆ λ i =O p ((f 2 1 (p) m X l=2 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p α 1 )→ 0, as p→∞. (6.26) In this way, we proved the subspace consistency for the sample eigenvectors with indices in J 1 . Step 3: In this step, suppose P j∈J l p 2 ji → 1 as p → ∞, for any i ∈ J l , l = 1,··· ,t−1, 2≤ t≤ m, we need to show P j∈Jt p 2 ji → 1, for any i∈ J t . Similarly as before, it suffices to show X j∈J 1 ∪···∪J t−1 p 2 ji + X j∈J t+1 ∪···∪J m+1 p 2 ji → 0, for any i∈ J t . We will prove both two terms above converge to 0 as p → ∞. For the first term, by induction assumption we have X j∈J 1 ∪···∪J t−1 p 2 ji → 1, for any i∈ J 1 ∪···∪J t−1 , as p→∞, 125 where the summation in J 1 ∪···∪J t−1 will make the convergence even faster than only in the corresponding J l , 1≤ l≤ t−1. It leads to X i∈J 1 ∪···∪J t−1 X j∈J 1 ∪···∪J t−1 p 2 ji → X i∈J 1 ∪···∪J t−1 1 = t−1 X l=1 k l , as p→∞. Thus for any i∈ J t , by applying the above convergence result, we have X j∈J 1 ∪···∪J t−1 p 2 ji ≤ X j∈J 1 ∪···∪J t−1 X i∈Jt∪···∪J m+1 p 2 ji = X j∈J 1 ∪···∪J t−1 p X i=1 p 2 ji − X j∈J 1 ∪···∪J t−1 X i∈J 1 ∪···∪J t−1 p 2 ji → t−1 X l=1 k l − t−1 X l=1 k l =0. For the second term P j∈J t+1 ∪···∪J m+1 p 2 ji , similar to (6.25) and (6.26) inStep 2, conditional on event E, we have X j∈J t+1 ∪···∪J m+1 p 2 ji ≤ X j∈J t+1 ∪···∪Jm f 2 1 (p)λ j / ˆ λ i + X j∈J m+1 f 2 2 (p)λ j / ˆ λ i =O p ((f 2 1 (p) m X l=t+1 k l M l p α l +f 2 2 (p)k m+1 M m+1 )/p αt )→ 0, as p→∞, (6.27) by Condition 5. Based on above proof, we have shown that the subspace consistency for the sample eigenvectors with indices in the group J t is also true. Thus combined with step 2, we have proved subspace consistency for the sample eigenvectors with 126 indices in the group J l , l = 1,··· ,m. Further analysis regarding convergence rates has been discussed in Theorem 1. Proof of Lemma 5 To prove lemma 3, we will use Bonferroni’s inequality and Gaussian tail bound to derive the probability. As mentioned in Section 3.2, each f j and ˆ f j have been rescaled to have L 2 -norm n 1/2 . Since ε∼ N(0,σ 2 I n ), for each i and j, given x i we have n −1 x i τε∼ N(0,σ 2 kx i k 2 2 /n 2 ) and n −1 ˆ f j τε∼ N(0,σ 2 /n). We first bound the term n −1 |x i τ(F− ˆ F)γ|. Since kγk ∞ is bounded by T, we have kγk 1 ≤ KT. It follows from (6.10) that k(F− ˆ F)γk 2 ≤kγk 1 · max 1≤j≤K kf j − ˆ f j k 2 ≤ KT · c 2 2KT p logn = c 2 2 p logn, (6.28) which, conditioned on the event in Condition 6 and 3, gives n −1 |x i τ(F− ˆ F)γ|≤ n −1 kx i k 2 k(F− ˆ F)γk 2 ≤ c 2 L 1 2 p (logn)/n≤ c 2 2 p (logp)/n. (6.29) Similarly we have n −1 |f j τ(F− ˆ F)γ|≤ c 2 2 p (logn)/n as ˆ f j have L 2 -norm n 1/2 . 127 RecallthatE istheevent with significantprobability 1−(nKg 1 (p)+ng 2 (p)(p− K)) defined in proof of Lemma 3, we will conduct our analysis on this event. By Bonferroni’s inequality, we get P( ˜ E c ∩E)=P( ˜ E c |E)P(E)≤ P( ˜ E c |E) ≤ p X i=1 P |n −1 x i τ˜ ε| > c 2 p (logp)/n|E + K X j=1 P |n −1 ˆ f j τ˜ ε| >c 2 p (logp)/n|E . By definition of ˜ ε, for the first sum above, we have p X i=1 P |n −1 x i τ˜ ε| > c 2 p (logp)/n |E ≤ p X i=1 P |n −1 x i τε| > c 2 p (logp)/n−n −1 |x i τ(F− ˆ F)γ||E . Replacing n −1 |x i τ(F− ˆ F)γ| with the bound in (6.29) and by Gaussian tail prob- ability bound, we get p X i=1 P |n −1 x i τ˜ ε| > c 2 p (logp)/n |E ≤ p X i=1 P |n −1 x i τε| > c 2 2 p (logp)/n ≤ p X j=1 4σ c 2 √ logp 1 √ 2π e − c 2 2 logp 8σ 2 L 2 1 ≤ 2 √ 2σ c 2 √ πlogp p 1− c 2 2 8σ 2 L 2 1 . Similarly we have K X j=1 P |n −1 ˆ f j τ˜ ε| > c 2 p (logp)/n|E ≤ 2 √ 2σK c 2 √ πlogp p − c 2 2 8σ 2 . 128 Since K is no larger than p and L 1 ≥ 1, the bounds for the two sums above give the bound for P( ˜ E c ∩E) as P( ˜ E c ∩E)≤ 2 √ 2σ c 2 √ πlogp (p+K)p − c 2 2 8σ 2 L 2 1 , which tends to 0 as n→∞ for c 2 ≥ √ 10σL 1 . The bound for P(E c 0 ∩E) can be derived similarly as P(E c 0 ∩E)≤ 2 √ 2σs c ′ 2 √ πlogn n − c ′2 2 8σ 2 . Thus, for the intersection event E 1 = ˜ E∩E 0 , it follows that P(E c 1 )≤ P(E c 1 ∩E)+P(E c )≤ P( ˜ E c ∩E)+P(E c 0 ∩E)+P(E c ) ≤ 2 √ 2σ c 2 √ πlogp (p+K)p − c 2 2 8σ 2 L 2 1 + 2 √ 2σs c ′ 2 √ πlogn n − c ′2 2 8σ 2 +(nKg 1 (p)+ng 2 (p)(p−K)), which completes the proof of Lemma 5. 129 Bibliography [1] Anderson, S., Auquier, A., Hauck, W.W., Oakes, D., Vandaele, W., Weisberg H.I.,Bryk,A.S.andKleinman,J.(1980).Statistical MethodsforComparative Studies. New York: Wiley. [2] Antoniadis, A. (1996). Smoothing noisy data with tapered coiflets series. Scand. J. Statist. 23, 313–330. [3] Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96, 939–967. [4] Austin, P. C. and Brunner, L. J. (2004). Inflation of the type I error rate when a continuous confounding variable is categorized in logistic regression analyses. Statist. Med. 23, 1159–1178. [5] Barron, A., Birge, L. and Massart, P. (1999).Risk bounds for model selection via penalization. Probab. Theory Related Fields 113, 301–413. [6] Bickel, P. J., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 37, 1705–1732. [7] Bunea,F.,Tsybakov, A.andWegkamp,M.(2007).Sparsityoracleinequalities for the Lasso. Electronic Journal of Statistics 1, 169–194. [8] Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann. Statist. 35, 2313–2404. [9] Chandrasekaran, V., Parrilo, P. A. and Willsky, A. S. (2012). Latent vari- ablegraphicalmodelselectionviaconvexoptimization(withdiscussion). Ann. Statist. 40, 1935–1967. [10] Chen, K., Dong H. and Chan, K. S. (2013). Reduced rank regression via adaptive nuclear norm penalization. Biometrika 100, 901–920. 130 [11] Chudik, A.,Pesaran, H.andTosetti, E.(2011).Weakandstrongcross-section dependence and estimation of large panels. Econometrics Journal 14, 45–90. [12] Donoho, D. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ 1 minimization. Proc. Natl. Acad. Sci. USA 100, 2197–2202. [13] Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Univer- sity Press. [14] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32, 407–499. [15] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36, 2605–2637. [16] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli- hood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. [17] Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with NP- dimensionality. IEEE Transactions on Information Theory 57, 5467–5484. [18] Fan,J.andPeng, H.(2004).Nonconcavepenalizedlikelihoodwithadiverging number of parameters. Ann. Statist. 32, 928–961. [19] Fan, Y. and Lv, J. (2013). Asymptotic equivalence of regularization methods in thresholded parameter space. J. Amer. Statist. Assoc. 108, 247–264. [20] Fang, Y., Feng, Y. and Yuan, M. (2013). Regularized principal components of heritability. Computational Statistics, to appear. [21] Frank, K. A. (2000). Impact of a confounding variable on a regression coeffi- cient. Sociological Methods & Research 29, 147–194. [22] Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemo- metrics regression tools (with discussion). Technometrics 35, 109–148. [23] Friedman, J., Hastie, T., H¨ ofling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1, 302–332. [24] Golub,T.R.,Slonim,D.K.,Tamayo,P.,Huard,C.,Gaasenbeek,M.,Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D.andLander,E.S.(1999).Molecularclassifcationofcancer: Classdiscovery and class prediction by gene expression monitoring. Science 286, 531–537. [25] Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal components analysis. J. Roy. Statist. Soc. Ser. B 68, 109–126. 131 [26] Hall, P. and Hosseini-Nasab, M. (2009). Theory for high-order bounds in functionalprincipalcomponents analysis. Math. Proc. Cambridge Philos. Soc. 146, 225–256. [27] Horn, R. A. and Johnson, C. R. (1990). Matrix Analysis, 2nd ed. Cambridge University Press. [28] James, W. and Stein, C. (1961). Estimation with quadratic loss. Proc. Fourth Berkeley Symp. Math. Statist. Prob. 1, 361–379. [29] Johnstone, I. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29, 295–327. [30] Johnstone, I. and Lu, A. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 104, 682– 693. [31] Jung, S. and Marron, J. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37, 4104–4130. [32] Kapetanios,G.,Pesaran, M.H.andYamagata,T.(2011).Panelswithnonsta- tionary multifactor error structures. Journal of Econometrics 160, 326–348. [33] Kneip, A. and Sarda P. (2011). Factor models and variable selection in high- dimensional regression analysis. Ann. Statist. 39, 2410–2447. [34] Lin, D. Y., Psaty, B. M. and Kronmal, R. A. (1998). Assessing the sensitiv- ity of regression results to unmeasured confounders in observational studies. Biometrics 54, 948–963. [35] Lin, W. and Lv, J. (2013). High-dimensional sparse additive hazards regres- sion. J. Amer. Statist. Assoc. 108, 247–264. [36] Lv, J. (2013). Impacts of high dimensionality in finite samples. Ann. Statist. 41, 2236–2262. [37] Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37, 3498–3528. [38] Ma, S., Xue, L. and Zou, H. (2013). Alternating direction methods for latent variable gaussian graphical model selection. Neural Computation 25, 2172– 2198. [39] Ott, J. and Rabinowitz, D. (1999). A principal-components approach based on heritability for combining phenotype information. Human Heredity 49, 106–111. 132 [40] Paul, D.(2007).Asymptotics ofsample eigenstructure for a largedimensional spiked covariance model. Statist. Sinica 17, 1617–1642. [41] Pesaran, M. H. and Tosetti, E. (2011).Large panels with common factors and spatial correlation. Journal of Econometrics 161, 182–202. [42] Shen,D.,Shen,H.andMarron,J.(2012).Ageneralframeworkforconsistency of principal component analysis. Manuscript. [43] Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proc. Third Berkeley Symp. Math. Statist. Prob.1, 197–206. [44] Tang, C. Y. and Leng C. (2010). Penalized high-dimensional empirical likeli- hood. Biometrika 97, 905–920. [45] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288. [46] van de Geer, S., B¨ uhlmann, P. and Zhou, S. (2011). The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso). Electronic Journal of Statistics 5, 688–749. [47] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for Lasso penalized regression. The Annals of Applied Statistics 2, 224–244. [48] Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax con- cave penalty. Ann. Statist. 38, 894–942. [49] Zheng,Z., Fan,Y.andLv, J.(2014).High-dimensional thresholded regression and shrinkage effect. J. Roy. Statist. Soc. Ser. B 76, 627–649. [50] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Statist. 36, 1509–1566. 133
Abstract (if available)
Abstract
This dissertation addresses two challenging problems with respect to feature selection in the high-dimensional setting, where the number of covariates can be larger than the sample size. Some notorious difficulties in high-dimensional model selection include high collinearity among the predictors, spurious variables and noise accumulation. It would become more challenging when some latent features exist in the hidden structure of the original model. ❧ In the first part of this dissertation, we consider the problem of high-dimensional sparse modeling via regularization, which provides a powerful tool for analyzing large-scale data sets and obtaining meaningful, interpretable models. The use of nonconvex penalty functions shows advantage in selecting important features in high dimensions, but the global optimality of such methods still demands more understanding. In this part, we consider sparse regression with hard-thresholding penalty, which we show to give rise to thresholded regression. This approach is motivated by its close connection with the L₀-regularization, which can be unrealistic to implement in practice but of appealing sampling properties, and its computational advantage. Under some mild regularity conditions allowing possibly exponentially growing dimensionality, we establish the oracle inequalities of the resulting regularized estimator, as the global minimizer, under various prediction and variable selection losses, as well as the oracle risk inequalities of the hard-thresholded estimator followed by a further L₂-regularization. The risk properties exhibit interesting shrinkage effects under both estimation and prediction losses. We identify the optimal choice of the ridge parameter, which is shown to have simultaneous advantages to both the L₂-loss and prediction loss. These new results and phenomena are evidenced by simulation and real data examples. ❧ In the second part, we focus on the high-dimensional model with latent features. As a powerful tool for producing interpretable models, sparse modeling has gained popularity for analyzing large-scale data sets. Most of existing methods assume implicitly that all features in a model are observable. Yet some latent confounding factors may potentially exist in the hidden structure of the original model in many applications. Omitting such confounding factors can cause serious issues in both prediction and variable selection, even in the low-dimensional setting. The impacts of latent features are less well understood in high dimensions. In this part, we suggest high-dimensional latent variable thresholded regression that incorporates confounding factors as the principal components of the random design matrix. We estimate the population principal components by the sample ones and establish asymptotic properties of the estimated prediction vectors in high dimensions for a wide class of distributions. With the aid of these properties, we prove that high-dimensional thresholded regression with estimated confounding factors can still enjoy model selection consistency and oracle inequalities under various prediction and variable selection losses for both observable covariates and latent confounding factors. The new method and results are evidenced by simulation and real data examples. ❧ The second part of the dissertation borrows the strengths of high-dimensional thresholded regression established in the first part, which shares the appealing hard-thresholding property that sets small coefficients to exact zeros, thus generating a sparse model with high interpretability. The further analysis of the latent model replies on a nice approximation for the confounding part. Both two parts employ the power of high-dimensional sparse modeling via regularization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Shrinkage methods for big and complex data analysis
PDF
High-dimensional feature selection and its applications
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Model selection principles and false discovery rate control
PDF
High-dimensional regression for gene-environment interactions
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Sparseness in functional data analysis
PDF
Essays on factor in high-dimensional regression settings
PDF
Essays on econometrics analysis of panel data models
PDF
Hierarchical regularized regression for incorporation of external data in high-dimensional models
PDF
Penalized portfolio optimization
PDF
Scalable multivariate time series analysis
PDF
Mixed-integer nonlinear programming with binary variables
PDF
Detecting joint interactions between sets of variables in the context of studies with a dichotomous phenotype, with applications to asthma susceptibility involving epigenetics and epistasis
PDF
Machine-learning approaches for modeling of complex materials and media
Asset Metadata
Creator
Zheng, Zemin
(author)
Core Title
Feature selection in high-dimensional modeling with thresholded regression
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
04/08/2015
Defense Date
03/23/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
global optimality,hard-thresholding,high dimensionality,latent confounding factors,model selection,OAI-PMH Harvest,prediction and variable selection,principal components,shrinkage effect,thresholded regression
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Fan, Yingying (
committee chair
), Goldstein, Larry (
committee chair
), Lv, Jinchi (
committee chair
), Baxendale, Peter H. (
committee member
), Zhang, Jianfeng (
committee member
)
Creator Email
zeminzhe@gmail.com,zeminzhe@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-544029
Unique identifier
UC11298514
Identifier
etd-ZhengZemin-3260.pdf (filename),usctheses-c3-544029 (legacy record id)
Legacy Identifier
etd-ZhengZemin-3260.pdf
Dmrecord
544029
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zheng, Zemin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
global optimality
hard-thresholding
high dimensionality
latent confounding factors
model selection
prediction and variable selection
principal components
shrinkage effect
thresholded regression