Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reproducible large-scale inference in high-dimensional nonlinear models
(USC Thesis Other)
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Reproducible Large-Scale Inference in High-Dimensional Nonlinear Models by Emre Demirkaya A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Applied Mathematics) August 2019 Copyright 2019 Emre Demirkaya Dedication To my family ii Acknowledgements I have always felt lucky for being surrounded by so many wonderful people whom I would like to thank not just for contributing to this thesis but also for being part of my life. First and foremost, I want to thank my advisors Prof. Yingying Fan and Prof. Jinchi Lv. I thank them for including me in so many wonderful projects. Without them, this thesis won't be possible at all. Their high energy, motivation, commitments, and enthusiasm will always be an inspiration for my research and life. Furthermore, they created an amazing group where I'm grateful to meet Yinfei Kong, Zemin Zheng, Daoji Li, Mahrad Sharifvaghe, and Yoshimasa Uematsu. I would also like to extend my gratitude to Prof. Gaorong Li, Prof. Yang Feng, and Pallavi Basu for their collaboration and contribution to the papers presented in Chapter 3 and 4. I would like to thank Prof. Larry Goldstein for supervising me, Prof. Jay Bartro for being my defense committee member, and Prof. Alp Eden for helping me to nd my academic path. I'm thankful for having excellent friends, Melike S rlanc, G uher C amlyurt, Can Ozan O guz, Ezgi Kantarc, Enes Ozel, Chinmoy Bhattacharjee, and Murat G uner. Each one of them is the only ones of their kinds. I appreciate their time, wisdom and laughter they shared with me. I especially thank my parents, Emine Demirkaya, Re sat Demirkaya, H ulya Kaba and Veysel Kaba, for endlessly supporting me and for bearing with being thousands of miles away from their child just because I can follow my dreams. I'm thankful for having Bet ul to be my sister. As my favorite Peanuts' character, Linus said, "Happiness is having a sister." Being an elder sister, she is also a half-mentor to me. iii Finally, I would like to thank my beloved wife, Duygu Kaba, with whom I embarked this journey in the rst place. I am lucky to have a wife with whom I can talk about everything including math. I especially thank her for believing and listening to me so hard. She had been my greatest support through thick and thin. I'm very grateful for the joy she brought to my life. iv Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures viii Abstract ix Chapter 1: Introduction 1 Chapter 2: Nonuniformity of P-values Can Occur Early in Diverging Dimensions 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Characterizations of P-values in Low Dimensions . . . . . . . . . . . . . . . . . . . 9 2.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Conventional p-values in low dimensions under xed design . . . . . . . . . 10 2.2.3 Conventional p-values in low dimensions under random design . . . . . . . 13 2.3 Nonuniformity of GLM P-values in Diverging Dimensions . . . . . . . . . . . . . . 14 2.3.1 The wild side of nonlinear regime . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Testing results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Proofs of Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6.2 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6.3 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.4 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter 3: RANK: Large-Scale Inference with Graphical Nonlinear Knockos 51 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Power analysis for oracle model-X knockos . . . . . . . . . . . . . . . . . . . . . . 54 3.2.1 Review of model-X knockos framework . . . . . . . . . . . . . . . . . . . . 55 3.2.2 Power analysis in linear models . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Robustness of graphical nonlinear knockos . . . . . . . . . . . . . . . . . . . . . . 61 3.3.1 Modied model-X knockos . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 Robustness of FDR control for graphical nonlinear knockos . . . . . . . . 64 3.3.3 Robustness of power in linear models . . . . . . . . . . . . . . . . . . . . . . 67 v 3.4 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 Model setups and simulation settings . . . . . . . . . . . . . . . . . . . . . . 70 3.4.2 Estimation procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.7 Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7.1 Proofs of Lemma 3.1 and Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . 82 3.7.2 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 3.7.3 Lemma 3.2 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.7.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.7.5 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.7.6 Proposition 3.2 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.8 Additional technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8.1 Lemma 3.3 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8.2 Lemma 3.4 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.8.3 Lemma 3.5 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.8.4 Lemma 3.6 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.8.5 Lemma 3.7 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.8.6 Lemma 3.8 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Chapter 4: Large-Scale Model Selection with Misspecication 112 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Large-scale model selection with misspecication . . . . . . . . . . . . . . . . . . . 115 4.2.1 Model misspecication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2.2 High-dimensional generalized BIC with prior probability . . . . . . . . . . . 118 4.3 Asymptotic properties of HGBIC p . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3.1 Technical conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3.2 Asymptotic expansion of Bayesian principle of model selection . . . . . . . 123 4.3.3 Consistency of covariance contrast matrix estimation . . . . . . . . . . . . . 125 4.3.4 Model selection consistency of HGBIC p . . . . . . . . . . . . . . . . . . . . 126 4.4 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.4.1 Multiple index model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.4.2 Logistic regression with interaction . . . . . . . . . . . . . . . . . . . . . . . 131 4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.6 Proofs of main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.6.1 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.6.2 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.6.3 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 4.7 Technical lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.7.1 Lemma 4.1 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.7.2 Lemma 4.2 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.7.3 Lemma 4.3 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.7.4 Lemma 4.4 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.7.5 Lemma 4.5 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 4.7.6 Lemma 4.6 and its proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4.8 Additional technical details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Reference List 164 vi List Of Tables 2.1 Means and standard deviations (SD) for estimated probabilities of making type I error in simulation example 1 with 0 the growth rate of dimensionality p = [n 0 ]. Two signicance levels a = 0:05 and 0.1 are considered. . . . . . . . . . . . . . . . 23 3.1 Simulation results for linear model (3.20) in simulation example 1 with A = 1:5 in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Simulation results for linear model (3.20) in simulation example 1 with A = 3:5 in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.3 Simulation results for partially linear model (3.21) in simulation example 2 in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.4 Simulation results for single-index model (3.22) in simulation example 3 in Section 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5 Simulation results for additive model (3.23) in simulation example 4 in Section 3.4.1 76 3.6 Selected genes and their associated pathways for real data analysis in Section 3.5 . 79 4.1 Average results over 100 repetitions for Example 4.4.1 with all entries multiplied by 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.2 Average false positives over 100 repetitions for Example 4.4.1. . . . . . . . . . . . 129 4.3 Average results over 100 repetitions for Example 4.4.2 with all entries multiplied by 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.4 Average false positives over 100 repetitions for Example 4.4.2. . . . . . . . . . . . 133 vii List Of Figures 2.1 Results of KS and AD tests for testing the uniformity of GLM p-values in simu- lation example 1 for diverging-dimensional logistic regression model with uniform orthonormal design under global null. The vertical axis represents the p-value from the KS and AD tests, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Results of KS and AD tests for testing the uniformity of GLM p-values in simula- tion example 2 for diverging-dimensional logistic regression model with correlated Gaussian design under global null for varying correlation level . The vertical axis represents the p-value from the KS and AD tests, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. . . . . . . . . . . . . . . . . . . . . 20 2.3 Results of KS test for testing the uniformity of GLM p-values in simulation example 3 for diverging-dimensional logistic regression model with uncorrelated Gaussian design under global null for varying sparsity s. The vertical axis represents the p-value from the KS test, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Histograms of null p-values in simulation example 1 from the rst simulation repetition for dierent growth rates 0 of dimensionality p = [n 0 ]. . . . . . . . . . 22 4.1 The average false discovery proportion (left panel) and the true positive rate (right panel) as the factor varies for Example 4.4.1. . . . . . . . . . . . . . . . . . . . . 130 4.2 The average false discovery proportion (left panel) and the true positive rate (right panel) as the factor varies for Example 4.4.2. . . . . . . . . . . . . . . . . . . . . 132 viii Abstract Feature selection, reproducibility, and model selection are of fundamental importance in contem- porary statistics. Feature selection methods are required in a wide range of applications in order to evaluate the signicance of covariates. Meanwhile, reproducibility of selected features is needed to claim that ndings are meaningful and interpretable. Finally, model selection is employed for pinpointing the best set of covariates among a sequence of candidate models produced by feature selection methods. We show that p-values, a common tool for feature selection, behave dierently in nonlinear models and p-values in nonlinear models can break down earlier than their linear counterparts. Next, we provide important theoretical foundations of model-X knockos which is a recent state- of-the-art method for reproducibility. We establish the power and robustness results for model-X knockos. Finally, we tackle large-scale model selection problem for misspecied models. We propose a novel information criterion which is tailored for both model misspecication and high dimensionality. ix Chapter 1 Introduction Rapid advances of modern technology in the last couple of decades have brought us the era of big data. In many applications from dierent disciplines such as social sciences, health sciences, and engineering [61, 41, 16], the unprecedented size of big data is produced and is waiting to be analyzed. In most of these applications, the number of features (a.k.a. covariates), which is referred as dimensionality p, is often comparable to or even larger than the number of samples n. A crucial assumption to deal with high dimensionality is sparsity. When we have confronted a sea of variables, we often want only a bucket of variables that are sucient for our purposes. Therefore, under sparsity assumption, the essential task is to decide which features to keep and which ones to discard. For this reason, we need to characterize the signicance of each variable. Finding a set of explanatory variables based on some feature importance measures is called feature selection. One of the most common tools for feature selection is p-values which are produced and employed by both practitioners and researchers. Thanks to the classical large-sample asymptotic theory, the conventional p-values in Gaussian linear model have well-established theory even when the dimensionality is a non-vanishing fraction of the sample size. However, p-values can break down when the design matrix becomes singular in higher dimensions or when the error distribution 1 deviates from Gaussianity. In Chapter 2, I present [46], joint work with Yingying Fan and Jinchi Lv, where we establish that the breakdown of p-values can occur early in nonlinear models. Reproducibility is an emerging complication of feature selection in high-dimensional statistics. When feature selection methods are used in multiple testing, they eventually produce false discoveries which cannot be replicated in subsequent studies. Many false discoveries may harm interpretability by deteriorating the model. Furthermore, the early breakdown of p-values as we discuss in Chapter 2 prohibits their use in high-dimensional nonlinear models. To resolve these problems, we need auxiliary methods to keep the number or the fraction of false discoveries at the target level. One contemporary reproducibility method bypassing the use of p-values is knockos framework introduced in [8]. The model-X knockos procedure in [18] improves the original knockos framework to the high-dimensional setting where dependence between the response and the covariates can be arbitrary. However, The model-X knockos procedure puts the burden on the knowledge of the covariate distribution. In Chapter 3, I present [45] which is joint work with Yingying Fan, Gaorong Li, and Jinchi Lv. In [45], we study several theoretical aspects of model-X knockos procedure. We provide theoretical foundations on the power and robustness for the model- X knockos. We establish that under mild regularity conditions, the power of the oracle knockos procedure with known covariate distribution in high-dimensional linear models is asymptotically one as sample size goes to innity. When moving away from the ideal case, we suggest the modied model-X knockos method called graphical nonlinear knockos (RANK) to accommodate the unknown covariate distribution. We provide theoretical justications on the robustness of our modied procedure by showing that the false discovery rate (FDR) is asymptotically controlled at the target level and the power is asymptotically one with the estimated covariate distribution. Finally, feature selection methods often yield multiple sets of selected variables. This can mainly happen because of two reasons. First, the practitioners can use dierent feature selection methods which may possibly produce dierent sets of selected features. Second, most feature 2 selection methods are based on a hyperparameter or tuning parameter. For example, in Lasso, one can change the parameter controlling the penalty term and obtain dierent results. In any case, a set of competing candidate models is generated and we want to compare these models and pick the best one. Model selection has a long history and extensive literature starting from [2]. However, most existing work assumes implicitly that the models are correctly specied or have xed dimensionality. Yet both features of model misspecication and high dimensionality are prevalent in practice. In Chapter 4, I present [26], joint work with Yang Feng, Pallavi Basu and Jinchi Lv, in which we tackle the model selection problem in high-dimensional misspecied models. We exploit the framework of model selection principles in misspecied models originated in [79] and investigate the asymptotic expansion of Bayesian principle of model selection in the setting of high-dimensional misspecied models. With a natural choice of prior probabilities that encourages interpretabil- ity and incorporates Kullback-Leibler divergence, we suggest the high-dimensional generalized Bayesian information criterion with prior probability (HGBIC p ) for large-scale model selection with misspecication. Our new information criterion characterizes the impacts of both model misspecication and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of HGBIC p in ultra-high dimensions under some mild regularity conditions. Yet another important problem we need to address in statistics is interaction detection. Feature selection discussed in this thesis is interested in detecting the main eects of individual signals. However, in practice, dierent features aect the outcomes in complicated ways, often interacting with each other in nonlinear ways. Straightforward application of feature selection method by adding all possible interactions can become computationally infeasible since the number of interactions grows very fast. For instance, for 1000 features, there are 499,500 pairwise interactions and approximately 10 301 total interactions. Simply put, the number of pairwise interactions grows quadratically and that of all possible interactions grows exponentially. The sheer growth of the 3 number of interactions requires us to nd new methods [49, 48, 27]. For instance, [27] borrows techniques from [19] and applies random projection and bagging ideas to screen the important pairs of interactions. However, the endeavor of nding the important interacting variables is beyond the scope of this thesis and the detailed discussions are omitted. 4 Chapter 2 Nonuniformity of P-values Can Occur Early in Diverging Dimensions 2.1 Introduction As p-values are routinely produced by algorithms, practitioners should perhaps be aware that those p-values are usually based on classical large-sample asymptotic theory. For example, marginal p-values have been employed frequently in large-scale applications when the number of covariates p greatly exceeds the number of observations n. Those p-values are based on marginal regression models linking each individual covariate to the response separately. In these marginal regression models, the ratio of sample size to model dimensionality is equal to n, which results in justied p-values as sample size increases. Yet due to the correlations among the covariates, we often would like to investigate the joint signicance of a covariate in a regression model conditional on all other covariates, which is the main focus of this chapter. A natural question is whether conventional joint p-values continue to be valid in the regime of diverging dimensionality p. It is well known that tting the linear regression model with p>n using the ordinary least squares can lead to perfect t giving rise to zero residual vector, which renders the p-values undened. When pn and the design matrix is nonsingular, the p-values in the linear regression model are well dened and valid thanks to the exact normality of the least-squares estimator 5 when the random error is Gaussian and the design matrix is deterministic. When the error is non-Gaussian, [66] showed that the least-squares estimator can still be asymptotically normal under the assumption of p = o(n), but is generally no longer normal when p = o(n) fails to hold, making the conventional p-values inaccurate in higher dimensions. For the asymptotic properties of M-estimators for robust regression, see, for example, [66, 85, 86] for the case of diverging dimensionality p =o(n) and [69, 10] for the scenario when the dimensionality p grows proportionally to sample size n. We have seen that the conventional p-values for the least-squares estimator in linear regression model can start behaving wildly and become invalid when the dimensionality p is of the same order as sample size n and the error distribution deviates from Gaussianity. A natural question is whether similar phenomenon holds for the conventional p-values for the maximum likelihood estimator (MLE) in the setting of diverging-dimensional nonlinear models. More specically, we aim to answer the question of whether p n is still the breakdown point of the conventional p-values when we move away from the regime of linear regression model, where stands for asymptotic order. To simplify the technical presentation, in this chapter we adopt the generalized linear model (GLM) as a specic family of nonlinear models [80]. The GLM with a canonical link assumes that the conditional distribution of y given X belongs to the canonical exponential family, having the following density function with respect to some xed measure f n (y; X;) n Y i=1 f 0 (y i ; i ) = n Y i=1 c(y i ) exp y i i b( i ) ; (2.1) where X = (x 1 ; ; x p ) is an np design matrix with x j = (x 1j ; ;x nj ) T , j = 1; ;p, y = (y 1 ; ;y n ) T is an n-dimensional response vector, = ( 1 ; ; p ) T is a p-dimensional regression coecient vector,ff 0 (y;) :2Rg is a family of distributions in the regular exponential family with dispersion parameter2 (0;1), and = ( 1 ; ; n ) T = X. As is common in GLM, the function b() in (2.1) is implicitly assumed to be twice continuously dierentiable with b 00 () 6 always positive. Popularly used GLMs include the linear regression model, logistic regression model, and Poisson regression model for continuous, binary, and count data of responses, respectively. The key innovation of this chapter is the formal justication that the conventional p-values in nonlinear models of GLMs can become invalid in diverging dimensions and such a breakdown can occur much earlier than in linear models, which spells out a fundamental dierence between linear models and nonlinear models. To begin investigating p-values in diverging-dimensional GLMs, let us gain some insights into this problem by looking at the specic case of logistic regression. Recently, [103] established an interesting phase transition phenomenon of perfect hyperplane separation for high-dimensional classication with an elegant probabilistic argument. Suppose we are given a random design matrix XN(0;I n I p ) and arbitrary binary y i 's that are not all the same. The phase transition of perfect hyperplane separation happens at the point p=n = 1=2. With such a separating hyperplane, there exist some 2 R p and t2 R such that x T i > t for all cases y i = 1 and x T i < t for all controls y i = 0. Let us t a logistic regression model with an intercept. It is easy to show that multiplying the vector (t; ( ) T ) T by a divergence sequence of positive numbers c, we can obtain a sequence of logistic regression ts with the tted response vector approaching y = (y 1 ; ;y n ) T as c!1. As a consequence, the MLE algorithm can return a pretty wild estimate that is close to innity in topology when the algorithm is set to stop. Clearly, in such a case the p-value of the MLE is no longer justied and meaningful. The results in [103] have two important implications. First, such results reveal that unlike in linear models, p-values in nonlinear models can break down and behave wildly when p=n is of order 1=2; see [69, 10] and discussions below. Second, these results motivate us to characterize the breakdown point of p-values in nonlinear GLMs with pn 0 in the regime of 0 2 [0; 1=2). In fact, our results show that the breakdown point can be even much earlier than n=2. It is worth mentioning that our work is dierent in goals from the limited but growing literature on p-values for high-dimensional nonlinear models, and makes novel contributions to such a problem. The key distinction is that existing work has focused primarily on identifying the scenarios in which 7 conventional p-values or their modications continue to be valid with some sparsity assumption limiting the growth of intrinsic dimensions. For example, [43] established the oracle property including the asymptotic normality for nonconcave penalized likelihood estimators in the scenario of p = o(n 1=5 ), while [42] extended their results to the GLM setting of non-polynomial (NP) dimensionality. In the latter work, the p-values were proved to be valid under the assumption that the intrinsic dimensionality s =o(n 1=3 ). More recent work on high-dimensional inference in nonlinear model settings includes [107, 6] under sparsity assumptions. In addition, two tests were introduced in [55] for high-dimensional GLMs without or with nuisance regression parameters, but the p-values were obtained for testing the global hypothesis for a given set of covariates, which is dierent from our goal of testing the signicance of individual covariates simultaneously. [87] studied the asymptotic behavior of the MLE for exponential families under the classical i.i.d. non-regression setting, but with diverging dimensionality. In contrast, our work under the GLM assumes the regression setting in which the design matrix X plays an important role in the asymptotic behavior of the MLE b . The validity of the asymptotic normality of the MLE was established in [87] under the condition of p = o(n 1=2 ), but the precise breakdown point in diverging dimensionality was not investigated therein. Another line of work is focused on generating asymptotically valid p-values when p=n converges to a xed positive constant. For instance, [69] and [10] considered M-estimators in the linear model and showed that their variance is greater than classically predicted. Based on this result, it is possible to produce p-values by making adjustments for the in ated variance in high dimensions. Recently, [103] showed that similar adjustment is possible for the likelihood ratio test (LRT) for logistic regression. Our work diers from this line of work in two important aspects. First, our focus is on the classical p-values and their validity. Second, their results concern dimensionality that is comparable to sample size, while we aim to analyze the problem for a lower range of dimensionality and pinpoint the exact breakdown point of p-values. 8 The rest of the chapter is organized as follows. Section 2.2 provides characterizations of p-values in low dimensions. We establish the nonuniformity of GLM p-values in diverging dimensions in Section 2.3. Section 2.4 presents several simulation examples verifying the theoretical phenomenon. We discuss some implications of our results in Section 2.5. The proofs of all the results are relegated to the last section. 2.2 Characterizations of P-values in Low Dimensions To pinpoint the breakdown point of GLM p-values in diverging dimensions, we start with charac- terizing p-values in low dimensions. In contrast to existing work on the asymptotic distribution of the penalized MLE, our results in this section focus on the asymptotic normality of the unpenalized MLE in diverging-dimensional GLMs, which justies the validity of conventional p-values. Although Theorems 2.1 and 2.2 to be presented in Sections 2.2.2 and 2.2.3 are in the conventional sense of relatively small p, to the best of our knowledge such results are not available in the literature before in terms of the maximum range of dimensionality p without any sparsity assumption. 2.2.1 Maximum likelihood estimation For the GLM (2.1), the log-likelihood logf n (y; X;) of the sample is given, up to an ane transformation, by ` n () =n 1 y T X 1 T b(X) ; (2.2) where b() = (b( 1 ); ;b( n )) T for = ( 1 ; ; n ) T 2R n . Denote by b = ( b 1 ; ; b p ) T 2R p the MLE which is the maximizer of (2.2), and () = (b 0 ( 1 ); ;b 0 ( n )) T and () = diagfb 00 ( 1 ); ;b 00 ( n )g: (2.3) 9 A well-known fact is that the n-dimensional response vector y in GLM (2.1) has mean vector() and covariance matrix (). Clearly, the MLE b is given by the unique solution to the score equation X T [y(X)] = 0 (2.4) when the design matrix X is of full column rank p. It is worth mentioning that for the linear model, the score equation (2.4) becomes the well- known normal equation X T y = X T X which admits a closed form solution. On the other hand, equation (2.4) does not admit a closed form solution in general nonlinear models. This fact due to the nonlinearity of the mean function () causes the key dierence between the linear and nonlinear models. In future presentations, we will occasionally use the term nonlinear GLMs to exclude the linear model from the family of GLMs when necessary. We will present in the next two sections some sucient conditions under which the asymptotic normality of MLE holds. In particular, Section 2.2.2 concerns the case of xed design and Section 2.2.3 deals with the case of random design. In addition, Section 2.2.2 allows for general regression coecient vector 0 and the results extend some existing ones in the literature, while Section 2.2.3 assumes the global null 0 = 0 and Gaussian random design which enable us to pinpoint the exact breakdown point of the asymptotic normality for the MLE. 2.2.2 Conventional p-values in low dimensions under xed design Recall that we condition on the design matrix X in this section. We rst introduce a deviation probability bound that facilitates our technical analysis. Consider both cases of bounded responses and unbounded responses. In the latter case, assume that there exist some constants M;v 0 > 0 such that max 1in E exp jy i b 0 ( 0;i )j M 1 jy i b 0 ( 0;i )j M M 2 v 0 2 (2.5) 10 with ( 0;1 ; ; 0;n ) T = 0 = X 0 , where 0 = ( 0;1 ; ; 0;p ) T denotes the true regression coecient vector in model (2.1). Then by [42, 50], it holds that for any a2R n , P a T Y a T ( 0 ) >kak 2 " '("); (2.6) where '(") = 2e c1" 2 with c 1 > 0 some constant, and "2 (0;1) if the responses are bounded and "2 (0;kak 2 =kak 1 ] if the responses are unbounded. For nonlinear GLMs, the MLE b solves the nonlinear score equation (2.4) whose solution generally does not admit an explicit form. To address such a challenge, we construct a solution to equation (2.4) in an asymptotically shrinking neighborhood of 0 that meets the MLE b thanks to the uniqueness of the solution. Specically, dene a neighborhood of 0 as N 0 =f2R p :k 0 k 1 n logng (2.7) for some constant 2 (0; 1=2]. Assume that p = O(n 0 ) for some 0 2 (0; ) and let b n = ofmin(n 1=2 p logn;s 1 n n 2 01=2 =(logn) 2 g be a diverging sequence of positive numbers, where s n is a sequence of positive numbers that will be specied in Theorem 2.1 below. We need some basic regularity conditions to establish the asymptotic normality of the MLE b . Assumption 2.1. The design matrix X satises h X T ( 0 ) X i 1 1 =O(b n n 1 ); (2.8) max 2N0 max p j=1 max h X T diagfjx j jj 00 (X)jg X i =O(n) (2.9) with denoting the Hadamard product and derivatives understood componentwise. Assume that max p j=1 kx j k 1 <c 1=2 1 fn=(logn)g 1=2 if the responses are unbounded. 11 Assumption 2.2. The eigenvalues ofn 1 A n are bounded away from 0 and1, P n i=1 (z T i A 1 n z i ) 3=2 =o(1), and max n i=1 Ejy i b 0 ( 0;i )j 3 =O(1), where A n = X T ( 0 )X and (z 1 ; ; z n ) T = X. Conditions 2.1 and 2.2 put some basic restrictions on the design matrix X and a moment condition on the responses. For the case of linear model, bound (2.8) becomesk(X T X) 1 k 1 = O(b n =n) and bound (2.9) holds automatically since b 000 () 0. Condition 2.2 is related to the Lyapunov condition. Theorem 2.1 (Asymptotic normality). Assume that Conditions 2.1{2.2 and probability bound (2.6) hold. Then a) there exists a unique solution b to score equation (2.4) inN 0 with asymptotic probability one; b) the MLE b satises that for each vector u2R p withkuk 2 = 1 andkuk 1 =O(s n ), (2.10) and specically for each 1jp, (A 1 n ) 1=2 jj ( b j 0;j ) D !N(0;); (2.11) where A n = X T ( 0 )X and (A 1 n ) jj denotes the jth diagonal entry of matrix A 1 n . Theorem 2.1 establishes the asymptotic normality of the MLE and consequently justies the validity of the conventional p-values in low dimensions. Note that for simplicity, we present here only the marginal asymptotic normality, and the joint asymptotic normality also holds for the projection of the MLE onto any xed-dimensional subspace. This result can also be extended to the case of misspecied models; see, for example, [79]. 12 As mentioned in the Introduction, the asymptotic normality was shown in [42] for nonconcave penalized MLE having intrinsic dimensionality s =o(n 1=3 ). In contrast, our result in Theorem 2.1 allows for the scenario of p = o(n 1=2 ) with no sparsity assumption in view of our technical conditions. In particular, we see that the conventional p-values in GLMs generally remain valid in the regime of slowly diverging dimensionality p =o(n 1=2 ). 2.2.3 Conventional p-values in low dimensions under random design Under the specic assumption of Gaussian design and global null 0 = 0, we can show that the asymptotic normality of MLE continues to hold without previous Conditions 2.1{2.2. Theorem 2.2. Assume that 0 = 0, the rows of X are i.i.d. from N(0; ), b (5) () is uniformly bounded in its domain, and y 0 has uniformly sub-Gaussian components. Then if p =O(n ) with some 2 [0; 2=3), we have the componentwise asymptotic normality (A 1 n ) 1=2 jj b j D !N(0;); where all the notation is the same as in (2.11). Theorem 2.2 shows that the conclusions of Theorem 2.1 continue to hold for the case of random design and global null with the major dierence that the dimensionality can be pushed as far as pn 2=3 . The main reasons for presenting Theorem 2.2 under Gausssian design are twofold. First, Gaussian design is a widely used assumption in the literature. Second, our results on the nonuniformity of GLM p-values in diverging dimensions use geometric and probabilistic arguments which require random design setting; see Section 3 for more details. To contrast more accurately the two regimes and maintain self-contained theory, we have chosen to present Theorem 2.2 under Gaussian design. On the other hand, we would like to point out that Theorem 2.2 is not for practitioners who want to justify the usage of classical p-values. The global null assumption of 0 = 0 restricts the validity of Theorem 2.2 in many practical scenarios. 13 2.3 Nonuniformity of GLM P-values in Diverging Dimensions So far we have seen that for nonlinear GLMs, the p-values can be valid when p = o(n 1=2 ) as shown in Section 2.2, and can become meaningless when pn=2 as discussed in the Introduction. Apparently, there is a big gap between these two regimes of growth of dimensionality p. To provide some guidance on the practical use of p-values in nonlinear GLMs, it is of crucial importance to characterize their breakdown point. To highlight the main message with simplied technical presentation, hereafter we content ourselves with the specic case of logistic regression model for binary response. Moreover, we investigate the distributional property in (2.11) for the scenario of true regression coecient vector 0 = 0, that is, under the global null. We argue that this specic model is sucient for our purpose because if the conventional p-values derived from MLEs fail (i.e., (2.11) fails) for at least one 0 (in particular 0 = 0), then conventional p-values are not justied. Therefore, the breakdown point for logistic regression is at least the breakdown point for general nonlinear GLMs. This argument is fundamentally dierent from that of proving the overall validity of conventional p-values, where one needs to prove the asymptotic normality of MLEs under general GLMs rather than any specic model. 2.3.1 The wild side of nonlinear regime For the logistic regression model (2.1), we have b() = log(1 +e ), 2R and = 1. The mean vector() and covariance matrix () of the n-dimensional response vector y given by (2.3) now take the familiar form of() = e 1 1+e 1 ; ; e n 1+e n T and () = diag ( e 1 (1 +e 1 ) 2 ; ; e n (1 +e n ) 2 ) with = ( 1 ; ; n ) T = X. In many real applications, one would like to interpret the signicance of each individual covariate produced by algorithms based on the conventional asymptotic normality of the MLE as established in Theorem 2.1. As argued at the beginning of this section, in order 14 to justify the validity of p-values in GLMs, the underlying theory should at least ensure that the distributional property (2.11) holds for logistic regression under the global null. As we will see empirically in Section 2.4, as the dimensionality increases, p-values from logistic regression under the global null have a distribution that is skewed more and more toward zero. Consequently, classical hypothesis testing methods which reject the null hypothesis when p-value is less than the pre-specied level would result in more false discoveries than the desired level of . As a result, practitioners may simply lose the theoretical backup and the resulting decisions based on the p-values can become ineective or even misleading. For this reason, it is important and helpful to identify the breakdown point of p-values in diverging-dimensional logistic regression model under the global null. Characterizing the breakdown point of p-values in nonlinear GLMs is highly nontrivial and challenging. First, the nonlinearity generally causes the MLE to take no analytical form, which makes it dicult to analyze its behavior in diverging dimensions. Second, conventional probabilistic arguments for establishing the central limit theorem of MLE only enable us to see when the distributional property holds, but not exactly at what point it fails. To address these important challenges, we introduce novel geometric and probabilistic arguments presented later in the proofs of Theorems 2.3{2.4 that provide a rather delicate analysis of the MLE. In particular, our arguments unveil that the early breakdown point of p-values in nonlinear GLMs is essentially due to the nonlinearity of the mean function(). This shows that p-values can behave wildly much early on in diverging dimensions when we move away from linear regression model to nonlinear regression models such as the widely applied logistic regression; see the Introduction for detailed discussions on the p-values in diverging-dimensional linear models. Before presenting the main results, let us look at the specic case of logistic regression model under the global null. In such a scenario, it holds that 0 = X 0 = 0 and thus ( 0 ) = 4 1 I n , which results in A n = X T ( 0 )X = 4 1 X T X: 15 In particular, we see that whenn 1 X T X is close to the identity matrixI p , the asymptotic standard deviation of the jth component b j of the MLE b is close to 2n 1=2 when the asymptotic theory in (2.11) holds. As mentioned in the Introduction, when p n=2 the MLE can blow up with excessively large variance, a strong evidence against the distributional property in (2.11). In fact, one can also observe in ated variance of the MLE relative to what is predicted by the asymptotic theory in (2.11) even when the dimensionality p grows at a slower rate with sample size n. As a consequence, the conventional p-values given by algorithms according to property (2.11) can be much biased toward zero and thus produce more signicant discoveries than the truth. Such a breakdown of conventional p-values is delineated clearly in the simulation examples presented in Section 2.4. 2.3.2 Main results We now present the formal results on the invalidity of GLM p-values in diverging dimensions. Theorem 2.3 (Uniform orthonormal design). Assume that n 1=2 X is uniformly distributed on the Stiefel manifold V p (R n ) consisting of all np orthonormal matrices. Then for the logistic regression model under the global null, the asymptotic normality of the MLE established in (2.11) fails to hold when pn 2=3 , where stands for asymptotic order. Theorem 2.4 (Correlated Gaussian design). Assume that X N(0;I n ) with covariance matrix nonsingular. Then for the logistic regression model under the global null, the same conclusion as in Theorem 2.3 holds. Theorem 2.2 in 2.2.3 states that under the global null in GLM with Gaussian design, the p-value based on the MLE remains valid as long as the dimensionalityp diverges withn at a slower rate than n 2=3 . This together with Theorems 2.3 and 2.4 shows that under the global null, the exact breakdown point for the uniformity of p-value is n 2=3 . We acknowledge that these results are mainly for theoretical interests because in practice one cannot check precisely whether the global 16 null assumption holds or not. However, these results clearly suggest that in GLM with diverging dimensionality, one needs to be very cautious when using p-values based on the MLE. The key ingredients of our new geometric and probabilistic arguments are demonstrated in the proof of Theorem 2.3 in Section 2.6.3. The assumption that the rescaled random design matrix n 1=2 X has the Haar measure on the Stiefel manifold V p (R n ) greatly facilitates our technical analysis. The major theoretical nding is that the nonlinearity of the mean function () can be negligible in determining the asymptotic distribution of MLE as given in (2.11) when the dimensionality p grows at a slower rate than n 2=3 , but such nonlinearity can become dominant and deform the conventional asymptotic normality when p grows at rate n 2=3 or faster. See the last paragraph of Section 2.6.3 for more detailed in-depth discussions on such an interesting phenomenon. Furthermore, the global null assumption is a crucial component of our geometric and probabilistic argument. The global null assumption along with the distributional assumption on the design matrix ensures the symmetry property of the MLE and the useful fact that the MLE can be asymptotically independent of the random design matrix. In the absence of such an assumption, we may suspect that p-values of the noise variables can be aected by the signal variables due to asymmetry. Indeed, our simulation study in Section 2.4 reveals that as the number of signal variables increases, the breakdown point of the p-values occurs even earlier. Theorem 2.4 further establishes that the invalidity of GLM p-values in high dimensions beyond the scenario of orthonormal design matrices considered in Theorem 2.3. The breakdown of the conventional p-values occurs regardless of the correlation structure of the covariates. Our theoretical derivations also suggest that the conventional p-values in nonlinear GLMs can generally fail to be valid when pn 0 with 0 ranging between 1=2 and 2=3, which diers signicantly from the phenomenon for linear models as discussed in the Introduction. The special feature of logistic regression model that the variance function b 00 () takes the maximum value 1=4 at natural parameter = 0 leads to a higher transition point of pn 0 with 0 = 2=3 for the case of global null 0 = 0. 17 2.4 Numerical Studies We now investigate the breakdown point of p-values for nonlinear GLMs in diverging dimensions as predicted by our major theoretical results in Section 2.3 with several simulation examples. Indeed, these theoretical results are well supported by the numerical studies. 2.4.1 Simulation examples Following Theorems 2.3{2.4 in Section 2.3, we consider three examples of the logistic regression model (2.1). The response vector y = (y 1 ; ;y n ) T has independent components and each y i has Bernoulli distribution with parameter e i =(1 +e i ), where = ( 1 ; ; n ) T = X 0 . In example 1, we generate the np design matrix X = (x 1 ; ; x p ) such that n 1=2 X is uniformly distributed on the Stiefel manifold V p (R n ) as in Theorem 2.3, while examples 2 and 3 assume that X N(0;I n ) with covariance matrix as in Theorem 2.4. In particular, we choose = ( jjkj ) 1j;kp with = 0; 0:5, and 0:8 to re ect low, moderate, and high correlation levels among the covariates. Moreover, examples 1 and 2 assume the global null model with 0 = 0 following our theoretical results, whereas example 3 allows sparsity s =k 0 k 0 to vary. To examine the asymptotic results we set the sample size n = 1000. In each example, we consider a spectrum of dimensionality p with varying rate of growth with sample size n. As mentioned in the Introduction, the phase transition of perfect hyperplane separation happens at the point p=n = 1=2. Recall that Theorems 2.3{2.4 establish that the conventional GLM p-values can become invalid when p n 2=3 . We set p = [n 0 ] with 0 in the gridf2=3 4; ; 2=3 ; 2=3; 2=3 +; ; 2=3 + 4; (log(n) log(2))= log(n)g for = 0:05. For example 3, we pick s signals uniformly at random among all but the rst components, where a random half of them are chosen as 3 and the other half are set as3. The goal of the simulation examples is to investigate empirically when the conventional GLM p-values could break down in diverging dimensions. When the asymptotic theory for the MLE in 18 (2.11) holds, the conventional p-values would be valid and distributed uniformly on the interval [0; 1] under the null hypothesis. Note that the rst covariate x 1 is a null variable in each simulation example. Thus in each replication, we calculate the conventional p-value for testing the null hypothesis H 0 : 0;1 = 0. To check the validity of these p-values, we further test their uniformity. For each simulation example, we rst calculate the p-values for a total of 1; 000 replications as described above and then test the uniformity of these 1; 000 p-values using, for example, the Kolmogorov{Smirnov (KS) test [70, 97] and the Anderson{Darling (AD) test [4, 5]. We repeat this procedure 100 times to obtain a nal set of 100 new p-values from each of these two uniformity tests. Specically, the KS and AD test statistics for testing the uniformity on [0; 1] are dened as KS = sup x2[0;1] jF m (x)xj and AD =m Z 1 0 [F m (x)x] 2 x(1x) dx; respectively, where F m (x) = m 1 P m i=1 I (1;x] (x i ) is the empirical distribution function for a given samplefx i g m i=1 . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (a) KS test ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●● ● ● ● ● ●●●●● ● ●● ●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (b) AD test Figure 2.1: Results of KS and AD tests for testing the uniformity of GLM p-values in simulation example 1 for diverging-dimensional logistic regression model with uniform orthonormal design under global null. The vertical axis represents the p-value from the KS and AD tests, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. 19 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●●●●● ● ●●●●●●●●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (a) KS test for = 0:5 ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●● ● ●●● ● ●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (b) AD test for = 0:5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (c) KS test for = 0:8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●●●● ● ● ● ● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (d) AD test for = 0:8 Figure 2.2: Results of KS and AD tests for testing the uniformity of GLM p-values in simulation example 2 for diverging-dimensional logistic regression model with correlated Gaussian design under global null for varying correlation level . The vertical axis represents the p-value from the KS and AD tests, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. 2.4.2 Testing results For each simulation example, we apply both KS and AD tests to verify the asymptotic theory for the MLE in (2.11) by testing the uniformity of conventional p-values at signicance level 0:05. As mentioned in Section 2.4.1, we end up with two sets of 100 new p-values from the KS and AD tests. Figures 2.1{2.3 depict the boxplots of the p-values obtained from both KS and AD tests for simulation examples 1{3, respectively. In particular, we observe that the numerical results shown in Figures 2.1{2.2 for examples 1{2 are in line with our theoretical results established in Theorems 2.3{2.4, respectively, for diverging-dimensional logistic regression model under global null that the conventional p-values break down whenpn 0 with 0 = 2=3. Figure 2.3 for example 3 examines the breakdown point of p-values with varying sparsitys. It is interesting to see that the breakdown 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (a) KS test for s = 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●● 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 1.0 1.0 1.0 1.0 1.0 1.0 0.47 0.52 0.57 0.62 0.67 0.72 0.77 0.82 0.87 0.9 α 0 p−value (b) KS test for s = 2 Figure 2.3: Results of KS test for testing the uniformity of GLM p-values in simulation example 3 for diverging-dimensional logistic regression model with uncorrelated Gaussian design under global null for varying sparsity s. The vertical axis represents the p-value from the KS test, and the horizontal axis stands for the growth rate 0 of dimensionality p = [n 0 ]. point shifts even earlier when s increases as suggested in the discussions in Section 2.3.2. The results from the AD test are similar so we present only the results from the KS test for simplicity. To gain further insights into the nonuniformity of the null p-values, we next provide an additional gure in the setting of simulation example 1. Specically, in Figure 2.4 we present the histograms of the 1,000 null p-values from the rst simulation repetition (out of 100) for each value of 0 . It is seen that as the dimensionality increases (i.e., 0 increases), the null p-values have a distribution that is skewed more and more toward zero, which is prone to produce more false discoveries if these p-values are used naively in classical hypothesis testing methods. To further demonstrate the severity of the problem, we estimate the probability of making type I error at signicance level a, as the fraction of p-values below a. The means and standard deviations of the estimated probabilities are reported in Table 2.1 for a = 0:05 and 0.1. When the null p-values are distributed uniformly, the probabilities of making type I error should all be close to the target level a. However, Table 2.1 shows that when the growth rate of dimensionality 0 approaches or exceeds 2=3, these probabilities can be much larger than a, which again supports our theoretical ndings. Also it is seen that when 0 is close to but still smaller than 2=3, the averages of estimated probabilities exceed slightly a, which could be the eect of nite sample size. 21 0.0 0.3 0.6 0.9 1.2 0.00 0.25 0.50 0.75 1.00 (a) 0 = 0:47 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 (b) 0 = 0:52 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 (c) 0 = 0:57 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 (d) 0 = 0:62 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 (e) 0 = 0:67 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 (f) 0 = 0:72 0.0 0.5 1.0 1.5 0.00 0.25 0.50 0.75 1.00 (g) 0 = 0:77 0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 (h) 0 = 0:82 0 1 2 3 0.00 0.25 0.50 0.75 1.00 (i) 0 = 0:87 Figure 2.4: Histograms of null p-values in simulation example 1 from the rst simulation repetition for dierent growth rates 0 of dimensionality p = [n 0 ]. 22 0 0.10 0.47 0.57 0.67 0.77 0.87 a = 0:05 Mean 0.050 0.052 0.055 0.063 0.082 0.166 SD 0.006 0.007 0.007 0.007 0.001 0.011 a = 0:1 Mean 0.098 0.104 0.107 0.118 0.144 0.247 SD 0.008 0.010 0.009 0.011 0.012 0.013 Table 2.1: Means and standard deviations (SD) for estimated probabilities of making type I error in simulation example 1 with 0 the growth rate of dimensionality p = [n 0 ]. Two signicance levels a = 0:05 and 0.1 are considered. 2.5 Discussions In this chapter we have provided characterizations of p-values in nonlinear GLMs with diverging dimensionality. The major ndings are that the conventional p-values can remain valid when p =o(n 1=2 ), but can become invalid much earlier in nonlinear models than in linear models, where the latter case can allow forp =o(n). In particular, our theoretical results pinpoint the breakdown point of pn 2=3 for p-values in diverging-dimensional logistic regression model under global null with uniform orthonormal design and correlated Gaussian design, as evidenced in the numerical results. It would be interesting to investigate such a phenomenon for more general class of random design matrices. The problem of identifying the breakdown point of p-values becomes even more complicated and challenging when we move away from the setting of global null. Our technical analysis suggests that the breakdown point pn 0 can shift even earlier with 0 ranging between 1=2 and 2=3. But the exact breakdown point can depend upon the number of signals s, the signal magnitude, and the correlation structure among the covariates in a rather complicated fashion. Thus more delicate mathematical analysis is needed to obtain the exact relationship. We leave such a problem for future investigation. Moving beyond the GLM setting will further complicate the theoretical analysis. As we routinely produce p-values using algorithms, the phenomenon of nonuniformity of p- values occurring early in diverging dimensions poses useful cautions to researchers and practitioners when making decisions in real applications using results from p-value based methods. For instance, 23 when testing the joint signicance of covariates in diverging-dimensional nonlinear models, the eective sample size requirement should be checked before interpreting the testing results. Indeed, statistical inference in general high-dimensional nonlinear models is particularly challenging since obtaining accurate p-values is generally not easy. One possible route is to bypass the use of p-values in certain tasks including the false discovery rate (FDR) control; see, for example, [8, 18, 45] for some initial eorts made along this line. 2.6 Proofs of Main Results We provide the detailed proofs of Theorems 2.1{2.4 in this section. 2.6.1 Proof of Theorem 2.1 To ease the presentation, we split the proof into two parts, where the rst part locates the MLE b in an asymptotically shrinking neighborhoodN 0 of the true regression coecient vector 0 with signicant probability and the second part further establishes its asymptotic normality. Part 1: Existence of a unique solution to score equation (2.4) inN 0 under Condition 2.1 and probability bound (2.6). For simplicity, assume that the design matrix X is rescaled columnwise such thatkx j k 2 = p n for each 1jp. Consider an event E = n kk 1 c 1=2 1 p n logn o ; (2.12) where = ( 1 ; ; p ) T = X T [y( 0 )]. Note that for unbounded responses, the assumption of max p j=1 kx j k 1 < c 1=2 1 fn=(logn)g 1=2 in Condition 2.1 entails that c 1=2 1 p logn is less than 24 min p j=1 fkx j k 2 =kx j k 1 g. Thus bykx j k 2 = p n, probability bound (2.6), and Bonferroni's inequality, we deduce P (E) 1 p X j=1 P j j j>c 1=2 1 p n logn (2.13) 1 2pn 1 = 1Ofn (10) g; since p =O(n 0 ) for some 0 2 (0; ) with 2 (0; 1=2] by assumption. Hereafter we condition on the eventE dened in (2.12) which holds with signicant probability. We will show that for suciently large n, the score equation (2.4) has a solution in the neighborhoodN 0 which is a hypercube. Dene two vector-valued functions () = ( 1 (); ; p ()) T = X T (X) and () = () ( 0 ); 2R p : Then equation (2.4) is equivalent to () = 0. We need to show that the latter has a solution inside the hypercubeN 0 . To this end, applying a second order Taylor expansion of () around 0 with the Lagrange remainder term componentwise leads to () = ( 0 ) + X T ( 0 ) X( 0 ) + r; (2.14) where r = (r 1 ; ;r p ) T and for each 1jp, r j = 1 2 ( 0 ) T r 2 j ( j ) ( 0 ) 25 with j some p-dimensional vector lying on the line segment joining and 0 . It follows from (2.9) in Condition 2.1 that krk 1 max 2N0 p max j=1 1 2 max h X T diagfjx j jj 00 (X)jg X i k 0 k 2 2 (2.15) =O pn 12 (logn) 2 : Let us dene another vector-valued function () h X T ( 0 ) X i 1 () = 0 + u; (2.16) where u =[X T ( 0 )X] 1 ( r). It follows from (2.12), (2.15), and (2.8) in Condition 2.1 that for any2N 0 , kuk 1 h X T ( 0 ) X i 1 1 (kk 1 +krk 1 ) (2.17) =O h b n n 1=2 p logn +b n pn 2 (logn) 2 i : By the assumptions of p =O(n 0 ) with constant 0 2 (0; ) and b n =ofmin(n 1=2 p logn; n 2 01=2 =(logn) 2 g, we have kuk 1 =o(n logn): Thus in light of (2.16), it holds for large enough n that when ( 0 ) j =n p logn, j ()n p lognkuk 1 0; (2.18) and when ( 0 ) j =n p logn, j ()n p logn +kuk 1 0; (2.19) 26 where () = ( 1 (); ; p ()) T . By the continuity of the vector-valued function (), (2.18), and (2.19), Miranda's existence theorem [110] ensures that equation () = 0 has a solution b inN 0 . Clearly, b also solves equation () = 0 in view of (2.16). Therefore, we have shown that score equation (2.4) indeed has a solution b inN 0 . The strict concavity of the log-likelihood function (2.2) by assumptions for model (2.1) entails that b is the MLE. Part 2: Conventional asymptotic normality of the MLE b . Fix any 1 j p. In light of (2.16), we have b 0 = A 1 n ( r), which results in (A 1 n ) 1=2 jj ( b j 0;j ) = (A 1 n ) 1=2 jj e T j A 1 n (A 1 n ) 1=2 jj e T j A 1 n r (2.20) with e j 2R p having one for the jth component and zero otherwise. Note that since the smallest and largest eigenvalues of n 1 A n are bounded away from 0 and1 by Condition 2.2, it is easy to show that (A 1 n ) 1=2 jj is of exact order n 1=2 . In view of (2.17), it holds on the eventE dened in (2.12) that A 1 n r 1 h X T ( 0 ) X i 1 1 krk 1 = O b n pn 2 (logn) 2 =o(n 1=2 ); since b n =ofn 2 01=2 =(logn) 2 g by assumption. This leads to (A 1 n ) 1=2 jj e T j A 1 n r =O(n 1=2 )o P (n 1=2 ) =o P (1): (2.21) 27 It remains to consider the term (A 1 n ) 1=2 jj e T j A 1 n = P n i=1 i , where i = (A 1 n ) 1=2 jj e T j A 1 n z i [y i b 0 ( 0;i )]. Clearly, the n random variables i 's are independent with mean 0 and n X i=1 var( i ) = (A 1 n ) 1 jj e T j A 1 n (A n )A 1 n e j =: It follows from Condition 2.2 and the Cauchy{Schwarz inequality that n X i=1 Ej i j 3 = n X i=1 (A 1 n ) 1=2 jj e T j A 1 n z i 3 Ejy i b 0 ( 0;i )j 3 =O(1) n X i=1 (A 1 n ) 1=2 jj e T j A 1 n z i 3 O(1) n X i=1 (A 1 n ) 1=2 jj e T j A 1=2 n 3 2 A 1=2 n z i 3 2 =O(1) n X i=1 z T i A 1 n z i 3=2 =o(1): Thus an application of Lyapunov's theorem yields (A 1 n ) 1=2 jj e T j A 1 n = n X i=1 i D !N(0;): (2.22) By Slutsky's lemma, we see from (2.20){(2.22) that (A 1 n ) 1=2 jj ( b j 0;j ) D !N(0;); showing the asymptotic normality of each component b j of the MLE b . We further establish the asymptotic normality for the one-dimensional projections of the MLE b . Fix an arbitrary vector u2R p withkuk 2 = 1 satisfying the L 1 sparsity boundkuk 1 =O(s n ). In light of (2.16), we have b 0 = A 1 n ( r), which results in (u T A 1 n u) 1=2 (u T b u T 0 ) = (u T A 1 n u) 1=2 u T A 1 n (u T A 1 n u) 1=2 u T A 1 n r: (2.23) 28 Note that since the smallest and largest eigenvalues of n 1 A n are bounded away from 0 and1 by Condition 2.2, it is easy to show that (u T A 1 n u) 1=2 is of exact order n 1=2 . In view of (2.17), it holds on the eventE dened in (2.12) that A 1 n r 1 h X T ( 0 ) X i 1 1 krk 1 = O b n pn 2 (logn) 2 =o(s 1 n n 1=2 ) since b n =ofs 1 n n 2 01=2 =(logn) 2 g by assumption. This leads to (u T A 1 n u) 1=2 u T A 1 n r =O(n 1=2 )kuk 1 kA 1 n rk 1 =o P (1) (2.24) sincekuk 1 =O(s n ) by assumption. It remains to consider the term (u T A 1 n u) 1=2 u T A 1 n = P n i=1 i with i = (u T A 1 n u) 1=2 u T A 1 n z i [y i b 0 ( 0;i )]. Clearly, the n random variables i 's are independent with mean 0 and n X i=1 var( i ) = (u T A 1 n u) 1 u T A 1 n (A n )A 1 n u =: It follows from Condition 2.2 and the Cauchy{Schwarz inequality that n X i=1 Ej i j 3 = n X i=1 (u T A 1 n u) 1=2 u T A 1 n z i 3 Ejy i b 0 ( 0;i )j 3 =O(1) n X i=1 (u T A 1 n u) 1=2 u T A 1 n z i 3 O(1) n X i=1 (u T A 1 n u) 1=2 u T A 1=2 n 3 2 A 1=2 n z i 3 2 =O(1) n X i=1 z T i A 1 n z i 3=2 =o(1): 29 Thus an application of Lyapunov's theorem yields (u T A 1 n u) 1=2 u T A 1 n = n X i=1 i D !N(0;): (2.25) By Slutsky's lemma, we see from (2.23){(2.25) that (u T A 1 n u) 1=2 (u T b u T 0 ) D !N(0;); showing the asymptotic normality of any L 1 -sparse one-dimensional projection u Tb of the MLE b . This completes the proof of Theorem 2.1. 2.6.2 Proof of Theorem 2.2 The proof is similar to that for Theorem 2.1. Without loss of generality, we assume that =I p because under global null, a rotation of X yields standard normal rows. First let = ( 1 ; ; p ) T = (X T X) 1 X T [y 0 ], where 0 =b 0 (0)1 with 1 = (1; ; 1) T 2R n because 0 = 0. Then y 0 has i.i.d. uniform sub-Gaussian components and is independent of X = (z 1 ; ; z p )2 R np . Dene event E =fkk 1 c 2 p n 1 logng: By Lemma 2.1, it is seen that P (E) 1o(p a ). Furthermore, dene the neighborhood N 0 =fkk 1 c 3 p n 1 logng (2.26) for some c 3 > c 2 (b 00 (0)) 1 . We next show that the MLE must fall into the region N 0 with probability at least 1O(p a ) following the similar arguments in Theorem 1. 30 First, we dene () = ( 1 (); ; p ()) T X T (X) and () = () ( 0 ) X T [y 0 ]; 2R p : Applying a forth order Taylor expansion of () around 0 = 0 with the Lagrange remainder term componentwise leads to () = ( 0 ) +b 00 (0)X T X( 0 ) + r + s + t; where r = (r 1 ; ;r p ) T , s = (s 1 ; ;s p ) T , t = (t 1 ; ;t p ) T and for each 1jp, r j = b 000 (0) 2 n X i=1 x ij (x T i ) 2 (2.27) s j = b (4) (0) 6 n X i=1 x ij (x T i ) 3 (2.28) t j = 1 24 n X i=1 b (5) (x T i f j )x ij (x T i ) 4 : (2.29) with f j some p-dimensional vector lying on the line segment joining and 0 . Let us dene another vector-valued function () h b 00 (0)X T X i 1 () = 0 + u; (2.30) where u =(b 00 (0)) 1 + [b 00 (0)X T X] 1 (r + s + t). It follows from the above derivation that for any2N 0 , kuk 1 (b 00 (0)) 1 1 + h b 00 (0)X T X i 1 (r + s + t) 1 : 31 Now, we bound the terms on the right hand side. First note that on eventE, (b 00 (0)) 1 1 (b 00 (0)) 1 c 2 p n logn: (2.31) Then, we consider the next term: h b 00 (0)X T X i 1 (r + s + t) 1 . We observe that h b 00 (0)X T X i 1 (r + s + t) 1 jb 00 (0)j 1 h n 1 X T X i 1 1 n 1 (r + s + t) 1 jb 00 (0)j 1 h n 1 X T X i 1 1 n 1 r 1 + n 1 s 1 + n 1 t 1 : By Lemma 2.2, we have that h n 1 X T X i 1 1 1 +O(pn 1=2 ). Lemmas 2.6, 2.7, 2.8 assert that n 1 r 1 + n 1 s 1 + n 1 t 1 =fn 5=6 logn +n 3=25=4 (logn) 3=2 +n 1 (logn) 1=2 +n 23=2 (logn) 3=2 g p n 1 logn: We combine last two bounds so that we have h b 00 (0)X T X i 1 (r + s + t) 1 =o( p n 1 logn) (2.32) with probability at least 1o(p c ) when p =O(n ) with < 2=3. Combining equations (2.31) and (2.32), we obtain that if p =O(n ) with 2 [0; 2=3), then kuk 1 c 3 p n 1 logn: Thus, the MLE must fall into the regionN 0 following the similar arguments in Theorem 2.1. 32 Next, we show the componentwise asymptotic normality of the MLE b . By equation (2.30), we have b =u = (b 00 (0)) 1 (X T X) 1 X T [y 0 ] [b 00 (0)X T X] 1 (r + s + t). So, we can write ^ j = (b 00 (0)) 1 n 1 e T j X T [y 0 ] + (b 00 (0)) 1 T e T j [b 00 (0)X T X] 1 (r + s + t) (2.33) where T = e T j (X T X) 1 X T [y 0 ]n 1 e T j X T [y 0 ]. By Lemma 2.9 and Equation (2.32), both n 1=2 (b 00 (0)) 1 T and n 1=2 e T j [b 00 (0)X T X] 1 (r + s + t) converges to zero in probability. So, it is enough to consider the rst summand in (2.33). Now, we show that n 1=2 e T j X T [y 0 ] is asymptotically normal. In fact, we can write e T j X T [y 0 ] = P n i=1 x ij y i where each summand x ij y i is independent over i and has variance b 00 (0). Moreover, P n i=1 Ejx ij y i j 3 = O(n) since jx ij j 3 andjy i j 3 are independent and nite mean. So, we apply Lyapunov's theorem to obtain b 00 (0) 1=2 n 1=2 e T j X T [y 0 ] D !N(0;): Finally, we know thatb 00 (0)n(A 1 n ) jj ! 1 in probability from the remark in Theorem 2.1. Thus, Slutsky's lemma yields (A 1 n ) 1=2 jj ^ j D !N(0;): (2.34) This completes the proof of the theorem. Lemma 2.1. Assume that the components of y 0 are uniform sub-Gaussians. That is, there exist a positive constant C such that P(j(y 0 ) i j>t)C expfCt 2 g for all 1in. Then, it holds that, for some positive constant c 2 , k(X T X) 1 X T (y 0 )k 1 c 2 p n 1 logn: with asymptotic probability 1o(p a ). 33 Proof. We prove the result by conditioning on X. Let E = n 1 X T X I p . Then by matrix inversion, (n 1 X T X) 1 = (I p + E) 1 = I p 1 X k=1 (1) k+1 (E) k = I p E + 1 X k=2 (1) k (E) k = 2I p n 1 X T X + 1 X k=2 (1) k (E) k : Thus, it follows that k(X T X) 1 X T (y 0 )k 1 k2n 1 X T (y 0 )k 1 +kn 2 X T XX T (y 0 )k 1 + n 1 1 X k=2 (1) k (E) k X T (y 0 ) 1 = 1 + 2 + 3 : In the rest of the proof, we will bound 1 , 2 and 3 . Part 1: Bound of 1 . First, it is easy to see that 1 =k2n 1 X T (y 0 )k 1 = 2 max 1jp n 1 n X i=1 x ij (y 0 ) i : We observe that each summand x ij (y 0 ) i is the product of two subgaussian random variables, and so satises P (jx ij (y 0 ) i j>t)C exp(Ct) for some constant C > 0 by Lemma 1 in [48]. Moreover, E[x ij (y 0 ) i ] = 0 sincex ij and (y 0 ) i are independent and have zero mean. Thus, we can use Lemma 2.5 by setting W ij =x ij (y 0 ) i and = 1. So, we get 1 = 2 max 1jp n 1 n X i=1 x ij (y 0 ) i c 2 p n 1 logp (2.35) 34 with probability 1O(p c ) for some positive constants c and c 2 . Part 2: Bound of 2 . Now, we study 2 =kn 2 X T XX T (y 0 )k 1 . Let z k be the k-th column of X, that is z k = Xe k . Direct calculations yield e T k X T XX T (y 0 ) = p X j=1 (z T k z j )(z T j (y 0 )) =kz k k 2 2 z T k (y 0 ) + p X j6=k (z T k z j )(z T j (y 0 )): Thus, it follows that (X T XX T (y 0 ) 1 max k kz k k 2 2 z T k (y 0 ) + max k p X j6=k (z T k z j )(z T j (y 0 )) : (2.36) First, we consier max k kz k k 2 2 z T k (y 0 ) . Lemma 2.10 shows that max k kz k k 2 2 O(n) with probability 1O(p c ). We also have max k z T k (y 0 ) = n 2 1 O( p n logp) by equation (2.35). It follows that max k kz k k 2 2 z T k (y 0 ) max k kz k k 2 2 max k z T k (y 0 ) O(n p n logp): (2.37) Next, leta j = z T k z j =kz k k 2 andb j = z T j (y 0 )=ky 0 k 2 . Then it is easy to see that conditional on z k and y, a j N(0; 1), b j N(0; 1) and cov(a j ;b j jz k ; y) = z T k (y 0 )=(kz k k 2 ky 0 k 2 ). By (E.6) of Lemma 7 in [48], it can be shown that P 0 @ 1 p 1 p X j6=k (z T k z j )(z T j (y 0 )) z T k (y 0 ) ckz k k 2 ky 0 k 2 p p 1 logp z k ; y 1 A =P 0 @ 1 p 1 p X j6=k a j b j z T k (y 0 ) kz k k 2 ky 0 k 2 c p p 1 logp z k ; y 1 A cp c1 ; 35 where c 1 is some large positive constant independent of z k and y. Moreover, we can choose c 1 as large as we want by increasing c. Thus, it follows that P 0 @ 1 p 1 p X j6=k (z T k z j )(z T j (y 0 )) z T k (y 0 ) ckz k k 2 ky 0 k 2 p p 1 logp 1 A cp c1 : It follows from probability union bound that P 0 @ 1 p 1 max k 1 kz k k 2 ky 0 k 2 p X j6=k (z T k z j )(z T j (y 0 )) z T k (y 0 ) c p p 1 logp 1 A cp c1+1 : Taking c 1 > 1 yields that with probability at least 1o(p a ) for some a> 0, max k 8 < : 1 kz k k 2 ky 0 k 2 1 p 1 p X j6=k (z T k z j )(z T j (y 0 )) z T k (y 0 ) 9 = ; c p p 1 logp: By Lemma 2.10, we have max k kz k k 2 = p max k kz k k 2 2 O p ( p n). Therefore, by using the fact thatky 0 k 2 O p ( p n), we have max k p X j6=k (z T k z j )(z T j (y 0 )) max k p X j6=k [(z T k z j )(z T j (y 0 ))] (p 1)z T k (y 0 ) + (p 1) max k jz T k (y 0 )j p max k kz k k 2 ky 0 k 2 max k 8 < : 1 kz k k 2 ky 0 k 2 1 p 1 p X j6=k [(z T k z j )(z T j (y 0 )) z T k (y 0 )] 9 = ; +p max k jz T k (y 0 )j cpn p logp p p 1 logp +cp p n logp: (2.38) 36 Combining (2.36){(2.38) yields 2 =kn 2 X T XX T (y 0 )k 1 cp 1=2 n 1 logp =o( p n 1 logn): (2.39) Part 3: Bound of 3 . Finally, we study 3 . We observe that 3 P 1 k=2 (1) k+1 (E) k 1 kn 1 X T (y 0 )k 1 . Lemma 2.3 proves that P 1 k=2 (1) k+1 (E) k 1 O(p 3=2 n 1 ) while equation (2.35) shows that kn 1 X T (y 0 )k 1 =O( p n 1 logp) with probability 1O(p c ). Putting these facts together, we obtain 3 O(p 3=2 n 1 p n 1 logp) =o( p n 1 logn) (2.40) where we use p =O(n 0 ) with 0 2 [0; 2=3). Combining equations (2.35), (2.39), and (2.40), we obtain that with probability at least 1o(p a ), k(X T X) 1 X T (y 0 )k 1 c p n 1 logn: Lemma 2.2. Under the assumptions of Theorem 2.2,k(n 1 X T X) 1 k 1 1 +O(pn 1=2 ) with probability 1O(p c ). Proof. Let E = n 1 X T X I p . Then,kEk 2 C(p=n) 1=2 for some constant C with probability 1O(p c ) by Theorem 4.6.1 in [109]. Furthermore, by matrix inversion, we get (n 1 X T X) 1 = (I p + E) 1 = I p 1 X k=1 (1) k+1 (E) k : 37 Now, we take the norm and use triangle inequalities to get k(n 1 X T X) 1 k 1 kI p k 1 + 1 X k=1 kE k k 1 1 +p 1=2 1 X k=1 kE k k 2 1 +p 1=2 1 X k=1 kEk k 2 1 +Cp 1=2 1 X k=1 ((p=n) 1=2 ) k 1 +Cp 1=2 (p=n) 1=2 where we use the fact that p=n is bounded by a constant less than 1. Lemma 2.3. In the same setting as Lemma 2.2, if E =n 1 X T X I p , then 1 X k=2 (1) k+1 (E) k 1 Cp 3=2 n 1 ; with probability 1O(p c ). Proof. Again, we use thatkEk 2 C(p=n) 1=2 for some constant C with probability 1O(p c ). By similar calculations as in Lemma 2.2, we deduce 1 X k=2 (1) k+1 (E) k 1 1 X k=2 (1) k+1 (E) k 1 1 X k=2 p 1=2 (E) k 2 = 1 X k=2 p 1=2 kEk k 2 1 X k=2 p 1=2 (p=n) k=2 Cp 3=2 n 1 : Lemma 2.4. Let W j be nonnegative random variables for 1 j p that are not necessarily independent. If P(W j > t) C 1 exp(C 2 a n t 2 ) for some constants C 1 and C 2 and for some sequence a n , then for any c> 0, max 1jp W j ((c + 1)=C 2 ) 1=2 a 1=2 n (logp) 1=2 with probability at least 1O(p c ). 38 Proof. Using union bound, we get P ( max 1jp W j >t) X 1jp P (W j >t)pC 1 exp(C 2 a n t 2 ): Taking t =a 1=2 n (logp) 1=2 ((c + 1)=C 2 ) 1=2 concludes the proof since then P ( max 1jp W j >a 1=2 n (logp) 1=2 ((c + 1)=C 2 ) 1=2 )C 1 p c : Lemma 2.5. Let W ij be random variables which are independent over the index i. Assume that there are constants C 1 and C 2 such that P(jW ij j>t)C 1 exp(C 2 t ) with 0< 1. Then, with probability 1O(p c ), max 0jp n 1 n X i=1 (W ij EW ij ) Cn (1=2) (logp) 1=2 ; for some positive constants c and C. Proof. We have P(jn 1 P n i=1 (W ij EW ij )j>t)C 3 exp(C 4 n t 2 ) by Lemma 6 of [48] where C 3 and C 4 are some positive constants which only depend on C 1 and C 2 . This probability bound shows that the assumption of Lemma 2.4 holds with a n = n . Thus, Lemma 2.4 nishes the proof. Lemma 2.6. With probability 1O(p c ), the vector r dened in (2.27) satises the bound kn 1 rk 1 =O(n 5=6 logn p n 1 logn). Proof. We begin by observing that both x ij and (x T i =kk 2 ) are standard normal variables. So, using Lemma 1 of [48], we have P(x ij (x T i =kk 2 ) 2 >t)C exp(Ct 2=3 ) for some constant C which does not depend. It is easy to see thatx ij (x T i ) 2 are independent random variables across 39 i's with mean 0. By Lemma 2.5, max 1jp jn 1 P n i=1 x ij x T i kk2 2 j is of order O(n 1=3 (logp) 1=2 ). Moreover,kk 2 p 1=2 kk 1 O(p 1=2 p n 1 logn) when2N 0 . Therefore, kn 1 rk 1 = max 1jp b (3) (0) 2 kk 2 2 n 1 n X i=1 x ij (x T i =kk 2 ) 2 Cpn 1 (logn)n 1=3 (logp) 1=2 =Cn 4=3 (logn) 3=2 =O(n 5=6 (logn) p n 1 logn); since p =O(n ). Lemma 2.7. With probability 1O(p c ), the vector s dened in (2.28) satises the bound kn 1 sk 1 =O((n 3=25=4 (logn) 3=2 + (n 1 (logn) 1=2 ) p n 1 logn). Proof. First, observe that for some constant C,jn 1 s j jCkk 3 2 n 1 P n i=1 x ij x T i kk2 3 . More- over, the summands x ij x T i kk2 3 are independent over i and they satisfy the probability bound P (jx ij x T i kk2 3 j>t)C exp(Ct 1=2 ) by Lemma 1 of [48]. Thus, by Lemma 2.5, we obtain max 1jp n 1 n X i=1 x ij x T i kk 2 3 E " x ij x T i kk 2 3 #! =O(n 1=4 (logp) 1=2 ): Now, we calculate the expected value of the summand x ij x T i kk2 3 . We decompose x T i as x ij j + x T i;j j where x i;j and j are the vectors x i and whose jth entry is removed. We use the independence of x i;j and x ij and get E " x ij x T i kk 2 3 # = 1 kk 3 2 E h x ij x ij j + x T i;j j 3 i = 1 kk 3 2 E h x 4 ij 3 j + 3x 3 ij 2 j x T i;j j + 3x 2 ij j x T i;j j 2 +x ij x T i;j j 3 i = 1 kk 3 2 3 3 j + 3 j k j k 2 2 = 3 j kk 2 : 40 Finally, we can combine the result of Lemma 2.5 and the expected value of x ij x T i kk2 3 . We boundkn 1 sk 1 as follows kn 1 sk 1 Ckk 3 2 max 1jp n 1 n X i=1 x ij x T i kk 2 3 O kk 3 2 (n 1=4 (logp) 1=2 + kk 1 kk 2 ) O kk 3 2 n 1=4 (logp) 1=2 +kk 1 kk 2 2 ) : Since 2N 0 , we havekk 2 = O(p 1=2 n 1=2 (logp) 1=2 ) andkk 1 = O(n 1=2 (logp) 1=2 ). Thus, kn 1 sk 1 =O((n 3=25=4 (logn) 3=2 + (n 1 (logn) 1=2 ) p n 1 logn) when p =O(n ). Lemma 2.8. With probability 1O(p c ), the vector t dened in (2.29) satises the bound kn 1 tk 1 =O(n 23=2 (logn) 3=2 p n 1 logn). Proof. The proof is similar to the proof of Lemma 2.7. Since b (5) () is uniformly bounded, jn 1 t j j Ckk 4 2 n 1 P n i=1 x ij x T i kk2 4 for some constant C. We focus on the summands x ij x T i kk2 4 which are independent across i. Moreover, repeated application of Lemma 1 of [48] yields P(x ij (x T i =kk 2 ) 2 > t) C exp(Ct 2=5 ) for some constant C independent of . We can bound the expected value of the summand by Cauchy-Schwartz: E x ij x T i kk2 4 Ex 2 ij E x T i kk2 8 1=2 = p 105. So, by Lemma 2.5, we get kn 1 tk 1 Ckk 4 2 ( p 105 +n 1=5 (logp) 1=2 ) =O(kk 4 2 ) =O(p 2 n 2 (logn) 2 ): Finally, we can deduce thatkn 1 tk 1 =O(n 23=2 (logn) 3=2 p n 1 logn) when p =O(n ). 41 Lemma 2.9. Let T = e T j (X T X) 1 X T [y 0 ]n 1 e T j X T [y 0 ]. Under the assumptions of Theorem 2.2, we have e T j (X T X) 1 X T [y 0 ]n 1 e T j X T [y 0 ] =o p (n 1=2 ): (2.41) Proof. Since X and y are independent, expectation of T is clearly zero. Then, we consider the variance of T . To this end, we condition on X. We can calculate the conditional variance of T as follows var[TjX] = var[e T j ((X T X) 1 n 1 I p )X T (y 0 )jX] =b 00 (0)e T j ((X T X) 1 n 1 I p )X T X((X T X) 1 n 1 I p )e j where we use var[y] =b 00 (0)I n . When we dene E =n 1 X T XI p , simple calculations show that Var[TjX] =b 00 (0)n 1 e T j ((n 1 X T X) 1 I p ) + (n 1 X T XI p ))e j =b 00 (0)n 1 e T j 1 X k=2 (1) k E k ! e j : Now, we can obtain the unconditional variance using the law of total variance. var[T ] =E[var[TjX]] + var[E[TjX]] =b 00 (0)n 1 e T j E 1 X k=2 (1) k E k ! e j : Thus, using Lemma 2.3, we can show that var[T ] =o(n 1 ). Finally, we use Chebyshev's inequality P (jTj>n 1=2 )nvar[T ] =o(1). So, we conclude that T =o p (n 1=2 ) Lemma 2.10. Let x ij be standard normal random variables for 1in and 1jp. Then, max 1jp P n i=1 x 2 ij n +O(n 1=2 (logp) 1=2 ) with probability 1O(p c ) for some positive constant 42 c. Consequently, when logp =O(n ) for some 0< 1, we have max 1jp P n i=1 x 2 ij =O(n), for large enough n with probability 1O(p c ). Proof. Since x ij is a standard normal variable, x 2 ij is subexponential random variable whose mean is 1. So, Lemma 2.5 entails that max 1jp n 1 n X i=1 (x 2 ij 1) =O(n 1=2 (logp) 1=2 ) with probability 1O(p c ). Thus, simple calculations yields max 1jp n X i=1 x 2 ij = max 1jp n + n X i=1 (x 2 ij 1) n +O(n 1=2 (logp) 1=2 ) with probability 1O(p c ). 2.6.3 Proof of Theorem 2.3 To prove the conclusion in Theorem 2.3, we use the proof by contradiction. Let us make an assumption (A) that the asymptotic normality (2.11) in Theorem 2.1 which has been proved to hold when p =o(n 1=2 ) continues to hold when pn 0 for some constant 1=2< 0 1, where stands for asymptotic order. As shown in Section 2.3.1, in the case of logistic regression under global null (that is, 0 = 0) with deterministic rescaled orthonormal design matrix X (in the sense of n 1 X T X =I p ) the limiting distribution in (2.11) by assumption (A) becomes 2 1 n 1=2 b j D !N(0; 1); (2.42) where b = ( b 1 ; ; b p ) T is the MLE. 43 Let us now assume that the rescaled random design matrix n 1=2 X is uniformly distributed on the Stiefel manifoldV p (R n ) which can be thought of as the space of allnp orthonormal matrices. Then it follows from (2.42) that 2 1 n 1=2 b j D !N(0; 1) conditional on X: (2.43) Based on the limiting distribution in (2.43), we can make two observations. First, it holds that 2 1 n 1=2 b j D !N(0; 1) (2.44) unconditional on the design matrix X. Second, b j is asymptotically independent of the design matrix X, and so is the MLE b . Since the distribution of n 1=2 X is assumed to be the Haar measure on the Stiefel manifold V p (R n ), we have n 1=2 XQ d =n 1=2 X; (2.45) where Q is any xed pp orthogonal matrix and d = stands for equal in distribution. Recall that the MLE b solves the score equation (2.4), which is in turn equivalent to equation Q T X T [y(X)] = 0 (2.46) since Q is orthogonal. We now use the fact that the model is under global null which entails that the response vector y is independent of the design matrix X. Combining this fact with (2.45){(2.46) yields Q T b d = b (2.47) by noting that X = (XQ)(Q T ). Since the distributional identity (2.47) holds for any xed pp orthogonal matrix Q, we conclude that the MLE b has a spherical distribution on R p . It is 44 a well-known fact that all the marginal characteristic functions of a spherical distribution have the same generator. Such a fact along with (2.44) entails that 2 1 n 1=2 b is asymptotically close to N(0;I p ): (2.48) To simplify the exposition, let us now make the asymptotic limit exact and assume that b N(0; 4n 1 I p ) and is independent of X: (2.49) The remaining analysis focuses on the score equation (2.4) which is solved exactly by the MLE b , that is, X T [y(X b )] = 0; (2.50) which leads to n 1=2 X T [y(0)] =n 1=2 X T [(X b )(0)]: (2.51) Let us rst consider the random variable dened in (2.51). Note that 2[y(0)] has independent and identically distributed (i.i.d.) components each taking value 1 or1 with equal probability 1=2, and is independent of X. Thus since n 1=2 X is uniformly distributed on the Stiefel manifold V p (R n ), it is easy to see that =n 1=2 X T [y(0)] d = 2 1 n 1=2 X T 1; (2.52) where 12R n is a vector with all components being one. Using similar arguments as before, we can show that has a spherical distribution on R p . Thus the joint distribution of is determined completely by the marginal distribution of. For each 1jp, denote by j the jth component 45 of = 2 1 n 1=2 X T 1 using the distributional representation in (2.52). Let X = (x 1 ; ; x p ) with each x j 2R n . Then we have j = 2 1 n 1=2 x T j 1 d = 2 1 (n 1=2 =ke x j k 2 )n 1=2 e x T j 1; (2.53) wheree x j N(0; 4 1 I n ). It follows from (2.53) and the concentration phenomenon of Gaussian measures that each j is asymptotically close toN(0; 4 1 ) and thus consequently is asymptotically close to N(0; 4 1 I p ). A key fact (i) for the nite-sample distribution of is that the standard deviation of each component j converges to 1=2 at rate O P (n 1=2 ) that does not depend upon the dimensionality p at all. We now turn our attention to the second term dened in (2.51). In view of (2.49) and the fact that n 1=2 X is uniformly distributed on the Stiefel manifold V p (R n ), we can show that with signicant probability, kX b k 1 o(1) (2.54) forpn 0 with 0 < 1. The uniform bound in (2.54) enables us to apply the mean value theorem for the vector-valued function around 0 = 0, which results in =n 1=2 X T [(X b )(0)] = 4 1 n 1=2 X T X b + r (2.55) = 4 1 n 1=2 b + r since n 1=2 X is assumed to be orthonormal, where r =n 1=2 X T Z 1 0 h (tX b ) 4 1 I n i dt X b : (2.56) 46 Here, the remainder term r = (r 1 ; ;r p ) T 2R p is stochastic and each component r j is generally of order O P fp 1=2 n 1=2 g in light of (2.49) when the true model may deviate from the global null case of 0 = 0. Since our focus in this theorem is the logistic regression model under the global null, we can in fact claim that each component r j is generally of order O P fpn 1 g, which is a better rate of convergence than the one mentioned above thanks to the assumption of 0 = 0. To prove this claim, note that the variance function b 00 () is symmetric in 2R and takes the maximum value 1=4 at = 0. Thus in view of (2.54), we can show that with signicant probability, 4 1 I n (tX b )cdiagf(tX b ) (tX b )g =ct 2 diagf(X b ) (X b )g (2.57) for all t2 [0; 1], where c > 0 is some constant and stands for the inequality for positive semidenite matrices. Moreover, it follows from (2.49) and the fact that n 1=2 X is uniformly distributed on the Stiefel manifold V p (R n ) that with signicant probability, all the n components of X b are concentrated in the order of p 1=2 n 1=2 . This result along with (2.57) and the fact that n 1 X T X =I p entails that with signicant probability, n 1=2 X T Z 1 0 h 4 1 I n (tX b ) i dt X (2.58) n 1=2 X T Z 1 0 c t 2 pn 1 dt X = 3 1 c pn 3=2 X T X = 3 1 c pn 1=2 I p ; where c > 0 is some constant. Thus combining (2.56), (2.58), and (2.49) proves the above claim. We make two important observations about the remainder term r in (2.55). First, r has a spherical distribution on R p . This is because by (2.55) and (2.51) it holds that r = 4 1 n 1=2 b = 4 1 n 1=2 b ; 47 which has a spherical distribution on R p . Thus the joint distribution of r is determined completely by the marginal distribution of r. Second, for the nonlinear setting of logistic regression model, the appearance of the remainder term r in (2.55) is due solely to the nonlinearity of the mean function (), and we have shown that each component r j can indeed achieve the worst-case order pn 1 in probability. For each 1 j p, denote by j the jth component of . Then in view of (2.49) and (2.55), a key fact (ii) for the nite-sample distribution of is that the standard deviation of each component j converges to 1=2 at rate O P fpn 1 g that generally does depend upon the dimensionality p. Finally, we are ready to compare the two random variables and on the two sides of equation (2.51). Since equation (2.51) is a distributional identity in R p , naturally the square root of the sum of var j 's and the square root of the sum of var j 's are expected to converge to the common value 2 1 p 1=2 at rates that are asymptotically negligible. However, the former has ratep 1=2 O P (n 1=2 ) =O P fp 1=2 n 1=2 g, whereas the latter has ratep 1=2 O P fpn 1 g =O P fp 3=2 n 1 g. A key consequence is that when p n 0 for some constant 2=3 0 < 1, there is a profound dierence between the two asymptotic rates in that the former rate is O P fn (10)=2 g =o P (1), while the latter rate becomesO P fn 30=21 g which is now asymptotically diverging or nonvanishing. Such an intrinsic asymptotic dierence is, however, prohibited by the distributional identity (2.51) in R p , which results in a contradiction. Therefore, we have now argued that assumption (A) we started with for 2=3 0 < 1 must be false, that is, the asymptotic normality (2.11) which has been proved to hold when p =o(n 1=2 ) generally would not continue to hold when pn 0 with constant 2=3 0 1. In other words, we have proved the invalidity of the conventional GLM p-values in this regime of diverging dimensionality, which concludes the proof of Theorem 2.3. 2.6.4 Proof of Theorem 2.4 By assumption, XN(0;I n ) with covariance matrix nonsingular. Let us rst make a useful observation. For the general case of nonsingular covariance matrix , we can introduce a change 48 of variable by letting e = 1=2 and correspondingly e X = X 1=2 . Clearly, e XN(0;I n I p ) and the MLE for the transformed parameter vector e is exactly 1=2b , where b denotes the MLE under the original design matrix X. Thus to show the breakdown point of the conventional asymptotic normality of the MLE, it suces to focus on the specic case of XN(0;I n I p ). Hereafter we assume that X N(0;I n I p ) with p = o(n). The rest of the arguments are similar to those in the proof of Theorem 2.3 in Section 2.6.3 except for some modications needed for the case of Gaussian design. Specically, for the case of logistic regression model under global null (that is, 0 = 0), the limiting distribution in (2.11) becomes 2 1 n 1=2 b j D !N(0; 1); (2.59) since n 1 X T X! I p almost surely in spectrum and thus 4 1 n(A 1 n ) jj ! 1 in probability as n!1. Here, we have used a claim that both the largest and smallest eigenvalues of n 1 X T X converge to 1 almost surely as n!1 for the case of p =o(n), which can be shown by using the classical results from random matrix theory (RMT) [54, 96, 7]. Note that since XN(0;I n I p ), it holds that n 1=2 XQ d =n 1=2 X; (2.60) where Q is any xed pp orthogonal matrix and d = stands for equal in distribution. By XN(0;I n I p ), it is also easy to see that =n 1=2 X T [y(0)] d = 2 1 n 1=2 X T 1; (2.61) 49 where 12R n is a vector with all components being one. In view of (2.49) and the assumption of XN(0;I n I p ), we can show that with signicant probability, kX b k 1 o(1) (2.62) for p n 0 with constant 0 < 1. It holds further that with signicant probability, all the n components of X b are concentrated in the order of p 1=2 n 1=2 . This result along with (2.57) and the fact that n 1 X T X!I p almost surely in spectrum entails that with asymptotic probability one, n 1=2 X T Z 1 0 h 4 1 I n (tX b ) i dt X (2.63) n 1=2 X T Z 1 0 c t 2 pn 1 dt X = 3 1 c pn 3=2 X T X! 3 1 c pn 1=2 I p ; where c > 0 is some constant. This completes the proof of Theorem 2.4. 50 Chapter 3 RANK: Large-Scale Inference with Graphical Nonlinear Knockos 3.1 Introduction The issues of power and reproducibility are key to enabling rened scientic discoveries in big data applications utilizing general high-dimensional nonlinear models. To characterize the reproducibility of statistical inference, the seminal paper of [11] introduced an elegant concept of false discovery rate (FDR) which is dened as the expectation of the fraction of false discoveries among all the discoveries, and proposed a popularly used Benjamini{Hochberg procedure for FDR control by resorting to the p-values for large-scale multiple testing returned by some statistical estimation and testing procedure. There is a huge literature on FDR control for large-scale inference and various generalizations and extensions of the original FDR procedure were developed and investigated for dierent settings and applications [12, 30, 100, 101, 1, 28, 29, 37, 114, 24, 57, 47, 82, 116, 38, 76, 102]. Most of existing work either assumes a specic functional form such as linearity on the dependence structure of response Y on covariates X j 's, or relies on the p-values for evaluating the signicance of covariates X j 's. Yet in high-dimensional settings, we often do not have such luxury since response Y could depend on covariate vector x through very complicated forms and even when Y and x have simple dependence structure, high dimensionality of covariates can render classical 51 p-value calculation procedures no longer justied or simply invalid [66, 46, 104]. These intrinsic challenges can make the p-value based methods dicult to apply or even fail [18]. To accommodate arbitrary dependence structure of Y on x and bypass the need of calculating accurate p-values for covariate signicance, [18] recently introduced the model-X knockos frame- work for FDR control in general high-dimensional nonlinear models. Their work was inspired by and builds upon the ingenious development of the knocko lter in [8], which provides eective FDR control in the setting of Gaussian linear model with dimensionality p no larger than sample size n. The knocko lter was later extended in [9] to high-dimensional linear model using the ideas of data splitting and feature screening. The salient idea of [8] is to construct the so-called \knocko" variables which mimic the dependence structure of the original covariates but are independent of response Y conditional on the original covariates. These knocko variables can be used as control variables. By comparing the regression outcomes for original variables with those for control variables, the relevant set of variables can be identied more accurately and thus the FDR can be better controlled. The model-X knockos framework introduced in [18] greatly expands the applicability of the original knocko lter in that the response Y and covariates x can have arbitrarily complicated dependence structure and the dimensionality p can be arbitrarily large compared to sample size n. It was theoretically justied in [18] that the model-X knockos procedure controls FDR exactly in nite samples of arbitrary dimensions. However, one important assumption in their theoretical development is that the joint distribution of covariates x should be known. Moreover, formal power analysis of the knockos framework is still lacking even for the setting of Gaussian linear model. Despite the importance of known covariate distribution in their theoretical development, [18] empirically explored the scenario of unknown covariate distribution for the specic setting of generalized linear model (GLM) [80] with Gaussian design matrix and discovered that the estimation error of the covariate distribution can have negligible eect on FDR control. Yet there exist no formal theoretical justications on the robustness of the model-X knockos method and 52 it is also unclear to what extent such robustness can hold beyond the GLM setting. To address these fundamental challenges, this chapter intends as the rst attempt to provide theoretical foundations on the power and robustness for the model-X knockos framework. Specically, the major innovations of the chapter are twofold. First, we will formally investigate the power of the knockos framework in high-dimensional linear models with both known and unknown covariate distribution. Second, we will provide theoretical support on the robustness of the model-X knockos procedure with unknown covariate distribution in general high-dimensional nonlinear models. More specically, in the ideal case of known covariate distribution, we prove that the model-X knockos procedure in [18] has asymptotic power one under mild regularity conditions in high- dimensional linear models. When moving away from the ideal scenario, to accommodate the diculty caused by unknown covariate distribution we suggest the modied model-X knockos method called graphical nonlinear knockos (RANK). The modied knockos procedure exploits the data splitting idea, where the rst half of the sample is used to estimate the unknown covariate distribution and reduce the model size, and the second half of the sample is employed to globally construct the knocko variables and apply the knockos procedure. We establish that the modied knockos procedure asymptotically controls the FDR regardless of whether the reduced model contains the true model or not. Such feature makes our work intrinsically dierent from that in [9] requiring the sure screening property [40] of the reduced model; see Section 3.3.1 for more detailed discussions on the dierences. In our theoretical analysis of FDR, we still allow for arbitrary dependence structure of response Y on covariates x and assume that the joint distribution of x is characterized by Gaussian graphical model with unknown precision matrix [73]. In the specic case of high-dimensional linear models with unknown covariate distribution, we also provide robustness analysis on the power of our modied procedure. The rest of the chapter is organized as follows. Section 3.2 reviews the model-X knockos framework and provides theoretical justications on its power in high-dimensional linear models. We introduce the modied model-X knockos procedure RANK and investigate its robustness on 53 both FDR control and power with respect to the estimation of unknown covariate distribution in Section 3.3. Section 3.4 presents several simulation examples of both linear and nonlinear models to verify our theoretical results. We demonstrate the performance of our procedure on a real data set in Section 3.5. Section 3.6 discusses some implications and extensions of our work. The proofs of main results and additional technical details are relegated to the end of the chapter. 3.2 Power analysis for oracle model-X knockos Suppose we have a sample (x i ;Y i ) n i=1 of n independent and identically distributed (i.i.d.) observa- tions from the population (x;Y ), where dimensionality p of covariate vector x = (X 1 ; ;X p ) T can greatly exceed available sample size n. To ensure model identiability, it is common to assume that only a small fraction ofp covariatesX j 's are truly relevant to responseY . To be more precise, [18] dened the set of irrelevant featuresS 1 as that consisting of X j 's such that X j is independent of Y conditional on all remaining p 1 covariates X k 's with k6= j, and thus the set of truly relevant featuresS 0 is given naturally byS c 1 , the complement of setS 1 . Features in setsS 0 and S c 0 =S 1 are also referred to as important and noise features, respectively. We aim at accurately identifying these truly relevant features in setS 0 that is assumed to be identiable while keeping the false discovery rate (FDR) [11] under control. The FDR for a feature selection procedure is dened as FDR =E[FDP] with FDP = j b S\S c 0 j j b Sj ; (3.1) where b S denotes the sparse model returned by the feature selection procedure,jj stands for the cardinality of a set, and the convention 0=0 = 0 is used in the denition of the false discovery proportion (FDP) which is the fraction of noise features in the discovered model. Here feature selection procedure can be any favorite sparse modeling method by the choice of the user. 54 3.2.1 Review of model-X knockos framework Our suggested graphical nonlinear knockos procedure in Section 3.3 falls in the general framework of model-X knockos introduced in [18], which we brie y review in this section. The key ingredient of model-X knockos framework is the construction of the so-called model-X knocko variables that are dened as follows. Denition 3.1 ([18]). Model-X knockos for the family of random variables x = (X 1 ; ;X p ) T is a new family of random variables e x = ( e X 1 ; ; e X p ) T that satises two properties: (1) (x T ;e x T ) swap(S) d =(x T ;e x T ) for any subsetSf1; ;pg, where swap(S) means swapping compo- nents X j and e X j for each j2S and d = denotes equal in distribution, and (2)e x? ?Yjx. We see from Denition 3.1 that model-X knocko variables e X j 's mimic the probabilistic dependency structure among the original features X j 's and are independent of response Y given X j 's. When the covariate distribution is characterized by Gaussian graphical model [73], that is, xN(0; 1 0 ) (3.2) with pp precision matrix 0 encoding the graphical structure of the conditional dependency among the covariates X j 's, we can construct the p-variate model-X knocko random variablee x characterized in Denition 3.1 as e xjxN x diagfsg 0 x; 2diagfsg diagfsg 0 diagfsg ; (3.3) 55 where s is ap-dimensional vector with nonnegative components chosen in a suitable way. In fact, in view of (3.2) and (3.3) it is easy to show that the original features and model-X knocko variables have the following joint distribution 0 B B @ x e x 1 C C A N 0 B B @ 0 B B @ 0 0 1 C C A ; 0 B B @ 0 0 diagfsg 0 diagfsg 0 1 C C A 1 C C A (3.4) with 0 = 1 0 the covariance matrix of covariates x. Intuitively, larger components of s means that the constructed knocko variables deviate further from the original features, resulting in higher power in distinguishing them. The p-dimensional vector s in (3.3) should be chosen in a way such that 0 2 1 diagfsg is positive denite, and can be selected using the methods in [18]. We will treat it as a nuisance parameter throughout our theoretical analysis. With the constructed knocko variables e x, the knockos inference framework proceeds as follows. We select important variables by resorting to the knocko statistics W j = f j (Z j ; e Z j ) dened for each 1 j p, where Z j and e Z j represent feature importance measures for jth covariateX j and its knocko counterpart e X j , respectively, andf j (;) is an antisymmetric function satisfying f j (z j ; ~ z j ) =f j (~ z j ;z j ). For example, in linear regression models, one can choose Z j and e Z j as the Lasso [106] regression coecients of X j and e X j , respectively, and a valid knocko statistic is W j =f j (z j ; ~ z j ) =jz j jj~ z j j. There are also many other options for dening the feature importance measures. Observe that all model-X knocko variables e X j 's are just noise features by the second property in Denition 3.1. Thus intuitively, a large positive value of knocko statistic W j indicates that jth covariate X j is important, while a small magnitude of W j usually corresponds to noise features. The nal step of the knockos inference framework is to sortjW j j's from high to low and select features whose W j 's are at or above some threshold T , which results in the discovered model b S = b S(T ) =f1jp :W j Tg: (3.5) 56 Following [8] and [18], one can choose the threshold T in the following two ways T = min ( t2W : jfj :W j tgj jfj :W j >tgj q ) ; (3.6) T + = min ( t2W : 1 +jfj :W j tgj jfj :W j >tgj q ) ; (3.7) whereW =fjW j j : 1jpgnf0g is the set of unique nonzero values attained byjW j j's and q2 (0; 1) is the desired FDR level specied by the user. The procedures using threshold T in (3.6) and threshold T + in (3.7) are referred to as knockos and knockos + methods, respectively. It was proved in [18] that model-X knockos procedure controls a modied FDR that replaces j b Sj in the denominator by q 1 +j b Sj in (3.1), and model-X knockos + procedure achieves exact FDR control in nite samples regardless of dimensionality p and dependence structure of response Y on covariates x. The major assumption needed in [18] is that the distribution of covariates x is known. Throughout the chapter, we implicitly use the threshold T + dened in (3.7) for FDR control in the knockos inference framework but still write it as T for notational simplicity. 3.2.2 Power analysis in linear models Although the knockos procedures were proved rigorously to have controlled FDR in [8, 9, 18], their power advantages over popularly used approaches have been demonstrated only numerically therein. In fact, formal power analysis for the knockos framework is still lacking even in simple model settings such as linear regression. We aim to ll in this gap as a rst attempt and provide theoretical foundations on the power analysis for model-X knockos framework. In this section, we will focus on the oracle model-X knockos procedure for the ideal case when the true precision matrix 0 for the covariate distribution in (3.2) is known, which is the setting assumed in [18]. The robustness analysis for the case of unknown precision matrix 0 will be undertaken in Section 3.3. 57 We would like to remark that the power analysis for the knockos framework is necessary and nontrivial. The FDR and power are two sides of the same coin, just like type I and type II errors in hypothesis testing. The knockos framework is a wrapper and can be combined with most model selection methods to achieve FDR control. Yet the theoretical properties of power after applying the knockos procedure are completely unknown for the case of correlated covariates and unknown covariate distribution. For example, when the knockos framework is combined with the Lasso, it further selects variables from the set of variables picked by Lasso applied with the augmented design matrix to achieve the FDR control. For this reason, the power of knockos is usually lower than that of Lasso. The main focus of this section is to investigate how much power loss the knockos framework would encounter when combined with Lasso. Since the power analysis for the knockos framework is nontrivial and challenging, we content ourselves on the setting of high-dimensional linear models for the technical analysis on power. The linear regression model assumes that y = X 0 +"; (3.8) where y = (Y 1 ; ;Y n ) T is ann-dimensional response vector, X = (x 1 ; ; x n ) T is annp design matrix consisting of p covariates X j 's, 0 = ( 0;1 ; ; 0;p ) T is a p-dimensional true regression coecient vector, and" = (" 1 ; ;" n ) T is an n-dimensional error vector independent of X. As mentioned before, the true modelS 0 = supp( 0 ) which is the support of 0 is assumed to be sparse with size s =jS 0 j, and then rows of design matrix X are i.i.d. observations generated from Gaussian graphical model (3.2). Without loss of generality, all the diagonal entries of covariance matrix 0 are assumed to be ones. As discussed in Section 3.2.1, there are many choices of the feature selection procedure up to the user for producing the feature importance measures Z j and e Z j for covariates X j and knocko variables e X j , respectively, and there are also dierent ways to construct the knocko statistics 58 W j . For the illustration purpose, we adopt the Lasso coecient dierence (LCD) as the knocko statistics in our power analysis. The specic choice of LCD for knocko statistics was proposed and recommended in [18], in which it was demonstrated empirically to outperform some other choices in terms of power. The LCD is formally dened as W j =j b j ()jj b p+j ()j; (3.9) where b j () and b p+j () denote thejth and (p +j)th components, respectively, of the Lasso [106] regression coecient vector b () = argmin b2R 2p n (2n) 1 y [X; e X]b 2 2 +kbk 1 o (3.10) with 0 the regularization parameter, e X = (e x 1 ; ;e x n ) T an np matrix whose n rows are independent random vectors of model-X knocko variables generated from (3.3), andkk r for r 0 theL r -norm of a vector. To simplify the technical analysis, we assume that with asymptotic probability one, there are no ties in the magnitude of nonzero W j 's and no ties in the magnitude of nonzero components of Lasso solution in (3.10), which is a mild condition in light of the continuity of the underlying distributions. To facilitate the power analysis, we impose some basic regularity conditions. Assumption 3.1. The components of" are i.i.d. with sub-Gaussian distribution. Assumption 3.2. It holds thatfn=(logp)g 1=2 min j2S0 j 0;j j!1 as n increases. Assumption 3.3. There exists some constant c2 (2(qs) 1 ; 1) such that with asymptotic proba- bility one,j b Sjcs for b S given in (3.5). Condition 3.1 can be relaxed to heavier-tailed distributions at the cost of slower convergence rates as long as similar concentration inequalities used in the proofs continue to hold. Condition 3.2 is assumed to ensure that the Lasso solution b () does not miss a great portion of important 59 features inS 0 . This is necessary since the knockos procedure under investigation builds upon the Lasso solution and thus its power is naturally upper bounded by that of Lasso. To see this, recall the well-known oracle inequality for Lasso [14, 16] that with asymptotic probability one, k b () 0 k 2 =O(s 1=2 ) for chosen in the order off(logp)=ng 1=2 . Then Condition 3.2 entails that for some n !1, O(s 2 ) =k b () 0 k 2 2 P j2 b S c L \S0 2 0;j n 1 (logp) 2 n j b S c L \S 0 j with b S L = suppf b ()g. Thus the number of important features missed by Lassoj b S c L \S 0 j is upper bounded by O(s 2 n ) with asymptotic probability one. This guarantees that the power of Lasso is lowered bounded by 1O( 2 n ); that is, Lasso has asymptotic power one. However, as discussed previously the power of knockos is always upper bounded by that of Lasso. So we are interested in the relative power of knockos compared to that of Lasso. For this reason, Condition 3.2 is imposed to simplify the technical analysis of the knockos power by ensuring that the asymptotic power of Lasso is one. We will show in Theorem 3.1 that there is almost no power loss when applying model-X knockos procedure. Condition 3.3 imposes a lower bound on the size of the sparse model selected by the knockos procedure. Recall that we assume the number of true variables s can diverge with sample size n. The rationale behind Condition 3.3 is that any method with high power should at least be able to select a large number of variables which are not necessarily true ones though. Since it is not straightforward to check, we provide a sucient condition that is more intuitive in Lemma 3.1 below, which shows that Condition 3.3 can hold as long as there exist enough strong signals in the model. We acknowledge that Lemma 3.1 may not be a necessary condition for Condition 3.3. Lemma 3.1. Assume that Condition 3.1 holds and there exists some constant c2 (2(qs) 1 ; 1) such thatjS 2 jcs withS 2 =fj :j 0;j j [sn 1 (logp)] 1=2 g. Then Condition 3.3 holds. We would like to mention that the conditions of Lemma 3.1 are not stronger than Condition 3.2. We require a few strong signals, and yet still allow for many very weak ones. In other words, the set of strong signalsS 2 is only a large enough proper subset of the set of all signalsS 0 . 60 We are now ready to characterize the statistical power of the knockos procedure in high- dimensional linear model (3.8). Formally speaking, the power of a feature selection procedure is dened as Power( b S) =E h j b S\S 0 j jS 0 j i ; (3.11) where b S denotes the discovered sparse model returned by the feature selection procedure. Theorem 3.1. Assume that Condition 3.1{3.3 hold, all the eigenvalues of 0 are bounded away from 0 and1, the smallest eigenvalue of 2diag(s) diag(s) 0 diag(s) is positive and bounded away from 0, and = C 1 f(logp)=ng 1=2 with C 1 > 0 some constant. Then the oracle model-X knockos procedure satises that with asymptotic probability one,j b S\S 0 j=jS 0 j 1O( 1 n ) for some n !1, and Power( b S)! 1 as n!1. Theorem 3.1 reveals that the oracle model-X knockos procedure in [18] knowing the true precision matrix 0 for the covariate distribution can indeed have asymptotic power one under some mild regularity conditions. This shows that for the ideal case, model-X knockos procedure can enjoy appealing FDR control and power properties simultaneously. 3.3 Robustness of graphical nonlinear knockos When moving away from the ideal scenario considered in Section 3.2, a natural question is whether both properties of FDR control and power can continue to hold with no access to the knowledge of true covariate distribution. To gain insights into such a question, we now turn to investigating the robustness of model-X knockos framework. Hereafter we assume that the true precision matrix 0 for the covariate distribution in (3.2) is unknown. We will begin with the FDR analysis and then move on to the power analysis. 61 3.3.1 Modied model-X knockos We would like to emphasize that the linear model assumption is no longer needed here and arbitrary dependence structure of response y on covariates x is allowed. As mentioned in Introduction, to overcome the diculty caused by unknown precision matrix 0 we modify the model-X knockos procedure described in Section 3.2.1 and suggest the method of graphical nonlinear knockos (RANK). To ease the presentation, we rst introduce some notation. For each givenpp symmetric posi- tive denite matrix , denote by C = I p diagfsg and B = 2diagfsgdiagfsg diagfsg 1=2 the square root matrix. We denenp matrix e X = (e x 1 ; ;e x n ) T by independently generating e x i from the conditional distribution e x i jx i N C x i ; (B ) 2 ; (3.12) where X = (x 1 ; ; x n ) T is the original np design matrix generated from Gaussian graphical model (3.2). It is easy to show that the (2p)-variate random vectors (x T i ; (e x i ) T ) T are i.i.d. with Gaussian distribution of mean 0 and covariance matrix given by cov(x i ) = 0 , cov(x i ;e x i ) = 0 C , and cov(e x i ) = (B ) 2 + C 0 (C ) T . Our modied knockos method RANK exploits the idea of data splitting, in which one half of the sample is used to estimate unknown precision matrix 0 and reduce the model dimensionality, and the other half of the sample is employed to construct the knocko variables and implement the knockos inference procedure, with the steps detailed below. Step 1. Randomly split the data (X; y) into two folds (X (k) ; y (k) ) with 1k 2 each of sample size n=2. Step 2. Use the rst fold of data (X (1) ; y (1) ) to obtain an estimate b of the precision matrix and a reduced model with support e S. 62 Step 3. With estimated precision matrix b from Step 2, construct an (n=2)p knockos matrix b X using X (2) with rows independently generated from (3.12); that is, b X = X (2) (C b ) T + ZB b with Z an (n=2)p matrix with i.i.d. N(0; 1) components. Step 4. Construct knocko statistics W j 's using only data on support e S, that is, W j = W j (y (2) ; X (2) e S ; b X e S ) for j 2 e S and W j = 0 for j 2 e S c . Then apply knockos inference procedure to W j 's to obtain nal set of features b S. Here for any matrix A and subsetSf1; ;pg, the compact notation A S stands for the submatrix of A consisting of columns in setS. As discussed in Section 3.2.1, the model-X knockos framework utilizes sparse regression procedures such as the Lasso. For this reason, even in the original model-X knockos procedure the knocko statistics W j 's (see, e.g., (3.9)) take nonzero values only over a much smaller model than the full model. This observation motivates us to estimate such a smaller model using the rst half of the sample in Step 2 of our modied procedure. When implementing this modied procedure, we limit ourselves to sparse models e S with size bounded by some positive integer K n that diverges with n; see, for example, [50, 78] for detailed discussions and justications on similar consideration of sparse models. In addition to sparse regression procedures, feature screening methods such as [40, 34] can also be used to obtain the reduced model e S. The above modied knockos method diers from the original model-X knockos procedure [18] in that we use an independent sample to obtain the estimated precision matrix b and reduced model e S. In particular, the independence between estimates ( b ; e S) and data (X (2) ; y (2) ) plays an important role in our theoretical analysis for the robustness of the knockos procedure. In fact, the idea of data splitting has been popularly used in the literature for various purposes [44, 36, 95, 9]. Although the work of [9] has the closest connection to ours, there are several key dierences between these two methods. Specically, [9] considered high-dimensional linear model with xed design, where the data is split into two portions with the rst portion used for feature 63 screening and the second portion employed for applying the original knocko lter in [8] on the reduced model. To ensure FDR control, it was required in [9] that the feature screening method should enjoy the sure screening property [40], that is, the reduced model after the screening step contains the true modelS 0 with asymptotic probability one. In contrast, one major advantage of our method is that the asymptotic FDR control can be achieved without requiring the sure screening property; see Theorem 3.2 in Section 3.3.2 for more details. Such major distinction is rooted on the dierence in constructing knocko variables; that is, we construct model-X knocko variables globally in Step 3 above, whereas [9] constructed knocko variables locally on the reduced model. Another major dierence is that our method works with random design and does not need any assumption on how response y depends upon covariates x, while the method in [9] requires the linear model assumption and cannot be extended to nonlinear models. 3.3.2 Robustness of FDR control for graphical nonlinear knockos We begin with investigating the robustness of FDR control for the modied model-X knockos procedure RANK. To simplify the notation, we rewrite (X (2) ; y (2) ) as (X; y) with sample size n whenever there is no confusion, where n now represents half of the original sample size. For each given pp symmetric positive denite matrix , an np knockos matrix e X = (e x 1 ; ;e x n ) T can be constructed with n rows independently generated according to (3.12) and the modied knockos procedure proceeds with a given reduced modelS. Then the FDP and FDR functions in (3.1) can be rewritten as FDR n ( ;S) =E[FDP n (y; X S ; e X S )]; (3.13) where the subscript n is used to emphasize the dependence of FDP and FDR functions on sample size. It is easy to check that the knockos procedure based on (y; X S ; e X 0 S ) satises all the conditions in [18] for FDR control for any reduced modelS that is independent of X and e X 0 , which ensures that FDR n ( 0 ;S) can be controlled at the target level q. To study the robustness 64 of our modied knockos procedure, we will make a connection between functions FDR n ( ;S) and FDR n ( 0 ;S). To ease the presentation, denote by e X 0 = e X 0 the oracle knockos matrix with = 0 , C 0 = C 0 , and B 0 = B 0 . The following proposition establishes a formal characterization of the FDR as a function of the precision matrix used in generating the knocko variables and the reduced modelS. Proposition 3.1. For any given symmetric positive denite 2 R pp andSf1; ;pg, it holds that FDR n ( ;S) =E h g n X S aug H i ; (3.14) where X S aug = [X; e X 0;S ]2R n(p+jSj) , function g n () is some conditional expectation of the FDP function whose functional form is free of andS, and H = 0 B B @ I p C S C 0;S (B T 0;S B 0;S ) 1=2 (B S ) T B S 1=2 0 (B T 0;S B 0;S ) 1=2 (B S ) T B S 1=2 1 C C A : We see from Proposition 3.1 that when = 0 , it holds that H 0 = I p+jSj and thus the value of the FDR function at point 0 reduces to FDR n ( 0 ;S) =E h g n X S aug i ; which can be shown to be bounded from above by the target FDR level q using the results proved in [18]. Since the dependence of FDR function on is completely through matrix H , we can reparameterize the FDR function as FDR n (H ;S). In view of (3.14), FDR n (H ;S) is the expectation of some measurable function with respect to the probability law of X S aug which has matrix normal distribution with independent rows, and thus is expected to be a smooth function 65 of entries of H by measure theory. Motivated by such an observation, we make the following Lipschitz continuity assumption. Assumption 3.4. There exists some constant L > 0 such that for alljSj K n andk 0 k 2 C 2 a n with some constant C 2 > 0 and a n ! 0, FDR n (H ;S) FDR n (H 0 ;S) L H H 0 F , wherekk 2 andkk F denote the matrix spectral norm and matrix Frobenius norm, respectively. Assumption 3.5. Assume that the estimated precision matrix b satisesk b 0 k 2 C 2 a n with probability 1O(p c1 ) for some constants C 2 ;c 1 > 0 and a n ! 0, and thatj e SjK n . The error rate of precision matrix estimation assumed in Condition 3.5 is quite exible. We would like to emphasize that no sparsity assumption has been made on the true precision matrix 0 . Bounding the size of sparse models is also important for ensuring model identiability and stability; see, for instance, [50, 78] for more detailed discussions. Theorem 3.2. Assume that all the eigenvalues of 0 are bounded away from 0 and1 and the smallest eigenvalue of 2diag(s)diag(s) 0 diag(s) is bounded from below by some positive constant. Then under Condition 3.4, it holds that sup jSjKn;k 0k2C2an jFDR n (H ;S) FDR(H 0 ;S)jO(K 1=2 n a n ): (3.15) Moreover, under Conditions 3.4{3.5 with K 1=2 n a n ! 0, the FDR of RANK is bounded from above by q +O(K 1=2 n a n ) +O(p c1 ), where q2 (0; 1) is the target FDR level. Theorem 3.2 establishes the robustness of the FDR with respect to the precision matrix ; see the uniform bound in (3.15). As a consequence, it shows that our modied model-X knockos procedure RANK can indeed have FDR asymptotically controlled at the target level q. We remark that the term K 1=2 n in Theorem 3.2 is because Condition 3.4 is imposed through the matrix Frobenius norm, which is motivated from results on the smoothness of integral function from 66 calculus. If one is willing to impose assumption through matrix spectral norm instead of Frobenius norm, then the extra term K 1=2 n can be dropped and the setS can be taken as the full model f1; ;pg. We would like to stress that Theorem 3.2 allows for arbitrarily complicated dependence structure of response y on covariates x and for any valid construction of knocko statistics W j 's. This is dierent from the conditions needed for power analysis in Section 3.2.2 (that is, the linear model setting and LCD knocko statistics). Moreover, the asymptotic FDR control in Theorem 3.2 does not need the sure screening property of Pf e SS 0 g! 1 as n!1. 3.3.3 Robustness of power in linear models We are now curious about the other side of the coin; that is, the robustness theory for the power of our modied knockos procedure RANK. As argued at the beginning of Section 3.2.2, to ease the presentation and simplify the technical derivations we come back to high-dimensional linear models (3.8) and use the LCD in (3.9) as the knocko statistics. The dierence with the setting in Section 3.2.2 is that we no longer assume that the true precision matrix 0 is known and use the modied knockos procedure introduced in Section 3.3.1 to achieve asymptotic FDR control. Recall that for the RANK procedure, the reduced model e S is rst obtained from an independent subsample and then the knockos procedure is applied on the second fold of data to further select features from e S. Clearly if e S does not have the sure screening property of Pf e SS 0 g! 1 as n!1, then the Lasso solution based on [X (2) e S ; e X e S ] as given in (3.18) is no longer a consistent estimate of 0 even when the true precision matrix 0 is used to generate the knocko variables. In addition, the nal power of our modied knockos procedure will always be upper bounded by s 1 j e S\S 0 j. Nevertheless, the results in this section are still useful in the sense that model (3.8) can be viewed as the projected model on support e S. Thus our power analysis here is relative power analysis with respect to the reduced model e S. In other words, we will focus on how much power loss would occur after we apply the model-X knockos procedure to (X (2) e S ; e X e S ; y (2) ) when 67 compared to the power of s 1 j e S\S 0 j. Since our focus is relative power loss, without loss of generality we will condition on the event n e SS 0 o : (3.16) We would like to point out that all conditions and results in this section can be adapted corre- spondingly when we view model (3.8) as the projected model if e S6S 0 . Similarly as in FDR analysis, we restrict ourselves to sparse models with size bounded by K n that diverges as n!1, that is,j e SjK n . With taken as the estimated precision matrix b , we can generate the knocko variables from (3.12). Then the Lasso procedure can be applied to the augmented data (X (2) ; b X; y (2) ) with b X constructed in Step 3 of our modied knockos procedure and the LCD can be dened as c W j =W b ; e S j =j b j (; b ; e S)jj b p+j (; b ; e S)j; (3.17) where b j (; b ; e S) and b p+j (; b ; e S) are the jth and (j +p)th components, respectively, of the Lasso estimator b (; b ; e S) = argmin b e S 1 =0 n n 1 y (2) [X (2) ; b X]b 2 2 +kbk 1 o (3.18) with 0 the regularization parameter and e S 1 =f1j 2p :j62 e S and jp62 e Sg. Unlike the FDR analysis in Section 3.3.2, we now need sparsity assumption on the true precision matrix 0 . Assumption 3.6. Assume that 0 is L p -sparse with each row having at most L p nonzeros for some diverging L p and all the eigenvalues of 0 are bounded away from 0 and1. 68 For each given precision matrix and reduced modelS, we dene W ;S j similarly as in (3.17) except that is used to generate the knocko variables and setS is used in (3.18) to calculate the Lasso solution. Denote by b S =fj :W ;S j TgS the nal set of selected features using the LCD W ;S j in the knockos inference framework. We further dene a class of precision matrices 2R pp A = : is L 0 p -sparse andk 0 k 2 C 2 a n ; (3.19) where C 2 and a n are the same as in Theorem 3.2 and L 0 p is some positive integer that diverges with n. Similarly as in Section 3.2.2, in the technical analysis we assume implicitly that with asymptotic probability one, for all valid constructions of the knocko variables there are no ties in the magnitude of nonzero knocko statistics and no ties in the magnitude of nonzero components of Lasso solution uniformly over all 2A andjSjK n . Assumption 3.7. It holds that Pf b 2Ag = 1O(p c2 ) for some constant c 2 > 0. The assumption on the estimated precision matrix b made in Condition 3.7 is mild and exible. A similar class of precision matrices was considered in [49] with detailed discussions on the choices of the estimation procedures. See, for example, [51, 22] for some more recent developments on large precision matrix estimation. In parallel to Theorem 3.1, we have the following results on the power of our modied knockos procedure with the estimated precision matrix b . Theorem 3.3. Assume that Conditions 3.1{3.2 and 3.6{3.7 hold, the smallest eigenvalue of 2diag(s)diag(s) 0 diag(s) is positive and bounded away from 0,jfj :j 0;j j [sn 1 (logp)] 1=2 gj cs, and = C 3 f(logp)=ng 1=2 with c2 ((qs) 1 ; 1) and C 3 > 0 some constants. Then if [(L p + L 0 p ) 1=2 +K 1=2 n ]a n = o(1) and sfa n + (K n +L 0 p )[n 1 (logp)] 1=2 g = o(1), RANK with estimated precision matrix b and reduced model e S has asymptotic power one. Theorem 3.3 establishes the robustness of the power for the RANK method. In view of Theorems 3.2{3.3, we see that our modied knockos procedure RANK can enjoy appealing properties of 69 FDR control and power simultaneously when the true covariate distribution is unknown and needs to be estimated in high dimensions. 3.4 Simulation studies So far we have seen that our suggested RANK method admits appealing theoretical properties for large-scale inference in high-dimensional nonlinear models. We now examine the nite-sample performance of RANK through four simulation examples. 3.4.1 Model setups and simulation settings Recall that the original knocko lter (KF) in [8] was designed for linear regression model with dimensionality p not exceeding sample size n, while the high-dimensional knocko lter (HKF) in [9] considers linear model with p possibly larger than n. To compare RANK with the HKF procedure in high-dimensional setting, our rst simulation example adopts the linear regression model y = X +"; (3.20) where y is ann-dimensional response vector, X is annp design matrix, = ( 1 ; ; p ) T is ap- dimensional regression coecient vector, and" is ann-dimensional error vector. Nonlinear models provide useful and exible alternatives to linear models and are widely used in real applications. Our second through fourth simulation examples are devoted to three popular nonlinear model settings: the partially linear model, the single-index model, and the additive model, respectively. As a natural extension of linear model (3.20), the partially linear model assumes that y = X + g(U) +"; (3.21) 70 where g(U) = (g(U 1 ); ;g(U n )) T is an n-dimensional vector-valued function with covariate vector U = (U 1 ; ;U n ) T , g() is some unknown smooth nonparametric function, and the rest of notation is the same as in model (3.20). In particular, the partially linear model is a semiparametric regression model that has been commonly used in many areas such as economics, nance, medicine, epidemiology, and environmental science [32, 58]. The third and fourth simulation examples drop the linear component. As a popular tool for dimension reduction, the single-index model assumes that y = g(X) +"; (3.22) where g(X) = (g(x T 1 ); ;g(x T n )) T with X = (x 1 ; ; x n ) T ,g() is an unknown link function, and the remaining notation is the same as in model (3.20). In particular, the single-index model provides a exible extension of the GLM by relaxing the parametric form of the link function [67, 98, 59, 74, 63]. To bring more exibility while alleviating the curse of dimensionality, the additive model assumes that y = p X j=1 g j (X j ) +"; (3.23) where g j () = (g j ( 1 ); ;g j ( n )) T for = ( 1 ; ; n ) T , X j represents the jth covariate vector with X = (X 1 ; ; X p ), g j ()'s are some unknown smooth functions, and the rest of notation is the same as in model (3.20). The additive model has been widely employed for nonparametric modeling of high-dimensional data [60, 90, 81, 23]. For the linear model (3.20) in simulation example 1, the rows of the np design matrix X are generated as i.i.d. copies of N(0; ) with precision matrix 1 = ( jjkj ) 1j;kp for = 0 and 0:5. We set the true regression coecient vector 0 2R p as a sparse vector with s = 30 nonzero components, where the signal locations are chosen randomly and each nonzero coecient is selected randomly fromfAg with A = 1:5 and 3:5. The error vector" is assumed to be N(0; 2 I n ) with = 1. We set sample size n = 400 and consider the high-dimensional scenario with dimensionality 71 p = 200; 400; 600; 800, and 1000. For the partially linear model (3.21) in simulation example 2, we choose the true function as g(U) = sin(2U), generate U = (U 1 ; ;U n ) T with i.i.d. U i from uniform distribution on [0; 1], and setA = 1:5 with the remaining setting the same as in simulation example 1. Since the single-index model and additive model are more complex than the linear model and partially linear model, we reduce the true model size s while keeping sample size n = 400 in both simulation examples 3 and 4. For the single-index model (3.22) in simulation example 3, we consider the true link function g(x) = x 3 =2 and set p = 200; 400; 600; 800, and 1000. The true p-dimensional regression coecient vector 0 is generated similarly with s = 10 and A = 1:5. For the additive model (3.23) in simulation example 4, we assume that s = 10 of the functions g j ()'s are nonzero with j's chosen randomly fromf1; ;pg and the remaining p 10 functions g j ()'s vanish. Specically, each nonzero function g j () is taken to be a polynomial of degree 3 and all coecients under the polynomial basis functions are generated independently as N(0; 10 2 ) as in [23]. The dimensionality p is allowed to vary with values 200; 400; 600; 800, and 1000. For each simulation example, we set the number of repetitions as 100. 3.4.2 Estimation procedures To implement RANK procedure described in Section 3.3.1, we need to construct a precision matrix estimator b and obtain the reduced model e S using the rst fold of data (X (1) ; y (1) ). Among all available estimators in the literature, we employ the ISEE method in [51] for precision matrix estimation due to its scalability, simple tuning, and nice theoretical properties. For simplicity, we choose s j = 1= max ( b ) for all 1jp, where b denotes the ISEE estimator for the true precision matrix 0 and max standards for the largest eigenvalue of a matrix. Then we can obtain an (n=2) (2p) augmented design matrix [X (2) ; b X], where b X represents an (n=2)p knockos matrix constructed in Step 3 of our modied knockos procedure in Section 3.3.1. To construct the reduced model e S using the rst fold of data (X (1) ; y (1) ), we borrow the strengths from the 72 recent literature on feature selection methods. After e S is obtained, we employ the reduced data (X e S aug ; y (2) ) with X e S aug = [X (2) e S ; b X e S ] to t a model and construct the knocko statistics. In what follows, we will discuss feature selection methods for obtaining e S for the linear model (3.20), partially linear model (3.21), single-index model (3.22), and additive model (3.23) in simulation examples 1{4, respectively. We will also discuss the construction of knocko statistics in each model setting. For the linear model (3.20) in simulation example 1, we obtain the reduced model e S by rst applying the Lasso procedure b (1) = argmin b2R p n n 1 ky (1) X (1) bk 2 2 +kbk 1 o (3.24) with 0 the regularization parameter and then taking the support e S = supp( b (1) ). Then with the estimated b and e S, we construct the knocko statistics as the LCD (3.17), where the estimated regression coecient vector is obtained by applying the Lasso procedure on the reduced model as described in (3.18). The regularization parameter in Lasso is tuned using the K-fold cross-validation (CV). For the partially linear model (3.21) in simulation example 2, we employ the proling method in semiparametric regression based on the rst fold of data (X (1) ; U (1) ; y (1) ) by observing that model (3.21) becomes a linear model when conditioning on the covariate vector U (1) . Consequently we need to estimate both the proled response E(y (1) jU (1) ) and the proled covariates E(X (1) jU (1) ). To this end, we adopt the local linear smoothing estimators [35] \ E(y (1) jU (1) ) and \ E(X (1) jU (1) ) of E(y (1) jU (1) ) and E(X (1) jU (1) ) using the Epanechnikov kernel K(u) = 0:75(1u 2 ) + with the optimal bandwidth selected by the generalized cross-validation (GCV). Then we dene the Lasso estimator b (1) for the p-dimensional regression coecient vector similarly as in (3.24) with y (1) and X (1) replaced by y (1) \ E(y (1) jU (1) ) and X (1) \ E(X (1) jU (1) ), respectively. The reduced model is then taken as e S = supp( b (1) ). For knocko statistics c W j , we set c W j = 0 for all j62 e S. On the 73 support e S, we construct c W j =j b j jj b p+j j with b j and b p+j the Lasso coecients obtained by applying the model tting procedure described above to the reduced data (X e S aug ; U (2) e S ; y (2) ) in the second subsample with X e S aug = [X (2) e S ; b X e S ]. To t the single-index model (3.22) in simulation example 3, we employ the Lasso-SIR method in [75]. The Lasso-SIR rst divides the sample of m =n=2 observations in the rst subsample (X (1) ; y (1) ) intoH slices of equal lengthc, and constructs the matrix H = 1 mc (X (1) ) T MM T X (1) , where M = I H 1 c is an mH matrix that is the Kronecker product of the identity matrix I H and the constant vector 1 c of ones. Then the Lasso-SIR estimates the p-dimensional regression coecient vector b (1) using the Lasso procedure similarly as in (3.18) with the original response vector y (1) replaced by a new response vectore y (1) = (c 1 ) 1 MM T X (1) 1 , where 1 denotes the largest eigenvalue of matrix H and 1 is the corresponding eigenvector. We set the number of slices H = 5. Then the reduced model is taken as e S = supp( b (1) ). We then apply the tting procedure Lasso-SIR discussed above to the reduced data (X e S aug ; y (2) ) with X e S aug = [X (2) e S ; b X e S ] and construct knocko statistics in a similar way as in partially linear model. To t the additive model (3.23) in simulation example 4, we apply the GAMSEL procedure in [23] for sparse additive regression. In particular, we choose 6 basis functions each with 6 degrees of freedom for the smoothing splines using orthogonal polynomials for each additive component and set the penalty mixing parameter = 0:9 in GAMSEL to obtain estimators of the true functions g j ()'s. The GAMSEL procedure is rst applied to the rst subsample (X (1) ; y (1) ) to obtain the reduced model e S, and then applied to the reduced data (X e S aug ; y (2) ) with X e S aug = [X (2) e S ; b X e S ] to obtain estimatesb g j andb g p+j for the additive functions corresponding to the jth covariate and its knocko counterpart with j2 e S, respectively. The knocko statistics are then constructed as c W j =kb g j k 2 n=2 kb g p+j k 2 n=2 for j2 e S (3.25) 74 and c W j = 0 forj62 e S, wherekb g j k n=2 represents the empirical norm of the estimated functionb g j () evaluated at its observed points and n=2 stands for the size of the second subsample. It is seen that in all four examples above, intuitively large positive values of knocko statistics c W j provide strong evidence against the jth null hypothesis H 0;j : j = 0 or H 0;j :g j = 0. For all simulation examples, we set the target FDR level at q = 0:2. Table 3.1: Simulation results for linear model (3.20) in simulation example 1 with A = 1:5 in Section 3.4.1 RANK RANK + RANKs RANKs + p FDR Power FDR Power FDR Power FDR Power 0 200 0.2054 1.00 0.1749 1.00 0.1909 1.00 0.1730 1.00 400 0.2062 1.00 0.1824 1.00 0.2010 1.00 0.1801 1.00 600 0.2263 1.00 0.1940 1.00 0.2206 1.00 0.1935 1.00 800 0.2385 1.00 0.1911 1.00 0.2247 1.00 0.1874 1.00 1000 0.2413 1.00 0.2083 1.00 0.2235 1.00 0.1970 1.00 0.5 200 0.2087 1.00 0.1844 1.00 0.1875 1.00 0.1692 1.00 400 0.2144 1.00 0.1879 1.00 0.1954 1.00 0.1703 1.00 600 0.2292 1.00 0.1868 1.00 0.2062 1.00 0.1798 1.00 800 0.2398 1.00 0.1933 1.00 0.2052 0.9997 0.1805 0.9997 1000 0.2412 1.00 0.2019 1.00 0.2221 0.9984 0.2034 0.9984 Table 3.2: Simulation results for linear model (3.20) in simulation example 1 with A = 3:5 in Section 3.4.1 RANKs RANKs + HKF HKF + p FDR Power FDR Power FDR Power FDR Power 0 200 0.1858 1.00 0.1785 1.00 0.1977 0.9849 0.1749 0.9837 400 0.1895 1.00 0.1815 1.00 0.2064 0.9046 0.1876 0.8477 600 0.2050 1.00 0.1702 1.00 0.1964 0.8424 0.1593 0.7668 800 0.2149 1.00 0.1921 1.00 0.1703 0.7513 0.1218 0.6241 1000 0.2180 1.00 0.1934 1.00 0.1422 0.7138 0.1010 0.5550 0.5 200 0.1986 1.00 0.1618 1.00 0.1992 0.9336 0.1801 0.9300 400 0.1971 1.00 0.1805 1.00 0.1657 0.8398 0.1363 0.7825 600 0.2021 1.00 0.1757 1.00 0.1253 0.7098 0.0910 0.6068 800 0.2018 1.00 0.1860 1.00 0.1374 0.6978 0.0917 0.5792 1000 0.2097 0.9993 0.1920 0.9993 0.1552 0.6486 0.1076 0.5524 75 Table 3.3: Simulation results for partially linear model (3.21) in simulation example 2 in Section 3.4.1 RANK RANK + RANKs RANKs + p FDR Power FDR Power FDR Power FDR Power 0 200 0.2117 1.00 0.1923 1.00 0.1846 0.9976 0.1699 0.9970 400 0.2234 1.00 0.1977 1.00 0.1944 0.9970 0.1747 0.9966 600 0.2041 1.00 0.1776 1.00 0.2014 0.9968 0.1802 0.9960 800 0.2298 1.00 0.1810 1.00 0.2085 0.9933 0.1902 0.9930 1000 0.2322 1.00 0.1979 1.00 0.2113 0.9860 0.1851 0.9840 0.5 200 0.2180 1.00 0.1929 1.00 0.1825 0.9952 0.1660 0.9949 400 0.2254 1.00 0.1966 1.00 0.1809 0.9950 0.1628 0.9948 600 0.2062 1.00 0.1814 1.00 0.2038 0.9945 0.1898 0.9945 800 0.2264 1.00 0.1948 1.00 0.2019 0.9916 0.1703 0.9906 1000 0.2316 1.00 0.2033 1.00 0.2127 0.9830 0.1857 0.9790 Table 3.4: Simulation results for single-index model (3.22) in simulation example 3 in Section 3.4.1 RANK RANK + RANKs RANKs + p FDR Power FDR Power FDR Power FDR Power 0 200 0.1893 1 0.1413 1 0.1899 1 0.1383 1 400 0.2163 1 0.1598 1 0.245 0.998 0.1676 0.997 600 0.2166 1 0.1358 1 0.2314 0.999 0.1673 0.998 800 0.1964 1 0.1406 1 0.2443 0.992 0.1817 0.992 1000 0.2051 1 0.134 1 0.2431 0.969 0.1611 0.962 0.5 200 0.2189 1 0.1591 1 0.2322 1 0.1626 1 400 0.2005 1 0.1314 1 0.2099 0.996 0.1615 0.995 600 0.2064 1 0.1426 1 0.2331 0.998 0.1726 0.998 800 0.2049 1 0.1518 1 0.2288 0.994 0.1701 0.994 1000 0.2259 1 0.1423 1 0.2392 0.985 0.185 0.983 Table 3.5: Simulation results for additive model (3.23) in simulation example 4 in Section 3.4.1 RANK RANK + RANKs RANKs + p FDR Power FDR Power FDR Power FDR Power 0 200 0.1926 0.9780 0.1719 0.9690 0.2207 0.9490 0.1668 0.9410 400 0.2094 0.9750 0.1773 0.9670 0.2236 0.9430 0.1639 0.9340 600 0.2155 0.9670 0.1729 0.9500 0.2051 0.9310 0.1620 0.9220 800 0.2273 0.9590 0.1825 0.9410 0.2341 0.9280 0.1905 0.9200 1000 0.2390 0.9570 0.1751 0.9350 0.2350 0.9140 0.1833 0.9070 0.5 200 0.1904 0.9680 0.1733 0.9590 0.2078 0.9370 0.1531 0.9330 400 0.2173 0.9650 0.1701 0.9540 0.2224 0.9360 0.1591 0.9280 600 0.2267 0.9600 0.1656 0.9360 0.2366 0.9340 0.1981 0.9270 800 0.2306 0.9540 0.1798 0.9320 0.2332 0.9150 0.1740 0.9110 1000 0.2378 0.9330 0.1793 0.9270 0.2422 0.8970 0.1813 0.8880 76 3.4.3 Simulation results To gain some insights into the eect of data splitting, we also implemented our procedure without the data splitting step. To dierentiate, we use RANKs to denote the procedure with data splitting and RANK to denote the procedure without data splitting. To examine the feature selection performance, we look at both measures of FDR and power. The empirical versions of FDR and power based on 100 replications are reported in Tables 3.1{3.2 for simulation example 1 and Tables 3.3{3.5 for simulation examples 2{4, respectively. In particular, Table 3.1 compares the performance of RANK and RANK + with that of RANKs and RANKs + , where the subscript + stands for the corresponding method when the modied knocko thresholdT + is used. We see from Table 3.1 that RANK and RANK + mimic closely RANKs and RANKs + , respectively, suggesting that data splitting is more of a technical assumption. In addition, the FDR is approximately controlled at the target level ofq = 0:2 with high power, which is in line with our theory. Table 3.2 summarizes the comparison of RANKs with HKF procedure for high-dimensional linear regression model. Despite that both methods are based on data splitting, their practical performance is very dierent. It is seen that although controlling the FDR below the target level, HKF suers from a loss of power due to the use of the screening step and the power deteriorates as dimensionality p increases. In contrast, the performance of RANKs is robust across dierent correlation levels and dimensionality p. It is worth mentioning that HKF procedure with data recycling performed generally better than that with data splitting alone. Thus only the results for the former version are reported in Table 3.2 for simplicity. For high-dimensional nonlinear settings of partially linear model, single-index model, and additive model in simulation examples 2{4, we see from Tables 3.3{3.5 that RANKs and RANKs + performed well and similarly as RANK and RANK + in terms of both FDR control and power across dierent scenarios. These results demonstrate the model-X feature of our procedure for large-scale inference in nonlinear models. 77 3.5 Real data analysis In addition to simulation examples presented in Section 3.4, we also demonstrate the practical utility of our RANK procedure on a gene expression data set, which is based on Aymetrix GeneChip microarrays for the plant Arabidopsis thaliana in [113]. It is well known that isoprenoids play a key role in plant and animal physiological processes, such as photosynthesis, respiration, regulation of growth, and defense against pathogens in plant physiological processes. In particular, [64] found that many of the genes expressed preferentially in mature leaves are readily recognizable as genes involved in photosynthesis, including rubisco activase (AT2G39730), fructose bisphosphate aldolase (AT4G38970), and two glycine hydroxymethyltransferase genes (AT4G37930 and AT5G26780). Thus isoprenoids have become important ingredients in various drugs (e.g., against cancer and malaria), fragrances (e.g., menthol), and food colorants (e.g., carotenoids). See, for instance, [113, 91, 88] on studying the mechasnism of isoprenoid synthesis in a wide range of applications. The aforementioned data set in [113] consists of 118 gene expression patterns under various experimental conditions for 39 isoprenoid genes, 15 of which are assigned to the regulatory pathway, 19 to the plastidal pathway, and the remaining 5 isoprenoid genes encode protein located in the mitochondrion. Moreover, 795 additional genes from 56 metabolic pathways are incorporated into the isoprenoid genetic network. Thus the combined data set is comprised of a sample of n = 118 gene expression patterns for 834 genes. This data set was studied in [115] for identifying genes that exhibit signicant association with the specic isoprenoid gene GGPPS11 (AGI code AT4G36810). Motivated by [115], we choose the expression level of isoprenoid gene GGPPS11 as the response and treat the remaining p = 833 genes from 58 dierent metabolic pathways as the covariates, in which the dimensionality p is much larger than sample size n. All the variables are logarithmically transformed. To identify important genes associated with isoprenoid gene GGPPS11, we employ the RANK method using the Lasso procedure with target FDR level q = 0:2. The implementation of RANK is the same as that in Section 3.4 for the linear model. Since the sample size of this 78 Table 3.6: Selected genes and their associated pathways for real data analysis in Section 3.5 RANK RANK + Pathway Gene Pathway Gene Calvin AT4G38970 Calvin AT4G38970 Carote AT1G57770 Carote AT1G57770 Folate AT1G78670 Folate AT1G78670 Inosit AT3G56960 Phenyl AT2G27820 Phenyl AT2G27820 Purine AT3G01820 Purine AT3G01820 Ribo AT4G13700 Ribo AT2G01880 Ribo AT2G01880 Starch AT5G19220 Starch AT5G19220 Lasso Pathway Gene Pathway Gene Berber AT2G34810 Porphy AT4G18480 Calvin AT4G38970 Pyrimi AT5G59440 Calvin AT3G04790 Ribo AT2G01880 Glutam AT5G18170 Starch AT5G19220 Glycol AT4G27600 Starch AT2G21590 Pentos AT3G04790 Trypto AT5G48220 Phenyl AT2G27820 Trypto AT5G17980 Porphy AT1G03475 Mevalo AT5G47720 Porphy AT3G51820 data set is relatively low, we choose to implement RANK without sample splitting, which has been demonstrated in Section 3.4 to be capable of controlling the FDR at the desired level. Table 3.6 lists the selected genes by RANK, RANK + , and Lasso along with their associated pathways. We see from Table 3.6 that RANK, RANK + , and Lasso selected 9 genes, 7 genes, and 17 genes, respectively. The common set of four genes, AT4G38970, AT2G27820, AT2G01880, and AT5G19220, was selected by all three methods. The values of the adjusted R 2 for these three selected models are equal to 0:7523, 0:7515, and 0:7843, respectively, showing similar level of goodness of t. In particular, among the top 20 genes selected using the Elem-OLS method with entrywise transformed Gram matrix in [115], we found that ve genes (AT1G57770, AT1G78670, AT3G56960, AT2G27820, and AT4G13700) selected by RANK are included in such a list of top 20 genes, and three genes (AT1G57770, AT1G78670, and AT2G27820) picked by RANK + are contained in the same list. 79 To gain some scientic insights into the selected genes, we conducted Gene Ontology (GO) enrichment analysis to interpret, from the biological point of view, the in uence of selected genes on isoprenoid gene GGPPS11, which is known as a precursor to chloroplast, carotenoids, tocopherols, and abscisic acids. Specically, in the enrichment test of GO biological process, gene AT1G57770 is involved in carotenoid biosynthetic process. In the GO cellular component enrichment test, genes AT4G38970 and AT5G19220 are located in chloroplast, chloroplast envelope, and chloroplast stroma; gene AT1G57770 is located in chloroplast and mitochondrion; and gene AT2G27820 is located in chloroplast, chloroplast stroma, and cytosol. The GO molecular function enrichment test shows that gene AT4G38970 has fructose-bisphosphate aldolase activity and gene AT1G57770 has carotenoid isomerase activity and oxidoreductase activity. These scientic insights in terms of biological process, cellular component, and molecular function suggest that the selected genes may have meaningful biological relationship with the target isoprenoid gene GGPPS11. See, for example, [64, 89, 112] for more discussions on these genes. 3.6 Discussions Our analysis in this chapter reveals that the suggested RANK method exploiting the general framework of model-X knockos introduced in [18] can asymptotically control the FDR in general high-dimensional nonlinear models with unknown covariate distribution. The robustness of the FDR control under estimated covariate distribution is enabled by imposing the Gaussian graphical structure on the covariates. Such a structural assumption has been widely employed to model the association networks among the covariates and extensively studied in the literature. Our method and theoretical results are powered by scalable large precision matrix estimation with statistical eciency. It would be interesting to extend the robustness theory of the FDR control beyond Gaussian designs as well as for heavy-tailed data and dependent observations. 80 Our work also provides a rst attempt to the power analysis for the model-X knockos framework. The nontrivial technical analysis establishes that RANK can have asymptotic power one in high-dimensional linear model setting when the Lasso is used for sparse regression. It would be interesting to extend the power analysis for RANK with a wide class of sparse regression and feature screening methods including SCAD, SIS, and many other concave regularization methods [39, 40, 34, 50]. Though more challenging, it is also important to investigate the power property for RANK beyond linear models. Our RANK procedure utilizes the idea of data splitting, which plays an important role in our technical analysis. Our numerical examples, however, suggest that data splitting is more of a technical assumption than a practical necessity. It would be interesting to develop theoretical guarantees for RANK without data splitting. These extensions are interesting topics for future research. 3.7 Proofs of main results We provide the proofs of Theorems 3.1{3.3, Propositions 3.1{3.2, and Lemmas 3.1{3.2 in this section. Additional technical details for the proofs of Lemmas 3.3{3.8 are included in the additional technical details. To ease the technical presentation, we rst introduce some notation. Let min () and max () be the smallest and largest eigenvalues of a symmetric matrix. For any matrix A = (a ij ), denote bykAk 1 = max j P i ja ij j,kAk max = max i;j ja ij j,kAk 2 = 1=2 max (A T A), andkAk F = [tr(A T A)] 1=2 the matrix ` 1 -norm, entrywise maximum norm, spectral norm, and Frobenius norm, respectively. For any setSf1; ;pg, we use A S to represent the submatrix of A formed by columns in setS and A S;S to denote the principal submatrix formed by columns and rows in setS. 81 3.7.1 Proofs of Lemma 3.1 and Theorem 3.1 Observe that the choice of e S =f1; ;pg certainly satises the sure screening property. We see that Lemma 3.1 and Theorem 3.1 are specic cases of Lemma 3.6 in Section 3.8.4 and Theorem 3.3, respectively. Thus we only prove the latter ones. 3.7.2 Proof of Proposition 3.1 In this proof, we will consider andS as deterministic parameters and focus only on the second half of sample (X (2) ; y (2) ) used in FDR control. Thus, we will drop the superscripts in (X (2) ; y (2) ) whenever there is no confusion. For a given precision matrix , the matrix of knocko variables e X = [e x 1 ; ;e x n ] T can be generated using (3.12) with 0 replaced by . Here, we use the superscript to emphasize the dependence of knockos matrix on . Recall that for a given setS with k =jSj, we calculate the knocko statistics W j 's using (y; X S ; e X S ). Thus, the FDR function can be written as FDR n ( ;S) =E[FDP n (y; X S ; e X S )] =E g 1;n (X; e X S ) ; (3.26) whereg 1;n (X; e X S ) =E[FDP n (y; X S ; e X S ) X; e X S ]. It is seen that the functiong n is the conditional FDP when knocko variablese x i , 1in, are simulated using and only variables in setS are used to construct knocko statistics W j . We want to emphasize that since given X the response y is independent of e X S , the functional form of g 1;n is free of the matrix used to generate knocko variables. Using the technical arguments in [18], we can show that FDR n ( 0 ;S)q for any sample size n and all subsetsSf1; ;pg that are independent of the original data (X (2) ; y (2) ) used in the knockos procedure. Observe that the only dierence between FDR n ( ;S) and FDR n ( 0 ;S) is 82 that dierent precision matrices are used to generate knocko variables. We restrict ourselves to the following data generating scheme e x i = (C ) T x i + B z i ; i = 1; ;n; where C = I p diagfsg, z i N(0; I p )2 R p are i.i.d. normal random vectors that are independent of x i 's, and B = 2diagfsgdiagfsg diagfsg 1=2 . For simplicity, writee x (0) i =e x 0 i , B 0 = B 0 , and C 0 = C 0 , i.e., the matrices corresponding to the oracle case. Then restricted to setS, e x i;S = (C S ) T x i + (B S ) T z i ; e x (0) i;S = C T 0;S x i + (B 0;S ) T z i ; where the subscriptS means the submatrix (subvector) formed by columns (components) in setS. We want to make connections betweene x i;S ande x (0) i;S . To this end, construct x i;S = (C S ) T x i + e B T B T 0;S z i ; (3.27) where e B = (B T 0;S B 0;S ) 1=2 (B S ) T B S 1=2 . Then it is seen that (x i ;e x i;S ) and (x i ; x i;S ) have identical joint distribution. Although x i;S cannot be calculated in practice for a given due to its dependency on 0 , the random vector x i;S acts as a proxy ofe x i;S in studying the FDR function. In fact, by construction (3.26) can be further written as FDR n ( ;S) =E g 1;n (X; e X S ) =E g 1;n (X; X S ) ; (3.28) where X S = [ x 1;S ; ; x n;S ] T . Observe that the randomness in both e X (0) S and X S is fully determined by the same random matrices X and ZB 0;S , which are independent of each other and whose rows are i.i.d. copies from 83 N(0; 0 ) and N(0; B T 0;S B 0;S ), respectively. For this reason, we can rewrite the FDR function in (3.28) as FDR n ( ;S) =E[g 1;n (X S aug H )]; where X S aug = [X; e X 0;S ] = [X; XC 0;S + ZB 0;S ]2 R n(p+k) is the augmented matrix collecting columns of X and e X 0;S , and H = 0 B B @ I p C S C 0;S e B 0 e B 1 C C A ; which completes the proof of Proposition 3.1. 3.7.3 Lemma 3.2 and its proof Lemma 3.2. Assume thatk 0 k 2 =O(a n ) with a n ! 0 some deterministic sequence and all the notation the same as in Proposition 3.1. If min f2diag(s) diag(s) 0 diag(s)gc 0 and max ( 0 )c 1 0 for some constant c 0 > 0, then it holds that k e B I k k 2 c 1 k 0 k 2 =O(a n ); where e B is given in (3.27) and c 1 > 0 is some uniform constant independent of setS. Proof. We use C to denote some generic positive constant whose value may change from line to line. First note that (B S ) T B S B T 0;S B 0;S = diag(s)( 0 )diag(s) S;S : (3.29) 84 Further, since 0 2 1 diag(s) is positive denite it follows thatksk 1 2 max ( 0 ) 2c 1 0 . Thus it holds that k(B S ) T B S B T 0;S B 0;S k 2 Ck( 0 ) S;S k 2 Ck 0 k 2 =O(a n ): For n large enough, by the triangle inequality we have min ((B S ) T B S ) min (B T 0;S B 0;S ) + min ((B S ) T B S B T 0;S B 0;S ) min (B T 0 B 0 )O(a n ) = min 2diag(s) diag(s) 0 diag(s) O(a n ) c 0 =2: In addition, min ((B 0;S ) T B 0;S ) min ((B 0 ) T B 0 ) = min 2diag(s) diag(s) 0 diag(s) c 0 =2. The above two inequalities together with Lemma 2.2 in [92] entail that (B S ) T B S 1=2 (B 0;S ) T B 0;S 1=2 2 ( p c 0 =2 + p c 0 =2) 1 k(B S ) T B S (B 0;S ) T B 0;S k 2 Ck 0 k 2 =O(a n ); (3.30) where the last step is because of (3.29). Thus it follows that k e B I k k 2 (B S ) T B S 1=2 (B 0;S ) T B 0;S 1=2 2 k (B 0;S ) T B 0;S 1=2 k 2 Ck 0 k 2 1=2 min ((B 0 ) T B 0 )Ck 0 k 2 ; (3.31) where the last step comes from assumption min (B T 0 B 0 ) = min (2diag(s) diag(s) 0 diag(s)) c 0 . This concludes the proof of Lemma 3.2. 85 3.7.4 Proof of Theorem 3.2 We now proceed to prove Theorem 3.2 with the aid of Lemma 3.2 in Section 3.7.3. We use the same notation as in the proof of Proposition 3.1 and use C > 0 to denote a generic constant whose value may change from line to line. We start with proving (3.15). By Condition 3.4, we have jFDR(H ;S) FDR(H 0 ;S)jLkH H 0 k F ; (3.32) where the constant L is uniform over allk 0 kC 2 a n andjSjK n . Denote by k =jSj. By the denition of H , it holds that H H 0 = H I p+k = 0 B B @ 0 C S C 0;S e B 0 e B I k 1 C C A : By the denition and matrix norm inequality, we deduce kH H 0 k F =kC S C 0;S e Bk F +k e B I k k F p kkC S C 0;S e Bk 2 + p kk e B I k k 2 p K n kC S C 0;S k 2 +kC 0;S ( e B I k )k 2 +k e B I k k 2 p K n 1 +kC 0;S k 2 k e B I k k 2 +kC C 0 k 2 : Since 0 2 1 diag(s) is positive denite, it follows that s j 2 max ( 0 ) 2= min ( 0 ) C. ThuskC 0;S k 2 kC 0 k 2 = kI 0 diag(s)k 2 1 +k 0 k 2 kdiag(s)k 2 C. This along with kC C 0 k 2 =k( 0 )diag(s)k 2 Ca n and Lemma 3.2 entails thatkH H 0 k F can be further bounded as kH H 0 k F C p K n a n : 86 Combining the above result with (3.32) leads to sup fjSjKn;k 0kCang jFDR(H ;S) FDR(H 0 ;S)jO( p K n a n ); (3.33) which completes the proof of (3.15). We next establish the FDR control for RANK. By Condition 3.7, the eventE 0 =fk b 0 k 2 C 2 a n g occurs with probability at least 1O(p c1 ). Since b and e S are estimated from independent subsample (X (1) ; y (1) ), it follows from (3.15) that E FDP n (X (2) e S ; b X e S ) E 0 E FDP n (X (2) e S ; e X 0; e S ) E 0 sup jSjKn;k 0kC2an E FDP n (X (2) S ; e X S ) E 0 E FDP n (X (2) S ; e X 0;S ) E 0 = sup jSjKn;k 0kC2an E FDP n (X (2) S ; e X S ) E FDP n (X (2) S ; e X 0;S ) = sup jSjKn;k 0kC2an FDR(H ;S) FDR(H 0 ;S) O( p K n a n ): (3.34) Now note that by the property of conditional expectation, we have FDR n ( b ; e S) FDR n ( 0 ; e S) = E FDP n (X (2) e S ; b X e S ) E 0 E FDP n (X (2) e S ; e X 0; e S ) E 0 P(E 0 ) + E FDP n (X (2) e S ; b X e S ) E c 0 E FDP n (X (2) e S ; e X 0; e S ) E c 0 P(E c 0 ) I 1 +I 2 : Let us rst consider term I 1 . By (3.34), it holds that jI 1 j E FDP n (X (2) e S ; b X e S ) E 0 E FDP n (X (2) e S ; e X 0; e S ) E 0 O( p K n a n ): 87 We next consider term I 2 . Since FDP is always bounded between 0 and 1, we have jI 2 j 2P(E c 0 )O(p c1 ): Combining the above two results yields FDR n ( b ; e S) FDR n ( 0 ; e S) O( p K n a n ) +O(p c1 ): This together with the result of FDR n ( 0 ; e S)q mentioned in the proof of Proposition 3.1 in Section 3.7.2 completes the proof of Theorem 3.2. 3.7.5 Proof of Theorem 3.3 In this proof, we will drop the superscripts in (X (2) ; y (2) ) whenever there is no confusion. By the denition of power, for any given precision matrix and reduced modelS the power can be written as Power( ;S) =E[f(X S ; e X S ; y)]; where f is some function describing how the empirical power depends on the data. Note that f(X S ; e X S ; y) is a stochastic process indexed by , and we care about the mean of this process. Our main idea is to construct another stochastic process indexed by which has the same mean but possibly dierent distribution. Then by studying the mean of this new stochastic process, we can prove the desired result. We next provide more technical details of the proof. The proxy process is dened as X S = XC S + ZB 0;S (B T 0;S B 0;S ) 1=2 B S T B S 1=2 ; (3.35) 88 where C S is the submatrix of C = I p diagfsg, B S is the submatrix of B = diag(s) diag(s) diag(s) 1=2 , and B 0 = B 0 . It is easy to see that X S and e X S dened using (3.12) have the same distribution. Since Z is independent of (X; y), we can further conclude that (X S ; e X S ; y) and (X S ; X S ; y) have the same joint distribution for each given andS. Thus the power function can be further written as Power( ;S) =E[f(X S ; e X S ; y)] =E[f(X S ; X S ; y)]: Therefore, we only need to study the power of the knockos procedure based on the pseudo data (X S ; X S ; y). To simplify the technical presentation, we will slightly abuse the notation and still use b = b () = b (; ;S) to represent the Lasso solution based on pseudo data (X S ; X S ; y). We will use c and C to denote some generic positive constants whose values may change from line to line. Dene e G = 1 n e X T KO e X KO 2R (2p)(2p) and e = 1 n e X T KO y2R 2p (3.36) with e X KO = [X; X ]2R n(2p) the augmented design matrix. For any given setSf1; ;pg with k =jSj, (2p) (2p) matrix A, and (2p)-vector a, we will abuse the notation and denote by A S;S 2R (2k)(2k) the principal submatrix formed by columns and rows in setfj :j2S or jp2 Sg and a S 2 R 2k the subvector formed by components in setfj :j2S or jp2Sg. For any pp matrix B (or p-vector b), we dene B S (or b S ) in the same way meaning that columns (or components) in setS will be taken to form the submatrix (or subvector). 89 With the above notation, note that the Lasso solution b = ( b 1 ; ; b 2p ) T = b (; ;S) restricted to variables in S can be obtained by setting b j = 0 for j 2f1 j 2p : j 62 S and jp62Sg and minimizing the following objective function b S = arg min b2R 2k 1 2 b T e G S;S be T S b +kbk 1 : (3.37) By Proposition 3.2 in Section 3.7.6, with probability at least 1O(p c ) it holds that sup 2A;jSjKn k b (; ;S) T k 2 =O( p s); (3.38) sup 2A;jSjKn k b (; ;S) T k 1 =O(s); (3.39) where =C p (logp)=n with C > 0 some constant. Denote byW ;S j the LCD based on the above b (; ;S). Recall that by assumption, there are no ties in the magnitude of nonzero W ;S j 's and no ties in the nonzero components of the Lasso solution with asymptotic probability one. LetjW ;S (1) jjW ;S (p) j be the ordered knocko statistics according to magnitude. Denote by j the index such thatjW ;S (j ) j =T. Then by the denition of T, it holds thatT <W ;S (j +1) 0. We next analyze the two cases of W ;S (j +1) = 0 andT <W ;S (j +1) < 0 separately. Case 1. For the case of W ;S (j +1) = 0, b S =fj :W ;S j > 0g and jfj :W ;S j < 0gj jfj :W ;S j > 0gj q by denition. Ifjfj :W ;S j < 0gj>q 1 s for some constant q 1 > 0, then it reduces to case 2 and we can prove that TO() and the rest follows. 90 Ifjfj :W ;S j < 0gjo(s), then b S \S 0 = supp(W ;S )\S 0 fj :W ;S j < 0g\S 0 supp(W ;S )\S 0 o(s) (3.40) since b S = supp(W ;S )nfj :W ;S j < 0g. Now, we focus on supp(W ;S )\S 0 . We observe that supp(W ;S )f1; ;pgnS 1 ; (3.41) whereS 1 =f1jp : b j (; ;S) = 0g. Meanwhile, note that in view of (3.39) we have with probability at least 1O(p c ), O(s) sup 2A;jSjKn k b (; ;S) 0 k 1 sup 2A;jSjKn X j2S 1 \S0 j b j (; ;S) 0;j j = X j2S 1 \S0 j 0;j jjS 1 \S 0 j min j2S0 j 0;j j: By Condition 3.2 and =O( p (logp)=n), we can further derive from the above inequality that jS 1 \S 0 j =o(s); which together withjS 0 j =s entails that j f1; ;pgnS 1 \S 0 j [1o(1)]s: Combining this with (3.41) leads to supp(W ;S )\S 0 f1; ;pgnS 1 \S 0 [1o(1)]s: (3.42) 91 Thus with asymptotic probability one, it holds uniformly over all 2A andjSjK n that b S \S 0 s 1o(1) by equations (3.40) and (3.42). Case 2. We next consider the case ofT <W ;S (j +1) < 0. By the denitions of T and j , we have jfj :W ;S j Tgj + 2 jfj :W ;S j Tgj >q (3.43) since otherwise we would reduce T tojW ;S (j +1) j to get the new smaller threshold with the criterion still satised. We next boundT using the results in Lemma 3.6 in Section 3.8.4. Observe that (3.43) and Lemma 3.6 lead tojfj :W ;S j 6Tgj>qjfj :W ;S j >Tgj 2qcs 2 with asymptotic probability one. Moreover, when W ;S j T we havej b j (; ;S)jj b j+p (; ;S)jT and thusj b j+p (; ;S)jT . Using (3.39), we obtain O(s) =k b (; ;S) T k 1 X j:W ;S j 6T j b j+p (; ;S)jTjfj :W ;S j 6Tgj: Combining these results yields O(s)T (qcs 2) and thus it holds that TO(): (3.44) 92 We now proceed to prove the theorem by showing that Type II error is small. By Condition 3.2 of p n=(logp) min j2S0 j 0;j j!1 and assumption =C p (logp)=n, there exists some n !1 such that min j2S0 j 0;j j n as n!1. In light of (3.39), we derive O(s) =k b (; ;S) T k 1 = p X j=1 (j b j (; ;S) j j +j b j+p (; ;S)j) X j2S0\( b S ) c (j b j (; ;S) j j +j b j+p (; ;S)j) X j2S0\( b S ) c (j b j (; ;S) j j +j b j (; ;S)jT ) sincej b j+p (; ;S)jj b j (; ;S)jT when j2 ( b S ) c . Using the triangle inequality and since j j j n when j2S 0 , we can conclude that O(s) X j2S0\( b S ) c (j j jT ) ( n T ) fj2S 0 \ ( b S ) c g : Then it follows that jfj2S 0 \ ( b S ) c gj s = 1 jfj2S 0 \ ( b S ) c gj s 1O n T = 1O( 1 n ) = 1o(1) uniformly over all 2A andjSjK n since TO(). Combining the above two scenarios, we have shown that with asymptotic probability one, uniformly over all 2A andjSjK n it holds that jfj2S 0 \ ( b S ) c gj s 1o(1): (3.45) 93 This along with the assumption Pf b 2Ag = 1O(p c2 ) in Condition 3.7 gives Power( b ; e S) =E h jfj2S 0 \ ( b S b ) c gj s i E h jfj2S 0 \ ( b S b ) c gj s b 2A i Pf b 2Ag [1o(1)][1O(p c2 )] = 1o(1); where the second to the last step is due to (3.45). This concludes the proof of Theorem 3.3. 3.7.6 Proposition 3.2 and its proof Proposition 3.2. Assume that Conditions 3.1 and 3.6 hold, the smallest eigenvalue of 2diag(s) diag(s) 0 diag(s) is positive and bounded away from 0, and = C 3 f(logp)=ng 1=2 with C 3 > 0 some constant. Let T = ( T 0 ; 0; ; 0) T 2R 2p be the expanded vector of true regression coecient vector. If [(L p +L 0 p ) 1=2 +K 1=2 n ]a n =o(1) and sfa n +L 0 p [(logp)=n] 1=2 + [K n (logp)=n] 1=2 g =o(1), then with probability at least 1O(p c3 ), sup 2A;jSjKn k b (; ;S) T k 1 =O(s) and sup 2A;jSjKn k b (; ;S) T k 2 =O(s 1=2 ); where b (; ;S) is dened in the proof of Theorem 3.3 in Section 3.7.5 and c 3 > 0 is some constant. Proof. We adopt the same notation as used in the proof of Theorem 3.3 in Section 3.7.5. Let us introduce some key events which will be used in the technical analysis. Dene E 3 = n sup k 0kC2an;jSjKn ke S e G S;S T;S k 1 0 o ; (3.46) E 4 = n sup k 0k2C2an;jSjKn k e G S;S G S;S k max C 5 a 2;n o ; (3.47) 94 where 0 =C 4 p (logp)=n and a 2;n =a n + (L 0 p +K n ) p (logp)=n with C 4 ;C 5 > 0 some constants. Then by Lemmas 3.4 and 3.7 in Sections 3.8.2 and 3.8.5, P(E 3 \E 4 ) = 1O(p c0 ) (3.48) for some constant c 0 > 0. Hereafter we will condition on the eventE 3 \E 4 . Since b S is the minimizer of the objective function in (3.37), we have 1 2 b T S e G S;S b S e T S b S +k b S k 1 1 2 T T;S e G S;S T;S e T S T;S +k T;S k 1 : Some routine calculations lead to 1 2 ( b S T;S ) T e G S;S ( b S T;S ) +k b S k 1 T T;S e G S;S b S + T T;S e G S;S T;S +e T S ( b S T;S ) +k T;S k 1 = (e S e G S;S T;S ) T ( b S T;S ) +k T;S k 1 k b S T;S k 1 ke S e G S;S T;S k 1 +k T;S k 1 : (3.49) Let b = b T . Then we can simplify (3.49) as 1 2 b T S e G S;S b S +k b S k 1 k b S k 1 ke S e G S;S T;S k 1 +k T;S k 1 0 k b S k 1 +k T;S k 1 : (3.50) Observe thatk b S k 1 =k b S0 k 1 +k b SnS0 k 1 andk T;S k 1 =k T;S0 k 1 +k T;SnS0 k 1 =k T;S0 k 1 with S 0 the support of true regression coecient vector. Then it follows fromk b S0 T;S0 k 1 k T;S0 k 1 k b S0 k 1 that 1 2 b T S e G S;S b S +k b SnS0 k 1 0 k b S k 1 +k b S0 T;S0 k 1 : 95 Denote byk b S0 k 1 =k b S0 T;S0 k 1 andk b SnS0 k 1 =k b SnS0 T;SnS0 k 1 =k b SnS0 k 1 . Then we can further deduce 1 2 b T S e G S;S b S +k b SnS0 k 1 0 k b S k 1 +k b S0 k 1 = 0 k b S0 k 1 + 0 k b SnS0 k 1 +k b S0 k 1 ; that is, 1 2 b T S e G S;S b S + ( 0 )k b SnS0 k 1 ( + 0 )k b S0 k 1 : (3.51) When 2 0 , it holds that 1 2 b T S e G S;S b S + 2 k b SnS0 k 1 3 2 k b S0 k 1 : (3.52) Since b T S e G S;S b S 0, we obtain the basic inequality k b SnS0 k 1 3k b S0 k 1 (3.53) on eventE 3 . It follows from (3.52) that b T S G S;S b S 3k b S0 k 1 + b T S (G S;S e G S;S ) b S (3.54) with G = 0 B B @ 0 0 diag(s) 0 diag(s) 0 1 C C A : With some matrix calculations, we can show that min (G)C min f2diag(s) diag(s) 0 diag(s)gCc 0 ; 96 where the last step is by assumption. Thus the left hand side of (3.54) can be bounded from below by c 0 Ck b S k 2 2 . It remains to bound the right hand side of (3.54). For the rst term, it follows from the Cauchy{Schwarz inequality that 3k b S0 k 1 3 p sk b S0 k 2 3 p sk b S k 2 3 p s s b T S G S;S b S cC 0 C p sk b S k 2 ; (3.55) where the last step is becausekG S;S k 2 2k 0 k 2 + 2k 0 diag(s)k 2 4k 0 k 2 + 2ksk 1 C uniformly over allS. For the last term b T S (G S;S e G S;S ) b S on the right hand of (3.54), by conditioning on eventE 4 and using the Cauchy{Schwarz inequality, the triangle inequality, and the basic inequality (3.53) we can obtain b T S (G S;S e G S;S ) b S kG S;S e G S;S k max k b S k 2 1 kG S;S e G S;S k max (k b S0 k 1 +k b SnS0 k 1 ) 2 16kG S;S e G S;S k max k b S0 k 2 1 16skG S;S e G S;S k max k b S0 k 2 2 16skG S;S e G S;S k max k b S k 2 2 = Csa 2;n k b S k 2 2 : Combining the above results, we can reduce inequality (3.54) to c 0 Ck b S k 2 2 C p sk b S k 2 +Csa 2;n k b S k 2 2 : Since sa 2;n ! 0, it holds for n large enough that k b S k 2 =k b S T;S k 2 =O( p s): 97 Further, by (3.55) we have k b S T;S k 1 =O(s): Note that by denition, b S c = 0 and T;S c = 0. Therefore, summarizing the above results completes the proof of Proposition 3.2. 3.8 Additional technical details This section contains additional technical details for the proofs of Lemmas 3.3{3.8. 3.8.1 Lemma 3.3 and its proof Lemma 3.3. Assume that X = (X ij )2R np has independent rows with distribution N(0; 0 ), max ( 0 )M, and" = (" 1 ; ;" n ) T has i.i.d. components with Pfj" i j>tgC 1 exp(C 1 1 t 2 ) for t> 0 and some constants M;C 1 > 0. Then we have P 1 n X T " 1 C p (logp)=n 1p c for some constant c> 0 and large enough constant C > 0. Proof. First observe that P(jX ij j>t) 2 expf(2M) 1 t 2 g for t> 0, since X ij N(0; 0;jj ) and 0;jj max ( 0 )M, where 0;jj denotes thejth diagonal entry of matrix 0 . By assumption, we also have P(j" i j>t)C 1 expfC 1 1 t 2 g. Combining these two inequalities yields P(j" i X ij j>t)P(j" i j> p t) +P(jX ij j> p t) C 1 expfC 1 1 tg + 2 expf(2M) 1 tg C 2 expfC 1 2 tg; 98 where C 2 > 0 is some constant that depends only on constants C 1 and M. Thus by Lemma 6 in [48], there exists some constant e C 1 > 0 such that P(jn 1 n X i=1 " i X ij j>z) e C 1 expf e C 1 nz 2 ) (3.56) for all 0<z< 1. Denote by X j the jth column of matrix X. Then by (3.56), the union bound leads to 1P n 1 X T " 1 z =P n 1 X T " 1 >z =P max 1jp jn 1 " T X j j>z p X j=1 P(jn 1 n X i=1 " i X ij j>z) p e C 1 expf e C 1 nz 2 ): Letting z =C p (logp)=n in the above inequality, we obtain P n 1 X T " 1 C p (logp)=n 1 e C 1 p ( e C1C 2 1) : Taking large enough positive constant C completes the proof of Lemma 3.3. 3.8.2 Lemma 3.4 and its proof Lemma 3.4. Assume that all the conditions of Proposition 3.2 hold anda n [(L p +L 0 p ) 1=2 +K 1=2 n ] = o(1). Then we have P ( sup 2A;jSjKn e S e G S;S T;S 1 C 4 p (logp)=n ) = 1O(p c4 ) for some constants c 4 ;C 4 > 0. 99 Proof. In this proof, we usec andC to denote generic positive constants and use the same notation as in the proof of Proposition 3.2 in Section 3.7.6. Since T = ( T 0 ; 0;:::; 0) T with 0 the true regression coecient vector, it is easy to check that e X KO T = X 0 . In view of y = X 0 +", it follows from the denitions ofe and e G that e S e G S;S T;S = 1 n e X T KO;S X 0 + 1 n e X T KO;S " 1 n e X T KO;S e X KO;S T;S = 1 n X T KO;S " + 1 n ( e X KO;S X KO;S ) T ": Using the triangle inequality, we deduce ke S e G S;S T;S k 1 1 n X T KO;S " 1 + 1 n ( e X KO;S X KO;S ) T " 1 : We will bound both terms on the right hand side of the above inequality. By Lemma 3.3, we can show that for the rst term, 1 n X T KO;S " 1 1 n X T KO " 1 C p (logp)=n with probability at least 1p c for some constants C;c> 0. We will prove that with probability at least 1o(p c ), 1 n ( e X KO;S X KO;S ) T " 1 Ca n (L p +L 0 p ) 1=2 p (logp)=n +Ca n p n 1 K n (logp): (3.57) Then the desired result in this lemma can be shown by noting that a n [(L p +L 0 p ) 1=2 +K 1=2 n ]! 0. It remains to prove (3.57). Recall that matrices X S and X 0;S can be written as X S = X(I diagfsg) S + ZB 0;S (B T 0;S B 0;S ) 1=2 (B S ) T B S 1=2 ; X 0;S = X(I 0 diagfsg) S + ZB 0;S ; 100 where the notation is the same as in the proof of Proposition 3.2 in Section 3.7.6. By the denitions of e X KO and X KO , it holds that 1 n ( e X KO;S X KO;S ) T " 1 = 1 n ( X S X 0;S ) T " 1 ; (3.58) where X S and X 0;S represent the submatrices formed by columns inS. We now turn to analyzing the term n 1 ( X S X 0;S ) T ". Some routine calculations give 1 n ( X S X 0;S ) T " = 1 n ( 0 )diagfsg S T X T " + 1 n (B S ) T B S 1=2 (B T 0;S B 0;S ) 1=2 I B T 0;S Z T ": Thus it follows from s j 2 max ( 0 ) for all 1jp and the triangle inequality that 1 n ( X S X 0;S ) T " 1 2 max ( 0 ) 1 n ( 0;S S ) T X T " 1 + 1 n (B S ) T B S 1=2 (B T 0;S B 0;S ) 1=2 I B T 0;S Z T " 1 : (3.59) We rst examine the upper bound for 1 n ( 0;S S ) T X T " 1 in (3.59). Since 2A and 0 is L p -sparse, by Lemma 3.3 we deduce 1 n ( 0;S S ) T X T " 1 1 n ( 0 )X T " 1 k 0 k 1 1 n X T " 1 q L p +L 0 p k 0 k 2 C p (logp)=n Ca n (L p +L 0 p ) 1=2 p (logp)=n: (3.60) 101 We can also bound the second term on the right hand side of (3.59) as 1 n (B S ) T B S 1=2 (B T 0;S B 0;S ) 1=2 I B T 0;S Z T " 1 (B S ) T B S 1=2 (B T 0;S B 0;S ) 1=2 I 1 1 n B T 0;S Z T " 1 p 2jSj (B S ) T B S 1=2 (B T 0;S B 0;S ) 1=2 I 2 1 n B T 0;S Z T " 1 p 2K n Ca n p (logp)=n =Ca n p n 1 K n (logp); where the second to the last step is entailed by Lemma 3.2 in Section 3.7.3 and Lemma 3.5 in Section 3.8.3. Therefore, combining this inequality with (3.58){(3.60) results in (3.57), which concludes the proof of Lemma 3.4. 3.8.3 Lemma 3.5 and its proof Lemma 3.5. Under the conditions of Proposition 3.2, it holds that with probability at least 1O(p c ), sup jSjKn 1 n B T 0;S Z T " 1 C p (logp)=n for some constant C > 0. Proof. Since this is a specic case of Lemma 3.8 in Section 3.8.6, the proof is omitted. 3.8.4 Lemma 3.6 and its proof Lemma 3.6. Under the conditions of Proposition 3.2 and Lemma 3.1, there exists some constant c2 (2(qs) 1 ; 1) such that with asymptotic probability one,j b S j cs holds uniformly over all 2A andjSjK n , where b S =fj :W ;S j >Tg. 102 Proof. Again we use C to denote generic positive constants whose values may change from line to line. By Proposition 3.2 in Section 3.7.6, we have with probability at least 1O(p c1 ) that uniformly over all 2A andjSjK n , max 1jp j b j (; ;S) 0;j jC p sn 1 (logp) and max 1jp j b j+p (; ;S)jC p sn 1 (logp) for some constants C;c 1 > 0. Thus for each 1jp, we have W ;S j =j b j (; ;S)jj b j+p (; ;S)j j b j+p (; ;S)jC p sn 1 (logp): (3.61) On the other hand, for each j2S 2 =fj : 0;j p sn 1 (logp)g it holds that W ;S j =j b j (; ;S)jj b j+p (; ;S)j j 0;j jj b j (; ;S) 0;j jj b j+p (; ;S)jC p sn 1 (logp): (3.62) Thus in order for any W ;S j , 1jp to fall belowT, we must have W ;S j T for all j2S 2 . This entails that fj :W ;S j Tg jS 2 jcs; (3.63) which completes the proof of Lemma 3.6. 103 3.8.5 Lemma 3.7 and its proof Lemma 3.7. Assume that all the conditions of Proposition 3.2 hold and a 2n = a n + (L 0 p + K n )f(logp)=ng 1=2 =o(1). Then it holds that P ( sup 2A;jSjKn e G S;S G S;S max C 8 a 2;n ) = 1O(p c8 ) for some constants c 8 ;C 8 > 0. Proof. In this proof, we adopt the same notation as used in the proof of Proposition 3.2 in Section 3.7.6. In light of (3.36), we have e G =n 1 [X; X ] T [X; X ]. Thus the matrix dierence e G S;S G S;S can be represented in block form as e G S;S G S;S = 1 n 0 B B @ X T S X S ( X S ) T X S X T S X S ( X S ) T X S 1 C C A 0 B B @ 0 0 diagfsg 0 diagfsg 0 1 C C A S;S = 0 B B @ n 1 X T S X S 0;S;S n 1 ( X S ) T X S 0 diagfsg S;S n 1 X T S X S 0 diagfsg S;S n 1 ( X S ) T X S 0;S;S 1 C C A : Note that the o-diagonal blocks are the transposes of each other. Then we see thatk e G S;S G S;S k max can be bounded by the maximum ofk 1 k max ,k 2 k max , andk 3 k max with 1 =n 1 X T S X S 0;S;S ; 2 =n 1 X T S X S 0 diagfsg S;S ; 3 =n 1 ( X S ) T X S 0;S;S : 104 To bound these three terms, we dene three events E 5 = n kn 1 X T X 0 k max C p (logp)=n o ; E 6 = n sup jSjKn n 1 B T 0;S Z T X 1 C p (logp)=n o ; E 7 = n sup jSjKn n 1 B T 0;S Z T ZB 0;S B T 0;S B 0;S max C p (logp)=n o : By Lemma 3.8 in Section 3.8.6, it holds that P (E 6 ) 1O(p c ) and P (E 7 ) 1O(p c ). Using Lemma A.3 in [13], we also have P (E 5 ) 1O(p c ). Combining these results yields P (E 5 \E 6 \E 7 ) 1O(p c ) with c> 0 some constant. Let us rst consider term 1 . Conditional onE 5 , it is easy to see that k 1 k max kn 1 X T X 0 k max C p (logp)=n: (3.64) We next boundk 2 k max conditional onE 5 \E 6 . To simplify the notation, denote by e B S; = (B T 0;S B 0;S ) 1=2 (B S ) T B S 1=2 . By the denition of X S , we deduce 2 =n 1 X T S X S 0 diagfsg S;S =n 1 X T S X(I diagfsg) S +n 1 X T S ZB 0;S e B S; 0 diagfsg S;S = (n 1 X T X 0 )(I diagfsg) S;S + diagfsg 0 diagfsg S;S +n 1 X T S ZB 0;S e B S; 2;1 + 2;2 + 2;3 : We will examine the above three terms separately. 105 Since isL 0 p -sparse,kI 0 diag(s)k 2 kIk 2 +k 0 diag(s)k 2 C, andk( 0 )diagfsgk 2 Ca n , we have I diagfsg 1 q L 0 p I diagfsg 2 q L 0 p I 0 diagfsg 2 + ( 0 )diagfsg 2 C q L 0 p : (3.65) Thus it follow from (3.65) that conditional onE 5 , k 2;1 k max = (n 1 X T X 0 )(I diagfsg) S;S max (n 1 X T X 0 )(I diagfsg) max n 1 X T X 0 max I diagfsg 1 C q L 0 p p (logp)=n: (3.66) For term 2;2 , it holds that k 2;2 k max = diagfsg 0 diagfsg S;S max CkI 0 k max Ck 0 k 2 k 0 k 2 Ca n : (3.67) Note that by Lemma 3.2 in Section 3.7.3, we have k e B S; k 1 p jSjk e B S; k 2 p jSj(k e B S; Ik 2 + 1)C p jSjC p K n 106 whenjSjK n . Then conditional onE 6 , it holds that k 2;3 k max =kn 1 X T S ZB 0;S e B S; k max kn 1 X T S ZB 0;S k max k e B S; k 1 C p n 1 K n (logp): (3.68) Thus combining (3.66){(3.68) leads to k 2 k max Cfa n + q n 1 L 0 p (logp) + p n 1 K n (logp)g: (3.69) We nally deal with term 3 . Some routine calculations show that 3 =n 1 ( X S ) T X S 0;S;S : =n 1 (I diagfsg) T S X T + ( e B S; ) T B T 0;S Z T X(I diagfsg) S + ZB 0;S e B S; 0;S;S = n 1 (I diagfsg) T X T X(I diagfsg) 0 + B T 0 B 0 S;S +n 1 ( e B S; ) T B T 0;S Z T X(I diagfsg) S + (I diagfsg) T S X T ZB 0;S e B S; + ( e B S; ) T B T 0;S Z T ZB 0;S e B S; B T 0;S B 0;S 3;1 + 3;2 + T 3;2 + 3;3 : 107 Conditional on eventE 5 , with some simple matrix algebra we derive k 3;1 k = n 1 (I diagfsg) T X T X(I diagfsg) 0 + B T 0 B 0 S;S max n 1 (I diagfsg) T X T X(I diagfsg) 0 + B T 0 B 0 max (I diagfsg) T (n 1 X T X 0 )(I diagfsg) max + (I diagfsg) T 0 (I diagfsg) 0 + 2diagfsg diagfsg 0 diagfsg max kn 1 X T X 0 k max k(I diagfsg)k 2 1 +kdiagfsg(I 0 )k max +k(I 0 )diagfsgk max +kdiagfsg( 0 0 )diagfsgk max CL 0 p p (logp)=n +Ca n ; (3.70) where the last step used (3.65) and calculations similar to (3.67). It follows from (3.65) and the previously proved resultk e B S; k 1 C p K n forjSjK n that conditional on eventE 6 , k 3;2 k =kn 1 ( e B S; ) T B T 0;S Z T X(I diagfsg) S k max k e B S; k 1 kn 1 B T 0;S Z T Xk max k(I diagfsg) S k 1 C p K n q L 0 p n 1 (logp) =C q n 1 K n L 0 p (logp): (3.71) 108 Finally, by Lemma 3.2 it holds that conditioned onE 7 , k 3;3 k = n 1 ( e B S; ) T B T 0;S Z T ZB 0;S e B S; B T 0;S B 0;S max ( e B S; ) T (n 1 B T 0;S Z T ZB 0;S B T 0;S B 0;S ) e B S; max + ( e B S; ) T B T 0;S B 0;S e B S; B T 0;S B 0;S max n 1 B T 0;S Z T ZB 0;S B T 0;S B 0;S max k e B S; k 2 1 +Ca n CK n p (logp)=n +Ca n : (3.72) Therefore, combining (3.70){(3.72) results in k 3 k max Ca n +C(L 0 p +K n + q K n L 0 p ) p (logp)=n Ca n + 2C(L 0 p +K n ) p (logp)=n; which together with (3.64) and (3.69) concludes the proof of Lemma 3.7. 3.8.6 Lemma 3.8 and its proof Lemma 3.8. Under the conditions of Proposition 3.2, it holds that with probability at least 1O(p c ), sup jSjKn 1 n B T 0;S Z T X max C p (logp)=n; sup jSjKn n 1 B T 0;S Z T ZB 0;S B T 0;S B 0;S max C p (logp)=n for some constants c;C > 0. 109 Proof. We still use c and C to denote generic positive constants. We start with proving the rst inequality. Observe that sup jSjKn 1 n B T 0;S Z T X max 1 n B T 0 Z T X max : Thus it remains to prove P 1 n B T 0 Z T X max C p (logp)=n o(p c ): (3.73) Let U = ZB 0 2 R np and denote by U j the jth column of matrix U. We see that the components of U j are i.i.d. Gaussian with mean zero and variance e T j B T 0 B 0 e j , and the vectors U j are independent of ". Let e U j = (e T j B T 0 B 0 e j ) 1=2 U j . Then it holds that e U j N(0; I n ). Since X ij N(0; 0;jj ) and 0;jj max ( 0 )C with C > 0 some constant, it follows from Bernstein's inequality that for t> 0, P 1 n B T 0 Z T X max tkB T 0 B 0 k 2 p X j=1 P 1 n (U j ) T X i tkB T 0 B 0 k 2 p X j=1 P 1 n ( e U j ) T X i t Cp exp(Cnt 2 ): Taking t =C p (logp)=n with large enough constant C > 0 in the above inequality yields P 1 n B T 0 Z T X max C p (logp)=nkB T 0 B 0 k 2 Cp c 110 for some constant c> 0. Thus with probability at least 1O(p c ), it holds that 1 n B T 0 Z T X max C p (logp)=nkB T 0 B 0 k 2 =C p (logp)=nkdiag(s) diag(s) 0 diag(s)k 2 C p (logp)=n; which establishes (3.73) and thus concludes the proof for the rst result. The second inequality follows from sup jSjKn n 1 B T 0;S Z T ZB 0;S B T 0;S B 0;S max n 1 B T 0 Z T ZB 0 B T 0 B 0 max and Lemma A.3 in [13], which completes the proof of Lemma 3.8. 111 Chapter 4 Large-Scale Model Selection with Misspecication 4.1 Introduction The problem of model selection has been studied extensively by many researchers in the past several decades. Among others, well-known model selection criteria include the Akaike information criterion (AIC) [2, 3] and Bayesian information criterion (BIC) [93], where the former is based on the Kullback-Leibler (KL) divergence principle of model selection and the latter is originated from the Bayesian principle of model selection. A great deal of work has been devoted to understanding and extending these model selection criteria to dierent model settings; see, for example, [15, 53, 71, 68, 20, 21, 77, 83, 31, 65]. The connections between the AIC and cross- validation have been investigated in [99, 56, 84] for various contexts. In particular, [52] showed that classical information criteria such as AIC and BIC can no longer be consistent for model selection in ultra-high dimensions and proposed the generalized information criterion (GIC) for tuning parameter selection in high-dimensional penalized likelihood, for the scenario of correctly specied models. See also [8, 17, 18, 94, 46, 45] for some recent work on high-dimensional inference for feature selection. Most existing work on model selection and feature selection usually makes an implicit assumption that the model under study is correctly specied or of xed dimensions. Given the practical 112 importance of model misspecication, [111] laid out a general theory of maximum likelihood estimation in misspecied models for the case of xed dimensionality and independent and identically distributed (i.i.d.) observations. [25] also studied maximum likelihood estimation of a multi-dimensional log-concave density when the model is misspecied. Recently, [79] investigated the problem of model selection with model misspecication and originated asymptotic expansions of both KL divergence and Bayesian principles in misspecied generalized linear models, leading to the generalized AIC (GAIC) and generalized BIC (GBIC), for the case of xed dimensionality. A specic form of prior probabilities motivated by the KL divergence principle led to the generalized BIC with prior probability (GBIC p ). Yet both features of model misspecication and high dimensionality are prevalent in contemporary big data applications. Thus an important question is how to characterize the impacts of both model misspecication and high dimensionality on model selection. We intend to provide some answer to this question in this chapter. Let us rst gain some insights into the challenges of the aforementioned problem by considering a motivating example. Assume that the responseY depends on the covariate vector (X 1 ; ;X p ) T through the functional form Y =f(X 1 ) +f(X 2 X 3 ) +f(X 4 X 5 ) +"; (4.1) where f(x) = x 3 =(x 2 + 1) and the remaining setting is the same as in Section 4.4.1. Consider sample size n = 200 and vary dimensionality p from 100 to 3200. Without any prior knowledge of the true model structure, we take the linear regression model y = Z +" (4.2) as the working model and apply some information criteria to hopefully recover the oracle working model consisting of the rst ve covariates, where y is an n-dimensional response vector, Z is 113 an np design matrix, = ( 1 ; ; p ) T is a p-dimensional regression coecient vector, and " is an n-dimensional error vector. When p = 100, the traditional AIC and BIC, which ignore model misspecication, tend to select a model with size larger than ve. As expected, GBIC p in [79] works well by selecting the oracle working model over 90% of the time. However, when p is increased to 3200, these methods fail to select such a model with signicant probability and the prediction performance of the selected models deteriorates. This motivates us to study the problem of model selection in high-dimensional misspecied models. In contrast, our new method can recover the oracle working model with signicant probability in this challenging scenario. The main contributions of this chapter are threefold. First, we provide the asymptotic expansion of Bayesian principle of model selection in high-dimensional misspecied generalized linear models which involves delicate and challenging technical analysis. Motivated by the asymptotic expansion and a natural choice of prior probabilities that encourages interpretability and incorporates Kullback-Leibler divergence, we suggest the high-dimensional generalized BIC with prior probability (HGBIC p ) for large-scale model selection with misspecication. Second, our work provides rigorous theoretical justication of the covariance contrast matrix estimator that incorporates the eect of model misspecication and is crucial for practical implementation. Such an estimator is shown to be consistent in the general setting of high-dimensional misspecied models. Third, we establish the model selection consistency of our new information criterion HGBIC p in ultra-high dimensions under some mild regularity conditions. In particular, our work provides important extensions to the studies in [79] and [52] to the cases of high dimensionality and model misspecication, respectively. The aforementioned contributions make our work distinct from other studies on model misspecication including [17, 65, 94]. The rest of the chapter is organized as follows. Section 4.2 introduces the setup for model misspecication and presents the new information criterion HGBIC p based on Bayesian principle of model selection. We establish a systematic asymptotic theory for the new method in Section 4.3. Section 4.4 presents several numerical examples to demonstrate the advantages of our newly 114 suggested method for large-scale model selection with misspecication. We provide some discussions of our results and possible extensions in Section 4.5. The proofs of main results and additional technical details are relegated to the end of the chapter. 4.2 Large-scale model selection with misspecication 4.2.1 Model misspecication The main focus of this chapter is investigating ultra-high dimensional model selection with model misspecication in which the dimensionality p can grow nonpolynomially with sample size n. We denote byM an arbitrary subset with size d of all p available covariates and X = (x 1 ; ; x n ) T the corresponding nd xed design matrix given by the covariates in modelM. Assume that conditional on the covariates in modelM, the response vector Y = (Y 1 ; ;Y n ) T has independent components and each Y i follows distribution G n;i with density g n;i , with all the distributions G n;i unknown to us in practice. Denote by g n = Q n i=1 g n;i the product density and G n the corresponding true distribution of the response vector Y. Since the collection of true distributionsfG n;i g 1in is unknown to practitioners, one often chooses a family of working models to t the data. One class of popular working models is the family of generalized linear models (GLMs) [80] with a canonical link and natural parameter vector = ( 1 ; ; n ) T with i = x T i , where x i is a d-dimensional covariate vector and = ( 1 ; ; d ) T is a d-dimensional regression coecient vector. Let > 0 be the dispersion parameter. Then under the working model of GLM, the conditional density of response y i given the covariates in modelM is assumed to take the form f n;i (y i ) = expfy i i b( i ) +c(y i ;)g; (4.3) 115 where b() and c(;) are some known functions with b() twice dierentiable and b 00 () bounded away from 0 and1. F n denotes the corresponding distribution of the n-dimensional response vector y = (y 1 ; ;y n ) T with the product density f n = Q n i=1 f n;i assuming the independence of components. Since the GLM is chosen by the user, the working distribution F n can be generally dierent from the true unknown distribution G n . For the GLM in (4.3) with natural parameter vector, let us dene two vector-valued functions b() = (b( 1 ); ;b( n )) T and() = (b 0 ( 1 ); ;b 0 ( n )) T , and a matrix-valued function () = diagfb 00 ( 1 ); ;b 00 ( n )g. The basic properties of GLM give the mean vector Ey =() and the covariance matrix cov(y) = () with = X. The product density of the response vector y can be written as f n (y;;) = n Y i=1 f n;i (y i ) = exp " y T X 1 T b(X) + n X i=1 c(y i ;) # ; (4.4) where 1 represents the n-dimensional vector with all components being one. Since GLM is only our working model, (4.4) results in the quasi-log-likelihood function [111] ` n (y;;) = logf n (y;;) = y T X 1 T b(X) + n X i=1 c(y i ;): (4.5) Hereafter we treat the dispersion parameter as a known parameter and focus on our main parameter of interest. Whenever there is no confusion, we will slightly abuse the notation and drop the functional dependence on . The quasi-maximum likelihood estimator (QMLE) for the parameter vector in our working model of GLM (4.3) is dened as b n = arg max 2R d ` n (y;); (4.6) 116 which is the solution to the score equation n () =@` n (y;)=@ = X T [y(X)] = 0: (4.7) For the linear regression model with(X) = X, such a score equation becomes the familiar normal equation X T y = X T X. Note that the KL divergence [72] of our working model F n from the true model G n is dened as I(g n ;f n (;)) =E logg n (Y)E` n (Y;) with the response vector Y following the true distribution G n . As in [79], we consider the best working model that is closest to the true model under the KL divergence. Such a model has parameter vector n;0 = arg min 2R d I(g n ;f n (;)), which solves the equation X T [EY(X)] = 0: (4.8) We see that equation (4.8) is simply the population version of the score equation given in (4.7). Following [79], we introduce two matrices that play a key role in model selection with model misspecication. Note that under the true distribution G n , we have cov X T Y = X T cov(Y)X. Computing the score equation at n;0 , we dene matrix B n as B n = cov n ( n;0 ) = cov X T Y = X T cov(Y)X (4.9) with cov(Y) = diagfvar(Y 1 ); ; var(Y n )g by the independence assumption and under the true model. Note that under the working model F n , it holds that cov X T Y = X T (X)X. We then dene matrix A n () as A n () = @ 2 I(g n ;f n (;)) @ 2 =E @ 2 ` n (Y;) @ 2 = X T (X)X; (4.10) 117 and denote by A n = A n ( n;0 ). Hence we see that matrices A n and B n are the covariance matrices of X T Y under the best working model F n ( n;0 ) and the true model G n , respectively. To account for the eect of model misspecication, we dene the covariance contrast matrix H n = A 1 n B n as revealed in [79]. Observe that A n and B n coincide when the best working model and the true model are the same. In this case, H n is an identity matrix of size d. 4.2.2 High-dimensional generalized BIC with prior probability Given a set of competing modelsfM m :m = 1; ;Mg, a popular model selection procedure using Bayesian principle of model selection is to rst put nonzero prior probability Mm on each model M m , and then choose a prior distribution Mm for the parameter vector in the corresponding model. We used m =jM m j to denote the dimensionality of candidate modelM m and suppress the subscript m for conciseness whenever there is no confusion. Assume that the density function of Mm is bounded in R Mm =R dm and locally bounded away from zero in a shrinking neighborhood of n;0 . The Bayesian principle of model selection is to choose the most probable model a posteriori; that is, choose the modelM m0 such that m 0 = arg max m2f1;;Mg S(y;M m ;F n ); (4.11) where the log-marginal-likelihood is S(y;M m ;F n ) = log Z Mm exp [` n (y;)]d Mm () (4.12) with the log-likelihood ` n (y;) as dened in (4.5) and the integral over R dm . 118 The choice of prior probabilities Mm is important in high dimensions. [79] suggested the use of prior probability Mm /e Dm for each candidate modelM m , where the quantityD m is dened as D m =E h I(g n ;f n (; b n;m ))I(g n ;f n (; n;m;0 )) i (4.13) with the subscript m indicating a particular candidate model. The motivation is that the further the QMLE b n;m is away from the best misspecied GLM F n (; n;m;0 ), the lower prior probability we assign to that model. In the high-dimensional setting when dimensionality p can be much larger than sample size n, it is sensible to also take into account the complexity of the space of all possible sparse models with the same size asM m . Such an observation motivates us to consider a new prior probability of the form Mm /p d e Dm (4.14) withd =jM m j. The complexity factor p d is motivated by the asymptotic expansion of p d d! 1 . In fact, an application of Stirling's formula yields log p d d! 1 d logp = log(p d ) up to an additive term of order o(d) when d = o(p). The factor of p d 1 was also exploited in [20] who showed that using the term p d with some constant 0< 1, the extended BIC can be model selection consistent for the scenario of correctly specied models with p = O(n ) for some positive constant satisfying 1 (2) 1 < . A dierent form of integrating the number of candidate models into the prior was considered in [105] when the model under study is correctly specied. Moreover, we add the term d! to re ect a stronger prior on model sparsity. See also [52] for the characterization of model selection in ultra-high dimensions with correctly specied models. The asymptotic expansion of the Bayes factor for Bayesian principle of model selection in Theorem 4.1 to be presented in Section 4.3.2 motivates us to introduce the high-dimensional 119 generalized BIC with prior probability (HGBIC p ) as follows for large-scale model selection with misspecication. Denition 4.1. We dene HGBIC p = HGBIC p (y;M;F n ) of modelM as HGBIC p =2` n (y; b n ) + 2(logp )jMj + tr( b H n ) logj b H n j; (4.15) where b H n is a consistent estimator of H n and p =pn 1=2 . In correctly specied models, the term tr( b H n )logj b H n j in (4.15) is asymptotically close tojMj when b H n is a consistent estimator of H n = A 1 n B n = I d . Thus compared to BIC with factor logn, the HGBIC p contains a larger factor of order logp when dimensionality p grows nonpolynomially with sample size n. This leads to a heavier penalty on model complexity similarly as in [52]. As shown in [79] for GBIC p , the HGBIC p dened in (4.15) can also be viewed as a sum of three terms: the goodness of t, model complexity, and model misspecication; see [79] for more details. Our new information criterion HGBIC p provides an important extension of the original GBIC p in [79], which was proposed for the scenario of model misspecication with xed dimensionality, by explicitly taking into account the high dimensionality of the whole feature space. 4.3 Asymptotic properties of HGBIC p 4.3.1 Technical conditions We list the technical conditions required to prove the main results and the asymptotic properties of QMLE with diverging dimensionality. Denote by Z the full design matrix of sizenp whose (i;j)th entry is x ij . For any subsetM off1; ;pg, Z M denotes the submatrix of Z formed by columns whose indices are inM. When there is no confusion, we drop the subscript and use X = Z M for xedM. For theoretical reasons, we restrict the parameter space toB 0 which is a suciently large convex and compact set of R p . We consider parameters with bounded support. Namely, we 120 deneB(M) =f2B 0 : supp() =Mg andB =[ jMjK B(M) where the maximum support size K is taken to be o(n). Moreover, we assume that c 0 b 00 (Z)c 1 0 for any2B where c 0 is some positive constant. We use the following notation. For matrices,kk 2 ,kk 1 , andkk F denote the matrix operator norm, entrywise maximum norm, and matrix Frobenius norm, respectively. For vectors,kk 2 and kk 1 denote the vector L 2 -norm and maximum norm, and (v) i represents the ith component of vector v. Denote by min () and max () the smallest and largest eigenvalues of a given matrix, respectively. Assumption 4.1. There exists some positive constant c 1 such that for each 1in, P (jW i j>t)c 1 exp(c 1 1 t) (4.16) for any t > 0, where W = (W 1 ; ;W n ) T = YEY. The variances of Y i are bounded below uniformly in i and n. Assumption 4.2. Let u 1 and u 2 be some positive constants and e m n = O(n u1 ) a diverging sequence. We have the following bounds (i) maxfkEYk 1 ; sup 2B k(Z)k 1 g e m n ; (ii) P n i=1 [EYi((X n;0 ))i] 2 var(Yi) 2 =O(n u2 ). For simplicity, we also assume that e m n diverges faster than logn. Assumption 4.3. Let K =o(n) be a positive integer. There exist positive constants c 2 and u 3 such that (i) For anyMf1; ;pg such thatjMjK, c 2 min (n 1 Z T M Z M ) max (n 1 Z T M Z M )c 1 2 ; 121 (ii) kZk 1 =O(n u3 ); (iii) For simplicity, we assume that columns of Z are normalized: P n i=1 x 2 ij =n for allj = 1; ;p. Condition 4.1 is a standard tail condition on the response variable Y . This condition ensures that the sub-exponential norm of the response is bounded. Conditions 4.2 and 4.3 have their counterparts in [52]. However, Condition 4.2 is modied to deal with model misspecication. More specically, the means of the true distribution and tted model as well as their relations are assumed in Condition 4.2. The rst part simultaneously controls the tail behavior of the response and tted model. The second part ensures that the mean of the tted distribution does not deviate from the true mean too signicantly. We would like to point out that such a condition does not limit the generality of model misspecication since the misspecication considered here is due to the distributional mismatch between the working model and the underlying true model. Even in the misspecied scenario, the tted mean vector from the working model can approximate the true mean vector under certain regularity conditions. Condition 4.3 is on the design matrix X. The rst part is important for the consistency of QMLE b n and uniqueness of the population parameter. Conditions 4.2 and 4.3 also provide bounds for the eigenvalues of A n () and B n . See [52] for further discussions on these assumptions. For the following conditions, we dene a neighborhood around n;0 . Let n = e m n p logp = O(n u1 p logp). We dene the neighborhood N n ( n ) =f2 R d :k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g. We assume that (n=d) 1=2 n converges to zero so that N n ( n ) is an asymptotically shrinking neighborhood of n;0 . Assumption 4.4. Assume that the prior density relative to the Lebesgue measure 0 on R d (h()) = d M d0 (h()) satises inf 2Nn(2n) (h())c 3 and sup 2R d (h())c 1 3 ; (4.17) 122 where c 3 is a positive constant, and h() = (n 1 B n ) 1=2 . Assumption 4.5. Let V n () = B 1=2 n A n ()B 1=2 n , V n = V n ( n;0 ) = B 1=2 n A n B 1=2 n , and e V n ( 1 ; ; d ) = B 1=2 n e A n ( 1 ; ; d )B 1=2 n , where e A n ( 1 ; ; d ) is the matrix whose jth row is the corresponding row of A n ( j ) for each j = 1; ;d. There exists some sequence n ( n ) such that n ( n ) 2 n d converges to zero, (i) max 2Nn(2n) maxfj min (V n () V n )j;j max (V n () V n )jg n ( n ); (ii) max 1 ;; d 2Nn(n) k e V n ( 1 ; ; d ) V n k 2 n ( n ). Similar versions of Conditions 4.4 and 4.5 were imposed in [79]. Under Condition 4.4, the prior density is bounded above globally and bounded below in a neighborhood of n;0 . This condition is used in Theorem 4.1 for the asymptotic expansion of the Bayes factor. Condition 4.5 is on the continuity of the matrix-valued function V n and e V n in a shrinking neighborhood N n (2 n ) of n;0 . The rst and second parts control the expansions of expected log-likelihood and score functions, respectively. Condition 4.5 ensures that the remainders are negligible in approximating the log-marginal-likelihood S(y;M m ;F n ). See [79] for more discussions. 4.3.2 Asymptotic expansion of Bayesian principle of model selection We now provide the asymptotic expansion of Bayesian principle of model selection with the prior introduced in Section 4.2.2. As mentioned earlier, the Bayesian principle chooses the model that maximizes the log-marginal-likelihood given in (4.12). To ease the presentation, for any2R d , we dene a quantity ` n (y;) =` n (y;)` n (y; b n ); (4.18) 123 which is the deviation of the quasi-log-likelihood from its maximum. Then from (4.12) and (4.18), we have S(y;M m ;F n ) =` n (y; b n ) + logE Mm [U n () n ] + log Mm ; (4.19) whereU n () = exp[n 1 ` n (y;)]. With the choice of the prior probability in (4.14), it is clear that log Mm =D m d logp: (4.20) Aided by (4.19) and (4.20), some delicate technical analysis unveils the following expansion of the log-marginal-likelihood. Theorem 4.1. Let Mm =Cp d e Dm with C > 0 some normalization constant and assume that Conditions 4.1, 4.2(i), 4.3(i), 4.3(iii), 4.4, and 4.5 hold. If (n=d) 1=2 n =o(1), then we have with probability tending to one, S(y;M;F n ) =` n (y; b n ) (logp )jMj 1 2 tr(H n ) + 1 2 logjH n j + log(Cc 4 ) +o(e n ); (4.21) where H n = A 1 n B n , p =pn 1=2 , e n = tr(A 1 n B n )_ 1, and c 4 2 [c 3 ;c 1 3 ]. Theorem 4.1 lays the foundation for investigating high-dimensional model selection with model misspecication. Based on the asymptotic expansion in (4.21), our new information criterion HGBIC p in (4.15) is dened by replacing the covariance contrast matrix H n with a consistent estimator b H n . The HGBIC p naturally characterizes the impacts of both model misspecication and high dimensionality on model selection. A natural question is how to ensure a consistent estimator for H n . We address such a question in the next section. 124 4.3.3 Consistency of covariance contrast matrix estimation For practical implementation of HGBIC p , it is of vital importance to provide a consistent estimator for the covariance contrast matrix H n . To this end, we consider the plug-in estimator b H n = b A 1 n b B n with b A n and b B n dened as follows. Since the QMLE b n provides a consistent estimator of n;0 in the best misspecied GLM F n (; n;0 ), a natural estimate of matrix A n is given by b A n = A n ( b n ) = X T (X b n )X: (4.22) When the model is correctly specied, the following simple estimator b B n = X T diag nh y(X b n ) i h y(X b n ) io X (4.23) with denoting the componentwise product gives an asymptotically unbiased estimator of matrix B n . Theorem 4.2. Assume that Conditions 4.1{4.3 hold, n 1 A n () is Lipschitz in operator norm in the neighborhood N n ( n ), d =O(n 1 ), and logp =O(n 2 ) with constants satisfying 0< 1 < 1=4, 0<u 3 < 1=4 1 , 0<u 2 < 14 1 4u 3 , 0<u 1 < 1=22 1 u 3 , and 0< 2 < 14 1 2u 1 2u 3 . Then the plug-in estimator b H n = b A 1 n b B n satises that tr( b H n ) = tr(H n ) +o P (1) and logj b H n j = logjH n j +o P (1) with signicant probability 1Ofn +p 18c2 2 n g, where is some positive constant and n is a slowly diverging sequence such that n e m n p Kn 1 logp! 0. Theorem 4.2 improves the result in [79] in two important aspects. First, the consistency of the covariance contrast matrix estimator was justied in [79] only for the scenario of correctly specied models. Our new result shows that the simple plug-in estimator b H n still enjoys consistency in the general setting of model misspecication. Second, the result in Theorem 4.2 holds for the case of high dimensionality. These theoretical guarantees are crucial to the practical implementation of 125 the new information criterion HGBIC p . Our numerical studies in Section 4.4 reveal that such an estimate works well in a variety of model misspecication settings. 4.3.4 Model selection consistency of HGBIC p We further investigate the model selection consistency property of information criterion HGBIC p . Assume that there are M = o(n ) sparse candidate modelsM 1 ; ;M M , where is some suciently large positive constant. At rst glance, such an assumption may seem slightly restrictive since it rules out an exhaustive search over all p d possible candidate models. However, our goal in this chapter is to provide practitioners with some tools for comparing a set of candidate models that are available to them. In fact, the set of sparse models under model comparison in practice can be often smaller, e.g., polynomial instead of exponential in sample size, even under the ultra-high dimensional setting. One example is that people may apply dierent algorithms each of which can lead to a possibly dierent model. Another example is the use of certain regularization method with a sequence of sparse models generated by a path algorithm, which will be demonstrated in our numerical studies. For each candidate modelM m , we have the HGBIC p criterion as dened in (4.15) HGBIC p (M m ) =2` n (y; b n;m ) + 2(logp )jM m j + tr( b H n;m ) logj b H n;m j; (4.24) where b H n;m is a consistent estimator of H n;m and p =pn 1=2 . Assume that there exists an oracle working model in the sequencefM m :m = 1; ;Mg that has support identical to the set of all important features in the true model. Without loss of generality, suppose thatM 1 is such oracle working model. 126 Theorem 4.3. Assume that all the conditions of Theorems 4.1{4.2 hold and the population version of HGBIC p criterion in (4.24) is minimized atM 1 such that for some > 0, min m>1 HGBIC p (M m ) HGBIC p (M 1 ) > (4.25) with HGBIC p (M m ) =2` n (y; n;m;0 ) + 2(logp )jM m j + tr(H n;m ) logjH n;m j. Then it holds that min m>1 fHGBIC p (M m ) HGBIC p (M 1 )g> =2 (4.26) with asymptotic probability one. Theorem 4.3 formally establishes the model selection consistency property of the new information criterion HGBIC p for large-scale model selection with misspecication in that the oracle working model can be selected among a large sequence of candidate sparse models with signicant probability. Such a desired property is an important consequence of results in Theorems 4.1 and 4.2. 4.4 Numerical studies We now investigate the nite-sample performance of the information criterion HGBIC p in com- parison to the information criteria AIC, BIC, GAIC, GBIC, and GBIC p in high-dimensional misspecied models via simulation examples. 4.4.1 Multiple index model The rst model we consider is the following multiple index model Y =f( 1 X 1 ) +f( 2 X 2 + 3 X 3 ) +f( 4 X 4 + 5 X 5 ) +"; (4.27) 127 where the response depends on the covariates X j 's only through the rst ve ones in a nonlinear fashion and f(x) = x 3 =(x 2 + 1). Here the rows of the np design matrix Z are sampled as i.i.d. copies from N(0;I p ), and the n-dimensional error vector"N(0; 2 I n ). We set the true parameter vector 0 = (1;1; 1; 1;1; 0; ; 0) T and = 0:8. We vary the dimensionality p from 100 to 3200 while keeping the sample size n xed at 200. We would like to investigate the behavior of dierent information criteria when the dimensionality increases. Although the data was generated from model (4.27), we t the linear regression model (4.2). This is a typical example of model misspecication. Note that since the rst ve variables are independent of the other variables, the oracle working model is M 0 = supp( 0 ) =f1; ; 5g. Due to the high dimensionality, it is computationally prohibitive to implement the best subset selection. Thus we rst applied Lasso followed by least-squares retting to build a sequence of sparse models and then selected the nal model using a model selection criterion. In practice, one can apply any preferred variable selection procedure to obtain a sequence of candidate interpretable models. We report the consistent selection probability (the proportion of simulations where selected model c M = M 0 ), the sure screening probability [40, 34] (the proportion of simulations where selected mode c MM 0 ), and the prediction error E(Y z Tb ) 2 with b an estimate and (z;Y ) an independent observation for z = (X 1 ; ;X p ) T . To evaluate the prediction performance of dierent criteria, we calculated the average prediction error on an independent test sample of size 10,000. The results for prediction error and model selection performance are summarized in Table 4.1. In addition, we calculate the average number of false positives for each method in Table 4.2. From Table 4.1, we observe that as the dimensionality p increases, the consistent selection probability tends to decrease for all criteria except the newly suggested HGBIC p , which maintains a perfect 100% consistent selection probability throughout all dimensions considered. Generally speaking, GAIC improved over AIC, and GBIC, GBIC p performed better than BIC in terms of both prediction and variable selection. In particular, the model selected by our new information 128 Table 4.1: Average results over 100 repetitions for Example 4.4.1 with all entries multiplied by 100. Consistent selection probability with sure screening probability in parentheses p AIC BIC GAIC GBIC GBIC p HGBIC p Oracle 100 1(100) 71(100) 5(100) 72(100) 92(100) 100(100) 100(100) 200 0(100) 43(100) 4(100) 44(100) 79(100) 100(100) 100(100) 400 0(100) 27(100) 1(100) 31(100) 67(100) 100(100) 100(100) 800 0(100) 16(100) 1(100) 21(100) 57(100) 100(100) 100(100) 1600 0(100) 13(100) 2(100) 18(100) 49(100) 100(100) 100(100) 3200 0(100) 5(100) 3(100) 8(100) 45(100) 100(100) 100(100) Mean prediction error with standard error in parentheses 100 97(1) 84(1) 89(1) 84(1) 82(1) 82(1) 82(1) 200 103(1) 84(1) 89(1) 84(1) 81(1) 80(1) 80(1) 400 112(2) 88(1) 94(1) 87(1) 84(1) 82(1) 82(1) 800 120(1) 93(1) 97(1) 92(1) 86(1) 83(1) 83(1) 1600 121(1) 94(1) 96(1) 93(1) 87(1) 82(1) 82(1) 3200 117(1) 93(1) 93(1) 91(1) 84(1) 80(1) 80(1) criterion HGBIC p delivered the best performance with the smallest prediction error and highest consistent selection probability across all settings. An interesting observation is the comparison between GBIC p and HGBIC p in terms of model selection consistency property. While GBIC p is comparable to HGBIC p when the dimensionality is not large (e.g., p = 100 and 200), the dierence between these two methods increases as the dimensionality increases. In the case when p = 3200, HGBIC p has 100% of success for consistent selection, while GBIC p has a success rate of only 45%. This conrms the necessity of including the logp factor with p =pn 1=2 in the model selection criterion to take into account the high dimensionality, which is in line with the results in [52] for the case of correctly specied models. Table 4.2: Average false positives over 100 repetitions for Example 4.4.1. AIC BIC GAIC GBIC GBIC p HGBIC p 100 8.55 0.51 3.49 0.48 0.08 0.00 200 13.05 1.07 3.70 0.98 0.28 0.00 400 17.68 1.65 4.66 1.49 0.40 0.00 800 21.28 2.61 4.57 2.17 0.70 0.00 1600 22.20 3.01 4.40 2.68 0.96 0.00 3200 22.48 3.92 4.07 3.20 0.86 0.00 129 Figure 4.1: The average false discovery proportion (left panel) and the true positive rate (right panel) as the factor varies for Example 4.4.1. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.00 0.05 0.10 0.15 0.20 0.25 factor FDP p=200 p=800 p=3200 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.80 0.85 0.90 0.95 1.00 factor TPR p=200 p=800 p=3200 We further study a family of model selection criteria induced by the HGBIC p and characterized as follows HGBIC p; (M m ) =2` n (y; b n;m ) + h 2(logp )jM m j + tr( b H n;m ) logj b H n;m j i ; (4.28) where is a positive factor controlling the penalty level on both model misspecication and high dimensionality. Note that HGBIC p; with = 1 reduces to our original HGBIC p . Here we examine the impact of the factor on the false discovery proportion (FDP) and the true positive rate (TPR) for the selected model c M compared to the oracle working model M 0 . In Figure 4.1, we observe that as increases, the average FDP drops sharply as it gets close to 1. In addition, we have the desired model selection consistency property (with FDP close to 0 and TPR close to 1) when 2 [1; 2]. This gure demonstrates the robustness of the introduced HGBIC p; criteria. 130 4.4.2 Logistic regression with interaction Our second simulation example is the high-dimensional logistic regression with interaction. We simulated 100 data sets from the logistic regression model with interaction and an n-dimensional parameter vector = Z + 2x p+1 + 2x p+2 ; (4.29) where Z = (x 1 ; ; x p ) is an np design matrix, x p+1 = x 1 x 2 and x p+2 = x 3 x 4 are two interaction terms, and the rest of the setting is the same as in the simulation example in Section 4.4.1. For each data set, the n-dimensional response vector y was sampled from the Bernoulli distribution with success probability vector [e 1 =(1 +e 1 ); ;e n =(1 +e n )] T with = ( 1 ; ; n ) T given in (4.29). As in Section 4.4.1, we consider the case where all covariates are independent of each other. We chose 0 = (2:5;1:9; 2:8;2:2; 3; 0; ; 0) T and set sample size n = 300. Although the data was generated from the logistic regression model with parameter vector (4.29), we t the logistic regression model without the two interaction terms. This provides another example of misspecied models. As argued in Section 4.4.1, the oracle working model is supp( 0 ) =f1; ; 5g which corresponds to the logistic regression model with the rst ve covariates. To build a sequence of sparse models, we applied Lasso followed by maximum-likelihood retting based on the support of the estimated model. Since the goal in logistic regression is usually classication, we replace the prediction error with the classication error rate. Tables 4.3 and 4.4 show similar conclusions to those in Section 4.4.1. Again HGBIC p outperformed all other model selection criteria with greater advantage as the dimensionality increases (e.g., p 800). As in Example 4.4.1, we also present the trend of FDP and TPR as varies in Figure 4.2. From the gure, we observe that it is a more dicult setting than the multiple index model to reach model selection consistency. The proposed HGBIC p criterion with the choice of = 1 appears to strike a good balance between FDP and TPR. 131 Table 4.3: Average results over 100 repetitions for Example 4.4.2 with all entries multiplied by 100. Consistent selection probability with sure screening probability probability in parentheses p AIC BIC GAIC GBIC GBIC p HGBIC p Oracle 100 0(100) 30(100) 0(100) 32(100) 55(100) 99(99) 100(100) 200 0(100) 27(100) 0(100) 29(100) 41(100) 95(97) 100(100) 400 0(100) 12(100) 0(100) 16(100) 35(100) 95(95) 100(100) 800 0(100) 2(99) 0(100) 4(99) 12(99) 94(95) 100(100) 1600 0(100) 0(100) 0(100) 1(100) 5(100) 84(85) 100(100) 3200 0(100) 0(100) 0(100) 1(100) 1(100) 79(84) 100(100) Mean classication error (in percentage) with standard error in parentheses 100 25.3(0.3) 16.1(0.2) 27.4(0.3) 16.1(0.2) 15.7(0.2) 15.2(0.2) 15.2(0.2) 200 24.9(0.3) 17.2(0.3) 28.4(0.3) 16.9(0.3) 16.5(0.2) 15.5(0.2) 15.4(0.2) 400 25.0(0.3) 19.7(0.4) 28.7(0.3) 17.8(0.3) 16.8(0.3) 15.3(0.2) 15.2(0.2) 800 24.7(0.3) 21.9(0.4) 28.0(0.3) 18.8(0.3) 17.7(0.3) 15.7(0.2) 15.5(0.2) 1600 26.0(0.4) 24.3(0.4) 28.4(0.3) 20.2(0.3) 18.7(0.3) 15.9(0.3) 15.4(0.2) 3200 25.7(0.3) 24.4(0.4) 28.2(0.3) 20.7(0.3) 19.5(0.3) 15.9(0.2) 15.3(0.2) Figure 4.2: The average false discovery proportion (left panel) and the true positive rate (right panel) as the factor varies for Example 4.4.2. 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.00 0.05 0.10 0.15 0.20 0.25 factor FDP p=200 p=800 p=3200 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.70 0.75 0.80 0.85 0.90 0.95 1.00 factor TPR p=200 p=800 p=3200 132 Table 4.4: Average false positives over 100 repetitions for Example 4.4.2. AIC BIC GAIC GBIC GBIC p HGBIC p 100 55.71 1.57 86.50 1.48 0.78 0.00 200 40.83 3.24 119.82 2.14 1.33 0.02 400 35.25 11.74 130.33 4.27 2.24 0.00 800 31.78 18.22 140.20 6.00 3.45 0.01 1600 30.25 22.65 142.81 8.02 4.80 0.01 3200 28.41 22.31 142.75 8.61 6.15 0.05 4.5 Discussions Despite the rich literature on model selection, the general case of model misspecication in high dimensions is less well studied. Our work has investigated the problem of model selection in high-dimensional misspecied models and characterized the impacts of both model misspecication and high dimensionality on model selection, providing an important extension of the work in [79] and [52]. The newly suggested information criterion HGBIC p has been shown to perform well in high-dimensional settings. Moreover, we have established the consistency of the covariance contrast matrix estimator that captures the eect of model misspecication in the general setting, and the model selection consistency of HGBIC p in ultra-high dimensions. The logp term in HGBIC p with p =pn 1=2 is adaptive to high dimensions. In the setting of correctly specied models, [52] showed that a similar term is necessary for the model selection consistency of information criteria when the dimensionality p grows fast with the sample size n. It would be interesting to study optimality property of the information criteria HGBIC p and the HGBIC p; dened in (4.28) under model misspecication, and investigate these model selection principles in more general high-dimensional misspecied models such as the additive models and survival models. It would also be interesting to combine the strengths of the newly suggested HGBIC p and the recently introduced knockos inference framework [8, 18, 45] for more stable and enhanced large-scale model selection with misspecication. These problems are beyond the scope of this chapter and are interesting topics for future research. 133 4.6 Proofs of main results We provide the proofs of Theorems 4.1{4.3 in this section. Additional technical details are provided in the next section. 4.6.1 Proof of Theorem 4.1 We consider the decomposition of S(y;M m ;F n ) in (4.19) and deal with terms logE Mm [U n () n ] and log Mm separately by invoking Taylor's expansion. In fact, logE Mm [U n () n ] is based on ` n (y;), the deviation of the quasi-log-likelihood from its maximum, while log Mm is the log-prior probability which depends onD m =E[I(g n ;f n (; b n;m ))I(g n ;f n (; n;m;0 ))], expected dierence in the KL divergences. In light of consistency of the estimator b n as shown in Lemma 4.1, we focus only on the neighborhood of n;0 . First, we make a few remarks on the technical details of the proof. Throughout the proof, we condition on the event e Q n =f b n 2 N n ( n )g, where N n ( n ) =f2 R d :k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g; B n = X T cov(Y)X, n =O(L n p logp) and b n is the unrestricted MLE. Note that the eigenvalues ofn 1 A n () andn 1 B n are bounded away from 0 and1 by Conditions 4.1 and 4.3. This follows from the fact that eigenvalues of M T NM lie between min (N) min (M T M) and max (N) max (M T M) for any matrix M and positive semidenite symmetric matrix N. Therefore, from Lemma 4.1 we have that P ( e Q n )! 1 as n!1. To establish this theorem we require a possibly dimension dependent bound on the quantity kn 1=2 X b n k 2 . This can be achieved by putting some restriction on the parameter space. Let M n () =f2 R d :kXk 1 logng be a neighborhood, where is some positive constant. One way of bounding the quantitykn 1=2 X b n k 2 is to restrict the QMLE b n on the set M n (). Here, the constant can be chosen as large as desired to make M n () large enough, whereas the neighborhood N n ( n ) is asymptotically shrinking. Then, we have N n ( n ) M n () for all suciently large n, which implies that conditional on e Q n , the restricted MLE coincides with its 134 unrestricted version. Hereafter in this proof b n will be referred to as the restricted ML unless specied otherwise. Part I: expansion of the term logE Mm [U n () n ]. Recall that U n () = exp n 1 ` n (y;) and ` n (y;) =` n (y;)` n (y; b n ). First, we observe that the maximum value of the function ` n (y;) is attained at = b n . Moreover, we have @ 2 ` n (y;)=@ 2 =A n () from (4.10) where A n () = X T (X)X. Then, we consider Taylor's expansion of the log-likelihood function ` n (y;) around b n in a new neighborhood e N n ( n ) =f2 R d :k(n 1 B n ) 1=2 ( b n )k 2 (n=d) 1=2 n g. We get ` n (y;) = 1 2 ( b n ) T @ 2 ` n (y; )=@ 2 ( b n ) (4.30) = n 2 T V n ( ); where lies on the line segment joining and b n , = n 1=2 B 1=2 n ( b n ), and V n () = B 1=2 n A n ()B 1=2 n . Since b n 2 e N n ( n ), by the convexity of the neighborhood e N n ( n ) we have 2 e N n ( n ). We also note that conditional on the event e Q n , it holds that e N n ( n )N n (2 n ). Now, we will bound U n () n over the region e N n ( n ) using Taylor's expansion in (4.30). By Condition 4.5, we get q 1 ()1 e Nn(n) ()n 1 ` n (y;)1 e Nn(n) ()q 2 ()1 e Nn(n) (); (4.31) where q 1 () = 1 2 T [V n n ( n )I d ] and q 2 () = 1 2 T [V n + n ( n )I d ]. Then, we consider the linear transformation h() = (n 1 B n ) 1=2 . For suciently large n, we obtain E M [e nq2() 1 e Nn(n) ()]E M [U n () n 1 e Nn(n) ()]E M [e nq1() 1 e Nn(n) ()]; (4.32) where M denotes the prior distribution on h()2R d for the modelM. 135 The nal expansion of logE M [U n () n ] results from combination of Lemmas 4.7{4.10. The expressions E M [U n () n 1 e N c n (n) ] and R 2R d e nqj 1 e N c n (n) d 0 for j = 1; 2 in Lemmas 4.8 and 4.10 converge to zero faster than any polynomial rate in n sincee n = min (V n )=2 is bounded away from 0. Moreover, Lemmas 4.7 and 4.9 yield logE M [U n () n ] = log ( 2 n d=2 jV n n ( n )I d j 1=2 ) + logc 4 ; where c 4 2 [c 3 ;c 1 3 ]. Finally, we observe that jV n n ( n )I d j 1=2 =jV n j 1=2 jI d n ( n )V 1 n j 1=2 =jV n j 1=2 f1 +O[ n ( n )tr(V 1 n )]g 1=2 =jV n j 1=2 f1 +O[ n ( n )d 1 min (V n )]g 1=2 =jV n j 1=2 [1 +o(1)]; where we use Condition 4.5. So, we obtain logE M [U n () n ] = log ( 2 n d=2 jV n j 1=2 [1 +o(1)] ) + logc 4 = logn 2 d + 1 2 logjA 1 n B n j + log(2) 2 d + logc 4 +o(1): (4.33) This completes the expansion of logE M [U n () n ]. Part II: expansion of the prior term log Mm . Now, we consider the prior term log Mm which depends on b n throughD m . Simple calculation shows that log Mm =D m + logCd logp: (4.34) We aim to provide a decomposition ofD m in terms of H n . Observe thatD m =E n ( b n ) n ( n;0 ) where n () = E` n (e y;), and e y is an independent copy of y. We expand E n ( b n ) around 136 n ( n;0 ). In the GLM setup, we observe that ` n (e y;) = e y T X 1 T b(X) and n () = (Ee y T )X 1 T b(X). Then, we split E n ( b n ) in the region e Q n and its complement, that is, E n ( b n ) =Ef n ( b n )1 e Qn g +Ef n ( b n )1 e Q c n g (4.35) =Ef n ( b n )1 e Qn g +Ef[(E~ y) T (X b n ) 1 T b(X b n )]1 e Q c n g: First, we aim to show that the second term on the right-hand side of (4.35) iso(1). Performing componentwise Taylor's expansion of b() around 0 and evaluating at X b n , we obtain b(X b n ) = b(0) +b 0 (0)X b n + r, where r = (r 1 ; ;r n ) T with r i = 2 1 b 00 ((X i ) i )(X b n ) 2 i and 1 ; ; n lying on the line segment joining b n and 0. Thus, we get Efj(Ee y) T X b n 1 T b(X b n )j1 e Q c n gOfn logn +n +n(logn) 2 gP ( e Q c n ) =o(1) (4.36) for suciently largen. The last inequality follows from the fact thatP ( e Q c n ) converges to zero faster than any polynomial rate. To verify the orders, we recall that b n is the constrained MLE andb 00 () is bounded away from 0 and1. Thus, we obtain following bounds for the four termsj(Ee y) T X b n j = O(n logn),j1 T b(0)j =O(n),jb 0 (0)1 T X b n j =O(n logn), andj1 T rj =O(n(logn) 2 ). Now, we consider the rst term on the right-hand side of (4.35). We begin by expanding n () around n;0 conditioned on the event f Q n . By the denition of n;0 , n () attains its maximum at n;0 . By evaluating Taylor's expansion of n () around n;0 at b n , we derive n ( b n ) = n ( n;0 ) 1 2 ( b n n;0 ) T A n ( )( b n n;0 ) = n ( n;0 ) 1 2 ( b n n;0 ) T A n ( b n n;0 ) s n 2 ; 137 where A n () =@ 2 ` n (y;)=@ 2 , A n = A n ( n;0 ), and is on the line segment joining n;0 and b n . The second equality is obtained by taking s n = ( b n n;0 ) T [A n ( ) A n ]( b n n;0 ). Furthermore, setting C n = B 1=2 n A n and v n = C n ( b n n;0 ) simplies the above expression to n ( b n ) = n ( n;0 ) 1 2 v T n [(C 1 n ) T A n C 1 n ]v n s n 2 : (4.37) In (4.37), we rst handle the term s n . Note that on the event f Q n , by the convexity of the neighborhood N n ( n ) we have 2N n ( n ). Then, Condition 4.5 implies s n 1 f Qn = ( b n n;0 ) T (A n ( ) A n )( b n n;0 ) 1 f Qn (4.38) = [B 1=2 n ( b n n;0 )] T [V n ( ) V n ][B 1=2 n ( b n n;0 )] 1 f Qn n ( n ) 2 n d1 f Qn ; where V n () = B 1=2 A n ()B 1=2 n and V n = V( n;0 ). We then deduce that E(s n 1 f Qn ) = o(1), since n ( n ) 2 n d1 f Qn =o(1) by Condition 4.5. Therefore, (4.37) becomes E[ n ( b n )1 f Qn ] =E[ n ( n;0 ) 1 2 v T n [(C 1 n ) T A n C 1 n ]v n 1 f Qn ] +o(1): (4.39) We provide a decomposition of v n to handle the term v T n [(C 1 n ) T A n C 1 n ]v n in (4.39). Dene ( n ) = X T [y(X n )]. From the score equation we have ( b n ) = 0. From (4.8), it holds that X T [Ey(X n;0 )] = 0. For any 1 ; ; d 2R d , denote by e A n ( 1 ; ; d ) add matrix with jth row the corresponding row of A n ( j ) for each j = 1; ;d. Then, we dene matrix-valued 138 function e V n ( 1 ; ; d ) = B 1=2 n e A n ( 1 ; ; d )B 1=2 n . Assuming the dierentiability of () and applying the mean-value theorem componentwise around n;0 , we obtain 0 = n ( b n ) = n ( n;0 ) e A n ( 1 ; ; d )( b n n;0 ) = X T (yEy) e A n ( 1 ; ; d )( b n n;0 ); where each of 1 ; ; d lies on the line segment joining b n and n;0 . Therefore, we have the decomposition v n = C n ( b n n;0 ) = u n + w n ; (4.40) where u n = B 1=2 n X T (yEy) and w n = h e V n ( 1 ; ; d ) V n ih B 1=2 n ( b n n;0 ) i . We handle the quadratic term v T n [(C 1 n ) T A n C 1 n ]v n in (4.39) by using the decomposition of v n . For simplicity of notation, denote by R n = (C 1 n ) T A n C 1 n . Recall that C n = B 1=2 n A n . With some calculations we obtain E(u T n R n u n ) =Ef(yEy) T XA 1 n X T (yEy)g =Eftr(A 1 n X T (yEy)(yEy) T X)g = tr(A 1 n B n ): Note that E(u T n R n u n 1 f Qn ) = E(u T n R n u n ) E(u T n R n u n 1 f Qn c). From Lemma 4.1, we have P( f Q n c ) ! 0 as n ! 1. We set e n = tr(A 1 n B n )_ 1, hereby n is bounded away from zero. We apply Vitali's convergence theorem to show that E(u T n R n u n 1 f Qn c ) =o(e n ). To establish uniform integrability, we use Lemma 4.6 which states that sup n Ej(u T n R n u n )=e n j 1+ <1 for some constant > 0. This leads to E(u T n R n u n 1 f Qn c ) =o(e n ). Hence we have 1 2 E(u T n R n u n 1 f Qn ) = 1 2 tr(A 1 n B n ) +o(e n ): (4.41) 139 Now, it remains to show that E[(w T n R n w n + 2w T n R n u n )1 f Qn ] =o(e n ): (4.42) Using the denition of R n and w n , we can bound w T n R n w n : w T n R n w n =kR 1=2 n w n k 2 2 k e V n ( 1 ; ; d ) V n k 2 2 2 n dtr(A 1 n B n ): So, on the event f Q n , it holds that E(w T n R n w n 1 f Qn ) =o(e n ) by Condition 4.5. For the cross term w T n R n u n , applying the Cauchy{Schwarz inequality yields jE(w T n R n u n 1 f Qn )jE(kR 1=2 n w n k 2 2 1 f Qn ) 1=2 E(ku T n R 1=2 n k 2 2 ) 1=2 E[k e V n ( 1 ; ; d ) V n k 2 1 f Qn n d 1=2 tr(A 1 n B n )]: Thus, we obtain that E(w T n R n u n 1 f Qn ) =o(e n ). Note that Efj n ( n;0 )j1 f Qn cg is of order o(1) by similar calculations as in (4.36). Then, combining (4.35), (4.39), (4.41) and (4.42) yields Ef n ( b n )g = n ( n;0 ) 1 2 tr(A 1 n B n ) +o(e n ): (4.43) Combining (4.34) and (4.43) yields the expansion log M = 1 2 tr(A 1 n B n ) + logCd logp +o(e n ): Part I and Part II conclude the proof of Theorem 4.1. 140 4.6.2 Proof of Theorem 4.2 In the beginning of the proof, we demonstrate that the theorem follows from the consistency of b A n and b B n . Next, we establish the consistency of b A n and b B n . The consistency of b A n follows directly from the Lipschitz assumption; however, the consistency of b B n is harder to prove. To accomplish this, we break down b B n and invoke Bernstein-type tail inequalities and concentration theorems to handle challenging pieces. We rst introduce some notation to simplify the presentation of the proof. k () denotes the eigenvalues arranged in increasing order. Denote the spectral radius of dd square matrix M by (M) = max 1kd fj k (M)jg. kk 2 denotes the matrix operator norm. o P () denotes the convergence in probability of the matrix operator norm. We want to show that logj b H n j = logjH n j +o P (1) and tr( b H n ) = tr(H n ) +o P (1). To establish both equalities, it is enough to show that b H n = H n +o P (1=d). Indeed, assume that b H n = H n +o P (1=d) is established. In that case, we observe that jtr( b H n ) tr(H n )j =jtr( b H n H n )jd( b H n H n ) =dk b H n H n k 2 =o P (1); where the equality of the spectral radius and the operator norm follows from the symmetry of the matrix b H n H n . Moreover, we have j logj b H n j logjH n jjd max 1kd j log k ( b H n ) log k (H n )j =d max 1kd log max ( k ( b H n ) k (H n ) ; k (H n ) k ( b H n ) )! d max 1kd max ( k ( b H n ) k (H n ) ; k (H n ) k ( b H n ) ) 1 ! d max 1kd j k ( b H n ) k (H n )j minf k ( b H n ); k (H n )g : (4.44) 141 Recall that the smallest and largest eigenvalues of bothn 1 B n andn 1 A n are bounded away from 0 and1. (See the note in the beginning of the proof of Theorem 4.1.) So, we get k (H n ) =O(1) and 1 k (H n ) =O(1) uniformly for all 1kd. An application of Weyl's theorem shows that j k ( b H n ) k (H n )j( b H n H n ) for each k. We have ( b H n H n ) =k b H n H n k 2 =o P (1=d). Hence, the right-hand side of (4.44) is o P (1). Now, we proceed to show that b H n = H n +o P (1=d). It suces to prove that n 1 b A n = n 1 A n +o P (1=d) and n 1 b B n =n 1 B n +o P (1=d). To see the suciency, note that b H n H n = (n 1 b A n ) 1 (n 1 d b B n ) (n 1 A n ) 1 (n 1 dB n ) = (n 1 b A n ) 1 (n 1 d b B n ) (n 1 b A n ) 1 (n 1 dB n ) + (n 1 b A n ) 1 (n 1 dB n ) (n 1 A n ) 1 (n 1 dB n ): Then, b H n = H n +o P (1=d) can be obtained by repeated application of the following properties of the operator norm:k(I d M) 1 k 2 1=(1kMk 2 ) ifkMk 2 < 1,kMNk 2 kMk 2 kNk 2 , and kM + Nk 2 kMk 2 +kNk 2 , where M and N are dd matrices [62]. Part 1: prove n 1 b A n = n 1 A n +o P (1=d). From Lemma 4.1 we have,k b n n;0 k 2 = O P f(n=d) 1=2 n g, which entails b n = n;0 +O P f(n=d) 1=2 n g. Then it follows from the Lipschitz assumption for n 1 A n () in the neighborhood N n ( n ) that n 1 b A n =n 1 A n +o P (1=d). Part 2: prove n 1 b B n = n 1 B n +o P (1=d). We need to control the term y(X b n ). In correctly specied models,(X n;0 ) andEy are the same. So, it is enough to introduce the mean Ey which is close to both y and(X b n ). However, it is harder to control the term y(X b n ) in misspecied models since we need to deal with both(X n;0 ) and Ey. 142 First, we use the fact that(X n;0 ) and(X b n ) are close. To accomplish this, we add and subtract(X n;0 ) to get the following decomposition: n 1 b B n =n 1 X T diag nh y(X b n ) i h y(X b n ) io X = G 1 + G 2 + G 3 ; where G 1 =n 1 X T diagf[y(X n;0 )] [y(X n;0 )]gX; G 2 = 2n 1 X T diagf[y(X n;0 )] [(X n;0 )(X b n )]gX; G 3 =n 1 X T diagf[(X b n )(X n;0 )] [(X b n )(X n;0 )]gX: Next, we introduce Ey to obtain terms yEy andEy(X n;0 ) both of which can be kept small. We split G 1 as G 1 = G 11 + G 12 + G 13 and G 2 as G 2 = G 21 + G 22 , where G 11 =n 1 X T diagf(yEy) (yEy)gX; G 12 = 2n 1 X T diagf(yEy) [Ey(X n;0 )]gX; G 13 =n 1 X T diagf[Ey(X n;0 )] [Ey(X n;0 )]gX; G 21 = 2n 1 X T diagf(yEy) [(X n;0 )(X b n )]gX; G 22 = 2n 1 X T diagf[Ey(X n;0 )] [(X n;0 )(X b n )]gX: 143 Now, we will control each of the above terms separately. Before we begin, we observe that for any matrices M and N, we have P (dkM Nk 2 t)P (dkM Nk F t) d 2 max 1j;kd P (jM jk N jk jt=d 2 ); (4.45) wherekk F denotes the matrix Frobenius norm and M jk denotes the (j;k)th entry of M. Therefore, it is enough to bound P (jM jk N jk jt=d 2 ) by o(1=d 2 ) to show that M = N +o p (1=d). Part 2a) prove G 11 =n 1 B n +o P (1=d). We will use Bernstein-type tail inequality. First, note that EG 11 = n 1 B n and G jk 11 = n 1 P n i=1 fx ij x ik [y i Ey i ] 2 g = P n i=1 a jk i q 2 i , where a jk i = n 1 x ij x ik var(y i ) and q i =fvar(y i )g 1=2 (y i Ey i ). Let a jk = (a jk 1 ; ;a jk n ) T . Then we have ka jk k 2 2 = O(n 4u31 ) sincekXk 1 = O(n u3 ) from Condition 4.3. It may be noted that q i 's are 1-sub-exponential random variables from Condition 4.1 and so q 2 i 's are 2-sub-exponential random variables. Furthermore, sup 1in var(q 2 i ) =O(1). To see this, we note var(q 2 i )Eq 4 i 4 4 (4 1 [Eq 4 i ] 1=4 ) 4 4 4 sup m1 n m 1 (Ejq i j m ) 1=m o 4 =O(1); where we use Lemma 4.5. Then combining (4.45) with Lemma 4.12 for a choice of = 2, we deduce P (dkG 11 EG 11 k 2 t)d 2 max 1j;kd P (jG jk 11 EG jk 11 jt=d 2 ) Cd 2 expfCt 1=2 n 1 4 u3 =dg for some constant C. Since d =O(n 1 ) and u< 1=4u 3 , the right-hand side of above equation tends to zero. Thus, we obtain G 11 =EG 11 +o P (1=d) =n 1 B n +o P (1=d). 144 Part 2b) prove G 12 = o P (1=d). Similar to the previous part, we invoke Bernstein-type tail inequality. Observe that G jk 12 =n 1 P n i=1 2fx ij x ik [Ey(X n;0 )] i [y i Ey i ]g = P n i=1 ~ a jk i q i , where ~ a jk i = 2n 1 var(y i ) 1=2 x ij x ik [Ey(X n;0 )] i and q i =fvar(y i )g 1=2 (y i Ey i ). Then, we getk~ a jk k 2 2 =O(n 4u3+u2=23=2 ) by Conditions 4.2 and 4.3. By Lemma 4.11, we have P (dkG 12 k 2 t)d 2 max 1j;kd P (jG jk 12 jt=d 2 ) Cd 2 expfCtn 3 4 2u3 u 2 4 =d 2 g for some constant C. Since d =O(n 1 ) and 3=4 2u 3 u 2 =4 2 1 > 0, the right-hand side of above equation tends to zero. Hence, we have G 12 =o P (1=d). Part 2c) prove G 13 =o(1=d). We derive kG 13 k 2 2 kn 1 n X i=1 fx i x T i [Ey i [(X n;0 )] i ] 2 gk 2 F = 1j;kd [ n X i=1 a jk i [Ey i [(X n;0 )] i ] 2 =var(y i )] 2 n X i=1 f[Ey i [(X n;0 )] i ] 2 =var(y i )g 2 1j;kd ka jk k 2 2 ; where the last step follows from the componentwise Cauchy{Schwarz inequality. From Conditions 4.2 and 4.3, we getkG 13 k 2 2 = O(n u2 d 2 n 4u31 ). Therefore, G 13 = o(1=d) since d = O(n 1 ) and u 2 + 4 1 + 4u 3 1< 0. Part 2d) prove G 21 = o(1=d 2 ). Bounding G 21 is the trickiest part. The use of classical Bernstein-type inequalities are prohibited since the summation includes two random quantities y and b . Instead, we will apply concentration inequalities. 145 We start by truncating the random variable y by conditioning on the set n =fkWk 1 C 1 logng which is dened in Lemma 4.2. Since b n belongs to the neighborhoodN n ( n ) by Lemma 4.1, we get jG jk 21 j =j2n 1 n X i=1 x ij x ik [y i Ey i ][(X n;0 )(X b n )] i j sup n 2Nn(n) 2n 1 j n X i=1 x ij x ik [y i Ey i ][(X n;0 )(X n )] i j: Then, we can separate the right-hand side by conditioning on n . So, we havejG jk 21 j G jk 211 +G jk 212 where G jk 211 = sup n 2Nn(n) 2n 1 j n X i=1 x ij x ik [y i Ey i ][(X n;0 )(X n )] i 1 n j; G jk 212 = sup n 2Nn(n) 2n 1 j n X i=1 x ij x ik [y i Ey i ][(X n;0 )(X n )] i (1 1 n )j: First, we bound EG jk 211 . We take a Rademacher sequencef i g n i=1 independent of y. Then, we apply symmetrization and contraction inequalities in [16] as follows. EG jk 211 =E sup n 2Nn(n) 2n 1 j n X i=1 x ij x ik [y i Ey i ][(X n;0 )(X n )] i 1 n j 4n 1 E sup n 2Nn(n) j n X i=1 i x ij x ik y i [(X n;0 )(X n )] i 1 n j 4n 1 c 0 E sup n 2Nn(n) j n X i=1 i x ij x ik y i [X n;0 X n ] i 1 n j 4n 1 c 0 sup n 2Nn(n) k n;0 n k 2 Ek n X i=1 i x ij x ik y i 1 n x i k; where the last step follows from the Cauchy{Schwarz inequality. We observe that sup n 2Nn(n) k n;0 n k 2 n 1=2 d 1=2 n and Ek P n i=1 i x ij x ik y i 1 n x i k 2 ( P n i=1 x 2 ij x 2 ik E[y 2 i 1 n ]kx i k 2 2 ) 1=2 . So, we can boundEG jk 211 by 4c 0 n 3=2 d 1=2 n ( P n i=1 x 2 ij x 2 ik E[y 2 i 1 n ]kx i k 2 2 ) 1=2 . Using Conditions 4.2 146 and 4.3, we obtainEG jk 211 =O(n 1+2u3 d n e m n ). Sinced =O(n 1 ) and1+2u 3 +3 1 +2u 1 + 2 =2< 0, we deduce EG jk 211 =o(1=d 2 ). Furthermore, we need to bound 2jx ij x ik y i [(X n;0 )(X n )] i 1 n j for any n 2N n ( n ) in order to use the concentration theorem in [16]. We use Lemma 4.2 to bound y i : 2jx ij x ik y i [(X n;0 )(X n )] i 1 n j 2jx ij jjx ik jj(y i Ey i +Ey i )j1 n j[(X n;0 )(X n )] i j 2jx ij jjx ik j(jEy i j +C 1 log(n))j[(X n;0 )(X n )] i j: Since b 00 (X) c 1 0 for any joining the line segment n;0 and n , we havej[(X n;0 ) (X n )] i j c 1 0 kx i k 2 k n;0 n k 2 for any n 2 N n ( n ). When we put last two inequalities together with Conditions 4.2 and 4.3, we get 2jx ij x ik y i [(X n;0 )(X n )] i 1 n jc i; n where c i; n =O(n 2u3 e m n )kx i k 2 k n;0 n k 2 . Moreover, we have sup n 2Nn(n) n 1 n X i=1 c 2 i; n O(n 1+4u3 e m 2 n ) sup n 2Nn(n) k n;0 n k 2 2 n X i=1 kx i k 2 2 O(n 1+4u3 e m 2 n d 2 2 n ) where we use the fact thatk n;0 n k 2 2 =O(n 1 d 2 n ) for any n 2N n ( n ). Thus, we can use the concentration inequality in [16] which yields P (G jk 211 EG jk 211 +t)C exp C nt 2 n 1+4u3 e m 2 n d 2 2 n ; (4.46) for some constant C. Now, take any ~ t > 0. We know that EG jk 211 < ~ t=(2d 2 ) for large enough n. Then by taking t = ~ t=(2d 2 ) in equation (4.46), we obtain P (G jk 211 ~ t=d 2 )C expfC ~ t 2 n 2+4u3 e m 2 n d 6 2 n g: 147 Since2 + 4u 3 + 6 1 + 4u 1 + 2 < 0, we have P (G jk 211 ~ t=d 2 ) =o(1=d 2 ). Lastly, G jk 212 = 0 on the event n which holds with probability at least 1O(n ) by Lemma 4.2. Therefore, we obtain G 21 =o(1=d 2 ) by using (4.45). Part 2e) prove G 22 =o(1=d). First, we apply the Cauchy{Schwarz inequality to obtain jG jk 22 j 2 = 2 n X i=1 h n 1 var(y i ) 1=2 x ij x ik [(X n;0 )(X b n )] i i [Ey(X n;0 )] i var(y i ) 1=2 ! 2 4 n X i=1 n 2 var(y i )x 2 ij x 2 ik [(X n;0 )(X b n )] 2 i n X i=1 [Ey(X n;0 )] 2 i var(y i ) Since b n lies in the regionN n ( n ) with high probability andb 00 () is bounded, [(X n;0 )(X b n )] 2 i can be bounded bykx i k 2 2 O(n 1 d 2 n ). Condition 4.2 and the Cauchy{Schwarz inequality yield P n i=1 [var(y i )] 1 [Ey(X n;0 )] 2 i O(n 1=2+u2=2 ). We further use Conditions 4.1 and 4.3 to obtain jG jk 22 j 2 =O(n 3=2+4u3+u2=2 d 2 2 n ). Since d =O(n 1 ) and3=2 + 4u 3 +u 2 =2 + 6 1 + 2u 1 + 2 < 0, we getjG jk 22 j 2 =o(1=d 4 ). Thus, we obtain G 22 =o p (1=d). Part 2f) prove G 3 =o(1=d). We decompose (i;j)th entry of G 3 as follows jG jk 3 j =n 1 j n X i=1 x ij x ik [(X n;0 )(X b n )] 2 i j n 1 n X i=1 jx ij jjx ik j[(X n;0 )(X b n )] 2 i =O(n 1+2u3 d 2 2 n ); where the last line is similar to Part 2e. So,jG jk 3 j =o(1=d 2 ) since1 + 2u 3 + 4 1 + 2u 1 + 2 < 0. Therefore, we get G 3 =o(1=d). We have nished the proof of Part 2. This concludes the proof of Theorem 4.2 with the desired probability bound 1Ofn +p 18c2 2 n g. 148 4.6.3 Proof of Theorem 4.3 Theorem 4.3 is a direct consequence of Theorem 4.2, Lemma 4.1, and assumption (4.25). To see this, observe that the dierence in the sample version HGBIC p can be written as the sum of the population version HGBIC p and the terms consisting of dierences of likelihood, tr(H n ) and log(det(H n )) between the sample and population versions. That is, HGBIC p (M m ) HGBIC p (M 1 ) = HGBIC p (M m ) HGBIC p (M 1 ) 2[` n (y; b n;m )` n (y; n;m;0 )] + 2[` n (y; b n;1 )` n (y; n;1;0 )] + [tr( b H n;m ) tr(H n;m )] [tr( b H n;1 ) tr(H n;1 )] [logj b H n;m j logjH n;m j] + [logj b H n;1 j logjH n;1 j]: The equation (4.25) suggests that the rst line is bounded below by for any m > 1. Then we focus on the remaining terms. Let m = 2; ;M be xed. The consistency of QMLE in Lemma 4.1 implies that2[` n (y; b n;m )` n (y; n;m;0 )] + 2[` n (y; b n;1 )` n (y; n;1;0 )] converges to zero with probability at least 1O(n ) for some constant > 0 . Moreover, Theorem 4.2 proves that the last two lines are also of order o() with probability at least 1O(n ). Therefore,fHGBIC p (M m ) HGBIC p (M 1 )g > =2 with probability 1O(n ) for any xed m> 1. Applying the union bound over all M =o(n ) competing models completes the proof of Theorem 4.3. This section contains key lemmas, their proofs, and additional technical details. We aim to establish the asymptotic consistency of QMLE uniformly over all modelsM such thatjMjK whereK =o(n). For this purpose, we extend our notation. n;0 (M) denotes the parameter vector for the working model and is dened as the minimizer of the KL-divergence whose support isM: n;0 (M) = arg min 2B(M) I(g n ;f n (;;)): n;0 (M) is estimated by the QMLE b (M) which is dened as b (M) = arg max 2B(M) ` n (): 149 4.7 Technical lemmas 4.7.1 Lemma 4.1 and its proof Lemma 4.1 (Uniform consistency of QMLE). Assume Conditions 4.1, 4.2(i), 4.3(i), and 4.3(iii) hold. If L n p Kn 1 logp! 0, then sup jMjK;Mf1;;pg 1 p jMj k b (M) n;0 (M)k 2 =O p h L n p n 1 logp i ; where L n = 2e m n +C 1 logn. e m n is a diverging sequence which appears in Condition 4.2 and C 1 is the positive constant from Lemma 4.2. Proof. First, we construct the auxiliary parameter vector b u (M) as follows. For any sequence N n , we take u = (1 +k b (M) n;0 (M)k 2 =N n ) 1 and dene b u (M) =u b (M) + (1u) n;0 (M). We havek b u (M) n;0 (M)k 2 =uk b (M) n;0 (M)k 2 N n by the denition ofu. So, b u (M) belongs to the neighborhoodB M (N n ) =f2R d ; supp() =M :k n;0 (M)k 2 N n g. Moreover, we observe thatk b u (M) n;0 (M)k 2 N n =2 impliesk b (M) n;0 (M)k 2 N n . Thus, it is enough to boundk b u (M) n;0 (M)k 2 to prove the theorem. Now, we considerk b u (M) n;0 (M)k 2 . First, the concavity of ` n and the denition of b (M) yield ` n ( b u (M))u` n ( b (M)) + (1u)` n ( n;0 (M)) u` n ( b u (M)) + (1u)` n ( n;0 (M)): So, by rearranging terms, we get ` n ( n;0 (M)) +` n ( b u (M)) 0: (4.47) 150 Besides, for any2B M (N n ), we have E[` n ( n;0 (M))` n ()] =I(g n ;f n (;;))I(g n ;f n (; n;0 (M);)) 0; (4.48) by the optimality of n;0 (M). Combining (4.47) and (4.48) gives 0E[` n ( n;0 (M))` n ( b u (M))] ` n ( n;0 (M)) +` n ( b u (M)) +E[` n ( n;0 (M))` n ( b u (M))] sup 2B M (Nn) `()E[` n ()]f` n ( n;0 (M))E[` n ( n;0 (M))]g =nT M (N n ); (4.49) since b u (M)2B M (N n ). On the other hand, for any2B M (N n ), E[` n ( n;0 (M))` n ()] =EY T Z M ( n;0 (M)) 1 T (b(Z M n;0 (M)) b(Z M )) =(Z M n;0 (M))Z M ( n;0 (M)) 1 T (b(Z M n;0 (M)) b(Z M )); since n;0 (M) satises the score equation: Z T M [EY(Z M )] = 0. Furthermore, applying the second order Taylor expansion yields E[` n ( n;0 (M))` n ()] = 1 2 n;0 (M) T Z T M (Z M )Z M n;0 (M) ; 151 where lies on the line segment connecting n;0 (M) and . Then, we use Condition 4.3 and the assumption that c 0 b 00 (Z) c 1 0 for any2B. So, we get E[` n ( n;0 (M))` n ()] 1 2 nc 0 c 2 k n;0 (M) b u (M)k 2 2 : Therefore, for any2B M (N n ), k n;0 (M)k 2 2 2(c 0 c 2 ) 1 n 1 E[` n ( n;0 (M))` n ()]: (4.50) Finally, we take a slowly diverging sequence n such that n L n p K log(p)=n! 0. Then, we choose N n = n L n p jMjn 1 logp. Since b u (M)2B M (N n ), we combine equations (4.49) and (4.50) to obtain sup jMjK 1 p jMj k n;0 (M) b u (M)k 2 sup jMjK T M (N n ) jMj 1=2 p 2(c 0 c 2 ) 1 n 1 =O p [L n p n 1 logp]; where the last step follows from Lemma 4.4. This completes the proof of Lemma 4.1. 4.7.2 Lemma 4.2 and its proof Lemma 4.2. Assume that Y 1 ; ;Y n are independent and satisfy Condition 4.1. Then, for any constant > 0, there exist large enough positive constants C 1 and C 2 such that kWk 1 C 1 logn; (4.51) with probability at least 1O(n ) and, kn 1=2 E[Wj n ]k 2 =O((logn)n C2 ); (4.52) where n =fkWk 1 C 1 logng. 152 Proof. We take t =C 1 logn in Condition 4.1. So we get P (kWk 1 C 1 logn) 1n max in P (jW i j>C 1 logn) 1c 1 n 1c 1 1 C1 : We chooseC 1 large enough so that 1c 1 1 C 1 0. Thus, we haveP (kWk 1 C 1 logn) = 1O(n ) where we pick =c 1 1 C 1 1> 0. This proves the rst part of the lemma. Now, we proceed the proof of the second part of the lemma. We will bound each termE[W i j n ] for i = 1; ;n. SincefW i g for i = 1; ;n are independent, the conditional expectation E[W i j n ] can be written as follows E[W i j n ] =E[W i jjW i jC 1 logn] = E[W i 1fjW i jC 1 logng] P (jW i jC 1 logn) : Since EW = 0 by denition, we get E[W i 1fjW i jC 1 logng] =E[W i 1fjW i j>C 1 logng]. Last two equalities result in jE[W i j n ]j E[jW i j1fjW i j>C 1 logng] P (jW i jC 1 logn) : We already showed that the denominator P (jW i jC 1 logn) can be bounded below by 1O(n ) uniformly in i. Thus, it suces to bound the numerator E[jW i j1fjW i j>C 1 logng]. Indeed, we have E[jW i j1fjW i j>C 1 logng] = Z 1 0 P (jW i j1fjW i j>C 1 logngt)dt = Z C1 logn 0 P (jW i j1fjW i j>C 1 logngt)dt + Z 1 C1 logn P (jW i j1fjW i j>C 1 logngt)dt = Z C1 logn 0 P (jW i jC 1 logn)dt + Z 1 C1 logn P (jW i jt)dt C 1 lognP (jW i jC 1 logn) + Z 1 C1 logn c 1 exp(c 1 1 t)dt C 1 lognc 1 exp(c 1 1 C 1 logn) +c 2 1 exp(c 1 1 C 1 logn); 153 where we use Condition 4.1 in the last two steps. This concludes the proof of Lemma 4.2 by choosing C 2 =c 1 1 C 1 . 4.7.3 Lemma 4.3 and its proof Lemma 4.3. Under Condition 4.2, the function dened by (x T i ;Y i ) = Y i x T i b(x T i ) is Lipschitz continuous with the Lipschitz constant L n = 2e m n +C 1 logn conditioned on the set n =fkWk 1 C 1 logng given in Lemma 4.2. Proof. We consider the dierence (x T i 1 ;Y i )(x T i 2 ;Y i ) for any 1 and 2 in R p . We observe that j(x T i 1 ;Y i )(x T i 2 ;Y i )jjY i jjx T i ( 1 2 )j +jb(x T i 1 )b(x T i 2 )j: We can bound jY i j on n using Condition 4.2 as jY i j kYk 1 kEYk 1 +kWk 1 e m n + C 1 log(n). Then we apply the mean-value theorem to obtain jb(x T i 1 ) b(x T i 2 )j jb 0 ( ~ )jjx T i ( 1 2 )j where ~ lies on the line segment connecting 1 and 2 . Thus, we get jb(x T i 1 )b(x T i 2 )j e m n jx T i ( 1 2 )j by Condition 4.2. Hereby, we showed thatj(x T i 1 ;Y i ) (x T i 2 ;Y i )j (2e m n +C 1 logn)jx T i 1 x T i 2 j conditioned on n . Thus, (;Y i ) is Lipschitz continuous with the Lipschitz constant L n = 2e m n +C 1 logn conditioned on the set n . This completes the proof of Lemma 4.3. 4.7.4 Lemma 4.4 and its proof Lemma 4.4. Assume that Conditions 4.1, 4.2(i), 4.3(i), and 4.3(iii) hold. Dene the neighborhood B M (N) =f2R d ; supp() =M :k n;0 (M)k 2 Ng and T M (N) = sup 2B M (N) n 1 ` n ()` n ( n;0 (M))E[` n ()` n ( n;0 (M))] : 154 If n is a slowly diverging sequence such that n L n p Kn 1 logp! 0, then sup jMjK 1 jMj T M n L n p jMjn 1 logp =O(L 2 n n 1 logp) with probability at least (1e 2 p 18c2 2 n )(1O(n )), where L n = 2e m n +C 1 logn. Proof. To prove the lemma, we condition on the set n =fkYEYk 1 C 1 logng. We observe that ` n ()` n ( n;0 (M))E[` n ()` n ( n;0 (M))] ` n ()` n ( n;0 (M))E[` n ()` n ( n;0 (M))j n ] +jE[` n ()` n ( n;0 (M))]E[` n ()` n ( n;0 (M))j n ]j; by the triangle inequality. Thus, T M (N n ) can be bounded by the sum of the following two terms: e T M (N n ) = sup 2B M (Nn) n 1 ` n ()` n ( n;0 (M))E[` n ()` n ( n;0 (M))j n ] ; and R M (N n ) = sup 2B M (Nn) n 1 fE[` n ()` n ( n;0 (M))]E[` n ()` n ( n;0 (M))j n ]g That is, T M (N n ) e T M (N n ) +R M (N n ): (4.53) In the rest of the proof, we will show the following bounds R M (N n ) =o L 2 n logp n ; (4.54) 155 and e T M (N n ) =O p L 2 n logp n : (4.55) First, we consider R M (N n ). We split R M (N n ) by the Cauchy{Schwarz inequality so that R M (N n ) = sup 2B M (Nn) n 1 j(EYE[Yj n ]) T X[ n;0 (M)]j kn 1=2 (EYE[Yj n ])k 2 sup 2B M (Nn) kn 1=2 X[ n;0 (M)]k 2 : We have kn 1=2 (EYE[Yj n ])k 2 =kn 1=2 (E[Wj n ])k 2 =O(n C2 logn) by Lemma 4.2. We also have kn 1=2 X( n;0 (M))k 2 ( max (n 1 X T M X M )) 1=2 k n;0 (M)k 2 c 1=2 2 N n ; for any2B M (N n ). Therefore, R M () =O(N n n C2 logn). So, (4.54) follows by taking C 2 large enough. Next, we deal with the term e T M (N n ) by showing (4.55). We observe that the dierence ` n ()` n ( n;0 (M)) can be written as ` n ()` n ( n;0 (M)) = n X i=1 n Y i [x T i x T i n;0 (M)] [b(x T i )b(x T i n;0 (M))] o = n X i=1 (x T i ;Y i )(x T i n;0 (M);Y i ) : In Lemma 4.3, we showed that (x T i ;Y i ) = Y i x T i b(x T i ) is Lipschitz continuous with the Lipschitz constant L n conditioned on the set n . 156 Next, we choose a Rademacher sequencef i g n i=1 . Then, we apply symmetrization and concen- tration inequalities in [16] as follows: E[ e T M (N n )j n ] 2E " sup 2B M (Nn) n 1 n X i=1 i (x T i ;Y i )(x T i n;0 (M);Y i ) j n # 4L n E " sup 2B M (Nn) n 1 n X i=1 i (x T i x T i n;0 (M)) j n # : Furthermore, we have E " sup 2B M (Nn) n 1 n X i=1 i (x T i x T i n;0 (M)) j n # E " n 1 sup 2B M (Nn) k n;0 (M)k 2 k n X i=1 i (x i ) M k 2 j n # E " n 1 N n k n X i=1 i (x i ) M k 2 j n # =n 1 N n E 2 6 4 0 @ X j2M n X i=1 i x ij ! 2 1 A 1=2 3 7 5 n 1 N n 0 @ X j2M E 2 4 n X i=1 i x ij ! 2 3 5 1 A 1=2 =N n n 1=2 jMj 1=2 ; where we use the Cauchy{Schwarz inequality and the assumption P n i=1 x 2 ij =n. Therefore, we obtain the bound E[ e T M (N n )j n ] 4L n N n n 1=2 jMj 1=2 : (4.56) 157 For any2B M (N n ), we have n 1 n X i=1 j(x T i n;0 (M);Y i )(x T i ;Y i )j 2 n 1 L 2 n n X i=1 jx T i n;0 (M)x T i j 2 =n 1 L 2 n ( n;0 (M)) T X T M X M ( n;0 (M)) L 2 n c 1 2 N 2 n : Then we apply Theorem 14.2 in [16] to obtain P e T M (N n )E[ e T M (N n )j n ] +tj n exp nc 2 t 2 8L 2 n N 2 n : Now, we take t = 4L n N n n 1=2 jMj 1=2 u for some positive u that will be chosen later. So, we get P ( e T M (N n ) 4L n N n n 1=2 jMj 1=2 (1 +u)j n ) exp(2c 2 u 2 jMj) by using (4.56). We choose N n =L n n 1=2 jMj 1=2 (1 +u). So, it follows that P e T M (N n ) jMj 4L 2 n n 1 (1 +u) 2 j n ! exp(8c 2 u 2 jMj): Thus, we have P sup jMjK e T M (N n ) jMj 4L 2 n n 1 (1 +u) 2 j n ! X jMjK P e T M (N n ) jMj 4L 2 n n 1 (1 +u) 2 j n ! X kK p k exp(8c 2 u 2 k) X kK pe k k exp(8c 2 u 2 k): 158 Now, we choose u = n p logp. So, for n large enough, we get X kK pe k k exp(8c 2 u 2 k) = X kK pe k k p 8c2 2 n k = X kK (ep (18c2 2 n ) ) k k k X kK ep (18c2 2 n ) k! e 2 p 18c2 2 n : So far, the probability of the event e T M (N n ) = O(L 2 n logp=n), which we call A, is bounded below conditional on n . Simple calculation yields P(A)P(A\ n ) =P( n )P(Aj n ). Thus, P (A) (1e 2 p 18c2 2 n )(1O(n )). So, (4.55) follows. We have shown (4.54) and (4.55), which control the terms e T M (N n ) and R M (N n ), respectively. Thus, (4.53) concludes the proof of Lemma 4.4. 4.7.5 Lemma 4.5 and its proof Lemma 4.5. Let q i 's be n independent, but not necessarily identically distributed, scaled and centered random variables with uniform sub-exponential decay, that is, P (jq i j>t)C exp(C 1 t) for some positive constant C. Letkq i k 1 denote the sub-exponential norm dened by kq i k 1 := sup m1 n m 1 (Ejq i j m ) 1=m o : Then, we havekq i k 1 e 1=e C(C_ 1) for all i. Proof. From the condition on sub-exponential tails, we derive Ejq i j m =m Z 1 0 x m1 P (jq i jx)dxCm Z 1 0 x m1 exp(C 1 x)dx =CmC m Z 1 0 u m1 exp(u)du =CmC m (m)CmC m m m ; 159 where the last line follows from the denition of the Gamma function. Taking the mth root, we have (Ejq i j m ) 1=m (Cm) 1=m Cm: Rewriting above equation, we obtain m 1 (Ejq i j m ) 1=m m 1=m C 1=m Ce 1=e (C_ 1)C; for all m 1. Since the bound is independent of m, it holds thatkq i k 1 e 1=e C(C_ 1) for all i. This completes the proof of Lemma 4.5. 4.7.6 Lemma 4.6 and its proof Lemma 4.6. Under Condition 4.1, for some constant > 0, we have sup n Ej(u T n R n u n )=e n j 1+ <1; where u n = B 1=2 n X T (YEY), R n = B 1=2 n A 1 n B 1=2 n , ande n = tr(A 1 n B n )_ 1. Proof. From the expression of u T n R n u n , we have u T n R n u n =(YEY) T XA 1 n X T (YEY) =[(YEY) T cov(Y) 1=2 ][cov(Y) 1=2 XA 1 n X T cov(Y) 1=2 ] [cov(Y) 1=2 (YEY)]: 160 Denote S n = cov(Y) 1=2 XA 1 n X T cov(Y) 1=2 and q = cov(Y) 1=2 (YEY). We decompose u T n R n u n into two terms, the summations of the diagonal entries and the o-diagonal entries, respectively, u T n R n u n = q T S n q = n X i=1 s ii q 2 i + X 1i6=jn s ij q i q j ; where s ij and q i denote the (i;j)th entry of S n and ith entry of q. Then, we have E(u T n R n u n ) 2 = n X i=1 s 2 ii E(q 4 i ) + X 1i6=jn s ii s jj E(q 2 i )E(q 2 j ) + 2 X 1i6=jn s 2 ij E(q 2 i )E(q 2 j ): Using Condition 4.1 and the sub-Gaussian norm bound in Lemma 4.5, both quantities E(q 4 i ) and E(q 2 i )E(q 2 j ) can be uniformly bounded by a common constant. Hence E(u T n R n u n ) 2 O(1)f[tr(S n )] 2 + tr(S 2 n )g: Since S n is positive semidenite it holds that tr(S 2 n ) [tr(S n )] 2 . Finally noting that tr(S n ) = tr(A 1 n B n ) e n , we see that sup n Ej(u T n R n u n )=e n j 1+ <1 for = 1, which concludes the proof of Lemma 4.6. 4.8 Additional technical details Lemmas 4.7{4.10 below are similar to those in [79]. Their proofs can be found in [79] or with minor modications. Lemma 4.7. Under Condition 4.4, for j = 1; 2; we have c 5 Z 2R d e nqj 1 e Nn(n) d 0 E M h e nqj 1 e Nn(n) i c 6 Z 2R d e nqj 1 e Nn(n) d 0 : (4.57) 161 Lemma 4.8. Conditional on the event e Q n , for suciently large n we have E M [U n () n 1 e N c n (n) ] expf[e n n ( n )=2]d 2 n g (4.58) exp[(e n =2)d 2 n ]; wheree n = min (V n )=2. Lemma 4.9. It holds that Z 2R d e nq1 d 0 = 2 n d=2 jV n n ( n )I d j 1=2 (4.59) and Z 2R d e nq2 d 0 = 2 n d=2 jV n + n ( n )I d j 1=2 : (4.60) Lemma 4.10. For j = 1; 2, it holds that Z 2R d e nqj 1 e N c n (n) d 0 2 ne n d=2 exp h ( p e n d 2 n p d) 2 =2 i : (4.61) Lemma 4.11 ([108]). For independent sub-exponential random variablesfy i g n i=1 , we have that the sub-exponential norm of q i =fvar(y i )g 1=2 (y i Ey i ) is bounded by some positive constant C 3 . Moreover, the following Bernstein-type tail probability bound holds P ( j n X i=1 a i q i jt ) 2 exp C 3 min t 2 C 2 3 kak 2 2 ; t C 3 kak 1 for a2R n , t 0. 162 Lemma 4.11 rephrases Proposition 5.16 of [108] for the case wherekq i k 1 C 3 . Further, for our proof we need to characterize the concentration of the square of a sub-exponential random variable. In this regard, we dene a general -sub-exponential random variable which satises P (j j>t )H exp(t=H) for H;t> 0. Note that the usual sub-exponential q i 's are 1-sub-exponential random variables. It may be useful to note that = 1=2 corresponds to sub-Gaussian random variables. Lemma 4.12 ([33]). For independent -sub-exponential random variables q 2 i , the following Bernstein-type tail probability bound holds P ( j n X i=1 a i q 2 i E[ n X i=1 a i q 2 i ]jt ) C 4 exp 2 6 4C 4 0 @ t sup i var 1=2 (q 2 i )kak 2 1 A 2 2+ 3 7 5 for a2R n , t sup i var 1=2 (q 2 i )kak 2 , and C 4 > 0 depending on the choice of ;H. The proof of Lemma 4.12 follows from that of Lemma 8.2 in [33]. 163 Reference List [1] F. Abramovich, Y. Benjamini, D. L. Donoho, and I. M. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584{653, 2006. [2] H. Akaike. Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (eds. B. N. Petrov and F. Csaki), Akademiai Kiado, Budapest, pages 267{281, 1973. [3] H. Akaike. A new look at the statistical model identication. IEEE Transactions on Automatic Control, 19:716{723, 1974. [4] T. W. Anderson and D. A. Darling. Asymptotic theory of certain \goodness-of-t" criteria based on stochastic processes. Annals of Mathematical Statistics, 23:193{212, 1952. [5] T. W. Anderson and D. A. Darling. A test of goodness-of-t. Journal of the American Statistical Association, 49:765{769, 1954. [6] S. Athey, G. W. Imbens, and S. Wager. Ecient inference of average treatment eects in high dimensions via approximate residual balancing. arXiv preprint arXiv:1604.07125, 2016. [7] Z. D. Bai. Methodologies in spectral analysis of large dimensional random matrices, a review. Statistica Sinica, 9:611{677, 1999. [8] R. F. Barber and E. J. Cand es. Controlling the false discovery rate via knockos. The Annals of Statistics, 43:2055{2085, 2015. [9] R. F. Barber and E. J. Cand es. A knocko lter for high-dimensional selective inference. arXiv:1602.03574, 2016. [10] D. Bean, P. J. Bickel, N. E. Karoui, and B. Yu. Optimal M-estimation in high-dimensional regression. Proceedings of the National Academy of Sciences of the United States of America, 110:14563{14568, 2013. [11] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57:289{300, 1995. [12] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165{1188, 2001. [13] P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, 36:199{227, 2008. [14] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37:1705{1732, 2009. 164 [15] H. Bozdogan. Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52:345{370, 1987. [16] P. B uhlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, 2011. [17] P. B uhlmann and S. van de Geer. High-dimensional inference in misspecied linear models. Electronic Journal of Statistics, 9:1449{1473, 2015. [18] E. Cand es, Y. Fan, L. Janson, and J. Lv. Panning for gold: `model-X' knockos for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B, 80:551{577, 2018. [19] T. I. Cannings and R. J. Samworth. Random-projection ensemble classication. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959{1035, 2017. [20] J. Chen and Z. Chen. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95:759{771, 2008. [21] K. Chen and K.-S. Chan. Subset ARMA selection via the adaptive lasso. Statistics and Its Interface, 4:197{205, 2011. [22] M. Chen, Z. Ren, H. Zhao, and H. H. Zhou. Asymptotically normal and ecient estimation of covariate-adjusted gaussian graphical model. Journal of the American Statistical Association, 111:394{406, 2016. [23] A. Chouldechova and T. Hastie. Generalized additive model selection. arXiv:1506.03850, 2015. [24] S. Clarke and P. Hall. Robustness of multiple testing procedures against dependence. The Annals of Statistics, 37:332{358, 2009. [25] M. Cule, R. Samworth, and M. Stewart. Maximum likelihood estimation of a multi- dimensional log-concave density (with discussion). Journal of the Royal Statistical Society Series B, 72:545{607, 2010. [26] E. Demirkaya, Y. Feng, P. Basu, and J. Lv. Large-scale model selection with misspecication. arXiv preprint arXiv:1803.07418, 2018. [27] E. Demirkaya and J. Lv. Discussion of \Random-projection ensemble classication". Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1008{1009, 2017. [28] B. Efron. Correlation and large-scale simultaneous signicance testing. Journal of American Statistical Association, 102:93{103, 2007. [29] B. Efron. Size, power and false discovery rates. The Annals of Statistics, 35:1351{1377, 2007. [30] B. Efron and R. Tibshirani. Empirical bayes methods and false discovery rates for microarrays. Genetic Epidemiology, 23:70{86, 2002. [31] S. Eguchi. Model comparison for generalized linear models with dependent observations. Econometrics and Statistics, 5:171{188, 2017. [32] R. Engle, C. Granger, J. Rice, and A. Weiss. Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association, 81:310{320, 1986. 165 [33] L. Erd} os, H.-T. Yau, and J. Yin. Bulk universality for generalized wigner matrices. Probability Theory and Related Fields, 154:341{407, 2012. [34] J. Fan and Y. Fan. High-dimensional classication using features annealed independence rules. The Annals of Statistics, 36:2605{2637, 2008. [35] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. London: Chapman & Hall/CRC, 1996. [36] J. Fan, S. Guo, and N. Hao. Variance estimation using retted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society Series B, 74:37{65, 2012. [37] J. Fan, P. Hall, and Q. Yao. To how many simultaneous hypothesis tests can normal, student's t or bootstrap calibration be applied? Journal of the American Statistical Association, 102:1282{1288, 2007. [38] J. Fan, X. Han, and W. Gu. Control of the false discovery rate under arbitrary covariance dependence (with discussion). Journal of American Statistical Association, 107:1019{1045, 2012. [39] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96:1348{1360, 2001. [40] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70:849{911, 2008. [41] J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space (invited review article). Statistica Sinica, 20:101{148, 2010. [42] J. Fan and J. Lv. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transac- tions on Information Theory, 57:5467{5484, 2011. [43] J. Fan and H. Peng. Nonconcave penalized likelihood with diverging number of parameters. The Annals of Statistics, 32:928{961, 2004. [44] J. Fan, R. J. Samworth, and Y. Wu. Ultrahigh dimensional variable selection: beyond the linear model. Journal of Machine Learning Research, 10:1829{1853, 2009. [45] Y. Fan, E. Demirkaya, G. Li, and J. Lv. RANK: large-scale inference with graphical nonlinear knockos. Journal of the American Statistical Association, to appear, 2018. [46] Y. Fan, E. Demirkaya, and J. Lv. Nonuniformity of p-values can occur early in diverging dimensions. Journal of Machine Learning Research, to appear, 2019. [47] Y. Fan and J. Fan. Testing and detecting jumps based on a discretely observed process. Journal of Econometrics, 164:331{344, 2011. [48] Y. Fan, Y. Kong, D. Li, and J. Lv. Interaction pursuit with feature screening and selection. arXiv preprint arXiv:1605.08933, 2016. [49] Y. Fan, Y. Kong, D. Li, and Z. Zheng. Innovated interaction screening for high-dimensional nonlinear classication. The Annals of Statistics, 43:1243{1272, 2015. [50] Y. Fan and J. Lv. Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association, 108:1044{1061, 2013. [51] Y. Fan and J. Lv. Innovated scalable ecient estimation in ultra-large gaussian graphical models. The Annals of Statistics, 44:2098{2126, 2016. 166 [52] Y. Fan and C. Y. Tang. Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society Series B, 75:531{552, 2013. [53] D. Foster and E. George. The risk in ation criterion for multiple regression. The Annals of Statistics, pages 1947{1975, 1994. [54] S. Geman. A limit theorem for the norm of random matrices. The Annals of Proabability, 8:252{261, 1980. [55] B. Guo and S. X. Chen. Tests for high dimensional generalized linear models. Journal of the Royal Statistical Society Series B, 78:1079{1102, 2016. [56] P. Hall. Akaike's information criterion and kullback-leibler loss for histogram density estimation. Probability Theory and Related Fields, 85:449{467, 1990. [57] P. Hall and Q. Wang. Strong approximations of level exceedences related to multiple hypothesis testing. Bernoulli, 16:418{434, 2010. [58] W. H ardle, H. Liang, and J. T. Gao. Partially Linear Models. Heidelberg: Springer Physica Verlag, 2000. [59] W. H ardle and T. M. Stoker. Investigating smooth multiple regression by the method of average derivatives. Journal of the American statistical Association, 84:986{995, 1989. [60] T. Hastie and R. Tibshirani. Generalized Additive Models. London: Chapman & Hall/CRC, 1990. [61] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition). Springer, 2009. [62] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [63] J. L. Horowitz. Semiparametric and nonparametric methods in econometrics. Springer, 2009. [64] D. P. Horvath, R. Schaer, and E. Wisman. Identication of genes induced in emerging tillers of wild oat (avena fatua) using arabidopsis microarrays. Weed Science, 51:503{508, 2003. [65] H.-L. Hsu, C.-K. Ing, and H. Tong. On model selection from a nite family of possibly misspecied time series models. The Annals of Statistics, 47(2):1061{1087, 2018. [66] P. J. Huber. Robust regression: Asymptotics, conjectures and Monte Carlo. The Annals of Statistics, 1:799{821, 1973. [67] H. Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58:71{120, 1993. [68] C.-K. Ing. Accumulated prediction errors, information criteria and optimal forecasting for autoregressive time series. The Annals of Statistics, 35:1238{1277, 2007. [69] N. E. Karoui, D. Bean, P. J. Bickel, C. Lim, and B. Yu. On robust regression with high- dimensional predictors. Proceedings of the National Academy of Sciences of the United States of America, 110:14557{14562, 2013. [70] A. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari., 4:83{91, 1933. 167 [71] S. Konishi and G. Kitagawa. Generalised information criteria in model selection. Biometrika, 83:875{890, 1996. [72] S. Kullback and R. A. Leibler. On information and suciency. The Annals of Mathematical Statistics, 22:79{86, 1951. [73] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996. [74] Q. Li and J. S. Racine. Nonparametric econometrics: theory and practice. Princeton University Press, 2007. [75] Q. Lin, Z. Zhao, and J. S. Liu. Sparse sliced inverse regression for high dimensional data. arXiv:1611.06655, 2016. [76] W. Liu and Q.-M. Shao. Phase transition and regularized bootstrap in large-scale t-tests with false discovery rate control. The Annals of Statistics, 42:2003{2025, 2014. [77] W. Liu and Y. Yang. Parametric or nonparametric? a parametricness index for model selection. The Annals of Statistics, pages 2074{2102, 2011. [78] J. Lv. Impacts of high dimensionality in nite samples. The Annals of Statistics, 41:2236{ 2262, 2013. [79] J. Lv and J. S. Liu. Model selection principles in misspecied models. Journal of the Royal Statistical Society Series B, 76:141{167, 2014. [80] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman and Hall, London, 1989. [81] L. Meier, S. van de Geer, and P. B uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37:3779{3821, 2009. [82] L. Meng, F. Sun, X. Zhang, and M. S. Waterman. Sequence alignment as hypothesis testing. J. Comput. Biol., 18:677{691, 2011. [83] Y. Ninomiya and S. Kawano. AIC for the Lasso in generalized linear models. Electronic Journal of Statistics, 10:2537{2560, 2016. [84] H. Peng, H. Yan, and W. Zhang. The connection between cross-validation and Akaike information criterion in a semiparametric family. Journal of Nonparametric Statistics, 25:475{485, 2013. [85] S. Portnoy. Asymptotic behavior of M-estimators of p regression parameters when p 2 =n is large; i. consistency. The Annals of Statistics, 12:1298{1309, 1984. [86] S. Portnoy. Asymptotic behavior of M-estimators of p regression parameters when p 2 =n is large; ii. normal approximation. The Annals of Statistics, 13:1403{1417, 1985. [87] S. Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to innity. The Annals of Statistics, 16:356{366, 1988. [88] A. Preli c, S. Bleuler, P. Zimmermann, A. Wille, P. B uhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22:1122{1129, 2006. [89] F. Ramel, C. Sulmon, M. Bogard, I. Cou ee, and G. Gouesbet. Dierential patterns of reactive oxygen species and antioxidative mechanisms during atrazine injury and sucrose-induced tolerance in arabidopsis thaliana plantlets. BMC Plant Biology, 9:1{18, 2009. 168 [90] P. Ravikumar, H. Liu, J. Laerty, and L. Wasserman. Spam: sparse sdditive models. Journal of the Royal Statistical Society Series B, 71:1009{1030, 2009. [91] J. Sch afer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:1544{1615, 2005. [92] B. A. Schmitt. Perturbation bounds for matrix square roots and pythagorean sums. Linear algebra and its applications, 174:215{227, 1992. [93] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461{464, 1978. [94] R. D. Shah and P. B uhlmann. Goodness-of-t tests for high dimensional linear models. Journal of the Royal Statistical Society Series B, to appear, 80:113{135, 2018. [95] R. D. Shah and R. J. Samworth. Variable selection with error control: Another look at stability selection. Journal of the Royal Statistical Society Series B, 75:55{80, 2013. [96] J. W. Silverstein. The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Proabability, 13:1364{1368, 1985. [97] N. Smirnov. Table for estimating the goodness of t of empirical distributions. Annals of Mathematical Statistics, 19:279{281, 1948. [98] T. M. Stoker. Consistent estimation of scaled coecients. Econometrica, pages 1461{1481, 1986. [99] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike's criterion. Journal of the Royal Statistical Society Series B, pages 44{47, 1977. [100] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B, 64:479{498, 2002. [101] J. D. Storey, J. E. Taylor, and D. Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unied approach. Journal of the Royal Statistical Society Series B, 66:187{205, 2004. [102] W. Su and E. J. Cand es. Slope is adaptive to unknown sparsity and asymptotically minimax. The Annals of Statistics, 44:1038{1068, 2016. [103] P. Sur and E. J. Cand es. A modern maximum-likelihood theory for high-dimensional logistic regression. arXiv preprint arXiv:1803.06964, 2018. [104] P. Sur, Y. Chen, and E. J. Cand es. The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. arXiv preprint arXiv:1706.01191, 2017. [105] P. Szulc. Weak consistency of modied versions of Bayesian information criterion in a sparse linear regression. Probability and Mathematical Statistics, 32:47{55, 2012. [106] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 58:267{288, 1996. [107] S. van de Geer, P. B uhlmann, Y. Ritov, and R. Dezeure. On asymptotically optimal condence regions and tests for high-dimensional models. The Annals of Statistics, 42:1166{1202, 2014. [108] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2012. 169 [109] R. Vershynin. High-dimensional probability. An Introduction with Applications, 2016. [110] M. N. Vrahatis. A short proof and a generalization of Miranda's existence theorem. Proceed- ings of the American Mathematical Society, 107:701{703, 1989. [111] H. White. Maximum likelihood estimation of misspecied models. Econometrica, 50:1{25, 1982. [112] S. Wienkoop, M. Glinski, N. Tanaka, V. Tolstikov, O. Fiehn, and W. Weckwerth. Linking protein fractionation with multidimensional monolithic reversed-phase peptide chromatogra- phy/mass spectrometry enhances protein identication from complex mixtures even in the presence of abundant proteins. Rapid Commun. Mass Spectrom., 18:643{650, 2004. [113] A. Wille, P. Zimmermann, E. Vranov a, A. F urholz, O. Laule, S. Bleuler, L. Hennig, A. Preli c, P. von Rohr, L. Thiele, et al. Sparse graphical gaussian modeling of the isoprenoid gene network in arabidopsis thaliana. Genome biology, 5:R92, 2004. [114] W. B. Wu. On false discovery control under dependence. The Annals of Statistics, 36:364{380, 2008. [115] E. Yang, A. Lozano, and P. Ravikumar. Elementary estimators for high-dimensional linear regression. In Proceedings of the 31st International Conference on Machine Learning (ICML- 14), pages 388{396, 2014. [116] Y. Zhang and J. S. Liu. Fast and accurate approximation to signicance tests in genome-wide association studies. Journal of the American Statistical Association, 106:846{857, 2011. 170
Abstract (if available)
Abstract
Feature selection, reproducibility, and model selection are of fundamental importance in contemporary statistics. Feature selection methods are required in a wide range of applications in order to evaluate the significance of covariates. Meanwhile, reproducibility of selected features is needed to claim that findings are meaningful and interpretable. Finally, model selection is employed for pinpointing the best set of covariates among a sequence of candidate models produced by feature selection methods. ❧ We show that p-values, a common tool for feature selection, behave differently in nonlinear models and p-values in nonlinear models can break down earlier than their linear counterparts. Next, we provide important theoretical foundations of model-X knockoffs which is a recent state-of-the-art method for reproducibility. We establish the power and robustness results for model-X knockoffs. Finally, we tackle large-scale model selection problem for misspecified models. We propose a novel information criterion which is tailored for both model misspecification and high dimensionality.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Model selection principles and false discovery rate control
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Nonlinear modeling and machine learning methods for environmental epidemiology
PDF
Numerical methods for high-dimensional path-dependent PDEs driven by stochastic Volterra integral equations
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Asymptotically optimal sequential multiple testing with (or without) prior information on the number of signals
PDF
Nonlinear models for hippocampal synapses for multi-scale modeling
PDF
Improved computational and statistical guarantees for high dimensional linear regression
PDF
Parameter estimation to infer injector-producer relationships in oil fields: from hybrid constrained nonlinear optimization to inference in probabilistic graphical model
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
High-dimensional feature selection and its applications
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
A novel hybrid probabilistic framework for model validation
PDF
Optimal dividend and investment problems under Sparre Andersen model
PDF
Equilibrium model of limit order book and optimal execution problem
PDF
Stein's method and its applications in strong embeddings and Dickman approximations
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Shrinkage methods for big and complex data analysis
Asset Metadata
Creator
Demirkaya, Emre
(author)
Core Title
Reproducible large-scale inference in high-dimensional nonlinear models
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Publication Date
06/18/2019
Defense Date
04/26/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian principle,big data,FDR,generalized linear model,graphical nonlinear knockoffs,high dimensionality,joint significance testing,Kullback-Leibler divergence,model misspecification,model selection,nonlinear models,OAI-PMH Harvest,Power,p-value,reproducibility,robustness
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lv, Jinchi (
committee chair
), Bartroff, Jay (
committee member
), Fan, Yingying (
committee member
), Goldstein, Larry (
committee member
)
Creator Email
demirkay@usc.edu,emredemirkaya@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-177075
Unique identifier
UC11660552
Identifier
etd-DemirkayaE-7504.pdf (filename),usctheses-c89-177075 (legacy record id)
Legacy Identifier
etd-DemirkayaE-7504.pdf
Dmrecord
177075
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Demirkaya, Emre
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian principle
big data
FDR
generalized linear model
graphical nonlinear knockoffs
high dimensionality
joint significance testing
Kullback-Leibler divergence
model misspecification
model selection
nonlinear models
p-value
reproducibility
robustness