Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Model selection principles and false discovery rate control
(USC Thesis Other)
Model selection principles and false discovery rate control
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MODEL SELECTION PRINCIPLES AND FALSE DISCOVERY RATE CONTROL by Pallavi Basu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) May 2016 Copyright 2016 Pallavi Basu To all my Teachers, incl. parents, brother, and family. Friends and students, and to my Love and his family. I have learned many invaluable lessons, of work and life from you all, if not only. And I patiently wait to learn more as life unveils! All mistakes are mine solely. ii Contents Contents iii List of Tables viii List of Figures x Abstract xi 1 Introduction 1 1.1 Model selection in high-dimensional misspecied models . . . . . . . . . . . 1 1.1.1 High-dimensional inference . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Model selection principles . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Model misspecication . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 wFDR control in large-scale multiple testing . . . . . . . . . . . . . . . . . . 3 1.2.1 Large-scale multiple testing . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 False discovery rate (FDR) criterion . . . . . . . . . . . . . . . . . . . 4 1.2.3 Weighted FDR control . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 iii 2 Model Selection in High-Dimensional Misspecied Models 8 2.1 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Model misspecication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 High-dimensional model selection in misspecied models . . . . . . . . . . . 13 2.4.1 Technical conditions and asymptotic properties of QMLE in high dimen- sions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.2 Generalized AIC in misspecied models . . . . . . . . . . . . . . . . . 16 2.4.3 Estimation of covariance contrast matrix . . . . . . . . . . . . . . . . 18 2.4.4 Generalized BIC in misspecied models . . . . . . . . . . . . . . . . . 19 2.4.5 Consistency of selection criterion . . . . . . . . . . . . . . . . . . . . 22 2.5 Numerical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.1 Simulation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Real data examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Additional tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3 Weighted False Discovery Rate Control in Large-Scale Multiple Testing 33 3.1 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Model and notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.2 Weighted error criterion and power function . . . . . . . . . . . . . . 38 3.4 Oracle procedure for wFDR control . . . . . . . . . . . . . . . . . . . . . . . 39 iv 3.4.1 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Comparison with the optimality results in Spjtvoll (1972) and Ben- jamini and Hochberg (1997) . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.3 Optimal ranking: VCR vs. WPO . . . . . . . . . . . . . . . . . . . . 45 3.5 Data-driven procedures and asymptotics . . . . . . . . . . . . . . . . . . . . 46 3.5.1 Proposed test statistic and its estimation . . . . . . . . . . . . . . . . 46 3.5.2 Proposed testing procedure and its asymptotic properties . . . . . . . 47 3.6 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.1 Group-wise weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.2 General weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Application to GWAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.1 Framingham Heart Study . . . . . . . . . . . . . . . . . . . . . . . . 56 3.7.2 Multiple testing and wFDR control . . . . . . . . . . . . . . . . . . . 57 3.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Application and Future Directions 62 4.1 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2 wFDR control of mutual fund performance alphas . . . . . . . . . . . . . . . 62 4.3 Multiple testing with condent directions . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 One possible adaptive solution . . . . . . . . . . . . . . . . . . . . . . 64 4.3.4 Proposed testing procedure . . . . . . . . . . . . . . . . . . . . . . . 68 v 4.4 Control of the false discovery exceedance (FDX) . . . . . . . . . . . . . . . . 69 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.3 Exact and implementable solutions . . . . . . . . . . . . . . . . . . . 70 4.4.4 Connections to other method(s) . . . . . . . . . . . . . . . . . . . . . 71 5 Proofs of Results in Chapter 2 72 5.1 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Proofs of some main results . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.2 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2.3 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Proofs of additional theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.4 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4 Proofs of lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.3 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.4 Lemma for overtted models . . . . . . . . . . . . . . . . . . . . . . . 99 5.4.5 Lemma for undertted models. . . . . . . . . . . . . . . . . . . . . . 101 vi 6 Proofs of Results in Chapter 3 102 6.1 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.2 A useful lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2.3 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3 Proofs of propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.1 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3.2 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 112 References 113 vii List of Tables 2.1 Simulation results for Example 2.5.1 with all entries multiplied by 100 when the model is misspecied, with the oracle results based on both strong eects and weak eects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Simulation results for Example 2.5.1 with all entries multiplied by 100 when the model is correctly specied, with the oracle results based on both strong eects and weak eects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Simulation results for Example 2.5.1 with all entries multiplied by 100. . . . 28 2.4 Simulation results for Example 2.5.1 with all entries multiplied by 100. . . . 29 2.5 Example 2.5.1. Median false positives with median false negatives (strong/weak eects) in parentheses when the model is misspecied. . . . . . . . . . . . . 30 2.6 Example 2.5.1. Median false positives with median false negatives (strong/weak eects) in parentheses when the model is correctly specied. . . . . . . . . . 30 2.7 Example 2.5.1. Median false positives with median false negatives in paren- theses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.8 Example 2.5.1. Median false positives with median false negatives in paren- theses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.9 Results for Prostate and Neuroblastoma data sets. . . . . . . . . . . . . . . . 31 3.1 ETP values (in original scale) of various methods corresponding to Figure 3.2 54 viii 3.2 Number of selected SNPs at dierent levels for the Male population . . . . 59 3.3 Some previously reported SNPs detected by all three methods . . . . . . . . 59 3.4 Some previously reported SNPs detected by Methods II and III only . . . . . 59 3.5 Some previously reported SNPs detected by Method III only. . . . . . . . . . 60 ix List of Figures 3.1 Comparison under group-wise weights: 2 BH97 (or BH95), WPO,4 DD (proposed), and + AZ. The eciency gain of the proposed method increases as c 1 and c 2 become more distinct. . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Comparison under group-wise weights: WPO,4 DD (proposed), and + AZ. The eciency gain of the proposed method is more pronounced when signals are weak. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Comparison with general weights: WPO,4 DD (proposed), and + AZ. All methods control the wFDR approximately at the nominal level. The eciency gains of the proposed method become more pronounced when (i) the signal strength decreases, (ii) the signals become more sparse, or (iii) the test level decreases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Comparison with general weights: WPO,4 DD (proposed), and + AZ. The unweighted AZ method fails to control the wFDR at the nominal level. The eciency gain of the proposed method increases as signals become weaker. . 56 x Abstract In this thesis we discuss problems in two cutting-edge topics of modern statistics, namely, high-dimensional statistical inference and large-scale multiple testing. Model selection is indispensable to high-dimensional sparse modeling in selecting the best set of covariates among a sequence of candidate models. Most existing work assumes implicitly that the model is correctly specied or of xed dimensions. Yet model misspeci- cation and high dimensionality are common in real applications. We investigate two classical Kullback-Leibler divergence and Bayesian principles of model selection in the setting of high- dimensional misspecied models. Asymptotic expansions of these principles reveal that the eect of model misspecication is crucial and should be taken into account, leading to the generalized AIC and generalized BIC in high dimensions. With a natural choice of prior probabilities, we suggest the generalized BIC with prior probability which involves a loga- rithmic factor of the dimensionality in penalizing model complexity. We further establish the consistency of the covariance contrast matrix estimator in a general setting. Our results and new methods are supported by numerical studies. The use of weights provides an eective strategy to incorporate prior domain knowledge in large-scale inference. Our work studies weighted multiple testing in a decision-theoretic framework. We develop oracle and data-driven procedures that aim to maximize the expected number of true positives subject to a constraint on the weighted false discovery rate. The asymptotic validity and optimality of the proposed methods are established. The results demonstrate that incorporating informative domain knowledge enhances the interpretability of results and precision of inference. Simulation studies show that the proposed method xi controls the error rate at the nominal level, and the gain in power over existing methods is substantial in many settings. An application to genome-wide association study is discussed. Further we discuss connections between the two topics, applications to area(s) of business, and possible future directions. xii Chapter 1 Introduction In this chapter we provide a brief and gentle introduction to Chapters 2 and 3. Thereby the concepts and the concerning questions have been simplied in their presentation. 1.1 Model selection in high-dimensional misspecied models 1.1.1 High-dimensional inference With the introduction of linear regression by Gauss in early nineteenth century, it would be a mere understatement to say that the topic has largely shaped up the statistical literature. Since then several generalizations of the `linearity' in the relationship and the assumption of normality have been implemented. A notable landmark is in unifying the various formu- lations, by Nelder and Wedderburn, in the name of generalized linear models. We let data f(x i ;y i ) n i=1 g follow the distributiony i P 0 (jx i ), where x i 2R p ,y i 2R, and the parameter vector 0 2R d . We are interested in the high-dimensional situation, where p is larger than n. The ordinary least squares problem is thus ill-posed which is commonly addressed by Tikhonov's ridge regression or Tibshirani's lasso, among other methods. One problem of inference concerns the identication of the d-subspace of the p- dimensional space and, when possible, a quantication of the relative importance. A simple example in the context of linear regression can be, step 1, to try all p choose d variables for alld that it is well-posed, step 2, run a linear regression for every choice, and dene the `best' choice as that with the lowest prediction error, preferably out-of-sample. In this example 1 step 1 leads to identication of the best d-subspace, and the standardized coecients from step 2 provide estimates of relative importance within the directions of the chosen space. More generally, the framework of statistical inference discussed here involves a choice of models, which then leads to the question: Given data, a bit of natural assumptions, and several competing models, how do we systematically decide on the `best' model? We explore this question further. Note that inferences need not be model based, it is for convenience, both technical and interpretation. Model-free inferences are in its nascent stages of research and we do not pursue it here. 1.1.2 Model selection principles Among the many ways to choose the best model, two classical model selection principles have been Akaike information criterion (AIC) and Schwarz's Bayesian information criterion (BIC). Both these principles, penalizes the in-sample prediction error by a factor of the dimension of the model to avoid overtting. In the case of simple linear regression, with, say, unit noise variance, the `best' model guided by these principles can be written as b m = arg min m fRSS(k m ) + 2k m g; where RSS(k m ) denotes the residual sum of squares for the model m with k m parameters. The penalization parameter is = 1 for AIC. For BIC it is = (logn)=2. Foster and George's risk in ation criterion (RIC) model selection rule takes = logp with p being the number of variables available for potential inclusion. We raise several questions: In the situation of high-dimensional inference which of these penalizations can be `justied'? Further, under what conditions can they really pin down to the true model with high certainty? We will explore these questions. 2 1.1.3 Model misspecication In Section 1.1.1 we discussed model based inferences but only alluded to model-free infer- ences. Taking an intermediate step, and in practice, it could well be true that the set of models that have been considered does not include the true model. Besides, there is no way to verify so in the framework of model selection principles. Although pioneered by an econo- metrician White back in 1982, the statistical literature is rather incomplete on the model misspecication front. Much of what exists concern that of maximum likelihood estimation (MLE), called quasi-MLE in this case to emphasize on the misspecication of the likelihood function. So we can ask: How do we appropriately formulate the model selection principles to accommodate model misspecication? And if so would we be able to estimate that con- sistently from the given data? And even further, would that then be able to select the model that ts `best', again with desirably high certainty, of course, relative to the dimensions i.e. (n;p;d) of the problem? We attempt to answer all these questions. 1.2 wFDR control in large-scale multiple testing 1.2.1 Large-scale multiple testing An alternative framework to regression problem involves that of hypothesis testing. The concepts date back to early twentieth century, with Sir Fisher, perhaps in a jest, trying to solve the famous lady testing tea problem. This led to two new areas of statistics: Hypothesis testing and design of experiments, often very well connected. Around the same time, Neyman and Pearson proposed a decision theory based cost-benet analysis philosophy leading to the foundational principles of modern hypothesis testing. About half a century later, several researchers became concerned with the issue of multiple comparisons or testing. The basic concern is very simple: If there are 1000 hypotheses that are each tested at a p-value of 0.01, about 10 of them will be wrongly rejected by design. This observation leads to a chain of questions: Can we allow for such many wrong rejections? If not, how can we adjust the 3 testing procedures to correct on the unwanted wrong rejections? We explore these questions here. The rst attempt to correct for the unwanted number of wrong rejections was by scaling the p-value threshold by a factor of the number of hypotheses tested, and goes by the name of Bonferroni correction. However that results in a conservative test procedure and often yields to very few rejections or discoveries. This gave way to the fascinating question: How can we balance multiplicity control and power of a testing procedure? 1.2.2 False discovery rate (FDR) criterion To achieve higher power and yet have a multiplicity control, Benjamini and Hochberg's FDR criterion proposes to control E( V maxfR; 1g )q; where V denotes the number of false rejections out of R and q denotes the desired nominal level of the test procedure. The intuition here is that a few wrong rejections are allowed but only up to a desired fraction of the number of rejections or discoveries. Except under the situation where all nulls are true, the criterion allows for more power than the Bonferroni correction. The celebrated procedure controlling the FDR, by the name of BH-procedure, ranks the m hypotheses by the increasing p-values. The maximum k hypotheses are rejected such that p (k) qk=m. It can be shown that the BH-procedure controls the FDR at level q, under independence. The most elegant proof of this fact has been articulated in proof two of https://statpacking.wordpress.com. With the advent of microarrays followed by expression quantitative trait loci (eQTLs) in genome-wide association studies (GWAS), procedures controlling the FDR have been increasingly relevant. We can ask: Does the BH-procedure provide the most powerful test among all the proce- dures that control the FDR at nominal level? Note that the threshold for the BH-procedure does not use the information of the actual p-values beside the ranks. Are we losing any 4 information then? Are p-values even the right statistic to rank by? Is there a need to mod- ify the FDR denition under certain situations, particularly say under prior or additional information? We will explore these questions. 1.2.3 Weighted FDR control In a situation where all the hypotheses to be tested may not be equally relevant it is important to distinguish between them. Conceptually this may be done by assigning unequal weights to the type-I and type-II decision errors. The multiple testing problem with prior information, that is the information which distinguishes the hypotheses, can be then reformulated in a decision theoretic framework. Under this framework we ask: What is the `optimal' ranking of the hypotheses for the weighted testing problem? And what may be a reasonable threshold for the rejection of the null hypotheses? How does the method perform in comparison to p- value based methods? Further, can the suggested procedure achieve the multiplicity control and power balance? We attempt to answer these questions. 1.3 Thesis outline Besides Section 1.4 where we brie y provide a connection between model selection principles and FDR controlling method(s), the rest of the thesis is organized as follows. Chapter 2 attempts to answer the questions raised about model selection principles for high-dimensional misspecied models. Chapter 3 addresses the problem on weighted multiple testing. In Chapter 4 we discuss application(s) to business, and some future works. All technical proofs are provided in Chapters 5 and 6. 5 1.4 Synthesis In this section we elaborate one explicit connection between the ideas of false discovery rate controlling procedure and model selection principles. Consider the multivariate normal means problem: y i = i +" i ; where " i N(0; 2 ), independent and identically distributed, for i = 1; ;n. Let us suppose that is known, for simplicity. Given observations y i , the problem is to estimate the unknown mean i 's. Let us also assume, say, that only a few, unknown to us, number of i 's are nonzero, a phenomenon called `sparsity'. Perhaps a more realistic assumption is that few of the nonzero i 's are `large' and we want to be able to identify and estimate those as accurately as possible. We describe a FDR based estimation procedure. Let k := c (q=2k=n), where c () denotes the normal right-tail and q denotes the nominal level. Dene b k as the largest k such thatjyj (k) k , withjyj (1) being the largest in absolute value. This procedure that chooses the `largest' value of the indices is called the `step-up' method. The hard-thresholded estimates for the unknown mean then is dened by, b b k k := 8 < : y k ifjy k j b k ; 0 otherwise. To reiterate, this is the hard-thresholding estimation of unknown means where the data adaptive threshold is determined by the BH-procedure, a concept developed from the FDR framework. Now we illustrate how it may also be derived from model selection principles. Dene RSS(k 0 ) := k (y k b k 0 k ). The FDR based hard-thresholding procedure can be formulated as a penalized model selection procedure with b k := arg min k 0 ( RSS(k 0 ) + 2 k 0 k 1 0 k 0 X k=1 [ c (q=2k=n)] 2 !) : 6 We simplify the penalty term k 0 ;n := k 1 0 P k 0 k=1 [ c (q=2k=n)] 2 when k 0 n. From prop- erties of Gaussian quantiles and Taylor expansion we have k 0 ;n k 1 0 k 0 X k=1 2 log(2n=qk) 2 log(2n=qk 0 ) = 2 log(n=k 0 ) + 2 log(2=q): This observation has been known and is transcribed in the works of many scholars such as Felix Abramovich, Yoav Benjamini, David Donoho, and Iain Johnstone among others. The penalty observed here is thus more strict than that of AIC but slightly more relaxed than that of BIC. We think that this is an interesting compromise from desirable variable or model selection consistency in the hope of achieving more power or identifying higher number of true positives. It is natural to ask whether this connection may be extended to a regression set-up, especially for the case of non-orthogonal design matrices. In particular we can ask which form of model selection principle leads to a FDR based estimator for identifying and estimating the slope coecients. We ween that such a connection has not been theoretically established in the literature so far and forward this challenging problem to the future. 7 Chapter 2 Model Selection in High-Dimensional Misspecied Models 2.1 Collaboration This chapter is a collaborative work with Yang Feng and Jinchi Lv. 2.2 Introduction With rapid advances of modern technology, high-throughput data sets of unprecedented size, such as genetic and proteomic data, fMRI and functional data, and panel data in economics and nance, are frequently encountered in many contemporary applications. In these applications, the dimensionality p can be comparable to or even much larger than the sample sizen. A key assumption that often makes large-scale inference feasible is the sparsity of signals, meaning that only a small fraction of covariates contribute to the response when p is large compared to n. High-dimensional modeling with dimensionality reduction and feature selection plays an important role in these problems. A sparse modeling procedure typically produces a sequence of candidate models, each involving a possibly dierent subset of covariates. An important question is how to compare dierent models in high dimensions when models are possibly misspecied. The problem of model selection has a long history with numerous contributions by many researchers. Among others, well-known model selection criteria are the AIC (Akaike, 1973 and 1974) and BIC (Schwarz, 1978), where the former is based on the Kullback-Leibler (KL) 8 divergence principle of model selection and the latter is originated from the Bayesian princi- ple. A great deal of work has been devoted to understanding and extending these methods. See, for example, Bozdogan (1987), Foster and George (1994), Konishi and Kitagawa (1996), Ing (2007), Chen and Chen (2008), Chen and Chan (2011), Ing and Lai (2011), Liu and Yang (2011), and Chang et al. (2014) in dierent model settings. The connections between the AIC and cross-validation have been investigated in Stone (1977), Hall (1990), and Peng et al. (2013) in various contexts. Model selection criteria such as AIC and BIC are frequently used for tuning parameter selection in regularization methods. For instance, mode selection in the context of penalized likelihood methods has been studied in Fan and Li (2001), Wang et al. (2007), Wang et al. (2009), Zhang et al. (2010), and Fan and Tang (2013). In particular, Fan and Tang (2013) showed that classical information criteria such as AIC and BIC can be inconsistent for model selection when the dimensionalityp grows very fast relative to sample size n. Most existing work on model selection usually makes an implicit assumption that the model under study is correctly specied or of xed dimensions. For example, White (1982) laid out a general theory of maximum likelihood estimation in misspecied models for the case of xed dimensionality and independent and identically distributed (i.i.d.) observations. Recently, Lv and Liu (2014) investigated the problem of model selection with model misspec- ication and derived asymptotic expansions of both KL divergence and Bayesian principles in misspecied generalized linear models, leading to the generalized AIC and generalized BIC, for the case of xed dimensionality. A specic form of prior probabilities motivated by the KL divergence principle leads to the generalized BIC with prior probability (GBIC p -L 1 ). Yet model misspecication and high dimensionality are both common in real applications. Thus a natural and important question is how to characterize the impact of model misspec- ication on model selection in high dimensions. We intend to provide some answer to this question in this work. Our analysis enables us to suggest the generalized BIC with prior 1 Here we use this notation to emphasize that the criterion is for the low-dimensional case, while reserving the original notation GBIC p in Lv and Liu (2014) for the high-dimensional counterpart. 9 probability (GBIC p ) that involves a logarithmic factor of the dimensionality in penalizing model complexity. To gain some insights into the challenges of the aforementioned problem, let us con- sider a motivating example. Assume that the response Y depends on the covariate vector (X 1 ; ;X p ) T through the functional form Y =f(X 1 ) +f(X 2 X 3 ) +f(X 4 X 5 ) +"; (2.1) wheref(x) =x 3 =(x 2 + 1) and the remaining setting is as specied in Section 2.5.1. Consider sample size n = 100 and vary dimensionality p from 200 to 3200. Without prior knowledge about the true model structure, we take the linear regression model y = X +" (2.2) as the working model, with the same notation therein, and apply some information criteria to hopefully recover the oracle working model consisting of the rst ve covariates. When p = 200, the traditional AIC and BIC, which ignore model misspecication, tend to select a model with size larger than ve. As expected, GBIC p -L works reasonably well by selecting the oracle working model half of the time. However, whenp is increased to 3200, these methods fail to select such a model with signicant probability and the prediction performance of the selected models deteriorates. This motivates us to study the problem of model selection in high-dimensional misspecied models. In contrast, our newly suggested GBIC p can recover the oracle working model with signicant probability in this challenging scenario. The main contributions of our work are threefold. First, we establish a systematic the- ory of model selection with model misspecication in high dimensions. The asymptotic expansions for dierent model selection principles involve delicate and challenging technical analysis. Second, our work provides rigorous theoretical justication of the covariance con- trast matrix estimator that incorporates the eect of model misspecication and is crucial for practical implementation. Such an estimator is shown to be consistent in the general 10 setting of high-dimensional misspecied models. Third, we suggest the use of a new prior in the expansion for GBIC p involving the logp term. This criterion has connections to the model selection criteria in Chen and Chen (2008) and Fan and Tang (2013) with the logp factor for the case of correctly specied models. The rest of the work is organized as follows. Section 2.3 introduces the setup for model misspecication. We present some key asymptotic properties of the quasi-maximum likeli- hood estimator and provide asymptotic expansions of KL divergence and Bayesian model selection principles in high dimensions in Section 2.4. Section 2.5 demonstrates the perfor- mance of dierent model selection criteria in high-dimensional misspecied models through several simulation and real data examples. We provide some discussions of our results and possible extensions in Section 2.7. 2.3 Model misspecication Assume that conditional on the covariates, the n-dimensional random response vector Y = (Y 1 ; ;Y n ) T has a true unknown distribution G n with density function g n (y) = n Y i=1 g n;i (y i ); (2.3) where y = (y 1 ; ;y n ) T . Model (2.3) entails that all components of Y are independent but not necessarily identically distributed. Consider a set of d covariates out of all p available covariates, where p can be much larger than n. Denote by X the corresponding nd deterministic design matrix. To simplify the technical presentation, we focus on the case of deterministic design. In practice, one chooses a family of working models to t the data. Model misspecication generally occurs when the family of distributions is misspecied or some true covariates are missed. 11 Since the true model G n is unknown, we choose a family of generalized linear models (GLMs) F n (;) = F n (z; X;) with a canonical link as our working models, each of which has density function f n (z;)d 0 (z) = n Y i=1 f 0 (z i ; i )d 0 (z i ) n Y i=1 exp [z i i b( i )]d(z i ); (2.4) where z = (z 1 ; ;z n ) T , = ( 1 ; ; n ) T = X with 2 R d , b() is a smooth convex function, 0 is the Lebesgue measure, and is some xed measure onR. Assume thatb 00 () is continuous and bounded away from 0, X is of full column rank d, and EY 2 i are bounded. Clearlyff 0 (z;) : 2Rg is a family of distributions in the regular exponential family and may not contain g n;i 's. To ease the presentation, dene two vector-valued functions B() = (b( 1 ); ;b( n )) T and () = (b 0 ( 1 ); ;b 0 ( n )) T , and a matrix-valued function () = diagfb 00 ( 1 ); ;b 00 ( n )g. For any n-dimensional random vector Z with distribution F n (;) given by (2.4), it holds that EZ = (X) and Cov(Z) = (X). The density function (2.4) can be rewritten as f n (z;) = exp z T X 1 T B(X) n Y i=1 d d 0 (z i ); where d d 0 denotes the Radon-Nikodym derivative. Given the observations y and X, this gives the quasi-log-likelihood function ` n (y;) = logf n (y;) = y T X 1 T B(X) + n X i=1 log d d 0 (y i ): (2.5) The quasi-maximum likelihood estimator (QMLE) of the d-dimensional parameter vector is dened as b n = arg max 2R d ` n (y;); (2.6) which is the solution to the score equation n () = @` n (y;)=@ = X T [y(X)] = 0. This equation becomes the normal equation X T y = X T X in the linear regression model. 12 The KL divergence (Kullback and Leibler, 1951) of the model F n (;) from the true model G n can be written as I(g n ;f n (;)) = E logg n (Y)E` n (Y;). The best working model that is closest to the true model under the KL divergence has parameter vector n;0 = arg min 2R dI(g n ;f n (;)), which solves the equation X T [EY(X)] = 0: (2.7) We introduce two matrices that play a key role in model selection with model misspecica- tion. Dene Cov n ( n;0 ) = Cov X T Y = X T Cov(Y)X = B n (2.8) with Cov(Y) = diagfVar(Y 1 ); ; Var(Y n )g by the independence assumption, @ 2 I(g n ;f n (;)) @ 2 = @ 2 ` n (y;) @ 2 = X T (X)X = A n (); (2.9) and A n = A n ( n;0 ). Observe that A n and B n are the covariance matrices of X T Y under the best misspecied GLM F n (; n;0 ) and the true model G n , respectively. 2.4 High-dimensional model selection in misspecied models We now present the asymptotic expansions of both KL divergence and Bayesian model selection principles in high-dimensional misspecied GLMs. 2.4.1 Technical conditions and asymptotic properties of QMLE in high dimensions We list a few technical conditions required to prove the asymptotic properties of QMLE with diverging dimensionality. Denote bykk 2 the vectorL 2 -norm and the matrix operator norm. 13 Condition 1. There exists some constant H > 0 such that for each 1 i n, P (jq i j > t)H exp(t 2 =H) for any t 0, where (q 1 ; ;q n ) T = Cov(Y) 1=2 (YEY). Condition 2. There exist positive constants c 1 , c 0 > 8c 2 1 H, and r < 1=4 such that for suciently large n, min 2Nn(n) min fV n ()g > c 1 n r and min (B n ) d 2 n , where n = n r (c 0 logn) 1=2 , N n ( n ) =f2R d :k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g, and V n () = B 1=2 n A n ()B 1=2 n . Moreover, d =ofn (14r)=3 (logn) 2=3 g. Condition 3. Assume P n i=1 (x T i B 1 n x i ) 3=2 =o(1) and max 1in EjY i EY i j 3 =O(1). Condition 4. Assume max 1 ;; d 2Nn(n) k e V n ( 1 ; ; d ) V n k 2 =O(dn 1=2 n ); where V n = V n ( n;0 ) = B 1=2 n A n B 1=2 n and e V n ( 1 ; ; d ) = B 1=2 n e A n ( 1 ; ; d )B 1=2 n with e A n ( 1 ; ; d ) a dd matrix with jth row the corresponding row of A n ( j ) for each 1jd. Moreover, max (V n ) is a polynomial order of n. Conditions 1 and 2 are some basic assumptions for establishing the consistency of the QMLE b n in Theorem 1. In particular, Condition 1 assumes that the standardized response has sub-Gaussian distribution which facilitates the derivation of the deviation probability bound. Conditions 2{4 are similar to those in Lv and Liu (2014), except for some major dif- ferences due to the high-dimensional setting. In particular, Condition 2 allows the minimum eigenvalue of V n () to converge to zero at a certain rate as n increases in a neighborhood N n ( n ) of n;0 . Such a neighborhood is wider compared to that for the case of xed dimen- sionality. The dimensionalityd of the QMLE is allowed to diverge withn. Conditions 3 and 4 are imposed to establish the asymptotic normality of b n . Theorem 1. (Consistency of QMLE). Under Conditions 1{2, the QMLE b n satises b n n;0 = o P (1) and further b n 2 N n ( n ) with probability 1O(n ) for some large positive constant . 14 Theorem 2. (Asymptotic normality). Under Conditions 1{4, the QMLE b n satises D n C n ( b n n;0 ) D !N(0;I m ); where C n = B 1=2 n A n and D n is any md matrix such that D n D T n =I m . Theorems 1 and 2 establish the consistency and asymptotic normality of the QMLE in high-dimensional misspecied GLM. These results provide the theoretical foundation for the technical analyses in Sections 2.4.2{2.4.4. The asymptotic theory of the QMLE reduces to that of the maximum likelihood estimator (MLE) when the model is correctly specied. Our results extend those in Lv and Liu (2014) for the case of xed dimensionality. We next introduce a few additional conditions for deriving the asymptotic expansions of the two model selection principles. Condition 5. There exists some constant 1 with 0< 1 <=21 such thatb 00 () =O(n 1 ) and for suciently large n, N n ( n ) M n ( 1 ) =f 2 R d :kXk 1 1 logng, where constant is given in Theorem 1. Condition 6. Assume that (h()) = d M d 0 (h()) satises inf 2Nn(2n) (h())c 2 and sup 2R d (h())c 3 (2.10) with c 2 ;c 3 > 0 some constants, and n ( n ) = max 2Nn(2n) maxfj min (V n () V n )j;j max (V n () V n )jg =ofn (1r)=3 g. Condition 7. Assume that n 1 A n (), n 1 X T diagfj(X)(X n;0 )jg X, andn 1 X T diagf[(X)(X n;0 )] [(X)(X n;0 )]gX are Lipschitz (in operator norm) with constant L> 0 in N n ( n ), andkXk 1 =O(n 2 ) with constant 0 2 <r, where represents the Hadamard (componentwise) product andkk 1 denotes the entrywise matrix L 1 -norm. Condition 8. Assume P n i=1 f[EY i ((X n;0 )) i ] 2 =Var(Y i )g 2 =O(n 3 ) with some constant 0 3 4(r 2 ). 15 The rst part of Condition 5 holds naturally for linear and logistic regression models, and is introduced to accommodate the case of Poisson regression. The second part of Condition 5 is a mild assumption ensuring that the restricted QMLE coincides with its unrestricted version with signicant probability, which is key to the asymptotic expansion of the KL divergence principle in high dimensions in Theorem 3. It is worth mentioning that the set M n ( 1 ) grows with n, while the neighborhood N n ( n ) is asymptotically shrinking. Condition 6 is similar to the one in Lv and Liu (2014), except that we need to specify the rate at which n ( n ) converges to zero. Condition 7 requires the Lipschitz property for those matrix-valued functions. The bound on the entry-wise matrix L 1 -norm of the design matrix is mild. Condition 8 is a sensible assumption bounding the eect of the model bias. In particular, Conditions 7 and 8 are introduced only for proving the consistency of the covariance contrast matrix in the general setting in Theorem 4. 2.4.2 Generalized AIC in misspecied models Given a sequence of subsetsfM m :m = 1; ;Mg of the full modelf1; ;pg, we can construct a sequence of QMLE'sf b n;m :m = 1; ;Mg by tting the GLM (2.4). A natural question is how to compare those tted models. The QMLEsf b n;m :m = 1; ;Mg become the MLEs when the model is correctly specied. Akaike's principle of model selection is choosing the modelM m 0 that minimizes the KL divergence I(g n ;f n (; b n;m )) of the tted model F n (; b n;m ) from the true model G n , that is, m 0 = arg min m2f1;;Mg I(g n ;f n (; b n;m )); (2.11) where I(g n ;f n (; b n;m )) =E logg n ( e Y) n ( b n;m ) (2.12) 16 with n () =E` n ( e Y;) and e Y an independent copy of Y. Thus m 0 = arg max m2f1;;Mg n ( b n;m ) = arg max m2f1;;Mg E e Y ` n ( e Y; b n;m ); which shows that Akaike's principle of model selection is equivalent to choosing the model M m 0 that maximizes the expected log-likelihood with the expectation taken with respect to an independent copy of Y. Using the asymptotic theory of MLE, Akaike (1973) showed that for the case of i.i.d. observations, n ( b n ) can be asymptotically expanded as` n (y; b n )jMj, which leads to the seminal AIC for comparing competing models: AIC(y;M) =2` n (y; b n ) + 2jMj: (2.13) For simplicity, we drop the last term in (2.5) which does not depend on, and redene the quasi-log-likelihood as ` n (y;) = y T X 1 T B(X) hereafter. Theorem 3. Under Conditions 1{5, we have with probability tending to one, E n ( b n ) =E` n (y; b n ) tr(H n ) +oftr(H n )g; (2.14) where H n = A 1 n B n . Theorem 3 generalizes the corresponding result in Lv and Liu (2014) to high dimensions. However, we would like to point out that our new technical analysis diers substantially from theirs due to the challenges of diverging dimensionality. The asymptotic expansion in Theorem 3 enables us to introduce the generalized AIC (GAIC) as follows. Denition 1. We dene GAIC of modelM as GAIC(y;M;F n ) =2` n (y; b n ) + 2tr( b H n ); (2.15) where w3 is a consistent estimator of H n specied in Section 2.4.3. 17 When the model is correctly specied, it holds that tr( b H n ) tr(I d ) =jMj, under which GAIC reduces to AIC asymptotically. We demonstrate in the simulation studies that GAIC can improve over the original AIC substantially in the presence of model misspecication. 2.4.3 Estimation of covariance contrast matrix From the asymptotic expansions for the GAIC, GBIC, and GBIC p (the latter two to be introduced in Section 2.4.4), a common term is the covariance contrast matrix H n , which characterizes the impact of model misspecication. Therefore, providing an accurate esti- mator for such a matrix H n is of vital importance in the application of these information criteria. Consider the plug-in estimator b H n = b A 1 n b B n with b A n and b B n dened as follows. Since the QMLE b n provides a consistent estimator of n;0 in the best misspecied GLM F n (; n;0 ), a natural estimate of matrix A n is given by b A n = A n ( b n ) = X T (X b n )X: (2.16) When the model is correctly specied, the following simple estimator b B n = X T diag nh y(X b n ) i h y(X b n ) io X (2.17) gives an asymptotically unbiased estimator of B n . Theorem 4. (Consistency of estimator) Assume that Conditions 1{3 and 7{8 hold, the eigenvalues of n 1 A n and n 1 B n are bounded away from 0 and1, and d = ofn (14r)=4 g. Then the plug-in estimator b H n satises tr( b H n ) = tr(H n ) +o P (1) and logj b H n j = logjH n j + o P (1). Theorem 4 improves the result in Lv and Liu (2014) in two important aspects. First, the consistency of the covariance contrast matrix estimator was previously justied in Lv and Liu (2014) for the case of correctly specied model. Our new result shows that the simple 18 plug-in estimator b H n still enjoys consistency in the general setting of model misspecica- tion. Second, the result in Theorem 4 holds for the case of diverging dimensionality. These theoretical guarantees are crucial to the practical implementation of those information cri- teria. Our numerical studies reveal that such an estimate works well in a variety of model misspecication settings. 2.4.4 Generalized BIC in misspecied models Given a set of competing modelsfM m :m = 1; ;Mg, a popular Bayesian model selection procedure is to rst put nonzero prior probability Mm on each modelM m , and then choose a prior distribution Mm for the parameter vector in the corresponding model. Assume that the density function of Mm is bounded inR Mm =R dm withd m =jM m j and locally bounded away from zero throughout the domain. The Bayesian principle of model selection is to choose the most probable model a posteriori, that is, choose modelM m 0 such that m 0 = arg max m2f1;;Mg S(y;M m ;F n ); (2.18) where the log-marginal-likelihood is S(y;M m ;F n ) = log Z Mm exp [` n (y;)]d Mm () (2.19) with the log-likelihood ` n (y;) as in (2.5) and the integral overR dm . To ease the presentation, for any2R d we dene a quantity ` n (y;) =` n (y;)` n (y; b n ); (2.20) which is the deviation of the quasi-log-likelihood from its maximum. Then from (2.19) and (2.20), we have S(y;M m ;F n ) =` n (y; b n ) + logE Mm [U n () n ] + log Mm ; (2.21) 19 where U n () = exp[n 1 ` n (y;)]. Theorem 5. Under Conditions 1{3 and 6, we have with probability tending to one, S(y;M;F n ) =` n (y; b n ) logn 2 jMj + 1 2 logjH n j + log M (2.22) + log(2) 2 jMj + logc n +o(1); where H n = A 1 n B n and c n 2 [c 2 ;c 3 ]. The asymptotic expansion of the Bayes factor in Theorem 5 leads us to introduce the generalized BIC (GBIC) as follows. Denition 2. We dene GBIC of modelM as GBIC(y;M;F n ) =2` n (y; b n ) + (logn)jMj logj b H n j; (2.23) where b H n is a consistent estimator of H n . It is clear from (2.23) that GBIC contains an extra term compared to BIC that replaces the factor 2 with logn in penalizing model complexity in (2.13). This additional term re ects the eect of model misspecication. When the model is correctly specied, GBIC reduces to BIC asymptotically. The choice of the prior probabilities Mm is important in high dimensions. Lv and Liu (2014) suggested prior probability Mm / e Dm for each candidate modelM m , where the quantity D m is dened as D m =E h I(g n ;f n (; b n;m ))I(g n ;f n (; n;m;0 )) i (2.24) and the subscript m indicates a particular candidate model. The motivation is that the further the QMLE b n;m is away from the best misspecied GLM F n (; n;m;0 ), the lower prior we assign to that model. In the high-dimensional setting when p can be much larger 20 than n, it is sensible to take into account the complexity of the space of all possible sparse models with the same size asM m . This observation motivates us to consider a new prior of the form Mm / p d 1 e Dm (2.25) withd =jM m j. Such a complexity factor has been exploited in the extended BIC (EBIC) in Chen and Chen (2008), who showed that using the term p d with some constant 0< 1, the EBIC can be model selection consistent for p = O(n ) with some positive constant satisfying 1 (2) 1 < . Under the assumption of d = o(p), an application of Stirling's formula shows that up to an additive constant, it holds that log Mm D m d logpd +d logd. Thus for the prior dened in (2.25), we have an additional term(logp + 1 logd)jMj in the asymptotic expansion for GBIC. When p is of order n with some constant > 0, this new term is of the same order as(logn)jMj. When logp is of order n with some constant 0 < < 1, the logp term dominates that involving logn. Fan and Tang (2013) proposed a similar term log(logn) logp term to ameliorate the BIC for the case of correctly specied models with non-polynomially growing dimensionalityp. The following theorem provides the asymptotic expansion of the Bayes factor with the particular choice of prior in (2.25). Theorem 6. Assume that Conditions 1{6 hold, Mm = C p d 1 e Dm with C > 0 some normalization constant, and d =o(p). Then we have with probability tending to one, S(y;M;F n ) =` n (y; b n ) (logp )jMj 1 2 tr(H n ) + 1 2 logjH n j (2.26) + log(Cc n ) +o(1); where H n = A 1 n B n , p = maxfn;pg, and c n 2 [c 2 ;c 3 ]. Similarly to the GBIC, we now dene a new information criterion, the generalized BIC with prior probability (GBIC p ), based on Theorem 6. 21 Denition 3. We dene GBIC p of modelM as GBIC p (y;M;F n ) =2` n (y; b n ) + 2(logp )jMj + tr( b H n ) logj b H n j; (2.27) where b H n is a consistent estimator of H n . In correctly specied models, the term tr( b H n ) logj b H n j is asymptotically close tojMj when b H n is a consistent estimator of H n = I d . Thus compared to BIC with factor logn, the GBIC p contains a larger factor logp when p grows non-polynomially with n. This leads to a heavier penalty on model complexity similarly as in Fan and Tang (2013). As pointed out in Lv and Liu (2014), the right hand side of (2.27) can be viewed as a sum of three terms: the goodness of t, model complexity, and model misspecication. An important distinction with the low-dimensional counterpart of GBIC p is that our new criterion explicitly takes into account the dimensionality of the whole feature space. 2.4.5 Consistency of selection criterion Consider a sequence of models denoted by fM m : m = 1; ;M n g with each unique jsupp(M m )j =d n;m . Recall our generalized criterion for model selection: GBIC p (M m ) =2` n (y; b n;m ) + 2 logp jM m j + tr( b H n;m ) logj b H n;m j; where logp = maxflogp; logng. We make the following general assumptions in order to establish the consistency of our selection criterion. Condition 9. Assume that there exists a set, the true support, which is contained in our sequence of models. Without loss of generality suppose thatM 1 is the true support. Condition 10. Assume that d n;m K n and K n =ofn (14r)=3 (logn) 2=3 g. 22 Our goal is to show that with probability tending to 1, inf m>1 fGBIC p (M m )GBIC p (M 1 )g> 0: (2.28) We split the set on which the inmum is computed into two categories: Category I and Category II. Motivated by the terminology in Shao (1993), Category I refers to those models where supp(M m )+ supp(M 1 ) and Category II refers to those with supp(M m )) supp(M 1 ). We make a small change in the denition of Shao (1993), in the sense the true model does not belong to Category II in our denition. We deal with the two categories separately, in other words, we show that (2.28) holds separately for the two cases. Recalling some notations: n;m;0 = arg min R dn;m I(g n ;f n (;)) or the parameter vector that is closest to the true model under KL divergence with the given support, and the QMLE b n;m = arg max R dn;m ` n (y;). Further recall that b H n;m involves (y; X; b n;m ). We aim to derive the appropriate magnitude of a n such that we may dene GBIC p (M m ) =2` n (y; b n;m ) + 2a n jM m j + tr( b H n;m ) logj b H n;m j: Theorem 7. With high probability it holds that inf supp(Mm)6=supp(M 1 ) jsupp(Mm)jKn fGBIC p (M m )GBIC p (M 1 )g> 0: 2.5 Numerical studies The asymptotic expansions of both KL divergence and Bayesian principles in Section 2.4 have enabled us to introduce the GAIC, GBIC, and GBIC p for model selection in high dimensions with model misspecication. We now investigate their performance in comparison to the information criteria AIC, BIC, and GBIC p -L in high-dimensional misspecied models via simulation examples as well as two real data sets. For each simulation study, we set the 23 number of repetitions to be 100 and examined the scenarios when the dimensionality grows (p = 200, 400, 1600, and 3200). 2.5.1 Simulation examples Sparse linear regression with interaction and weak eects The rst model we consider is the following high-dimensional linear regression model with interaction and weak eects y = X + x p+1 +"; (2.29) where X = (x 1 ; ; x p ) is annp design matrix, x p+1 = x 1 x 2 is an interaction term which is the product of the rst two covariates, the rows of X are sampled as i.i.d. copies from N(0;I p ), and the error vector"N(0; 2 I n ). We set 0 = (1;1:25; 0:75;0:95; 1:5; 0:1; 0:1; 0:1;0:1; 0:1; 0; ; 0) T , n = 100, and = 0:25. Although the data was generated from model (2.29), we t the linear regression model (2.2) without interaction, which is a typical example of model misspeccation. In view of (2.29), the true model involves only the rst ten covariates in a nonlinear form. Since the other covariates are independent of those ten covariates, the oracle working model is supp( 0 ) =f1; ; 10g as argued in Lv and Liu (2014). Due to the high dimensionality, it is computationally prohibitive to implement the best subset selection. Therefore, we rst applied the regularization method SICA (Lv and Fan, 2009) to build a sequence of sparse models and then selected the nal model using a model selection criterion. In practice, one can apply any preferred variable selection procedure to obtain a sequence of candidate models. In addition to comparing the models selected by dierent information criteria, we also considered the estimate based on the oracle working modelM 0 =f1; ; 10g as a benchmark and used both measures of prediction and variable selection. Denote by c M the selected model. We split the oracle working model into the set of strong eects M 0;s =f1; ; 5g and that of weak eectsM 0;w =f6; ; 10g. It is interesting to observe that all criteria tend to miss the entire set of weak eects M 0;w due to their very low signal strength. Therefore, 24 we focused on comparing the model selection performance in recovering the set of strong eects M 0;s . We report the strong eect consistent selection probability (the portion of simulations where c M = M 0;s ), the strong eect inclusion probability (the portion of simulations where c M M 0;s ), and the prediction error E(Y x T b ) 2 with b an estimate and (x T ;Y ) an independent observation. To evaluate the prediction performance of dierent criteria, we calculated the average prediction error on an independent test sample of size 10,000. The results for prediction error and model selection performance are summarized in Table 2.1. The number of false positivesj c M\M c 0 j and the numbers of false negatives for strong eects j c M c \M 0;s j and weak eectsj c M c \M 0;w j, respectively, are reported in Table 2.5. It is clear that as the dimensionality p increases, the consistent selection probability tends to decrease and the prediction error tends to increase for all information criteria. Generally speaking, GAIC improved over AIC, and GBIC, GBIC p -L, and GBIC p performed better than BIC in terms of both prediction and variable selection. In particular, the model selected by our new information criterion GBIC p delivered the best performance with the smallest prediction error and highest strong eect consistent selection probability across all settings. Meanwhile it is also interesting to see what results dierent model selection criteria lead to when the model is correctly specied. To this end, we regenerate the solution path based on the linear regression model with the interaction x p+1 = x 1 x 2 added. The same performance measures are calculated for this scenario with the results reported in Tables 2.2 and 2.6. A comparison of these results with those in Tables 2.1 and 2.5 gives several interesting observations. First, all model selection criteria have a better performance when the model is correctly specied in terms of both model selection and prediction. Second, it is worth noting that while all model selection criteria except AIC work reasonably well for the correctly specied model, all but the newly proposed GBIC p have a very low consistent selection probability under both model misspecication and high dimensionality. Third, it 25 Table 2.1: Simulation results for Example 2.5.1 with all entries multiplied by 100 when the model is misspecied, with the oracle results based on both strong eects and weak eects. Strong eect consistent selection probability with inclusion probability p AIC BIC GAIC GBIC GBIC p -L GBIC p Oracle 200 0(99) 29(99) 21(99) 32(99) 67(98) 73(98) 100(100) 400 0(100) 9(100) 8(100) 19(100) 54(100) 76(100) 100(100) 1600 0(100) 0(100) 9(100) 0(100) 27(100) 66(100) 100(100) 3200 0(100) 0(100) 4(100) 0(100) 16(100) 64(100) 100(100) Median prediction error with robust standard deviation in parentheses 200 164(35) 130(13) 130(10) 128(12) 125(8) 125(8) 121(7) 400 162(29) 154(38) 129(13) 131(22) 125(9) 122(10) 120(7) 1600 168(31) 172(28) 134(13) 170(28) 129(14) 125(10) 121(7) 3200 159(22) 169(23) 135(14) 167(23) 134(15) 125(13) 120(8) Table 2.2: Simulation results for Example 2.5.1 with all entries multiplied by 100 when the model is correctly specied, with the oracle results based on both strong eects and weak eects. Strong eect consistent selection probability with inclusion probability p AIC BIC GAIC GBIC GBIC p -L GBIC p Oracle 200 2(100) 82(100) 81(99) 82(100) 87(100) 91(100) 100(100) 400 8(100) 76(100) 76(100) 84(100) 90(100) 94(100) 100(100) 1600 39(95) 74(99) 65(89) 79(99) 88(100) 96(100) 100(100) 3200 64(94) 84(98) 72(88) 84(98) 94(100) 95(100) 100(100) Median prediction error with robust standard deviation (RSD) in parentheses 200 13.6(1.9) 11.2(1.0) 11.2(1.0) 11.2(1.0) 11.6(1.3) 11.7(1.2) 7.0(0.4) 400 12.1(1.4) 11.5(1.3) 11.5(1.2) 11.5(1.2) 11.7(1.3) 11.8(1.0) 6.9(0.4) 1600 12.4(8.3) 12.0(7.9) 11.9(9.8) 12.0(8.0) 12.2(7.7) 12.4(7.3) 7.0(0.4) 3200 21.2(10.2) 20.7(9.4) 21.8(11.0) 20.7(9.4) 20.4(8.8) 20.3(8.5) 7.0(0.3) is interesting to see that GBIC p outperforms the existing methods even under the correctly specied model in terms of consistent selection probability. 26 Multiple index model We next consider another model misspecication setting that involves the multiple index model Y =f( 1 X 1 ) +f( 2 X 2 + 3 X 3 ) +f( 4 X 4 + 5 X 5 ) +"; (2.30) where the response depends on the covariates only through the rst ve ones but with non- linear functions and f(x) =x 3 =(x 2 + 1). Here the design matrix X = (x 1 ; ; x p ) was gen- erated as in Section 2.5.1. We set the true parameter vector 0 = (1; 1;1; 1;1; 0; ; 0) T , n = 100, and = 0:25. Note that the oracle working model is M 0 = supp( 0 ) =f1; ; 5g for this example. Although the data was generated from model (2.30), we t the linear regression model (2.2). The results are summarized in Tables 2.3 and 2.7. The consistent selection probability and inclusion probability are now calculated based on M 0 . In general, the conclusions are similar to those in Example 2.5.1. An interesting obser- vation is the comparison between GBIC p -L and GBIC p in terms of model selection. While GBIC p -L is comparable to GBIC p when the dimension is not large (p = 200), the dier- ence between these two methods increases as the dimensionality increases. In the case when p = 3200, GBIC p has 77% success probability of consistent selection, while all the other criteria have at most 5% success probability. This conrms the necessity of including the logp factor in the model selection criterion to take into account the high dimensionality, which is in line with the conclusion in Fan and Tang (2013) for the case of correctly specied models. Logistic regression with interaction Our last simulation example is high-dimensional logistic regression with interaction. We sim- ulated 100 data sets from the logistic regression model with interaction and ann-dimensional parameter vector = X + 2x p+1 + 2x p+2 ; (2.31) 27 Table 2.3: Simulation results for Example 2.5.1 with all entries multiplied by 100. Consistent selection probability with inclusion probability p AIC BIC GAIC GBIC GBIC p -L GBIC p Oracle 200 2(100) 4(100) 2(100) 6(100) 51(100) 65(100) 100(100) 400 1(100) 1(100) 2(100) 1(100) 28(100) 67(100) 100(100) 1600 0(100) 0(100) 3(100) 0(100) 5(100) 63(100) 100(100) 3200 0(100) 0(100) 5(100) 0(100) 5(100) 77(100) 100(100) Median prediction error with RSD in parentheses 200 26(3) 26(3) 26(3) 26(3) 23(3) 23(2) 22(1) 400 28(3) 28(3) 27(3) 28(3) 25(4) 23(2) 22(1) 1600 31(3) 31(3) 30(4) 31(3) 30(4) 23(4) 22(1) 3200 31(4) 31(4) 30(3) 31(4) 30(3) 23(2) 22(1) where X = (x 1 ; ; x p ) is an np design matrix, x p+1 = x 1 x 2 and x p+2 = x 3 x 4 are two interaction terms, and the rest is the same as in (2.29). For each data set, the n-dimensional response vector y was sampled from the Bernoulli distribution with success probability vector [e 1 =(1 +e 1 ); ;e n =(1 +e n )] T with = ( 1 ; ; n ) T given in (2.31). As in Section 2.5.1, we consider the case where all covariates are independent of each other. We chose 0 = (2:5;1:9; 2:8;2:2; 3; 0; ; 0) T and set sample size n = 200. Although the data was generated from the logistic regression model with parameter vector (2.31), we t the logistic regression model without the two interaction terms. This provides another example of misspecied models. As argued in Section 2.5.1, the oracle working model is supp( 0 ) =f1; ; 5g which corresponds to the logistic regression model with the rst ve covariates. Since the goal in logistic regression is usually classication, we replace the prediction error with the classication error rate. Tables 2.4 and 2.8 show similar phenomenon as in Sections 2.5.1 and 2.5.1. Again GBIC p outperformed all other model selection criteria with greater advantage for the high-dimensional case (e.g., p = 3200). 28 Table 2.4: Simulation results for Example 2.5.1 with all entries multiplied by 100. Consistent selection probability with inclusion probability p AIC BIC GAIC GBIC GBIC p -L GBIC p Oracle 200 0(99) 32(94) 1(99) 39(94) 49(91) 49(91) 100(100) 400 0(99) 19(97) 0(99) 36(93) 50(92) 55(92) 100(100) 1600 0(96) 0(96) 0(94) 21(90) 35(88) 47(81) 100(100) 3200 0(95) 0(95) 0(96) 10(90) 21(86) 41(72) 100(100) Median classication error rate with RSD in parentheses 200 22(3) 15(2) 16(2) 15(1) 14(1) 14(1) 14(1) 400 21(3) 16(5) 17(2) 15(1) 15(1) 15(1) 13(1) 1600 21(2) 21(2) 18(1) 16(3) 15(1) 16(2) 14(1) 3200 22(2) 21(2) 19(2) 18(3) 15(2) 15(2) 13(1) 2.5.2 Real data examples We nally consider two gene expression data sets: Prostate (Singh et al., 2002) and Neu- roblastoma (Oberthuer et al., 2006). The prostate data set contains p = 12601 genes with n = 136 samples including 59 positives and 77 negatives. The neuroblastoma (NB) data set, available from the MicroArray Quality Control phase-II (MAQC-II) project (MAQC Con- sortium, 2010), consists of gene expression proles forp = 10707 genes from 239 patients (49 positives and 190 negatives) of the German Neuroblastoma Trials NB90-NB2004 with the 3-year event-free survival (3-year EFS) information available. See those references for more detailed description of the data sets. We t the logistic regression model with SICA implemented with ICA algorithm (Fan and Lv, 2011). Before applying the regularization method, we exploited the sure independence screening approach to reduce the dimensionality. The random permutation idea (Fan et al., 2011) was applied to determine the threshold for marginal screening. After the screening step, the numbers of retained variables are 430 (prostate) and 2778 (neuroblastoma), respec- tively. We then chose the nal model using those six model selection criteria. Moreover, we randomly split the data into training (80%) and testing (20%) sets for 100 times, and reported the median test classication error rate along with the median model size in Table 2.9. 29 From Table 2.9, for the prostate data set the best criterion appears to be GBIC p -L, which has the smallest test classication error rate. For the neuroblastoma data set, if we only look at the median test classication error rate, GBIC p -L again has the best performance with a small model size. It is worth noting that GBIC p leads to the most parsimonious model, with median model size 3, at the expense of slightly increasing the test classication error rate. From the results of real examples, it is evident that by taking into account the eect of model misspecication, the performance of the original model selection criteria can be improved in general. This is important since the true model structure is generally unavailable to us in real applications. Our results suggest that the term involving model misspecication in the asymptotic expansions is usually nonnegligible for model selection. 2.6 Additional tables In Tables 2.5{2.8, we report additional variable selection results for the three simulation examples in Section 2.5.1. Table 2.5: Example 2.5.1. Median false positives with median false negatives (strong/weak eects) in parentheses when the model is misspecied. AIC BIC GAIC GBIC GBIC p -L GBIC p 200 24(0/4) 2(0/5) 2(0/5) 1(0/5) 0(0/5) 0(0/5) 400 23(0/4) 19(0/5) 3(0/5) 4(0/5) 0(0/5) 0(0/5) 1600 24(0/5) 25(0/5) 4(0/5) 23(0/5) 1(0/5) 0(0/5) 3200 19(0/5) 25(0/5) 4(0/5) 25(0/5) 2(0/5) 0(0/5) Table 2.6: Example 2.5.1. Median false positives with median false negatives (strong/weak eects) in parentheses when the model is correctly specied. AIC BIC GAIC GBIC GBIC p -L GBIC p 200 73(0/0) 0(0/4) 0(0/4) 0(0/4) 0(0/4) 0(0/4) 400 77(0/0) 0(0/4) 0(0/4) 0(0/4) 0(0/4) 0(0/5) 1600 4(0/2) 0(0/5) 0(0/5) 0(0/5) 0(0/5) 0(0/5) 3200 0(0/5) 0(0/5) 0(0/5) 0(0/5) 0(0/5) 0(0/5) 30 Table 2.7: Example 2.5.1. Median false positives with median false negatives in parentheses. p AIC BIC GAIC GBIC GBIC p -L GBIC p 200 4(0) 4(0) 4(0) 3(0) 0(0) 0(0) 400 5(0) 5(0) 5(0) 5(0) 4(0) 0(0) 1600 8(0) 8(0) 8(0) 8(0) 7(0) 0(0) 3200 8(0) 8(0) 8(0) 8(0) 8(0) 0(0) Table 2.8: Example 2.5.1. Median false positives with median false negatives in parentheses. p AIC BIC GAIC GBIC GBIC p -L GBIC p 200 29(0) 1(0) 11(0) 1(0) 0(0) 0(0) 400 25(0) 5(0) 14(0) 1(0) 0(0) 0(0) 1600 19(0) 18(0) 14(0) 3(0) 1(0) 0(0) 3200 18(0) 17(0) 13(0) 9(0) 1(0) 0(0) 2.7 Discussion Despite the rich literature on model selection, the general case of model misspecication in high dimensions is less well studied. Our work has investigated the problem of model selection in high-dimensional misspecied models and characterized the impact of model misspecication. The newly suggested information criterion GBIC p involving a logarithmic factor of the dimensionality in penalizing model complexity has been shown to perform well in high-dimensional settings. Moreover, we have established the consistency of the covariance Table 2.9: Results for Prostate and Neuroblastoma data sets. Median classication error rate (in percentage) with RSD in parentheses Data set AIC BIC GAIC GBIC GBIC p -L GBIC p Prostate 19(9) 15(6) 15(9) 15(9) 13(9) 15(10) NB 18(5) 18(5) 18(3) 18(3) 18(5) 19(5) Median model size with RSD in parentheses Prostate 15.0(3.7) 8.5(4.5) 3.0(1.5) 6.0(3.7) 6.0(3.7) 5.0(3.0) NB 27.0(3.0) 26.0(2.2) 8.5(3.7) 6.0(3.7) 5.0(3.0) 3.0(2.2) 31 contrast matrix estimator that captures the eect of model misspecication in the general setting. The logp term in GBIC p is adaptive to high dimensions. In the setting of correctly specied models, Fan and Tang (2013) showed that such a term is necessary for the model selection consistency of information criteria when the dimensionality diverges fast with the sample size. It would be interesting to study the optimality of those dierent information criteria under model misspecication. It would also be interesting to investigate model selection principles in more general high-dimensional misspecied models such as the additive models and survival models. These problems are beyond the scope of the current work and are interesting topics for future research. 32 Chapter 3 Weighted False Discovery Rate Control in Large-Scale Multiple Testing 3.1 Collaboration This chapter is a collaborative work with T. Tony Cai, Kiranmoy Das, and Wenguang Sun. 3.2 Introduction In large-scale studies, relevant domain knowledge, such as external covariates, scientic insights and prior data, is often available alongside the primary data set. Exploiting such information in an ecient manner promises to enhance both the interpretability of scien- tic results and precision of statistical inference. In multiple testing, the hypotheses being investigated often become \unequal" in light of external information, which may be re ected by dierential attitudes towards the relative importance of testing units or the severity of decision errors. The use of weights provides an eective strategy to incorporate informative domain knowledge in large-scale testing problems. In the literature, various weighting methods have been advocated for a range of multiple comparison problems. A popular scheme, referred to as the decision weights or loss weights approach, involves modifying the error criteria or power functions in the decision process (Benjamini and Hochberg, 1997). The idea is to employ two sets of positive constants 33 a a a = fa i : i = 1; ;mg and b b b = fb i : i = 1; ;mg to take into account the costs and gains of multiple decisions. Typically, the choice of the weights a a a and b b b re ects the degree of condence one has toward prior beliefs and external information. It may also be pertinent to the degree of preference that one has toward the consequence of one class of erroneous/correct decisions over another class based on various economical and ethical considerations. For example, in the spatial cluster analysis considered by Benjamini and Heller (2007), the weighted false discovery rate was used to re ect that a false positive cluster with larger size would account for a larger error. Another example arises from genome-wide association studies (GWAS), where prior data or genomic knowledge, such as prioritized subsets (Lin and Lee, 2012), allele frequencies (Lin et al., 2014) and expression quantitative trait loci information (Li et al., 2013), can often help to assess the scientic plausibility of signicant associations. To incorporate such information in the analysis, a useful strategy is to up-weight the gains for the discoveries in preselected genomic regions by modifying the power functions in respective testing units (P~ ena et al., 2011; Sun et al., 2015). We assume in this work that the weights have been pre-specied by the investigator. This is a reasonable assumption in many practical settings. For example, weights may be assigned according to economical considerations (Westfall and Young, 1993), external covariates (Benjamini and Heller, 2007; Sun et al., 2015) and biological insights from prior studies (Xing et al., 2010). We mention two alternative formulations for weighted multiple testing. One popular method, referred to as the procedural weights approach by Benjamini and Hochberg (1997), involves the adjustment of thep-values from individual tests. In GWAS, Roeder et al. (2006) and Roeder and Wasserman (2009) proposed to utilize linkage signals to up-weight the p- values in preselected regions and down-weight thep-values in other regions. It was shown that the power to detect association can be greatly enhanced if the linkage signals are informative, yet the loss in power is small when the linkage signals are uninformative. Another useful weighting scheme, referred to as the class weights approach, involves allocating varied test levels to dierent classes of hypotheses. For example, in analysis of the growth curve data (Box, 1950), Westfall and Young (1993, page 186) proposed to allocate a higher family-wise 34 error rate (FWER) to the class of hypotheses related to the primary variable \gain" and a lower FWER to the secondary variable \shape". We focus on the decision weights approach in the present work. This weighting scheme is not only practically useful for a wide range of applications, but also provides a powerful framework that enables a unied investigation of various weighting methods. Specically, the proposal in Benjamini and Hochberg (1997) involves the modication of both the error rate and power function. The formulation is closely connected to classical ideas in compound decision theory that aim to optimize the tradeos between the gains and losses when many simultaneous decisions are combined as a whole. Our theory reveals that if the goal is to maximize the power subject to a given error rate, then the modications via decision weights would lead to improved multiple testing methods with sensible procedural weights or class weights, or both. For example, in GWAS, the investigators can up-weight the power functions for discoveries in genomic regions that are considered to be more scientic plausible or biologically meaningful; this would naturally up-weight thep-values in these regions and thus yield weighting strategies similar to those suggested by Roeder and Wasserman (2009). In large clinical trials, modifying the power functions for respective rejections at the primary and secondary end points would correspond to the allocation of varied test levels across dierent classes of hypotheses, leading to weighting strategies previously suggested by Westfall and Young (1993). The false discovery rate (FDR; Benjamini and Hochberg, 1995) has been widely used in large-scale multiple testing as a powerful error criterion. Following Benjamini and Hochberg (1997), we generalize the FDR to weighted false discovery rate (wFDR), and develop optimal procedures for wFDR control under the decision weights framework. We rst construct an oracle procedure that maximizes the weighted power function subject to a constraint on the wFDR, and then develop a data-driven procedure to mimic the oracle and establish its asymptotic optimality. The numerical results show that the proposed method controls the wFDR at the nominal level, and the gain in power over existing methods is substantial in many settings. Our optimality result in the decision weights framework marks a clear 35 departure from existing works in the literature that are mainly focused on the derivation of optimal procedural weights subject to the conventional FDR criterion and unweighted power function (Roeder and Wasserman, 2009; Roquain and van de Wiel, 2009). Our research also makes a novel contribution to the theory of optimal ranking in mul- tiple testing. Conventionally, a multiple testing procedure operates in two steps: ranking the hypotheses according to their signicance levels and then choosing a cuto along the rankings. It is commonly believed that the rankings remain the same universally at all FDR levels. For example, the ranking based on p-values or adjusted p-values in common practice is invariant to the choice of the FDR threshold. The implication of our theory is interesting, for it claims that there does not exist a ranking that is universally optimal at all test levels. Instead, the optimal ranking of hypotheses depends on the pre-specied wFDR level. That is, the hypotheses may be ordered dierently when dierent wFDR levels are chosen. This point is elaborated in Section 3.3. The rest of the article is organized as follows. Section 2 discusses a general framework for weighted multiple testing. Sections 3 and 4 develop oracle and data-driven wFDR procedures and establish their optimality properties. Simulation studies are conducted in Section 5 to investigate the numerical performance of the proposed methods. An application to GWAS is presented in Section 6. Section 7 concludes the article with a discussion of related and future works. 3.3 Problem formulation This section discusses a decision weights framework for weighted multiple testing. We rst introduce model and notation and then discuss modied error criteria and power functions. 36 3.3.1 Model and notation Suppose that m hypotheses H 1 ; ;H m are tested simultaneously based on observations X 1 ; ;X m . Let = ( 1 ; ; m )2f0; 1g m denote the true state of nature, where 0=1 indicates a null/non-null case. Assume that observationsX i are independent and distributed according to the following model X i j i (1 i )F 0i + i F 1i ; (3.1) where F 0i and F 1i are the null and non-null distributions for X i , respectively. Denote by f 0i and f 1i the corresponding density functions. Suppose that the unknown states i are Bernoulli (p i ) variables, where p i = P ( i = 1). The mixture density is denoted by f i = (1p i )f 0i +p i f 1i . Consider the widely used random mixture model (Efron et al., 2001; Storey, 2002; Gen- ovese and Wasserman, 2002) X i F = (1p)F 0 +pF 1 : (3.2) This model, which assumes that all observations are identically distributed according to a common distribution F , can sometimes be unrealistic in applications. In light of domain knowledge, the observations are likely to have dierent distributions. For example, in the context of a brain imaging study, Efron (2008) showed that the proportions of activated voxels are dierent for the front and back halves of a brain. In GWAS, certain genomic regions contain higher proportions of signicant signals than other regions. In the adequate yearly progress study of California high schools (Rogasa, 2003), the densities ofz-scores vary signicantly from small to large schools. We develop theories and methodologies for model (3.1) for it considers dierent non-null proportions and densities; this allows the proposed method to be applied to a wider range of situations. 37 The multi-group model considered in Efron (2008) and Cai and Sun (2009), which has been widely used in applications, is an important case of the general model (3.1). The multi- group model assumes that the observations can be divided into K groups. LetG k denote the index set of the observations in groupk,k = 1; ;K. For eachi2G k , i is distributed as Bernoulli(p k ), and X i follows a mixture distribution: (X i ji2G k )f k = (1p k )f 0k +p k f 1k ; (3.3) wheref 0k andf 1k are the null and non-null densities for observations in groupk. This model will be revisited in later sections. See also Ferkingstad et al. (2008) and Hu et al. (2010) for related works on multiple testing with groups. 3.3.2 Weighted error criterion and power function This section discusses a generalization of the FDR criterion in the context of weighted mul- tiple testing. Denote the decisions for the m tests by = ( 1 ; ; m )2f0; 1g m , where i = 1 indicates that H i is rejected and i = 0 otherwise. The weighted false discovery rate (wFDR) is dened as wFDR = E m P i=1 a i (1 i ) i E m P i=1 a i i ; (3.4) where a i is the weight indicating the severity of a false positive decision. For example, a i is taken as the cluster size in the spatial cluster analyses conducted in Benjamini and Heller (2007) and Sun et al. (2015). As a result, rejecting a larger cluster erroneously corresponds to a more severe decision error. Remark 1. Our denition of the wFDR is slightly dierent from that considered in Ben- jamini and Hochberg (1997), which denes the wFDR as the expectation of a ratio. The consideration of using a ratio of two expectations (or a marginal version of the wFDR) is only to facilitate our theoretical derivations. Genovese and Wasserman (2002) showed that, 38 in large-scale testing problems, the dierence between the marginal FDR (mFDR) and FDR is negligible under mild conditions. The asymptotic equivalence in the weighted case can be established similarly. To compare the eectiveness of dierent weighted multiple testing procedures, we dene the expected number of true positives ETP =E m X i=1 b i i i ! ; (3.5) where b i is the weight indicating the power gain when H i is rejected correctly. The use of b i provides a useful scheme to incorporate informative domain knowledge. In GWAS, larger b i can be assigned to pre-selected genomic regions to re ect that the discoveries in these regions are more biologically meaningful. In spatial data analysis, correctly identifying a larger cluster that contains signal may correspond to a larger b i , indicating a greater decision gain. By combining the concerns on both the error criterion and power function, the goal in weighted multiple testing is to maximize the ETP subject to the constraint wFDR: (3.6) The optimal solution to (3.6) is studied in the next section. 3.4 Oracle procedure for wFDR control The basic framework of our theoretical and methodological developments is outlined as follows. In Section 3.1, we assume thatp i ,f 0i , andf i in the mixture model (3.1) are known by an oracle and derive an oracle procedure that maximizes the ETP subject to a constraint on the wFDR. Connections to the literature and a discussion on optimal ranking are included 39 in Sections 3.2 and 3.3. In Section 4, we develop a data-driven procedure to mimic the oracle and establish its asymptotic validity and optimality. 3.4.1 Oracle procedure The derivation of the oracle procedure involves two key steps: the rst is to derive the optimal ranking of hypotheses and the second is to determine the optimal threshold along the ranking that exhausts the pre-specied wFDR level. We discuss the two issues in turn. Consider model (3.1). Dene the local false discovery rate (Lfdr, Efron et al. 2001) as Lfdr i = (1p i )f 0i (x i ) f i (x i ) : (3.7) The wFDR problem (3.6) is equivalent to the following constrained optimization problem maximize E m P i=1 b i i (1 Lfdr i ) subject to E m P i=1 a i i (Lfdr i ) 0. (3.8) Let S =fi : Lfdr i g and S + =fi : Lfdr i > g. Then the constraint in (3.8) can be equivalently expressed as E ( X S + a i i (Lfdr i ) ) E ( X S a i i ( Lfdr i ) ) : (3.9) Consider an optimization problem which involves packing a knapsack with a capacity given by the right hand side of equation (3.9). Every available object has a known value and a known cost (of space). Clearly rejecting a hypothesis in S is always benecial as it allows the capacity to expand, and thus promotes more discoveries. The key issue is how to eciently utilize the capacity (after all hypotheses in S are rejected) to make as many discoveries as possible inS + . Each rejection inS + would simultaneously increase the power 40 and decrease the capacity. We propose to sort all hypotheses in S + in an decreasing order of the value to cost ratio (VCR). Equations (3.8) and (3.9) suggest that VCR i = b i (1 Lfdr i ) a i (Lfdr i ) : (3.10) To maximize the power, the ordered hypotheses are rejected sequentially until maximum capacity is reached. The above considerations motivate us to consider the following class of decision rules (t) =f i (t) :i = 1; ;mg, where i (t) = 8 > < > : 1; if b i (1 Lfdr i )>ta i (Lfdr i ); 0; if b i (1 Lfdr i )ta i (Lfdr i ): (3.11) We brie y explain some important operational characteristics of testing rule (3.11). First, if we let t> 0, then the equation implies that i (t) = 1 for all i2S ; hence all hypotheses in S are rejected as desired. (This explains why the VCR is not used directly in (3.11), given that the VCR is not meaningful inS .) Second, a solution path can be generated as we vary t continuously from large to small. Along the path (t) sequentially rejects the hypotheses in S + according to their VCRs. Denote byH (1) ; ;H (m) the hypotheses sequentially rejected by . (The actual ordering of the hypotheses within S does not matter in the decision process since all are always rejected.) The next task is to choose a cuto along the ranking to achieve exact wFDR control. The diculty is that the maximum capacity may not be attained by a sequential rejection procedure. To exhaust the wFDR level, we permit a randomized decision rule. Denote the Lfdr values and the weights corresponding to H (i) by Lfdr (i) , a (i) , and b (i) . Let C(j) = j X i=1 a (i) (Lfdr (i) ) (3.12) 41 denote the capacity up to jth rejection. According to the constraint in equation (3.8), we choose k = maxfj :C(j) 0g so that the capacity is not yet reached when H (k) is rejected but would be exceeded if H (k+1) is rejected. The idea is to split the decision point at H (k+1) by randomization. LetU be a Uniform (0; 1) variable that is independent of the truth, the observations, and the weights. Dene t = b (k+1) 1 Lfdr (k+1) a (k+1) Lfdr (k+1) and p = C(k) C(k + 1)C(k) : LetI A be an indicator, which takes value 1 if event A occurs and 0 otherwise. We propose the oracle decision rule OR =f i OR :i = 1; ;mg, where i OR = 8 > > > > > < > > > > > : 1 if b i (1 Lfdr i )>t a i (Lfdr i ); 0 if b i (1 Lfdr i )<t a i (Lfdr i ); I U<p if b i (1 Lfdr i ) =t a i (Lfdr i ): (3.13) Remark 2. The randomization step is only employed for theoretical considerations to enforce the wFDR to be exactly . Thus the optimal power can be eectively character- ized. Moreover, only a single decision point at H (k+1) is randomized, which has a negligible eect in large-scale testing problems. We do not pursue randomized rules for the data-driven procedures developed in later sections. Let wFDR( ) and ETP( ) denote the wFDR and ETP of a decision rule , respectively. Theorem 8 shows that the oracle procedure (3.13) is valid and optimal for wFDR control. Theorem 8. Consider model (3.1) and oracle procedure OR dened in (3.13). LetD be the collection of decision rules such that for any 2D , wFDR( ). Then we have (i). wFDR( OR ) =. (ii). ETP( OR ) ETP( ) for all 2D . 42 3.4.2 Comparison with the optimality results in Spjtvoll (1972) and Benjamini and Hochberg (1997) Spjtvoll (1972) showed that the likelihood ratio (LR) statistic T i LR = f 0i (x i ) f 1i (x i ) (3.14) is optimal for the following multiple testing problem maximize E \H 1i m P i=1 i subject to E \H 0i m P i=1 i , (3.15) where\H 0i and\H 1i denote the intersections of the nulls and non-nulls, respectively. The error criterion E \H 0i f P i a i i g is referred to as the intersection tests error rate (ITER). A weighted version of problem (3.15) was considered by Benjamini and Hochberg (1997), where the goal is to maximize E \H 1i m P i=1 b i i subject to E \H 0i m P i=1 a i i . (3.16) The optimal solution to (3.16) is given by the next proposition. Proposition 1. (Benjamini and Hochberg, 1997). Dene the weighted likelihood ratio (WLR) T i IT = a i f 0i (x i ) b i f 1i (x i ) : (3.17) Then the optimal solution to (3.16) is a thresholding rule of the form i IT = (T i IT < t ), where t is the largest threshold that controls the weighted ITER at level . The ITER is very restrictive in the sense that the expectation is taken under the con- junction of the null hypotheses. The ITER is inappropriate for mixture model (3.1) where 43 a mixture of null and non-null hypotheses are tested simultaneously. To extend intersection tests to multiple tests, dene the per family error rate (PFER) as PFER( ) =E ( m X i=1 a i (1 i ) i ) : (3.18) The power function should be modied correspondingly. Therefore the goal is to maximize E m P i=1 b i i i subject to E m P i=1 a i (1 i ) i . (3.19) The key dierence between the ITER and PFER is that the expectation in (3.18) is now taken over all possible combinations of the null and non-null hypotheses. The optimal PFER procedure is given by the next proposition. Proposition 2. Consider model (3.1) and assume continuity of the LR statistic. LetD PF be the collection of decision rules such that for every 2D PF , PFER( ) . Dene the weighted posterior odds (WPO) T i PF = a i (1p i )f 0i (x i ) b i p i f 1i (x i ) : (3.20) Denote by Q PF (t) the PFER of i PF = I(T i PF < t). Then the oracle PFER procedure is PF = ( i PF : i = 1; ;m), where i PF = I(T i PF < t PF ) and t PF = supft : Q PF (t) g: This oracle rule satises: (i). ETP( PF ) =. (ii). ETP( PF ) ETP( ) for all 2D PF . Our formulation (3.6) modies the conventional formulations in (3.16) and (3.19) to the multiple testing situation with an FDR type criterion. These modications lead to methods that are more suitable for large-scale scientic studies. The oracle procedure (3.13) uses the VCR (3.10) to rank the hypotheses. The VCR, which optimally combines the decision weights, signicance measure (Lfdr) and test level, produces a more powerful ranking than the WPO (3.20) in the wFDR problem; this is explained in detail next. 44 3.4.3 Optimal ranking: VCR vs. WPO Although the WPO is optimal for PFER control, it is suboptimal for wFDR control. This section discusses a toy example to provide some insights on why the WPO rank- ing is dominated by the VCR ranking. We simulate 1000 z-values from a mixture model (1p)N(0; 1) +pN(2; 1) with p = 0:2. The weights a i are xed at 1 for all i, and b i are generated from log-normal distribution with location parameter ln 3 and scale parameter 1. At wFDR level = 0:10, we can reject 68 hypotheses along the WPO ranking, with the number of true positives being 60; in contrast, we can reject 81 hypotheses along the VCR ranking, with the number of true positives being 73. This shows that the VCR ranking enables us to \pack more objects" under the capacity wFDR = 0:1 compared to the WPO ranking. Detailed simulation results are presented in Section 5. Next we give some intuitions on why the VCR ranking is more ecient in the wFDR problem. The test level , which can be viewed as the initial capacity for the error rate, plays an important role in the ranking process. Under the wFDR criterion, the capacity may either increase or decrease when a new rejection is made; the quantity that aects the current capacity is the excessive error rate (Lfdr i ). A dierent would yield a dierent excessive error rate and hence a dierent ranking. (This is very dierent from the PFER criterion, under which the capacity always decreases when a new rejection is made and is not useful in ranking.) The next example shows that, although the WPO ranking always remains the same the VCR ranking can be altered by the choice of . Example 1. Consider two unitsA andB with observed values and weightsx A = 2:73,x B = 3:11, b A = 83:32, and b B = 11:95. The Lfdr values are Lfdr A = 0:112 and Lfdr B = 0:055, ranking B ahead of A. Taking into account of the decision weights, the WPO values are WPO A = 0:0015 and WPO B = 0:0049, ranking A ahead of B, and this ranking remains the same at all wFDR levels. At = 0:01, we have VCR A = 725:4 and VCR B = 250:9, yielding the same ranking as the WPO. However, at = 0:05, we have VCR A = 1193:5 and VCR B = 2258:6, reversing the previous ranking. This reversed ranking is due to the small 45 excessive error rate (Lfdr B ) at = 0:05, which makes the rejection ofB, rather thanA, more \protable." 3.5 Data-driven procedures and asymptotics The oracle procedure (3.13) cannot be implemented in practice since it relies on unknown quantities such as Lfdr i and t . This section develops a data-driven procedure to mimic the oracle. We rst propose a test statistic to rank the hypotheses and discuss related estimation issues. A step-wise procedure is then derived to determine the best cuto along the ranking. Finally, asymptotic results on the validity and optimality of the proposed procedure are presented. 3.5.1 Proposed test statistic and its estimation The oracle procedure utilizes the ranking based on the VCR (3.10). However, the VCR is only meaningful for the tests in S + and becomes problematic when both S and S + are considered. Moreover, the VCR could be unbounded, which would lead to diculties in both numerical implementations and technical derivations. We propose to rank the hypotheses using the following statistic (in increasing values) R i = a i (Lfdr i ) b i (1 Lfdr i ) +a i jLfdr i j : (3.21) As shown in the next proposition,R i always ranks hypotheses inS higher than hypotheses inS + (as desired), and yields the same ranking as that by the VCR (3.10) for hypotheses in S + . The other drawbacks of VCR can also be overcome by R i : R i is always bounded in the interval [1; 1] and is a continuous function of the Lfdr i . Proposition 3. (i) The rankings generated by the decreasing values of VCR (3.10) and increasing values of R i (3.21) are the same in both S and S + . (ii) The ranking based on increasing values of R i always puts hypotheses in S ahead of hypotheses in S + . 46 Next we discuss how to estimateR i ; this involves the estimation of the Lfdr statistic (3.7), which has been studied extensively in the multiple testing literature. We give a review of related methodologies. If all observations follow a common mixture distribution (3.2), then we can rst estimate the non-null proportionp and the null densityf 0 using the methods in Jin and Cai (2007), and then estimate the mixture densityf using a standard kernel density estimator (e.g. Silverman, 1986). If all observations follow a multi-group model (3.3), then we can apply the above estimation methods to separate groups to obtain corresponding estimates ^ p k , ^ f 0k , and ^ f k ,k = 1; ;K. The theoretical properties of these estimators have been established in Sun and Cai (2007) and Cai and Sun (2009). In practice, estimation problems may arise from more complicated models. Related theories and methodologies have been studied in Storey (2007), Ferkingstad et al. (2008), and Efron (2008, 2010); theoretical supports for these estimators are yet to be developed. The estimated Lfdr value forH i is denoted by d Lfdr i . By convention, we take d Lfdr i = 1 if d Lfdr i > 1. This modication only facilitates the development of theory and has no practical eect on the testing results (since rejections are essentially only made for small d Lfdr i 's). The ranking statistic R i can therefore be estimated as b R i = a i ( d Lfdr i ) b i (1 d Lfdr i ) +a i j d Lfdr i j : (3.22) The performance of the data driven procedure relies on the accuracy of the estimate d Lfdr i ; some technical conditions are discussed in the next subsection. 3.5.2 Proposed testing procedure and its asymptotic properties Consider b R i dened in (3.22). Denote by b N i = a i ( d Lfdr i ) the estimate of excessive error rate whenH i is rejected. Let b R (1) ; ; b R (m) be the ordered test statistics (in increasing values). The hypothesis and estimated excessive error rate corresponding to b R (i) are denoted by H (i) and b N (i) . The idea is to choose the largest cuto along the ranking based on b R i so 47 that the maximum capacity is reached. Motivated by the constraint in (3.8), we propose the following step-wise procedure. Procedure 1. (wFDR control with general weights). Rank hypotheses according to b R i in increasing values. Let k = max j : j P i=1 b N (i) 0 : Reject H (i) , for i = 1;:::;k. It is important to note that in Procedure 1, b R i is used in the ranking step whereas b N i (or a weighted transformation of d Lfdr i ) is used in the thresholding step. The ranking by d Lfdr i is in general dierent from that by b R i . In some applications where the weights are proportional, i.e. a a a = cb b b for some constant c > 0, then the rankings by b R i and d Lfdr i are identical. Specically b R i is then monotone in d Lfdr i . Further, choosing the cuto based on b N i is equivalent to that of choosing by a weighted d Lfdr i . This leads to an Lfdr based procedure (Sun et al., 2015), which can be viewed as a special case of Procedure 1. Procedure 2. (wFDR control with proportional weights). Rank hypotheses according to d Lfdr i in increasing values. Denote the hypotheses and weights corresponding to d Lfdr (i) by H (i) and a (i) . Let k = max 8 < : j : j X i=1 a (i) ! 1 j X i=1 a (i) d Lfdr (i) 9 = ; : Reject H (i) , for i = 1;:::;k. Next we investigate the asymptotic performance of Procedure 1. We rst give some regularity conditions for the weights. Our theoretical framework requires that the decision weights must be obtained from external sources such as prior data, biological insights, or economical considerations. In particular, the observed datafX i : i = 1; ;mg cannot be used to derive the weights. The assumption is not only crucial in theoretical developments, but also desirable in practice (to avoid using data twice). Therefore given the domain knowledge, the decision weights do not depend on observed values. Moreover, a model with random (known) weights is employed for technical convenience, as done in Genovese et al. (2006) and Roquain and van de Wiel (2009). We assume that the weights are independent 48 with each other across testing units. Formally, denote e i the external domain knowledge for hypothesis i, we require the following condition. Condition 11. (i) (a i ;b i jX i ; i ;e i ) d (a i ;b i je i ) for 1 i m. (ii) (a i ;b i ) and (a j ;b j ) are independent for i6=j. In weighted multiple testing problems, the analysis is always carried out in light of the external information e i implicitly. The notation of conditional distribution on e i will be suppressed when there is no ambiguity. In practice, the weightsa i andb i are usually bounded. We need a weaker condition in our theoretical analysis. Condition 12. (Regularity conditions on the weights). LetC andc be two positive constants. E(sup i a i ) =o(m);E(sup i b i ) =o(m);E(a 4 i )C, and minfE(a i );E(b i )gc. A consistent Lfdr estimate is needed to ensure the large-sample performance of the data- driven procedure. Formally, we need the following condition. Condition 13. It holds that d Lfdr i Lfdr i = o P (1). Also, d Lfdr i d ! Lfdr, where Lfdr is an independent copy of Lfdr i . Remark 3. Condition 3 is a reasonable assumption in many applications. We give a few important scenarios where Condition 3 holds. For the simple random mixture model (3.2), it can be shown that the estimators proposed in Jin and Cai (2007) satisfy ^ p p ! p and Ek ^ f 0 f 0 k 2 ! 0. In addition, it is known that the kernel density estimator satises Ek ^ f fk 2 ! 0. It follows from Sun and Cai (2007) that Condition 3 holds when the above estimators are used. For the multi-group model (3.3), let ^ p k , ^ f k0 , and ^ f k be estimates of p k , f k0 , and f k such that ^ p k p ! p k , Ek ^ f k0 f k0 k 2 ! 0, Ek ^ f k f k k 2 ! 0, k = 1; ;K. Let d Lfdr i = (1 ^ p k ) ^ f 0k (x i )= ^ f k (x i ) if i2G k . It follows from Cai and Sun (2009) that Condition 3 holds when we apply Jin and Cai's estimators to the groups separately. The oracle procedure (3.13) provides an optimal benchmark for all wFDR procedures. The next theorem establishes the asymptotic validity and optimality of the data-driven 49 procedure by showing that the wFDR and ETP levels of the data-driven procedure converge to the oracle levels as m!1. Theorem 9. Assume Conditions 1-3 hold. Denote by wFDR DD the wFDR level of the data- driven procedure (Procedure 1). Let ETP OR and ETP DD be the ETP levels of the oracle procedure (3.13) and data-driven procedure, respectively. Then we have (i). wFDR DD = +o(1): (ii). ETP DD =ETP OR = 1 +o(1): 3.6 Simulation studies In all simulation studies, we consider a two-point normal mixture model X i (1p)N(0; 1) +pN(; 2 );i = 1; ;m: The nominal wFDR is xed at = 0:10. Section 5.1 considers the comparison of dierent methods under the scenario where there are two groups of hypotheses and within each group the weights are proportional. Section 5.2 compares our methods with existing methods using general weights a i and b i that are generated from probability distributions. The proposed method (Procedure 1 in Section 4.2) is denoted by4 DD. Other methods include: 1. The wFDR method proposed by Benjamini and Hochberg (1997); denoted by2 BH97. In simulations wherea i = 1 for alli, BH97 reduces to the well-known step-up procedure in Benjamini and Hochberg (1995), denoted by BH95. 2. A stepwise wFDR procedure, which rejects hypotheses along the WPO (3.20) ranking sequentially and stops at k = max j : j P i=1 b N (i) 0 , with b N (i) dened in Section 4.2. The method is denoted by WPO. Following similar arguments in the proof of Theorem 2, we can show that the WPO method controls the wFDR at the nominal level asymptotically. This is also veried by our simulation. Meanwhile, we expect 50 that the WPO method will be outperformed by the proposed method (4 DD), which operates along the VCR ranking. 3. The adaptive z-value method in Sun and Cai (2007), denoted by + AZ. AZ is valid and optimal in the unweighted case but suboptimal in the weighted case. 3.6.1 Group-wise weights This section considers group-wise weights. Our setting is motivated by the applications to GWAS, where the hypotheses can be divided into two groups: those in preselected regions and those in other regions. It is desirable to assign varied weights to separate groups to re ect that the discoveries in preselected regions are more biologically meaningful. The rst simulation study investigates the eect of weights. Consider two groups of hypotheses with group sizes m 1 = 3000 and m 2 = 1500. In both groups, the non-null proportion is p = 0:2 and the non-null distribution is N(1:9; 1). We x a i = 1 for all i. Hence BH97 reduces to the method proposed in Benjamini and Hochberg (1995), denoted by BH95. The wFDR reduces to the regular FDR, and all methods being considered are valid for FDR control. For hypotheses in group 1, we letc 1 =a i =b i . For hypotheses in group 2, we letc 2 =a i =b i . We choosec 1 = 3 and varyc 2 . Hence the weights are proportional with respective groups and vary across groups. In each simulation setting, we apply the four methods to the simulated data set and obtain the wFDR and ETP levels by averaging the multiple testing results over 200 replications. In Figure 3.1, we plot the wFDR and ETP levels of dierent methods as functions of c 2 , which is varied over [0:1; 0:8]. Panel (a) shows that all methods control the wFDR under the nominal level, and the BH97 method is conservative. Panel (b) shows that the proposed method dominates all existing methods. The proposed method is followed by the WPO method, which outperforms all unweighted methods (AZ and BH95) since b i , the weights in the power function, are incorporated in the testing procedure. The BH97 (or BH95) has the 51 (a) c 2 wFDR 0.1 0.13 0.17 0.2 0.22 0.25 0.29 0.33 0.4 0.5 0.67 0.8 0.00 0.05 0.10 0.15 0.20 (b) c 2 ETP 0.1 0.13 0.17 0.2 0.22 0.25 0.29 0.33 0.4 0.5 0.67 0.8 300 400 500 600 Figure 3.1: Comparison under group-wise weights: 2 BH97 (or BH95), WPO,4 DD (proposed), and + AZ. The eciency gain of the proposed method increases as c 1 and c 2 become more distinct. smallest ETP. As c 2 approaches 1 or the weights a i and b i equalizes, the relative dierence of the various methods (other than BH95) becomes less. In the second simulation study, we investigate the eect of the signal strength. Similar as before, consider two groups of hypotheses with group sizes m 1 = 3000 and m 2 = 1500. Under this settingc 1 andc 2 are xed at 3 and 0.33, respectively. The non-null proportion is p = 0:2 and the signal strength is varied from 1.75 to 2.5. We apply dierent methods to the simulated data sets and obtain the wFDR and ETP levels as functions of by averaging results over 200 replications. The simulation results are summarized in Figure 3.2. We can see from Panel (a) that all methods control the wFDR at the nominal level 0.1 approximately (the BH95 method is very conservative and the result is not displayed). Panel (b) shows that the proposed methods dominates other competing methods; and the gain in power is more pronounced when the signals are weak. (The ETP increases rapidly with increased signal strength. For better visualization of results, we present the graph in a logarithmic scale. See Table 1 for results of the BH95 method, as well as the ETP levels in original scales.) 52 (a) µ wFDR 1.75 1.8 1.85 1.9 1.95 2 2.1 2.2 2.3 2.4 2.5 0.00 0.05 0.10 0.15 0.20 (b) µ ln ETP 1.75 1.8 1.85 1.9 1.95 2 2.1 2.2 2.3 2.4 2.5 5.6 5.8 6.0 6.2 6.4 6.6 Figure 3.2: Comparison under group-wise weights: WPO,4 DD (proposed), and + AZ. The eciency gain of the proposed method is more pronounced when signals are weak. 3.6.2 General weights In applications where domain knowledge is precise (e.g. spatial cluster analysis), dividing the hypotheses into groups and assigning group-wise weights would not be satisfying. This sec- tion investigates the performance of our method when random weights (a i ;b i ) are generated from a bivariate distribution. In the third simulation study, we testm = 3000 hypotheses witha i , the weights associated with the wFDR control, xed at 1. We generate b i , the weights associated with the power (or ETP), from log-normal distribution with location parameter ln 3 and scale parameter 1. The location parameter is chosen in a way such that the median weight is 3, similar to those in previous settings. We apply dierent methods with 200 replications. The simulation results are summarized in Figure 3.3. The rst row xes = 0:10 and p = 0:2, and plots the wFDR and ETP as functions of . The second row xes = 0:10 and = 1:9, and plots the wFDR and ETP as functions ofp. The last row xesp = 0:2 and = 1:9, and plots the wFDR and ETP as functions of . In the plots, we omit the BH95 53 method (which is very conservative) and present the ETP in a logarithmic scale (for better visualization of results). The following observations can be made: (i) all methods control the wFDR at the nominal level approximately; (ii) by exploiting the weights b i , the WPO method outperforms the unweighted AZ method; (iii) the proposed method outperforms all competing methods; (iv) Panel (f) shows that gains in power of the proposed method over the WPO method vary at dierent nominal levels ; (v) similar to the observations in previous simulation studies, the dierence between the WPO method and the proposed method decreases with increased signal strength, the eciency gain of the proposed method is larger as signals become more sparse. Table 3.1: ETP values (in original scale) of various methods corresponding to Figure 3.2 = 1.75 1.80 1.85 1.90 1.95 2.0 2.1 2.2 2.3 2.4 2.5 BH95 102.5 125.8 150.6 179.6 204.6 237.0 301.7 361.5 431.2 501.0 567.4 AZ 278.9 312.6 350.9 388.3 420.9 460.4 536.8 599.4 667.2 733.2 789.4 WPO 285.7 328.1 379.4 428.0 468.9 514.9 599.4 666.4 737.8 800.1 852.3 DD (proposed) 346.7 382.6 425.1 467.3 504.8 545.4 620.4 681.3 748.1 808.7 858.2 In the last simulation study, a i 's are assigned to two groups of hypotheses with group sizes m 1 = 3000 and m 2 = 1500. In groups 1 and 2, we x a i = 1 and a i = 3, respectively. Conventional FDR methods are only guaranteed to work when all a i are xed at 1. Under this setting, we expect that the unweighted AZ may fail to control the wFDR. We then generate random weightsb i from log-normal distribution with location ln 6 and scale 1. The non-null proportion for group 1 is 0.2, and that for group 2 is 0.1. The mean of the the non-null distribution for group 1 or 1 is varied between [3:75;2] while that for group 2 is xed at 2. The simulation results are shown in Figure 3.4. We can see that the unweighted AZ method fails to control the wFDR at the nominal level, which veries our conjecture. The observations regarding the ETP are similar to those in the previous simulation study. Overall, all numerical studies together substantiate our theoretical results and arm the use of the methodology in various settings. 54 (a) µ wFDR 1.75 1.8 1.85 1.9 1.95 2 2.1 2.2 2.3 2.4 2.5 0.00 0.05 0.10 0.15 0.20 (b) µ ln ETP 1.75 1.8 1.85 1.9 1.95 2 2.1 2.2 2.3 2.4 2.5 6.6 6.8 7.0 7.2 7.4 7.6 (c) p wFDR 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.00 0.05 0.10 0.15 0.20 (d) p ln ETP 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 6.0 6.5 7.0 7.5 (e) α wFDR 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.00 0.05 0.10 0.15 0.20 (f) α ln ETP 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 6.0 6.2 6.4 6.6 6.8 7.0 7.2 7.4 Figure 3.3: Comparison with general weights: WPO,4 DD (proposed), and + AZ. All methods control the wFDR approximately at the nominal level. The eciency gains of the proposed method become more pronounced when (i) the signal strength decreases, (ii) the signals become more sparse, or (iii) the test level decreases. 55 (a) µ 1 wFDR -3.75 -3.5 -3.25 -3 -2.75 -2.5 -2.25 -2 0.00 0.05 0.10 0.15 0.20 (b) µ 1 ln ETP -3.75 -3.5 -3.25 -3 -2.75 -2.5 -2.25 -2 8.0 8.2 8.4 8.6 8.8 Figure 3.4: Comparison with general weights: WPO,4 DD (proposed), and + AZ. The unweighted AZ method fails to control the wFDR at the nominal level. The eciency gain of the proposed method increases as signals become weaker. 3.7 Application to GWAS Weighted FDR procedures have been widely used in GWAS to prioritize the discoveries in pre-selected genomic regions. This section applies the proposed method for analyzing a data set from Framingham Heart Study (Fox et al., 2007; Jaquish, 2007). A brief description of the study, the implementation of our methodology, and the results are discussed in turn. 3.7.1 Framingham Heart Study The goal of the study is to decipher the genetic architecture behind the cardiovascular dis- orders for the Caucasians. Started in 1948 with nearly 5,000 healthy subjects, the study is currently in its third generation of the participants. The biomarkers responsible for the car- diovascular diseases, for e.g., body mass index (BMI), weight, blood pressure, and cholesterol level, were measured longitudinally. 56 We analyze a subset of the original data set containing 977 subjects with 418 males and 559 females, whose BMIs are measured over time. Subjects are mostly from 29 years to 85 years old. The current data set also contains genetic information or genotype group of each participant over 5 million single nucleotide polymorphism (SNPs) on dierent chromosomes. Following the traditional GWAS, we exclude the rare SNPs, that is, SNPs with minor allele frequency less than 0.10, from our analyses. Male and Female populations are analyzed separately. For purpose of illustration, we only report the results from the Male population. 3.7.2 Multiple testing and wFDR control We consider the BMI as the response variable and develop a dynamic model to detect the SNPs associated with the BMI. Let Y i (t ij ) denote the response (BMI) from the i-th subject at time t ij , j = 1;:::;T i . Consider the following model for longitudinal traits: Y i (t ij ) =f(t ij ) + k G ik + i0 + i1 t ij + i (t ij ); (3.23) where f() is the general eect of time that is modeled by a polynomial function of suitable order, k is the eect of the k-th SNP on the response and G ik denotes the genotype of the i-th subject for the k-th SNP. We also consider the random intercepts and random slopes, denoted 0i and 1i , respectively, for explaining the subject-specic response trajectories. A bivariate normal distribution for i = ( 0i ; 1i ) is assumed. Moreover, we assume that the residual errors are normally distributed with zero mean, and covariance matrix i with an order-one auto-regressive structure. We t model (3.23) for each SNP and obtain the estimate of the genetic eect b k . If we reject the null hypothesis H 0 : k = 0 vs. H 1 : k 6= 0, then we conclude that the k-th SNP has a signicant association with the BMI. Since we have nearly 5 million SNPs, the false discovery rate needs to be controlled for making scientically meaningful inference. For each k, we take standardized b k as our z-scores and obtain the estimated ranking statistic b R k 57 as described in (3.22). For selecting the weights, we use the following three methods, with a k = 1 in all the three cases: Method I: Here we take b k = 1; this is just the unweighted case. Method II: We rst perform a baseline association test for each SNP. We consider only the baseline BMI for each subject and group the response values as High (BMI higher than 25), Low (BMI lower than 18.5), and Medium. For each SNP, we have three genotypes and thus we get a 3 3 table for each SNP and perform a chi-square association test. The p-values are recorded. Now we partition the SNPs into three groups based on thesep-values (lower than 0.01, higher than 0.10, and in between). For each group, we compute the average of the inverse of the p-values and take this average as b k 's for all the SNPs belonging to this particular group. Method III: We consider the dynamic model (3.23) and derive the p-values for testing H 0 : k = 0 vs. H 1 : k 6= 0 for the Female population. We partition the SNPs into three groups based on these p-values: lower than 0.01, higher than 0.10, and in between. For each group, we compute the average of the inverse of the p-values and take this average as the b k 's for all the SNPs belonging to this particular group while analyzing the data from the Male population. Similar methodology of deriving weights from a reference population has been previously explored in Xing et al. (2010). 3.7.3 Results In Table 2, we present the number of selected SNPs from three dierent methods at dierent levels. Now we study in detail the SNPs identied at = 10 6 ; this choice of is typical in GWA studies. In Table 3, we list some important SNPs (previously reported in the literature) detected by all three methods. For example, SNPs on chromosomes 3, 5, and 17 were reported in Das et al. (2011, Human Heredity) as signicant SNPs associated with high blood pressure and related cardio-vascular diseases. SNPs on chromosome 6 and 8 were 58 reported in Das et al. (2011, Human Genetics). Li et al. (2011, Bioinformatics) reported the SNP on chromosome 10 to be associated with BMI. Table 3.2: Number of selected SNPs at dierent levels for the Male population Method I Method II Method III 10 3 1384 988 1093 10 4 832 447 518 10 5 271 69 91 10 6 86 12 33 Table 3.3: Some previously reported SNPs detected by all three methods Chromosome SNP Position Trait/Disease (associated with) 3 ss66149495 16,140,422 Blood Pressure 5 ss66501706 147,356,971 Blood Pressure 6 ss66068448 131,562,687 BMI 8 ss66359352 11093585 BMI 10 ss66311679 32,719,838 BMI 17 ss66154967 29,846,491 Blood Pressure In Table 4, we list previously reported SNPs which were detected only by Methods II and III. Das et al. (2011) reported the SNP on chromosome 12. Li et al. (2011) reported the SNPs on chromosomes 1, 10, 20, and 22. Table 3.4: Some previously reported SNPs detected by Methods II and III only Chromosome SNP Position Trait/Disease (associated with) 1 ss66185476 8,445,140 BMI 10 ss66293192 32,903,593 BMI 12 ss66379521 130,748,789 Blood Pressure 20 ss66171460 22,580,931 BMI 22 ss66055592 23,420,006 BMI In Table 5, we list some previously reported SNPs which were detected only by Method III. Das et al. (2011) reported the SNP on chromosome 19. Li et al. (2011) reported the SNPs on chromosomes 1, 10, and 22. Note that 11 out of 12 SNPs identied by method II have been, as tabulated in Tables 3 and 4, previously identied in dierent studies. The SNP ss66077670 on Chromosome 9 is the only identied SNP that has not been previously reported, to our knowledge, and may be further explored by domain experts. 59 Table 3.5: Some previously reported SNPs detected by Method III only. Chromosome SNP Position Trait/Disease (associated with) 1 ss66364251 198321700 BMI 10 ss66303064 32,995,111 BMI 19 ss66092412 56,060,316 Blood Pressure 22 ss66164329 23,420,370 BMI 3.8 Discussion In the multiple testing literature, procedural, decision, and class weights are often viewed as distinct weighting schemes and have been mostly investigated separately. Although this work focuses on the decision weights approach, the decision-theoretic framework enables a unied investigation of other weighting schemes. For example, a comparison of the LR (3.14) and WLR (3.17) demonstrates how the LR statistic may be adjusted optimally to account for the decision gains/losses. This shows that procedural weights may be derived in the decision weights framework. Moreover, the dierence between the WLR (3.17) and WPO (3.20) shows the important role that p i plays in multiple testing. In particular the WPO (3.20) provides important insights on how prior beliefs may be incorporated in a decision weights approach to derive appropriate class weights. To see this, consider the multi-class model (3.3). Following the arguments in Cai and Sun (2009), we can conclude that in order to maximize the power, dierent FDR levels should be assigned to dierent classes. Similar suggestions for varied class weights have been made in Westfall and Young (1993, pages 169 and 186). These examples demonstrate that the decision weights approach provides a powerful framework to derive both procedural weights and class weights. We have assumed that the decision weights are pre-specied by the investigators. It is of interest to extend the work to the setting where the weights are unknown. Due to the variability in the quality of external information, subjectivity of investigators, and complexity in modeling and analysis, a systematic study of the issue is beyond the scope of the current work. Notable progresses have been made, for example, in Roeder and Wasserman (2009) and Roquain and van de Wiel (2009). However, these methods are mainly focused on the weighted p-value approach under the unweighted FDR criterion, hence do not apply to the 60 framework in Benjamini and Hochberg (1997). Moreover, the optimal decision rule in the wFDR problem in general is not a thresholding rule based on the adjusted p-values. Much work is still needed to derive decision weights that would optimally incorporate domain knowledge in large-scale studies. 61 Chapter 4 Application and Future Directions 4.1 Collaboration This chapter is based on collaborative works with Luella Fu, Brian Reich, and Wenguang Sun. We have also beneted from discussions with Trambak Banerjee and Yun Ling in the context of nance application(s) and Weinan Wang in the area of multiple testing. This is an ongoing work. 4.2 wFDR control of mutual fund performance alphas In the highly cited work of Barras, Scaillet, and Wermers (2010), the authors make use the Storey (2002) estimator of the null proportion to correct for false discoveries. This aims to separate mutual fund managers that exhibit higher excess return by luck alone. The mutual fund performance is measured by the intercept, called `alpha', when the excess return of the fund is regressed by the market indicators (for e.g. Fama and French three-factor model). Given that thousands of funds are simultaneously tested for performance, this naturally calls for correction of false discoveries. First, often prior information on the funds are easily available. For example the funds may be categorized as growth, aggressive growth, or growth and income. The credit ratings of the funds provide another way of grouping the funds based on prior information. In this situation, we may be interested in weighting the categories dierently leading to a weighted multiple testing problem. 62 Further, two other aspects of the problem can be of interest. Both these issues and possible ways to mitigate the same will be discussed in the next two sections. Some unskilled funds could by chance be very lucky and similarly some skilled ones may get unlucky as well. This calls for correction of false directions. And further it may well be the case that many alphas are non-zero. However we are only interested in those which deviate largely from zero. Correction of the false discovery rate only controls for the false discovery proportion on an average. In practice we are interested in controlling the false discovery proportion with high certainty. In particular, if the null distribution has fatter or skewed tails, it seems intuitive that the variability or skewness of the false discovery proportion will increase as well. 4.3 Multiple testing with condent directions 4.3.1 Motivation John Tukey (1991) in The Philosophy of Multiple Comparisons alluded to the idea that \what we should be answering rst is can we tell the direction in which the eects of A dier from the eects of B? In other words, can we be condent about the direction from A to B? Is it up, down, or uncertain?" He further provided suggestions on follow-up questions. It is useful to note that \uncertain" here refers to an indecision and, in particular, does not indicate acceptance to the null hypothesis that the eects of A and B are equal. Our state space here beingf1; 1g while our decision space isf1; 0; 1g. Because an indecision further calls for a test for inequality of eects of A and B, we do not consider falsely rejected true null as an error. 63 4.3.2 Problem formulation Given observationsZ = (Z i ) i2[m] we want to test the hypothesesH = (H 1 i ;H 2 i ) i2[m] , where H 1 i : i < 0 and H 2 i : i > 0. Z i j i l i =1 F 1i ( i ;) +l i =+1 F 2i ( i ;); Here the i denotes the true state of nature. We are interested in deciding on i such that ETP is maximized subject to FDR control. Our denition of ETP in this context, ETP :=E X i fl i =1 l i =1 +!l i =+1 l i =+1 g and FDR control is FDR := E P i fl i =+1 l i =1 +l i =1 l i =+1 g E P i fl i =1 +l i =+1 g : We interpret ! as the benet of correctly identifying an increase when it is truly an increase. Note that ! > 1 may be a valid choice in this context. In the case that it is more benecial to identify a decrease we may choose !< 1. 4.3.3 One possible adaptive solution Viewing it as a knapsack problem of three knapsacks: each marked bags 1 and 2 correspond- ing to the hypotheses 1 and 2, and bag 3 for waste or reject option. The waste bag does not contribute to total prot and has unlimited capacity. Bags 1 and 2 have symmetric cost and asymmetric prot (that is one or the other error is not any more expensive but it is more useful to discover objects) and have limited total capacity. The prot to weight ratio (PWR) for each test is: PWR i := maxf P ( i =1jZ) P ( i = +1jZ) ; !P ( i = +1jZ) P ( i =1jZ) g: 64 When ! = 1, and in the situation that i 's are inherently continuous and hence assuming P ( i = +1jZ) = 1P ( i = 1jZ) then the adaptive procedure would be to rank by decreasing: T i (z i ) := maxfP ( i =1jZ i =z i ); 1P ( i =1jZ i =z i )g; and label intermediate decisions i = 1 if P ( i = 1jZ i = z i ) > 1=2 and i = +1 otherwise. Then stop at largest k such that: 1 k X j2[k] (1T (j) ); whereT (j) is thejth ordered statistics in the decreasing order. Let (j) be the corresponding decisions as dened. The nal decision rule then is, (j) = (j) for j 2 [k] and (j) = 0 otherwise. Here zero indicates indecision or reject option. Asymmetric errors can be handled by putting appropriate weights and redening the ranking statistics. Also viewing it as prot to weight ratio makes it generalizable to more than two hypotheses cases. When !> 1 and both P ( i = +1jZ) and P ( i =1jZ) greater than , P ( i =1jZ) P ( i = +1jZ) > !P ( i = +1jZ) P ( i =1jZ) or, P ( i =1jZ) 1P ( i =1jZ) > ![1P ( i =1jZ)] P ( i =1jZ) or, (! 1)P 2 ( i =1jZ)P ( i =1jZ)[2!(1 +!)] +!(1)< 0; where if we choose ! = 1 we retrieve the condition to have P ( i =1jZ)> 1=2, of course assuming that < 1. Returning to the case in point, that is, when ! > 1, the quadratic equation has real roots if and only if, 2 (1 +!) 2 + 4!(1 2) 0; 65 where a sucient condition is that 1=2. It may be enough to restrict the discussion to this case for realistic reasons. Further the expression is negative when, P ( i =1jZ) takes value between, 2!(1 +!) p 2 (1 +!) 2 + 4!(1 2) 2(! 1) ; 2!(1 +!) + p 2 (1 +!) 2 + 4!(1 2) 2(! 1) ! : Noting that the quantity on the right is larger than 1, the above condition is equivalent to, P ( i =1jZ)> 1 2 + (1)(1 +!) p 2 (1 +!) 2 + 4!(1 2) 2(! 1) ! : Again assuming that 1=2, the above quantity excess over 1=2 takes the same sign as that of (! 1). Hence, if!> 1 we require more strict condition onP ( i =1jZ) to choose i =1. Recall that this is only an intermediate decision and does not include the reject option yet. By similar argument, when ! < 1 we can relax the condition on P ( i =1jZ) to choose i =1. Note that for 1=2, bothP ( i = +1jZ) andP ( i =1jZ) less than is an impossible situation. The somewhat more puzzling cases are when one of the two quantities between P ( i = +1jZ) and P ( i =1jZ) is less than and one is not. For symmetric case, this is ne since if P ( i =1jZ) < implies P ( i = +1jZ) > 1 > and hence the decision that expands the knapsack also has higher prot, thus we should decide to choose i = +1. Building on that, for ! > 1, it holds that !P ( i = +1jZ) > P ( i =1jZ) and then the same situation holds. Thus the only remaining case is whenP ( i = +1jZ)< when clearly P ( i =1jZ) > P ( i = +1jZ) but may not be the case that P ( i =1jZ) > !P ( i = +1jZ). We consider this situation by cases dened by the amount of asymmetry or values of !. When! (1)=, in the case thatP ( i = +1jZ)<, we haveP ( i =1jZ)> 1. Then, P ( i =1jZ)> 1!>!P ( i = +1jZ): 66 Thus, i =1 is the dominant intermediate decision rule for the ith hypothesis. Then the next question is under what kind of possible situations we would like to choose!> (1)=? The next paragraph provides some sample size order suciency conditions for intermediate optimality. Fix"> 0, form the number of tests and observationsZ = (Z i ) i2[m] , classify into disjoint classes: (Class I) P ( i = +1jZ)<, (Class II) P ( i = +1jZ)<(1 +"), (Class III) (1 +")P ( i = +1jZ) 1(1 +"), (Class IV) 1(1 +")<P ( i = +1jZ) 1, and (Class V) 1<P ( i = +1jZ). We note the following: Class V implies and implied by P ( i =1jZ)<. Classes I and V are the only two classes that can potentially contribute in `expanding' the knapsack. Class III implies and implied by (1 +")P ( i =1jZ) 1(1 +"). In class V we know that the optimal intermediate decision is i = +1. Further class III candidates should be alloted decisions i = +1 before class I candidates. Let m I be the number of hypotheses in class I and similarly dene for m III and m V . Then the maximum space that would be freed from class V is m V , the minimum space that would be occupied by all of class III ism III ". If even after deciding all the hypotheses in class I to be i =1, there is just enough space to decide i = +1 for those in class III, then none of the hypotheses 67 in class I can possibly have i = +1. Note they can still have i =1 or reject option. The above condition translates to, m I m III "m V ; or, (m I +m V )=m III ", but these quantities are random hence one possible condition: Condition 1. For the choice of there exists a positive ", when it holds that lim m!1 P m I +m V m III >" = 0: 4.3.4 Proposed testing procedure The following procedure is `asymptotically optimal' when 1=2 and either! (1)= or Condition 1 holds: Procedure 3. 1. If P ( i = 1jZ) then i = +1. Or if P ( i = +1jZ) then i =1. Dene the total available capacity as W := P i2Step 1 f min[P ( i = 1jZ);P( i = +1jZ)]g. 2. For the remaining hypotheses dene intermediate decisions i =f1; +1g respectively as the rst argument of PWR i is the maximum or not, where PWR i := maxf P ( i =1jZ) P ( i = +1jZ) ; !P ( i = +1jZ) P ( i =1jZ) g: Rank these hypotheses by decreasing PWR i . 3. Decide i = i until P i2Ranked in Step 2 fP ( i = i jZ)gW . After that choose to exercise the reject option. 68 4.4 Control of the false discovery exceedance (FDX) 4.4.1 Motivation Control of the FDR only guarantees control of the expected value of the false discovery proportion (FDP). In practice, both the variability and skewness for the FDP is high, for common procedures such as BH. Therefore it is often desirable to have control of the FDP with high certainly, instead of simply having a control of the expected FDP. With this motivation FDX seems to be a natural quantity to control. We thus aim to maximize power subject to a tail probability bound of the FDP. 4.4.2 Problem formulation Given observationsZ = (Z i ) i2[m] we want to test the hypothesesH = (H 0 i ;H 1 i ) i2[m] , where H 0 i : i = 0 and H 1 i : i 6= 0. In the random mixture model framework, Z i j i l i =0 F 0i (0) +l i =1 F 1i ( i ); where i is some location parameter. Here the i denotes the true state of nature. We are interested in making a decision about i such that ETP is maximized subject to FDX control. Our denition of ETP in this context, ETP :=E X i fl i =1 l i =1 g and FDX control is FDX :=P (FDP > ); for given ( ;)2 (0; 1) and where FDP := P i fI i =0 I i =1 g P i I i =1 _ 1 : 69 4.4.3 Exact and implementable solutions Here we provide one exact solution to the problem and it can be shown that this procedure controls the FDX at level ( ;). Further we propose another procedure, an approximate one, but less computationally intensive. Procedure 4. 1. Rank by increasing lfdr test statistic T i := P ( i = 0jdata). Without loss of generality assume ranked. 2. Reject up to K := maxfk :P jdata (jfalsej> k)g. Further we make two changes of the suggested procedure, both aiming to reduce compu- tations. First we compute a simple bound for the number of possible rejections which may be implemented by standard methods in literature (for e.g. Sun and Cai, 2007). This step is exact and based on the fact that P jdata (jfalsej> k); implies E jdata (jfalsej) k +(1 ): Second, we approximate by the binomial right tail (BRT) where BRT (k) :=P (Bin(k;T (k) )> k); This is a conservative estimate of the tail probability. With some more computations we can hope to be a little less conservative here. Procedure 5. 1. Rank by increasing lfdr test statistic T i := P ( i = 0jdata). Without loss of generality assume ranked. 2. First reject up to K 1 := maxfk2 [m] : E jdata (jfalsej) k +(1 )g. 3. Finally reject only up to K 2 := maxfk2 [K 1 ] :BRT (k)g. 70 4.4.4 Connections to other method(s) The following step-up oracle method has been proposed in Roquain and Villers (2011). The general idea is to sort the p-values p (1) p (2) p (m) (dene p (0) := 0) and nd b k := maxfk2f0; 1; ;mgjp (k) t k g. Let G(t) := 0 F 0 (t) + (1 0 )F 1 (t) be the mixture distribution of the p-values and dene t m+1 := 1 then for 1km nd t k := maxft2 [0;t k+1 ]jP [X k > k]g where X k Bin(k; 0 t=G(t)): The expressionF 0 (t) =t is used due to the fact that under the null the p-values are uniformly distribution. This is the oracle case. In the data driven procedure the quantities 0 andG(t) have to be estimated. However the authors did not propose a method to estimate these unknown quantities and assume 0 1 and refers Chi and Tan (2008) to use the procedure instead with t CT k := maxft2 [0;t CT k+1 ]jP [X k > k]g where X k Bin(k; 1^mt=k); (4.1) with the convention that t CT m+1 := 1. 71 Chapter 5 Proofs of Results in Chapter 2 5.1 Acknowledgments We are thankful to Larry Goldstein for the insightful course on the concentration of measure phenomenon and to Anirban Basak for pointing to most recent result(s) on the topic. 5.2 Proofs of some main results This chapter presents the proofs of Theorems 1 and 3{4. The proofs of all other theorems and technical lemmas also follows. For notational simplicity, throughout the proofs we may specify the orders of dierent quantities without stating the exact constants, and use the notation y for observed response and Y for random response interchangeably when it is convenient. 5.2.1 Proof of Theorem 1 Throughout this proofkk 2 denotes the Euclidean norm of a given vector. The main idea of the proof is to obtain a probabilistic lower bound for the eventf b n 2N n ( n )g: To accomplish that we rst consider an event that is a subset of this event and calculate the probabilistic lower bound for the smaller event. Recall that b 00 () is continuous and bounded away from 0, and X is of full column rank d. Recall the denition N n ( n ) =f2R d :k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g, and 72 let @N n ( n ) denote the boundary of this neighborhood. Since N n ( n ) is compact, ` n (y;) a continuous strictly concave function, whenever the event Q n = ` n (y; n;0 )> max 2@Nn(n) ` n (y;) (5.1) occurs, b n will be inN n ( n ). The strict concavity of the log-likelihood function follows from the positive deniteness of A n () = X T (X)X, which is the negative of the Hessian of the log-likelihood. This property entails that on the event Q n , the global maximizer b n must belong to the interior of the neighborhood N n ( n ). Hereafter we condition on the event Q n dened in (5.1). The technical arguments that follow herein, in order to prove thatQ n holds with signicant probability, require delicate analyses due to growing dimensionality d. Applying Taylor's expansion to the log-likelihood function` n (y;) around n;0 , we obtain ` n (y;)` n (y; n;0 ) = ( n;0 ) T n ( n;0 ) 1 2 ( n;0 ) T A n ( )( n;0 ); where is on the line segment joining and n;0 and n ( n;0 ) = X T [y(X n;0 )]. By letting u = 1 n d 1=2 B 1=2 n ( n;0 ), the above Taylor's expansion can be rewritten as ` n (y;)` n (y; n;0 ) =d 1=2 n u T B 1=2 n n ( n;0 )d 2 n u T V n ( )u=2; (5.2) where V n () = B 1=2 n A n ()B 1=2 n . From the denition of u, 2 @N n ( n ) is equivalent tokuk 2 = 1, and 2 @N n ( n ) implies 2N n ( n ) since N n ( n ) is convex. Also it is clear that max kuk 2 =1 u T B 1=2 n n ( n;0 ) =kB 1=2 n n ( n;0 )k 2 : (5.3) 73 From Condition 2, for n suciently large, min 2Nn(n) min fV n ()g>c 1 n r where 0<r< 1=4. Using this condition and since 2N n ( n ), it holds that min kuk 2 =1 u T V n ( )u min 2Nn(n) min fV n ()g>c 1 n r : (5.4) Hence by combining (5.38){(5.39) and taking a supremum on the boundary@N n ( n ) in (5.37) we derive max 2@Nn(n) ` n (y;)` n (y; n;0 )<d 1=2 n [kB 1=2 n n ( n;0 )k 2 (5.5) 2 1 c 1 n r d 1=2 n ]: By (2.7), we have X T [Ey(X n;0 )] = 0. Hence, n ( n;0 ) = X T [y(X n;0 )] = X T (yEy). Denote by W = B 1=2 n n ( n;0 ) = B 1=2 n X T (yEy). Notice that EW = 0 and Cov(W) = B 1=2 n Cov( n ( n;0 ))B 1=2 n = B 1=2 n B n B 1=2 n =I d . Clearly the left hand side of (5.5) is negative with probability given by PfkWk 2 2 1 c 1 n r d 1=2 n g: (5.6) From the expression of W, we have kWk 2 2 =(yEy) T XB 1 n X T (yEy) =[(yEy) T Cov(y) 1=2 ][Cov(y) 1=2 XB 1 n X T Cov(y) 1=2 ] [Cov(y) 1=2 (yEy)]; where denotes product. Denote by R = Cov(y) 1=2 XB 1 n X T Cov(y) 1=2 and q = Cov(y) 1=2 (y Ey). It is easy to check that R 2 = R. Therefore, R is a projection matrix with rank tr(R) =d. In addition, we have Eq = 0 and Cov(q) =I n . 74 We now decomposekWk 2 2 into two terms, the summations of the diagonal entries and the o-diagonal entries, respectively, kWk 2 2 = n X i=1 r ii q 2 i + X 1i6=jn r ij q i q j ; (5.7) where r ij denotes the (i;j)-entry of R. Next we obtain probabilistic bounds for each of the two terms. [Proof for sub-Gaussian random variables] From the sub-Gaussian tail condition for q in Condition 1, there exists some positive constant H such that for any t 0, P (jq i j>t)H exp(t 2 =H); (5.8) for all 1in. Thus for any t 0, it holds that P ( n \ i=1 fq 2 i t 2 g ) 1 n X i=1 Pfq 2 i >t 2 g 1nH exp(t 2 =H): (5.9) On the event \ n i=1 fq 2 i t 2 g, we can bound the rst term of (5.7) as n X i=1 r ii q 2 i t 2 tr(R) =dt 2 : (5.10) Denote by R D a diagonal matrix with diagonal entries r ii . As a result, we observe that X 1i6=jn r ij q i q j = q T (RR D )q. It is easy to see thatE q T (R R D )q = 0. We will use a version of the Hanson-Wright inequality (see, e.g., Theorem 1.1 of Rudelson and Vershynin, 2013) to obtain the concentration bound of the quadratic form q T (R R D )q. But we rst start with some notation and preparation. 75 Letkk 2 denote the sub-Gaussian norm of a sub-Gaussian random variable dened askk 2 = sup m1 m 1=2 (Ejj m ) 1=m . From Condition 1, that is, the condition on sub- Gaussian tails, we derive Ejq i j m =m Z 1 0 x m1 P (jq i jx)dxHm Z 1 0 x m1 exp(x 2 =H)dx = (Hm=2)H m=2 Z 1 0 u m=21 exp(u)du = (Hm=2)H m=2 (m=2) (Hm=2)H m=2 (m=2) m=2 ; where the last line follows directly from the denition of the Gamma function. Taking the m-th root, we have (Ejq i j m ) 1=m (Hm=2) 1=m H 1=2 (m=2) 1=2 : Rewriting after bounding (1=2) (1=m)+(1=2) by 1, we obtain m 1=2 (Ejq i j m ) 1=m m 1=m H (1=2)+(1=m) e 1=e (H 3=2 _ 1) since m 1. Therefore, it holds thatkq i k 2 c 4 for all i, where c 4 =e 1=e (H 3=2 _ 1). We now need bounds on the operator and Frobenius norms of RR D . Denotekk 2 and kk F as the matrix operator and Frobenius norms, respectively. Note thatkRk 2 = 1 and kRk 2 F = tr(R 2 ) = tr(R) =d. Thus using the fact P i6=j r 2 ij d, we obtainkR R D k 2 F d. Sincejr ii j 1, we further obtainkR R D k 2 kRk 2 +kR D k 2 2. Thereby, a direct application of the Hanson-Wright inequality yields P ( X i6=j r ij q i q j >dt 2 ) =P jq T (R R D )qj>dt 2 (5.11) 2 exp c 5 min d 2 t 4 c 4 4 kR R D k 2 F ; dt 2 c 2 4 kR R D k 2 2 exp c 5 min dt 4 c 4 4 ; dt 2 2c 2 4 2 expfc 6 dt 2 g 76 for any t > c 4 = p 2, where c 5 and c 6 are some positive constants. To ensure t > c 4 = p 2, we choose n = n r (c 0 logn) 1=2 for some constant c 0 > 8c 2 1 H and t = 2 3=2 c 1 n r n . Therefore, the probability bound (5.11) holds for large enough n. Combining (5.9) and (5.11), with probability at least 1nH exp(t 2 =H) 2 exp(c 6 dt 2 ), we have kWk 2 2 X i r ii q 2 i + X i6=j r ij q i q j 2dt 2 : In view of our choice of t, it holds that (2dt 2 ) 1=2 = 2 1 c 1 n r d 1=2 n and thus PfkWk 2 2 1 c 1 n r d 1=2 n g 1nH exp(t 2 =H) 2 exp(c 6 dt 2 ) 1O(n ); where = [(c 2 1 c 0 =(8H) 1)]^ (c 2 1 c 6 c 0 =8). Note that > 0 since c 0 > 8c 2 1 H. This leads to P (Q n ) 1O(n ): (5.12) The positive constant can be large if c 0 in n is chosen to be large. From Condition 2, min (B n )!1 at a faster rate than d 2 n . Then we have the consistency b n n;0 =o P (1). [Proof for sub-exponential random variables] First we state an useful lemma. Lemma 1. (Concentration bounds for sub-exponential random variables) Let q i 's be n inde- pendent, but not necessarily identically distributed, scaled and centered random variables with uniform sub-exponential decay, that is, P (jq i j>t)H exp(t=H): Then the following holds: 77 1. Letkq i k 1 denote the sub-exponential norm dened by kq i k 1 := sup m1 m 1 (Ejq i j m ) 1=m : Then we havekq i k 1 c 4 for all i, where c 4 :=e 1=e H(H_ 1). 2. Further, let r ij denote the (i;j)-entry of any real valued matrix. Then, (a) for the diagonal entries, it holds that P 8 < : n X i=1 r ii q 2 i n X i=1 r ii n " n X i=1 r 2 ii # 1=2 9 = ; C exp(c 1=2 n ); (b) and for the o-diagonal entries we have that P 8 < : X i6=i 0 r ii 0q i q i 0 n " X i6=i 0 r 2 ii 0 # 1=2 9 = ; C exp(c 1=4 n ); where n is any user-dened expression with n 1 and in particular may depend on n. The positive constants C and c depend only on H 1 and H 2 . To begin with choose n :=d 1=2 n 2 r for r> 0. For all n> 2 we have that, P ( n X i=1 r ii q 2 i 2dn 2 r ) P ( n X i=1 r ii q 2 i d(1 +n 2 r ) ) P ( n X i=1 r ii q 2 i n X i=1 r ii n d 1=2 ) ; 78 where we use that fact that tr(R) =d. Further using, [ P n i=1 r 2 ii ] 1=2 d 1=2 , we derive P ( n X i=1 r ii q 2 i n X i=1 r ii n d 1=2 ) P 8 < : n X i=1 r ii q 2 i n X i=1 r ii n " n X i=1 r 2 ii # 1=2 9 = ; C exp(c 1=2 n ) =C exp(cd 1=4 n r )C exp(cn r ) where we apply part 2(a) of Lemma 1. Thus we obtain, P ( n X i=1 r ii q 2 i 2dn 2 r ) C exp(cn r ): (5.13) Again with the choice of n := d 1=2 n 2 r and using the fact that h P i6=j r 2 ij i 1=2 d 1=2 we derive, P ( X i6=j r ij q i q j dn 2 r ) P 8 < : X i6=j r ij q i q j n " X i6=j r 2 ij # 1=2 9 = ; C exp(c 1=4 n ); applying part 2(b) of Lemma 1. Substituting the value of n and using d 1=8 1, we obtain P ( X i6=j r ij q i q j dn 2 r ) C exp(cn r ): (5.14) Combining (5.13) and (5.14), with probability at least 1 2C exp(cn r ), we have kWk 2 2 X i r ii q 2 i + X i6=j r ij q i q j 3dn 2 r : In other words, P (kWk 2 p 3d 1=2 n r ) 1O(expfcn r g): 79 With our choice of n =c 1 1 2 p 3n r+ r , this leads to P (Q n ) 1O(expfcn r g): (5.15) 5.2.2 Proof of Theorem 3 DeneE =f b n 2 N n ( n )g, where b n stands for the QMLE. Note thatE does depend on n, but for simplicity of notation we will omit the subscript n in sequel. To establish this theorem we require a possibly dimension dependent bound on the quantitykn 1=2 X b n k 2 . The need for bounding the specied quantity, particularly with growing dimensionality, can be intuitively understood by trying to put some restriction on the parameter space. This is analogous to the case of penalized likelihood. Recall the neighborhood M n ( 1 ) =f2 R d :kXk 1 1 logng, where 1 is some positive constant satisfying 1 <=2 1. One way of bounding the quantitykn 1=2 X b n k 2 is to restrict the QMLE b n on the set M n ( 1 ). As mentioned in Theorem 1, the constant can be large ifc 0 is chosen to be large, which ensures that 1 is positive. From Condition 5, N n ( n ) M n ( 1 ) for all suciently large n to ensure that conditional onE, the restricted MLE coincides with its unrestricted version. However, this condition is very mild in the sense that the constant 1 can be chosen as large as desired to make M n ( 1 ) large enough, whereas the neighborhood N n ( n ) is asymptotically shrinking. Hereafter in this proof b n will be referred to as the restricted MLE, unless specied otherwise. Recall that n () =E` n (e y;), wheree y is an independent copy of y. In the GLM setup, we have ` n (e y;) =e y T X 1 T B(X) and n () = (Ee y T )X 1 T B(X). Part 1: Expansion of E n ( b n ). We approach the proof by splitting E n ( b n ) in the regionE and its complement, that is, E n ( b n ) =Ef n ( b n )1 E g +Ef n ( b n )1 E cg (5.16) =Ef n ( b n )1 E g +Ef[(E~ y) T (X b n ) 1 T B(X b n )]1 E cg; 80 where the second equality follows from the denition of n (). We aim to show that the second term on the right hand side of (5.16) is o(1). Perform- ing componentwise Taylor's expansion of B() around 0 and evaluating at X b n , we obtain B(X b n ) = B(0) +b 0 (0)X b n + r, where r = (r 1 ; ;r n ) T with r i = 2 1 b 00 ((X i ) i )(X b n ) 2 i and 1 ; ; n lying in the line segment joining b n and 0. Recall that b n is the con- strained MLE here, EY 2 i is bounded uniformly in i and n, and b 00 () =O(n 1 ) uniformly in its argument. The condition on b 00 () can be much weakened in many cases including linear and logistic regression models. This condition also accommodates Poisson regression where b 00 () = exp() for 2R since b() = exp(). Then it follows that Efj(Ee y) T X b n 1 T B(X b n )j1 E cgOfn logn +n +n 1+ 1 (logn) 2 gP (E c ) (5.17) Ofn 2( 1 +1) gP (E c ) =o(1) for suciently large n. The last inequality follows from the fact that > 2( 1 + 1) and we recall that P (E c ) = O(n ). To verify the orders, we note that the four bounds j(Ee y) T X b n j n max 1in (Ey 2 i ) 1=2 1 logn, j1 T B(0)j = O(n), jb 0 (0)1 T X b n j O(1)n 1 logn, andj1 T rjn max 1in jr i j nO(n 1 )( 1 logn) 2 . On the eventE, we rst expand n () around n;0 . By the denition of n;0 , n () attains its maximum at n;0 . By Taylor's expansion of n () around n;0 and evaluating at b n , we derive n ( b n ) = n ( n;0 ) 1 2 ( b n n;0 ) T A n ( )( b n n;0 ) (5.18) = n ( n;0 ) 1 2 ( b n n;0 ) T A n ( b n n;0 ) s n 2 = n ( n;0 ) 1 2 v T n [(C 1 n ) T A n C 1 n ]v n s n 2 ; 81 where A n () =@ 2 ` n (y;)=@ 2 , A n = A n ( n;0 ),s n = ( b n n;0 ) T [A n ( )A n ]( b n n;0 ), v n = C n ( b n n;0 ), and is on the line segment joining n;0 and b n . Then it follows that js n 1 E j = ( b n n;0 ) T (A n ( ) A n )( b n n;0 ) 1 E (5.19) = [B 1=2 n ( b n n;0 )] T [V n ( ) V n ][B 1=2 n ( b n n;0 )] 1 E kV n ( ) V n k 2 2 n d1 E ; where V n () = B 1=2 A n ()B 1=2 n and V n = V( n;0 ). Note that on the event E, by the convexity of the neighborhood N n ( n ) we have 2 N n ( n ). From Condition 4, max 1 ;; d 2Nn(n) k e V n ( 1 ; d )V n k 2 =O(d 1=2 n 1=2 ). Therefore we deduce thatE(s n 1 E ) is of order O(d 3=2 n 1=2 2 n ) =o(1), which follows from (5.26) in the proof of Theorem 2. From (5.25) in the proof of Theorem 2, we have the decomposition v n = u n + w n with u n = B 1=2 n X T (yEy) and w n = h e V n ( 1 ; ; d ) V n ih B 1=2 n ( b n n;0 ) i : For simplicity of notation, denote by R n = (C 1 n ) T A n C 1 n . Recall that C n = B 1=2 n A n . With some calculations we obtain E(u T n R n u n ) =Ef(yEy) T XA 1 n X T (yEy)g =Eftr(A 1 n X T (yEy)(yEy) T X)g = tr(A 1 n B n ): Note that E(u T n R n u n 1 E ) = E(u T n R n u n ) E(u T n R n u n 1 E c). From Theorem 1, we have P (E c )! 0 asn!1. Let n = tr(A 1 n B n )_ 1 ensuring that this quantity is bounded away from zero. We will apply Vitali's convergence theorem to show thatE(u T n R n u n 1 E c) =o( n ). To establish uniform integrability we use the following lemma, the proof of which has also been provided. Lemma 2. For some constant > 0, sup n Ej(u T n R n u n )= n j 1+ <1. 82 This leads to E(u T n R n u n 1 E c) =o( n ). Hence we have 1 2 E(u T n R n u n 1 E ) = 1 2 tr(A 1 n B n ) +o( n ): It remains to show that E[(w T n R n w n + 2w T n R n u n )1 E ] =o( n ): (5.20) Note that on the eventE, we have w T n R n w n =kR 1=2 n w n k 2 2 k e V n V n k 2 2 2 n dtr(A 1 n B n ): In view of the assumption max 1 ;; d 2Nn(n) k e V n ( 1 ; ; d )V n k 2 =O(d 1=2 n 1=2 ), it holds that E(w T n R n w n 1 E ) = o( n ). For the cross term w T n R n u n , applying the Cauchy-Schwarz inequality yields jE(w T n R n u n 1 E )jE(kR 1=2 n w n k 2 2 1 E ) 1=2 E(ku T n R 1=2 n k 2 2 ) 1=2 E[k e V n V n k 2 1 E n d 1=2 tr(A 1 n B n )]; which entails that E(w T n R n u n 1 E ) = o( n ). Note that Efj n ( n;0 )j1 E cg is of order o(1) by similar calculations as in (5.17). Thus combining (5.16) { (5.20) yields Ef n ( b n )g = n ( n;0 ) 1 2 tr(A 1 n B n ) +o( n ). Part 2: Expansion of E` n (y; n;0 ). Similarly we expand ` n (y;) around b n and evaluate at n;0 . From Condition 5, N n ( n ) M n ( 1 ) for suciently large n, we see that 83 n;0 2 M n ( 1 ). On the eventE, since ` n (y;) attains its maximum at the restricted MLE b n , we have ` n (y; n;0 ) =` n (y; b n ) 1 2 ( b n n;0 ) T A n ( )( b n n;0 ) (5.21) =` n (y; b n ) 1 2 ( b n n;0 ) T A n ( b n n;0 ) s n 2 =` n (y; b n ) 1 2 v T n [(C 1 n ) T A n C 1 n ]v n s n 2 : Then similarly as in Part 1, we can obtain Ef` n (y; n;0 )1 E g =Ef` n (y; b n )1 E g 1 2 tr(A 1 n B n ) +o( n ): If we can show that Efj` n (y; n;0 )j1 E cg and Efj` n (y; b n )j1 E cg are both of order o(1), then we obtain the desired asymptotic expansion Ef n ( b n )g =Ef` n (y; b n )g tr(A 1 n B n ) +o( n ): To see why Efj` n (y; n;0 )j1 E cg is of order o(1), we derive Efjy T X n;0 1 T B(X n;0 )j1 E cg O(n logn)P (E c ) 1=2 +Ofn +n logn +n 2+ 1 (logn) 2 gP (E c ) Ofn 2( 1 +1) gP (E c ) =o(1); similarly as in (5.17) and using E[jy T X n;0 j1 E c] E[jy T X n;0 j 2 ] 1=2 P (E c ) 1=2 . Similarly we can also show that Efj` n (y; b n )j1 E cg is of order o(1). The only dierence in the above derivation is to boundkX b n k 1 instead ofkX n;0 k 1 , which holds from the denition of the restricted QMLE. This concludes the proof. 84 5.2.3 Proof of Theorem 4 In view of the expansions of GAIC, GBIC, and GBIC p , we need to show that logj b H n j = logjH n j + o P (1) and tr( b H n ) = tr(H n ) + o P (1). To establish this we show that b H n = H n +o P (1=d), where theo P () denotes the convergence in probability of the matrix operator norm. Let M be a dd square matrix. Denote by tr(M) = tr(M)=d the normalized trace and (M) = max 1kd fj k (M)jg the spectral radius. Then we have jtr( b H n ) tr(H n )j =djtr( b H n H n )j d( b H n H n ) =dk b H n H n k 2 =o P (1); wherekk 2 denotes the matrix operator norm. The equality of the spectral radius and the operator norm follows from the symmetry of the matrix b H n H n . Similarly dene the normalized log determinant, that is, logjMj = (logjMj)=d for any arbitrary matrix M. Denote k () as the eigenvalues arranged in the increasing order. Then we have j logj b H n j logjH n jjdjlogj b H n j logjH n jj d max 1kd j log k ( b H n ) log k (H n )j d max 1kd log ( 1 + k ( b H n ) k (H n ) 1 ) : (5.22) Recall that we assume that the smallest and largest eigenvalues of both n 1 B n and n 1 A n are bounded away from 0 and1. It then follows that k (H n ) =O(1) and 1 k (H n ) =O(1) uniformly for all 1kd. An application of Weyl's theorem shows that j k ( b H n ) k (H n )j( b H n H n ) for each k. We have ( b H n H n ) =k b H n H n k 2 = o P (1=d). Hence the right hand side of (5.22) is o P (1). 85 Now we proceed to show that b H n = H n +o P (1=d). It suces to prove that n 1 b A n = n 1 A n +o P (1=d) and n 1 b B n = n 1 B n +o P (1=d). We use the following properties of the operator norm (Horn and Johnson, 1985): k(I d M) 1 k 2 1=(1kMk 2 ) ifkMk 2 < 1, kMNk 2 kMk 2 kNk 2 , andkM + Nk 2 kMk 2 +kNk 2 , where M and N aredd matrices. To see the suciency note that (n 1 b A n ) 1 (n 1 d b B n ) (n 1 A n ) 1 (n 1 dB n ) =(n 1 b A n ) 1 (n 1 d b B n ) (n 1 b A n ) 1 (n 1 dB n ) + (n 1 b A n ) 1 (n 1 dB n ) (n 1 A n ) 1 (n 1 dB n ): Then the desired result b H n = H n +o P (1=d) can be obtained by repeated application of the above properties of the operator norm. Part 1: Proven 1 b A n =n 1 A n +o P (1=d). From Theorem 1 we have,k(n 1 B n ) 1=2 ( b n n;0 )k 2 = O P f(n=d) 1=2 n g, which along with the assumption that the smallest eigenvalue of n 1 B n is bounded away from 0 entails b n = n;0 + O P f(n=d) 1=2 n g. Then it fol- lows from the Lipschitz assumption for n 1 A n () from Condition 7 in the neighborhood N n ( n ) and Theorem 1 that n 1 b A n = n 1 A n + o P (1=d), which holds for our choice of d =ofn (14r)=3 (logn) 2=3 g and n . Part 2: Prove n 1 b B n =n 1 B n +o P (1=d). We rst split n 1 b B n as n 1 b B n =n 1 X T diag nh y(X b n ) i h y(X b n ) io X = G 1 + G 2 + G 3 ; where G 1 =n 1 X T diagf(y(X n;0 )) (y(X n;0 ))gX; G 2 = 2n 1 X T diagf(y(X n;0 )) [(X n;0 )(X b n )]gX; G 3 =n 1 X T diagf[(X b n )(X n;0 )] [(X b n )(X n;0 )]gX: 86 We will state two lemmas before proceeding with the proof. Dene the sub-exponential norm of a sub-exponential random variable as kk 1 = sup m1 m 1 (Ejj m ) 1=m : Lemma 3. For independent sub-Gaussian random variablesfy i g n i=1 , we have that q 2 i = (y i Ey i ) 2 =Var(y i ) is sub-exponential with norm bounded by 2c 2 4 , where c 4 is as dened in the proof of Theorem 1. Moreover, the following Bernstein-type tail probability bound holds P j n i=1 a i q 2 i E[ n i=1 a i q 2 i ]jt 2 exp c 10 min t 2 4c 4 4 kak 2 2 ; t 2c 2 4 kak 1 for a2R n , t 0, and c 10 > 0. Lemma 4. For independent sub-Gaussian random variablesfy i g n i=1 with q i =fVar(y i )g 1=2 (y i Ey i ), the following tail probability bound holds Pfj n i=1 a i q i jtge exp c 11 t 2 c 2 4 kak 2 2 for a2R n , t 0, and c 11 > 0. Lemma 3 follows from Lemma 5.14 and Proposition 5.16 of Vershynin (2012). Note that here we dene the sub-exponential random variable as the square of a sub-Gaussian random variable and the bound on the norm follows by our previous observation thatkq i k 2 c 4 in the proof of Theorem 1. Lemma 4 rephrases Proposition 5.10 of Vershynin (2012) for the case wherekq i k 2 c 4 . Further split G 1 as G 1 = G 11 + G 12 + G 13 where G 11 =n 1 X T diagf(yEy) (yEy)gX; G 12 = 2n 1 X T diagf(yEy) [Ey(X n;0 )]gX; G 13 =n 1 X T diagf[Ey(X n;0 )] [Ey(X n;0 )]gX: 87 Note that EG 11 = n 1 B n and G 11 = n 1 n i=1 fx i x T i [y i Ey i ] 2 g = n i=1 A i q 2 i , where A i = n 1 Var(y i )x i x T i . Then it holds that for any positive t, P (kG 11 EG 11 k 2 t)P (kG 11 EG 11 k F t) d 2 max 1j;kd P (jG jk 11 EG jk 11 jt=d); (5.23) wherekk F denotes the matrix Frobenius norm and G jk 11 denotes the (j;k) entry of G 11 . Recall from Condition 7 thatkXk 1 =O(n 2 ) with 0 2 <r. Dene a jk i =n 1 Var(y i )x ij x ik and a jk = (a jk 1 ; ;a jk n ) T . We haveka jk k 2 2 = O(n 1 n 4 2 ). Then combining (5.23) with Lemma 3, we deduce P (dkG 11 EG 11 k 2 t)d 2 max 1j;kd P (jG jk 11 EG jk 11 jt=d 2 ) 2d 2 expfc 12 t 2 n 14 2 =d 4 g for some constant c 12 > 0. Note that d =ofn (14r)=4 g, we obtain G 11 =EG 11 +o P (1=d). By Condition 8 and Lemma 4, we have P (dkG 12 k 2 t)d 2 P (jG jk 12 jt=d 2 )ed 2 expfc 13 t 2 n 14 2 +(1 3 )=2 =d 4 g; where c 13 > 0 is some constant. Hence from d = ofn (14r)=4 g and 0 3 4(r 2 ), we have G 12 =o P (1=d). To show that G 13 =o(1=d), we derive kG 13 k 2 2 kn 1 n i=1 fx i x T i [Ey i [(X n;0 )] i ] 2 gk 2 F = 1j;kd [ n i=1 a jk i [Ey i [(X n;0 )] i ] 2 =Var(y i )] 2 n i=1 f[Ey i [(X n;0 )] i ] 2 =Var(y i )g 2 1j;kd ka jk k 2 2 ; where the last step follows from the component-wise Cauchy-Schwarz inequality. From Con- dition 8, G 13 = o(1=d). Combining the above derivations yields G 1 = EG 1 +o P (1=d) = 88 n 1 B n +o P (1=d). To see that G 2 = o P (1=d), note that (y(X n;0 )) i = (y i Ey i ) + (Ey i [(X n;0 )] i ) and apply similar arguments as above. By the Lipschitz Condition 7 in the neighborhood N n ( n ), we have G 3 =o P (1=d), which completes the proof. Now we discuss the proofs of Theorems 2 and 5{6, and technical lemmas. 5.3 Proofs of additional theorems 5.3.1 Proof of Theorem 2 Recall that C n = B 1=2 n A n . To establish the asymptotic normality of the QMLE b n , we prove the following D n C n ( b n n;0 ) D !N(0;I m ); (5.24) for any md matrix D n such that D n D T n =I m with m xed. From the score equation we have ( b n ) = X T [y(X b n )] = 0. From (2.7), it holds that X T [Ey(X n;0 )] = 0. For any 1 ; ; d 2R d , denote by e A n ( 1 ; ; d ) add matrix withj-th row the corre- sponding row of A n ( j ) for each j = 1; ;d, and matrix-valued function e V n ( 1 ; d ) = B 1=2 n e A n ( 1 ; ; d )B 1=2 n . Assuming the dierentiability of () and applying the mean- value theorem componentwise around n;0 , we obtain 0 = n ( b n ) = n ( n;0 ) e A n ( 1 ; ; d )( b n n;0 ) = X T (yEy) e A n ( 1 ; ; d )( b n n;0 ); where each of 1 ; ; d lies on the line segment joining b n and n;0 . It follows from this expansion that C n ( b n n;0 ) = u n + w n ; (5.25) 89 where u n = B 1=2 n X T (yEy) and w n = h e V n ( 1 ; ; d ) V n ih B 1=2 n ( b n n;0 ) i ; where V n = V n ( n;0 ) = B 1=2 n A n B 1=2 n . Therefore we have D n C n ( b n n;0 ) = D n u n + D n w n : By the Cram er-Wold theorem, it suces to show that for any unit vector a 2 R m , a T D n C n ( b n n;0 ) D ! N(0; 1). Further by Slutsky's lemma, it is sucient to show that a T D n u n D !N(0; 1) and a T D n w n =o P (1) for any unit vector a. Part 1 (Asymptotic normality of a T D n u n ): We will build on the conditions required to apply the Lyapunov central limit theorem (CLT). For an arbitrary unit vector a2R m , consider the asymptotic distribution of v n = a T D n u n = a T D n B 1=2 n X T (yEy) = n X i=1 z i ; where z i = a T D n B 1=2 n x i (y i Ey i ), i = 1; ;n, and X = (x 1 ; ; x n ) T . Since z i 's are independent and have mean zero, we derive Var(v n ) = n X i=1 Var(z i ) = a T D n B 1=2 n X T Cov(y)XB 1=2 n D T n a = a T D n B 1=2 n B n B 1=2 n D T n a = 1: From Condition 3, we have max 1in Ejy i Ey i j 3 M for some positive constant M and P n i=1 (x T i B 1 n x i ) 3=2 =o(1). Then an application of the Cauchy-Schwarz inequality yields n X i=1 Ejz i j 3 = n X i=1 ja T D n B 1=2 n x i j 3 Ejy i Ey i j 3 M n X i=1 ja T D n B 1=2 n x i j 3 M n X i=1 kD T n ak 3 2 kB 1=2 n x i k 3 2 =M n X i=1 (x T i B 1 n x i ) 3=2 ! 0; 90 noting thatkD T n ak 2 2 = a T D n D T n a = a T I m a = 1. Therefore by applying Lyapunov's CLT, we obtain a T D n u n = n X i=1 z i D !N(0; 1): Part 2 (To show a T D n w n is o(1) in probability): Conditional on the eventf b n 2 N n ( n )g and using the fact thatkD T n ak 2 = 1, we have ja T D n w n jkD T n ak 2 kw n k 2 kw n k 2 k e V n V n k 2 kB 1=2 n ( b n n;0 )k 2 k e V n V n k 2 d 1=2 n ; where the last step follows from the denition of the neighborhood N n ( n ) =f2 R d : k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g and given thatf b n 2N n ( n )g. From Condition 4, max 1 ;; d 2Nn(n) k e V n ( 1 ; ; d ) V n k 2 =O(d 1=2 n 1=2 )O(dn 1=2 n ). Again conditional on the eventf b n 2N n ( n )g and noticing that each j dened previously for 1 j d lies in N n ( n ) due to its convexity, it holds that ja T D n w n j =O(d 3=2 n 1=2 2 n ) =o(1); (5.26) where we choose n =n r (c 0 logn) 1=2 as in the proof of Theorem 1 and d =ofn (14r)=3 (logn) 2=3 g with 0r< 1=4. Since the eventf b n 2N n ( n )g holds with probability tending to 1, a T D n w n =o P (1). Also note that the convergence to zero in probability is uniform in a and D n . Therefore, combining parts 1 and 2 nishes the proof. 91 5.3.2 Proof of Theorem 5 Throughout the proof we condition on the event e Q n =f b n 2N n ( n )g, whereN n ( n ) =f2 R d :k(n 1 B n ) 1=2 ( n;0 )k 2 (n=d) 1=2 n g, B n = X T Cov(Y)X, and b n is the unrestricted MLE. From Theorem 1 we have shown that as n!1, P ( e Q n )! 1: Recall from (2.20) that ` n (y;) = ` n (y;)` n (y; b n ). Then the maximum value zero of this function is attained at = b n . It follows from (2.9) that @ 2 ` n (y;)=@ 2 =A n (); where A n () = X T (X)X. By Taylor's expansion of the likelihood function ` n (y;) around b n in the new neighborhood e N n ( n ) = f 2 R d : k(n 1 B n ) 1=2 ( b n )k 2 (n=d) 1=2 n g, we derive ` n (y;) = 1 2 ( b n ) T @ 2 ` n (y; )=@ 2 ( b n ) (5.27) = n 2 T V n ( ); where lies on the line segment joining and b n , =n 1=2 B 1=2 n ( b n ), and V n () = B 1=2 n A n ()B 1=2 n . Since b n 2 e N n ( n ), by the convexity of the neighborhood e N n ( n ) we have 2 e N n ( n ). Also note that conditional on the event e Q n , it holds that e N n ( n ) N n (2 n ). We dene n ( n ) = max 2Nn(2n) maxfj min (V n () V n )j;j max (V n () V n )jg with V n = V n ( n;0 ). Using Taylor's expansion (5.27) over the region e N n ( n ), we obtain q 1 ()1 e Nn(n) ()n 1 ` n (y;)1 e Nn(n) ()q 2 ()1 e Nn(n) (); (5.28) 92 where q 1 () = 1 2 T [V n n ( n )I d ] and q 2 () = 1 2 T [V n + n ( n )I d ]. Dene U n () = exp [n 1 ` n (y;)] which takes values in the interval [0; 1] by deni- tion. From Condition 2, for n large, min 2Nn(n) min fV n ()g > c 1 n r with 0 < r < 1=4 and n ( n ) = ofn (1r)=3 g. Since n;0 belongs to N n ( n ), this assumption yields n ( n ) min (V n )=2 for suciently large n. To see this, note that since (1r)=3 > r we have n ( n )n r = o(1) whereas min (V n )n r > c 1 . Consider the linear transformation h() = (n 1 B n ) 1=2 . For suciently large n, we obtain E M [e nq 2 () 1 e Nn(n) ()]E M [U n () n 1 e Nn(n) ()] (5.29) E M [e nq 1 () 1 e Nn(n) ()]; where M denotes the prior distribution onh()2R d for modelM. Before proceeding with the proof we state a few useful lemmas. From Condition 6, the prior density relative to the Lebesgue measure 0 onR d , (h()) = d M d 0 (h()), satises inf 2Nn(2n) (h())c 2 and sup 2R d (h())c 3 ; (5.30) where c 2 and c 3 are some positive constants. Lemma 5. Under (5.30), for j = 1; 2; we have c 2 Z 2R d e nq j 1 e Nn(n) d 0 E M h e nq j 1 e Nn(n) i c 3 Z 2R d e nq j 1 e Nn(n) d 0 : (5.31) Lemma 6. Conditional on the event e Q n , for suciently large n we have E M [U n () n 1 e N c n (n) ] expf[ n n ( n )=2]d 2 n g (5.32) exp[( n =2)d 2 n ]; where n = min (V n )=2. 93 Lemma 7. It holds that Z 2R d e nq 1 d 0 = 2 n d=2 jV n n ( n )I d j 1=2 (5.33) and Z 2R d e nq 2 d 0 = 2 n d=2 jV n + n ( n )I d j 1=2 : (5.34) Lemma 8. For j = 1; 2, it holds that Z 2R d e nq j 1 e N c n (n) d 0 2 n n d=2 exp h ( p n d 2 n p d) 2 =2 i (5.35) Now we proceed with the proof. From Condition 2, n = n r (c 0 logn) 1=2 . Then the expression in (5.32) converges to zero faster than any polynomial rate in n. Let us rewrite the right hand side of (5.35) as exp d 2 ( p n 2 n 1) 2 + d 2 [log(2) log(n n )] ; which converges to zero faster than any polynomial rate in n. From Condition 2, d = ofn (14r)=3 (logn) 2=3 g with 0<r< 1=4. Then it follows that jV n n ( n )I d j 1=2 =jV n j 1=2 jI d n ( n )V 1 n j 1=2 =jV n j 1=2 f1 +O[ n ( n )tr(V 1 n )]g =jV n j 1=2 f1 +O[ n ( n )d 1 min (V n )]g =jV n j 1=2 [1 +o(1)]: 94 Combining Lemmas 5{8 yields logE M [U n () n ] = log ( 2 n d=2 jV n j 1=2 [1 +o(1)] ) + logc n = logn 2 d + 1 2 logjA 1 n B n j + log(2) 2 d + logc n +o(1); where c n 2 [c 2 ;c 3 ]. This completes the proof. 5.3.3 Proof of Theorem 6 The proof follows from the proof of Theorem 5 and Part 1, that is, expansion of E n ( b n ), of the proof of Theorem 3. 5.3.4 Proof of Theorem 7 Consider the expression GBIC p (M m )GBIC p (M 1 ) = 2T 11 2T 12 2T 13 + 2a n (jM m jjM 1 j) +T 31 +T 32 T 41 T 42 ; where, T 11 :=f` n (y; b n;m )` n (y; n;m;0 )gf` n (y; b n;1 )` n (y; n;1;0 )g; T 12 :=f` n (y; n;m;0 )E y ` n (y; n;m;0 )gf` n (y; n;1;0 )E y ` n (y; n;1;0 )g; T 13 :=E y ` n (y; n;m;0 )E y ` n (y; n;1;0 ); T 31 :=ftrH( b n;m ) trH( n;m;0 )gftrH( b n;1 ) trH( n;1;0 )g; T 32 := trH( n;m;0 ) trH( n;1;0 ); T 41 :=flogjH( b n;m )j logjH( n;m;0 )jgflogjH( b n;1 )j logjH( n;1;0 )jg; T 42 := logjH( n;m;0 )j logjH( n;1;0 )j; 95 where H( n;m;0 ) = A 1 n ( n;m;0 )B n . From proof of Theorem 4, we have both T 31 ;T 41 = (jM m jjM 1 j)o P (1). Using Lemma 9, we have T 11 = (jM m jjM 1 j)O p (n r logn), where r2 [0; 1=4). Using Lemma 10, we obtain T 12 = (jM m jjM 1 j)O p (n 1=2 log 3=2 n) with r2 (0; 1=4). Dene n := inf supp(Mm)6=supp(M 1 ) jsupp(Mm)jKn T 13 = inf supp(Mm)6=supp(M 1 ) jsupp(Mm)jKn fI(g n ;f n (; n;m;0 ))I(g n ;f n (; n;1;0 ))g: Assume n > 0, since n;1;0 is the true support. Also assumejT 32 j = (jM m jjM 1 j)O(1) and T 42 = (jM m jjM 1 j)o(1). Then we obtain, GBIC p (M m )GBIC p (M 1 ) = 2T 13 + (jM m jjM 1 j)(a n O p (n r logn) O p (n 1=2 log 3=2 n) +o p (1)) >2 n jM 1 ja n d n O p (n 1=2 log 3=2 n) = n f2K n 1 n [a n +O p (n 1=2 log 3=2 n)]g > n f2K n 1 n a n g; when a n = (logp) and logp = (n ), when > 1=2. Assume n is growing so that a n =o(K 1 n n ). 5.4 Proofs of lemmas Lemmas 3 and 4 have been discussed in the paragraph following them. The proofs of Lemmas 5{7 can be found in Lv and Liu (2014). 96 5.4.1 Proof of Lemma 1 1. From the condition on sub-exponential tails, we derive Ejq i j m =m Z 1 0 x m1 P (jq i jx)dxHm Z 1 0 x m1 exp(x=H)dx =HmH m Z 1 0 u m1 exp(u)du =HmH m (m)HmH m m m ; where the last line follows from the denition of the Gamma function. Taking them-th root, we have (Ejq i j m ) 1=m (Hm) 1=m Hm: Rewriting we obtain m 1 (Ejq i j m ) 1=m m 1=m H 1=m He 1=e (H_ 1)H since m 1. Therefore, it holds thatkq i k 1 c 4 for all i, where c 4 =e 1=e H(H_ 1). 2. These proofs can be found in literature (for e.g., see Lemma 8.3 of Erd} os et al. (2012)). 5.4.2 Proof of Lemma 2 From the expression of u T n R n u n , we have u T n R n u n =(yEy) T XA 1 n X T (yEy) =[(yEy) T Cov(y) 1=2 ][Cov(y) 1=2 XA 1 n X T Cov(y) 1=2 ] [Cov(y) 1=2 (yEy)]: 97 Denote by S n = Cov(y) 1=2 XA 1 n X T Cov(y) 1=2 and q = Cov(y) 1=2 (yEy). We decompose u T n R n u n into two terms, the summations of the diagonal entries and the o-diagonal entries, respectively, u T n R n u n = n X i=1 s ii q 2 i + X 1i6=jn s ij q i q j ; where s ij denotes the (i;j)-entry of S n . Then we have E(u T n R n u n ) 2 = n X i=1 s 2 ii E(q 4 i ) + X 1i6=jn s ii s jj E(q 2 i )E(q 2 j ) + X 1i6=jn s 2 ij E(q 2 i )E(q 2 j ): Using the sub-Gaussian norm bound c 4 , both quantities E(q 4 i ) and E(q 2 i )E(q 2 j ) can be uni- formly bounded by a common constant. Hence E(u T n R n u n ) 2 O(1)f[tr(S n )] 2 + tr(S 2 n )g: Since S n is positive semidenite it holds that tr(S 2 n ) [tr(S n )] 2 . Finally noting that tr(S n ) = tr(A 1 n B n ), we see that sup n Ej(u T n R n u n )= n j 1+ <1 for = 1. 5.4.3 Proof of Lemma 8 From the denition of q j () for j = 1; 2, we derive exp(nq j ) = exp((n=2) T [V n n ( n )I d ]) exp(n( n n ( n )=2) T ) exp((n n )=2 T ): (5.36) 98 Then we have Z 2R d e nq j 1 e N c n (n) d 0 Z 2R d e nn 2 T 1 e N c n (n) d 0 = 2 n n d=2 P (k(n n ) 1=2 Zk 2 2 (n=d) 1 2 n ) = 2 n n d=2 P (kZk 2 2 n d 2 n ); where ZN(0;I d ). Using the chi-square tail bound, that is, for any positivex it is known thatP (kZk 2 2 d 2 p dx+2x) exp(x) and after minor modication it holds thatP (kZk 2 2 ( p d+ p 2x) 2 ) exp(x). With this observation, dene x = ( p n d 2 n p d) 2 =2 and the proof concludes. 5.4.4 Lemma for overtted models Lemma 9. (Bound on dierence of log-likelihood) Under Conditions 1-2, we have ` n (y; b n;m )` n (y; n;m;0 ) =jM m jO p (n r logn). Proof of Lemma 9. In the proof we remove the dependence on m for b n;m and n;m;0 for notational simplicity. From Theorem 1, b n 2N n ( n ) with probability 1O(n ), for large positive constant. Further, following some steps of the proof: Applying Taylor's expansion to the log-likelihood function ` n (y;) around n;0 , we obtain ` n (y;)` n (y; n;0 ) = ( n;0 ) T n ( n;0 ) 1 2 ( n;0 ) T A n ( )( n;0 ); where is on the line segment joining and n;0 and n ( n;0 ) = X T [y(X n;0 )]. By letting u = 1 n d 1=2 B 1=2 n ( n;0 ), the above Taylor's expansion can be rewritten as ` n (y;)` n (y; n;0 ) =d 1=2 n u T B 1=2 n n ( n;0 )d 2 n u T V n ( )u=2; (5.37) where V n () = B 1=2 n A n ()B 1=2 n . 99 From here on condition on the event that b n 2 N n ( n ). Note that2 N n ( n ) implies 2N n ( n ) since N n ( n ) is convex. Also it is clear that max kuk 2 =1 u T B 1=2 n n ( n;0 ) =kB 1=2 n n ( n;0 )k 2 : (5.38) From Condition 2, for n suciently large, min 2Nn(n) min fV n ()g>c 1 n r where 0<r< 1=4. Using this condition and since 2N n ( n ), it holds that min kuk 2 =1 u T V n ( )u min 2Nn(n) min fV n ()g>c 1 n r : (5.39) We derive that ` n (y;)` n (y; n;0 ) =fd 1=2 n (u T u) 1=2 g[u T =(u T u) 1=2 ]B 1=2 n n ( n;0 )fd 2 n u T ug: [u T =(u T u) 1=2 ]V n ( )[u=(u T u) 1=2 ]=2 fd 1=2 n (u T u) 1=2 gkB 1=2 n n ( n;0 )k 2 fd 2 n u T ug2 1 c 1 n r : By 2.7, we have X T [Ey(X n;0 )] = 0. Hence, n ( n;0 ) = X T [y(X n;0 )] = X T (y Ey). Denote by W = B 1=2 n n ( n;0 ) = B 1=2 n X T (yEy). Recall that P (kWk 2 2 1 c 1 n r d 1=2 n ) 1O(n ): Hence from here on we suppose thatfkWk 2 2 1 c 1 n r d 1=2 n g holds. Then we derive ` n (y;)` n (y; n;0 ) =fd 1=2 n (u T u) 1=2 gkWk 2 fd 2 n u T ug2 1 c 1 n r d2 1 c 1 n r 2 n (u T u) 1=2 [1 (u T u) 1=2 ]: From the denition of u,2N n ( n ) is equivalent tokuk 2 1. Therefore, ` n (y; b )` n (y; n;0 ) sup 2Nn(n) f` n (y;)` n (y; n;0 )gd2 3 c 1 n r 2 n : 100 From our choice of n = n r (c 0 logn) 1=2 , hence we obtain that ` n (y; b n;m )` n (y; n;m;0 ) = jM m jO p (n r logn), wherejM m j =d here. 5.4.5 Lemma for undertted models. Lemma 10. (Deviation of log-likelihood) Under Conditions 1, 3, and 5 we have ` n (Y; n;m;0 )E Y ` n (Y; n;m;0 ) =jM m jO p (n 1=2 log 3=2 n), for r2 (0; 1=4). Proof of Lemma 10. We have, ` n (Y; n;m;0 )E Y ` n (Y; n;m;0 ) =[YE Y Y] T Cov(Y) 1=2 Cov(Y) 1=2 X n;m;0 = X i a i q i ; wherea i = Var(Y i ) 1=2 (X n;m;0 ) i andq i := [Y i E Y Y i ] T Cov(Y i ) 1=2 . Hence under Conditions 3 and 5kak 2 2 O(1)n 2 1 log 2 n. Then by Lemma 3, and by choosing t :=jM m jn 1=2 log 3=2 n, we have that, P (j` n (Y; n;m;0 )E Y ` n (Y; n;m;0 )jjM m jn 1=2 log 3=2 n) e exp c 11 jM m j 2 logn c 2 4 O(1) : 101 Chapter 6 Proofs of Results in Chapter 3 In this chapter we prove all the technical results presented in chapter 3. 6.1 Proof of Theorem 8 Proof of Part (i). To show that wFDR( OR ) =, we only need to establish that E U;a a a;b b b;X X X ( m X i=1 a i i OR (Lfdr i ) ) = 0; where the notationE U;a a a;b b b;X X X denotes that the expectation is taken overU;a a a;b b b, andX X X. Accord- ing to the denitions of the capacity function C() and threshold t , we have m X i=1 a i i OR (Lfdr i ) =C(k) +I(U <p )fC(k + 1)C(k)g: It follows from the denition of p that E Uja a a;b b b;X X X ( m X i=1 a i i OR (Lfdr i ) ) =C(k) +fC(k + 1)C(k)gp = 0; where the notation E Uja a a;b b b;X X X indicates that the expectation is taken over U while holding (a a a;b b b;X X X) xed. Therefore E U;a a a;b b b;X X X ( m X i=1 a i i OR (Lfdr i ) ) = 0; (6.1) and the desired result follows. 102 Proof of Part (ii). Let be an arbitrary decision rule such that wFDR( ). It follows that E a a a;b b b;X X X ( m X i=1 a i E( i ja a a;b b b;x x x)(Lfdr i ) ) 0: (6.2) The notation E( i jx x x;a a a;b b b) means that the expectation is taken to average over potential randomization conditional on the observations and weights. LetI + =fi : i OR E( i jx x x;a a a;b b b)> 0g andI =fi : i OR E( i jx x x;a a a;b b b)< 0g. Fori2I + , we have i OR = 1 and hence b i (1 Lfdr i ) t a i (Lfdr i ). Similarly for i2I , we have i OR = 0 and so b i (1 Lfdr i )t a i (Lfdr i ). Thus X i2I + [I i OR E( i jx x x;a a a;b b b) fb i (1 Lfdr i )t a i (Lfdr i )g 0: Note that i OR is perfectly determined byX X X except for (k + 1)th decision. Meanwhile, b (k+1) 1 Lfdr (k+1) t a (k+1) Lfdr (k+1) = 0 by our choice of t . It follows that E a a a;b b b;X X X " m X i=1 E( i OR jx x x;a a a;b b b)E( i jx x x;a a a;b b b) fb i (1 Lfdr i )t a i (Lfdr i )g # 0: (6.3) Recall that the power function is given by ETP( ) =E ( m X i=1 E( i jx x x;a a a;b b b)b i (1 Lfdr i ) ) for any decision rule . Combining equations (6.1) { (6.3) and noting that t > 0, we claim that ETP( OR ) ETP( ) and the desired result follows. 103 6.2 Proof of Theorem 9 6.2.1 Notations We rst recall and dene a few useful notations. LetI A be an indicator function, which equals 1 if event A occurs and 0 otherwise. Let N i =a i (Lfdr i ); b N i =a i ( d Lfdr i ); R i = a i (Lfdr i ) b i (1 Lfdr i ) +a i jLfdr i j ; b R i = a i ( d Lfdr i ) b i (1 d Lfdr i ) +a i j d Lfdr i j ; Q(t) = 1 m m X i=1 N i I R i t and b Q(t) = 1 m m X i=1 b N i I b R i t for t2 [0; 1]: Note that Q(t) and b Q(t), the estimates for oracle and data driven capacities, are non- decreasing and right-continuous. We can further dene OR = infft2 [0; 1] :Q(t) 0g and b = infft2 [0; 1] : b Q(t) 0g: (6.4) Next we construct a continuous version ofQ() for later technical developments. Specically, for 0R (k) <tR (k+1) , let Q c (t) =f1r(t)gQ R (k) +r(t)Q R (k+1) ; where c indicates \continuous" and r(t) = (tR (k) )=(R (k+1) R (k) ). Let R (m+1) = 1 and N (m+1) = 1. Similarly we can dene a continuous version of b Q(t). For 0 b R (k) <t b R (k+1) , let b Q c (t) = [1b r(t)] b Q( b R (k) ) +b r(t) b Q( b R (k+1) ); 104 withb r(t) = (t b R (k) )=( b R (k+1) b R (k) ). Now the inverses of Q c (t) and b Q c (t) are well dened; denote these inverses by Q c;1 (t) and b Q c;1 (t), respectively. By construction, it is easy to see that I R i OR =I R i Q c;1 (0) andI b R i b =I b R i b Q c;1 (0) : 6.2.2 A useful lemma We rst state and prove a lemma that contains some key facts to prove the theorem. Lemma 11. Assume that Conditions 1-3 hold. For any t2 [0; 1], we have (i) E b N i I [ b R i t] N i I [R i t] 2 =o(1), (ii) E n b N i I [ b R i t] N i I [R i t] b N j I [ b R j t] N j I [R j t] o =o(1), and (iii) b Q c;1 (0)Q c;1 (0) p ! 0. Proof of Part (i). We rst decompose E b N i I [ b R i t] N i I [R i t] 2 into three terms: E b N i I [ b R i t] N i I [R i t] 2 = E[( b N i N i ) 2 I b R i t;R i t ] +E[ b N 2 i I b R i t;R i >t ] +E[N 2 i I b R i >t;R i t ]: (6.5) Next we argue below that all three terms are of o(1). First, it follows from the denitions of b N i and N i that E n ( b N i N i ) 2 I b R i t;R i t o =E a 2 i Lfdr i d Lfdr i 2 I b R i t;R i t E a 2 i Lfdr i d Lfdr i 2 : By an application of Cauchy-Schwarz inequality, we have E a 2 i Lfdr i d Lfdr i 2 E(a 4 i ) 1=2 E Lfdr i d Lfdr i 4 1=2 : 105 It follows from Condition 12 that E(a 4 i ) = O(1). To show E Lfdr i d Lfdr i 4 = o(1), note that both Lfdr i and d Lfdr i are in [0; 1]. Hence E Lfdr i d Lfdr i 4 EjLfdr i d Lfdr i j. Using the fact that Lfdr i d Lfdr i =o P (1), the uniform integrability for bounded random variables, and the Vitali convergence theorem, we conclude that EjLfdr i d Lfdr i j = o(1). Therefore, the rst term in (6.5) is of o(1). Next we show thatE b N 2 i I b R i t;R i >t =o(1). Applying Cauchy-Schwarz inequality again, we have E b N 2 i I b R i t;R i >t (1) 2 E(a 4 i ) 1=2 n P b R i t;R i >t o 1=2 : Condition 12 implies that E(a 4 i ) = O(1); hence we only need to show that P ( b R i t;R i > t) =o(1). Let > 0 be a small constant. Then P ( b R i t;R i >t) =P b R i t;R i 2 (t;t +] +P b R i t;R i >t + P (R i 2 (t;t +]) +P (j b R i R i j>): SinceR i is a continuous random variable, we can nd t > 0 such thatP (R i 2 (t;t+])<"=2 for a given". For this xed t > 0, we can show thatP (j b R i R i j> t )<"=2 for suciently large n. This follows from Lfdr i d Lfdr i = o P (1) and the continuous mapping theorem. Similar argument can be used to prove that E[N 2 i I b R i >t;R i t ] = o(1), hence completing the proof of part (i). Proof of Part (ii). As X i and X j are identically distributed and our estimates are invariant to permutation, we have E n ( b N i I [ b R i t] N i I [R i t] )( b N j I [ b R j t] N j I [R j t] ) o E b N i I [ b R i t] N i I [R i t] 2 : The desired result follows from part (i). Proof of Part (iii). Dene Q 1 (t) = E(N i I R i t ), where the expectation is taken over (a a a;b b b;X X X; ). Let 1 = infft2 [0; 1] :Q 1 (t) 0g: 106 We will show that (i) Q c;1 (0) p ! 1 and (ii) b Q c;1 (0) p ! 1 . Then the desired result b Q c;1 (0)Q c;1 (0) p ! 0 follows from (i) and (ii). Fixt2 [0; 1]. By Condition 12 and WLLN, we have thatQ(t) p !Q 1 (t). SinceQ c;1 () is continuous, for any"> 0, there exists a> 0 such thatjQ c;1 (Q 1 ( 1 ))Q c;1 (Q c ( 1 ))j< " wheneverjQ 1 ( 1 )Q c ( 1 )j<. It follows that PfjQ 1 ( 1 )Q c ( 1 )j>g (6.6) P jQ c;1 (Q 1 ( 1 ))Q c;1 (Q c ( 1 ))j>" = P jQ c;1 (0) 1 j>" : (6.7) Equation (6.7) holds since Q 1 ( 1 ) = 0 by the continuity of R i , and Q c;1 (Q c ( 1 )) = 1 by the denition of inverse. Therefore we only need to show that for any t2 [0; 1];Q c (t) p ! Q 1 (t). Note that EjQ(t)Q c (t)j E(sup i a i ) m ! 0; by Condition 12. Using Markov's inequality, Q(t)Q c (t) p ! 0. Following from Q(t) p ! Q 1 (t), we have Q c (t) p ! Q 1 (t). Therefore (6.6) and hence (6.7) goes to 0 as m!1, establishing the desired result (i) Q c;1 (0) p ! 1 . To show result (ii) b Q c;1 (0) p ! 1 , we can repeat the same steps. In showingQ c;1 (0) p ! 1 , we only used the facts that (a) Q(t) p ! Q 1 (t), (b) Q c;1 () is continuous, and (c) Q(t)Q c (t) p ! 0. Therefore to prove b Q c;1 (0) p ! 1 , we only need to check whether the same conditions (a) b Q(t) p !Q 1 (t), (b) b Q c;1 () is continuous, and (c) b Q(t) b Q c (t) p ! 0 still hold. It is easy to see that (b) holds by denition, and (c) holds by noting that Ej b Q(t) b Q c (t)j E(sup i a i ) m ! 0: The only additional result we need to establish is (a). 107 Previously, we have shown that Q(t) p ! Q 1 (t). Therefore the only additional fact that we need to establish is thatj b Q(t)Q(t)j p ! 0. Now consider the following quantity: Q =f b Q(t)Q(t)g [Ef b Q(t)gEfQ(t)g]: (6.8) By repeating the steps of part (i) we can show that jEf b Q(t)gEfQ(t)gj =jE(N i I R i t )E( b N i I b R i t )j! 0: (6.9) By denition, Q = m 1 P m i=1 f b N i I [ b R i t] N i I [R i t] g [E( b N i I b R i t )E(N i I R i t )]: For an application of WLLN for triangular arrays (see, for e.g., Theorem 6.2 of Billingsley, 1991), we need to show that var( m X i=1 f b N i I [ b R i t] N i I [R i t] g)=m 2 ! 0: Using the result in Part (i) we deduce that, m 2 Var ( m X i=1 b N i I [ b R i t] N i I [R i t] ) m 2 E ( m X i=1 b N i I [ b R i t] N i I [R i t] ) 2 = 1 1 m E n b N i I [ b R i t] N i I [R i t] b N j I [ b R j t] N j I [R j t] o + 1 m E b N i I [ b R i t] N i I [R i t] 2 =o(1): It follows from the WLLN for triangular arrays thatjQj p ! 0. Combining (6.8) and (6.9), we conclude thatj b Q(t)Q(t)j p ! 0, which completes the proof. 108 6.2.3 Proof of Theorem 9 Proof of Part (i). Consider the oracle and data driven thresholds OR and b dened in Equation (6.4) in section 6.2.1. The wFDRs of the oracle and data-driven procedures are wFDR OR = E P i a i (1 i ) i OR E P i a i i OR ; wFDR DD = E P i a i (1 i )I b R i b E P i a i I b R i b : Making the randomization explicit, the wFDR of the oracle procedure is wFDR OR = E m 1 P i a i (1 i )I R i OR +m 1 a i (1 i ) i OR E m 1 P i a i I R i OR +m 1 a i i OR ; where i indicates the randomization point in a realization. Note that both Efa i (1 i ) i OR =mg and Efa i i OR =mg are bounded by E(a i =m). Hence by Condition 12 both quantities are of o(1). From the discussion in section 6.2.1,I R i OR =I R i Q c;1 (0) andI b R i b =I b R i b Q c;1 (0) . According to Part (iii) of Lemma 11, we havef b R i b Q c;1 (0)gfR i Q c;1 (0)g = o P (1). Following the proof of Lemma 11 that E n a i (1 i )I b R i b Q c;1 (0)0 o =E a i (1 i )I R i Q c;1 (0)0 +o(1): (6.10) It follows thatm 1 Ef P i a i (1 i )I b R i b g!m 1 Ef P i a i (1 i )I R i OR g: Similarly, we can show that E a i I b R i b Q c;1 (0)0 =E a i I R i Q c;1 (0)0 +o(1): (6.11) 109 Further from Condition 12 the quantity m 1 E( P i a i I R i OR ) is bounded away from zero. To see this, note that Condition 1 implies that a i is independent of Lfdr i . It follows that m 1 E m X i=1 a i I R i OR ! =E (a i I R i OR )ce p > 0; (6.12) whereP (Lfdr(X))e p for somee p 2 (0; 1] for the choice of the nominal level2 (0; 1) and X, an i.i.d copy of X i . This holds for any non-vanishing . (Note that all hypotheses with Lfdr i < will be rejected automatically). Therefore we conclude that wFDR DD = wFDR OR +o(1) = +o(1): Proof of Part (ii). The quantity m 1 ETP DD is dened as m 1 E b i i I [ b R i b ] . Making the randomization explicit, we have m 1 ETP OR =E 1 m X i b i i I [R i OR ] + 1 m b i i i OR ! ; where i indicates the randomized point. By Condition 12 0m 1 E b i i i OR Eb i m E sup i b i m =o(1): From the discussion in section 6.2.1,I R i OR =I R i Q c;1 (0) andI b R i b =I b R i b Q c;1 (0) . Repeat- ing the steps in proving the wFDR, we can show that E b i i I [ ^ R i ^ ] =E b i i I [R i OR ] +o(1): Finally, it is easy to show that E b i i I [R i OR ] c(1)e p ; which is bounded below by a nonzero constant. Here the positive constantc is as dened in Condition 12 ande p is dened immediately after (6.12). We conclude that ETP DD =ETP OR = 1 +o(1) and the proof is complete. 110 6.3 Proofs of propositions 6.3.1 Proof of Proposition 2 Let > 0 be the relative cost of a false positive to a false negative. Consider the following weighted classication problem with loss function: L a a a;b b b ( ; ) = m X i=1 fa i (1 i ) i +b i i (1 i )g: (6.13) We aim to nd that minimizes the posterior loss E jX X X fL a a a;b b b ( ; )g argmin m X i=1 fa i P ( i = 0jX i ) i +b i P ( i = 1jX i )(1 i )g = argmin m X i=1 fa i P ( i = 0jX i )b i P ( i = 1jX i )g i : Therefore the optimal decision rule PF = ( i PF :i = 1; ;m) is of the form i PF =I a i P ( i = 0jX i ) b i P ( i = 1jX i ) < 1 ; (6.14) which reduces to the test statistic dened in (3.20). Next note that Q PF (t) is a continuous and increasing function of t. Therefore we can nd t PF such that Q PF (t PF ) = . For an arbitrary decision rule 2D , we must have ETP ( )ETP ( PF ). Suppose not, then there exists 2D such that PFER( ) = PFER( PF ) andETP( )<ETP( PF ). Consider a weighted classication problem with = 1=t PF . Then we can show that has a smaller classication risk compared to PF . This is a contradiction. Therefore we must have ETP( ) ETP( PF ). 111 6.3.2 Proof of Proposition 3 Proof of Part (i). For convenience of notation, dene S i = 1=VCR i . We show that rankings by increasing values ofR i andS i are the same. Ifi2S + , then all values are positive. Sorting by increasing S i is the same as sorting by decreasing (1=S i ) + 1 and hence by increasing 1=(1=S i + 1), which is precisely sorting by increasing R i . If i2 S , then all values are negative. Sorting by increasingS i is the same as sorting by decreasing (1=S i ) 1 and hence by increasing 1=(1=S i 1), which is again the same as sorting by increasingR i . The desired result follows. Proof of Part (ii). The result follows directly from the facts that (a) R i is negative when i2S and (b) R i is positive if i2S + . 112 References [1] Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information Theory (eds. B. N. Petrov and F. Csaki), Akademiai Kiado, Budapest, 267{281. [2] Akaike, H. (1974). A New Look at the Statistical Model Identication. IEEE Trans. Auto. Control, 19, 716{723. [3] Barras, L., Scaillet, O., and Wermers, R. (2010), \False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas," The Journal of Finance, 65, 179{ 216. [4] Benjamini, Y., and Heller, R. (2007), \False Discovery Rates for Spatial Signals," Journal of the American Statistical Association, 102, 1272{1281. [5] Benjamini, Y., and Hochberg, Y. (1995), \Controlling the False Discovery Rate: A Practi- cal and Powerful Approach to Multiple Testing," Journal of the Royal Statistical Society, Series B, 57, 289{300. [6] Benjamini, Y., and Hochberg, Y. (1997), \Multiple Hypotheses Testing with Weights," Scandinavian Journal of Statistics, 24, 407{418. [7] Billingsley, P. (1991), Probability and Measure (2nd ed.), New York: John Wiley & Sons. [8] Box, G. E. P. (1950), \Problems in the Analysis of Growth and Wear Curves," Biometrics, 6, 362{389. 113 [9] Bozdogan, H. (1987). Model Selection and Akaike's Information Criterion (AIC): The general Theory and its Analytical Extensions. Psychometrika, 52, 345{370. [10] Cai, T. T., and Sun, W. (2009), \Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks," Journal of the American Statistical Association, 104, 1467{1481. [11] Chang, C.-H., Huang, H.-C. and Ing, C.-K. (2014). Asymptotic Theory of Generalized Information Criterion for Geostatistical Regression Model Selection, Ann. Statist., to appear. [12] Chen, J. and Chen, Z. (2008). Extended Bayesian Information Criteria for Model Selec- tion With Large Model Spaces. Biometrika, 95, 759{771. [13] Chen, K. and Chan, K. S. (2011). Subset ARMA Model Selection via the Adaptive Lasso. Statistics and Its Interface, 4, 197{205. [14] Chi, Z. and Tan, Z. (2008). Positive False Discovery Proportions: Intrinsic Bounds and Adaptive Control. Statistics Sinica, 18, 837{860. [15] Das, K., Li, J., Fu, G., Wang, Z., and Wu, R. (2011), \Genome-Wide Association Studies for Bivariate Sparse Longitudinal Data," Human Heredity, 72, 110{120. [16] Das, K., Li, J., Wang, Z., Tong, C., Fu, G., Li, Y., Xu, M., Ahn, K., Mauger, D., Li, R., and Wu, R. (2011), \A Dynamic Model for Genome-Wide Association Studies," Human Genetics, 129, 629{639. [17] Efron, B. (2008), \Simultaneous Inference: When Should Hypothesis Testing Problems be Combined?" The Annals of Applied Statistics, 2, 197{223. [18] Efron, B. (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Test- ing, and Prediction, New York: Cambridge University Press. 114 [19] Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001), \Empirical Bayes Anal- ysis of a Microarray Experiment," Journal of the American Statistical Association, 96, 1151{1160. [20] Erd} os, L., Yau, H.-T., and Yin, J. (2012). Bulk Universality for Generalized Wigner Matrices. Probab. Theory Relat. Fields, 154, 341{407. [21] Fan, J., Feng, Y. and Song, R. (2011). Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models. J. Amer. Statist. Assoc., 106, 544{557. [22] Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Amer. Statist. Assoc., 96, 1348{1360. [23] Fan, J. and Lv, J. (2011). Nonconcave Penalized Likelihood with NP-Dimensionality. IEEE Transactions on Information Theory, 57, 5467{5484. [24] Fan, Y. and Tang, C. Y. (2013). Tuning Parameter Selection in High Dimensional Penalized Likelihood. J. Roy. Statist. Soc. Ser. B, 75, 531{552. [25] Ferkingstad, E., Frigessi, A., Rue, H., Thorleifsson, G., and Kong, A. (2008), \Unsu- pervised Empirical Bayesian Multiple Testing with External Covariates," The Annals of Applied Statistics, 2, 714{735. [26] Foster, D. and George, E. (1994). The Risk In ation Criterion for Multiple Regression. Ann. Statist., 22, 1947{1975. [27] Fox, C. S., Heard-Costa, N., Cupples, L. A., Dupuis, J., Vasan, R. S., and Atwood, L. D. (2007), \Genome-Wide Association to Body Mass Index and Waist Circumference: the Framingham Heart Study 100K Project," BMC Medical Genetics, 8, S18. [28] Genovese, C., and Wasserman, L. (2002), \Operating Characteristics and Extensions of the False Discovery Rate Procedure," Journal of the Royal Statistical Society, Series B, 64, 499{517. 115 [29] Hall, P. (1990). Akaike's Information Criterion and Kullback-Leibler Loss for Histogram Density Estimation. Probability Theory and Related Fields, 85, 449{467. [30] Horn, R. A. and Johnson, C. R. (1985). Matrix Analysis. New York: Cambridge Uni- versity Press. [31] Hu, J. X., Zhao, H., and Zhou, H. H. (2010), \False Discovery Rate Control With Groups," Journal of the American Statistical Association, 105, 1215{1227. [32] Ing, C.-K. (2007). Accumulated Prediction Errors, Information Criteria and Optimal Forecasting for Autoregressive Time Series. Ann. Statist., 35, 1238-1277. [33] Ing, C.-K. and Lai, T. L. (2011). A Stepwise Regression Method and Consistent Model Selection for High-Dimensional Sparse Linear Models. Statistica Sinica, 21, 1473-1513. [34] Jaquish, C. E. (2007), \The Framingham Heart Study, on its Way to Becoming the Gold Standard for Cardiovascular Genetic Epidemiology?" BMC Medical Genetics, 8, 63. [35] Jin, J., and Cai, T. T. (2007), \Estimating the Null and the Proportion of Nonnull Eects in Large-Scale Multiple Comparisons," Journal of the American Statistical Asso- ciation, 102, 495{506. [36] Konishi, S. and Kitagawa, G. (1996). Generalised Information Criterion in Model Selec- tion. Biometrika, 83, 875{890. [37] Kullback, S. and Leibler, R. (1951). On Information and Suciency. Ann. Math. Statist., 22, 79{86. [38] Li, J., Das, K., Fu, G., Li, R., and Wu, R. (2011), \The Bayesian Lasso for Genome- Wide Association Studies," Bioinformatics, 27, 516{523. [39] Li, L., Kabesch, M., Bouzigon, E., Demenais, F., Farrall, M., Moatt, M. F., Lin, X., and Liang, L. (2013), \Using eQTL Weights to Improve Power for Genome-Wide Association Studies: a Genetic Study of Childhood Asthma," Frontiers in Genetics, 4, Article 103. 116 [40] Lin, C.-Y., Xing, G., Ku, H.-C., Elston, R. C., and Xing, C. (2014), \Enhancing the Power to Detect Low-Frequency Variants in Genome-Wide Screens," Genetics, 196, 1293{ 1302. [41] Lin, W.-Y., and Lee, W.-C. (2012), \Improving Power of Genome-Wide Association Studies with Weighted False Discovery Rate Control and Prioritized Subset Analysis," PLoS ONE, 7, e33716. [42] Liu, W. and Yang, Y. (2011). Parametric or Nonparametric? A Parametricness Index for Model Selection. Ann. Statist., 39, 2074{2102. [43] Lv, J. and Fan, Y. (2009). A Unied Approach to Model Selection and Sparse Recovery Using Regularized Least Squares. Ann. Statist., 37, 3498{3528. [44] Lv, J. and Liu, J. S. (2014). Model Selection Principles in Misspecied Models. J. Roy. Statist. Soc. Ser. B, 76, 141{167. [45] MAQC Consortium (2010). The MicroArray Quality Control (MAQC)-II Study of Com- mon Practices for the Development and Validation of Microarray-based Predictive Mod- els. Nat. Biotechnol., 28, 827{841. [46] Oberthuer, A., Berthold, F., Warnat, P., Hero, B., Kahlert, Y., Spitz, R., Ernestus, K., K onig, R., Haas, S., Eils, R., Schwab, M., Brors, B., Westermann, F. and Fischer, M. (2006) Customized Oligonucleotide Microarray Gene Expression Based Classication of Neuroblastoma Patients Outperforms Current Clinical Risk Stratication. Journal of Clinical Oncology, 24, 5070{5078. [47] P~ ena, E. A., Habiger, J. D., and Wu, W. (2011), \Power-Enhanced Multiple Decision Functions Controlling Family-Wise Error and False Discovery Rates," The Annals of Statistics, 39, 556{583. [48] Peng, H., Yan, H. and Zhang, W. (2013). The Connection Between Cross-Validation and Akaike Information Criterion in a Semiparametric Family. Journal of Nonparametric Statistics, 25, 475{485. 117 [49] Roeder, K., Bacanu, S.-A., Wasserman, L., and Devlin, B. (2006), \Using Linkage Genome Scans to Improve Power on Association in Genome Scans," The American Jour- nal of Human Genetics, 78, 243{252. [50] Roeder, K., and Wasserman, L. (2009), \Genome-Wide Signicance Levels and Weighted Hypothesis Testing," Statistical Science, 24, 398{413. [51] Rogosa, D. (2003), \Student Progress in California Charter Schools," 1999{2002. Avail- able at http://statweb.stanford.edu/ rag/draft/charter9902.pdf. [52] Roquain, E., and van de Wiel, M. A. (2009), \Optimal Weighting for False Discovery Rate Control," Electronic Journal of Statistics, 3, 678{711. [53] Roquain, E., and Villers, F. (2011), \Exact Calculations for False Discovery Proportion with Application to Least Favorable Congurations," The Annals of Statistics, 39, 584{ 612. [54] Rudelson, M. and Vershynin, R. (2013). Hanson-Wright Inequality and Sub-Gaussian Concentration. Electron. Commun. Probab., 18, 1{9. [55] Schwartz, G. (1978). Estimating the Dimension of a Model. Ann. Statist., 6, 461{464. [56] Silverman, B. W. (1986), Density Estimation for Statistics and Data Analysis, London: Chapman and Hall. [57] Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., Damico, A. and Richie, J. (2002). Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell, 1, 203{209. [58] Spjtvoll, E. (1972), \On the Optimality of Some Multiple Comparison Procedures," The Annals of Mathematical Statistics, 43, 398{411. [59] Stone, M. (1977). An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike's Criterion. J. Roy. Statist. Soc. Ser. B, 39, 44{47. 118 [60] Storey, J. D. (2002), \A Direct Approach to False Discovery Rates," Journal of the Royal Statistical Society, Series B, 64, 479{498. [61] Storey, J. D. (2007), \The Optimal Discovery Procedure: A New Approach to Simulta- neous Signicance testing," Journal of the Royal Statistical Society, Series B, 69, 347{368. [62] Sun, W., and Cai, T. T. (2007), \Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control," Journal of the American Statistical Association, 102, 901{912. [63] Sun, W., Reich, B., Cai, T. T., Guindani, M., and Schwartzman, A. (2015), \False Dis- covery Control in Large-Scale Spatial Multiple Testing," Journal of the Royal Statistical Society, Series B, 77, 59{83. [64] Tukey, J. W. (1991), \The Philosophy of Multiple Comparisons," Statistical Science, 6, 100{116. [65] Vershynin, R. (2012). Introduction to the Non-Asymptotic Analysis of Random Matri- ces. Compressed Sensing, Theory and Applications, Chapter 5, 210{268. Cambridge Uni- versity Press. [66] Wang, H., Li, B. and Leng, C. (2009). Shrinkage Tuning Parameter Selection with a Diverging Number of Parameters. J. R. Statist. Soc. B, 71, 671{683. [67] Wang, H., Li, R. and Tsai, C. L. (2007). Tuning Parameter Selectors for the Smoothly Clipped Absolute Deviation Method. Biometrika, 94, 553{568. [68] Westfall, P. H., and Young, S. S. (1993), Resampling-Based Multiple Testing, New York: John Wiley and Sons. [69] White, H. (1982). Maximum Likelihood Estimation of Misspecied Models. Economet- rica, 50, 1{25. 119 [70] Xing, C., Cohen, J. C., and Boerwinkle, E. (2010), \A Weighted False Discovery Rate Control Procedure Reveals Alleles at FOXA2 that In uence Fasting Glucose Levels," The American Journal of Human Genetics, 86, 440{446. [71] Zhang, Y., Li, R. and Tsai, C. L. (2010). Regularization Parameter Selections via Gen- eralized Information Criterion J. Amer. Statist. Assoc., 105, 312{323. 120
Abstract (if available)
Abstract
In this thesis we discuss problems in two cutting-edge topics of modern statistics, namely, high-dimensional statistical inference and large-scale multiple testing. ❧ Model selection is indispensable to high-dimensional sparse modeling in selecting the best set of covariates among a sequence of candidate models. Most existing work assumes implicitly that the model is correctly specified or of fixed dimensions. Yet model misspecification and high dimensionality are common in real applications. We investigate two classical Kullback-Leibler divergence and Bayesian principles of model selection in the setting of high dimensional misspecified models. Asymptotic expansions of these principles reveal that the effect of model misspecification is crucial and should be taken into account, leading to the generalized AIC and generalized BIC in high dimensions. With a natural choice of prior probabilities, we suggest the generalized BIC with prior probability which involves a logarithmic factor of the dimensionality in penalizing model complexity. We further establish the consistency of the covariance contrast matrix estimator in a general setting. Our results and new methods are supported by numerical studies. ❧ The use of weights provides an effective strategy to incorporate prior domain knowledge in large-scale inference. Our work studies weighted multiple testing in a decision-theoretic framework. We develop oracle and data-driven procedures that aim to maximize the expected number of true positives subject to a constraint on the weighted false discovery rate. The asymptotic validity and optimality of the proposed methods are established. The results demonstrate that incorporating informative domain knowledge enhances the interpretability of results and precision of inference. Simulation studies show that the proposed method controls the error rate at the nominal level, and the gain in power over existing methods is substantial in many settings. An application to genome-wide association study is discussed. ❧ Further we discuss connections between the two topics, applications to area(s) of business, and possible future directions.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Reproducible large-scale inference in high-dimensional nonlinear models
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
PDF
Shrinkage methods for big and complex data analysis
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Nonparametric ensemble learning and inference
PDF
Comparing robustness to outliers and model misspecification between robust Poisson and log-binomial models
PDF
Two essays on financial econometrics
PDF
Essays on high-dimensional econometric models
PDF
Two essays in econometrics: large N T properties of IV, GMM, MLE and least square model selection/averaging
PDF
Characterization and discovery of genetic associations: multiethnic fine-mapping and incorporation of functional information
PDF
Essays on factor in high-dimensional regression settings
PDF
Robust feature selection with penalized regression in imbalanced high dimensional data
PDF
Model selection methods for genome wide association studies and statistical analysis of RNA seq data
PDF
Optimal dividend and investment problems under Sparre Andersen model
PDF
Topics in selective inference and replicability analysis
PDF
Computational models and model-based fMRI studies in motor learning
PDF
Building better species distribution models with machine learning: assessing the role of covariate scale and tuning in Maxent models
PDF
Equilibrium model of limit order book and optimal execution problem
Asset Metadata
Creator
Basu, Pallavi
(author)
Core Title
Model selection principles and false discovery rate control
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
04/20/2016
Defense Date
03/21/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
AIC,Bayesian principle,BIC,class weights,decision weights,high dimensionality,Kullback-Leibler divergence principle,model misspecification,model selection,multiple testing with groups,OAI-PMH Harvest,prioritized subsets,value to cost ratio,weighted p-value
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lv, Jinchi (
committee chair
), Sun, Wenguang (
committee chair
), Ferson, Wayne (
committee member
), James, Gareth (
committee member
), Radchenko, Peter (
committee member
)
Creator Email
pallavi.basu.2016@marshall.usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-236091
Unique identifier
UC11277473
Identifier
etd-BasuPallav-4320.pdf (filename),usctheses-c40-236091 (legacy record id)
Legacy Identifier
etd-BasuPallav-4320.pdf
Dmrecord
236091
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Basu, Pallavi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
AIC
Bayesian principle
BIC
class weights
decision weights
high dimensionality
Kullback-Leibler divergence principle
model misspecification
model selection
multiple testing with groups
prioritized subsets
value to cost ratio
weighted p-value