Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
(USC Thesis Other)
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Large-Scale Multiple Hypothesis Testing and Simultaneous Inference: Compound Decision Theory and Data Driven Procedures by Weinan Wang A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Business Administration) December 2018 Copyright 2018 Weinan Wang Dedication To my parents, who have always supported me throughout my journey in life, to Marcos, who made my entire Ph.D. a lot more interesting. ii Acknowledgements First and foremost, I would like to express my sincerest gratitude towards my Ph.D. advisor and mentor, Dr. Wenguang Sun. Dr. Sun tirelessly guided me during my whole graduate school using his domain expertise and research intuition, he is the person that made this research and my career possible. I would also like to thank Dr. Gareth James and Dr. Gourab Mukherjee for serving on my dissertation committee and providing feedback during my thesis process. I really appreciate the time and eort they put into this whole process. I would also like to acknowledge the faculty and students in the Data Sciences and Operations department at Marshall for attending my research talks and giving amazing feedback on both my presentation skills and my research. Especially, I would like to thank Trambak Banerjee and Luella Fu, with whom I had valuable discussions regarding the most interesting research questions. Last but not least, I would like to say thank you to my parents who have always been my unwavering support, not just for my Ph.D. life, but the twenty-two years before that as well. I owe my life and any achievements to them. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures viii Abstract x Chapter 1: Introduction 1 1.1 Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference . 1 1.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Relevant literatures and our proposal . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Multistage Adaptive Testing of Sparse Signals . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Non-adaptive vs. adaptive designs . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Applications and statistical challenges . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.4 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference 12 2.1 Extracting Structural Information Using An Auxiliary Sequence . . . . . . . . . . 13 2.1.1 Constructing the primary and auxiliary statistics . . . . . . . . . . . . . . . 14 2.1.2 A bivariate random mixture model . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Oracle and Data-driven Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Oracle procedure with pairs of observations . . . . . . . . . . . . . . . . . . 18 2.2.2 Approximating T OR via screening . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Estimation of the test statistic . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.4 A rened estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.5 The CARS procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.6 The case with unknown variances and non-Gaussian errors . . . . . . . . . 27 2.3 Extensions and Connections with Existing Work . . . . . . . . . . . . . . . . . . . 28 2.3.1 A general bivariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Discrete case: multiple testing with groups . . . . . . . . . . . . . . . . . . 29 2.3.3 Discretization and uncorrelated screening . . . . . . . . . . . . . . . . . . . 31 2.3.4 The \pooling within" strategy for information integration . . . . . . . . . . 32 2.3.5 Capturing sparsity information via screening . . . . . . . . . . . . . . . . . 33 iv 2.3.6 Heterogeneity, correlation and empirical null . . . . . . . . . . . . . . . . . 35 2.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4.1 The implementation and R-package CARS . . . . . . . . . . . . . . . . . . . 37 2.4.2 Simulation I: known variances . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.3 Simulation II: estimated variances and non-Gaussian errors . . . . . . . . . 41 2.4.4 Simulation III: means with opposite signs . . . . . . . . . . . . . . . . . . . 41 2.4.5 Application in Supernova Detection . . . . . . . . . . . . . . . . . . . . . . 43 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.6 Proofs of Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.6.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.7 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.1 Additional Theory and Proofs of Other Results . . . . . . . . . . . . . . . . 54 2.7.1.1 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.1.2 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.7.1.3 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.7.1.4 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.7.1.5 Proof of Proposition 5 . . . . . . . . . . . . . . . . . . . . . . . . 59 2.7.1.6 Proof of Proposition 6 . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.7.1.7 Proof of Proposition 7 . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.7.1.8 Eects of baseline removal (Remark 1) . . . . . . . . . . . . . . . 65 2.7.1.9 Proof of the claim in Remark 3 . . . . . . . . . . . . . . . . . . . . 66 2.7.1.10 Proof of insuciency and further explanation of Remark 7 . . . . 67 2.7.1.11 The conservative bivariate density estimator . . . . . . . . . . . . 68 2.7.1.12 Screening via Lfdr: formulae and implementation . . . . . . . . . 70 2.7.2 Supplementary Numerical Results . . . . . . . . . . . . . . . . . . . . . . . 71 2.7.2.1 The case with unknown and equal variances . . . . . . . . . . . . 71 2.7.2.2 Comparison with sample splitting when 1i = 2i . . . . . . . . . . 71 2.7.2.3 Stability of tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 2.7.2.4 Comparison of ranking . . . . . . . . . . . . . . . . . . . . . . . . 75 2.7.2.5 Dependent tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.7.2.6 Supernova Example Images . . . . . . . . . . . . . . . . . . . . . . 79 2.7.2.7 Analysis of the Microarray Time-Course (MTC) Data . . . . . . . 79 2.7.2.8 Comparison of mFDR and FDR . . . . . . . . . . . . . . . . . . . 81 Chapter 3: Multistage Adaptive Testing of Sparse Signals 82 3.1 Oracle and SMART Rules for Sparse Recovery . . . . . . . . . . . . . . . . . . . . 82 3.1.1 A decision-theoretic formulation . . . . . . . . . . . . . . . . . . . . . . . . 83 3.1.2 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.1.3 Approximation of oracle thresholds and its properties . . . . . . . . . . . . 85 3.1.4 The SMART procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.1.5 Connections to existing ideas . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.2 Limits of Sparse Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.2.1 The fundamental limits in adaptive designs . . . . . . . . . . . . . . . . . . 90 3.2.2 Comparison with existing results . . . . . . . . . . . . . . . . . . . . . . . . 92 3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.1 Estimation of the oracle statistic . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.2 Simulation 1: simple thresholding vs. compound thresholding . . . . . . . . 94 3.3.3 Simulation 2: SMART vs. distilled sensing . . . . . . . . . . . . . . . . . . 95 3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 v 3.5.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5.2 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.5.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.5.4 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.5.5 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.6 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6.1 Derivation of thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.6.2 Application to astronomical survey . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 4: Ongoing and Future Research 113 4.1 Online FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1.3 Oracle procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1.3.1 Oine oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . 116 4.1.3.2 Online oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . 118 4.1.4 Data-driven Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.1.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Anomaly Detection in User Engagement Metrics with False Discovery Rate Control 122 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.2 Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.2.2.1 STL decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.2.2.2 Adaptive-Z procedure . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.2.2.3 Online FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.2.3 Implementation of real time STL+adaptive-Z . . . . . . . . . . . . . . . . . 132 4.2.3.1 Parameters for STL decomposition and real-time prediction . . . . 132 4.2.3.2 Null distribution and non-null proportion estimations in adaptive- Z procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2.3.3 Real time anomaly detection algorithm . . . . . . . . . . . . . . . 135 4.2.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.2.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.2.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 138 Bibliography 142 vi List Of Tables 2.1 Thresholds and Total Number of Rejections (in parentheses) of Dierent Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 vii List Of Figures 2.1 A graphical illustration of the smoothing estimator (2.3.3). (a). The counts of Si are uniformly distributed on the right. The bias decreases and the variability increases when increases. (b). The histogram of all p-values. Similarly, the p-values are approximately uniformly distributed on the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Two-sample tests with known variances. The FDR and MDR are plotted against varied non-null proportions (top row) and conditional proportions (middle row) and eect sizes (bottom row). DD (), BH (4), OR (), AZ (+) and US (). . . 40 2.3 Two-sample tests with unknown variances and non-Gaussian errors. The FDR and MDR are plotted against varied non-null proportions (top row), conditional proportions (bottom row). The displayed procedures are DD (), BH (4), OR (), AZ (+) and US (). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4 Comparison of BH, DD and OR when the nonzero means have opposite signs. . . . 43 2.5 From left to right, images are taken respectively on August 23, 24, and 25, 2011. The arrows clearly indicate the explosion of SN 2011fe. . . . . . . . . . . . . . . . 44 2.6 General Case (Unknown Variance): the FDR and MDR varied by non-null propor- tion (top row) and conditional proportion (bottom row). The displayed procedures are DD (), BH (4), OR (), AZ (+) and US (). . . . . . . . . . . . . . . . . . . 72 2.7 The special case when 1i = 2i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.8 Eect of . The displayed procedures are DD (), OR (4). . . . . . . . . . . . . . 76 2.9 Eect of . The displayed procedures are DD (), OR (4). . . . . . . . . . . . . . 77 2.10 Comparison of ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.11 Comparison of methods under weak dependence. . . . . . . . . . . . . . . . . . . . 79 2.12 Left to right: BH procedure, Adaptive Z procedure, Uncorrelated Screening, CARS. 80 2.13 Asymptotic equivalence of mFDR and FDR for OR and DD . . . . . . . . . . . . . 81 viii 3.1 Comparison with single thresholding. The displayed procedures are DD.SM (), OR.SM (), DD.ST (N), OR.ST (+). . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.2 Comparison with DS. The displayed procedures are DS (), DD.SM (N), OR.SM (+), DD.ST (), OR.ST (). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.3 SMART and DS comparison. Fig. (a) and (b) show the original radio telescope image and tainted image with white noise, respectively. Fig. (c) and (d) compare SMART and DS when the total number of observations are about the same. Fig. (f) shows the resulting image when implementing DS for 12 stages, Fig. (e) shows the image produced by SMART when using the recorded error rates from the 12-stage DS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.1 Simulation Setting 1, with varying , p = 0:01. . . . . . . . . . . . . . . . . . . . . 123 4.2 Simulation Setting 2, with varying , p = 0:05. . . . . . . . . . . . . . . . . . . . . 124 4.3 Simulation Setting 3, with linearly varying p t , and varying . . . . . . . . . . . . . 125 4.4 Simulation Setting 4, with sin shape varying p t , and varying . . . . . . . . . . . . 126 4.5 A sample data of user engagement metrics at Snap Inc. . . . . . . . . . . . . . . . 127 4.6 Seasonal and trend patterns for sample user engagement metrics . . . . . . . . . . 137 4.7 Simulation Setting 1, varying eect size with p = 0:05, i.i.d. random noise. . . . . . 138 4.8 Simulation Setting 2, varying proportion of signals with uniformly distributed signal strengths, i.i.d. random noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.10 Detected anomalies on user engagement metrics from March 10th to March 15th, 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.9 Simulation Setting 3, varying proportion of signals with uniformly distributed signal strengths, noise generated from ARIMA(2; 0; 1). . . . . . . . . . . . . . . . . . . . . 140 4.11 Detected anomalies on user engagement metrics from Feb 25th to March 5th, 2018. 140 4.12 Detected anomalies on user engagement metrics from February 10th to March 20th, 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 ix Abstract Two-sample multiple testing has a wide range of applications. The conventional practice rst reduces the original observations to a vector of p-values and then chooses a cuto to adjust for multiplicity. However, this data reduction step could cause signicant loss of information and thus lead to suboptimal testing procedures. In this thesis, I present a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple testing procedure is developed by employing a covariate-assisted ranking and screening (CARS) approach that optimally combines the information from both the primary and auxiliary variables. The proposed CARS procedure is shown to be asymptotic valid and optimal for false discovery rate (FDR) control. The procedure is implemented in the R- package CARS. Numerical results conrm the eectiveness of CARS in FDR control and show that it achieves substantial power gain over existing methods. CARS is also illustrated through an application to the analysis of satellite imaging data set for supernova detection. In addition to the aforementioned two-sample multiple testing problem, multistage designs have been used in a wide range of scientic elds as well. By allocating sensing resources adap- tively, one can eectively elimi- nate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multi- stage adaptive testing and study how to minimize the total number of measurements while meeting pre-specied constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which eectively pools information across individual tests using a simultaneous multistage adap- tive ranking and thresholding (SMART) approach, can achieve precise error rates control and lead x to great savings in total study costs. Numerical studies conrm the eectiveness of SMART for FPR and MDR control and show that it achieves substantial power gain over existing methods. The SMART procedure is demonstrated through the analysis of high-throughput screening data and spatial imaging data. I also discuss ongoing and future research directions in this thesis, particularly on the topics of online FDR control and structured multiple testing. For applications, I discuss in detail on using online FDR control for real time anomaly detection for user engagement metrics at Snap Inc. xi Chapter 1 Introduction This chapter provides a brief introduction about large-scale two-sample multiple testing and multi- stage sequential analysis. Section 1.1 provides an introduction to two-sample inference in the literature and our motivation for CARS procedure, together with its applications and contribu- tions. Section 1.2 provides an introduction to multi-stage testing, both adaptive and non-adaptive designs, together with its applications and problem formulation. Furthermore, we talk about our SMART procedure's main contributions. Section 1.3 provides an outline of the remainder of this dissertation for reference. 1.1 Covariate Assisted Ranking and Screening for Large- Scale Two-Sample Inference A common goal in modern scientic studies is to identify features that exhibit dierential levels across two or more conditions. The task becomes dicult in large-scale comparative experiments, where dierential features are sparse among thousands or even millions of features being inves- tigated. The conventional practice is to rst reduce the original samples to a vector of p-values and then choose a cuto to adjust for multiplicity. However, the rst step of data reduction could cause signicant loss of information and thus lead to suboptimal testing procedures. This 1 paper proposes new strategies to extract structural information in the sample using an auxiliary covariate sequence and develops optimal covariate-assisted inference procedures for large-scale two-sample multiple testing problems. 1.1.1 Applications We focus on a setting where both mean vectors are individually sparse. Such a setting arises naturally in many modern scientic applications. For example, the detection of sequentially acti- vated genes in time-course microarray experiments, involves identifying varied eect sizes across dierent time points [28, 103]. Since only a small fraction of genes are dierentially expressed from the baseline, the problem of identifying varied levels over time essentially reduces to a mul- tiple testing problem with several high-dimensional sparse vectors (after removing the baseline eects). The second example arises from the detection of supernova explosions considered in later Sections. The potential locations can be identied by testing sudden changes in brightness in satellite images taken over a period of time. After the measurements are converted into grey-scale images and vectorized, multiple tests are conducted to compare the intensity levels between two sparse vectors. Another case in point is the analysis of dierential networks, where the goal is to detect discrepancies between two or more networks with possibly sparse edges. 1.1.2 Relevant literatures and our proposal We rst describe the conventional framework for two-sample inference and then discuss its limi- tations. Let X andY be two random vectors recording the measurement levels of the same m features under two experimental conditions, respectively. The population mean vectors are given by x =E(X) = ( x1 ;:::; xm ) | and y =E(Y ) = ( y1 ;:::; ym ) | . A classical formulation for identifying dierential features is to carry out m two-sample tests: H i;0 : xi = yi vs. H i;1 : xi 6= yi ; 1im: (1.1.1) 2 Suppose we have collected two random samplesfX 1 ;:::;X n1 g andfY 1 ;:::;Y n2 g as independent copies ofX andY , respectively. The standard practice starts with a data reduction step: a two- samplet-statisticT i is computed to compare the two conditions for featurei, thenT i is converted to a p-value or z-value. Finally a signicance threshold is chosen to control the multiplicity. However, this conventional practice, which only utilizes a vector of p-values, may suer from substantial information loss. This thesis proposes a new testing framework that involves two steps. In the rst step, besides the usual primary test statistics, an auxiliary covariate sequence is constructed from the original data to capture important structural information that is discarded by conventional practice. In the second step, the auxiliary covariates are combined with the primary test statistics to construct a multiple testing procedure that improves the accuracy in inference. Our idea is that the hypotheses become \unequal" in light of the auxiliary sequence. A key step in our methodological development is to incorporate the heterogeneity by recasting the problem in the framework of multiple testing with a covariate sequence. This requires a carefully constructed pair of statistics that lead to a simple bivariate model and an easily implementable methodology. Chapter 2 discusses strategies for constructing the pair of primary and auxiliary variables. Then we develop oracle and data- driven multiple testing procedures for the consequent bivariate model in Chapter 3. The proposed method employs a covariate-assisted ranking and screening (CARS) approach that simultaneously incorporates the primary and auxiliary information in multiple testing. We show that the CARS procedure controls the false discovery rate at the nominal level and outperforms existing methods in power. We mention two related strategies in the literature: testing following screening and testing following grouping. In the rst strategy, the hypotheses are formed and tested hierarchically via a screen-and-clean method [120, 83, 121, 114, 22]. Following that strategy, one can rst inspect the sample to identify the union support of x and y , and then conduct two-sample tests on 3 the narrowed subset to further eliminate the null locations with no dierential levels. The screen- and-clean approach requires sample splitting to ensure the independence between the screening and testing stages to avoid selection bias [90]. However, even the screening stage can signicantly narrow down the focus, sample splitting often leads to loss of power. For example, the empirical studies in [99] concluded that a two-stage analysis is in general inferior compared to a naive joint analysis that combines the data from both stages. The second strategy (71) can be described as testing following grouping, that is, the hypotheses are analyzed in groups via a divide-and-test method. [71] developed an uncorrelated screening (US) method, which rst divides the hypotheses into two groups according to a screening statistic, and then applies multiple testing procedures to the groups separately to identify non-null cases. It was shown in Liu (2014) that US controls the error rate at the nominal level and outperforms competitive methods in power. 1.1.3 Main contributions Our approach marks a clear departure from existing methods. Both the screen-and-clean and divide-and-test strategies involve dichotomizing a continuous variable, which fails to fully utilize the auxiliary information. By contrast, our proposed CARS procedure models the screening covariate as a continuous variable and employs a novel ranking and selection procedure that optimally integrates the information from both the primary and auxiliary variables. 1.2 Multistage Adaptive Testing of Sparse Signals Suppose we wish to recover the support of a p-vector = ( 1 ; ; p )2R p based on measure- ments from variables X 1 , , X p . LetS =fi : i 6= 0g denote the support of . We focus 4 on a setup where measurements on variables are performed sequentially. Consider the following multistage random mixture model [7, 115, 74]: X ij (1)F 0 +F 1i ; i = 1; ;p; (1.2.2) where the measurements are taken in stages j = 1; 2; , and X ij follow the null distribution F 0 if i = 2S and the non-null distribution F 1i if i2S. The model assumes that F 0 is identical for all i = 2S, whereas F 1i can vary across i2S. Allowing heterogeneous F 1i is desirable in applications where non-zero coordinates have dierent eect sizes. The mixing proportion is usually small, and can be understood as follows: X i1 has probability 1 of being a null case and probability of being a signal. Denote f 0 and f 1i the corresponding densities. The goal of the sparse recovery problem is to nd a subset that virtually contains all and only signals. 1.2.1 Non-adaptive vs. adaptive designs Consider a setting where each measurement is associated with a xed cost; hence collecting many repeated measurements on all variables are prohibitively expensive in large-scale studies where p is in the order of thousands and even millions. In applications with a sequential design, it may not be necessary to measure all features at all stages. LetA j f1; ;pg denote the set of coordinates where measurements are performed at stage j and X Aj = (X ij : i2A j ) the corresponding observations. In a non-adaptive setting, X ij are sampled at every stage following a pre-xed policy and each coordinate i is expected to receive the same amount of measurement budget. In an adaptive sampling design, the sampling scheme varies across dierent coordinates, with the exibility of adjustingA j in response to the information collected at previous stages. The adaptive sampling and inference framework provides a powerful approach to sparse esti- mation and testing problems. Intuitively, the sensing resources in later stages can be allocated 5 in a more cost-eective way to re ect our updated contextual knowledge during the course of the study; hence greater precision in inference can be achieved with the same study budgets or com- putational costs. A plethora of powerful multistage testing and estimation procedures have been developed under this exible framework; some recent developments include the hierarchical testing procedures for pattern recognition [18, 77, 104], distilled sensing and sequential thresholding meth- ods for sparse detection [54, 73, 74], multi-scale search and open-loop feedback control algorithms for adaptive estimation [7, 115, 116] and sequentially designed compressed sensing [55, 75]. These works demonstrate that methodologies adopting adaptive designs can substantially outperform those developed under non-adaptive settings. Our goal is to develop a cost-eective multistage sampling and inference procedure to narrow down the focus in a sequential manner to identify the vector support reliably. The proposed strategy consists of a stopping rule for selecting the coordinates at which measurements should be performed, together with a testing rule for deciding whether a coordinate contains a signal. 1.2.2 Applications and statistical challenges Multistage experiments have been widely used in many scientic elds including environmental sciences [32, 107], microarray, RNA-seq, and protein array experiments [78, 88], geostatistical analysis [86, 20], genome-wide association studies [94, 89] and astronomical surveys [77]. We rst describe two applications and discuss important statistical issues in multistage design and analysis. High-throughput screening (HTS, [122, 19]) is a large-scale hierarchical process that has played a crucial role in fast-advancing elds such as stem cell biology and drug discovery. In drug discovery, HTS involves testing a large number of chemical compounds in multiple stages including target identication, assay development, primary screening, conrmatory screening, and follow-up of hits. The accurate selection of useful compounds is an important issue at each aforementioned stage of HTS. For example, at the primary screening stage, a library of compounds is tested to 6 generate an initial list of active compounds, or hits. The goal of this stage is to reduce the size of the library signicantly with negligible false negative rate. In the conrmatory screening stage, the hits are further investigated to generate a list of conrmed hits, which will be used to generate \leads" for developing drug candidates. As the lab costs for leads generation are very high, the important task at the conrmatory stage is to construct a subset with negligible false positive rate while keeping as many useful compounds as possible. Another application arises from large- scale astronomical surveys for detecting sparse objects of interest. When millions of images are taken with high frequencies, the computational cost of an exhaustive search through every single image and pixel, which often involves testing billions of hypotheses, is prohibitively expensive. A multistage decision process can lead to great savings in total sensing eorts by quickly narrowing down the focus to a much smaller subset of most promising spots in the images [33]. The two applications will be discussed in detail in Section 3.4. In the design and analysis of large-scale multistage studies, the in ation of decision errors and soaring study costs are among the top concerns. First, to identify useful signals eectively, we need to control the false negative rate to be small at all stages since missed signals will not be revisited in subsequent stages. Second, to reduce the study costs, it is desirable to eliminate as many locations as possible at each stage. Finally, to avoid misleading scientic conclusions, the nal stage of analysis calls for a strict control of the false positive rate. We aim to develop a simultaneous inference framework for multistage decision process to address the above issues integrally. 1.2.3 Problem formulation The sequential sparse recovery problem is a compound decision problem [84] where each component problem involves testing a single hypothesis H 0 : = 0 vs. H 1 : 6= 0 based on sequentially 7 collected observations. The basic framework is formulated in the seminal work of Wald [110], where the following constrained optimization problem is studied: minimizeE(N) subject toP H0 (Reject) 0 andP H1 (Accept) 0 : (1.2.3) Here N is the stopping time, and 0 and 0 are pre-specied Type I and Type II error rates. E(N), which represents the average sampling costs, characterizes the eciency of a sequential procedure. The sequential probability ratio test (SPRT) is shown to be optimal [110, 97] for the single sequential testing problem (1.2.3) in the sense that it has the smallest E(N) among all sequential procedures at level ( 0 ; 0 ). When many coordinate-wise sequential decisions are made simultaneously, the control of in- ated decision errors becomes a critical issue. Denote i =I( i 6= 0), whereI() is an indicator function. Let = ( 1 ; ; p ), where i = 0=1 indicates that i is classied as a null/non-null case. Then the true and estimated supports are denotedS = (i : i = 1) and ^ S = (i : i = 1); respectively. Dene the false positive rate (FPR) and missed discovery rate (MDR) as FPR( ) = Ef P p i=1 (1 i ) i g E( P p i=1 i ) and MDR( ) = Ef P p i=1 i (1 i )g E( P p i=1 i ) : (1.2.4) The FPR, also referred to as the marginal false discovery rate (mFDR) [47, 102], is asymptotically equivalent to the widely used false discovery rate (FDR, [13]). The FDR is a powerful and popular error criterion in large-scale testing problems. The main consideration of using FPR (as opposed to FDR) is for analytical convenience. The MDR is also called the \missed rate" [106]. An alternative measure to the MDR is the false negative rate (FNR, [47, 92]). To compare the eciency, we use the expected average stopping time EAST(N N N) =E p 1 P p i=1 N i ; (1.2.5) 8 whereN N N = (N 1 ; ;N p ), with N i being the stopping time (or number of measurements) at unit i. Let and be the pre-specied FPR and MDR levels. We study the following constrained optimization problem: minimize EAST(N N N) subject to FPR( ) and MDR( ) : (1.2.6) The formulation (1.2.6) naturally extends the classical formulation (1.2.3) in [110, 97] to the compound decision setting. Moreover, it re ects the two key concerns in the design and analysis of large-scale multistage experiments: the in ation of decision errors (FPR and MDR) and soaring study costs (EAST). 1.2.4 Main contributions We formulate the sequential sparse recovery problem as a multiple testing problem with sequen- tially collected observations. A new adaptive testing procedure is developed under the compound decision-theoretic framework. The proposed procedure, which employs a simultaneous multistage adaptive ranking and thresholding (SMART) approach, not only utilizes all information collected through multiple stages but also exploits the compound decision structure to pool information across dierent coordinates. We show that SMART controls the FPR and MDR and achieves the information-theoretic lower bounds. Numerical studies conrm the eectiveness of SMART for error rates control and demonstrate that it leads to substantial savings in study costs. SMART has several advantages. First, existing methods such as distilled sensing (DS, [54]) and simple sequential thresholding (SST, [74]) use xed thresholding rules that do not oer accurate error rate control. Moreover, the literature on sequential multiple testing [10, 50, 118, 51] has focused on the control of false positive ndings, and the issues on MDR control and optimal design in the adaptive setting still remain unknown. By contrast, SMART aims to solve the constrained minimization problem (1.2.6), which addresses the issues on adaptive design, FPR 9 control and MDR control integrally. Second, although existing methods (e.g. [54, 74]) employ a multistage design, the stopping and testing rules at stage k only depend on the observations at the current stage, and the observations from previous stages j = 1; ;k 1 are abandoned. Instead, SMART utilizes all available information from the rst stage to the current stage, which leads to more powerful testing and stopping rules. Finally, as opposed to DS and SST that ignore the compound decision structure, SMART employs the ranking and compound thresholding idea in multiple testing to pool information across individual tests. Consequently, SMART controls the error rates more precisely with substantial eciency gain. 1.3 Summary of Chapters The rest of the thesis is organized as follows: In Chapter 2, I discuss in detail the CARS procedure. In Section 2.1, I rst discuss the work already done in the area of large-scale two-sample inference, and then move on to the idea on extracting structural information using an auxiliary sequence. In Section 2.2, we then introduce the oracle and data-driven CARS procedures together with relevant estimation methods. In Section 2.3, we discuss the extensions and connections of CARS with existing work. In Section 2.4, simulation results demonstrate that CARS controls the false discovery rate in nite samples and uniformly dominates all existing methods. The gain in power is substantial in many settings. In Section 2.5, We illustrate our method to analyze a time-course satellite image dataset. The application shows improved sensitivity of the proposed method in identifying changes between images taken over time. In Chapter 3, I discuss the SMART procedure for multi-stage analysis. In Section 3.1, we rst formulate a compound decision-theoretic framework for sequential sparse recovery problems, and then propose oracle and SMART rules for FPR and MDR control. In Section 3.2, the fundamental limits for sparse inference are derived and the asymptotic optimality of the proposed SMART 10 procedure is established. In Section 3.3, numerical comparisons with competitive methods are conducted to demonstrate the merits of SMART. In Section 3.4, we apply SMART for analyzing data from high-throughput screening studies. In Chapter 4, I discuss ongoing and future directions of the research, namely on online FDR control and structured multiple testing. In Section 4.1, I discuss in detail the motivation and the proposed online FDR procedure and show some preliminary simulation results. In Section 4.2, I discuss a specic use case on real time anomaly detection on user engagement metrics at Snap Inc. using online FDR control. In Section 4.3, I discuss the problem of structured multiple testing and preliminary results. 11 Chapter 2 Covariate Assisted Ranking and Screening for Large-Scale Two-Sample Inference Two-sample multiple testing has a wide range of applications. The conventional practice rst reduces the original observations to a vector of p-values and then chooses a cuto to adjust for multiplicity. However, this data reduction step could cause signicant loss of information and thus lead to suboptimal testing procedures. In this paper, we introduce a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple testing procedure is developed by employing a covariate-assisted ranking and screening (CARS) approach that optimally combines the information from both the primary and auxiliary variables. The proposed CARS procedure is shown to be asymptotic valid and optimal for false discovery rate (FDR) control. The procedure is implemented in the R-package CARS. Numerical results conrm the eectiveness of CARS in FDR control and show that it achieves substantial power gain over existing methods. CARS is also illustrated through an application to the analysis of satellite imaging data set for supernova detection. 12 2.1 Extracting Structural Information Using An Auxiliary Sequence SupposefX ij : 1 j n x g andfY ik : 1 k n y g, i = 1; ;m, are repeated measurements of generic independent random variables X i and Y i , respectively. Let 0 = ( 0i : 1im) be a latent baseline vector which itself is sparse [including the special case where 0i = 0 for all i]. Consider the following hierarchical model X ij = 0i + xi + xij ; Y ik = 0i + yi + yik ; (2.1.1) with corresponding population means given by xi = 0i + xi and yi = 0i + yi . For ease of presentation, we focus on the Gaussian model for the error terms xij iid N(0; 2 xi ) and yij iid N(0; 2 yi ): More general settings will be discussed in Sections 2.2.6. We assume that xi and yi , which can be viewed as random perturbations from the baseline, satisfy xi (1 x ) 0 + x g x () and yi (1 y ) 0 + y g y (), where 0 is the Dirac delta function, andg x andg y are unspecied densities of nonzero eects. Remark 1. Model (2.1.1) can be applied to scenarios with non-sparse x and y when some base- line measurements are available. See Section 2.7.1.8 for further details. The proposed method- ology only requires X i and Y i to be normal. In practical situations where n x and n y are large, our method works well without the normality assumption. Numerical results with non-Gaussian errors are provided in Section 2.4.3. Let n =n x +n y . Denote x =n x =n and y =n y =n. The population means xi and yi are estimated by X i = (n x ) 1 P nx j=1 X ij and Y i = (n y ) 1 P ny k=1 Y ik , respectively. The two-sample inference problem is concerned with the simultaneous testing ofm hypotheses H i;0 : xi = yi vs H i;1 : xi 6= yi , i = 1; ;m. Let I() be an indicator function. Let T 1i and T 2i be summary statistics that contain the information about 1i =I( xi 6= yi ) (support of 13 mean dierence) and 2i =I( xi 6= 0 or yi 6= 0) (union support), respectively. T 1i is the primary statistic in inference and T 2i is an auxiliary covariate. The term \auxiliary" indicates that we do not use T 2i to make inference on 1i directly. Instead, we aim to incorporate T 2i in inference to (indirectly) support the evidence provided in the primary variableT 1i . The intuition is that, since the union support is sparse if both x and y are sparse, exploiting this structural information would improve the eciency of tests. To see this, note that the continuity of xi and yi implies that, with probability one, 1i and 2i obey the following logical relationship: 1i = 0 if 2i = 0: (2.1.2) Hence the auxiliary sequence can be utilized to assist inference by providing supplementary evi- dence on whether a hypothesis is promising. We rst discuss how to construct the primary and auxiliary statistics from the original data, and then introduce a bivariate random mixture model to describe their joint distribution. Fi- nally, we formulate a decision-theoretic framework for two-sample simultaneous inference with an auxiliary covariate. 2.1.1 Constructing the primary and auxiliary statistics A key step in our formulation is to construct a pair of statistics (T 1i ;T 2i ) such that (i) the pair extracts information from the data eectively; (ii) the pair leads to a simple bivariate model via which the logical relationship (2.1.2) can be exploited. To focus on the main ideas, we rst discuss the Gaussian case with known variances. Extensions to two-sample tests with non-Gaussian errors and unknown variances are discussed in Section 2.2.6. The general strategies for constructing the pair (T 1i ;T 2i ) can be described as follows. First,T 1i is used to capture the information on 1i ; hence X i Y i should be incorporated in its expression. Second, to capture the information on the union support 2i , we propose to use the weighted sum 14 X i + i Y i , where i > 0 is the weight to be specied later. Under the normality assumption, the covariance of X i Y i and X i + i Y i is given by 2 xi =n x i 2 yi =n y . This motivates us to choose the weight i = ( y 2 xi )=( x 2 yi ), which leads to zero correlation, a crucial property for simplifying the model and facilitating the methodological development. Finally, the dierence and weighted sum are standardized to make the statistics comparable across tests. Combining the above considerations, we propose to use the following pair of statistics to summarize the information in the data: (T 1i ;T 2i ) = r n x n y n X i Y i pi ; X i + i Y i p i pi ! ; (2.1.3) where 2 pi = y 2 xi + x 2 yi . DenoteT T T 1 = (T 1i : 1im) andT T T 2 = (T 2i : 1im). 2.1.2 A bivariate random mixture model We develop a bivariate model to describe the joint distribution of T 1i andT 2i . Let i = ( 1i ; 2i ). Assume that i are independent and identically distributed (i.i.d.) bivariate random vectors that take values in the Cartesian product spacef0; 1g 2 =f(0; 0); (0; 1); (1; 0); (1; 1)g. For each combination i = (j;k), (T 1i ;T 2i ) are jointly distributed with conditional density f(t 1i ;t 2i j 1i = j; 2i = k). Denote jk =P( 1i = j; 2i = k). In practice, we do not know ( 1i ; 2i ) but only observe (T 1i ;T 2i ) from a mixture model f(t 1i ;t 2i ) = P (j;k)2f0;1g 2 jk f(t 1i ;t 2i j 1i =j; 2i =k): (2.1.4) Denote j =P( ji = 1), j = 1; 2. Assume that 1 > 0. The goal is to determine the value of 1i based on pairsf(T 1i ;T 2i ) : 1img. The mixture model (2.1.4) is dicult to analyze. However, if T 1i and T 2i are carefully con- structed as done in Section 2.1.1, then several simplications can be made. First, the logical 15 relationship (2.1.2) implies that 10 = 0; thus we only have three terms in (2.1.4). Second, according to our construction (2.1.3), T 1i and T 2i are conditionally independent: f(t 1i ;t 2i j xi ; yi ) =f(t 1i j xi ; yi )f(t 2i j xi ; yi ): (2.1.5) The next proposition utilizes (2.1.5) to further simplify the model. Proposition 1. The conditional independence (2.1.5) implies that f(t 1i ;t 2i j 1i = 0; 2i = 0) = f(t 1i j 1i = 0)f(t 2i j 2i = 0); f(t 1i ;t 2i j 1i = 0; 2i = 1) = f(t 1i j 1i = 0)f(t 2i j 1i = 0; 2i = 1); and f(t 1i ;t 2i j 1i = 0) = f(t 1i j 1i = 0)f(t 2i j 1i = 0): (2.1.6) The last equation shows that T 1i and T 2i are independent under the null hypothesis H i0 : 1i = 0. This is a critical result for our later methodological and theoretical developments. The joint density is given by f(t 1i ;t 2i ) = 00 f(t 1i j 1i = 0)f(t 2i j 2i = 0) + 01 f(t 1i j 1i = 0)f(t 2i j 1i = 0; 2i = 1) + 11 f(t 1i ;t 2i j 1i = 1; 2i = 1): (2.1.7) 2.1.3 Problem formulation Our goal is to make inference on 1i = I( xi 6= yi ), 1 i m, by simultaneously testing m hypotheses H i;0 : 1i = 0 vs H i;1 : 1i = 1. Compared to conventional approaches, we aim to develop methods utilizing m pairsf(T 1i ;T 2i ) : 1 i mg instead of a single vector fT 1i : 1 i mg. This new problem can be recast and solved in the framework of multiple testing with a covariate: T 1i is viewed as the primary statistic for assessing signicance, and T 2i is viewed as a covariate to assist inference by providing supporting information. 16 The concepts of error rate and power are similar to those in the conventional settings. A multiple testing procedure is represented by a thresholding rule of the form =f i =I(S i <t) :i = 1; ;mg2f0; 1g m ; (2.1.8) where i = 1 if we reject hypothesis i and i = 0 otherwise. Here S i is a signicance index that ranks the hypotheses from the most signicant to least signicant, and t is a threshold. In large-scale testing problems, the false discovery rate (FDR, 13) has been widely used to control the in ation of Type I errors. For a given decision rule = ( i : 1im) of the form (2.1.8), the FDR is dened as FDR =E P m i=1 (1 1i ) i ( P m i=1 i )_ 1 ; (2.1.9) where x_y = max(x;y). A closely related concept is the marginal false discovery rate (mFDR), which is dened by mFDR = Ef P m i=1 (1 1i ) i g E( P m i=1 i ) : (2.1.10) [47] showed that mFDR = FDR +O(m 1=2 ) when the BH procedure (13) is applied to m in- dependent tests. We use mFDR mainly for technical considerations to obtain optimality result. Proposition 7 in Section 2.6.2 gives sucient conditions under which the mFDR and FDR are asymptotically equivalent and shows that the conditions are fullled by our proposed method. Dene the expected number of true positives ETP = E ( P m i=1 1i i ). Other related power measures include the missed discovery rate (MDR, 106), average power (39) and false negative rate (FNR, 47, 91). Our optimality result is developed based on the mFDR and ETP. We call a multiple testing procedure valid if it controls the mFDR at the nominal level and optimal if it has the largest ETP among all valid mFDR procedures. 17 2.2 Oracle and Data-driven Procedures The basic framework of our methodological developments is explained as follows. We rst consider an ideal situation where an oracle knows all parameters in model (2.1.7). Section 2.2.1 derives an oracle procedure. Sections 2.2.2 and 2.2.3 discuss an approximation strategy and related estimation methods, with a renement given in Section 2.2.4. The data-driven procedure and extensions are presented in Sections 2.2.5 and 2.2.6. 2.2.1 Oracle procedure with pairs of observations The marginal density function forT ji is dened asf j = (1 j )f j0 + j f j1 , where j =P( ji = 1) and f j0 and f j1 are the null and non-null densities for T ji , respectively. Conventional FDR procedures, which are developed based on a vector ofp-values orz-values, are essentially univariate inference procedures that only utilize the information of T 1i . Dene the local false discovery rate (Lfdr, 41) Lfdr 1 (t 1 ) = (1 1 )f 10 (t 1 ) f 1 (t 1 ) ; (2.2.1) where subscript \1" indicates a quantity associated with T 1i . It was shown in [102] that the optimal univariate mFDR procedure is a thresholding rule of the form (Lfdr 1 ;c) = [IfLfdr 1 (t 1i )<cg : 1im]; (2.2.2) where 0 c 1 is a cuto. Denote Q LF (c) the mFDR level of (Lfdr 1 ;c). Let c = supfc : Q LF (c)g be the largest cuto under the mFDR constraint. Then = (Lfdr 1 ;c ) is optimal among all univariate mFDR procedures in the sense that it has the largest ETP subject to mFDR. The next theorem derives an oracle procedure for mFDR control when the pairs (T 1i ;T 2i ) are given. We shall see that the performance of , the optimal univariate procedure, can be greatly 18 improved by exploiting the information in T 2i . The oracle procedure under the bivariate model (2.1.7) has two important components: an oracle statistic T i OR that optimally pools information from bothT 1i andT 2i , and an oracle thresholdt OR that controls the mFDR with the largest ETP. Theorem 1. Suppose (T 1i ;T 2i ) follow model (2.1.7). Let q (t 2 ) = (1 1 )f (t 2 j 1i = 0): (2.2.3) Dene the oracle statistic T i OR (t 1 ;t 2 ) =P( 1i = 0jT 1i =t 1 ;T 2i =t 2 ) = q (t 2 )f 10 (t 1 ) f(t 1 ;t 2 ) ; (2.2.4) where f(t 1 ;t 2 ) is the joint density given by (2.1.7). Then (a) For 0 < 1, let Q OR () be the mFDR level of testing rulefI(T i OR < ) : 1 i mg. Then Q OR ()< and Q OR () is non-decreasing in . (b) Suppose we choose< Q OR (1). Then the oracle threshold OR = supf :Q OR ()g exists uniquely and Q OR ( OR ) = . Furthermore, dene oracle rule OR = ( i OR : i = 1; ;m), where i OR =I(T i OR < OR ): (2.2.5) Then OR is optimal in the sense that ETP ETP OR for any inD , whereD is the collection of all testing rules based onT T T 1 andT T T 2 such that mFDR . Remark 2. The oracle statistic T i OR is the posterior probability that H i;0 is true given the pair of primary and auxiliary statistics. It serves as a signicance index providing evidence against the null. Section 2.2.2 gives a detailed discussion of q (t 2 ) and explains that it roughly describes how frequently T 2i from the null distribution would fall into the neighborhood of t 2 . The estimation of T OR and q (t 2 ) is discussed in Section 2.2.3. 19 Remark 3. Theorem 1 indicates that pooling auxiliary information would not result in eciency loss, provided that T 2i are carefully constructed according to the principles described in Section 2.1.1. Consider the \worst case scenario" where T 2i is completely non-informative: n x = n y , 2 xi = 2 yi and xi = yi . In Section 2.7.1.9 of the Supplementary Material, we show that under the above conditions T i OR reduces to the Lfdr statistic (2.2.1), and the oracle (bivariate) procedure would coincide with the optimal univariate rule (2.2.2). Contrary to the common intuition that incorporatingT 2i might negatively aect the performance, Theorem 1 says that our oracle procedure dominates the optimal univariate procedure (2.2.2). Hence the power will never be decreased by pooling the auxiliary information in T 2i . Further numerical evidence is provided in Section 2.4.4. The oracle rule (2.2.5) motivates us to consider a stepwise procedure that operates in two steps: ranking and thresholding. The ranking step orders all hypotheses from the most signicant to the least signicant according toT OR , and the thresholding step identies the largest threshold along the ranking subject to the constraint on FDR. Specically, denote T (1) OR ::: T (m) OR the ordered oracle statistics and H (1) ; ;H (m) the corresponding hypotheses. The step-wise procedure operates as follows. Let k = max n j : ^ Q OR (j) =j 1 P j i=1 T (i) OR o . Reject H (1) ; ;H (k) . (2.2.6) The quantity ^ Q OR (j), which is the moving average of the topj ordered statistics, gives an estimate of the FDR [cf. [102]]. Thus the stepwise algorithm (2.2.6) identies the largest threshold subject to the FDR constraint. The hypotheses whose T i OR are less than the threshold are claimed to be signicant. 20 2.2.2 Approximating T OR via screening The oracle statistic T i OR is unknown and needs to be estimated. However, standard methods do not work well for the bivariate model. For example, the popular EM algorithm usually requires the specication of a parametric form of the non-null distribution; this is often impractical in large-scale studies where little is known about the alternative. Moreover, existing estimators often suer from low accuracy and convergence issues when signals are sparse. To overcome the diculties in estimation, we propose a new test statistic T ;i OR that only involves quantities that can be well estimated from data. The new statistic provides a good approximation to T i OR and guarantees the FDR control. In (2.2.4), the null density f 10 is known by construction. The bivariate density f(t 1 ;t 2 ) can be well estimated using a standard kernel method [[98, 112]]. Hence we shall focus on the quantity q (t 2 ). Suppose we are interested in counting how frequently T 2i from the null distribution (i.e. 1i = 0) would fall into an interval in the neighborhood of t 2 : Q (t 2 ;h) = #fi :T 2i 2 [t 2 h=2;t 2 +h=2] and 1i = 0g=m: The quantity is relevant because q (t 2 ) = lim h!0 EfQ (t 2 ;h)g=h: The counting task is dicult as we do not know the value of 1i . Our idea is to rst apply a screening method to select the nulls (i.e. 1i = 0), and then construct an estimator based on selected cases. DenoteP i thep-value associated withT 1i =t 1i . For a large, say = 0:9, we would reasonably predict that 1i = 0 if P i > , as most likely large p-values should be from the null. Hence we may count those T 2i with large p-values: Q (t 2 ;h) = #fi :T 2i 2 [t 2 h=2;t 2 +h=2] andP i >g m(1) : (2.2.7) 21 The adjustment (1 ) in the denominator comes from the fact that we have only utilized 100(1)% of the data while counting the frequency. LetA denote the set of possible t 1i such that P i >. Using Q to replace Q , a sensible approximation of q (t 2 ) would be q (t 2 ) = lim h!0 EfQ (t 2 ;h)g h = R A f(t 1 ;t 2 )dt 1 1 : (2.2.8) Intuitively, a large would yield a sample that is close to one generated from a \pure" null distri- bution and thus reduce the biasq (t 2 )q (t 2 ). Our theory reveals that the bias is always positive (Proposition 2), and would decrease in (Proposition 4). However, a larger would increase the variability of our proposed estimator (as we have fewer samples to construct the estimator), af- fecting the testing procedure adversely. The bias-variance tradeo is further discussed in Section 2.2.4. Substituting q (t 2 ) in place of q (t 2 ), we obtain the following approximation of T i OR : T ;i OR (t 1 ;t 2 ) = q (t 2 )f 10 (t 1 ) f(t 1 ;t 2 ) : (2.2.9) Some important properties of the approximation in (2.2.9) are summarized in the next proposition, which shows that T ;i OR always overestimates T i OR . Hence if we substitute T ;i OR in place of T i OR in (2.2.6), then fewer rejections will be made, leading to a conservative FDR level. Proposition 2. (a) T i OR (t 1 ;t 2 ) T ;i OR (t 1 ;t 2 ) for all . (b) Let OR be a decision rule that substitutes T ;i OR in place of T i OR in (2.2.6). Then both the FDR and mFDR levels of OR are controlled below level . 22 2.2.3 Estimation of the test statistic We now turn to the estimation of T ;i OR . By our construction, the null density f 10 (t 1 ) is known. The bivariate density f(t 1 ;t 2 ) can be estimated using a kernel method [[98, 112]]: ^ f(t 1 ;t 2 ) =m 1 m X i=1 K h1 (t 1 t 1i )K h2 (t 2 t 2i ); (2.2.10) where K(t) is a kernel function, h 1 and h 2 are the bandwidths, with K h (t) = h 1 K(t=h). To estimate q (t 2 ), we rst carry out a screening procedure to obtain sampleT () =fi :P 1i >g, and then apply kernel smoothing to the selected observations: ^ q (t 2 ) = P i2T() K h2 (t 2 t 2i ) m(1) : (2.2.11) The next proposition shows that ^ q () converges to q () in L 2 norm. Proposition 3. Consider ^ q and q respectively dened in (2.2.8) and (2.2.11). Assume that (i) q () is bounded and have continuous rst and second derivatives; (ii) the kernel K is a positive, bounded and symmetric function satisfying R K(t) = 1, R tK(t)dt = 0 and R t 2 K(t)dt < 1; and (iii) f (2) 2 (t 2 j) = R t12A R f (2) 2 (t 2 jt 1 )f 1 (t 1 )dt 1 is square integrable, where f 2 (t 2 jt 1 ) is the conditional density of T 2 given T 1 . Then with the common choice of bandwidth h m 1=6 , we have Ek^ q q k 2 =E Z f^ q (x)q (x)g 2 dx! 0: Combining the above results, we propose to estimate T OR by the following statistic ^ T OR (t 1 ;t 2 ) = ^ q (t 2 )f 10 (t 1 ) ^ f(t 1 ;t 2 ) ^ 1; (2.2.12) where ^ q (t 2 ) and ^ f(t 1 ;t 2 ) are respectively given in (2.2.11) and (2.2.10), and x^y = min(x;y). 23 Remark 4. In our proposed estimator, the same bandwidth h 2 has been used for the kernels in both (2.2.10) and (2.2.11). Utilizing the same bandwidth across the numerator and denominator of (2.2.12) has no impact on the theory, but is benecial for increasing the stability of our estimator. More practical guidelines are provided in Section 5.1. 2.2.4 A rened estimator This section develops a consistent estimator of q (t 2 ). The proposed estimator is important for the optimality theory in Section 2.2.5. However, it is computationally intensive and requires much stronger assumptions which should be scrutinized in practice. The power gain is very limited. In practice we still recommend the simple estimator (2.2.11). This section may be skipped for readers who are mainly interested in methodology. We state in the next proposition some theoretical properties for the approximation error q (t 2 )q (t 2 ); these properties are helpful to motivate the new estimator and prove its con- sistency. Let the CDF of the p-value associated with T 1i be G() = (1 1 ) + 1 G 1 (), where G 1 is the alternative CDF. Denote g and g 1 the corresponding density functions. Proposition 4. Consider T OR dened in (2.2.9). (a). Denote B q () = R jq (t 2 )q (t 2 )jdt 2 the total approximation error. If G 1 () is concave, then B q () decreases in . (b). If lim x"1 g 1 (x) = 0, then lim "1 q (t 2 ) =q (t 2 ). Remark 5. The concavity assumption (or the more general monotone likelihood ratio condition, MLRC) has been commonly used in the literature (100, 47, 102); the MLRC should be treated with caution (29). Assumption (b) is also a typical condition (48), which requires that the null cases are dominant on the right of the p-value histogram. The condition holds for one-sided p- values but can be violated by two-sided p-values (79). It would be desirable to develop a more general condition in future work. 24 It follows from Proposition 4 that a large is helpful to reduce the bias and the bias converges to zero when ! 1. However, a large would increase the variance of our estimator (2.2.11), which is constructed using the sampleT () =fi :P 1i >g. To address the bias-variance tradeo, we propose to rst obtain ^ q for a range of 's, sayf 1 ; ; k g, and then use a smoothing method to obtain the limiting value of ^ q when ! 1. This approach aims to borrow strength from the entire sample to minimize the bias without blowing up the variance. Specically, let 0 < 1 < < k be ordered and equally spaced points in the inteval (0; 1). Denote ^ q j (t 2 ) the estimates from (2.2.11), j = 1; ;k. We propose to obtain the local linear kernel estimator ^ q (t 2 ) ^ qf = 1; ^ q 1 (t 2 ); ; ^ q k (t 2 )g as the height of the t ^ 0 , where ( ^ 0 ; ^ 1 ) minimizes the weighted kernel least squares P k j=1 (^ q j 0 1 j ) 2 K h ( j k ): For a given integer r, denote ^ s r =k 1 P k j=1 ( j 1) r K h ( j k ). It can be shown that (e.g. 112, pp. 119) ^ q (t 2 ) =k 1 P k j=1 f^ s 2 ^ s 1 ( j k )gK h ( j k )^ q j ^ s 2 ^ s 0 ^ s 2 1 : (2.2.13) The next proposition shows that ^ q (t 2 ) is a consistent estimator for q (t 2 ). Proposition 5. Consider ^ q and ^ q that are respectively dened in (2.2.11) and (2.2.13). Denote q k ;(2) (t 2 ) = (d=d) 2 q ;(2) (t 2 )j = k . Assume the following conditions hold: (i) q k ;(2) (t 2 ) is square integrable; (ii) K() is symmetric about zero and is supported on [1; 1]; and (iii) the bandwidth h is a sequence satisfying h ! 0 and kh !1 as k!1. Moreover, assume Condition (i) to (iii) in Proposition 3 and Condition (b) in Proposition 4 hold. We have Ek^ q q k 2 =E Z f^ q (x)q (x)g 2 dx! 0 when (m;k)! 0. (2.2.14) 25 2.2.5 The CARS procedure The estimated statistics ^ T ;i OR will be used as a signicance index to rank the relative importance of all hypotheses. Motivated by the stepwise algorithm (2.2.6), we propose the following covariate- assisted ranking and screening (CARS) procedure. Procedure 1 (CARS Procedure). Consider model (2.1.7) and estimated statistics ^ T ;i OR (2.2.12). Denote ^ T ;(1) OR ^ T ;(m) OR the ordered statistics and H (1) ; ;H (m) the corresponding hypothe- ses. Let k = max n j :j 1 P j i=1 ^ T ;(i) OR o : Then reject H (1) ; ;H (k) . To ensure good performance of the data-driven procedure, we require the following conditions for estimated quantities: (C1). Ek^ q q k 2 ! 0. (C1 0 ). Ek^ q q k 2 ! 0. (C2). E ^ ff 2 =E h RR f ^ f(t 1 ;t 2 )f(t 1 ;t 2 )g 2 dt 1 dt 2 i ! 0. Remark 6. Proposition 3 shows that (C1) is satised by the proposed estimator (2.2.11). Propo- sition 5 shows that (C1 0 ) is satised by the smoothing estimator (2.2.14) under stronger assump- tions. Finally, (C2) is satised by the standard choice of bandwidth h t1 m 1=6 , h t2 m 1=6 ; see, for example, page 111 in [112] for a proof. The asymptotic properties of the CARS procedure are established by the next theorem. Theorem 2. Asymptotic validity and optimality of CARS. (a). If Conditions (C1) and (C2) hold, then both the mFDR and FDR of the CARS procedure are controlled at level +o(1). (b). If Conditions (C1 0 ) and (C2) hold, and we substitute ^ q (2.2.14) in place of ^ q in (2.2.12) to compute ^ T OR , then the FDR level of the CARS procedure is +o(1). Moreover, denote ETP CARS and ETP OR the ETP levels of CARS and the oracle procedure, respectively. Then we have ETP CARS =ETP OR = 1 +o(1). 26 2.2.6 The case with unknown variances and non-Gaussian errors For two-sample tests with unknown and unequal variances, we can estimate 2 xi and 2 yi byS 2 xi = (n x ) 1 P nx j=1 (X ij X i ) 2 and S 2 yi = (n y ) 1 P ny j=1 (Y ij Y i ) 2 ; respectively. Let ^ i = ( y S 2 xi )=( x S 2 yi ) and S 2 pi = y S 2 xi + x S 2 yi : The following pair will be used to summarize the information in the sample: (T 1i ;T 2i ) = r n x n y n X i Y i S pi ; X i + ^ i Y i p ^ i S pi ! : (2.2.15) For the case with unknown but equal variances (e.g. 2 xi = 2 yi ), we modify (2.2.15) as follows. First, ^ i is replaced by = y = x . Second, S 2 pi is instead estimated by S 2 pi = x S 2 xi + y S 2 yi : Finally T 1i and T 2i are plugged into (2.2.12) to compute the CARS statistic, which is further employed to implement Procedure 1. T 1i and T 2i are not strictly independent when estimated variances are used. The following proposition shows that T 1i and T 2i are asymptotically independent under the null. Proposition 6. Consider model (2.1.1). Assume that the error terms (possibly non-Gaussian) of X ij and Y ij have symmetric distributions and nite fourth moments. Then (T 1i ;T 2i ) dened in (2.2.15) are asymptotically independent when H i;0 : xi = yi is true. The expression for asymptotic variance-covariance matrix, which is given in Section 2.7.1.6 of the Supplementary Material, reveals that the asymptotic independence holds for non-Gaussian errors as long as the error distributions are symmetric. Our simulation results in Section 2.4.3 con- rm that unknown variances and non-Gaussian errors have almost no impact on the performance of CARS. Therefore the plug-in methods are recommended for practical applications. The case with skewed distribution requires further research, and a full theoretical justication of CARS methodology is still an open question. 27 2.3 Extensions and Connections with Existing Work This section considers the extension of our theory to a general bivariate model. The results in the general model provide a unied theoretical framework for understanding dierent testing strategies, which helps gain insights on the connections between existing methods. We substitute (T i ;S i ) in place of (T 1i ;T 2i ) in this section. This change re ects a more exible view of the auxiliary covariate: S i can be either continuous or discrete, from either internal or external data, and we do not explicitly estimate the joint density of T i andS i as done in previous sections. Sections 2.3.1 to 2.3.5 assume that T i follow a continuous distribution with a known density under the null; the case with unknown null density is discussed in Section 2.3.6. We allow S i to be either continuous or categorical and hence eliminate the notation 2i . (Previously 2i denotes the union support, which is only needed when T 2i has a density with a point mass at 0.) As a result, the subscript \1" in 1i is suppressed for notational convenience, where i = 0=1 stands for a null/non-null case. 2.3.1 A general bivariate model Suppose T i and S i follow a joint distribution F i (t;s). The optimal (oracle) testing rule is given by the next theorem, which can be proven similarly as Theorem 1. Theorem 3. Dene the oracle statistic under the general model T G OR (t;s) =P( i = 0jT i =t;S i =s): (2.3.1) Denote Q G OR () the mFDR level of (T G OR ;), where (T G OR ;) =fI(T G OR (t;s) < ) : 1 i mg. Let OR = supf2 (0; 1) : Q G OR () g: Dene the oracle mFDR procedure under the general model as G OR = (T G OR ; OR ). Then G OR is optimal in the sense that for any such that mFDR , we always have ETP ETP G OR . 28 Theorem 1 can be viewed as a special case of Theorem 3. However, Theorem 3 is of less practical importance as the \best" data-driven solution to Theorem 3 may depend on a number of factors such as (i) whether the auxiliary statistic is categorical or continuous; (ii) whether the null distribution of T i is xed and known; (iii) whether S i and T i are independent, and etc. The key issue is that estimatingT G;i OR is very dicult in a general bivariate model. Under some special cases, T G OR can be approximated well. For example, T OR provides a good approximation to T G OR under bivariate model (2.1.4) and the conditional independence assumption (2.1.6). When S i is categorical, the oracle procedure can also be approximated well. This important special case is discussed next. 2.3.2 Discrete case: multiple testing with groups We now consider a special case where the auxiliary covariateS i is categorical. A concrete scenario is the multi-group random mixture model rst introduced in [40]. See also [24]. The model is useful to handle large-scale testing problems where data are collected from heterogeneous sources. Correspondingly, the m hypotheses may be divided into, say, K groups that exhibit dierent characteristics. Let S i denote the group membership. Assume that S i takes values inf1; ;Kg with prior probabilitiesf 1 ; ; K g. Consider the following conditional distributions: (T i jS i =k)f k 1 = (1 k 1 )f k 10 + k 1 f k 11 ; (2.3.2) for k = 1; ;K; where k 1 is the proportion of non-null cases in group k, f k 10 and f k 11 are the null and non-null densities of T i , and f k 1 = (1 k 1 )f k 10 + k 1 f k 11 is the mixture density for all observations in group k. The model allows the conditional distributions in (2.3.2) to vary across groups; this is desirable in practice when groups are heterogeneous. Remark 7. In Section 2.7.1.10 in the supplement, we present a simple example to show that T i is not sucient (as insightfully pointed out by a reviewer), whereas (T i ;S i ) is sucient. S i is 29 ancillary in the sense that its value is determined by an external process independent from the main parameter. Contrary to the common intuition thatS i is \useless" for inference, our analysis reveals that S i can be informative. The phenomenon is referred to as the ancillarity paradox because, to quote Lehmann (page 420, 70), \the distribution of the ancillary, which should not aect the estimation of the main parameter, has an enormous eect on the properties of the standard estimator." A related phenomenon in the estimation context was discussed by [43]. See also the seminal work by [23] for a paradox in multiple regression. The problem of multiple testing with groups and related problems have received substantial attention in the literature (40, 42, 24, 63, 72, 4). It can be shown that under model (2.3.2), the oracle statistic (2.3.1) is reduced to the conditional local false discovery rate (CLfdr, Cai and Sun 2009) CLfdr i = (1 k 1 )f k 10 (t i )=f k 1 (t i ) for S i =k, k = 1; ;K. The CLfdr statistic can be accurately estimated from data when the number of tests is large in separate groups. However, the CLfdr statistic cannot be well estimated when the number of groups is large, or when S i becomes a continuous variable. An important recent progress for exploiting the grouping and hierarchical structures among hypotheses under more generic settings has been made in Liu et al. (2016), wherein an interesting decomposition of the oracle statistic was derived: T i OR (t;s) = 1f1P ( 2i = 0jt;s)gf1P ( 1i = 0jt;s; 2i = 1)g: The decomposition explicitly shows how the auxiliary statistic can be used to adjust the Lfdr statistic, and provides insights on how the grouping eects S i and individual eects T i interplay in simultaneous testing. The logical correlation (2.1.2) can be conceptualized as a hierarchical constraint and exploited more eciently (93). The result on multiple testing with groups motivates an interesting strategy to approximate the oracle rule. For a continuous auxiliary covariate S i , we can rst discretizeS i , then divide the hypotheses into groups according to the discrete variable, and nally apply group-wise multiple 30 testing procedures. This idea is closely related to the uncorrelated screening (US) method in Liu (2014); the connection is discussed next. 2.3.3 Discretization and uncorrelated screening The idea in [71] involves discretizing a continuous S i as a binary variable. Dene index sets G 1 =f1im :S i >g andG 2 =f1im :S i g, where the tuning parameter divides t 1i 's into two groups: T 1 (G 1 ) =ft i : i2G 1 g and T 1 (G 2 ) =ft i : i2G 2 g: The uncorrelated screening (US, Liu 2014) method operates in two steps. First, the BH method (13) is applied at level to the two groups separately, and then the rejected hypotheses from two groups are combined as the nal output. The tuning parameter is chosen in a way such that it yields the largest number of rejections (two groups combined). US is closely related to the separate analysis strategy proposed in [40]. The key dierence is that the groups are known a priori in Efron (2008), whereas the groups are chosen adaptively in Liu (2014). The main merit of US is that the screening statistic is constructed to be uncorrelated with the test statistic, which ensures that the selection bias issue can be avoided. Moreover, the divide-and-test strategy combines the results in both groups; this is dierent from conventional independent ltering approaches [[22]], in which one group is completely ltered out. We now compare dierent methods under a unied framework. Both US and CARS can be viewed as approximations of the oracle rule (2.3.1). The goal is to borrow information from the external covariate S i to improve the eciency of simultaneous inference. US adopts the divide-and-test strategy and only models S i as a binary variable. It suers from information loss in the discretization step. Specically, the auxiliary variables S i can be used to reveal other useful data structures in addition to sparsity. Consider a toy example where the cases on the union support can be divided into two types, respectively characterized by low and high baseline activities; and among the more active ones, a larger proportion would exhibit dierential levels between the two conditions. Intuitively the auxiliary statistics can be used to identify three 31 groups, respectively with no, low and high activities. Hence the two-group strategy utilized by US can be potentially outperformed by a three-group strategy that captures the underlying data structure more eectively. In practice, the data structure can be complex and nding the \best" grouping is tricky; this sheds lights on the superiority of CARS, for it fully utilizes the auxiliary data by modeling S i as a continuous variable. The general framework suggests several directions in which US may be improved. First, the information of S i may be better exploited, e.g. by creating more groups. However, it remains unknown how to choose the optimal number of groups. Second, US tests the hypotheses at FDR level for both groups. However, [24] showed that the choice of identical FDR levels across groups can be suboptimal. In order to maximize the overall power, dierent group-wise FDR levels should be chosen. However, no matter how smart a divide-and-test strategy may be, discretizing a continuous covariate would inevitably result in information loss and hence will be outperformed by CARS. 2.3.4 The \pooling within" strategy for information integration In Philosophy and Principles of Data Analysis (109), Tukey coined two witted terms to advocate some of his favorite information integration strategies: borrowing strength and pooling within. The idea of borrowing strength, which was investigated extensively and systematically by researchers in both simultaneous estimation and multiple testing elds, has led to a number of impactful the- ories and methodologies exemplied by the James-Stein's estimator (64) and local false discovery rate methodology (41). By contrast, the direction of \pooling within" has been less explored. Tukey described it, in a very dierent scenario from ours, as a two step strategy that involves rst gathering quantitative indications from \within" dierent parts of the data, and then \pooling" these indications into a single overall index (109, pp. 278). Our work formalizes a theoretical framework for the \pooling within" idea in the context of two-sample multiple testing: rst con- structing multiple indications from within the data (i.e. independent and comparable pairs), and 32 second deriving an overall index (i.e. the oracle statistic) that optimally combines the evidences exhibited from both statistics. Our work diers in several ways from existing works on multiple testing with covariate (42, 119, 96). First, the covariate in other works is collected externally from other data sources, whereas the auxiliary information in our work is gathered internally within the primary data set. Second, in other works it has been assumed that the null density would not be aected by the external covariate. However, the assumption should be scrutinized in practice as it may not always hold. Under our testing framework, the requirement of a xed null density is formalized as the conditional independence between the primary and auxiliary statistics. The conditional independence is proposed as a principle for information extraction, and is automatically fullled by our approach to constructing the auxiliary sequence. CARS makes several new methodological and theoretical contributions. First, under a decision- theoretic framework, the oracle CARS procedure is shown to be optimal for information pooling. Second, existing methodologies on testing with covariate are mostly developed under the Bayesian computational framework and lack theoretical justications. By contrast, our data-driven CARS procedure is a non-parametric method and enjoys nice asymptotic properties. Such FDR theories, as far as we know, are new in the literature. Third, the screening approach employed by CARS reveals interesting connections between sparsity estimation and multiple testing with covariate, which is elaborated next. 2.3.5 Capturing sparsity information via screening A celebrated nding in the FDR literature is that incorporating the estimated proportion of non- nulls [ 1 =P ( i = 1)] can improve the power (15, 100, 47). In light ofS i , the proportion becomes heterogeneous; hence it is desirable to utilize the conditional proportions 1 (s) =P ( i = 1jS i =s) to improve the power of existing methods (119, 96). In a similar vein, earlier works on multiple testing with groups (or discreteS i ) reveal that varied sparsity levels across groups can be exploited 33 to construct more powerful methods (42, 24, 63). Estimating 1 (s) with a continuous covariate is a challenging problem. Most existing works (119, 96) employ Bayesian computational algorithms that do not provide theoretical guarantees. Notable progress has been made by [21]. However, their theory requires a correct specication of the underlying regression model, which cannot be checked in practice. Next we discuss how the screening idea in Sections 2.2.3 and 2.2.4 can be extended to derive a simple and elegant non-parametric estimator of 1 (s). Fig. 2.1 gives a graphical illustration of the proposed estimator. We generate m = 10 5 tests with X i N(0; 1) and Y i 0:8N(0; 1) + 0:2N(2; 1); hence T i = (1= p 2)( X i Y i ) and S i = (1= p 2)( X i + Y i ). Suppose we are interested in counting how many S i would fall into the interval [t 2 h;t 2 +h] with t 2 = 2 and h = 0:3. The counts are represented by vertical bars on Panel (a) for each p-value interval. As shown in Proposition 1, T i and S i are independent under the null [cf. Equation (2.1.6)]. Therefore we can see that the counts of S i are roughly uniformly distributed when thep-value ofT i is large. Expanding the interval [t 2 h;t 2 +h] to the entire real line (which actually corresponds to discarding the information in S i ), we obtain the histogram of all p-values [Panel (b)]. τ = 0.5 0 250 500 750 1000 1250 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p−value of T 1 count (a). Counts of T 2i in [t 2 − h, t 2 + h] τ = 0.5 0 2500 5000 7500 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 p−value of T 1 count (b). Histogram of p−values Figure 2.1: A graphical illustration of the smoothing estimator (2.3.3). (a). The counts of Si are uniformly distributed on the right. The bias decreases and the variability increases when increases. (b). The histogram of allp-values. Similarly, thep-values are approximately uniformly distributed on the right. We start with a description of a classical estimator [95, 100] for 1 ; see [68] for a elaborated discussion of various extensions. Let Q() = #fP i > g, then by Fig. 2.1. (b), the expected 34 counts covered by light grey bars to the right of the threshold can be approximated asEfQ()g = m(1 1 )(1). Setting the expected and actual counts equal, we obtain ^ 1 = 1Q()=fm(1 )g: Next we consider the conditional proportion 1 (s) = P ( i = 1jS i = s). Assume that 1 (s) and f(s), the density of S i , are constants in a small neighborhood [sh=2;s +h=2]. Then the expected counts of the p-values from the null distribution in the interval [sh=2;s +h=2] can be approximated by Q (s;h)f1 1 (s)gf(s)h: The other way of counting can be done using (2.2.7) in Section 2.2.2. In obtaining (2.2.7), we exploit the fact that the counts S i are roughly uniformly distributed to the right of the threshold. Taking the limit, we obtain 1 (s) = 1ff(s)g 1 lim h!0 Q (s;h)=h = 1q (s)=f(s): Finally, utilizing the screening approach (2.2.11), we propose the following nonparametric smoothing estimator ^ 1 (s) = 1 P i2T() K h (ss i ) (1) P m i=1 K h (ss i ) : (2.3.3) Choosing tuning parameter is an important issue but has gone beyond the scope of the current work; see [100, 68] for further discussions. 2.3.6 Heterogeneity, correlation and empirical null Conventional FDR analyses treat all hypotheses exchangeably. However, the hypotheses become \unequal" in light of S i , and it is desirable to incorporate, for example, the varied conditional proportions in a testing procedure to improve the eciency. This section further discusses the case where the heterogeneity is re ected by disparate null densities. A key principle in our construction is that the primary and auxiliary statistics are conditionally independent under the null. However, in many applications where the auxiliary information is collected from external data,S i may be correlated withT i . Then the FDR procedure may become invalid ifS i is incorporated inappropriately. For example, if the grouping variable S i is correlated 35 with the p-value, then applying Benjamini-Hochberg's (BH) procedure to hypotheses in separate groups would be problematic because the null distributions of the p-values in some groups may no longer be uniform. A partial solution to resolve the issue is to estimate the empirical null distributions (37, 67) for dierent groups, instead of using the theoretical null directly. The theory and methodology in [40] and [24], which allow the use of varied empirical nulls across dierent groups, can be applied to control the FDR. However, as we previously mentioned, discretizing a continuous S i fails to fully utilize the auxiliary information. The estimation of the empirical null with a continuousS i is an interesting problem for future research. The nonparametric smoothing idea in (2.3.3) might be helpful but additional diculties may arise. The limitations of the current methodology and open questions will be discussed in Section 2.5. 2.4 Numerical Results This section investigates the numerical performance of CARS using both simulated and real data. We compare the oracle and data-driven CARS procedures, respectively denoted by OR and DD, with existing methods, including the Benjamini-Hochberg procedure (BH, Benjamin and Hochberg, 1995), the adaptive z-value procedure (AZ, Sun and Cai 2007), and the uncorrelated screening procedure (US, Liu 2014). We rst describe the implementation of CARS in Section 2.4.1. Sections 2.4.2 and 2.4.3 respectively consider (i) the case with known and unequal vari- ances and (ii) the case with estimated variances and non-Gaussian errors. Section 2.4.4 provides numerical evidence to show the merit of CARS when the two means have opposite signs. An application to Supernova Detection is discussed in Section 2.4.5. Additional numerical results, including the non-informative case (Section B.2), completely informative case (Section B.3) and dependent tests (Section B.6), are provided in Section B of the Supplementary Material. 36 2.4.1 The implementation and R-package CARS The R-package CARS has been developed to implement the proposed method. This section de- scribes implementation details and some practical guidelines. The bivariate density estimator ^ f(t 1 ;t 2 ) can become unstable in very sparse settings, which may lead to slightly elevated FDR level (cf. top left panel in Figure 2.2). To increase the stability of CARS in the extremely sparse setting where the non-null proportion is vanishingly small, the CARS package has included a \sparse" option, which implements a conservative but more stable density estimator ^ f (t 1 ;t 2 ) = (1 ^ 2 )f 10 (t 1 )f 20 (t 2 ) +G L () n 1 \ FDR 2 () o ^ f(t 1 ;t 2 j d Lfdr 2 <): (2.4.1) HereG L () =m 1 P m i=1 IfLfdr 2 (t 2i <)g is an empirical CDF, [ FDR 2 () is the estimated FDR level, and is the screening level. The rst term on the right hand side of (2.4.1) is based on known densities, which stabilizes the bivariate density estimate in regions with few observations. Our numerical studies in Section 2.7.2.3 show that the choice of has little impact in the range of 0.1 to 0.3; the default choice in the CARS package is = 0:1. To estimate the bivariate density f(t 1 ;t 2 jLfdr 2 < ), we apply the R package ash to the sampleT =ft 2i : d Lfdr(t 2i ) < g. We explain in Section 2.7.1.11 that the screening step would underestimate f(t 1 ;t 2 ) and hence lead to a conservative FDR control. The performance of the modied density estimator is investigated in Section 2.4.3 for extremely sparse case (including k = 0). For the global null case, one may consider a hybrid strategy as done in [36] that include a global testing step (34, 25) to test the hypothesis that all eects are zero; run CARS if the global null is rejected. In estimating (2.2.11), the CARS package uses the Lfdr as the screening statistic (as opposed to thep-values). A correction factor similar to 1 is needed and can be easily computed from data. Related formulae and computational details are described in Section 2.7.1.12 in the Supplement. Although thep-values lead to simpler and more intuitive descriptions of the methodology, we found 37 that screening via Lfdr leads to improved stability in tuning for nite samples. The intuition is that the Lfdr, which contains information about the sparsity and non-null density, provides a testing rule that is more adaptive to the data. Section 2.7.2.3 in the Supplement investigates the choice of the tuning parameter when the Lfdr is used. In general as increases, the FDR is closer to the nominal level but the stability decreases. The default choice in our package is = 0:9, which has been used in all our simulations. The R package np is used to choose the bandwidths h 1 and h 2 in equation (2.2.10). We have adopted two strategies to improve the performance. First, the bandwidth h 1 and h 2 are chosen based on the normal reference rule restricted to the samples with Lfdr 1 < 0:5. The restriction leads to more informative bandwidths as this subset is the more relevant part of the sample for the multiple testing problem. Second, the sameh 2 has been used in (2.2.11) to obtain the numerator of (2.2.12). This strategy is helpful to increase the stability of the estimator [cf. Remark 4]. Finally we note that these strategies are only practical guidelines in nite samples; the asymptotic theories are not aected. 2.4.2 Simulation I: known variances Consider model (2.1.1). Denote x;i1:i2 = ( x;i1 ; ; x;i2 ) and y;i1:i2 = ( y;i1 ; ; y;i2 ) the vectors of consecutive observations from i 1 to i 2 . The two original samples are denoted fx x x 1 ; ;x x x nx g andfy y y 1 ; ;y y y ny g, with corresponding means x and y . Let xi = 1, yi = 2, n x = 50 and n y = 60. Our simulations use m = 5000 and FDR level = 0:05. We consider the following 3 settings, where dierent methods are applied to simulated data and the results are averaged over 500 replications. The FDR and average power (proportion of dierential eects that are correctly identied) are plotted as functions of varied parameter values and displayed in Figure 2.2. 38 Setting 1: we set x;1:k = 5= p 30, x;(k+1):(2k) = 4= p 30, x;(2k+1):m = 0, y;1:k = 2= p 30, y;(k+1):(2k) = 4= p 30 and y;(2k+1):m = 0. Herek denotes the sparsity level: the proportion of locations with dierential eects is k=m, and the proportion of nonzero locations is (2k)=m. We vary k from 100 to 1000 to investigate the eect of sparsity. Setting 2: we use k 1 and k 2 to denote the number of nonzero locations and the number of locations with dierential eects, respectively. Let x;1:k2 = 5= p 30, x;(k2+1):k1 = 4= p 30, x;(k1+1):m = 0, y;1:k2 = 2= p 30, y;(k2+1):k1 = 4= p 30 and y;(k1+1):m = 0. We x k 1 = 2000 and vary k 2 from 100 to 1000. This setting investigates how the informativeness of the auxiliary covariate would aect the performance of dierent methods. Note that as k 2 increases, the conditional probability 1j1 =P( 1i = 1j 2i = 1) also increases, and the auxiliary covariate becomes more informative. Setting 3: We x k = 750 and set x;1:k = 0 = p 30, x;(k+1):(2k) = 3= p 30, x;(2k+1):m = 0, y;1:k = 2= p 30, and y;(k+1):(2k) = 3= p 30, y;(2k+1):m = 0. To investigate the impact of the eect sizes, we vary 0 from 3.5 to 5. We can see that the CARS procedure is more powerful than conventional univariate methods such BH and AZ, and is superior than US which only partially utilizes the auxiliary information. A more detailed description of simulation results is given below. (a). All methods control the FDR at the nominal level 0.05 approximately. BH is slightly conservative and US is very conservative. (b). Univariate methods (BH and AZ) are improved by bivariate methods (US and CARS) in most settings. This shows that exploiting the auxiliary information is helpful. (c). US is uniformly dominated by CARS. This is expected because US only models T 2i as a binary variable whereas CARS fully utilizes the information in T 2i . (d). DD has similar performance as OR in most settings. However, DD can be conservative in FDR control in some settings and hence has less power compared to OR (cf. Setting 3, bottom row of Figure 1). This has been predicted by our theory (Proposition 5). (e). Setting 1 shows that the gain in eciency (of bivariate methods 39 0.00 0.03 0.06 0.09 250 500 750 1000 k FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 250 500 750 1000 k Average Power Power Comparsion Method DD BH OR AZ US Setting 1 0.000 0.025 0.050 0.075 0.100 0.125 250 500 750 1000 k 2 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 250 500 750 1000 k 2 Average Power Power Comparsion Method DD BH OR AZ US Setting 2 0.000 0.025 0.050 0.075 0.100 3.5 4.0 4.5 μ 0 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 3.5 4.0 4.5 μ 0 Average Power Power Comparison Method DD BH OR AZ US Setting 3 Figure 2.2: Two-sample tests with known variances. The FDR and MDR are plotted against varied non-null proportions (top row) and conditional proportions (middle row) and eect sizes (bottom row). DD (), BH (4), OR (), AZ (+) and US (). 40 over univariate methods) decreases as k (or the sparsity level) increases. (f) Setting 2 shows that the eciency gain of CARS increases when k 2 increases. Note k 2 is proportional to 1j1 (the informativeness of the auxiliary covariate). (g). Setting 3 shows that the eciency gain of CARS increases as the signal strength decreases (note that a smaller 0 corresponds a larger dierence in eect sizes). 2.4.3 Simulation II: estimated variances and non-Gaussian errors We consider similar simulation settings as the previous section with three modications. First, we substitute the estimated variances in place of known variances. Second, to investigate the performance of our method with non-Gaussian errors, we modify Setting 3 slightly by generating xij and yik from t-distribution with df = 4 and df = 5, respectively. Finally, we vary k from 1 to 200 to to investigate the performance of CARS under dierent sparsity regimes. The modied density estimator ^ f (t 1 ;t 2 ), dened in (2.4.1), has been used in all settings. The simulation results are summarized in Figure 2.3. The patterns are very similar to those in Simulation I; our conclusions on the comparison of dierent methods remain the same. We mention the following points. (a). Settings 1-2 show that the CARS works well with estimated variances. (b). Setting 3 shows that CARS is robust to the Gaussian assumption. (c). Under the very sparse setting, the modied CARS procedure is conservative for FDR control but still outperforms competitive methods. 2.4.4 Simulation III: means with opposite signs Our testing framework utilizes T 2i as auxiliary statistics to assist inference. It is possible that T 2i may be non-informative but this auxiliary information cannot hurt. This important point has been explained by Remark 3; see also Section 2.7.1.9 in the Supplementary Material. Here we provide numerical evidence to support the claim. 41 0.00 0.05 0.10 0.15 0 50 100 150 200 k FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0 50 100 150 200 k Average Power Power Comparsion Method DD BH OR AZ US Setting 1 0.00 0.05 0.10 0.15 0.20 0 50 100 150 200 k 2 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0 50 100 150 200 k 2 Average Power Power Comparison Method DD BH OR AZ US Setting 2 0.000 0.025 0.050 0.075 0.100 3.5 4.0 4.5 5.0 μ 0 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 3.5 4.0 4.5 5.0 μ 0 Average Power Power Comparison Method DD BH OR AZ US Setting 3 Figure 2.3: Two-sample tests with unknown variances and non-Gaussian errors. The FDR and MDR are plotted against varied non-null proportions (top row), conditional proportions (bottom row). The displayed procedures are DD (), BH (4), OR (), AZ (+) and US (). 42 Consider a setting in which the two means have opposite signs. We shall see that CARS outperforms univariate methods as long as the two means do not cancel out with each other precisely. This conrms our claim that CARS, which benets from an enhanced signal to noise ratio by exploiting the auxiliary data, always dominates the univariate methods. Let xij N(0; 1) and yik N(0; 1) be iid noise. Set n x =n y = 50. In our simulation, the number of tests is m = 10; 000. The two mean vectors are given below: x;1:500 = 3= p 50; x;501:1000 = 4= p 50; x;1001:m = 0 y;1:500 = (3c)= p 50; y;501:1000 = 4= p 50; y;1001:m = 0: We varyc from1 to 0, wherec =1 corresponds to the least favorable situation where the two means cancel out precisely. We apply the BH procedure (BH), oracle CARS procedure (OR) and data-driven CARS procedure (DD) to the simulated data sets. The FDR and power are obtained based on 200 replications. The simulation results are summarized by Figure 2.4. We can see that when c =1, all methods have similar power. As c increases to zero, the power gain of CARS become more pronounced. 0.000 0.025 0.050 0.075 0.100 −1.00 −0.75 −0.50 −0.25 0.00 c FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 −1.00 −0.75 −0.50 −0.25 0.00 c Average Power Power Comparison Method DD BH OR Figure 2.4: Comparison of BH, DD and OR when the nonzero means have opposite signs. 2.4.5 Application in Supernova Detection This section applies CARS for analysis of time course satellite imaging data. Figure 2.5 shows the time course g-band images of galaxy M101 collected by the Palomar Transient Factory survey 43 (69). The images indicate the appearance of SN 2011fe, one of the brightest supernovas known up to date (81). A major goal of our analysis is to detect the discrepancies between images taken over time so that we can narrow down the potential locations for Supernova explosions. More accurate measurements and further investigations will then be carried out on the narrowed subset of potential locations. Figure 2.5: From left to right, images are taken respectively on August 23, 24, and 25, 2011. The arrows clearly indicate the explosion of SN 2011fe. The satellite data are recorded and converted into gray-scale images of size 516 831 (or m = 428; 796 pixels). Each pixel corresponds to a value ranging from 0 to 1 that indicates the intensity level of the in ux from stars. We use image 1 as the baseline. Its grey-scale pixel values are subtracted from those in images 2 and 3. These dierences are then vectorized as x x x = (x 1 ; ;x m ) and y y y = (y 1 ; ;y m ) (respectively representing \image 2 image 1" and \image 3 image 1"). We plot the histograms and nd that the null distributions ofx x x andy y y are dierent. This can be explained by the lapse in times at which these images are taken (the brightness of these g-band images changes gradually over time). The supernova data have signicant amount of background noise in each image, and the average magnitudes of the background noise vary a lot from image to image. To remove the image-specic background noise, we rst estimate the empirical null dis- tributions based on the center part of the histograms. The variances of observations are assumed 44 Table 2.1: Thresholds and Total Number of Rejections (in parentheses) of Dierent Testing Pro- cedures FDR level BH procedure Adaptive Z procedure US (two thresholds) CARS 10 4 3:66 10 10 (4) 2:75 10 4 (5) 3:24; 5:46 (22) 4:38 10 4 (35) 0.01 9:91 10 7 (22) 3:37 10 2 (24) 2:51; 4:87 (38) 9:25 10 2 (58) 0.05 7:38 10 6 (64) 0.39 (69) 1:92; 4:42 (80) 0.26 (109) to be homoscedastic and are estimated using all pixels. For x x x andy y y, we have N(0:0028; 0:0023) and N(0:044; 0:0027), respectively. We then standardize the observations as x x x st and y y y st , which are used in our analysis. We do not take the dierencex x xy y y directly as it would create many false signals due to the varied magnitudes of the background noise. The standardized measurements x x x st and y y y st from the two images seem to be comparable as most of pairs (x st i ;y st i ) have similar values. Next we carry out m = 428; 796 two-sample tests with known variances. For standardized observationsx x x st andy y y st the variances are known to be 1. Then t 1i = x st i y st i p 2 and t 2i = x st i +y st i p 2 . We apply BH, AZ, US and CARS procedures at FDR levels 0.01%, 1% and 5%. Figure 2.12 in the Supplementary Material shows the rejected pixels in the 516 831 layout for each method under dierent FDR levels. The estimated sparsity levels fort 1 t 1 t 1 andt 2 t 2 t 2 are respectively 1:47% and 49:5%. The corresponding estimated support size at = 0:5 is 6285. We report the thresholds of dierent testing procedures in Table 1. We can see that more information can be harvested from the data by using auxiliary informa- tion. In particular, at FDR level 0:01%, the supernova is missed by BH and AZ but detected by CARS. To further quantify our procedure's superiority, we count the total number of rejections in Table 1 (numbers in parentheses). We can see that CARS consistently detects more signals from the satellite images than competing methods. 45 2.5 Discussion Covariate-assisted multiple testing can be viewed as a special case of a much broader class of problems under the theoretical framework of integrative simultaneous inference, which covers a range of topics including multiple testing with external or prior domain knowledge (14, 9), partial conjunction tests and set-wise inference (11, 103, 35), and replicability analysis (58, 57). A coherent theme in these problems is to combine the information from multiple sources to make more informative decisions. Tukey's \pooling within" strategy provides a promising approach in such scenarios where quantitative indications might be hidden in various parts of massive data sets. The current formulations and methodologies in integrative data analysis dier substantially. A general theory and methodology is yet to be developed for handling various types of problems in a unied framework. For instance, in weighted FDR analysis (14), the external domain knowledge is incorporated as the weights in modied FDR and power denitions to re ect the varied gains and losses in decisions. By contrast, covariate-assisted multiple testing still utilizes unweighted FDR and power denitions. The connection of CARS to theories on optimal weights is still an open issue (85, 87). Moreover, in partial conjunction tests and replicability analysis, the summary statistics from dierent studies are of equal importance, which marks a key dierence from covariate assisted inference where some statistics are primary while others are secondary. We conclude our discussion with a few more open issues. • Are there better ways to construct the auxiliary sequence? Our theory only shows that CARS is optimal when the pairs are given. How to construct an optimal pair from data is still an open question. For instance, in situations where two means have opposite signs, the sum of absolute values may better capture the sparsity information but would give rise to a correlated pair, which cannot be handled by the current testing framework. 46 • How to deal with multiple testing dependency? The CARS method cannot be applied to dependent tests as it assumes that T i are independent. Our simulation studies show that CARS controls the FDR under weak dependence. However, the result is based on very limited empirical studies, which lack theoretical supports. An important direction is to develop new theory and methodology for the dependent case. • How to generalize the idea to settings where the null distribution is unknown? This im- portant situation may arise from the classical two-sample tests where the null distribution is calibrated with permutations. We conjecture that the CARS procedure, which requires explicit form of the null density, may be tailored by using a dierent, probably more ad hoc, approximation. For example, informative weights may be derived from the auxiliary data and incorporated into the permutation basedp-values via some grouping and weighting strategy. • How to construct the auxiliary sequence in more general settings? This article focuses on the two-sample tests. It would be of interest to extend the methodology to simultaneous ANOVA tests. Moreover, CARS provides a useful strategy for extracting the sparsity structure from data. There are other important structures in the data such as heteroscedasticity, hierarchy and dependency, which may also be captured by an auxiliary sequence. It remains an open question on how to extract and incorporate such structural information eectively to improve the power of a testing procedure. • How to summarize the auxiliary information in high-dimensional settings? The proposed CARS methodology requires the joint modeling of the primary and auxiliary statistics, which cannot handle many covariate sequences because the joint density estimator would greatly suer from high-dimensionality. A fundamental issue is to develop new principles for information pooling, or optimal dimension reduction, in multiple testing with high- dimensional covariates. 47 • How to make inference with multiple sequences? In partial conjunction tests and replicability analysis, an important feature is that the means (of summary statistics) from separate studies are individually sparse. We expect that similar strategies for extracting and sharing sparsity information among multiple sequences would improve the accuracy of simultaneous inference. However, as opposed to covariate-assisted inference where there is a sequence of primary statistic, in partial conjunction tests and replicability analysis all sequences are of equal importance, which poses new challenges for problem formulation and methodological development. 2.6 Proofs of Main Theorems This section proves the main theorems. The proofs of other propositions are provided in the Supplementary Material. 2.6.1 Proof of Theorem 1 Proof. We rst show that the two expressions of T i OR in (2.2.4) are equivalent. Recall q (t2) = (1 1)f (t2j1i = 0). Applying Bayes theorem and using the conditional independence between T1i andT2i under the null 1i = 0 (Proposition 1), we obtain T i OR (t1;t2) = P(1i = 0)f(t1;t2j1i = 0) f(t1;t2) = q (t2)f10(t1) f(t1;t2) : Part (a). Let QOR(t) =t. We rst show that t <t. According to the mFDR denition, E (T T T 1 ;T T T 2 ) P m i=1 (T i OR t)I(T i OR <t) = 0; (2.6.1) where the subscript (T T T 1;T T T 2) indicates that the expectation is taken over the joint distribution of (T T T 1;T T T 2). The above equation implies thatt <t; otherwise all terms in the summation on its LHS would be either zero or negative. 48 Next we show that QOR(t) is monotone in t. Let QOR(tj ) = j for j = 1; 2. We only need to show that if t1 <t2, then 12. We argue by contradiction. If 1 >2, then (T i OR 2)I(T i OR <t2) = (T i OR 2)I(T i OR <t1) + (T i OR 2)I(t1T i OR <t2) (T i OR 1)I(T i OR <t1) + (12)I(T i OR <t1) +(T i OR 1)I(t1T i OR <t2): Next take expectations on both sides and sum over all i. We claim that E T T T 1 ;T T T 2 P m i=1 (T i OR 2)I(T i OR <t2) > 0: (2.6.2) The above inequality holds since (i)E T T T 1 ;T T T 2 P m i=1 (T i OR 1)I(T i OR <t1) = 0, (ii)E T T T 1 ;T T T 2 P m i=1 (12)I(T i OR <t1) > 0, and (iii) E T T T 1 ;T T T 2 P m i=1 (T i OR 1)I(t1T i OR <t2) > 0; which are respectively due to (2.6.1), the assumption that 1 > 2 and the fact 1 < t1. However, (2.6.2) is a contradiction to our denition of 2, which implies that E T T T 1 ;T T T 2 P m i=1 (T i OR 2)I(T i OR <t2) = 0: Hence we must have 12. Part (b). The oracle threshold is dened as tOR = sup t ft 2 (0; 1) : QOR(t) g: We want to show that at tOR, the mFDR level is attained precisely. Let = QOR(1). Part (a) shows that the continuous function QOR(t) is non-decreasing. Then we always have QOR(tOR) = if < . Dene OR =fI(T i OR < tOR) : 1 i mg. Let = ( 1 ; ; m ) be an arbitrary decision rule such that mFDR( ): It follows that E T T T 1 ;T T T 2 P m i=1 (T i OR ) i OR = 0 and E T T T 1 ;T T T 2 P m i=1 (T i OR ) i 0: (2.6.3) Combining the two results in (2.6.3), we conclude that E T T T 1 ;T T T 2 P m i=1 ( i OR i )(T i OR ) 0: (2.6.4) 49 Next, consider a monotonic transformation of the oracle decision rule i OR =I(T i OR < tOR) via f(x) = (x)=(1x) (note that the derivative _ f(x) = (1)=(1x) 2 > 0). The oracle decision rule is equivalent to i OR =I T i OR 1T i OR <OR , where OR = t OR 1t OR : A useful fact is that <tOR < 1. Hence OR > 0. Note that (i) (T i OR )OR(1T i OR ) < 0 if i OR > i , and (ii) (T i OR )OR(1T i OR ) > 0 if i OR < i . Combining (i) and (ii), we conclude that the following inequality holds for all i: ( i OR i ) (T i OR )OR(1T i OR ) 0: Summing over i and taking expectations, we have E T T T 1 ;T T T 2 " m X i=1 ( i OR i ) n (T i OR )OR(1T i OR ) o # 0: (2.6.5) Combining (2.6.4) & (2.6.5), we have ORE T T T 1 ;T T T 2 ( m X i=1 ( i OR i )(1T i OR ) ) E T T T 1 ;T T T 2 ( m X i=1 ( i OR i )(T i OR ) ) 0: Finally, note thatOR > 0 and that the ETP of a decision rule = (1; ;m) is given byE T T T 1 ;T T T 2 P m i=1 i(1T i OR ) , we conclude that ETP( OR) ETP( ): 2.6.2 Proof of Theorem 2 Proof. We rst provide a summary of useful notations. • Q (t) = m 1 P m i=1 T ;i OR I(T ;i OR < t). This quantity provides a conservative FDR estimate of the decision rule I(T ;i OR <t) : 1im . • ^ Q (t) = m 1 P m i=1 ^ T ;i OR I( ^ T ;i OR < t). This is the data-driven version of Q (t) with T ;i OR being replaced by ^ T ;i OR . • Q 1 (t) =EfT OR I(T OR <t)g denotes the limiting value of Q OR (t), where T OR is a generic member from the samplefT i OR : 1img. • t 1 = supft2 (0; 1) :Q 1 (t)g. Part (a). In Proposition 5, we show that OR is conservative in mFDR control. To establish the desired property in mFDR control, we only need to show that mFDR( DD) = mFDR( OR ) +o(1): 50 Dene a continuous version of Q (t) as follows. For T ;(k) OR <tT ;(k+1) OR , let Q C (t) = tT ;(k) OR T ;(k+1) OR T ;(k) OR Q k + T ;(k+1) OR t T ;(k+1) OR T ;(k) OR Q k+1 ; (2.6.6) where Q k =Q T ;(k) OR . It is easy to see that Q C (t) is continuous and monotone. Hence the inverse of Q C (t), denoted Q ;1 C , is well dened. Moreover, Q ;1 C is continuous and monotone. We can similarly dene a continuous version of ^ Q (t), denoted by ^ Q C (t). ^ Q C (t) is continuous and monotone; so does its inverse ^ Q ;1 C (). By construction, we have OR = I T ;i OR Q ;1 C () : 1im and DD = h I n ^ T ;i OR ^ Q ;1 C () o : 1im i : We will show that (i) Q ;1 C () p !t 1 and (ii) ^ Q ;1 C () p !t 1 . (2.6.7) To show (i), note that the continuity of Q ;1 C () implies that for any > 0, we can nd > 0 such that Q ;1 C ()Q ;1 C fQ C (t 1 )g < ifjQ C (t 1 )j<. Hence P (jQ C (t 1 )j>)P Q ;1 C ()Q ;1 C fQ C (t 1 )g > : Next, by the WLLN Q C (t) p ! Q 1 (t): Note that Q 1 (t 1 ) = , we haveP (jQ (t 1 )j>)! 0. By Markov inequality, we conclude that Q ;1 C () p !Q ;1 C fQ C (t 1 )g =t 1 : Next we show (ii). By inspecting the proof of (i), we only need to show that ^ Q C (t) p !Q 1 (t). Denote a variable without index i (e.g. ^ T OR and T OR ) as a generic member from the sample. It follows from Condition (C2) and the continuous mapping theorem that ^ T OR p ! T OR : Note that both T OR and ^ T OR are bounded above by 1. It follows thatE( ^ T OR T OR ) 2 ! 0: Let Ui = T ;i OR I(T ;i OR < t) and ^ Ui = ^ T ;i OR I( ^ T ;i OR < t). We will show thatE( ^ UiUi) 2 = o(1): To see this, consider the following decomposition ( ^ UiUi) 2 = ( ^ T OR T OR ) 2 I( ^ T OR t;T OR t) + ( ^ T OR ) 2 I( ^ T OR t;T OR >t) +(T OR ) 2 I( ^ T OR >t;T OR t) =I +II +III: 51 The rst term I =o(1) becauseE( ^ T OR T OR ) 2 ! 0. Let > 0. Note that T OR is continuous and that ^ T OR p !T OR , we have P( ^ T OR t;T OR >t)PfT OR 2 (t;t +)g +P j ^ T OR T OR j> ! 0: Since ^ T OR is bounded, we conclude that the second termII =o(1). Similarly we can show thatIII =o(1). ThereforeE( ^ UiUi) 2 =o(1): Next we show that ^ Q (t) p !Q 1 (t). Note that Q (t) p !Q 1 (t), we only need to show that ^ Q (t) p ! Q (t). The dependence among ^ Ui in the expression ^ Q (t) = m 1 P i ^ Ui creates some complications. The idea is to apply some standard techniques for the limit of triangular arrays that do not require independence between variables. Consider Sn = P m i=1 ( ^ UiUi). Then E(Sn) = mfE( ^ Ui)E(Ui)g. Applying standard inequalities such as Cauchy-Schwartz, we haveE( ^ UiUi)( ^ UjUj ) =o(1): It follows that m 2 var(Sn)m 1 E( ^ UiUi) 2 + (1 +o(1))E n ( ^ UiUi)( ^ UjUj ) o =o(1): Therefore EfSnE(Sn)=ng 2 ! 0. Applying Chebyshev's inequality, we obtain m 1 fSnE(Sn)g = ^ Q (t)Q (t) p ! 0: Therefore ^ Q (t) p ! Q 1 (t). By denition,j ^ Q C (t) ^ Q (t)j m 1 . We claim that ^ Q C (t) p ! Q 1 (t) and (ii) follows. According to (i) and (ii) in (2.6.7), ^ Q ;1 C () = Q ;1 C () +o P (1). The mFDR levels of the testing procedures are: mFDR( OR ) = PH 0 T ;i OR <Q ;1 C () P T ;i OR <Q ;1 C () ; and mFDR( dd ) = PH 0 ^ T ;i OR < ^ Q ;1 C () P ^ T ;i OR < ^ Q ;1 C () : The operation of our testing procedure implies that Q ;1 C (). It follows thatP T ;i OR <Q ;1 C () is bounded away from zero. We conclude that mFDR( dd ) = mFDR( OR ) +o(1): The result on mFDR control can be extended to FDR control. The next proposition, which is proved in the Supplementary Material, rst gives sucient conditions under which the mFDR and FDR denitions 52 are asymptotically equivalent, and then veries that these conditions are fullled by the proposed CARS procedure dd . It follows from the proposition that CARS controls the FDR at level +o(1). Proposition 7. (a) Consider a general decision rule . LetY = m 1 P m i=1 i. Then mFDR( ) = FDR( ) + o(1) if (i)E(Y) for some > 0, and (ii) Var(Y) = o(1): (b) Conditions (i) and (ii) are fullled by the CARS procedure dd . Part (b). The CARS procedure utilizes ^ q , and the corresponding test statistic is ^ T ;i OR . It follows from Conditions (C1 0 ) and (C2), and the continuous mapping theorem that ^ T OR p !TOR. Denote QOR(t) the oracle mFDR function and tOR the oracle threshold. Then QOR(t) = PH 0 (TOR <t) P(TOR <t) =EfTORI(TOR <t)g; tOR = supft2 (0; 1) :QOR(t)g: Dene ^ Q (t) =m 1 P m i=1 ^ T ;i OR I( ^ T ;i OR <t): Similar to (2.6.6), we dene a continuous version of ^ Q (t) and denote it by ^ Q C (t). It can be shown that ^ Q C (t) is continuous and monotone; so does its inverse ^ Q ;1 C (t). The CARS procedure is given by DD = h I n ^ T ;i OR ^ Q ;1 C () o : 1im i : Following the steps in Part (a) we can show that ^ Q C (t) p !QOR(t); ^ Q ;1 C (t) p !tOR: (2.6.8) The operation of CARS implies thatQ ;1 C () (thus the denominator of the mFDR is bounded away from zero). Hence mFDR( DD ) = QOR(tOR) +o(1) = +o(1): Next, we consider the ETP. It follows from ^ T OR p !TOR and (2.6.8) that ETP( DD ) ETP( OR) = PH 1 n ^ T OR < ^ Q ;1 C () o PH 1 (TOR <tOR) = 1 +o(1): 53 2.7 Supplemental Materials This supplement contains the proofs of other theoretical results (Section A) and additional numerical results (Section B) for Chapter 2. 2.7.1 Additional Theory and Proofs of Other Results 2.7.1.1 Proof of Proposition 1 Proof. Let Ax = fx : x 6= 0g. According to the denition of ji and the conditional independence assumption (2.1.5), we have f(t1i;t2ij1i = 0;2i = 0) = f(t1i;t2ijxi = 0;yi = 0) = f(t1ijxi = 0;yi = 0)f(t2ijxi = 0;yi = 0) = f(t1ij1i = 0;xi =yi = 0)f(t2ij2i = 0;xi =yi = 0) = f(t1ij1i = 0)f(t2ij2i = 0): The last equality holds because given ji = 0, tji are just linear combinations of random errors xi and yi, which are independent ofxi andyi,j = 1; 2. DenoteGx () the distribution function ofxi. Then f(t1i;t2ij1i = 0;2i = 1) = Z Ax f(t1i;t2ijxi =yi =x)dGx (x) = Z Ax f(t1ij1i = 0;xi =yi =x)f(t2ijxi =yi =x)dGx (x) = f(t1ij1i = 0) Z Ax f(t2ij1i = 0;xi =yi =x)dGx (x) = f(t1ij1i = 0)f(t2ij1i = 0;2i = 1): 54 For the third equality, note that t1i are independent of both xi and yi given 1i = 0, whereas t2i are correlated with xi and yi given that 1i = 0. Finally, f(t1i;t2ij1i = 0) = f(t1i;t2ij1i = 0;2i = 0)P(2i = 0j1i = 0) +f(t1i;t2ij1i = 0;2i = 1)P(2i = 1j1i = 0) = f(t1ij1i = 0)ff(t2ij1i = 0;2i = 0)P(2i = 0j1i = 0) +f(t2ij1i = 0;2i = 1)P(2i = 1j1i = 0)g = f(t1ij1i = 0)f(t2ij1i = 0): Hence T1i and T2i are independent under the null hypothesis 1i = 0. 2.7.1.2 Proof of Proposition 2 Proof. Part (a). We only need to show thatq (t2)q (t2) for all. We rst split the joint density into two parts and then use the independence between T1i and T2i under the null: q (t2) = (1) 1 Z t 1 2A ff(t1;t2j1i = 0)P(1i = 0) +f(t1;t2j1i = 1)P(1i = 1)gdt1 (1) 1 Z t 1 2A q (t2)f10(t1)dt1 =q (t2): The last equality holds because R t 1 2A f10(t1)dt1 = 1, which cancels with (1) 1 . Part (b). LetR be the index set of hypotheses rejected by OR . The FDR of OR is FDR( OR ) = E P i2R (11i) jRj_ 1 = E T T T 1 ;T T T 2 E P i2R (11i) jRj_ 1 jT T T 1;T T T 2 = E T T T 1 ;T T T 2 1 jRj_ 1 X i2R T i OR ! ; 55 where the last equality follows from the denition of the oracle statistic T i OR =E(11ijT1i;T2i). For all i2RI, we have T i OR T ;i OR . It follows that FDR( OR )E T T T 1 ;T T T 2 1 jRj_ 1 X i2R T ;i OR ! : The last inequality is due to the operation of OR , which guarantees that fjRj_ 1g 1 X i2R T ;i OR for all realizations of (T T T 1;T T T 2). The result on mFDR holds since E ( X i2R (11i) ) E T T T 1 ;T T T 2 X i2R T ;i OR ! E T T T 1 ;T T T 2 (jRj_ 1): 2.7.1.3 Proof of Proposition 3 Let ~ N =m ~ G() be the expected size of the screening set. Dene ~ f 2 (t2) = (1= ~ N ) X T 2i 2T 2 () K h 2 (t2T2i): Then ^ q (t2) = ~ G() 1 ~ f 2 (t2): Dene the conditional density of T2i given that Pi > f 2 (t2) =f(t2jPi >) = ~ G() 1 Z A f(t1;t2)dt1: Then q (t2) = ~ G() 1 f 2 (t2): It follows that Ek^ q q k 2 = ~ G() 1 2 E ~ f 2 f 2 2 ; For a xed , we only need to show that E ~ f 2 f 2 2 =E Z n ~ f 2 (t2)f 2 (t2) o 2 dt2! 0: 56 LetE T 2 jT 1 denote the conditional expectation that is taken over T2i's while holding T1i's xed. The conditional density of T2 given T1 is denoted f2(t2jT1i). We will work on the bias and variance in turn. First, E T 2 jT 1 fK h 2 (t2T2i)g =f2(t2jT1i) + h 2 2 2 f (2) 2 (t2jT1i) Z z 2 K(z)dz +o(h 2 2 ): Now considerEf ~ f 2 (t2)g =ET 1 [E T 2 jT 1 f ~ f 2 (t2)g]. It follows that Ef ~ f 2 (t2)g = (1= ~ N )ET 1 h P m i=1 n f2(t2jT1i) + h 2 2 2 f (2) 2 (t2jT1i) R z 2 K(z)dz +o(h 2 2 ) I (T1i2A ) = f 2 (t2) + h 2 2 2 f (2) 2 (t2j) Z z 2 K(z)dz +o(h 2 2 ); where f (2) 2 (t2j) is dened in the proposition. The second equality holds because ET 1 ff(t2jT1i)I(T1i2A )g = Z A f2(t2jt1)f(t1)dt1 = ~ G()f 2 (t2): We have assumed the square integrability of f (2) 2 (t2j), according to which dene R n f (2) 2j o = Z ff (2) 2 (t2j)g 2 dt2: It follows that the integrated squared bias is Z [Ef ~ f 2 (t2)g ~ f 2 (t2)] 2 dt2 = (h 4 2 =4)R n f (2) 2 () o f2(K)g 2 (1 +o(1)); where we use the notation 2(K) = R z 2 K(z)dz. Next, we compute the variance term. Consider the following decomposition Varf ~ f 2 (t2)g = VarT 1 [E T 2 jT 1 f ~ f 2 (t2)g] +ET 1 [Var T 2 jT 1 f ~ f 2 (t2)g]; 57 where the rst and second terms can be respectively computed as VarT 1 [E T 2 jT 1 f ~ f 2 (t2)g] = n m ~ G 2 () o 1 Z ff 2j1 (t2jt1)g 2 f1(t1)dt1ff2(t2)g 2 f1 +o(1)g; ET 1 [Var T 2 jT 1 f ~ f 2 (t2)g] = 1 ~ Nh2 Z A f2(t2jt1)f1(t1)dt1R(K)f1 +o(1)g = (mh2) 1 f 2 (t2)R(K)f1 +o(1)g: It is easy to see that the second term is the leading term (note that we xed k and let m!1). Hence Z Varf ~ f 2 (t2)gdt2 = (mh2) 1 R(K)f1 +o(1)g: The common choice of bandwidthhm 1=5 makesE ~ f 2 f 2 2 ! 0, and the desired result follows. 2.7.1.4 Proof of Proposition 4 Part (a). The total approximation error can be computed as Bq() = Z jq (t2)q (t2)jdt2 = 1G() 1 (11) = 1f1G1()g 1 : Noting that G1(1) = 1, and applying the mean value theorem, we have d d Bq() = f1G1()gg1()(1) (1) 2 = g1()g1() 1 for some 2 (; 1). If G1 is concave, then g1()g1()< 0 and the desired result follows. Part (b). Let (x) =f1G1(x)g=(1x). If g1 satises lim x"1 g1(x) = 0, then using L'Hospital's Rule, we claim that lim x"1 (x) exists and lim x"1 (x) = lim x"1 g1(x) = 0: Consider the following decomposition of the joint density Z A f(t1;t2)dt1 = Z A fq (t2)f10(t1) +1f(t2jt1;1i = 1)f1(t1)gdt1: 58 Noting that both T1i and T2i are standardized, the conditional density f(t2jt1;1 = 1) is bounded by some constant C0. Using the denition of (x), we have q (t2) = (1) 1 Z A f(t1;t2)dt1q (t2) +C01() Applying L'Hospital's Rule again, we have lim "1 q (s)q (s): In Proposition 2.(a), we have shown that q (s)q (s) for all . Combining the two inequalities, we conclude that lim "1 q (s) =q (s). Then the desired result follows. 2.7.1.5 Proof of Proposition 5 Proof. Proposition 4 denes (x) =f1G1(x)g=(1x) and shows that lim k !1q k (t2) =q (t2); jq (t2)q (t2)jC01(): Note that G1(0) = 0 and G1(1) = 1. It follows from the concavity of G1 that () 1. Moreover, R fq k (t2)q (t2)gdt2 =1( k ). Hence Z fq k (t2)q (t2)g 2 dt2C01 Z fq k (t2)q (t2)gdt2 =C0 2 1 ( k )! 0 (2.7.1) when k ! 1, according to L'Hospital's Rule and our assumption that lim x"1 g1(x) = 0: Therefore, we only need to show that for a given k , Ek^ q q k k 2 =E Z f^ q (t2)q k (t2)g 2 dt2 m!0 ! 0: Consider ^ q (t2) dened in (2.2.13). Let aj =k 1 (^ s2 ^ s0 ^ s 2 1 ) 1 f^ s2 ^ s1(j 1)gK h (j 1): 59 Then ^ q (t2) = P k j=1 aj ^ q j , P k j=1 aj = 1 and P k j=1 aj 1. Following similar steps in the proof of Proposition 3, Ef^ q (t2)g = ET 1 [E T 2 jT 1 f^ q (t2)g] = k X j=1 aj 1G(j ) 1j " f j 2 (t2) + h 2 2 2 u2(K) Z t 1 2A j f (2) 2 (t2jt1)f1(t1)dt1 +o(h 2 2 ) # = q k (t2) + 1 2 h 2 q k ;(2) (t2)C (K) +o(h 2 ) + h 2 2 2 u2(K) k X j=1 aj 1G(j ) 1j Z t 1 2A j f (2) 2 (t2jt1)f1(t1)dt1 +o(h 2 2 ); where C (K) is a kernel dependent constant. The last equality follows from [112] (pp. 128) for kernel smoothing estimator at boundary points [applying the theory to P k j=1 ajq j (t2)]. Note that 1G( j ) 1 j 1 and P j aj = 1, we have k X j=1 aj 1G(j ) 1j Z t 1 2A j f (2) 2 (t2jt1)f1(t1)dt1 Z f (2) 2 (t2jt1)f1(t1)dt1: According to the square integrability of q k ;(2) (t2) and f (2) 2 (t2j), we can dene R n q k ;(2) o = Z n q k ;(2) (t2) o 2 dt2; R n f (2) t 2 j o = Z n f (2) 2 (t2j) o 2 dt2: Then the leading term of R [Efq (t2)gq k (t2)] 2 dt2 is bounded above by 1 2 h 4 R n q k ;(2) o fC (K)g 2 + 1 2 h 4 2 fu2(K)g 2 R n f (2) t 2 j o ; which converges to 0 when (m;k)!1. The argument below regarding the variance part does not pursue the study of the optimal rate of convergence. In fact, the analysis is intractable because q j are dependent quantities. We have show in Proposition 3 that Z Varf^ q j (t2)gdt2 (mh2) 1 R(K)f1 +o(1)g 60 for all j . It follows that Varf^ q (t2)g (mh2) 1 R(K)( k X j=1 a 2 k )f1 +o(1)g (mh2) 1 R(K)f1 +o(1)g! 0 when m!1. We note that the actual convergence rate would be better because the variance will be greatly reduced by averaging. Combining the results on the bias and variance terms, we have E Z f^ q (t2)q k (t2)g 2 dt2 (2.7.2) 1 2 h 4 R(q k ;(2) )fC (K)g 2 + 1 2 h 4 2 fu2(K)g 2 R(f (2) t 2 jt 1 ) + R(K) mh2 f1 +o(1)g: Combining (2.7.1) and (2.7.2), we conclude that Ek^ q q k 2 ! 0 when (m;k)! 0, completing the proof. 2.7.1.6 Proof of Proposition 6 Proof. We rst point out that the arguments in this proof should be understood as the conditional version (given that xi and yi are known). Dene the following central sample moments that are used in later calculations: mxi = 1 nx nx X j=1 (Xijxi); myi = 1 ny ny X j=1 (Yijyi); mxxi = 1 nx nx X j=1 (Xijxi) 2 ; myyi = 1 ny ny X j=1 (Yijyi) 2 : It follows that S 2 xi =mxxim 2 xi ; S 2 yi =myyim 2 yi : Next dene central moments, xi3 =E(Xijxi) 3 ; yi3 =E(Yijyi) 3 ; xi4 =E(Xijxi) 4 ;yi4 =E(Yijyi) 4 : 61 According to the CLT, we have r nxny n 8 > > > > > > > > > > < > > > > > > > > > > : 0 B B B B B B B B B B @ mxi myi mxxi myyi 1 C C C C C C C C C C A 0 B B B B B B B B B B @ 0 0 2 xi 2 yi 1 C C C C C C C C C C A 9 > > > > > > > > > > = > > > > > > > > > > ; L !N(0; i); where i = 0 B B B B B B B B B B @ y 2 xi 0 yxi3 0 0 x 2 yi 0 xyi3 yxi3 0 y(xi4 4 xi ) 0 0 xyi3 0 x(yi4 4 yi ) 1 C C C C C C C C C C A : Let gi(s;t;u;v) = 0 B B @ (st +xiyi)( xv + yu) 1 2 s + yu xv t +xi + y x yi n ( xv + yu) yu xv o 1 2 1 C C A : It follows that _ gi(0; 0; 2 xi ; 2 yi ) = 0 B B @ 1 pi 1 pi 1 2 y(xiyi) 3 pi 1 2 x(xiyi) 3 pi 1 pi 1 2 i 1 pi 1 2 i 1 2 y xi +yi y x 3 pi (1 + 2 i ) 3 2 i 1 2 x xi +yi y x 3 pi 1 2 i 1 C C A : The asymptotic variance-covariance matrix of p x yn(T1i;T2i) | is i T 1 ;T 2 = 0 B B @ 2 i (t1;t1) 2 i (t1;t2) 2 i (t1;t2) 2 i (t2;t2) 1 C C A ; 62 where the entries can be computed by apply the Delta method: 2 i (t1;t1) = 1 + 4 pi (xiyi) 2 x yi3 2 y xi3 + 6 pi 4 (xiyi) 2 3 x yi4 4 yi + 3 y xi4 4 xi ; 2 i (t1;t2) = 4 pi 2 (xiyi) 2 x yi3 1 2 i + 2 y xi3 1 2 i 4 pi 2 xi +yi y x 2 x yi3 1 2 i + 2 y xi3 (1 + 2 i ) 3 2 i ; + 6 pi 4 xi +yi y x (xiyi) 3 y (1 + 2 i ) 3 2 i xi4 4 xi 3 x 1 2 i yi4 4 yi ; 2 i (t2;t2) = 1 + 4 pi xi +yi y x i 2 x yi3 (1 + 2 i ) 2 i 2 y xi3 + 6 pi 4 xi +yi y x 2n 3 x i yi4 4 yi + 3 y (1 + 2 i ) 2 3 i xi4 4 xi o : The o-diagonal terms degenerate to zero when (i) the null hypothesis xi = yi is true and (ii) the distributions are symmetric, i.e. y3i =x3i = 0, completing the proof. 2.7.1.7 Proof of Proposition 7 The proposition can be considered as a special case of the theory in Section D of [9] and can be proved similarly. As the proofs in [9] are done in dierent settings with the weighted FDR denition and random weights, we provide the proof here for completeness. Proof. Part (a). LetX = m 1 P m i=1 (1i)i. Note that whenY = 0 we must haveX = 0. The asymptotic equivalence follows if we can show the following mFDR( ) FDR( )E X Y X EY I(Y > 0) = o(1): (2.7.3) 63 SinceXY and both are non-negative expressions, using Cauchy-Schwarz E X Y X EY I(Y > 0) = E X Y I(Y > 0) jYEYj EY EjYEYj 2 1=2 EY = Var (Y) 1=2 EY ; The desired result follows from Conditions (i) and (ii). Part (b). According to the proof of Theorem 2, Part (a), P T ;i OR <Q ;1 C () is bounded away from zero, then there exists p > 0, such thatP T ;i OR <Q ;1 C () p. Therefore, P ^ T ;i OR < ^ Q ;1 C () =P T ;i OR <Q ;1 C () + o(1); m 1 E ( m X i=1 I T ;i OR <Q ;1 C () ) =P T ;i OR <Q ;1 C () p > 0: LetY 0 =m 1 P m i=1 I(T ;i OR t 1 ). Note that Var(Y 0 ) =m 1 Var n I T ;i OR t1 o m 1 = o(1): To show that Var(Y) = o(1), use the decomposition Var(Y) = Var(Y 0 ) + Var(YY 0 ) + 2Cov(YY 0 ;Y 0 ): We only need to show that Var(YY 0 ) = o(1). Then by Cauchy-Schwarz inequality and using Var(Y 0 ) = o(1), it follows that Cov(YY 0 ;Y 0 ) = o(1): 64 Recall that ^ T OR T OR = oP (1) and ^ Q ;1 C ()t 1 = oP (1). We have Var(YY 0 ) =m 2 Var " m X i=1 n I ^ T ;i OR < ^ Q ;1 C () I T ;i OR <t 1 o # m 2 E " m X i=1 n I ^ T ;i OR < ^ Q ;1 C () I T ;i OR <t 1 o # 2 = 1 1 m E 2 4 Y k=i;j;i6=j n I ^ T ;k OR < ^ Q ;1 C () I T ;k OR <t 1 o 3 5 + 1 m E n I ^ T OR < ^ Q ;1 C I (T OR <t 1 ) o 2 = o(1): The last equality follows since E 2 4 Y k=i;j n I ^ T ;k OR < ^ Q ;1 C () I T ;k OR <t 1 o 3 5 E hn I ^ T ;i OR < ^ Q ;1 C () I T ;i OR <t 1 oi =o(1): 2.7.1.8 Eects of baseline removal (Remark 1) This section explains that removing the baseline eects would not violate our assumption on conditional independence. To x ideas, we consider the application context of microarray time-course experiments. Let denote the true baseline expression levels of m genes. Consider two time points x and y. The (unknown) true eect sizes at these two time points are given by x = + x; y = + y: Note that we do not require that is sparse but some baseline measurements should be available. By contrast, we assume that x = (xi : i = 1; ;m) and y = (yi : i = 1; ;m) are sparse vectors representing random perturbation eects from based line levels . The goal is to identify nonzero eects xiyi6= 0. We observe Xi = xi +xi and Yi = yi +yi, with xi N(0; 1) and yi N(0; 1). Moreover, we do not observe the true latent variable . The baseline eects would be measured (at time 0) with some 65 errors Zi = i +zi: We assume that xi, yi and zi are independent with each other. After removing the baseline eects, we have new observations: ~ Xi = i +xi +xi (i +zi); ~ Yi = i +yi +yi (i +zi): The pairs of dierences and sums are given by T1i = ~ Xi ~ Yi = (xiyi) + (xiyi); T2i = ~ Xi + ~ Yi = (xi +yi) + (xi +yi) 2(i +zi): Under the null hypothesisxi =yi, the error termsxiyi andxi +yi are independent, andxiyi is independent of i +zi. Hence T1i and T2i satisfy the conditional independence assumption (2.5), which further leads to Equation (2.6) that is needed in the CARS methodology. 2.7.1.9 Proof of the claim in Remark 3 Proof. We study the special case where (i) the observations have equal sample size and (known) equal variance 2 xi = 2 yi = 2 i , and (ii) the two nonzero means have the same magnitude with opposite signs xi =yi. Then (T1i;T2i) = p n 2i ( Xi Yi; Xi + Yi): Next, note that Xi =xi + xi; Yi =xi + yi; where xiN(0; (2=n) 2 i ) and yiN(0; (2=n) 2 i ). We have f(t1;t2) = Z f(t1;t2jx =)dGx () = Z f(t1jx =)f2(t2)dGx () = f1(t1)f2(t2); f(t2j1i = 0) = f(t2jxi =yi = 0) =f2(t2): 66 It follows that the oracle statistic TOR(t1;t2) = (11)f(t2j1i = 0)f10(t1) f(t1;t2) = (11)f10(t1) f1(t1) ; which gives the Lfdr dened in (2.2.1). 2.7.1.10 Proof of insuciency and further explanation of Remark 7 We prove the claim in Remark 7 that the primary statistic is not a sucient statistic. The proof uses the notations in Section 4. We only need to provide a counter example. Consider the following special case where the grouping variable Si takes two possible values 1 or 2 with equal probability: P (Si = 1) = 0:5; P (Si = 2) = 0:5: If Si =j, j = 1; 2, then Ti follows a mixture distribution Ti (1j )N(0; 1) +jN(1; 1); where j is the conditional proportion of non-null cases for group j. An equivalent way to conceptualize the above model is to introduce an unknown location parameter i: TijiN(i; 1); where i follows a distribution with two point masses 0 and 1: i (1j )0 +j1: Here 0 and 1 are the Dirac delta functions centered at 0 and 1, respectively. Now we prove Ti's insuciency. By denition, if Ti is sucient, then P (Si = 1jTi =t;i) should not depend on i. However, some simple algebras give the following: P (Si = 1jTi =t; = 0) = 11 212 ; P (Si = 1jTi =t; = 1) = 1 1 +2 ; and the desired result follows. 67 Remark 8. The ancillarity paradox means that Si seems to be \useless" for inference of i since it is an external variable and the location information should have been captured by Ti. However, making inference using Ti alone would lead to substantial loss of information. In this example, Si is called an ancillary component because (i) Ti alone is insucient; (ii) Si is ancillary; and (iii) (Ti;Si) is sucient. 2.7.1.11 The conservative bivariate density estimator We rst present a formula that gives an equivalent expression for the bivariate density function: f(t1i;t2i) = (12)f(t1i;t2ij2i = 0) +2f(t1i;t2ij2i = 1) = (12)f10(t1i)f20(t2i) +f(t1i;t2i;2i = 1); where the second equality follows from the logical relationship (2.1.2) and the conditional independence between T1i and T2i under the null (Equation 2.1.6). The estimation of f(t1i;t2i;2i = 1) requires the knowledge of 2i, which is unknown in practice. We propose to estimate 2i via a screening procedure 2i = I(Ri < ), where Ri is a screening statistic such as the Lfdr statistic or the p-value and is a tuning parameter. The R package uses Lfdr2(t2i) as the screening statistic with the default screening level = 0:1. LetA =f(t1;t2) :Ri(t2)<g. Dene the following test statistic T ;i OR (t1;t2) =q (t2)f10(t1)=f (t1;t2); (2.7.4) wheref (t1;t2) = (12)f10(t1)f20(t2) +fG() (12)G0()gf(t1;t2j 2 = 1): Consider the following index setW =fi : T i OR < 12g: This set can be viewed as the collection of \interesting" hypotheses that are worthy of further investigation. We shall study the property of T i; OR onW because hypotheses not inW would not have any impact in the multiple testing problem { observe that 2 is small and large T i OR would not play a role in inference according to the operation of Procedure 2.2.6. The next proposition implies that the modied density estimator would lead to a conservative FDR control. Proposition 8. Consider T ;i OR dened in (2.7.4). If i2W, then T i OR T ;i OR . 68 Proof. We rst show that if i2W, then f(t1i;t2i)f(t1i;t2ij2i = 1). Note that P(2i = 0jT1i =t1i;T2i =t2i) = (12)f(t1i;t2ij2i = 0) (12)f(t1i;t2ij2i = 0) +1f(t1i;t2ij2i = 1) : If i2W, it follows from the logical relationship (2.1.2) that P(2i = 0jT1i =t1i;T2i =t2i) =P(1i = 0;2i = 0jT1i =t1i;T2i =t2i)T i OR < 12: Therefore f(t1i;t2ij2i = 0) f(t1i;t2ij2i = 1) for i2W. Without causing ambiguity, we focus on a generic hypothesis i 2 W and drop the index i. It follows that f(t1;t2) = (1 2)f(t1;t2j2 = 0) +2f(t1;t2j2 = 1)f(t1;t2j2 = 1). Hence f(t1;t2;2 = 1) f(t1;t2;2 = 1; 2 = 1) = P (2 = 1; 2 = 1)f(t1;t2j2 = 1; 2 = 1) Cf(t1;t2j 2 = 1); where C =G() (12)G0(), G() =P(Ri <) and G0() =P(Ri <j2i = 0). The last equality follows from the following two facts P( 2 = 1;2 = 1) = P ( 2 = 1)P ( 2 = 1;2 = 0); and f(t1;t2j2 = 1; 2 = 1) = f(t1;t2j2 = 1)IA =P (A) f(t1;t2)IA =P (A) =f(t1;t2j 2 = 1): Therefore, we conclude that f (t1;t2)f(t1;t2) and the desired result follows. Finally, we discuss how to estimate f (t1;t2). The rst term (12)f10(t1)f20(t2) can be estimated by plugging in ^ 2; the null densities f10 and f20 are known. The second term can be rewritten as G()f1Q()gf(t1;t2j 2 = 1); where Q() = (12)G0()=G() can be viewed as the mFDR level of I(Ri < ) for testing hypothesis 2i = 0. In practice, G() can be estimated by the empirical CDF G() = m 1 P m i=1 I(Ri < ). To estimate f(t1;t2j 2i = 1), we apply the R package ash to the sample 69 T =fi : Ri < g. [The ash package uses the average shifted histogram, which seems to, according to our numerical studies, provide better estimation than the more standard package np.] The ash package only calculates the density on grids, we further use the function interp surface in the R package fields to interpolate the estimated density. When the Lfdr is used for screening, Q() can be estimated as [cf. [102]] ^ Q() = P m i=1 If d Lfdr2(t2i)<g d Lfdr2(t2i) P m i=1 I( d Lfdr2(t2i)<) : When the p-value is used for screening, we use the well known estimate ^ Q() ==G() [cf. [15, 47]]. 2.7.1.12 Screening via Lfdr: formulae and implementation Consider a screening procedure based on i <, wherei :=i(t1i) is a screening statistic and > 0 is a tuning parameter. Denote G() the CDF of i. LetA =fx :(x)>g and S() =P (A ) the survival function. Then q (t2) = lim h!0 EfQ (t2;h)g h =S() 1 Z A f(t1;t2)dt1; (2.7.5) which can be estimated by ^ q (t2) = P i2T () K h 2 (t2t2i) m ^ S() 1 : (2.7.6) In the p-value approach, we have an explicit correction factor of 1. The factor is now replaced by ^ S(). If the Lfdr is used for screening, then a new correction factor would be needed. We can simulate B data points (e.g. B = 10 5 ) from the null distribution N(0; 1) and calculate their Lfdrs using the method in Sun and Cai (2007), where the parameters are estimated using the primary statisticsft1i : 1img. The correction factor is then obtained as the proportion of Lfdrs exceeding . 70 2.7.2 Supplementary Numerical Results 2.7.2.1 The case with unknown and equal variances Let xijN(0; 1) and yik N(0; 1). Set nx = 50;ny = 60. Therefore our = ny nx = 1:2. We compute T1i and T2i using estimated variances. The following settings are considered in our simulation study. Setting 1: we set x;1:k = 5= p 30, x;(k+1):(2k) = 4= p 30, x;(2k+1):m = 0, y;1:k = 2= p 30, y;(k+1):(2k) = 4= p 30 and y;(2k+1):m = 0. We vary k from 100 to 1000. Setting 2: we use k1 and k2 to denote the number of nonzero locations and the number of locations with dierential eects, respectively. Let x;1:k 2 = 5= p 30, x;(k 2 +1):k 1 = 4= p 30, x;(k 1 +1):m = 0, y;1:k 2 = 2= p 30, y;(k 2 +1):k 1 = 4= p 30 and y;(k 1 +1):m = 0. We x k1 = 2000 and vary k2 from 200 to k1. Setting 3: the sparsity level is xed at k = 750. We set x;1:(k) = 0= p 30, x;(k+1):(2k) = 3= p 30, x;(2k+1):m = 0, y;1:k = 2= p 30, and y;(k+1):(2k) = 3= p 30, y;(2k+1):m = 0. We vary 0 from 3.5 to 5. In Settings 1-3, we apply dierent methods to the simulated data and obtain results by averaging over 500 replications. We plot the FDR and Average Power levels as functions of varied parameter values. The results are displayed in Figure 2.6. We can similarly see that both CARS and US are more powerful than conventional univariate methods such BH and AZ, and CARS is superior than US. 2.7.2.2 Comparison with sample splitting when 1i = 2i Similar to the general setting, we set nx = 50;ny = 60 and construct the primary and auxiliary statistics accordingly. We consider two simulation settings. Setting 1: x;1:k = 5= p 30, x;(k+1):m = 0, y;1:k = 2= p 30 and y;(k+1):m = 0. Vary k from 50 to 1000 to investigate the eect of sparsity on testing results. Setting 2: the sparsity level is xed at k = 500,y;1:m = 0, z;1:k =0= p 30, and z;(k+1):m = 0. Vary 0 from 1.5 to 3.5 to investigate the impact of eect sizes. The sample splitting (SS) method has been added to the comparison. Specically, SS splits both nx andny into two equal parts. The rst half of data are used to compute the initial p-values. We then use 71 0.000 0.025 0.050 0.075 0.100 250 500 750 1000 k FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 250 500 750 1000 k Average Power Power Comparsion Method DD BH OR AZ US Setting 1 0.000 0.025 0.050 0.075 0.100 0.125 500 1000 1500 k 2 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 500 1000 1500 k 2 Average Power Power Comparsion Method DD BH OR AZ US Setting 2 0.000 0.025 0.050 0.075 0.100 3.5 4.0 4.5 μ 0 FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 3.5 4.0 4.5 μ 0 Average Power Power Comparison Method DD BH OR AZ US Setting 3 Figure 2.6: General Case (Unknown Variance): the FDR and MDR varied by non-null proportion (top row) and conditional proportion (bottom row). The displayed procedures are DD (), BH (4), OR (), AZ (+) and US (). 72 the p-value cut-o 0.05 to select locations and use these locations for the second stage analysis. Next the second stage p-values are computed for the selected locations using the second half of the data. Finally we apply the BH procedure to the second stage p-values and record the decisions. The results are displayed in Figure 2.7. The following observations can be made based on the simulation. (a). All methods can control the FDR at the nominal level 0:05, with BH slightly conservative and US very conservative. (b). DD can be very conservative in many settings but still enjoys substantial power gain over competing methods. (c). Univariate methods (BH and AZ) can be greatly improved by US, which is further improved by CARS. (d). Setting 1 shows that the gain in eciency decreases as k increases. (e). Setting 2 shows that the eciency gain of CARS increases as k2 increases. (f). The sample splitting method is inecient. 2.7.2.3 Stability of tuning This section investigates the robustness of tuning. LetT () =fi : Lfdr1(t1i) > g. We rst examine the performance of CARS wrt the choice of by focusing on the case when 1i and 2i are perfectly correlated. The following settings are considered: Setting 1: same as Setting 1 in Section 5.2, except that k = 500. Vary from 0:6 to 0:9. Setting 2: same as Setting 1 above, except that x;1:500 = 4= p 50. Vary from 0:6 to 0:9. Setting 3: same as Setting 1, except that k = 1; 000. Vary from 0.6 to 0.9. Setting 4: same as the Setting 3, except that x;1:1000 = 4= p 50. Vary from 0:6 to 0:9. We apply the oracle procedure (OR) and the data-driven CARS procedure (DD) to the simulated data and record the FDR, power and size of ^ S =f d Lfdr1 > g. Note that for OR, we always have Card(S) = 4500 or 4000. The results are summarized in Figure 2.8. We can see that the value of aects the size of ^ S and in general provides better error control when it is large. This is consistent with 73 0.000 0.025 0.050 0.075 0.100 250 500 750 1000 k FPR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 250 500 750 1000 k Power Power Comparsion Method DD BH OR AZ US SS Setting 1 0.000 0.025 0.050 0.075 0.100 1.5 2.0 2.5 3.0 3.5 k FPR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 1.5 2.0 2.5 3.0 3.5 k Power Power Comparsion Method DD BH OR AZ US SS Setting 2 Figure 2.7: The special case when 1i = 2i . 74 our theory in Proposition 4. However, to avoid in ated variability in estimated quantities, we do not recommend larger 's. As a rule of thumb, our practical recommendation is to take = 0:9 to ensure both precise error control and stability of the methodology. Next we investigate the stability of CARS wrt the choice of in (2.7.4). We x = 0:9 and vary from 0:1 to 0:3. The following settings are considered: Setting 1: same as Setting 1 in Section 5.2, except that k = 50. Setting 2: same as Setting 1 above, except that x;1:50 = 4= p 50. Setting 3: same as Setting 1, except that k = 100. Setting 4: same as the Setting 3, except that x;1:100 = 4= p 50. The results are summarized in Figure 9. We can see that the value of aects the size of ^ T but in general does not aect the FDR or power. As a rule of thumb, our practical recommendation is to take = 0:1. 2.7.2.4 Comparison of ranking A fundamental aspect of the multiple comparison issue is ranking. For example, if a biologist has a limited budget and asks us to provide a list of top k genes (say, k = 10). Then on top of FDR control, s/he may care more about how many true ndings are actually in the list that we provide. We carry out a small simulation study to compare dierent methods whenk2 = 20. The goal is to show that, by exploiting the auxiliary information, CARS yields a much better ranking (in the sense that for a pre-specied budget, namely a xed number of rejections, CARS identies more true positives than competing methods). Consider the case with the unknown and equal variances. Let xij N(0; 1) and yik N(0; 1). Set nx = 50;ny = 60. Therefore = ny nx = 1:2. Number of repetitions is 100. The number of genes m = 5000. Setting: x;1:20 = 5= p 50, x;21:40 = 4= p 50, x;41:m = 0, y;1:20 = 2= p 50, y;21:40 = 4= p 50, y;41:m = 0. We vary number of rejectionsk from 5 to 20, and calculate for even given k, how many true positives are in the top k hypotheses. The rankings are based on p-values (BH), Lfdr statistics (Sun and Cai 2007), oracle statistic TOR (OR, proposed with known parameters), and data-driven statistic ^ T OR (DD, proposed with estimated parameters), respectively. The expected number of true positives (ETP) is 75 0.000 0.025 0.050 0.075 0.100 0.6 0.7 0.8 0.9 τ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 τ Average Power Power Comparison 3000 3500 4000 4500 5000 0.6 0.7 0.8 0.9 τ Average Cardinality for S Cardinality of S Comparison Method DD OR Setting 1 0.000 0.025 0.050 0.075 0.100 0.6 0.7 0.8 0.9 τ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 τ Average Power Power Comparison 3000 3500 4000 4500 5000 0.6 0.7 0.8 0.9 τ Average Cardinality for S Cardinality of S Comparison Method DD OR Setting 2 0.000 0.025 0.050 0.075 0.100 0.6 0.7 0.8 0.9 τ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 τ Average Power Power Comparison 3000 3500 4000 4500 5000 0.6 0.7 0.8 0.9 τ Average Cardinality for S Cardinality of S Comparison Method DD OR Setting 3 0.000 0.025 0.050 0.075 0.100 0.6 0.7 0.8 0.9 τ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.6 0.7 0.8 0.9 τ Average Power Power Comparison 2500 3000 3500 4000 4500 5000 0.6 0.7 0.8 0.9 τ Average Cardinality for S Cardinality of S Comparison Method DD OR Setting 4 Figure 2.8: Eect of . The displayed procedures are DD (), OR (4). 76 0.000 0.025 0.050 0.075 0.100 0.10 0.15 0.20 0.25 0.30 υ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.10 0.15 0.20 0.25 0.30 υ Average Power Power Comparison 95.0 97.5 100.0 102.5 105.0 0.10 0.15 0.20 0.25 0.30 υ Average Cardinality for T Cardinality of T Comparison Method DD OR Setting 1 0.000 0.025 0.050 0.075 0.100 0.10 0.15 0.20 0.25 0.30 υ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.10 0.15 0.20 0.25 0.30 υ Average Power Power Comparison 80 90 100 110 0.10 0.15 0.20 0.25 0.30 υ Average Cardinality for T Cardinality of T Comparison Method DD OR Setting 2 0.000 0.025 0.050 0.075 0.100 0.10 0.15 0.20 0.25 0.30 υ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.10 0.15 0.20 0.25 0.30 υ Average Power Power Comparison 190 195 200 205 210 0.10 0.15 0.20 0.25 0.30 υ Average Cardinality for T Cardinality of T Comparison Method DD OR Setting 3 0.000 0.025 0.050 0.075 0.100 0.10 0.15 0.20 0.25 0.30 υ FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 0.10 0.15 0.20 0.25 0.30 υ Average Power Power Comparison 190 195 200 205 210 0.10 0.15 0.20 0.25 0.30 υ Average Cardinality for T Cardinality of T Comparison Method DD OR Setting 4 Figure 2.9: Eect of . The displayed procedures are DD (), OR (4). 77 calculated based on the average of 100 simulated data sets. We can see from Figure 2.10 that the ranking of CARS is superior compared to other methods. 5 10 15 5 10 15 20 Number of Rejections ETP Method OR DD AZ BH Ranking Comparison Figure 2.10: Comparison of ranking 2.7.2.5 Dependent tests This section presents a small simulation study to investigate the performance of CARS under dependence. We demonstrate the performance of CARS under the setting with the unknown and equal variances case. Let be the covariance matrix. i;i = 1; i = 1; ;m; i;i+1 = 0:5; i = 1; ;m 1; i1;i = 0:5; i = 2; ;m; i;i+2 = 0:4; i = 1; ;m 2; i2;i = 0:4; i = 3; ;m: All other entries equal to zero. Let x x x j N(0; ) and y y y k N(0; ). Set nx = 50;ny = 60. Therefore = ny nx = 1:2. Number of repetitions is 100. m = 2000. Setting: We set x;1:k = 5= p 50; x;(k+1):2k = 4= p 50; x;(2k+1):m = 0; y;1:k = 2= p 50; y;(k+1):2k = 4= p 50; y;(2k+1):m = 0. We vary k from 100 to 500. The results are summarized by Figure 2.11. Both OR and DD seem to work well for FDR control. However, we want to emphasize that the simulation results 78 0.000 0.025 0.050 0.075 0.100 100 200 300 400 500 k FDR FDR level Comparison 0.00 0.25 0.50 0.75 1.00 100 200 300 400 500 k Average Power Power Comparsion Method DD BH OR AZ US Figure 2.11: Comparison of methods under weak dependence. here are very limited and we do not intend to draw any conclusions based on these results. Meanwhile, we conjecture that CARS would still be asymptotically valid under some weak dependence assumptions. 2.7.2.6 Supernova Example Images Here we provide additional numerical results for the analysis in Section 2.4.5 in the main text. Figure 2.12 shows the rejected pixels in the 516 831 layout for each method under dierent FDR levels. 2.7.2.7 Analysis of the Microarray Time-Course (MTC) Data In this section we apply the CARS procedure to the MTC dataset collected by Calvano et al. (2005) for studying systemic in ammation in humans. This dataset contains eight study subjects which are randomly assigned to treatment and control groups and then administered with endotoxin and placebo, respectively. Ametrix U133A chips were used to prole the expression levels of 22,283 genes in human leukocytes measured before infusion (0 hour) and at 2, 4, 6, 9, and 24 hours afterwards. One of the goals in this experiment is to identify, in the treatment group, early to middle response genes that are dierentially expressed within 4 hours and thus revealing meaningful early activation gene sequence which could potentially govern the immune responses. We rst preprocess the data according to the steps discussed in Sun and Wei (2011). To further identify the genes that regulate this sequence, we take time point 0 as the baseline and time point 4 and 24 as the interval over which dierential gene expressions are estimated. Denote Yj;i as the average gene 79 Nominal FDR level 0:01% Nominal FDR level 1% Nominal FDR level 5% Figure 2.12: Left to right: BH procedure, Adaptive Z procedure, Uncorrelated Screening, CARS. expression value for genei at time pointj. Let ~ Yj;i =Yj;iY0;i denotes the baseline adjusted expression level for gene i at time point j. The primary and auxiliary statistics are (T1i;T2i) = r 1 2 ~ Y4;i ~ Y24;i Sp ; ~ Y4;i + ^ ~ Y24;i p ^ Sp where m = 22; 283, S 2 y;4 = Var( ~ Y4;i), S 2 y;24 = Var( ~ Y24;i), ^ =S 2 y;4 =S 2 y;24 and S 2 p =S 2 y;4 +S 2 y;24 . At FDR level 0:05, our CARS procedure discovered 429 dierentially expressed genes among the total of 22; 283 genes. By contrast, BH procedure discovered 121 genes. 80 2.7.2.8 Comparison of mFDR and FDR This section investigates the asymptotic equivalence of mFDR and FDR in dierent settings. In our simulation, we choosem = 5000,nx = 50,ny = 60, x;1:k = 5= p 30, x;(k+1):(2k) = 4= p 30, x;(2k+1):m = 0 and y;1:k = 3= p 30, y;(k+1):(2k) = 4= p 30 and y;(2k+1):m = 0. The sparsity levelk is varied from 150 to 1000. We apply the oracle CARS procedure (OR) and data-driven CARS procedure (DD) to the simulated data sets. The mFDR is computed as the ratio of the average number of false positives over the average number of rejections. The FDR is computed as the average of false discovery proportions. The results are obtained based on 200 replications and summarized by Figure 2.13. We can see that the FDR and mFDR levels are very similar for both the OR and DD methods. 0.000 0.025 0.050 0.075 0.100 250 500 750 1000 k FDR FDR level Comparison 0.000 0.025 0.050 0.075 0.100 250 500 750 1000 k mFDR mFDR level Comparison 0.00 0.25 0.50 0.75 1.00 250 500 750 1000 k Power Average Power Comparison Method DD OR Figure 2.13: Asymptotic equivalence of mFDR and FDR for OR and DD 81 Chapter 3 Multistage Adaptive Testing of Sparse Signals Multistage design has been used in a wide range of scientic elds. By allocating sensing resources adaptively, one can eectively eliminate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multi-stage adaptive testing and study how to minimize the total number of measurements while meeting pre-specied constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which eectively pools information across individual tests using a simultaneous multistage adaptive ranking and thresholding (SMART) approach, can achieve precise error rates control and lead to great savings in total study costs. Numerical studies conrm the eectiveness of SMART for FPR and MDR control and show that it achieves substantial power gain over existing methods. The SMART procedure is demonstrated through the analysis of high-throughput screening data and spatial imaging data. 3.1 Oracle and SMART Rules for Sparse Recovery The sparse recovery problem in the adaptive setting involves the simultaneous testing of p hypotheses: H0;i :i = 0 vs. H1;i :i6= 0; 1ip; (3.1.1) based on p streams of observations. In this section, we rst formulate the sequential sparse recovery problem in a decision-theoretic framework (Section 3.1.1), then derive an oracle procedure for FPR and 82 MDR control (Section 3.1.2), and nally propose new methodologies to approximate the oracle procedure (Sections 3.1.3-3.1.4). 3.1.1 A decision-theoretic formulation Let Xij be the measurement of variable Xi at stage j and X X X j i = (Xi1; ;Xij ) the collection of mea- surements on Xi up to stage j. A multistage decision procedure involves choosing a stopping rule and a testing rule for each location. At location i, the stopping rule i consists of a series of functions i1(X X X 1 i ), i2(X X X 2 i ), , whereij takes values inf0; 1g, with 0 and 1 standing for \taking another observation" and \stopping sampling and making a decision", respectively. We consider a class of multistage designs where the focus is sequentially narrowed down, i.e. the active sets satisfyA1 =f1; ;pg, andAjAj1 for j = 2; 3; . It follows that at every coordinatei2Aj , there are three possible actions: (i) stop sampling and claim H0;i is true; (ii) stop sampling and claim H0;i is false; and (iii) do not make a decision and take another observation. Dene the stopping time Ni = minfn 1 :in(X X X n i ) = 1g: Then the stopping rule i can be equivalently described by a stopping time Ni. The testing rule i2f0; 1g is carried out at stage Ni (the terminal sampling stage), where i = 0=1 indicates that i is classied as a null/non-null case. A multistage decision procedure is therefore denoted d d d = (N N N; ), where N N N = (N1; ;Np) and = (1; ;p) are the stopping times and terminal decisions, respectively. Assume that the cost of taking one observation is c. From a decision-theoretic viewpoint, we can study a weighted classication problem with the following loss function: L 1 ; 2 ( ;d d d) =1 P p i=1 (1i)i +2 P p i=1 i(1i) +c P p i=1 Ni; (3.1.2) where1 and2 are the costs for a false positive decision and a false negative decision, respectively. The sum of the rst two terms in (3.1.2) corresponds to the total decision errors, and the last term gives the total sampling costs. The optimal solution to the weighted classication problem is the Bayes sequential procedure d d d that minimizes the expected loss EfL 1 ; 2 ( ;d d d )g = inf d d d EfL 1 ; 2 ( ;d d d)g: (3.1.3) 83 In [17],d d d is represented by a thresholding rule based on the oracle statistic T i;j OR =P(i = 0jX X X j i ): (3.1.4) T i;j OR is the posterior probability of case i being a null given X X X j i . We view T i;j OR as a signicance index re ecting our condence on claiming H0;i is true based on all information available at stage j. In practice we only specify the FPR and MDR levels and but do not know 1 and 2. However, the optimal solution to the weighted classication problem (3.1.3) motivates us to conjecture that the optimal solution to the sparse recovery problem (1.2.6) should also be a thresholding rule based on T i;j OR . This conjecture will be established rigorously in the next subsection. 3.1.2 Oracle procedure Let t l and tu be constants satisfying 0 t l < tu 1. Consider a class of sequential testing procedures d d d (t l ;tu) of the form: stop sampling for unit i at Ni = minfj 1 :T i;j OR t l orT i;j OR tug; and decide i = 1 if T i;j OR t l and i = 0 if T i;j OR tu. (3.1.5) Denote byD; the collection of all sequential decision procedures which simultaneously satisfy (i) FPR(d d d) and (ii) MDR(d d d) : The following assumption is a standard condition in the sequential analysis liter- ature (e.g. [17, 97]). It essentially requires that f(X X X 1 i ji = 0) and f(X X X 1 i ji = 1) dier with some positive probability, which ensures that the sequential testing procedure has a nite stopping time. Assumption 1. Suppose the dimension p is xed. Assume thatP i (Zi;1 = 0)< 1 for i = 1; ;p, where Zi;1 = log f(X X X 1 i j i =1) f(X X X 1 i j i =0) . The next theorem derives an oracle sequential testing procedure that provides the optimal solution to the constrained optimization problem (1.2.6). 84 Theorem 4. Consider a class of sequential testing rules d d d (t l ;tu) taking the form of (3.1.5). Denote QOR(t l ;tu) and ~ QOR(t l ;tu) the FPR and MDR levels ofd d d (t l ;tu), respectively. Then under Assumption 1, we have (a). QOR(t l ;tu) is non-decreasing in t l for a xed tu, and ~ QOR(t l ;tu) is non-increasing in tu for a xed t l . (b). Let 0 < ; < 1. Then there exists a pair of oracle thresholds (t l OR ;t u OR ), based on which we can dene the oracle procedure d d dORd d d (t l OR ;t u OR ); (3.1.6) such that d d dOR is optimal in the sense that (i) FPR(d d dOR) ; (ii) MDR(d d dOR) ; and (iii) ESS(d d dOR) ESS(d d d) for alld d d2D; . The oracle procedured d dOR is a thresholding rule based on the oracle statisticTOR and oracle thresholds t l OR and t u OR . However, Theorem 4 only shows the existence of t l OR and t u OR , which are unknown in practice. In Section 3.1.3, we extend the classical ideas in [110, 97] to derive approximations of t l OR and t u OR . The approximations guarantee conservative FPR and MDR control but suer from the \overshoot" problem in sequential testing. Section 3.1.4 develops a ranking and compound thresholding algorithm to improve the approximation. 3.1.3 Approximation of oracle thresholds and its properties If we focus on a single locationi, then the sequential probability ratio test (SPRT) [110] is a thresholding rule based onLi;j = f(X X X j i j i =1) f(X X X j i j i =0) and of the form if logLi;ja; stop sampling and decide H0;i is true; if logLi;jb, stop sampling and decide H0;i is false; if a< logLi;j <b, take another observation. 85 Let 0 =PH 0;i (Reject H0;i) and 0 =PH 1;i (Accept H0;i) be prespecied Type I and Type II error rates. Then by applying Wald's identify [111, 97], the thresholds a and b can be approximated as: ~ a = log 0 1 0 and ~ b = log 1 0 0 : (3.1.7) The oracle statistic T i;j OR = (1)=f(1) +Li;jg is monotone inLi;j . Therefore in the multiple hypothesis setting, we can view the oracle ruled d dOR =d d d (t l OR ;t u OR ) as m parallel SPRTs. Therefore the problem boils down to how to obtain approximate formulas of the oracle thresholds t l OR and t u OR for a given pair of pre-specied FPR and MDR levels (; ). In our derivation, the classical techniques (e.g. Section 7.5.2 in [17]) are used, and the relationships between the FPR and MDR levels (; ) and the Type I and Type II error rates ( 0 ; 0 ) are exploited. The derivation of approximation formulas with the FPR and MDR constraints is complicated; we provide detailed arguments in Appendix A. The approximated thresholds are given by ~ t l OR = and ~ t u OR = 1 + 1 : (3.1.8) The next theorem shows that the pair (3.1.8) is valid for FPR and MDR control. Theorem 5. Consider multistage model (1.2.2). Denote ~ d d dOR =d d d ( ~ t l OR ; ~ t u OR ) the thresholding procedure that operates according to (3.1.5) with upper and lower thresholds given by (3.1.8). Then FPR( ~ d d dOR) and MDR( ~ d d dOR) : The approximations in (3.1.8) lead to very conservative FPR and MDR levels. The main issue is the \overshoot" problem, i.e. in the parallel SPRTs the boundaries are not hit exactly. The conservativeness in the error rates control would result in increased study budgets. The situation is much exacerbated in high-dimensional settings where most locations are only tested once or twice. In fact, the \step sizes" of many SPRTs tend to be quite large relative to the given boundaries and the overshoot problems would occur with high frequencies. Our simulation studies reveal that the actual FPR and MDR levels are often only half of the nominal levels, resulting in substantial eciency loss. The above concern motivates us to develop more powerful methodologies to improve the approximations in (3.1.8). 86 3.1.4 The SMART procedure The oracle procedure d d dOR (3.1.6) can be interpreted in two ways: one is a collection of p parallel and independent SPRTs, and the other is a stage-wise simultaneous inference procedure. This section takes the latter view. To gain more insights on the overall structure of the problem, we rewrite (3.1.2) as the sum of stage-wise losses: L 1 ; 2 ( ;d d d) = N X j=1 2 4 X i2S j f1(1i)i +2i(1i)g +c Card(Aj ) 3 5 ; (3.1.9) where N = maxfNi : 1 i pg andSj =fi : Ni = jg. At stage j, we stop sampling onSjAj and make terminal decisions at every i2Sj ; the remaining locations will become the active set for the next stageAj+1 =AjnSj , on which new observations are taken. We then proceed to make further decisions onAj+1. The process will be repeated until the active set becomes empty. This simultaneous decision view motivates us to employ the idea of ranking and thresholding that has been widely used in the multiple testing literature. For example, the FDR procedure in [13] rst orders all p-values from the smallest to largest, and then uses a step-up method to choose a cuto along the p-value ranking. Under (3.1.9), we make three types of decisions simultaneously at each stage based on the ranking produced by the ordered TOR: (i) identifying non-null cases, (ii) eliminating null cases, and (iii) selecting coordinates for further measurements. Thus the proposed multistage testing procedure similarly operates via a ranking and thresholding scheme. We describe in Table 1 the proposed algorithm, called SMART, for \simultaneous multistage adaptive ranking and thresholding." The operation of SMART can be described as follows. At stagej, we rst calculate the oracle statistics in the active setAj and sort them from smallest to largest. Then we carry out two thresholding procedures along the ranking: algorithm (a) chooses a lower cuto ^ t l;j OR with selected locations claimed as signals; algorithm (b) chooses an upper cuto ^ t u;j OR with selected locations claimed as nulls. We stop sampling on locations where denitive decisions are made, and take new observations on remaining locations for further investigation. The above steps will be iterated until convergence. 87 Algorithm 1. The SMART procedure Dene the lower and upper thresholds ~ t l OR = and ~ t u OR = 1 +1 . LetA j be the active set at stage j, j = 1; 2; . Iterate Step 1 to Step 3 untilA j =;. Step 1 (Ranking). For all i2A j , compute T i;j OR and sort them in ascending order T (1);j OR T (2);j OR T (kj);j OR ; where k j = Card(A j ). Step 2 (Thresholding). (a) (Signal discovery). Let k s j = maxfr :r 1 P r i=1 T (i);j OR ~ t l OR g and ^ t l;j OR =T (k s j );j OR . DeneS s j =fi2A j :T i;j OR ^ t l;j OR g. For all i2S s j , stop sampling and let i = 1. (b) (Noise elimination). Let k e j = maxfr :r 1 P r1 i=0 (T (kji);j OR ) ~ t u OR g and ^ t u;j OR =T (kjk e j +1);j OR . DeneS e j =fi2A j :T i;j OR ^ t u;j OR g. For all i2S e j , stop sampling and let i = 0. Step 3 (Updating). LetA j+1 =A j n (S s j [S e j ). Take new observations onA j+1 . The SMART algorithm utilizes stage-wise thresholds ^ t l;j OR and ^ t u;j OR . The operation of the algorithm implies that we always have ^ t l;j OR ~ t l OR and ^ t u;j OR ~ t l OR for allj. Hence SMART always uses fewer samples than ~ d d dOR. The next theorem shows that SMART is valid for FPR and MDR control. Theorem 6. Denote d d dSM the SMART procedure described in Algorithm 1 with pre-specied FPR and MDR levels (; ). Then FPR(d d dSM ) and MDR(d d dSM ) : In the thresholding step, SMART chooses the lower and upper cutos based on the moving averages of the selected oracle statistics. We call SMART a compound thresholding procedure, for its stage-wise cutos ^ t l;j OR and ^ t u;j OR are jointly determined by data from multiple locations. By contrast, we call ~ d d dOR a simple thresholding rule as the decision at location i only depends on its own data. By adopting a compound thresholding scheme and pooling information across parallel SPRTs, SMART overcomes the overshoot problem of individual SPRTs. To illustrate the point, consider the following toy example. Example 3.1.1. Let = 0:05. Suppose at stagej, the orderedTOR values aref0:01; 0:055; 0:07; 0:10;g. If we use the simple thresholding rule ~ d d dOR with ~ t l OR = = 0:05, then we can reject one hypothesis with T (1);j OR = 0:01; the gap between T (1);j OR and is 0.04. By contrast, the moving average of the top three 88 statistics is 0:045 < ; hence SMART rejects three hypotheses. The gap between the moving average and the threshold is only 0.005. Thus the boundary can be hit more precisely by the moving average. Therefore SMART has smaller approximation errors compared to individual SPRTs, which leads to more accurate FPR and MDR control as well as savings in study budgets. 3.1.5 Connections to existing ideas In the sparse recovery problem (1.2.6), thep sequential decisions are combined and evaluated as a whole; this is called a compound decision problem [84]. Letd d d = (d d d1; ;d d dp) denote a decision procedure. Then d d d is simple if d d di depends on data from location i alone, and compound if d d di also depends on data from other locations j6= i. A fundamental result in [84] is that compound rules are in general superior to simple rules, even when the observations from dierent units are independent. In the context of multiple testing, Sun and Cai (2007) [102] showed that more powerful FDR procedures can be constructed by pooling information from independent tests. The proposed SMART procedure further reveals that the accuracy of single SPRTs can be greatly improved by exploiting the overall structure of a compound decision problem. Compound thresholding is a popular and powerful idea in multiple testing [13, 102]. For example, the Bonferroni method can be uniformly improved by compound thresholding rules such as Holm's procedure [62] and Hochberg's procedure [59]. Moreover, the well known BH procedure [13] is also a compound thresholding procedure. To see this, let ^ G(t) be the empirical distribution of p-values. Then the BH threshold, given bytBH ( ^ G) = supft :t= ^ G(t)g [47], depends on a bulk ofp-values. The aforementioned methods employ step-wise algorithms that involve ranking and compound thresholding, which is also adopted by SMART. By pooling information between dierent locations, the approximation errors of simple thresholding rules can be greatly reduced. The distilled sensing (DS, [54]) and single sequential thresholding (SST, [74]) methods provide ecient adaptive sampling schemes that narrow down the focus in a sequential manner. SMART employs a similar adaptive sampling and inference framework but improves DS and SST in several ways. First, as opposed to DS and SST that use xed thresholding rules at all units and through all stages [e.g. DS always uses 0 as the threshold and hence eliminates roughly half of the coordinates in each distillation stage], 89 SMART uses data-driven thresholds that are adaptive to both data and pre-specied FPR and MDR levels. Second, the sampling and inference rules in DS and SST only utilize observations at the current stage. By contrast, the building block of SMART is T i;j OR , which utilizes all observations collected up to the current stage. This can greatly increase the signal to noise ratio and lead to more powerful testing and stopping rules. Third, SMART utilizes compound thresholding to exploit the multiple hypothesis structure, which can greatly improve the accuracy of the error rates control. Finally, DS and SST only have a stopping rule to eliminate null locations. Intuitively, such a scheme is inecient as one should also stop sampling at locations where the evidence against the null is extremely strong. The proposed scheme in Step 2 of SMART involves two algorithms that can simultaneously identify signals and eliminate noise. In other words, SMART stops sampling once a denitive decision (i = 0 ori = 1) is reached; this more exible operation is desirable and would further save study budgets. 3.2 Limits of Sparse Recovery This section discusses the notion of fundamental limits in sparse inference for both xed and adaptive settings. The discussion serves two purposes. First, as the limits re ect the diculties of support recovery under dierent settings, they demonstrate the advantage of using adaptive designs over xed designs. Second, the limit provides an optimality benchmark for what can be achieved asymptotically by any inference procedure in the form of a lower bound; hence we can establish the optimality of an inference procedure by proving its capability of achieving the limit. 3.2.1 The fundamental limits in adaptive designs Dierent from previous sections that consider a nite (and xed) dimension p, the asymptotic analysis in this section assumes thatp!1. Under this asymptotic framework, the FPR and MDR levels are denoted p and p, both of which tend to zero as p!1. In Malloy and Nowark (2014) [74], the fundamental limits for reliable recovery were established under the family wise error rate (FWER) criterion. This section extends their theory to the important FPR and MDR paradigm. 90 The basic setup assumes that the null and alternative distributions F0 andF1 are identical across all locations. This more restrictive model is only for theoretical analysis and our proposed SMART procedure works under the more general model (1.2.2), which allows F1i to vary across testing units. Let G and H be two distributions with corresponding densities g and h. Dene D(GkH) = R 1 1 g(x) log g(x) h(x) dx. Then the Kullback-Leibler (KL) divergence, given by DKL(F0;F1) = maxfD(F0kF1);D(F1kF0)g; can be used to measure the distance between F0 and F1. Let be the average number of measurements allocated to each unit. Consider a general multistage decision procedure d d d. The performance of d d d is characterized by its total riskR (d d d) = FPR(d d d)+MDR(d d d). Intuitively,DKL(F0;F1) and together would characterize the possibility of constructing ad d d such that R (d d d)! 0. A decision rule d d d is symmetric if d d df (x x x j i : i 2 Aj )g = fd d d(x x x j i : i 2 Aj )g for all permutation operators ([31]). Most existing multistage testing methods, such as DS, SST and SMART, are symmetric procedures. The fundamental limit, described in the next theorem, gives the minimum condition under which it is possible to construct a symmetricd d d such that R (d d d)! 0. Theorem 7. Fundamental limits (lower bound). Let d d d be a symmetric multistage adaptive testing rule. Assume that < 1 3 . If EAST(d d d) log(4) 1 D KL (F 0 ;F 1 ) , then we must have R (d d d) for all > 0. In asymptotic analyses we typically take an that converges to 0 slowly. Theorem 7 shows that any adaptive procedure with total risk tending to zero must at least have an EAST (or the average sample size per dimension) in the order of logf(4) 1 g. This limit can be used as a theoretical measure to assess the eciency of a multistage procedure. Denoted d dSM = (N N NSM; SM ) the SMART procedure described in Algorithm 1. As we proceed, we need the following assumption that is essentially equivalent to Condition (8) in [74]: E(T i;N i OR jT i;N i OR >tu) tu (1tu)e C 1 +tu ; E(T i;N i OR jT i;N i OR <t l ) t l (1t l )e C 2 +t l (3.2.1) for all possible thresholds t l ; tu. The condition is satised when logL i;1 follows a bounded distribution such Gaussian and exponential distributions. A more detailed discussion on this issue can be found in [49]. The following theorem derives the upper bound and establishes the optimality of SMART. 91 Theorem 8. Asymptotic optimality (Upper bound). Consider the SMART procedure described in Algorithm 1 with lower and upper thresholds t l = 1 f(p) 1+ and tu = (1)f(p) 1+ + (1)f(p) 1+ ; where > 0 is a small constant and f(p) is a function of p that grows to innity at an arbitrarily slow rate. Then under (3.2.1), the SMART procedure satises limp!1R ( SM ) = 0 and lim p!1 EAST(N N NSM ) (1 +) logf(p) minf(D(F0jF1);D(F1jF0)g : If we take =f(p) withf(p)!1 slowly, then the rates in Theorems 7 and 8 would match. Therefore SMART is optimal for adaptive testing under this asymptotic regime. 3.2.2 Comparison with existing results We rst review the limits for xed and adaptive designs in the literature and then compare them with our new limits. Consider a two-point normal mixture model under a single-stage design Xi i:i:d: (1p)N(0; 1) +pN(p; 2 ); (3.2.2) where p =p , p = p 2r logp and 0< < 1; 0<r < 1. The fundamental limits for a range of global and simultaneous inference problems have derived under this setup [34, 26, 108]. Of particular interest is the classication boundary [76, 54, 108], which demarcates the possibility of constructing a subset with both the FPR and MDR tending to zero. Under model (3.2.2), the classication boundary is a straight line r = in the -r plane for both the homoscedastic case ( = 1) [76, 54] and heteroscedastic case (6= 1) [108]. Hence the goal of R ! 0 requires that the signal magnitude p must be at least in the order of p logp. This gives the fundamental limit of sparse recovery for xed designs. The rate p logp can be substantially improved in the adaptive setting. For example, Haupt et al. (2011) [54] proposed the distilled sensing (DS) method, which is capable of achieving the classication boundary with much weaker signals. DS is a multistage testing procedure with a total measurement 92 budget of 2p. It assumes that observations follow a mixture model with noise distributed as standard normal. At each stage, DS keeps locations with positive observations and obtain new observations for these locations in the next stage. It was shown in [54] that after k = maxfdlog 2 logpe; 0g + 2 steps, the DS algorithm successfully constructs a subset with both FPR and MDR tending to zero provided that p diverges at an arbitrary rate in the problem dimensionp. A more general result on the limit of sparse recovery was obtained in [74], where the KL divergence and the average measurements per dimension are used in place of the growing signal amplitude p to characterize the diculty of the problem. Let s denote the cardinality of the support. It was shown in [74] that under xed designs, the reliable recovery requires that must be at least in the order of logp, whereas under adaptive designs, the required is in the order of logs. This reveals the advantage of adaptive designs (note logs log logp under the calibration p =p ). SST is optimal in the sense that it achieves the rate of logs asymptotically [74]. The upper and lower bounds presented in Theorems 7 and 8 show that the FPR-MDR paradigm requires fewer samples to guarantee that R ! 0. The result is consistent with the rate achieved by DS under model (3.2.2) [54]. SMART only requires a of the order logf(p), where f(p)!1 at any rate. This slightly improves the rate of logs achieved by SST. The improvement is expected as SST is developed to control the more stringent FWER. 3.3 Simulation Section 3.3.1 discusses the estimation of the oracle statistic TOR under a hierarchical normal mixture model. Sections 3.3.2 and 3.3.3 compare SMART with competitive methods. 3.3.1 Estimation of the oracle statistic In practice the oracle statisticTOR is unknown and needs to be estimated from data. We present detailed formulae for computing TOR in a Bayesian hierarchical model considered in [7, 115]. The model, which utilizes non-informative priors and allows varied signal magnitudes across locations, has been widely used in signal processing by providing a exible framework for a range of estimation and testing problems. 93 Let1; ;p be independent Bernoulli() variables. Assume that observationsXij obey the following multistage model: Xij =i +ij; ijN(0; 2 ); (3.3.1) wherei = 0 if i = 0, and iN i; 2 i ifi = 1. The prior mean and variance are denoted i(0) and 2 i (0), respectively. We use the method in Jin and Cai (2007) [67] to get an initial estimate of the prior probability ^ and the null distribution parameter ^ 2 . Our initialization utilizes non-informative priors: for all i let T i;0 OR = ^ , i(0) be the average of top 100^ % observations, and 2 i (0) = 1. At each stage j, we collect new observations on the active set and update our estimates of (i;i; 2 i ) by i(j);i(j); 2 i (j) : Let (;; 2 ) denote the density function of a normal distribution with mean and variance 2 . The recursive formula is based on f iji = 1;x x x j+1 i / f(xi;j+1ji)f(ijx x x j i ;i = 1) = (xi;j+1;i; 2 ) i;i(j); 2 i (j) : Completing the squares, we havef(iji = 1;x x x j+1 i )/ exp h f i i (j+1)g 2 2 2 i (j+1) i : The proposed algorithm operates as follows. Algorithm 2. Recursive estimation of T i;j OR For j = 0; 1; 2; ;N, compute the following quantities in turn: i (j + 1) = 2 i (j) 2 i (j)+^ 2 x i;j+1 + ^ 2 2 i (j)+^ 2 i (j); 2 i (j + 1) = 2 i (j)^ 2 2 i (j)+^ 2 ; f i;0;j+1 = x i;j+1 ; 0; ^ 2 ; f i;1;j+1 = x i;j+1 ; i (j); 2 i (j) + ^ 2 ; T i;j+1 OR =P( i = 0jx x x j+1 i ) = T i;j OR fi;0;j+1 T i;j OR fi;0;j+1+(1T i;j OR )fi;1;j+1 . 3.3.2 Simulation 1: simple thresholding vs. compound thresholding This simulation study compares the following methods: (i) the single thresholding procedure ST that assumes an oracle knows the true parameters (OR.ST); (ii) the SMART procedure SM with known parameters (OR.SM); (iii) ~ OR with parameters estimated via Algorithm 2 (DD.ST), where DD refers to \data-driven"; (iv) SM with estimated parameters (DD.SM). The main purpose is to show the advantage of compound thresholding. Specically, ~ OR operates asp parallel SPRTs and suers from the overshoot 94 problem. The approximation errors of SPRTs can be greatly reduced by SMART, which control the FPR and MDR more accurately with smaller sample sizes. We generate data from the multistage model (3.3.1). The number of locations is p = 10 5 . Let (; ) = (0:05; 0:05) be pre-specied FPR and MDR levels. The following settings are considered: Setting 1: = 0:01. i =0 for all i. Vary 0 from 2 to 4 with step size 0.2. Setting 2: = 0:05. i =0 for all i. Vary 0 from 2 to 4 with step size 0.2. Setting 3: Drawi randomly from a uniform distributionU(2; 4). Vary from 0:05 to 0:2 with step size 0:01. We apply the four methods to the simulated data. The FPR, MDR and ESS (expected sample size) are computed based on the average of 100 replications, and are plotted as functions of varied parameter values. The results are summarized in Figure 3.1. In Setting 3, the two oracle methods (OR.ST and OR.SM) are not implemented as recursive formulae for T i;j OR are unavailable. We can see that all four methods control the FPR at the nominal level. However, the two single thresholding methods (OR.TH and DD.TH) are very conservative (the actual FPR is only about half of the nominal level). Similarly, both OR.TH and DD.TH are very conservative for MDR control. In contrast, the SMART procedures (OR.SM and DD.SM) control the error rates more accurately and require smaller sample sizes. When signals are sparse and weak (top middle panel), the MDR level of DD.SM is slightly higher than the nominal level. This is due to the estimation errors occurred at stage 1. It is of interest to develop more accurate estimation procedures in such settings. 3.3.3 Simulation 2: SMART vs. distilled sensing This simulation study compares SMART and DS [54]. As DS does not provide precise error rates control, the simulation is designed in the following way to make the comparison on an equal footing: we rst run DS with up to 10 stages and record its FPR and MDR levels. Then we apply SMART at the corresponding FPR and MDR levels so that the two methods have roughly equal error rates. The ESS is used to compare the eciency. 95 The data are generated from the multistage model (3.3.1). The number of locations is p = 10 5 . The following two settings are considered: Setting 1: = 0:05. i =0 for all i. Vary 0 from 2 to 4 with step size 0.2. Setting 2: Draw i randomly fromU(2; 4). Vary from 0:05 to 0:2. In both settings, the FPR, MDR and ESS are computed by averaging over 100 replications, and are plotted as functions of varied parameter values. The results are summarized in Figure 3.2. We can see that the error rates of both OR.SM and DD.SM match well with those of DS, but they require fewer samples. OR.ST and DD.ST also outperform DS, achieving lower error rates with fewer samples. 0.000 0.025 0.050 0.075 0.100 2.0 2.5 3.0 3.5 4.0 µ FPR FPR Comparison 0.000 0.025 0.050 0.075 0.100 2.0 2.5 3.0 3.5 4.0 µ MDR MDR Comparison 100000 150000 200000 2.0 2.5 3.0 3.5 4.0 µ ESS ESS Comparsion Method DD.SM DD.ST OR.SM OR.ST Setting 1 0.000 0.025 0.050 0.075 0.100 2.0 2.5 3.0 3.5 4.0 µ FPR FPR Comparison 0.000 0.025 0.050 0.075 0.100 2.0 2.5 3.0 3.5 4.0 µ MDR MDR Comparison 100000 150000 200000 250000 2.0 2.5 3.0 3.5 4.0 µ ESS ESS Comparsion Method DD.SM DD.ST OR.SM OR.ST Setting 2 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 π FPR FPR Comparison 0.00 0.05 0.10 0.15 0.05 0.10 0.15 0.20 π MDR MDR Comparison 125000 130000 135000 140000 0.05 0.10 0.15 0.20 π ESS ESS Comparsion Method DD.SM DD.ST Setting 3 Figure 3.1: Comparison with single thresholding. The displayed procedures are DD.SM (), OR.SM (), DD.ST (N), OR.ST (+). 96 3.4 Applications This section applies SMART and DS to the HTS studies. The other application to satellite image analysis is provided in the Supplementary Material. We compare the performances of dierent methods in two ways: (i) the total sample sizes needed to achieve pre-specied error rates, and (ii) the actual error rates achieved for a xed total sample size. 0.00 0.01 0.02 0.03 2.0 2.5 3.0 3.5 4.0 µ FPR FPR Comparison 0.0 0.1 0.2 0.3 2.0 2.5 3.0 3.5 4.0 µ MDR MDR Comparison 150000 175000 200000 225000 2.0 2.5 3.0 3.5 4.0 µ ESS ESS Comparsion Method DS DD.SM DD.ST OR.SM OR.ST Setting 1 0.0000 0.0025 0.0050 0.0075 0.0100 0.05 0.10 0.15 0.20 π FPR FPR Comparison 0.000 0.025 0.050 0.075 0.100 0.05 0.10 0.15 0.20 π MDR MDR Comparison 2e+05 3e+05 4e+05 0.05 0.10 0.15 0.20 π ESS ESS Comparsion Method DS DD.SM DD.ST Setting 2 Figure 3.2: Comparison with DS. The displayed procedures are DS (), DD.SM (N), OR.SM (+), DD.ST (), OR.ST (). The goal of the HTS study conducted by McKoy et al. (2012) is to identify novel inhibitors of the amyloid beta peptide (A), whose aggregation is believed to be a trigger of the Alzheimer's disease. In the study, a total ofp = 51; 840 compounds are tested, with three measurements recorded for each compound. We use the observed data set as a pilot data set and simulate observations in later stages to illustrate how to design a multistage sampling and inference procedure for identifying useful compounds. We rst obtain z-scores based on the average of the three measurements and then estimate the non- null proportion and null distribution using the method in [67]. The estimated non-null proportion is ^ 0:0007, and the estimated null distribution (referred to as the empirical null distribution, [37]) is N(^ 0; ^ 2 0 ) with ^ 0 = 0:2459; ^ 0 = 0:6893. Next, we choose the largest 100(1 ^ )% of the data and use 97 their average as the signal amplitude ^ = 3:194. The observations in later stages will be generated based on the estimated parameters. We set both the FDR and MDR at level 0:1, apply SMART and record the total sample size. We then apply DS with the recorded sample size by SMART. The results are summarized below Since DS Methods FDP MDP Total Observation SMART 0.083333 0.1081081 56926 DS 0.9971195 0 77641 always eliminates half of the locations at each step, for this particular instance DS requires at least 1:5p observations, and does not oer proper error rate control. Next, we run DS up to 10 stages and record the FPR and MDR levels, and then apply SMART at the same levels and compare the required sample sizes. The results are summarized below. We can see that SMART has smaller error rates while taking signicantly fewer samples. Methods FDP MDP Total Observation DS 0.2340426 0.02702703 104308 SMART 0.1487284 0.02162162 67850 3.5 Proofs This section provides the proofs of all theorems in the main text. The derivation of the approximation formulae (3.1.8) is provided in the Supplementary Material. 3.5.1 Proof of Theorem 4 Proof. Part (a) Denoted d d (t l ;tu) =f(Ni;i) : 1ipg. Using the denition of T i;N i OR , we have E ( p X i=1 (1i)i ) =E X X X E jX X X ( p X i=1 (1i)i ) =E X X X p X i=1 T i;N i OR i ! : Then the FPR is Q(t l ;tu) =E P p i=1 T i;N i OR i =E P p i=1 i : It follows that E " p X i=1 n T i;N i OR Q(t l ;tu) o I T i;N i OR t l # =E 2 6 4 X i:T i;N i OR t l n T i;N i OR Q(t l ;tu) o 3 7 5 = 0: 98 The above equation implies that Q(t l ;tu) t l ; otherwise every term on the LHS must be negative, resulting in a contradiction. According to Assumption 1,P i (Ni <1) = 1 for all i; see [17] for a proof. SinceP i (Ni <1) = 1 for alli, andp is nite, we claim thatP(maxNi <1) = 1, i.e. the oracle procedure has a nite stopping time. Next we prove that for a xed tu, QOR(t l ;tu) is non-decreasing in t l . Let QOR(t l;j ;tu) = j for j = 1; 2. We only need to show that if t l;1 <t l;2 , then 12. Denote Ni;1 and Ni;2 the stopping times for location i corresponding to thresholds (t l;1 ;tu) and (t l;2 ;tu), respectively. If t l;1 <t l;2 , then it is easy to see that for any particular realization of the experiment, we must have Ni;1Ni;2. We shall show that ift l;1 <t l;2 and1 >2, then we will have a contradiction. To see this, note that T i;N i;2 OR 2 I T i;N i;2 OR t l;2 = T i;N i;2 OR 2 I T i;N i;2 OR t l;1 + T i;N i;2 OR 2 I t l;1 <T i;N i;2 OR t l;2 = T i;N i;1 OR 2 I T i;N i;1 OR t l;1 + T i;N i;2 OR 2 I t l;1 <T i;N i;2 OR t l;2 T i;N i;1 OR 1 I T i;N i;1 OR t l;1 + (12)I T i;N i;1 OR t l;1 + T i;N i;2 OR 1 I t l;1 <T i;N i;2 OR t l;2 The second equality holds because if T i;N i;2 OR <t l;1 , then we must have Ni;1 =Ni;2. Taking expectations on both sides, we have E ( p X i=1 T i;N i;2 OR 2 I T i;N i;2 OR t l;2 ) = 0; and E ( p X i=1 T i;N i;1 OR 1 I T i;N i;1 OR t l;1 ) = 0: However, since 1 >2 and 1t l;1 as shown previously, we must have E ( p X i=1 (12)I T i;N i;1 OR t l;1 ) > 0; and E ( p X i=1 T i;N i;2 OR 1 I t l;1 <T i;N i;2 OR t l;2 ) 0: 99 This leads to a contradiction. Therefore, we conclude that QOR(t l ;tu) is non-decreasing in t l for a xed tu. Next, we prove that ~ Q(t l ;tu) is non-increasing in tu for a xed t l . By the denition of MDR and similar arguments for the FPR part, we have E " p X i=1 1T i;N i OR n I T i;N i OR tu ~ Q(t l ;tu) o # = 0: Since our model has a nite stopping time, naturally we have n i :T i;N i OR tu o [ n j :T j;N j OR t l o = f1; 2; 3; ;pg: It follows that E P n i:T i;N i OR tu o 1T i;N i OR n 1 ~ Q(t l ;tu) o =E ( P j:T j;N j OR t l 1T j;N j OR ~ Q(t l ;tu) ) : We have 1 ~ Q(t l ;tu) ~ Q(t l ;tu) = E ( P j:T j;N j OR t l 1T j;N j OR ) E P n i:T i;N i OR tu o 1T i;N i OR : (3.5.1) Consider two thresholds tu;1 > tu;2. Denote Ni;1 and Ni;2 the corresponding stopping times at loca- tion i. The operation of the thresholding procedure implies that Ni;1 Ni;2, n i :T i;N i;1 OR tu;1 o n i :T i;N i;2 OR tu;2 o , and n j :T j;N j;2 OR t l o n j :T j;N j;1 OR t l o . Therefore, E ( P i:T i;N i;2 OR t u;2 1T i;N i;2 OR ) =E ( P i:T i;N i;2 OR t u;1 1T i;N i;2 OR ) +E P n i:t u;1 >T i;N 2 OR t u;2 o 1T i;N i;2 OR E ( P i:T i;N i;1 OR t u;1 1T i;N i;1 OR ) : We have shown that n j :T j;N j;2 OR t l o n j :T j;N j;1 OR t l o . Moreover, on the set n j :T j;N j;2 OR t l o , we have Ni;1 =Ni;2. It follows that E ( P j:T j;N j;1 OR t l 1T j;N j;1 OR ) E ( P j:T j;N j;2 OR t l 1T j;N j;2 OR ) : 100 Combining the above results, we have E ( P j:T j;N j;1 OR t l 1T j;N j;1 OR ) E ( P i:T i;N i;1 OR t u;1 1T i;N i;1 OR ) E ( P j:T j;N j;2 OR t l 1T j;N j;2 OR ) E ( P i:T i;N i;2 OR t u;2 1T i;N i;2 OR ): Hence if tu;1 >tu;2, then it follows from (3.5.1) that 1 ~ Q(t l ;tu;1) ~ Q(t l ;tu;1) 1 ~ Q(t l ;tu;2) ~ Q(t l ;tu;2) : Therefore ~ Q(t l ;tu;1) ~ Q(t l ;tu;2). We conclude that ~ Q(t l ;tu) is non-increasing in tu for a xed t l . Part (b). The proof is divided into two parts. The rst part describes a process that identies a unique pair of (t l OR ;t u OR ). The second part shows that d d d (t l OR ;t u OR ) has the largest power among all eligible procedures. (1). Oracle thresholds. Let Q(1; 1) = be the theoretical upper bound corresponding to the FPR when all hypotheses are rejected. A prespecied FPR level > 0 is called eligible if < . Let R =ftu : Q(tu;tu) > g. We can see thatR is nonempty if is eligible, since Q(0; 0) = 0 and Q(1; 1) = . Considertu2R. Note thatQ(0;tu) = 0 for alltu, the following threshold is well dened: t l OR (tu) = supft l :Q(t l ;tu)g: (3.5.2) We claim that Qft l OR (tu);tug =. We prove by contradiction. First, according to the continuity of Q(t l ;tu), for every tu2R, we can ndt l (tu) such thatQ (t l (tu);tu) = [sinceQ(0;tu) = 0 andQ(tu;tu)>]. If not the equality does not hold, i.e. we have Qft l OR (tu);tug<; then the monotonicity of Q(t l ;tu) implies that t l (tu)>t l OR (tu), which contradicts the denition of t l OR (tu). The above construction shows that, for every tu2R, we can always identify a unique t l OR (tu) such that Q t l OR (tu);tu =. We say (; ) constitute an eligible pair of prespecied error rates if is eligible, and for this , satises 0 < < sup n ~ Q t l OR (tu);tu :tu2R o : In the above denition, the eligibility of (; ) only depends on the model, but not any given tu. Now consider an eligible pair (; ). The continuity of 101 ~ Qu(tu) ~ Qft l OR (tu);tug implies that we can nd t u such that ~ Q t l OR (t u );t u = . Let t u OR = infftu2 R : ~ Q(t l OR (tu);tu) = g: The pair of oracle thresholds are thus given by (t l OR ;t u OR )ft l OR (t u OR );t u OR g: (2). Proof of optimality. Denoted d d = (N N N; ) a sequential procedure that satises FPR(d d d) = ; MDR(d d d) = ; whereN N N = (N 1 ; ;N p ) and = ( 1 ; ; p ) are the corresponding stopping times and decision rules. Deonte EAST(d d d) the expected average stopping times. By denition, we have E ( p X i=1 (T i;N i OR ) i ) = 0; E ( p X i=1 (1T i;N i OR )(1 i ) ) = 0: Now we can sort T i;N i OR as T (1);N (1) OR T (2);N (2) OR T (p);N (p) OR with their corresponding decisions (1) ; (2) ; ; (p) : If does not take the form of there exists a k; such that (i) = 8 > > > < > > > : 1 ik 0 k<ip ; (3.5.3) then we can always modify into such a form with the same EAST and smaller FPR and MDR. Specically, suppose that there exists l1 < l2 such that (l 1 ) = 0 and (l 2 ) = 1, then we swap these two decisions. Such operation can be iterated until the decision rule takes the form as (3.5.3). Denote the new decision rule by d d d 0 = (N N N; 0 ). Since T (l 1 );N (l 1 ) OR T (l 2 );N (l 2 ) OR in each swapping, we can reduce the FPR and MDR: 0 = P p i=1 T i;N i OR 0i E P p i=1 0i P p i=1 T i;N i OR i E P p i=1 i =; 0 = P p i=1 n (1T i;N i OR )(1 0i ) o p P p i=1 n (1T i;N i OR )(1 i ) o p = : Expressing d d d 0 in the form of (3.1.5), we can nd t 0 l and t 0 u such that i;N i = 8 > > > < > > > : 1 T i;N i OR t 0 l 0 T i;N i OR t 0 u ; where Q(t 0 l ;t 0 u ) = 0 ; ~ Q(t 0 l ;t 0 u ) = 0 . Now we claim that t 0 l t l OR (t u OR ) and t 0 u t u OR . We prove by contradiction. First, if we have t u OR >t 0 u , then by the denition of t u OR , we have ~ Q t l OR (t 0 u );t 0 u > ; Q t l OR (t 0 u );t 0 u =: However, we 102 also have ~ Q (t 0 l ;t 0 u ) = 0 : By the denition of ~ Q(t l ;tu), with the sametu, only a largert l could result in a strictly smaller MDR level. Together with the monotonicity of ~ Q(t l ;tu) for a xed tu, we claim that t l OR (t 0 u )<t 0 l . SinceQ(t l ;tu) is non-decreasing in t l for a xedtu, we haveQ t l OR (t 0 u );t 0 u =. It follows that Q (t 0 l ;t 0 u )>; contradicting the fact that Q(t 0 l ;t 0 u ) = 0 . Therefore we must have t 0 u t u OR . Next, assume that t 0 l > t l OR (t u OR ). Then by the denition of t l OR (tu) and monotonicity of Q(t l ;tu), we have Q (t 0 l ;t u OR ) > . Given the fact that Q(t 0 l ;t u OR ) t 0 l , we must have < t 0 l , claiming that t 0 l is always bounded below by regardless of 0 , which leads to a contradiction since we always have Q(t 0 l ;t 0 u ) = 0 <t 0 l . Hence we must havet 0 l t l OR (t u OR ). Therefore EAST(d d d 0 ) = EAST(d d d) EAST(d d dOR) and the desired result follows. 3.5.2 Proof of Theorem 5 Proof. The goal is to show that the pair t l OR = and t u OR = 1 +1 control the FPR and MDR. The FPR part is straightforward since FPR( ~ d d dOR) = E n P p i=1 T i;N i OR I(T i;N i OR ) o E n P p i=1 I(T i;N i OR ) o : To show the MDR part, we rst carry out an analysis of the false negative rate (FNR), which is dened as FNR( ~ d d dOR) = Ef P p i=1 i (1 i )g Ef P p i=1 (1 i )g : According to the operation of ~ d d dOR, the FNR can be further calculated as FNR( ~ d d dOR) = E n P p i=1 (1T i;N i OR )I(T i;N i OR tu) o E n P p i=1 I(T i;N i OR tu) o 1tu = + 1 : Denote ~ d d d i OR = ( ~ N i OR ; ~ i OR ). We have shown that FPR( ~ d d dOR). Suppose the actual FPR level is ~ . ThenE n P p i=1 (1i) ~ i OR o = ~ ( P p i=1 ~ i OR ). It follows that (1 ~ )E P p i=1 ~ i OR =E n P p i=1 i ~ i OR o : (3.5.4) 103 Meanwhile, our analysis of the FNR shows that E ( p X i=1 i(1 ~ i OR ) ) + 1 E ( p X i=1 (1 ~ i OR ) ) : (3.5.5) Combining (3.5.4) and (3.5.5), we obtain ( + 1) n p (1 ~ )E P p i=1 ~ i OR o n pE P p i=1 ~ i OR o : It follows thatE P p i=1 ~ i OR p(1)(1 ) ~ +(1)(1~ ) p(1 ) (1~ ) Using (3.5.4) and p =E( P i i), we have E( P p i=1 i ~ i OR ) E( P p i=1 i) E( P p i=1 i): Therefore Ef P p i=1 i(1 ~ i OR )g E( P p i=1 i) and the desired result follows. 3.5.3 Proof of Theorem 6 Proof. Dene stage-wise false positive rate sFPRj and stage-wise false non-discovery rate sFNRj as sFPRj := E n P i2S j (1i)i o E P i2S j i ; sFNRj := E n P i2S j i(1i) o E P i2S j (1i) ; where sFPRj is the ratio of the expected number of false rejections at stage j over the expected number of all rejections at stage j, and sFNRj is the ratio of the expected number of false acceptance over the expected number of all acceptance. By our denition of sFPRj , we have sFPRj = E Pk s j i=1 T (i);j OR E k s j E k s j E k s j =: Similarly, by the denition of sFNRj , we have sFNRj = E n Pk e j 1 i=0 1T k j i;j OR o E k e j E k e j +1 E k e j = + 1 : Therefore SMART controls the sFDR and sFNR at level and +1 , respectively, across all stages. 104 Next we show that if sFPRj and sFNRj are controlled universally at pre-specied levels across all stages, then the global FPR and FNR can be controlled at the same levels. Consider the global FPR rst: FPR = E n P N j=1 P i2S j (1i) i SM o E P N j=1 P i2S j i SM E P N j=1 P i2S j i SM E P N j=1 P i2S j i SM =: Next, we consider the global FNR: FNR = E n P N j=1 P i2S j i 1 i SM o E n P N j=1 P i2S j (1 i SM ) o E n P N j=1 +1 P i2S j 1 i SM o E n P N j=1 P i2S j (1 i SM ) o = + 1 : Finally, according to the arguments in Theorem 2, the MDR satises MDR if FNR +1 , completing the proof. 3.5.4 Proof of Theorem 7 Proof. For a symmetric decision procedure d d d, denote its Type I and Type II errors on unit i by 0 = PH i;0 (Reject Hi;0) and 0 =PH i;1 (Accept Hi;0): It can be shown that the corresponding global error rates are given by FPR(d d d) = (1) 0 (1) 0 +(1 0 ) and MDR(d d d) = 0 ; (3.5.6) respectively. Our result largely follows from the lower bound derived in [74] on family-wise error rate (FWER); we only highlight the main steps on how to go from the FWER paradigm to the FPR/MDR paradigm, which essentially involves exploiting the relationship (3.5.6). From Thm. 2.39 in [97], we have 1 0 log( 0 1 0 ) + (1 0 ) log( 1 0 0 ) D(F0jjF1) and 2 (1 0 ) log ( (1 0 0 ) + 0 log ( 0 1 0 ) D(F1jjF0) ; 105 where1 and2 are the expected stopping times for null and non-null locations, respectively. Furthermore, from [74], we have 1 (1 0 ) log( 0 ) 1 log 2 D(F0jjF1) ; 2 (1 0 ) log( 0 ) 1 log 2 D(F1jjF0) : Using the KL divergence DKL(F0;F1) = maxfD(F0jF1);D(F1jF0)g, the average stopping time of all locations satises = (pp)1 +p2 p (1)(1 0 ) log( 0 ) 1 +(1 0 ) log( 0 ) 1 log 2 DKL(F0;F1) : (3.5.7) We consider two situations. If 0 0 , then (1)(1 0 ) log( 0 ) 1 +(1 0 ) log( 0 ) 1 log 2: Note that for 0 x 1,fx logx : x2 (0; 1)g reaches its minimum when x = e 1 . It follows that 2 1 < e 1=e 0 0 1. Therefore (1 0 ) log( 0 ) 1 log(2 0 ) 1 : Together with (3.5.7), we have log (4 0 ) 1 D KL (F 0 ;F 1 ) : According to our constraint on, we conclude that log(4) 1 DKL(F0;F1) log(4 0 ) 1 : Therefore, 0 and R (d d d). If 0 0 , then we can similarly show that (1 0 ) log 01 log 2 DKL(F0;F1) log 1 2 0 log 2 DKL(F0;F1) log 1 4 0 DKL(F0;F1) : It follows that log(4) 1 DKL(F0;F1) log(4 0 ) 1 ; which implies 0 . Under the assumption that < 1 3 and 1 2 , we have 1 2 12 1 . Consider the function k(x) = (1)x (1)x+ . It is easy to see that k(x) is monotonically increasing in x. Hence R (d d d) (1) 0 (1) 0 +(1 0 ) (1) 0 (1) 0 + (1) (1) + (1) (1) 12 1 + =; 106 completing the proof. 3.5.5 Proof of Theorem 8 Proof. We have already shown that when t l = and tu = 1 +1 , the SMART procedure controls the FDR and MDR at level and , respectively. Let = = 1 f(p) 1+ . With the choice of t l and tu mentioned above, we have lim p!1 R (d d d) = lim p!1 2 f(p) 1+ = 0; which proves the rst part of the theorem. Next we establish the upper bound. Consider p simultaneous SPRTs with the same threshold t l and tu. The operation of our SMART procedure uses these thresholds for the moving averages; hence the SPRT approach with the same t l and tu will always take more samples. It is sucient to show that the result holds for simultaneous SPRTs. We convert thresholdst l andtu to the thresholds for SPRTs: A = (1)(1tu ) tu; ;B = (1)(1t l ) t l : Under our specications, we further have A = (1) +1 ;B = (1)(1) : From [74], we have 0 B 1 ; 0 A. According to Assumption (3.2.1), we have E(logL i;N i j logL i;N i < logA) logAC1; E(logL i;N i j logL i;N i > logB) logB +C2 for some positives constants C1 and C2. ConsiderE i =0 logL i;N i . Then E i =0 logL i;N i = (1)E i =0 logL i;N i j logL i;N i < logA E i =0 (logL i;N i j logL i;N i > logB) (1)(logA 1 +C1) logB (1)(logA 1 +C1) logA 1 +C1: 107 Likewise, we can show thatE i =1 (logL i;N i ) logB +C2: LetC3 = (1)C1 +C2. According to Wald's identity (P490, [17]), we have E i =1 (Ni) = E i =1 logL i;N i E i =1 (logL i;1 ) = E i =1 logL i;N i D(F1jF0) ; E i =0 (Ni) = E i =0 logL i;N i E i =0 (logL i;1 ) = E i =0 logL i;N i D(F0jF1) : It follows that lim p!1 logf(p) = lim p!1 (1)E i =0 (Ni) +E i =1 (Ni) logf(p) lim p!1 (1)(logA 1 +C1) logf(p)D(F0jF1) + lim p!1 (logB +C2) logf(p)D(F1jF0) = lim p!1 (1) log +1 (1) + log (1)(1) +C3 logf(p) minf(D(F0jF1);D(F1jF0)g (3.5.8) = lim p!1 (1) log( 1 ) + log (1) +C3 logf(p) minf(D(F0jF1);D(F1jF0)g (3.5.9) = lim p!1 log( 1 ) logf(p) minf(D(F0jF1);D(F1jF0)g (3.5.10) = 1 + minf(D(F0jF1);D(F1jF0)g : From (3.5.8) to (3.5.9), we have used the fact and are error rates converging to zero; hence +1 (1) = 1 f1+o(1)g and (1)(1) = (1) f1+o(1)g: Equation (3.5.10) uses the fact that = . The desired result follows. 3.6 Supplemental Materials 3.6.1 Derivation of thresholds We need the following assumption in our derivation. The assumption has been commonly adopted in the literature on SPRT (e.g. Berger, 1985; Siegmund, 1985). Assumption 2. ConsiderZi;1 dened in Assumption 1. For alli, we haveP i (Zi;1 = 0)< 1,P i (jZi;1j< 1) = 1,P i (Zi;1 < 0)> 0, andP i (Zi;1 > 0)> 0. Moreover, M i (t) =E i [e tZ i;1 ] exists for all t. 108 We start our derivation by noting that T i;j OR is a monotone function of the likelihood ratio statistic L i;j : T i;j OR =P i = 0jX X X j i = 1= 1 + 1 Li;j : (3.6.11) Henced d d (t l ;tu) can be expressed as a thresholding rule based onL i;j : stops sampling for unit i at time Ni = min j 1 :L i;j A orL i;j B , deciding i;N i = 0 ifL i;j A and i;N i = 1 ifL i;j B. We rst solve (A;B) for a given pair (; ), then transform (A;B) to (t l ;tu). The technique used in our derivation is similar to the classical ideas when deriving the upper and lower thresholds for SPRT. Since all testing units operate independently and have the same thresholds, it is sucient to focus on the operation of SPRT on a generic testing unit. Hence, for simplicity, we drop index i and denoteL i;N i , i and Z i;k asL N , and Z ;k , respectively. Under the random mixture model, the FPR and MDR of the SPRT with thresholds (A;B) can be calculated as FPR = (1)P L N >Bj = 0 P (L N >B) ; MDR =P L N <Aj = 1 : Let SN = P N k=1 Z ;k = logL N . Denote a = logA and b = logB: Under Assumption 2,P (N <1) = 1 and all moment of N exist. There exists a unique nonzero number t for which M (t ) = 1 ([17]). This fundamental identity then implies 1 =E n exp(t SN )M (t ) N o =E fexp(t SN )g exp(t a)P (SNa) + exp(t b)P (SNb): In the above approximation, we ignore the overshoots and pretend that SN hits the boundaries a and b exactly. In this idealized situation, SN has a two-point distributionP : P (SN = a) =P (SN a) and P (SN =b) =P (SNb): 109 Moreover, Assumption 2 implies that 1 =P (N <1) =P (L N A) +P (L N B) =P (SNa) +P (SNb): Thus we can solve from the above that P (L N B) =P (SNb)f exp(t a)g=fexp(t b) exp(t a)g: According to Assumption 2,P (jZ :;k j <1) = 1 for = 0; 1. Then t =0 = 1;t =1 =1 (P493, [17]). It follows that FPR (1) 1A BA P (L N B) = (1) 1A BA (1) 1A BA + 11=A 1=B1=A = 1 1 +B ; MDR 1 1 exp(a) exp(b) exp(a) = 1 1 1=A 1=B 1=A = A(B 1) BA : Setting FPR =, MDR = and solving for A and B, we have A ( 1 1)(1) ( 1 1)(1) + ; B ( 1 1)(1) : The relationship (3.6.11) impliesA = (1)(1tu)=(tu); B = (1)(1t l )=(t l ). Transforming from L i;j to T i;j OR , the corresponding thresholds can be obtained as: t l OR = and t u OR = + 1 + 1 : To ensure an eective MDR control, we choose a more stringent upper threshold: t u OR = 1 + 1 + 1 + 1 ;8 0: 110 3.6.2 Application to astronomical survey In astronomical surveys, a common goal is to separate sparse targets of interest (stars, supernovas, or galaxies) from background noise. We consider a dataset from Phoenix Deep Survey (PDS), a multi- wavelength survey of a region over 2 degrees diameter in the southern constellation Phoenix. Fig. 3(a) shows a telescope image from the PDS. It has 616536 = 330; 176 pixels, among which 1131 pixels exhibit signal amplitude of at least 2:98. In practice we monitor the same region for a xed period of time. After taking high resolution images, it is of interest to narrow down the focus quickly using a sequential testing procedure so that we can use limited computational resources to explore certain regions more closely. The image is converted into gray-scale with signal amplitudes standardized. Fig. 3(b) depicts a contaminated image with Gaussian white noise. We apply SMART by setting both the FPR and MDR at 5%, then record the total number of measurements, and nally apply DS with the recorded sample size. We can see that SMART control the error rates precisely. The resulting images for SMART and DS are demonstrated in Fig. 3(c) and (d), respectively. We can see SMART produces much sharper images than DS. Methods FDP MDP Total Observation SMART 0.06321335 0.05658709 368796 DS 0.9864338 0.00265252 495866 Next we implement DS up to 12 stages, record the FDR and MDR levels, and then apply SMART at the recorded error rates. The required sample sizes of the two methods are summarized below. We can Methods FDP MDP Total Observation DS 0.07237937 0.01414677 670331 SMART 0.03192407 0.00795756 479214 see that SMART control the error rates with fewer observations. The resulting images for SMART and DS are shown in Fig. 3(e) and (f), we can see SMART produces slightly sharper images than DS. 111 (a) Original Data (b) Noisy Observation (c) SMART result 1 (d) DS result 1 (e) SMART result 2 (f) DS result 2 Figure 3.3: SMART and DS comparison. Fig. (a) and (b) show the original radio telescope image and tainted image with white noise, respectively. Fig. (c) and (d) compare SMART and DS when the total number of observations are about the same. Fig. (f) shows the resulting image when implementing DS for 12 stages, Fig. (e) shows the image produced by SMART when using the recorded error rates from the 12-stage DS. 112 Chapter 4 Ongoing and Future Research 4.1 Online FDR Control Under the online constraint, multiple testing problems deal with hypotheses that arrive in a stream, whereas decisions must be made immediately after they arrive. Most importantly, our decision at any given time t cannot depend on any future unseen observations. Traditional generalized -investing type algorithms dealt with this issue using p-value, which is not the fundamental building block in large-scale multiple testing. We propose a novice empirical Bayes type online FDR control procedure that's more robust and data-adaptive. We start with a user-specied nominal FDR level 2 (0; 1) and guarantees any-time FDR control. Numerical results conrm the eectiveness of our new procedure in FDR control and show it achieves substantial power gain over existing p-value based methods. 4.1.1 Introduction A common goal in modern statistical studies is to rst distinguish the nulls and the non-nulls given a collection of multiple hypotheses with suitable control on dierent types of error. In this article, we aim to control the false discovery rate (FDR), dened as the expected ratio of false discoveries to total discoveries. Specically, we aim to propose a procedure that can guarantee that the FDR is bounded by a pre-specied constant 2 (0; 1). The seminal work of Benjamini and Hochberg [13] proposes the BH procedure, which lays the foun- dation for multiple testing. Other notable procedures include the BHY procedure [16] and the adaptive-z 113 procedure [102], etc. However, such procedures are intrinsically oine, meaning that an algorithm testing m hypotheses receives the entire batch of observations at once. In the online version of the problem, number of hypotheses are not known in advance, in fact, it could be an innite stream. Under such setting, decision about rejecting or accepting current location's null hypothesis must be made before the next one arrives. There are multiple motivating justications for considering the online constraint. For instance, even though we may have the entire batch of observations at hand, we may still choose to process them in a particular order. Especially when we have prior belief that non-nulls appear earlier in the new ordering. In such cases, we can expect more discoveries than oine algorithms. This is the main motivation that underlies Foster and Stine's paper [44]. Furthermore, in public databases, dierent research teams test their respective hypotheses on the same database over time. Thresholding individual p-values is not ecient and may deteriorate results. Another example is multiple A/B tests or multiple armed bandit tests, where internet companies aim to conduct a sequence of tests over time or the data at hand are intrinsically sequential. Such setting has a wide array of applications. For instance, in post-selection hypothesis tests, at each step of the forward stepwise and LARS procedures (online feature selection), multiple A/B tests with the goal of identifying an alternative website design that provides improvements over the default design [82] and quality preserving databases (QPD) [2]. Specically, the quality preserving database (QPD) framework has been employed widely by re- searchers coming from dierent backgrounds. Examples include Stanford's HIVdb which serves the com- munity of anti-HIV treatment researches, WTCCC's large-scale data for whole-genome association studies which is distributed to selected research groups, and the NIH in uenza virus resource. Such framework requires a false discovery rate controlling procedure operates sequentially, without prior knowledge of even the total number of hypotheses. Our goal is to construct a powerful FDR controlling procedure (asymptotically) that can potentially handle an innite amount of hypotheses without much loss of power due to the framework constraint. We can formulate such online multiple testing problems as follows. Suppose we have hypotheses arrive sequentially in a stream, and our goal is to recover the support of such potentially innite dimensional vector = (1;2; )2R 1 based on measurements from variables X1;X2; ; in an online manner, 114 i.e., at each step, the investigator must decide whether to reject the current hypothesis without knowledge of future hypotheses. DenoteS =ft :t6= 0g as the support of . Consider the following random mixture model: t Ber(t); Xtjt (1t)F0 +tF1t; t = 1; 2; ; (4.1.1) wheret here is a covariate which can be conceptualized as the time label (or domain knowledge, ordering, etc), characterizing the sequential nature of the problem. Xt follow the null distribution F0 if t = 2S and the non-null distribution F1t if t2S. The model assumes a universal null F0 if t = 2S and a potentially varying non-null distributionF1t. HeterogenousF1t setting encompasses cases where non-zero coordinates have dierent eect sizes. The mixing proportion t varies with the covariate t and is usually assumed to be a smooth function of t. Furthermore, it can be understood as follows: Xt has probability 1t of being a null case and probability t of being a signal. Denote f0 and f1t the corresponding densities. The goal is to make af0; 1g decision for each arriving location, while controlling the False Discovery Rate (FDR) at the nominal level up to the current stage. Foster and Stine [44] designed the rst online-investing procedures that can control mFDR dynam- ically. However, Alpha-investing rules might stop the testing after some number of rejected hypotheses, and there is no general guarantee that alpha-investing controls FDR. Aharoni and Rosset [1] extended the -investing idea to a more generalized class (GAI) which controls the mFDR as well. Javanmard and and Montanari [65] propose the LORD and LOND algorithms, which are two special cases of GAI methods, they can appropriately control FDR for independent p-values. [82] further propose the GAI++ methods, which uniformly improve the power of GAI methods and can deal with more general cases. However, all methods mentioned above arep-value based, which is not the fundamental building block in large-scale hypothesis testing since it fails to capture the overall structure of the compound decision problem [102]. For -investing methods, they can not guard against issues like \piggybacking" and \- death". The recently proposed GAI++ methods take a relatively passive approach on dealing with such issues by allowing \indecision". Furthermore, GAI methods and GAI++ methods require pre-specication of a string of constant which is not completely data-adaptive. In this paper, we propose a novice local fdr based empirical Bayes online FDR algorithm which is suitable for the large-scale setting. We rst formulate the problem under a decision theoretic framework 115 and formally dene FDRt, the error statistic we aim to control. Furthermore, we propose the oracle oine procedure and explain how it motivates the oracle online procedure. In the subsequent section, we formally introduce the data-driven procedure and related estimation issues. Numerical simulations and real data analysis on genomic data conrms the power gain over existing methods and the validity of our data-driven online FDR procedure. 4.1.2 Problem formulation At time t = 0, we pre-specify the level at which we aim to control the FDR at any subsequent time. The online FDR control problem can be formulated under the decision-theoretic framework. Denote t =I(t6= 0), whereI() is an indicator function. Let = (1;2; ), where t = 0=1 indicates that location t is classied as a null/non-null case. Further let t = (1;2; ;t) be the collection of all decisions up to time t. Dene the false discovery rate at time t (FDRt) as FDRt( t ) =E P t i=1 (1i)i P t i=1 i_ 1 ! : (4.1.2) We with to construct a data-driven procedure that can eectively control FDRt below nominal level for all t. The FDR is a powerful and popular error criterion in large-scale testing problems. Some existing online multiple testing algorithms control a variant of the FDR named mFDR (marginal false discovery rate), it is asymptotically equivalent to FDR under some regularity conditions [13]. 4.1.3 Oracle procedures 4.1.3.1 Oine oracle procedure In this subsection we rst focus our attention on the oracle procedures under the traditional oine setting, where all data are observed and the relevant distributional information is known. This essentially means we can rank our test statistics and choose a cut-o along the path, like most existing multiple testing procedures. As we mentioned above, p-value based method is in general proved to be inecient in large-scale multiple testing, empirical Bayes type methods like the adaptive-z procedure [102] are in general more 116 powerful. In the oine setting, we extend the adaptive-z procedure one step further by incorporating the covariate t in our test statistic. Specically, we consider the oracle oine CLfdr (conditional local false discovery rate) procedure with a modied Lfdr statistic as CLfdrt =P(t = 0jXt =xt) = (1t)f0(xt) ft(xt) ; t = 1; ;T: Recall ft(xt) = (1t)f0(xt) +tft(xt) is the mixture density, t the prior probability that location t is non-null and f0 the null density. Suppose we have measurements x1; ;xT , then the oracle oine CLfdr procedure operates as follows: Algorithm 1. The oracle oine CLfdr procedure Ranking: Sort CLfdr from smallest to largest: CLfdr (1) ; ; CLfdr (T) : Thresholding: If CLfdr (1) >, accept the null for all locations. Otherwise, let = maxfCLfdr (k) : 1 k P k i=1 CLfdr (i) g. t =I (CLfdr t ); t = 1; ;T . Denote an oine procedure as d d d o , specically, denote the CLfdr procedure as d d d OR o . Furthermore, letD ;T o be the collection of all oine procedures which satisfy FDRT (d d d o ). The following theorem states that the oracle oine CLfdr procedure is valid and optimal for FDRT control. Theorem 9. The oracle oine CLfdr procedure is valid and optimal for FDRT control in the sense that FDRT d d d OR o ; ETP (d d d) ETP d d d OR o for alld d d2D ;T o : The above oracle oine CLfdr procedure can only control FDR at the end stage, i.e. when all observations are at hand and ranking is enabled. It doesn't provide any theoretical guarantee on FDRt for t<T . Furthermore, it doesn't operate online. However, it is a key starting point for developing the online FDR procedure as it provides valuable insights on how to set the thresholds in a data-adaptive fashion. 117 4.1.3.2 Online oracle procedure Now we switch our focus to the online setting, where we must make decisions as hypotheses arrive. Such natural constraint prohibits the ranking of our test statistics. In fact, we develop a new oracle procedure that conforms to such prohibition and can operate forever, unlike the p-value based alpha-investing type methods which could potentially terminate. One major concern for any online FDR procedure is the chance of an early rejection that could potentially eat up the budget, [82] refer to this phenomenon as -death and they tackle it by introducing \indecision". However, we would like to conform to the formulation of our problem and deal with such issue by incorporating a dynamic screening step on our test statistics for the online procedure. Motivated by our oine oracle procedure, at the end stage T , we are essentially thresholding all the CLfdr statistics at level CLfdr (k) where k is the total number of rejections. If we were to know this quantity in advance, we could simply use this to threshold our test statistics and recover the exact decisions as the traditional setting even under the online constraint for optimal FDRT control. However, we couldn't possibly know this oracle threshold beforehand, and the oine procedure can only control FDRT . The motivation here is to approximate this threshold at each time point t using the maximum CLfdr we could have rejected so far. Before introducing the online oracle procedure, we rst dene some necessary notations. LetAt = fi :it; i = 1g be the collection of locations we rejected up until time t. t as the threshold applied on CLfdrt. Our oracle procedure operates as follows: Algorithm 2. The oracle online FDR procedure Intialization: A 0 =;; 0 = 1. Updating the Barrier: At time t, sort CLfdr from smallest to largest: CLfdr (1) ; ; CLfdr (t) : If CLfdr (1) >, let t = t1 . Otherwise, let t = maxfCLfdr (k) : 1 k P k i=1 CLfdr (i) ; k<tg. Decision: If CLfdr t t , t =I P i2A t1 CLfdri+CLfdrt jAt1j+1 . Otherwise t = 0. 118 The above procedure is essentially running two parallel procedures, the online decision-making proce- dure and the oine oracle procedure for determining the adaptive threshold. The threshold is a barrier that prevents issues like -death and piggybacking (a string of bad decisions made due to previously acquired budget), as more observations come in, we can expect such threshold to converge to the optimal threshold for each t in the oine procedure. Now we prove the above procedure is valid for FDRt control for any time pointt. Similarly, we denote an online procedure as d d don, specically, denote the oracle online FDR procedure as d d d OR on . Furthermore, letD ;T on be the collection of all online procedures which satisfy FDRT (d d d o ). The following theorem states that the oracle online procedure is valid for FDRt; t = 1; ;T control. Theorem 10. The oracle online FDR procedure is valid for FDRt; t = 1; ;T control, i.e. FDRt(d d d OR on ); t = 1; ;T: Proof. 8t, FDRt =E P t i=1 (1i)i P t i=1 i_ 1 ! = EfE( P t i=1 (1i)ijXi))g jAtj = Ef P i2A t CLfdrig jAtj : 4.1.4 Data-driven Procedure In this section we propose the data-driven version of our oracle online FDR procedure. It primarily involves the estimation of the CLfdr statistics. Recall the denition of our CLfdr statistic, CLfdrt(xt) = (1t)f0(xt) ft(xt) ; t = 1; 2; : We assume a general null distribution f0 which is common practice in the literature. So the key quantities for estimation are t and ft(xt). We will discuss further in detail on the practical estimation of these two quantities. 119 In order to motivate the intuition behind our estimation methods, let's rst circle back to some applications of any online FDR control procedure, specically quality preserving databases (QPD) and A/B testing. Normally, such databases or servers already possess a large quantity of existing data which is extremely informative for providing the baseline initial estimates ^ f1 and ^ 1. Therefore, we can reasonably assume that at the beginning of our online FDR procedure, we already have a decent estimate for the above two quantities. The key issue now is on how to further update them as new observations arrive. We rst discuss how to estimate the quantityt. If we have reason to believe thatt is xed through out allt due to the nature of the problem, we can employ Jin and Cai's method for its initial estimation. We can update t periodically after T new observations arrive. In a more general setting, we can assumet varies witht. Even so, we should assumet is intrinsically a continuous function of t. • Fixed t: Jin and Cai's method. Varying t: First obtain a sampleTt() =fxi : Pi > g in a neighborhood of t (tN i t). Let K(t) be a kernel function, h is the bandwidth, K h (t) =h 1 K(t=h) andN = CardfT ()g. We propose the following estimator ^ (t) = P X i 2T t () K h (tXi) (1) P t i=tN K h (tXi) : (4.1.3) Theorem 11. E R t f^ (t)g(t)g 2 dt! 0 and t t for all . (Sun) • xed density: standard kernel method. Conditional density: ^ ft(xjt) = P t j=1 K h t (jt)K hx (xjx) P t j=1 K h t (jt) : (4.1.4) Theorem 12. E ^ ft(xjt)ft(xjt) 2 =E Z Z n ^ ft(xjt)ft(xjt) o 2 dtds! 0: (4.1.5) The result is known in the literature. 120 Now we have an estimated version of our CLfdr statistic: \ CLfdr t (xt) = (1 ^ (t))f0(xt) ^ ft(xjt) ; t = 1; 2; ; The data-driven online FDR procedure can be operated as follows. Algorithm 3. The data-driven online FDR procedure Initialization: A 0 =;; 0 = 1, hyper-parameter = 0:6. Updating the Barrier: At time t, sort \ CLfdr from smallest to largest: \ CLfdr (1) ; ; \ CLfdr (t) : If \ CLfdr (1) >, let t = t1 . Otherwise, let t = maxf \ CLfdr (k) : 1 k P k i=1 \ CLfdr (i) ; k<tg. Decision: If \ CLfdr t t , t =I P i2A t1 \ CLfdr i + \ CLfdr t jAt1j+1 . Otherwise t = 0. Theorem 13. Under the general model, the proposed data-driven online FDR procedure controls the FDRt. Theorem 14. Under the xed model, n! ? . the proposed data-driven online FDR procedure controls the FDRt, and its MDRt converges to that of the oracle MDRt. 4.1.5 Simulation studies In this section, we investigate the numerical performance of our data-driven online FDR procedure along with several others under various settings; specically, LOND (GAI), the oracle online FDR procedure and the recently proposed LORD++ (GAI++). For the general simulation setup, we choose m = 5000, total number of 1000 iterations, pre-specied FDR level at 0:05. We investigate the missed discovery rate (MDR) MDRt = P t i=1 i(1i) P t i=1 i ; t = 1; ;m and the FDR at various time points, ranging from 500 to 5000 with step size 500. 121 For the data-driven procedure, we need a initial burn-in period in order to establish a stable estimation for the density function. Specically, we generate simulated data of 500 locations for the initial density estimations, and update them every 200 observations. The following simulation settings are investigated: 1. pt =p = 0:01, vary from 2 to 4 with step size 0:5. 2. pt =p = 0:05, vary from 2 to 4 with step size 0:5. 3. Vary pt linearly from 0 to 0.5, vary from 2 to 4 with step size 0:5. 4. pt = (sin 2t m + 1)=4, pt ranges between 0 to 0.5, vary from 2 to 4 with step size 0:5. 4.2 Anomaly Detection in User Engagement Metrics with False Discovery Rate Control Anomalies are patterns in data that do not conform to a well dened notion of normal behavior. The problem of nding these patterns is referred to as anomaly detection. At Snap Inc., we have a large number of user engagement metrics stored in Google Cloud in the form of time series, detecting anomalies in such time series data in a robust fashion can give meaningful insights and enable proper subsequent actions. In this paper, we tackle this problem by transforming it into a multiple testing problem in the statistical domain. We rst use STL (seasonal trend residual decomposition using Loess) to decompose the time- series data, then we apply empirical Bayes adaptive Z procedure for False Discovery Rate (FDR) control at any nominal level on the residual terms. Numerical studies conrmed the eectiveness of our approach and superiority over existing methods in detecting various kinds of anomalies for user engagement metrics. We also deployed this methodology at Snap Inc. 4.2.1 Introduction In many modern companies, we are seeing an exponential increase in the availability of streaming, time- series data. At Snap Inc., the most important data-sets of this form are the user engagement metrics (active users, session length, session frequency, etc.), which are key indicators for company performances, 122 mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 t FDR Intermediate FDR level Comparison mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method Oracle_Gamma LOND DD LORD++ Methods Comparison (fix p=0.01) Figure 4.1: Simulation Setting 1, with varying , p = 0:01. 123 mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 t FDR Intermediate FDR level Comparison mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method Oracle_Gamma LOND DD LORD++ Methods Comparison (fix p=0.05) Figure 4.2: Simulation Setting 2, with varying , p = 0:05. 124 mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 t FDR Intermediate FDR level Comparison mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method Oracle_Gamma LOND DD LORD++ Methods Comparison (vary p linearly from 0 to 0.5) Figure 4.3: Simulation Setting 3, with linearly varying p t , and varying . 125 mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 0.000 0.025 0.050 0.075 0.100 t FDR Intermediate FDR level Comparison mu: 4 mu: 3.5 mu: 3 mu: 2.5 mu: 2 1000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method Oracle_Gamma LOND DD LORD++ Methods Comparison (vary p from 0 to 0.5 (sin−shape)) Figure 4.4: Simulation Setting 4, with sin shape varying p t , and varying . 126 Figure 4.5: A sample data of user engagement metrics at Snap Inc. external events, and potential infrastructure outages. Therefore, fast and reliable detection of anomalies in such user engagement metrics has signicant implications and use cases. By its denition, the task of anomaly detection aims to nd "an observation that deviates so much from other observations as to arouse suspicion that it was generated by a dierent mechanism" [56]. There are numerous research in time-series anomaly detection, dating back to [45]. A lot of them has been done in various domains such as, statistics, signal processing, nance, econometric, etc [56, 6, 61, 117]. Many techniques are supervised methods, which are unsuitable for robust anomaly detection since they hardly detect new and unknown anomalies [52]. Other techniques, like smoothing methods, clustering, simple thresholding methods, are only capable of detecting spatial anomalies [3]. For detecting temporal anomalies, change point detection methods using scan statistics and likelihood ratio tests have been employed in the eld of genomics, engineering, and material science [80, 53, 105, 8]. Such methods are often sensitive to the size of the windows and pre-specied thresholds [3], making it dicult to control the number of false positives while maintaining good detection power. At Snap Inc., most of the user engagement metrics demonstrate strong seasonality with an underlying trend. Figure 1 is an illustrative example of such data. Most existing anomaly detection methods are not specically tailored for such characteristics. In [60], they propose a hybrid approach using STL and the Extreme Studentized Deviate test (ESD), which is robust to high percentage of anomalies and can elect arbitrary number of anomalies for rejection. However, their Seasonal ESD and Seasonal Hybrid ESD methods do not provide any control on the false discovery rate 1 , which is of utmost importance considering the vast amount of data being processed and potential augmented number of false positives due to the statistical artifact known as "multiple testing" 2 . 1 https://en.wikipedia.org/wiki/False discovery rate 2 https://en.wikipedia.org/wiki/Multiple comparisons problem 127 Considering the aforementioned issues, we develop a novel non-parametric empirical Bayes method for anomaly detection for such user engagement metrics at Snap Inc. The proposed method is fully data- driven, and under some mild assumptions, provides the optimal power under any nominal false discovery rate. Specically, we rst use STL (seasonal trend decomposition using Loess) to decompose the time series data into three components: seasonal, trend, and residuals. Then we model the residuals as a two group mixture model with noise and signals, the adaptive Z procedure is subsequently applied for valid FDR control. The method is evaluated extensively on user engagement metrics at Snap Inc, in addition to simulated data. 4.2.2 Methodologies 4.2.2.1 STL decomposition As mentioned in the previous section, user engagement metrics at Snap demonstrate strong seasonality and cyclic behavior. STL is a non-parametric technique coined by [30]. It decomposes a time series into three additive components- seasonal, trend and remainder: Yi =Ti +Si +Ri; i = 1; ;N: (4.2.6) N here is the total number of measured data points. The algorithm is accomplished through two loops. In the outer loops, STL assign robustness weights to each data point depending on the size of the remainders (taking transient, aberrant behaviors into consideration, i.e. anomalies); meanwhile in the nested inner loops, it iteratively updates the trend and seasonal components. Specically, it operates according to the following outline [46]: 1. Initialize trend as T (0) i = 0 and R (0) i = 0. 2. Outer loop: calculate robustness weights. Run n (o) times • Calculate Ri. 128 • Calculate robustness weightsi =B(jRij=h) whereh = 6median(jRij) andB is the bi-square weight function: B(u) = 8 > > > < > > > : (1u 2 ) 2 ; u 0< 1; 0; u> 1: • On initial loop, i = 1 for all i. 3. Inner loop: calculate the trend and seasonal terms. Run n (i) times • Detrend: Y (i) T (k) (i) where k is the loop number. Detrend term is missing if Yi is missing. • Cycle-subseries smoothing: the detrended time series is broken into cycle-subseries 3 . Each cycle-subseries is then Loess smoothed with q =n (s) and d = 1. The smoothed values yield a temporary seasonal time series C k+1 . • Low-pass lter (LPF): the low-pass lter onC k+1 yieldsL k+1 . This lter is the application of two moving averages of lag equal to three followed by Loess smoothing with q = n (l) and d = 1. n (l) here is the smallest odd integer greater than the period. • De-trending of smoothed cycle-subseries: S (k+1) =C (k+1) L (k+1) . This is the (k+1)-th estimate of seasonal component. • De-seasonalizing: YS (k+1) . • Trend smoothing: Loess smooth the time series with q =n (t) , resulting in T (k+1) . There are many techniques in the literature for time series decomposition, we chose STL because of its versatility and robustness, especially towards outliers, which is the main goal for our detection problem. It is also extremely fast, suitable for the big data setting we encounter at Snap Inc. Packages stl and stlplus both perform STL decomposition in R. 4.2.2.2 Adaptive-Z procedure In this section, we introduce the non-parametric empirical Bayes adaptive-Z procedure for optimal FDR control [101]. The quantity of interest is the residual term Ri from the STL decomposition. If there were 3 A cycle-subseries comprises of values at each position of a seasonal cycle. For example, if the series is monthly with a yearly periodicity, then the rst sub-cycle series is the January values, the second is the February values, and so forth. 129 no anomalies in our original time series data, we should expect STL to capture all the seasonality and trend movement, while the residual terms are just noise. However, with anomalies present, the residual terms Ri can be thought of as a mixture of two groups, the noise and potential signals (i.e. anomalies). Therefore, we can formulate our anomaly detection problem as a multiple testing problem. Specically, let 1; ;N be independent Bernoulli(p) variables and let Ri be generated as Riji (1i)F0 +iF1; i = 1; ;N: (4.2.7) Ri are observed, while the variables i are unobserved. Here F0 is the cdf (cumulative distribution function) for the noise, and F1 the cdf for the signals. Then the marginal cdf of R is the mixture distribution F (r) = (1p)F0(r) +pF1(r), and the probability distribution function (pdf) is f(r) = (1p)f0(r) +pf1(r). Under such framework, the anomaly detection problem can be further casted as a multiple testing problem: H0i :i = 0; H1i :i6= 0; i = 1; ;N; (4.2.8) where the solution to which can be represented by a decision rule, = (1; ;m)2I =f0; 1g N : (4.2.9) We aim to control the false discovery rate (FDR), the expected proportion of false discoveries among all discoveries, under the nominal level : FDR =E P N i=1 (1i)i P N i=1 i_ 1 ! : (4.2.10) In large-scale statistical inference, FDR control has become the standard practice [12, 101, 5, 27]. They can be further categorized into two groups,p-value based and empirical Bayes 4 type methods. In general, p-value based techniques are in general inferior than empirical Bayes type methods because it fails to take into consideration the compound structure of the problem [101, 27, 113]. 4 https://en.wikipedia.org/wiki/Empirical_Bayes_method 130 Most notably, adaptive-Z procedure demonstrated stability and robustness under various settings. It operates as follows: 1. Calculate the Lfdr test statistics: Lfdri(ri) = (1p)f0(ri) f(ri) ; i = 1; ;N; 2. Rank them from smallest to largest: Lfdr (1) ; Lfdr (2) ; ; Lfdr (N) : 3. Adaptive step-up procedure: Let k = maxfl : 1 l l X j=1 Lfdr (j) g; then reject all H0i; i = 1; ;k: In addition to FDR control, another attractive property for the adaptive-Z procedure is that we can alternatively specify the total number of k anomaly points desired as well. Instead of employing the adaptive step-up procedure, we can simply elect corresponding locations for the top ranked k Lfdr statistics for rejection. [101] establishes that adaptive-Z procedure is both asymptotically valid and optimal for FDR control. 4.2.2.3 Online FDR Control In order to enable real-time anomaly detection, one key component in our paradigm is online FDR control. In the previous section, we discussed the operation of the adaptive-Z procedure. Essentially, it involves rst ranking the Lfdr test statistics, then a cuto is chosen along this ranking for FDR control. However, under the online constraint, we can no longer rank the test statistics, decisions must be made immediately after they arrive. Hence, we propose the following variation of the adaptive-Z procedure for online FDR control. 131 For each time point t, dene t = (1;2; ;t) be the collection of all decisions up until time t, we aim to control false discovery rate at time t as FDRt( t ) =E P t i=1 (1i)i P t i=1 i_ 1 ! : The goal here is to eectively control FDRt below nominal level for all t. LetAt =fi :it;i = 1g be the collection of locations we rejected up until timet, be the nominal FDR level, then the following procedure would guarantee control of the FDRt for all t: 1. Initialization:A0 =;: 2. Decision: t =I P i2A t1 Lfdr i +Lfdr t jA t1 j+1 . The above procedure is valid for FDRt control (proof in Appendix). 4.2.3 Implementation of real time STL+adaptive-Z In order to practically implement the STL and adaptive-Z procedure in real time for user engagement metrics, we need to specify the parameters for STL decomposition and the estimation methods for key quantities in the Lfdr test statistics, together with how to enable real time decision making. We devote this section to practical implementation setup for our method. 4.2.3.1 Parameters for STL decomposition and real-time prediction For STL decomposition, There are a total number of six primary parameters involved: • n (p) n (p) n (p) : the periodicity of the seasonality, e.g., if we were to model daily data with 10 minutes intervals, n (p) = 144 7. • n (i) n (i) n (i) : number of cycles through the inner loop, it should typically be large enough to reach conver- gence, default is 2. • n (o) n (o) n (o) : number of cycles through the outer loop, default is 1. • n (l) n (l) n (l) : the span in lags for the LPF. It's recommended to take the next odd integer greater thann (p) . • n (s) n (s) n (s) : smoothing parameter for the seasonal lter. As it increases, each cycle-subseries becomes smoother. 132 • n (t) n (t) n (t) : smoothing parameter of the trend behavior. In our use case, we want to specify a larger value for a smoother trend. In practice, the most important two parameters aren (p) andn (s) , we recommend choosingn (p) based on total number of daily data while letn (s) be relatively large, therefore believing the changes are resulted from aberrant behaviors (in the residuals) instead of seasonal behaviors. Using simulated data, we further noticed that by letting the number of inner and outer iterations both equal to 2 yields more stable results than default values. For real time prediction, we can use the method predict in R on our tted STL model, then use the dierence of the observed value and the tted value as the residuals. 4.2.3.2 Null distribution and non-null proportion estimations in adaptive-Z procedure The oracle statistics in our adaptive-Z procedure is the local false discovery rate Lfdr: Lfdri = (1p)f0(zi) f(zi) ; i = 1; ;N: (4.2.11) p here can be interpreted as the proportion of anomalies among all observations, which intrinsically should be small, while f0(r) is the null distribution. After standardizing the observations Ri into z-scores zi, we can either use the theoretical null distribution N(0; 1) as f0, or we can use the method proposed in [66] to obtain consistent estimators for both ^ f0 =N(^ 0; ^ 2 0 ) (the empirical null distribution) and ^ p using empirical characteristic function and Fourier analysis. Efron [38] studied a dataset on breast cancer, which he insightfully pointed out that in large scale multiple testing, sometimes the empirical null distribution is more appropriate. In our use case on anomaly detection, we suggest the usage of the estimated empirical null distribution. Specically, consider the empirical characteristic function n(t) =n(t;Z1; ;ZN;N) = 1 n n X j=1 e itZ j 133 and its expectation, the characteristic function (t) =(t;;;n) = 1 N N X j=1 e it j 2 j t 2 2 : naturally splits into two parts, (t) =0(t) + ~ (t) where 0(t) =0(t;;;N) = (1pN )e i 0 2 0 t 2 =2 ; ~ (t) = ~ (t;;;n) =pN Ave fj:( j ; j )6=( 0 ; 0 )g e i j t 2 j t 2 =2 ; which respectively correspond to the null eects and non-null eects. For any dierentiable complex- valued function f s.t.jf(t)j6= 0, we dene the two functionals 2 0 (f;t) = d dt jf(t)j tjf(t)j ; (4.2.12) 0(f;t) = Re(f(t)) Im(f 0 (t)) Re(f 0 (t)) Im(f(t)) jf(t)j 2 : (4.2.13) Evaluating the functionals at0 would give us the exact value of 2 0 and0: 2 0 (0;t) = 2 0 and0(0;t) = 0,8t6= 0. However we can not directly evaluate 0, hopefully for a large enough t, n(t)(t)0(t), so the contribution of anomaly points to the empirical characteristic function is negligible. In [66], an adaptive way to choose t is proposed. For a given 2 (0; 1=2), set ^ tn( ) = ^ tn( ;n) = infft :jn(t)j =n ; 0t logng: (4.2.14) Then the plug-in estimators ^ 2 0 = 2 0 n; ^ tn( ) ; ^ 0 =0 n; ^ tn( ) (4.2.15) are proven to be consistent uniformly over a wide class of estimators. Empirically, a smaller yields more stable results, for instance, 0.01 or 0.05. 134 Subsequently, after estimating the null distribution N(^ 0; ^ 2 0 ), the following estimator provides con- sistent estimator for the proportion ^ p after standardization of Ri for any 2 (0; 1=2): ^ pn( p 2 logn;R1; ;RN;N) (4.2.16) = 1 N N X j=1 h 1 Rj ; p 2 logn i ; where (4.2.17) x; p 2 logn = Z 1 1 !()e logn 2 cos(tx)d: (4.2.18) In practice, we evaluate the above quantity over a grid of and pick the one that gives the largest estimation for ^ p. Alternatively, one can also omit the estimation of ^ p for simplicity, since Lfdri = (1p)f0(zi) f(zi) < f0(zi) f(zi) ; i = 1; ;N; providing conservative control of the FDR under nominal level . 4.2.3.3 Real time anomaly detection algorithm Now we summarize the implementation details into the following algorithm. data driven STL + online adaptive-Z procedure 1. Specify n (p) and n (s) for STL decomposition, get the trend Ti, seasonal Si and residual Ri. 2. Standardize the residualsRi intoz-scoreszi, then estimate the null distributionN(^ 0; ^ 2 0 ) (option- ally the proportion ^ p as well). 3. Use standard univariate kernel density estimation to obtain an estimate for the mixture distribution ^ f, evaluated on a grid D. 4. Obtain estimate for Lfdr test statistics for training data: [ Lfdri = (1 ^ p) ^ f0(zi) ^ f(zi) ; i = 1; ;N: 135 5. Determine initial active setAN , for any incoming new data point t, estimate the density ^ ft(zt), ^ f0t(zt), and ^ pt based on the nearest neighborhood with indexftK; ;tg (here K is the size of the neighborhood, a hyper-parameter). Subsequently, let t =I 0 @ P i2A t1 d Lfdri + d Lfdrt jAt1j + 1 1 A ; t =N + 1; ; We can use the methodology proposed in [27] to estimate varying proportion ^ pt, or we can simply elect it to be 0 for conservative estimation. Updates can be done daily. 4.2.4 Numerical Results Here we demonstrate the performance of our proposed real-time anomaly detection algorithm using simu- lated data and real data from Snap Inc. For simulated data, the goal is to demonstrate the validity of our online FDR control procedure together with its achieved power in detection. According to our knowledge, there are no real-time anomaly detection algorithms that guarantees FDR control at any time point t. 4.2.4.1 Simulations In this section, we simulate data with known anomaly data points that has seasonal patterns to demon- strate the performance of our proposed real-time anomaly detection algorithm. We compare the oracle and data-driven version of our algorithm (denoted by OR and DD respectively), together with Twit- ter's anomaly detection algorithm (denoted by TW). Note that Twitter's anomaly detection algorithm uses simple thresholding on test statistics, which cannot guarantee FDR control even globally, let alone anytime FDR control. We consider setting with anomalies demonstrating dierent signal strengths, auto- correlated errors, and dierent proportion of anomalies. We use real user engagement metric data at Snap Inc. to extract seasonal and trend data (demon- strated in Figure 2), and manually add in anomalies with varying structures and eect sizes. We add in random noises with distribution N(0; 2 0 ) where 0 = 144 based on our sample data. In general, we consider total number of observations m = 4458, we continuously monitor FDRt at t = 600; 1000; ; 4200 with stepsize 400. FDRt and MDRt (missed discovery rate at time t) are plotted 136 Figure 4.6: Seasonal and trend patterns for sample user engagement metrics for comparison. In order to make comparisons more meaningful, we choose = 0:1 for universal nominal FDR level for our DD and OR procedures, and choose signicance level 0:001 for Twitter's Anomaly Detection algorithm. Note MDRt is dened as: MDRt =E P t i=1 i(1i) P t i=1 i_ 1 ! ; t = 1; ; Power is further dened as 1 MDR. Under the same realized FDR level, lower MDR level means more powerful procedure. Specically ,we consider the following settings, the corresponding results are summarized in Figure 3 to 5: • Setting 1: Vary signal's eect size (xed) from 3:50 to 4:50, proportion of signals p = 0:05, error terms are i.i.d. N(0; 2 0 ) where 0 = 144, signal locations are uniformly sampled based on binom(p). 5 • Setting 2: Signal's eect size are uniformly sampled from3:50 to50, vary proportion of signals p = 0:02; 0:03; 0:05, error terms are i.i.d. N(0; 2 0 ) where 0 = 144, signal locations are uniformly sampled based on binom(p). • Setting 3: Signal's eect size are uniformly sampled from3:50 to50, vary proportion of signals p = 0:02; 0:03; 0:05, error terms are generated from ARIMA model of order (2; 0; 1) with sd 144 (the ARIMA model is based on estimation from real data at Snap Inc.), signal locations are uniformly sampled based on binom(p). Our real-time anomaly detection algorithm can always control the FDR level below the nominal level = 0:1 at any time, and it adheres closely to the oracle case where all parameters in the model are assumed to be known. Twitter's AnomalyDetection algorithm performs reasonably well too, however, 5 the size of 0 is based on empirical null distribution estimation from the real data at Snap Inc. 137 mu: 4.5 mu: 4 mu: 3.5 1000 2000 3000 4000 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 t FDR Intermediate FDR level Comparison mu: 4.5 mu: 4 mu: 3.5 1000 2000 3000 4000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method OR DD TW Methods Comparison Figure 4.7: Simulation Setting 1, varying eect size with p = 0:05, i.i.d. random noise. there's no direct relationship between the signicance level (0.001 here) specied with the realized FDR level. Furthermore, Twitter's method fails to control FDR in an online fashion in setting 1 and seems to be overly restrictive in other settings. Our data-driven procedure uniformly achieves better power (lower MDR level) under various anomaly eect sizes and proportions, even when errors are correlated. 4.2.4.2 Experimental Results In this section, we apply our method on real user engagement metrics at Snap Inc. We demonstrate three cases to showcase the power of our real-time anomaly detection algorithm, nominal FDR level is = 0:01. We run twitter's anomaly detection algorithm as well for comparison (signicance level 0:01), are the anomaly points labeled by our data-driven procedure, while4 are the points labeled by twitter's algorithm. In the rst use case, we are able to detect the spikes on March 14th, 2018 due to a student walk out, while twitter's method failed to. 138 p: 0.05 p: 0.03 p: 0.02 1000 2000 3000 4000 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 t FDR Intermediate FDR level Comparison p: 0.05 p: 0.03 p: 0.02 1000 2000 3000 4000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Method OR DD TW Methods Comparison Figure 4.8: Simulation Setting 2, varying proportion of signals with uniformly distributed signal strengths, i.i.d. random noise. Figure 4.10: Detected anomalies on user engagement metrics from March 10th to March 15th, 2018. In the second use case, our algorithm also picked up an anomaly triggered by a soccer game in Paris on Feb 28th, 2018. 139 p: 0.05 p: 0.03 p: 0.02 1000 2000 3000 4000 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 t FDR Intermediate FDR level Comparison p: 0.05 p: 0.03 p: 0.02 1000 2000 3000 4000 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 t MDR Intermediate MDR level Comparison Method OR DD TW Methods Comparison Figure 4.9: Simulation Setting 3, varying proportion of signals with uniformly distributed signal strengths, noise generated from ARIMA(2; 0; 1). Figure 4.11: Detected anomalies on user engagement metrics from Feb 25th to March 5th, 2018. In the third use case, our algorithm picked up a level change anomaly triggered by snow storm Emma on Feb 28th, 2018. 140 Figure 4.12: Detected anomalies on user engagement metrics from February 10th to March 20th, 2018. 141 Bibliography [1] E. Aharoni and S. Rosset. Generalized -investing: denitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76 (4):771{794, 2014. [2] E. Aharoni, H. Neuvirth, and S. Rosset. The quality preserving database: A computational frame- work for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM transactions on computational biology and bioinformatics, 8(5):1431{1437, 2011. [3] S. Ahmad and S. Purdy. Real-time anomaly detection for streaming analytics. arXiv preprint arXiv:1607.02480, 2016. [4] R. Barber and A. R. (2017+). The p-lter: multi-layer fdr control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2017. [5] R. F. Barber, E. J. Cand es, et al. Controlling the false discovery rate via knockos. The Annals of Statistics, 43(5):2055{2085, 2015. [6] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1974. [7] E. Bashan, G. Newstadt, and A. O. Hero. Two-stage multiscale search for sparse targets. IEEE Transactions on Signal Processing, 59(5):2331{2341, 2011. [8] M. Basseville, I. V. Nikiforov, et al. Detection of abrupt changes: theory and application, volume 104. Prentice Hall Englewood Clis, 1993. [9] P. Basu, T. T. Cai, K. Das, and W. S. (2017+). Weighted false discovery control in large-scale multiple testing. Journal of the American Statistical Association., 2017. [10] Y. Benjamini and R. Heller. False discovery rates for spatial signals. J. Amer. Statist. Assoc., 102: 1272{1281, 2007. ISSN 0162-1459. [11] Y. Benjamini and R. Heller. Screening for partial conjunction hypotheses. Biometrics, 64:1215{1222, 2008. [12] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pages 289{300, 1995. [13] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289{300, 1995. ISSN 0035-9246. [14] Y. Benjamini and Y. Hochberg. Multiple hypotheses testing with weights. Scandinavian Journal of Statistics, 24:407{418, 1997. [15] Y. Benjamini and Y. Hochberg. On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60{83, 2000. [16] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. Ann. Statist., 29(4):1165{1188, 2001. ISSN 0090-5364. 142 [17] J. O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985. [18] G. Blanchard and D. Geman. Hierarchical testing designs for pattern recognition. Ann. Statist., 33:1155{1202, 2005. [19] K. H. Bleicher, H.-J. B ohm, K. M uller, and A. I. Alanine. Hit and lead generation: beyond high- throughput screening. Nature reviews Drug discovery, 2(5):369{378, 2003. [20] P. E. Bloma, S. J. Fleischerb, and Z. Smilowitzb. Spatial and temporal dynamics of colorado potato beetle in elds with perimeter and spatially targeted insecticides. Environmental Entomology, 31 (1):149{159, 2002. [21] S. M. Boca and J. T. Leek. A regression framework for the proportion of true null hypotheses. Preprint. bioRxiv, 3567, 2017. [22] R. Bourgon, R. Gentleman, and W. Huber. Independent ltering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences, 107(21):9546{9551, 2010. [23] L. D. Brown. An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics, pages 471{493, 1990. [24] T. T. Cai and W. Sun. Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc, 104:1467{1481, 2009. [25] T. T. Cai and Y. Wu. Optimal detection of sparse mixtures against a given null distribution. IEEE Transactions on Information Theory, 60:2217{2232, 2014. [26] T. T. Cai, J. Jin, and M. G. Low. Estimation and condence sets for sparse normal mixtures. Ann. Statist., 35(6):2421{2449, 2007. ISSN 0090-5364. [27] T. T. Cai, W. Sun, and W. Wang. Cars: Covariate assisted ranking and screening for large-scale two-sample inference. Journal of the Royal Statistical Society, Series B, under review, 2016. [28] S. E. Calvano, W. Xiao, D. R. Richards, R. M. Felciano, H. V. Baker, R. J. Cho, R. O. Chen, B. H. Brownstein, J. P. Cobb, S. K. Tschoeke, et al. (2005). A network-based analysis of systemic in ammation in humans, 437:1032{1037, 2005. [29] H. Cao, W. Sun, and M. R. Kosorok. The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika, 100:495{502, 2013. [30] R. B. Cleveland, W. S. Cleveland, and I. Terpenning. Stl: A seasonal-trend decomposition procedure based on loess. Journal of Ocial Statistics, 6(1):3, 1990. [31] J. B. Copas. On symmetric compound decision rules for dichotomies. Ann. Statist., 2:199{204, 1974. [32] R. Cormack. Statistical challenges in the environmental sciences: a personal view. Journal of the Royal Statistical Society. Series A (Statistics in Society), pages 201{210, 1988. [33] S. G. Djorgovski, A. Mahabal, C. Donalek, M. J. Graham, A. J. Drake, B. Moghaddam, and M. Turmon. Flashes in a star stream: Automated classication of astronomical transient events. In E-Science (e-Science), 2012 IEEE 8th International Conference on, pages 1{8. IEEE, 2012. [34] D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32:962{994, 2004. ISSN 0090-5364. [35] L. Du and C. Zhang. Single-index modulated multiple testing. The Annals of Statistics, 42:1262{ 1311, 2014. [36] G. Durand. Adaptive p-value weighting with power optimality. arXiv preprint arXiv:1710.01094, 2017. 143 [37] B. Efron. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association, 99(465):96{104, 2004. ISSN 0162-1459. doi: 10.1198/ 016214504000000089. URL http://dx.doi.org/10.1198/016214504000000089. [38] B. Efron. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99(465):96{104, 2004. [39] B. Efron. Size, power and false discovery rates. Ann. Statist., 35:1351{1377, 2007. ISSN 0090-5364. [40] B. Efron. Microarrays, empirical Bayes and the two-groups model. Statist. Sci., 23:1{22, 2008. ISSN 0883-4237. [41] B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher. Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc., 96:1151{1160, 2001. ISSN 0162-1459. [42] E. Ferkingstad, A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong. Unsupervised empirical bayesian multiple testing with external covariates. The Annals of Applied Statistics, 2:714{735, 2008. [43] D. P. Foster and E. I. George. A simple ancillarity paradox. Scandinavian journal of statistics, pages 233{242, 1996. [44] D. P. Foster and R. A. Stine. -investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429{ 444, 2008. [45] A. J. Fox. Outliers in time series. Journal of the Royal Statistical Society. Series B (Methodological), pages 350{363, 1972. [46] D. R. Gardner. Stl algorithm explained: Stl part ii. http://www.gardner.fyi/blog/STL-Part-II/, 2018. Accessed: 2018-06-05. [47] C. Genovese and L. Wasserman. Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B, 64:499{517, 2002. ISSN 1369-7412. [48] C. Genovese and L. Wasserman. A stochastic process approach to false discovery control. Ann. Statist., 32:1035{1061, 2004. ISSN 0090-5364. [49] M. Ghosh. Bounds for the expected sample size in a sequential probability ratio test. Journal of the Royal Statistical Society. Series B (Methodological), pages 360{367, 1960. [50] J. J. Goeman and U. Mansmann. Multiple testing on the directed acyclic graph of gene ontology. Bioinformatics, 24(4):537{544, 2008. [51] J. J. Goeman and A. Solari. The sequential rejection principle of familywise error control. The Annals of Statistics, 38(6):3782{3810, 2010. [52] N. G ornitz, M. Kloft, K. Rieck, and U. Brefeld. Toward supervised anomaly detection. Journal of Articial Intelligence Research, 46:235{262, 2013. [53] N. Hao, Y. S. Niu, and H. Zhang. Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica, 23(4):1553, 2013. [54] J. Haupt, R. M. Castro, and R. Nowak. Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation. Information Theory, IEEE Transactions on, 57:6222{6235, 2011. [55] J. Haupt, R. G. Baraniuk, R. M. Castro, and R. D. Nowak. Sequentially designed compressed sensing. In SSP, pages 401{404, 2012. [56] D. M. Hawkins. Identication of outliers, volume 11. Springer, 1980. [57] R. Heller and D. Yekutieli. Replicability analysis for genome-wide association studies. The Annals of Applied Statistics, 8:481{498, 2014. 144 [58] R. Heller, M. Bogomolov, and Y. Benjamini. Deciding whether follow-up studies have replicated ndings in a preliminary large-scale omics study. Proceedings of the National Academy of Sciences, 111:16262{16267, 2014. [59] Y. Hochberg. A sharper bonferroni procedure for multiple tests of signicance. Biometrika, 75(4): 800{802, 1988. [60] J. Hochenbaum, O. S. Vallis, and A. Kejariwal. Automatic anomaly detection in the cloud via statistical learning. arXiv preprint arXiv:1704.07706, 2017. [61] V. Hodge and J. Austin. A survey of outlier detection methodologies. Articial intelligence review, 22(2):85{126, 2004. [62] S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, 6:65{70, 1979. [63] J. X. Hu, H. Zhao, and H. H. Zhou. False discovery rate control with groups. Journal of the American Statistical Association, 105(491):1215{1227, 2010. [64] W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, pages 361{379. Volume 1, 1961. [65] A. Javanmard and A. Montanari. Online rules for control of false discovery rate and false discovery exceedance. arXiv preprint arXiv:1603.09000, 2016. [66] J. Jin and T. T. Cai. Estimating the null and the proportion of nonnull eects in large-scale multiple comparisons. Journal of the American Statistical Association, 102(478):495{506, 2007. [67] J. Jin and T. T. Cai. Estimating the null and the proportional of nonnull eects in large-scale multiple comparisons. J. Amer. Statist. Assoc., 102:495{506, 2007. ISSN 0162-1459. [68] M. Langaas, B. H. Lindqvist, and E. Ferkingstad. Estimating the proportion of true null hypotheses, with application to dna microarray data. J. Roy. Statist. Soc. B, 67:555{572, 2005. [69] N. M. Law, S. R. Kulkarni, R. G. Dekany, E. O. Ofek, R. M. Quimby, P. E. Nugent, J. Surace, C. C. Grillmair, J. S. Bloom, M. M. Kasliwal, et al. The palomar transient factory: system overview, performance, and rst results. Publications of the Astronomical Society of the Pacic, 121(886): 1395, 2009. [70] E. L. Lehmann and G. Casella. Theory of point estimation. Springer, Science & Business Media, 2006. [71] W. Liu. Incorporation of sparsity information in large-scale multiple two-sample t tests. arXiv preprint arXiv:1410.4282, 2014. [72] Y. Liu, S. K. Sarkar, and Z. Zhao. A new approach to multiple testing of grouped hypotheses. Journal of Statistical Planning and Inference, 179:1{14, 2016. [73] M. Malloy and R. Nowak. Sequential analysis in high-dimensional multiple testing and sparse recovery. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, pages 2661{2665. IEEE, 2011. [74] M. L. Malloy and R. D. Nowak. Sequential testing for sparse recovery. IEEE Transactions on Information Theory, 60(12):7862{7873, 2014. [75] M. L. Malloy and R. D. Nowak. Near-optimal adaptive compressed sensing. IEEE Transactions on Information Theory, 60(7):4001{4012, 2014. [76] N. Meinshausen and J. Rice. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist., 34:373{393, 2006. 145 [77] N. Meinshausen, P. Bickel, J. Rice, et al. Ecient blind search: Optimal power of detection under computational cost constraints. The Annals of Applied Statistics, 3(1):38{60, 2009. [78] P. M uller, G. Parmigiani, C. Robert, and J. Rousseau. Optimal sample size for multiple testing: the case of gene expression microarrays. Journal of the American Statistical Association, 99(468): 990{1001, 2004. [79] P. Neuvial. Asymptotic results on adaptive false discovery rate controlling procedures based on kernel estimators. The Journal of Machine Learning Research, 14:1423{1459, 2013. [80] Y. S. Niu and H. Zhang. The screening and ranking algorithm to detect dna copy number variations. Ann. Appl. Stat., 6(3):1306{1326, 09 2012. doi: 10.1214/12-AOAS539. URL https://doi.org/10. 1214/12-AOAS539. [81] P. E. Nugent, M. Sullivan, S. B. Cenko, R. C. Thomas, D. Kasen, D. A. Howell, D. Bersier, J. S. Bloom, S. R. Kulkarni, M. T. Kandrasho, A. V. Filippenko, J. M. Silverman, G. W. Marcy, A. W. Howard, H. T. Isaacson, K. Maguire, N. Suzuki, J. E. Tarlton, Y.-C. Pan, L. Bildsten, B. J. Fulton, J. T. Parrent, D. Sand, P. Podsiadlowski, F. B. Bianco, B. Dilday, M. L. Graham, J. Lyman, P. James, M. M. Kasliwal, N. M. Law, R. M. Quimby, I. M. Hook, E. S. Walker, P. Mazzali, E. Pian, E. O. Ofek, A. Gal-Yam, and D. Poznanski. Supernova sn 2011fe from an exploding carbon-oxygen white dwarf star. Nature, 480(7377):344{347, 12 2011. URL http://dx.doi.org/ 10.1038/nature10644. [82] A. Ramdas, F. Yang, M. J. Wainwright, and M. I. Jordan. Online control of the false discovery rate with decaying memory. arXiv preprint arXiv:1710.00499, 2017. [83] A. Reiner-Benaim, D. Yekutieli, N. E. Letwin, G. I. Elmer, N. H. Lee, N. Kafka, and Y. Benjamini. Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics, 23(17):2239{2246, 2007. [84] H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950, pages 131{148, Berkeley and Los Angeles, 1951. University of California Press. [85] K. Roeder and L. Wasserman. Genome-wide signicance levels and weighted hypothesis testing. Statistical science: a review journal of the Institute of Mathematical Statistics, 24, 2009. [86] F. A. Roesch. Adaptive cluster sampling for forest inventories. Forest Science, 39(4):655{669, 1993. [87] E. Roquain and M. A. Van de wiel (2009). optimal weighting for false discovery rate control. Electronic journal of statistics, 3:678{711, 2009. [88] D. Rossell and P. M uller. Sequential stopping for high-throughput experiments. Biostatistics, 14 (1):75{86, 2013. [89] N. Rothman, M. Garcia-Closas, N. Chatterjee, N. Malats, X. Wu, J. D. Figueroa, F. X. Real, D. Van Den Berg, G. Matullo, D. Baris, et al. A multi-stage genome-wide association study of bladder cancer identies multiple susceptibility loci. Nature genetics, 42(11):978{984, 2010. [90] D. Rubin, S. Dudoit, and M. Van der Laan. A method to increase the power of multiple testing procedures through sample splitting. Statistical Applications in Genetics and Molecular Biology, 5 (1):Article 19, 2006. [91] S. K. Sarkar. Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist., 30:239{257, 2002. ISSN 0090-5364. [92] S. K. Sarkar. Fdr-controlling stepwise procedures and their false negatives rates. J. Stat. Plan. Infer., 125(1):119{137, 2004. [93] S. K. Sarkar and Z. Zhao. Local false discovery rate based methods for multiple testing of one-way classied hypotheses. arXiv preprint arXiv:1712.05014, 2017. 146 [94] J. M. Satagopan, E. Venkatraman, and C. B. Begg. Two-stage designs for gene{disease association studies with sample size constraints. Biometrics, 60(3):589{597, 2004. [95] T. Schweder and E. Spjtvoll. Plots of p-values to evaluate many tests simultaneously. Biometrika, 69(3):493{502, 1982. [96] J. G. Scott, R. C. Kelly, M. A. Smith, P. Zhou, and R. E. Kass. False discovery rate regression: an application to neural synchrony detection in primary visual cortex. Journal of the American Statistical Association, 110:459{471, 2015. [97] D. Siegmund. Sequential analysis: tests and condence intervals. Springer Science & Business Media, 1985. [98] B. W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986. [99] A. D. Skol, L. J. Scott, G. R. Abecasis, and M. Boehnke. Joint analysis is more ecient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet., 38:209{213, 2006. [100] J. D. Storey. A direct approach to false discovery rates. J. R. Stat. Soc. B, 64:479{498, 2002. ISSN 1369-7412. [101] W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479):901{912, 2007. [102] W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102:901{912, 2007. ISSN 0162-1459. [103] W. Sun and Z. Wei. Large-scale multiple testing for pattern identication, with applications to time-course microarray experiments. J. Amer. Statist. Assoc, 106:73{88, 2011. [104] W. Sun and Z. Wei. Hierarchical recognition of sparse patterns in large-scale simultaneous inference. Biometrika, 102:267{280, 2015. [105] M. Szmit and A. Szmit. Usage of modied holt-winters method in the anomaly detection of network trac: Case studies. Journal of Computer Networks and Communications, 2012, 2012. [106] J. Taylor, R. Tibshirani, and B. Efron. The \missed rate" for the analysis of gene expression data. Biostatistics, 6(1):111{117, 2005. [107] S. K. Thompson and G. A. Seber. Adaptive sampling. John Wiley & Sons, Inc., New York, NY, 1996. [108] T. Tony Cai and W. Sun. Optimal screening and discovery of sparse signals with applications to multistage high throughput studies. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):197{223, 2017. [109] J. W. Tukey. The collected works of john w. tukey. Volume, 3, 1994. [110] A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2): 117{186, 1945. [111] A. Wald. Sequential analysis. Wiley, New York, 1947. [112] M. Wand and M. Jones. Kernel smoothing. Chapman & Hall, London, 1995. [113] W. Wang and W. Sun. Multistage adaptive testing of sparse signals. arXiv preprint arXiv:1707.07215, 2017. [114] L. Wasserman and K. Roeder. High-dimensional variable selection. Ann. Statist., 37:2178{2201, 2009. ISSN 0090-5364. 147 [115] D. Wei and A. O. Hero. Multistage adaptive estimation of sparse signals. IEEE Journal of Selected Topics in Signal Processing, 7(5):783{796, 2013. [116] D. Wei and A. O. Hero. Performance guarantees for adaptive estimation of sparse signals. IEEE Transactions on Information Theory, 61(4):2043{2059, 2015. [117] K. Worden, G. Manson, and N. R. Fieller. Damage detection using outlier analysis. Journal of Sound and Vibration, 229(3):647{667, 2000. [118] D. Yekutieli. Hierarchical false discovery rate{controlling methodology. Journal of the American Statistical Association, 103(481):309{316, 2008. [119] R. W. Zablocki, A. J. Schork, R. A. Levine, O. A. Andreassen, A. M. Dale, and W. K. Thompson. Covariate-modulated local false discovery rate for genome-wide association studies. Bioinformatics, 30:2098{2104, 2014. [120] S. Zehetmayer, P. Bauer, and M. Posch. Two-stage designs for experiments with a large number of hypotheses. Bioinformatics, 21(19):3771{3777, 2005. [121] S. Zehetmayer, P. Bauer, and M. Posch. Optimized multi-stage designs controlling the false discovery or the family-wise error rate. Stat. Med., 27(21):4145{4160, 2008. [122] J.-H. Zhang, T. D. Chung, and K. R. Oldenburg. A simple statistical parameter for use in evaluation and validation of high throughput screening assays. Journal of biomolecular screening, 4(2):67{73, 1999. 148
Abstract (if available)
Abstract
Two-sample multiple testing has a wide range of applications. The conventional practice first reduces the original observations to a vector of p-values and then chooses a cutoff to adjust for multiplicity. However, this data reduction step could cause significant loss of information and thus lead to suboptimal testing procedures. In this thesis, I present a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple testing procedure is developed by employing a covariate-assisted ranking and screening (CARS) approach that optimally combines the information from both the primary and auxiliary variables. The proposed CARS procedure is shown to be asymptotic valid and optimal for false discovery rate (FDR) control. The procedure is implemented in the R-package CARS. Numerical results confirm the effectiveness of CARS in FDR control and show that it achieves substantial power gain over existing methods. CARS is also illustrated through an application to the analysis of satellite imaging data set for supernova detection. ❧ In addition to the aforementioned two-sample multiple testing problem, multistage designs have been used in a wide range of scientific fields as well. By allocating sensing resources adaptively, one can effectively eliminate null locations and localize signals with a smaller study budget. We formulate a decision-theoretic framework for simultaneous multistage adaptive testing and study how to minimize the total number of measurements while meeting pre-specified constraints on both the false positive rate (FPR) and missed discovery rate (MDR). The new procedure, which effectively pools information across individual tests using a simultaneous multistage adaptive ranking and thresholding (SMART) approach, can achieve precise error rates control and lead to great savings in total study costs. Numerical studies confirm the effectiveness of SMART for FPR and MDR control and show that it achieves substantial power gain over existing methods. The SMART procedure is demonstrated through the analysis of high-throughput screening data and spatial imaging data. ❧ I also discuss ongoing and future research directions in this thesis, particularly on the topics of online FDR control and structured multiple testing. For applications, I discuss in detail on using online FDR control for real time anomaly detection for user engagement metrics at Snap Inc.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large scale inference with structural information
PDF
Sequential testing of multiple hypotheses
PDF
Topics in selective inference and replicability analysis
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Sequential analysis of large scale data screening problem
PDF
Shrinkage methods for big and complex data analysis
PDF
High dimensional estimation and inference with side information
PDF
Model selection principles and false discovery rate control
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
Asset Metadata
Creator
Wang, Weinan
(author)
Core Title
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
09/26/2018
Defense Date
07/26/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
decision theory,empirical Bayes,false discovery rate,large-scale statistical inference,missed discovery rate,multiple hypothesis testing,OAI-PMH Harvest,sequential analysis,signal detection,sparse recovery
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Sun, Wenguang (
committee chair
), James, Gareth (
committee member
), Mukherjee, Gourab (
committee member
), Ren, Xiang (
committee member
)
Creator Email
wangweinan@berkeley.edu,wangweinan1115@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-73892
Unique identifier
UC11668706
Identifier
etd-WangWeinan-6780.pdf (filename),usctheses-c89-73892 (legacy record id)
Legacy Identifier
etd-WangWeinan-6780.pdf
Dmrecord
73892
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Wang, Weinan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
decision theory
empirical Bayes
false discovery rate
large-scale statistical inference
missed discovery rate
multiple hypothesis testing
sequential analysis
signal detection
sparse recovery