Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Large scale inference with structural information
(USC Thesis Other)
Large scale inference with structural information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Large scale inference with structural information By Bowen Gang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY MATHEMATICS May 2020 Copyright 2020 Bowen Gang Abstract High dimensional problems often have structures that can be exploited to improve statistical perfor- mance. The structural information can be in the form of heteroscedasticity, sparsity or dependence among others. This work is an attempt to use empirical Bayes method to eectively integrate struc- tural information into hypothesis testing and estimation procedures. The dissertation is divided into three parts. Chapter 1 of the thesis studies the problem of hypothe- sis testing under heteroscedasticity. Traditional approach uses standardization. However, there can be a signicant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity{adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The main idea of HART is to bypass stan- dardization by directly incorporating both the summary statistic and its variance into the testing procedure. A key message is that the variance structure of the alternative distribution, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power. The proposed HART procedure is shown to be asymptotically valid and optimal for false discovery rate (FDR) control. Our simulation demonstrates that HART achieves substantial power gain over existing methods at the same FDR level. We illustrate the implementation through a microarray analysis of myeloma. Chapter 2 studies hypothesis testing in an online setting. The hypotheses come in a stream and a real-time decision must be made before the next data point arrives. The error rate is required to be controlled at all decision points. Conventional simultaneous testing rules are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision- making process may come to a halt when the total error budget, or alpha{wealth, is exhausted. This work develops a new structure online{adaptive rules for false discovery rate (FDR) control. The pro- posed procedure is a novel alpha{investment rule that precisely characterizes the tradeos between dierent actions in online decision making. The procedure captures useful structural information of the dynamic model, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation in the next period. We present theory and numerical results to show that the proposed method controls the FDR at all decision points and achieves substantial power gain over existing online FDR procedures. In chapter 3 we consider the problem of simultaneous estimation of a sequence of dependent param- eters that are generated from a hidden Markov model. Based on observing a noise contaminated ii vector of observations from such a sequence model, we consider simultaneous estimation of all the parameters irrespective of their hidden states under square error loss. We study the roles of statis- tical shrinkage for improved estimation of these dependent parameters. Being completely agnostic on the distributional properties of the unknown underlying Hidden Markov model, we develop a novel non-parametric shrinkage algorithm. Our proposed method elegantly combines Tweedie-based non-parametric shrinkage ideas with ecient estimation of the hidden states under Markovian de- pendence. Based on extensive numerical experiments, we establish superior performance our our proposed algorithm compared to non-shrinkage based state-of-the-art parametric as well as non- parametric algorithms used in hidden Markov models. We provide decision theoretic properties of our methodology and exhibit its enhanced ecacy over popular shrinkage methods built under inde- pendence. We demonstrate the application of our methodology on real-world datasets for analyzing of temporally dependent social and economic indicators such as search trends and unemployment rates as well as estimating spatially dependent Copy Number Variations. iii Acknowledgements I would like to thank my adviser Wenguang Sun for his patience, support and inspiring discussion on the subject matter; Stanislav Minsker for his excellent lecture on regression and machine learning; Remigijus Mikulevicius for serving on my committee. I would like to thank all my co-authors: Luella Fu, Weinan Wang, Gareth James and Gourab Mukherjee. I won't be able to complete the thesis without their help. Special thanks to Gourab Mukherjee for helpful discussion and advice on writing and presentation. I must thank all my friends from USC especially: Nicolle Sandoval Gonzalez, Michael Earnest, Fanhui Xu, Zhanerke Temirgaliyeva, Chukiat Phonsom, Jie Ruan, Jianjun Luo, Man Luo, Tianyou Wang, Yushen Wu and Yutong Tan for all the fun time together. I would also like to thank all the members from math department for their support and friendliness. I thank Department of Recreational Sports for providing two excellent gyms, where I can release stress and build some muscle. Lastly, I want to thank my parents for their continual love and support and my grandpas for giving me a wonderful childhood. iv To my parents and grandparents. v Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Information Loss and Power Distortion from Standardizing in Multiple Hypoth- esis Testing 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Formulation and the Issue of Standardizing . . . . . . . . . . . . . . . . . . 4 1.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Data processing and power loss: an illustrative example . . . . . . . . . . . . 5 1.2.3 Heteroscadasticity and empirical null distribution . . . . . . . . . . . . . . . . 9 1.3 HART: Heteroscedasticity Adjusted Ranking and Thresholding . . . . . . . . . . . . 10 1.3.1 The oracle rule under heteroscedasity . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Data-driven procedure and computational algorithms . . . . . . . . . . . . . 12 1.3.3 Theoretical properties of HART . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.1 Implementation of HART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4.2 Comparison in general settings . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.3 Comparison under a two-group model . . . . . . . . . . . . . . . . . . . . . . 18 1.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.6.1 Multiple testing with side information . . . . . . . . . . . . . . . . . . . . . . 21 1.6.2 Open issues and future directions . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.7 Formulas for the Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . 24 vi 1.8 Proofs of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.8.1 Proof of Theorem 1.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.8.2 Proof of Theorem 1.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.9 Proof of Proposition 1.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.9.1 Proof of Step (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.9.2 Proof of Step (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 1.9.3 Proof of Step (c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.10 Proof of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.10.1 Proof of lemma 1.8.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1.10.2 Proof of lemma 1.9.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.10.3 Proof of lemma 1.9.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 1.10.4 Proof of lemma 1.9.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2 Structure Online{Adaptive Rules for False Discovery Rate Control in Dynamic Models 37 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 Oracle and Adaptive Rules for Online FDR Control . . . . . . . . . . . . . . . . . . 40 2.2.1 Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.2.2 The oracle rule for simultaneous testing . . . . . . . . . . . . . . . . . . . . . 41 2.2.3 Adapting to local structures by Clfdr: an illustration . . . . . . . . . . . . . . 43 2.2.4 A new alpha{investing framework . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2.5 Oracle{assisted adaptive learning and the SAST algorithm . . . . . . . . . . 45 2.3 Data-Driven SAST and Its Theoretical Properties . . . . . . . . . . . . . . . . . . . 47 2.3.1 Data-driven procedure and computational algorithms . . . . . . . . . . . . . 47 2.3.2 Theoretical properties of data{driven SAST . . . . . . . . . . . . . . . . . . . 49 2.3.3 Theory for data streams with xed distributions . . . . . . . . . . . . . . . . 50 2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.4.2 Comparisons of online FDRs and MDRs . . . . . . . . . . . . . . . . . . . . . 51 2.4.3 Eects of the barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.5.1 Time series anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.5.2 IMPC dataset Genotype Discovery . . . . . . . . . . . . . . . . . . . . . . . . 57 vii 2.6 Proof of main theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.2 Proof of Theorem 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.3 Proof of theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.7 Proof of propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.1 Proof of Proposition 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.2 Proof of Proposition 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.3 Growing domain version of Proposition 2.3.2 . . . . . . . . . . . . . . . . . . 65 2.7.4 Proof of Lemma 2.7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.7.5 Proof of lemma 2.7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.7.6 Proof of lemma 2.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2.8 Optimality of the Clfdr rule in simultaneous testing . . . . . . . . . . . . . . . . . . 70 3 Large Scale Shrinkage Estimation under Markovian Dependence 73 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Shrinkage estimation in a hidden Markov model . . . . . . . . . . . . . . . . . . . . . 75 3.2.1 Model and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2.2 Oracle estimator under independence . . . . . . . . . . . . . . . . . . . . . . . 76 3.2.3 Oracle Estimator under HMM dependence . . . . . . . . . . . . . . . . . . . . 77 3.3 Data-Driven Estimator and Computational Algorithms . . . . . . . . . . . . . . . . . 78 3.3.1 The modied Baum-Welch algorithm . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.2 Choice of h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.3.3 Initialization and the HMMT estimator . . . . . . . . . . . . . . . . . . . . . 80 3.4 Asymptotic Properties of the proposed estimator . . . . . . . . . . . . . . . . . . . . 82 3.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.1 Comparison of the MSEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.2 Comparison of the estimated f 1 's . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6 Real data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6.1 Copy number variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.6.2 Internet search trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.6.3 Change in unemployment rate . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.8 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 viii List of Tables 1.1 Numbers of genes (% of total) that are selected by each method. . . . . . . . . . . . 21 2.1 Algorithm 1. The oracle SAST rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.2 Algorithm 2. The data-driven SAST. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3 Discovery comparison NYC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.4 Discovery comparison IMPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.1 MSE comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 ix List of Figures 1.1 Rejection region comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Plots of the density functions of Z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Performance comparison general setting. . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Performance comparison two group setting. . . . . . . . . . . . . . . . . . . . . . . . 19 1.5 Histogram of z values and p values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6 Histogram of S i (left), scatter plot of (Z i ;S i ) (right) . . . . . . . . . . . . . . . . . . 21 1.7 Scatter plot of rejected hypotheses by each method. . . . . . . . . . . . . . . . . . . 22 2.1 Structure{adaptiveness: Clfdr vs weighted p-values. . . . . . . . . . . . . . . . . . . . . . . . 44 2.2 Simulation results for Settings 1 and 2. . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.3 Simulation results for Settings 3 and 4. . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4 The incorporation of the barrier greatly reduces the MDR levels. . . . . . . . . . . . 55 2.5 NYC Taxi passenger count time series. . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.6 NYC taxi data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.7 Anomaly detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.1 Schematic diagram of the data generation process. . . . . . . . . . . . . . . . . . . . 76 3.2 Estimated density comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.3 IMR103 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4 Truncated IMR103 data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5 Density estimate of mean BAC log 2 ratio. . . . . . . . . . . . . . . . . . . . . . . . . 91 3.6 Internet search trend data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7 Density estimate of search trend. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.8 Unemployment rate data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.9 Density estimate of unemployment rate. . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.10 Change of unemployment rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 x Chapter 1 Information Loss and Power Distortion from Standardizing in Multiple Hypothesis Testing 1.1 Introduction In a wide range of modern scientic studies, multiple testing frameworks have been routinely em- ployed by scientists and researchers to identify interesting cases among thousands or even millions of features. A representative sampling of settings where multiple testing has been used include: genetics, for the analysis of gene expression levels (Tusher et al., 2001; Dudoit et al., 2003; Sun and Wei, 2011); astronomy, for the detection of galaxies (Miller et al., 2001); neuro-imaging, for the discovery of dierential brain activity (Pacico et al., 2004; Schwartzman et al., 2008); education, to identify student achievement gaps (Efron, 2008a); data visualization, to nd potentially interesting patterns (Zhao et al., 2017); and nance, to evaluate trading strategies (Harvey and Liu, 2015). The standard practice involves three steps: reduce the data in dierent study units to a vector of summary statistics; standardize the summary statistics to obtain signicance indices such as z-values or p-values; and nd a threshold of signicance that corrects for multiplicity. Given a summary statisticX i with associated standard deviation i , traditional multiple testing approaches begin by standardizing the observed data Z i = X i = i , which is then used to compute the p-value based on a problem specic null distribution. Finally, the p-values are ordered, and a threshold is 1 applied to keep the rate of Type I error below a pre-specied level. Classical approaches concentrated on setting a threshold that controls the family-wise error rate (FWER), using methods such as the Bonferroni correction or Holm's procedure (Holm, 1979a). However, the FWER criterion becomes infeasible once the number of hypotheses under consideration grows into the thousands. The seminal contribution of Benjamini and Hochberg (1995) proposed replacing the FWER by the false discovery rate (FDR) and provided the BH algorithm for choosing a threshold on the ordered p-values which, under certain assumptions, is guaranteed to control the FDR. While the BH procedure oers a signicant improvement over classical approaches, it only con- trols the FDR at level (1), where is the proportion of non-nulls, suggesting that its power can be improved by incorporating an adjustement for into the procedure. Benjamini and Hochberg (2000), Storey (2002) and Genovese and Wasserman (2002) proposed to rst estimate the non-null proportion by ^ and then run BH at level =(1 ^ ). Efron et al. (2001) proposed the local false discovery rate (Lfdr), which incorporates, in addition to the sparsity parameter, information about the alternative distribution. Sun and Cai (2007) proved that thez-value optimal procedure is an Lfdr thresholding rule and that this rule uniformly dominates the p-value optimal procedure in Genovese and Wasserman (2002). The key idea is that the shape of the alternative could potentially aect the rejection region but the important structural information is lost when converting the z-value to p-value. For example, when the means of non-null eects are more likely to be positive than negative, then taking this asymmetry of the alternative into account increases the power. However, the sign information is not captured by conventional p-value methods, which only consider informa- tion about the null. This work further investigates the usefulness of the conguration of alternative distribution characterized by the variance structure, which has been discarded from standardizing. Although a wide variety of multiple testing approaches have been proposed, they almost all begin with the standardized data Z i (or its associated p-value, P i ). In fact, in large-scale studies where the data are collected from intrinsically diverse sources, the standardization step has been upheld as conventional wisdom, for it takes into account the variability of the summary statistics and suppresses the heterogeneity { enabling one to compare multiple study units on an equal footing. For example, in microarray studies Efron et al. (2001) rst computes standardized two-sample t- statistics for comparing the gene expression levels across two biological conditions and then converts the t-statistics to z{scores, which are further employed to carry out FDR analyses. Binomial data is also routinely standardized by rescaling the number of successes X i by the number of trials n i to obtain success probabilities ^ p i =X i =n i and then converting the probabilities to z-scores (Efron, 2 2008a,b). However, while standardization is an intuitive, and widely adopted, approach, we argue in this paper that there can be a signicant loss in information from basing hypothesis tests on Z i rather then the full data (X i ; i ) 1 . This observation, which we formalize later in the paper, is based on the fact that the power of tests can vary signicantly as changes, but this dierence in power is suppressed when the data is standardized and treated as equivalent. In the illustrative example in Section 1.2.2, we show that by accounting for dierences in an alternative ordering of rejections can be obtained, allowing one to identify more true positives at the same FDR level. This chapter develops a new class of heteroscedasticity-adjusted ranking and thresholding (HART) rules for large-scale multiple testing that aim to improve existing methods by simultaneously ex- ploiting commonalities and adjusting heterogeneities among the study units. The main strategy of HART is to bypass standardization by directly incorporating (X i ; i ) into the testing procedure. We adopt a two-step approach. In the rst step a new signicance index is developed by taking into account the alternative distribution of each X i conditioned on i ; hence HART avoids power dis- tortion. This kind of conditioning is not possible for standardized values since the i are subsumed under Z i . Then, in the second step the signicance indices are ordered and the smallest ones are rejected up to a given cuto. We develop theories to show that HART is optimal for integrating the information from both X i and i . Numerical results are provided to conrm that HART controls the FDR in nite samples and uniformly dominates existing methods in power. The ndings are impactful for three reasons. First, the observation that standardization can be inecient has broad implications since, due to inherent variabilities or diering sample sizes between study units, standardized tests are commonly applied to large-scale heterogeneous data to make dierent study units comparable. Second, our nding enriches the recent line of research on multiple testing with side and structural information (e.g. (Lei and Fithian, 2018; Cai et al., 2019; Li and Barber, 2019), among others). In contrast with existing works that have focused on the usefulness of sparsity structure, our characterization of the impact of heteroscedasticity, or more concretely the shape of alternative distribution, is new. Finally, HART convincingly demonstrates the benets of leveraging structural information in high-dimensional settings when the number of tests is in the thousands or more. Ideas from HART apply to smaller data sets as well, but the algorithm is designed to capitalize on copious data in ways not possible for procedures intended for moderate amounts of data, and thus is most useful in large-scale testing scenarios where the 1 Unless otherwise stated, the term \full data" specically refers to the pair (Xi;i) in this chapter. In practice, the process of deriving the pair (Xi;i) from the original (full) data could also suer from information loss. This has gone beyond the scope of this work; see the rejoinder of Cai et al. (2019) for related discussions. 3 structure can be learned from data with good precision. 1.2 Problem Formulation and the Issue of Standardizing This section rst describes the problem formulation and then discusses an example to illustrate the key issue. 1.2.1 Problem formulation Suppose the summary statistics X 1 ;:::;X m are normal variables obeying distribution X i j i ; 2 i ind N( i ; 2 i ); (1.2.1) where i follows a mixture model with a point mass at zero and i is drawn from an unspecied prior i iid (1) 0 () +g (); 2 i iid g (): (1.2.2) Here is the proportion of nonzero signals, 0 () is a Dirac delta function, and g () and g () are unknown density functions. Following tradition in dealing with heteroscedasticity problems (e.g. Xie et al., 2012; Weinstein et al., 2018), we assume that i are known. This simplies the discussion and enables us to focus on key ideas. For practical applications, we use a consistent estimator of i . The goal is to simultaneously test m hypotheses: H 0;i : i = 0 vs. H 1;i : i 6= 0; i = 1;:::;m: (1.2.3) The multiple testing problem (1.2.3) is concerned with the simultaneous inference of =f i = I( i 6= 0) :i = 1;:::;mg, whereI() is an indicator function. The decision rule is represented by a binary vector = ( i : 1 i m)2f0; 1g m , where i = 1 means that we reject H 0;i , and i = 0 means we do not reject H 0;i . The false discovery rate (FDR) (Benjamini and Hochberg, 1995), dened as FDR =E P i (1 i ) i maxf P i i ; 1g ; (1.2.4) is a widely used error criterion in large-scale testing problems. A closely related criterion is the marginal false discovery rate mFDR = Ef P i (1 i ) i g E ( P i i ) : (1.2.5) 4 The mFDR is asymptotically equivalent to the FDR for a general set of decision rules satisfying certain rst- and second-order conditions on the number of rejections (Basu et al., 2018), includ- ing p{value based tests for independent hypotheses (Genovese and Wasserman, 2002) and weakly dependent hypotheses (Storey et al., 2004). We shall show that our proposed data-driven proce- dure controls both the FDR and mFDR asymptotically; the main consideration of using the mFDR criterion is to derive optimality theory and facilitate methodological developments. We use the expected number of true positives ETP = E ( P m i=1 i i ) to evaluate the power of an FDR procedure. Other power measures include the missed discovery rate (MDR, Taylor et al., 2005), average power (Benjamini and Hochberg, 1995; Efron, 2007) and false negative rate or false non-discovery rate (FNR, Genovese and Wasserman, 2002; Sarkar, 2002). Cao et al. (2013) showed that under the monotone likelihood ratio condition (MLRC), maximizing the ETP is equivalent to minimizing the MDR and FNR. The ETP is used in this chapter because it is intuitive and simplies the theory. We call a multiple testing procedure valid if it controls the FDR at the nominal level and optimal if it has the largest ETP among all valid FDR procedures. The building blocks for conventional multiple testing procedures are standardized statistics such as Z i or P i . Let i = i = i . The tacit rationale in conventional practice is that the simultaneous inference problem H 0;i : i = 0 vs. H 1;i : i 6= 0; i = 1; ;m; (1.2.6) is equivalent to the formulation (1.2.3); hence the standardization step has no impact on multiple testing. However, this seemingly plausible argument, which only takes into account the null distri- bution, fails to consider the change in the structure of the alternative distribution. Next we present an example to illustrate the information loss and power distortion from standardizing. 1.2.2 Data processing and power loss: an illustrative example The following diagram describes a data processing approach that is often adopted when performing hypothesis tests: (X i ; i ) ! Z i = X i i ! P i = 2(jZ i j): (1.2.7) We start with the full data consisting of X i and 2 i =Var(X i j i ). The data is then standardized, Z i = X i = i , and nally converted to a two-sided p-value, P i . Typically these p-values are ordered from smallest to largest, a threshold is chosen to control the FDR, and hypotheses with p-values below the threshold are rejected. 5 Here we present a simple example to illustrate the information loss that can occur at each of these data compression steps. Consider a hypothesis testing setting with H 0;i : i = 0 and the data coming from a normal mixture model, X i j i ind (1)N(0; 2 i ) +N( a ; 2 i ); (1.2.8) where i U[0:5; 4], a = 2 and =P ( i = 1) = 0:1. We examine three possible approaches to controlling the FDR at = 0:1. In thep-value approach we reject for allp-values below a given threshold. Note that, when the FDR is exhausted, this is the uniformly most powerful p-value based method (Genovese and Wasserman, 2002), so is superior to, for example, the BH procedure. Alternatively, in thez-value approach we reject for all suitably small P(H 0 jZ i ), which is in turn the most powerful z-value based method (Sun and Cai, 2007). Finally, in the full data approach we reject whenP(H 0 jX i ; i ) is below a certain threshold, which we show later is optimal given X i and i . In computing the thresholds, we assume that there is an oracle knowing the alternative distribution; the formulas for our theoretical calculations are provided in Section 1.7 of the Appendix. For the model given by (1.2.8) these rules correspond to: p = fI(P i 0:0006) : 1img =fI(jZ i j 3:43) : 1img; z = fI(P(H 0 jZ i ) 0:24) : 1img =fI(Z i 3:13) : 1img; full = fI(P(H 0 jX i ; i ) 0:28) : 1img; with the thresholds chosen such that the FDR is exactly 10% for all three approaches. However, while the FDRs of these three methods are identical, the average powers, AP( ) = 1 m E ( P m i=1 i i ), dier signicantly: AP( p ) = 5:0%; AP( z ) = 7:2%; AP( full ) = 10:5%: (1.2.9) To better understand these dierences consider the left hand plot in Figure 1.1, which illustrates the rejection regions for each approach as a function of Z and 2 . In the blue region all methods fail to reject the null hypothesis, while all methods reject in the black region. The green region corresponds to the space where the full data approach rejects the null while the other two methods do not. Alternatively, in the red region both the z-value and full data methods reject while the 2 The p-value method will also reject for large negative values of Z but, to keep the gure readable, we have not plotted that region. 6 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 3 4 5 6 7 σ Z 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 2 3 4 5 6 7 σ Z Figure 1.1: Rejection region comparison. Left: Rejection regions for the p-value approach (black line), z-value approach (red line) and full data approach (green line) as a function of Z and . Approaches reject for all points above their corresponding line. Right: Heat map of relative proportions (on log scale) of alternative vs null hypotheses for dierent Z and . Blue corresponds to lower ratios and purple to higher ratios. The solid black line represents equal fractions of null and alternative, while the dashed line corresponds to three times as many alternative as null. p-value approach fails to do so. Finally, in the white region the full data approach fails to reject while the z-value method does reject. We rst compare z and p . Let + and denote the proportions of positive eects and negative eects, respectively. Then + = 0:1 and = 0. This asymmetry of the alternative distribution can be captured by z , which uses a one-sided rejection region. (Note that this asymmetric rejection region is not pre-specied but a consequence of theoretical derivation. In practice z can be emulated by an adaptive z-value approach that is fully data-driven (Sun and Cai, 2007).) By contrast, p enforces a two-sided rejection region that is symmetrical about 0, trading o extra rejections in the region Z i 3:43 for fewer rejections in the region where 3:13Z i 3:43. As all nonzero eects are positive, negativez-values are highly unlikely to come from the alternative; this accounts for the 2.2% loss in AP for the p-value method. Next consider full vs z . The full data approach trades o extra rejections in the green space for fewer rejections in the white space. This may seem like a sub-optimal trade-o given that the green space is smaller. However, the green space actually contains many more true alternative hypotheses. Approximately 3:8% of the true alternatives occur in the green region as opposed to only 0:5% in the white region, which accounts for the 3:3% higher AP for the full data approach. 7 At rst Figure 1.1 may appear counterintuitive. Why should we reject for low z-values in the green region but fail to reject for high z-values in the white region? The key observation here is that not all z-values are created equal. In the green region the observed data is far more consistent with the alternative hypothesis than the null hypothesis. For example, with Z = 4 and = 0:5 our observed X is four standard deviations from the null mean but exactly equal to the alternative mean. Alternatively, while it is true that in the white region the high z-values suggest that the data are inconsistent with the null hypothesis, they are also highly inconsistent with the alternative hypothesis. For example, with Z = 4 and = 2 our observed X is 8, which is four standard deviations from the null mean, but also three standard deviations from the alternative mean. Given that 90% of observations come from the null hypothesis, we do not have conclusive evidence as to whether this data is from the null or alternative. A z-value of 4 with = 0:5 is far more likely to come from the alternative hypothesis than is a z-value of 4 with = 2. The right hand plot of Figure 1.1 makes this clear. Here we have plotted (on a log scale) the relative proportions of alternative vs null hypotheses for dierent Z and . Blue corresponds to lower ratios and purple to higher ratios. The solid black line represents equal fractions of null and alternative, while the dashed line corresponds to three times as many alternative as null. Clearly, for the same z-value, alternative hypotheses are relatively more common for low values. Notice how closely the shape of the dashed line maps the green rejection boundary in the left hand plot, which indicates that the full data method is correctly capturing the regions with most alternative hypotheses. By contrast, thep-value andz-value methods fail to correctly adjust for dierent values of . Figure 1.2 provides one further way to understand the eect of standardizing the data. Here we have plotted the density functions of Z under the null hypothesis (black solid) and alternative hypothesis (red dashed) for dierent values of . The densities have been multiplied by the relative probability of each hypothesis occurring so points where the densities cross correspond to an equal likelihood for either hypothesis. The blue line represents an observation, which is xed at Z = 2 in each plot. The alternative density is centered at Z = 2= so when is large the standardized null and alternative are very similar, making it hard to know which distribution Z = 2 belongs to. As decreases the standardized alternative distribution moves away from the null and becomes more consistent with Z = 2. However, eventually the alternative moves past Z = 2 and it again becomes unclear which distribution our data belongs to. Standardizing means that the null hypothesis is consistent for all values of but the alternative hypothesis can change dramatically as a function of the standard deviation. 8 −2 0 2 4 6 8 σ = 4 z −2 0 2 4 6 8 σ = 2 z −2 0 2 4 6 8 σ = 1 z −2 0 2 4 6 8 σ = 1/2 z Figure 1.2: Plots of the density functions of Z. Plots of the density functions of Z under the null hypothesis (black solid) and alternative hypothesis (red dashed) for dierent values of . The blue line represents an observation at Z = 2. To summarize, the information loss occurred in both steps of data processing (1.2.7) reveals the essential role of the alternative distribution in simultaneous testing. This structure of the alternative is not captured by the p-value, which is calculated only based on the null. Our result (1.2.9) in the toy example shows that by exploiting (i) the overall asymmetry of the alternative via the z-value and (ii) the heterogeneity among individual alternatives via the full data, the average power of conventional p-value based methods can be doubled. 1.2.3 Heteroscadasticity and empirical null distribution In the context of simultaneous testing with composite null hypotheses, Sun and McLain (2012) argued that the conventional testing framework, which involves rescaling or standardization, can become problematic: \In multiple testing problems where the null is simple (H0;i : i = 0), the heteroscedasticity in errors can be removed by rescaling all i to 1. However, when the null is composite, such a rescaling step would distort the scientic question." Sun and McLain (2012) further proposed the concept of empirical composite null as an extension of Efron's empirical null (Efron, 2004) for testing composite nulls H 0;i : i 2 [a 0 ;a 0 ] under heteroscedastic models. It is important to note that the main message of this chapter, which focuses on the impact of heteroscedastiticy on the alternative instead of the null, is fundamentally dierent from that in Sun and McLain (2012). In fact, we show that even when the null is simple, the heteroscedasticity still matters. Our nding, which somehow contradicts the above quotes, is more 9 striking and even counter-intuitive. Moreover, we shall see that our data-driven HART procedure, which is based on Tweedie's formula (or thef-modeling approach, Efron, 2011), is very dierent from the deconvoluting kernel method (or g-modeling approach) in Sun and McLain (2012) 3 . The new two-step bivariate estimator in Section 1.3.2 is novel and highly nontrivial; the techniques employed in the proofs of theory are also very dierent. 1.3 HART: Heteroscedasticity Adjusted Ranking and Thresh- olding The example in the previous section presents a setting where hypothesis tests based on the full data (X i ; i ) can produce higher power than that from using the standardized data Z i . In this section we formalize this idea and show that the result holds in general for heteroscedasticity problems. We rst assume that the distributional information is known and derive an oracle rule based on full data in Section 1.3.1. Section 1.3.2 develops data-driven schemes and computational algorithms to implement the oracle rule. Finally theoretical properties of the proposed method are established in Section 1.3.3. 1.3.1 The oracle rule under heteroscedasity Note that the models given by (1.2.1) and (1.2.2) imply that X i j i ind f i (x) = (1)f 0;i (x) +f 1;i (x); (1.3.1) where f 0; (x) = 1 (x=) is the null density, f 1; (x) = 1 R x g ()d is the alternative density, (x) is the density of a standard normal variable, and f (x) is the mixture density. For standardized data Z i =X i = i , Model 1.3.1 reduces to Z i iid f(z) = (1)f 0 (z) +f 1 (z); (1.3.2) where f 0 (z) = (z), f 1 (z) is the non-null density, and f(z) is the mixture density of the z-values. As discussed previously, a standard approach involves converting the z-value to a two-sidedp{value 3 The deconvoluting kernel method has an extremely slow convergence rate. Our numerical studies show that the method in Sun and McLain (2012) only works for composite nulls where the uncertainties in estimation can be smoothed out over an interval [a 0 ;a 0 ]. However, the deconvoluting method is highly unstable and does not work well when testing simple nulls H 0;i : i = 0. Our numerical results show that the two-step method in Section 1.3.2 works much better. 10 P i = 2(jZ i j), where () is the standard normal cdf. The mixture model based on p-values is P i iid g(p) = (1)I [0;1] (p) +g 1 (p); for p2 [0; 1]; (1.3.3) whereI() is an indicator function, andg() andg 1 () are the mixture density and non-null density of thep-values, respectively. Models 1.3.2 and 1.3.3 provide a powerful and exible framework for large- scale inference and have been used in a range of related problems such as signal detection, sparsity estimation and multiple testing [e.g. Efron et al. (2001); Storey (2002); Genovese and Wasserman (2002); Donoho and Jin (2004); Newton et al. (2004); Jin and Cai (2007)]. The oracle FDR procedure for Models 1.3.2 and 1.3.3 are both known. We rst review the oracle z-value procedure (Sun and Cai, 2007). Dene the local FDR (Efron et al., 2001) Lfdr i =P(H 0 jz i ) =P( i = 0jz i ) = (1)f 0 (z i ) f(z i ) : (1.3.4) Then Sun and Cai (2007) showed that the optimal z-value FDR procedure is given by z = [IfLfdr(z i )<c g : 1im]; (1.3.5) where c is the largest Lfdr threshold such that mFDR . Similarly, Genovese and Wasserman (2002) showed that the optimal p-value based FDR procedure is given by p = [IfP i <c g : 1im]; (1.3.6) where c is the largest p-value threshold such that mFDR. Next we derive the oracle rule based on m pairsf(x i ; i ) :i = 1;:::;mg. This new problem can be recast and solved in the framework of multiple testing with a covariate sequence. Consider Model 1.3.1 and dene the heterogeneity{adjusted signicance index T i T (x i ; i ) =P( i = 0jx i ; i ) = (1)f 0;i (x i ) f i (x i ) : (1.3.7) Let Q(t) denote the mFDR level of the testing rule [IfT i < tg : 1 i m]: Then the oracle full data procedure is denoted full = [IfT i <t g : 1im]; (1.3.8) where t = supft :Q(t)g: 11 The next theorem provides the key result showing that full has highest power amongst all{level FDR rules based onf(x i ; i ) :i = 1; ;mg. Theorem 1.3.1. LetD be the collection of all testing rules based onf(x i ; i ) :i = 1;:::;mg such that mFDR . Then ETP ETP full for any2D . In particular we have ETP p ETP z ETP full: Based on Theorem 1.3.1, our proposed methodology employs a heteroscedasticity{adjusted rank- ing and thresholding (HART) rule that operates in two steps: rst rank all hypotheses according to T i and then reject all hypotheses with T i t . We discuss in Section 1.3.2 our nite sample approach for implementing HART using estimates for T i and t . 1.3.2 Data-driven procedure and computational algorithms We rst discuss how to estimate T i and then turn to t . Inspecting T i 's formula (1.3.7), the null density f 0;i (x i ) is known and the non-null proportion can be estimated by ^ using existing methods such as Storey's estimator (Storey, 2002) or Jin-Cai's estimator (Jin and Cai, 2007). Hence we focus on the problem of estimating f i (x i ). There are two possible approaches for implementing this step. The rst involves directly esti- matingf i (x i ) while the second is implemented by rst estimatingf 1;i (x i ) and then computing the marginal distribution via ^ f i (x i ) = (1 ^ )f 0;i (x i ) + ^ ^ f 1;i (x i ): (1.3.9) Our theoretical and empirical results strongly suggest that this latter approach provides superior results so we adopt this method. Remark 1. The main concern about the direct estimation of f i (x i ) is that the tail areas of the mixture density are of the greatest interest in multiple testing but unfortunately the hardest parts to accurately estimate due to the few observations in the tails. The fact that f i (x i ) appears in the denominator exacerbates the situation. The decomposition in (1.3.9) increases the stability of the density by incorporating a known part of null density. Standard bivariate kernel methods (Silverman, 1986; Wand and Jones, 1994) are not suitable for estimating f 1;i (x i ) because, unlike a typical variable, i plays a special role in a density function and needs to be modeled carefully. Fu et al. (2018) recently addressed a closely related problem 12 using the following weighted bivariate kernel estimator: ^ f (x) := m X j=1 h ( j ) P m j=1 h ( j ) hxj (xx j ); (1.3.10) whereh = (h x ;h ) is a pair of bandwidths, h ( j )=f P m j=1 h ( j )g determines the con- tribution of (x j ; j ) based on j , h xj = h x j is a bandwidth that varies across j, and h (z) = 1 p 2h exp n z 2 2h 2 o is a Gaussian kernel. The variable bandwidth h xj up-weights/down-weights ob- servations corresponding to small/large j ; this suitably adjusts for the heteroscedasticity in the data. LetM 1 =fi : i = 1g. In the ideal setting where j is observed one could extend (1.3.10) to estimate f 1;i (x i ) via ~ f 1; (x) = X j2M1 h ( j ) P k2M1 h ( k ) hxj (xx j ): (1.3.11) Given that j is unknown, we cannot directly implement (1.3.11). Instead we apply a weighted version of (1.3.11), ^ f 1;i (x i ) = m X j=1 ^ w j h ( i j ) P m k=1 ^ w k h ( i k ) hxj (x i x j ) (1.3.12) with weights ^ w j equal to an estimate ofP ( j = 1jx j ; j ). In particular we adopt a two step approach: 1. Compute ^ f (0) 1;i (x i ) via (1.3.12) with initial weights ^ w (0) j = (1 ^ T (0) j ) for all j, where ^ T (0) j = min (1^ )f0; j (xj ) ^ f j (xj ) ; 1 , ^ is the estimated non-null proportion, and ^ f j (x j ) is computed using (1.3.10). 2. Compute ^ f (1) 1;i (x i ) via (1.3.12) with updated weights ^ w (1) j = (1 ^ T (1) j ) where ^ T (1) j = (1 ^ )f 0;j (x j ) (1 ^ )f 0;j (x j ) + ^ ^ f (0) 1;j (x j ) : This leads to our nal estimate for T i =P(H 0 jx i ; i ): ^ T i = ^ T (2) i = (1 ^ )f 0;i (x i ) (1 ^ )f 0;i (x i ) + ^ ^ f (1) 1;i (x i ) : In the next section, we carry out a detailed theoretical analysis to show that both ^ f i (x i ) and ^ T i are consistent estimators withEk ^ f i f i k 2 =E R f ^ f i (x)f i (x)g 2 dx! 0 and ^ T i P ! T i , uniformly for all i. 13 To implement the oracle rule (1.3.8), we need to estimate the optimal threshold t , which can be found by carrying out the following simple stepwise procedure. [data-driven HART procedure] Rank hypotheses by increasing order of ^ T i . Denote the sorted ranking statistics ^ T (1) ::: ^ T (m) and H (1) ;:::;H (m) the corresponding hypotheses. Let k = max ( j : 1 j j X i=1 ^ T (i) ) : Then reject the corresponding ordered hypotheses, H (1) ;:::;H (k) . The idea of the above procedure is that if the rst j hypotheses are rejected, then the moving average 1 j P j i=1 ^ T (i) provides a good estimate of the false discovery proportion, which is required to fulll the FDR constraint. Comparing with the oracle rule (1.3.8), Procedure 1.3.2 can be viewed as its plug-in version: dd =fI( ^ T i ^ t ) : 1img; where ^ t = ^ T (k) : (1.3.13) The theoretical properties of Procedure 1 are studied in the next section. 1.3.3 Theoretical properties of HART In Section 1.3.1, we have shown that the (full data) oracle rule full (1.3.8) is valid and optimal for FDR analysis. This section discusses the key theoretical result, Theorem 2, which shows that the performance of full can be achieved by its nite sample version dd (1.3.13) when m!1. Inspecting (1.3.13), the main steps involve showing that both ^ T i and ^ t are \close" to their oracle counterparts. To ensure good performance of the proposed procedure, we require the following conditions. (C1) supp(g )2 (M 1 ;M 2 ) and supp(g )2 (M;M) for some M 1 > 0, M 2 <1, M <1. (C2) h x =of(logm) 1 g. (C3) lim m!1 mh x h 2 =1, lim m!1 m 1 h h 2 x =1 and lim m!1 m =2 h 2 h 1 x ! 0 for some > 0. (C4) ^ p !. Remark 2. Condition (C2) is standard in density estimation theory, see for example Brown and Greenshtein (2009) and Silverman (1986). Condition (C3) is satised for most popular choices ofh x , for example in Wand and Jones (1994), whereh x m 1=5 minimizes mean integrated squared error 14 (MISE). Jin-Cai's estimator Jin and Cai (2007) fullls Condition (C4) in a wide class of mixture models. Our theory is divided into two parts. The next proposition establishes the theoretical properties of the proposed density estimator ^ f and the plug-in statistic ^ T i . The convergence of ^ t to t and the asymptotic properties of dd are established in Theorem 1.3.3. Proposition 1.3.2. Suppose Conditions (C1) to (C4) hold. Then Ek ^ f f k 2 =E Z f ^ f (x)f (x)g 2 dx! 0; where the expectationE is taken over (X X X; ; ). Further, we have ^ T i P !T i . Next we turn to the performance of our data-driven procedure dd when m!1. A key step in the theoretical development is to show that ^ t P !t , where ^ t and t are dened in (1.3.13) and (1.3.8), respectively. Theorem 1.3.3. Under the conditions in Proposition 1.3.2, both the mFDR and FDR of dd are controlled at level +o(1), and ETP dd=ETP full = 1 +o(1). In combination with Theorem 1.3.1, these results demonstrate that the proposed nite sample HART procedure (Procedure 1.3.2) is asymptotically valid and optimal. 1.4 Simulation We rst describe the implementation of HART in Section 1.4.1. Section 1.4.2 presents results for the general setting where i comes from a continuous density function. In Section 1.4.3, we further investigate the eect of heterogeneity under a mixture model where i takes on one of two distinct values. 1.4.1 Implementation of HART The accurate estimation of ^ T i is crucial for ensuring good performance of the HART procedure. The key quantity is the bivariate kernel density estimator ^ f 1; (x), which depends on the choice of tuning parametersh = (h x ;h ). Note that the ranking and selection process in Procedure 1.3.2 only involves small ^ T i . To improve accuracy, the bandwidth should be chosen based on the pairs (x i ; i ) that are less likely to come from the null. We rst implement Jin and Cai's method (Jin and Cai, 15 2007) to estimate the overall proportion of non-nulls in the data, denoted ^ . We then compute h x and h by applying Silverman's rule of thumb (Silverman, 1986) to the subset of the observations fx i : P i < ^ g. When implementing HART, a jacknifed method is recommended. Specically, we rst estimate f (x) using the data without (X i ;S i ), and then plug-in the unused data (X i ;S i ) to calculate ^ T i . This jacknifed method tends to be more conservative. As shown in the proof of Proposition 1, the asymptotic properties of ^ T i hold for both the regular and jacknifed approaches. 1.4.2 Comparison in general settings We consider simulation settings according to Models 1.2.1 and 1.2.2, where i are uniformly gener- ated from U[1; max ]. We then generate X i from a two-component normal mixture model X i j i iid (1)N(0; 2 i ) +N(2; 2 i ): In the rst setting, we x max = 3 and vary from 0.05 to 0.15. In the second setting, we x = 0:1 and vary max from 2.5 to 3.5. Four methods are compared: the ideal full data oracle procedure (OR), the z-value oracle procedure of Sun and Cai (2007) (ZOR), the Benjamini- Hochberg procedure (BH) and the proposed data{driven HART procedure (DD). The nominal FDR level is set to = 0:1. For each setting, the number of tests is m = 20; 000. Each simulation is also run over 100 repetitions. Then, the FDR is estimated as the average of the false discovery proportion FDP() = P m i=1 f(1 i ) i g=( P m i=1 i _ 1) and the average power is estimated as the average proportion of true positives that are correctly identied, P m i=1 ( i i )=(mp), both over the number of repetitions. The results for diering values of and max are respectively displayed in the rst and second rows of Figure 1.3. Next we discuss some important patterns of the plots and provide interpretations. Panel (a) of Figure 1.3 shows that all methods appropriately control FDR at the nominal level, with BH being slightly conservative. Panel (b) illustrates the advantage of the proposed HART procedure over existing methods. When is small, the power of OR can be 40% higher than ZOR. This shows that exploiting the structural information of the variance can be extremely benecial. DD has lower power compared to OR due to the inaccuracy in estimation. However, DD still dominates ZOR and BH in all settings. We can also see that ZOR dominates BH and the eciency gain increases as increases. To explain the power gain of ZOR over BH, let + and denote the proportion of true positive signals and true negative signals, respectively. Then + = and = 0. This asymmetry can be captured by ZOR, which uses a one-sided rejection region. By contrast, BH 16 0.06 0.08 0.10 0.12 0.14 0.00 0.05 0.10 0.15 0.20 0.25 a)FDR Comparison varing π π FDR BH ZOR OR DD 0.06 0.08 0.10 0.12 0.14 0.06 0.08 0.10 0.12 0.14 0.16 0.18 b)Power Comparison varing π π Power 2.6 2.8 3.0 3.2 3.4 0.00 0.05 0.10 0.15 0.20 c)FDR Comparison varing σ max σ max FDR 2.6 2.8 3.0 3.2 3.4 0.06 0.08 0.10 0.12 0.14 0.16 0.18 d)Power Comparison varing σ max σ max Power Figure 1.3: Performance comparison general setting. Comparison when i is generated from a uniform distribution. We vary in the top row and max in the bottom row. All methods control the FDR at the nominal level. DD has roughly the same FDR but higher power compared to ZOR in all settings. adopts a two-sided symmetric rejection region. Under the setting being considered, the power loss due to the conservativeness of BH is essentially negligible, whereas the failure of capturing important structural information in the alternative accounts for most power loss. From the second row of Figure 1.3, we can again see that all methods control the FDR at the nominal level. OR dominates the other three methods in all settings. DD is less powerful than OR but has a clear advantage over ZOR with slightly lower FDR and higher power. 17 1.4.3 Comparison under a two-group model To illustrate the heteroscedasticity eect more clearly, we conduct a simulation using a simpler model where i takes on one of two distinct values. The goal is to illustrate that the heterogeneity adjustment is more useful when there is greater variation in the standard deviations among the testing units. Consider the setup in Models 1.2.1 and 1.2.2. We rst draw i randomly from two possible valuesf a ; b g with equal probability, and then generateX i from a two-point normal mixture model X i j i iid (1)N(0; 2 i ) +N(; 2 i ). In this simpler setting, it is easy to show that HART reduces to the CLfdr method in Cai and Sun (2009), where the conditional Lfdr statistics are calculated for separate groups dened by a and b . As previously, we apply BH, ZOR, OR and DD to data with m = 20; 000 tests and the experiment repeated on 100 data sets. We x = 0:1, = 2:5, a = 1 and vary b from 1.5 to 3. The FDRs and powers of dierent methods are plotted as functions of b , with results summarized in the rst row of Figure 1.4. In the second row, we plot the group-wise z-value cutos and group-wise powers as functions of b for the DD method. We can see that DD has almost identical performance to OR, and the power gain over ZOR becomes more pronounced as b increases. This is intuitive, because more variation in tends to lead to more information loss in standardization. The bottom row shows thez-value cutos for ZOR and DD for each group. We can see that in comparison to ZOR, which uses a single z-value cuto, HART uses dierent cutos for each group. The z-value cuto is bigger for the group with larger variance, and the gap between the two cutos increases as the degree of heterogeneity increases. In Panel d), we can see that the power of Group b decreases as b increases. These interesting patterns corroborate those we observed in our toy example in Section 1.2.2. 1.5 Data Analysis This section compares the adaptivez-value procedure (AZ, the data-driven implementation of ZOR, Sun and Cai (2007)), BH, and HART on a microarray data set. The data set measures expression levels of 12; 625 genes for patients with multiple myeloma, 36 for whom magnetic resonance imaging (MRI) detected focal lesions of bone (lesions), and 137 for whom MRI scans could not detect focal lesions (without lesions) of bone (Tian et al., 2003). For each gene, we calculate the dierential gene expression levels (X i ) and standard errors (S i ). The FDR level is set at = 0:1. We rst address two important practical issues. The rst issue is that the theoretical nullN(0; 1) 18 1.5 2.0 2.5 3.0 0.00 0.05 0.10 0.15 0.20 a)FDR Comparison varing σ b σ b FDR BH ZOR OR DD 1.5 2.0 2.5 3.0 0.15 0.20 0.25 0.30 b)Power Comparison varing σ b σ b Power 1.5 2.0 2.5 3.0 2.5 3.0 3.5 4.0 c)z−value cut off for 2 groups and ZOR σ b z−value DD:group a DD:group b ZOR 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 d)Power comparison for 2 groups σ b Power DD:group a DD:group b Figure 1.4: Performance comparison two group setting. Two groups with varying b from 1.5 to 3. As b increases, the cut-o for group a decreases whereas the cut-o for group b increases. The power for tests in group b drops quickly as b increases. This corroborates our calculations in the toy example in Section 1.2.2 and the patterns revealed by Figure 1.1. (red curve on the left panel of Figure 1.5) is much narrower compared to the histogram of z-values. Efron (2004) argued that a seemingly small deviation from the theoretical z-curve can lead to severely distorted FDR analysis. For this data set, the analysis based on the theoretical null would inappropriately reject too many hypotheses, resulting in a very high FDR. To address the distortion of the null, we adopted the empirical null approach (Efron, 2004) in our analysis. Specically, we rst used the middle part of the histogram, which contains 99% of the data, to estimate the null distribution asN(0; 1:30 2 ) [see Efron (2004) for more details]. The new p-values are then converted from the z-values based on the estimated empirical null: P i = 2 (2jZ i j), where is the CDF of aN(0; 1:30 2 ) variable. We can see from Figure 1.5 that the empirical null (green curve) provides a better t to the histogram of z-values. Another evidence for the suitability of the empirical null 19 Histogram of Z Z Density -4 -2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5 Histogram of p-values PV Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 200 400 600 Histogram of estimated p-values PV.hat Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 Figure 1.5: Histogram of z values and p values. Left: histogram of z-values: the estimated empirical null N(0; 1:3 2 ) (green line) seems to provide a better t to the data compared to the theoretical null N(0; 1) (red line). Middle: histogram of original p-values. Right: histogram of estimated p-values based on the empirical null. The z-value histogram suggests that the theoretical null is inappropriate (too narrow, leading to too many rejections). The use of an empirical null corrects the non-uniformity of the histogram of the p-values. approach is that the histogram of the estimated p-values looks closer to uniform compared to that of original p-values. The uniformity assumption is crucial for ensuring the validity of p-value based procedures. The second issue is the estimation of f (x), which usually requires a relatively large sample size to ensure good precision. Figure 1.6 presents the histogram ofS i and scatter plot ofS i vsZ i . Based on the histogram, we propose to only focus on data points with S i less than 1 (12172 out of 12625 genes are kept in the analysis) to ensure the estimation accuracy of ^ T i . Compared to conventional approaches, there is no eciency loss because no hypothesis withS i > 1 is rejected by BH at = 0:1 { note that the BHp-value cuto is 6 10 5 , which corresponds to az-value cuto of 5.22; see also Figure 1.7. (If BH rejects hypotheses with large S i , we recommend to carry out a group-wise FDR analysis, which rst tests hypotheses at in separate groups and then combines the testing results, as suggested by Efron (2008a).) Finally we apply BH, AZ and HART to the data points with S i < 1. BH uses the new p-values P i based on the estimated empirical nullN(0; 1:3 2 ). Similarly AZ uses Lfdr statistics where the null is taken as the density of aN(0; 1:3 2 ) variable. When implementing HART, we estimate the non-null proportion using Jin-Cai's method with the empirical null taken asN(0; 1:3 2 ). We further employ the jacknifed method to estimate f (x) by following the steps in Section 1.4.1. We summarize the number of rejections by each method in Table 1.1 and display the testing results in Figure 1.7, where we have marked rejected hypotheses by each method using dierent colors. HART rejects more hypotheses than BH and AZ. The numbers should be interpreted with caution 20 standard deviation Frequency 0 1 2 3 4 0 2000 6000 10000 −4 −2 0 2 4 6 0 1 2 3 4 z−value standard deviation Figure 1.6: Histogram of S i (left), scatter plot of (Z i ;S i ) (right) . Table 1.1: Numbers of genes (% of total) that are selected by each method. -level BH AZ HART 0:1 8 (0:07%) 25 (0:2%) 122 (1%) as BH and AZ have employed the empirical null N(0; 1:3 2 ) whereas HART has utilized null density N(0; 2 i ) conditioned on individual i { it remains an open issue how to extend the empirical null approach to the heteroscedastic case. Since we do not know the ground truth, it is dicult to assess the power gains. However, the key point of this analysis, and the focus of our paper, is to compare the shapes of rejection regions to gain some insights on the dierences between the methods. It can be seen that for this data set, the rejection rules of BH and AZ only depend on Z i . By contrast, the rejection region for HART depends on both Z i and S i . HART rejects more z-values when S i is small compared to BH and AZ. Moreover, HART does not reject any hypothesis when S i is large. This pattern is consistent with the intuitions we gleaned from the illustrative example (Figure 1.1) and the results we observed in simulation studies (Figure 1.4, Panel c). 1.6 Discussion 1.6.1 Multiple testing with side information Multiple testing with side or auxiliary information is an important topic that has received much attention recently. The research directions are wide-ranging as there are various types of side in- formation, which may be either extracted from the same data using carefully constructed auxiliary sequences or gleaned from secondary data sources such as prior studies, domain-specic knowledge, structural constraints and external covariates. The recent works by Xia et al. (2019), Li and Bar- 21 −4 −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 z−value standard deviation −4 −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 z−value standard deviation −4 −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 z−value standard deviation Figure 1.7: Scatter plot of rejected hypotheses by each method. Green: BH, blue: AZ, red: HART. AZ and BH reject every hypothesis to the right of the dashed line. The rejection region for HART depends on both z and . ber (2019) and Cai et al. (2019) have focused on utilizing side information that encodes the sparsity structure. By contrast, our work investigates the impact of the alternative distribution, showing that incorporating i can be extremely useful for improving the ranking and power in multiple testing. In the context of FDR analysis, the key issue is that the hypotheses become unequal in light of side information. Efron (2008a) argued that ignoring the heterogeneity among study units may lead to FDR rules that are inecient, noninterpretable and even invalid. We discuss two lines of work to further put our main contributions in context and to guide future research developments. Grouping, pioneered by Efron (2008a), provides an eective strategy for capturing the hetero- geneity in the data. Cai and Sun (2009) showed that the power of FDR procedures can be much improved by utilizing new ranking statistics adjusted for grouping. Recent works along this direction, including Liu et al. (2016), Barber and Ramdas (2017) and Sarkar and Zhao (2017), develop general frameworks for dealing with a class of hierarchical and grouping structures. However, the groups can be characterized in many ways and the optimal grouping strategy still remains unknown. More- over, discretizing a continuous covariate by grouping leads to loss of information. HART directly incorporates i into the ranking statistic and hence eliminates the need to dene groups. Weighting is another widely used strategy for incorporating side information into FDR analy- ses (Benjamini and Hochberg, 1997; Genovese et al., 2006; Roquain and Van De Wiel, 2009; Basu et al., 2018). For example, when the sparsity structure is encoded by a covariate sequence, weighted p-values can be constructed to up-weight the tests at coordinates where signals appear to be more frequent (Hu et al., 2012; Xia et al., 2019; Li and Barber, 2019). However, the derivation of weight- ing functions for directly incorporating heteroscedasticity seems to be rather complicated (Pe~ na et al., 2011; Habiger et al., 2017). Notably, Habiger (2017) developed novel weights for p-values as 22 functions of a class of auxiliary parameters, including i as a special case, for a generic two-group mixture model. However, the formulation is complicated and the weights are hard to compute { the methodology requires handling the derivative of the power function, estimating several unknown quantities and tuning a host of parameters. The works by Ignatiadis et al. (2016) and Lei and Fithian (2018) suggest that heterogeneous variances can be incorporated into inference as one type of side information. [Similar ideas were also included in earlier works by Efron (2008a) and Cai and Sun (2009) in their analyses of the school data.] However, because the variance issue is not the focus of their work, the discussions in Ignatiadis et al. (2016) and Lei and Fithian (2018) are very brief and it is unclear to us how to implement their proposals. They did not provide numerical evidence to illustrate whether or why incorporating the variances is benecial. To the best of our knowledge, our work is the rst to provide intuitive illustrations, clear evidence and novel insights on why heteroscedasticity is useful in large-scale inference (Section 2.2). The message is crucial as standardization has been a common practice in multiple testing. In contrast with existing works, HART provides a simple and powerful approach for adjusting heteroscedasticity with proven validity and optimality properties. 1.6.2 Open issues and future directions We conclude the chapter by discussing several open issues. First, HART works better for large-scale problems where the density with heteroscedastic errors can be well estimated. For problems with several hundred tests or less,p-value based algorithms such as BH or the WAMDF approach (Habiger, 2017) are more suitable. The other promising direction for dealing with smaller-scale problems, suggested by Castillo and Roquain (2018), is to employ spike and slab priors to produce more stable empirical Bayes estimates (with frequentist guarantees under certain conditions). Second, in practice the model given by (1.2.2) can be extended to i j i ind (1 i ) 0 () + i g (j i ); 2 i iid g (); (1.6.1) where both the sparsity level and distribution of non-null eects depend on i ; this setting has been considered in a closely related work by Weinstein et al. (2018). The heterogeneity-adjusted statistic is then given by T i =P( i = 0jx i ; i ) = (1 i )f 0;i (x i ) f i (x i ) ; (1.6.2) 23 where the varying proportion i indicates that i also captures the sparsity structure. This is possible, for example, in applications where observations from the alternative have larger variances compared to those from the null. An interesting, but challenging, direction for future research is to develop methodologies that can simultaneously incorporate both the sparsity and heterocedasticity structures into inference. Third, the HART-type methodology can only handle one covariate se- quencef i : 1img. It would be of great interest to develop new methodologies and principles for information pooling for multiple testing with several covariate sequences. Finally, our work has assumed that i are known in order to illustrate the key message (i.e. the impact of alternative distribution on the power of FDR analyses). Although this is a common practice, it is desirable to carefully investigate the impact of estimating i on the accuracy and stability of large-scale inference. 1.7 Formulas for the Illustrative Example Consider Model 1.2.8 in Section 1.2.2. We derive the formulas for the oracle p-value, oracle z-value and oracle full data procedures. p corresponds to the thresholding rule I(jZ i j>t p ), where t p = inf 8 < : t> 0 : 2(1) ~ (t) 2(1) ~ (t) + R n ~ (t + a ) + ~ (t a ) o dG() 9 = ; ; with ~ being the survival function of the N(0; 1) variable. z is a one-sided thresholding rule of the form I(Z i >t z ), where t z = inf ( t> 0 : (1) ~ (t) (1) ~ (t) + R ~ (t a )dG() ) : full is of the form IfP( i = 0jx i ; i )<g. It can be written as IfZ i >t z; ()g, where t z; () = 2 a 2 2 log n (1)(1) o 2 a : Denote the optimal threshold. Hence full is given by IfP ( i = 0jx i ; i )< g, where = sup " 2 [0; 1] : (1) R ~ ft z; ()gdG() (1) R ~ ft z; ()gdG() + R ~ ft z; () a gdG() # : 24 The optimal cutos can be solved numerically from the above. The powers are given by AP ( p ) = Z n ~ t p + a + ~ t p a o dG(); AP ( z ) = Z ~ t a dG(); AP ( full ) = Z ~ n t z; () a o dG(): 1.8 Proofs of Theorems 1.8.1 Proof of Theorem 1.3.1 We divide the proof into two parts. In Part (a), we establish two properties of the testing rule full (t) =fI(T i < t) : 1 i mg for an arbitrary 0 < t < 1. In Part (b) we show that the oracle rule full (t ) attains the mFDR level exactly and is optimal amongst all FDR procedures at level . Part(a). Denote(t) the mFDR level of full (t). We shall show that (i)(t)<t for all 0<t< 1 and that (ii) (t) is nondecreasing in t. Note thatEf P m i=1 (1 i ) i g =E X; ( P m i=1 T i i ). According to the denition of (t), we have E X; ( m X i=1 fT i (t)gI(T i t) ) = 0: (1.8.1) We claim that (t)<t. Otherwise if (t)t, then we must have T i <t(t). It follows that the LHS must be negative, contradicting (2.8.15). Next we show (ii). Let (t j ) = j . We claim that if t 1 < t 2 , then we must have 1 2 . We argue by contradiction. Suppose that t 1 <t 2 but 1 > 2 . Then (T i 2 )I(T i <t 2 ) = (T i 1 )I(T i <t 1 ) + ( 1 2 )I(T i <t 1 ) + (T i 2 )I(t 1 T i <t 2 ) (T i 1 )I(T i <t 1 ) + ( 1 2 )I(T i <t 1 ) + (T i 1 )I(t 1 T i <t 2 ): It follows thatEf P m i=1 (T i 2 )I(T i <t 2 )g> 0 sinceEf P m i=1 (T i 1 )I(T i <t 1 )g = 0 according to (2.8.15), 1 > 2 and T i t 1 > 1 , contradicting (2.8.15). Hence we must have 1 < 2 : Part (b). Let =(1). In Part (a), we show that(t) is non{decreasing int. It follows that for all < , there exists a t such that t = supft :(t ) =g. By denition, t is the oracle threshold. Consider an arbitrary decision ruled = (d 1 ;:::;d m )2f0; 1g m such that mFDR(d). We have 25 E n P m i=1 (T i ) full i o = 0 andEf P m i=1 (T i )d i g 0: Hence E ( m X i=1 ( full i d i )(T i ) ) 0: (1.8.2) Consider transformation f(x) = (x)=(1x). Note that f(x) is monotone, we rewrite full i = I [f(T i )=(1T i )g<], where = (t )=(1t ). In Part (a) we have shown that<t OR < 1, which implies that > 0. Hence E " m X i=1 ( full i d i )f(T i )(1T i )g # 0: (1.8.3) To see this, consider the terms where full i d i 6= 0. Then we have two situations: (i) full i > d i or (ii) full i < d i . In situation (i), full i = 1, implying thatf(T i )=(1T i )g < . In sit- uation (ii), full i = 0, implying thatf(T i )=(1T i )g . Therefore we always have ( full i d i )f(T i )(1T i )g 0. Summing over them terms and taking the expectation yield (2.8.18). Combining (2.8.17) and (2.8.18), we obtain 0E ( m X i=1 ( full i d i )(T i ) ) E ( m X i=1 ( full i d i )(T i ) ) : Finally, since > 0, it follows that E n P m i=1 ( full i d i )(T i ) o > 0. Finally, we apply the denition of ETP to conclude that ETP( full ) ETP(d) for alld2D . 1.8.2 Proof of Theorem 1.3.3 We begin with a summary of notation used throughout the proof: Q(t) =m 1 P m i=1 (T i )IfT i <tg. b Q(t) =m 1 P m i=1 ( ^ T i )If ^ T i <tg. Q 1 (t) =Ef(T OR )IfT OR <tgg. t 1 = supft2 (0; 1) :Q 1 (t) 0g: the \ideal" threshold. For T (k) OR <t<T (k+1) OR , dene a continuous version of b Q(t) as b Q C (t) = t b T (k) OR b T (k+1) OR b T (k) OR b Q k + b T (k+1) OR t b T (k+1) OR b T (k) OR b Q k+1 ; 26 where b Q k = b Q b T (k) OR . Since b Q C (t) is continuous and monotone, its inverse b Q 1 C is well{dened, continuous and monotone. Next we show the following two results in turn: (i) b Q(t) p ! Q 1 (t) and (ii) b Q 1 C (0) p !t 1 . To show (i), note that Q(t) p ! Q 1 (t) by the WLLN, so that we only need to establish that b Q(t)Q(t) p ! 0. We need the following lemma, which is proven in Section 1.10. Lemma 1.8.1. Let U i = (T i )I(T i <t) and b U i = ( ^ T i )If ^ T i <tg. ThenE b U i U i 2 =o(1). By Lemma 1.8.1 and Cauchy-Schwartz inequality,E n b U i U i b U j U j o =o(1). Let S m = P m i=1 b U i U i . It follows that Var m 1 S m m 2 m X i=1 E b U i U i 2 +O 0 @ 1 m 2 X i;j:i6=j E n b U i U i b U j U j o 1 A =o(1): By Proposition 1.3.2,E(m 1 S m )! 0, applying Chebyshev's inequality, we obtainm 1 S m = b Q(t) Q(t) p ! 0. Hence (i) is proved. Notice that Q 1 (t) is continuous by construction, we also have b Q(t) p ! b Q C (t). Next we show (ii). Since b Q C (t) is continuous, for any " > 0, we can nd > 0 such that b Q 1 C (0) b Q 1 C n b Q C (t 1 ) o <" if b Q C (t 1 ) <. It follows that P n b Q C (t 1 ) > o P n b Q 1 C (0) b Q 1 C n b Q C (t 1 ) o >" o : Proposition 1.3.2 and the WLLN imply that b Q C (t) p ! Q 1 (t): Note that Q 1 (t 1 ) = 0. Then P b Q C (t 1 ) > ! 0: Hence we have b Q 1 C (0) p ! b Q 1 C n b Q C (t 1 ) o = t 1 , completing the proof of (ii). To show FDR( dd ) = FDR( full ) + o(1) = + o(1), we only need to show mFDR( dd ) = mFDR( full ) +o(1). The result then follows from the asymptotic equivalence of FDR and mFDR, which was proven in Cai et al. (2019). Dene the continuous version ofQ(t) asQ C (t) and the corresponding threshold asQ 1 C (0). Then by construction, we have dd = h I n ^ T i b Q 1 C (0) o : 1im i and full = I T i Q 1 C (0) : 1im : Following the previous arguments, we can show that Q 1 C (0) p ! t 1 . It follows that b Q 1 C (0) = 27 Q 1 C (0) +o p (1): By construction mFDR( full ) =. The mFDR level of dd is mFDR( dd ) = P H0 n ^ T i b Q 1 C (0) o P n ^ T i b Q 1 C (0) o : From Proposition 2, ^ T i p !T i . Using the continuous mapping theorem, mFDR dd = mFDR full + o(1) = +o(1). The desired result follows. Finally, using the fact that ^ T i p !T i and b Q 1 C (0) p !Q 1 C (0), we can similarly show that ETP( dd )=ETP( full ) = 1 +o(1): 1.9 Proof of Proposition 1.3.2 Summary of notation The following notation will be used throughout the proofs: ^ f (x) = P m j=1 h ( j ) P m i=1 h ( i ) hxj (xx j ). ^ f 1; (x) = P m j=1 h ( j )I( j = 1) P m i=1 h ( i )I( i = 1) hxj (xx j ). ~ f 1; (x) = P m j=1 h ( j )P ( j = 1jx j ; j ) P m i=1 h ( i )P ( i = 1jx i ; i ) hxj (xx j ). ^ f 1; (x) = P m j=1 h ( j ) ^ P ( j = 1jx j ; j ) P m i=1 h ( i ) ^ P ( i = 1jx i ; i ) hxj (xx j ). ^ f (x) = (1 ^ )f 0; (x) + ^ ^ f 1; (x). The basic idea is that a consistent one{step estimator constructed via ^ f (x) leads to a consistent two{step estimator via ^ f (x). By Condition (C4) and the triangle inequality, it is sucient to show that E Z n ^ f 1; (x)f 1; (x) o 2 dx! 0: (1.9.4) Let u j = h ( j ) P m i=1 h ( i ) . A direct consequence of condition (C1) is 0< C 1 m Eu j C 2 m <1 for some positive constants C 1 and C 2 . Let C 0 = min(1;C 1 ).Consider event A = ( m X i=1 j m < C 0 2 m ) : (1.9.5) 28 By Hoeding's inequality and Condition (C2),P (A C )O(h 2 x ) exp(C 02 m=2)O(h 2 x )! 0: There- fore it suces to prove (1.9.4) underA. We establish the result in three steps: 1. E R f ^ f 1; (x)f 1; (x)g 2 dx! 0: 2. E R f ~ f 1; (x) ^ f 1; (x)g 2 dx! 0: 3. E R f ^ f 1; (x) ~ f 1; (x)g 2 dx! 0. The proposition then follows from the triangle inequality. 1.9.1 Proof of Step (a) Let b j = h ( j )I( j = 1) P m i=1 h ( i )I( i = 1) . It is easy to show that n ^ f 1; (x)f 1; (x) o 2 = m X j=1 m X k=1 b j b k f hx k (xx k )f 1; (x)g hxj (xx j )f 1; (x) : Under condition (C1) and eventA, we have E(b j b k ) = O(m 2 ). Using standard arguments in density estimation theory (e.g. Wand and Jones (1994) page 21), and the fact thatE P m j=1 (b j ) 2 = O(m 1 h 1 ); (Proof of Lemma 4 in Fu et al. (2018)) we haveE R f ^ f 1; (x)f 1; (x)g 2 dx =O (mh h x ) 1 +h 4 x : Under condition (C2) and (C3) the RHS! 0, establishing Step (a). 1.9.2 Proof of Step (b) Let b j = h ( j )P ( j = 1jx j ; j ) P m i=1 h ( i )P ( i = 1jx i ; i ) . Then n ~ f 1; (x) ^ f 1; (x) o 2 = m X j=1 (b j b j ) 2 2 hxj (xx j ) + X (j;k):j6=k (b j b j )(b k b k ) hxj (xx j ) hx k (xx k ): (1.9.6) We rst boundE(b j b j ) 2 . WriteE(b j b j ) 2 =fE(b j b j )g 2 +Var(b j b j ). It is clear that E(b j ) andE(b j ) are both O(m 1 ). HencefE(b j b j )g 2 =O(m 2 ). Next consider Var(b j b j ) = Var(b j ) +Var(b j ) 2Cov(b j ;b j ). We have by condition (C3) Var(b j ) =Var I( j = 1) h ( j ) P m i=1 h ( i )I( i = 1) E(b j ) 2 =O(m 2 h 1 ): SimilarlyVar(b j ) =O(m 2 h 1 ): It follows thatCov(b j ;b j ) =O(m 2 h 1 ): ThereforeVar(b j b j ) = O(m 2 h 1 ) andE(b j b j ) 2 = O(m 2 h 1 ). Using the fact that R 2 hxj (xx j )dx = O(h 1 x ), we 29 have Z E m X j=1 (b j b j ) 2 2 hxj (xx j )dx =Of(mh x h ) 1 g! 0: (1.9.7) Next we boundEf(b j b j )(b k b k )g for j6=k. Consider the decomposition Ef(b j b j )(b k b k )g =E(b j b j )E(b k b k ) +Cov(b j b j ;b k b k ): (1.9.8) Our goal is to show thatEf(b j b j )(b k b k )g =O(m 3 h 2 ) +O(m 4 h 4 ). It suces to show E j;x j;x j;x f(b j b j )(b k b k )g =O(m 3 h 2 ) +O(m 4 h 4 ): (1.9.9) Observe that Var n 1 mh 1 P m i=1 h ( j )I( i = 1)j;x ;x ;x o =O(m 1 ) and E j;x j;x j;x ( 1 mh 1 m X i=1 h ( j )I( i = 1) ) = 1 mh 1 m X i=1 h ( j )P ( i = 1j i ;x i ): Applying Chebyshev's inequality, 1 mh 1 m X i=1 h ( j )I( i = 1) 1 mh 1 m X i=1 h ( j )P ( i = 1j i ;x i ) p ! 0: It follows that for any > 0, P ( m X i=1 h ( j )I( i = 1) m X i=1 h ( j )P ( i = 1j i ;x i ) <mh 1 ) ! 1: UnderA dened in (1.9.5), we have P m i=1 h ( j )I( i = 1)>h 1 C 3 m for some C 3 , and P ( m X i=1 h ( j )P ( i = 1j i ;x i )<h 1 C 3 m ) ! 0: (1.9.10) The boundedness of b j and b j and (2.7.10) imply that we only need to prove (1.9.9) on the event A = (x x x; ) : P m i=1 h ( j )P ( i = 1j i ;x i )h 1 C 3 m : We shall considerE j;x j;x j;x (b j b j ) and 30 Cov(b j b j ;b k b k j ;x x x) in turn. Write E j;x j;x j;x (b j ) =E j;x j;x j;x I( j = 1) h ( j ) P m i=1 h ( i )I( i = 1) =P ( j = 1jx j ; j )E j;x j;x j;x h ( j ) P m i=1 h ( i )I( i = 1) +Cov j ; h ( j ) P m i=1 h ( i )I( i = 1) ;x ;x ;x : Let Y = h ( j ) P m i=1 h ( i )I( i = 1) . We state three lemmas that are proven in Section 1.10. Lemma 1.9.1. Under eventA , we haveE j;x j;x j;x (Y )E j;x j;x j;x (Yj j = 1) =O(m 2 ) andE j;x j;x j;x (Y ) E j;x j;x j;x (Yj j = 0) =O(m 2 h 1 ). Lemma 1.9.2. Under eventA , we have E j;x j;x j;x h ( j ) P m i=1 h ( i )I( i = 1) = h ( j ) P m i=1 h ( i )P ( i = 1jx i ; i ) +O(m 2 h 2 ): Lemma 1.9.3. Under eventA , we have Cov(b j b j ;b k b k j ;x x x) =O(m 3 h 2 ). According to Lemma 1.9.1, we have Cov j ; h ( j ) P m i=1 h ( i )I( i = 1) ;x ;x ;x (1.9.11) = Z fP ( j = 0jx j ; j )P ( j = 1jx j ; j )gfyE j;x j;x j;x (Y )gf Yjj =1;;x ;x ;x (y)dy Z f1P ( j = 1jx j ; j )g 2 fyE j;x j;x j;x (Y )gf Yjj =0;;x ;x ;x (y)dy =O(m 2 h 1 ): Together with Lemma 1.9.2, we have E j;x j;x j;x I( j = 1) h ( j ) P m i=1 h ( i )I( i = 1) = P ( j = 1jx j ; j ) h ( j ) P m i=1 h ( i )P ( i = 1jx i ; i ) +O(m 2 h 2 ): (1.9.12) It follows thatE(b j b j ) =O(m 2 h 2 ) andE(b j b j )E(b k b k ) =O(m 4 h 4 ). The decomposition (3.8.11) and Lemma 1.9.3 together imply Ef(b j b j )(b k b k )g = O(m 3 h 2 ) +O(m 4 h 4 ). It follows that Z E X (j;k):j6=k (b j b j )(b k b k ) hxj (xx j ) hxj (xx k )dx =Of(mh x h 2 ) 1 +O (mh ) 2 ! 0: (1.9.13) Combing (1.9.6), (3.8.10) and (1.9.13), we conclude thatE R f ~ f 1; (x) ^ f 1; (x)g 2 dx! 0. 31 1.9.3 Proof of Step (c) Let q j = P ( j = 1j j ;x j ), ^ q j = ^ P ( j = 1j j ;x j ) = min ( (1 ^ )f 0;j (x j ) ^ f j (x j ) ; 1 ) and ^ f 1; (x) = P m j=1 h ( j )^ q j P m i=1 h ( i )^ q i hxj (xx j ): Write ^ q j =q j +a j , thenja j j 1and a j =o P (1). We have E Z n ^ f 1; (x) ~ f 1; (x) o 2 dx = O ( h 1 x m 2 E h ( j )^ q j P m i=1 h ( i )^ q i h ( j )q j P m i=1 h ( i )q i 2 ) = O h 1 x h 2 Ea 2 j : Next we explain why the last equality holds. Let c i = h ( i )h . Then E h ( j )^ q j P m i=1 h ( i )^ q i h ( j )q j P m i=1 h ( i )q i 2 = E a j P m i=1 h ( i )q i q j P m i=1 h ( i )a i f P m i=1 h ( i )^ q i gf P m i=1 h ( i )q i g 2 = h 2 1 m 4 O 2 4 E ( a j m X i=1 c i q i q j m X i=1 c i a i ) 2 3 5 = h 2 m 4 O 2 4 E 8 < : m 2 a 2 j 2ma j m X i=1 a i + m X i=1 a i ! 2 9 = ; 3 5 = h 2 m 2 O E a 2 j : The last line holds by noting thatE (a j a i ) q E(a 2 j )E(a 2 i ) =OfE(a 2 j )g: The next step is to bound E(a 2 j ). Note that a j = O ( f 0;j (x j )f ^ f j (x j )f j (x j )g f j (x j ) ^ f j (x j ) ) . By the construction of ^ q j , we have ^ f j (x j ) (1 ^ )f 0;j (x j ). Hence a j =O ( 1 ^ f j (x j ) f j (x j ) ) and E(a 2 j ) =O 2 4 E ( 1 ^ f j (x j ) f j (x j ) ) 2 3 5 : LetK j = j p p logmM; j p p logm +M : By the Gaussian tail bound, Pfx j 62K j g = O(m =2 ): By the boundedness ofa 2 j and the fact thath 1 x h 2 m =2 ! 0 (Condition (C3)), we only need to considerE 2 4 1 2 ^ f j (x j ) f j (x j ) + ( ^ f j (x j ) f j (x j ) ) 2 x j 3 5 for x j 2K j . Let f (x j ) = R (x)f(1) 0 (x j x) +g (x j x)gdx. Dene a jacknifed version of ^ f ;(j) j that is formed without the pair ( j ;x j ). It follows that Ef ^ f ;(j) j (x j )jx j g = Z Z p 2 +h 2 x 2 j (x)f(1) 0 (x j x) +g (x j x)gg ( j )d j dx: 32 By the intermediate value theorem and Condition (C1), Ef ^ f ;(j) j (x j )jx j g = Z p 2 +h 2 x c (x)f(1) 0 (x j x) +g (x j x)gdx for some constant c. Next consider the ratio Ef ^ f ;(j) j (x j )jx j g f j (x j ) = R p 2 +h 2 x c (x)f(1) 0 (x j x) +g (x j x)gdx R (x)f(1) 0 (x j x) +g (x j x)gdx : By Condition (C1), supp(g )2 (M;M) with M <1, we have inf xjM<x<xj +M p 2 +h 2 x c (x) (x) Ef ^ f ;(j) j (x j )jx j g f j (x j ) sup xjM<x<xj +M p 2 +h 2 x c (x) (x) : Note that x j 2 ( j p p logmM; j p p logm +M). The above inmum and supremum are taken over x2K = ( j p p logm 2M; j p p logm + 2M). Using Taylor expansion p 2 +h 2 x c (x) (x) = p 2 +h 2 x c " 1 + 1 X k=1 1 k! h 2 x cx 2 2( 2 +h 2 x c) k # ; we have inf x2K p 2 +h 2 x c (x) (x) = p 2 +h 2 x c = 1 +O(h 2 x ): Similarly, sup x2K p 2 +h 2 x c ( 1 + 1 X k=1 1 k! h 2 x cx 2 2( 2 +h 2 x c) k ) = 1 +O(h 2 x ) +O " sup 1 X k=1 1 k! h 2 x cx 2 2( 2 +h 2 x c) k # : It follows that Ef ^ f ;(j) j (x j )jx j g f j (x j ) = 1 +O(h 2 x ) +O ( sup 1 X k=1 1 k! h 2 x cx 2 2( 2 +h 2 x c) k ) ; (1.9.14) 1 2 Ef ^ f ;(j) j (x j )jx j g f j (x j ) = 1 +O(h 2 x ) +O ( sup 1 X k=1 1 k! h 2 x cx 2 2( 2 +h 2 x c) k ) : (1.9.15) Next considerE ( ^ f ;(j) j (x j ) f j (x j ) x j ) 2 = " Ef ^ f ;(j) j (x j )jx j g f j (x j ) # 2 +Var ( ^ f ;(j) j (x j ) f j (x j ) x j ) : By the same com- putation on page 21 of Wand and Jones (1994), Var ( ^ f ;(j) j (x j ) f j (x j ) x j ) =O (mh x ) 1 f j (x j ) 1 +o (mh x ) 1 f j (x j ) 2 : Since x j 2K j , f j (x j ) C 3 m =2 for some constant C 3 , together with Condition (C3), we have 33 h 1 x Var ( ^ f ;(j) j (x j ) f j (x j ) x j ) =o(1): It follows from (1.9.14) and (1.9.15) that h 1 x 2h 1 x E ( ^ f ;(j) j (x j ) f j (x j ) x j ) +h 1 x E ( ^ f ;(j) j (x j ) f j (x j ) x j ) 2 = O ( h x + sup 1 X k=1 1 k! h 2k1 x c k x 2k 2 k ( 2 +h 2 x c) k ) +o(1): (1.9.16) By condition (C2) and the range of x, the RHS goes to 0. Let S j = P m i=1 h ( j i ) and S j = P m i6=j h ( j i ): Some algebra shows ^ f j (x j ) = S j S j ^ f ;(j) j (x j ) + 1 2S j h h x j . We use the fact that f j (x j ) C 3 m =2 for some constant C 3 and Condition (C3) to claim that onA , h 1 x Ef ^ f j (x j )jx j g f j (x j ) =h 1 x Ef ^ f ;(j) j (x j )jx j g f j (x j ) +o(1): (1.9.17) Similar computation shows that h 1 x E ( ^ f j (x j ) f j (x j ) x j ) 2 =h 1 x E ( ^ f ;(j) j (x j ) f j (x j ) x j ) 2 +o(1): (1.9.18) (1.9.16), (1.9.17) and (1.9.18) together impliesh 1 x h 2 Efa 2 j jx j g! 0. HenceE R n ^ f 1; (x) ~ f 1; (x) o 2 dx! 0 and Step (c) is established. 1.10 Proof of Lemmas 1.10.1 Proof of lemma 1.8.1 Using the denitions of b U i and U i , we can show that b U i U i 2 = ^ T i T i 2 I ^ T i t;T i t + ^ T i 2 I ^ T i t;T i >t + (T i ) 2 I ^ T i >t;T i t : Denote the three sums on the RHS as I, II, and III respectively. By Proposition 2,E(I) =o(1). Let "> 0. Consider P ^ T i t;T i >t P ^ T i t;T i 2 (t;t +") +P ^ T i t;T i t +" PfT i 2 (t;t +")g +P (jT i T i j>") The rst term on the right hand is vanishingly small as "! 0 because b T i OR is a continuous random variable. The second term converges to 0 by Proposition 2. we conclude thatII =o(1). In a similar 34 fashion, we can show that III =o(1), thus proving the lemma. 1.10.2 Proof of lemma 1.9.1 Note thatE j;x j;x j;x (Yj j = 0)E j;x j;x j;x Y E j;x j;x j;x (Yj j = 1). It is sucient to boundE j;x j;x j;x (Yj j = 0)E j;x j;x j;x (Yj j = 1). The lemma follows by noting that E j;x j;x j;x (Yj j = 0)E j;x j;x j;x (Yj j = 1) = E j;x j;x j;x ( h ( j ) P i6=j h ( i ) i ) E j;x j;x j;x ( h ( j ) P i6=j h ( i ) i + h ( j ) ) = E j;x j;x j;x ( 2 h ( j ) f P i6=j h ( i ) i gf P i6=j h ( i ) i + h ( j )g ) E j;x j;x j;x ( 2 h ( j ) ( P i6=j h ( i ) i ) 2 ) =O(m 2 h 1 ): 1.10.3 Proof of lemma 1.9.2 Let Z = P m i=1 h ( i )I( i = 1), We expand 1 Z aroundE j;x j;x j;x (Z) and take expected value: E j;x j;x j;x 1 Z =E j;x j;x j;x " 1 E j;x j;x j;x (Z) 1 fE j;x j;x j;x (Z)g 2 (ZE j;x j;x j;x Z) + 1 X k=3 (1) k1 fE j;x j;x j;x (Z)g k (ZE j;x j;x j;x Z) k1 # : The series converges onA. Moreover, using normal approximation of binomial distribution, it can be shown thatE j;x j;x j;x (ZE j;x j;x j;x Z) k =O((mh 1 ) k=2 ). The lemma follows by noting thatE j;x j;x j;x Z 1 = fE j;x j;x j;x (Z)g 1 +O(m 2 h 2 ). 1.10.4 Proof of lemma 1.9.3 Consider b j = h ( j )I( j = 1) P m i=1 h ( i )I( i = 1) dened in Section 1.9.1. Let ~ b j = j P m i=1 i . By Condi- tion (C1), Cov(b j ;b k j ;x x x) = Ofh 2 Cov( ~ b j ; ~ b k j ;x x x)g: Note that Cov( ~ b j ; ~ b k j ;x x x) = E j;x j;x j;x ( ~ b j ~ b k ) E j;x j;x j;x ( ~ b j )E j;x j;x j;x ( ~ b k ): Using similar argument as in the proof for (1.9.12), we have E j;x j;x j;x ( ~ b j ) = P ( j = 1j ;x x x) P m i=1 P ( i = 1j ;x x x) +O(m 2 ) andE j;x j;x j;x ( ~ b k ) = P ( k = 1j ;x x x) P m i=1 P ( i = 1j ;x x x) +O(m 2 ): It follows that E j;x j;x j;x ( ~ b j )E j;x j;x j;x ( ~ b k ) = P ( j = 1j ;x x x) P m i=1 P ( i = 1j ;x x x) P ( k = 1j ;x x x) P m i=1 P ( i = 1j ;x x x) + O(m 3 ): Next we computeE j;x j;x j;x ( ~ b j ~ b k ). Note thatE j;x j;x j;x ( ~ b j ~ b k ) =P ( j = 1j ;x x x)E j;x j;x j;x k ( P m i=1 i ) 2 j = 1 : Using similar arguments as the proof for (1.9.11), we have Cov k ; 1 ( P m i=1 i ) 2 j = 1; ;x x x = 35 O(m 3 ): Let k = ( j : 1jm;j6=k). We have E j;x j;x j;x k ( P m i=1 i ) 2 j = 1 =P ( k = 1j ;x x x)E k 1 ( P m i=1 i ) 2 j = 1; ;x x x +O(m 3 ): Using similar arguments in Lemmas 1.9.2 and 1.9.1, we have E k 1 ( P m i=1 i ) 2 j = 1; ;x x x = 1 E j;x j;x j;x ( P m i=1 i ) 2 +O(m 3 ): In the previous equation, the conditional expectationE k can be replaced byE because the term k only aects the ratio by a term of orderO(m 3 ). Note thatE j;x j;x j;x ( P m i=1 i ) 2 = E j;x j;x j;x ( P m i=1 i ) 2 + Var( P m i=1 i j ;x x x) and Var( P m i=1 i j ;x x x)m. We have E j;x j;x j;x k ( P m i=1 i ) 2 j = 1 = P ( k = 1j ;x x x) f P m i=1 P ( i = 1j ;x x x)g 2 +O(m 3 ): Finally, the lemma follows from the fact that Cov( ~ b j ; ~ b k j ;x x x) =E j;x j;x j;x ( ~ b j ~ b k )E j;x j;x j;x ( ~ b j )E j;x j;x j;x ( ~ b k ) =O(m 3 ): 36 Chapter 2 Structure Online{Adaptive Rules for False Discovery Rate Control in Dynamic Models 2.1 Introduction The online multiple testing problem is concerned with the investigation of a possibly innite stream of null hypothesesfH 1 ;H 2 ;g in an ongoing manner based on sequentially collected datafX 1 ;X 2 ;g. At each time point, the investigator must make a real-time decision after X t arrives, without know- ing future datafX t+1 ;X t+2 ;g. The control of multiplicity in sequential testing typically involves imposing serial constraints on error rates over time, which requires that, for example, the family wise error rate (FWER) or false discovery rate (FDR; Benjamini and Hochberg, 1995) must fall below a pre{specied level at all decision points. The online testing problem may arise from a range of applications. For example, the quality preserving database (QPD) framework (Aharoni et al., 2010) has been widely employed by many research teams from diverse backgrounds. Some notable databases include Stanford's HIVdb that serves the community of anti-HIV treatment groups, WTCCC's large-scale database that is dis- tributed to assist various whole-genome association studies, and the National Health Institute (NIH) in uenza virus resource (IVR) that has been intensively queried by numerous researchers for design- ing new vaccines and treatments. The proper and ecient management of these large databases 37 calls for new analytical tools for handling thousands of hypothesis tests with real{time decisions made in a sequential fashion. In particular, the control of false positive errors is among the top concerns in statistical analyses using the QPD framework. For instance, the NIH IVR has been used to investigate thousands of biomedical hypotheses and, per the record in PubMed, has lead to more than 1,000 scientic publications as of January 2020. Another important application scenario, which is frequently encountered in nance, social media and mobile computing, is the real{time detection of anomalies based on high{frequency and large{scale time series data. For example, large travel ser- vice providers closely monitor the number of changes or cancellation requests of existing itineraries. An abnormal spike usually signies an unexpected event. It is important for the company to detect such events early and make necessary adjustments. The development of online detection system plays a key role for providing novel and timely marketing insights and avoiding adverse nancial losses. The large-scale testing problems under the online setup poses several new issues that are not present in conventional \oine" setup. First, a real{time decision must be made before the next data point arrives. This makes conventional step{wise testing methods no longer applicable. For instance, the well{known Holm's procedure (Holm, 1979b) for FWER control and Benjamini{Hochberg's pro- cedure for FDR control both involve rst ordering all observedp-values and then choosing a thresh- old along the ranking. However, the ranking step becomes impossible due to the absence of future data. Second, in contrast with conventional FWER and FDR criteria that only require an overall assessment of the multiplicity in simultaneous testing, the online methods must proceed with more stringent error constraints that are imposed sequentially at every decision point. This not only leads to decreased power in detecting signals but also makes the development of online testing rules more challenging. Third, the data stream often encodes useful local structures, including signal magni- tudes, sparsity levels and grouping patterns, that may vary over time. It is crucial to develop exible and adaptive online rules to exploit the underlying domain knowledge and informative structures. Fourth, the online decision-making process, which proceeds sequentially without the knowledge of future, may come to a halt when the total error budget, or alpha{wealth, is exhausted. As a result, the investigator may miss all potential discoveries in the future. This concern must be carefully ad- dressed because in many applications the hypothesis tests are conducted in an ongoing manner with unpredictable patterns { even the total number of hypotheses to be investigated can be unknown. Finally, how to wisely allocate and invest the alpha{wealth to ensure the validity in error control while maintaining high statistical power of online testing rules in the long run has remained as a key issue that requires much research. 38 The online FDR control problem has received growing interest and great progresses have been made recently. The alpha-investing (AI) idea (Foster and Stine, 2008) and its various generalizations (Aharoni and Rosset, 2014; Ramdas et al., 2017; Javanmard et al., 2018) have served as the basic framework and proved to be eective. Carefully designed AI rules are capable of handling an innite stream of hypotheses and incorporating informative domain knowledge into the dynamic decision- making process. Beginning with a pre-specied alpha{wealth, the key idea in AI algorithms is that each rejection gains extra alpha{wealth, which may be subsequently used to make more discoveries at later time points. The generalized AI (GAI) algorithms (Aharoni and Rosset, 2014; Robertson and Wason, 2018; Lynch et al., 2017) are developed for a wider class of pay-out functions, enabling the construction of new online rules with increased power. The GAI++ framework (Ramdas et al., 2017) improves the power of GAI methods uniformly and is capable of dealing with more general settings. The new class of weighted GAI++ methods are exibly designed to allow \indecisions" and are capable of integrating prior domain knowledge. To alleviate the \piggybacking" and \alpha{ death" issues of AI rules, Ramdas et al. (2017) discussed the concept of decaying memory FDR. To eectively incorporate structural information into online inference, the SAFFRON procedure (Ramdas et al., 2018) derived a sequence of thresholds that are adaptive to estimated sparsity levels and showed that the power can be much improved. This chapter develops a new class of structure{adaptive sequential testing (SAST) rules for online FDR control with several new features. First, in contrast with existing AI and GAI rules whose building blocks arep-values, the class of SAST rules are built upon the conditional local false discovery rate (Clfdr), which optimally adapts to important local structures in the data stream. Second, the sequential rejection rule based on Clfdr leads to a novel alpha{investing framework that is fundamentally dierent from that in Foster and Stine (2008). The new framework precisely characterizes the tradeos between dierent actions in online decision making, which provides key insights for designing more powerful online FDR rules. The new AI framework also reveals that SAST automatically avoids the \alpha{death" issue in the sense that its operation always reserves budget to reject new hypotheses, and can proceed in an ongoing manner to any time point in the future. Finally, by adaptively learning from past experiences and dynamically allocating the alphawealth, SAST can eectively avoid the \piggybacking" issue and improve its performance as more data are acquired. Our theoretical and numerical results demonstrate that SAST is eective for online FDR control, and achieves substantial power gain over existing methods in many settings. The chapter is organized as follows. Section 2 rst introduces the model and problem formulation, and then develops the oracle SAST procedure for online FDR control by assuming that model pa- 39 rameters are known. Section 3 discusses computational algorithms, proposes the data-driven SAST rule and establishes its theoretical properties. Simulation is conducted in Section 4 to investigate the nite sample performance of SAST and compare it with existing methods. SAST is illustrated in Section 5 through applications for identifying dierentially expressed genes and detecting anomalies in time series data. The proofs are provided in the end. 2.2 Oracle and Adaptive Rules for Online FDR Control We rst describe the model and problem formulation in Section 2.2.1, then discuss three key elements in the proposed SAST rule in turn: a new test statistic to capture the structural information in the data stream (Sections 2.2 and 2.3); a new alpha{investing framework to characterize the gains and losses in sequential decision making (Section 2.4); and a new adaptive learning algorithm to optimize the alpha{wealth allocation (Sections 2.5). 2.2.1 Model and Problem Formulation DenoteT a continuous temporal domain and t2T a time point. LetTT be a discrete, ordered and evenly spaced index set for time labels 1 . Suppose we are interested in testing a sequence of null hypothesesfH t : t2Tg based on data stream X X X = (X t : t2T). To describe the true states of nature, dene Bernoulli variables t , where t = 0=1 ifH t is true/false. Letf t P ( t = 1) :t2Tg denote the local sparsity levels that may vary over time. The observations can be described using a hierarchical model: t Bernoulli( t ); X t j t F t = (1 t )F 0 + t F 1t ; (2.2.1) where F 0 and F 1t are the null and non-null distributions, respectively. Denote f 0 and f 1t the corresponding density functions. We assume that F 0 is known and identical for all t2T . By contrast, t and f 1t can vary smoothly in t2T . Remark 3. The inhomogeneity assumption re ects that signals may either vary in strengths or appear with dierent frequencies over time. This structural information can be highly informative. The smoothness assumption makes it possible for pooling information from the observations in the neighborhood of t. We do not impose further assumptions on t and F 1t , both of which will be 1 T may be taken either asf1; 2; ;tg on a growing domain or a set of points that lie on a xed-domain regular grid:f 1 t ; 2 t ;:::; t1 t ; 1g with t!1. To simplify notation, we takeT as the set of positive integers. 40 estimated non-parametrically. LetX X X t = (X i :i2T; it) be the collection of observations up to time t. Consider a class of online decision rules =f t (X X X t ) :t2Tg2f0; 1g T , where t (X X X t ) represents a real-time decision in the sense that t only depends on information available at time t, with t = 1 indicating that H t is rejected and t = 0 otherwise. Denote t =f i (X X X i ) :i2T; itg the collection of decisions up to t. The online FDR problem is concerned with the performance of a stream of real{time decisions. For decisions up to t, let FDR t ( t ) =E ( P it;i2T (1 i ) i ( P it;i2T i )_ 1 ) ; (2.2.2) where the superscript \t" denotes that the FDR is evaluated at a specic time point. The goal is to construct a real{time decision rule =f t (X X X t ) :t2Tg that controls the FDR t at level for all t2T. To compare the power of dierent testing rules, dene the average power (AP) and missed discovery rate (MDR) as AP t ( t ) = E( P it;i2T i i ) E( P it;i2T i ) ; MDR t ( t ) = 1 AP t ( t ): (2.2.3) AP t (MDR t ) can be viewed as the overall reward (loss) of a sequential testing rule up to t. To simplify the discussion, throughout this section we assume that the distributional information such as the non-null proportion t and density function f t in Model 2.2.1 are known. Section 3 considers the case where the model parameters are unknown and discusses in detail related estimation and implementation issues. 2.2.2 The oracle rule for simultaneous testing The goal of this section is to justify the fundamental role of Clfdr as the building block of the proposed online FDR rule. The online decision-making process is complicated due to the serial constraints on FDR and absence of future data. To focus on the essential issue, we rst consider an ideal setup where a hypothetical oracle observes all data in a local neighborhood at once and makes a batch of simul- taneous decisions. Let d denote the size of a neighborhood. Consider the collection of hypothe- ses in a neighborhood prior to t d: fH i : t d + 1 i t g. Denote the neighborhood N d (t ) =ft d + 1; ;t g and the simultaneous decisions =f i :i2N d (t )g, where i is allowed to depend on the entire d-vector X X X =fX i : i2N d (t )g. Unlike (2.2.2), we only require 41 the FDR control for the d simultaneous decisions: FDR s ( ) =E (P i2N d (t ) (1 i ) i ( P i2N d (t ) i )_ 1 ) ; (2.2.4) where the superscript \s" indicates a simultaneous{type FDR denition. The simultaneous testing of multiple hypotheses can be conceptualized as a two-stage inferential process: rstly ranking all hypotheses according to a signicance index and secondly choosing a cuto along the ordered sequence. This process can be described by a thresholding rule of the form =fI( i c) :i2N d (t )g; whereI() is an indicator function, i is the signicance index of H i and c is the cuto of i . For example, the BH procedure uses the p-value as the signicance index to order the hypotheses, and implements a step-up algorithm to determine a data-driven cuto c. However, the p-value is inecient for online FDR analysis as it fails to capture the important structural information in the data stream. We propose to use the Clfdr (Cai and Sun, 2009) as the signicance index to order the hypotheses: Clfdr t (x t ) =P( t = 0jX t =x t ) = (1 t )f 0 (x t ) f t (x t ) ; for t2T: (2.2.5) Denote Clfdr (1) ; ; Clfdr (d) the ordered Clfdr values inN d (t ) and H (1) ; ;H (d) the corre- sponding hypotheses. To determine the cuto for simultaneous testing, we apply a step-wise algo- rithm k = max ( j : 1 j j X i=1 Clfdr (i) ) : (2.2.6) Then the threshold isc = Clfdr (k) and we rejectH (1) ; ;H (k) . The Clfdr rule (2.2.6) may be viewed as an oracle rule that sees all data in a local neighborhood at once and then makes simultaneous decisions. In Appendix 2.8, we establish the optimality property of the Clfdr rule for simultaneous testing under the \oine" setup. An innite data stream can be approximately by sequential data points arrived in batches. Intuitively, the Clfdr statistic provides a good building block for developing new online sequential testing rules as it is optimal for simultaneous inference in each batch of data points. Remark4. In the \oine" setup for simultaneous testing with a covariate sequence, which includes the Clfdr rule (2.2.6) as a special case, Cai et al. (2019) develops asymptotic optimality theory. We 42 can similarly show that (2.2.6) is asymptotically optimal in the sense that it achieves the benchmark of a hypothetical oracle. However, the optimality issue in the online setup, which depends on many other factors such as the optimal allocation of alpha{wealth and prediction of future patterns over time, is still an open issue and requires much research. 2.2.3 Adapting to local structures by Clfdr: an illustration The incorporation of structural information and domain knowledge promises to improve the power of existing FDR procedures (Genovese et al., 2006; Cai and Sun, 2009; Hu et al., 2010; Lei and Fithian, 2018; Cai et al., 2019). For example, the works by Hu et al. (2010), Li and Barber (2019) and Xia et al. (2019) showed that the weighted p-values can be constructed to capture the varying sparsity levels of ordered or grouped hypotheses. In contrast with the p-value, the Clfdr takes into account important structural information such as t and f t , which makes Clfdr an ideal building block for multiple testing with inhomogeneous data streams. We present an example to illustrate the advantage of the Clfdr rule. Consider the following situation where the data streamfX 1 ;X 2 ;:::;X t ;:::g obeys a random mixture model with varying sparsity levels: X t (1 t )N(0; 1) + t N(; 1): (2.2.7) Model (2.2.7) is a special case of Model (2.2.1): the null and alternative densities are xed and the dynamic part is fully captured by the varying proportion t . The key idea of Clfdr and weighted p-value (in the form of p t =w t , where w t is the weight for H t ) is to upweight the hypotheses in a local neighborhood where signals appear more frequently (e.g. in clusters). In addition to varied t , Clfdr also automatically adapts to f t , leading to further power improvement. To compare the eectiveness of dierent weighting methods, we simulate a data stream for testing m = 5000 hypotheses. The top row in Figure 2.1 sets t = 0:5 in blocks [1001 : 1150], [2001 : 2150], [3001 : 3100] and [4001 : 4150], and t = 0:01 elsewhere. We vary from 2 to 4. The bottom row sets = 2:5 and vary t from 0.2 to 0.9 in the above blocks. The block structure is highly informative and can be exploited by Clfdr and weighted p-values to improve the power. We apply the following methods at FDR level = 0:05 by assuming that the model parameters in (2.2.7) are known: BH (Benjamini and Hochberg, 1995), the structure{adaptive BH algorithm (SABHA; Li and Barber, 2019) using weighted p-values with w t = 1=(1 t ), the GAP method (Xia et al., 2019) using weighted p-values with w t = t =(1 t ), and the Clfdr rule (2.2.6). We can see that 43 2.0 2.5 3.0 3.5 4.0 0.02 0.04 0.06 0.08 0.10 FDR Comparison µ FDR BH CLfdr SABHA GAP 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 Power Comparison µ AP 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.02 0.04 0.06 0.08 0.10 FDR Comparison π FDR BH CLfdr SABHA GAP 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.4 0.6 0.8 Power Comparison π AP Figure 2.1: Structure{adaptiveness: Clfdr vs weighted p-values. all methods control the FDR at the nominal level. In terms of the power, BH can be improved by SABHA and GAP, both of which are dominated by the Clfdr rule. 2.2.4 A new alpha{investing framework Existing FDR methods such as the BH and Clfdr procedures are simultaneous inference procedures that involve rst ordering the signicance indices (p-value or Clfdr) of all hypotheses and then applying a step-wise algorithm to the ordered sequence to determine the threshold. However, the ranking and thresholding strategy cannot be applied to the online setting where the investigator must make real{time decisions without seeing future observations. This section discusses how to avoid the over ow of the FDR at any given time t and how to eciently allocate the alpha{wealth to increase the power. We start with a novel interpretation of the alpha{investing idea by recasting the Clfdr algorithm (2.2.6) as a varying{capacity knapsack process. DenoteR t fH 1 ;H 2 ; ;H t g the collection of rejected hypotheses at time t. The decision process (2.2.6) can be conceptualized as a sequence of comparisons of two quantities: the nominal FDR level and the average of the rejected Clfdr values. 44 Specically, (2.2.6) motivates us to consider the constraint AvefClfdr i :i2R t g; for all t2T, (2.2.8) where Ave(A) denotes the average of the elements in set A. The simultaneous testing setup is only concerned with one constraint at the last time point when all data have been observed. By contrast, the online setup poses a series of constraints, e.g. (2.2.8) must be fullled for every t to avoid the over ow of FDR t (2.2.2). We view (2.2.8) as a dynamic decision process resembling a knapsack problem, where H t can only be rejected when the following constraint is satised: Clfdr t C t := X Hi2Rt1 (Clfdr i ); for t = 1; 2; (2.2.9) where C t is the capacity (of the knapsack) at time t with the default choice C 1 = 0. The capacity may either expand or shrink over time, depending on the sequential decisions along the data stream. This dynamic process can be described as follows. The initial capacity is C 1 = 0. Starting from t = 1, we reject H t if (2.2.9) is fullled. If H t with Clfdr t < is rejected, then the capacity C t increases by Clfdr t (gain); hence we earn bonus room. By contrast, if H t with Clfdr t > is rejected, then C t decreases by Clfdr t (loss). The decision process (2.2.9) provides a new alpha{investing framework that precisely charac- terizes the gains and losses in sequential testing. In contrast with the alpha{investing framework in Foster and Stine (2008), which views each rejection as a gain of extra alpha{wealth, the new characterization (2.2.9) reveals that not all rejections are created equal: rejections with small Clfdr will lead to increased alpha{wealth whereas rejections with large Clfdr will lead to decreased alpha{ wealth. This view provides key insights for designing more powerful online FDR rules. Moreover, the new AI framework reveals that utilizing Clfdr rules can automatically avoid the \alpha{death" issue. Specically, the process (2.2.9) can always reject new hypotheses with Clfdr < regardless of the current budget, and can proceed in an ongoing manner to any time point in the future. 2.2.5 Oracle{assisted adaptive learning and the SAST algorithm To eciently allocate the alpha{wealth, we need to further rene the online algorithm (2.2.9) to avoid making imprudent rejections that can potentially eat up all the budget. The specic issue is referred to as \piggybacking" (Ramdas et al., 2017), which, in a vivid way, describes the phenomenon 45 that a string of bad decisions were made due to previously acquired budget. To see the necessity of taking careful actions, suppose that we have accumulated some bonus room over time before observing a very large Clfdr t satisfying (2.2.9). Although rejecting H t is an action that obeys the FDR constraint, the action can be unwise since it is possible that we can invest the extra \cost", Clfdr t , to make more discoveries at later time points. A practical strategy is to incorporate a \barrier" t and modify (2.2.9) as Clfdr t < t and Clfdr t X Hi2Rt1 (Clfdr i ): (2.2.10) The barrier can eectively prevent \piggybacking" by ltering out large Clfdr t and hence saving budget for future. The choice of t depends on the pattern of future hypotheses. However, all online methods must proceed without seeing the future. To resolve the issue, consider the oracle Clfdr rule (2.2.6) that sees all data in a local neighborhood at once. If we assume that the hypothesis stream is \locally stable" in its patterns, then t may be informed by the oracle rule (2.2.6) simultaneously conducted onN d (t), a local neighborhood prior to t. The rationale is to use recent past data to get some ideas about the patterns of hypotheses to arrive in the near future. Concretely, we rst order fH i :i2N d (t)g according to their Clfdr values, then run the \oine" algorithm (2.2.6) to set the barrier t = Clfdr (k+1) . The online algorithm, by acting as if it sees the future, can eectively lter out large Clfdr values and hence avoid inecient investments. The operation of algorithm (2.2.6) on N d (t) also implies that the barrier t may be either raised or lowered according to the varied t and f t in the dynamic model, which is desirable in practice for dealing with inhomogeneous data streams. In Section 2.4.3, we illustrate that the incorporating of the barrier can greatly reduce the MDR (2.2.3). Finally, we present the proposed structure{adaptive sequential testing (SAST) rule (oracle version with known parameters) in Algorithm 1. The SAST algorithm essentially utilizes the sequential constraints (2.2.10) with barriers set by the oine algorithm (2.2.6). We can see that Algorithm 1 runs two parallel procedures: an online procedure for making real{ time decisions and an \oine" procedure for determining the barrier. Thus the information of every data point has been used twice: rst X t is used for real{time decision{making at time t, then X t is stored as past data so that we can \learn from experiences" via the oine oracle. The following theorem shows that Algorithm 1 is valid for online FDR control. Theorem 2.2.1. Consider the online FDR procedure = ( t : t2T), where t is determined by 46 Table 2.1: Algorithm 1. The oracle SAST rule. Intialization: A 0 =;; 0 =. Updating the barrier: LetN d (t) =ftd + 1; ;tg. SortfClfdr i :i2N d (t)g from the smallest to largest and denote the ordered statistics asfClfdr t (1) ; Clfdr t (2) ;g. If Clfdr t (1) >, keep the same barrier t = t1 . Otherwise let k = maxfj :Q t (j)g, where Q t (j) = 1 j P j i=1 Clfdr t (i) , and update the barrier as t = Clfdr t (k+1) . Decision: LetR t =fit : i = 1g and denotejR t j its cardinality. If Clfdr t < t andfjR t1 j + 1g 1 P i2Rt1 Clfdr i + Clfdr t , then t = 1. Otherwise t = 0. Algorithm 1. Denote t = ( i :it;i2T). Assume that the Clfdr values are known. Then we have FDR t ( t ); for all t2T. 2.3 Data-Driven SAST and Its Theoretical Properties We rst develop estimation methodologies and computational algorithms to implement the SAST rule in Section 2.3.1, then establish the theoretical properties of the data-driven procedure in Section 2.3.2. 2.3.1 Data-driven procedure and computational algorithms We assume that the null distribution of z-values f 0 is known, which is a standard practice in the literature 2 . The key quantities remained to be estimated are t and f t (x). In our motivating applications such as queries of QPDs and anomaly detection in high{frequency time series, the databases or servers have already collected large amounts of data at the beginning of the online FDR analysis. LetfX K0 ; ;X 1 ;X 0 g denote the available data and suppose we start online testing at t = 1 with a data streamfX 1 ;X 2 ;:::g 3 . The conditional densityf t can be estimated using standard (one{sided) bivariate kernel methods (Silverman, 1986): ^ f t (x) = P t1 j=td+1 K ht (jt)K hx (x j x) P t1 j=td+1 K ht (jt) ; (2.3.1) where dK 0 is the length of the moving window that includes a pre-specied number of observa- tions, K(t) is a kernel function, h t and h x are the bandwidths, with K h (t) =h 1 K(t=h). Remark 5. In analysis of large-scale high-frequency time series data such as the NYC taxi data 2 In situations where the empirical null is more appropriate (Efron, 2004), f 0 can be rst estimated using the method in Jin and Cai (2007) and then treated as known. 3 In situations where the online FDR analysis must start without prior data, we suggest applying existing methods such as LOND rst and then switch to SAST as more data are acquired. 47 (Section 2.5), we can pre-specify d, say, to be 1000 to speed up the computation. This virtually has no impact on the estimator ^ f t (compared to using all previous data). Otherwise we can always set d =t. Note that our estimator has followed the standard practice in density estimation, which does not include X t when estimating f t (x) at time t. This can avoid overestimating Clfdr, which leads to over ow of FDR t . Next we propose a weighted screening approach to estimate the unknown proportionf t :t2Tg. The key idea is to use a kernel, which weights observations by their distance tot, to pool information from nearby time points. Leth t be the bandwidth 4 andK a kernel function satisfying R K(t)dt = 1, R tK(t)dt = 0 and R t 2 K(t)dt<1. Consider a screening procedureT t () =ftd + 1it 1 : P i >g, where is a pre-specied threshold. We propose the following estimator based on Cai et al. (2019): ^ t = 1 P i2Tt() K ht (ti) (1) P t1 i=td+1 K ht (ti) : (2.3.2) Now we provide some intuitions of the estimator (2.3.2). First, at time t, dene v h (t;i) = K ht (jtij)=K ht (0). We can view m t = P t1 i=td+1 v h (t;i) as the \total" number of observations at timet. Suppose we are interested in counting how many null p-values are greater than among the m t \observations" at t. The empirical count is given by P i2T v h (t;i), whereas the expected count is given byf P t1 i=td+1 v h (t;i)gf1 t g(1): Equation (2.3.2) can be derived by rst setting equal the expected and empirical counts and then solving for t . In Section 2.3.2 we show that ^ t is a consistent estimator of t = 1 (1) 1 P(P t >); (2.3.3) which always underestimates t and guarantees (conservative) FDR control (Propositions 2.3.1). Remark 6. There is a bias-variance tradeo in the choice of for the proposed estimator ^ t . We shall see that when increases, the \purity" of the screening subsetT () increases, which decreases the approximation bias of t (desirable). At the same time, when increases, the sample size for estimating t will decrease, thereby increasing the variance of the estimator ^ t (undesirable). The common choice of is 0.5. In Section 2.4.1, we discuss a data{driven algorithm that chooses adaptively. Combining (2.3.1) and (2.3.2), we propose to estimate the Clfdr as [ Clfdr t = min ( (1 ^ t )f 0 (x t ) ^ f t (x t ) ; 1 ) ; t2T: (2.3.4) 4 We recommend using the same ht in both (2.3.1) and (2.3.2) to stabilize the performance. 48 Our proposed data-driven rule implements Algorithm 1 by substituting [ Clfdr t in place of Clfdr t . The data-driven algorithm is summarized in Algorithm 2. Table 2.2: Algorithm 2. The data-driven SAST. Initialization: R 0 =;; 0 =. Estimation: [ Clfdr t = min n (1^ t )f0(xt) ^ ft(xt) ; 1 o ; where ^ t and ^ f t are dened by (2.3.2) and (2.3.1), respectively. Updating the barrier: LetN d (t) =ftd + 1; ;tg. Sortf [ Clfdr i : i2N d (t)g from the smallest to largest and denote the ordered statistics asf [ Clfdr t (1) ; [ Clfdr t (2) ;g. If [ Clfdr t (1) > , keep the same barrier t = t1 . Otherwise let k = maxfj : Q t (j) g, where Q t (j) = 1 j P j i=1 [ Clfdr t (i) , and update the barrier as t = [ Clfdr t (k+1) . Decision: LetR t =fit : i = 1g and denotejR t j its cardinality. If [ Clfdr t < t andfjR t1 j + 1g 1 P i2Rt1 [ Clfdr i + [ Clfdr t , then t = 1. Otherwise t = 0. 2.3.2 Theoretical properties of data{driven SAST This section aims to show that the data{driven SAST procedure is asymptotically valid for online FDR control. Our theoretical analysis is divided into three steps. The rst step (Proposition 2.3.1) shows that a hypothetical rule, which substitutes Clfdr t = (1 t )f 0 (x t ) f t (x t ) (2.3.5) in place of Clfdr t in Algorithm 1, is conservative for online FDR control. Proposition 2.3.1. Consider t dened by (2.3.3), then we have t t and Clfdr t Clfdr t . Hence the hypothetical rule using (2.3.5) is valid (and conservative) for online FDR control. The second step (Proposition 2.3.2) shows that \ Clfdr t is a consistent estimator of Clfdr t . We prove the result by appealing to the inll{asymptotics framework (Stein, 2012), which converts the set of time pointsf1; 2; ;tg on a growing domain to a set of points that lie on a xed-domain regular grid: f 1 t ; 2 t ;:::; t1 t ; 1g. The discussions in Stein (2012) indicate that the in-ll model is equivalent to the growing domain model under mild conditions: When t!1, the asymptotic arguments, which respectively correspond to letting the grid become denser and denser in the xed interval (0; 1] and letting the domainf1; 2; ;tg to grow to innity, can be essentially established in the same manner. We state the xed domain theory as it naturally connects to the familiar density estimation theory, where the notations and regularity conditions are standard and easy to 49 understand. The growing domain version of the theory is brie y discussed in Appendix 2.7.3. We can similarly dene the bivariate density estimator and the following conditional proportion estimator: ^ t = 1 P i2Tt() K ht (1i=t) (1) P t1 i=td+1 K ht (1i=t) : (2.3.6) The two estimators (2.3.2) and (2.3.6) are essentially identical (with rescaled bandwidths). The following regularity conditions are needed in our theory. These conditions are standard in density estimation. (A1): For anys2 (0; 1] and> 0,9 such that ifjss 0 j;s 0 2 (0; 1] then R jf s (x)f s 0(x)jdx<. (A2): h x ! 0, h t ! 0 and th x h t !1. (A3): f j (x)<C and R jf 00 j (x)jdx<C for all j. (A4): dh t !1 and dcth t for some c> 0. Proposition 2.3.2. Suppose (A1){(A4) hold, then [ Clfdr t p ! Clfdr t : In the third step of our theoretical analysis (Theorem 2.3.3), we establish the asymptotic validity of the data-driven SAST procedure for online FDR control. Theorem 2.3.3. Assume the conditions in Proposition 2.3.2 hold. Then for any given time t, the data-driven SAST rule (Algorithm 2) controls the FDR t at level asymptotically. 2.3.3 Theory for data streams with xed distributions SAST learns from past decisions and improves its performance over time through the assistance from an oine oracle. The barrier t would become more informative as more tests are conducted. Specically, the initial barrier is set to be at timet = 1, which is very conservative. In the special case when the mixture model has xed t and f t over time, we can show that the barrier t would converge to OR , where OR is the optimal threshold of the \oine" oracle procedure in Section 2.2.2. Hence SAST behaves like an oracle that sees all data points (including future ones). Our numerical results show that the FDR levels of SAST are conservative at the beginning but the FDR becomes closer to as we sequentially update the barrier with information from more time points. Theorem 2.3.4. Assume conditions from Theorem 2.3.3 holds. Then under the model with t and f t f, the data-driven barrier ^ t ! OR when t!1, where OR is the optimal threshold of the oracle FDR procedure for simultaneous testing dened in Section 2.2.2. 50 2.4 Simulation In this section, we rst provide some details in implementation. Comprehensive simulation studies are conducted in Section 4.2 to compare the oracle and data-driven SAST procedures with other existing online FDR rules. Section 4.3 presents an example with simulated data to illustrate the merit of including a barrier in online sequential testing. 2.4.1 Implementation Details In our simulation, we use a data-driven which is determined by the BH algorithm with = 0:5. Roughly speaking, in the subset ~ T () =fs2S : p(s) g, 50% of the cases come from the null (e.g. the expected proportion of false positives made by BH). It is anticipated that in the remaining setT () =fs2 S : p(s) > g, which is used to construct our estimator, majority of the cases should come from the null. This data-driven scheme ensures a small bias in approximation while maintaining a larger sample size compared to the standard choice of = 0:5. The conditional density function ^ f t (x) is estimated using R function density, where the bandwidths h x and h t are chosen based on Silverman (1986). 2.4.2 Comparisons of online FDRs and MDRs We compare the proposed SAST procedure with its competitors for online FDR control. The following methods are included in the comparison: SAST with known t and f t (SAST.OR, Algorithm 1) SAST with estimated model parameters (SAST.DD, Algorithm 2) LOND: the method proposed by Javanmard, A. and Montanari, A. (2016). LOND++: the GAI++ rule proposed by Ramdas et al. (2017). For the general simulation setup, we choosem = 5000 and the pre-specied FDR level = 0:05. The data are simulated from the following model: X t (1 t )N(0; 1) + t N(; 1): For the data{driven method, we need an initial burn{in period. In simulation we generate 500 data points prior to t = 1 to form an initial density estimate. The varying density and proportion estimates are updated every 200 time points. The following simulation settings are considered: 51 1. BlockPattern: t = 0:01; t2 (1; 1000][(1200; 2000][(2200; 3000][(3200; 4000][(4200; 5000]; t = 0:6; t2 (1000; 1200][ (2000; 2200]; t = 0:8; t2 (3000; 3200][ (4000; 4200]. Vary from 2 to 4.2 with step size 0:2. 2. Constant Pattern: t = 0:05; t = 1; ;m. Vary from 2 to 4.2 with step size 0:2. 3. Linear Pattern: Vary t linearly from 0 to 0.5. Vary from 2 to 4.2 with step size 0:5. 4. Sin Pattern: t = (sin 2t m + 1)=4, t ranges between 0 to 0.5, vary from 2 to 4.2 with step size 0:5. We apply dierent methods at = 0:05. The empirical FDR and MDR levels are evaluated using the average of the false discovery proportions and missed discovery proportions from 1000 replications. To investigate the performance of dierent methods in the online setting, we display the empirical FDR t and MDR t levels at various time points, where the intermediate evaluation points ranges from 1500 to 5000 with step size 500. The results for block and constant patterns are summarized in Figure 2.2, and the results for the linear and sin patterns are summarized in Figure 2.3. The following observations can be made from the simulation results. All methods control FDR t at the nominal level at all decision points being considered. SAST.OR achieves the nominal level very precisely. SAST.DD is conservative. LOND and LOND++ are more conservative compared to SAST.DD. SAST.DD is inferior compared to SAST.OR. This is largely due to the conservativeness of the estimator ^ t . The gap in the performances between SAST.DD and SAST.OR narrows as the signal strength becomes stronger, in which situation the estimator ^ t becomes more precise. In general LOND can be much improved by LOND++, which can be further improved by SAST.DD. The gap in power performances between SAST.DD and LOND++ narrows as the signal strength becomes stronger, in which situation it is easier to separate the signals from null cases. When t is xed over time, the signals arrive at a constant rate and there is no informative structural information in the data stream. SAST.DD still outperforms LOND and LOND++ because the novel AI framework via Clfdr leads to more precise FDR control and optimizes the alpha{wealth allocation in the online setting. 52 mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 t FDR Intermediate FDR level Comparison mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Methods Oracle LOND LORD++ DD Methods Comparison (block structured π) mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 t FDR Intermediate FDR level Comparison mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Methods Oracle LOND LORD++ DD Methods Comparison (constant π 0.05) Figure 2.2: Simulation results for Settings 1 and 2. Signal proportions are varied in a block fashion and kept constant respectively. Various signal strengths are investigated as well. Our data-driven and oracle procedures provide signicantly more power while controlling FDR under the nominal level in comparison with others. 53 mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 t FDR Intermediate FDR level Comparison mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Methods Oracle LOND LORD++ DD Methods Comparison (linearly vary π 0 to 0.5) mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 0.000 0.025 0.050 0.075 0.100 0.125 t FDR Intermediate FDR level Comparison mu: 3.8 mu: 4 mu: 4.2 mu: 3.2 mu: 3.4 mu: 3.6 mu: 2.6 mu: 2.8 mu: 3 2000 3000 4000 5000 2000 3000 4000 5000 2000 3000 4000 5000 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 t MDR Intermediate MDR level Comparison Methods Oracle LOND LORD++ DD Methods Comparison (Sin patterned π from 0 to 0.5) Figure 2.3: Simulation results for Settings 3 and 4. Signal proportions are varied in linear and sin patterns, respectively. Our data-driven and oracle procedures provide signicantly more power while controlling FDR under the nominal level in comparison with others. 54 The MDRs for SAST.DD and SAST.OR show a downward trend. This is because we start with no alpha{wealth, both oracles tend to be strict in the beginning, e.g. the rst rejection must satisfy Clfdr<. 2.4.3 Eects of the barrier This section presents a toy example to illustrate that the barrier, which aims to prevent the \pig- gybacking" issue (Ramdas et al., 2017), can greatly reduce the MDR by allocating existing alpha{ wealth in a more cost{eective way. Consider the previous block structured setting (Setting 1 in Section 4.2). Figure 2.4 shows the FDR and MDR comparisons for the following methods at FDR level = 0:05: (i) oracle SAST rule (oracle); (ii) oracle SAST rule with no barrier (oracle nob); (iii) data-driven SAST rule with estimated parameters (DD, Section 3); (iv) data-driven SAST rule with no barrier (DD nob). Figure 2.4: The incorporation of the barrier greatly reduces the MDR levels. We can see from the comparison that although the FDR levels between the two oracle methods are roughly the same, the MDR levels are greatly reduced by incorporating the barrier (hence the alpha{wealth is invested more eciently). The same patterns can be observed for the two data-driven procedures. 2.5 Applications Online FDR rules are useful for a wide range of scenarios. We discuss two applications, respec- tively for anomaly detection in large{scale time series data and genotype discovery under the QPD framework. 55 2.5.1 Time series anomaly detection The NYC taxi dataset can be downloaded from the Numenta Anomaly Benchmark (NAB) repos- itory (Ahmad et al., 2017), which contains useful tools and datasets for evaluating algorithms for anomaly detection in streaming, real{time applications. The dataset records the counts of NYC taxi passengers every 30 minutes from July 1, 2014 to January 31, 2015, during which period ve known anomalies had occurred (the NYC marathon, Thanksgiving, Christmas, New Years day and a snow storm). In Figure 2.5, we plot the time series, with the known anomalous intervals displayed in red rectangles. Figure 2.5: NYC Taxi passenger count time series. From July 1st 2014 to Jan 31st 2015. Blue lines are Loess smoothed time series indicating the overall trend change. We formulate the anomaly detection problem as an online sequential multiple testing problem. The basic setup can be described as follows. The null hypothesis H t corresponds to no anomaly at time t. We claim that an anomaly occurs at t if H t is rejected. A rejection within the red intervals is considered to be a true discovery. The application of online FDR rules requires summarizing the stream of counts data as a sequence ofp-values or CLfdr statistics. However, directly calculating thep-values based on this dataset would be problematic as the data demonstrate strong trend and seasonality patterns. We rst use the R package stlplus to carry out an STL decomposition (Seasonal Trend decomposition using Loess smoothing; Cleveland et al., 1990) to remove the seasonal and trend components. The residuals, 56 Method Number of Discoveries Oine BH procedure 179 Online SAST.DD (Proposed) 201 Online LOND 137 Online LORD++ 178 Table 2.3: Discovery comparison NYC. Number of discoveries made by various online and oine FDR procedures for the NYC taxi dataset, nominal FDR level at 0:0001. displayed in the top 3 rows of Figure 2.6, are standardized and modeled using a two-component mixture (2.2.1). However, as can be seen from the histogram at the bottom of Figure 2.6, the null distribution is approximately normal but deviates from a standard normal. Following the method in Jin and Cai (2007), we estimate the empirical null distribution as N(0:028; 0:618). We apply the BH (pretending all observations are seen at once), LOND, LOND++ and SAST.DD at FDR level 0.0001. For the SAST.DD method, the neighborhood sized and initial burn-in period are both chosen to be 500. In calculating the Clfdr, f 0 (x) is taken as the density of the estimated empirical null ^ F 0 . Moreover, the p-values are obtained by the formula P i = 2 ^ F 0 (jZ i j), where z-scores are computed based on the residuals. Figure 2.7 summarizes the anomaly points detected by dierent methods. We can see that for the several anomaly time periods labeled, SAST can detect more points than other methods. Table 2.3 summarizes the total number of rejections within the labeled time windows. It may appear counter-intuitive that SAST, being an online procedure, rejects more than the oine BH procedure. The reason is that for this data set, the anomalies appear in clusters. This structural information is captured by the Clfdr statistic, which forms the building block of SAST and leads to improved power in detecting structured signals appearing in blocks (Section 2.2.3). 2.5.2 IMPC dataset Genotype Discovery In this section, we demonstrate the SAST procedure on a real dataset from the International Mouse Phenotyping Consortium (IMPC). This dataset, which has been analyzed in Karp et al. (2017), involves a large study to functionally annotate every protein coding gene by exploring the impact of gene knockouts. This dataset and resulting family of hypotheses are constantly growing as new results come in. Karp et al. (2017) tested both the roles of genotype and sex as modiers of genotype eects, resulting in two sets ofp-values: one set for testing genotype eects, and the other for sexual dimorphism. This dataset has been widely used for comparing online FDR algorithms. Currently it is available as one of the application datasets in the R-package OnlineFDR that implements methods 57 Figure 2.6: NYC taxi data. Top three rows: Time series of remainder component from STL decomposition with the known anomaly regions marked in red rectangles. Bottom row: Histogram of the remainder term from STL decomposition, the red curve indicates the estimated empirical null distribution N(0:028; 0:618). 58 Figure 2.7: Anomaly detection. Anomaly points detected by various algorithms, our data-driven SAST procedure detects the most anomaly points within the labeled window marked by red rectangles. Nominal signicance level chosen as 0.0001. such as LORD, LOND and LOND++. In order to implement our proposed SAST procedure, we need the originalz-scores instead ofp-values. However, the directions of eects cannot be determined based on p-value alone. Hence, we transform the p-values into z-scores by introducing a Bernoulli random variable to ensure asymptotic symmetry around 0: z = X 1 (p=2) (1X) 1 (p=2); where XBer(0:5) 5 . 5 We recommend that in the future the biomedical community should report, in addition to p-values, the eect sizes. Thus we also know the direction and magnitude of an interesting signal. In fact, convertingz-scores top-values may lead to loss of information (cf. Sun and Cai, 2007). 59 Method Genotype Method type SAST 12975 online BH 12907 oine LORD++ 8517 online LOND 2905 online Fixed threshold 0:0001 4158 online Table 2.4: Discovery comparison IMPC. Number of discoveries made by various online and oine FDR procedures for the IMPC dataset on Genotypes, nominal FDR level at 0:05. Table 2.4 summarizes the total number of discoveries made by each method. We can see that SAST makes more discoveries than other alpha{investing methods. Similar to the analysis in Section 2.5, SAST rejects more hypotheses than the oine BH procedure. One possible explanation is that Clfdr is more powerful than p-values since it captures useful structural information in the data stream. 2.6 Proof of main theorems 2.6.1 Proof of Theorem 1 Note that the Clfdr is dened as Clfdr i =P( i = 0jX i ). Then by the denition of FDR and double expectation theorem, we have: FDR t =E X ( (jR t j_ 1) 1 X i2Rt Clfdr i ) : By construction of the decision rule, (jR t j_1) 1 P i2Rt Clfdr i for all realization ofX X X. It follows that FDR t . 2.6.2 Proof of Theorem 2.3.3 We need the following lemma: Lemma 2.6.1. Suppose a n p ! 0 andja n j is bounded for all n, then lim n!1 P n i=1 a i n p ! 0. The proof of lemma 2.6.1 is elementary thus omitted. By denition of our algorithm, if [ Clfdr t 60 then t = 1. Note that, for any > 0, P ( 1 X i=1 I(Clfdr i <)<1 ) =P ( 1 [ M=1 1 X i=1 I(Clfdr i <)<M ) 1 X M=1 P ( 1 X i=1 I(Clfdr i <)<M ) : Note that Clfdr i is a random variable from random mixture model (2.2.1) with a non-vanishing proportion of nonzero signals, we have P ( 1 X i=1 I(Clfdr i <)<M ) = 0 for every M. We havePf P 1 i=1 I(Clfdr i <)<1g = 0. Now, P 1 i=1 I( [ Clfdr i )<1 would implyj [ Clfdr i Clfdr i j> innitely many times. By Proposition 2.3.2,P(j [ Clfdr i Clfdr i j>)! 0. It follows thatP n P 1 i=1 I( [ Clfdr i )<1 o = 0; hencejR t j!1. By Proposition 2.3.2 and Lemma 2.6.1, we have P i2Rt [ Clfdr i Clfdr i jR t j p ! 0: Finally, the operation of Algorithm 2 implies that P i2Rt [ Clfdr i jR t j ; It follows that FDR( ) =E X P i2Rt Clfdr i jR t j E X P i2Rt Clfdr i jR t j =E X " P i2Rt [ Clfdr i jR t j # +o(1) +o(1): 2.6.3 Proof of theorem 3 Note when both f t and t are xed over time, the Clfdr statistic reduces to Lfdr i := (1)f0(xi) f(xi) . The optimal threshold in the oine simultaneous testing setup would be independent of time t and the chosen neighborhood. The oracle oine rule coincides with the oracle procedure described in Section 3.2 of Sun and Cai (2007). We now introduce some notations: ^ U t ( ) =t 1 P t i=1 ( [ Clfdr (i) )I \ fClfdr (i) < g U t ( ) =t 1 P t i=1 (Clfdr (i) )IfClfdr (i) < g. U t 1 ( ) =Ef(Clfdr )IfClfdr < gg. 1 = supf 2 (0; 1);U t 1 ( ) 0g is the \ideal" threshold. 61 Note that ^ U t is discrete. To facilitate the theoretical analysis, we dene, for [ Clfdr (i) < < [ Clfdr (i+1) , a continuous version of ^ U t : ^ U t C ( ) = [ Clfdr (i) [ Clfdr (i+1) [ Clfdr (i) ^ U t i + [ Clfdr (i+1) [ Clfdr (i+1) [ Clfdr (i) ^ U t i+1 ; where ^ U t i = ^ U t ( [ Clfdr (i) ). It is easy to verify that ^ U t C is continuous and monotone. Hence its inverse ^ U t;1 C is well dened, continuous and monotone. Next we show the following two results in turn: (i) ^ U t ( ) p !U t 1 ( ) and (ii) ^ U t;1 C (0) p ! 1 . Proof of (i). Note that U t ( ) p ! U t 1 ( ) by the WLLN, so that we only need to establish that ^ U t ( ) p !U t ( ). We need to following lemma: Lemma 2.6.2. Let V i = (Clfdr i )I(Clfdr i < ) and ^ V i = ( [ Clfdr i )If [ Clfdr i < g. Then E ^ V i V i 2 =o(1). Proof of Lemma 2.6.2. Using the denitions of ^ V i and V i , we can show that ^ V i V i 2 = [ Clfdr i Clfdr i 2 I [ Clfdr i ; Clfdr i + [ Clfdr i 2 I [ Clfdr i ; Clfdr i > + (Clfdr i ) 2 I [ Clfdr i > ; Clfdr i : Let us refer to the three sums on the right hand as I, II, and III respectively. By step 2 in the proof of Theorem 2, I =o(1). Then let "> 0, and consider that P [ Clfdr i ; Clfdr i > P [ Clfdr i ; Clfdr i 2 ( ; +") +P [ Clfdr i t; Clfdr i +" PfClfdr i 2 ( ; +")g +P Clfdr i [ Clfdr i >" The rst term on the right hand is vanishingly small as"! 0 because [ Clfdr i is a continuous random variable. The second term converges to 0 by Proposition 2.3.2. Noting that 0 [ Clfdr i 1, we concludeII =o(1). In a similar fashion, we can show thatIII =o(1), thus proving the lemma. Let S t = P t i=1 ( ^ V i V i ), by Lemma 2.6.2 and the Cauchy-Schwartz inequality, E n ^ V i V i ^ V j V j o =o(1): 62 It follows that Var t 1 S t =t 2 Var(S t )t 2 t X i=1 E ^ V i V i 2 +O 0 @ 1 t 2 X i;j:i6=j E n ^ V i V i ^ V j V j o 1 A =o(1): By Proposition 2.3.2,E t 1 S t ! 0, applying Chebyshev's inequality, we obtain t 1 S t = ^ U t U t p ! 0; establishing (i). Proofof(ii). Since ^ U t C is continuous, for any> 0, we can nd> 0 such that ^ U t;1 C (0) ^ U t;1 C n ^ U t C ( 1 ) o < " if ^ U t C ( 1 ) <. It follows that P n ^ U C ( 1 ) > o P n ^ U t;1 C (0) ^ U t;1 C n ^ U t C ( 1 ) o >" o : Proposition 2.3.2 and the WLLN imply that ^ U t C ( ) p !U t 1 ( ): Note that U t 1 ( 1 ) = 0, then, P ^ U t C ( 1 ) > ! 0: Hence, we have ^ U t;1 C (0) p ! ^ U t;1 C n ^ U t C ( 1 ) o = 1 ; (2.6.7) completing the proof of (ii). 63 2.7 Proof of propositions 2.7.1 Proof of Proposition 2.3.1 LetA =fx :P 0 (x)>g, where P 0 (x) is the p-value of x. Then (1) 1 P(P t >) = (1) 1 Z A f 0 (x)(1 t ) + t f 1t (x)dx (1) 1 Z A f 0 (x)(1 t )dx = (1 t ): Hence t = 1(1) 1 P(P t >) 1(1 t ) = t : By denition of Clfdr t , we have Clfdr t Clfdr t . Let OR be the decision rule described in Algorithm 1 with Clfdr t used in place of Clfdr t . Let R be the index set of hypotheses rejected by OR . The FDR of OR is FDR( OR ) =E P iR (1 i ) jR_ 1j =E X X X E P iR (1 i ) jR_ 1j X =E X 1 jR_ 1j X i2R Clfdr i ! : Since Clfdr t Clfdr t , it follows that FDR( OR )E X 1 jR_ 1j X i2R Clfdr i ! : The last inequality is due to the denition of OR which guarantees that 1 jR_ 1j X i2R Clfdr i : 2.7.2 Proof of Proposition 2.3.2 Under the in-ll model, we write ^ f t (x) = P t1 j=td+1 K ht (1j=t)K hx (x j x) P t1 j=td+1 K ht (1j=t) : We rst state 3 lemmas that will be proved in turn. 64 Lemma 2.7.1. Under the assumption of Proposition 2.3.2,E R n ^ f t (x)f t (x) o 2 dx! 0: Lemma 2.7.2. Under the assumptions of Proposition 2.3.2, k^ t k 2 = Z f^ t (x) (x)g 2 dx! 0: Lemma2.7.3. Let ^ t , ^ f t (x), and ^ f 0 be estimates such thatEk^ t t k 2 ! 0,Ek ^ f t (x)f t (x)k 2 ! 0, Ek ^ f 0 f 0 k 2 ! 0, and thenEk [ Clfdr t Clfdr t k 2 ! 0. By Lemma 2.7.1 and Lemma 2.7.2, together with the fact thatf 0 is known, it follows from Lemma 2.7.3 thatEk [ Clfdr t Clfdr t k 2 ! 0: Since convergence in second order mean implies convergence in probability, we have [ Clfdr t p ! Clfdr t : 2.7.3 Growing domain version of Proposition 2.3.2 In the growing domain framework, Proposition 2.3.2 takes the following form: Proposition 2.7.4. Suppose: (A1'): For any> 0,9T such that for all integersi;j on the interval hj t p log(t)h t k ;t i ,t>T , we have R jf i (x)f j (x)jdx<. (A2'): h x ! 0, h t t and h x h t !1. (A3'): f j (x)<C and R jf 00 j (x)jdx<C for all j. (A4'): dch t for some c> 0. We have [ Clfdr t p ! Clfdr t : The proof follows the same line as the proof of proposition 2.3.2, thus omitted. 2.7.4 Proof of Lemma 2.7.1 We rst computeE ^ f t (x)f t (x). Note thatEK hx (X j x) = R K(z)f j (xh x z)dz: Using Taylor expansion, we have f j (xh x z) =f j (x)h x zf 0 j (x) + 1 2 h 2 x z 2 f 00 j (x) +o(h 2 x ): It follows that EK hx (X j x)f t (x) =f j (x)f t (x) + 1 2 h 2 x f 00 j (x) Z z 2 K(z)dz +o(h 2 x ): 65 Let A = P t1 j=td+1 n K ht (1j=t) 1 2 h 2 x f 00 j (x) R z 2 K(z)dz +K ht (1j=t)o(h 2 x ) o : Then E ^ f t (x)f t (x) = P t1 j=td+1 K ht (1 j t )ff j (x)f t (x)g +A P t1 j=td+1 K ht (1 j t ) : Z jE ^ f t (x)f t (x)jdx =O ( P t1 j=td+1 K ht (1j=t) R jf j (x)f t (x)jdx P t1 j=td+1 K ht (1j=t) ) +o(h 2 x )! 0: To see why the last expression goes to 0, note that for any > 0; by Assumption (A1), we can take such that for all i> (1)t, R jf i (x)f t (x)jdx<. Hence, P t1 j=td+1 K ht (1j=t) R jf j (x)f t (x)jdx P t1 j=td+1 K ht (1j=t) = P b(1)tc j=td+1 K ht (1j=t) R jf j (x)f t (x)jdx P t1 j=td+1 K ht (1j=t) + P t1 j=b(1)tc+1 K ht (1j=t) R jf j (x)f t (x)jdx P t1 j=td+1 K ht (1j=t) P b(1)tc j=td+1 K ht (1j=t) P t1 j=td+1 K ht (1j=t) +: Note that h t ! 0, we conclude that P b(1)tc j=td+1 K ht (1j=t) = O n R 1 K ht (x)dx o ! 0: Also, since dh t !1 as t!1, we have P t1 j=td+1 K ht (1j=t)c 0 h 1 t for some c 0 . Thus P b(1)tc j=td+1 K ht (1j=t) P t1 j=td+1 K ht (1j=t) ! 0, and lim t!1 P t1 j=td+1 K ht (1j=t) R jf j (x)f t (x)jdx P t1 j=td+1 K ht (1j=t) = 0: It follows from the boundedness ofE ^ f t and f t (x) that Z jE ^ f t (x)f t (x)j 2 dx! 0: (2.7.8) Next we compute Var n ^ f t (x) o : VarfK hx (X j x)g = 1 h x Z K(z) 2 f j (xh x z)dzff j (x) +o(1)g 2 = 1 h x Z K(z) 2 (f j (x) +o(1))dzff j (x) +o(1)g 2 =O 1 h x Z K(z) 2 dzf j (x) +o(h 1 x ): 66 Some additional calculations give Var ^ f t (x) = P t1 j=td+1 fK ht (1j=t)g 2 O n 1 hx R K(z) 2 dzf j (x) o f P t1 j=td+1 K ht (1j=t)g 2 =O 2 6 4(th x ) 1 R 1 0 K 2 ht (x)dx n R d=t 0 K ht (x)dx o 2 3 7 5 =O 2 4 (th x h t ) 1 = ( Z d=t 0 K ht (x)dx ) 2 3 5 Therefore, by assumption (A3) and (A4), Z Var ^ f t (x)dx =O (th t h x ) 1 ! 0: (2.7.9) SinceE R f ^ f t (x)f t (x)g 2 = R fE ^ f t (x)f t (x)g 2 +Varf ^ f t (x)gdx; (2.7.8) and (2.7.9) together imply thatE R f ^ f t (x)f t (x)g 2 ! 0. 2.7.5 Proof of lemma 2.7.2 Dene ^ P(P t >) := P i2T t () K h t (1i=t) P t1 i=td+1 K h t (1i=t) : Let P t be the p-value of X t . We will show k ^ P(P t >)P(P t >)k 2 ! 0: (2.7.10) We rst rewrite the term P i2Tt() K ht (1i=t) P t1 i=td+1 K ht (1i=t) = P i2Tt() K ht (1i=t)I(P (x i )>) P t1 i=td+1 K ht (1i=t) : By Lemma 2.7.1 we have,E R B ^ f t (x)f t (x)dx! 0 for everyB. In particular, takeB =fx :P (x)> g use the denition of ^ f t we have P t1 j=td+1 K ht (1j=t)E R P(x)> K hx (x j x)dx P t1 j=td+1 K ht (1j=t) !P(P t >): (2.7.11) To show the lemma, it is sucient to show E Z P(x)> K hx (x j x)dx!EfI(P (x j )>)g: (2.7.12) 67 To see why (2.7.12) implies (2.7.10), note that (2.7.12) implies P t1 i=td+1 (1j=t)E R P(x)> K hx (x j x)dx P t1 i=td+1 K ht (1j=t) ! P t1 i=td+1 K ht (1j=t)EfI(P (x j )>)g P t1 i=td+1 K ht (1j=t) : (2.7.13) Next note that E ( P t1 i=td+1 K ht (1i=t)I(P (x i )>) P t1 i=td+1 K ht (1i=t) ) = P t1 i=td+1 K ht (1j=t)EfI(P (x j )>)g P t1 i=td+1 K ht (1j=t) ; and Var ( P t1 i=td+1 K ht (1i=t)I(P (x i )>) P t1 i=td+1 K ht (1i=t) ) =O 2 6 4t 1 R 1 0 K 2 ht (x)dx n R d=t 0 K ht (x)dx o 2 3 7 5 =O 2 4 (th t ) 1 = ( Z d=t 0 K ht (x)dx ) 2 3 5 : By (A4), we have n R 1 0 K ht (x)dx o 2 c for some constant c> 0. Now th t !1, implies Var P t1 i=td+1 K h t (1i=t)I(P(xi)>) P t1 i=td+1 K h t (1i=t) ! 0. By Chebyshev's inequality, P t1 i=td+1 K ht (1i=t)I(P (x i )>) P t1 i=td+1 K ht (1i=t) p ! P t1 i=td+1 K ht (1j=t)EfI(P (x j )>)g P t1 i=td+1 K ht (1j=t) : Combining (2.7.13) , (2.7.11), (A1) and (A2), ^ P(P t >) = P t1 i=td+1 K ht (1i=t)I(P (x i )>) P t1 i=td+1 K ht (1i=t) p !P(P t >): Therefore (2.7.10) follows. We now show (2.7.12). Let = p h x . Write E Z P(x)> K hx (x j x)dx =E Z P(x)>;jxjxj< K hx (x j x)dx +E Z P(x)>;jxjxj> K hx (x j x)dx: Use the normal tail bound, Z P(x)>;jxjxj> K hx (x j x)dx Z jxjxj> K hx (x j x)dx 2 expf1=(2h x )g! 0: 68 DeneA =fx j :P (x j )>g, let f j be the density function for X j . Note that E Z P(x)>;jxjxj< K hx (x j x)dx = Z 1 1 Z P(x)>;jxjxj< K hx (x j x)f j (x j )dxdx j = Z A Z jxjxj< K hx (x j x)dxf j (x j )dx j = Z A [1Of2 exp(1=(2h x ))g]f j (x j )dx j ! Z A f j (x j )dx j =EfI(P (x j )>)g: Hence (2.7.12) is proved. The lemma follows. 2.7.6 Proof of lemma 2.7.3 Note that f t (x) is continuous and positive on the real line, then there exists K 1 = [M;M] such thatP(x2K c 1 )! 0 as M!1. Let inf x2K1 f t (x) =l 0 and A ft =fx :j ^ f t (x)f t (x)jl 0 =2g. Note that Ek ^ f t (x)f t (x)k 2 (l 0 =2) 2 P(A ft ); thenP(A ft )! 0; we claim thatf t and ^ f t are bounded below by a positive number for larget except for an event that has a low probability. Similar arguments can be applied to the upper bound of ^ f t and f t , as well as the cases for f 0 and ^ f 0 . Therefore, we conclude that f 0 ; ^ f 0 ;f t , and ^ f t are all bounded in the interval [l a ;l b ]; 0<l a <l b <1 for large t except for an event A that has probability tends to 0. Hence 0<l a < inf z2A minff 0 ; ^ f 0 ;f t ; ^ f t g< sup z2A c maxff 0 ; ^ f 0 ;f t ; ^ f t g<l b <1: Next note that [ Clfdr t Clfdr t = ^ f 0 f t ( t ^ t ) + (1 t )f( ^ f 0 f 0 ) + (1 t )f 0 (f t ^ f t ) ^ f t f t ; we conclude that [ Clfdr t Clfdr t 2 c 1 ( t ^ t ) 2 +c 2 ^ f 0 f 0 2 +c 3 ^ f t f t 2 in A c : It is easy to see thatk [ Clfdr t Clfdr t k 2 is bounded by some constant L, then Ek [ Clfdr t Clfdr t k 2 LP(A ) +c 1 Ek^ t t k 2 +c 2 Ek ^ f t f t k 2 +c 3 Ek ^ f 0 f 0 k 2 : According to the assumptions, we further have that for a given> 0, there existsM2Z + such that 69 we can ndA ,P(A )<=(4L), and at the same timeEk^ t t k 2 <=(4c 1 ),Ek ^ f t f t k 2 <=(4c 2 ), andEk ^ f 0 f 0 k 2 <=(4c 3 ) for alltM. Consequently, we haveEk [ Clfdr t Clfdr t k 2 < fortM, and the desired result follows. 2.8 Optimality of the Clfdr rule in simultaneous testing The optimality of the Clfdr rule in simultaneous testing is summarized in the following proposition. The idea in the proof essentially follows that in Cai et al. (2019). We provide it here for completeness. Proposition 2.8.1. Consider a class of decision rules ( ) =fI(CLfdr i < ) : 1 i mg for simultaneous testing of hypothesesfH i : i2N d (t)g in the neighborhood of t. Denote Q OR ( ) the marginal FDR of ( ). If<Q OR (1), then the oracle threshold OR := supf :Q OR ( )g exists and is unique. Dene the oracle rule OR =fI(CLfdr i OR ) : 1img. Then OR is optimal for simultaneous testing in the sense that mFDR ( OR ); ETP (d d d ) ETP ( OR ) for alld d d such that mFDR(d d d ): Proof. The proof has two parts. In (a), we establish two properties of the testing rule that thresholds the Clfdr at an arbitrary ,fI(Clfdr i < ) : 1 i mg. We show that it produces mFDR < for all and that its mFDR is monotonic in t. In (b) we show that when the threshold is OR , the testing rule, OR , exactly attains the mFDR level and is optimal amongst all valid testing procedures controls mFDR at level . Part(a). For the testing rulefI(Clfdr i < ) : 1 i mg, let Q OR ( ) = . We rst show that < . Since Clfdr i =P ( i = 0jX i =x i ), then E ( m X i (1 i ) i ) =E X "( m X i E jX (1 i ) i )# =E X m X i Clfdr i i ! (2.8.14) where notation E is the expected value taken over (X;), notation E X is the expectation taken over the distribution of (X), andE jX is the expectation taken over, holding (X) xed. We use (2.8.14) in the denition of Q OR ( ) to get E X ( m X i=1 (Clfdr i )I(Clfdr i ) ) = 0: (2.8.15) The equality above implies that < . To see this, consider that all potentially non{zero terms arise when Clfdr i , and when this is the case, either (i) Clfdr i < , (ii) Clfdr i < , or 70 (iii) Clfdr i < . Notice (i) produces zero or positive terms on the LHS of (2.8.15), (ii) produces zero or negative terms, and (iii) produces negative terms. If , then only (iii) is possible, which contradicts the RHS. Thus, the testing rule is valid. Next, we show that Q OR ( ) is nondecreasing in . That is, letting Q( j ) = j , if 1 < 2 , then 1 2 . We argue by contradiction. Suppose that 1 < 2 but 1 > 2 . First, it cannot be that I(Clfdr i < 2 ) = 0 for all i, because that implies 1 = 2 (both equal 0). Next, since 1 < 2 , (Clfdr i 2 )I(Clfdr i < 2 ) = (Clfdr i 2 )I(Clfdr i < 1 ) + (Clfdr i 2 )I( 1 Clfdr i < 2 ) and rewrite (Clfdr i 2 )I(Clfdr i < 1 ) = (Clfdr i 1 )I(Clfdr i < 1 ) + ( 1 2 )I(Clfdr i < 1 ): If 2 < 1 , then (Clfdr i 2 )I(Clfdr i < 2 )(Clfdr i 1 )I(Clfdr i < 1 ) + ( 1 2 )I(Clfdr i < 1 ) (2.8.16) + (Clfdr i 1 )I( 1 Clfdr i < 2 ): It follows that E ( m X i=1 (Clfdr i 2 )I(Clfdr i < 2 ) ) > 0: To see this, consider the expectation of the sum over m tests for the three RHS terms of (2.8.16), which we reference as (i), (ii), and (iii) respectively. First, (i) is zero because of (2.8.15). Then for each Clfdr i < 2 , either (ii) is positive because 2 < 1 , or (iii) is positive because 1 < 1 . However, (2.8.15) establishes thatEf P m i=1 (Clfdr i 2 )I(Clfdr i < 2 )g = 0, leading to a contra- diction. Hence, 1 < 2 : Part(b). The oracle threshold is dened as OR = sup f 2 (0; 1) : Q OR ( ) g. First, let =Q OR (1), which represents the largest mFDR level that the oracle testing procedure can be. By part (a), Q OR ( OR ) is non{decreasing. Via the squeeze theorem, for all < , this implies that Q OR ( OR ) =. Next, consider the power of OR =fI(Clfdr i < OR ) : 1img compared to that of an arbi- trary decision rule d = (d 1 ;:::;d m ) such that mFDR(d ) . Using the previous result from part(a), it follows that E ( m X i=1 (Clfdr i ) i OR ) = 0 and E ( m X i=1 (Clfdr i )d i ) 0: 71 Take the dierence of the two expressions to obtain E ( m X i=1 ( i OR d i )(Clfdr i ) ) 0: (2.8.17) Next apply a transformation f(x) = (x)=(1x) to each i OR . Note that because f 0 (x) = (1 )=(1 x) 2 > 0, f(x) is monotonically increasing. Then order is preserved: if Clfdr i < OR then f(Clfdr i ) < f( OR ) and likewise for Clfdr i i > OR . This means we can rewrite i OR = I [f(Clfdr i )=(1 Clfdr i )g< OR ], where OR = ( OR )=(1 OR ). It will be useful to note that, from part (a), we have < OR < 1, which implies that OR > 0. Then, E " m X i=1 ( i OR d i )f(Clfdr i ) OR (1 Clfdr i )g # 0: (2.8.18) To see this, consider that if i OR d i 6= 0, then either (i) i OR > d i or (ii) i OR < d i . If (i), then i OR = 1 and it follows thatf(Clfdr i )=(1 Clfdr i )g < OR . If (ii), then i OR = 0 and f(Clfdr i )=(1 Clfdr i )g OR . For both cases, ( i OR d i )f(Clfdr i ) OR (1 Clfdr i )g 0: Summing over all m terms and taking the expectation yields (2.8.18). Combine (2.8.17) and (2.8.18) to obtain 0E ( m X i=1 ( i OR d i )(Clfdr i i ) ) OR E ( m X i=1 ( i OR d i )(Clfdr i ) ) : Finally, since OR > 0, it follows thatE P m i=1 ( i OR d i )(Clfdr i ) > 0. After distributing the ( i OR d i ) term and separating the expectations for the sums of the two decision rules, we apply the denition of ETP () =E P m i=1 i (Clfdr i ) to conclude that ETP ( OR )ETP (d ). 72 Chapter 3 Large Scale Shrinkage Estimation under Markovian Dependence 3.1 Introduction Shrinkage is a useful notion that provides an elegant and powerful framework for compound esti- mation problems. The topic has been extensively studied since the celebrated work of James and Stein (1961). A plethora of in uential results obtained by various researchers show that shrinkage often leads to substantial improvements in the performances of conventional estimators. The opti- mal directions and magnitudes of shrinkage estimators in parametric models have been extensively studied in the literature, see Fourdrinier et al. (2017); Johnstone (2013); Ahmed and Reid (2012); Carlin and Louis (2010) for reviews of related topics. In real-world applications, parametric families of estimators often have limited usage. Nonparametric shrinkage methods, exemplied by Tweedie's formula (Efron, 2011), have received renewed attention in large-scale inference problems where thou- sands of parameters are estimated simultaneously; see Brown and Greenshtein (2009); Jiang et al. (2009); Koenker and Mizera (2014); Saha and Guntuboyina (2017); Dicker and Zhao (2016) for some important recent developments. One essential limitation of existing non-parametric shrinkage methods is that most assume that observations are independently sampled from an underlying distribution. However, observations arising from large-scale estimation problems are often dependent. Ignoring the dependence struc- ture may result in signicant loss of eciency and invalid inference. This chapter aims to develop non-parametric shrinkage estimators under the widely used hidden Markov model and show that the 73 eciency of existing nonparametric methods can be greatly improved by incorporating the Marko- vian dependence structure. Consider the problem of simultaneous estimation of a sequence of dependent parameters that are generated from a Hidden Markov Model (HMM) (Rabiner, 1989; Churchill, 1992). Markovian dependence provides a powerful tool for modeling data arising from a wide range of modern scientic applications where parameters of interest are spatially or temporally correlated. This chapter focuses on an important class of HMMs that have two underlying states: one being the default (null) state where the process is in-control; the other being the abnormal (non-null) state where the process is out-of-control. The observed data are independent conditional on the unknown states. The two- state HMM is a popular model that can be used to describe many data sets collected from various applications. For example, in estimating the Copy Number Variations (CNV) across a genome (Efron and Zhang, 2011; Jiang et al., 2015), there are regions with high and low variations that can be well described by the non-null and null states, respectively. In event analysis application, such as estimating hourly social media trends of a marketing campaign there are dormant (null) and active (non-null) states which can be adaptively incorporated into an HMM for building a better tracker. To the best of our knowledge, the important problem of nonparametric shrinkage estimation under dependence has not been studied in the literature. The key idea in our methodology is to combine the ideas in the elegant Tweedie's formula (Brown and Greenshtein, 2009; Jiang et al., 2009; Efron, 2011) and the forward-backward algorithms in HMMs to infer the underlying states, which is further utilized to facilitate fast and robust estimation of the unknown eect sizes (mean parameters) under Markovian dependence. We establish decision-theoretic properties of the proposed estimator and exhibit its enhanced ecacy over popular shrinkage methods developed under the independence assumption. In contrast with existing algorithms in HMMs that proceed with pre-specied families of parametric densities (e.g. Gaussian mixtures), our method is nonparametric and capable of handling a wider class of distributions. Through extensive numerical experiments, we demonstrate the superior performance of our proposed algorithm over existing state-of-the-art methods. The chapter is organized as follows: in Section 2 we formulate the problem, develop an oracle estimator and compare this estimator with classical shrinkage estimators that do not incorporate HMM structure. In Section 3, we propose a data-driven procedure that mimics the oracle estimator and discuss how to overcome the new diculties and challenges in the non-parametric approach. In Section 4, we establish theoretical properties of our proposed estimator and show that the non- parametric approach oers substantial eciency gain over HMM algorithms employing Gaussian mixture models. In Section 5, we present numerical experiments to compare our estimator with 74 existing methods. Section 6 applies the proposed method to three real data examples. Section 7 concludes the chapter with a discussion on interesting extensions. 3.2 Shrinkage estimation in a hidden Markov model Consider a two-state hidden Markov model. Suppose we are interested in estimating the mean vector = ( 1 ; ; n ) based on observed outputX X X = (X 1 ; ;X n ). In HMMs, the observed dataX X X can be viewed as a contaminated version of the eect sizes , where the contaminations are white noises following a zero-mean normal distribution with known variance 2 . The HMM further assumes that i s are independently distributed conditional on the unknown states = ( 1 ; ; n )2f0; 1g n , where i are Bernoulli variables forming a Markov chain. The latent state [ = 0] usually represents the null or the in-control state and [ = 1] corresponds to the non-null or the out-of-control state. For example, in CNV studies, X i is the observed number of repeats in the genome, [ i = 0] represents healthy part of genome, and [ i = 1] represents the part of the genome that is susceptible to perturbation in diseased patients. 3.2.1 Model and notations Formally, the data generation process can be described by a hierarchical model: X i = i + i ; for i = 1;:::;n and i iid N(0; 2 ); (3.2.1) where given i , i are conditionally independent following unknown prior distributions g 0 and g 1 : i j i = 0 iid g 0 and i j i = 1 iid g 1 : We assume that i s follow a spatially homogeneous Markov process with transition probabilities a jk =P ( i =kj i1 =j);j;k2f0; 1g: The above probabilities are unknown and obey standard stochastic constraints 0 < a jk < 1;a j0 + a j1 = 1: Figure 3.2.1 presents a schematic diagram of the model. We aim to nd a decision rule ^ ^ ^ for estimating the unknown mean vector . The goal is to minimize the Bayes risk R =Efl(^ ; )g, 75 where l(^ ; ) is the mean squared error (MSE, or l 2 loss): l(^ ; ) =n 1 n X i=1 E( i ^ i ) 2 : Figure 3.1: Schematic diagram of the data generation process. We observe xis only and would like to estimate the is which are related through the Markov chain f1;2; ;ng. Let = (A; ( 0 ; 1 );f 0 ;f 1 ) denote the HMM parameters, whereA is the transition matrix, ( 0 ; 1 ) is the initial distribution, and f k (x) :=f(xj =k) = Z (x)g k ()d: k = 0; 1: We use to denote the normal density with mean 0 and variance 2 with the sux dropped for standard normal density. In practice, we often assume that the null distributiong 0 is either known or can be modeled by a pre-specied parametric family 1 . By contrast, we do not specify any parametric forms ong 1 because we usually do not have sucient knowledge on the non-null process g 1 , which may be dicult to describe using any pre-specied parametric families. 3.2.2 Oracle estimator under independence Ignoring the dependence structure, the optimal solution that minimizes the Bayes risk is given by Tweedie's formula: E( i jx i ) =x i + 2 f 0 (x i ) f(x i ) where f 0 (x) = d dx f(x); (3.2.2) 1 We shall assume that f 0 is in certain parametric form. However, the parameters of f 0 are not necessarily to be known. This is a reasonable assumption, as in real world applications, researchers often have reasonably good ideas about the behavior of an observation from the in-control state. 76 and f is the marginal distribution of all the observed X. The formula rst appeared in Robbins (1955), who attributed the idea to Maurice Kenneth Tweedie. Tweedie's formula can be implemented using empirical Bayes methods for constructing a class of non-parametric estimators (Brown and Greenshtein, 2009; Efron, 2011). The crucial observation is that it works directly with the marginal distribution, which is in particular attractive in large-scale estimation problems since f and f 0 can be well estimated using standard density estimation techniques (Silverman, 1986). 3.2.3 Oracle Estimator under HMM dependence Our idea is to extend Tweedie's formula to correlated observations under the HMM dependence. This can eectively increase the statistical eciency by borrowing strength from adjacent observations. It has been shown in the multiple testing literature that exploiting the dependence structure can greatly improve the power of existing false discovery rate methods (Sun and Cai, 2009; Wei et al., 2009; Sun et al., 2015). We further show in this chapter that dependence structure is also highly informative and can be exploited to improve the accuracy of shrinkage estimators. Our methodological development is divided into two steps. We rst assume that an oracle knows the hidden states and study the oracle rule. Then we discuss the case when the states are unknown and propose data-driven methods to emulate the oracle rule. If the states i are known, then it is natural to apply Tweedie's formula to two separate states: ~ i FOR (x x x) = 1 X k=0 ( x i + 2 f 0 k (x i ) f k (x i ) ) If i =kg: (3.2.3) The oracle estimator (3.2.3) ~ i FOR (x x x) provide a benchmark in shrinkage estimation, i.e. the theo- retically achievable lower limit of the estimation risk. Next we consider a \weaker" oracle estimator that only knows , the collection of hyper- parameters in the HMM. Then the optimal solution is given by the next lemma. Lemma 3.2.1. Consider Model (3.2.1) and assume that HMM parameters are known. Then the estimator that minimizes the MSE is given by ^ i FBayes (x x x; ) = 1 X k=0 E( i j i =k;x i )P( i =kjx x x) = 1 X k=0 ( x i + 2 f 0 k (x i ) f k (x i ) ) P( i =kjx x x): (3.2.4) The proof of lemma is straightforward and hence omitted. The formula (3.2.4) can be viewed 77 as an extension of Tweedie's formula under HMM dependence. The next section considers the case where HMM parameters are unknown. We shall develop data-driven estimators and computational algorithms to emulate the Bayes estimator (3.2.4). 3.3 Data-Driven Estimator and Computational Algorithms We discuss how to estimatef 0 ; f 1 andP( i =kjx x x) to implement the Bayes estimator. The proposed estimator is a non-parametric Tweedie-based shrinkage estimator under dependence (TD). In light of (3.2.4), we consider a class of estimators in the form: ^ FTD i = ^ p i ( x i + 2 ^ f 0 1 (x i ) ^ f 1 (x i ) ) + (1 ^ p i ) ( x i + 2 ^ f 0 0 (x i ) ^ f 0 (x i ) ) ; (3.3.5) where ^ p i are estimates of the true conditional probability P ( i = 1jX), and ^ f 0 and ^ f 1 are estimates of f 0 and f 1 . Next we describe an algorithm for constructing estimates of the unknown quantities in (3.3.5). 3.3.1 The modied Baum-Welch algorithm The most well-known method for constructing estimators of the form (3.3.5), under the conventional setting with pre-specied parametric families, is the Baum-Welch (BW) algorithm (Rabiner, 1989; Yang et al., 2015; Baum et al., 1970). We emphasize that the BW algorithm must proceed with user specied parametric forms forf k 's. However, whilef 0 , the in-control ( = 0) distribution, can be well modeled by parametric densities, the out-of-control observations may not be easily approximated by parametric densities. For example, the Gaussian mixture model with a xed number of components is not suitable for approximating very heavy tailedf 1 . To overcome the issue, we describe an HMM- based Tweedie (HMMT) estimator that employs parametricf 0 and non-parametricf 1 . The essential component in our estimator is a generalized BW algorithm that updates the estimates of f 0 and f 1 iteratively based on posterior probability estimates of the latent states. To avoid the identiability issues, assume f 0 is unimodal and 0 > 0:5. These assumptions are usually reasonable in applications. In our HMMT estimator,f 0 is assumed to be a Gaussian density [;] with possibly unknown location and scale . The Gaussian model can be generalized to Gaussian mixtures as well as other parametric families without essential diculty. In contrast with existing HMM algorithms that utilize parametric assumptions on f 1 , we consider a nonparametric approach to estimating f 1 . The proposed methodology employs a class of weighted kernel density 78 estimators ^ f 1 [w;h](x) = n X i=1 w i h K x i x h ; (3.3.6) where P n i=1 w i = 1, h is the bandwidth and K() is a standard Gaussian kernel function. Other choices of kernel such as the Cauchy kernel can also be used particularly in applications that are sensitive to the tails of the density. With an initial choice ofA (0) , ^ f (0) 0 and ^ f (0) 1 , the forward-backward propagation of the Baum- Welch algorithm provides updated estimates of the posterior probabilities ^ p (0) i . These estimates can be utilized to update ^ f 1 using new weights w (0) i = ^ p (0) i = P n i=1 ^ p (0) i and ^ f 0 : ^ f 1 (1) = ^ f 1 [ ^ w (0) ;h]; and ( (1) ; (1) ) = arg max 2R;0 n X i=1 1 ^ p (0) i log (X i ) The process is repeated until convergence. 3.3.2 Choice of h Choosing an appropriate bandwidth h is crucial for constructing an ecient estimator. If h is too small, then we will end up with estimating i by the naive estimatorx i . Ifh is too large, then it will be dicult to identify ^ f 1 , which leads to a severely biased and inaccurate estimator. This section describe a cross validation method for choosing the tuning parameter h. The rst step is to split the observed sample X X X into (U U U;V V V ) by artically adding independent Gaussian noise: U U U = X X X + Z Z Z; V V V = X X X (1=)Z Z Z, where Z Z Z d = N(0;I n ). Note that U U U;V V V are normally distributed random eects with same mean but dierent variances 2 u = (1 + 2 ) 2 and 2 v = (1 + 2 ) 2 . By construction,U U U andV V V are mutually independent conditioned on . The key idea in the second step is to useU U U for estimation andV V V for bandwidth selection. One can think of as a measure of how much data we use for estimation (Brown et al., 2013). When = 0, the entire data is used for estimation and none for hyper-parameter calibration. This chapter uses = 10%. The steps are repeated for a list of prexed h values yielding estimates of the posterior proba- bilitiesf^ p i [h] : 1ing of the latent states and marginal densities estimates ^ f 1 [h] and ^ f 0 [h]. For each h value, we estimate the MSE as d mse[h] = n X i=1 2 4 X j2f0;1g ^ q i;j [h] ( u i + 2 u ^ f 0 j [h](u i ) ^ f j [h](u i ) ) v i 3 5 2 ; (3.3.7) 79 where, ^ q 1;j [h] = ^ p 1 [h] and ^ q 0;j [h] = 1 ^ p 1 [h]. The optimal value of bandwidth is chosen as: ^ h = arg min h d mse[h]: Subsequently, the HMMT estimator is computed by running the generalized BW algorithm with the bandwidth set at ^ h and = 0. 3.3.3 Initialization and the HMMT estimator Another important caveat about Algorithm 1 is the initialization step. To prevent the algorithm getting trapped in a local maximum, a good initialization is important. For this purpose we use the algorithm developed in Ko et al. (2015) for estimation in an HMM with unknown number of changing points. As the probability of the process being in control 0 0:5, (0) is estimated by the mode of the component with the largest probability which is subsequently attributed to ^ f (0) 0 . The remaining states as estimated via a Dirichlet process model contributes towards the out-of-process marginal densities ^ f (0) 1 . It was found that such an initialization produced reasonable values of ^ A (0) . Henceforth for simplicity, we assume = 1. The detailed estimation procedure is summarized by Algorithm 1 below. 80 Algorithm 1 HMM-Tweedie Estimator 1: Sample splitting. With = 0:1, add independent noiseZ Z ZN(0;I n ): U U U =X X X +Z Z Z; V V V =X X X 1 Z Z Z. UseU U U for estimation andV V V for bandwidth selection. 2: Intialization Use the algorithm in Ko et al. (2015) to get estimates (0) =f ^ A (0) ; (0) 0 ; (0) 1 ; ^ f (0) 0 ; ^ f (0) 1 g. 3: E step: Fort 1, compute the following in thet iteration using the previous iteration estimates. For i = 1;:::;n and j;k = 0; 1 compute a. The forward variable: (t) i (j) =P[ (t1) ](U 1 =u 1 ;:::;U i =u i j i =j); b. The backward variable: (t) i (j) =P[ (t1) ](U i+1 =u i+1 ;:::;U n =u n j i =j); c. Posterior probabilities: ^ p (t) i (j) = (t) i (j) (t) i (j)=( (t) i (0) (t) i (0) + (t) i (1) (t) i (1)); (t) i (j;k) =P (t1)( i =j; i+1 =kjU U U) = ^ p (t) i (j)A (t1) jk ^ f (t1) k (u i+1 ) (t) i+1 (k)= (t) i (j) 4: M step: Update the parameters and the non-parametric density estimate a. (t) j =n 1 P n i=1 ^ p (t) i (j) b. A (t) jk = P n1 i=1 ^ (t) i (j;k)=( P n1 i=1 ^ p (t) i (j)) c. ^ w (t) i = ^ p (t) i (1)=( P n i=1 ^ p (t) i (1)) d. ^ f (t) 1 [h;u u u](y) =h 1 2 u P n i=1 ^ w (t) i K(h 1 1 u (yu i )). e. ^ f (t) 0 =N( (t) ; (t) ) where, ( (t) ; (t) ) = arg max 2R;0 P n i=1 ^ p (t) i (0) log (u i ). 5: Iterate until the parameters and the non-parametric density estimate converges. 6: Repeat steps 3-5 for a list ofh values and store the nal marginal densities estimates ^ f 0 [h]; ^ f 1 [h] and the posterior probabilitiesf^ p i [h](j) : 1in; 0j 1g of the latent states. 7: Choose bandwidth ^ h = arg min h d mse[h] where, d mse[h] = P n i=1 " P j2f0;1g ^ p i [h](j) ( u i + 2 u ^ f 0 j [h](u i ) ^ f j [h](u i ) ) v i # 2 : 8: Repeat steps 1 to 5 with bandwidth ^ h and = 0. Based on these posterior es- timates ^ p i (j) and density estimates ^ f 0 and ^ f 1 report location estimates: ^ FHMMT i = P j2f0;1g ^ p i (j) x i + ^ f 0 j (x i ) ^ f j (x i ) ! . 81 3.4 Asymptotic Properties of the proposed estimator For establishing risk properties of our proposed shrinkage estimators, we consider the following assumption that is popularly imposed for theoretical analysis in HMM starting from the pioneering works of Bickel et al. (1996, 1998): A1. are from a statitionary ergodic Markov chain and the transition probabilities are such that 0 < a 00 ;a 11 < 1. To facilitate the proof, we also assume that the Markov chain begins in its stationary state and is reversible. Algorithm 1 includes several modications to the conventional Baum-Welch algorithm of Baum et al. (1970). For understanding the crucial dierences rst consider = 1, i.e, no sample splitting. The average log-likelihood for complete datafx x x; g assuming known is: l n () =n 1 log X P (X X X; j) where P (X X X; j) = 1 n Y i=2 a i1;i n Y i=1 f i (x i ) : When is unknown, interchanging logarithm and summation above, we iteratively maximize the following lower bound ~ l n ( (t) j (t1) ) of l n ( t ), ~ l n ( (t) j (t1) ) =Q n ( (t) j (t1) ) +H n ( (t1) ); where, nH n ( (t1) ) =E jX X X; (t1) logP ( jX X X; (t1) ) is the entropy, and Q n decouples as Q n = Q 1;n +Q 2;n with nQ 1;n ( (t) j (t1) ) =E 1jX X X; (t1) logP ( 1 ) + n X i=2 E i1;ijX X X; (t1) logP ( i j i1 ) andnQ 2;n ( (t) j (t1) ) = P n i=1 E ijX X X; (t1) logP (x i j i1 ;f 1 ;f 0 ). We iteratively maximizeQ n ( (t) j (t1) ) over (t) based on previous iterate (t1) . If the posterior probabilities for [ i = 1] are updated to ^ p (t) i then, nQ 2;n ( t j (t1) ) equals n X i=1 ^ p (t) i log 8 < : n X j=1 w t j 1 h K x i x j h 9 = ; + n X i=1 (1 ^ p (t) i ) logf (x i )g: Although Q 2;n is concave in (w t 1 ;:::;w t n ), it is not strongly concave. Updating w t j by maximiz- ing Q 2;n over the n 1 dimensional hyperplane as n!1 would not yield good solutions. We interchange the logarithm and summation in the rst term in Q 2;n above. The resultant ~ Q 2;n have a closed form maxima at w t j = ^ p (t) j = P n i=1 ^ p (t) i . f (t) 1 as demonstrated in Algorithm 1 is 82 accordingly updated. Standard forward-backward procedure (Rabiner, 1989) is used to maxi- mize Q 1;n and produce updated posterior probabilities ^ p (t) i . For i = 1;:::;n and j;k = 0; 1 the forward variable: (t) i (j) = P (t1)(X 1 = x 1 ;:::;X i = x i j i = j) and the backward variable: (t) i (j) =P (t1)(X i+1 = x i+1 ;:::;X n = x n j i = j) are computed. and the posterior probabilities are updated as: ^ p (t) i = (t) i (1) (t) i (1)=( (t) i (0) (t) i (0) + (t) i (1) (t) i (1)) : (3.4.8) The objective Q 1;n + ~ Q 2;n is bounded above and increasing at each step, therefore, our algorithm will converge. In particualr, as n!1, the true solution is a xed point of the algorithm. To understand the risk properties of ^ FHMMT it is important to understand the risk properties of the estimate off 1 that is nally obtained from the proposed algorithm. For that purpose consider the following two estimators: f For 1;n [h](u) = n X i=1 i 1 n X i=1 i K ux i h ; and, ^ f 1;n [h](u) = n X i=1 ^ p i;n 1 n X i=1 ^ p i;n K ux i h : (3.4.9) The rst is not essentially as estimator but an oracle choice which uses knowledge of unknown where as the ^ f 1;n [h] uses estimates ^ p i;n of posterior probabilities P ( i = 1jX X X) instead of the true . Consider ^ p i;n being based on (3.4.8) at an arbitrary iterative step. Barring sample splitting, the estimate of f 1 in every iterative step of Algorithm 1 has the functional form of ^ f 1;n [h]. Consider f 1 to be from a smooth class of function and has bounded support. Assume f 1 is -Holder continuous, i.e., f 1 2H where H = h : d s dx s h(x) d s dy s h(y) Ljxyj for all s 1; and x;y2R;L> 0 : Then, we know (ch. 6 of Wasserman, 2006) that the squaredL 2 distance of the oracle kernel density estimator from the true f 1 , d 2 f For 1;n [h];f 1 = Z f For 1;n [h](u)f 1 (u) 2 du 83 has the following asymptotic expected value: E X X X d 2 f For 1;n [h];f 1 =O h 2 + (nh) 1 for any h > 0: Under assumption A1, we show that ^ f 1;n [h] can be decomposed as ^ f 1;n [h] =f For 1;n [h] + ^ R n [h] where the residual ^ R n [h] is asymptotically negligible as n!1 provided ^ p i;n are asymptotically unbiased estimates of i . As such, E ( ;X X X) Z ^ R 2 n [h](u)du =O logn nh + (E ( ;X X X) ^ p 1;n 1 ) 2 h 2 as n!1 where, 1 =P ( i = 1) based on the stationary distribution of the Markov chain on whose existence is ensured by assumption A1. The results above are summarized in the lemma below. Lemma 3.4.1. If f 1 is -Holder continuous and has bounded support then under assumption A1, the integratedL 2 Bayes risk of non-parametric density estimators of the form (3.4.8)-(3.4.9) for any h> 0 is E (;X X X) d 2 ^ f 1;n [h];f 1 =O h 2 + logn nh + (E ( ;X X X) ^ p 1;n 1 ) 2 h 2 as n!1: In particular, if ^ p 1;n are unbiased and h Fo n = (logn=n) 1=(2 +1) then, E (;X X X) d 2 ^ f 1;n [h Fo n ];f 1 =Of(logn=n) 2 =(2 +1) g as n!1: The above results shows that with good posterior probability estimates, the algorithm 1 provides consistent non-parametric estimation of f 1 . As f 0 is indexed by a parametric Gaussian density, its consistent estimation also subsequently follows. The free parameters in is A = (a 00 ;a 11 ) and , and f 1 . For understanding the evolution of the estimates in Algorithm 1 from the initial solutions towards the true , note that under assumption A1, the population criteria Q 1 (j ~ ) := lim n!1 Q 1;n (j ~ ) and ~ Q 2 (j ~ ) := lim n!1 ~ Q 2;n (j ~ ) are well-dened. Suppose our initial solution 0 n is in the vicinity of such that with estimates ^ f (1) 0;n and ^ f (1) 1;n that are improvements over the initial estimates ^ f (1) 0;n and ^ f (1) 1;n in the sensed( ^ f (1) 1;n ;f 1 )<d( ^ f (0) 1;n ;f 1 ) andd( ^ f (1) 0;n ;f 0 )<d( ^ f (0) 0;n ;f 0 ), maximizing Q 2 (Aj ^ f (1) 0;n ; ^ f (1) 1;n ) and the subsequent application of (3.4.8) produces better estimates ^ p (1) i;n of the true posterior probabilities such that ^ f (2) 1;n given by (3.4.9) is better than ^ f (1) 1;n withd( ^ f (2) 1;n ;f 2 )<d( ^ f (1) 1;n ;f 1 ). 84 If this feature of the evolution is maintained over the successive iterations, the bias in the posterior probability estimates decrease in every step and ^ f (1) 0;n and ^ f (1) 1;n will ultimately converge to the true f 0 and f 1 respectively at the rate given in Lemma 3.4.1. Having a good initial solution that has negligible asymptotic bias, and based on which successive iterates have the aforementioned properties be essential for successful convergence in Algorithm 1.We denote these properties of our initial solution by assumption A3 while assumption A2 imposes -Holder continuous on f 1 . The results in lemma 3.4.1 provide us the risk optimality of our proposed ^ FHMMT . To facilitate a universal optimality statement that is appropriate for both nonsparse and sparse regimes, we impose the following assumption on as illustrated in Brown and Greenshtein (2009): A4. For any xed positive , we have n sup i=1;:::;n j i j! 0 as n!1. Based on our set-up in Section 3.2.1 and assumption A1, the above condition is equivalent to restricting the support of the priors g 1 and g 0 to [S n ;S n ] where n S n ! 0 as n! 0 for any positive . For any xed h > 0, let ^ p i [h], ^ f 1 [h] and ^ [h] be the nal estimates of the P ( i jX X X), f 1 and the mode of f 0 respectively, from algorithm 1. Recall that as = 1, the estimates of the mean when the process is out of control is ^ FOC i [h] =x i + ^ f 0 1 [h](x i ) ^ f 1 [h](x i ) =x i + P n j=1 ^ p i [h](x j x i )K((x i x j )=h) h P n j=1 ^ p i [h]K((x i x j )=h) : Barring sample splitting (using = 0, not steps 7 and 8 in algorithm 1) the HMMT estimator is ^ FHMMT i [h] = (1 ^ p i [h])^ [h] + ^ p i [h]^ FOC i [h]. As the HMMT estimator involves ratio of estimates ^ f 0 and ^ f, the approximation error bound in Lemma 3.4.1 on ^ f does not trivally lead to optimal risk properties of the HMMT estimator. For achieving asymptotically optimal risk performance, consider a modied robust version of the HMMT estimator that truncates very high out-of-control mean values: ^ FT i [h] = (1 ^ p i [h])^ [h] + ^ p i [h] sign(^ FOC i [h]) max(j^ FOC i [h]j; p 3 logn ) : The following result akin to Theorem 1 of Brown and Greenshtein (2009) shows that ^ FT is asymp- totically oracle optimal as it can achieve the mean square error of the oracle shrinkage estimator dened in (3.2.3). The proofs of the results in this section is provided in the supplements. Theorem 3.4.2. Under assumptions A1-A4, for any h n such that h n logn! 0 but n h n !1 for 85 any positive then, lim n!1 Ejj n ^ FT n jj 2 2 Ejj n ~ FOR n jj 2 2 = 1: 3.5 Simulation Studies This section conducts simulation studies to compare the performance of our proposed HMMT with the following four estimators: (a) GMM.3: we use BW algorithm and Gaussian mixture models with 3 mixing components for modeling f 1 ; (b)GMM.DP: similar to GMM.3, we use BW algorithm and Gaussian mixture models. However, the number of components in the Gaussian mixture will not be xed as before. Instead, the number of components is estimated using the Dirichlet Process model as described in Ko et al. (2015); (c) T.ND: we ignore the Markovian dependence structure and apply Tweedie's formula. The algorithm in Fu et al. (2018) has been used to choose the tuning parameter. (d) OR: the oracle estimator in (3.2.3) which uses knowledge about the latent parameters. 3.5.1 Comparison of the MSEs In Table 3.1 we report the mean squared errors averaged over 50 replications. We consider 11 simulation scenarios. In each scenario, we simulate n = 2; 000 observations. The latent states are generated based on a two-states Markov chain where the transition matrix has the probability of being in-controlA 00 xed at 0:95 while the probability of being out-of-controlA 11 is varied between 0:2 to 0:8. Across all the scenarios g 0 was xed as the distribution with point mass at 0. In cases I to VI, the out-of-control prior g 1 was generated from the following 6 densities: (a) uniform distribution with support on [9; 9]; (b) Asymmetric triangle distribution on [30; 30] with mode at 6; (c) Levy distribution with scale 7 on [0;1]; (d) Non-central Chi-square distribution with 3 degrees of freedom and non-centrality parameter at 2; 86 (e) Weibull distribution with shape 2 and scale 5; (f) Burr distribution with location 0, scale 2 and shape parameters 2 and 0:5. In cases VII to XI, we study the performance of the estimators when g 1 is generated from mixture of the above densities. The following observations can be made based on the simulation results. Across all the regimes, our proposed HMMT estimator signicantly improves the non-parametric shrinkage estimator T.ND which does not use dependent structure, which shows the impor- tance of taking the dependence structure into account. In many cases, incorporating Markov structure result in eciency gain even when the model is mis-specied. The HMMT estimator also improves on the Gaussian mixture model based estimation strate- gies and has error rates quite close to that of the Oracle estimator in (3.2.3). GMM.3 sometimes has higher MSE than even the non-parametric shrinkage estimator T.ND that does not use dependent structure. This re ects the usefulness of using Kernel density based non-parametric approach in HMMT. 3.5.2 Comparison of the estimated f 1 's When implementing the proposed HMMT method, the estimate GMM.DP has been used in the ini- tialization step. This indicates that, in the cases where HMMT improves signicantly over GMM.DP, the proposed algorithm employs f 1 that is suciently far away from the Gaussian mixture family. This would typically happen when the distribution of out-of-control averages is heavy-tailed. In Figure 2, the estimated f 1 across these simulation regimes are displayed for the caseA 11 = 0:8. Across all the studied regimes, the HMMT estimator is evidently smoother and its shape is closer to the truth. 87 Table 3.1: MSE comparison. As the probability of remaining out of controlA11 and the out of control prior g1 varies, MSE of the Tweedie estimator that does not use dependence structure (T.ND), conventional Baum-Welch algorithm using Gaussian mixture models with 3 mixing components (GMM.3) and with the number of components estimated by using Dirichlet Process model (GMM.DP) are reported along with the performance of our proposed HMMT estimator and that of the Oracle estimator in (3.2.3). Scenarios Out-of-control prior g1 A11 T.ND GMM.3 GMM.DP HMMT Oracle I Unif[-9,9] 0.2 0.345 0.213 0.137 0.130 0.112 0.4 0.361 0.217 0.150 0.138 0.131 0.6 0.439 0.317 0.190 0.177 0.168 0.8 0.429 0.502 0.263 0.253 0.238 II Triangle 0.2 0.378 0.224 0.134 0.124 0.110 0.4 0.294 0.242 0.154 0.136 0.122 0.6 0.319 0.336 0.190 0.176 0.164 0.8 0.515 0.579 0.284 0.258 0.250 III Levy 0.2 0.313 0.175 0.128 0.120 0.116 0.4 0.309 0.192 0.144 0.135 0.132 0.6 0.356 0.215 0.171 0.163 0.160 0.8 0.468 0.375 0.240 0.235 0.232 IV Non-central 2 0.2 0.358 0.195 0.123 0.117 0.110 0.4 0.330 0.199 0.151 0.149 0.141 0.6 0.340 0.246 0.171 0.167 0.159 0.8 0.521 0.392 0.232 0.225 0.223 V Weibull 0.2 0.352 0.174 0.134 0.130 0.124 0.4 0.370 0.184 0.149 0.143 0.136 0.6 0.402 0.195 0.165 0.162 0.160 0.8 0.447 0.289 0.239 0.233 0.229 VI Burr 0.2 0.329 0.191 0.128 0.122 0.116 0.4 0.365 0.200 0.152 0.150 0.142 0.6 0.428 0.247 0.176 0.174 0.167 0.8 0.421 0.377 0.233 0.228 0.217 VII 0.4 Unif[3,8] + 0.6 Levy 0.2 0.363 0.180 0.121 0.118 0.111 0.4 0.394 0.173 0.128 0.127 0.121 0.6 0.389 0.248 0.169 0.166 0.165 0.8 0.506 0.367 0.230 0.223 0.221 VIII 0.4 Non-central 2 + 0.6 Triangle 0.2 0.340 0.253 0.109 0.106 0.102 0.4 0.428 0.268 0.139 0.132 0.124 0.6 0.435 0.333 0.176 0.166 0.160 0.8 0.541 0.591 0.239 0.228 0.226 IX 0.4 Unif[3,8] + 0.6 Weibull 0.2 0.262 0.153 0.132 0.123 0.112 0.4 0.345 0.163 0.136 0.135 0.126 0.6 0.395 0.175 0.156 0.153 0.146 0.8 0.437 0.257 0.219 0.217 0.210 X 0.5 Weibull + 0.5 Levy 0.2 0.301 0.177 0.123 0.119 0.106 0.4 0.321 0.196 0.142 0.138 0.129 0.6 0.362 0.226 0.170 0.165 0.156 0.8 0.462 0.375 0.240 0.238 0.231 XI 0.5 Non-central 2 + 0.5 Burr 0.2 0.295 0.176 0.122 0.120 0.112 0.4 0.329 0.205 0.150 0.147 0.138 0.6 0.370 0.276 0.178 0.174 0.166 0.8 0.456 0.404 0.244 0.242 0.235 XII 0.6 Non-central 2 + 0.4Unif[3,8] 0.2 0.332 0.136 0.123 0.115 0.107 0.4 0.359 0.160 0.150 0.139 0.132 0.6 0.407 0.209 0.185 0.185 0.180 0.8 0.473 0.307 0.236 0.205 0.202 88 Figure 3.2: Estimated density comparison. From top to bottom (by columns), we have the estimated density ^ f1 from GMM.DP (blue) and HMMT (red) across the eleven cases considered in Table 3.1. The true f1 is plotted in black. 3.6 Real data analysis This section we illustrate our methodology by applying it to several real data analyses. 3.6.1 Copy number variation We analyze the IMR103 data in Sharp et al. (2006), which are displayed in Figure 3.3. The gene copy number is the number of copies of a particular gene in the genotype of an individual. It is widely believed that in healthy cells, the average copy number should be 2. A shift away from 2 is a genomic disorder and is usually related to certain disease. It is clear from the left panel of Figure 3.3 that in the region from 100000 to 110000, there is a shift away from 0 in log 2 ratio. We take g 0 = 0 and model the data using an HMM. The two hidden states 0 and 1 can be interpreted as healthy/unhealthy genes. The noise variance is estimated as the sample variance of the the rst 10000 data (0.183). The right panel of Figure 3.3 shows the histogram of the rst 10000 data along 89 with the density function of N(0; 0:183 2 ). We will take this as f 0 . Figure 3.3: IMR103 data. Left panel: The IMR103 data, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: Histogram of the rst 10000 genes' mean BAC log 2 ratio. For the data analysis, we focus on the data from position position 111000 to 112999 (Call it U U U) and from position 1113000 to 117999 (Call it V V V ). For each data in U U U we add a random noise 1 N(0:0:183 2 ). Dene U U U 1 = U U U + 1 1 1 , For each data in V V V we also add a noise from the same distribution 2 N(0; 0:183 2 ). Let V V V 1 = V V V + 2 2 2 and V V V 2 = V V V 2 2 2 . Under this construction, V V V 1 and V V V 2 are independent with the same mean. We will construct the estimators based on U U U 1 , and estimate the mean vector of V V V 1 (call it ^ ). And use the average ofk^ V V V 2 k 2 2 as prediction error. The plot ofU U U andV V V are shown in gure 14 and gure 15 respectively. Figure 3.4: Truncated IMR103 data. Left panel: The IMR103 data gene number 111000 to 112999, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio.Right panel: The IMR103 data gene number 113000 to 118000, the x-axis is the gene number, y-axis is the mean BAC log 2 ratio. The estimated f 1 using the Gaussian mixture model and our non-parametric method are dis- played in Figure 3.5. Use the sample splitting method we discussed in Section 3.2, the bandwidth 90 is chosen to be 0.28. Finally we apply the GMM.DP and HMMT methods to the data set. The prediction error for GMM.DP is 0.130, whereas the prediction error of the proposed HMMT is 0.121. This illustrates the benet of using the nonparametric approach to modeling f 1 . Figure 3.5: Density estimate of mean BAC log 2 ratio. Left panel: Estimated f1 for mean BAC log 2 ratio using GMM. Right panel: Estimated f1 for mean BAC log 2 ratio using HMMTweedie. 3.6.2 Internet search trend This section applies the proposed methods for analysis of search trend data. The key word of interest is \NBA". The data are collected from Google Trend for the period 08/29/2013 to 05/28/2018. According to Google, \Numbers represent search interest relative to the highest point on the chart for the given region and time. Hence a value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term." Because of the increasing accessibility of the Internet, if we directly take the time interval to be from 08/29/2013 to 05/28/2018, there will be a clear increasing trend. To adjust for that, we collect the data by taking the time window to be 4 months. Since the numbers are relative, homoscedastic errors seem to be a reasonable assumption. Figure 3.6 display the search trend data and its histogram. Bottom panel of Figure 3.6 shows the histogram of the data with value less than 30, where the line represents the density function of 1 3 N(5; 3) + 1 3 N(13; 3) + 1 3 N(25; 3) 91 Figure 3.6: Internet search trend data. Top left panel: Plot for the search interest score, x-axis is time, y-axis is the google search interest score for the keyword "NBA". Top right panel: Histogram of the search interest score for the keyword "NBA". Bottom panel: Histogram of the search interest score (<30) for the keyword "NBA". The two hidden states 0 and 1 can be interpreted as no event/event. An event could be an important game (all-stars,nals etc.). To estimate the homoscedastic error, we compute the standard deviation of the data from 08/29/2013 to 10/02/2013, since it is during the o-season, the search trend should be stable. It turns out the standard deviation is around 3. We are primarily interested in the time when search interest is high. Thus we use f 0 to model those time when search interest is low. By inspection, We take f 0 to be the density for 1 3 N(5; 3) + 1 3 N(13; 3) + 1 3 N(25; 3). The estimated f 1 is displayed in Figure 3.7. For the proposed HMMT method, the bandwidth is chosen to be 3. Figure 3.7: Density estimate of search trend. Left panel: Estimated f1 using GMM. Right panel: Estimated f1 using HMMTweedie 92 Using the estimated transition matrix and f 1 , we can calculate the expected proportion of days when the search interest is greater than 80. The Gaussian mixture model gives an estimate of 0.0294, the non-parametric model gives an estimate of 0.0266, whereas the proportion obtained from the data is 0.0243. We can see that the HMMT method provides a better estimate. 3.6.3 Change in unemployment rate Finally we illustrate the method using the unemployment data. Left panel of Figure 3.8 is a plot of monthly unemployment rate change from February 1948 to August 2018. The data is from U.S. Bureau of Labor Statistics website. Since we are only considering the changes of unemployment rate, it is reasonable to assume that the errors are homoscedastic. Figure 3.8: Unemployment rate data. Left panel: Plot of unemployment rate change from February 1948 to August 2018. x-axis is time, y-axis is unemployment rate change. Right panel: Histogram of of unemployment rate change from February 1948 to August 2018 We can see from the left panel of Figure 3.8 that most of the time the change in unemployment rate is close to 0, and dramatic change often appears in clusters, exhibiting dependence structure that can be reasonably described by an HMM. Suppose we are interested in the months where there is a big increase in unemployment rate. Economists and policy makers may want to focus on those times and try to gure out the possible causes to avoid increase in unemployment rate in the future. The right panel of Figure 3.8 shows a histogram of the unemployment rate change when it is 0:2%, together with the density function for 0:8N(0; 0:11) + 0:2N(0:25; 0:11). We can see the density function matches the data quite well, therefore, we assumef 0 to be 0:8N(0; 0:11) + 0:2N(0:25; 0:11) The estimatedf 1 is given in Figure 93 3.9. For the proposed HMMT, we choose the bandwidth to be 0.15. The Gaussian mixture model assumes there are 6 components. However, as we can see from the histogram, the mixture model is inadequate. By contrast, the proposed nonparametric model describes the data quite well; this clearly shows the exibility and benet of the HMMT method. Figure 3.9: Density estimate of unemployment rate. Left panel: Estimated f1 for unemployment rate change using GMM. Right panel: Estimated f1 for unem- ployment rate change using HMMTweedie Next we apply our GMM.DP and HMMT to analyze the unemployment rate data (monthly, seasonal adjusted, January 1960 to June 2018), which are displayed in Figure 3.10. The data are obtained from the website of Federal Reserve Bank of St. Louis. Top right panel of Figure 3.10 presents the estimated true change of unemployment rate using a Gaussian mixture model. Bottom panel of Figure 3.10 is the estimate obtained using the proposed HMMT. We can see that HMMT has a clear \de-noise" eect on the data. The HMMT estimate is less scattered and more robust than GMM.DP. This can help researchers and policy makers to have a better idea of the true change of unemployment rate, and focus on the studies on the policies over the months that have a real impact on the unemployment rate. 3.7 Discussion We developed a non-parametric estimator for estimating mean vector under Markovian dependence. It will be interesting to introspect if following the bivariate kernel density approach in Fu et al. (2018) the proposed methodology can be extended to heteroskedastic Markov set-ups with known variances. Developing shrinkage algorithms when the variance in the observations is unknown and needs to jointly estimated from the data will be important. Estimation in multivariate HMMs where 94 Figure 3.10: Change of unemployment rate. Top left panel: Plot of change of unemployment rate January 1960 to June 2018. Top right panel: Estimated true change of unemployment rate January 1960 to June 2018 using GMM. Bottom: Estimated true change of unemployment rate January 1960 to June 2018 using HMMTweedie. the observed outputsX i 's are vectors nds wide applications (Kale et al., 2003; Fiecas et al., 2017). In this context it will be useful to study shrinkage estimation particularly in the presence of latent structual properties on the covariances. 3.8 Proofs Proof of Lemma 3.4.1 Note that it is sucient to boundE (;X X X) d 2 ^ f 1;n [h];f For 1;n [h] : The result then follows from the triangle inequality. LetB be the event thatj P n i=1 i 1 nj < 1 2 1 n, throughout the proof we assumeB holds, as by Hoeding's inequality,P(B c ) =O(e n=2 ). Writeb j = j = P n i=1 i ,b j = ^ p j;n = P n i=1 ^ p i;n , and h (xx j ) = 1 h K( xxj h ) then n ^ f 1;n [h](x)f For 1;n [h](x) o 2 = n X j=1 (b j b j ) 2 2 h (xx j ) + X j6=k (b j b j )(b k b k ) h (xx j ) h (xx k ): We rst bound E(b j b j ) 2 . Write E(b j b j ) 2 = fE(b j b j )g 2 + Var(b j b j ). It is clear that E(b j ) and E(b j ) are both of order O(n 1 ). HencefE(b j b j )g 2 = O(n 2 ). Next consider 95 Var(b j b j ) = Var(b j ) +Var(b j ) 2Cov(b j ;b j ). We have Var(b j ) = Var j P n i=1 i E 1 P n i=1 i 2 =O(n 2 ): Similarly Var(b j ) =O(n 2 ): It follows from Cauchy-Schwarz inequality that Cov(b j ;b j ) =O(n 2 ): Therefore Var(b j b j ) =O(n 2 ) andE(b j b j ) 2 =O(n 2 ). Using the fact that R 2 h (xx j )dx = O(h 1 ), we have Z E n X j=1 (b j b j ) 2 2 h (xx j )dx =Of(nh) 1 g: (3.8.10) Next we boundEf(b j b j )(b k b k )g for j6=k. Consider the decomposition Ef(b j b j )(b k b k )g =E(b j b j )E(b k b k ) +Cov(b j b j ;b k b k ): (3.8.11) Note that Eb j = 1 E 1 P n i=1 i +Cov j ; 1 P n i=1 i : Assumption A1 implies Cov( j ; k ) =O( jjkj ) for some > 0. Hence for every j we can focus on its logn neighborhood, it follows that E 1= n X i=1 i j j = 1 ! E 1= n X i=1 i j j = 0 ! =O(n 2 logn): Some elementary calculation shows Cov j ; 1 P n i=1 i =O(n 2 logn): By Lemma 3 from Fu et al. (2019), together with the Markov structure, we have E 1 P n i=1 i = 1 1 n +O(n 2 logn): Hence Eb j = 1=n +O(n 2 logn): Note that Eb j = 1=n +O E^ p1;n 1 n + logn n 2 : It follows that E(b j b j )E(b k b k ) =O E(^ p1;n 1) 2 n 2 + log 2 n n 4 . By Lemma 4 from Fu et al. (2019) and the Markov structure we also have Cov(b j b j ;b k b k ) =O(n 3 logn). Hence, Z E X j6=k (b j b j )(b k b k ) h (xx j ) h (xx k )dx =O logn nh + (E ( ;X X X) ^ p 1;n 1 ) 2 h 2 : The lemma follows. 96 Proof of Theorem 3.4.2 First, notice that as a consequence of A1,Ejj ^ OR (x x x)jj 2 2 =O(n): Thus the conditions of theorem 1 in Brown and Greenshtein, 2009 are satised. Our theorem is an adaption of theorem 1 in Brown and Greenshtein, 2009. The proof follows the same logic, we provide a sketch. We will showEjj^ T n ~ FOR n jj 2 2 = o(n). Let p i =P( i = 1jx x x). Note that it follows from the proof of Lemma 3.4.1 that ^ p i p i = o(1). This fact together with Lemma 2 in Brown and Greenshtein (2009) implies we only need to show E n X i=1 ( f 0 1 [h](X i ) f 1 [h](X i ) ^ f 0 1 [h](X i ) ^ f 1 [h](X i ) ) 2 =o(n): (3.8.12) Where f 1 [h](x) = R 1 p 1 +h x p 1 +h dg 1 (): Follow the proof of lemma 3 in Brown and Green- shtein, 2009 we rst show for an independent sample ~ X X X = ( ~ X 1 ;:::; ~ X n ); ~ X i N( i ; 1); i = 1;:::n we have E n X i=1 ( f 0 1 [h]( ~ X i ) f 1 [h]( ~ X i ) ^ f 0 1 [h]( ~ X i ) ^ f 1 [h]( ~ X i ) ) 2 =o(n): (3.8.13) We can write ^ f 0 1 [h](xi) ^ f1[h](xi) = f 0 1 [h](xi)+R1 f1[h](xi)+R2 . Then on the regionR dened in the proof of lemma 3 in in Brown and Greenshtein, 2009, E n X i=1 ( f 0 1 [h]( ~ X i ) f 1 [h]( ~ X i ) ^ f 0 1 [h]( ~ X i ) ^ f 1 [h]( ~ X i ) ) 2 =O ( E R 1 f 1 [h]( ~ X i ) +R 2 2 +E CR 2 f 1 [h]( ~ X i ) +R 2 2 ) : A consequence of Lemma 3.4.1 is that E(R i )! 0; i = 1; 2 as n!1. We only need to bound The variance of R i . The variances of R 1 ; R 2 are the variances of the density estimators ^ f 0 1 [h]( ~ X i ) and ^ f 1 [h]( ~ X i ). Notice that the variances off For 1;n [h]( ~ X i ) andf For 1;n [h]( ~ X i ) satises (55) in Brown and Greenshtein, 2009. Since R 1 Varf For 1;n [h]( ~ X i ) and R 2 Varf For 1;n [h]( ~ X i ), R 1 , R 2 also satisfy (55). Using Bernstein, PfR 2 E(R 2 )<0:5f 1 [h]( ~ X i )g< 1=4C 2 ; SinceE(R 2 )! 0, then as n!1, PfR 2 <0:5f 1 [h]( ~ X i )g 1=4C 2 : Use the same argument as in the proof of Lemma 3 in in Brown and Greenshtein, 2009, (3.8.13) follows. To show (2.7.10), we use the same argument in Brown and Greenshtein, 2009 . Since 97 the Markov structure at most contribute a factor of logn, the conclusion is not aected. (3.8.12) and Lemma 2 in Brown and Greenshtein, 2009 together with Cauchy-Schwarz inequality proves the theorem. 98 Bibliography Aharoni, E., Neuvirth, H., and Rosset, S. (2010). The quality preserving database: A computa- tional framework for encouraging collaboration, enhancing power and controlling false discovery. IEEE/ACM transactions on computational biology and bioinformatics, 8(5):1431{1437. Aharoni, E. and Rosset, S. (2014). Generalized -investing: denitions, optimality results and application to public databases. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):771{794. Ahmad, S., Lavin, A., Purdy, S., and Agha, Z. (2017). Unsupervised real-time anomaly detection for streaming data. Neurocomputing, 262:134{147. Ahmed, S. E. and Reid, N. (2012). Empirical bayes and likelihood inference, volume 148. Springer Science & Business Media. Barber, R. F. and Ramdas, A. (2017). The p-lter: multilayer false discovery rate control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1247{1268. Basu, P., Cai, T. T., Das, K., and Sun, W. (2018). Weighted false discovery rate control in large-scale multiple testing. Journal of the American Statistical Association, 113:1172{1183. Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics, 41(1):164{171. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289{300. Benjamini, Y. and Hochberg, Y. (1997). Multiple hypotheses testing with weights. Scandinavian Journal of Statistics, 24:407{418. 99 Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60{83. Bickel, P. J., Ritov, Y., et al. (1996). Inference in hidden markov models i: Local asymptotic normality in the stationary case. Bernoulli, 2(3):199{228. Bickel, P. J., Ritov, Y., Ryden, T., et al. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden markov models. The Annals of Statistics, 26(4):1614{1635. Brown, L. D. and Greenshtein, E. (2009). Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, 37:1685{1704. Brown, L. D., Greenshtein, E., and Ritov, Y. (2013). The poisson compound decision problem revisited. Journal of the American Statistical Association, 108(502):741{749. Cai, T. T. and Sun, W. (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc., 104:1467{1481. Cai, T. T., Sun, W., and Wang, W. (2019). Cars: covariate assisted ranking and screening for large-scale two-sample inference (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81:187{234. Cao, H., Sun, W., and Kosorok, M. R. (2013). The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika, 100(2):495{502. Carlin, B. P. and Louis, T. A. (2010). Bayes and empirical Bayes methods for data analysis. Chap- man and Hall/CRC. Castillo, I. and Roquain, E. (2018). On spike and slab empirical bayes multiple testing. arXiv:1808.09748. Churchill, G. A. (1992). Hidden markov chains and the analysis of genome structure. Computers & chemistry, 16(2):107{115. Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. (1990). Stl: a seasonal-trend decomposition. Journal of ocial statistics, 6(1):3{73. Dicker, L. H. and Zhao, S. D. (2016). High-dimensional classication via nonparametric empirical bayes and maximum likelihood inference. Biometrika, 103:21{34. 100 Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32:962{994. Dudoit, S., Shaer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci., 18(1):71{103. Efron, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Amer. Statist. Assoc., 99(465):96{104. Efron, B. (2007). Size, power and false discovery rates. Ann. Statist., 35:1351{1377. Efron, B. (2008a). Microarrays, empirical Bayes and the two-groups model. Statist. Sci., 23:1{22. Efron, B. (2008b). Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Stat., 2:197{223. Efron, B. (2011). Tweedie's formula and selection bias. Journal of the American Statistical Associ- ation, 106(496):1602{1614. Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc., 96:1151{1160. Efron, B. and Zhang, N. R. (2011). False discovery rates and copy number variation. Biometrika, 98(2):251{271. Fiecas, M., Franke, J., von Sachs, R., and Tadjuidje Kamgaing, J. (2017). Shrinkage estima- tion for multivariate hidden markov models. Journal of the American Statistical Association, 112(517):424{435. Foster, D. P. and Stine, R. A. (2008). -investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):429{444. Fourdrinier, D., Strawderman, W. E., and Wells, M. T. (2017). Shrinkage Estimation. Springer. Fu, L., Gang, B., James, G. M., and Sun, W. (2019). Information loss and power distortion from standardizing in multiple hypothesis testing. arXiv preprint arXiv:1910.08107. Fu, L., James, G., and Sun, W. (2018). Nonparametric empirical bayes estimation on heterogeneous data. 101 Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B, 64:499{517. Genovese, C. R., Roeder, K., and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93(3):509{524. Habiger, J., Watts, D., and Anderson, M. (2017). Multiple testing with heterogeneous multinomial distributions. Biometrics, 73(2):562{570. Habiger, J. D. (2017). Adaptive false discovery rate control for heterogeneous data. Statistica Sinica, pages 1731{1756. Harvey, C. R. and Liu, Y. (2015). Backtesting. The Journal of Portfolio Management, 42(1):13{28. Holm, S. (1979a). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, 6:65{70. Holm, S. (1979b). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65{70. Hu, J. X., Zhao, H., and Zhou, H. H. (2010). False discovery rate control with groups. Journal of the American Statistical Association, 105(491):1215{1227. Hu, J. X., Zhao, H., and Zhou, H. H. (2012). False discovery rate control with groups. Journal of the American Statistical Association, 105:1215{1227. Ignatiadis, N., Klaus, B., Zaugg, J. B., and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature methods, 13(7):577. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1, pages 361{379. Javanmard, A., Montanari, A., et al. (2018). Online rules for control of false discovery rate and false discovery exceedance. The Annals of statistics, 46(2):526{554. Jiang, W., Zhang, C.-H., et al. (2009). General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647{1684. Jiang, Y., Oldridge, D. A., Diskin, S. J., and Zhang, N. R. (2015). Codex: a normalization and copy number variation detection method for whole exome sequencing. Nucleic acids research, 43(6):e39{e39. 102 Jin, J. and Cai, T. T. (2007). Estimating the null and the proportional of nonnull eects in large-scale multiple comparisons. J. Amer. Statist. Assoc., 102:495{506. Johnstone, I. M. (2013). Gaussian estimation: Sequence and wavelet models. Version: 11 June, 2013. Available at "http://www-stat.stanford.edu/ ~ imj". Kale, A., Chowdhury, A. K. R., and Chellappa, R. (2003). Towards a view invariant gait recognition algorithm. In null, page 143. IEEE. Karp, N. A., Mason, J., Beaudet, A. L., Benjamini, Y., Bower, L., Braun, R. E., Brown, S. D., Chesler, E. J., Dickinson, M. E., Flenniken, A. M., et al. (2017). Prevalence of sexual dimorphism in mammalian phenotypic traits. Nature communications, 8:15475. Ko, S. I., Chong, T. T., Ghosh, P., et al. (2015). Dirichlet process hidden markov multiple change- point model. Bayesian Analysis, 10(2):275{296. Koenker, R. and Mizera, I. (2014). Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674{685. Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side infor- mation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649{679. Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini{hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45{ 74. Liu, Y., Sarkar, S. K., and Zhao, Z. (2016). A new approach to multiple testing of grouped hypothe- ses. Journal of Statistical Planning and Inference, 179:1{14. Lynch, G., Guo, W., Sarkar, S. K., Finner, H., et al. (2017). The control of the false discovery rate in xed sequence multiple testing. Electronic Journal of Statistics, 11(2):4649{4673. Miller, C., Genovese, C., Nichol, R., Wasserman, L., Connolly, A., Reichart, D., Hopkins, D., Schneider, J., and Moore, A. (2001). Controlling the false-discovery rate in astrophysical data analysis. Astronomical Journal, 122:3492{3505. Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2004). Detecting dierential gene expres- sion with a semiparametric hierarchical mixture method. Biostatistics, 5(2):155{176. Pacico, M., Genovese, C., Verdinelli, I., and Wasserman, L. (2004). False discovery control for random elds. Journal of the American Statistical Association, 99(468):1002{1014. 103 Pe~ na, E. A., Habiger, J. D., and Wu, W. (2011). Power-enhanced multiple decision functions controlling family-wise error and false discovery rates. Annals of statistics, 39(1):556. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257{286. Ramdas, A., Yang, F., Wainwright, M. J., and Jordan, M. I. (2017). Online control of the false discovery rate with decaying memory. In Advances In Neural Information Processing Systems, pages 5650{5659. Ramdas, A., Zrnic, T., Wainwright, M., and Jordan, M. (2018). Saron: an adaptive algorithm for online control of the false discovery rate. In International Conference on Machine Learning, pages 4286{4294. Robbins, H. (1955). An empirical Bayes approach to statistics. Oce of Scientic Research, US Air Force. Robertson, D. S. and Wason, J. (2018). Online control of the false discovery rate in biomedical research. arXiv preprint arXiv:1809.07292. Roquain, E. and Van De Wiel, M. A. (2009). Optimal weighting for false discovery rate control. Electronic journal of statistics, 3:678{711. Saha, S. and Guntuboyina, A. (2017). On the nonparametric maximum likelihood estimator for gaussian location mixture densities with application to gaussian denoising. arXiv preprint arXiv:1712.02009. Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist., 30:239{257. Sarkar, S. K. and Zhao, Z. (2017). Local false discovery rate based methods for multiple testing of one-way classied hypotheses. arXiv:1712.05014. Schwartzman, A., Dougherty, R. F., and Taylor, J. E. (2008). False discovery rate analysis of brain diusion direction maps. Ann. Appl. Stat., 2(1):153{175. Sharp, A. J., Hansen, S., Selzer, R. R., Cheng, Z., Regan, R., Hurst, J. A., Stewart, H., Price, S. M., Blair, E., Hennekam, R. C., et al. (2006). Discovery of previously unidentied genomic disorders from the duplication architecture of the human genome. Nature genetics, 38(9):1038. 104 Silverman, B. W. (1986). Density estimation for statistics and data analysis, volume 26. CRC press. Stein, M. L. (2012). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Storey, J. D. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. B,64:479{498. Storey, J. D., Taylor, J. E., and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unied approach. J. Roy. Statist. Soc. B, 66(1):187{205. Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102:901{912. Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. B, 71:393{424. Sun, W. and McLain, A. C. (2012). Multiple testing of composite null hypotheses in heteroscedastic models. Journal of the American Statistical Association, 107(498):673{687. Sun, W., Reich, B. J., Cai, T. T., Guindani, M., and Schwartzman, A. (2015). False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(1):59{83. Sun, W. and Wei, Z. (2011). Large-scale multiple testing for pattern identication, with applications to time-course microarray experiments. J. Amer. Statist. Assoc., 106:73{88. Taylor, J., Tibshirani, R., and Efron, B. (2005). The "miss rate" for the analysis of gene expression data. Biostatistics, 6(1):111{117. Tian, E., Zhan, F., Walker, R., Rasmussen, E., Ma, Y., Barlogie, B., and Shaughnessy, J. D. (2003). The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine, 349(26):2483{2494. PMID: 14695408. Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Signicance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S. A., 98(9):5116{5121. Wand, M. P. and Jones, M. C. (1994). Kernel Smoothing, volume 60 of Chapman and Hall CRC Monographs on Statistics and Applied Probability. Chapman and Hall CRC. Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media. 105 Wei, Z., Sun, W., Wang, K., and Hakonarson, H. (2009). Multiple testing in genome-wide association studies via hidden markov models. Bioinformatics, 25(21):2802{2808. Weinstein, A., Ma, Z., Brown, L. D., and Zhang, C.-H. (2018). Group-linear empirical Bayes estimates for a heteroscedastic normal mean. Journal of the American Statistical Association, 113:698{710. Xia, Y., Cai, T. T., and Sun, W. (2019). Gap: A general framework for information pooling in two-sample sparse inference. Journal of the American Statistical Association, (just-accepted):to appear. Xie, X., Kou, S., and Brown, L. D. (2012). Sure estimates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500):1465{1479. Yang, F., Balakrishnan, S., and Wainwright, M. J. (2015). Statistical and computational guarantees for the baum-welch algorithm. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 658{665. IEEE. Zhao, Z., De Stefani, L., Zgraggen, E., Binnig, C., Upfal, E., and Kraska, T. (2017). Controlling false discoveries during interactive data exploration. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 527{540, New York, NY, USA. ACM. 106
Abstract (if available)
Abstract
High dimensional problems often have structures that can be exploited to improve statistical performance. The structural information can be in the form of heteroscedasticity, sparsity or dependence among others. This work is an attempt to use empirical Bayes method to effectively integrate structural information into hypothesis testing and estimation procedures. ❧ The dissertation is divided into three parts. Chapter 1 of the thesis studies the problem of hypothesis testing under heteroscedasticity. Traditional approach uses standardization. However, there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity-adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The main idea of HART is to bypass standardization by directly incorporating both the summary statistic and its variance into the testing procedure. A key message is that the variance structure of the alternative distribution, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power. The proposed HART procedure is shown to be asymptotically valid and optimal for false discovery rate (FDR) control. Our simulation demonstrates that HART achieves substantial power gain over existing methods at the same FDR level. We illustrate the implementation through a microarray analysis of myeloma. ❧ Chapter 2 studies hypothesis testing in an online setting. The hypotheses come in a stream and a real-time decision must be made before the next data point arrives. The error rate is required to be controlled at {all} decision points. Conventional simultaneous testing rules are no longer applicable due to the more stringent error constraints and absence of future data. Moreover, the online decision-making process may come to a halt when the total error budget, or alpha-wealth, is exhausted. This work develops a new structure online-adaptive rules for false discovery rate (FDR) control. The proposed procedure is a novel alpha-investment rule that precisely characterizes the tradeoffs between different actions in online decision making. The procedure captures useful structural information of the dynamic model, learns the optimal threshold adaptively in an ongoing manner and optimizes the alpha-wealth allocation in the next period. We present theory and numerical results to show that the proposed method controls the FDR at all decision points and achieves substantial power gain over existing online FDR procedures. ❧ In chapter 3 we consider the problem of simultaneous estimation of a sequence of dependent parameters that are generated from a hidden Markov model. Based on observing a noise contaminated vector of observations from such a sequence model, we consider simultaneous estimation of all the parameters irrespective of their hidden states under square error loss. We study the roles of statistical shrinkage for improved estimation of these dependent parameters. Being completely agnostic on the distributional properties of the unknown underlying Hidden Markov model, we develop a novel non-parametric shrinkage algorithm. Our proposed method elegantly combines Tweedie-based non-parametric shrinkage ideas with efficient estimation of the hidden states under Markovian dependence. Based on extensive numerical experiments, we establish superior performance our our proposed algorithm compared to non-shrinkage based state-of-the-art parametric as well as non-parametric algorithms used in hidden Markov models. We provide decision theoretic properties of our methodology and exhibit its enhanced efficacy over popular shrinkage methods built under independence. We demonstrate the application of our methodology on real-world datasets for analyzing of temporally dependent social and economic indicators such as search trends and unemployment rates as well as estimating spatially dependent Copy Number Variations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High dimensional estimation and inference with side information
PDF
Topics in selective inference and replicability analysis
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Adapting statistical learning for high risk scenarios
PDF
Sequential testing of multiple hypotheses
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Shrinkage methods for big and complex data analysis
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Sequential analysis of large scale data screening problem
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Information geometry of annealing paths for inference and estimation
PDF
Model selection principles and false discovery rate control
Asset Metadata
Creator
Gang, Bowen
(author)
Core Title
Large scale inference with structural information
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Mathematics
Publication Date
03/24/2020
Defense Date
02/19/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
alpha–investing,conditional local false discovery rate,covariate-assisted inference,covariate–assisted inference,data processing and information loss,empirical Bayes estimation,false discovery rate,heteroscedasticity,hidden Markov model,multiple testing with side information,OAI-PMH Harvest,structured multiple testing,time series anomaly detection,Tweedie formula
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Minsker, Stanislav (
committee chair
), Mikulevicius, Remigijus (
committee member
), Sun, Wenguang (
committee member
)
Creator Email
bgang@usc.edu,gangbowen02@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-276679
Unique identifier
UC11674250
Identifier
etd-GangBowen-8230.pdf (filename),usctheses-c89-276679 (legacy record id)
Legacy Identifier
etd-GangBowen-8230.pdf
Dmrecord
276679
Document Type
Dissertation
Rights
Gang, Bowen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
alpha–investing
conditional local false discovery rate
covariate-assisted inference
covariate–assisted inference
data processing and information loss
empirical Bayes estimation
false discovery rate
heteroscedasticity
hidden Markov model
multiple testing with side information
structured multiple testing
time series anomaly detection
Tweedie formula