Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
(USC Thesis Other)
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NONPARAMETRIC EMPIRICAL BAYES METHODS FOR LARGE-SCALE INFERENCE UNDER HETEROSCEDASTICITY by Luella Fu A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) December 2018 Copyright 2018 Luella Fu To my family for loving me. You have always supported me, and I can’t thank you enough. To every friend or person who inspired a sticky note on my office wall. You give shape to my ideals and inspire me to be better. To every teacher from high school into graduate school who encouraged learning and teaching. Where would anyone I know be without people like you? ii Acknowledgments The most rewarding part of my Ph.D. comprised not only discovery about statistics but also per- sonal fulfillment. I cannot really separate one from the other, and I have many people to thank for both. Part of my personal satisfaction and joy derived from being around not just good thinkers but good people. It is a rare person who can maintain an amiable nature at all times, but my co-advisor Wenguang Sun is such a person. His ability to bring his kindness and excitement to our research meetings, at the exclusion of his other tasks and circumstances, helped me as much as his exper- tise. This is saying a lot, since I owe to Wen my entire knowledge of the multiple testing field. His insights into the interplay between theory and simulation were particularly enlightening. Further, his humility is awe-inspiring and worthy of emulation. Also, I have always admired wit, creativity, and commonsense; luckily, I got to benefit from all of these with Gareth James as my co-advisor. Gareth brought exciting and much needed perspectives to my research. His suggestions elevated its scope while his edits made it more intelligible. From him, I also learned sanity checks for data analysis, no small contribution to my mental health. I hope to honor the high standards for research set by both of my advisors. Other researchers helped me along the way as well. Gourab Mukherjee, Jacob Bien, and Sid- darth Sivaramakrishnan were kind enough to be on my dissertation committee. Also, Siddarth, Thomas Valente, Megan Jacobs, Nathan Cobb, Jody Brookover, and Amanda Graham introduced me to different positive collaborative styles. Jacob helpfully asked questions and gave comments in seminars, while Gourab Mukherjee bestowed unique advice and exuded enthusiastic interest. I am grateful to benefit from such examples of geniality, mentorship, and thoughtful research. The people I spent the most time with were fellow students, and I hardly do them justice by mentioning just a few qualities they brought into my life. Courtney Paulson took statistical nerdi- ness and irreverent humor to a new level with her statistics limericks. She is also a paragon teacher, who along with other passionate teachers, fueled my enthusiasm for and fascination with teaching. Pallavi “Life Coach” Basu has such deep empathy, making my setbacks and achievements her own; she got me through tough personal crises. Pallavi was also the one I most loved to talk to about iii papers and statistics ideas, because I could draw on her seemingly endless knowledge of statistics literature. She also has the driest humor of anyone I have ever met, so much so that it took two years for me to fully appreciate, but it was more rewarding for the slow realization! Xinghao Qiao was brave and dogged as the first example I could follow in navigating the stages of our Ph.D. program. I thank him for being so eager to ease the journey for the rest of us. Trambak Banerjee, I thank for thinking that I knew things but who actually informed me about things. Weinan Wang, I thank for apt gifs and frank conversations. Negin Golrezaei, Josh Derenski, Simeng Shao, Wilson Lin, Brad Rava, and Mike Huang, I thank for being enjoyable company and inspirational in their sustainable work habits, kindness, and enthusiasm. Also thanks to Guang Li, who so casually said the most profound thing I heard from any student or faculty about work, life, and death. Other people, in my life outside school, also made me happy and studious, and they continue to make me happy and good (at least better than I would have been). Dushan Perera and my mother Yun Zhou both provided excellent companionship when I worked, making it possible for me to be both extremely studious and extremely contented. Dushan was also my moral companion on a philosophical quest to understand the causes of stress and effects of behavior on environment. My life became much more tranquil and therefore productive after I met him. Xiaotong Li, my father Yigang Fu, and my sister Sue Fu were my braintrust when it came to lining up priorities and overcoming difficulties. Also, thanks to Chris Wu who pulled all-nighters with me during my first two Ph.D. years, before I eventually discovered that this routine led to a sort of sleepiness involving visions of Mickey Mouse tap-dancing across the classroom blackboard. Many friends, including Xioatong, Jonathan O’Connell, Irene Graham, and Carson Tang also helped me by sharing with me their interesting statistical problems encountered at work. They reminded me of why I enjoy and am fascinated by statistics. Other experiences outside of my studies helped me as well. The Visions and V oices program and the Joint Education Project at USC gave me exposure to different life views, and volunteering gave me something less selfish to do in my six years than forward my own career. Thank you to everyone who made this Ph.D. possible and who deepened my understanding of meaningful topics. I owe my achievements to you. iv Table of Contents Dedication ii Acknowledgments iii Table of Contents v List of Tables x List of Figures xi Abstract xiii 1 Introduction 1 1.1 Example from exploratory analysis of Facebook networks . . . . . . . . . . . 2 1.1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Statistical questions raised by exploratory results . . . . . . . . . . . 3 1.1.3 Larger scale, more problems . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Estimating a Vector of Means . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 The homoscedastic case and its history . . . . . . . . . . . . . . . . . 7 1.2.2 Linear shrinkage estimators . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Nonlinear thresholding estimators . . . . . . . . . . . . . . . . . . . 9 1.2.4 Tweedie’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.5 Nonparametric empirical Bayes methodology . . . . . . . . . . . . . 10 1.3 Multiple Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 v 1.3.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2 Tests based on standardized statistics . . . . . . . . . . . . . . . . . . 12 1.4 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Nonparametric Empirical Bayes Estimation 15 2.1 Tweedie’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 Tweedie’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.2 Multivariate Tweedie’s formula . . . . . . . . . . . . . . . . . . . . . 19 2.2 NEST for the Heteroscedastic Normal Means Problem . . . . . . . . . . . . 22 2.2.1 Weighted kernel density estimation . . . . . . . . . . . . . . . . . . . 22 2.2.2 Selecting the tuning parameters . . . . . . . . . . . . . . . . . . . . . 24 2.3 Asymptotic Properties of NEST . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Asymptotic setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 Asymptotic optimality of NEST . . . . . . . . . . . . . . . . . . . . 27 2.3.3 NEST and selection bias . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Numeric Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.2 Selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4.3 California API data . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.5 Proofs and supplementary material . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.1 Expressions for Common Members of the Exponential Family . . . . 42 2.5.2 Proof of Theorem 2.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.5.3 Proof of Proposition 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . 45 vi 2.5.4 Three lemmas for Proposition 2.3.1 . . . . . . . . . . . . . . . . . . 46 2.5.5 Proof of Proposition 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.6 Proof of Proposition 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.7 Proof of Theorem 2.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.5.8 Proof of Theorem 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.5.9 Proof of Lemma 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.5.10 Proof of Lemma 2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.5.11 Proof of Lemma 2.5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.5.12 Table of large school proportions . . . . . . . . . . . . . . . . . . . . 61 3 The Mountain Comes to Muhammad 62 3.1 Oracle Rule for FDR Control . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.1.1 Notations and definitions . . . . . . . . . . . . . . . . . . . . . . . . 65 3.1.2 Lfdr and oracle rule with standardized statistics . . . . . . . . . . . . 66 3.1.3 Oracle rule adjusted for . . . . . . . . . . . . . . . . . . . . . . . . 67 3.1.4 The Mountain Comes to Muhammad . . . . . . . . . . . . . . . . . . 68 3.1.5 Connections to existing works . . . . . . . . . . . . . . . . . . . . . 70 3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.1 Estimation ofT OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.2 Estimation oft OR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.3 Muhammad Meets the Mountain (3M) and its theoretical properties . 74 3.2.4 Extension to the case where i and i are correlated . . . . . . . . . . 75 3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 vii 3.3.1 Comparison for grouped hypotheses . . . . . . . . . . . . . . . . . . 76 3.3.2 Comparison in more general settings . . . . . . . . . . . . . . . . . . 78 3.4 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.5.1 Proof of Theorem 3.1.1 . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5.2 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.3 Proof of Theorem 3.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.5.4 Proof of Lemma 3.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.5 Proof of Lemma 3.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4 Future Directions 92 4.1 Top-Percent Selection in Large-scale Inference with Heteroscedastic Error . . 92 4.1.1 Motivating Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.1.2 Selection and Comparison Issues . . . . . . . . . . . . . . . . . . . . 94 4.1.3 Goals, Notation, and Proposal . . . . . . . . . . . . . . . . . . . . . 95 4.1.4 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.1.5 Data-driven procedure . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.1.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.1.7 Sketch of Proof (a): Validity of the Procedure 1 . . . . . . . . . . . . 101 4.2 FDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.3 Oracle and data-driven procedures . . . . . . . . . . . . . . . . . . . 104 viii 4.2.4 Theoretical properties and ideas for establishing them . . . . . . . . . 106 4.2.5 Connections to other methods . . . . . . . . . . . . . . . . . . . . . 107 4.2.6 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Bibliography 114 ix List of Tables 1.1 Social network metrics of interest and their definition. . . . . . . . . . . . . . 4 1.2 Two-sample comparisons of smokers versus nonsmokers on four types of net- works and eleven metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Normal model mean squared errors with standard errors in parentheses. Bolded terms represent oracle, and best performing competitor(s). . . . . . . . . . . . 34 2.2 Sparse model mean squared errors with standard errors in parentheses. Bolded terms represent best performing methods. . . . . . . . . . . . . . . . . . . . 35 2.3 Two-point model mean squared errors with standard errors in parentheses. Bolded terms represent best performing methods. . . . . . . . . . . . . . . . . . . . 36 2.4 Mean squared errors (standard errors) for the proportion of large schools. Bolded terms represent best performances. . . . . . . . . . . . . . . . . . . . . . . . 40 2.5 Proportion of large schools. Bolded terms represent best performances. . . . . 61 3.1 Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between ALL and AML patients. . . . . . . . . . . . 81 3.2 Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between DLBCL and FL patients. . . . . . . . . . . 81 3.3 Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between multiple myeloma patients with and without lesions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 x List of Figures 2.1 Differences ^ for Naive (dark grey histograms) compared to TF (white his- tograms, left) and NEST (white histograms, right) for the smallest 20 observa- tions across 200 simulations in two settings; top row has X i 0:7N(1;:5 2 + 1 2 ) + 0:3N(1;:5 2 + 3 2 ) and bottom hasX i :7N(0;:5 2 + 1 2 ) +:3N(5;:5 2 + 3 2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.2 NEST shrinkage as a function ofx ands, plotted againstx for differents. . . 41 3.1 The Mountain Comes to Muhammad: A graphical illustration. . . . . . . . . 69 3.2 A depiction of information loss due to standardization: the Clfdr (solid blue curve) uses information from i while the Lfdr (dashed black line) does not. 70 3.3 Two groups with varying where 1 = 1 and 2 = 3. . . . . . . . . . . . . . 77 3.4 Two groups with varying 0 where 1 = 1 and 2 = 3 and =:2. . . . . . . 78 3.5 Two groups with varying 2 where 0 = 2 and =:2. . . . . . . . . . . . . 79 3.6 Mixture for with varying where max = 2 (top) or max = 2:5 (bottom). . 80 3.7 Differentially expressed genes between ALL and AML leukemia patients. . . 82 3.8 Differentially expressed genes between DLBCL and FL lymphoma patients. 83 3.9 Differentially expressed genes between multiple myeloma patients with lesions and without lesions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.1 Simulation results varyingp. Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. . . . . . 109 xi 4.2 Simulation results varyingm. Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. . . . . . 109 4.3 Simulation results varying. Methods are denoted by2 LR05 (green),# (blue) RV11, and4 proposed (red). The plot is for the data driven method. . . . . . 110 4.4 Simulation results varying Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. . . . . . 110 4.5 Simulation results varying. Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. . . . . . 111 4.6 Results for prostate cancer study. In (a), two histograms are overlaid: Lfdr is in dark gray and p-values are in light gray. In (b), methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (data-driven) (red). . . . . . . . . . 112 xii Abstract This thesis addresses inferential phenomena that arise from large-scale data and develops nonpara- metric Empirical Bayes (EB) techniques that exploit data for improved inference in estimation and testing. In particular, we consider four settings, unified by the goal of maximizing “good” deci- sions subject to a control of “poor” decisions, in which EB can be successfully implemented for large-scale problems. We will see that both estimation of parameters and hypothesis testing play a role. The simultaneous estimation of a vector of parameters i , based on a corresponding set of observations x i , for i = 1;:::;n, is a key research problem that has resurfaced in the high- dimensional setting due to selection bias concerns and which we study in order to incorporate parameter estimates into multiple testing procedures. In the typical estimation framework, the classic example involves estimating a vector of normal means i subject to a fixed variance term 2 , and the goal is to pool information in a way that corrects for extreme mean estimates that occur by chance. However, many practical situations involve heterogeneous data (x i ; i ) where i is a known nuisance parameter e.g. i = 2 i . Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the “Nonparametric Empirical Bayes Smooth- ing Tweedie” (NEST) estimator, which efficiently estimates i without making any parametric assumptions on its distribution and properly adjusts for heterogeneity via a generalized version of Tweedie’s formula. The estimation framework is simple but general enough to accommodate any member of the exponential family of distributions; a thorough study of the normal means problem subject to heterogeneous variances is presented to illustrate the proposed framework. Our theo- retical results show that NEST is asymptotically optimal, while simulation studies show that it outperforms competing methods, with substantial efficiency gains in many settings. The method is further demonstrated on a data set measuring the performance gap in math scores between socioe- conomically advantaged and disadvantaged students in K-12 schools. xiii After addressing the issue of parameter estimation in the presence of heterogeneous nuisance parameters, our next step is to consider the analogous problem in the multiple testing framework. In multiple testing procedures, two common techniques are to pool information across testing units in order to boost power and to standardize statistics, usually yieldingp-values orz-values. We show through analytical and intuitive arguments that for heterogeneous data (x i ; 2 i ), where i = 1;:::;m, it is advantageous to use the original bivariate data rather than the standardizedz i . We propose a method, “Muhammad Meets the Mountain” (3M) that produces test statistics which condition on i . Theory and simulations establish 3M’s ability to improve power. The proposed method is also compared with standardized methods on real microarray data. Then, we discuss two preliminary works that extend our research in multiple testing. One is in importance selection, which combines ideas from parameter estimation and hypothesis testing in order to provide an alternative selection criteria to significance testing, and where the goal is to select interesting rather than significant units. The second is in false discovery exceedance control, which provides stricter Type I error control than the false discovery rate (FDR) but more relaxed control than the familywise error rate (FWER). xiv Chapter 1 Introduction To give a top-down introduction, this thesis is mainly concerned with a single objective, adopted from the multiple hypothesis testing field, that seeks to control poor binary (yes or no) decisions while maximizing the number of good binary decisions. All material are unified by their use of nonparametric empirical Bayes methods to accomplish these ends. To delve one step further into this thesis, there are two avenues we pursue: one, we wish to make “good decisions” that incorporate the magnitude of the underlying parameters and variability of the statistics analyzed. Then, naturally, this thesis veers into the study of estimating parameters as well as hypothesis testing in order to ultimately combine the two ideas. Two, we consider procedures that “control poor decisions” in ways which seem practical in the context of real-world hypothesis testing. To step just one level further into the details, we present four works, or in the case of the last two, preliminary works, where the first work on estimating parameters in a particular setting and the second on testing hypotheses in a similar setting build into the third work on incorporating both into a hypothesis testing procedure. As these works focus more on increasing the number of “good decisions” made, the last work turns to consider limiting “poor decisions” in a setting applicable to scientific research, but actually preliminary analysis suggests that viewing “poor decisions” through the lens of the fourth work can also lead to more “good decisions”. The rest of the chapter gives an overview of the inferential problems that subsequent chapters address. As the topics of this thesis all concern complications that originate in large-scale data, the introduction begins with an example from exploratory Facebook analysis that illustrates how the large volume of data available these days creates challenges that did not exist before the informa- tion revolution and then eases the reader into the three main themes that arise in the four works of this thesis: the problem of estimating a vector of means, the framework of multiple testing, and the ideas that define a nonparametric empirical Bayes method. 1 1.1. Example from exploratory analysis of Facebook networks The following concerns preliminary exploration of data from a study comparing the Facebook networks of smokers versus nonsmokers. It represents an actual situation faced by researchers who handle large data sets and points out statistical perplexities that we want to solve. This section is based on work with Megan A. Jacobs, Jody Brookover, Thomas W. Valente, Nathan K. Cobb, and Amanda L. Graham. We begin with some background for the study, present the results, discuss four concerns, and relate these problems to the large-scale inference questions of Chapters 2 - 4. The discussion is rather general in this section, while the formal set-up for large-scale inference is deferred to subsequent sections. 1.1.1 Background and motivation Social networks influence health behavior, including tobacco use and cessation. However, little is known about how the networks of individuals differ by health behavior, specifically whether social networks differ between smokers and non-smokers. Online social networks such as Facebook pro- vide a unique opportunity to compare the structure and characteristics of the networks of smokers and non-smokers. These insights could have important implications for the design of public health campaigns or tobacco cessation interventions that involve social networks. This study explores the structure of ego networks of both smokers and non-smokers collected as part of a randomized control trial conducted within Facebook. The data was collected in 2012 and 2013, shortly before Facebook severely limited the API interface that allowed access to detailed network information. During the parent trial (Cobb et al. 2014, 2016), a total of 14; 010 individuals installed a Face- book smoking cessation app. These individuals comprise the analytic sample for the present anal- yses. Included in the analytic sample are the 9,042 smokers who went on to be randomized in the parent trial, an additional 2,881 smokers who did not meet full eligibility criteria and were not randomized, and 2,087 non-smokers. The total number of smokers is 11; 923 and nonsmokers is 2; 087. 2 For each individual in the analytic sample, commonly referred to as an ego in the social network literature to indicate that the social network is built around them, we constructed an egocentric net- work out to second degree ties (connections between the individual and their friends as well as the connections between their friends). Four kinds of Facebook networks were examined: Friend- ship, Family, Photo Association, and Group Association networks. The primary Facebook network consists of friendships between users, but users who downloaded the app also consented to give further information about their Facebook activities. Since Facebook allows users to tag friends as “family”, we could label ties as being familial to enable the construction of Family networks. The Photo Association network for each individual was constructed using tags in participants’ Facebook photographs. The Group Association network was constructed using information about Facebook groups that participants joined and the membership of those groups. Eleven network metrics were used in this study: vertices, edges, density, isolates, diameter, communities, betweenness centrality, closeness centrality, transitivity, clusters, and modularity. Table 1.1 provides definitions for these metrics. The first four metrics are descriptive metrics of how large a network is and how many connections it contains. These are lower order network structural terms that are a necessary first step in characterizing a network. The remaining seven metrics are higher order structural measures that can be classified into three types: 1) cohesion (diameter, communities), or how compact a networks is; 2) clustering (transitivity, clusters, and modularity), the extent to which a network is characterized by having distinct, separate pockets of dense interconnectivity separated by bridges; and 3) centrality (closeness, betweenness), or whether there are some individuals that are particularly prominent or central in the network. We calculated these eleven social network metrics using the iGraph 1.0.1 package for R (Csardi and Nepusz 2006). 1.1.2 Statistical questions raised by exploratory results Since the purpose of this study was to discover if any striking network differences existed between smokers and nonsmokers, it was natural to begin the exploration with some basic comparisons 3 Table 1.1: Social network metrics of interest and their definition. Social network metric Definition Vertices Number of Facebook friends of an ego (i.e., network size), an ego-level metric Edges Number of friendships in the network, including friendships between an ego’s friends, a network-level metric Density* Fraction of existing friendships over all possible, a network-level metric Isolates Number of individuals with no friends other than the ego, a network-level metric Diameter* Maximum degree of separation between any two individuals in the network, a network-level metric Communities* Number of groups, sorted to increase dense connections within the group and decrease sparse connections outside it (i.e., to maximize modularity), a network-level metric Closeness* Average of how closely associated members are to one another, a network-level metric Betweenness* An average of the relative importance of all individuals within their own network, normalized as 2*betweenness/(vertices-1)(vertices-2), a network-level metric Transitivity* Number of triads divided by all possible connections, a network-level metric Clusters* Number of subnetworks, a network-level metric Modularity The fraction of ties between community members in excess of the expected number of ties within communities if ties were random, a network-level metric * calculated after removing ego and isolates of the network metrics between these two groups. Two-sided t-tests compared smokers and non- smokers on these eleven metrics on the four different networks for a total of 44 tests. On metrics unsuitable for t-testing, the Wilcoxon-Mann-Whitney two-sample rank-sum test was used. T-tests and rank-sum tests detected if and how smoking and nonsmoking Facebook networks differed. While the t-tests compared means of two groups, the rank-sum tests specifically assessed whether the values of one group were higher than those of the other. The p-values are reported in Table 1.2. We point out four complications raised by results like these, which are quite common to large data sets. First, note that most of these comparisons givep-values of zero. Since there are thou- sands of smokers and nonsmokers in the analysis, it is not surprising that even tiny differences are statistically real. It would be more surprising if, say, nonsmoker and smoker groups both had a 4 Table 1.2: Two-sample comparisons of smokers versus nonsmokers on four types of networks and eleven metrics. Friendship network metrics Smokers Nonsmokers p-value Vertices, Median (IQR) 196 (94-380) 142 (61-310) 0.0000 Edges, Median (IQR) 1355 (457-4201) 1000 (286-3921) 0.0000 Density 0.12 (1.13E-03) 0.17 (3.62E-03) 0.0000 Isolates, Median (IQR) 11 (5-26) 7 (2-17) 0.0000 Diameter, Median (IQR) 7 (5-9) 6 (5-8) 0.0000 Communities, Median (IQR) 7 (5-11) 5 (4-9) 0.0000 Closeness, Mean (SE) 0.54 (5.03E-04) 0.56 (1.66E-03) 0.0000 Betweenness, Mean (SE) 8.50E-03 (1.50E-04) 0.01 (4.41E-04) 0.0000 Transitivity, Mean (SE) 0.36 (1.34E-03) 0.41 (3.39E-03) 0.0000 Clusters, Median (IQR) 3 (1-5) 2 (1-4) 0.0000 Modularity, Mean (SE) 0.43 (1.53E-03) 0.39 (3.73E-03) 0.0000 Family network metrics Smokers Nonsmokers p-value Vertices, Median (IQR) 5 (3-8) 4 (2-7) 0.0000 Edges, Median (IQR) 5 (2-12) 4 (1-9) 0.0000 Density 0.61 (3.19E-03) 0.66 (8.73E-03) 0.0000 Isolates, Median (IQR) 1 (1-3) 1 (1-2) 0.0228 Diameter, Median (IQR) 1 (0-2) 1 (0-2) 0.0000 Communities, Median (IQR) 1 (1-1) 1 (1-1) 0.0000 Closeness, Mean (SE) 0.77 (1.28E-03) 0.80 (3.85E-03) 0.0000 Betweenness, Mean (SE) 0.13 (3.95E-03) 0.14 (1.20E-02) 0.0072 Transitivity, Mean (SE) 0.38 (1.78E-03) 0.37 (4.94E-03) 0.2881 Clusters, Median (IQR) 1 (0-1) 1 (0-1) 0.0000 Modularity, Mean (SE) 0.15 (2.86E-03) 0.12 (7.70E-03) 0.0000 Photo network metrics Smokers Nonsmokers p-value Vertices, Median (IQR) 5 (3-10) 4 (2-9) 0.0032 Edges, Median (IQR) 5 (2-12) 4 (1-10) 0.0011 Density 0.54 (5.17E-03) 0.58 (1.36E-02) 0.0125 Isolates, Median (IQR) 2 (1-5) 2 (1-5) 0.1695 Diameter, Median (IQR) 0 (0-2) 0 (0-1) 0.0006 Communities, Median (IQR) 1 (1-1) 1 (1-1) 0.0002 Closeness, Mean (SE) 0.74 (2.75E-03) 0.76 (7.29E-03) 0.0081 Betweenness, Mean (SE) 0.15 (1.87E-03) 0.15 (4.97E-03) 0.2105 Transitivity, Mean (SE) 0.23 (5.17E-03) 0.21 (1.34E-02) 0.2082 Clusters, Median (IQR) 0 (0-1) 0 (0-1) 0.0038 Modularity, Mean (SE) 0.15 (4.81E-03) 0.13 (1.35E-02) 0.3878 Group network metrics Smokers Nonsmokers p-value Vertices, Median (IQR) 14 (5-34) 11 (4-27) 0.0000 Edges, Median (IQR) 53 (8-302) 38 (6-199) 0.0000 Density 0.69 (3.00E-03) 0.72 (7.91E-03) 0.0000 Isolates, Median (IQR) 0 (0-1) 0 (0-1) 0.0366 Diameter, Median (IQR) 2 (1-3) 1 (1-3) 0.0345 Communities, Median (IQR) 1 (1-2) 1 (1-2) 0.0000 Closeness, Mean (SE) 0.80 (1.79E-03) 0.82 (4.77E-03) 0.0000 Betweenness, Mean (SE) 0.04 (7.65E-04) 0.03 (1.87E-03) 0.1693 Transitivity, Mean (SE) 0.82 (2.56E-03) 0.84 (6.61E-03) 0.0965 Clusters, Median (IQR) 1 (1-2) 1 (1-2) 0.0000 Modularity, Mean (SE) 0.16 (2.36E-03) 0.14 (6.01E-03) 0.0023 5 median of exactly 196 friends. Yet, especially for data exploration, the goal is to investigate inter- esting phenomenon rather than merely statistically significant phenomenon. This idea is explored further in Chapter 4 under a more challenging setting where the number of tests is in the hundreds or thousands rather than on the scale of ten. Second, if we procured an analogue top-values that could detect interesting differences, how would we guard against erroneous discoveries? For a single test, common practice sets the-level, the probability of detecting a difference when no true difference exists, to a low level, like 5% and uses that as a threshold forp-values. In the multiple testing setting, that-level applied to each test cannot control the probability of finding any false differences below. It is easiest to understand the problem in the context of making 1; 000 comparisons. At =:05, 50 tests would be picked out as interesting when actually not. This has become a concern for the greater science community (see McCarthy et al. (2008), Head et al. (2015) and Heller and Yekutieli (2014)), since this phenomenon produces findings that cannot be replicated. Amusing stories have even been reported in the media about bogus findings, such as chocolate’s ability to aid weight-loss (Bohannon 2015). Third, our analyses contrasting these smoker and nonsmoker networks should account for the group structure of the data. Which groups though? One way to group the tests is by the type of network, but another way is to group by the metric. From the descriptions in Table 1.1, metrics can even be subgrouped based on the kind of latent network structure they capture. Then once groups are found, how should we incorporate them into our analysis? Chapters 2 and 3 discuss some grouping ideas in the context of other problems. The fourth issue, related to but more general than the third, is how do we handle correlated comparisons? This has long been a tricky issue in multiple-testing. We formulate a way to deal with correlation specifically between means and variances in Chapter 3. These issues cover a wide swath of topics. While this thesis will discuss and present work related to these ideas, not all questions will be fully or directly addressed. It is our hope, how- ever, that this example has conveyed two messages: one, large-scale paradigms are different from moderate- or small-scale settings in ways that require new methodology. Two, large-scale infer- ence is motivated by a host of practical questions that deserve to be addressed. 6 1.1.3 Larger scale, more problems The Facebook data is large-scale in the sense that data from many people is aggregated into test statistics, such as the mean and median. Another kind of large-scale data arises from having many statistics, such as means, to analyze. The most common example of this is perhaps in genetics, when genes for cancer and cancer-free patients are routinely studied. Typically, such studies pro- duce average gene expression levels on thousands or more genes compared across patient groups. Not only are the testing challenges discussed in Section 1.1.2 exacerbated by the number of com- parisons conducted, but even estimating means is no longer a straightforward matter. It is this sort of large-scale setting that we consider throughout the rest of the thesis. The next two sections formulate the problems and frame the difficulties of estimating and testing many parallel statistics. 1.2. Estimating a Vector of Means This section formally describes the problem of estimating a vector of means, which is an important special case of the research in Chapter 2, and gives a literature review so that the reader gains a sense of work in the area. This prepares the reader for Chapter 2, where the problem is generalized and additional challenges are presented. 1.2.1 The homoscedastic case and its history As a formal introduction to the problem, consider estimating a population mean vector = ( 1 ; ; n ) T based on a vector of summary statisticsX = (X 1 ; ;X n ) T . The X i are con- ditionally normally distributed, as they would be if they represented sample averages: X i j i ind N( i ; 1 2 ). This is a classic low-dimensional problem, with roots in the works of Robbins (1956) and James and Stein (1961), but it has re-emerged as a key research question in the high- dimensional setting thanks to Efron (2011), which recognized that an old idea, shrinkage esti- mation, could resolve the new issue of selection bias, wherein, for massive data sets, some data 7 points are large merely by chance, causing traditional estimators to overestimate the corresponding means. Shrinkage estimation has provided an effective bias reduction tool for simultaneous estimation problems. There are several popular classes of methods, of which we describe three briefly: linear shrinkage estimators (James and Stein 1961, Efron and Morris 1975, Berger 1976), non-linear thresholding-based estimators motivated by sparse priors (Donoho and Jonhstone 1994, Johnstone and Silverman 2004, Abramovich et al. 2006), and full Bayes or empirical Bayes (EB) estimators with unspecified priors (Brown and Greenshtein 2009, Jiang and Zhang 2009, Castillo and van der Vaart 2012). 1.2.2 Linear shrinkage estimators Linear shrinkage estimators are adapted from and exemplified by the James-Stein (JS) estimator (James and Stein 1961), which has the form ^ JS i = 1 (m 2) P m i=1 x 2 i x i : Recall that P m i=1 x 2 i forms a chi-squared distribution with expectation equal to the degrees of free- dom,m. Thus, in expectation, form 2 the term in the parenthesis is always between zero and one inclusively, so that the estimator shrinksx i towards zero. Subsequent work in this vein generalized the formula to shrinkx i towards other points, with the goal of movingx i closer to the true i . For instance,x i can be shrunk towards the sample mean x through the following modification: ^ JS 0 i = x + 1 (m 3) P m i=1 (x i x) 2 (x i x): 8 1.2.3 Nonlinear thresholding estimators For nonlinear thresholding estimators, we take Johnstone and Silverman (2004) as our primary example. The method is motivated by a model where the means i are themselves random variables drawn from a mixture distributionf prior () that generates mostly zeros: f prior () = (1) 0 () +g(); where 0 is the Dirac delta function that has a point-mass at 0 with probability 1, is a small probability of non-zero , and g() is an unspecified distribution that often has some desired properties, such as symmetry or unimodality, Because the i are sparse, members of this class of estimators, which we will denote with TH, use thresholding onx i : ^ TH i =x i Ifjx i jt()g; whereI is the indicator function andt is a function of the probability of non-zero i . Prominent features of this class of estimators are that they are non-linear and when the magnitude of x i is below the threshold, the estimated means are zero. 1.2.4 Tweedie’s formula The third group of shrinkage estimators includes Tweedie’s formula (TF) (Robbins 1956). Tweedie’s rule is an elegant empirical Bayes shrinkage estimator for distributions from the expo- nential family. Similar to the nonlinear thresholding class of estimators, Tweedie’s rule considers a prior for the underlying means i g(), but the form ofg() does not need to be specified. Tweedie’s formula works withg() only through the marginal densityf(x) ofx i . In the special case whereX i j i are normally distributed, Tweedie’s formula is ^ TF i =E( i jx i ) =x i + f (1) (x i ) f(x i ) ; 9 where f (1) (x i ) is the derivative of f(x). The expression on the right of x i is a shrinkage term that shifts x i back towards its mean. The formula is simple and intuitive since its implementa- tion in the empirical Bayes setting only requires the estimation of the marginal density of X i . This property is particularly appealing for large-scale estimation problems where nonparametric density estimates can be easily constructed from data. The resultant EB estimator enjoys optimal- ity properties (Brown and Greenshtein 2009) and delivers superior numerical performance (Efron 2011, Koenker and Mizera 2014). The work of Efron (2011) further convincingly demonstrates that Tweedie’s formula provides an effective tool for removing selection bias when estimating thousands of parameters simultaneously. Most of the research in this area, in particular the study of Tweedie’s formula, has been restricted to homoscedastic models of the form X i j i ; ind N( i ; 2 ). In Chapter 2, we con- sider a more general problem of estimating a vector of exponential family parameters i with heteroscedastic nuisance parameters i , of which the heteroscedastic modelX i j i ; ind N( i ; 2 i ) is a special case. 1.2.5 Nonparametric empirical Bayes methodology We take advantage of this section to highlight what characterizes a nonparametric empirical Bayes method, since this thesis uses the nonparametric empirical Bayes framework. The works discussed in Sections 1.2.2 - 1.2.4 provide a nice set of examples with which to explain how frequentist parametric models, parametric empirical Bayes, and nonparametric empirical Bayes models relate. The JS estimator was developed under a frequentist framework, which considers deterministic i and i . In contrast, Bayesian methods, whether pure or empirical, are characterized by the modeling of parameter i as a random variable following some prior distribution. Interestingly, the JS estimator has a Bayesian interpretation and can be derived from a Bayesian model that assumes i N(0; 2 ) (Efron and Morris 1975) and that uses an unbiased estimator for the unknown 1=(1+ 2 ) quantity that appears in the Bayes formula. The JS estimator can therefore be considered a parametric empirical Bayes model because it uses empirical data to estimate the parameters i 10 and 1=(1 + 2 ) but like a pure Bayes method, it assumes a parametric prior distribution. Tweedie’s formula is a nonparametric empirical Bayes method because it estimates even the prior distribution from data. Historically, the JS estimator has been a more popular springboard than Tweedie’s formula for new methods. It is easy to understand this from a data perspective: until late last century, the large data sets and computational power needed to empirically estimate the entire density function for Tweedie’s formula did not exist. For over a century researchers’ most viable option was to use parametric models. The first chapter of Efron (2010) describes the connections and trade-offs between the JS estimator and Tweedie’s formula in much more detail. In the context of this thesis, it is also interesting to note that Colombo and Morrison (1988) consider a fund-cutting situation amongst British Ph.D. programs that parallels the education funding problem presented in Chapter 2. In both, the criterion for academic success is based on a proportion; yet our approaches greatly differ. The 1988 paper uses a binomial prior to analyze 64 Ph.D. programs and obtains a parametric Bayes estimator, while we leave the prior unspecified but use thousands of K-12th grade schools for our nonparametric Bayes estimator. The contrast highlights how even similar questions can be approached in more flexible ways now that nonparametric empirical Bayes methods are practical to use. 1.3. Multiple Testing This section formally introduces the problem of multiple hypothesis testing and describes how testing procedures have evolved from p-value to z-value based statistics. This sets the stage for Chapter 3, where we depart from standardized statistics altogether. 11 1.3.1 Problem setup In the heteroscedastic setting, we observe summary statistics from a hierarchical model with het- eroscedastic errors: X i j i ; i ind N( i ; 2 i ); (1.1) where i are drawn from a partially unknown sparse prior, and are drawn from a fully unknown prior: i iid (1p)I( i = 0) +pg (); 2 i iid g (); (1.2) wherep2 (0; 1) is the proportion of nonzero signals, I( i = 0) is one when theH i;0 is true or otherwise zero, andg () andg () are unknown continuous density functions. The goal is to test m simultaneous hypothesis tests: H i;0 : i = 0 vs. H i;1 : i 6= 0; i = 1; ;m: (1.3) Tests are designed to control Type I errors in the form of the family-wise error rate (FWER) (Holm 1979, Simes 1986, Hochberg 1988, Hommel 1988), thek-family-wise error rate (k-FWER) (Lehmann and Romano 2005a, van der Laan et al. 2004, Romano and Shaikh 2006, Sarkar 2007), or the false discovery rate (FDR) (Benjamini and Hochberg 1995, Genovese and Wasserman 2002). FWER keeps a low probability on incorrectly rejecting any of them hypotheses whilek-FWER keeps a low probability on incorrectly rejecting more than k hypotheses. FDR control is more relaxed in that it allows a low average proportion of false rejections. FDR control tends to scale better than FWER ork-FWER control with larger numbers of tests, making it the most appropriate focus for Chapter 3. 1.3.2 Tests based on standardized statistics Tests are usually based on one of two standardized statistics: p-values orz-values, wherez-value based tests evolved from the former. In this section we review this development. 12 The traditionalp-value multiple testing approaches begin by standardizing the observed data, Z i = X i = i . By assuming that Z i follows the standard normal distribution under the ith null hypothesis, the (two-sided)p-value can be computed asP i = 2PfjZ i j>N(0; 1)g: Then, thep- values are ordered, and a threshold is applied to keep the falsely rejected nulls below a pre-specified level. Benjamini and Hochberg (2000) and Storey (2002) observed that the original FDR-control procedure (BH, Benjamini and Hochberg (1995)), is conservative when some testing units fall under the alternative hypothesis. This is in consequence of the fact that p-values incorporate no information about the alternative distribution. Efron et al. (2001) proposed an Empirical Bayes estimator based on z-values instead and introduced the local false discovery rate (Lfdr). Efron (2007) discussed advantages of usingz-values in reducing the number of parametric assumptions and in estimating non-null densities for power calculations. Sun and Cai (2007) further proved that the z-value optimal procedure is an Lfdr thresholding rule that uniformly dominates the p- value optimal procedure (Genovese and Wasserman 2002), where power gains come from pooling information acrossz-values and exploiting the sample’s distributional information. However, even usingz-values entails a loss of information relative to the full data (x;). While standardizing data makes testing units comparable so that information can be pooled, it also treats the power of each z i test as the same. It is intuitive that data with higher variance has smaller power than data with lower variance, but this information gets averaged out when thez i values are pooled. Therefore, a method based on the pairs (x i ; i ) is the focus of Chapter 3 and the power reducing consequences of standardizing are further elucidated. 1.4. Summary of Chapters The rest of the thesis is organized as follows. Chapter 2 picks up where Section 1.2 leaves off, and considers the problem of estimating a vector of parameters from an exponential family in the presence of a varying nuisance parameter. Chapter 3 further explores the implications of using (x i ; i ) rather thanz i as discussed in Section 1.3, develops a method for testing that does not rely 13 on standardized statistics, and also discusses connected research. Finally, Chapter 4 describes two ongoing works in ranking and selection, and in false discovery exceedance control. The last section of each chapter provides supplementary explanations, figures, data, and proofs, as indicated by the main text. 14 Chapter 2 Nonparametric Empirical Bayes Estimation on Heterogeneous Data Suppose that we are interested in estimating a population vector of parameters = ( 1 ;:::; n ) based on a vector of summary statisticsX = (X 1 ; ;X n ) T . The setting where i = i and X i j i N( i ; 2 ) is the most well-known example, but this problem occurs in a wide range of applications. Other common examples include: the Poisson distribution with i = i and the Binomial distribution with i =p i . This fundamental problem, largely resolved in the low-dimensional setting, has recently become a key research question in the high-dimensional setting. For example, in modern large- scale applications it is often of interest to perform selective inference, e.g. first select interesting values of X i and then carry out inferential analysis for the associated i ’s; see Benjamini and Yekutieli (2011), Berk et al. (2013), Weinstein et al. (2013), Lee et al. (2016), which all discuss approaches for the construction of simultaneous confidence intervals after a selection procedure is applied. In multiple testing as well as related ranking and selection problems, it is often desirable to incorporate estimates of the effect sizes i in the decision process to prioritize the selection of more scientifically meaningful hypotheses (Benjamini and Hochberg 1997, Sun and McLain 2012, He et al. 2015, Henderson and Newton 2016, Basu et al. 2017). We explore this subject further in Chapter 4. Most of the research in this area has been restricted to models of the formf(x i j i ) where the distribution ofX i is only a function of i . In situations involving a nuisance parameter it is gen- erally assumed to be known and identical for allX i . For example, homoscedastic Gaussian models of the formX i j i ; ind N( i ; 2 ) involve the nuisance parameter = 2 , but 2 is usually treated as known and constant for allX i . However, in large-scale studies the data are often collected from 15 heterogeneous sources; hence the nuisance parameters often vary over the n observations. Per- haps the most common example, and the setting we concentrate on most, involves heteroscedastic errors, where 2 varies overX i . Microarray data (Erickson and Sabatti 2005, Chiaretti et al. 2004), returns on mutual funds (Brown et al. 1992), and the state-wide school performance gaps, consid- ered in Section 2.4.3, are all instances of large-scale data where genes, funds, or schools have heterogeneous variances. Heteroscedastic errors also arise in ANOV A analysis and linear regres- sion (Weinstein et al. 2018). Moreover, in compound Binomial problems, the heterogeneity is reflected by unequal sample sizes across different estimation units. In Section 2.4.2, we demon- strate that the conventional Tweedie’s formula cannot eliminate selection bias for heterogeneous data. Moreover, failing to account for heterogeneity leads to invalid false discovery rate methods (Efron 2008b, Cai and Sun 2009), unstable multiple testing procedures (Tusher et al. 2001) and inefficient ranking and selection algorithms (Henderson and Newton 2016), exacerbating the repli- cability crisis in large-scale studies (Tusher et al. 2001). However, relatively few methodologies are available to address this issue. The heterogeneity setting can be formulated using a hierarchical approach. For n parallel studies, the data for theith study is summarized byX i which is modeled as coming from a member of the exponential family of distributions f i (x i j i ) ind = expf i x i i ( i )gh i (x i ); i = 1;:::;n; (2.1) where i and i are independent and drawn from unspecified priors i iid G (); i iid G (): (2.2) Many common families of distributions fall into this framework, including the Binomial (with i = logfp i =(1p i )g and i = n i ), Gamma (with i = i and i = i ), and Beta (with i = i and i = i ). In particular, the heteroscedastic problem is a special case of (2.1) and (2.2) where i = i , i = 2 i , and X i j i ; 2 i follows a normal distribution. A common goal is then to find the estimator, or make the decision, that minimizes the expected squared error loss. Following 16 tradition, we assume 2 i , or in general i , are known (take for example, Robbins (1951), Brown and Greenshtein (2009), Xie et al. (2012), and Weinstein et al. (2018)) and for implementation, use a consistent estimator, as discussed in Weinstein et al. (2018). An alternative to estimating the nuisance parameter involves placing an objective prior on i as done in Jing et al. (2016), which extends the model in Xie et al. (2012) with known variance to the case of unknown variance. A plausible-seeming solution to the heteroscedastic problem might be to scale eachX i by i so that a homoscedastic method could be applied to X sc i = X i = i , before undoing the scaling on the final estimate of i . Indeed, this is essentially the approach taken whenever we compute standardized test statistics, such as t- or z-scores. A similar standardization is performed when we compute ^ p i = X i =n i in the Binomial setting. However, perhaps somewhat surprisingly, we demonstrate in both our theoretical and empirical results that this approach disregards important information so is inefficient. More advanced methods have been developed, but all suffer from various limitations. For instance, the methods proposed by Xie et al. (2012), Tan (2015), Jing et al. (2016), Kou and Yang (2017), and Zhang and Bhattacharya (2017) are designed for heteroscedastic data but assume a parametric Gaussian prior, which leads to loss of efficiency when the prior is misspecified. In contrast, we propose a two-step approach, “Nonparametric Empirical Bayes Smoothing Tweedie” (NEST), which first estimates the marginal distribution ofX i ,f (x), and second predicts i using a generalized version of Tweedie’s formula. Hence, NEST departs from linear shrinkage and other techniques that focus on specific forms of the prior distribution. A significant challenge in the heterogeneous setting is that f (x) varies with , so we must estimate a two-dimensional function. NEST addresses this issue using a kernel which weights observations by their distance in both thex and dimensions. The intuition here is that the density function should change smoothly as a function of, so observations with variability close to can be used to estimatef (x). Once we obtain this two-dimensional density function, we simply apply Tweedie’s formula to estimate the parameter corresponding to any particular combination of x and . Compared to grouping methods (Weinstein et al. 2018), which group data according to i and then apply linear estimators 17 within each group, NEST is capable of pooling information from all samples to construct a more efficient estimator. NEST has four clear advantages. First, it is both easy to understand and compute but nev- ertheless handles general settings. Second, NEST does not rely on any parametric assumptions onG (). In fact it makes no explicit assumptions about the prior since it directly estimates the marginal density f (x) using a nonparametric kernel method. Third, we prove that NEST only requires a few simple assumptions to achieve asymptotic optimality for a broad class of models. Additionally, we develop a Stein-type unbiased risk estimate (SURE) criterion for bandwidth tun- ing, which explicitly resolves the bias-variance trade off issue in compound estimation. Finally, we demonstrate numerically that NEST can provide high levels of estimation accuracy relative to a host of benchmark comparison methods. The rest of the chapter is structured as follows. Section 2.1 develops a generalization of Tweedie’s formula to the multivariate setting and provides the corresponding oracle estimator. In Section 2.2, we present the NEST decision rule and algorithm, and describe the SURE criterion for choosing bandwidths. Section 2.3 describes the asymptotic setup and provides the main theorem establishing the asymptotic optimality of NEST. Section 2.4 concludes with a comparison of meth- ods in several simulations and a data application. The proofs are given in Section 2.5. Additional theoretical and numerical results are provided in the Supplementary Material. 2.1. Tweedie’s Formula In Section 2.1.1, we review Tweedie’s formula for univariate models, while Section 2.1.2 gen- eralizes the formula to multivariate models; a special case of this general result gives the oracle estimator for the nuisance parameter setting. 18 2.1.1 Tweedie’s formula Efron (2011) demonstrated that for the hierarchical model (2.1) and (2.2), with i =, the estimator minimizing the expected squared error loss is given by TF = ( TF i : 1in), where TF i =E( i jx i ) =l (1) f; (x i )l (1) h; (x i ); (2.3) where l f; (x) = logf (x) and l h; (x) = logh (x). In general h (x) can be directly computed for any particular member of the exponential family while f (x) can be estimated based on the observed data. In particular, whenf i (x i j i ) is aN( i ; 2 i ) density with, 2 i = 2 , then TF i =E( i jx i ) =x i + 2 d dx logf (x i ) =x i + 2 f (1) (x i ) f (x i ) (2.4) as first appeared in Robbins (1956), wheref (x) = R f (xj)dG () is the marginal density of X i , and f (1) (x) = d dx f (x) its derivative. An important property of the formula is that it only requires the estimation of the marginal distribution ofX i in order to compute the estimator, which is particularly appealing in large-scale studies where one observes thousands of X i , making it possible to obtain an accurate estimate off (x). 2.1.2 Multivariate Tweedie’s formula Suppose we observe x = (x 1 ;;:::;x n ) T with conditional distribution f (xj) where = ( 1 ;::: n ) T is a vector of known nuisance parameters and = ( 1 ;:::; n ) T are parameters of interest. We assume that comes from an unknown prior distribution G (). Furthermore, we assume that, conditional on,x comes from an exponential family of the form f (xj) = exp T x () h (x): (2.5) 19 The Gaussian distribution, f (xj) = 1 p 2jj exp 1 2 (x) T 1 (x) ; (2.6) is the most common case of (2.5), where = 1 , = , () = 1 2 T , and h (x) = 1 p 2jj exp 1 2 x T 1 x . We demonstrate that Tweedie’s formula can be extended to the multivariate setting. Let = f 1 (x); ; n (x)g T be an estimator for. Denote l n (;) = n 1 kk 2 2 the squared error loss of estimating using. The compound Bayes risk of under squared error loss is r(;G ) = Z Z l n (;)f (xj)dxdG (): (2.7) The next theorem derives the optimal estimator under risk (2.7). Theorem 2.1.1. (The multivariate Tweedie’s formula). Under Model 2.5, the optimal estimator that minimizes (2.7) is (x) =E(jx;) =l (1) f; (x)l (1) h; (x) (2.8) where l (1) f; (x) = n d dx 1 l f; (x);:::; d dxn l f; (x) o T ;l f; (x) = logf (x);f (x) = R f (xj)dG(), and l h; (x) = logh (x). In particular, whenxj follows the Gaussian distribution (2.6) then l (1) h; (x) = 1 x and (x) =E(jx;) =x + l (1) f; (x) =x + f (1) (x) f (x) ; (2.9) wheref (x) = R f (xj)dG () is the marginal density ofX. Remark 2.1.1. For the Gaussian case, if is the minimizer of the risk (2.7), then will be the minimizer of r(;G ) = Z Z l n (;)f (xj)dxdG () (2.10) 20 becauser(;G ) = 2 r(;G ). Hence we end up with the same estimator for using either (2.8) or (2.9), both of which will be denoted hereinafter. Consider the special case where 1 ;:::; n are independent and alsox 1 ;:::;x n are mutually independent sof (xj) = Q i expf i x i i ( i )gh i (x i ). For the Gaussian case this holds when = diagf 2 1 ;:::; 2 n g is a diagonal matrix, and the elements i in are independent and fol- low a common prior distributionG (). Then the next corollary gives the result for the nuisance parameter setting with which we are concerned. Corollary 2.1.2. Under Models 2.1, & 2.2, the optimal estimator is = ( 1 ; ; n ) T , where i =E( i jX i =x i ; i ) =l (1) f; i (x i )l (1) h; i (x i ); (2.11) where l f; i (x i ) = logf i (x i ) and l h; i (x i ) = logh i (x i ). In particular, for the Gaussian case, l (1) h; i (x i ) =x i = 2 i and i =x i + 2 i l (1) f; i (x i ) =x i + 2 i f (1) i (x i ) f i (x i ) : (2.12) The proof of the corollary is straightforward and omitted. The estimator (2.12) envelopes previous work. If we assume homoscedastic errors i = , then f i (x i ) = f (x i ) and (2.12) reduces to the univariate Tweedie’s formula (2.4). Section 2.5.1 provides analogous calculations to (2.12) for the Binomial, Negative Binomial, Gamma and Beta distributions. Remark 2.1.2. Notice that Corollary 2.1.2 also demonstrates why the standardization approach is sub-optimal. Applying Tweedie directly to the standardized dataX sc i and then multiplying by i gives an estimate of sc i = i x sc i + f (1) x sc(x sc i ) f x sc(x sc i ) ! =x i + 2 i f (1) x (x i ) f x (x i ) (2.13) where f x (x) = R f x (xj)dG () and f x sc(x sc ) = R 1 f x ( x j)dG (). The last equality in (2.13) can be established by verifying that f (1) x sc (x sc i ) f x sc(x sc i ) = i f (1) x (x i ) fx(x i ) . Equations (2.12) and (2.13) have similar forms but the key distinction is that (2.13) computes the density averaged over all while the oracle estimator (2.12) uses the conditional distribution at = i . 21 The Bayes rule (2.12) is an oracle estimator, which cannot be implemented directly because the densitiesf i (x) are typically unknown in practice. The next section introduces an empirical Bayes (EB) approach that can tackle the implementation issue. The EB approach also provides a powerful framework for studying the risk properties of the proposed estimator. 2.2. NEST for the Heteroscedastic Normal Means Problem This section describes our proposed NEST approach. We primarily concentrate on the Gaussian heteroscedastic case to illustrate the proposed estimation framework. Methodological extensions to other distributions can be done by following similar strategies; related formulae are provided in Section 2.5.1. In Section 2.2.1, we consider an empirical Bayes framework for estimating the oracle rule and discuss new challenges with nuisance parameters. Section 2.2.2 presents our SURE criterion for selecting the tuning parameters. 2.2.1 Weighted kernel density estimation In order to implement (2.12) we must first form estimates forf i (x i ) andf (1) i (x i ). We propose a weighted kernel density estimator. Leth = (h x ;h ) be tuning parameters (bandwidths). Define ^ f ;h (x) := n X j=1 w j h xj (xx j ); (2.14) where w j w j (;h ) = h ( j )=f P n j=1 h ( j )g is the weight that determines the contribution from (x j ; j ), h xj = h x j is a bandwidth that varies across j, and h (z) = 1 p 2h exp n z 2 2h 2 o is a Gaussian kernel. The weights w j have been standardized to ensure that ^ f ;h (x) itself is a proper density function. When the errors are homoscedastic,w j = 1=n for allj and (2.14) becomes the usual kernel density estimator. Next we give explanations for the proposed estimator. First, estimatingf i (x i ) directly is dif- ficult as we only have one pair of observations (x i ; i ) for each density function. To exploit the 22 fact thatf (x) changes smoothly as a function of, we propose using weights, which are deter- mined by a kernel function, to borrow strength from observations with variability close to i , while placing little weight on points where i and j are far apart. Second, we seth xj = h x j , which provides a varying bandwidth to adjust for the heteroscedasticity in the data. Specifically, the band- width of the kernel placed on the pointX j is proportional to j ; hence data points observed with higher variations are associated with flatter kernels. Our numerical results show that the varying bandwidth provides substantial efficiency gain over fixed bandwidths. A related idea has been used in the variable kernel method (e.g. Silverman (1986), pp. 21), which employs bandwidths that are proportional to the sparsity of the data points in a region. Finally, a plethora of kernel functions may be used to construct our estimator. We have chosen the Gaussian kernel h () to facilitate our theoretical analysis. Another advantage of using the Gaussian kernel is that it leads to good numer- ical performance. In contrast, as observed by Brown and Greenshtein (2009), kernels with heavy tails typically introduce a significant amount of bias in the corresponding EB estimates. Compared to the choice of kernel, the selection of tuning parametersh is a more important issue; a detailed discussion is given in Section 2.2.2. We follow the standard method in the literature (e.g. Wand and Jones (1994)) to obtain the estimate of the derivative ^ f (1) ;h (x): ^ f (1) ;h (x) := d dx ^ f ;h (x) = n X j=1 w j h xj (xx j ) x j x h 2 xj : (2.15) Our methodological development is focused on a class of estimation procedures of the form h h (X) = ( h;i : 1in) T , where h;i =x i + 2 i ^ f (1) i ;h (x i ) ^ f i ;h (x i ) : (2.16) 23 2.2.2 Selecting the tuning parameters Implementing (2.16) requires selecting the tuning parametersh = (h x ;h ). Ideally, these param- eters should be chosen to minimize the true risk, which is unknown in practice. We propose using Stein’s unbiased risk estimate (SURE; Stein (1981)) as a criterion for tuningh. Our SURE method requires a training dataset to estimate the densities, as well as an independent dataset to evaluate the true risk. Next we describe a cross-validation approach for constructing the estimator and calculating the corresponding SURE function, which is further used for selecting the tuning parameters. LetX =f1; ;ng denote the index set of all observations. We first divide the data intoK equal or nearly equal subsets so that for each foldk = 1;:::;K, there is a holdout subsetX k taken from the full datasetX . LetX C k =XnX k . For eachi2X k , we first use allx j 2X C k to estimate the density (and corresponding first and second derivatives), and then evaluate the risk using the following SURE function: S i (h) :=S(h;x i ; 2 i ) = 2 i + i 4 2 6 4 2 ^ f i ;h (x i ) ^ f (2) i ;h (x i ) n ^ f (1) i ;h (x i ) o 2 n ^ f i ;h (x i ) o 2 3 7 5 ; (2.17) where the second derivative is computed as ^ f (2) ;h (x) = X j2X C k w j h xj (xx j ) h 2 xj ( xx j h xj 2 1 ) : In the above formula, the dependence of S i (h) upon the training data has been suppressed for notational simplicity. The compound SURE function can be obtained by combining all individ- ual SURE functions defined by (2.17): S(h) = P n i=1 S i (h). The tuning parameters are cho- sen to minimize S(h): ^ h = argmin h S(h): Substituting ^ h in place of h in (2.16) provides the proposed “Nonparametric Empirical Bayes Smoothing Tweedie” (NEST) estimator, denoted ^ = ( ^ 1 ; ; ^ n ). 24 Remark 2.2.1. The SURE criterion is different from the proposals developed in the density esti- mation literature for bandwidth selection. Since the emphasis in the kernel smoothing literature is to pick a bandwidth to produce a good estimate of the density, this bandwidth may not be the same as that which produces the best decision rule to estimate. In fact, the theoretical anal- yses in Brown and Greenshtein (2009) (and their Remark 5) suggest that the bandwidth for the compound estimation problem should converge to 0 “just faster” than (logn) 1 . By contrast, the optimal choice of bandwidth is equal toh x n 1=5 for a continuously twice differentiable den- sity (e.g. Wand and Jones (1994)). Our numerical results show that the SURE criterion leads to much improved performance compared to using the standard bandwidth in the density estimation literature. Finally we prove that (2.17) provides an unbiased estimate of the risk. Let X be a generic random variable obeying Models 2.1 & 2.2, from which we also have an independent sample of training data. Denote the unknown mean and 2 the variance associated withX . The corre- sponding estimator in the class (2.16) is denoted h . Again, the dependence of h on the training data is suppressed. For a fixed , the risk associated with h isR( h ; ) = E Xj ( h ) 2 ; where the expectation is taken with respect toX given and the training data. The corresponding Bayes risk is given byr( h ;G) = R R( h ; )dG ( ): The next proposition justifies the SURE function (2.17). Proposition 2.2.1. Consider the heteroscedastic Models 2.1 & 2.2. Then we have R( h ; ) = E Xj fS(h;X ; 2 )g andr( h ;G) =E X; fS(h;X ; 2 )g, where the last expectation is taken with respect to the joint distribution of (X ; ) for a fixed training dataset. 2.3. Asymptotic Properties of NEST This section studies the risk behavior of the proposed NEST estimator and establishes its asymp- totic optimality. Section 2.3.1 describes the basic setup. The main theorem is presented in Sec- tion 2.3.2, where we also explain the main steps and provide some intuitions behind the proofs. 25 Finally, in Section 2.3.3 we quantify the degree of bias that occurs in selecting extreme values and show that NEST asymptotically corrects this bias. 2.3.1 Asymptotic setup Consider the hierarchical Models 2.1 and 2.2. We are interested in estimating based on the observed (X; 2 ); this is referred to as a compound decision problem (Robbins 1951, 1956) as the performances of then coordinate-wise decisions will be combined and evaluated together. Let = ( 1 ; ; n ) T be a general decision rule. We call a simple rule if for alli, i depends only on (x i ; i ), and a compound rule if i depends also on (x j ; j ), j 6= i (Robbins 1951). Let D = (X;; 2 ). Then the (compound) Bayes risk is r(;G) =E D fl n (;)g; (2.18) whereG is used to denote the unspecified joint prior on (; 2 ). Consider the oracle setting where the marginal densityf (x) is known. We have shown that the oracle rule that minimizes the Bayes risk is [cf. Equation 2.12], which is a simple rule, a useful fact that can be exploited to simplify our analysis. Concretely, let (X;;) be a generic triple of random variables from the hierarchical Model 2.1 and 2.2, and the (scalar) oracle rule for estimating based on (X;). It follows that the risk of the compound estimation problem (e.g. using to estimate) reduces to the risk of a single estimation problem (e.g. using for estimating): r( ;G) =r( ;G) :=E X;; ( ) 2 : (2.19) The oracle risk (2.19) characterizes the optimal performance of all decision rules. Moreover, it is easy to analyze as it can be explicitly written as the following integral E X;; ( ) 2 = Z Z Z ( ) 2 (x)dxdG ()dG (): (2.20) 26 We focus on the setting where the oracle risk is bounded below by a constant: r( ;G)C > 0. Following Robbins (1964), we call a decision rule asymptotically optimal if lim n!1 r(;G) =r( ;G): (2.21) The major goal of our theoretical analysis is to show that the NEST estimator ^ is asymp- totically optimal in the sense of (2.21). The NEST estimator is difficult to analyze because it is a compound rule, i.e. the decision for i depends on all of the elements ofX and 2 . The compound risk of the form (2.18) cannot be explicitly written as simple integrals as done in (2.20) because each X i is used twice: for both constructing the estimator and evaluating the risk. We discuss strategies to overcome this difficulty in the next section. 2.3.2 Asymptotic optimality of NEST We first describe the assumptions that are needed in our theory. Assumption 2.3.1. All means are bounded above by a sequence, which grows to infinity at a rate slower than any polynomial ofn, i.e.8i,j i jC n , whereC n =o(n ) for every> 0. This is considered to be a mild assumption as it allows one to take C n = Of(logn) k g for any constantk. The particular case ofk = 1=2 corresponds to interesting scenarios in a range of large-scale inference problems such as signal detection (Donoho and Jin 2004), sparse estimation (Abramovich et al. 2006), and false discovery rate analysis (Meinshausen and Rice 2006, Cai and Sun 2017). For example, in the signal detection problem considered by Donoho and Jin (2004), choosing n = p 2r logn, where 0 < r < 1, makes the global testing problem neither trivial nor too difficult. Under Assumption 2.3.1, it is natural to consider a truncated version of NEST which returns the component-wise max( ^ i ;K logn) for some largeK. The truncation tends to slightly improve the numerical performance. For notational simplicity, the truncated estimator is also denoted by ^ and 27 subsequently used in our proof. The modification is proposed mainly for theoretical considerations; we justify in Section 2.5.9 that the truncation will always reduce the MSE. Assumption 2.3.2. The variances are uniformly bounded, i.e. there exist 2 l and 2 u such that 2 l 2 i 2 u for alli. This is a reasonable assumption for most real life cases. When 2 i measures the variability of a summary statistic for an inference unit, the assumption is fulfilled when the sample size for obtaining the summary statistic is neither too large nor too small. Next we state our main theorem, which formally establishes the asymptotic optimality of the proposed NEST estimator. Theorem 2.3.1. Consider Models 2.1 and 2.2. Leth x n x andh n s , where x and s are small constants such that 0< x + s < 1. Then under Assumptions 1-2, the NEST estimator ^ is asymptotically optimal in the sense of (2.21). In the remainder of this section, we explain the main ideas and steps in the proof of the theorem. The challenge in analyzing NEST lies in the dependence of ^ i upon all of the elements of X and 2 ; hence the asymptotic analysis, which involves expectations over the joint distributions of (X;; 2 ), is difficult to handle. To overcome the difficulty, we divide the task by proving three propositions. The first proposition involves the study of the risk from applying NEST to a new pair of observations (X; 2 ) obeying Models 2.1 and 2.2: ^ =X + 2 ^ f (1) ; ^ h (X) ^ f ; ^ h (X) ; where ^ f ; ^ h and ^ f (1) ; ^ h are constructed fromf(x i ; 2 i ) : 1ing. As the data used to construct the NEST estimator ^ are independent of the new observation, the Bayes risk for estimating can be expressed as r( ^ ;G) =E D E X;; n ( ^ ) 2 o : (2.22) The risk (2.22) is relatively easy to analyze becauseE X;; n ( ^ ) 2 o can be evaluated explicitly as RRR ( ^ ) 2 (x)dxdG ()dG (): 28 The following proposition, which constitutes a key step in establishing the asymptotic optimal- ity, demonstrates thatr( ^ ;G) is asymptotically equal to the oracle risk. Proposition 2.3.1. Suppose we apply two scalar decisions: the oracle estimator and the NEST estimator ^ , to a new pair (X; 2 ) obeying Models 2.1 and 2.2. Then we have E D E X;; n ( ^ ) 2 o =o(1): It follows that lim n!1 r( ^ ;G) =r( ;G): Proposition 2.3.1 is proven via three lemmas, which introduce two intermediate estimators ~ and , defined rigorously in Section 2.5.4, that help bridge the gap between the NEST and oracle estimators. Intuitively, ~ is constructed based on a kernel density estimator that eliminates the randomness inX i , and is obtained as an approximation to ~ by further teasing out the variability in 2 j . The analysis involves the study of the relationships of the risks of ~ and to the risks of the oracle rule and NEST method ^ . The three lemmas respectively show that (i) the risk of is close to that of ; (ii) the risk of ~ is close to that of , and (iii) the risk of ^ is close to that of ~ . Therefore Proposition 2.3.1 follows by combining (i) to (iii). Next we show that the proof of the theorem essentially boils down to proving the asymptotic optimality of a jackknifed NEST estimator ^ = ( ^ 1 ; ; ^ n ), where ^ i represents the ith decision, in which the density and its derivative are fitted using data X i ; 2 i =f(x j ; j ) : 1 jn;j6=ig, and then applied to (x i ; i ). The risk function for the jackknifed estimator is r ^ ;G =E D n 1 ^ 2 2 : (2.23) The next proposition shows that the compound risk of ^ is equal to the univariate risk r( ^ ;G) from applying NEST to a new pair (X; 2 ). 29 Proposition 2.3.2. Consider the jackknifed NEST estimator ^ . Then under the assumptions and conditions of Theorem 2.3.1, we have r ^ ;G =r( ^ ;G) =r( ;G) +o(1): Finally, the following proposition shows that the jackknifed NEST estimator is asymptotically equivalent to the full NEST estimator. Proposition 2.3.3. Consider the full NEST estimator ^ and the jackknifed NEST estimator ^ . Then under the assumptions and conditions of Theorem 2.3.1, we haver ^ ;G = r ^ ;G + o(1): Combining Propositions 2.3.1 to 2.3.3, we complete the proof of Theorem 2.3.1, thereby estab- lishing the asymptotic optimality of NEST. 2.3.3 NEST and selection bias In large-scale inference, attention has recently moved beyond multiple testing, namely selecting a few non-null cases from a large number of hypotheses, to a new direction that involves estimating the effect sizes of the selected non-null cases. A widely used formulation is the “post-selection inference” framework, which has been applied to, for example, the construction of simultaneous confidence intervals for selected parameters (Berk et al. 2013, Lee et al. 2016) and the hierarchical testing of structured hypotheses (Yekutieli 2008, Benjamini and Bogomolov 2014, Sun and Wei 2015). These types of questions have been largely driven by the modern “big data” problems, which often involve first searching large data sets for potential targets for inference, and then making refined decisions on a subset of parameters that appear to be worthy of investigating. A central issue in large-scale selective inference is that the classical inference tool fails to take into account the selection bias (or snooping bias) when estimating many means simultaneously. Efron (2011) demonstrated convincingly in his “exponential example” that the histogram of the uncorrected differences z i i of the largest 20 z-values from 2,000 observations is centered 30 around some positive constant; hence we have a systematic bias due to the selection effect. Next we discuss two theoretical results: the first explicitly characterizes the magnitude of the selection bias under hierarchical Models 2.1 and 2.2, and the second explains why the bias can be removed asymptotically by the proposed NEST estimator. We also present numerical examples in Section 2.4.2 to illustrate the effectiveness of NEST in correcting the selection bias as well as the failure of the homoscedastic Tweedie method. Theorem 2.3.2. Consider Models 2.1 and 2.2 and the selection eventfX > tg. When applying the naive estimatorX, the expected value of the selection bias is given by E X;jX>t; 2(X) = 2 f (t) 1F (t) ; (2.24) wheref (t) = R (t)g()d andF (t) =P (X >tj 2 ) = R 1 t f (x)dx. We omit results for other types of selection events such asfX <tg andfX <t 1 orX >t 2 g, which can be derived in a similar fashion. The next theorem establishes the asymptotic unbiased- ness of the NEST estimator. Theorem 2.3.3. Under the assumptions and conditions of Theorem 2.3.1, the bias term of the NEST estimator is vanishingly small: E DjX i >t ( ^ i i ) =o(1): (2.25) This remarkable result follows from the convergence of NEST w.r.t. the oracle estimator (2.1.1) E D E XjX>t; 2( ^ ) =o(1); (2.26) and one elegant property of the correction term in the oracle estimator, which says that the condi- tional expectation of the ratio E X;jX>t; 2 ( f (1) (X) f (X) ) = f (t) 1F (t) ; (2.27) 31 after multiplying by 2 , precisely cancels out the expected selection bias in (2.24). Both (2.26) and (2.27) are proved in Section 2.5.8. 2.4. Numeric Results This section compares the performance of NEST relative to several competing methods using simulated data in Section 2.4.1, examines how NEST removes selection bias for the heteroscedastic setting in Section 2.4.2, and demonstrate all methods on a real dataset in Section 2.4.3. 2.4.1 Simulation We compare seven approaches: the naive method (Naive), usingX without shrinkage; Tweedie’s formula (TF) from Brown and Greenshtein (2009); the scaling method mentioned in Section 2, which inputs scaledX sc i =X i = i and outputs the rescaled mean (Scaled); the group linear method (Group L) of Weinstein et al. (2018); a grouping method (k-Groups), which casts the grouping idea of Weinstein et al. (2018) into a Tweedie’s formulation, by first creatingk equal sized groups based on the variances and second applying the univariate Tweedie’s formula within each group; the semi-parametric monotonically constrained SURE method (SURE-M) from Xie et al. (2012); and from the same paper, the semi-parametric grand-mean sure-shrinkage method (SURE-SG). Note, we implement a truncated version of NEST that has little impact on the numerical performance but is consistent with theory (discussed right after Assumption 2.3.1 in Section 2.3.2). Mean-squared errors (MSE), and associated standard errors (SE), are computed using the differences between the estimated mean from each method and the true, over a total of 50 simulation runs. We simulateX i j i ; i iid N( i ; 2 i ) fori = 1;:::;n, wheren,g andg vary across settings with two sample sizes: n = 5; 000 and 10; 000. Moreover, three scenarios forg are investigated. In the first setting i N(3; 1 2 ). In the second setting, i are drawn from a sparse mixture distribution with a 0:7 probability of a point mass at zero, and a 0:3 probability of a draw from N(3; 0:3 2 ). Finally, under the third setting, i are drawn from a two-point mixture model, with:5 32 chance of a point mass at 0 and:5 chance of a point mass at 3. Forg , we simulate iid U[0:1; M ] and test three different values of M for eachg setting. The values of M are chosen so that the ratio between variancesvar()=var(X) are reasonably close to:9,:75, and:5 despite the different models forg . This provides an anchor for comparison across the threeg settings. In terms of the relationship between’s variance and the average variance ofX conditional on, a ratio of:9 meansvar() = 9:5Efvar(Xj)g; a ratio of:75 meansvar() = 3Efvar(Xj)g; and a ratio of :5 meansvar() = Efvar(Xj)g. The first setting is relatively easy, the second is challenging, and the third is extreme. Because eachg setting has different hyper-parameters, there are small deviations from the exact ratios ofvar()=Efvar(Xj)g = 9:5, 3, and 1. All methods, except for Naive, require tuning parameters. For the methods that use kernel estimators (NEST, TF, Scaled, andk-Groups) we minimize ourSURE criterion to select the kernel bandwidth,h x or (h x ;h s ) for NEST, over a grid of possible values. The same bandwidth is used for both the density function and its derivatives. Thek-Groups methods additionally must be tuned for the number of groups. To this end, we run the group method on evenly sized groups of 2, 5, and 10. Also, the linear shrinkage methods (Group L, SURE-M, and SURE-SG) have a different set of tuning parameters. The tuning is handled via optimization techniques in the code provided by Weinstein et al. (2018). Table 2.1 shows mean squared errors for the normal model i iid N(3; 1 2 ). For this setting, since the analytical form of the mixture density is known, we can additionally compare perfor- mance against the oracle rule (2.4). SURE-M consistently leads performance under the normal setting and uniformly achieves oracle performance. Since SURE-M is a linear shrinkage method that movesX towards a global center, it is expected to perform particularly well under this normal setting, which concentrates points around a center at 3. When n = 5; 000, next best is SURE- SG, but NEST has an MSE only 1 - 5% higher than the oracle’s. When n = 10; 000, SURE- M and SURE-SG lead by a smaller gap, with several competing methods, including NEST, also achieving oracle performance whenvar() is 9.6 timesE(var(Xj)). In the extreme case where var() =E(var(Xj)) and the gap is largest between NEST and the best estimator, NEST is 3% higher than the oracle. 33 Table 2.1: Normal model mean squared errors with standard errors in parentheses. Bolded terms represent oracle, and best performing competitor(s). n Method Var() EfVar(Xj)g = 9:6 Var() EfVar(Xj)g = 3 Var() EfVar(Xj)g = 1 5000 Oracle 0.090 (0.000) 0.223 (0.001) 0.411 (0.002) Naive 0:103(0:000) 0:336(0:001) 1:024(0:004) NEST 0:091(0:000) 0:226(0:001) 0:430(0:002) TF 0:091(0:000) 0:228(0:002) 0:444(0:002) Scaled 0:095(0:000) 0:270(0:001) 0:659(0:004) Group L 0.090 (0.000) 0:224(0:001) 0:416(0:002) 2-Groups 0:091(0:001) 0:231(0:002) 0:462(0:012) 5-Groups 0:092(0:001) 0:238(0:004) 0:519(0:032) 10-Groups 0:093(0:001) 0:247(0:006) 0:570(0:053) SURE-M 0.090 (0.000) 0.223 (0.001) 0.412 (0.002) SURE-SG 0.090 (0.000) 0:224(0:001) 0:415(0:002) 10000 Oracle 0.090 (0.000) 0.223 (0.001) 0.411 (0.001) Naive 0:103(0:000) 0:335(0:001) 1:022(0:003) NEST 0.090 (0.000) 0:225(0:001) 0:424(0:001) TF 0.090 (0.000) 0:228(0:001) 0:461(0:015) Scaled 0:095(0:000) 0:267(0:001) 0:638(0:003) Group L 0.090 (0.000) 0:224(0:001) 0:415(0:001) 2-Groups 0.090 (0.000) 0:228(0:001) 0:469(0:017) 5-Groups 0:091(0:000) 0:233(0:003) 0:476(0:019) 10-Groups 0:091(0:001) 0:237(0:004) 0:538(0:047) SURE-M 0.090 (0.000) 0.223 (0.001) 0.411 (0.001) SURE-SG 0.090 (0.000) 0.223 (0.001) 0:413(0:001) Table 2.2 shows the performance of the sparse model for i . For this setting, we apply a stabilizing technique to shrinkage estimators that follow Tweedie’s formula (2.9), which ensures that the estimate ^ has the same sign as the original data pointx. In particular for (x;) we set ^ S = Ifsgn(x) = sgn(^ )g ^ , whereI is an indicator function. We do not modify the linear shrinkage methods since they shrink X towards a specific location. In this sparse case, NEST does best in the relatively easy scenario whereVar() = 9:5E(Var(Xj) and in the challenging scenario whereVar() = 3E(Var(Xj). The linear shrinkage methods are the worst performers other than the naive method in the easiest scenario but perform best in the extreme scenario where the expected conditional variance ofX is as large as the variance of. In this final scenario, the data is so noisy that the grand mean becomes the best estimator, which is essentially what the linear shrinkage methods produce. Even so, in this extreme setting, NEST has 22% the error of the naive 34 Table 2.2: Sparse model mean squared errors with standard errors in parentheses. Bolded terms represent best performing methods. n Method Var() EfVar(Xj)g = 9:5 Var() EfVar(Xj)g = 3 Var() EfVar(Xj)g = 1 5000 Naive 0:369(0:002) 1:840(0:008) 6:011(0:026) NEST 0.124 (0.001) 0.684 (0.004) 1:298(0:011) TF 0:154(0:001) 0:775(0:003) 1:598(0:007) Scaled 0:182(0:002) 0:824(0:005) 1:713(0:014) Group L 0:286(0:001) 0:773(0:003) 1:180(0:004) 2-Groups 0:139(0:002) 0:775(0:024) 1:704(0:132) 5-Groups 0:168(0:019) 0:885(0:071) 2:833(0:996) 10-Groups 0:211(0:040) 0:970(0:089) 2:248(0:277) SURE-M 0:285(0:001) 0:764(0:002) 1.159 (0.003) SURE-SG 0:286(0:001) 0:771(0:003) 1:174(0:004) 10000 Naive 0:369(0:001) 1:839(0:005) 6:009(0:015) NEST 0.118 (0.001) 0.661 (0.003) 1:230(0:006) TF 0:153(0:001) 0:773(0:002) 1:588(0:004) Scaled 0:178(0:001) 0:812(0:003) 1:629(0:010) Group L 0:286(0:001) 0:770(0:002) 1:170(0:002) 2-Groups 0:134(0:001) 0:730(0:009) 1:473(0:023) 5-Groups 0:136(0:005) 0:785(0:054) 1:727(0:238) 10-Groups 0:149(0:011) 0:836(0:067) 1:902(0:287) SURE-M 0:285(0:001) 0:763(0:002) 1.155 (0.002) SURE-SG 0:286(0:001) 0:767(0:002) 1:165(0:002) method compared to best performer’s 19%. NEST does particularly well in the 9:5 and 3 scenarios due to its ability to adapt to multiple centers and varying distributions. The simulations also highlight a trade-off between the grouping and homogeneous methods. Grouping methods capture some of the heteroscedasticity in the data, but when the number of groups divides the data into clusters that are much smaller than the full set, the estimator can become unstable, particularly for data with broad ranges of variance. Homogeneous methods have the opposite characteristics: their use of the whole data set makes them very stable, but can result in biased estimators in heterogeneous data settings. NEST can be seen as a continuous version of the discrete grouping method, and hence provides a compromise between the two approaches: it makes efficient use of all of the data but does not suffer from the bias introduced by using the homogeneous methods. 35 Table 2.3: Two-point model mean squared errors with standard errors in parentheses. Bolded terms represent best performing methods. n Method Var() EfVar(Xj)g = 9:2 Var() EfVar(Xj)g = 3:2 Var() EfVar(Xj)g = 1 5000 Oracle 0.037 (0.001) 0.279 (0.002) 0.758 (0.003) Naive 0:242(0:001) 0:701(0:003) 2:161(0:010) NEST 0.048 (0.001) 0.306 (0.003) 0.856 (0.006) TF 0:072(0:001) 0:374(0:003) 0:957(0:004) Scaled 0:129(0:001) 0:480(0:003) 1:169(0:006) Group L 0:208(0:001) 0:475(0:002) 0:904(0:003) 2-Groups 0:061(0:002) 0:343(0:006) 0:951(0:024) 5-Groups 0:066(0:006) 0:392(0:032) 1:120(0:124) 10-Groups 0:085(0:013) 0:459(0:066) 1:412(0:261) SURE-M 0:208(0:001) 0:472(0:002) 0:893(0:003) SURE-SG 0:209(0:001) 0:475(0:002) 0:901(0:003) 10000 Oracle 0.037 (0.001) 0.281 (0.001) 0.764 (0.0.002) Naive 0:243(0:001) 0:704(0:002) 2:171(0:007) NEST 0.045 (0.001) 0.302 (0.002) 0.839 (0.003) TF 0:071(0:001) 0:377(0:002) 0:962(0:002) Scaled 0:127(0:001) 0:480(0:002) 1:160(0:004) Group L 0:209(0:001) 0:476(0:001) 0:905(0:002) 2-Groups 0:058(0:001) 0:343(0:004) 0:927(0:014) 5-Groups 0:055(0:003) 0:347(0:016) 0:965(0:047) 10-Groups 0:063(0:005) 0:387(0:034) 1:135(0:135) SURE-M 0:209(0:001) 0:474(0:001) 0:897(0:002) SURE-SG 0:210(0:001) 0:476(0:001) 0:903(0:002) Table 2.3 corresponds to i drawn from the two-point model, with half centered at 0 and half at 3. In this setting, the analytical oracle rule is available and included in the comparisons. NEST uniformly performs best in this case. Whenn = 5; 000, it has errors that are 10% - 29% higher than the oracle MSE. Next best is the 2-Groups method with errors 23% - 64% higher than the oracle’s. Whenn is 10; 000, the pattern is the same, with NEST leading shrinkage methods with errors between 7% - 22% while the next best method, the 2-Groups method, has errors 21% - 61% higher than the oracle’s. Tweedie’s-formula type methods appear to do better in a setting like this one, where there are multiple underlying means and highly non-normal distributions. 36 Pooled TF µ ^ −µ Frequency −20 −10 0 10 20 0 200 400 600 800 NEST µ ^ −µ Frequency −20 −10 0 10 20 0 200 400 600 800 Pooled TF µ ^ −µ Frequency −20 −10 0 10 20 0 200 400 600 800 1000 NEST µ ^ −µ Frequency −20 −10 0 10 20 0 200 400 600 800 1000 Figure 2.1: Differences ^ for Naive (dark grey histograms) compared to TF (white histograms, left) and NEST (white histograms, right) for the smallest 20 observations across 200 simulations in two settings; top row hasX i 0:7N(1;:5 2 + 1 2 ) + 0:3N(1;:5 2 + 3 2 ) and bottom hasX i :7N(0;:5 2 + 1 2 ) +:3N(5;:5 2 + 3 2 ). 2.4.2 Selection bias We also explore the relative abilities of NEST and TF to correct for selection bias in heteroscedastic data. We demonstrate the selection bias problem through two simple settings. In the first,X i j i N( i ; 2 g ), wherei = 1;:::; 5000, i N(1;:5 2 ), and g = 1 (or 3) with 70 (or 30%) probability. In the second setting, g is the same but i come from two centers X i :7N(0;:5 2 + 1 2 ) + :3N(5;:5 2 + 3 2 ). In these two-group settings, NEST reduces to TF applied to separate groups. We plot histograms for the differences ^ of the 20 smallest X i generated from each of 200 simulations. This gives a total of 4; 000 differences for each of the three estimators: Naive, TF, and NEST. In Figure 2.1, the top row corresponds to the setting with a single center 0 = 1. Naive is plotted with TF on the left while Naive is plotted with NEST on the right. The top row shows 37 Naive shifted leftward, but both TF and NEST are centered around zero. However, while TF has effectively removed the bias, it is clear that NEST has significantly smaller variance than TF, and only slightly higher variance than Naive. NEST is able to fully use the information from 2 i and produce separate shrinkage effects for the two groups while TF must take a weighted average of the shrinkage. There is a further, and more serious, consequence to the average shrinkage approach that TF takes. While NEST removes selection bias by shrinking eachX i by an appropriate amount, TF is over-shrinking someX i and under-shrinking others. Specifically, TF’s shrinkage is 2 g f 0 (x i ) f(x i ) = ( 0 x i ) 2 g 1 v 2 1 w i + 1 v 2 2 (1w i ) ; wherew i =fp v 1 (x i 0 )g=fp v 1 (x i 0 )+(1p) v 2 (x i 0 )g;g = 1 or 2,p is the proportion ofX i fromg = 1, andv 2 g = 2 + 2 g . Since these weights are positive, and 1=v 2 g increases as g decreases,x i ’s with the smaller standard deviation will be under-shrunk whilex i ’s with the larger standard deviation will be over-shrunk. The phenomenon is well-captured by the bottom row of Figure 2.1. Here, the NEST differences shown in the right plot are still unimodal and centered around zero, but TF shrinkage issues become visible in the presence of distinct centers. The average shrinkage effect is slightly too mild for the group that should be shrunk toward zero, shifting its peak slightly left of zero, and far too severe for the group that should be shrunk towards five, shifting its peak towards ten. Thus heteroscedasticity prevents even TF from removing selection bias, but NEST is still able to do so. 2.4.3 California API data Next, we compare NEST and its competitors on California Academic Performance Index (API) school testing data. The API data files are publicly available on the California Department of Education’s website, and the data were described in Rogosa (2003) and analyzed previously by Efron (2008b) and Sun and McLain (2012) in the context of multiple testing (with heteroscedas- tic errors). The data focuses on the within-school achievement gap, for grades 2-12, between 38 socio-economically advantaged (SEA) and socio-economically disadvantaged (SED) students as measured by the difference between the proportion in each group who passed California’s standard- ized math tests, ^ p A ^ p D . During 2002-2010, these test scores were collected in accordance with the No Child Left Behind act and used to unlock federal funding for high-performing schools. With about 7; 000 schools evaluated each year, these performance gaps are susceptible to mean bias. If the observed achievement gaps are not corrected, small schools with fewer testers are particularly likely to fall in the extreme tails, obscuring large schools with lower variances from scrutiny. For this data, we compare the proportion of large schools selected into the top 100 achievement gaps estimated by each procedure. The data was cleaned prior to analysis. Observed achievement gaps were calculated over three- year-periods so that the 9 years of data were aggregated into 7 windows: from 2002-2004, 2003- 2005, :::, to 2008-2010. The achievement gaps were calculated using the difference of passing rates in each school between socioeconomic groups, 100(^ p A ^ p D ), and served asX i (in percent- ages). We chose schools wheren A , number of SEA students, andn D , number of SED students, were at least 30 each. Additionally, we used schools that had at least 5 students who passed and 5 students who failed the math test for both SED and SEA groups. The standard deviations, corre- sponding tos i , were estimates for i and were calculated using s = 100 p ^ p A (1 ^ p A )=n A + ^ p D (1 ^ p D )=n D : After cleaning, there were approximately 6; 500 schools in each of the 7 windows, although the number varied slightly from window to window. To quantify the performance of each method, schools were classified as small, medium, or large based on the socioeconomic group with the smaller number of testers at each school,n min = min(n A ;n D ). A school was small if the minimum number of testers fell into the bottom quartile of n min amongst all schools, and large if it lay in the top quartile. Since we have defined large schools as those in the top 25% of the population, and, absent information to the contrary, it is reasonable to assume that school size is independent of achievement gap, this is the proportion of large schools expected to appear in the top 100 schools. Competing methods were compared using 39 Table 2.4: Mean squared errors (standard errors) for the proportion of large schools. Bolded terms represent best performances. Method MSE (SE) Method MSE (SE) Naive 0:012(0:002) 2 Group 0:007(0:002) NEST 0.001 (0.001) 3 Group 0:007(0:002) TF 0:006(0:002) 4 Group 0:007(0:002) Scaled 0:014(0:002) 5 Group 0:007(0:002) SURE-M 0:006(0:002) Group L 0:005(0:002) SURE-SG 0:005(0:001) the squared difference between the observed fraction of large schools chosen in the top 100 by that method, relative to the nominal ideal fraction of 25%, averaged across the 7 three-year windows. We compared NEST to the same methods as in the simulation study. Our SURE criterion was used to tune the bandwidth when applicable. Each bandwidth was tuned on a grid of 0:1 intervals, where the endpoints were adjusted for each method. Thek-Groups method was also tuned over 2, 3, 4, and 5 groups. Since the standard deviations were unimodal and did not suggest particular clusters, the groups were constructed based on evenly spaced quantiles over the data. The mean squared errors (MSE) in Table 2.4 show that NEST adjusts achievement gaps in a way that most fairly represents large schools. In fact NEST had an average deviation from the benchmark 25% of only about 2:5%. By contrast the Naive method significantly under-counted large schools by approximately 12:5%. Even SURE-SG, which was the next most competitive method, had an average deviation over twice that of NEST, at 5:5%. Table 2.5 in Section 2.5 shows the full breakdown of proportions of large schools selected by each method on each of the seven windows of data. Figure 2.2 visualizes the NEST shrinkage function for various values ofs. The black line with circles, corresponding tos = 0, is flat since no shrinkage is needed for noiseless observations. We see that shrinkage is approximately linear inx, with the most negativex experiencing the greatest positive shrinkage and vice versa for positivex. Similarly, ass increases, the slope of the shrinkage line becomes steeper, and the shrinkage increases. However, notice that the shrinkage is not linear 40 −20 0 20 40 60 −1.5 −1.0 −0.5 0.0 0.5 1.0 x shrinkage s= 0 1 2 3 5 10 Figure 2.2: NEST shrinkage as a function ofx ands, plotted againstx for differents. ins (ors 2 ), with the marginal change in shrinkage declining ass grows larger. This differs from the standardization approach (2.13) which, for a given value ofx, must be linear ins 2 , by construction. 2.5. Proofs and supplementary material This section proves the main theoretical results and technical lemmas of this chapter. It also pro- vides additional material previously referenced. 41 2.5.1 Expressions for Common Members of the Exponential Family We observe (x 1 ; 1 );:::; (x n ; n ) with conditional distribution f i (x i j i ) = expf i z i ( i )gh i (z i ); (2.28) where i is a known nuisance parameter and i is an unknown parameter of interest. In addition to the Gaussian distribution, there are several common cases of (2.28). Binomial: f n i (x i j i ) = n i ! x i !(nn i x i )! p x i i (1p i ) n i x i = expf i x i ( i )gh n i (x i ); where i = log p i 1p i ; i =n i ; ( i ) =n i log (1 +e i ), andh n i (x i ) = n i ! x i !(n i x i )! . Negative Binomial: f r i (x i j i ) = (x i +r i 1)! x i !(r i 1)! p z i i (1p i ) r i = expf i x i ( i )gh r i (x i ); where i = logp i ; i =r i ; ( i ) =r i log (1e i ), andh r i (x i ) = (x i +r i 1)! z i !(r i 1)! . Gamma: f i (x i j i ) = 1 ( i ) i i x i 1 i exp( i x i ) = expf i x i ( i )gh i (x i ); where i = i ; i = i ; ( i ) = i log( i ), andh i (x i ) = 1 ( i ) x 1 i . Beta: f i (z i j i ) = 1 B( i ; i ) x i i (1x i ) i 1 = expf i z i ( i )gh i (z i ); 42 wherez i = logx i ; i = i ; i = i ; ( i ) = logB( i ; i ) andh i (z i ) = (1e z i ) i 1 . Hence, we can computel 0 h; (z) explicitly for these distributions. Binomial:l 0 h;n i (x i ) = P x i k=1 1 k + P n i x i k=1 1 k 2 where is the Euler-Mascheroni constant Negative Binomial:l 0 h;r i (x i ) = 8 > < > : P x i +r i 1 k=x i +1 1 k r i > 1 0 r i = 1 Gamma:l 0 h; i (x i ) = (1 i ) 1 x i Beta:l 0 h; i (z i ) = ( i 1) e z i 1e z i = ( i 1) x i 1x i . Combining these expressions with (2.11) we can expressE (jx) as follows: Binomial:E n i log( p i 1p i )jx i = P x i k=1 1 k + P n i x i k=1 1 k 2 +l 0 f;n i (x i ) Negative Binomial:E r i (logp i jx i ) =l 0 f;r i (x i ) + 8 > < > : P x i +r i 1 k=x i +1 1 k r i > 1 0 r i = 1 Gamma:E i ( i jx i ) = ( i 1) 1 x i l 0 f; i (x i ) Beta:E i ( i jz i ) = ( i 1) x i 1x i +l 0 f; i (z i ). 43 2.5.2 Proof of Theorem 2.1.1 As described in Efron (2011), by Bayes rule, g (jx) =f (xj)g()=f (x): But for exponential families of the form given by (2.5) g (jx) = exp x T (x) [g() exp( ())]; where (x) = log f (x) h (x) . This implies thatjx is also an exponential family with canonical parameterx and cumulant generating function (cgf) logE exp T t = (x +t) (x): We can differentiate the cgf to derive the moments ofjx. In particular E (jx) = @ @t (x +t)j t=0 =l 0 f; (x)l 0 h; (x): (2.29) The proof for the Gaussian case can be derived either as a special case of (2.29) or following similar arguments to those in Brown (1971) and Johnstone (2015). We begin by expanding the formula for the partial derivatives off (xj): f (1) (x) = Z 1 f (xj)G () Z 1 xf (xj)dG (): Then after pulling out 1 on the left side and dividing both sides byf (x), we get: f (1) (x) f (x) = 1 R f (xj)dG () f (x) x : 44 The right side can be simplified according to the Bayes rule: E(jx; ) = R f (xj)dG () R f (xj)dG () = R f (xj)dG () f (x) : Finally left-multiplying both sides by and takingx to the left side yields the desired result. 2.5.3 Proof of Proposition 2.2.1 The proof involves a simple application of the following Stein’s lemma. Lemma 2.5.1. (Stein (1981)). Consider X N(; 2 ) and a differentiable function h with its derivative denoted by h (1) . If Efh(X)(X)g and E h (1) (X) exist, then we have Efh(X)(X)g = 2 E h (1) (X) : Proof. Consider the class of estimators h . Leth(X ) = 2 ^ f (1) ;h (X) ^ f ;h (X) . Then h = X +h(X ). Expanding the riskR( h ; ) =E Xj ( h ) 2 , we get three terms inside the expectation: E Xj f(X ) 2 + 2h(X )(X ) +h 2 (X )g: The expectation of the first term is 2 . Applying Stein’s lemma to the second term, we get E Xj f2h(X )(X )g = 2 2 E Xj h (1) (X ) = 2 4 E Xj 2 6 4 ^ f ;h (X ) ^ f (2) ;h (X ) n ^ f (1) ;h (X ) o 2 n ^ f ;h (X ) o 2 3 7 5 : The third term can be easily computed as E Xj fh 2 (X )g = 4 E Xj n ^ f (1) ;h (X ) o 2 = n ^ f ;h (X ) o 2 : Combining the three terms gives the desired equality R ( h ; ) = E Xj S( h ;X ; 2 ). The second part of the theorem follows directly from the first part. 45 2.5.4 Three lemmas for Proposition 2.3.1 Consider a generic triple of variables (X;; 2 ) from Model 2.1 and 2.2. The oracle estimator and NEST estimators ^ are respectively constructed based on f (x) = R (x)dG () and ^ f ; ^ h (x). In our derivation, we use the notation ^ f (x), where the dependence of ^ f on ^ h and observed data is suppressed. We first define two intermediate estimators ~ and to bridge the gap between ~ and . The estimator ~ is based on ~ f, which is defined as ~ f (x) =E X;j 2 n ^ f (x) o ; (2.30) to eliminate the variability in X i and i . The expression of ~ f (x) can be obtained in two steps. The first step takes conditional expectation overX while fixing and 2 : E Xj; 2f ^ f (x)g = n X j=1 ! j Z hx j (yx) j (y j )dy = n X j=1 ! j j (x j ); where 2 = 1 +h 2 x . The previous calculation uses the fact that Z 1 (y 1 ) 2 (y 2 )dx = ( 2 1 + 2 2 ) 1=2( 1 2 )dx: The next step takes another expectation over conditional on 2 . Using notationfgfg(x) = R g()f(x)d, we have ~ f (x) =E j 2 n X j=1 ! j v j (x j ) = n X j=1 ! j g j (x): 46 The second estimator is based on f , which is defined as the limiting value of ~ f , to eliminate the variability from j . Note that ~ f (x) = P n j=1 g j (x) h ( j ) P n j=1 h ( j ) = R fg y g (x) h (y)dG (y) R h (y)dG (y) +K n ; where K n is bounded andE 2(K n ) = O(n ) for some > 0. Let L n n l , 0 < l < s . DefineA := [L n ; +L n ]; with its complement denotedA C . Next we show that the integral R A C fg y g (x) h (y)dG (y) is vanishingly small. To see this, consider the value of h () on the boundaries ofA C , where the density is the greatest. Then, h (L n ) = ( p 2h ) 1 exp 1 2 (L n =h ) 2 : The choice of a polynomial rate L n ensures that for all y 2 A C , h (y ) = O(n " ) for some" > 0. Moreover, the support of i is bounded. It follows that R A C fg y g (x) h (y )dG (y) = O(n " ): OnA , we apply the mean value theorem for definite integrals to conclude that there exists 2A such that Z A fg y g (x) h (y)dG (y) =fg g (x) Z A h (y)g (y)dy: Following similar arguments we can show that Z h (y)dG (y) = Z A h (y)dG (y)f1 +O(n )g: Cancelling the term R A h (y)dG (y) from top and bottom, the limit of ~ f (x) can be repre- sented as: f (x) :=fg g(x); for some 2A : (2.31) Finally, the derivatives ~ f (1) (x) and f (1) (x), as well as ~ and , can be defined correspondingly. 47 According to the triangle inequality, to prove Proposition 2.3.1, we only need to establish the following three lemmas, which are proved, in order, from Section 2.5.9 to Section 2.5.11 in the Supplementary Material. Lemma 2.5.2. Under the conditions of Proposition 2.3.1, Z Z Z ( ) 2 (x)dxdG ()dG () =o(1): Lemma 2.5.3. Under the conditions of Proposition 2.3.1, E 2 Z Z Z ( ~ ) 2 (x)dxdG ()dG () =o(1): Lemma 2.5.4. Under the conditions of Proposition 2.3.1, E X;; 2 Z Z Z ( ^ ~ ) 2 (x)dxdG ()dG () =o(1): 2.5.5 Proof of Proposition 2.3.2 Consider applying i to estimate i . Then r i ( ^ i ;G) =E X i ; 2 i Z Z Z ( ^ i ) 2 (x)dxdG ()dG (): Then the risk function for the jacknifed estimator is r ^ ;G =E X;; 2 n 1 ^ 2 2 =n 1 n X i=1 r i ^ i ;G : (2.32) Noting thatr i ( ^ i ;G) is invariant toi, and ignoring the notational difference betweenn andn 1, we can see that the average risk of ^ from (2.32) is equal to the univariate riskr( ^ ;G) defined by (2.22), thereby converting the compound risk of a vector decision to the risk of a scalar decision: r ^ ;G =r( ^ ;G): 48 2.5.6 Proof of Proposition 2.3.3 Consider the full and jackknifed density estimators ^ f i (x i ) and ^ f i (x i ). Denote ^ f (1) i (x i ) and ^ f (1); i (x i ) the corresponding derivatives. LetS i = P n j=1 h ( i j ); S i = P j6=i h ( i j ): Some algebra shows the following relationships: ^ f (x i ) = S i S i ^ f i (x i ) + 1 2S i h h x i ; ^ f (1) i (x i ) = S i S i ^ f (1); i (x i ): (2.33) The full and jackknifed NEST estimators are respectively given by: ^ i =x i + 2 i ^ f (1) i (x i ) ^ f i (x i ) ; ^ i =x i + 2 i ^ f (1); i (x i ) ^ f i (x i ) : (2.34) Then according to (2.33) and (2.34), we have ( ^ i ^ i ) 2 = i 2h x h 2 ( ^ f (1); i (x i ) ^ f i (x i ) ) 2 ( 1 S i ^ f i (x i ) ) 2 = O(h 2 x h 2 )O(C 02 n ) n S i ^ f i (x i ) o 2 Define Q i = S i ^ f i (x i ) (2h x h i ) 1 = P j6=i h ( i j ) h xj (x i x j ). Then E X;; 2 Q i = (n 1)E X i ; i ; 2 i q(X i ; 2 i ); where theq function can be derived using arguments similar to those in Section 2.5.4: q(x; 2 ) = Z fg y g (x) h (y)dG (y): LetY ij = h ( i j ) h xj (x i x j ) and Y i be its average. ThenE( Y i ) =Efq(X i ; 2 i )g. LetA i be the event such that Y < 1 2 E( Y ). Then applying Hoeffding’s inequality, we claim that P(A i )P j Y i E( Y i )j 1 2 ( Y i ) =O(n ) 49 for some> 0. Note that on eventA i , we have E X;; 2 n ( ^ i ^ i ) 2 1 A i o =O(C 02 n )O(n ) =o(1): for alli. On the complement, E X;; 2 n ( ^ i ^ i ) 2 1 A C i o O(h 2 x h 2 )O(C 02 n ) Q i 2 O(h 2 x h 2 )O(C 02 n ) 1 2 (n 1)Efq(X i ; 2 i )g 2 Our assumption on bounded 2 i implies thatEfq(X i ; 2 i ) is bounded below by a constant. Hence E X;; 2 n ( ^ i ^ i ) 2 1 A C i o =O(n 2 h 2 x h 2 C 02 n ) =o(1): Finally we apply the triangle inequality and the compound risk definition to claim that E X;; 2 n 1 ^ 2 2 =E X;; 2 n 1 ^ 2 2 +o(1); which proves the desired result. 2.5.7 Proof of Theorem 2.3.2 Consider the distribution ofX restricted to the region [t; +1]. The density of the truncatedX is given by f (xjX >t) = d dx R x t f (y)dy R 1 t f (y)dy = f (x) 1F (x) forx>t; (2.35) 50 wheref andF are defined in the theorem. The expected value ofX can be computed as Z 1 t (x)f (xjX >t)dx = R 1 t (x)f (x)dx 1F (x) = RR 1 t (x) (x)dxdG() 1F (x) = 2 RR 1 t d (x)dG() 1F (x) = 2 R (t)dG() 1F (x) = 2 f (t) 1F (t) ; which gives the desired result. 2.5.8 Proof of Theorem 2.3.3 We first prove that the oracle estimator is unbiased. Using the density of the truncated variable (2.35), the conditional expectation of the ratiof (1) (X)=f (X) can be computed as E X;jX>t ( f (1) (X) f (X) ) = Z 1 t f (1) (x) f (x) f (x) 1F (t) dx = f (t) 1F (t) : By Theorem 2.3.2, we claim that the oracle estimator is unbiased. Next we discuss the general steps to show thatE DjX i >t ( ^ i i ) =o(1). Details will be omitted. We only need to prove the convergence of NEST to the oracle Tweedie estimator under theL 2 risk: E DjX i >t ( ^ i i ) 2 =o(1): (2.36) Following similar arguments to those in previous sections, the proof of (2.36) essentially boils down to proving the result for a generic triple (X;; 2 ) and the corresponding decision rules ^ 51 and :E D n E XjX>t ( ^ ) 2 o = o(1): The proof can be done similarly to that of Proposition 3 as the conditional density functionf (xjX >t) is proportional tof (x). 2.5.9 Proof of Lemma 2.5.2 We first argue in Section 2.5.9 that it is sufficient to prove the result over the following domain R x :=fx :C n lognxC n + logng: (2.37) This simplification can be applied to the proofs of other lemmas. Truncating the domain Our goal is to show that ( ^ ) 2 is negligible onR C x . Sincejj C n by Assumption 2.3.1, the oracle estimator is bounded: =E(Xj; 2 ) = R (x)dG () R (x)dG () <C n : Let C 0 n = C n + logn. Consider the truncated NEST estimator ^ ^C 0 n . The two intermediate estimators ~ and are truncated correspondingly without altering their notations. Let 1 Rx be the indicator function that is 1 onR x and 0 elsewhere. Our goal is to show that Z Z Z R C x ( ^ ) 2 (x)dxdG ()dG () =O(n ) (2.38) for some small > 0. Note that for allx2R C x , the normal tail density vanishes exponentially: (x) = O(n 0 ) for some 0 > 0. The desired result follows from the fact that ( ^ ) 2 = o(n ) for any> 0, according to the assumption onC n . 52 Proof of the lemma We first apply triangle inequality to obtain ( ) 2 4 ( f (1) (x) f (x) ) 2 f (x) f (x) 2 2 4 ( f (1) (x) f (1) (x) 1 ) 2 + f (x) f (x) 1 2 3 5 2 : Hence the lemma follows if we can prove the following facts forx2R x . (i)f (1) (x)=f (x) =O(C 0 n ), whereC 0 n =C n + logn. (ii) f (x)=f (x) = 1 +O(n " ) for some"> 0. (iii) f (1) (x)=f (1) (x) = 1 +O(n " ) for some"> 0. To prove (i), note that = O(C n ) as shown earlier, andx = O(C 0 n ) ifx2 R x . The oracle estimator satisfies = x + 2 f (1) (x)=f (x). By Assumption 2, G has a finite support, so we claim thatf (1) (x)=f (x) =O(C n ). Now consider claim (ii). LetA := n :jxj p log(n) o . Following similar arguments to the previous sections, we apply the normal tail bounds to claim that (x) =Ofn 1=(2 2 +1) g. Similar arguments apply tof (x) when2A . Therefore f (x) f (x) = R 2A (x)dG () R 2A (x)dG () 1 +O(n 1 ) (2.39) for some 1 > 0. Next, we evaluate the ratio in the range ofA : (x) (x) = ( ) exp 1 2 (x) 2 1 ( ) 2 1 2 = 1 +O(n 2 ) (2.40) 53 for some 2 > 0. This result follows from our definition of , which is in the range of [L n ; + L n ] for someL n n l . Since the result (2.40) holds for all inA , we have Z 2A (x)dG () = Z 2A (x) (x) (x) dG () = Z 2A (x)dG () 1 +O(n 2 ) : Together with (2.39), claim (ii) holds true. To prove claim (iii), we first show that f (1) (x) = Z (x) x 2 dG () = Z 2A (x) x 2 dG () 1 +O(n 2 ) for some > 0. The above claim holds true by using similar arguments for normal tails (as the term (x) essentially has no impact on the rate). We can likewise argue that f (1) (x) = Z A 2 ( ) 2 (x) (x) (x) x 2 dG () = f (1) (x)f1 +O(n " )g for some > 0. This proves (iii) and completes the proof of the lemma. Note that the proof is done without using the truncated version of . Since the truncation will always reduce the MSE, the result holds for the truncated automatically. 2.5.10 Proof of Lemma 2.5.3 It is sufficient to prove the result overR x defined in (2.37). Begin by defining R 1 = ~ f (1) (x) f (1) (x) andR 2 = ~ f (x) f (x). Then we can represent the squared difference as ( ~ ) 2 =O 0 @ R 1 f (x) +R 2 2 + " R 2 f (1) (x) f (x)f f (x) +R 2 g # 2 1 A : (2.41) 54 ConsiderL n defined in the previous section. We first study the asymptotic behavior ofR 2 . R 2 = X j 2A w j f j (x)f (x) +K n (); (2.42) where the last term can be calculated as K n () = X j 2A C w j f j (x)f (x) =O 0 @ X j 2A C w j 1 A : The last equation holds since both f j (x) and f (x) are bounded according to our assumption 2 l 2 j 2 u for allj. ConsiderA := n :jxj p log(n) o . We have f (x) f j (x) = R 2A (x)dG () R 2A j (x)dG () 1 +O(n 1 ) for some 1 > 0, and in the range of2A , we have (x)= (x) = 1 +O(n 2 ) for some 2 > 0 and allj such that j 2A . We conclude that the first term in (2.42) isO(n ) for some> 0 sincef j (x) is bounded and P j2N w j 1. Now we focus on the asymptotic behavior of P j 2A C ! j (). LetK 1 be the event that n 1 n X j=1 h ( j )< 1 2 fg h g() andK 2 the event that n 1 n X j=1 1 f j 2A C g h ( j )> 2 Z A C g (y)(y)dy: 55 LetY j = h ( j ). Then fora j Y j b j , we use Hoeffding’s inequality P j YE( Y )jt 2 exp ( 2n 2 t 2 P n j=1 (b i a i ) 2 ) : Takingt = 1 2 E(Y i ), we have P(K 1 ) 2 exp (1=2)n 2 fE(Y i )g 2 nO(h 1 ) =O(n ) for some > 0. Similarly we can show thatP(K 2 ) = O(n ) for some > 0. Moreover, on the eventK =K C 1 \K C 2 , we have X j 2A C ! j () 4 R A C g (y)(y)dy fg h g() =O(n ) for some > 0. We use the same in the previous arguments, which can be achieved easily by appropriate adjustments (taking the smallest). Previously we have shown that the first term in (2.42) isO(n ). Hence on eventK,R 2 =O(n ) for some> 0. Now consider the domainR x . DefineS x :=fx : f (x) > n 0 g, where 0 < 0 < . On R x \S C x , we have Z Z Rx\S C x ( ~ ) 2 f (x)dxdG () =OfC 02 n P(R x \S C x )g =O(n ) (2.43) for some> 0. The previous claim holds true since the length ofR x is bounded byC 0 n , and both ~ and are truncated byC 0 n . Now we only need to prove the result for the regionR x \S x . On eventK, we have E 2 0 @ 1 K Z Z Rx\Sx " R 2 f (1) (x) f (x)f f (x) +R 2 g # 2 f (x)dxdG () 1 A = O(C 02 n )O n ( 0 ) ; 56 which isO(n ) for some> 0. On eventK C , E 2 1 K C Z Z Rx\Sx ( ~ ) 2 f (x)dxdG () =O(C 02 n )O(n ); which is also O(n ). Hence the risk regarding the second term of (2.41) is vanishingly small. Similarly, we can show that the first term satisfies E 2 Z Z Rx\Sx R 1 f (x) +R 2 2 f (x)dxdG () ! =O(n ): Together with (2.43), we establish the desired result. 2.5.11 Proof of Lemma 2.5.4 LetS 1 = ^ f (1) (x) ~ f (1) (x) andS 2 = ^ f (x) ~ f (x). Then ( ~ ^ ) 2 2 4 2 4 ( ~ f (1) (x) ~ f (x) ) 2 S 2 S 2 + ~ f (x) 2 + S 1 S 2 + ~ f (x) 2 3 5 : (2.44) According to the definition of ~ f (x) [cf. equation (2.30)], we haveE X;j 2(S 2 ) = 0: By doing differentiation on both sides we further haveE X;j 2(S 1 ) = 0: A key step in our analysis is to study the variance ofS 2 . We aim to show that V X;; 2(S 2 ) =O(n 1 h 1 h 1 x ): (2.45) To see this, first note that V X;j 2(S 2 ) = n X j=1 w 2 j V X;j 2f h xj (xX j )g; where 57 Vf h xj (xX j )g = Z f h xj (xy)g 2 fg j g(y)dy Z h xj (xy)fg j g(y)dy 2 = 1 h x 2 j Z 2 (z)g (x +h x j z)dz 1 j Z (z)g j (x +h x j z)dz 2 = 1 h x 2 j Z 2 (z)dz f j (x)f1 +o(1)g 1 j f j (x) 2 f1 +o(1)g = O(h 1 x ): Next we shall show that E 2 n P n j=1 w 2 j o =O (n 1 h 1 ): (2.46) Observe that h ( j ) =O(h 1 ) for allj. Therefore we have n X j=1 2 h ( j ) =O(h 1 ) n X j=1 h ( j ); which further implies that n X j=1 w 2 j = P n j=1 2 h ( j ) n P n j=1 h ( j ) o 2 = O(n 1 h 1 ) n 1 P n j=1 h ( j ) : LetY j = h ( i ) and Y =n 1 P n j=1 Y i . Then 0Y j ( p 2h ) 1 and E(Y j ) =fg h g() =g () +O(h 2 ): LetE 1 be the event such that Y < 1 2 E( Y ). We apply Hoeffding’s inequality to obtain P Y < 1 2 E( Y ) P j YE( Y )j 1 2 E( Y ) 2 exp 2n 2 g h () n(2) 1 h 2 2 exp(Cnh 2 ) =O(n 1 ): 58 Note that P n j=1 w 2 j P n j=1 w j = 1. We have E( n X j=1 w 2 j ) = E n X j=1 w 2 j 1 E ! +E n X j=1 w 2 j 1 C E ! = O(n 1 h 1 ) +O(n 1 ) = O(n 1 h 1 ); proving (2.46). Next, consider the variance decomposition V X;; 2(S 2 ) =V 2fE X;j 2(S 2 )g +E 2fV X;j 2(S 2 )g: The first term is zero, and the second term is given by E 2fV X;j 2(S 2 )g =O(h 1 x )E P n j=1 w 2 j =O(n 1 h 1 h 1 x ): We simplify the notation and denote the variance of S 2 byV(S 2 ) directly. ThereforeV(S 2 ) = O(n ) for some > 0. Consider the following spaceQ x =fx : ~ f (x) > n 0 g, where 2 0 < . In the proof of the previous lemmas, we showed that onR x , ~ f (x) =f (x)f1 +O(n )g +K n ; whereK n is a bounded random variable due to the variability of 2 j , andE 2(K n ) = O(n " ) for some"> 0. Next we show it is sufficient to only considerQ x . To see this, note that E 2 Z Z Rx\Q C x ( ~ ) 2 f (x)dxdG () = E 2 Z Z Rx\Q C x ( ~ ) 2 h ~ f (x)f1 +O(n )g +K n i dxdG () = O(C 03 n ) n O(n 0 ) +O(n 0 ) +O(n 1=2 ) o ; 59 which is alsoO(n ) for some> 0. Let Y j =w j h xj (xX j )w j g j (x) and Y =n 1 P n j=1 Y j . ThenE(Y j ) = 0,S 2 = P n j=1 Y j , and 0Y j D n , whereD n h 1 x . Let E 2 be the event such thatS 2 < 1 2 ~ f (x). Then by applying Hoeffding’s inequality, P(E 2 ) P j YE( Y )j 1 2 ~ f (x) 2 exp ( 2n 2 f 1 2 ~ f (x)g 2 nD 2 n ) =O(n ) for some> 0. Note that on eventE 2 , we have E X;; 2 n ( ^ ~ ) 2 1 E 2 o =O(C 2 n )O(n ) =o(1): Therefore, we only need to focus on the eventE C 2 , on which we have ~ f (x) +S 2 1 2 ~ f (x): It follows that onE C 2 , we have fS 2 =( ~ f (x) +S 2 )g 2 4S 2 2 =f ~ f (x)g 2 : Therefore the first term on the right of (2.44) can be controlled as E X;; 2 0 @ 1 E C 2 Z Z Rx\Qx ( ~ f (1) (x) ~ f (x) ) 2 S 2 S 2 + ~ f (x) 2 f (x)dxdG () 1 A = O(C 02 n )O(n (2 0 ) ) =O(n ) for some> 0. Hence we show that the first term of (2.44) is vanishingly small. 60 For the second term in (2.44), we need to evaluate the variance term of S 1 , which can be similarly shown to be of order O(n ) for some > 0. Following similar arguments, we can prove that the expectation of the second term in (2.44) is also vanishingly small, establishing the desired result. 2.5.12 Table of large school proportions Table 2.5 shows three-year windows for the proportion of large schools selected into the 100 schools by each method. The three-year windows range from 2002 - 2004 to 2008 - 2010. NEST is among the closest to giving large schools 25% representation for 6 of 7 years. Table 2.5: Proportion of large schools. Bolded terms represent best performances. Method 2002 - 2003 - 2004 - 2005 - 2006 - 2007 - 2008 - 2004 2005 2006 2007 2008 2009 2010 Naive 0.10 0.11 0.12 0.12 0.12 0.14 0.16 NEST 0.20 0.19 0.26 0.26 .29 0.25 .24 TF 0.13 0.16 0.15 0.18 0.19 0.24 0.29 Scaled 0.11 0.12 0.12 0.12 0.12 0.17 0.18 2 Group 0.13 0.15 0.15 0.18 0.18 0.25 0.29 3 Group 0.13 0.15 0.15 0.16 0.18 0.22 0.26 4 Group 0.13 0.15 0.15 0.17 0.18 0.25 0.27 5 Group 0.13 0.15 0.15 0.17 0.17 0.26 0.29 Group L 0.13 0.16 0.17 0.18 0.24 0.29 0.28 SURE-M 0.16 0.17 0.19 0.21 0.24 0.34 0.28 SURE-SG 0.16 0.17 0.18 0.21 0.22 0.32 0.26 61 Chapter 3 The Mountain Comes to Muhammad: Caveats on Using Standardized Statistics in Multiple Testing Multiple testing frameworks are routinely employed by scientists and researchers to compare many features simultaneously across many groups. For example, biologists often compare gene expres- sion levels across treated and untreated groups in order to find genes that respond to treatment. Generally, multiple testing involves three steps: reduce the data to standardized test statistics, usually z-values; calculate p-values or analogous statistics based on z-values; and finally find a threshold of significance that corrects for multiplicity. The standardization step is upheld as con- ventional wisdom, but this chapter rigorously demonstrates through theory and intuitive examples that standardization has negative effects for heteroscedastic multiple testing. This chapter also proposes an ameliorating testing procedure, “Muhammad Meets the Mountain” (3M), suitable for large-scale, heterogeneous data. It eliminates the need to standardize heteroscedastic errors and provides an optimal multiple testing procedure for finding non-zero means. The findings are impactful for two reasons. First, the message runs counter to the widely used and commonly accepted practice of standardizing data. This has broad implications since stan- dardized multiple tests are often applied to large-scale, heterogeneous data. Multiple testing has been used not only in the analysis of gene expression levels (a few works include Tusher et al. (2001), Dudoit et al. (2003), Efron (2008b) and Sun and Wei (2011)) but also in astronomy for the detection of undiscovered galaxies (Miller et al. 2001); in neuro-imaging, for the discovery of differential brain activity (Genovese et al. 2002, Schwartzman et al. 2008); in education, to iden- tify student achievement gaps (Efron 2008b); in data visualization, to find potentially interesting 62 patterns (Zhao et al. 2017); and in finance, to evaluate trading strategies (Harvey and Liu 2015). These data often have variances that differ for different testing units due either to inherent differ- ences or differing sample sizes between units, and so standardization is commonly used to make testing units comparable. Second, standardization exacts a price, but the proposed method avoids these drawbacks and efficiently uses heterogeneous large-scale data, where testing units number in the thousands or more. Ideas from 3M apply to smaller data sets as well, but the algorithm is designed to capitalize on copious data in ways not possible for procedures intended for moderate amounts of data, and thus is most powerful in large-testing scenarios. The main idea of 3M is to bypass standardization by directly incorporating (x i ; i ) into the testing procedure. The procedure has two steps: in the first step, a statistic that captures the likely significance of each testing unit is calculated using a bivariate density estimator for (x i ; i ). A key feature of this step is that the alternative distribution for each x i is conditioned on the standard deviation i , so that 3M avoids power distortion. Note that this kind of conditioning is not possible for standardized values since the i are subsumed under z i . Then, in the second step, the significance statistics are ordered, and the smallest of these so-called ranking statistics are rejected at a cutoff. Related works that focus on adjusting for heterogeneous testing power fall into two categories: grouping and weighting. When auxiliary information is available, the grouping approach first separates data into groups and secondly applies a multiple testing procedure separately to each (Efron 2008a). The choice of groups is usually context specific. One advantage of 3M is that it eliminates the need to define groups as part of the testing procedure, which is particularly liberating when the groups can be divided in many possible ways. It also pools information from all tests rather than discretizing information by group. This is desirable because, as noted in Section 1.3.2, testing procedures can gain power from pooling information across tests. The weighting approach first calculates two statistics for all testing units:p-values and weights. It then forms ranking statistics by ordering the weightedp-values. Finally the weighting approach uses these ordered statistics to accept or reject hypotheses. It speaks to the challenging nature 63 of designing good weights that this framework has produced many different methods of weight- ing (a representative collection includes Genovese et al. (2006), Ferkingstad et al. (2008), Roeder and Wasserman (2009), Hu et al. (2012), and Li and Foygel Barber (2016)), but no weighting method that directly incorporates heteroscedasticity currently exists under this framework. The chief advantages 3M has over weighting methods are threefold. One, it directly incorporates infor- mation from into the test statistic. Second, finding p-value weights as function of variance is messy, but 3M can bypass this difficulty since it does not use p-values. Rather, the local false discovery rate (Lfdr) statistic it employs adapts to heteroscedasticity in an intuitive way. Third, the choice of weights is embedded in the algorithm and does not rely on prior information. More generally, the incorporation of the standard deviation in 3M distinguishes the method from techniques that incorporate auxiliary information. The standard deviation is more integral to the data’s distribution than a typical auxiliary variable, and so it is not used merely to segment or weight data but plays a critical role in the density estimation. By using the standard deviation, 3M is thus able to enhance the accuracy of the statistic by which the significance of each test is assessed. We provide more discussion on the connections between methods in Section 3.1, after 3M is more fully explained. The rest of the chapter is organized as follows. Section 3.1 reviews the standardized oracle rule to control for FDR, introduces the proposed oracle rule, provides intuition and theory, and connects it to the larger body of literature. Section 3.2 describes the data-driven procedure and its theoretical properties. Section 3.3 contains simulations, and Section 3.4 demonstrates the estimator on microarray analyses of three medical conditions: leukemia, lymphoma, and myeloma. 3.1. Oracle Rule for FDR Control This section formally gives the notation for the multiple-testing set-up considered, reviews the standardized Lfdr method, describes the oracle method, illustrates the difference between the two methods, proposes 3M, and compares 3M to other methods. 64 3.1.1 Notations and definitions Suppose (X 1 ; 1 );:::; (X m ; m ) are pairs of observations from the following model: X i j i ; i ind N( i ; 2 i ); (3.1) where i are drawn from a partially unknown sparse prior, and are drawn from a fully unknown prior: i iid (1p)I( i = 0) +pg (); 2 i iid g (); (3.2) wherep2 (0; 1) is the proportion of nonzero signals, I( i = 0) is one when theH i;0 is true or otherwise zero, andg () andg () are unknown continuous density functions. The goal is to test m simultaneous hypothesis tests: H i;0 : i = 0 vs. H i;1 : i 6= 0; i = 1; ;m: (3.3) Let the underlying truth be = ( i : 1 i m)2f0; 1g m , where i = 1 whenH i;1 is true and i = 0 otherwise. The testing procedure produces a vector of decisions = ( i :2 1im)2 f0; 1g m , where i = 1 means we rejectH 0;i , and i = 0 means we do not reject the null. The false discovery rate (FDR) metric (Benjamini and Hochberg 1995) is a widely used crite- rion for controlling the Type I error rate across multiple hypotheses. It is defined as: FDR =E P i (1 i ) i P i i _ 1 ; (3.4) where i _ 1 denotes the maximum of i and 1. It sets the default for FDR to zero when no null hypotheses are rejected. Otherwise, FDR is the expected proportion of falsely rejected nulls over the total number of rejections. A closely related but more convenient criterion is the marginal false discovery rate (mFDR): 65 mFDR = E ( P i (1 i ) i ) E ( P i i _ 1) : (3.5) The mFDR is asymptotically equivalent to the FDR for a broad range of settings, including p-value- based tests for independent hypotheses (Genovese and Wasserman 2002),p-value-based tests for weakly dependent hypotheses (Storey et al. 2004), and for a general set of decision rules that satisfy first- and second-order conditions (Basu et al. 2017). The form of the mFDR makes it easier to establish theoretical properties of our testing procedure. We describe the Lfdr statistic (Efron et al. 2001) next, which requires the following notation. Denote the true proportion of non-null tests as = P( i = 1). The null density is f 0 (), the non-null density isf 1 (), and the marginal density isf() = (1)f 0 () +f 1 (). 3.1.2 Lfdr and oracle rule with standardized statistics For the standardized data, the Lfdr statistic is defined as Lfdr(z) =P ( = 0jz) = (1)f 0 (z) f(z) : (3.6) An optimal procedure is one that has the largest expected number of true positives (ETP) among all procedures based on z-values where FDR . Sun and Cai (2007) showed that the optimal FDR procedure is a thresholding rule with the form (Lfdr;c) = [IfLfdr(z i )<cg : 1im] where the thresholdc2 [0; 1]. Specifically, the optimal threshold is the largest cutoff under the mFDR constraint, c = supfc : Q LF (c) g, where Q LF (c) is the mFDR level of(Lfdr;c). Then the optimal FDR procedure is (Lfdr;c ) = [IfLfdr(z i )<c g : 1im]: 66 3.1.3 Oracle rule adjusted for We present an alternative to standardizing. Consider the following conditional local false discovery rate (Clfdr) test statistic T i OR (x;) =P ( = 0jx;) = (1)f(j i = 0)f (xj i = 0) f(x;) ; (3.7) which simplifies to T i OR (x;) = (1)f (xj i = 0) f (x) : when i and 2 i are independent. LetQ OR (t) denote the mFDR level of the testing rule [IfT OR (x i ; i )<tg : 1im]: Define the oracle threshold t OR = supft :Q OR (t)g: (3.8) The optimal procedure for FDR control OR = ( i OR : 1im) is given by the ranking statistic T i OR and the thresholdt OR : i OR =IfT i OR <t OR g: (3.9) Theorem 3.1.1 (Optimality and validity of the oracle rule.). LetD be the collection of all testing rules based on (x;) such that mFDR . Then ETP ETP OR for any2D . The above theorem is proved in Section 3.5 and states that amongst testing rules based on (x;) that control the FDR at level, (3.9) has highest power. 67 3.1.4 The Mountain Comes to Muhammad This section presents an example that explores the effects of heteroscedasticity on power and that illustrates the loss of power induced by pooling standardized statistics. Consider the following random mixture model, X i (1)N( 0 ; 2 i ) +N( 1 ; 2 i ): (3.10) where the null 0 = 0, the alternative 1 = 3, = :25, and i are known but heterogeneous. Imagine that for each, an observation is drawn such that the standardized observation isx 0 = 3. Also, the weighted null distribution can be calculated using (1)(x=), the weighted alternative distribution using(x= 1 =), and the Clfdr rule (3.7) using Clfdr(x;) = (1)(x=) (1)(x=) +(x= 1 =) ; where() is the standard normal distribution. Since the Clfdr isP ( = 1jx;), here it measures power for fixedx= =x 0 at different. Figure 3.1 depicts the alternative distribution and Clfdr(x;) at various . The blue solid curve represents (1)(x=), and the fixed observationx 0 is the red dash on thex-axis. The green dashed curves represent(x= 3=), for = 1=2, 2=3, 1, 2, and 6, with the black dots indicating the Clfdr(x;). We see that Clfdr (x 0 ) is smallest at (x 0 ; 1) and the null is most likely to be rejected in favor of the alternative. As gets larger, the alternative distribution ofx= draws closer to the null distribution, so that it becomes uncertain from which distributionx 0 is drawn. As gets smaller, the alternative distribution ofz pulls away from the support of the null but also pulls further away fromx 0 so thatx 0 is again as likely to be from the null as the alternative. The result is that some tests will favor the null hypothesis even thoughx 0 is from the alterna- tive distribution.This seeming paradox brings to mind the proverb of Sir Francis Bacon: “If the mountain will not come to Muhammad, then Muhammad must go to the mountain.” This case pro- vides an inversion: Muhammad (the red dash) stands still, while the mountain (the standardized alternative distribution) moves around him. 68 Figure 3.1: The Mountain Comes to Muhammad: A graphical illustration. -2 0 2 4 6 8 0.05 0.10 0.15 0.20 0.25 0.30 x Density Curves x0 CLfdr=0.75 sigma=0.5 CLfdr=0.093 sigma=0.67 CLfdr=0.03 sigma=1 CLfdr=0.09 sigma=2 CLfdr=0.43 sigma=6 The mountain comes to Muhammad. The mountain runs over Muhammad. The mountain moves away from Muhammad. If you standardize everything then the null distribution and x stays the same but the alternative distribution gets closer and closer to the null. At first this makes the data look like it must come from the alternative, but eventually the alternative is almost as far from the data as the null is from the data so we would end up sticking with the null. z Next we show that the standardized Lfdr (3.6) does not retain this information. We consider a sequence of from:4 to 6 with increments of:05. For this setting, i are discretely and evenly distributed, so the Lfdr is calculated as Lfdr (z) = (1)(z) (1)(z) +(n ) 1 P (z 3=) ; wherez = x 0 andn = 115 the number of elements in the sequence. Figure 3.2 depicts Clfdr and Lfdr as a function of, where the black circles correspond to the Clfdr from Figure 3.1. The solid blue line is the Clfdr (3.7), which grows as strays from one, and the dashed black line is the Lfdr, which is a constant regardless of. Standardized statistics ignore the moving mountain at their own peril. The key takeaway from these figures is that the samez-value can have different power depend- ing on, and while the different powers are captured byP ( = 0jx;), the information is lost to the standardized analogP ( = 0jz). 69 Figure 3.2: A depiction of information loss due to standardization: the Clfdr (solid blue curve) uses information from i while the Lfdr (dashed black line) does not. 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0 Clfdr vs Lfdr σ 3.1.5 Connections to existing works The 3M procedure follows two lines of work, which we characterize separately because the main emphasis has been on different issues, but the techniques have some overlap. The first line of work focuses on adapting testing procedures to the different powers of each testing unit. The second addresses how to incorporate side-information. Two main approaches have tackled disparate power amongst tests: grouping and p-value weighting, with considerable cross-overs between the two. Grouping, pioneered by Efron (2008a), argues for testing groups separately from one another. Cai and Sun (2009) discretized groups to calculate ranking statistics but combined groups for decision making. Barber and Ramdas (2017) proposed a method to control FDR over groups and individual tests simultaneously that involves applying the Simes (Simes 1986) threshold over groups. Methods forp-value weighting focus on modifying the BH procedure to increase power. We point out Roquain and Van De Wiel (2009) in particular as it derives the optimal weighting scheme in this class of multiple testing procedures. 70 Also, Hu et al. (2012) combines ideas from both camps by assigning weights based on groups to p-values. Research in using side-information has gone in as many directions as there are varieties of side-information (Benjamini and Heller (2007), Fortney et al. (2015), Li and Barber (2017) to name a few). Surprisingly few works directly incorporate heteroscedasticity as additional informa- tion. The closest works to 3M are Habiger (2017) (WAMDF) and Lei and Fithian (2018) (AdaPT). WAMDF is based on a Bayesian two-group model very similar to (3.1) - (3.2) where null and alter- native distributions are conditioned on a non-negative parameter, in the context of which we may view as a special case. The work is notable for its derivation ofp-value weights as a function of the auxiliary parameter, but the test formulation is quite complicated, as it requires handling the derivative of the power function, calculating a normalizing constantk , estimating i , estimating non-null proportionsp i for each test, tuning a parameter, and tuning an upper bound on an adap- tive threshold. The advantage of 3M over this method is ease of use. AdaPT is an iterative, easily comprehended method that flexibly incorporates side-information while controlling FDR. It uses the null and alternative distributions of thep-values, rather than the observed data, conditioned on covariates, but one must know how the distributions of thep-values depend on the covariates. This makes the method challenging to use for heteroscedastic, as it is difficult to find the functional relationship between and the distributions of thep-values. 3.2. Methodology The statistics T OR and t OR are unknown in practice and need to be estimated from data. This section provides methods for estimating these values. 3.2.1 Estimation ofT OR We first consider the case where i and i are independent: T i OR (x i ; i ) = (1)f i (x i j i =0) f i (x i ) . The non-null proportion can be estimated using ^ from Jin and Cai (2007). By (1.1), the null 71 f i (x i j i = 0) is a known density function. The mixture density f i (x i ), however, is challeng- ing to estimate. Bivariate kernel estimators (Silverman 1986, Wand and Jones 1994) cannot be used because i is unlike an arbitrary random variable in that it dictates the spread of a density function. The smoothing kernel estimator from NEST (Fu et al. 2018) incorporates i appropri- ately, and therefore we use NEST as a starting point. The NEST weighted kernel is given by the formula: ^ f ;h (x) := n X j=1 w j h xj (xx j ); (3.11) whereh = (h x ;h ) is a pair of bandwidths,w j w j (;h ) = h ( j )=f P n j=1 h ( j )g is the weight that determines the contribution of (x j ; j ) based on j ,h xj = h x j is a bandwidth that varies across j, and h (z) = 1 p 2h exp n z 2 2h 2 o is a Gaussian kernel. The form is similar to a conditional kernel estimator (Hyndman et al. 1996), but the variable bandwidthh xj adds an important element of stability to the estimator. It downweights observations corresponding to large j so that more variablex j do not produce misleadingly small ranking statistics (the statistics that play the role thatp-values,T OR or the Lfdr do) by chance. As it stands, (3.11) does not correct for selection bias and requires further modification in order to be practical for multiple testing. In multiple testing, extreme values play a particularly important role since they indicate promising signals, but these are the values which are hardest to accurately estimate. Ideally, our estimator will capture the true division between null and alternative densities as the true marginal density does:f (x) = (1)f (xj = 0) +f (xj = 1), but two obstacles stand in our way. One, there are a small number of non-nulls and two, we do not, of course, know which data are non-null. As a result, data from the null distribution will often appear in the tail ends of (3.11) by chance and seem as if they were from the alternative distribution. Our solution is a two-stage procedure. First, we initially divide testing units into likely null or alternative hypotheses based onp-valuesfp i : 1 i mg. Lets denote a screening threshold for thep-values, s i = I(p i <s), and s = ( s 1 ;:::; s m ). Then, the screened observations are the set s = 1. Second, we 72 employ (3.11) to estimatef (xj = 1) by using only the (x j ; j ) corresponding to s j = 1 , and we approximate the entire marginal distributionf (x) as ^ f s (x) = (1 ^ )f (xj = 0) + ^ ^ f ;h (xj s = 1); (3.12) where ^ andf (xj = 0) are as previously proposed. Combining the estimators for,f (xj = 0), andf (x) gives a data-driven approximation toT i OR : b T s;i OR = (1 ^ )f i (x i j i = 0) ^ f s i (x i ) : (3.13) This estimator benefits from the stability off i (x i j i = 0) in the denominator and increases the accuracy off i (x i ) for extreme values by screening out most of the observations from i = 0 for the density estimate off i (x i j i = 1). In practice, to estimate s we suggest a screening threshold =:2, which we use in the numerical studies of Sections 3.3 and 3.4. The next proposition establishes that the oracle version of the thresholding rule I( b T s;i OR t OR ) has conservative FDR control. Proposition 3.2.1. Let T s;i OR = (1)f i (x i j i = 0) f s i (x i ) ; wheref s i (x i ) = (1)f i (x i j i = 0) +f i (x i j s i = 1), and s OR =fI(T s;i OR ) <t OR : 1 i mg. Also letR denote the index set of hypotheses rejected by s OR , andW =fi :T i OR < 1g, the set of promising hypotheses. Assume that (C1)RW. Then (a)T i OR T s;i OR , and (b) for s OR , the FDR< and mFDR<. The proof is in Section 3.5. Condition (C1) says that the rejected hypotheses should not have ranking statistics that are very large. Since the operation of s OR indicates that only smallT i OR will be rejected, in practice, this condition is trivially fulfilled. 73 3.2.2 Estimation oft OR First, to find the oracle thresholdt OR that we use with our ranking statisticsT i OR , we follow two steps. For FDR control, we first order the hypotheses according to the values of the oracle statistics, and then use the following stepwise algorithm. Let T (i) OR denote the ordered oracle statistics and H (1) ; ;H (m) the corresponding hypotheses. Set k = max ( j : 1 j j X i=1 T OR (i) ) ; (3.14) and then rejectH (1) ; ;H (k) . The valueT (k) OR is the oracle thresholdt OR . Then, to estimate ^ t OR , we replaceT (k) OR with b T (k) OR . 3.2.3 Muhammad Meets the Mountain (3M) and its theoretical properties This section describes the proposed testing procedure and provides its theoretical properties. Procedure 1 (3M Procedure). Rank hypotheses by increasing order of ^ T s;i OR and denote the sorted ranking statistics ^ T (S;1) OR ::: ^ T (S;m) OR . Letk = max n j : 1 j P j i=1 T (S;i) OR o : Then reject the corresponding ordered hypotheses,H (1) ;:::;H (k) . Next, we establish the theoretical properties of Procedure 1. Previously, Theorem 3.1.1 showed that the oracle procedure controls the FDR at level and has the largest power among all valid FDR procedures. Now, we present Theorem 3.2.1, which states that the data-driven procedure is valid by controlling the mFDR and FDR below level +o(1). Theorem 3.2.1 (Asymptotic validity of 3M). Suppose Conditions (C1) from Proposition 3.2.1 and the following conditions hold: (C2) ^ p ! (C3)Ek ^ f s f s k 2 =E h R f ^ f s (x)f s (x)g 2 dx i ! 0 Then both the mFDR and FDR of the 3M procedure are controlled at level +o(1). 74 The proof is provided in Section 3.5. Also note that (C2) holds in a wide class of mixture models as shown in Jin and Cai (2007) and Jin (2008). Additionally, Condition (C3) holds by our proposed estimator; see Fu et al. (2018) for a proof. 3.2.4 Extension to the case where i and i are correlated This idea is adapted from Cai et al. (2016), where more technical details can be found. The idea is to replace an estimate of, the overall proportion of non-nulls, by an estimate of the conditional proportion of non-nulls,() = P ( i = 1j) or() = 1P ( i = 0j). This estimate is given by the following nonparametric smoothing estimator: ^ () = 1 P i2T () K h ( i ) (1) P m i=1 K h ( i ) ; (3.15) where is a tuning parameter that represents a screening threshold,T is the set of i where the p-value of i exceeds, andK is a univariate smoothing kernel estimator with bandwidthh. Note that for this estimator, we need to assume i j = 0 follow a distribution for which we can calculate thep-value, but the marginal distribution of is general. 3.3. Simulation We consider simulations from the set-up of (3.1)-(3.2), where the densities for i g , and i g are varied across settings. The methods compared are the oracle version of 3M (OR), the data-driven 3M (3M), the Benjamini-Hochberg procedure (BH, Benjamini and Hochberg (1995)), and the adaptive z-value procedure (AZ, Sun and Cai (2007)). The desired FDR level is 0:1. For each setting, m = 5; 000 tests, and each simulation is also run over 200 repetitions. Then, for a set of decisions = ( 1 ;:::; m ),FDR() is estimated as the average false discovery proportion, FDP () = P m i=1 f(1 i ) i g=( P m i=1 i _ 1) and power is estimated as the average proportion of true positives, P m i=1 ( i i )=(mp) , over the number of repetitions. 75 3.3.1 Comparison for grouped hypotheses In the first set of simulations, we consider a two group model where i takes values inf 1 ; 2 g randomly with equal probability. The i come from two points: 0 with proportion and 1 with 1 proportion. We explore FDR and ETP under varying, 1 and 2 . For all settings, 1 = 1, and when 2 does not vary, it is set to a default of 3. Similarly, the default for is :1, and the default for 1 = 2:5. First, we consider the effects of varying the sparsity from:04 to:2 by:02. Second, we vary 1 from 2 to 3 by:2. Third, we vary 2 from 1:4 to 3:0 by:2. This gives a total of 24 simulation settings. For 3M, in the two-group setting,T i OR (x;) = (1)f (xj i =0) f (x) can be estimated as: ^ T i OR (x;) = (1 ^ ) (x) ^ f (x) ; where ^ is found using the estimator in Jin and Cai (2007), and ^ f (x) reduces to estimating the marginal densities for the two separate groups 1 and 2 . We estimate these by applying the conventional density estimator separately to samplesT j =fx i : i = j g, where j = 1 and 2. The R base package implements this density estimator. In these settings, we do not use the conservative procedure with the screening step. For the first setting with varied , Figure 3.3 shows that even over varied sparsity levels, all methods control FDR close to the nominal level. 3M is slightly elevated for very sparse, but it remains reasonably close to the target level. In terms of power, there is a substantial gap between OR and AZ, but almost no gap between OR and 3M. In this case, 3M exceeds OR performance because it has higher FDR, but the gains in power of 3M over AZ are not entirely due to this disparity, and we believe that the power of 3M should nevertheless be close to that of OR. Consider that for increasing, the FDR control of 3M draws closer to that of OR, and simultaneously, the power of 3M approaches the level of the OR from above. 76 Figure 3.3: Two groups with varying where 1 = 1 and 2 = 3. 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 FDR Comparison π FDR BH AZ OR 3M α 0.05 0.10 0.15 0.20 0.10 0.15 0.20 0.25 0.30 Power Comparison π Power For the second setting with varying 0 , Figure 3.4 shows that the FDR levels of AZ and 3M are both slightly above that of the OR, with 3M at the highest levels, but still close to the desired level. Again we see a large gap between the performance of OR and AZ, but we see that 3M closely follows the power of OR for all values of 0 . Finally, for the third setting, Figure 3.5 shows how different methods fare as 2 rises. The FDR patterns are similar to before. The power of all methods decrease, with large gaps between BH and AZ, and AZ and OR. However, 3M exhibits similar behavior to before and follows the power of and pattern for OR closely. Again, though 3M has a higher FDR than AZ, the difference in power cannot be entirely attributed to this fact. Notice that as 2 grows, the power of AZ drops off while that of OR remains high. The trend in power that 3M displays of following OR closely seems to indicate that 3M has very accurate ranking. We also note that while the FDR levels of 3M are not alarmingly high, 77 Figure 3.4: Two groups with varying 0 where 1 = 1 and 2 = 3 and =:2. 2.0 2.2 2.4 2.6 2.8 3.0 0.05 0.10 0.15 0.20 FDR Comparison μ 0 FDR BH AZ OR 3M α 2.0 2.2 2.4 2.6 2.8 3.0 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Power Comparison μ 0 Power for realistic settings, we still prefer the more conservative estimator of Section 3.2.2, which we consider next. 3.3.2 Comparison in more general settings We have assumed that nonzero i form a point mass and 2 i have two groups. These are highly idealized settings. In the following settings, i are uniformly generated fromU[:5; max ], where max = 2 or 2:5, and non-zero i are generated from a normal mixture model g = :5N(1:5;:1 2 ) +:5N(2;:1 2 ). Again, is varied from :04 to :2 by :02, generating a total of 18 settings. In this case, OR and 3M are modified. To compute OR when there are no simple analytic for- mulas for the marginal densityf (x), we perform numeric integration over the product off (Xj) and the true prior densityg () on a fine grid of values. For 3M, the bandwidthsh x andh s are 78 Figure 3.5: Two groups with varying 2 where 0 = 2 and =:2. 1.5 2.0 2.5 3.0 0.05 0.10 0.15 0.20 FDR Comparison σ 2 FDR BH AZ OR 3M α 1.5 2.0 2.5 3.0 0.15 0.20 0.25 0.30 Power Comparison σ 2 Power found using the hdrcde package in R. We also use the more conservative procedure described in Section 3.2.1 with =:2. Figure 3.6 shows that fors max = 2, the FDR control for 3M is conservative, lower even than that of the BH procedure. Yet 3M has power competitive with that of AZ. Fors max = 2:5, 3M again adopts a conservative FDR level for all, even for the sparsest, where AZ has a FDR around :14. Yet the power of 3M exceeds all other data-driven methods. These simulations demonstrate how 3M can see dramatic efficiency gains over other methods, and the stabilized version of 3M provides a practical method for conservative FDR control. 79 Figure 3.6: Mixture for with varying where max = 2 (top) or max = 2:5 (bottom). 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 FDR Comparison (sigma~U[1,2]) π FDR BH AZ OR 3M α 0.05 0.10 0.15 0.20 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Power Comparison (sigma~U[1,2]) π Power 0.05 0.10 0.15 0.20 0.05 0.10 0.15 0.20 FDR Comparison (sigma~U[1,2.5]) π FDR 0.05 0.10 0.15 0.20 0.04 0.06 0.08 0.10 0.12 0.14 Power Comparison (sigma~U[1,2.5]) π Power 3.4. Data Analysis In this section, three testing procedures (AZ, BH, and 3M) are compared on three microarray data sets and for each data set, the number of genes selected by each method at different FDR levels is reported. The three data sets analyze different diseases: leukemia, lymphoma, and multiple myeloma. All focus on two groups of patients, and the goal is to distinguish between the two types of patient conditions so that appropriate treatment can be administered. For each gene, we calculate the differential gene expression level by subtracting the average gene expression level of 80 one group from the average gene expression of the other. Standard deviations are calculated using the standard error of the difference. FDR levels considered are =:01;:05 and:1. Table 3.1: Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between ALL and AML patients. -level 3M AZ BH :01 840 (11:8%) 908 (12:7%) 680 (9:5%) :05 1497 (21%) 1732 (24:3%) 1235 (17:3%) :1 2054 (28:8%) 2481 (34:8%) 1621 (22:7%) The first data set measures differential gene expression levels for 7; 129 genes between 47 acute lymphoblastic leukemia (ALL) and 25 acute myeloid leukemia (AML) patients (Golub et al. 1999). Table 3.1 shows that AZ selects the most genes at all- levels, followed by 3M. This is consistent with Section 3.3.2, where AZ had the largest number of rejections but comparable power to 3M. Here we do not know the power of each method, but the top row of Figure 3.7 shows the different genes picked out by each method at varying levels of FDR control. The genes are plotted as grey (x i ; i ) points, and color-coded symbols are drawn over the points selected by each method, with red 3M diamonds overlaying black BH crosses overlaying AZ “x”s when all three methods reject the same hypotheses. Many of the genes selected by all methods fall in a “V” shape spreading out from 0, which is shown in the bottom row of Figure 3.7 by a side-by-side comparison of the three methods at =:05. The “V” region represents a set of points wherejZj is roughly constant. However, at every level, there are also genes with slightly higher standard error selected by 3M but ignored by AZ. In contrast, the genes selected by AZ but ignored by 3M are more often genes in the main “V” region where all methods already select many genes. Table 3.2: Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between DLBCL and FL patients. -level 3M AZ BH :01 840 (18:9%) 1366 (19:2%) 895 (12:6%) :05 2309 (32:4%) 2607 (36:6%) 1393 (19:5%) :1 3366 (47:2%) 3965 (55:6%) 1766 (24:7%) 81 Figure 3.7: Differentially expressed genes between ALL and AML leukemia patients. The second data set measures differential gene expression levels for 7; 129 genes between 58 diffuse large B-cell lymphoma (DLBCL) and 19 follicular lymphoma (FL) patients (Shipp et al. 2002). Table 3.2 and the first row of Figure 3.8 shows similar patterns to the leukemia analysis with AZ selecting the most genes and 3M the second most. The second row of Figure 3.8 shows the plot of = :05 with methods separated to better compare performance in regions where all methods have many rejections. Figure 3.8 again shows that all methods reject genes in the “V” shape of the plot, but two interesting patterns appear. In the upper plots, as increases from:01 to:1, AZ adds hypotheses to its rejection set that were in the rejection set of 3M at a lower FDR level. This highlights the differences between their ranking of hypotheses. Secondly, at the least conservative testing level =:1, AZ adds a risky gene with large differential expression level but also very high standard error that the other two methods do not. Finally, the third data set measures differential gene expression levels for 12; 625 genes for patients with multiple myeloma, 36 for whom magnetic resonance imaging (MRI) detected focal lesions of bone (lesions), and 137 for whom MRI scans could not detect focal lesions (without 82 Figure 3.8: Differentially expressed genes between DLBCL and FL lymphoma patients. Table 3.3: Comparison of multiple testing procedures: number of genes (% of total) selected as significantly different between multiple myeloma patients with and without lesions. -level 3M AZ BH :01 134 (1:1%) 111 (:9%) 65 (:5%) :05 362 (2:9%) 385 (3:1%) 210 (1:7%) :1 607 (4:8%) 724 (5:7%) 422 (3:3%) lesions) of bone (Tian et al. 2003). Table 3.3 indicates that for the strictest -level, 3M selects the most genes, followed by AZ. The pattern from the previous data analyses returns for higher . Figure 3.9 is again divided into an upper half where varies and a lower half that shows methods separately at =:05. The figure shows that many of the genes uniquely selected by 3M are extreme for their standard deviations and extremely under-expressed for patients with lesions compared to those without lesions. 3.5. Proofs This section proves the main theoretical results and technical lemmas of this chapter. 83 Figure 3.9: Differentially expressed genes between multiple myeloma patients with lesions and without lesions. 3.5.1 Proof of Theorem 3.1.1 The proof has two parts. In (a), we establish two properties of the testing rule that thresholds the oracle statistic at an arbitraryt,fI(T i OR < t) : 1 i mg. We show that it is valid, i.e., that it produces mFDR< t for allt and that its mFDR is monotonic int. In (b) we show that when the threshold ist OR , the testing rule, OR , exactly attains the mFDR level and is optimal amongst all valid testing procedures inD . Part(a). For the testing rulefI(T i OR < t) : 1 i mg, letQ OR (t) = t . We first show that t <t. SinceT i OR =P ( = 0jX =x i ; = i ), then E ( m X i (1 i ) i ) =E X; "( m X i E jX; (1 i ) i )# =E X; m X i T i OR i ! (3.16) 84 where notation E is the expected value taken over (X;;), notation E X; is the expectation taken over the joint distribution of (X;), andE jX; is the expectation taken over, holding the pair (X;) fixed. We use (3.16) in the definition ofQ OR (t) to get E X; ( m X i=1 (T i OR t )I(T i OR t) ) = 0: (3.17) The equality above implies that t < t. To see this, consider that all potentially non-zero terms arise whenT i OR t, and when this is the case, either (i) T i OR < t, (ii)T i OR < t, or (iii) T i OR <t. Notice (i) produces zero or positive terms on the LHS of (3.17), (ii) produces zero or negative terms, and (iii) produces negative terms. If t t, then only (iii) is possible, which contradicts the RHS. Thus, the testing rule is valid. Next, we show thatQ OR (t) is nondecreasing int. That is, lettingQ(t j ) = j , ift 1 < t 2 , then t 1 t 2 . We argue by contradiction. Suppose thatt 1 < t 2 but 1 > 2 . First, it cannot be that I(T i OR <t 2 ) = 0 for alli, because that implies 1 = 2 (both equal 0). Next, sincet 1 <t 2 , (T i OR 2 )I(T i OR <t 2 ) = (T i OR 2 )I(T i OR <t 1 ) + (T i OR 2 )I(t 1 T i OR <t 2 ) and rewrite (T i OR 2 )I(T i OR < t 1 ) = (T i OR 1 )I(T i OR < t 1 ) + ( 1 2 )I(T i OR < t 1 ): If 2 < 1 , then (T i OR 2 )I(T i OR <t 2 )(T i OR 1 )I(T i OR <t 1 ) + ( 1 2 )I(T i OR <t 1 ) (3.18) + (T i OR 1 )I(t 1 T i OR <t 2 ): It follows that E ( m X i=1 (T i OR 2 )I(T i OR <t 2 ) ) > 0: To see this, consider the expectation of the sum over m tests for the three RHS terms of (3.18), which we reference as (i), (ii), and (iii) respectively. First, (i) is zero because of 3.17. Then for eachT i OR <t 2 , either (ii) is positive because 2 < 1 , or (iii) is positive because 1 <t 1 . 85 However, (3.17) establishes thatEf P m i=1 (T i OR 2 )I(T i OR <t 2 )g = 0, leading to a contra- diction. Hence, 1 < 2 : Part(b). The oracle threshold is defined as t OR = sup t ft2 (0; 1) : Q OR (t) g. First, let = Q OR (1), which represents the largest mFDR level that the oracle 3M testing procedure can be. By part (a),Q OR (t OR ) is non-decreasing. Via the squeeze theorem, for all< , this implies thatQ OR (t OR ) =. Next, consider the power of OR =fI(T i OR <t OR ) : 1img compared to that of an arbi- trary decision rule = ( 1 ;:::; m ) such thatmFDR( ) . Using the previous result from part(b), it follows that E ( m X i=1 (T i OR ) i OR ) = 0 and E ( m X i=1 (T i OR ) i ) 0: Take the difference of the two expressions to obtain E ( m X i=1 ( i OR i )(T i OR ) ) 0: (3.19) Next apply a transformation f(x) = (x)=(1x) to each i OR . Note that because f 0 (x) = (1)=(1x) 2 > 0, f(x) is monotonically increasing. Then order is preserved: if T i OR < t OR then f(T i OR ) < f(t OR ) and likewise for T i OR > t OR . This means we can rewrite i OR = I [f(T i OR )=(1T i OR )g< OR ], where OR = (t OR )=(1t OR ). It will be useful to note that, from part (a), we have<t OR < 1, which implies that OR > 0. Then, E " m X i=1 ( i OR i ) (T i OR ) OR (1T i OR ) # 0: (3.20) To see this, consider that if i OR i 6= 0m, then either (i) i OR > i or (ii) i OR < i . If (i), then i OR = 1 and it follows thatf(T i OR )=(1T i OR )g < OR . If (ii), then i OR = 0 and f(T i OR )=(1T i OR )g OR . For both cases, ( i OR i )f(T i OR ) OR (1T i OR )g 0. Summing over allm terms and taking the expectation yields (3.20). 86 Combine (3.19) and (3.20) to obtain 0E ( m X i=1 ( i OR i )(T i OR ) ) OR E ( m X i=1 ( i OR i )(T i OR ) ) : Finally, since OR > 0, it follows thatEf P m i=1 ( i OR i )(T i OR )g> 0. After distributing the ( i OR i ) term and separating the expectations for the sums of the two decision rules, we apply the definition ofETP () =Ef P m i=1 i (T i OR )g to conclude thatETP ( OR )ETP ( ) for all 2D . 3.5.2 Proof of Proposition 3.2.1 Part(a). We only need to show thatf i (x i )f s i (x i ) fori2W. Note that ifi2W, then (1)f i (x i j i = 0) (1)f i (x i j i = 0) +f i (x i j i = 1) < 1: This implies that f i (x i j i = 0)<f i (x i j i = 1) (3.21) for alli2W. In the following, consider a generici2W, and dropi. LetA s =fy : 2(jyj)< sg. Then, f (xjA s ) = 0 s f (xj = 0) + 1 s f (xj = 1)<f (xj = 1) where j s = P ( = jjA s ) for j = 0; 1. The inequality comes from 0 s + 1 s = 1 and (3.21). Therefore, f s = (1)f (xj = 0) +f (xjA s )<f (xj = 1): Part(b). We prove that (i) FDR( s OR ) and (ii) mFDR( s OR ). For (i), the FDR of s OR can be computed as FDR( s OR ) =E x; E P i2R (1 i ) jRj_ 1 x; =E x; ( (jR_ 1j) 1 X i2R T i OR ; ) 87 wherejRj is the number of rejections, and the last equality follows from T i OR = E i f(1 i )jx i ; i g. Then, from (a) and Procedure 1, E x; ( (jR_ 1j) 1 X i2R T i OR ; ) E x; ( (jR_ 1j) 1 X i2R T s;i OR ; ) : For (ii), the result on the mFDR holds since mFDR( s OR ) = E P i2R (1 i ) E(jRj_ 1) andE P i2R (1 i ) =E P m i s;i OR (1 i ) E x; P m i T s;i OR s;i OR E x; (jRj_ 1). 3.5.3 Proof of Theorem 3.2.1 We begin with a summary of notation used throughout the proof: Q s (t) = m 1 P m i=1 T s;i OR IfT s;i OR < tg. This is a conservative estimate of the FDR for the oracle decision rulefI(T s;i OR <t) : 1img. b Q s (t) = m 1 P m i=1 b T s;i OR If b T s;i OR < tg. This is the conservative FDR estimate of the data- driven decision rulefI( b T s;i OR <t) : 1img. Q s 1 (t) = EfT s OR g is the limiting FDR level, where T s OR is a generic member from the samplefT s;i OR : 1img. t s 1 (t) = supft2 (0; 1) :Q s 1 (t)g is the “ideal” threshold. We first show that FDR( 3M ) = FDR( s OR ) +o(1) +o(1). Sincet is continuous but the ranking statisticsT s;i OR are discrete, forT s;(k) OR <t<T s;(k+1) OR , define a continuous version of b Q s (t) as b Q s C (t) = t b T s;(k) OR b T s;(k+1) OR b T s;(k) OR b Q s k + b T s;(k+1) OR t b T s;(k+1) OR b T s;(k) OR b Q s k+1 ; 88 where b Q s k = b Q s ( b T s;(k) OR ) is the estimated FDR whenk hypotheses are rejected. It is easy to verify that b Q s C (t) is continuous and monotone. Hence, its inverse b Q s;1 C is well-defined, continuous, and monotone. Next, we shall show the following two results in turn: (i) b Q s (t) p !Q s 1 (t) and (ii) b Q s;1 C () p ! t s 1 . We shall need the following two lemmas, which are proved after this proof. Lemma 3.5.1. E b T s OR T s OR 2 =o(1) and b T s OR p !T s OR . Lemma 3.5.2. LetU i =T s;i OR I(T s;i OR <t) and b U i = b T s;i OR If b T s;i OR <tg. ThenE b U i U i 2 =o(1). To show (i), note thatQ s (t) p ! Q s 1 (t) by the WLLN, so that we only need to establish that b Q s (t) p !Q s (t). LetS m = P m i=1 b U i U i . ThenE (S m ) =m n E b U i E (U i ) o . By Lemma 3.5.2 and the Cauchy-Schwartz inequality,E n b U i U i b U j U j o =o(1). It follows that E m 1 fS m E (S m )g =m 1 Var(S m )m 1 E b U i U i 2 +f1 +o(1)gE n b U i U i b U j U j o =o(1): Applying Chebyshev’s inequality, we obtainm 1 fS m E (S m )g = b Q s (t)Q s (t) p ! 0. Hence (i) is proved. Next, we show (ii). Since b Q s C (t) is continuous, for any " > 0, we can find > 0 such that b Q s;1 C () b Q s;1 C n b Q s C (t s 1 ) o <" if b Q s C (t s 1 ) <. It follows that P n b Q s C (t s 1 ) > o P n b Q s C (t s 1 ) > o : Lemma 3.5.1 and the WLLN imply that b Q s C (t) p !Q s 1 (t): Note thatQ s 1 (t s 1 ) =. Then, P b Q s C (t s 1 ) > ! 0: 89 By the Markov inequality, we have b Q s;1 C () p ! b Q s;1 C n b Q s C (t s 1 ) o =t s 1 ; (3.22) completing the proof of (ii). Finally, we can similarly define the continuous version ofQ s (t) asQ s C (t) and the corresponding threshold asQ s;1 C (). Then by construction, we have 3m = h I n b T s;i OR b Q s;1 C () o : 1im i and s OR = I T s;i OR Q s;1 C () : 1im : Also, following the previous arguments, we can show that Q s;1 C () p !t s 1 : (3.23) According to (3.22) and (3.23), we have b Q s;1 C () =Q s;1 C () +O p (1): (3.24) Note that the mFDR level of s OR is mFDR( s OR ) = P H 0 T s;i OR Q s;1 C () P T s;i OR Q s;1 C () Moreover, the operation of the testing procedure implies that Q s;1 C () . Note also that the mFDR level of 3m is mFDR( 3m ) = P H 0 n b T s;i OR b Q s;1 C () o P n b T s;i OR b Q s;1 C () o : From Lemma 3.5.1, b T s;i OR p ! T s;i OR , (3.23), and by the continuous mapping theorem, mFDR ( 3m ) = mFDR ( s OR ) +o(1). In Proposition 3.2.1, we showed mFDR ( s OR ) , and so the desired result follows. 90 For the result on FDR control, similar arguments as above apply. Also, an alternative proof comes from the asymptotic equivalence between FDR and mFDR as proved in Appendix D of Basu et al. (2017) under conditions which are satisfied by 3m . 3.5.4 Proof of Lemma 3.5.1 The first result can be proved by following similar arguments to Sun and Cai (2007) and is omitted The second result follows from the first and the fact that b T s OR is bounded by 1. 3.5.5 Proof of Lemma 3.5.2 Using the definitions of b U i andU i , we can show that b U i U i 2 = b T s;i OR T s;i OR 2 I b T s;i OR t;T s;i OR t + b T s;i OR 2 I b T s;i OR t;T s;i OR >t + T s;i OR 2 I b T s;i OR >t;T s;i OR >t : Let us refer to the three sums on the right hand asI,II, andIII respectively. By Lemma 3.5.1, I =o(1). Then let"> 0, and consider that P b T s;i OR t;T s;i OR >t P T s;i OR 2 (t;t +") +P T s;i OR T s;i OR >" : (3.25) The first term on the right hand is vanishingly small as" p ! 0 because b T s;i OR is a continuous random variable. The second term converges to 0 by Lemma 3.5.1. Noting that T s;i OR 1, we conclude II =o(1). In a similar fashion, we can show thatIII =o(1), thus proving the lemma. 91 Chapter 4 Future Directions This chapter presents preliminary work for two future topics that address large-scale inference using an EB framework. The first topic seeks to discover “important” cases rather than significant ones in multiple testing. The second topic probabilistically controls the false discovery proportion for any single experiment rather than controlling the average false discovery proportion over a series of repeated experiments as the FDR does. 4.1. Top-Percent Selection in Large-scale Inference with Het- eroscedastic Error This section revisits and unifies the ideas of Chapters 2 and 3 in a multiple-testing setting where estimating mean effects and accounting for heteroscedastic errors are crucial for the discovery of “important” hypotheses. The word “important” is necessarily vague, as what one views as impor- tant is contextual, but we give some examples in Section 4.1.1 of what may constitute important findings to different people. For a general sense of the problem, imagine that a group of researchers tests a thousand hypotheses and finds that one hundred of them are statistically significant. If they only have the financial, mental, or temporal budget to investigate ten of these significant hypothe- ses, which should they choose? We will see in Section 4.1.2 that whether they filter by lowest p-values or the largest (smallest) sufficient statistics, the result is unsatisfactory. In Section 4.1.3 we present two criterion for a satisfactory result. In Section 4.1.4 and 4.1.5, we indicate the direc- tion of the method we wish to develop in order to achieve these criterion. 92 4.1.1 Motivating Problems Three examples where we would want to select candidates based on a criterion that balances sig- nificance and magnitude of the sufficient statistic are the following: NBA basketball player recruiters try to find a limited number of promising players to add to their teams’ rosters within a recruiting budget. They may review the 3-point basket rates of basketball players to decide which players are best, but players with very high throw rates but very few games are risky bets while players with many games may not have outstanding performance or may be difficult to recruit due to competition from other recruiters. Education public-policy makers have a limited budget to distribute to public schools. For example, in public policy implemented during 2012-2013, the federal government required all states to produce annual yearly progress (AYP) reports, which determined which schools would receive government funding. The AYP study was conducted at thousands of pri- mary and secondary schools, and reported the contrasting success rates on standardized math exams of social-economically advantaged (SEA) and social-economically disadvan- taged (SED) students. Small schools could have large achievements between the scores of SEA and SED students that didn’t reflect school quality, while large schools might not have demonstrated such disparate scores but still appeared to have more significant achievement gaps. Yelp: The idea of partial ranking is also relevant to challenges heralded by the information revolution. Namely, the internet has made vast amounts of information so readily avail- able that it becomes necessary for developers to build tools to help web surfers navigate the information (example: Google’s search bar helps users locate the websites most relevant to their immediate needs). One website where partial ranking can be a tool to help people allocate the limited resource of their attention is Yelp. Yelp lists thousands of local restau- rants, but any given user only looks through a subset of the top-listed businesses. Currently, Yelp allows users to order restaurants by most rated, which puts restaurants with hundreds or 93 thousands of reviews at the top of the search, or by highest reviews. The highest review filter does not truly order businesses by rating, because it highly weights restaurants by number of reviews. As a result, both sorting algorithms provide viewers with more certainty about restaurant quality but may not provide many interesting results for people looking for new places to explore. Of course, a list sorted purely by rating would favor restaurants with very few reviews, which viewers may also want to disregard. However, a balance can be struck between high ratings and many reviews that better filters information for users. It would be helpful for Yelp visitors to have the option to threshold their restaurant lists based on, say, “4 stars or higher” where the resulting list would show both heavily reviewed restaurants that average 4 stars or more and less established restaurants that have unusually high ratings. 4.1.2 Selection and Comparison Issues We use the AYP score gap to develop our discussion of the selection issue. Letting i index the schools to rank, where i = 1;:::m = 7866 total schools, for school i, denote the math-score success rates asY i for the SEA group andY 0 i for the SED group. The number of students in the SEA isn i , and, in the SED groups,n 0 i . The common way in multiple testing of comparing these groups is to first obtain az-value for each school: z i = Y i Y 0 i p Y i (1Y i )=n i +Y 0 i (1Y 0 i )=n 0 i ; i = 1; ;m and then select largejz i j. One problem is that thez-values tend to rank the largest schools highest. In fact eight out of ten of the largestjz i j belong to schools that have sample sizes in the largest 1% of all schools (Sun and McLain 2012). These schools with largestjz i j are not always the most “interesting”. Small schools with largeY i Y 0 i are likely to hide in midranks, but these are of great interest. Also, for a large number of candidates, only a partial list may be manageable or relevant. Perhaps we limit our attention to the schools with an SEA-SED gap greater than 10% (or for the basketball players, to those who make at least 30% of their baskets), but in this case, we must then 94 pick a threshold, 0 . For the AYP study, this threshold represents the true performance gap which only 10% of all high schools exceed. However, producing the threshold 0 can be difficult. 4.1.3 Goals, Notation, and Proposal It is our goal to address the two challenges of making small, interesting z-values more prominent and subsetting the full list of candidates to a top percentile. We envision a ranking and selection statisticS, whereS produces a set of testing units for each top percentilep% that we specify. This statistic would even allow us to varyp% from 0% to 100% and produce a set of candidates at each percentile. Therefore, it is desirable to develop a procedure with the following two properties: (P1) Effective error rate control in the selection process: wrongly asserting that an effect size exceeds 0 when it actually does not should be infrequent. This is important to avoid unnec- essary losses in later actions. (P2) Sensible thresholding of the selected units: they should be selected so that both true effect sizes and statistical significances are respected. These properties direct us to optimize metrics we explain after introducing notation. Suppose X i , i = 1; ;m, are independent observations from a random mixture model with possibly heteroscedastic errors: X i = i + i ; i N(0; 2 i ); i = 1;:::;m (4.1) In a multiple comparison problem, such as the AYP example, i denotes the unknown differential of subjecti under two conditions A and B . Further, i come from an underlying distribution i F =pI 0 + (1p)G ( 0 ) 2 i F () (4.2) 95 where I 0 is the Dirac delta function at 0,p is the probability of being 0 , andG ( 0 ) is a contin- uous distribution function for nonzero. The variances 2 i are random variables from distribution F , whose values are assumed known, or it is possible to estimate them well. With respect to (P1) and (P2), the goal is to testm simultaneous hypothesis tests: H i;0 : i 0 vs. H i;1 : i > 0 ; i = 1; ;m: (4.3) We let the parameter i = I( i > 0 ) be an indicator function that gives the true state of theith item, while i 2f0; 1g is the decision we make about that item, with i = 1 being a decision in favor ofH i;1 to select the item and i = 0 being theH i;0 decision to leave the item out. Denote = ( 1 ; ; m ). A selection error occurs if we assert that an effect size exceeds 0 when it actually does not. By convention this incorrect choice is called a false positive decision. In large-scale selection problems, false positive decisions are inevitable if we wish to discover interesting effects with a reasonable power. Instead of aiming to avoid any false positives, a practical goal is to keep small the false selection (or discovery) rate (FDR), which is the expected proportion of false positives among all selections. Controlling the expected proportion is of particular interest in large-scale studies and therefore accomplishes (P1), Accomplishing (P1) can also be done by controlling the marginal false discovery rate (mFDR). The mFDR is defined as mFDR = E ( P i (1 i ) i ) E ( P i i _ 1) ; (4.4) where the lower bound in the denominator makes mFDR equal to zero when no items are selected (or discovered). Under certain conditions, the mFDR asymptotically equals FDR. In light of (P2), we also want to maximize correct selections. Correct selections are the expected number of true positives (ETP), and they characterize the power of a multiple testing procedure. ETP =E X i i i ! =E X i I( i > 0 ) i ! : (4.5) 96 This criterion encourages the proper detection of i surpassing the threshold 0 so that ETP selects i, but it is indifferent to the magnitude of the event so ETP does not rank what it selects. If we modify ETP and use ETP instead: ETP =E ( X i ( i 0 ) i ) : (4.6) we obtain a selection criterion dependent on both effect size and significance. Remark 4.1.1. Indeed, a more direct alteration of ETP is ETP =E ( X i i ( i 0 ) i ) ; but when i = 0, then i 0 is negative and ETP decreases. Therefore, (4.6) accounts for i without having to explicitly include it. Combining the above ideas, the selection procedure maximize ETP subject to mFDR: (4.7) accomplishes (P1) and (P2). 4.1.4 Oracle procedure Let f 0; i represent the density of X i given the null hypothesis, e.g. f 0; i (x i ) = R 0 1 f i (x i j)g()d, wheref i (x i j) is the conditional null density ofX i given i , and letf i (x i ) be the marginal distribution ofX i . To solve the optimization problem we propose the following algorithm. Procedure 1. 1. Select and rank the strongest candidates using Clfdr: 97 (a) Calculate Clfdr i =P ( i 0 jx i ; i ) = (1p)f 0; i (x i ) f i (x i ) for i = 1;:::;m: (4.8) (b) If Clfdr i < , assign i = 1 and rank these items at the top of the list from smallest Clfdr to largest. 2. For the remainingm 0 items, rank by statisticT = (T 1 ;:::;T m ): (a) Calculate the ranking statistic T i = E( i jx i ) 0 Clfdr i fori = 1;:::;m 0 (4.9) (b) Order the items from largest to smallestT i and denote the sorted listT (1) ;:::;T (m) . 3. Threshold the list usingT and Clfdr: (a) With them 0 items ranked byT , use Clfdr to calculate a cut-offk2f1;:::;m 0 g: k = supfk : P k i=1 Clfdr i k g (4.10) and let the estimated oracle threshold ^ t OR be: ^ t OR =T (k) (4.11) (b) On the ordered list, let = 1 for items 1;:::;k and = 0 fork + 1;:::;m 0 . Alterna- tively, on the unordered list, assign i =I T i ^ t OR fori = 1;:::m 0 . The algorithm’s goal is to correctly choose members in the topY % of a list. The idea is that for each X i , we calculate two statistics (Clfdr i ;T i ). The Clfdr statistic is dependent on the 0 and selects significantX, while theT statistic ranks them. Later we will use a modified ranking statistic R instead to simplify the algorithm. 98 For Step 1, in (4.8), notice that everyith test may have a different nullf 0 () and marginalf() density from the other tests because of heteroscedastic errors 2 i . Also keep in mind that, while, in our selection, the order of items does not matter, it is still particularly important to include items that more surely belong in the topY %. This goal is reflected in how items with low Clfdr are first to enter our list. Observing that the ranking statisticT is easier to interpret if it is bounded, we propose ranking by a monotone transformation ofT that is bounded in the range of [1; 1]: R i = 2e 1=T 1 +e 1=T 1 2 : (4.12) We order R i from smallest to largest. Using the statistic R simplifies the algorithm. Since the statisticR automatically floats items where Clfdr i < 0 to the top, Step 1 in the procedure can be eliminated, and allm cases can be treated together using Steps 2-3. Note that for any percentile, we can still show that there is a unique corresponding 0 . Also, addressing the matter of finding a threshold, 0 : the procedure proposes a random number of selections rather than a predetermined one. It is reasonable to believe that this more flexible frame- work is more efficient than fixed-quantile methods if our goal is to select the unknown number of “interesting” cases. In varying 0 to find the most appropriate threshold, the procedure may exhibit an interesting property: by increasing the threshold from, say, top 1% to top 50%, it is possible that new subjects that enter for bigger thresholds will rank higher than old subjects that previously entered for stricter thresholds. Therefore, a subject’s rank will also depend on the percentile. With this procedure, one can see how the list changes based on 0 , which is similar to howq-values in Storey (2007) produce a list of rejected hypotheses based on -level. For our importance ranking procedure, there is a uniqueR-value (R k corresponding to (4.11)) that corresponds to each top percentile ranked list. 99 Theoretical properties of the oracle procedure Though this is still preliminary work, we conjecture that Procedure 1 achieves (4.7) and give ideas for the proof as well as some intuitive arguments. The proof can be split into two parts: we must establish (a) that the mFDR and FDR, and (b) that the ETP is higher than that of any other procedure based on x. For (a), first, once mFDR is shown, it is easy to extend the result to FDR using a result from Basu et al. (2017). Sec- ond, the testing units rejected in Procedure 1 (1b) will not endanger the validity of the procedure, so we need only concern ourselves with the testing units where Clfdr. We give a rough sketch of this argument in Section 4.1.6. For (b), it should be possible to follow arguments similar to those in Basu et al. (2017), because we make decisions in a similar manner, where we balance potential gains (E( i jx i ) 0 ) against potential costs (Clfdr i ). Intuitively, a knapsack argument applies: if we wish to fill a knapsack that has capacity with tests to maximize total value P i2sack fE( i jx i ) 0 g, then it makes sense to add tests to the knapsack by density, e.g. the value to cost ratio, when, as here, we can exactly achieve the capacity by doing so. A formal proof that shows that Procedure 4.1.4 controls the Clfdr at exactly and has highestETP would combine ideas from the optimality proofs in Basu et al. (2017) and Chapter 3.5. 4.1.5 Data-driven procedure Three parameters must be estimated from data in order to use our procedure. They are b 0 , an estimate of the desired quantile;b i , the estimate of the mean forith item; and [ Clfdr i , an estimate of P ( i < 0 jX i ). We are encouraged by the fact that b i can be estimated using NEST from Chapter 2. We also believe it is possible to use the method of Sun and McLain (2012) to obtain [ Clfdr i , though we are not entirely satisfied with applying the estimator from that work forf 0 (x) to our setting and may develop a new method to replace it. It may further be possible to estimate the quantile b 0 using the quantiles of the data. 100 4.1.6 Proofs 4.1.7 Sketch of Proof (a): Validity of the Procedure 1 By design of the procedure, we have P k i=1 Clfdr (i) k : where Clfdr (i) are ordered from small to large. We drop the “()” in the rest of the proof and assume i are always ordered by Clfdr. Then we consider the mFDR. For the numerator, we use the rule of conditional expectation and condition onX: E X E jX ( P m i (1 i ) i ) E ( P m i i ) = E X E jX P k i=1 (1 i ) E ( P m i i ) : The equality follows from the fact that, givenX, i = 0 for (ordered)i>k. Next, inside the inner expectation of the numerator, we divide and multiply by P m i i . We use the fact that depends on only throughX, so that givenX, is independent of and we can take P m i i out of the top inner expectation. Also for the denominator, we can apply the expectation over justX: E X E jX P m i i P k i=1 (1 i )= P m i i E ( P m i i ) = E X h P m i i E jX n P k i=1 (1 i )= P m i i oi E X ( P m i i ) =E X 2 4 E jX n P k i=1 (1 i ) o P m i i 3 5 GivenX, P m i i =k, yielding: E X 0 @ E jX P k i=1 (1 i ) k 1 A : 101 We know the inner expectation is less than: E X P k i=1 Clfdr i k ! E X () =: We have shown that E( P m i (1 i ) i) E( P m i i) ; therefore we have mFDR control at the-level. 4.2. FDX This section proposes a method to control the false discovery exceedance (FDX). The key charac- teristic of the FDX is that it controls the proportion of false positives (FDP) as a probability, not an expectation. This section is based on ongoing work with Pallavi Basu and Wenguang Sun. Section 4.2.1 begins with our motivation for using the FDX criterion and its relationship to the FDR and FWER criterion. Section 4.2.3 formally sets up the problem, provides a procedure to control the FDX, and reviews some comparable methods. Section 4.2.6 provides some preliminary simula- tions and data analysis. Finally, Section 4.2.7 contains proofs about the optimality of the proposed FDX method. 4.2.1 Motivation In multiple-testing, scientific findings usually come from experiments that have been performed only a few times or even just once, but researchers often use the FDR, a Type I error control based on averages, in their testing procedures. The hope is that good FDR control will yield reproducible findings. In this section, we discuss shortcomings of the FDR with respect to this goal, how FDX can remedy the issues, and how FDX compares with other well-known Type I error criterion. The FDR guarantees control of the expected value of the FDP, but this does not control indi- vidual FDPs. For example, even if we perfectly control the FDR at 5%, this allows five out of a hundred repeated experiments to produce FDPs of 100% as long as the other ninety-five produce FDPs of 0%. It is a highly dramatized example, but in practice, popular FDR procedures such as the 102 Benjamini-Hochberg procedure (BH) (Benjamini and Hochberg 1995), do display high variability and skewness in the FDP across implementations. It is therefore desirable to control the FDP for each experiment so that the results from a single operation of a testing procedure are reliable. Due to the popularity of the FDR, there is relatively little research that focuses on other kinds of control. Yet, from a reproducibility standpoint, FDX is very relevant, because it does not rely on averages as FDR does. In contrast to the FDR, the FDX keeps a low probability,, on the event that the FDP exceeds an acceptable proportion, . By keeping a low probability on undesirably high levels of FDP each time the testing procedure is carried out rather than keeping an average FDP level low over many hypothetical experiments on which the same testing procedure is applied, FDX procedures reduce the variability of the FDP’s distribution and increase the reproducibility of scientific discoveries. FDX is most similar to thek-FWER, (itself an extension of the FWER) except that it controls the probability of an undesirable event that is defined not by a number of rejections k but by a proportion of rejections . This allows FDX methods to scale up with the number of true rejections as FDR methods do. 4.2.2 Problem formulation In this section, we formally describe the testing set-up and FDX criterion. Suppose we are given observationsZ = (Z i : i = 1;:::;m). We want to test the hypothesesH = (H 0 i ;H 1 i ), where H 0 i : i = 0 andH 1 i : i 6= 0, where i is some location parameter. Under the random mixture model framework, Z i j i (1)I( i = 0)f 0i (0) + I( i = 1)f 1i ( i ); where is the proportion of non-null hypotheses, I is an indicator function; i denotes the true state of nature with i = 0 representing the null hypothesis and i = 1 the alternative; f 0i is the density ofZ i under the null hypothesis; andf 1i is the density ofZ i under the alternative hypothesis. 103 We must make a decision, i 2f0; 1g indicating our belief about i . Ultimately, we are interested in generating = ( 1 ;:::; m ) such that ETP is maximized subject to FDX control. Our definition of ETP in this context is the expected number of correctly rejected hypotheses: ETP :=E X i fI( i = 1) i g; and FDX control is FDX :=P (FDP > ); for given ( ;)2 (0; 1), where represents a tolerance level on the FDP and a low probability. Also FDP := P i fI( i = 0) i g P i i _ 1 ; where P i i _ 1 denotes the maximum of P i i and 1. 4.2.3 Oracle and data-driven procedures This section first proposes an oracle solution to control the ( ;)-level FDX and then provides a data-driven approximation to the oracle solution that can be used in practice. Exact Solution For FDX control, we are interested in calculating P (FDP > ), which is equivalent to P (jfalsej > k ), wherek is the number of rejections made andjfalsej is the number of those rejections that are actually null. This probability can be found using the Poisson binomial dis- tribution, which generalizes the binomial distribution to the case where each trial has a different probability of success,p i . Denote the Poisson binomial distribution byPBD(k;p), wherek is the total number of trials andp = (p i : i = 1;:::;k). Next, to findp i , consider the following local false discovery rate (lfdr) statistic: T OR i :=P ( i = 0jz i ) = (1)f 0i f i ; (4.13) 104 where is the proportion of non-nulls andf i = (1)f 0i +f 1i , the marginal distribution ofZ i . Then the following gives the oracle procedure to control the ( ;)- level FDX. Procedure 1. 1. Consider the lfdr test statisticT OR i as in (4.13) fori = 1;:::;m, and denote the ranked statisticsT OR (1) ;:::;T OR (m) . 2. Calculate P jdata (jfalsej > k) = P PBD(k;p (k) )> k , where p (k) = (T OR (1) ;:::;T OR (k) ), and reject up toK := maxfk :P jdata (jfalsej> k)g. Implementable Solution For implementation, onlyT OR i fori = 1;:::;m needs to be estimated with a data-drivenT i . Since the estimation can be done via Sun and Cai (2007), the main focus of this section is on optional modifications to the procedure that reduces its computational time. We present the computationally fast version of Procedure 1 and then discuss the data-driven procedure based on it. We modify Procedure 1 with two changes: first, we can consider rejecting onlyT OR (i) wherei is less than an upper boundK 1 on the expected number of rejections. Second, we can find a conservative cut-off point in the first 1;:::;K 1 ordered tests. In the first step, the screening criterion relies on three facts: one, P jdata (jfalsej> k); two, the upper bound onjfalsej is the total number of rejectionsk; and three, whenjfalsej satisfies the tolerance leveljfalsej k, the largest number of false rejections is k. This gives E jdata (jfalsej) k +(1 ); where +(1 ) is a conservatively large bound because the upper boundk is used for the case where an undesirable number of rejections are made and the upper bound k is used for the case where an acceptable number of rejections are made. 105 In the second step, we approximate the Poisson binomial probability with a computationally simpler binomial right tail (BRT) probability where BRT (k) :=P (Bin(k;T (k) )> k): This is a conservative estimate of the tail probability. Note that it is possible to use the first step without the second step for a less conservative approximation that still gains speed over Procedure 1. However, including both steps, the faster procedure is as follows: Procedure 2. 1. Consider the lfdr test statisticT OR i as in (4.13) fori = 1;:::;m, and denote the ranked statisticsT OR (1) ;:::;T OR (m) . 2. First reject up toK 1 := max h k2f1;:::;mg : E jdata (jfalsej) k +(1 ) i . 3. Finally reject only up toK 2 := maxfk2 [K 1 ] :BRT (k)g. Finally, to perform the data-driven method, replaceT OR i withT i as already discussed and to estimateE jdata (jfalsej)=k, use standard methods for approximating the marginal false discovery rate (mFDR) in the literature (for example, Sun and Cai (2007)). 4.2.4 Theoretical properties and ideas for establishing them This section describes the theoretical progress of this work and gives a sense of the road map for establishing properties of Procedures 1 and 2. Generally, we wish to show two properties for the oracle procedure: it is valid, e.g. the procedure controls FDX at level ( ;) and optimal, e.g. it maximizes ETP amongst procedures that control FDX at ( ;). For the oracle procedure, the proof of validity comes from the Poisson binomial distribution and definition ofT OR i , so it is trivial. For optimality: Section 4.2.7 gives a fairly complete proof that an optimal decision rule must rankX i by increasing values ofT OR i . Then intuitively, the best rule is the one that thresholds FDX for a given at exactly. 106 For the data-driven procedure, we begin by establishing that the oracle version of Procedure 2 is valid, which, similar to the proof of Procedure 1, is trivial. We hope to then show that the data- driven method has FDX that is asymptotically , and that the power of the data-driven method goes to that of the oracle method. 4.2.5 Connections to other methods Lehmann and Romano (2005b) Lehmann and Romano (2005b) offer the following step-down method. Sort p-values p (1) p (2) p (m) (definep (0) := 0). Then reject the corresponding hypotheses 1;:::k up to maxk :p (k) k =C (b mc)+1) wherebxc is the nearest integer to x that is less than or equal to x, C n = P n i=1 1=i, and k = f(b kc + 1)g=(m +b kc + 1k). This version of the algorithm controls the false discovery exceedance at the-level for any set of p-values. Roquain and Villers (2011) and Chi and Tan (2008) The following step-up oracle method has been proposed in Roquain and Villers (2011). The general idea is to sort the p-values p (1) p (2) p (m) (define p (0) := 0) and find b k := maxfk 2f0; 1; ;mgjp (k) t k g. Let G(t) := 0 F 0 (t) + (1 0 )F 1 (t) be the mix- ture distribution of the p-values and definet m+1 := 1 then for 1km find t k := maxft2 [0;t k+1 ]jP [X k > k]g whereX k Bin(k; 0 t=G(t)): The expressionF 0 (t) =t uses the fact that p-values are uniformly distributed under the null. This is the oracle case. In the data-driven procedure the quantities 0 andG(t) have to be estimated. 107 However the authors did not propose a method to estimate these unknown quantities and perhaps assume 0 1. They refer to Chi and Tan (2008) (CT) and use a procedure with t CT k := maxft2 [0;t CT k+1 ]jP [X k > k]g whereX k Bin(k; 1^mt=k); (4.14) with the convention thatt CT m+1 := 1. We compare our proposed data-driven method with that of CT. 4.2.6 Numerical results In the following section, the proposed, Procedure 2 and other methods are compared in several simulations and a real data application. Simulations In all settings, we simulate z i j; ind (1 i )N(0; 1) + i N(; 1), for i = 1;:::;m = 5; 000, where i iid bern(p). As defaults, we set = 1:9, p = 0:2, = 0:2, and = 0:05. Different settings demonstrate the impact of varying p, m, , , or . The total number of repetitions is 200 for each setting. The reported metrics are the proportion ofm tests where FDP exceeds and the expected number of true positives, calculated as the average number of true positives taken over the repetitions. We compare the proposed method to the methods described in Section 4.2.5: Lehmann and Romano (2005b) (LR05) and Roquain and Villers (2011) (RV11). Simulation setting I: varying the proportion of true signals Figure 4.1 is a simulation forp = 0:1 top = 0:9 by 0:1 increments, with the other parameters set to the default values. In the extremely sparse case, the proposed method comes closest to the nominal-level. Whenp =:2, the proposed method has an ETP of about 264, the ETP of RV11 is about 237, and for LR05, about 2. As the proportion of true signals increases, all methods become very conservative but the proposed method shows large power gains over LR05 and increasingly advantageous performance over RV11. 108 Figure 4.1: Simulation results varyingp. Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. (a) p Proportion(FDP > 0.2): Target 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.000 0.010 0.020 0.030 (b) p ETP 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1000 2000 3000 4000 Figure 4.2: Simulation results varying m. Methods are denoted by 2 LR05 (green), # RV11 (blue), and4 proposed (red). The plot is for the data driven method. (a) m Proportion(FDP > 0.2): Target 0.05 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0.00 0.01 0.02 0.03 0.04 (b) m ETP 0 50 100 150 200 250 Simulation setting II: varying the number of tests Figure 4.2 is a simulation form = 1000 tom = 7000 by intervals of 1; 000. It demonstrates the power gains of the proposed method over LR05 and RV11. In particular, the proposed method has relatively greater efficiency gains as the number of tests grows. 109 Figure 4.3: Simulation results varying. Methods are denoted by2 LR05 (green),# (blue) RV11, and4 proposed (red). The plot is for the data driven method. (a) μ Proportion(FDP > 0.2): Target 0.05 1.5 1.6 1.7 1.8 1.9 2 2.1 0.000 0.010 0.020 0.030 (b) μ ETP 1.5 1.6 1.7 1.8 1.9 2 2.1 0 100 200 300 Figure 4.4: Simulation results varying Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. (a) γ Proportion(FDP > 0.2): Target 0.05 0.01 0.05 0.1 0.2 0.3 0.4 0.00 0.01 0.02 0.03 0.04 0.05 0.06 (b) γ ETP 0.01 0.05 0.1 0.2 0.3 0.4 0 100 200 300 400 Simulation settings III & IV Simulation settings III and IV respectively vary the underlying mean and the upper bound of the false discovery proportion. They portray similar patterns. Figure 4.3 is a simulation for = 1:5 to = 2:1 by 0:1 increments, while Figure 4.4 is a simulation for = 0:01, 0:05, 0:1, 0:2, 0:3 and 0:4. Both demonstrate the gain in power over LR05 and competitiveness with RV11. 110 Figure 4.5: Simulation results varying. Methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (red). The plot is for the data driven method. (a) α Proportion(FDP > 0.2): Target Varies 0.01 0.03 0.05 0.07 0.09 0.000 0.001 0.002 0.003 0.004 0.005 (b) α ETP 0.01 0.03 0.05 0.07 0.09 0 50 100 150 200 250 Simulation setting V: varying the false discovery exceedance control Figure 4.5 is a simulation for = 0:01 to = 0:1, by 0:01 increments. It shows the advantages of the proposed method and RV11 over LR05. While RV11 has greatest power for very low prob- ability bounds on the FDX, the proposed method has an advantage once crosses approximately 0:03. Application We compare the same methods from simulation on real microarray data. The data set was used in Farcomeni (2009) to benchmark FDX methods and is available in the R package “rda”. It contains 2000 gene expression levels from each of 40 colon cancer patients and 22 healthy patients. With the goal of selecting the genes that differ between cancer and healthy patients, we generate two- sample t-statistics between the two patient groups. For =:1 and evaluated at:01 and:05 to 1 by:05 increments, Figure 4.6 shows the difference between proposed and the other methods. The methods contrast in two ways: Lfdr and p-values suggest vastly different probability distributions 111 Figure 4.6: Results for prostate cancer study. In (a), two histograms are overlaid: Lfdr is in dark gray and p-values are in light gray. In (b), methods are denoted by2 LR05 (green),# RV11 (blue), and4 proposed (data-driven) (red). (a) Histograms: Lfdr v P−val bins Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 500 1000 1500 (b) Genes selected γ # genes selected 0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 amongst the 2000 genes tested. Because the p-values are all quite high, the numbers of genes selected by LR05 and RV11 are very low, but because the lfdrs of the most significant 518 genes are essentially zero, the proposed method detects many more genes. 4.2.7 Proofs Proofs of main results Suppose there exists an “optimal” decision rule for the problem described in Section 4.2.2. For almost allX, it must have been ranked by increasing values of lfdr. Rough proof. The idea is to provide a general alteration scheme of the decision rule however obtained which is “better” than the existing. Fixm the number of tests or decisions to be made. Suppose not: then for some setP (X) > 0, we must have the “optimal” decision rule such that lfdr j < lfdr ` but j = 0 and ` = 1. Note thatj;` is random and is a function ofX. Condition on all suchX, denote as the “optimal” decision rule and new as new j = 1 and new ` = 0. Condition onX, then new is more powerful orETP ( new )>ETP ( ). On the other set that is wherever ranked by lfdr we don’t change the . So overall new is more powerful. 112 P new jdata (jfalsej> k 0 )P jdata (jfalsej> k 0 ): (4.15) Note that again for the other set we don’t change the rule. So overall if is valid so is new . Now to show (4.15). Under P jdata , we havejfalsej PBD(flfdr 0 i s;lfdr ` g;k 0 ), where PBD is a Poisson binomial distributed random variable (see wiki). Similarly under P new jdata , we havejfalsej PBD(flfdr 0 i s;lfdr j g;k 0 ). Recall that we have lfdr j < lfdr ` . To simplify notations, we show that for random variables Y PBD(fp 1 ;p 2 ; ;p k 0 g;k 0 ) and Y new PBD(fp new 1 ;p 2 ; ;p k 0 g;k 0 ) withp new 1 < p 1 . Then to show thatP (Y new > k 0 ) P (Y > k 0 ). We compute by recursive formula,P (Y > k 0 ) =p 1 P ( Y > k 0 1)+(1p 1 )P ( Y > k 0 ) where Y PBD(fp 2 ; ;p k 0 g;k 0 1). Then, P (Y > k 0 ) =P ( Y > k 0 ) +p 1 fP ( Y > k 0 1)P ( Y > k 0 )g =P ( Y > k 0 ) +p 1 P ( Y =d k 0 1e) P ( Y > k 0 ) +p new 1 P ( Y =d k 0 1e) =P (Y new > k 0 ); wheredxe refers to the nearest integer strictly larger thanx. 113 Bibliography F. Abramovich, Y . Benjamini, D. L. Donoho, and I. M. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist., 34:584–653, 2006. ISSN 0090-5364. R. F. Barber and A. Ramdas. The p-filter: multilayer false discovery rate control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79 (4):1247–1268, 2017. doi: 10.1111/rssb.12218. URLhttps://rss.onlinelibrary. wiley.com/doi/abs/10.1111/rssb.12218. P. Basu, T. T. Cai, K. Das, and W. Sun. Weighted false discovery rate control in large-scale multiple testing. Journal of the American Statistical Association, 0(ja):0–0, 2017. doi: 10.1080/01621459.2017.1336443. URL https://doi.org/10.1080/01621459. 2017.1336443. Y . Benjamini and M. Bogomolov. Selective inference on multiple families of hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):297–318, 2014. Y . Benjamini and R. Heller. False discovery rates for spatial signals. J. Amer. Statist. Assoc., 102: 1272–1281, 2007. ISSN 0162-1459. Y . Benjamini and Y . Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289–300, 1995. ISSN 0035-9246. Y . Benjamini and Y . Hochberg. Multiple hypotheses testing with weights. Scandinavian Journal of Statistics, 24:407–418, 1997. Y . Benjamini and Y . Hochberg. On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25: 60–83, 2000. Y . Benjamini and D. Yekutieli. False discovery rate–adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469):71– 81, 2011. doi: 10.1198/016214504000001907. URL https://doi.org/10.1198/ 016214504000001907. 114 J. O. Berger. Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss. Ann. Statist., 4(1):223–226, 01 1976. doi: 10.1214/aos/1176343356. URL http://dx.doi.org/10.1214/aos/1176343356. R. Berk, L. Brown, A. Buja, K. Zhang, and L. Zhao. Valid post-selection inference. Ann. Statist., 41 (2):802–837, 04 2013. doi: 10.1214/12-AOS1077. URLhttps://doi.org/10.1214/ 12-AOS1077. J. Bohannon. I fooled millions into thinking chocolate helps weight loss. here’s how., 2015. L. D. Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value problems. The Annals of Mathematical Statistics, 42(3):855–903, 1971. L. D. Brown and E. Greenshtein. Nonparametric empirical Bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statis- tics, 37:1685–1704, 2009. S. J. Brown, W. Goetzmann, R. G. Ibbotson, and S. A. Ross. Survivorship bias in performance studies. The Review of Financial Studies, 5(4):553–580, 1992. doi: 10.1093/rfs/5.4.553. URL +http://dx.doi.org/10.1093/rfs/5.4.553. T. Cai and W. Sun. Optimal screening and discovery of sparse signals with applications to mul- tistage high throughput studies. Journal of the Royal Statistical Society: Series B (Statisti- cal Methodology), 79(1):197–223, 2017. ISSN 1467-9868. doi: 10.1111/rssb.12171. URL http://dx.doi.org/10.1111/rssb.12171. T. T. Cai and W. Sun. Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc., 104:1467–1481, 2009. T. T. Cai, W. Sun, and W. Wang. Cars: covariate assisted ranking and screening for large-scale two-sample inference. Technical Report, 2016. I. Castillo and A. van der Vaart. Needles and straw in a haystack: Posterior concentration for pos- sibly sparse sequences. Ann. Statist., 40(4):2069–2101, 08 2012. doi: 10.1214/12-AOS1029. URLhttp://dx.doi.org/10.1214/12-AOS1029. Z. Chi and Z. Tan. Positive false discovery proportions: Intrinsic bounds and adaptive control. Statistica Sinica, 18(3):837–860, 2008. ISSN 10170405, 19968507. URL http://www. 115 jstor.org/stable/24308519. S. Chiaretti, X. Li, R. Gentleman, A. Vitale, M. Vignetti, F. Mandelli, J. Ritz, and R. Foa. Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood, 103(7):2771–2778, 4 2004. ISSN 0006-4971. doi: 10.1182/blood-2003-09-3243. N. K. Cobb, M. A. Jacobs, J. Saul, E. P. Wileyto, and A. L. Graham. Diffusion of an evidence- based smoking cessation intervention through facebook: a randomised controlled trial study protocol. BMJ Open, 4(1), 2014. ISSN 2044-6055. doi: 10.1136/bmjopen-2013-004089. URLhttp://bmjopen.bmj.com/content/4/1/e004089. N. K. Cobb, M. A. Jacobs, P. Wileyto, T. Valente, and A. L. Graham. Diffusion of an evidence- based smoking cessation intervention through facebook: A randomized controlled trial. Amer- ican Journal of Public Health, 106(6):1130–1135, 2016. doi: 10.2105/AJPH.2016.303106. URLhttp://dx.doi.org/10.2105/AJPH.2016.303106. PMID: 27077358. R. A. Colombo and D. G. Morrison. Blacklisting social science departments with poor ph.d. submission rates. Management Science, 34(6):696–706, 1988. doi: 10.1287/mnsc.34.6.696. URLhttps://doi.org/10.1287/mnsc.34.6.696. G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJour- nal, Complex Systems:1695, 2006. URLhttp://igraph.org. D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32:962–994, 2004. ISSN 0090-5364. D. L. Donoho and J. M. Jonhstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81 (3):425, 1994. doi: 10.1093/biomet/81.3.425. URL+http://dx.doi.org/10.1093/ biomet/81.3.425. S. Dudoit, J. P. Shaffer, and J. C. Boldrick. Multiple hypothesis testing in microarray experiments. Statist. Sci., 18(1):71–103, 02 2003. doi: 10.1214/ss/1056397487. URL https://doi. org/10.1214/ss/1056397487. B. Efron. Size, power and false discovery rates. Ann. Statist., 35:1351–1377, 2007. ISSN 0090- 5364. 116 B. Efron. Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Stat., 2:197–223, 2008a. B. Efron. Microarrays, empirical Bayes and the two-groups model. Statist. Sci., 23:1–22, 2008b. ISSN 0883-4237. B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Predic- tion. Institute of Mathematical Statistics Monographs. Cambridge University Press, 2010. doi: 10.1017/CBO9780511761362. B. Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011. B. Efron and C. N. Morris. Data analysis using stein’s estimator and its generalizations. Journal of the American Statistical Association, 70(350):311–319, 1975. B. Efron, R. Tibshirani, J. D. Storey, and V . Tusher. Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc., 96:1151–1160, 2001. ISSN 0162-1459. S. Erickson and C. Sabatti. Empirical Bayes estimation of a sparse vector of gene expression change. Statistical applications in genetics and molecular biology, 4(1):1132, 2005. A. Farcomeni. Generalized augmentation to control the false discovery exceedance in multi- ple testing. Scandinavian Journal of Statistics, 36(3):501–517, 2009. ISSN 1467-9469. doi: 10.1111/j.1467-9469.2008.00633.x. URL http://dx.doi.org/10.1111/j. 1467-9469.2008.00633.x. E. Ferkingstad, A. Frigessi, H. Rue, G. Thorleifsson, and A. Kong. Unsupervised empirical Bayesian multiple testing with external covariates. The Annals of Applied Statistics, 2:714– 735, 2008. K. Fortney, E. Dobriban, P. Garagnani, C. Pirazzini, D. Monti, D. Mari, G. Atzmon, N. Barzilai, C. Franceschi, A. B. Owen, and S. K. Kim. Genome-wide scan informed by age-related dis- ease identifies loci for exceptional human longevity. PLOS Genetics, 11(12):1–23, 12 2015. doi: 10.1371/journal.pgen.1005728. URL https://doi.org/10.1371/journal. pgen.1005728. 117 L. Fu, G. James, and W. Sun. Nonparametric empirical bayes estimation on heterogeneous data. Available athttp://www-bcf.usc.edu/ ˜ wenguans/Papers/NEST.html, 2018. URLhttp://www-bcf.usc.edu/ ˜ wenguans/Papers/NEST.html. C. Genovese and L. Wasserman. Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. B, 64:499–517, 2002. ISSN 1369-7412. C. R. Genovese, N. A. Lazar, and T. Nichols. Thresholding of statistical maps in functional neu- roimaging using the false discovery rate. Neuroimage, 15(4):870–878, 2002. C. R. Genovese, K. Roeder, and L. Wasserman. False discovery control with p-value weighting. Biometrika, 93(3):509–524, 2006. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular clas- sification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537, 1999. ISSN 0036-8075. doi: 10.1126/science.286.5439.531. URLhttp://science.sciencemag.org/content/286/5439/531. J. D. Habiger. Adaptive false discovery rate control for heterogeneous data. Statistica Sinica, 27 (4):1731–1756, 2017. C. R. Harvey and Y . Liu. Backtesting. The Journal of Portfolio Management, 42(1):13–28, 2015. ISSN 0095-4918. doi: 10.3905/jpm.2015.42.1.013. URL http://jpm.iijournals. com/content/42/1/13. L. He, S. K. Sarkar, and Z. Zhao. Capturing the severity of type ii errors in high-dimensional multiple testing. Journal of Multivariate Analysis, 142:106 – 116, 2015. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2015.08.005. URLhttp://www.sciencedirect. com/science/article/pii/S0047259X1500189X. M. L. Head, L. Holman, R. Lanfear, A. T. Kahn, and M. D. Jennions. The extent and consequences of p-hacking in science. PLOS Biology, 13(3):1–15, 03 2015. doi: 10.1371/journal.pbio. 1002106. URLhttps://doi.org/10.1371/journal.pbio.1002106. R. Heller and D. Yekutieli. Replicability analysis for genome-wide association studies. Ann. Appl. Stat., 8(1):481–498, 03 2014. doi: 10.1214/13-AOAS697. URLhttps://doi.org/10. 118 1214/13-AOAS697. N. C. Henderson and M. A. Newton. Making the cut: improved ranking and selection for large- scale inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78:1467–9868, 2016. Y . Hochberg. A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4): 800–802, 1988. S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of statis- tics, pages 65–70, 1979. G. Hommel. A stagewise rejective multiple test procedure based on a modified bonferroni test. Biometrika, 75(2):383–386, 1988. J. X. Hu, H. Zhao, and H. H. Zhou. False discovery rate control with groups. Journal of the American Statistical Association, 105:1215–1227, 2012. R. J. Hyndman, D. M. Bashtannyk, and G. K. Grunwald. Estimating and visualizing conditional densities. Journal of Computational and Graphical Statistics, 5(4):315–336, 1996. ISSN 10618600. URLhttp://www.jstor.org/stable/1390887. W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the The- ory of Statistics, pages 361–379, Berkeley, Calif., 1961. University of California Press. URL http://projecteuclid.org:443/euclid.bsmsp/1200512173. W. Jiang and C.-H. Zhang. General maximum likelihood empirical bayes estimation of normal means. Ann. Statist., 37(4):1647–1684, 08 2009. doi: 10.1214/08-AOS638. URL http: //dx.doi.org/10.1214/08-AOS638. J. Jin. Proportion of non-zero normal means: universal oracle equivalences and uniformly consis- tent estimators. J. Roy. Statist. Soc. B, 70:461–493, 2008. J. Jin and T. T. Cai. Estimating the null and the proportional of nonnull effects in large-scale multiple comparisons. J. Amer. Statist. Assoc., 102:495–506, 2007. ISSN 0162-1459. B.-Y . Jing, Z. Li, G. Pan, and W. Zhou. On sure-type double shrinkage estimation. Journal of the American Statistical Association, 111(516):1696–1704, 2016. 119 I. M. Johnstone. Gaussian estimation: Sequence and wavelet models. Draft version, 2015. I. M. Johnstone and B. W. Silverman. Needles and straw in haystacks: empirical Bayes estimates to possibly sparse sequences. Annals of Statistics, 32(4):1594–1649, 2004. R. Koenker and I. Mizera. Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109(506):674–685, 2014. S. C. Kou and J. J. Yang. Optimal shrinkage estimation in heteroscedastic hierarchical linear mod- els. Big and Complex Data Analysis: Methodologies and Applications (Springer International Publishing), pages 249–284, 2017. J. D. Lee, D. L. Sun, Y . Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016. E. Lehmann and J. P. Romano. Generalizations of the familywise error rate. The Annals of Statis- tics, 33(3):1138–1154, 2005a. E. L. Lehmann and J. P. Romano. Generalizations of the familywise error rate. Ann. Statist., 33 (3):1138–1154, 06 2005b. doi: 10.1214/009053605000000084. URL http://dx.doi. org/10.1214/009053605000000084. L. Lei and W. Fithian. Adapt: an interactive procedure for multiple testing with side information. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 0(0), 2018. doi: 10.1111/rssb.12274. URLhttps://rss.onlinelibrary.wiley.com/doi/abs/ 10.1111/rssb.12274. A. Li and R. F. Barber. Accumulation tests for fdr control in ordered hypothesis testing. Journal of the American Statistical Association, 112(518):837–849, 2017. doi: 10.1080/01621459. 2016.1180989. URLhttps://doi.org/10.1080/01621459.2016.1180989. A. Li and R. Foygel Barber. Multiple testing with the structure adaptive Benjamini- algorithm. ArXiv e-prints, jun 2016. M. I. McCarthy, G. R. Abecasis, L. R. Cardon, D. B. Goldstein, J. Little, J. P. A. IIoannidis, and J. N. Hirschhorn. Genome-wide association studies for complex traits: Consensus, uncer- tainty and challenges. Nature Reviews Genetics, 9(356):287–298, 2008. 120 N. Meinshausen and J. Rice. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist., 34:373–393, 2006. C. Miller, C. Genovese, R. Nichol, L. Wasserman, A. Connolly, D. Reichart, D. Hopkins, J. Schnei- der, and A. Moore. Controlling the false-discovery rate in astrophysical data analysis. Astro- nomical Journal, 122:3492–3505, 2001. H. Robbins. Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950, pages 131–148, Berkeley and Los Angeles, 1951. University of California Press. H. Robbins. An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. on Math. Statistic. and Prob., 1:157–163, 1956. H. Robbins. The empirical Bayes approach to statistical decision problems. Annals of Mathemat- ical Statistics., 35:1–20, 1964. K. Roeder and L. Wasserman. Genome-wide significance levels and weighted hypothesis testing. Statistical science: a review journal of the Institute of Mathematical Statistics, 24(4):398, 2009. D. Rogosa. Accuracy of api index and school base report elements: 2003 academic performance index, california department of education. Report, 2003. J. P. Romano and A. M. Shaikh. Stepup procedures for control of generalizations of the familywise error rate. Ann. Statist., 34(4):1850–1873, 2006. ISSN 0090-5364. E. Roquain and M. A. Van De Wiel. Optimal weighting for false discovery rate control. Electronic journal of statistics, 3:678–711, 2009. E. Roquain and F. Villers. Exact calculations for false discovery proportion with application to least favorable configurations. Ann. Statist., 39(1):584–612, 02 2011. doi: 10.1214/10-AOS847. URLhttp://dx.doi.org/10.1214/10-AOS847. S. K. Sarkar. Stepup procedures controlling generalized FWER and generalized FDR. Ann. Statist., 35(6):2405–2420, 2007. ISSN 0090-5364. A. Schwartzman, R. F. Dougherty, and J. E. Taylor. False discovery rate analysis of brain diffusion direction maps. Ann. Appl. Stat., 2(1):153–175, 2008. ISSN 1932-6157. 121 M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G. S. Pinkus, T. S. Ray, M. A. Koval, K. W. Last, A. Norton, T. A. Lister, J. Mesirov, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8:68–74, 2002. B. W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986. J. Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73: 751–754, 12 1986. C. M. Stein. Estimation of the mean of a multivariate normal distribution. Ann. Statist., 9(6): 1135–1151, 11 1981. doi: 10.1214/aos/1176345632. URL http://dx.doi.org/10. 1214/aos/1176345632. J. D. Storey. A direct approach to false discovery rates. J. Roy. Statist. Soc. B, 64:479–498, 2002. ISSN 1369-7412. J. D. Storey. The optimal discovery procedure: a new approach to simultaneous significance testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3): 347–368, 2007. J. D. Storey, J. E. Taylor, and D. Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. Roy. Statist. Soc. B, 66(1):187–205, 2004. W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102:901–912, 2007. ISSN 0162-1459. W. Sun and A. C. McLain. Multiple testing of composite null hypotheses in heteroscedastic mod- els. Journal of the American Statistical Association, 107(498):673–687, 2012. W. Sun and Z. Wei. Large-scale multiple testing for pattern identification, with applications to time-course microarray experiments. J. Amer. Statist. Assoc., 106:73–88, 2011. W. Sun and Z. Wei. Hierarchical recognition of sparse patterns in large-scale simultaneous infer- ence. Biometrika, 102:267–280, 2015. 122 Z. Tan. Improved minimax estimation of a multivariate normal mean under heteroscedasticity. Bernoulli, 21:574–603, 2015. E. Tian, F. Zhan, R. Walker, E. Rasmussen, Y . Ma, B. Barlogie, and J. D. Shaughnessy. The role of the wnt-signaling antagonist dkk1 in the development of osteolytic lesions in mul- tiple myeloma. New England Journal of Medicine, 349(26):2483–2494, 2003. doi: 10. 1056/NEJMoa030847. URL https://doi.org/10.1056/NEJMoa030847. PMID: 14695408. V . G. Tusher, R. Tibshirani, and G. Chu. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U. S. A., 98(9):5116–5121, April 2001. M. J. van der Laan, S. Dudoit, and K. S. Pollard. Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Statistical applications in genetics and molecular biology, 3(1), 2004. M. P. Wand and M. C. Jones. Kernel Smoothing, volume 60 of Chapman and Hall CRC Mono- graphs on Statistics and Applied Probability. Chapman and Hall CRC, 1994. A. Weinstein, W. Fithian, and Y . Benjamini. Selection adjusted confidence intervals with more power to determine the sign. Journal of the American Statistical Association, 108(501):165– 176, 2013. doi: 10.1080/01621459.2012.737740. URLhttps://doi.org/10.1080/ 01621459.2012.737740. A. Weinstein, Z. Ma, L. D. Brown, and C.-H. Zhang. Group-linear empirical bayes estimates for a heteroscedastic normal mean. Journal of the American Statistical Association, 0(0):1– 13, 2018. doi: 10.1080/01621459.2017.1280406. URLhttps://doi.org/10.1080/ 01621459.2017.1280406. X. Xie, S. Kou, and L. D. Brown. Sure estimates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500):1465–1479, 2012. D. Yekutieli. Hierarchical false discovery rate–controlling methodology. J. Amer. Statist. Assoc., 103(481):309–316, 2008. X. Zhang and A. Bhattacharya. Empirical Bayes, sure, and sparse normal mean models. Preprint, 2017. 123 Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. Controlling false discoveries during interactive data exploration. In Proceedings of the 2017 ACM Inter- national Conference on Management of Data, SIGMOD ’17, pages 527–540, New York, NY , USA, 2017. ACM. ISBN 978-1-4503-4197-4. doi: 10.1145/3035918.3064019. URL http://doi.acm.org/10.1145/3035918.3064019. 124
Abstract (if available)
Abstract
This thesis addresses inferential phenomena that arise from large-scale data and develops nonparametric Empirical Bayes (EB) techniques that exploit data for improved inference in estimation and testing. In particular, we consider four settings, unified by the goal of maximizing “good” decisions subject to a control of “poor” decisions, in which EB can be successfully implemented for large-scale problems. We will see that both estimation of parameters and hypothesis testing play a role. ❧ The simultaneous estimation of a vector of parameters ηᵢ, based on a corresponding set of observations xᵢ, for i = 1,...,n, is a key research problem that has resurfaced in the high-dimensional setting due to selection bias concerns and which we study in order to incorporate parameter estimates into multiple testing procedures. In the typical estimation framework, the classic example involves estimating a vector of normal means μᵢ subject to a fixed variance term σ², and the goal is to pool information in a way that corrects for extreme mean estimates that occur by chance. However, many practical situations involve heterogeneous data (xᵢ,ξᵢ) where ξᵢ is a known nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the “Nonparametric Empirical Bayes Smoothing Tweedie” (NEST) estimator, which efficiently estimates ηᵢ without making any parametric assumptions on its distribution and properly adjusts for heterogeneity via a generalized version of Tweedie's formula. The estimation framework is simple but general enough to accommodate any member of the exponential family of distributions
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Shrinkage methods for big and complex data analysis
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
High dimensional estimation and inference with side information
PDF
Large scale inference with structural information
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Adapting statistical learning for high risk scenarios
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
PDF
Prohorov Metric-Based Nonparametric Estimation of the Distribution of Random Parameters in Abstract Parabolic Systems with Application to the Transdermal Transport of Alcohol
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Multiscale and multiresolution approach to characterization and modeling of porous media: From pore to field scale
Asset Metadata
Creator
Fu, Luella Jill
(author)
Core Title
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
09/26/2018
Defense Date
09/25/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bias selection,compound decision,empirical Bayes,heteroscedasticity,kernel smoothing,multiple testing,nonparametric statistics,OAI-PMH Harvest,shrinkage estimation
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
James, Gareth M. (
committee chair
), Sun, Wenguang (
committee chair
), Bien, Jacob (
committee member
), Mukherjee, Gourab (
committee member
), Siddarth, Sivaramakrishnan (
committee member
)
Creator Email
luella.fu@gmail.com,luella@sfsu.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-73526
Unique identifier
UC11669050
Identifier
etd-FuLuellaJi-6781.pdf (filename),usctheses-c89-73526 (legacy record id)
Legacy Identifier
etd-FuLuellaJi-6781.pdf
Dmrecord
73526
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Fu, Luella Jill
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bias selection
compound decision
empirical Bayes
heteroscedasticity
kernel smoothing
multiple testing
nonparametric statistics
shrinkage estimation