Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High dimensional estimation and inference with side information
(USC Thesis Other)
High dimensional estimation and inference with side information
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
High Dimensional Estimation and Inference with Side Information by Jiajun Luo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Applied Mathematics) August 2021 Copyright 2021 Jiajun Luo To my parents, who have always been my unwavering support through my adventures in life. ii Acknowledgements First and foremost, I would like to express my sincere gratitude towards my Ph.D. advisor and mentor, Dr. Wenguang Sun, for his support, kindness and inspiring discussion on this research. Without his guidance, invaluable suggestions on my dissertation's framework and precious instruc- tions, it is impossible for me to complete this dissertation. I owe the great appreciation to Dr. Stanislav Minsker and Dr. Remigijus Mikulevicius for serving on my committee and providing feedback to this dissertation. Special thanks to Stanislav for his excellent lectures on statistics and machine learning. I would like to thank all my co-authors: Dr. Gourab Mukherjee and Dr. Yunjin Choi. I won't be able to complete the dissertation without their help. Special thanks to Gourab for his helpful discussion and advice on the theory of multivariate normal means problems. I would also like to thank all the faculty members and students in the math department for their consistent help, which makes my entire Ph.D. life more interesting and enjoyable. Last but not least, I would like to thank my parents for their continual love and unhesitating support. I owe all my accomplishments to them. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vi List of Figures vii Abstract ix Chapter 1: Introduction 1 1.1 Nonparametric Empirical Bayes Estimation with Auxiliary Sequences . . . . . . . . 1 1.1.1 Compound rules, structural knowledge and side information . . . . . . . . . . 3 1.1.2 Nonparametric Integrative Tweedie . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Main ideas of our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.4 Comparison with existing works and our contributions . . . . . . . . . . . . . 6 1.2 Sparse Estimation with Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 High dimensional sparse estimation . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Spatial information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Our contributions and a toy example . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Improved Nonparametric Empirical Bayes Estimation By Auxiliary Sequences 13 2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Nonparametric integrative tweedie . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Nonparametric estimation via convex programming . . . . . . . . . . . . . . . 15 2.1.3 Computational details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Compound estimation and kernelized Stein's discrepancy . . . . . . . . . . . 19 2.2.2 Theoretical properties of KSD . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 The issue of inverting problem . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Theoretical properties of ^ h h h n and data-driven NIT . . . . . . . . . . . . . . . . 22 2.2.5 Benets and caveats in exploiting auxiliary data . . . . . . . . . . . . . . . . 24 2.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.1 Simulation 1: integrative estimation with one auxiliary sequence . . . . . . . 28 2.3.2 Simulation 2: Integrative estimation in two-sample inference of sparse means 29 2.3.3 Simulation 3: integrative estimation with multiple auxiliary sequences . . . . 32 2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 2.4.1 Integrative non-parametric estimation of gene expressions . . . . . . . . . . . 36 2.4.2 Leveraging auxiliary information in predicting monthly sales . . . . . . . . . 37 2.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.5.2 Proof of Proposition 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.3 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.4 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.5 Proof of Theorem 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.6 Proof of Theorem 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.7 Proof of Theorem 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.8 Proof of Proposition 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.9 Proof of Proposition 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6 Supplemental Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.1 Further details on the gene expressions estimation example . . . . . . . . . . 51 2.6.2 Further details on predicting monthly sales data example . . . . . . . . . . . 52 Chapter 3: Spatially Adaptive FDR Thresholding for Sparse Estimation 56 3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.1 Model and problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.1.2 Structured FDR analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.1.3 SAFT procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1.4 Properties of the local sparsity estimator . . . . . . . . . . . . . . . . . . . . 60 3.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.1 Oracle procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.2 Benets of local sparsity structure in exploiting spatial data . . . . . . . . . . 64 3.2.3 Data-driven procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3.1 Estimating the conditional proportions . . . . . . . . . . . . . . . . . . . . . . 68 3.3.2 The block-wise 1D setting with piece-wise constants . . . . . . . . . . . . . . 69 3.3.3 The block-wise 1D setting with trapezoid patterns . . . . . . . . . . . . . . . 71 3.3.4 Tuning the FDR level via MCV . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.5 Simulation in 2D setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Application on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5.1 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.5.3 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5.4 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5.5 Properties of the mean detection function . . . . . . . . . . . . . . . . . . . . 91 v List of Tables 2.1 % gain in prediction errors by dierent estimators over the naive unshrunken esti- mator of gene expressions of INFA regulated infected cells. . . . . . . . . . . . . . . . 37 2.2 Average % gains over the naive unshrunken estimator for monthly beer sales prediction 39 2.3 Correlation matrix of the monthly sales of dierent products. . . . . . . . . . . . . . 52 2.4 Monthwise % gains for monthly beer sales prediction over the naive unshrunken estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.1 The rst row and the second row respectively present MSEs and the ratios of MSEs over unweighted method (BH) for each method. The smallest value is underlined. Tr.: the original values of the training set are used as estimators. ABDJ: hard- thresholding method proposed by Abramovich et al. (2006) , BH: BH-based (un- weighted) method, SAFT: SAFT-weighted method. . . . . . . . . . . . . . . . . . . . 77 vi List of Figures 1.1 A multiple testing example with 2D data. The heat map of parameters is given in the Figure 1.1a. Consider the estimation problem, the mean square error of SAFT is 0.233 while the mean square error of BH procedure is 0.401, which implies the eectiveness of local sparsity structure in estimation. The heat maps of SureBlock estimator, BH procedure and our SAFT estimator are shown in the Figure 1.1b, 1.1c and 1.1d. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 The issue of inverting problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 The performance of oracle NIT procedure and data-driven NIT procedure with dif- ferent number of auxiliary sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Percentage of eciency gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Simulation results for one given auxiliary sequence. . . . . . . . . . . . . . . . . . . . 30 2.5 Two-sample inference of sparse means. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Integrative estimation with multiple auxiliary sequences. . . . . . . . . . . . . . . . 35 2.7 Panel A: Heatmaps of the gene expression datasets showing the four expression vectors corresponding to the observed, validation and auxiliary sequences. Panel B: scatterplot of the eect size estimates of 3000 genes based on Tweedie and NIT (using both S U and S I ). Magnitude of the auxiliary variables used in the NIT estimate is re ected by dierent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.8 Distribution of monthly sales of beer across stores (on left) and the pairwise distri- bution of joint sales of dierent products in the month of July (in right). . . . . . . . 38 2.11 Scatterplot of the logarithm of beer demand estimates in the month of August. The magnitudes of the corresponding auxiliary variables used in the IT estimate are re ected in the dierent colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.9 Top panel: Scatterplot and names of genes where Tweedie and Integrated Tweedie eectsize estimates disagreed by more than 50%. The other panels show the dierent molecular function types and biological processes that are impacted by these genes. . 54 2.10 Distribution of the 866 stores across dierent states in USA. . . . . . . . . . . . . . . 55 3.1 Example 1 for eciency gain of SAFT procedure. . . . . . . . . . . . . . . . . . . . . 65 3.2 Example 2 for eciency gain of SAFT procedure. . . . . . . . . . . . . . . . . . . . . 66 vii 3.3 2D circular triangular, rectangle and circular pattern. . . . . . . . . . . . . . . . . . 69 3.4 MSE comparisons: the linear block pattern. . . . . . . . . . . . . . . . . . . . . . . . 70 3.5 MSE comparisons: the trapezoid block pattern. . . . . . . . . . . . . . . . . . . . . . 71 3.6 Comparison between MCV criterion and MSE . . . . . . . . . . . . . . . . . . . . . . 72 3.7 2D circular triangular, rectangle and circular pattern. . . . . . . . . . . . . . . . . . 73 3.8 MSE comparisons: 2D rectangle and triangle pattern. . . . . . . . . . . . . . . . . . 73 3.9 MSE comparisons: 2D circular pattern. . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.10 Heatmaps of the log-scale daily average number of shared bike rides in Seoul in April, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.11 Histograms of the dierences between log-scaled daily average observations over weekends and weekdays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.12 Heatmap of the estimated conditional probabilities s . . . . . . . . . . . . . . . . . . 76 viii Abstract High dimensional estimation problems have a wide range of applications and are extensively studied in literature. Recent advances in technology make it possible to collect a numerous amount of data. The side information can be in the form of auxiliary sequences extracted from the secondary dataset or the same dataset using carefully constructed statistics; or dependence structure among others such as spatial information. Incorporating and leveraging these side information in testing, estimation and inference can improve the performance of statistical procedures. This dissertation aims to eectively integrate side information into estimation and inference procedures. This dissertation is divided into three chapters. Chapter 1 introduces the problems we will study. In chapter 2, we consider the problem of nonparametric empirical Bayes estimation with auxiliary data. We present a new framework to extract and transfer structural knowledge encoded in auxiliary data to assist the simultaneous estimation of thousands or even millions of parameters in the target domain. We devise a class of nonparametric integrative Tweedie (NIT) estimators for empirical Bayes data-sharing shrinkage estimation. When applied in conjunction with repro- ducing kernels and convex optimization techniques, NIT provides superior and robust performance and scales well with growing number of parameters. The new estimation framework is capable of handing multiple and possibly correlated auxiliary sequences and is exible for incorporating various structural constraints into the data-driven decision rule. We develop theory to establish the convergence rates for the risk of the data-driven NIT. The theory provides important insights on the benets and caveats of utilizing multivariate auxiliary data. Numerical studies show that our approach achieves substantial gain in empirical performance over existing methods in many settings. Chapter 3 studies thresholding problem for sparse estimation with spatial data. Exploiting spatial information in high-dimensional estimation promises to improve the accuracy of statisti- cal procedures. This dissertation develops a new class of spatially adaptive false discovery rate ix thresholding (SAFT) procedure by extending the elegant false discovery rate thresholding esti- mator (Abramovich et al., 2006) to spatial settings. The idea is rst constructing robust and structured-adaptive weights via estimating the local sparsity levels, and then setting spatially adap- tive thresholds through weighted Benjamini-Hochberg (BH) Procedure. SAFT is data-driven and assumption-lean. Theoretical results demonstrate its superior asymptotic performance over the original false discovery rate thresholding estimator in spatial settings. The nite sample perfor- mance is studied using both simulated data and real data, which shows the proposed SAFT method outperforms the existing methods in various settings. x Chapter 1 Introduction This chapter gives a brief introduction to the problems considered in this dissertation includ- ing nonparametric empirical Bayes estimation with auxiliary sequences and false discovery rate thresholding for sparse estimation with spatial data. Section 1.1 provides an introduction to non- parametric empirical Bayes estimation in the literature and our motivation for incorporating side information into estimation, together with the main ideas of our approach and its contributions. Section 1.2 provides an introduction to false discovery rate thresholding estimation and structured multiple testing. Furthermore, we describe the problem formulation for spatial data and the main ideas of our proposed procedure, together with our contributions and a toy example. Section 1.3 gives an outline of the remaining chapters for reference. 1.1 Nonparametric Empirical Bayes Estimation with Auxiliary Sequences In a broad class of integrative inference problems such as meta analysis, replicability analysis, multi-task learning and multi-view data analysis, an essential task is to combine information from multiple sources to make valid and informative decisions. Consider a compound estimation problem whereY Y Y = (Y i : 1in) is a vector of summary statistics in the target domain obeying Y i = i + i ; i N(0; 2 i ): (1.1.1) 1 We are interested in estimating a high-dimensional parameter = E(Y Y Y ) =f i : 1 i ng. Suppose we also collect K auxiliary sequences S S S k =fS k i : 1 i ng, 1 k K, from related source domains. Denote the auxiliary data S S S = (S S S 1 ; ;S S S n ) T , where S S S i = (S 1 i ; ;S K i ) T is the side information associated with unit i. Assume that S S S i follow some unspecied multivariate distribution F S . Let = ( 1 ; ; n ) T be a latent vector. The relationship between the primary and auxiliary data can be formalized using a hierarchical model: i =h ( i ; y;i ); 1in; s j i =h s j( i ; s j ;i ); 1jK; (1.1.2) where h and h s j are unspecied functions, and y;i and s j ;i are random perturbations that are independent from the latent vector . This hierarchical model provides a general framework that can be utilized to incorporate both continuous and discrete auxiliary data into inference. The model oers much exibility that allows the strength of side information to vary from fully informative to completely non-informative. In transfer learning for large-scale estimation, we aim to extract and transfer structural knowl- edge encoded in auxiliary data S S S in the source domain to assist the simultaneous estimation of thousands or even millions of parameters in the target domain. Several issues arise in this data aggregation process. First, conventional meta-analytical methods often aim to estimate an overall eect by constructing weighted estimators to combine data across several subpopulations. How- ever, such methods would become problematic when the source and target distributions dier. A new estimation framework is necessary for transfer learning, where the data from source domains should strictly play a secondary role by providing supplementary information. A key principle is that the inaccuracy of auxiliary information or mismatch of the target and source distributions should not lead to negative learning. Second, existing meta-analytical methods, which construct weighted estimators for only one or a few parameters, can be highly inecient in large-scale esti- mation problems. When thousands of parameters are estimated simultaneously, useful structural knowledge, which is often encoded in auxiliary data sources, is highly informative but has been much underexploited in conventional analyses. Finally, most theories in transfer learning have fo- cused on classication algorithms. A new theoretical framework is needed to gain understandings 2 of the benets and caveats of transfer learning in the important context of large-scale shrinkage estimation. 1.1.1 Compound rules, structural knowledge and side information Consider a compound decision problem where we make simultaneous inference of n parameters f i : 1ing based onfY i : 1ing fromn independent experiments. Let = ( i : 1in) be a decision rule. Then is separable or simple if i depends on Y i alone, whereas is compound if i depends on other Y j , j6= i. Classical ideas such as the compound decision theory (Robbins, 1951), empirical Bayes methods (Robbins, 1964) and James-Stein shrinkage estimator (Stein, 1956), as well as the more recent local false discovery rate methodologies (Efron et al., 2001; Sun and Cai, 2007) showed that compound rules are in general superior to simple rules. The idea is that the joint structure of all observations is highly informative; the structural knowledge can be exploited to construct more ecient classication, estimation and multiple testing procedures. For example, the subminimax rule in Robbins (1951) showed that the disparity in the proportions of positive and negative signals can be used to inform simple classication rules, and the adaptivez-value procedure in Sun and Cai (2007) showed that the asymmetry in the shape of the alternative distribution can be utilized to construct more powerful false discovery rate (FDR; Benjamini and Hochberg, 1995) procedures. In light of auxiliary data, the inference units become unequal. This heterogeneity provides new structural knowledge that can be further utilized to improve the eciency of existing compound rules. The idea is to rst learn the rened structure of the high-dimensional object through auxiliary data and then apply the structural knowledge to the target domain. For example, in genomics research, prior data and domain knowledge may be used to dene prioritized subset of genes. Roeder and Wasserman (2009) proposed to up{weight the p-values in prioritized subsets where genes are more likely to be associated with the disease. Structured multiple testing is an important topic that has received much recent attention; see Lei and Fithian (2018); Cai et al. (2019); Li and Barber (2019); Ignatiadis and Huber (2020); Ren and Cand es (2020) for a partial list of references. These works show that the power of existing FDR methods can be substantially improved by utilizing auxiliary data to place dierential weights or to set varied thresholds on corresponding test statistics. Similar ideas have been adopted by a few recent works on shrinkage estimation. For 3 example, Weinstein et al. (2018) and Banerjee et al. (2020) respectively showed that the variance and sparsity structures can be highly informative in compound estimation. The idea is to incorporate the side information into inference by rst creating groups and then constructing group-wise linear shrinkage or soft-thresholding estimators. 1.1.2 Nonparametric Integrative Tweedie Tweedie's formula is an elegant and celebrated result that has received renewed interests recently (Jiang and Zhang, 2009; Brown and Greenshtein, 2009; Efron, 2011; Koenker and Mizera, 2014). Under the nonparametric empirical Bayes framework, the formula is particularly appealing for large-scale estimation problems for it (a) is simple to implement; (b) eectively pools information from all samples and provides the optimal shrinkage direction; (c) removes the selection bias; and (d) enjoys frequentist's optimality properties asymptotically. This dissertation develops a class of general empirical Bayes (GEB) estimators based on non- parametric integrative Tweedie (NIT) to extract and incorporate useful structural knowledge from both primary and auxiliary data. NIT has several merits in the transfer learning setup. First, NIT allows the target and source distributions to dier, which eectively avoids negative learning. We shall see that no distributional assumptions are imposed on S S S, which is allowed to be categorical, numerical or mixed type. The auxiliary data are only used to provide structural knowledge of the high-dimensional parameter in the target domain. Second, NIT provides a general framework for incorporating various types of structural information in auxiliary data. By contrast, the grouping methods in Weinstein et al. (2018) and Banerjee et al. (2020) can only leverage heteroscedasticity or sparsity structures in the data. Moreover, the grouping strategy involves discretizing continuous covariates, which not only leads to loss of information but also becomes intrinsically dicult to implement for multivariate covariate sequences. Finally, in contrast with the linear EB shrinkage estimators (Xie et al., 2012; Tan, 2015; Jing et al., 2016; Zhang and Bhattacharya, 2017) that are only optimal under parametric Gaussian priors, NIT belongs to the class of GEB estimators that have been shown to enjoy asymptotically optimality and minimax optimality properties for a wide class of models (Zhang, 1997; Brown and Greenshtein, 2009). 4 1.1.3 Main ideas of our approach The EB implementation of Tweeide's formula typically involves two steps: rst estimating the marginal distribution and then predicting the unknown using a plug-in rule. We describe some important developments in the literature. Zhang (1997) showed that a truncated GEB estimator, which is based on a Fourier innite-order smoothing kernel, asymptotically achieves both the Bayes and minimax risks. The GMLEB approach by Jiang and Zhang (2009) implements Tweedie's for- mula by estimating the unknown prior distribution via the Kiefer-Wolfwitz estimator. GMLEB is approximately minimax and universally reduces the estimation risk. Brown and Greenshtein (2009) employed a nonparametric EB estimator via Gaussian kernels and showed that the estimator achieves asymptotic optimality for both dense and sparse means. Empirically GMLEB outperforms the kernel method by Brown and Greenshtein (2009). However, GMLEB is computationally inten- sive and may not be suitable for data{intensive applications. The connection between compound estimation and convex optimization was pioneered by Koenker and Mizera (2014), who showed that the Kiefer{Wolfowitz estimator can be cast as a convex program (with shape constraints) and solved by modern interior point methods. The algorithm is fast and compares favorably to competing methods. However, the nonparametric GEB approach to compound estimation with covariates has not been pursued in the literature. This dissertation develops new methodology and theory for large-scale shrinkage estimation under the transfer learning setup by employing a set of modern computational and theoretical tools. The implementation of NIT essentially boils down to the estimation of the log-gradient of the convoluted marginal densityh(x x x i ), wherex x x i = (y i ;s s s T i ) T is multivariate due to the side information. Through a carefully designed reproducing kernel Hilbert space (RKHS) representation of Stein's discrepancy, we recast compound estimation as a convex optimization problem, where the optimal shrinkage factor is found by searching among all feasible score function embeddings in the RKHS. The algorithm is computationally fast and scalable, and enjoys superior performance empirically. The kernelized optimization framework provides a rigorous and powerful mathematical interface for theoretical analysis. By appealing to the RKHS theory and concentration theories of U statistics, we derive the approximate order of the kernel bandwidth, establish the asymptotic optimality of the data-driven NIT and explicitly characterize the impact of the dimension of S S S i on the rate of 5 convergence. The theory provides insights on the benets and caveats of using auxiliary data in transfer learning for high-dimensional estimation problems. 1.1.4 Comparison with existing works and our contributions Transfer learning algorithms for shrinkage estimation construct new compound rules to reduce es- timation risk through incorporating structural knowledge encoded in auxiliary data. It is a new area with relatively few works, most of which are conned to specic classes of estimators and tai- lored to specic types of side information. NIT provides a unied, exible and powerful framework that is capable of leveraging auxiliary information from multiple sources. The nonparametric EB formulation has several advantages. First, existing works [e.g. Ke et al., 2014; Cohen et al., 2013; Kou and Yang, 2017] require that the conditional mean function must be specied in the form of m(S S S i ) =E( i jS S S i ) =S S S T i ; where are regression coecients that are usually unknown. Ignatiadis and Wager (2019) derive the minimax rates of convergence for a broad class of functionsm(S S S i ) but their work is conned to the class of linear shrinkage estimators. By contrast, NIT does not require the specication of any functional relationship and its asymptotic optimality holds for a much wider class of prior distributions. Second, NIT is capable of incorporating various types of side information and handling multivariate covariates. By contrast, the methods in Xie et al. (2012), Weinstein et al. (2018) and Fu et al. (2020) only focused on utilizing the variance structure, and the method in Banerjee et al. (2020) is only suitable for capturing the sparse structure. Third, unlike the methods in Weinstein et al. (2018) and Fu et al. (2020) which are computationally intensive and often unstable, our convex optimization approach is fast and scalable, produces stable estimates, and provides a exible tool for incorporating various structural constraints such as monotonicity and unbiasedness. Finally, NIT eliminates the needs for dening groups, which avoids information loss in discretization. The formulation through Mahalanobis distance provides a useful tool for handling multiple and possibly correlated covariates. Our work makes several contributions in theory. First, we establish the convergence rates for both the kernerlized stein's discrepancy and the L 2 risk of the data-driven NIT. Second, our theoretical framework facilitates a systematic investigation of the usefulness of side information. The new thoery provides important insights on the benets and caveats of utilizing multivariate auxiliary data. Finally, we develop new analytical tools to formalize the conjectures in Liu et al. 6 (2016b) to establish the isometry between the L 2 space of score functions 1 and the feasible score function embeddings in the RKHS. Specically, the data-driven NIT involves mapping the observed data to RKHS and applying shrinkage estimators on the mapped data. However, the Hilbert space norm can behave very dierently from the Fisher's divergence asn increases. We provide a rigorous analysis to show that the optimal score function in the RKHS transformed space indeed yields optimal risk in the L 2 space asymptotically. The theoretical tool employed in our analysis is also useful for studying the theoretical properties of RKHS approximations in a range of statistical problems such as non-parametric least square regression and goodness-of-t testing. 1.2 Sparse Estimation with Spatial Data Sparsity is a common and essential phenomenon in high dimensional data analysis. For exam- ple, a bike-sharing company is interested in investigating the dierences in numbers of trips of shared bikes between weekdays and weekends in some specied areas, as we will study further in Section 3.4. In some regions, the numbers of trips are almost the same between weekdays and weekends while the dierences are signicant in other regions. Moreover, it can be observed these signicant dierences appear more frequently in some regions. This kind of spatial information is also common in numerous applications including genomics, image processing, disease mapping and environmental studies. But few analytical tools are available for leveraging the spatial structure in high dimensional data analysis. The second theme of this dissertation aims to develop a new framework to incorporate sparse and spatial structures simultaneously to improve the eciency in estimating a high dimensional sparse parameter vector. 1.2.1 High dimensional sparse estimation The normal means estimation problems arise frequently from a wide range of applications. In lit- erature, the shrinkage methods, exemplied by the seminal work of James and Stein (1961), have demonstrated the signicant success in both theoretical and numerical results. The popular ones include linear shrinkage estimator (James and Stein, 1961; Efron and Morris, 1975), thresholding based non-linear shrinkage methods (Donoho and Johnstone, 1994; Abramovich et al., 2006; Cai 1 The space is equipped with the natural L 2 norm, which is Fisher's divergence in the marginal density f domain. 7 and Zhou, 2009) and empirical Bayes (EB) methods, also known as the Tweedie's formula (Brown and Greenshtein, 2009; Efron, 2011; Weinstein et al., 2018; Fu et al., 2019). In sparse estimation problems, thresholding based shrinkage methods always give more accurate and interpretable re- sults. This dissertation considers a class of the soft-thresholding estimator for high dimensional sparse estimation problems. Let (y;t) be a soft-thresholding operator such that (Y;t) = 8 > > < > > : Y 1 ; if Y 1 t; t sign(Y 1 ); otherwise: (1.2.3) Then the soft-thresholding estimator is given by ^ (Y;t) = Y(Y;t). Conventional threshold- ing method chooses the threshold t by minimizing the Stein's unbiased risk estimator (SURE) (Donoho and Johnstone, 1995). But instability has been observed under some sparse parameter spaces (Donoho and Johnstone, 1995). Abramovich et al. (2006) proposed an elegant data-adaptive thresholding scheme by connecting the decision theory in sparse estimation to false discovery rate (FDR) analysis in multiple testing. FDR is an important concept in the eld of multiple testing. In traditional multiple testing, many techniques have been developed to control type I error when simultaneously testing n of hypothesesH 0i versusH 1i ,i = 1; ;n. The traditional way is to control the familywise error rate (FWER) at a specied level q. Benjamini and Hochberg (1995) argued that it is too conservative to control the FWER and proposed to control the FDR instead. Formally, let V be the number of false discoveries among the R rejected hypotheses. We dene FDP as the proportion of the false discoveries among the discoveries: FDP =V=R ifR> 0 and 0 ifR = 0. The false discovery rate is dened as: FDR =EfFDPg, which is a widely used error criterion in large-scale testing problems. There is a simple step-up FDR-controlling procedure. First convert observations Y i to p-values p i . Then let p (i) be the ascending ordered p-values for the hypotheses H (0i) . Dene ^ k = maxfk 1 :p (k) (k=n)qg: (1.2.4) The FDR controlling procedure is to reject all hypotheses H (0i) corresponding to the ordered indices i = 1; ; ^ k. Benjamini and Hochberg (1995) showed that in the situation of independent 8 hypotheses testing problems, the FDR of the above BH procedure is bounded by q. FDR aims for the tradeos between false positives and false negatives unlike FWER only look at the number of false negatives, and thus it achieves the adaptiveness to sparsity. Due to the adaptiveness of FDR, the FDR thresholding estimator is shown to achieve asymptotic minimaxity simultaneously over a wide range of sparse parameter spaces and loss functions, and be adaptive to the unknown degree of sparsity. The corresponding threshold is inherently data- adaptive: it is higher for sparse signals and lower for dense ones. 1.2.2 Spatial information Exploiting spatial structure and leveraging the information from nearby locations can help test and estimate the signals more accurately. Recent advances in structured multiple testing have demonstrated the usefulness of spatial information in testing and inference (Basu et al., 2018; Lei and Fithian, 2018; Li and Barber, 2019; Cai et al., 2020). The key idea is that the coordinates become \unequal" in light of the spatial information. They demonstrate the spatial structure of the mean vector, together with the sparsity information, can be highly informative for constructing more ecient estimation and testing procedures. Most spatial analyses assume some parametric models including hidden Markov model (Sun and Cai, 2009), Gaussian random eld models (Sun et al., 2015), Ising models(Shu et al., 2015) and graphical models (Liu et al., 2016a). But these assumptions may not hold in practice. This dissertation aims to develop a class of FDR-thresholding estimators for spatial data which can learn the structural information of the spatial process without parametric assumptions on the underlying data or prior knowledge about the clusters. The main goal is to leverage the local sparsity information from nearby locations and adaptively set coordinate-wise thresholds for the estimation units. We propose a spatially adaptive FDR thresholding (SAFT) estimator that consists of three steps. First, we estimate the local sparsity structure by a screening and smoothing approach. Then the weighted p-values are constructed based on the estimated local sparsity structure. Next, we apply the weighted BH procedure to obtain a cut-o on weighted p-values. Finally, the spatially adaptive thresholds are calculated based on the cut-o. 9 The key idea of the proposed SAFT procedure is to capture the inhomogeneity by using the weighted BH algorithm to adaptively set varied thresholds at corresponding coordinates. Con- sider an ideal situation where we know the local sparsity structure. Intuitively, we should use a lower/higher threshold when local sparsity level is large/small (e.g. when signals are abun- dant/sparse in a local neighborhood). When the local sparsity structure is unknown, we assume that it changes smoothly and estimate it via a screening approach. A useful view of this screening approach is that those coordinates which are \close" in the values of spatial location constitute a \local neighborhood" sharing similar distributional characteristics such as sparsity levels and signal amplitudes. By contrast, coordinates which are \far apart" would form dierent neighborhoods with possibly distinct characteristics. This class of locally adaptive estimation algorithms aims to simultaneously (i) integrate information from adjacent coordinates to estimate local sparsity levels s and (ii) exploit the heterogeneity in sparsity levels among remote coordinates to adaptively nar- row down the focus. This involves (i) calculating dierential weights, (ii) carrying out FDR analysis and (iii) setting varied thresholds on corresponding coordinates in light of spatial structure. 1.2.3 Our contributions and a toy example High dimensional estimation and inference with side information/auxiliary sequences is an impor- tant topic has received much attention recently (Banerjee et al., 2020; Ignatiadis and Wager, 2019). Spatially adaptive thresholding via SAFT oers a exible framework to simultaneously incorporate the sparsity structure and spatial structure into estimation. It has several advantages over the existing methods. First, it extends the elegant FDR-thresholding framework (Abramovich et al., 2006) by incorporating the spatial structure. Numerical and theoretical results demonstrate the improvement on eciency of estimation under various settings. Second, conventional estimation with side information method (Banerjee et al., 2020) assumes that the side information is extracted from the secondary dataset or the same dataset using carefully constructed sequences. SAFT method only considers the intrinsic ordering of the spatial data. This spatial ordering is dierent from the grouping and the clustering, where the qualitative and quantitative auxiliary variables are used to create groups or clusters before estimation requiring prior knowledge or intensive compu- tations. Third, SAFT method is model-free and assumption-lean. Conventional spatial modeling methods assume gaussian priors or parametric assumptions which may be misspecied in practice. 10 Finally, in theory, we show that the proposed SAFT procedure improves the eciency of estimation signicantly and is asymptotically minimax in sparse settings. We illustrate the eectiveness of SAFT procedure through a toy example from a multiple testing problem with 2D spatial data. The heatmap of parameters is given in the top left panel of Figure 1.1 where the signals appear more frequently in two circular regions. Cai et al. (2020) demonstrates that by incorporating the estimated local sparsity structure, weighted BH procedure outperforms the BH procedure signicantly in the eld of multiple testing. In estimation, Cai and Zhou (2009) proposed a SureBlock method which adaptively sets the varied thresholds for dierent blocks. But the choices of block structure will require some prior knowledge and it is dicult to decide the size and shape of blocks for spatial data. Moreover, sometimes spatial data does not form contiguous regions and only uctuates in signal sparsity levels. Our proposed SAFT procedure doesn't require any assumption on the block structure. The proposed estimator utilizes the coordinate-wise thresholding and is capable of adapting to structural sparsity in a more exible manner than the blockwise thresholding. Comparing the heat maps of estimations using SureBlock, BH-based thresholding and SAFT as shown in the Figure 1.1b, 1.1c and 1.1d, SAFT procedure achieves the better performance. 1.3 Summary of Chapters The rest of the dissertation is organized as follows. Chapter 2 picks up where Section 1.1 leaves o, and considers the proposed NIT procedure for estimating a vector of normal means in the presence of auxiliary sequences. We propose a new framework to leverage information from auxiliary sequences. The improvements in estimation are illustrated through theoretical results, numerical simulations and real data applications. Chapter 3 picks up where Section 1.2 leaves o. We discuss the problem formulation, its benets and the proposed SAFT procedure. Theoretical results and applications on simulated data and real data are conducted. 11 (a) 2D parameters (b) SureBlock (c) BH (d) SAFT Figure 1.1: A multiple testing example with 2D data. The heat map of parameters is given in the Figure 1.1a. Consider the estimation problem, the mean square error of SAFT is 0.233 while the mean square error of BH procedure is 0.401, which implies the eectiveness of local sparsity structure in estimation. The heat maps of SureBlock estimator, BH procedure and our SAFT estimator are shown in the Figure 1.1b, 1.1c and 1.1d. 12 Chapter 2 Improved Nonparametric Empirical Bayes Estimation By Auxiliary Sequences This chapter studies the problem of nonparametric empirical Bayes estimation with auxiliary se- quences. Section 2.1 describes the methodology of the proposed NIT procedure. The theoretical properties are studied in Section 2.2. Simulations are conducted in Section 2.3 and applications on real data are studied in Section 2.4. The proofs are provided in Section 2.5 and the further results about applications are given in supplemental materials. 2.1 Methodology Let (y y y;s s s) = ( i : 1in) be an estimator of andL 2 n ( ; ) =n 1 P n i=1 ( i i ) 2 be the mean squared error in compound estimation. Dene the riskR n ( ; ) =E Y Y Y;S S Sj L 2 n ( ; ) and the Bayes risk B n ( ) = R R n ( ; )dG( ), where G( ) is an unspecied prior. This section rst develops an oracle rule under the transfer learning setup and then discusses how the rule can be implemented via convex programming. 2.1.1 Nonparametric integrative tweedie We study the optimal rule that minimizes the Bayes risk. The integrative Tweedie's formula, given by the next proposition, generalizes Tweedie's formula (Robbins (1964); Efron (2011)) from the classical setup to the transfer learning setup. 13 Proposition 2.1 (Integrative Tweedie's formula). Consider the hierarchical model (1.1.1) and (1.1.2). Let f(yjs s s) be the conditional density of Y givenS S S. The optimal estimator that minimizes the Bayes risk is (y y y;s s s) = ( i : 1in), where i := (y i ;s s s i ) and (y;s s s) =y + 2 r y logf(yjs s s): (2.1.1) Integrative Tweedie's formula provides a model{free approach to extract and transfer useful structural information from the auxiliary data. Existing works on estimation with side information require that the form of the conditional mean functionm(S S S i ) =E(Y i jS S S i ) must be pre-specied (Ke et al., 2014; Cohen et al., 2013; Kou and Yang, 2017). Moreover, the eectiveness of transfer learning algorithms for classication critically depends on the similarity between the target and source distributions. By contrast, integrative Tweedie makes no specic assumptions on the relationships betweenY andS S S or between the target and source distributions. Hence it eectively avoids negative learning. The key idea is that the auxiliary data are only used to assist inference by providing the structural knowledge of the primary statistics. The next two examples respectively show that (a) if the two distributions match perfectly, then integrative Tweedie emulates an oracle that combines the two sequences optimally; (b) if the two distributions dier, substantial eciency gain can still be achieved if the auxiliary sequence encodes useful structural knowledge. Example 2.1. Suppose (Y i ;S i ) are conditionally independent given ( i;y ; i;s ). We begin by con- sidering a case where S i is simply an independent copy of Y i : i N( 0 ; 2 ); Y i = i + i ; S i = i + 0 i ; (2.1.2) where i N(0; 2 ), 0 i N(0; 2 ). Intuitively, the optimal estimator is to combineY i andS i using X i = (Y i +S i )=2N( i ; 1 2 2 ). Then we have the following: ^ op i = 1 2 2 0 + 2 X i 1 2 2 + 2 = 2 0 + 2 (y i +s i ) 2 + 2 2 ; (2.1.3) 14 which will be used as the benchmark. 1 For this bivariate normal model, the conditional dis- tribution of Y i given S i is Y i jS i = s i N 2 0 + 2 s i 2 + 2 ; 2 (2 2 + 2 ) 2 + 2 : It follows that l 0 (y i js i ) = ( 2 s i + 2 0 )y i ( 2 + 2 ) 2 (2 2 + 2 ) : We obtain i = 2 0 + 2 (y i +s i ) 2 +2 2 ; recovering the optimal estimator (2.1.3). Example 2.2. Suppose thatS i is a group indicator. The two groups (S = 1 andS = 2) have equal sample sizes. The primary data obeyY i jS i =k (1 k )N(0; 1)+ k N( i ; 1); with 1 = 0:01, 2 = 0:4 and i N(2; 1). Hence we have two sparse mean vectors with varying sparsity levels across the two groups. Denote the oracle Bayes rule with and without side information by i (Y i ;S i ) and i (Y i ), respectively. Some calculations show that B( i (Y i ))B( i (Y i ;S i )) =B( i (Y i )) = 0:216, which shows that incorporating auxiliary information can reduce the risk by as much as 21:6%. It is important to note that the distributions of Y i and S i are very dierent (continuous vs. binary), making it dicult for conventional machine learning algorithms to pool information across the target and source domains. NIT is still highly eective since the sparse structure of Y i is encoded in the grouping variable S i . 2.1.2 Nonparametric estimation via convex programming This section develops a data-driven algorithm to implement the oracle rule. Denote the collection of all data by X = (y y y;s s s 1 ; ;s s s K ) := (x x x 1 ; ;x x x n ) T , where y y y = (y 1 ; ;y n ) T are the summary statistics in the target domain, s s s k = (s k 1 ; ;s k n ) T are the summary statistics collected from the kth source domain, 1 k K, and x x x i = (x i1 ; ;x i;K+1 ) T , the ith row of X, are the observed data associated with unit i, 1in. LetK (x x x;x x x 0 ) be a kernel function that is integrally strictly positive denite, with being a tuning parameter to be specied. A detailed discussion on the construction of kernel K (;) and choice of is provided in Section 2.1.3. Our strategy is to directly estimate the shrinkage factor h h h 1 (x x x) =fr y i logf(y i js s s i ) : 1ing by solving a convex program. Dene the following nn matrices: n 2 K K K := [K (x x x i ;x x x j )] ij ;n 2 r k K K K := [r x jk K (x x x i ;x x x j )] ij ;n 2 r 2 k K K K := [r x jk r x ik K (x x x i ;x x x j )] ij : 1 However, if we slightly alter the model, say letSi =i +i + 0 i at some random locationsi or letSi =f(i) + 0 i , then obviously this strategy would not work. 15 For a xed 2 , let ^ h h h k () = ^ h k 1 (); ; ^ h k n () T be the solutions to the following quadratic program, for k = 1; ;K + 1, 2 ^ h h h k = argmin h h h k 2V V V k n ^ S k (h h h k ;) := (h h h k ) T K K K h h h k + 2(h h h k ) T r k K K K 1 1 1 + 1 1 1 T r 2 k K K K 1 1 1; (2.1.4) whereV V V k n =fh h h = (h 1 ; ;h n ) :M k 1 v v va a a k 1 ; M k 2 h h h =a a a k 2 g is a convex set andM k 1 ,M k 2 ,a a a k 1 ,a a a k 2 are known matrices and vectors that enforce linear constraints on the components of ^ h h h k . The proposed class of data-driven nonparametric integrative Tweedie (NIT) rules are of the form () =f i () : 1ing =fy i + 2 y ^ h 1 i () : 1ing: Section 2.2 shows that, for a suitably chosen, optimizing ^ S 1 ( ^ h h h 1 ;) is asymptotically equivalent to nding ^ h h h 1 that is as close as possible to the true score function h h h 1 (x x x) = fr y i logf(y i js s s i ) : 1 i ng. Since h h h 1 is the optimal shrinkage factor in the integrative Tweedie, intuitively the convex program (2.1.4) yields asymptotically optimal solutions to the compound estimation problem; rigorous theories underpinning these intuitions will be developed Section 2.2. Our convex optimization approach marks a clear departure from existing GMLEB methods (Jiang and Zhang, 2009; Koenker and Mizera, 2014; Gu and Koenker, 2017), in which the trans- fer learning setup has not been addressed. These methods cannot be easily extended to handle multivariate auxiliary data. Moreover, NIT has several additional advantages over existing non- parametric GEB methods in theory and computation. First, in comparison with the GMLEB method (Jiang and Zhang, 2009), the convex program (2.1.4) is fast and scalable. Second, Brown and Greenshtein (2009) proposed to estimate the score function by the ratio ^ f (1) = ^ f, where ^ f is a kernel density estimate and ^ f (1) is its derivative. By contrast, our direct optimization approach avoids computing ratios and produces more stable and accurate estimates. Third, our convex pro- gram can be ne-tuned by selecting a suitable . This leads to improved numerical performance and enables a more disciplined theoretical analysis compared to the proposal in Koenker and Miz- era (2014). Finally, P K+1 k=1 ^ S k can be conceptualized as an empirical measure of kernelized Stein's 2 If the interest is to estimate the parameters in the target domain then we only need to solve the rst quadratic program. We present the general formulation for practical applications where ^ h h h k provides the estimate of h h h k (x x x) = frx k i logf(x k i jfxij :j6=k; 1jK + 1g) : 1ing, which can be employed to construct shrinkage estimators for related source domains. 16 discrepancy (Liu et al. (2016b); Chwialkowski et al. (2016)), which provides a useful analytical tool to establish rates of convergence. 2.1.3 Computational details We provide computational details regarding (a) how to utilize the constraints to enforce unbiased- ness and monotonicity conditions; (b) how to construct kernel functions to handle multivariate and possibly correlated covariates; and (c) how to choose the bandwidth . 1. Structural constraints. We illustrate how to impose appropriate constraints to ensure unbiasedness and monotonicity conditions. The constraint on unbiasedness can be achieved by setting P n i=1 h i = 0, with corresponding linear constraints given byM 2 h h h = a 2 , whereM 2 = [1; ; 1] and a 2 = 0. The monotonicity constraints, which was pursued in Koenker and Mizera (2014), have been shown to be highly eective for improving the estimation accuracy. For ease of presentation, we assume that y 1 y 2 y n ,M 1 ,a a a 1 can be chosen such that 2 h i1 2 h i y i y i1 : (2.1.5) Precisely, we have M 1 = 2 6 6 6 6 6 6 6 6 6 6 4 2 2 0 ::: 0 0 0 2 2 ::: 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 2 2 3 7 7 7 7 7 7 7 7 7 7 5 a a a 1 = y 2 y 1 y 3 y 2 y n y n1 T (2.1.6) 2. Kernel functions. The kernel function needs to be carefully constructed to deal with various complications in the multivariate setting, where the auxiliary sequences S S S k , 1 k K, may (i) have dierent measurement units; (ii) be correlated; (iii) contain both continuous and categorical variables. We propose to use the generalized Mahalanobis distance [Krusi nska (1987)], which is capable of handling correlated and heteroscedastic variables with mixed types. When all 17 variables are continuous, the distance is given bykx x xx x x 0 k x x x = q (x x xx x x 0 ) T 1 x x x (x x xx x x 0 ), where x x x is the sample covariance matrix that we assume can be well estimated from data. The RBF kernel is K (x x x;x x x 0 ) = expf0:5 2 kx x xx x x 0 k 2 x x x g, where is the bandwidth to be specied next. Compared to Euclidean distance that treats each coordinate equally, Mahalanobis distance is more suitable for combining data collected from heterogeneous sources for it is unitless, scale-invariant and takes into account the correlation in the data. 3. Modied cross-validation (MCV). Implementing the NEB estimator requires selecting the tuning parameter . Ideally, these parameters should be chosen to minimize the mean square errorL 2 n ( neb ; ), which is unknown in practice. Brown et al. (2013) use a modied cross validation (MCV) technique for choosing the tuning parameter for Poisson distribution. We propose to apply MCV to select in the NIT procedure. Let i N (0; 2 y ), 1in be i.i.d. auxiliary variables, independent of observationsy y y ands s s 1 ; ;s s s K . Denea i =y i + i andb i =y i i =. Note that the primary observations y i = i + i , where i is the interested parameter and i i.i.d. N (0; 2 y ). Then i + i and i i = are independent. The MCV uses sub-samplesa 1 ; ;a n for the construction of the family of estimators neb (a a a;S S S;), while the auxiliary sub-samples b 1 ; ;b n are used for validation. Let =(a a a;S S S;). It can be shown that the validation loss is ^ L(;) = 1 n n X i=1 ( i b i ) 2 2 y (1 + 1= 2 ): (2.1.7) Proposition 2.3 in Section 2.2.4 establishes the asymptotic consistency of the validation loss (2.1.7), justifying the MCV algorithm. The data-driven bandwidth is chosen by minimizing ^ L(;): = arg min 2 ^ L(;): The proposed NIT estimator is given by ^ =f ^ i : 1ing =fy i + 2 y ^ v 1 i ( ) : 1ing: (2.1.8) 2.2 Theory This section develops asymptotic theories for the data-driven NIT estimator. We assume that i.i.d. observations X X X i = (Y i ;S i1 ; ;S ik ) T , i = 1; ;n, follow a continuous multivariate distribution P . Let X = (X X X 1 ; ;X X X n ) T . DenoteX X X k the kth column of X, k = 1; ;K + 1. Let k =E(X X X k ). 18 We discuss the joint estimation of K + 1 high-dimensional parameters = ( 1 ; ; K ) bearing in mind that (a) the primary vector Y Y Y = (Y i : 1 i n) T in one study can be exploited as auxiliary data for another related study; and (b) the theoretical properties apply to every column k , including the most relevant case k = 1. 2.2.1 Compound estimation and kernelized Stein's discrepancy We provide some intuitions behind the optimization approach (2.1.4). The integrative Tweedie's formula shows that the optimal shrinkage factor for estimating k is h h h k f (X) =fr x ik logf(x ik jx x x i;k ) : 1ing; wherex x x i;k =fx ij : 1jK + 1;j6=kg. DenoteH f = (h h h 1 f ; ;h h h K+1 f ). LetH = (h h h 1 ; ;h h h K+1 ) be any estimate of H f . Consider the sample criterion (2.1.4). Dene ^ M (H) = K+1 X k=1 ^ S k (h h h k ;) = K+1 X k=1 h (h h h k ) T K K K h h h k + 2(h h h k ) T r k K K K 1 1 1 + 1 1 1 T r 2 k K K K 1 1 1 i : (2.2.9) Let h h h f (x x x) =fr x k logf(x k jx x x k ) : 1 k K + 1g be the score function of a generic observation x x x = (x 1 ; ;x K+1 ) T obeying densityf. To see why the optimization criterion (2.1.4) makes sense, we conceptualize H as the score function of another distribution f h h h(x x x) =fr x k logf (x k jx x x k ) : 1kK + 1g T evaluated atx x x 1 ; ;x x x n : H = [h h h(x x x 1 ); ;h h h(x x x n )] T : We show in Proposition 2.2 that the sample criterion ^ M (H) is a V -statistic that provides an asymptotically unbiased estimate of the following population criterion: M fh h h f ();h h h()g =E (x x x;x x x 0 )ff (h h h f (x x x)h h h(x x x)) T K (x x x;x x x 0 )(h h h f (x x x 0 )h h h(x x x 0 ) D(f;f ); (2.2.10) where D(f;f ) is the kernelized Stein's discrepancy (KSD) between f and f (Liu et al. (2016b); Chwialkowski et al. (2016)). It is easy to see that D(f;f ) = 0 if f = f . Moreover, the KSD is 19 equivalently and fully represented by the discrepancy between the corresponding score functions, i.e. D(f;f )D(h h h f ;h h h). Next note thath h h f is the optimal shrinkage factor we are trying to solve, we conclude that minimizing the ^ M (H) is essentially the process of nding h h h that is as close as possible to h h h f (in RKHS norm), which in turn yields asymptotically consistent solutions to the compound estimation problem. Rigorous theories underpinning the above intuitions are spelled out in the next sections. The empirical estimation scheme in equation (2.2.9) is extemely useful in practical applications such as goodness of t tests (Liu et al., 2016b; Chwialkowski et al., 2016 ) where the direct evaluation ofM fh h h f ();h h h()g using equation (2.2.10) is dicult since even thoughh h h f may have a parametric form,h h h f is unknown in general as we only access f through samplesfx x x i g n i=1 . As comparison, the evaluation of M fh h h f ();h h h()g using equation (2.2.9) is possible in practice even if h h h f is unknown since it only needsh h h andfx x x i g n i=1 . It does not require the access to the true score functionh h h f . This appealing merit makes KSD criterion popular in a wide range of applications (Liu et al., 2016b; Chwialkowski et al., 2016; Liu and Wang, 2016; Oates et al., 2017). 2.2.2 Theoretical properties of KSD Note that the sample criterion b M (H H H) in equation (2.2.9) is a V-statistic and a biased estimator of the population criterion M fh h h f ();h h h()g with the property that M fh h h f ();h h h()g = 0 if and only if h h h =h h h f . Denote M ( ^ h h h n ) = 1 n(n 1) X i6=j E x x x1;;x x xnf n (h h h f (x x x i ) ^ h h h n (x x x i )) T K (x x x i ;x x x j )(h h h f (x x x j ) ^ h h h n (x x x j ) o : (2.2.11) The next proposition shows M ( ^ h h h n ) is bounded by sher's divergence. Proposition 2.2. Let K (;) be RBF kernel with bandwidth parameter 2 and is a compact set of R + bounded from zero. Then we have M ( ^ h h h n ) Ekh h h f k 2 2 Ek ^ h h h n k 2 2 n 1 : (2.2.12) This proposition illustrates the distance in RKHS is bounded by sher's divergence but not vice versa. 20 δ f 1 δ f 2 t f 1 t f 2 d B (t f 1 ,t f 2 )≤ d A (δ f 1 ,δ f 2 ) d A (δ f 1 ,δ f 2 )≤ C 0 d B (t f 1 ,t f 2 ) but not always Figure 2.1: The issue of inverting problem. 2.2.3 The issue of inverting problem Assume that marginal densityf =g whereg is some prior density and is Gaussian noise. Let's consider the score function set of all density f: h h h f =r x x x logf(x x x). Let f 0 is the true marginal and f 0 = g 0 for this deconvolution problem. Topology A on score function h h h f where f varies over all possible distribution inR d , is equipped with the L 2 f 0 -distance: d A (h h h f 1 ;h h h f 2 ) =kh h h f 1 h h h f 2 k L 2 f 0 . Our goal is show that the ouput from our algorithm ^ h h h n is asymptotically close toh h h f 0 , i.e. , E x x x 1 ;x x x 2 ;;x x xnf 0 1 n n X i=1 k ^ h h h n [x x x 1 ;x x x 2 ; ;x x x n ](x x x i )h h h f 0 (x x x i )k 2 2 ! 0 as n!1: (2.2.13) On the other hand, we consider the topology B which is the RKHS of the marginal density. Its metric is the hilbert space normkk H induced by a Gaussian kernel of bandwidth. Lett t t f be the projection of an arbitraryh h h f from topology A to topology B. The distance in topology B is denoted as d B (t t t f 1 ;t t t f 2 ) =kt t t f 1 t t t f 2 k H . Note that the hilbert space H is dense in L 2 space and h h h f t t t f is negligible for any xed f. The output from our algorithm is such thatk ^ t t t n t t t f 0 k H =O( 1 n ). But it doesn't directly imply thatk ^ h h h n h h h f 0 k L 2 f 0 =O( 1 n ), which is our desired result, as we only have, d B (t t t f 1 ;t t t f 2 )d A (h h h f 1 ;h h h f 2 ); (2.2.14) 21 but not,d A (h h h f 1 ;h h h f 2 )C 0 d B (t t t f 1 ;t t t f 2 ), which is the mark if quasi-isometry between topology A and B. What we show here is that iff is a well-behaved distribution, then the follwoing reverse is also true for our estimator ^ h h h n when the bandwidth = n 1 K+2 ! 0 as n!1. E x x x 1 ;x x x 2 ;;x x xnf 0 1 n n X i=1 k ^ h h h n [x x x 1 ;x x x 2 ; ;x x x n ](x x x i )h h h f 0 (x x x i )k 2 2 =O p log K+1+3d=2 (n) n 1=(2K+4) : (2.2.15) This exploits the stochastic continuity of ^ ^ t n that is a by product of our convex optimization criterion and the regularity conditions on the tails of f 0 . The proof directly exploits the asymptotic quasi-geodesic between the two topologies A and B. As such as conjectured in Liu et al. (2016b) and in several other heuristic observations on kernel methods that d B appearent convergences to d A as ! 0. Our proof formalizes these conjectures and heuristic assumptions, by showing that for the decomvolution problem. 2.2.4 Theoretical properties of ^ h h h n and data-driven NIT This section investigates the theoretical properties of shrinkage factor that appears in the decision rule (2.1.1). In theorem 2.1 we establish the asymptotic consistency of our target estimator ^ h h h n . We impose the following regularity conditions that will aid our technical analysis throughout this section: Assumption 2.1. Assume the density f is dierentiable and lipschitz continuous, i.e. there exists a constant L 0 f > 0, such thatjf(x x x)f(x x x 0 )jL 0 f kx x xx x x 0 k 2 . Assumption 2.2. For some large n and small> 0, whenkx x x n 1 x x x n 2 k 2 , there exists ^ L n1 > 0 depending onx x x 1 ; ;x x x n exceptx x x n 1 such thatk ^ h h h n (x x x n 1 ) ^ h h h n (x x x n 2 )k 2 ^ L n1 kx x x n 1 x x x n 2 k 2 ,k ^ h h h n (x x x n 1 ) ^ h h h n1 (x x x n 2 )k 2 ^ L n1 kx x x n 1 x x x n 2 k 2 and E ^ L 2 n1 <1 where ^ h h h n1 (x x x) is the empirical estimator using x x x 1 ; ;x x x n exceptx x x n 1 . Assumption 2.3. There exists a constant d > 0 such thatkh h h f (x x x)k 2 = O(kx x xk d 2 ) andk ^ h h h n (x x x)k 2 = O(kx x xk d 2 ). Assumption 2.4. There exists a constant > 0 such that P(kx x xk 2 >t)e t . 22 Assumption 2.1 is a basic assumption for continuous density function. Due to technical diculty, we assume the empirical estimator is also lipschitz continuous in Assumption 2.2. In Assumption 2.3 and 2.4, we consider the case when the distribution f is sub-exponential. In next theorem, we establish the bound on the population of dierence between our target estimator ^ h h h n and the true score functionh h h f under L 1 and L 2 norm. Theorem 2.1. Under assumptions 2.1-2.4, if = n 1 K+2 , when n is large, we have E x x x 1 ;;x x xnf h 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 1 i =O log 2d+2K+2 (n) n 1=(K+2) ; (2.2.16) and E x x x 1 ;;x x xnf h 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 i =O log 3d+2K+2 (n) n 1=(K+2) : (2.2.17) Under the same assumptions, similar error bound is achieved for empirical version. Theorem 2.2. Under assumptions 2.1-2.4 , if = n 1 K+2 , when n is large, we have 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 1 =O p log 2K+2d+2 (n) n 1=(K+2) ; (2.2.18) and 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 =O p log 2K+3d+2 (n) n 1=(K+2) : (2.2.19) Finally, we have the following results on estimation error. Theorem 2.3. Under assumptions 2.1-2.4 , if = n 1 K+2 , then L 1 n ( ^ neb ; )L 1 n ( ; ) =O p log 2d+2K+2 (n) n 1=(K+2) ; (2.2.20) and L 2 n ( ^ neb ; )L 2 n ( ; ) =O p log K+1+3d=2 (n) n 1=(2K+4) : (2.2.21) 23 Next, we consider the theoretical property of the NIT estimator when the distribution has heavy tail. The Cauchy distribution is considered in the following assumptions. Assumption 2.5. P (kx x xk 2 t) =O( 1 t 1+K ). Assumption 2.6. There exists constant C > 0, such thatkh h h f (x x x)k 2 C and k ^ h h h n (x x x)k 2 C. The following theorems give the theoretical properties of NIT estimator when the distribution has dense tails. Theorem 2.4. Under assumptions 2.1, 2.2, 2.5 and 2.6, if = n 1 K+2 , we have E x x x 1 ;;x x xnf h 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 i =O n 1 3(K+2) : (2.2.22) Theorem 2.5. Under assumptions 2.1, 2.2, 2.5 and 2.6, if = n 1 K+2 , then L 2 n ( ^ neb ; )L 2 n ( ; ) =O p n 1 6(K+2) : (2.2.23) Next, we establish the asymptotic consistency of validation loss which is dened in Section 2.1.3. Our asymptotic consistency of validation loss is shown in the following proposition. Proposition 2.3. If = n 1 K+2 , under assumptions 2.1-2.4 or 2.1-2.2, 2.5-2.6, we have for all , lim !0 lim n!1 n ^ L(;)L 2 n ( ^ neb ; ) o = 0: (2.2.24) 2.2.5 Benets and caveats in exploiting auxiliary data Consider a toy example where the latent vector is drawn from a two{point mixture model, with .5 chance of a point mass at 0 and .5 chance of a point mass at 2, i.e. i 0:5 f0g +0:5 f2g . The mean vectors are simulated as i = i + y;i and k;i = i + k;i , 1 k K with y;i ; k;i i.i.d. N (0; 1). Then generateY i N ( i ; 1) andS k;i N ( k;i ; 1), 1kK. We varyK from 1 to 5 and compare the performance of oracle NIT procedure and data-driven NIT procedure in Figure (2.2). We see that whenK increases, the gap between oracle and data-driven NIT procedures increases, which is illustrated in equation (2.2.23). However, by incorporating more side information as K increase, 24 Figure 2.2: The performance of oracle NIT procedure and data-driven NIT procedure with dierent number of auxiliary sequences the eciency gain of the oracle NIT procedure increases, which leads to lower MSE of data-driven NIT procedure. The amount of eciency gain of Bayes rule (2.1.1) depends on the usefulness of the side in- formation. We use the Fisher information to measure the eciency gain of side information. The Fisher information for primary observation is dened by I(f y ) = Z f 0 (y) f(y) 2 f(y)dy: (2.2.25) Similarly, we can dene the sher information of conditional density: I(f yjs s s ) = Z r y f(yjs s s) f(yjs s s) 2 f(y;s s s)dyds s s: (2.2.26) The following theorem measures the eciency gain using side information. Proposition 2.4. The eciency gain of side information is given by B n ( TF (y y y))B n ( (y y y;S S S)) = 4 y I(f yjs s s )I(f y ) : (2.2.27) 25 Moreover, B n ( TF (y y y))B n ( (y y y;S S S)) 0: (2.2.28) where the equality is attained if and only if the primary observation y is independent of side infor- mations s s. Inequality (2.2.28) demonstrates the robustness of Bayes rule (2.1.1) even in worst case when the side information is non-informative. We provide a simple example to illustrate the power of side information in shrinkage estimation problem. Assume that the primary observations are y y y and there is one auxiliary sequence s s s. The mean vectors are and , whose joint distribution is 0 B @ i i 1 C AN ( ; ): (2.2.29) where = ( y ; s ) T and = 0 B @ 1 1 1 C A. Then y i and s i are generated as y i =N ( i ; 1) and s i = N ( i ; 2 s ). Note that if we rst simulate the latent vector cN( 1 ;I 2 ). Then ( i ; i ) T = , which means this simple example is considered in model (1.1.1), (1.1.2) and controls the usefulness of side information. we consider following two scenerios to show the power of side information. First, we vary s from 0 to 1 and x = 0:8, which illustrate the power of side information under dierent noise level of side information. In second scenerio, we vary from 0 to 1 and x s = 0:5. The corresponding eciency gains are shown in Figure (2.3) The oracle decision rule (2.1.1) cannot be implemented in practice. We discuss in section 2.1.2 our nite sample approach for implementing the oracle decision rule (2.1.1). 2.3 Simulation We consider three dierent settings where the structural information is encoded in (a) one given auxiliary sequence that shares information with the primary sequence through a common latent 26 0.0 0.2 0.4 0.6 0.8 1.0 0.20 0.25 0.30 0.35 0.40 0.45 σ s Percentage of Efficiency Gain (a) Scenarioa 1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 η Percentage of Efficiency Gain (b) Scenario 2 Figure 2.3: Percentage of eciency gain vector (Section 2.3.1); (b) one auxiliary sequence constructed within the same data to capture the sparsity structure of the primary sequence (Section 2.3.2); (c) multiple auxiliary sequences that share a common structure with the primary sequence (Section 2.3.3). We conduct simulations to compare the following methods: James-Stein estimator (JS). The empirical Bayes Tweedie (EBT) estimator implemented using kernel smoothing as de- scribed in Brown and Greenshtein (2009). Non-parametric maximum likelihood estimator (NPMLE), which implements Tweedie's for- mula using the convex optimization approach (Koenker and Mizera, 2014). The method is implemented by the R-package \REBayes" in Koenker and Gu (2017). Empirical Bayes with cross-tting (EBCF) by Ignatiadis and Wager (2019). The oracle NIT procedure (2.1.1) with known f(yjs s s) (NIT.OR). The data-driven NIT procedure (2.1.8) by solving the convex program (NIT.DD). The last three methods, which utilize auxiliary data, are expected to outperform the rst three methods when the side information is highly informative. The MSE of OR is provided as the benchmark for assessing the eciency of various methods. 27 In the implementation of NIT.DD, we impose generalized Mahalanobis distance, discussed in Section 2.1.3, to compute the RBF kernel with bandwidth . A data-driven choice of the tuning parameter is obtained by rst solving optimization problems 2.1.4 over a grid of values and then computing the corresponding modied cross-validation (MCV) loss. We choose the with minimum MCV loss as the data-driven bandwidth. 2.3.1 Simulation 1: integrative estimation with one auxiliary sequence Let = ( i : 1in) be a latent vector obeying a two-point normal mixture: i 0:5N (0; 1) + 0:5N (1; 1): The primary data Y Y Y = (Y i : 1 i n) in the target domain are simulated according to the following hierarchical model: i N ( i ; 2 ); Y i N ( i ; 1): By contrast, the auxiliary dataS S S = (S i : 1in) obeys i N ( i ; 2 ); S i N ( i ; 2 s ): Hence both the primary parameter i and auxiliary parameter i are related to a common latent variable i , with controlling the amount of common information shared by i and i . We further use s to re ect the noise level when collecting data in the source domain. The auxiliary sequenceS S S becomes more useful when both and s decrease. We consider the following settings to investigate the impact of , s and sample size n on the performance of dierent methods. Setting 1: we x n = 1000 and 0:1, then vary s from 0:1 to 1. Setting 2: we x n = 1000 and s 1, then vary from 0:1 to 1. Setting 3: we x s 0:5 and 0:5, then vary n from 100 to 1000. Finally we consider a setup where the auxiliary sequence is a binary vector. In the implementa- tion of NIT.DD for categorical variables, we use indicator function to compute the pairwise distance 28 between categorical variables. Precisely, assume that s i and s j are two categorical variables, then the distance d(s i ;s j ) = 1(s i =s j ). Setting 4: Let = ( i : 1 i n) be a latent vector obeying a Bernoulli distribution i Bernoulli(p). The primary sequence in the target domain is generated according to a hierarchical model: i N (2 i ; 0:25); y i N ( i ; 1): The auxiliary vector is a noisy ver- sion of the latent vector: s i (1 i )Bernoulli(0:05) + i Bernoulli(0:9). We xn = 1000 and vary p from 0:05 to 0:5. We apply dierent methods to data simulated according to the above models and compute the MSEs using 100 replications. The simulation results for Settings 1-4 are displayed in Figure 2.4. We summarize some important patterns of the plots and provide interpretations. (a). The integrative methods (NIT.DD, EBCF) outperform univariate methods (JS, NPMLE, EBT) that do not utilize side information in most settings. NIT.DD uniformly dominates EBCF. The eciency gain is substantial in many settings. (b). Settings 1-2 shows that the eciency gain of the integrative methods decreases when and s increase (e.g. the auxiliary data become less informative or noisier). (c). Setting 3 shows that the sample size has big impacts on integrative empirical Bayes estimation. A large sample size is essential for eectively incorporating the side information. The EBCF may underperform univariate methods when n is small. (d). The gap between NIT.OR and NIT.DD narrows whenn increases, the performance of NIT.OR is asymptotically achieved by NIT.DD for n greater than 600. (e). Setting 4 shows that the side information can be highly informative even when the types of primary and auxiliary data do not match. 2.3.2 Simulation 2: Integrative estimation in two-sample inference of sparse means This section considers compound estimation in two-sample inference. Let X 1i and X 2i be two Gaussian random variables. Denote 1i = E(X 1i ) and 2i = E(X 2i ), 1 i n. Suppose we are 29 (a) Setting 1 (b) Setting 2 (c) Setting 3 (d) Setting 4 Figure 2.4: Simulation results for one given auxiliary sequence. 30 interested in estimating the dierences =f 1i 2i : 1ing. The primary statistic is given by Y Y Y =fX 1i X 2i : 1ing. However, it is argued in Cai et al. (2019) that the primary statisticY Y Y is not a sucient statistic.Consider the case where both 1 and 2 are individually sparse. Then an important fact is that the union supportU =fi : 1i 6= 0 or 2i 6= 0g is also sparse. The intuition is that the sparsity structure of is captured by an auxiliary parameter =f 1i + 2i : 1ing. Our idea is to construct an auxiliary sequenceS S S =fX 1i +X 2i : 1ing and incorporateS S S into inference to improve the eciency 3 . To illustrate the eectiveness of the integrative estimation strategy, we simulate data according to the following two settings and obtain primary and auxiliary data asY Y Y =fX 1i X 2i : 1in andS S S =fX 1i +X 2i : 1ing. Setting 1: X 1i and X 2i are generated from X 1i N ( 1i ; 1) and X 2i N ( 2i ; 1), where 1 [1 :k] = 2:5; 2 [1 :k] = 1 1 [k + 1 : 2k] = 1; 2 [k + 1 : 2k] = 1 1 [2k + 1 :n] = 0; 2 [2k + 1 :n] = 0 The sparsity level of is controlled by k. We x n = 1000 and vary k from 50 to 450 to investigate the impact of sparsity level on the eciency of dierent methods. Setting 2: X 1i and X 2i are generated from X 1i N ( 1i ; 1) and X 2i N ( 2i ; 1), where 1 [1 :k] = 1; 2 [1 :k] = 1 1 [k + 1 : 500] = 2:5; 2 [k + 1 : 500] = 1 1 [501 :n] = 0; 2 [501 :n] = 0 The primary parameter becomes more sparse when k increases. We x n = 1000 and vary k from 50 to 450 to investigate the eciency gain of the integrative estimation strategy. We apply dierent methods to simulated data and calculate the MSEs using 100 replications. The simulation results are displayed in Figure 2.5. The following patterns can be observed. 3 It can be shown thatf(X1iX2i;X1i +X2i) : 1ing is minimal sucient and retains all information about . 31 (a) Setting 1 (b) Setting 2 Figure 2.5: Two-sample inference of sparse means. (a). The side information provided by the auxiliary sequence can be highly informative for reducing the estimation risk. Our proposed methods (NIT.DD, NIT.OR) have smaller MSEs than competing methods (EBCF, JS, NPMLE, EBT). The eciency gain over univariate methods (JS, EBT, NPMLE) is more pronounced when signals become more sparse. (b). EBCF is dominated by NIT, and can be inferior to univariate shrinkage methods. (c). The class of linear estimators is inecient under the sparse setting. For example, the NPMLE method dominates the JS estimator, and the eciency gain increases when the signals become more sparse. 2.3.3 Simulation 3: integrative estimation with multiple auxiliary sequences This section considers a setup where auxiliary data are collected from multiple source domains. DenoteY Y Y the primary sequence andS S S j , 1j 4, the auxiliary sequences. In our simulation, we assume that the primary vector Y =E(Y Y Y ) share some common information with auxiliary vectors 32 j S = E(S S S j ), 1 j 4 through a latent vector , which obeys a mixture model with two point masses at 0 and 2 respectively: i 0:5 f0g + 0:5 f2g ; 1in: There can be various ways to incorporate auxiliary data from multiple sources. We consider, in addition to NIT.DD that utilizes all sequences, an alternative strategy that involves rstly constructing a new auxiliary sequence S S S = 1 4 (S S S 1 +S S S 2 +S S S 3 +S S S 4 ) to reduce the dimension and secondly applying NIT.DD to the pair (Y Y Y; S S S); this strategy is denoted by NIT1.DD. Intuitively, if all auxiliary sequences share identical side information, then data reduction via S S S is lossless. However, if the auxiliary data are collected from heterogeneous sources with dierent structures and measurement units, then NIT1.DD may distort the side information and lead to substantial eciency loss. To illustrate the benets and caveats of dierent data combination strategies, we rst consider the scenario where all sequences share a common structure via the same latent vector (Settings 1-2). Then we turn to the scenario where the auxiliary sequences share information with the primary data in distinct ways (Settings 3-4). In all simulations below we use n = 1000 and 100 replications. Setting 1: The primary and auxiliary data are generated from the following models: Y i = Y i + Y i ; S j i = j i + j i ; (2.3.30) where Y i N ( i ; 2 ), j i N ( i ; 2 ), 1j 4, Y i N (0; 1) and j i N (0; 2 s ), 1in. We x = 0:5 and vary s from 0:1 to 1. Setting 2: the data are generated using the same models as in Setting 1 except that we x s = 0:5 and vary from 0:1 to 1. Setting 3: We generateY Y Y andS S S j using model (2.3.30). However, we now allow j to have dierent structures across j. Specically, let 1 [1 : 500] = [1 : 500]; 1 [501 :n] = 0; 2 [1 : 500] = 0; 2 [501 :n] = [501 :n]: 33 The following construction implies that only the rst two sequences are informative in infer- ence: Y i N ( 1 i ; 2 ); j i N ( 1 i ; 2 );j = 1; 2; j i N ( 2 i ; 2 );j = 3; 4: We x = 0:5 and vary s from 0:1 to 1. Setting 4: the data are generated using the same models as in Setting 3 except that we x s = 0:5 and vary from 0:1 to 1. We apply dierent methods to simulated data and summarized the results in Figure 3. We make the following remarks. (a). The univariate methods (JS, NPMLE, EBT) are dominated by the integrative methods (NIT.DD, NIT.OR, EBCF, NIT1.DD). The eciency gain is more pronounced when and s are small. (b). EBCF is dominated by NIT.DD. Compared to the setting with one auxiliary sequence, the gap between the performances of NIT.OR and NIT.DD has widened because of the increased complexity of the estimation problem in higher dimensions. (c). In Settings 1-2, NIT1.DD is more ecient than NIT.DD as there is no loss in data reduction and fewer sequences are utilized in estimation. (d). In Settings 3-4, the average S S S does not provide an eective way to combine the information in auxiliary data. Since the last two sequences are not useful, such a data reduction step leads to substantial information loss. NIT1.DD still outperforms univariate methods but is much worse than EBCF and NIT.DD. The simulation results show that it is potentially benecial to reduce the dimension of auxiliary data. However, there can be signicant information loss if the data reduction step is carried out improperly. It would be of interest to develop principled methods for data reduction for extracting structural information from a large number of auxiliary sequences. 2.4 Applications This section compares NIT and its competitors on gene expression data and monthly sales data. 34 (a) Setting 1 (b) Setting 2 (c) Setting 3 (d) Setting 4 Figure 2.6: Integrative estimation with multiple auxiliary sequences. 35 2.4.1 Integrative non-parametric estimation of gene expressions The data set in Sen et al. (2018) measures expression levels of n = 3; 000 genes from cells that are without interferon alpha (INFA) protein and have been infected with varicella-zoster virus (VZV), which is known to cause chickenpox and shingles in humans (Zerboni et al., 2014). Interferons in uence host to defense against VZV and so, it is important to estimate the gene expressions in infected cells with regulated INFA; further details about the dataset is provided in sec. 2.6.1 of the supplement. There were two vectors (samples) of gene expressions from these concerned in ected cell- types.We use one vector, denoted Y Y Y , to construct the estimates and the other, denoted ~ Y Y Y , for validation. Alongside the primary data, we also collect auxiliary data, denoted S S S U andS S S I , which respectively correspond to gene expression levels for uninfected cells and infected cells without INFA regulation. We consider estimating the expression levels of n = 3000 genes based on Y Y Y;S S S U and S S S I . Fig- ure 2.7 shows the heatmap of the primary, the auxiliary and the validation sequences. We im- plemented the following estimators (a) the modied James-Stein (JS) following Xie et al. (2012), (b) Non-parametric Tweedie estimator without auxiliary information, (c) Empirical Bayes with cross-tting (EBCF) by Ignatiadis and Wager (2019) and the Non-parametric Integated Tweedie (NIT) with auxiliary infomation: (d) with S S S U only, (e) with S S S I only (f) using both auxiliary se- quences. The mean square prediction errors of the above estimates were computed with respect to the validation vector ~ Y Y Y . Table 2.1 reports the percentage gain acheived over the naive unshrunken estimator that uses Y Y Y to estimate . It shows that non-parametric shrinkage produces an additional 0.6% gain over parametric JS and using auxiliary information via NIT yields a further 5.2% gain. In particular, NIT method outperforms EBCF by 1.7% gain, which also leverage side information from both S U andS I . Figure 2.7B shows the dierences between the Tweedie and NIT estimates. The dierences are more pronounced in the left tails where Tweedie estimator is seen to overestimate compared to NIT. The JS and NIT eective size estimates disagree by more than 50% at 28 genes (which are listed in supplement g 2.9 ). These genes impact 35 biological processes and 12 molecular functions 36 in human cells (see supplement g 2.9 bottom two panels) implying that important inferential gains are made by using auxiliary information via our proposed NIT methodology. Figure 2.7: Panel A: Heatmaps of the gene expression datasets showing the four expression vectors corresponding to the observed, validation and auxiliary sequences. Panel B: scatterplot of the eect size estimates of 3000 genes based on Tweedie and NIT (using both S U andS I ). Magnitude of the auxiliary variables used in the NIT estimate is re ected by dierent colors. Table 2.1: % gain in prediction errors by dierent estimators over the naive unshrunken estimator of gene expressions of INFA regulated infected cells. Methods James-Stein Tweedie EBCF usingS S S U &S S S I NIT usingS S S U NIT usingS S S I NIT usingS S S U &S S S I % Gain 3.5 4.1 7.6 6.9 7.5 9.3 MSE 2.014 2.001 1.927 1.930 1.951 1.895 2.4.2 Leveraging auxiliary information in predicting monthly sales We consider the total monthly sales of beers across n = 866 stores of a retail grocery chain. These stores are spread across dierent states in the USA (see g 2.10 in supplements). The data is extracted from Bronnenberg et al. (2008) which is widely studied for interventory management and consumer preference analysis (see Bronnenberg et al., 2012; Banerjee et al., 2018 and the references therein). Let Y Y Y t be the n dimensional vector denoting the monthly sales of beer across the n stores in month t2f1;:::; 12g. For inventory planning, it is economically important to estimate future demand. In this context, we consider estimating the monthly demand vector (across stores) for month t using the previous month's sales Y Y Y t1 . We use the rst six months t = 1;:::; 6 for estimating store demand variabilities ^ 2 i ;i = 1;:::;n. For t = 7;:::; 12, using estimators based on 37 montht's sales, we calculate their demand prediction error for montht+1 by using its monthly sale data for validation. Among the estimators, we introspect the modied James-Stein (JS) estimator of Xie et al. (2012): ^ t+1 i [JS] = c JS t i + 1 n 3 P i ^ 2 i (Y t i c JS t i ) 2 + (Y t i c JS t i ) where c JS t i = P n i=1 2 i Y t i P n i=1 2 i ; as well as the Tweedie (T) estimator ^ t+1 i [T] =Y t i + ^ i ^ h i where ^ h i are estimates ofr 1 logf(^ 1 i Y t i ) based on the marginal density of standardized sales. We also consider the sales of three other products: milk, deodorant and hotdog from these stores. They are not directly related to the sale of beers but they might contain possibly useful information regarding consumer preferences to beers particularly as they share zip-code and other store specic responses. We use them as auxiliary sequences in our NIT methodology. Figure 2.8 shows the distribution of beer sales (across stores) Figure 2.8: Distribution of monthly sales of beer across stores (on left) and the pairwise distribution of joint sales of dierent products in the month of July (in right). for dierent months and the pairwise distribution of the sales of dierent products. Further details about the dataset is provided in the supplement Sec 2.6.2. In table 2.2, we report the average % gain in predictive error by the JS, T and IT estima- tors (using dierent combinations of auxiliary sequences) over the naive unshrunken estimator ^ t+1;naive =Y Y Y t for the demand prediction problem at t = 7;:::; 12. The detailed month-wise gains and the loss function are provided in the supplement. Using auxiliary variables via our proposed IT 38 framework yields signicant additional gains over non-integrative methods. However, the improve- ment slackens as increasing auxiliary sequences are incorporated. It is to be noted that demand data is highly complex and heterogeneous and n = 866 might not be adequate large for successful non-parametric estimation. As such suitably anchored parametric JS estimator produced better prediction that non-parmatric Tweedie. Also, as demonstrated in table 2.4 of the supplement, there are months where shrinkage estimation does not yield positive gains. Nevertheless, the IT methodology proposed here produced signicant advantage in the repeated trials. It produces on av- erage 7.7% gain over unshrunken methods and attains an additional 3.7% gain over non-parametric shrinkage methods. Table 2.2: Average % gains over the naive unshrunken estimator for monthly beer sales prediction JS Tweedie IT-Milk IT-Deodorant IT-Hotdog IT-M&D IT-M&H IT-D&H IT-M&D&H 5.7 4.0 6.0 7.1 6.8 6.1 6.6 7.5 7.7 2.5 Proofs 2.5.1 Proof of Lemma 2.1 Proof. Using the fact that f(y;s s s) = R f(y;s s sj)dh () and f(y;s s sj) = f(yjs s s;)f(s s sj), we expand the partial derivative of f(y;s s s): r y f(y;s s s) = 2 Z f(yjs s s;)f(s s sj)dh ()y Z f(yjs s s;)f(s s sj)dh () = 2 Z f(y;s s sj)dh ()yf(y;s s s) : Then left-multiplying by 2 and dividing by f(y;s s s) on both sides, we get 2 r y f(y;s s s) f(y;s s s) = R f(y;s s sj)dh () f(y;s s sj) y: (2.5.31) 39 According to the Bayes rule: E(jy;s s s) = R f(y;s s sj)dh () f(y;s s s) : (2.5.32) Combining equations (2.5.31) and (2.5.32), we have E(jy;s s s) =y + 2 r y f(y;s s s) f(y;s s s) : 2.5.2 Proof of Proposition 2.2 Proof. By the construction of the ^ h h h n , we have b M ;n ( ^ h h h n ) b M ;n (h h h f ): (2.5.33) Note that b M ;n ( ^ h h h n ) and b M ;n (h h h f ) are V-statistics ofM ( ^ h h h n ) andM (h h h f ) respectively. Taking the expectation on the both sides of equation (2.5.33), we get n 2 n n 2 M ( ^ h h h n ) + n n 2 Ek ^ h h h n k 2 + 1 n 2 n n 2 M (h h h f ) + n n 2 Ekh h h f k 2 + 1 : (2.5.34) Notice thatM (h h h f ) = 0 and then the above inequality implies M ( ^ h h h n ) 1 n 1 Ekh h h f k 2 Ek ^ h h h n k 2 : (2.5.35) 2.5.3 Proof of Theorem 2.1 To prove main theorem we begin by proving an important technical lemmas that will be used in the main proofs. 40 Lemma 2.6. Under assumption (2)-(4) , we have E h k ^ h h hn(x x x i )h h h f (x x x i )k 2 2 f(x x x i ) i C K+1 1 n 1 Ekh h h f k 2 2 Ek ^ h h h n+1 k 2 2 + ( logn) K+3 +Ek ^ h h hn(x x x i )h h h f (x x x i )k 2 2 K+2 logn : (2.5.36) Proof of Lemma 2.6. Proposition 2.2 shows that M ( ^ h h h n ) 1 n1 Ekh h h f k 2 Ek ^ h h h n k 2 . We consider the following decomposition of M ( ^ h h h n ) for any > 0 and i<n, M ( ^ h h h n ) =E h ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x x i x x xnk 2 ^ h h h n (x x x n )h h h f (x x x n ) 1 (kx x x i x x x n k<) i +E h ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x x i x x xnk 2 ^ h h h n (x x x n )h h h f (x x x n ) 1 (kx x x i x x x n k) i : First we control the second term of right hand side. Let = logn, then e 1 2 2 kx x x i x x xnk 2 1 (kx x x i x x x n k) is always bounded by n 0:5 logn , which implies that the second term is bounded by Ekh h h f ^ h h h n k 2 n 0:5 logn using Cauchy Schwarz inequality. When n is large, we will have Ekh h h f ^ h h h n k 2 n 0:5 logn is bounded by 1 n1 Ekh h h f k 2 Ek ^ h h h n k 2 . Then the above decomposion of M ( ^ h h h n ) implies that E h ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x xix x xnk 2 ^ h h h n (x x x n )h h h f (x x x n ) 1 (kx x x i x x x n k<) i 2 n 1 Ekh h h f k 2 Ek ^ h h h n k 2 : (2.5.37) Next, we consider the following further decomposition: E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 2 kx x xix x xnk 2 1 (kx x x i x x x n k<) i =E h ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x xix x xnk 2 ^ h h h n (x x x n )h h h f (x x x n ) 1 (kx x x i x x x n k<) i +E h ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x xix x xnk 2 ^ h h h n (x x x i )h h h f (x x x i ) ^ h h h n (x x x n )h h h f (x x x n ) T 1 (kx x x i x x x n k<) i : (2.5.38) The rst term of right hand side is already bounded in inequality (2.5.37). Let's consider the second term. Using Assumption and the constraint thatkx x x i x x x n k<, when is small, we have k ^ h h h n (x x x i )h h h f (x x x i ) ^ h h h n (x x x n )h h h f (x x x n ) k (L f + ^ L n1 ): (2.5.39) 41 Then use the inequality (2.5.39) and Cauchy Schwarz inequality, we get the following bound on the second term of decomposition 2.5.38 E h ^ h h h n (x x x n )h h h f (x x x n ) ^ h h h n (x x x i )h h h f (x x x i ) T e 1 2 2 kx x xix x xjk 2 ^ h h h n (x x x i )h h h f (x x x i ) 1 (kx x x i x x x n k<) i E h (L f + ^ L n1 )e 1 2 2 kx x xix x xnk 2 k ^ h h h n (x x x i )h h h f (x x x i )k1 (kx x x i x x x n k<) i n E (L f + ^ L n1 )1 (kx x x i x x x n k<) 2 E e 1 2 kx x xix x xnk 2 k ^ h h h n (x x x i )h h h f (x x x i )k 2 1 (kx x x i x x x n k<) o1 2 : (2.5.40) Note that ^ L n1 is not depend onx x x n . ThenE (L f + ^ L n1 )1 (kx x x i x x x n k<) 2 can be simplied as E(L f + ^ L n1 ) 2 E x x xn 1 (kx x x i x x x n k<). Using the fact that p(x x x n )C p from the assumption, we can show that E x x xn 1 (kx x x i x x x n k<) C f (K+1)=2 ( K+1 2 +1) K+1 , where (x) is the gamma function. Thus we have the following bound: E (L f + ^ L n1 )1 (kx x x i x x x n k<) 2 E (L f + ^ L n1 ) 2 C f (K+1)=2 ( K+1 2 + 1) K+1 : (2.5.41) Combining equations (2.5.38), (2.5.40) and using the fact that e 1 2 kx x x i x x xnk 2 e 1 2 2 kx x x i x x xnk 2 , we get E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 kx x xix x xjk 2 1 (kx x x i x x x n k<) i E (L f + ^ L n1 ) 2 C f (K+1)=2 ( K+1 2 + 1) K+1 E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 kx x xix x xnk 2 1 (kx x x i x x x n k<) i 1 2 + 2 n 1 Ekh h h f k 2 Ek ^ h h h n k 2 : (2.5.42) Solving this quadratic inequality for E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 kx x x i x x x j k 2 1 (kx x x i x x x n k<) i , we have the follwing bound: E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 kx x x i x x x j k 2 1 (kx x x i x x x n k<) i E (L f + ^ L n1 ) 2 C f (K+1)=2 ( K+1 2 + 1) K+3 + 4 n 1 Ekh h h f k 2 Ek ^ h h h n k 2 : (2.5.43) 42 Finally, we consider the following decomposition: E h k ^ h h h n1 (x x x i )h h h f (x x x i )k 2 e 1 2 2 kx x x i x x xnk 2 1 (kx x x i x x x n k<) i E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 2 kx x x i x x xnk 2 1 (kx x x i x x x n k<) i +E h k ^ h h h n1 (x x x i ) ^ h h h n (x x x i )k 2 e 1 2 2 kx x x i x x xnk 2 1 (kx x x i x x x n k<) i : (2.5.44) Our goal is to prove the desired bound on the left hand side through analyzing the two terms from the right hand side. The rst term is bounded in equation (2.5.43) and we can show that the second term is bounded by 2 2 E ^ L 2 n1 1 (kx x x i x x x n k<) since assumption (2.2) implies thatk ^ h h h n1 (x x x i ) ^ h h h n (x x x i )k 2 ^ L n kx x x n x x x n1 k. Note that ^ L n1 is independent of x x x n . Then E ^ L 2 n1 1 (kx x x i x x x n k<) can be simplies as E ^ L 2 n1 E x x xn 1 (kx x x i x x x n k<) where E ^ L 2 n1 is a nite constant. Using the fact thatf(x x x n )C f from the assumption, we can show thatE x x xn 1 (kx x x i x x x n k<)C f (K+1)=2 ( K+1 2 +1) K+1 . Then combining equation (2.5.43), we have E h k ^ h h h n1 (x x x i )h h h f (x x x i )k 2 e 1 2 2 kx x xix x xnk 2 1 (kx x x i x x x n k<) i =O K+3 + 1 n 1 Ekh h h f k 2 Ek ^ h h h n k 2 : (2.5.45) Next, we assume there are n + 1 i.i.d. samples from f(x x x): x x x 1 ; ;x x x n+1 . Equation 2.5.45 shows that E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 e 1 2 2 kx x xix x xn+1k 2 1 (kx x x i x x x n+1 k<) i C K+3 + 1 n Ekh h h f k 2 Ek ^ h h h n+1 k 2 : (2.5.46) 43 Notice that the estimator ^ h h h n is independent ofx x x n+1 andi<n + 1. Hence the left hand side can be simplied asEk ^ h h h n (x x x i )h h h f (x x x i )k 2 E x x x n+1 h e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<) i . By the assumption of the theorem,jf(x x x n+1 )f(x x x i )jL 0 f kx x x n+1 x x x i k. It follows that for anyx x x i , Z f(x x x i )e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 =E x x x n+1 h e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<) i + Z f(x x x i )f(x x x n+1 ) e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 E x x x n+1 h e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<) i +L 0 f Z e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 : (2.5.47) Hence, multiplying byk ^ h h h n (x x x i )h h h f (x x x i )k 2 and taking the expectation yields Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 f(x x x i ) Z e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 E x x x n+1 h e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<) i +L 0 p Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 Z e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 : (2.5.48) Now, we compute R e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 . Since = logn , notice that Z e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 = K+1 Z e 1 2 kx x xk 2 1 (kx x xk< logn)dx x x: (2.5.49) Whenn!1, R e 1 2 kx x xk 2 1 (kx x xk< logn)dx x x! R e 1 2 kx x xk 2 dx x x. LetC 1 = R e 1 2 kx x xk 2 dx x x. Hence, when n is large enough, R e 1 2 2 kx x x i x x x n+1 k 2 1 (kx x x i x x x n+1 k<)dx x x n+1 is between 1 2 C 1 K+1 and C 1 K+1 . Therefore, combining (2.5.46) and (2.5.48), E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 f(x x x i ) i C K+1 1 n Ekh h h f k 2 2 Ek ^ h h h n+1 k 2 2 + ( logn) K+3 +Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 K+2 logn : (2.5.50) In proving the theorem, the constant C may represent dierent values throughout the proof. 44 Proof. Using Lemma 2.6, we have E h k ^ h h hn(x x xi)h h h f (x x xi)k 2 2 f(x x xi) i C K+1 1 n Ekh h h f k 2 2 Ek ^ h h hn+1k 2 2 + ( logn) K+3 +Ek ^ h h hn(x x xi)h h h f (x x xi)k 2 2 K+2 logn : (2.5.51) Next, we will derive the bound on E h k ^ h h h n (x x x i )h h h f (x x x i )k 1 i . If we restrict x x x i infkx x x i k 2 2 logng for some constant C, using Cauchy-Schwarz inequality, we get E h k ^ h h h n (x x x i )h h h f (x x x i )k 1 1 (kx x x i k 2 2 logn) i n E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 f(x x x i ) io1 2 Z 1 (kx x x i k 2 2 logn)dx x x i 1 2 : (2.5.52) Notice that R 1 (kx x x i k 2 2 logn)dx x x i =C log K+1 n. Meanwhile, sincep(x x x i ) is sub-exponential with parameter , when n is large enough, E h k ^ h h h n (x x x i )h h h f (x x x i )k 1 1 (kx x x i k 2 > 2 logn) i E h (K + 1)kx x x i k d 2 1 (kx x x i k 2 > 2 logn) i Cn 1 : (2.5.53) Combining 2.5.51, 2.5.52 and 2.5.53, we derive the following bound: h Ek ^ h h hn(x x xi)h h h f (x x xi)k1 i 2 C logn K+1 1 n 1 Ekh h h f k 2 2 Ek ^ h h hn+1k 2 2 + ( logn) K+3 +Ek ^ h h hn(x x xi)h h h f (x x xi)k 2 2 K+2 logn +O(n 2 ): (2.5.54) Similarly, we can show thatEkh h h f k 2 2 Ek ^ h h h n+1 k 2 2 C(K +1) log d (n)Ek ^ h h h n+1 (x x x i )h h h f (x x x i )k 1 +O(n 1 ) and Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 C(K + 1) log d nEk ^ h h h n (x x x i )h h h f (x x x i )k 1 +O(n 1 ). Then equation 2.5.55 can be simplied as Ek ^ h h hn(x x xi)h h h f (x x xi)k1 C 2 log K+d+2 n 2 C logn K+1 ( log d nEk ^ h h hn+1(x x xi)h h h f (x x xi)k1 n 1 + 2 K+3 (logn) K+d+3 ) +O(n 2 ): (2.5.55) Noting = (n 1 K+2 ), we can check thatEk ^ h h h n (x x x i )h h h f (x x x i )k 1 ! 0 as n!1. 45 Dene A n = max(Ek ^ h h h n (x x x i )h h h f (x x x i )k 1 ;n 1 K+2 log K+2 n). From equation 2.5.55, we claim that there exists a constant C, such that A n C(logn) K+d+1 n 1 2K+4 p A n+1 : (2.5.56) Applying (2.5.56) recursively, we have A n C(logn) K+d+1 n 1 2K+4 p A n+1 C(logn) K+d+1 n 1 2K+4 q C (log(n + 1)) K+d+1 (n + 1) 1 2K+4 p A n+2 : (2.5.57) Notice that when n is large, (log(n + 1)) K+d+1 (n + 1) 1 2K+4 (logn) K+d+1 n 1 2K+4 . Hence, A n C(logn) K+d+1 n 1 2K+4 1+ 1 2 A 1 4 n+2 : (2.5.58) Apply this inequality recursively, we obtain A n C(logn) K+d+1 n 1 2K+4 1++ 1 2 m A 1 2 m+1 n+m+1 : (2.5.59) SinceEk ^ h h h n (x x x i )h h h f (x x x i )k 1 ! 0 asn!1, by the denition ofA n , we haveA n < 1 whenn is large, which implies for any m> 0, A n C(logn) K+d+1 n 1 2K+4 1++ 1 2 m : (2.5.60) Then, let m!1, we proved that A n C(logn) 2K+2d+2 n 1 K+2 , which implies Ek ^ h h h n (x x x i )h h h f (x x x i )k 1 C(logn) 2K+2d+2 n 1 K+2 : (2.5.61) Next, we show the bound onEk ^ h h h n (x x x i )h h h f (x x x i )k 2 2 . Consider the following decomposition: Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 =Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x x i k 2 > 2 logn) +Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x x i k 2 2 logn): 46 Similar to equation 2.5.53, we have Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x x i k 2 > 2 logn) = O(n 1 ) when n is large. On the other hand, Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x x i k 2 2 logn) log d nEk ^ h h h n (x x x i )h h h f (x x x i )k 1 1 (kx x x i k 2 2 logn) =C(logn) 2K+3d+2 n 1 K+2 : We proved the desired bound on Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 . 2.5.4 Proof of Theorem 2.2 Proof. Here we prove the results for mean square error. The proof for L 1 norm follows the same procedure.For any> 0, letM = 1 anda n = log 2K+2 n n 2=(K+2) , then using Chebyshev's inequality, we have P 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 >Ma n ! 1 Ma n E 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 : (2.5.62) Since x x x 1 ; ;x x x n are i.i.d. samples, by Theorem (2.1) we have E 1 n P n i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 = Ek ^ h h h n (x x x n )h h h f (x x x n )k 2 2 =O(a n ). Notice that M = 1 . Hence 1 Ma n E 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 =O(): (2.5.63) Then we proved that 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 =O p log 2K+2 n n 2=(K+2) : (2.5.64) 2.5.5 Proof of Theorem 2.3 Proof. Here we give the proof for L 2 norm. The proof for L 1 norm follows the same procedure.By denition of ^ neb and , L 2 n ( ^ neb ; ) = 1 n n X i=1 2 y ^ h 1 i r y logf(y i js s s i ) 2 : (2.5.65) 47 Assume thatx x x i = (y i ;s s s T i ) T , we have 1 n n X i=1 ^ h 1 i r y logf(y i js s s i ) 2 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 : (2.5.66) Meanwhile, Theorem (2.2) shows that 1 n n X i=1 k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 =O p log 2K+2 n n 2=(K+2) : (2.5.67) It follows thatL 2 n ( ^ neb ; ) =O p log 2K+2 n n 2=(K+2) . Next, using Triangular inequality, we have q L 2 n ( ^ neb ; ) q L 2 n ( ; ) q L 2 n ( ^ neb ; ) =O p log K+1 n n 1=(K+2) : (2.5.68) Hence, q L 2 n ( ^ neb ; )! q L 2 n ( ; ) as n!1. It follows that L 2 n ( ^ neb ; )L 2 n ( ; ) = q L 2 n ( ^ neb ; ) q L 2 n ( ; ) q L 2 n ( ^ neb ; ) + q L 2 n ( ; ) 3 q L 2 n ( ; ) q L 2 n ( ^ neb ; ) q L 2 n ( ; ) =O p log K+1 n n 1=(K+2) : (2.5.69) 2.5.6 Proof of Theorem 2.4 Proof. Denote = 1 3(K+1)(K+2) . We consider the following decomposition: E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 i =E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 n ) i +E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 >n ) i : 48 Under the conditions of the theorem, we haveE h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 >n ) i =O(n (K+1) ). We next show the bound on the rst term. Applying the Cauchy-Schwarz inequality, we have E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 n ) i n E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 f(x x x i )1 (kx x xk 2 n ) io1 2 Z k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 n ) dx x x i 1 2 : (2.5.70) By assumptions, R k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 1 (kx x xk 2 n ) dx x x i Cn (K+1) . Moreover, Lemma 2.6 shows E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 f(x x x i ) i C K+1 1 n 1 Ekh h h f k 2 2 Ek ^ h h h n+1 k 2 2 + ( logn) K+3 +Ek ^ h h h n (x x x i )h h h f (x x x i )k 2 2 K+2 logn : Denote A n =E h k ^ h h h n (x x x i )h h h f (x x x i )k 2 2 i . Let = (n 1 K+2 ) and note = 1 3(k+1)(k+2) , then we have A 2 n Cn 2 3(K+2) +Cn 1 3(K+2) h n 1 K+2 p A n+1 +n 1 K+1 A n logn +n 2 K+2 log K+3 (n) i : We can check A n ! 0 as n!1. Thus, A 2 n Cn 2 3(K+2) +Cn 1 3(K+2) h n 1 K+2 +n 1 K+1 logn +n 2 K+2 log K+3 (n) i =O(n 2 3(K+2) ): This concludes the proof of Theorem 2.4. 2.5.7 Proof of Theorem 2.5 Proof. The proof of this theorem is the same as the proof of Theorem 2.3. We refer this proof to proof of Theorem 2.3 and omit it here. 2.5.8 Proof of Proposition 2.3 Proof. By Theorem 2.3 or Theorem 2.5, we have lim n!1 ^ L(;) = lim n!1 L 2 n ( (a a a;S S S);b b b) (2.5.71) 49 and lim n!1 L 2 n ( ^ neb ; ) = lim n!1 L 2 n ( (y y y;S S S); ): (2.5.72) Meanwhile, by law of large numbers, lim n!1 L 2 n ( (a a a;S S S);b b b) =E a;s s s;b ( (a;s s s)b) 2 2 y (1 + 1= 2 ) (2.5.73) and lim n!1 L 2 n ( (y y y;S S S); ) =E y;s s s; ( (y;s s s)) 2 : (2.5.74) Note thatb =+= where;N (0; 2 y ) and by the construction ofa andb,a is conditionally independent of b given . Hence, E a;s s s;b ( (a;s s s)b) 2 =E E a;s s s;bj ( (a;s s s)b) 2 j =E E a;s s s;bj ( (a;s s s)) 2 + (=) 2 j : (2.5.75) the last equation is due to the fact that E(=) = 0. SinceE(=) 2 = 2 y (1 + 1= 2 ), then E a;s s s;b ( (a;s s s)b) 2 2 y (1 + 1= 2 ) =E a;s s s; ( (a;s s s)) 2 : (2.5.76) Combine (2.5.71), (2.5.72), (2.5.73), (2.5.74) and (2.5.76), lim n!1 ^ L(;)L 2 n ( ^ neb ; ) =E y;s s s; ( (y;s s s)) 2 E a;s s s; ( (a;s s s)) 2 : (2.5.77) Finally, notice that a =y + and let ! 0, we proved that lim !0 lim n!1 ^ L(;)L 2 n ( ^ neb ; ) = 0: (2.5.78) 50 2.5.9 Proof of Proposition 2.4 Proof. Proposition 4.5 in Johnstone (2011) shows that B n ( TF (y y y)) = 2 4 I(f y ). Using similar arguments, we have B n ( (y y y;S S S)) = 2 4 I(f yjs s s ). Then B n ( TF (y y y))B n ( (y y y;S S S)) = 4 (I(f yjs s s )I(f y )): Next, we prove that I(f yjs s s )I(f y ) is non-negative. Using the denition of I(f yjs s s ), we have: I(f yjs s s ) = ZZ f(y)r y f(s s sjy) +f(s s sjy)r y f(y) f(y;s s s) 2 f(y;s s s)dyds s s = ZZ r y f(s s sjy) f(s s sjy) 2 f(y;s s s)dyds s s + ZZ r y f(y) f(y) 2 f(y)dy + 2 ZZ r y f(s s sjy)r y f(y)dyds s s: Note that ZZ r y f(s s sjy)r y f(y)dyds s s = Z r y f(y)r y Z f(s s sjy)ds s s dy = 0: and RR ryf(s s sjy) f(s s sjy) 2 f(y;s s s)dyds s s 0. Then we proved that I(f yjs s s )I(f y ). 2.6 Supplemental Materials 2.6.1 Further details on the gene expressions estimation example The data considered here was collected in Sen et al. (2018) via RNA sequencing. The set of genes in the sequencing kit was same across all the experiments. The standard deviations of the expressions values corresponding to the dierent genes were estimated from related gene expression samples which contain replications under dierent experimental conditions. Pooling data across these experiments, unexpressed and lowly expressed genes were ltered out. We consider estimating 51 the expression vector of n = 3000 resultant genes based on primary vector Y Y Y and auxiliary sequences S S S U and S S S I . In Figure 2.9 (top panel), we list the 28 genes for which the Tweedie and IT estimates disagreed by more than 50%. According to PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classication System (Mi et al., 2012), those genes impact 12 molecular functions and 35 biological processes in human cells. The bottom two panels of gure 2.9 show the dierent function and process types that are impacted. 2.6.2 Further details on predicting monthly sales data example The data is extracted from Bronnenberg et al. (2008). We consider the monthly sales at store level for 4 dierent commodities: beer, milk, deodorant and hotdog. There are 866 stores. The distribution of store across dierent US states is shown in gure 2.10. Table 2.3 shows the correlation between the dierent products. In table 2.4, we report the average % gain in predictive error by the JS, T and IT estimators (using dierent combinations of auxiliary sequences) over the naive unshrunken estimator ^ t;naive = Y Y Y t1 for the demand prediction problem at t = 7;:::; 12. For estimator ^ we report, Gain t ( ^ ) = P n i=1 ^ 2 i ( ^ t i ^ y t i ) 2 P n i=1 ^ 2 i ( ^ t;naive ^ y t i ) 2 100% for t = 7;:::; 12: The last column in the table 2.4, reports the average performance of these methods over the six successive trails. These average gains are reported in table 2.2 of the main paper. Table 2.3: Correlation matrix of the monthly sales of dierent products. Products (1) (2) (3) (4) (1) Beer 1.00 (2) Milk 0.33 1.00 (3) Deod 0.16 0.63 1.00 (4) Hotdog 0.84 0.38 0.19 1.00 In Figure (2.11), we compare the prediction of monthly sales in August using Tweedie and IT-M&D&H. The magnitude of side-information is denoted by color. The signicant dierences between Tweedie and IT are in the left-tails. 52 Table 2.4: Monthwise % gains for monthly beer sales prediction over the naive unshrunken estima- tor. July August September October November December Average James-Stein 9.7 2.4 10.8 -2.7 -16.2 -3.7 5.7 Tweedie 7.5 7.5 9.6 -7.2 -22.6 -2.8 4 IT -Milk 11.7 5.2 9.4 -7.4 -8.8 -8.2 6 IT -Deo 11.3 5.1 10.7 -10.6 -13.7 3.7 7.1 IT -Hotdog 12.4 2.6 11.9 -3.2 -13.2 -6.5 6.8 IT-M&D 10.7 5.9 9.8 -7.4 -8.7 -7 6.1 IT-M&H 10.3 5.7 10.8 -4.3 -10.3 -4.8 6.6 IT-D&H 11.7 6.8 11 -8.2 -9.1 -0.6 7.5 IT-M&D&H 11.2 6.8 10.9 -8.1 -7.2 1.8 7.7 6 8 10 12 6 8 10 12 log(Tweedie) log(IT−M&D&H) Side Information small medium large Figure 2.11: Scatterplot of the logarithm of beer demand estimates in the month of August. The magnitudes of the corresponding auxiliary variables used in the IT estimate are re ected in the dierent colors. 53 Figure 2.9: Top panel: Scatterplot and names of genes where Tweedie and Integrated Tweedie eectsize estimates disagreed by more than 50%. The other panels show the dierent molecular function types and biological processes that are impacted by these genes. 54 20 40 60 80 100 120 140 Number of Stores Figure 2.10: Distribution of the 866 stores across dierent states in USA. 55 Chapter 3 Spatially Adaptive FDR Thresholding for Sparse Estimation This chapter develops the SAFT procedure for sparse estimation. Section 3.1 describes the proposed SAFT procedure. The theoretical properties of SAFT are investigated in Section 3.2. Simulations are conducted in Section 3.3 and the performance on applications is studied in Section 3.4. The proofs are provided in section 3.5. 3.1 Methodology This section develops the SAFT procedure. Section 3.1.1 describes the model and problem setup. The literature of structured FDR analysis is reviewed in Section 3.1.2. Section 3.1.3 provides the details of the SAFT procedure, whereas the estimation of local sparsity levels is studied in Section 3.1.4. 3.1.1 Model and problem setup LetSR d be ad-dimensional spatial domain ands2S a spatial location. The data are observed on a nite, regular lattice SS. In theoretical analysis, we let S!S and consider the inll- asymptotics framework (Stein, 2012; Cai et al., 2020). Let Y s be the observations and s the parameters of interest. Suppose that Y s = s +z s ; z s i:i:d: N(0; 1);s2S: (3.1.1) where z s are the noises in observations and is assumed to be known or can be well estimated from data (Abramovich et al., 2006; Brown and Greenshtein, 2009; Weinstein et al., 2018; Xie 56 et al., 2012). Denote n =jSj and let s = P( s 6= 0), which is used to denote the local sparsity levels. Ignoring the inhomogeneity captured by spatial location s, it reduces to the widely used mixture model (Abramovich et al., 2006; Brown and Greenshtein, 2009; Jiang and Zhang, 2009) where = P( s 6= 0) is a constant across all locations. We aim to estimate the unknown mean vector =f s g s2S lying in a xed parameter space n under the mean square error (^ ; ) =n 1 X s2S E s (^ s s ) 2 ; (3.1.2) where ^ =f^ s g s2S is an estimator of . 3.1.2 Structured FDR analysis Structured FDR analysis is an important topic that has received much attention. Recent works on multiple testing with side information includes AdaPT (Lei and Fithian, 2018), SABHA (Li and Barber, 2019) and LAWS (Cai et al., 2020). AdaPT adaptively estimates a Bayes-optimal rejection threshold on p-values and controls FDR in nite samples. SABHA incorporates prior information to reweight the p-values in a data-adaptive way. The performances of AdaPT and SABHA are heavily depend on the quality of prior information or human interactions. LAWS reweights the p-values adapting to local sparsity levels and works the best for spatial setting. The estimation of local sparsity levels will be discussed in section 3.1.4 and we assume the estimated local sparsity levels are ^ s . LAWS proceeds in three steps. First constructs the weights ^ w s = ^ s 1^ s ; then calculates the weighted p-values as p ^ w s =p s = ^ w s ; nally orders the weighted p-values from the smallest to largest p ^ w (1) ; ;p ^ w (n) . Denote corresponding null hypotheses H (1) ; ;H (n) .A step-wise algorithm is used to select a threshold on p-values: k ^ w = max ( k :k 1 X s2S ^ s p ^ w (k) q ) ; (3.1.3) and then reject H (1) ; ;H (k ^ w ) . LAWS shows their weights can separate the clustered p-values eectively. 57 3.1.3 SAFT procedure This section develops the SAFT procedure. Abramovich et al. (2006) proposed a FDR threshold adaptive to the unknown global sparsity. In this article, we aim to develop FDR thresholds adaptive to the unknown local sparsity structure for spatial data. The key idea is that the observations become unequal when being aware of the local sparsity structure, and thus the coordinate-wise thresholds will be desired. Let ^ w s be the estimated locally adaptive weights. Dene the weighted p-values p ^ w s = p s = ^ w s . We arrange the weighted p-values in ascending order as p ^ w (k) and then apply the BH procedure to the weighted p-values with FDR level q. Let ^ k = max n k 1 :p ^ w (k) kq=n o : (3.1.4) Note this p-values threshold selection is dierent from the algorithm used by LAWS in (3.1.3). It follows that the explicit FDR-based thresholding is ^ t s = ~ 1 1 2 min ^ w s ^ kq=n; 1 ; s2S; (3.1.5) where ^ kq=n is the threshold on the unweighted p-values and 1 aims to stabilize the thresholds. The choice of locally adaptive weights is not unique. Our SAFT procedure works for all locally adaptive weights encoding useful informations. In the theory and simulation of this dissertation, we adopt the weights from LAWS and slightly modify weights to achieve the desired theoretical results. This has no impact on the estimation results as the goal is not to control the FDR and the rescaling of weighted p-values does not change the ranking. Dene the estimated locally adaptive weights ^ w s = ^ s ^ (1 ^ s ) ; s2S; (3.1.6) where ^ =n 1 P s2S ^ s and ^ s are estimated by the screening approach (3.1.13) [we suppress the tuning parameter in the expression for ease of presentation]. To increase the stability of the procedure, we take ^ s = (1) if ^ s > 1 with a small positive constant . 58 The performance of the proposed SAFT procedure depends on the choice of the FDR level through ^ t s (q). In multiple testing, the goal is to control the FDR at a specied level. But in the normal means problem, FDR serves as a tool for solving the estimation problem and we use the FDR levelq as a tuning parameter. Ideally we should choose the FDR level to minimize the target error criterion, which is unknown in practice. Brown et al. (2013) use a modied cross validation (MCV) technique for choosing the tuning parameter for Poisson distribution. We propose to apply MCV to set FDR level adaptively to minimize the approximated risk; this guarantees to prevent negative learning asymptotically when pooling information from non-informative spatial data. Let s N(0; 2 n ), s 2 S be i.i.d. auxiliary variables, independent of observations y y y. Dene a s = y s + s s and b s = y s s = where is a non-zero constant. Note the observations y s = s + s , where s is the interested parameter and s i.i.d. N(0; 2 n ). Then a s and b s are independent conditional on s . The MCV uses sub-samplesa a a =fa s g s2S for the construction of the family of estimators ^ (a a a;q), while the auxiliary sub-samples b b b =fb s g s2S are used for validation. Let ^ = ^ (a a a;q). It can be shown that the validation loss is ^ L(q;) = 1 n X s2S (^ s b s ) 2 2 n (1 + 1= 2 ): (3.1.7) Let ^ q be the minimizer of ^ L(q;). The proposed SAFT estimator is given by ^ s (y s ) =y s n y s ; ^ t s (^ q) ; s2S: (3.1.8) Next proposition shows that when ! 0, the validation loss gives unbiased estimation of the risk function. To conduct the theoretical properties of the validation loss, we impose the following assumption which assume the proposed thresholds are robust to noises on the observations. Assumption 3.1. For all q2 [0; 1], lim !0 E [t s (y y y;q)t s (a a a;q)] 2 = 0. Proposition3.1. Under assumption 3.1, for allq2 [0; 1], we have lim !0 E ^ L(q;) = (^ (t s (q)); ). 59 3.1.4 Properties of the local sparsity estimator The local sparsity levels s are unknown in practice. This section proposes the estimator of local sparsity levels and studies its theoretical properties. . The direct estimation of s is dicult. We propose an intermediate quantity which approximates s : s = 1 Pfp s >g 1 ; 0< < 1: (3.1.9) The relative bias of the approximation can be calculated as s s s = 1G 1 (js) 1 (3.1.10) where G 1 (js) is the conditional cumulative distribution functions (CDF) of non-null p-value at s. This provides a good approximation to s because (a) s desirably leads to conservative FDR control for all ; (b) as becomes large, the relative biasf s s g= s approaches zero. The following proposition shows that under a mild condition, the relative bias vanishes. Assumption 3.2. When the tuning parameter " 1, we have lim "1 g 1 (js) = 0. Proposition 3.2. Under assumption 3.2 , we have lim "1 s s s = 0: (3.1.11) Remark 3.1. Assumption 3.2 is a typical condition (Genovese and Wasserman, 2004), which requires that the null cases are dominant on the right of the p-value histogram. The condition holds for one-sided p-values but can be violated by two-sided p-values (Neuvial, 2013). Next we can estimate s via a smoothing and screening approach. To pool information from nearby locations, we use a kernel function to assign weights to observations according to their distances to s. Let K be a positive, bounded and symmetric kernel function. Denote by K h (t) = h 1 K(t=h), whereh is the bandwidth. Denev h (s;s 0 ) = K h (ss 0 ) K h (0) , whereK h (ss 0 ) is computed as 60 a function of the Euclidean distancekss 0 k. Letm s = P s 0 2S v h (s;s 0 ), which can be conceptualized as the \total sample size" at coordinate s. This kernel smoothing step utilizes the local structure to borrow strength from observations close to s. DenoteT () =fs2 S : p s > g. Suppose we are interested in counting how many p-values from the null are greater than among the m s \observations". The empirical count is given by P s 0 2T v h (s;s 0 ), whereas the expected count is f P s 0 2S v h (s;s 0 )gf1 s g(1): (3.1.12) Setting the expected and empirical counts equal, the proposed estimator is ^ s = 1 P s 0 2T v h (s;s 0 ) (1) P s 0 2S v h (s;s 0 ) : (3.1.13) We estimate s by pooling information from nearby locations via approach (3.1.13). The following assumptions will be necessary for the theoretical guarantee of local sparsity estimation (Cai et al., 2020). Assumption 3.3. Assume that s has continuous rst and second partial derivatives and there exists a constant C > 0 such thatC min (s) max (s)C uniformly for all s2S. Assumption3.4. Assume that Var( P s2S Ifp s >g)C 0 P s2S Var(Ifp s >g) for some constant C 0 > 1. Proposition 3.3. Under assumption 3.2-3.4, if h/jSj 1=3 , we have, uniformly for all s2S, Ef^ s s g 2 =O jSj 2=3 (1) 2 ! ; as S!S: (3.1.14) We refer the proof of this proposition to the proof the Proposition 1 in Cai et al. (2020). 3.2 Theory This section investigates the theoretical properties of the proposed SAFT procedure. Section 3.2.1 considers the oracle procedure where the local sparsity levels are assumed to be known. In Section 3.2.2, the benets of incorporating the spatial structure into estimation are illustrated through two examples. We study the theoretical properties of data-driven procedure in Section 3.2.3. 61 Dene the ` 0 quasi-norm byk k 0 = #fs : s 6= 0g. For normal means problem, the common notion of sparsity is that there is a relatively small proportion of nonzero coecients. The corre- sponding parameter space of is n =f :k k 0 ng. It's well known that the asymptotic of the minimax risk on n (Abramovich et al., 2006) is inf ^ sup 2n (^ ;) 2 2 n log 1 ; (3.2.15) if! 0 whenn!1. A corresponding asymptotically minimax estimator is given by soft or hard thresholding at thresholdt = (2 log 1 ) 1=2 . When is dense, no precise formula for asymptotic of minimax risk is available. To conduct a systematic study of the in uence of spatial structure on estimating , borrowing the idea of block thresholding, we further assume the sparsity levels s satisfy a block sparse model: the observed lattice S has K blocksfB k g 1kK and the sparsity levels will be a constant within the block, i.e. s k if s2 B k where k is the constant sparsity level for block B k . We dene the parameter space for block sparse model b n = n :k k k 0 n k k ; 1kK o (3.2.16) wheren k =jB k j is the number of observed samples in block B k and k 2R n k is the corresponding norm means. In theory, we are interested in the maximal risk of the proposed estimator ^ on parameter space b n : sup 2 b n (^ ; ). When k ! 0, by (3.2.15), the asymptotic of minimax risk for block B k is 2 2 n k k log( k ) 1 . Hence, when max k k ! 0 as n!1, the asymptotic of minimax risk on b n is inf ^ sup 2 b n (^ ; ) X k 2 2 n k k log( k ) 1 : (3.2.17) Notice that when k for all blocks B k , we have b n = n , and hence inf ^ sup 2 b n (^ ; ) = inf ^ sup 2n (^ ; ); (3.2.18) which means the spatial structure is not informative. 62 3.2.1 Oracle procedure This section studies the theoretical properties of the SAFT procedure in oracle case where the local sparsity levels s are assumed to be known. Some assumptions will be imposed to achieve the desired theoretical results in oracle case. Let be a small positive number. Assumption 3.5 (False discovery rate control). q n 2 [b 1 = logn;b 2 =(logn) ]. Assumption 3.6 (Sparsity). 2 [n 1 (logn) 5 ;n b 3 ] with b 3 > 0 and s 1, s2S. Assumption 3.5 requires the FDR level is vanishing slowly. Similar assumptions are considered in Meinshausen and Rice (2006) and Cai and Sun (2017). Assumption 3.6 is the standard assumption for FDR thresholding (Abramovich et al., 2006). Note that we only assume the global sparsity level is vanishing and the local sparsity levels can be dense. Next, we study the asymptotic of maximal risk of the soft-thresholding estimator with threshold t s [w s ] with known weight w s in oracle case. Dene S 1 =fB k : k n b for some positive constant bg; (3.2.19) and S 2 =fB k : k (logn) b for some positive constant bg: (3.2.20) Here we consider two cases of sparsity levels in theoretical analysis since the minimax risk has dierent properties in sparse and dense situations. The SAFT procedure also works when the sparsity level is in between. In order to illustrate the main properties in theoretical analysis, without loss of generality, we assume all local sparsity levels are lying in these two cases. DenoteR A (S 1 ) = 2 2 P k2S 1 n k k log( k ) 1 andR NA (S i ) = 2 2 P k2S i n k k log 1 fori = 1; 2. Let R A be the asymptotic of minimax risk adaptive to the spatial structure and R NA be the asymptotic of minimax risk without awareness of spatial structure. 63 Theorem 3.1. Consider the block sparse model, we assume that S 1 [S 2 =S. Under assumptions 3.5 and 3.6, we have: sup 2 b n (^ (t[w]); ).R A (S 1 ) +o (R NA (S 2 )): (3.2.21) as S!S. The following corollary considers two special cases of Theorem 3.1, which illustrates the e- ciency gain in estimation of the proposed SAFT procedure. Corollary 3.1. Under the same assumptions of Theorem 3.1, we have (a) If P k2S 1 n k k n ! 1 as n!1, we have sup 2 b n (^ (t[w]); ).R A : (3.2.22) (b) If P k2S 2 n k k n ! 1 as n!1, then sup 2 b n (^ (t[w]); ).o (R NA ): (3.2.23) In part (a), we assume most signals are from S 1 asymptotically. Then SAFT estimator is asymptotically minimax adaptive to the spatial structure. In part (b), we consider the case that most signals appear in blocks S 2 asymptotically. Then SAFT estimator signicantly improves minimax risk not adaptive to spatial structure. 3.2.2 Benets of local sparsity structure in exploiting spatial data This section studies the benets of local sparsity structure in exploiting spatial data in sparse esti- mation. The following proposition illustrates that we always have eciency gain in the asymptotic of minimax risk by leveraging the local sparsity structure. Proposition 3.4. Under model 3.1.1, if max k k ! 0 when n!1, we have R NA R A , the equality is achieved if and only if k for all blocks B k . Next, we give two concrete examples to illustrate the eciency gain of our SAFT procedure. 64 Figure 3.1: Example 1 for eciency gain of SAFT procedure. Example 3.1. Consider the example shown in Figure 3.1 with 4 blocks. The length and local sparsity are illustrated in the gure. Asymptotically, most signals appear in the rst and second block. Assume n = 1, some calculus gives, as n!1, R A 3 5 n 4=5 logn: (3.2.24) Noting that 2n 1=5 , we have R NA 4 5 n 4=5 logn: (3.2.25) Thus, by leveraging side information, we reduce the minimax risk by 25% asymptotically. The infor- mation gain is from setting more accurate threshold for the second block. For rst block, the optimal threshold adaptive to local sparsity is q 2 log 1 1 and the optimal threshold not adaptive to local spar- sity is p 2 log 1 . Noting 1 = n 1=5 and = 2n 1=5 , we have q 2 log 1 1 p 2 log 1 . Thus asymptotically we set the same threshold for the rst block even if we are aware of the local sparsity structure. Next we consider the second block. Since 2 =n 1=10 , q 2 log 1 2 1 2 p 2 log 1 . Thus we reduce the minimax risk of second block by 50% and the total minimax risk by 25% asymptot- ically. The eciency gain is mainly because of the smaller and more accurate threshold for the second block. As shown in Theorem 3.1 and Theorem 3.2, the SAFT procedure will achieve the asymptotic of minimax risk R A . 65 Figure 3.2: Example 2 for eciency gain of SAFT procedure. Example 3.2. Consider the example shown in Figure 3.2 with 3 blocks. The length and local sparsity are illustrated in the gure. Most the signals appear in the second block. Since 0:1n 1=10 , we have R NA 0:02n 9=10 logn: (3.2.26) The sparsity level of the second block is nite and we don't have closed-form formula for the asymp- totic of minimax risk. But we could nd an upper bound on the asymptotic of minimax risk for the second block. Using the upper bound on asymptotic of minimax risk of soft-thresholding estimator from Section 8 in Johnstone (2011), we have the following bound on the minimax risk for the second block with threshold : R A;2 .n 2 (1 2 )e 2 =2 +n 2 2 (1 + 2 ): (3.2.27) Choosing = q 2 log 1 2 which is nite, we have R A;2 .n 2 (1 2 ) 2 +n 2 2 (1 + 2 ); .O(n 2 2 ) =O(n 9=10 ): (3.2.28) 66 Since most signals are from the second block, it impliesR A =O(n 9=10 ). Hence, combining (3.2.26), we have R A =o(R NA ): (3.2.29) The asymptotic of minimax risk adaptive to the local sparsity structure is asymptotically negligible compared to the asymptotic of minimax risk not aware of the local sparsity structure. As shown in Theorem 3.1 and Theorem 3.2, the maximal risk of SAFT procedure is also asymptotically negligible compared to R NA . 3.2.3 Data-driven procedure The next theorem establishes the theoretical properties of the data-driven SAFT procedure, which utilizes the estimated weights via equation 3.1.13. Proposition 3.3 illustrates the technical diculty in estimating local sparsity when there are very few signals. Therefore, the analysis of theoretical properties of the data-driven procedure requires stronger assumption on sparsity. Assumption 3.7 (Sparsity). lies in the interval [c 1 n b 2 ;n b 3 ] and s 2 [c 2 n b 2 ; 1] with b 3 b 2 < 2 9 . Theorem 3.2. Consider the block sparse model, we assume thatS 1 [S 2 =S. If = 1 ( 1 log logn ) andh/jSj 1=3 , under assumptions 3.2-3.5 and 3.7, we have the following result on the asymptotic of the maximal risk: sup 2 b n (^ (t[ ^ w]);).R A (S 1 ) +o (R NA (S 2 )): (3.2.30) as S!S. 3.3 Simulation This section investigates simulation studies to compare the proposed SAFT procedure with other competitive methods. The implementation details are discussed in Section 3.3.1. The simulation 67 results in 1D setting is analyzed in Section 3.3.2 and 3.3.3. The parameter tuning using MCV method is discussed in section 3.3.4. The 2D simulated data are considered in Section 3.3.5. We consider SureBlock method (Cai and Zhou, 2009) and soft-thresholding estimator whose thresholds are chosen from the weighted BH procedure. The following methods with dierent weights are applied to the simulated data: SureBlock (Cai and Zhou, 2009): it selects the optimal block size and threshold by minimizing the SURE criterion. Benjamini-Hochberg procedure (BH): w s = 1; SABHA (Li and Barber, 2019) with known s (SABHA.OR): w s = 1 1s ; data-driven SABHA with estimated ^ s (SABHA.DD): ^ w s = 1 1^ s ; SAFT with known s (SAFT.OR): w s = s (1s) ; data-driven SAFT with estimated ^ s (SAFT.DD): ^ w s = ^ s ^ (1^ s) . The inclusion of SABHA method is to illustrate the superiority and robustness of SFAT weight over the SABHA weight. The oracle SABHA and SAFT methods select the optimal FDR level based on MSE while the data-driven methods and the BH procedure use the MCV criterion to choose the FDR level. The mean square error (MSE) of dierent methods are computed over 300 replications. 3.3.1 Estimating the conditional proportions The weights in the SAFT procedure are mainly depend on the proposed local sparsity estimator dened in approach (3.1.13) which captures the sparsity structure of the spatial data. This section rst discusses its detailed implementation and then illustrates its eectiveness through simulated data. In creating the selective subsetT , we choose the tuning parameter as the p-value threshold from the BH procedure. Next, we study the informative of the proposed local sparsity using simulated spatial data. We generate m = 5000 samples from the following normal mixture model: X i j i ind (1 i )N(0; 1) + i N(; 1); i Bernoulli( i ): (3.3.31) 68 0 1000 2000 3000 4000 5000 0.0 0.4 0.8 Linear Block s π s 0 1000 2000 3000 4000 5000 0.0 0.4 0.8 Trapezoid Block s π s Figure 3.3: 2D circular triangular, rectangle and circular pattern. We consider two patterns for 1D spatial data: linear block and trapezoid block. The signals appear in four specied intervals. In the top panel of gure 3.3, we consider the setup where the signals appear with elevated constant frequencies in the following blocks: (s) = 0:9 for s2 [1001; 1300][ [2001; 2300]; (s) = 0:7 for s2 [3001; 3300][ [4001; 4300]: (3.3.32) For the rest locations, we have(s) = 0:01. In the bottom panel, the signals have trapezoid pattern in the same blocks. As we can see, the proposed estimator (dashed blue line) can potentially capture the sparsity structure of the spatial data (solid red line). 3.3.2 The block-wise 1D setting with piece-wise constants This section compares the SAFT procedure with competitive methods. The simulated data are generated from the mixture model considered in Section 3.3.1 and the signals appear with the constant sparsity structure in the same blocks. In the left panel of Figure 3.4, we vary the alternative 69 (a) (b) Figure 3.4: MSE comparisons: the linear block pattern. signal from 2 to 12 to investigate the impact of the alternative signal strength. In the right panel, we x = 10 and let s = in the above specied blocks and s = 0:01 elsewhere. Then we vary from 0:5 to 1 to investigate the impact of sparsity structure. Finally, we can see the eciency gain of the SAFT method over the competing methods in signicant even if the alternative signal is weak (left panel of Figure 3.4). This shows the advantage of SAFT method, which integrates the local information from nearby locations via kernel estimation. Meanwhile, SureBlock method works better than the BH procedure only when the local sparsity in the specied blocks is close to 1 and the alternative eect size is large. Our SAFT method outperforms the SureBlock method signicantly. The improvement of SAFT method is more pronounced when the signals are more aggregated in the specied blocks as shown in the right panel of Figure 3.4. This is consistent with the intuition that when the gap between sparsity structures is more signicant, the spatial structure is more informative. Moreover, compared to the SABHA method, we can see that SAFT method is more robust with dierent spatial structures. 70 (a) (b) Figure 3.5: MSE comparisons: the trapezoid block pattern. 3.3.3 The block-wise 1D setting with trapezoid patterns We generate the simulated data under the setup where the local sparsity structure s follows a trapezoid block pattern, which is described in Section 3.3.1. The results are summarized in Figure 3.5. As the experiments for 1D linear blocks setting, in the left panel we vary the alternative signal strength while the impact of dierent sparsity levels is investigated in the right panel. We can see that all the adaptive BH procedures outperform the BH procedure and the SAFT method achieves much more information gain compared to SABHA method. Meanwhile, compared to SABHA method, the data-driven SAFT is more consistent to the oracle procedure. 3.3.4 Tuning the FDR level via MCV The performance of proposed SAFT procedure depends on the choice of the FDR level. We consider an optimization criterion base on modied cross validation (MCV). In this section, we study the accuracy of the MCV criterion under dierent FDR level and sample size. We consider the blockwise 1D setting with piece-wise constants as section 3.4. we x = 3 and 0 = 0:8 in the specied blocks. We compare the MCV criterion with MSE with sample size 2000 and 5000 using one replication with = 1. Figure 3.6 shows that when sample size is 2000, the curve of MCV criterion has similar 71 (a) (b) Figure 3.6: Comparison between MCV criterion and MSE trend as the curve of MSE when the FDR levelq varies. When the sample size is increased to 5000, MCV criterion is almost identical to MSE for dierent FDR level. This illustrates the eectiveness of SAFT procedure in choosing optimal FDR level. 3.3.5 Simulation in 2D setting This section provides the simulation results in the 2D setting. In the implementation of SureBlock (Cai and Zhou, 2009), we restrict the 2D blocks to be the rectangles with the same size. We generate data from the previous mixture model on a 100 100 lattice under the setup where s follows (1) the 2D triangle and rectangle pattern; (2) the 2D double-circular pattern, as shown in Figure 3.7. We set s = 0:9 for the left triangular and the left circular respectively, s = 0:7 for the right rectangle and right circular and set s = 0:01 for other locations. Similar to 1D settings, we rst vary the from 3 to 13 to illustrate the impact of alternative signal strength. Then we x = 10, let s = 0 in the triangle, rectangle and circular regions, s = 0:01 for the rest locations. We vary 0 from 0:3 to 1 to investigate the impact the dierent levels of sparsity structure. The results are summarized in Figure 3.8 and Figure 3.9 respectively. 72 Figure 3.7: 2D circular triangular, rectangle and circular pattern. (a) (b) Figure 3.8: MSE comparisons: 2D rectangle and triangle pattern. 73 (a) (b) Figure 3.9: MSE comparisons: 2D circular pattern. As we can see, BlockSure method performs worse than the BH procedure. This result is rea- sonable since we restrict the blocks to be rectangles in the implementation of BlockSure method. However, the blocks in 2D case can have some non-rectangle patterns. Thus, the performance the BlockSure method will not be guarantee in 2D case. In contrast, Our SAFT method is more robust and exible to blocks with dierent patterns. By leveraging the local sparsity structure, the MSE of the BH procedure has been signicantly improved by SAFT.OR and SAFT.DD for varying alter- native signal strength and sparsity level. Meanwhile, compared to the SABHA method, SAFT.OR achieves better MSE performance and the corresponding data-driven method has more consistent approximation to the oracle procedure. 3.4 Application on Real Data This section presents real data applications of the proposed method. We investigate the number of trips of shared bikes in Seoul, South Korea between April 1st to 28th in 2019. The dataset consists of detailed information of each trip including Global Positioning System (GPS) location at every minute along with transaction ID, starting time, ending time, starting location, and ending location. In our analysis, we laid a 860 by 580 grid on Seoul map chopping both longitude and 74 (a) Weekdays (b) Weekends (c) Dierences between (a) and (b) Figure 3.10: Heatmaps of the log-scale daily average number of shared bike rides in Seoul in April, 2019 latitude by 0.0005, and counted the number of rides passed through each grid in log-scale. Grid points on which bikes obviously cannot pass through, such as river, are removed. The grid consists of 182099 valid grid points in total. The counts for weekdays and weekends are recorded separately to address the fact that a few locations would show distinct dierences depending on the land use. For example, leisure districts such as riversides are visited more often over weekends while business districts show higher frequency over weekdays. Figure 3.10 presents heatmaps of the average counts per day in log-scale over weekdays and weekends respectively along with their dierence. In order to detect locations with signicantly dierent daily usages between weekdays and weekends, the proposed methods are applied on the dierences of the log-scaled average counts on 182099 points. The performance is evaluated by the mean squared error (MSE) on a test set. Out of four-week period, the rst two weeks (from April 1st to April 14th, 2019) is used as a training set and the last two weeks (from April 15th to April 28th, 2019) as a test set. Each set is scaled by the reciprocal of its standard deviation, and i = 1 is plugged in for all i = 1; ;N. Figure 3.11 illustrates histograms of the training and the test sets. The conditional probability s is estimated using the 2-dimensional estimation method proposed in Cai et al. (2020) with the parameters set to h = 2 and = 0:032. A heatmap of the estimated s is displayed in Figure 3.12. Table 3.1 presents the estimation performances. For comparison, the results from two other methods are presented together. \Tr." uses the original values of the training set as estimators for 75 (a) Training set (b) Test set Figure 3.11: Histograms of the dierences between log-scaled daily average observations over week- ends and weekdays Figure 3.12: Heatmap of the estimated conditional probabilities s . the test set. \ABDJ" denotes the hard-thresholding method based on FDR control proposed by Abramovich et al. (2006): ^ F;k = 8 > > < > > : y k ; jy k j ^ t F 0 otherwise: 76 Tr. ABDJ BH SAFT MSE 1.235 1.147 1.033 0.905 Ratio 1.196 1.110 1.000 0.876 Table 3.1: The rst row and the second row respectively present MSEs and the ratios of MSEs over unweighted method (BH) for each method. The smallest value is underlined. Tr.: the original values of the training set are used as estimators. ABDJ: hard-thresholding method proposed by Abramovich et al. (2006) , BH: BH-based (unweighted) method, SAFT: SAFT-weighted method. For this hard-thresholding method, the FDR level, which is in association with the tuning parameter ^ t F , is set to 0.90. The most favorable tuning parameter value is chosen that yields the smallest MSE for ABDJ. Both of the proposed methods show promising results over competing methods including the unweighted (BH) method. The proposed methods achieve good results despite the fact that the dataset is subject to some model misspecication issues. The model misspecication problems include normality assumption violation since the data is left-skewed as can be seen in Figure 3.11. This suggests that the performance can improve further in suitable settings. 3.5 Proofs 3.5.1 Proof of Proposition 3.1 Proof. Note that b s = s +z s s = and z s ; s N(0; 2 ) , we have E ^ L(q;) = 1 n X s2S E (^ s (a a a;q) s ) 2 2 1 n X s2S E ^ s (a a a;q)(z s s =); (3.5.33) andE ^ s (a a a;q)(z s s =) = 0 since a s is independent of z s s = conditional on s . This implies E ^ L(q;) = 1 n X s2S E (^ s (a a a;q) s ) 2 : (3.5.34) Note that ^ s (y y y;q) = y s n (y s ;t(y y y;q)) and lim !0 E(y s a s ) 2 = 0. Then it is enough to show that lim !0 E [ (y s ;t(y y y;q)) (a s ;t(a a a;q))] 2 = 0. Consider the following decomposition E [ (y s ;t(y y y;q)) (a s ;t(a a a;q))] 2 2E [ (y s ;t(y y y;q)) (y s ;t(a a a;q))] 2 + 2E [ (y s ;t(a a a;q)) (a s ;t(a a a;q))] 2 (3.5.35) 77 Next, we show the limits of these two terms both approach zero when ! 0. By the denition soft-thresholding operator, the rst term is bounded as [ (y s ;t(y y y;q)) (y s ;t(a a a;q))] 2 (t(y y y;q)t(a a a;q)) 2 : (3.5.36) By assumption 3.1, we have lim !0 E (t(y y y;q)t(a a a;q)) 2 = 0, which implies lim !0 [ (y s ;t(y y y;q)) (y s ;t(a a a;q))] 2 = 0: (3.5.37) Consider the second term and note [ (y s ;t(a a a;q)) (a s ;t(a a a;q))] 2 (y s a s ) 2 . Since lim !0 E (y s a s ) 2 = 0, we have lim !0 E [ (y s ;t(a a a;q)) (a s ;t(a a a;q))] 2 = 0: (3.5.38) Thus, we proved this proposition. 3.5.2 Proof of Theorem 3.1 We present a stronger version of theorem 3.1, which could be further utilized in the theoretical analysis of the data-driven procedure. Here we use the same notation as theorem 3.1. Theorem 3.3. Assume thatS 1 [S 2 =S. If we further assume ^ ! 0 and ^ ss s ! 0 as n!1 , under assumptions 3.5 and 3.6, we have: sup 2n (^ (t k ^ w);).R A (S 1 ) +o (R NA (S 2 )): (3.5.39) as S!S. Remark 3.2. If we let ^ s = s , Theorem 3.1 is the corollary of Theorem 3.3. We will use following notations of some constants. Denote n = 1=(b 5 p 2 log logn) with b 5 = 1qn 4 p 2 . Let q 0 = maxf 1+qn 2 ; 2qn g. Let t s [ ^ w s ;k] = 1 1 2 minf( ^ w s k=n)q; 1g be the threshold with weight ^ w s and cut-o k. Note that the proposed threshold is t s [ ^ w s ] =t s [ ^ w s ; ^ k] where ^ k is the cut-o obtaining from the weighted BH procedure. Denote r S (;) =E [^ ()] 2 as the risk for soft-thresholding estimator with threshold . 78 The proof sketch of Theorem 3.3 is as follows: Step 1: we show that ^ k is close to the mean discovery number k() with high probability. The proof is highly depend on the properties of the mean detection function with threshold t s [ ^ w s ]. Step 2: We consider two cases of ^ k. First, when 2 n k n ^ k ( 1 1qn + n )k n , the proof is straightforward since ^ k is close to k n . Second, when 1 ^ k 2 n k n , we can show that most non-zero signals are below threshold t s [ ^ w s ; 2 n k n ]. Then the estimator can achieve the desired property since most i are small. Theoretical properties of ^ k: The analysis of uctuations in ^ k depends on the mean detection function. Given the cut-ok in the BH procedure, we dene the exceedance number N(k) = #fi : y (i) t (i) [ ^ w (i) ;k]g. Since y (k) t (k) [ ^ w (k) ;k] if and only if N(k) k, thus we are interested in the values of k for which N(k)k. Dene the mean exceedance function over thresholds t s [ ^ w s ;k] as k varies: M(k;) =E N(k) = X s2S P (j s +z s jt s [ ^ w s ;k]) = X s p s ( s ;k): (3.5.40) Herep s (;k) = ~ (t s [ ^ w s ;k]) + (t s [ ^ w s ;k]), (A) denotes the probability of eventA under the standard Gaussian probability distribution and ~ (A) = 1 (A). We extend the values of k from positive integer to positive real values. Dene the mean discovery number k() = inffk2R + :M(k;) =kg: (3.5.41) To control the behaviors of the stochastic quantity ^ k, with the quantity n as in assumptions, we dene k () = 8 > > < > > : k() n k n ; k() 2 n k n 0; k()< 2 n k n (3.5.42) and k + () =k()_ n k n + n k n : (3.5.43) 79 We say that an event A n () is b n -likely if there exist constants c 0 , c 1 not depending on n and b n such that sup 2 b n P fA c n ()gc 0 expfc 1 log 2 ng: (3.5.44) Proposition 3.5. Under assumption 3.5 3.6 and , if ^ ! as n!1, it is b n -likely that k () ^ kk + (): (3.5.45) The proof of this proposition is based on the properties of the mean detection function provided in Section 3.5.5. The technical diculties in proving the properties of the mean detection function is due to the dierent sparsity for dierent blocks. A key point is that the mean discovery number is bounded and monotone in . To see the boundedness of the mean discovery number, note that the largest mean discovery number M is obtained by setting the non-zero components to be very large, which means M(k) = sup 2 b n M(k;) =k n + X s 2(n s k s ) ~ (t k [ ^ w s ]) =k n + X s (n s k s ) min(k ^ w s q n =n; 1) k n +kq n 1 1 max s s +o(1) k n +kq n 1 +o(1) (3.5.46) using the denition of t s [ ^ w s ;k]. Solving M(k) = k yields a solution ~ k k n = 1 qn (1 +o(1)) when q n ! 0 as n!1 and n is large. In particular, we have k()k n = 1 q n (1 +o(1)) k n = 1q 0 (3.5.47) for all 2 b n when n is large. Before proving this Proposition, we collect the necessary lemma and proposition from Abramovich et al. (2006). 80 Lemma 3.4 (Lemma 7.1 in Abramovich et al. (2006)). Suppose that Y l , l = 1; ;n, are indepen- dent 0=1 variables with P (Y l = 1) =p l . Let N = P N 1 Y l and M =EN = P n 1 p l . Then P(Nk) expf 1 4 Mh(k=M)g; if k<M; (3.5.48) P(Nk) expf 1 4 Mh(k=M)g; if k>M; (3.5.49) where h(x) = minfjx 1j;jx 1j 2 g. Proposition 3.6 (Proposition 7.2 in Abramovich et al. (2006)). Under Assumption 3.5 and 3.6 , we have (a) If k()k 1 (1q 0 ) 1 k n and k 2 =k 1 + n k n , then k 2 M k 2 1 (1q 0 ) 3 n : (3.5.50) (b) If 2 n k n k() (1q 0 ) 1 k n and k 1 =k() n k n , then 1 k 1 M k 1 (1q 0 ) 2 n : (3.5.51) Proof of Proposition 3.5. Dene the exceedance numbers with cut-o k: N(k) = #fi : y (i) t (i) [ ^ w (i) ;k]g (3.5.52) Then we have ^ k = maxfk :p (k) kq n n g = maxfk :N(k)kg: (3.5.53) The right-hand side inequality is due to the fact thatN(k)k if and only if p (k) kqn n . Similarly, we dene k 2 + 1 := minfk :p (k) kq n n g = minfk :N(k)kg; (3.5.54) 81 it follows that ^ kk 2 . Next, we show it is b n -likely that ^ kk + () and k 2 k (). This implies it is n -likely that k () ^ kk + (). 1. Let k 1 =k()_ n k n and k 2 =k 1 + n k n . From (3.5.53), f ^ kk 2 g [ lk 2 fN(l)lg: (3.5.55) For lk 2 >k(), we haveE N(l) =M(l;)<l. Then using Lemma 3.4, we have P (N(l)l) exp 1 4 M l h(l=M l ) ; M l =M(l;): (3.5.56) Sinceh(x) is increasing whenx> 1 andl=M l is increasing inl,h(l=M l )h(k 2 =M k 2 ). Note thatk 1 and k 2 satisfy the assumptions of Proposition 3.6(a) and so h(k 2 =M k 2 ) 2 2 n with = (1q 0 ) 3 . Since M l is increasing in l and M 1 c(logn) 3 , we have M l c(logn) 3 : (3.5.57) Combining (3.5.55), (3.5.56) and (3.5.57), we get P ^ kk 2 n expf 1 4 M 1 2 2 n g n expfc 1 (logn) 2 g expfc 2 (logn) 2 g; (3.5.58) when n is large. 2. Next we assume that k() 2 n k n and let k 1 =k() n k n . Using (3.5.54), we nd fk 2 <k 1 g =fk 2 + 1k 1 g [ lk 1 fN(l)lg: (3.5.59) Since lk 1 <k(), we necessarily have M l >l and then P (N(l)l) exp 1 4 M l h(l=M l ) : (3.5.60) 82 Note that l=M l is increasing in l and since k 1 and k() satisfy the assumptions of Proposition 3.6(b), we get h(l=M l ) 1 k 1 M k 1 2 2 2 n : (3.5.61) Finally, as M l M 1 (logn) 3 , we obtain P (k 2 <k 1 ) expfc 2 (logn) 2 g; (3.5.62) in the sane way as for (3.5.58). Now we are ready to prove Theorem 3.3. Proof of Theorem 3.3. Proposition 3.5 demonstrates that it is b n -likelyk () ^ kk + (). Then Ek^ (t ^ k )k 2 1 f ^ k= 2[k ();k + ()]g P ^ k = 2 [k ();k + ()] supk^ (t ^ k )k 2 ; c 0 expfc 1 log 2 ng2n logn; = 2c 0 n 1c 1 logn logn: (3.5.63) which is o(R A ) as n!1. Thus, in the following proof, we assume k () ^ k k + () without loss of generality. Next, we divide the proof into two parts based on the block's sparsity level. Part (1): for all blocks B k 2 S 1 , we show that the asymptotic of maximal risk is R A (S 1 ) +o(R NA ). We consider two cases separately and demonstrate that the risk is minimax in each case. First, ifk() 2 n k n , using (3.5.47), we have 2 n k n ^ k 1 1q 0 k n + n k n := n k n with n = 1 1q 0 + n . Since ^ w s !w s when n is large, it implies t s [ ^ w s ; ^ k] is between t s [1:1w s ; n k n ] and t s [0:9w s ; 2 n k n ] whenn is large enough. We derive the desired theoretical properties directly. Note that [^ ()] 2 is rst decreasing and then increasing in . This implies [^ ()] 2 achieves the maximum values at the boundary points. Thus we have h ^ ts[ ^ ws; ^ k] s i 2 max [^ (ts[1:1ws;nkn])s] 2 ; [^ (ts[0:9ws; 2nkn])s] ; (3.5.64) which implies it will be enough to show P s r S (t[1:1w s ; n k n ]; s ) and P s r S (t[0:9w s ; 2 n k n ]; s ) are both R A (S 1 ) +o(R NA ). 83 On parameter space b n , for block B k , there are n k k non-null signals and n k (1 k ) null observations. For ease of presentation, we adopt the notation of subscript k;i , 1 k K, 1in k and we further assume the rstn k k means k;i are non-null, i.e. k;i 6= 0, 1in k k . Then using the properties of risk function for soft thresholding, we have X B k 2S 1 X s2B k r S (t s [1:1w s ; n k n ]; s ) = X B k 2S 1 ns X i=n k k +1 r S (t k;i [1:1w k;i ; n k n ]; 0) + X B k 2S 1 n k k X i=1 r S (t k;i [1:1w k;i ; n k n ]; k;i ) (3.5.65) wherer S (; 0) 2 3 () when!1 (equation (8.7) in Johnstone (2011)). Since() ~ () as !1, we have r S (t k;i [1:1w k;i ; n k n ]; 0) 2 (t k;i [1:1w k;i ; n k n ]) 2 ~ (t k;i [1:1w k;i ; n k n ]); . 2 n k n q n w k;i 2n = n q n k;i 1 k;i : (3.5.66) Noting that s 1 and q n ! 0, we get ns X i=n k k +1 r S (t k;i [1:1w k;i ; n k n ]; 0).n k k n q n : (3.5.67) This implies X B k 2S 1 ns X i=n k k +1 r S (t k;i [1:1w k;i ; n k n ]; 0). X B k 2S 1 o(n k k ) =o (R A (S 1 )): (3.5.68) Consider the risk for non-zero components. Using the inequality r S (;) 1 + 2 (Lemma 8.3 in Johnstone (2011)), we have X B k 2S 1 n k k X i=1 r S (t k;i [1:1w k;i ; n k n ]; s;i ) X B k 2S 1 n k k X i=1 1 + (t k;i [1:1w k;i ; n k n ]) 2 (3.5.69) 84 Since (t k;i [1:1w k;i ; n k n ]) 2 2 log 2n(1 k ) 1:1 k nknqn (Lemma 12.3 in Abramovich et al. (2006)) and k n n, noting q n b 1 = logn and k n b , we get n k k X i=1 r S (t k;i [1:1w k;i ; n k n ]; s;i ). n k k X i=1 2 log( k ) 1 (1 +o(1)) 2n k k log( k ) 1 (1 +o(1)); (3.5.70) which implies X B k 2S 1 n k k X i=1 r S (t k;i [1:1w k;i ; n k n ]; s;i ).R A (S 1 ): (3.5.71) Therefore, we have X B k 2S 1 X s2B k r s (t[1:1w s ; n k n ];).R A (S 1 ): (3.5.72) Similarly, following the same arguments, we can show that P B k 2S 1 P s2B k r S (t[0:9w s ; 2 n k n ];). R A (S 1 ). Combined with equation (3.5.68), we obtain P B k 2S 1 P s2B k r S t[ ^ w s ; ^ k]; . R A (S 1 ) when k() 2 n k n . Next, if k() 2 n k n , we have 1 ^ k k() + n k n 3 n k n from Proposition 3.5. Since M(k;)=k is decreasing ink andM(k();) = 1 by the denition ofk(), we haveM(2 n k n ;) 2 n k n . We will show that most non-zero signals s;i are below threshold t k;i [w k;i ; 2 n k n ]. De- note N = #f(k;i) : k;i t k;i [w k;i ; 2 n k n ]g. When k;i t k;i [w k;i ; 2 n k n ], p 2nkn;k;i ( k;i ) = ~ (t k;i [w k;i ; 2 n k n ] k;i )+(t k;i [w k;i ;k n ] k;i ) 1 2 . SinceM(2 n k n ;) = P k;i p 2nkn;k;i ( k;i ) 2 n k n , it impliesN 4 n k n . Now, we consider the risk for our proposed threshold whenn is large. Note that h ^ t[ ^ w s ; ^ k] s i 2 max n [^ (t[1:1w s ; 3 n k n ]) s ] 2 ; [^ (t[0:9w s ; 1]) s ] 2 o : (3.5.73) Following the same arguments as equation (3.5.72) , we will have X B k 2S 1 X s2B k r S (t[1:1w s ; 3 n k n ]; s ).R A (S 1 ): (3.5.74) 85 Then we only have to show P B k 2S 1 P s2B k r S (t[0:9w s ; 1]; s ).R A (S 1 ) +o(R NA ). Note that X B k 2S 1 X s2B k rS (t[0:9ws; 1];s) = X B k 2S 1 n k k X i=1 rS (t k;i [0:9w k;i ; 1]; 0) + X B k 2S 1 n k X i=n k k +1 rS (t k;i [0:9w k;i ; 1]; k;i ): Next we will bound these two terms separately. Noting that r S (;) 1 + 2 (Lemma 8.3 in Johnstone (2011)), we have X B k 2S 1 X i r S (t k;i [0:9w k;i ; 1]; k;i ) 1 f k;i <t k;i [w k;i ;2nkn]g X B k 2S 1 X in k k (1 + 2 k;i )1 f0< k;i <t k;i [w k;i ;2nkn]g + X B k 2S 1 ns X i=ks+1 r S (t k;i [0:9w k;i ; 1]; 0); . X B k 2S 1 n k k 1 + (t k;i [w k;i ; 2 n k n ]) 2 + X B k 2S 1 ns X i=ks+1 r S (t k;i [0:9w k;i ; 1]; 0): (3.5.75) Following the same arguments as equation (3.5.69), we can have X B k 2S 1 n k k 1 + (t k;i [w k;i ; 2 n k n ]) 2 .R A (S 1 ): (3.5.76) For the second term of equation (3.5.75), using r S (; 0) e 2 =2 (equation (8.7) in Johnstone (2011)), we obtain X B k 2S 1 n k X i=n k k +1 r S (t k;i [0:9w k;i ; 1]; 0) X B k 2S 1 n k X i=n k k +1 e (t k;i [0:9w k;i ;1]) 2 =2 : (3.5.77) Since t k;i [0:9w k;i ; 1] q 2 log( 2n 0:9w k;i qn ), then X B k 2S 1 n k X i=n k k +1 e (t k;i [0:9w k;i ;1]) 2 . X B k 2S 1 n k X i=n k k +1 0:9w k;i q n 2n ; = X B k 2S 1 n k X i=n k k +1 0:9 k q n (1 k )2n : (3.5.78) 86 Notice that k 1 and P k n k k = n, the above term is bounded by q n =(2), which is o (R A (S 1 )). Combined with equation (3.5.76), we obtain X B k 2S 1 X i r S (t k;i [0:9w k;i ; 1]; k;i ) 1 f k;i <t k;i [w k;i ;2nkn]g .R A (S 1 ): (3.5.79) Meanwhile, using r S (;) 1 + 2 ,we obtain X B k 2S 1 X i r S (t k;i [0:9w k;i ; 1]; k;i )1 f k;i t k;i [w k;i ;2nkn]g N(1 + (t k;i [0:9w k;i ; 1]) 2 ) .4 n k n 2c logn; (3.5.80) for some constant c. Since n ! 0 as n!1, we have X B k 2S 1 X i r S (t k;i [0:9w k;i ; 1]; k;i )1 f k;i t k;i [w k;i ;2nkn]g =o(R NA ): (3.5.81) Thus, we have proved P B k 2S 1 P i r S (t k;i [0:9w k;i ; 1]; k;i ) R A (S 1 )(1 +o(1)) +o(R NA ), and then we obtain P B k 2S 1 P s2B k r S t s [ ^ w s ; ^ k]; s .R A (S 1 ) +o(R NA ) when k() 2 n k n . Therefore, we have showed for all 2 b n , X B k 2S 1 X s2B k ER ^ t s [ ^ w s ; ^ k] ; s .R A (S 1 ) +o(R NA ); (3.5.82) which implies sup 2 b n (^ (t[w]);).R A (S 1 ) +o(R NA ): (3.5.83) Part (2): for all blocks B k 2S 2 , we will prove that the risk is o(R A (S 1 )) +o(R NA (S 2 )). Noting thatR NA (S 2 ) = P s2S 2 n s s logn andR A (S 1 ) = P s2S 1 n s s logn , we only have to show that P B k 2S 2 P i r S (t k;i [ ^ w k;i ; ^ k]; k;i ) . o (n logn) for all . From assumptions we have q n ! 0 asn!1. Notice that ^ k n k n with n = 1 1q 0 + n and ^ w k;i !w k;i . Then the proposed thresholdt k;i [ ^ w k;i ; ^ k] = ~ 1 min ^ w k;i knqn 2n ; 1 still approached innity whenn!1. We will modify the proof from Part (1) to show the desired theoretical results. 87 First, we consider the situation when k() 2 n k n , which means k n ^ k n k n . Equation (3.5.64) implies that it is enough to show P s r S (t s [1:1w s ; n k n ]; s ) and P s r S (t s [0:9w s ; 2 n k n ]; s ) are both o (n logn) asymptotically. From equation (3.5.67) and equation (3.5.69) in Part (1), for block B k , we have n k X i=1 r S (t k;i [1:1w k;i ; n k n ]; k;i ).n k k n q n + n k k X i=1 1 + (t k;i [1:1w k;i ; n k n ]) 2 (3.5.84) Notice thatq n b 1 (logn) 1 and whenB k 2S 2 , the block sparsity k (logn) b for some constant b. Using (t k;i [1:1w k;i ; n k n ]) 2 2 log 2(1 k ) 1:1 k nqn , we get (t k;i [1:1w k;i ; n k n ]) 2 .c log logn; (3.5.85) for some constant c. Thus, X B k 2S 2 n k X i=1 ER (^ (t k;i [1:1w k;i ; n k n ]); k;i ).o( X B k 2S 2 n k k ) +O( X B k 2S 2 n k k log logn); =o(n logn): (3.5.86) Similarly, we can also show P B k 2S 2 P n k i=1 r S (t k;i [0:9w k;i ;k n ]; k;i ) . o(n logn) using the same arguments. Combined with (3.5.86), we prove that X B k 2S 2 X i r S (t k;i [ ^ w k;i ; ^ k]; k;i ).o (n logn); (3.5.87) when k() 2 n k n . Next, whenk() 2 n k n , we have 1 ^ k 3 n k n . From Part (1), we obtain the number of sig- nals above the thresholdt s [w s ; 2 n k n ] is bounded by 4 n k n , i.e. N 4 n k n . Equation (3.5.73) im- plies that it is enough to show P B k 2S 2 P i r S (t k;i [1:1w k;i ; 3 n k n ]; k;i ) and P B k 2S 2 P i r S (t k;i [0:9w k;i ; 1]; k;i ) are both o (n logn) asymptotically. P B k 2S 2 P i r S (t k;i [1:1w k;i ; 2 n k n ]; k;i ) can be shown to be o (n logn) asymptotically using the same arguments as we prove equation (3.5.86). Then it is enough to show P B k 2S 2 P i r S (t k;i [0:9w k;i ; 1]; k;i ) is o (n logn) asymptotically. 88 Using equations (3.5.75) and (3.5.78), we have X B k 2S 2 X i r S (t k;i [0:9w k;i ; 1]; k;i ) . X B k 2S 2 n k k 1 + (t k;i [w k;i ; 1]) 2 + X B k 2S 2 0:9n k k q n (1 k )2n (3.5.88) By equation (3.5.78), the rst term of above equation is bounded by P B k 2S 2 n k k (1 +o(log logn)), which is o (n logn) asymptotically. The second term of the above equation is bounded by q n =2, which is nite. Thus, we obtain X B k 2S 2 X i r S (t k;i [0:9w k;i ; 1]; k;i ) =o (n logn): (3.5.89) This implies P B k 2S 2 P i r S (t k;i [ ^ w k;i ; ^ k]; k;i ) . o (n logn) when k() 2 n k n . Combined with (3.5.87), we prove sup 2 b n (^ (t[w]);).o(R A (S 1 )) +o(R NA (S 2 )): (3.5.90) 3.5.3 Proof of Proposition 3.4 Proof. By the denition of R NA and R A , we have R NA R A = 2 S X s=1 s n s 2 log 1 r=2 2 log 1 s r=2 : (3.5.91) Denote f(x) =x(logx 1 ) r=2 , x2 (0; 1). Some calculation show that f 0 (x) = (logx 1 ) r=21 (logx 1 r 2 ); (3.5.92) and f 00 (x) =x 1 (logx 1 ) r=22 ( r 2 logx 1 r 2 (1 r 2 )): (3.5.93) 89 Hence, f(x) is concave when 0<r 2. Then using Jensen's inequality, we get X s n s n f( s )f( X s n s n s ); (3.5.94) which implies X s s n s log 1 s r=2 X s s n s log 1 r=2 : (3.5.95) Hence, R NA R A 0. 3.5.4 Proof of Theorem 3.2 Proof. First, from Proposition 3.3, we haveE ^ ss s 2 =O (log logn) 2 n 2=3b 2 andE ^ 2 =O (log logn) 2 n 2=3b 2 . Apply the Chebyshev's inequality, we obtain P ^ > (log logn) 1 O (log logn) 3 n 2=3b 2 andP ^ s s s > (log logn) 1 O (log logn) 3 n 2=3b 2 ; (3.5.96) Denote the eventA s = n ^ > (log logn) 1 o [ n ^ ss s > (log logn) 1 o . When the observation y s = s +z s and , we have E As [^ (t s [ ^ w s ]) s ] 2 E Z As 2z 2 s + 2(t s [ ^ w s ]) 2 (z s )dz s : (3.5.97) Since (z) is the density function of Gaussian distribution, P(jz s j 2 logn)n 2 logn . Note that t s [ ^ w s ] p c logn for some constantc by denition. Thus the right hand size of equation (3.5.97) is bounded by C(logn) 2 P(A s ) +n 2 logn . Notice that b 2 < 2 9 , then X s E As [^ (t s [ ^ w s ]) s ] 2 =o n 7=9 (logn) 3 =o(R A ): (3.5.98) Therefore, we can assume that ^ ! 0 and ^ ss s ! 0 asn!1 without loss of generality. Then the assumptions of Theorem 3.3 is satised. Using Theorem 3.3, we proved the desired results. 90 3.5.5 Properties of the mean detection function This section collects some properties of the mean detection function. Without loss of generality, we assume that 1 2 n in this section. Proposition 3.7. The mean detection function has following properties: 1. _ M(k;)> 0 and M(k;) 0. 2. @ @k M(k;) k 0. Proof. One checks that @t k [ ^ w i ] @k = 8 > > < > > : q n ^ w i =(2n(t k [ ^ w i ])); if ^ w i qnk n < 1 0; if ^ w i qnk n 1 (3.5.99) Writing _ M, M for partial derivative w.r.t. k, calculus shows that _ M(k;) = n X i=1 h i (k;) (3.5.100) with h i (k;) = @t k [ ^ w i ] @k (t k [ ^ w i ] i ) +(t k [ ^ w i ] i ) = 8 > > < > > : (q n ^ w i =n)e 2 i =2 cosh(t k [ ^ w i ] i ); if ^ w i qnk n < 1 0; if ^ w i qnk n 1 (3.5.101) Note that there exists ~ i such that ^ ! ~ i 1, then ^ w ~ i qnk n < 1. Hence, _ M(k;)> 0. Meanwhile, M(k;) =h 0 i (k;) (3.5.102) with h 0 i (k;) = 8 > > < > > : q 2 n ^ w 2 i = 2n 2 (t k [ ^ w i ]) i e 2 i =2 sinh(t k [ ^ w i ] i ); if ^ w i qnk n < 1 0; if ^ w i qnk n 1 (3.5.103) 91 so M(k;) 0. Finally, sinceM(0;) = 0, there exists ~ k2 [0;k] such that the threshold exceedance function k 1 M(k;) =k 1 (M(k;)M(0;)) = _ M( ~ k;), and hence, for each , the exceedance function is decreasing in k: @ @k M(k;) k = 1 k [ _ M(k;) _ M( ~ k;)] 0: (3.5.104) Proposition 3.8. Under Assumptions 3.5 and 3.6, the mean detection function has following property: _ M( n k n ;) (1 +q n )=2. Proof. Let us consider the non-zero coordinates and zero coordinates separately. Without loss of generality, we assume that 1 2 n . Then M(k;) = kn X i=1 [ e (t k [ ^ w i ] i ) + (t k [ ^ w i ] i )] + n X i=kn+1 min( q n k ^ w i n ; 1): (3.5.105) Using (3.5.99) and the fact that (x)(0), we have _ M(k;) = kn X i=1 @t k [ ^ w i ] @k [(t k [ ^ w i ] i ) +(t k [ ^ w i ] i )] + n X i=kn+1 q n ^ w i n 1 ( qnk^ w i n <1) kn X i=1 2(0) @t k [ ^ w i ] @k + X i q n ^ w i n (3.5.106) In particular, if k = n k n , we have qnk ^ w i n = n q n ^ w i ! 0, as n!1 by assumption. Then using equation (12.10) in Abramovich et al. (2006),2(0) @t k [ ^ w i ] @k 2(0) kt k [ ^ w i ] Whenn is large enough, using Lemma 12.3 in Abramovich et al. (2006) and noting n k n ^ w i ! 0, we get (t nkn [ ^ w i ]) 2 2 log 2 n q n ^ w i 0:99 log log 2 n q n ^ w i (3.5.107) Hence, when n is large, _ M( n k n ;) kn X i=1 2 n k n q 2 log 2=( n q n ^ w i ) +q n : (3.5.108) 92 Since n ! 0, we have n q n ^ w i =2 (logn) when n is large. Then equation 3.5.108 implies _ M( n k n ;) kn X i=1 2 n k n p log logn +q n = 1 +q n 2 : (3.5.109) Proposition 3.9. Under Assumptions 3.6 and 3.5 , the mean detection function has the following property: M(1;)c(logn) 3 . Before proving this proposition, we collect some necessary denitions. A key feature of n is that for blocks, onlyk s =n s n;s coordinates may be non-zero. Totally we will have k n = P S s=1 k s non-zero coordinates. Assume that 1 2 n 0. Denote p k;i () = ~ (t k [ ^ w i ]) + (t k [ ^ w i ]). Hence, the number of false discoveries at threshold t k [ ^ w i ] from the nk n zero coordinates is at most: n X i=kn+1 p k;i (0) = n X i=kn+1 kq n ^ w i =nkq n ; (3.5.110) since p k;i (0) = min(kq n ^ w i =n; 1). True positive rate: The true positive rate using threshold t k [ ^ w s;i ] is dened by k () = 1 k n kn X i=1 p k;i ( i ): (3.5.111) Corollary 3.2. Under assumption, ifn is suciently large, then uniformly over in n for which k<k(), the global average true positive rate using thresholds t k [ ^ w i ] satises k () kq n k k n : (3.5.112) Proof. From the denition of the mean exceedance number M(k;), we have k n k () =M(k;) n X i=kn+1 p k;i (0): (3.5.113) 93 Since k <k() and M(k;) is decreasing in k, we have M(k;) k. Meanwhile, from (3.5.110), we have n X i=kn+1 p k;i (0)kq n : (3.5.114) Hence, we get k n k ()kq n k, as desired. Dene a bi-threshold function g i () =g k;;i () =p k;i (p 1 ;i ()); ^ w i k=n 1; k< n k n : (3.5.115) Lemma 3.5. If k< n k n , then !g k;;i () is convex and increasing. Proof. First note that when k k n , kqn ^ w i n nknqn ^ w i n ! 0. Hence, the threshold t k [ ^ w i ] = ~ 1 2 min( kqn ^ w i n ; 1) = ~ kqn ^ w i 2n . Then the proof of this lemma is the same as Lemma 6.3. in Abramovich et al. (2006) and is omitted. Next we give the proof of Proposition 3.9. Proof of Proposition 3.9. For convenience, we abbreviate n k n by . Lemma 3.5 shows that bi- threshold function g i =g k;;i is convex when k<. Denoting i =p ;i ( i ), we have M(k;) = n X i=1 g k;;i ( i ) kn X i=1 g k;;i ( i ): (3.5.116) From the denition of g i , we have g k;;i ( i ) =p k;i ( i ) ~ (t k [ ^ w i ] i ): (3.5.117) Next, we seek an upper bound for t k [ ^ w i ] i =t k [ ^ w i ]t [ ^ w i ] +t [ ^ w i ] i : (3.5.118) 94 First note that i =p ;i ( i ) = (t [ ^ w i ] i ) + ~ (t [ ^ w i ] i ) (3.5.119) which means that i 2 ~ (t [ ^ w i ] i ) from which we get t [ ^ w i ] i z( i =2) q 2 log(2 1 i ); (3.5.120) using (12.8). Using (12.14) in Abramovich et al. (2006) and k<, we have 0t k [ ^ w i ]t [ ^ w i ] r 2 log( 2n q n k ^ w i ) r 2 log( 2n q n ^ w i ) + 3=2: (3.5.121) One can check that q 2 log( 2n qnk ^ w i ) q 2 log( 2n qn ^ w i ) is increasing in ^ w i when k < . Using the assumption that n b 3 , we have ^ w i 2n 1b 3 . It implies t k [ ^ w i ]t [ ^ w i ] s 2 log( n b 3 q n k ) s 2 log( n b 3 q n ) + 3=2: (3.5.122) Combining 3.5.116 and 3.5.117, when = 1, we have M(1;) kn X i=1 ~ 0 @ s 2 log( n b 3 q n ) s 2 log( n b 3 q n ) + 3=2 + q 2 log(2 1 i ) 1 A : (3.5.123) Notice that b 1 = logn q n < 1 and = n k n k n . Hence, when n is large, q 2 log( n b 3 qn ) q 2 log( n b 3 qn ) + 3=2 p b 3 logn p b 3 log(n=k n )). Denote h(x) = ~ v n + p 2 log(2x 1 ) with v n = p b 3 logn p b 3 log(n=k n ). Then M(1;) P kn i=1 h( i ). Computing the rst and second derivative of h(x), we get h 0 (x) = v n + p 2 log(2x 1 ) 2 x p 2 log(2x 1 ) ; (3.5.124) 95 and h 00 (x) = v n + p 2 log(2x 1 ) 2x 2 log(2x 1 ) " 4 v n + p 2 log(2x 1 ) + 2 p 2 log(2x 1 ) 4 p 2 log(2x 1 ) # ; (3.5.125) hence h 00 (x)> 0 when 0x< 1. Applying Jensen's inequality to 3.5.123, we have M(1;)k n ~ v n + q 2 log(2 1 ) (3.5.126) where = 1 kn P kn i=1 p ;i ( i ) dened in 3.5.111. Then using Corollary 3.2 and noting (1q n (1 + o(1)))=k n = (1q n (1 +o(1))) n c 4 p logn for some constant c 4 , when n is large, we have M(1;)k n ~ (v n +s n ) (3.5.127) with s n = p 3 log logn. We may rewrite k n in terms of v n , obtaining logk n =v b r 4 b 3 logn v 2 n b 3 : (3.5.128) Note that the bound ~ (x)(x)=(2x) holds for x> p 2. Applying this bound we conclude k n ~ (v n +s n ) e s 2 n =2 2(v n +s n ) exp v n r 4 b 3 logn 2 +b 3 2b 3 v n s n : (3.5.129) Since e s 2 n =2 = (logn) 3 2 and 2(v n + s n ) 3 p b 3 logn, the rst factor is bounded below by c 0 (logn) 2 for some constant c 0 . Next, we bound the main exponential term. Let g(v) = v q 4 b 3 logn 2+b 3 2b 3 vs n . Note that k n 2 [(logn) 2 ;n 1b 3 ]. We now evaluate the values of v and g(v) corresponding to these bounds on k n . At the lower bound, k n = (logn) 2 ; using p a p a=(2 p a), then v n 5 p b 3 log logn 2 p logn =:v 1n 96 and one checks thatg(v 1n ) 5 log logn 1 whenn is large. At the upper bound,k n =n 1b 3 ; then v n 1 p b 3 2 (1 + b 3 2 )(1 p b 3 ) logn +c =:v 2n and one checks that g(v 2n ) g(v 1n ) when n is large. Since g(v) is a concave quadratic function, its minimum is attained at the boundaries. This implies e g(vn) e g(v 1n ) e 1 (logn) 5 : (3.5.130) Combined with the lower bound on the rst factor in( 3.5.129), we proved the desired result. 97 Bibliography Abramovich, F., Benjamini, Y., Donoho, D. L., Johnstone, I. M., et al. (2006). Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584{653. Banerjee, T., Mukherjee, G., and Paul, D. (2018). Improved shrinkage prediction under a spiked covariance structure. Banerjee, T., Mukherjee, G., and Sun, W. (2020). Adaptive sparse estimation with side information. Journal of the American Statistical Association, 115(532):2053{2067. Basu, P., Cai, T. T., Das, K., and Sun, W. (2018). Weighted false discovery rate control in large- scale multiple testing. Journal of the American Statistical Association, 113(523):1172{1183. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289{300. Bronnenberg, B. J., Dub e, J.-P. H., and Gentzkow, M. (2012). The evolution of brand preferences: Evidence from consumer migration. American Economic Review, 102(6):2472{2508. Bronnenberg, B. J., Kruger, M. W., and Mela, C. F. (2008). Database paper|the iri marketing data set. Marketing science, 27(4):745{748. Brown, L. D. and Greenshtein, E. (2009). Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, 37:1685{1704. Brown, L. D., Greenshtein, E., and Ritov, Y. (2013). The poisson compound decision problem revisited. Journal of the American Statistical Association, 108(502):741{749. 98 Cai, T. and Sun, W. (2017). Optimal screening and discovery of sparse signals with applications to multistage high-throughput studies. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):197. Cai, T. T., Sun, W., and Wang, W. (2019). CARS: Covariate assisted ranking and screening for large-scale two-sample inference (with discussion). J. Roy. Statist. Soc. B, 81:187{234. Cai, T. T., Sun, W., and Xia, Y. (2020). Laws: A locally adaptive weighting and screening approach to spatial multiple testing. Journal of the American Statistical Association, pages 1{30. Cai, T. T. and Zhou, H. H. (2009). A data-driven block thresholding approach to wavelet estimation. The Annals of Statistics, 37(2):569{595. Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of t. JMLR: Workshop and Conference Proceedings. Cohen, N., Greenshtein, E., and Ritov, Y. (2013). Empirical bayes in the presence of explanatory variables. Statistica Sinica, pages 333{357. Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 90(432):1200{1224. Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. biometrika, 81(3):425{455. Efron, B. (2011). Tweedie ~ Os formula and selection bias. Journal of the American Statistical Association, 106(496):1602{1614. Efron, B. and Morris, C. (1975). Data analysis using stein's estimator and its generalizations. Journal of the American Statistical Association, 70(350):311{319. Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc., 96:1151{1160. Fu, L., Gang, B., James, G. M., and Sun, W. (2019). Information loss and power distortion from standardizing in multiple hypothesis testing. arXiv preprint, page arXiv:1910.08107. 99 Fu, L. J., Sun, W., and James, G. M. (2020). Nonparametric empirical bayes estimation on heterogeneous data. arXiv preprint arXiv:2002.12586. Genovese, C. and Wasserman, L. (2004). A stochastic process approach to false discovery control. Ann. Statist., 32:1035{1061. Gu, J. and Koenker, R. (2017). Unobserved heterogeneity in income dynamics: An empirical bayes perspective. Journal of Business & Economic Statistics, 35(1):1{16. Ignatiadis, N. and Huber, W. (2020). Covariate powered cross-weighted multiple testing. arXiv preprint arXiv:1701.05179. Ignatiadis, N. and Wager, S. (2019). Covariate-powered empirical bayes estimation. arXiv preprint arXiv:1906.01611. James, W. and Stein, C. (1961). Estimation with quadratic loss. Berkeley Symposium on Mathe- matical Statistics and Probability, pages 361{379. Jiang, W. and Zhang, C.-H. (2009). General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647{1684. Jing, B.-Y., Li, Z., Pan, G., and Zhou, W. (2016). On sure-type double shrinkage estimation. Journal of the American Statistical Association, 111(516):1696{1704. Johnstone, I. M. (2011). Gaussian estimation: Sequence and wavelet models. Unpublished manuscript. Ke, T., Jin, J., and Fan, J. (2014). Covariance assisted screening and estimation. Annals of statistics, 42(6):2202{2242. Koenker, R. and Gu, J. (2017). REBayes: An R package for empirical bayes mixture methods. Journal of Statistical Software, 82(8):1{26. Koenker, R. and Mizera, I. (2014). Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674{685. Kou, S. and Yang, J. J. (2017). Optimal shrinkage estimation in heteroscedastic hierarchical linear models. In Big and Complex Data Analysis, pages 249{284. Springer. 100 Krusi nska, E. (1987). A valuation of state of object based on weighted mahalanobis distance. Pattern Recognition, 20(4):413{418. Lei, L. and Fithian, W. (2018). Adapt: an interactive procedure for multiple testing with side infor- mation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):649{ 679. Li, A. and Barber, R. F. (2019). Multiple testing with the structure-adaptive benjamini{hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1):45{ 74. Liu, J., Zhang, C., Page, D., et al. (2016a). Multiple testing under dependence via graphical models. Annals of Applied Statistics, 10(3):1699{1724. Liu, Q., Lee, J., and Jordan, M. (2016b). A kernelized stein discrepancy for goodness-of-t tests. International conference on machine learning, pages 276{284. Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pages 2378{2386. Meinshausen, N. and Rice, J. (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. The Annals of Statistics, 34(1):373{393. Mi, H., Muruganujan, A., and Thomas, P. D. (2012). Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic acids research, 41(D1):D377{D386. Neuvial, P. (2013). Asymptotic results on adaptive false discovery rate controlling procedures based on kernel estimators. The Journal of Machine Learning Research, 14(1):1423{1459. Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695{718. Ren, Z. and Cand es, E. (2020). Knockos with side information. arXiv preprint arXiv:2001.07835. 101 Robbins, H. (1951). Asymptotically subminimax solutions of compound statistical decision prob- lems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Proba- bility, 1950, pages 131{148, Berkeley and Los Angeles. University of California Press. Robbins, H. (1964). The empirical bayes approach to statistical decision problems. The Annals of Mathematical Statistics, 35(1):1{20. Roeder, K. and Wasserman, L. (2009). Genome-wide signicance levels and weighted hypothesis testing. Statistical science: a review journal of the Institute of Mathematical Statistics, 24(4):398. Sen, N., Sung, P., Panda, A., and Arvin, A. M. (2018). Distinctive roles for type i and type ii interferons and interferon regulatory factors in the host cell defense against varicella-zoster virus. Journal of virology, 92(21):e01151{18. Shu, H., Nan, B., and Koeppe, R. (2015). Multiple testing for neuroimaging via hidden markov random eld. Biometrics, 71(3):741{750. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical report, STANFORD UNIVERSITY STANFORD United States. Stein, M. L. (2012). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc., 102:901{912. Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. B, 71:393{424. Sun, W., Reich, B. J., Cai, T. T., Guindani, M., and Schwartzman, A. (2015). False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(1):59{83. Tan, Z. (2015). Improved minimax estimation of a multivariate normal mean under heteroscedas- ticity. Bernoulli, 21(1):574{603. 102 Weinstein, A., Ma, Z., Brown, L. D., and Zhang, C.-H. (2018). Group-linear empirical bayes estimates for a heteroscedastic normal mean. Journal of the American Statistical Association, 113(522):698{710. Xie, X., Kou, S., and Brown, L. D. (2012). Sure estimates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500):1465{1479. Zerboni, L., Sen, N., Oliver, S. L., and Arvin, A. M. (2014). Molecular mechanisms of varicella zoster virus pathogenesis. Nature reviews microbiology, 12(3):197{210. Zhang, C.-H. (1997). Empirical bayes and compound estimation of normal means. Statistica Sinica, 7(1):181{193. Zhang, X. and Bhattacharya, A. (2017). Empirical bayes, sure and sparse normal mean models. arXiv preprint arXiv:1702.05195. 103
Abstract (if available)
Abstract
High dimensional estimation problems have a wide range of applications and are extensively studied in literature. Recent advances in technology make it possible to collect a numerous amount of data. The side information can be in the form of auxiliary sequences extracted from the secondary dataset or the same dataset using carefully constructed statistics; or dependence structure among others such as spatial information. Incorporating and leveraging these side information in testing, estimation and inference can improve the performance of statistical procedures. This dissertation aims to effectively integrate side information into estimation and inference procedures.
This dissertation is divided into three chapters. Chapter 1 introduces the problems we will study. In chapter 2, we consider the problem of nonparametric empirical Bayes estimation with auxiliary data. We present a new framework to extract and transfer structural knowledge encoded in auxiliary data to assist the simultaneous estimation of thousands or even millions of parameters in the target domain. We devise a class of nonparametric integrative Tweedie (NIT) estimators for empirical Bayes data-sharing shrinkage estimation. When applied in conjunction with reproducing kernels and convex optimization techniques, NIT provides superior and robust performance and scales well with growing number of parameters. The new estimation framework is capable of handing multiple and possibly correlated auxiliary sequences and is flexible for incorporating various structural constraints into the data-driven decision rule. We develop theory to establish the convergence rates for the risk of the data-driven NIT. The theory provides important insights on the benefits and caveats of utilizing multivariate auxiliary data. Numerical studies show that our approach achieves substantial gain in empirical performance over existing methods in many settings.
Chapter 3 studies thresholding problem for sparse estimation with spatial data. Exploiting spatial information in high-dimensional estimation promises to improve the accuracy of statistical procedures. This dissertation develops a new class of spatially adaptive false discovery rate thresholding (SAFT) procedure by extending the elegant false discovery rate thresholding estimator (Abramovich et al., 2006) to spatial settings. The idea is first constructing robust and structured-adaptive weights via estimating the local sparsity levels, and then setting spatially adaptive thresholds through weighted Benjamini-Hochberg (BH) Procedure. SAFT is data-driven and assumption-lean. Theoretical results demonstrate its superior asymptotic performance over the original false discovery rate thresholding estimator in spatial settings. The finite sample performance is studied using both simulated data and real data, which shows the proposed SAFT method outperforms the existing methods in various settings.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large scale inference with structural information
PDF
Topics in selective inference and replicability analysis
PDF
Robust estimation of high dimensional parameters
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Model selection principles and false discovery rate control
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Conformalized post-selection inference and structured prediction
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Shrinkage methods for big and complex data analysis
PDF
On the theory and applications of structured bilinear inverse problems to sparse blind deconvolution, active target localization, and delay-Doppler estimation
PDF
Essays on high-dimensional econometric models
PDF
Prediction and feature selection with regularized regression in integrative genomics
PDF
Leveraging training information for efficient and robust deep learning
PDF
Cycle structures of permutations with restricted positions
PDF
Generalized linear discriminant analysis for high-dimensional genomic data with external information
PDF
Essays on the econometric analysis of cross-sectional dependence
Asset Metadata
Creator
Luo, Jiajun
(author)
Core Title
High dimensional estimation and inference with side information
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2022-08
Publication Date
06/29/2021
Defense Date
04/30/2021
Publisher
University of Southern California. Libraries
(digital)
Tag
compound decision problem,convex optimization,Fisher's divergence,integrative Tweedie,kernelized Stein's discrepancy,OAI-PMH Harvest,shrinkage
Language
English
Advisor
Minsker, Stanislav (
committee chair
), Mikulevicius, Remigijus (
committee member
), Sun, Wenguang (
committee member
)
Creator Email
jiajunlu@usc.edu,jiajunluo121@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC13012638
Unique identifier
UC13012638
Legacy Identifier
etd-LuoJiajun-9683
Dmrecord
473893
Document Type
Dissertation
Rights
Luo, Jiajun
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Repository Email
cisadmin@lib.usc.edu
Tags
compound decision problem
convex optimization
Fisher's divergence
integrative Tweedie
kernelized Stein's discrepancy
shrinkage