Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Shrinkage methods for big and complex data analysis
(USC Thesis Other)
Shrinkage methods for big and complex data analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SHRINKAGE METHODS FOR BIG AND COMPLEX DATA ANALYSIS by Trambak Banerjee A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) May 2020 Copyright 2020 Trambak Banerjee Acknowledgments I am indebted to Professor Gourab Mukherjee for his mentorship and undivided attention through the course of my PhD. I have benefited immensely from our various discussions around statistical shrinkage and those have greatly influenced the course of my research. I express my sincere gratitude to Professor Wenguang Sun for his support and counsel during my tenure at USC. The research meetings with Professor Sun have been instrumental in stimulating my interest in empirical Bayes techniques. I am extremely grateful to Professors Gareth James, Jinchi Lv and Shantanu Dutta for serving on my dissertation committee. Professors James and Lv provided invaluable feedback on my research through numerous research seminars for which I thank them greatly. The High Dimensional Statistics class offered by Professor Lv was extremely helpful in introducing me to the current literature in this area. I became aware of several exciting problems in Marketing through my interactions with Professor Dutta and those have led to some natural extensions of the research presented in this thesis. I thank my collaborators, Professors Debashish Paul, Qiang Liu, Peter Radchenko and Pulak Ghosh, for their valuable feedback on our various projects. I have been fortunate enough to be taught by Professors Arnab Chakraborty and Lieven Vandenberghe and I thank them for being two of the best teachers that I know. My days at USC were extremely enjoyable thanks to my fellow PhD students, both past and present. I thank Joshua Derenski, Jihoon Hong, Kai Yoshiyoka (at UC Irvine) and Professors Courtney Paulson, Luella Fu for their companionship and for being the exceptional individuals that they are. I have been blessed with a supportive and caring family, and this thesis is theirs as much as it is mine. My deepest ii gratitude to my parents for their unconditional love and encouragement through thick and thin. My wife, Dr. Padma Sharma, has been a phenomenal source of inspiration for me and the Bayesian Econometrician in her has often offered perspectives that have made research most enjoyable. Finally, I thank Veer for bringing out his best performance on every instance of our ball games. He has, perhaps intentionally, restored and maintained sanity throughout this process and continues to do so with delight. iii Table of Contents Acknowledgments ii Contents iv List of Tables viii List of Figures xi Abstract xvi 1 Introduction 1 1.1 Adaptive sparse estimation with side information . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Empirical Bayes estimation in discrete linear exponential family . . . . . . . . . . . . . . . 3 1.3 Improved shrinkage prediction under a spiked covariance structure . . . . . . . . . . . . . . 5 1.4 Hierarchical variable selection in multivariate mixed models . . . . . . . . . . . . . . . . . 6 2 Adaptive Sparse Estimation with Side Information 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 iv 2.2 Adaptive sparse estimation with side information . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3 Empirical Bayes Estimation in the Discrete Linear Exponential Family 44 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 A general framework for compound estimation inDLE family . . . . . . . . . . . . . . . . 48 3.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Real data analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4 Improved Shrinkage Prediction under a Spiked Covariance Structure 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Predictive setup: aggregative predictors and side-information . . . . . . . . . . . . . . . . . 79 4.3 Proposed methodology and asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . 89 4.4 Proposed predictive rule for aggregative predictions . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.6 Real data Illustration with groceries sales data . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Hierarchical Variable Selection in Multivariate Mixed Models 115 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 v 5.2 Motivating data: activity, engagement, churn and promotion effects in freemium mobile games123 5.3 CEZIJ modeling framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.4 Variable selection in CEZIJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.5 Estimation procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.6 Analysis of freemium mobile games using CEZIJ . . . . . . . . . . . . . . . . . . . . . . . 139 5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 References 152 A Technical details related to Chapter 2 162 A.1 The auxiliary screening approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 A.3 Additional simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 A.4 Microarray Time Course (MTC) Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 A.5 Choice ofK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 B Technical details related to Chapter 3 187 B.1 Results under the squared error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 B.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 C Technical details related to Chapter 4 212 C.1 Preliminary expansions for eigenvector and eigenvalues . . . . . . . . . . . . . . . . . . . . 212 vi C.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 D Technical details related to Chapter 5 225 D.1 Details around the maximization problem in equation (5.9) . . . . . . . . . . . . . . . . . . 225 D.2 Prediction equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 D.3 Split-and-Conquer approach and numerical experiments . . . . . . . . . . . . . . . . . . . . 234 D.4 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 D.5 Variable selection by split-and-conquer : voting results . . . . . . . . . . . . . . . . . . . . 245 vii List of Tables 2.1 One-sample estimation with side information: risk estimates and estimates ofT for ASUS atm = 200. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. . . . . . . . . . . . . . . . . . . . 37 2.2 Two-sample estimation with side information: risk estimates and estimates ofT for ASUS atn = 5000. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. . . . . . . . . . . . . . . . . . . 39 2.3 Summary of SureShrink and ASUS methods (RNA-Seq data).n k =j b I k j fork = 1; 2. . . . 42 3.1 Poisson compound decision problem under scaled squared error loss: Risk ratios R (1) n (;)=R (1) n (; neb k ) atn = 5000 for estimating. . . . . . . . . . . . . . . . . . . . . 66 3.2 Poisson compound decision problem under squared error loss: Risk ratios R (0) n (;)=R (0) n (; neb k ) atn = 5000 for estimating. . . . . . . . . . . . . . . . . . . . . 66 3.3 Binomial compound decision problem under scaled squared error loss: Risk ratios R (1) n (;)=R (1) n (; neb k ) atn = 5000 for estimating. . . . . . . . . . . . . . . . . . . . . 70 3.4 Binomial compound decision problem under the usual squared error loss: Risk ratios R (0) n (;)=R (0) n (; neb k ) atn = 5000 for estimating . . . . . . . . . . . . . . . . . . . . 70 3.5 Loss ratios of the competing methods for estimating. . . . . . . . . . . . . . . . . . . . . 71 viii 3.6 Loss ratios of the competing methods for estimating. News article genre: Economy and social media: Facebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Loss ratios of the competing methods for estimating. News article genre: Microsoft and social media: LinkedIn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 1. The numbers in parenthesis are standard errors over 500 repetitions.102 4.2 Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 2. The numbers in parenthesis are standard errors over 500 repetitions.106 4.3 Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 3. The numbers in parenthesis are standard errors over 500 repetitions.108 4.4 Loss ratios (4.15) across six predictive rules for four products. . . . . . . . . . . . . . . . . 113 5.1 Parameter constraints and their interpretation. Here (s) prom(i) indicates the fixed effect coef- ficient for promotioni =I;:::;VI under models = 1;:::; 5. . . . . . . . . . . . . . . . . 133 5.2 Selected fixed effect coefficients and their estimates under the sub-models Act. Indicator, Activity Time, Engag. Indicator and Engagement Amount and Dropout. The selected random effects are those variables that exhibit a () over their estimates. See Table D.3 for a detailed description of the covariates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3 Results of predictive performance of CEZIJ model and Benchmarks I to IV . For activity and engagement indicators, the false positive (FP) rate / the false negative (FN) rate averaged over the 29 time points are reported. For non-zero activity time and engagement amounts, the ratio of prediction errors (5.10) of Benchmarks I to IV to CEZIJ model averaged over the 29 time points are reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 ix A.1 Asymptotic performance of ASUS: risk estimates and estimates ofT for ASUS atn = 5000. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 180 A.2 Risk estimates and estimates ofT for Aux-Scr atn = 5000 under the setting of the simula- tion experiment described in chapter 2.4.2. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. . . 181 A.3 Summary of the performance of SureShrink and ASUS on MTC data. Heren k =j b I k j for k = 1; 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 D.1 Simulation setting I (n = 500;m = 30;p = 10;K = 5) - average False Negatives(FN), average False Positives (FP) for fixed (composite or not) and random effects and, % datasets with non-hierarchical selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 D.2 Simulation setting II (n = 2000;m = 30;p = 20;K = 10) - average False Negatives(FN), average False Positives (FP) for fixed (composite or not) and random effects and, % datasets with non-hierarchical selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 D.3 List of covariates and the five responses. The gaming characteristics are marked with an (). 242 D.4 Summary statistics of the covariates reporting % of 0, mean, the 25 th ; 50 th ; 75 th ; 95 th per- centiles and the standard deviation of all active players ( ij = 1) across allm = 60 days. Fortimesince, however, the statistics are reported for all players and not just active. . . . . 243 D.5 Summary of the promotion strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 D.6 The number of times each candidate predictor is selected as fixed effect and random effect across theK = 20 splits for the five sub-models. For each sub-model, the predictors with atleast 12 occurrences across 20 splits were selected. . . . . . . . . . . . . . . . . . . . . . 245 x List of Figures 2.1 Heat map showing the average expression levels in the RNA-seq study. Left panel: VZV; right panel from top to bottom: C1, C2, C3 and C4, where the number of replicates (patients) is shown in parenthesis. We can see that 80-90% of the genes under the VZV condition are unexpressed (black), and the same sparse structure seems to be roughly maintained in the other four experimental conditions. Useful side information on sparsity can be extracted from secondary data (C1-C4) and be combined with the primary data (VZV) to construct more efficient estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Toy example depicting ASUS. Left: SureShrink estimator att = 0:6. Middle: ASUS with group 0 andt 0 = 4:2. Right: ASUS with group 1 andt 1 = 0:15. . . . . . . . . . . . . . . . 21 2.3 One-sample estimation with side information for scenario S1: Estimated risks of different estimators. Left: ASUS versus EBT and EJS. Right: ASUS verus Aux-Scr. . . . . . . . . . 35 2.4 One-sample estimation with side information for scenario S2: Estimated risks of different estimators. Left: ASUS versus EBT and EJS. Right: ASUS verus Aux-Scr. . . . . . . . . . 36 2.5 Two-sample estimation with side information: Average risks of different estimators. Left: Scenario S1 and Right: Scenario S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xi 2.6 Left: Heat map showing the following from top to bottom: average expression levels of VZV , C1, C2, C3 and C4 across their respective replicates (in parenthesis). Right: SURE estimate of the risk of ^ S i (t) att = 0:61 versus an unbiased estimate of the risk of ASUS for different values of. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 (a) Histogram of gene expressions for VZV . Group 1 is b I 2 and Group 0 is b I 1 . (b) A network of 20 new genes highlighted in black with their interaction partners. . . . . . . . . . . . . . 41 3.1 Poisson compound decision problem under scaled squared error loss: Risk estimates of the various estimators for scenarios 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Poisson compound decision problem under squared error loss: Risk estimates of the various estimators for scenarios 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3 Binomial compound decision problem under scaled squared error loss: Risk estimates of the various estimators for Scenarios 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 Binomial compound decision problem under squared error loss: Risk estimates of the vari- ous estimators for Scenarios 1 to 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5 Observed Juvenile Delinquency rates in 2012. The top 500 and bottom 500 counties are plotted. The data on Florida arrests is not available in the US Department of Justice and Federal Bureau of Investigation (2014) database. . . . . . . . . . . . . . . . . . . . . . . . 72 3.6 Estimated Juvenile Delinquency rates of the 1000 counties exhibited in figure 3.5. Left: Estimation under squared error loss (k = 0). Right: Estimation under scaled squared error loss (k = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 xii 4.1 Experiment 1 Scenario 1 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Experiment 1 Scenario 2 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i aver- aged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . 104 4.3 Experiment 2 Scenario 1 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i aver- aged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . 106 4.4 Experiment 2 Scenario 2 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i aver- aged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . 107 4.5 Experiment 3 Scenario 1 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6 Experiment 3 Scenario 2 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.7 CASP predicted weekly demand of the grocery items across US averaged over the two prediction weeks - November 7, 2011 to November 20, 2011. . . . . . . . . . . . . . . . . . 110 xiii 4.8 Role of coordinate-wise shrinkage in CASP across US states for four grocery items. In the figures, 1 the shrinkage factors is displayed and so, deeper shades denote higher shrinkage. 111 5.1 Game play flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.2 (a) Proportion of players active and proportion of players with positive engagement over 60 days. (b) Proportion of player churn from day 31 to day 60. (c) Median activity sand- wiched between its 25 th and 75 th percentile. (d) Mean engagement amount and the total engagement amount over the 60 days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3 Schematic diagram of our joint model for playeri. The suffix denoting the day number is dropped for presentational ease. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4 Heatmap of the 47 47 correlation matrix obtained from b . On the horizontal axis are the selected composite effects of the four sub-models: AI, Activity Time, EI and Engagement Amount. The horizontal axis begins with theintercept from theAI model and ends with a2fight source from the Engagement Amount model. . . . . . . . . . . . . . . . . . . 143 5.5 Two networks that demonstrate several cross correlations across the models. Blue line rep- resents positive correlation and red line represents negative correlation. The model numbers are inside the parenthesis next to the predictor names. Left: Key cross correlations between the sub-modelsAI and Activity Time. Right: Key cross correlations between the sub-models Activity Time andEI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 xiv 5.6 Functional cluster analysis of predicted dropout probabilities over time. The plot presents three cluster centroids. The shaded band around the centroids are the 25 th and 75 th per- centiles of the churn probabilities. The vertical shaded regions in the graph correspond to the days on which different promotion strategies were on effect. The number of clusters were identified using prediction strength (Tibshirani and Walther 2005). . . . . . . . . . . . 148 A.1 Pictorial representation of coordinate-wise decomposition of maximal riskR AS . Here j;k = 1; 2 andj6=k. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.2 Asymptotic performance of ASUS: Average risks of different estimators. Dashed line rep- resents the risk of the oracle estimator ~ SI i (T OR ). Left: Scenario S1 and Right: Scenario S2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 A.3 Two numerical solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 A.4 Toy example of chapter 2.2.3. Left: SURE estimate of risk of ASUS asK varies. Right: Estimate of risks for ASUS withK = 2, SureShrink and ASUS with no side information but withK =n= logn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 D.1 Computing time comparison for a fixed regularization parameter. Left: Simulation setting I withn = 500;m = 30;p = 10;K = 5. Right: Simulation setting II withn = 2000;m = 30;p = 20;K = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 D.2 Empirical CDF of Activity Time and Engagement Amount. . . . . . . . . . . . . . . . . . . 241 D.3 Distribution of the six promotion strategies over 60 days . . . . . . . . . . . . . . . . . . . 244 xv Abstract In this thesis we discuss novel shrinkage methods for estimation, prediction and variable selection prob- lems. We begin with large scale estimation and consider the problem of estimating a high-dimensional sparse parameter in the presence of side information that encodes the sparsity structure. In a wide range of fields including genomics, neuroimaging, signal processing and finance, such side information promises to yield more accurate and meaningful results. However, few analytical tools are available for extracting and combining information from different data sources in high-dimensional data analysis. We develop a gen- eral framework for incorporating side information into the sparse estimation framework and develop new theories to characterize regimes in which our proposed procedure far outperforms competitive shrinkage estimators. When the parameter of interest is not necessarily sparse and the data available for estimation are discrete, we propose a Nonparametric Empirical Bayes (NEB) framework for compound estimation in such discrete models. Specifically, we consider the discrete linear exponential family, which includes a wide class of discrete distributions frequently arising from modern big data applications, and develop a flexible framework for compound estimation with both regular and scaled squared error losses. We develop theory to show that the class of NEB estimators enjoys strong asymptotic properties and present comprehensive simulation studies as wells as real data analyses demonstrating the superiority of the NEB estimator over competing methods. Contemporary applications in finance, health-care and supply chain management often require simulta- neous predictions of several dependent variables when the true covariance structure is unknown. In these xvi multivariate applications, the true covariance structure can often be well represented through a spiked co- variance model. We propose a novel shrinkage rule for prediction in a high-dimensional non-exchangeable hierarchical Gaussian model with an unknown spiked covariance structure. We propose a family of commu- tative priors for the mean parameter, governed by a power hyper-parameter, which encompasses from per- fect independence to highly dependent scenarios. Corresponding to popular loss functions such as quadratic, generalized absolute, and linex losses, these prior models induce a wide class of shrinkage predictors that involve quadratic forms of smooth functions of the unknown covariance. By using uniformly consistent estimators of these quadratic forms, we propose an efficient procedure for evaluating these predictors which outperforms factor model based direct plug-in approaches. We further improve our predictors by introspect- ing possible reduction in their variability through a novel coordinate-wise shrinkage policy that only uses covariance level information and can be adaptively tuned using the sample eigen structure. We establish asymptotic optimality of our proposed procedure and present simulation experiments as well as real data examples illustrating the efficacy of the proposed method. We conclude by considering a specific business problem in Marketing that involves predicting player activity and engagement in app-based freemium mobile games. Longitudinal data from such games usually exhibit a large set of potential predictors and choosing the relevant set of predictors is highly desirable for various purposes including improved predictability. We propose a scalable joint modeling framework that conducts simultaneous, coordinated selection of fixed and random effects in high-dimensional generalized linear mixed models. Our modeling framework simultaneously analyzes player activity, engagement and drop-outs (churns) in app-based mobile freemium games and addresses the complex inter-dependencies between a player’s decision to use a freemium product, the extent of her direct and indirect engagement with the product and her decision to permanently drop its usage. The proposed framework extends the existing class of joint models for longitudinal and survival data in several ways. It not only accommodates extremely zero-inflated responses in a joint model setting but also incorporates domain-specific, convex structural xvii constraints on the model parameters. Moreover, for analyzing such large-scale datasets, variable selection and estimation are conducted via a distributed computing based split-and-conquer approach that massively increases scalability and provides better predictive performance over competing predictive methods. xviii Chapter 1 Introduction The fundamental notion of shrinkage lies in the concept of carefully introducing biases in estimators with the goal of improving overall statistical performance. Led by the pioneering works of Stein (1956) and James and Stein (1961), and along with Herbert Robbins’ empirical Bayes connection (Robbins 1956), the ideas of statistical shrinkage have proven to be extremely valuable in designing optimal procedures for large scale estimation, testing and prediction problems (Efron 2012, Fourdrinier et al. 2017). The traditional role of these shrinkage methods, however, has been evolving thanks to the tremendous pace at which techniques for scientific data production and storage are advancing (Efron and Hastie 2016). Besides being big, these modern datasets exhibit complex structural properties that often require non-standard inferential attributes such as asymmetric loss functions, information sharing across disparate sources as well as intricate modeling caveats exacerbated by unobserved heterogeneity. These fundamental characteristics of present-day big and complex data pose challenges not only in developing flexible statistical algorithms but also in optimally tuning them to obtain sound theoretical properties. In this thesis, we discuss novel shrinkage methods for the analysis of such big and complex data. The statistical methods discussed here address shrinkage estimation, prediction and variable selection problems, and their subsequent application in areas such as Virology, Inventory Management and Marketing. In what follows, we provide an overview of the various chapters of this thesis. 1 1.1. Adaptive sparse estimation with side information Recent technological advancements in data gathering, processing and sharing have led to the accumulation of huge digital repositories which can be cheaply and readily accessed for collection of supplementary in- formation pertinent to several inferential problems that arise in genomics, finance and engineering. One of the biggest challenges for fruitful use of such side information is to assess the quality and relevance of the information present in these readily available supplementary data sources and thereafter to optimally assem- ble the extracted intelligence with the information present in the primary data sources. In high dimensional problems, where signals are often sparse, borrowing information from such supplementary data sources will be extremely useful because even a little additional information regarding the signal sparsity can lead to a significant boost in performance of statistical algorithms. In chapter 2, we develop a robust method for efficient processing and assembling of relevant side information for the problem of estimating a high di- mensional sparse parameter. Estimation problems of this flavor arise in several applications (Johnstone and Silverman 2004); for example, 1. In the estimation of differential gene expression in microarray experiments (Matsui and Noma 2011, Holland et al. 2016, Erickson et al. 2005) or in developing genetic biomarkers for personalized medicine (Matsui 2013). 2. In astronomy where the goal might be to estimate average change in pixel densities over time given noisy versions of the images and alongwith the knowledge that many of the pixel densities may be 0 (Nugent et al. 2011, Cai et al. 2018+). In these applications, we typically observe a vector Y = (Y 1 ;Y 2 ; ;Y n ) whose components satisfy Y i = i + i 2 where i are independentN(0; 2 i ) with 2 i known. The aim is to recover = ( 1 ;:::; n ) many of whose coordinates might be small or zero. Thresholding is a natural approach which is frequently employed in these problems where the thresholding estimator relies on a thresholdt to determine if i is to be estimated as 0. The threshold is generally chosen in a data driven fashion and, therefore, depends critically on the sparsity of the signal. Often in these settings, we may encounter a separate set of observations S = (S 1 ; ;S n ) that hold sparsity information about. We call such a set as having side information about the parameter . For example in the context of differential gene effect estimation across two experimental conditions, the expression levels of the same genes from a related study done in the past may hold relevant information about the effect sizes that are small or negligible in the current study. In chapter 2, we propose an efficient and adaptive methodology for incorporating this extra information into the estimation process. When such a side information is available, our methodology adapts to the unknown sparsity and uses the extra information in S to partition the estimation problem into groups with different sparsity levels, thereby fine tuning the choice of thresholdt for the thresholding estimators in each group. Moreover, our empirical evidence and theoretical analysis suggest that thresholding estimators that rely on the proposed methodology to incorporate the side information can provide superior mean square error rates for estimating than estimators that rely only on the information in Y. An interesting feature of our methodology is that it does not require S to be sparse. In other words, the side information can continue to be dense yet hold sparsity information about. 1.2. Empirical Bayes estimation in discrete linear exponential family While chapter 2 is devoted to the classical normal means problem (see Brown and Greenshtein (2009), Efron (2011), Xie et al. (2012), Tan (2015), Fourdrinier et al. (2017), Weinstein et al. (2018) and the references therein) where the data are naturally continuous, in many modern large scale inference problems one often 3 encounters discrete or count data. Empirical Bayes approaches for such discrete compound decision prob- lems date back to the famous Robbin’s formula (Robbins 1956) for the Poisson model. Recently and under the same model, Brown et al. (2013) propose a three step smoothing approach that vastly improves the Rob- bins estimator while Koenker and Mizera (2014), Koenker and Gu (2017) use the non-parametric maximum likelihood approach of Kiefer and Wolfowitz (1956) to construct efficient decision rules under the Poisson and Binomial models. The aforementioned works usually consider the squared error loss for constructing their empirical Bayes estimates that rely on approximating the posterior mean as smooth functionals of the unknown marginal densityf of the data. In chapter 3, we develop a general framework for empirical Bayes compound estimation of several parameters in the discrete linear exponential (DLE) family. LetY be a non-negative integer valued random variable. The distribution ofY is said to belong to a DLE family with parameter> 0 if its probability mass function is of the form p(yj) = a y y g() ; y2f0g[N wherea y ;g() are known functions such thata y 0 is independent of andg() is a normalizing factor which is a differentiable function of. The aforementioned family includes several popular discrete distri- butions, such as the Poisson, Binomial, Negative Binomial and Geometric distributions that routinely arise in modern big data applications. Under the hierarchical model Y i j i ind: DLE( i ); i i:i:d G(); i = 1;:::;n withG() being an unspecified prior distribution on i , we consider estimating = ( 1 ;:::; n ) based on Y = (Y 1 ;:::;Y n ). We work with the Bayes estimators in this wide class of discrete distributions that in- volve ratios of the unknown marginal probability mass functions. Unlike the popularf-modeling strategies, we estimate the ratio functionals of these marginals directly using a convex optimization criteria and thus 4 avoid plugging in the estimated marginal pmf into the Bayes rule. Moreover, motivated by applications that often require asymmetries in decision making, our estimation framework uses a scaled squared error loss which regulates such asymmetries, and is far more general as it covers a much wider range of distributions, incorporates structural constraints like monotonicity of the data driven decision rule and presents a unified approach to compound estimation in discrete models. 1.3. Improved shrinkage prediction under a spiked covariance structure In chapter 4, we study the problem of statistical prediction under the predictive framework of Aitchison and Dunsmore (1976) and Geisser (1993), where the observed pastX N n (; ) and the futureY N n (;m 1 0 ) are assumed independent given (; ). Here is an unknownn 1 vector, m 0 > 0 is a known constant and is an unknownnn positive definite matrix assumed to have a spiked covariance structure (see for example Onatski, Moreira, and Hallin (2014), Kritchman and Nadler (2009), Fan, Liao, and Mincheva (2013), El Karoui (2008), Dobriban, Leeb, Singer, et al. (2020), Ma (2013), Cai, Ma, and Wu (2013) and the references therein). The multivariate applications that motivated this study, such as the high dimensional demand prediction problem discussed in chapter 4.6, often have low-dimensional representations and thus the variability in such data that can be well represented through a spiked covariance structure on . The prediction problem under this framework is to compute ^ q =f^ q i (X) : 1 i ng based on the past dataX such that ^ q optimally predictsY under some loss function. We propose a coordinate wise adaptive shrinkage predictive rule that conducts shrinkage prediction under a hierarchical setup that imposes a nonexchangeable Gaussian prior on as follows: (j;;;)N n (; ) 5 Here the shape parameter is key to controlling the essential characteristic of the posterior density of and when = 0, the exchangeable prior on the locations resembles the set-up of Xie et al. (2012) with known diagonal covariance. Corresponding to popular loss functions such as quadratic, generalized absolute and linex losses, the Bayes predictors in this hierarchical model involves quadratic forms of smooth functions of the population covariance matrix . Our approach involves using the results on the behavior of eigenvalues and eigenvectors of high-dimensional sample covariance matrix (Paul 2007, Onatski 2012, Baik and Silverstein 2006) to construct uniformly consistent estimators of the quadratic forms involving and develop a bias-correction principle that leads to an efficient approach for evaluating the Bayes predictors. 1.4. Hierarchical variable selection in multivariate mixed models In chapter 5 we consider a business problem involving app-based freemium mobile games where the product managers and marketers associated with these game are often interested in predicting player engagement and the monetization potential of their playing base. The longitudinal data associated with these games usually exhibit numerous covariates related to both game-specific and player-specific variables and choosing the relevant set of covariates is highly desirable for improving predictability. The high-dimensionality of these datasets, however, renders classical variable selection techniques incompetent. We develop a novel shrinkage algorithm that conducts variable selection from a large set of potential predictors in a Generalized Linear Mixed Model (GLMM) based joint model. To produce interpretable effects our proposed framework imposes a hierarchical structure on the selection mechanism and includes covariates either as fixed effects or composite effects where the latter are those covariates that have both fixed and random effects (Hui et al. 2017a). Efficient selection of fixed and random effect components in a mixed model framework has received considerable attention in recent years (Bondell et al. (2010), Fan and Li (2012), Lin et al. (2013)). Our approach is related to the CREPE (Composite Random Effects PEnalty) estimator of Hui et al. (2017a) 6 that conducts hierarchical variable selection in a GLMM with a single longitudinal outcome and employs a monte carlo EM (MCEM) algorithm of Wei and Tanner (1990) to maximize the likelihood. However, unlike Hui et al. (2017a), we conduct hierarchical variable selection in a joint model of multiple correlated longitudinal outcomes, incorporate any convexity constraint on the fixed effects and rely on a distributed computing-based approach to process large longitudinal data-sets. 7 Chapter 2 Adaptive Sparse Estimation with Side Information 2.1. Introduction The recent technological advancements have made it possible to collect vast amounts of data with various types of side information such as domain knowledge, expert insights, covariates in the primary data, and secondary data from related studies. In a wide range of fields including genomics, neuroimaging and signal processing, incorporating side information promises to yield more accurate and meaningful results. How- ever, few analytical tools are available for extracting and combining information from different data sources in high-dimensional data analysis. This article aims to develop new theory and methodology for leveraging side information to improve the efficiency in estimating a high-dimensional sparse parameter. We study the following closely related issues: (i) how to properly extract or construct an auxiliary sequence to capture useful sparsity information; (ii) how to combine the auxiliary sequence with the primary summary statis- tics to develop more efficient estimators; and (iii) how to assess the relevance and usefulness of the side information, as well as the robustness and optimality of the proposed method. 2.1.1 Motivating applications Sparsity is an essential phenomenon that arises frequently in modern scientific studies. In a range of data- intensive application fields such as genomics and neuroimaging, only a small fraction of data contain useful signals. The detection, estimation and testing of a high-dimensional sparse object have many important applications and have been extensively studied in the literature (Donoho and Jin 2004, Johnstone and Sil- verman 2004, Abramovich et al. 2006). For instance, in the RNA-seq study that will be analyzed in Section 8 2.4.3, the goal is to estimate the true expression levels ofn = 53; 216 genes for the virus strain VZV , which is the causative agent of varicella (chickenpox) and zoster (shingles) in humans (Zerboni et al. 2014). The parameter of interest (the population mean vector of gene expression) is sparse as it is known that very few genes in the generic RNA-seq kits express themselves in these single-cell virology studies (Sen et al. 2018). The accurate identification and estimation of nonzero large effects is helpful for the discovery of novel genetic biomarkers, which constitutes a key step in the development of new treatments and personalized medicine (Matsui 2013, Holland et al. 2016, Erickson et al. 2005). Another example arises from microarray time-course (MTC) experiments (see Appendix A.4) where the goal is to identify genes that exhibit a spe- cific pattern of differential expression over time. The temporal pattern, which can be revealed by estimating the differences in expression levels of genes between two time points, would help gain insights into the mechanisms of the underlying biological processes (Calvano et al. 2005, Sun and Wei 2011). After baseline removal, the parameter of interest is the difference between two mean vectors that are both individually sparse. In practice, the intrinsic sparsity structure of the high-dimensional parameter is often captured by side information, which can be obtained as either summary statistics from secondary data sources or can be constructed as a covariate sequence from the original data. For instance, in the RNA-seq data, expression levels corresponding to other four experimental conditions (C1, C2, C3 and C4) are also available for the same n genes through related studies conducted in the lab. The heat map in Figure 2.1 shows that the sparse structure of the mean transcription levels of the genes for VZV is roughly maintained by the same set of genes in subjects from the other four conditions. The common structural information shared by both cases (VZV) and controls (C1 to C4) can be exploited to construct more efficient estimation procedures. In the two-sample sparse estimation problem considered in the MTC study (analyzed in Appendix A.4), we illustrate that a covariate sequence can be constructed from the original data matrix to assist inference by capturing the sparseness of the mean difference. Intuitively, incorporating side information promises to 9 improve the efficiency of existing methods and interpretability of results. However, in conventional practice, such useful auxiliary data have been largely ignored in analysis. Figure 2.1: Heat map showing the average expression levels in the RNA-seq study. Left panel: VZV; right panel from top to bottom: C1, C2, C3 and C4, where the number of replicates (patients) is shown in parenthesis. We can see that 80-90% of the genes under the VZV condition are unexpressed (black), and the same sparse structure seems to be roughly maintained in the other four experimental conditions. Useful side information on sparsity can be extracted from secondary data (C1-C4) and be combined with the primary data (VZV) to construct more efficient estimators. 2.1.2 ASUS: a general framework for leveraging side information In this article, we develop a general integrative framework for sparse estimation that is capable of handling side information that may be extracted from (i) prior or domain-specific knowledge, (ii) covariate sequence based on the same (original) data, or (iii) summary statistics based on secondary data sources. Let = 10 ( 1 ; ; n ) be an unknown high-dimensional sparse parameter. Our study focuses on the class of non- linear thresholding estimators [See Chs 8, 13 of Johnstone (2015) and Ch 11 of Mallat (2008)], which have been widely used in the sparse case where many coordinates of are small or zero. The proposed estimation framework involves two steps: first constructing an auxiliary sequenceS = (S i : 1 i n) to capture the sparse structure, and second combiningS with the primary statistics, denoted Y Y Y = (Y i : 1 i n), via a group-wise adaptive thresholding algorithm. Our idea is that the coordinates of become nonexchangeable in light of side information. To reflect this heterogeneity, we divide all coordinates into K groups based on S i . The side information is then incorporated in our estimation procedure by applying soft-thresholding estimators separately, thereby fine tuning the group- wise thresholds to capture the varied sparsity levels across groups. The optimal grouping and thresholds are chosen adaptively via a data-driven approach, which employs the Stein’s unbiased risk estimate (SURE) criterion to minimize the total estimation risk. The proposed method, which carries out adaptive SURE- thresholding using side information (ASUS), is shown to have robust performance and enjoy optimality properties. ASUS is simple and intuitive, but nevertheless provides a general framework for information pooling in sparse estimation problems. Concretely, since ASUS does not rely on any functional relationships betweenS and, it is robust and effective in leveraging side information in a wide range of scenarios. In chapter 2.2.2, we demonstrate that this flexible framework can be applied to various sparse estimation problems. The amount of efficiency gain of ASUS depends on two factors: (i) the usefulness of the side informa- tion; and (ii) the effectiveness in utilizing the side information. To understand the first issue, we formulate in chapter 2.3 a hierarchical model to assess the informativeness of an auxiliary sequence. Our theoretical analysis characterizes the conditions under which methods ignoring side information are suboptimal com- pared to an “oracle” with perfect knowledge on sparsity structure. To investigate the second issue, chapter 11 2.3 establishes precise conditions under which ASUS is asymptotically optimal, in the sense that its max- imal risk is close to the theoretical limit that is attained by the oracle. Finally, we carry out a theoretical analysis on the robustness of ASUS; our results show that pooling non-informative side information would not harm the performance of data combination procedures. Our asymptotic results are built upon the elegant higher-order minimax risk evaluations developed by Johnstone (1994). 2.1.3 Connections with existing work and our contributions ASUS is a non-linear shrinkage estimator that incorporates relevant side information by choosing data- adaptive thresholds to reflect the varied sparsity levels across groups. We use the SURE criterion for simul- taneous tuning of the grouping and shrinkage parameters. Our methodology is related to Xie et al. (2012), Tan (2015) and Weinstein et al. (2018), which utilized SURE to devise algorithms reflecting optimal shrink- age directions. However, these works are developed for different purposes (addressing the heteroscedasticity issue in the data) and do not cover the sparse case. The notion of side information in estimation has been explored in several research fields. In information theory for instance, sparse source coding with side information is a well studied problem (Wyner 1975; Cover and Thomas 2012; Watanabe et al. 2015). However, these methodologies focus on very different goals and cannot be directly applied to our problem. In the statistical literature, the use of side information in sparse estimation problems has been mainly limited to regression settings where the side information must be in the form of a linear function of (Ke et al. 2014, Kou and Yang 2015). By contrast, our estimation framework utilizes a more flexible scheme that does not require the specification of any functional relationship between and the side information. The proposed ASUS algorithm is simple and intuitive but nevertheless enjoys strong numerical and theoretical properties. Our simulation studies show that it can substantially outperform competitive methods in many settings. ASUS is a robust data combination 12 procedure in the sense that asymptotically it would not under-perform methods ignoring side information when the auxiliary data are non-informative (see Theorem 2.3.3). The proposed research makes several new theoretical contributions. First, we develop general principles for constructing and pooling the side information, which guarantees proper information extraction and ro- bust performance of ASUS. Second, we formulate a theoretical framework to assess the usefulness of side information. Third, we establish precise conditions under which ASUS is asymptotically optimal. Finally, we extend the sparse minimax decision theory of Johnstone (2015), which provides the foundation for a range of sparse inference problems (Abramovich et al. 2006, 2007, Cai et al. 2014, Tibshirani et al. 2014, Collier et al. 2017), to derive new high-order characterizations of the maximal risk of soft-thresholding estimators. 2.1.4 Organization of this chapter Chapter 2.2 describes the proposed ASUS procedure. Chapter 2.3 presents theoretical analyses. The numer- ical performances of ASUS are investigated using both simulated and real data in chapter 2.4. Chapter 2.5 concludes with a discussion. Additional numerical results and proofs are given in Appendix A. 2.2. Adaptive sparse estimation with side information This section first describes the model and assumptions (chapter 2.2.1), then discusses how to construct the auxiliary sequence (chapter 2.2.2), and finally proposes the methodology (chapter 2.2.3). 13 2.2.1 Model and assumptions To conduct a systematic study of the influence of side information for estimating, we consider a hierar- chical model that relates the primary and auxiliary data sets through a latent vector = ( 1 ;:::; n ), which represents the noiseless side information that encodes the sparsity information of. The latent vector cannot be observed directly but may be partially revealed by an auxiliary sequence (noisy side information) S = (S 1 ; ;S n ). For instance, in the RNA-seq example, the parameter of interest is the population mean of the gene expression levels for diseased patients, and the latent variable i may represent the quantitative outcome of a complex gene regulation process that determines whether gene i expresses itself under the influence of a certain experimental condition. The primary and secondary data respectively correspond to gene expression levels for the patients from the concerned (i.e. VZV infected) and other related groups. The primary and auxiliary statisticsY i andS i for genei can be constructed based on the corresponding sample means. Forn parallel units, the summary statisticY i for theith unit is modeled by Y i = i + i ; i N(0; 2 i ); (2.1) where, by convention, 2 i are assumed to be known or can be well estimated from data (e.g. (Brown and Greenshtein 2009, Xie et al. 2012, Weinstein et al. 2018)). We further assume that both andS are related to the latent vector through some unknown real-valued functionsh andh s : i = h ( i ; 1i ); (2.2) S i = h s ( i ; 2i ); (2.3) 14 where 1i and 2i follow some unspecified priors, and represent independent random perturbations that are independent of i ; concrete examples for Models 2.1 to 2.3 are discussed in chapter 2.2.2. Remark 2.2.1. The above model can be conceptualized as a Bayesian hierarchical model: Y i j( i ;S i )N( i ; 2 i ); ( i ;S i )j i f 1 (j i )f 2 (sj i ); i iid f 3 (); wheref 1 ;f 2 ;f 3 are unknown densities. In Equations 2.2 and 2.3, i is a random quantity and independent of 1i and 2i . As a special case of Equation 2.2, we can write i =h ( i ) without the random perturbations 1i . Our theory is mainly stated in terms of random i ’s for ease of presentation. However, we note that our theoretical results still hold even when i is deterministic because the theory in chapter 2.2.3 is derived conditional on i , and the proof in chapter 2.3 is built upon an empirical density function (2.10). The hierarchical Models 2.1 to 2.3 provide a general and flexible framework for our methodological and theoretical developments. In particular, it covers a wide range of scenarios by allowing the strength of the side information to vary from completely non-informative (e.g., when i is useless, or whenS i and i are independent for alli) to perfectly informative (e.g. when i = i andS i = i for alli). In chapter 2.3, the usefulness of the latent vector is investigated via Equation 2.2, and the informativeness of the auxiliary sequenceS is characterized by Equations 2.2 and 2.3. 2.2.2 Constructing the auxiliary sequence: principles and examples A key step in our methodological development is to properly extract side information using an auxiliary sequence. The sequenceS S S can be constructed from various data sources including the following three basic settings: (i) prior or domain-specific knowledge; (ii) covariates or discard data in the same primary data set; or (iii) secondary data from related studies. We stress that our estimation framework is valid for all three settings as long asS S S fulfills the following two fundamental principles. 15 The first principle is informativeness, which requires thatS i should be chosen or constructed in a way to encode the sparse structure effectively. The second principle is conditional independence, which requires thatS i must be conditionally independent ofY i given the latent variable i . The conditional independence assumption, which is implied by Models 2.1 to 2.3, ensures proper shrinkage directions and plays a key role in establishing the robustness of ASUS. Examples 1 to 4 below present specific instances of auxiliary sequences fulfilling such principles, wherein the auxiliary sequences may either be readily available from distinct but related experiments or can be carefully constructed from the same (original) data to capture important structural information that is discarded by conventional practice. Example 1. Prioritized subset analysis (PSA, Li et al. 2008). In genome wide association studies, prior data and domain knowledge such as known gene functions or interactions may be used to construct an auxiliary sequenceS that can prioritize the discovery of SNPs in certain genomic regions. Typically, the primary data set can be summarized as a vectorY = (Y 1 ; ;Y n ), whereY i are either taken as differential allele frequencies between diseased and control groups, or z-values based on 2 -tests assessing the association between the allele frequency and the disease status. LetS = (S 1 ; ;S n )2f1; 1g n be an auxiliary sequence, where S i = 1 if SNP i is in the prioritized subset and S i =1 otherwise. S can be viewed as perturbations of the true state sequence = ( 1 ; ; n ), where i = 1 if SNPi is associated with the disease and i =1 otherwise. The informativeness and independence principles are fulfilled when (i) the prioritized subset contains SNPs that are more likely to hold disease susceptible variants and (ii) the perturbations of are random (henceY i andS i are conditionally independent given i ). Both (i) and (ii) seem reasonable assumptions in PSA studies. Example 2. One-sample inference. In the RNA-seq study, let the primary data befY i;j :i = 1; ;n;j = 1; ;k y g that record the expression levels ofn genes fromk y subjects infected by VZV . The primary statis- tics areY = ( Y 1 ; ; Y n ), where Y i =k 1 y P ky j=1 Y i;j . Let the secondary data befX i;j :i = 1; ;n;j = 16 1; ;k x g that record the expression levels of the samen genes fork x subjects but under different Con- ditions C1 to C4. The auxiliary sequence can be constructed asS = (S 1 ; ;S n ) = (j X 1 j; ;j X n j), where X i = k 1 x P kx j=1 X i;j . Thus although we record the expression levels of the same set ofn genes, in the case of the primary data the genes are infected with the VZV virus whereas for the secondary data the expression levels are recorded under the influence of agents that are different from that of the VZV virus. The latent state i represents whether genei expresses itself under any of the conditions. Now we check whether the two information extraction principles are fulfilled. First, the informativeness principle holds since, as demonstrated by the heat map in Figure 2.1, inactive genes under VZV are likely to remain inac- tive under the other conditions. The sparse structure is captured by the auxiliary sequence, where a small S i signifies an inactive gene. Second, chapter 2.2.1 has explained how the RNA-seq data may be sensibly conceptualized via Models (2.1) to (2.3), where Y i and S i are conditionally independent given the latent variable i , fulfilling the second principle. Example 3. Two-sample inference. Consider the MTC study discussed in the introduction (and analyzed in Appendix A.4). LetfY i;j;t d : i = 1;:::;n;j = 1;:::;k i ;d = 0; 1; 2g record the expression levels of n genes from k i subjects at time points t 0 (baseline), t 1 and t 2 . Let Y i;d = k 1 i P k i j=1 (Y i;j;t d Y i;j;t 0 ) be the average expression levels of gene i at time point t d after baseline adjustment, d = 1; 2. Denote i;d =E( Y i;d ) and d = ( i;d : 1in). Then both 1 and 2 are individually sparse. The parameter of interest is i = i;1 i;2 , which can be estimated by the primary statisticY i = Y i;1 Y i;2 . Denote the union supportU =fi : i;1 6= 0 or i;2 6= 0g. ThenU can be exploited to screen out zero effects since if i = 2U, we must have i = 0. Consider the sequence S i =j Y i;1 + i Y i;2 j, where i = ^ i;1 =^ i;2 and ^ 2 i;d = (k i 1) 1 P k i j=1 (Y i;j;t d Y i;j;t 0 Y i;d ) 2 . Then the auxiliary sequence is informative since a large S i provides strong evidence thati2U. The union support encodes the sparse structure of . Moreover,Y i andS i are asymptotically independent with our choice of i (Proposition 6 in Cai et al. 2018+). Hence both principles are fulfilled. 17 Example 4. Estimation under the ANOVA setting. This example is an extension of Example 3 to multi- sample inference. Considerm conditionsd = 1;:::;m,m 2. The parameter of interest is n1 = a, where nm = ( 1 ;:::; m ), i;d =E( Y i;d ) and a m1 is a vector of known weights. Here may represent a weighted average of true transcription levels ofn genes acrossm time points. Let D i = ( Y i;1 ;:::; Y i;m ) be the vector of average expression level of gene i for the m time points after baseline adjustment and denoteD nm = (D 1 ;:::; D n ) T . To estimate, our proposed framework suggests using the usual unbiased estimatorY Y Y =Da as the primary statistic, andS S S =Db as the auxiliary sequence for some weights b. The informativeness principle from Example 3 continues to hold under this setting. To fulfil the independence principle, we choose b such thatCov(Y Y Y;S) = 0. In Examples 3 and 4, the auxiliary sequenceS is constructed from the same original data matrix. We give some intuitions to explain whyS is useful. The conventional practice reduces the original data into a vector of summary statisticsY . However, this data reduction step often causes significant loss of informa- tion and thus leads to suboptimal procedures. Specifically, the information on the sparseness of the union supportU is lost in the data reduction step. The key idea in Example 3 is that the auxiliary sequenceS S S captures the structural information on sparsity, which is discarded by conventional practice. Therefore by incorporatingS S S into the inferential process we can improve the efficiency of existing methods. Note thatY is not a sufficient statistic for estimating , the minimax estimation error based on (Y;S) can greatly im- prove the performance of all estimators that are based onY alone; a rigorous theoretical analysis is carried out in the proof of Theorem 2.3.1. To summarize, the above examples illustrate that the side information can be either “external” (Examples 1-2) or “internal” (Examples 3-4). The key in the proposed estimation framework, which we discuss next, is to construct a proper auxiliary sequence that fulfills the two funda- mental principles. We shall develop a unified estimation framework that is capable of handling both internal and external side information. 18 We conclude this section with two remarks. First, the conditional independence assumption can be relaxed; the methodology would work as long asY i andS i are conditionally uncorrelated (c.f. Proposition 1). Second, we do not require Y i or i to be related to S i through any functional forms; hence classical regression techniques (even nonparametric models) cannot be applied in the above scenarios. We aim to develop a general information pooling strategy that does not involve any prescribed functional relationships; a methodology in this spirit is described next. 2.2.3 The ASUS estimator and its risk properties LetY Y Y andS denote the primary statistics and auxiliary sequence obeying Models (2.1) to (2.3). Let t (:) be a soft-thresholding operator such that t (Y i ) = 8 > > > > < > > > > : Y i 1 i , ifjY i 1 i jt; t sign(Y i 1 i ), otherwise: The proposed ASUS estimator operates in two steps: first constructing K groups using S, and second applying soft-thresholding within each group usingY . The construction of the groups relies only onS. The tuning parameters for both grouping and shrinkage are determined using the SURE criterion. Procedure 1. Fork = 1;:::;K and =f 1 < ::: < K1 g, denote b I k =fi : k1 < S i k g with 0 =1, K =1. Consider the following class of shrinkage estimators: ^ SI i (T ) :=Y i + i t k (Y i ) ifi2 b I k ; (2.4) 19 where,T =f 1 ;:::; K1 ; t 1 ;:::;t K g and each of the threshold hyper-parameterst 1 ;:::;t K varies in [0;t n ] witht n = (2 logn) 1=2 . Thus, the set of all possible hyper-parameterT values isH n = R K1 + [0;t n ] K . Define the SURE function S(T;Y Y Y;S) =n 1 2 6 4 n X i=1 2 i + K X k=1 X i2 b I k 2 i (jY i 1 i j^t k ) 2 2 2 i I(jY i 1 i jt k ) 3 7 5: (2.5) Let ^ T = argmin T2Hn S(T;Y Y Y;S). Then, the ASUS estimator is given by ^ SI i ( ^ T ). Remark 2.2.2. When is very sparse, the empirical fluctuations in the SURE function would have non- negligible effects on thresholding procedures. We suggest choosing t 1 ;:::;t k for a given grouping by implementing a hybrid scheme that is similar to the SureShrink estimator of Donoho and Johnstone (1995), e.g. settingt k =t n ifj b I k j 1 P i2 b I k (Y 2 i = 2 i )^t 2 n 1n 1=2 log 3=2 n. We present a toy example to illustrate why ASUS works. Consider the two-sample inference problem described by Example 3 in chapter 2.2.2. Let i = i;1 i;2 and Y i;d N( i;d ; 0:25), whered = 1; 2, i = 1;:::;n, andn = 10 4 . For 1 we generate the first 20% of its coordinates randomly from Unif(4; 6), the next 20% randomly from Unif(2; 3) and set the remaining coordinates to 0. For 2 , the first 20% are from Unif(1; 2), the next 20% from Unif(1; 6) and the remaining 0. Finally, we let Y i = Y i;1 Y i;2 and S i =j Y i;1 + Y i;2 j. The left panel in Figure 2.2 presents the histogram ofY Y Y = ( Y i : 1 i n), where the lighter shade corresponds to Y i with i = 0. The SureShrink estimator in Donoho and Johnstone (1995) chooses thresholdt = 0:6 for all observations, resulting in an MSE of 0.338. Imagine that an oracle has the perfect knowledge about the two groups ( i = 0 vs. i 6= 0). In group 0, SureShrink choosest 0 = 4:2, whereas in group 1, SureShrink choosest 0 = 0:15. The total MSE is reduced to 0.20 by adopting varied thresholds for the two groups. In practice, the groups cannot be identified perfectly but can be partially revealed by the auxiliary statistic S i = j Y i;1 + Y i;2 j, where a small S i signifies a possible zero effect. 20 Our simulation studies in chapter 2.4 show that by exploiting the side information in S i , ASUS achieves substantial gain in performance over conventional methods. = 0 . 6 True zeros True non-zeros 0 = 4 . 2 Group 0 1 = 0 . 15 Group 1 Figure 2.2: Toy example depicting ASUS. Left: SureShrink estimator att = 0:6. Middle: ASUS with group 0 andt 0 = 4:2. Right: ASUS with group 1 andt 1 = 0:15. Letl n (; b ) = n 1 k b k 2 2 denote the squared error loss of estimating using b . For each member ^ SI (T ) in our class of estimators,T 2H n , denote its risk byr n (T ;) =E h l n n ; ^ SI (T ) oi , where the expectation is taken with respect to the joint distribution of (Y i ;S i ). The next proposition shows that (2.5) provides an unbiased estimate of the true risk. Proposition 1. Consider Models (2.1) to (2.3). Then given i , the pair (Y i i ) t k (Y i );I(i2 b I k ) are uncorrelated. It follows thatr n (T ;) =EfS(T;Y Y Y;S)g. Next we study the large-sample behavior of the proposed SURE criterion. As in Xie et al. (2012), we impose the following assumption on the fourth moment of the noise distributions: (A1) lim sup n!1 1 n n X i=1 4 i <1: The following theorem shows that the risk estimateS(T;Y Y Y;S) is uniformly close to the true risk as well as the loss, justifying our proposed hyper-parameter estimate ^ T . Compared to Xie et al. (2012) (theorem 21 3.1) and Brown et al. (2018) (theorem 4.1), we obtain explicit rates of convergence by tracking the empirical fluctuations in the SURE function through sharper concentration inequalities. Theorem 2.2.1. Under Assumption A1, withc n =n 1=2 (logn) for any> 3=2, we have (a) lim n!1 c n E n sup T2Hn S(T;Y Y Y;S)r n (T ;) o = 0; (b) lim n!1 c n E h sup T2Hn S(T;Y Y Y;S)l n f; ^ SI (T )g i = 0; where the expectation is with respect to the joint distribution ofY Y Y;S. DefineT OL as the minimizer of the true loss function:T OL = argmin T2Hn l n f; ^ SI (T )g.T OL is referred to as the oracle loss hyper-parameter as it involves the knowledge of of. It provides the theoretical limit that one can reach if allowed to minimize the true loss. Let ^ SI (T OL ) be the corresponding oracle loss estimator. The following corollary establishes the asymptotic optimality of ^ T . Corollary 2.2.2. Under assumption A1, if lim n!1 c n n 1=2 log n = 0 for any> 3=2, then (a) The loss of ^ SI ( ^ T ) converges in probability to the loss of ^ SI (T OL ): lim n!1 P h l n n ; ^ SI ( ^ T ) o l n n ; ^ SI (T OL ) o +c 1 n i = 0 for any> 0: (b) The risk of ^ SI ( ^ T ) converges to the risk of the oracle loss estimator: lim n!1 c n E h l n n ; ^ SI ( ^ T ) o l n n ; ^ SI (T OL ) oi = 0: 22 2.2.4 Approximating the Bayes rule by ASUS This section discusses a Bayes setup and illustrates how ASUS may be conceptualized as an approximation to the Bayes oracle estimator. Consider a hierarchical model where i has an unspecified prior andY i ind: N( i ; 2 i ) with 2 i known. In the absence of any auxiliary sequenceS and when i are all equal to, say, the optimal estimator is i =E( i jy i ) =y i + 2 f 0 (y i ) f(y i ) ; (2.6) which is known as Tweedie’s formula (Efron 2011). When the marginal densities f(y i ) are unknown, (2.6) can be implemented in an empirical Bayes (EB) framework. For example, Brown and Greenshtein (2009) used kernel methods to estimate unknown densities and showed that the resulting EB estimator is asymptotically optimal under mild conditions. Under the sparse setting, an effective approach to incorporate the sparsity structure is to consider, for example, spike-and-slab priors (Johnstone and Silverman 2004). In decision theory it has been established that the posterior median is minimax optimal under spike-and-slab priors; see Thoerem 1 of Johnstone and Silverman (2004). Hence the soft-threshold estimators can be viewed as good surrogates to the Bayes rule under sparsity. When the sparsity level is unknown, the threshold should be chosen adaptively using a data-driven method. For a given pair of primary and auxiliary statistics (Y i ;S i ), the Bayes oracle estimator is i =E( i jY i ;S i ): (2.7) Conditionally onS i , a Tweedie’s formula for equation (2.7) can be written which would require estimating the conditional marginal densitiesf(y i js i ) and its derivatives. ASUS can be viewed as a two-step approx- imation to the oracle estimator (2.7). The first step involves using the auxiliary sequence to divide the n 23 coordinates intoK groups: i ^ k (Y i ) =E( i jY i ;i2G k ) =E( i jY i ;S i =k); which can be viewed as a discrete approximation to the oracle rule (2.7) by discretizingS i as a categorical variableS i taking values k = 1; ;K. The second step involves setting thresholds for separate groups to incorporate the updated structural information from the auxiliary sequence. This step makes sense because under the sparse regime, it is natural to use the class of soft-thresholding estimators as a convenient surrogate to the Bayes rule, and ideally the threshold should be set differently to reflect the varied sparsity levels across the groups. Finally the optimal grouping and optimal thresholds are chosen by minimizing a SURE criterion. This Bayesian interpretation reveals that ASUS may suffer from information loss in the discretization step. However, fully utilizing the auxiliary data by modeling S as a continuous variable is practically impossible under the ASUS framework since the search algorithm cannot deal with a diverging number of groups. Moreover, directly implementing (2.7) using bivariate Tweedie approaches is highly nontrivial and requires further research. ASUS, thus, seems to provide a simple, feasible yet effective framework to incorporate the side information. 2.3. Theoretical analysis This section studies the theoretical properties of ASUS under the important setting where is sparse. By contrast, the results of chapter 2.2.3 hold for any sequence. To simplify the presentation, we focus on a class of thresholding estimators that utilize two groups. The two-group model provides a natural choice for some important applications such as the prioritized subset analysis and RNA-seq study, but the proposed ASUS framework can handle more groups. The major goal of our theoretical analysis is to gain insights on sparse inference with side information, for which the simple two-group setup helps in two ways. First, 24 it leads to a concise and intuitive characterization of the potential influence of side information on simulta- neous estimation. Second, it enables us to develop precise conditions under which ASUS is asymptotically optimal. 2.3.1 Asymptotic set-up Consider hierarchical Models (2.1) to (2.3). We begin by considering an oracle estimator ~ SI n (T OR n ) that directly uses the noiseless side information: ~ SI i;n (T OR n ) := 8 > > > > < > > > > : Y i + i t 1 (Y i ) ifi2I ? 1;n ; Y i + i t 2 (Y i ) ifi2I ? 2;n ; (2.8) whereI 1;n =fi : i g,I 2;n =fi : i >g, and T OR n := ( ? n ;t 1;n ;t 2;n ) = argmin T2R[0;tn][0;tn] El n n ; ~ SI (T ) o : (2.9) Remark 2.3.1. Both the oracle estimator ~ SI n (T OR n ) and the oracle loss estimator ^ SI (T OL ) assume the knowledge of . However, they are different in that the former creates groups based on, whereas the latter usesS S S. The purposes of introducing these two oracle estimators are different: ^ SI (T OL ) is used to assess the effectiveness of the SURE criterion; by contrast, ~ SI n (T OR n ) is employed to evaluate the usefulness of the noiseless side information, i.e. the maximal improvement in performance that can be achieved by incorporating . 25 Denote 1;n =n 1 P n i=1 I( i ? n ) and 2;n = 1 1;n . Intuitively, the optimal partition ? n (within the class of thresholding procedures utilizing two groups) is chosen to maximize the “discrepancy” between the two groups. For units in groupI ? k;n , the mixture density of i is given by g k;n () = (1p k;n ) 0 + p k;n h k;n (); k = 1; 2; (2.10) where 0 is a dirac delta function (null effects),h k;n is the (alternative) empirical density of non-null effects. Following remark 2.2.1, our theory developed based on the empirical density (2.10) can handle both random and deterministic models; this can be more clearly seen in our proofs of the theorems. Here p k;n is the conditional proportion of non-null effects for a given group and may be conceptualized as the probability that a randomly selected unit in groupI ? k;n is a non-null effect. We consider an asymptotic set-up based on the sparse estimation framework in chapter 8.6 of Johnstone (2015), which has been widely used in high-dimensional sparse inference (Johnstone and Silverman 1997, Abramovich et al. 2006, Donoho et al. 1998, Mukherjee and Johnstone 2015, Cai and Sun 2017). Let p 1;n = n andp 2;n = n for some 0 < < 1. Define n = 1 1;n 2;n . Consider the following parameter space n (;; n ) = 2R n :kk 0 n(n + n n )=(1 + n ) : The maximal risk of ASUS over n (;; n ) is R AS n (;; n ) = sup 2n(;;n) r n ( ^ T;): 26 Correspondingly, over the same parameter space n (;; n ), we letR OS n (;; n ) denote the maximal risk of the oracle procedure ~ SI n (T OR n ), andR NS n (;; n ) the minimax risk of all soft thresholding esti- mators without side information. The risk differenceR NS n R OS n is a key quantity that will be used in later analysis as the benchmark decision theoretic improvement due to incorporation of side information. Specifically, the noiseless side information is useful if it provides non-negligible improvement on the risk: lim n!1 n(R NS n R OS n ) =1: (2.11) Moreover, the ASUS estimator is asymptotically optimal if its risk improvement overR NS n (;; n ) is asymptotically equal to that of the oracle: RI n = R NS n R AS n R NS n R OS n ! 1 asn!1: (2.12) 2.3.2 Usefulness of side information We focus on Model (2.10), a hypothetical model based on the oracle partition ? n . We state a few condi- tions that are needed in later analysis; some are essential for characterizing the situations where the side information is useful, i.e. the oracle estimator ~ SI n (T OR n ) would provide non-negligible efficiency gain over competitive estimators. (A2.1) lim n!1 n n 0 = 0 for some 0 <. (A2.2) For some < 1=2 andk n = logn, lim n!1 k n (1 1;n ) =1. (A2.3) For some < 1=2, lim n!1 n 1;n p 1;n =1. (A2.4) Let 2 n =n 1 P n i=1 2 i and 0< lim inf n!1 2 n lim sup n!1 2 n <1. 27 Remark 2.3.2. (A2.1) implies 2;n p 2;n =( 1;n p 1;n )! 0, which ensures that the oracle partition is effective in the sense that the two resulting groups have different sparsity levels. The asymmetric condition can be easily flipped for generalization. (A2.2) is a mild condition which allows 1;n to approach 1 but at a controlled rate. (A2.3) prevents the trivial setting where ASUS reduces to the SureShrink procedure with universal threshold p 2 logn, i.e. the side information would not have any influence in the estimation process. See lemma 3 in Appendix A.2.2 which shows that if lim n!1 n 1=2 1;n p 1;n <1, then ASUS reduces to the SureShrink procedure, i.e. there is no need for creating groups. (A2.4) is a mild condition that is satisfied in most real life applications. Now we study the usefulness of the noiseless side information. Following the theory in Johnstone (1994), the next theorem explicitly evaluates the risk differenceR NS n R OS n up to higher order terms. The analysis overcomes the crudeness of the first order asymptotics for evaluating thresholding rules as pointed out by Bickel (1983) and Johnstone (1994). Theorem 2.3.1. Consider the oracle estimator defined in (2.8)-(2.9). Under assumption A2.1, withk n = logn, for all < 1, we have, R NS n (;; n )R OS n (;; n ) = 1;n p 1;n 2 n n log 1 1;n (2 3 1 k 1 n ) +O(k n ) o : It follows from (A2.3) that lim n!1 n(R NS n R OS n ) =1, establishing (2.11). 2.3.3 Asymptotic optimality of ASUS To evaluate the efficiency of ASUS, we need to compare the segmentation used by ASUS with that used by the oracle estimator. For a given segmentation hyper-parameter, define ~ q jk i;n () :=P n ( ^ I j i jI k i forj;k2f1; 2g; i = 1;:::;n; 28 where ^ I 1 i =fS i g,I 1 i =f i ? n g, ^ I 2 i = Rn ^ I 1 i ,I 2 i = RnI 1 i , and the probability operatorP n is based on Model (2.10). Let q jk i;n () = ~ q jk i;n () if inf 2R 2;n ~ q 12 n () + 1;n ~ q 21 n ()< inf 2R 1;n ~ q 11 n () + 2;n ~ q 22 n () and otherwiseq jk i;n () = ~ q kk i;n () andq kk i;n () = 1q jk i;n () forj6=k. Denote the weighted average q jk n () = P n i=1 q jk i;n () 2 i P n i=1 2 i ; j;k2f1; 2g: Viewing the data-driven grouping step of ASUS as a classification procedure with the oracle segmentation corresponding to the true states, we can conceptualizeq 21 n ( n ) andq 12 n ( n ) as misclassification rates. Define the efficiency ratio E n = R NS n R OS n R AS n R OS n : (2.13) For notational simplicity, the dependence of this ratio on;; n is not explicitly marked. It follows from (2.12) thatRI n = 1E 1 n . Hence a largerE n signifies better performance of ASUS. In particular,E n !1 implies the asymptotic optimality of ASUS. The poly-log rates in the following theorem are sharp. Theorem 2.3.2. Assume (A2.1) – (A2.4) hold. Letk n = logn. If there exists a sequencef n g n1 such that lim n!1 k 2 n q 21 n ( n ) = 0 and lim n!1 n q 12 n ( n ) = 0; (2.14) then ASUS is asymptotically optimal. In particular, for all < 1 we have lim inf n!1 k n E n 2 lim inf n!1 log 1 1;n : (2.15) 29 Next we present two hierarchical models, respectively with sub-Gaussian (SG) and sub-Exponential (SExp) tails, under which the misclassification rates can be adequately controlled. Let S i j i be in- dependent random variables with i := i ( i ) and ( i ( i );b i ( i )) such that Efexp((S i i ))g exp( 2 i 2 =2) for all i and alljj b 1 i : Let lim sup i b i < 1, lim sup i i < 1 and b n = sup 1in max(2 2 i ;b i ). When b i = 0, the distribution of S i has sub-Gaussian tails. For two partitions A andB of the setf1;:::;ng, define the` 1 distance between the two setsf i :i2Ag andf i :i2Bg by dist(A;B) = inffjxyj :x2A;y2Bg. Letc n = b n (2 logk n + log n ). The following lemma provides a sufficient condition under which the requirements on misclassification rates (2.14) are satisfied. The proof of the lemma follows directly from the standard bounds for sub-Gaussian and sub-Exponential tails. Lemma 2.3.1. LetI 1;n =fi : i ? n g andI 2;n =f1;:::;ngnI 1;n . The requirements on misclassification rates given by (2.14) are satisfied if lim inf n!1 c n dist(I 1;n ;I 2;n )> ; where is 1=2 if sup i b i = 0 and 1 otherwise. 2.3.4 Robustness of ASUS This section carries out a theoretical analysis to address the concern whether the performance of data com- bination procedures would deteriorate when pooling non-informative auxiliary data. We first characterize asymptotic regimes under which auxiliary data are non-informative (while the attention is confined to the prescribed class of two-group ASUS estimators), and then show that under such regimes, ASUS is robust in performance in the sense that it does not under-perform standard soft-thresholding methods. Theorem 2.3.3. Suppose (A2.1) – (A2.4) hold. Let n =n 0 andk n = logn. 30 (a) Consider the following situations: (i) lim n!1 k 1 n n q 21 n ( n ) =1; and (ii) lim n!1 n n q 21 n ( n ) = 0 but lim n!1 k 1 n n q 12 n ( n ) =1. If for all sequencef n g n1 either (i) or (ii) holds, then we must have lim n!1 E n = 1: Hence, the auxiliary data are non-informative. (b) We always have lim inf n!1 E n 1: Thus, even when pooling non-informative auxiliary data ASUS would be at least as efficient as competing soft thresholding based methods that do not use auxiliary data. Our next result characterizes the performance of soft-thresholding estimators, where their efficacies are measured by the ratio of their respective maximal risks with respect to that of the oracle. The subsequent analysis is carried out using the ratiosR AS n R OS n andR NS n R OS n , instead of the ratios of the risk differences (e.g.RI n andE n ). In this metric, we see that any optimally tuned soft-thresholding procedure is robust; but the improvement due to the incorporation of the side information can be observed in the varied convergence rates. Concretely, we show that the maximal risk of any soft thresholding scheme lies within a constant multiple of the oracle riskR OS n irrespective of the informativeness of the side information. Particularly, if lim inf n!1 1;n > 0, then lim n!1 k n (R NS n R OS n 1) = 0 for all < 1. By contrast,R AS n R OS n tends to 1 at a faster rate under the conditions of Theorem 2.3.2. Lemma 2.3.2. Letc n = log 1 1;n =fk n 1:5 log(2k n ) + 2:5 + log(0)g andk n = logn. For any < 1, under assumptions (A2.1) – (A2.4), we have lim n!1 k 2 n R NS n R OS n min(1 +c n ; =) = 0; lim sup n!1 k 2 n R AS n R OS n min(1 +c n ; =) 0: Under the conditions of Theorem 2.3.2, if there exists> 0 such that lim n!1 k n log 1 1;n =1, then lim n!1 k 1+ n (R NS n R OS n 1) =1 and lim n!1 k 2 n (R AS n R OS n 1) = 0: 31 Hence the risk of ASUS approaches the oracle risk at a faster rate. 2.4. Numerical results In this section we compare the performance of ASUS against several competing methods, including (i) the SureShrink (SS) estimator in Donoho and Johnstone (1995), (ii) the extended James Stein estimator (EJS) discussed in Brown (2008), (iii) the Empirical Bayes Thresholding (EBT) in Johnstone and Silverman (2004), and (iv) the Auxiliary Screening (Aux-Scr) procedure using simulated data in chapter 2.4.2 and a real dataset in chapter 2.4.3. The “Aux-Scr” method is motivated by a comment from a reviewer. The idea is to first utilizeS to conduct a preliminary screening of the data, then discard coordinates that appear to contain little information, and finally apply soft-thresholding estimators on remaining coordinates. A detailed description of the Aux-Scr method is provided in Appendix A.1. More simulation results and an additional real data analysis are provided in Appendices A.3 and A.4. Our numerical results suggest that ASUS enjoys superior numerical performance and the efficiency gain over competitive estimators is substantial in many settings. 2.4.1 Implementation andR-packageasus The R-package asus has been developed to implement our proposed methodology. In this section, we provide some implementation details upon which our package has been built. Our scheme for choosingT involves minimizingS(T;Y Y Y;S) with respect toT . In particular, the optimal T is given by ^ T = argmin 2n;t 1 ;:::;t K 2[0;tn] S(T;Y Y Y;S) (2.16) 32 where n is a collection ofK 1 dimensional distinct points spanning R K1 + andt n denotes the universal threshold of p 2 logn. To solve this minimization problem, we proceed as follows: Let S (1) ;S (n) be the smallest and largestS i respectively. Consider a set ofm n equi-spaced points spanning (S (1) ;S (n) ) and take n to be a mn K1 K 1 matrix where each row is aK 1 dimensional sorted vector constructed out of them n points. For each j in thejth row of n , determineft j 1 ;:::;t j K g by minimizing the SURE function for theK groups b I k . This step can easily be carried out via the hybrid scheme discussed in Donoho and Johnstone (1995). Using Proposition 1, we computeS(T;Y Y Y;S) atT =f j ;t j 1 ;:::;t j K g, and repeat this process forj = 1;:::; mn K1 to find ^ T using equation (2.16). For choosing an appropriateK, the procedure discussed above can be repeated for each candidate value ofK and an estimate ofK may be taken to be the one that minimizes the SURE estimate of risk of ASUS over the candidate values ofK. In Appendix A.5, we present a simple example that demonstrates this procedure for choosingK. Our practical recommendation is to takem n = 50 logn andK = 2 which is computationally inexpensive and tends to provide substantial reduction in overall risk against the competing estimators in both simulations and real data examples we considered. 2.4.2 Simulation This section presents results from two simulation studies, respectively investigating the performances of ASUS in one-sample and two-sample estimation problems. To reveal the usefulness of side information and investigate the effectiveness of ASUS, we also include the oracle estimator ~ SI (T OR n ) in the comparison. The MSE of the oracle estimator (OR), which provides the lowest attainable risk, serves as a benchmark for assessing the performance of various methods. The R code that reproduces our simulation results can be downloaded from the following link – https://github.com/trambakbanerjee/ASUS. 33 One-sample estimation with side information We generate our data based on hierarchical Models (2.1) to (2.3), where we fixn = 5000,K = 2, and take h ( i ; 1i ) = i + 1i . We simulate 1i from a sparse mixture model (1n 1=2 ) 0 +n 1=2 N(2; 0:01). The latent vector is simulated under the following two scenarios: (S1) Unif(6; 7) | {z } sample size = 50 ; Unif(2; 3) | {z } sample size = 200 ; 0;::::::; 0 | {z } sample size =n 250 , (S2) Unif(4; 8) | {z } sample size = 200 ; Unif(1; 3) | {z } sample size = 800 ; 0;::::::; 0 | {z } sample size =n 10 3 withY i N( i ; 1). In practice, we only observe an auxiliary sequenceS, which can be viewed as a noisy version of. To assess the impact of noise on the performance of ASUS, we consider four different settings. In settings 1 and 2, we simulate m samples of 2 = ( 21 ;:::; 2n ) from two different distributions and generate auxiliary sequencesS 1 andS 2 as follows: (1) (1) 2i i:i:d Laplace(0; 4) withS 1 =j + (1) 2 j, (2) (2) 2i i:i:d 2 10 withS 2 =j + (2) 2 j, where (k) 2 is the average of (k) 2 over them samples. For settings 3 and 4, we first introduce perturbations in the latent variable vector and then generate auxiliary sequencesS 3 ,S 4 as follows: (3) ~ i = i I i 6=0 + LogN(0; 5= p m) I i =0 withS 3 =j ~ + (1) 2 j, where is a vector ofn Rademacher random variables generated independently. (4) ~ i = i I i 6=0 + t 2m=10 I i =0 withS 4 =j ~ (2) 2 j, where is a vector ofn independent Bernoulli random variables with probability of success 0:75. We varym from 10 to 200 to investigate the impact of noise. The MSEs are obtained by averaging over N = 500 replications. The results for scenarios S1 and S2 are summarized in table 2.1 and in Figures 2.3 34 and 2.4 wherein ASUS.j and Aux-Scr.j correspond to versions of ASUS and Aux-Scr that rely on the side information in the auxiliary sequenceS j ,j = 1;:::; 4. 0.1 0.2 0.3 0.4 50 100 150 200 m risk ASUS.1 ASUS.2 ASUS.3 ASUS.4 EBT EJS OR SS 0.100 0.125 0.150 0.175 50 100 150 200 m risk ASUS.1 ASUS.2 ASUS.3 ASUS.4 Aux−Scr.1 Aux−Scr.2 Aux−Scr.3 Aux−Scr.4 OR SS Figure 2.3: One-sample estimation with side information for scenario S1: Estimated risks of different esti- mators. Left: ASUS versus EBT and EJS. Right: ASUS verus Aux-Scr. From the left panels of figures 2.3 and 2.4 we see that ASUS exhibits the best performance when com- pared against EBT, EJS and SureShrink estimators. In particular, ASUS.1, ASUS.2 outperform their coun- terparts ASUS.3, ASUS.4. This reveals how the usefulness of the latent sequence would affect the per- formance of ASUS. Nonetheless, ASUS.3 and ASUS.4 still provide improvements over, and, crucially, are never worse than the SureShrink estimator. This reveals the impact of the accuracy of the auxiliary sequence S (in capturing the information in) on the performance of ASUS. The right panels of figures 2.3 and 2.4 present the risk comparison between ASUS and Aux-Scr using the auxiliary sequencesS 1 ;:::;S 4 . Not sur- prisingly, ASUS and Aux-Scr have almost identical risk performance using the auxiliary sequencesS 1 ;S 2 andS 3 for large m. As m increases, the accuracy of these auxiliary sequences increase but the negative 35 0.3 0.4 0.5 0.6 0.7 50 100 150 200 m risk ASUS.1 ASUS.2 ASUS.3 ASUS.4 EBT EJS OR SS 0.25 0.30 0.35 0.40 50 100 150 200 m risk ASUS.1 ASUS.2 ASUS.3 ASUS.4 Aux−Scr.1 Aux−Scr.2 Aux−Scr.3 Aux−Scr.4 OR SS Figure 2.4: One-sample estimation with side information for scenario S2: Estimated risks of different esti- mators. Left: ASUS versus EBT and EJS. Right: ASUS verus Aux-Scr. Bernoulli perturbations inS 4 interferes with its magnitude so that a smallerjS i4 j may correspond to a signal coordinate. The Aux-Scr procedure which discards observations based on the magnitude of the auxiliary sequence may miss important signal coordinates while relying onS 4 . ASUS, however, does not discard any observations and continues to exploit the available information in the noisy auxiliary sequences. In table 2.1, we report risk estimates and estimates ofT for ASUS whenm = 200. The estimates of the hyper-parameters of Aux-Scr are provided in table A.2 of Appendix A.1 and we only report its risk estimates here in table 2.1. We can see that ASUS.1 and ASUS.2 choose similar thresholding hyper-parameters (t 1 ;t 2 ) as those of the oracle estimator. Moreover, ASUS.4 demonstrates a lower estimation risk than Aux-Scr.4 using the same auxiliary sequenceS 4 . 36 Table 2.1: One-sample estimation with side information: risk estimates and estimates ofT for ASUS at m = 200. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. One-sample estimation with side information Scenario S1 Scenario S2 OR ? 2 1.003 t ? 1 ,t ? 2 4.114, 0.138 4.073, 0.133 n ? 1 ,n ? 2 4750, 250 4008, 992 risk 0.095 0.224 ASUS.1 1.342 0.979 t 1 ,t 2 4.114, 0.107 4.073, 0.156 n 1 ,n 2 4748, 252 4008, 992 risk 0.097 0.243 ASUS.2 11.229 5.82 t 1 ,t 2 4.115, 0.106 4.073, 0.137 n 1 ,n 2 4748, 252 4008, 992 risk 0.095 0.228 ASUS.3 1.777 1.778 t 1 ,t 2 4.089, 0.662 3.422, 0.441 n 1 ,n 2 4271, 729 3606, 1394 risk 0.146 0.357 ASUS.4 7.785 8.524 t 1 ,t 2 1.360, 3.653 0.745, 3.864 n 1 ,n 2 1775, 3225 2249, 2751 risk 0.165 0.356 Aux-Scr.1 risk 0.097 0.243 Aux-Scr.2 risk 0.095 0.232 Aux-Scr.3 risk 0.147 0.360 Aux-Scr.4 risk 0.186 0.414 SureShrink risk 0.191 0.429 EBT risk 0.253 0.692 EJS risk 0.408 0.652 Two-sample estimation with side information We consider the problem of estimating the difference of two Gaussian mean vectors. An auxiliary sequence can be constructed from data by following Example 3 in chapter 2.2.2. We first simulate 1i (1p 1 ) 0 +p 1 Unif(3; 7); 2i (1p 2 ) 0 +p 2 f4g ; where f4g is the dirac delta at 4 and then generate i;1 = 1i + 1i and i;2 = 2i + 2i with 1i ; 2i i:i:d N(0; 0:01). The parameter of interest is = 1 2 and the associated latent side infor- mation vector is = 1 2 . The observations based on the simulated mean vectors are generated as 37 0.4 0.8 1.2 1.6 1000 2000 3000 4000 5000 n risk ASUS Aux−Scr EBT EJS OR SS 0.2 0.4 0.6 0.8 1000 2000 3000 4000 5000 n risk ASUS Aux−Scr EBT EJS OR SS Figure 2.5: Two-sample estimation with side information: Average risks of different estimators. Left: Scenario S1 and Right: Scenario S2. U i N( i;1 ; 2 i;1 ); V i N( i;2 ; 2 i;2 ). Finally, the primary and auxiliary statistics are obtained as Y i = U i V i ; S i =jU i + i V i j. We fix p 1 = n 0:6 , p 2 = n 0:3 , i = i;1 = i;2 and consider two scenarios where i;1 = i;2 = 1 under scenario S1 and ( 2 1;i ; 2 2;i ) i:i:d Unif(0:1; 1) under scenario S2. The estimates of risks are obtained by averaging overN = 1000 replications. We varyn from 500 to 5000 to investigate the impact of the strength of side information. The simulation results are reported in Table 2.2 and figure 2.5. We see that ASUS uses the side information inS and exhibits the best performance across both scenar- ios. In scenario S2, the variances ofY i are smaller, which leads to an improved risk performance of ASUS over scenario S1. Similar to the previous simulation study, the risk of ASUS would not exceed the risk of the SureShrink estimator across both the scenarios. Different magnitudes of the thresholding hyper-parameters (t 1 ;t 2 ) in table 2.2 further corroborates the importance of the auxiliary statisticsS i in constructing groups 38 Table 2.2: Two-sample estimation with side information: risk estimates and estimates ofT for ASUS at n = 5000. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. Two-sample estimation with side information Scenario S1 Scenario S2 OR ? 1.947 1.363 t ? 1 ,t ? 2 4.106, 0.137 4.106, 0.424 n ? 1 ,n ? 2 4584, 416 4583, 417 risk 0.185 0.132 ASUS 3.167 2.504 t 1 ,t 2 1.223, 0.253 3.058, 0.323 n 1 ,n 2 4570, 430 4195, 805 risk 0.610 0.239 Aux-Scr 14.385 2.768 t 1 ,t 2 0.955, 0.002 5.708, 0.498 n 1 ,n 2 4991, 9 3681, 1319 risk 0.688 0.258 SureShrink risk 0.688 0.318 EBT risk 0.761 0.311 EJS risk 0.891 0.600 with disparate sparsity levels and thereby improving the overall estimation accuracy. This is particularly true in the case of scenario S2 where EBT and SureShrink are competitive but ASUS is far more efficient because it has constructed two groups where one group holds majority of the signals and ASUS uses the smaller thresholdt 2 to retain the signals. The other group holds majority of the noise wherein ASUS uses the larger thresholdt 1 to shrink them to zero. Moreover, we notice that ASUS provides a better risk performance than Aux-Scr across both the scenarios. Using the side information inS, Aux-Scr discards observations that havejS i j thereby eliminating some potentially information rich signal coordinates and thus returns a higher risk than ASUS. 2.4.3 Analysis of RNA sequence data We compare the performance of ASUS against the SureShrink (SS) estimator for analysis of the RNA sequence data described in the introduction. The goal is to estimate the true expression levels of the n genes that are infected with VZV strain. Through previous studies conducted in the lab, expression 39 levels corresponding to other four experimental conditions, including uninfected cells (C1, 3 replicates), a fibrosarcoma cell line (C2, 3 replicates) and cells treated with interferons gamma (C3, 2 replicates), alpha (C4, 3 replicates), were also collected. Let X i be the mean expression level of gene i across the four experimental conditions. SetS i =jX i j withK = 2. Let ^ S i (t) denote the SureShrink estimator of i based onY i , the mean expression level of genei under the VZV condition. The standard deviation i for the mean expression level pertaining to gene i across the 3 replicates of the VZV strain is derived from the study conducted in Sen et al. (2018). On the right panel of Figure 2.6, the dotted line represents the minimum of the SURE risk of ^ S (t), which is minimized at t = 0:61. The solid line represents the minimum of the SURE risk of a class of two-group estimators over a grid of values. ASUS chooses that minimizes the SURE risk (the red dot in figure 2.6). The resulting risk is 1:99% at ^ T = (1:25; 1:16; 0), a significant reduction compared to the risk estimate of 3:69% for ^ S (t). In order to evaluate the results in a predictive framework, we next use only two replicates of the VZV strain for calibrating the hyper-parameters and calculate the prediction errors based on the hold out third replicate. The risk reduction by ASUS over SureShrink is about 30%. In this example, a reduction in risk is possible because ASUS has efficiently exploited the sparsity in- formation about encoded byS. This can be seen, for example, from (i) the stark contrast between the magnitudes of thresholding hyper-parameterst 1 ;t 2 for the two groups in table 2.3 and (ii) the heat maps in figure 2.6 where the genes expressions under the four experimental conditions follow the expression pattern of VZV . Moreover, the risk of Aux-Scr for this example was seen to be no better than the SureShrink esti- mator and thus has been excluded from the results reported in table 2.3. Figure 2.7a presents the distribution of gene expression for genes that belong to groups b I 1 and b I 2 . ASUS exploits the side information inS to 40 .……… SureShrink ASUS Risk (%) Genes VZV (3) C1 (3) C2 (3) C3 (2) C4 (3) Low High Figure 2.6: Left: Heat map showing the following from top to bottom: average expression levels of VZV , C1, C2, C3 and C4 across their respective replicates (in parenthesis). Right: SURE estimate of the risk of ^ S i (t) att = 0:61 versus an unbiased estimate of the risk of ASUS for different values of. (a) (b) Figure 2.7: (a) Histogram of gene expressions for VZV . Group 1 is b I 2 and Group 0 is b I 1 . (b) A network of 20 new genes highlighted in black with their interaction partners. 41 Table 2.3: Summary of SureShrink and ASUS methods (RNA-Seq data).n k =j b I k j fork = 1; 2. RNA Seq n 53,216 SureShrink t 0.61 SURE estimate 3.69 ASUS 1.25 t 1 1.16 t 2 0 n 1 39,535 n 2 13,681 SURE estimate 1.99 partition the estimation units into two groups with very different sparsity levels and therefore returns a much smaller risk. The ASUS estimator ^ SI ( ^ T ) results in the discovery of 114 new genes than those discovered by using ^ SI (t). Figure 2.7b shows the network of protein-protein interactions of 20 such genes. The interaction network is generated using NetworkAnalyst (Xia et al. 2015) that maps the chosen genes to a comprehen- sive high-quality protein-protein interaction (PPI) database based on InnateDB. A search algorithm is then performed to identify first-order neighbors (genes that directly interact with a given gene) for each of these mapped genes. The resulting nodes and their interaction partners are returned to build the network. In case of the RNA-Seq data, the interaction network of the 20 new genes indicates that ASUS may help reveal important biological synergies between genes that have a high estimated expression level for VZV and other genes in the human genome. 2.5. Discussion In high-dimensional estimation and testing problems, the sparsity structure can be encoded in various ways; we have considered three basic settings where the structural information on sparsity may be extracted from (i) prior or domain-specific knowledge, (ii) covariate sequence based on the same data, or (iii) summary statistics based on secondary data sources. This article develops a general integrative framework for sparse 42 estimation that is capable of handling all three scenarios. We use higher-order minimax optimality tools to establish the adaptivity and robustness of ASUS. Numerical studies using both simulated and real data corroborate the improvement of ASUS over existing methods. We conclude the article with a discussion of several open issues. Firstly, in large-scale compound esti- mation problems, various data structures such as sparsity, heteroscedasticity, dependency and hierarchy are often available alongside the primary summary statistics. ASUS can only handle the sparsity structure; and it is desirable to develop a unified framework that can effectively incorporate other types of structures into inference. New theoretical frameworks will be needed to characterize the usefulness of various types of side information and to establish precise conditions under which the new integrative method is asymptotically optimal. Secondly, in situations where there are multiple auxiliary sequences, it is unclear how to modify the ASUS framework to construct groups using an auxiliary matrix. The computation involved in the search for the optimal group-wise thresholds, which requires the evaluation of the SURE function for every possible combination of group-wise thresholds, quickly becomes prohibitively expensive as the number of columns increases. Finally, the higher dimension would affect the stability of an integrative procedure adversely. A promising idea for handling multiple auxiliary sequences is to construct a new auxiliary sequence that represents the “optimal use” of all available side information. However, the search for this optimal direction of projection is quite challenging. It would be of great interest to explore these directions in future research. 43 Chapter 3 Empirical Bayes Estimation in the Discrete Linear Exponential Family 3.1. Introduction Shrinkage methods, exemplified by the seminal work of James and Stein (1961), have received renewed attention in modern large-scale inference problems (Efron 2012, Fourdrinier et al. 2018). Under this setting, the classical Normal means problem has been extensively studied (Brown 2008, Jiang et al. 2009, Brown and Greenshtein 2009, Efron 2011, Xie et al. 2012, Weinstein et al. 2018). However, in a variety of applications, the observed data are often discrete. For instance, in the News Popularity study discussed in chapter 3.5.2, the goal is to estimate the popularity of a large number of news items based on their frequencies of being shared in social media platforms such as Facebook and LinkedIn. Another important application scenario arises from genomics research, where estimating the expected number of mutations across a large number of genomic locations can help identify key drivers or inhibitors of a given phenotype of interest. We mention two main limitations of existing shrinkage estimation methods. First, the methodology and theory developed for continuous variables, in particular for Normal means problem, may not be directly applicable to discrete models. Second, existing methods have focused on the squared error loss. However, the scaled loss (Clevenson and Zidek 1975), which effectively reflects the asymmetries in decision making [cf. Equation (3.3)], is a more desirable choice for many discrete models such as Poisson, where the scaled loss corresponds to the local Kulback-Leibler distance. The scaled loss also provides a more desirable criterion in a range of sparse settings, for example, when the goal is to estimate the rates of rare outcomes in Binomial distributions (Fourdrinier and Robert 1995). Much research is needed for discrete estimation problems under various loss functions. This article develops a general framework for empirical Bayes 44 estimation for the discrete linear exponential (DLE) family, also known as the family of discrete power series distributions (Noack 1950), under both regular and scaled error losses. The DLE family includes a wide class of popular members such as the Poisson, Binomial, negative Binomial and Geometric distributions. LetY be a non-negative integer valued random variable. ThenY is said to belong to aDLE family if its probability mass function (pmf) is of the form p(yj) = a y y g() ; y2f0; 1; 2;g; (3.1) wherea y andg() are known functions such thata y 0 is independent of andg() is a normalizing factor that is differentiable at every. Special cases ofDLE include the Poisson() distribution witha y = (y!) 1 , = and g() = exp (), and the Binomial(m;q) distribution with a y = m y , = q=(1q) and g() = (1 +) m . SupposeY 1 ;:::;Y n obey the following hierarchical model Y i j i ind: DLE( i ); i i:i:d G(); (3.2) whereG() is an unspecified prior distribution on i . The problem of interest is to estimate = ( 1 ;:::; n ) based onY = (Y 1 ;:::;Y n ). Empirical Bayes approaches to this compound decision problem date back to the famous Robbins’ formula (Robbins 1956) under the Poisson model. Important recent progresses by Brown et al. (2013), Koenker and Mizera (2014) and Koenker and Gu (2017) show that Robbins’ estimator can be vastly improved by incorporating smoothness and monotonicity adjustments. The main idea of existing works is to approximate the shrinkage factor in the Bayes estimator as smooth functionals of the unknown marginal pmfp(y). The pmf can be estimated in various ways including the observed empirical 45 frequencies (Robbins 1956), the smoothness-adjusted estimator (Brown et al. 2013) or the shape-constrained NPMLE approach (Koenker and Mizera 2014, Koenker and Gu 2017). This article develops a general non-parametric empirical Bayes (NEB) framework for compound estima- tion in discrete models. We first derive generalized Robbins’ formula (GRF) for theDLE Model (3.2), and then implementGRF via solving a scalable convex program. The powerful convex program, which is care- fully developed based on a reproducing kernel Hilbert space (RKHS) representation of Stein’s discrepancy measure, leads to a class of efficientNEB shrinkage estimators. We develop theories to show that theNEB estimator is p n consistent up to certain logarithmic factors and enjoys superior risk properties. Simulation studies are conducted to illustrate the superiority of the proposed NEB estimator when compared to exist- ing state-of-the-art approaches such as Brown et al. (2013), Koenker and Mizera (2014) and Koenker and Gu (2017). We show that the NEB estimator has smaller risk in all comparisons and the efficiency gain is substantial in many settings. There are several advantages of the proposedNEB estimation framework. First, in contrast with exist- ing methods such as the smoothness-adjusted Poisson estimator in Brown et al. (2013), our methodology covers a much wider range of distributions and presents a unified approach to compound estimation in discrete models. Second, our proposed convex program is fast and scalable. It directly produces stable estimates of optimal Bayes shrinkage factors and can easily incorporate various structural constraints into the decision rule. By contrast, the three-step estimator in Brown et al. (2013), which involves smoothing, Rao-Blackwellization and monotonicity adjustments, is complicated, computationally intensive and some- times unstable (as the numerator and denominator of the ratio are computed separately). Third, the RKHS representation of Stein’s discrepancy measure provides a new analytical tool for developing theories such as asymptotic optimality and convergence rates. Finally, theNEB estimation framework is robust to departures from the true model due to its utilization of a generic quadratic program that does not rely on the specific form of a particular DLE family. Our numerical results in chapter 3.4 demonstrate that the NEB estimator 46 has significantly better risk performance than competitive approaches of Efron (2011), Brown et al. (2013) and Koenker and Gu (2017) under a mis-specified Poisson model. An alternative approach to compound estimation in discrete models, as suggested and investigated by Brown et al. (2013), is to employ variance stabilizing transformations, which converts the discrete problem to a classical normal means problem. This allows estimation via Tweedie’s formula for normal variables (Efron 2011), where the marginal density can be estimated using NPMLE (Jiang et al. 2009, Koenker and Mizera 2014) or through kernel density methods (Brown and Greenshtein 2009). However, there are several drawbacks of this approach compared to ourNEB framework. First, Tweedie’s formula is not applicable to scaled error loss whereas our methodology is built upon the generalized Robbins’ formula, which covers both regular and scaled squared error losses. Second, there can be information loss in conventional data processing steps such as standardization, transformation and continuity approximation. While investigating the impact of information loss on compound estimation is of great interest, it is desirable to develop method- ologies directly based on generalized Robbins’ formula that is specifically derived and tailored for discrete variables. Finally, our NEB framework provides a convenient tool for developing asymptotic theories. By contrast, convergence rates are yet to be developed for normality inducing transformations, which can be highly non-trivial. The rest of the chapter is organized as follows. In chapter 3.2, we introduce our estimation framework while chapter 3.3 presents a theoretical analysis of the NEB estimator. The numerical performance of our method is investigated using both simulated and real data in chapters 3.4 and 3.5 respectively. Additional technical details and proofs are relegated to Appendix B. 47 3.2. A general framework for compound estimation inDLE family This section describes the proposedNEB framework for compound estimation in discrete models. We first introduce in chapter 3.2.1 the generalized Robbins’ formula for theDLE family (3.2), then propose in chapter 3.2.2 a convex optimization approach for its practical implementation. Details for tuning parameter selection are discussed in chapter 3.2.3. 3.2.1 Generalized Robbins’ formula forDLE models Denote i an estimator of i . Consider a class of loss functions ` (k) ( i ; i ) = k i ( i i ) 2 (3.3) fork2f0; 1g, where` (0) ( i ; i ) is the usual squared error loss, and` (1) ( i ; i ) = 1 i ( i i ) 2 corresponds to the scaled squared error loss (Clevenson and Zidek 1975, Fourdrinier and Robert 1995). In compound estimation, one is concerned with the average loss L (k) n (;) =n 1 n X i=1 ` (k) ( i ; i ): The associated risk is denoted R (k) n (;) = E Y Y Yj L (k) n (;). Let G G G( ) denote the joint distribution of ( 1 ; ; n ). The Bayes estimator (k) that minimizes the Bayes risk B (k) n () = R R (k) n (;)dG G G() is given by Lemma 3.2.1. Lemma 3.2.1 (Generalized Robbins’ formula). Consider theDLE Model (3.2). Letp() = R p(j)dG() be the marginal pmf ofY . Define fork2f0; 1g, w (k) p (y i ) = p(y i k) p(y i + 1k) ; fory i =k;k + 1; : 48 Then the Bayes estimator that minimizes the riskB (k) n () is given by (k) =f (k);i (y i ) : 1ing, where (k);i (y i ) = 8 > > > > < > > > > : a y i k =a y i +1k w (k) p (y i ) ; fory i =k;k + 1; 0; fory i <k : (3.4) Remark 3.2.1. Under the squared error loss (k = 0) withY i j i Poi( i ) anda y i = (y i !) 1 , Lemma 3.2.1 yields (0);i (y i ) = (y i + 1) p(y i + 1) p(y i ) ; (3.5) which recovers the classical Robbins’ formula (Robbins 1956). In contrast, under the scaled loss, we have (1);i (y i ) =y i p(y i ) p(y i 1) fory i > 0 and (1);i (y i ) = 0 otherwise: (3.6) Under scaled error loss the estimator (3.5) can be much outperformed by (3.6) (and vice versa under the regular loss). We develop parallel results for the two types of loss functions. Next we discuss related works for implementing Robbins’ formula under the empirical Bayes (EB) estimation framework. Inspecting (3.4) and (3.5), we can view a y i k =a y i +1k as a naive and known es- timator of i . The ratio functionalw (k) p (y i ), which is unknown in practice, represents the optimal shrink- age factor that depends on p(). Hence, a simple EB approach, as done in the classical Robbins’ for- mula, is to estimate w (k) p (y) by plugging-in empirical frequencies: ^ w (0) n (y) = ^ p n (y)=^ p n (y + 1), where ^ p n (y) = n 1 P n i=1 I(y i = y). It is noted by Brown et al. (2013) that this plug-in estimator can be highly inefficient especially when i are small. Moreover, the numerator and denominator inw (0) p (y) are estimated separately, which may lead to unstable ratios. Brown et al. (2013) showed that Robbins’ formula can be dramatically improved by imposing additional smoothness and monotonicity adjustments. An alternative approach is to estimatep(y) using NPMLE (Jiang et al. 2009) under appropriate shape constraints (Koenker 49 and Mizera 2014). However, efficient estimation of p(y) may not directly translate into an efficient esti- mation of the underlying ratiop(y + 1)=p(y). We recast the compound estimation problems as a convex program, which directly produces consistent estimates of the ratio functionals w (k) p = n w (k) p (y 1 );:::;w (k) p (y n ) o from data. The estimators are shown to enjoy superior numerical and theoretical properties. Unlike existing works that are limited to regular loss and specific members in theDLE family, our method can handle a wide range of distributions and various types of loss functions in a unified framework. 3.2.2 Shrinkage estimation by convex optimization This section focuses on the scaled squared error loss (k = 1). Methodologies and theories for the case with usual squared error loss (k = 0) can be derived similarly; details are provided in Appendix B.1.1. We first introduce some notations and then present theNEB estimator in Definition 1A. SupposeY is a non-negative integer-valued random variable with pmfp(). Define h (1) 0 (y) = 8 > > > > < > > > > : 1; ify = 0 1w (1) p (y); ify2f1; 2;:::g: (3.7) LetK (y;y 0 ) = expf 1 2 (yy 0 ) 2 g be the positive definite Radial Basis Function (RBF) kernel with bandwidth parameter2 where is a compact subset ofR + bounded away from 0. Given observations y = (y 1 ;:::;y n ) from Model (3.2), leth (1) 0 = n h (1) 0 (y 1 );:::;h (1) 0 (y n ) o . Define operators y K (y;y 0 ) = K (y + 1;y 0 )K (y;y 0 ) and y;y 0K (y;y 0 ) = y 0 y K (y;y 0 ) = y y 0K (y;y 0 ): 50 Consider the followingnn matrices, which are needed in the definition of theNEB estimator: K =n 2 [K (y i ;y j )] ij ; K =n 2 [ y i K (y i ;y j )] ij ; 2 K =n 2 [ y i ;y j K (y i ;y j )] ij : Definition 1A (NEB estimator). Consider theDLE Model (3.2) with loss` (1) ( i ; i ). For any fixed2 , let ^ h (1) n () = n ^ h (1) 1 ();:::; ^ h (1) n () o be the solution to the following quadratic optimization problem: min h2Hn ^ M ;n (h) =h T K h + 2h T K 1 + 1 T 2 K 1; (3.8) whereH n =fh = (h 1 ;:::;h n ) :Ahb;Ch =dg is a convex set andA;C;b andd are known real matrices and vectors that enforce linear constraints on the components ofh. Define ^ w (1) i () = 1 ^ h (1) i (). Then theNEB estimator is given by neb (1) () = n neb (1);i () : 1in o , where neb (1);i () = a y i 1 =a y i ^ w (1) i () ; ify i 2f1; 2;:::g; and neb (1);i () = 0 ify i = 0. Next we provide some insights on why the optimization criterion (3.8) works; theories are developed in chapter 3.3 to establish the properties of the NEB estimator rigorously. Denote h (1) 0 and ~ h (1) as the ratio functionals corresponding to pmfsp and ~ p, respectively. SupposeY i are i.i.d. samples obeyingp(y). Theorem 1 shows that ^ M ;n ( ~ h) =M ( ~ h) +O p log 2 n n 1=2 ; where ^ M ;n ( ~ h) is the objective function in (3.8) andM ( ~ h), also denotedS [~ p](p), is the kernelized Stein’s discrepancy (KSD). Roughly speaking, the KSD measures how different one distributionp is from another distribution ~ p, withS [~ p](p) = 0 if and only if ~ p = p. A key feature of the KSD is thatS [~ p](p) can 51 be equivalently represented by the discrepancy between the corresponding ratio functionalsh (1) 0 and ~ h (1) . Hence, optimizing (3.8) is asymptotically equivalent to finding ~ h (1) that is as close as possible to the true underlyingh (1) 0 , which corresponds to the optimal shrinkage factor in the compound estimation problem. Theorems 2A and 3A demonstrate that (3.8) is an effective convex program in the sense that the minimizer ^ h n is p n consistent with respect toh (1) 0 , and the resultantNEB estimator converges to the Bayes estimator. 3.2.3 Structural constraints and bandwidth selection In problem (3.8) the linear inequalityAh b can be used to impose structural constraints on the NEB rule neb (1) (). The structural constraints, which may take the form of monotonicity constraints as pursued in, for example, Brown et al. (2013) and Koenker and Mizera (2014), have been shown to be effective for stabilizing the estimator and hence improving the accuracy. For example, when Y i j i Poi( i ) then (1);i (y i ) = (y i + 1)=w (1) p (y i ) andA,b can be chosen such that h (i1) y (i1) + 1 y (i) + 1 h (i) 1 y (i1) + 1 y (i) + 1 ; for 2in andy (1) y (2) y (n) . Moreover, wheny i = 0 we set neb (1);i () = 0 by convention (see lemma 3.2.1). The equality constraintsCh = d accommodate such boundary conditions along with instances of ties for which we requireh i =h j whenevery i =y j . The implementation of the quadratic program in (3.8) requires the choice of a tuning parameter in the RBF kernel. For practical applications, must be determined in a data-driven fashion. For infinitely divisible random variables (Klenke 2014) such as Poisson variables, Brown et al. (2013) proposed a modified cross validation (MCV) method for choosing the tuning parameter. However, the MCV method cannot be applied to distributions with bounded support, e.g. variables that are not infinitely divisible (Sato and Ken- Iti 1999) such as the Binomial distribution. To provide a unified estimation framework for theDLE family, 52 we develop an alternative method for choosing . The key idea is to derive an asymptotic risk estimate ARE (1) n () that serves as an approximation of the true riskR (1) n (;). Then the tuning parameter is chosen to minimizeARE (1) n (). The methodology based on ARE is illustrated below for Poisson and Binomial models under the scaled loss (see Definitions 2A and 3A, respectively). The ideas can be extended to other members in the DLE family. In Appendix B.1.2, we provide relevant details for choosing under the regular lossL (0) n . Definition 2A (ARE of neb (1) () in the Poisson model). Suppose Y i j i ind: Poi( i ). Under the loss ` (1) ( i ;), an ARE of the true risk of neb (1) () is ARE (1;P) n () = 1 n n n X i=1 y i + n X i=1 (y i ) 2 n X i=1 neb (1);i () o ; where (y i ) =f neb (1);j ()g 2 =(y i + 1); y i = 0; 1;:::: withj2f1;:::;ng such thaty j =y i + 1. For the Binomial model, we proceed along similar lines and consider the following asymptotic risk estimate of the true risk. Definition 3A (ARE of neb (1) () in the Binomial model). SupposeY i jq i Bin(m;q i ). Hence in Equations (3.1) and (3.2) we havea y i = m y i and i =q i =(1q i ). Under the loss` (1) ( i ;), an ARE of the true risk of neb (1) () is ARE (1;B) n () = 1 n n n X i=1 y i my i + 1 + n X i=1 (my i ) (y i ) 2 n X i=1 neb (1);i () o ; where (y i ) =f neb (1);j ()g 2 =(y i + 1); y i = 0;:::;m: withj2f1;:::;ng such thaty j =y i + 1. 53 Remark 3.2.2. Although the expression for appears to be identical across Definitions 2A and 3A, it differs with respect to neb (1) (). Specifically, in definition 2A, neb (1) () is theNEB estimator of the Poisson means, whereas in Definition 3A, neb (1) () is theNEB estimator of the Binomial odds. Remark 3.2.3. If for somei, y i + 1 is not available in the observed sampley, (y i ) can be calculated using cubic splines, and a linear interpolation can be used to tackle the boundary point of the observed sample maxima. We propose the following estimate of the tuning parameter based on the ARE: ^ = 8 > > > > < > > > > : argmin 2 ARE (1;P) n (); ifY i j i ind: Poi( i ) argmin 2 ARE (1;B) n (); ifY i jq i ind: Bin(m;q i ) : (3.9) In practice we recommend using = [10; 10 2 ], which works well in all our simulations and real data analyses. In chapter 3.3, we present Lemmas 2 and 3 to provide asymptotic justifications for selecting using equation (3.9). 3.3. Theory This section studies the theoretical properties for theNEB estimator under the Poisson and Binomial models. We first investigate the large-sample behavior of the KSD measure (chapter 3.3.1), then turn to the perfor- mance of the estimated ratios ^ w w w n (chapter 3.3.2), and finally establish the consistency and risk properties of the proposed estimator neb (chapter 3.3.3). The accuracy of the ARE criteria, which are used in choosing tuning parameter, will also be investigated. 54 3.3.1 Theoretical properties of the KSD measure To provide motivation and theoretical support for Definition 1A, we introduce the Kernelized Stein’s Dis- crepancy (KSD) (Liu et al. 2016, Chwialkowski et al. 2016) and discuss its connection to the quadratic program (3.8). While the KSD has been used in various contexts including goodness of fit tests (Liu et al. 2016), variational inference (Liu and Wang 2016) and Monte Carlo integration (Oates et al. 2017), our theory on its connection to the compound estimation problem and empirical Bayes methodology is novel. Assume thatY andY 0 are i.i.d. copies from the marginal pmfp. Considerh 0 defined in Equation (3.7) 1 . Let ~ p denote a pmf on the support of Y , for which we similarly define ~ h. The KSD, which is formally defined as S [~ p](p) =E p hn ~ h(Y )h 0 (Y ) o K (Y;Y 0 ) n ~ h(Y 0 )h 0 (Y 0 ) oi ; (3.10) provides a discrepancy measure betweenp and ~ p in the sense that (a) S [~ p](p) 0 andS [~ p](p) = 0 if and only ifp = ~ p; and (b) informally,S [~ p](p) tends to increase when there is a bigger disparity betweenh 0 and ~ h (or equiva- lently, betweenp and ~ p). The direct evaluation ofS [~ p](p) via Equation (3.10) is difficult because h 0 is unknown. Note that while the pmfp can be learned well from a random samplefY 1 ;:::;Y n g p, we introduce an alternative representation of KSD, developed by Liu et al. (2016), in a reproducing kernel Hilbert space (RKHS) that 1 In Section 3.3.1 we shall drop the superscript fromh0, which is used to indicate whether the loss is scaled or regular. The simplification has no impact since the general idea holds for both types of losses and the discussion in this section focuses on the scaled loss. 55 does not directly involve unknownh 0 . Concretely, consider a smooth positive definite kernel function [~ p]: [~ p](u;v) = ~ h(u) ~ h(v)K (u;v) + ~ h(u) v K (u;v) + ~ h(v) u K (u;v) + u;v K (u;v): (3.11) For i.i.d. copies (Y;Y 0 ) from distributionp, it can be shown that S [~ p](p) = E (Y;Y 0 ) i:i:d: p h [~ p](Y;Y 0 ) i (3.12) = 1 n(n 1) E p h X 1i6=jn [ ~ h(Y i ); ~ h(Y j )](Y i ;Y j ) i := M ( ~ h); wherefY 1 ;:::;Y n g is a random sample fromp. It can be similarly shown thatM ( ~ h) = 0 if and only if ~ h =h 0 . Substituting the empirical distribution ^ p n in place of the pmfp in (3.12), we obtain the following empirical evaluation scheme forS [~ p](p) that is both intuitive and computationally efficient: S [~ p](^ p n ) = 1 n 2 n X i=1 n X j=1 [ ~ h(y i ); ~ h(y j )](y i ;y j ) := ^ M ;n ( ~ h): (3.13) Note that (3.13) is exactly the objective function of the quadratic program (3.8). The empirical representation of KSD (3.13) provides an extremely useful tool for solving the discrete compound decision problem under the EB estimation framework. A key observation is that the kernel function [~ p](u;v) depends on ~ p only through ~ h. Meanwhile, the EB implementation of the generalized Robbins’ formula [cf. Equations (3.4) and (3.7)] essentially boils down to the estimation of h 0 . Hence, ifS [~ p](^ p n ) is asymptotically equal toS [~ p](p), then minimizingS [~ p](^ p n ) with respect to the unknowns ~ h = n ~ h(y 1 );:::; ~ h(y n ) o is effectively the process of finding an ~ h that is as close as possible toh 0 , which yields an asymptotically optimal solution to the EB estimation problem. Therefore our formulation of the 56 NEB estimator neb () would be justified as long as we can establish the asymptotic consistency of the sample criterionS [~ p](^ p n ) around the population criterionS [~ p](p) uniformly over (Theorem 1). For a fixed mass function ~ p on the support ofY , we impose the following regularity conditions that are needed in our technical analysis. (A1) E p j [ ~ h(U); ~ h(V )](U;V )j 2 <1 for all2 where is a compact subset ofR + bounded away from 0. (A2) For some2 (0; 1), lim sup n!1 n 1 P n i=1 exp( i )<1. (A3) For any function g that satisfies 0 < kgk 2 2 < 1, there exists a constant c > 0 such that n 2 P n i;j=1 g(y i )K (y i ;y j )g(y j )>ckgk 2 2 for every2 . (A4) The feasible solutionsh n to equations (3.8) and (B.2) satisfy sup hn2Hn kh n k 1 =O(n logn). Remark 3.3.1. Assumption (A1) is a standard moment condition on the kernel function related to V-statistics, see, for example, Serfling (2009). Assumption (A2) ensures that with high probability max(Y 1 ;:::;Y n ) logn as n!1. This idea is formalized by Lemma A in Appendix B.2. Assump- tion (A3) is a standard condition which ensures that the KSDS [~ p](p) is a valid discrepancy measure (Liu et al. 2016, Chwialkowski et al. 2016). Assumption (A4) provides a control on the growth rate of the` 1 norm of the feasible solutions. In particular, both Assumptions (A3) and (A4) play a critical role in establishing point-wise Lipschitz stability of the optimal solution ^ h n () under perturbations on the tuning parameter 2 (see Lemma B in Appendix B.2). Theorem 1. Ifp and ~ p are probability mass functions on the support ofY then, under Assumptions (A1) and (A2), we have sup 2 ^ M ;n ( ~ h)M ( ~ h) =O p log 2 n p n : 57 In the context of our compound estimation framework, Theorem 1 is significant because it guarantees that the empirical version of the KSD measure given by ^ M ;n ( ~ h) is asymptotically close to its population counterpart M ( ~ h) uniformly in 2 . Moreover, along with the fact that M (h 0 ) = 0, Theorem 1 establishes that ^ M ;n (h) is the appropriate criteria to minimize with respect toh ~ h. In Theorem 2A, we further show that the resulting estimator of the ratio functionalsw (1) p from equation (3.8) are consistent. 3.3.2 Theoretical properties of ^ w w w n The optimization problem in (3.8) is defined over a convex setH n R n . However, the dimension of H n , denoted by dim(H n ), is usually much smaller than n. Consider the Binomial case where Y i jq i Bin(m i ;q i ) withq i 2 (0; 1), m i m <1 and i = q i =(1q i ). Here dim(H n ) is at mostm since max(Y 1 ;:::;Y n )m. While the boundedness of the support is not always available outside the Binomial case, in most practical applications it is reasonable to assume that the distribution of i has some finite moments, which ensures that dim(H n ) grows slower than logn; see Assumption (A2). In Lemma A we make this precise. The next theorem establishes the asymptotic consistency of ^ w (1) n (). Theorem 2A. LetK (;) be the positive definite RBF kernel with bandwidth parameter 2 . If lim n!1 c n n 1=2 log 2 n = 0 then, under Assumptions (A1) - (A3), we have for any2 , lim n!1 P n ^ w (1) n ()w (1) p 2 c 1 n o = 0; for any> 0; where ^ w (1) n () = 1 ^ h (1) n (). Theorem 2A shows that under the scaled squared error loss, ^ w (1) n (), the optimizer of quadratic form (3.8), provides a consistent estimator ofw (1) p , the optimal shrinkage factor in the Bayes rule (Lemma 3.2.1). Theorem 2A is proved in Appendix B.2.3, where we also include relevant details for proving a companion result under the regular squared error loss. 58 Remark 3.3.2. The estimation framework in Definition 1A may be used for producing consistent estimators for any member in the DLE family. This allows the corresponding NEB estimator to cover a much wider class of discrete distributions than previously proposed. Compared to existing methods (Efron 2011, Brown et al. 2013, Koenker and Mizera 2014, Koenker and Gu 2017), our proposedNEB estimation framework is robust against departures from the true data generating process. This is due to the fact that the quadratic optimization problem in (3.8) does not rely on the specific form of the distribution of Yj, and that the shrinkage factors are estimated in a non-parametric fashion. The robustness of the estimator is corroborated by our numerical results in chapter 3.4. 3.3.3 Properties of theNEB estimator In this section we discuss the risk properties of the NEB estimator. We begin with two lemmas showing that uniformly in 2 , the gap between the estimated risk ARE (1) n () and true risk is asymptotically negligible. This justifies our proposed methodology for choosing the tuning parameter in chapter 3.2.3. In the following two lemmas, we letc n be a sequence satisfying lim n!1 c n n 1=2 log 5=2 n = 0. Lemma 2. Under Assumptions (A3) and (A4) and the Binomial model, we have (a): c n sup 2 ARE (1;B) n (;Y )R (1) n (; neb (1) ()) =o p (1); (b): c n sup 2 ARE (1;B) n (;Y )L (1) n (; neb (1) ()) =o p (1): Lemma 3. Under Assumptions (A2), (A3) and (A4) and the Poisson model, we have (a): c n sup 2 ARE (1;P) n (;Y )R (1) n (; neb (1) ()) =o p (1); (b): c n sup 2 ARE (1;P) n (;Y )L (1) n (; neb (1) ()) =o p (1): 59 To analyze the quality of the data-driven bandwidth ^ [cf. Equation (3.9)], we consider an oracle loss estimator or (1) := neb (1) ( orc 1 ), where orc 1 := argmin 2 L (1) n n ; neb (1) () o : The oracle bandwidth orc 1 is not available in practice since it requires the knowledge of unknown . How- ever, it provides a benchmark for assessing the effectiveness of the data-driven bandwidth selection proce- dure in chapter 3.2.3. The following lemma shows that the loss of neb (1) ( ^ ) converges in probability to the loss of or (1) . Lemma 4. Under Assumptions (A2) - (A4), if lim n!1 c n n 1=2 log 5=2 n = 0, then for both the Poisson and Binomial models, we have lim n!1 P h L (1) n n ; neb (1) ( ^ ) o L (1) n (; or (1) ) +c 1 n i = 0 for any> 0: Obviously, the estimator neb (1) ( orc 1 ) is lower bounded by the risk of the optimal solution (1) (Lemma 3.2.1). Next we study the asymptotic optimality of neb (1) , which aims to provide decision theoretic guarantees on neb (1) in relation to (1) . Theorem 3A establishes the optimality theory by showing that (a) the largest coordinate-wise gap between neb (1) ( ^ ) and (1) is asymptotically small, and (b) the estimation loss of the NEB estimator converges in probability to the loss of the corresponding Bayes estimator asn!1. Theorem 3A. Under the conditions of Theorem 2A, if lim n!1 c n n 1=2 log 4 n = 0, then for both the Poisson and Binomial models, we have c n neb (1) ( ^ ) (1) 1 =o p (1): 60 Furthermore, under the same conditions, we have for both the Poisson and Binomial models, lim n!1 P h L (1) n (; neb (1) ( ^ ))L (1) n (; (1) ) +c 1 n i = 0 for any> 0: Remark 3.3.3. The second part of the statement of Theorem 3A follows from the first part and Lemma 4. In Appendix B.1.2, we discuss the counterpart to Theorem 3A under the squared error lossL (0) n . 3.4. Numerical results In this section we first discuss, in chapter 3.4.1, the implementation details of the convex program (3.8) and bandwidth selection process (3.9) (see also (B.6) in Appendix B.1.2). Then we investigate the numerical performance of theNEB estimator for Poisson and Binomial compound decision problems, respectively in chapters 3.4.2 and 3.4.3. Both regular and scales losses will be considered. Our numerical results demon- strate that the proposedNEB estimator enjoys superior numerical performance and the efficiency gain over competitive methods is substantial in many settings. We have developed an R package,npeb, to implement our proposedNEB estimator in Definitions 1A (and Definition 1B in Appendix B.1.1), for the Poisson and Binomial models under both regular and scaled losses. Moreover, the R code that reproduces the numerical results in simulations can be downloaded from the following link: https://github.com/trambakbanerjee/DLE_paper. 3.4.1 Implementation Details For a fixed we use the R-packageCVXR (Fu et al. 2017) to solve the optimization problem in Equations (3.8) [and (B.2) in Appendix B.1.1]. As discussed in chapter 3.2.2, in the implementation under the scaled squared error loss (k = 1) the linear inequality constraints, given byAh b, ensure that the resulting 61 decision rule neb (1) () is monotonic, while the equality constraintsCh = d handle boundary cases that involvey i = 0 and ties. Moreover, sincew (1) p (y) > 0, the inequality constraints also ensure thath i < 1 whenevery i > 0. Implementation under the squared error loss (k = 0) follows along similar lines and the inequality constraints in this case ensure thath i +y i > 0 whenevery i 0. A data-driven choice of the tuning parameter is obtained by first solving problems (3.8) and (B.2) over a grid of values, i.e.f 1 ;:::; s g, and then computing the corresponding asymptotic risk estimate ARE (k) n ( j ) forj = 1;:::;s. Then is chosen according to ^ k := argmin 2 1 ;:::;s ARE (k) n (); wherek2f0; 1g. For all simulations and real data analyses considered in this chapter, we have fixeds = 10 and employed an equi-spaced grid over [10; 10 2 ]. 3.4.2 Simulations: Poisson Distribution In this section, we generate observationsY i j i ind: Poi( i ) fori = 1;:::;n and varyn from 500 to 5000 in increments of 500. We consider four different scenarios to simulate i . For each scenario, the following competing estimators are considered: the proposed estimator, denotedNEB; the oracleNEB estimator or := neb ( orc ), denotedNEB OR; the estimator of Poisson means based on Brown et al. (2013), denotedBGR; the estimator of Poisson means based on Koenker and Gu (2017), denotedKM; Tweedie’s formula based on Efron (2011) for the Poisson model, denotedTF OR; 62 Tweedie’s formula for the Normal means problem based on transformed data and the convex opti- mization approach in Koenker and Mizera (2014), denotedTF Gauss. The approach using transfor- mation was suggested by Brown et al. (2013). The risk performance of the TF OR method relies heavily on the choice of a suitable bandwidth pa- rameterh > 0. We use the oracle loss estimateh orc , which is obtained by minimizing the true lossL (0) n . The TF Gauss methodology is only applicable for the Normal means problem, and uses a variance sta- bilization transformation onY i to getZ i = 2 p Y i + 0:25. TheZ i are then treated as approximate Normal random variables with mean i and variances 1. To estimate normal means i , we use the NPMLE approach proposed by Koenker and Mizera (2014). Finally, i is estimated as 0:25 ^ i 2 . It is important to note that the competitors to our NEB estimator only focus on the regular lossL (0) n . Nevertheless, in our simulation we assess the performance of these estimators for estimating under both L (0) n andL (1) n . Consider the following settings: Scenario 1: We generate i iid: Unif(0:5; 15) fori = 1;:::;n. Scenario 2: We generate i iid: 0:75Gamma(5; 1) + 0:25Gamma(10; 1) fori = 1;:::;n. In the next two scenarios we assess the robustness of the five competing estimators to departures from the Poisson model. Specifically consider the Conway-Maxwell-Poisson distribution (Shmueli et al. 2005)CMP( i ;). TheCMP distribution is a generalization of some well-known discrete distributions. With < 1, CMP represents a discrete distribution that has longer tails than the Poisson distribution with parameter i . Scenario 3: We generate i iid: 0:5 f10g + 0:5Gamma(5; 2) for eachi and let Y i j i ind: 0:8Poi( i ) + 0:2CMP( i ;); 63 where we let = 0:8 for the CMP distribution. Scenario 4: We let to be an equi-spaced vector of lengthn in [1; 5] and letY i j i be distributed as the CMP distribution with parameters i and = 0:8. 0.775 0.800 0.825 0.850 0.875 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (a) Scenario 1: Estimation of under lossL (1) n where i iid: Unif(0:5; 15). 0.67 0.68 0.69 0.70 0.71 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (b) Scenario 2: Estimation of under lossL (1) n where i iid: 0:75Gamma(5; 1) + 0:25Gamma(10; 1). 1.2 1.3 1.4 1.5 1.6 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (c) Scenario 3: Estimation of under lossL (1) n where i iid: 0:5 f10g + 0:5Gamma(5; 2). 0.8 0.9 1.0 1.1 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (d) Scenario 4: Estimation of under lossL (1) n where is an equi-spaced vector of length n in [1; 5] and Y i j i ind: CMP( i ; 0:8) . Figure 3.1: Poisson compound decision problem under scaled squared error loss: Risk estimates of the various estimators for scenarios 1 to 4. 64 4.8 5.0 5.2 5.4 5.6 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (a) Scenario 1: Estimation of under lossL (0) n where i iid: Unif(0:5; 15). 3.9 4.1 4.3 4.5 4.7 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (b) Scenario 2: Estimation of under lossL (0) n where i iid: 0:75Gamma(5; 1) + 0:25Gamma(10; 1). 10.0 12.5 15.0 17.5 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (c) Scenario 3: Estimation of under lossL (0) n where i iid: 0:5 f10g + 0:5Gamma(5; 2). 2.4 2.6 2.8 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss TF OR (d) Scenario 4: Estimation of under lossL (0) n where is an equi-spaced vector of length n in [1; 5] and Y i j i ind: CMP( i ; 0:8) . Figure 3.2: Poisson compound decision problem under squared error loss: Risk estimates of the various estimators for scenarios 1 to 4. The performances of these four estimators are presented in figures 3.1 and 3.2 wherein the riskR (k) n (;) of the various estimators is estimated using 50 Monte Carlo repetitions for varyingn. Tables 3.1 and 3.2 report the risk ratiosR (k) n (;)=R (k) n (; neb k ) atn = 5000 and fork = 1; 0 respectively, where a risk ratio bigger than 1 demonstrates a smaller estimation risk for the NEB estimator. For BGR the modified cross validation approach of choosing the bandwidth parameter was extremely slow in our simulations and we therefore report its risk performance only atn = 5000. From figures 3.1, 3.2 and tables 3.1, 3.2, we note 65 Table 3.1: Poisson compound decision problem under scaled squared error loss: Risk ratios R (1) n (;)=R (1) n (; neb k ) atn = 5000 for estimat- ing. Scenario Method 1 2 3 4 KM 1.10 1.04 1.17 1.37 TF Gauss 1.03 1.01 1.10 1.25 TF OR 1.00 1.02 1.14 1.22 BGR 1.22 1.07 1.16 1.37 NEB 1.00 1.00 1.00 1.00 NEB OR 1.00 1.00 0.90 1.00 Table 3.2: Poisson compound decision problem under squared error loss: Risk ratios R (0) n (;)=R (0) n (; neb k ) atn = 5000 for estimat- ing. Scenario Method 1 2 3 4 KM 1.00 1.00 1.59 1.21 TF Gauss 1.00 1.01 1.51 1.08 TF OR 1.07 1.12 1.66 1.12 BGR 1.01 1.01 1.55 1.15 NEB 1.00 1.00 1.00 1.00 NEB OR 1.00 1.00 0.90 1.00 that the NEB estimator demonstrates an overall competitive risk performance. In particular, we see that when estimation is conducted under lossL (1) n the risk ratios of the competing estimators in table 3.1 reflect a relatively better performance of theNEB estimator which is not surprising considering the fact thatKM, TF Gauss and TF OR are designed to estimate under lossL (0) n . We note that TF Gauss is highly competitive againstKM (Koenker and Mizera 2014) and this observation was also reported in Brown et al. (2013). Of particular interest are Scenarios 3 and 4, which reflect the relative performance of these estimators under departures from the Poisson model. TheNEB estimator has a significantly better risk performance in these settings across both types of losses. 3.4.3 Simulations: Binomial Distribution In this section, we generate Y i j q i ind: Bin(m i ; q i ) for i = 1;:::;n and vary n from 500 to 5000 in increments of 500. We consider four different scenarios to simulate i =q i =(1q i ) and for each scenario we consider the following competing estimators: the proposed estimator, denotedNEB; the oracleNEB estimator or := neb ( orc ), denotedNEB OR; 66 the estimator of Binomial means based on Koenker and Gu (2017), denotedKM; Tweedie’s formula for Binomial log odds based on Efron (2011) and Fu et al. (2018), denoted TF OR; Tweedie’s formula for the Normal means problem based on transformed data and the convex opti- mization approach in Koenker and Mizera (2014), denotedTF Gauss. For TF OR, analogous to the Poisson case, we continue to use the oracle loss estimate h orc as a choice for the bandwidth parameter. Since theTF Gauss methodology is only applicable for the Normal means problem, it uses a variance stabilization transformation onY i to getZ i = arcsin p (Y i + 0:25)=(m i + 0:5). The Z i are then treated as approximate Normal random variables with mean i , variances (4m i ) 1 , and estimate of the means i ’s are obtained using the NPMLE approach of Koenker and Mizera (2014). Finally, q i is estimated asfsin(^ i )g 2 . We note that unlike the Poisson case discussed earlier, the competitors to our NEB estimator do not directly estimate the odds. For instance, under a squared error loss bothKM andTF Gauss estimate the success probabilitiesq whileTF OR estimates log. Nevertheless, in this simulation experiment we assess the performance of these estimators for estimating the odds under both squared error loss and its scaled version. The following settings are considered in our simulation: Scenario 1: We generateq i iid: 0:4 f0:5g + 0:6Beta(2; 5) and fixm i = 5 fori = 1;:::;n. Scenario 2: We let i iid: 0:8 f0:5g + 0:2Gamma(1; 2) and fixm i = 10 fori = 1;:::;n. In this scenario we let the odds i arise from a mixture model that has 80% point mass at 0:5. Scenario 3: i iid: 2 2 and fixm i = 5 fori = 1;:::;n. This scenario is similar to scenario 2 where we let the odds i arise from a Chi-square distribution with 2 degrees of freedom. Scenario 4: We generateq i iid: 0:5Beta(1; 1) + 0:5Beta(1; 3) and fixm i = 10 fori = 1;:::;n. 67 The simulation results are presented in Figures 3.3 and 3.4 wherein the risks of various estimators are calculated by averaging over 50 Monte Carlo repetitions for varyingn. Tables 3.3 and 3.4 report the risk ratiosR (k) n (;)=R (k) n (; neb k ) atn = 5000 and fork = 1; 0 respectively, where a risk ratio bigger than 1 demonstrates a smaller estimation risk for theNEB estimator. 0.24 0.26 0.28 0.30 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (a) Scenario 1: Estimation of odds under lossL (1) n whereq i iid: 0:4 f0:5g + 0:6Beta(2; 5) andm i = 5. 0.200 0.225 0.250 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (b) Scenario 2: Estimation of odds under lossL (1) n where i iid: 0:8 f0:5g + 0:2 Gamma(1; 2) and fix m i = 10. 0.9 1.0 1.1 1.2 1.3 1.4 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (c) Scenario 3: Estimation of odds under lossL (1) n where i iid: 2 2 andm i = 5. 0.23 0.24 0.25 0.26 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (d) Scenario 4: Estimation of odds under lossL (1) n where q i iid: 0:5Beta(1; 1) + 0:5 Beta(1; 3) and m i = 10. Figure 3.3: Binomial compound decision problem under scaled squared error loss: Risk estimates of the various estimators for Scenarios 1 to 4. 68 We can see from the simulation results that the NEB estimator demonstrates an overall superior risk performance than its competitors. In particular, we see that when estimation is conducted under lossL (1) n the risk ratios of the competing estimators in Table 3.3 reflect a significantly better performance of theNEB estimator. This is not surprising becauseKM, TF Gauss andTF OR are designed to estimateq under loss L (0) n . This also explains the relatively improved performance of these estimators as seen through their risk ratios in table 3.4 wherein the estimation is conducted under the usual squared error lossL (0) n . Across the 0.140 0.145 0.150 0.155 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (a) Scenario 1: Estimation of odds under lossL (0) n whereq i iid: 0:4 f0:5g + 0:6Beta(2; 5) andm i = 5. 0.500 0.525 0.550 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (b) Scenario 2: Estimation of odds under lossL (0) n where i iid: 0:8 f0:5g + 0:2Gamma(1; 2) andm i = 10. 2.4 2.5 2.6 2.7 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (c) Scenario 3: Estimation of odds under lossL (0) n where i iid: 2 2 andm i = 5. 0.300 0.325 0.350 0.375 1000 2000 3000 4000 5000 n risk KM NEB NEB OR TF Gauss (d) Scenario 4: Estimation of odds under lossL (0) n where q i iid: 0:5Beta(1; 1) + 0:5 Beta(1; 3) and m i = 10. Figure 3.4: Binomial compound decision problem under squared error loss: Risk estimates of the various estimators for Scenarios 1 to 4. 69 Table 3.3: Binomial compound decision problem under scaled squared error loss: Risk ratios R (1) n (;)=R (1) n (; neb k ) atn = 5000 for estimat- ing. Scenario Method 1 2 3 4 KM 1.22 1.38 1.26 1.03 TF Gauss 1.23 1.51 1.34 1.08 TF OR > 10 > 10 > 10 > 10 NEB 1.00 1.00 1.00 1.00 NEB OR 1.00 1.00 1.00 1.00 Table 3.4: Binomial compound decision problem under the usual squared error loss: Risk ratios R (0) n (;)=R (0) n (; neb k ) atn = 5000 for estimat- ing Scenario Method 1 2 3 4 KM 1.01 1.08 1.08 1.06 TF Gauss 1.06 1.06 1.09 1.17 TF OR > 10 > 10 > 10 > 10 NEB 1.00 1.00 1.00 1.00 NEB OR 1.00 0.99 0.99 1.00 four scenarios,TF OR exhibits the poorest performance and appears to suffer from the fragmented approach of estimating the gradient of the log density logp(y) whereinp(y) and its first derivative with respect toy are estimated separately using a Gaussian kernel with common bandwidthh orc . The approach of using a variance stabilizing transformation to convert the data to approximate normality rendersTF Gauss highly competitive to KM (Koenker and Mizera 2014). A similar phenomenon was also reported in Brown et al. (2013) in the context of the Poisson model. However, under the Binomial model, when the primary goal is to estimate the odds, the risk ratios reported in Tables 3.3 and 3.4 suggest that the proposedNEB estimator is by far the best amongst these competitors under both types of losses. 3.5. Real data analyses This section illustrates the proposed method for estimating the Juvenile Delinquency rates from Poisson models and news popularity in Binomial models. 3.5.1 Estimation of Juvenile Delinquency rates In this section we consider an application for analysis of the Uniform Crime Reporting Program (UCRP) Database (US Department of Justice and Federal Bureau of Investigation 2014) that holds county-level 70 counts of arrests and offenses ranging from robbery to weapons violations in 2012. The database is main- tained by the National Archive of Criminal Justice Data (NACJD) and is one of the most widely used database for research related to factors that affect juvenile delinquency (JD) rates across the United States; see for example Aizer and Doyle Jr (2015) and Damm and Dustmann (2014), Koski et al. (2018). A prelim- inary and important goal in these analyses is to estimate the JD rates based on the observed arrest data and determine the counties that are amongst the worst or least affected. However with almost 3,000 counties being evaluated the JD rates are susceptible to selection bias, wherein some of the data points are in the extremes merely by chance and traditional estimators may underestimate or overestimate the corresponding means, especially in counties with fewer total number of arrests across all age groups. Table 3.5: Loss ratios of the competing methods for estimating. (n = 2; 803) Loss ratios Method k = 1 k = 0 NEB 1.00 1.00 BGR 1.18 1.03 KM 1.19 1.03 TF Gauss 1.12 1.01 TF OR 1.11 1.05 For the purpose of our analyses, we use the 2012 UCRP data that spansn = 3; 178 counties in the U.S. and consider estimating the mean JD rate as a vector of Poisson means. The observed data for countyi is denotedy i , which represents the number of juvenile arrests expressed as a percentage of total arrests in that county in Year 2012. We assume thatY i j i ind: Poi( i ) fori = 1;:::;n. Figure 3.5 plots the observed data for the top 500 and the bottom 500 counties that have at least 1 juvenile arrest. Campbell county in South Dakota, followed by Fulton county in New York, exhibits the highest observed JD rates in Year 2012. As discussed in Section 3.4.2, we consider the following 5 estimators of: NEB, BGR (Brown et al. 2013),KM (Koenker and Gu 2017),TF OR (Efron 2011) andTF Gauss (Koenker and Mizera 2014, Brown 71 Figure 3.5: Observed Juvenile Delinquency rates in 2012. The top 500 and bottom 500 counties are plot- ted. The data on Florida arrests is not available in the US Department of Justice and Federal Bureau of Investigation (2014) database. Figure 3.6: Estimated Juvenile Delinquency rates of the 1000 counties exhibited in figure 3.5. Left: Estima- tion under squared error loss (k = 0). Right: Estimation under scaled squared error loss (k = 1). et al. 2013). We use the 2014 UCRP data (US Department of Justice and Federal Bureau of Investigation 2017) to compare their estimation accuracies under bothL (0) n andL (1) n losses. The data were cleaned prior to any analyses which ensured that all counties in the year 2012 had at least one arrest (juvenile or not). This resulted inn = 2803 counties where all methods are applied to. In Figure 3.6 we visualize the shrinkage estimates of JD rates for those 1000 counties considered in Fig- ure 3.5. The left plot presents the estimates under the squared error loss while right plot presents the results under the scaled squared error loss. Notably, the scaled error loss exhibits a larger magnitude of shrinkage for 72 the bigger observations than the squared error loss. Table 3.5 reports the loss ratiosL (k) n (;)=L (k) n (; neb (k) ) where for any estimator of, a ratio bigger than 1 indicates a smaller estimation loss for neb (k) . We can see that for estimating, all four competitors exhibit loss ratios bigger than 1 under the scaled squared error loss (k = 1). This is not surprising since these competitors are designed to estimate under the regular squared error loss (k = 0). Interestingly, even under the regular loss, theNEB estimator continues to provide a better estimation accuracy thanTF OR,BGR andKM, and demonstrates a competitive performance against TF Gauss. 3.5.2 News popularity in social media platforms Journalists and editors often face the critical task of assessing the popularity of various news items and determining which articles are likely to become popular; hence existing content generation resources can be efficiently managed and optimally allocated to avenues with maximum potential. Due to the dynamic nature of the news articles, popularity is usually measured by how quickly the article propagates (frequency) and the number of readers that the article can reach (severity) through social media platforms like Twitter, Youtube, Facebook and LinkedIn. As such predicting these two aspects of popularity based on early trends is extremely valuable to journalists and content generators (Bandari et al. 2012). Table 3.6: Loss ratios of the competing methods for estimating. News article genre: Economy and social media: Facebook (n = 3; 972) Loss Ratios Method k = 1 k = 0 NEB 1.00 1.00 KM 12.25 13.25 TF Gauss 4.06 3.25 TF OR 81.57 36.24 Table 3.7: Loss ratios of the competing methods for estimating. News article genre:Microsoft and social media: LinkedIn (n = 3; 850) Loss Ratios Method k = 1 k = 0 NEB 1.00 1.00 KM 41.06 59.64 TF Gauss 9.31 7.44 TF OR 36.14 12.76 73 In this section, we assess the popularity of several news items based on their frequency of propagation and analyze a dataset from Moniz and Torgo (2018) that holds 48 hours worth of social media feedback data on a large collection of news articles since the time of first publication. For the purposes of our analysis, we consider two popular genres of news from this data set: Economy andMicrosoft, and examine how frequently these articles were shared in Facebook and LinkedIn, respectively, over a period of 48 hours from the time of their first publication. Each news article in the data has a unique identifier and 16 consecutive time intervals, each of length 180 minutes, to detect whether the article was shared at least once in that time interval. Let Z ij = 1 if article i was shared in time interval j and 0 otherwise, where i = 1;:::;n and j = 1;:::; 16. Supposeq ij 2 [0; 1] denote the probability that news articlei is shared in intervalj. We let q ij =q i for allj = 1;:::; 16 and assume that for eachi,Z ij are independent realizations fromBer(q i ). It follows thatY i = P 8 j=1 Z ij ind: Bin(8;q i ). To assess the popularity of article i, we estimate its odds of sharing given by i = q i =(1q i ) and consider the following 4 estimators of: NEB, KM (Koenker and Gu 2017), TF Gauss (Koenker and Mizera 2014) and TF OR (Efron 2011, Fu et al. 2018). We use the data on time pointsj = 9;:::; 16 to compare the estimation accuracy of these estimators under bothL (0) n andL (1) n losses. Tables 3.6 and 3.7 report the loss ratiosL (k) n (;)=L (k) n (; neb (k) ) where for any estimator of, a ratio bigger than 1 indicates a smaller estimation loss for neb (k) . We observe that the three competitors to theNEB estimator exhibit loss ratios substantially bigger than 1 under both the losses. This is not surprising since these competitors are designed to estimateq and log under a squared error loss (k = 0). However, when the primary goal is to estimate the odds, the proposed NEB estimator is by far the best amongst these competitors under both losses. 74 Chapter 4 Improved Shrinkage Prediction under a Spiked Covariance Structure 4.1. Introduction In every branch of big-data analytics, it is now commonplace to use notions of shrinkage for the construction of robust algorithms and predictors. Over the last decade, driven by applications in a wide range of scientific problems, the traditional roles of statistical shrinkage have rapidly evolved as new perspectives have been introduced to address and exploit complex, latent structural properties of modern datasets. Incorporating such structural properties vastly improves predictive efficiency. Traditional shrinkage estimators in high- dimensional location models (see Efron (2012), Fourdrinier, Strawderman, and Wells (2017), Zhang (2003), Robbins (1985), Greenshtein and Ritov (2009), Greenshtein and Park (2009), Koenker and Mizera (2014), Dicker and Zhao (2016), Efron and Hastie (2016), Brown and Greenshtein (2009) and the references therein) were developed based on homoscedastic models using notions of spherical symmetry. Recent results of Xie, Kou, and Brown (2012, 2016), Weinstein, Ma, Brown, and Zhang (2018), Tan (2015), Brown, Mukherjee, and Weinstein (2018) have brought to light new shrinkage phenomena in heteroscedastic models. How- ever, these results are based on multivariate set-ups with known covariances. For prediction in location models with dependence, it is difficult to assimilate optimal structural properties of covariances using these shrinkage rules. In a host of modern applications in biology, economics, finance, health-care and supply chain management which are briefly narrated below, we need simultaneous predictions of several dependent variables where including domain specific regularization on their covariances is beneficial. 75 1. In portfolio selection, the vector of next period excess returns on investable assets form a critical component in determining the optimal portfolio weights (Karoui et al. 2011). Prediction programs of different flavors are employed to estimate the future returns with several popular approaches using factor covariance models (Fan, Fan, and Lv 2008, Johnstone and Titterington 2009) to capture the dependence among asset returns (Kozak et al. 2017). 2. In cell-biology, the problems of predicting the expressions of several genes leads to inference in a high-dimensional location model (Cavrois et al. 2017, Sen et al. 2014). Effective statistical methods usually integrate the dependence structure of gene expressions while conducting inference on such high-dimensional location parameters (Sun and Cai 2009). 3. In health-care management, simultaneous prediction of several inventories or resources is very impor- tant for optimal operations. For instance, general operations of a health-care provider need simulta- neous prediction of the number of nurses that it will be needing in its different hospitals (Green et al. 2013). The loss function for the health-care provider is agglomerative across its different hospitals. Another interesting caveat here is that the loss functions here are asymmetric as a hospital would incur an underage cost if too many patients arrive and expensive agency nurses have to be called, and an overage cost if too many regular nurses are scheduled compared to the number of patients. In this chapter, we study shrinkage prediction under such loss functions. In Mukherjee et al. (2015) it was seen that for such compound decision theoretic problems in uncorrelated models, empirical Bayes induced shrinkage can provide better performance than simple coordinate-wise rules. Incorporating the dependence structure among the patient arrivals in different hospitals would improve shrinkage rules developed for uncorrelated models. 4. A topic of vibrant current research in supply chain management (Rudin and Vahn 2014, Levi et al. 2015) is the inventory optimization problem of distributors and retailers who, based on past sales 76 data, need to predict future demands and balance the trade-offs between stocking too much and incur- ring high depreciation costs on unsold inventory versus stocking too little and suffering tremendous reputation and lost sales costs. Here, we study the optimal stocking problem by analyzing grocery sales data across several retail outlets in USA. For any distributor forecasting the future sales across so many outlets translates to a high-dimensional demand prediction problem where incorporating co-dependencies in the demands among different stores is potentially useful. Often, the data in these highly multivariate applications have approximate low-dimensional representations that can be described through a factor model. Thus, the variability in such data can be well represented through a spiked covariance model (see for example Onatski, Moreira, and Hallin (2014), Kritchman and Nadler (2009), Fan, Liao, and Mincheva (2013), El Karoui (2008), Dobriban, Leeb, Singer, et al. (2020), Ma (2013), Cai, Ma, and Wu (2013) and the references therein). For constructing efficient predictors, it is important to leverage the presence of such covariance structures. However, the well-studied high- dimensionality effects on the eigenvectors and eigenvalues of the sample covariance matrix (Johnstone and Titterington 2009, Onatski 2015, Johnstone and Paul 2018, Paul 2007) also suggest the need for appropriate regularization even while making use of the spiked covariance structure. In this chapter, we propose CASP - a Coordinate-wise Adaptive Shrinkage Prediction rule for shrinkage prediction in high-dimensional Gaussian models with unknown location as well as unknown spiked structured covariances. Motivated by contemporary applications, we consider the set up where we observe only a few observations from the model, but also assume having some auxiliary information on the covariance. We provide a rigorous framework for constructing such auxiliary information based onlagged (see chapter 4.2.3) dis-aggregate level data which often arises in research problems based on industrial datasets. 77 To facilitate a potent and robust notion of shrinkage in such Gaussian models, we consider a hierarchical set-up based on non-exchangeable priors for the mean vector. Our proposed prior structure involves ashape and ascale hyper-parameter, as well as the unknown population covariance, and is designed to describe the structure of the mean vector in terms of its representation in the spectral coordinates of the population covariance. The shape hyper-parameter regulates the contributions of low-variance principal components and produces a wide class of priors ranging from complete independence to highly dependent scenarios. For a range of commonly used loss functions, the Bayes predictors in these hierarchical models involve quadratic forms involving smooth functions of the population covariance. In practice, one needs to estimate these quantities based on available information, which constitutes a core challenge of this formulation. We work under the framework where the auxiliary information allows us to construct an estimator of the unknown population covariance that has degrees of freedom comparable to the dimensionality of the observations. Then, we make use of the results on the behavior of eigenvalues and eigenvectors of high- dimensional sample covariance matrix (Paul 2007, Onatski 2012, Baik and Silverstein 2006), to develop a bias-correction principle that leads to an efficient approach for evaluating the Bayes predictors. Thereafter, by introducing a novel coordinate-wise shrinkage term we provide additional improvements on the perfor- mance of these Bayes predictors. Our proposed CASP methodology systematically assimilates these key features. In addition, we provide a detailed analysis of the operational characteristics of the proposed CASP procedure for both aggregated and dis-aggregated forecasting problems. Our analysis demonstrates that, even for linearly aggregated prediction problems, the substitution-based CASP procedure is still asymptoti- cally optimal as long as the dimension of the aggregation subspace is suitably small compared to dimension of the observations. The chapter is organized as follows. In chapter 4.2, we describe our predictive set-up. In chapter 4.3, our proposed methodology CASP and its asymptotic properties are presented. In chapter 4.4, we analyze the performance of CASP in aggregated predictive models. Numerical performances of our methods are 78 investigated using both simulated and real data in chapters 4.5 and 4.6 respectively. Proofs and additional technical details are relegated to the Appendix C. 4.2. Predictive setup: aggregative predictors and side-information We first introduce the statistical prediction analysis framework of Aitchison and Dunsmore (1976) and Geisser (1993). Then we give an overview of the high-dimensional orthogonal prediction setup (George et al. 2006, George and Xu 2008, Mukherjee et al. 2015), and thereafter introduce aggregative prediction objectives based on dis-aggregative orthogonal predictive models. 4.2.1 Predictive model Consider ann dimensional Gaussian location model where the observed pastX = (X 1 ;:::;X n ) as well as the future observationY = (Y 1 ;:::;Y n ) are distributed according to a normal distribution with an unknown mean and unknown covariance proportional to . The past and the future are related only through the unknown parameters and , conditioned on which they are independent. The orthogonal predictive model is: Past observations XN n (; ); and Future Y N n (;m 1 0 ); (4.1) where, is an unknownnn positive definite matrix, is an unknownn 1 vector andm 0 > 0 is a known constant (typicallym 0 = 1). Based on observingx, the goal at the the dis-aggregative level is to predictY by ^ q(x) under a loss that is agglomerative across then dimensions. 79 4.2.2 Linearly aggregated predictors In a host of modern applications, we are interested in predicting several linearly aggregated components from model (4.1). The predictant here isV =AY , where the transformation matrixA2R pn is observed, with pn and full rank. Instead of prediction at the dis-aggregate level, the goal is to formulate ^ q =f^ q i (X) : 1ipg based on the dis-aggregative past dataX such that ^ q optimally forecastsV . The loss function is cumulative across thep components ofV . Aggregative prediction problems of this flavor often arise in several applications. For example, in port- folio selection (P´ astor 2000),A might represent the pn portfolio weight matrix of p investors andY is the next period excess return vector on the n assets. Similarly, as discussed in chapter 4.6, in supply chain management distributors may need to forecast the future sales of their products across a large number of retail outlets spread over various locations or states. Often the high inter-state transfer costs forbid the distributors to deliver their products to these retail outlets from a central warehouse. Instead, the products are typically sourced at regional or state warehouses which are then distributed to the retail outlets in the nearby region. In this demand forecasting setup then, the matrixA might represent thepn aggregation matrix that aggregates the demand for each product across the n retail outlets into p states andY is the n dimensional future demand vector at each retail outlet. Such problems, where the target distribution is different from that of past observations, are more challenging than dis-aggregate level prediction (Komaki 2015, Yano and Komaki, George and Xu 2008). Naturally, in this set-up whenp =n andA =I n , we revert to prediction at dis-aggregate level. Model (4.1) is a natural extension of the uncorrelated heteroscedastic model of ?. For correlated and known , shrinkage estimation in (4.1) for quadratic loss has been studied in Kong et al. (2017). However, the parameters of interest studied in multi-level models are correlated in most practical situations. The correlation structure is usually unknown and requires estimation. Here, we consider shrinkage prediction 80 in (4.1) when is estimated from observationsW j = (W 1j ;:::;W nj ) T whereW j j j are independently distributed fromN n ( j ; ) forj = 1;:::;m. Given the parameters; and :=f j : 1jmg in our predictive set-up,X,W := [W 1 ;:::;W m ], andY are independent. In many real world applications j ’s are very different from, in which casesW provides information about but not much on. For instance, there may have been a drift in the data generation process over time without affecting the correlation structures. In such rapid trend changing environments (Patton and Timmermann 2007, Kozak et al. 2017, Harvey et al. 2016), there often exist related instruments that can be used to estimate the covariances but not the average. In the asset pricing context for estimating the joint explanatory power of a large number of cross-sectional stock return predictors, Kozak et al. (2017) suggest using the daily returns to estimate covariances while conducting shrinkage to regulate the uncertainty about means. Most datasets available for research from the industry contain dis-aggregate data from lagged past as data from immediate past could potentially reveal their current operational strategies. Prediction in such problems involve high-dimensional correlated location model (4.1) where one or very few (in which case they can be summarized intoX by taking average) immediate past observation vectors are available. In the following section, we show that for prediction based on such high-dimensional lagged datasets, we can construct side informationW . 4.2.3 Construction of suitable side-information regarding covariance Consider observingW t fort =t 0 + 1;:::;t 0 +m time periods from the drift changing model W t = t + t (4.2) where t i:id N n (0; ). Let X be a vector from this model from the most recent time period so that X = tc + tc with tc much different thanf t :t =t 0 +1;:::;t 0 +mg as the time lagt c t 0 m is huge. 81 We are interested in predicting the future vectorY from the next time periodt c +1 whereY = tc+1 + tc+1 . Compared to the variability produced by , the differencejj tc tc+1 jj can be ignored. Let = tc and as t i:i:d N n (0; ) for t 2 ft 0 + 1;:::;t 0 +mg, it can be considered that both X and Y are from model (4.1). Note, asX has only one past vector andm is large , it does not benefit us much to useX for estimating . Throughout this chapter, we will estimate functionals involving based only onW , whereas our predictors will involveX as it contains pivotal information about the current drift. Given the parameters ( 1 ;:::; m ;; ),X,W := [W t 0 +1 ;:::;W t 0 +m ], andY are independent with the goal being to predictV =AY by using the information inW andX. It is observed in Ke et al. (2014), Cai et al. (2018+), Binkiewicz et al. (2014), Banerjee et al. (2019) that methods which successfully leverage associated auxiliary information vastly enhance estimation accuracy. We next describe an efficient methodology to extract suitable information about fromW , which we will treat as side information on and treatX as primary information on the current model. Often t can be well approximated by low dimensional processes. We usek basis functions to model t over time and let W t =UC t + 1=2 t where,t =t 0 + 1;:::;t 0 +m; and C t is a k 1 vector of basis coefficients, U 2 R nk is the matrix of unknown coefficients and t i:i:d N n (0;I n ). In matrix notations, the above can be expressed as W T =CU T + 1=2 where C mk = (C t 0 +1 ; ;C t 0 +m ) T and mn = ( t 0 +1 ; ; t 0 +m ) T . Consider the projection matrix P c =C(C T C) 1 C T . Then , note thatS = W (I m P c )W T follows an n-dimensional Wishart distri- bution with degrees of freedomm k =mk and covariance 1=2 (I m P c ) 1=2 . Herem is large andk is much smaller thanm, and so we expect 1=2 (I m P c ) 1=2 to be a good approximation to . Henceforth, without loss of generality but with a slight abuse of notation which literally replacesm k bym, we assume thatS followsWishart n (m; ). 82 4.2.4 Spiked covariance structure We further assume that the unknown covariance has a spike covariance structure (Johnstone and Lu 2012, Baik and Silverstein 2006, Paul and Aue 2014). It is popularly used to model signal-plus-noise decompo- sition of the centered (mean-subtracted) observations, where the signal belongs to a low-dimensional linear subspace and noise is isotropic (Passemier and Yao 2012, Benaych-Georges and Nadakuditi 2012). Suppose in model (4.1) is of the form: = K X j=1 ` j p j p T j +` 0 (I K X j=1 p j p T j ) (4.3) where, p 1 ;:::; p k are orthonormal,` 1 > > ` K > ` 0 > 0 and the number of spikes 1 K n. Let the spectral decomposition ofS be P K j=1 ^ ` j ^ p j ^ p T j where ^ p j are orthonormal and ^ ` 1 ^ ` n . A key result in this model is the so-called phase transition of the eigenvalues of S, whenn=m!> 0 as n!1 and K is fixed (Baik and Silverstein 2006, Paul 2007, Onatski 2012, Benaych-Georges and Nadakuditi 2012). There is also a corresponding phase transition result for the corresponding eigenvectors of S (Paul 2007, Onatski 2012). Based on these results, for significant spikes, the principal eigenvectors and eigenvalues can be consistently estimated if K is well estimated. This has been the approach taken by Kritchman and Nadler (2008, 2009), Passemier and Yao (2012), and more recently, by Passemier et al. (2017a). Our proposed predictive rule uses a similar strategy to deal with spike structures and conducts uniform estimation of quadratic forms involving smooth functions of by appropriately adjusting the sam- ple eigenvalues and eigenvectors (see chapter 4.3.1). Specifically, we aim to make use of known results on the behavior of sample eigenvalues and eigenvectors, to develop a simple substitution principle that leads to consistent estimators of linear functionals of population eigenvectors and quadratic function of smooth functions of the population covariance matrix. 83 4.2.5 Hierarchical modeling with non-exchangeable priors We consider prediction in correlated hierarchical models, which result in non-exchangeability of the corre- sponding coordinate problems. Hierarchical modeling provides an effective tool for combining information and achieving partial pooling of inference (Xie et al. 2012, Kou and Yang 2015). Though the hierarchical framework allows usage of exchangeable as well as nonexchangeable priors, traditionally shrinkage algo- rithms in this framework have been developed under exchangeable priors (Fourdrinier et al. 2017, Zhang 2003). However, in many contemporary applications involving correlated Gaussian models, we need non- exchangeable priors to suitably incorporate auxiliary information regarding covariances in the hierarchical framework (P´ astor and Stambaugh 2000, P´ astor 2000, Harvey et al. 2016). Here, we impose a class of com- mutative conjugate priors with power-decay on the location parameter, which is related to the unknown covariance by hyper-parameters,: (j;;;)N n ; (4.4) Theshape parameter is key to controlling the essential characteristic of the posterior density of under model (4.1). As varies in [0;1), it produces a large family of priors capable of reflecting perfect indepen- dence to highly dependent scenarios. When = 0, the exchangeable prior on the locations resembles the set-up of Xie et al. (2012) with known diagonal covariance. With = 1 the prior has the same correlation structure as the data whereas with > 1 the prior is relatively less concentrated towards the dominant vari- ability directions in the data. In the finance literature, this family of priors is widely used in asset pricing for formulating varied economically motivated priors that induce shrinkage estimation of market factors (Kozak et al. 2017). While = 0 corresponds to the diffuse prior in Harvey et al. (2016), = 1 gives the asset pricing prior in P´ astor and Stambaugh (2000), P´ astor (2000) and = 2 yields the prior proposed in Kozak et al. (2017) that shrinks the contributions of low-variance principal components of the candidate factors. 84 Thescale parameter is allowed to vary between 0 to1. The location parameter is usually restricted to some pre-specified low dimensional subspace. For instance, we can restrict to a 1-dimensional sub-space and consider = 1 and estimate as a hyper-parameter. For simplicity, we consider to be set to a pre-specified value 0 . For the lagged data model of equation (4.2), we can set toWC(C T C) 1 C new or to the grand mean across coordinatesn 1 1 T WC(C T C) 1 C new whereC new is a known vector of basis coefficients. Our goal is to construct predictive rules for aggregated predictantsV under different popular loss functions in the hierarchical model governed by (4.1) and (4.4). 4.2.6 Popular loss functions and Bayes predictors In this article, we consider popular loss functions that routinely arise in applications - quadratic loss,` 1 loss, generalized absolute loss and Linex loss. While the quadratic loss is the most widely studied loss function in Statistics, the utility and necessity of asymmetric losses, like the generalized absolute loss and Linex loss, has long been acknowledged, for instance in the early works of Zellner and Geisel (1968), Granger (1969), Koenker and Bassett Jr (1978), Zellner (1986). In what follows, we briefly discuss the two aforementioned asymmetric losses and then present the Bayes predictors for the loss functions considered in this chapter. Generalized absolute loss function - also referred to as the check loss (see Chapter 11.2.3 of Press (2009)), is a piecewise linear loss function with two linear segments and uses differential linear weights to measure the amount of underestimation or overestimation. It is the simplest as well as the most popular asymmetric loss function and is fundamental in quantile regression (Koenker and Bassett Jr 1978). If ^ q i (X) represents the predictive estimate of the futureV i , then under generalized absolute loss thei th coordinate incurs a loss L i (V i ; ^ q i (x)) =b i (V i ^ q i ) + +h i (^ q i V i ) + (4.5) 85 whereb i andh i are known positive costs associated with underestimation and overestimation respectively in coordinatei. In inventory management problems (Mukherjee et al. 2015, Rudin and Vahn 2014, Levi et al. 2015) for example, where overestimation leads to depreciation and storage costs, but underestimation may lead to significant reputation costs for the retailers, the generalized absolute loss function arises naturally withb i h i . Linex loss function - of Varian (1975), on the other hand, uses a combination of linear and exponential functions (and hence its name) to measure errors in the two different directions. The loss associated with coordinatei is L i (V i ; ^ q i (x)) =b i n e a i (^ q i V i ) a i (^ q i V i ) 1 o (4.6) wherea i 6= 0;b i > 0 for alli. This loss function is more appropriate for event analysis such as predicting accident counts or crime rates, underestimations of which result in much graver consequences than overes- timations, however for small values ofjaj, linex loss behaves approximately like a quadratic loss function (Zellner 1986). To facilitate the ease of presentation, we define a few notations first which will be used throughout the chapter. Letl p (V; ^ q) = p 1 P p i=1 L i (V i ; ^ q i ) denote the average loss for predictingV using ^ q which only depends onX andS. For eachX =x andS =s, the associated predictive loss isL p ( ; ^ q) =E V l p (V; ^ q) where = A and the expectation is taken over the distribution of the futureV only. The predictive risk is given byE X;S L p ( ; ^ q) which, by sufficiency, reduces toR p ( ; ^ q) = E AX L p ( ; ^ q) wherein the expectation is taken over the distribution ofAX. Note that the expectation overV is already included in L p . Let ~ b i =b i =(b i +h i ) in the generalized absolute loss function, =A A T and define G r;; :=G r;; (;A) = ( 1 1 + 1 1 ) r 1 86 where the dependence ofG r;; on has been kept implicit for notational ease. In particular, whenA =I n thenG r;; =H r;; where H r;; :=H r;; () = ( 1 + 1 ) r : Moreover, whenA is a generalpn rectangular matrix then one may expressG r;; in terms ofH r;; as follows r G r;; = n AH 0;;0 A T h A H 0;;0 +H 0;1;0 A T i 1 AH 0;1;0 A T o r AH 0;1;0 A T (4.7) Our goal is to minimizeR p ( ; ^ q) over the class of estimators ^ q for all values of . An essential intermediate quantity in that direction is the Bayes predictive ruleq Bayes which is the unique minimizer of the integrated Bayes risk B p (;) = R R p ( ; ^ q)( j;;)d and Lemma 1 below provides the univariate Bayes estimatorq Bayes i for the loss functions discussed earlier. Lemma 1 (Univariate Bayes Estimator). Consider the hierarchical model in equations (4.1) and (4.4). If were known, the unique minimizer of the integrated Bayes risk for coordinatei is q Bayes i (AXj;;) = e T i A 0 + e T i G 1;1; A(X 0 ) +F loss i (;A;;) fori = 1;:::;p, where F loss i (;A;;) = 8 > > > > > > > > > < > > > > > > > > > : 1 ( ~ b i ) e T i G 1;0; e i +m 1 0 e T i G 0;1;0 e i 1=2 ; for generalized absolute loss a i 2 e T i G 1;0; e i +m 1 0 e T i G 0;1;0 e i ; for linex loss 0; for quadratic loss: 87 Since is unknown, perhaps the simplest approach will be to plug in the sample covariance matrix S inG r;; that appears in the expression of the Bayes estimator. However, this produces a biased, sub- optimal predictor even whenm is comparable in magnitude ton. In chapter 4.3 we describe an efficient methodology for evaluating the Bayes predictive rules. 4.2.7 Discussion We develop a new algorithm for prediction in high dimensional Gaussian models with unknown spiked covariances. Motivated by modern research problems involving dis-aggregate level lagged data in drift changing applications (described in chapter 4.2.3), our prediction framework involves observing only a few observation vectors from the current model which are summarized in a past vector. We construct useful auxiliary or side information on the unknown covariance from the lagged data. Based on these, we propose a flexible shrinkage methodology CASP that integrates available information on the unknown covariance optimally in determining shrinkage directions. The proposed methodology makes several contributions. To summarize, the key features of CASP are: It can be used for prediction under popular symmetric as well as asymmetric losses such as linex and generalized absolute loss. Asymmetric losses arise in modern health care and supply chain manage- ment problems and shrinkage prediction under them differs in fundamental aspects from shrinkage under symmetric losses. CASP improves upon recent shrinkage algorithms developed in Mukherjee et al. (2015) for asymmetric losses by optimally integrating covariance information. It utilizes the phase transition phenomenon of the sample eigenvalues and eigenvectors seen in spiked covariance models (Paul 2007, Onatski 2012) and improves upon naive factor model based method- ology by using bias corrected efficient estimates of quadratic forms involving unknown covariance matrix (see chapters 4.3.1 and 4.5). 88 It is developed based on an hierarchical framework that encompass a wide family of non-exchangeable priors that are used in real world applications (Harvey et al. 2016, Kozak et al. 2017). The prior family involves a shape hyper-parameter that regulates the contributions of low-variance principal components. It makes CASP a very flexible shrinkage method which includes the exchangeable priors set-ups of Xie, Kou, and Brown (2012) as a special case. It uses a novel coordinate-wise shrinkage policy that only relies on the covariance level information and introduces possible reduction in the variability of the proposed rules (see chapter 4.3.2). Robust data driven schemes are employed to adaptively tune the hyper-parameters. We provide a detailed asymptotic analysis on the scope of improvement due to this coordinate-wise shrinkage policy and establish asymptotic optimality of our proposed procedure (see Lemma 3 and Theorem 2A). It can be used for prediction in aggregated models. In a lot of applications, we need to forecast linearly combined measurements by dis-aggregate level data. Aggregative set-ups present several challenges. Unlike dis-aggregative level prediction, here the family of prior in (4.4) does not commute with the unknown covariance. CASP is built on a simple substitution principle that we establish is asymptotically efficient in such aggregative set-ups (see chapter 4.4). Across varied simulation regimes, we witness that the improvement in CASP due to incorporation of bias-correction and coordinate-wise shrinkage policies is not just technical but is essential as CASP vastly out-performs other competing shrinkage methods (see chapter 4.5). 4.3. Proposed methodology and asymptotic properties In this section we describe our proposed methodology for the efficient evaluation of the Bayes predic- tive rules in Lemma 1 (chapter 4.3.1) along with the asymptotic properties of the proposed methodology 89 and thereafter discuss the potential improvement in predictive efficiency brought about by coordinate-wise shrinkage (chapter 4.3.2). 4.3.1 Evaluating Bayes predictors in dis-aggregative models Since is unknown, we need to evaluate the Bayes predictive rules based onX andS only which essen- tially reduces to estimating the quadratic formsb T G r;; b uniformly well for all; whereb are known vectors on then dimensional unit sphereS n1 . In particular, for the dis-aggregative model (A =I n ), esti- mating these quadratic forms involvingG r;; reduces to estimating quadratic forms involvingH r;; which is relatively easier. So, we describe our procedure first for the simpler case of the dis-aggregative model and thereafter present the case of the aggregative model in chapter 4.4. We assume the following asymptotic conditions throughout the chapter: A1 Asymptotic regime : Suppose that n = n m1 !2 (0;1) asn!1. A2 Significant spike: Suppose that` K >` 0 (1 + p ). We next present efficient estimatesf ^ ` e j g K j=0 of the dominating eigenvaluesf` j g K j=0 of . Define (x;) = h 1=(x 1) 2 1 +=(x 1) i 1=2 with j =(` j =` 0 ;) Recall that under assumptions A1 and A2 the leading eigenvectors and eigenvalues ofS have the following properties (Paul 2007) ^ ` j ` j 1 + (` j =` 0 1) =O P (n 1=2 ); j = 1;:::;K; (4.8) 90 and ^ p j j p j + q 1 2 j (I P K P T K ) " j p nK ; j = 1;:::;K; (4.9) where P K = [p 1 : : p K ] and" j N(0;I nK ). We will use these properties to ensure that the quadratic forms of the typeb T H r;; b are consistently estimated. WhenK, the number of significant spikes, is known, we have efficient estimates ^ ` e j of` j forj = 0;:::;K (Passemier et al. 2017b) that involve bias correction of ^ ` j using the approximation properties of equation (4.8) as follows: Let, ^ ` 0 = (nK) 1 P n j=K+1 ^ ` j and then forj = 1;:::;K, let ^ ` 0 j be the solution of the following equation (forx) ^ ` j = ^ ` 0 (x= ^ ` 0 ; n ) =x 1 + n x= ^ ` 0 1 ! : Then, the estimates off` j g K j=0 aref ^ ` e j g K j=0 where ^ ` e 0 = ^ ` 0 1 + n ^ 0 nK ! (4.10) and forj = 1;:::;K, ^ ` e j = ^ ` e 0 2 ( ^ ` j = ^ ` e 0 + 1 n ) + ( ^ ` j = ^ ` e 0 + 1 n ) 2 4 ^ ` j = ^ ` e 0 1=2 ; (4.11) with ^ 0 =K + P K j=1 ^ ` 0 j = ^ ` 0 1 1 . Now, consider the following as an estimate forH r;; ^ H r;; = K X j=1 1 ^ 2 j (h r;; ( ^ ` e j )h r;; ( ^ ` e 0 ))^ p j ^ p T j +h r;; ( ^ ` e 0 )I = K X j=1 " 1 ^ 2 j h r;; ( ^ ` e j ) + 1 1 ^ 2 j ! h r;; ( ^ ` e 0 ) # ^ p j ^ p T j +h r;; ( ^ ` e 0 ) 0 @ I K X j=1 ^ p j ^ p T j 1 A (4.12) 91 whereh r;; (x) = (x 1 + 1 x ) r x is the scalar version ofH r;: , ^ j =( ^ ` e j = ^ ` e 0 ; n ), ^ ` e 0 ; ^ ` e j are from equations (4.10), (4.11) respectively and ^ p j are from equation (4.9). A key aspect regarding the estimates ^ H r;; in equation (4.12) is that they not only involve asymptotic adjustments to the sample eigenvalues through equations (4.10) and (4.11) but also use the phase transition phenomenon of the sample eigenvectors to appropriately adjust them through ^ j and equation (4.9). The following condition ensures that the results on the behavior of the Bayes predictors and their esti- mated versions remain valid uniformly over a collection of hyper-parameters. A3 2 T 0 and2 B 0 where T 0 and B 0 are compact subsets of (0;1) and [0;1), respectively. Notice that A3 implies in particular that 0 <1 for some 0 > 0. For the dis-aggregative model, Theorem 1A proves the asymptotic consistency ofb T ^ H r;; b uniformly over the hyper-parameters (;) and known vectorsb on then dimensional unit sphereS n1 . Theorem 1A (Asymptotic consistency ofb T ^ H r;; b). Under assumptionsA1, A2, andA3, uniformly over 2 T 0 , 2 B 0 andb2B such thatjBj = O(n c ) for any fixedc > 0 andkbk 2 = 1, we have, for all (r;)2f1; 0; 1gR, sup 2T 0 ;2B 0 ;b2B b T ^ H r;; bb T H r;; b =O p r logn n where the dependence ofH r;; on has been kept implicit for notational ease. An important consequence of Theorem 1A is that it allows us, almost immediately, to construct an efficient evaluation scheme for the Bayes predictive rule in Lemma 1 under the dis-aggregative model as follows: 92 Definition 1 (Predictive rule - dis-aggregative model). Under the hierarchical model of equations (4.1) and (4.4), the proposed predictive rule for the dis-aggregative model is given by ^ q approx which is defined as ^ q approx i (XjS;;) = e T i 0 + e T i ^ H 1;1; (X 0 ) + ^ F loss i (S;;) (4.13) where ^ F loss i (S;;) = 8 > > > > > > > > > < > > > > > > > > > : 1 ( ~ b i ) e T i ^ H 1;0; e i +m 1 0 e T i ^ H 0;1;0 e i 1=2 ; for generalized absolute loss a i 2 e T i ^ H 1;0; e i +m 1 0 e T i ^ H 0;1;0 e i ; for linex loss 0; for quadratic loss. Note that ^ q approx is a simple approximation to the Bayes predictive rules in Lemma 1 where, under the dis-aggregative model, the quadratic formsb T H r;; b are replaced by their consistent estimatesb T ^ H r;; b in equation (4.12). Except the second term in (4.13), all the other estimated quadratic forms are symmetric. The asymmetric quadratic formb T H r;; c (whereb,c are unit vectors) can also be written as a difference of two symmetric quadratic forms (1=4)f(b +c) T H r;; (b +c) (bc) T H r;; (bc)g, and The- orem 1A can be directly applied to yield Lemma 2 which provides decision theoretic guarantees on the predictors. It shows that uniformly over (;) the largest coordinate-wise gap between ^ q approx andq Bayes is asymptotically small. Lemma 2. Under assumptionsA1,A2 andA3, uniformly over2 T 0 ,2 B 0 , for all (r;)2f1; 0; 1g R, we have, conditionally onX, sup 2T 0 ;2B 0 ^ q approx (XjS;;)q Bayes (Xj;;) 1 X 0 2 _ 1 =O p r logn n : 93 While ^ q approx is an asymptotically unbiased approximation toq Bayes , the averageL 2 distance between ^ q approx andq Bayes is a non-trivial quantity due to the intrinsic variability inX. In the following sub-section, we introduce our Coordinate wise Adaptive Shrinkage Prediction Rule, CASP, that relies on data driven adaptive shrinkage factors to reduce the marginal variability of ^ q approx for any fixedS, and minimize the squared errors of the predictors fromq Bayes . 4.3.2 Improved predictive efficiency by coordinate-wise shrinkage We continue our discussion with respect to the dis-aggregative model (A =I n ) and first introduce a class of coordinate-wise shrinkage predictive rules that includes ^ q approx as a special case. Definition 2 (Class of coordinate-wise shrinkage predictive rules). Consider a class of coordinate-wise shrinkage predictive rulesQ cs =f^ q cs i (XjS;f i ;;)jf i 2R + ;2 T 0 ;2 B 0 g where ^ q cs i (XjS;f i ;;) = e T i 0 +f i e T i ^ H 1;1; (X 0 ) + ^ F loss i (S;;) with ^ F loss i (S;;) as defined in definition 1 andf i 2R + is a shrinkage factor depending only onS. The classQ cs represents a wider class of predictive rules when compared to the linear functional form of the Bayes rule. In particular, it includes ^ q approx from definition 1 whenf i = 1 for alli. The coordinate- wise shrinkage factorsf i do not depend onX but only onS, and their role lies in reducing the marginal variability of the predictive rule as demonstrated in Lemma 3 below. Lemma 3. Suppose that assumptionsA1,A2 andA3 hold. Under the hierarchical model of equations (4.1) and (4.4), asn!1, 94 (a) E n ^ q cs i (XjS;f i ;;)q Bayes i (Xj;;) 2 o is minimized at f OR i = e T i U()e i e T i ^ H 1;1; J () ^ H 1;1; e i +O p r logn n whereU() :=H 1;1; J ()H 1;1; ,J () := + and the expectation is taken with respect to the marginal distribution ofX withS fixed. (b) For any fixed,, with probability 1, lim sup n!1 max 1in f OR i 1: Moreover, letM =f1in :kP K e i k 2 > 0g, where P K denotes theK-dimensional projection matrix associated with theK spiked eigenvalues of . Then, with j(x) := x +x as the scalar version ofJ (), we have max i2M f OR i max i2M e T i U()e i e T i U()e i +j(` 0 )(h 1;1; (` K )h 1;1; (` 0 )) 2 kP K e i k 2 2 +O P r logn n ! ; so that the leading term on the right hand side is less than 1. (c) Also, for any fixed and, we have with probability 1: lim inf n!1 Ek^ q approx (XjS;;)q Bayes (Xj;;)k 2 2 Ek^ q cs (XjS;f OR ;;)q Bayes (Xj;;)k 2 2 1; where, the expectations are taken with respect to the marginal distribution ofX withS fixed. Lemma 3 is proved in Appendix C.2.5. An interesting point to note about the proof of statement (a) of the lemma is that minimizing the squared error essentially reduces to minimizing the variability of ^ q cs 95 as any member inQ cs has asymptotically negligible bias. The optimal variance is attained by the oracle shrinkage factors f OR i which assume knowledge of H 1;1; andJ (). Statement (b) shows that these shrinkage factors lie in [0; 1]. It also shows that some of them are actually quite different from 1. Thus, the resultant coordinate-wise shrunken oracle prediction rule greatly differs from ^ q approx . Indeed, statement (b) shows that if the eigenvectors of are relatively sparse, so that for a small number of coordinates i, the quantitieskP K e i k 2 are positive (and relatively large), then the shrinkage factorf OR i for the corresponding coordinates can be significantly smaller than 1. Statement (c) trivially follows from (b) and guarantees that ^ q cs constructed based on the oracle shrinkage factors f OR i are at least as good as ^ q approx in terms of squared error distance from the trueq Bayes predictor. However, as is unknown,f OR i cannot be computed in practice. Theorem 1A allows us to estimate the oracle shrinkage factors consistently and those estimates form a key ingredient in our proposed predictive rule CASP in definition 3 below. Definition 3 (CASP). The coordinate-wise adaptive shrinkage prediction rule is given by ^ q casp 2Q cs with f i = ^ f prop i where ^ f prop i = e T i ^ H 1;1; e i e T i ^ Re i and ^ R = ^ H 1;1; +j( ^ ` e 0 ) K X j=1 ^ 4 j h 1;1; ( ^ ` e j )h 1;1; ( ^ ` e 0 ) 2 ^ p j ^ p T j withj(x) :=x +x as the scalar version ofJ (). Unlike the numerator, the denominator in f OR i is not linear in H r;; and estimating it with desired precision involves second order terms in ^ R. Lemma 4 below shows that indeed ^ f prop i is a consistent estimate off OR i under our hierarchical model. 96 Lemma 4. Under the hierarchical model of equations (4.1) and (4.4), sup 1in j ^ f prop i f OR i j =O p r logn n : Using Lemmas 3 (a) and 4, Theorem 2A below guarantees the oracle optimality of ^ q casp in the class Q cs in the sense that the shrinkage factors ^ f prop i reduce the squared error between CASP and the Bayes predictive rule as much as the oracle shrinkage factorsf OR i would for any predictive rule in the classQ cs . Theorem 2A (Oracle optimality of CASP). Under assumptionsA1,A2 andA3, and the hierarchical model of equations (4.1) and (4.4), we have, conditionally onX, sup 2T 0 ;2B 0 k^ q casp (XjS; ^ f prop ;;) ^ q cs (XjS;f OR ;;)k 2 2 k^ q cs (XjS;f OR ;;)e T i 0 k 2 2 =O p (logn=n): In the following section, we turn our attention to the aggregative model whereA2R pn withp n andAA T is invertible. The proofs of Lemma 4 and Theorem 2A are provided in Appendix C. 4.4. Proposed predictive rule for aggregative predictions Under the aggregative model, recall that equation (4.7) expresses G r;; in terms of H r;; . To estimate G r;; in this setting, we adopt the substitution principle and construct the following estimates ofG r;; ^ G 0;1;0 = A ^ H 0;1;0 A T ^ G 1;0; = A ^ H 0;;0 A T h A ^ H 0;;0 + ^ H 0;1;0 A T i 1 A ^ H 0;1;0 A T ^ G 1;1; = A ^ H 0;;0 A T h A ^ H 0;;0 + ^ H 0;1;0 A T i 1 97 which appear in the functional form of CASP for aggregative models in definition 4 below. Throughout this section, we assume the following regularity condition on the aggregation matrixA: A4 Aggregation matrix: Supposep = o(n) andA2R pn is such that the matrixAA T is invertible and has uniformly bounded condition number even asp;n!1. Definition 4 (CASP for aggregative models). For any fixedA obeying assumption A4, consider a class of coordinate-wise shrinkage predictive rulesQ cs A =f^ q cs i (AXjS;f i ;;)jf i 2R + ;2 T 0 ;2 B 0 g where ^ q cs i (AXjS;f i ;;) = e T i A 0 +f i e T i ^ G 1;1; A(X 0 ) + ^ F loss i (S;A;;) and ^ F loss i (S;A;;) are the estimates ofF loss i (;A;;) as defined in Lemma 1 with G r;; replaced by ^ G r;; andf i 2 R + are shrinkage factors depending only onS andA. The coordinate-wise adaptive shrinkage predictive rule for the aggregative model is given by ^ q casp 2Q cs A withf i = ^ f prop i where ^ f prop i = e T i ^ Ne i e T i ^ De i and ^ N = ^ G 1;1; A ^ H 1;1+; A T ^ G 1;1; ^ D = ^ N +j( ^ ` e 0 ) K X j=1 ^ 4 j h 1;1; ( ^ ` e j )h 1;1; ( ^ ` e 0 ) 2 A^ p j (A^ p j ) T withj(x) :=x +x as the scalar version ofJ (). Next, we establish the analogue of Theorem 1A for this set-up. The proof is much more complicated, as for a generalA, the expression in the posterior covariances loses commutativity in multiplicative operations betweenA and . The resultant is that for quadratic form estimation, we need to be precise in tackling 98 the distortion in the spectrum of the posterior variance due to the presence of the linear aggregation matrix A. We show that the substitution principle, which avoids higher order corrections, is still consistent under p . p n situations that hold in most applications in this setting, outside which our consistency bounds deteriorate due to the cost of inversion paid by the simple substitution rules. Theorem 1B (Asymptotic consistency ofb T ^ G r;; b). Under assumptions A1, A2, A3 and A4, uniformly over2 T 0 ,2 B 0 andb2B such thatB = O(n c ) for any fixedc> 0 andjjbjj 2 = 1, we have for all (r;)2f1; 0; 1gR sup 2T 0 ;2B 0 ;b2B b T ^ G r;; bb T G r;; b =O p n max p n ; r logn n o where the dependence ofG r;; on has been kept implicit for notational ease. Theorem 2B (Oracle optimality of CASP). Under assumptions A1, A2, A3 and A4, and the hierarchical model of equations (4.1) and (4.4), we have, conditionally onX, sup 2T 0 ;2B 0 ^ q casp (AXjS; ^ f prop ;;) ^ q cs (AXjS;f OR ;;) 2 2 ^ q cs (AXjS;f OR ;;)e T i A 0 2 2 =O p n max p n 2 ; logn n o Using Theorem 2B, we show that in the aggregative model too the data driven adaptive shrinkage factors ^ f prop i continue to guarantee the oracle optimality of ^ q casp in the classQ cs A . For the proofs of Theorem 1B and 2B we refer the reader to appendices C.2.2 and C.2.6 respectively. Implementation and R package casp - The R package casp has been developed to implement our proposed predictive rule CASP. It uses the following scheme to estimate the prior hyper-parameters (;) and the number of spikesK: 99 For estimating K we use the procedure described in Kritchman and Nadler (2009) that estimates K through a sequence of hypothesis tests determining at each step whether the k th sample eigenvalue came from a spike. To estimate the prior hyper-parameters (;), we first note that marginally X N n ( 0 ;J ()). LetJ inv () = ( + ) 1 . Our scheme for choosing (;) is based on an empirical Bayes approach wherein we maximize the marginal likelihood ofX with respect to (;) withJ () and J inv () replaced by their estimates ^ J = ^ H 1;1+; and ^ J inv = 1 ^ H 1;1; respectively. In particular an estimate of (;) is given by (^ ; ^ ) = argmax 2T 0 ;2B 0 n n x 0 T ^ J inv x 0 o (4.14) To facilitate implementation, the maximization in equation (4.14) is conducted numerically over a bounded interval [ lb ; ub ] [ lb ; ub ] where, in most practical applications, prior knowledge dictates the lower ( lb ; lb ) and upper bounds ( ub ; ub ) of the above intervals. In the simulations and real data examples of chapters 4.5 and 4.6, we use the above scheme to estimate (; ). 4.5. Simulation studies In this section we asses the predictive performance of CASP across a wide range of simulation experiments. We consider four competing predictive rules that use different methodologies to estimate and thereafter plug-in their respective estimates of in the Bayes predictive rule of Lemma 1. In what follows, we briefly discuss these competing methods for estimating : 1. ^ q Bcv - the predictive rule that uses the bi-cross-validation approach of Owen and Wang (2016) which, under a heteroscedastic factor model structure, first estimates the number of factors, then constructs 100 an estimateS Bcv of and finally plugs-inS Bcv in the Bayes predictive rule of Lemma 1. We use the implementation available in the R packageesaBcv for our simulations. 2. ^ q Fact - the predictive rule that uses the FactMLE algorithm of Khamaru and Mazumder (2018) to estimateS Fact by formulating the low rank maximum likelihood Factor Analysis problem as a non- linear, non-smooth semidefinite optimization problem. The implementation of the FactMLE algorithm is available in the R packageFACTMLE wherein we use an estimate ^ K ofK as discussed in chapter 4.4. 3. ^ q Poet - the predictive rule that uses the approach of Fan et al. (2013) to estimateS Poet by first retaining the first ^ K principal components of S and then using a thresholding procedure on the remaining sample covariance matrix S. The implementation of this approach is available in the R-package POET where ^ K is an estimate of the number of spikes from chapter 4.4. 4. ^ q Naive - the Naive predictive rule which first estimates the number of spikes ^ K from the data, re- constructs the sample covariance matrixS Naive from the leading ^ K eigen values and eigen vectors of S, and finally plugs inS Naive in place of in the Bayes predictive ruleq Bayes in Lemma 1. To assess the performance of various predictive rules, we calculate a relative estimation error (REE) which is defined as REE(^ q) = R n (; ^ q)R n (;q Bayes ) R n (; ^ q approx )R n (;q Bayes ) where ^ q is any prediction rule, ^ q approx is CASP with shrinkage factors f i = 1 for all i andq Bayes is the Bayes predictive rule based on the knowledge of unknown . A value of REE larger than 1 implies poorer prediction performance of ^ q relative to ^ q approx whereas a value smaller than 1 implies a better prediction performance. In particular, REE allows us to quantify the relative advantage of using coordinate wise adaptive shrinkage in our proposed predictive rule ^ q casp . 101 4.5.1 Experiment 1 In the setup of experiment 1, we consider prediction under the dis-aggregated model and sample from an n = 200 variate Gaussian distribution with mean vector 0 = 0 and covariance . We impose a spike covariance structure on withK = 10 spikes under the following two scenarios withl 0 fixed at 1. Scenario 1: we consider the generalized absolute loss function in equation (4.5) with b i sampled uniformly between (0:9; 0:95), h i = 1b i with (;) = (0:5; 0:25) and K spikes equi-spaced between 80 and 20. Scenario 2: we consider the linex loss function in equation (4.6) witha i sampled uniformly between (2;1),b i = 1 with (;) = (1; 1:75) andK spikes equi-spaced between 25 and 5. To estimateS, we sampleW j fromN n (0; ) independently form samples where we allowm to vary over (15; 20; 25; 30; 35; 40; 45; 50). Finallym x = 1 copy ofX is sampled fromN n (; ) withm 0 = 1. This sampling scheme is repeated over 500 repetitions and the REE of the competing predictive rules and CASP is presented in figures 4.1 and 4.2 for scenarios 1 and 2 respectively. In table 4.1, we report the REE and the estimates of (K;;) atm = 15. Using the R-packagePOET, the estimation ofS Poet was extremely slow in our simulations and therefore we report the REE of ^ q Poet only at m = 15 and exclude this predictive rule from the figures. The left panels of figures 4.1 and 4.2 both suggest a superior risk performance of Table 4.1: Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 1. The numbers in parenthesis are standard errors over 500 repetitions. Scenario 1: (K;;) = (10; 0:5; 0:25) Scenario 2: (K;;) = (10; 1; 1:75) ^ K ^ ^ REE(^ q) ^ K ^ ^ REE(^ q) CASP 7 (0.04) 0.59 (0.002) 0.27 (0.002) 0.95 4 (0.04) 0.97 (0.004) 1.79 (0.006) 1.00 Bcv 3 (0.08) 0.58 (0.003) 0.26 (0.003) 1.14 1 (0.04) 1.00 (0.001) 1.75 (0.006) 4.24 FactMLE 7 (0.04) 0.57 (< 10 3 ) 0.19 (0.001) 1.68 4 (0.04) 0.98 (< 10 3 ) 1.55 (0.001) 4.58 POET 7 (0.04) 0.57 (< 10 3 ) 0.18 (< 10 3 ) 2.14 4 (0.04) 0.97 (< 10 3 ) 1.53 (< 10 3 ) 7.26 Nave 7 (0.04) 0.60 (< 10 3 ) 0.24 (0.001) 1.36 4 (0.04) 1.00 (< 10 3 ) 1.63 (0.004) 1.87 CASP asm varies. Moreover, when the ration=(m 1) is largest, the right panels of these figures plot 102 the sorted shrinkage factors ^ f prop i averaged over the 500 repetitions (red line) and sandwiched between its 10 th and 90 th percentiles (represented by the gray shaded region) under the two scenarios. Under scenario 1 in particular, the estimated shrinkage factors are all smaller than 1 indicating the significant role that the coordinate-wise shrinkage plays in reducing the marginal mean square error of ^ q casp fromq Bayes . However as increases from 0:25 to 1:75 in scenario 2, the estimated shrinkage factors move closer to 1, and the risk performances of ^ q casp and ^ q approx are indistinguishable from each other as seen in table 4.1 under scenario 2 wherein the REE of CASP is 1. This is not unexpected because with a fixed > 0 and growing above 1, the factor P K j=1 ^ 4 j h 1;1; ( ^ ` e j )h 1;1; ( ^ ` e 0 ) 2 in the denominator of ^ f prop i becomes smaller in comparison to the numerator ^ N in definition 4 and the improvement due to coordinate-wise shrinkage dissipates. ● ● ● ● ● ● ● ● 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.90 0.93 0.96 0.99 0 50 100 150 200 Dimensions Shrinkage Factors Figure 4.1: Experiment 1 Scenario 1 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles From table 4.1, ^ q Bcv is the most competitive predictive rule next to ^ q casp across both the scenarios how- ever it seems to suffer from the issue of under-estimation of the number of factorsK. We notice this behavior of ^ q Bcv across all our numerical and real data examples. The other three predictive rules, ^ q Fact ; ^ q Poet and 103 ● ● ● ● ● ● ● ● 0.95 1.45 1.95 2.45 2.95 3.45 3.95 4.45 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.996 0.997 0.998 0.999 0 50 100 150 200 Dimensions Shrinkage Factors Figure 4.2: Experiment 1 Scenario 2 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles ^ q Naive , exhibit poorer risk performances and this is not entirely surprising in this setting primarily because the four competing predictive rules considered here do not involve any asymptotic corrections to the sam- ple eigenvalues and their eigenvectors whereas CASP uses the phase transition phenomenon of the sample eigenvalues and their eigenvectors to constructs consistent estimators of smooth functions of that appear in the form of the Bayes predictive rules. 4.5.2 Experiment 2 For experiment 2 we consider the setup of a static factor model with heteroscedastic noise and simulate our data according to the following model: X t = +B t + t t N K (0;I K ) N n ( 0 ; ) and t N n (0; n ) 104 whereK n represents the number of latent factors,B is thenK matrix of factor loadings, t is the K 1 vector of latent factors independent of t and n is an nn diagonal matrix of heteroscedastic noise variances. In this model = BB T + n and coincides with the heteroscedastic factor models considered in Owen and Wang (2016), Fan et al. (2013), Khamaru and Mazumder (2018) for estimating . Thus the three competing predictive rules ^ q Bcv ; ^ q Poet and ^ q Fact are well suited for prediction in this model. Factor models of this form are often considered in portfolio risk estimation (see for example Fan et al. (2016)) where the goal is to first estimate the matrix of factor loadingsB and the vector of latent factors t and thereafter use the fitted model to sequentially predictAX t+s fors = 1; 2;:::;T whereX t might represent ann dimensional vector of stock excess returns andA is thepn weight matrix that aggregates the predicted excess returns intopn individual portfolios level returns. Often an autoregressive structure is imposed on t so that t = t1 +v t which is the so called dynamic factor model (Geweke 1977) where is aKK matrix of autoregressive coefficients andv t N K (0;D). For the purposes of this simulation exercise we taken = 200, = 0 andD =I K withK = 10 factors. We fix 0 = 0, and simulate the rows ofB independently fromN K (0;cI K ) and the diagonal elements of n independently from Unif(0:5; 1:5). The elements of the aggregation matrixA are simulated uniformly from (0; 1) withp = 20 rows normalized to 1. In this experiment, similar to experiment 1, we consider two scenarios: Scenario 1: we fix (c;;) = (0:5; 0:5; 0:25) Scenario 2: we fix (c;;) = (0:2; 1:5; 2) To estimateS, we sampleW j from N n (0; ) independently for m samples where we allow m to vary over (15; 20; 25; 30; 35; 40; 45; 50). Finallym x = 1 copy ofX t is sampled fromN n (; ) and the goal is to predictAX t+1 under a linex loss with b i = 1 and a i i:i:d Unif(1;2) to emphasize the severity of over prediction of portfolio excess returns. This sampling scheme is repeated over 500 repetitions and 105 the REE of the competing predictive rules and CASP is presented in figures 4.3 and 4.4 for scenarios 1 and 2 respectively. From the left plot in figure 4.3 we see that ^ q Bcv returns the smallest REE amongst Table 4.2: Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 2. The numbers in parenthesis are standard errors over 500 repetitions. Scenario 1: (K;;) = (10; 0:5; 0:25) Scenario 2: (K;;) = (10; 1:5; 2) ^ K ^ ^ REE(^ q) ^ K ^ ^ REE(^ q) CASP 8 (0.04) 0.59 (0.002) 0.28 (0.002) 0.91 2 (0.04) 1.48 (0.004) 2.09 (0.003) 1.00 Bcv 3 (0.08) 0.58 (0.003) 0.26 (0.002) 0.87 0 (0.01) 1.50 (< 10 3 ) 2.04 (0.004) 1.23 FactMLE 8 (0.04) 0.57 (< 10 3 ) 0.19 (0.001) 1.49 2 (0.04) 1.48 (< 10 3 ) 1.87 (0.002) 1.10 POET 8 (0.04) 0.57 (< 10 3 ) 0.18 (< 10 3 ) 1.87 2 (0.04) 1.47 (< 10 3 ) 1.83 (< 10 3 ) 1.12 Nave 8 (0.04) 0.60 (< 10 3 ) 0.24 (0.001) 1.61 2 (0.04) 1.50 (< 10 3 ) 2.05 (0.004) 1.27 all competing predictive rules in scenario 1 and is closely followed by ^ q casp . This is expected since ^ q Bcv relies on a heteroscedastic factor model structure to estimate , however, even in this scenario CASP is competitive. In particular, the estimated shrinkage factors for CASP are all smaller than 1 (right plot in figure 4.3) which allows CASP to deliver an REE which is substantially less than 1. ● ● ● ● ● ● ● ● 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.85 0.90 0.95 1.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Dimensions Shrinkage Factors Figure 4.3: Experiment 2 Scenario 1 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles 106 Under scenario 2 (left plot in figure 4.4) ^ q Bcv no longer enjoys a superior performance and exhibits a volatile REE profile as m increases from 15 to 50 which can potentially be due to its tendency to under estimate the number of factorsK as seen from table 4.2 atm = 15. In this scenario, the ^ q casp is the most competitive predictive rule however with > 1, ^ q casp is no better than ^ q approx in terms of REE. ● ● ● ● ● ● ● ● 1.0 1.1 1.2 1.3 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9984 0.9988 0.9992 0.9996 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Dimensions Shrinkage Factors Figure 4.4: Experiment 2 Scenario 2 (Linex loss): Left - Relative Error estimates as m varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles 4.5.3 Experiment 3 For experiment 3, we consider a slightly different setup where we do not impose a spike covariance structure on . Instead, we assume that () ij = Cov(X i ;X j ) = 0:9 jijj wherei;j = 1;:::;n, thus imposing an AR(1) structure between the n coordinates ofX. As in experiment 1, we sample from an n = 200 variate Gaussian distribution with mean vector 0 = 0 and covariance . We vary (;) across two scenarios where we take (;) as (1; 0:5) and (0:5; 2) in scenarios 1 and 2 respectively. We estimateS using the approach described in experiments 1 and 2, and sample m x = 1 copy ofX from N n (; ) with a goal to predictAY under a generalized absolute loss function with h i = 1b i and b i sampled 107 uniformly from (0:9; 0:95) fori = 1; ;p. HereY N n (; ) is independent ofX andA is a fixed pn sparse matrix with thep = 20 rows sampled independently from a mixture distribution with density 0:9 0 + 0:1Unif(0; 1) and normalized to 1 thereafter. This sampling scheme is repeated over 500 repetitions and the REE of the competing predictive rules and CASP is presented in figures 4.5, 4.6 and table 4.3. In this Table 4.3: Relative Error estimates (REE) of the competing predictive rules atm = 15 for Scenarios 1 and 2 under Experiment 3. The numbers in parenthesis are standard errors over 500 repetitions. Scenario 1: (;) = (1; 0:5) Scenario 2: (;) = (0:5; 2) ^ K ^ ^ REE(^ q) ^ K ^ ^ REE(^ q) CASP 7 (0.06) 1.09 (0.007) 0.40 (0.004) 0.94 7 (0.06) 0.57 (< 10 3 ) 1.74 (0.001) 1.00 Bcv 1 (0.08) 0.95 (0.012) 0.36 (0.002) 2.39 1 (0.08) 0.59 (< 10 3 ) 1.79 (0.002) 4.27 FactMLE 7 (0.06) 1.16 (0.001) 0.33 (< 10 3 ) 1.23 7 (0.06) 0.57 (< 10 3 ) 1.73 (< 10 3 ) 1.08 POET 7 (0.06) 1.17 (< 10 3 ) 0.33 (< 10 3 ) 1.49 7 (0.06) 0.57 (< 10 3 ) 1.73 (< 10 3 ) 1.27 Nave 7 (0.06) 1.16 (0.001) 0.34 (0.001) 1.26 7 (0.06) 0.57 (< 10 3 ) 1.73 (< 10 3 ) 1.11 setup, the departure from the factor model leads to a poorer estimate of for CASP than what was observed under experiments 1 and 2, however, the REE of CASP continues to be the smallest amongst all the other competing rules. When = 2 (scenario 2), ^ q casp and ^ q approx are almost identical in their performance. Amongst the competing methods ^ q Bcv has the highest REE, possibly exacerbated by the departure from a ● ● ● ● ● ● ● ● 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.85 0.90 0.95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Dimensions Shrinkage Factors Figure 4.5: Experiment 3 Scenario 1 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles 108 factor based model considered in this experiment whereas this seems to have a comparatively lesser impact on CASP indicating potential robustness of CASP to mis-specifications of the factor model. ● ● ● ● ● ● ● ● 0.95 1.45 1.95 2.45 2.95 3.45 3.95 15 20 25 30 35 40 45 50 m REE ● Bcv CASP FactMLE Naive ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.996 0.997 0.998 0.999 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Dimensions Shrinkage Factors Figure 4.6: Experiment 3 Scenario 2 (Generalized absolute loss): Left - Relative Error estimates asm varies over (15; 20; 25; 30; 35; 40; 45; 50). Right: Magnitude of the sorted shrinkage factors ^ f prop i averaged over 500 repetitions atm = 15 and sandwiched between its 10 th and 90 th percentiles 4.6. Real data Illustration with groceries sales data In this section we analyze a part of the dataset published by Bronnenberg et al. (2008). This dataset has been used in significant studies related to consumer behavior, spending and their policy implications (see for example Bronnenberg et al. (2012), Coibion et al. (2015)). The dataset holds the weekly sales and scanner prices of common grocery items sold in retail outlets across 50 states in the U.S. The retail outlets available in the dataset have identifiers that link them to the city that they serve. In accordance to our lagged data example, we analyze a part of this dataset that spansm = 100 weeks from December 31, 2007 to November 29, 2009 as substantial amount of dis-aggregate data from distant past that will be used for constructing auxiliary information on the covariance. We use 3 weeks from a relatively recent snapshot covering October 31, 2011 to November 20, 2011 as data from the current model. We assume, as in equation (4.2), that there 109 might have been drift change in the sales data across time but the covariances across stores are invariant over time. Our goal is to predict the state level total weekly sales across all retail outlets for four common grocery items: coffee, mayo, frozen pizza and carbonated beverages. We use the most recentT = 2 weeks, from November 7, 2011 to November 20, 2011 as our prediction period and utilize the sales data of week t 1 to predict the state aggregated totals for weekt wheret = 1;:::;T . For each of the four products, the prediction period includes sales across approximately n = 1; 140 retail outlets that vary significantly in terms of their size and quantity sold across theT weeks. Moreover, some of the outlets have undergone merger and even closure during the prediction period which is often recorded as 0 product sales. Let X (p) 0 be then dimensional vector denoting the number of units of productp sold across then outlets in week 0 - October 31, 2011 to November 6, 2011. For our prediction problem, we use a threshold ofs p units for productp and consider only those outlets that have sold atleasts p units in week 0. Figure 4.7: CASP predicted weekly demand of the grocery items across US averaged over the two prediction weeks - November 7, 2011 to November 20, 2011. 110 Figure 4.8: Role of coordinate-wise shrinkage in CASP across US states for four grocery items. In the figures, 1 the shrinkage factors is displayed and so, deeper shades denote higher shrinkage. LetX (p) t1 be then p = P n i=1 I(X (p) 0;i s p ) dimensional vector denoting the number of units of product p sold acrossn p stores in weekt1. For a distributor, it is economically important to predict the aggregated demands (future sales) for each US state as intra-state transport of inventories and transfer of business and tax accounts can be easily executed within state boundaries. The timet 1 prediction problem then is to predictV (p) t =A (p) X (p) t whereA (p) is ad p n p matrix that aggregates productp sales acrossn p stores into d p unique states across the U.S. To evaluate the performance of any predictive rule ^ q (p) t , we use the generalized absolute loss function of equation (4.5) and calculate the timet ratio of total loss for prediction using ^ q (p) t to the total loss for prediction using CASP with all shrinkage factorsf i = 1: L t q (approx;p) t ;q (p) t ) = P dp i=1 n b i V (p) t;i ^ q (p) t;i + +h i ^ q (p) t;i V (p) t;i + o P dp i=1 n b i V (p) t;i ^ q (approx;p) t;i + +h i ^ q (approx;p) t;i V (p) t;i + o (4.15) 111 where ^ q (approx;p) t is CASP withf i = 1 for alli. We setb i = 0:95 andh i = 1b i for alli = 1;:::;d p to emphasize the severity of under-prediction since over-stocking may lead to holding and storage costs as all of these four products have considerably longer expiration dates but under-stocking, on the other hand, may translate into substantial lost sales and reputation costs for the retail outlets. In table 4.4 we report this loss ratioL t for each productp in columns (a) and (b), and for six competing predictive rules: (i) CASP, (ii) the Naive predictive rule as discussed in chapter 4.5, (iii) Bcv (Owen and Wang 2016), (iv) POET (Fan et al. 2013), (v) FactMLE (Khamaru and Mazumder 2018), and (vi) the Unshrunk predictive rule that simply uses past week’s sales to predict the sales in the forthcoming week. To compute an estimateS (p) of then p n p population covariance matrix (p) ofX (p) t we rely on the additional data onm = 100 weeks available from December 31, 2007 to November 29, 2009 and estimateS (p) using the methodology described in chapter 4.2.3. In particular we use the function smooth.spline from the R-package splines2 and choose k = 3 knots corresponding to the 25; 50 and 95 percentiles of the sales distribution across then p stores at each of them weeks. We complete the specification of our model by setting = (p) 0 1 where (p) 0 is the median of average weekly sales ofn p outlets over them weeks and use equation (4.14) to estimate (p) over the interval [0:1; 1] with (p) fixed at 1. The loss ratios reported in columns (a) and (b) of table 4.4 indicate a competitive performance of CASP over the five remaining predictive rules. CASP continues to provide the smallest loss ratios across both the weeks with the only exception being the loss ratio in Week 1 (column (a)) for product ‘Mayo’, where CASP is competitive with the predictive rules ^ q Naive (Naive) and ^ q Poet (POET). It is interesting to note that in atleast one of the two weeks, the loss ratio of CASP is marginally bigger than 1 across all four product categories, indicating that the coordiante-wise shrinkage factors do not always bring in any significant improvement in prediction performance. This is not entirely unexpected because the hierarchical model assumption of equation (4.4) may not hold in this setting and thus the model based shrinkage factors ^ f prop i may not be 112 Table 4.4: Loss ratios (4.15) across six predictive rules for four products. (a.) Total sales by state (b.) Total sales by state Product Method K Week 1 Loss Ratio Week 2 Loss Ratio Coffee CASP 26 0:999 1:002 s p = 1000; n p = 233; d p = 31 Nave 26 1.044 1.063 Bcv 17 1.043 1.036 POET 26 1.047 1.070 FactMLE 26 1.009 1.044 Unshrunk - 1.838 2.273 Mayo CASP 26 0:995 1:004 s p = 500; n p = 157; d p = 30 Nave 26 0.996 1.016 Bcv 19 1.040 1.019 POET 26 0.996 1.022 FactMLE 26 0.999 1.012 Unshrunk - 1.084 2.420 Frozen Pizza CASP 33 1:000 0:998 s p = 1000; n p = 359; d p = 33 Nave 33 1.177 1.135 Bcv 19 1.059 1.091 POET 33 1.033 1.040 FactMLE 33 1.008 1.020 Unshrunk - 4.424 6.701 Carb. Beverages CASP 37 1:003 0:984 s p = 5000; n p = 410; d p = 33 Nave 37 1.065 1.033 Bcv 20 1.073 1.142 POET 37 1.065 1.038 FactMLE 37 1.067 1.059 Unshrunk - 3.459 8.885 the most optimal coordinate-wise shrinkage. Figure 4.7 presents the CASP predicted future demand across the different states while figure 4.8 presents the state-wise distribution of the shrinkage factors of CASP for the four products and are plotted as 1f i so that a lighter shade in the heatmap corresponds to a smaller shrinkage (larger ^ f prop i ). For example in the case of Coffee, the shrinkage factors are all closer to 1 across all the d p = 31 states and this effect translates into loss ratios being almost equal to 1 across the two weeks in table 4.4. In the case of Frozen Pizza, however, the magnitudes of the shrinkage factors are evenly distributed across thed p = 33 states. For instance, the states of Iowa followed by Illinois and Oklahoma exhibit the largest shrinkages while for Mayo, Georgia and North Dakota have the largest shrinkages in their predicted weekly total sales. In particular, this demonstrates that the shrinkage factors vary across the products because the variability in sales is product specific. More importantly, CASP is flexible enough to 113 capture this inter-product differences and mainly due to its bias correction step (see chapter 4.3.1), CASP offers better estimates of future sales than the other popular predictive approaches considered here. 114 Chapter 5 Hierarchical Variable Selection in Multivariate Mixed Models 5.1. Introduction Mobile games have become an integral part of modern life (Koetsier 2015). While their almost ubiquitous presence is increasingly reshaping the recreational, socialization, educational and learning media (Statista (2018), see Ch 1 and 3 of Hwong (2016), Garg and Telang (2012)), the monetization policies associated with these new mobile apps is rapidly revolutionizing the digital marketing and advertisement space in informa- tion systems (Appel et al. 2017, Liu et al. 2014). As such mobile games (as per industry standards formally defined as any app-based game played on an Internet enabled mobile device such as tablets, phones, etc) cur- rently comprise 42% of the market share of global gaming products (McDonald 2017) and more than eight hundred thousand mobile games were available for download in the iOS App Store alone, with approxi- mately four hundred new submissions arriving each day (PocketGamer 2018). To understand how quickly the gaming market is growing, a new industry study from Spil Games (Diele 2013) reports that 1.2 billion people are now playing games worldwide, with 700 million of those online. The unprecedented growth and popularity of mobile games has resulted in a market with some very unique consumer characteristics (Boudreau et al. 2017). It is an extremely crowded market with significant proportion of revenue accumu- lated through advertisement based on free products (Appel et al. 2017). Specifically, app retention rates are much lower than the observed retention rates in classical products and services, with reports suggesting that more than 80% of all app users churn (drop out) within the first quarter (Perro 2016, MarketingCharts 2017). The freemium business model (Niculescu and Wu 2011), which offers a certain level of service without cost and sells premium add-on components to generate revenue, is a popular strategy for monetization of these 115 mobile games. As such, industry reports indicate that more than 90% of the mobile games start as free, and more than 90% of the profits currently come from products that began as free (AppBrain 2017, Taube 2013). User characteristics in freemium models differ in fundamental aspects from traditional marketing models. This necessitates development of new analytical methods for modeling freemium behavior. 5.1.1 Freemium model: Player Activity and Engagement In the freemium market, firms initially attract customers with free usage of their products, with the expec- tation that free usage will lead customers to engage in future purchase of premium components. However, customers can always remain free users and never need to enjoy the premium components of the product. This is an important distinction with non-freemium business models, where customers must purchase in order to use the product. While the free to use part of freemium products helps to attract the consumer base quickly (Kumar 2014), managers are uncertain on whether and how freemium can generate profits (Needleman and Loten 2012) as majority of the consumers do not use the premium part of freemium. As such, unless a game is very popular, in-app purchases contribute an insignificant proportion of its revenue. Mobile marketing automation firm Swrve (Swrve 2016) found that over 48% of all in-game revenue are derived from 0.19% of all players, which is a tiny segment. While in-game (direct) revenue is important, there are several indirect ways of monetizing the free users by involving them to engage with the game via social media (through facebook or twitter likes and posts of game achievements, inviting social media friends to join game, watching, liking or posting youtube videos related to the game) or the app center. To measure thedaily engagement of a player, we judiciously combine her in-app purchases (direct source of revenue) with her varied involvements with the game in media (indirect source of monetization), under the notion that purchase is the highest form of engagement. We define a player’sdaily activity as the time she spends playing the game in the day. Positive daily activity does not always lead to positive engagement. It is 116 commonly believed that as a game grows with increasing and prolonged playeractivities, it will have more positive as well as higher engagement values. For game managers it is extremely important to accurately measure player activity, engagement and their co-dependencies. Also, varied retention strategies are often used to curb high churn rates and their effects need to be properly analyzed. Here, we develop a Constrained Extremely Zero Inflated Joint (CEZIJ) modeling framework that provides a disciplined statistical program for jointly modeling player activity, en- gagement and churn in online gaming platforms. Our proposed framework captures the co-dependencies between usage (activity), direct and indirect revenue (engagement), and dropouts (which is a time-to-event) and provides a systematic understanding of how the dependent variables influence each other and are influ- enced by the covariates. Furthermore, the CEZIJ framework can be used to predict the activity, engagement and attrition of new players. The ability to forecast behavior of new players is critical for managers, as this enables them to better predict the effectiveness of their gaming platform in engaging customers and thus attract future advertisers to their platform. 5.1.2 Joint modeling of player characteristics Our joint modeling framework uses generalized linear mixed effect models (GLMM) and relies on a joint system of equations that model the relationships between activity, engagement, and churn. In the activity equations, we separately assess whether consumers are active (i.e. play the game) and the extent of their activity through the amount of time they spend playing the game. Engagement is modeled by the probability of having positive engagement and by a conditional model on the positive engagement values. In the churn equations, we account for permanent churn identified as those players who are not active for more than 30 consecutive days. Our modeling systems addresses the complex interdependencies between (1) the decision to use the free product, (2) how much time will be spent using the free product, (3) the decision to engage, (4) 117 the extent of engagement and (5) the decision to churn. That is, the joint equation system comprehensively uncovers positive, negative, or zero co-dependencies among activity, engagement, and churn in freemium markets. In recent times, joint modeling of multiple outcomes have received considerable attention (Ri- zopoulos 2012). Many applications consider the modeling of single or multiple longitudinal outcomes and a time-to-event outcome (e.g., Jiang (2007), McCulloch (2008), Rizopoulos et al. (2009, 2010), Banerjee et al. (2014), Rizopoulos and Lesaffre (2014)). Our motivation for jointly modeling the drivers of player gaming traits and dropout arises from the fact that there is heterogeneity across player’s outcomes and one must combine these effects by correlating the multiple responses. Since these responses are measured on a variety of different scales (viz. time spend in hours, revenue in dollars), a flexible solution is to model the association between different responses by correlating the random heterogeneous effects from each of the responses. Such a model not only gives us a covariance structure to assess the strength of association between the responses, but also offers useful insights to managers, since despite huge popularity of mobile games among users, managers are not certain whether freemium is profitable. Furthermore, it is important for managers to understand how activity and engagement are related to player churn. While customers who frequently use the free product could be more satisfied, thus reducing their probability of churn (Gustafsson et al. 2005), free usage could be related to a greater probability of churn as there is little switching cost for customers due to their lower perceived value (Yang and Peterson 2004). Earlier studies used simpler models for churn that are independent of the purchase rate (Jerath et al. 2011). Here we model churn allowing for possible co-dependencies with activity and engagement. 5.1.3 Statistical challenges The online gaming data, which is the application case described in detail in chapter 5.2, and the particular business model of freemium, pose several statistical challenges and necessitates novel extensions of the joint modeling framework. We describe the details below. 118 (i) Extreme Zero-inflation - Freemium behavior suggests that even if a player is active on a day, it very rarely leads to purchases or social media engagement on her part. Thus, though both activity and engagement are zero-inflated, engagement has an extremely zero-inflated distribution. Mixture distributions of which zero-inflated distributions are a special case are commonly used in this kind of data. While there are multiple models that have been developed to accommodate data with excess zeros; see for example, Olsen and Schafer (2001), Min and Agresti (2005), Han and Kronmal (2006), Alf` o et al. (2011), Greene (2009) and the references therein, there is not much attention on extreme zero-inflated data. Few recent works, e.g., Hatfield et al. (2012) show promise though. We develop a joint modeling framework that can accommodate extreme zero-inflation. The proposed framework allows us to accommodate large incidences of no-engagement by active players, such as that observed in freemium markets and helps managers more accurately forecast sales potential for businesses with large active customer bases but small incidence of engagement by separating the confound between non-active and non-engaged. We highlight that this extreme zero inflated data is not only relevant to freemium markets but is also common in other businesses wherein a sizable portion of the active consumer base engages in very little purchase activity. For example, in the online setting, we may observe low incidences of online ratings (i.e. 1-5 star rating), user generated content creation, banner ad click-through, and search ad conversion (see for example Urban et al. (2013), Haans et al. (2013)). Likewise in the offline setting of purchase data for example, most product categories comprise less than 5% planned or actual purchase for an individual’s visit to the grocery store (Hui et al. 2013). Thus, if managers are interested in assessing promotion on sales or individual level purchase activity in these contexts, we may be confronted with data that contains an extreme number of zeros. (ii) Parametric Constraints - We develop a framework for incorporating domain specific structural constraints in our model for one may have prior knowledge that a vector of parameters lies on a simplex or follows a particular set of inequality constraints. It is quite common in gaming data to have prior information available on various activities of the player. For example, it is well-known that player characteristics will 119 have a burgeoning weekend effect or marketers have prior knowledge on the comparative efficacies of the retention strategies particularly if they have known dosage demarcations. Using these side information is extremely important (Banerjee et al. 2019, James et al. 2019) and the CEZIJ framework incorporates these domain expertise though convexity constraints in our model. (iii) Hierarchical Variable Selection - In online gaming data one usually encounters numerous covari- ates related to both game specific and player specific variables and choosing the relevant set of covariates is highly desirable for improving predictability. It is also important that the inferential problems associ- ated with these data properly account for the presence of a lot of possibly spurious covariates. The high- dimensionality of these datasets, however, renders classical variable selection techniques incompetent. We develop a novel algorithm for estimation in the CEZIJ framework that conducts variable selection from a large set of potential predictors in GLMM based joint model. To produce interpretable effects CEZIJ im- poses a hierarchical structure on the selection mechanism and includes covariates either as fixed effects or composite effects where the latter are those covariates that have both fixed and random effects (Hui et al. 2017a)(see chapter 5.4 for details). Efficient selection of fixed and random effect components in a mixed model framework has received considerable attention in recent years (Bondell et al. (2010), Fan and Li (2012), Lin et al. (2013); detailed background is provided in chapter 5.4). Penalized quasi likelihood (PQL) approach has been used by Hui et al. (2017b) to conduct simultaneous (but non-hierarchical) selection of mixed effects in a GLMM framework with adaptive lasso and adaptive group lasso regularization. The CREPE (Composite Random Effects PEnalty) estimator of Hui et al. (2017a) conducts hierarchical variable selection in a GLMM with a single longitudinal outcome and employs a monte carlo EM (MCEM) algo- rithm of Wei and Tanner (1990) to maximize the likelihood. The CREPE estimator ensures that variables are included in the final model either as fixed effects only or as composite effects. Our proposed CEZIJ framework is related to Hui et al. (2017a) in its ability to conduct hierarchical variable selection in GLMMs. 120 However, unlike Hui et al. (2017a), CEZIJ performs hierarchical variable selection in a joint model of mul- tiple correlated longitudinal outcomes. Additionally, it can also incorporate any convexity constraint on the fixed effects. (iv) Scalability - For any mobile game app, gargantuan volumes of user activity data are automatically accumulated. Analyzing such big datasets not only involves inferential problems associated with high- dimensional data analysis but also the computational challenges of processing large-scale (sample) longitu- dinal data. To process large longitudinal data-sets, CEZIJ leverages the benefits of distributed computing. Recently, algorithmic developments for increased scalability and reduced computational time without sacri- ficing the requisite level of statistical accuracy have received significant attention. See for example Jordan et al. (2013), Jordan et al. (2018), Lee et al. (2015) and the references therein. A popular approach is to conduct inference independently and simultaneously onK subsets of the full dataset and then form a global estimator by combining the inferential results from theK nodes in a computation-efficient manner. We take a similar approach for the hierarchical selection of fixed and random effects by using the split-and-conquer approach of Chen and Xie (2014) that splits the original dataset intoK non-overlapping groups, conducts variable selection separately in each group and uses a majority voting scheme in assimilating the results from the splits. (v) Prediction and Segmentation - Predictive analysis of new player behavior is fundamental for the maintenance of existing as well as for the creation of new advertisement based monetization routes in these gaming platforms. Statistically, this necessitates construction of prognostic models that can not only fore- cast new user activity, engagement and drop-out behavior but also dynamically update such forecasts over time as new longitudinal information about them arrives. Based on our fitted joint model, we construct drop-out probability profiles (over time) for an out-of-sample generic player population and use them for segmentation of idiosyncratic player behaviors. Segmentation is a key analytical tool for managers. Users 121 in different segments respond differently to varied marketing promotions. This enables managers to use rel- evant marketing promotions that better match user responses in different segments and increase efficiency of their marketing campaign. We develop our joint modeling framework which accommodates all of the above mentioned extensions through an efficient and scalable estimation procedure. To the best of our knowledge, we are the first to study constrained joint modeling of high-dimensional data. Though we demonstrate the applicability of the CEZIJ inferential framework for the disciplined study of freemium behavior, it can be used in a wide range of other applications that needs analyzing multiple high-dimensional longitudinal outcomes along with a time-to-event analysis. To summarize, the key features of our CEZIJ framework are: Joint modeling of the highly related responses pertaining to daily player activity and engagement as well as the daily dropout probabilities using the freemium mobile game data described in chapter 5.2; The possibility of acute zero-inflation in the player engagement distribution is addressed by modeling the conditional probability of no engagement given that the player had used the app in the day (see Figure 5.3); Convexity constraints pertinent to domain expertise and prior beliefs are incorporated in the modeling framework in chapter 5.3; A penalized EM algorithm (Wei and Tanner 1990) is used for simultaneous selection of fixed and random effects wherein data-driven weighted` 1 penalties are imposed on the fixed effects as well as on the diagonal entries of the covariance matrix of the random effects while the common regularization parameter is chosen by a BIC-type criterion (see equation (5.9)); Hierarchical selection of the fixed and random effects is conducted in chapter 5.4 by using a re- weighted` 1 minimization algorithm that alternates between estimating the parameters and redefining 122 the data-driven weights such that the weights used in any iteration are computed from the solutions of the previous iteration; The divide and conquer approach in chapter 5.5 distributes the problem into tractable parallel sub- groups resulting in increased scalability; Prediction of the drop-out probabilities as well as the activity and engagement characteristics of new players with the predictions being dynamically updated as additional longitudinal information is recorded (see chapter 5.6). Based on these dynamic churn probability curves from our fitted joint model, we conduct segmentations of player profiles that can be used by game managers to develop improved promotion and retention policies specifically targeting different dominant player-types. 5.2. Motivating data: activity, engagement, churn and promotion effects in freemium mobile games We consider daily player level gaming information for a mobile app game where users use robot avatars to fight other robots till one is destroyed. There were 38,860 players in our database and we tracked daily player level activity and purchases for 60 consecutive days starting from the release date of the game. We use a part of the data (players) for estimation and the other part as the hold out set for prediction (See details in chapter 5.6). There were three modes of the game and level progression can only be attained through the principal mode. However, the players get rewards (henceforth called in-game rewards) if they win games in all three modes. For the two non-principal modes, collecting rewards is the main objective. The players can use these rewards for improving their fighting equipments through upgrades of their existing inventories or in getting access to powerful new robots or for acquiring fancy game themes and background changes. The player can also buy these facilities (add-ons) using real money through direct in-app purchases 123 (IAP). There were only 0.28% of the players who used real money for buying add-ons. The players are given premium rewards, which has much higher order of magnitudes than regular rewards, if they promote the game or the developers through social media (inviting friends on facebook for games, facebook likes, youtube likes, tweets) or through the app center or by downloading other related apps from the developer. Approximately 7.2% of the players in our data had premium rewards. We record daily engagement of a player by appropriately combining her real money purchases (direct source of revenue) with her varied involvement in promoting the game in media (indirect source of monetization) with the notion being that the highest form of engagement is the one leading to purchases. Daily engagement is an extremely zero- inflated variable. We assess player behavior in terms of her daily total playing time (activity), engagement value and drop out probability. We say that a player has dropped out if she has not logged-in for a month consecutively. For each player we have a host of time-dependent covariates generated through the game- Figure 5.1: Game play flowchart play which we model as composite effects. They include current level of the game, number of games played daily in the three different modes of the game, how are the in-game rewards spent, etc (See tables D.3 and D.4 in Appendix D.4 for details). From a gaming perspective, it is very interesting to study the effects on 124 gaming time of the amount of in-game rewards that the players spend on either upgrading existing robots or purchasing new robots. Another interesting feature of the game, was the usage of “gacha” mechanism (Toto 2012) which allowed the players to gamble in-game currency through lottery draws. The “gacha” is a very popular feature in freemium games (Kanerva 2016). We use the currency employed by players in “gacha” as well as their gains, as covariates in modeling engagement. Also, several promotional and retention strategies were used by the developers, which encourage player activity. Figure 5.1 contains a flowchart summarizing the key components of the game. The promotions intrinsically were of four different flavors: (a) award more reward percentages and battery life (b) sale on robots (c) thanksgiving holiday promotions (d) email and app-message based notifications for retention. Also, there were three different kinds of sales on robots. Thus, there were six different promotional strategies, with only one of them (if at all) being employed on a single day (See table D.5 and figure D.3 in Appendix D.4). In figures 5.2a and 5.2b, we present the activity, engagement and churn profiles of the players in our data. Interestingly, the proportion of players with positive engagement is below 10% from day 3 onwards and drops to less than 1% after the first 21 days. Figures 5.2c and 5.2d respectively show the 25 th , 50 th and 75 th percentiles of the distribution of Total Time Played and the average engagement amount on each of the 60 days. Note that from day 20 onwards the distribution of average engagement shows increased variability. This is not unexpected given the observation from figure 5.2a which shows that the proportion of players with positive engagement falls steadily. Also, note that the heavy tailed nature of the distributions of positive time played and positive engagement amount is evident from figure D.2 (Appendix D.4) which plots the empirical CDF of the two variables. So, in the following section we use Log-normal distributions to model the non-zero activity and engagement values. Further details regarding the data are available in Appendix D.4. 125 (a) (b) (c) (d) Figure 5.2: (a) Proportion of players active and proportion of players with positive engagement over 60 days. (b) Proportion of player churn from day 31 to day 60. (c) Median activity sandwiched between its 25 th and 75 th percentile. (d) Mean engagement amount and the total engagement amount over the 60 days. 126 5.3. CEZIJ modeling framework Using the aforementioned motivation example, we now introduce our generic joint modeling framework. Consider data from n independent players where every player i = 1;:::;n is observed over m time points. Let A ij and E ij denote, respectively, the activity and engagement of player i at day j with A i = (A i1 ;:::; A im ) and E i = (E i1 ;:::; E im ) denoting the corresponding vector of longitudinal mea- surements taken on player i. Let D i denote the time of dropout for player i and C i the censoring time. We assume C i s are independent of D i s. Thus C i = m if player i never drops out. The observed time of dropout is D i = min(D i ; C i ), and the longitudinal measurements on any player i are available only overm i D i time points. Suppose ij be the indicator of the event that playeri is active (A ij > 0) on day j and ij be the indicator that she positively engages (E ij > 0) on day j. Let ij = Pr( ij = 1), q ij = Pr( ij = 1j ij = 1), i = ( i1 ;:::; im ) and i = ( i1 ;:::; im ). Note that, ij = 0 implies A ij = E ij = 0 and also ij = 0. In these gaming apps, it is usually witnessed that any player’s usage of the app always produces positive activity (however small). Thus, ij here corresponds to a player’s daily activity indicator (AI). It forms the base (first level) of our joint model. The ij corresponds to daily usage probabil- ity whereasq ij corresponds to the conditional probability of positive player engagement given that the player has used the app in the day. In Figure 5.3, we provide a schematic diagram of our joint model where we use two binary random variables: Activity Indicator (AI) and Engagement Indicator (EI) to be respectively denoted ij and ij . We jointly model the five components [ i ; A i ; i ; E i ; D i ] := [Y (s) i :s2f1; 2; 3; 4; 5g] given the observations. LetI be the full set ofp predictors in the data withI f I as the set of fixed effects (time invariant or not) andI c =InI f as the set of composite effect predictors, which are modeled by combination of fixed and random effects. Letp f =jI f j andp c =jI c j and so,p c +p f =p. For each of the first four sub-models,s = 1;:::; 4, we considerp fixed effects (s) (p f of those are from the time invariant and the rest from the composite components) andp c random effectsb (s) while for the dropout model,s = 5, 127 we considerp new fixed effects (5) but share the random effects from the four sub-models and calibrate their effects on dropouts through an association parameter vector. See chapter 5.3.1 for further details. Letx (s) ijk denote the observedkth covariate value for theith player on thejth day. Letx (s) ij =fx (s) ijk jk2 Ig andz (s) ij =fz (s) ijk jk2I c g denote the set of covariate values pertaining to the in-model fixed and random effects; X (s) and Z (s) respectively denote the data for these effects across alln players and =f (s) :s2 f1; 2; 3; 4gg and b =fb (s) :s2f1; 2; 3; 4gg be all the fixed and random effects across all players. To join the four models, we take a correlated random effects approach and assume that the random effects governing the four sub-models have a multivariate Gaussian distribution. For playeri, represent all her random effects byb i = (b (s) i : 1 s 4). We assume thatfb i : 1 i ng i.i.d. N(0; ) where is the 4p c 4p c unknown covariance matrix. To model the dropouts, we again considerp new fixed effects (5) but share the random effects from the four sub-models and calibrate their effects on dropouts through an association parameter vector. We model Y (s) : 1s 5j X; Z;; as n Y i=1 b i i j X (1) i ; (1) ; Z (1) i ;b (1) i A i j i ; X (2) i ; (2) ; Z (2) i ;b (2) i i j i ; X (3) i ; (3) ; Z (3) i ;b (3) i E i j i ; i ; X (4) i ; (4) ; Z (4) i ;b (4) i D i j X (5) i ; (5) ;b i : Note that the dimension of eachb (s) i inb i isp c and that ofx (s) ij isp. In the context of our mobile app game data,p c = 25 and so is 100 100 andp = 31 for each of the five sub-models, thus making a set of 155 fixed effects (time invariant or not). See chapter 5.6 for more details. Remark 5.3.1. If we have data pertaining to social media interactions among players, it would be beneficial to include network or group effects among players. In the absence of such network information, we model b (s) i as i.i.d. across players. 128 5.3.1 Longitudinal sub-models and model for Dropouts Zero inflated Log-normal for modeling Activity - Since playeri is active only at some time pointsj, the observed activity A i has a mix of many zeros and positive observations. In equation (5.1), we consider a zero inflated (ZI) Log Normal model for A ij to capture both the prevalence of these excess zeros and possible large values observed in the support of A ij . Thus, the model for activity A ij has a mixture distribution with pdf g 1 ( ij ; A ij jb (1) i ;b (2) i ) = (1 ij )If ij = 0g + ij ( 1 A ij ) 1 log A ij ij 1 ! If ij = 1g (5.1) where logit( ij ) = x (1)T ij (1) +z (1)T ij b (1) i ! Binary part (5.2) and ij = x (2)T ij (2) +z (2)T ij b (2) i ! Positive activity part (5.3) The activity indicator ij is modeled using a logistic regression model with random effects in equation (5.2). In equation (5.3) we use an identity link to connect the expected log activity with the covariates and the random effects. For convenience, hereon the dependence on the fixed effects and covariates are kept implicit in the notations and only the involved random effects are explicitly demonstrated. Extreme ZI Log-normal for modeling Engagement - Note that, E i also has a mix of zeros and positive observations but the extreme zero events in the engagement variable are due to : (a) players are inactive on days and, (b) active players on a day may not exhibit engagement on the same day. To account for this excess 129 prevalence of zeros, we use an Extreme Zero Inflated (EZI) Log Normal model that models ( ij ; ij ; E ij ; ) as a flexible mixture distribution with joint pdf g 2 ( ij ; ij ; E ij jb (1) i ;b (3) i ;b (4) i ) = (1 ij )If ij = 0g + ij g 3 ( ij ; E ij jb (3) i ;b (4) i )If ij = 1g (5.4) where; g 3 ( ij ; E ij jb (3) i ;b (4) i ) = (1q ij )If ij = 0g + (5.5) q ij ( 2 E ij ) 1 log E ij ij 2 ! If ij = 1g logit(q ij ) = x (3)T ij (3) +z (3)T ij b (3) i ! Binary part (5.6) ij = x (4)T ij (4) +z (4)T ij b (4) i ! Positive engagement part (5.7) Note that a player can potentially engage (E ij 0) only if she is active ( ij = 1) on that day. Thus, g 3 ( ij ; E ij jb (3) i ;b (4) i ) in equation (5.4) represents the joint pdf of ( ij ; E ij ) conditional on the event that the player is active, i.e., ij = 1. However, even if the player is active, distribution of engagement again can have a mixture distribution, as the particular player may or may not exhibit positive engagement (E ij > 0). Thus, conditional on the player being active, we further model ( ij ; E ij ) using another zero-inflated Log Normal model as shown in equation (5.5). By combining equations (5.4) and (5.5), intuitively, we use the EZI model to split the players into two groups: (1) who are not active and (2) who are active. Then conditional on being active, we further split the latter group of players into two additional segments: (1) who do not engage ( ij = 0) and, (2) who engage ( ij = 1) and thus demonstrate positive engagement (E ij > 0). Finally, we complete the specification of the EZI Log Normal model by connecting the binary response ij j ij = 1 with the covariates and the random effects through a logit link in equation (5.6) and use an identity link for expected log engagement ij in equation (5.7). Note that even though we model ij in equations (5.1) and (5.4) usingg 1 andg 2 , respectively, there is no discordance asg 1 ( ij ) = g 2 ( ij ) for all (i;j). 130 Model for dropouts - For the discrete time hazard of dropout, we model ij :=P (D i =jjD i j;b i ) as logit( ij ) =x (5)T ij (5) + T b i ; (5.8) and the pmf of D i is g 4 (D i =djb i ) = n d1 Y j=1 (1 ij ) o D i id (1 id ) 1 D i where D i = I(D i C i ) is the indicator of dropout occurrence. Here is a parameter vector that relates the longitudinal outcomes and the dropout time via the random effectsb i . This approach to modeling the dropouts through equation (5.8) is analogous to the shared parameter models in clinical trials that are used to account for potential Not Missing At Random (NMAR) responses. (see V onesh et al. (2006), Guo and Carlin (2004) for example). If = 0 then the dropout is ignorable given the observed data. Figure 5.3 contains a schematic diagram of our joint model. Correlating the random effects and Linking the sub-models - All the sub-models described above carry information about the playing behavior of individuals and are therefore inter-related. To get the complete picture and to account for the heterogeneity across individual’s outcomes, one must combine these effects by correlating the multiple outcomes. Without inter-relating or jointly considering these outcomes, it is not only hard to answer questions about how the evolution of one response (e.g., activity) is related to the evolution of another (e.g., engagement) or who is likely to dropout, but also problematic to model the heterogeneity. In such cases, it is natural to consider models where the dependency among the responses may be incorporated via the presence of one or more latent variables. A flexible solution is to model the association between different responses by correlating the random heterogeneous effects from each of the responses. In our joint modeling approach, random effects are assumed for each longitudinal response and 131 Figure 5.3: Schematic diagram of our joint model for player i. The suffix denoting the day number is dropped for presentational ease. they are associated by imposing a joint multivariate distribution on the random effects, i.e,b i = (b (s) i : 1 s 4)N(0; ). Such a model borrows information across the various touch points and offers an intuitive way of describing the dependency between the responses. For example, questions such as, “is engagement related to activity for an individual?”, or “does higher activity increase the probability of engagement” can be answered using the estimated covariance structure of . Furthermore, we assume that the dependency between the longitudinal outcomes and the risk of dropout are described by the random effectsb i and the covariates. In our context this is reasonable since, for instance, the longitudinal outcomeAI may characterize player engagement, and player engagement can in turn influence the risk of dropout. 132 Table 5.1: Parameter constraints and their interpretation. Here (s) prom(i) indicates the fixed effect coefficient for promotioni =I;:::;VI under models = 1;:::; 5. Constraint Description (s) weekend 0,8s Expect increased player activity on weekends (s) timesince 0,8s Expect lower player activity as time since last login increases (1) promV ; (1) promIV 0 Expect promotions IV , V to increase player activity (1) promV (1) promIV Expect promotion V to have a higher positive impact on player activity than promotion IV (2) prom(i) 0 fori6=IV All promotions other than IV to have a non-negative impact on activity. (2) promIII (2) promV Promotions III leads to a higher increase in activity than promotion V (2) promVI (2) promII (2) promV Promotions VI has the largest positive impact on activity followed by promotions II and V (2) promV (2) promIV Promotion V leads to a higher increase in activity than promotion IV 5.3.2 Parametric Constraints The CEZIJ framework can incorporate any convexity constraints on the fixed effects: f (s) ( (s) ) 0; s = 1; ; 5, where f is any pre-specified convex function. In the mobile game platform modeling application, domain expertise can be incorporated into our framework via these constraints. For example, industry belief dictates that all other factors remaining fixed, players have higher chance of being active in the game on weekends than on week days. Thus a sign constraint on the unknown fixed effect coefficient for the variable ( (s) weekend > 0) that indicates whether dayj is a weekend, is a simple yet effective way to include this addi- tional information into our estimation framework. Also, different promotional and retention strategies used in these games are incorporated in the model as fixed effects through the binary variables demarcating the days they were applied (see figure D.3 in Appendix D.4 for a distribution of the various promotion strategies across them = 60 days). These strategies often have previously known efficacy levels which imply mono- tonicity constraints on their effects. For example, email and app messaging based retention scheme should have at least a non-negative increment effect on the daily usage probabilities ij s; the engagement effect of a promotion that offers sale on only selected robots can not exceed the increment effect of sale on all robots. As such, in our mobile game application, we assimilate these side information through structured affine constraints: C (s) (s) (s) ; s = 1; ; 5 where C (s) and (s) are known. Details about these constraints 133 are provided in tables 5.1 and D.5 (in Appendix D.4), where we describe the six promotion strategies and the constraints that have been included in our estimation framework along with their business interpretation. 5.4. Variable selection in CEZIJ In the absence of any prior knowledge regarding variables that may appear in the true model, we conduct automated variable selection. Selection of fixed and random effect components in a mixed model framework has received considerable attention. Under the special case of a linear mixed effect model, Bondell et al. (2010) and Ibrahim et al. (2011) proposed penalized likelihood procedures to simultaneously select fixed and random effect components, while Fan and Li (2012), Peng and Lu (2012) and Lin et al. (2013) conduct selection of fixed and random effects using a two stage approach. Procedures to select only the fixed effects or the random effects have also been proposed under a GLMM framework; see Pan and Huang (2014) and the references therein. Simultaneous selection of fixed and random effect components in a GLMM framework is, however, computationally challenging. The high dimensional integral with respect to the random effects in the marginal likelihood of GLMM often has no analytical form and several approaches have been proposed to tackle this computational hurdle: for example Laplacian approximations (Tierney and Kadane 1986), adaptive quadrature approximations (Rabe-Hesketh et al. 2002), penalized quasi likelihood (PQL)(Breslow and Clayton 1993) and EM algorithm (McCulloch 1997). We use a penalized EM algorithm and for proper interpretation of composite effects we conduct joint variable selection of fixed and random effects in a hierarchical manner, which ensures that non-zero random effects in the model are accompanied by their corresponding non-zero fixed effects. Let = (1) ;:::; (5) ; 1 ; 2 ;;vec() := ;vec() denote 134 the vector of all parameters to be estimated. The marginal log-likelihood of the observed data under the joint model is: `() = n X i=1 log Z p i ; A i ; i ; E i ; D i jb i ; p b i j db i = n X i=1 ` i (); where, ` i () = 1 2 logjj + log Z exp m i X j=1 logp( ij ; A ij ; ij ; E ij ; D i jb i ;) 1 2 b T i 1 b i db i We estimate using the EM algorithm for Joint models (Rizopoulos 2012) where we treat the random effectsb i as ‘missing data’ and obtain b by maximizing the expected value of the complete data likelihood ` cl (;b) where ` cl (;b) = n 2 log + n X i=1 m i X j=1 logp( ij ; A ij ; ij ; E ij ; D i jb i ;) 1 2 b T i 1 b i = n X i=1 ` cl i (;b i ) Denote the Q-function` Q () = P n i=1 E` cl i (;b i ) where the expectation is over the conditional distribu- tion ofb i given the observations at the current parameter estimates. We solve the following maximization problem involving a penalized Q-function for variable selection: max ; 0 ` Q ()n 5 X s=1 p X r=1 c sr j sr j + d sr (s) rr Ifr2I c g subject to f (s) ( (s) ) 0; s = 1; ; 5: (5.9) Here, (s) =f sr : r2Ig and is notationally generalized to include random effects corresponding to allp fixed effects – time invariant or not by introducing harmless zero rows and columns corresponding to time-invariant effects. This is done for presentational ease only to keep the indices same for the fixed and random effects and such degenerate large matrix never crops in the computations. Also, (s) rr is the r th element of the vector 1+pc(s1);1+pc(s1) ;:::; pcs;pcs which represents the segmented covariance matrix corresponding to thesth model,I c is the index set of all composite effects and> 0 is the common 135 regularization parameter which is chosen using a BIC-type criterion (Bondell et al. 2010, Lin et al. 2013, Hui et al. 2017a) given byBIC =2` Q ( b ) + log(n)dim( b ) where dim( b ) is the number of non-zero components in b . In many practical applications the composite effects impose the following hierarchy between fixed and random effects: a random component can have a non-zero coefficient only if its corresponding fixed ef- fect is non-zero (Hui et al. 2017a). To induce such hierarchical selection of fixed and random effects, we solve equation (5.9) using a re-weighted ` 1 minimization algorithm that alternates between estimating and redefining the data-driven weights (c sr ;d sr ) 2 R 2 + such that the weights used in any iteration are computed from the solutions of the previous iteration and are designed to maintain the hierarchy in select- ing the fixed and random effects through their construction (see Candes et al. (2008), Zhao and Koˇ cvara (2015), Lu et al. (2015) for details on these kind of approaches). Suppose (t) denote the solution to the maximization problem in equation (5.9) at iteration t. Then we set c (t) sr = min j (t) sr j ; 1 1 and d (t) sr = min j (s;t) rr j j (t) sr j ; 1 2 for iteration (t+1) with = 2. We take 1 = 10 2 to provide numeri- cal stability and to allow a non-zero estimate in the next iteration given a zero valued estimate in the current iteration (Candes et al. 2008) and fix 2 = 10 4 to enforce a large penalty on the corresponding diagonal element of in iteration (t + 1) wheneverj (t) sr j = 0. Note that wheneverr2I c , the penaltyd sr on the diagonal elements of encourages hierarchical selection of random effects. In appendix D.3.1 we conduct simulation experiments to demonstrate this property of our re-weighted` 1 procedure for solving equation (5.9). We end this section with the observation that although the maximization problem based on criterion (5.9) does not conduct any selection on the association parameters, it achieves that goal implicitly through the selection of the random effects. 136 5.5. Estimation procedure In this section, we discuss two key aspects of the estimation process. Solving the maximization problem - We use an iterative algorithm to solve the maximization problem in equation (5.9) which is analogous to the monte carlo EM (MCEM) algorithm of Wei and Tanner (1990). Let (t) denote the parameter estimates at iterationt. In iterationt + 1, the MCEM algorithm performs the following two steps until convergence: E-step Recall Y i = [Y (s) i ;s = 1;:::; 5]. Evaluate` Q (t) () = P n i=1 E b i j (t) ;Y i ` cl i (;b i ) where the expecta- tion above is taken with respect to the conditional distribution ofb i given the observations Y i at the current estimates (t) . Thus, E b i j (t) ;Y i ` cl i (; b i ) = Z ` cl i (;b i )p(b i j i ; A i ; i ; E i ; D i ; (t) )db i = expf` i ( (t) )g Z ` cl i (;b i )p( i ; A i ; i ; E i ; D i j (t) ;b i ; ) pc (b i j0; (t) )db i where, pc (j 0; (t) ) is the p c dimensional normal density with mean 0 and variance (t) . Note that the expectation involves a multivariate integration with respect to the random effectsb i which is evaluated by Monte Carlo integration. We approximate it as: D X d=1 ` cl i (;b d i )p( i ; A i ; i ; E i ; D i j (t) ;b d i ) D X d=1 p( i ; A i ; i ; E i ; D i j (t) ;b d i ) whereb d i is a random sample from pc (j 0; (t) ) and D = 2000 is the number of monte carlo samples. 137 M-step Solve the following maximization problem with data driven adaptive weights (c (t) sr ;d (t) sr ) (t+1) = argmax ; 0 ` Q (t) ()n 5 X s=1 p X r=1 c (t) sr j sr j + d (t) sr (s) rr Ifr2I c g subject to f (s) ( (s) ) 0; s = 1; ; 5: The maximization problem above decouples into separate components that estimate (s) as solutions to convex problems and as a solution to a non-convex problem. To solve the convex problems involving (s) , we use a proximal gradient descent algorithm after reducing the original problem to an` 1 penalized least squares fit with convex constraints. See James et al. (2019) for related approaches of this kind. For estimating , we use the coordinate descent algorithm of Wang (2014) that solves a lasso problem and updates one column and row at a time while keeping the rest fixed. Further details regarding our estimation procedure is presented in Appendix D.1. Split and Conquer - To enhance the computational efficiency of the estimation procedure, we use the split-and-conquer approach of Chen and Xie (2014) to split the full set ofn players intoK non-overlapping groups and conduct variable selection separately in each group by solvingK parallel maximization problems represented by equation (5.9). Following Chen and Xie (2014), the selected fixed and random effects are then determined using a majority voting scheme across all theK groups as described below. Let b (s) [k] and b (s) [k] denote, respectively, the estimate of the fixed effect coefficients for models and the estimate of thep c diagonal elements of for models on splitk obtained by solving the maximization problem (5.9), wherek = 1;:::;K. We construct the set of selected effects as: Set of Fixed Effects: b I (s) = n r : K X k=1 I( b sr [k]6= 0)w 0 ; r = 1;:::;p o Set of Random Effects: b I (s) R = n r : K X k=1 I( b (k) r+pc(s1);r+pc(s1) > 0)w 1 ; r = 1;:::;p c o 138 Here w 0 ; w 1 are pre-specified thresholds determining the severity of the majority voting scheme. For large datasets as in mobile apps application, a distributed computing framework utilizing the above scheme leads to substantial reduction in computation time. Appendix D.3 presents a discussion of the split-and- conquer approach along with numerical experiments that demonstrate the applicability of this method in our setting where data driven adaptive weights are used in the penalty and variable selection is conducted simultaneously across multiple models. Finally, based on the selected fixed and random effect components in b I (s) and b I (s) R , we use the entire data and estimate their effects more accurately by maximizing the likelihood based on only those components using the standard EM algorithm. 5.6. Analysis of freemium mobile games using CEZIJ We apply our proposed CEZIJ methodology to the freemium mobile game data discussed in chapter 5.2. This dataset holds player level gaming information for 38,860 players observed over a period of 60 days. The analyses presented here uses a sample of 33,860 players for estimation and the remaining 5; 000 players for out of sample validation. See Appendix D.4 for a detailed description of the data. For sub-models s = 1;:::; 4, we consider a set of 30 predictors, of which 24 can have composite effects. The 24 composite effects are listed in table D.3 (Serial No 1-24) of Appendix D.4. The remaining 6 predictors are the 6 promotion strategies summarized in table D.5 of Appendix D.4. We treat these promotion strategies as potential fixed effects with no corresponding random effect counterparts. For the dropout model, which shares its random effects with the four sub-models, the entire list of 30 candidate predictors is taken as potential fixed effects. Overall, the selection mechanism must select random effects from a set of 100 potential random effects (24 for each of the four sub-models and their 4 intercepts) and select fixed effects from a set of 155 potential fixed effects (30 for each of the five sub-models and their 5 intercepts). 139 To model the responses at time point j, we consider time j 1 values of the predictors that contain gaming characteristics of a player simply because at timej these characteristics are known only upto the previous time pointj 1. These gaming characteristics are marked with an () in table D.3 in appendix D.4. All the remaining predictors like weekend indicator and the 6 indicator variables corresponding to the promotion strategies are applied at timej. We initialize the CEZIJ algorithm by fitting a saturated model on a subset of 200 players, which was also used to initialize the weightsc sr ; d sr in criterion (5.9). Finally, our algorithm is run onK = 20 splits where each split holdsn k = 1693 randomly selected players with the majority voting parametersw 0 ; w 1 fixed at 12 and the regularization parameter k is chosen as that value of 2f10 4 ; 10 3 ; 10 2 ; 10 1 ; 0:25; 0:5; 1; 5g which minimizesBIC . Table D.6 in Appendix D.5 presents the voting results for each candidate predictor across the 20 splits. 5.6.1 The fitted joint model and its interpretations The final list of selected predictors and their estimated fixed effect coefficients for the sub-models of Activity Indicator (AI), Activity time (daily total time played), Engagement Indicator (EI), Engagement amount and Dropout is presented in Table 5.2. See Table D.3 (Appendix D.4) for the description of the covariates. The selected composite effects are those predictors that exhibit a () over their coefficient estimates in Table 5.2. All the selected fixed and random effects obey the hierarchical structure discussed in chapter 5.4. We next discuss the fitted coefficients for each sub-model. Activity Indicator - For modeling probability of AI, the CEZIJ methodology selects 18 fixed effects of which 14 are composite effects. As AI forms the base of our joint model, the fixed effects of its estimated marginal distribution have the least nuanced interpretation among the 5 sub-models. All other things re- maining constant, there is an overall increase in the odds forAI by 35% on the weekends and an 8% increase in odds for each level advancement in the game. Similarly, the conditional odds is boosted by 20%, 14% 140 and 9% respectively if the gacha game was played or robot purchases or upgrades were made in the previous days. Absence of log-in in the previous day adversely affects the odds with an average decrement of 88% for each absent day. Promotions II and VI, which provide sale on robots at different dosages are positively associated and increase odds ofAI by approximately 20% and 30% respectively. Activity Time - In this case the selection mechanism selects 17 fixed effects of which 15 are composite effects. The signs on the coefficients of timesince and weekend align with the constraints imposed on them and along with the game characteristics like number of primary and auxiliary fights played, level progressions, and robot upgrades, continue to provide a similar interpretation as with theAI model. This is the second layer of joint model which is conditioned on positive login occurrence. A key difference between these two models, however, lies in the inclusion of predictorsavg session length,gacha sink and pfight source. They indicate that, keeping other things fixed, players interacting with the game through spending in-game currencies or winning the same through principal fights on the previous day have the natural incentive to spend more in-game time on the following day. In line with the monotonicity constraints imposed on the promotion strategies for this model, the coefficient for promotion VI is both positive and bigger than the coefficient for promotion II thus indicating that the strategy to promote sale on all robots has a higher impact on activity time than the strategy to offer the special particular ‘Boss’ robots at a discount. Engagement Indicator and Amount - Recall from chapter 5.3 that we use an EZI Log Normal model for the engagement amount by first building a separate model for the probability of EI given activity. For the sub-models that modelEI and the engagement amount, the CEZIJ methodology selects respectively, 22 fixed effects of which 16 are composite effects and 6 fixed effects of which 2 are composite effects. Direct interpretation of the fixed effect coefficients is difficult here, as this sub-model is conditioned on the first two sub-models. We see that some of the key player engagement characteristics like number of auxiliary fights played, level progression, in-app virtual currency spent and earned seem to positively impact the 141 Table 5.2: Selected fixed effect coefficients and their estimates under the sub-modelsAct. Indicator, Activ- ity Time,Engag. Indicator and Engagement Amount and Dropout. The selected random effects are those variables that exhibit a () over their estimates. See Table D.3 for a detailed description of the covariates. Predictors Act. Indicator Activity Time Engag. Indicator Engag. Amount Churn b (1) b (2) b (3) b (4) b (5) intercept -4.648 0.932 -1.560 0.953 -1.902 avg session length – 0.269 0.198 – – p fights 0.378 0.169 -0.126 – – a1 fights 0.303 0.379 0.274 – – a2 fights 0.334 0.216 -0.492 – – level 0.084 0.304 0.282 – – robot played – – – – – gacha sink – 0.201 0.509 – 0.129 gacha premium sink – – – – – pfight source – 0.144 – – – a1fight source 0.030 -0.239 -0.727 – – a2fight source -0.240 -0.192 0.482 0.331 – gacha source 0.182 -0.212 – – – gacha premium source – – 0.133 – – robot purchase count 0.134 – – – – upgrade count 0.093 0.112 0.404 – – lucky draw wg – – -0.240 – – timesince -2.065 -0.641 -0.229 – 3.502 lucky draw og -0.230 – -0.469 – – fancy sink – – -0.110 – – upgrade sink 0.037 – -0.272 – – robot buy sink – – 0.159 – – gain gachaprem – – – – – gain gachagrind -0.127 0.180 – – – weekend 0.302 0.358 – – – promotion I – – – -1.153 -0.894 promotion II 0.178 0.134 -0.189 – -0.934 promotion III -0.129 – -0.166 -1.791 -3.500 promotion IV – – 0.164 -3.345 -0.673 promotion V – – -5.000 – 0.828 promotion VI 0.290 0.249 0.131 -2.389 -1.509 conditional likelihood of positive engagement at subsequent time points. A significant finding is that among the three different fight modes, only auxiliary fight second mode which involve time restricted fights seems to lead to substantially higher player engagement implying that all other variables remaining constant, player engagement in game promotion through social media is more while playing time attack fights. Dropout - In this case, the selection mechanism selects 9 fixed effects. The sign on the coefficient for timesince is positive, which is natural, and indicates that players who do not frequent the game often (low frequency of AI) exhibit a high likelihood of dropping out at subsequent time points. It is also in- teresting to see, through gacha sink, that all else being equal, players who spend more of their virtual 142 currencies on gacha exhibit a high likelihood of dropping out at subsequent time points. This can potentially be explained through a"make-gacha-work-for-all-players"(Agelle 2016) phenomenon where the player spends a major portion of her virtual currency on gacha however the value of the items won is largely worthless when compared to the amount of currency spent, thus inducing a lack of interest in the game at future time points. All the promotions, with exception of promotion V , reduce the odds of dropout validating their usage as retention schemes. Figure 5.4: Heatmap of the 47 47 correlation matrix obtained from b . On the horizontal axis are the selected composite effects of the four sub-models: AI, Activity Time, EI and Engagement Amount. The horizontal axis begins with the intercept from the AI model and ends with a2fight source from the Engagement Amount model. From the heatmap in figure 5.4, the random effects of the selected composite effect predictors demon- strate correlations within the four sub-models that were modeled jointly, indicating that players exhibit 143 Figure 5.5: Two networks that demonstrate several cross correlations across the models. Blue line represents positive correlation and red line represents negative correlation. The model numbers are inside the parenthe- sis next to the predictor names. Left: Key cross correlations between the sub-modelsAI and Activity Time. Right: Key cross correlations between the sub-models Activity Time andEI. idiosyncratic profiles over time. Moreover, we notice several instances of cross correlations across the four sub-models. For example from figure 5.5, the random effect associated with the number of championship fights played (predictor p fights) in the AI model has a positive correlation with the amount of virtual currency earned through auxiliary fights (predictora2 fights source) played in the model for Activity Time which suggests that the modeled responses are correlated for a player. Our joint model allows us to bor- row information across these related responses and may aid game managers and marketers in understanding how the outcomes depend on each other. 5.6.2 Out of sample validation We use the hold-out sample of 5; 000 players from the original data for assessing the predictive accuracy of our model. Our scheme consists of predicting the four outcomes - AI, activity time, EI and engagement amount, dynamically over the next 29 days using the fitted model discussed in chapter 5.6.1. Note that the time frame of prediction covers the first 30 days of game usage for each player, and so by definition, no 144 player drops out which leaves us with the aforementioned four outcomes to predict. As benchmarks to our fitted model, we consider four competing models - Benchmark I to Benchmark IV which we describe below. For Benchmark I we consider a setup where there are no random effects, the outcomes are not modeled jointly and variable selection is conducted using the R-package glmmLasso (Schelldorfer et al. 2014) that uses an` 1 -penalized algorithm for fitting high-dimensional generalized linear mixed models (GLMMs) with logit links for AI, EI and identity link for the two continuous outcomes of positive activity time and engagement amount. In case of Benchmark II, we continue to model the outcomes separately and use the R-package rpql (Hui et al. 2017b) that performs joint selection of fixed and random effects in GLMMs using a regularized PQL (Breslow and Clayton 1993) with similar link functions as used in Benchmark I. The remaining two Benchmark models rely on the selected variables from the CEZIJ model itself and do not conduct their respective variable selection. In particular, Benchmark III uses the selected predictors from the CEZIJ methodology and models the outcomes via generalized linear models with logit links for AI, EI and identity link for the two continuous outcomes of positive activity time and engagement amount. Thus Benchmark III, like Benchmark I, represents a setup where there are no random effects and the outcomes are not modeled jointly. Benchmark IV , on the other hand, represents a more sophisticated setup wherein it resembles the fitted CEZIJ model in every aspect except that the random effects across the four sub-models are not correlated. It achieves this by using the selected fixed and composite effects from CEZIJ model but employs a slightly modified covariance matrix where the covariances between random effects originating from the different sub-models are set to 0, thus representing a setup where the outcomes are not modeled jointly. The out of sample validation requires predicting the responses dynamically over time. For Benchmarks I and III this step is easily carried out by running the fitted model on the validation data. However, for 145 Table 5.3: Results of predictive performance of CEZIJ model and Benchmarks I to IV . For activity and engagement indicators, the false positive (FP) rate / the false negative (FN) rate averaged over the 29 time points are reported. For non-zero activity time and engagement amounts, the ratio of prediction errors (5.10) of Benchmarks I to IV to CEZIJ model averaged over the 29 time points are reported. Sub-model Benchmark I Benchmark II Benchmark III Benchmark IV CEZIJ Activity Indicator 1:32%= 6:71% 0:27%= 7:83% 1:19%= 6:33% 5:92%= 4:15% 5:86%= 4:12% Total Time Played 1.742 1.961 4.662 1.041 1 Engagement Indicator 0.09% / 1.87% 0% / 1.89% 0.05% / 1.89% 3.56% / 1.48% 3.54% / 1.47% Engagement Amount 1.408 8.619 1.217 1.067 1 Benchmark II, IV and CEZIJ model the prediction mechanism must, respectively, estimate the latent ran- dom effects and appropriately account for the endogenous nature of the responses. To do that we utilize the simulation scheme discussed in section 7.2 of Rizopoulos (2012) and section 3 of Rizopoulos (2011), and calculate the expected timej responses given the observed responses until timej1, the estimated parame- ters and the event that the player has not churned until timej 1 (details provided in Appendix D.2). Table 5.3 summarizes the results of predictive performance of CEZIJ and the benchmark models. For AI and EI, table 5.3 presents, for each model, the false positive (FP) rate and the false negative (FN) rate respectively averaged over the 29 time points. The FP rate measures the percentage of cases where the model incorrectly predicted activity (or engagement) whereas the FN rate measures the percentage of cases where the model incorrectly predicted no activity (or no engagement). Benchmark II, for example, exhibits the lowest FP rate and has the highest FN rate followed by Benchmark III. The low FP rate of Benchmark II, however, belies the relatively poor performance of this model in predicting zero inflated responses which becomes apparent in the higher FN rates especially for theEI model. The CEZIJ model alongwith Benchmark IV , on the other hand, have the lowest FN rates demonstrating their relatively superior ability in predicting the zero inflated responses ofAI andEI. For positive activity times and positive engagement values we take a slightly different approach and first calculate the timej prediction errors PE j for the Benchmark models and CEZIJ 146 as follows. For any modelM2fBenchmark I, ..., Benchmark IV , CEZIJg, we define PE M j for sub-model s = 2 at timej = 1;:::; 29 as PE M j (Y (s) ; b Y (s) ) = n X i=1 log Y (s) ij log b Y (s) ij (5.10) where Y (s) ij = Y (s) ij if ij = 1 and 1 otherwise, and b Y (s) ij = b Y (s) ij if b ij = 1 and 1 otherwise with b Y (s) ij , b ij being modelM predictions of activity time, AI, respectively, for player i at time j. The time j prediction error for sub-model s = 4 is also defined in a similar fashion with ij ; b ij replaced with ij ; b ij respectively and measures the total absolute deviation of the prediction from the truth at any time j. For notational convenience the dependence of PE M j on ij ; b ij (or ij ; b ij ) have been suppressed but the inclusion of these predicted and observed indicators in equation (5.10) is aimed at exploiting the dependencies between the responses, if any. For the two sub-models (s = 2; 4) table 5.3 presents the ratio of the prediction errors of the Benchmarks to the CEZIJ model averaged over the 29 time points where a ratio in excess of 1 indicates a larger absolute deviation of the prediction from the truth when compared to CEZIJ model. All Benchmark models exhibit prediction error ratios bigger than 1 with Benchmarks II and III being the worse for engagement amount and activity time models respectively. Benchmark IV , on the other hand, profits from the structure of the various components of CEZIJ model but is unable to account for the dependencies between the responses which is reflected in its prediction error ratios being slightly bigger than 1 but alongwith the CEZIJ model, it continues to demonstrate superior prediction error ratios across the two sub-models. 5.6.3 Player segmentation using predicted churn probabilities Player sub-populations with similar churn characteristics over time provide valuable insights into user pro- files that are more likely to dropout and can be used to design future retention policies specifically targeting 147 Figure 5.6: Functional cluster analysis of predicted dropout probabilities over time. The plot presents three cluster centroids. The shaded band around the centroids are the 25 th and 75 th percentiles of the churn probabilities. The vertical shaded regions in the graph correspond to the days on which different promotion strategies were on effect. The number of clusters were identified using prediction strength (Tibshirani and Walther 2005). those characteristics. In this section we use the fitted churn model of chapter 5.6.1 to predict the temporal trajectories of churn probabilities on a sample of 1; 000 players who are 30 days into the game and use the predicted probabilities over the next 25 days to cluster the players into homogeneous sub-groups. The churn probabilities are predicted in a similar fashion as discussed in chapter 5.6.2 and Appendix D.2 where the churn probability at timej is predicted conditional on the estimated parameters, the observed responses until timej 1 and the event that the player has not churned until timej 1. To determine the player subgroups, we useR packagefda.usc to cluster the rows of the 1000 25 predicted churn probability matrix using functional K-means clustering. We use the prediction strength algorithm of Tibshirani and Walther (2005) to determine the number of clusters. 148 In figure 5.6 the three cluster centroids segment the sample into groups which demonstrate distinct temporal churn profiles. For instance, cluster 3, which holds almost 48% of the players, exhibits rising churn probabilities until day 5 but tapers down under the influence promotions I, II and VI. Cluster 2, with 34% of the players, has a different trajectory than cluster 3 and appears to respond favorably to promotion VI. Of particular importance are those players that belong to cluster 1 which holds 18% of the players and is characterized by rising churn probabilities over time. The churn profile of this cluster represents players who have been relatively inactive in the game and continue to do so even under the influence of various promotion strategies. During days 17 to 19, their churn probabilities are predicted to diminish under the effect of promotion VI however subsequent promotions do not appear to have any favorable impact. These segment curves suggest that there are some key differences in customer attrition patterns. For example, Cluster 1 shows increasing attrition rates over time, which suggests that the game is not able to retain these players. Cluster 2 shows increasing attrition initially, but then the attrition rate starts to decline significantly after 45 days. This segment is potentially beneficial to the platform as it demonstrates that there is a core set of players who are loyal to the game. Players in Cluster 3 on average start with a much higher attrition rate than the other two segments, but their attrition rate tapers down significantly after five days and then stays at a very low level over time. Interestingly, Cluster 3 seems to be responding to promotions I, II and IV . These differences in user behavior across the segments can be leveraged to increase the efficiency of player retention policies. They also suggest that the platform should adopt different business strategies. For instance, in Cluster 3 many players have been weeded out early. This indicates that short term visitors to the gaming portal have left the platform more quickly in Cluster 3 compared to the other two segments. So it is important for the platform to emphasize promotional activities that increase player engagement. On the contrary, for Clusters 1 and 2, it is important for the platform to emphasize promotional activities that increase player log-in or activity. This relative emphasis across the segments can increase the efficiency of marketing campaigns. 149 5.7. Discussion We propose a very scalable joint modeling framework CEZIJ for unified inference and prediction of player activity and engagement in freemium mobile games. The rapid growth of mobile games globally has gener- ated significant research interest in different business areas such as marketing, management and information sciences. Our proposed algorithm conducts variable selection by maintaining the hierarchical congruity of the fixed and random effects and produces models with interpretable composite effects. A key feature of our framework is that it allows incorporation of side information and domain expertise through convexity constraints. We exhibit the superior performance of CEZIJ in producing dynamic predictions. It is also used to segment players based on their churn rates, with the analysis revealing several idiosyncratic player behaviors that can be used for targeted marketing of players in future freemium games. The segmentation findings have important business implications for monetization of the platforms. They can be used to en- hance the effectiveness and efficiency of promotional activities and also future user acquisition and retention strategies. Our inferential framework is based on modern optimization techniques and is very flexible. It can be used in a wide range of big-data applications that need analyzing multiple high-dimensional longitudi- nal outcomes along with a time-to-event analysis. In future, we would like to extend our joint modeling program for providing comprehensive statistical guidance regarding the growth, development and optimal pricing of generic digital products that use the freemium model. For that purpose, it will be interesting to investigate extensions of our CEZIJ modeling framework, in particular, the possibility of incorporating non-parametric components for modeling the nonlinear time effects since player behavior may change over time. Furthermore, the current dropout model in equation (5.8) may be enhanced to include more sophis- ticated structures involving cumulative effects parametrization and conduct variable selection on the high dimensional vector of association parameter , which the current CEZIJ framework implicitly achieves 150 through the selection of the random effects. An alternative and computationally less demanding approach may be to consider the following low dimensional representation wherein the dropout model is of the form logit( ij ) =x (5)T ij (5) + P 4 s=1 s z (s)T ij b (s) i so that is then only a 4 1 vector. Finally, while the focus of this chapter is the CEZIJ modeling framework and its applicability in the disciplined study of freemium be- havior and other applications that needs analyzing multiple high-dimensional longitudinal outcomes along with a time-to-event analysis, a natural extension of our work, as future research, will be targeted towards estimating standard errors of the estimated coefficients and confidence intervals under the CEZIJ framework using ideas from recent developments in post-selection inference (see Javanmard and Montanari (2014), Lee et al. (2016) for example). Of the thousands of freemium games that are developed every month, very few of them go on to make adequate amount through IAP (in-app purchases). Most games resemble our data where a significant part of the revenue is earned through in-game ads and social media usages. In these games, such low incidence of real money purchases present a challenge in model development as the robustness of the estimated model coefficients will be significantly impacted in case real money purchases are modeled as a separate response variable. Thus, in very low IAP incidence games it is useful to model the combined revenue using game specific weights to blend direct and indirect engagement as is done in this chapter. For games with significant amount of IAP, we envision modeling direct and indirect engagement separately and study their interactions. 151 Bibliography F. Abramovich, Y . Benjamini, D. L. Donoho, and I. M. Johnstone. Adapting to unknown sparsity by con- trolling the false discovery rate. Ann. Statist., 34:584–653, 2006. F. Abramovich, V . Grinshtein, M. Pensky, et al. On optimality of bayesian testimation in the normal means problem. The Annals of Statistics, 35(5):2261–2286, 2007. P. Agelle. Getting gacha right: Tips for creating successful in-game lotteries. PocketGamer, 2016. Available at http://www.pocketgamer.biz/comment-and-opinion/63620/ getting-gacha-right-tips-for-creating-successful-in-game-lotteries/. J. Aitchison and I. R. Dunsmore. Statistical prediction analysis. Bulletin of the American Mathematical Society, 82(5):683–688, 1976. A. Aizer and J. J. Doyle Jr. Juvenile incarceration, human capital, and future crime: Evidence from randomly assigned judges. The Quarterly Journal of Economics, 130(2):759–803, 2015. M. Alf` o, A. Maruotti, and G. Trovato. A finite mixture model for multivariate counts under endogenous selectivity. Statistics and Computing, 21(2):185–202, 2011. AppBrain. Free vs. paid android apps. AppBrain, July 24, 2017. Available athttp://www.appbrain. com/stats/free-and-paid-android-applications. G. Appel, B. Libai, E. Muller, and R. Shachar. Retention and the monetization of apps. 2017. J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. Journal of Multivariate Analysis, 97(6):1382–1408, 2006. R. Bandari, S. Asur, and B. A. Huberman. The pulse of news in social media: Forecasting popularity. ICWSM, 12:26–33, 2012. S. Banerjee, B. P. Carlin, and A. E. Gelfand. Hierarchical modeling and analysis for spatial data. Crc Press, 2014. T. Banerjee, G. Mukherjee, and W. Sun. Adaptive sparse estimation with side information. Journal of the American Statistical Association, pages 1–15, 2019. F. Benaych-Georges and R. R. Nadakuditi. The singular values and vectors of low rank perturbations of large rectangular random matrices. Journal of Multivariate Analysis, 111:120–135, 2012. P. Bickel. Minimax estimation of a normal mean subject to doing well at a point. Recent Advances in Statistics (MH Rizvi, JS Rustagi, and D. Siegmund, eds.), Academic Press, New York, pages 511–528, 1983. J. Bien and R. J. Tibshirani. Sparse estimation of a covariance matrix. Biometrika, 98(4):807–820, 2011. N. Binkiewicz, J. T. V ogelstein, and K. Rohe. Covariate-assisted spectral clustering. arXiv preprint arXiv:1411.2158, 2014. H. D. Bondell, A. Krishna, and S. K. Ghosh. Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66(4):1069–1077, 2010. 152 J. F. Bonnans and A. Shapiro. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. K. Boudreau, L. B. Jeppesen, and M. Miric. Freemium, network effects and digital competition: Evidence from the introduction of game center on the apple appstore. 2017. N. E. Breslow and D. G. Clayton. Approximate inference in generalized linear mixed models. Journal of the American statistical Association, 88(421):9–25, 1993. B. J. Bronnenberg, M. W. Kruger, and C. F. Mela. Database paper - the iri marketing data set. Marketing science, 27(4):745–748, 2008. B. J. Bronnenberg, J.-P. H. Dub´ e, and M. Gentzkow. The evolution of brand preferences: Evidence from consumer migration. American Economic Review, 102(6):2472–2508, 2012. L. D. Brown. In-season prediction of batting averages: A field test of empirical bayes and bayes method- ologies. The Annals of Applied Statistics, pages 113–152, 2008. L. D. Brown and E. Greenshtein. Nonparametric empirical bayes and compound decision approaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, pages 1685–1704, 2009. L. D. Brown, E. Greenshtein, and Y . Ritov. The poisson compound decision problem revisited. Journal of the American Statistical Association, 108(502):741–749, 2013. L. D. Brown, G. Mukherjee, and A. Weinstein. Empirical bayes estimates for a 2-way cross-classified additive model. Annals of Statistics, 2018. T. Cai, W. Sun, and W. Wang. Cars: Covariate assisted ranking and screening for large-scale two-sample inference. To appear: Journal of the Royal Statistical Society, Series B, 2018+. T. T. Cai and W. Sun. Optimal screening and discovery of sparse signals with applications to multistage high throughput studies. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):197–223, 2017. T. T. Cai, Z. Ma, and Y . Wu. Sparse pca: Optimal rates and adaptive estimation. The Annals of Statistics, 41 (6):3074–3110, 2013. T. T. Cai, M. Low, and Z. Ma. Adaptive confidence bands for nonparametric regression functions. Journal of the American Statistical Association, 109(507):1054–1070, 2014. S. E. Calvano, W. Xiao, D. R. Richards, R. M. Felciano, H. V . Baker, R. J. Cho, R. O. Chen, B. H. Brown- stein, J. P. Cobb, S. K. Tschoeke, et al. A network-based analysis of systemic inflammation in humans. Nature, 437(7061):1032–1037, 2005. E. J. Candes, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14(5-6):877–905, 2008. M. Cavrois, T. Banerjee, G. Mukherjee, N. Raman, R. Hussien, B. A. Rodriguez, J. Vasquez, M. H. Spitzer, N. H. Lazarus, J. J. Jones, et al. Mass cytometric analysis of hiv entry, replication, and remodeling in tissue cd4+ t cells. Cell reports, 20(4):984–998, 2017. X. Chen and M.-g. Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, pages 1655–1684, 2014. K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. JMLR: Workshop and Conference Proceedings, 2016. M. L. Clevenson and J. V . Zidek. Simultaneous estimation of the means of independent poisson laws. Journal of the American Statistical Association, 70(351a):698–705, 1975. O. Coibion, Y . Gorodnichenko, and G. H. Hong. The cyclicality of sales, regular and effective prices: Business cycle and policy implications. American Economic Review, 105(3):993–1029, 2015. 153 O. Collier, L. Comminges, A. B. Tsybakov, et al. Minimax estimation of linear and quadratic functionals on sparsity classes. The Annals of Statistics, 45(3):923–958, 2017. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012. A. P. Damm and C. Dustmann. Does growing up in a high crime neighborhood affect youth criminal behavior? American Economic Review, 104(6):1806–32, 2014. L. H. Dicker and S. D. Zhao. High-dimensional classification via nonparametric empirical bayes and maxi- mum likelihood inference. Biometrika, 103(1):21–34, 2016. O. Diele. State of online gaming report. Spil Games, 2013. Available at http: //auth-83051f68-ec6c-44e0-afe5-bd8902acff57.cdn.spilcloud.com/v1/ archives/1384952861.25_State_of_Gaming_2013_US_FINAL.pdf. E. Dobriban, W. Leeb, A. Singer, et al. Optimal prediction in the linearly transformed spiked model. The Annals of Statistics, 48(1):491–513, 2020. D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32: 962–994, 2004. D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 90(432):1200–1224, 1995. D. L. Donoho, I. M. Johnstone, et al. Minimax estimation via wavelet shrinkage. The annals of Statistics, 26(3):879–921, 1998. B. Efron. Tweedie?s formula and selection bias. Journal of the American Statistical Association, 106(496): 1602–1614, 2011. B. Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012. B. Efron and T. Hastie. Computer age statistical inference, volume 5. Cambridge University Press, 2016. N. El Karoui. Spectrum estimation for large dimensional covariance matrices using random matrix theory. The Annals of Statistics, 36(6):2757–2790, 2008. S. Erickson, C. Sabatti, et al. Empirical bayes estimation of a sparse vector of gene expression changes. Statistical applications in genetics and molecular biology, 4(1):1132, 2005. J. Fan, Y . Fan, and J. Lv. High dimensional covariance matrix estimation using a factor model. Journal of Econometrics, 147(1):186–197, 2008. J. Fan, Y . Liao, and M. Mincheva. Large covariance estimation by thresholding principal orthogonal com- plements. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):603–680, 2013. J. Fan, W. Wang, and Y . Zhong. Robust covariance estimation for approximate factor models. arXiv preprint arXiv:1602.00719, 2016. Y . Fan and R. Li. Variable selection in linear mixed effects models. Annals of statistics, 40(4):2043, 2012. D. Fourdrinier and C. P. Robert. Intrinsic losses for empirical bayes estimation: A note on normal and poisson cases. Statistics & probability letters, 23(1):35–44, 1995. D. Fourdrinier, W. E. Strawderman, and M. T. Wells. Shrinkage Estimation. Springer, 2017. D. Fourdrinier, W. E. Strawderman, and M. T. Wells. Shrinkage Estimation. Springer, 2018. J. Friedman, T. Hastie, H. H¨ ofling, R. Tibshirani, et al. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007. A. Fu, B. Narasimhan, and S. Boyd. Cvxr: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582, 2017. 154 L. Fu, G. James, and W. Sun. Nonparametric empirical bayes estimation on heterogeneous data. 2018. R. Garg and R. Telang. Inferring app demand from publicly available data. 2012. S. Geisser. Predictive inference. monographs on statistics & applied probability. CRC, London, 1993. E. I. George and X. Xu. Predictive density estimation for multiple regression. Econometric Theory, 24(2): 528–544, 2008. E. I. George, F. Liang, and X. Xu. Improved minimax predictive densities under kullback–leibler loss. The Annals of Statistics, 34(1):78–91, 2006. J. Geweke. The dynamic factor analysis of economic time series. Latent variables in socio-economic models, 1977. C. W. Granger. Prediction with a generalized cost of error function. Journal of the Operational Research Society, 20(2):199–207, 1969. M. Grant, S. Boyd, and Y . Ye. Cvx: Matlab software for disciplined convex programming, 2008. L. V . Green, S. Savin, and N. Savva. Nursevendor problem: Personnel staffing in the presence of endogenous absenteeism. Management Science, 59(10):2237–2256, 2013. W. Greene. Models for count data with endogenous participation. Empirical Economics, 36(1):133–173, 2009. E. Greenshtein and J. Park. Application of non parametric empirical bayes estimation to high dimensional classification. Journal of Machine Learning Research, 10(Jul):1687–1704, 2009. E. Greenshtein and Y . Ritov. Asymptotic efficiency of simple decisions for the compound decision problem. In Optimality: The Third Erich L. Lehmann Symposium, volume 57, pages 266–275, 2009. X. Guo and B. P. Carlin. Separate and joint modeling of longitudinal and event time data using standard computer packages. The American Statistician, 58(1):16–24, 2004. A. Gustafsson, M. D. Johnson, and I. Roos. The effects of customer satisfaction, relationship commitment dimensions, and triggers on customer retention. Journal of marketing, 69(4):210–218, 2005. H. Haans, N. Raassens, and R. van Hout. Search engine advertisements: The impact of advertising state- ments on click-through and conversion rates. Marketing Letters, 24(2):151–163, 2013. C. Han and R. Kronmal. Two-part models for analysis of agatston scores with possible proportionality constraints. Communications in StatisticsTheory and Methods, 35(1):99–111, 2006. C. R. Harvey, Y . Liu, and H. Zhu. ...and the cross-section of expected returns. The Review of Financial Studies, 29(1):5–68, 2016. L. A. Hatfield, M. E. Boye, M. D. Hackshaw, and B. P. Carlin. Multilevel bayesian models for survival times and longitudinal patient-reported outcomes with many zeros. Journal of the American Statistical Association, 107(499):875–885, 2012. D. Holland, Y . Wang, W. K. Thompson, A. Schork, C.-H. Chen, M.-T. Lo, A. Witoelar, T. Werge, M. O’Donovan, O. A. Andreassen, et al. Estimating effect sizes and expected replication probabil- ities from gwas summary statistics. Frontiers in genetics, 7, 2016. F. K. Hui, S. M¨ uller, and A. Welsh. Hierarchical selection of fixed and random effects in generalized linear mixed models. Statistica Sinica, 27(2), 2017a. F. K. Hui, S. M¨ uller, and A. Welsh. Joint selection in mixed models using regularized pql. Journal of the American Statistical Association, 112(519):1323–1333, 2017b. S. K. Hui, J. J. Inman, Y . Huang, and J. Suher. The effect of in-store travel distance on unplanned spending: Applications to mobile promotion strategies. Journal of Marketing, 77(2):1–16, 2013. C. Hwong. Using audience measurement data to boost user acquisition and engagement. Verto Analytics, 2016. Available athttp://www.vertoanalytics.com/. 155 J. G. Ibrahim, H. Zhu, R. I. Garcia, and R. Guo. Fixed and random effects selection in mixed effects models. Biometrics, 67(2):495–503, 2011. G. M. James, C. Paulson, and P. Rusmevichientong. Penalized and constrained optimization: An application to high-dimensional website advertising. Journal of the American Statistical Association, pages 1–31, 2019. W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1, pages 361–379, 1961. A. Javanmard and A. Montanari. Confidence intervals and hypothesis testing for high-dimensional regres- sion. The Journal of Machine Learning Research, 15(1):2869–2909, 2014. K. Jerath, P. S. Fader, and B. G. Hardie. New perspectives on customer death using a generalization of the pareto/nbd model. Marketing Science, 30(5):866–880, 2011. J. Jiang. Linear and generalized linear mixed models and their applications. Springer Science & Business Media, 2007. W. Jiang, C.-H. Zhang, et al. General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647–1684, 2009. I. M. Johnstone. On minimax estimation of a sparse normal mean vector. The Annals of Statistics, pages 271–289, 1994. I. M. Johnstone. Gaussian estimation:sequence and wavelet models. Draft version, 2015. I. M. Johnstone and A. Y . Lu. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 2012. I. M. Johnstone and D. Paul. Pca in high dimensions: An orientation. Proceedings of the IEEE, (99), 2018. I. M. Johnstone and B. W. Silverman. Wavelet threshold estimators for data with correlated noise. Journal of the royal statistical society: series B (statistical methodology), 59(2):319–351, 1997. I. M. Johnstone and B. W. Silverman. Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. Annals of Statistics, pages 1594–1649, 2004. I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data, 2009. M. I. Jordan, J. D. Lee, and Y . Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, (just-accepted), 2018. M. I. Jordan et al. On statistics, computation and scalability. Bernoulli, 19(4):1378–1390, 2013. T. Kanerva. Cultures combined: Japanese gachas are sweeping f2p mobile games in the west. GameRefinery, 2016. Available at http://www.gamerefinery.com/ japanese-gachas-sweeping-f2p-games-west/. N. E. Karoui, A. E. Lim, and G.-Y . Vahn. Estimation error reduction in portfolio optimization with condi- tional value-at-risk. Technical report, 2011. T. Ke, J. Jin, and J. Fan. Covariance assisted screening and estimation. Annals of statistics, 42(6):2202, 2014. K. Khamaru and R. Mazumder. Computation of the maximum likelihood estimator in low-rank factor analysis. arXiv preprint arXiv:1801.05935, 2018. J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, pages 887–906, 1956. A. Klenke. Probability Theory: A Comprehensive Course, pages 331–349. Springer London, 2014. R. Koenker and G. Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric Society, pages 33–50, 1978. 156 R. Koenker and J. Gu. Rebayes: An r package for empirical bayes mixture methods. Journal of Statistical Software, 82(1):1–26, 2017. R. Koenker and I. Mizera. Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674–685, 2014. J. Koetsier. Why 2016 is the global tipping point for the mobile economy. Tune, 2015. Available at https://www.tune.com/. F. Komaki. Asymptotic properties of bayesian predictive densities when the distributions of data and target variables are different. Bayesian Analysis, 10(1):31–51, 2015. X. Kong, Z. Liu, P. Zhao, and W. Zhou. Sure estimates under dependence and heteroscedasticity. Journal of Multivariate Analysis, 161:1–11, 2017. S. V . Koski, D. Bowers, and S. Costanza. State and institutional correlates of reported victimization and consensual sexual activity in juvenile correctional facilities. Child and Adolescent Social Work Journal, 35(3):243–255, 2018. S. Kou and J. J. Yang. Optimal shrinkage estimation in heteroscedastic hierarchical linear models. arXiv preprint arXiv:1503.06262, 2015. S. Kozak, S. Nagel, and S. Santosh. Shrinking the cross section. SSRN, 2017. S. Kritchman and B. Nadler. Determining the number of components in a factor model from limited noisy data. Chemometrics and Intelligent Laboratory Systems, 94(1):19–32, 2008. S. Kritchman and B. Nadler. Non-parametric detection of the number of signals: Hypothesis testing and random matrix theory. IEEE Transactions on Signal Processing, 57(10):3930–3941, 2009. V . Kumar. Making” freemium” work. Harvard business review, 92(5):27–29, 2014. J. D. Lee, Y . Sun, Q. Liu, and J. E. Taylor. Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv:1503.04337, 2015. J. D. Lee, D. L. Sun, Y . Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. Ann. Statist., 44(3):907–927, 06 2016. doi: 10.1214/15-AOS1371. URLhttps://doi.org/10. 1214/15-AOS1371. R. Levi, G. Perakis, and J. Uichanco. The data-driven newsvendor problem: new bounds and insights. Operations Research, 63(6):1294–1306, 2015. C. Li, M. Li, E. M. Lange, and R. M. Watanabe. Prioritized subset analysis: improving power in genome- wide association studies. Human heredity, 65(3):129–141, 2008. B. Lin, Z. Pang, and J. Jiang. Fixed and random effects selection by reml and pathwise coordinate optimiza- tion. Journal of Computational and Graphical Statistics, 22(2):341–355, 2013. C. Z. Liu, Y . A. Au, and H. S. Choi. Effects of freemium strategy in the mobile app market: An empirical study of google play. Journal of Management Information Systems, 31(3):326–354, 2014. Q. Liu and D. Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386, 2016. Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized stein discrepancy for goodness-of-fit tests. In Proceedings of the International Conference on Machine Learning (ICML), 2016. C. Lu, Z. Lin, and S. Yan. Smoothed low rank and sparse matrix recovery by iteratively reweighted least squares minimization. IEEE Transactions on Image Processing, 24(2):646–654, 2015. Z. Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2): 772–801, 2013. S. Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way. Academic Press, 3rd edition, 2008. ISBN 0123743702, 9780123743701. 157 MarketingCharts. App retention rates still low, but improving. Marketing Charts, Feb 20, 2017. Available at http://www.marketingcharts.com/online/ app-retention-rates-still-low-but-improving-75135/. S. Matsui. Genomic biomarkers for personalized medicine: development and validation in clinical studies. Computational and mathematical methods in medicine, 2013, 2013. S. Matsui and H. Noma. Estimating effect sizes of differentially expressed genes for power and sample-size assessments in microarray experiments. Biometrics, 67(4):1225–1235, 2011. C. McCulloch. Joint modelling of mixed outcome types using latent variables. Statistical Methods in Medical Research, 17(1):53–73, 2008. C. E. McCulloch. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American statistical Association, 92(437):162–170, 1997. E. McDonald. The global games market. Newzoo, 2017. Available at https://newzoo.com/ insights/articles/the-global-%20games-market-%20will-reach-108-% 209-billion-%20in-2017-%20with-mobile-%20taking-42. Y . Min and A. Agresti. Random effect models for repeated measures of zero-inflated count data. Statistical modelling, 5(1):1–19, 2005. N. Moniz and L. Torgo. Multi-source social feedback of online news feeds. arXiv preprint arXiv:1801.07055, 2018. G. Mukherjee and I. M. Johnstone. Exact minimax estimation of the predictive density in sparse gaussian models. Annals of statistics, 43(3):937, 2015. G. Mukherjee, L. D. Brown, and P. Rusmevichientong. Efficient empirical bayes prediction under check loss using asymptotic risk estimates. arXiv preprint arXiv:1511.00028, 2015. S. E. Needleman and A. Loten. When freemium fails. WSJ, 2012. Available at https://www.wsj. com/articles/SB10000872396390443713704577603782317318996. M. F. Niculescu and D. J. Wu. When should software firms commercialize new products via freemium business models. Under Review, 2011. A. Noack. A class of random variables with discrete distributions. The Annals of Mathematical Statistics, 21(1):127–132, 1950. P. E. Nugent, M. Sullivan, S. B. Cenko, R. C. Thomas, D. Kasen, D. A. Howell, D. Bersier, J. S. Bloom, S. Kulkarni, M. T. Kandrashoff, et al. Supernova sn 2011fe from an exploding carbon-oxygen white dwarf star. Nature, 480(7377):344–347, 2011. C. J. Oates, M. Girolami, and N. Chopin. Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718, 2017. M. K. Olsen and J. L. Schafer. A two-part random-effects model for semicontinuous longitudinal data. Journal of the American Statistical Association, 96(454):730–745, 2001. A. Onatski. Asymptotics of the principal components estimator of large factor models with weakly influen- tial factors. Journal of Econometrics, 168(2):244–258, 2012. A. Onatski. Asymptotic analysis of the squared estimation error in misspecified factor models. Journal of Econometrics, 186(2):388–406, 2015. A. Onatski, M. J. Moreira, and M. Hallin. Signal detection in high dimension: The multispiked case. The Annals of Statistics, 42(1):225–254, 2014. A. B. Owen and J. Wang. Bi-cross-validation for factor analysis. Statistical Science, 31(1):119–139, 2016. J. Pan and C. Huang. Random effects selection in generalized linear mixed models via shrinkage penalty function. Statistics and Computing, 24(5):725–738, 2014. 158 D. Passemier and J.-F. Yao. On determining the number of spikes in a high-dimensional spiked population model. Random Matrices: Theory and Applications, 1(01):1150002, 2012. D. Passemier, Z. Li, and J. Yao. On estimation of the noise variance in high dimensional probabilistic prin- cipal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2017a. D. Passemier, Z. Li, and J. Yao. On estimation of the noise variance in high dimensional probabilistic prin- cipal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):51–67, 2017b. L. P´ astor. Portfolio selection and asset pricing models. The Journal of Finance, 55(1):179–223, 2000. L. P´ astor and R. F. Stambaugh. Comparing asset pricing models: an investment perspective. Journal of Financial Economics, 56(3):335–381, 2000. A. J. Patton and A. Timmermann. Properties of optimal forecasts under asymmetric loss and nonlinearity. Journal of Econometrics, 140(2):884–918, 2007. D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642, 2007. D. Paul and A. Aue. Random matrix theory in statistics: A review. Journal of Statistical Planning and Inference, 150:1–29, 2014. H. Peng and Y . Lu. Model selection in linear mixed effect models. Journal of Multivariate Analysis, 109: 109–129, 2012. J. Perro. Mobile apps: Whats a good retention rate? Localytics, March 28, 2016. Available at http: //Info.Localytics.Com/Blog/Mobile-apps-Whats-A-Good-Retention-Rate. PocketGamer. Number of applications submitted per month to the itunes app store. Pocket Gamer, 2018. Available athttp://www.pocketgamer.biz/metrics/app-store/submissions/. D. Pollard. A few good inequalities. available at: http://www.stat.yale.edu/ ˜ pollard/ Books/Mini/Basic.pdf, 2015. S. J. Press. Subjective and objective Bayesian statistics: principles, models, and applications, volume 590. John Wiley & Sons, 2009. S. Rabe-Hesketh, A. Skrondal, A. Pickles, et al. Reliable estimation of generalized linear mixed models using adaptive quadrature. The Stata Journal, 2(1):1–21, 2002. D. Rizopoulos. Dynamic predictions and prospective accuracy in joint models for longitudinal and time-to- event data. Biometrics, 67(3):819–829, 2011. D. Rizopoulos. Joint models for longitudinal and time-to-event data: With applications in R. CRC Press, 2012. D. Rizopoulos and E. Lesaffre. Introduction to the special issue on joint modelling techniques. Statistical methods in medical research, 23(1):3–10, 2014. D. Rizopoulos, G. Verbeke, and E. Lesaffre. Fully exponential laplace approximations for the joint mod- elling of survival and longitudinal data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3):637–654, 2009. D. Rizopoulos, G. Verbeke, and G. Molenberghs. Multiple-imputation-based residuals and diagnostic plots for joint models of longitudinal and survival outcomes. Biometrics, 66(1):20–29, 2010. H. Robbins. An empirical bayes approach to statistics. Technical report, COLUMBIA UNIVERSITY New York City United States, 1956. H. Robbins. The empirical bayes approach to statistical decision problems. In Herbert Robbins Selected Papers, pages 49–68. Springer, 1985. 159 C. Rudin and G.-Y . Vahn. The big data newsvendor: Practical insights from machine learning. 2014. K.-i. Sato and S. Ken-Iti. L´ evy processes and infinitely divisible distributions. Cambridge university press, 1999. J. Schelldorfer, L. Meier, and P. B¨ uhlmann. Glmmlasso: an algorithm for high-dimensional generalized linear mixed models using l1-penalization. Journal of Computational and Graphical Statistics, 23(2): 460–477, 2014. N. Sen, G. Mukherjee, A. Sen, S. C. Bendall, P. Sung, G. P. Nolan, and A. M. Arvin. Single-cell mass cytometry analysis of human tonsil t cell remodeling by varicella zoster virus. Cell reports, 8(2): 633–645, 2014. N. Sen, P. Sung, A. Panda, and A. M. Arvin. Distinctive roles for type i and type ii interferons and interferon regulatory factors in the host cell defense against varicella-zoster virus. Journal of virology, pages JVI– 01151, 2018. R. J. Serfling. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 2009. G. Shmueli, T. P. Minka, J. B. Kadane, S. Borle, and P. Boatwright. A useful distribution for fitting discrete data: revival of the conway–maxwell–poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(1):127–142, 2005. R. S. Singh. On the glivenko-cantelli theorem for weighted empiricals based on independent random vari- ables. The Annals of Probability, pages 371–374, 1975. Statista. How many hours in a typical week would you say you play games? Statista, 2018. Available at https://www.statista.com/statistics/273311/ time-spent-gaming-weekly-in-the-uk-by-age/. C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 197–206, Berkeley, Calif., 1956. University of California Press. URLhttps://projecteuclid.org/euclid.bsmsp/1200501656. C. M. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6): 1135–1151, 1981. W. Sun and T. Cai. Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2):393–424, 2009. W. Sun and Z. Wei. Multiple testing for pattern identification, with applications to microarray time-course experiments. Journal of the American Statistical Association, 106(493):73–88, 2011. Swrve. Monetization report 2016. swrve, 2016. Available at https://www.swrve.com/images/ uploads/whitepapers/swrve-monetization-report-2016.pdf. Z. Tan. Improved minimax estimation of a multivariate normal mean under heteroscedasticity. Bernoulli, 21(1):574–603, 2015. A. Taube. People spend way more on purchases in free apps than they do downloading paid apps. Business Insider, December 30, 2013. Available at http://www.businessinsider.com/ inapp-purchases-dominate-revenue-share-2013-12. R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511–528, 2005. doi: 10.1198/106186005X59243. URL http://dx. doi.org/10.1198/106186005X59243. R. J. Tibshirani et al. Adaptive piecewise polynomial estimation via trend filtering. The Annals of Statistics, 42(1):285–323, 2014. L. Tierney and J. B. Kadane. Accurate approximations for posterior moments and marginal densities. Jour- nal of the american statistical association, 81(393):82–86, 1986. 160 S. Toto. Gacha: Explaining japans top money-making social game mechanism. Kantan Games, 2012. Available athttps://www.serkantoto.com/2012/02/21/gacha-social-games/. G. L. Urban, G. Liberali, E. MacDonald, R. Bordley, and J. R. Hauser. Morphing banner advertising. Marketing Science, 33(1):27–46, 2013. US Department of Justice and Federal Bureau of Investigation. Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, United States, 2012. Inter-University Consortium for Political and Social Research Ann Arbor, MI, 2014. doi: https://doi.org/10.3886/ICPSR35019.v1. US Department of Justice and Federal Bureau of Investigation. Uniform Crime Reporting Program Data: County-Level Detailed Arrest and Offense Data, United States, 2014. Inter-University Consortium for Political and Social Research Ann Arbor, MI, 2017. doi: https://doi.org/10.3886/ICPSR36399.v2. H. R. Varian. A bayesian approach to real estate assessment. Studies in Bayesian econometrics and statistics in honor of Leonard J. Savage, pages 195–208, 1975. E. F. V onesh, T. Greene, and M. D. Schluchter. Shared parameter models for the joint analysis of longitudinal data and event times. Statistics in medicine, 25(1):143–163, 2006. H. Wang. Coordinate descent algorithm for covariance graphical lasso. Statistics and Computing, 24(4): 521–529, 2014. S. Watanabe, S. Kuzuoka, and V . Y . Tan. Nonasymptotic and second-order achievability bounds for coding with side-information. IEEE Transactions on Information Theory, 61(4):1574–1605, 2015. G. C. Wei and M. A. Tanner. A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411):699–704, 1990. A. Weinstein, Z. Ma, L. D. Brown, and C.-H. Zhang. Group-linear empirical bayes estimates for a het- eroscedastic normal mean. Journal of the American Statistical Association, pages 1–13, 2018. A. Wyner. On source coding with side information at the decoder. IEEE Transactions on Information Theory, 21(3):294–300, 1975. J. Xia, E. E. Gill, and R. E. Hancock. Networkanalyst for statistical, visual and network-based meta-analysis of gene expression data. Nature protocols, 10(6):823–844, 2015. X. Xie, S. Kou, and L. D. Brown. Sure estimates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500):1465–1479, 2012. X. Xie, S. C. Kou, and L. Brown. Optimal shrinkage estimation of mean parameters in family of distributions with quadratic variance. Annals of statistics, 44(2):564, 2016. Z. Yang and R. T. Peterson. Customer perceived value, satisfaction, and loyalty: The role of switching costs. Psychology & Marketing, 21(10):799–822, 2004. K. Yano and F. Komaki. Information criteria for prediction when distributions of data and target variables are different. Statistica Sinica. A. Zellner. Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association, 81(394):446–451, 1986. A. Zellner and M. S. Geisel. Sensitivity of control to uncertainty and form of the criterion function. U. of Chicago, 1968. L. Zerboni, N. Sen, S. L. Oliver, and A. M. Arvin. Molecular mechanisms of varicella zoster virus patho- genesis. Nature Reviews Microbiology, 12(3):197–210, 2014. C.-H. Zhang. Compound decision theory and empirical bayes methods: invited paper. Ann. Statist., 31(2): 379–390, 04 2003. Y .-B. Zhao and M. Koˇ cvara. A new computational method for the sparsest solutions to systems of linear equations. SIAM Journal on Optimization, 25(2):1110–1134, 2015. 161 Appendix A Technical details related to Chapter 2 This chapter contains a detailed description of the Auxiliary Screening procedure (Aux-Scr) (chapter A.1), proofs of the results in chapters 2.2 and 2.3 (chapters A.2.1 and A.2.2 respectively), additional simulation experiments (chapter A.3), a real data analysis (chapter A.4) and an example that demonstrates a data driven procedure for choosingK (chapter A.5). A.1. The auxiliary screening approach We consider a potential competitor of ASUS, called Aux-Scr, which uses the auxiliary sequenceS to con- duct a preliminary screening of the primary data thereby discarding data instances that contain little infor- mation and retains the potentially information rich primary data for estimation. Using the notation described in chapter 2, we define Aux-Scr forK = 2 groups as follows: LetY andS denote the primary statistics and auxiliary sequence obeying the models (2.1)-(2.3) de- scribed in chapter 2.2.1. Let t (:) be a soft-thresholding operator such that t (Y i ) = 8 > > > > < > > > > : Y i 1 i , ifjY i 1 i jt; t sign(Y i 1 i ), otherwise: The Aux-Scr estimator operates in two steps: first it constructsK = 2 groups using the magnitude ofS, where in group 1jS i j and in group 2jS i j>. Then it conducts soft-thresholding estimation using the 162 primary statisticsY in group 2 and estimates b i = 0 for all coordinates that belong to group 1. The tuning parameters for both grouping and shrinkage are determined using the SURE criterion. Procedure 2. Fork = 1; 2, denote b I 1 =fi :jS i j g and b I 2 =fi :jS i j > g. Consider the following class of shrinkage estimators: ^ SI i (T ) :=Y i + i t k (Y i ) ifi2 b I k ; where,T =f;t 2 g,t 2 varies in [0;t n ] witht n = (2 logn) 1=2 andt 1 = maxfjY i 1 i j :i2 b I 1 g. Thus, the set of all possible hyper-parameterT values isH n = R + [0;t n ]. Define the SURE function S(T;Y;S) =n 1 2 6 4 n X i=1 2 i + K X k=1 X i2 b I k 2 i (jY i 1 i j^t k ) 2 2 2 i I(jY i 1 i jt k ) 3 7 5: (A.1) Let ^ T = argmin T2Hn S(T;Y;S). Then, the Aux-Scr estimator is given by ^ SI i ( ^ T ) with t 1 = maxfjY i 1 i j :i2 b I 1 g. Following the arguments in the proof of Proposition 1, it can be shown that equation (A.1) is an unbiased estimate of the true risk. Moreover, unlike ASUS, the thresholding hyper-parametert 1 is fixed and the SURE criteria is used to select the grouping hyper-parameter and the thresholding hyper-parametert 2 . When compared to Aux-Scr, ASUS has three distinct advantages: optimality, robustness and adaptivity. First, the screening strategy does not address the important issue on how to set an optimal group-wise cutoff in the screening stage; this issue has been resolved by the SURE criterion in ASUS. Second, the “divide- and-threshold” strategy adopted by ASUS is clearly more effective than the “screening” strategy that directly throws away a lot of data. WhenS is imperfect in capturing the sparsity structure, the screening step would inevitably miss important signal coordinates. By contrast, ASUS is more robust to noisy side information as it only utilizesS to divideY into groups; no coordinates are discarded directly. Finally, Aux-Scr uses the same threshold for all coordinates that pass the preliminary screening stage. By contrast, ASUS is more 163 adaptive to the unknown sparsity as it sets varied group-wise thresholds to reflect the possibly varied sparsity levels across groups. A.2. Proofs A.2.1 Proofs of the results in Chapter 2.2 Proof of Proposition 1 - Recall thatr n (T ;) =n 1 Ejj ^ SI (T )jj 2 where the expectation is taken with respect to the joint distribution of (Y i ;S i ) given i fori = 1; 2; ;n. Now, expandingjj ^ SI (T )jj 2 as jjYjj 2 +jj ^ SI (T )Yjj + 2hY; ^ SI (T )Yi and taking expectation, we have, nr n (T ;) = n X i=1 2 i + n X i=1 K X k=1 2 i E n 2 t k (Y i )I(i2I k ) o + 2 n X i=1 K X k=1 i E n (Y i i ) t k (Y i )I(i2I k ) o Observe that from Models (2.1)-(2.3), the pairs n (Y i i ) t k (Y i ); I(i2 b I k ) o are uncorrelated for eachi and for allk = 1;:::;K. Further, note that by Lemma 1 of Stein (1981) E n 0 t k (Y i ) o = 1 i Z R 0 t k (u) u i i du = 2 i E n t k (Y i )(Y i i ) o : Thus, 2 i E I(jY i 1 i jt k ) = i E t k (Y i )(Y i i ) which completes the proof. Proof of Theorem 2.2.1, statement (a) - First note that we can decomposeS(T;Y Y Y;S) intoK components: S(T;Y Y Y;S) = P K k=1 S k (T k ;Y Y Y;S) where S k (T k ;Y Y Y;S) =n 1 n X i=1 n 2 i 2 2 i I(jY i 1 i jt k ) + 2 i jY i 1 i j^t k 2 o I n S i 2 ( k1 ; k ] o 164 andT k =f k1 ; k ;t k g. Let S k i (T k ;Y i ;S i ) = 2 i 2 2 i I(jY i 1 i jt k ) + 2 i jY i 1 i j^t k 2 I n S i 2 ( k1 ; k ] o and notice thatS k i (T k ;Y i ;S i ) is bounded above by 2 i (1 +t 2 k ) for eachi. Also, we can decompose the risk r n (T;) = P K k=1 r k n (T k ;) where r k n (T k ;) =n 1 n X i=1 E h Y i + i t k (Y i ) i I n X i 2 ( k1 ; k ] oi 2 =n 1 n X i=1 r k i (T k ; i ) and noting thatr k i (T k ; i ) 2 2 i (1 +t 2 k ). The last inequality follows from the upper bound on the risk of soft thresholding estimator at thresholdt k . Now, by triangle inequality, it is enough to show c n E n sup T k 2R 2 [0;tn] S k (T k ;Y Y Y;S)r k n (T k ;) o <1 for alli and for all largen (A.2) Based on the form ofS k (T k ;Y Y Y;S), we consider a re-parametrization of the problem with respect to 0 ~ k1 < ~ k 1 where ~ k = max i2I k F S i ( k ), ~ k1 = min i2I k F S i ( k1 ) and F S i is the distribution function ofS i . The only k1 ; k dependent quantity in the expression ofS k (T k ;Y Y Y;S) is b I k =fi : k1 < S i k g which is re-parametrized to b I ~ k =fi : ~ k1 < F S i (s i ) ~ k g. This facilitates the analysis since now the supremum with respect to ~ T k =f~ k1 ; ~ k ;t k g is actually over a compact set. We will mimic the proof of Proposition 1 of Donoho and Johnstone (1995), hereafter referred to as DJ95P1, to prove equation (A.2). For the other terms similar arguments will continue to hold. 165 LetS k ( ~ T k ;Y Y Y;S)r n ( ~ T k ;) =n 1 P n i=1 U i ( ~ T k ) =V n ( ~ T k ) whereEU i ( ~ T k ) = 0 and from the upper bounds on S k i (T k ;Y i ;S i ) and r k i (T k ; i ),jU i ( ~ T k )j 3(1 +t 2 n ) 2 i . Now replace Z n (t) in DJ95P1 with V n ( ~ T k ) and notice that Hoeffding’s inequality gives, for a fixed ~ T k and (for now) arbitraryr n > 1, P n V n ( ~ T k ) > r n p n o 2 exp n nr n 18(1 +t 2 n ) 2 P n i=1 4 i o Next, for a perturbation ~ T 0 k =f~ 0 k1 ; ~ 0 k ;t 0 k g of ~ T k where ~ 0 k > ~ k , ~ 0 k1 > ~ k1 and t 0 k > t k , we wish to bound the increments V n ( ~ T k )V n ( ~ T 0 k ) . To that effect, define ~ T (r;s) k to be ~ T k but with components (r;s) replaced by the components (r;s) of ~ T 0 k ,r < s = 1; 2; 3. Then we can write, dropping the subscript k from ~ T k for brevity, thatjS k ( ~ T;Y Y Y;S)S k ( ~ T 0 ;Y Y Y;S)j is bounded above by the sum of three terms: jS k ( ~ T;Y Y Y;S)S k ( ~ T (3) ;Y Y Y;S)j,jS k ( ~ T (3) ;Y Y Y;S)S k ( ~ T (2;3) ;Y Y Y;S)j andjS k ( ~ T (2;3) ;Y Y Y;S)S k ( ~ T 0 ;Y Y Y;S)j. The first term is bounded byn 1 (2 +t 02 t 2 )N n (t;t 0 ) which follows directly from the proof of DJ95P1 withN n (t;t 0 ) = P n i=1 2 i I(t <jY i 1 i j t 0 ). The second term is bounded byn 1 (3 +t 02 )M n (~ k ; ~ 0 k ) where M n (~ ; ~ 0 ) = P n i=1 2 i I(~ < F S i (s i ) < ~ 0 ) and similarly the third term is bounded by n 1 (3 + t 02 )M n (~ k1 ; ~ 0 k1 ). For the riskr k n ( ~ T;), we follow the same decomposition and upper boundjr k n ( ~ T;)r k n ( ~ T 0 ;)j by r k n ( ~ T;)r k n ( ~ T (3) ;) + r k n ( ~ T (3) ;)r k n ( ~ T (2;3) ;) + r k n ( ~ T (2;3) ;)r k n ( ~ T 0 ;) From the proof of DJ95P1, we upper bound the first term above by 5n 1 0n t n P n i=1 2 i as long asjtt 0 j< 0n for some 0n > 0. The second and the third terms are upper-bounded by 2n 1 1n (1 +t 2 n ) P n i=1 2 i and 2n 1 2n (1 +t 2 n ) P n i=1 2 i respectively as long asj~ k ~ 0 k j < 1n andj~ k1 ~ 0 k1 j < 2n for some 1n ; 2n > 0. 166 Hence, we can boundn V n ( ~ T k )V n ( ~ T 0 k ) by 2 +t 02 t 2 N n (t;t 0 ) + 3 +t 02 n M n (~ k ; ~ 0 k ) +M n (~ k1 ; ~ 0 k1 ) o + (A.3) 5 0n t n n X i=1 2 i + 2 1n + 2n 1 +t 2 n n X i=1 2 i Following the proof of DJ95P1, we choose 0n ; 1n ; 2n such that 0n t n ; 1n t 2 n and 2n t 2 n are allo(n 1=2 ) and for largen we useEN n (t;t 0 ) + (3 +t 2 n )fEM n (~ k ; ~ 0 k ) +EM n (~ k1 ; ~ 0 k1 )gc 0 n 0n +c 1 n 1n + c 2 n 2n =O(r n n 1=2 ) for some absolute constantsc 0 ;c 1 ;c 2 . This and the bound in equation (A.3) establish r n = p n =O(c 1 n ) directly from the proof of DJ95P1 which proves the desired` 1 convergence of equation (A.2). Proof of Theorem 2.2.1, statement (b) - Due to the result of theorem 2.2.1 part (a), proving the result in part (b) essentially reduces to showingc n E h sup T2Hn l n f; ^ SI (T )gr n (T;) i <1. Note that the loss l n f; ^ SI (T )g decomposes as the sum of K losses: l n f; ^ SI (T )g = P K k=1 l k n f; ^ SI (T k )g where l k n f; ^ SI (T k )g = n 1 P n i=1 n Y i + i t k (Y i ) i o 2 I n S i 2 ( k1 ; k ] o andT k =f k1 ; k ;t k g. As the risk is just the expectation of the loss, by triangle inequality, it is enough to show c n E h sup T k 2R 2 [0;tn] l k n f; ^ SI (T k )gEl k n (; ^ SI (T k )) i <1 for alli and for all largen Note that each of the lossesl k n again decomposes into two parts: A n = n 1 n X i=1 n i Z i i t k sign( i + i Z i ) o 2 I n S i 2 ( k1 ; k ] o I n j i + i Z i j> i t k o B n = n 1 n X i=1 2 i I n S i 2 ( k1 ; k ] o I n j i + i Z i j i t k o whereZ i ’s are i.i.dN(0; 1) random variables. 167 We next prove that for some 0 , (i) there exist functions g n and h n such thatP n c n sup T k A n EA n > o g n () andP n c n sup T k B n EB n > o h n () for all > 0 and for all largen, and (ii) both lim sup R 1 0 g n ()d<1, lim sup R 1 0 h n ()d<1. This establishes the desired result. We deal withB n first and, without loss of generality, establish the bound for B n =n 1 P n i=1 2 i I n S i 2 ( k1 ; k ] o I n i + i Z i i t k o . Now P n c n sup T k B n EB n > o =P n c n sup T k B n EB n > andF n o +P F c n (A.4) where the setF n = fmax i=1;:::;n jZ i j (1 + ) p 2 logng, and P F c n (0)n for all large n. We bound the first term on the right side of equation (A.4) by using the Glivenko-Cantelli theorem for weighted empirical measures (Singh 1975). As t n = p 2 logn and t k 2 [0;t n ], onF n the weights in B n can be positive only when 2 i (2 + ) 2 2 i 2 logn. We next use the inequality in equation (6) of Singh (1975) with a in that equation equaling c 1 n P n i=1 4 i . Further, note that for all large n, c 1 n P n i=1 4 i q (2 +) 4 (2 logn) 2 P n i=1 4 i which is the maximum possible` 2 norm of the weights 2 i . This, along with assumption A1 and the fact that P n i=1 2 i p n( P n i=1 4 i ) 1=2 gives P n sup T k c n B n EB n >;F n o < 4 n log n q P n i=1 4 i exp n 2 log 2(1) n 2(2 +) 4 o Now if P n i=1 4 i = o(c 1 n ), then the desired bound onE sup T k c n B n EB n is obvious; else the above probability is bounded above by h n () = 4n 2 exp n 2 log 2(1) n=2(2 +) 4 o which satisfies the afore- mentioned integrability condition. Thus, the proof of the result forB n is complete. We now turn our attention to A n . Again, without loss of generality, we prove the bound for A n = n 1 P n i=1 (Z i t k ) 2 I n S i 2 ( k1 ; k ] o I n i +Z i >t k o . As we saw in the case ofB n , the variances 2 i appear only throughn 1 P n i=1 2 i which is finite by assumption A1. Thus, we take i = 1 for all i and 168 decomposeA n as sum of three parts by expanding (Z i t k ) 2 =Z 2 i 2Z i t k +t 2 k . The bound on the third term follows directly by the traditional Glivenko-Cantelli theorem and by noting thatt 2 k 2 logn. Here we establish the` 1 convergence result for the first term. The proof for the second term is very similar. We further reduce the problem. Without loss of generality, we assume i = 0 and prove the` 1 conver- gence result forA n = n 1 P n i=1 Z 2 i I n S i 2 ( k1 ; k ] o I n Z i > t k o . We again apply the same technique as withB n and control the probabilityP n sup T k c n A n EA n >;F n o . Similarly as withB n , but now conditioned onfZ i :i = 1;:::;ng, the above probability is easily controlled at the desired rate by applying equation (6) of Singh (1975), i.e,P n sup T k c n A n EA n >;F n jZ 1 ;:::;Z n o g n () where g n does not depend onZ i and for some 0 > 0, R 1 0 g n ()d<1 for all largen with P n i=1 Z 2 i c n !1 asn!1. This establishes the desired` 1 result forA n and completes the proof. Proof of Corollary 2.2.2 - Both statements of this corollary follow from result (b) of Theorem 2.2.1. For statement (a), note that for any > 0, the probabilityP h l n f; ^ SI ( ^ T )g l n f; ^ SI (T OL )g +c 1 n i is bounded above byP h l n f; ^ SI ( ^ T )gS( ^ T;Y Y Y;S)l n f; ^ SI (T OL )gS(T OL ;Y Y Y;S) +c 1 n i , which converges to 0 by Theorem 2.2.1 (b). Statement (b) of this corollary follows as the difference l n f; ^ SI ( ^ T )g l n f; ^ SI (T OL )g can be decomposed as sum ofl n f; ^ SI ( ^ T )gS( ^ T;Y Y Y;S),S(T OL ;Y Y Y;S)l n f; ^ SI (T OL )g andS( ^ T;;;Y;S) S(T OL ;Y Y Y;S). By definition, the last term is not positive. Thus, the sum is bounded above by 2 sup T2Hn jS(T;Y Y Y;S)l n f; ^ SI (T )gj which converges to 0 at the prescribed rate by Theorem 2.2.1 (b). A.2.2 Proofs of the results of Chapter 2.3 We use Theorem 2 of Johnstone (1994), which provides an explicit higher order evaluation of the maximal risk of the soft threshold estimator with the best possible choice of threshold. We restate the theorem 169 abet for the symmetric case, which slightly increases the maximal risk presented in equation (17) of the aforementioned theorem. Result 1. Consider the class of univariate soft-threshold estimators ^ S (x) = sign(x)(x) + for 0. If the parameter 2 R is such that P ( = 0) 1. Then as ! 0, the best choice of threshold is f(t) +O(t 3 logt) and the minimal possible risk is H(t) = (h(t) + 36t 2 logt +O(t 2 )) where t = p 2 log 1 and f(t) = p t 2 6 logt + 2 log(0) andh(t) =f 2 (t) + 5: (A.5) Proof of Theorem 2.3.1 - Directly applying the above result we have, R NS n (;; n ) = ( 1;n n + 2;n n )H(t n ) 2 n ; where, 2 n =n 1 n X i=1 2 i ; andt 2 n = 2 log( 1;n n + 2;n n ) 1 as the density level is at most 1;n n + 2;n n in n (;; n ). Now, if we completely know the latent side information then again applying equation (17) of Johnstone (1994) separately to the two groups:fi : i ? n g andfi : i > ? n g we have: R OS n (;; n ) =f 1;n n H(t 1;n ) + 2;n n H(t 2;n )g 2 n wheret 2 1;n = 2k n ;t 2 2;n = 2k n : Also, t 2 n = t 2 1;n 2 log 1;n +O( 2;n 1 1;n n ). By Assumption (A2.1) there exists > 0 such that 2;n 1 1;n n <n for all largen. Thus, asn!1 withc 0 = 5 + 2 log(0), and ~ k n =k n = logk n , R OS n = 1;n p 1;n 2 n f2k n 3 log(2k n ) +c 0 +O( ~ k 1 n )g R NS n = 1;n p 1;n 2 n 2k n + 2 log 1 1;n 3 log(2k n ) 3 log 1 + (k n ) 1 log 1 1;n +c 0 +O( ~ k 1 n ) ; 170 from which the lemma follows. To understand the phenomenon here in a simplifier lens, consider the first order approximations:R NS n 2 n 1;n p 1;n f(t n ),R OS n 2 n 1;n p 1;n f(t 1;n ),f(t n )f(t 1;n ) 2 log 1 1;n and R NS n R OS n (2 log 1 1;n ) 1;n p 1;n 2 n Thus, the gain due to incorporation of side information is essentially due to the fact that we can use a lower threshold for the subgroup with smaller sparsity than that used in the agglomerative case with no side information and this is exactly the phenomenon depicted in Figure 2.2. Proofs of Theorems 2.3.2, 2.3.3 and Lemma 2.3.2 - Note that in our asymptotic set-up, there exists> 0 such that p 1;n 1;n (p 2;n 2;n ) 1 n for all largen: (A.6) We will be using this property to simplify our calculations by restricting ourselves to dominant terms. As such we will be ignoring terms which areo(p 1;n 1;n 2 n k 1 n ). Without loss of generality we assume thatSj has monotone likelihood ratio inS and considerq jk i;n () := P n ( ^ I j i jI k i forj;k2f1; 2g; i2f1;:::;ng; where, ^ I 1 i =fS i g,I 1 i =f i ? n g, ^ I 2 i = Rn ^ I 1 i ,I 2 i = RnI 1 i andq jk n () =n P n i=1 q jk i;n () 2 i = 2 n . Proof of Lemma 2.3.2 - Dividing the difference in Theorem 2.3.1 by the expression ofR OS n from the display above it we get R NS n =R OS n = 1 + [2 log 1 1;n 3 logf1 + (k n ) 1 log 1 1;n g] f2k n 3 log(2k n ) +c 0 g +O(k n ); (A.7) 171 wherec 0 = 5 + 2 log(0) and < 2. The first result of the lemma now follows by noting log 1 1;n 0 < which is due to Assumption (A2.1). Note that, iffc n ! 0 thenR NS n =R OS n ! 1. From the above display it follows thatk 1+ n (R NS n =R OS n 1) k n log 1 1;n f1 1:5(k n ) 1 g +O(k +1+ n ) where < 2. Thus, we have the first part of the third result. Its second part follows directly from the proof of Theorem 2.3.2, which is present after the proof of this lemma. Next, we establish the upper bound on the maximal risk of ASUS given in the second statement of lemma 2.3.2. LetR KS n denotes the maximal risk of ASUS when we can set any possible thresholds in ASUS including those depending on the density levelsp 1;n ;p 2;n as well as the mixing probabilities 1;n and 2;n . However, we do not know the latent variable or its subsequent oracle optimal groupsfi : i n g and fi : i > n g. Thus, by definitionR OS n R KS n R NS n . Now, ASUS always chooses the thresholds and the segmentation hyper-parameter in a data-dependent fashion minimizing the SURE criterion. We next apply theorem 2.2.1 which tells us that the maximal risk of ASUSR AS n can not be much bigger thanR KS n . As such, theorem 2.2.1 compounded with theorem 4a of DJ95 impliesR AS n R KS n R F n If 2 n 3d n g + o( 1;n p 1;n 2 n k 1 n ) whereR F n is the risk of ASUS with fixed threshold of p 2k n andd n =n 1=2 log 3=2 n and 2 n =n 1 P n i=1 2 i ^(2k n ). By Lemma 8.3 of Johnstone (2015),R F n n 1 +n 1 P n i=1 f 2 i ^(1+2k n )g 2 i . Thus,R AS n R NS n +o( 1;n p 1;n 2 n k 1 n ) and the result follows from (A.7). Proof of Theorem 2.3.2 - First consider the situation where the sparsity levelsp 1;n andp 2;n are known. Due to the product structure of our ASUS estimator, we first concentrate on its maximal risk for each of theith coordinate. This reduces to an univariate risk analysis. If noise variance equals 1, univariate soft-threshold estimators with threshold has: (a) the risk at the origin:g 1 ()(1 +O( 2 )) for large whereg 1 () = 4()= 3 (b) the maximal risk at the non-origin points: g 2 () = 1 + 2 and the maximum is attained when the parametric value is1. 172 Now, if the probability of the parameter i being non-zero isp then the maximal risk of the soft-threshold estimator with threshold isg(p;) = (1p)g 1 () +pg 2 (). As i is generated from the two group model of equation (2.8) with density levelsp k;n ,q 12 i;n andq 21 i;n are the probabilities of mis-classifying group 2 and group 1 respectively and thresholdst k;n were used for those detected in groupk = 1; 2. Note, that without mis-classification the maximal risk at each coordinatei is weighted by the group probabilities k;n and the optimal threshold choices aref( p 2k n ) andf( p 2k n ) where,f is defined in (A.5). However, under mis-classification these thresholds will change and the optimal thresholds will bef(m opt 1;n [i]) andf(m opt 2;n [i]) where, m opt k;n [i] = 2 log k;n q kk i;n p k;n + j;n q kj i;n p j;n k;n q kk i;n + j;n q kj i;n 1=2 forj6=k: (A.8) These can not be used asq jk i;n are not known while constructing the estimator. We are interesting in deriving upper bounds on the maximal risk of ASUS and so, unlike the optimal thresholds which depend oni, here we consider thresholdst 1;n andt 2;n which are uniform over the groups. With mis-classification, the proba- bilitiesq jk i;n will be also involved into the expression for maximal risk as now i coming from group 1 (say) might be treated with either thresholdt 1;n (when correctly classified) ort 2;n (when incorrectly classified). Figure A.1 provides a pictorial representation of how the probabilitiesq jk i;n enter this decomposition. The maximal risk for coordinatei is given by 2 X j=1 2 X k=1 k;n q jk i;n g(p k;n ;t j;n ) 2 i f1 +o(1)g: (A.9) 173 Figure A.1: Pictorial representation of coordinate-wise decomposition of maximal riskR AS . Herej;k = 1; 2 andj6=k. We fix a threshold oft 1;n = f( p 2k n ) andt 2;n = f( p 2 k n ) where is allowed to vary in [;]. These thresholds when substituted in (A.9) will produce a upper bound on the maximal risk. Doing so and using (A.6), we see that the maximal risk for theith coordinate is upper bounded by 2 i h 1;n p 1;n n q 11 i;n g 2 (t 1;n ) +q 21 i;n g 2 (t 2;n ) o + 2 X j=1 2 X k=1 k;n q jk i;n g 1 (t j;n ) +O 1;n p 1;n k n i (A.10) Now, consider the second term in equation (A.10). Ast 2;n t 1;n , we lower bound and upper bound it by A n + 1;n q 11 i;n g 1 (t 1;n ) andA n + 1;n g 1 (t 1;n ) whereA n = 2;n q 22 i;n g 1 (t 2;n ) + 2;n q 12 i;n g 1 (t 1;n ). Define ~ k n = k n = logk n . Note thatg 1 (t 1;n ) = 4p 1;n f1 +O( ~ k 1 n )g andg 1 (t 2;n ) = 4(0)t 3 2;n (t 2;n )(1 + O(k 1 n )). Thus, with n ( ) =n (0)(t 2;n )t 3 2;n , t 2;n =f( p 2 k n ) the second term in (A.10) is bounded above by 4 2;n p 2;n fq 22 i;n n n ( ) + n q 12 i;n + 1gf1 +O( ~ k 1 n )g (A.11) 174 as n = 2;n = 1;n . Now, consider the first term in equation (A.10). We have, q 11 i;n g 2 (t 1;n ) +q 21 i;n g 2 (t 2;n ) =g 2 (t 1;n ) +q 21 i;n fg 2 (t 2;n )g 2 (t 1;n )g andg 2 (t 2;n )g 2 (t 1;n ) = 2( )k n 3 log log( =) := n ( ). Thus, the first term in equation (A.10), 1;n p 1;n n q 11 i;n g 2 (t 1;n ) +q 21 i;n g 2 (t 2;n ) o = 1;n p 1;n n g 2 (t 1;n ) +q 21 i;n n ( ) o (A.12) Now, g 2 (t 1;n ) = 1 +t 2 1;n = h( p 2k n ) 4, and so, from equations (A.11) and (A.12), maximal risk of ASUS for coordinatei is upper bounded by 1;n p 1;n 2 i h h( p 2k n ) +fq 21 i;n n ( ) + 4 n q 12 i;n + 4q 22 i;n n n ( )gf1 +O( ~ k 1 n )g +O ~ k 1 n i and therefore the maximal risk over then coordinates of the ASUS estimator when thresholds can be directly chosen depending on the density levelsp 1;n andp 2;n is R KS n 1;n p 1;n 2 n h h( p 2k n ) +fq 21 n n ( ) + 4 n q 12 n + 4q 22 n n n ( )gf1 +O( ~ k 1 n )g +O ~ k 1 n i where q jk n () = P n i=1 q jk i;n () 2 i P n i=1 2 i for j;k2f1; 2g. Recall,R KS n denotes the maximal risk of ASUS when we can set any possible thresholds in ASUS including those depending on the density levels p 1;n ;p 2;n as well as the mixing probabilities 1;n and 2;n . Now, consider the general case when those are unknown. ASUS always chooses the thresholdst 1;n andt 2;n and the segmentation hyper-parameter n in a data-dependent fashion minimizing the SURE criterion. We next apply theorem 2.2.1 similarly as in the 175 proof of lemma 2.3.2 which provides us withR AS n R KS n o( 1;n p 1;n 2 n k 1 n ) Also, from calculation in the previous subsection for any < 1, we knowR OS n 1;n p 1;n 2 n fh( p 2k n ) +o(k n )g and so R AS n R OS n 1;n p 1;n 2 n h fq 21 n n ( ) + 4 n q 12 n + 4q 22 n n n ( )gf1 +O( ~ k 1 n )g +O( ~ k 1 n ) i : (A.13) Again, in our asymptotic set-up there exists > 0 such that n n 0 for all large n. Choosing = + 0 ,we have n ( ) = 2 0 k n 3 log log(1 + 0 =) and n n =o(n 0 ) for some 0< 0 <. Thus, with this choice of , based on (A.13) the controls onq 12 n andq 21 n stated in the theorem impliesR AS n R OS n O( ~ k 1 n ). This, along with theorem 2.3.1 provide us with the desired result for the theorem as well as the third result of lemma 2.3.2. Proof of Theorem 2.3.3, statement (a) - Consider case (ii) first. Note that lim n!1 n n q 21 n ( n ) = 0 implies n q 21 i;n ( n ) =o(1) asn!1 for alli. Hence for each coordinatei, the optimal threshold for group 1 considering two groups in the data isf(m opt 1;n [i]) (see equation (A.8)). As n q 21 i;n ( n ) = o(1) , we have m opt 1;n = p 2k n f1 +o(k 1 n )g for any > 0. Thus, the threshold here asymptotically equalst 1;n used in the proof of theorem 2.3.2: part b, before. Concentrating on only the j = 1;k = 1 and j = 2;k = 1 terms in equation (A.9) by the previously conducted analysis we have the maximal risk of theith coordinate bounded below by 1;n p 1;n 2 i h h( p 2k n ) + 4 n q 12 i;n f1 +O( ~ k 1 n )g i . Thus, if ASUS considers two groups thenR KS n 1;n p 1;n 2 n [h( p 2k n ) + 4 n q 12 i;n f1 +O( ~ k 1 n )g] and the ratio (R NS n R OS n )=(R KS n R OS n ) diverges to1 unless lim sup n q 12 n k 1 n <1 as log 1 1;n = 0 k n . In that case we use a uniform choice of thresholdt 1;n =t 2;n =f(m opt n ) wherem opt n =f2 log( 1;n p 1;n + 2;n p 2;n )g 1=2 . This along with part (a) completes the proof for case (ii). For case (i), note that them opt 1;n for theith coordinate defined in (A.8) equals m opt 1;n [i] =f2 logp 1 1;n + 2 log(1 + n q 21 i;n =q 11 i;n )g 1=2 f1 +o(1)g asn!1: 176 Now, considering the risk at the origin forj = 1;k = 1 term in equation (A.9), we see that it will contain at least an extra additive component ofO( 1;n p 1;n n q 21 i;n =q 11 i;n 2 i ) overh( p 2k n ). Thus, the average maximal risk over then coordinates is bounded below by 1;n p 1;n n h( p 2k n ) 2 n +O n n 1 n X i=1 q 21 i;n =q 11 i;n 2 i o Asq 11 i;n 1 for alli, the second term on right side above is bounded below byO( 1;n p 1;n n q 21 n 2 n ) which provides us with the desired result. Proof of Theorem 2.3.3, statement (b) - By definition,R KS n R NS n and by application of theorem 1 (as in the proof of lemma 2) we haveR AS n R KS n o( 1;n p 1;n 2 n k 1 n ). Also, by assumption A2.2, there exists some < 1 such that limk =2 n (1 1 ) =1. This impliesk n log 2 1;n , which coupled with theorem 2.3.1 gives us with the desired result. Proof of Lemma 2.3.1 - Without loss of generality, assume that the marginal distribution of the auxiliary sequenceS given the latent parameter has monotone likelihood ratio in the statisticS. Also, let 2;n = inff i : i 2 I 2;n g > supf i : i 2 I 1;n g = 1;n . Then, 2;n = 1;n +d n . Under this reduction q jk i;n () :=P n ( ^ I j i jI k i forj;k2f1; 2g; i2f1;:::;ng; where, the sets ^ I 1 i =fS i g,I 1 i =f i ? n g, ^ I 2 i = Rn ^ I 1 i ,I 2 i = RnI 1 i . Let b n = sup i max(2 2 i ;b i ). Now, set n = 1;n + b n (2k n + logk n ) . Thus, for all largen the condition imposed ond n in the lemma implies n 2;n b n (log n + logk n ) . Next, note that,q 21 n ( n ) 2 n sup i q 21 i;n ( n ) 2 n sup i P (S i n j 1;n ) and thereafter, using the tail bounds of the sub-Exponential/Gaussian distributions, we have for alli, P (S i n j 1;n ) expf( n 1;n ) 1+I(b i =0) = max(2 2 i ;b i )gk 2 n = logk n for all large n. Similarly, it follows thatq 12 n ( n ) 1 n = logk n , which completes the proof. 177 Statement and proof of Lemma 3 - Lemma 3 - If our parametric space (;; n ) is such that lim sup n!1 n 1=2 1;n p 1;n <1, ASUS con- vergences in probability to the SureShrink procedure with the fixed threshold choice of p 2 logn. Proof. Define, 2 n =n 1 P n i=1 2 i ^ (5k n ) for some prefixed> 0. Note that 2 n 5 1;n p 1;n k n (1 +o(1)) and so, by the condition of the lemma 2 n =d n ! 0 whered n =n 1=2 log 3=2 n. Defines 2 n =n 1 P n i=1 y 2 i ^ (2k n ) 1 where Y i = i + Z i and Z i are i.i.d. N(0; 1) . Let F n = fmax i z 2 i < 3k n g. Note that P (s 2 n d n ) P (F c n ) +P (s 2 n d n andF n ): The firm term converge to 0 and the second term on the right side above is bounded byP (n 1 P n i=1 Y 2 i 1 d n ;n 1 P n i=1 2 i 2 n ) which converges to 0 as 2 n =d n ! 0 (see proof of theorem 4 (b) of DJ95). This completes the proof. A.3. Additional simulation experiments In this section, we present a number of simulation experiments demonstrating the asymptotic performance of ASUS asn increases. We fixm = 50 and allown to vary from 500 to 5000 in increments of 100. To simulate the parameter vector, we continue to use the setup of the one sample problem discussed in chapter 2.4.2 and simulate 1i as before but vary the sparsity levels in under scenarios S1 and S2 as follows: (S1) Unif(6; 7) | {z } 1% ofn ; Unif(2; 3) | {z } 4% ofn ; 0;::::::; 0 | {z } n 5% ofn (S2) Unif(4; 8) | {z } 4% ofn ; Unif(1; 3) | {z } 16% ofn ; 0;::::::; 0 | {z } n 20% ofn with = + 1 ,Y i N( i ; 2 i ) where 2 i = 1 for alli under S1 and 2 i i:i:d Unif(0:1; 1) under S2. For scenario S1, we consider two side informationS 1 andS 2 as follows: (ASUS.1)S i1 ji2I ? 1 =jN( 0 ; 2 i )+ 2i j, S i1 ji 2 I ? 2 = jN( 1 ; 2 i ) + 2i j with 0 = p logk n , 1 = 0, and (ASUS.2) S i2 ji 2 I ? 1 = jN( 0 ; 2 i ) + 2i j,S i2 ji2I ? 2 =jN( 1 ; 2 i ) + 2i j with 0 = p k n , 1 = 0. Here 2 i i:i:d Unif(0:1; 1) and 178 0.1 0.2 0.3 0.4 0.5 0.6 1000 2000 3000 4000 5000 n risk ASUS.1 ASUS.2 Aux−Scr.1 Aux−Scr.2 EBT EJS OR SS 0.2 0.3 0.4 0.5 1000 2000 3000 4000 5000 n risk ASUS.1 ASUS.2 Aux−Scr.1 Aux−Scr.2 EBT EJS OR SS Figure A.2: Asymptotic performance of ASUS: Average risks of different estimators. Dashed line represents the risk of the oracle estimator ~ SI i (T OR ). Left: Scenario S1 and Right: Scenario S2. 2i is the average overm samples ofN(0; 0:01). Similarly for scenario S2, we consider two side information S 1 andS 2 as follows: (ASUS.1)S i1 ji2I ? 1 =j 2 1+ p logkn + 2i j,S i1 ji2I ? 2 =j 2 1 + 2i j, and (ASUS.2) S i2 ji2I ? 1 =j 2 1+kn + 2i j,S i2 ji2I ? 2 =j 2 1 + 2i j. ThusS 1 andS 2 differ in the separation of the means of their conditional distributions withS 2 in scenario S2 having a near optimal separation in the means as prescribed by Lemma 2.3.1 in chapter 2.3. We repeat this sampling scheme forN = 1000 repetitions and report the results in table A.1 and figure A.2. As observed in the one sample estimation problem, ASUS continues to exhibit the best performance amongst all the competing estimators. However, the efficiency of ASUS in exploiting side information clearly depends on the magnitude of separation of the conditional means ofS under the two groups. For example ASUS.2, which has a bigger separation between the conditional means of side informationS 2 , exhibits the best performance across all scenarios. In fact under scenario S2,S 2 is sub-exponential and the logn separation between the conditional means brings the risk of ASUS.2 closer to the risk of the oracle 179 Table A.1: Asymptotic performance of ASUS: risk estimates and estimates ofT for ASUS atn = 5000. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. Asymptotic performance of ASUS Scenario S1 Scenario S2 OR ? 2.00 0.980 t ? 1 ,t ? 2 4.115, 0.130 4.073, 0.062 n ? 1 ,n ? 2 4750, 250 4000, 1000 risk 0.107 0.154 ASUS.1 1.936 1.904 t 1 ,t 2 1.253, 3.520 0.740, 1.281 n 1 ,n 2 3460, 1540 2857, 2143 risk 0.183 0.241 ASUS.2 1.60 2.918 t 1 ,t 2 0.420, 4.104 0.139, 1.8 n 1 ,n 2 446, 4554 1024, 3976 risk 0.126 0.161 Aux-Scr.1 5.920 34.218 t 1 ,t 2 0.424, - 0,- n 1 ,n 2 5000, 0 5000,0 risk 0.189 0.254 Aux-Scr.2 7.375 50.958 t 1 ,t 2 0.424, - 0,- n 1 ,n 2 5000, 0 5000, 0 risk 0.189 0.254 SureShrink risk 0.189 0.254 EBT risk 0.257 0.402 EJS risk 0.411 0.431 estimator. As far as ASUS.1 is concerned, the relatively smaller separation in the conditional means of S 1 does not allow ASUS to optimally partition the n coordinates into heterogeneous groups in terms of sparsity and hence it performs only marginally better than the SureShrink estimator. Moreover, we observe that across both the scenarios, the risk of the auxiliary screening procedure, Aux-Scr, is indistinguishable from the risk of the SureShrink estimator, thus demonstrating that discarding observations based on the magnitude of the side information may lead to missing important signal coordinates especially when the side information is corrupted with noise. ASUS, however, does not discard any observations and continues to exploit the available information in the noisy auxiliary sequence. In table A.1, we report risk estimates and estimates ofT for ASUS and Aux-Scr atn = 5000. 180 Table A.2: Risk estimates and estimates ofT for Aux-Scr atn = 5000 under the setting of the simulation experiment described in chapter 2.4.2. Heren ? k =jI ? k j andn k =j b I k j fork = 1; 2. Asymptotic performance of Aux-Scr Scenario S1 Scenario S2 Aux-Scr.1 1.326 0.981 t 1 ,t 2 4.612, 0.109 4.618, 0.159 n 1 ,n 2 4747, 253 4002, 998 risk 0.097 0.243 Aux-Scr.2 11.231 10.752 t 1 ,t 2 4.601, 0.106 4.580, 0.160 n 1 ,n 2 4748, 252 3976, 1024 risk 0.095 0.232 Aux-Scr.3 1.774 1.460 t 1 ,t 2 4.668, 0.665 4.760, 0.544 n 1 ,n 2 4261, 739 3290, 1710 risk 0.147 0.360 Aux-Scr.4 5.072 2.004 t 1 ,t 2 3.036, 1.043 3.500, 0.851 n 1 ,n 2 2023, 2977 984, 4016 risk 0.186 0.414 In table A.2, we report the estimates of the hyper-parameters of Aux-Scr under the setting of the simu- lation experiment described in chapter 2.4.2. A.4. Microarray Time Course (MTC) Data Our second real data application is an MTC dataset collected by Calvano et al. (2005) for studying systemic inflammation in humans and is an example of a setting where ASUS can be used for 2 sample estimation problems. This dataset contains eight study subjects which are randomly assigned to case and control groups and then administered with endotoxin and placebo, respectively. The expression levels of n = 22; 283 genes in human leukocytes are measured before infusion (0 hour) and at 2, 4, 6, 9, and 24 hours afterwards. One of the goals of this experiment is to identify, in the case group, early to middle response genes that are differentially expressed within 4 hours and thus reveal meaningful early activation gene sequence that governs the immune responses. As discussed in Sun and Wei (2011) the early activation sequence quickly activates many secreted pro-inflammatory factors in response to exterior intrusion. These activated factors 181 subsequently trigger the expression of several transcription factors to initiate the immune response. In the late period, the expression levels of a number of transcription factors limiting the immune response are increased. Finally the whole system concludes with full recovery and a normal phenotype. To identify the genes that regulate this sequence, we take time point 0 as the baseline and time points 4 and 24 as the interval over which differential gene expression will be estimated. We follow the data preprocessing steps outlined in Sun and Wei (2011) and denoteY i;j as the arcsinh transformed average gene expression value for gene i at time pointj. LetY i = ~ Y i;4 ~ Y i;24 where ~ Y i;j = Y i;j Y i;0 denotes the baseline adjusted expression level of gene i at time point j. The side information that we use in this setting is X i = ~ Y i;4 + i ~ Y i;24 withS i =jX i j,K = 2, i = ~ i;4 =~ i;24 where ~ i;j is the observed standard deviation of ~ Y i;j across the 4 replicates. In figure A.3a right, the dotted line represents the SURE estimate of risk of ^ S (t) att = 1:13. ASUS uses information in (Y i ;S i ) and returns an estimate of risk (the red dot) that is significantly smaller than the risk estimate returned by ^ S (t). In order to evaluate the results in a predictive framework, we next used 3 out of the 4 replicates for calibrating the hyper-parameters and calculated the prediction errors of the ASUS and SureShrink procedures based on the held out fourth replicate. Here, the risk reduction by ASUS compared to SureShrink is almost 9%. Figure A.3a left presents heatmap of expression levelY i in the top panel and heatmap of the associated side informationS i in the bottom panel for genei. The expression levels for the genes are ordered in terms of their magnitude. Notice how the magnitude of side information inS i follows the pattern of the expression levelsY i , largely indicating thatY i is small wheneverS i is small. ASUS exploits this extra information in S i and thus performs better than the SureShrink estimator that only relies on the information inY i . Figure A.3b left presents the distribution of gene expression for genes that belong to the groups b I 1 and b I 2 . In this example, group b I 2 holds only about 2% of then genes and is therefore inconspicuous in this plot. We present a magnified version of this plot in right that demonstrates in green the distribution of gene expression 182 for genes that belong to b I 2 and summarize the results in table A.3. In this real data example, a reduction in Table A.3: Summary of the performance of SureShrink and ASUS on MTC data. Here n k =j b I k j for k = 1; 2. MTC n 22,283 SureShrink t 1.13 SURE estimate 1.32 ASUS 4.11 t 1 1.3 t 2 0.04 n 1 21,791 n 2 492 SURE estimate 0.62 risk is possible because ASUS has efficiently exploited the sparsity information encoded inS. This can be seen, for example, from the stark contrast between the magnitudes of thresholding hyper-parameterst 1 and t 2 in table A.3. Moreover, the risk of Aux-Scr for this example was seen to be no better than the SureShrink estimator and thus has been excluded from the results reported in table A.3. A.5. Choice ofK We consider the toy example discussed in chapter 2.2.3 and letK 2. For each candidate value ofK we plot the SURE estimate of risk of ASUS in figure A.4 left. An estimate ofK may be taken to be the one that appears at the elbow of this plot and that implies b K = 2. Often a large value ofK, sayK = 5 or 6, may continue to provide a marginal reduction in overall risk as opposed toK = 2 as seen in this example but such a reduction in risk comes at a cost of increased computational burden of conducting a search over O(m K1 n ) points for largen. A cross validation based approach for selectingK in such scenarios is often useful. On a related note, in figure A.4 right we demonstrate how ASUS reaps the benefits of adapting to the informativeness of side information. We continue with the toy example of chapter 2.2.3 and estimate the hyper-parameterT forK = 2. In contrast to this scheme, we also construct a version of ASUS where 183 the number of groupsK is automatically set ton= logn without the aid of any side information and only the thresholding hyper-parameterst k are determined using the data driven hybrid scheme of Donoho and Johnstone (1995). The risk of this estimator is denoted by the green dotted line in figure A.4 which clearly indicates that ASUS provides a better risk performance when the segmentation hyper-parameter is chosen in a data driven adaptive fashion. 184 (a) Group 1 Group 0 All data Group 1 Group 0 All data (b) Figure A.3: (a) Left: Heatmap of gene expressionsY Y Y and side informationS. Right: SURE estimate of the risk of ^ S i (t) att = 1:13 versus an unbiased estimate of the risk of ASUS for different values of. (b) Left: Histogram of gene expressionsY Y Y . Group 1 is b I 2 and Group 0 is b I 1 . Right: A magnified plot to show b I 2 . 185 0.20 0.25 0.30 1 2 3 4 5 6 7 8 9 10 K risk of ASUS 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 n risk ASUS K = 2 K = n/log(n) SureShrink Figure A.4: Toy example of chapter 2.2.3. Left: SURE estimate of risk of ASUS as K varies. Right: Estimate of risks for ASUS withK = 2, SureShrink and ASUS with no side information but withK = n= logn. 186 Appendix B Technical details related to Chapter 3 We first present in Appendix B.1 the results for the NEB estimator under the regular squared loss, then provide in Appendix B.2 the proofs and technical details of all theories in the main text and Appendix B.1. B.1. Results under the squared error loss B.1.1 TheNEB estimator In this section we discuss the estimation ofw (0) p that appear in lemma 3.2.1 under the usual squared error loss (k = 0). LetY be a non-negative integer-valued random variable with probability mass function (pmf) p and define h (0) 0 (y) = y + 1 w (0) p (y) y;y2f0g[N (B.1) SupposeK (y;y 0 ) = expf0:5 2 (yy 0 ) 2 g be the positive definite RBF kernel with bandwidth parameter 2 where is a compact subset ofR + bounded away from 0. Given observationsy = (y 1 ;:::;y n ) from model (3.2), leth (0) 0 = (h (0) 0 (y 1 );:::;h (0) 0 (y n )) and define the followingnn matrices: n 2 K = [K (y i ;y j )] ij ,n 2 K = [ y i K (y i ;y j +1)] ij andn 2 2 K = [ y i ;y j K (y i ;y j )] ij where y K (y;y 0 ) = K (y + 1;y 0 )K (y;y 0 ) and y;y 0K (y;y 0 ) = y 0 y K (y;y 0 ) = y y 0K (y;y 0 ). 187 Definition 1B (NEB estimator of i ). Consider theDLE Model (3.2) with loss` (0) ( i ; i ). For a fixed2 , let ^ w (0) i () = (y i +1)=(y i + ^ h (0) i ()) and ^ h (0) n () = n ^ h (0) 1 ();:::; ^ h (0) n () o be the solution to the following quadratic optimization problem: min h2Hn ^ M ;n (h) =h T K h + 2h T K y +y T 2 K y; (B.2) whereH n =fh = (h 1 ;:::;h n ) :Ahb;Ch =dg is a convex set andA;C;b;d are known real matrices and vectors that enforce linear constraints on the components ofh. Then theNEB estimator for a fixed is given by neb (0) () = n neb (0);i () : 1in o , where neb (0);i () = a y i =a y i +1 ^ w (0) i () ; ify i 2f0; 1; 2;:::g Theorem 2B. LetK (;) be the positive definite RBF kernel with bandwidth parameter 2 . If lim n!1 c n n 1=2 logn = 0 then, under assumptionsA1A3, we have for any2 , lim n!1 P h ^ w (0) n ()w (0) p 2 c 1 n i = 0; for any> 0 where ^ w (0) n () = [(Y i + 1)=( ^ h (0) i () +Y i )] i . We now provide some motivation behind the minimization problem in definition 1B for estimating the ratio functionalsw (0) p . Suppose ~ p be a probability mass function on the support ofY and define S [~ p](p) =E p h ( ~ h (0) (Y )h (0) 0 (Y ))K (Y + 1;Y 0 + 1)( ~ h (0) (Y 0 )h (0) 0 (Y 0 )) i (B.3) where h (0) 0 ; ~ h (0) are as defined in equation (B.1) and Y;Y 0 are i.i.d copies from the marginal distribution that has mass functionp.S [~ p](p) in equation (B.3) is the Kernelized Stein’s Discrepancy (KSD) measure 188 that can be used to distinguish between two distributions with mass functionsp; ~ p such thatS [~ p](p) 0 andS [~ p](p) = 0 if and only ifp = ~ p (Liu et al. 2016, Chwialkowski et al. 2016). MoreoverS [~ p](p) = E (U;V ) i:i:d p h [~ p](U;V ) i where [~ p](u;v) is ~ h (0) (u) ~ h (0) (v)K (u;v) + ~ h (0) (u)v v K (u + 1;v) + ~ h (0) (v)u u K (u;v + 1) +uv u;v K (u;v) (B.4) An empirical evaluation scheme forS [~ p](p) is given byS [~ p](^ p n ) where S [~ p](^ p n ) = 1 n 2 n X i;j=1 [~ p](y i ;y j ) (B.5) andY = (Y 1 ;:::;Y n ) is a random sample from the marginal distribution with mass functionp with empir- ical CDF ^ p n . Note that [~ p](u;v) in equation (B.4) involves ~ p only through ~ h (0) and may analogously be denoted by [ ~ h(u); ~ h(v)](u;v) := [~ p](u;v) where we have dropped the superscript from ~ h that indicates that the loss in question is the regular squared error loss. This slight abuse of notation is harmless as the discussion in this section is geared towards the squared error loss only. Under the empirical Bayes compound estimation framework of model (3.2), our goal is to estimateh (0) 0 . To do that we minimizeS [~ p](^ p n ) in equation (B.5) with respect to the unknowns ~ h = ( ~ h(y 1 );:::; ~ h(y n )) and the sample criteria ^ M ;n ( ~ h) = 1 n 2 n X i;j=1 [ ~ h(y i ); ~ h(y j )](y i ;y j ) is thus the objective function of the optimisation problem in equation (B.2) with optimisation variables h i ~ h(y i ) fori = 1;:::;n. Note that ^ M ;n ( ~ h) above is a V-statistic and a biased estimator of the population criteriaM ( ~ h) that is defined in equation (3.12) with [ ~ h(u); ~ h(v)](u;v) given by equation (B.4). 189 B.1.2 Bandwidth choice and asymptotic properties We propose the following asymptotic risk estimate of the true risk of neb (0) () in the Poisson and Binomial model. Definition 2B (ARE of neb (0) () in the Poisson model). Suppose Y i j i ind: Pois( i ). Under the loss ` (0) ( i ;) an ARE of the true risk of neb (0) () is ARE (0;P) n () = 1 n n n X i=1 y i (y i 1) 2 n X i=1 y i (y i ) + n X i=1 f neb (0);i ()g 2 o where (y i ) = neb (0);j (); y i 1 withj2f1;:::;ng such thaty j =y i 1. Definition 3B (ARE of neb (0) () in the Binomial model). SupposeY i jq i Bin(m;q i ) so that in equation (3.1)a y i = m y i and i =q i =(1q i ) in equation (3.2). Under the loss` (0) ( i ;) an ARE of the true risk of neb (0) () is ARE (0;B) n () = 1 n n n X i=1 y i (y i 1) (my i + 2)(my i + 1) 2 n X i=1 y i (y i ) + n X i=1 f neb (0);i ()g 2 o where (y i ) = neb (0);j ()=(my i + 1); y i 1 withj2f1;:::;ng such thaty j =y i 1. 190 Note that if for some indexi,y i 1 is not available in the observed sampley, (y i ) can be calculated using cubic splines. We propose the following estimate of the tuning parameter based on the ARE: ^ = 8 > > > > < > > > > : argmin 2 ARE (0;P) n (); ifY i j i ind: Pois( i ) argmin 2 ARE (0;B) n (); ifY i jq i ind: Bin(m;q i ) (B.6) where a choice of = [10; 10 2 ] work well in the simulations and real data analyses of chapters 3.4 and 3.5. Lemmata 3 and 2 continue to provide the large-sample properties of the proposed ARE (0;P) n , ARE (0;B) n criteria. To analyze the quality of the estimates ^ obtained from equation (B.6), we consider an oracle loss estimator or (0) := neb (0) ( orc 0 ) where orc 0 = argmin 2 L (0) n (; neb (0) ()) and Lemma 4 establishes the asymptotic optimality of ^ obtained from equation (B.6). In theorem 3B below we provide decision theoretic guarantees on theNEB estimator and show that the largest coordinate-wise gap between neb (0) ( ^ ) and (0) is asymptotically small. Theorem 3B. Under the conditions of Theorem 2B, if lim n!1 c n n 1=2 log 3 n = 0 then, for the Poisson and the Binomial model, c n neb (0) ( ^ ) (0) 1 =o p (1): Furthermore, under the same conditions, we have for the Poisson and the Binomial model, lim n!1 P h L (0) n n ; neb (0) ( ^ ) o L (0) n (; (0) ) +c 1 n i = 0 for any> 0: 191 B.2. Proofs We will begin this section with some notations and then state two lemmas that will be used in proving the statements discussed in chapter 3.3. Letc 0 ;c 1 ;::: denote some generic positive constants which may vary in different statements. LetD n = f0; 1; 2;:::;C logng and given a random sample (Y 1 ;:::;Y n ) from model (3.2) denoteB n to be the event fmax 1in Y i C logng whereC is the constant given by lemma A below under assumption (A2). Lemma A. Assumption (A2) implies that with probability tending to 1 asn!1, max(Y 1 ;:::;Y n )C logn whereC > 0 is a constant depending on. Our next lemma below is a statement on the pointwise Lipschitz stability of the optimal solution ^ h (k) n () under perturbations on the parameter2 . See, for example, Bonnans and Shapiro (2013) for general results on the stability and sensitivity of parametrized optimization problems. Lemma B. Let ^ h (k) n ( 0 ) be the solution to problems (3.8) and (B.2), respectively, fork2f0; 1g and for some 0 2 . Then, under Assumption (A3), there exists a constantL > 0 such that for any2 the solution ^ h (k) n () to problems (3.8) and (B.2) satisfies ^ h (k) n () ^ h (k) n ( 0 ) 2 Lj 0 j 192 B.2.1 Proof of Lemma 3.2.1 First note that for any coordinate i, the integrated Bayes risk of an estimator (k);i of i is P y i R p(y i j i )` (k) n ( i ; (k);i )dG( i ) which is minimized with respect to (k);i if for each y i , (k);i (y i ) is defined as (k);i (y i ) = argmin (k);i Z p(y i j i )` (k) n ( i ; (k);i )dG( i ) However, R p(y i j i )` (k) n ( i ; (k);i )dG( i ) is a minimum with respect to (k);i when (k);i (y i ) = R p(y i j i ) 1k i dG( i ) R p(y i j i ) k i dG( i ) The result then follows by noting that p(y i k) = R a y i k y i k i =g( i )dG( i ), and p(y i + 1 k) = R a y i +1k y i +1k i =g( i )dG( i ) for y i =k;k + 1;:::. B.2.2 Proof of Theorem 1 Define ~ M (h) = P i;j2Dn [h(i);h(j)](i;j)P(Y =i)P(Y =j) and re-write ^ M ;n (h) as ^ M ;n (h) = 1 n 2 X i;j2Dn [h(i);h(j)](i;j)C ij ; whereC ij is the number of pairs (Y r ;Y s ) in the sample that hasY r =i;Y s =j. Now, we have sup 2 ^ M ;n (h)M (h) sup 2 ^ M ;n (h) ~ M (h) + sup 2 M (h) ~ M (h) (B.7) 193 Consider the first term on the right hand side of the inequality in equation (B.7) above and withP i :=P(Y = i) note that assumptionA2 and lemma A imply E sup 2 ^ M ;n (h) ~ M (h) X i;j2Dn E h sup 2 [h(i);h(j)](i;j) C ij n 2 P i P j in 1 +o(1) o X i;j2Dn n E h sup 2 [h(i);h(j)](i;j) i 2 E C ij n 2 P i P j 2 o 1=2 n 1 +o(1) o (B.8) In equation (B.8) above,Ejn 2 C ij P i P j j 2 isO(1=n) and assumptionA1 together with the compactness of and the continuity of [h(i);h(j)](i;j) with respect to imply thatE[sup 2 j [h(i);h(j)](i;j)j] 2 < 1. ThusE sup 2 j ^ M ;n (h) ~ M (h)j isO(log 2 n=n 1=2 ). Now consider the second term on the right hand side of the inequality in equation (B.7) and note that it is bounded above by the following tail sums 2 X i2Dn;j= 2Dn sup 2 j [h(i);h(j)](i;j)jP i P j + X i;j= 2Dn sup 2 j [h(i);h(j)](i;j)jP i P j : But from Assumption (A1),E p sup 2 j [h(U);h(V )](U;V )j <1 and together with assumption (A2) and proof of lemma A, it follows that the terms in the display above areO(n ) for some > 1=2. Now fix an > 0 and letc n = n 1=2 = log 2 n. SinceE p sup 2 j ^ M ;n (h) ~ M (h)j isO(log 2 n=n 1=2 ) there exists a finite constant M > 0 and an N 1 such that c n E sup 2 j ^ M ;n (h) ~ M (h)j M for all n N 1 . Moreover since sup 2 jM (h) ~ M (h)j ! 0 as n ! 1, there exists an N 2 such that sup 2 jM (h) ~ M (h)jM=c n for allnN 2 . Thus witht = 4M=, we haveP(c n sup 2 j ^ M ;n (h) M (h)j>t)< for alln max(N 1 ;N 2 ) which suffices to prove the desired result. 194 B.2.3 Proofs of Theorems 2A and 2B We will first prove Theorem 2A. Note that from equation (3.7), ^ w (1) n ()w (1) p 2 = ^ h (1) n ()h (1) 0 2 : Now from assumptionA3 and for any> 0, there exists a> 0 such that for any2 , P h c n ^ h (1) n ()h (1) 0 2 i P h c n n M ( ^ h (1) n )M (h (1) 0 ) o i : But the right hand side is upper bounded by the sum of P h c n n M ( ^ h (1) n ) ^ M ;n ( ^ h (1) n ) o =3 i , P h c n n ^ M ;n ( ^ h (1) n ) ^ M ;n (h (1) 0 ) o =3 i andP h c n n ^ M ;n (h (1) 0 )M (h (1) 0 ) o =3 i . From theorem 1, the first and third terms go to zero asn!1 while the second term is zero since ^ M ;n ( ^ h (1) n ) ^ M ;n (h (1) 0 ). This proves the statement of theorem 2A. To prove theorem 2B first note that from equation (B.1), ^ w (0) n ()w (0) p 2 2 n X i=1 n Y i + 1 ^ h (0) n;i ()h (0) 0;i o 2 n ^ h (0) n;i ()h (0) 0;i o 2 : From assumptionA2 and Lemma A, there exists a constantc 0 > 0 such that for largen, max 1in (Y i +1) c 0 logn with high probability. Moreover fori = 1; ;n, since ^ w (0) n;i ()> 0 for every2 andw (0) p;i > 0, lemma A and equation (B.1) together imply ^ h (0) n;i ()+c 0 logn> 0 andh (0) 0;i +c 0 logn> 0. Thus, conditional on the eventfmax 1in (Y i + 1)c 0 logng and for any> 0, P h c n logn ^ w (0) n ()w (0) p 2 i P h c 1 c n ^ h (0) n ()h (0) 0 2 i for some constantc 1 > 0. The proof of the statement of theorem 2B thus follows from the proof of theorem 2A above and lemma A. 195 B.2.4 Proof of Lemma 2 We will first prove the two statements of lemma 2 under the scaled squared error loss. The proof for the squared error loss will follow from similar arguments and we will highlight only the im- portant steps. Throughout the proof, we will denote d 1 := inf 2 inf 1in (1 ^ h (1) n;i ()) > 0 and d 2 := inf 2 inf 1in ^ w (0) n;i ()> 0. Proof of statement 1 for the scaled squared error loss (k = 1) Note that by triangle inequality sup 2 ARE (1;B) n (;Y )R (1) n (; neb (1) ()) is upper bounded by the following sum sup 2 ARE (1;B) n (;Y )EARE (1;B) n (;Y ) + sup 2 EARE (1;B) n (;Y )R (1) n (; neb (1) ()) : Consider the first term. Using definition 3A, this term is upper bounded by 1 n n X i=1 U i + sup 2 2 n n X i=1 n neb (1);i ()E neb (1);i () o + sup 2 1 n n X i=1 n (mY i ) (Y i )E(mY i ) (Y i ) o (B.9) whereU i = Y i =(mY i + 1) i ,EU i = 0 andEU 2 i <1 sinceY i =(mY i + 1) m <1 for all i = 1; ;n. So, n 1 P n i=1 U i =O p (n 1=2 ). 196 Now consider the second term in equation (B.9) above and define Z n () = n 1 P n i=1 V i () where V i () = neb (1);i ()E neb (1);i (). Recall that for everyi such thatY i > 0, 1 ^ h (1) n;i () > 0 for all2 . Moreover, from definition 1A for the Binomial model, neb (1);i () = [Y i =(mY i + 1)] n 1 ^ h (1) n;i () o 1 : Thus, neb (1);i () m=d 1 andjV i ()j 2m=d 1 for alli = 1;:::;n. Along with the fact thatEV i () = 0 Hoeffding’s inequality gives, for a fixed and (for now) arbitraryr n > 1 P n jZ n ()j> r n p n o 2 exp n r 2 n d 2 1 8m 2 o (B.10) Next for a perturbation 0 of such that (; 0 )2 := [ l ; u ], we will bound the increments Z n () Z n ( 0 ) . To that effect, note that njZ n ()Z n ( 0 )jk neb (1) () neb (1) ( 0 )k 1 +Ek neb (1) () neb (1) ( 0 )k 1 andd 2 1 k neb (1) () neb (1) ( 0 )k 1 mk ^ h (1) n () ^ h (1) n ( 0 )k 1 . Now from lemma B we know that k ^ h (1) n () ^ h (1) n ( 0 )k 1 n 1=2 c 1 j 0 j sup h2N ( ^ h (1) n ()) kr 2 hn; ^ M ;n (h) +o(1)k 2 : Moreover, the Binomial model withm <1 and assumptionA4 imply that the supremum in the display above isO(logn). Thus so long asj 0 j n , Z n ()Z n ( 0 ) c 0 n logn p n : 197 Now choose j =j n 2 and note thatA n =fsup 2 jZ n ()j> 3r n = p ngD n [E n where D n =fsup j jZ n ( j )j>r n = p ng andE n =fsup j sup j j jn jZ n ()Z n ( j )j> 2r n = p ng: Choose n so that n n 1=2 logn = o(n 1=2 ) and note that P(E n ) expf2r 2 n g and P(D n ) 2( u = n ) expfr 2 n d 2 1 =8m 2 g from equation (B.10) and the cardinality of j . Thus, P (A n ) 2( u = n ) expfr 2 n d 2 1 =8m 2 g + expf2r 2 n g Setr n = (s logn) 1=2 (m p 8=d 1 ) = O( p logn). ThenP(A n ) n s 3( u = n ) and thus the second term in equation (B.9) isO p ( p logn=n). We will now consider the third term in equation (B.9) and analyze it in a similar manner to the second term of equation (B.9). Here we will assume that the setI i =fj : Y j = Y i + 1g is non-empty for every i = 1;:::;n. Recall from definition 3A that (mY i ) (Y i ) m=d 2 1 for allY i = 0; 1;:::;m. Define Z n () = n 1 P n i=1 V i () where V i () = (mY i ) (Y i )E(mY i ) (Y i ) withEV i () = 0 and jV i ()j 2m=d 2 1 . Hoeffding’s inequality gives, for a fixed and (for now) arbitraryr n > 1 P n jZ n ()j> r n p n o 2 exp n r 2 n d 4 1 8m 2 o Moreover, for a perturbation 0 of such that (; 0 )2 , the incrementsnjZ n ()Z n ( 0 )j are bounded above bymk 0k 1 +mEk 0k 1 where = ( (Y 1 );:::; (Y n )) for any2 . Now, note that from definition 3A and for the Binomial model, j (Y i ) 0(Y i )j 2(m 2 =d 3 1 )j ^ h n;j () ^ h n;j ( 0 )j; 198 wherej2I i : Therefore jZ n ()Z n ( 0 )j 2 m 3 nd 3 1 h k ^ h (1) n () ^ h (1) n ( 0 )k 1 +Ek ^ h (1) n () ^ h (1) n ( 0 )k 1 i : Now lemma B and assumption A4 imply that the right hand side of the inequality above is j 0 jO(n 1=2 logn). So as long asj 0 j n , choose j = j n 2 . In a manner similar to the second term of equation (B.9) define the eventsA n ; D n andE n , and setr n = (s logn) 1=2 (m p 8=d 2 1 ) to conclude that the third term of equation (B.9) continues to be isO p ( p logn=n) which suffices to prove the statement of the result. Proof of statement 2 for the scaled squared error loss (k = 1) Note that sup 2 jARE (1;B) n (;Y )L (1) n (; neb (1) ())j is bounded above by the sum of: sup 2 jARE (1;B) n (;Y )R (1) n (; neb (1) ())j and sup 2 jL (1) n (; neb (1) ())EL (1) n (; neb (1) ())j. The first term isO p ( p logn=n) from statement 1. The second term is bounded above by 2 n sup 2 n X i=1 n neb (1);i ()E neb (1);i () o + 1 n sup 2 n X i=1 1 i nh neb (1);i () i 2 E h neb (1);i () i 2 o (B.11) where the first term in equation (B.11) is O p ( p logn=n) from the proof of statement 1. Now consider the second term in equation (B.11) and defineZ n () = n 1 P n i=1 V i () where i V i () = [ neb (1);i ()] 2 E[ neb (1);i ()] 2 . Recall that from definition 1A for the Binomial model, neb (1);i () = [Y i =(mY i + 1)] n 1 ^ h (1) n;i () o 1 : 199 Thus, [ neb (1);i ()] 2 m 2 =d 2 1 andjV i ()j 2 1 i m 2 =d 2 1 for alli = 1;:::;n. For an arbitraryr n > 1 and fixed, we have from Hoeffding’s inequality, P n jZ n ()j> r n p n o 2 exp n r 2 n d 4 1 n 8m 2 P n i=1 1 i o (B.12) Moreover for a perturbation 0 of such that (; 0 )2 , n X i=1 1 i [ neb (1);i ()] 2 [ neb (1);i ( 0 )] 2 2m 2 d 2 1 ^ h (1) n () ^ h (1) n ( 0 ) 2 n X i=1 1 i : Thus lemma B, assumption A4 and the display above together imply that njZ n () Z n ( 0 )j is bounded above by c 1 j 0 j logn P n i=1 1 i . Now as long asj 0 j n , choose n so that n n 1 logn P n i=1 1 i = o(n 1=2 ) and along with equation (B.12), follow the steps outlined in the proof of the second term in equation (B.9) to conclude that the second term in equation (B.11) isO p ( p logn=n) from which the desired result follows. Proof of statement 1 for the squared error loss (k = 0) The proof of this statement is very similar to the proof of statement 1 under the scaled squared error loss and therefore we highlight the important steps here. To prove statement 1, we will only look at the term sup 2 ARE (0;B) n (;Y ) EARE (0;B) n (;Y ) because under the Binomial 200 model, it can be verified using definition 3B that EARE (0;B) n () = R (0) n (; neb (0) ()). Now note that sup 2 ARE (0;B) n (;Y )EARE (0;B) n (;Y ) is bounded above by 1 n n X i=1 U i + sup 2 2 n n X i=1 n Y i (Y i )EY i (Y i )g + 1 n sup 2 n X i=1 nh neb (0);i () i 2 E h neb (0);i () i 2 o (B.13) whereU i =Y i (Y i 1)=[(mY i + 2)(mY i + 1)] 2 i ,EU i = 0 andEU 2 i <1 sincejU i jm 2 <1 for alli = 1; ;n. So, n 1 P n i=1 U i = O p (n 1=2 ). For the second term in equation (B.13) note that from definition 3B,Y i (Y i )m 2 =d 2 and for a perturbation 0 of such that (; 0 )2 , n X i=1 Y i (Y i ) 0(Y i ) c 2 j 0 j p n lognf1 +o(1)g: The last inequality in the display above follows from definition 3B, lemma B and assumptionA4. Thus the upper bound on the second term of equation (B.13) and the corresponding upper bound on its increments over (; 0 )2 suffice to show that this term isO p ( p logn=n). Finally, the third term in equation (B.13) is bounded above by 4m 2 =d 2 2 andk neb (0) () neb (0) ( 0 )k 1 isj 0 jO( p n logn) for (; 0 )2 from which the desired follows that sup 2 ARE (0;B) n (;Y )EARE (0;B) n (;Y ) isO p ( p logn=n). Proof of statement 2 for the squared error loss (k = 0) We will only look at the term sup 2 jL (0) n (; neb (0) ()) EL (0) n (; neb (0) ())j and show that it is O p ( p logn=n). Note that 2 n sup 2 n X i=1 i n neb (0);i ()E neb (0);i () o + 1 n sup 2 n X i=1 nh neb (0);i () i 2 E h neb (0);i () i 2 o (B.14) 201 is an upper bound on sup 2 jL (0) n (; neb (0) ())EL (0) n (; neb (0) ())j. Analogous to the preceding proofs under lemma 2, we will provide the upper bounds and bounds on the increments with respect to perturbations on for the two terms in equation (B.14). The rest of the proof will then follow from the proof of statement 1 in lemma 2. From definition 1B, we know that neb (0);i () 2m=d 2 . Thus n X i=1 j i neb (0);i ()j 2(m=d 2 ) n X i=1 i : Moreover, from Lemma B and Assumption (A4), n X i=1 i neb (0);i () neb (0);i ( 0 ) c 3 j 0 j n X i=1 i lognf1 +o(1)g and n X i=1 h neb (0);i () i 2 [ neb (0);i ( 0 ) i 2 c 4 j 0 j p n lognf1 +o(1)g for (; 0 )2 . Hoeffding’s inequality and the steps outlined in the proof of statement 1 (k = 1) of lemma 2 will then show that both the terms in equation (B.14) areO p ( p logn=n) which proves the desired result. B.2.5 Proof of Lemma 3 We will first prove the two statements of lemma 3 under the scaled squared error loss. The proof for the squared error loss will follow from similar arguments and we will highlight only the im- portant steps. Throughout the proof, we will denote d 1 := inf 2 inf 1in (1 ^ h (1) n;i ()) > 0 and d 2 := inf 2 inf 1in ^ w (0) n;i ()> 0. Proof of statement 1 for the scaled squared error loss (k = 1) 202 Note that by triangle inequality sup 2 ARE (1;P) n (;Y )R (1) n (; neb (1) ()) is upper bounded by the following sum: sup 2 ARE (1;P) n (;Y )EARE (1;P) n (;Y ) + sup 2 EARE (1;P) n (;Y )R (1) n (; neb (1) ()) : Under the Poisson model and definition 2A it can be shown that E n ARE (1;P) n (;Y ) o =R (1) n (; neb (1) ()): So the second term in the display above is zero. Now consider the first term and note that it is bounded above by 1 n n X i=1 U i + sup 2 2 n n X i=1 n neb (1);i ()E neb (1);i () o + sup 2 1 n n X i=1 n (Y i )E (Y i ) o (B.15) where U i = Y i i and EU i = 0. Then using Hoeffding’s inequality on n 1 P n i=1 U i along with assumptionA3 and lemma A gives that the first term in equation (B.15) isO p (log 3=2 n= p n). Now consider the second term in equation (B.15) and defineV i () = neb (1);i ()E neb (1);i () withZ n () =n 1 P n i=1 V i (). Conditional on the eventB n , we havejV i ()j 2C logn=d 1 . Moreover conditional onB n , assumptions A2A4 and lemma B give Z n ()Z n ( 0 ) c 0 n log 4 n p n f1 +o(1)g whenever (; 0 )2 andj 0 j n . The proof of statement 1 for lemma 2 and the bounds in the display above along with those developed for V i () establish that the second term in equation (B.15) is O p (log 3=2 n= p n). Now for the third term in equation (B.15), define V i () = (Y i )E (Y i ). From 203 assumptionA2 and lemma A, there exists a constantC 0 > 0 such that for largen, max 1in (Y i + 1) C 0 logn with high probability which gives, conditional on the eventB 0 n =fmax 1in (Y i +1)C 0 logng, jV i ()jc 1 log 2 n. Moreover withZ n () =n 1 P n i=1 V i () and conditional on the eventB 0 n , assumptions A2A4 and lemma B give Z n ()Z n ( 0 ) c 2 n log 5 n p n f1 +o(1)g: Now we mimic the proof of statement 1 for lemma 2 to establish that the third term in equation (B.15) is O p (log 5=2 n= p n) which proves the statement of the lemma. Proof of statement 2 for the scaled squared error loss (k = 1) For the proof of this statement, we will show that sup 2 jL (1) n (; neb (1) ()) EL (1) n (; neb (1) ())j is O p (log 5=2 n= p n). This term is bounded above by 2 n sup 2 n X i=1 n neb (1);i ()E neb (1);i () o + 1 n sup 2 n X i=1 1 i nh neb (1);i () i 2 E h neb (1);i () i 2 o (B.16) where the first term in equation (B.16) isO p (log 3=2 n= p n) from the proof of statement 1. Now consider the second term in equation (B.16) and defineZ n () = n 1 P n i=1 V i () where i V i () = [ neb (1);i ()] 2 E[ neb (1);i ()] 2 . Conditional on the eventB n , we havejV i ()j 2 1 i (C=d 1 ) 2 log 2 n. Moreover for a pertur- bation 0 of such that (; 0 )2 , n X i=1 1 i [ neb (1);i ()] 2 [ neb (1);i ( 0 )] 2 2C 2 log 2 n d 3 1 ^ h (1) n () ^ h (1) n ( 0 ) 2 n X i=1 1 i 204 conditional on B n . Thus assumptions A2 A4, lemma B and the above display together imply that njZ n ()Z n ( 0 )j is bounded above byc 0 n log 5 n P n i=1 1 i f1 +o(1)g wheneverj 0 j n . Now we follow the steps outlined in the proof of statement 1 for lemma 3 to conclude that the second term in equation (B.16) isO p (log 5=2 n= p n) from which the desired result follows. Proof of statement 1 for the squared error loss (k = 0) The proof of this statement is very similar to the proof of statement 1 under the scaled squared error loss and therefore we highlight the important steps here. To prove statement 1, we will only show that the term sup 2 ARE (0;P) n (;Y )EARE (0;P) n (;Y ) is O p (logn 5=2 = p n) because under the Poisson model, it can be verified using definition 2B that EARE (0;P) n () = R (0) n (; neb (0) ()). Now note that sup 2 ARE (0;P) n (;Y )EARE (0;P) n (;Y ) is bounded above by 1 n n X i=1 U i + sup 2 2 n n X i=1 n Y i (Y i )EY i (Y i )g + 1 n sup 2 n X i=1 nh neb (0);i () i 2 E h neb (0);i () i 2 o (B.17) whereU i = Y i (Y i 1) 2 i andEU i = 0. Then using Hoeffding’s inequality on n 1 P n i=1 U i , along with assumptionA3 and lemma A, gives that the first term in equation (B.17) isO p (log 5=2 n= p n). Now for the second term in equation (B.17), defineV i () = Y i (Y i )EY i (Y i ). Conditional on the eventB n , jV i ()jc 0 log 2 n. Moreover withZ n () =n 1 P n i=1 V i () and conditional on the eventB n , assumptions A2A4 and lemma B give jZ n ()Z n ( 0 )jc 1 n n 1=2 log 5 nf1 +o(1)g 205 wheneverj 0 j n for (; 0 )2 . The proof of statement 1 (k = 1 case) for lemma 3 and the bounds in the display above along with those developed forV i () establish that the second term in equation (B.17) is O p (log 5=2 n= p n). For the third term in equation (B.17), we proceed in a similar manner and define V i () = [ neb (0);i ()] 2 E[ neb (0);i ()] 2 andZ n () =n 1 P n i=1 V i (). From Assumption (A2) and Lemma A, there exists a constantC 0 > 0 such that for largen, max 1in (Y i + 1) C 0 logn with high probability which gives, conditional on the eventB 0 n =fmax 1in (Y i + 1) C 0 logng, (i)jV i ()j c 2 log 2 n, and (ii) under Assumptions (A2)-(A4) and Lemma B, jZ n ()Z n ( 0 )jc 3 n n 1=2 log 4 nf1 +o(1)g wheneverj 0 j n for (; 0 )2 . Thus the third term in equation (B.17) isO p (log 5=2 = p n) which follows from the proof of statement 1 (k = 1 case) for lemma 3 together with the preceding bounds developed forV i () andjZ n ()Z n ( 0 )j, and suffices to prove the desired result. Proof of statement 2 for the squared error loss (k = 0) We will only look at the term sup 2 jL (0) n (; neb (0) ()) EL (0) n (; neb (0) ())j and show that it is O p (log 5=2 n= p n). Note that 2 n sup 2 n X i=1 i n neb (0);i ()E neb (0);i () o + 1 n sup 2 n X i=1 nh neb (0);i () i 2 E h neb (0);i () i 2 o (B.18) is an upper bound on sup 2 jL (0) n (; neb (0) ())EL (0) n (; neb (0) ())j. Analogous to the preceding proofs under lemma 3, we will provide the upper bounds and bounds on the increments with respect to perturbations on for the two terms in equation (B.18). The rest of the proof will then follow from the proof of statement 1 in lemma 3. 206 From definition 1B, we know that under the Poisson model and conditional on the event B 0 n , neb (0);i () C 0 logn=d 2 . Moreover, from lemma B, P n i=1 i j neb (0);i () neb (0);i ( 0 )j is bounded above by c 0 j 0 j P n i=1 i log 3 nf1 +o(1)g and P n i=1 j[ neb (0);i ()] 2 [ neb (0);i ( 0 )] 2 j is bounded above by c 1 j 0 jn 1=2 log 4 nf1 +o(1)g for (; 0 )2 . Hoeffding’s inequality and the steps outlined in the proof of statement 1 (k = 1) of lemma 3 will then show that the first term in equation (B.18) isO p (log 3=2 n= p n)and the second term isO p (log 5=2 n= p n) which proves the desired result. B.2.6 Proof of Lemma 4 The statement of this lemma follows from part (2) of Lemmata 3 and 2. We will prove this lemma for the Poisson case first. Note that for any > 0 and k 2 f0; 1g, the probability P h L (k) n (; neb (k) ( ^ )) L (k) n (; or (k) ) +c 1 n i is bounded above by P h L (k) n (; neb (k) ( ^ ))ARE (1;P) n ( ^ ;Y )L (k) n (; or (k) )ARE (1;P) n ( orc ;Y ) +c 1 n i ; which converges to 0 by part (2) of Lemma 3. For the Binomial case, similar arguments using part (2) of Lemma 2 suffice. B.2.7 Proofs of Theorems 3A, 3B We will first prove Theorem 3A. Note thatjj neb (1) ( ^ ) (1) jj 2 1 jj neb (1) ( ^ ) (1) jj 2 2 and neb (1) ( ^ ) (1) 2 2 = n X i=1 h a Y i 1 =a Y i ^ w (1) n;i ( ^ )w (1) p;i i 2 h ^ w (1) n;i ( ^ )w (1) p;i i 2 Now, ^ w (1) n;i ()> 0 for every2 andw (1) p;i > 0. This fact along with assumptionA2 and lemma A imply that there exists a constantc 0 > 0 such thatjj neb (1) ( ^ ) (1) jj 2 c 0 lognk ^ w (1) n ( ^ )w (1) p k 2 . The first result 207 thus follows from the above inequality and Theorem 2A. To prove the second part of the theorem, note that n 1 jL (1) n (; (1) )L (1) n (; neb (1) ( ^ ))j is upper bounded by n X i=1 1 i nh neb (1);i ( ^ ) i 2 h (1);i i 2 o + 2 n X i=1 h neb (1);i ( ^ ) (1);i i (B.19) Now use the fact that ^ w (1) n;i () > 0 for every 2 and w (1) p;i > 0 to deduce, from assumption A2 and Lemma A, thatj neb (1);i ( ^ ) + (1);i j c 1 logn for some constantc 1 > 0. Using this inequality we can now upper bound the display in equation (B.19) by: n X i=1 neb (1);i ( ^ ) (1);i n 2 + c 1 logn i o neb (1) ( ^ ) (1) 1 n 2n +c 1 logn n X i=1 1 i o Thus, L (1) n (; (1) )L (1) n (; neb (1) ( ^ )) neb (1) ( ^ ) (1) 1 n 2 +c 1 logn n n X i=1 1 i o Finally, the result follows from the above display and the first part of this theorem after noting that i > 0 for alli = 1; 2;:::. We will now prove Theorem 3B. Using theorem 2B, the first part of theorem 3B follows along similar lines as the first part of theorem 3A. To prove the second part of the theorem, note thatn 1 L (0) n (; (0) ) L (0) n (; neb (0) ( ^ )) is upper bounded by n X i=1 nh neb (0);i ( ^ ) i 2 h (0);i i 2 o + 2 n X i=1 i h neb (0);i ( ^ ) (0);i i (B.20) and the display in equation (B.20) is less than or equal to k neb (0) ( ^ ) (0) k 1 f n X i=1 2 i +c 2 n logng; 208 where we have used the fact that ^ w (0) n;i ()> 0 for every2 ,w (0) p;i > 0 and along with assumptionA2 and Lemma A,j neb (0);i ( ^ ) + (0);i jc 2 logn for some constantc 2 > 0. Thus forn large, L (0) n (; (0) )L (0) n (; neb (0) ( ^ )) c 3 logn neb (0) ( ^ ) (0) 1 from which the desired result follows. B.2.8 Proofs of Lemmata A and B Proof of Lemma A First note that from assumptionA4 ifN(;n) denotes the cardinality of the setfi : i (1+) logng for some> 0, thenN(;n)! 0 asn!1. We will now prove the statement of lemma A for the case when Y i j i ind: Poi( i ). For distributions with bounded support, like the Binomial model, the lemma follows trivially. Under the Poisson model, we haveP(Y i i +t) expf0:5t 2 =( i +t)g for anyt> 0. The above inequality follows from an application of Bennett inequality to the Poisson MGF (see Pollard (2015)). Now consider P(max i=1;:::;n Y i i +t) and note that since Y i are all independent, this probability is given by Q n i=1 [1 expf0:5t 2 =( i +t)g]. Take t = s logn where s 2 =fs + (1+) g > 4. Then with i (1+) logn, the above probability is bounded below bya n =f1n (1+) g n for some > 0. As n!1,a n ! 1 which proves the statement of the lemma. Proof of Lemma B We begin with some remarks on the optimization problems (3.8) and (B.2). Note that the feasible 209 setH n in equation (3.8) (and (B.2)) is compact and independent of. Moreover, the optimization problem in definitions 1A and 1B is convex. Consequently, (i) for all 2 , the optimization takes place in a compact set, and (ii) the optimal solution set corresponding to any 2 is a singleton,f ^ h (k) n ()g. Now fix an > 0. Then for any 2 N ( 0 )\ there exists a > 0 such that the optimal solution h n := ^ h (k) n () 2 N ( ^ h (k) n ( 0 )) and ^ M ;n fh n g ^ M ;n f ^ h (k) n ( 0 )g 0. Moreover, we can re-write ^ M 0 ;n fh n g ^ M 0 ;n f ^ h (k) n ( 0 )g as ^ M 0 ;n fh n g ^ M ;n fh n g ^ M 0 ;n f ^ h (k) n ( 0 )g + ^ M ;n f ^ h (k) n ( 0 )g + ^ M ;n fh n g ^ M ;n f ^ h (k) n ( 0 )g The last term in the display above is negative and thus we can upper bound ^ M 0 ;n fh n g ^ M 0 ;n f ^ h (k) n ( 0 )g by ^ M 0 ;n fh n g ^ M ;n fh n g ^ M 0 ;n f ^ h (k) n ( 0 )g + ^ M ;n f ^ h (k) n ( 0 )g Now apply the mean value theorem with respect toh n to the function ^ M 0 ;n fh n g ^ M ;n fh n g in the display above and notice that ^ M 0 ;n fh (k) n g ^ M 0 ;n fh (k) n ( 0 )g is bounded above by h r hn n ^ M 0 ;n ( h n ) ^ M ;n ( h n ) oi T h h n h (k) n ( 0 ) i where h n = ^ h (k) n ( 0 ) +fh n ^ h (k) n ( 0 )g for some2 (0; 1) andr h ^ M ;n (h) is the partial derivative of ^ M ;n (h) with respect toh. Usingr hn [ ^ M 0 ;n (h n ) ^ M ;n (h n )] =r 2 hn; ^ M 0 ;n (h n )( 0 )+o(j 0 j) we get ^ M 0 ;n fh n g ^ M 0 ;n f ^ h (k) n ( 0 )g sup h2N ( ^ h (k) n ( 0 )) h r 2 hn; ^ M 0 ;n (h) +o(1) 2 i 0 h n ^ h (k) n ( 0 ) 2 210 Moreover assumptionA3 implies that ^ M 0 ;n fh (k) n g ^ M 0 ;n f ^ h (k) n ( 0 )gc h (k) n ^ h (k) n ( 0 ) 2 2 The desired result thus follows from the above two displays with L = sup h2N ( ^ h (k) n ( 0 )) kr 2 hn; ^ M 0 ;n (h) +o(1)k 2 =c: 211 Appendix C Technical details related to Chapter 4 In this chapter we present the detailed proofs of all the theorems and lemmas related to Chapter 4. C.1. Preliminary expansions for eigenvector and eigenvalues In this subsection, we put together the key expansions that are needed to prove the theorems. We first express thej-th sample eigenvector ^ p j as ^ p j =a j P K (e j;K + j ) + q 1a 2 j P K;? u j;nK ; (C.1) where P K = [p 1 : : p K ], P K;? is ann (nK) matrix so that [P K : P K;? ] is an orthogonal matrix, a j =kP K ^ p j k2 (0; 1) (without loss of generality, choosing the correct sign), e j;K is thej-th canonical coordinate vector inR K . Moreover, u j;nK is uniformly distributed onS nK1 (the unit sphere inR nK ), so that u j;nK = " j = p nK where" j N(0;I nK ). We shall make use of the following asymptotic expansions (Paul 2007). k j k =O P (n 1=2 ) and a j = j +O P (n 1=2 ) (C.2) 212 Now, forpn, let A be anypn matrix such thatkAk andk(AA T ) 1 k are bounded even asp;n!1. Then, for any b2R p withkbk 2 = 1, we have the expansion hb; A^ p j i = j hb; Ap j i + q 1 2 j p nK hb; AP K;? " j i +(a j j )hb; Ap j i + ( q 1a 2 j q 1 2 j ) 1 p nK hb; AP K;? " j i +a j hb; AP K j i + q 1a 2 j hb; AP K;? " j i(k" j k 1 (nK) 1=2 ): (C.3) Suppose thatB be any collection of unit vectors inR p of cardinalityO(n c ) for some fixedc2 (0;1). Then, from (C.3) we conclude that, uniformly over b2B, hb; A^ p j i j hb; Ap j i =O P ( p logn=n) (C.4) Here, we used the fact thathb; AP K;? " j i N(0; b T A(I P K P T K )A T b),jhb; Ap j ij kAk and, jhb; A j ijkAkk j k =O P (n 1=2 ). Moreover,ja j j j =O P (n 1=2 ) impliesj q 1a 2 j q 1 2 j j = O P (n 1=2 ) andjk" j k 1 (nK) 1=2 j =O P (n 1 ). C.2. Proofs C.2.1 Proof of Theorem 1A First note that for any fixed (r;)2f1; 0; 1gR, and any given and, equation (4.12) gives, for any b2B withkbk 2 = 1 b T ^ H r;; b = K X j=1 1 ^ 2 j (h r;; ( ^ ` e j )h r;; ( ^ ` e 0 ))(hb; ^ p j i) 2 +h r;; ( ^ ` e 0 )kbk 2 213 and from equations (4.10), (4.11), (C.1) and (C.2), the above reduces to b T ^ H r;; b = K X j=1 (h r;; (` j )h r;; (` 0 ))(hb; p j i) 2 +h r;; (` 0 ) +O P ( p logn=n) = b T H r;; b +O P ( p logn=n); (C.5) uniformly overb2B consisting of O(n c ) unit vectors, for any fixed c > 0. Next, since by assumption A3, and belong to compact subsets on which all the quantities in question are smooth functions with uniformly bounded Lischitz seminorm with respect (;), by choosing appropriate grid of (;) of size O(n c 0 ) for somec 0 > 0, we note that the expansion in (C.5) continue to hold uniformly in (;), and hence we have sup 2T 0 ;2B 0 ;b2B jb T ^ H r;; bb T H r;; bj =O P ( p logn=n), thus proving the theorem. C.2.2 Proof of Theorem 1B We only prove the result for fixed (;) since the argument can be extended to compact subsets of (;), under assumption A3, using an argument similar to that used in the proof of Theorem 1A. Since the aggregated Bayes predictive rules involve quadratic forms of the formb T G r;; b, we have the following cases of interest: G 0;1;0 ;G 1;0; andG 1;1; . In order to analyze the corresponding estimators of these quantities of interest, we introduce some notations. Let C = AA T , Q = [q 1 : : q K ], where q j = Ap j , and e Q = [e q 1 : :e q K ] wheree q j = ^ 1 j A^ p j . Then, for any2R + , A ^ H 0;;0 A T = ( ^ ` e 0 ) AA T + K X j=1 ( ^ ` e j ) ( ^ ` e 0 ) ^ 2 j A^ p j ^ p T j A T = ( ^ ` e 0 ) C + K X j=1 ( ^ ` e j ) ( ^ ` e 0 ) e q j e q T j = ( ^ ` e 0 ) h C + e Q ( ^ ` e 0 ) ^ I K e Q T i (C.6) Setting = 1, we observe that b T ^ G 0;1;0 b = ^ ` e 0 b T AA T b + P K j=1 ( ^ ` e j ^ ` e 0 ) 1 ^ 2 j (hb;A^ p j i) 2 which is b T G 0;1;0 b +O P ( p logn=n) from equations (C.4) and (C.2). This proves the theorem when = 1. 214 To prove the theorem for any6= 1, we make repeated use of the following basic formula for matrix inversion. Given a symmetric nonsingularpp matrix B, and apq matrix D, B + DD T 1 = B 1 B 1 D I + D T B 1 D 1 D T B 1 : (C.7) Using (C.7) and (C.6), we have, with ^ = ( ^ ` e 0 ) ^ I K , (A ^ H 0;;0 A T ) 1 = ( ^ ` e 0 ) C 1 ( ^ ` e 0 ) C 1 e Q ^ 1=2 h I K + ^ 1=2 e Q T C 1 e Q ^ 1=2 i 1 ^ 1=2 e Q T C 1 We must therefore analyze the behavior of e Q T C 1 e Q. As a preliminary step, we observe that since AP K;? " j N(0; A(I P K P T K )A T ), it follows that 1 p kC 1=2 AP K;? " j k 2 = 1 p trace C 1 A(I P K P T K )A T +O P (p 1=2 ) (C.8) which reduces to 1r A =p +O P (p 1=2 ) wherer A =p 1 trace A T C 1 AP K P T K and jr A jkA T C 1 Ak rank(P K P T K )K: We will use equation (C.8) and the bound onjr A j to control e Q T C 1 e Q. First note that using (C.1), (C.2) and, for any 1j;kK, e q T j C 1 e q k = p T j A T C 1 Ap k (1 +O P (n 1=2 )) + q 1 2 j q 1 2 k j k hAP K;? " k ; C 1 AP K;? " j nK (1 +O P (n 1=2 )) 215 which, using equation (C.8), reduces to p T j A T C 1 Ap k (1 +O P (n 1=2 )) + q 1 2 j q 1 2 k j k pr A +O P (p 1=2 ) nK (1 +O P (n 1=2 )) and finally to q T j C 1 q k (1 +O P (n 1=2 )) +O P (p=n) using the bound onjr A j. Consequently, we have e Q T C 1 e Q = Q T C 1 Q +O P (n 1=2 ) +O P (p=n): (C.9) Now let =` 0 I K . Thenk ^ k =O P (n 1=2 ), and hence, by (C.9), ^ U 1 := h I K + ^ 1=2 e Q T C 1 e Q ^ 1=2 i 1 = h I K + 1=2 Q T C 1 Q 1=2 i 1 + R ;n = U 1 + R ;n ; (C.10) wherekR ;n k = O P (n 1=2 ) +O P (p=n). Furthermore, (A ^ H 0;1;0 A T ) 1 + 1 (A ^ H 0;;0 A T ) 1 can be written as ( ^ ` e 0 ) 1 + 1 ( ^ ` e 0 ) C 1 C 1 e Q ( ^ ` e 0 ) 1 ^ 1=2 1 ^ U 1 ^ 1=2 1 + 1 ( ^ ` e 0 ) ^ 1=2 ^ U ^ 1=2 e Q T C 1 ; which by (C.10) is ( ^ ` e 0 ) 1 + 1 ( ^ ` e 0 ) C 1 C 1 e Q ^ V e Q T C 1 (C.11) where ^ V = V + R 1;n + 1 R ;n with V = ` 1 0 1=2 1 U 1 1=2 1 + 1 ` 0 1=2 U 1=2 and R ;n = ^ ` 0 ^ 1=2 R ;n ^ 1=2 , so thatk R ;n k = O P (n 1=2 ) +O P (p=n) for all . Notice that V is positive def- inite, and hence ^ V is positive definite with probability tending to 1. 216 Define, for x > 0, a ; (x) = x 1 + 1 x . By (C.11), we can write h (A ^ H 0;1;0 A T ) 1 + 1 (A ^ H 0;;0 A T ) 1 i 1 as 1 a ; ( ^ ` e 0 ) C " C 1 a ; ( ^ ` e 0 ) e Q ^ V e Q T # 1 C = 1 a ; ( ^ ` e 0 ) C+ 1 (a ; ( ^ ` e 0 )) 2 e Q " ^ V 1 1 a ; ( ^ ` e 0 ) e Q T C 1 e Q # 1 e Q T and using ^ V = V + R 1;n + 1 R ;n , we can re-write it as 1 a ; ( ^ ` e 0 ) C + 1 (a ; ( ^ ` e 0 )) 2 e Q V 1 1 a ; (` 0 ) Q T C 1 Q + R ;n 1 e Q T wherekR ;n k =O P (n 1=2 ) +O P (p=n). As a consequence, we have b T ^ G 1;0; b =b T G 1;0; b +O P ( p logn=n) +O P (p=n) uniformly overb2B. An analogous calculation yields b T ^ G 1;1; c =b T G 1;1; c +O P ( p logn=n) +O P (p=n) uniformly overb;c2B. C.2.3 Proof of Lemma 1 First note that under the hierarchical models (4.1) and (4.4), the posterior distribution of givenAX is N(A 0 +G 1;1; A(X 0 );G 1;0; ). To prove this Lemma, we first fix a few notations. For coordinate 217 i, letv i ;v i andv fi denote thei th diagonal element of 1 , andm 1 0 1 respectively. The minimizer of the univariate Bayes riskB i (;) is given by ^ q i = argmin q Z L i ( i ;q i )( i j(AX) i ) where the posterior distribution( i j(AX) i ) N( i ;! i ) where i = i (AX) i + (1 i )(A 0 ) i , i = v i =(v i +v i ) and! i = (v 1 i +v 1 i ) 1 . We prove the Lemma for the generalized absolute loss and the linex loss functions. The univariate Bayes predictive rules for the other losses considered in chapter 4 will follow from similar arguments. For the linex loss function, note that L i ( i ;q i ) = E V i L i (V i ;q i ) whereL i (V i ;q i ) is the linex loss for coordinatei from equation (4.6). SinceV i N( i ;v fi ), E V i L i (V i ;q i ) =b i h expfa i (q i i ) + (a 2 i =2)v fi ga i (q i i ) 1 i Furthermore,E i j(AX) i L i ( i ;q i ) =b i h expfa i (q i i ) + (a 2 i =2)(v fi +! i )ga i (q i i ) 1 i is convex inq i . Differentiating the above posterior expectation with respect toq i , we get ^ q i = i (AX) i + (1 i )(A 0 ) i a i 2 (v fi +! i ) which completes the proof. For the generalized absolute loss function in equation (4.5), note that E V i L i (V i ;q i ) =b i ( i q i ) + (b i +h i )E(q i i Z) + 218 whereZ is a standard normal random variable. Furthermore, direct calculation yieldsE(q i i Z) + = (q i i )(q i i ) +(q i i ). The Bayes predictive rule then follows from Lemma 2.1 and 2.2 of Mukherjee et al. (2015). C.2.4 Proof of Lemma 2 We prove this lemma for the generalized absolute loss function in equation (4.5). For anyi and fixed (;), it follows from Theorem 1A, ^ q approx i (XjS;;)q Bayes i (Xj;;) e T i ( ^ H 1;1; H 1;1; )(X 0 ) +O p r logn n (C.12) The first term on the right of the inequality above is an asymmetric quadratic form and can be written as a difference of two symmetric quadratic forms as follows kX 0 k 2 4 h (a +e i ) T ( ^ H 1;1; H 1;1; )(a +e i ) (ae i ) T ( ^ H 1;1; H 1;1; )(ae i ) i wherea = (X 0 )=kX 0 k 2 . Re-apply Theorem 1A separately to these two symmetric quadratic forms and note that the above is bounded byO p ( p logn=n)(kX 0 k 2 )(ka +e i k 2 2 +kae i k 2 2 )=4, from which the result follows. C.2.5 Proofs of Lemmata 3 and 4 To prove these lemmas, we use the following result. 219 Lemma A. Under assumptions A1 and A2, uniformly inb2B such thatB = O(n c ) for any fixedc > 0, withkbk 2 = 1, and for all (r;)2f1; 0; 1gR, we have asn!1, sup b2B b T ^ H 1;1; J () ^ H 1;1; bb T H 1;1; J ()H 1;1; b j(` 0 ) K X j=1 (h 1;1; (` j )h 1;1; (` 0 )) 2 (hb; p j i) 2 =O P ( p logn=n): Proof of Lemma A. Let us first define the following quantities: j (h 1 ) = h 1;1; (` j )h 1;1; (` 0 ), j = J (` j )J (` 0 ) and ^ j (h 1 ) = h 1;1; ( ^ ` e j ) h 1;1; ( ^ ` e 0 ) where h r;; is the scalar version of H r;; andJ (x) = x +x being the scalar version ofJ (). For anyb2B withkbk 2 = 1, expand b T ^ H 1;1; J () ^ H T 1;1; b as K X j=1 K X j 0 =1 K X k=1 ^ j (h) ^ j 0(h) ^ 2 j ^ 2 j 0 k hb; ^ p j ihb; ^ p j 0ihp k ; ^ p j ihp k ; ^ p j 0i +J (` 0 ) K X j=1 K X j 0 =1 ^ j (h) ^ j 0(h) ^ 2 j ^ 2 j 0 hb; ^ p j ihb; ^ p j 0ih^ p j 0; ^ p j i + 2h( ^ ` e 0 ) K X j=1 K X k=1 ^ j (h) ^ 2 j k hb; ^ p j ihb; p k ihp k ; ^ p j i + 2h( ^ ` e 0 )J (` 0 ) K X j=1 ^ j (h) ^ 2 j (hb; ^ p j i) 2 + (h( ^ ` e 0 )) 2 K X k=1 k (hb; p k i) 2 + (h( ^ ` e 0 )) 2 J (` 0 )kbk 2 Then, using equation (4.9), it can be verified that above asymptotically equals b T K X j=1 j (h 1 )p j p T j +h 1;1; (` 0 )I K X j=1 j p j p T j +h 1;1; (` 0 )I K X j=1 j (h 1 )p j p T j +h 1;1; (` 0 )I b +J (` 0 ) K X j=1 j (h 1 ) 2 (hb; p j i) 2 +O P ( p logn=n) 220 where theO P term is uniform inb2B consisting ofO(n c ) unit vectors. Finally, using the definitions of j (h 1 ), j and arguments similar that used in proving Theorem 1A, the result follows. Next, we prove the three statements of lemma 3. Proof of Lemma 3, statement (a) - first note thatE n q cs i (XjS;f i ;;)q Bayes i (Xj;;) 2 o can be decomposed as E 2 n q cs i (XjS;f i )q Bayes i (X) o + Var n q cs i (XjS;f i )q Bayes i (X) o where the first term represents bias squared and the second term is the variance. Now consider, for example, the generalized absolute loss function of equation (4.5). Under this loss, the bias with respect to the marginal distribution ofX is 1 ( ~ b i ) h e T i ^ H 1;0; e i +m 1 0 e T i ^ H 0;1;0 e i 1=2 e T i H 1;0; e i +m 1 0 e T i H 0;1;0 e i 1=2 i which, by Theorem 1A isO P ( p logn=n). Now the variance term is equal to f 2 i e T i ^ H 1;1; J () ^ H T 1;1; e i 2f i e T i ^ H 1;1; J ()H T 1;1; e i +e T i H 1;1; J ()H T 1;1; e i ; which is a quadratic with respect tof i and is minimized at f OR i = e T i ^ H 1;1; J ()H T 1;1; e i e T i ^ H 1;1; J () ^ H T 1;1; e i : 221 The numerator in the above expression is an asymmetric quadratic form in ^ H r;; and by Theorem 1A it equalse T i H 1;1; J ()H T 1;1; e i + (e T i H 1;1; J 2 ()H T 1;1; e i ) 1=2 O p ( p logn=n). By Lemma A, the denominator is e T i H 1;1; J ()H 1;1; e i +j(` 0 ) K X j=1 (h 1;1; (` j )h 1;1; (` 0 )) 2 (he i ; p j i) 2 +O p ( p logn=n) which, for fixed > 0; 0, is non-trivial since ` j > ` 0 > 0 for all j = 1;:::;K. Thus, the ratio asymptotically equals e T i H 1;1; J ()H T 1;1; e i e T i H 1;1; J ()H T 1;1; e i +j(` 0 ) P K j=1 (h 1;1; (` j )h 1;1; (` 0 )) 2 (he i ; p j i) 2 +O p r logn n : The second term in the denominator is at least as big as j(` 0 )(h 1;1; (` K )h 1;1; (` 0 )) 2 kP K e i k 2 : Finally, note thatU() =H 1;1; J ()H 1;1; , from which the result follows. Proof of Lemma 3, statement (b) - from Lemma A,b T ^ H 1;1; J () ^ H T 1;1; b asymptotically equals b T H 1;1; J ()H 1;1; b +j(` 0 ) K X j=1 (h 1;1; (` j )h 1;1; (` 0 )) 2 (hb; p j i) 2 +O p ( p logn=n) which is strictly bigger thanb T H 1;1; J ()H 1;1; b +O P ( p logn=n) for any fixed > 0; > 0 and from this the proof immediately follows. 222 Proof of Lemma 3, statement (c) - this follows directly from statements (a) and (b). For any coordinatei, by definition off OR i in statement (a), E h q approx i (XjS)q Bayes i (X) 2 i E h q cs i (XjS;f OR i )q Bayes i (X) 2 i while statement (b) implies that the above inequality holds for alli and for any fixed > 0 and > 0. Proof of Lemma 4 - The proof of this Lemma follows directly using Theorem 1A for the numera- tor of ^ f prop i and Lemma A and equations (4.8), (4.9) for the denominator. Similar arguments using Theorem 1B, Lemma A and equations (4.8), (4.9) prove the result for ^ f prop i in definition 4. C.2.6 Proof of Theorems 2A and 2B We will first prove Theorem 2B for the generalized absolute loss function in equation (4.5). For anyi and fixed > 0, > 0, we have , ^ q caspr i ( ^ f prop i ) ^ q cs i (f OR i ) = ^ f prop i f OR i e T i ^ H 1;1; A(X 0 ) which can be upper bounded by 1 ^ f prop i =f OR i h ^ q cs i (f OR i ) e T i A 0 + 1 ( ~ b i ) e T i ^ G 1;0; e i + m 1 0 e T i ^ G 0;1;0 e i 1=2 i . Now using Theorem 1B and Lemma 4,k^ q caspr (AXj ^ f prop ) ^ q cs (AXjf OR )k 2 2 is upper bounded by 2 f OR inf 2 h k^ q cs (AXjf OR )e T i A 0 k 2 2 + n 1 ( ~ b i ) o 2 e T i G 1;0; e i +m 1 0 e T i G 0;1;0 e i +c n i O p logn n 223 wheref OR inf := inf 1in f OR i > 0 andc n = O p n max p n ; r logn n o . The proof then follows by noting that ^ q cs (AXjf OR )e T i A 0 2 2 > 0 since ` 0 > 0. The proof of Theorem 2A follows using similar arguments with Theorem 1B and Lemma 4. 224 Appendix D Technical details related to Chapter 5 This chapter holds the following items: details around the maximization problem in equation (5.9) (appendix D.1), the prediction equations used in chapter 5.6.3 (appendix D.2), discussion around the split-and-conquer approach (appendix D.3), data description (appendix D.4) and variable selection voting results (appendix D.5). D.1. Details around the maximization problem in equation (5.9) In this section, we will first show that the maximization problem in equation (5.9) decouples into separate components that estimate (s) (and 1 ; 2 for the activity and engagement models) and as solutions to independent optimization problems (appendix D.1.1). Thereafter, we show that the optimization problems involving (s) are convex and can be solved after reducing the original problem to an ` 1 penalized least squares fit with convex constraints (appendix D.1.2), while the coordinate descent algorithm of Wang (2014) provides a solution to the non-convex problem involving (appendix D.1.3). D.1.1 Simplifying equation (5.9) Note that in theE-step of chapter 5.5,` Q (t) () is approximated by P n i=1 P D d=1 ` cl i (;b d i )w (t) id where w (t) id =p( ij ; A ij ; ij ; E ij ; D i jb d i ; (t) )= D X d=1 p( ij ; A ij ; ij ; E ij ; D i jb d i ; (t) ) 225 is a known constant at iteration (t) and ` cl i (;b d i ) = 1 2 log 1 2 b dT i 1 b d i + m i X j=1 logp( ij ; A ij ; ij ; E ij ; D i jb d i ;): Moreover given the random effectsb d i , logp( ij ; A ij ; ij ; E ij ; D i jb d i ;) factorizes into logp( ij jb d(1) i ; (1) ) + logp(A ij j ij ;b d(2) i ; (2) ; 1 ) + logp( ij j ij ;b d(3) i ; (3) )+ logp(E ij j ij ; ij ;b d(4) i ; (4) ; 2 ) + logp(D i jb d i ; (5) ;) wherein thes th term, fors = 1; ; 5, in the display above is solely a function of the unknown parameter (s) (and 1 ; 2 ; for s = 2; 4; 5 respectively). This suffices to show that the maximization problem in equation (5.9) decouples into six separate problems for estimating (1) ; ( (2) ; 1 ); (3) ; ( (4) ; 2 ); ( (5) ;) and . D.1.2 Estimating (s) In what follows, we will show that the optimization problems involving (s) (and 1 ; 2 for the activity and engagement models) are convex and can be solved after reducing the original problem to an` 1 penalized least squares fit with convex constraints. AI model - First note that P n i=1 P m i j=1 P D d=1 logp( ij jb d(1) i ; (1) )w (t) id can be written as f 1 ( (1) ) + f 2 ( (1) ) + terms independent of (1) where f 1 ( (1) ) = n X i=1 m i X j=1 D X d=1 log h 1 + exp x (1)T ij (1) +z (1)T ij b d(1) i i w (t) id 226 is concave in (1) and f 2 ( (1) ) = n X i=1 m i X j=1 D X d=1 ij x (1)T ij (1) w (t) id is affine in (1) . Now from equation (5.9), the minimization problem for (1) is min (1) f( (1) ) +h( (1) ) (D.1) wheref( (1) ) =f 1 ( (1) )f 2 ( (1) ) is convex and differentiable with respect to (1) , andh( (1) ) = n P p r=1 c 1r j 1r j +I C ( (1) ) is convex but non-differentiable, withI C ( (1) ) as the indicator function of the closed, convex setC =f (1) : f (1) ( (1) ) 0g. To solve equation (D.1), we use the proximal gradient method that updates (1) in iterationk = 1; 2; 3;::: as (1) (k) = prox t k ;h (1) (k1) t k rf( (1) (k1) ) (D.2) wheret k > 0 is the step size determined by backtracking line search and prox t k ;h (u) = argmin h() + 1 2t k jjujj 2 2 (D.3) is the proximal mapping ofh withu = (1) (k1) t k rf( (1) (k1) ) and rf( (1) (k1) ) = n X i=1 m i X j=1 D X d=1 w (t) id nh 1 + exp x (1)T ij (1) (k1) z (1)T ij b d(1) i i 1 ij o x (1) ij 227 being the derivative off( (1) ) with respect to (1) evaluated at (1) (k1) . The proximal mapping in equation (D.3) for our specific application is, unfortunately, not available in an analytical form. We resort to comput- ing the proximal mappings numerically by re-writing the minimization problem in equation (D.3) as an` 1 penalized least squares fit with convex constraints as follows: min e (1) 1 2t jjuA (1) e (1) jj 2 2 +njj e (1) jj 1 (D.4) subject to e f (1) ( e (1) ) 0 wheret =t k , e 1r =c 1r 1r ,A (1) is app diagonal matrix withA (1) r;r = 1=c 1r and e f (1) are the transformed convexity constraints on e (1) . For instance, if f (1) ( (1) ) =C (1) (1) for some matrixC (1) withp columns then e f (1) ( e (1) ) =C (1) A (1) e (1) . Finally, we solve (D.4) using CVX (Grant et al. 2008). Activity Time model - Define 1 = 1 1 , (2) = 1 (2) and re-write P n i=1 P m i j=1 P D d=1 logp(A ij j ij ;b d(2) i ; (2) ; 1 )w (t) id asf( 1 ; (2) ) + constant terms, where f( 1 ; (2) ) = n X i=1 m i X j=1 D X d=1 ij w (t) id h 1 2 1 log A ij x (2)T ij (2) 2 log 1 i is convex in ( 1 ; (2) ). Thus from equation (5.9), the minimization problem for ( 1 ; (2) ) is min 1 ; (2) f( 1 ; (2) ) +h( 1 ; (2) ) (D.5) where f( 1 ; (2) ) is convex and differentiable with respect to ( 1 ; (2) ), and h( 1 ; (2) ) = n P p r=1 c 2r j 2r j +I C ( 1 ; (2) ) is convex but non-differentiable, withI C ( 1 ; (2) ) as the indicator func- tion of the closed, convex setC =f( 1 ; (2) ) : f (2) ( (2) ) 0; 1 g. Here is a small positive number 228 used to enforce 1 > 0. To solve (D.5) we use the proximal gradient method discussed in equations (D.2) and (D.3) wherein the proximal mapping ofh is given by prox t k ;h (u) = argmin 1 ; h( 1 ; ) + 1 2t k jj( 1 ; ) T ujj 2 2 whereu = ( (k1) 1 ; (2) (k1) ) T t k rf( (k1) 1 ; (2) (k1) ) andrf( (k1) 1 ; (2) (k1) ) = 2 6 6 4 P n i=1 P m i j=1 P D d=1 ij w (t) id n 1= (k1) 1 + log A ij (k1) 1 log A ij x (2)T ij (2) (k1) o P n i=1 P m i j=1 P D d=1 ij w (t) id (k1) 1 log A ij x (2)T ij (2) (k1) x (2) ij 3 7 7 5 being the derivative off( 1 ; (2) ) with respect to ( 1 ; (2) ) evaluated at ( (k1) 1 ; (2) (k1) ). The above prox- imal mapping is computed in CVX by solving an` 1 penalized least squares fit with convex constraints as shown in equation (D.4). EI model - Like theAI model, P n i=1 P m i j=1 P D d=1 logp( ij j ij ;b d(3) i ; (3) )w (t) id can be written asf 1 ( (3) )+ f 2 ( (3) ) + terms independent of (3) where f 1 ( (3) ) = n X i=1 m i X j=1 D X d=1 ij log h 1 + exp x (3)T ij (3) +z (3)T ij b d(3) i i w (t) id is concave in (3) and f 2 ( (3) ) = n X i=1 m i X j=1 D X d=1 ij ij x (3)T ij (3) w (t) id is affine in (3) . So the minimization problem for (3) in (5.9) is min (3) f( (3) ) +h( (3) ) (D.6) 229 wheref( (3) ) =f 1 ( (3) )f 2 ( (3) ) is convex and differentiable with respect to (3) , andh( (3) ) = n P p r=1 c 3r j 3r j +I C ( (3) ) is convex but non-differentiable, withI C ( (3) ) as the indicator function of the closed, convex setC =f (3) : f (3) ( (3) ) 0g. To solve equation (D.6), we use the proximal gradient method discussed in equations (D.2) and (D.3) wherein the proximal mapping ofh is given by prox t k ;h (u) = argmin h() + 1 2t k jjujj 2 2 whereu = (3) (k1) t k rf( (3) (k1) ) and rf( (3) (k1) ) = n X i=1 m i X j=1 D X d=1 ij w (t) id nh 1 + exp x (3)T ij (3) (k1) z (3)T ij b d(3) i i 1 ij o x (3) ij being the derivative off( (3) ) with respect to (3) evaluated at (3) (k1) . The above proximal mapping is finally computed in CVX by solving an` 1 penalized least squares fit with convex constraints as shown in equation (D.4). Engag. Amount model - Like the Activity time model, define 2 = 1 2 , (4) = 2 (4) and re-write P n i=1 P m i j=1 P D d=1 logp(E ij j ij ; ij ;b d(4) i ; (4) ; 2 )w (t) id asf( 2 ; (4) ) + constant terms, where f( 2 ; (4) ) = n X i=1 m i X j=1 D X d=1 ij ij w (t) id h 1 2 2 log E ij x (4)T ij (4) 2 log 2 i is convex in ( 2 ; (4) ). Thus from equation (5.9), the minimization problem for ( 2 ; (4) ) is min 2 ; (4) f( 2 ; (4) ) +h( 2 ; (4) ) (D.7) 230 where f( 2 ; (4) ) is convex and differentiable with respect to ( 2 ; (4) ), and h( 2 ; (4) ) = n P p r=1 c 4r j 4r j +I C ( 2 ; (4) ) is convex but non-differentiable, withI C ( 2 ; (4) ) as the indicator func- tion of the closed, convex setC =f( 2 ; (4) ) : f (4) ( (4) ) 0; 2 g. Here is a small positive number used to enforce 2 > 0. To solve (D.7) we use the proximal gradient method discussed in equations (D.2) and (D.3) wherein the proximal mapping ofh is given by prox t k ;h (u) = argmin 2 ; h( 2 ; ) + 1 2t k jj( 2 ; ) T ujj 2 2 whereu = ( (k1) 2 ; (4) (k1) ) T t k rf( (k1) 2 ; (4) (k1) ) andrf( (k1) 2 ; (4) (k1) ) = 2 6 6 4 P n i=1 P m i j=1 P D d=1 ij ij w (t) id n 1= (k1) 2 + log E ij (k1) 2 log E ij x (4)T ij (4) (k1) o P n i=1 P m i j=1 P D d=1 ij ij w (t) id (k1) 2 log E ij x (4)T ij (4) (k1) x (4) ij 3 7 7 5 being the derivative of f( 2 ; (4) ) with respect to ( 2 ; (4) ) evaluated at ( (k1) 2 ; (4) (k1) ). Finally, CVX is used to compute the above proximal mapping by solving an ` 1 penalized least squares fit with convex constraints as shown in equation (D.4). Dropout model - For the dropout model, we re-write P n i=1 P m i j=1 P D d=1 logp(D i jb d i ; (5) ;)w (t) id asf 1 ( (5) ;) +f 2 ( (5) ;) + constant terms where f 1 ( (5) ;) = n X i=1 m i X j=1 D X d=1 log h 1 + exp x (5)T ij (5) + T b d i i w (t) id is concave in ( (5) ;) and f 2 ( (5) ;) = n X i=1 m i X j=1 D X d=1 D i x (5)T ij (5) + T b d i w (t) id 231 is affine in ( (5) ;). Now from equation (5.9), the minimization problem for ( (5) ;) is min (5) ; f( (5) ;) +h( (5) ) (D.8) wheref( (5) ;) =f 1 ( (5) ;)f 2 ( (5) ;) is convex and differentiable with respect to ( (5) ;), and h( (5) ) = n P p r=1 c 5r j 5r j +I C ( (5) ) is convex but non-differentiable, with I C ( (5) ) as the indicator function of the closed, convex setC = f (5) : f (5) ( (5) ) 0g. To solve equation (D.8), we use the proximal gradient method discussed in equations (D.2) and (D.3) wherein the proximal mapping of h is given by prox t k ;h (u) = argmin ; h() + 1 2t k jj(;) T ujj 2 2 whereu = ( (5) (k1) ; (k1) ) T t k rf( (5) (k1) ; (k1) ) andrf( (5) (k1) ; (k1) ) = n X i=1 m i X j=1 D X d=1 w (t) id nh 1 + exp x (5)T ij (5) (k1) (k1) b d i i 1 D i o 2 6 6 4 x (5) ij b d i 3 7 7 5 being the derivative off( (5) ;) with respect to ( (5) ;) evaluated at ( (5) (k1) ; (k1) ). We use CVX to compute the above proximal mapping by solving an` 1 penalized least squares fit with convex constraints as shown in equation (D.4). D.1.3 Estimating From equation (5.9), the optimization problem for estimating in iteration (t) can be expressed as min 0 logjj + trace(Q 1 ) + 2jjP jj 1 (D.9) 232 whereQ 4pc4pc =n 1 P n i=1 P D d=1 b d i b dT i w (t) id ,P 4pc4pc = diag(d (t) s1 ;:::;d (t) s4pc ). Here denotes elemen- twise multiplication and for any matrixA,jjAjj 1 =jjvec(A)jj 1 = P pc i;j jA ij j. The above minimization problem in is non-convex (Bien and Tibshirani 2011) and we use the coordinate descent based algorithm of Wang (2014) that updates one row and one column at a time while keeping the remaining elements fixed to obtain a solution. In particular, given inputs (Q;P;) and iteration (k + 1), the aforementioned algorithm first partitions (k+1) = 0 B B @ (k) 11 12 T 12 22 1 C C A ; Q = 0 B B @ Q 11 q 12 q T 12 q 22 1 C C A : where (k) 11 andQ 11 are the sub-matrices obtained from the first 4p c 1 columns. Then with = 12 and = 22 T 12 1(k) 11 12 , it uses coordinate descent algorithms (Friedman et al. 2007) to obtain the estimates ( ^ ; ^ ) (see equations (5)-(7) in Wang (2014)) and finally updates (k+1) 12 = ^ and (k+1) 22 = ^ + ^ T 1(k) 11 ^ . This procedure is repeated for every row and column (keeping others fixed) until convergence. D.2. Prediction equations We first focus on the prediction problem discussed in chapter 5.6.3. For player i, let Y i (t) = f ij ; A ij ; ij ; E ij : 0 j tg denote the observed responses until time t and i (uj t) be the condi- tional probability of drop-out at timeu>t> 0 given no drop-out until timet. Then, i (ujt) = Pr(D i =ujD i >t;Y i (t); ) = Z Pr(D i =ujD i >t;Y i (t);b i ; )p(b i jD i >t;Y i (t); )db i = Z Pr(D i =ujD i >t;b i ; )p(b i jD i >t;Y i (t); )db i 233 Following section 3 of Rizopoulos (2011) and the fitted dropout model in equation (5.8), an estimate of i (ujt) is b i (ujt) = Pr(D i =ujD i >t; b b i ; b ) where b b i = argmax b logp(bjD i >t;Y i (t); b ). In chapter 5.6.2, we are interested in predicting the timeu > t expected longitudinal outcomes of AI, Activity, EI and Engagement given the observed responsesY i (t) for playeri who has not dropped-out at timet. We consider the case of predictingw i (ujt) := EfA iu jD i >t;Y i (t); g as an example as the rest follow along similar lines. Letb iu be the predicted AI at timeu conditional onY i (t) and no dropout until timet. Then note that EfA iu jD i >t;Y i (t); g = Z EfA iu jb i ; gp(b i jD i >t;Y i (t); )db i and from section 7.2 of Rizopoulos (2012) an estimate ofw i (ujt) is given by b w i (ujt) = 8 > > > > < > > > > : 0; ifb iu = 0 exp x (2)T iu b (2) +z (2)T iu b b (2) i + b 2 1 2 ; otherwise where b b i = ( b b (s) i : 1s 4) = argmax b logp(bjD i >t;Y i (t); b ). D.3. Split-and-Conquer approach and numerical experiments In this section, we first discuss the split-and-conquer approach of Chen and Xie (2014) (chapter D.3.1) and thereafter conduct numerical experiments to demonstrate the applicability of this approach in our GLMM setup (chapter D.3.2). 234 D.3.1 Split-and-Conquer approach To enhance the computational efficiency of the estimation procedure, CEZIJ uses the split-and-conquer ap- proach of Chen and Xie (2014) to split the full set ofn players intoK non-overlapping groups and conducts variable selection separately in each group by solving K parallel maximization problems represented by equation (5.9). In the process, our methodology uses data-driven adaptive weights (c sr ;d sr )2 R 2 + in the penalty with weights in any iteration being computed from the solutions of the previous iteration. The se- lected fixed and random effects are then determined using a majority voting scheme across all theK groups as described in chapter 5.4. In their original article however, Chen and Xie (2014) use this approach in a GLM setup, and conduct selection and thereafter estimation of the selected coefficients, by first solvingK penalized likelihood problems (with fixed penalty that may vary withK) across theK splits of the data and then averaging across the selected coefficients in each split. Theorem 1 in their paper demonstrates that the estimator so obtained is sign consistent under some regularity conditions and as long as log(Kp) =o(n=K) wherep denotes the number of candidate predictors andn the sample size. Moreover along with Theorem 1, Theorem 2 establishes that this averaged estimator is asymptotically equivalent to the estimator obtained by solving the penalized likelihood problem on the entire data. While these theoretical results do not directly extend to a GLMM setting, in chapter D.3.2 we empirically demonstrate the applicability of the above scheme in selecting fixed and random effects in our setting where data-driven adaptive weights are used in the penalty and variable selection is conducted simultaneously across multiple models. In terms of computational efficiency, the split-and-conquer approach is efficient in the sense that if an estimation procedure requiresO(n a p b ) computing steps for somea > 1;b 0, then the split-and-conquer approach results in an efficiency gain ofO(K a1 ) in computing steps (see theorem 5 in Chen and Xie (2014)). Figure D.1 presents a comparison of the computing time for the two simulation 235 settings considered in in chapter D.3.2 and demonstrates that in both these settings CEZIJ, through its split- and-conquer approach for variable selection, offers a potential gain in computational efficiency against the conventional and memory intensive approach of running the selection algorithm on the undivided data. D.3.2 Numerical experiments Here we present numerical experiments that assess the model selection performance of CEZIJ under the longitudinal and Dropout models discussed in chapter 5.3.1. The MATLAB code for these simulation ex- periments is available at https://github.com/trambakbanerjee/cezij#what-is-cezij. We consider two simulation settings as follows: Simulation setting I - We consider a sample of n = 500 players and for each player i, let X i = (X i:1 ;:::;X i:p ) denote the mp matrix of candidate predictors whereX i:k = (x i1k ;:::;x imk ) T . We fixm = 30; p = 10 and takeI f =f1;:::; 8g,I c =f9; 10g so thatp f = 8, p c = 2. Thus, the first 8 columns ofX i represent fixed effects while the last 2 represent composite effects. The five responses [ i ; A i ; i ; E i ; D i ] corresponding to equations (5.2), (5.3), (5.6), (5.7) and (5.8) of chapter 5.3.1 are generated from the following models: logit( ij ) = (1) 0 +x ij1 (1) 1 +b i1 , ij = (2) 0 + x ij2 (2) 1 +b i2 with 1 = 0:5,logit(q ij ) = (3) 0 +x ij3 (3) 1 +b i3 , ij = (4) 0 +x ij4 (4) 1 +b i4 with 2 = 0:5 and logit( ij ) = (5) 0 +x ij5 (5) 1 + 1 b i1 +::: + 4 b i4 where the true values of the fixed effect coefficients are: (1) = (1;1:5), (2) = (3:5;2), (3) = (1;1), (4) = (3;3), (5) = (1; 2) and, = ( 1 ;:::; 4 ) = (0:1; 0:2; 0:1;0:2). Thus setting I presents a relatively simple scenario wherein there are 236 no composite effects in the true model. The random effectsb i = (b i1 ;:::;b i4 ) are sampled fromN 4 (0; ), independently for eachi, where = 0 B B B B B B B B B B @ 1 0:2 0:4 0:5 0:2 3 0:9 0:7 0:4 0:9 0:8 0:5 0:5 0:7 0:5 4 1 C C C C C C C C C C A Since the CEZIJ framework can incorporate convexity constraints on the fixed effect coefficients, we impose the following sign constraints: (1) 0 > 0; (1) 1 < 0; (3) 1 < 0; (4) 0 > 0; (4) 1 < 0: Finally, to complete the specification, we sample (x ij1 ;:::;x ij4 ) fromN 4 (0; 4I 4 ), independently for each i = 1;:::;n,j = 1;:::;m. To ensure that the generated sample contains players that have not churned for at least the first 7 to 10 days, we letX i:5 to be anm dimensional ordered sample fromUnif(1; 1) so that X i:5 = (x i15 x im5 ) and, generate the remaining predictors independently fromUnif(1; 1). In this respect,X i:5 mimics the variabletimesince that gradually increases withm and appears in the fitted Dropout model in table 5.2 of chapter 5.6.1. Simulation setting II - In this setting, we consider a larger design and fixn = 2000;m = 30;p = 20 and, takeI f =f1; 3; 5:::; 8; 11;:::;pg,I c =f2; 4; 9; 10g so thatp f = 16 andp c = 4. The five responses [ i ; A i ; i ; E i ; D i ] are generated from the following models: logit( ij ) = (1) 0 +x ij1 (1) 1 +b i1 , ij = (2) 0 +b i2 +x ij2 ( (2) 1 +b i3 ) with 1 = 0:5,logit(q ij ) = (3) 0 +x ij3 (3) 1 +b i4 , ij = (4) 0 +b i5 +x ij4 ( (4) 1 +b i6 ) with 2 = 0:5 and logit( ij ) = (5) 0 +x ij5 (5) 1 + 1 b i1 +::: + 6 b i6 where the true values of the fixed 237 effect coefficients are identical to setting I and, = ( 1 ;:::; 6 ) i:i:d Unif(0:3; 0:3). The random effects b i = (b i1 ;:::;b i6 ) are sampled fromN 6 (0; ), independently for eachi, where = 0 B B B B B B B B B B B B B B B B B B B @ 1 0:2 0:3 0:4 0:5 0:3 0:2 3 0:2 0:9 0:7 0:1 0:3 0:2 1 0:2 0:3 0:2 0:4 0:9 0:2 0:8 0:5 0:4 0:5 0:7 0:3 0:5 4 0:3 0:3 0:1 0:2 0:4 0:3 1 1 C C C C C C C C C C C C C C C C C C C A and the convexity constraints on the fixed effect coefficients continue to resemble that of setting I. Finally, we continue to letX i:5 be anm dimensional ordered sample fromUnif(1; 1) and sample the remainingp1 predictors from a multivariate Gaussian distribution with mean 0 and covariance matrix Cov(x ijr ;x ijs ) = 0:5 jrsj independently for eachi = 1;:::;n,j = 1;:::;m. Recall that CEZIJ uses the split-and-conquer approach of Chen and Xie (2014) to split the full set ofn players intoK non-overlapping groups and conducts variable selection separately in each group by solving K parallel maximization problems represented by equation (5.9). The selected fixed and random effects are then determined using a majority voting scheme across all theK groups as described in chapter 5.5. For settings I and II, we fix (K; ! 1 ! 2 ) at (5; 3; 3) and (10; 6; 6), respectively, so thatn=K is 100 in setting I and 200 in setting II. For each setting, we generate 50 datasets and assess the model selection performance in terms of the average False Negatives (in-model predictors falsely identified as being out of model) and average False Positives (out of model predictors falsely identified as being in-model) for the fixed effects (composite or not) and the random effects. To evaluate the hierarchical selection property of our framework, 238 Table D.1: Simulation setting I (n = 500;m = 30;p = 10;K = 5) - average False Negatives(FN), average False Positives (FP) for fixed (composite or not) and random effects and, % datasets with non-hierarchical selection. Fixed Effects Random Effects Model FN FP FN FP % Non Hier. Selec. AI 0 2.76 0.16 0.36 0 Activity Time 0 1.44 0 0.04 0 EI 0 4.48 0 1.12 0 Engage. Time 0 1.52 0 0.04 0 Dropout 0 1.52 - - - we also report the percentage of datasets where our method conducted non-hierarchical selection and chose predictors with random effects only. Table D.2: Simulation setting II (n = 2000;m = 30;p = 20;K = 10) - average False Negatives(FN), average False Positives (FP) for fixed (composite or not) and random effects and, % datasets with non- hierarchical selection. Fixed Effects Random Effects Model FN FP FN FP % Non Hier. Selec. AI 0 4.13 0.07 1 0 Activity Time 0 0.47 0 0 0 EI 0 5.80 0 0.8 0 Engage. Time 0 1.07 0 0 0 Dropout 0 1.13 - - - Tables D.1 and D.2 report the results of these simulation experiments. We see that across both simula- tion settings, CEZIJ selects the correct in-model predictors for the five models. The relatively higher fixed effects False positives for theAI andEI models possibly indicate some over-fitting due to the prevalence of large number of zeros in these models. However, CEZIJ selects the fixed and random effects in a hierarchi- cal fashion such that no random effect predictor appears in any of the four models without their fixed effect counterparts. This is not surprising given the way CEZIJ updates the adaptive weights (c (t) sr ;d (t) sr ) are after each iteration. Figure D.1 presents a comparison of the computing time for the two simulation settings con- sidered here. In particular, it demonstrates that in both these settings CEZIJ, through its split-and-conquer approach for variable selection, offers a potential gain in computational efficiency against the conventional 239 1 2 4 5 6 7 8 9 Computing time (mins) 1 2 8 10 12 14 16 18 20 Computing time (mins) Simulation setting I Simulation setting II CEZIJ with undivided data CEZIJ with split-and-conquer CEZIJ with undivided data CEZIJ with split-and-conquer Figure D.1: Computing time comparison for a fixed regularization parameter. Left: Simulation setting I withn = 500;m = 30;p = 10;K = 5. Right: Simulation setting II withn = 2000;m = 30;p = 20;K = 10. and memory intensive approach of running the selection algorithm on the undivided data. The efficiency gain reported in these figures, however, rely on the specific system configuration which in our case was Windows 7, 64 bit, 32GB RAM on an Intel i7-5820K CPU with 12 cores. D.4. Data description In this section we describe the data that holds player level gaming information for a free-to-play Robot versus Robot Fighting game based on the movie Real Steel for Windows, iOS and Android devices. The primary game-play revolves around fighting and upgrading the robots while the secondary goals are to own as many robots as possible and collect rewards. A key feature of the game is a Lucky Draw which is a card game where players bet on their earnings to earn exciting in-app consumables, virtual currencies for robot upgrades or even robots! There are 38,860 players with first activity date 24-Oct-2014 and the analyses presented in chapter 5.6 uses the cohort of 33,860 players for estimation and the remaining 5; 000 players 240 for prediction. In table D.3, we list the raw covariates along with their description that were available in the 10 -2 10 0 10 2 10 4 10 6 10 8 Activity Time, Engagement Amount 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability Activity Time Engage. Amnt. Figure D.2: Empirical CDF of Activity Time and Engagement Amount. data and table D.4 presents a descriptive summary of the raw covariates. Promotion strategies - As discussed in chapter 5.2, we also have side information about the different re- tention and promotion strategies that were used across the 60 days. These strategies were carefully designed by the game marketers to induce player activity, boost engagement and in-app purchases at different points in time. Table D.5 and figure D.3 provide a summary of the 6 different promotion strategies that were used during the 60 days that the players were observed. In what follows, we provide a short description of the 6 promotion strategies. Promotion strategy I - awards extra energy points or rewards during fights when the player wins a combat. 241 Table D.3: List of covariates and the five responses. The gaming characteristics are marked with an (). Sl No Covariates Description 1 avg session length Average Session Length in Minutes 2 p fights Total No. Of Principal Fights Played 3 a1 fights Auxiliary 1 Fights Played 4 a2 fight Auxiliary 2 Fights Played 5 level Last Principal Level Fight Played 6 robot played Total No. Of Robots Played with 7 gacha sink Amount of In-Game Currency Spent for Gacha 8 gacha premium sink Amount of Premium In-Game Currency Spent on Gacha 9 pfight source Amount of In-Game Currency Earned by Playing Principal Fights 10 a1fight source Amount of In-Game Currency Earned by Playing Auxiliary 1 Fights 11 a2fight source Amount of In-Game Currency Earned by Playing Auxiliary 2 Fights 12 gacha source Amount of In-Game Currency Earned by Playing Lucky Draw 13 gacha premium source Amount of Premium In-Game Currency Earned by Playing the Lucky Draw 14 robot purchase count No. Of Robot Purchased per Day 15 upgrade count No. Of Robot Upgrades Done per Day 16 lucky draw ig No. Of Lucky Draw played per Day Inside Game 17 timesince Time Since Last Login in Days 18 lucky draw og No. Of Lucky Draw played per Day Outside Game 19 fancy sink Amount of In-Game Currency Spent on Buying Accessories 20 upgrade sink Amount of In-Game Currency Spent for Robot Upgrade 21 robot buy sink Amount of In-Game Currency Spent for Robot Purchase 22 gain gachaprem % gain over gacha premium sink 23 gain gachagrind % gain over gacha sink 24 weekend Weekend Indicator (0 - No, 1 - Yes) Sl No Response Description 1 AI Whether active in a day (0 - No, 1 - Yes) 2 activity time Total Time Played in a day in Minutes 3 EI Whether positive engagement from the player in a day (0 - No, 1 - Yes) 4 engagement amount Total positive engagement amount from the player in a day in dollars 5 dropout Whether dropped out on that day (0 - No, 1 - Yes) Promotion strategy II - constitutes the sale of ‘boss’ robots that possess special combat moves not available in other robots and can only be acquired by defeating the boss robot itself. Promotion strategy III- provides discounts on the purchase of powerful robots that are usually avail- able in higher levels of the game. Promotion strategy IV- offered discounts on in-app purchases during the Black-Friday and Thanks- giving holiday week. Promotion strategy V- designed to promote different robots and their combat skills through emails and notifications 242 Table D.4: Summary statistics of the covariates reporting % of 0, mean, the 25 th ; 50 th ; 75 th ; 95 th percentiles and the standard deviation of all active players ( ij = 1) across allm = 60 days. For timesince, however, the statistics are reported for all players and not just active. Covariates % of 0 Mean 25 th 50 th 75 th 95 th Std. avg session length 0.01 32.55 7.32 2.66 13.63 30.97 5925.89 p fights 43.69 2.83 1.00 0.00 4.00 12.00 5.07 a1 fights 57.11 1.59 0.00 0.00 1.00 8.00 3.57 a2 fight 84.83 0.64 0.00 0.00 0.00 4.00 2.35 level 43.69 3.70 1.00 0.00 5.00 15.00 5.06 robot played 30.68 1.15 1.00 0.00 2.00 3.00 1.16 gacha sink 73.06 6.19 0.00 0.00 1.00 31.50 27.65 gacha premium sink 96.80 0.30 0.00 0.00 0.00 0.00 3.76 pfight source 43.69 27.18 0.98 0.00 10.24 116.82 108.44 a1fight source 57.12 3.07 0.00 0.00 1.56 15.82 10.98 a2fight source 84.83 1.47 0.00 0.00 0.00 5.12 12.27 gacha source 71.81 1.67 0.00 0.00 0.80 8.50 6.56 gacha premium source 88.93 0.85 0.00 0.00 0.00 5.00 3.81 robot purchase count 91.62 0.10 0.00 0.00 0.00 1.00 0.38 upgrade count 55.26 3.30 0.00 0.00 3.00 17.00 6.74 lucky draw wg 55.62 1.18 0.00 0.00 1.00 4.00 3.56 timesince 7.91 22.93 21.00 7.00 37.00 54.00 17.57 lucky draw og 77.97 1.92 0.00 0.00 0.00 10.00 8.56 fancy sink 87.57 0.69 0.00 0.00 0.00 1.60 10.51 upgrade sink 55.50 18.23 0.00 0.00 10.29 85.20 74.21 robot buy sink 91.64 8.68 0.00 0.00 0.00 35.00 56.10 gain gachaprem 98.36 0.04 0.00 0.00 0.00 0.00 0.45 gain gachagrind 77.98 0.13 0.00 0.00 0.00 0.47 1.30 weekend 63.33 0.37 0.00 0.00 1.00 1.00 0.48 Table D.5: Summary of the promotion strategies Strategy Description No. of days % - No strategy 20 33.33 I More energy or rewards 8 13.33 II Sale of boss robots 4 6.67 III Discounts on powerful robots 8 13.33 IV Holiday sale 7 11.67 V Promotion via emailing and messaging 5 8.33 VI Sale on all robots 8 13.33 Promotion strategy VI- provides discounts on the purchase of all robots. In table 5.1 we provide the list of convex constraints imposed on the fixed effects coefficients while solving the maximization problem in equation (5.9). 243 Figure D.3: Distribution of the six promotion strategies over 60 days 244 D.5. Variable selection by split-and-conquer : voting results Table D.6 provides the voting results of the variable selection by split and conquer approach of chapter 5.5. Table D.6: The number of times each candidate predictor is selected as fixed effect and random effect across theK = 20 splits for the five sub-models. For each sub-model, the predictors with atleast 12 occurrences across 20 splits were selected. AI Act. Time EI Engage. Amnt Dropout Predictors Fixed Eff. Random Eff. Fixed Eff. Random Eff. Fixed Eff. Random Eff. Fixed Eff. Random Eff. Fixed Eff. Intercept 20 20 20 20 14 14 20 20 17 avg session length 5 5 18 18 14 14 5 3 3 p fights 20 20 18 18 14 14 11 11 3 a1 fights 20 20 18 18 14 14 9 9 3 a2 fights 20 20 18 18 14 14 11 11 3 level 15 14 18 18 14 14 3 0 5 robot played 5 5 11 11 0 0 8 8 3 gacha sink 3 0 18 18 14 14 11 0 14 gacha premium sink 0 0 0 0 0 0 5 5 2 pfight source 3 0 18 18 0 0 9 8 3 a1fight source 12 11 18 18 14 14 8 8 5 a2fight source 17 17 18 18 14 14 12 12 11 gacha source 20 20 18 18 0 0 11 11 3 gacha premium source 11 11 0 0 14 14 5 5 2 robot purchase count 17 17 0 0 0 0 3 0 0 upgrade count 20 20 12 12 14 14 5 5 2 lucky draw wg 2 0 2 2 14 14 9 9 3 timesince 20 20 20 20 14 14 2 0 20 lucky draw og 18 18 0 0 14 14 6 6 3 fancy sink 2 2 0 0 14 14 5 5 0 upgrade sink 17 15 0 0 14 14 6 6 3 robot buy sink 5 5 0 0 14 6 3 2 2 gain gachaprem 3 3 0 0 0 0 2 2 0 gain gachagrind 18 18 18 18 0 0 11 5 3 weekend 20 12 18 18 0 0 2 0 0 promotion I 0 0 0 12 17 promotion II 20 18 14 3 17 promotion III 14 0 14 14 17 promotion IV 0 0 14 14 15 promotion V 0 0 14 0 14 promotion VI 20 18 14 14 17 # selected 18 14 17 15 22 16 6 2 9 245
Abstract (if available)
Abstract
In this thesis we discuss novel shrinkage methods for estimation, prediction and variable selection problems. We begin with large scale estimation and consider the problem of estimating a high-dimensional sparse parameter in the presence of side information that encodes the sparsity structure. In a wide range of fields including genomics, neuroimaging, signal processing and finance, such side information promises to yield more accurate and meaningful results. However, few analytical tools are available for extracting and combining information from different data sources in high-dimensional data analysis. We develop a general framework for incorporating side information into the sparse estimation framework and develop new theories to characterize regimes in which our proposed procedure far outperforms competitive shrinkage estimators. When the parameter of interest is not necessarily sparse and the data available for estimation are discrete, we propose a Nonparametric Empirical Bayes (NEB) framework for compound estimation in such discrete models. Specifically, we consider the discrete linear exponential family, which includes a wide class of discrete distributions frequently arising from modern big data applications, and develop a flexible framework for compound estimation with both regular and scaled squared error losses. We develop theory to show that the class of NEB estimators enjoys strong asymptotic properties and present comprehensive simulation studies as wells as real data analyses demonstrating the superiority of the NEB estimator over competing methods. ❧ Contemporary applications in finance, health-care and supply chain management often require simultaneous predictions of several dependent variables when the true covariance structure is unknown. In these multivariate applications, the true covariance structure can often be well represented through a spiked covariance model. We propose a novel shrinkage rule for prediction in a high-dimensional non-exchangeable hierarchical Gaussian model with an unknown spiked covariance structure. We propose a family of commutative priors for the mean parameter, governed by a power hyper-parameter, which encompasses from perfect independence to highly dependent scenarios. Corresponding to popular loss functions such as quadratic, generalized absolute, and linex losses, these prior models induce a wide class of shrinkage predictors that involve quadratic forms of smooth functions of the unknown covariance. By using uniformly consistent estimators of these quadratic forms, we propose an efficient procedure for evaluating these predictors which outperforms factor model based direct plug-in approaches. We further improve our predictors by introspecting possible reduction in their variability through a novel coordinate-wise shrinkage policy that only uses covariance level information and can be adaptively tuned using the sample eigen structure. We establish asymptotic optimality of our proposed procedure and present simulation experiments as well as real data examples illustrating the efficacy of the proposed method. ❧ We conclude by considering a specific business problem in Marketing that involves predicting player activity and engagement in app-based freemium mobile games. Longitudinal data from such games usually exhibit a large set of potential predictors and choosing the relevant set of predictors is highly desirable for various purposes including improved predictability. We propose a scalable joint modeling framework that conducts simultaneous, coordinated selection of fixed and random effects in high-dimensional generalized linear mixed models. Our modeling framework simultaneously analyzes player activity, engagement and drop-outs (churns) in app-based mobile freemium games and addresses the complex inter-dependencies between a player’s decision to use a freemium product, the extent of her direct and indirect engagement with the product and her decision to permanently drop its usage. The proposed framework extends the existing class of joint models for longitudinal and survival data in several ways. It not only accommodates extremely zero-inflated responses in a joint model setting but also incorporates domain-specific, convex structural constraints on the model parameters. Moreover, for analyzing such large-scale datasets, variable selection and estimation are conducted via a distributed computing based split-and-conquer approach that massively increases scalability and provides better predictive performance over competing predictive methods.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Feature selection in high-dimensional modeling with thresholded regression
PDF
Large-scale multiple hypothesis testing and simultaneous inference: compound decision theory and data driven procedures
PDF
Large scale inference with structural information
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Large-scale inference in multiple Gaussian graphical models
PDF
Adapting statistical learning for high risk scenarios
PDF
High dimensional estimation and inference with side information
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
From least squares to Bayesian methods: refining parameter estimation in the Lotka-Volterra model
PDF
Model selection principles and false discovery rate control
PDF
Nonparametric ensemble learning and inference
PDF
Novel multi-stage and CTLS-based model updating methods and real-time neural network-based semiactive model predictive control algorithms
PDF
Optimizing penalized and constrained loss functions with applications to large-scale internet media selection
PDF
Data-driven learning for dynamical systems in biology
PDF
Incorporating prior knowledge into regularized regression
PDF
Covariance-based distance-weighted regression for incomplete and misaligned spatial data
PDF
Essays on econometrics analysis of panel data models
PDF
Three essays on linear and non-linear econometric dependencies
PDF
Hierarchical approaches for joint analysis of marginal summary statistics
PDF
Robust estimation of high dimensional parameters
Asset Metadata
Creator
Banerjee, Trambak
(author)
Core Title
Shrinkage methods for big and complex data analysis
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Publication Date
04/06/2020
Defense Date
02/18/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
constrained joint modeling,empirical Bayes,inference with side information,OAI-PMH Harvest,shrinkage estimation,shrinkage prediction,sparsity
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Mukherjee, Gourab (
committee chair
), Dutta, Shantanu (
committee member
), James, Gareth (
committee member
), Lv, Jinchi (
committee member
), Sun, Wengunag (
committee member
)
Creator Email
trambakb@usc.edu,trambakbanerjee@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-279289
Unique identifier
UC11673444
Identifier
etd-BanerjeeTr-8245.pdf (filename),usctheses-c89-279289 (legacy record id)
Legacy Identifier
etd-BanerjeeTr-8245.pdf
Dmrecord
279289
Document Type
Dissertation
Rights
Banerjee, Trambak
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
constrained joint modeling
empirical Bayes
inference with side information
shrinkage estimation
shrinkage prediction
sparsity