Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Statistical citation network analysis and asymmetric error controls
(USC Thesis Other)
Statistical citation network analysis and asymmetric error controls
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Statistical citation network analysis and asymmetric error controls by Lijia Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) May 2023 Copyright 2023 Lijia Wang Dedication To my parents, for your constant love and support. ii Acknowledgements I wish to express my heartfelt gratitude to everyone who has supported and encouraged me throughout my academic journey. First and foremost, I am deeply indebted to my two advisors, Professors Jay Bartroff and Xin Tong, for their unwavering support, guidance, and inspiration. Their invaluable insights and expertise have been instrumental in shaping my research direction and developing my analytical skills. I would also like to sincerely thank my committee chair, Professor Larry Goldstein, and committee member, Professor Chunming Wang, for their thoughtful feedback and constructive criticism that have significantly improved the quality of my work. I am grateful to my coauthors (not limited to this dissertation), Xu Han, Gary A. Lorden, Y. X. Rachel Wang Jingyi Jessica Li, and Xiao Han, for their contributions to our collaborative projects and for sharing their knowledge and expertise. In addition, I want to express my appreciation to my family and friends for their unwavering support, love, and encouragement. I also want to thank my boyfriend, Shunan Yao, for his love, support, and understanding throughout this journey. I must acknowledge my beloved pets, Maomao Wang and Mimi Shen, for their constant companionship and for taking over my position to accompany my parents when I was away for my education. Without the support of these individuals and pets, completing this thesis would not have been possible. I am truly grateful for their presence in my life. iii Finally, I want to express my deepest gratitude to my parents, Chun Chen and Qiang Wang, for their unconditional support and for giving me everything I could hope for. Their unwavering love and sacrifices have been my foundation, and I am forever grateful for their presence in my life. iv TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Statistical citation network analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Optimal hypergeometric confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 False discovery control in skilled mutual fund selection . . . . . . . . . . . . . . . . . . . . 4 1.4 A Hierarchical Neyman-Pearson approach for Classifying COVID-19 Severity . . . . . . . 5 Chapter 2: Statistics in everyone’s backyard: an impact study via citation network analysis . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Data collection and overview of citation trends . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Citation distributions and trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Comparison between internal and external citations . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Diversity of citing fields over time . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Internal and external impact of most highly cited papers . . . . . . . . . . . . . . . 18 2.4 Connecting statistical research communities to external topics by local clustering – methods and theoretical guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Preliminaries and the DC-SBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.2 Adjusted Personalized PageRank under the DC-SBM . . . . . . . . . . . . . . . . . 24 2.4.3 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.4 Properties of the population version of conductance . . . . . . . . . . . . . . . . . 29 2.4.5 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.6 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.5 Case studies from the citation network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7 Acknowledgements and author contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 3: Optimal hypergeometric confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 58 3.1 Introduction and summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 v 3.2 Additional notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Properties of the hypergeometric distribution and auxiliary lemmas . . . . . . . . . . . . . 60 3.4 α max optimal acceptance sets and modifying intervals for monotonicity . . . . . . . . . . 65 3.5 α optimal, admissable, nondecreasing acceptance intervals . . . . . . . . . . . . . . . . . . 74 3.5.1 Algorithm 2 isα max optimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.5.2 Modification at M =N/2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5.3 Putting it all together: Algorithm 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.6 Optimal admissible confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.6.1 Confidence and acceptance sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.6.2 Inverted confidence sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6.3 Size optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.7 Examples and comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.9 Acknowledgements and author contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Chapter 4: Skilled mutual fund selection: false discovery control under dependence . . . . . . . . . 106 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.2 Optimal multiple testing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3 Mutual Fund Data and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.1 Equity Fund Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.3.2 Fund Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3.3 Challenges: dependence and nonnormality . . . . . . . . . . . . . . . . . . . . . . 119 4.4 Degree of non-skillness (d-value) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.1 Fund selection by multiple testing strategy . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.2 Mixture model for nonnormal and dependent test Statistics . . . . . . . . . . . . . 123 4.4.3 Fitting parameters in mixture model . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.4.4 Solving Equation Set in (4.9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.4.5 Calculating degree of non-skillness . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.4.6 Comparison with other Existing Methods . . . . . . . . . . . . . . . . . . . . . . . 140 4.5 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.6 Skilled funds selection with FDR control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.6.1 Dynamic trading strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.6.2 Comparison of d-value and p-value . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 4.8 Acknowledgements and author contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Chapter 5: Hierarchical Neyman-Pearson classification for prioritizing severe disease categories in COVID-19 patient data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.1.1 Neyman-Pearson paradigm and multi-class classification . . . . . . . . . . . . . . . 151 5.1.2 ScRNA-seq data featurization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.2 Hierarchical Neyman-Pearson (H-NP) classification . . . . . . . . . . . . . . . . . . . . . . 153 5.2.1 Under-diagnosis errors in H-NP classification . . . . . . . . . . . . . . . . . . . . . 153 5.2.2 H-NP algorithm with high probability control . . . . . . . . . . . . . . . . . . . . . 154 5.2.3 H-NP umbrella algorithm for three classes . . . . . . . . . . . . . . . . . . . . . . . 166 5.2.4 General H-NP umbrella algorithm forI classes . . . . . . . . . . . . . . . . . . . . 168 5.2.5 The change ofv(k,n ′ 2 ,α ′ 2 ) in Figure 5.2b . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2.6 Simulation studies for the H-NP umbrella algorithm . . . . . . . . . . . . . . . . . 173 vi 5.3 Application to COVID-19 severity classification . . . . . . . . . . . . . . . . . . . . . . . . 178 5.3.1 Integrating COVID-19 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.3.2 ScRNA-seq data and featurization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.3.3 Results of H-NP classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.5 Acknowledgements and author contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 vii ListofTables 2.1 Papers with significantly different internal and external ranks. . . . . . . . . . . . . . . . . 18 2.2 Top 5 most frequent WoS categories in all the citations towards Azzalini and Valle [9]. . . 19 2.3 Top 5 most frequent WoS categories in all the citations towards Duval and Tweedie [35]. . 19 2.4 Papers whose internal and external citations both rank in the top 20. . . . . . . . . . . . . 20 2.5 The means and standard deviations of precision and recall rates for local clustering under α =0.15 and different numbers of seeds. Each setting is simulated 50 times. . . . . . . . . 45 2.6 The means and standard deviations of precision and recall rates for local clustering under different α values and numbers of seeds. Each setting is simulated 50 times. . . . . . . . . 46 2.7 Means and standard deviations of precision and recall for local clustering, spectral clustering and SCORE+. Each setting is repeated in 50 simulations. . . . . . . . . . . . . . 48 2.8 Summary statistics for the subnetworks in Figure 2.15 compared with the global graphA s . 52 3.1 Confidence intervals given by C ∗ (x) = [L(x),U(x)] forα = 0.05,N = 500,n = 100, andx=0,1,...,100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.2 Confidence intervals given by W. Wang’s [132] method C forN = 500, n = 100, and α =0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.3 For the Beijing air quality data [76], the numbern of days with complete measurements, the number x of days with complete measurements classified as hazardous, the point estimateNx/n (to 1 decimal place) of the numberM of annual hazardous days, and the 90% confidence interval C ∗ (x) forM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1 Classification of tested hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2 Fitted parameters, the proportion of data forL1 regression and the total variation (TV) for 10-year periods. . 127 viii 4.3 Comparison of our approach (4.4) with approximate empirical Bayes method for fitting the parameters and other FDR control methods for each combination of sparsity (s.1, s.2) and dependence structures (d.1, d.2, d.3) whereθ = 0.1. The average false discovery proportion (FDP), false non-discovery proportion (FNP), number of selection are calculated over 100 simulations for each setting. . . . . . . . . . . . . . . . . . . . . . 142 4.4 The values of portfolios by the end of 2019 and the annual returns for each approach. . . . . . . . . . . . 144 5.1 Number of patients under each severity level in each dataset. The datasets marked with * were utilized by both studies [96, 128] in their respective analyses. . . . . . . . . . . . . . 181 5.2 Format of age information in each dataset. An example record for a 64-year-old patient is provided for each dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.3 The average proportion of zero values across patients for each cell type. . . . . . . . . . . 184 5.4 The averages of approximate errors. “error1", “error23", “error21", “error31", “error32", “overall" correspond to R 1⋆ ( b ϕ ), R 2⋆ ( b ϕ ), P 2 ( b Y = 1), P 3 ( b Y = 1), P 3 ( b Y = 2) and P( b Y ̸=Y), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 5.5 Ranking cell types by their coefficients that quantify the effect of the predictors (cell type expression) on the log odds ratios of the severe category relative to the healthy category in logistic regression with the featurization M.2. . . . . . . . . . . . . . . . . . . . . . . . . 188 ix ListofFigures 2.1 The distribution of citation counts for the source papers: (a) the proportions of papers falling into different citation brackets; (b) the histogram of papers with citations ≤ 500. . . 13 2.2 The general information of statistical journals: (a) the total number of citations per year for each journal; (b) the total number of publications per year for each journal. . . . . . . . 13 2.3 Citation trends and distributions for each journal: (a) the normalized number of citations over the years. “All Stats" refers to all the source papers; (b) the Lorenz curve for each journal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 The normalized citation counts for each journal over the years. Four papers with citations greater than 5000 are removed from JRSSB. . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Diversity of citing fields in statistics journals: (a) the annual counts of internal and external citations for all the source papers; (b) the yearly Gini concentration for each journal. . . . 16 2.6 Breakdown of citations into research categories for each statistical journal: (a) BIOMETRIKA, (b) JASA, (c) AOAS, (d) AOS, and (e) JRSSB. . . . . . . . . . . . . . . . . . . 17 2.7 The comparison of external and internal ranks for highly cited papers. . . . . . . . . . . . 18 2.8 Breakdown of citations for the papers in Table 2.4 aggregated by the five statistical topics: (a) MCMC, (b) Causal, (c) Penalized regression, (d) FDR, and (e) Bayesian model selection. 21 2.9 The violin plots of (a) precision and (b) recall for local clustering withα =0.15. . . . . . . 46 2.10 Examples of conductance plots for different seed numbers when α = 0.15, under the setting of Table 2.5. The dotted line indicates the local minimum selected. . . . . . . . . . . 47 2.11 Examples of conductance plots under Settings 1 and 2. More examples under Setting 3 can be found in Figure 2.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.12 The violin plots of precision and recall for Setting 1-3. . . . . . . . . . . . . . . . . . . . . . 50 2.13 The categories of citing papers related to: (a) single-cell transcriptomics; (b) flu. . . . . . . 51 x 2.14 The conductance plots for each topic: (a) single-cell transcriptomics, (b) labor economics, and (c) flu. The x-axis is truncated at 500 for visual clarity. . . . . . . . . . . . . . . . . . . 52 2.15 Networks and word clouds generated from the source papers found by local clustering for each topic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1 Confidence intervals C Piv (x) andC ∗ (x) for α = 0.05, N = 500, n = 100, and x=0,1,...,100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.2 The differences in total size |C Piv |−| C ∗ |, forN =500,α =0.05, andn=10,20,...,490. 97 3.3 The computational time of the confidence intervals C W andC ∗ forN = 500,α = 0.05, andn=10,20,...,490. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.4 The computational time of the confidence intervals C Piv andC ∗ forN = 500,α = 0.05, andn=10,20,...,490. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.5 Coverage probability ofC ∗ andC Piv forN =500,n=100, andα =0.05. . . . . . . . . 101 3.6 Coverage probability ofC W forN =500,n=100, andα =0.05. . . . . . . . . . . . . . . 101 3.7 The computational time of the confidence intervals C W andC ∗ forN =200,400,...,1000, andn=N/2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.8 The 90% confidence interval C ∗ (x) for the number M of annual hazardous days at different locations in Beijing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.1 Annual returns for the mutual funds existing in the 10-year windows from 2000-2009 to 2009-2018. Each cluster center at a certain year represents the annual returns in that year for equity funds that survived during the previous ten years. Moreover, the y-coordinate of each point in a certain cluster is the annual return in that year so that the number of funds with the same level of annual returns is observable. . . . . . . . . . 116 4.2 Top 50 eigenvalues ofΣ for the mutual fund data between 2009 and 2018. . . . . . . . . . . . . . . . . . 120 4.3 Histograms and density plots of standardized OLS estimates for fundα ’s during (a) 2002-2011 and (b) 2009-2018. 121 4.4 Histograms and density plots of Z and the simulated data based on the parameter-fitting results from our approximate empirical Bayes method (AEB) during (a) 2002-2011 and (b) 2009-2018. . . . . . . . . . . . . 128 4.5 The values of portfolios selected by our procedure and other FDR control procedures ( BH, Storey, FHG, FC, FAT, HL20 and GLX) from the end of 2009 to the end of 2019. “Skilled funds" refers to our portfolio. “S&P 500" is the value of S&P 500 index. For “HL18", we construct the portfolio containing funds with HL18 estimation of alphas greater than its 95% quantile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.6 The comparison of the p-group and d-group selected based on the data from 2000-2009: (a) the box plots of p-values and d-values, and (b) the estimated annualα ’s for each 10-year window after that period. (“others" refers to funds without top 50 smallest d-values nor p-values.) . . . . . . . . . . . . . . . . . . . . . . 146 xi 4.7 The boxplots for cumulative returns of funds in p-groups and d-groups for the following years. D-values and p-values are computed based on the data during (a) 2000-2009 and (b) 2003-2012. . . . . . . . . . . . . . . 146 5.1 Partition ofS it into three kinds of outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.2 The influence of t 1 on the errorP 3 b Y =2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 5.3 The distribution of approximate errors on the test set whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 = α 2 = 0.05) are plotted as red dashed lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 5.4 The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 = α 2 = 0.05) are plotted as red dashed lines. The data are generated undersettingT1 except a different sample splitting ratio is used: samples in S 1 are randomly split into30% forS 1s (score),70% forS 1t (threshold);S 2 are split into25% forS 2s ,70% forS 2t ,5% forS 2e (evaluation);S 3 are split into95% forS 3s and5% forS 3e . 174 5.5 The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 = α 2 = 0.05) are plotted as red dashed lines. The data are generated under setting T1 except a different sample splitting ratio is used: samples in S 1 are randomly split into70% forS 1s ,30% forS 1t ;S 2 are split into65% forS 2s ,30% forS 2t ,5% forS 2e ;S 3 are split into95% forS 3s and5% forS 3e . . . . . . . . . . . . . . . 175 5.6 The distribution and averages of approximate errors on the test set under setting T1. “error23" and “error32" correspond toR 2⋆ ( b ϕ ) andP 3 ( ˆ Y =2), respectively. . . . . . . . . . 176 5.7 The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles (δ 1 = δ 2 = 0.05) ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 = α 2 = 0.1) are plotted as red dashed lines. Also, the averages of theR c are marked by red points in (c). . . . . . . . . . . . . . . 178 5.8 The distributions of approximate errors on the test set under setting T1. “error1", “error23" and “error32" correspond toR 1⋆ ( b ϕ ),R 2⋆ ( b ϕ ) andP 3 ( ˆ Y =2), respectively. . . . . 179 5.9 The distributions of approximate errors on the test set under setting T2. “error1" and “error23" correspond to the errorsR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ), respectively. . . . . . . . . . . . . . 179 5.10 Two left UMAPs plots are colored based on cell types predicted by scClassify (using [114] as reference). Two right UMAPs plots are colored by the batches. . . . . . . . . . . . . . . 182 5.11 The proportions of zeros for different severity classes. . . . . . . . . . . . . . . . . . . . . . 190 5.12 The distribution of the proportion of zero values across patients for each cell type. . . . . . 191 xii 5.13 The distribution of approximate errors for each combination of featurization method and base classification method. “error1", “error23", “error21", “error31", “error32", “overall" correspond to R 1⋆ ( b ϕ ), R 2⋆ ( b ϕ ), P 2 ( ˆ Y = 1), P 3 ( ˆ Y = 1), P 3 ( ˆ Y = 2) and P( ˆ Y ̸= Y), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 xiii Abstract This dissertation is divided into four parts. The first part investigates a significant question for statistics - How much impact has statistics made on other scientific fields in the era of Big Data? The remaining three parts focus on addressing asymmetric error control problems in statistical learning. Below are summaries of the main results in each part of the dissertation. In the first part, we examine the influence of statistics on other scientific disciplines by conducting a statistical citation network analysis. Our investigation of citation trends and citing field compositions over time reveals growing citation diversity. Additionally, we employ a local clustering technique that utilizes personalized PageRank with conductance for size selection to identify the most relevant statistical area for a given external topic in other fields. Through various case studies, we demonstrate that our citation data analysis results align well with our understanding of these external topics. In summary, our findings indicate that the statistical theory and methods developed by the statistics community have increasingly impacted other scientific fields. The second part proposes a novel and efficient approach for computing exact confidence intervals of the hypergeometric parameter. Specifically, we use minimum-width acceptance intervals and modify them to ensure non-decreasing endpoints while preserving their level. The resulting confidence intervals achieve the minimum possible size and outperform existing methods both in terms of interval size and computation time. xiv The third part utilizes the multiple testing framework to select skilled mutual funds, using the intercept α of the Carhart four-factor model as a measure of true performance. We discover that the standardized OLS estimates ofα ’s exhibit strong dependence and non-normality, making conventional multiple testing methods unsuitable for identifying skilled funds. To address this issue, we propose an optimal multiple testing procedure based on a decision-theoretic perspective, which minimizes a weighted sum of false dis- covery rate and false non-discovery rate. Our proposed testing procedure takes into account the posterior probability of each fund not being skilled, given the information across all the funds in our study. Empirical studies demonstrate that the selected skilled funds perform better than the non-selected funds. The fourth part investigates the COVID-19 severity classification problems. In classifying patients into severity categories, the more important classification errors are “under-diagnosis”, in which patients are misclassified into less severe categories. We propose a hierarchical NP (H-NP) framework and an um- brella algorithm that generally adapts to popular classification methods and controls the under-diagnosis errors with high probability. On an integrated collection of single-cell RNA-seq (scRNA-seq) datasets, we explore ways of featurization and demonstrate the efficacy of the H-NP algorithm in controlling the under- diagnosis errors regardless of featurization. Beyond COVID-19 severity classification, the H-NP algorithm generally applies to multi-class classification problems, where classes have a priority order. xv Chapter1 Introduction In recent years, the demand for data analysis tools has grown rapidly in various fields, including biology, finance, environmental science, and arts. To meet this demand, new statistical algorithms and models have been developed. This dissertation introduces four statistical methodologies proposed to address new data analysis challenges in these fields, along with their theoretical guarantees. The first part of the dissertation addresses a fundamental question for statistics - how does statistics serve as a basic tool in other areas? We proposed a local community detection procedure to conduct citation network analysis to solve this most significant question in statistics. The remaining part of the dissertation focus on asymmetric error control problems and propose confidence intervals, optimal testing, and Neyman-Pearson classification procedures. The methodologies developed have a wide range of applications in air quality data analy- sis, COVID-19 severity diagnosis, mutual fund selection, and citation network analysis. In the following sections, detailed introductions will be provided for the topics of interest in each chapter. 1.1 Statisticalcitationnetworkanalysis Statistics are outward-facing and play a keystone role in other scientific investigations. As technology and methodology continue to advance, the demand for statistical tools to support large and complex data analysis across various fields has grown. However, developing statistical theories and methods requires 1 rigorous mathematical arguments and abstract formulations for generalizability, which makes their direct application in other fields challenging. This has raised the fundamental question of how statistics impacts other scientific investigations. Citation data, available in databases such as WebofScience, have been used to measure the impact of academic work or fields and track the movement of ideas between different scientific fields using network analysis techniques. While the Bradley-Terry model has been used in the statistics literature to measure the import and export of knowledge between statistical journals [115], and citation and coauthorship networks have been analyzed for papers in statistical journals [56], there has been limited research on the influence of statistical research on other scientific disciplines. Our study aims to bridge this gap by exploring the impact of statistical research on other disciplines and connecting communities in statistics with research topics in other fields they impact. We present a comprehensive analysis of the connections between statistics and other fields by investi- gating citation trends and compositions of citing fields over time, showing an increasing diversity of fields. To identify the most influential statistical communities, we propose a local clustering procedure consisting of two steps. First, we use the adjusted personalized PageRank (aPPR) algorithm [4] to rank nodes based on their relevance to a target community using a small subset of seed nodes. We extend existing work [27] to show that aPPR sorts individuals relevant to the target community with high probability under the degree- corrected stochastic block model (DC-SBM). Second, we use conductance [4] to evaluate the quality of the community found along the sorted list of nodes, cutting it at an appropriate size to contain the entire com- munity. We show that this procedure finds all community members to which the seed nodes belong, with high probability, under the DC-SBM. By applying our procedure to citation data, we can identify the most relevant statistical community for an external research topic, such as labor economics, single-cell analysis, or COVID-19. This is the first study to comprehensively analyze the connections between statistics and other fields and identify the most influential statistical communities in various research topics. 2 1.2 Optimalhypergeometricconfidenceintervals The hypergeometric distribution is commonly used to model situations where X counts the number of items in a simple random sample of size n from a population of size N that possess a binary "spe- cial" property, with M items in the population having the special property. This can be denoted by X ∼ Hyper(M,n,N). Two types of exact confidence sets can be constructed: one for M with a known population size N, and the other for N with a known M. These confidence sets are designed to have a coverage probability of at least 1− α , where α ∈ (0,1) is user-specified. In practice, the coverage probability can be obtained by inverting level-α acceptance intervals from hypothesis tests. However, due to the discrete nature of the hypergeometric distribution, constructing exact confidence intervals (sets of consecutive integers) can be challenging. We present a novel and efficient method for constructing exact hypergeometric confidence intervals for M with optimal total size and computational time. Our approach involves inverting tests of the hypotheses H :M =M 0 forM 0 =0,1,...,N, based on a given sample sizen and number of successesX. To ensure that the confidence intervals have coverage probability 1− α , we utilize acceptance intervals[a M ,b M ] that areα max optimal, maximizing the probability of acceptance among all shortest possible level-α intervals. Our method also includes a shifting process that arranges the endpoints of the intervals in nondecreasing order, resulting in proper intervals. We provide a proof of size-optimality and demonstrate the efficiency of our algorithm through numerical experiments. Finally, we apply our method to a real-world air quality data set from multiple sites in Beijing, China. 3 1.3 Falsediscoverycontrolinskilledmutualfundselection Identifying mutual funds with a genuine ability to generate profits is a formidable challenge and a highly coveted objective in finance. The Carhart four-factor model is commonly used to evaluate fund perfor- mance by regressing a fund’s excess return to market factors, with the intercept α indicating the fund’s profit-making ability in the cross-sectional linear regression model. Funds are categorized as skilled, un- skilled, orzero-alpha ifα is greater than zero, less than zero, or equal to zero, respectively. However, since the true value ofα is unobservable, the ordinary least squares estimator ofα is used to compare different funds. Unfortunately, this approach can result in some funds with a true zero or negativeα having positive estimatedα values due to luck, rather than skill, leading to potential substantial losses. As a result, there is a need to develop strategies for selecting genuinely skilled mutual funds while controlling or avoiding the false selection of “lucky" funds. The process of selecting skilled mutual funds involves simultaneously testingH 0 :α ≤ 0 for thousands of funds based on the test statistics, which are the standardized OLS estimates ofα ’s. Previous work, such as [10], has adopted the false discovery rate (FDR) concept and applied Storey’s testing procedure to select skilled funds while controlling the FDR at the desired level. However, this approach may not be suitable for mutual funds due to the strong dependence among test statistics, which can lead to over-conservative FDR control. Additionally, the assumption of normality in the test statistics may not hold in the case of mutual fund data, which has been documented in the finance literature. Therefore, there is a need for alternative methods that can handle the dependence and nonnormality in mutual fund data to select truly skilled mutual funds while avoiding the false selection of "lucky" funds. To address these challenges, we introduced the posterior probability of each fund not being skilled based on all the test statistics. We provide the degree of non-skillness (d-value) as the posterior probability estimation. Smaller d-values suggest a greater chance of being skilled, encouraging selection. The depen- dence and nonnormality structure of the test statistics are preserved in the posterior information so that 4 d-values can provide a good ranking for fund performance. In contrast to p-value-based selection strate- gies, our multiple testing procedures based on the d-value provides effective control over the FDR with an optimal false nondiscovery rate. To calculate the d-values, we modeled the empirical distribution of the test statistics using a three-part mixture model to capture the dependence and nonnormality phenomena in the test statistics. We proposed a method called “approximate empirical Bayes" to fit the parameters, borrowing the strength of empirical Bayes but also relieving the computational burden conferred by a greater number of parameters. The empirical studies showed that our selected skilled funds have superior performance. 1.4 A Hierarchical Neyman-Pearson approach for Classifying COVID- 19Severity COVID-19 is a disease that can manifest in patients with varying degrees of severity, ranging from no symptoms to critical illness requiring intensive care. The World Health Organization has classified COVID- 19 severity into healthy, mild/moderate, and severe categories. Misclassifying patients into a less severe category, or under-diagnosis error, can lead to a loss of critical information when studying the immune responses of severe patients. Therefore, prioritizing the control of this error is crucial in severity classi- fication. However, the classical classification paradigm only focuses on minimizing overall error without considering the priorities of each severity class. To address this issue, the Neyman-Pearson (NP) classifi- cation paradigm has been developed to prioritize designated types of errors. While most NP procedures focus on binary classification, few studies [124] have explored controlling asymmetric errors in multi-class classification, and none have provided high-probability controls. To fill this gap, we propose a hierarchical NP (H-NP) framework for controlling under-diagnosis errors with high probability guarantees using an 5 H-NP umbrella algorithm (inspired by [127]) that adapts popular scoring-type classification methods such as logistic regression, random forest, and SVM to the H-NP paradigm. Furthermore, the increasing availability of single-cell RNA-seq (scRNA-seq) data has provided an op- portunity for researchers to study the transcriptional response to COVID-19 severity at the cellular and gene levels. However, the unique data structure of scRNA-seq, with one matrix for each patient, requires the use of appropriate featurization methods for conducting severity classification. To address this is- sue, we integrated nine publicly available scRNA-seq datasets, creating a sample of 740 patients with three severity levels, and explored the appropriate methods for feature extraction. We proposed four approaches for constructing feature vectors based on the matrices and evaluated their performance in combination with different classification methods under both classical classification and our H-NP classi- fication paradigms. Our numerical results demonstrated the robustness of our H-NP umbrella algorithm, indicating its applicability to various base classification algorithms and featurization methods. 6 Chapter2 Statisticsineveryone’sbackyard: animpactstudyviacitationnetwork analysis 2.1 Introduction As a discipline that focuses on the collection, analysis, and interpretation of data, statistics is outward facing and often serves as a tool in other scientific investigations. The age of Big Data has brought about new challenges and opportunities in many fields, where the postulation, verification and refinement of scientific models rely on empirical data. In this sense, one would expect statistics to play an increasingly important role in these fields as the need for methods and tools for handling large, complex data increases. On the other hand, much of the fundamental research in statistical theory and methods requires rigorous mathematical arguments and abstract formulations for generalizability. It can be argued that the technical nature of such works serves as a barrier, making direct adoption of research developments difficult in other fields. In this chapter, we consider measuring the impact of theoretical and methodological research in statistics on other scientific disciplines in recent decades. As John Tukey deftly put it: “the best thing about being a statistician is that you get to play in everyone’s backyard." One direct way to measure the impact of academic works is through citation data. In the digital age, comprehensive bibliometric studies have been made possible by the existence citation databases such as 7 Web of Science and Scopus. From these databases, citations between papers can be extracted, represented as a network, and studied using network analysis techniques. These citation networks have been used to track the movements of ideas and measure the distance between different scientific fields [108, 130]. Coauthorship networks can also be constructed from publication records for studying the structure of collaboration patterns [91, 86]. More specifically in statistics, the Bradley-Terry model was used to measure the import and export of knowledge between statistical journals [115, 131]. Citation and coauthorship networks are collected and analyzed for papers in top statistical journals [56]. Rather than focusing on the structure of citation patterns inside statistics, we provide the first comprehensive study analyzing the connections between statistics and other fields . We collect citation information for papers published in selected statistical journals from the Web of Science (WOS) Core Collection. These published papers are termed source papers for being the source of knowledge export; our complete data contains citations between source papers as well as their citations by papers (termed citing papers) in other journals and fields. Using descriptive statistics, we characterize the trends of citation volumes and compositions of citing fields for the source papers over time, paying attention to fields external to statistics. We compare the internal and external citations for highly cited source papers and identify the corresponding statistical research areas highly ranked by both criteria. Citation trend analysis of these areas allows us to associate them with external fields on which they have made an intellectual impact. Given a network, one of the most commonly used analysis techniques is community detection, also known as node clustering. On the citation network for source papers, global clustering techniques can be used to partition the nodes into densely connected communities as has been done in [56], offering a global view of various research areas within statistics. However, we are more interested in connecting these com- munities in statistics with research topics in other disciplines they have cast an influence on. That is, given an external research topic (e.g., Covid-19), we consider finding the most relevant community in statistics, 8 with relevance measured by the citation data. Alocal clustering perspective is particularly suitable in this case since i) we expect the relevant community to be small compared with the whole network, making it challenging to detect by global clustering methods; and ii) the citations between the source papers and the citing papers give a natural way of finding “seed nodes” for local clustering algorithms. A large class of local clustering algorithms is based on seed expansion: given a small subset of seed nodes from a community of interest, the rest of the community is detected by ranking the other nodes ac- cording to the landing probabilities of random walks started from the seeds. Different classes of algorithms correspond to different ways of combining these probabilities for random walks of different lengths, with the most popular ones being versions of personalized PageRank (PPR) [5, 135, 64] and heat kernels [28, 63]. These algorithms have been widely applied to large-scale real networks with much empirical success. More recently, attention has been paid to studying their theoretical properties on networks generated from the stochastic block model (SBM, [52]) and its variant. Kloumann et al. [65] showed that PPR corresponds to the optimal linear classifier under a suitable two-block SBM. Using the more general degree-corrected SBM (DC-SBM, [60]), Chen et al. [27] showed that PPR can include high-degree nodes outside the commu- nity of interest while using the adjusted PPR (aPPR) algorithm in [4] can correct the degree bias, achieving consistency in the detection of the target community with high probability. After the nodes have been ranked in terms of their relevance to the target community, it remains to choose the size of the local cluster and cut the sorted list of nodes at the desired size. A scoring function is thus needed to evaluate the quality of the communities found along the sorted list. One of the most widely used scoring functions is conductance [4, 144, 129], which measures the fraction of total edge vol- ume that points outside the cluster. A smaller conductance indicates the cluster is more separated from the rest of the network, hence more likely to be a community on its own. Assessing the performance of vari- ous scoring functions on a large number of real networks, Yang and Leskovec [144] showed conductance consistently gives good performance in identifying ground-truth communities. The theoretical properties 9 of conductance, however, have not been investigated under the local clustering setting with generative network models. For our local clustering procedure, we adopt aPPR followed by conductance local min- imization. Under the DC-SBM, we show that with high probability, this procedure finds all the nodes in the community to which the seed nodes belong. The rest of the paper is organized as follows. In Section 2.2, we describe the data collection procedure and various covariates used in our analysis. We provide a summary of citation trends over time and citation distributions for each journal. In Section 2.3, we study the diversity of citing fields by grouping the citing papers according to the research areas they belong to. In particular, we determine which highly cited source papers have high citations both within statistics and outside statistics, as well as those that appear to have a larger impact on one side of the audience. In Section 2.4, we describe our local clustering procedure for finding the statistical community most relevant to an external research topic. We provide a theoretical analysis of its behavior under the DC-SBM and demonstrate its performance on simulated data and a number of case studies from our citation network. We end the paper with a discussion of the merits and limitations of our study, pointing to directions in which it can be extended in the future. 2.2 Datacollectionandoverviewofcitationtrends 2.2.1 Datacollection We conducted our study on all the papers published from 1995 to 2018 in five influential statistics journals: AnnalsofAppliedStatistics (AOAS),AnnalsofStatistics (AOS),Biometrika,JournaloftheAmericanStatistical Association (JASA) and Journal of the Royal Statistical Society: Series B (JRSSB) ∗ . Using a Python script, we crawled the bibliographic database Web of Science (WoS) Core Collection to collect citation data for a total of 9,338 papers published in these journals in the time span considered. We only included publications whose document types are listed as “article" in WoS. We call these publications source papers since they ∗ We include both publication names JRSSB used during 1995-1997. 10 act as a source of knowledge for papers citing them. Among our selected journals, AOS, Biometrika, JASA, and JRSSB are considered by many researchers in the statistics community as top outlets for theory and method works. We have also included AOAS as a representative journal with a broad applied focus. For each source paper, the WoS database provides a list of papers citing it and the corresponding pub- lication information. We finished extracting these lists before December 2020. In addition to the citations between the source papers,264,356 papers from other journals (or from the selected five statistics journals but published in 2019 and 2020) cited these source papers; these papers are called citing papers. † Rather than limiting to “article" as we did for the source papers, the citing papers can be of any document type. Based on the lists of citations, we build a citation network that consists of273,694 nodes including all the source and citing papers, and edges representing citations between the source papers and from the citing papers to the source papers. The above citation network can be represented by a binary adjacency matrix A ∈ {0,1} 273694× 9338 , in which A ij = 1, i citesj; 0, otherwise. (2.1) In this matrix, we assign each source paper to an index inI s ={1,...,9338} and each citing paper to an index inI c ={9339,...,273694}. Our current study does not contain citations from the source papers to the citing papers since we are primarily interested in the impact of source papers on other scientific works. We obtained the publication information for both source and citing papers from the WoS database. In particular, the following variables are central to our analysis: (1) article title, (2) publication source title (e.g., journal or conference names), (3) publication year, (4) author keywords, (5) abstract, (6) WoS categories (e.g., “Statistics & Probability" and “Mathematical & Computational Biology"), and (7) research † The accessibility of citing papers depends on the university library VPN used to access the WoS database. 11 areas (e.g., “Mathematics"). In our dataset, only 98 of all the papers do not have any specified categories (nor research areas), thus we label their categories (and research areas) as “NA". We use the broad research areas to classify the general field of each paper and the WoS categories to provide finer classifications when statistics needs to be distinguished from other research fields. More discussions on the division and field classification can be found in Section 2.3. Furthermore, in Section 2.3, we use the variables (2) publication source title, (3) publication year and (7) research areas to illustrate the change of impact on external and internal areas over time for the selected five journals and their highly cited papers. In Section 2.4, we select papers from target topics based on (1) article title and (5) abstract, and validate the local community found using (4) author keywords. More details about the usage of these variables will be presented in the respective sections. 2.2.2 Citationdistributionsandtrends As shown in Figure 2.1a, the vast majority of source papers have fewer than500 citations with1.92% of the source papers receiving zero citations. Figure 2.1b further plots the distribution of the citation counts for source papers with citations from 0 to 500. We observe that removing the zero-citation papers would lead to a power-law distribution of the citation counts. Notably, only 0.06% papers (6 papers) received more than5,000 citations. Highly cited papers like these will be discussed in more details in Section 2.3.2. Looking at the trends over the years, the total number of citations for each journal grows consistently in Figure 2.2a, and the growth is not due to the journals expanding their volumes of publications. In fact, Figure 2.2b shows that there was no significant increase in the annual number of publications in each journal in except AOAS. AOAS was established in 2007 and subsequently went through fast growth period before stabilizing. To account for the effect of publication numbers, for each year T , we normalize the annual citation count for each journal by the total number of published papers from 1995 to T in that journal, since any citing paper published in yearT is free to cite source papers in the period 1995-T . 12 Number of citations 0 Proportion 1 ~ 500 501 ~ 5000 > 5000 1.92% 97.02% 1.00% 0.06% 0 500 1000 1500 2000 2500 0 100 200 300 400 500 number of citations count of papers Histogram of papers with citations less than 500 (a) 0 500 1000 1500 2000 2500 0 100 200 300 400 500 number of citations count of papers (b) Figure 2.1: The distribution of citation counts for the source papers: (a) the proportions of papers falling into different citation brackets; (b) the histogram of papers with citations ≤ 500. 0 10000 20000 30000 40000 50000 1995 2000 2005 2010 2015 year value journal BIOMETRIKA JASA AOAS AOS JRSSB All Stats (a) 100 200 300 400 500 1995 2000 2005 2010 2015 year value journal BIOMETRIKA JASA AOAS AOS JRSSB All Stats (b) Figure 2.2: The general information of statistical journals: (a) the total number of citations per year for each journal; (b) the total number of publications per year for each journal. Figure 2.3a shows that the normalized citations still increase consistently over the years for all the journals, among which JRSSB enjoys substantially more citations per article after 2002. AOAS’ normalized citations have been growing quickly as a relatively new journal. It is clear that citation counts are not distributed equally across all the papers, and one possible way to measure citation inequality is through the Lorenz curves [130, 56]. For journalj, define L(p)= P ⌊p× N j ⌋ i=1 d (i) P N j i=1 d (i) , 13 0 5 10 1995 2000 2005 2010 2015 year normalized count journal BIOMETRIKA JASA AOAS AOS JRSSB All Stats (a) 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 p L(p) journal BIOMETRIKA JASA AOAS AOS JRSSB (b) Figure 2.3: Citation trends and distributions for each journal: (a) the normalized number of citations over the years. “All Stats" refers to all the source papers; (b) the Lorenz curve for each journal. whereN j is the number of publications,p is the percentage, andd (1) ,d (2) ,...,d (N j ) are the citation num- bers in a non-decreasing order of papers in journalj published in 1995-2018.L(p) calculates the percentage of citations shared by the bottom p percent of papers as a measure of inequality. Figure 2.3b plots L(p) as a Lorenz curve for each journal, with curves closer to the bottom right corner indicating greater extent of inequality. Most journals have highly similar curves, while JRSSB appears to have the most significant inequality. This can be explained by the fact that there are four papers that each received more than5,000 citations, accounting for 49.5% of the total citations towards JRSSB in this period. (Recall that we only have six papers in our dataset with citation numbers exceeding5,000.) After removing these four papers, the normalized citation counts for JRSSB become much closer to the other journals but remain the highest of all journals in Figure 2.4. 2.3 Comparisonbetweeninternalandexternalcitations As the overall citations for each journal increase over the years, how much of the increase can be attributed to research fields outside statistics? In this section, we break down the citations by their research fields, paying attention to the distinction between internal and external citations. 14 0 2 4 6 1995 2000 2005 2010 2015 year normalized count journal BIOMETRIKA JASA AOAS AOS JRSSB All Stats Figure 2.4: The normalized citation counts for each journal over the years. Four papers with citations greater than 5000 are removed from JRSSB. As mentioned Section 2.2.1, even though the WoS categories and research areas can help us identify the research field each paper belongs to, we still have to make a decision about whether a citation should be considered inside (internal) or outside (external) of statistics. This is a subjective decision in some sense given the interdisciplinary nature of many research topics in statistics and the overlap of statistics with fields such as mathematics, computational biology and econometrics. We take the following approach, which perhaps can be viewed as conservative in estimating external impact. We consider two types of internal papers. The first type includes papers containing the tag “Statistics & Probability" in their WoS categories, which applies to all the papers published in common statistics and/or probability journals. These papers are labeled as “STATS" in our subsequent plots. The second type includes papers whose WoS categories contain the keyword “math" (e.g., “Mathematics" and “Mathematical & Computational Biology"). Additional papers selected by this step are published in journals such as Journal of Econometrics, BMC Bioinformatics, thus from fields reasonably close to statistics. In what follows, these papers are labeled as “MATH" and counted as internal citations. The rest of the papers are considered as external. This procedure divides our dataset into83,503 internal and190,191 external papers. We use the research areas ‡ provided by WoS to classify the external papers into five broad categories: arts & humanities (ART), life sciences & biomedicine (BIO), physical sciences (PHY), social sciences (SOC), and technology (TECH). We note that the finer divisions under research areas could also be used to search ‡ https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html 15 for the second type of internal papers described above with “math" as a keyword, but doing so would lead to a subset of the papers already selected by the WoS categories. 2.3.1 Diversityofcitingfieldsovertime Using the category labels discussed above, Figure 2.5a shows the research area breakdowns for all the citations over the years. If an external paper lists multiple research areas, each area is weighted equally and contributes a fractional count to the total. As expected, in the earlier years of our period of study, most of the citations are from within statistics. However, the proportion of external citations soon begins to increase at a fast pace and finally exceeds half. Among the external citations, BIO and TECH have heavy weights. The same trend for each journal separately is presented in Supplementary Figure 2.6. The proportion of external citations also increases over time for all the journals, with AOAS and JRSSB having larger proportions than the others. 0 10000 20000 30000 40000 50000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count type STATS MATH BIO PHY SOC TECH ART NA (a) 30 40 50 60 70 80 1995 2000 2005 2010 2015 year value journal BIOMETRIKA JASA AOAS AOS JRSSB (b) Figure 2.5: Diversity of citing fields in statistics journals: (a) the annual counts of internal and external citations for all the source papers; (b) the yearly Gini concentration for each journal. One way to summarize the distribution of proportions and put the diversity measure for each journal on the same scale is through the use of Gini concentration [115]. Let Gini Concentration=100× X i s 2 i , 16 wheres i is the proportion of citations from research categoryi, and we consider the same categories as shown in Figure 2.5a except that we combine STATS and MATH into one internal category. Journals with more diverse citations by external categories have lower Gini concentrations. Figure 2.5b plots the change in the Gini concentration for each journal over the years. Overall the trends agree with our results in Figure 2.5a and Figure 2.6. All the journals have demonstrated increasing connections with external fields, with AOAS, JASA, and JRSSB being more diverse than the others. 0 2000 4000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count BIOMETRIKA (a) 0 5000 10000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count JASA (b) 0 1000 2000 3000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count AOAS (c) 0 2500 5000 7500 10000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count AOS (d) 0 5000 10000 15000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count type STATS MATH BIO PHY SOC TECH ART NA JRSSB (e) Figure 2.6: Breakdown of citations into research categories for each statistical journal: (a) BIOMETRIKA, (b) JASA, (c) AOAS, (d) AOS, and (e) JRSSB. 17 2.3.2 Internalandexternalimpactofmosthighlycitedpapers In the previous section, we compared the proportions of internal and external citations at an aggregated level within each journal. Now we turn to examine the internal and external impact of some specific source papers selected based on their high citation counts. Do highly cited papers always have high impact both internally and externally? To this end, we first rank the source papers according to their internal and external citation counts separately. Focusing on papers in the top 20 list by either internal or external counts, Figure 2.7 shows their respective ranks internally and externally. One can see that most of these papers are ranked high under both criteria except for a few outliers. We focus on the most obvious two (boxed in red) and provide their information in Table 2.1 and further analysis below. 0 50 100 150 0 500 1000 1500 Internal rank External rank Figure 2.7: The comparison of external and internal ranks for highly cited papers. Title Rank (internal) Rank (external) # of Citations The multivariate skew-normal distribution [9] 20 177 749 A nonparametric trim and fill method of accounting for publication bias in meta-analysis [35] 1,520 19 1,362 Table 2.1: Papers with significantly different internal and external ranks. The first paper [9] in Table 2.1 ranks in the top 20 based on the internal citation counts, but its external rank is relatively lower in comparison. Since the paper is about distribution theory, unsurprisingly we find most of the citations come from fields closely related to statistics. Table 2.2 provides the top 5 WoS categories and their number of occurrences among the citations, with “Statistics & Probability" appearing 18 most often. Also, most of these categories contain the keyword “math", which explains the higher internal rank. The other categories (e.g, “Computer Science, Interdisciplinary Applications") are still closely related to statistics or mathematics. The paper has reached a larger audience within statistics and mathematics, most likely due to its technical nature. The second paper [35] in Table 2.1 demonstrates the opposite pat- tern, with a high external rank but a low internal rank. We proposes a practical method of evaluating and adjusting for the possibility of publication bias (e.g., a preference for positive results), a well-known phenomenon in published academic research especially in meta-analysis, and thus has attracted wide sci- entific interests. Table 2.3 lists the top 5 most frequent WoS categories among all the citations. One can see that list is dominated by psychiatry and psychology, while statistics or mathematics-related categories are not present. We have additionally searched for keywords related to publication bias in the title and author keywords of the internal papers. The search only returns 59 papers, confirming the topic is less explored internally and could be a potential area for further theoretical and methodological development in statistics. WoS category Frequency Statistics & Probability 534 Mathematics, Interdisciplinary Applications 69 Computer Science, Interdisciplinary Applications 61 Mathematical & Computational Biology 48 Economics 40 Table 2.2: Top 5 most frequent WoS categories in all the citations towards Azzalini and Valle [9]. WoS category Frequency Psychiatry 152 Psychology, Multidisciplinary 132 Psychology, Clinical 106 Public, Environmental & Occupational Health 97 Medicine, General & Internal 88 Table 2.3: Top 5 most frequent WoS categories in all the citations towards Duval and Tweedie [35]. As can be observed in Figure 2.7, most papers have both high internal and external ranks. Table 2.4 lists all the papers that are ranked in the top 20 both internally and externally. We classify these papers roughly into five topics: Markov chain Monte Carlo (MCMC), causal inference (causal), penalized regression, false 19 Title Area (statistics) Rank (internal) Rank (external) # of Citations Reversible jump Markov chain Monte Carlo computation and Bayesian model determination [47] MCMC 8 16 2,868 Identification of causal effects using instrumental variables [6] Causal 18 17 2,125 Least angle regression [37] Penalized regression 7 8 4,252 The control of the false discovery rate in multiple testing under dependency [13] FDR 12 6 5,062 Model selection and estimation in regression with grouped variables [147] Penalized regression 9 14 2,935 Regularization and variable selection via the elastic net [153] Penalized regression 5 7 5,790 A direct approach to false discovery rates [116] FDR 14 9 3,186 Bayesian measures of model complexity and fit [112] Bayesian model selection 4 4 6,743 Controlling the false discovery rate: a practical and powerful approach to multiple testing [11] FDR 2 1 46,899 Regression shrinkage and selection via the Lasso [125] Penalized regression 1 2 16,905 Table 2.4: Papers whose internal and external citations both rank in the top 20. discovery rate (FDR), Bayesian model selection. To investigate the influence of these papers on other fields, we consider the aggregated citations by the five topics and break down the citations by category labels, similar to Figure 2.5a. In this case, we have added two category labels: “BE" for the research area “Business & Economics" and “CS" for the research area “Computer Science", since we notice a considerable number of citations are from these two areas, especially for causal inference and penalized regression. To avoid double counting, papers with the BE (or CS) label will not be counted in SOC (or TECH), which is the broad category BE (or CS) belongs to in WoS. Similar to before, multiple labels for one paper are weighted equally. Figure 2.8 shows that the influence on other fields differs by statistical research topics. FDR and Bayesian model selection have always attracted a substantial proportion of citations from BIO, even from the earlier years. MCMC and penalized regression have more citations from CS than the others. On the other hand, causal inference has the largest proportion of citations from SOC and BE among the five topics. 20 0 50 100 150 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count MCMC (a) 0 50 100 150 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count Causal (b) 0 1000 2000 3000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count Penalized regression (c) 0 2000 4000 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count FDR (d) 0 200 400 600 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 year count type STATS MATH BE CS BIO PHY SOC TECH ART NA Bayesian model selection (e) Figure 2.8: Breakdown of citations for the papers in Table 2.4 aggregated by the five statistical topics: (a) MCMC, (b) Causal, (c) Penalized regression, (d) FDR, and (e) Bayesian model selection. 2.4 Connecting statistical research communities to external topics by localclustering–methodsandtheoreticalguarantees We have seen that different research topics in statistics often have different citation profiles by external fields, indicating they may have a heavier influence on some fields and topics and less so on others. This prompts us to consider the question, given a specific external research topic, can we identify the most relevant statistical research topic (with relevance measured by our collected citation data)? This section 21 investigates a local clustering approach by aPPR followed by appropriate cutoff selection. We present theoretical studies under the DC-SBM and results on simulated data. A typical local clustering method starts from one or multiple seed nodes and performs a random walk in the neighborhood of the seeds to gather other relevant nodes. In our setting, we first use keyword search to select a subset of citing papers,I t ⊂ I c , from an external topic of interest (see details in Section 2.5). The seed nodes are constructed using citation information between the source papersI s and the topic papers inI t , and the local clustering is performed onI s and their network A s . For clustering purpose, we consider two papers as related in content if a citation exists between them; the direction of this citation is less important if we think of it as a form of association. For this reason, we treat A s as an undirected network in this section. That is, A s ij = 1, there is a citation betweeni andj ; 0, otherwise (2.2) fori,j∈I s ={1,...,9338}. Next we present the details of the local clustering procedure and its theoretical properties under a net- work model with community structure. Standard order notationsO,Ω ,O p andΩ p will be used throughout. 2.4.1 PreliminariesandtheDC-SBM In order to analyze the behavior of local clustering, we adopt the popular DC-SBM [60], which captures both node heterogeneity and community structure, as the underlying network model. While such a model may not capture all the features of our citation network, the presence of node heterogeneity is reflected by the uneven distribution of citation counts, and it is plausible to assume the underlying communities correspond to different research topics. For convenience of notation, we will describe the DC-SBM and 22 local clustering procedure using a general symmetric adjacency matrix A and a general set of nodesI, with the understanding that they refer toA s andI s in our data analysis. In the original SBM [52],N nodes are assigned toK blocks or communities, and the probability of an edge between two nodes only depends on their community memberships. To abbreviate notations, write the set{1,...,n} as[n] for any integern. The set of nodesI = [N] is partitioned into K blocks by the function g : [N] → [K]. Let n k denote the size of block k,I k denote the set of nodes in block k for k ∈ [K]. The proportion of members in blockk isτ k = n k /N. We consider the case that the number of blocksK is fixed, and τ k is bounded below by a constant for all thek ∈ [K]. The probability of an edge between nodesi andj is A ij |g ind. ∼ Bernoulli B g(i)g(j) ,∀i,j∈I, i̸=j, where B ∈ [0,1] K× K is the connectivity matrix. We adopt the common parametrization for B as B = ρ N S, whereS is a fixed K× K matrix, andρ N is the average edge density satisfyingρ N → 0 at some rate asN →∞. DC-SBM introduces node heterogeneity by adding a degree parameterθ i for each nodei, so that the probability of an edge betweeni andj becomes A ij |g, θ ind. ∼ Bernoulli θ i θ j B g(i)g(j) ,∀i,j∈I, i̸=j. (2.3) Some constraint is needed on θ i for identifiability, and we adopt the constraint P i∈I k θ i = n k for all k∈[K] following [60]. The degree of nodei is defined as d i = P j∈I A ij . The population adjacency matrix is the conditional expectation ofA, i.e., A =E[A|g, θ ]. 23 It follows then the population node degrees ared i = P j∈I A ij , and the expected degree is λ N = 1 N P i∈I d i . Note that we also haveλ N =Nρ N . 2.4.2 AdjustedPersonalizedPageRankundertheDC-SBM Given an adjacency matrixA, define the diagonal matrix D = diag(d 1 ,...,d N ) and the graph transition matrix P = D − 1 A. The personalized PageRank (PPR) vector p ∈ [0,1] N is the stationary distribution of the process p ⊤ =απ ⊤ +(1− α )p ⊤ P , where α ∈ (0,1] is the teleportation constant, and π ∈ [0,1] N is a probability vector called the pref- erence vector encoding one or multiple seed nodes. For example, if there is one seed node v 0 = 1, π = (1,0,...,0) ⊤ . Under a network model with community structure such as SBM or DC-SBM, the goal is to recover all the nodes with the same community membership as v 0 by ranking the elements in the PPR vectorp. In our setting, we choose source papers that have high citation counts by a set of topic papers as the seed nodes. For a source paperj ∈I s and a set of topic papersI t , its citation count isa j = P i∈It A ij , whereA is the citation network defined in Eq (2.1). The preference vector π ∈[0,1] 9338 is calculated as π k = a ′ k P j∈Is a ′ j wherea ′ j = a j , a j ≥ t; 0, a j <t. (2.4) Heret is a chosen threshold constant. We extend the setting of a single seed node in [65, 27] to multiple seed nodes, but still make the assumption that they all belong to the same community. While it is unlikely that all papers cited by a specific topic come from the same community, the threshold t helps us prune the vectorπ and make the assumption more reasonable. 24 Related to PPR, the adjusted personalized PageRank (aPPR) vector is defined as p ∗ i = p i d i fori=1, ...,N, (2.5) where p i is the ith entry in the PPR vector. Chen et al. [27] showed that under the DC-SBM, adjusting by the degrees leads to a consistent ordering of the entries in p ∗ so that entries with the highest values belong to the target community. Formally, letn be a community size cutoff. Then n nodes with the largest p ∗ i values are selected as members in the target community, that is C n ={i|p ∗ i ≥ p ∗ (n) }, (2.6) wherep ∗ (1) ,...,p ∗ (N) is the sorted list ofp ∗ in a non-increasing order. Corollary 1 in Chen et al. [27] shows that withv 0 =1 (assuming without loss of generality it belongs to block 1), provided we know the correct size cutoff n = n 1 (recalln 1 =|I 1 |), then the aPPR clustering can recover all the nodes in block 1 with high probability, i.e.,C n 1 =I 1 . Specifically, under the DC-SBM, Chen et al. They constructed the “block-wise” population version of aPPR vector p ∗ ∈ R K and proved that when there is a single seed node in block 1, p ∗ 1 >max{p ∗ k |k =2,...,K}. (2.7) The separation between block 1 and the other blocks is defined as ∆ α ∈[0,1], ∆ α = p ∗ 1 − max{p ∗ k |k =2,...,K} p ∗ 1 . (2.8) 25 Note that∆ α is an increasing function function ofα . This separation together with appropriate concen- tration analysis allowed them to show in their Corollary 1 that the sample aPPR vector can consistently recover all the nodes in block 1 given the correct size cutoff. The following property ofp ∗ can be easily derived from the linearity of PPR vectors in general, p ∗ (ω 1 π 1 +ω 2 π 2 )=ω 1 p ∗ (π 1 )+ω 2 p ∗ (π 2 ), whereω i ≥ 0 andω 1 +ω 2 =1. (2.9) This property enables us to extend their Corollary 1 [27] to the setting with multiple seed nodes in a straightforward way. The following proposition extends their result to better fit our situation, i.e., the usage of multiple seed nodes. Proposition1. UndertheDC-SBM,givenasetofseednodesfromthesameblock,sayblock1,andatelepor- tation constantα , assume that (c.1) min u∈[N] θ u ≥ L θ andmax u∈[N] θ u ≤ U θ whereL θ andU θ are positive constants. (c.2) For some sufficient large constant c 1 >0, λ N >c 1 1− α ∆ α 2 logN. Then for sufficiently large N, with probability at least1− O(N − 5 ), we haveC n 1 =I 1 . ProofofProposition 1. To extend the Corollary 1 in Chen et al. [27], we first check that their assumption max i∈I d i min i∈I d i <c 0 for some constantc 0 holds under our assumption (c.1). We have d i = X j A ij = X j θ i θ j B g(i)g(j) =θ i ρ N X j θ j S g(i)g(j) . 26 By assumption (c.1), max i∈I d i min i∈I d i ≤ max i∈I θ i P j θ j S g(i)g(j) min i∈I θ i P j θ j S g(i)g(j) ≤ U θ maxS ij P j θ j L θ minS ij P j θ j = U θ maxS ij L θ minS ij . (2.10) SinceS is a fixed matrix, max i∈I d i min i∈I d i is bounded above. It remains to show the inequality (2.7) holds for multiple seed nodes from the same block. Without loss of generality, we consider two seed nodesv 1 = 1 andv 2 = 2 from block 1, and their corresponding preference vectors areπ 1 = (1,0,0,...,0) ⊤ andπ 2 = (0,1,0,...,0) ⊤ . Whenω 1 +ω 2 = 1 andω i ≥ 0, ω 1 π 1 +ω 2 π 2 can be considered as a preference vector containing two seed nodes from the same block. Now Eq (2.7) applies toπ 1 andπ 2 separately, that is p ∗ 1 (π 1 ) > max{p ∗ k (π 1 ) | k = 2,...,K} andp ∗ 1 (π 2 ) > max{p ∗ k (π 2 ) | k = 2,...,K}. (2.11) By Eq (2.9) and Eq (2.11), we have p ∗ 1 (ω 1 π 1 +ω 2 π 2 )=ω 1 p ∗ 1 (π 1 )+ω 2 p ∗ 1 (π 2 ) >ω 1 max{p ∗ k (π 1 )|k =2,...,K}+ω 2 max{p ∗ k (π 2 )|k =2,...,K} ≥ max{ω 1 p ∗ k (π 1 )+ω 2 p ∗ k (π 2 )|k =2,...,K} =max{p ∗ k (ω 1 π 1 +ω 2 π 2 )|k =2,...,K}. (2.12) The rest of the proof is the same as that of Corollary 1 in [27]. 2.4.3 Conductance Given the result in Proposition 1, it remains to choose the correct sizen forC n to fully recover the target community (block 1). To achieve this, an objective function is needed to evaluate the quality of the clusters 27 found. Conductance is a popular objective function to be optimized either globally or locally [129, 144] and often used in conjunction with a local clustering algorithm like PPR [5, 140]. It tends to favor small clusters weakly connected to the rest of the graph, and one would expect such an assortative structure in citation networks with communities defined by research topics. For a set of nodesI ′ ⊆ I, we define its conductance ϕ as ϕ (I ′ )= P i∈I ′ P j/ ∈I ′A ij P i∈I ′A i· , (2.13) whereA i· = P j∈I A ij . The numerator is known as the cut of the graph partitioned byI ′ and its comple- ment(I ′ ) c , while the denominator represents the volumevol(I ′ ,I). We note that an alternative form of conductance has min{vol(I ′ ,I),vol((I ′ ) c ,I)} in the denominator. The two forms are equivalent when the size ofI ′ is smaller than (I ′ ) c , a condition we expect to hold forC n 1 and its neighborhood, givenn 1 is small compared withN. Hence we choose the form in Eq (2.13) for easier bookkeeping. Proposition 1 demonstrates that the aPPR vector sorts the nodes in terms their relevance to the target community with high probability. The sorted list of nodes leads to a sequence of clusters{C n } N n=1 and their conductance values{ϕ (C n )} N n=1 . Our next theorem establishes that the correct choice ofn occurs at a local optimum along this sequence, justifying the practice of choosing the community size cutoff by inspecting the conductance plot. Theorem 1. Under the DC-SBM, suppose that (c.1) and (c.2) hold. Recall S is the fixed component in the connectivity matrixB, andτ is the set of block proportions (Section 2.4.1). Further assume that (c.3) givenS andτ , S 11 min i>1 ˜ S i· >2max j>1 S 1j ˜ S 1· where ˜ S i· = X j S ij τ j . (2.14) 28 Then for sufficiently large N, there exitsn ′ withn ′ − n 1 =Ω( N) such that ϕ (C n 1 )− ϕ (C n )≤− 1 N Ω P (|n− n 1 |) (2.15) uniformly forn∈[n ′ ]. The proof of Theorem 1 has two major parts. The first part in Section 2.4.4 analyzes the optimality properties of ϕ at the population level with the help of the result in Proposition 1. The second part in Section 2.4.5 incorporates noise from the adjacency matrix and proves the local optimality result in the theorem. We make two remarks as follows. a) The bound in Eq (2.15) and the lower bound onn ′ − n 1 guarantees the optimum atn 1 is well separated from its neighborhood and this neighborhood is wide enough to be observed in a conductance plot. b) As can be seen from the proofs in Section 2.4.4, the larger the gap between S 11 min i>1 ˜ S i· and 2max j>1 S 1j ˜ S 1· , the more peaked and easier to spot the local optimum is. In an assortative graph with S ii > max j̸=i S ij , a smaller τ 1 will lead to a smaller ˜ S 1· in Eq (2.14), making the inequaltiy more easily satisfied. Thus using conductance as a objective function is well suited to the situation wheren 1 is a small fraction ofN. 2.4.4 Propertiesofthepopulationversionofconductance In this section, we analyze the optimality properties of the conductance function under the population version before we present the sample version in the next section. Such a technique has been widely used in a number of works (e.g., Bickel and Chen [16] and Zhao et al. [150]); our case mostly differs in the construction of the confusion matrix and analysis of the population version of the objective function. A major difference between the previous works and our analysis is that they aim to recover all the blocks, whereas we are only concerned about the target block (block 1). For a given cutoff set C n in Eq (2.6) 29 (n ∈ [N]), which essentially partitions all the nodesI into two sets, we consider the label assignment functionz =h(C n ). More concretely, z(u)= 1, u∈C n ; 2, u / ∈C n , (2.16) for each nodeu∈I. In other words, we merge blocks2,...,K into one block and collectively call them block 2. Therefore, the correct assignmentz 0 (as far as block 1 is concerned) should havez 0 (u) = 1 for u∈I 1 , andz 0 (u)=2 foru / ∈I 1 . Given the aPPR vectorp ∗ and the corresponding sequence{C n } n∈[N] , denote ζ ={z =h(C n )|n∈[N]}, which is the set of all possible labels generated from{C n } n∈[N] . We have|ζ |=N. Recall that Proposition 1 establishes the aPPR vector recovers block 1 with high probability when n = n 1 , i.e.,C n 1 =I 1 . We will show that under this high probability event,ϕ (C n 1 ) is a local minimum by analyzing the neighborhood aroundn 1 . It is easy to see this event also implies the following property for the setC n , C n ⊊ = ⊋ I 1 whenn < = > n 1 . (2.17) In other words, all the nodes inC n are from block 1 when the cutoff n < n 1 ;C n is exactly block 1 when n=n 1 ; and the whole block 1 is contained inC n whenn>n 1 . 30 For clarity of description, we first study the properties of ϕ under the SBM withK blocks. Following the convention, let∥z− z 0 ∥ 1 = P N u=1 1{z(u)̸=z 0 (u)}, and for1≤ a,b≤ 2, O ab (z)= X u̸=v 1{z(u)=a,z(v)=b}A uv . (2.18) Define the confusion matrix R∈[0,1] 2× K , R ab (z,g)= 1 N N X u=1 1{z(u)=a,g(u)=b}, (2.19) whereg denotes the correct labels as introduced in 2.4.1. LetR abbreviateR(z,g), andRSR ⊤ abbreviate R(z,g)SR(z,g) ⊤ . Note that R ⊤ 1 = τ where τ k = |I k |/N is the proportion of nodes in block k. Let µ N =N 2 ρ N , then 1 µ N E[O(z)|g]=RSR ⊤ . For convenience, for a general2× 2 matrixM, define F(M)= M 11 M 1· , whereM 1· =M 11 +M 12 . (2.20) We immediately have 1− ϕ (C n )=F O(z) µ N . (2.21) Moreover, we write G(z)=F(RSR ⊤ (z)), which is the population version of Eq (2.21) and only depends onz. The following lemma showsz 0 is a well separated local optimum in a suitable neighborhood defined around C n 1 . Recall that we are working under the eventC n 1 =I 1 so that Eq (2.17) holds. 31 Lemma 1. Suppose the assumption (c.3) holds under the SBM. Given a large enough N and a sequence {C n } n∈[N] , there existsn ′ satisfyingn ′ − n 1 =Ω( N) such that G(z)− G(z 0 )≤− 1 N Ω( |n− n 1 |) uniformly for allz in the setζ ′ ={z =h(C n )|n∈[n ′ ]}. ProofofLemma 1. We have G(z)=F(RSR ⊤ (z))= [RSR ⊤ ] 11 [RSR ⊤ ] 1· = P i,j R 1i S ij R 1j P i,j,k R 1i S ij R kj . (2.22) According to (2.17) , we consider the following cases. Case 1: n = n 1 . Then,C n =I 1 (i.e.,{u| z(u) = 1} ={u| g(u) = 1}),z = z 0 . Note thatR 1j = 0 forj̸=1,R 21 =0, andR 11 =τ 1 . By Eq (2.22), G(z 0 )= S 11 τ 2 1 τ 1 P j,k S 1j R kj = S 11 τ 2 1 τ 1 P j S 1j τ j =τ 1 S 11 ˜ S 1· . Case 2: n < n 1 . By Eq (2.17),C n ⊊ I 1 , that is,{u | z(u) = 1} ⊊ {u | g(u) = 1}. It follows then R 1j =0 forj̸=1, and R 11 = 1 N N X u=1 1{z(u)=1,g(u)=1}= P N u=1 1{z(u)=1} N = |C n | N = n N . (2.23) We have τ 1 − R 11 = 1 N |n− n 1 |, sinceτ 1 = n 1 N , and G(z)= S 11 R 2 11 R 11 P j S 1j τ j =R 11 S 11 ˜ S 1· . 32 Therefore, G(z)− G(z 0 )=(R 11 − τ 1 ) S 11 ˜ S 1· ≤− 1 N Ω( |n− n 1 |). Case 3: n > n 1 . By Eq (2.17),C n ⊋I 1 , that is,{u| z(u) = 1}⊋{u| g(u) = 1}. HenceR 21 = 0, R 11 =τ 1 , and X j>1 R 1j = 1 N N X u=1 X b>1 1{z(u)=1,g(u)=j}= 1 N N X u=1 1{z(u)=1,g(u)̸=1} = 1 N ( N X u=1 1{z(u)=1}− N X u=1 1{g(u)=1})= 1 N (|C n |−| I 1 |)= 1 N |n− n 1 |. (2.24) We have G(z)= τ 2 1 S 11 +2τ 1 P j>1 S 1j R 1j + P i,j>1 R 1i S ij R 1j P j R 1i S ij τ j = τ 2 1 S 11 +2τ 1 P j>1 S 1j R 1j + P i,j>1 R 1i S ij R 1j τ 1 ˜ S 1· + P i>1 ˜ S i· R 1i . (2.25) Substituting, G(z)− G(z 0 )= τ 2 1 S 11 +2τ 1 P j>1 S 1j R 1j + P i,j>1 R 1i S ij R 1j τ 1 ˜ S 1· + P i>1 ˜ S i· R 1i − τ 1 S 11 ˜ S 1· . 33 For a small but fixed ε>0 (to be specified later), by choosing n ′ satisfyingn ′ − n 1 =⌊Nε⌋, we can restrict max j>1 R 1j ≤ ε for allN >1/ε andn∈[n ′ ] according to Eq (2.24). Then we have G(z)− G(z 0 ) ≤ τ 2 1 S 11 +2τ 1 max j>1 S 1j P j>1 R 1j +εmax i,j>1 S ij P j>1 R 1j τ 1 ˜ S 1· +min i>1 ˜ S i· P j>1 R 1j − τ 1 S 11 ˜ S 1· = h ˜ S 1· (2τ 1 max j>1 S 1j +εmax i,j>1 S ij )− τ 1 S 11 min i>1 ˜ S i· i P j>1 R 1j ˜ S 1· (τ 1 ˜ S 1· +min i>1 ˜ S i· P j>1 R 1j ) = h ˜ S 1· (2max j>1 S 1j +ετ − 1 1 max i,j>1 S ij )− S 11 min i>1 ˜ S i· i ˜ S 1· ( ˜ S 1· +τ − 1 1 min i>1 ˜ S i· P j>1 R 1j ) ( 1 N |n− n 1 |) (2.26) According to assumption (c.3), there exists a constantc>0 such that S 11 min i>1 ˜ S i· − c>2max j>1 S 1j ˜ S 1· . (2.27) Let ε= cτ 1 2 ˜ S 1· max i,j>1 S ij , it follows then ˜ S 1· (2max j>1 S 1j +ετ − 1 1 max i,j>1 S ij )− S 11 min i>1 ˜ S i· <− c 2 <0, and (2.26) is upper bounded by − c/2 ˜ S 1· ( ˜ S 1· +τ − 1 1 min i>1 ˜ S i· Kε) ( 1 N |n− n 1 |)=− 1 2 ˜ S 2 1· c + Kmin i>1 ˜ S i· max i,j>1 S ij ( 1 N |n− n 1 |) (2.28) Therefore, G(z)− G(z 0 )<− 1 N Ω( |n− n 1 |) for alln∈{n 1 +1,...,n ′ }. 34 The same result can be shown for a DC-SBM withK blocks by defining a confusion tensor, adding a dimensionality for the nodes; all the other notations remain the same unless otherwise specified. Define the confusion tensorT ∈{0, 1 N } 2× K× N as T abu (z,g)= 1 N 1{z(u)=a,g(u)=b}. (2.29) Then, define the degree-corrected confusion matrix ˜ R∈[0,1] 2× K , ˜ R a,b (z,g)= N X u=1 θ u T abu (z,g). (2.30) LetT abbreviateT(z,g), and ˜ R abbreviate ˜ R(z,g). Now we have [ ˜ R ⊤ 1] b = X a,u θ u T abu = n b N =τ b . (2.31) Also, 1 µ N E[O ab |g,θ ]= K X i,j=1 N X u,v=1 T a,i,u θ u S ij θ v T b,j,v =[ ˜ RS ˜ R ⊤ ] ab . (2.32) The following lemma is similar to Lemma 1 but extends the result to the DC-SBM. Lemma 2. Suppose the assumptions (c.1) and (c.3) hold under the DC-SBM. Define ˜ G(z) = F( ˜ RS ˜ R ⊤ (z)). Given a large enoughN and a sequence{C n } n∈[N] , there existsn ′ satisfyingn ′ − n 1 =Ω( N) such that ˜ G(z)− ˜ G(z 0 )≤− 1 N Ω( |n− n 1 |) uniformly for allz in the setζ ′ ={z =h(C n )|n∈[n ′ ]}. 35 ProofofLemma 2. We have ˜ G(z)=F( ˜ RS ˜ R ⊤ (z))= [ ˜ RS ˜ R ⊤ ] 11 [ ˜ RS ˜ R ⊤ ] 1· = P i,j ˜ R 1i S ij ˜ R 1j P i,j,k ˜ R 1i S ij ˜ R kj . (2.33) Similar to the proof for Lemma 1, we consider the three cases in Eq (2.17). Case 1: n = n 1 . Again, we haveC n =I 1 andz = z 0 , so ˜ R 1j = 0 forj ̸= 1, ˜ R 21 = 0, and ˜ R 11 = τ 1 . By Eq (2.33), we have ˜ G(z 0 )=τ 1 S 11 ˜ S 1· . (2.34) Case 2: n < n 1 . By Eq (2.17),C n ⊊ I 1 (i.e.,{u| z(u) = 1}⊊ {u| g(u) = 1}), so ˜ R 1j = 0 forj̸= 1, and τ 1 − ˜ R 11 = n 1 N − N X u=1 θ u N 1{z(u)=1,g(u)=1} = X u∈I 1 θ u N − X u∈Cn θ u N = 1 N X u∈I 1 \Cn θ u ≥ L θ N X u∈I 1 \Cn 1= L θ N |n− n 1 |. (2.35) The last inequality holds by assumption (c.1). Also, we have ˜ G(z)= ˜ R 11 S 11 ˜ S 1· . Then, ˜ G(z)− ˜ G(z 0 )=( ˜ R 11 − τ 1 ) S 11 ˜ S 1· ≤− 1 N Ω( |n− n 1 |). 36 Case 3: n>n 1 . By Eq (2.17),C n ⊋I 1 (i.e.,{u|z(u)=1}⊋{u|g(u)=1}), so ˜ R 21 =0, ˜ R 11 =τ 1 , and X j>1 ˜ R 1j = 1 N N X u=1 θ u 1{z(u)=1,g(u)̸=1}= 1 N X u∈Cn\I 1 θ u ≥ L θ N |n− n 1 |. Similar to Eq (2.25), ˜ G(z)= τ 2 1 S 11 +2τ 1 P j>1 S 1j ˜ R 1j + P i,j>1 ˜ R 1i S ij ˜ R 1j τ 1 ˜ S 1· + P i>1 ˜ S i· ˜ R 1i . Now, we have ˜ G(z)− ˜ G(z 0 )= τ 2 1 S 11 +2τ 1 P j>1 S 1j ˜ R 1j + P i,j>1 ˜ R 1i S ij ˜ R 1j τ 1 ˜ S 1· + P i>1 ˜ S i· ˜ R 1i − τ 1 S 11 ˜ S 1· . By assumption (c.1), ˜ R ij = 1 N X u∈[N] θ u 1{z(u)=i,g(u)=j}≤ U θ N X u∈[N] 1{z(u)=i,g(u)=j}=U θ R ij . HereR is the original confusion matrix defined in Eq (2.19). By the same argument as in Lemma 1, we can find a fixed ε>0, such that by choosingn ′ − n 1 =⌊Nε⌋, we can restrictmax j>1 R 1j ≤ ε for allN >1/ε andn∈[n ′ ] according to Eq (2.24). Therefore, max j>1 ˜ R 1j ≤ U θ ε. Similar to Eq (2.26), ˜ G(z)− ˜ G(z 0 )≤ [ ˜ S 1· (2max j>1 S 1j +ετ − 1 1 U θ max i,j>1 S ij )− S 11 min i>1 ˜ S i· ] P j>1 ˜ R 1j ˜ S 1· ( ˜ S 1· +τ − 1 1 min i>1 ˜ S i· P j>1 ˜ R 1j ) . (2.36) 37 Let ε= cτ 1 2U θ ˜ S 1· max i,j>1 S ij . According to Eq (2.27), we have ˜ S 1· (2max j>1 S 1j +ετ − 1 1 U θ max i,j>1 S ij )− S 11 min i>1 ˜ S i· <− c 2 <0. Eq (2.36) has an upper boundary similar to (2.28) that is − 1 2 ˜ S 2 1· c + Kmin i>1 ˜ S i· max i,j>1 S ij ( L θ N |n− n 1 |). (2.37) Therefore, ˜ G(z)− ˜ G(z 0 )<− 1 N Ω( |n− n 1 |) forn∈{n 1 +1,...,n ′ }. 2.4.5 ProofofTheorem1 The proof of the main theorem relies on the optimality properties of the population version we derived in the previous section and concentration inequalities in following lemma. Lemma 3. Let X(z) = O(z)/µ N − ˜ RS ˜ R ⊤ (z), ∥X∥ ∞ denotes max ij |X ij |. Define the constant C S = U 2 θ max ab S ab . Then, P(max z∈ζ ∥X(z)∥ ∞ ≥ ε)≤ 8Nexp − ε 2 µ N 8C S forε≤ 6C S , (2.38) P( max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)∥ ∞ ≥ ε)≤ 16exp − ε 2 µ N 8C S forε≤ 6C S , (2.39) 38 and P( max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)− X(z 0 )∥ ∞ ≥ ε)≤ 16exp − N 16mC S ε 2 µ N forε≤ 12mC S /N. (2.40) Proof. These are well-known inequalities that can be proved by Bernstein’s inequality. For the sake of completeness, we present the details here. We have 1 µ N E[O ab |g,θ ]=[ ˜ RS ˜ R ⊤ ] ab , thus µ N X ab =O ab − E[O ab |g,θ ]. Also, we have O ab = X i̸=j 1{z(i)=a,z(j)=b}A ij =2 X i<j 1{z(i)=a,z(j)=b}A ij . µ N X ab is a sum of independent zero mean random variables bounded by 1. By Bernstein’s inequality, P(|µ N X ab |≥ εµ N )≤ 2exp − ε 2 µ 2 N 2(Var(µ N X ab )+εµ N /3) =2exp − ε 2 µ 2 N 2(Var(O ab )+εµ N /3) . Note thatA ij ind. ∼ Bernoulli θ i θ j B g(i)g(j) andB ij =ρ N S ij , it follows Var(A ij )=θ i θ j ρ N S g(i)g(j) − (θ i θ j ρ N S g(i)g(j) ) 2 ≤ ρ N U 2 θ maxS ij =ρ N C S ; Var(O ab )≤ 4 N(N− 1) 2 ρ N C S ≤ 2N 2 ρ N C S =2µ N C S . 39 Sinceε≤ 6C S , for fixed a,b,z, P(|X ab |≥ ε)≤ 2exp − ε 2 µ N 8C S . Therefore, P(∥X(z)∥ ∞ ≥ ε)≤ 8exp − ε 2 µ N 8C S for a fixed z. (2.41) We have |ζ | = N, which establishes Eq (2.38). Moreover, according to Eq (2.16), we have |{z ∈ ζ : ∥z− z 0 ∥ 1 =m}|≤ 2, which establishes Eq (2.39). Now, we assumez(m+1)=z 0 (m+1),...,z(N)=z 0 (N). Then, O ab (z)− O ab (z 0 )=2 m X i<j (1{z(i)=a,z(j)=b}− 1{z 0 (i)=a,z 0 (j)=b})A ij +2 m X i=1 N X j=m+1 (1{z(i)=a,z(j)=b}− 1{z 0 (i)=a,z 0 (j)=b})A ij ; Var(O ab (z)− O ab (z 0 ))≤ 4 m(m− 1) 2 +m(N− m) ρ N C S ≤ 4mNρ N C S =4 m N µ N C S . Based on the Bernstein inequality, we have P(µ N |X ab (z)− X ab (z 0 )|≥ εµ N )≤ 2exp − ε 2 µ 2 N 2(Var(µ N (X ab (z)− X ab (z 0 ))+εµ N /3) ≤ 2exp − ε 2 µ 2 N 2(Var(µ N (O ab (z)− O ab (z 0 ))+εµ N /3) ≤ 2exp − ε 2 Nµ N 2(4mC S +εN/3) . Forε≤ 12mC S /N and fixed a,b,z,z 0 , P(|X ab (z)− X ab (z 0 )|≥ ε)≤ 2exp − N 16mC S ε 2 µ N . 40 Then, P(∥X(z)− X(z 0 )∥ ∞ ≥ ε)≤ 8exp − N 16mC S ε 2 µ N , which establishes Eq (2.40). The proof of the main theorem combines the population version result in Lemma 2, which holds un- der the high probability event established in Proposition 1, and Lemma 3, which controls noise through concentration. ProofofTheorem 1. According to Eq (2.21), our goal is same as showing that there existsn ′ − n 1 =Ω( N) such that F O(z) µ N − F O(z 0 ) µ N ≤− 1 N Ω P (∥z− z 0 ∥ 1 ), (2.42) wherez = ζ ′ andζ ′ = {z = h(C n ) | n ∈ [n ′ ]} forλ n satisfying (c.2). Note that∥z− z 0 ∥ 1 = |n− n 1 | according to the definition in Eq (2.16). The proof technique is similar to Bickel et al. [17]. By Taylor expansion, F O(z) µ N − F( ˜ RS ˜ R ⊤ (z))= ∂F ∂M M= ˜ RS ˜ R ⊤ (e) vec(X(z))+O(∥X(z)∥ 2 ∞ ); (2.43) and F O(z 0 ) µ N − F( ˜ RS ˜ R ⊤ (z 0 ))= ∂F ∂M M= ˜ RS ˜ R ⊤ (z) vec(X(z 0 ))+O(∥X(z 0 )∥ 2 ∞ ), where ∂F ∂M is the partial derivative with respect to the vectorizedM. ∂F ∂M is continuous with respect toM, so ∂F ∂M M= ˜ RS ˜ R ⊤ (z) = ∂F ∂M M= ˜ RS ˜ R ⊤ (z 0 ) +O(∥ ˜ RS ˜ R ⊤ (z)− ˜ RS ˜ R ⊤ (z 0 )∥ ∞ ). 41 Note that∥ ˜ RS ˜ R ⊤ (z)− ˜ RS ˜ R ⊤ (z 0 )∥ ∞ =O(∥z− z 0 ∥ 1 /N). Then, we have F O(z) µ N − F( ˜ RS ˜ R ⊤ (z))− F O(z 0 ) µ N +F( ˜ RS ˜ R ⊤ (z 0 )) = ∂F ∂M M= ˜ RS ˜ R ⊤ (z 0 ) vec(X(z)− X(z 0 ))+O ∥z− z 0 ∥ 1 N ∥X(z)∥ ∞ +O(∥X(z)∥ 2 ∞ )+O(∥X(z 0 )∥ 2 ∞ ). Therefore, there exists positive constantsC 1 ,C 2 ,C 3 ,C 4 such that F O(z) µ N − F O(z 0 ) µ N ≤ F( ˜ RS ˜ R ⊤ (z))− F( ˜ RS ˜ R ⊤ (z 0 ))+C 1 ∥X(z)− X(z 0 )∥ ∞ +C 2 ∥X(z)∥ ∞ ∥z− z 0 ∥ 1 N +C 3 ∥X(z)∥ 2 ∞ +C 4 ∥X(z 0 )∥ 2 ∞ Under the highly probability event described in Eq (2.17), by Lemma 2, there exist n ′ − n 1 = Ω( N) and a positive constantC 0 satisfying F( ˜ RS ˜ R ⊤ (z))− F( ˜ RS ˜ R ⊤ (z 0 ))≤− C 0 ∥z− z 0 ∥ 1 N , forz∈ζ ′ . The other terms can be bounded by concentration, noting that assumption (c.2) implies λ N → ∞. We write P max z∈ζ ′ :z̸=z 0 F O(z) µ N − F O(z 0 ) µ N ≥− C 0 ∥z− z 0 ∥ 1 5N ≤ P max z∈ζ ′ :z̸=z 0 {− C 0 ∥z− z 0 ∥ 1 N +C 1 ∥X(z)− X(z 0 )∥ ∞ +C 2 ∥X(z)∥ ∞ ∥z− z 0 ∥ 1 N (2.44) +C 3 ∥X(z)∥ 2 ∞ +C 4 ∥X(z 0 )∥ 2 ∞ }≥− C 0 ∥z− z 0 ∥ 1 5N ≤ P max z∈ζ :z̸=z 0 ∥X(z)− X(z 0 )∥ ∞ ∥z− z 0 ∥ 1 /N ≥ C 0 5C 1 +P max z∈ζ ∥X(z)∥ ∞ ≥ C 0 5C 2 +P max z∈ζ :z̸=z 0 ∥X(z)∥ 2 ∞ ∥z− z 0 ∥ 1 /N ≥ C 0 5C 3 +P max z∈ζ :z̸=z 0 ∥X(z 0 )∥ 2 ∞ ∥z− z 0 ∥ 1 /N ≥ C 0 5C 4 . (2.45) 42 By Eq (2.38), we have P max z∈ζ ∥X(z)∥ ∞ ≥ ε −→ 0 asλ N −→∞ . (2.46) According to Eq (2.40), we have P max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)− X(z 0 )∥ ∞ ≥ ε m N ≤ 16exp − m 16NC S ε 2 µ N . Then, P max z∈ζ :z̸=z 0 ∥X(z)− X(z 0 )∥ ∞ ∥z− z 0 ∥ 1 /N ≥ ε ≤ N X m=1 P max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)− X(z 0 )∥ ∞ m/N ≥ ε ≤ N X m=1 16exp − m ε 2 λ N 16C S −→ 0 asλ N −→∞ . (2.47) Also, by Eq (2.39), we have P max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)∥ 2 ∞ ≥ ε m N ≤ 16exp − m 8C S N εµ N . Similar to Eq (2.47), P max z∈ζ :z̸=z 0 ∥X(z)∥ 2 ∞ ∥z− z 0 ∥ 1 /N ≥ ε ≤ N X m=1 P max z∈ζ :∥z− z 0 ∥ 1 =m ∥X(z)∥ 2 ∞ m/N ≥ ε ≤ N X m=1 16exp − m ελ N 8C S −→ 0 asλ N −→∞ . (2.48) Again, by Eq (2.41), P(∥X(z 0 )∥ 2 ∞ ≥ ε m N )≤ 8exp − m 8C S N εµ N for allm∈[N]. 43 Then, P max z∈ζ :z̸=z 0 ∥X(z 0 )∥ 2 ∞ ∥z− z 0 ∥ 1 /N ≥ ε ≤ N X m=1 P ∥X(z 0 )∥ 2 ∞ m/N ≥ ε ≤ N X m=1 8exp − m ελ N 8C S −→ 0 asλ N −→∞ . (2.49) Combining Eq (2.44), (2.46), (2.47), (2.48), and (2.49), we have F O(z) µ N − F O(z 0 ) µ N ≤− 1 N Ω P (∥z− z 0 ∥ 1 ) forz∈ζ ′ . Algorithm1: Local clustering Input : adjacency matrixA, preference vectorπ , and teleportation constantα . 1 Compute the aPPR vectorp ∗ in Eq (2.5) based on(A,π,α ). 2 Construct the sequence of clusters{C n } N n=1 according to Eq (2.6) andp ∗ . 3 Calculate conductance values{ϕ (C n )} N n=1 by Eq (2.13). 4 Find the first local minimum ϕ (C n ∗ ) in{ϕ (C n )} N n=1 . Output: local clusterC n ∗ . 2.4.6 Simulationstudies In this section, we examine the performance of our local clustering procedure as summarized in Algo- rithm 1 on data simulated from the DC-SBM. We focus on the case where the target community (block 1) is small compared with the whole graph as we consider it to be more relevant to our real data structure. Consider a DC-SBM withK =2,n 1 =50,n 2 =3000, and B = 0.05 0.01 0.01 0.05 . 44 To simulate the degree parameters, letη i ∼ Uniform(1,10) fori=1,...,n 1 +n 2 , and θ i = n 1 η i / P n 1 j=1 η j , i=1,...,n 1 ; n 2 η i / P n 1 +n 2 j=n 1 +1 η j , i=n 1 +1,...,n 1 +n 2 so that they satisfy the identifiability constraint on θ i . Precision Recall α # of seeds Mean(%) SD(%) Mean(%) SD(%) 0.15 1 97.00 0.15 98.77 1.54 5 97.49 0.22 99.23 0.94 10 97.93 0.22 99.42 0.63 15 98.30 0.30 99.50 0.46 20 98.60 0.40 99.56 0.37 Table 2.5: The means and standard deviations of precision and recall rates for local clustering underα = 0.15 and different numbers of seeds. Each setting is simulated 50 times. We investigate the effect of the teleportation constant α and the number of seed nodes, denoted m, on the accuracy of the local clustering results. When usingm seed nodes, the corresponding preference vector have π i = 1/m for i ∈ [m], and π i = 0 for the other entries. To determine the community size, we search for the first obvious local minimum in the conductance plot, and we find these optimal points usually occur before n < 55. Figure 2.10 provides examples of the conductance plots for α = 0.15 and different m values; the cases for otherα are similar. Table 2.5 shows the average precision and recall rates and their standard deviations for finding mem- bers in block 1 with α = 0.15 and five seed counts (1, 5, 10, 15, 20); each setting is repeated for 50 sim- ulations. We can see that the precision increases as the number of seeds increases, since more seeds will provide more initial information for the clustering. More seeds also help to stabilize the variance of recall and increase the mean recall by a smaller margin. On the other hand, the influence of α is rather minimal. The results from different α values (0.05, 0.25) are presented in Table 2.6. In Figure 2.9, we further illustrate 45 0.970 0.975 0.980 0.985 0.990 1 5 10 15 20 number of seeds Precision (a) 0.97 0.98 0.99 1.00 1 5 10 15 20 number of seeds Recall (b) Figure 2.9: The violin plots of (a) precision and (b) recall for local clustering withα =0.15. the distributions of these precision and recall values for the caseα = 0.15. For all the case studies from the citation data in the next section, we setα =0.15 as in Chen et al. [27]. Precision Recall α # of seeds Mean(%) SD(%) Mean(%) SD(%) 0.05 1 97.00 0.14 98.77 1.54 5 97.49 0.22 99.23 0.94 10 97.93 0.21 99.43 0.63 15 98.30 0.30 99.51 0.45 20 98.59 0.40 99.58 0.32 0.25 1 97.00 0.15 98.76 1.54 5 97.48 0.21 99.24 0.94 10 97.93 0.22 99.41 0.63 15 98.30 0.29 99.50 0.45 20 98.60 0.40 99.55 0.36 Table 2.6: The means and standard deviations of precision and recall rates for local clustering under dif- ferentα values and numbers of seeds. Each setting is simulated 50 times. For the sake of completeness, we compare the local clustering procedure with common global clus- tering techniques including the usual spectral clustering, SCORE [57] and the more recently proposed SCORE+ [58]. We find that SCORE+ performs better than SCORE in most of our experimental settings, so we only present the results of SCORE+ in what follows. We consider the three settings below. Setting 1: n 1 = 150. The preference vector hasπ i = 1/50 fori = 1,...,50,π i = 0 for others. For the local clustering method, we observe that a local minimum usually occurs for n < 200. Thus we search 46 0 100 200 300 400 500 0.85 0.90 0.95 1.00 number of nodes conductance 6 (a) 1 seed 0 100 200 300 400 500 0.85 0.90 0.95 1.00 number of nodes conductance 20 (b) 5 seeds 0 100 200 300 400 500 0.85 0.90 0.95 1.00 number of nodes conductance 33 (c) 10 seeds 0 100 200 300 400 500 0.85 0.90 0.95 1.00 number of nodes conductance 35 (d) 15 seeds 0 100 200 300 400 500 0.85 0.90 0.95 1.00 number of nodes conductance 44 (e) 20 seeds Figure 2.10: Examples of conductance plots for different seed numbers when α = 0.15, under the setting of Table 2.5. The dotted line indicates the local minimum selected. for the minimum point in the range1− 200. Figure 2.11a is an example of a conductance plot under this setting. The local minimum is obvious at the pointn=129 in this example. 47 Precision Recall Method Mean(%) SD(%) Mean(%) SD(%) Setting 1 Local 97.49 0.50 99.11 0.45 Spectral 97.18 2.15 94.19 15.56 SCORE+ 98.30 0.32 99.87 0.07 Setting 2 Local 97.53 0.47 99.35 0.31 Spectral 97.34 1.41 91.65 17.63 SCORE+ 94.71 1.38 56.28 14.70 Setting 3 Local 98.60 0.40 99.56 0.37 Spectral 96.97 0.34 52.47 7.69 SCORE+ 96.77 0.01 50.03 0.06 Table 2.7: Means and standard deviations of precision and recall for local clustering, spectral clustering and SCORE+. Each setting is repeated in 50 simulations. Setting 2: n 1 = 100. The preference vector has π i = 1/30 for i = 1,...,30, π i = 0 for others. In this case, we search for the local minimum within the range1− 150. Figure 2.11b gives an example of a conductance plot under setting 2. Again the local minimum is clear in the plot. Setting 3: n 1 = 50. The preference vector hasπ i = 1/20 fori = 1,...,20,π i = 0 for others. Here, we search for the local minimum in the range1− 55, as mentioned in Section 2.4.6. All the other parameters (e.g.,K,B andθ i ) are the same as in Section 2.4.6. The teleportation constant α is set to 0.15 as before. Note that the number of seeds inπ decreases when the size of block 1 decreases, as we expect fewer seeds to be available for smaller community sizes. For each setting, we calculate the precision and recall from 50 simulations and record their means and standard deviations in Table 2.7. All the methods have high precision and recall rates in Setting 1. However, asn 1 decreases and the two block sizes become more imbalanced in Settings 2 and 3, the performance of spectral clustering and SCORE+ become worse, whereas local clustering remains stable with high averages and small standard deviations. As shown in Table 2.5 and Table 2.6, which examine the effect of α and seed number under Setting 3, local clustering has slightly higher average precision and substantially higher average recall than the other two methods even when only a single seed is used. We also note that the standard deviations of local clustering are smaller than those of spectral clustering in all the settings. More 48 detailed distributions of these precision and recall rates under the three settings can be found in the violin plots of Figure 2.12. As expected, local clustering is better suited to the situation with smallern 1 . 0 200 400 600 800 1000 0.70 0.80 0.90 1.00 number of nodes conductance 129 (a) Setting 1 0 200 400 600 800 1000 0.7 0.8 0.9 1.0 number of nodes conductance 70 (b) Setting 2 Figure 2.11: Examples of conductance plots under Settings 1 and 2. More examples under Setting 3 can be found in Figure 2.10. 2.5 Casestudiesfromthecitationnetwork We apply local clustering to our citation data and use the procedure to find the most relevant statistical research areas for given external topics. We choose three external topics (single-cell transcriptomics, labor economics and flu) of high general interests spanning biology, economics and epidemiology, and discuss the results in detail. Before applying Algorithm 1, it remains to describe the selection of topic papersI t and the construc- tion of the preference vectorπ . (Recall that the adjacency matrix used here isA s as described in Eq (2.2).) For each external topic, papers inI t are chosen by keyword searches among the citing papers. More concretely, for the topics of single-cell transcriptomics and labor economics, we find citing papers that contain the relevant keywords § in their abstracts. For a more accurate search result, we further restrict the labor economics papers to the category SOC using the labels in Section 2.3.1. The single-cell papers can come from a more diverse set of categories, and as shown in Figure 2.13a, most of our selected papers § “Single-cell" (or “single cell") and “RNA-seq" for the topic of single-cell; “labor" for the topic of labor economics. 49 (a) Setting 1 0.925 0.950 0.975 Local clustering Spectral clustering SCORE+ Method Precision 0.5 0.6 0.7 0.8 0.9 1.0 Local clustering Spectral clustering SCORE+ Method Recall (b) Setting 2 0.94 0.95 0.96 0.97 0.98 0.99 Local clustering Spectral clustering SCORE+ Method Precision 0.5 0.6 0.7 0.8 0.9 1.0 Local clustering Spectral clustering SCORE+ Method Recall (c) Setting 3 0.970 0.975 0.980 0.985 0.990 Local clustering Spectral clustering SCORE+ Method Precision 0.5 0.6 0.7 0.8 0.9 1.0 Local clustering Spectral clustering SCORE+ Method Recall Figure 2.12: The violin plots of precision and recall for Setting 1-3. are from BIO. For the topic of flu, we note that many papers may use flu datasets as examples of their ana- lytic methods instead of focusing on the topic itself. To select papers with a sharper focus on the topic, we search for papers with “flu” or “influenza” in their titles instead of abstracts. The proportions of category labels for the flu papers are illustrated in Figure 2.13b, which indicates most of them are from BIO. Having 50 constructedI t , the seed nodes inπ are chosen from the source papers with high citation counts byI t . For each topic, we construct the preference vector by Eq (2.4). We choose the thresholdt based on the citation counts from the topic papers to the source papers. For the topic “single-cell" and “labor economics", the top papers receive more than 90 citations; we sett = 10. For the topic “flu", the highest citation count is less than 90, and we sett=5. (a) (b) group STATS MATH BIO TECH SOC PHY Figure 2.13: The categories of citing papers related to: (a) single-cell transcriptomics; (b) flu. The conductance plot for each topic is shown in Figure 2.14. In most cases, there is an obvious local minimum leading to a reasonable community size. In (b), we choose the first minimum occurring after n ≥ 10 for a more plausible subnetwork size and a clearer interpretation of the result. The size of the target community found for each topic is listed in Table 2.8. We can see that these subnetworks indeed have significantly denser connections (and in some cases, higher clustering coefficients) than the whole network. The subnetworks and the word clouds generated from the keywords of the subnetwork papers can be found in Figure 2.15. We discuss these in more details below, interpreting the results with our understanding of the topics. Single-celltranscriptomics 51 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 single−cell number of nodes conductance 79 (a) 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Labor economics number of nodes conductance 108 (b) 0 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Flu number of nodes conductance 30 (c) Figure 2.14: The conductance plots for each topic: (a) single-cell transcriptomics, (b) labor economics, and (c) flu. The x-axis is truncated at 500 for visual clarity. Topic Size Graph density Average clustering coefficient single-cell 79 0.031 0.608 economic labor 108 0.039 0.402 flu 30 0.73 0.232 all source papers 9,338 0.001 0.252 Table 2.8: Summary statistics for the subnetworks in Figure 2.15 compared with the global graphA s . Rapid advances in single-cell sequencing technologies in the past decade have enabled researchers to profile different aspects of an individual cell, in particular its transcriptome. After appropriate preprocess- ing, a single-cell transcriptomic data usually takes the form of a large, sparse matrix, with tens of thousands of rows representing genes and columns representing cells. The sparse, noisy and heterogeneous nature of such data has proved a fertile ground for the development of statistical and computational methods (see e.g. [62] for a review). Inspecting the subnetwork and word cloud in Figure 2.15a, perhaps unsurprisingly, 52 a significant fraction of the papers selected are concerned with multiple testing and connected to the hub node 79 [11]. As an example, multiple testing is routinely performed in the analysis of single-cell RNA- seq (scRNA-seq) data for identifying differentially expressed genes, which involves applying a statistical test to a large number of genes to determine if their expression levels are significantly different between two sets of cells. The word cloud also suggests clustering as another main keyword; in the subnetwork, clustering is a topic shared by the set of papers tightly knit around node 35 [119] and 78 [126]. In the analysis pipeline of scRNA-seq data, clustering is applied to a dimension-reduced scRNA-seq matrix to identify distinct subpopulations of cells, which can correspond to different cell types or states. The related feature selection and model selection problems are highly relevant in this context, as they help researchers determine genes (features) that distinguish these subpopulations and the total number of subpopulations observed. Laboreconomics Labor economics aims to understand the functioning and dynamics of the markets for wage labor. Many fundamental questions in this subject—How does education affect income? How does healthcare af- fect income?—are of causal nature. Economists and governments would like to design policies that might achieve certain economic and social welfare goals based on causal analysis. Randomized controlled trials (RCT) are usually not available for Labor Economics problems. Therefore, it is not surprising to see that an overwhelming majority of the statistics papers selected in the subnetwork and word cloud in Figure 2.15b are in the realm of causal inference. Concretely, in the the word cloud, the frequently appearing key- words (minus “test") are all technical terms in causal inference—“propensity score", “instrumental variable", “structure model", “matched sampling", “treatment effect", “matching", and “observational study". Notably, the node 78 [6], a hub in the subnetwork, links the structural equations framework in econometrics and the potential outcomes framework in statistics. The paper provides conditions for a causal interpretation of the instrumental variable (IV) estimand, and quantifies the bias of violations of the critical assumptions. 53 Moreover, many cited statistics papers (node 16 [44] and node 18 [73]) are rather recent, and they also appear in the subnetwork. This coincides with the recent surge of the study of causal inference in the statistical community in the last few years and offers some evidence that the new developments quickly penetrate into other research fields. Flu The global pandemic of Covid-19 has further ignited wide research interests in the modeling and pre- diction of the spread of an epidemic. We choose flu as an example of epidemics due to its longer history of study and frequent appearance in the literature of epidemiology. As expected, many of the keywords in Figure 2.15c are related to stochastic processes and state-space modeling. The word MCMC appears the most often being a commonly used technique for parameter estimation in these epidemic models. Looking more closely at the subnetwork, many of the papers focus on refining the susceptible-infectious-recovered (SIR) model for infectious diseases including flu and SARS. For the two hub nodes 28 [20] and 3 [34], the former is concerned with the parameter estimation problem for different types of observed data, while the latter extends the SIR model by incorporating incubation stage and time dynamics to track the spread of flu. 2.6 Discussion We study the citation network arising from selected statistical papers in the past two decades, a period coinciding with the rise of Big Data and statistics being perceived to play increasingly important roles in many scientific disciplines. Unlike previous studies on statistics citation networks, we focus on the con- nections between statistics and other disciplines and use citation data to investigate the external influence of various statistical works. First performing descriptive analysis, we show that both the overall volume of citations and the diver- sity of citing fields have been increasing over time for all the journals considered. Even typical theoretical 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 FDR clustering bayesian multiple−comparison k−means familywise−error−rate inference p−value hierarchical test meta−analysis microarray multiple−testing regression association−study confidence−interval empirical−bayes MCMC mixture−model bootstrap correlation EM−algorithm high−dimensional lasso logistic−regression outlier poisson resampling robust robustness spatial statistic alternative−splicing auditing bonferroni−procedure case−control eQTL finite−action−problem influence latent−variable model−selection nonparametric overdispersion overlap replicability−analysis selection step−up unsupervised variable−selection admissibility ancestry−estimation−algorithms b−spline−densities binomial case−only−estimator characteristic−mode closed−population−estimation comparative−calibration conditional−procedure control−variate crossvalidation decisions diffractometry disequilibrium distance dose−response early−stopping election enrichment entity−resolution etiologic−heterogeneity false−acceptance−rate false−coverage−rate familywise fcr filtering fisher functional−parameters fuzzy−decision−making gamma−distribution gene−treatment generalized−linear−mixed−model imputation inadmissibility kendall−tau network nonasymptotic−error−control nonlinear−procedure partition partitioning−principle permutation q−gaussian−distribution random−field−theory randomized−test retention stability step−down stouffer−method type−r uniform−distribution variable−length (a) single-cell transcriptomics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 causal−inference matching instrumental−variable propensity−score observational−study sensitivity−analysis test treatment−effect structural−model matched−sampling measurement−error randomization−inference randomized−experiment covariate−balance meta−analysis potential−outcome program−evaluation sutva bias bootstrap clinical−trial confounding integer−programming interference mle noncompliance normal−mixture publication−bias regression regression−discontinuity rubin−causal−model selection−bias treatmen−effect attributable−effect bayesian calibration common−support confounder difference−in−differences dose−response−function econometrics EM−algorithm epbr evidence−factor fine−balance funnel−plot hodges−lehmann−estimate ignorability imputation inverse−probability−weighting MCMC missing−data network network−flow nonparametric−bound policy−evaluation probit quasi−experiment randomization social−network subclassification time−series two−stage−least−squares weak−instrument bahadur−efficiency bart biased−sampling binning body−mass−index chi−squared confidence−interval congestion consumption−function continuous−treatment covariate−balancing decennial−census design−effect differential discrete−time−hazard doubly−randomized−preference−trial eigenvalue−optimization external−validity hierarchical−random−effect−modeling kullback−leibler−information−criterion local−efficiency metropolis−hastings−algorithm missingness−at−random misspecified−model multiplicative−effect permutation−model policy−intervention probability−assessment spillovers stepped−wedge−cluster−randomized−trial; subjective−probability survey−analysis survey−research survival−time synthetic−control transportation−safety (b) labor economics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 MCMC bayesian filtering point−process epidemic−model likelihood birth−and−death−process bootstrap epidemic martingale clustered−data clustering consistency data−augmentation generalized−linear−model incomplete−data latent−variable martingales quasilikelihood−estimation social−network state−space−model basic−reproduction−number bio−logging causal−inference community−size conditional−independence conditional−independence−model data diagnostic−test downscaling dynami−system efficacy epidemic−data estimating−equation estimating−function explosion final−size−data forecasting gaussian−process generalized−estimating−equation gold−standard graph heteroskedastic−model hierarchical hierarchical−data hierarchical−model household importance−sampling infectious−disease−models infectivity infectivity−function malthusian−parameter meta−analysis metropolis−hastings−algorithm missing−data mixed−model model multitype multivariate−analysis nonlinear−model nonparametric one−way−array outbreak parameter parameter−estimation−self−tuning partial−observation persistence−threshold principal−stratification quasi−stationary−distribution random−effect−model random−graph robust−estimation secondary−attack−rate; sensitivity−analysis simulation−model sir SIR−model smoothing state−space−modeling stochastic stochastic−epidemic−models stochastic−fade−out structured−community subdistribution sums−of−squares susceptibility threshold time−series time−to−extinction transformation−model transition−probability; transmission−parameter unbalanced−data vaccination variance−component (c) flu Figure 2.15: Networks and word clouds generated from the source papers found by local clustering for each topic. 55 journals such as AOS have been attracting a significant proportion of external citations in recent years, which is quite encouraging. Next by distinguishing between internal and external citations, we identify research areas in statistics that have high impact under both criteria. The most highly cited papers are ranked high both internally and externally. On the other hand, papers with a large number of external citations but relatively fewer internal citations can point to areas where future development in relevant theory and methods may be rewarded by immediate visibility outside statistics. Lastly, using the technique of local clustering, we identify the statistical research communities most relevant to various external top- ics of interest. Under the DC-SBM, we prove the combination of aPPR and conductance selects all nodes in the target community with high probability. We demonstrate the performance of the algorithm using simulated data, examining its stability with respect to the number of seeds and the teleportation constant. Presenting a number of case studies using external topics of high general interests, we show that the communities selected align well with our intuition and understanding of the topics. Our study takes the first step toward understanding the influence of statistical works on other disci- plines that use tools and methods from statistics to aid their discoveries. The data we have collected can be of independent interests, opening opportunities for further modeling and analysis from different perspec- tives. We also note that some of the limitations in our current study can be addressed by expanding the scope of the data. For example, in analyzing the trend of diversity of citing fields, it would be ideal to collect information about the number of published papers in each citing field and include it as a normalization factor. The data could also be expanded to include more journals and other types of source publications, such as conferences and books, over a longer period of time to allow for a more comprehensive historical view and richer analysis. We leave the collection and analysis of these more extensive data as future work. Compared with global clustering, the theoretical properties of local clustering techniques are less well characterized under generative network models. Our application and theoretical results of local clustering can be extended to incorporate mixed membership modeling and temporal changes in the evolution of 56 communities. We have currently used textual data (e.g., keywords) as a way to validate the target commu- nities found; it would be more interesting to include such data as covariates in the network model subject to clustering analysis. We end the discussion by acknowledging the limitations of citation itself as a form of data measuring intellectual influence, some of which have already been pointed out in previous studies [115, 131]. Not all citations carry the same weight – a paper could be mentioned just in the literature review or serve as the foundation that inspired the paper citing it; arguably the latter type of citation is more important. Citations are not always attributed to the correct source, and modern day style of research relying on search engines such as Google is likely to bias toward papers already with high citation counts. Many data scientists and practitioners in industry do not necessarily publish their works but can still make use of ideas and tools in statistical papers, resulting in missing citations. Nevertheless, despite these limitations, citation data provide a useful and necessary first passage into investigating the intellectual influence of scientific works. 2.7 Acknowledgementsandauthorcontributions This study is a collaborative effort with Xin Tong, Y.X. Rachel Wang. X.T. and Y.X.R.W. conceived the study and designed the data collection procedures and methods. L.W. executed the methods, wrote the code, and performed empirical and theoretical analyses. X.T. and Y.X.R.W. supervised the execution. All authors contributed to the writing of the manuscript. 57 Chapter3 Optimalhypergeometricconfidenceintervals 3.1 Introductionandsummary Given integers 0 < n ≤ N and 0 ≤ M ≤ N, a random variable X has the hypergeometric distribu- tion Hyper(θ,n,N ) if P M (X =x)= M x N− M n− x N n (3.1) for all integer values ofx such that the quotient (3.1) is defined, with P M (X =x)=0 otherwise. It is not hard to see that (3.1) is nonzero if and only if max{0,M +n− N}≤ x≤ min{M,n}. (3.2) For a hypergeometric distribution, denoted by Hyper(M,N,n), where N is the population size, M is the number of population units with some attribute, and n is the given sample size, there are two parametric cases: (i) N is unknown and M is given; (ii) M is unknown and N is given. The most common setting in which the hypergeometric distribution arises is whenX counts the num- ber of items with a certain binary “special” property in a simple random sample (i.e., sampled uniformly 58 without replacement) of sizen from a population of sizeN containingM special items. But the hyperge- ometric arises in many other ways ∗ not involving a simple random sample, such as the analysis of a2× 2 contingency table using Fisher’s Exact Test, and in other sampling schemes. Our approach to constructing(1− α )-confidence intervals for M based onX is by inverting tests of the hypothesesH : M = M 0 , which we denote asH(M 0 ), forM 0 = 0,1,...,N. For testingH(M), we utilize acceptance intervals[a M ,b M ] thatmaximize the acceptance probabilityP M (X ∈[a M ,b M ]) among all shortest possible level-α intervals, a property we callα max optimal which is discussed in Section 3.4, along with a novel method of shifting a set of α max optimal intervals so their endpoints a M ,b M form nondecreasing sequences. This guarantees that the confidence sets that result from inversion are proper intervals, which is our goal here. After obtaining and shifting a set ofα max optimal intervals, in Section 3.5 we discuss how to further modify them to make them admissible, and discuss the caseM =N/2 whenN is even, which needs separate handling. With our admissible and monotonic acceptance intervals in hand, in Section 3.6 we prove the size-optimality results for the confidence intervals that result from inversion. In Section 3.7 we present some numerical examples and compare with two existing methods, including the notable, recent method of W. Wang [132]. There we also apply our method to some data about the air quality in China. 3.2 Additionalnotation Throughout the chapter we treat the positive integersn andN, and the desired confidence level 1− α ∈ (0,1), as fixed quantities, known to the statistician, and inference centers on the unknown value of M. Since the parameterM of interest is an integer, the intervals we consider are actually sets of consecutive integers, which we denote by[a,b] but actually mean{a,a+1,...,b}. For an arbitrary setA we letP M (A) denote P M (X ∈ A) where X ∼ Hyper(M,n,N), which X will denote throughout unless otherwise ∗ For readers interested in aspects of the hypergeometric distribution not considered here, we refer them to [48] for its history and naming, [61] for log-concavity and other properties, and [29] and [111] for exponential tail bounds, to name a few. 59 specified. For a scalar x we letP M (x) denoteP M (X =x). We let⌊y⌋ denote the largest integer≤ y and ⌈y⌉ the smallest integer≥ y. For setsA,B letA\B ={a∈ A : a / ∈ B} denote the set difference and |A| denote set cardinality, e.g.,|[a,b]| = b− a+1 for integersa≤ b. For a nonnegative integerj we let [j]={0,1,...,j}. 3.3 Propertiesofthehypergeometricdistributionandauxiliarylemmas In this section, we first record some well known properties of the Hyper (M,n,N) distribution in Lemmas 4 and 5, the latter covering unimodality of P M (x) in x. The content of Lemma 4 and Lemma 5, part 1, is mentioned in [59, Chapter 6], so we do not prove it here, and the rest follow immediately from the expression P M (x) P M (x− 1) = (M +1− x)(n+1− x) x(N− M− n+x) for 1≤ x≤ n. (3.3) After that we state and prove some needed auxiliary results concerning other types of monotonicity and unimodality: monotonicity of density function ratios with respect to M in Lemma 6, and unimodality ofP M ([a,b]) with respect toM in Lemma 7 and with respect to shifts in the interval [a,b] in Lemma 8. Throughout letP M (x) denote the density (3.1) of the Hyper(M,n,N) distribution. Lemma4. 1. We have X ∼ Hyper(n,M,N)⇔X ∼ Hyper(M,n,N) (3.4) ⇔n− X ∼ Hyper(N− M,n,N). (3.5) 2. A useful coupling: For n < N, X ∼ Hyper(M,n + 1,N) can be written X = X ′ + Y where X ′ ∼ Hyper(M,n,N) andY|X ′ ∼ Bern((M− X ′ )/(N− n)). 60 3. Monotonelikelihoodratio: ForeveryM 1 ,M 2 ∈[N]withM 1 <M 2 ,P M 2 (x)/P M 1 (x)isnondecreasing inx (with the conventionc/0=∞ forc>0). Lemma5 (Unimodality properties of the hypergeometric). Let m= (n+1)(M +1) N +2 . (3.6) 1. We have P M (x− 1) < = > P M (x) according asx < = > m. (3.7) 2. argmax x P M (x) = [m 1 ,m 2 ], where [m 1 ,m 2 ] = [⌊m⌋,⌊m⌋], unless m is an integer, in which case [m 1 ,m 2 ]=[m− 1,m]. 3. P M (x) increases strictly on [x min ,m 1 ] decreases strictly on [m 2 ,x max ], where x min = max{0,M + n− N}andx max =min{n,M}arethesmallestandlargest,respectively,ofthexvalueswithpositive P M probability. The next lemma establishes monotonicity inM of ratiosP M (x 2 )/P M (x 1 ). Lemma6. For fixed N andn, letx 1 ,x 2 be distinct integers in[0,n] such that0<x 2 − x 1 <N− n. Then P M (x 2 ) P M (x 1 ) < P M+1 (x 2 ) P M+1 (x 1 ) for x 2 ≤ M ≤ N− n+x 1 . (3.8) 61 Proof. We have P M (x 2 ) P M (x 1 ) = x 2 − 1 Y x=x 1 (M− x)(n− x) (N− M− (n− x)+1)(x+1) (3.9) < x 2 − 1 Y x=x 1 (M +1− x)(n− x) (N− (M +1)− (n− x)+1)(x+1) = P M+1 (x 2 ) P M+1 (x 1 ) . (3.10) The next lemma establishes the unimodality of probabilitiesP M ([a,b]) as a function ofM. It is helpful to define coupled random variables X and Y as the numbers of red and white balls, respectively, in a simple random sample ofn from a box ofN balls in whichM balls are white, one is red, and the remaining N − (M +1) balls are green. Then X ∼ Hyper(M,n,N) and X +Y ∼ Hyper(M +1,n,N). In the usual notation,P M+1 (x)=P(X +Y =x) andP M (x)=P(X =x). Writing P M+1 (x)− P M (x)=[P(X =x− 1,Y =1)+P(X =x,Y =0)] − [P(X =x,Y =1)+P(X =x,Y =0)] =P(X =x− 1,Y =1)− P(X =x,Y =1), (3.11) and summing overx froma tob yields P M+1 ([a,b])− P M ([a,b])=P(X =a− 1,Y =1)− P(X =b,Y =1). (3.12) Note that forx such thatP(X =x)>0, P(X =x,Y =1)=P(X =x) n− x N− M =P M (x) n− x N− M (3.13) 62 sincex white balls in the sample implies thatn− x of theN− M colored (red or green) balls are in the sample, so that the red ball has conditional probability(n− x)/(N− M) of being in the sample. Relation (3.13) is trivially true whenP(X =x)=0, so it is true for allx∈[n]. Using (3.12) and (3.13), (N− M)(P M+1 ([a,b])− P M ([a,b]))=(n− (a− 1))P M (a− 1)− (n− b)P M (b). (3.14) This equation provides the basis for the following lemma. Lemma7. Assume0≤ a≤ b≤ nandb− a<n. ThenP M ([a,b]) isnondecreasingforM ≤ M(a,b) and nonincreasing forM ≥ M(a,b), where M(a,b)= 0 ifa=0 N ifb=n min{M :(n− (a− 1))P M (a− 1)<(n− b)P M (b)} otherwise. (3.15) Proof. By (3.14), sgn(P M+1 ([a,b])− P M ([a,b]))= sgn((n− (a− 1))P M (a− 1)− (n− b)P M (b)). (3.16) If a = 0, the first term on the right-hand side vanishes and {P M ([a,b])} is therefore nonincreasing. Similarly ifb = n,{P M ([a,b])} is nondecreasing. It remains to consider only1≤ a≤ b≤ n− 1, and it suffices to show that if M 1 ,M 2 are such that sgn(P M+1 ([a,b])− P M ([a,b]))= +1 ifM =M 1 − 1 ifM =M 2 , (3.17) 63 then M 1 <M 2 . (3.18) Since the coefficients n− (a− 1) andn− b in (3.16) are positive, (3.16) and (3.17) imply thatP M 1 (a− 1) andP M 2 (b) must be positive. Therefore, since by (3.2)P M (x) must be positive if and only ifx ≤ M ≤ x+N− n, we have M 1 ∈I 1 :=[a− 1,a− 1+N− n] and M 2 ∈I 2 :=[b,b+N− n]. (3.19) The endpoints ofI 1 are less than the corresponding endpoints ofI 2 , so thatM 1 cannot be to the right of I 2 , and if it is to the left,M 1 < M 2 follows immediately. So assume thatM 1 belongs toI 2 and similarly thatM 2 belongs toI 1 . ThenM 1 andM 2 both belong toI 1 ∩I 2 , hence b≤ M 1 ,M 2 ≤ a− 1+N− n. (3.20) Sincen− b is positive andP M (a− 1) is positive on this interval, (3.16) and (3.17) imply that sgn P M (b) P M (a− 1) − n− (a− 1) n− b = − 1 ifM =M 1 +1 ifM =M 2 , (3.21) and the property (3.8) implies thatM 1 <M 2 . Lemma8. For fixed n,N,0≤ a≤ b<n, and positive integerd≤ n− b, we have (i) M(a,b)≤ M(a+d,a+d), and (ii) P M ([a,b])≤ P M ([a+d,b+d]) for allM ≥ M(a+d,b+d). 64 Proof. The lemma can be proved by induction on d and is straightforward verification using the defini- tion (3.15) ofM(a,b) and the inequality (3.10). We omit the details. 3.4 α maxoptimalacceptancesetsandmodifyingintervalsformonotonicity In this section we establish properties of acceptance intervals that will guarantee that they still enjoy size optimality when they are appropriately shifted to make their endpoints monotonic. The next definition makes this precise, and we call the propertyα max optimal. Theorem 2 shows how to modify any set of α max optimal acceptance intervals to produce intervals whose endpoints a M ,b M are nondecreasing in M, thus producing proper confidence intervals upon inversion, which is discussed in Section 3.6. It is not difficult to construct α max optimal acceptance intervals, and a simple and straightforward algorithm to do so which we call Algorithm 2, which we will verify isα max optimal in Section 3.5. For the next definition we more generally consider acceptance sets (not necessarily intervals): Alevel-α acceptance set forH(M) is any subsetS M ⊆ [n] such that P M (S M )≥ 1− α. (3.22) Definition1. Fixn,N, andα ∈(0,1). 1. Given M ∈ [N], a subset S ⊆ [n] is α optimal for M if P M (S) ≥ 1− α and P M (S ∗ ) < 1− α wheneverS ∗ ⊆ [n] with|S ∗ | <|S|. A collection{S M : M ∈M},M ⊆ [N], is α optimal (forM) if, for allM ∈M,S M isα optimal forM. 2. GivenM ∈[N],asubsetS⊆ [n]isP M -maximizingifallelementsofS havepositiveP M -probability and P M (S) ≥ P M (S ∗ ) whenever|S ∗ | = |S|. A collection{S M : M ∈ M},M ⊆ [N], is P M - maximizing if, for allM ∈M,S M isP M -maximizing. 65 3. A collection{S M : M ∈ M},M ⊆ [N], is α max optimal (forM) if it is α optimal and P M - maximizing. Our main result concerning α max optimal acceptance intervals, stated in the next theorem, is that they can always be modified in order to make both sequences of endpoints nondecreasing in M while still beingα optimal. Theorem2. Fixn,N,α ∈(0,1). LetM⊆ [N]beanarbitrarysetofconsecutiveintegers, and{[a M ,b M ]: M ∈M} a set ofα max optimal acceptance intervals. ForM ∈M define a M = max M ′ ≤ M a M ′ and b M = min M ′ ≥ M b M ′. (3.23) Finally, define M a ={M ∈M : a M b M }. (3.24) Then the following hold. 1. The setsM a andM b are disjoint. 2. The adjusted intervals [a adj M ,b adj M ]:= [a M ,b M +(a M − a M )], M ∈M a [a M − (b M − b M ),b M ], M ∈M b [a M ,b M ], all otherM ∈M, (3.25) areα optimal and have nondecreasing endpoint sequences. 66 Proof of Theorem 2. We first prove that M a andM b are disjoint. TakeM ∈M a , so thata M < a M , and we will show thatM ̸∈M b , i.e.,b M ≤ b M ′ for allM ′ > M. Fix such anM ′ ∈M, and we will consider 2 cases, comparinga M ′ anda M . Case 1: a M ′ < a M . In this case we havea M ′ < a M ≤ a M ′. If equality holds in this last, then there existsM ′ ℓ <M <M ′ such thata M ′ ℓ =a M =a M ′, sob M ′ ≥ b M by part 1(d)ii of Lemma 9. Otherwise,a M < a M ′, so it must be that a new maximum is achieved betweenM andM ′ , i.e., a M ′ = a M ′ ℓ for some M < M ′ ℓ < M ′ . Then a M ′ ℓ > a M , so b M < b M ′ ℓ by part 1 of Lemma 10. We also havea M ′ ℓ = a M ′ > a M ′, sob M ′ ℓ ≤ b M ′ by part 1 of Lemma 13. Combining these inequalities gives b M ′ > b M . Case 2: a M ′ ≥ a M . We have b M ′ > b M by part 1 of Lemma 10, implying that b M = b M , satisfying the claim. The proof of thatM ∈M b impliesM / ∈M a is similar. For monotonicity of the endpoints, givenM ∈M, we will show thata adj M ≥ a adj M ′ for allM ′ <M and b adj M ≤ b adj M ′ for allM ′ >M. We consider only the caseM ∈M a , theM ∈M b case being similar, and the M ̸∈M a ∪M b case being even more simple. ForM ′ < M we havea adj M ′ ≤ a M ′ ≤ a M = a adj M , the first inequality holding for any element of the sequence by the definition (3.25). For M ′ > M, if M ′ ∈M a then, using Lemma 10 part 2 for the following inequality, we have b adj M =b M +a M − a M ≤ b M ′ +a M ′− a M ′ =b adj M ′ . If M ′ / ∈ M a ∪M b then a M ′ = a M ′ ≥ a M so b adj M ′ = b M ′ ≥ b M + a M − a M = b adj M by part 1 of Lemma 10. Finally, if M ′ ∈ M b then b M ′ > b adj M ′ = b M ′ = b M ′ u for some M ′ u > M ′ , so a M ′ u ≥ a M ′ by part 2 in Lemma 13. SinceM a andM b are disjoint we know that M ′ / ∈ M a so a M ′ ≥ a M ′, and combining these last two inequalities gives a M ′ u ≥ a M ′ ≥ a M ′ ≥ a M . Thus, using part 1 of Lemma 10 givesb adj M =b M +a M − a M ≤ b M ′ u =b adj M ′ . 67 That the adjusted intervals are level-α is handled in parts 1a and 2a of Lemma 9, respectively, for the two nontrivial cases. Finally, note that the adjusted intervals have the same length as the original intervals, thus implying length optimality. The related auxiliary results are given as follows. The next lemma establishes that anywhere a “gap” a M > a M occurs in the sequence of lower endpoints of α max optimal acceptance intervals, the gap may be “filled” by shifting the interval up the needed amount while maintaining the interval’s acceptance probability and without violating monotonicity in the upper endpointb M . Lemma9. Let{[a M ,b M ] : M ∈M} beα max optimal withM⊆ [N] an interval, anda M ,b M ,M a ,M b as defined in (3.23)-(3.24). 1. IfM ∗ ∈M a then, lettingδ =a M ∗ − a M ∗ , we have (a) P M ∗ ([a M ∗ +δ,b M ∗ +δ ])≥ 1− α , (b) there existsM ℓ ∈M withM ℓ <M ∗ such thata M ℓ =a M ∗ , (c) b M ∗ +δ >b M ℓ for anyM ℓ satisfying 1b, (d) for anyM ℓ satisfying 1b, then for allM ∈[M ℓ ,M ∗ ] we have i. b M − a M ≤ b M ∗ − a M ∗ , ii. b M ≤ b M ∗ . 2. IfM ∗ ∈M b then, lettingδ =b M ∗ − b M ∗ , we have (a) P M ∗ ([a M ∗ − δ,b M ∗ − δ ])≥ 1− α , (b) there existsM u ∈M withM u >M ∗ such thatb Mu =b Mu , (c) a M ∗ − δ b M ℓ . Using this and the fact thata M ∗ +δ =a M ∗ =a M ℓ , we have [a M ℓ ,b M ℓ ]⊆ [a M ∗ +∆ ,b M ∗ +∆] for all ∆ ∈[δ ], (3.26) and thus P M ([a M ℓ ,b M ℓ ])≤ P M ([a M ∗ +∆ ,b M ∗ +∆]) for all M ∈M, ∆ ∈[δ ]. (3.27) We know that P M ∗ ([a M ∗ +δ,b M ∗ +δ ])≤ P M ∗ ([a M ∗ ,b M ∗ ]) (3.28) by Definition 1 since these intervals have the same width. If equality holds in (3.28) then part 1a is proved because the right-hand-side is≥ 1− α . Otherwise strict inequality holds in (3.28), which implies that M ∗ <M(a M ∗ +δ,b M ∗ +δ ) by Lemma 8, part (ii). Then, using unimodality and (3.27), we have P M ∗ ([a M ∗ +δ,b M ∗ +δ ])≥ P M ℓ ([a M ∗ +δ,b M ∗ +δ ])≥ P M ℓ ([a M ℓ ,b M ℓ ])≥ 1− α, finishing the proof of part 1a. For part 1d, using unimodality we have P M ∗ ([a M ∗ +δ,b M ∗ +δ ]) ≥ min(P M ∗ ([a M ∗ +δ,b M ∗ +δ ]),P M ℓ ([a M ∗ +δ,b M ∗ +δ ])) ≥ 1− α, and therefore b M − a M ≤ b M ∗ +δ − (a M ∗ +δ )=b M ∗ − a M ∗ 69 by length optimality of[a M ,b M ]. By this inequality, ifa M ∗ ≥ a M thenb M ∗ ≥ b M . Otherwise,a M ∗ M ∈M satisfyb M ≤ b M ∗ , thena M ∗ >a M ∗ − (b M ∗ − b M ∗ )≥ a M . 4. The sequencea M − b M +b M is nondecreasing inM ∈M b . Proof. For part 1, there must beM ℓ < M ∗ such thata M ℓ = a M ∗ > a M ∗ . Then we haveb M ℓ ≤ b M ∗ by Lemma 13, part 1. Combining these two, we have[a M ℓ ,b M ℓ ]⫋[a M ∗ ,b M ∗ ], thus P M ∗ ([a M ℓ ,b M ℓ ])<1− α ≤ P M ℓ ([a M ℓ ,b M ℓ ]), by length optimality of the latter. This implies thatM ∗ ≥ M(a M ℓ ,b M ℓ ) by Lemma 7. We also have that P M ([a M ℓ ,b M ℓ ])≤ P M ∗ ([a M ℓ ,b M ℓ ])<1− α (3.29) 70 sinceM > M ∗ ≥ M(a M ℓ ,b M ℓ ). If it were thatb M ≤ b M ℓ , then we would have[a M ,b M ] ⊆ [a M ℓ ,b M ℓ ] and then (3.29) would imply that P M ([a M ,b M ])≤ P M ([a M ℓ ,b M ℓ ])<1− α, a contradiction. Thus it must be thatb M > b M ℓ . Then we have[a M ℓ ,b M ℓ ]⊆ [a M ℓ ,b M ] and[a M ,b M ]⊆ [a M ℓ ,b M ], P M ℓ ([a M ℓ ,b M ])≥ P M ℓ ([a M ℓ ,b M ℓ ])≥ 1− α and P M ([a M ℓ ,b M ])≥ P M ([a M ,b M ])≥ 1− α. Thus, by unimodality,P M ∗ ([a M ℓ ,b M ])≥ 1− α , and so by length optimality we have b M ∗ − a M ∗ ≤ b M − a M ℓ . (3.30) Note thatb M ∗ +a M ∗ − a M ∗ =a M ℓ +b M ∗ − a M ∗ . By the inequality (3.30), we have b M ∗ +a M ∗ − a M ∗ =a M ℓ +b M ∗ − a M ∗ ≤ a M ℓ +b M − a M ℓ =b M , concluding the proof of part 1. For 2, consider M 1 ,M 2 ∈M a with M 1 < M 2 . We have a M 1 ≤ a M 2 . If strict inequality holds then there is M ℓ,2 ∈M satisfying that M 1 < M ℓ,2 < M 2 and a M ℓ,2 = a M 2 > a M 1 . Then using part 1 and Lemma 9, respectively, for the following inequalities, b M 1 +a M 1 − a M 1 ≤ b M ℓ,2 P(a) and takey > b, the other cases being similar. Since P(·) is increasing at some point between a and y, it must be that the modem in (3.6) exceedsa. Ifm≤ b+1 thenP(b+1)≥ P(y)>P(a), so P([a+1,b+1])− P([a,b])=P(b+1)− P(a)>0, contradicting the probability maximizing property of[a,b]. Otherwisem>b+1 soP(b+1)≥ P(a). If strict inequality holds here then the same argument applies. Otherwise q :=P(a)=P(a+1)=...=P(b+1)=...=P(z ∗ − 1)0, again contradicting the probability maximizing property of[a,b]. The following lemma establishes that, inα max optimal intervals, monotonicity can only be violated one endpoint at a time. This property is used to prove that the setsM a andM b in Theorem 2 are disjoint. Lemma 13. LetM = [M 1 ,M 2 ] ⊆ [N] be a nonempty interval and{[a M ,b M ] : M ∈ M} be α max optimal. 1. IfthereexistsM ′ ∈[M 1 +1,M 2 ]and0<d≤ M ′ − M 1 suchthata M ′ − d >a M ′,thenb M ′ − d ≤ b M ′. 2. If there existsM ′′ ∈ [M 1 ,M 2 − 1] and0 < d≤ M 2 − M ′′ such thatb M ′′ +d < b M ′′, thena M ′′ +d ≥ a M ′′. 73 Proof of Lemma 13. SupposeM ′ andd satisfy the hypothesis of the first claim. If P M ′(a M ′) orP M ′ − d (b M ′ − d ) were zero, those points would have been removed from their respective interval, producing shorter inter- vals, implying both probabilities must be positive. It then follows from (3.2) that b M ′ − d ≤ M ′ − d<M ′ ≤ N− n+a M ′. (3.32) Sincea M ′ / ∈ [a M ′ − d ,b M ′ − d ], we haveP M ′ − d (a M ′) ≤ P M ′ − d (b M ′ − d ) by Lemma 12. Applying Lemma 6 withx 1 =a M ′,x 2 =b M ′ − d , and using (3.32), we have P M ′(b M ′ − d ) P M ′(a M ′) > P M ′ − d (b M ′ − d ) P M ′ − d (a M ′) ≥ 1. By Lemma 12 this implies thatb M ′ − d ∈[a M ′,b M ′], and in particular the first claim. The second claim is proved similarly after reflecting the intervals about n/2 andM aboutN/2, and using Lemma 11. 3.5 α optimal,admissable,nondecreasingacceptanceintervals We first verify that Algorithm 2 is α max optimal in Section 3.5.1. This sets the stage to apply Theorem 2 to the Algorithm 2 acceptance intervals {[a M ,b M ]: M ∈[⌊N/2⌋]} (3.33) to obtain the modified intervals {[a adj M ,b adj M ]: M ∈[⌊N/2⌋]}. (3.34) 74 Algorithm2: Givenα ,n, andN, produce a set of levelα acceptance intervals{[a M ,b M ]: M ∈ [N]}. Input :N ∈N,n≤ N and0<α< 1 1 forM =0,...,⌊N/2⌋do 2 x min ← max{0,M +n− N};x max ← min{n,N};C,D←⌊ (n+1)(M+1) N+2 ⌋;P ← P M (C) 3 if C >x min then 4 PC← P M (C− 1) 5 else 6 PC← 0 7 end 8 if D <x max then 9 PD← P M (D+1) 10 else 11 PD← 0 12 end 13 whileP <1− α do 14 if PD >PC then 15 D← D+1;P ← P +PD 16 if D <x max then 17 PD← P M (D+1) 18 else 19 PD← 0 20 end 21 else 22 C← C− 1;P ← P +PC 23 if C >x min then 24 PC← P M (C− 1) 25 else 26 PC← 0 27 end 28 end 29 end 30 a N− M ← n− b M 31 b N− M ← n− a M 32 end Output:{[a M ,b M ]} N M=0 A set of acceptance intervals{[a M ,b M ]:M ∈[N]} is admissible if the intervals are equivariant with respect to the reflections M 7→N− M,[a M ,b M ]7→[n− b M ,n− a M ]. That is, if [a N− M ,b N− M ]=[n− b M ,n− a M ] for allM ∈[N]. (3.35) 75 We seek admissible acceptance intervals because they will result in admissible confidence intervals (defined analogously to (3.35) below) upon inversion in Section 3.6. From (3.34), one way to achieve this is to define [a adj M ,b adj M ] forM >⌊N/2⌋ as the reflection of the intervals (3.34) across n/2. This achieves admissibility everywhere except atM =N/2 whenN is even and[a adj N/2 ,b adj N/2 ] is not symmetric aboutn/2. This special case is addressed in Section 3.5.2, where Theorem 4 verifies that taking the M =N/2 interval to be (3.50) does not violate monotonicity. Theorem 4 is general and applies to not just Algorithm 2 but any α max optimal intervals (3.33), thus these modification steps could be carried out starting from any α max optimal intervals resulting inα optimal, admissable acceptance intervals (3.54) with nondecreasing endpoints; this fact is recorded in Theorem 5. We call the result of applying Theorem 5 to Algorithm 2, Algorithm 3. In the next theorem, we want to show that the reflection procedure can give monotonic α optimal intervals. Theorem3. Given anα max optimal set{[a M ,b M ] : M ∈ [⌊N/2⌋]}. We construct monotonicα optimal {[a adj M ,b adj M ]: M ∈[⌊N/2⌋]} by applying Theorem 2. We reflect the intervals to get [a adj M ,b adj M ]=[n− b adj M ,n− a adj M ] for allM =⌊N/2⌋+1,...,N. Then,{[a adj M ,b adj M ]: M ∈[N]} is nondecreasing andα optimal. Proof of Lemma 3. Let [a N− M ,b N− M ]=[n− b M ,n− a M ] for allM =⌊N/2⌋+1,...,N. Then, we have{[a M ,b M ]: M ∈[N]} is anα max optimal set by Lemma 11. Note that we don’t directly construct an adjustedα optimal set according to thisα max optimal set. 76 By Theorem 2,{[a adj M ,b adj M ] : M ∈ [⌊N/2⌋]} is nondecreasing andα optimal, so{[a adj M ,b adj M ] : M = ⌊N/2⌋ + 1,...,N} is also nondecreasing and α optimal by symmetry. Then, we want to show that nondecrease still holds after combining them. Now, we consider M h = ⌊N/2⌋− 1, forN is even; ⌊N/2⌋, forN is odd. (3.36) We want to show that a adj M h ≤ a adj N− M h andb adj M h ≤ b adj N− M h . (3.37) These two inequalities are resulted from a adj M h ≤ n− b adj M h . (3.38) If no adjustment applied to[a M h ,b M h ], i.e. [a M h ,b M h ] = [a adj M h ,b adj M h ], then the inequalities (3.37) hold by Lemma 15. If we need to adjust[a M h ,b M h ], we have two cases. Case 1: M h ∈M b (i.e. adjusting down). By Lemma 15,n− b adj M h ≥ a M h >a adj M h . Case 2:M h ∈M a (i.e. adjusting up). It means that there existM ∗ <M h such thata M ∗ =a M h =a adj M h according to Theorem 2. According to Lemma 15, a M h = a M ∗ ≤ n− b M ∗ = a N− M ∗ . By part 1 in Lemma 10,b M h − a M h ≤ b N− M ∗ − a M h because{[a M ,b M ]: M ∈[N]} isα max optimal. Then, n− b adj M h =n− (a M h +b M h − a M h )≥ n− (a M h +b N− M ∗ − a M h )=n− b N− M ∗ =a M ∗ =a adj M h . Therefore, the inequality (3.38) holds, and implies inequalities (3.37). 77 IfN is odd, then{[a adj M ,b adj M ] : M ∈ [N]} is nondecreasing by inequalities (3.37). IfN is even, then we need to look at whether[a adj N/2 ,b adj N/2 ] preserves nondecrease. In other words, we want to show that a adj M h ≤ a adj N/2 ≤ a adj N− M h and b adj M h ≤ b adj N/2 ≤ b adj N− M h . (3.39) Now, consider evenN. By Lemma 15 and 14, we havea N/2 ≥ a M forM < N/2, i.e. a N/2 = a N/2 , so a N/2 / ∈M a . Also, N/2 is the greatest integer in [⌊N/2⌋], so a N/2 / ∈M b . Therefore, [a adj N/2 ,b adj N/2 ] = [a N/2 ,b N/2 ]. By Lemma 14 and inequality (3.38), we haveb N/2 ≤ n− a adj M h =b adj N− M h . Then, b adj M h ≤ b adj N/2 =b N/2 ≤ b adj N− M h (3.40) For the nondecrease of lower boundary, we will prove by contradiction. We assumea adj N/2 > a adj N− M h , i.e. a N/2 >n− b adj M h . We immediately haven− b adj M h / ∈[a N/2 ,b N/2 ]. By Lemma 12, P N/2 (a N/2 )≥ P N/2 (n− b adj M h ). (3.41) Ifb adj M h <n− b adj M h , we haven− b adj M h >n/2. The mode (3.6) forM =N/2 ism=(n+1)/2. Because n− b adj M h is an integer, we have a N/2 >n− b adj M h ≥⌊ m⌋, (3.42) soP N/2 (a N/2 ) n− b adj M h , b adj M h > n/2. The mode (3.6) is still m = (n + 1)/2. Similarly, b adj M h ≥ ⌊ m⌋. If a N/2 > b adj M h , then b adj M h / ∈ [a N/2 ,b N/2 ], so we have P N/2 (a N/2 ) ≥ P N/2 (b adj M h ). However, we have P N/2 (a N/2 ) < P N/2 (b adj M h ) because a N/2 > b adj M h ≥ ⌊ m⌋. This means that we must have a N/2 ≤ b adj M h . According to inequality (3.37),b adj M h ∈[a N/2 ,b N/2 ]. 78 The related auxiliary results are given as follows. Lemma14. ForM <N/2,considerthat[a M ,b M ]isanα optimalinterval,andb M ≤ n− a M .[a N/2 ,b N/2 ] isα max optimal. Then,[a N/2 ,b N/2 ]⊆ [a M ,n− a M ]. Proof of Theorem 14. We have[a M ,b M ]⊂ [a M ,n− a M ] and[n− b M ,n− a M ]⊂ [a M ,n− a M ]. Then, P M ([a M ,b M ])≥ P M ([a M ,b M ])≥ 1− α ; P N− M ([n− b M ,n− a M ])≥ P N− M ([n− b M ,n− a M ])≥ 1− α. By unimodality, we have P N/2 ([a M ,n− a M ]>1− α. (3.43) Toward contradiction, suppose thata N/2 P N/2 (a N/2 ), 79 since this last is positive, and this contradicts (3.44). Ifb N/2 >n− a M then similar arguments apply. Lemma15. Let{[a M ,b M ]|M =[⌊N/2⌋]} beα max optimal . For a fixed M <N/2,n− a M ≥ b M . Proof of Lemma 15. Let{[a M ,b M ] | [⌊N/2⌋]} be α max optimal. For a fixed M < N/2, note that [n− b M ,n− a M ] should beα max optimal. We want to show thatn− b M ≥ a M . Assumen− b M < a M , which meansb M > n− a M . By Lemma 12,P N− M (b M )≤ P N− M (n− b M ), sinceb M / ∈[n− b M ,n− a M ]. Similarly, we haveP M (b M )≥ P M (n− b M ), sincen− b M / ∈[a M ,b M ]. We know thatb M ≤ M <N/2<N− M ≤ N− n+n− b M . By Lemma 6, P N− M (b M ) P N− M (n− b M ) > P M (b M ) P M (n− b M ) ≥ 1. This contradicts toP N− M (b M )≤ P N− M (n− b M ). Therefore,n− b M ≥ a M , which meansn− a M ≥ b M . 3.5.1 Algorithm2isα maxoptimal Lemma 16. The acceptance intervals{[a M ,b M ] : M ∈ [⌊N/2⌋]} produced by Algorithm 2 are α max optimal. Proof. Fix arbitraryM ∈[⌊N/2⌋] and throughout this proof denote[a M ,b M ] simply by[a,b], andP M by P . Each interval produced by Algorithm 2 is level-α by construction; see the “while” loop in its definition. To verify the other two properties, we first establish that P(y)≥ min{P(a),P(b)}≥ P(x) for all y∈[a,b], x / ∈[a,b]. (3.46) 80 SinceP(·) is unimodal (see Lemma 4) its minimum on any interval is attained at one or both endpoints. At each stage Algorithm 2 discards one such endpoint, leaving an interval whose endpoints are easily seen by induction to be at least as probable as any of the points outside. In particular the final interval has endpoints – hence all points – at least as probable as any point outside. This establishes (3.46) and shows that among intervals of the same length, it clearly maximizes the acceptance probability. To verify size optimality, without loss of generality assumeP(a) = min{P(a),P(b)}. By definition of the algorithm,P([a+1,b]) < 1− α since otherwisea would have been replaced bya+1. Consider a setS ∗ such that|S ∗ | <|[a,b]|, so|S ∗ |≤| [a,b]|− 1 = b− a =|S|, whereS = [a+1,b]. This implies that the number of points in S \ S ∗ is no less than the number in S ∗ \ S. This and (3.46) imply that P(S\S ∗ )≥ P(S ∗ \[a+1,b]). Therefore, P(S ∗ )=P(S ∗ \S)+P(S ∗ ∩S)≤ P(S\S ∗ )+P(S ∗ ∩S)=P(S)<1− α, (3.47) showing that[a,b] is size optimal. 3.5.2 Modificationat M =N/2 Throughout this section we consider the case whenN is even so that⌊N/2⌋=N/2 is an integer, and let {[a adj M ,b adj M ]: M ∈[N/2]} (3.48) denote the modification of an arbitrary set of α max optimal intervals {[a M ,b M ]: M ∈[N/2]} (3.49) 81 via Theorem 2. The interval[a adj N/2 ,b adj N/2 ] may not be symmetric aboutn/2, and thus to achieve this it will be replaced in Section 3.5.3 by the symmetric,α optimal interval [h α/ 2 ,n− h α/ 2 ], where h α/ 2 =max x∈[n]: P N/2 (X <x)≤ α/ 2 . (3.50) We note thath α/ 2 ≤ n/2 since P N/2 (X <⌊n/2⌋+1)=P N/2 (X ≤⌊ n/2⌋)≥ 1/2>α/ 2. By definition, (3.50) is the smallest admissible level- α acceptance interval for H(N/2). Theorem 4 verifies that making this choice for the M =N/2 interval preserves monotonicity of the endpoints. Theorem4. Suppose that{[a adj M ,b adj M ] : M ∈ [N]} areα optimal acceptance intervals with nondecreasing endpoint sequences{a adj M } and{b adj M }, and are admissible, which may be written a adj M +b adj N− M =n for M ̸=N/2. (3.51) Let[a ′ ,b ′ ] denote the smallest level-α acceptance interval forM = N/2 that is admissible – that is, satisfies (3.51) forM = N/2 or, equivalently, has midpointn/2. Then replacing[a adj N/2 ,b adj N/2 ] by[a ′ ,b ′ ] preserves the endpoint monotonicity. Proof of Theorem 4. FixM =N/2 and denote[a adj M ,b adj M ] by[a M ,b M ]. It is easily checked that the require- ment that[a M ,b M ] or a replacement of it preserves monotonicity can be expressed as [a M+1 ,b M− 1 ]⊆ [a M ,b M ]⊆ [a M− 1 ,b M+1 ]. (3.52) 82 LetI andJ denote the intervals on the left and right sides of (3.52). It suffices to show that I ⊆ [a ′ ,b ′ ]⊆ J. (3.53) Since J contains the α optimal interval [a M ,b M ], its P M probability is at least 1− α and by (3.51) it is symmetrical. Since symmetrical intervals all have the same midpoint, it follows that J contains the smallest interval with those two properties, which is[a ′ ,b ′ ]. It remains to show that [a ′ ,b ′ ] contains I. If [a M ,b M ] is symmetrical, then since it is α optimal, it equals [a ′ ,b ′ ], which therefore contains I. If [a M ,b M ] is not symmetrical, then I, which is symmetrical by (3.51), is strictly contained in[a M ,b M ], which isα optimal. ThereforeP M (I) is less than1− α , hence less thanP M ([a ′ ,b ′ ]). Since the symmetrical intervalsI and[a ′ ,b ′ ] have the same midpoint, evidentlyI is strictly contained in[a ′ ,b ′ ]. Remark 1. It is not necessary to use special calculations to geth α/ 2 in [h α/ 2 ,n− h α/ 2 ] = [a ′ ,b ′ ], since it is easily obtained from anα max optimal interval[a N/2 ,b N/2 ] bya ′ = min{a N/2 ,n− b N/2 },b ′ = n− a ′ . Notethatif[a N/2 ,b N/2 ]isalreadysymmetrical,[a ′ ,b ′ ]isthesameinterval,sothatrecalculating[a N/2 ,b N/2 ] in this way as part of Algorithm 3, for instance, would suffice. 3.5.3 Puttingitalltogether: Algorithm3 Starting with a set ofα max optimal intervals{[a M ,b M ] : M ∈ [⌊N/2⌋]} we can now define a new set of intervals{[a ∗ M ,b ∗ M ] : M ∈ [N]} by (i) applying the adjustments in Theorem 2, (ii) reflecting across n/2 to obtain admissible intervals forM >[⌊N/2⌋], and (iii) ifN is even setting[a ∗ N/2 ,b ∗ N/2 ] to be (3.50). The next theorem shows that the resulting intervals are α optimal, admissible, and have nondecreasing endpoint sequences. 83 Theorem 5. Given a set ofα max optimal intervals{[a M ,b M ] : M ∈ [⌊N/2⌋]}, let{[a adj M ,b adj M ] : M ∈ [⌊N/2⌋]} denote the result of applying Theorem 2, and [a ∗ M ,b ∗ M ]= [a adj M ,b adj M ], forM =0,1,...,⌈N/2⌉− 1; [n− b adj N− M ,n− a adj N− M ], forM =⌊N/2⌋+1,...,N; [h α/ 2 ,n− h α/ 2 ], forM =N/2 ifN is even. (3.54) Then{[a ∗ M ,b ∗ M ] : M ∈ [N]} are level-α , admissible, have nondecreasing endpoint sequences, and are size- optimal except possibly for M = N/2 when N is even; in this case, [a ∗ N/2 ,b ∗ N/2 ] is size-optimal among all admissible, level-α intervals. Note that ifN is odd then the first two cases of (3.54) cover all M ∈[N]. Proof. Admissibility is by construction, and monotonicity follows from Theorem 4 at M = N/2 (when N is even) and Theorem 2 everywhere else. Theorem 2 also establishes α -optimality in the first case of (3.54), and to establish it in the second case, letM >⌊N/2⌋ andX ∼ Hyper(M,n,N). Then P M (X ∈[a ∗ M ,b ∗ M ])=P M (X ∈[n− b ∗ N− M ,n− a ∗ N− M ]) =P M (n− X ∈[a ∗ N− M ,b ∗ N− M ])≥ 1− α, (3.55) this last by the first case of (3.54) and since n− X ∼ Hyper(N − M,n,N); see Lemma 4. For size optimality, we will show thatP M ([c,d])≥ 1− α impliesd− c≥ b ∗ M − a ∗ M . By an argument similar to the one above,[n− d,n− c] is level-α for testingH(N− M) and thus no shorter than[a ∗ N− M ,b ∗ N− M ], so d− c=(n− c)− (n− d)≥ b ∗ N− M − a ∗ N− M =(n− a ∗ M )− (n− b ∗ M )=b ∗ M − a ∗ M , 84 as claimed. ForN even, the third case of (3.54) is clearly level-α , and any competing admissible interval forH(N/2) must be of the form[c,n− c]. If this is level-α thenc≤ h α/ 2 , thus it can be no shorter than (3.50). We call Algorithm 3 the result of applying Theorem 5 to Algorithm 2 which, for ease of use, is also given in the algorithmic form. Algorithm 3: Given α , n, and N, calculate a set of level-α acceptance intervals{[a ∗ M ,b ∗ M ] : M ∈[N]}. Input :N ∈N,n≤ N and0<α< 1 1 forM =0,...,⌊N/2⌋do 2 C← 0;D← n; 3 whileP M ([C,D])≥ 1− α do 4 a M ← C;b M ← D; 5 if P M (C)b ∗ M+1 ;then 21 a ∗ M =a ∗ M +b ∗ M+1 − b ∗ M ;b ∗ M =b ∗ M+1 ; 22 end 23 a ∗ N− M ← n− b ∗ M ;b ∗ N− M ← n− a ∗ M ; 24 end 25 if N is even then 26 a ∗ N/2 ← max x|P N/2 (X <x)≤ α/ 2 ;b ∗ N/2 ← n− a ∗ N/2 ; 27 else 28 a ∗ ⌊N/2⌋+1 ← n− b ∗ ⌊N/2⌋ ;b ∗ ⌊N/2⌋+1 ← n− a ∗ ⌊N/2⌋ ; 29 end Output:{[a ∗ M ,b ∗ M ]} N M=0 85 3.6 Optimaladmissibleconfidenceintervals 3.6.1 Confidenceandacceptancesets For a setS let2 S denote the power set ofS, i.e., the set of all subsets ofS. Aconfidencesetwithconfidence level1− α is a functionC :[n]→2 [N] such that the coverage probability satisfies P M (M ∈C(X))≥ 1− α for all M ∈[N]. (3.56) For short, we refer to such aC as a(1− α )-confidence set. If a confidence set C is interval-valued (i.e., for allx∈[n],C(x) is an interval) we call it a confidence interval . A confidence set C is admissible if C(x)=N− C(n− x) for allx∈[n]. (3.57) Here, for a setS, the notationN − S means{N − s : s ∈S}. Admissibility (3.57) is an equivariance condition requiring that the confidence set is reflected about N/2 when the data is reflected about n/2. See also Section 3.7 for how this definition compares with that of W. Wang [132]. Similarly, we shall denote a level-α acceptance set by a functionA :[N]→2 [n] such that P M (X ∈A(M))≥ 1− α for all M ∈[N], (3.58) and call an interval-valued (i.e., for all M ∈ [N],A(M) is an interval) acceptance set an acceptance interval † and writeA(M)=[a M ,b M ], or similar. † Note that whereas above we referred to an expression like (3.33) as a set of acceptance intervals, we will now call it an acceptance interval (singular). This is to coincide with our terminology for a confidence set, as well as avoid cumbersome phrases like “a set of acceptance sets.” 86 We also need to generalize the concept of admissibility from (3.35) to handle general sets, so we say that an acceptance setA is admissible if A(M)=n− A(N− M) for all M ∈[N]. (3.59) This says that the set is equivariant with respect to reflections M 7→N− M, and specializes to (3.35) for intervals. 3.6.2 Invertedconfidencesets We will construct confidence sets that are inversions of acceptance sets, and vice-versa. If A is a level-α acceptance set then C A (x)={M ∈[N]: x∈A(M)} (3.60) is a(1− α )-confidence set. Conversely, given a (1− α )-confidence set C, A C (M)={x∈[n]: M ∈C(x)} (3.61) is a level-α acceptance set; see, for example, [100, Chapter 9.3]. Moreover,C A C =C andA C A =A, which are immediate from the definitions. However, neither A norC being interval-valued guarantees that its inversion is. We will evaluate confidence and acceptance sets by their total size, which we define as the sum of the cardinalities of each set: Recalling that|·| denotes set cardinality, define the total size of acceptance and confidence sets to be |A|= N X M=0 |A(M)| and |C|= n X x=0 |C(x)|. (3.62) 87 IfA(M)=[a M ,b M ] is an acceptance interval then |A|= N X M=0 |[a M ,b M ]|= N X M=0 (b M − a M +1), (3.63) and similarly for a confidence interval C. Lemma 17 records some basic facts about inverted confidence sets. Lemma17. LetA be an acceptance set. Then the following hold. 1. |C A |=|A|. (3.64) 2. C A is admissible if and only ifA is admissible. 3. If, in addition,A(M) = [a M ,b M ] is interval-valued and the endpoint sequences{a M } and{b M } are nondecreasing, thenC A is interval-valued. Proof. DenoteC A simply byC. For part 1, letting1{·} denote the indicator function, |C|= X x∈[n] |C(x)|= X x∈[n] |{M ∈[N]:x∈A(M)}| = X x∈[n],M∈[N] 1{x∈A(M)}= X M∈[N] |{x∈[n]:x∈A(M)}| = X M∈[N] |A(M)|=|A|. (3.65) 88 For part 2, ifA is admissible, C(x)={M ∈[N]: x∈A(M)}={M ∈[N]: x∈n− A(N− M)} ={M ∈[N]: n− x∈A(N− M)}={N− M ∈[N]: n− x∈A(M)} =N−{ M ∈[N]: n− x∈A(M)}=N− C(n− x). A similar argument shows the converse. For part 3, fix arbitrary x ∈ [n] and to show thatC(x) is an interval, suppose that M 1 ,M 2 ∈C(x) and we will show that M ∈ C(X) for all M 1 < M < M 2 . Since M 2 ∈ C(x), x ∈ [a M 2 ,b M 2 ] so x ≥ a M 2 ≥ a M by monotonicity. By a similar argument, x ≤ b M 1 ≤ b M , thus x ∈ [a M ,b M ] so M ∈C(x). 3.6.3 Sizeoptimality We say that a confidence set C is size-optimal among a collection of confidence sets if it achieves the minimum total size in that collection. The results in this section establish size-optimality ofC ∗ =C A ∗ , whereA ∗ = {[a ∗ M ,b ∗ M ] : M ∈ [N]} denotes the result of applying Theorem 5 to any α max optimal acceptance intervals{[a M ,b M ] : M ∈ [N]}. Thus,A ∗ could be the intervals given by Algorithm 3, or the result of starting with any otherα max optimal intervals. Whatever the choice ofA ∗ , note thatC ∗ is an admissible,(1− α )-confidence interval by Lemma 17 Theorems 6 and 7, which follow, are the main results of the chapter. Theorem 6 is the more powerful of the two in that it gives wide conditions under whichC ∗ is size-optimal among admissible confidence sets (not just intervals) and shows that, even in the worst case, the total size|C ∗ | is at most1 point larger than the optimal set. Theorem 7 specializes to intervals and gives conditions for optimality there. In particular, 89 it shows thatC ∗ is size-optimal among all admissible non-empty (i.e.,C(x)̸=∅ for allx) intervals, which are usually preferred in practice. Theorem6. LetC ∗ beasdefinedaboveand C S theclassofalladmissible,(1− α )-confidencesets. Then C ∗ is size-optimal inC S , i.e., |C ∗ |= min C∈C S |C|, (3.66) if either of the following holds: (a) n orN is odd; (b) n,N are even and there is no size-optimalC ∈ C S such that |A C (N/2)| is even. If n, N, and |A C (N/2)| are all even for some size-optimalC∈C S , then |C ∗ |≤ min C∈C S |C|+1. (3.67) In addition,C ∗ is size-optimal among allC∈C S such thatA C are all intervals. See also Example 3.7.3 for an instance ofC ∗ failing to be optimal under conditions satisfying part (b). Proof of Theorem 6. First supposeN is odd, and letC ∈ C S be arbitrary; we will show that|C ∗ | ≤ | C|. Since N/2 is not an integer, by Lemma 18 we have|A ∗ (M)| ≤ | A C (M)| for all M ∈ [N] thus, using Lemma 17, |C|=|A C |= N X M=0 |A C (M)|≥ N X M=0 |A ∗ (M)|=|A ∗ |=|C ∗ |, (3.68) as claimed. Now supposeN is even and letC∈C S be size-optimal. By Lemma 18 we have|A ∗ (M)|≤| A C (M)| for all M ∈ [N] other than M = N/2. If n or|A C (N/2)| is odd, then by Lemma 19 there is an in- terval [a,n− a] such that n− 2a + 1 = |A C (N/2)| and P N/2 ([a,n− a]) ≥ P N/2 (A C (N/2)). Since 90 A ∗ (N/2)=[a ∗ N/2 ,b ∗ N/2 ]=[a ∗ N/2 ,n− a ∗ N/2 ] is the shortest admissible acceptance interval forM =N/2, we have |A ∗ (N/2)|=b ∗ N/2 − a ∗ N/2 +1≤ n− 2a+1=|A C (N/2)|. (3.69) This, with the above inequality for theM ̸=N/2 cases, establishes (3.68) in this case. The remaining case – when N, n, and|A C (N/2)| are all even – is handled by Lemma 20, recalling thatC was size-optimal to establish (3.67). For the final statement in the theorem, for any such C,A C is admissible and thus has total size at least |A ∗ |, so|C|=|A C |≥| A ∗ |=|C ∗ |. Theorem7. LetC ∗ beasdefinedaboveand C I theclassofalladmissible,(1− α )-confidenceintervals. Then C ∗ is size-optimal inC I , i.e., |C ∗ |= min C∈C I |C|, (3.70) if either of the following holds: (a) n orN is odd; (b) n,N are even and there is no size-optimalC∈C I such that C(n/2)=∅. (3.71) A sufficient condition for C ∗ to be size-optimal in this case is that α< N/2 n/2 2 , N n . (3.72) In particular,C ∗ is size-optimal among all nonemptyC∈C I regardless of the parity ofn,N. 91 We comment that the scenario (3.71) seems to be particularly rare sinceC(n/2) is typically the widest confidence interval. Thus, even allowing empty intervals, Theorem 7 establishes size optimality of C ∗ among intervals for most intents and purposes, and (3.67) holds in any case. However, it may be possible to construct an adversarial example with that property. Proof of Theorem 7. Part (a) is a consequence of Theorem 6 sinceC I ⊆ C S . AssumeN andn are even, and there is no size-optimalC satisfying (3.71). LetC be any size-optimal interval and sinceC(n/2) ̸= ∅, there is someM ∈C(n/2). BecauseC(n/2) is admissible, N − M ∈ C(n/2), and becauseC(n/2) is an interval, N/2 ∈C(n/2) since it lies between M and N − M. This implies thatn/2∈A C (N/2), which is admissible and hence symmetric aboutn/2. Using these facts, |A C (N/2)|=2|{x∈A C (N/2)|x<n/2}|+|{n/2}| =2|{x∈A C (N/2)|x<n/2}|+1, which is odd. We then have|C ∗ |≤| C| by Lemma 20. To see that (3.72) is sufficient, suppose there is a C withC(n/2)=∅. Then forM =N/2, we have α ≥ P M (M ̸∈C(X))≥ P M (X =n/2)= N/2 n/2 2 , N n . The proofs of Theorems 6and 7 utilize some auxiliary lemmas, stated and proved as follows. Lemma18. In the setting of Theorem 6, for anyC∈C S andM ̸=N/2, |A ∗ (M)|≤| A C (M)|. (3.73) 92 Proof. Fix M ∈ [N], M ̸= N/2. By Lemma 21 there is an interval [a M ,b M ] such that b M − a M +1 = |A C (M)| andP M ([a M ,b M ]) ≥ P M (A C (M)). SinceM ̸= N/2,A ∗ (M) = [a ∗ M ,b ∗ M ] is size-optimal by Theorem 5 so |A ∗ (M)|=b ∗ M − a ∗ M +1≤ b M − a M +1=|A C (M)|. (3.74) Lemma19. ForevenN,assumeA⊆ [n]isnonemptyandsuchthatx∈A⇒n− x∈A. Thenthereexists c∈[n] such that P N/2 ([c,n− c])≥ P N/2 (A) (3.75) and n− 2c+1= |A|, ifn or|A| is odd, |A|+1, ifn and|A| are even. (3.76) Proof. Ifn is odd thenx̸=n− x for allx∈[n], implying that|A| is even. Letc=(n−| A|+1)/2, an integer. If|A|=0 then there is nothing to prove for (3.75), so assume|A|≥ 2, whencec≤ (n− 1)/2=:m, which by (3.6) is the mode of the Hyper(N/2,n,N) density. Thusm∈ [c,n− c] and by (3.5) this density takes the same value at the endpointsc andn− c. Combining these facts implies thatP N/2 (x 1 )≥ P N/2 (x 2 ) for anyx 1 ∈ [c,n− c] andx 2 / ∈ [c,n− c]. ThenP N/2 ([c,n− c])≥ P N/2 (A) now follows from this and the fact that these two sets have the same number of points,n− 2c+1=|A|. Ifn is even and|A| is odd, thenc=(n−| A|+1)/2 is still an integer. If|A|=1 then[c,n− c]={n/2}, the point maximizing P N/2 (·) by Lemma 4, hence (3.75) holds. Otherwise,|A| ≥ 3 and c < m so the argument in the previous paragraph applies. 93 Ifn and|A| are both even, letc=(n−| A|)/2, an integer, andB =[c,n− c− 1]. By unimodality and symmetry ofP N/2 (·) aboutn/2 we have that min x∈B P N/2 (x)=P N/2 (c)=P N/2 (n− c)=max x̸∈B P N/2 (x). It follows from this and|B|=n− 2c=|A| thatP N/2 (B)≥ P N/2 (A), thusP N/2 ([c,n− c])≥ P N/2 (B)≥ P N/2 (A). Lemma20. In the setting of Theorem 6, supposen andN are even. Then, forC∈C S , |C ∗ |≤ |C|, if|A C (N/2)| is odd, |C|+1, if|A C (N/2)| is even. (3.77) Proof. We have|A ∗ (M)| ≤ | A C (M)| for all M ̸= N/2 by Lemma 18. A C (N/2) is admissible, hence symmetric, so by Lemma 19 there is an interval[a,n− a] such thatP N/2 ([a,n− a])≥ P N/2 (A C (N/2)) and n− 2a+1≤ |A C (N/2)|, if|A C (N/2)| is odd, |A C (N/2)|+1, if|A C (N/2)| is even. (3.78) SinceA ∗ (N/2) = [a ∗ N/2 ,b ∗ N/2 ] = [a ∗ N/2 ,n− a ∗ N/2 ] is the shortest admissible acceptance interval for M =N/2, we have |A ∗ (N/2)|=b ∗ N/2 − a ∗ N/2 +1≤ n− 2a+1, (3.79) which is thus≤ the right-hand-side of (3.78). This, with the above inequality for the M ̸= N/2 cases, gives the desired result after summing in an argument like (3.68). Lemma21. ForanyA⊆ [n],thereexistsaninterval[a,b]⊆ [n]suchthatb− a+1=|A|andP M ([a,b])≥ P M (A). 94 Proof. BecauseP M (x) is unimodal with respect tox, we can find an interval [a,b] satisfyingb− a+1=|A| andP M (x 1 )≥ P M (x 2 ) for anyx 1 ∈ [a,b] andx 2 / ∈ [a,b]. LetB ={x = 0,...,n| x∈ [a,b]/A}. Then, |B|=|A\[a,b]|, sinceb− a+1=|A|. Therefore, P M (A)=P M (A∩[a,b])+ X x∈A\[a,b] P M (x) ≤ P M (A∩[a,b])+ X x∈B P M (x)=P M ([a,b]). 3.7 Examplesandcomparisons In this section we show examples of our proposed methodC ∗ using Algorithm 3 as the acceptance in- tervalA ∗ , and give some comparisons with other methods. We focus on exact methods with guaranteed coverage probability. A standard method for producing a(1− α )-confidence interval for M is the so-calledmethodofpivoting the c.d.f. ‡ givingC Piv (x)=[L Piv (x),U Piv (x)] where, for fixed nonnegative α 1 +α 2 =α , L Piv (x)=min{M ∈[N]:P M (X ≥ x)>α 1 }, (3.80) U Piv (x)=max{M ∈[N]:P M (X ≤ x)>α 2 }. (3.81) See [23], [26, Chapter 9], or [66]. Takingα 1 = α 2 = α/ 2 is a common choice, and all our calculations of C Piv below use this. W. Wang [132] proposed a method producing a(1− α )-confidence interval for M, which we denote byC W , that cycles through the intervalsC Piv (x), shrinking the intervals where possible while checking ‡ We have heard this method alternatively called the quantile method and the method of extreme tails. 95 that coverage probability is maintained. The algorithm can require multiple passes through the intervals, calculating the coverage probability for all M ∈ [N] multiple times, and is therefore computationally intensive. We compare the computational times ofC W andC ∗ in Examples 3.7.1 and 3.7.2. All calculations ofC W (x) were performed using that author’sR code. Although W. Wang proves that a 1-sided version of his algorithm produces size-optimal intervals (among 1-sided intervals), it is not claimed thatC W is size-optimal. SinceC W produces nonempty in- tervals we know that|C ∗ | ≤ | C W | by Theorem 7. In the following example we compareC ∗ withC W in terms of both size and computational time, and indeed exhibit a setting where|C ∗ | <|C W |. We also note that W. Wang’s definition of admissibility is slightly more restrictive than ours because in addition to the equivariance requirement in Section 3.6 it also includes the two additional requirements that both sequences of endpoints ofC W (x) be nondecreasing inx, and any sub-interval ofC W (x) must have con- fidence level strictly less than 1− α . OurC ∗ satisfies these additional properties, and see also Figure 3.1 for an example of monotonicity ofC ∗ . However, we do not include them in our definition of admissibility so that our optimality results apply to a broader class. Example3.7.1. WecompareC ∗ ,C Piv ,andC W inthesettingα =0.05,N =500,andn=10,20,30,...,490. Figure 3.1 plots theC ∗ andC Piv intervals as vertical bars for then = 100 case. Figure 3.1 suggests that the C ∗ intervals are much shorter thanC Piv in this setting, and indeed Figure 3.2 shows the differences in size |C Piv |−| C ∗ |forn=10,20,...,490whicharesubstantial;alltheC ∗ intervalsareatleast200pointsshorter than their correspondingC Piv intervals, and some are as many as 260 points shorter. These differences are also sizable fractions of the largest possible range[0,N]=[0,500]. TheC W intervalsareverysimilartoC ∗ andsoarenotshowninFigures3.1-3.2. Infact,thesizes|C W |= |C ∗ | are exactly equal for all values ofn considered, exceptn = 100. To examine this case more closely, we give these confidence intervals explicitly in Tables 3.1 and 3.2. Examining these tables shows very similar, but 96 Figure 3.1: Confidence intervals C Piv (x) andC ∗ (x) for α = 0.05, N = 500, n = 100, and x = 0,1,...,100. 0 20 40 60 80 100 0 100 300 500 Confidence Intervals x M Figure 3.2: The differences in total size |C Piv |−| C ∗ |, forN =500,α =0.05, andn=10,20,...,490. 0 100 200 300 400 500 200 220 240 260 The difference of total size n Number of points slightly different intervals, with neither method dominating the other. For example, |C ∗ (0)| = |[0,14]| < |[0,16]|=|C W (0)| and |C ∗ (13)|=|[40,102]|>|[40,101]|=|C W (13)|. (3.82) 97 Totalingthesizesgives|C ∗ |=7129<7131=|C W |,showingthattheC W intervalsareindeedsub-optimal. One property of our method is that it does not necessarily producing intervals that are sub-intervals ofC Piv , whichC W always does since it begins with these intervals before iteratively shrinking them. For example, in this settingC Piv (13)=[39,101] which, by (3.82), containsC W (13) but notC ∗ (13). Table 3.1: Confidence intervals given by C ∗ (x) = [L(x),U(x)] for α = 0.05, N = 500, n = 100, and x=0,1,...,100. x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 L(x) 0 1 3 5 8 12 15 16 22 25 29 32 37 40 U(x) 14 24 31 39 46 52 59 64 72 77 84 89 94 102 x 14 15 16 17 18 19 20 21 22 23 24 25 26 27 L(x) 45 47 53 56 60 65 69 73 78 82 85 90 95 100 U(x) 107 112 117 124 129 134 139 144 152 157 162 167 172 177 x 28 29 30 31 32 33 34 35 36 37 38 39 40 41 L(x) 103 108 113 118 122 125 130 135 140 145 149 153 158 163 U(x) 182 188 194 199 204 209 214 219 224 229 234 239 244 250 x 42 43 44 45 46 47 48 49 50 51 52 53 54 55 L(x) 168 173 178 183 187 191 195 200 205 210 215 220 225 230 U(x) 255 260 265 270 275 280 285 290 295 300 305 309 313 317 x 56 57 58 59 60 61 62 63 64 65 66 67 68 69 L(x) 235 240 245 250 256 261 266 271 276 281 286 291 296 301 U(x) 322 327 332 337 342 347 351 355 360 365 370 375 378 382 x 70 71 72 73 74 75 76 77 78 79 80 81 82 83 L(x) 306 312 318 323 328 333 338 343 348 356 361 366 371 376 U(x) 387 392 397 400 405 410 415 418 422 427 431 435 440 444 x 84 85 86 87 88 89 90 91 92 93 94 95 96 97 L(x) 383 388 393 398 406 411 416 423 428 436 441 448 454 461 U(x) 447 453 455 460 463 468 471 475 478 484 485 488 492 495 x 98 99 100 L(x) 469 476 486 U(x) 497 499 500 Computational time: 0.0019 min. Total size: 7129 Inadditiontothetotalsizes,Tables3.1and3.2alsoshowthecomputationaltimes § usedbybothmethods, at the bottom of each table. WhereasC ∗ took roughly 1/10th of a second (.0019 minutes) to fill the table, C W took more than 10 minutes. As mentioned above, this is due to the adjusting technique ofC W which requires repeated updating of intervals, whereasC ∗ just requires one pass through the acceptance intervals § Computed using R’sproc.time() function 98 Table 3.2: Confidence intervals given by W. Wang’s [132] method C forN =500,n=100, andα =0.05. x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 L(x) 0 1 3 5 8 12 15 17 22 25 29 33 37 40 U(x) 16 24 32 39 46 52 59 65 72 78 84 90 95 101 x 14 15 16 17 18 19 20 21 22 23 24 25 26 27 L(x) 45 47 53 56 60 66 69 73 79 82 85 91 96 100 U(x) 107 113 117 124 130 135 141 144 152 157 163 168 173 178 x 28 29 30 31 32 33 34 35 36 37 38 39 40 41 L(x) 102 108 114 118 122 125 131 136 142 145 149 153 158 164 U(x) 182 188 194 200 205 210 215 220 225 231 236 241 246 250 x 42 43 44 45 46 47 48 49 50 51 52 53 54 55 L(x) 169 174 179 183 187 191 195 201 206 211 216 221 226 232 U(x) 253 258 263 268 274 279 284 289 294 299 305 309 313 317 x 56 57 58 59 60 61 62 63 64 65 66 67 68 69 L(x) 237 242 247 250 254 259 264 269 275 280 285 290 295 300 U(x) 321 326 331 336 342 347 351 355 358 364 369 375 378 382 x 70 71 72 73 74 75 76 77 78 79 80 81 82 83 L(x) 306 312 318 322 327 332 337 343 348 356 359 365 370 376 U(x) 386 392 398 400 404 409 415 418 421 427 431 434 440 444 x 84 85 86 87 88 89 90 91 92 93 94 95 96 97 L(x) 383 387 393 399 405 410 416 422 428 435 441 448 454 461 U(x) 447 453 455 460 463 467 471 475 478 483 485 488 492 495 x 98 99 100 L(x) 468 476 484 U(x) 497 499 500 Computational time: 10.1792 min. Total size: 7131 for adjustment. Figure 3.3 gives a more complete comparison of computational times in this setting. The additional time required byC W is sizable, even exceeding25 minutes for values ofn near the middle of the range. AcomparisonofcomputationaltimesofC Piv andC ∗ isshowninFigure3.4,whichshowsthatthetimes are much faster overall compared toC W (the longest times being less than1/3 of a second), and comparable between the two methods. Figures 3.5 and 3.6 show the coverage probability for then=100 case of the three methods as a function ofM = 0,1,...500. Like their sizes,C W andC ∗ have very similar coverage probabilities, whereas that of C Piv is overall higher (an undesirable property once it exceeds 1− α ), especially for values of M near the endpoints0 andN =500. 99 Figure 3.3: The computational time of the confidence intervals C W andC ∗ forN = 500,α = 0.05, and n=10,20,...,490. 0 10 20 0 100 200 300 400 500 n Time (min) type C* Cw Computational time Figure 3.4: The computational time of the confidence intervals C Piv andC ∗ forN = 500,α = 0.05, and n=10,20,...,490. 0.0 0.1 0.2 0.3 0 100 200 300 400 500 n Time (sec) type C* Cpiv Computational time Example3.7.2. WecomparethecomputationaltimeofC ∗ andC W inthesettingα =0.05,N =200,400,...,1000, andn=N/2. 100 Figure 3.5: Coverage probability ofC ∗ andC Piv forN =500,n=100, andα =0.05. Figure 3.6: Coverage probability ofC W forN =500,n=100, andα =0.05. Wang’s[132]methodistime-consumingespeciallywhenthesamplesizeandthepopulationsizearelarge. A comparison of computational times ofC ∗ andC W are shown in Figure 3.7. When N = 1000, n = 500 andα =0.05, thecomputationaltimeforC W reached250minutes, andC ∗ onlytook0.0111minutes, which shows that our method is efficient. Example3.7.3. In this example we show the necessity of part (b) of Theorem 6. That is, we exhibit a setting with n, N, andA(N/2) all even for a certain acceptance setA whose inversionC is size-optimal with |C| =|C ∗ |− 1. SetN = 20,n = 6, andα = 0.6. ForM ̸=N/2 = 10 define A(M) = [a ∗ M ,b ∗ M ] to be the 101 Figure 3.7: The computational time of the confidence intervals C W andC ∗ forN = 200,400,...,1000, andn=N/2. 0 50 100 150 200 250 N = 200, n = 100 N = 400, n = 200 N = 600, n = 300 N = 800, n = 400 N = 1000, n = 500 N and n Time (min) type C* Cw Computational time same intervals given by Theorem 5 and inverted to createC ∗ , and define A(10) ={2,4}. For allM ̸= 10, A(M) is a level-α interval, andA(10) is as well since P M=10 (2)=P M=10 (4)=.244 (3.83) to 3 decimal places, thus P M=10 ({2,4}) > .4 = 1− α. It can be shown thatA ∗ (10) = [2,4], thus the intervalsA have1 fewer point thanA ∗ , so by (3.64) we have that|C|=|C ∗ |− 1. Example 3.7.4 (Air quality data). In this example we apply our confidence interval C ∗ to data collected by China’s Ministry of Environmental Protection (MEP) and discussed by [76]. The MEP collects data on particulate matter (PM 2.5 ) concentration, measured inµg/m 3 , of fine inhalable particles with diameters less than 2.5 micrometers. The U.S. Environmental Protection Agency [1] classifies the air quality of a given day as “hazardous” if the day’s 24-hour average PM 2.5 measurement exceeds the set threshold250.5. Liang et al. [76] analyzed the 2013 to 2015 MEP data and concluded that it was consistent with measurements taken at 102 nearby U.S. diplomatic posts, the U.S. Embassy in Beijing and four U.S Consulates in other cities. However, a persistentproblemwiththeMEPdataisahighdegreeofmissingdays. Foragivenyear,ifthemissingdaysare assumed to be missing at random with each day of the year equally likely, then the numberX of remaining ‘hazardous’days,conditionedonthenumbernofremainingdays,followsahypergeometricdistributionwith N = 365 and unknown actual number M of annual hazardous days, to be estimated as an indication of annual air quality. Wefocusonthe2015datafrom3MEPsitesinBeijing: Dongsi,Dongsihuan,andNongzhanguan. Foreach of these sites, Table 3.3 shows the numbern of days with complete measurements, the observed numberx of days with complete measurements classified as “hazardous,” the point estimate Nx/n (withN =365) of the numberM of annual hazardous days, and the 90% confidence interval C ∗ (x) forM, which are also plotted in Figure 3.8. The point estimates from the MEP sites are higher (i.e., indicating worse annual air quality) compared to the U.S. Embassy data, similar to the conclusions drawn by [76]. The confidence intervals also show that the MEP estimates are more variable, largely in the direction of indicating worse air quality, with two out of three upper confidence limits being much larger for the MEP sites than for the U.S. Embassy. Table 3.3: For the Beijing air quality data [76], the number n of days with complete measurements, the number x of days with complete measurements classified as hazardous, the point estimate Nx/n (to 1 decimal place) of the numberM of annual hazardous days, and the 90% confidence interval C ∗ (x) forM. Site n x Nx/n 90% our CI (time, size) 90% Wang’s CI (time, size) Dongsi 292 16 20.0 [17,24] (6E-04 min, 3786) [17,24] (3.79 min, 3786) Dongsihuan 166 7 15.4 [10,24] (3E-04 min, 4672) [10,24] (3.83 min, 4672) Nongzhanguan 290 11 13.8 [11,17] (6E-04 min, 3818) [12,17] (3.73 min, 3818) U.S. Embassy 332 15 16.5 [15,18] (4E-04 min, 2736) [15,18] ( 2.90 min, 2736) 103 Figure 3.8: The 90% confidence interval C ∗ (x) for the number M of annual hazardous days at different locations in Beijing. 10 15 20 25 90% Confidence Intervals for M Location M Dongsi Dongsihuan Nongzhanguan U.S. Embassy _ _ _ _ _ _ _ _ 3.8 Discussion We have presented an efficient method of computing exact hypergeometric confidence intervals. Com- pared to the standard pivotal method, our method requires similar computational time but produces much shorter intervals. Our method produces intervals with total size no larger than, and strictly smaller than in some cases, the existing nearly-optimal method of W. Wang [132], which is computationally much more costly than our and the pivotal method. Therefore we hope our method can provide something near the “best of both worlds” for this problem in terms of computational time and interval size. The key to our method is the novel shifting of acceptance intervals before inversion, developed in Sections 3.4 and 3.5. We have observed in the numerical examples included in Section 3.7, as well as extensive further computations not included in this chapter, that the needed shifts seem to never exceed a single point, i.e., the δ i in Theorem 2 satisfy δ i ≤ 1. This is not needed in our theory but we close by mentioning it as a tantalizing conjecture. 104 3.9 Acknowledgementsandauthorcontributions This study is a collaborative effort with Jay Bartroff and Gary Lorden. J.B. and G.L. conceived the study, while L.W. executed the methods, wrote the code, and performed empirical, real-data analyses. J.B. and G.L. supervised the execution. All authors contributed to the design of the methods, theoretical analysis, and the writing of the manuscript. 105 Chapter4 Skilledmutualfundselection: falsediscoverycontrolunderdependence 4.1 Introduction The mutual fund industry is giant, with trillions of dollars of assets under management. It plays an im- portant role in shaping the economy by pooling money from many investors (retail investors as well as institutional investors) to purchase securities. Meanwhile, the industry has a great impact on household fi- nances. Mutual funds in the past decade have become a more important provider of liquidity to households [84]. In 2019, 46.4% of the households owned funds in the United States according to Statista ∗ . While the success of the fund industry is commonly attributed to several advantages: diversification (reduce risk), marketability (easy access), and professional management, the success of individual fund mainly depends on the fund manager’s capability. How to select mutual funds with true capabilities to make profits is a challenging and substantial prob- lem in finance. Those funds with good performance during the prior few years do not guarantee profits for the subsequent years. For example,Rydex Series Funds:Inverse S&P 500 Strategy Fund gen- erated a total return of 383.3% between 2001 and 2005, but it had a total return of -96.2% between 2006 and 2019. Even in a less extreme case, considering the size of mutual funds, an unskilled mutual fund that loses a small proportion of its investment could incur a huge loss in terms of the dollar amount. ∗ The statistics are available athttps://www.statista.com 106 The Carhart four-factor model [25] has been widely used to evaluate a mutual fund’s performance. Regressing the excess return on four market factors, the interceptα in the cross-sectional linear regression model is considered as a measure of the fund’s true capability of making profits. By convention, α > 0 is classified as skilled,α < 0 as unskilled andα = 0 as zero-alpha. In practice, the true value ofα is not observed. Instead, the ordinary least squares (OLS) estimator ofα is calculated for the comparison between different funds. Some funds may have a zero or even a negative α , meaning that they cannot make profits by adjusting the market factors. Due to luck, however, their estimated alpha’s can be positive for some years. Investment in such funds can lead to substantial loss rather than profits in the future. There is an urgent demand to develop strategies for selecting truly skilled mutual funds while avoiding or controlling the false selection of those “lucky" funds (unskilled or zero-alpha funds). The seminal paper Barras, Scaillet and Wermers [10] initiated a multiple testing framework for the fund selection problem. Selecting the skilled mutual funds can be formulated as simultaneously testing H i0 : α i ≤ 0 vs. H ia : α i > 0 for thousands of funds based on the test statistics as the standardized OLS estimates ofα ’s. By adopting the false discovery rate (FDR) concept from Benjamini and Hochberg [12], Barras, Scaillet and Wermers [10] applied Storey procedure [117] for selecting the skilled funds, aiming to control the FDR at a certain level, where FDR is defined as the expected proportion of false selection of “lucky" funds among the total selection. When we consider funds during a certain period, the test statistics are usually dependent. The depen- dence can be very strong and can greatly affect the FDR control. For example, such strong dependence has a huge impact on funds’ locations in the empirical distribution of the test statistics, which can cause small p-values for “lucky" funds (false selection), or relatively large p-values for skilled funds (mis-selection). Consequently, the selection of “lucky" funds may increase while more skilled funds will be ignored. Storey procedure is usually valid for FDR control under weak dependence settings but not for strong dependence. Another issue in the aforementioned multiple testing is that p-values are usually calculated based on the 107 assumption that each test statistic is normally distributed. In our fund data, we observe nonnormality and a strong dependence IN the test statistics, indicating that the method in [10] is not ideal. To address the above challenging issues arising from the mutual fund data, we consider the conditional probability of each fund not being skilled given the test statistics across all of the funds in our study. For convenience, we call such conditional probability as the degree of non-skillness (d-value). Our multiple testing procedures will be constructed based on the d-value, in contrast with the p-value based selection strategies. Intuitively, d-values with smaller values suggest larger chances of being skilled and thus en- couraging selection. The dependence and nonnormality structure of the test statistics are also preserved in the conditional information, so d-values can provide a good ranking for fund performance, which is crucial to fund selections by our multiple testing procedure. To construct the multiple testing procedure, we start from a decision theoretic perspective. In the decision process, in addition to the falsely selected lucky funds (false discoveries), another type of mistake is the mis-selected skilled funds (false non-discoveries). Control of the false discovery rate (FDR) has been a prevailing problem in large scale multiple testing [12, 14, 103]. Results on the control of the false non-discovery rate (FNR) are fewer yet equally important [102, 104]. Consideration of one type of error alone can lead to unsatisfactory testing procedures. For example, a procedure with FDR control only may still be very conservative, with substantial power loss. A natural step is to minimize a combination of FDR and FNR by an optimal testing procedure. However, this problem has not been well studied in the statistics literature. Alternative attempts to this question can be found in [45] and subsequent papers in the direction where they considered marginal FDR and marginal FNR. When the test statistics possess general dependence, especially strong dependence, marginal FDR (FNR) is not equivalent to FDR (FNR). Our optimal testing procedure (Theorem 8) provides a solution to this long-standing problem by directly targeting the FDR and FNR. The d-values defined above are the key to achieving optimality. 108 To calculate the d-values, we need to model the empirical distribution of the test statistics. We consider a three-part mixture model to capture the dependence and nonnormality phenomenon in the test statistics. The mixture model is a powerful tool in high-dimensional statistics to model a multi-modal structure. In the existing literature, it is usually assumed that individual statistics of primary interest are independent; thus the joint likelihood can be decomposed as the product of marginal likelihoods to facilitate statistical analysis. However, such an independent assumption is not satisfied in our data. Another challenging issue in this mixture model is that there are seven free parameters. Conventional techniques such as empirical Bayes will require seven equations to solve those parameters. It is not desirable in our data analysis, since more equations will increase the difficulty of solving the parameters, and higher moments will also have more complicated expressions of the parameters. Instead, we propose a new method called “approximate empirical Bayes" to fit the parameters here; it borrows the strength of empirical Bayes but also relieves the computational burden for more parameters. The rest of the chapter is organized as follows. Section 4.2 introduces our multiple testing procedure. Section 4.3 describes the mutual fund data and challenging issues. Section 4.4 develops an approximate em- pirical Bayes method to fit the parameters and constructs the degree of non-skillness (d-value). Section 4.6 analyzes the mutual fund data by using the proposed methods. All the technical derivations, additional figures, and tables are relegated to the Supplementary Materials. 4.2 Optimalmultipletestingprocedure In this section, we will propose an optimal multiple testing procedure under dependence, which will be applied to the mutual fund selection problem in later sections. Suppose we have a p-dimensional vec- tor Z with unknown mean vector µ = (µ 1 ,··· ,µ p ) ⊤ and a known covariance matrixΣ . We want to simultaneously test H i0 :µ i ≤ 0 vs. H ia :µ i >0, i=1,··· ,p 109 based onZ. Without loss of generality, we can assume that all diagonal elements ofΣ are 1, since we can standardizeZ otherwise. Table 4.1: Classification of tested hypotheses Number Number Number of not rejected rejected Total True Null U V p 0 False Null T S p 1 p− R R p Table 4.1 is the classification of multiple hypothesis testing, where R denotes the total number of rejections (discoveries), V denotes the total number of false rejections (false discoveries), andT denotes the total number of false acceptance (false non-discoveries). We can define the false discovery proportion (FDP) and the false nondiscovery proportion (FNP) as FDP= V R and FNP= T p− R respectively, to measure the mistakes that we make in the decision process. For notational convenience, we define 0/0=0. Correspondingly, we define the false discovery rate as FDR =E[FDP] and the false non- discovery rate as FNR = E[FNP]. FDR and FNR have been regarded as the Type I error and the Type II error respectively for multiple testing procedures [102]. From a decision theoretic perspective, an optimal testing procedure can be constructed to minimize the objective function: FNR+λ FDR, (4.1) where λ is a tuning parameter to balance the two types of errors. (Choice of λ is a nontrivial task in practice, but it is not relevant in our main discussion here.) Let’s treatλ as a generic variable. Suppose we consider a decision vectora = (a 1 ,··· ,a p ), wherea i = 1 if we reject theith null hypothesis anda i = 0 110 otherwise. Therefore, a false discovery can be expressed asa i I µ i ≤ 0 whereI is an indicator function, and a false non-discovery can be expressed as(1− a i )I µ i >0 . The objective function can be written as E h P p i=1 (1− a i )I µ i >0 p− P p j=1 a j +λ P p i=1 a i I µ i ≤ 0 P p j=1 a j i . The above expectation is over the likelihood ofZ and the prior distribution ofµ . It is equivalent to minimize the expectation of loss function conditional onZ, that is, minimizing L(a)= p X i=1 h (1− a i )P(µ i >0|Z) p− P p j=1 a j +λ a i P(µ i ≤ 0|Z) P p j=1 a j i . (4.2) Without any delicate analysis, minimizing the objective function would involve optimization over 2 p choices of a. To the best of our knowledge, the minimizer to (4.2) has not been shown in the existing literature. We notice that the number of rejections P p j=1 a j only have p+1 possible values. It inspires us to decompose the problem intop+1 optimization tasks, which shares a similar rationale as divide and conquer strategies. Our optimization procedure is presented in the following theorem. Theorem 8. Denote DONS (1) ,··· , DONS (p) as the non-decreasing order of P(µ i ≤ 0|Z), i = 1,··· ,p. Define the action vector a 0 =(a (0) 1 ,··· ,a (0) p )=(0,··· ,0). Given a value ofλ , for eachj∈{1,··· ,p}, let the action vectora j =(a (j) 1 ,··· ,a (j) p ) where a (j) i = 1, ifP(µ i ≤ 0| Z)≤ DONS (j) ; 0, ifP(µ i ≤ 0| Z)> DONS (j) . Calculate the corresponding loss function L(a j )≡ p X i=1 h (1− a (j) i )P(µ i >0|Z) p− j +λ a (j) i P(µ i ≤ 0|Z) j i (4.3) 111 forj =0,··· ,p. The optimal procedure for (4.2) isa opt =argmin {a j :0≤ j≤ p} L(a j ). ProofofTheorem8. Note that for the two special cases when P p j=1 a j = 0 or P p j=1 a j = p, the so- lutions are clear. For example, the former case corresponds to a j = 0 for j = 1,··· ,p, while the latter case corresponds toa j = 1 forj = 1,··· ,p. To find the solution, it is equivalent to consider the p− 1- optimization problems for eachj∈{1,··· ,p− 1}, minimize L(a)= p X i=1 h (1− a i )P(α i >0|Z) p− P p j=1 a j +λ a i P(α i ≤ 0|Z) P p j=1 a j i subject to p X i=1 a i =j, Note thatL(a) can be re-arranged as L(a) =(p− j) − 1 p X i=1 (1− a i ) 1− P(α i ≤ 0|Z) + λ j p X i=1 a i P(α i ≤ 0|Z) = 1 p− j p X i=1 [1− P(α i ≤ 0|Z)]− j p− j +( 1 p− j + λ j ) p X i=1 a i P(α i ≤ 0|Z). when P p i=1 a i =j. GivenP(α i ≤ 0|Z) fori=1,··· ,p, we haveL(a) strictly increasing with respect to P p i=1 a i P(α i ≤ 0|Z), and other terms are fixed. We look at a decisiona j =(a (j) 1 ,··· ,a (j) p ) a (j) i = 1, ifP(α i ≤ 0| Z)≤ DONS (j) ; 0, ifP(α i ≤ 0| Z)> DONS (j) . It is obvious thata j minimizes P p i=1 a i P(α i ≤ 0|Z) with the restriction P p i=1 a i = j. Therefore,a j is the minimizer forL(a) with the restriction P p i=1 a i =j. 112 Finally, we choose the decision minimizing L(a) amonga 0 ,··· ,a p , and the decision minimizes the objective function (4.2), soa opt =argmin {a j :0≤ j≤ p} L(a j ). For ease of presentation, we considerµ as the mean vector. Theorem 8 is still valid for a more general vector of parameters of interest. Note that if we restrict the total number of rejections to bek, the second term in the loss function (4.3) will be λ P p i=1 a (j) i P(µ i ≤ 0|Z) j =λ P k i=1 DONS (i) k . The termk − 1 P k i=1 DONS (i) can be regarded as an FDR conditional on Z. This motivates a FDR control procedure: given aθ level, we select the largestk such that 1 k k X i=1 DONS (i) ≤ θ. (4.4) Instead of minimizing the objective function (4.2) that involves the tuning parameter λ , this procedure targets the FDR control directly. Theθ level reflects our tolerance for making mistakes of false rejections. We should point out that a procedure only focusing on FDR control may be very conservative, since it can reject fewer hypotheses to avoid false rejections. Ideally, we want to hunt for more true rejections in addition to the FDR control. Fortunately, our procedure achieves this goal since the false non-discovery rate is minimized, which is resulted from the monotonicity established in the next result. Corollary 1. k − 1 P k i=1 DONS (i) is monotonically nondecreasing ink, and(p− k) − 1 P p i=k+1 DONS (i) is monotonically nondecreasing ink. 113 ProofofCorollary1. For the first part, we want to show that for a generic positive integer value k, 1 k k X i=1 DONS (i) ≤ 1 k+1 k+1 X i=1 DONS (i) . It is equivalent to show that (k + 1)( P k i=1 DONS (i) ) ≤ k( P k+1 i=1 DONS (i) ). The right handed side is k P k i=1 DONS (i) +kDONS (k+1) , which can be cancelled byk P k i=1 DONS (i) on both sides. Therefore, it is equivalent to show P k i=1 DONS (i) ≤ kDONS (k+1) , which is always correct since DONS (i) ≤ DONS (k+1) fori≤ k. Thus, the conclusion is correct. For the second part, we want to show that for a generic positive integer valuek, (p− k) − 1 p X i=k+1 DONS (i) ≤ (p− k− 1) − 1 p X i=k+2 DONS (i) , which is equivalent to show that (p− k− 1) p X i=k+1 DONS (i) ≤ (p− k) p X i=k+2 DONS (i) . The left handed side can be expanded as(p− k− 1)DONS (k+1) +(p− k− 1) P p i=k+2 DONS (i) , which can be cancelled out by the right handed side. Therefore, we only need to show that (p− k− 1)DONS (k+1) ≤ p X i=k+2 DONS (i) . This is always correct since DONS (k+1) ≤ DONS (i) for anyi≥ k+2. Thus, the conclusion is correct. According to the decisiona j in Theorem 8, the FNR conditional onZ can be expressed as P p i=1 (1− a i )P(µ i >0|Z) p− j = P p i=k+1 (1− DONS (i) ) p− k =1− P p i=k+1 DONS (i) p− k , whenj =k. 114 Therefore, the FNR is monotonically nonincreasing in k by Corollary 1. Combining the two results in Corollary 1, the FNR based on testing procedure (4.4) is minimum among all the testing procedures based on the test statisticsZ while the FDR is controlled at the levelθ . 4.3 MutualFundDataandChallenges 4.3.1 EquityFundDataDescription We download the monthly return data of equity mutual funds from the CRSP survivor-bias-free US mutual fund database available at the Wharton Research Data Services (WRDS) † . The monthly returns are net of trading costs and expenses (including fees). We select the funds existing from the beginning of 2000 to the end of 2019 (20 years). Throughout the chapter, we focus on the equity funds that survived for at least 10 years and analyze the equity funds with complete monthly return data during each 10-year periods (2000- 2009, 2001-2010,··· , 2009-2018). We notice that WRDS uses “0" to denote a missing value for the monthly return. To be cautious with missing values, we delete any funds with a monthly return as zero for each 10-year period. This data cleaning step reduces the number of mutual funds to several thousand for each 10-year period, e.g.,2,722 equity funds during 2000-2009, and5,123 funds during 2009-2018. The reason to consider 10-year windows will be explained in Section 4.3.2. We believe that the dataset with such a great amount of funds is sufficient for our analysis. There are three variables in this dataset: date, fund ID, and monthly return per share. For all the equity funds that survived in each 10-year period, Figure 4.1 plots their annual returns for the subsequent one year. For example, the cluster centered at 2010 represents the annual returns in 2010 for equity funds that survived during 2000-2009. It illustrates that the distributions of fund returns vary over the years, but most funds have annual returns around 0. Our primary goal here is to figure out how to select funds with persistent future performance based on the historical data. † The monthly return data are available athttps://wrds-web.wharton.upenn.edu/wrds/ds/crsp/mfund_q/monthly/ /index.cfm 115 To apply the Carhart four-factor model four-factor model in Section 4.3.2, we download data for the Fama-French market factors during this period from WRDS ‡ . There are six variables: (1) date, (2) monthly excess return on the market which is calculated as the value-weight return on all NYSE, AMEX and NASDAQ stocks (from CRSP) minus the one-month Treasury bill rate (from Ibbotson Associates), (3) the monthly average return on the three small portfolios minus the average return on the three big portfolios, (4) the monthly average return on the two value portfolio (the high BE (Book Equity)/ ME (Market Equity) ratios) minus the average return on the two growth portfolios (the low BE/ME ratios), (5) the momentum, and (6) the risk-free interest rate (one-month treasury bill rate). −1.0 0.0 1.0 2.0 year annual return 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Figure 4.1: Annual returns for the mutual funds existing in the 10-year windows from 2000-2009 to 2009-2018. Each cluster center at a certain year represents the annual returns in that year for equity funds that survived during the previous ten years. Moreover, the y-coordinate of each point in a certain cluster is the annual return in that year so that the number of funds with the same level of annual returns is observable. 4.3.2 FundPerformanceMeasurement The Carhart four-factor model [25] has been widely used for measuring a mutual fund’s performance. Consider the four-factor model r i,t =α i +b i r m,t +s i r smb,t +h i r hml,t +m i r mom,t +ϵ i,t , ‡ The Fama-French market factor data are available athttps://wrds-web.wharton.upenn.edu/wrds/ds/famafrench/ factors_m.cfm?navId=203 116 wherer i,t is the montht excess return of fundi over the risk-free rate (proxied by the monthly 30-day T-bill beginning-of-month yield),r m,t is the montht excess return on the CRSP NYSE/NASDAQ value-weighted market portfolio, r smb,t , r hml,t andr mom,t are the montht returns on zero-investment factor mimicking portfolios for size, book-to-market, and momentum, and ϵ i,t is the unobserved random error. Adjusting for the market factors, the interceptα reflects a fund manager’s true capability to cover the trading cost and expenses. By convention, the positive α indicates skilled performance. Skilled fund managers are considered to have stock-picking ability to make extra profits after fees and expenses, while unskilled fund managers’ ability is insufficient to cover the trading costs and expenses. We should point out that there are other asset pricing models in finance. Jensen’s alpha was first introduced in [55] to measure fund performance, which is the intercept of the CAPM model. The CAPM model directly regresses the excess monthly return on the market factor r m,t and explains fund returns on the overall market. The Fama and French three-factor model [39] further considered r m,t , r smb,t and r hml,t as the market factors and has demonstrated prominent performance in asset pricing. It is worth pointing out that our method does not rely on the assumption that the residuals in the four-factor model should be independent. Thus, misspecification in the model due to a lack of additional common factors does not invalidate our method. Nevertheless, the Carhart four-factor model is sufficient to demonstrate our method for analyzing mutual fund data. Our method can also be applied to the other pricing models according to the readers’ preferences. The pursuit of positive alpha is based on the rationale that fund managers’ skills in investment persist through a certain period of years. Such rationale has been documented in the literature [22]. There- fore, for a fund manager with a positive value of α in the prior several years, we expect that she would maintain her good quality of investment in the future, at least for a few years. In practice, trueα ’s are un- observable. Instead, we can calculate the ordinary least squares (OLS) estimate of α ’s for each 10 year period. During a certain period, t = 1,··· ,T , let r i = (r i,1 ,··· ,r i,T ) ⊤ , r m = (r m,1 ,··· ,r m,T ) ⊤ , 117 r smb = (r smb,1 ,··· ,r smb,T ) ⊤ , r hml = (r hml,1 ,··· ,r hml,T ) ⊤ , r mom = (r mom,1 ,··· ,r mom,T ) ⊤ , R = (r m ,r smb ,r hml ,r mom ), 1 = (1,··· ,1) ⊤ . By standard linear model theory, the cross-sectional OLS esti- mator forα i can be written as b α i = h ⊤ r i where h ⊤ =[(T − 1 ⊤ R(R ⊤ R) − 1 R ⊤ 1) − 1 1 ⊤ − 1 T 1 ⊤ R(R ⊤ R− R ⊤ 1 1 T 1 ⊤ R) − 1 R ⊤ ], fori=1,··· ,p. The detailed derivation will be given as follows. Derivationof b α i : β i = (α i ,b i ,s i ,h i ,m i ) ⊤ , ϵ i = (ϵ i,1 ,··· ,ϵ i,T ) ⊤ , R = (r m ,r smb ,r hml ,r mom ), 1 = (1,··· ,1) ⊤ , andX=(1,R), then the four-factor model can be written as r i =Xβ i +ϵ i . The least squares estimator forβ i is b β i =(X ⊤ X) − 1 X ⊤ r i . We want to find the least squares estimator for α i , which will be the first coordinate of b β i . Note that X=(1,R), therefore, X ⊤ X= n 1 ⊤ R R ⊤ 1 R ⊤ R . To fine the inverse of X ⊤ X, note that for an invertible arbitrary block matrix A B C D , 118 whereA andD are symmetric, andB ⊤ = C, then the inverse is given as (A− BD − 1 C) − 1 − A − 1 B(D− CA − 1 B) − 1 − (D− CA − 1 B) − 1 CA − 1 (D− CA − 1 B) − 1 . Therefore, (X ⊤ X) − 1 = (n− 1 ⊤ R(R ⊤ R) − 1 R ⊤ 1) − 1 − 1 n 1 ⊤ R(R ⊤ R− R ⊤ 1 1 n 1 ⊤ R) − 1 − (R ⊤ R− R ⊤ 1 1 n 1 ⊤ R) − 1 R ⊤ 1 1 n (R ⊤ R− R ⊤ 1 1 n 1 ⊤ R) − 1 . As we haven=T ,h ⊤ =[(T− 1 ⊤ R(R ⊤ R) − 1 R ⊤ 1) − 1 1 ⊤ − 1 T 1 ⊤ R(R ⊤ R− R ⊤ 1 1 T 1 ⊤ R) − 1 R ⊤ ],where b α i = h ⊤ r i . It is clear that cov(b α i ,b α j )= h ⊤ cov(r i ,r j )h, which suggests that there is dependence among{b α i } p i=1 . Based on the Carhart four-factor model, with a larger sample size T , the OLS estimate of the true α will be more accurate, which encourages us to choose a wider window for the analysis. However, a wider window will possibly raise the issue of survivor bias, that is, funds surviving for a longer time in the past also indicate the success of the funds. Considering such a trade-off, we focus on a 10-year window for applying the Carhart four-factor model, instead of a 5-year window (reduce the estimation accuracy) or a 15-year window (raising survivor bias). 4.3.3 Challenges: dependenceandnonnormality Based on our derivation in Section 4.3.2, the OLS estimates{b α i } p i=1 possess covariance dependence. Let Σ ⋆ denote the covariance matrix where the (i,j)th element of Σ ⋆ is h ⊤ cov(r i ,r j )h and cov(r i ,r j ) is calculated as the sample covariance between the observations r i and r j . We further let Σ denote the 119 correlation matrix ofΣ ⋆ . To illustrate this dependence issue, we consider the equity fund data for a 10- year window from 2009 to 2018 as an example. There were5,123 equity funds that survived during this period. We plot the eigenvalues ofΣ in Figure 4.2. The largest eigenvalue is4,176, which shows that there is strong dependence. Meanwhile, the largest eigenvalue is2,087 for2,722 funds during 2000-2009. The largest eigenvalue increases when the number of funds increases, suggesting that the strong dependence will not be weakened by involving more funds. 0 1000 2000 3000 4000 0 10 20 30 40 50 component number eigenvalue Scree plot Figure 4.2: Top 50 eigenvalues ofΣ for the mutual fund data between 2009 and 2018. Various reasons are contributing to this strong dependence. One possible issue is the herding among the mutual funds, which is a common phenomenon documented by the literature [134]. Herding could arise when mutual funds are applying similar investment strategies. For example, funds often purchase the past year’s winners and sell the losers. As a result, these funds herd into (or out of) the same stocks at the same time. Another phenomenon we observed is that the empirical distributions of standardized {b α i } p i=1 are nonnormal. Kosowski et al. [67] explain the nonnormality phenomenon from two aspects: in- dividual mutual fundα ’s nonnormality and different levels of risk-taking among funds. The nonnormality arising from distinct levels of risk could be alleviated by standardization, but the normalized statistics are still affected by fund-level nonnormality, caused by many reasons such as nonnormal benchmark returns. Figure 4.3 plots two distributions of standardized OLS estimates of α ’s as an example. Figure 4.3a has a bell shape roughly centered at 0, but with a long tail on the right. In Figure 4.3b, there is a mode on the 120 left, not completely separated from the centered bell. The histograms for the standardized OLS estimates for the other 10-year windows also show nonnormality with different shapes. 0.0 0.3 0.6 0.9 −2 0 2 value density (a) 2002-2011 0.00 0.25 0.50 0.75 1.00 −2 0 2 value density (b) 2009-2018 Figure 4.3: Histograms and density plots of standardized OLS estimates for fundα ’s during (a) 2002-2011 and (b) 2009-2018. 4.4 Degreeofnon-skillness(d-value) 4.4.1 Fundselectionbymultipletestingstrategy Recall that selection of the skilled mutual funds can be formulated as simultaneously testing H i0 :α i ≤ 0 vs. H ia :α i >0, i=1,··· ,p, (4.5) where p is the number of mutual funds during a certain period. Then, the rejection of the ith null hy- pothesis (α i ≤ 0) represents the selection of theith fund. LetΣ ⋆ be the covariance matrix as calculated in Section 4.3.3. Let the diagonal elements ofΣ ⋆ be{σ 2 i } p i=1 . StandardizeΣ ⋆ to a correlation matrixΣ . For notational convenience, letZ=(b α 1 /σ 1 ,··· ,b α p /σ p ) ⊤ , whereZ is the vector of the standardized OLS estimators to avoid heteroscedasticity. As shown by Theorem 8, the optimal multiple testing procedure can be constructed based onP(α i ≤ 0|Z) for i = 1,··· ,p, since the result in Theorem 8 is not restricted to the mean vector. We consider such conditional probabilities, that is, conditional on the data we have, how likely each fund is not skilled. 121 Intuitively, the smaller the value ofP(α i ≤ 0|Z), the more likely the correspondingα i is positive. Since we condition on the test statistics from all the mutual funds in our study, we do not lose the dependence information and the nonnormality structure. The construction of this conditional probability reveals our belief that even if we only look at a particular mutual fund, the information of all other funds is important for our understanding of this mutual fund’s performance. LetZ (i) be the vector of elements inZ excludingZ i , thenP(α i ≤ 0|Z) can be connected to the local FDR defined in [36]: P(α i ≤ 0|Z)= f(Z (i) |Z i ,α i ≤ 0) f(Z (i) |Z i ) × P(α i ≤ 0|Z i ), (4.6) wheref(Z (i) |Z i ,α i ≤ 0) andf(Z (i) |Z i ) are the conditional density functions. The termP(α i ≤ 0|Z i ) is the local FDR, where the conditional information is onlyZ i . When evaluating the performance of theith mutual fund, local FDR only focuses on the data of this fund and ignore the information from other funds. However, the first part of (4.6) contains the dependence information that makes P(α i ≤ 0|Z) different from the local FDR. The conditional probabilityP(α i ≤ 0|Z) is the posterior probability of the null hypothesis given the test statistics, so it can be viewed as a generalized version of local FDR. In the statistics literature, such posterior probability has been considered as a significant measure where the test statistics possess some special dependence structures, e.g., equal correlation in considered in [104], and a time series dependence is considered in [122]. The references are not exhaustive here. In our mutual fund selection problem, the test statistics possess a more general dependence that has not been well studied in the past. For ease of presentation, we call P(α i ≤ 0|Z) the degree of non-skillness (d-value) for the ith mutual fund. The calculation of P(α i ≤ 0|Z) requires delicate analysis, but such efforts will be paid off as shown by our analysis for the mutual fund data in Section 4.6. Further discussions on the d-values will be given in Section 4.4. 122 With d-values, we can perform the FDR control method (4.4) to select skilled funds. Regarding Corol- lary 1, the monotonicity property of this procedure has an immediate advantage in the mutual fund selec- tion problem. If we consider twoθ levels: θ 1 andθ 2 withθ 1 ≤ θ 2 , then the selection set forθ 1 belongs to the selection set forθ 2 due to the monotonicity. Suppose an investor selected skilled funds based on our procedure. She was cautious in the beginning and setθ = 0.1. Then, she did further investigations and analysis for the selected funds. Later, she realized that her earlier selection was too conservative, and she would like to update the threshold θ to 0.15. Fortunately, she does not need to redo the analysis for the funds based on her new selection because her new selection is only an expansion of her earlier choice. In contrast, the well-known Benjamini and Hochberg procedure [12] does not possess this property. More specifically, the BH procedure rejects the largest k tests such that k − 1 p (k) ≤ θ/p where p (k) is the kth ordered p-value. A largerk might produce a smallerk − 1 p (k) , which violates the monotonicity. 4.4.2 MixturemodelfornonnormalanddependenttestStatistics In this section, we propose an approach to calculate the d-valueP(α i ≤ 0|Z) for theith fund. Letµ i = α i /σ i fori = 1,··· ,p, then the signs ofα i andµ i are the same, i.e.,P(α i ≤ 0|Z) = P(µ i ≤ 0|Z). As a result, we can calculateP(µ i ≤ 0|Z) for d-value. If the Carhart four-factor model fully explains the mutual fund data, in the ideal setting that the random errors are Gaussian distributed with known variances, conditional on µ , Z should follow a multivariate normal distribution. Based on our observation of the empirical distribution ofZ, the nonnormality raises the challenge in modeling the data. This motivates us to propose an appropriate prior forµ . Recall that 123 Figure 4.3a has a heavy tail on the right, and Figure 4.3b has a bell shape centered roughly at 0 as well as a left mode. We consider a mixture prior to modeling the data structure as follows: Z|µ ∼ N p (µ ,Σ ) µ i ∼ π 0 δ ν 0 +π 1 N(ν 1 ,τ 2 1 )+π 2 N(ν 2 ,τ 2 2 ), i=1,··· ,p, (4.7) where π 0 + π 1 + π 2 = 1 and δ ν 0 denotes a point mass at ν 0 with ν 0 ≤ 0. Note that Σ is a known covariance matrix given the data. Eachµ i is distributed at a point massν 0 with probabilityπ 0 to capture the centered bell shape, and it can be distributed as a normal distributionN(ν 1 ,τ 2 1 ) with probabilityπ 1 and a normal distributionN(ν 2 ,τ 2 2 ) with probabilityπ 2 to capture the right and the left modes. The empirical distribution ofZ shows that the point massν 0 is close to 0 but not necessarily exact at 0. There are many well-established methods for mixture models with independent Z i |µ i , thus the joint likelihood can be decomposed as the product of the marginal likelihoods [97, 98, 50]. In our problem, such independence assumption is not satisfied. Therefore, the conventional expectation-maximization (EM) algorithm cannot be applied here for fitting the parameters. To the best of our knowledge, estimation under mixture models with such general dependence is still an open question. To facilitate our analysis, we propose a parametric model here, while a more flexible nonparametric modeling technique can be our future research interest. 4.4.3 Fittingparametersinmixturemodel Recall in the mixture model (4.7), there are eight unknown parameters: the proportions π 0 , π 1 and π 2 , the point mass ν 0 , and the parameters ν 1 , τ 2 1 , ν 2 and τ 2 2 for the two normal distributions. Among these parameters, there are seven free parameters, since we have the restrictionπ 0 +π 1 +π 2 = 1. The degree of non-skillness P(µ i ≤ 0|Z) and the subsequent multiple testing procedure all rely on these parame- ters which are unknown in practice. We will propose a method called “approximate empirical Bayes" to 124 estimate these parameters. Several challenging issues arise due to the dependence and the number of parameters. Motivated by [40], we can expressZ as an approximate factor model. More specifically, let λ 1 ,··· ,λ p be the non-increasing eigenvalues ofΣ , andγ 1 ,··· ,γ p be the corresponding eigenvectors. For a positive integerl, we can write Z=µ +CV+K, (4.8) whereC = ( √ λ 1 γ 1 ,··· , √ λ l γ l ),V∼ N l (0,I l ) andK∼ N p (0,A) withA = P p i=l+1 λ i γ i γ ⊤ i . Whenl is appropriately chosen,K is shown to be weakly dependent [40]. Here, we choosel = max{i| λ i > 1}, and the weak dependence holds becauseλ 1 is extremely large in our data analysis. LetC=(c 1 ,··· ,c p ) ⊤ and the diagonal elements ofA be η 2 1 ,··· ,η 2 p . Then, η 2 i = 1−∥ c i ∥ 2 . LetH = Z− CV− ν 0 1, then H∼ N p (µ − ν 0 1,A). By location shift, we have µ i − ν 0 ∼ π 0 δ 0 +π 1 N(ν 1 − ν 0 ,τ 2 1 )+π 2 N(ν 2 − ν 0 ,τ 2 2 ), i=1,··· ,p. Letu 1 =ν 1 − ν 0 andu 2 =ν 2 − ν 0 . Then, fori=1,··· ,p, the first four moments of H i (theith element ofH) are EH i =π 1 u 1 +π 2 u 2 , EH 2 i =π 1 u 2 1 +π 2 u 2 2 +η 2 i +π 1 τ 2 1 +π 2 τ 2 2 , EH 3 i =π 1 u 3 1 +π 2 u 3 2 +3(τ 2 1 +η 2 i )π 1 u 1 +3(τ 2 2 +η 2 i )π 2 u 2 , (4.9) EH 4 i =π 1 u 4 1 +π 2 u 4 2 +6(τ 2 1 +η 2 i )π 1 u 2 1 +6(τ 2 2 +η 2 i )π 2 u 2 2 +3(τ 2 1 +η 2 i ) 2 π 1 +3(τ 2 2 +η 2 i ) 2 π 2 +3η 4 i π 0 . 125 Normally we need to set up seven equations to solve these seven free parameters. However, more equations will increase the difficulty of solving the equation set and higher moments of H i will also have more complicated expressions of the parameters. Therefore, we only focus on the first four moments of H i and propose an approximate empirical Bayes method to estimate these parameters. In Algorithm 4, we present the method in details. We need to point out that Step 12, solving the equation set (4.9), is not a trivial task. For brevity, we relegate the detailed derivation steps to the Supplementary Materials 4.4.4. In Step 1 of the algorithm, note that ν 0 is very close to 0; thus, smaller values of|Z i | tend to have µ i close to 0 in expression (4.8). In Step 10, note that{H i } p j=1 are weakly dependent so that the sample moments are expected to converge to the population moments in probability. We apply this approximate empirical Bayes method to estimate the parameters in the mixture model for the empirical distribution ofZ for every 10-year period from 2000-2009 to 2009-2018. Table 4.2 shows that the total variations between the empirical distribution ofZ and the distribution of the simulated data based on the fitted mixture model are all below 0.017, suggesting that our fittings capture the nonnormal distribution well. For theν 1 andν 2 , it is worth mentioning that we did not restrictν 1 ≤ ν 2 in the fitting. Thus,ν 1 andν 2 could be the right mode or the left mode. It takes about 3 minutes to fit the parameters for a 10-year period on a Macbook. Time was computed by proc.time() function in R. While our method is more computationally intensive than some testing approaches [12, 117], our efforts in such complicated modeling have much better performance, which will be demonstrated in Section 4.5 and 4.6.1. Figure 4.4 compares the shape of the empirical distribution of Z and the simulated data based on the fitted parameters during 2002-2011 and 2009-2018. It is clear that the nonnormality phenomenon exists, so the p-value calculation based on the normality assumption is not sufficient here. Our proposed mixture model, on the other hand, can capture this structure. By looking at the empirical distributions, a natural question is whether we should select the funds localized on the right tail part. From a p-value perspective, mutual funds on the right tail are more extreme 126 Algorithm4: Parameter Estimation Algorithm Input :Z is the vector of standardized OLS estimators; p is the length ofZ;C is the matrix ( √ λ 1 γ 1 ,··· , √ λ l γ l );Σ is the covariance matrix ofZ;D m is a search region form with a increment 5;D ν 0 is a search region forν 0 with a increment 0.1;D (τ 2 1 ,τ 2 2 ) is a search region for(τ 2 1 ,τ 2 2 ) with a increment 0.01 for both. 1 Z abs ← sort(|Z|) ; // sort the absolute values |Z i | in increasing order 2 TV comp ← 1 3 form inD m do 4 index←{ i||Z i |≤ Z abs [as.integer(p× m%)]} 5 Z part ← Z[index] ; // select m% Z i with the smallest absolute values 6 C part ← C[index,] ; // select c i with respect to the selected m% Z i 7 b V← the coefficient vector of L1regression(Z part ∼ C part ) 8 forν 0 inD ν 0 do 9 b H← Z− C b V− ν 0 1 ; // estimate H= Z− CV− ν 0 1 10 (EH i ,EH 2 i ,EH 3 i ,EH 4 i )← ( P p i=1 b H i /p, P p i=1 b H 2 i /p, P p i=1 b H 3 i /p, P p i=1 b H 4 i /p) 11 for(τ 2 1 ,τ 2 2 ) inD (τ 2 1 ,τ 2 2 ) do 12 (b π 0 ,b π 1 ,b π 2 ,b u 1 ,b u 2 )← solve (4.9) based on(EH i ,EH 2 i ,EH 3 i ,EH 4 i ,τ 2 1 ,τ 2 2 ) 13 (b ν 1 ,b ν 2 )← (b u 1 +ν 0 ,b u 2 +ν 0 ) 14 ˜ Z← simulate data followingN p (b µ ,Σ ) where b µ i ∼ b π 0 δ ν 0 +b π 1 N(b ν 1 ,τ 2 1 )+b π 2 N(b ν 2 ,τ 2 2 ), i=1,··· ,p, 15 TV ← the total variation betweenZ and ˜ Z 16 if TV <TV comp then 17 TV comp ← TV ; // choose the smallest total variation 18 SET best ← (m,τ 2 1 ,τ 2 2 ,ν 0 ,b π 0 ,b π 1 ,b π 2 ,b ν 1 ,b ν 2 ) 19 end 20 end 21 end 22 end Output:SET best Table 4.2: Fitted parameters, the proportion of data forL1 regression and the total variation (TV) for 10-year periods. m% τ 2 1 τ 2 2 ν 0 π 0 π 1 π 2 ν 1 ν 2 TV 2000-2009 15 .11 .16 .00 .1327 7.55E-02 .7918 1.5412 .2625 .0082 2001-2010 30 .10 .16 -.20 .1612 7.00E-02 .7687 1.3997 .0087 .0151 2002-2011 20 .15 .15 -.10 .1552 8.53E-02 .7595 1.1814 .0402 .0067 2003-2012 25 .14 .14 -.20 .2173 6.87E-02 .7139 .7895 -.0935 .0050 2004-2013 25 .15 .15 -.20 .2384 5.73E-03 .7559 1.2090 -.1341 .0077 2005-2014 30 .14 .14 -.10 .2605 7.96E-01 .0182 -.2084 .8643 .0092 2006-2015 35 .12 .17 .00 .1471 2.02E-05 .8529 1.5978 -.1206 .0054 2007-2016 30 .11 .21 .00 .2381 5.50E-05 .7619 -1.126 -.2885 .0106 2008-2017 40 .12 .26 -.10 .2691 2.29E-04 .7307 -1.0716 -.4247 .0115 2009-2018 30 .12 .15 -.10 .0759 1.21E-01 .8035 -1.2035 -.2418 .0169 127 Figure 4.4: Histograms and density plots ofZ and the simulated data based on the parameter-fitting results from our approxi- mate empirical Bayes method (AEB) during (a) 2002-2011 and (b) 2009-2018. 0.0 0.3 0.6 0.9 −4 −2 0 2 4 value density type Z AEB (a) 2002-2011 0.00 0.25 0.50 0.75 1.00 −4 −2 0 2 4 value density type Z AEB (b) 2009-2018 observations and should be selected as significant. However, d-values for our testing procedure are quite different. The funds on the right tail are not necessarily selected by our procedure, and the funds on the right tail do not necessarily have a persistent performance for the subsequent years. We will further demonstrate this issue in Section 4.6.2 with more details. 4.4.4 SolvingEquationSetin (4.9) First, we considerτ 1 ̸=τ 2 . According to (4.9) andπ 1 +π 2 +π 3 =1, EH i =π 1 u 1 +π 2 u 2 EH 2 i =π 1 u 2 1 +π 2 u 2 2 +η 2 i +(1− π 0 )τ 2 1 +(τ 2 2 − τ 2 1 )π 2 EH 3 i =π 1 u 3 1 +π 2 u 3 2 +3(τ 2 1 +η 2 i )(π 1 u 1 +π 2 u 2 )+3(τ 2 2 − τ 2 1 )π 2 u 2 EH 4 i =π 1 u 4 1 +π 2 u 4 2 +6(τ 2 1 +η 2 i )(π 1 u 2 1 +π 2 u 2 2 )+3(τ 2 1 +η 2 i ) 2 − 3π 0 τ 4 1 − 6π 0 τ 2 1 η 2 i +6(τ 2 2 − τ 2 1 )π 2 u 2 2 +3[(τ 2 2 +η 2 i ) 2 − (τ 2 1 +η 2 i ) 2 ]π 2 . 128 Therefore, we have the following equations π 1 u 1 +π 2 u 2 =EH i π 1 u 2 1 +π 2 u 2 2 − π 0 τ 2 1 +(τ 2 2 − τ 2 1 )π 2 =EH 2 i − η 2 i − τ 2 1 π 1 u 3 1 +π 2 u 3 2 +3(τ 2 2 − τ 2 1 )π 2 u 2 =EH 3 i − 3(τ 2 1 +η 2 i )EH i π 1 u 4 1 +π 2 u 4 2 +3π 0 τ 4 1 +6(τ 2 2 − τ 2 1 )π 2 u 2 2 +3(τ 2 2 − τ 2 1 ) 2 π 2 =EH 4 i − 3(τ 2 1 +η 2 i ) 2 − 6(τ 2 1 +η 2 i )E(H 2 i − η 2 i − τ 2 1 ). Letδ = τ 2 2 − τ 2 1 ,a = EH i ,b 1 = EH 2 i − η 2 i − τ 2 1 ,c 1 = EH 3 i − 3(τ 2 1 +η 2 i )EH i , andd 1 = EH 4 i − 3(τ 2 1 +η 2 i ) 2 − 6(τ 2 1 +η 2 i )(EH 2 i − η 2 i − τ 2 1 ). We need to solve the system of equations π 1 u 1 +π 2 u 2 =a π 1 u 2 1 +π 2 u 2 2 − π 0 τ 2 1 +δπ 2 =b 1 π 1 u 3 1 +π 2 u 3 2 +3δπ 2 u 2 =c 1 π 1 u 4 1 +π 2 u 4 2 +3π 0 τ 4 1 +6δπ 2 u 2 2 +3δ 2 π 2 =d 1 π 0 +π 1 +π 2 =1 . (4.10) Let b 2 = EH 2 i − η 2 i − τ 2 2 , c 2 = EH 3 i − 3(τ 2 2 +η 2 i )EH i , and d 2 = EH 4 i − 3(τ 2 2 +η 2 i ) 2 − 6(τ 2 2 + η 2 i )(EH 2 i − η 2 i − τ 2 2 ). 129 Simiarly, we have π 1 u 1 +π 2 u 2 =a π 1 u 2 1 +π 2 u 2 2 − π 0 τ 2 2 − δπ 1 =b 2 π 1 u 3 1 +π 2 u 3 2 − 3δπ 1 u 1 =c 2 π 1 u 4 1 +π 2 u 4 2 +3π 0 τ 4 2 − 6δπ 1 u 2 1 +3δ 2 π 1 =d 2 π 0 +π 1 +π 2 =1 . (4.11) Then, we have (b 1 +b 2 )/2=π 1 u 2 1 +π 2 u 2 2 − (τ 2 1 +τ 2 2 )π 0 +δ (π 2 − π 1 ). Also, d 1 − d 2 =3(τ 4 1 − τ 4 2 )π 0 +6δ (π 1 u 2 1 +π 2 u 2 2 )+3δ 2 (π 2 − π 1 ) d=3(τ 4 1 − τ 4 2 )π 0 +6δ [(b 1 +b 2 )/2+(τ 2 1 +τ 2 2 )π 0 − δ (π 2 − π 1 )]+3δ 2 (π 2 − π 1 ) =3(τ 4 2 − τ 4 1 )π 0 +3δ (b 1 +b 2 )− 3δ 2 (π 2 − π 1 ), =⇒ π 2 − π 1 =[3(τ 4 2 − τ 4 1 )π 0 +3δ (b 1 +b 2 )− d 1 +d 2 ]/3δ 2 =(τ 2 2 +τ 2 1 )π 0 /δ +(b 1 +b 2 )/δ − (d 1 − d 2 )/3δ 2 . Therefore, we have π 1 +π 2 =1− π 0 π 2 − π 1 =(τ 2 2 +τ 2 1 )π 0 /δ +(b 1 +b 2 )/δ − (d 1 − d 2 )/3δ 2 . 130 This implies π 1 =− τ 2 2 π 0 /δ +[1− (b 1 +b 2 )/δ +(d 1 − d 2 )/3δ 2 ]/2 π 2 =τ 2 1 π 0 /δ +[1+(b 1 +b 2 )/δ − (d 1 − d 2 )/3δ 2 ]/2 . (4.12) If we could write everything into a function ofπ 0 , the we could solve a equation ofπ 0 first. We denote (4.10) as π 1 u 1 +π 2 u 2 =a π 1 u 2 1 +π 2 u 2 2 +δπ 2 =b π 1 u 3 1 +π 2 u 3 2 +3δπ 2 u 2 =c π 1 u 4 1 +π 2 u 4 2 +6δπ 2 u 2 2 +3δ 2 π 2 =d π 1 +π 2 =k , whereb = b 1 +π 0 τ 2 1 ,c = c 1 ,d = d 1 − 3π 0 τ 4 1 andk = 1− π 0 . Therefore,a,b,c,d,k are all constants or functions ofπ 0 . Note that a− u 2 k =π 1 (u 1 − u 2 ) b− u 2 a=π 1 u 1 (u 1 − u 2 )+δπ 2 c− u 2 b=π 1 u 2 1 (u 1 − u 2 )+2δπ 2 u 2 d− u 2 c=π 1 u 3 1 (u 1 − u 2 )+3δπ 2 u 2 2 +3δ 2 π 2 . Therefore, b− u 2 a− δπ 2 a− u 2 k = a− π 2 u 2 π 1 bπ 1 − aπ 1 u 2 − δπ 1 π 2 =a 2 − aπ 2 u 2 − aku 2 +kπ 2 u 2 2 π 2 u 2 2 =[bπ 1 − a 2 − δπ 1 π 2 − aπ 1 u 2 +aπ 2 u 2 +a(π 1 +π 2 )u 2 ]/k =[bπ 1 − a 2 − δπ 1 π 2 +2aπ 2 u 2 ]/k. 131 Also, c− u 2 b− 2δπ 2 u 2 b− u 2 a− δπ 2 = a− π 2 u 2 π 1 cπ 1 − bπ 1 u 2 − 2δπ 1 π 2 u 2 =ab− a 2 u 2 − aδπ 2 − bπ 2 u 2 +aπ 2 u 2 2 +δπ 2 2 u 2 cπ 1 − ab+aδπ 2 =(bπ 1 +2δπ 1 π 2 − a 2 − bπ 2 +δπ 2 2 )u 2 +aπ 2 u 2 2 cπ 1 − ab+aδπ 2 =(bπ 1 +2δπ 1 π 2 − a 2 − bπ 2 +δπ 2 2 )u 2 +a[bπ 1 − a 2 − δπ 1 π 2 +2aπ 2 u 2 ]/k [k(bπ 1 +2δπ 1 π 2 − a 2 − bπ 2 +δπ 2 2 )+2a 2 π 2 ]u 2 =k(cπ 1 − ab+aδπ 2 ) − a(bπ 1 − a 2 − δπ 1 π 2 ). . Therefore, u 2 = k(cπ 1 − ab+aδπ 2 )− a(bπ 1 − a 2 − δπ 1 π 2 ) k(bπ 1 +2δπ 1 π 2 − a 2 − bπ 2 +δπ 2 2 )+2a 2 π 2 . (4.13) Similarly, d− u 2 c− 3δπ 2 u 2 2 − 3δ 2 π 2 c− u 2 b− 2δπ 2 u 2 = a− π 2 u 2 π 1 dπ 1 − cπ 1 u 2 − 3δπ 1 π 2 u 2 2 − 3δ 2 π 1 π 2 =ac− abu 2 − 2aδπ 2 u 2 − cπ 2 u 2 +bπ 2 u 2 2 +2δπ 2 2 u 2 2 dπ 1 − 3δ 2 π 1 π 2 − ac=(cπ 1 − ab− 2aδπ 2 − cπ 2 )u 2 +(3δπ 1 +b+2δπ 2 )π 2 u 2 2 dπ 1 − 3δ 2 π 1 π 2 − ac=(cπ 1 − ab− 2aδπ 2 − cπ 2 )u 2 +(3δπ 1 +b+2δπ 2 )(bπ 1 − a 2 − δπ 1 π 2 +2aπ 2 u 2 )/k. We have u 2 = k(dπ 1 − 3δ 2 π 1 π 2 − ac)− (3δπ 1 +b+2δπ 2 )(bπ 1 − a 2 − δπ 1 π 2 ) k(cπ 1 − ab− 2aδπ 2 − cπ 2 )+2aπ 2 (3δπ 1 +b+2δπ 2 ) . (4.14) 132 According to equation (4.13) and (4.14), we have the equation k(dπ 1 − 3δ 2 π 1 π 2 − ac)− (3δπ 1 +b+2δπ 2 )(bπ 1 − a 2 − δπ 1 π 2 ) k(cπ 1 − ab− 2aδπ 2 − cπ 2 )+2aπ 2 (3δπ 1 +b+2δπ 2 ) = k(cπ 1 − ab+aδπ 2 )− a(bπ 1 − a 2 − δπ 1 π 2 ) k(bπ 1 +2δπ 1 π 2 − a 2 − bπ 2 +δπ 2 2 )+2a 2 π 2 . By replacingπ 1 andπ 2 by functions ofπ 0 according to the equation set (4.12), we have an equation of π 0 , sincea,b,c,d,k are all constants or functions ofπ 0 . By solving the equation, we can getπ 0 . Then, we can calculateπ 1 ,π 2 andu 2 , according to equation (4.12) and (4.13). By u 1 = d− u 2 c− 3δπ 2 u 2 2 − 3δ 2 π 2 c− u 2 b− 2δπ 2 u 2 , we can getu 1 . Therefore, we can get the estimationsb π 0 ,b π 1 ,b π 2 ,b u 1 andb u 2 . We haveb ν 1 =b u 1 +ν 0 andb ν 2 =b u 2 +ν 0 . For the case τ 1 = τ 2 , we consider the following method. Let τ = τ 1 . Then, we have the following equations π 1 u 1 +π 2 u 2 =EH i π 1 u 2 1 +π 2 u 2 2 − π 0 τ 2 =EH 2 i − η 2 i − τ 2 π 1 u 3 1 +π 2 u 3 2 =EH 3 i − 3(τ 2 +η 2 i )EH i π 1 u 4 1 +π 2 u 4 2 +3π 0 τ 4 =EH 4 i − 3(τ 2 +η 2 i ) 2 − 6(τ 2 +η 2 i )E(H 2 i − η 2 i − τ 2 ). 133 Leta = EH i ,b = EH 2 i − η 2 i − τ 2 ,c = EH 3 i − 3(τ 2 +η 2 i )EH i , andd = EH 4 i − 3(τ 2 +η 2 i ) 2 − 6(τ 2 + η 2 i )(EH 2 i − η 2 i − τ 2 ). We need to solve the system of equations π 1 u 1 +π 2 u 2 =a π 1 u 2 1 +π 2 u 2 2 − π 0 τ 2 =b π 1 u 3 1 +π 2 u 3 2 =c π 1 u 4 1 +π 2 u 4 2 +3π 0 τ 4 =d π 0 +π 1 +π 2 =1 . (4.15) Then, we have u 2 = c− u 1 (b+π 0 τ 2 ) (b+π 0 τ 2 )− u 1 a = (d− 3π 0 τ 4 )− u 1 c c− u 1 (b+π 0 τ 2 ) =⇒ h b+τ 2 π 0 2 − ac i u 2 1 + a d− 3τ 4 π 0 − c b+τ 2 π 0 u 1 +c 2 − d− 3τ 4 π 0 b+τ 2 π 0 =0. We denote A(π 0 ) = b+τ 2 π 0 2 − ac, B(π 0 ) = a d− 3τ 4 π 0 − c b+τ 2 π 0 and C(π 0 ) = c 2 − d− 3τ 4 π 0 b+τ 2 π 0 . Then, u 1 = − B(π 0 )± p B 2 (π 0 )− 4A(π 0 )C(π 0 ) 2A(π 0 ) . In fact, two roots foru 2 should be same as two roots foru 2 by similar computation. We assumeu 1 ̸= u 2 . Without loss of generality, we let u 1 = − B(π 0 )− p B 2 (π 0 )− 4A(π 0 )C(π 0 ) 2A(π 0 ) u 2 = − B(π 0 )+ p B 2 (π 0 )− 4A(π 0 )C(π 0 ) 2A(π 0 ) . 134 Also, according to the equation set (4.15), we have π 1 u 1 (u 1 − u 2 )=b+π 0 τ 2 − u 2 a =⇒π 1 = b+π 0 τ 2 − u 2 a u 1 (u 1 − u 2 ) Similarly, π 2 = b+π 0 τ 2 − u 1 a u 2 (u 2 − u 1 ) . Therefore, b+π 0 τ 2 − u 2 a u 1 (u 1 − u 2 ) + b+π 0 τ 2 − u 1 a u 2 (u 2 − u 1 ) +π 0 =1 =⇒(u 2 − u 1 )(b+π 0 τ 2 )+(u 2 1 − u 2 2 )a+u 1 u 2 (u 1 − u 2 )(π 0 − 1)=0 =⇒ − (b+π 0 τ 2 )+(u 1 +u 2 )a+u 1 u 2 (π 0 − 1)=0 =⇒ − (b+π 0 τ 2 )− B(π 0 )a A(π 0 ) + C(π 0 ) A(π 0 ) (π 0 − 1)=0 =⇒ − A(π 0 )(b+π 0 τ 2 )− B(π 0 )a+C(π 0 )(π 0 − 1)=0. This means that we need to solve c 3 π 3 0 +c 2 π 2 0 +c 1 π 0 +c 0 =0 135 where c 3 =2τ 6 c 2 =− 3τ 6 − dτ 2 c 1 =3(a 2 − b)τ 4 +(2ac− 3b 2 +d)τ 2 +c 2 − bd c 0 =− b 3 +2abc− a 2 d− c 2 +bd. We select the roots in(0,1). Then, we can get esitmators ofb π 0 ,b π 1 ,b π 2 ,b u 1 andb u 2 . We haveb ν 1 =b u 1 +ν 0 andb ν 2 =b u 2 +ν 0 . 4.4.5 Calculatingdegreeofnon-skillness Following the notation in Section 4.4.3, letλ 1 ,··· ,λ p be the non-increasing eigenvalues for the covariance matrixΣ , andγ 1 ,··· ,γ p be the corresponding eigenvectors. For a positive definite matrix Σ , the smallest eigenvalue λ p is positive. This helps us construct a strict factor model expression for the test statistics. More specifically, Σ = P p i=1 λ i γ i γ ⊤ i . We re-arrange this summation as Σ = p X i=1 (λ i − λ p )γ i γ ⊤ i +λ p p X i=1 γ i γ ⊤ i = p X i=1 (λ i − λ p )γ i γ ⊤ i +λ p I. Therefore, the original dataZ can be stochastically expressed as follows: Z=µ +BW+ξ , (4.16) whereB=( p λ 1 − λ p γ 1 ,··· , p λ p− 1 − λ p γ p− 1 ),W∼ N p− 1 (0,I p− 1 ) andξ ∼ N p (0,λ p I p ). Note that if λ j =λ p , which corresponds to the spiked covariance model, the dimension ofW will be(j− 1), and the subsequent calculation will be greatly simplified. The expression in (4.16) is different from expression (4.8) 136 in Section 4.4.2. Here the elements in ξ are independent, while the random errors in (4.8) are weakly dependent. Consider the row vector of B as b i for i = 1,··· ,p. For notational convenience, we use dN(µ,σ 2 ) to denote the probability density function ofN(µ,σ 2 ). The density ofZ can be expressed as f(Z)= Z ··· Z f(Z|µ ) p Y j=1 [π 0 δ ν 0 +π 1 dN(ν 1 ,τ 2 1 )+π 2 dN(ν 2 ,τ 2 2 )]dµ 1 ,··· ,dµ p = Z ··· Z Z f(Z|W,µ )f(W)dW p Y j=1 [π 0 δ ν 0 +π 1 dN(ν 1 ,τ 2 1 )+π 2 dN(ν 2 ,τ 2 2 )]dµ 1 ,··· ,dµ p = Z p Y j=1 h Z dN(µ i +b i W,λ p )[π 0 δ ν 0 +π 1 dN(ν 1 ,τ 2 1 )+π 2 dN(ν 2 ,τ 2 2 )]dµ i i f(W)dW =E W { p Y i=1 [π 0 dN(ν 0 +b i W,λ p )+π 1 dN(ν 1 +b i W,τ 2 1 +λ p ) +π 2 dN(ν 2 +b i W,τ 2 2 +λ p )]}. (4.17) The last expectation is with respect toW∼ N p− 1 (0,I p− 1 ). Similarly, R µ i ≤ 0 f(Z,µ i )dµ i can be expressed as Z µ i ≤ 0 f(Z,µ i )dµ i = E W p Y j=1 [π 0 dN(ν 0 +b j W,λ p ) +π 1 dN(ν 1 +b j W,τ 2 1 +λ p )+π 2 dN(ν 2 +b j W,τ 2 2 +λ p )] × π 0 dN(ν 0 +b i W,λ p )+π 1 G τ 2 1 ,λ p (ν 1 ,b i W,Z i )+π 2 G τ 2 2 ,λ p (ν 2 ,b i W,Z i ) π 0 dN(ν 0 +b i W,λ p )+π 1 dN(ν 1 +b i W,τ 2 1 +λ p )+π 2 dN(ν 2 +b i W,τ 2 2 +λ p ) ) , (4.18) where G τ 2 ,λ p (µ 0 ,b i W,Z i )= 1 p 2π (λ p +τ 2 ) exp β 2 i 2σ 2 − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 Φ − β i σ . (4.19) 137 DerivationofEq (4.19). We start with G τ 2 ,λ p (µ 0 ,b i W,Z i )= Z µ i ≤ 0 1 2πτ p λ p exp − (Z i − µ i − b i W) 2 2λ p exp − (µ i − µ 0 ) 2 2τ 2 dµ i . We have exp − (Z i − µ i − b i W) 2 2λ p exp − (µ i − µ 0 ) 2 2τ 2 =exp − (λ p +τ 2 )µ 2 i − 2[τ 2 (Zi− b i W)+µ 0 λ p ]µ i 2τ 2 λ p exp − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 . Let β i = τ 2 (Z i − b i W)+µ 0 λ p λ p +τ 2 σ 2 = τ 2 λ p λ p +τ 2 . Then, exp − (Z i − µ i − b i W) 2 2λ p exp − (µ i − µ 0 ) 2 2τ 2 =exp − µ 2 i − 2β i µ i +β 2 i 2σ 2 exp β 2 i 2σ 2 exp − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 . Therefore, G τ 2 ,λ p (µ 0 ,b i W,Z i )= 1 p 2π (λ p +τ 2 ) exp β 2 i 2σ 2 − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 × Z µ i ≤ 0 1 √ 2πσ 2 exp − (µ i − β i ) 2 2σ 2 dµ i = 1 p 2π (λ p +τ 2 ) exp β 2 i 2σ 2 − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 Φ − β i σ . 138 Correspondingly, the degree of non-skillnessP(µ i ≤ 0|Z)= R µ i ≤ 0 f(Z,µ i )dµ i /f(Z). So far, we have focused on selecting the skilled funds. Identifying unskilled funds is also important in practice, because investors can possibly avoid substantial loss by such selection. Similar to the multiple testing (4.5), we can also test H 0i : α i ≥ 0 vs. H ai : α i < 0, for i = 1,··· , p. We compute P(α i ≥ 0|Z)=P(µ i ≥ 0|Z)= R µ i ≥ 0 f(Z,µ i )dµ i /f(Z). Smaller value ofP(α i ≥ 0|Z) indicates thatα i is more likely to be negative. We sortP(α i ≥ 0|Z) from the smallest to the largest, and denote LOS (1) ,··· , LOS (p) as the non-decreasing order ofP(α i ≥ 0|Z). Given aθ level, we choose the largestl such that l − 1 l X i=1 LOS (i) ≤ θ. (4.20) For the expression of R µ i ≥ 0 f(Z,µ i )dµ i , we have Z µ i ≥ 0 f(Z,µ i )dµ i =E W p Y j=1 π 0 dN(ν 0 +b j W,λ p )+π 1 dN(ν 1 +b i W,τ 2 1 +λ p )+π 2 dN(ν 2 +b i W,τ 2 2 +λ p ) × 1{ν 0 =0}π 0 dN(ν 0 +b i W,λ p )+π 1 Q τ 2 1 ,λ p (ν 1 ,b i W,Z i )+π 2 Q τ 2 2 ,λ p (ν 2 ,b i W,Z i ) π 0 dN(ν 0 +b i W,λ p )+π 1 dN(ν 1 +b i W,τ 2 1 +λ p )+π 2 dN(ν 2 +b i W,τ 2 2 +λ p ) ) , in which Q τ 2 ,λ p (µ 0 ,b i W,Z i )= 1 p 2π (λ p +τ 2 ) exp β 2 i 2σ 2 − (Z i − b i W) 2 2λ p − µ 2 0 2τ 2 1− Φ − β i σ . The derivation is similar to Eq (4.9). 139 4.4.6 ComparisonwithotherExistingMethods Our approach is different from the existing methods in the mutual fund literature. We will compare with the following methods concretely: [42]: FC; [51]: HL20; [40]: FHG; [69]: FAT; [46]: GLX; [50]: HL18. Both FC and HL20 consider type II error in distinct ways from ours. FC improves estimates for the proportions of unskilled, zero-alpha, and skilled funds developed in [10] by considering the power of the test and the confusion parameter, which increases the power of the test. It is a p-value based approach and ignores the dependence. HL20 uses a double-bootstrapped procedure to estimate the FDR and FNR. This approach can be used to compare FNR’s of different testing procedures if these testing procedures are provided, but it does not offer an optimal testing procedure. FAT and GLX both assume an approximate factor model, in which conditional on the common fac- tors, the random errors have a weakly dependent correlation matrix. By adjusting the common factors and standardization with the standard deviation of the random errors, the signal-to-noise ratio can be in- creased, thus enhancing the testing power. Since the resulting test statistics are nearly independent, FAT further considers Storey procedure [117], while GLX considers BH procedure [12]. In practice, the fund returns may not satisfy such an approximate factor model, and the random errors may not have a weakly dependent correlation matrix, i.e., by adjusting the common factors and standardization, the resulting test statistics can still have general and strong dependence. BH and Storey procedure will be conservative for such settings. As a related but different approach, FHG directly estimates the false discovery proportion by incorporating the strong dependence. It does not rely on the assumption of the approximate factor model. But the drawback is that it was designed for the setting with very sparse signals. When there is a sub- stantial proportion of signals, FHG will be conservative. All three methods are p-value based approaches, which are not optimal. Our optimal procedure does not rely on the approximate factor model assumption, and it also adapts to different sparsity of signals. 140 HL18 is an estimation approach different from the multiple testing framework. It utilizes a two-part mixture model for the α ’s, and develops an EM algorithm to estimate α ’s. The EM algorithm is imple- mented under the assumption that conditional on the common factors, the random errors are independent. As an intermediate step in our approach, we considered a three-part mixture model for the test statistics, in which the dependence is general. Since it violates the independence assumption in HL18, we develop the approximate empirical Bayes method for fitting the parameters. We will demonstrate through numerical studies that our approach outperforms HL18 in terms of density fitting. 4.5 Simulationstudies In the simulation studies, we considerp = 1000. For each simulation round, we first sample 1000 funds without replacement from the 2009-2018 dataset. Theith item in the sample is funds(i) from the original data. Then, we construct the new returns ˜ r it as ˜ r it =µ i σ s(i) +R t b β s(i) +˜ ϵ it , fort=1,...,T, whereR t = (r m,t ,r smb,t ,r hml,t ,r mom,t ), b β s(i) is the corresponding OLS coefficient vector for fund s(i), and σ 2 s(i) is the s(i)th diagonal element in the correlation matrix for the OLS estimates of α , which is discussed in Secion 4.3.3. We construct µ i according to the mixture model (4.7) with π 0 = 0.1,ν 0 = 0,ν 1 = − 0.5,ν 2 = 1.2,τ 2 1 = 0.1,τ 2 2 = 0.1. We consider ˜ ϵ t ∼ N p (0,Σ ϵ ) where ˜ ϵ t = (ϵ 1,t ,··· ,ϵ p,t ) ⊤ . Note thatπ 1 ,π 2 andΣ ϵ will affect the proportion of skilled funds (sparsity) and the dependence structure, respectively. We consider the following settings. Sparsity: The proportion of skilled funds (sparsity) is mainly determined byπ 2 , since we setν 2 > 0. We change the value of π 2 to achieve different levels of sparsity. Correspondingly, π 1 is adjusted to ensure thatπ 0 +π 1 +π 2 =1. Two settings are considered as follows. s.1 π 1 =0.7,π 2 =0.2. 141 s.2 π 1 =0.2,π 2 =0.7. s.2 produces a denser signal setting than s.1. Some testing procedures, such as FHG, are designed for the setting with sparse signals, so their procedures are more conservative under s.2. Dependence: LetΣ ϵ = cor(Σ ⋆ ϵ ) and diag(Σ ⋆ ϵ ) = (σ 2 ϵ s(1) ,...,σ 2 ϵ s(p) ), whereσ 2 ϵ s(i) = P t b ϵ 2 s(i),t /(T − 1) andb ϵ s(i),t =r s(i) − b α s(i) − R t b β s(i) . Three dependence structures are considered as follows. d.1 Strict factor model: A∈R p× 4 ,A ij ∼ N(0,4), andΣ ϵ =cor(AA ⊤ +I p ). d.2 Power-decay model: A ∈ R p× 10 , A ij ∼ N(0,4) and Σ ϵ = cor(AA ⊤ + M) where the (i,j)th element inM isM ij =ρ |i− j| fori,j =1,··· ,p withρ =0.8. d.3 Long memory autocovariance model: A∈R p× 10 ,A ij ∼ Uniform(− 1,1) andΣ ϵ =cor(AA ⊤ +M) whereM ij =0.5||i− j|+1| 2H − 2|i− j| 2H +||i− j|− 1| 2H fori,j =1,··· ,p andH =0.9. For the three settings of the dependence, conditional on the four market factors, the random errors possess additional strong dependence. The structure ofM in d.3 has also been considered in [18]. Dependence Sparsity Mean Ours Storey BH FAT GLX FC HL20 FHG d.1 s.1 FDP 0.093 0 0 0.162 0.043 0.061 0.020 0.068 FNP 0.049 0.238 0.238 0.035 0.066 0.068 0.087 0.073 selection 222 0 0 252 195 197 170 192 s.2 FDP 0.097 0.009 0 0.003 0.025 0.082 0.026 0.014 FNP 0.033 0.311 0.713 0.686 0.158 0.064 0.172 0.276 selection 783 535 0 31 678 760 674 603 d.2 s.1 FDP 0.095 0 0 0 0.050 0.057 0.019 0.073 FNP 0.060 0.238 0.238 0.239 0.079 0.070 0.089 0.073 selection 212 0 0 0 185 193 168 194 s.2 FDP 0.101 0.006 0 0 0.023 0.093 0.027 0.013 FNP 0.054 0.426 0.711 0.711 0.223 0.058 0.178 0.277 selection 778 423 0 0 644 769 669 601 d.3 s.1 FDP 0.100 0 0 0.090 0.043 0.054 0.016 0.055 FNP 0.060 0.238 0.238 0.167 0.083 0.068 0.090 0.082 selection 213 0 0 101 179 195 167 181 s.2 FDP 0.100 0.029 0 0 0.020 0.064 0.026 0.013 FNP 0.066 0.332 0.711 0.562 0.232 0.085 0.173 0.303 selection 770 496 0 291 636 735 670 566 Table 4.3: Comparison of our approach (4.4) with approximate empirical Bayes method for fitting the parameters and other FDR control methods for each combination of sparsity (s.1, s.2) and dependence structures (d.1, d.2, d.3) whereθ = 0.1. The average false discovery proportion (FDP), false non-discovery proportion (FNP), number of selection are calculated over 100 simulations for each setting. We apply our multiple testing procedure (4.4) for each combination of sparsity and dependence with θ = 0.1. As we consider the parameters in the mixture model unknown, our procedure first applies the 142 approximate empirical Bayes method to fit the model before performing the FDR control method. The av- erage FDP, FNP, and number of selection are calculated over 100 simulations in Table 4.3. Furthermore, we compare with other existing FDR control methods [12, 117, 40, 42, 69, 51, 46] and model fitting procedure [50] in these tables. (Correspondingly, we abbreviate these methods as BH, Storey, FHG, FC, FAT, HL20, GLX and HL18 method.) Table 4.3 illustrates that the average FDP of our approach is closer to 0.1 while the average FNP is smaller than the other methods under various sparsity and dependence structures. This is consistent with our theoretical result that our procedure can minimize FNR while controlling FDR at the desired level. 4.6 SkilledfundsselectionwithFDRcontrol We apply the multiple testing procedure (4.4) utilizing the degrees of non-skillness (d-value) to our down- loaded data for equity funds in this section. It shows that our procedure selects funds with persistent performance and produces substantial returns in the subsequent years. The discrepancies in selections compared with the conventional multiple testing procedures mainly result from the differences between p-values and d-values. We compare these two measurements of significance in Section 4.6.2 to illustrate the advantage of d-values. 4.6.1 Dynamictradingstrategy We first demonstrate the short-term out-of-sample performance of our procedure. Hypothetically we had $1 million at the end of 2009. We selected the skilled funds by testing procedure (4.4) withθ level as 0.15 based on the data from 2000 to 2009. Then we invested in the selected funds with equal weights for the year 2010. At the end of 2010, we re-selected the skilled funds based on the data from 2001 to 2010 and invested in the newly selected funds with equal weights for the year 2011. We continue the above trading strategy until the end of 2019. The value of our portfolio at the end of each year is plotted with a red line 143 in Figure 4.5. For comparison, we first consider two conventional multiple testing procedures for selecting funds: BH procedure and Storey procedure. BH procedure is constructed based on the individual p-values. When there is general dependence among the test statistics, it is usually very conservative. Storey proce- dure is less conservative than BH procedure, and it can asymptotically control the FDR under the weak dependence of the test statistics. In [10], skilled funds were selected based on Storey procedure. For a fair comparison, we choose the threshold as 0.15 for BH procedure and Storey procedure. Table 4.4 shows that after 10 years, our portfolio grows to $3.52 million (annual return13.41%) while their portfolios are less than $1.75 million. Due to the conservativeness, both BH procedure and Storey procedure fail to se- lect any funds for many 10-year periods, resulting in the flat lines in Figure 4.5. It is also worth plotting the growth of the portfolio by investing in the S&P 500 index in Figure 4.5, since S&P 500 is a common benchmark for financial investment. After 10 years, the S&P 500 index would grow to $2.90 million (an- nual return 11.22%), still much inferior to our constructed portfolio. Furthermore, we construct similar equal-weighted portfolios based on the funds selected by the existing methods discussed in Section 4.5. In particular, we consider FHG, FC, FAT, HL20, GLX, and HL18. Recall that HL18 is not an FDR control method, so funds whose HL18 estimates of alphas greater than 95% quantile are selected. The rest of the FDR control procedures above are implemented withθ =0.15. The returns of these portfolios are recorded in Figure 4.5 and Table 4.4, which shows that it is difficult to outperform S&P 500, while our procedure succeeds. Table 4.4: The values of portfolios by the end of 2019 and the annual returns for each approach. Portfolio Value (million) Annual Return (%) Our method 3.52 13.41 S&P 500 2.90 11.22 HL18 2.83 10.98 FAT 2.32 8.80 HL20 2.27 8.56 FHG 2.24 8.40 GLX 2.18 8.09 Storey 1.73 5.64 FC 1.46 3.89 BH 1.00 0.00 144 Figure 4.5: The values of portfolios selected by our procedure and other FDR control procedures ( BH, Storey, FHG, FC, FAT, HL20 and GLX) from the end of 2009 to the end of 2019. “Skilled funds" refers to our portfolio. “S&P 500" is the value of S&P 500 index. For “HL18", we construct the portfolio containing funds with HL18 estimation of alphas greater than its 95% quantile. 1.0 1.5 2.0 2.5 3.0 3.5 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 year value type Skilled funds S&P 500 HL18 FAT HL20 FHG GLX Storey FC BH 4.6.2 Comparisonofd-valueandp-value The selection of funds based on our d-values is quite different from those based on the p-value, e.g., BH pro- cedure or Storey procedure. The p-values of our selected skilled funds are not necessarily small, indicating that the funds located on the right tail part of the empirical distribution of Z are not necessarily skilled. To present the distinction, for a certain 10-year window, we collect the funds with the top 50 smallest p-values (denoted as p-group) and the funds with the top 50 smallest d-values (denoted as d-group) after excluding the intersection part. We compare these two groups by p-values, d-values, estimated annualα ’s, and future returns in the following discussion. Figure 4.6a provides the example boxplots of p-values and d-values for the two groups based on the data from 2000-2009. The d-group has a median p-value equal to 0.19, and even the minimum p-value is greater than 0.04. In contrast, all the p-values in the p-group are less than 0.04. Similar phenomena can be observed in other periods. Therefore, the rankings for funds based on d-values and p-values have a sharp distinction. A natural question is which ranking is better. We answer this question by comparing 145 Figure 4.6: The comparison of the p-group and d-group selected based on the data from 2000-2009: (a) the box plots of p-values and d-values, and (b) the estimated annualα ’s for each 10-year window after that period. (“others" refers to funds without top 50 smallest d-values nor p-values.) 0.0 0.1 0.2 0.3 0.4 d−value p−value d−group 0.00 0.05 0.10 0.15 0.20 0.25 d−value p−value p−group (a) 0.0 0.1 0.2 00−09 01−10 02−11 03−12 04−13 05−14 06−15 07−16 08−17 09−18 10−19 year annual alpha hat type d−group p−group others Annual alpha hat (median) 2000−2009 (b) the fund performance in the p-group and d-group in Figure 4.6b. The plot of the medians of estimated annual α ’s shows that the superior performance of the d-group is persistent in the subsequent moving 10-year windows compared to that of the p-group. For the p-group, we observe excellent performance in the beginning, but the performance deteriorates and becomes even worse than that of funds with relatively large p-values. Figure 4.7: The boxplots for cumulative returns of funds in p-groups and d-groups for the following years. D-values and p-values are computed based on the data during (a) 2000-2009 and (b) 2003-2012. 0 1 2 3 4 5 d−group p−group Return Return in 10−19 (a) 2000-2009 0 1 2 3 d−group p−group Return Return in 13−19 (b) 2003-2012 We also look at the cumulative returns of funds in the two groups for the subsequent years over a long horizon. Figure 4.7 contains example boxplots for the cumulative returns of funds in p-groups and d-groups. In Figure 4.7a, the cumulative returns of funds in the d-group, constructed based on the data from 2000-2009, have a median greater than 150% during 2010-2019, but the cumulative returns of the p-group have a negative median. Overall, d-groups strongly outperform p-groups. Further investigation 146 of the selected groups reveals additional support for our procedure. For instance, our selection of the d- groups based on 2003-2012 includes some prominent funds such as Fidelity and ProFunds, which has outstanding long-term performance in addition to its short-term performance for the year 2013. For ex- ample, Fidelity Select Portfolios:Medical Equipment & Systems Portfolio generated a total return of297% from 2013 to 2019 while ProFunds:Biotechnology UltraSector generated a total re- turn of236% during the same period. However, such funds are not in the p-group constructed based on the data from 2003-2012. 4.7 Discussion We focuses on the performance of the mutual funds net of trading costs, fees, and expenses. We may look at returns of mutual funds before expenses. The latter directly reflects mutual fund managers’ capabilities to pick stocks. We can compute expenses in terms of the expense ratios, and add the estimated expenses to the current net returns. From the investors’ side, despite the management fee that investors have to pay out, they also receive dividends from the investment of some funds in addition to the returns. Some dividend- oriented funds may pay a substantial amount of dividends, which can not be ignored for calculating profits. In our aforementioned analysis, we did not collect information of dividends for each fund. The above factors motivate us to collect the related information for the mutual fund data in our future research. Our proposed method can also be a useful tool for portfolio management by screening skilled mutual funds. In Section 4.6, our dynamic trading strategy is constructed based on the equal-weight investment. More advanced strategies through mean-variance efficiency can be found in [74]. 147 4.8 Acknowledgementsandauthorcontributions This study is a collaborative effort with Xu Han and Xin Tong. X.H. and X.T. conceived the study and design of the methods. L.W. executed the methods, wrote the code, and performed empirical, theoretical, and real-data analyses. X.H. and X.T. supervised the execution. All authors contributed to the writing of the manuscript. . 148 Chapter5 HierarchicalNeyman-Pearsonclassificationforprioritizingsevere diseasecategoriesinCOVID-19patientdata 5.1 Introduction The COVID-19 pandemic has infected more than 763 million people and caused 6.91 million deaths (April 2023) [137], prompting collective efforts from statistics and other communities to address data-driven chal- lenges. Many statistical works have modeled the COVID-19 epidemic dynamics [15, 94], forecasted the case growth rates and outbreak locations [21, 121, 88], and analyzed and predicted the mortality rates [54, 68]. Classification problems, such as COVID-19 diagnosis (positive/negative) [139, 75, 148] and severity prediction [143, 120, 151, 92], have been tackled by machine learning approaches (e.g., logistic regression, support vector machine (SVM), random forest, boosting, and neural network; see [3] for a review). In the existing COVID-19 classification works, the commonly used data types are CT images, routine blood tests, and other clinical data including age, blood pressure and medical history [89]. In compar- ison, multiomics data are harder to acquire but can provide better insights into the molecular features driving patient responses to the disease [93]. Recently, the increasing availability of single-cell RNA-seq (scRNA-seq) data offers the opportunity to understand transcriptional responses to COVID-19 severity at the cellular level [136, 114, 99]. 149 More generally, genome-wide gene expression measurements have been routinely used in classifica- tion settings to characterize and distinguish disease subtypes, both at the bulk-sample level [2] and, more recently, at the single-cell level [8, 53]. While obtaining such genome-wide data can be expensive, they provide a comprehensive view of the transcriptome and can unveil significant gene expression patterns for diseases with complex pathophysiology, where multiple genes and pathways are involved. Furthermore, as the patient-level measurements continue to grow in dimension and complexity (e.g., from single bulk- sample measurements to multi-cell inputs), a supervised learning setting enables us to establish better the connection between patient-level features and their associated disease states, paving the way toward personalized treatment. This study focuses on patient severity classification using an integrated collection of multi-patient blood scRNA-seq datasets. Based on the WHO guidelines [138], COVID-19 patients have at least three severity categories: healthy, mild/moderate, and severe. The classical classification paradigm aims at min- imizing the overall classification error. However, prioritizing the identification of more severe patients, in this case, can provide important insights into the biological mechanisms underlying disease progres- sion and severity and facilitate the discovery of potential biomarkers for clinical diagnosis and therapeutic intervention. Consequently, it is important to prioritize minimizing “under-diagnosis" errors in our clas- sification, where patients are misclassified into less severe categories. Motivated by the gap in existing classification algorithms for severity classification (Section 5.1.1), we propose a hierarchical Neyman-Pearson (H-NP) classification framework that prioritizes the under- diagnosis error control in the following sense. Suppose there areI classes with class labels [I] = {1,2,...,I} ordered in decreasing severity. Fori∈[I− 1], thei-th under-diagnosis error is the proba- bility of misclassifying an individual in classi into any classj withj > i. We develop an H-NP umbrella algorithm that controls thei-th under-diagnosis error below a user-specified level α i ∈ (0,1) with high probability while minimizing a weighted sum of the remaining classification errors. Similar in spirit to 150 the NP umbrella algorithm for binary classification in [127], the H-NP umbrella algorithm adapts to pop- ular scoring-type multi-class classification methods (e.g., logistic regression, random forest, and SVM). To our knowledge, the algorithm is the first to achieve asymmetric error control with high probability in multi-class classification. Another contribution of this study is the exploration of appropriate ways to featurize multi-patient scRNA-seq data. Following the workflow in [79], we integrate 19 publicly available scRNA-seq datasets to form a sample of 864 patients with three levels of severity. For each patient, scRNA-seq data were collected from peripheral blood mononuclear cells (PBMCs) and processed into a sparse expression matrix, which consists of tens of thousands of genes in rows and thousands of cells in columns. We propose four ways of extracting a feature vector from each of these864 matrices. Then we evaluate the performance of each featurization way in combination with multiple classification methods under both the classical and H-NP classification paradigms. We note that our H-NP umbrella algorithm is applicable to other featurizations of scRNA-seq data, other forms of patient data, and more general disease classification problems with a severity ordering. Below we review the NP paradigm and featurization of multi-patient scRNA-seq data as the background of our work. 5.1.1 Neyman-Pearsonparadigmandmulti-classclassification Classical binary classification focuses on minimizing the overall classification error, i.e., a weighted sum of type I and II errors, where the weights are the marginal probabilities of the two classes. However, the class priorities are not reflected by the class weights in many applications, especially severe disease diagnosis, where the severe class is the minor class and has a smaller weight (e.g., HIV [90] and cancer [33]). One class of methods that addresses this error asymmetry is cost-sensitive learning [38, 85], which assigns different costs to type I and type II errors. However, such weights may not be easy to choose in practice, especially in a multi-class setting; nor do these methods provide high probability controls on the 151 prioritized errors. The NP classification paradigm [24, 107, 101] was developed as an alternative framework to enforce class priorities: it finds a classifier that controls the population type I error (the prioritized error, e.g., misclassifying diseased patients as healthy) under a user-specified level α while minimizing the type II error (the error with less priority, e.g., misdiagnosing healthy people as sick). Practically, using an order statistics approach, [127] proposed an NP umbrella algorithm that adapts all scoring-type classification methods (e.g., logistic regression) to the NP paradigm for classifier construction. The resulting classifier has the population type I error underα with high probability. Besides disease diagnosis, the NP classification paradigm has found diverse applications, including social media text classification [141] and crisis risk control [41]. Nevertheless, the original NP paradigm is for binary classification only. Although several works aimed to control prioritized errors in multi-class classification [70, 142, 124], they did not provide high probability control. That is, if they are applied to severe disease classification, there is a non-trivial chance that their under-diagnosis errors exceed the desired levels. 5.1.2 ScRNA-seqdatafeaturization In multi-patient scRNA-seq data, every patient has a gene-by-cell expression matrix, which contains the expression levels of genes in cells; genes are matched across patients, but cells are not. For learning tasks with patients as instances, featuriztion is a necessary step to ensure that all patients have feautures in the same space. A common featurization approach is to assign every patient’s cells into cell types, which are comparable across patients, by clustering [113, 43] and/or manual annotation [49]. Then, each pa- tient’s gene-by-cell expression matrix can be converted into a gene-by-cell-type expression matrix using a summary statistic (e.g., every gene’s mean expression in a cell type), so all patients have gene-by-cell- type expression matrices with the same dimensions. We note here that most of the previous multi-patient single-cell studies with a reasonably large cohort used CyTOF data [32], which typically measures50–100 protein markers, whereas scRNA-seq data has a much higher feature dimension, containing expression 152 values of∼ 10 4 genes. Thus further featurization is necessary to convert each patient’s gene-by-cell-type expression matrix into a feature vector for classification. Following the data processing workflow in [79], we obtain 864 patients’ cell-type-by-gene expression matrices, which include 18 cell types and 3,000 genes (after filtering). We propose and compare four ways of featurizing these matrices into vectors. The four ways differ in their treatments of 0 values and approaches for dimension reduction. Note that we perform featurization as a separate step before classification so that all classification methods are applicable. Separating the featurization step also allows us to investigate whether a featurization way maintains robust performance across classification methods. The rest of the chapter is organized as follows. In Section 5.2, we introduce the H-NP classification framework and propose an umbrella algorithm to control the under-diagnosis errors with high probability. Next, we conduct extensive simulation studies to evaluate the performance of the umbrella algorithm. In Section 5.3, we describe the four ways of featurizing the COVID-19 multi-patient scRNA-seq data and show that the H-NP umbrella algorithm consistently controls the under-diagnosis errors in COVID-19 severity classification across all featurization ways and classification methods. = = 5.2 HierarchicalNeyman-Pearson(H-NP)classification 5.2.1 Under-diagnosiserrorsinH-NPclassification We first introduce the formulation of H-NP classification and define the under-diagnosis errors, which are the probabilities of individuals being misclassified to less severe (more generally, less important) classes. In an H-NP problem withI≥ 2 classes, the class labelsi∈ [I] :={1,2,...,I} are ranked in decreasing order of importance, i.e., class i is more important than class j if i < j. Let (X,Y) be a random pair, where X ∈ X ⊂ IR d represents a vector of features and Y ∈ [I] denotes the class label. A classifier 153 ϕ :X→[I] maps a feature vectorX to a predicted class label. In the following discussion, we abbreviate IP(·| Y = i) asP i (·). Our H-NP framework aims to control the under-diagnosis errors at the population level in the sense that R i⋆ (ϕ )=P i (ϕ (X)∈{i+1,...,I})≤ α i for i∈[I− 1], (5.1) whereα i ∈ (0,1) is the desired control level for thei-th under-diagnosis errorR i⋆ (ϕ ). Simultaneously, our H-NP framework minimizes the weighted sum of the remaining errors, which can be expressed as R c (ϕ )=IP(ϕ (X)̸=Y)− I− 1 X i=1 π i R i⋆ (ϕ ), where π i =IP(Y =i). (5.2) We note that whenI =2, this H-NP formulation is equivalent to the binary NP classification paradigm (prioritizing class 1 over class 2), withR 1⋆ (ϕ ) being the population type I error. For COVID-19 severity classification with three levels, severe patients labeled as Y = 1 have the top priority, and we want to control the probability of severe patients not being identified, which is R 1⋆ (ϕ ). The secondary priority is for moderate patients labeled asY = 2; R 2⋆ (ϕ ) is the probability of moderate patients being classified as healthy. Healthy patients that do not need medical care are labeled as Y = 3. Note thatR i⋆ (·) andR c (·) are population-level quantities as they depend on the intrinsic distribution of (X,Y), and it is hard to control theR i⋆ (·)’s almost surely due to the randomness of the classifier. 5.2.2 H-NPalgorithmwithhighprobabilitycontrol In this section, we construct an H-NP umbrella algorithm that controls the population under-diagnosis errors in the sense that IP(R i⋆ ( b ϕ ) > α i ) ≤ δ i for i ∈ [I − 1], where (δ 1 ,...,δ I− 1 ) is a vector of tolerance parameters, and b ϕ is a scoring-type classifier to be defined below. 154 Roughly speaking, we employ a sample-splitting strategy, which uses some data subsets to train the scoring functions from a base classification method and other data subsets to select appropriate thresholds on the scores to achieve population-level error controls. Here, the scoring functions refer to the scores assigned to each possible class label for a given input observation and include examples such as the output from the softmax transformation in multinomial logistic regression. Fori∈[I], letS i ={X i j } n i j=1 denote n i independent observations from class i, where n i is the size of the class. In the following discussion, the superscript on X is dropped for brevity when it is clear which class the observation comes from. Our procedure randomly splits the class-i observations into (up to) three parts:S is (i∈[I]) for obtaining scoring functions,S it (i∈[I− 1]) for selecting thresholds, andS ie (i=2,...,I) for computing empirical errors. As will be made clear later,S 1e andS It are not needed in our procedure, and we only split class 1 and classI into two parts. After splitting, we use the combination of observationsS s = S i∈[I] S is to train the scoring functions. We consider a classifier that relies on I− 1 scoring functionsT 1 ,T 2 ,...,T I− 1 :X→IR, where the class decision is made sequentially with eachT i (X) determining whether the observation belongs to class i or one of the less prioritized classes(i+1),...,I. Thus at each stepi, the decision is binary, allowing us to use the NP Lemma to motivate the construction of our scoring functions. Note thatIP(Y =i|X = x)/IP(Y ∈ {i + 1,...,I} | X = x) ∝ f i (x)/f >i (x), where f >i (x) and f i (x) represent the density function ofX whenY > i andY = i, respectively, and the density ratio is the statistic that leads to the most powerful test with a given level of control on one of the errors by the NP Lemma. Given a typical scoring-type classification method (e.g., logistic regression, random forest, SVM, and neural network) that provides the probability estimates b IP(Y = i | X) for i ∈ [I], we can construct our scores using these estimates by defining T 1 (X)= b IP(Y =1|X), and T i (X)= b IP(Y =i|X) P I j=i+1 b IP(Y =j|X) for 1 α i )≤ δ i for alli∈ [I− 1]. In what follows, we will develop our arguments conditional on the dataS s for training the scoring functions so thatT i ’s can be viewed as fixed functions. According to Eq (5.3), the first under-diagnosis error R 1⋆ ( b ϕ ) = P 1 (T 1 (X)<t 1 ) only depends on t 1 , while the other under-diagnosis errors R i⋆ ( b ϕ ) depend on t 1 ,...,t i . To achieve the high probability controls withIP(R i⋆ ( b ϕ ) > α i )≤ δ i for alli∈ [I− 1], we selectt 1 ,...t I− 1 sequentially using an order statistics approach. We start with the selection oft 1 , which is covered by the following general proposition. Proposition 2. For any i ∈ [I], denoteT i = {T i (X) | X ∈ S it }, and let t i(k) be the corresponding k-th order statistic. Further denote the cardinality ofT i as n i . Assuming that the data used to train the 156 scoring functions and the left-out data are independent, then given a control levelα , for another independent observationX from classi, IP P i T i (X)<t i(k) |t i(k) >α ≤ v(k,n i ,α ):= k− 1 X j=0 n i j (α ) j (1− α ) n i − j . (5.5) We remark that similar to Proposition 1 in Tong et al. [127], ifT i is a continuous random variable, the bound in Eq (5.5) is tight. The proof is a modification of their Proposition 1, presented as follows. ProofofProposition2. Suppose T i (X) is the classification score of an independent observation from classi, andF i is the cumulative distribution function for− T i (X). Then, P i T i (X)<t i(k) t i(k) =P i − T i (X)>− t i(k) t i(k) =1− F i − t i(k) , and IP P i T i (X)<t i(k) t i(k) >α =IP 1− F i − t i(k) >α =IP − t i(k) <F − 1 i (1− α ) =IP − t i(k) <F − 1 i (1− α ),− t i(k+1) <F − 1 i (1− α ),...,− t i(n i ) <F − 1 i (1− α ) =IP at leastn i − k+1 elements inT i are less thanF − 1 i (1− α ) = n i X j=n i − k+1 n i j IP − T i (X)<F − 1 i (1− α ) j 1− IP − T i (X)<F − 1 i (1− α ) n i − j = k− 1 X j=0 n i j 1− IP − T i (X)<F − 1 i (1− α ) j IP − T i (X)<F − 1 i (1− α ) n i − j =(n i +1− k) n i k− 1 Z IP[− T i (X)<F − 1 i (1− α )] 0 u n i − k (1− u) k− 1 du ≤ (n i +1− k) n i k− 1 Z 1− α 0 u n i − k (1− u) k− 1 du = k− 1 X j=0 n i j (α ) j (1− α ) n i − j =v(k,n i ,α ) (5.6) The inequality holds becauseIP − T i (X)<F − 1 i (1− α ) ≤ 1− α , and it becomes an equality whenF i is continuous. 157 Algorithm5: DeltaSearch(n,α,δ ) Input : size: n; level: α ; tolerance: δ . 1 k =0,v k =0 2 whilev k ≤ δ do 3 v k =v k + n k (α ) k (1− α ) n− k 4 k =k+1 5 end Output:k Letk i = max{k| v(k,n i ,α i )≤ δ i }, which can be computed using Algorithm 5. Then Proposition 2 and Eq (5.4) imply IP R i⋆ ( b ϕ )>α i ≤ IP P i T i (X)<t i |t i(k i ) >α i ≤ δ i for all t i ≤ t i(k i ) . (5.7) We note that to have a solution forv(k,n i ,α i )≤ δ i amongk∈[n i ], we needn i ≥ logδ i /log(1− α i ), the minimum sample size required for the classS it . Wheni=1, the first inequality in Eq (5.7) in fact becomes equality, sot 1(k 1 ) is an effective upper bound on t 1 when we later minimize the empirical counterpart of R c (·) in Eq (5.2) with respect to different feasible threshold choices. On the other hand, for i > 1, the inequality is mostly strict, which means that the bound t i(k i ) on t i is expected to be loose and can be improved. To this end, we note that Eq (5.4) can be decomposed as R i⋆ ( b ϕ )=P i (T i (X)<t i |T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )· P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 ) (5.8) leading to the following theorem that finds the upper bound on t i given the previous thresholds. Theorem 9. Given the previous thresholds t 1 ,...,t i− 1 , consider all the scores T i on the left-out classS it , T i = {T i (X) | X ∈ S it }, and a subset of these scores depending on the previous thresholds, defined as T ′ i ={T i (X)|X ∈S it ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 }. We uset i(k) andt ′ i(k) to denote thek-th order 158 statistic ofT i andT ′ i , respectively. Letn i andn ′ i be the cardinality ofT i andT ′ i , respectively, andα i andδ i be the prespecified control level and violation tolerance for the i-th under-diagnosis errorR i⋆ (·). We set ˆ p i = n ′ i n i , p i = ˆ p i +c(n i ), α ′ i = α i p i , δ ′ i =δ i − exp{− 2n i c 2 (n i )}, (5.9) wherec(n)=O(1/ √ n). Let t i = t ′ i(k ′ i ) , ifn ′ i ≥ logδ ′ i /log(1− α ′ i ) and α ′ i <1; t i(k i ) , otherwise, (5.10) wherek i =max{k∈[n i ]|v(k,n i ,α i )≤ δ i } and k ′ i =max{k∈[n ′ i ]|v(k,n ′ i ,α ′ i )≤ δ ′ i }. Then, IP(R i⋆ ( b ϕ )>α i )=IP P i T 1 (X)<t 1 ,...T i (X)<t i |t i >α i ≤ δ i for all t i ≤ t i . (5.11) In other words, if the cardinality ofT ′ i exceeds a threshold, we can refine the choice of the upper bound according to Eq (5.10); otherwise, the bound in Proposition 2 always applies. The proof of the theorem is provided in the next section; the computation of the upper bound t i is summarized in Algorithm 6. t i guarantees the required high probability control on the i-th under-diagnosis error, while providing a tighter bound compared with Eq (5.4). We make two additional remarks as follows. Remark2. a) The minimum sample size requirement forS it is still n i ≥ logδ i /log(1− α i ) because t i(k i ) in Eq (5.10) always exists when this inequality holds. For instance, ifα i = 0.05 andδ i = 0.05, thenn i ≥ 59. b) The choice of c(n) involves a trade-off between α ′ i and δ ′ i , although under the constraint c(n) = O(1/ √ n), any changes in both quantities are small in magnitude for large n. For example, a larger c(n)leadstoasmallerα ′ i andalargerδ ′ i ,thusaloosertolerancelevelcomesatthecostofastrictererror 159 control level. In practice, largerα ′ i and largerδ ′ i values are desired since they lead to a wider region for t i . Wesetc(n)=2/ √ nthroughouttherestofthechapter. ThenbyEq (5.9),α ′ i increasesasnincreases, andδ ′ i =δ i − e − 4 , so the difference between δ ′ i and the prespecified δ i is sufficiently small. c) Eq (5.11) has two cases as indicated in Eq (5.10). When t i = t i(k i ) , the bound remains the same as Eq (5.7), which is not tight fori > 1. Whent i = t ′ i(k ′ i ) , Eq (5.11) provides a tighter bound through the decomposition in Eq (5.8), where the first part is bounded by a concentration argument, and the second part achieves a tight bound the same way as Proposition 2. ProofofTheorem9. We are going to prove that IP P i T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 , and T i (X)<t i t i >α i ≤ δ i by two cases. Case1: We consider the setE ={n ′ i ≥ logδ ′ i /log(1− α ′ i ) and 1− α ′ i > 0} (case 1 in Eq (5.10)). Under this event, and we want to show that IP P i h T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 , and T i (X)<t ′ i(k ′ i ) t ′ i(k ′ i ) i >α i E ≤ δ i . Suppose that T i (X) is the classification score of an independent observation from class i. F ′ i is the cumulative distribution function for the classification score − T i (X) when T 1 (X) < t 1 ,...,T i− 1 (X) < t i− 1 . Then, similar to the proof of Proposition 2, P i h T i (X)<t ′ i(k) t ′ i(k) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 i =1− F ′ i − t ′ i(k) . 160 Note thatα ′ i is determined byn i ,n ′ i ,α i . Meanwhile,δ ′ i is fixed, as it only depends on the given values n i andδ i . We have IP P i h T i (X)<t ′ i(k) t ′ i(k) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 i >α ′ i |n ′ i =IP h − t ′ i(k) >(F ′ i ) − 1 (1− α ′ i )|n ′ i i = IP h − t ′ i(k) >(F ′ i ) − 1 (1− α ′ i ) and |T i |=n ′ i i IP(|T i |=n ′ i ) (5.12) Note that the event{− t ′ i(k) > (F ′ i ) − 1 (1− α ′ i ) and |T i | = n ′ i } indicates thatn ′ i elements inS it satisfy {T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 }; among these elements, at leastn ′ i − k+1 elements haveT i (X) less than (F ′ i ) − 1 (1− α ′ i ). We can considerS it as independent draws from a multinomial distribution with three kinds of outcomes (A 1 ,A 2 ,A 3 ) defined in Figure 5.1. Therefore, Figure 5.1: Partition ofS it into three kinds of outcomes. IP h − t ′ i(k) >(F ′ i ) − 1 (1− α ′ i ) and |T i |=n ′ i i = n ′ i X j=n ′ i − k+1 n i n i − n ′ i ,n ′ i − j,j P i (A 1 ) n i − n ′ i P i (A 2 ) n ′ i − j P i (A 3 ) j ; IP(|T i |=n ′ i ) = n i n i − n ′ i P i (A 1 ) n i − n ′ i P i (A 2 ∪A 3 ) n ′ i . 161 Then, by Eq (5.12), IP P i h T i (X)<t ′ i(k) t ′ i(k) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 i >α ′ i |n ′ i = n ′ i X j=n ′ i − k+1 n ′ i j P i (A 2 ) P i (A 2 ∪A 3 ) n ′ i − j P i (A 3 ) P i (A 2 ∪A 3 ) j = n ′ i X j=n ′ i − k+1 n ′ i j (1− P i − T i (X)<(F ′ i ) − 1 (1− α )|T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 ) n ′ i − j × P i − T i (X)<(F ′ i ) − 1 (1− α ′ i )|T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 j = k− 1 X j=0 n ′ i j (1− P i − T i (X)<(F ′ i ) − 1 (1− α )|T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 ) j × P i − T i (X)<(F ′ i ) − 1 (1− α ′ i )|T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 n ′ i − j ≤ k− 1 X j=0 n ′ i j (α ′ i ) j (1− α ′ i ) n ′ i − j =v(k,n ′ i ,α ′ i ). The last inequality holds because P i − T i (X)<(F ′ i ) − 1 (1− α ′ i )|T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 ≤ 1− α ′ i , and it becomes an equality when(F ′ i ) − 1 is continuous. Also, IP P i h T i (X)<t ′ i(k ′ i ) t ′ i(k ′ i ) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 i >α ′ i |n ′ i ≤ δ ′ i sincek ′ i =max{k|v(k,n ′ i ,α ′ i )≤ δ ′ i }. Then, IP P i h T i (X)<t ′ i(k ′ i ) t ′ i(k ′ i ) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 i >α ′ i E =IE k ′ i − 1 X j=0 n ′ i j (α ′ i ) j (1− α ′ i ) n ′ i − j E (5.13) ≤ IE[δ ′ i |E]=δ ′ i . 162 On the other hand, note that the eventE ={α ′ i <1 and n ′ i ≥ logδ ′ i /log(1− α ′ i )} is equivalent to 1− α i n i n ′ i +n i c(n i ) n ′ i ≤ δ ′ i (5.14) and α i n i n ′ i +n i c(n i ) <1. (5.15) The left part of the inequality (5.15) is decreasing with respect to n ′ i . Also, the left part of the inequal- ity (5.14) is nonincreasing inn ′ i when the inequality (5.15) holds, which implies that IE[n ′ i |n ′ i ≥ logδ ′ i log(1− α ′ i ) andα ′ i <1]≥ IE[n ′ i |n ′ i <logδ ′ i /log(1− α ′ i ) orα ′ i ≥ 1], i.e.,E[n ′ i |E]≥ E[n ′ i |E c ]. Immediately, P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )=IE n ′ i n i ≤ IE n ′ i n i E =IE[ˆ p i |E], so IP(P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )>p i |E) =IP(P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )− ˆ p i >c(n i )|E) ≤ IP(IE[ˆ p i |E]− ˆ p i >c(n i )|E) =IP IE n ′ i n i E − n ′ i n i >c(n i ) E =IP IE n ′ i E − n ′ i >n i c(n i ) E ≤ e − 2n i c 2 (n i ) . 163 The last inequality is by Hoeffding’s inequality. Then, IP(P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 , and T i (X)<t ′ i(k ′ i ) |t ′ i(k ′ i ) )>α i |E) =IP h P i (T i (X)<t ′ i(k ′ i ) |t ′ i(k ′ i ) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 ) × P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )>α i |E] ≤ IP[P i (T i (X)<t ′ i(k ′ i ) |t ′ i(k ′ i ) ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )>α i /p i |E] +IP[P i (T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 )>p i |E] ≤ δ ′ i +e − 2n i c 2 (n i ) =δ i , (5.16) Case2: We consider the eventE c ={n ′ i < logδ ′ i /log(1− α ′ i ) or 1− α ′ i ≤ 0}. Under this event, t i =t i(k i ) . Sincen i is deterministic, we have IP P i T i (X)<t i(k) t i(k) >α i E c ≤ k− 1 X j=0 n i j (α i ) j (1− α i ) n i − j =v(k,n i ,α i ). The proof is similar to that of Eq (5.6) and (5.13), and even easier sincen i is deterministic. We do not repeat the argument here. Recall thatk i =max{k|v(k,n i ,α i )≤ δ i }, so IP P i T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 , and T i (X)<t i(k i ) |t i(k i ) >α i E c ≤ IP P i T i (X)<t i(k i ) |t i(k i ) >α i E c ≤ δ i . (5.17) Eq (5.17) implies By Eq (5.10), (5.16) and (5.17), IP P i T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 , and T i (X)<t i |t i >α i ≤ δ i , 164 which implies the inequality (5.11) in Theorem 9. With the set of upper bounds on the thresholds chosen according to Theorem 9, the next step is to find an optimal set of thresholds(t 1 ,t 2 ,...,t I− 1 ) satisfying these upper bounds while minimizing the empir- ical version ofR c ( b ϕ ) , which is calculated using observations inS e = S I i=2 S ie (since class-1 observations are not needed in R c ( b ϕ )). For brevity, we denote all the empirical errors as ˜ R, e.g., ˜ R c . In Section 5.2.6, we will show numerically that Theorem 9 provides a wider search region for the thresholdt i compared to Proposition 2, which benefits the minimization of R c . As our COVID-19 data has three severity levels, in the next section, we will focus on the three-class H-NP umbrella algorithm and describe in more details how the above procedures can be combined to select the optimal thresholds in the final classifier. Algorithm6: UpperBound(S it ,α i ,δ i ,(T 1 ,...,T i ),(t 1 ,...,t i− 1 )) Input : The left-out class-i samples:S it ; level: α i ; tolerance: δ i ; score functions: (T 1 ,...,T i ); thresholds: (t 1 ,...,t i− 1 ). 1 n i ←| S it | 2 {t i(1) ,...,t i(n i ) }← sortT i ={T i (X)|X ∈S it } 3 k i ← DeltaSearch(n i ,α i ,δ i ) 4 t i ← t i(k i ) 5 if i>1then 6 T ′ i ←{ t ′ i(1) ,...,t ′ i(n ′ i ) }= sort{T i (X)|X ∈S it ,T 1 (X)<t 1 ,...,T i− 1 (X)<t i− 1 } ; // Note that n ′ i is random 7 ˆ p i ← n ′ i n i ,p i ← ˆ p i +c(n i ),α ′ i ← α i /p i ,δ ′ i ← δ i − e − 2n i c 2 (n i ) ; // e.g., c(n) = 2 √ n 8 if n ′ i ≥ logδ ′ i /log(1− α ′ i )andα ′ i <1then 9 k ′ i ← DeltaSearch(n ′ i ,α ′ i ,δ ′ i ) 10 t i ← t ′ i(k ′ i ) 11 end 12 end Output:t i 165 5.2.3 H-NPumbrellaalgorithmforthreeclasses Since our COVID-19 data groups patients into three severity categories, we introduce our H-NP umbrella algorithm forI =3. In this case, there are two under-diagnosis errorsR 1⋆ (ϕ )=P 1 (ϕ (X)∈{2,3}) and R 2⋆ (ϕ ) = P 2 (ϕ (X) = 3), which need to be controlled at prespecified levels α 1 ,α 2 with tolerance levels δ 1 ,δ 2 , respectively. In addition, we wish to minimize the weighted sum of errors R c (ϕ )=IP(ϕ (X)̸=Y)− π 1 R 1⋆ (ϕ )− π 2 R 2⋆ (ϕ ) =π 2 P 2 (ϕ (X)=1)+π 3 [P 3 (ϕ (X)=1)+P 3 (ϕ (X)=2)]. (5.18) WhenI = 3, our H-NP umbrella algorithm relies on two scoring functionsT 1 ,T 2 :X→ IR, which can be constructed by Eq (5.3) using the estimates b IP(Y =i|X) from any scoring-type classification method: T 1 (X)= b IP(Y =1|X) and T 2 (X)= b IP(Y =2|X) b IP(Y =3|X) . (5.19) The H-NP classifier then takes the form b ϕ (X)= 1, T 1 (X)≥ t 1 ; 2, T 2 (X)≥ t 2 and T 1 (X)<t 1 ; 3, otherwise. (5.20) Here T 2 determines whether an observation belongs to class 2 or class 3, with a larger value indicating a higher probability for class 2. Applying Algorithm 6, we can find t 1 such that any threshold t 1 ≤ t 1 will satisfy the high probability control on the first under-diagnosis error, that is IP(R 1⋆ ( b ϕ ) > α 1 ) = IP P 1 T 1 (X)<t 1 |t 1 >α 1 ≤ δ 1 . Recall that the computation of t 2 (and consequently t 2 ) depends on the choice oft 1 . Given a fixed t 1 , the high probability control on the second under-diagnosis errors is 166 IP(R 2⋆ ( b ϕ ) > α 2 ) = IP P 2 T 1 (X)<t 1 ,T 2 (X)<t 2 |t 2 >α 2 ≤ δ 2 , where t 2 is computed by Algo- rithm 6 so that anyt 2 ≤ t 2 satisfies the constraint. (a) The construction ofT ′ 2 with a fixed t 1 . (b) The effect of decreasing t 1 . Figure 5.2: The influence of t 1 on the errorP 3 b Y =2 . The interaction between t 1 and t 2 comes into play when minimizing the remaining errors in R c ( b ϕ ). First note that using Eq (5.18) and (5.20), the other types of errors inR c ( b ϕ ) are P 2 b ϕ (X)=1 =P 2 (T 1 (X)≥ t 1 ) , P 3 b ϕ (X)=1 =P 3 (T 1 (X)≥ t 1 ) , (5.21) P 3 b ϕ (X)=2 =P 3 (T 1 (X)<t 1 ,T 2 (X)≥ t 2 ) . To simplify the notation, let b Y denote b ϕ (X) in the following discussion. For a fixed t 1 , decreasing t 2 leads to an increase inP 3 ( b Y =2) and has no effect on the other errors in (5.21), which means that t 2 =t 2 minimizesR c ( b ϕ ). However, the selection oft 1 is not as straightforward ast 2 . Figure 5.2a illustrates how the setT ′ 2 ={T 2 (X)|X ∈S 2t ,T 1 (X)<t 1 } (as appeared in Theorem 9) is constructed for a givent 1 , where the elements are ordered by theirT 2 values. Clearly, more elements are removed fromT ′ 2 ast 1 decreases, leading to a smallern ′ 2 . Consider an element in the setT ′ 2 which has rankk in the ordered list (colored yellow in Figure 5.2a). Thenk,n ′ 2 ,α ′ 2 , and consequentlyv(k,n ′ 2 ,α ′ 2 ), will all be affected by decreasing t 1 , but the change is not monotonic as shown in Figure 5.2b. Decreasingt 1 could remove elements (dashed circles in Figure 5.2b) either to the left side (case 1) or right side (case 2) of the yellow element, depending on the values of the scores T 1 . In case 1, v(k,n ′ 2 ,α ′ 2 ) decreases, resulting in a larger t 2 and a smaller 167 P 3 ( b Y = 2) error, whereas the reverse can happen in case 2. The details of howv(k,n ′ 2 ,α ′ 2 ) changes can be found in Section 5.2.5, with additional simulations in Figure 5.7. In view of the above, minimizing the empirical error ˜ R c requires a grid search over t 1 , for which we use the setT 1 = {T 1 (X) | X ∈S 1t }, and the overall algorithm for finding the optimal thresholds and the resulting classifier is described in Algorithm 7, which we name as the H-NP umbrella algorithm. The algorithm for the general case with I >3 can be found in the next section. Algorithm7: H-NP umbrella algorithm forI =3 Input : Sample:S =S 1 ∪S 2 ∪S 3 ; levels: (α 1 ,α 2 ); tolerances: (δ 1 ,δ 2 ); grid set: A 1 (e.g.,T 1 ). 1 b π 2 =|S 2 |/|S|;b π 3 =|S 3 |/|S| 2 S 1s ,S 1t ,← Random splitS 1 ; 3 S 2s ,S 2t ,S 2e ← Random splitS 2 ; 4 S 3s ,S 3e ← Random splitS 3 5 S s =S 1s ∪S 2s ∪S 3s 6 T 1 ,T 2 ← A base classification method (S s ) ; // c.f. Eq (5.19) 7 t 1 ← UpperBound(S 1t ,α 1 ,δ 1 ,(T 1 ), NULL) 8 ˜ R c =1 9 fort 1 ∈A 1 ∩(−∞ ,t 1 ]do 10 t 2 ← UpperBound(S 2t ,α 2 ,δ 2 ,(T 1 ,T 2 ),(t 1 )) ; // i.e., Algorithm 6 11 b ϕ ← a classifier with respect to t 1 ,t 2 12 e 21 = P X∈S 2e 1I{ b ϕ (X)=1}/|S 2e |,e 3 = P X∈S 3e 1I{ b ϕ (X)∈{1,2}}}/|S 3e | 13 ˜ R c new =b π 2 e 21 +b π 3 e 3 14 if ˜ R c new < ˜ R c then 15 ˜ R c ← ˜ R c new , b ϕ ∗ ← b ϕ 16 end 17 end Output: b ϕ ∗ 5.2.4 GeneralH-NPumbrellaalgorithmforI classes For a generalI, we conduct a grid search over dimensionI− 2. For 1 ≤ i 0. Then, F n+1,α n+1 (z)− F n,α n (z− 1) =F n+1,α n+1 (z)− F n,α n+1 (z− 1)+F n,α n+1 (z− 1)− F n,α n (z− 1) ≥ (1− α n+1 )P n,α n+1 (z)>0. Note that α ′ 2 = α 2 p 2 = α 2 n ′ 2 /n 2 +c(n 2 ) = α 2 n 2 n ′ 2 +c(n 2 )n 2 . Sincen 2 is fixed, we can write α 2 in the form of c 1 n ′ 2 +c 2 for some positive constantsc 1 andc 2 . In other words, if we remove an element to the left of the yellow element in Figure 5.2b case 1, the value ofv(k,n ′ 2 ,α ′ 2 ) will decrease. By contrast, there are situations in case 2 of Figure 5.2b that will increasev(k,n ′ 2 ,α ′ 2 ), as discussed in the following lemma. Lemma 23. Let Z ∼ Binomial(n,α n ), where α n = c 1 n+c 2 and c 2 ≥ c 1 > 0. Fix z > 0, F n,α n (z) is decreasing with respect ton≥ n l wheren l =min{n|(n− 1)α n+1 ≥ z}. Proof. Eq (5.22) implies F n+1,α (z)− F n,α (z)=− αP n,α (z). (5.24) 171 On the other hand, for fixed n andz, dlog(P n,α (z)) dα = z α − n− z 1− α >0, α<z/n ; =0, α =z/n; <0, α>z/n , (5.25) soP n− 1,c (z) is decreasing on[α n+1 ,α n ] when(n− 1)α n+1 ≥ z. Then, according to Eq (5.23) F n,α n+1 (z)− F n,α n (z)≤ n(α n − α n+1 )P n− 1,α n+1 (z). (5.26) By Eq (5.24) and (5.26), for(n− 1)α n+1 ≥ z, F n+1,α n+1 (z)− F n,α n (z)=F n+1,α n+1 (z)− F n,α n+1 (z)+F n,α n+1 (z)− F n,α n (z) ≤− α n+1 P n,α n+1 (z)+n(α n − α n+1 )P n− 1,α n+1 (z) =− n z (α n+1 ) z+1 (1− α n+1 ) n− z + n n+c 2 n− 1 z (α n+1 ) z+1 (1− α n+1 ) n− z− 1 = α n+1 − 1+ n− z n+c 2 n z (α n+1 ) z+1 (1− α n+1 ) n− z− 1 ≤ − n+1+c 2 − c 1 n+1+c 2 + n n+c 2 n z (α n+1 ) z+1 (1− α n+1 ) n− z− 1 =− (c 2 − c 1 )(n+c 2 )+c 2 (n+c 2 )(n+c 2 +1) n z (α n+1 ) z+1 (1− α n+1 ) n− z− 1 <0, sincec 2 ≥ c 1 >0. Furthermore, (n− 1)α n+1 = n− 1 n+1+c 2 c 1 172 is increasing inn, i.e, anyn≥ n l satisfies (n− 1)α n+1 ≥ z. It establishes thatF n,α n (z) is decreasing with respect ton whenn≥ n l . This lemma presents a situation for case 2 in Figure 5.2b, wherev(k,n ′ 2 ,α ′ 2 ) will increase. 5.2.6 SimulationstudiesfortheH-NPumbrellaalgorithm We first examine the validity of our H-NP umbrella algorithm using simulated data from setting T1: I = 3, and the feature vectors in class i are generated as (X i ) ⊤ ∼ N(µ i ,I), where µ 1 = (0,− 1) ⊤ , µ 2 = (− 1,1) ⊤ ,µ 3 = (1,0) ⊤ andI is the2× 2 identity matrix. For each simulated dataset, we generate the feature vectors and labels with 500 observations in each of the three classes. The observations are randomly separated into parts for score training, threshold selection and computing empirical errors:S 1 is split into50%,50% forS 1s ,S 1t ;S 2 is split into45%,50% and5% forS 2s ,S 2t andS 2e ;S 3 is split into 95%,5% forS 3s ,S 3e , respectively. All the results in this section are based on 1000 repetitions from a given setting. We setα 1 = α 2 = 0.05 andδ 1 = δ 2 = 0.05. To approximate and evaluate the true population errorsR 1⋆ ,R 2⋆ andR c , we additionally generate20,000 observations for each class and refer to them as the test set. First, we demonstrate that Algorithm 7 outputs an H-NP classifier with the desired high probability controls. More specifically, we show that any t 1 ≤ t 1 and t 2 = t 2 (t 1 , t 2 are computed by Algorithm 6) will lead to a valid threshold pair (t 1 ,t 2 ) satisfying IP(R 1⋆ ( b ϕ ) > α 1 ) ≤ δ 1 and IP(R 2⋆ ( b ϕ ) > α 2 ) ≤ δ 2 , where R 1⋆ and R 2⋆ are approximated using the test set in each round of simulation. Here, we use multinomial logistic regression to construct the scoring functions T 1 and T 2 , the inputs of Algorithm 7. Figure 5.3 displays the boxplots of various approximate errors witht 1 chosen as thek-th largest element inT 1 ∩(−∞ ,t 1 ] ask changes. In Figure 5.3a and 5.3b, where the blue diamonds mark the95% quantiles, we can see that the violation rate of the required error bounds (red dashed lines, representingα 1 andα 2 ) is about 5% or less, suggesting our procedure provides effective controls on the errors of concerns. In 173 this case, in most simulation rounds,t 1 minimizes the empirical error ˜ R c computed onS 2e andS 3e , and t 1 = t 1 is chosen as the optimal threshold by Algorithm 7 in the final classifier. We can see this coincide with Figure 5.3c, which shows that the largest element inT 1 ∩(−∞ ,t 1 ] (i.e.,t 1 ) minimizes the approximate errorR c on the test set. We note here that the results from other split settings can be found in Figures 5.4 and 5.5 and show similar trends. As the minimum sample size requirement onS it is 59 in this case, the setting in Figure 5.3 already exceeds the requirement. Further increasing its size makes minimal difference in the performance of the H-NP classifier. 1 2 3 4 5 6 7 0.00 0.02 0.04 0.06 k errors (a)R 1⋆ 1 2 3 4 5 6 7 0.00 0.02 0.04 0.06 k errors (b)R 2⋆ 1 2 3 4 5 6 7 0.35 0.45 0.55 0.65 k errors (c)R c Figure 5.3: The distribution of approximate errors on the test set when t 1 is the k-th largest element in T 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 =α 2 =0.05) are plotted as red dashed lines. 1 2 3 4 5 6 7 8 9 10 11 0.00 0.02 0.04 0.06 k errors (a)R 1⋆ 1 2 3 4 5 6 7 8 9 10 11 0.00 0.02 0.04 0.06 k errors (b)R 2⋆ 1 2 3 4 5 6 7 8 9 10 11 0.35 0.45 0.55 0.65 k errors (c)R c Figure 5.4: The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) and R 2⋆ ( b ϕ ) (α 1 =α 2 =0.05) are plotted as red dashed lines. The data are generated undersettingT1 except a different sample splitting ratio is used: samples in S 1 are randomly split into30% forS 1s (score),70% forS 1t (threshold);S 2 are split into 25% forS 2s , 70% forS 2t , 5% forS 2e (evaluation);S 3 are split into 95% forS 3s and5% forS 3e . 174 1 2 3 0.00 0.02 0.04 0.06 k errors (a)R 1⋆ 1 2 3 0.00 0.02 0.04 0.06 k errors (b)R 2⋆ 1 2 3 0.35 0.45 0.55 0.65 k errors (c)R c Figure 5.5: The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The95% quantiles ofR 1⋆ andR 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) and R 2⋆ ( b ϕ ) (α 1 =α 2 =0.05) are plotted as red dashed lines. The data are generated undersettingT1 except a different sample splitting ratio is used: samples in S 1 are randomly split into70% forS 1s ,30% forS 1t ; S 2 are split into65% forS 2s ,30% forS 2t ,5% forS 2e ;S 3 are split into95% forS 3s and5% forS 3e . Next, we check whether indeed Theorem 9 gives a better upper bound on t 2 than Proposition 2 for overall error minimization. Recall the two upper bounds in Eq (5.7) (t 2(k 2 ) ) and Eq (5.10) (t 2 ). For each base classification algorithm (e.g., logistic regression), we set t 1 = t 1 andt 2 equal to these two upper bounds respectively, resulting in two classifiers with different t 2 thresholds. We compare their performance by evaluating the approximate errors of R 2⋆ ( b ϕ ) and P 3 ( ˆ Y = 2) since, as discussed in Section 5.2.3, the threshold t 2 only influences these two errors for a fixed t 1 . Figure 5.6 shows the distributions of the errors and also their averages for three different base classification algorithms. Under each algorithm, both choices oft 2 effectively control R 2⋆ ( b ϕ ), but the upper bound from Proposition 2 is overly conservative compared with that of Theorem 9, which results in a notable increase inP 3 ( ˆ Y = 2). This is undesirable sinceP 3 ( ˆ Y = 2) is one component inR c ( b ϕ ), and the goal is to minimizeR c ( b ϕ ) under appropriate error controls. Now we consider comparing our H-NP classifier against alternative approaches. Since existing meth- ods for controlling specific errors in the multi-class classification setting do not directly apply to the type of under-diagnosis errors we consider (e.g., [124, 70]) and provide “approximate" instead of high probabil- ity error controls, we construct an example of “approximate" error control using the empirical ROC curve 175 0.00 0.05 0.10 0.15 error23 error32 errors Logistic Regression 0.0 0.1 0.2 0.3 error23 error32 errors Random Forest 0.0 0.1 0.2 0.3 error23 error32 errors SVM 0.0 0.1 0.2 0.3 error23 error32 errors method Prop 1 Thm 1 SVM Logistic Regression Method Error23 Error32 Prop 1 0.006 0.082 Thm 1 0.020 0.046 Random Forest Method Error23 Error32 Prop 1 0.004 0.077 Thm 1 0.017 0.033 SVM Method Error23 Error32 Prop 1 0.004 0.077 Thm 1 0.018 0.033 Figure 5.6: The distribution and averages of approximate errors on the test set undersettingT1. “error23" and “error32" correspond toR 2⋆ ( b ϕ ) andP 3 ( ˆ Y =2), respectively. approach. In this case, each class of observations is split into two parts: one for training the base classifica- tion method, the other for threshold selection using the ROC curve. UndersettingT1, using similar split ratios as before, we separateS i into50% and50% forS is andS it fori=1,2,3. The same test set is used. We re-compute the scoring functions (T 1 andT 2 ) corresponding to the new split. t 1 is selected using the ROC curve generated byT 1 aiming to distinguish between class 1 (samples inS 1t ) and class2 ′ (samples in S 2t ∪S 3t ) merging classes 2 and 3, with specificity calculated as the rate of misclassifying a class-1 obser- vation into class2 ′ . Similarly,t 2 is selected usingT 2 dividing samples inS 2t ∪S 3t into class 2 and class 3, with specificity defined as the rate of misclassifying a class-2 observation into class 3. More specifically, in Eq (5.20) we uset 1 = sup t: P X∈S 1t 1{T 1 (X)<t} |S 1t | ≤ α 1 andt 2 = sup t: P X∈S 2t 1{T 2 (X)<t} |S 2t | ≤ α 2 to obtain the classifier for the ROC curve approach. The comparison between our H-NP classifier and the ROC curve approach is summarized in Figure 5.8. Recallingα i andδ i are both0.05, we mark the95% quantiles of the under-diagnosis errors by solid black lines and the target error control levels by dotted red lines. First we observe that the95% quantiles ofR 1⋆ 176 using the ROC curve approach well exceed the target level control, in fact with their averages centering around the target. We also see the influence of t 1 on the R 2⋆ – without a suitable adjustment for t 2 based ont 1 , the control onR 2⋆ ( b ϕ ) in the ROC curve approach is overly conservative despite it being an approximate error control method, which in turn leads to inflation in error P 3 ( ˆ Y =2). In view of this, we further consider a simulation setting where the influence of t 1 ont 2 is smaller. SettingT2 moves samples in class 1 further away from classes 2 and 3 by having µ 1 = (0,− 3) ⊤ , while the other parts remain the same as setting T1. α i ,δ i are still 0.05. The distributions of the under-diagnosis errors are shown in Figure 5.9, demonstrating that the ROC curve approach does not provide the required level of control for R 1⋆ orR 2⋆ . At the end of this section, we provide a simulation example to exhibit that the influence of t 1 on the weighted sum of errorsR c ( b ϕ ) is not monotonic. The cause of the non-monotonicity is explained at the end of Section 5.2.3. Here,SettingT3 sets the proportion of each class by settingn 1 =1000,n 2 =200,n 3 = 800. As a result, the weight for P 3 ( b Y ∈ {1,2}) in R c ( b ϕ ) increases, and the weight for P 2 ( b Y = 1) decreases. We set µ 1 = (0,0) ⊤ , µ 2 = (− 0.5,0.5) ⊤ , and µ 3 = (2,2) ⊤ . Under this setting, class-1 and 2 are closer in Euclidean distance compared to class 3. We increaseα 1 andα 2 to0.1 to get a wider search region fort 1 so that the pattern ofR c is easier to observe. To approximate the true errorsR 1⋆ ,R 2⋆ andR c on a test set, we generate30,000, 6,000 and2,4000 observations for class1, 2, 3, respectively. The ratio of the three classes in the test set is the same asn 1 :n 2 :n 3 . Other settings are the same assettingT1 in Section 5.2.6. The results of this new simulation setting are presented in Figure 5.7. We can observe thatR c is not monotonically decreasing, but our procedure still maintains effective controls on the under-diagnosis errors. 177 1 4 7 10 14 18 22 26 30 34 38 0.00 0.04 0.08 0.12 k1 errors (a)R 1⋆ 1 4 7 10 14 18 22 26 30 34 38 0.00 0.02 0.04 0.06 0.08 0.10 k1 errors (b)R 2⋆ 1 4 7 10 14 18 22 26 30 34 38 0.1 0.2 0.3 0.4 k1 errors (c)R c Figure 5.7: The distribution of approximate errors whent 1 is thek-th largest element inT 1 ∩(−∞ ,t 1 ). The 95% quantiles (δ 1 = δ 2 = 0.05) of R 1⋆ and R 2⋆ are marked by blue diamonds. The target control levels forR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ) (α 1 = α 2 = 0.1) are plotted as red dashed lines. Also, the averages of the R c are marked by red points in (c). 5.3 ApplicationtoCOVID-19severityclassification 5.3.1 IntegratingCOVID-19data We integrate 20 publicly available scRNA-seq datasets to form a total of864 COVID-19 patients with three severity levels marked as “Severe/Critical" (318 patients), “Mild/Moderate" (353 patients), and “Healthy" 178 0.00 0.05 0.10 0.15 0.20 error1 error23 error32 errors Logistic Regression 0.00 0.05 0.10 0.15 0.20 error1 error23 error32 errors Random Forest 0.00 0.05 0.10 0.15 0.20 error1 error23 error32 errors method ROC NP−adjusted SVM 0.02 0.04 0.06 error1 error23 errors method ROC H−NP Logistic Regression Logistic Regression Error1 Error23 Error32 Method (95% quantile) (mean) ROC 0.074 0.023 0.096 H-NP 0.045 0.036 0.047 Random Forest Error1 Error23 Error32 Method (95% quantile) (mean) ROC 0.077 0.020 0.093 H-NP 0.047 0.034 0.032 SVM Error1 Error23 Error32 Method (95% quantile) (mean) ROC 0.078 0.023 0.098 H-NP 0.048 0.037 0.047 Figure 5.8: The distributions of approximate errors on the test set under setting T1. “error1", “error23" and “error32" correspond toR 1⋆ ( b ϕ ),R 2⋆ ( b ϕ ) andP 3 ( ˆ Y =2), respectively. 0.000 0.025 0.050 0.075 0.100 error1 error23 errors Logistic Regression 0.000 0.025 0.050 0.075 0.100 error1 error23 errors Random Forest 0.00 0.03 0.06 0.09 0.12 error1 error23 errors SVM 0.02 0.04 0.06 error1 error23 errors method ROC H−NP Logistic Regression Figure 5.9: The distributions of approximate errors on the test set undersettingT2. “error1" and “error23" correspond to the errorsR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ), respectively. (193 patients). The detail of each dataset and patient composition can be found in Table 5.1. The severe, moderate and healthy patients are labeled as class 1, 2 and 3, respectively. Before integration, we performed size factor standardization and log transformation on the raw count expression matrices using the logNormCount function in the R package scater (version 1.16.2) [87] and generated log transformed gene expression matrices. All the PBMC datasets are integrated by scMerge2 [77], which is specifically designed for merging multi-sample and multi-condition studies. Following the standard pipeline for assessing the quality of integration, in Figure 5.10, we show the UMAP projections of all cells from all the studies, obtained from the top 20 principle components of the merged gene-by-cell 179 expression matrix, for (a) before integration and (b) after integration. The cells are colored by their cell types (left column) or which study (or batch) they come from (right column). Before integration, cells from the same cell type are split into separate clusters based on batch labels, indicating the presence of batch effects. After integration, cells from the same cell type are significantly better mixed while the distinctions among cell types are preserved. To construct pseudo-bulk expression profiles, we input the cell types annotated by scClassify [78] (using cell types in [114] as reference) into scMerge2. The resulting profiles are used to identify mutual nearest subgroups as pseudo-replicates and to estimate the parameters of the scMerge2 model. We select the top 3,000 highly variable genes through the function modelGeneVar in R package scran [83], and for each patient calculate the average expression of each cell type for selected genes, i.e., for each patient, the integrated dataset provides an g × n c matrix recording the average gene expressions, wheren g is the number of genes (n g =3,000) andn c is the number of cell types (n c =18). Other than scRNA-seq data, we also include age as a predictor in the integrated dataset. Most of the datasets used in our study recorded age information either as an exact number or an age group, while the rest did not provide this information (see Table 5.2). In the integrated dataset, we use the lower end of the age group recorded for patients with no exact age, and replace the missing values with the average age (52.23). 5.3.2 ScRNA-seqdataandfeaturization For each patient, PBMC scRNA-seq data is available in the form of a matrixA (j) ∈IR ng× nc recording the expression levels of genes in hundreds to thousands of cells, wheren c = 18 is the number of cell types, and n g = 3,000 is the number of genes for analysis. Figure 5.12 shows the distribution of the sparsity levels, i.e., the proportion of genes with zero values, under each cell type across all the patients. Despite 180 Publication Severe/Critical Mild/Moderate Healthy Total [7] 4 3 5 12 [19] 21 6 5 32 [30] 62 31 10 103 [31] 9 11 14 34 [71] 3 4 5 12 [80] 30 3 14 47 [96]* - - 19 19 [99] 70 61 20 151 [105] 17 19 38 74 [106] 2 6 4 12 [109] 5 2 3 10 [110] 21 - - 21 [114] 28 53 32 113 [118] 12 117 - 129 [123] 5 - 3 8 [128]* 10 - - 10 [136] 11 20 8 39 [145] 6 5 - 11 [149] 1 8 10 19 [152] 1 4 3 8 Total 318 353 193 864 Table 5.1: Number of patients under each severity level in each dataset. The datasets marked with * were utilized by both studies [96, 128] in their respective analyses. having a significant proportion of zeros, several cell types have varying sparsity across the three severity classes (Figure 5.11), suggesting their activity level might be informative for classification. Since classical classification methods typically use feature vectors as input, appropriate featurization that transforms the expression matrices into vectors is needed. We propose four ways of featurization that differ in their considerations of the following aspects. • As we observe the sparsity level in some cell types changes across the severity classes, we expect different treatments of zeros will influence the classification performance. Three approaches are proposed: 1) no special treatment (M.1); 2) remove individual zeros but keeping all cell types (M.4); 3) remove cell types with significant amount of zeros across all three classes (M.2 and M.3). 181 −10 0 10 −10 0 10 UMAP1 UMAP2 B CD14 Mono CD16 Mono CD4 T CD8 T DC DN gdT HSPC ILC intermediate MAIT MAST Neutrophil NK NKT Plasma Platelet RBC unassigned −10 0 10 −10 0 10 UMAP1 UMAP2 Arunachalam_2020 Bost_PBMC_2021 COMBAT_2022 Combes_2021 Lee_2020 Liu_2021 Ramaswamy_2021 Ren_2021 Schulte−Schrepping_2020 Schuurman_2021 Silvin_2021 Sinha_2021 Stephenson_2021 Su_2020 Thompson_2021 Unterman_2022 Wilk_2021 Y ao_2021 Zhao_2021 Zhu_2020 −15 −10 −5 0 5 10 −5 0 5 10 UMAP1 UMAP2 B CD14 Mono CD16 Mono CD4 T CD8 T DC DN gdT HSPC ILC intermediate MAIT MAST Neutrophil NK NKT Plasma Platelet RBC unassigned −15 −10 −5 0 5 10 −5 0 5 10 UMAP1 UMAP2 Arunachalam_2020 Bost_PBMC_2021 COMBAT_2022 Combes_2021 Lee_2020 Liu_2021 Ramaswamy_2021 Ren_2021 Schulte−Schrepping_2020 Schuurman_2021 Silvin_2021 Sinha_2021 Stephenson_2021 Su_2020 Thompson_2021 Unterman_2022 Wilk_2021 Y ao_2021 Zhao_2021 Zhu_2020 (a) The raw dataset before processing. −10 0 10 −10 0 10 UMAP1 UMAP2 B CD14 Mono CD16 Mono CD4 T CD8 T DC DN gdT HSPC ILC intermediate MAIT MAST Neutrophil NK NKT Plasma Platelet RBC unassigned −10 0 10 −10 0 10 UMAP1 UMAP2 Arunachalam_2020 Bost_PBMC_2021 COMBAT_2022 Combes_2021 Lee_2020 Liu_2021 Ramaswamy_2021 Ren_2021 Schulte−Schrepping_2020 Schuurman_2021 Silvin_2021 Sinha_2021 Stephenson_2021 Su_2020 Thompson_2021 Unterman_2022 Wilk_2021 Y ao_2021 Zhao_2021 Zhu_2020 −15 −10 −5 0 5 10 −5 0 5 10 UMAP1 UMAP2 B CD14 Mono CD16 Mono CD4 T CD8 T DC DN gdT HSPC ILC intermediate MAIT MAST Neutrophil NK NKT Plasma Platelet RBC unassigned −15 −10 −5 0 5 10 −5 0 5 10 UMAP1 UMAP2 Arunachalam_2020 Bost_PBMC_2021 COMBAT_2022 Combes_2021 Lee_2020 Liu_2021 Ramaswamy_2021 Ren_2021 Schulte−Schrepping_2020 Schuurman_2021 Silvin_2021 Sinha_2021 Stephenson_2021 Su_2020 Thompson_2021 Unterman_2022 Wilk_2021 Y ao_2021 Zhao_2021 Zhu_2020 (b) The dataset integrated through scMerge. Figure 5.10: Two left UMAPs plots are colored based on cell types predicted by scClassify (using [114] as reference). Two right UMAPs plots are colored by the batches. • Dimension reduction is commonly used to project the information in a matrix onto a vector. We consider performing dimension reduction along different directions, namely row projections, which take combinations of genes (M.2), and column projections, which combine cell types with appro- priate weights (M.3 and M.4). We aim to compare choices of projection direction, so we focus on principal component analysis (PCA) as our dimension reduction method. • We consider two approaches to generate the PCA loadings: 1) overall PCA loadings (M.2 and M.4), where we perform PCA on the whole data to output a loading vector for all patients; 2) patient- specific PCA loadings (M.3), where PCA is performed for each matrix A (j) to get an individual- specific loading vector. 182 Publication Age recording format Example [7] exact age 64 [19] not available NA [30] age group 61-70 [31] exact age 64 [71] exact age 64 [80] exact age 64 [96, 128] exact age 64 [99] exact age 64 [105] age group 61-65 [106] exact age 64 [109] exact age 64 [110] exact age 64 [114] age group 60-69 [118] exact age 64 [123] not available NA [136] age group 60-69 [145] not available NA [149] exact age 64 [152] exact age 64 Table 5.2: Format of age information in each dataset. An example record for a 64-year-old patient is provided for each dataset. The details of each featurization method are as follows. M.1 Simplefeaturescreening: we consider each elementA (j) uv (geneu under cell typev) as a possible feature for patientj and use its standard deviation across all patients, denoted asSD uv , to screen the features. Elements that hardly vary across the patients are likely to have a low discriminative power for classification. Let SD (i) be thei-th largest element in{SD uv | u∈ [n g ],v ∈ [n c ]}. The feature vector for each patient consists of the entries in{A (j) uv | SD uv ≥ SD (n f ) }, wheren f is the number of features desired and set to3,000. M.2 Overall gene combination: removing cell types with mostly zero expression values across all patients, we select 17 cell types to construct ˜ A (j) ∈ IR ng× 17 that only preserves columns in A (j) corresponding to the selected cell types. Then, ˜ A (1) ,..., ˜ A (N) are concatenated column-wise to get 183 ˜ A all ∈IR ng× (N× 17) , whereN =864. Let ˜ w∈IR ng× 1 denote the first principle component loadings of( ˜ A all ) ⊤ , and the feature vector for patientj is given byX j = ˜ w ⊤ ˜ A (j) . M.3 Individual-specificcelltypecombination: for patientj, the loading vector ˜ w j ∈IR 1× 17 is taken as the absolute values of first principle component loadings for ˜ A (j) , the matrix with selected 17 cell types in M.2. The principle component loading vector ˜ w j that produces X j = ( ˜ A (j) ˜ w j ) ⊤ is patient-specific, intending to reflect different cell type compositions in different individuals. M.4 Commoncelltypecombination: we compute an expression matrixA averaged over all patients defined as A uv = P j∈[N] A (j) uv |{j∈[N]|A (j) uv ̸=0}| , where|·| is the cardinality function. Letw∈IR nc× 1 denote the first principle component loadings ofA, then the feature vector for thej-th patient isX j =(A (j) w) ⊤ . In the featurization methods M.2 and M.3, we remove the cell type ILC with its zero proportion hardly changing across all three classes (Figure 5.11) and an average zero proportion greater than95% (Table 5.3). 17 cell types are left: B, CD14 Mono, CD16 Mono, CD4 T, CD8 T, DC,gdT, HSPC, MAST, Neutrophil, NK, NKT, Plasma, Platelet, RBC, DN, MAIT. cell type zero proportion cell type zero proportion B 0.054 Neutrophil 0.514 CD14 Mono 0.028 NK 0.025 CD16 Mono 0.142 NKT 0.099 CD4 T 0.024 Plasma 0.243 CD8 T 0.042 Platelet 0.232 DC 0.186 RBC 0.730 gdT 0.209 DN 0.524 HSPC 0.548 MAIT 0.220 MAST 0.786 ILC 0.972 Table 5.3: The average proportion of zero values across patients for each cell type. We next evaluate the performance of these featurizations when applied as input to different base clas- sification methods for H-NP classification. 184 5.3.3 ResultsofH-NPclassification After obtaining the feature vectors and applying a suitable base classification method, we apply Algo- rithm 7 to control the under-diagnosis errors. Recall thatY = 1,2,3 represent the severe, moderate and healthy categories, respectively, and the goal is to controlR 1⋆ ( b ϕ ) andR 2⋆ ( b ϕ ). In this section, we evaluate the performance of the H-NP classifier applied to each combination of featurization method in Section 5.3.2 and base classification method (logistic regression, random forest, SVM (linear model)), which is used to train the scores (T 1 andT 2 ). In each class, we leave out 30% of the data as the test set and split the rest 70% as follows for training the H-NP classifier: 35% and35% ofS 1 formS 1s andS 1t ;35%,25% and10% ofS 2 formS 2s ,S 2t andS 1e ;35% and35% ofS 3 formS 3s andS 3e . For each combination of featurization and base classification method, we perform random splitting of the observations for 50 times to produce the results in this section. In Figure 5.13, the yellow halves of the violin plots show the distributions of different approximate errors from the classical classification methods; Table 5.4 records the averages of these errors. In all the cases, the average of the approximateR 1⋆ error is greater than20%, in many cases greater than40%. On the other hand, the approximateR 2⋆ error under the classical paradigm is already relatively low, with the averages around10%. Under the H-NP paradigm, we setα 1 ,α 2 = 0.2 andδ 1 ,δ 2 = 0.2, i.e., we want to control each under-diagnosis error under20% at a20% tolerance level. With the prespecified α 1 ,α 2 ,δ 1 ,δ 2 , for a given base classification method Algorithm 7 outputs an H- NP classifier that controls the under-diagnosis errors while minimizing the weighted sum of the other empirical errors. The blue half violin plots in Figure 5.13 and Table 5.4 show the resulting approximate errors after H-NP adjustment. We observe that the common cell type combination feature M.4 consistently leads to smaller errors under both the classical and H-NP classifiers, especially for linear classification models (logistic regression and SVM). 185 Logistic Regression Featurization Paradigm Error1 Error23 Error21 Error31 Error32 Overall M.1 classical 0.313 0.110 0.241 0.078 0.241 0.330 H-NP 0.160 0.119 0.416 0.177 0.122 0.344 M.2 classical 0.466 0.153 0.280 0.267 0.370 0.491 H-NP 0.172 0.091 0.640 0.587 0.215 0.542 M.3 classical 0.248 0.115 0.226 0.060 0.178 0.284 H-NP 0.159 0.129 0.336 0.108 0.134 0.303 M.4 classical 0.241 0.108 0.216 0.050 0.157 0.267 H-NP 0.169 0.131 0.305 0.093 0.109 0.285 Random Forest Featurization Paradigm Error1 Error23 Error21 Error31 Error32 Overall M.1 classical 0.262 0.049 0.257 0.091 0.228 0.293 H-NP 0.177 0.121 0.356 0.096 0.072 0.297 M.2 classical 0.361 0.095 0.256 0.216 0.452 0.426 H-NP 0.158 0.122 0.491 0.402 0.247 0.455 M.3 classical 0.314 0.039 0.200 0.113 0.369 0.321 H-NP 0.178 0.116 0.386 0.148 0.126 0.332 M.4 classical 0.300 0.036 0.219 0.130 0.353 0.323 H-NP 0.162 0.120 0.407 0.175 0.115 0.340 SVM Featurization Paradigm Error1 Error23 Error21 Error31 Error32 Overall M.1 classical 0.275 0.091 0.253 0.081 0.219 0.309 H-NP 0.159 0.118 0.394 0.104 0.158 0.326 M.2 classical 0.437 0.164 0.280 0.281 0.365 0.487 H-NP 0.175 0.110 0.613 0.542 0.258 0.539 M.3 classical 0.227 0.082 0.229 0.041 0.157 0.255 H-NP 0.175 0.123 0.295 0.045 0.106 0.269 M.4 classical 0.229 0.077 0.222 0.037 0.160 0.251 H-NP 0.172 0.119 0.288 0.040 0.104 0.261 Table 5.4: The averages of approximate errors. “error1", “error23", “error21", “error31", “error32", “overall" correspond toR 1⋆ ( b ϕ ),R 2⋆ ( b ϕ ),P 2 ( b Y =1),P 3 ( b Y =1),P 3 ( b Y =2) andP( b Y ̸=Y), respectively. In each plot of Figure 5.13, the two leftmost plots are the distributions of the two approximate under- diagnosis errors R 1⋆ and R 2⋆ . We mark the 80% quantiles of R 1⋆ and R 2⋆ by short black lines (since δ 1 ,δ 2 = 0.2), and the desired control levels (α 1 ,α 2 = 0.2) by red dashed lines. The four rightmost plots show the approximate errors for the overall risk and the three components inR c ( b ϕ ) as discussed in Eq (5.21). For all the featurization and base classification methods, the under-diagnosis errors are controlled at the desired levels with a slight increase in the overall error, which is much smaller than the reduction in 186 under-diagnosis errors. This demonstrates consistency of our method and indicates its general applicability to various base classification algorithms chosen by users. Another interesting phenomenon is that when a classical classification method is conservative for specified α i andδ i , our algorithm will increase the corresponding thresholdt i , which relaxes the decision boundary for classes less prioritized than i. As a result, the relaxation will benefit some components in R c ( b ϕ ). In Figure 5.13d, in many cases the classifier produces an approximate error R 2⋆ less than0.2 under the classical paradigm, which means it is conservative for the control levelα 2 =0.2 at the tolerance level δ 2 = 0.2. In this case, the NP classifier adjusts the threshold t 2 to lower the requirement for class 3, thus notably decreasing the approximate error ofP 3 ( ˆ Y =2). Finally, we show that using this integrated scRNA-seq data in a classification setting enables us to identify genomic features associated with disease severity in patients at the cell type. First, by combining logistic regression with an appropriate featurization, we generate a ranked list of features (i.e., cell types or genes) that are important in predicting severity. At the cell type level, we utilize logistic regression with the featurization M.2, which compresses the expression matrix for each patient into a cell-type-length vector and rank the cell types based on their coefficients from the log odds ratios of the severe category relative to the healthy category. Table 5.5 shows the top-ranked cell types are CD14 + monocytes, NK cells, CD8 + effector T cells, and neutrophils, all with significant p-values. This is consistent with known involvement of these cell types in the immune response of severe patients [82, 81, 95]. 5.4 Discussion In general disease severity classification, under-diagnosis errors are more consequential in the sense that they increase the risk of patients receiving insufficient medical care. By assuming the classes have a prior- itized ordering, we propose an H-NP classification framework and its associated algorithm (Algorithm 7) 187 cell type p-value cell type p-value CD14 Mono 1.38e-05 RBC 3.91e-01 NK 9.95e-04 CD4 T 4.12e-01 CD8 T 9.10e-03 MAIT 4.16e-01 Neutrophil 9.54e-03 DN 4.26e-01 B 9.49e-02 NKT 6.13e-01 gdT 1.16e-01 MAST 7.81e-01 HSPC 2.47e-01 Plasma 8.61e-01 CD16 Mono 3.52e-01 DC 9.34e-01 Platelet 3.73e-01 Table 5.5: Ranking cell types by their coefficients that quantify the effect of the predictors (cell type expres- sion) on the log odds ratios of the severe category relative to the healthy category in logistic regression with the featurization M.2. capable of controlling under-diagnosis errors at desired levels with high probability. The algorithm per- forms post hoc adjustment on scoring-type classification methods and thus can be applied in conjunction with most methods preferred by users. The idea of choosing thresholds on the scoring functions based on a held-out set bears resemblance to conformal splitting methods [72, 133]. However, our approach differs in that we assign only one label to each observation while maintaining high-probability error controls. Additionally, our approach prioritizes certain misclassification errors, unlike conformal prediction which treats all classes equally. Through simulations and the case study of COVID-19 severity classification, we demonstrate the ef- ficacy of our algorithm in achieving the desired error controls. We have also compared different ways of constructing interpretable feature vectors from the multi-patient scRNA-seq data and shown that the common cell type PCA featurization overall achieves better performance under various classification set- tings. The use of scRNA-seq data has allowed us to gain biological insights into the disease mechanism and immune response of severe patients. Nevertheless, if the main objective is to build a classifier for triage diagnostics using other clinical variables, one can easily apply our method to other forms of patient-level COVID-19 data with other base classification methods. 188 Even though our case study has three classes, the framework and algorithm developed are general. Increasing the number of classes has no effect on the minimum size requirement of the left-out part of each class for threshold selection since it suffices for each class i to satisfyn i ≥ logδ i /(1− α i ). We also note that the notion of prioritized classes can be defined in a context-specific way. For example, in some diseases like Alzheimer’s disease, the transitional stage is considered to be the most important [142]. There are several interesting directions for future work. For small data problems where the minimum sample size requirement is not full-filled, we might consider adopting a parametric model, under which we can not only develop a new algorithm without a minimum sample size requirement but also study the oracle type properties of the classifiers. In terms of featurizing multi-patient scRNA-seq data, we have chosen PCA as the dimension reduction method to focus on other aspects of comparison; more dimension reduction methods can be explored in future work. It is also conceivable that the class labels in the case study are noisy with possibly biased diagnosis. Accounting for label noise with a realistic noise model and extending the work of [146] to a multi-class NP classification setting will be another interesting direction to pursue. 5.5 Acknowledgementsandauthorcontributions This study is a collaborative effort with Y.X. Rachel Wang, Jingyi Jessica Li, and Xin Tong. All authors contributed to conceiving the study, designing the methods, and writing the manuscript. L.W. executed the methods, wrote the code, and performed empirical, theoretical, and real-data analyses, while other authors supervised the execution. . 189 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition B 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition CD14 Mono 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition CD16 Mono 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition CD4 T 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition CD8 T 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition DC 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition gdT 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition HSPC 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition MAST 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition Neutrophil 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition NK 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition NKT 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition Plasma 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition Platelet 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition RBC 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition DN 0.00 0.25 0.50 0.75 1.00 Severe/Critical Mild/Moderate Healthy condition MAIT 0.4 0.6 0.8 1.0 Severe/Critical Mild/Moderate Healthy condition ILC Figure 5.11: The proportions of zeros for different severity classes. 190 B CD14 Mono CD16 Mono CD4 T CD8 T DC gdT HSPC MAST 0.0 0.2 0.4 0.6 0.8 1.0 NKT Plasma Platelet RBC DN MAIT ILC 0.0 0.2 0.4 0.6 0.8 1.0 Figure 5.12: The distribution of the proportion of zero values across patients for each cell type. 191 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Logistic Regression 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Random Forest 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors SVM (a) M.1 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Logistic Regression 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Random Forest 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors SVM (b) M.2 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Logistic Regression 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Random Forest 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors SVM (c) M.3 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Logistic Regression 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors Random Forest 0.0 0.2 0.4 0.6 0.8 error1 error23 error21 error31 error32 overall errors SVM (d) M.4 0.0 0.2 0.4 0.6 error1 error23 error21 error31 error32 overall type errors method classical H−NP Figure 5.13: The distribution of approximate errors for each combination of featurization method and base classification method. “error1", “error23", “error21", “error31", “error32", “overall" correspond to R 1⋆ ( b ϕ ), R 2⋆ ( b ϕ ),P 2 ( ˆ Y =1),P 3 ( ˆ Y =1),P 3 ( ˆ Y =2) andP( ˆ Y ̸=Y), respectively. 192 Bibliography [1] US Environmental Protection Agency. “The National Ambient Air Quality Standards for Particle Pollution Revised Air Quality Standards For Particle Pollution And Updates To The Air Quality Index (AQI)”. In: (2012). [2] Sara Aibar, Celia Fontanillo, et al. “Analyse multiple disease subtypes and build associated gene networks using genome-wide expression profiles”. In: BMC genomics 16 (2015), pp. 1–10. [3] Norah Alballa and Isra Al-Turaiki. “Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: A review”. In: IMU 24 (2021), p. 100564. [4] Reid Andersen, Fan Chung, and Kevin Lang. “Local graph partitioning using pagerank vectors”. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE. 2006, pp. 475–486. [5] Reid Andersen and Kevin J Lang. “Communities from seed sets”. In: Proceedings of the 15th international conference on World Wide Web. 2006, pp. 223–232. [6] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. “Identification of causal effects using instrumental variables”. In: Journal of the American statistical Association 91.434 (1996), pp. 444–455. [7] Prabhu S Arunachalam, Florian Wimmers, et al. “Systems biological assessment of immunity to mild versus severe COVID-19 infection in humans”. In: Science 369.6508 (2020), pp. 1210–1220. [8] Eirini Arvaniti and Manfred Claassen. “Sensitive detection of rare disease-associated cell subsets via representation learning”. In: Nat. Commun. 8.1 (2017), pp. 1–10. [9] Adelchi Azzalini and A Dalla Valle. “The multivariate skew-normal distribution”. In: Biometrika 83.4 (1996), pp. 715–726. [10] Laurent Barras, Olivier Scaillet, and Russ Wermers. “False discoveries in mutual fund performance: Measuring luck in estimated alphas”. In: The journal of finance 65.1 (2010), pp. 179–216. 193 [11] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B (Methodological) 57.1 (1995), pp. 289–300. [12] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B (Methodological) 57.1 (1995), pp. 289–300. [13] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency”. In: Annals of statistics (2001), pp. 1165–1188. [14] Yoav Benjamini and Daniel Yekutieli. “The control of the false discovery rate in multiple testing under dependency”. In: Annals of statistics (2001), pp. 1165–1188. [15] Rebecca A Betensky and Yang Feng. “Accounting for incomplete testing in the estimation of epidemic parameters”. In: Int J Epidemiol 49.5 (2020), pp. 1419–1426. [16] Peter J Bickel and Aiyou Chen. “A nonparametric view of network models and Newman–Girvan and other modularities”. In: Proceedings of the National Academy of Sciences 106.50 (2009), pp. 21068–21073. [17] Peter J Bickel, Aiyou Chen, Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. “Correction to the proof of consistency of community detection”. In: The Annals of Statistics (2015), pp. 462–466. [18] Peter J Bickel and Elizaveta Levina. “Covariance regularization by thresholding”. In: (2008). [19] Pierre Bost, Francesco De Sanctis, et al. “Deciphering the state of immune silence in fatal COVID-19 patients”. In: Nat. Commun. 12.1 (2021), p. 1428. [20] Tom Britton. “Estimation in multitype epidemics”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60.4 (1998), pp. 663–679. [21] Logan C Brooks, Evan L Ray, et al. “Comparing ensemble approaches for short-term probabilistic COVID-19 forecasts in the US”. In: International Institute of Forecasters (2020). [22] Stephen J Brown and William N Goetzmann. “Performance persistence”. In:TheJournaloffinance 50.2 (1995), pp. 679–698. [23] John P Buonaccorsi. “A note on confidence intervals for proportions in finite populations”. In: The American Statistician 41.3 (1987), pp. 215–218. [24] Adam Cannon, James Howse, et al. “Learning with the Neyman-Pearson and min-max criteria”. In: Los Alamos National Laboratory, Tech. Rep. LA-UR (2002), pp. 02–2951. [25] Mark M Carhart. “On persistence in mutual fund performance”. In: The Journal of finance 52.1 (1997), pp. 57–82. [26] George Casella and Roger L Berger. Statistical inference. Cengage Learning, 2021. 194 [27] Fan Chen, Yini Zhang, and Karl Rohe. “Targeted sampling from massive block model graphs with personalized PageRank”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82.1 (2020), pp. 99–126. [28] Fan Chung. “A local graph partitioning algorithm using heat kernel pagerank”. In: Internet Mathematics 6.3 (2009), pp. 315–330. [29] Vasek Chvátal. “The tail of the hypergeometric distribution”. In: Discrete Mathematics 25.3 (1979), pp. 285–287. [30] COMBAT, David J Ahern, et al. “A blood atlas of COVID-19 defines hallmarks of disease severity and specificity”. In: MedRxiv (2021), pp. 2021–05. [31] Alexis J Combes, Tristan Courau, et al. “Global absence and targeting of protective immune states in severe COVID-19”. In: Nature 591.7848 (2021), pp. 124–130. [32] Mark M Davis, Cristina M Tato, and David Furman. “Systems immunology: just getting started”. In: Nat. Immunol. 18.7 (2017), pp. 725–732. [33] Marcel Dettling and Peter Bühlmann. “Boosting for tumor classification with gene expression data”. In: Bioinformatics 19.9 (2003), pp. 1061–1069. [34] Vanja Dukic, Hedibert F Lopes, and Nicholas G Polson. “Tracking epidemics with Google flu trends data and a state-space SEIR model”. In: Journal of the American Statistical Association 107.500 (2012), pp. 1410–1426. [35] Sue Duval and Richard Tweedie. “A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis”. In: Journal of the american statistical association 95.449 (2000), pp. 89–98. [36] Bradley Efron. “Correlation and large-scale simultaneous significance testing”. In: Journal of the American Statistical Association 102.477 (2007), pp. 93–103. [37] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. “Least angle regression”. In: The Annals of statistics 32.2 (2004), pp. 407–499. [38] Charles Elkan. “The foundations of cost-sensitive learning”. In: International joint conference on artificial intelligence . Vol. 17. 1. Lawrence Erlbaum Associates Ltd. 2001, pp. 973–978. [39] Eugene F Fama and Kenneth R French. “Common risk factors in the returns on stocks and bonds”. In: Journal of financial economics 33.1 (1993), pp. 3–56. [40] Jianqing Fan, Xu Han, and Weijie Gu. “Estimating false discovery proportion under arbitrary covariance dependence”. In: Journal of the American Statistical Association 107.499 (2012), pp. 1019–1035. [41] Yang Feng, Xin Tong, and Weining Xin. “Targeted Crisis Risk Control: A Neyman-Pearson Approach”. In: Available at SSRN 3945980 (2021). 195 [42] Wayne Ferson and Yong Chen. “How many good and bad funds are there, really?” In: Handbook of Financial Econometrics, Mathematics, Statistics, and Machine Learning. World Scientific, 2021, pp. 3753–3827. [43] Edward A Ganio, Natalie Stanley, et al. “Preferential inhibition of adaptive immune system dynamics by glucocorticoids in patients after acute surgical trauma”. In: Nat. Commun. 11.1 (2020), pp. 1–12. [44] Peter Ganong and Simon Jäger. “A permutation test for the regression kink design”. In: Journal of the American Statistical Association 113.522 (2018), pp. 494–504. [45] Christopher Genovese and Larry Wasserman. “Operating characteristics and extensions of the false discovery rate procedure”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64.3 (2002), pp. 499–517. [46] Stefano Giglio, Yuan Liao, and Dacheng Xiu. “Thousands of alpha tests”. In: The Review of Financial Studies 34.7 (2021), pp. 3456–3496. [47] Peter J Green. “Reversible jump Markov chain Monte Carlo computation and Bayesian model determination”. In: Biometrika 82.4 (1995), pp. 711–732. [48] Anders Hald. A history of probability and statistics and their applications before 1750. John Wiley & Sons, 2005. [49] Xiaoyuan Han, Mohammad S Ghaemi, et al. “Differential dynamics of the maternal immune system in healthy pregnancy and preeclampsia”. In: Front Immunol. (2019), p. 1305. [50] Campbell R Harvey and Yan Liu. “Detecting repeatable performance”. In: The Review of Financial Studies 31.7 (2018), pp. 2499–2552. [51] Campbell R Harvey and Yan Liu. “False (and missed) discoveries in financial economics”. In: The Journal of Finance 75.5 (2020), pp. 2503–2553. [52] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. “Stochastic blockmodels: First steps”. In: Social networks 5.2 (1983), pp. 109–137. [53] Zicheng Hu, Benjamin S Glicksberg, and Atul J Butte. “Robust prediction of clinical outcomes using cytometry data”. In: Bioinformatics 35.7 (2019), pp. 1197–1203. [54] Nick James, Max Menzies, and Peter Radchenko. “COVID-19 second wave mortality in Europe and the United States”. In: Chaos 31.3 (2021), p. 031105. [55] Michael C Jensen. “The performance of mutual funds in the period 1945-1964”. In: The Journal of finance 23.2 (1968), pp. 389–416. [56] Pengsheng Ji and Jiashun Jin. “Coauthorship and citation networks for statisticians”. In: The Annals of Applied Statistics 10.4 (2016), pp. 1779–1812. 196 [57] Jiashun Jin. “Fast community detection by SCORE”. In: The Annals of Statistics 43.1 (2015), pp. 57–89. [58] Jiashun Jin, Zheng Tracy Ke, and Shengming Luo. “SCORE+ for Network Community Detection”. In: CoRR abs/1811.05927 (2018). arXiv: 1811.05927.url: http://arxiv.org/abs/1811.05927. [59] Norman L Johnson, Adrienne W Kemp, and Samuel Kotz. Univariate discrete distributions. Vol. 444. John Wiley & Sons, 2005. [60] Brian Karrer and Mark EJ Newman. “Stochastic blockmodels and community structure in networks”. In: Physical review E 83.1 (2011), p. 016107. [61] Julian Keilson and Hans Gerber. “Some results for discrete unimodality”. In: Journal of the American Statistical Association 66.334 (1971), pp. 386–389. [62] Peter V Kharchenko. “The triumphs and limitations of computational methods for scRNA-seq”. In: Nature Methods (2021), pp. 1–10. [63] Kyle Kloster and David F Gleich. “Heat kernel based community detection”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 1386–1395. [64] Isabel M Kloumann and Jon M Kleinberg. “Community membership identification from small seed sets”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 1366–1375. [65] Isabel M Kloumann, Johan Ugander, and Jon Kleinberg. “Block models and personalized PageRank”. In: Proceedings of the National Academy of Sciences 114.1 (2017), pp. 33–38. [66] Hendrik S Konijn. “Statistical theory of sample survey design and analysis.” In: (1973). [67] Robert Kosowski, Allan Timmermann, Russ Wermers, and Hal White. “Can mutual fund “stars” really pick stocks? New evidence from a bootstrap analysis”. In: The Journal of finance 61.6 (2006), pp. 2551–2595. [68] Peter Kramlinger, Tatyana Krivobokova, and Stefan Sperlich. “Marginal and Conditional Multiple Inference for Linear Mixed Model Predictors”. In: JASA 0.ja (2022), pp. 1–31.doi: 10.1080/01621459.2022.2044826. eprint: https://doi.org/10.1080/01621459.2022.2044826. [69] Wei Lan and Lilun Du. “A factor-adjusted multiple testing procedure with application to mutual fund selection”. In: Journal of Business & Economic Statistics 37.1 (2019), pp. 147–157. [70] Thomas Landgrebe and R Duin. “On Neyman-Pearson optimisation for multiclass classifiers”. In: Proceedings 16th Annual Symposium of the Pattern Recognition Association of South Africa. PRASA. 2005, pp. 165–170. [71] Jeong Seok Lee, Seongwan Park, et al. “Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19”. In:SciImmunol 5.49 (2020), eabd1554. 197 [72] Jing Lei. “Classification with confidence”. In: Biometrika 101.4 (2014), pp. 755–769. [73] Fan Li, Kari Lock Morgan, and Alan M Zaslavsky. “Balancing covariates via propensity score weighting”. In: Journal of the American Statistical Association 113.521 (2018), pp. 390–400. [74] Tengfei Li, Kani Chen, Yang Feng, and Zhiliang Ying. “Binary switch portfolio”. In: Quantitative Finance 17.5 (2017), pp. 763–780. [75] Wei Tse Li, Jiayan Ma, et al. “Using machine learning of clinical data to diagnose COVID-19: a systematic review and meta-analysis”. In: BMC Med Inform Decis Mak 20.1 (2020), pp. 1–13. [76] Xuan Liang, Shuo Li, Shuyi Zhang, Hui Huang, and Song Xi Chen. “PM2. 5 data reliability, consistency, and air quality assessment in five Chinese cities”. In: Journal of Geophysical Research: Atmospheres 121.17 (2016), pp. 10–220. [77] Yingxin Lin, Yue Cao, et al. “Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2”. In: bioRxiv (2022), pp. 2022–12. [78] Yingxin Lin, Yue Cao, et al. “scClassify: sample size estimation and multiscale classification of cells using single and multiple reference”. In: Mol Syst Biol 16.6 (2020), e9389. [79] Yingxin Lin, Lipin Loo, et al. “Scalable workflow for characterization of cell-cell communication in COVID-19 patients”. In: PLoS Comp Biol 18.10 (2022), e1010495. [80] Can Liu, Andrew J Martins, et al. “Time-resolved systems immunology reveals a late juncture linked to fatal COVID-19”. In: Cell 184.7 (2021), pp. 1836–1857. [81] Jing Liu, Sumeng Li, et al. “Longitudinal characteristics of lymphocyte responses and cytokine profiles in the peripheral blood of SARS-CoV-2 infected patients”. In: EBioMedicine 55 (2020), p. 102763. [82] Carolina Lucas, Patrick Wong, et al. “Longitudinal analyses reveal immunological misfiring in severe COVID-19”. In: Nature 584.7821 (2020), pp. 463–469. [83] Aaron TL Lun, Davis J McCarthy, and John C Marioni. “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor”. In: F1000Research 5 (2016). [84] Yiming Ma, Kairong Xiao, and Yao Zeng. “Bank debt versus mutual fund equity in liquidity provision”. In: Jacobs Levy Equity Management Center for Quantitative Financial Research Paper (2022). [85] Dragos D Margineantu. “Class probability estimation and cost-sensitive classification decisions”. In: Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13. Springer. 2002, pp. 270–281. [86] Travis Martin, Brian Ball, Brian Karrer, and MEJ Newman. “Coauthorship and citation patterns in the Physical Review”. In: Physical Review E 88.1 (2013), p. 012814. 198 [87] Davis J McCarthy, Kieran R Campbell, et al. “Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R”. In: Bioinformatics 33.8 (2017), pp. 1179–1186. [88] Daniel J McDonald, Jacob Bien, et al. “Can auxiliary indicators improve COVID-19 forecasting and hotspot prediction?” In: PNAS 118.51 (2021). [89] Yassine Meraihi, Asma Benmessaoud Gabis, et al. “Machine learning-based research for covid-19 detection, diagnosis, and prediction: A survey”. In: SN computer science 3.4 (2022), p. 286. [90] Klemens B Meyer and Stephen G Pauker. Screening for HIV: can we afford the false positive rate? 1987. [91] Mark EJ Newman. “The structure of scientific collaboration networks”. In: Proceedings of the national academy of sciences 98.2 (2001), pp. 404–409. [92] Anthony Ortiz, Anusua Trivedi, et al. “Effective deep learning approaches for predicting COVID-19 outcomes from chest computed tomography volumes”. In:SciRep 12.1 (2022), pp. 1–10. [93] Katherine A Overmyer, Evgenia Shishkova, et al. “Large-scale multi-omic analysis of COVID-19 severity”. In: Cell Syst. 12.1 (2021), pp. 23–40. [94] Corbin Quick, Rounak Dey, and Xihong Lin. “Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data”. In: JASA 116.536 (2021), pp. 1561–1577. [95] Anuradha Rajamanickam, Nathella Pavan Kumar, et al. “Dynamic alterations in monocyte numbers, subset frequencies and activation markers in acute and convalescent COVID-19 individuals”. In: Sci. Rep. 11.1 (2021), p. 20254. [96] Anjali Ramaswamy, Nina N Brodsky, et al. “Immune dysregulation and autoreactivity correlate with disease severity in SARS-CoV-2-associated multisystem inflammatory syndrome in children”. In: Immunity 54.5 (2021), pp. 1083–1095. [97] Vikas Raykar and Linda Zhao. “Nonparametric prior for adaptive sparsity”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics . JMLR Workshop and Conference Proceedings. 2010, pp. 629–636. [98] Vikas C Raykar and Linda H Zhao. “Empirical Bayesian thresholding for sparse signals using mixture loss functions”. In: Statistica Sinica (2011), pp. 449–474. [99] Xianwen Ren, Wen Wen, et al. “COVID-19 immune features revealed by a large-scale single-cell transcriptome atlas”. In: Cell 184.7 (2021), pp. 1895–1913. [100] John A Rice. Mathematical statistics and data analysis. Cengage Learning, 2006. [101] Philippe Rigollet and Xin Tong. “Neyman-pearson classification, convexity and stochastic constraints”. In: JMLR (2011). 199 [102] Sanat K Sarkar. “False discovery and false nondiscovery rates in single-step multiple testing procedures”. In: (2006). [103] Sanat K Sarkar. “Some results on false discovery rate in stepwise multiple testing procedures”. In: The Annals of Statistics 30.1 (2002), pp. 239–257. [104] Sanat K Sarkar, Tianhui Zhou, and Debashis Ghosh. “A general decision theoretic formulation of procedures controlling FDR and FNR from a Bayesian perspective”. In: Statistica Sinica (2008), pp. 925–945. [105] Jonas Schulte-Schrepping, Nico Reusch, et al. “Severe COVID-19 is marked by a dysregulated myeloid cell compartment”. In: Cell 182.6 (2020), pp. 1419–1440. [106] Alex R Schuurman, Tom DY Reijnders, et al. “Integrated single-cell analysis unveils diverging immune features of COVID-19, influenza, and other community-acquired pneumonia”. In: Elife 10 (2021), e69661. [107] Clayton Scott and Robert Nowak. “A Neyman-Pearson approach to statistical learning”. In: IEEE Trans. Inf. Theory 51.11 (2005), pp. 3806–3819. [108] Feng Shi, Jacob G Foster, and James A Evans. “Weaving the fabric of science: Dynamic network models of science’s unfolding structure”. In: Social Networks 43 (2015), pp. 73–85. [109] Aymeric Silvin, Nicolas Chapuis, et al. “Elevated calprotectin and abnormal myeloid cell subsets discriminate severe from mild COVID-19”. In: Cell 182.6 (2020), pp. 1401–1418. [110] Sarthak Sinha, Nicole L Rosin, et al. “Dexamethasone modulates immature neutrophils and interferon programming in severe COVID-19”. In: Nat. Med. 28.1 (2022), pp. 201–211. [111] Matthew Skala. “Hypergeometric tail inequalities: ending the insanity”. In: arXiv preprint arXiv:1311.5939 (2013). [112] David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van Der Linde. “Bayesian measures of model complexity and fit”. In: Journal of the royal statistical society: Series b (statistical methodology) 64.4 (2002), pp. 583–639. [113] Natalie Stanley, Ina A Stelzer, et al. “VoPo leverages cellular heterogeneity for predictive modeling of single-cell data”. In: Nat. Commun. 11.1 (2020), pp. 1–9. [114] Emily Stephenson, Gary Reynolds, et al. “Single-cell multi-omics analysis of the immune response in COVID-19”. In: Nat. Med. 27.5 (2021), pp. 904–916. [115] Stephen M Stigler. “Citation patterns in the journals of statistics and probability”. In: Statistical Science (1994), pp. 94–108. [116] John D Storey. “A direct approach to false discovery rates”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64.3 (2002), pp. 479–498. 200 [117] John D Storey. “A direct approach to false discovery rates”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64.3 (2002), pp. 479–498. [118] Yapeng Su, Daniel Chen, et al. “Multi-omics resolves a sharp disease-state shift between mild and moderate COVID-19”. In: Cell 183.6 (2020), pp. 1479–1495. [119] Catherine A Sugar and Gareth M James. “Finding the number of clusters in a dataset: An information-theoretic approach”. In: Journal of the American Statistical Association 98.463 (2003), pp. 750–763. [120] Liping Sun, Fengxiang Song, et al. “Combination of four clinical indicators predicts the severe/critical symptom of patients infected COVID-19”. In: J. Clin. Virol 128 (2020), p. 104431. [121] Francesca Tang, Yang Feng, et al. “The interplay of demographic variables and social distancing scores in deep prediction of US COVID-19 cases”. In: JASA 116.534 (2021), pp. 492–506. [122] Weihua Tang and Cun-Hui Zhang. “Empirical Bayes methods for controlling the false discovery rate with dependent data”. In: Lecture Notes-Monograph Series (2007), pp. 151–160. [123] Elizabeth A Thompson, Katherine Cascino, et al. “Metabolic programs define dysfunctional immune responses in severe COVID-19 patients”. In: Cell reports 34.11 (2021), p. 108863. [124] Ye Tian and Yang Feng. “Neyman-Pearson Multi-class Classification via Cost-sensitive Learning”. In: arXiv preprint arXiv:2111.04597 (2021). [125] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In: Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996), pp. 267–288. [126] Robert Tibshirani, Guenther Walther, and Trevor Hastie. “Estimating the number of clusters in a data set via the gap statistic”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63.2 (2001), pp. 411–423. [127] Xin Tong, Yang Feng, and Jingyi Jessica Li. “Neyman-Pearson classification algorithms and NP receiver operating characteristics”. In: Sci. Adv. 4.2 (2018), eaao1659. [128] Avraham Unterman, Tomokazu S Sumida, et al. “Single-cell multi-omics reveals dyssynchrony of the innate and adaptive immune system in progressive COVID-19”. In: Nat. Commun. 13.1 (2022), p. 440. [129] Twan Van Laarhoven and Elena Marchiori. “Local network community detection with continuous optimization of conductance and weighted kernel k-means”. In: The Journal of Machine Learning Research 17.1 (2016), pp. 5148–5175. [130] Attila Varga. “Shorter distances between papers over time are due to more cross-field references and increased citation rate to higher-impact papers”. In: Proceedings of the National Academy of Sciences 116.44 (2019), pp. 22094–22099. 201 [131] Cristiano Varin, Manuela Cattelan, and David Firth. “Statistical modelling of citation exchange between statistics journals”. In: Journal of the Royal Statistical Society: Series A (Statistics in Society) 179.1 (2016), pp. 1–63. [132] Weizhen Wang. “Exact optimal confidence intervals for hypergeometric parameters”. In: Journal of the American Statistical Association 110.512 (2015), pp. 1491–1499. [133] Wenbo Wang and Xingye Qiao. “Set-Valued Support Vector Machine with Bounded Error Rates”. In: JASA (2022), pp. 1–13. [134] Russ Wermers. “Mutual fund herding and the impact on stock prices”. In: the Journal of Finance 54.2 (1999), pp. 581–622. [135] Joyce Jiyoung Whang, David F Gleich, and Inderjit S Dhillon. “Overlapping community detection using seed set expansion”. In: Proceedings of the 22nd ACM international conference on information & knowledge management. 2013, pp. 2099–2108. [136] Aaron J Wilk, Arjun Rustagi, et al. “A single-cell atlas of the peripheral immune response in patients with severe COVID-19”. In: Nat. Med. 26.7 (2020), pp. 1070–1076. [137] World Health Organization. COVID-19 Dashboard. Accessed: April 23, 2023. 2023.url: https://covid19.who.int/. [138] World Health Organization. “WHO R&D blueprint novel coronavirus COVID-19 therapeutic trial synopsis”. In: World Health Organization (2020), pp. 1–9. [139] Jiangpeng Wu, Pengyi Zhang, et al. “Rapid and accurate identification of COVID-19 infection through machine learning based on clinical available blood test results”. In: MedRxiv (2020). [140] Xiao-Ming Wu, Zhenguo Li, Anthony Man-Cho So, John Wright, and Shih-Fu Chang. “Learning with Partially Absorbing Random Walks.” In: NIPS. Vol. 25. 2012, pp. 3077–3085. [141] Lucy Xia, Richard Zhao, et al. “Intentional control of type i error over unconscious data distortion: A neyman–pearson approach to text classification”. In: JASA 116.533 (2021), pp. 68–81. [142] Chengjie Xiong, Gerald van Belle, et al. “Measuring and estimating diagnostic accuracy when there are three ordinal diagnostic groups”. In: Statistics in Medicine 25.7 (2006), pp. 1251–1273. [143] Li Yan, Hai-Tao Zhang, et al. “Prediction of criticality in patients with severe Covid-19 infection using three clinical features: a machine learning-based prognostic model with clinical data in Wuhan”. In: MedRxiv 27 (2020), p. 2020. [144] Jaewon Yang and Jure Leskovec. “Defining and evaluating network communities based on ground-truth”. In: Knowledge and Information Systems 42.1 (2015), pp. 181–213. [145] Changfu Yao, Stephanie A Bora, et al. “Cell-type-specific immune dysregulation in severely ill COVID-19 patients”. In: Cell reports 34.1 (2021), p. 108590. 202 [146] Shunan Yao, Bradley Rava, et al. “Asymmetric Error Control Under Imperfect Supervision: A Label-Noise-Adjusted Neyman–Pearson Umbrella Algorithm”. In: JASA (2022), pp. 1–13. [147] Ming Yuan and Yi Lin. “Model selection and estimation in regression with grouped variables”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.1 (2006), pp. 49–67. [148] Jiawei Zhang, Jie Ding, and Yuhong Yang. “Is a Classification Procedure Good Enough?—A Goodness-of-Fit Assessment Tool for Classification Learning”. In: JASA (2021), pp. 1–11. [149] Xiang-Na Zhao, Yue You, et al. “Single-cell immune profiling reveals distinct immune response in asymptomatic COVID-19 patients”. In: Signal Transduct Target Ther 6.1 (2021), p. 342. [150] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. “Consistency of community detection in networks under degree-corrected stochastic block models”. In: The Annals of Statistics 40.4 (2012), pp. 2266–2292. [151] Zirun Zhao, Anne Chen, et al. “Prediction model and risk scores of ICU admission and mortality in COVID-19”. In: PloS one 15.7 (2020), e0236618. [152] Linnan Zhu, Penghui Yang, et al. “Single-cell sequencing of peripheral mononuclear cells reveals distinct immune response landscapes of COVID-19 and influenza patients”. In: Immunity 53.3 (2020), pp. 685–696. [153] Hui Zou and Trevor Hastie. “Regularization and variable selection via the elastic net”. In: Journal of the royal statistical society: series B (statistical methodology) 67.2 (2005), pp. 301–320. 203
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Statistical downscaling with artificial neural network
PDF
Robust estimation of high dimensional parameters
PDF
twas_sim, a Python-based tool for simulation and power analysis of transcriptome-wide association analysis
PDF
Essays on bioinformatics and social network analysis: statistical and computational methods for complex systems
PDF
Finite sample bounds in group sequential analysis via Stein's method
PDF
Validation of an alternative neural decision tree
PDF
Asymptotic properties of two network problems with large random graphs
PDF
Sequential testing of multiple hypotheses
PDF
Facial key points detection by convolutional neural network
PDF
Statistical modeling of sequence and gene expression data to infer gene regulatory networks
PDF
The spread of an epidemic on a dynamically evolving network
PDF
Application of statistical learning on breast cancer dataset
PDF
Topics in selective inference and replicability analysis
PDF
Statistical inference for second order ordinary differential equation driven by additive Gaussian white noise
PDF
Statistical methods for causal inference and densely dependent random sums
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Side-channel security enabled by program analysis and synthesis
PDF
Efficient statistical significance approximation for local association analysis of high-throughput time series data
Asset Metadata
Creator
Wang, Lijia
(author)
Core Title
Statistical citation network analysis and asymmetric error controls
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2023-05
Publication Date
05/08/2023
Defense Date
05/02/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asymmetric error control,network analysis,OAI-PMH Harvest,statistical testing
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Goldstein, Larry (
committee chair
), Tong, Xin (
committee member
), Wang, Chunming (
committee member
)
Creator Email
lijiawan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113102944
Unique identifier
UC113102944
Identifier
etd-WangLijia-11796.pdf (filename)
Legacy Identifier
etd-WangLijia-11796
Document Type
Dissertation
Format
theses (aat)
Rights
Wang, Lijia
Internet Media Type
application/pdf
Type
texts
Source
20230508-usctheses-batch-1039
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
asymmetric error control
network analysis
statistical testing