Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adapting statistical learning for high risk scenarios
(USC Thesis Other)
Adapting statistical learning for high risk scenarios
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Adapting Statistical Learning for High Risk Scenarios by Bradley Rava A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (BUSINESS ADMINISTRATION) August 2022 Copyright 2022 Bradley Rava Acknowledgements Education has truly changed the trajectory of my life. I am indebted to the many people that helped me along the way, to those who saw potential in my academic career and the many that took a risk by mentoring me. When I first entered Santa Monica College, I truly believed that I needed to work twice as hard as everyone else in order to have a chance of obtaining a college degree. To be able to obtain a Ph.D. reflects a drastic shift in mentality and a shift in the type of person that I have become over the 7 years I have spent at USC, both as an undergraduate and graduate student. I am very indebted to my advisor Gareth M. James, who was always available to discuss research and put an incredible amount of time into my career. He always made time for my projects, even as he became dean of the business school and had much important people to meet with! While he would tell you that he is just doing his job, his mentorship has changed my life for the better. IamalsoverythankfulformyadvisorXinTong, whofirstinspiredmetopursuefairness in machine learning as a research topic. Xin has always been extremely supportive of my careerandconstantlystressedthathowimportantitwasforourresearchtohavemeaningful real world impact. From Xin I learned that while publications are important, it is more important to do work that helps someone. I am also grateful to have been able to work with Wenguang (Wen) Sun. Even with his years of research experience, Wen still has the excitement of a first year graduate student and it is contagious. He was a joy to work with, even if we spent too much time talking about future projects instead of the one we were trying to submit. ii From the operations management side, Vishal Gupta always served as a great devil’s advocateandhelpedmethinkaboutmyresearchfrommultiplelenses, notjusttheonefrom the statistics community. Morerecently,IamalsothankfultohaveworkedwithMatteoSesiathispastyear. Ihave already learned a lot from him and his passion and dedication to his work is inspirational. From my undergraduate years, I would also like to thank my undergraduate mentor David Crombecque. He took a chance on me early on and spent the summer of 2014 doing research in topology with me. While I did not end up becoming a topologist, this was my first experience doing research and it served as a base for starting my current career. On a personal note, I would like to thank my family for their support throughout my degree. They have gone through a lot in the past five years and they always gave me the time I needed to conduct my research. My mother immigrated to this country to make a better life for her children, and I know watching me get hooded this spring semester was very meaningful for her. Lastly,myfatherRichardRavawasahugesupporterofmyeducationandmadenumerous personalsacrificestohelpmesucceed. Hisdeathinthefirstyearofmy Ph.D. wastraumatic and caused me to rethink what I wanted to accomplish with my life. I am thankful for the love and support that my father gave me and the numerous lessons that have become more and more important as time goes on. This dissertation is dedicated to my father and to my family. iii Table of Contents Acknowledgements ii List of Tables viii List of Figures x Abstract xiii Chapter 1: Introduction 1 1.1 Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Label Noise and Asymmetric Error . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Asymmetric error from two perspectives . . . . . . . . . . . . . . . . 9 1.3 Fairness in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Multiple Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.4.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Remaining Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Irrational Exuberance: Correcting Bias in Probability Estimates 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.1 Modeling ˜ p i and Selecting a Loss Function . . . . . . . . . . . . . . . 21 2.2.2 The Oracle Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3.1 Estimation of g . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.3.2 Estimation of λ and γ ∗ . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Large sample results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 ECAP Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Incorporating Bias in ˜ p i . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Mixture Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 iv 2.4.1 Unbiased Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.2 Biased Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5 Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.1 ESPN NCAA Football Data . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.2 Election Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter3: Asymmetricerrorcontrolunderimperfectsupervision: alabel-noise-adjusted Neyman-Pearson umbrella algorithm 49 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Notation and Corruption Model . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 The NP umbrella algorithm without label noise . . . . . . . . . . . . 56 3.3.2 Algorithm 1: label-noise-adjusted NP umbrella algorithm with known corruption levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3.3 Algorithm 1 # : label-noise-adjusted NP umbrella algorithm with un- known corruption levels . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4.1 Rationale behind Algorithm 1 . . . . . . . . . . . . . . . . . . . . . . 63 3.4.2 Theoretical properties of Algorithm 1 . . . . . . . . . . . . . . . . . 68 3.4.3 Theoretical properties of Algorithm 1 # . . . . . . . . . . . . . . . . . 70 3.5 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.1.1 Algorithm 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.1.2 Algorithm 1 # .. . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.1.3 Benchmark Algorithms. . . . . . . . . . . . . . . . . . . . . 76 3.5.2 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 4: A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1 A selective inference framework for binary classification . . . . . . . . 84 4.2.2 False selection rate and the fairness issue . . . . . . . . . . . . . . . . 85 4.2.3 From fair scores to fair classifiers: issues and roadmap. . . . . . . . . 88 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3.1 The R-value and FASI algorithm . . . . . . . . . . . . . . . . . . . . 91 4.3.2 Why FASI works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4 Theoretical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.1 Theory on FSR control: main ideas and contributions . . . . . . . . . 96 4.4.2 Theoretical R-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.3 A sketch of the optimality theory . . . . . . . . . . . . . . . . . . . . 99 4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.6 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 v 4.6.1 COMPAS Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.6.2 1994 Census Income Data Analysis . . . . . . . . . . . . . . . . . . . 106 4.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.7.1 Connection to the sufficiency principle . . . . . . . . . . . . . . . . . 108 4.7.2 Other fairness notions . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.7.3 FSR concepts in multinomial classification . . . . . . . . . . . . . . . 109 Chapter 5: Ongoing work: Correcting Bias in p-value Estimation 111 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.1 Posterior Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.2.2 Empirical Bayes Adjustment . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.3 Data Driven Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.4 Estimating h − 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.5 Estimating γ , λ and θ . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Ongoing questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Chapter 6: Conclusion 119 Chapter 7: Bibliography 120 Appendix A Appendix for Irrational Exuberance: Correcting Bias in Probability Estimates . . 128 A.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.2 Proof of Theorem 2 and Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . 129 A.3 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 A.4 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.5 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 A.6 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A.7 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.8 Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 A.9 Supplementary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Appendix B Appendix for Asymmetric error control under imperfect supervision: a label-noise- adjusted Neyman-Pearson umbrella algorithm . . . . . . . . . . . . . . . . . 142 B.1 Summary of sampling scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 142 B.2 BINARY SEARCH Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 143 B.3 An example for assumption 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 143 B.4 Additional Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B.4.1 Additional Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B.4.2 CIFAR10 data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 145 B.4.3 Violin plots for Section 3.5 . . . . . . . . . . . . . . . . . . . . . . . . 146 B.4.4 Tables for Section 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.4.5 Alternative implementation with a positive ε . . . . . . . . . . . . . . 150 B.5 Extra Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 vi B.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Appendix C AppendixforABurdenSharedisaBurdenHalved: AFairness-AdjustedApproach to Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.1 Supplementary Technical Results . . . . . . . . . . . . . . . . . . . . . . . . 159 C.1.1 The conversion algorithm and connections to Storey’s q-value . . . . 159 C.1.2 FSR and mFSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C.1.3 R-value is the (BH) q-value of conformal p-values . . . . . . . . . . . 160 C.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.2.1 Proof of Part (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.2.1.1 An empirical process description of the FASI algorithm . . . 162 C.2.1.2 Martingale arguments . . . . . . . . . . . . . . . . . . . . . 163 C.2.1.3 FSR Control . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.2.2 Proof of Part (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.2.2.1 The empirical process description . . . . . . . . . . . . . . . 165 C.2.2.2 Martingale arguments . . . . . . . . . . . . . . . . . . . . . 166 C.2.2.3 FSR Control . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.3 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 C.3.1 A monotonicity property . . . . . . . . . . . . . . . . . . . . . . . . . 167 C.3.2 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.4 Additional Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.4.1 Comparing the R and R + -value . . . . . . . . . . . . . . . . . . . . . 170 C.4.2 Numerical investigations of the factor γ c,a . . . . . . . . . . . . . . . 170 vii List of Tables 2.1 Averageexpectedlossfordifferentmethodsovermultipleunbiasedsimulation scenarios. Standard errors are provided in parentheses. . . . . . . . . . . . . 40 2.2 Average expected loss for different methods over multiple biased simulation scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.3 Bias corrected ECAP adjustment of FiveThirtyEight’s 2018 election proba- bilities. Reported average d EC δ . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 1. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . 73 3.2 Averages of (approximate) true type II errors over 1,000 repetitions for Sim- ulation 1. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . 73 3.3 (Approximate)typeIerrorviolationrates,andaveragesof(approximate)true type II errors over 1,000 repetitions for Simulation 2 (m 0 = .95, m 1 = .05, α =.1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . 73 3.4 (Approximate)typeIerrorviolationrates,andaveragesof(approximate)true type II errors over 1,000 repetitions for Simulation 3 (m 0 = .95, m 1 = .05, α =.1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . 74 3.5 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 4. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . 75 3.6 (Approximate) type II error violation rates over 1,000 repetitions for Simula- tion 4. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . 75 3.7 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 5 (m 0 = .95, m 1 = .05, α = .1 and δ = .1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.8 (Approximate)typeIerrorviolationrates,andaveragesof(approximate)true type II errors by Algorithm 1 and original NP umbrella algorithm over 1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. 78 viii 3.9 (Approximate) type I error violation rates by Algorithm 1 # over 1,000 repe- titions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. . . 78 3.10 Averages of (approximate) true type II errors by Algorithm 1 # over 1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. 78 B.1 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 6. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . 144 B.2 Averages of (approximate) true type II errors over 1,000 repetitions for Sim- ulation 6. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . 144 B.3 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 6. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . 145 B.4 (Approximate)typeIerrorviolationratesover1,000repetitionsforSimulation 8. Standard errors (× 10 − 2 ) in parentheses. . . . . . . . . . . . . . . . . . . . 146 B.5 (Approximate) type I error violation rate, and average of (approximate) true typeIIerrorsbyAlgorithm1over1,000repetitionsforthemodifiedCIFAR10 dataset. Standard errors (× 10 − 3 ) in parentheses.. . . . . . . . . . . . . . . . 146 B.6 Averages of (approximate) true type II errors over 1,000 repetitions for Sim- ulation 5 (m 0 = .95, m 1 = .05, α = .1 and δ = .1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.7 (Approximate) type I error violation rates, and averages of (approximate) true type II error by benchmark algorithms over 1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . 150 ix List of Figures 1.1 Intuition behind selection bias . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Left: Simulated p i and associated ˜ p i . The probability estimates are unbiased. Cen- ter: The average value of p i , as a function of ˜ p i i.e. E(p i |˜ p i ) (orange line) is systematically higher than ˜ p i (dashed line). Right: The ratio of E(p i |˜ p i ) relative to ˜ p i , as a function of ˜ p i . An ideal ratio would be one (dashed line). . . . . . . . . . 18 2.2 Density functions for ˜ p i given p i = 0.002 (blue / solid), p i = 0.01 (orange / dot- dashed), and p i =0.03 (green / dashed). In all three cases γ ∗ =0.001. . . . . . . 22 2.3 Average excess certainty as a function of ˜ p i for three different values of γ ∗ (orange / dashed line). All plots exhibit excess certainty far above zero but the issue grows worse asγ ∗ gets larger, corresponding to more variance in ˜ p i . The green (solid) line in each plot corresponds to the average excess certainty for the oracle estimator p i0 . 25 2.4 Plots of E(˜ p i |p i ) as a function of p i for different values of θ . When θ = 0 (black / solid) the estimates are unbiased. θ =2 (orange / dashed) corresponds to a setting where ˜ p i systematically underestimates large values of p i , while θ = − 3 (green / dot-dashed) represents a situation where ˜ p i is an overestimate for large values of p i . 35 2.5 Distributions of p used in the simulation . . . . . . . . . . . . . . . . . . . . 38 2.6 A screenshot of the NCAA football win probabilities publicly available on ESPN’s website. USC vs. Texas (2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Empirical EC in both the unadjusted and ECAP setting with δ =[0,0.02]. . 45 2.8 ECAP bias corrected probabilities vs original FiveThirtyEight probability from classic method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Density plots in Example 1. True (lighter and solid) and corrupted (darker and dashed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 x 3.2 The original NP umbrella algorithm vs. a label-noise-adjusted version for Example 2. The plots in the left panel (blue) are the true type I and II errors for the original NP umbrella algorithm. The plots in the right panel (orange) are the true type I and II errors for the label-noise-adjusted NP umbrella algorithm with known corruption levels. The black dot and vertical bar in every violin represent mean and standard deviation, respectively. In the top row, the horizontal black line is α =0.05 and the boundaries between lighter and darker color in each violin plot mark the 1− δ =95% quantiles. . . . . 58 3.3 Thebluesolidcurveisthedensityoftrueclass0(i.e.,N(0,1))andtheorange dashed curve is the density of corrupted class 0 (i.e., a mixture ofN(0,1) and N(2,1) with m 0 = 0.85). The black vertical line marks the threshold of the classifier 1I {X >2.52} whose corrupted type I error is 0.05. . . . . . . . . . 60 3.4 A cartoon illustration of 1− δ probability upper bound of type I error. . . . 67 4.1 The selection of recidivists from a pool of criminal defendants (Broward County, Florida). The target FSR is 25%. Left: the unadjusted approach. Right: the proposed FASI approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 All plots have a fixed proportion of signals (50%) from the male group and varying proportions of signal (from 15% to 85%) for the female group. For FCC and RCC, the degree of unfairness increases as p M and p F become more disparate. FASI ensures that the group-wise FSRs are effectively controlled and equalized. . . . . 89 4.3 Simulation 1. Top row: The oracle procedure. Bottom row: A data-driven GAM fitting procedure. Left and middle column: FSR 1,a and FSR 2,a for both females and males. Right column: The expected proportion of indecision’s (EPI). All plots vary the proportion of true class 2 observations from the Female protected group, π 2,F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4 Simulation 2. Comparable setup to Simulation 1 except that the female and male distributionsnowdifferfromeachother. However, theresultsaresimilarwithFASI correctly controlling the FSRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.5 COMPAS data analysis for predicting recidivists. Left and Middle: False Selection Rate minus the desired control level for varying levels of α for the FCC and FASI method respectively. Right: The EPI for both the FCC and FASI method. . . . 105 4.6 Census income prediction for individuals that earn more than 50K a year. Left and Middle: False Selection Rate minus the desired control level for varying levels of α for the FCC and FASI method respectively. Right: The EPI for both the FCC and FASI method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.1 Estimate of false discovery rate using the unadjusted approach (orange/solid), SAP Unbiased(green/dot)andSAPBiased(blue/dashed)correctedp-values. Thex-axis varies the level of corruption in the unadjusted p-values and the y-axis measures the FDR. The goal is to control FDR at the 10% level (black / dashed). . . . . . 114 xi B.1 Violin plots for (approximate) true type I errors of Simulation 1. . . . . . . 147 B.2 Violin plots for (approximate) true type II errors of Simulation 1. . . . . . . 147 B.3 Violin plots for (approximate) true type I errors of Simulation 2. . . . . . . 148 B.4 Violin plots for (approximate) true type II errors of Simulation 2. . . . . . . 148 B.5 Violin plots for (approximate) true type I errors of Simulation 3. . . . . . . 149 B.6 Violin plots for (approximate) true type II errors of Simulation 3. . . . . . . 149 B.7 Violin plots for (approximate) true type I errors of Simulation 1. . . . . . . 151 B.8 Violin plots for (approximate) true type II errors of Simulation 1. . . . . . . 151 B.9 Violin plots for (approximate) true type I errors of Simulation 2. . . . . . . 152 B.10 Violin plots for (approximate) true type II errors of Simulation 2. . . . . . . 152 B.11 Violin plots for (approximate) true type I errors of Simulation 3. . . . . . . 153 B.12 Violin plots for (approximate) true type II errors of Simulation 3. . . . . . . 153 C.1 The comparison between the R-value and R + -value for varying sizes of the test data set. The left column shows the histograms of the R-value (orange) and the right column shows the histograms of the R + -value (green). The R-values and R + - values are computed for a fixed base score of s(x,a)=0.9 based on 1,000 randomly generated data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 C.2 Estimates of γ 1,a from the simulations in Section 4.5. The solid (green) line repre- sents the estimate of γ 1,F for the Female protected group and similarly the orange (long-dashed) line for the Male protected group. . . . . . . . . . . . . . . . . . 171 xii Abstract The advancement of machine learning in the past decade has had profound impacts on soci- ety. Algorithms that were previously expensive and hard to implement to use have become accessible to the general public. As a consequence, automated decision systems have started being used in sensitive areas such as medical diagnosis, bail sentencing, and financial ser- vices. These application areas require a more nuanced view of uncertainty estimation since the consequences of making a mistake can directly impact peoples happiness and quality of life. When artificial intelligence is used to make decisions for many people, it is impor- tant that we rigorously control severe consequences and that we properly communicate the uncertainty associated with these mistakes to practitioners. To address uncertainty estimation in high-risk scenarios, this dissertation aims to help practitioners gain better inference into the predictions their machine learning models are giving them. The first part of this dissertation focuses on predicting extreme events and shows that when practitioners are estimating the probability of many events occurring, even whenthoseestimatesareunbiasedthemostextremeprobabilitiescanbeshowntobebiased estimates of the true probability of that event occurring. This is problematic when one considers that extreme probabilities tend to be the ones that practitioners are the most interested in (i.e. probability of death if one were to undertake a medical procedure). To addressthis,weproposeanempiricalBayesmethodtoaddressselectionbiaswhenestimating the occurrence of extreme events. The second part moves past estimating probabilities and directly focuses on classification. When classification is used in sensitive areas, one type of errormayhavefarworseconsequencesthantheother(suchasmissingoutonsickindividuals xiii in medical diagnosis). However, popular techniques that try to control the more severe error belowapre-specifiedthresholdaredifficulttouseinpracticebecausetheyassumeknowledge of perfect training labels. Our work expands the literature by allowing practitioners to have label noise in their training data and provide a label noise adjusted algorithm that can provably control the more consequential error. The third part of this dissertation develops flexible framework for classification that can be used when algorithms are being used across protectedgroups. Thisframeworkrequiresminimalassumptionsandworksregardlessofthe accuracyandtypeofmachinelearningalgorithmused. Inthisway, practitionerscanbesure that our definition of fairness will be upheld without any restrictive assumptions. Lastly, we finish this dissertation by discussing a work in progress that addresses selection bias in p-value estimation for multiple hypothesis testing. This dissertation tackles this fundamental problem of uncertainty estimation when the resultsorpredictionsofalgorithmshaveahighsocietalcostassociatedwiththem. Toaddress this, chapters 2-5 cover selection bias in extreme event prediction, label noise in asymmetric classification, fairness in machine learning and finally selection bias in multiple hypothesis testing. Chapters 2 and 3 are based on previously published work, chapter 4 on a work in revision and chapter 5 on one currently ongoing work. As the applications our algorithms are being used on become more complicated and nuanced, so too must our tools for inference evolve to meet modern day societal challenges. This dissertation serves as a starting point to start answering these questions and it will be future expanded upon in my future research career. xiv Chapter 1 Introduction We are without a doubt in an artificial intelligence revolution. As algorithms are becoming more sophisticated so too is computational power becoming cheaper. Consequently, auto- mated algorithms are starting to be used repeatedly in areas where the cost of a mistake can havehighsocietalcost. Afewexamplesincludealgorithmsinmedicinethataredevelopedto determine which patients are at high risk of having a deadly disease, algorithms in the legal justice system that are used to identify which defendants are of high risk of recommitting a crime,andfacialrecognitionsoftwarethatisusedbyborderpatroltoidentifycriminals. The use of these algorithms in sensitive areas requires researchers to rethink the tools we have developed to quantify and communicate uncertainty, instead of naively applying previously established methodology. These application areas require special care since the results of them can have profound impacts on peoples lives. Misdiagnosing patients that are truly sick, accidentally misclassifying a criminal defendant as high risk, or misidentifying an indi- vidualatasecuritycheckpointareconsequencesthatcarryaheavierweightthantraditional machine learning algorithms were equipped to deal with. In this dissertation we directly tackle the problem of automated AI algorithms in high risk scenarios, where high risk can be thought of as estimating rare events or making decisions where a mistake can have a high consequence. 1 When using modern machine learning algorithms in high-risk scenarios, it is important that uncertainty is controlled and properly communicated to practitioners. Modern appli- cations of data science make uncertainty estimation very difficult in these scenarios which in turn forces practitioners to make decisions without fully understanding the uncertainty of their algorithms. To help practitioners make better decisions in high-risk scenarios, this dissertation offers four distinctly novel techniques that make the implementation of these algorithms more realistic. This dissertation offers four distinctly novel techniques that are developed to better quantify uncertainty in high-risk scenarios. Specifically, we will be looking at four challenges that high-risk uncertainty estimation gives us. Selection bias, label noise, asymmetric error, fairness considerations and multiple hypothesis testing, each of which will be described in detail below. These four ideas are just a few of many different topics that need to be addressed in high-risk automated decision making. Below, we describe four fundamental issues with uncertainty estimation in high-risk sce- narios. 1.1 Selection Bias In modern data science, techniques are rarely applied in isolation. Rather they are repeated withtheirresultsdigestedatscale. Thisresultsinalargenumberofoutputsforpractitioners to handle, many of which are extreme. It is natural for us to focus on extreme observations because it tends to be the case that these observations are the most informative. Take for example an investor that uses machine learning to estimate default probabilities across thousands of risky assets. Taking a risk adverse stance, this investor might be interested in putting her money only in assets that have less than a 1% chance of defaulting. To accomplish this, a natural strategy would be to take the estimated probabilities of default the investor can get and only invest in assets with a default probability less than 1%. A 2 surprising result of this strategy is that true default rate could result in being much greater than 1%, which is contrary to the investors desire to avoid risk. A key issue with the investing example is that the investors desire to avoid risk caused her to focus specifically on the most extreme default probability estimates. Focusing on the most extreme outputs of a model is a form of selection bias and any future analysis based on these observation could provide a distorted estimate of the population. This example isn’t limited to bonds, rather it is an effect that can be observed whenever one is interested in estimating the occurrence of many events. The first work in this dissertation [55] discusses a novel methodology specifically for correcting selection bias in probability estimates. This will allow for practitioners to better understand the uncertainty they have for extreme events occurring. This will be important not only in investing, but also in medicine when doctors are trying to estimate a patients chance of having a rare disease or in politics where political pundit’s may be interested in the chance that someone wins an election. Selection bias does not only occur when estimating probability estimates. It can also occur in classification. Consider a classification algorithm that decides which customers of a bankshouldbegivenaloan. Inthesescenarios,itisnaturaltoconsiderseparatelythegroup of individuals that are approved or not approved for a loan. However, focusing specifically on a group of classified individuals can also be thought of as a form of selection bias. 1.1.1 Intuition In Figure 1.1, we demonstrate how selection bias can cause practitioners to underestimate the true level of uncertainty they should have about an event occurring, particularly when they focus on the most extreme probability estimates. As an example, let’s pretend that a researcher is interested 1,000 different events. This researcher has uncertainty about which events might occur, so using the latest machine learning algorithms, she produces probability estimates ˜ p for each event in question. In 3 (a) 1,000 cards with estimated probabilities. (b) Randomly pick card 106. (c) Observe the true probability. (d) Picking the smallest estimated probability. Figure 1.1: Intuition behind selection bias Figure 1.1 plot (a) we demonstrate this with playing cards, where each card represents a different event. On the front of each card (which is visible to the researcher) the estimated probability ˜ p i is displayed, for i = 1,...,1000. On the back of each card (which is unseen by the researcher) the true probability p i which represents the true probability of a specific event occurring. The researcher knows that her estimates might be noisy, with some estimated proba- bilities being larger and others smaller than the true probability of a corresponding event occurring. However, if the probabilities are generated unbiasedly (i.e. E(˜ p i |p i ) for all i) then on average her estimates should be correct. However even with unbiasedness, using ˜ p i as an estimate for p can be problematic. In plot (b), the researcher randomly picks a card. In this case it is ˜ p 106 . Knowing that each probability has been estimated unbiasedly, what should the researchers best guess be for the true probability? Indeed, knowing nothing else her best guess is the estimated probability, ˜ p which in this case is 0.1. Flipping the card over in plot (c) reveals the true probability p=0.11, which was close to our best guess. 4 However, now the researcher is going to pick a different card. She orders all of her estimated probabilities and selects the smallest one, which in this case is ˜ p 1 = 0.004. What would her best guess be for the true probability? In fact, our strategy is going to be different than just using ˜p 1 as our best guess. The reason is that we have more information than the first scenario. We not only know ˜p 1 but we also know that ˜ p 1 is the smallest estimated probability out of all 1,000. This information allows us to say something stronger about our guess of p 1 . Since ˜ p 1 is the smallest estimated probability in the deck, this means that either p 1 is also extremely small or the error associ- ated with estimating ˜ p 1 is large and negative. The latter is more likely since we focused our attention on the most extreme ˜ p. So our best guess is that p 1 is something larger than ˜ p 1 . This card example, while simple shows that selection bias can cause problems when estimating the occurrence of extreme events, even when each event in question is estimated sensibly! Selection bias needs to be taken into account when practitioners are interested in extreme observations, otherwise they will have a distorted view of the population. In Chapter 2, we offer a way to correct for selection bias in the estimation of probability estimates. Our method is an empirical Bayes approach that updates probability estimates to take selection bias into account, in particular for the most extreme ones. This method is extreme flexible and can easily be applied to current automated probability estimates. It is non-parametric and primarily involves estimating the score function associated with the probability estimates so that it does not need to make any restrictive assumptions about the prior on the true probabilities. 1.1.2 Literature Review There has been a lot of research that looks at unbounded observations X 1 ,...,X n with E(X i ) = µ i , where X i can be any real number. A common goal in this setting is to es- timate µ i based on an observed X i . In this setting, it is well known that the conditional expectationE(µ i |X i )correctsforselectionbiaswith X i pickedinanon-randomfashion[35]. 5 Numerous approaches have been suggested to address selection bias, with most methods im- posing some form of shrinkage to either explicitly, or implicitly, estimate E(µ i |X i ). Among linear shrinkage methods, the James-Stein estimator [56] is the most well known, although many others exist [37, 53]. There are also other popular classes of methods, including: non- linearapproachesutilizingsparsepriors[31,1,13,66],Bayesianestimators[40]andempirical Bayes methods [57, 18, 79]. For Gaussian data, Tweedie’s formula [84] provides an elegant empirical Bayes estimate for E(µ i |X i ), using only the marginal distribution of X i . While less well known than the James-Stein estimator, it has been shown to be an effective non-parametric approach for addressing selection bias [36]. The approach can be automatically adjusted to lean more heavily on parametric assumptions when little data is available, but in settings such as ours, wherelargequantitiesofdatahavebeenobserved,itprovidesahighlyflexiblenon-parametric shrinkage method [12, 47]. However, the standard implementation of Tweedie’s formula assumes that, conditional on µ i , the observed data follow a Gaussian distribution. Most shrinkage methods make similar distributional assumptions or else model the data as unbounded, which makes little sense for probabilities. In chapter 2 of this dissertation, we expand this theory for low probability events. It is designed for probability estimation in settings with a large number of observations. None ofthis literature currently addresses selection biasfor probabilityestimates. Chap- ter 2 serves as novel methodology for this setting that is based on a variant of Tweedie’s formula that is tuned to make more sense for probabilities. 1.2 Label Noise and Asymmetric Error Data is being created at a rapid pace. However, collection techniques, such as employing people from crowd sourcing platforms, can result in mistakes and in many applications it 6 has been estimated that around five percent of labels collected are incorrect [77, 82]. This problem is generally refereed to as label noise and while it is widely accepted to exist, most modern machine learning algorithms do not take it into account. Label noise can be problematic in high-risk scenarios because its severity can invali- date the theoretical guarantees that practitioners expect from popular machine learning techniques. Unfortunately, while there has been a lot of empirical discussion about some algorithms robustness to label noise and theoretical discussions about bounding the risk of label noise, there has yet to be any work that discusses label noise in high risk settings. Where here high risk refers High-risk settings are particularly sensitive to label noise because practitioners have a strong desire to avoid high-risk mistakes such as identifying patients with a rare disease. However, if the methodology they are using does not take label noise into account, this might result in an increase in risk since the models they are using have been trained on incorrect labels. In high-risk settings, typically the mistakes that can be made with a classification algo- rithm are not of equal importance. For example, a biomedical diagnosis that misidentifies a benign tumor as malignant will cause distress and potentially unnecessary medical pro- cedures, but the alternative, where a malignant tumor is classified as benign, will have far worse outcomes. In situations such as these, it is natural for a practitioner to prefer a clas- sification algorithm that can guarantee that the worse error (i.e. misidentifying a malignant tumor as benign) doesn’t happen too often. While any error an algorithm makes has consequences, in high risk settings it is common to identify a mistake that has a higher cost above all others. Aside from medicine, there are also other examples in cyber-security and finance where practitioners want to avoid making errors that could have detrimental ramifications. This motivated us to develop methodology for addressing label noise in high risk classifi- cationsettings. Astandardsettingtoconsiderforhigh-riskclassificationistheNP-paradigm. 7 The NP paradigm [23, 87] aims to mimic the NP oracle ϕ ∗ α ∈ argmin ϕ : R 0 (ϕ )≤ α R 1 (ϕ ), where R 1 , R 0 represent the severe and less severe error for a classifier ϕ respectively and α ∈(0,1)isauser-specifiedlevelthatreflectstheprioritytowardsthemoresevereerror. This framework is ideal for high-risk classification scenarios because NP-classification paradigm respects the following two conditions with high probability. First the more severe error constraint is respected and second the excess less severe error diminishes with an explicit rate (w.r.t. sample size). More recently, [99] developed the NP-classification umbrella algorithm which can adapt anyscoringtypeclassificationmethodstosatisfytheNP-classificationparadigmabove. This umbrella algorithm was a breakthrough in the asymmetric classification literature because it made the NP-classification paradigm easily achievable and implementable into current practitioners workflow. In chapter 3 of this dissertation, we expand the NP umbrella algorithm to the setting where ones training data contains label noise. Our method is a label-noise adjusted NP umbrella algorithm that corrects for label noise with a lower bound estimate on the level of corruption. This algorithm also becomes the original NP umbrella algorithm when the level of noise is zero. To our knowledge, this work is the first of its kind to offer any theoretical guarantees in this setting. 1.2.1 Literature Review The study of label noise in supervised learning has been a vibrant field in academia. On the empirical front, researchers have found that some statistical learning methods such as quadraticdiscriminantanalysis[65]andk-NN[76],canbegreatlyaffectedbylabelnoiseand have accuracy seriously reduced, while other methods, such as linear discriminant analysis 8 [64], are more label noise tolerant. Moreover, one can modify AdaBoost [25], perceptron algorithm [61] and neural networks [96], so that they are more tolerant to label noise. Data cleansing techniques were also developed, such as in [44] and [17]. On the theoretical front, [75] provided a guarantee for risk minimization in the setting of convex surrogates. [71] proved label noise tolerance of risk minimization for certain types of loss functions, and [41] extended the result by considering more loss types. [69] proposed learning methods with importance-reweighting which can minimize the risk. [14] studied intensely the class- conditional corruption model, a model that many works on label noise are based on. In particular,theoreticalresultsaboutparameterestimationandconsistencyofclassifiersunder this model were presented in their work. Most recently, [22] derived innovative theory of excess risk for general classifiers. Asymmetric error control in binary classification has been vibrantly studied in the ma- chine learning literature. A common approach by practitioners is to assign a cost to the two mistakes their classifier can make that serve as weights for tuning a trade off between the severity of mistakes of their classifier. This is referred to as cost sensitive learning [38, 104]. However, this approach can be problematic when researchers can not reach a consensus about how to assign exact costs and there is no guidance as to how costs can be assigned to provably control the most severe error (such as misdiagnosing a patient with cancer) below a user specified level. The setbacks to cost sensitive learning lead to the development of the Neyman-Pearson (NP) classification paradigm. 1.2.2 Asymmetric error from two perspectives This dissertation addresses asymmetric error from two different perspectives. In Chapter 3, we discuss how to control asymmetric error with high probability which is necessary for practitioners that truly must avoid any violations above a pre-specified value. In Chapter 4, the control is only held in expectation and we introduce the concept of an indecision group where we explicitly do not force an algorithm to make a decision on an individual. This 9 modification allows us not only to control the most serious error, but also all errors below a pre-specified level. 1.3 Fairness in Classification One of the biggest challenges in high-risk automated decision making is ensuring that algo- rithms treat people of different backgrounds, race, gender, or any other protected attribute fairly. In particular, the use of machine learning in sensitive applications such as medical diagnosis,recidivismprediction,andfacialrecognition(amongothers)hasstartedtodemon- strate that current methodology needs to be reworked in order to ensure that everyone has a fair chance of a high quality of life. As machine learning becomes easier to implement and work into our daily lives, we need to make sure we are not systematically disadvantaging people from different protected groups and that we are communicating uncertainty about fairnessproperlytopractitionerssotheycanworkwithalgorithmstomakebetteroutcomes. In this dissertation, we focus on developing a selective inference methodology that can ensure our definition of fairness across a pre-defined set of protected groups. This methodol- ogy is developed with minimal assumptions and it can be used as a wrapper for any current machine learning pipe line. Many state of the art algorithms are complex and current fair- ness algorithms require assumptions that are hard to test. That motivated us to make a framework that is easy to use so that practitioners can immediately implement it in their work. Crucially,ourframeworkalsodoesnotmakeadecisionwhenitisnotconfidentenough to do so, helping the user know when their algorithm is perhaps improper for an individual. 1.3.1 Literature Review Fairness has become one of the fastest growing and most important fields in machine learn- ing. In response, researchers have introduced a rich set of definitions that mathematically formalize fairness for algorithms. Many approaches consider a set of protected attributes 10 such as race, religion, gender and the imposes a constraint across them. These approaches are generally referred to as group-wise fairness. However, there is also a growing body of lit- eraturethatattemptstodefinefairnesswithoutprotectedattributesinawaythatquantifies fair outcomes at the individual level. In this dissertation, we only consider group fairness notions and leave individual fairness definitions for future work. For example, a popular starting point for defining fairness is defining a rule that does not explicitly use protected attributes in the decision making process. This definition, while appealing at first glance, is problematic because it has been widely shown that protected attributes can be encoded into other observations, making algorithms that satisfy fairness through unawareness just as unfair as if they had no adjustment. In addition to the fairness through unawareness, a widely used fairness notion is the separation principle [5], which requires that P(Y ̸= ˆ Y|Y =c,A=a) are the same for all a∈A. (1) A third notion of fairness, which is considered in Chapter 4, is the sufficiency principle ir predictive parity [28, 5, 26]. It requires that P(Y ̸=c| ˆ Y =c,A=a)≤ α for all a∈A. A third notion on fairness, in the context of prediction intervals, has been considered in [85]. Rather than conditioning on either Y or ˆ Y, these works are concerned with the joint probabilities of ( ˆ Y,Y). This fairness criterion requires that the misclassification rates are equalized across all protected groups: P(Y ̸= ˆ Y|A=a) are the same for all a∈A. (2) Other popular fairness notions include equalized odds (45, 86) and equalized risks (27). A highly controversial issue is that different fairness criteria often lead to different algorithms and decisions in practice. For example, the sufficiency and separation principles can be 11 incompatiblewitheachother(62,39), andtheclassificationparityandcalibrationcanharm the very groups that the algorithms are designed to protect (27). Inthisdissertationwedonotmanyanyassertionsthatonedefinitionoffairnessisuniver- sally superior to others. The implementation of fair algorithms must be taken contextually and deeply considered with every new application. Much research is still needed to fully understand the trade-offs and caveats between different approaches to fairness. 1.4 Multiple Hypothesis Testing Hypothesis testing is a powerful tool for researchers to interpret the results of their experi- ment in order to see if they have meaningful results. As scientific disciplines have been able to collect more data, there has been a need to preform multiple hypothesis tests simultane- ously. However, a naive application of single-inference procedures can result in many more false findings than the practitioner might expect. This is due to the multiplicity that arises through multiple comparisons, instead of just one. In the typical two-sample inference framework, let X andY be two random vectors that record the measurement levels of the same m features under two experimental conditions. The corresponding population mean vectors can be written as E(X) = µ x = (µ x 1 ,...,µ xm ) andsimilarlyforE(Y). Atraditionalwayforidentifyingdifferentialfeaturesistocarryout m two-sampletests,whereH i,0 ,H i,1 representsthenullandalternativehypothesisrespectively. H i,0 :µ x i =µ y i vs. H i,1 :µ x i ̸=µ y i , 1≤ i≤ m To control the effect of multiplicity, it is common to control the false discovery rate (FDR). FDR is an estimate for the proportion of violations out of all hypothesises that a multiple testing procedure classifies as interesting. For R rejected hypothesises containing a false rejections, the FDR can be written as, 12 FDR=E(a/R). This measure the expected proportion of falsely rejected hypothesises that a multiple testing procedure has selected. 1.4.1 Literature Review The field of multiple testing is one of the most important and vibrant areas of modern statistics. While most of the literature focuses on controlling FDR, there is also work that focuses on controlling the family-wise error rate (FWER). FWER focuses on controlling the probability that a single (or more) violations occur. However, in practice this is a very strict criterion and individual deviations from the null hypothesis have a hard time being detected if FWER is controlled. This is especially true when the number of hypothesises is very large and the cost of making any one false rejection is not too high. This is why in practice, the FDR is preferred to be controlled over the FWER. There are many procedures for controlling the FWER and FDR, with varying levels of assumptionsaboutdependenceamountthep-valuesandamountofsideinformationavailable to the practitioner. For controlling FWER, a simple approach is Bonferonni’s Method [29] which rejects all hypothesises with p-values less than a pre-defined control level α divided by the total number of hypothesises n. Another approach is Sidak’s Procedure [106], which under the assumption of independence between the p-values, rejects hypothesises below the threshold1− (1− α ) 1/n . Similarly,theBenjamini-HochbergprocedureandHolm’sProcedure also control FWER by ranking and thresholding p-values [50, 8]. For controlling FDR, the Benjamini-Hochberg procedure is perhaps the most popular method used in practice [8]. Assuming mutually independent p-values, it only involves ordering the p values from smallest to largest and then rejecting all p-values that are less than the largest ordered p-value that is less than the following moving average: 13 p (i) ≤ i n α, where p (i) is the i th ordered p-value. Other methods include Storey’s Procedure and Benjamini–Yekutieli procedure, which also involve sorting / ranking the p-values and then thresholding them. 1.4.2 Our Contribution In this dissertation, we reference multiple hypothesis testing in Chapter 5: Ongoing work, where we consider an empirical bayes adjustment to p-value estimation for multiple hypoth- esis testing. In this chapter, we consider controlling FDR with the Benjamini-Hochberg procedure and focus on bias correcting the p-values themselves. Using similar techniques to the paper discussed in Section 2, we show that when estimating p-values with data the FDR control can be off. However, by adjusting for this noise we can obtain better FDR control in settings where there is not enough data to estimate the p-values well. This section is still ongoing and the content is subject to change. 1.5 Remaining Outline The rest of this dissertation will go in depth into each topic discussed above. Chapter 2 will cover an empirical Bayes approach to bias correcting probability estimates. Chapter 3 will cover a label noise adjusted Neyman Pearson Umbrella algorithm. Chapter 4 will cover a fairness approach to classification that ensures the rate of erroneous decisions of any machine learning algorithm is effectively equalized across a set of protected groups. Finally, inchapter5weconcludethedissertationbybrieflymentioningfutureworkonbiascorrection in multiple hypothesis testing and summarize all of this dissertations main ideas. Therestofthedissertationisisbasedontwopreviouspublications, oneworkinrevision, and one work in progress. 14 • Publication [55]: Gareth M. James, Peter Radchenko & Bradley Rava (2022) Irra- tional Exuberance: Correcting Bias in Probability Estimates, Journal of the American Statistical Association, 117:537, 455-468, DOI: 10.1080/01621459.2020.1787175. • Publication [103]: Shunan Yao, Bradley Rava, Xin Tong & Gareth James (2022) AsymmetricErrorControlUnderImperfectSupervision: ALabel-Noise-AdjustedNey- man–Pearson Umbrella Algorithm, Journal of the American Statistical Association, DOI: 10.1080/01621459.2021.2016423. • Preprint [81]: Bradley Rava, Wenguang Sun, Gareth M. James & Xin Tong (2022) A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification. Under Review. Preprint available on arXiv. • Work in progress: Bradley Rava, Gareth M. James. Correcting Bias in p-value Esti- mation. All proofs and discussions will be provided in the appendix. To apply the concepts in chapter 2, the “ecap” package is available in the R language on CRAN. Similarly for chapter 4, the “fasi” package is available in the R language on CRAN. 15 Chapter 2 Irrational Exuberance: Correcting Bias in Probability Estimates Weconsiderthecommonsettingwhereoneobservesprobabilityestimatesforalargenumber of events, such as default risks for numerous bonds. Unfortunately, even with unbiased estimates, selecting events corresponding to the most extreme probabilities can result in systematically underestimating the true level of uncertainty. We develop an empirical Bayes approach “Excess Certainty Adjusted Probabilities” (ECAP), using a variant of Tweedie’s formula,whichupdatesprobabilityestimatestocorrectforselectionbias. ECAPisaflexible non-parametric method, which directly estimates the score function associated with the probabilityestimates,soitdoesnotneedtomakeanyrestrictiveassumptionsabouttheprior on the true probabilities. ECAP also works well in settings where the probability estimates are biased. We demonstrate through theoretical results, simulations, and an analysis of two real world data sets, that ECAP can provide significant improvements over the original probability estimates. 2.1 Introduction We are increasingly facing a world where automated algorithms are used to generate proba- bilities,ofteninrealtime,forthousandsofdifferentevents. Justasmallhandfulofexamples 16 include finance where rating agencies provide default probabilities on thousands of different risky assets [58, 52]; sporting events where each season ESPN and other sites estimate win probabilities for all the games occurring in a given sport [68]; politics where pundits esti- mate the probabilities of candidates winning in congressional and state races during a given election season [89, 91]; or medicine where researchers estimate the survival probabilities of patients undergoing a given medical procedure [80, 90]. Moreover, with the increasing availability of enormous quantities of data, there are more and more automated probability estimates being generated and consumed by the general public. Many of these probabilities have significant real world implications. For example, the rating given to a company’s bonds will impact their cost of borrowing, or the estimated risk of a medical procedure will affect the patient’s likelihood of undertaking the operation. This leads us to question the accuracy of these probability estimates. Let p i and ˜ p i respec- tively represent the true and estimated probability of A i occurring for a series of events A 1 ,...,A n . Then, we often seek an unbiased estimator such that E(˜ p i |p i ) =p i , so ˜ p i is nei- ther systematically too high nor too low. Of course, there are many recent examples where this unbiasedness assumption has not held. For example, prior to the financial crisis of 2008 rating agencies systematically under estimated the risk of default for mortgage backed secu- rities so E(˜ p i |p i ) ˜ p i . Finally, in the right panel we haveplottedtheratioofE(p i |˜ p i )to ˜ p i . Ideallythisratioshouldbeapproximatelyone,which would,forexample,correspondtothetrueriskofasetofbondsequallingtheestimatedrisk. However, for small values of ˜ p i we observe ratios far higher than one. So, for example, our investor who only purchases bonds with an estimated default risk of ˜ p i = 0.001 will in fact find that 0 .004 of her bonds end up defaulting, a 400% higher risk level than she intended to take! These somewhat surprising results are not a consequence of this particular simulation setting. It is in fact an instance of selection bias, a well known issue which occurs when the 18 selectionofobservationsismadeinsuchaway, e.g. selectingthemostextremeobservations, thattheycannolongerbeconsideredrandomsamplesfromtheunderlyingpopulation. Ifthis biasisnottakenintoaccountthenanyfutureanalyseswillprovideadistortedestimateofthe population. Consider the setting where we observe X 1 ,...,X n with E(X i )=µ i and wish to estimateµ i basedonanobserved X i . Thenitiswellknownthattheconditionalexpectation E(µ i |X i )correctsforanyselectionbiasassociatedwithchoosing X i inanon-randomfashion [36]. Numerousapproacheshavebeensuggestedtoaddressselectionbias,withmostmethods imposingsomeformofshrinkagetoeitherexplicitly,orimplicitly,estimateE(µ i |X i ). Among linear shrinkage methods, the James-Stein estimator [56] is the most well known, although many others exist [37, 53]. There are also other popular classes of methods, including: non- linearapproachesutilizingsparsepriors[31,1,13,66],Bayesianestimators[40]andempirical Bayes methods [57, 18, 79]. For Gaussian data, Tweedie’s formula [84] provides an elegant empirical Bayes estimate for E(µ i |X i ), using only the marginal distribution of X i . While less well known than the James-Stein estimator, it has been shown to be an effective non-parametric approach for addressing selection bias [36]. The approach can be automatically adjusted to lean more heavily on parametric assumptions when little data is available, but in settings such as ours, wherelargequantitiesofdatahavebeenobserved,itprovidesahighlyflexiblenon-parametric shrinkage method [12, 47]. However, the standard implementation of Tweedie’s formula assumes that, conditional on µ i , the observed data follow a Gaussian distribution. Most shrinkage methods make similar distributional assumptions or else model the data as unbounded, which makes little sense for probabilities. What then would be a better estimator for low probability events? In this paper we propose an empirical Bayes approach, called “Excess Certainty Adjusted Probability”(ECAP),specificallydesignedforprobabilityestimationinsettingswithalarge number of observations. ECAP uses a variant of Tweedie’s formula which models ˜ p i as coming from a beta distribution, automatically ensuring the estimate is bounded between 0 19 and1. WeprovidetheoreticalandempiricalevidencedemonstratingthattheECAPestimate is generally significantly more accurate than ˜p i . Thischaptermakesthreekeycontributions. First,weconvincinglydemonstratethateven an unbiased estimator ˜ p i can provide a systematically sub-optimal estimate for p i , especially in situations where large numbers of probability estimates have been generated. This leads us to develop the oracle estimator for p i , which results in a substantial improvement in expected loss. Second, we introduce the ECAP method which estimates the oracle. ECAP does not need to make any assumptions about the distribution of p i . Instead, it relies on estimating the marginal distribution, and conditional accuracy, of ˜ p i , a relatively easy problemintheincreasinglycommonsituationwhereweobservealargenumberofprobability estimates. Finally, we extend ECAP to the biased data setting where ˜ p i represents a biased observation of p i and show that even in this setting we are able to recover systematically superior estimates of p i . This chapter is structured as follows. In Section 2.2 we first formulate a model for ˜p i and a loss function for estimating p i . We then provide a closed form expression for the corresponding oracle estimator and its associated reduction in expected loss. We conclude Section 2.2 by proposing the ECAP estimator for the oracle and deriving its theoretical properties. Section 2.3 provides two extensions. First, we propose a bias corrected version of ECAP, which can detect situations where ˜ p i is a biased estimator for p i and automatically adjust for the bias. Second, we generalize the ECAP model from Section 2.2. Next, Section 2.4 contains results from an extensive simulation study that examines how well ECAP works to estimate p i , in both the unbiased and biased settings. Section 2.5 illus- tratesECAPontwointerestingrealworlddatasets. Thefirstisauniquesetofprobabilities from ESPN predicting, in real time, the winner of various NCAA football games, and the second contains the win probabilities of all candidates in the 2018 US midterm elections. 20 We conclude with a discussion and possible future extensions in Section 2.6. Proofs of all theorems are provided in the appendix. 2.2 Methodology Let ˜ p 1 ,...,˜ p n represent initial estimates of events A 1 ,...,A n occurring. In practice, we assume that ˜ p 1 ,...,˜ p n have already been generated, by previous analysis or externally, say, by an outside rating agency in the case of the investment example. Our goal is to construct estimators ˆ p 1 (˜ p 1 ),...,ˆ p n (˜ p n ) which provide more accurate estimates for p 1 ,...,p n . In order to derive the estimator we first choose a model for ˜p i and select a loss function for ˆ p i , which allowsustocomputethecorrespondingoracleestimatorp i0 . Finally,weprovideanapproach for generating an estimator for the oracle ˆ p i . In this section we only consider the setting where ˜ p i isassumedtobeanunbiasedestimatorfor p i . Weextendourapproachtothemore general setting where ˜ p i may be a biased estimator in Section 2.3.1. 2.2.1 Modeling ˜ p i and Selecting a Loss Function Given that ˜ p i is a probability, we model its conditional distribution using the beta distribu- tion 1 . In particular, we model ˜ p i |p i ∼ Beta(α i ,β i ), where α i = p i γ ∗ , β i = 1− p i γ ∗ , (1) and γ ∗ is a fixed parameter which influences the variance of ˜ p i . Under (1), E(˜ p i |p i )=p i and Var(˜ p i |p i )= γ ∗ 1+γ ∗ p i (1− p i ), (2) so ˜ p i is an unbiased estimate for p i , which becomes more accurate as γ ∗ tends to zero. Figure 2.2 provides an illustration of the density function of ˜ p i for three different values of 1 We consider a more general class of distributions for ˜ p i in Section 2.3.2 21 0 100 200 300 0.00 0.01 0.02 0.03 0.04 0.05 p Density of ˜ p given p Figure2.2: Densityfunctionsfor ˜ p i givenp i =0.002(blue/solid),p i =0.01(orange/dot-dashed), and p i =0.03 (green / dashed). In all three cases γ ∗ =0.001. p i . In principle, this model could be extended to incorporate observation specific variance terms γ ∗ i . Unfortunately, in practice γ ∗ needs to be estimated, which would be challenging if we assumed a separate term for each observation. However, in some settings it may be reasonable to model γ ∗ i =w i γ ∗ , where w i is a known weighting term, in which case only one parameter needs to be estimated. Next, we select a loss function for our estimator to minimize. One potential option would be to use a standard squared error loss, L(ˆ p i ) = E(p i − ˆ p i ) 2 . However, this loss function is not the most reasonable approach in this setting. Consider for example the event corresponding to a bond defaulting, or a patient dying during surgery. If the bond has junk status, or the surgery is highly risky, the true probability of default or death might be p i = 0.26, in which case an estimate of ˆ p i = 0.25 would be considered very accurate. It is unlikely that an investor or patient would have made a different decision if they had instead been provided with the true probability of 0.26. However, if the bond, or surgery, are considered very safe we might provide an estimated probability of ˆ p i = 0.0001, when the true probability is somewhat higher at p i = 0.01. The absoluteerrorintheestimateisactuallyslightlylowerinthiscase,butthepatientorinvestor might well make a very different decision when given a 1% probability of a negative outcome vs a one in ten thousand chance. 22 In this sense, the error between p i and ˆ p i as a percentage of ˆ p i is a far more meaningful measure of precision. In the first example we have a percentage error of only 4%, while in the second instance the percentage error is almost 10,000%, indicating a far more risky proposition. Tocapturethisconceptofrelativeerrorweintroduceasourmeasureofaccuracy a quantity we call the “Excess Certainty”, which is defined as EC(ˆ p i )= p i − ˆ p i min(ˆ p i ,1− ˆ p i ) . (3) In the first example EC = 0 .04, while in the second example EC = 99. Note, we include ˆ p i in the denominator rather than p i because we wish to more heavily penalize settings where the estimated risk is far lower than the true risk (irrational exuberance) compared to the alternative where true risk is much lower. Ideally, the excess certainty of any probability estimate should be very close to zero. Thus, we adopt the following expected loss function, L(ˆ p i ,˜ p i )=E p i EC(ˆ p i ) 2 |˜ p i , (4) wheretheexpectationistakenoverp i ,conditionalon ˜ p i . Ouraimistoproduceanestimator ˆ p i that minimizes (4) conditional on the observed value of ˜ p i . It is worth noting that if our goalwassolelytoremoveselectionbiasthenwecouldsimplycompute E(p i |˜ p i ), whichwould be equivalent to minimizing E (p i − ˆ p i ) 2 |˜ p i . Minimizing (4) generates a similar shrinkage estimator, which also removes the selection bias, but, as we discuss in the next section, it actually provides additional shrinkage to account for the fact that we wish to minimize the relative, or percentage, error. 23 2.2.2 The Oracle Estimator We now derive the oracle estimator, p i0 , which minimizes the loss function given by (4), p i0 =argmin a E p i EC(a) 2 |˜ p i . (5) Our ECAP estimate aims to approximate the oracle. Theorem 1 below provides a relatively simple closed form expression for p i0 and a bound on the minimum reduction in loss from using p i0 relative to any other estimator. Theorem 1. For any distribution of ˜ p i , p i0 = min E(p i |˜ p i )+ Var(p i |˜ p i ) E(p i |˜ p i ) , 0.5 , E(p i |˜ p i )≤ 0.5 max 0.5, E(p i |˜ p i )− Var(p i |˜ p i ) 1− E(p i |˜ p i ) , E(p i |˜ p i )>0.5. (6) Furthermore, for any p ′ i ̸=p i0 , L(p ′ i ,˜ p i )− L(p i0 ,˜ p i )≥ E(p 2 i |˜ p i ) h 1 p ′ i − 1 p i0 i 2 , p i0 ≤ 0.5 E([1− p i ] 2 |˜ p i ) h 1 1− p ′ i − 1 1− p i0 i 2 , p i0 ≥ 0.5. (7) Remark 1. Note that both bounds in (7) are valid when p i0 =0.5. We observe from this result that the oracle estimator starts with the conditional expec- tation E(p i |˜ p i ) and then shifts the estimate towards 0.5 by an amount Var(p i |˜ p i ) min(E(p i |˜ p i ),1− E(p i |˜ p i )) . However, if this would move the estimate past 0.5 then the estimator simply becomes 0.5. Figure 2.3 plots the average excess certainty (3) from using ˜ p i to estimate p i (orange lines) and from using p i0 to estimate p i (green lines), for three different values of γ ∗ . Recall that an ideal EC should be zero, but the observed values for ˜ p i are far larger, especially for higher values of γ ∗ and lower values of ˜ p i . Note that, as a consequence of the minimization of the expected squared loss function (4), the oracle is slightly conservative with a negative EC, which is due to the variance term in (6). 24 0 1 2 3 0.00 0.01 0.02 0.03 0.04 0.05 ˜ p Excess Certainty γ ∗ = 0.001 0 5 10 15 20 0.00 0.01 0.02 0.03 0.04 0.05 ˜ p Excess Certainty γ ∗ = 0.01 0 20 40 60 80 0.00 0.01 0.02 0.03 0.04 0.05 ˜ p Excess Certainty γ ∗ = 0.05 Figure 2.3: Average excess certainty as a function of ˜ p i for three different values of γ ∗ (orange / dashed line). All plots exhibit excess certainty far above zero but the issue grows worse as γ ∗ gets larger, corresponding to more variance in ˜ p i . The green (solid) line in each plot corresponds to the average excess certainty for the oracle estimator p i0 . It is worth noting that Theorem 1 applies for any distribution of ˜ p i |p i and does not rely on our model (1). If we further assume that (1) holds, then Theorem 2 provides explicit forms for E(p i |˜ p i ) and Var(p i |˜ p i ). Theorem 2. Under (1), E(p i |˜ p i ) = µ i ≡ ˜ p i +γ ∗ [g ∗ (˜ p i )+1− 2˜ p i ] (8) Var(p i |˜ p i ) = σ 2 i ≡ γ ∗ ˜ p i (1− ˜ p i )+γ ∗ 2 ˜ p i (1− ˜ p i ) g ∗ ′ (˜ p i )− 2 , (9) where g ∗ (˜ p i ) = ˜ p i (1− ˜ p i )v ∗ (˜ p i ), v ∗ (˜ p i ) = ∂ ∂˜ p i logf ∗ (˜ p i ) is the score function of ˜ p i and f ∗ (˜ p i ) is the marginal density of ˜ p i . If we also assume that the distribution of p i is symmetric then further simplifications are possible. Corollary 1. If the prior distribution of p i is symmetric about 0.5, then p i0 = min E(p i |˜ p i )+ Var(p i |˜ p i ) E(p i |˜ p i ) , 0.5 , ˜ p i ≤ 0.5 max 0.5, E(p i |˜ p i )− Var(p i |˜ p i ) 1− E(p i |˜ p i ) , ˜ p i >0.5, (10) 25 g ∗ (0.5)=0, and g ∗ (˜ p i )=− g ∗ (1− ˜ p i ). (11) A particularly appealing aspect of Theorem 2 and its corollary is that g ∗ (˜ p i ) is only a function of the marginal distribution of ˜ p i , so that it can be estimated directly using the observed probabilities ˜ p i . In particular, we do not need to make any assumptions about the distribution of p i in order to compute g ∗ (˜ p i ). 2.2.3 Estimation In order to estimate p i0 we must form estimates for g ∗ (˜ p i ), its derivative g ∗ ′ (t), and γ ∗ . 2.2.3.1 Estimation of g Let ˆ g(˜ p) represent our estimator of g ∗ (˜ p). Given that g ∗ (˜ p) is a function of the marginal distribution of ˜ p i , i.e. f ∗ (˜ p i ), then one could estimate g ∗ (˜ p i ) by ˜ p i (1− ˜ p i ) ˆ f ′ (˜ p i )/ ˆ f(˜ p i ), where ˆ f(˜ p i ) and ˆ f ′ (˜ p i ) are respectively estimates for the marginal distribution of ˜ p i and its deriva- tive. However, this approach requires dividing by the estimated density function, which can produce a highly unstable estimate in the boundary points, precisely the region we are most interested in. Instead we directly estimate g ∗ (˜ p) by choosing ˆ g(˜ p) so as to minimize the risk function, which is defined as R(g) = E[g(˜ p)− g ∗ (˜ p)] 2 for every candidate function g. The following result provides an explicit form for the risk. Theorem3. Suppose that model (1) holds, and the prior for p has a bounded density. Then, R(g)=Eg(˜ p) 2 +2E[g(˜ p)(1− 2˜ p)+ ˜ p(1− ˜ p)g ′ (˜ p)]+C (12) for all bounded and differentiable functions g, where C is a constant that does not depend on g. 26 Remark 2. We show in the proof of Theorem 3 that g ∗ is bounded and differentiable so (12) holds for g =g ∗ . Theorem 3 suggests that we can approximate the risk, up to an irrelevant constant, by ˆ R(g)= 1 n n X i=1 g(˜ p i ) 2 +2 1 n n X i=1 [g(˜ p i )(1− 2˜ p i )+ ˜ p i (1− ˜ p i )g ′ (˜ p i )]. (13) However, simply minimizing (13) would provide a poor estimate for g ∗ (˜ p) because, without any smoothness constraints, ˆ R(g) can be trivially minimized. Hence, we place a smoothness penalty on our criterion by minimizing Q(g)= ˆ R(g)+λ Z g ′′ (˜ p) 2 d˜ p, (14) where λ > 0 is a tuning parameter which adjusts the level of smoothness in g(˜ p). We show in our theoretical analysis in Section 2.2.4 (see the proof of Theorem 4) that, much as with the more standard curve fitting setting, the solution to criteria of the form in (14) can be well approximated using a natural cubic spline, which provides a computationally efficient approach to compute g(˜ p). Let b(x) represent the vector of basis functions for a natural cubic spline, with knots at ˜ p 1 ,...,˜ p n , restricted to satisfy b(0.5) = 0. Then, in minimizing Q(g) we only need to consider functions of the form g(˜ p) = b(˜ p) T η , where η is the basis coefficients. Thus, (14) can be re-expressed as Q n (η )= 1 n n X i=1 η T b(˜ p i )b(˜ p i ) T η +2 1 n n X i=1 (1− 2˜ p i )b(˜ p i ) T + ˜ p i (1− ˜ p i )b ′ (˜ p i ) T η +λ η T Ω η (15) where Ω= R b ′′ (˜ p)b ′′ (˜ p) T d˜ p. Standard calculations show that (15) is minimized by setting ˆ η =− n X i=1 b(˜ p i )b(˜ p i ) T +nλ Ω ! − 1 n X i=1 [(1− 2˜ p i )b(˜ p i )+ ˜ p i (1− ˜ p i )b ′ (˜ p i )]. (16) 27 If the prior distribution of p i is not assumed to be symmetric, then g ∗ (˜ p i ) should be directly estimated for 0 ≤ ˜ p i ≤ 1. However, if the prior is believed to be symmetric this approach is inefficient, because it does not incorporate the identity g ∗ (˜ p i ) = − g ∗ (1− ˜ p i ). Hence, a superior approach involves flipping all of the ˜p i > 0.5 across 0.5, thus converting them into 1− ˜ p i , and then using both the flipped and the unflipped ˜ p i to estimate g(˜ p i ) between 0 and 0.5. Finally, the identity ˆ g(˜ p i ) =− ˆ g(1− ˜ p i ) can be used to define ˆg on (0.5,1]. This is the approach we use for the remainder of the paper. Equation (16) allows us to compute estimates for E(p i |˜ p i ) and Var(p i |˜ p i ): ˆ µ i = ˜ p i +ˆ γ (b(˜ p i ) T ˆ η +1− 2˜ p i ) (17) ˆ σ 2 i = ˆ γ ˜ p i (1− ˜ p i )+ˆ γ 2 ˜ p i (1− ˜ p i )[b ′ (˜ p i ) T ˆ η − 2]. (18) Equations (17) and (18) can then be substituted into (10) to produce the ECAP estimator ˆ p i . 2.2.3.2 Estimation of λ and γ ∗ In computing (17) and (18) we need to provide estimates for γ ∗ and λ . We choose λ so as to minimize a cross-validated version of the estimated risk (13). In particular, we randomly partitiontheprobabilitiesintoK roughlyevengroups: G 1 ,...,G K . Then,forgivenvaluesof λ andk, ˆ η kλ iscomputedvia(16),withtheprobabilitiesinG k excludedfromthecalculation. We then compute the corresponding estimated risk on the probabilities in G k : R kλ = X i∈G k ˆ h 2 ik +2 X i∈G k h (1− 2˜ p i ) ˆ h ik + ˜ p i (1− ˜ p i ) ˆ h ′ ik i , where ˆ h ik = b(˜ p i ) T ˆ η kλ and ˆ h ′ ik = b ′ (˜ p i ) T ˆ η kλ . This process is repeated K times for k = 1,...,K, and R λ = 1 n K X k=1 R kλ 28 is computed as our cross-validated risk estimate. Finally, we choose ˆ λ =argmin λ R λ . Toestimateγ ∗ weneedameasureoftheaccuracyof ˜ p i asanestimateofp i . Insomecases that information may be available from previous analyses. For example, if the estimates ˜ p i were obtained by fitting a logistic regression model, we could compute the standard errors on the estimated coefficients and hence form a variance estimate for each ˜p i . We would estimate γ ∗ by matching the computed variance estimates to the expression (2) for the conditional variance under the ECAP model. Alternatively, we can use previously observed outcomes of A i to estimate γ ∗ . Suppose that we observe Z i = 1 A i occured, 0 A i did not occur, (19) for i = 1,...,n. Then a natural approach is to compute the conditional log-likelihood function for Z i given ˜ p i . Namely, l γ = X i:Z i =1 log(ˆ µ γ i )+ X i:Z i =0 log(1− ˆ µ γ i ), (20) where ˆ µ γ i is the ECAP estimate of E(p i |˜ p i ) generated by substituting in a particular value of γ into (17). We then choose the value of γ that maximizes (20). Asanexampleofthisapproach,considertheESPNdatarecordingprobabilitiesofvictory for various NCAA football teams throughout each season. To form an estimate for γ ∗ we can take the observed outcomes of the games from last season (or the first couple of weeks of this season if there are no previous games available), use these results to generate a set of Z i , and then choose the γ that maximizes (20). One could then form ECAP estimates for future games during the season, possibly updating the γ estimate as new games are played. 29 2.2.4 Large sample results In this section we investigate the large sample behavior of the ECAP estimator. More specifically, we show that, under smoothness assumptions on the function g ∗ , the ECAP adjusted probabilities are consistent estimators of the corresponding oracle probabilities, defined in (5). We establish an analogous result for the corresponding values of the loss function, defined in (4). In addition to demonstrating consistency we also derive the rates of convergence. Our method of proof takes advantage of the theory of empirical processes, however, the corresponding arguments go well beyond a simple application of the existing results. We let f ∗ denote the marginal density of the observed ˜ p i and define the L 2 ( ˜ P) norm of a given function u(˜ p) as∥u∥=[ R 1 0 u 2 (˜ p)f ∗ (˜ p)d˜ p] 1/2 . We denote the corresponding empirical norm, [(1/n) P n i=1 u 2 (˜ p i )] 1/2 , by∥u∥ n . To simplify the presentation of the results, we define r n =n − 4/7 λ − 1 n +n − 2/7 +λ n and s n =1+n − 4/7 λ − 2 n . We write ˆ g for the minimizer of criterion (14) over all natural cubic spline functions g that correspond to the sequence of n knots located at the observed ˜ p i . For concreteness, we focus on the case where criterion (14) is computed over the entire interval [0,1]. However, all of the results in this section continue to hold if ˆ g is determined by only computing the criterion over[0,0.5],accordingtotheestimationapproachdescribedinSection2.2.3.1. Thefollowing result establishes consistency and rates of convergence for ˆ g and ˆ g ′ . Theorem4. If g ∗ is twice continuously differentiable on [0,1], f ∗ is bounded away from zero and n − 8/21 ≪ λ n ≪ 1, then ∥ˆ g− g ∗ ∥ n =O p r n , ∥ˆ g ′ − g ∗ ′ ∥ n =O p √ r n s n . The above bounds also hold for the∥·∥ norm. 30 Remark 3. The assumption n − 8/21 ≪ λ n ≪ 1 implies that the error bounds for ˆ g and ˆ g ′ are of order o p (1). When λ n ≍ n − 2/7 , Theorem 4 yields an n − 2/7 rate of convergence for ˆ g. This rate matches the optimal rate of convergence for estimating the derivative of a density under the corresponding smoothness conditions [92]. Given a value ˜ p in the interval (0,1), we define the ECAP estimator, ˆp = ˆ p(˜ p), by replacing ˜ p i , γ ∗ , andg with ˜ p,ˆ γ and ˆ g, respectively, intheexpressionfortheoracleestimator provided by formulas (8), (9) and (10). Thus, we treat ˆ p as a random function of ˜ p, where the randomness comes from the fact that ˆ p depends on the training sample of the observed probabilities ˜ p i . By analogy, we define p 0 via (10), with ˜ p i replaced by ˜ p, and view p 0 as a (deterministic) function of ˜ p. We define the function W 0 (˜ p) as the expected loss for the oracle estimator: W 0 (˜ p)=E p EC(p 0 (˜ p)) 2 |˜ p , where the expected value is taken over the true p given the corresponding observed proba- bility ˜ p. Similarly, we define the random function c W(˜ p) as the expected loss for the ECAP estimator, c W(˜ p)=E p EC(ˆ p(˜ p)) 2 |˜ p , where the expected value is computed given the training sample ˜ p 1 ,...,˜ p n and is again taken over the true p conditional on the corresponding ˜ p. The randomness in the function c W(˜ p) is due to the dependence of ˆ p on the training sample. Tostatetheasymptoticresultsfor ˆ pand ˆ W,weimplementaminortechnicalmodification in the estimation of the conditional variance via formula (9). After computing the value of ˆ σ 2 , we set it equal to max{ˆ σ 2 ,c √ r n s n }, where c is allowed to be any fixed positive constant. This ensures that, as the sample size grows, ˆ σ 2 does not approach zero too fast. We note 31 that this technical modification is only used to establish consistency of c W(˜ p) in the next theorem; all the other results in this section hold both with and without this modification. Theorem 5. If g ∗ is twice continuously differentiable on [0,1], f ∗ is bounded away from zero, n − 8/21 ≪ λ n ≪ 1 and|ˆ γ − γ ∗ |=o p (1), then ∥ˆ p− p 0 ∥=o p (1) and ∥ˆ p− p 0 ∥ n =o p (1). If, in addition,|ˆ γ − γ ∗ |=O p ( √ r n s n ), then 1 Z 0 c W(˜ p)− W 0 (˜ p) f ∗ (˜ p)d˜ p=o p (1) and 1 n n X i=1 c W(˜ p i )− W 0 (˜ p i ) =o p (1). The next result provides the rates of convergence for ˆ p and ˆ W. Theorem 6. If g ∗ is twice continuously differentiable on [0,1], f ∗ is bounded away from zero, n − 8/21 ≪ λ n ≪ 1 and|ˆ γ − γ ∗ |=O p √ r n s n , then 1− ϵ Z ϵ ˆ p(˜ p)− p 0 (˜ p) 2 f ∗ (˜ p)d˜ p=O p r n s n , 1− ϵ Z ϵ c W(˜ p)− W 0 (˜ p) f ∗ (˜ p)d˜ p=O p r n s n , X i:ϵ ≤ ˜ p i ≤ 1− ϵ 1 n ˆ p(˜ p i )− p 0 (˜ p i ) 2 =O p r n s n and X i:ϵ ≤ ˜ p i ≤ 1− ϵ 1 n c W(˜ p i )− W 0 (˜ p i ) =O p r n s n , for each fixed positive ϵ . Remark 4. The assumption n − 8/21 ≪ λ n ≪ 1 ensures that all the error bounds are of order o p (1). In Theorem 6 we bound the integration limits away from zero and one, because the rate of convergence changes as ˜ p approaches those values. However, we note that ϵ can be set to anarbitrarilysmallvalue. Theoptimalrateofconvergencefor c W isprovidedinthefollowing result. 32 Corollary 2. Suppose that λ n decreases at the rate n − 2/7 and|ˆ γ − γ ∗ |=O p (n − 1/7 ). If f ∗ is bounded away from zero and g ∗ is twice continuously differentiable on [0,1], then 1− ϵ Z ϵ c W(˜ p)− W 0 (˜ p) d˜ p=O p n − 2/7 and X i:ϵ ≤ ˜ p i ≤ 1− ϵ 1 n c W(˜ p i )− W 0 (˜ p i ) =O p n − 2/7 , for every positive ϵ . Corollary 2 follows directly from Theorem 6 by balancing out the components in the expression for r n . 2.3 ECAP Extensions In this section we consider two possible extensions of (1), the model for ˜ p i . In particular, in the next subsection we discuss the setting where ˜ p i can no longer be considered an unbiased estimator for p i , while in the following subsection we suggest a generalization of the beta model. 2.3.1 Incorporating Bias in ˜ p i So far, we have assumed that ˜ p i is an unbiased estimate for p i . In practice probability estimates ˜ p i may exhibit some systematic bias. For example, in Section 2.5 we examine probability predictions from the FiveThirtyEight.com website on congressional house, sen- ate, and governors races during the 2018 US midterm election. After comparing the ac- tual election results with the predicted probability of a candidate being elected, there is clear evidence of bias in the estimates [89]. In particular the leading candidate won many more races than would be suggested by the probability estimates. This indicates that the FiveThirtyEight.comprobabilitieswereoverlyconservative,i.e.,thatincomparisonto p i the estimate ˜ p i was generally closer to 0.5; for example, E(˜ p i |p i )0.5. 33 In this section we generalize (1) to model situations where E(˜ p i |p i )̸=p i . To achieve this goal we replace (1) with ˜ p i |p i ∼ Beta(α i ,β i ), where p i =h θ (α i γ ∗ )=h θ (1− β i γ ∗ ), (21) h θ (·) is a prespecified function, and θ is a parameter which determines the level of bias of ˜ p i . In particular, (21) implies that for any invertible h θ , p i =h θ (E(˜ p i |p i )), (22) so that if h θ (x)=x, i.e., h θ (·) is the identity function, then (21) reduces to (1), and ˜ p i is an unbiased estimate for ˜ p i . To produce a valid probability model h θ (·) needs to satisfy several criteria: 1. h 0 (x)=x, so that (21) reduces to (1) when θ =0. 2. h θ (1− x)=1− h θ (x), ensuring that the probabilities of events A i and A c i sum to 1. 3. h θ (x)=x for x=0,x=0.5 and x=1. 4. h θ (α ) is invertible for values of θ in a region around zero, so that E(˜ p i |p i ) is unique. The simplest polynomial function that satisfies all these constraints is h θ (x)=(1− 0.5θ )x− θ [x 3 − 1.5x 2 ], which is invertible for − 4 ≤ θ ≤ 2. Note that for θ = 0, we have h 0 (x) = x, which corresponds to the unbiased model (1). However, if θ > 0, then ˜ p i tends to overestimate small p i and underestimate large p i , so the probability estimates are overly conservative. Alternatively, whenθ < 0, then ˜ p i tendstounderestimatesmallp i andoverestimatelarge p i , sotheprobabilityestimatesexhibitexcesscertainty. Figure2.4providesexamplesofE(˜ p i |p i ) for three different values of θ , with the green line representing probabilities resulting in 34 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 p E(˜ p|p) Figure 2.4: Plots of E(˜ p i |p i ) as a function of p i for different values of θ . When θ = 0 (black / solid) the estimates are unbiased. θ = 2 (orange / dashed) corresponds to a setting where ˜ p i systematically underestimates large values of p i , while θ = − 3 (green / dot-dashed) represents a situation where ˜ p i is an overestimate for large values of p i . excesscertainty,theorangelineoverlyconservativeprobabilities,andtheblacklineunbiased probabilities. One of the appealing aspects of this model is that the ECAP oracle (10) can still be used to generate an estimator for p i . The only change is in how E(p i |˜ p i ) and Var(p i |˜ p i ) are computed. The following result allows us to generalize Theorem 2 to the biased setting to compute E(p i |˜ p i ) and Var(p i |˜ p i ). Theorem 7. Suppose that model (21) holds, p i has a bounded density, and µ i and σ 2 i are respectively defined as in (8) and (9). Then, E(p i |˜ p i ) = µ i +0.5θ 3σ 2 i − 6µ i σ 2 i +3µ 2 i − µ i − 2µ 3 i +O θγ ∗ 3/2 (23) Var(p i |˜ p i ) = (1− 0.5θ ) 2 σ 2 i +θσ 2 i 3µ i (1− µ i )(3θµ i (1− µ i )− 0.5θ +1) +O θγ ∗ 3/2 . (24) The remainder terms in the above approximations are of smaller order than the leading terms when γ ∗ is small, which is typically the case in practice. As we demonstrate in the proof of Theorem 7, explicit expressions can be provided for the remainder terms. However, 35 the approximation error involved in estimating these expressions is likely to be much higher than any bias from excluding them. Hence, we ignore these terms when estimating E(p i |˜ p i ) and Var(p i |˜ p i ): \ E(p i |˜ p i ) = ˆ µ i +0.5θ 3ˆ σ 2 i − 6ˆ µ i ˆ σ 2 i +3ˆ µ 2 i − ˆ µ i − 2ˆ µ 3 i (25) \ Var(p i |˜ p i ) = (1− 0.5θ ) 2 ˆ σ 2 i +θ ˆ σ 2 i 3ˆ µ i (1− ˆ µ i )(3θ ˆ µ i (1− ˆ µ i )− 0.5θ +1) . (26) The only remaining issue in implementing this approach involves producing an estimate for θ . However, this can be achieved using exactly the same maximum likelihood approach as the one used to estimate γ ∗ , which is described in Section 2.2.3.2. Thus, we now choose both θ and γ to jointly maximize the likelihood function l θ,γ = X i:Z i =1 log(ˆ µ θ,γ i )+ X i:Z i =0 log(1− ˆ µ θ,γ i ), (27) where ˆ µ θ,γ i is the bias corrected ECAP estimate of E(p i |˜ p i ) from (25), generated by substi- tuting in particular values of γ and θ . In all other respects, the bias corrected version of ECAP is implemented in an identical fashion to the unbiased version. 2.3.2 Mixture Distribution We now consider another possible extension of (1), where we believe that ˜ p i is an unbiased estimator for p i but find the beta model assumption to be unrealistic. In this setting one could potentially model ˜ p i using a variety of members of the exponential family. However, one appealing alternative is to extend (1) to a mixture of beta distributions: ˜ p i |p i ∼ K X k=1 w k Beta(α ik ,β ik ), where α ik = c k p i γ ∗ , β ik = 1− c k p i γ ∗ , (28) 36 and w k and c k are predefined weights such that P k w k =1 and P k w k c k =1. Note that (1) is a special case of (28) with K =w 1 =c 1 =1. As K grows, the mixture model can provide as flexible a model as desired, but it also has a number of other appealing characteristics. In particular, under this model it is still the case that E(˜ p i |p i ) = p i . In addition, Theorem 8 demonstrates that simple closed form solutions still exist for E(p i |˜ p i ) and Var(p i |˜ p i ), and, hence, also the oracle ECAP estimator p i0 . Theorem 8. Under (28), E(p i |˜ p i ) = µ i K X k=1 w k c k (29) Var(p i |˜ p i ) = (σ 2 i +µ 2 i ) K X k=1 w k c 2 k − µ 2 i K X k=1 w k c k ! 2 , (30) where µ i and σ 2 i are defined in (8) and (9). The generalized ECAP estimator can thus be generated by substituting ˆ µ i and ˆ σ 2 i , given by formulas (17) and (18), into (29) and (30). The only additional complication involves computing values for w k and c k . For settings with a large enough sample size, this could be achieved using a variant of the maximum likelihood approach discussed in Section 2.2.3.2. However, we do not explore that approach further in this paper. 2.4 Simulation Results InSection2.4.1wecompareECAPtocompetingmethodsundertheassumptionofunbiased- nessin ˜ p i . Wefurtherextendthiscomparisontothesettingwhere ˜ p i representsapotentially biased estimate in Section 2.4.2. 37 0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 p Density Beta(4,4) 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 p Density 0.5*Beta(6,2)+0.5*Beta(2,6) 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 p Density Beta(1.5,1.5) Figure 2.5: Distributions of p used in the simulation 2.4.1 Unbiased Simulation Results In this section our data consists of n = 1,000 triplets (p i ,˜ p i ,Z i ) for each simulation. The p i are generated from one of three possible prior distributions; Beta(4,4), an equal mix- ture of Beta(6,2) and Beta(2,6), or Beta(1.5,1.5). The corresponding density functions are displayed in Figure 2.5. Recall that ECAP models ˜ p i as coming from a beta distribution, conditional on p i . However, in practice there is no guarantee that the observed data will exactly follow this distribution. Hence, we generate the observed data according to: ˜ p i =p i +p q i (˜ p o i − p i ), (31) where ˜ p o i |p i ∼ Beta(α,β ) and q is a tuning parameter. In particular for q = 0 (31) gener- ates observations directly from the ECAP model, while larger values of q provide a greater deviation from the beta assumption. In practice we found that setting q = 0 can result in ˜ p’s that are so small they are effectively zero (˜p i = 10 − 20 , for example). ECAP is not significantly impacted by these probabilities but, as we show, other approaches can perform extremely poorly in this scenario. Setting q > 0 prevents pathologic scenarios and allows us to more closely mimic what practitioners will see in real life. We found that q = 0.05 typically gives a reasonable amount of dispersion so we consider settings where either q =0 38 or q =0.05. We also consider different levels of the conditional variance for ˜p i , by taking γ ∗ as either 0.005 or 0.03. Finally, we generate Z i , representing whether event A i occurs, from a Bernoulli distribution with probability p i . We implement the following five approaches: the Unadjusted method, which simply uses the original probability estimates ˜ p i , two implementations of the proposed ECAP approach (ECAP Opt and ECAP MLE), and two versions of the James Stein approach (JS Opt and JS MLE). For the proposed ECAP methods, we select λ via the cross-validation procedure in Section 2.2.3.2. ECAP Opt is an oracle-type implementation of the ECAP methodology, in which we select γ to minimize the average expected loss, defined in (4), over the training data. Alternatively, ECAP MLE makes use of the Z i ’s and estimates γ ∗ using the maximum likelihood approach described in Section 2.2.3.2. The James-Stein method we use is similar to its traditional formulation. In particular the estimated probability is computed using ˆ p JS i = ¯˜ p+(1− c)(˜ p i − ¯˜ p), (32) where ¯˜ p= 1 n P n j=1 ˜ p j andcisatuningparameterchosentooptimizetheestimates. 2 Equation (32) is a convex combination of ˜ p i and the average observed probability ¯˜ p. The JS Opt implementation selects c to minimize the average expected loss in the same fashion as for ECAP Opt, while the JS MLE implementation selects c using the maximum likelihood approach described in Section 2.2.3.2. Note that ECAP Opt and JS Opt represent optimal situations that can not be implemented in practice because they require knowledge of the true distribution of p i . In each simulation run we generate both training and test data sets. Each method is fit on the training data. We then calculate EC(ˆ p i ) 2 for each point in the test data and average over these observations. The results for the three prior distributions, two values of γ ∗ , and two values of q, averaged over 100 simulation runs, are reported in Table 2.1. Since 2 To maintain consistency with ECAP we flip all ˜p i >0.5 across 0.5 before forming ˆ p JS i and then flip the estimate back. 39 γ ∗ q Method Type Beta(4,4) 0.5*Beta(6,2) + 0.5*Beta(2,6) Beta(1.5,1.5) 0.005 0 Unadjusted 0.0116 (0.0001) 44.9824 (43.7241) 3.9× 10 12 (3.9× 10 12 ) ECAP Opt 0.0095 (0.0001) 0.0236 (0.0002) 0.0197 (0.0001) JS Opt 0.0100 (0.0001) 0.0241 (0.0002) 0.0204 (0.0002) ECAP MLE 0.0120 (0.0004) 0.0326 (0.0008) 0.0294 (0.0009) JS MLE 0.0121 (0.0003) 1.1590 (0.8569) 4.8941 (4.7526) 0.05 Unadjusted 0.0100 (0.0001) 0.0308 (0.0006) 0.0273 (0.0006) ECAP Opt 0.0085 (0.0000) 0.0196 (0.0001) 0.0166 (0.0001) JS Opt 0.0090 (0.0000) 0.0201 (0.0001) 0.0172 (0.0001) ECAP MLE 0.0022 (0.0005) 0.0073 (0.0010) 0.0084 (0.0011) JS MLE 0.0105 (0.0002) 0.0265 (0.0006) 0.0245 (0.0007) 0.03 0 Unadjusted 2.1× 10 8 (2.1× 10 8 ) 2.4× 10 14 (1.6× 10 14 ) 1.6× 10 15 (5.5× 10 14 ) ECAP Opt 0.0391 (0.0002) 0.0854 (0.0004) 0.0740 (0.0004) JS Opt 0.0537 (0.0002) 0.0986 (0.0005) 0.0899 (0.0005) ECAP MLE 0.0435 (0.0009) 0.1202 (0.0136) 0.1203 (0.0152) JS MLE 0.0636 (0.0019) 1.4× 10 13 (1.4× 10 13 ) 1.2× 10 14 (1.1× 10 14 ) 0.05 Unadjusted 0.0887 (0.0010) 0.3373 (0.0047) 0.2780 (0.0043) ECAP Opt 0.0364 (0.0002) 0.0765 (0.0004) 0.0665 (0.0004) JS Opt 0.0488 (0.0002) 0.0874 (0.0005) 0.0801 (0.0005) ECAP MLE 0.0022 (0.0004) 0.0075 (0.0010) 0.0078 (0.0014) JS MLE 0.0558 (0.0011) 0.1213 (0.0066) 0.1235 (0.0071) Table 2.1: Average expected loss for different methods over multiple unbiased simulation scenarios. Standard errors are provided in parentheses. the ECAP Opt and JS Opt approaches both represent oracle type methods, they should be compared with each other. The ECAP Opt method statistically significantly outperforms its JS counterpart in each of the twelve settings, with larger improvements in the noisy setting where γ ∗ = 0.03. The ECAP MLE method is statistically significantly better than the corresponding JS approach in all but five settings. However, four of these settings, correspond to q = 0 and actually represent situations where JS MLE has failed because it has extremely large excess certainty, which impacts both the mean and standard error. Alternatively, theperformanceoftheECAPapproachremainsstableeveninthepresenceof extreme outliers. Similarly, the ECAP MLE approach statistically significantly outperforms the Unadjusted approach, often by large amounts, except for the five settings with large outliers, which result in extremely bad average performance for the latter method. 40 2.4.2 Biased Simulation In this section we extend the results to the setting where the observed probabilities may be biased, i.e., E(˜ p i |p i ) ̸= p i . To do this we generate ˜ p i according to (21) using four different valuesforθ ,{− 3,− 1,0,2}. Recallthatθ < 0correspondstoanti-conservativedata,where ˜ p i tendstobetoocloseto0or1,θ =0representsunbiasedobservations,andθ > 0corresponds to conservative data, where ˜ p i tends to be too far from 0 or 1. In all other respects our data is generated in an identical fashion to that of the unbiased setting. 3 To illustrate the biased setting we opted to focus on the q = 0.05 with γ ∗ = 0.005 setting. We also increased the sample size to n= 5,000 because of the increased difficulty of the problem. The two ECAP implementations now require us to estimate three parameters: λ,γ and θ . We estimate λ in the same fashion as previously discussed, while γ and θ are now chosen over a two-dimensional grid of values, with θ restricted to lie between− 4 and 2. The two JS methods remain unchanged. The results, again averaged over 100 simulation runs, are presented in Table 2.2. In the two settings where θ < 0 we note that the unadjusted and JS methods all exhibit significant deterioration in their performance relative to the unbiased θ = 0 scenario. By comparison, the two ECAP methods significantly outperform the JS and unadjusted approaches. A similar pattern is observed for θ > 0. In this setting all five methods deteriorate, but ECAP is far more robust to the biased setting than unadjusted and JS. It is perhaps not surprising that the bias corrected version of ECAP outperforms the other methods when the data is indeed biased. However, just as interestingly, even in the unbiased setting (θ = 0) we still observe that ECAP matches or slightly outperforms its JS counterpart, despite the fact that ECAP must estimate θ . This is likely a result of the fact that ECAP is able to accurately estimate θ . Over all simulation runs and settings, ECAP Opt and ECAP MLE respectively averaged absolute errors of only 0.0582 and 0.2016 in estimating θ . 3 Because the observed probabilities are now biased, we replace p i in (31) with E(˜ p i |p i ). 41 Method Type Beta(4,4) 0.5*Beta(6,2) + 0.5*Beta(2,6) Beta(1.5, 1.5) θ =− 3 Unadjusted 0.1749 (0.0005) 0.7837 (0.0025) 0.6052 (0.0030) ECAP Opt 0.0019 (0.0000) 0.0109 (0.0000) 0.0086 (0.0000) JS Opt 0.0609 (0.0002) 0.2431 (0.0005) 0.1526 (0.0003) ECAP MLE 0.0036 (0.0002) 0.0135 (0.0003) 0.0111 (0.0003) JS MLE 0.0633 (0.0003) 0.2712 (0.0014) 0.1707 (0.0011) θ =− 1 Unadjusted 0.0319 (0.0001) 0.1389 (0.0007) 0.1130 (0.0008) ECAP Opt 0.0051 (0.0000) 0.0150 (0.0000) 0.0124 (0.0001) JS Opt 0.0142 (0.0000) 0.0477 (0.0001) 0.0361 (0.0001) ECAP MLE 0.0065 (0.0002) 0.0176 (0.0002) 0.0158 (0.0004) JS MLE 0.0155 (0.0002) 0.0541 (0.0008) 0.0413 (0.0010) θ =0 Unadjusted 0.0099 (0.0000) 0.0305 (0.0002) 0.0275 (0.0003) ECAP Opt 0.0084 (0.0000) 0.0195 (0.0001) 0.0164 (0.0001) JS Opt 0.0088 (0.0000) 0.0199 (0.0001) 0.0171 (0.0001) ECAP MLE 0.0102 (0.0003) 0.0229 (0.0003) 0.0206 (0.0005) JS MLE 0.0094 (0.0001) 0.0233 (0.0005) 0.0219 (0.0005) θ =2 Unadjusted 0.0652 (0.0001) 0.2419 (0.0003) 0.1776 (0.0003) ECAP Opt 0.0240 (0.0001) 0.0614 (0.0002) 0.0502 (0.0001) JS Opt 0.0652 (0.0001) 0.2419 (0.0003) 0.1776 (0.0003) ECAP MLE 0.0256 (0.0002) 0.0739 (0.0013) 0.0591 (0.0009) JS MLE 0.0652 (0.0001) 0.2419 (0.0003) 0.1776 (0.0003) Table 2.2: Average expected loss for different methods over multiple biased simulation sce- narios. 2.5 Empirical Results InthissectionweillustrateECAPontworealworlddatasets. Section2.5.1containsourre- sultsanalyzingESPN’sprobabilityestimatesfromNCAAfootballgames,whileSection2.5.2 examinesprobabilityestimatesfromthe2018USmidtermelections. Giventhatforrealdata p i is never observed, we need to compute an estimate of EC(ˆ p i ). Hence, we choose a small window δ , for example δ = [0,0.02], and consider all observations for which ˜ p i falls within δ . 4 We then estimate p i via ¯p δ = 1 n δ P n i=1 Z i δ i , where δ i = I(˜ p i ∈ δ ), n δ = P n i=1 δ i and Z i is defined as in (19). Hence we can estimate EC using d EC δ ( ¯ˆ p δ )= ¯p δ − ¯ˆ p δ ¯ˆ p δ , (33) 4 In this section, for simplicity of notation, we have flipped all probabilities greater than 0 .5, and the associated Z i around 0.5 so δ =[0,0.02] also includes probabilities between 0.98 and 1. 42 Figure 2.6: A screenshot of the NCAA football win probabilities publicly available on ESPN’s website. USC vs. Texas (2017) where ¯ˆ p δ = 1 n δ P n i=1 ˆ p i δ i . 2.5.1 ESPN NCAA Football Data Each year there are approximately 1,200 Division 1 NCAA football games played within the US. For the last several seasons ESPN has been producing automatic win probability estimatesforeverygame. Theseprobabilitiesupdateinrealtimeaftereveryplay. Figure2.6 provides an example of a fully realized game between the University of Southern California (USC) and the University of Texas at Austin (TEX) during the 2017 season. For most of the game the probability of a USC win hovers around 75% but towards the end of the game the probability starts to oscillate wildly, with both teams having high win probabilities, before USC ultimately wins. 5 These gyrations are quite common and occasionally result in a team with a high win probability ultimately losing. Of course even a team with a 99% win probability will end up losing 1% of the time so these unusual outcomes do not necessarily indicate an error, or selection bias issue, with the probability estimates. To assess the accuracy of ESPN’s estimation procedure we collected data from the 2016 and 2017 NCAA football seasons. We obtained this unique data set by scrapping the win probabilities, and ultimate winning team, for a total of 1,722 games (about 860 per season), involving an average of approximately 180 probabilities per game. Each game runs for 60 minutes, although the clock is often stopped. For any particular time point t during these 5 The game was not chosen at random. 43 60 minutes, we took the probability estimate closest to t in each of the individual games. We used the entire data set, 2016 and 2017, to compute ¯p δ , which represents the ideal gold standard. However,thisestimatorisimpracticalinpracticebecausewewouldneedtocollect data over two full years to implement it. By comparison, we used only the 2016 season to fit ECAP and ultimately to compute ¯ˆ p δ . We then calculated d EC δ ( ¯ˆ p δ ,t) for both the raw ESPN probabilities and the adjusted ECAP estimates. The intuition here is that d EC δ ( ¯ˆ p δ ,t) provides a comparison of these estimates to the ideal, but unrealistic, ¯p δ . In general we found that d EC δ ( ¯ˆ p δ ,t) computed on the ESPN probabilities was not sys- tematically different from zero, suggesting ESPN’s probabilities were reasonably accurate. However, we observed that, for extreme values of δ , d EC δ ( ¯ˆ p δ ,t) was well above zero towards the end of the games. Consider, for example, the solid orange line in Figure 2.7, which plots d EC δ ( ¯ˆ p δ ,t) using δ = [0,0.02] at six different time points during the final minute of these games. We observe that excess certainty is consistently well above zero. The 90% boot- strap confidence intervals (dashed lines), generated by sampling with replacement from the probabilities that landed inside δ i , demonstrate that the difference from zero is statistically significant for most time points. This suggests that towards the end of the game ESPN’s probabilities are too extreme i.e. there are more upsets then would be predicted by their estimates. Next we applied the unbiased implementation of ECAP, i.e. with θ = 0, separately to each of these six time points and computed d EC δ (t) for the associated ECAP probability estimates. To estimate the out of sample performance of our method, we randomly picked halfofthe2016gamestoestimateγ ∗ ,andthenusedECAPtoproduceprobabilityestimates on the other half. We repeated this process 100 times and averaged the resulting d EC δ ( ¯ˆ p δ ,t) independently for each time point. The solid green line in Figure 2.7 provides the estimated excess certainty. ECAP appears to work well on this data, with excess certainty estimates close to zero. Notice also that ECAP is consistently producing a slightly negative excess 44 0.0 2.5 5.0 01:00 00:50 00:40 00:30 00:20 00:10 Seconds left in game d EC δ Method Type ECAP ESPN Unadjusted Figure 2.7: Empirical EC in both the unadjusted and ECAP setting with δ =[0,0.02]. certainty, which is actually necessary to minimize the expected loss function (4), as demon- strated in Figure 2.3. Interestingly this excess certainty pattern in the ESPN probabilities is no longer apparent in data for the 2018 season, suggesting that ESPN also identified this as an issue and applied a correction to their estimation procedure. 2.5.2 Election Data Probabilities have increasingly been used to predict election results. For example, news organizations, political campaigns, and others, often attempt to predict the probability of a given candidate winning a governors race, or a seat in the house, or senate. Among other uses, political parties can use these estimates to optimize their funding allocations across hundreds of different races. In this section we illustrate ECAP using probability estimatesproducedbytheFiveThirtyEight.comwebsiteduringthe2018USmidtermelection cycle. FiveThrityEight used three different methods, Classic, Deluxe, and Lite, to generate probability estimates for every governor, house, and senate seat up for election, resulting in 506 probability estimates for each of the three methods. Interestinglyapreviousanalysisofthisdata[89]showedthattheFiveThirtyEightproba- bilityestimatesappearedtobeoverlyconservativei.e. theleadingcandidatewonmoreoften 45 Method Adjustment δ 1 δ 2 Classic Unadjusted -0.6910 -0.8361 ECAP -0.2881 -0.0758 Deluxe Unadjusted -0.4276 -0.8137 ECAP -0.0371 0.1814 Lite Unadjusted -0.8037 -0.8302 ECAP -0.3876 -0.1118 Table2.3: BiascorrectedECAPadjustmentofFiveThirtyEight’s2018electionprobabilities. Reported average d EC δ . than would have been predicted by their probabilities. Hence, we should be able to improve the probability estimates using the bias corrected version of ECAP from Section 2.3.1. We first computed d EC δ ( ¯ˆ p δ ) on the unadjusted FiveThirtyEight probability estimates using two different values for δ i.e. δ 1 = [0,0.1] and δ 2 = [0.1,0.2]. We used wider windows for δ in comparison to the ESPN data because we only had one third as many observations. The results for the three methods used by FiveThirtyEight are shown in Table 2.3. Notice that for all three methods and both values of δ the unadjusted estimates are far below zero and several are close to − 1, the minimum possible value. These results validate the previous analysis suggesting the FiveThirtyEight estimates are systematically conservatively biased. Next we applied ECAP separately to each of the three sets of probability estimates, with the value of θ chosen using the MLE approach previously described. Again the results are provided in Table 2.3. ECAP appears to have significantly reduced the level of bias, with most values of d EC δ ( ¯ˆ p δ ) close to zero, and in one case actually slightly above zero. For the Deluxemethodwithδ 1 ,ECAPhasanalmostperfectlevelofexcesscertainty. FortheClassic and Lite methods, θ =2 was chosen by ECAP for both values of δ , representing the largest possible level of bias correction. For the Deluxe method, ECAP selected θ =1.9. Figure 2.8 demonstrates the significant level of correction that ECAP applies to the classic method FiveThirtyEight estimates. For example, ECAP adjusts probability estimates of 0.8 to 0.89 and estimates of 0.9 to 0.97. 46 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Unadjusted Probability (538 Data) ECAP Figure 2.8: ECAP bias corrected probabilities vs original FiveThirtyEight probability from classic method. 2.6 Discussion In this chapter, we have convincingly demonstrated both theoretically and empirically that probability estimates are subject to selection bias, even when the individual estimates are unbiased. Our proposed ECAP method applies a novel non-parametric empirical Bayes approachtoadjustbothbiasedandunbiasedprobabilities,andhenceproducemoreaccurate estimates. The results in both the simulation study and on real data sets demonstrate that ECAP can successfully correct for selection bias, allowing us to use the probabilities with a higher level of confidence when selecting extreme values. There are a number of possible areas for future work. For example, the ESPN data contains an interesting time series structure to the probabilities, with each game consisting of a probability function measured over 60 minutes. Our current method treats each time point independently and adjusts the probabilities accordingly. However, one may be able to leverage more power by incorporating all time points simultaneously using some form of functional data analysis. Another potential area of exploration involves the type of data on which ECAP is implemented. For example, consider a setting involving a large number of 47 hypothesistestsandassociatedp-values, ˜ p 1 ,...,˜ p n . Therehasbeenmuchdiscussionrecently of the limitations around using p-values. A superior approach would involve thresholding based on the posterior probability of the null hypothesis being true i.e. p i = P(H 0i |X i ). Of course, in general, p i is difficult to compute which is why we use the p-value ˜p i . However, if we were to treat ˜ p i as a, possibly biased, estimate of p i , then it may be possible to use a modified version of ECAP to estimate p i . If such an approach could be implemented it would likely have a significant impact in the area of multiple hypothesis testing. 48 Chapter 3 Asymmetric error control under imperfect supervision: a label-noise-adjusted Neyman-Pearson umbrella algorithm Label noise in data has long been an important problem in supervised learning applications asitaffectstheeffectivenessofmanywidelyusedclassificationmethods. Recently,important real-worldapplications,suchasmedicaldiagnosisandcybersecurity,havegeneratedrenewed interest in the Neyman-Pearson (NP) classification paradigm, which constrains the more severetypeoferror(e.g., thetypeIerror)underapreferredlevelwhileminimizingtheother (e.g., the type II error). However, there has been little research on the NP paradigm under labelnoise. ItissomewhatsurprisingthatevenwhencommonNPclassifiersignorethelabel noiseinthetrainingstage,theyarestillabletocontrolthetypeIerrorwithhighprobability. However, the price they pay is excessive conservativeness of the type I error and hence a significant drop in power (i.e., 1 − type II error). Assuming that domain experts provide lower bounds on the corruption severity, we propose the first theory-backed algorithm that adapts most state-of-the-art classification methods to the training label noise under the NP paradigm. The resulting classifiers not only control the type I error with high probability under the desired level but also improve power. 49 3.1 Introduction Most classification methods assume a perfectly labeled training dataset. Yet, it is estimated that in real-world databases around five percent of labels are incorrect [77, 82]. Labeling errors might come from insufficient guidance to human coders, poor data quality, or human mistakes in decisions, among others [15, 49, 16]. Specifically, in the medical field, a 2011 surveyof morethan6,000physicians foundthathalf saidthey encountered diagnosticerrors at least once a month [70]. The existence of labeling errors in training data is often referred to as label noise, imperfect labels or imperfect supervision. It belongs to a more general data corruption problem, which refers to “anything which obscures the relationship between description and class” [49]. The study of label noise in supervised learning has been a vibrant field in academia. On the empirical front, researchers have found that some statistical learning methods such as quadraticdiscriminantanalysis[65]andk-NN[76],canbegreatlyaffectedbylabelnoiseand have accuracy seriously reduced, while other methods, such as linear discriminant analysis [64], are more label noise tolerant. Moreover, one can modify AdaBoost [25], perceptron algorithm [61] and neural networks [96], so that they are more tolerant to label noise. Data cleansing techniques were also developed, such as in [44] and [17]. On the theoretical front, [75] provided a guarantee for risk minimization in the setting of convex surrogates. [71] proved label noise tolerance of risk minimization for certain types of loss functions, and [41] extended the result by considering more loss types. [69] proposed learning methods with importance-reweighting which can minimize the risk. [14] studied intensely the class- conditional corruption model, a model that many works on label noise are based on. In particular,theoreticalresultsaboutparameterestimationandconsistencyofclassifiersunder this model were presented in their work. Most recently, [22] derived innovative theory of excess risk for general classifiers. In many classification settings, one type of error may have far worse consequences than the other. For example, a biomedical diagnosis/prognosis that misidentifies a benign tumor 50 as malignant will cause distress and potentially unnecessary medical procedures, but the alternative, where a malignant tumor is classified as benign, will have far worse outcomes. Other related predictive applications include cybersecurity and finance. Despite great ad- vances in the label-noise classification literature, to our knowledge, no classifier has been constructed to deal with this asymmetry in error importance under label noise so as to control the level of the more severe error type. In this paper, we concentrate on the classification setting involving both mislabeled out- comes and error importance asymmetry. The Neyman-Pearson (NP) paradigm [23, 87], which controls the false-negative rate (FNR, a.k.a., type I error 1 ) under some desired level while minimizing the false-positive rate (FPR, a.k.a., type II error), provides a natural ap- proach to this problem. However, to the best of our knowledge, there has been no work that studies how label noise issues affect the control of the more severe FNR. We show that if one trains a standard NP classifier on corrupted labels (e.g., the NP umbrella algorithm [99]), then the actual achieved FNR is far below the control target, resulting in a very high, and undesirable, FPR. This problem motivates us to devise a new label-noise-adjusted umbrella algorithm that corrects for the labeling errors to produce a lower FPR while still controlling the FNR. The construction of such an algorithm is challenging because we must identify the optimal correctionlevelwithoutanytrainingdatafromtheuncorrupteddistribution. Toaddressthis challenge,weemployacommonclass-conditionalnoisemodelandderivethepopulation-level differencebetweenthetypeIerrorsofthetrueandcorruptedlabels. Basedonthisdifference, we propose a sample-based correction term that, even without observing any uncorrupted labels, cancorrectlyadjusttheNPumbrellaalgorithmtosignificantlyreducetheFPRwhile still controlling the FNR. 1 Note that type I error in our work is defined to be the conditional probability of misclassifying a 0 instance as class 1. Moreover, we code the more severe class as class 0. In the disease diagnosis example, the disease class would be class 0. 51 Our approach has several advantages. First, it is the first theory-backed methodology in the label noise setting to control population-level type I error (i.e., FNR) regarding the true labels. Concretely, we can show analytically that the new algorithm produces classifiers that have a high probability of controlling the FNR below the desired threshold with a FPR lower than that provided by the original NP umbrella algorithm. Second, when there are no labeling errors, our new algorithm reduces to the original NP algorithm. Finally, we demonstrate on both simulated and real-world data, that under the NP paradigm the new algorithm dominates the original unadjusted one and competes favorably against existing methods which handle label noise in classification. The rest of this chapter is organized as follows. In Section 3.2, we introduce some notation and a corruption model to study the label noise. In Section 3.3, we demonstrate the ineffectiveness of the original NP umbrella algorithm under label noise and propose a new label-noise-adjusted version. The validity and the high-probability type I error control property of the new algorithm are established in Section 3.4. Simulation and real data analysisareconductedinSection3.5,followedbyaDiscussionsection. Allproofs,additional numerical results, and technical results are relegated to the Appendix. 3.2 Notation and Corruption Model Let (X,Y, ˜ Y) be a random triplet, where X ∈ X ⊂ IR d represents features, Y ∈ {0,1} encodes the true class labels and ˜ Y ∈{0,1} the corrupted ones. Note that in our setting, we cannot observe Y; the observations come from (X, ˜ Y). Denote X 0 ≜ X | (Y = 0) and X 1 ≜ X | (Y = 1). Similarly, denote ˜ X 0 ≜ X | ( ˜ Y = 0) and ˜ X 1 ≜ X | ( ˜ Y = 1). Denote by IP and IE generic probability measure and expectation whose meanings depend on the context. For any Borel set A⊂X , we denote P 0 (A)=IP(X ∈A|Y =0), P 1 (A)=IP(X ∈A|Y =1), ˜ P 0 (A)=IP(X ∈A| ˜ Y =0), ˜ P 1 (A)=IP(X ∈A| ˜ Y =1). 52 Then,wedenotebyF 0 ,F 1 , ˜ F 0 and ˜ F 1 theirrespectivedistributionfunctionsandbyf 0 ,f 1 , ˜ f 0 and ˜ f 1 the density functions, assuming they exist. Moreover, for a measurable function T :X →IR, we denote, for any z∈IR, F T 0 (z)=P 0 (T(X)≤ z), F T 1 (z)=P 1 (T(X)≤ z), ˜ F T 0 (z)= ˜ P 0 (T(X)≤ z), ˜ F T 1 (z)= ˜ P 1 (T(X)≤ z). Since the effect of, and adjustment to, the label noise depend on the type and severity of corruption, we need to specify a corruption model to work with. Our choice for this work is theclass-conditional noise (contamination) model,whichisspecifiedinthenextassumption. There exist constants m 0 ,m 1 ∈[0,1] such that for any Borel set A⊂X , ˜ P 0 (A)=m 0 P 0 (A)+(1− m 0 )P 1 (A) and ˜ P 1 (A)=m 1 P 0 (A)+(1− m 1 )P 1 (A). (1) Furthermore,assumem 0 >m 1 butbothquantitiescanbeunknown. Moreover,letm # 0 ,m # 1 ∈ [0,1] be known constants such that m # 0 ≥ m 0 and m # 1 ≤ m 1 . Example 1. [An example of Assumption 3.2] Let X 0 ∼ N (µ 0 ,σ 2 ) and X 1 ∼ N (µ 1 ,σ 2 ), where µ 0 ,µ 1 ∈ IR and σ > 0. Then ˜ F 0 (z) = m 0 Φ( z− µ 0 σ )+(1− m 0 )Φ( z− µ 1 σ ) and ˜ F 1 (z) = m 1 Φ( z− µ 0 σ )+(1− m 1 )Φ( z− µ 1 σ ), where Φ( ·) is the distribution function of N(0,1). With the choice of µ 0 = 0, µ 1 = 1, σ = 1, m 0 = 0.9, and m 1 = 0.05, the density functions f 0 , ˜ f 0 , f 1 and ˜ f 1 are plotted in Figure 3.1. Note that equation (1) specifies perhaps the simplest model for label noise in supervised learning. Here, m 0 and m 1 represent the severity of corruption levels. Concretely, m 0 can be interpreted as the proportion of true 0 observations among corrupted 0 observations, and m 1 the proportion of true 0 observations among corrupted 1 observations. The assumption m 0 > m 1 means that corrupted class 0 resembles true class 0 more than corrupted class 1 does, and that corrupted class 1 resembles true class 1 more than corrupted class 0 does. However, this assumption does not mean that corrupted class 0 resembles true class 0 more 53 0.0 0.1 0.2 0.3 0.4 -5.0 -2.5 0.0 2.5 5.0 7.5 x density Figure 3.1: Density plots in Example 1. True (lighter and solid) and corrupted (darker and dashed). than it resembles true class 1 (i.e, m 0 > 1/2) or that corrupted class 1 resembles true class 1 more than it resembles true class 0 (i.e., m 1 < 1/2). Note that by the way our model is written, m 0 =1 and m 1 =0 correspond to the no label noise situation; as such, the roles of m 0 and m 1 are not symmetric. Hence, the assumptions m # 0 ≥ m 0 and m # 1 ≤ m 1 mean that we know some lower bounds of the corruption levels. The class-conditional label noise model has been widely adopted in the literature [75, 69, 14]. We note here that the assumption m 0 > m 1 aligns with the total noise assumption π 0 +π 1 < 1 in [14] as π 0 and π 1 in their work correspond to 1− m 0 and m 1 in Assumption 1, respectively. In [75] and [69], the label noise was modeled through the label flipping probabilities: µ i = IP( ˜ Y = 1− i|Y = i), i = 0,1. This alternative formulation is related to our formulation via Bayes’ rule. An in-depth study of the class-conditional label noise model, including mutual irreducibility and identifiability, was presented in [14]. Moreover, [14] developed a noisy label trained classifier based on weighted cost-sensitive surrogate loss and established its consistency. Similarly, [75] provided two methods to train classifiers, both relying on classification-calibrated surrogate loss; bounds for respective excess risks of these two methods were also given. Moreover, [69] proposed an importance reweighting 54 method and extended the result in [75] to all surrogate losses. Other than [14], which briefly discussedtheNPparadigmatthepopulationlevel,inallaforementionedpapers,thoughloss functions vary, the goal of classification is to minimize the overall risk. Our work focuses on the NP paradigm. Moreover, we focus on high probability control on the type I error based on finite samples, in contrast to asymptotic results in the literature. Inthiswork,wetaketheperspectivethatthedomainexpertscanprovideunder-estimates ofcorruptionlevels. Intheliterature,thereareexistingmethodstoestimatetheselevels. For example, [69] and [14] developed methods to estimate π i ’s and µ i ’s, and showed consistency of their estimators. In numerical studies, we apply the method in [69] to estimate m 0 and m 1 2 . Numerical evidence shows that using these estimators in our proposed algorithm fails to establish a high probability control of the true type I error. In fact, even using consistent and unbiased estimators of m 0 and m 1 as inputs of our proposed algorithm would not be able to control the true type I error with high probability. One such case is demonstrated in Simulation 8 of the Appendix, where estimators for m 0 and m 1 are normally distributed and centered at the true values. To have high probability control on the true type I error, we do need the “under-estimates” of corruption levels as in Assumption 3.2. 3.3 Methodology Inthissection,wefirstformallyintroducetheNeyman-Pearson(NP)classificationparadigm andreviewtheNPumbrellaalgorithm[99]fortheuncorruptedlabelscenario(Section3.3.1). Then we provide an example demonstrating that in the presence of label noise, naively implementing the NP umbrella algorithm leads to excessively conservative type I error. i.e., type I error much smaller than the control target α . We analyze and capitalize on this phenomenon, and present new noise-adjusted versions of the NP umbrella algorithm, Algorithm 1 for known corruption levels (Section 3.3.2) and Algorithm 1 # for unknown 2 Note that though their method targets at µ i ’s, estimates of m i ’s in equation (1) can be constructed from those of µ i ’s by the Bayes’ theorem. 55 corruptionlevels(Section3.3.3). Algorithm1canbeconsideredasaspecialcaseofAlgorithm 1 # : m # 0 =m 0 and m # 1 =m 1 . Afewadditionalnotationsareintroducedtofacilitateourdiscussion. Aclassifier ϕ :X → {0,1} maps from the feature space to the label space. The (population-level) type I and II errors of ϕ (·) regarding the true labels (a.k.a., true type I and II errors) are respectively R 0 (ϕ ) = P 0 (ϕ (X) ̸= Y) and R 1 (ϕ ) = P 1 (ϕ (X) ̸= Y). The (population-level) type I and II errors of ϕ (·) regarding the corrupted labels (a.k.a., corrupted type I and II errors) are respectively ˜ R 0 (ϕ ) = ˜ P 0 (ϕ (X) ̸= ˜ Y) and ˜ R 1 (ϕ ) = ˜ P 1 (ϕ (X) ̸= ˜ Y). In verbal discussion in this paper, type I error without any suffix refers to type I error regarding the true labels. 3.3.1 The NP umbrella algorithm without label noise The NP paradigm [23, 87] aims to mimic the NP oracle ϕ ∗ α ∈ argmin ϕ : R 0 (ϕ )≤ α R 1 (ϕ ), where α ∈ (0,1) is a user-specified level that reflects the priority towards the type I error. In practice, with or without label noise, based on training data of finite sample size, it is usually impossible to ensure R 0 (·) ≤ α almost surely. Instead, we aim to control the type I error with high probability. Recently, the NP umbrella algorithm [99] has attracted significantattention 3 . Thisalgorithmworksinconjunctionwithanyscorebasedclassification method (e.g., logistic regression, support vector machines, or random forest) to compress a d-dimensional feature measurement to a 1-dimensional score, and then threshold the score to classify. Specifically, given a (score based) classification method , the NP umbrella algorithm uses a model-free order statistics approach to decide the threshold, attaining a high probability controlontypeIerrorwithminimumtypeIIerrorfor that method. Moreover,whencoupling 3 At the time of writing, the NP umbrella package has been downloaded over 35,000 times. 56 withaclassificationmethodthatmatchestheunderlyingdatadistribution, theNPumbrella algorithm also achieves a diminishing excess type II error, i.e., R 1 ( ˆ ϕ α )− R 1 (ϕ ∗ α )→ 0. For example, [100] showed that under a linear discriminant analysis (LDA) model, an LDA classifier with the score threshold determined by the NP umbrella algorithm satisfies both the control on type I error and a diminishing excess type II error 4 . Next we will review the implementation of the NP umbrella algorithm. Let S 0 = {X 0 j } M 0 j=1 and S 1 = {X 1 j } M 1 j=1 , respectively be the uncorrupted observations in classes 0 and 1, where M 0 and M 1 are the number of observations from each class 5 . Then, given a classification method (i.e., base algorithm, e.g., logistic regression), the NP umbrella algorithmisimplementedbyrandomlysplittingtheclass0dataS 0 intotwoparts: S 0 b andS 0 t . Thefirstpart, S 0 b ,togetherwithS 1 ,isusedtotrainthebase algorithm,whilethesecondpart S 0 t determines the threshold candidates. Specifically, we train a base algorithm with scoring function ˆ T(·) (e.g., the sigmoid function in logistic regression) usingS 0 b ∪ S 1 , apply ˆ T(·) on S 0 t (|S 0 t | = n) to get threshold candidates{t 1 ,...,t n }, and sort them in an increasing order {t (1) ,...,t (n) }. Then the NP umbrella algorithm proposes classifier ˆ ϕ k∗ (·)=1I{ ˆ T(·)>t (k∗ ) }, where k ∗ =min ( k∈{1,...,n}: n X j=k n j (1− α ) j α (n− j) ≤ δ ) , (2) in which δ is a user-specified tolerance probability of the type I error exceeding α . The key to this approach is that [99] established, for all ˆ ϕ k (·)=1I{ ˆ T(·)>t (k) } where k∈{1,...,n}, it holds IP(R 0 ( ˆ ϕ k ) > α )≤ P n j=k n j (1− α ) j α (n− j) , where IP corresponds to random draws of S 0 and S 1 , as well as potential randomness in the classification method (e.g., random forest), and the inequality becomes an equality when ˆ T is continuous almost surely. In view of this inequality and the definition for k ∗ , we have IP(R 0 ( ˆ ϕ k∗ ) > α )≤ δ , and ˆ ϕ k∗ achieves the smallest type II error among the ˆ ϕ k ’s that respect the (1− δ ) probability control of the 4 These two properties together were coined as the NP oracle inequalities by [83]. Classifiers with these properties were constructed with non-parametric assumptions in [98] and [105]. 5 Note that the uncorrupted data S 0 and S 1 are not available in our present label noise setting and we only use them here for review purposes. 57 original adjusted 200 500 1000 2000 200 500 1000 2000 0.00 0.04 0.08 0.12 training sample size (N) true type I errors original adjusted 200 500 1000 2000 200 500 1000 2000 0.2 0.4 0.6 0.8 1.0 training sample size (N) true type II errors Figure3.2: TheoriginalNPumbrellaalgorithmvs. alabel-noise-adjustedversionforExam- ple 2. The plots in the left panel (blue) are the true type I and II errors for the original NP umbrella algorithm. The plots in the right panel (orange) are the true type I and II errors for the label-noise-adjusted NP umbrella algorithm with known corruption levels. The black dot and vertical bar in every violin represent mean and standard deviation, respectively. In the top row, the horizontal black line is α = 0.05 and the boundaries between lighter and darker color in each violin plot mark the 1− δ =95% quantiles. type I error. We call this algorithm the original NP umbrella algorithm to contrast with the newly developed versions. 3.3.2 Algorithm 1: label-noise-adjusted NP umbrella algorithm with known corruption levels Returning to our errors in labels problem leads one to ask what would happen if we were to directly apply the original NP umbrella algorithm to the label noise setting? The results are mixed. While this algorithm successfully controls type I error, it tends to be massively conservative, leading to very low type I errors, but high type II errors. The next example illustrates this phenomenon. Example 2. Let X 0 ∼ N (0,1) and X 1 ∼ N (2,1), m 0 = 0.85, m 1 = 0.15, α = 0.05 and δ = 0.05. For simplicity, we use the identity scoring function: ˆ T(X) = X. We generate 58 N ∈{200,500,1000,2000} corrupted class 0 observations and train a classifier ˆ ϕ k∗ (·) based on them. Due to normality, we can analytically calculate the type I and II errors regarding the true labels. The above steps are repeated 1,000 times for every value of N to graph the violin plots of both errors as shown in the left panel of Figure 3.2. Clearly, all the achieved true type I errors are much lower than the control target α and true type II errors are very high 6 . ThephenomenonillustratedintheleftpanelofFigure3.2isnotacontrivedone. Indeed, under the class-conditional noise model (i.e., Assumption 3.2), at the same threshold level, thetailprobabilityofcorruptedclass0isgreaterthanthatoftrueclass0sincethecorrupted 0 distribution is a mixture of true 0 and 1 distributions. Figure 3.3 provides further illustration. In this figure, the black vertical line ( x = 2.52) marks the threshold of the classifier 1I {X > 2.52} whose corrupted type I error (i.e., the right tail probability under the orange dashed curve) is 0.05. In contrast, its true type I error (i.e., the right tail probability under the blue solid curve) is much smaller. The above observation motivates us to create new label-noise-adjusted NP umbrella al- gorithms by carefully studying the discrepancy between true and corrupted type I errors, whose population-level relation is channeled by the class-conditional noise model and can be estimated based on data with corrupted labels alone. We will first develop a version for known corruption levels (i.e., Algorithm 1) and then a variant for unknown corruption levels (i.e., Algorithm 1 # ). Although the latter variant is suitable for most applications, we believe that presenting first the known corruption level version streamlines the reasoning and presentation. Formethodologyandtheorydevelopment,weassumethefollowingsamplingscheme. Let ˜ S 0 ={ ˜ X 0 j } N 0 j=1 be corrupted class 0 observations and ˜ S 1 ={ ˜ X 1 j } N 1 j=1 corrupted class 1 ones. 6 To make a contrast, we also plot in the right panel of Figure 3.2 the true type I and II errors of ˆ ϕ k ∗ (·), the classifier constructed by the label-noise-adjusted NP umbrella algorithm with known corruption levels to be introduced in the next section. The details to generate ˆ ϕ k ∗ (·)’s are skipped here, except we reveal that corrupted class 1 observations, in addition to the corrupted class 0 observations, are also needed to construct the thresholds. 59 0.0 0.1 0.2 0.3 0.4 -5.0 -2.5 0.0 2.5 5.0 x density Figure 3.3: The blue solid curve is the density of true class 0 (i.e., N(0,1)) and the orange dashed curve is the density of corrupted class 0 (i.e., a mixture ofN(0,1) andN(2,1) with m 0 =0.85). The black vertical line marks the threshold of the classifier 1I {X >2.52} whose corrupted type I error is 0.05. The sample sizes N 0 and N 1 are considered to be non-random numbers, and we assume that allobservationsin ˜ S 0 and ˜ S 1 areindependent. Then,wedivide ˜ S 0 intothree randomdisjoint non-empty subsets. The first two parts ˜ S 0 b and ˜ S 0 t are used to train the base algorithm and determine the threshold candidates, respectively. The third part ˜ S 0 e is used to estimate a correction term to account for the label noise. Similarly, we randomly divide ˜ S 1 into two disjoint non-empty subsets: ˜ S 1 b and ˜ S 1 e . Let ˆ T(·) be a scoring function trained on ˜ S b = ˜ S 0 b ∪ ˜ S 1 b . We apply ˆ T(·) to elements in ˜ S 0 t and sort them in an increasing order: {t (1) ,...,t (n) }, where n = | ˜ S 0 t | 7 . These will serve as the threshold candidates, just as in the original NP umbrella algorithm. However, instead of k ∗ , the label-noise-adjusted NP umbrella algorithm with known corruption levels will take the order k ∗ defined by k ∗ =min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α }, 7 In Appendix B.1, we summarize the notations related to the sampling scheme for the readers’ conve- nience. 60 where α k,δ 8 satisfies n X j=k n j α n− j k,δ (1− α k,δ ) j =δ, (3) ˆ D + (·) = ˆ D(·)∨0 := max( ˆ D(·),0) and ˆ D(·) = 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (·)− ˆ ˜ F ˆ T 1 (·) , in which ˆ ˜ F ˆ T 0 (·) and ˆ ˜ F ˆ T 1 (·) are empirical estimates of ˜ F ˆ T 0 (·) and ˜ F ˆ T 1 (·) based on ˜ S 0 e and ˜ S 1 e , respectively. Algorithm1:Label-noise-adjustedNPUmbrellaAlgorithmwithknowncorruption levels Input : ˜ S 0 : sample of corrupted 0 observations ˜ S 1 : sample of corrupted 1 observations α : type I error upper bound, 0 <α< 1 δ : type I error violation rate target, 0 <δ < 1 m 0 : probability of a corrupted class 0 sample being of true class 0 m 1 : probability of a corrupted class 1 sample being of true class 0 1 ˜ S 0 b , ˜ S 0 t , ˜ S 0 e ← random split on ˜ S 0 2 ˜ S 1 b , ˜ S 1 e ← random split on ˜ S 1 3 ˜ S b ← ˜ S 1 b ∪ ˜ S 0 b ; // combine ˜ S 0 b and ˜ S 1 b as ˜ S b 4 ˆ T(·)← base classification algorithm ( ˜ S b ) ; // train a scoring function on ˜ S b 5 T t ={t 1 ,t 2 ,...,t n }← ˆ T( ˜ S 0 t ) ; // apply ˆ T to every entry in ˜ S t 6 {t (1) ,t (2) ,...,t (n) }← sort(T t ) 7 T 0 e ← ˆ T( ˜ S 0 e ) 8 T 1 e ← ˆ T( ˜ S 1 e ) ; // apply ˆ T to all elements in ˜ S 0 e and ˜ S 1 e 9 for k in{1,...,n} do 10 α k,δ ← BinarySearch(δ,k,n ) ; // compute α k,δ through binary search 11 ˆ ˜ F ˆ T 0 (t (k) )←|T 0 e | − 1 · P t∈T 0 e 1I{t≤ t (k) } 12 ˆ ˜ F ˆ T 1 (t (k) )←|T 1 e | − 1 · P t∈T 1 e 1I{t≤ t (k) } ; // compute the empirical distributions 13 ˆ D(t (k) )← 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (t (k) )− ˆ ˜ F ˆ T 1 (t (k) ) ; // compute an estimate of ˜ R 0 − R 0 14 ˆ D + (t (k) )← ˆ D(t (k) )∨0 ; // if ˆ D(t (k) ) is negative, then set it to 0 15 end 16 k ∗ ← min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α } ; // select the order 17 ˆ ϕ k ∗ (·)← 1I{ ˆ T(·)>t (k ∗ ) } ; // construct an NP classifier Output: ˆ ϕ k ∗ (·) 8 The existence and uniqueness of α k,δ are ensured by Lemma 7 in the Appendix. 61 Theentireconstructionprocessof ˆ ϕ k ∗ (·)=1I{ ˆ T(·)>t (k ∗ ) }issummarizedanddetailedin Algorithm1. Inthisalgorithm,tosolveα k,δ ,weuseabinarysearchsubroutine(Algorithm2 in Appendix B.2) on the function x7→ P n j=k n k x n− j (1− x) j , leveraging its strict monotone decreasing property in x. Interested readers are referred to the proof of Lemma 7 in the Appendix for further reasoning. Currently we randomly split ˜ S 0 and ˜ S 1 respectively into three and two equal sized subgroups. An optimal splitting strategy could be a subject for future research. The key to the new algorithm is ˆ D + (·), which adjusts for the label corruption. Indeed, the original NP umbrella algorithm can be seen as a special case of our approach where ˆ D + (·) = 0. The numerical advantage of the new algorithm is demonstrated in the right panel of Figure 3.2 and in Section 3.5. We will prove in the next section that the label-noise- adjustedNPclassifier ˆ ϕ k ∗ (·)=1I{ ˆ T(·)>t (k ∗ ) }controlstruetypeIerrorwithhighprobability while avoiding the excessive conservativeness of the original NP umbrella algorithm. Note thatincontrasttothedeterministicorder k ∗ intheoriginalNPumbrellaalgorithm, thenew order k ∗ is random, calling for much more involved technicalities to establish the theoretical properties of ˆ ϕ k ∗ (·). 3.3.3 Algorithm 1 # : label-noise-adjusted NP umbrella algorithm with unknown corruption levels For most applications in practice, accurate corruption levels m 0 and m 1 are inaccessible. To addressthis, weproposeAlgorithm1 # , asimplevariantofAlgorithm1thatreplacesm 0 and m 1 withestimatesm # 0 andm # 1 . Inallotherrespectsthetwoalgorithmsareidentical. Specif- ically,whenestimating ˜ R 0 − R 0 ,Algorithm1 # uses ˆ D # (t (k) )= 1− m # 0 m # 0 − m # 1 ˆ ˜ F ˆ T 0 (t (k) )− ˆ ˜ F ˆ T 1 (t (k) ) and ˆ D + # (t (k) )= ˆ D # (t (k) )∨0. Then,Algorithm1 # deliverstheNPclassifier ˆ ϕ k ∗ # (·)=1I{ ˆ T(·)> t (k ∗ # ) }, where k ∗ # = min{k ∈ {1,...,n} : α k,δ − ˆ D + # (t (k) ) ≤ α }. Due to the similarity with Algorithm 1, we do not re-produce the other steps of Algorithm 1 # to write it out in a full algorithm format. 62 Rather than supplying unbiased estimates for m 0 and m 1 , we will demonstrate that it is important that m # 0 and m # 1 are under-estimates of the corruption levels (i.e., m # 0 ≥ m 0 and m # 1 ≤ m 1 as in Assumption 3.2). In this work, we assume that domain experts supply these under-estimates. While it would be unrealistic to assume that these experts know m 0 and m 1 exactly, in many scenarios one can provide accurate bounds on these quantities. It would be interesting to investigate data-driven estimators that have such a property for future work. 3.4 Theory In this section, we first elaborate the rationale behind Algorithm 1 (Section 3.4.1), and then showthatunderafewtechnicalconditions,thisnewalgorithminduceswell-definedclassifiers whosetypeIerrorsareboundedfromabovebythedesiredlevelwithhighprobability(Section 3.4.2). Thenweestablishasimilarresultforitsunknown-corruption-levelvariant,Algorithm 1 # (Section 3.4.3). 3.4.1 Rationale behind Algorithm 1 Proposition 1. Let ˆ T(·) be a scoring function (e.g., sigmoid function in logistic regression) trained on ˜ S b . Applying ˆ T(·) to every element in ˜ S 0 t , we get a set of scores. Order these scores and denote them by {t (1) ,t (2) ,...,t (n) }, in which t (1) ≤ t (2) ≤ ... ≤ t (n) . Then, for any α ∈(0,1) and k∈{1,2,...,n}, the classifier ˆ ϕ k (·)=1I{ ˆ T(·)>t (k) } satisfies IP ˜ R 0 ( ˆ ϕ k )>α ≤ n X j=k n j (1− α ) j α (n− j) , in which IP is regarding the randomness in all training observations, as well as addi- tional randomness if we adopt certain random classification methods (e.g., random forest). Moreover, when ˆ T(·) is continuous almost surely, the above inequality obtains the equal sign. 63 Recall that ˜ R 0 (·) denotes type I error regarding the corrupted labels. We omit a proof for Proposition 1 as it follows the same proof as its counterpart in [99]. For α,δ ∈ (0,1), recallthattheoriginalNPumbrellaalgorithmselectsk ∗ =min{k∈{1,...,n}: P n j=k n j (1− α ) j α (n− j) ≤ δ }. Thesmallestk amongallthatsatisfy P n j=k n j (1− α ) j α (n− j) ≤ δ isdesirable because we also wish to minimize the type II error. There is a sample size requirement for this order statistics approach to work because a finite order k ∗ should exist. Precisely, an order statistics approach works if the last order does; that is (1− α ) n ≤ δ . This translates to Assumption 3.4.1 on n, the sample size of ˜ S 0 t . This is a mild requirement. For instance, when α =δ =0.05, n should be at least 59. n≥⌈ logδ/ log(1− α )⌉, in which⌈·⌉ denotes the ceiling function. In view of Proposition 1, the choice of k ∗ guarantees IP ˜ R 0 ( ˆ ϕ k∗ )≤ α ≥ 1− δ . In other words, if we were to ignore the label noise presence and apply the original NP umbrella algorithm, the type I error regarding the corrupted labels, ˜ R 0 , is controlled under level α with probability at least 1− δ . Moreover, the achieved ˜ R 0 is usually not far from α when the sample size n is much larger than the lower bound requirement. However, this is not our main target; what we really want is to control R 0 . Example 2 in Section 3.3.1 convincingly demonstrates that in the presence of label noise, the achieved R 0 after naive implementation of the original NP umbrella algorithm can be much lower than the control target α . This is no exception. To aid in analyzing the gap between R 0 and ˜ R 0 , we make the following assumption. Thescoringfunction ˆ T istrainedsuchthat ˜ F ˆ T 0 (z)> ˜ F ˆ T 1 (z)forallz∈IRwithprobability at least 1− δ 1 (n b ), where n b =| ˜ S b | and δ 1 (n b ) converges to 0 as n b goes to infinity. Loosely,Assumption3.4.1meansthatthescoringfunctiontrainedoncorrupteddatastill has the “correct direction.” For any classifier of the form ˆ ϕ c (·) = 1I{ ˆ T(·) > c}, Assumption 3.4.1 implies that with probability at least 1− δ 1 (n b ), ˜ P 0 ( ˆ ϕ c (X) = 0) > ˜ P 1 ( ˆ ϕ c (X) = 0), which means that a corrupted class 0 observation is more likely to be classified as 0 than a corrupted class 1 observation is. Interested readers can find a concrete example that 64 illustrates this mild assumption in the Appendix B.3 (Example 3). Now we are ready to describe the discrepancy between R 0 and ˜ R 0 . Lemma 1. Let ˆ T be a scoring function trained on ˜ S b and ˆ ϕ c (·)=1I{ ˆ T(·)>c} be a classifier that thresholds the scoring function at c∈IR. Denote D(c)= ˜ R 0 ( ˆ ϕ c )− R 0 ( ˆ ϕ c ). Then, under Assumptions 3.2, 3.4.1 and 3.4.1, for given α and δ , it holds that IP inf c∈IR D(c)≥ 0 ≥ 1− δ 1 (n b ) and IP R 0 ( ˆ ϕ k∗ )>α − D(t (k∗ ) ) ≤ δ +δ 1 (n b ), where k ∗ and δ are related via equation (2). Moreover, we have D(c)=M ˜ F ˆ T 0 (c)− ˜ F ˆ T 1 (c) , (4) where M =(1− m 0 )(m 0 − m 1 ) − 1 . NotethatD(c)measuresthediscrepancybetweenthe corrupted typeIerrorandthe true type I error of the classifier ˆ ϕ c (·). Lemma 1 implies that with high probability, ˆ ϕ k∗ (·) has R 0 , the type I error regarding true labels, under a level that is smaller than the target value α , and that the gap is measured by D(t (k∗ ) ). It is important to note that D(c) is solely a function of the distributions of the corrupted data, and does not require any knowledge of the uncorrupted scores, so we are able to estimate this quantity from our observed data. As argued previously, excessive conservativeness in type I error is not desirable because it is usually associated with a high type II error. Therefore, a new NP umbrella algorithm should adjust to the label noise, so that the resulting classifier respects the true type I error control target, but is not excessively conservative. Motivated by Lemma 1, our central plan istochoosesomelessconservative(i.e., smaller)orderthanthatintheoriginalNPumbrella algorithm, in view of the difference between R 0 and ˜ R 0 . Recall that δ ∈ (0,1) is the target 65 type I error violation rate. In the presence of label noise, we do not expect to control at this precise violation rate, but just some number around it. Forany ˆ ϕ k (·), underAssumptions3.2, 3.4.1and3.4.1, Lemma1implies ˜ R 0 ( ˆ ϕ k )≥ R 0 ( ˆ ϕ k ) withprobabilityatleast1− δ 1 (n b ). Notethattheδ 1 (n b )termissmallandasymptotically0; we will ignore it in this section when motivating our new strategy. With this simplification, ˜ R 0 ( ˆ ϕ k ) is always greater than R 0 ( ˆ ϕ k ), as illustrated in Figure 3.4. The definition of α k,δ in equation (3) and Proposition 1 imply with probability at least 1− δ , α k,δ ≥ ˜ R 0 ( ˆ ϕ k ), which corresponds to the green region (the region on the right) in Figure 3.4. Since we only need 1− δ probability control on R 0 , it suffices to control R 0 corresponding to this region. Combining the results α k,δ ≥ ˜ R 0 ( ˆ ϕ k ) and ˜ R 0 ( ˆ ϕ k )≥ R 0 ( ˆ ϕ k ), we have the inequalities α k,δ ≥ α k,δ − D(t (k) )≥ R 0 ( ˆ ϕ k ) on our interested region (Recall D(t (k) ) = ˜ R 0 ( ˆ ϕ k )− R 0 ( ˆ ϕ k )). By the previous argument, α k,δ can be used as an upper bound for R 0 , but to have a good type II error, a better choice is clearly the smaller α k,δ − D(t (k) ). So if D(t (k) ) were a known quantity, we can set the order to be ˜ k ∗ = min{k ∈ {1...,n} : α k,δ − D(t (k) ) ≤ α } and propose a classifier ˆ ϕ ˜ k ∗ (·) = 1I{ ˆ T(·) > t ( ˜ k ∗ ) }. This is to be compared with the order k ∗ chosen by the original NP umbrella algorithm, which can be equivalently expressed as k ∗ = min{k ∈{1 ...,n} : α k,δ ≤ α } (Lemma 7 in the Appendix). Then we have ˜ k ∗ ≤ k ∗ , and so ˆ ϕ ˜ k ∗ (·) is less conservative than ˆ ϕ k∗ (·) in terms of type I error. However, ˆ ϕ ˜ k ∗ (·) is not accessible because D is unknown. Instead we estimate D by re- placing ˜ F ˆ T 0 and ˜ F ˆ T 1 in(4)withtheirempiricaldistributions ˆ ˜ F ˆ T 0 and ˆ ˜ F ˆ T 1 ,whicharecalculated using ˜ S 0 e and ˜ S 1 e , i.i.d. samples from the corrupted 0 and 1 observations. Note that these estimates are independent of ˜ S b and ˜ S 0 t . For a given ˆ T, we define for every c∈IR, ˆ D(c)= 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (c)− ˆ ˜ F ˆ T 1 (c) and k ∗∗ =min{k∈{1,...,n}:α k,δ − ˆ D(t (k) )≤ α − ε}, in which a small ε > 0 is introduced to compensate for the randomness of ˆ D in the theory proofs. For simulation and real data, we actually just use ε = 0. Finally, the proposed new 66 type I error , with probability with probability 1 − ( ) ෨ 0 ( ) 0 ( ) th resh old ( ( ) ) corrupted type I error true type I error ( ( ) ) , − ( ( ) ) 0 ( ) Figure 3.4: A cartoon illustration of 1− δ probability upper bound of type I error. label-noise-adjusted NP classifier with known corruption levels is ˆ ϕ k ∗ (·)=1I{ ˆ T(·)>t (k ∗ ) }, in which k ∗ is a small twist from k ∗∗ by replacing ˆ D with its positive part. The construction of ˆ ϕ k ∗ (·) was detailed in Algorithm 1. We have two comments on the implementation of Algorithm 1. First, though the ε compensation for the randomness is necessary for the theory proof, our empirical results suggestalmostidenticalperformancebetweenε=0relativetoanysmallε,sowerecommend setting ε to 0 for simplicity, and we do not use the ε compensation in Algorithm 1. Second, in the order selection criterion of k ∗ in Algorithm 1, we use ˆ D + = ˆ D∨ 0 := max( ˆ D,0) instead of ˆ D, because empirically, although highly unlikely, ˆ D can be negative, which results in min{k ∈ {1,...,n} : α k,δ − ˆ D(t (k) ) ≤ α } ≥ min{k ∈ {1,...,n} : α k,δ ≤ α }. In this case, the new order could be greater than k ∗ . Since we aim to reduce the conservativeness of the original NP umbrella algorithm, the possibility of k ∗ ≥ k ∗ will reverse this effort and worsen the conservativeness. To solve this issue, we force the empirical version of D to be non-negative by replacing ˆ D with ˆ D + in Algorithm 1. 67 3.4.2 Theoretical properties of Algorithm 1 In this subsection, we first formally establish that Algorithm 1 gives rise to valid classifiers (Lemma 2) and then show that these classifiers have the true type I errors controlled under α with high probability (Theorem 9). Lemma 2. Under Assumption 3.4.1, k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } in Algorithm 1 exists. Moreover, this label-noise-adjusted order is no larger than that chosen by the original NP umbrella algorithm; that is k ∗ ≤ k ∗ . Lemma 2 implies that Algorithm 1 reduces the excessive conservativeness of the original NP umbrella algorithm on the type I error by choosing a smaller order statistic as the threshold. Moreover, if there is no label noise, i.e., when m 0 = 1 and m 1 = 0, we have k ∗ = min{k∈{1,...,n} : α k,δ ≤ α } = k ∗ . That is, Algorithm 1 reduces to the original NP umbrella algorithm. Another important question is whether Algorithm 1 can control the true type I error with high probability. The following condition is assumed for the rest of this section. The scoring function ˆ T is trained from a class of functionsT such that the density functions for both ˆ T( ˜ X 0 ) and ˆ T( ˜ X 1 ) exist for every ˆ T ∈T. Then, we denote these two densities by ˜ f ˆ T 0 and ˜ f ˆ T 1 , respectively. Furthermore, sup ˆ T∈T ∥ ˜ f ˆ T 0 ∨ ˜ f ˆ T 1 ∥ ∞ ≤ C and inf ˆ T∈T inf z∈D ˆ T ˜ f ˆ T 0 (z) > c for some positive c and C with probability 1− δ 2 (n b ), whereD ˆ T is the support of ˜ f ˆ T 0 and is a closed interval, and δ 2 (n b ) converges to 0 as n b goes to infinity. Note that Assumption 3.4.2 summarizes assumptions that we make for technical conve- nience in establishing the next theorem. In particular, we assume the existence of densities ˜ f ˆ T 0 and ˜ f ˆ T 1 , which holds if ˜ X 0 and ˜ X 1 have densities and ˆ T(·) is smooth. Moreover, we assumethatwithhighprobability, boththedensitiesareuniformlyboundedfromaboveand ˜ f ˆ T 0 (·) is bounded uniformly from below. Recall that in Algorithm 1, we set k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α } without an ε term. Setting ε = 0 intuitively seems reasonable since, when the sample 68 size is small, the sets {k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α − ε} and {k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α } agree with high probability, and, when the sample size is large, concentration of random variables takes effect so there is little need for compensation for randomness. Our simulation results further reinforce this intuition. However, we include an ε term in the next theorem as this is required in our proof for the theory to hold. Theorem 9. Under Assumptions 3.2, 3.4.1, 3.4.1 and 3.4.2, the classifier ˆ ϕ k ∗ (·), given by Algorithm 1 with k ∗ =min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α − ε}, satisfies IP R 0 ( ˆ ϕ k ∗ )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 , in which n b =| ˜ S b |, n=| ˜ S 0 t |, n 0 e =| ˜ S 0 e |, and n 1 e =| ˜ S 1 e |. NotethattheupperboundofIP R 0 ( ˆ ϕ k ∗ )>α isδ ,ourviolationratecontroltarget,plus a few terms which converge to zero as the sample sizes increase. To establish this inequality, wefirstexcludethecomplementoftheeventsdescribedinAssumption3.4.1and3.4.2. Then, we further restrict ourselves on the event constructed by a Glivenko-Cantelli type inequality where ˆ D and D only differ by 2 − 1 ε. There, the order selection criterion can be written as k ∗ = min{k∈{1,...,n} : α k,δ − D(t (k) )≤ α − 2 − 1 ε}. The main difficulty of the proof is to handletherandomnessofthethresholdt (k ∗ ) . Unlikethedeterministicorderk ∗ intheoriginal NP umbrella algorithm, the new order k ∗ is stochastic. As such, even when conditioning on ˆ T, t (k ∗ ) is sill random and cannot be handled as a normal order statistic. Our solution is to find a high probability deterministic lower bound for t (k ∗ ) . To do this, we introduce c k , the k/nquantileof ˜ F ˆ T 0 ,whichisadeterministicvalueifweconsider ˆ T tobefixed. Then,weshow that D(t (k) ) only differs from D(c k ) by 4 − 1 ε for all k and that α k ∗ ,δ − D(c k ∗ ) ≤ α − 4 − 1 ε. Then, we define k 0 = min{k ∈ {1,...,n} : α k,δ − D(c k ) ≤ α − 4 − 1 ε}, which is another deterministic value, given that ˆ T is considered to be fixed. Then, we find that k 0 ≤ k ∗ and α k 0 ,δ − D(t (k 0 ) ) ≤ α with high probability. Therefore, t (k 0 ) is a high probability lower boundfort (k ∗ ) . Moreover,t (k 0 ) isanorderstatisticwithdeterministicorder(forfixed ˆ T)and 69 thus its distribution can be written as a binomial probability. The fact α k 0 ,δ − D(t (k 0 ) )≤ α combined with Proposition 1 yields that the violation rate of ˆ ϕ k 0 (·) is smaller than δ . The readers are referred to Appendix B.6 for a complete proof. 3.4.3 Theoretical properties of Algorithm 1 # In this subsection, we discuss the properties of Algorithm 1 # . Recall that m # 0 ≥ m 0 and m # 1 ≤ m 1 in Assumption 3.2 mean that the corruption levels are “underestimated.” As such, Algorithm 1 # produces a more conservative result than Algorithm 1. To see this, note that the only difference between two algorithms is that (1 − m 0 )(m 0 − m 1 ) − 1 in Algorithm 1 is replaced with (1− m # 0 )(m # 0 − m # 1 ) − 1 in Algorithm 1 # . The latter is no larger than the former, so we have a threshold in Algorithm 1 # larger than or equal to that in Algorithm 1. Ontheotherhand,underAssumption3.2,Algorithm1 # isstilllessconservativethanthe originalNPumbrellaalgorithm. Todigestthis,wefirstconsiderthecasewherethelabelnoise is totally “ignored”, i.e., m # 0 = 1 and m # 1 = 0. In this case, Algorithm 1 # is equivalent to theoriginalNPumbrellaalgorithm. Then, sinceusually m # 0 <1andm # 1 >0, Algorithm1 # producesasmallerthresholdthantheNPoriginalumbrellaalgorithm. Therefore,Algorithm 1 # overcomes, at least partially, the conservativeness issue of the original NP umbrella algorithm. These insights are formalized in the following lemma. Lemma 3. Under Assumptions 3.2 - 3.4.1, k ∗ # =min{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ α } in Algorithm 1 # exists. Moreover, the order k ∗ # is between k ∗ and k ∗ , i.e., k ∗ ≤ k ∗ # ≤ k ∗ . NextweestablishahighprobabilitycontrolontypeIerrorforAlgorithm1 # . Recallthat a high probability control on type I error for Algorithm 1 was established in Theorem 9. In view of Lemma 3, ˆ ϕ k ∗ # (·) produced in Algorithm 1 # has a larger threshold, and thus smaller true type I error, than that of ˆ ϕ k ∗ (·) produced by Algorithm 1. Then, a high probability control on true type I error of ˆ ϕ k ∗ # (·) naturally follows. This result is summarized in the following corollary. 70 Under Assumptions 3.2 - 3.4.2, the classifier ˆ ϕ k ∗ # (·) given by Algorithm 1 # with k ∗ # = min{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ α − ε}, satisfies IP R 0 ( ˆ ϕ k ∗ # )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . in which n b =| ˜ S b |, n=| ˜ S 0 t |, n 0 e =| ˜ S 0 e |, and n 1 e =| ˜ S 1 e |. 3.5 Numerical Analysis Inthissection,weapplyAlgorithms1(knowncorruptionlevels)and1 # (unknowncorruption levels) on simulated and real datasets, and compare with other methods in the literature. Wepresentthe(approximate)typeIerrorviolationrates 9 andtheaveragesof(approximate) true type II errors. Besides the simulations in this section, we have additional simulations in Appendix B.4.1. Furthermore, the violin plots associated with selected simulation are presented in Appendix B.4.3. As a justification of the minor discrepancy between our theory and implementation, readers can find in Appendix B.4.5 the results for a slightly different implementation of Algorithm 1, in which k ∗ = min{k ∈{1,...,n} : α k,δ − ˆ D + (t (k) )≤ α − ε} and ε = 0.0001. In principle, it is possible that setting ε > 0 will make k ∗ larger than when ε = 0 as {k∈{1,2,...,n},α k,δ − ˆ D + (t (k) )≤ α k,δ − ε}isasubsetof{k∈{1,2,...,n},α k,δ − ˆ D + (t (k) )≤ α k,δ }. ThiswillmakethethresholdlargerandthetypeIerrorandtheviolationratesmaller. However, since ε = 0.0001 is a very small value, its effect on k ∗ is very minor. In numerical studies, two implementations (ε = 0.0001 in the Appendix vs. ε = 0 in this section) give nearly identical results for all examples. Both implementations generate the same type I 9 Strictly speaking, the observed type I error violation rate is only an approximation to the real violation rate. The approximation is two-fold: i). in each repetition of an experiment, the population type I error is approximated by empirical type I error on a large test set; ii). the violation rate should be calculated based on infinite repetitions of the experiment, but we only calculate it based on a finite number of repetitions. However, such approximation is unavoidable in numerical studies. 71 errors and type II errors for most (at least 95%) cases. Moreover, the difference in violation rates of the two implementations is no larger than a very small number 0.1δ . 3.5.1 Simulation 3.5.1.1 Algorithm 1. WepresentthreedistributionalsettingsforAlgorithm1(knownm 0 andm 1 ). Ineachsetting, 2N observations are generated as a training sample, of which half are from the corrupted class 0 and half from the corrupted class 1. The number N varies from 200 to 2,000. To approximate the true type I and II errors, we generate 20,000 true class 0 observations and 20,000 true class 1 observations as the evaluation set. For each distribution and sample size combination,werepeattheprocedure1,000times. Algorithm1(“adjusted”)andtheoriginal NP umbrella algorithm (“original”) are both applied, paired with different base algorithms. Simulation 1 (Gaussian Distribution). Let X 0 ∼ N (µ 0 ,Σ) and X 1 ∼ N (µ 1 ,Σ) , where µ 0 =(0,0,0) ⊤ ,µ 1 =(1,1,1) ⊤ and Σ= 2 − 1 0 − 1 2 − 1 0 − 1 2 , and the base algorithm is linear discriminant analysis (LDA). For different (m 0 ,m 1 ,α,δ ) combinations, the (approximate) type I error violation rates and the averages of (approx- imate) true type II errors generated by Algorithm 1 are reported in Tables 3.1 and 3.2, respectively. Simulation2(UniformDistributionwithinCircles). LetX 0 andX 1 beuniformlydistributed within unit circles respectively centered at (0,0) ⊤ and (1,1) ⊤ . The base algorithm is logistic regression. We only report (approximate) type I error violation rates and the averages of 72 N m 0 =.95,m 1 =.05 α =.05,δ =.05 m 0 =.9,m 1 =.1 α =.05,δ =.05 m 0 =.95,m 1 =.05 α =.1,δ =.1 m 0 =.9,m 1 =.1 α =.1,δ =.1 adjusted original adjusted original adjusted original adjusted original 200 .026 (5.03) .001 (1.00) .033 (5.65) 0 (0) .078 (8.84) .003 (1.73) .073 (8.23) 0 (0) 500 .031 (5.40) 0 (0) .046 (6.63) 0 (0) .090 (9.05) .001 (1.00) .085 (8.82) 0 (0) 1,000 .038 (5.97) 0 (0) .049 (6.83) 0 (0) .105 (9.70) 0 (0) .081 (8.63) 0 (0) 2,000 .053 (6.96) 0 (0) .046 (6.63) 0 (0) .087 (8.92) 0 (0) .099 (9.45) 0 (0) Table 3.1: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 1. Standard errors (× 10 − 3 ) in parentheses. N m 0 =.95,m 1 =.05 α =.05,δ =.05 m 0 =.9,m 1 =.1 α =.05,δ =.05 m 0 =.95,m 1 =.05 α =.1,δ =.1 m 0 =.9,m 1 =.1 α =.1,δ =.1 adjusted original adjusted original adjusted original adjusted original 200 .685 (7.16) .706 (4.65) .697 (7.06) .826 (3.54) .333 (3.93) .403 (3.56) .369 (4.93) .537 (4.03) 500 .481 (4.08) .590 (2.99) .512 (4.92) .743 (2.79) .249 (1.94) .307 (1.83) .257 (2.21) .436 (2.48) 1,000 .396 (2.53) .534 (2.19) .387 (2.37) .663 (1.68) .218 (1.18) .287 (1.22) .213 (1.01) .381 (1.28) 2,000 .350 (1.51) .491 (1.45) .371 (1.99) .651 (1.45) .201 (.76) .268 (.77) .205 (.87) .375 (1.01) Table3.2: Averagesof(approximate)truetypeIIerrorsover1,000repetitionsforSimulation 1. Standard errors (× 10 − 3 ) in parentheses. (approximate) true type II errors generated by Algorithm 1 for one combination (m 0 = .95, m 1 =.05, α =.1 and δ =.1) in Table 3.3. N (approximate) violation rate averages of (approximate) true type II errors adjusted original adjusted original 200 .079 (8.53) .006 (2.44) .164 (2.77) .226 (3.35) 500 .086 (8.87) .001 (1.00) .123 (.92) .161 (.80) 1,000 .085 (8.82) 0 (0) .109 (.61) .151 (.58) 2,000 .085 (8.82) 0 (0) .101 (.44) .142 (.39) Table 3.3: (Approximate) type I error violation rates, and averages of (approximate) true type II errors over 1,000 repetitions for Simulation 2 (m 0 = .95, m 1 = .05, α = .1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. Simulation 3 (T Distribution). Let X 0 and X 1 be t-distributed with shape matrix Σ , which was specified in Simulation 1, 4 degrees of freedom, and centered at (0,0,0) ⊤ and (1,1,1) ⊤ respectively. 73 The base algorithm is LDA. Similar to the previous simulation, we only report (approx- imate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 1 for one combination (m 0 = .95, m 1 = .05, α = .1 and δ = .1) in Table 3.4. N (approximate) violation rate average of (approximate) true type II errors adjusted original adjusted original 200 .068 (7.96) .008 (2.82) .526 (5.67) .575 (4.32) 500 .085 (8.82) .002 (1.41) .398 (3.32) .472 (2.59) 1,000 .090 (9.05) 0 (0) .345 (2.07) .432 (1.78) 2,000 .093 (9.19) 0 (0) .314 (1.24) .401 (1.18) Table 3.4: (Approximate) type I error violation rates, and averages of (approximate) true type II errors over 1,000 repetitions for Simulation 3 (m 0 = .95, m 1 = .05, α = .1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. The results from Simulations 1-3 confirm that the original NP umbrella algorithm is overly conservative on type I error when there is label noise in the training data, resulting in type I error violation rates (close to) 0 in all settings. In contrast, the label-noise-adjusted Algorithm 1 has type I errors controlled at the specified level with high probability and achieves much better type II errors. 3.5.1.2 Algorithm 1 # . In this section, we show numerically that under the NP paradigm, the “under-estimates” of corruption levels serve Algorithm 1 # well, while ”over-estimates” do not. Simulation 4. The distributional setting is the same as in Simulation 1. Different com- binations of m # 0 and m # 1 are used. the (approximate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 1 # for one combination (m 0 =.95, m 1 =.05, α =.1 and δ =.1) are reported in Tables 3.5 and 3.6. 74 N m # 0 =.93, m # 1 =.07 m # 0 =.95, m # 1 =.05 m # 0 =.97, m # 1 =.03 original 200 .136(10.85) .078(8.48) .055(7.21) .003(1.73) 500 .218(13.06) .090(9.05) .038(6.05) .001(1.00) 1,000 .324(14.81) .105(9.70) .012(3.44) 0(0) 2,000 .462(15.77) .087(8.92) .005(2.23) 0(0) Table 3.5: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 4. Standard errors (× 10 − 3 ) in parentheses. N m # 0 =.93, m # 1 =.07 m # 0 =.95, m # 1 =.05 m # 0 =.97, m # 1 =.03 original 200 .287(3.43) .333(3.92) .373(4.62) .403(3.56) 500 .215(1.61) .249(1.94) .285(2.22) .307(1.83) 1,000 .189(1.02) .218(1.18) .250(1.37) .287(1.22) 2,000 .174(.65) .201(.76) .230(.86) .268(.77) Table 3.6: (Approximate) type II error violation rates over 1,000 repetitions for Simulation 4. Standard errors (× 10 − 3 ) in parentheses. The second to the last column in Table 3.5 confirms that, using strict under-estimates of corruption levels (i.e., m # 0 > m 0 and m # 1 < m 1 ), the type I error control objective is satisfied. Note that we also include the strict over-estimate scenarios in the second column (i.e., m # 0 <m 0 andm # 1 >m 1 ), whereweseethatthetypeIviolationratesexceedthetarget δ . Hence the under-estimate requirement in the theory part is not merely for technical convenience. Table 3.6 confirms that the using strict under-estimates would lead to higher type II errors than using the true corruption levels. This is a necessary price to pay for not knowing the exact levels, but still it is better than totally ignoring the label corruption and applying the original NP umbrella algorithm. Westateagainthatinthiswork, werelyondomainexpertstosupplyunder-estimatesof corruptionlevels. Intheliterature,thereareexistingestimators. Forexample,weimplement estimators proposed by [69] in Simulations 6 and 7 in Appendix B.4.1. There, we would see that those estimators do not help Algorithm 1 # achieve the type I error control objective. But this is not a problem with these estimators themselves. Even “oracle” consistent and unbiasedestimatorsthatcenterat m 0 andm 1 donotservethepurposeeither, asrevealedin 75 Simulation8inAppendixB.4.1. Asexpected,givenourdiscussionabouttheneedforunder- estimates of the corruption levels (i.e., m # 0 ≥ m 0 and m # 1 ≤ m 1 ), Algorithm 1 # performs poorly using these unbiased estimates. It could be an interesting topic for future research to identify an efficient method for producing biased estimates which will satisfy (with high probability) the bounds necessary to ensure correct type 1 error control. 3.5.1.3 Benchmark Algorithms. In the next simulation, we apply existing state-of-the-art algorithms that perform classifica- tionondatawithlabelnoise. Inparticular, weapplythebackwardlosscorrectionalgorithm in [78] and the T-revision method in [102]. Since we focus on the NP paradigm, we will report the same (approximate) type I error violation rates and averages of (approximate) true type II errors as for our own methods. Simulation5. The distributional setting is the same as in Simulation 1. The (approximate) type I error violation rates and averages of (approximate) true type II errors generated by benchmark algorithms for one combination (m 0 = .95, m 1 = .05, α = .1 and δ = .1) are reported in Table 3.7 in the main and Table B.6 in Appendix B.4.4, respectively. algorithms N 200 500 1,000 2,000 T-revision .713(14.31) .675(14.82) .651(15.08) .621(15.35) backward loss correction (known corruption levels) .994(2.44) .977(4.74) .770(13.31) .127(10.53) backward loss correction (unknown corruption levels) .984(3.97) .793(5.20) .320(6.89) .131(3.60) Table 3.7: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 5 (m 0 =.95, m 1 =.05, α =.1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. In Simulation 5, the benchmark algorithms fail to control the true type I error with the pre-specified high probability. This is understandable, as none of the benchmark algorithms have α or δ as inputs. As such, these algorithms, unlike Algorithms 1 or 1 # , are not designed for the NP paradigm. 76 3.5.2 Real Data Analysis We analyze a canonical email spam dataset [51], which consists of 4,601 observations includ- ing57attributesdescribingcharacteristicsofemailsanda0− 1classlabel. Here,1represents spam email while 0 represents non-spam, and the type I/II error is defined accordingly. The labels in the dataset are all assumed to be correct. We create corrupted labels according to the class-conditional noise model. Concretely, we flip the labels of true class 0 observations with probability r 0 and flip the labels of true class 1 observations with probability r 1 . Note that m 0 and m 1 are IP(Y = 0| ˜ Y = 0) and IP(Y =0| ˜ Y =1), respectively, while r 0 =IP( ˜ Y =1|Y =0) and r 1 =IP( ˜ Y =0|Y =1). In our analysis, we choose m 0 = 0.95 and m 1 = 0.05, which implies setting r 0 = 0.032 and r 1 =0.078 10 . For each training and evaluation procedure, we split the data by stratified sampling into training and evaluation sets. Specifically, 20% of the true class 0 observations and 20% of the true class 1 observations are randomly selected to form the training dataset, and the rest of the observations form the evaluation dataset. In total, the training set contains 921 observations and the evaluation set contains 3,680 observations. The larger evaluation set is reserved to better approximate (population-level) true type I/II error. We leave the evaluation data untouched, but randomly flip the training data label according to the calculated r 0 and r 1 . Four base algorithms are coupled with the original and new NP umbrella algorithms, with α =δ =0.1. We repeat the procedure 1,000 times. The(approximate)typeIerrorviolationratesandaveragesof(approximate)truetypeII errors generated by Algorithm 1 and the original NP umbrella algorithm are summarized in Table 3.8. Similar to the simulation studies, we observe that Algorithm 1 correctly controls type I error at the right level, while the original NP umbrella algorithm is significantly overly conservative on type I error, and consequently has much higher type II error. We 10 ThisisanapplicationoftheBayestheoremwithIP(Y =0)estimatedtobe0.610,whichistheproportion of class 0 observations in the whole dataset. 77 also summarize the results generated by Algorithm 1 # in Tables 3.9 and 3.10. Clearly, while strict under-estimates lead to higher type II errors than using exact corruption levels, the type I error control objective is achieved, and the type II error is better than just ignoring label corruption and applying the original NP umbrella algorithm. (approximate) violation rate average of (approximate) true type II errors adjusted original adjusted original penalized logistic regression .082(8.68) 0(0) .205(2.65) .272(2.71) linear discriminant analysis .096(9.32) 0(0) .226(3.05) .314(2.77) support vector machine .093(9.19) .004(2.00) .183(3.15) .218(1.93) random forests .080(8.58) 0(0) .120(1.13) .152(1.54) Table 3.8: (Approximate) type I error violation rates, and averages of (approximate) true type II errors by Algorithm 1 and original NP umbrella algorithm over 1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. m # 0 =0.93, m # 1 =0.07 m # 0 =0.95, m # 1 =0.05 m # 0 =0.97, m # 1 =0.03 original penalized logistic regression .231(13.33) .082(8.68) .028(5.22) 0(0) linear discriminant analysis .223(13.17) .096(9.32) .023(4.74) 0(0) support vector machine .220(13.11) .093(9.19) .026(5.03) .004(2.00) random forest .238(13.47) .080(8.58) .019(4.32) 0(0) Table3.9: (Approximate)typeIerrorviolationratesbyAlgorithm1 # over1,000repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. m # 0 =0.93, m # 1 =0.07 m # 0 =0.95, m # 1 =0.05 m # 0 =0.97, m # 1 =0.03 original penalized logistic regression .165(2.04) .205(2.65) .254(3.10) .272(2.71) linear discriminant analysis .213(2.54) .226(3.05) .314(3.37) .314(2.77) support vector machine .138(1.20) .183(3.15) .199(2.11) .218(1.93) random forest .102(.78) .120(1.13) .143(1.41) .152(1.54) Table 3.10: Averages of (approximate) true type II errors by Algorithm 1 # over 1,000 repe- titions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. To make a comparison, we also apply the loss correction algorithm in [78] and the T- revision method in [102] to the email spam data, with results summarized in Table B.7 in 78 Appendix B.4.4. Since these benchmark algorithms are not designed for the NP paradigm, as discussed in Section 3.5.1, none of the (approximate) true type I error violation rates are controlledaswedesire. Inadditiontotheemailspamdata,wealsoapplyAlgorithm1tothe CIFAR10 dataset [63] and successfully have the type I error controlled (Appendix B.4.2). 3.6 Discussion Under the NP paradigm, we developed the first label-noise-adjusted umbrella algorithms. There are several interesting directions for future research. First, we can consider a more complex noise model in which the corruption levels depend on both the class and fea- tures. Another direction is to consider data-driven “under-estimates” of the corruption levels in the class-conditional noise model and develop (distributional) model-specific ad- justment algorithms. For instance, we can adopt the linear discriminant analysis model, i.e., X 0 ∼N (µ 0 ,Σ) and X 1 ∼N (µ 1 ,Σ). 79 Chapter 4 A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification We study fairness in classification, where one wishes to make automated decisions for people from different protected groups. When individuals are classified, the decision errors can be unfairly concentrated in certain protected groups. We develop a fairness-adjusted selective inference (FASI) framework and data-driven algorithms that achieve statistical parity in the sense that the false selection rate (FSR) is controlled and equalized among protected groups. The FASI algorithm operates by converting the outputs from black-box classifiers to R-values, which are intuitively appealing and easy to compute. Selection rules based on R-values are provably valid for FSR control, and avoid disparate impacts on protected groups. The effectiveness of FASI is demonstrated through both simulated and real data. 4.1 Introduction In a broad range of applications, artificial intelligence (AI) systems are rapidly replacing human decision-making. Many of these scenarios are sensitive in nature, where the AI’s decision, correct or not, can directly impact one’s social or economic status. A few examples include a bank determining credit card limits, stores using facial recognition systems to detect shoplifters, and hospitals attempting to identify which of their patients has a specific 80 disorder. Unfortunately, despite their supposedly unbiased approach to decision-making, there has been increasing evidence that AI algorithms often fail to treat equally people of different genders, races, religions, or other protected attributes. Whether this is due to the historical bias in one’s training data, or otherwise, it is important, for both legal and policy reasons, that we make ethical use of data and ensure that decisions are made fairly for everyone regardless of their protected attributes. Althoughalargeliteraturehasbeendevotedtodevelopingsupervisedlearningalgorithms for improving the prediction accuracy, making reliable and fair decisions in the classification settingremainsacriticalandchallengingproblem. AIalgorithmstypicallyareforcedtomake classificationsonallnewobservationswithoutacarefulassessmentofassociateduncertainty or ambiguity. The ambiguity can be inherent to the classification problem – even an oracle knowing exactly the probability distribution of the classes cannot control the classification errors at a desired level when the signal to noise ratio is too low. This chapter develops a “fairness-adjusted selective inference” (FASI) framework to address the critical issues of uncertainty assessment, error rate control and statistical parity in classification. We provide anindecision optionforobservationswhocannotbeselectedintoanyclasseswithconfidence. These observations can then be separately evaluated. This practice often aligns with the policyobjectivesinmanyrealworldscenarios. Forexample,incorrectlyclassifyingalow-risk individual as a recidivist or rejecting a well-deserving candidate for the loan request is much more expensive than turning the case over for a more careful review. A mis-classification is an error, the probability of which must be controlled to be small as its consequence can be severe. Bycontrast,thecostofanindecisionisusuallymuchless. Forexample,theambiguity can be mitigated by collecting additional contextual knowledge of the convicted individual or requesting more information from the loan applicant. Under the selective inference [7] framework, we only make definitive decisions on a selected subset of all individuals; the less consequential indecision option is considered as a wasted opportunity rather than an error. A natural error rate notion under this framework is the false selection rate (FSR), which 81 0.23 0.24 0.25 0.26 0.27 0.28 Unadjusted Method False Selection Proportion 0.23 0.24 0.25 0.26 0.27 0.28 FASI All Races Black Other Figure 4.1: The selection of recidivists from a pool of criminal defendants (Broward County, Florida). The target FSR is 25%. Left: the unadjusted approach. Right: the proposed FASI approach. is defined as the expected fraction of erroneous classifications among the selected subset of individuals. The goal is to develop decision rules to ensure that the FSR is effectively controlled and equalized across protected groups, while trying to minimize the total wasted opportunities. However, a classification rule that controls the overall FSR may have disparate impacts on different protected groups. We illustrate the point using the COMPAS data (3, 30). The COMPAS algorithm has been widely used in the US to help inform courts about a defendant’s recidivism likelihood, i.e., the likelihood of a convicted criminal recommitting a crime, so any prediction errors could have significant implications. The left hand plot of Figure 4.1 shows the False Selection Proportions (FSP), i.e. the fraction of individuals who did not recommit a crime among those who were classified as recidivists. The classification rule was generated via a Generalized Additive Model (GAM) to achieve the target FSR of 25%. We can see that the green bar, which provides the overall FSP for all races, is close to the target value. Moreover, the rule appears to be “fair” for all individuals, regardless of their protected attributes, in the sense that the same threshold has been applied to the base scores (i.e. estimated class probabilities) produced by the same GAM fit. However, the blue and orange bars show that the FSPs for different racial groups differ significantly from 25%, which is clearly not a desirable situation. 82 ThisarticleintroducesanewnotionoffairnessthatrequiresparityinFSRcontrolacross various protected groups. This aligns with the social and policy goals in various decision- making scenarios such as selecting recidivists or determining risky loan applicants, where the burden of erroneous classifications should be shared equally among different genders and/or races. However, the development of effective and fair FSR rules is challenging. First, controlling the error rate associated with a classifier, such as one built around the GAM procedure, critically depends on the accuracy of the scores. However, the assessment of the accuracy/uncertainty of these scores largely remains unknown. Second, we wish to provide practitioners with theoretical guarantees on the parity and validity for FSR control, regardlessofthealgorithmbeingused,includingcomplexblack-boxclassifiers. Ifwebuildan algorithm around black-box models then it often becomes intractable to compute thresholds for controlling the FSR over multiple protected groups. To address these issues, we develop a data-driven FASI algorithm, which is specifically designed to control both the overall FSR, and the (protected) group-wise FSRs, below a user specified level of risk α . The right panel of Figure 4.1 illustrates the FSPs of FASI on the recidivism data. Now, not only is the overall FSP controlled at 25%, but also are the individual race FSPs. FASI works by converting the base score from a black-box classifier to an R-value, which is intuitive, easy to compute, and comparable across different protected groups. We then show that selecting all observations with R-value no greater than α will result in an FSR of approximately α . Hence, we can directly use this R-value to assign new observations a class label or, for those with high R-values, to assign them to the indecision class. This chapter makes several contributions. First, we introduce a new notion of fairness involving controlling, not only the overall FSR, but also the FSR for designated sub-groups. Ours is not the only approach to fairness and we do not claim that it is universally supe- rior relative to alternative approaches (34, 45, 86). However, for the settings we consider, controlling all of the sub-group FSRs appears to be a reasonable fairness approach. Second, 83 we develop a data-driven FASI algorithm based on the R-value. The algorithm, which can be implemented with a user-specified classifier, is intuitively appealing and easy to intepret. Third, weprovidetheoreticalresultsbothjustifyingtheuseof R-valuesandtheeffectiveness oftheFASIalgorithmforFSRcontrol. Thefinite-sampletheoryisestablishedwithminimal assumptions: we allow the base scores to be generated from black-box classifiers, and make no assumptions on the accuracy of these scores. Finally, the strong empirical performance of FASI is demonstrated via both simulated and real data. The rest of this chapter is structured as follows. In Section 4.2 we define the FSR and describe the problem formulation. Section 4.3 introduces the R-value and FASI algorithm. We establish theoretical properties of FASI in Section 4.4. The numerical results for simu- lated and real data are presented in Sections 4.5 and 4.6, respectively. Section 4.7 concludes the article with a discussion of related works and possible extensions. 4.2 Problem Formulation Suppose we observe a data set D = {(X i ,A i ,Y i ) : i = 1,...,n}, where X i ∈ R p is a p- dimensional vector of features, A i ∈A is an additional feature representing the protected or sensitive attribute, and Y i is a class label taking values in C = {1,...,C}. The goal is to predict the classes for multiple individuals with instances (X n+j ,A n+j ), j =1,...,m. 4.2.1 A selective inference framework for binary classification To focus on key ideas, we mainly consider the binary classification problem in this article. The extension to the general multi-class setting is discussed in Section 4.7. One important application scenario is the prediction of mortgage default, where Y = 2 indicates default and Y = 1 otherwise. The current practice is to use risk assessment softwaretoproduceaconfidence score ,basedonwhichanindividualisclassifiedinto“high”, “medium” or “low” risk classes. Let S(x,a) denote such a score that maps an instance 84 (x,a) to a real value, with a higher value indicating a higher risk of default. Suppose we observe a new instance (X ∗ ,A ∗ ) = (x,a). Consider a class of decision rules of the form: ˆ Y = I{S(x,a) < t l } + 2I{S(x,a) > t u }, where t l and t u are thresholds chosen by the investigator to characterize the lower and upper limits of potential risks and I(·) is the indicator function. ˆ Y takes three possible values in the action space Λ = {1,2,0}, respectively indicating that an individual has low, high and medium risks of default. The value“0”,whichisreferredtoasanindecision orreject option inclassification(48,97,67),is usedtoexpress“doubt”, reflectingthatthereisnotsufficientconfidencetomakeadefinitive decision. For example, an individual with ˆ Y = 1 will be approved for a mortgage, with ˆ Y = 2 will be rejected, whereas with ˆ Y = 0 will be asked to provide additional information and resubmit the application. Now we turn to a classification task with multiple individuals whose risk scores are given byS S S test ={S n+j :1≤ j≤ m}. Consider the following decision rule ˆ Y Y Y ={ ˆ Y n+j :1≤ j≤ m}={I(S n+j <t l )+2I(S n+j >t u ):1≤ j≤ m}. (1) We can view (1) as a selective inference procedure, which selects individuals with extreme scoresintothehighandlowriskclasses,whilereturninganindecisionontheremainder. The selectiveinferenceviewprovidesaflexibleframeworkthatallowsothertypesofclassification rules. For example, if it is only of interest to select high-risk individuals, then the action space is Λ= {0,2}, and one can use the following rule ˆ Y Y Y ={ ˆ Y n+j :1≤ j≤ m}={2· I(S n+j >t u ):1≤ j≤ m}. (2) 4.2.2 False selection rate and the fairness issue In practice, it is desirable to avoid erroneous selections, which often have negative social or economic impacts. In the context of the mortgage example, approving an individual who 85 will default (i.e., ˆ Y = 1 but Y = 2) would increase the financial burden of the lender, while rejecting an individual who will not default (i.e., ˆ Y = 2 but Y = 1) would lead to loss of profit. In situations where m is large, controlling the inflation of selection errors is a crucial task for policy makers. A practically useful notion is the false selection rate (FSR), which is defined as the expected fraction of erroneous decisions among all definitive decisions. We use the notationFSR C ′ , whereC ′ ⊂C ={1,2}is the set of class labelsthat we areinterested in selecting. To illustrate the definition, we consider two scenarios. In the first, we select individuals from both classes using rule (1). DenoteS ={1≤ j≤ m : ˆ Y n+j ̸= 0} the index set of the selected cases and|S| its cardinality. Then we have FSR {1,2} =E " P j∈S I( ˆ Y n+j ̸=Y n+j ) |S|∨1 # , (3) where x∨y =max{x,y}, and the exact meaning ofE will become clear in Section 4.3 after we explicitly spell out our algorithm. In the second scenario, the goal is to select individuals in class c=2 using rule (2). Then FSR {2} =E P m j=1 I( ˆ Y n+j =2,Y n+j ̸=2) n P m j=1 I( ˆ Y n+j =2) o ∨1 . (4) FSR {1} can be defined similarly. By allowing for indecisions, one can find a decision rule that simultaneously controls both FSR {1} and FSR {2} at a small user-specified level 1 . The FSR is a general concept for selective inference that encompasses important special cases such as the standard misclassification rate, the false discovery rate (FDR; 9) and beyond. If we set both the state space and action space to be {1,2}, so there are no indecisions, then the FSR defined by (3) reduces to m − 1 E n P m j=1 ( ˆ Y n+j ̸=Y n+j ) o , i.e. the 1 However, such a goal may not be achievable under the standard setup for classification, which does not allow indecisions. For example, it is impossible to make FSR 1 and FSR 2 to be simultaneously small when the minimum condition on the classification boundary (73, 21) fails to hold. 86 standardmisclassificationrate. ToseetheconnectiontotheFDR,consideramultipletesting problem with H j0 : Y n+j =2 vs. H j1 : Y n+j =1, j =1,...,m. The state space is C = {1,2}. A multiple testing procedure ˆ Y Y Y = { ˆ Y n+j : 1 ≤ j ≤ m} ∈ {0,1} m corresponds to a selection rule that aims to select cases in class 1 only. The action space Λ = {1,0} differs from the state space C, with ˆ Y n+j = 1 indicating that H j0 is rejected, and ˆ Y n+j = 0 indicating that there is not enough evidence to reject H j0 . Then FSR {1} precisely yields the widely used FDR, the expected fraction of false rejections among all rejections. We use the expected proportion of indecisions (EPI) to describe the power concept (the smaller the EPI the larger the power): EPI= 1 m E n P m j=1 I ˆ Y n+j =0 o =1− E(|S|)/m. (5) Comparedtoerroneousdecisions,thelossesincurredduetoindecisionsarelessconsequential. This leads to a constrained optimization problem where the goal is to develop a selective rule that satisfies FSR ≤ α while making the EPI as small as possible. Next we turn to the important fairness issue in selective inference. We argue that simply controllingFSR c maynotbesatisfactoryinmanycontexts. Amajorconcernisthattherate oferroneousdecisionsmightbeunequallysharedbetweentheprotectedgroups,asillustrated in the COMPAS example. To address this issue, it is desirable to control the FSR for each protected attribute A. In particular, we require that the group-wise FSRs must satisfy: FSR {c} a =E P m j=1 I( ˆ Y n+j =c,Y n+j ̸=c,A n+j =a) n P m j=1 I( ˆ Y n+j =c,A n+j =a) o ∨1 ≤ α c , for all a∈A. (6) 87 We aim to develop a classification rule that fulfills the fairness criterion (6). This formula- tion, which adopts a fairness-adjusted error rate constraint, equally bounds the fraction of erroneous decisions among protected groups. 4.2.3 From fair scores to fair classifiers: issues and roadmap We investigate the important issue of what makes a “fair” classifier. In most classification tasks, the standard operation is to first construct a “base score”, and then turn the score into a decision by setting a threshold. Consider a thresholding rule of the form ˆ Y Y Y = c· I(S c n+j >t):1≤ j≤ m . (7) NowwepresenttwopossibleapproachestocalculatingS c ,andhence,selectingindividuals intoclassc. Thefirst,referredtoasthe“fullcovariateclassifier”(FCC),thresholdsthescores calculated based on S c (x,a)=Pr{Y =c|X =x,A=a}. (8) The score is used to assess the likelihood of being in class c given all observed characteris- tics of an individual. However, incorporating a sensitive attribute into a classifier without appropriate adjustments can lead to unfair decisions. To illustrate, consider the mortgage setup where we simulate a data set that contains a sensitive attribute “gender”. The goal is to select individuals into the high risk class with FSR control at 10%; the simulation setup is detailed in Section 4.5. We highlight here that the proportions of individuals with label “2” (default) are different across the groups: for the male group, the proportion p M is fixed at 50%, whereas for the female group the proportion p F varies from 15% to 85%. We plot the overall FSR and group-wise FSRs as functions of p F on the left panel of Figure 2.1. We can see that FCC controls the overall FSR but not the group-wise FSRs. Thresholding rules based on (8) are harmful in the sense that the burden of erroneous decisions is not shared equally among the two gender groups. 88 0.00 0.05 0.10 0.15 0.20 0.2 0.4 0.6 0.8 p F False Selection Rate FCC 0.00 0.05 0.10 0.15 0.20 0.2 0.4 0.6 0.8 p F RCC 0.00 0.05 0.10 0.15 0.20 0.2 0.4 0.6 0.8 p F FASI Female Male Female and Male Figure 4.2: All plots have a fixed proportion of signals (50%) from the male group and varying proportions of signal (from 15% to 85%) for the female group. For FCC and RCC, the degree of unfairness increases as p M and p F become more disparate. FASI ensures that the group-wise FSRs are effectively controlled and equalized. The second approach, referred to as the “reduced covariate classifier” (RCC), is to apply (7) by removing the sensitive attribute A: S c (x)=Pr{Y =c|X =x}. (9) However, this approach, as illustrated in the middle panel of Figure 2.1, can be harmful as well. WhiletheoverallFSRisstillcontrolledat10%, theissueofunfairnessisnotmitigated but aggravates with widened gaps in the group-wise FSRs. We mention two additional drawbacks of the RCC approach. First, ignoring an informative sensitive attribute can lead to substantial power loss. Second, the feature X can be highly predictive of the sensitive attributeA;hencetheclassifierislikelytoforma surrogateencoding ofthesensitiveattribute based on other features, leading to unfair decisions in a similar fashion as if (8) were used. To equalize the FSRs among the protected groups, we can consider two strategies: (a) construct new scores by adjusting the current base scores so that the new scores are fair and comparable across groups, or (b) stick to the original base scores and set varied group- adjusted thresholds. Strategy (a), which applies a universal threshold to all individuals, is appealingbecausethedecisionmakingprocesswouldbestraightforwardonceadjustedscores aregiventopractitioners,andthenewscoresarecomparableacrossthegroups. Bycontrast, 89 strategy (b), although working equally effectively for addressing the fairness issue, can be less intuitive and nontrivial to implement. For practitioners without a full understanding of the underlying algorithm, strategy (b) can be confusing and even controversial as varied thresholds are being used for different protected groups, causing another level of concern about possible discrimination. As a result, we adopt strategy (a). We call a score S c (with respect to class c) unfair or illegal if thresholding rule (7), which applies a universal cutoff t to all scores, produces larger FSRs for one group over the others. We can see from Figure 2.1 that both conditional probabilities (8) and (9) are unfair scores. Two key issues in strategy (a) are: (i) how to construct fair scores and (ii) how to go from fair scores to fair classifiers. Next we develop a data-driven FASI algorithm based on the R-value to address the two issues integrally. The empirical performance of our algorithm is illustrated on the right panel of Figure 2.1. We can see that the FSRs are effectively controlled at the nominal level and properly equalized across the protected groups. 4.3 Methodology This section develops a fairness-adjusted FSR controlling procedure for binary classification. WefocusonthegoalofcontrollingtheFSR c definedin(4). Themethodologiesforcontrolling the FSR of the form (3) and the case of multinomial classification will be briefly discussed in Section 4.7. A major challenge in our methodological development is that most state-of-the-art clas- sifiers are constructed based on complex black-box models, which do not offer theoretical guarantees about the outputs. This makes the uncertainty assessment, particularly the analysis of error rates, very challenging and even intractable. We address the challenge by developing a model-free framework that applies to any black-box classifier, and our method- ology only leverages mild assumptions on exchangeability between the observed and future 90 data. In contrast with existing theories that are asymptotic and require strong assumptions on models or classifiers, the new framework not only eliminates these assumptions, but also yields a powerful finite-sample theory on FSR control. 4.3.1 The R-value and FASI algorithm We first introduce a significance index, called the R-value, for ranking individuals and then discuss how the R-values can be converted to a selection rule via thresholding. The FASI algorithm consists of three steps: training, calibrating and thresholding. The observed data setD ={(X i ,A i ,Y i ):1≤ i≤ n} is divided into a training set and a calibra- tion set: D = D train ∪D cal . The testing set is denoted D test . The first step trains a score function, denoted ˆ S c (x,a), onD train . By convention a larger score indicates a higher prob- ability of belonging to class c. The scores, often representing estimated class probabilities, canbegeneratedfromanyuser-specifiedclassifier. Wemakenoassumptionsontheaccuracy of these scores. The only consequence of inaccurate scores is that the number of indecisions will increase, meaning that more effort will be needed to perform separate evaluations. In the second step, we first calculate ˆ S c (x,a) for individuals in bothD cal andD test , then calibrate an R-value for every individual in D test . Specifically, denote ˆ R c n+j ˆ R c n+j (X n+j = x,A n+j =a) and ˆ s ˆ S c (X n+j =x,A n+j =a), we define ˆ R c n+j = 1 n cal a +1 n P i∈D cal I A i =a, ˆ S c i ≥ ˆ s,Y i ̸=c +1 o 1 ma P i∈D test I A i =a, ˆ S c i ≥ ˆ s if A n+j =a, (10) wherem a = P i∈D test I(A i =a)andn cal a = P i∈D cal I(A i =a). Insituationswherethetestset issmall,weconsiderthefollowingmodified R-valuethatutilizesboththetestandcalibration set in the denominator: ˆ R c,+ n+j = 1 n cal a +1 n P i∈D cal I A i =a, ˆ S c i ≥ ˆ s,Y i ̸=c +1 o 1 ma+n cal a +1 n P i∈D test ∪D cal I A i =a, ˆ S c i ≥ ˆ s +1 o if A n+j =a, (11) 91 for 1 ≤ j ≤ m. Section C.4.1 in the supplement shows that the R + -value (11) provides a more stable score than the R-value (10) when|D test | is small. Now we provide some intuition behind the R-value. Roughly speaking, the R-value correspondstothesmallestgroup-wiseFSRsuchthatthe(n+j)thindividualisjust selected. In other words, if we make the cut at ˆ R = r, e.g. selecting all instances with R-values less than or equal to r into class c, then we expect that, for every group a∈A, approximately 100r% of the selected cases do not belong to class c. This interpretation is similar to the q-value (94) in FDR analysis. As the R-value can be interpreted as a fraction, we set the R-value to 1 if the quantity in (10) or (11) exceeds 1. The fairness notion has been naturally incorporated into the (group- adjusted) R-value, making it possible to calibrate a universal threshold equalizing the FSRs across the groups. More explanations of (10) and (11) are provided in the next section. In the third thresholding step, we compare the R-value defined in (10) or (11) with a pre-specified FSR level α c . For example, if we are interested in selecting individuals into class c, then the decision rule is ˆ Y Y Y = c· I(R c n+j ≤ α c ):1≤ j≤ m , wherethethresholdissimplytheuser-specified α c . AccordingtotheinterpretationoftheR- value, this step amounts to choosing the largest possible threshold subject to the constraint on group-wise FSRs. Intuitively this can best reduce the wasted opportunities induced by indecisions. If we are interested in selecting both classes, then the decision rule is ˆ Y Y Y = P 2 c=1 c· I(R c n+j ≤ α c ):1≤ j≤ m . To avoid assigning an individual to multiple classes, we classify the individual to the class with the smaller R-value when there is overlapping selection. The proposed FASI algorithm has several attractive properties. First, as we explain shortly, theR-valueprovidesanestimateforafraction, whichisstandardizedbetween0and 1, comparable across protected groups, and easily interpretable. Second, the FSR analysis 92 via R-values is straightforward: practitioners can make decisions directly with the R-values; the threshold is simply the user-specified FSR level. The fairness consideration, wrapped up intheR-valuedefinition,isaddressedinaproperandcleanway. Finally,theFASIalgorithm is model-free and offers a powerful theory on FSR control; this is discussed next. 4.3.2 Why FASI works? Now we explain why the FASI algorithm works. The effectiveness of our algorithm only leverages a mild exchangeability assumption. Condition1. Thecalibrationdataandtestingdataareindependent. Thetriples{(X i ,A i ,Y i ): i∈D cal ∪D test } are exchangeable. We start by explaining why the R-value provides a sensible estimate of the FSR. To simplify the discussion, we ignore the sensitive attribute A for the moment and consider a thresholding rule of the form ˆ Y Y Y ={I( ˆ S c n+j ≥ t) : 1≤ j ≤ m}. Consider the false selection proportion (FSP) process forD test : FSP(t)= P i∈D test I ˆ S c i ≥ t,Y i ̸=c P i∈D test I ˆ S c i ≥ t , (12) withFSP(t)=0ifnoindividualisselected. TheFSPcannotbecomputedfromdatabecause wedonotobservethetruestates{Y i :i∈D test }. ThegoodnewsisthatunderAssumption1 on exchangeability, the unobserved process P i∈D test I ˆ S c i ≥ t,Y i ̸=c will strongly resemble its“mirrorprocess”inthecalibrationdata P i∈D cal I ˆ S c i ≥ t,Y i ̸=c . Constructingamirror process and exploiting the symmetry property to make inference is a powerful idea that has beenexploredinrecentworks(e.g. 4,32). Finally, adjustingforthepossibleunequalsample sizes betweenD cal andD test , we obtain the R-value process b R c (t)= 1 n cal +1 n P i∈D cal I ˆ S c i ≥ t,Y i ̸=c +1 o 1 m P i∈D test I ˆ S c i ≥ t , (13) 93 where m and n cal are respectively the cardinalities of D test and D cal . Finally, the fairness- adjusted R-value defined in (10) can be recovered by restricting the R-value process to a specific group a∈A and substituting ˆ s in place of t in (13). The R + -value defined by (11) can be conceptualized in a similar fashion. Remark 5. Comparing (12) and (13), we note that “+1” has been incorporated into the count of false selections on D cal . This technical adjustment has virtually no impact on the empiricalperformanceofFASI.However,itensuresthat(13)effectivelyleadstoamartingale, which is essential for proving the theory. It is natural to apply the same “+1” adjustment to n cal , which makes the algorithm slightly more powerful. Next we state a theorem that establishes the finite-sample property of FASI. Our theory fundamentally departs from those in existing works: we do not make assumptions regarding the accuracy of ˆ S c i . The accuracy of the classifier only affects the power, not the validity for FSR control. See Section 4.4.3 for practical guidelines on how to construct more informative classifiers/ R-values. Theorem 10. Define γ c,a =E p test,a c,null /p cal,a c,null , where p test,a c,null and p cal,a c,null are the proportions of individuals in group a that do not belong to class c in the test and calibration data, respectively. Then under Assumption 1, we have, for all a∈A, 1. The FASI algorithm with R-value (10) satisfies FSR {c} a ≤ γ c,a α c ; 2. The FASI algorithm with R + -value (11) satisfies FSR {c},∗ a ≤ γ c,a α c , where FSR {c},∗ a =E " P m j=1 I( ˆ Y n+j =c,Y n+j ̸=c,A n+j =a) P m j=1 I( ˆ Y n+j =c,A n+j =a)+1 # . (14) Remark 6. In the modified FSR definition (14), the “+1” adjustment is used to account for the extra uncertainty in the approximation of the number of rejections. A similar modi- fication, in the context of FDR analysis but for different reasons, has been used in Theorem 1 of [4]. 94 Under Assumption 1, γ c,a tends to be very close to 1, resulting in nearly exact control. This is verified in our simulation studies and real data analyses which can be found in the Appendix, Section C.4.2. However, due to the stochastic nature of the ratio, FASI may lead to FSRs that deviate from the nominal level. The next corollary shows that a conservative version of the R-value guarantees that the FSR level is controlled below α . Corollary 3. Suppose we apply the FASI algorithm with the conservative R-values: ˜ R c n+j = n cal a +1 n cal,c a,null +1 ˆ R c n+j , ˜ R c,+ n+j = n cal a +1 n cal,c a,null +1 ˆ R c,+ n+j , (15) for 1 ≤ j ≤ m, where n cal a = P i∈D cal I(A i = a) and n cal,c a,null = P i∈D cal I(A i = a,Y i ̸= c). Further define n test a = P j∈D test I(A j = a) and n test,c a,null = P j∈D test I(A j = a,Y j ̸= c). Then the group-wise FSRs satisfy (a) FSR {c} a ≤ E n test,c a,null /n test a α for all a∈A when ˜ R c n+j are used, and (b) FSR {c},∗ a ≤ E n test,c a,null /n test a α for all a∈A when ˜ R c,+ n+j are used. Remark 7. Corollary 3 implies that the FSR levels are controlled strictly less than or equal to α . The ratio n test,c a,null /n test a , which is referred to as the null proportion in multiple testing, alsoappearsinthe(conservative)Benjamini-Hochberg(BH)procedureforFDRcontrol. The connection between FASI and BH will be discussed shortly and elaborated in the Appendix. It is anticipated that the FASI algorithm with conservative R-values (15) may be improved by methods that incorporate the unknown ratio n test a,null /n test a . This idea has been used in [10] and[93]toimprovethepowerofBHinthecontextofFDRcontrol. TheFASIalgorithmwith original R-values may be viewed as such an approach in the sense that it can be recovered via firstly estimating the unknown ratio n test,c a,null /n test a by (n cal,c a,null +1)/(n cal a +1), and secondly applying the FASI algorithm with the conservative R-values at level (n cal a +1)/(n cal,c a,null +1)α . This leads to power improvement with the price of the additional factor γ c,a in Theorem 10. The rest of this article does not focus on the conservative R-values since they usually result in a higher proportion of indecisions, while the original R-values are simple and intuitive, and offer almost exact control in practice. 95 Finally, we mention that the R-value has a nice interpretation under the conformal in- ference framework. Section C.1.3 in the Supplementary Material shows that a variation of our R-value corresponds to the Benjamini-Hochberg (BH) adjusted q-value of the conformal p-values (6) under the one-class classification setting. The connection to conformal infer- ence and the BH method, both of which are model-free, provides insights on why the FASI algorithm is assumption-lean and offers effective FSR control in finite samples as claimed in Theorem 10. 4.4 Theoretical Issues In this section we discuss a few important theoretical issues, which provide an in-depth investigation of the R-value and place our contributions in context. Section 4.4.1 explains the challenges and novelty of our FSR theory and sketches the main ideas in the proof of Theorem10. InSections4.4.2and4.4.3, weintroducethetheoretical R-valueandderivethe optimal score function under a simplified setup. The theory, which is intuitive and sensible, provides insights for practitioners regarding how to train score functions for constructing informative R-values. 4.4.1 Theory on FSR control: main ideas and contributions We explain the main idea behind the proof of Theorem 10. The discussion will focus on the R-value process (13), but can be easily extended to the group-adjusted R-value (10). Three major challenges in the theoretical analysis include (a) how to handle the unknown and complex dependence between the scores ˆ S c i [as the same training data have been used to compute the scores in (13)], (b) how to evaluate the FSR in classification without knowl- edge about the theoretical properties of the scores, and (c) how to develop non-asymptotic guarantees on the performance of the FASI algorithm in finite samples. 96 Inspired by the elegant ideas in the FDR literature [95, 4], we have carefully constructed theR-valuessothatthecorrespondingFSPprocess(13)canbestochasticallyboundedabove byamartingale. IntheproofofTheorem10,wefirstshowthatthethresholdinducedbythe FSPprocessisastoppingtime,andthenapplyDoob’soptionalstoppingtheoremtoobtainan upper bound for the expectation of the martingale. Finally we leverage the exchangeability assumption to cancel out the cardinality adjustments and establish the upper bound for the FSR. We stress that our theory utilizes no assumptions on the underlying models or quality of scores, and the control is exact for finite samples . A closely related idea is the conformal p-value, which was independently proposed in a recentworkby[6]. InSectionC.1.3oftheSupplementaryMaterial,weshowthattheR-value, which was motivated from a very different perspective, can be derived as the BH q-value of conformal p-values under the one-class classification setting [74, 60, 59]. The theory on FDR control in [6] needs to deal with a similar complication as the conformal p-values are also dependent. [6]adopteda novel approachbyfirst showingthatthe conformal p-values satisfy thePRDS(i.e. positive regression dependent on a subset)condition,andsecondapplyingthe theory in [11] to prove the FDR control. We conjecture that the PRDS approach could be extended to establish our theory. However, the extension seems to be non-trivial because, as we point out at the end of Section C.1.3, under the binary classification setup our R-values do not utilize conformal p-values explicitly. Our direct application of the martingale theory seems to be simpler and equally effective. 4.4.2 Theoretical R-value Ourdiscussionsinthenexttwosubsectionsassumeanoraclewithaccesstoalldistributional information, make several simplifying assumptions and are purely theoretical. The major goal is to develop a theoretical version of the R-value and an optimality theory for FSR control. 97 Our discussions are based on the following mixture model F(x)= X a∈A {π 1,a F 1,a (x)+π 2,a F 2,a (x)}, (16) whereF c,a istheconditionalCDFofX fromclasscwithattributeaandπ c,a =P(Y =c,A= a), c = 1,2. Denote π a = P(A = a), π c|a = P(Y = c|A = a) and f c,a (x) the corresponding conditional density functions. Consider a selection rule of the form ˆ Y(t) = c· I(S c ≥ t), where t denotes a threshold. Suppose an oracle knows the conditional probabilities and conditional CDFs defined above. In Appendix C.1.1, we discuss an algorithm that converts the base score S c = s (possibly unfair) to a fair score. Specifically, for an individual from group a, with S c =s, the conversion algorithm yields the following theoretical R-value: R c (s)=inf t≤ s n Q c a (t)P(Y ̸=c| ˆ Y(t)=c,A=a) o , (17) where Q c a (t) is the conditional error probability when the threshold is t 2 . We shall see that the group-wise FSR (6) is closely connected to Q c a (t) (Appendix C.1.2), and that the theoretical R-value can be viewed as the counterpart of the data-driven R-value defined in (10). It corresponds to the smallest conditional probability such that the individual with score S c =s is just selected into class c. The next proposition, which follows directly from (17), shows that thresholding the the- oretical R-value leads to a fair selective inference procedure. Proposition 2. Consider a classifier that claims ˆ Y =c if R c ≤ α . Then P(Y ̸=c| ˆ Y =c,A=a)≤ α for all a∈A. (18) 2 If the base score satisfies the monotone likelihood ratio condition (MLRC, 24) then the infimum is achieved at s exactly. See Section C.3.1 for related discussions. 98 The theoretical R-value is a fundamental quantity that is closely connected to the suf- ficiency principle in the fairness literature (Section 4.7.1). It also plays a central role in developing the optimality theory (next section). 4.4.3 A sketch of the optimality theory We state and prove an intuitive result that S c (x,a) = P(Y = c|X = x,A = a) is the optimal choice of score function for calibrating the theoretical R-value. A few simplifying assumptions will be required. Consider random mixture model (16). Suppose an oracle knows the score function S c (x,a) = P(Y = c|X = x,A = a). The goal is to assign labels “0”, “1” and “2” to new instances {(X n+j ,A n+j ) : 1 ≤ j ≤ m}. We assume that the instances (X j ,A j ) are independent draws from an underlying distribution F(x,a). Our optimality theory will be developed based on a variation of the FSR referred to as the marginal FSR: mFSR c a = E n P 1≤ j≤ m:A n+j =a I( ˆ Y n+j =c,Y n+j ̸=c) o E n P 1≤ j≤ m:A n+j =a I( ˆ Y n+j =c) o . The relationship between the mFSR and FSR is discussed in Section C.1.2. We aim to develop a selection rule under the binary classification setting that solves the following constrained optimization problem: minimize the EPI subject to mFSR c a ≤ α c , c=1,2 for all a∈A. (19) Denote S c n+j =P(Y n+j = c|X n+j = x n+j ,A n+j = a n+j ). The scores can be converted to theoretical R-values, denoted R 1 n+j and R 2 n+j . The process of conversion is described in the proof of Theorem 11. Define the oracle procedure δδδ OR ={δ j OR :1≤ j≤ m}, where δ j OR =I(R 1 n+j ≤ α 1 )+2I(R 2 n+j ≤ α 2 ). (20) 99 The optimality of the oracle procedure is established in the next theorem. Theorem 11. Consider random mixture model (16). Assume that α 1 and α 2 have been properly chosen such that (20) does not have overlapping selections. Let D α 1 ,α 2 denote the collection of selection rules that satisfy mFSR c a ≤ α c for c = 1,2 and all a ∈ A. Let EPI δδδ denote the EPI of an arbitrary decision rule δδδ . Then the oracle procedure (20) is optimal in the sense that EPI δδδ OR ≤ EPI δδδ for any δδδ ∈D α 1 ,α 2 . The optimality theory has several important implications for practitioners. The choice of optimal score indicates that in the training stage, we should construct the score func- tions using all features, including the sensitive attribute A, to best capture the individual level information. The scores trained without the sensitive attribute [e.g. (9)], are usually suboptimal. The adjustments for fairness should not be made in the training stage but in the calibration stage, where the fully informative scores can be converted to the R-values to adjust the disparity in error rates across the groups. Aby-productofourtheoryisthefindingthattheoptimalselectionrulealsoequalizesthe group-wise error rates. The intuition is that in order to maximize the EPI, the pre-specified mFSRs must be exhausted in all separate groups; hence the group-wise mFSRs are all equal tothenominallevel(henceautomaticallyequalized). Weconjecturethatasymptoticequality holds for our FASI algorithm, which has been corroborated by our numerical studies. A full analysis is complicated due to the dependence between the scores; we leave this for future research. 4.5 Simulation Results This section presents the results from two simulation scenarios comparing FASI to the Full Covariate Classifier (FCC). The Restricted Covariate Classifier (RCC) is not included since, in our simulations, it has systematically larger deviations from the target group-wise FSR levels compared to FCC. We demonstrate that both the oracle and data-driven versions of 100 FASIareabletocorrectlycontrolthegroup-wiseFSRs, whileRCCfailstodoso. Theoracle versions of FASI and FCC use the exact posterior probabilities for S i , defined in Equation 8, while the data-driven versions estimate S i via a GAM classifier [46, 54]. In our experience, similar patterns are observed when other classifiers are used to construct the base scores. Allsimulationsarerunwithsamplesizesof|D|=2,500and|D test |=1,000. Wegenerate D train andD cal using a random split ofD, with|D train | = 1,500 and|D cal | = 1,000. We use gender as our protected attribute taking two values A = F (females) and A = M (males). The feature vectors X X X ∈R 3 are simulated according to Model 16 with four components: F(·)=π M {π 1|M F 1,M (·)+π 2|M F 2,M (·)}+π F {π 1|F F 1,F (·)+π 2|F F 2,F (·)}, where π a =P(A = a), π c|a =P(Y = c|A = a) and F c,a is the conditional distribution of X X X given Y = c and A = a. Let π M = π F = 0.5, i.e. the numbers of females and males in the data set are equal. We will consider two scenarios in our simulation study that follow this setup. In the first scenario, the conditional distributions of X X X given class Y are assumed to be multivariate normal and are identical for males and females: F 1,M =F 1,F =N(µµµ 1 ,2· I 3 ), F 2,M =F 2,F =N(µµµ 2 ,2· I 3 ), where I 3 is a 3× 3 identity matrix, µµµ 1 = (0,1,6) ⊤ and µµµ 2 = (2,3,7) ⊤ . The only difference between males and females is in the conditional proportions: we fix π 2|M =P(Y = 2|A = M)=0.5, while varying π 2|F =P(Y =2|A=F) from 0.15 to 0.85. 101 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 1,a 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 2,a Oracle 0.000 0.025 0.050 0.075 0.100 0.125 0.2 0.4 0.6 0.8 π 2,F EPI 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 1,a 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 2,a Data Driven (GAM) 0.000 0.025 0.050 0.075 0.100 0.125 0.2 0.4 0.6 0.8 π 2,F EPI Female and Male Female Male FASI FCC Figure 4.3: Simulation 1. Top row: The oracle procedure. Bottom row: A data-driven GAM fitting procedure. Left and middle column: FSR 1,a and FSR 2,a for both females and males. Right column: The expected proportion of indecision’s (EPI). All plots vary the proportion of true class 2 observations from the Female protected group, π 2,F . We simulate 1,000 data sets and apply both the FCC and FASI [with R-values defined in (11)] at FSR level 0.1 to the simulated data sets. The FCC method ignores the protected attributes when computing the R-values, i.e. ˆ R c,FCC n+j = 1 n cal a +1 n P i∈D cal I ˆ S c i ≥ ˆ s,Y i ̸=c +1 o 1 ma+n cal a +1 n P i∈D test ∪D cal I ˆ S c i ≥ ˆ s +1 o. The corresponding selection rule is ˆ Y Y Y FCC = n c· I(R c,FCC n+j ≤ α c ):1≤ j≤ m o . The FSR levels are computed by averaging the respective FSPs from 1,000 replications. The simulation results are summarized in Figure 4.3. The first and second rows respectively correspond to the oracle and data-driven versions of each method. The first two columns respectively plot the group-wise FSRs for class 1 and class 2 as functions of π 2|F . The final column plots the expected proportion of indecisions (EPI) obtained by averaging the results from 1,000 replications. The following patterns can be observed. 102 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 1,a 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 2,a Oracle 0.000 0.025 0.050 0.075 0.100 0.125 0.2 0.4 0.6 0.8 π 2,F EPI 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 1,a 0.06 0.08 0.10 0.12 0.14 0.16 0.2 0.4 0.6 0.8 π 2,F FSR 2,a Data Driven (GAM) 0.000 0.025 0.050 0.075 0.100 0.125 0.2 0.4 0.6 0.8 π 2,F EPI Female and Male Female Male FASI FCC Figure 4.4: Simulation 2. Comparable setup to Simulation 1 except that the female and male distributions now differ from each other. However, the results are similar with FASI correctly controlling the FSRs. • Both the FASI method and FCC control the global FSR. For simplicity, we do not include these results in the figures below. • Shifting our focus to the group-wise FSRs, FCC fails to control the error rate. When π 2|F = 0.5, by construction we have π 2|F = π 2|M , making the Female and Male at- tributes indistinguishable. However, as π 2|F moves away from π 2|M = 0.5, the gap between the FSR control for Females and Males dramatically widens due to the asym- metry in the proportions of the signals (true class 2 observations) in the male and female groups. • Incomparison, bothoracleanddata-drivenFASIalgorithmsareabletoroughlyequal- izethegroup-wiseFSRsbetweentheFemaleandMalegroups. Thedata-drivenversion of FASI is able to closely mirror the behavior of the oracle method. The FSR control is in general effective except that the FSR levels are slightly elevated in the tails. • The parity in FSR control is achieved at the price of slightly higher EPI levels. 103 OursecondsimulationscenarioconsidersthesettingwhereF c,M ̸=F c,F ,forbothc=1,2. Denoting the mean of each distribution for class c and protected attribute a asµµµ c,a , the data is generated from F c,a =N(µµµ c,a ,2· I 3 ), with componentsµµµ 1,M =(0,1,6) ⊤ ,µµµ 2,M =(2,3,7) ⊤ , µµµ 1,F =(1,2,7) ⊤ andµµµ 2,F =(3,4,8) ⊤ . InallotherrespectsSimulations1and2areidentical. The results for the second simulation scenario are provided in in Figure 4.4. We notice very similar patterns to our first simulation setup. Both FASI and FCC are abletocontroltheglobalFSR(omittedfromthefigure). FASIcontrolsthegroup-wiseFSRs for all values of π 2|F while the FCC fails to do so. The data-driven FASI closely emulates the oracle procedure, for both the FSR and EPI levels. 4.6 Real Data Examples In this section we demonstrate FASI’s effectiveness on two real world case studies. In Sec- tion 4.6.1 we examine the Compas recidivism algorithm made popular by ProPublica in 2016 [3], while in Section 4.6.2 we use US census data from 1994 to predict an individual’s salary [33]. In both cases we compare FASI to the FCC approach described in Section 4.5 by randomly assigning 70% of our data set toD and the remaining 30% toD test . We further evenly splitD intoD train andD cal . Sincethisisarealdatasetting,thetrueposteriorprobabilitiesforS i ,definedinEquation 8, areunavailabletous. Toestimatethem, weusedaGAMfittingprocedureforthecompas case study and an Adaboost fitting procedure for the census income prediction case study [46, 54]. This method can be applied to other applications through the “fasi” package available in the R language on CRAN. 104 -0.06 -0.03 0.00 0.03 0.06 0.15 0.20 0.25 0.30 α [ FSR 2,a − α FCC -0.06 -0.03 0.00 0.03 0.06 0.15 0.20 0.25 0.30 α [ FSR 2,a − α FASI 0.00 0.25 0.50 0.75 1.00 0.15 0.20 0.25 0.30 α [ EPI Indecisions All Races Black Other FASI FCC Figure 4.5: COMPAS data analysis for predicting recidivists. Left and Middle: False Selection Rate minus the desired control level for varying levels of α for the FCC and FASI method respec- tively. Right: The EPI for both the FCC and FASI method. 4.6.1 COMPAS Data Analysis In 2016, ProPublica’s investigative journalists curated a data set of 6,172 individuals, 3,175 of whom were Black and the remaining 2,997 other races, that were arrested in Broward County, Florida. In this study, Black and Other are our protected attributes. The Black and Other groups respectively had 1,773 and 1,217 individuals who actually recidivatedinthe2-yeartimeframethatthestudyconsidered. Weusedthistwoyearwindow as a proxy for the true label of identifying recdivists. All individuals were assigned a risk score by the compas algorithm (a whole number between 1 and 10) developed by NorthPointe Inc. This score was used to inform the judge of each person’s risk of recidivating during their bail hearing. The data set contains demo- graphicinformationabouteachpersonincludingtheirrace,age,numberofpreviousoffenses, sex, number of prior offenses, and their assigned compas risk score. We randomly split the data set 1,000 times intoD andD test , and averaged the difference between the actual and target recidivism FSR’s for a range of α between 0.15 and 0.30. The first two columns of Figure 4.5 provide the difference between true and target FSR for the recidivist classification for FCC and FASI respectively. The last column plots the overall EPI for each method. 105 As we noted with the simulated data, while the FCC does a good job at controlling the overall FSR (green / circle) of recidivists, it is unable to do this at the race level. In the left hand plot of Figure 4.5, the breakdown of the race-wise FSR control for the FCC is shown. The Black attribute (blue / triangle) systematically has FSR control lower than the desired target level. While the Other attribute (orange / square) systematically has higher FSR control than the target level. This observation holds for all values of α considered. In comparison, the middle plot in Figure 4.5 shows the FASI method. For all values of α , the FSR is controlled at the desired level for both the protected attributes and for all observations. The right plot in Figure 4.5 also demonstrates that, in this study, FASI is able toobtainanearlyidenticalEPItotheFCC.Thisdemonstratesthatthepriceofourfairness constraint, in terms of the size of the indecision group, is nearly zero. 4.6.2 1994 Census Income Data Analysis The US census is the leading body of information for producing information about the American people. Naturally, the data that they collect can directly inform future policy decisions, such as funding programs that provide economic assistance for populations in need. In particular, resources such as food, health care, job training, housing, and other economic assistance rely upon good estimates of a population’s income levels. The cost of makingunfairdecisionswhenpredictingonesincomecanbeseveresincethepredictionhelps determinehowhundredsofbillionsofdollarsinfederalfundingarespentforthenextdecade. In this case study, we use the 1994 US Census Data set from the UCI Machine Learning Repository to predict if an individual earns more than 50,000 dollars a year. The data consist of 32,561 observations on 14, largely demographic, variables including education level, age, hours worked per week, and others. The protected attributes in this study are Female and Male. The Female group has 10,771 total observations, of which 1,179 make over $50K a year. Similarly, the remaining 21,790 observations are from the Male attribute, of which 6,662 make over $50K a year. 106 -0.050 -0.025 0.000 0.025 0.050 0.05 0.10 0.15 0.20 0.25 0.30 α [ FSR 2,a − α FCC -0.050 -0.025 0.000 0.025 0.050 0.05 0.10 0.15 0.20 0.25 0.30 α [ FSR 2,a − α FASI 0.0 0.1 0.2 0.3 0.4 0.05 0.10 0.15 0.20 0.25 0.30 α [ EPI Indecisions Female and Male Female Male FASI FCC Figure 4.6: Census income prediction for individuals that earn more than 50K a year. Left and Middle: False Selection Rate minus the desired control level for varying levels of α for the FCC and FASI method respectively. Right: The EPI for both the FCC and FASI method. As in Section 4.6.1, we compare the FCC approach to the FASI method for many values of FSR control, α , ranging from 0.05 to 0.3. The left most plot of Figure 4.6 shows results from the FCC method, where both the overall (green / circle) and Male (orange / square) has the desired FSR control. However, the FCC is unable to maintain the desired FSR control for the Female attribute (blue / triangle) across all values of α . In comparison, the FASI method shown in the middle plot is able to maintain both the overall FSR control as well as the FSR control for Females and Males across all values of α . TherightmostplotprovidestheestimatedEPIofeachapproach. UnlikefortheCompas data, for some values of α the FASI method returns a slightly larger indecision group on average in comparison to the FCC method. 4.7 Discussions We conclude this chapter with a discussion of other fairness notions and related error rate concepts. 107 4.7.1 Connection to the sufficiency principle Consider the theoretical R-value (17). Theorem 2 shows that thresholding the theoretical R-value at level α ensures thatP(Y ̸=c| ˆ Y =c,A=a)≤ α for all a∈A, which is referred to as the sufficiency principle or predictive parity in the fairness literature [28, 5, 26]. The conversion algorithm described in Appendix C.1.1 can be viewed as a method of calibration by group [5], which leads to the fulfillment of the sufficiency principle automatically. The connection between the theoretical R-value (17) and our (empirical) R-value (10) becomesclearbynotingthat(a)theempirical R-valuecanbeconceptualizedasthesmallest group-wise FSR such that the individual is just selected, whereas the theoretical R-value corresponds to the smallest conditional probability that the individual is just selected; (b) the group-adjusted FSRs (6) can be viewed as empirical counterparts of the conditional probabilities (18) (Appendix C.1.2). This interesting connection indicates that our FASI algorithm leads to classification rules that approximately satisfy the sufficiency principle. Our work departs from existing machine learning algorithms tailored to fulfill the suffi- ciency principle in several ways. The FASI algorithm in Section 4.3.1 is intuitive and easily implementable. However, existing calibration by group methods are often complicated and computationally intensive; the score conversion strategy discussed above, which assumes an oracle and involves many complicated unknown distributions, can be intractable in practice. Moreover, existing machine learning algorithms have no statistical guarantees on the error rate control, and there is no discussion about the inflation of decision errors when multiple individuals must be classified. By contrast, the FASI algorithm offers nearly exact error rate controlinfinitesamplesandaddressesthemultiplicityissuebycontrollingtheFSR,whichis motivatedbythepowerfulandpopularFDRideainmultiplehypothesistesting. Incontrast with existing methods, our theory works for sophisticated classifiers trained from black-box models without requiring conditions on the accuracy of the outputs. 108 4.7.2 Other fairness notions In addition to the sufficiency principle, a widely used fairness notion is the separation prin- ciple [5], which requires that P(Y ̸= ˆ Y|Y =c,A=a) are the same for all a∈A. (21) A third notion on fairness, in the context of prediction intervals, has been considered in [85]. Rather than conditioning on either Y or ˆ Y, these works are concerned with the joint probabilities of ( ˆ Y,Y). This fairness criterion requires that the misclassification rates are equalized across all protected groups: P(Y ̸= ˆ Y|A=a) are the same for all a∈A. (22) Other popular fairness notions include equalized odds (45, 86) and equalized risks (27). A highly controversial issue is that different fairness criteria often lead to different algorithms and decisions in practice. For example, the sufficiency and separation principles can be incompatiblewitheachother(62,39), andtheclassificationparityandcalibrationcanharm the very groups that the algorithms are designed to protect (27). We do not claim that FASI is universally superior than competitive approaches but adjusting group-wise FSRs appears to be a reasonable fairness criterion for the applications under our consideration. Muchresearchisstillneededtofullyunderstandthetrade-offsandcaveatsbetweendifferent approaches to fairness-adjusted inference. 4.7.3 FSR concepts in multinomial classification The selective inference framework and FSR concepts can be extended from the binary classification setting to more general settings. Denote the collection of all class labels by C = {1,...,C}. The case with C = 1 corresponds to the one-class classification problem 109 discussed in Appendix C.1.3; particularly the outlier detection problem recently considered in [43] and [6] can be encompassed by our general framework. For situations with C ≥ 2, denote the set of classes to be selected by C ′ , and assume C ′ ⊂ C . With indecisions being allowed, the action space is given by Λ = {0,C ′ }. Denote the selection rule ˆ Y Y Y ={ ˆ Y n+j :1≤ j≤ m}∈Λ m . Then the FSR with respect to subsetC ′ is defined as the expected fraction of erroneous selections among all selections: FSR C ′ =E " P m j=1 I( ˆ Y n+j ∈C ′ , ˆ Y n+j ̸=Y n+j ) P m j=1 I( ˆ Y n+j ∈C ′ ) ∨1 # . (23) The group-wise FSRs taking into account the protected attribute A can be defined analo- gouslyto(6)byrestrictingtheselectionstospecificgroups. TheEPI(5),whichcharacterizes thepoweroftheselectionprocedure,remainsthesame. ThedevelopmentoftheR-valuesand corresponding fairness algorithms is more complicated and will be left for future research. 110 Chapter 5 Ongoing work: Correcting Bias in p-value Estimation In this chapter, we will briefly discuss one ongoing work with preliminary results. The content in this section are not final and are subject to change. Another area of automated high risk decision making is multiple hypothesis testing. Multiple testing procedures are specifically designed for identifying sparse signal from large data sets. This arises in a number of high risk applications such as microarray experiments, drug efficacy and advertising campaigns. In these settings, a natural quantity to control is the false discovery rate (FDR), which is defined as the expected proportion of false positives among all identified signals. The field of multiple testing has been vibrantly studied in statistics where the goal is to select a set of hypothesises such that the FDR of that set does not exceed a pre-defined threshold α . However, most methods developed to control FDR assume accurate p-values in order to obtain control. This can be inaccurate in many applications since practitioners need to estimate their p-values in a data driven way. In multiple hypothesis testing, the most interesting hypothesizes tend to correspond to the smallest p-values. However by selecting the smallest p-values, practitioners may be underestimatingthetruelevelofuncertaintyassociatedwiththeirmultipletestingprocedure. To adjust for this, we propose an empirical Bayes approach that builds off the ECAP method in Chapter 2 [55] called Selection bias Adjusted p-value (SAP). SAP is a flexible 111 non-parametricmethodthatcanremoveselectionbiasfromthep-valueestimationprocedure and improve upon the FDR control of many modern multiple testing algorithms. 5.1 Problem Formulation We consider a set of hypothesis H = {H 1 ,H 2 ,H 3 ,...H m } where m is large. Our desire is to simultaneously test all p hypothesises across a parameter of interest µ where for each hypothesis we test, H 0 j :µ j =0 vs. H 1 j :µ j ̸=0, for j =1,...,m. Here H 0 represents the null hypothesis or uninteresting observations, while H 1 represents the alternative hypothesis that the observation is interesting to the practitioner. In these scenarios, practitioners are not only concerned with rejecting a large number of alternative hypothesises but they are also motivated to ensure that not too many of the hypothesises they reject were done so incorrectly. In practice, this has lead to a vibrant line of research that deals with controlling the false discovery rate (FDR) in multiple hypothesis testing. The false discovery rate is an estimate ofthetrueproportionofincorrectlyrejectedhypothesisesoutofthetotalnumberofrejected ones. Concretely, this is defined as, FDR=E(a/R), where a is the number of incorrectly hypothesises and R is the total number of rejections made. To control FDR, most methods directly utilize p-values in order to select a set of hypothesises to reject. The p-value is a statistic that is a function of the observed data. It canbeinterpretedastheprobabilitythatapractitionerwouldobservethatspecificstatistic, or one more extreme assuming the null hypothesis is true. i.e. that the current observation is uninteresting. 112 In this chapter, we will refer to the p-values as ˜ p for simplicity. As stated above, we can write ˜ p as a probability ˜ p i =P X ≥ x i |H 0 , where x i is the statistic calculated for hypothesis i assuming the null hypothesis is true. Using results from classic statistics, a common way for calculating ˜ p i is to use the central limit theorem. A typical statistic to use in this scenario is the t-statistic that is properly standardized. t= ¯x− µ 0 s/ √ n . Here n is the number of observations for the specific hypothesis the statistic is calculated for and ¯x, s is the sample average, standard deviation of those observations respectively. We note that µ 0 , the parameter of interest under the null hypothesis, is assumed to be known. As the name suggests, the t-statistic above can be shown to be well approximated by a t-distribution with n− 1 degrees of freedom, where n is the number of observations for a specific hypothesis used for calculating t. The p-value naturally follows from t and can be directlycalculatedwiththecumulativedistributionfunction F(t), ofthet-distribution, with n− 1 degrees of freedom. ˜ p i (t)=1− F n− 1 (t). There are many ways to calculate p-values, but most of them rely on some variation of the procedure described above. Popular algorithms for controlling FDR, like the popular BH-procedure, directly use p- values when determining which hypothesises to reject. However, using ˜ p as a p-value can be problematic. This is because ˜ p is actually an estimate of the true p-value. We note that ˜ p requires a sample estimate of the mean and standard deviation for each hypothesis. These estimates can have varying levels of accuracy depending on how many observations n each hypotheses has. 113 0.00 0.25 0.50 0.75 4 8 12 Number of observations per-hypothesis FDR 0.00 0.25 0.50 0.75 4 8 12 Number of observations per-hypothesis Proportion of True Signal Rejected Unadjusted SAP Unbiased SAP Biased Figure 5.1: Estimate of false discovery rate using the unadjusted approach (orange/solid), SAP Unbiased(green/dot)andSAPBiased(blue/dashed)corrected p-values. Thex-axisvariesthelevel of corruption in the unadjusted p-values and the y-axis measures the FDR. The goal is to control FDR at the 10% level (black / dashed). Let’sdenotethetruep-valueasp,whichisthep-valuethatiscalculatedwithoutestimates on a sample. Methods for controlling FDR assume that they are being supplied with p when in fact they are using ˜ p when establishing significance. In situations when the number of observations per hypothesis are small, our sample estimates ¯x and s can be inaccurate, making ˜ p noisy estimates of p. Using ˜ pasanestimateforpinFDRcontrollingprocedurescanbeproblematic. Inpartic- ularsincethemostinteresting p-valuestendtobethesmallestones, causingpractitionersto focus not on their entire set of p-values, but rather the most extreme ones. Figure 5.1 shows an example. We generated 1,000 p-values and then corrupted them according to Model 1 in Chapter 2. On the x-axis we varied the level of corruption in ˜ p and then estimated FDR on the y-axis, with the goal to control FDR at 10%. Once the level of corruption gets large, using the ˜ p for p-values results in an inflated control of FDR. The green and blue lines in Figure 5.1 represent two adjustments based on the ECAP method described in Chapter 2 of this dissertation [55]. The green line is a variant of the unbiased ECAP method and the blue line is a variant of the biased ECAP method specifically designed for adjusting p-values and controlling FDR. In the left most plot, we 114 see that both the green and blue lines are able to control FDR even when the number of observations per-hypothesises is very small. The right plot suggests that in order to obtain proper FDR control, we need to make less rejections than the BH-procedure allows us to since the unadjusted approach has the highest proportion of rejections. In this ongoing work, we propose an empirical Bayes method for adjusting estimated p- valuescalledSelectionbiasAdjustedp-values(SAP).SAPisspecificallydesignedfor p-value estimation when the number of hypothesises is large but the number of observations per- hypothesis is moderate. SAP uses a variant of Tweedies formula which models ˜ p as coming from a beta distribution, in a similar way to the ECAP method in Chapter 2 [55]. 5.2 Methodology 5.2.1 Posterior Probabilities In the settings considered in Chapter 2, we aimed to create adjusted probability estimates that minimized excess certainty 3. However in the multiple testing setting we instead will minimize mean squared error. This means we can rewrite Theorem 1 in Chapter 2 for small probability ˜ q as, q i0 =E(q i |˜ q). We note that the correction term Var(p i |˜ p)/E(p i |˜ p) is dropped since this is an artifact from minimizing excess certainty. However, we do not want to directly calculate E(p i |˜ p) for our p-values. This is because p-values by construction do not directly relate to the probability of an event occurring. However, we can easily fix that issue and transform them into posterior probabilities, which can then be used in the ECAP framework. 115 For simplicity, let q i ,˜ q i denote the true and estimated posterior probability of some given hypothesis i. However for practitioners that are more comfortable with p-values, we will return bias corrected p-values as an output of our algorithm so that they can be easily used with modern methods developed to control FDR. To convert between q and p, we can use bays theorem. For a given threshold T and observation t i , q i =P(case i is null|t i ≥ T) = P(t i ≥ T| case i is null) P(t i ≥ T) = p i · π P(t i ≥ T) =p i · λ (t i ), where λ (t i )=π/P (t i ≥ T) and π is some proportion of signal. Converting between posterior probabilities and p-values is a crucial task when imple- menting SAP. It relies on getting a good estimate of λ (t i ). We can estimate λ (t i ) by sep- arately estimating the numerator π with the proportion of true nulls in a held out data set and the denominator P(t i ≥ t 0 ) with a simple empirical estimate, N(t 0 )/N where N(t 0 ) = #{observations greater than t 0 }. We also note that in typical multiple hypoth- esis settings, π the true proportion of signal to noise, is assumed to be nearly 1. 5.2.2 Empirical Bayes Adjustment We recall that the ECAP adjustment from Chapter 2 tells us that that ˆ q i = ˜ q i +γ ∗ (g ∗ (˜ q i )+1− 2˜ q i ) 116 This means that if we convert ECAP adjusted posterior probabilities to p-values, we can express them as ˆ q i = ˜ q i +γ ∗ (g ∗ (˜ q i )+1− 2˜ q i ) = ˜ p i · ˆ λ i +γ ∗ g ∗ ˜ p i · ˆ λ i +1− 2· ˜ p i · ˆ λ i =h ˜ p i · ˆ λ i Using these SAP adjusted posterior probabilities, we can reject all hypothesises that are less significant than our desired FDR threshold α . This relates directly to the BH-procedure but with an empirical bayes adjustment. h ˜ p i · ˆ λ i ≤ α ˜ p i · ˆ λ i ≤ h − 1 (α ) ˜ p i ≤ h − 1 (α )/ ˆ λ i =h − 1 (α )/ i/N π ≈ h − 1 (α )/ i N Where the last line is due to the fact that we assume π is nearly 1. So the SAP algorithm tells us that if we threshold our unadjusted p-values by h − 1 (α )/ i N , we should control FSR at level π · α . This is nearly identical to the BH procedure, but with a correction term of h − 1 (α ). 5.2.3 Data Driven Estimation 5.2.4 Estimating h − 1 The decision rule above crucially depends on knowledge of the function h − 1 . However, in practicederivingaclosedformexpressionofthisinverseisintractable. But,wecancalculate 117 the value of h − 1 (z ′ α ) computationally for any specific value of α by searching for the specific value of z ′ α that satisfies the following expression of h. α =z α +γ ∗ (g(z α )+1− 2z α ) 5.2.5 Estimating γ , λ and θ TheSAPcorrectionprocedurecruciallydependsongoodestimatesofthelevelofcorruption γ , level of smoothness λ , and level of bias θ . These parameters are identical to the ones in Chapter 2 for the ECAP procedure. For ECAP, we estimated γ and θ by picking the combination of parameters that max- imized the binomial log-likelihood of a heldout set of events occuring. However, for FDR analysis it makes more sense to pick γ and θ in a way that obtains FDR control as close to a user specified level α as possible. The parameter λ will be estimated in the same way as before. 5.3 Ongoing questions There is still a lot of work to be done on this project and we plan on building up the SAP algorithmtoofferfinitesamplecontrolinstandardhypothesistestingsettings. Initscurrent form, SAP requires a held out data set to obtain good estimates of some tuning parameters. ThismakesSAPagreatchoiceforonlinehypothesistesting, whereinsteadofbeinggivenall hypothesises at once, the practitioner observes hypothesises coming in a stream over time. As more hypothesises come in, the SAP procedure should become more accurate and would be adaptive to changes in data distributions over time. 118 Chapter 6 Conclusion Statistical learning in high-risk scenarios requires us to rethink standard methodology. It is becoming common place for machine learning to be applied in these sensitive areas and this is causing practitioners to face problems that previous researchers never had to consider. This dissertation focused on a few key areas of statistical learning in high-risk scenarios. 1. Selection bias for the occurrence of extreme events. 2. Label noise in training data. 3. High probability error rate control in classification 4. Fairness in classification 5. Selection bias in multiple hypothesis testing This list is by no means exhaustive and future research will build upon these established ideas and add new ones. Automated decision making algorithms are advancing rapidly so it is imperative that we advance methodology just as quickly so that we can offer valid uncertaintyestimatestopractitioners. Thiswillallowustousemachinelearninginavariety of different applications and help facilitate better decision making because the user will have a better understanding of the strengths and weaknesses of their algorithms. 119 Chapter 7 Bibliography [1] Felix Abramovich, Yoav Benjamini, David L. Donoho, and Iain M. Johnstone. Adapt- ingtounknownsparsitybycontrollingthefalsediscoveryrate. Ann.Statist.,34(2):584– 653, 2006. [2] S Agmon. Lectures on elliptic boundary value problems. Van Nostrand, Princeton, NJ, 1965. [3] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica, 23:77–91, 2016. [4] Rina Foygel Barber and Emmanuel J Cand` es. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015. [5] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness in machine learning. Nips tutorial, 1:2, 2017. [6] StephenBates, EmmanuelCand` es, LihuaLei, YanivRomano, andMatteoSesia. Test- ing for outliers with conformal p-values. arXiv preprint arXiv:2104.08279, 2021. [7] Yoav Benjamini. Simultaneous and selective inference: Current successes and future challenges. Biometrical Journal, 52(6):708–721, 2010. [8] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995. [9] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B, 57:289–300, 1995. [10] YoavBenjaminiandYosefHochberg. Ontheadaptivecontrolofthefalsediscoveryrate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25:60–83, 2000. [11] YoavBenjaminiandDanielYekutieli. Thecontrolofthefalsediscoveryrateinmultiple testing under dependency. Ann. Statist., 29(4):1165–1188, 2001. 120 [12] Yoav Benjamini and Daniel Yekutieli. False discovery rate–adjusted multiple confi- dence intervals for selected parameters. Journal of the American Statistical Associa- tion, 100(469):71–81, 2005. [13] Peter J. Bickel and Elizaveta Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577–2604, 2008. [14] Gilles Blanchard, Marek Flaska, Gregory Handy, Sara Pozzi, and Clayton Scott. Clas- sificationwithasymmetriclabelnoise: Consistencyandmaximaldenoising. Electronic Journal of Statistics, 10(2):2780–2824, 2016. [15] Pavel Brazdil and Kurt Konolige. Machine learning, meta-reasoning and logics. Springer, 1990. [16] Carla Brodley and Mark Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research , 11:131–167, 1999. [17] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research , 11:131–167, 1999. [18] Lawrence D Brown and Eitan Greenshtein. Nonparametric empirical bayes and com- pounddecisionapproachestoestimationofahigh-dimensionalvectorofnormalmeans. The Annals of Statistics, pages 1685–1704, 2009. [19] T.TonyCaiandWenguangSun. Simultaneoustestingofgroupedhypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc., 104:1467–1481, 2009. [20] T. Tony Cai, Wenguang Sun, and Weinan Wang. CARS: Covariate assisted ranking and screening for large-scale two-sample inference (with discussion). J. Roy. Statist. Soc. B, 81:187–234, 2019. [21] Tony Cai and Wenguang Sun. Optimal screening and discovery of sparse signals with applications to multistage high-throughput studies. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(1):197, 2017. [22] Timothy I Cannings, Yingying Fan, and Richard J Samworth. Classification with imperfect training labels. Biometrika, 107(2):311–330, 2020. [23] AdamCannon,JamesHowse,DonHush,andClintScovel. Learningwiththeneyman- pearson and min-max criteria. Los Alamos National Laboratory, Tech. Rep. LA-UR, pages 02–2951, 2002. [24] Hongyuan Cao, Wenguang Sun, and Michael R Kosorok. The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika, page ast001, 2013. [25] JingjingCao,SamKwong,andRanWang.Anoise-detectionbasedadaboostalgorithm for mislabeled data. Pattern Recognition, 45(12):4451–4465, 2012. 121 [26] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017. [27] Sam Corbett-Davies and Sharad Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023, 2018. [28] Roger Crisp. Equality, priority, and compassion. Ethics, 113(4):745–763, 2003. [29] Fran¸ cois Curtin and Pierre Schulz. Multiple correlations and bonferroni’s correction. Biological Psychiatry, 44(8):775–777, 1998. [30] WilliamDieterich,ChristinaMendoza,andTimBrennan. Compasriskscales: Demon- strating accuracy equity and predictive parity. Northpointe Inc, 2016. [31] DavidL.DonohoandIainM.Johnstone. Idealspatialadaptationbywaveletshrinkage. Biometrika, 81(3):425–455, 1994. [32] Lilun Du, Xu Guo, Wenguang Sun, and Changliang Zou. False discovery rate control under general dependence by symmetrized data aggregation. Journal of the American Statistical Association,, to appear:arXiv:2002.11992, 2021. [33] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [34] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical com- puter science conference, pages 214–226, 2012. [35] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statis- tical Association, 106(496):1602–1614, 2011. [36] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statis- tical Association, 106(496):1602–1614, 2011. [37] Bradley Efron and Carl Morris. Data analysis using stein’s estimator and its general- izations. Journal of the American Statistical Association, 70(350):311–319, 1975. [38] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2 , IJCAI’01, page 973–978, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [39] Sorelle A Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. The (im) possibility of fairness: different value systems require different mechanisms for fair decision making. Communications of the ACM, 64(4):136–143, 2021. [40] Andrew Gelman and Cosma Rohilla Shalizi. Philosophy and the practice of bayesian statistics. BritishJournalofMathematicalandStatisticalPsychology,66(1):8–38,2012. [41] Aritra Ghosh, Naresh Manwani, and PS Sastry. Making risk minimization tolerant to label noise. Neurocomputing, 160:93–107, 2015. 122 [42] Peter J Green and Bernard W Silverman. Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman and Hall/CRC, 1993. [43] Leying Guan and Rob Tibshirani. Prediction and outlier detection in classification problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), to appear:arXiv:1905.04396, 2021. [44] Isabelle Guyon, Nada Matic, Vladimir Vapnik, et al. Discovering informative patterns and data cleaning., 1996. [45] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315–3323, 2016. [46] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer series in statistics. Springer, 2009. [47] Nicholas C. Henderson and Michael A. Newton. Making the cut: improved ranking and selection for large-scale inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(4):781–804, 2015. [48] RaduHerbeiandMartenHWegkamp. Classificationwithrejectoption. The Canadian Journal of Statistics/La Revue Canadienne de Statistique, pages 709–721, 2006. [49] Ray J Hickey. Noise modelling and evaluating learning from examples. Artificial Intelligence, 82(1-2):157–179, 1996. [50] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. [51] Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spambase data set. Hewlett-Packard Labs, 1(7), 1999. [52] John Hull, Mirela Predescu, and Alan White. Bond prices, default probabilities and risk premiums. Journal of Credit Risk, 1:53–60, 2005. [53] Yuki Ikeda, Tatsuya Kubokawa, and Muni S. Srivastava. Comparison of linear shrink- age estimators of a large covariance matrix in normal and non-normal distributions. Computational Statistics and Data Analysis, 95:95 – 108, 2016. [54] GarethJames,DanielaWitten,TrevorHastie,andRobertTibshirani. An Introduction to Statistical Learning: with Applications in R. Springer, 2013. [55] Gareth M. James, Peter Radchenko, and Bradley Rava. Irrational exuberance: Cor- recting bias in probability estimates. Journal of the American Statistical Association, 117(537):455–468, 2022. [56] W. James and Charles Stein. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pages 361–379, Berkeley, Calif., 1961. Uni- versity of California Press. 123 [57] Wenhua Jiang and Cun-Hui Zhang. General maximum likelihood empirical bayes estimation of normal means. Ann. Statist., 37(4):1647–1684, 2009. [58] Stephen Kealhofer. Quantifying credit risk i: Default prediction. Financial Analysts Journal, 59(1):30–44, 2003. [59] Michael Kemmler, Erik Rodner, Esther-Sabrina Wacker, and Joachim Denzler. One- class classification with gaussian processes. Pattern recognition, 46(12):3507–3518, 2013. [60] Shehroz S Khan and Michael G Madden. A survey of recent trends in one class classification. In Irish conference on artificial intelligence and cognitive science , pages 188–197. Springer, 2009. [61] Roni Khardon and Gabriel Wachman. Noise tolerant variants of the perceptron algo- rithm. Journal of Machine Learning Research, 8(Feb):227–248, 2007. [62] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016. [63] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. [64] PeterALachenbruch. Discriminantanalysiswhentheinitialsamplesaremisclassified. Technometrics, 8(4):657–662, 1966. [65] Peter A Lachenbruch. Note on initial misclassification effects on the quadratic dis- criminant function. Technometrics, 21(1):129–132, 1979. [66] Olivier Ledoit and Michael Wolf. Nonlinear shrinkage estimation of large-dimensional covariance matrices. Ann. Statist., 40(2):1024–1060, 2012. [67] Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014. [68] Carson K. Leung and Kyle W. Joseph. Sports data mining: Predicting results for the college football games. Procedia Computer Science, 35:710 – 719, 2014. Knowledge- Based and Intelligent Information & Engineering Systems 18th Annual Conference, KES-2014 Gdynia, Poland, September 2014 Proceedings. [69] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):447–461, 2016. [70] OwenMacDonald.Physicianperspectivesonpreventingdiagnosticerrors. QuantiaMD, 2011. [71] Naresh Manwani and PS Sastry. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics, 43(3):1146–1151, 2013. 124 [72] Lukas Meier, Sara Van de Geer, and Peter B¨ uhlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779–3821, 2009. [73] Nicholai Meinshausen and John Rice. Estimating the proportion of false null hy- potheses among a large number of independently tested hypotheses. Ann. Statist., 34:373–393, 2006. [74] MaryMMoyaandDonRHush. Networkconstraintsandmulti-objectiveoptimization for one-class classification. Neural networks, 9(3):463–474, 1996. [75] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems, pages 1196–1204, 2013. [76] Seishi Okamoto and Nobuhiro Yugami. An average-case analysis of the k-nearest neighbor classifier for noisy domains. In IJCAI (1), pages 238–245, 1997. [77] KenOrr. Dataqualityandsystemstheory. Communications of the ACM,41(2):66–71, 1998. [78] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017. [79] SoniaPetrone,StefanoRizzelli,JudithRousseau,andCatiaScricciolo.Empiricalbayes methods in classical and bayesian inference. METRON, 72(2):201–215, 2014. [80] Roy M. Poses, Wally R. Smith, Donna K. McClish, Elizabeth C. Huber, F. Lynne W. Clemo, Brian P. Schmitt, Donna Alexander-Forti, Edward M. Racht, III Colenda, Christopher C., and Robert M. Centor. Physicians’ Survival Predictions for Patients WithAcuteCongestiveHeartFailure. JAMA Internal Medicine, 157(9):1001–1007, 05 1997. [81] Bradley Rava, Wenguang Sun, Gareth M. James, and Xin Tong. A burden shared is a burden halved: A fairness-adjusted approach to classification. Technical Report, 2021. [82] Thomas Redman. The impact of poor data quality on the typical enterprise. Commu- nications of the ACM, 2(2):79–82, 1998. [83] PhilippeRigolletandXinTong. Neyman-pearsonclassification,convexityandstochas- tic constraints. Journal of Machine Learning Research, 12(Oct):2831–2855, 2011. [84] Herbert Robbins. An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. on Math. Statistic. and Prob., 1:157–163, 1956. [85] Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Cand` es. With malice toward none: Assessing uncertainty via equalized coverage. 2020. https://hdsr.mitpress.mit.edu/pub/qedrwcz3. 125 [86] YanivRomano,StephenBates,andEmmanuelJ.Cand` es. Achievingequalizedoddsby resamplingsensitiveattributes. InAdvances in Neural Information Processing Systems 33 (NIPS 2020). Curran Associates, Inc., 2020. To appear. [87] Clayton Scott and Robert Nowak. A neyman-pearson approach to statistical learning. IEEE Transactions on Information Theory, 51(11):3806–3819, 2005. [88] GalenRShorackandJonAWellner. Empiricalprocesseswithapplicationstostatistics. SIAM, 2009. [89] Nathaniel Silver. How fivethirtyeight’s 2018 midterm forecasts did. https://fivethirtyeight.com/features/how-fivethirtyeights-2018-midterm-forecasts- did/, Dec 2018. Online; accessed 04 September 2019. [90] Robert M. Smeenk, Vic J. Verwaal, Ninja Antonini, and Frans A. N. Zoetmulder. Survivalanalysisofpseudomyxomaperitoneipatientstreatedbycytoreductivesurgery and hyperthermic intraperitoneal chemotherapy. Annals of Surgery, 245(1):104–109, 2007. [91] Souren Soumbatiants, Henry W. Chappell, and Eric Johnson. Using state polls to forecast u.s. presidential election outcomes. Public Choice, 127(1):207–223, Apr 2006. [92] Charles J Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, pages 1348–1360, 1980. [93] John D. Storey. A direct approach to false discovery rates. J. Roy. Statist. Soc. B, 64:479–498, 2002. [94] John D. Storey. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Statist., 31:2013–2035, 2003. [95] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. Roy. Statist. Soc. B, 66(1):187–205, 2004. [96] Sainbayar Sukhbaatar and Rob Fergus. Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080, 2(3):4, 2014. [97] Wenguang Sun and Zhi Wei. Large-scale multiple testing for pattern identification, with applications to time-course microarray experiments. J. Amer. Statist. Assoc., 106:73–88, 2011. [98] Xin Tong. A plug-in approach to neyman-pearson classification. Journal of Machine Learning Research, 14(1):3011–3040, 2013. [99] XinTong, YangFeng, andJingyiJessicaLi. Neyman-pearsonclassificationalgorithms and np receiver operating characteristics. Science Advances, 4(2):eaao1659, 2018. 126 [100] Xin Tong, Lucy Xia, Jiacheng Wang, and Yang Feng. Neyman-pearson classification: parametrics and sample size requirement. Journal of Machine Learning Research, 21:1–18, 2020. [101] Sara Van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, Cambridge, 2000. [102] Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in label-noise learning? Advances in Neural Information Processing Systems, 32:6838–6849, 2019. [103] Shunan Yao, Bradley Rava, Xin Tong, and Gareth James. Asymmetric error control under imperfect supervision: A label-noise-adjusted neyman–pearson umbrella algo- rithm. Journal of the American Statistical Association, 0(0):1–13, 2022. [104] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost- proportionateexampleweighting. Third IEEE International Conference on Data Min- ing, pages 435–442, 2003. [105] AnqiZhao, YangFeng, LieWang, andXinTong. Neyman-pearsonclassificationunder high-dimensional settings. Journal of Machine Learning Research, 17(213):1–39, 2016. [106] Zbynˇ ek ˇSid´ ak. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association, 62(318):626–633, 1967. 127 Appendix A Appendix for Irrational Exuberance: Correcting Bias in Probability Estimates A.1 Proof of Theorem 1 We begin by computing the derivative of the loss function, L(x)= ( 1 x 2 E(p 2 i |˜ p i )− 2 x E(p i |˜ p i )+1 x≤ 0.5 1 (1− x) 2 E(p 2 i |˜ p i )− 2x (1− x) 2 E(p i |˜ p i )+ x 1− x 2 x>0.5. We have ∂L ∂x = ( − 2 x 3 E(p 2 i |˜ p i )+ 2 x 2 E(p i |˜ p i ) x<0.5 2 (1− x) 3 E(p 2 i |˜ p i )− 2(1+x) (1− x) 3 E(p i |˜ p i )+ 2x (1− x) 3 x>0.5 ∝ ( − E(p 2 i |˜ p i )+xE(p i |˜ p i ) x<0.5 E(p 2 i |˜ p i )− E(p i |˜ p i )+x(1− E(p i |˜ p i )) x>0.5. Note that L is a continuous function. If E(p i |˜ p i ) ≤ 0.5 and x ∗ = E(p 2 i |˜ p i )/E(p i |˜ p i ) ≤ 0.5 then algebraic manipulations show that ∂L/∂x is negative for all x < x ∗ and positive for x > x ∗ . Hence, p i0 = x ∗ = E(p i |˜ p i ) +Var(p i |˜ p i )/E(p i |˜ p i ) minimizes L. Alternatively, if E(p i |˜ p i )≤ 0.5 and x ∗ = E(p 2 i |˜ p i )/E(p i |˜ p i )≥ 0.5 then ∂L/∂x is negative for all x < 0.5 and positive for all x>0.5, so L is minimized by p i0 =0.5. Analogous arguments show that if E(p i |˜ p i )>0.5 and x ∗ =E(p 2 i |˜ p i )/(1− E(p i |˜ p i ))>0.5, then ∂L/∂x is negative for all x < x ∗ , zero at x = x ∗ and positive for x > x ∗ . Hence, p i0 = x ∗ = E(p i |˜ p i )+Var(p i |˜ p i )/(1− E(p i |˜ p i )) will minimize L. Alternatively, if E(p i |˜ p i ) > 0.5 and x ∗ = E(p 2 i |˜ p i )/(1− E(p i |˜ p i )) < 0.5 then ∂L/∂x is negative for all x < 0.5 and positive for all x>0.5, so L is minimized by p i0 =0.5. To prove the second result, first suppose E(p i |˜ p i ) ≤ 0.5 and p i0 < 0.5, in which case L(p i0 )=1− E(p 2 i |˜ p i )/p 2 i0 . Now let ˜ L(p ′ i )=E p i − p ′ i p ′ i 2 |˜ p i = 1 p ′ i 2 E(p 2 i |˜ p i )− 2 p ′ i E(p i |˜ p i )+1. Note that ˜ L(p ′ i )≤ L(p ′ i ) with equality for p ′ i ≤ 0.5. Hence, L(p ′ i )− L(p i0 )≥ ˜ L(p ′ i )− L(p i0 )=E(p 2 i |˜ p i ) 1 p ′ i 2 + 1 p 2 i0 − 2 p ′ i E(p i |˜ p i )=E(p 2 i |˜ p i ) 1 p ′ i − 1 p i0 2 . 128 Now consider the case E(p i |˜ p i )≤ 0.5 and p i0 = 0.5. Note that this implies 2E(p 2 i |˜ p i ) > E(p i |˜ p i ). If p ′ i ≤ 0.5, then L(p ′ i )− L(p i0 )=E(p 2 i |˜ p i ) 1 p ′ i 2 − 4 − 2E(p i |˜ p i ) 1 p ′ i − 2 ≥ E(p 2 i |˜ p i ) 1 p ′ i − 1 0.5 2 . Also note that 1 p ′ i − 1 0.5 2 ≥ 1 1− p ′ i − 1 0.5 2 . Alternatively, if p ′ i >0.5, then L(p ′ i )− L(p i0 )≥ ˜ L(1− p ′ i )− L(p i0 )≥ E(p 2 i |˜ p i ) 1 1− p ′ i − 1 0.5 2 . Observe that 1 1− p ′ i − 1 0.5 2 ≥ 1 p ′ i − 1 0.5 2 . Consequently, we have shown that L(p ′ i )− L(p i0 )≥ E(p 2 i |˜ p i )max 1 p ′ i − 1 0.5 2 , 1 1− p ′ i − 1 0.5 2 ! when E(p i |˜ p i )≤ 0.5 and p i0 =0.5. Thus, we have established the result for the case E(p i |˜ p i )≤ 0.5 Finally, consider the case E(p i |˜ p i )≥ 0.5. The result follows by repeating the argument from the case E(p i |˜ p i ) < 0.5 while replacing all of the probabilities with their complements, i.e., by replacing p i , p i0 and p ′ i with 1− p i , 1− p i0 and 1− p ′ i , respectively. A.2 Proof of Theorem 2 and Corollary 1 Throughout the proof, we omit the subscript i, for the simplicity of notation. We let f ˜ p (˜ p|p) denote the conditional density of ˜ p given that the corresponding true probability equals p, and define f X (x|p) by analogy for the random variable X = log(˜ p/[1− ˜ p]). We will slightly abuse the notation and not distinguish between the random variable and its value in the case of ˜ p, p and η . According to model (1), we have f ˜ p (˜ p|p) = B(p/γ ∗ ,(1− p)/γ ∗ ) − 1 ˜ p p/γ ∗ − 1 (1− ˜ p) (1− p)/γ ∗ − 1 where B(·) denotes the beta function. Hence, writing B for B(p/γ ∗ ,(1− p)/γ ∗ ), we derive log(f ˜ p (˜ p|p)) = − logB+ p γ ∗ − 1 [log˜ p− log(1− ˜ p)]+(1/γ ∗ − 2)log(1− ˜ p) = − logB+ηx +(1/γ ∗ − 2)log(1− ˜ p) (1) 129 where η = p γ ∗ − 1 and x=log ˜ p 1− ˜ p . Standard calculations show that f X (x|p)=f ˜ p (˜ p|p) e x (1+e x ) 2 =f ˜ p (˜ p|p)˜ p(1− ˜ p). (2) Note that log(1− ˜ p)=− log(1+e x ), and hence log(f X (x|p)) = − logB+ηx − (1/γ ∗ − 2)log(1+e x )+x− 2log(1+e x ) = − logB+ηx +x− 1/γ ∗ log(1+e x ) = − logB+ηx +l h (x), where l h (x)=x− 1/γ ∗ log(1+e x ). Consequently, we can apply Tweedie’s formula [36] to derive E(p/γ ∗ − 1|˜ p)=E(η |x)=v X (x)− l ′ h (x)=v X (x)− 1+ 1 γ ∗ e x 1+e x =v X (x)+ ˜ p γ ∗ − 1, where v X (x)=(df X (x)/dx)/f X (x) and f X is the density of X. This implies E(p|˜ p)= ˜ p+γ ∗ v X (x). In addition, we have df X (x) dx = df ˜ p (˜ p) d˜ p d˜ p dx e x (1+e x ) 2 +f ˜ p (˜ p) e x (1+e x ) 2 − 2e 2x (1+e x ) (1+e x ) 4 = df ˜ p (˜ p) d˜ p e x (1+e x ) 2 2 +f ˜ p (˜ p) e x (1+e x ) 2 1− e x 1+e x . Using the unconditional analog of formula (2), we derive v X (x) = df X (x)/dx f X (x) = df ˜ p (˜ p)/d˜ p f ˜ p (˜ p) e x (1+e x ) 2 + 1− e x 1+e x = v ˜ p (˜ p)˜ p(1− ˜ p)+1− 2˜ p, where v ˜ p (˜ p)=(df ˜ p (˜ p)/d˜ p)/f ˜ p (˜ p). Thus, E(p|˜ p)= ˜ p+γ ∗ (˜ p(1− ˜ p)v ˜ p (˜ p)+1− 2˜ p). Similarly, again by Tweedie’s formula, Var(p/γ ∗ − 1|˜ p)=Var(η |x)=v ′ X (x)− l ′′ h (x)=v ′ X (x)+ 1 γ ∗ e x (1+e x ) 2 =v ′ X (x)+ ˜ p(1− ˜ p) γ ∗ , which implies Var(p|˜ p)=γ ∗ ˜ p(1− ˜ p)+γ ∗ 2 v ′ X (x). 130 Noting that v ′ X (x)= ˜ p(1− ˜ p)[v ′ ˜ p (˜ p)˜ p(1− ˜ p)+v ˜ p (˜ p)(1− 2˜ p)− 2], we derive Var(p|˜ p)=γ ∗ 2 ˜ p(1− ˜ p)[v ′ ˜ p (˜ p)˜ p(1− ˜ p)+v ˜ p (˜ p)(1− 2˜ p)− 2]+γ ∗ ˜ p(1− ˜ p). If we define g ∗ (˜ p)= ˜ p(1− ˜ p)v ˜ p (˜ p), then E(p|˜ p) = ˜ p+γ ∗ (g(˜ p)+1− 2˜ p) Var(p|˜ p) = γ ∗ ˜ p(1− ˜ p)+γ ∗ 2 ˜ p(1− ˜ p)[g ′ (˜ p)− 2]. This completes the proof of Theorem 2. Finally, we establish some properties of g ∗ (˜ p) and prove Corollary 1. We denote the marginal density of ˜ p by f. First note that g(1− ˜ p) =− ˜ p(1− ˜ p)f ′ (1− ˜ p|p)/f(1− ˜ p|p). If h(p) represents the prior density for p, then f(˜ p)= Z 1 0 B(α,β ) − 1 ˜ p α − 1 (1− ˜ p) β − 1 h(p)dp. (3) Because function h is bounded, differentiation under the integral sign is justified, and hence f ′ (˜ p)= Z 1 0 B(α,β ) − 1 (α − 1)˜ p α − 2 (1− ˜ p) β − 1 − (β − 1)˜ p α − 1 (1− ˜ p) β − 2 h(p)dp, (4) where α =p/γ ∗ and β =(1− p)/γ ∗ . Substituting p ∗ =1− p we get f(1− ˜ p)= Z 1 0 B(β,α ) − 1 ˜ p α − 1 (1− ˜ p) β − 1 h(1− p ∗ )dp ∗ =f(˜ p) and f ′ (1− ˜ p)= Z 1 0 B(β,α ) − 1 (α − 1)˜ p α − 2 (1− ˜ p) β − 1 − (β − 1)˜ p α − 1 (1− ˜ p) β − 2 h(1− p ∗ )dp ∗ =f ′ (˜ p), provided h(p)=h(1− p). Hence, g ∗ (1− ˜ p)=− ˜ p(1− ˜ p)f ′ (˜ p)/f(˜ p)=− g ∗ (˜ p). By continuity of g ∗ (˜ p) this result also implies g ∗ (0.5)=0. To complete the proof of Corollary 1, we note that under the assumption that the dis- tribution of p i is symmetric, the conditional expected value E(p i |˜ p i ) lies on the same side of 0.5 as ˜ p i . A.3 Proof of Theorem 3 As before, we denote the marginal density of ˜ p by f. First, we derive a bound for g ∗ . Note that− 1≤ α − 1≤ 1 γ ∗ and, similarly,− 1≤ β − 1≤ 1 γ ∗ . Hence, by (3) and (4), − 1− ˜ p+ ˜ p γ ∗ f(˜ p)≤ ˜ p(1− ˜ p)f ′ (˜ p)≤ (1− ˜ p) 1 γ ∗ + ˜ p f(˜ p), 131 which implies |g ∗ (˜ p)|≤ 1 γ ∗ . (5) Next, note that lim ˜ p→0 ˜ p(1− ˜ p)f(˜ p) = 0 and (6) lim ˜ p→1 ˜ p(1− ˜ p)f(˜ p) = 0. (7) Observe that R(g(˜ p)) = E((g(˜ p)− g ∗ (˜ p)) 2 = Eg(˜ p) 2 − 2E{g(˜ p)g ∗ (˜ p)}+C = Eg(˜ p) 2 − 2 Z 1 0 g(˜ p)˜ p(1− ˜ p) f ′ (˜ p) f(˜ p) f(˜ p)d˜ p+C = Eg(˜ p) 2 − 2[g(˜ p)˜ p(1− ˜ p)f(˜ p)] 1 0 +2 Z 1 0 [g(˜ p)(1− 2˜ p)+ ˜ p(1− ˜ p)g ′ (˜ p)]f(˜ p)d˜ p+C = Eg(˜ p) 2 +2 Z 1 0 [g(˜ p)(1− 2˜ p)+ ˜ p(1− ˜ p)g ′ (˜ p)]f(˜ p)d˜ p+C (8) where C is a constant that does not depend on g, and the second to last line follows via integration by parts. Note the last line holds when g is bounded, because by (6), lim ˜ p→0 g(˜ p)˜ p(1− ˜ p)f(˜ p)=0, and by (7), lim ˜ p→1 g(˜ p)˜ p(1− ˜ p)f(˜ p)=0. In particular, due to the inequality (5), the relationship (8) holds when g is the true func- tion g ∗ . A.4 Proof of Theorem 4 We writeG N for the class of all natural cubic spline functions g on [0,1] that correspond to the sequence of n knots located at the observed ˜ p i . Given a function g, we define s g (˜ p) = 2[g(˜ p)(1− 2˜ p)+ ˜ p(1− ˜ p)g ′ (˜ p)] and I 2 (g) = R 1 0 [g ′′ (˜ p)] 2 d˜ p. We also denote (1/n) P n i=1 g 2 (˜ p i ) and R 1 0 g(˜ p)f ∗ (˜ p)d˜ p by∥g∥ 2 n and∥g∥ 2 , respectively. By Lemma 4 in Appendix A.9, there exists g ∗ N ∈G N , such that∥g ∗ N − g ∗ ∥ 2 =O p (λ 2 n ) and ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ O p n − 2/7 ∥ˆ g− g ∗ N ∥ +O p n − 4/7 I(ˆ g) +O p n − 4/7 +λ 2 n . We consider two possible cases (a) n − 4/7 I(ˆ g) ≤ n − 2/7 ∥ˆ g− g ∗ N ∥ + n − 4/7 + λ 2 n and (b) n − 4/7 I(ˆ g)>n − 2/7 ∥ˆ g− g ∗ N ∥+n − 4/7 +λ 2 n . 132 Under (a) we have ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ O p n − 2/7 ∥ˆ g− g ∗ N ∥ +O p n − 4/7 +λ 2 n . (9) It follows that ∥ˆ g− g ∗ N ∥ = O p (n − 2/7 +λ n ) and I 2 (ˆ g) = O p (n − 4/7 λ − 2 n +1). However, taking into account the case (a) condition, we also have I 2 (ˆ g) = O p (n 4/7 λ 2 n +1), thus leading to I(ˆ g)=O p (1). Under (b) we have ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ O p n − 4/7 I(ˆ g) . (10) It follows that I(ˆ g)=O p (n − 4/7 λ − 2 n ) and∥ˆ g− g ∗ N ∥=O p (n − 4/7 λ − 1 n ). Collecting all the stochastic bounds we derived, and using the fact that f ∗ is bounded away from zero, we deduce ∥ˆ g− g ∗ N ∥=O p (n − 4/7 λ − 1 n +n − 2/7 +λ n ) and I(ˆ g)=O p (1+n − 4/7 λ − 2 n ) Using the bound∥g ∗ N − g ∗ ∥ 2 =O p (λ 2 n ), together with the definitions of r n and s n , we derive ∥ˆ g− g ∗ ∥=O p (r n ) and I(ˆ g− g ∗ )=O p (1+n − 4/7 λ − 2 n ). (11) Applying Lemma 10.9 in [101], which builds on the interpolation inequality of [2], we derive ∥ˆ g ′ − g ∗ ′ ∥ = O p ( √ r n s n ). This establishes the error bounds for ˆ g and ˆ g ′ with respect to the ∥·∥ norm. To derive the corresponding results with respect to the ∥·∥ n norm, we first apply bound (17), in which we replace g ∗ N with g ∗ . It follows that ∥ˆ g− g ∗ ∥ 2 n −∥ ˆ g− g ∗ ∥ 2 =( ˜ P n − ˜ P)[ˆ g− g ∗ ] 2 =o p ∥ˆ g− g ∗ ∥ 2 +O p n − 1 I 2 (ˆ g− g ∗ ) , where we use the notation from the proof of Lemma 4. Because bounds (11) together with the assumption λ n ≫ n − 8/21 imply I(ˆ g− g ∗ )=O n − 4/7 n 16/21 =O n 4/21 , we can then derive ∥ˆ g− g ∗ ∥ 2 n =O ∥ˆ g− g ∗ ∥ 2 +O p n − 13/21 . Because r n ≥ n − 2/7 , we have r 2 n ≥ n − 13/21 . Consequently, ∥ˆ g− g ∗ ∥ 2 n = O(r 2 n ), which estab- lishes the analog of the first bound in (11) for the ∥·∥ n norm. It is only left to derive∥ˆ g ′ − g ∗ ′ ∥ n =O p ( √ r n s n ). Applying Lemma 17 in [72], in conjunc- tion with Corollary 5 from the same paper, in which we take γ ∗ = 2/3 and λ = n − 3/14 , we derive ( ˜ P n − ˜ P)[ˆ g ′ − g ∗ ′ ] 2 =O p n − 5/14 h ∥ˆ g ′ − g ∗ ′ ∥ 2 +n − 2/7 I 2 (ˆ g ′ − g ∗ ′ ) i . 133 Consequently, ∥ˆ g ′ − g ∗ ′ ∥ 2 n =O p ∥ˆ g ′ − g ∗ ′ ∥ 2 +O p n − 9/14 I 2 (ˆ g ′ − g ∗ ′ ) . Taking into account bound (11), the definition of s n , the assumption λ n ≫ n − 8/21 and the inequality r n ≥ n − 2/7 ,we derive n − 9/14 I 2 (ˆ g ′ − g ∗ ′ )=O p n − 9/14 s 2 n =O p n − 19/42 s n =O p r n s n . Thus,∥ˆ g ′ − g ∗ ′ ∥ 2 n =O p (∥ˆ g ′ − g ∗ ′ ∥ 2 +r n s n )=O p (r n s n ), which completes the proof. A.5 Proof of Theorem 5 WewilltakeadvantageoftheresultsinTheorem6,whichareestablishedindependentlyfrom Theorem 5. We will focus on proving the results involving integrals, because the results for the averages follow by an analogous argument with minimal modifications. We start by establishing consistency of ˆ p. Fixing an arbitrary positive ˜ ϵ , identifying a positive ϵ for which ˜ P(0,ϵ )+ ˜ P(1− ϵ, 1) ≤ ˜ ϵ/ 2, and noting that ˆ p and p 0 fall in [0,1] for every ˜ p, we derive ∥ˆ p− p 0 ∥ 2 ≤ ˜ ϵ/ 2+ Z 1− ϵ ϵ |ˆ p(˜ p)− p 0 (˜ p)| 2 f ∗ (˜ p)dp. By Theorem 6, the second term on the right-hand side of the above display is o p (1). Conse- quently, P ∥ˆ p− p 0 ∥ 2 > ˜ ϵ →0 as n→∞. As the above statement holds for every fixed positive ˜ϵ , we have established that∥ˆ p− p 0 ∥= o p (1). We now focus on showing consistency for ˆ W. Note that [ˆ µ 2 (˜ p)+ˆ σ 2 (˜ p)] 2 /ˆ µ 2 (˜ p)≥ [2ˆ µ (˜ p)ˆ σ (˜ p)] 2 /[ˆ µ 2 (˜ p)]=4ˆ σ 2 (˜ p). Thus, the definition of ˆp implies ˆ p 2 (˜ p) ≥ ˆ σ 2 (˜ p)∧ 0.25, and also ˆ p(˜ p) ≥ ˆ µ (˜ p)∧ 0.5, for every ˜ p∈(0,1). Writing p for the true probability corresponding to the observed ˜ p, we then derive c W(˜ p)= E p h (p− ˆ p(˜ p)) 2 |˜ p i ˆ p 2 (˜ p) ≤ σ 2 (˜ p) ˆ p 2 (˜ p) + [ˆ p(˜ p)− µ (˜ p)] 2 ˆ p 2 (˜ p) ≤ |ˆ σ 2 (˜ p)− σ 2 (˜ p)| 4ˆ σ 2 (˜ p) + [ˆ µ (˜ p)− µ (˜ p)] 2 ˆ σ 2 (˜ p) +7. (12) By Theorem 6, we have∥ˆ σ 2 − σ 2 ∥=O p ( √ r n s n )=o p (1) and∥ˆ µ − µ ∥=O p ( √ r n s n )=o p (1). Fix an arbitrary positive ϵ and define A ϵ =(0,ϵ )∪(1− ϵ, 1). Applying the Cauchy-Schwarz 134 inequality, and using the imposed technical modification of the ECAP approach to bound ˆ σ 2 below, we derive Z Aϵ |ˆ σ 2 (˜ p)− σ 2 (˜ p)| ˆ σ 2 (˜ p) f ∗ (˜ p)d˜ p≤ [ ˜ PA ϵ ] 1/2 ∥ˆ σ 2 − σ 2 ∥ c √ r n s n =[ ˜ PA ϵ ] 1/2 O p (1)=O p (ϵ 1/2 ). Similarly, we derive Z Aϵ |ˆ µ (˜ p)− µ (˜ p)| 2 ˆ σ 2 (˜ p) f ∗ (˜ p)d˜ p≤ Z Aϵ |ˆ µ (˜ p)− µ (˜ p)| ˆ σ 2 (˜ p) f ∗ (˜ p)d˜ p=O p (ϵ 1/2 ). Note that |W 0 (˜ p)|≤ 1 for every ˜ p. Thus, combining the bounds for the terms in (12) with the corresponding bound for| ˆ W − W 0 | in Theorem 6, we derive Z 1 0 c W(˜ p)− W 0 (˜ p) f ∗ (˜ p)d˜ p=O p (ϵ 1/2 )+o p (1). As this bound holds for every positive ϵ , we deduce that R 1 0 c W(˜ p)− W 0 (˜ p) f ∗ (˜ p)d˜ p=o p (1). A.6 Proof of Theorem 6 We build on the results of Theorem 4 to derive the rate of convergence for ˆ µ and c W for a fixed positive ϵ . Continuity and positivity of µ (˜ p) and p 0 (˜ p) imply that both functions are bounded away from zero on the interval [ϵ, 1− ϵ ]. Applying Lemma 10.9 in [101], we derive ∥ˆ g− g ∗ ∥ ∞ = O p (r 3/4 n s 1/4 n ). Because n − 8/21 ≪ λ n ≪ 1, we have ∥ˆ g− g ∗ ∥ ∞ = o p (1), which implies sup [ϵ, 1− ϵ ] |ˆ µ (˜ p)− µ (˜ p)| = o p (1). Also note that ˆ p(˜ p)≥ ˆ µ (˜ p) for all ˜ p. Consequently, there exists an event with probability tending to one, on which random functions ˆ p(˜ p) and ˆ µ (˜ p) are bounded away from zero on the interval [ϵ, 1− ϵ ]. The stated error bounds for ˆ p then follow directly from this observation and the error bounds for ˆ g and ˆ g ′ in Theorem 4. FortheremainderoftheproofwerestrictourattentiontotheeventA(whoseprobability tendstoone), onwhichfunctionsp 0 and ˆ parebothboundedawayfromzeroon[ϵ, 1− ϵ ]. We writepforthetrueprobabilitycorrespondingtotheobserved ˜ p,define G(q)=E[(p− q) 2 /q 2 |˜ p] and note that G ′ (q) = 2(q− p ∗ )E(p|˜ p)/q 3 . Let p ∗ be the minimizer of G, given by p ∗ = E[p|˜ p]+Var[p|˜ p]/E[p|˜ p]. Denote by ˆ p ∗ our estimator of p ∗ , which is obtained by replacing theconditionalexpectedvalueandvarianceintheaboveformulabytheirECAPestimators. While p ∗ and ˆ p ∗ depend on ˜ p, we will generally suppress this dependence in the notation for simplicity. Note that for ˜ p∈[ϵ, 1− ϵ ], functions p ∗ and ˆ p ∗ are both bounded away from zero on the set A. Fix an arbitrary ˜ p≤ 0.5. Define events A 1 = A∩{p ∗ ≤ 0.5,ˆ p ∗ ≤ 0.5}, A 2 = A∩{p ∗ > 0.5,ˆ p ∗ ≤ 0.5}, A 3 = A∩{p ∗ ≤ 0.5,ˆ p ∗ > 0.5} and A 4 = A∩{p ∗ > 0.5,ˆ p ∗ > 0.5}. Note that 135 A 4 implies ˆ p=p 0 =0.5. WritingTaylorexpansionsforfunctionGnearp ∗ and0.5,wederive the following bounds, which hold for some universal constant c that depends only on ϵ : W 0 (˜ p)− c W(˜ p) 1 {A} = G(p ∗ )− G(ˆ p ∗ ) 1 {A 1 } + G(0.5)− G(ˆ p ∗ ) 1 {A 2 } + G(p ∗ )− G(0.5) 1 {A 3 } ≤ c p ∗ − ˆ p ∗ ) 2 1 {A 1 } +c 0.5− ˆ p ∗ ) 2 1 {A 2 } +c p ∗ − 0.5) 2 1 {A 3 } ≤ c p ∗ − ˆ p ∗ ) 2 . Analogous arguments derive the above bound for ˜ p > 0.5. The rate of convergence for c W then follows directly from the error bounds for ˆ g and ˆ g ′ in Theorem 4. A.7 Proof of Theorem 7 Throughout the proof we drop the subscript i for the simplicity of notation. First note that the derivations in the proof of Theorem 2 also give E(γ ∗ α |˜ p) = µ and Var(γ ∗ α |˜ p) = σ 2 , where µ and σ 2 are respectively defined in (8) and (9). These identities hold for both the unbiased and biased versions of the model. The only difference is in how γ ∗ α relates to p. Note that E(p|˜ p) = E(h(γ ∗ α )|˜ p)=(1− 0.5θ )E(γ ∗ α |˜ p)− θ [E(γ ∗ 3 α 3 |˜ p)− 1.5E(γ ∗ 2 α 2 |˜ p)] = (1− 0.5θ )µ − θ [s 3 +3µσ 2 +µ 3 − 1.5σ 2 − 1.5µ 2 ], (13) where we use s k to denote the k-th conditional central moment of γ ∗ α given ˜ p. By Lemma 5 in Appendix A.9, the s 3 term in (13) is O(γ ∗ 3/2 ), which leads to the stated approximation for E(p|˜ p). We also have Var(p|˜ p)=Var(h(γ ∗ α )|˜ p)=(1− 0.5θ ) 2 σ 2 +θa, where a=θVar (γ ∗ 3 α 3 − 1.5γ ∗ 2 α 2 |˜ p)− (1− 0.5θ )Cov(γ ∗ α,γ ∗ 3 α 3 − 1.5γ ∗ 2 α 2 |˜ p). It is only left to show that a=O(γ ∗ 3/2 ). A routine calculation yields a=σ 2 i 3µ (1− µ )(3θµ (1− µ )− 0.5θ +1) +O 6 X k=3 [σ k +s k ] . By Lemma 5, the remainder term is O(γ ∗ 3/2 ), which completes the proof. A.8 Proof of Theorem 8 We use the notation from the proof of Theorem 2. In particular, we omit the subscript i throughout most of the proof, for the simplicity of the exposition. We represent ˜ p as P K k=1 I {I=k} ξ k , where ξ k |p ∼ Beta(α k ,β k ),α k = c k p/γ ∗ , β k = (1− c k p)/γ ∗ , and I is a discrete random variable independent of p and ξ k , whose probability distribution is given by P(I =k)=w k for k =1,...,K. 136 Note that f ξ k (˜ p|p)=B c k p γ ∗ , 1− c k p γ ∗ − 1 ˜ p c k p γ ∗ − 1 (1− ˜ p) 1− c k p γ ∗ − 1 . Hence, writing B for B( c k p γ ∗ , 1− c k p γ ∗ ), we derive log(f ξ k (˜ p|p))=− logB+ c k p γ ∗ − 1 log˜ p+ 1− c k p γ ∗ − 1 log(1− ˜ p) =− logB+ c k p γ ∗ log˜ p− log˜ p+ 1− c k p γ ∗ log(1− ˜ p)− log(1− ˜ p) =− logB+p c k γ ∗ log ˜ p 1− ˜ p − log˜ p+ 1 γ ∗ log(1− ˜ p)− log(1− ˜ p) =− logB+ηx − log˜ p+ 1− γ ∗ γ ∗ log(1− ˜ p), where we’ve defined η =p c 1 γ ∗ and x=log ˜ p/[1− ˜ p] . Repeating the derivations in the proof of Theorem 2 directly below display (1), we derive E(p|ξ k = ˜ p)= 1 c k h γ ∗ g ∗ (˜ p)+1− 2˜ p + ˜ p i Var(p|ξ k = ˜ p)= 1 c 2 k h γ ∗ 2 ˜ p(1− ˜ p)(g ∗ ′ (˜ p)− 2)+γ ∗ ˜ p(1− ˜ p) i . Consequently, E(p|˜ p,I =k)=E(p|ξ k = ˜ p)= 1 c k h γ ∗ g ∗ (˜ p)+1− 2˜ p + ˜ p i Var(p|˜ p,I =k)=Var(p|ξ k = ˜ p)= 1 c 2 k h γ ∗ 2 ˜ p(1− ˜ p)(g ∗ ′ (˜ p)− 2)+γ ∗ ˜ p(1− ˜ p) i (14) Applying the law of total probability and using the fact that I and ˜ p are independent, we derive E(p|˜ p)= X k=1 w k E(p|˜ p,I =k)= K X k=1 w k c k h γ g ∗ (˜ p)+1− 2˜ p + ˜ p i . By the law of total variance, we also have Var(p|˜ p) = X k=1 w k Var(p|˜ p,I =k)+ X k=1 w k E 2 (p|˜ p,I =k)− X k=1 w k E(p|˜ p,I =k) 2 = K X k=1 w k c 2 k γ ∗ 2 ˜ p(1− ˜ p)(g ∗ ′ (˜ p)− 2)+γ ∗ ˜ p(1− ˜ p) + γ ∗ (g ∗ (˜ p)+1− 2˜ p)+ ˜ p 2 K X k=1 w k c 2 k − K X k=1 w k c k 2 . 137 To complete the proof, we use formulas µ i = ˜ p i +γ ∗ [g ∗ (˜ p i )+1− 2˜ p i ] and σ 2 i =γ ∗ ˜ p i (1− ˜ p i )+γ ∗ 2 ˜ p i (1− ˜ p)[g ∗ ′ (˜ p i )− 2] to rewrite the above expressions as E(p i |˜ p i )=µ i P K k=1 w k /c k and Var(p i |˜ p i )= K X k=1 w k c 2 k σ 2 i +µ 2 i K X k=1 w k c 2 k − µ 2 i K X k=1 w k c k 2 =(σ 2 i +µ 2 i ) K X k=1 w k c 2 k − µ 2 i K X k=1 w k c k 2 . A.9 Supplementary Results Lemma 4. Under the conditions of Theorem 6, there exists a function g ∗ N ∈G N , such that ∥g ∗ N − g ∗ ∥ 2 =O p (λ 2 n ) and ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ O p n − 2/7 ∥ˆ g− g ∗ N ∥ +O p n − 4/7 I(ˆ g) +O p n − 4/7 +λ 2 n . ProofofLemma4. Wewillusetheempiricalprocesstheorynotationandwrite ˜ P n gand ˜ Pg for (1/n) P n i=1 g(˜ p i ) and R 1 0 g(˜ p)f ∗ (˜ p)d˜ p, respectively. Using the new notation, criterion (14) can be written as follows: Q n (g)= ˜ P n g 2 + ˜ P n s g +λ 2 n I 2 (g). As we showed in the proof of Theorem 3, equality ˜ Pg 2 + ˜ Ps g = ∥g− g ∗ ∥ 2 holds for every candidate function g∈G N . Consequently, Q n (g)=∥g− g ∗ ∥ 2 +( ˜ P n − ˜ P)g 2 +( ˜ P n − ˜ P)s g +λ 2 n I 2 (g). Let g ∗ N be a function in G N that interpolates g ∗ at points {0,˜ p 1 ,...,˜ p n ,1}, with two additional constraints: g ∗ N ′ (0) = g ∗ ′ (0) and g ∗ N ′ (1) = g ∗ ′ (1). A standard partial integration argument [similar to that in 42, for example] shows that I(g ∗ N )≤ I(g ∗ ), which also implies that g ∗ N ′ is uniformly bounded. Furthermore, we have ∥g ∗ N − g ∗ ∥ ∞ = O p (log(n)/n) by the maximum spacing results for the uniform distribution [88, for example], the boundedness away from zero assumption on f ∗ and the boundedness of g ∗ N ′ . Consequently,∥g ∗ N − g ∗ ∥ 2 = O p (λ 2 n ). Because Q n (ˆ g)≤ Q n (g ∗ N ), we then have ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ ( ˜ P n − ˜ P)[g ∗ N 2 − ˆ g 2 ]+( ˜ P n − ˜ P)[s g ∗ N − s ˆ g ]+λ 2 n [I 2 (g ∗ N )+1]. (15) Note that ( ˜ P n − ˜ P)[g ∗ N 2 − ˆ g 2 ]=− ( ˜ P n − ˜ P)[ˆ g− g ∗ N ] 2 − ( ˜ P n − ˜ P)g ∗ N [ˆ g− g ∗ N ]. (16) 138 Applying Lemma 17 in [72], in conjunction with Corollary 5 from the same paper, in which we take γ =2/5 and λ =n − 1/2 , we derive ( ˜ P n − ˜ P)[g ∗ N − ˆ g] 2 =O p n − 1/5 ∥ˆ g− g ∗ N ∥ 2 +O p n − 1 I 2 (ˆ g− g ∗ N ) . (17) Applying Corollary 5 in [72] with the same γ and λ yields ( ˜ P n − ˜ P)g ∗ N [ˆ g− g ∗ N ]=O p n − 2/5 q ∥g ∗ N [ˆ g− g ∗ N ]∥ 2 +n − 4/5 I 2 (g ∗ N [ˆ g− g ∗ N ]) . Using Lemma 10.9 in [101] to express the L 2 norm of the first derivative in terms of the norms of the second derivative and the original function, we derive ( ˜ P n − ˜ P)g ∗ N [ˆ g− g ∗ N ]=O p n − 2/5 ∥ˆ g− g ∗ N ∥ +O p n − 4/5 I(ˆ g− g ∗ N ) . (18) Applying Corollary 5 in [72] with γ =2/3,λ =n − 3/14 and using Lemma 10.9 in [101] again, we derive ( ˜ P n − ˜ P)[s g ∗ N − s ˆ g ]=O p n − 3/7 ∥s g ∗ N − s ˆ g ∥ +O p n − 4/7 I(ˆ g− g ∗ N ) . Hence, by Lemma 10.9 in [101], ( ˜ P n − ˜ P)[s g ∗ N − s ˆ g ]=O p n − 3/7 ∥ˆ g− g ∗ N ∥ 1/2 I 1/2 (ˆ g− g ∗ N ) +O p n − 4/7 I(ˆ g− g ∗ N ) , which leads to ( ˜ P n − ˜ P)[s g ∗ N − s ˆ g ]=O p n − 2/7 ∥ˆ g− g ∗ N ∥ +O p n − 4/7 I(ˆ g) +O p [n − 4/7 I(ˆ g ∗ N ) (19) Combining (15)-(19), and noting the imposed assumptions on λ n , we arrive at ∥ˆ g− g ∗ N ∥ 2 +λ 2 n I 2 (ˆ g)≤ O p n − 2/7 ∥ˆ g− g ∗ N ∥ +O p n − 4/7 I(ˆ g) +O p n − 4/7 +λ 2 n , which completes the proof of Lemma 4. Lemma 5. Under the conditions of Theorem 7, σ 2 =O(γ ∗ ) and s k =O(γ ∗ 3/2 ), for k≥ 3. Proof of Lemma 5. We first show that E(t− ˜ p|˜ p) = O( √ γ ∗ ) as γ ∗ tends to zero, where we write t for the quantity αγ ∗ = h − 1 θ (p). This result will useful for establishing the stated bound for s k . Throughout the proof we use expression≳ to denote inequality≥ up to a multiplicative factor equal to a positive constant that does not depend on γ ∗ . We use an analogous agreement for the≲ expression. We write f c (˜ p) for the conditional density of ˜ p given t=c, 139 write f(˜ p) for the marginal density of ˜ p, and write m θ (t) for the marginal density of t. In the new notation, we have E |t− ˜ p| ˜ p = 1 Z 0 |t− ˜ p|f t (˜ p)m θ (t)[f(˜ p)] − 1 dt. Using Stirling’s approximation for the Gamma function, Γ( x) = e − x x x− 1/2 (2π ) 1/2 [1 + O(1/x)], and applying the bound xΓ( x)=O(1) when t is close to zero or one, we derive the following bounds as τ tends to infinity: √ τE |t− ˜ p| ˜ p = 1 Z 0 √ τ |t− ˜ p| Γ( τ ) Γ( tτ )Γ( qτ ) ˜ p tτ − 1 ˜ q qτ − 1 m θ (t)[f(˜ p)] − 1 dt ≲ 1 Z 0 √ τ |t− ˜ p| 1 √ 2π [˜ p/t] tτ (˜ q/q) qτ √ tqτm θ (t)[f(˜ p)] − 1 [˜ p˜ q] − 1 dt ≲ 1 Z 0 √ τ |t− ˜ p|e − τ (t− ˜ p) 2 18 √ τdt. Implementing a change of variable, v = √ τ (t− ˜ p), we derive √ τ E t− ˜ p ˜ p ≲ Z R |v|e − v 2 /18 dv =O(1). Consequently, E(t− ˜ p|˜ p)=O(1/ √ τ )=O( √ γ ∗ ). We now bound E [t− ˜ p] 2 ˜ p using a similar argument. Following the arguments in the derivations above, we arrive at τE [t− ˜ p] 2 ˜ p ≲ 1 Z 0 τ (t− ˜ p) 2 1 √ 2π e − τ (t− ˜ p) 2 18 √ τm θ (t)dt. Implementing a change of variable, v = √ τ (t− ˜ p), we conclude that τE [t− ˜ p] 2 ˜ p ≲ Z R v 2 e − v 2 /18 dv =O(1). Thus, we have established E [t− ˜ p] k ˜ p =O(γ ∗ k/2 ), for k∈{1,2}. (20) 140 Analogous arguments lead to bounds E [t− ˜ p] k ˜ p = O(γ ∗ 3/2 ) for k≥ 3. We complete the proof of the lemma by noting that s k =O E [t− ˜ p] k ˜ p +O γ ∗ 3/2 , for k≥ 2. When k = 2 the above approximation follows from σ 2 ≤ E [t− ˜ p] 2 ˜ p , and when k = 3 it follows from (20) and s 3 = E [t− ˜ p] 3 ˜ p +3E [t− ˜ p] 2 ˜ p E ˜ p− t ˜ p +3E t− ˜ p ˜ p E 2 ˜ p− t ˜ p +E 3 ˜ p− t ˜ p = E [t− ˜ p] 3 ˜ p +O γ ∗ 3/2 . The derivations for k≥ 4 are analogous. 141 Appendix B Appendix for Asymmetric error control under imperfect supervision: a label-noise-adjusted Neyman-Pearson umbrella algorithm B.1 Summary of sampling scheme This section summarizes our sampling scheme and related notations for the readers’ conve- nience. First, toreviewtheNPparadigmandtomakeacontrastwiththecorruptedsetting, we introduced the notation for uncorrupted samples: let S 0 ={X 0 j } M 0 j=1 andS 1 ={X 1 j } M 1 j=1 , respectively be the uncorrupted observations in classes 0 and 1, where M 0 and M 1 are the number of observations from each class. To construct the original NP umbrella algorithm (for uncorrupted data), S 0 is randomly split into S 0 = S 0 b ∪S 0 t , where the subscript b re- inforces that this part is to train a base algorithm (e.g., logistic regression, random forest), and the subscript t reinforces that this part of the data is to find the threshold. For the uncorrupted scenario, we do not split S 1 . AllS 1 are used together with S 0 b to train a base algorithm. For the corrupted scenario, which is the focus of our paper, we assume the following sampling scheme for methodology and theory development. Let ˜ S 0 ={ ˜ X 0 j } N 0 j=1 be corrupted class 0 observations and ˜ S 1 ={ ˜ X 1 j } N 1 j=1 be corrupted class 1 observations. The sample sizes N 0 and N 1 are considered to be non-random numbers. The split for the corrupted scenario is more complicated than the uncorrupted counterpart. Concretely, we split ˜ S 0 into three parts: ˜ S 0 = ˜ S 0 b ∪ ˜ S 0 t ∪ ˜ S 0 e , and split ˜ S 1 into two parts ˜ S 1 = ˜ S 1 b ∪ ˜ S 1 e . The subscripts b and t have the same meaning as the uncorrupted case while the subscript e stands for estimation, and ˜ S 0 e and ˜ S 1 e are used to estimate a correction term to account for the label noise. Given the above decomposition of ˜ S 0 and ˜ S 1 , we also used ˜ S b = ˜ S 0 b ∪ ˜ S 1 b to denote all corrupted class 0 and class 1 observations that are used to train the base algorithm in the label-noise-adjusted NP umbrella algorithm. The sample size n is reserved for |S 0 t | in the uncorrupted scenario, or for | ˜ S 0 t | in the corrupted scenario. The other sub-sample size notations are all for the corrupted scenario. In particular, n b =| ˜ S b | =| ˜ S 0 b ∪ ˜ S 1 b |, n 0 e =| ˜ S 0 e |, and n 1 e =| ˜ S 1 e |. 142 Algorithm 2: Binary Search For α k,δ Input : δ : a small tolerance level, 0 <δ < 1 k,n: two integers such that k≤ n r: a small number for error (we implement r =10 − 5 in our numerical analysis) 1 α min ← 0 2 α max ← 1 3 δ max ← P n j=k n j (1− α min ) j α n− j min 4 δ min ← P n j=k n j (1− α max ) j α n− j max 5 E← 2 6 while E >r do 7 α middle ← (α min +α max )/2 8 δ middle ← P n j=k n j (1− α middle ) j α n− j middle 9 if δ middle =δ then Output: α middle 10 else if δ middle >δ then 11 α middle ← α min 12 else 13 α middle ← α max 14 end 15 E←| δ middle − δ | 16 end Output: α middle B.2 BINARY SEARCH Algorithm Here r is an error for stopping criterion of this binary search. That is, the algorithm stops when P n j=k n j (1− α middle ) j α n− j middle − δ ≤ r. B.3 An example for assumption 3 Example 3. Under the same distributional setting as in Example 1, let ˆ T be trained by linear discriminant analysis (LDA) on ˜ S b ; that is ˆ T(X)= ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )X, in which ˆ ˜ µ 0 and ˆ ˜ µ 1 are the sample means of corrupted class 0 and 1 observations, respectively, and ˆ ˜ σ 2 is the pooled sample variance. For any z∈IR, by Lemma 6 in the Appendix, we have ˜ F ˆ T 0 (z)− ˜ F ˆ T 1 (z)=(m 0 − m 1 ) F ˆ T 0 (z)− F ˆ T 1 (z) . Therefore, when m 0 > m 1 (as assumed in Assumption 3.2), ˜ F ˆ T 0 (z) > ˜ F ˆ T 1 (z) is equivalent to F ˆ T 0 (z) > F ˆ T 1 (z). We first fix ˜ S b , then ˆ T(X 0 )∼N ( ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 0 , ˆ ˜ σ − 4 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 ) 2 σ 2 ) and ˆ T(X 1 )∼N ( ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 1 )µ 0 , ˆ ˜ σ − 4 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 ) 2 σ 2 ). Since these two distributions are two normal with the same variance and different means, F ˆ T 0 (z) > F ˆ T 1 (z) as long as ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 0 < 143 ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 1 , or equivalently, ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(µ 1 − µ 0 )>0. By Lemma 6 in the Appendix, this conditioncanbewrittenas( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(˜ µ 1 − ˜ µ 0 )/(m 0 − m 1 )>0,where ˜ µ 0 and ˜ µ 1 arethemeansof ˜ X 0 and ˜ X 1 respectively. When m 0 >m 1 , this is further equivalent to ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(˜ µ 1 − ˜ µ 0 )>0. Then Assumption 3.4.1 follows from the law of large numbers. B.4 Additional Numerical Results B.4.1 Additional Simulations WeapplyAlgorithm1 # inSimulation6. Form # 0 andm # 1 neededinAlgorithm1 # ,weusethe estimators proposed by [69]. Technically, [69] estimates the “flip rates” IP ˜ Y =1|Y =0 and IP ˜ Y =0|Y =1 . Our corruption levels can be derived from flip rates by the Bayes theorem. Simulation 6. The distributional setting is the same as in Simulation 2. For different (m 0 ,m 1 ,α,δ ) combinations, the (approximate) type I error violation rates and averages of (approximate) true type II errors generated by Algorithm 1 # are reported in Tables B.1 and B.2, respectively. N m 0 =.95,m 1 =.05 α =.05,δ =.05 m 0 =.9,m 1 =.1 α =.05,δ =.05 m 0 =.95,m 1 =.05 α =.1,δ =.1 m 0 =.9,m 1 =.1 α =.1,δ =.1 200 .067(7.91) .068(7.96) .131(10.67) .101(9.53) 500 .084(8.78) .083(8.73) .134(10.78) .115(10.09) 1,000 .463(15.78) .182(12.21) .497(15.82) .197(12.58) 2,000 .665(14.93) .190(12.41) .695(14.57) .209(12.86) Table B.1: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 6. Standard errors (× 10 − 3 ) in parentheses. N m 0 =.95,m 1 =.05 α =.05,δ =.05 m 0 =.9,m 1 =.1 α =.05,δ =.05 m 0 =.95,m 1 =.05 α =.1,δ =.1 m 0 =.9,m 1 =.1 α =.1,δ =.1 200 .431(9.36) .589(9.79) .150(2.44) .221(4.86) 500 .219(3.60) .391(7.25) .115(.95) .145(1.49) 1,000 .140(.99) .190(2.53) .082(.72) .107(1.01) 2,000 .128(.82) .175(1.75) .073(.65) .101(.88) TableB.2: Averagesof(approximate)truetypeIIerrorsover1,000repetitionsforSimulation 6. Standard errors (× 10 − 3 ) in parentheses. In this simulation, Algorithm 1 # fails to control the type I error with pre-specified high probability. Similar results on additional distributional settings can be found in Simulation 7 of Appendix B.4.1. One might wonder: if we were to use other estimators of m 0 and m 1 , will the result be different? The answer is that the usually “good” estimators do not serve 144 for the purpose of high probability control on type I error. For example, Simulation 8 in Appendix B.4.1 uses consistent and unbiased estimators of m 0 and m 1 , but Algorithm 1 # still fails to control the type I error. Simulation 7. The distributional setting is the same as in Simulation 1. For different (m 0 ,m 1 ,α,δ ) combinations, the (approximate) true type I errors generated by Algorithm 1 # are reported in Table B.3. N m 0 =.95,m 1 =.05 α =.05,δ =.05 m 0 =.9,m 1 =.1 α =.05,δ =.05 m 0 =.95,m 1 =.05 α =.1,δ =.1 m 0 =.9,m 1 =.1 α =.1,δ =.1 200 .430(15.66) .512(15.81) .530(15.79) .504(15.82) 500 .694(14.58) .488(15.81) .758(13.55) .570(15.66) 1,000 .940(7.51) .788(13.47) .953(6.70) .805(12.54) 2,000 .950(6.90) .792(12.80) .957(6.42) .818(12.21) Table B.3: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 6. Standard errors (× 10 − 3 ) in parentheses. Simulation 8. The distributional setting is the same as in Simulation 1. The m # 0 and m # 1 are generated from N(m 0 ,1/N) and N(m 1 ,1/N), respectively. The (approximate) type I error violation rates generated by Algorithm 1 # for one combination (m 0 = .95, m 1 = .05, α =.1 and δ =.1) are reported in Table B.4. B.4.2 CIFAR10 data analysis In this section we apply Algorithm 1 to the CIFAR10 dataset [63]. As we focus on bi- nary classification problems, we merge the ten categories of the CIFAR10 dataset into two: “vehicles” and “non-vehicles.” The class “vehicles,” encoded as 0, contains the original “au- tomobile”and“truck”classes, andtheclass“non-vehicles,” encodedas1, containstheother eight original classes. Then type I/II errors are defined accordingly. We employ the NP paradigm to this modified dataset to prioritize control over the chance of failing to detect vehicles. The original CIFAR10 dataset has pre-specified training and test sets, but the number of class 0 observations in the test set is too small (2,000 in total) to produce a reliable approximation to population-level type I error. Furthermore, given that the train-test pro- cedure has to be repeated multiple times to approximate the type I error violation rate, a fixed train-test split throughout all repetitions does not serve our purpose. As such, we perform stratified splits to the whole modified CIFAR10 dataset (with the newly assigned labels). In particular, 20% true class 0 observations and 20% true class 1 observations are randomly selected to form the new training set and the remaining observations form the evaluation set. The training and evaluation sets contain 12,000 and 48,000 observations, respectively. Moreover, the labels of all training observations are artificially corrupted by the same method as in Section 3.5.2 with m 0 = 0.95 and m 1 = 0.05. By Bayes theorem, the flip rates r 0 = IP ˜ Y =1|Y =0 and r 1 = IP ˜ Y =0|Y =1 are 0.2083 and 0.0104, respectively. We apply Algorithm 1 (with the parameter choice α = δ = 0.1 and CNN as 145 N (approximate) violation rate 200 .193(1.25) 500 .208(1.28) 1,000 .186(1.23) 2,000 .203(1.27) Table B.4: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 8. Standard errors (× 10 − 2 ) in parentheses. the base algorithm) to the training set with corrupted labels and obtain a classifier. Then, the classifier is applied to the untouched evaluation set to calculate the (approximate) true type I and II errors. This procedure is repeated 1,000 times. In the main text, we have shown that Algorithm 1 # with under-estimates of corruption levelsfulfillsthegoalofhigh-probabilitycontroloverthetypeIerror,whileotherbenchmark algorithms do not. To avoid delivering redundant messages, we only apply Algorithm 1 to the modified CIFAR10 dataset since our primary interest is the type I error control. The (approximate) type I error violation rate and average of (approximate) true type II errors are presented in Table B.5. Clearly, Algorithm 1 is able to achieve high probability control of the true type I error under the specified level. (approximate) violation rate average of (approximate) true type II errors Algorithm 1 with CNN as base algorithm .099(9.45) .150(.87) TableB.5: (Approximate)typeIerrorviolationrate,andaverageof(approximate)truetype II errors by Algorithm 1 over 1,000 repetitions for the modified CIFAR10 dataset. Standard errors (× 10 − 3 ) in parentheses. B.4.3 Violin plots for Section 3.5 In this section, we present the violin plots (Figures B.1 - B.6) for Simulations 1-3 in Section 3.5. The violin plots for the (approximate) true type I and type II errors over these 1,000 repetitions are plotted for each (m 0 ,m 1 ,α,δ ) combination. Take Figures B.1 and B.2 as an example, the two rows in each figure respectively correspond to the m 1 = 0.95,m 1 = 0.05 and m 0 = 0.85,m 1 = 0.15 settings, while the two columns respectively correspond to α = 0.05,δ = 0.05 and α = 0.10,δ = 0.10. The area of every plot with lighter color represents true type I errors above the 1− δ quantile while the area with darker color represents true type I errors below the 1− δ quantile. The black dots represent the average of true type I/II errors and the bars above and below the dots represent standard deviations. 146 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.1: Violin plots for (approximate) true type I errors of Simulation 1. adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.2: Violin plots for (approximate) true type II errors of Simulation 1. 147 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.3: Violin plots for (approximate) true type I errors of Simulation 2. adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.0 0.2 0.4 0.6 0.8 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.4: Violin plots for (approximate) true type II errors of Simulation 2. 148 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.5: Violin plots for (approximate) true type I errors of Simulation 3. adjusted original 200 500 1000 2000 200 500 1000 2000 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.2 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.6: Violin plots for (approximate) true type II errors of Simulation 3. 149 B.4.4 Tables for Section 3.5 In this section, we present Table B.6 for in Simulation 5 in Section 3.5.1 and Table B.7 for the email spam data analysis in Section 3.5.2. algorithms N 200 500 1,000 2,000 T-revision .165(4.32) .153(4.08) .146(3.52) .147(4.27) backward loss correction (known corruption level) .151(.77) .139(.70) .161(.71) .199(.69) backward loss correction (unknown corruption level) .158(.02) .163(.02) .186(.01) .192(.01) TableB.6: Averagesof(approximate)truetypeIIerrorsover1,000repetitionsforSimulation 5 (m 0 =.95, m 1 =.05, α =.1 and δ =.1). Standard errors (× 10 − 3 ) in parentheses. (approximate) violation rate average of (approximate) true type II errors T-revision .829(11.91) .414(7.01) backward loss correction (known corruption level) .831(11.86) .573(5.49) backward loss correction (unknown corruption level) .750(13.70) .631(5.51) Table B.7: (Approximate) type I error violation rates, and averages of (approximate) true type II error by benchmark algorithms over 1,000 repetitions for the email spam data. Stan- dard errors (× 10 − 3 ) in parentheses. B.4.5 Alternative implementation with a positive ε Inthissection, werepeatthenumericalstudiesforSimulations1-3inSection3.5butreplace k ∗ in Algorithm 1 by min{k∈{1,...,n} : α k,δ − ˆ D + (t (k) )≤ α − ε} where ε = 0.0001. The results are presented in Figures B.7 - B.12. Numerical evidence shows that whether to have a small positive ε in selection of k ∗ does not actually affect much the performance of label- noise-adjusted umbrella algorithm. Thus, as a simpler algorithm is always preferred, we recommend taking ε=0. B.5 Extra Lemmas Lemma 6. Under Assumption 3.2, for any measurable function T :IR d →IR and arbitrary number z∈IR, we have ˜ F T 0 (z)=m 0 F T 0 (z)+(1− m 0 )F T 1 (z) and ˜ F T 1 (z)=m 1 F T 0 (z)+(1− m 1 )F T 1 (z). 150 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.04 0.08 0.12 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.7: Violin plots for (approximate) true type I errors of Simulation 1. adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.2 0.4 0.6 0.8 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.8: Violin plots for (approximate) true type II errors of Simulation 1. 151 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.04 0.08 0.12 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.9: Violin plots for (approximate) true type I errors of Simulation 2. adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.0 0.2 0.4 0.6 0.8 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.10: Violin plots for (approximate) true type II errors of Simulation 2. 152 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.11: Violin plots for (approximate) true type I errors of Simulation 3. adjusted original 200 500 1500 2000 200 500 1500 2000 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.2 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure B.12: Violin plots for (approximate) true type II errors of Simulation 3. 153 Furthermore, IE ˜ X 0 =m 0 IEX 0 +(1− m 0 )IEX 1 and IE ˜ X 1 =m 1 IEX 0 +(1− m 1 )IEX 1 . Proof. The first two equations can be proved in the similar way. So we will only show the first equation. By Assumption 3.2, for any Borel set A, ˜ P 0 (T − 1 (A))=m 0 P 0 (T − 1 (A))+(1− m 0 )P 1 (T − 1 (A)). Then, select A=(−∞ ,z] and the result follows. Similarly, the proof of last two equations are similar in nature. So we are going to show IE ˜ X 0 =m 0 IEX 0 +(1− m 0 )IEX 1 . Note that by Assumption 3.2, IE ˜ X 0 = Z ∞ 0 (1− ˜ P 0 (X ≤ x))dx− Z 0 −∞ ˜ P 0 (X ≤ x)dx =m 0 Z ∞ 0 (1− P 0 (X ≤ x))dx− Z 0 −∞ P 0 (X ≤ x)dx +(1− m 0 ) Z ∞ 0 (1− P 1 (X ≤ x))dx− Z 0 −∞ P 1 (X ≤ x)dx =m 0 IEX 0 +(1− m 0 )IEX 1 . Lemma 7. For any k ∈ {1,...,n} and δ ∈ (0,1), a unique α k,δ exists. Moreover, under Assumption 3.4.1, k ∗ =min{k∈{1,...,n}:α k,δ ≤ α }. Proof. Let h k (x) = P n j=k n k x n− j (1− x) j for any k ∈{1,...,n}. Then, one can show, for k≤ n− 1 and x∈(0,1), h ′ k (x)= n− 1 X j=k (n− j) n j x n− j− 1 (1− x) j − n X j=k j n j x n− j (1− x) j− 1 =n n X i=k+1 n i− 1 x n− i (1− x) i− 1 − n n X j=k n j− 1 x n− j (1− x) j− 1 =− n n k− 1 x n− k (1− x) k− 1 , which is negative. Thus, h k (x) is strictly decreasing on (0,1) for k ≤ n− 1. Furthermore, h n (x) = (1− x) n which is also strictly decreasing on (0,1). Since for any k, h k (0) = 1 and h k (1)=0, there exists a unique α k,δ such that h k (α k,δ )=δ . Recallthatk ∗ isdefinedasthesmallest k suchthath k (α )≤ δ . Meanwhile,bymonotonic- ity,foranyk,theinequalityh k (α )≤ δ isequivalenttoα k,δ ≤ α . Assumption3.4.1guarantees the existence of k such that h k (α ). Therefore it also guarantees the existence of k such that α k,δ ≤ α . Then, for any δ , {k ∈ {1,...,n} : h k (α ) ≤ δ } = {k ∈ {1,...,n} : α k,δ ≤ α }. Then, k ∗ =min{k∈{1,...,n}:α k,δ ≤ α }. 154 Lemma8. Given a random variable X ∈IR d with probability measureP and a deterministic measurable function T : IR d → IR, define probability measure P T (B) = P(T(X) ∈ B) for any Borel set B. Furthermore, denote the distribution functions of P and P T as F and F T , respectively. Let X 1 ,X 2 ,...,X n ∼ X be i.i.d. random variables. Moreover, let ˆ F T (z)= 1 n P n j=1 1I{T(X j )≤ z} for any z∈IR. Then, for any ξ > 0 P sup z∈IR ˆ F T (z)− F T (z) >ξ ≤ 2e − 2nξ 2 . Proof. NotethatX 1 ,X 2 ,...,X n arei.i.d. randomvariables,thensoareT(X 1 ),T(X 2 ),...,T(X n ). Denote T j = T(X j ), then T j has the probability measure P T . Note that the Dvoretsky- Kiefer-Wolfowitz inequality says, P T sup z∈IR 1 n n X j=1 1I{T j ≤ z}− F T (z) >ξ ! ≤ 2e − 2nξ 2 . Then,itsufficestoshowthelefthandsideofaboveinequalityequals P sup z∈IR ˆ F T (z)− F T (z) >ξ . Towards that, denote f n (x 1 ,x 2 ,...,x n )=1I ( sup z∈IR 1 n n X j=1 1I{T(x j )≤ z}− F T (z) >ξ ) , and f 0 (t 1 ,t 2 ,...,t n )=1I ( sup z∈IR 1 n n X j=1 1I{t j ≤ z}− F T (z) >ξ ) . By Fubini’s theorem, it holds that P sup z∈IR ˆ F T (z)− F T (z) >ξ =IE 1 IE 2 ...IE n f n (X 1 ,X 2 ,...,X n ), and P T sup z∈IR 1 n n X j=1 1I{T j ≤ z}− F T (z) >ξ ! =IE T 1 IE T 2 ...IE T n f 0 (T 1 ,T 2 ,...,T n ), whereIE j andIE T j aretheexpectationstakenwithrespecttoX j andT j undertheprobability measures P and P T , respectively. Thus, it suffices to show IE 1 IE 2 ...IE n f n (X 1 ,X 2 ,...,X n )=IE T 1 IE T 2 ...IE T n f 0 (T 1 ,T 2 ,...,T n ), 155 and we will show this by induction. Denote f l (x 1 ,x 2 ,...,x l ,t l+1 ,t l+2 ,...,t n )=1I ( sup z∈IR 1 n l X j=1 1I{T(x j )≤ z}+ n X j=l+1 1I{t j ≤ z} ! − F T (z) >ξ ) , foranyl∈{1,2,...,n− 1}andA n− 1 (x 1 ,x 2 ,...,x n− 1 )={t n :f n− 1 (x 1 ,x 2 ,...,x n− 1 ,t n )=1}. Then, for any fixed values of x 1 ,x 2 ,...,x n− 1 , IE n f n (x 1 ,x 2 ,...,x n− 1 ,X n )=P(T(X n )∈A n− 1 (x 1 ,x 2 ,...,x n− 1 )) =P T (A n− 1 (x 1 ,x 2 ,...,x n− 1 )) =IE T n f n− 1 (x 1 ,x 2 ,...,x n− 1 ,T n ), and thus, IE 1 IE 2 ...IE n f n (X 1 ,X 2 ,...,X n )=IE T 1 IE T 2 ...IE n− 1 IE T n f n− 1 (X 1 ,X 2 ,...,X n− 1 ,T n ). Now, assume that for some l∈{2,3,...,n}, IE 1 IE 2 ...IE n f n (X 1 ,X 2 ,...,X n ) =IE 1 IE 2 ...IE l− 1 IE T l IE T l+1 ...IE T n f l− 1 (X 1 ,X 2 ,...,X l− 1 ,T l ,T l+1 ,...,T n ). Therefore, for any fixed values of x 1 ,x 2 ,...,x l− 2 , denote A l− 2 (x 1 ,x 2 ,...,x l− 2 )={t l− 1 :IE T l IE T l+1 ...IE T n f l− 2 (x 1 ,x 2 ,...,x t− 2 ,t l− 1 ,T l ,...,T n )=1}, we can have IE l− 1 IE T l IE T l+1 ...IE T n f l− 1 (x 1 ,x 2 ,...,x l− 2 ,X l− 1 ,T l ,T l+1 ,...,T n ) =P(T(X l− 1 )∈A l− 2 (x 1 ,x 2 ,...,x l− 2 ))=P T (A l− 2 (x 1 ,x 2 ,...,x l− 2 )), and thus, IE l− 1 IE T l IE T l+1 ...IE T n f l− 1 (x 1 ,x 2 ,...,x l− 2 ,X l− 1 ,T l ,T l+1 ,...,T n ) =IE T l− 1 IE T l ...IE T n f l− 2 (x 1 ,x 2 ,...,x t− 2 ,T l− 1 ,...,T n ). Therefore, by the assumption, we have IE 1 IE 2 ...IE n f n (X 1 ,X 2 ,...,X n ) =IE 1 IE 2 ...IE l− 2 IE T l− 1 IE T l ...IE T n f l− 2 (X 1 ,X 2 ,...,X l− 2 ,T l− 1 ,T l ,...,T n ). We conclude the proof by induction. 156 B.6 Proofs Lemma 1. Let’sfocusontheeventofthestatementofAssumption3.4.1,whosecomplement holds with probability at most δ 1 (n b ). Meanwhile, by Lemma 6, for any z∈IR, ˜ F ˆ T 0 (z)− ˜ F ˆ T 1 (z)= h m 0 F ˆ T 0 (z)+(1− m 0 )F ˆ T 1 (z) i − h m 1 F ˆ T 0 (z)+(1− m 1 )F ˆ T 1 (z) i =(m 0 − m 1 ) F ˆ T 0 (z)− F ˆ T 1 (z) . Furthermore, for any classifier ϕ (X)=1I{ ˆ T(X)>z} ˜ R 0 (ϕ )− R 0 (ϕ )= 1− ˜ F ˆ T 0 (z) − 1− F ˆ T 0 (z) =F ˆ T 0 (z)− m 0 F ˆ T 0 (z)− (1− m 0 )F ˆ T 1 (z) =(1− m 0 ) F ˆ T 0 (z)− F ˆ T 1 (z) , which is positive by Assumption 3.4.1. Now, let D(z) = ˜ R 0 (ϕ )− R 0 (ϕ ) > 0 and therefore R 0 ( ˆ ϕ k∗ ) > α − D(t (k∗ ) ) is equivalent to ˜ R 0 ( ˆ ϕ k∗ ) > α , whose probability is δ by Proposition 1. To this end, we have shown IP R 0 ( ˆ ϕ k∗ )>α − D(t (k∗ ) ) ≤ δ +δ 1 (n b ). Proof of Lemma 2. By Lemma 7, the set {k ∈{1,...,n} : α k,δ ≤ α } is non-empty. Then, the set {k ∈{1,...,n} : α k,δ − ˆ D + (t (k) )≤ α } is non-empty since ˆ D + (t (k) ) is non-negative. Then k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } exists. Note that k ∗ = min{k ∈ {1,...,n} : α k,δ ≤ α } by Lemma 7. Since {k ∈ {1,...,n} : α k,δ ≤ α } is a subset of {k ∈{1,...,n} : α k,δ − ˆ D + (t (k) )≤ α } by the non-negativeness of ˆ D + , it can be concluded that k ∗ ≤ k ∗ . Proof of Lemma 3. Assumption 3.2 implies 0 ≤ M # := 1− m # 0 m # 0 − m # 1 ≤ M = 1− m 0 m 0 − m 1 , and thus, 0≤ ˆ D + # (c)≤ ˆ D + (c). Then,{k∈{1,...,n} :α k,δ ≤ α }, which is non-empty by Assumption 3.4.1, is a subset of {k ∈ {1,...,n} : α k,δ − ˆ D + # (t (k) ) ≤ α }. This implies k ∗ # exists and is smaller than or equal to k ∗ . Furthermore, {k ∈ {1,...,n} : α k,δ − ˆ D + # (t (k) ) ≤ α } is also a subset of {k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } and thus, k ∗ # = min{k ∈ {1,...,n} : α k,δ − ˆ D + # (t (k) )≤ α } is larger than or equal to k ∗ . Proof of Theorem 9. Let’sfocusontheeventwherestatementofbothAssumption3.4.1and 3.4.2 hold, whose complement has probability less that δ 1 (n b )+δ 2 (n b ). Then, let B e = sup z∈IR ˆ D(z)− D(z) ≤ 2 − 1 ε . 157 It follows from Lemma 8 that IP(B c e )≤ IP sup z∈IR ˆ ˜ F ˆ T 0 (z)− ˜ F ˆ T 0 (z) > ε 4 +IP sup z∈IR ˆ ˜ F ˆ T 1 (z)− ˜ F ˆ T 1 (z) > ε 4 ≤ 2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . Note that since D(z) is non-negative by Lemma 1, ˆ D + (z)− D(z) ≤ ˆ D(z)− D(z) ≤ 2 − 1 ε on B e . So, one can conclude that on the event B e , k ∗ is chosen from all k such that α k,δ − D(t (k) ) ≤ α − 2 − 1 ε. Furthermore, denote c k = inf{y : ˜ F ˆ T 0 (y) ≥ kn − 1 } and k 0 = min{k∈{1,...,n}:α k,δ − D(c k )≤ α − 4 − 1 ε}. Note that sinceD T is a closed interval, thus c k is well-defined. Let ˜ F ˆ T n be the empirical distribution induced byT t , i.e., for any z∈IR, ˜ F ˆ T n (z)= 1 n X t∈Tt 1I{t≤ z}. Denote B t = {sup z∈IR | ˜ F ˆ T n (z)− ˜ F ˆ T 0 (z)| ≤ 4 − 1 M − 1 C − 1 cε}. Then, by Lemma 8, IP(B c t ) ≤ 2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 . So, it remains to show the probability of true type I error exceeding α is bounded by δ on the set B t ∩B e . Thus, till the end of the proof, we will focus on the intersection of both sets. Note that we have ˜ F ˆ T n (t (k) ) = kn − 1 . Then, Taylor expansion implies ˜ F ˆ T n (t (k) )− ˜ F ˆ T 0 (t (k) )= ˜ F ˆ T n (t (k) )− ˜ F ˆ T 0 (c k )− ˜ f ˆ T 0 (c ∗ k )(t (k) − c k )=− ˜ f ˆ T 0 (c ∗ k )(t (k) − c k ), wherec ∗ k isboundedbyc k andt (k) . Thentheaboveequationimplies t (k) − c k ≤ 4 − 1 M − 1 C − 1 ε foranykaccordingtothelowerboundprovidedbyAssumption3.4.1. Furthermore,D(t (k) )− D(c k ) = M( ˜ f ˆ T 0 (c ∗∗ k )− ˜ f ˆ T 1 (c ∗∗ k ))(t (k) − c k ) for some c ∗∗ k bounded by c k and t (k) . Therefore, Assumption 3.4.1 implies|D(t (k) )− D(c k )|≤ 4 − 1 ε. Suppose k ∗ =k ′ , then, α k ′ ,δ − D(c k ′)≤ α k ′ ,δ − D(t (k ′ ) )+4 − 1 ε≤ α − 4 − 1 ε, and thus k ∗ ≥ k 0 . Furthermore, we also have α k 0 ,δ − D(t (k 0 ) )≤ α k 0 ,δ − D(c k 0 )+4 − 1 ε≤ α. Recall that D(t (k 0 ) ) = ˜ R 0 ( ˆ ϕ k 0 )− R 0 ( ˆ ϕ k 0 ). Therefore, R 0 ( ˆ ϕ k 0 ) > α implies ˜ R 0 ( ˆ ϕ k 0 ) > α k 0 ,δ whose probability is bounded by δ by Proposition 1. Proof of Corollary 3.4.3. ByLemma3,k ∗ # ≥ k ∗ andthus,t (k ∗ ) ≤ t (k ∗ # ) . Therefore,R 0 ( ˆ ϕ (k ∗ ) )≥ R 0 ( ˆ ϕ (k ∗ # ) ). Combined with Theorem 9, the previous result yields IP R 0 ( ˆ ϕ (k ∗ # ) )>α ≤ IP R 0 ( ˆ ϕ (k ∗ ) )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . 158 Appendix C Appendix for A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification This supplement provides additional technical results (Section C.1), proofs of all theorems (Sections C.2-C.3) and additional numerical results (Section C.4). C.1 Supplementary Technical Results C.1.1 The conversion algorithm and connections to Storey’s q- value This section presents a general strategy for converting an arbitrary score S c (x,a) to a fair score R c (S c ), which is referred to as the theoretical R-value. Although the discussion is purely theoretical, it shows that for every base score there always exists a corresponding fair score. The algorithm can be viewed as a method of calibration by group. Consider random mixture model (16). The conversion algorithm consists of three steps. In Step 1, we find the distributional information with respect to the score S c . Let G c (s) = P a∈A π a G c a (s)betheCDFofS c ,whereG c a (s)=π 1|a G c 1,a (s)+π 2|a G c 2,a (s). G c 1,a (s)andG c 2,a (s) aretheconditionalCDFsofS c givenA=aandY. Supposeanoracleknowstheconditional probabilities and conditional CDFs defined above. In Step 2, we compute the conditional error probability when the threshold is t: Q c a (t)P(Y ̸=c|S c ≥ t,A=a)= π 2|a{1− G 1 2,a (t)} 1− G 1 a (t) , c=1 π 1|a{1− G 2 1,a (t)} 1− G 2 a (t) , c=2. Finally,inStep3wecomputeafairscore,referredtothetheoreticalR-value,foranindividual from group a with S c =s c : R c (s c )=inf t≤ s c{Q c a (t)}. The theoretical R-value is closely related to the q-value [94], which provides a useful tool in large-scale testing problems due to its intuitive interpretation and easy operation. Suppose we are interested in testing hypotheses {H j : 1≤ j ≤ m} with associated p-values 159 {p j :1≤ j≤ m}. Let π be the proportion of non-nulls and G(t) the alternative distribution of p-values. Define the q-value for hypothesis H j q j = inf t≥ p j pFDR(t) (1− π )t (1− π )t+πG (t) . (1) Roughly speaking, q j measures the fraction of false discoveries when H j is just rejected. The operations of the q-value and R-value algorithms are similar. Specifically, an FDR/FSR analysis at level α can be carried out in two steps: first obtain the q-value/R-value for hypothesis/individual j, and then reject/select hypothesis/individual j if its q-value/r-value is less than α . C.1.2 FSR and mFSR It can be shown that, under the random mixture model (16), mFSR c,a = P(Y ̸= c| ˆ Y = c,A = a), the conditional probability needed in the sufficiency principle (18); see [94] for a similar result in the FDR literature. Following [20], we can similarly show that under mild conditions,FSR c a =mFSR c a +o(1)whenm a |{1≤ j≤ m:A n+j =a}|→∞. Thisconnection showsthat,underthesimplifyingassumptionswehavemade,thetwocriteria,namelygroup- wise FSR control (6) and sufficiency principle (18), are asymptotically equivalent. C.1.3 R-value is the (BH) q-value of conformal p-values In this section, we take a multiple testing point of view to provide further insights on the R-value. We show that the R-value, which was motivated from a very different perspective, can be derived as the q-value of the conformal p-values [6] under the one-class classification setting. To make different ideas comparable, we drop the sensitive attribute A and focus on the unadjusted R-value defined in (13). Theconformalp-value[6]isdevelopedunderadifferentcontextfornonparametricoutlier detection,whichisalsoreferredtoas one-class classification inmachinelearning[74,60,59]. The basic setup is that one collects a training data set containing only the subjects of one class. Withoutlossofgeneralitywelabeltheclassas“1”anddenotethedatasetasD 1 . The goal is to detect “unusual” subjects in D test , which can be a mixture of observations from class 1 and some other classes. The unusual subjects, referred to as outliers, are defined as subjects which exhibit differential characteristics from the inliers, i.e. subjects of class “1” in the “pure” training data set. The one-class classification problem can be formulated under the selective inference framework discussed in Section 2.1. Suppose we observe data from two classes and divide the observed data into two subsets D ={(X i ,Y i ):1≤ i≤ n}=D 1 ∪D 2 , with D c ={X i : subject i is observed with label Y =c}, c = 1,2. Imagine that we simply discardD 2 andfocusonaone-classclassificationproblem,wherethegoalistodetectoutliers in D test ={X n+j : 1≤ j ≤ m}, which can be a mixture of observations from both classes. 160 Viewing individuals in class “1” as the null cases, we can formulate an equivalent multiple testing problem: H j0 : Y n+j =1 vs. H j1 : Y n+j ̸=1(i.e.Y n+j =2), j =1,··· ,m. It is important to note that the practice of discarding D 2 leads to information loss. We revisit this point at the end of the section. The construction of conformal p-values involves a sample splitting step, which divides the observed data D 1 into two parts: D train for training a score function ˆ S and D cal for calibrating a significance index. We view ˆ S(X) as a conformity score, which measures how different X is from the inliers. The conformal p-value for testing H j0 , under our notational system 1 , corresponds to ˆ p(t)= P i∈D cal I{ ˆ S c i ≥ t)}+1 n cal +1 , (2) where c is taken to be 2 and t is the observed score ˆ S c n+j =t. To see the connection of our R-value to the conformal p-value (2), recall Storey’s q-value (93): ˆ q ST {ˆ p(t)}=(1− π )ˆ p(t)/G{ˆ p(t)}, where π is the proportion of non-null cases in D test and G(·) is the cumulative distribution function (CDF) of the p-values. Now let ˆ G(·) denote the empirical process of the scores { ˆ S i :i∈D test }: ˆ G(t)= 1 n test X i∈D test I{ˆ p(S i )≤ ˆ p(t)}= 1 n test X i∈D test I ˆ S c i ≥ t , (3) where the last equality holds because, by (2), a larger score corresponds to a smaller con- formal p-value. Next we consider a modification of Storey’s q-value, referred to as the BH q-value, which ignores the (1− π ) term and substitutes ˆ G in place of G in Storey’s q-value: ˆ q BH (t)= ˆ p j (t) ˆ G(t) . (4) The superscript “BH” is used because the thresholding rule ˆ Y Y Y = h I n ˆ q BH ( ˆ S n+j )≤ α o :1≤ j≤ m i 1 We mention a minor point to avoid confusions. Under our setup, a larger score indicates a greater likelihoodofbeinganoutlier. Thisinterpretationmakessenseinourproblembutisintheoppositedirection compared to that in [6]. To make the two definitions equivalent, the expression “ S ≤ t” in the conformal p-value definition in [6] has been swapped to “ S≥ t” in our equation (2). 161 is equivalent to applying the Benjamini-Hochberg procedure (9) to the conformal p-values {ˆ p( ˆ S c n+j ) : 1 ≤ j ≤ m}. Combining (2), (3) and (4), we can precisely recover the R-value defined in (13). Concretely, we have ˆ q BH (t) = m n cal +1 · P i∈D cal I ˆ S c i ≥ t +1 P i∈D test I ˆ S c i ≥ t (5) = m n cal +1 · P i∈D cal I ˆ S c i ≥ t,Y i ̸=c +1 P i∈D test I ˆ S c i ≥ t . (6) The last equality (6) holds because under the one-class classification setup, D cal is a “pure” trainingsetinwhichallobservationssatisfyY i ̸=ctrivially. Weconcludethatourunadjusted R-values (13), which can be called the conformal q-value under the one-class classification setup, is the BH q-value of conformal p-values. Finally we point out that this fundamental connection only holds under the one-class classification setup. The BH q-value (5) will be different from the R-value (13) under the bi- naryclassificationsetup. Specifically,thecardinalitiesofthecalibrationsetswillbedifferent under the two setups, and the equality (6) does not hold. The conformal p-value approach by [6] can still be applied for selective inference under the binary classification setup if the goal is only to detect cases from class 2. However, the conformal p-value method utilizes fewer data points than our R-value approach since the data setD 2 has been discarded. This may lead to loss of information and decreased power. Related issues have gone beyond the scope of this work and will be pursued in future research. C.2 Proof of Theorem 1 C.2.1 Proof of Part (a) C.2.1.1 An empirical process description of the FASI algorithm Suppose we select subjects into class c if the base score S c is great than t. The estimated false discovery proportion (FSP), as a function of t, in group a is given by: ˆ Q c (t)= 1 n cal a +1 n P i∈D cal I( ˆ S c i ≥ t,Y i ̸=c,A i =a)+1 o 1 ma n P (n+j)∈D test I( ˆ S c n+j ≥ t,A j =a) o ∨1 . (1) We choose the smallest t such that the estimated FSP is less than α . Define τ = ˆ Q − 1 c (α )=inf n t: ˆ Q c (t)≤ α o . (2) 162 Consider the R-value defined in (10). For ( n+j) th observation in D test , it is easy to see that ˆ R c n+j = inf t≤ ˆ s n ˆ Q c (t) o , where ˆ s ˆ S c (X n+j = x,A n+j = a). The FASI algorithm can be represented in two equivalent ways: I( ˆ R c n+j ≤ α ) ⇐⇒ I( ˆ S c n+j ≤ τ ). (3) Next we turn to the description of the true FSP process of the FASI algorithm (3) via the representation via ˆ S c . Let V test (t) = X j∈D test I( ˆ S c j ≥ t,Y j ̸=c,A j =a), and R test (t) = X j∈D test I( ˆ S c j ≥ t,A j =a) be the number of false selections and total selections in D test when the threshold is t. Fur- thermore, denote V cal (t) = P i∈D cal I( ˆ S c i ≥ t,Y i ̸= c,A i = a) and R cal (t) = P i∈D cal I( ˆ S c i ≥ t,A i =a) the corresponding quantities inD cal . The FSP of the proposed FASI algorithm is given by FSP {c} a (τ )= V test (τ ) R test (τ )∨1 . The operation of the FASI algorithm implies that FSP {c} a (τ ) = V test (τ ) V cal (τ )+1 · V cal (τ )+1 R test (τ )∨1 = ˆ Q c (τ )· n cal a +1 m a · V test (τ ) V cal (τ )+1 ≤ α · n cal a +1 m a · V test (τ ) V cal (τ )+1 , where the last two steps utilize definitions (1) and (2), respectively. C.2.1.2 Martingale arguments A key step to establish the FSR control, i.e. E n FSP {c} a (τ ) o ≤ α , is to show that the ratio V test (t) V cal (t)+1 (4) is a martingale. Suppose that both the calibration and test data (without labels) have been given. It is natural to consider the following filtration that involves two parallel processes: F t = σ V test (s),V cal (s) t l ≤ s≤ t , where t l is lower limit of the threshold. If t l is used, then all subjects are classified to class c. 163 In our proof, we focus on the following discrete-time filtration that describes the misclas- sification process: F k = σ V test (s k ),V cal (s k ) t l ≤ s k ≤ t , wheres k correspondstothethreshold(time)whenexactlyksubjects,combiningthesubjects in bothD cal andD test , are mistakenly classified as Y =c. Note that at time s k , only one of the two following events are possible A 1 = I{V test (s k− 1 )=V test (s k ), andV cal (s k− 1 )=V cal (s k )− 1}, A 2 = I{V test (s k− 1 )=V test (s k )− 1, andV cal (s k− 1 )=V cal (s k )}. According to Assumption 1 which claims that D cal andD test are exchangeable, and the fact that FASI uses same fitted model to compute the scores, we have P(A 1 |F k )= V cal (s k ) V test (s k )+V cal (s k ) ; P(A 2 |F k )= V test (s k ) V test (s k )+V cal (s k ) . Toseewhytheratiodefinedin(4)isadiscrete-timemartingalewithrespecttothefiltration F k , note that E V test (s k− 1 ) V cal (s k− 1 )+1 |F k = V test (s k ) V cal (s k ) · V cal (s k ) V test (s k )+V cal (s k ) + V test (s k )− 1 V cal (s k )+1 · V test (s k ) V test (s k )+V cal (s k ) = V test (s k ) V cal (s k )+1 , establishing the desired result. C.2.1.3 FSR Control The threshold τ defined by (2) is a stopping time with respect to the filtration F k since {τ ≤ s k }∈F k . Inotherwords,theeventwhetherthekthmisclassificationoccurscompletely depends on the information prior to time s k (including s k ). Let D test,0 /D cal,0 be the index sets for subjects in the testing/calibration data that do not belong to class c. The following facts/strategies will be used in the proof. (I). When t l is used then all subjects are classified to class c. (II). The sizes of the testing and calibration sets are random. The expectation is taken in steps, first conditional on fixed sample sizes and then taken over all possible sample sizes. (III). The testing data and calibration data are independent (Assumption 1). 164 Inthefinalstepofourproof,weshallapplytheoptionalstoppingtheoremtothefiltration {F k }. The group-wise FSR of the FASI algorithm is FSR {c} a = E{FSP {c} a (τ )} ≤ α · E |D cal |+1 |D test | · E V test (τ ) V cal (τ )+1 |D cal ,D test = α · E |D cal |+1 |D test | · E V test (t l ) V cal (t l )+1 |D cal ,D test = α · E |D cal |+1 |D test | · |D test,0 | |D cal,0 |+1 (5) = α · E |D cal |+1 |D cal,0 |+1 E |D test,0 | |D test | (6) ≤ α · E p test,0 c p cal,0 c γ c α, where p test,0 c and p cal,0 c are the proportion of individuals that do not belong to class c in the test and calibration data, respectively. To get Equation (5) we used fact (I) and to get Equation (6) we used fact (III). This completes the proof. C.2.2 Proof of Part (b) The proof is more complicated as the arguments involve constructing two martingales. We followthesameorganizationoftheproofforPart(a). Detailsareprovidedfornewarguments and omitted for repeated arguments similar to those in Part (a). C.2.2.1 The empirical process description The estimated FSP in group a for a given threshold t is: ˆ Q c (t)= 1 n cal a +1 n P i∈D cal I( ˆ S c i ≥ t,Y i ̸=c,A i =a)+1 o 1 ma+n cal a +1 n P j∈D test ∪D cal I( ˆ S c j ≥ t,A j =a)+1 o. Similar to (a) define τ = ˆ Q − 1 c (α ) = inf n t: ˆ Q c (t)≤ α o . Then our data-driven algorithm is given byI( ˆ S c n+j ≤ τ ). Define V test (t), R test (t), V cal (t) and R cal (t) as before. Following similar arguments as in (a), we have FSP {c,∗} a (τ ) = V test (τ ) V cal (τ )+1 · V cal (τ )+1 R cal (τ )+R test (τ )+1 · R cal (τ )+R test (τ )+1 R test (τ )+1 ≤ α · n cal a +1 n cal a +m a +1 · V test (τ ) V cal (τ )+1 · R cal (τ )+R test (τ )+1 R test (τ )+1 . Next we shall show that the last two terms in the above product are both martingales. 165 C.2.2.2 Martingale arguments In Part (a), we have shown that V test (t)/{V cal (t) +1} is a discrete-time martingale with respect to the filtration F k = σ V test (s k ),V cal (s k ) t l ≤ s k ≤ t , which is defined on the mis- classification process. Next we show that {R cal (t) + R test (t) + 1}/{R test (t) + 1} is also a discrete-time martingale. Consider the filtration that describes the selection process: F ∗ k ={σ R cal (s ∗ k ),R test (s ∗ k ) } t l ≤ s ∗ k ≤ t , where s ∗ k corresponds to the time when exactly k subjects are selected. At time s ∗ k , only one of the following two events are possible: A ∗ 1 = R cal (s ∗ k− 1 )=R cal (s ∗ k ),R test (s ∗ k− 1 )=R test (s ∗ k )− 1 ; A ∗ 2 = R cal (s ∗ k− 1 )=R cal (s ∗ k )− 1,R test (s ∗ k− 1 )=R test (s ∗ k ) . On this backward running filtration, we have P(A ∗ 1 )= R test (s ∗ k ) R cal (s ∗ k )+R test (s ∗ k ) , P(A ∗ 2 )= R cal (s ∗ k ) R cal (s ∗ k )+R test (s ∗ k ) . It follows that{R cal (t)+R test (t)+1}/{R test (t)+1} is a martingale since E R cal (s ∗ k− 1 )+R test (s ∗ k− 1 )+1 R test (s ∗ k− 1 )+1 |F ∗ k = R cal (s ∗ k )+R test (s ∗ k ) R test (s ∗ k ) · R test (s ∗ k ) R cal (s ∗ k )+R test (s ∗ k ) + R cal (s ∗ k )+R test (s ∗ k ) R test (s ∗ k )+1 · R cal (s ∗ k ) R cal (s ∗ k )+R test (s ∗ k ) = R cal (s ∗ k )+R test (s ∗ k )+1 R test (s ∗ k )+1 . C.2.2.3 FSR Control Notethatthethresholdτ isastoppingtimewithrespecttothefiltration F ∗ k . LetD test,0 /D cal,0 be the index sets for subjects in the testing/calibration data that do not belong to class c. FSR c,∗ a = E{FSP c,∗ a (τ )} ≤ α E |D cal |+1 |D cal |+|D test |+1 · R cal (τ )+R test (τ )+1 R test (τ )+1 · E V test (τ ) V cal (τ )+1 |D cal ,D test . The term{R cal (τ )+R test (τ )+1}/{R test (τ )+1} can be factored out because{R cal (τ ) and R test (τ )areconstantwhenD cal andD test aregiven. AccordingtoPart(a),{V test (t)}/{V cal (t)+ 1} is a backward martingale onF k . When t l is used then all subjects are classified to class c. According to the optional stopping theorem we have E V test (τ ) V cal (τ )+1 |D cal ,D test = V test (t l ) V cal (t)+1 = |D test,0 | |D cal,0 |+1 . 166 Next, conditional on the filtration defined on the selection process, we have E R cal (τ )+R test (τ )+1 R test (τ )+1 =E R cal (t l )+R test (t l )+1 R test (t l )+1 =E |D cal |+|D test |+1 |D test |+1 . Combining the above results, we have FSR c,∗ a ≤ α E |D cal |+1 |D cal |+|D test |+1 · |D cal |+|D test |+1 |D test |+1 · |D test,0 | |D cal,0 |+1 = α · E |D cal |+1 |D cal,0 |+1 E |D test,0 | |D test | ≤ γ c α, where γ c is defined at the end of Section C.2.1.3. The proof is compete. C.3 Proof of Theorem 11 ThetheoremimpliesthattheoptimalbasescoreforconstructingR-valuesshouldbeS c (x,a)= P(Y = c|X = x,A = a). A similar optimality theory has been developed in the context of multiple testing with groups [19]. However, the proof for the binary classification setup with the indecision option is much more complicated; we provide the proof here for complete- ness. We first establish an essential monotonicity property in Section C.3.1, then prove the optimality theory in Section C.3.2. C.3.1 A monotonicity property Suppose we use S c n+j (x,a) = P(Y n+j = c|X n+j = x,A n+j = a) as the base score. The correspondingtheoreticalR-valuescanbeobtainedviatheconversionalgorithminAppendix C.1.1. Under Model 16, the mFSR level with threshold t is mFSR {c} a (t) =P(Y ̸= c|S c ≥ t,A = a). The theoretical R-values is defined as R c (s c ) = inf t≤ s c n mFSR {c} a (t) o . Let Q c a (t) bethemFSRlevelwhenthethresholdist. Thenextpropositioncharacterizesthemonotonic relationship between Q c a (t) and t. Proposition 3. Q c a (t) is monotonically decreasing in t. The proposition is essential for expressing the oracle procedure as a thresholding rule based on S c . Specifically, denote Q c,− 1 a (·) the inverse of Q c a (·). The monotonicity of Q c a (t) and the definition of the theoretical R-value together imply that S c j (x,a) = Q c,− 1 a (R c j ) for a∈A. For notational convenience, let T n+j (x,a) =P(Y n+j = 2|X n+j = x,A n+j = a). Then S 1 n+j =1− T n+j and S 2 n+j =T n+j . Therefore the oracle rule δ n+j OR =I(R 1 n+j ≤ α 1 )+2I(R 2 n+j ≤ α 2 ). can be equivalently written as δ n+j OR = I S 1 n+j ≥ Q − 1 1,a (α 1 ) +2I S 2 n+j ≥ Q − 1 2,a (α 2 ) = I T n+j ≤ 1− Q − 1 1,a (α 1 ) +2I T n+j ≥ Q − 1 2,a (α 2 ) . (1) 167 for 1≤ j≤ m. This provides a key technical tool in Section C.3.2. Proof of Proposition 3. Define ˜ Q c a (t) = 1− Q c a (t). We only need to show that Q c a (t) is monotonically increasing in t. Let M a ={n+1≤ j ≤ n+m : A j = a}. According to the definition of the mFSR and the definition of S c j , we have E ( X j∈Ma n S c j − ˜ Q c a (t) o I(S c j >t) ) =0, (2) where the expectation is taken over both D test . It is important to note that the oracle procedure,whichassumesthatalldistributionalinformationisknown,doesnotutilizeD train andD cal . It is easy to see from Equation (2) that ˜ Q c a (t)>t otherwise the summation on the LHS must be positive, leading to a contradiction. Nextweshowthatt 1 <t 2 implies ˜ Q c a (t 1 )≤ ˜ Q c a (t 2 ). Assumeinsteadthat ˜ Q c a (t 1 )> ˜ Q c a (t 2 ). We focus on group a, then X j∈Ma {S c j − ˜ Q c a (t 1 )}I(S c j >t 1 ) = X j∈Ma {S c j − ˜ Q c a (t 2 )+ ˜ Q c a (t 2 )− ˜ Q c a (t 1 )}I(S c j >t 1 ) = X j∈Ma {S c j − ˜ Q c a (t 2 )}I(S c j >t 2 )+ X j∈Ma {S c j − ˜ Q c a (t 2 )}I(t 1 ≤ S c j ≤ t 2 ) + X j∈Ma n ˜ Q c a (t 2 )− ˜ Q c a (t 1 ) o I(S c j >t 1 ) = I +II +III. Taking expectations on both sides, it is easy to see that the LHS is zero. However, the RHS is strictly greater than zero. For term I, we have E(I) = 0 according to the definition of mFSR. For term II, we haveE(II) < 0 as we always have ˜ Q c a (t) > t. For term III, we have E(III)<0sinceweassume ˜ Q c a (t 1 )> ˜ Q c a (t 2 ). Itfollowsthattheassumption ˜ Q c a (t 1 )> ˜ Q c a (t 2 ) cannot be true, and the proposition is proved. C.3.2 Proof of the theorem Define the expected number of true selections ETS = P m j=1 I(Y n+j = c, ˆ Y n+j = c). Then it can be shown that minimizing the EPI subject to the FSR constraint is equivalent to maximizing the ETS subject to the same constraint. According to Proposition 3, the oracle rule can be written as δ n+j OR =I T n+j ≤ 1− Q − 1 1,a (α 1 ) +2I T n+j ≥ Q − 1 2,a (α 2 ) . 168 The mFSR constraints for the oracle rule imply that E ( X j∈Ma (T j − α 1 )I(δ j OR =1) ) =0, E ( X j∈Ma (1− T j − α 2 )I(δ j OR =2) ) =0. (3) Let δδδ ∈ {0,1,2} m be a general selection rule in D α 1 ,α 2 . Then the mFSR constraints for δδδ implies that E ( X j∈Ma (T j − α 1 )I(δ j =1) ) ≤ 0, E ( X j∈Ma (1− T j − α 2 )I(δ j =2) ) ≤ 0. (4) The ETS ofδδδ =(δ j :n+1≤ j≤ n+m) is given by ETS δδδ = E " X a∈A X j∈Ma {I(δ j =1)(1− T j )+I(δ j =2)T j } # = X a∈A ETS 1,a δδδ +ETS 2,a δδδ . The goal is to show that ETS(δδδ OR )≥ ETS(δδδ ). We only need to show ETS c,a δδδ OR ≥ ETS c,a δδδ for all c and a. We will show ETS 1,a δδδ OR ≥ ETS 1,a δδδ for a given a. The remaining inequalities follow similar arguments. According to (3) and (4), we have E " X j∈Ma (T j − α 1 ) I(δ j OR =1)− I(δ j =1) # ≥ 0. (5) Let λ 1,a = (1− Q − 1 1,a (α 1 )− α 1 )/Q − 1 1,a (α 1 ). It can be shown that λ 1,a > 0. For j ∈ M a , we claim that the oracle rule can be equivalently written as δ j OR =I T j − α 1 1− T j <λ 1,a . Using the previous expression and techniques similar to the Neyman-Pearson lemma, we claim that the following result holds for all j∈M a : I(δ j OR =1)− I(δ j =1) {T j − α 1 − λ 1,a (1− T j )}≤ 0. It follows that E " X j∈Ma I(δ j OR =1)− I(δ j =1) {T j − α 1 − λ 1,a (1− T j )} # ≤ 0. (6) 169 According to (5) and (6), we have λ OR E X j∈Ma (1− T j ) I(δ j OR =1)− I(δ j =1) =λ OR ETS 1,a δδδ OR − ETS 1,a δδδ ≥ 0. Note that λ OR >0, the desired result follows. C.4 Additional Numerical Results C.4.1 Comparing the R and R + -value In this section, we demonstrate through simulation that the R + -value (11) is more stable than the R-value (10) when|D test | is small. To do this, we will look at the distributions of R-value and R + -value for a fixed base score of s(x,a)=0.9. We consider the setting described in Section 4.5 with F 1,M = F 1,F = N(µµµ 1 ,2· I 3 ) and F 2,M = F 2,F =N(µµµ 2 ,2· I 3 ). We set π 2|F = π 2|M = 0.8, µµµ 1 = (1,1,1) ⊤ and µµµ 2 = (2,2,2) ⊤ . The base scores are constructed as the oracle class probabilities P(Y =c|X,A). In Figure C.2, we compute 1,000 R-values and R + -values for a fixed score of s = 0.9 based on randomly generated D cal and D test . The size of the calibration set is fixed at |D cal | = 1,000 and the test set has sizes |D test | ∈ {5,50,200}. The columns of Figure C.2 show the histograms the R-values (left) and R + -values (right) with D test increasing from 5 (first row) to 200 (last row). When |D test | = 5 , we notice that the R-value has much more variability than the R + -value. This is because the denominator of the R-value only utilizes 5 observations when computingthetotalnumberofselections. Bycontrast,theR + -valueuses1,005observations sinceithasaccesstodatafrombothD cal andD test . MovingfurtherdowntherowsofFigure C.2, the advantage of the R + -value slowly disappears as |D test | increases. This causes the variability of both R-value and R + -value to become almost identical. We conclude from this simulation that the R + -value is more desirable in settings where|D test | is small since it can use more data to decrease its variability. However, while the R-value has more variability for small |D test |, this disadvantage can be quickly overcome through the introduction of a reasonably sized test set. C.4.2 Numerical investigations of the factor γ c,a In Theorem 10, we show that the FASI algorithm can control the FSR at level γ c,a α c . This section investigates the deviations of γ c,a from 1. For simplicity, we only focus on γ 1,a . The setup of the simulations is identical to that in Section 4.5. Figure C.1 shows the estimates of γ 1,a for both the Female (green solid line) and Male (orange dashed line) groups. We vary π 2,F from 0.15 to 0.85 while fixing π 2,M = 0.5. The y-axis plots the estimate of γ 1,a averaged over 1,000 independent simulation runs. In both settings, γ 1,a is nearly 1 across both the Female and Male groups. In the most extreme setting (π 1,F =0.85), γ 1,a deviates away from 1 by 0.01. 170 0 10 20 30 40 0.00 0.05 0.10 0.15 R-value with s(x,a)=0.9 density |D test |=5 0 10 20 30 40 0.00 0.05 0.10 0.15 R + -value with s(x,a)=0.9 density |D test |=5 0 10 20 30 40 0.00 0.05 0.10 0.15 R-value with s(x,a)=0.9 density |D test |=50 0 10 20 30 40 0.00 0.05 0.10 0.15 R + -value with s(x,a)=0.9 density |D test |=50 0 10 20 30 40 0.00 0.05 0.10 0.15 R-value with s(x,a)=0.9 density |D test |=200 0 10 20 30 40 0.00 0.05 0.10 0.15 R + -value with s(x,a)=0.9 density |D test |=200 R-value R + -value Figure C.1: The comparison between the R-value and R + -value for varying sizes of the test data set. The left column shows the histograms of the R-value (orange) and the right column shows the histograms of the R + -value (green). The R-values and R + -values are computed for a fixed base score of s(x,a)=0.9 based on 1,000 randomly generated data sets. 0.90 0.95 1.00 1.05 1.10 0.2 0.4 0.6 0.8 π 2,F γ 2,a Simulation 1 0.90 0.95 1.00 1.05 1.10 0.2 0.4 0.6 0.8 π 2,F γ 2,a Simulation 2 Female Male FigureC.2: Estimates ofγ 1,a from the simulations in Section 4.5. The solid (green) line represents the estimate of γ 1,F for the Female protected group and similarly the orange (long-dashed) line for the Male protected group. 171
Abstract (if available)
Abstract
The advancement of machine learning in the past decade has had profound impacts on society. Algorithms that were previously expensive and hard to implement to use have become accessible to the general public. As a consequence, automated decision systems have started being used in sensitive areas such as medical diagnosis, bail sentencing, and financial services. These application areas require a more nuanced view of uncertainty estimation since the consequences of making a mistake can directly impact people's happiness and quality of life. When artificial intelligence is used to make decisions for many people, it is important that we rigorously control severe consequences and that we properly communicate the uncertainty associated with these mistakes to practitioners.
To address uncertainty estimation in high-risk scenarios, this dissertation aims to help practitioners gain better inference into the predictions their machine learning models are giving them. The first part of this dissertation focuses on predicting extreme events and shows that when practitioners are estimating the probability of many events occurring, even when those estimates are unbiased the most extreme probabilities can be shown to be biased estimates of the true probability of that event occurring. This is problematic when one considers that extreme probabilities tend to be the ones that practitioners are the most interested in (i.e. probability of death if one were to undertake a medical procedure). To address this, we propose an empirical Bayes method to address selection bias when estimating the occurrence of extreme events. The second part moves past estimating probabilities and directly focuses on classification. When classification is used in sensitive areas, one type of error may have far worse consequences than the other (such as missing out on sick individuals in medical diagnosis). However, popular techniques that try to control the more severe error below a pre-specified threshold are difficult to use in practice because they assume knowledge of perfect training labels. Our work expands the literature by allowing practitioners to have label noise in their training data and provide a label noise-adjusted algorithm that can provably control the more consequential error. The third part of this dissertation develops a flexible framework for classification that can be used when algorithms are being used across protected groups. This framework requires minimal assumptions and works regardless of the accuracy and type of machine learning algorithm used. In this way, practitioners can be sure that our definition of fairness will be upheld without any restrictive assumptions. Lastly, we finish this dissertation by discussing a work in progress that addresses selection bias in $p$-value estimation for multiple hypothesis testing.
This dissertation tackles this fundamental problem of uncertainty estimation when the results or predictions of algorithms have a high societal cost associated with them. To address this, chapters 2-5 cover selection bias in extreme event prediction, label noise in asymmetric classification, fairness in machine learning, and finally selection bias in multiple hypothesis testing. Chapters 2 and 3 are based on previously published work, chapter 4 on work in revision, and chapter 5 on one currently ongoing work.
As the applications our algorithms are being used on become more complicated and nuanced, so too must our tools for inference evolve to meet modern-day societal challenges. This dissertation serves as a starting point to start answering these questions and it will be future expanded upon in my future research career.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Large scale inference with structural information
PDF
New methods for asymmetric error classification and robust Bayesian inference
PDF
Statistical learning in High Dimensions: Interpretability, inference and applications
PDF
Topics in selective inference and replicability analysis
PDF
Early-warning systems for crisis risk
PDF
Feature learning for imaging and prior model selection
PDF
Statistical and computational approaches for analyzing metagenomic sequences with reproducibility and reliability
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
Nonparametric empirical Bayes methods for large-scale inference under heteroscedasticity
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Shrinkage methods for big and complex data analysis
PDF
Learning fair models with biased heterogeneous data
PDF
Robust causal inference with machine learning on observational data
PDF
Decision-aware learning in the small-data, large-scale regime
PDF
Learning to diagnose from electronic health records data
PDF
Fast and label-efficient graph representation learning
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Fairness in machine learning applied to child welfare
PDF
Nonparametric ensemble learning and inference
Asset Metadata
Creator
Rava, Bradley
(author)
Core Title
Adapting statistical learning for high risk scenarios
School
Marshall School of Business
Degree
Doctor of Philosophy
Degree Program
Business Administration
Degree Conferral Date
2022-08
Publication Date
07/22/2022
Defense Date
07/22/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
calibration by group,classification,empirical Bayes,Fairness,fairness in classification,fairness in machine learning,false discovery rate,false selection rate,label noise,machine learning,Neyman-Pearson (NP) paradigm,OAI-PMH Harvest,selection bias,selective inference,statistical parity,statistics,Tweedie’s formula,type I error,umbrella algorithm
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
James, Gareth Michael (
committee chair
), Alexander, Kenneth (
committee member
), Sun, Wenguang (
committee member
), Tong, Xin (
committee member
)
Creator Email
brava@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111373919
Unique identifier
UC111373919
Legacy Identifier
etd-RavaBradle-10916
Document Type
Thesis
Format
application/pdf (imt)
Rights
Rava, Bradley
Type
texts
Source
20220722-usctheses-batch-961
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
calibration by group
empirical Bayes
fairness in classification
fairness in machine learning
false discovery rate
false selection rate
label noise
machine learning
Neyman-Pearson (NP) paradigm
selection bias
selective inference
statistical parity
Tweedie’s formula
type I error
umbrella algorithm