Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
New methods for asymmetric error classification and robust Bayesian inference
(USC Thesis Other)
New methods for asymmetric error classification and robust Bayesian inference
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
NEW METHODS FOR ASYMMETRIC CLASSIFICATION AND ROBUST BAYESIAN INFERENCE by Shunan Yao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (APPLIED MATHEMATICS) August 2023 Copyright 2023 Shunan Yao Dedication To my parents. ii Acknowledgements I would like to express my gratitude to everyone who has been a part of my journey as a Ph.D. student. Firstly, I am truly grateful for the invaluable support and guidance provided by my advisors, Dr. Stanislav Minsker and Dr. Xin Tong. Their mentorship and diverse perspectives on statistics have been instrumental in shaping me as a researcher. I also want to extend my thanks to my collaborators for their joint work with me, and Dr. Steven Heilman for serving on my defense committee. Additionally, I would like to acknowledge Dr. Matteo Sesia for his academic support and guidance, and Dr. Chunming Wang for his assistance and advice during my job search. I am also grateful to the faculty and staff in the Department of Mathematics and the Department of Data Sciences and Operations for their assistance, particularly Ms. Amy Yung, who has helped ensure a smooth Ph.D. process. I owe a debt of gratitude to my friends and cohorts, who have been an integral part of my academic journey and have shared in both my joys and challenges. Special thanks go out to them. Above all, my deepest appreciation goes to my parents, Huiping Wu and Yuhai Yao, whose uncondi- tional love and support have been my constant source of strength, especially during difficult times. I would also like to thank my girlfriend, Lijia Wang, with whom I have achieved so much together. iii TableofContents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Median of Means Principle for Bayesian Inference . . . . . . . . . . . . . . . . . . . . . 3 2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Proposed approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Assumptions and preliminary results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Main results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Numerical examples and applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 3: Asymmetric error control under imperfect supervision: a label-noise-adjusted Neyman-Pearson umbrella algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Notation and Corruption Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 The NP umbrella algorithm without label noise. . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Algorithm 3.1: label-noise-adjusted NP umbrella algorithm with known corruption levels. 46 3.6 Algorithm3.1 # : label-noise-adjusted NP umbrella algorithm with unknown corruption levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7 Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7.1 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7.2 Rationale behind Algorithm 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7.3 Theoretical properties of Algorithm 3.1. . . . . . . . . . . . . . . . . . . . . . . . . 61 3.7.4 Theoretical properties of Algorithm3.1 # . . . . . . . . . . . . . . . . . . . . . . . . 65 3.8 Numerical Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.1 Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8.2 Violin plots for Section 3.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.9 Real Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 iv 3.10 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 4: Neyman-Pearson and equal opportunity: when efficiency meets fairness in classification 78 4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Neyman-Pearson equal opportunity (NP-EO) paradigm. . . . . . . . . . . . . . . . . . . . . 81 4.2.1 Mathematical setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.2 Equality of opportunity (EO). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.3 Neyman-Pearson (NP) paradigm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.4 NP-EO paradigm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Generalized Neyman-Pearson lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.2 Technical lemmas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 NP-EO oracle classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.5 Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.1 NP-EO umbrella algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6 Numerical results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.6.1 Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6.2 Real data analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.7 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 v ListofTables 2.1 MAP estimates of the intercept, regression coefficients and the standard deviation σ , left and right end points of95% credible intervals in parentheses. . . . . . . . . . . . . . . . . 34 3.1 (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.1. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Averages of (approximate) true type II errors over1,000 repetitions for Simulation 3.1. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.3 (Approximate) type I error violation rates, and averages of (approximate) true type II errors over1,000 repetitions for Simulation 3.2 (m 0 =.95,m 1 =.05,α =.1 andδ =.1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.4 (Approximate) type I error violation rates, and averages of (approximate) true type II errors over1,000 repetitions for Simulation 3.3 (m 0 =.95,m 1 =.05,α =.1 andδ =.1). Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5 (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.4. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6 (Approximate) type II error violation rates over 1,000 repetitions for Simulation 3.4. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.5. Standard errors (× 10 − 2 ) in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.8 (Approximate) type I error violation rates, and averages of (approximate) true type II errors by Algorithm 3.1 and original NP umbrella algorithm over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . . . . 76 3.9 (Approximate) type I error violation rates by Algorithm3.1 # over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . . . 76 3.10 Averages of (approximate) true type II errors by Algorithm3.1 # over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. . . . . . . . . . . . . . . 76 vi 4.1 (aprroximate) averages of type I and II errors, along with (approximate) violation rates of the NP and EO conditions over1,000 repetitions for Simulation 4.1. Standard error of the means (× 10 − 4 ) in parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.2 (aprroximate) averages of type I and II errors, along with (approximate) violation rates of NP and EO conditions over1,000 repetitions for Simulation 4.2. Standard error of the means (× 10 − 4 ) in parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3 (aprroximate) averages of type I and II errors, along with (approximate) violation rates of NP and EO conditions over1,000 repetitions for credit card dataset. Standard error of the means (× 10 − 4 ) in parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 vii ListofFigures 2.1 Posterior distribution ˆ Π N for Example 2.1. The dark blue curve is the density function and the light blue area represents the95% credible interval. . . . . . . . . . . . . . . . . . 10 2.2 Posterior distribution ˆ Π N for Example 2.2, scenario 1. The blue curves and blue shaded regions correspond to the density function and95% credible sets of ˆ Π N whereas dashed red curves and red shaded region are the standard posterior and its corresponding95% credible set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Posterior distribution ˆ Π N for Example 2.2, scenario 2. . . . . . . . . . . . . . . . . . . . . . 33 2.4 Posterior distribution b Π N for Example 2.3, no outliers. . . . . . . . . . . . . . . . . . . . . 36 2.5 Posterior distribution b Π N for Example 2.3, with outliers. . . . . . . . . . . . . . . . . . . . 36 2.6 Standard posterior distribution for Example 2.3, with outliers. . . . . . . . . . . . . . . . . 37 3.1 Density plots in Example 3.1. True (lighter and solid) and corrupted (darker and dashed). . 42 3.2 The original NP umbrella algorithm vs. a label-noise-adjusted version for Example 3.2. The plots in the left panel (blue) are the true type I and II errors for the original NP umbrella algorithm. The plots in the right panel (orange) are the true type I and II errors for the label-noise-adjusted NP umbrella algorithm with known corruption levels. The black dot and vertical bar in every violin represent mean and standard deviation, respectively. In the top row, the horizontal black line isα =0.05 and the boundaries between lighter and darker color in each violin plot mark the1− δ =95% quantiles. . . . . . . . . . . . . . . 47 3.3 The blue solid curve is the density of true class 0 (i.e.,N(0,1)) and the orange dashed curve is the density of corrupted class 0 (i.e., a mixture ofN(0,1) andN(2,1) with m 0 = 0.85). The black vertical line marks the threshold of the classifier I{X > 2.52} whose corrupted type I error is0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 A cartoon illustration of1− δ probability upper bound of type I error. . . . . . . . . . . . 60 3.5 Violin plots for (approximate) true type I errors of Simulation 3.1. . . . . . . . . . . . . . . 72 3.6 Violin plots for (approximate) true type II errors of Simulation 3.1. . . . . . . . . . . . . . 72 viii 3.7 Violin plots for (approximate) true type I errors of Simulation 3.2. . . . . . . . . . . . . . . 73 3.8 Violin plots for (approximate) true type II errors of Simulation 3.2. . . . . . . . . . . . . . 73 3.9 Violin plots for (approximate) true type I errors of Simulation 3.3. . . . . . . . . . . . . . . 74 3.10 Violin plots for (approximate) true type II errors of Simulation 3.3. . . . . . . . . . . . . . 74 4.1 Plots of three classifiers in Example 4.1. The three rows, from top to bottom, represent figure illustration of the Bayes classifier, NP oracle classifier and NP-EO oracle classifier, respectively. The left panel illustrates the densities ofX 0,a andX 1,a and the right panel those ofX 0,b andX 1,b . In every sub-figure, the green curve represents class 0 density and the orange curve represents class1 density. In each row, the two thresholds of the classifier are indicated by the two black vertical lines. The type I and type II errors conditional on sensitive attribute are depicted respectively as the light green and light orange regions in every sub-figure with their values marked. . . . . . . . . . . . . . . . . . 99 4.2 A cartoon illustration of the choices ofk ∗ a andk ∗ b . For everyT y,s , the circles, or squares, in its corresponding row represent its sorted elements, ascending from left to right. NP umbrella algorithm pickst 0,a k 0,a ∗ andt 1,a k 0,b ∗ inT 0,a andT 0,b , respectively. Then,t 1,a la andt 1,b l b are smallest elements inT 1,a andT 1,b that are larger thant 0,a k 0,a ∗ andt 1,a k 0,b ∗ , respectively. Candidates for k ∗ a and k ∗ b are chosen from{l a + 1,··· ,n 1 a } and{l b + 1,··· ,n 1 b }, respectively, such that equation (4.15) is satisfied. Finally, k ∗ a andk ∗ b are selected such that the overall empirical type II error is minimized. . . . . . . . . . . . . . . . . . . . . . . . . 108 ix Abstract This dissertation is devoted to statistical methods in robust Bayesian inference, asymmetric binary classi- fication, and algorithmic fairness. The performance of these methods is evaluated, both theoretically and numerically, and asymptotic guarantees are provided to establish their efficacy. The first part is related to robust estimation and inference in the Bayesian framework. In this frame- work, we often consider a family of probability density functions and data are generated from an unknown distribution whose probability density belongs to this family. Given a prior distribution on this set of probability density functions, one can construct a posterior distribution. In this part, it is assumed that an “adversary” replaces a random set of data by arbitrary values. The goal is to construct a posterior robust to the arbitrary outliers. The contribution of this work consists of two parts: first, we propose a robust posterior distribution based on median of means (MOM) principle. Second, we show its robustness and a version of Bernstein-von Mises theorem. The second part is related to label noise in Neyman-Pearson (NP) classification framework. Given a random couple (X,Y) ∈ R d ×{ 0,1}, the goal of NP classification is to predict Y such that the control of type I error, i.e., probability of misclassification given Y = 0, under a pre-specified value is given priority. We study the NP paradigm under the scenario that, instead of true training data{(X j ,Y j )} n j=1 , the labelsY in the training data are randomly flipped and the actual training data is {(X j , ˜ Y j )} n j=1 . The contribution of this work consists of two parts: first, we show that NP classifier still satisfies the type I error control though trained with noisy labels. However, the price for this control is high type II error, i.e., x the probability of misclassification given Y = 1. Second, we propose an approach which allows to adjust the NP algorithm to achieve the proper type I and II errors. The third part is related to algorithmic fairness in NP classification. In binary classification problems, algorithmic fairness setting assumes that one or more features are sensitive and the predictive outcome should satisfy certain criteria describing social fairness based on sensitive features. Among these criteria, we are interested in the Equal Opportunity (EO) framework. In this thesis, we consider an NP classifica- tion problem with EO constraint. The contribution of this part consists of two parts: first, we propose the Neyman-Pearson Equal-Opportunity (NP-EO) paradigm that combines social fairness and institutional in- terests, and then we present the NP-EO oracle classifier. Second, we propose the NP-EO umbrella algorithm that trains NP-EO classifier. xi Chapter1 Introduction Statistics has made significant advances in recent decades, finding far-reaching applications. However, new challenges have emerged for the statistical community. One of these challenges is the need for statistical methods and models to be both sound and robust. In this thesis, we propose the methods addressing these challenges in Bayesian inference and in binary classification problems. In Bayesian inference, the presence of contaminated data can result in a "misplaced" posterior distri- bution, making robust methods necessary. Moreover, these robust methods are expected to satisfy other theoretical properties that typically hold in an outlier-free setting. In classification problems, data con- tamination often leads to reduced prediction accuracy. However, in specific classification paradigms, the impact of corrupted data on the classification outcome may be more subtle and require adjustments based on the specific framework. Another challenge is the ethical aspect of statistical methods. For instance, in classification problems, it is important for the algorithm to be fair across all social groups. Therefore, algorithmic fairness has become a critical issue in the statistical community. This dissertation focuses on three specific challenges within the aforementioned setting. The first prob- lem tackled is the development of robust methods for Bayesian inference. The second problem involves addressing the issue of label noise in binary, asymmetric classification scenarios. Lastly, the third problem 1 explores the concept of algorithmic fairness in the context of asymmetric classification. Since the three parts are mostly independent we provide a more detailed introduction to every chapter separately. 2 Chapter2 MedianofMeansPrincipleforBayesianInference 2.1 Introduction. Modern statistical and machine learning algorithms typically operate under limited human supervision, therefore robustness - the ability of algorithms to properly handle atypical or corrupted inputs - is an important and desirable property. Robustness of the most basic algorithms, such as estimation of the mean and covariance structure that serve as “building blocks” of more complex methods, have received significant attention in the mathematical statistics and theoretical computer science communities; the survey papers by [56, 29] provide excellent overview of the recent contributions of these topics as well as applications to a variety of statistical problems. The key defining characteristics of modern robust methods are (a) their ability to operate under minimal model assumptions; (b) ability to handle high- dimensional inputs and (c) computational tractability. However, many algorithms that provably admit strong theoretical guarantees are not computationally efficient. In this work, we rely on a class of methods that can be broadly viewed as risk minimization: the output (or the solution) provided by such methods is always a minimizer of the properly defined risk, or cost function. For example, estimation of the mean µ of a square-integrable random variable Z can be viewed as minimization of the risk L(θ ) = E(Z− θ ) 2 over θ ∈ R. Since the risk involves the expectation with respect to the unknown distribution, its exact computation is impossible. Instead, risk minimization methods introduce a robust data-dependent “proxy” 3 of the risk function, and attempt to minimize it instead. The robust empirical risk minimization method by [20], the “median of means tournaments” developed by [55] and a closely related method due to [52] are the prominent examples of this approach. Unfortunately, the resulting problems are computationally hard as they typically involve minimization of general non-convex functions. In this chapter, we propose a Bayesian analogue of robust empirical risk minimization that allows one to replace non-convex loss minimization by sampling that can be readily handled by many existing MCMC algorithms. Moreover, we show that for the parametric models, our approach preserves one of the key benefits of Bayesian methods - the “built-in” quantification of uncertainty - and leads to asymptotically valid confidence sets. At the core of our method is a version of the median of means principle, and our results demonstrate its potential beyond the widely studied applications in the statistical learning framework. Next, we introduce the mathematical framework used throughout the text. Let ˜ X be a random variable with values in some measurable space and unknown distributionP . Suppose that ˜ X N := ˜ X 1 ,..., ˜ X N are the training data – N i.i.d. copies of ˜ X. We assume that the sample has been modified in the following way: an “adversary” replaces a random set ofO <N observations by arbitrary values, possibly depending on the sample. Only the corrupted valuesX N :=(X 1 ,...,X N ) are observed. Suppose thatP has a densityp with respect to aσ -finite measure µ (for instance, the Lebesgue measure or the counting measure). We will assume thatp belongs to a family of density functions{p θ (·),θ ∈Θ }, where Θ ⊂ R d is a compact subset, and that p ≡ p θ 0 for some unknown θ 0 in the interior of Θ . We will also make the necessary identifiability assumption stating that θ 0 is the unique minimizer ofL(θ ):= E(ℓ(θ,X )− ℓ(θ ′ ,X)) ∗ overθ ∈Θ , whereℓ(θ, ·) is the negative log-likelihood, that is,ℓ(θ, ·)=− logp θ (·). Clearly, an approach based on minimizing the classical empirical risk L N (θ ) := 1 N P N j=1 (ℓ(θ,X j )− ℓ(θ ′ ,X j )) ofEℓ(θ,X ) leads to familiar maximum likelihood estimator (MLE) θ ∗ N . At the same time, the ∗ Here,θ ′ ∈ Θ is fixed but arbitrary, and we suppress it in the notation for brevity; defining L(θ ) this way is convenient in what follows 4 main object of interest in the Bayesian approach is theposteriordistribution, which is a random probability measure onΘ defined via Π N (B|X N )= R B Q N j=1 p θ (X j )dΠ( θ ) R Θ Q N j=1 p θ (X j )dΠ( θ ) (2.1) for all measurable sets B ⊆ Θ . Here, Π is the prior distribution with density π (·) with respect to the Lebesgue measure. The following result, known as the Bernstein-von Mises (BvM) theorem that is due to L. Le Cam in its present form (see the book by [85] for its proof and discussion), is one of the main bridges connecting the frequentist and Bayesian approaches. Theorem2.1 (Bernstein-von Mises). Under the appropriate regularity assumptions on the family{p θ , θ ∈ Θ }, Π N −N θ ∗ N , 1 N (I(θ 0 )) − 1 TV P −→ 0, whereθ ∗ N istheMLE,∥·∥ TV standsforthetotalvariationdistance,I(θ )istheFisherInformationmatrixand P −→ denotes convergence in probability (with respect to the distribution of the sample ˜ X N ). In essence, BvM theorem implies that for a given0<α< 1, the1− α credible set, i.e. the set of(1− α ) posterior probability, coincides asymptotically with the set of (1− α ) probability under the distribution N θ ∗ N , 1 N (I(θ 0 )) − 1 , which is well-known to be an asymptotically valid(1− α ) “frequentist” confidence interval forθ 0 , again under minimal regularity assumptions † . It is well known however that the standard posterior distribution is, in general, not robust: if the sample contains even one corrupted observation (referred to as an “outlier” in what follows), the posterior distribution can concentrate arbitrarily far from the true parameterθ 0 that defines the data-generating distribution. A concrete scenario showcasing this fact is given in [7]; another illustration is presented below in example 2.2). The approach proposed be- low addresses this drawback: the resulting posterior distribution (a) admits natural MCMC-type sampling † For instance, these are rigorously defined in the book by [85]. 5 algorithms and (b) satisfies quantifiable robustness guarantees as well as a version of the Bernstein-von Mises theorem that is similar to its classical counterpart in the outlier-free setting. In particular, the cred- ible sets associate with the proposed posterior are asymptotically valid confidence intervals that are also robust to sample contamination. Many existing works are devoted to robustness of Bayesian methods, and we attempt to give a (nec- essarily limited) overview of the state of the art. The papers by [30] and [39] investigated approaches based on “conditioning on partial information,” while a more recent work by [63] introduced the notion of the “coarsened” posterior; however, non-asymptotic behavior of these methods in the presence of outliers has not been explicitly addressed. Another line of work on Bayesian robustness models contamination by either imposing heavy-tailed likelihoods, like the Student’s t-distribution, on the outliers [81], or by attempting to identify and remove them, as was done by [11]. As mentioned before, the approach followed in this work relies on a version of the median of means (MOM) principle to construct a robust proxy for the log-likelihood of the data and, consequently, a robust version of the posterior distribution. The original MOM estimator was proposed by [70] and later, indepen- dently, by [43, 3]. Its versions and extensions were studied more recently by many authors including [53, 55, 52, 64]; we refer the reader to the surveys mentioned in the introduction for a more detailed literature overview. The idea of replacing the empirical log-likelihood of the data by its robust proxy appeared previ- ously the framework of general Bayesian updating described by [15], where, given the data and the prior, the posterior is viewed as the distribution minimizing the loss expressed as the sum of a “loss-to-data” term and a “loss-to-prior” term. In this framework, [44] adopted different types of f-divergences (such as the one corresponding to the Hellinger distance), to the loss-to-prior term to obtain a robust analogue of the posterior; this approach has been investigated further in [48]. Asymptotic behavior of related types of posteriors was studied by [62], though the framework in that paper is not limited to parametric models while imposing more restrictive regularity conditions than the ones required in the present work. Various 6 extensions for this class of algorithms were suggested, among others, by [41, based on so-called “robust disparities”], [33, based on α -density power divergence], [68], [14], and [61, who used kernel Stein dis- crepancies in place of the log-likelihood]. Yet another interesting idea for replacing the log-likelihood by its robust alternative, yielding the so-called “ρ -Bayes” posterior, was proposed and rigorously investigated by [7]. However, sampling from theρ -posterior appears to be computationally difficult, while most of the other works mentioned above impose strict regularity conditions on the model that, unlike our results, exclude popular examples like the Laplace likelihood. 2.1.1 Proposedapproach. Letθ ′ ∈Θ be an arbitrary fixed point in the relative interior of Θ . Observe that the density of the posterior distribution dΠ N (θ |X N ) dθ is proportional toπ (θ )e − N P N j=1 ( ℓ(θ,X j )− ℓ(θ ′ ,X j ) ) N ; indeed, this is evident from equa- tion 2.1 once the numerator and the denominator are divided by Q n j=1 p θ ′(X j ). The key idea is to replace the averageN − 1 P N j=1 (ℓ(θ,X j )− ℓ(θ ′ ,X j )) by its robust proxy denoted b L(θ ) ‡ and defined formally in equation (2.3) below, which gives rise to the robust posterior distribution b Π N (B|X N )= R B exp − N b L(θ ) π (θ )dθ R Θ exp − N b L(θ ) π (θ )dθ (2.2) defined for all measurable sets B⊆ Θ . Remark2.1. While it is possible to work with the log-likelihoodℓ(θ,X ) directly, it is often easier and more natural to deal with the incrementsℓ(θ,X )− ℓ(θ ′ ,X). For instance, in the Gaussian regression model with X = (Y,Z) ∈ R× R d , Y = θ T Z +ε with likelihood p θ (y,z) ∝ exp − (y− θ T z) 2 σ 2 exp − z T Σ z 2 and θ ′ = 0, ℓ(θ, (Y,Z))− ℓ(θ ′ ,(Y,Z)) = θ T Z 2 − 2Y · θ T Z which is more manageable than ℓ(θ, (Y,Z)) itself: in particular, the increments do not include the terms involvingY 2 . ‡ Sinceθ ′ is fixed, we will suppress the dependence on θ ′ in the notation for brevity. 7 Note that the density of b Π N (B |X N ) is maximized for b θ N = argmin θ ∈Θ b L(θ )− 1 N logπ (θ ). For in- stance, if the priorΠ is the uniform distribution overΘ , then b θ N =argmin θ ∈Θ b L(θ ) corresponds exactly to the robust risk minimization problem which, as we’ve previously mentioned, is hard due to non-convexity of the function b L(θ ). At the same time, sampling from b Π N (B | X N ) is possible, making the “maximum a posteriori” (MAP) estimator b θ as well as the credible sets associated with b Π N (B |X N ) accessible. The robust risk estimator b L(θ ) employed in this work is based on the ideas related to the median of means principle. The original MOM estimator was proposed by [70] and later by [43, 3]. Its versions and exten- sions were studied more recently by many researchers including [6, 53, 20, 55, 52, 64]. Letk≤ N/2 be a positive integer and{G 1 ,G 2 ,...,G k } bek disjoint subsets (“blocks”) of{1,2,...,N} of equal cardinality |G j |=n≥ N/k,j∈{1,2,...,k}. For everyθ ∈Θ , define the block average ¯L j (θ )= 1 n X i∈G j (ℓ(θ,X i )− ℓ(θ ′ ,X i )), which is the (increment of) empirical log-likelihood corresponding to the subsample indexed byG j . Next, let ρ : R 7→ R + be a convex, even, strictly increasing smooth function with bounded first derivative; for instance, a smooth (e.g. convolved with an infinitely differentiable kernel) version of the Huber’s loss H(x) = min x 2 /2,|x|− 1/2 is an example of such function. Furthermore, let {∆ n } n≥ 1 be a non- decreasing sequence such that∆ n →∞ and∆ n =o( √ n). Finally, define b L(θ ):=argmin z∈R k X j=1 ρ √ n ¯L j (θ )− z ∆ n , (2.3) 8 which is clearly a solution to the convex optimization problem. Robustness and non-asymptotic perfor- mance of b L(θ ) can be quantified as follows. Let σ 2 (θ )=Var(ℓ(θ,X )− ℓ(θ ′ ,X)), and e ∆ n =max(σ (θ ),∆ n ); then for alls, and number of outliersO such thatmax(s,O)≤ ck for some absolute constantc>0, b L(θ )− L(θ ) ≤ e ∆ n ∆ n σ (θ ) r s N + e ∆ n s+O k √ n + r k N o(1) ! with probability at least 1− 2e − s , where o(1) → 0 as max(∆ n ,n) → ∞. Put simply, under very mild assumptions on ℓ(θ,X )− ℓ(θ ′ ,X), b L(θ ) admits sub-Gaussian deviations around L(θ ), moreover, it can absorb the number of outliers that is of orderk. We refer the reader to Theorem 3.1 in [65] for a uniform overθ version of this bound as well as more details. We end this section with two technical remarks. Remark 2.2. The classical MOM estimator corresponds to the choice ρ (x) = |x| which is not smooth but is scale-invariant, in a sense that the resulting estimator does not depend on the choice of ∆ n . While the latter property is often desirable, we conjecture that the posterior distribution based on such “classical” MOM estimatordoesnotsatisfytheBernstein-vonMisestheorem,andthatsmoothnessofρ isimportantbeyondbeing just a technical assumption. This, perhaps surprising, conjecture is so far only supported by our simulation results explained in Example 2.1. Example 2.1. Let ˜ X N = ( ˜ X 1 ,..., ˜ X N ) be i.i.d. with normal distributionN(θ, 1), θ 0 = − 30 and the prior distribution forθ isN(− 29.50,1). Furthermore, letρ (x) = |x|. We sample from the robust posterior distribution for the values of k = 20,40,60,80 and n = ⌊1000/k⌋. The resulting plots are presented in Figure2.1. Thekeyobservationisthattheposteriordistributionsareoftenmultimodalandskewed,unlikethe expected “bell shape.” 9 0 10 -30.15 -30.10 -30.05 -30.00 -29.95 k = 20 0 50 -30.1 -30.0 -29.9 k = 40 0 20 40 -30.10 -30.05 -30.00 -29.95 k = 60 0 50 -30.05 -30.00 -29.95 k = 80 0 10 -30.10 -30.05 -30.00 -29.95 k = 100 Figure 2.1: Posterior distribution ˆ Π N for Example 2.1. The dark blue curve is the density function and the light blue area represents the95% credible interval. Remark2.3. Letusmentionthattheposteriordistribution b Π N isavalidprobabilitymeasure,meaningthat b Π N (Θ |X N ) = 1. By the definition at display (2.2), it suffices to show the denominator, R e − N b L(θ ) π (θ )dθ , is finite. Indeed, note that b L(θ )> b L( e θ N ) for allθ where e θ N =argmin θ ∈Θ b L(θ ), hence Z e − N b L(θ ) π (θ )dθ ≤ e − N b L( e θ N ) Z Θ π (θ )dθ. Therefore, a sufficient condition for R Θ e − N b L(θ ) π (θ )dθ being finite is b L( e θ N ) >−∞ a.s. This is guaranteed bythefactthatundermildregularityassumptions, foranyθ,θ ′ ∈Θ ,ℓ(θ,x )− ℓ(θ ′ ,x)>−∞ ,P θ 0 -almost surely. 2.2 Assumptionsandpreliminaryresults. In this section, we list and discuss the list of regularity conditions that are required for the stated results to hold. The norm∥·∥ refers to the standard Euclidean norm everywhere below. Assumption2.1. The functionρ :R7→R + is convex, even, and such that 10 (i) ρ ′ (z)=z for|z|≤ 1 andρ ′ (z)=const for|z|≥ 2. (ii) z− ρ ′ (z) is nondecreasing onR + ; (iii) ρ (5) is bounded and Lipschitz continuous. One example of suchρ is the smoothed Huber’s loss: let H(z)= z 2 2 I{|z|≤ 3/2}+ 3 2 |z|− 3 4 I{|z|>3/2} . Moreover, setψ (z) = Ce − 4 1− 4z 2 I |z|≤ 1 2 . Thenρ (z) = (H ⋆ψ )(z), where⋆ denotes the convolution, satisfies assumption 2.1. Condition (iii) on the higher-order derivatives is technical in nature and can likely be avoided at least in some examples; in numerical simulations, we did not notice the difference between results based on the usual Huber’s loss and its smooth version. Next assumption is a standard requirement related to the local convexity of the loss functionL(θ ) at its global minimumθ 0 . Assumption2.2. The Hessian∂ 2 θ L(θ 0 ) exists and is strictly positive definite. In particular, this assumption ensures that in a sufficiently small neighborhood of θ 0 ,c(θ 0 )∥θ − θ 0 ∥ 2 ≤ L(θ )− L(θ 0 ) ≤ C(θ 0 )∥θ − θ 0 ∥ 2 for some constants 0 < c(θ 0 ) ≤ C(θ 0 ) < ∞. The following two conditions allow one to control the “complexity” of the class{ℓ(θ, ·), θ ∈Θ }. Assumption2.3. Foreveryθ ∈Θ ,themapθ ′ 7→ℓ(θ ′ ,x)isdifferentiableat θ forP-almostallx(wherethe exceptionalsetofmeasure0candependonθ ),withderivative∂ θ ℓ(θ,x ). Moreover,∀θ ∈Θ ,theenvelopefunc- tionV(x;δ ):=sup ∥θ ′ − θ ∥≤ δ ∥∂ θ ℓ(θ ′ ,x)∥oftheclass{∂ θ ℓ(θ ′ ,·):∥θ ′ − θ ∥≤ δ }satisfies EV 2+τ (X;δ )<∞ for someτ ∈(0,1] and a sufficiently small δ =δ (θ ). An immediate implication of this assumption is the fact that the function θ 7→ ℓ(θ,x ) is locally Lipschitz. It other words, for any θ ∈ Θ , there exists a ball B(θ,r (θ )) of radius r(θ ) such that for all 11 θ ′ , θ ′′ ∈B(θ,r (θ ))|ℓ(θ ′ ,x)− ℓ(θ ′′ ,x)|≤V (x;δ )∥θ ′ − θ ′′ ∥. In particular, this condition suffices to prove consistency of the estimator e θ N . The following condition is related to the modulus of continuity of the empirical process indexed by the gradients ∂ θ ℓ(θ,x ). It is similar to the typical assumptions required for the asymptotic normality of the MLE, such as Theorem 5.23 in the book by [85]. Define ω N (δ )=E sup ∥θ − θ 0 ∥≤ δ √ N(P N − P)(∂ θ ℓ(θ, ·)− ∂ θ ℓ(θ 0 ,·)) , whereP N is the empirical distribution by ˜ X N . Assumption2.4. The following relation holds: lim δ →0 limsup N→∞ ω N (δ )=0. Moreover, the number of blocksk satisfies k =o(n τ ) ask,n→∞. Limitation on the growth ofk is needed to ensure that the bias of the estimator e θ N is of ordero N − 1/2 , a fact that we rely on in the proof of Theorem 2.3. Finally, we state a mild requirement imposed on the prior distribution; it is only slightly more restrictive than its counterpart in the classical BvM theorem (for example, Theorem 10.1 in the book by [85]). Assumption2.5. The densityπ of the prior distributionΠ is positive and bounded onΘ , and is continuous on the set{θ :∥θ − θ 0 ∥<c π } for some positive constantc π . Remark2.4. Most commonly used families of distributions satisfy assumptions 2.2-2.4. For example, this is easytocheckforthenormal,LaplaceorPoissonfamiliesinthelocationmodelwherep θ (x)=f(x− θ ), θ ∈Θ . OtherexamplescoveredbyourassumptionsincludethelinearregressionwithGaussianorLaplace-distributed noise. The main work is usually required to verify assumption 2.4; it relies on the standard tools for the 12 bounds on empirical processes for classes that are Lipschitz in parameter or have finite Vapnik-Chervonenkis dimension. Examples can be found in the books by [85] and [86]. Next, we introduce some of the technical tools that will be used in the proofs of our main results. Lemmas 2.1 - 2.4 stated below were established in [64], and therefore will be given without the proofs. Let Θ ′ ⊂ Θ be a compact set, and define ˜ ∆(Θ ′ ):=max(∆ n ,σ 2 (Θ ′ )). The following lemma provides a high probability bound for b L(θ )− L(θ ) that holds uniformly overΘ ′ ⊂ Θ . Lemma 2.1. LetL = {ℓ(θ, ·), θ ∈ Θ } be a class of functions mapping S toR, and further assume that sup θ ∈Θ E|ℓ(θ,X )− L(θ )| 2+τ <∞ for someτ ∈[0,1]. Then there exist absolute constantsc, C >0 and a functiong τ,P (x) satisfyingg τ,P (x) x→∞ = o(1), τ =0, O(1), τ > 0 such that for alls>0,n andk satisfying s √ k∆ n E sup θ ∈Θ ′ 1 √ N N X j=1 |ℓ(θ,X j ))− L(θ )|+g τ,P (n) sup θ ∈Θ ′ E|ℓ(θ,X )− L(θ )| 2+τ ∆ 2+τ n n τ/ 2 ≤ c, the following inequality holds with probability at least1− 1 s : sup θ ∈Θ ′ b L(θ )− L(θ ) ≤ C " s· e ∆(Θ ′ ) ∆ n E sup θ ∈Θ ′ 1 N N X j=1 ℓ(θ,X j )− L(θ ) + e ∆(Θ ′ ) g τ,P (n) √ n sup θ ∈Θ ′ E|ℓ(θ,X )− L(θ )| 2+τ ∆ 2+τ n n τ/ 2 !# . This lemma implies that as long asEsup θ ∈Θ ′N − 1/2 P N j=1 |ℓ(θ,X j )− L(θ )| = O(1) and σ 2 (Θ ′ )≲ ∆ n =O(1), 13 sup θ ∈Θ ′ b L(θ )− L(θ ) =O p (N − 1/2 +n − (1+τ )/2 ). Next, Lemma 2.2 below establishes consistency of the estimator e θ N that is a necessary ingredient in the proof of the Bernstein-von Mises theorem. Lemma2.2. e θ N →θ 0 in probability asn,N/n→∞. The following lemma establishes the asymptotic equicontinuity of the processθ 7→ ∂ θ b L(θ )− ∂ θ L(θ ) atθ 0 (recall that the existence of∂ θ b L(θ ) at the neighborhood aroundθ 0 has bee justified in the proof of Theorem 2.3). Lemma2.3. For anyε>0, lim δ →0 limsup k,n→∞ P sup ∥θ − θ 0 ∥≤ δ √ N ∂ θ b L(θ )− ∂ θ L(θ )− ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) ≥ ε ! =0. This result combined with Lemma 2.2 yields a useful corollary. Note that due to consistency of e θ N , ∂ θ L( e θ N )− ∂ θ L(θ 0 )=∂ 2 θ L(θ 0 ) e θ N − θ 0 +o p e θ N − θ 0 . (2.4) On the other hand, ∂ θ L( e θ N )− ∂ θ L(θ 0 )=∂ θ b L( e θ N )− ∂ θ b L(θ 0 ) + ∂ θ L( e θ N )− ∂ θ b L( e θ N ) − ∂ θ L(θ 0 )− ∂ θ b L(θ 0 ) =∂ θ b L( e θ N )− ∂ θ b L(θ 0 )+r N , 14 wherer N = ∂ θ L( e θ N )− ∂ θ b L( e θ N ) − ∂ θ L(θ 0 )− ∂ θ b L(θ 0 ) . Note that for anyδ > 0, √ N∥r N ∥≤ √ N sup ∥θ − θ 0 ∥≤ δ ∂ θ b L(θ )− ∂ θ L(θ ) − ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) + √ N∥r N ∥I{∥ e θ N − θ 0 ∥>δ }. The first term converges to 0 in probability by Lemma 2.3 the second term converges to0 in probability by Lemma 2.2. Therefore, ∂ 2 θ L(θ 0 ) e θ N − θ 0 +o ∥ e θ N − θ 0 ∥ =− ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) +o p (N − 1/2 ). Under assumptions of the following Lemma 2.4, √ N ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) is asymptotically (multivari- ate) normal, therefore,∥∂ θ b L(θ 0 )− ∂ θ L(θ 0 )∥ = O p (N − 1/2 ). Moreover, ∂ 2 θ L(θ 0 ) is non-singular by As- sumption 2.2. It follows that∥ e θ N − θ 0 ∥=O p (N − 1/2 ), and we conclude that √ N( e θ N − θ 0 )=− ∂ 2 θ L(θ 0 ) − 1 √ N ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) +o p (1). (2.5) Lemma2.4. The following asymptotic relations hold: √ N ∂ θ b L(θ 0 )− ∂ θ L(θ 0 ) d −→N (0,I(θ 0 )) and √ N e θ N − θ 0 d −→N 0,I − 1 (θ 0 ) . The following lemma demonstrates that empirical processes indexed by classes that are Lipschitz in parameter (for example, satisfying assumption 2.3) are “well-behaved.” This fact is well-known but we outline the proof for reader’s convenience. 15 Lemma 2.5. LetF = f θ , θ ∈Θ ′ ⊆ R d be a class of functions that is Lipschitz in parameter, meaning that|f θ 1 (x)− f θ 2 (x)|≤ M(x)∥θ 1 − θ 2 ∥. Moreover, assume thatEM 2 (X)<∞ for somep≥ 1. Then E sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 (f θ 1 (X j )− f θ 2 (X j )− P(f θ 1 − f θ 2 )) ≤ C √ ddiam(Θ ′ ,∥·∥)E∥M∥ L 2 (Π n) . Proof. Symmetrization inequality yields that E sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 (f θ 1 (X j )− f θ 2 (X j )− P(f θ 1 − f θ 2 )) ≤ CE sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) =CE X E ε sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) . As the process f 7→ 1 √ n P n j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) is sub-Gaussian conditionally on X 1 ,...,X n , its (conditional)L p -norms are equivalent toL 1 norm. Hence, Dudley’s entropy bound implies that E ε sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) ≤ CE ε sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) ≤ C Z Dn(Θ ′ ) 0 H 1/2 (z,T n ,d n )dz, where d 2 n (f θ 1 ,f θ 2 ) = 1 n P n j=1 (f θ 1 (X j )− f θ 2 (X j )) 2 , T n = {(f θ (X 1 ),...,f θ (X n )), θ ∈Θ ′ } ⊆ R n and D n (Θ ′ ) is the diameter of Θ with respect to the distance d n . As f θ (·) is Lipschitz in θ , we have that d 2 n (f θ 1 ,f θ 2 )≤ 1 n P n j=1 M 2 (X j )∥θ 1 − θ 2 ∥ 2 , implying thatD n (Θ ′ )≤∥ M∥ L 2 (Π n) diam(Θ ′ ,∥·∥) and H(z,T n ,d n )≤ H z/∥M∥ L 2 (Π n) ,Θ ′ ,∥·∥ ≤ log C diam(Θ ′ ,∥·∥)∥M∥ L 2 (Π n) z d . 16 Therefore, Z Dn(Θ ′ ) 0 H 1/2 (z,T n ,d n )dz≤ C √ ddiam(Θ ′ ,∥·∥)·∥M∥ L 2 (Π n) and E X E ε sup θ 1 ,θ 2 ∈Θ ′ 1 √ n n X j=1 ε j (f θ 1 (X j )− f θ 2 (X j )) ≤ C √ ddiam(Θ ′ ,∥·∥)E 1/2 ∥M∥ 2 L 2 (Π n) . Next are lemmas that we rely on in the proof of Theorem 2.3. Lemma2.6. The following relation holds: √ N e θ N − θ 0 − Z N P −→ 0. Proof. This lemma is implied directly by display (2.5). Define r N (θ )= ∂ θ L(θ )− ∂ θ b L(θ ) − ∂ θ L(θ 0 )− ∂ θ b L(θ 0 ) . Lemma2.7. For anyθ ∈Θ , b L(θ )= b L(θ 0 )+(θ − θ 0 ) T ∂ θ b L(θ 0 )+ 1 2 (θ − θ 0 ) T ∂ 2 θ L(θ 0 )(θ − θ 0 )+R 1 (θ )+R 2 (θ ), whereR 1 (θ ) andR 2 (θ ) are two functions such that for anyθ satisfying∥θ − θ 0 ∥≤ δ , R 1 (θ )≤∥ θ − θ 0 ∥sup θ ∥r N (θ )∥, and sup θ R 2 (θ ) ∥θ − θ 0 ∥ 2 →0, asδ →0. 17 LetG θ (t)= b L(θ 0 +tv) wherev =θ − θ 0 and note thatG ′ θ (t)=v T ∂ θ b L(θ 0 +tv). Then b L(θ )= b L(θ 0 )+ Z 1 0 G ′ θ (s)ds= b L(θ 0 )+ Z 1 0 G ′ θ (0)ds+ Z 1 0 G ′ θ (s)− G ′ θ (0) ds. (2.6) The first integral equals ∂ θ b L(θ 0 )(θ − θ 0 ). For the second integral, note that the reasoning similar to the one behind equation (2.4) yields that for anyθ ′ ∂ θ b L(θ ′ )− ∂ θ b L(θ 0 )− r N (θ ′ )=∂ 2 θ L(θ 0 )(θ ′ − θ 0 )+R(θ ′ − θ 0 ), whereR(θ − θ 0 ) is a vector-valued function such thatR(θ − θ 0 )/∥θ − θ 0 ∥→0 asθ →θ 0 . Therefore, for anys∈(0,1) G ′ θ (s)− G ′ θ (0)=v T ∂ θ b L(θ 0 +sv)− ∂ θ b L(θ 0 ) =sv T ∂ 2 θ L(θ 0 )v+v T r N (θ 0 +sv)+v T R(sv), implying that Z 1 0 G ′ θ (s)− G ′ θ (0) ds= Z 1 0 sv T ∂ 2 θ L(θ 0 )v+v T r N (θ 0 +sv)+v T R(sv) ds = 1 2 v T ∂ 2 θ L(θ 0 )v+ Z 1 0 v T r N (θ 0 +sv)ds+ Z 1 0 v T R(sv)ds. Denoting the last two termsR 1 (θ ) andR 2 (θ ) respectively, and combining the previous display with equa- tion (2.6), we deduce that b L(θ )= b L(θ 0 )+∂ θ b L(θ 0 )(θ − θ 0 )+ 1 2 (θ − θ 0 ) T ∂ 2 θ L(θ 0 )(θ − θ 0 )+R 1 (θ )+R 2 (θ ). 18 Moreover, Z 1 0 v T r N (θ 0 +sv)ds≤ Z 1 0 ∥v∥ sup t∈[0,1] ∥r N (θ 0 +tv)∥ds=∥v∥ sup t∈[0,1] ∥r N (θ 0 +tv)∥. Therefore, forδ > 0 such that∥θ − θ 0 ∥≤ δ ,R 1 (θ )≤∥ θ − θ 0 ∥sup ∥θ − θ 0 ∥≤ δ ∥r N (θ )∥. Furthermore, Z 1 0 v T R(sv)ds≤ Z 1 0 ∥v∥ sup t∈[0,1] ∥R(tv)∥ds=∥v∥ sup t∈[0,1] ∥R(tv)∥. and R 2 (θ 0 +tv) ∥v∥ 2 ≤ sup t∈[0,1] ∥R(tv)∥ ∥v∥ ≤ sup t∈[0,1] R(tv) ∥tv∥ , Thus, sup ∥θ − θ 0 ∥≤ δ ∥R 2 (θ )∥/∥θ − θ 0 ∥ 2 ≤ sup ∥θ − θ 0 ∥≤ δ ∥R(θ − θ 0 )∥/∥θ − θ 0 ∥ which converges to0 as δ →0. Lemma2.8. There exits a sequenceh 0 N such that∥h 0 N ∥→∞,∥h 0 N / √ N∥→0 and sup h:∥h∥≤∥ h 0 N ∥ N R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N) P −→ 0, whereR 1 andR 2 are defined in Lemma 2.7. Proof of Lemma 2.8. Leth ∗ N be a sequence such that∥h ∗ N ∥→∞ and∥h ∗ N / √ N∥→ 0. In view of Lemma 2.3, sup h:∥h∥≤∥ h ∗ N ∥ ∥ √ Nr N (θ 0 +h/ √ N)∥ P −→ 0, 19 wherer N is given in Lemma 2.7. Moreover, leth (1) N be a sequence such that∥h (1) N ∥→∞,∥h (1) N ∥≤∥ h ∗ N ∥ and ∥h (1) N ∥ sup h:∥h∥≤∥ h ∗ N ∥ ∥ √ Nr N (θ 0 +h/ √ N)∥ P −→ 0. Lemma 2.7 implies that sup h:∥h∥≤∥ h (1) N ∥ NR 1 (θ 0 +h/ √ N) ≤ sup h:∥h∥≤∥ h (1) N ∥ √ N∥h∥ sup h:∥h∥≤∥ h (1) N ∥ ∥r N (θ 0 +h/ √ N)∥ ≤∥ h 1 N ∥ sup h:∥h∥≤∥ h (1) N ∥ ∥ √ Nr N (θ 0 +h/ √ N)∥ P −→ 0. Similarly, leth ∗∗ N be a sequence such that∥h ∗∗ N ∥→∞ and∥h ∗∗ N / √ N∥→0. Lemma 2.7 yields that sup h:∥h∥≤∥ h ∗∗ N ∥ R 2 (θ 0 +h/ √ N) ∥h/ √ N∥ 2 →0. Finally, leth (2) N be the sequence such thath (2) N →∞,∥h (2) N ∥≤∥ h ∗∗ N ∥ and h (2) N 2 sup h:∥h∥≤∥ h ∗∗ N ∥ R 2 (θ 0 +h/ √ N) ∥h/ √ N∥ 2 →0. Then sup h:∥h∥≤∥ h (2) N ∥ NR 2 (θ 0 +h/ √ N) ≤∥ h (2) N ∥ 2 sup h:∥h∥≤∥ h (2) N ∥ R 2 (θ 0 +h/ √ N) ∥h/ √ N∥ 2 ≤∥ h (2) N ∥ 2 sup h:∥h∥≤∥ h ∗∗ N ∥ R 2 (θ 0 +h/ √ N) ∥h/ √ N∥ 2 →0. Finally, takeh 0 N =argmin h∈{h (1) N ,h (2) N } ∥h∥, and conclude using the triangle inequality. 20 Lemma 2.9. Let {U n } n≥ 1 be a sequence of random vectors that converges to UZ weakly, where Z is a standard random vector of dimension d and U is a d× d invertible matrix. Furthermore, let V be a d× d symmetricpositivedefinitematrixand {a n } n≥ 1 -asequenceofpositivenumbersconvergingtoinfinity. Then Z ∥h∥≥ an e − 1 2 (h− Un) T V(h− Un) dµ (h) P −→ 0. Proof of Lemma 2.9. Note that Z ∥h∥≥ an e − 1 2 (h− Un) T V(h− Un) dh≤ Z ∥h∥≥ an e − 1 2 λ V min ∥h− Un∥ 2 dh, whereλ V min is the smallest eigenvalue ofV . LetC be an arbitrary positive constant andB n ={∥U n ∥≤ C}, then on the setB n , Z ∥h∥≥ an e − 1 2 λ V min ∥h− Un∥ 2 dh≤ Z ∥h∥≥ an e − 1 2 λ V min (∥h∥− C) 2 dh≤ δ for anyδ > 0 asn→∞. Note thatP(B n )→P(∥UZ∥≤ C)≥ P(∥Z∥ 2 ≤ C(λ U T U min ) − 1 ), whereλ U T U min is the smallest eigenvalue ofU T U. For an arbitraryε>0, selectC such thatP(∥Z∥ 2 ≤ C(λ U T U min ) − 1 )≥ 1− ε. Then Z ∥h∥≥ an e − 1 2 (h− Un) T V(h− Un) dh≤ δ with probability at least1− ε, thus the assertion holds. 21 2.3 Mainresults. We are ready to state the main theoretical results for the robust posterior distribution b Π N (·|X N ). Recall that L(θ ) = Eℓ(θ,X ) where ℓ(θ,x ) = − logp θ (x), and let σ (Θ) := sup θ ∈Θ Var(ℓ(θ,X )) and e ∆ = max(∆ n ,σ (Θ)) . The following theorem characterizes robustness properties of the mode of the posterior b Π N (·|X N ) defined as b θ N =argmin θ ∈Θ b L(θ )− 1 N logπ (θ ). Theorem2.2. Under the assumptions given in Section (2.2), with probability at least99%, b θ N − θ 0 2 ≤ C e ∆ O+1 k √ n + r k N o(1) !! +O 1 √ N as long asO ≤ ck for some absolute constants c,C > 0. Here, o(1) is a function that converges to 0 as n→∞. Proof. In view of assumption 2.2,∥θ − θ 0 ∥ 2 ≤ c ′ (θ 0 )(L(θ )− L(θ 0 )) whenever∥θ − θ 0 ∥ is sufficiently small. Hence, if we show that b θ N satisfies this requirement, we would only need to estimate L( b θ N )− L(θ 0 ). To this end, denoteL(θ,θ ′ )=L(θ )− L(θ ′ ), and observe that L( b θ N ,θ ′ )=L( b θ N ,θ ′ )− b L( b θ N )+ b L( b θ N )+ 1 N log(1/π ( b θ N ))− 1 N log(1/π ( b θ N )) ≤ L( b θ N ,θ ′ )− b L( b θ N )+ b L(θ 0 )+ 1 N log π ( b θ N ) π (θ 0 ) ! ≤ L(θ 0 ,θ ′ )+2sup θ ∈Θ L(θ,θ ′ )− b L(θ ) + 1 N log π ( b θ N ) π (θ 0 ) ! . If π ( b θ N ) ≤ π (θ 0 ), the last term above can be dropped without changing the inequality. On the other hand, π (θ ) is bounded and π (θ 0 ) > 0, π ( b θ N ) π (θ 0 ) ≤ ∥π ∥∞ π (θ 0 ) , whence the last term is at most C(π,θ 0 ) N . Given ε > 0, assumption 2.2 implies that there existsδ > 0 such thatinf ∥θ − θ 0 ∥≥ ε L(θ ) > L(θ 0 )+δ . LetN be 22 large enough so that C(π,θ 0 ) N ≤ δ/ 2, whenceP ∥ b θ N − θ 0 ∥≥ ε ≤ P sup θ ∈Θ | b L(θ )− L(θ,θ ′ )|>δ/ 2 . It follows from Lemma 2 in [64] (see also Theorem 3.1 in [65]) that under the stated assumptions, sup θ ∈Θ b L(θ )− L(θ,θ ′ ) ≤ o(1)+C e ∆ O k √ n with probability at least99% as long asn,k are large enough andO/k is sufficiently small. Here, o(1) is a function that tends to0 asn → ∞. This shows consistency of b θ N . Next, we will provide the required explicit upper bound on∥ b θ N − θ 0 ∥. As we’ve demonstrated above, it suffices to find an upper bound for L( b θ N )− L(θ 0 ,θ ′ ). We will apply the result of Theorem 2.1 in [60] to deduce that for C large enough, L( b θ N )− L(θ 0 ) ≤ C e ∆ O+1 k √ n + q k N o(1) +O 1 √ N with probability at least 99%. To see this, it suffices to notice that in view of Lemma 2.5, Esup θ ∈Θ 1 N N X j=1 ℓ(θ,X j )− ℓ(θ ′ ,X j )− L(θ,θ ′ ) ≤ C √ N , whereC may depend onΘ and the class{p θ , θ ∈Θ }. In particular, stated inequality implies that as long as the number of blocks containing outliers, whose cardinality isO, is not too large, the effect of these outliers on the squared estimation error is limited, regardless of their nature and magnitude. While the fact that the mode of b Π N is a robust estimator ofθ 0 is encouraging, one has to address the ability of the method to quantify uncertainty to fully justify the title of the “posterior distribution.” This is exactly the content of the following result. Let e θ N =argmin θ ∈Θ b L(θ ); it can be viewed as a mode of the posterior distribution corresponding to the uniform prior onΘ . Theorem2.3. Assume the outlier-free framework. Under Assumptions (2.1) - (2.5), b Π N (·|X N )−N e θ N , 1 N (I(θ 0 )) − 1 TV P −→ 0. 23 Moreover, √ N e θ N − θ 0 d −→N 0,I − 1 (θ ) . Proof. In view of the well-known property of the total variation distance, b Π N −N e θ N , 1 N (∂ 2 θ L(θ 0 )) − 1 TV = 1 2 Z Θ π (θ )e − N b L(θ ) R Θ π (θ ′ )e − N b L(θ ′ ) dθ ′ − N d/2 |∂ 2 θ L(θ 0 )| 1/2 (2π ) d/2 e − 1 2 N(θ − e θ N ) T ∂ 2 θ L(θ 0 )(θ − e θ N ) dθ. Next, let us introduce the new variableh= √ N(θ − θ 0 ), multiply the numerator and the denominator on the posterior byN b L(θ 0 ), and set κ N (h)=− N( b L(θ 0 +h/ √ N)− b L(θ 0 ))− N 2 (∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ), (2.7) andK N = R R d π (θ 0 +h/ √ N)e κ N (h) dµ (h). The total variation distance can then be equivalently written as ˆ Π N −N e θ N , 1 N (∂ 2 θ L(θ 0 )) − 1 TV = 1 2 Z π (θ 0 +h/ √ N)e κ N (h) K N − |∂ 2 θ L(θ 0 )| 1/2 (2π ) d/2 e − 1 2 (h− √ N( e θ N − θ 0 )) T ∂ 2 θ L(θ 0 )(h− √ N( e θ N − θ 0 )) dh. The functionπ (θ 0 +h/ √ N)e κ N (h) /K N can be viewed a pdf of a new probability measure ˆ Π ′ N . Thus it suffices to show that ˆ Π ′ N −N √ N( e θ N − θ 0 ),(∂ 2 θ L(θ 0 )) − 1 TV P −→ 0. Sinceθ 0 is the unique minimizer ofL(θ ),∂ θ L(θ 0 )=0. Next, define H(θ,z )= P k j=1 ρ ′ √ n ¯L j (θ )− z ∆ n ; it is twice differentiable since both ρ andℓ are. It is shown in the proof of Lemma 4 in [64] that∂ z H(θ, b L(θ 0 ))̸= 0 with high probability. Therefore, a unique mappingθ 7→ b L(θ ) exists around the neighborhood ofθ 0 and 24 so do∂ θ b L(θ 0 ) and∂ 2 θ b L(θ 0 ). DenoteZ N =− (∂ 2 θ L(θ 0 )) − 1 √ N∂ θ b L(θ 0 ). Then, Lemma 2.6 holds. In view of this lemma, the total variation distance between the normal lawsN √ N( e θ N − θ 0 ), ∂ 2 θ L(θ 0 ) − 1 and N Z N ,(∂ 2 θ L(θ 0 )) − 1 converges to0 in probability. Hence one only needs to show that ˆ Π ′ N −N Z N ,(∂ 2 θ L(θ 0 )) − 1 TV P −→ 0. Let λ N (h)=− 1 2 (h− Z N ) T ∂ 2 θ L(θ 0 )(h− Z N ) (2.8) and observe that as long as one can establish that Z R d π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dh P −→ 0, (2.9) we will be able to conclude that K N − (2π ) d/2 |∂ 2 θ L(θ 0 )| − 1 π (θ 0 ) = K N − Z R d π (θ 0 )e λ N (h) dµ (h) ≤ Z R d π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dh P −→ 0, so thatK N P −→ (2π ) d/2 |∂ 2 θ L(θ 0 )| − 1 π (θ 0 ). This further implies that Z R d π (θ 0 +h/ √ N)e κ N (h) − K N |∂ 2 θ L(θ 0 )| (2π ) d/2 e λ N (h) dh ≤ Z R d π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dµ (h)+ π (θ 0 )− K N |∂ 2 θ L(θ 0 )| (2π ) d/2 Z R d e λ N (h) dh = π (θ 0 )− K N |∂ 2 θ L(θ 0 )| (2π ) d/2 (2π ) d/2 |∂ 2 θ L(θ 0 )| + Z R d π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dh P −→ 0, 25 and the desired result would follow. Therefore, it suffices to establish that relation (2.9) holds. Moreover, sinceπ =0 outside of a compact setΘ , it is equivalent to showing that Z Θ ′ π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dµ (h) P −→ 0, (2.10) whereΘ ′ ={h:θ 0 +h/ √ N ∈Θ }. Note that ∂ θ b L(θ 0 +h/ √ N)− ∂ θ b L(θ 0 )= 1 √ N ∂ 2 θ b L(θ 0 )h+o P (∥h∥/ √ N). Lemma 2.7 yields the following representation forκ N (h) defined in (2.7): κ N (h)=− √ Nh T ∂ θ b L(θ 0 )− 1 2 h T ∂ 2 θ L(θ 0 )h− N 2 (∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ) − N R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N) . (2.11) Let us divideΘ ′ into 3 regions: A 1 N ={h∈ Θ ′ :∥h∥≤∥ h 0 N ∥},A 2 N ={h∈ Θ ′ :∥h 0 N ∥ <∥h∥≤ δ √ N} andA 3 N ={h∈ Θ ′ : δ √ N <∥h∥≤ R √ N} whereδ is a sufficiently small positive number and R is a sufficiently large so that {θ ∈R d :∥θ − θ 0 ∥≤ R} containsΘ . Finally,h 0 N is chosen such that∥h 0 N ∥→∞, ∥h 0 N / √ N∥→ 0 and that satisfies an additional growth condition specified in Lemma 2.8. The remainder of the proof is technical and is devoted to proving that each part of the integral (2.10) corresponding to A 1 N , A 2 N , A 3 N converges to0. For the integral over the setA 1 N , observe that Z A 1 N π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dµ (h)≤ Z A 1 N π (θ 0 +h/ √ N) e κ N (h) − e λ N (h) dµ (h)+ Z A 1 N π (θ 0 +h/ √ N)− π (θ 0 ) e λ N (h) dµ (h). 26 To estimate the first term, recall the definition of λ N in display (2.8) and note that λ N (h)=− √ Nh T ∂ θ b L(θ 0 )− 1 2 h T ∂ 2 θ L(θ 0 )h− N 2 (∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ). Therefore, recalling thatκ N can be written as in display (2.11), we have that κ N (h)=λ N (h)− N R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N) , hence Z A 1 N π (θ 0 +h/ √ N) e κ N (h) − e λ N (h) dh ≤ sup h∈A 1 N n π (θ 0 +h/ √ N) e − N(R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N)) − 1 o Z h∈A 1 N e λ N (h) dµ (h). Here,sup h∈A 1 N π (θ 0 +h/ √ N)→π (θ 0 ) by the continuity ofπ while sup h∈A 1 N e − N(R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N)) − 1 P −→ 0 by Lemma 2.8. Moreover, by the definition of λ N (see equation (2.8)), the integral factor equals(2π ) d/2 /|∂ 2 θ L(θ 0 )|. Therefore, the first integral converges to 0 in probability. For the second integral, observe that Z A 1 N π (θ 0 +h/ √ N)− π (θ 0 ) e λ N (h) dh ≤ sup h∈A 1 N π (θ 0 +h/ √ N)− π (θ 0 ) Z R d e λ N (h) dh = sup h∈A 1 N π (θ 0 +h/ √ N)− π (θ 0 ) (2π ) d/2 |∂ 2 θ L(θ 0 )| →0, 27 by Assumption 2.5. Next, to estimate the integral overA 2 N , note that Z A 2 N π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dh ≤ Z A 2 N π (θ 0 +h/ √ N)e κ N (h) dh+ Z A 2 N π (θ 0 )e λ N (h) dh. For the first term, consider again the representation of κ N as κ N (h)=− √ Nh T ∂ θ b L(θ 0 )− 1 2 h T ∂ 2 θ L(θ 0 )h− N 2 (∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ) − N R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N) . Since∂ 2 θ L(θ 0 ) is a positive definite matrix, λ min ∂ 2 θ L(θ 0 ) >0 and, in view of Lemma 2.7, N R 1 (θ 0 +h/ √ N)+R 2 (θ 0 +h/ √ N) ≤∥ h∥ 2 sup ∥h∥≤ δ √ N √ Nr N (θ 0 +h/ √ N) 2∥h 0 ∥ + |R 2 (θ 0 +h/ √ N)| ∥h/ √ N∥ 2 ! ≤ λ min ∂ 2 θ L(θ 0 ) 4 ∥h∥ 2 ≤ 1 4 h T ∂ 2 θ L(θ 0 )h, with probability close to1, for sufficiently small δ . Then κ N (h)≤− √ Nh T ∂ θ b L(θ 0 )− 1 4 h T ∂ 2 θ L(θ 0 )h− N 2 (∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ) =− h− 1 2 Z N T ∂ 2 θ L(θ 0 ) h− 1 2 Z N +W N , 28 whereW N = 1 2 N(∂ θ b L(θ 0 )) T (∂ 2 θ L(θ 0 )) − 1 ∂ θ b L(θ 0 ) andW N converges toZ T HZ weakly withZ∼N (0,I d ) andI d andH being ad-dimensional identity matrix 1 2 I(θ 0 )(∂ 2 θ L(θ 0 )) − 1 I(θ 0 ) respectively. Therefore, for any positive increasing sequence{c N }, Z A 2 N π (θ 0 +h/ √ N)e κ N (h) dµ (h)≤ c N sup h∈A 2 N π (θ 0 +h/ √ N) Z h∈A 2 N e − (h− 1 2 Z N) T ∂ 2 θ L(θ 0 )(h− 1 2 Z N) dµ (h) + sup h∈A 2 N π (θ 0 +h/ √ N)e W N Z h∈A 2 N e − (h− 1 2 Z N) T ∂ 2 θ L(θ 0 )(h− 1 2 Z N) dµ (h)I{W N >logc N }. It is easy to see thatsup h∈A 2 N π (θ 0 +h/ √ N) R h∈A 2 N e − (h− 1 2 Z N) T ∂ 2 θ L(θ 0 )(h− 1 2 Z N) dµ (h) converges to0 in probability by Lemma 2.9. Then choosingc N such that c N sup h∈A 2 N π (θ 0 +h/ √ N) Z h∈A 2 N e − (h− 1 2 Z N) T ∂ 2 θ L(θ 0 )(h− 1 2 Z N) dµ (h) P −→ 0, guarantees that the first term converges to 0. Meanwhile, the second term is0 with probabilityP(W N ≤ logc N ). Note that for anyC, P(W N ≤ logc N )=P(W N ≤ logc N )− P(W N ≤ logC) +P(W N ≤ logC)− P Z T HZ≤ logC +P Z T HZ≤ logC . P(W N ≤ logc N )− P(W N ≤ logC) is positive forc N large enough by tightness andP(W N ≤ logC)− P Z T HZ≤ logC converges to0 by weak convergence. Thus, forN large enough,P(W N ≤ logc N )≥ P Z T HZ≤ logC for any C. Since I(θ 0 ) and ∂ 2 θ L(θ 0 ) are symmetric and positive definite, so is H. 29 Note that for arbitraryε > 0, one can select a sufficiently large C such thatP ∥Z∥ 2 > logC λ max(H) ≤ ε. Therefore, P(Z T HZ >logC)≤ P(λ max (H)∥Z∥ 2 >logC)≤ ε. Thus, forN large enough, sup h∈A 2 N π (θ 0 +h/ √ N)e W N Z h∈A 2 N e − (h− 1 2 Z N) T ∂ 2 θ L(θ 0 )(h− 1 2 Z N) dµ (h)I{W N >logc N } equals0 with probability at least1− ε for anyε, hence the above term converges to0 in probability. We have so far shown that R A 2 N π (θ 0 +h/ √ N)e κ N (h) dµ (h) converges to0 in probability. Another application of Lemma 2.9 implies that Z A 2 N π (θ 0 )e λ N (h) dµ (h)≤ π (θ 0 ) Z ∥h∥≥ alogN e λ N (h) dµ (h) P −→ 0, which shows the integral overA 2 N converges to0 in probability. For the final part, the integral over A 3 N , observe again that Z A 3 N π (θ 0 +h/ √ N)e κ N (h) − π (θ 0 )e λ N (h) dµ (h) ≤ Z A 3 N π (θ 0 +h/ √ N)e κ N (h) dµ (h)+ Z A 3 N π (θ 0 )e λ N (h) dµ (h). As before, the second integral converges to0 in probability by Lemma 2.9. The first integral can be further estimated as Z A 3 N π (θ 0 +h/ √ N)e κ N (h) dµ (h)≤ Z a≤∥ h/ √ N∥≤ R π (θ 0 +h/ √ N)e κ N (h) dµ (h). 30 On the compact set{θ :δ ≤∥ θ − θ 0 ∥≤ R},L(θ )− L(θ 0 ) attains a minimum positive valuet 1 . Moreover, inf a≤∥ h/ √ N∥≤ R b L(θ 0 +h/ √ N)− b L(θ 0 )≥ inf ∥h/ √ N∥≤ R b L(θ 0 +h/ √ N)− L(θ 0 +h/ √ N) + inf a≤∥ h/ √ N∥≤ R L(θ 0 +h/ √ N)− L(θ 0 ) + L(θ 0 )− b L(θ 0 ) . By Lemma 2.1, the terms in the first and third pair of brackets converge to 0 in probability. Thus, inf a≤∥ h/ √ N∥≤ R b L(θ 0 +h/ √ N)− b L(θ 0 )≥ t 1 2 , with probability approaching 1. Therefore, Z a≤∥ h/ √ N∥≤ R π (θ 0 +h/ √ N)e κ N (h) dµ (h) ≤ e − 1 2 Nt 1 Z R d π (θ 0 +h/ √ N)dµ (h)≤ N d/2 e − 1 2 Nt 1 →0, with probability approaching 1. This establishes the relation (2.9), and therefore completes the proof. The message of this result is that in the ideal, outlier-free scenario, the robust posterior distribution b Π N asymptotically behaves like the usual posterior distributionΠ N , and that the credible sets associated with it are asymptotically valid confidence intervals. The main novelty here is the first, “BvM part” of the theorem, while asymptotic normality of e θ N has been previously established by [64]. 2.4 Numericalexamplesandapplications. We will illustrate our theoretical findings by presenting numerical examples below. The loss function that we use is Huber’s loss defined before. While, strictly speaking, it does not satisfy the smoothness requirements, we found that it did not make a difference in our simulations. Algorithm for sampling 31 from the posterior distributions was based on the “No-U-Turn sampler” variant of Hamiltonian Monte Carlo method [40]. Robust estimator of the log-likelihood b L(θ ) are approximated via the gradient descent algorithm at everyθ . Our first example demonstrates that using Huber’s loss in the framework of Example 2.1 is enough for BvM theorem to hold. Example2.2. We consider two scenarios: in the first one, the data are N = 1000 i.i.d. copies ofN(− 30,1) randomvariables. Inthesecondscenario,dataaregeneratedinthesamewayexceptthat40randomlychosen observationsarereplacedwith40i.i.d. copiesofN(10 4 ,1)distributedrandomvariables. Resultsarepresented in figures 2.2 and 2.3, where the usual posterior distribution is plotted as well for comparison purposes. The main takeaway from this simple example is that the proposed method behaves as expected: as long as the number of blocksk is large enough, robust posterior distribution concentrates most of its “mass” near the true value of the unknown parameter, while the usual posterior distribution is negatively affected by the outliers. At the same time, in the outlier-free regime, both posteriors perform similarly. Example 2.3. In this example, we consider a non-synthetic dataset in the linear regression framework. The dataset in question was provided by [27] and describes the qualities of different samples of red and white wine. It contains 11 “subjective” features such as fixed acidity, pH, alcohol, etc., and one “objective” feature, the scoring of wine quality; 4898 white wine samples are selected to perform the linear regression where the objective feature is the response and the subjective features are the regressors. It is assumed that the data is sampled from the model Y =β 0 +β 1 X 1 +β 2 X 2 +...β 8 X 8 +ε, whereβ 0 is the intercept, Y is the response, X 1 ,X 2 ,...,X 8 are the chosen regressors (see detailed variable names in Table 2.1) along with the corresponding coefficients β 1 ,β 2 ,...,β 8 , andε is the random error with N(0,σ 2 ) distribution. Here we remark that, for simplicity only8 out of11 “subjective” features are selected 32 0 10 -30.10 -30.05 -30.00 -29.95 -29.90 k = 20 0 10 -30.15 -30.10 -30.05 -30.00 -29.95 k = 40 0 10 -30.10 -30.05 -30.00 -29.95 -29.90 k = 60 0 10 -30.15 -30.10 -30.05 -30.00 -29.95 -29.90 k = 80 0 10 -30.1 -30.0 -29.9 k = 100 Figure 2.2: Posterior distribution ˆ Π N for Example 2.2, scenario 1. The blue curves and blue shaded regions correspond to the density function and95% credible sets of ˆ Π N whereas dashed red curves and red shaded region are the standard posterior and its corresponding95% credible set. 0 10 370.80 370.85 370.90 370.95 371.00 371.05 k = 20 0 10 370.8 370.9 371.0 k = 40 0 10 596.3 596.4 596.5 596.6 k = 60 0 10 -30.10 -30.05 -30.00 -29.95 k = 80 0 10 -30.1 -30.0 -29.9 k = 100 Figure 2.3: Posterior distribution ˆ Π N for Example 2.2, scenario 2. such that this model agrees with the OLS linear regression model generated by best subset selection with minimization of BIC. In the second experiment, 10 randomly chosen response variables are replaced with N(1000,10)randomoutliers. Inbothcases,thepriorsforβ j aresettobeN(0,10 2 )foreveryj,andtheprior for σ is the uniform distribution on (0,1]. The block size n is set to be 158 and the number of blocks k is 31. The MAP estimates ofβ j ’s andσ , as well as the two end points of the95% credible intervals are reported in Table 2.1. The plots for different posteriors are presented in figures 2.4, 2.5 and 2.6. These plots yet again demonstratethattheposterior b Π N ,unlikeitsstandardversion,showsstablebehaviorwhentheinputdataare corrupted. 33 Table 2.1: MAP estimates of the intercept, regression coefficients and the standard deviation σ , left and right end points of95% credible intervals in parentheses. variable name b Π N intercept − 0.002(− 0.026,0.022) fixed.acidity 0.065(0.027,0.101) volatile.acidity − 0.207(− 0.232,− 0.183) residual.sugar 0.453(0.381,0.536) free.sulfur.dioxide 0.077(0.051,0.102) density − 0.487(− 0.600,− 0.372) pH 0.125(0.093,0.160) sulphates 0.075(0.049,0.101) alcohol 0.287(0.227,0.350) σ 0.852(0.835,0.868) 2.5 Discussion. The proposed extension of the median of means principle to Bayesian inference yields a version of the pos- terior distribution possessing several desirable characteristics, such as (a) robustness, (b) valid asymptotic coverage and (c) computational tractability. In addition, the mode of this posterior distribution serves as a robust alternative to the maximum likelihood estimator. The computational cost of our method is higher compared to the usual posterior distribution as we need to solve a one-dimensional convex optimization problem to estimate the expected log-likelihood, however, the method is still practical and, unlike many existing alternatives with similar theoretical properties, can be implemented with many off the shelf sam- pling packages. As with many MOM-based methods, the main “tuning parameter” is the number of blocks k: while largerk increases robustness, smaller valuesk reduce the bias in the estimation of the likelihood. In many examples however, this bias is far from the worst case scenario, and we observed that in our simulations, the method behaves well even when the size of each “block” is small. As a practical rule of a thumb, we recommend settingk≍ √ N if no prior information about the number of outliers is available. Now, let us discuss the drawbacks. First of all, the requirement forΘ to be compact is quite restrictive, and is typically necessary to ensure that the quantity σ (Θ) = sup θ ∈Θ Var(ℓ(θ,X )) appearing in our bounds is finite. This root of this requirement is related to the fact that b L(θ ), viewed as an estimator of 34 the mean, is not scale-invariant. At the same time, compactness assumption is satisfied if one has access to some preliminary, “low-resolution” proxy ˜ θ of θ 0 such that∥θ 0 − ˜ θ ∥ ≤ R for some, possibly large, R>0. Second, our method is currently tailored only for the case of i.i.d. data and the parametric models, which is the most natural setup that is natural for demonstrating the “proof of concept.” At the same time, it would be interesting to obtain practical and theoretically sound extensions that are applicable in more challenging frameworks. 35 0 20 -0.05 0.00 0.05 intercept 0 10 20 0.0 0.1 fixed.acidity 0 20 -0.25 -0.20 volatile.acidity 0 5 10 0.4 0.6 residual.sugar 0 20 0.05 0.10 free.sulfur.dioxide 0 5 -0.6 -0.4 density 0 10 20 0.1 pH 0 20 0.05 0.10 sulphates 0 5 10 0.2 0.4 alcohol 0 20 40 0.85 σ Figure 2.4: Posterior distribution b Π N for Example 2.3, no outliers. 0 20 -0.05 0.00 0.05 intercept 0 10 20 0.0 0.1 fixed.acidity 0 20 -0.25 -0.20 volatile.acidity 0 5 10 0.4 0.6 residual.sugar 0 20 0.05 0.10 free.sulfur.dioxide 0 5 -0.6 -0.4 density 0 10 20 0.1 0.2 pH 0 20 0.05 0.10 sulphates 0 10 0.2 0.4 alcohol 0 20 40 0.85 σ Figure 2.5: Posterior distribution b Π N for Example 2.3, with outliers. 36 0e+00 1e+05 -1e-05 0e+00 1e-05 intercept 0 50000 -1e-05 0e+00 1e-05 fixed.acidity 0e+00 1e+05 0e+00 1e-05 2e-05 volatile.acidity 0e+00 1e+05 -5e-06 0e+00 5e-06 1e-05 residual.sugar 0e+00 1e+05 -2e-05 0e+00 2e-05 free.sulfur.dioxide 0 50000 -2e-05 -1e-05 0e+00 1e-05 density 0e+00 1e+05 0e+00 1e-05 2e-05 pH 0e+00 1e+05 -2e-05 -1e-05 0e+00 sulphates 0 50000 -2e-05 -1e-05 0e+00 alcohol 0e+00 5e+06 0.9999990 0.9999995 1.0000000 σ Figure 2.6: Standard posterior distribution for Example 2.3, with outliers. 37 Chapter3 Asymmetricerrorcontrolunderimperfectsupervision: a label-noise-adjustedNeyman-Pearsonumbrellaalgorithm 3.1 Introduction. Most classification methods assume a perfectly labeled training dataset. Yet, it is estimated that in real- world databases around five percent of labels are incorrect [74, 77]. Labeling errors might come from insufficient guidance to human coders, poor data quality, or human mistakes in decisions, among other factors [17, 38, 18]. Specifically, in the medical field, a 2011 survey of more than 6,000 physicians found that half said they encountered diagnostic errors at least once a month [57]. The existence of labeling errors in training data is often referred to as label noise, imperfect labels or imperfect supervision. It belongs to a more generaldatacorruption problem, which refers to “anything which obscures the relationship between description and class" [38]. The study of label noise in supervised learning has been a vibrant field in academia. On the empirical front, researchers have found that some statistical learning methods such as quadratic discriminant analysis [50] and k-NN [72], can be greatly affected by label noise and have accuracy seriously reduced, while other methods, such as linear discriminant analysis [49], are more label noise tolerant. Moreover, one can modify AdaBoost [23], perceptron algorithm [45] and neural networks [80], so that they are more tolerant to label 38 noise. Data cleansing techniques were also developed, such as in [35] and [19]. On the theoretical front, [69] provided a guarantee for risk minimization in the setting of convex surrogates. [58] proved label noise tolerance of risk minimization for certain types of loss functions, and [34] extended the result by considering more loss types. [54] proposed learning methods with importance-reweighting which can minimize the risk. [16] studied intensely the class-conditional corruption model, a model that many works on label noise are based on. In particular, theoretical results about parameter estimation and consistency of classifiers under this model were presented in their work. Most recently, [21] derived innovative theory of excess risk for general classifiers. In many classification settings, one type of error may have far worse consequences than the other. For example, a biomedical diagnosis/prognosis that misidentifies a benign tumor as malignant will cause distress and potentially unnecessary medical procedures, but the alternative, where a malignant tumor is classified as benign, will have far worse outcomes. Other related predictive applications include cyberse- curity and finance. Despite great advances in the label-noise classification literature, to our knowledge, no classifier has been constructed to deal with this asymmetry in error importance under label noise so as to control the level of the more severe error type. In this chapter, we concentrate on the classification setting involving both mislabeled outcomes and error importance asymmetry. The Neyman-Pearson (NP) paradigm [22, 79], which controls the false- negative rate (FNR, a.k.a., type I error ∗ ) under some desired level while minimizing the false-positive rate (FPR, a.k.a., type II error), provides a natural approach to this problem. However, to the best of our knowl- edge, there has been no work that studies how label noise issues affect the control of the more severe FNR. We show that if one trains a standard NP classifier on corrupted labels (e.g., the NP umbrella algorithm [83]), then the actual achieved FNR is far below the control target, resulting in a very high, and undesirable, FPR. ∗ Note that type I error in our work is defined to be the conditional probability of misclassifying a 0 instance as class 1. Moreover, we code the more severe class as class 0. In the disease diagnosis example, the disease class would be class 0. 39 This problem motivates us to devise a new label-noise-adjusted umbrella algorithm that corrects for the labeling errors to produce a lower FPR while still controlling the FNR. The construction of such an algorithm is challenging because we must identify the optimal correction level without any training data from the uncorrupted distribution. To address this challenge, we employ a common class-conditional noise model and derive the population-level difference between the type I errors of the true and corrupted labels. Based on this difference, we propose a sample-based correction term that, even without observing any uncorrupted labels, can correctly adjust the NP umbrella algorithm to significantly reduce the FPR while still controlling the FNR. Our approach has several advantages. First, it is the first theory-backed methodology in the label noise setting to control population-level type I error (i.e., FNR) regarding the true labels. Concretely, we can show analytically that the new algorithm produces classifiers that have a high probability of controlling the FNR below the desired threshold with a FPR lower than that provided by the original NP umbrella algorithm. Second, when there are no labeling errors, our new algorithm reduces to the original NP algorithm. Finally, we demonstrate on both simulated and real-world data, that under the NP paradigm the new algorithm dominates the original unadjusted one and competes favorably against existing methods which handle label noise in classification. The rest of the chapter is organized as follows. In Section 3.2, we introduce some notation and a corruption model to study the label noise. In Section 3.3, we demonstrate the ineffectiveness of the original NP umbrella algorithm under label noise and propose a new label-noise-adjusted version. The validity and the high-probability type I error control property of the new algorithm are established in Section 3.7. Simulation and real data analysis are conducted in Section 3.8, followed by a Discussion section. 40 3.2 NotationandCorruptionModel. Let (X,Y, ˜ Y) be a random triplet, where X ∈ X ⊂ R d represents features, Y ∈ {0,1} encodes the true class labels and ˜ Y ∈ {0,1} the corrupted ones. Note that in our setting, we cannot observe Y ; the observations come from(X, ˜ Y). DenoteX 0 ≜ X | (Y = 0) andX 1 ≜ X | (Y = 1). Similarly, denote ˜ X 0 ≜ X | ( ˜ Y = 0) and ˜ X 1 ≜ X | ( ˜ Y = 1). Denote byP andE generic probability measure and expectation whose meanings depend on the context. For any Borel setA⊂X , we denote P 0 (A)=P(X ∈A|Y =0), P 1 (A)=P(X ∈A|Y =1), ˜ P 0 (A)=P(X ∈A| ˜ Y =0), ˜ P 1 (A)=P(X ∈A| ˜ Y =1). Then, we denote byF 0 ,F 1 , ˜ F 0 and ˜ F 1 their respective distribution functions and byf 0 ,f 1 , ˜ f 0 and ˜ f 1 the density functions, assuming they exist. Moreover, for a measurable functionT :X →R, we denote, for anyz∈R, F T 0 (z)=P 0 (T(X)≤ z), F T 1 (z)=P 1 (T(X)≤ z), ˜ F T 0 (z)= ˜ P 0 (T(X)≤ z), ˜ F T 1 (z)= ˜ P 1 (T(X)≤ z). Since the effect of, and adjustment to, the label noise depend on the type and severity of corruption, we need to specify a corruption model to work with. Our choice for this work is theclass-conditionalnoise (contamination) model, which is specified in the next assumption. Assumption3.1. There exist constantsm 0 ,m 1 ∈[0,1] such that for any Borel setA⊂X , ˜ P 0 (A)=m 0 P 0 (A)+(1− m 0 )P 1 (A) and ˜ P 1 (A)=m 1 P 0 (A)+(1− m 1 )P 1 (A). (3.1) Furthermore,assumem 0 >m 1 butbothquantitiescanbeunknown. Moreover,letm # 0 ,m # 1 ∈[0,1]beknown constants such thatm # 0 ≥ m 0 andm # 1 ≤ m 1 . 41 0.0 0.1 0.2 0.3 0.4 -5.0 -2.5 0.0 2.5 5.0 7.5 x density Figure 3.1: Density plots in Example 3.1. True (lighter and solid) and corrupted (darker and dashed). Example3.1. [An example of Assumption 3.1] LetX 0 ∼N (µ 0 ,σ 2 ) andX 1 ∼N (µ 1 ,σ 2 ), whereµ 0 ,µ 1 ∈ Randσ > 0. Then ˜ F 0 (z)=m 0 Φ( z− µ 0 σ )+(1− m 0 )Φ( z− µ 1 σ )and ˜ F 1 (z)=m 1 Φ( z− µ 0 σ )+(1− m 1 )Φ( z− µ 1 σ ), whereΦ( ·) is the distribution function ofN(0,1). With the choice ofµ 0 =0,µ 1 =1,σ =1,m 0 =0.9, and m 1 =0.05, the density functionsf 0 , ˜ f 0 ,f 1 and ˜ f 1 are plotted in Figure 3.1. Note that equation (3.1) specifies perhaps the simplest model for label noise in supervised learning. Here, m 0 and m 1 represent the severity of corruption levels. Concretely, m 0 can be interpreted as the proportion of true 0 observations among corrupted 0 observations, and m 1 the proportion of true 0 ob- servations among corrupted 1 observations. The assumption m 0 > m 1 means that corrupted class 0 resembles true class0 more than corrupted class1 does, and that corrupted class1 resembles true class1 more than corrupted class0 does. However, this assumption does not mean that corrupted class0 resem- bles true class 0 more than it resembles true class 1 (i.e, m 0 > 1/2) or that corrupted class 1 resembles true class1 more than it resembles true class0 (i.e.,m 1 <1/2). Note that by the way our model is written, m 0 = 1 andm 1 = 0 correspond to the no label noise situation; as such, the roles ofm 0 andm 1 are not 42 symmetric. Hence, the assumptionsm # 0 ≥ m 0 andm # 1 ≤ m 1 mean that we know some lower bounds of the corruption levels. The class-conditional label noise model has been widely adopted in the literature [69, 54, 16]. We note here that the assumption m 0 > m 1 aligns with the total noise assumption π 0 + π 1 < 1 in [16] as π 0 and π 1 in their work correspond to 1− m 0 and m 1 in Assumption 1, respectively. In [69] and [54], the label noise was modeled through the label flipping probabilities: µ i = P( ˜ Y = 1− i|Y = i), i = 0,1. This alternative formulation is related to our formulation via Bayes’ rule. An in-depth study of the class- conditional label noise model, including mutual irreducibility and identifiability, was presented in [16]. Moreover, [16] developed a noisy label trained classifier based on weighted cost-sensitive surrogate loss and established its consistency. Similarly, [69] provided two methods to train classifiers, both relying on classification-calibrated surrogate loss; bounds for respective excess risks of these two methods were also given. Moreover, [54] proposed an importance reweighting method and extended the result in [69] to all surrogate losses. Other than [16], which briefly discussed the NP paradigm at the population level, in all aforementioned papers, though loss functions vary, the goal of classification is to minimize the overall risk. Our work focuses on the NP paradigm. Moreover, we focus on high probability control on the type I error based on finite samples, in contrast to asymptotic results in the literature. In this work, we take the perspective that the domain experts can provide under-estimates of corruption levels. In the literature, there are existing methods to estimate these levels. For example, [54] and [16] developed methods to estimate π i ’s and µ i ’s, and showed consistency of their estimators. In numerical studies, we apply the method in [54] to estimatem 0 andm 1 † . Numerical evidence shows that using these estimators in our proposed algorithm fails to establish a high probability control of the true type I error. In fact, even using consistent and unbiased estimators ofm 0 andm 1 as inputs of our proposed algorithm would not be able to control the true type I error with high probability. One such case is demonstrated in † Note that though their method targets atµ i’s, estimates ofmi’s in equation (3.1) can be constructed from those ofµ i’s by the Bayes’ theorem. 43 Simulation 3.5, where estimators form 0 andm 1 are normally distributed and centered at the true values. To have high probability control on the true type I error, we do need the “under-estimates” of corruption levels as in Assumption 3.1. 3.3 Methodology. In this section, we first formally introduce the Neyman-Pearson (NP) classification paradigm and review the NP umbrella algorithm [83] for the uncorrupted label scenario (Section 3.4). Then we provide an ex- ample demonstrating that in the presence of label noise, naively implementing the NP umbrella algorithm leads to excessively conservative type I error. i.e., type I error much smaller than the control targetα . We analyze and capitalize on this phenomenon, and present new noise-adjusted versions of the NP umbrella algorithm, Algorithm 3.1 for known corruption levels (Section 3.5) and Algorithm3.1 # for unknown cor- ruption levels (Section 3.6). Algorithm 3.1 can be considered as a special case of Algorithm3.1 # :m # 0 =m 0 andm # 1 =m 1 . A few additional notations are introduced to facilitate our discussion. A classifier ϕ : X → {0,1} maps from the feature space to the label space. The (population-level) type I and II errors ofϕ (·) regarding the true labels (a.k.a., true type I and II errors) are respectively R 0 (ϕ ) = P 0 (ϕ (X) ̸= Y) and R 1 (ϕ ) = P 1 (ϕ (X) ̸= Y). The (population-level) type I and II errors of ϕ (·) regarding the corrupted labels (a.k.a., corrupted type I and II errors) are respectively ˜ R 0 (ϕ ) = ˜ P 0 (ϕ (X) ̸= ˜ Y) and ˜ R 1 (ϕ ) = ˜ P 1 (ϕ (X) ̸= ˜ Y). In verbal discussion in this chapter, type I error without any suffix refers to type I error regarding the true labels. 44 3.4 TheNPumbrellaalgorithmwithoutlabelnoise. The NP paradigm [22, 79] aims to mimic the NP oracle ϕ ∗ α ∈ argmin ϕ : R 0 (ϕ )≤ α R 1 (ϕ ), where α ∈ (0,1) is a user-specified level that reflects the priority towards the type I error. In practice, with or without label noise, based on training data of finite sample size, it is usually impossible to ensure R 0 (·) ≤ α almost surely. Instead, we aim to control the type I error with high probability. Recently, the NP umbrella algorithm [83] has attracted significant attention ‡ . This algorithm works in conjunction with any score based classification method (e.g., logistic regression, support vector machines, or random forest) to compress ad-dimensional feature measurement to a1-dimensional score, and then threshold the score to classify. Specifically, given a (score based) classification method , the NP umbrella algorithm uses a model-free order statistics approach to decide the threshold, attaining a high probability control on type I error with minimum type II error for that method. Moreover, when coupling with a classification method that matches the underlying data distribution, the NP umbrella algorithm also achieves a diminishing excess type II error, i.e.,R 1 ( ˆ ϕ α )− R 1 (ϕ ∗ α )→0. For example, [84] showed that under a linear discriminant analysis (LDA) model, an LDA classifier with the score threshold determined by the NP umbrella algorithm satisfies both the control on type I error and a diminishing excess type II error § . Next we will review the implementation of the NP umbrella algorithm. LetS 0 ={X 0 j } M 0 j=1 andS 1 ={X 1 j } M 1 j=1 , respectively be the uncorrupted observations in classes 0 and 1, whereM 0 andM 1 are the number of observations from each class ¶ . Then, given a classification method (i.e., base algorithm, e.g., logistic regression), the NP umbrella algorithm is implemented by randomly ‡ At the time of writing, the NP umbrella package has been downloaded over35,000 times. § These two properties together were coined as the NP oracle inequalities by [78]. Classifiers with these properties were constructed with non-parametric assumptions in [82] and [89]. ¶ Note that the uncorrupted dataS 0 andS 1 are not available in our present label noise setting and we only use them here for review purposes. 45 splitting the class 0 dataS 0 into two parts:S 0 b andS 0 t . The first part, S 0 b , together withS 1 , is used to train the base algorithm, while the second partS 0 t determines the threshold candidates. Specifically, we train a base algorithm with scoring function ˆ T(·) (e.g., the sigmoid function in logistic regression) usingS 0 b ∪S 1 , apply ˆ T(·) onS 0 t (|S 0 t |=n) to get threshold candidates{t 1 ,...,t n }, and sort them in an increasing order {t (1) ,...,t (n) }. Then the NP umbrella algorithm proposes classifier ˆ ϕ k∗ (·)=I{ ˆ T(·)>t (k∗ ) }, where k ∗ =min k∈{1,...,n}: n X j=k n j (1− α ) j α (n− j) ≤ δ , (3.2) in whichδ is a user-specified tolerance probability of the type I error exceeding α . The key to this approach is that [83] established, for all ˆ ϕ k (·) = I{ ˆ T(·) > t (k) } where k ∈ {1,...,n}, it holds P(R 0 ( ˆ ϕ k ) > α )≤ P n j=k n j (1− α ) j α (n− j) , whereP corresponds to random draws ofS 0 andS 1 , as well as potential randomness in the classification method (e.g., random forest), and the inequality becomes an equality when ˆ T is continuous almost surely. In view of this inequality and the definition for k ∗ , we haveP(R 0 ( ˆ ϕ k∗ ) > α ) ≤ δ , and ˆ ϕ k∗ achieves the smallest type II error among the ˆ ϕ k ’s that respect the (1− δ ) probability control of the type I error. We call this algorithm the original NP umbrella algorithm to contrast with the newly developed versions. 3.5 Algorithm3.1:label-noise-adjustedNPumbrellaalgorithmwithknown corruptionlevels. Returning to our errors in labels problem leads one to ask what would happen if we were to directly apply the original NP umbrella algorithm to the label noise setting? The results are mixed. While this algorithm successfully controls type I error, it tends to be massively conservative, leading to very low type I errors, but high type II errors. The next example illustrates this phenomenon. 46 original adjusted 200 500 1000 2000 200 500 1000 2000 0.00 0.04 0.08 0.12 training sample size (N) true type I errors original adjusted 200 500 1000 2000 200 500 1000 2000 0.2 0.4 0.6 0.8 1.0 training sample size (N) true type II errors Figure 3.2: The original NP umbrella algorithm vs. a label-noise-adjusted version for Example 3.2. The plots in the left panel (blue) are the true type I and II errors for the original NP umbrella algorithm. The plots in the right panel (orange) are the true type I and II errors for the label-noise-adjusted NP umbrella algorithm with known corruption levels. The black dot and vertical bar in every violin represent mean and standard deviation, respectively. In the top row, the horizontal black line isα = 0.05 and the boundaries between lighter and darker color in each violin plot mark the1− δ =95% quantiles. Example 3.2. Let X 0 ∼ N (0,1) and X 1 ∼ N (2,1), m 0 = 0.85, m 1 = 0.15, α = 0.05 and δ = 0.05. For simplicity, we use the identity scoring function: ˆ T(X) = X. We generateN ∈ {200,500,1000,2000} corrupted class0 observations and train a classifier ˆ ϕ k∗ (·) based on them. Due to normality, we can analyti- cally calculate the type I and II errors regarding the true labels. The above steps are repeated1,000 times for every value ofN to graph the violin plots of both errors as shown in the left panel of Figure 3.2. Clearly, all theachievedtruetypeIerrorsaremuchlowerthanthecontroltargetα andtruetypeIIerrorsareveryhigh ∥ . The phenomenon illustrated in the left panel of Figure 3.2 is not a contrived one. Indeed, under the class-conditional noise model (i.e., Assumption 3.1), at the same threshold level, the tail probability of corrupted class 0 is greater than that of true class 0 since the corrupted 0 distribution is a mixture of true 0 and 1 distributions. Figure 3.3 provides further illustration. In this figure, the black vertical line ∥ To make a contrast, we also plot in the right panel of Figure 3.2 the true type I and II errors of ˆ ϕ k ∗ (·), the classifier constructed by the label-noise-adjusted NP umbrella algorithm with known corruption levels to be introduced in the next section. The details to generate ˆ ϕ k ∗ (·)’s are skipped here, except we reveal that corrupted class 1 observations, in addition to the corrupted class 0 observations, are also needed to construct the thresholds. 47 0.0 0.1 0.2 0.3 0.4 -5.0 -2.5 0.0 2.5 5.0 x density Figure 3.3: The blue solid curve is the density of true class0 (i.e.,N(0,1)) and the orange dashed curve is the density of corrupted class0 (i.e., a mixture ofN(0,1) andN(2,1) withm 0 =0.85). The black vertical line marks the threshold of the classifier I{X >2.52} whose corrupted type I error is0.05. (x = 2.52) marks the threshold of the classifier I{X > 2.52} whose corrupted type I error (i.e., the right tail probability under the orange dashed curve) is0.05. In contrast, its true type I error (i.e., the right tail probability under the blue solid curve) is much smaller. The above observation motivates us to create new label-noise-adjusted NP umbrella algorithms by carefully studying the discrepancy between true and corrupted type I errors, whose population-level rela- tion is channeled by the class-conditional noise model and can be estimated based on data with corrupted labels alone. We will first develop a version for known corruption levels (i.e., Algorithm 3.1) and then a variant for unknown corruption levels (i.e., Algorithm 3.1 # ). Although the latter variant is suitable for most applications, we believe that presenting first the known corruption level version streamlines the reasoning and presentation. For methodology and theory development, we assume the following sampling scheme. Let ˜ S 0 = { ˜ X 0 j } N 0 j=1 be corrupted class 0 observations and ˜ S 1 = { ˜ X 1 j } N 1 j=1 corrupted class 1 ones. The sample sizes N 0 andN 1 are considered to be non-random numbers, and we assume that all observations in ˜ S 0 and ˜ S 1 are independent. Then, we divide ˜ S 0 into three random disjoint non-empty subsets. The first two parts 48 ˜ S 0 b and ˜ S 0 t are used to train the base algorithm and determine the threshold candidates, respectively. The third part ˜ S 0 e is used to estimate a correction term to account for the label noise. Similarly, we randomly divide ˜ S 1 into two disjoint non-empty subsets: ˜ S 1 b and ˜ S 1 e . Let ˆ T(·) be a scoring function trained on ˜ S b = ˜ S 0 b ∪ ˜ S 1 b . We apply ˆ T(·) to elements in ˜ S 0 t and sort them in an increasing order:{t (1) ,...,t (n) }, wheren=| ˜ S 0 t |. These will serve as the threshold candidates, just as in the original NP umbrella algorithm. However, instead of k ∗ , the label-noise-adjusted NP umbrella algorithm with known corruption levels will take the orderk ∗ defined by k ∗ =min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α }, whereα k,δ ∗∗ satisfies n X j=k n j α n− j k,δ (1− α k,δ ) j =δ, (3.3) ˆ D + (·)= ˆ D(·)∨0:=max( ˆ D(·),0) and ˆ D(·)= 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (·)− ˆ ˜ F ˆ T 1 (·) , in which ˆ ˜ F ˆ T 0 (·) and ˆ ˜ F ˆ T 1 (·) are empirical estimates of ˜ F ˆ T 0 (·) and ˜ F ˆ T 1 (·) based on ˜ S 0 e and ˜ S 1 e , respectively. The entire construction process of ˆ ϕ k ∗ (·)=I{ ˆ T(·)>t (k ∗ ) } is summarized and detailed in Algorithm 3.1. In this algorithm, to solve α k,δ , we use a binary search subroutine (Algorithm 3.2) on the function x7→ P n j=k n k x n− j (1− x) j , leveraging its strict monotone decreasing property inx. Interested readers are referred to the proof of Lemma 3.2 for further reasoning. Currently we randomly split ˜ S 0 and ˜ S 1 respectively into three and two equal sized subgroups. An optimal splitting strategy could be a subject for future research. The key to the new algorithm is ˆ D + (·), which adjusts for the label corruption. Indeed, the original NP umbrella algorithm can be seen as a special case of our approach where ˆ D + (·) = 0. The numerical ∗∗ The existence and uniqueness ofα k,δ are ensured by Lemma 3.2. 49 Algorithm3.1: Label-noise-adjusted NP Umbrella Algorithm with known corruption levels Input : ˜ S 0 : sample of corrupted0 observations ˜ S 1 : sample of corrupted1 observations α : type I error upper bound,0<α< 1 δ : type I error violation rate target,0<δ < 1 m 0 : probability of a corrupted class 0 sample being of true class 0 m 1 : probability of a corrupted class 1 sample being of true class 0 1 ˜ S 0 b , ˜ S 0 t , ˜ S 0 e ← random split on ˜ S 0 2 ˜ S 1 b , ˜ S 1 e ← random split on ˜ S 1 3 ˜ S b ← ˜ S 1 b ∪ ˜ S 0 b ; // combine ˜ S 0 b and ˜ S 1 b as ˜ S b 4 ˆ T(·)← baseclassificationalgorithm( ˜ S b ) ; // train a scoring function on ˜ S b 5 T t ={t 1 ,t 2 ,...,t n }← ˆ T( ˜ S 0 t ) ; // apply ˆ T to every entry in ˜ S t 6 {t (1) ,t (2) ,...,t (n) }← sort(T t ) 7 T 0 e ← ˆ T( ˜ S 0 e ) 8 T 1 e ← ˆ T( ˜ S 1 e ) ; // apply ˆ T to all elements in ˜ S 0 e and ˜ S 1 e 9 fork in{1,...,n}do 10 α k,δ ← BinarySearch(δ,k,n ) ; // compute α k,δ through binary search 11 ˆ ˜ F ˆ T 0 (t (k) )← T 0 e − 1 · P t∈T 0 e I{t≤ t (k) } 12 ˆ ˜ F ˆ T 1 (t (k) )← T 1 e − 1 · P t∈T 1 e I{t≤ t (k) } ; // compute the empirical distributions 13 ˆ D(t (k) )← 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (t (k) )− ˆ ˜ F ˆ T 1 (t (k) ) ; // compute an estimate of ˜ R 0 − R 0 14 ˆ D + (t (k) )← ˆ D(t (k) )∨0 ; // if ˆ D(t (k) ) is negative, then set it to 0 15 end 16 k ∗ ← min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α } ; // select the order 17 ˆ ϕ k ∗ (·)← I{ ˆ T(·)>t (k ∗ ) } ; // construct an NP classifier Output: ˆ ϕ k ∗ (·) advantage of the new algorithm is demonstrated in the right panel of Figure 3.2 and in Section 3.8. We will prove in the next section that the label-noise-adjusted NP classifier ˆ ϕ k ∗ (·)=I{ ˆ T(·)>t (k ∗ ) } controls true type I error with high probability while avoiding the excessive conservativeness of the original NP umbrella algorithm. Note that in contrast to the deterministic orderk ∗ in the original NP umbrella algorithm, the new orderk ∗ is random, calling for much more involved technicalities to establish the theoretical properties of ˆ ϕ k ∗ (·). 50 Algorithm3.2: Binary Search Forα k,δ Input : δ : a small tolerance level,0<δ < 1 k,n: two integers such thatk≤ n r: a small number for error (we implementr =10 − 5 in our numerical analysis) 1 α min ← 0 2 α max ← 1 3 δ max ← P n j=k n j (1− α min ) j α n− j min 4 δ min ← P n j=k n j (1− α max ) j α n− j max 5 E← 2 6 whileE >r do 7 α middle ← (α min +α max )/2 8 δ middle ← P n j=k n j (1− α middle ) j α n− j middle 9 if δ middle =δ then Output:α middle 10 elseif δ middle >δ then 11 α middle ← α min 12 else 13 α middle ← α max 14 end 15 E←| δ middle − δ | 16 end Output:α middle 3.6 Algorithm 3.1 # : label-noise-adjusted NP umbrella algorithm with unknowncorruptionlevels. For most applications in practice, accurate corruption levelsm 0 andm 1 are inaccessible. To address this, we propose Algorithm 3.1 # , a simple variant of Algorithm 3.1 that replaces m 0 and m 1 with estimates m # 0 andm # 1 . In all other respects the two algorithms are identical. Specifically, when estimating ˜ R 0 − R 0 , Algorithm3.1 # uses ˆ D # (t (k) ) = 1− m # 0 m # 0 − m # 1 ˆ ˜ F ˆ T 0 (t (k) )− ˆ ˜ F ˆ T 1 (t (k) ) and ˆ D + # (t (k) ) = ˆ D # (t (k) )∨0. Then, Algorithm3.1 # delivers the NP classifier ˆ ϕ k ∗ # (·) = I{ ˆ T(·) > t (k ∗ # ) }, wherek ∗ # = min{k ∈{1,...,n} : α k,δ − ˆ D + # (t (k) )≤ α }. Due to the similarity with Algorithm 3.1, we do not re-produce the other steps of Algorithm3.1 # to write it out in a full algorithm format. Rather than supplying unbiased estimates for m 0 and m 1 , we will demonstrate that it is important that m # 0 and m # 1 are under-estimates of the corruption levels (i.e., m # 0 ≥ m 0 and m # 1 ≤ m 1 as in 51 Assumption 3.1). In this work, we assume that domain experts supply these under-estimates. While it would be unrealistic to assume that these experts know m 0 and m 1 exactly, in many scenarios one can provide accurate bounds on these quantities. It would be interesting to investigate data-driven estimators that have such a property for future work. 3.7 Theory. In this section, we first elaborate the rationale behind Algorithm 3.1 (Section 3.7.2), and then show that under a few technical conditions, this new algorithm induces well-defined classifiers whose type I errors are bounded from above by the desired level with high probability (Section 3.7.3). Then we establish a similar result for its unknown-corruption-level variant, Algorithm3.1 # (Section 3.7.4). 3.7.1 Preliminaries. Before presenting the main theoretical results, we will first introduce some preliminary results and tech- nical lemmas that will be useful in later proofs. We start with a theoretical guarantee of NP umbrella algorithm. The NP umbrella algorithm developed in [83] adapts all scoring-type classification methods (e.g., lo- gistic regression, random forest, neural nets) so that the resulting classifiers have the type I error bounded from above by a user-specified level α with pre-specified high probability 1− δ . In this section, we provide a description of general NP umbrella algorithm for readers’ convenience. LetS = {X j ,Y j } N j=1 be the set of observations (without label corruption). DecomposeS byS = S 0 ∪S 1 , where S 0 is the set of all instances of class 0 and S 1 is the set of instances of class 1. As- sume that the observations inS 0 andS 1 are independent. SplitS 0 randomly into two partsS 0 train and S 0 left-out . The sets S 1 and S 0 train are combined to train a scoring function T (e.g., sigmoid function in 52 logistic regression). Apply T to the features of all instances of S 0 left-out = {X 0 1 ,··· ,X 0 n } and denote {t 1 ,··· ,t n }:={T(X 0 1 ),··· ,T(X 0 n )}. Then we have Proposition 3.1 ([83]). DenoteT = {t (1) ,t (2) ,··· ,t (n) } where t (1) ≤ t (2) ≤ ··· ≤ t (n) . Then, for any α ∈(0,1), the classifier ϕ j =I{T(X)>t (k) } satisfies P(R 0 (ϕ k )>α )≤ n X j=k n j α n− j (1− α ) j , where the outerP is taken with respect to the randomness ofS. Moreover, when T(·) is continuous almost surely, the above inequality obtains the equal sign. Hence, the classifier ϕ (X) =I{T(X)>t (k ∗ ) } is able to control the type I error underα with proba- bility at least1− δ , wherek ∗ is the smallest integerk among{1,2,··· ,n} such that n X j=k n j α n− j (1− α ) j ≤ δ. The smallestk was chosen because we want to achieve type II error as small as possible. Recall that when only corrupted labels are available, this proposition achieves high probability control on the corrupted type I error ˜ R 0 . We omit a proof for Proposition 3.1 as it follows the same proof as its counterpart in [83]. Next, we will introduce some technical lemmas that will be useful in proofs of the main results. The next lemma is an simple extension of Assumption 3.1. Lemma3.1. UnderAssumption3.1, foranymeasurablefunctionT :R d →Randarbitrarynumberz∈R, we have ˜ F T 0 (z)=m 0 F T 0 (z)+(1− m 0 )F T 1 (z) and ˜ F T 1 (z)=m 1 F T 0 (z)+(1− m 1 )F T 1 (z). 53 Furthermore, E ˜ X 0 =m 0 EX 0 +(1− m 0 )EX 1 and E ˜ X 1 =m 1 EX 0 +(1− m 1 )EX 1 . Proof. The first two equations can be proved in the similar way. So we will only show the first equation. By Assumption 3.1, for any Borel setA, ˜ P 0 (T − 1 (A))=m 0 P 0 (T − 1 (A))+(1− m 0 )P 1 (T − 1 (A)). Then, selectA=(−∞ ,z] and the result follows. Similarly, the proof of last two equations are similar in nature. So we are going to show E ˜ X 0 = m 0 EX 0 +(1− m 0 )EX 1 . Note that by Assumption 3.1, E ˜ X 0 = Z ∞ 0 (1− ˜ P 0 (X ≤ x))dx− Z 0 −∞ ˜ P 0 (X ≤ x)dx =m 0 Z ∞ 0 (1− P 0 (X ≤ x))dx− Z 0 −∞ P 0 (X ≤ x)dx +(1− m 0 ) Z ∞ 0 (1− P 1 (X ≤ x))dx− Z 0 −∞ P 1 (X ≤ x)dx =m 0 EX 0 +(1− m 0 )EX 1 . Then, we present a technical lemma that validates the existence ofα k,δ . Also, another useful interpre- tation ofk ∗ is stated in this lemma. Lemma3.2. For anyk∈{1,...,n} andδ ∈ (0,1), a uniqueα k,δ exists. Moreover, under Assumption 3.2, k ∗ =min{k∈{1,...,n}:α k,δ ≤ α }. 54 Proof. Leth k (x) = P n j=k n k x n− j (1− x) j for anyk ∈ {1,...,n}. Then, one can show, fork ≤ n− 1 andx∈(0,1), h ′ k (x)= n− 1 X j=k (n− j) n j x n− j− 1 (1− x) j − n X j=k j n j x n− j (1− x) j− 1 =n n X i=k+1 n i− 1 x n− i (1− x) i− 1 − n n X j=k n j− 1 x n− j (1− x) j− 1 =− n n k− 1 x n− k (1− x) k− 1 , which is negative. Thus,h k (x) is strictly decreasing on(0,1) fork≤ n− 1. Furthermore,h n (x)=(1− x) n which is also strictly decreasing on(0,1). Since for anyk,h k (0)=1 andh k (1)=0, there exists a unique α k,δ such thath k (α k,δ )=δ . Recall thatk ∗ is defined as the smallest k such thath k (α )≤ δ . Meanwhile, by monotonicity, for any k, the inequality h k (α ) ≤ δ is equivalent to α k,δ ≤ α . Assumption 3.2 guarantees the existence of k such that h k (α ). Therefore it also guarantees the existence of k such that α k,δ ≤ α . Then, for any δ , {k∈{1,...,n} : h k (α )≤ δ } ={k∈{1,...,n} : α k,δ ≤ α }. Then,k ∗ = min{k∈{1,...,n} : α k,δ ≤ α }. 3.7.2 RationalebehindAlgorithm3.1. For α,δ ∈ (0,1), recall that the original NP umbrella algorithm selects k ∗ = min{k ∈ {1,...,n} : P n j=k n j (1− α ) j α (n− j) ≤ δ }. The smallest k among all that satisfy P n j=k n j (1− α ) j α (n− j) ≤ δ is desirable because we also wish to minimize the type II error. There is a sample size requirement for this order statistics approach to work because a finite order k ∗ should exist. Precisely, an order statistics approach works if the last order does; that is(1− α ) n ≤ δ . This translates to Assumption 3.2 onn, the sample size of ˜ S 0 t . This is a mild requirement. For instance, whenα =δ =0.05,n should be at least59. Assumption3.2. n≥⌈ logδ/ log(1− α )⌉, in which⌈·⌉ denotes the ceiling function. 55 In view of Proposition 3.1, the choice ofk ∗ guaranteesP ˜ R 0 ( ˆ ϕ k∗ )≤ α ≥ 1− δ . In other words, if we were to ignore the label noise presence and apply the original NP umbrella algorithm, the type I error regarding the corrupted labels, ˜ R 0 , is controlled under level α with probability at least 1− δ . Moreover, the achieved ˜ R 0 is usually not far from α when the sample size n is much larger than the lower bound requirement. However, this is not our main target; what we really want is to control R 0 . Example 3.2 in Section 3.4 convincingly demonstrates that in the presence of label noise, the achievedR 0 after naive implementation of the original NP umbrella algorithm can be much lower than the control targetα . This is no exception. To aid in analyzing the gap betweenR 0 and ˜ R 0 , we make the following assumption. Assumption3.3. Thescoringfunction ˆ T istrainedsuchthat ˜ F ˆ T 0 (z)> ˜ F ˆ T 1 (z)forallz∈Rwithprobability at least1− δ 1 (n b ), wheren b =| ˜ S b | andδ 1 (n b ) converges to0 asn b goes to infinity. Loosely speaking, Assumption 3.3 means that the scoring function trained on corrupted data still has the “correct direction." For any classifier of the form ˆ ϕ c (·) = I{ ˆ T(·) > c}, Assumption 3.3 implies that with probability at least 1− δ 1 (n b ), ˜ P 0 ( ˆ ϕ c (X) = 0) > ˜ P 1 ( ˆ ϕ c (X) = 0), which means that a corrupted class0 observation is more likely to be classified as 0 than a corrupted class1 observation is. Readers can find a concrete example that illustrates this mild assumption in Example 3.3. Example3.3. UnderthesamedistributionalsettingasinExample3.1,let ˆ T betrainedbylineardiscriminant analysis(LDA)on ˜ S b ;thatis ˆ T(X)= ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )X,inwhich ˆ ˜ µ 0 and ˆ ˜ µ 1 arethesamplemeansofcorrupted class0 and1 observations, respectively, and ˆ ˜ σ 2 is the pooled sample variance. For anyz∈R, by Lemma 3.1, we have ˜ F ˆ T 0 (z)− ˜ F ˆ T 1 (z)=(m 0 − m 1 ) F ˆ T 0 (z)− F ˆ T 1 (z) . Therefore, when m 0 > m 1 (as assumed in Assumption 3.1), ˜ F ˆ T 0 (z) > ˜ F ˆ T 1 (z) is equivalent to F ˆ T 0 (z) > F ˆ T 1 (z). Wefirstfix ˜ S b , then ˆ T(X 0 )∼N ( ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 0 , ˆ ˜ σ − 4 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 ) 2 σ 2 )and ˆ T(X 1 )∼N ( ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − 56 ˆ ˜ µ 1 )µ 0 , ˆ ˜ σ − 4 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 ) 2 σ 2 ). Sincethesetwodistributionsaretwonormalwiththesamevarianceanddifferent means,F ˆ T 0 (z)>F ˆ T 1 (z)aslongas ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 0 < ˆ ˜ σ − 2 ( ˆ ˜ µ 1 − ˆ ˜ µ 0 )µ 1 ,orequivalently,( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(µ 1 − µ 0 )> 0. ByLemma3.1,thisconditioncanbewrittenas( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(˜ µ 1 − ˜ µ 0 )/(m 0 − m 1 )>0,where ˜ µ 0 and ˜ µ 1 are themeansof ˜ X 0 and ˜ X 1 respectively. Whenm 0 >m 1 ,thisisfurtherequivalentto( ˆ ˜ µ 1 − ˆ ˜ µ 0 )(˜ µ 1 − ˜ µ 0 )>0. Then Assumption 3.3 follows from the law of large numbers. Now we are ready to describe the discrepancy betweenR 0 and ˜ R 0 . Lemma3.3. Let ˆ T beascoringfunctiontrainedon ˜ S b and ˆ ϕ c (·)=I{ ˆ T(·)>c}beaclassifierthatthresholds the scoring function atc∈R. DenoteD(c) = ˜ R 0 ( ˆ ϕ c )− R 0 ( ˆ ϕ c ). Then, under Assumptions 3.1, 3.2 and 3.3, for givenα andδ , it holds that P inf c∈R D(c)≥ 0 ≥ 1− δ 1 (n b ) and P R 0 ( ˆ ϕ k∗ )>α − D(t (k∗ ) ) ≤ δ +δ 1 (n b ), wherek ∗ andδ are related via equation (3.2). Moreover, we have D(c)=M ˜ F ˆ T 0 (c)− ˜ F ˆ T 1 (c) , (3.4) whereM =(1− m 0 )(m 0 − m 1 ) − 1 . Proof. Let us focus on the event of the statement of Assumption 3.3, whose complement holds with prob- ability at mostδ 1 (n b ). Meanwhile, by Lemma 3.1, for anyz∈R, ˜ F ˆ T 0 (z)− ˜ F ˆ T 1 (z)= h m 0 F ˆ T 0 (z)+(1− m 0 )F ˆ T 1 (z) i − h m 1 F ˆ T 0 (z)+(1− m 1 )F ˆ T 1 (z) i =(m 0 − m 1 ) F ˆ T 0 (z)− F ˆ T 1 (z) . 57 Furthermore, for any classifier ϕ (X)=I{ ˆ T(X)>z} ˜ R 0 (ϕ )− R 0 (ϕ )= 1− ˜ F ˆ T 0 (z) − 1− F ˆ T 0 (z) =F ˆ T 0 (z)− m 0 F ˆ T 0 (z)− (1− m 0 )F ˆ T 1 (z) =(1− m 0 ) F ˆ T 0 (z)− F ˆ T 1 (z) , which is positive by Assumption 3.3. Now, let D(z) = ˜ R 0 (ϕ )− R 0 (ϕ ) > 0 and therefore R 0 ( ˆ ϕ k∗ ) > α − D(t (k∗ ) ) is equivalent to ˜ R 0 ( ˆ ϕ k∗ )>α , whose probability isδ by Proposition 3.1. To this end, we have shown P R 0 ( ˆ ϕ k∗ )>α − D(t (k∗ ) ) ≤ δ +δ 1 (n b ). Note thatD(c) measures the discrepancy between thecorrupted type I error and thetrue type I error of the classifier ˆ ϕ c (·). Lemma 3.3 implies that with high probability, ˆ ϕ k∗ (·) hasR 0 , the type I error regarding true labels, under a level that is smaller than the target valueα , and that the gap is measured byD(t (k∗ ) ). It is important to note that D(c) is solely a function of the distributions of the corrupted data, and does not require any knowledge of the uncorrupted scores, so we are able to estimate this quantity from our observed data. As argued previously, excessive conservativeness in type I error is not desirable because it is usually associated with a high type II error. Therefore, a new NP umbrella algorithm should adjust to the label noise, so that the resulting classifier respects the true type I error control target, but is not excessively conservative. Motivated by Lemma 3.3, our central plan is to choose some less conservative (i.e., smaller) order than that in the original NP umbrella algorithm, in view of the difference between R 0 and ˜ R 0 . Recall 58 thatδ ∈ (0,1) is the target type I error violation rate. In the presence of label noise, we do not expect to control at this precise violation rate, but just some number around it. For any ˆ ϕ k (·), under Assumptions 3.1, 3.2 and 3.3, Lemma 3.3 implies ˜ R 0 ( ˆ ϕ k ) ≥ R 0 ( ˆ ϕ k ) with prob- ability at least 1− δ 1 (n b ). Note that the δ 1 (n b ) term is small and asymptotically 0; we will ignore it in this section when motivating our new strategy. With this simplification, ˜ R 0 ( ˆ ϕ k ) is always greater than R 0 ( ˆ ϕ k ), as illustrated in Figure 3.4. The definition of α k,δ in equation (3.3) and Proposition 3.1 imply with probability at least1− δ ,α k,δ ≥ ˜ R 0 ( ˆ ϕ k ), which corresponds to the green region (the region on the right) in Figure 3.4. Since we only need1− δ probability control onR 0 , it suffices to control R 0 corresponding to this region. Combining the results α k,δ ≥ ˜ R 0 ( ˆ ϕ k ) and ˜ R 0 ( ˆ ϕ k ) ≥ R 0 ( ˆ ϕ k ), we have the inequalities α k,δ ≥ α k,δ − D(t (k) ) ≥ R 0 ( ˆ ϕ k ) on our interested region (RecallD(t (k) ) = ˜ R 0 ( ˆ ϕ k )− R 0 ( ˆ ϕ k )). By the previous argument,α k,δ can be used as an upper bound forR 0 , but to have a good type II error, a better choice is clearly the smallerα k,δ − D(t (k) ). So ifD(t (k) ) were a known quantity, we can set the order to be ˜ k ∗ =min{k∈{1...,n}:α k,δ − D(t (k) )≤ α } and propose a classifier ˆ ϕ ˜ k ∗ (·)=I{ ˆ T(·)>t ( ˜ k ∗ ) }. This is to be compared with the orderk ∗ chosen by the original NP umbrella algorithm, which can be equivalently expressed ask ∗ = min{k∈{1 ...,n} : α k,δ ≤ α } (Lemma 3.2). Then we have ˜ k ∗ ≤ k ∗ , and so ˆ ϕ ˜ k ∗ (·) is less conservative than ˆ ϕ k∗ (·) in terms of type I error. However, ˆ ϕ ˜ k ∗ (·) is not accessible becauseD is unknown. Instead we estimateD by replacing ˜ F ˆ T 0 and ˜ F ˆ T 1 in (3.4) with their empirical distributions ˆ ˜ F ˆ T 0 and ˆ ˜ F ˆ T 1 , which are calculated using ˜ S 0 e and ˜ S 1 e , i.i.d. samples from the corrupted0 and1 observations. Note that these estimates are independent of ˜ S b and ˜ S 0 t . For a given ˆ T , we define for every c∈R, ˆ D(c)= 1− m 0 m 0 − m 1 ˆ ˜ F ˆ T 0 (c)− ˆ ˜ F ˆ T 1 (c) and k ∗∗ =min{k∈{1,...,n}:α k,δ − ˆ D(t (k) )≤ α − ε}, 59 type I error , with probability with probability 1 − ( ) ෨ 0 ( ) 0 ( ) th resh old ( ( ) ) corrupted type I error true type I error ( ( ) ) , − ( ( ) ) 0 ( ) Figure 3.4: A cartoon illustration of1− δ probability upper bound of type I error. in which a small ε > 0 is introduced to compensate for the randomness of ˆ D in the theory proofs. For simulation and real data, we actually just use ε = 0. Finally, the proposed new label-noise-adjusted NP classifier with known corruption levels is ˆ ϕ k ∗ (·) = I{ ˆ T(·) > t (k ∗ ) }, in whichk ∗ is a small twist fromk ∗∗ by replacing ˆ D with its positive part. The construction of ˆ ϕ k ∗ (·) was detailed in Algorithm 3.1. We have two comments on the implementation of Algorithm 3.1. First, though theε compensation for the randomness is necessary for the theory proof, our empirical results suggest almost identical perfor- mance betweenε=0 relative to any smallε, so we recommend settingε to0 for simplicity, and we do not use theε compensation in Algorithm 3.1. Second, in the order selection criterion ofk ∗ in Algorithm 3.1, we use ˆ D + = ˆ D∨0:=max( ˆ D,0) instead of ˆ D, because empirically, although highly unlikely, ˆ D can be negative, which results inmin{k∈{1,...,n}:α k,δ − ˆ D(t (k) )≤ α }≥ min{k∈{1,...,n}:α k,δ ≤ α }. In this case, the new order could be greater thank ∗ . Since we aim to reduce the conservativeness of the original NP umbrella algorithm, the possibility ofk ∗ ≥ k ∗ will reverse this effort and worsen the conser- vativeness. To solve this issue, we force the empirical version ofD to be non-negative by replacing ˆ D with ˆ D + in Algorithm 3.1. 60 3.7.3 TheoreticalpropertiesofAlgorithm3.1. In this subsection, we first formally establish that Algorithm 3.1 gives rise to valid classifiers (Lemma 3.4) and then show that these classifiers have the true type I errors controlled under α with high probability (Theorem 3.1). Lemma 3.4. Under Assumption 3.2, k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } in Algorithm 3.1 exists. Moreover, this label-noise-adjusted order is no larger than that chosen by the original NP umbrella algorithm; that isk ∗ ≤ k ∗ . Proof. By Lemma 3.2, the set{k∈{1,...,n} : α k,δ ≤ α } is non-empty. Then, the set{k∈{1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } is non-empty since ˆ D + (t (k) ) is non-negative. Then k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } exists. Note that k ∗ = min{k ∈ {1,...,n} : α k,δ ≤ α } by Lemma 3.2. Since {k ∈ {1,...,n} : α k,δ ≤ α } is a subset of {k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } by the non- negativeness of ˆ D + , it can be concluded thatk ∗ ≤ k ∗ . Lemma 3.4 implies that Algorithm 3.1 reduces the excessive conservativeness of the original NP um- brella algorithm on the type I error by choosing a smaller order statistic as the threshold. Moreover, if there is no label noise, i.e., whenm 0 = 1 andm 1 = 0, we havek ∗ = min{k ∈{1,...,n} : α k,δ ≤ α } = k ∗ . That is, Algorithm 3.1 reduces to the original NP umbrella algorithm. Another important question is whether Algorithm 3.1 can control the true type I error with high prob- ability. The following condition is assumed for the rest of this section. Assumption3.4. Thescoringfunction ˆ T istrainedfromaclassoffunctionsT suchthatthedensityfunctions for both ˆ T( ˜ X 0 ) and ˆ T( ˜ X 1 ) exist for every ˆ T ∈ T . Then, we denote these two densities by ˜ f ˆ T 0 and ˜ f ˆ T 1 , respectively. Furthermore,sup ˆ T∈T ∥ ˜ f ˆ T 0 ∨ ˜ f ˆ T 1 ∥ ∞ ≤ C andinf ˆ T∈T inf z∈D ˆ T ˜ f ˆ T 0 (z)>cforsomepositivecand C with probability1− δ 2 (n b ), whereD ˆ T is the support of ˜ f ˆ T 0 and is a closed interval, andδ 2 (n b ) converges to0 asn b goes to infinity. 61 Note that Assumption 3.4 summarizes assumptions that we make for technical convenience in estab- lishing the next theorem. In particular, we assume the existence of densities ˜ f ˆ T 0 and ˜ f ˆ T 1 , which holds if ˜ X 0 and ˜ X 1 have densities and ˆ T(·) is smooth. Moreover, we assume that with high probability, both the densities are uniformly bounded from above and ˜ f ˆ T 0 (·) is bounded uniformly from below. Recall that in Algorithm 3.1, we set k ∗ = min{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α } without an ε term. Setting ε = 0 intuitively seems reasonable since, when the sample size is small, the sets {k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α − ε} and{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k ∗ ) ) ≤ α } agree with high probability, and, when the sample size is large, concentration of random variables takes effect so there is little need for compensation for randomness. Our simulation results further reinforce this intuition. However, we include anε term in the next theorem as this is required in our proof for the theory to hold. Theorem 3.1. Under Assumptions 3.1, 3.2, 3.3 and 3.4, the classifier ˆ ϕ k ∗ (·), given by Algorithm 3.1 with k ∗ =min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α − ε}, satisfies P R 0 ( ˆ ϕ k ∗ )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 , in whichn b =| ˜ S b |,n=| ˜ S 0 t |,n 0 e =| ˜ S 0 e |, andn 1 e =| ˜ S 1 e |. Proof. Let’s focus on the event where statement of both Assumption 3.3 and 3.4 hold, whose complement has probability less thatδ 1 (n b )+δ 2 (n b ). Then, let B e = sup z∈R ˆ D(z)− D(z) ≤ 2 − 1 ε . 62 It follows from Dvoretzky–Kiefer–Wolfowitz inequality [59] that P(B c e )≤ P sup z∈R ˆ ˜ F ˆ T 0 (z)− ˜ F ˆ T 0 (z) > ε 4 +P sup z∈R ˆ ˜ F ˆ T 1 (z)− ˜ F ˆ T 1 (z) > ε 4 ≤ 2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . Note that since D(z) is non-negative by Lemma 3.3, ˆ D + (z)− D(z) ≤ ˆ D(z)− D(z) ≤ 2 − 1 ε onB e . So, one can conclude that on the eventB e ,k ∗ is chosen from allk such thatα k,δ − D(t (k) )≤ α − 2 − 1 ε. Furthermore, denote c k = inf{y : ˜ F ˆ T 0 (y) ≥ kn − 1 } and k 0 = min{k ∈ {1,...,n} : α k,δ − D(c k ) ≤ α − 4 − 1 ε}. Note that since D T is a closed interval, thus c k is well-defined. Let ˜ F ˆ T n be the empirical distribution induced byT t , i.e., for anyz∈R, ˜ F ˆ T n (z)= 1 n X t∈Tt I{t≤ z}. DenoteB t = {sup z∈R | ˜ F ˆ T n (z)− ˜ F ˆ T 0 (z)| ≤ 4 − 1 M − 1 C − 1 cε}. Then, by Dvoretzky–Kiefer–Wolfowitz in- equality,P(B c t )≤ 2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 . So, it remains to show the probability of true type I error exceeding α is bounded byδ on the setB t ∩B e . Thus, till the end of the proof, we will focus on the intersection of both sets. Note that we have ˜ F ˆ T n (t (k) )=kn − 1 . Then, Taylor expansion implies ˜ F ˆ T n (t (k) )− ˜ F ˆ T 0 (t (k) )= ˜ F ˆ T n (t (k) )− ˜ F ˆ T 0 (c k )− ˜ f ˆ T 0 (c ∗ k )(t (k) − c k )=− ˜ f ˆ T 0 (c ∗ k )(t (k) − c k ), wherec ∗ k is bounded byc k andt (k) . Then the above equation implies t (k) − c k ≤ 4 − 1 M − 1 C − 1 ε for anyk according to the lower bound provided by Assumption 3.2. Furthermore,D(t (k) )− D(c k )=M( ˜ f ˆ T 0 (c ∗∗ k )− 63 ˜ f ˆ T 1 (c ∗∗ k ))(t (k) − c k ) for some c ∗∗ k bounded by c k and t (k) . Therefore, Assumption 3.2 implies|D(t (k) )− D(c k )|≤ 4 − 1 ε. Supposek ∗ =k ′ , then, α k ′ ,δ − D(c k ′)≤ α k ′ ,δ − D(t (k ′ ) )+4 − 1 ε≤ α − 4 − 1 ε, and thusk ∗ ≥ k 0 . Furthermore, we also have α k 0 ,δ − D(t (k 0 ) )≤ α k 0 ,δ − D(c k 0 )+4 − 1 ε≤ α. Recall that D(t (k 0 ) ) = ˜ R 0 ( ˆ ϕ k 0 )− R 0 ( ˆ ϕ k 0 ). Therefore, R 0 ( ˆ ϕ k 0 ) > α implies ˜ R 0 ( ˆ ϕ k 0 ) > α k 0 ,δ whose probability is bounded byδ by Proposition 3.1. Note that the upper bound for P R 0 ( ˆ ϕ k ∗ )>α is δ , our violation rate control target, plus a few terms which converge to zero as the sample sizes increase. To establish this inequality, we first exclude the complement of the events described in Assumption 3.3 and 3.4. Then, we further restrict ourselves on the event constructed by a Glivenko-Cantelli type inequality where ˆ D and D only differ by 2 − 1 ε. There, the order selection criterion can be written as k ∗ = min{k ∈ {1,...,n} : α k,δ − D(t (k) ) ≤ α − 2 − 1 ε}. The main difficulty of the proof is to handle the randomness of the threshold t (k ∗ ) . Unlike the deterministic orderk ∗ in the original NP umbrella algorithm, the new orderk ∗ is stochastic. As such, even when conditioning on ˆ T , t (k ∗ ) is sill random and cannot be handled as a normal order statistic. Our solution is to find a high probability deterministic lower bound for t (k ∗ ) . To do this, we introduce c k , the k/n quantile of ˜ F ˆ T 0 , which is a deterministic value if we consider ˆ T to be fixed. Then, we show that D(t (k) ) only differs from D(c k ) by 4 − 1 ε for all k and that α k ∗ ,δ − D(c k ∗ ) ≤ α − 4 − 1 ε. Then, we define k 0 = min{k ∈ {1,...,n} : α k,δ − D(c k ) ≤ α − 4 − 1 ε}, which is another deterministic value, given that ˆ T is considered to be fixed. Then, we find that k 0 ≤ k ∗ and α k 0 ,δ − D(t (k 0 ) ) ≤ α with high 64 probability. Therefore,t (k 0 ) is a high probability lower bound fort (k ∗ ) . Moreover,t (k 0 ) is an order statistic with deterministic order (for fixed ˆ T ) and thus its distribution can be written as a binomial probability. The fact α k 0 ,δ − D(t (k 0 ) ) ≤ α combined with Proposition 3.1 yields that the violation rate of ˆ ϕ k 0 (·) is smaller thanδ . 3.7.4 TheoreticalpropertiesofAlgorithm 3.1 # . In this subsection, we discuss the properties of Algorithm3.1 # . Recall thatm # 0 ≥ m 0 andm # 1 ≤ m 1 in Assumption 3.1 mean that the corruption levels are “underestimated.” As such, Algorithm3.1 # produces a more conservative result than Algorithm 3.1. To see this, note that the only difference between two algorithms is that (1− m 0 )(m 0 − m 1 ) − 1 in Algorithm 3.1 is replaced with (1− m # 0 )(m # 0 − m # 1 ) − 1 in Algorithm3.1 # . The latter is no larger than the former, so we have a threshold in Algorithm3.1 # larger than or equal to that in Algorithm 3.1. On the other hand, under Assumption 3.1, Algorithm3.1 # is still less conservative than the original NP umbrella algorithm. To digest this, we first consider the case where the label noise is totally “ignored”, i.e., m # 0 = 1 andm # 1 = 0. In this case, Algorithm3.1 # is equivalent to the original NP umbrella algorithm. Then, since usually m # 0 < 1 and m # 1 > 0, Algorithm 3.1 # produces a smaller threshold than the NP original umbrella algorithm. Therefore, Algorithm3.1 # overcomes, at least partially, the conservativeness issue of the original NP umbrella algorithm. These insights are formalized in the following lemma. Lemma3.5. UnderAssumptions3.1-3.2,k ∗ # =min{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ α }inAlgorithm 3.1 # exists. Moreover, the orderk ∗ # is betweenk ∗ andk ∗ , i.e.,k ∗ ≤ k ∗ # ≤ k ∗ . Proof. Assumption 3.1 implies0≤ M # ≤ M, and thus, 0≤ ˆ D + # (c)≤ ˆ D + (c). Then,{k ∈{1,...,n} : α k,δ ≤ α }, which is non-empty by Assumption 3.2, is a subset of{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ α }. This impliesk ∗ # exists and is smaller than or equal tok ∗ . Furthermore,{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ 65 α } is also a subset of{k ∈ {1,...,n} : α k,δ − ˆ D + (t (k) ) ≤ α } and thus, k ∗ # = min{k ∈ {1,...,n} : α k,δ − ˆ D + # (t (k) )≤ α } is larger than or equal tok ∗ . Next we establish a high probability control on type I error for Algorithm 3.1 # . Recall that a high probability control on type I error for Algorithm 3.1 was established in Theorem 3.1. In view of Lemma 3.5, ˆ ϕ k ∗ # (·) produced in Algorithm 3.1 # has a larger threshold, and thus smaller true type I error, than that of ˆ ϕ k ∗ (·) produced by Algorithm 3.1. Then, a high probability control on true type I error of ˆ ϕ k ∗ # (·) naturally follows. This result is summarized in the following corollary. Corollary 3.1. Under Assumptions 3.1 - 3.4, the classifier ˆ ϕ k ∗ # (·) given by Algorithm 3.1 # with k ∗ # = min{k∈{1,...,n}:α k,δ − ˆ D + # (t (k) )≤ α − ε}, satisfies P R 0 ( ˆ ϕ k ∗ # )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . in whichn b =| ˜ S b |,n=| ˜ S 0 t |,n 0 e =| ˜ S 0 e |, andn 1 e =| ˜ S 1 e |. Proof. By Lemma 3.5, k ∗ # ≥ k ∗ and thus, t (k ∗ ) ≤ t (k ∗ # ) . Therefore, R 0 ( ˆ ϕ (k ∗ ) ) ≥ R 0 ( ˆ ϕ (k ∗ # ) ). Combined with Theorem 3.1, the previous result yields P R 0 ( ˆ ϕ (k ∗ # ) )>α ≤ P R 0 ( ˆ ϕ (k ∗ ) )>α ≤ δ +δ 1 (n b )+δ 2 (n b )+2e − 8 − 1 nM − 2 C − 2 c 2 ε 2 +2e − 8 − 1 n 0 e M − 2 ε 2 +2e − 8 − 1 n 1 e M − 2 ε 2 . 66 3.8 NumericalAnalysis. In this section, we apply Algorithms 3.1 (known corruption levels) and 3.1 # (unknown corruption lev- els) on simulated and real datasets, and compare with other methods in the literature. We present the (approximate) type I error violation rates †† and the averages of (approximate) true type II errors. As designed in Algorithm 3.1 and3.1 # ,ε ink ∗ =min{k∈{1,...,n}:α k,δ − ˆ D + (t (k) )≤ α − ε} is chosen as some small positive value. In principle, it is possible that settingε>0 will makek ∗ larger than whenε = 0 as{k∈{1,2,...,n},α k,δ − ˆ D + (t (k) )≤ α k,δ − ε} is a subset of{k∈{1,2,...,n},α k,δ − ˆ D + (t (k) )≤ α k,δ }. This will make the threshold larger and the type I error and the violation rate smaller. However, sinceε is a very small value, its effect on k ∗ is very minor. Thus, we implement only the version whereε=0. 3.8.1 Simulation. Algorithm3.1. We present three distributional settings for Algorithm 3.1 (knownm 0 andm 1 ). In each setting,2N observations are generated as a training sample, of which half are from the corrupted class0 and half from thecorrupted class1. The numberN varies from200 to2,000. To approximate the true type I and II errors, we generate 20,000 true class 0 observations and 20,000 true class 1 observations as the evaluation set. For each distribution and sample size combination, we repeat the procedure1,000 times. Algorithm 3.1 (“adjusted") and the original NP umbrella algorithm (“original") are both applied, paired with different base algorithms. †† Strictly speaking, the observed type I error violation rate is only an approximation to the real violation rate. The approxi- mation is two-fold: i). in each repetition of an experiment, the population type I error is approximated by empirical type I error on a large test set; ii). the violation rate should be calculated based on infinite repetitions of the experiment, but we only calculate it based on a finite number of repetitions. However, such approximation is unavoidable in numerical studies. 67 Simulation3.1 (Gaussian Distribution). LetX 0 ∼N (µ 0 ,Σ) andX 1 ∼N (µ 1 ,Σ) ,whereµ 0 =(0,0,0) ⊤ ,µ 1 = (1,1,1) ⊤ and Σ= 2 − 1 0 − 1 2 − 1 0 − 1 2 , and the base algorithm is linear discriminant analysis (LDA). For different (m 0 ,m 1 ,α,δ ) combinations, the (approximate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 3.1 are reported in Tables 3.1 and 3.2, respectively. Table 3.1: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.1. Standard errors (× 10 − 3 ) in parentheses. N m0 =.95,m1 =.05 α =.05,δ =.05 m0 =.9,m1 =.1 α =.05,δ =.05 m0 =.95,m1 =.05 α =.1,δ =.1 m0 =.9,m1 =.1 α =.1,δ =.1 adjusted original adjusted original adjusted original adjusted original 200 .026(5.03) .001(1.00) .033(5.65) 0(0) .078(8.84) .003(1.73) .073(8.23) 0(0) 500 .031(5.40) 0(0) .046(6.63) 0(0) .090(9.05) .001(1.00) .085(8.82) 0(0) 1,000 .038(5.97) 0(0) .049(6.83) 0(0) .105(9.70) 0(0) .081(8.63) 0(0) 2,000 .053(6.96) 0(0) .046(6.63) 0(0) .087(8.92) 0(0) .099(9.45) 0(0) Table 3.2: Averages of (approximate) true type II errors over1,000 repetitions for Simulation 3.1. Standard errors (× 10 − 3 ) in parentheses. N m0 =.95,m1 =.05 α =.05,δ =.05 m0 =.9,m1 =.1 α =.05,δ =.05 m0 =.95,m1 =.05 α =.1,δ =.1 m0 =.9,m1 =.1 α =.1,δ =.1 adjusted original adjusted original adjusted original adjusted original 200 .685(7.16) .706(4.65) .697(7.06) .826(3.54) .333(3.93) .403(3.56) .369(4.93) .537(4.03) 500 .481(4.08) .590(2.99) .512(4.92) .743(2.79) .249(1.94) .307(1.83) .257(2.21) .436(2.48) 1,000 .396(2.53) .534(2.19) .387(2.37) .663(1.68) .218(1.18) .287(1.22) .213(1.01) .381(1.28) 2,000 .350(1.51) .491(1.45) .371(1.99) .651(1.45) .201(.76) .268(.77) .205(.87) .375(1.01) Simulation3.2 (Uniform Distribution within Circles). LetX 0 andX 1 beuniformlydistributedwithinunit circles respectively centered at(0,0) ⊤ and(1,1) ⊤ . The base algorithm is logistic regression. We only report (approximate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 3.1 for one combination (m 0 =.95,m 1 =.05,α =.1 andδ =.1) in Table 3.3. 68 Table 3.3: (Approximate) type I error violation rates, and averages of (approximate) true type II errors over 1,000 repetitions for Simulation 3.2 (m 0 =.95,m 1 =.05,α =.1 andδ =.1). Standard errors (× 10 − 3 ) in parentheses. N (approximate) violation rate averages of (approximate) true type II errors adjusted original adjusted original 200 .079(8.53) .006(2.44) .164(2.77) .226(3.35) 500 .086(8.87) .001(1.00) .123(.92) .161(.80) 1,000 .085(8.82) 0(0) .109(.61) .151(.58) 2,000 .085(8.82) 0(0) .101(.44) .142(.39) Simulation3.3 (T Distribution). LetX 0 andX 1 bet-distributedwithshapematrixΣ ,whichwasspecifiedin Simulation3.1,4degreesoffreedom,andcenteredat(0,0,0) ⊤ and(1,1,1) ⊤ respectively. Thebasealgorithm is LDA. Similar to the previous simulation, we only report (approximate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 3.1 for one combination (m 0 = .95, m 1 =.05,α =.1 andδ =.1) in Table 3.4. Table 3.4: (Approximate) type I error violation rates, and averages of (approximate) true type II errors over 1,000 repetitions for Simulation 3.3 (m 0 =.95,m 1 =.05,α =.1 andδ =.1). Standard errors (× 10 − 3 ) in parentheses. N (approximate) violation rate average of (approximate) true type II errors adjusted original adjusted original 200 .068(7.96) .008(2.82) .526(5.67) .575(4.32) 500 .085(8.82) .002(1.41) .398(3.32) .472(2.59) 1,000 .090(9.05) 0(0) .345(2.07) .432(1.78) 2,000 .093(9.19) 0(0) .314(1.24) .401(1.18) The results from Simulations 3.1-3.3 confirm that the original NP umbrella algorithm is overly conser- vative on type I error when there is label noise in the training data, resulting in type I error violation rates (close to) 0 in all settings. In contrast, the label-noise-adjusted Algorithm 3.1 has type I errors controlled at the specified level with high probability and achieves much better type II errors. 69 Algorithm3.1 # . In this section, we show numerically that under the NP paradigm, the “under-estimates" of corruption levels serve Algorithm3.1 # well, while "over-estimates" do not. Simulation 3.4. The distributional setting is the same as in Simulation 3.1. Different combinations of m # 0 andm # 1 are used. the (approximate) type I error violation rates and the averages of (approximate) true type II errors generated by Algorithm 3.1 # for one combination (m 0 = .95, m 1 = .05, α = .1 andδ = .1) are reported in Tables 3.5 and 3.6. Table 3.5: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.4. Standard errors (× 10 − 3 ) in parentheses. N m # 0 =.93, m # 1 =.07 m # 0 =.95, m # 1 =.05 m # 0 =.97, m # 1 =.03 original 200 .136(10.85) .078(8.48) .055(7.21) .003(1.73) 500 .218(13.06) .090(9.05) .038(6.05) .001(1.00) 1,000 .324(14.81) .105(9.70) .012(3.44) 0(0) 2,000 .462(15.77) .087(8.92) .005(2.23) 0(0) Table 3.6: (Approximate) type II error violation rates over 1,000 repetitions for Simulation 3.4. Standard errors (× 10 − 3 ) in parentheses. N m # 0 =.93, m # 1 =.07 m # 0 =.95, m # 1 =.05 m # 0 =.97, m # 1 =.03 original 200 .287(3.43) .333(3.92) .373(4.62) .403(3.56) 500 .215(1.61) .249(1.94) .285(2.22) .307(1.83) 1,000 .189(1.02) .218(1.18) .250(1.37) .287(1.22) 2,000 .174(.65) .201(.76) .230(.86) .268(.77) The second to the last column in Table 3.5 confirms that, using strict under-estimates of corruption levels (i.e., m # 0 > m 0 and m # 1 < m 1 ), the type I error control objective is satisfied. Note that we also include the strict over-estimate scenarios in the second column (i.e.,m # 0 <m 0 andm # 1 >m 1 ), where we see that the type I violation rates exceed the targetδ . Hence the under-estimate requirement in the theory part is not merely for technical convenience. Table 3.6 confirms that the using strict under-estimates would lead to higher type II errors than using the true corruption levels. This is a necessary price to pay for not 70 knowing the exact levels, but still it is better than totally ignoring the label corruption and applying the original NP umbrella algorithm. In the next simulation, Algorithm3.1 # fails to control the type I error with pre-specified high proba- bility with “good” estimators ofm 0 andm 1 . It shows the “good" estimators do not serve for the purpose of high probability control on type I error. Simulation3.5. The distributional setting is the same as in Simulation 3.1. Them # 0 andm # 1 are generated fromN(m 0 ,100/N) andN(m 1 ,100/N), respectively. The (approximate) type I error violation rates gener- ated by Algorithm3.1 # for one combination (m 0 =.95,m 1 =.05,α =.1 andδ =.1) are reported in Table 3.7. Table 3.7: (Approximate) type I error violation rates over 1,000 repetitions for Simulation 3.5. Standard errors (× 10 − 2 ) in parentheses. N (approximate) violation rate 200 .378(1.53) 500 .374(1.53) 1,000 .397(1.55) 2,000 .394(1.55) 3.8.2 ViolinplotsforSection3.8. In this section, we present the violin plots (Figures 3.5 - 3.10) for Simulations 3.1-3.3 in Section 3.8. The violin plots for the (approximate) true type I and type II errors over these 1,000 repetitions are plotted for each(m 0 ,m 1 ,α,δ ) combination. Take Figures 3.5 and 3.6 as an example, the two rows in each figure respectively correspond to them 1 = 0.95,m 1 = 0.05 andm 0 = 0.85,m 1 = 0.15 settings, while the two columns respectively correspond toα = 0.05,δ = 0.05 andα = 0.10,δ = 0.10. The area of every plot with lighter color represents true type I errors above the1− δ quantile while the area with darker color represents true type I errors below the1− δ quantile. The black dots represent the average of true type I/II errors and the bars above and below the dots represent standard deviations. 71 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.5: Violin plots for (approximate) true type I errors of Simulation 3.1. adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.6: Violin plots for (approximate) true type II errors of Simulation 3.1. 72 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.7: Violin plots for (approximate) true type I errors of Simulation 3.2. adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.0 0.2 0.4 0.6 0.8 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.25 0.50 0.75 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.8: Violin plots for (approximate) true type II errors of Simulation 3.2. 73 adjusted original 200 500 1000 2000 200 500 1000 2000 0.000 0.025 0.050 0.075 0.100 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.00 0.05 0.10 0.15 0.20 training sample size (N) true type I errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.00 0.05 0.10 0.15 training sample size (N) true type I errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.9: Violin plots for (approximate) true type I errors of Simulation 3.3. adjusted original 200 500 1000 2000 200 500 1000 2000 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .05 δ = .05 adjusted original 200 500 1000 2000 200 500 1000 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .95 m1 = .05 α = .1 δ = .1 adjusted original 200 500 1500 2000 200 500 1500 2000 0.2 0.4 0.6 0.8 1.0 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .05 δ = .05 adjusted original 200 500 1500 2000 200 500 1500 2000 0.25 0.50 0.75 1.00 training sample size (N) true type II errors m0 = .90 m1 = .10 α = .1 δ = .1 Figure 3.10: Violin plots for (approximate) true type II errors of Simulation 3.3. 74 3.9 RealDataAnalysis. We analyze a canonical email spam dataset [42], which consists of 4,601 observations including 57 at- tributes describing characteristics of emails and a0− 1 class label. Here,1 represents spam email while0 representsnon-spam, and the type I/II error is defined accordingly. The labels in the dataset are all assumed to be correct. We create corrupted labels according to the class-conditional noise model. Concretely, we flip the labels of true class0 observations with probabilityr 0 and flip the labels of true class 1 observations with probabilityr 1 . Note thatm 0 andm 1 areP(Y = 0| ˜ Y = 0) andP(Y = 0| ˜ Y = 1), respectively, while r 0 =P( ˜ Y =1|Y =0) andr 1 =P( ˜ Y =0|Y =1). In our analysis, we choosem 0 =0.95 andm 1 =0.05, which implies settingr 0 =0.032 andr 1 =0.078 ‡‡ . For each training and evaluation procedure, we split the data by stratified sampling into training and evaluation sets. Specifically, 20% of the true class0 observations and20% of the true class1 observations are randomly selected to form the training dataset, and the rest of the observations form the evaluation dataset. In total, the training set contains 921 observations and the evaluation set contains 3,680 obser- vations. The larger evaluation set is reserved to better approximate (population-level) true type I/II error. We leave the evaluation data untouched, but randomly flip the training data label according to the calcu- latedr 0 andr 1 . Four base algorithms are coupled with the original and new NP umbrella algorithms, with α =δ =0.1. We repeat the procedure1,000 times. The (approximate) type I error violation rates and averages of (approximate) true type II errors gener- ated by Algorithm 3.1 and the original NP umbrella algorithm are summarized in Table 3.8. Similar to the simulation studies, we observe that Algorithm 3.1 correctly controls type I error at the right level, while the original NP umbrella algorithm is significantly overly conservative on type I error, and consequently has much higher type II error. We also summarize the results generated by Algorithm3.1 # in Tables 3.9 ‡‡ This is an application of the Bayes theorem with P(Y = 0) estimated to be 0.610, which is the proportion of class 0 observations in the whole dataset. 75 and 3.10. Clearly, while strict under-estimates lead to higher type II errors than using exact corruption levels, the type I error control objective is achieved, and the type II error is better than just ignoring label corruption and applying the original NP umbrella algorithm. Table 3.8: (Approximate) type I error violation rates, and averages of (approximate) true type II errors by Algorithm 3.1 and original NP umbrella algorithm over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. (approximate) violation rate average of (approximate) true type II errors adjusted original adjusted original penalized logistic regression .082(8.68) 0(0) .205(2.65) .272(2.71) linear discriminant analysis .096(9.32) 0(0) .226(3.05) .314(2.77) support vector machine .093(9.19) .004(2.00) .183(3.15) .218(1.93) random forests .080(8.58) 0(0) .120(1.13) .152(1.54) Table 3.9: (Approximate) type I error violation rates by Algorithm3.1 # over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. m # 0 =0.93, m # 1 =0.07 m # 0 =0.95, m # 1 =0.05 m # 0 =0.97, m # 1 =0.03 original penalized logistic regression .231(13.33) .082(8.68) .028(5.22) 0(0) linear discriminant analysis .223(13.17) .096(9.32) .023(4.74) 0(0) support vector machine .220(13.11) .093(9.19) .026(5.03) .004(2.00) random forest .238(13.47) .080(8.58) .019(4.32) 0(0) Table 3.10: Averages of (approximate) true type II errors by Algorithm3.1 # over1,000 repetitions for the email spam data. Standard errors (× 10 − 3 ) in parentheses. m # 0 =0.93, m # 1 =0.07 m # 0 =0.95, m # 1 =0.05 m # 0 =0.97, m # 1 =0.03 original penalized logistic regression .165(2.04) .205(2.65) .254(3.10) .272(2.71) linear discriminant analysis .213(2.54) .226(3.05) .314(3.37) .314(2.77) support vector machine .138(1.20) .183(3.15) .199(2.11) .218(1.93) random forest .102(.78) .120(1.13) .143(1.41) .152(1.54) 76 3.10 Discussion. Under the NP paradigm, we developed the first label-noise-adjusted umbrella algorithms. There are several interesting directions for future research. First, we can consider a more complex noise model in which the corruption levels depend on both the class and features. Another direction is to consider data-driven “under-estimates" of the corruption levels in the class-conditional noise model and develop (distributional) model-specific adjustment algorithms. For instance, we can adopt the linear discriminant analysis model, i.e.,X 0 ∼N (µ 0 ,Σ) andX 1 ∼N (µ 1 ,Σ) . 77 Chapter4 Neyman-Pearsonandequalopportunity: whenefficiencymeetsfairness inclassification 4.1 Introduction. Recently, the U.S. Justice Department and the Equal Employment Opportunity Commission warned em- ployers that used artificial intelligence to hire workers for potential unlawful racial discrimination. ∗ Ear- lier, Amazon was accused of gender bias against women in its deployment of machine learning algorithms to search for top talent. † Evidence that algorithmic decision-making exhibits systematic bias against cer- tain disadvantageous social groups has been accumulating in labor markets [24, 51] and also growing in many other areas, including credit lending, policing, court decisions, and healthcare treatment [5, 46, 10, 71, 32]. To address the public concern of algorithmic fairness, a number of studies propose to regulate algorithmic design such that disadvantageous groups must receive non-disparate treatments [9, 47, 26, 8]. Statistically, this means that, in carrying out its predictive task, an algorithm ought to prioritize the fairness-related construction, such as purposefully equalizing certain error types of concern. However, efficiency loss could occur as these fairness-related designs may limit the prediction accuracy [47]. ∗ "AI Hiring Tools Can Violate Disability Protections, Government Warns," Wall Street Journal, May 12, 2022. https://www.wsj.com/articles/ai-hiring-tools-can-violate-disability-protections-government-warns-11652390318 † "Amazon scraps secret AI recruiting tool that showed bias against women," October 11, 2018. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G 78 Consider a situation where a bank uses an algorithmic classifier to decide whether to approve a loan application based on default status prediction. Here, fairness is a primary concern of the society and regulations; concretely, the disparity between denial rates of qualified applicants by sensitive attributes, such as gender or race, is not tolerated. The banks, however, are primarily concerned about the efficiency, which can be decomposed into two parts, the false negative rate (i.e., the probability of misclassifying a default case as non-default) and the false positive rate (i.e., the probability of misclassifying a non-default case as default). The false negative rate, due to its connection to financial security, has a higher priority for the banks than the false positive rate. Here and in many other examples, social fairness and economic efficiency could be in conflict. To address this conflict, we propose a novel framework that accommodates a dualfocus on efficiency and fairness, as well as the asymmetric importance within efficiency consideration. The efficiency part of our framework is based on the Neyman-Pearson (NP) classification paradigm [22, 79]. This paradigm controls the type I error (i.e., the probability of misclassifying a 0 instance as 1) under some desired levelα (referred to as the NP constraint) while minimizing the type II error (i.e., the probability of misclassifying a 1 instance as 0). In the loan application example, if we label the default status as 0 and non-default status as 1, the type I error is the false negative rate and the type II error is the false positive rate. The asymmetric treatment of the NP paradigm permits flexible control over the more-consequential error type. The fairness part of our framework borrows a relaxation of the equality of opportunity (EO) concept [37]. Assuming class 1 is the advantaged class, the EO constraint requires achieving the same type II error in all sensitive groups (e.g., race or gender); in the context of loan appli- cation, this means that denial rates of qualified applicants should be equalized in different groups. The relaxation we adopt eases the exact rate-equality requirement by allowing a pre-specified ε difference [31, 1]. In verbal discussion, we will still refer to this relaxation as the EO constraint. Fusing the above efficiency and fairness parts, we obtain the new NP-EO paradigm. A natural question is: for any givenα,ε ∈(0,1), are the NP constraint for economic efficiency and the EO constraint for social 79 fairness feasible simultaneously? We provide a positive answer to this question. Moreover, leveraging the generalized Neyman-Pearson Lemma, we derive an NP-EO oracle classifier. Guided by the NP-EO oracle, we construct finite-sample based classifiers that respect the population- level NP and EO constraints with high probability. We take an umbrella algorithm perspective; that is, we wish to adjust the commonly-used methods (e.g, logistic regression, random forest, neural nets) to the NP- EO paradigm in a universal way. Similar in spirit to the original NP umbrella algorithm developed in [83] and its variant for corrupted labels in the last chapter, we employ an order statistics approach and do not have distributional assumptions on the data in the algorithmic development. But the technicalities here are much more involved than in the NP umbrella algorithms, because we need to determine two thresholds (instead of one) simultaneously. In simulation studies, we demonstrate that NP-EO classifiers are the only classifiers that guarantee both NP and EO constraints with high probability. This advantage of the NP-EO classifiers is further demonstrated on a credit card dataset. This chapter contributes to the emerging literature on algorithmic fairness. The overall goal of this scholarly endeavor is to promote algorithmic decision making for the social good, especially for the pro- tection of socially disadvantageous groups. Existing studies have focused on algorithmic bias due to data sampling and engineering [76, 28], the construction of fairness conditions [37, 47], and the way of incor- porating ethical concerns into algorithmic optimization [26], among others. However, the fundamental social science problem–the tradeoff between economic efficiency and social equality is not yet adequately addressed. Some researchers advocate a social-planning approach, in which the algorithmic designer models a social welfare function that captures an explicit preference for a certain socially desirable objective [46, 75]. While this approach provides a useful benchmark to evaluate social welfare in the presence of ethical consideration, how to put it into practice is a great challenge. Social preferences are often difficult to measure and have to be approximated by some measurable outcomes. 80 These proxies can be mismeasured and lead the predictive outcomes astray, as demonstrated in [66] and [71]. Alternative to the social-planning approach, our approach is from a regulatory perspective, in which a decision maker can pursue their own objective after obeying a certain regulatory constraint. Existing algorithmic designs under the regulatory framework [26] do not explicitly cope with the efficiency-equality tradeoff. Regulatory failure is likely to occur when the efficiency loss caused by the fairness constraint is significant. Our proposed NP-EO approach provides a framework to detect algorithmic bias, evaluate the social loss caused by self-interested algorithms, and regulate algorithms to maintain the regulatory goal while permitting users sufficient freedom to achieve efficiency. In the algorithmic fairness literature, many criteria were proposed to define “fairness”; see [8] and references within. Our work does not intend to introduce a new fairness criterion. Rather, our framework is flexible enough that the EO constraint can potentially be replaced by other well-defined fairness crite- ria, and the NP constraint can also be replaced by other efficiency priority. Such efficiency-fairness dual constraints have the potential to be implemented as long as they are simultaneously feasible. The rest of the chapter is organized as follows. Mathematical settings of the Neyman-Pearson equal opportunity (NP-EO) paradigm is introduced in Section 4.2. Then, Section 4.4 presents the NP-EO oracle classifier. We introduce two NP-EO umbrella algorithms and provide theoretical justification of the first one in Section 4.5. Numerical studies are presented in Section 4.6. 4.2 Neyman-Pearsonequalopportunity(NP-EO)paradigm. 4.2.1 Mathematicalsetting. Let(X,S,Y) be a random triplet whereX ∈X ⊂ R d representsd features,S denotes a sensitive attribute that takes values from{a,b}, andY denotes the class label that takes values from{0,1}. It is not necessary 81 that every feature in X is neutral; we partition the features into X and S to emphasize that we will specifically consider a classifier’s societal impacts related to S. We denote by P a generic probability measure whose meaning will be clear in context, and denote respectively byP Z andP B the probabilities taken with respect to the randomness of Z andB, for any random variable Z and random setB. Let ϕ : X ×{ a,b} 7→ {0,1} be a classifier. The (population-level) type I error and type II error of ϕ are defined as R 0 (ϕ ):=P(ϕ (X,S)̸=Y |Y =0) and R 1 (ϕ ):=P(ϕ (X,S)̸=Y |Y =1) , respectively. Next, we denote the type I/II error conditional on the sensitive attribute by R s y (ϕ ):=P(ϕ (X,S)̸=Y |Y =y,S =s) , fory∈{0,1} ands∈{a,b}. Then it follows that, R y (ϕ )=R a y (ϕ )· p a|y +R b y (ϕ )· p b|y , (4.1) wherep s|y =P(S =s|Y =y) fors∈{a,b}. Eachp s|y is assumed to be non-zero, and we useX y,s as a shorthand ofX |{Y =y,S =s} fory∈{0,1} ands∈{a,b}. Throughout the chapter, we consider class 1 as the ‘advantaged’ outcome for individuals, such as ‘being hired’, ‘receiving promotion’, ‘admission to a college’, or ‘non-default’, and class 0 as the less-advantageous outcome forindividuals. In the meantime, we understand class 0 as the class that organizations concern about and try to avoid, such as ‘default’. 82 4.2.2 Equalityofopportunity(EO). Let L y (ϕ ) := R a y (ϕ )− R b y (ϕ ) . In the literature of algorithmic fairness, a popular notion of fairness, coined as ‘equalized odds’ (or ‘separation’), requires absolute equality across social groups for any outcome, orL 0 (ϕ )=L 1 (ϕ )=0 in our notation; see [8] and the references therein. [37] formulated a less-stringent condition, referred to as ‘equality of opportunity’, which only requires L 1 (ϕ ) = 0. That is, qualified people from different social groups have equal opportunities to obtain the ‘advantaged’ outcome. This weaker notion of fairness is consistent with the advocacy of productive equity in social science and is acceptable in a wide range of social contexts. However, the requirement of absolute equality is not practical for finite-sample based classifiers: due to the randomness of data, the population-level condition L 1 (ϕ ) = 0 can hardly be achieved from any finite training sample procedure. Thus, researchers (e.g., [31, 1]) worked on a relaxed criterion: L 1 (ϕ )≤ ε, (4.2) for some pre-specified small ε. This condition states that equality of opportunity is satisfied if for two groups, the difference in the probabilities of classifying an “advantaged” outcome as “disadvantaged” is sufficiently small. This less stringent criterion offers a flexible level of tolerance and could be achieved by finite sample procedures with high probability. In this chapter, we adopt the relaxed EO condition described by equation (4.2), and refer to it as the EO constraint. Furthermore, we refer toL 1 (ϕ ) as the type II error disparity of the classifier ϕ . 4.2.3 Neyman-Pearson(NP)paradigm. Like other fairness criteria, the EO constraint draws a boundary to incorporate the societal concern of fairness in algorithmic decision making. In the fairness literature, it was combined with some general loss 83 functions (e.g., [87]). For example, it was incorporated into theclassical classification paradigm, which min- imizes the overall classification error, i.e., a weighted average of type I and type II errors, with the weights equal to the marginal probabilities of the two classes. In many applications, however, these weights do not reflect the relative importance of different error types; as a consequence, classifiers under the classical paradigm could have undesirably high type I error (or type II error). The inclusion of a fairness criterion, such as the EO constraint, can further complicate the problem by resulting in an (unintended) redistribu- tion of the two types of classification errors, as will be shown by Example 4.1 in Section 4.4. Recall the loan application example. A bank wishes to classify loan applicants so as to controlling the default risk (controlling the type I error) and gaining ample business opportunities (maximizing 1-type II error). The problem is that the two types of errors are statistically in conflict and the bank has to balance the trade-off between the goals. Regulation from fairness concerns (e.g., through the EO constraint) may help lift the bank’s bias against certain social groups and enlarge its business opportunities (lower type II error), but it could also expose the bank to a greater default risk (higher type I error). To cope with the above problem, we propose using the Neyman-Pearson (NP) paradigm [22, 79, 78], which solves: min ϕ :R 0 (ϕ )≤ α R 1 (ϕ ), whereα ∈(0,1) is a user-specified constant. On the population level, an NP classifier is admissible only if its type I error is no larger thanα . Under this admissibility condition, the NP oracle classifier is the one that minimizes the type II error among all feasible classifiers. In the loan example, an NP oracle classifier would control the risk of classifying a default applicant as a non-default one, helping banks manage their financial risk; after securing the financial safety, it minimizes the chances of classifying a non-default applicant as a default one, giving banks the maximum possible business opportunities. 84 4.2.4 NP-EOparadigm. We now integrate the EO constraint into the NP classification paradigm and propose the NP-EO paradigm as follows: min R 0 (ϕ )≤ α,L 1 (ϕ )≤ ε R 1 (ϕ ), (4.3) where α,ε ∈ (0,1) are pre-specified numbers. Program (4.3) has joint constraints: the NP constraint R 0 (ϕ )≤ α which ensures the most important part of economic efficiency, and the EO constraint L 1 (ϕ )≤ ε which enforces the social fairness restriction. In this arrangement, the direct impact of the EO constraint on the type I errorR 0 is isolated and the conflict between efficiency and equality is absorbed by the type II errorR 1 , which is assumed to be economically less consequential. On the population level, we will derive an NP-EO oracle classifier, i.e., a solution to program (4.3). On the sample level, we will construct finite sample based classifiers that respect the two constraints in (4.3) with high probability. The NP-EO paradigm serves simultaneously organizations’ primary economic goal and their social responsibility. Desirable as it is, there is no free lunch. Compared to those under the classical paradigm, classifiers under this new paradigm could suffer from a loss in the overall classification error. More subtly, as will be demonstrated in the numerical section, an NP-EO classifier tends to have a smaller type I error and a larger type II error than that of the corresponding NP classifier with the same α . Returning to the loan application example, a bank is concerned with two private goals—controlling the default risk (R 0 ) and expanding business opportunity (R 1 )—and a social goal of maintaining equal opportunity (a small difference between R a 1 andR b 1 ). With the NP-EO paradigm, the risk-control goal is achieved by the conditionR 0 (ϕ ) ≤ α , whereα is a risk level chosen by the bank, and the social goal is achieved by the conditionL 1 (ϕ )≤ ε, whereε is determined by regulation or social norms. With these two goals, the bank has to be modest in the business-expansion goal — potentially paying the cost of having 85 a larger chance of misclassifying non-defaulters as defaulters. While this cost could be more significant for startup banks at the stage of customer expansion, it is small for established banks that have a large customer base. 4.3 Preliminaries. In this section, we provide some preliminary results that are helpful in understanding the main results of this chapter. Some of the preliminary results cited in previous chapters will also be useful, namely, the Bernstein-von Mises theorem (Theorem 2.1) and the NP umbrella algorithm (Proposition 3.1). Next, we introduce other technical preliminaries that help establish theoretical results. 4.3.1 GeneralizedNeyman-Pearsonlemma Theorem4.1. Letf 1 ,··· ,f m+1 be real-valued functions defined on a Euclidean space X and integrableµ , and support that for given constantsc 1 ,··· ,c m , there exsits a critical functionϕ satisfying Z ϕf i dµ =c i , i=1,··· ,m. (4.4) LetC be the class of critical functionsϕ for which (4.4) holds. (i) Among all members ofC, there exists one that maximizes Z ϕf m+1 dµ. (ii) A sufficient condition for a member of C to maximize Z ϕf m+1 dµ 86 is the existence of constantsk 1 ,··· ,k m such that ϕ (x)=1 when f m+1 (x)> m X i=1 k i f i (x), ϕ (x)=0 when f m+1 (x)< m X i=1 k i f i (x). (4.5) (iii) If a member ofC satisfies (4.5) withk 1 ,··· ,k m ≥ 0, then it maximizes Z ϕf m+1 dµ among all critical functions satisfying Z ϕf i dµ ≤ c i , i=1,··· ,m. (iv) The setM of points inm-dimensional space whose coordinates are Z ϕf 1 dµ, ··· , Z ϕf m dµ forsomecriticalfunctionϕ isconvexandclosed. If(c 1 ,··· ,c m )isaninnerpointofM,thenthereexists constantsk 1 ,··· ,k m andatestϕ satisfying (4.4)and (4.5), andanecessaryconditionforamemberof C to maximize Z ϕf m+1 dµ is that (4.5) holds a.e. µ . 4.3.2 Technicallemmas. Here, we state some technical lemmas that will be used in the proof of Theorem 4.3. 87 Lemma4.1. LetU 1 ,U 2 ,··· ,U n benindependentrandomvariablesthatareuniformlydistributedover[0,1] andξ j = I{U j ≤ c} for everyj = 1,2,··· ,n an arbitrary deterministic constantc∈ [0,1]. Then, for any 1≤ pc a f 0,a (X)}· I{S =a}+I{f 1,b (X)>c b f 0,b (X)}· I{S =b}. (4.6) The following result holds. Theorem 4.2. For each y ∈ {0,1} and s ∈ {a,b}, we assume (i) f y,s exists, (ii) F y,s (z) is continuous on [0,∞), (iii) F y,s (0) = 0, and lim z→∞ F y,s (z) = 1. Then there exist two non-negative constants c ∗ a and c ∗ b such thatϕ # c ∗ a ,c ∗ b is NP-EO oracle classifier. Proof. First, we state the mathematical foundation for the densities. Letµ =µ d ×M be a measure defined onR d ×{ a,b}, whereµ d is Lebesgue measure onR d andM is the counting measure on{a,b}. Thus, the random variable(X,S)|{Y = 0} and(X,S)|{Y = 1} both have densities with respect toµ ; denote them byf 1 andf 0 respectively. Consider the NP oracle (withoutε-separation constraint). That is, a classifier that minimizes R 1 among all classifiers ϕ such thatR 0 (ϕ )≤ α . Assume for simplicity that there exists a constantc α such that P f 1 (X,S) f 0 (X,S) >c α |Y =0 =α. 90 By the Neyman-Pearson lemma, the classifier ϕ ∗∗ α (X,S)=I f 1 (X,S) f 0 (X,S) >c α = X s=a,b I f 1,s (X) f 0,s (X) >c α · P(S =s|Y =0) P(S =s|Y =1) · I{S =s}, is the NP oracle classifier. Note that P f(X |S,Y =1)P(S|Y =1) f(X |S,Y =0)P(S|Y =0) >z|Y =0 =P f(X |S,Y =1)P(S|Y =1) f(X |S,Y =0)P(S|Y =0) >z|Y =0,S =a p a|0 +P f(X |S,Y =1)P(S|}Y =1) f(X |S,Y =0)P(S|Y =0) >z|Y =0,S =b p b|0 = 1− F 0,a zp a|0 p − 1 a|1 p a|0 + 1− F 0,b zp b|0 p − 1 b|1 p b|0 . Note that lim z→∞ F 0,a (z) = 1 and F 0,a (0) = 0 by assumption. Similarly, F 0,b has the same property. Then, since bothF 0,a andF 0,b are continuous, there exists ac α > 0 such that the above quantity equals α . Note thatϕ ∗∗ α can be written in the following way. ϕ ∗∗ α (X,S)=I f 1,a (X) f 0,a (X) >c ∗∗ a I{S =a}+I f 1,b (X) f 0,b (X) >c ∗∗ b I{S =b}, wherec ∗∗ a =c α p 0,a p − 1 1,a andc ∗∗ b =c α p 0,b p − 1 1,b . Thus,ϕ ∗∗ α =ϕ # c ∗∗ a ,c ∗∗ b . Now, there are two cases,L 1 ϕ # c ∗∗ a ,c ∗∗ b ≤ ε orL 1 ϕ # c ∗∗ a ,c ∗∗ b >ε. For the first case, ϕ # c ∗∗ a ,c ∗∗ b minimizes R 1 over{ϕ :R 0 (ϕ )≤ α,L 1 (ϕ )≤ ε} since it is the NP oracle classifier and thus ϕ # c ∗∗ a ,c ∗∗ b =ϕ ∗ α,ε . 91 For the second case, assume without loss of generality,R b 1 ϕ # c ∗∗ a ,c ∗∗ b − R a 1 ϕ # c ∗∗ a ,c ∗∗ b > ε. Consider the following optimization problem. For any classifier, maximizeP(ϕ (X,S)=1|Y =1)= Z ϕf (x|S =s,Y =1)P(S =s|Y =1)dµ d , subject toR 0 (ϕ )= Z ϕf (x|S =s,Y =0)P(S =s|Y =0)dµ d =α R b 1 (ϕ )− R a 1 (ϕ ) = Z ϕ (f 1,a (x)I{S =a}− f 1,b (x)I{S =b})dµ d =ε. (4.7) Here, recall thatµ d is the Lebesgue measure onR d . By the generalized Neyman-Pearson lemma (Theorem 4.1), if there exist two non-negative numbersk 1 ,k 2 such that the classifier ϕ ′ (X,S)=I f 1,S (X)p S|1 >k 1 f 0,S (X)p S|0 +k 2 (f 1,a (X)I{S =a}− f 1,b (X)I{S =b}) =I f 1,a (X) f 0,a (X) > k 1 p a|0 p a|1 − k 2 I{S =a}+I f 1,b (X) f 0,b (X) > k 1 p b|0 p b|1 +k 2 I{S =b}, satisfies R 0 (ϕ ′ )=α andR b 1 (ϕ )− R a 1 (ϕ )=ε, it maximizesP(ϕ (X,S)=1|Y =1), i.e., minimizesR 1 (ϕ ) over allϕ such thatR 0 (ϕ )≤ α andR b 1 (ϕ )− R a 1 (ϕ )≤ ε, and thus validates the assertion in the theorem. Therefore, it suffices to show the existence of k 1 andk 2 . In particular,k 1 >0 andk 2 ∈(0,p a|1 ). We claim that there exist two constantsC >C ′ >0 such that R 0 ϕ # Cp a|0 p − 1 a|1 ,C ′ p b|0 p − 1 b|1 =α, (4.8) and R b 1 ϕ # Cp a|0 p − 1 a|1 ,C ′ p b|0 p − 1 b|1 − R a 1 ϕ # Cp a|0 p − 1 a|1 ,C ′ p b|0 p − 1 b|1 =ε. (4.9) 92 In view of this, take k 1 = CC ′ Cp b|1 +C ′ p a|1 andk 2 = C− C ′ Cp − 1 a|1 +C ′ p − 1 b|1 , and ϕ ′ satisfies the conditions of optimization problem (4.7). Moreover, one can see k 1 > 0 and k 2 ∈ (0,p a|1 ) sinceC >C ′ >0 and k 2 = C− C ′ Cp − 1 a|1 +C ′ p − 1 b|1 < C Cp − 1 a|1 =p a|1 . Then, generalized Neyman-Pearson lemma validates the assertion. The remainder of the proof relies on the following two key functionsf andg. Forc∈[c α ,∞), f(c)=inf n z≥ 0: 1− F 0,a (cp a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (zp b|0 p − 1 b|1 ) p b|0 ≤ α o . Conceptually, for a classifier whose {S =a} section threshold iscp a|0 p − 1 a|1 and overall type I error is equal to or less thanα ,f(c) describes its smallest possible{S =b} section threshold. Moreover, define g(c)=sup n z≥ 0:F 1,b zp b|0 p − 1 b|1 − F 1,a cp a|0 p − 1 a|0 =ε o . on[c α ,V) whereV =sup{z :F 1,a zp b|0 p − 1 b|1 ≤ 1− ε}. If a classifier whose {S =a} section threshold isc satisfies the ε-separation,g(c) describes the largest possible{S =b} section threshold. To check that the domain ofg is indeed well defined, i.e., c α <V , note that by our assumption thatF 1,b c α p b|0 p − 1 b|1 − F 1,a c α p a|0 p − 1 a|0 >ε, one can conclude thatF 1,a c α p b|0 p − 1 b|1 <1− ε and thusc α <V by continuity of F 1,a . Here, we make several remarks that are useful in the following proofs • f is non-increasing whereasg is positive and non-decreasing; 93 • By the definition of c α ,00, 1− F 0,a (cp a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (f(c)p b|0 p − 1 b|1 ) p b|0 =α, andF 1,b g(c)p b|0 p − 1 b|1 − F 1,a cp a|0 p − 1 a|0 =ε. It remains to discuss several possible scenarios. Iff(c α )≤ g(c α ), then the existence ofC >C ′ >0 is given by Lemma 4.4. The scenario wheref(c α )>g(c α ) is more involved. LetA − ={c∈[c α ,V):g(c)<f(c)} andA + = {c ∈ [c α ,V) : g(c) ≥ f(c)}. Furthermore, denoteA = supA − . Depending on ifA < V or A=V , this scenario is further divided into two cases. ForA<V , the proof is finished by the application of Lemma 4.5. Otherwise, the conclusion is implied by Lemma 4.6. Next, we give the three necessary lemmas that conclude the proof of Theorem 4.2. The first lemma deals with the scenario wheref(c a )≤ g(c a ). Lemma4.4. Iff(c α )≤ g(c α ), there existC >C ′ >0 that satisfy equation 4.8 and 4.9. Proof. By the definition of c α , 1− F 0,a (c α p a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (c α p b|0 p − 1 b|1 ) p b|0 =α. Moreover, 1− F 0,a (c α p a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (f(c α )p b|0 p − 1 b|1 ) p b|0 =α, 94 Sincef(c α )≤ g(c α )<c α , we have 1− F 0,a (c α p a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (g(c α )p b|0 p − 1 b|1 ) p b|0 =α, by monotonicity ofF 0,a . Thus lettingC =c α andC ′ =g(c α ) yields the desired result. The next lemma deals with the casef(c a )>g(c a ) andA<V . Lemma4.5. Iff(c α )>g(c α ) andA<V, there existC >C ′ >0 that satisfy equations 4.8 and 4.9. Proof. By monotonicity off andg, for anyc∈ [c α ,V), one can conclude thatc∈A + ifc∈ (A,V) and c∈A − ifc∈ [c α ,A). Then, we claim thatA∈A + since ifA∈A − , there exists ana∈ (g(A),f(A)), and, for a sufficiently small positive number δ , by continuity ofF 0,a andF 1,a , we have 1− F 0,a ((A+δ )p a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (ap b|0 p − 1 b|1 ) p b|0 >α, and F 1,b ap b|0 p − 1 b|1 − F 1,a (A+δ )p a|0 p − 1 a|0 <ε. Thus,f(A+δ )>a>g(A+δ ) andA+δ ∈A − , contradicting the fact thatA+δ ∈A + . Furthermore, sincef(c α )>g(c α ), replacingA withc α in all previous argument yields the conclusion thatA>c α . 95 Now denote F = lim c→A − f(c) and G = lim c→A − g(c), whose existence is guaranteed by f and g being monotone and bounded by f(c) > 0 and g(c) < f(c) ≤ f(c α ) ≤ c α . Furthermore, we have G≥ g(c α )>0 by monotonicity ofg andF ≥ G asf >g onA − . Then, by continuity ofF 0,a andF 0,b , 1− F 0,a (Ap a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (Fp b|0 p − 1 b|1 ) p b|0 = 1− F 0,a (Ap a|0 p − 1 a|1 ) p a|0 + lim c→A − 1− F 0,b (f(c)p b|0 p − 1 b|1 ) p b|0 = 1− F 0,a (Ap a|0 p − 1 a|1 ) p a|0 + lim c→A − h α − 1− F 0,a (cp a|0 p − 1 a|1 ) p a|0 i =α. (4.10) Similarly, we have F 1,b Gp b|0 p − 1 b|1 − F 1,a Ap a|0 p − 1 a|0 = lim c→A − F 1,b g(c)p b|0 p − 1 b|1 − F 1,a Ap a|0 p − 1 a|0 =ε+ lim c→A − F 1,a cp a|0 p − 1 a|0 − F 1,a Ap a|0 p − 1 a|0 =ε. (4.11) Therefore, by monotonicity,F 1,b zp b|0 p − 1 b|1 − F 1,a Ap a|0 p − 1 a|0 =ε for anyz∈[G,g(A)] and 1− F 0,a (Ap a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (zp b|0 p − 1 b|1 ) p b|0 =α, for anyz∈[f(A),F]. SinceA∈A + ,g(A)≥ f(A). Additionally,F ≥ G. Then,[G,g(A)]∩[f(A),F]̸= ∅. Let A ′ be an element in this intersection. One can show A ′ ≤ F ≤ f(c α ) ≤ c α < A and A ′ ≥ G ≥ g(c α ) > 0. Taking C = A and C ′ = A ′ , we have C > C ′ > 0 and constraints (4.8) and (4.9) are satisfied. The last scenario is addressed in the following lemma. Lemma4.6. Iff(c α )>g(c α ) andA=V, there existC >C ′ >0 that satisfy equation 4.8 and 4.9. 96 Proof. Let G and F be defined as in the proof of Lemma 4.5. Then, similar to equation (4.11), we have F 1,b Gp b|0 p − 1 b|1 − F 1,a Vp a|0 p − 1 a|0 =ε. Then, for everyz∈[G,∞), F 1,b zp b|0 p − 1 b|1 − F 1,a Vp a|0 p − 1 a|0 =ε, asF 1,a Vp a|0 p − 1 a|0 =1− ε. Furthermore, similar to the equation (4.10) 1− F 0,a (Vp a|0 p − 1 a|1 ) p a|0 + 1− F 0,b (Fp b|0 p − 1 b|1 ) p b|0 =α, and0 < G≤ F . Then, takingC = V andC ′ = F suffices as C = V > c α ≥ f(c α )≥ F = C ′ ≥ G > 0. Now we rationalize results in Theorem 4.2 on an intuitive level. Theorem 4.2 states that the NP-EO oracle can be decomposed into two separate parts, namely, S = a component and S = b component. This is natural because, as long as a classifier ϕ takes into consideration the protected attributeS, it can always be rewritten as a two-part form, i.e., ϕ (X,S) = ϕ a (X)· I{S = a}+ϕ b (X)· I{S = b}, where ϕ a (·) = ϕ (·,a) and ϕ b (·) = ϕ (·,b). Then, given the two-part form, it is not surprising that the best ϕ a and ϕ b , in terms of group-wise type II error performance for any given type I error level, adopt density ratios as scoring functions. Thus, as long as the two thresholds are adjusted so that NP and EO constraints are satisfied, the classifier in the form of equation (4.6) will have smaller R a 1 and R b 1 than other feasible classifiers and thus a smaller R 1 . Note that under the general conditions of Theorem 4.2, there is no guarantee on the uniqueness of NP- EO oracle classifier. Indeed, if the value of F y,a (·) does not change on an interval containingc ∗ a for each y∈{0,1}, then keepingc ∗ b the same, any value in this interval makes an NP-EO oracle classifier. Similarly, the threshold forS =b part is not unique ifF y,b (·) is constant in a vicinity ofc ∗ b for eachy∈{0,1}. Next, we present a simple example of an NP-EO oracle. 97 Example4.1. LetX 0,a ,X 1,a ,X 0,b andX 1,b beN(0,1),N(4,1),N(0,9) andN(4,9) distributed, respec- tively, and setP(S = a,Y = 0) = P(S = a,Y = 1) = P(S = b,Y = 1) = P(S = b,Y = 1) = 0.25. Then,theBayesclassifieris ϕ Bayes =I{X >2}andtheNPoracleclassifierfortypeIerrorcontrolat α =0.1 is ϕ NP = I{X > 2.58} ‡ . If the type I error control level α and the type II error disparity level ε are both set at0.1, the NP-EO oracle classifier is ϕ NP-EO = I{X > 3.20}I{S = a}+I{X > 2.53}I{S = b}. The graphical illustration of this example is depicted in Figure 4.1. We can calculate that R 0 (ϕ Bayes ) = 0.135, R 1 (ϕ Bayes ) = 0.135 andL 1 (ϕ Bayes ) = 0.23, violating both the NP and EO constraints. The NP oracle, com- pared with the Bayes classifier, has a larger threshold. Consequently, R 0 (ϕ NP ) = 0.1, R 1 (ϕ NP ) = 0.198 andL 1 (ϕ NP ) = 0.24. The NP oracle classifier satisfies the NP constraint but violates the EO constraint. The NP-EO oracle classifier is more subtle. Its S = a part threshold is larger than that of NP oracle classifier whereas theS = b part threshold is slightly smaller, resulting inR 0 (ϕ NP-EO ) = 0.100,R 1 (ϕ NP-EO ) = 0.262 andL 1 (ϕ NP-EO )=0.1, so that the NP-EO oracle classifier satisfies both NP and EO constraints. An NP-EO oracle classifier has a nice property: it is invariant to the changes in the proportions of class labels. This insight is made explicit by the following proposition. Proposition 4.1. Under conditions of Theorem 4.2, an NP-EO oracle classifier is invariant to the change in P(Y = 0) (or equivalentlyP(Y = 1)), as long as the distributions ofX | (Y = y,S = s) (i.e., X y,s ) and S|(Y =y) stay the same for eachy∈{0,1} ands∈{a,b}. Proof. Letp 1 ,p 2 ∈(0,1) be two distinct numbers. Moreover, define (X p 1 ,S p 1 ,Y p 1 ) and(X p 2 ,S p 2 ,Y p 2 ) be random triplets with the same distributions exceptP(Y p 1 = 0) = p 1 andP(Y p 2 = 0) = p 2 , respectively. For anyp∈{p 1 ,p 2 } and arbitrary classifier ϕ , we denoteR s q (ϕ |p) to be theR s q ofϕ based on the random variable(X p ,S p ,Y p ) for anys∈{a,b} andq ∈{0,1}. Similarly,R 0 (ϕ | p) andR 1 (ϕ | p) are the type I error and type II errors ofϕ based on(X p ,S p ,Y p ). ‡ In this example, the sensitive attributeS does not appear in the Bayes classifier or in the NP oracle classifier because the thresholds are the same for theS =a andS =b components. Thus,S can be omitted due to the specific setup of this model. 98 0.02 0.02 0.0 0.1 0.2 0.3 0.4 −5 0 5 10 0.25 0.25 0.00 0.05 0.10 −5 0 5 10 0.078 0.005 0.0 0.1 0.2 0.3 0.4 −5 0 5 10 0.318 0.195 0.00 0.05 0.10 −5 0 5 10 0.212 0.000 0.0 0.1 0.2 0.3 0.4 −5 0 5 10 0.312 0.200 0.00 0.05 0.10 −5 0 5 10 Figure 4.1: Plots of three classifiers in Example 4.1. The three rows, from top to bottom, represent figure illustration of the Bayes classifier, NP oracle classifier and NP-EO oracle classifier, respectively. The left panel illustrates the densities ofX 0,a andX 1,a and the right panel those ofX 0,b andX 1,b . In every sub- figure, the green curve represents class 0 density and the orange curve represents class1 density. In each row, the two thresholds of the classifier are indicated by the two black vertical lines. The type I and type II errors conditional on sensitive attribute are depicted respectively as the light green and light orange regions in every sub-figure with their values marked. Note that by assumption,X p 1 |(S p 1 =s,Y p 1 =s) has the same distribution asX p 2 |(S p 2 =s,Y p 2 = s) for eachs∈{a,b},q∈{0,1}. Then for everyz∈[0,∞), P f 1,s (X p 1 ) f 0,s (X p 1 ) ≤ z|Y p 1 =q,S p 1 =s =P f 1,s (X p 2 ) f 0,s (X p 2 ) ≤ z|Y p 2 =q,S p 2 =s , 99 for eachs∈{a,b},q ∈{0,1}. This further implies that, given a classifier ϕ # ca,c b of the form in equation (4.6) for arbitrary constantsc a andc b ,R s q (ϕ # ca,c b |p 1 )=R s q (ϕ # ca,c b |p 2 ). Thus, R 0 (ϕ # ca,c b |p 1 )=R a 0 (ϕ # ca,c b |p 1 )p a|0 +R b 0 (ϕ # ca,c b |p 1 )p b|0 =R a 0 (ϕ # ca,c b |p 2 )p a|0 +R b 0 (ϕ # ca,c b |p 2 )p b|0 =R 0 (ϕ # ca,c b |p 2 ). Moreover, R a 1 (ϕ # ca,c b | p 1 )− R b 1 (ϕ # ca,c b | p 1 ) = R a 1 (ϕ # ca,c b | p 2 )− R b 1 (ϕ # ca,c b | p 2 ). Denote ϕ # p 1 and ϕ # p 2 to be NP-EO oracle classifiers for (X p 1 ,S p 1 ,Y p 1 ) and(X p 2 ,S p 2 ,Y p 2 ) respectively. Sinceϕ # p 1 is an NP-EO oracle classifier for (X p 1 ,S p 1 ,Y p 1 ), it is also an NP-EO classifier for (X p 2 ,S p 2 ,Y p 2 ). Similarly,ϕ # p 2 is also an NP-EO classifier for (X p 1 ,S p 1 ,Y p 1 ). To this end, it suffices to verify the ϕ # p 1 achieves the minimumR 1 for(X p 2 ,S p 2 ,Y p 2 ) among all NP-EO classifier. Indeed, since ϕ # p 1 is also of the form in equation (4.6), R 1 (ϕ # p 1 |p 1 )=R a 1 (ϕ # p 1 |p 1 )p a|1 +R b 1 (ϕ # p 1 |p 1 )p b|1 =R a 0 (ϕ # p 1 |p 2 )p a|1 +R b 0 (ϕ # p 1 |p 2 )p b|1 =R 1 (ϕ # p 1 |p 2 ). Similarly,R 1 (ϕ # p 2 | p 1 ) = R 1 (ϕ # p 2 | p 2 ). IfR 1 (ϕ # p 2 | p 2 ) < R 1 (ϕ # p 1 | p 2 ), one can conclude thatR 1 (ϕ # p 2 | p 1 ) < R 1 (ϕ # p 1 | p 1 ), violating the fact thatϕ # p 1 is an NP-EO oracle classifier. Therefore, one can conclude R 1 (ϕ # p 2 |p 2 )≥ R 1 (ϕ # p 1 |p 2 ), and sinceϕ # p 2 is also an NP-EO oracle classifier, R 1 (ϕ # p 2 |p 2 )=R 1 (ϕ # p 1 |p 2 ). Therefore,ϕ # p 1 achieves the minimumR 1 for(X p 2 ,S p 2 ,Y p 2 ) among all NP-EO classifiers. 100 4.5 Methodology. In this section, we propose a sample-based NP-EO umbrella algorithm. In theory, Theorem 4.2 indicates that with proper threshold choices, plugging the density ratio estimates in equation (4.6) would lead to classifiers with good theoretical properties. In practice, however, practitioners often prefer certain classi- fication methods (e.g., logistic regression and neural networks), which we will also refer to as base algo- rithms. Therefore, we construct classifiers of the generic form ˆ ϕ (X,S)=I{T a (X)>c a }· I{S =a}+I{T b (X)>c b }· I{S =b}, (4.12) whereT a (·) andT b (·) are scoring functions for groupsS =a andS =b, respectively. The classifier ˆ ϕ in (4.12) is trained on finite sample; thus it is random due to randomness of the sample, and the constraints in program (4.3) cannot be satisfied with probability 1 in general. Therefore, we aim to achieve high- probability NP and EO constraints of the form, P R 0 ( ˆ ϕ )>α ≤ δ and P L 1 ( ˆ ϕ )>ε ≤ γ, (4.13) for pre-specified small δ,γ ∈ (0,1). Here, the probabilityP is taken over the randomness of the training sample. 4.5.1 NP-EOumbrellaalgorithm. We now construct an algorithm that respects (4.13) § , and achieves type II error as small as possible. Denote byS y,s the set of X, i.e., features, of the observations whose labels are y and sensitive attributes are s, wherey∈{0,1} ands∈{a,b}. We assume that all theS y,s ’s are independent, and instances within each § Strictly speaking, we only achieveγ + in high probability EO condition, whereγ + ↘γ as the sample size diverges. 101 S y,s are independent and identically distributed. EachS y,s is divided into two halves: S y,s train for training scoring functions, andS y,s left-out for estimating the thresholds in the classifier (4.12). First, allS y,s train ’s are combined together to train a scoring function (e.g., sigmoid function in logistic regression)T :X ×{ a,b}7→R; then we takeT a (·) = T(·,a) andT b (·) = T(·,b). To determinec a and c b such that the constraints are satisfied, we select thresholds to fulfill the NP constraint first and then adjust the thresholds for the EO constraint. A prior result we leverage to achieve the high probability NP constraint is the NP umbrella algorithm developed by [83]. This algorithm adapts to all scoring-type clas- sification methods (e.g., logistic regression and neural-nets); without separating out the protected attribute from the other features, it states that for an arbitrary (random) scoring functionS :X 7→R and i.i.d. class 0 observations{X 0 1 ,X 0 2 ,··· ,X 0 n }, which is independent ofS, a classifier that controls type I error under α with probability at least1− δ and achieves small type II error can be built as I{S(X)>s (k ∗ ) }, where s (k ∗ ) is the (k ∗ ) th order statistic of{s 1 ,s 2 ,··· ,s n } := {S(X 0 1 ),S(X 0 2 ),··· ,S(X 0 n )} and k ∗ is the smallest k ∈ {1,2,··· ,n} such that P n j=k n j (1− α ) j α n− j ≤ δ. The smallest such k is chosen to achieve the smallest type II error. The only condition for the type I error high probability control is n ≥ ⌈ logδ/ log(1− α )⌉, a mild sample size requirement. More details of this algorithm are provided in Proposition 3.1. Motivated by the NP umbrella algorithm, we build our method as follows. We apply T s (·) to each instance inS y,s left-out to obtainT y,s = {t y,s 1 ,t y,s 2 ,··· ,t y,s n y s }, where n y s = |S y,s left-out |, y ∈ {0,1}, and s ∈ {a,b}. Two pivots are selected based onT 0,a andT 0,b respectively; the selection procedure adopts the NP umbrella algorithm in [83]. Concretely, for the sorted setT 0,a = {t 0,a (1) ,t 0,a (2) ,··· ,t 0,a (n 0 a ) }, the pivot t 0,a (k 0,a ∗ ) is selected as the k 0,a ∗ th order statistic inT 0,a , where k 0,a ∗ is the smallest k ∈ {1,··· ,n 0 a } such that 102 P n 0 a j=k n 0 a j (1− α ) j α n− j ≤ δ. The pivott 0,b (k 0,b ∗ ) is selected similarly onT 0,b . Ifc a ≥ t 0,a (k 0,a ∗ ) andc b ≥ t 0,b (k 0,b ∗ ) , then the classifier ˆ ϕ in (4.12) satisfies P R a 0 ( ˆ ϕ )>α ≤ δ and P R b 0 ( ˆ ϕ )>α ≤ δ, by Proposition 1 in [83]. In view of (4.1), the above inequalities guarantee that the NP constraint can be achieved with probability at least1− 2δ . If we want to strictly enforce the1− δ probability type I error control in theory as in the first inequality in (4.13), the δ parameter in our algorithm can be replaced by δ/ 2 ¶ . The next step is to adjust the thresholds so that the resulting classifier also satisfies the second inequal- ity in (4.13), i.e., the high probability EO constraint. Note that though it is possible to select one ofc a andc b to be smaller than their respective pivot while the other larger, such that the overall type I error maintains the NP condition in (4.13), this selection has to be based on the proportions of protected attribute values, since R 0 (ϕ ) = R a 0 (ϕ )· P(S = a)+R b 0 (ϕ )· P(S = b) for any classifier ϕ . However, these proportions are unknown and need to be estimated, which would introduce another source of randomness in our al- gorithm. Thus, we decided to choose the pair(c a ,c b ) only amongc a ≥ t 0,a (k 0,a ∗ ) andc b ≥ t 0,b (k 0,b ∗ ) to satisfy the EO condition in (4.13). Similar toT 0,a andT 0,b , we denote the sortedT 1,s ={t 1,s (1) ,t 1,s (2) ,··· ,t 1,s (n 1 s ) } for s∈{a,b}. We will selectc a fromT 1,a andc b fromT 1,b . Concretely, let l a = n 1 a X j=1 I n t 1,a j ≤ t 0,a (k 0,a ∗ ) o and l b = n 1 b X j=1 I n t 1,b j ≤ t 0,b (k 0,b ∗ ) o . (4.14) ¶ However, numerical results in Section 4.6 suggest that this extra cautionary measure does not seem to be necessary in practice, because the subsequent EO adjustment step gears our algorithm towards the more conservative direction for type I error control. 103 Then,c a is selected from{t 1,a (j) :l a <j≤ n 1 a } andc b is selected from{t 1,b (j) :l b <j≤ n 1 b }. To this end, we investigate the distributions of r a 1 (i)=P X 1,a T a (X 1,a )≤ t 1,a (i) and r b 1 (j)=P X 1,b T b (X 1,b )≤ t 1,b (j) , fori > l a andj > l b . We remark thatr a 1 (i) andr b 1 (j) are random because only the randomness ofX 1,a andX 1,b are taken inP X 1,a andP X 1,b. Note thatr a 1 (i) andr b 1 (j) are respectively theS = a andS = b components of the type II error of the classifier in (4.12), if we take c a =t 1,a (i) andc b =t 1,b (j) . The left hand side of the EO condition in (4.13) can be further written as E T P r a 1 (i)− r b 1 (j) >ε . For simplicity of narrative, we consider the scoring function T(·) (and hence T a (·) and T b (·)) as fixed. This is harmless because the sample used to trainT(·) was assumed to be independent of the rest. If one can approximate the inner probability as a function of i and j, the pair (i,j) can be selected such that P r a 1 (i)− r b 1 (j) >ε ≤ γ . This approximation is inspired from the following example. Let X and Y 1 ,Y 2 ,··· ,Y n be continuous, independent and identically distributed random variables. Moreover, letc be a random variable that is independent ofX,Y 1 ,··· ,Y n and define by l = P n j=1 I{Y j ≤ c}. We will approximate the distribution of P X (X ≤ Y (k) ) conditional on l. Elementary probability suggests that the quantityP X (X ≤ Y (k) ) | l is equal in distribution to g c,l +(1− g c,l )B k− l,n− k+1 for k >l whereB p,q ∼ Beta(p,q) for anyp andq, andg c,l =P Y 1 (Y 1 ≤ c)|l. Moreover,g c,l andB k− l,n− k+1 are independent. However, the distribution ofg c,l is still not clear. One can see thatl = P n j=1 I{Y j ≤ c} is exactly the sum of independent Bernoulli random variables with a random success probabilityP Y 1 (Y 1 ≤ c). Thus, from the Bayesian point of view, the distribution of g c,l is the posterior distribution ofP Y 1 (Y 1 ≤ 104 c). By the Bernstein-von Mises theorem, g c,l is “close” to being normally distributed with meanl/n and variance equal to the Fisher information ofP Y 1 (Y 1 ≤ c), which can be estimated byn − 1 (l/n)(1− l/n). Now we relate the above example to our setting. Note that for a new observationX 1,a ,T a (X 1,a ) and t 1,a 1 ,··· ,t 1,a n 1 a are independent and identically distributed. Furthermore,t 0,a (k 0,a ∗ ) is also independent of them. Moreover, by the definition of l a in the equation (4.14), the distribution of r a 1 (i)|l a =P X 1,a T a (X 1,a )≤ t 1,a (i) |l a can be approximated byG 1,a +(1− G 1,a )B i− la,n 1 a − i+1 , if we further assumeT 1,a (X 1,a ),t 1,a 1 ,··· ,t 1,a n 1 a are continuous. Here, G 1,a ∼N l a n 1 a , l a /n 1 a (1− l a /n 1 a ) n 1 a . The distribution ofr b 1 (j)|l b can be approximated similarly. Next, note that P r a 1 (i)− r b 1 (j) >ε =E la,l b P r a 1 (i)− r b 1 (j) >ε|l a ,l b . Since the random couple(r a 1 (i),l a ) is independent of(r b 1 (j),l b ) for any fixed i andj, the random variable r a 1 (i) | {l a ,l b } = r a 1 (i) | l a is also independent of r b 1 (j) | {l a ,l b } = r b 1 (j) | l b . Therefore, the inner probability on the right hand side of the above equation can be approximated by estimating the distri- butions ofr a 1 (i)|{l a ,l b } andr b 1 (i)|{l a ,l b } independently. LetF 1,a (i) andF 1,b (j) be two independent random variables such thatF 1,a (i)=G 1,a +(1− G 1,a )B i− la,n 1 a − i+1 , in distribution andF 1,b (j) is defined analogously by replacinga withb andi withj. Pick(i,j) such that P F 1,a (i)− F 1,b (j) >ε ≤ γ. (4.15) 105 Among these feasible pairs, the one that minimizes the empirical type II error, which can be calculated as ((i− 1)+(j− 1))/(n 1 a +n 1 b ), should be selected; i.e., we select the final orders as (k ∗ a ,k ∗ b )= min all feasible(i,j) that satisfy(4.15) (i− 1)+(j− 1) n 1 a +n 1 b . Then our NP-EO classifier is ˆ ϕ ∗ (X,S)=I{T a (X)>t 1,a (k ∗ a ) }· I{S =a}+I{T b (X)>t 1,b (k ∗ b ) }· I{S =b}. Note that, if there are no i ∈ {l a +1,··· ,n 1 a } and j ∈ {l b +1,··· ,n 1 b } to satisfy inequality (4.15), we say our algorithm does not provide a viable NP-EO classifier. This is an exception that we have not met in simulation and real data studies. We summarize the above NP-EO umbrella algorithm in Algorithm 4.2. Note that in Step 8, the NP violation rate control at δ/ 2 is needed for theoretical purpose (c.f. Theorem 4.3 and its proof). We will demonstrate through numerical analysis that it suffices to use δ directly. We also note that the steps to get(k ∗ a ,k ∗ b ) is summarized as theEOviolationalgorithm (Step 10) inside Algorithm 4.2, and it is presented separately as Algorithm 4.1 for clarity. Now we are ready to provide a theoretical guarantee for ˆ ϕ ∗ (X,S). Theorem 4.3. Let ˆ ϕ ∗ (·,·) be the classifier output by Algorithm 4.2 with parameters (α,δ/ 2,ε,γ ). Assume that the scoring function T(·,·) is trained such that T s (X y,s ) is a continuous random variable whose dis- tribution function is strictly monotone for each y ∈ {0,1} and s ∈ {a,b}, and that all distribution func- tions for T s (X y,s ) have the same support. Furthermore, assume that n 0 a ≥ log(δ/ 2)/log(1− α ) and n 0 b ≥ log(δ/ 2)/log(1− α ). Then it holds simultaneously that (a)P R 0 ( ˆ ϕ ∗ )>α ≤ δ and (b)P |R a 1 ( ˆ ϕ ∗ )− R b 1 ( ˆ ϕ ∗ )|>ε ≤ γ +ξ (n 1 a ,n 1 b ), 106 Algorithm4.1: EO violation algorithm Input : l a ,l b ,n a ,n b : constants such thatl a ≤ n a andl b ≤ n b ε: upper bound for the type II error disparity γ : type II error disparity violation rate target 1 ˆ γ ← 1 2 b R 1 ← 1 3 k ∗ a ← n 1 a 4 k ∗ b ← n 1 b 5 fork a =l a +1,l a +2,··· ,n a do 6 fork b =l b +1,l b +2,··· ,n b do 7 W a ←N l a /n a , (la/na)((1− la)/na) na 8 W b ←N l b /n b , (l b /n b )((1− l b )/n b ) n b ; // W a ,W b are two normal random variables 9 B a ← Beta(k a − l a ,n a − k a +1) 10 B b ← Beta(k b − l b ,n b − k b +1) ; // B a ,B b are two Beta distributed random variables 11 V a ← W a +(1− W a )B a 12 V b ← W b +(1− W b )B b 13 ˆ γ ← P(|V a − V b |>ε) 14 b R new 1 ← (ka− 1)+(k b − 1) na+n b 15 if ˆ γ ≤ γ and b R new 1 < b R 1 then 16 k ∗ a ← k a 17 k ∗ b ← k b 18 b R 1 ← b R new 1 19 end 20 end 21 end Output:(k ∗ a ,k ∗ b ) in whichξ (n 1 a ,n 1 b ) converges to0 asn 1 a andn 1 b diverge. Proof. The first assertion in this theorem is simple. By Proposition 3.1, P P X 0,a T a (X 0,a )>t 0,a k 0,a ∗ >α ≤ δ/ 2, andP P X 0,b T b (X 0,b )>t 0,b k 0,b ∗ >α ≤ δ/ 2. Moreover, note thatR 0 ( ˆ ϕ ∗ ) can be written as P X 0,a T a (X 0,a )>t 1,a k ∗ a P(S =a|Y =0)+P X 0,b T b (X 0,b )>t 1,b k ∗ b P(S =b|Y =0). 107 ∗ 0 , ∗ 1 0 1 1 ∗ 0 , ∗ 1 0 1 1 0 , : 1 , : 0 , : 1 , : Figure 4.2: A cartoon illustration of the choices ofk ∗ a andk ∗ b . For everyT y,s , the circles, or squares, in its corresponding row represent its sorted elements, ascending from left to right. NP umbrella algorithm picks t 0,a k 0,a ∗ andt 1,a k 0,b ∗ inT 0,a andT 0,b , respectively. Then,t 1,a la andt 1,b l b are smallest elements inT 1,a andT 1,b that are larger thant 0,a k 0,a ∗ andt 1,a k 0,b ∗ , respectively. Candidates fork ∗ a andk ∗ b are chosen from{l a +1,··· ,n 1 a } and {l b +1,··· ,n 1 b }, respectively, such that equation (4.15) is satisfied. Finally, k ∗ a and k ∗ b are selected such that the overall empirical type II error is minimized. This combined with the fact thatt 1,a k ∗ a ≥ t 0,a k 0,a ∗ andt 1,b k ∗ b ≥ t 0,b k 0,b ∗ yields P R 0 ( ˆ ϕ ∗ )>α ≤ P P X 0,a T a (X 0,a )>t 0,a k 0,a ∗ P(S =a|Y =0)>α P(S =a|Y =0) +P P X 0,b T a (X 0,b )>t 0,b k 0,b ∗ P(S =b|Y =0)>α P(S =b|Y =0) ≤ δ. Next, we proceed with the second assertion. Before presenting the proof, we remark that as long as l a ,l b ,n a andn b are fixed, Algorithm 4.1 is a deterministic procedure. That is, k ∗ a = k ∗ a (l a ,l b ,n a ,n b ) is a non-random quantity and neither isk ∗ b . 108 Algorithm4.2: NP-EO algorithm Input :S y,s : X observations whose labely∈{0,1} and protected attributes∈{a,b} α : upper bound for type I error δ : type I error violation rate target ε: upper bound for the type II error disparity γ : type II error disparity violation rate target 1 S y,s train ,S y,s left-out ← random split onS y,s fory∈{0,1} ands∈{a,b} 2 S train ←S 0,a train ∪S 0,b train ∪S 1,a train ∪S 1,b train 3 T ← baseclassificationalgorithm(S train ) ; // T(·,·):X ×{ a,b}7→R 4 T s (·)← T(·,s) fors∈{a,b} 5 T y,s ← T s (S y,s left-out ) fory∈{0,1} ands∈{a,b} 6 n y s ←|T y,s | fory∈{0,1} ands∈{a,b} 7 T y,s ={t y,s (1) ,t y,s (2) ,··· ,t y,s (n y s ) } fory∈{0,1} ands∈{a,b} 8 k 0,s ∗ ← theNPumbrellaalgorithm(n 0 s ,α,δ/ 2) fors∈{a,b} 9 l s ← max{k∈{1,2,··· ,n s }:t 1,s (k) ≤ t 0,s (k 0,s ∗ ) } fors∈{a,b} 10 k ∗ a ,k ∗ b ← EOviolationalgorithm(l a ,l b ,n a ,n b ,ε,γ ) Output: ˆ ϕ ∗ (X,S)=I{T a (X)>t 1,a (k ∗ a ) }· I{S =a}+I{T b (X)>t 1,b (k ∗ b ) }· I{S =b} Now, let us focus on the proof. We denote the classifier given by Algorithm 4.2 as ˆ ϕ ∗ (X,S)=I{T a (X)>t 1,a k ∗ a }I{S =a}+I{T b (X)>t 1,b k ∗ b }I{S =b}. Letξ a j =I{t 1,a j ≤ t 0,a (k 0,a ∗ ) } for everyt 1,a j ∈T 1,a andξ b i =I{t 1,b i ≤ t 0,b (k 0,b ∗ ) } for everyt 1,b i ∈T 1,b . Note that P R a 1 − R b 1 >ε =E S train h P left-out R a 1 − R b 1 >ε i . The probabilityP left-out is taken with respect to the randomness of allS y,s left-out . If this quantity can be shown to be at most γ , thenP R a 1 − R b 1 >ε ≤ γ . Thus, till the end of the proof, we will only consider the randomness inP left-out and takeT a andT b to be fixed. Next, note that P left-out R a 1 − R b 1 >ε =E ξ P left-out R a 1 − R b 1 >ε|ξ a 1 ,··· ,ξ a na ,ξ b 1 ,...,ξ b n b 109 whereE ξ is the expectation taken with respect toξ a 1 ,··· ,ξ a na ,ξ b 1 ,...,ξ b n b . Moreover, denoteξ a = ξ a 1 ,ξ a 2 ,··· ,ξ a n 1 a andξ b = ξ b 1 ,ξ b 2 ,··· ,ξ b n 1 b . To this end, we will showP left-out R a 1 − R b 1 >ε|ξ a ,ξ b is bounded by ap- proximatelyγ with high probability. Consider the quantity R a 1 =R a 1 ( ˆ ϕ ∗ )=P left-out T a (X 1,a )≤ t 1,a (k ∗ a ) . We remark that conditional onS train ,R a 1 is solely determined byS 0,a left-out andS 1,a left-out . Thus,R a 1 is indepen- dent ofξ b . Furthermore, conditional onξ a ,k ∗ a is fixed. Thus, denote k ∗ a =k a , for anys∈R the conditional distribution function ofR a 1 can be written as P left-out h R a 1 ≤ s|ξ a ,ξ b i =P left-out [R a 1 ≤ s|ξ a ]=E t 0,a k 0,a ∗ h P left-out R a 1 ≤ s|ξ a ,t 0,a k 0,a ∗ |ξ a i . Define G a =P left-out t 1,a ≤ t 0,a (k 0,a ∗ ) |t 0,a (k 0,a ∗ ) wheret 1,a is another iid copy oft 1,a 1 ,··· ,t 1,a n 1 − a , then condi- tional onξ a andt 0,a k 0,a ∗ ,R a 1 is equal to distribution toG a +(1− G a )B a whereB a is beta distributed with parametersk a − l a andn 1 a − k a +1 by Lemma 4.7. Here,l a = P na j=1 ξ a j . Then, since P left-out [R a 1 ≤ s|ξ a ]=E t 0,a k 0,a ∗ h P left-out R a 1 ≤ s|ξ a ,t 0,a k 0,a ∗ |ξ a i =E t 0,a k 0,a ∗ [P Ba (G a +(1− G a )B a ≤ s)|ξ a ] =E Ba P t 0,a k 0,a ∗ (G a +(1− G a )B a ≤ s|ξ a ) =E Ba P Fa (F a +(1− F a )B a ≤ s), whereF a is a random variable such thatP Fa (F a ≤ t) =P t 0,a k 0,a ∗ (G a ≤ t| ξ a ) for any constantt. Here, we use the fact G a is constant conditionally on t 0,a k 0,a ∗ for the second equality and B a is independent of t 0,a k 0,a ∗ 110 andξ a for the third equality. Therefore, the distribution ofR a 1 |ξ a is equal toF a +(1− F a )B a . Similarly, R b 1 |ξ b has the same distribution asF b +(1− F b )B b whereF b andB b are defined analogously. LetV a =F a +(1− W a )B a andV b =F b +(1− W b )B b . Given thatR a 1 andR b 1 are independent, P left-out R a 1 − R b 1 >ε|ξ a ,ξ b =P left-out |V a − V b |>ε|ξ a ,ξ b . It remains to show that the distributions of F a and F b are close to Gaussian distributions described in Algorithm 4.1. This is true by the Bernstein-von Mises theorem. In detail, it is not hard to realize that P Fa (F a ≤ t) = P t 0,a k 0,a ∗ (G a ≤ t | ξ a ) is exactly the posterior distribution of G a given ξ a . One can show thatξ a is exactly the vector of independent Bernoulli random variables with success rateG a . Moreover, for fixed t 0,a k 0,a ∗ , l a /n 1 a is the maximum likelihood estimator of G a by the definition of l a in display (4.14). Then, the Bernstein-von Mises theorem (see Theorem 2.1) states that P Fa F a ∈·| ξ a ,ξ b −N l a n 1 a , G ∗ a (1− G ∗ a ) n 1 a TV →0, in probability, whereG ∗ a is true success probability of the Bernoulli distribution from which the Bernoulli trialsξ a are generated from. Furthermore, N l a n 1 a , G ∗ a (1− G ∗ a ) n 1 a −N l a n 1 a , (l a /n 1 a )(1− l a /n 1 a ) n 1 a TV →0, in probability asl a /n 1 a converges toG ∗ a in probability. Therefore, P Fa F a ∈·| ξ a ,ξ b −N l a n a , (l a /n 1 a )(1− l a /n 1 a ) n 1 a TV →0, 111 in probability. That is, for anyε ′ ,γ ′ and sufficiently large n a , sup B P Fa F a ∈B|ξ a ,ξ b − P Za (Z a ∈B) ≤ ε ′ , with probability at least 1− γ ′ /2 where Z a ∼ N (l a /n a , (la/na)(1− la/na) na ). Here the supremum is taken with respect to all measurable sets. Similarly, for sufficiently large n b sup B P F b F b ∈B|ξ a ,ξ b − P Z b (Z b ∈B) ≤ ε ′ , with probability at least1− γ ′ /2. Therefore, denotingV ′ a =Z a +(1− Z a )B a andV ′ b =Z b +(1− Z b )B b , with probability at least1− γ ′ , P left-out R a 1 − R b 1 >ε|ξ a ,ξ b =E Wa E W b E Ba E B b I{|V a − V b |>ε} ≤ E Za E Z b E Ba E B b I{|V ′ a − V ′ b |>ε}+ε ′ +(ε ′ ) 2 . The expectation term on the right hand side of the inequality equalsγ by design of Algorithm 4.2. There- fore, E ξ P left-out R a 1 − R b 1 >ε|ξ a ,ξ b ≤ γ +ε ′ +(ε ′ ) 2 +γ ′ . Letξ (n 1 a ,n 1 b ) = ε ′ +(ε ′ ) 2 +γ ′ . Then,ξ (n 1 a ,n 1 b ) is a function ofn 1 a andn 1 b that converges to0 asn 1 a and n 1 b go to infinity, hence proof is finished. Here, we present the lemma 4.7 used in the previous proof. Lemma4.7. Conditionallyonξ a ,t 0,a (k 0,a ∗ ) andT,R a 1 isequalindistributiontoG a +(1− G a )B a . Furthermore, B a is independent ofξ a ,t 0,a (k 0,a ∗ ) andG a . 112 Proof. For allT 1,a = {t 1,a 1 ,t 1,a 2 ,··· ,t 1,a n 1 a } and j = 1,2,··· ,n 1 a , let U j = P left-out T a (X 1,a )≤ t 1,a j . Then,U j ’s are independent random variables uniformly distributed on(0,1). To see this, denoteF 1,a (z)= P left-out T a (X 1,a )≤ z , then, for anyz∈(0,1), P left-out (U j ≤ z)=P left-out F 1,a (t 1,a j )≤ z =P left-out t 1,a j ≤ (F 1,a ) − 1 (z) =F 1,a ((F 1,a ) − 1 (z))=z, and the independence is guaranteed by the independence oft 1,a j ’s. Moreover, for fixed t 0,a k 0,a ∗ ,ξ a j =I{t 1,a j ≤ t 0,a (k 0,a ∗ ) }=I{U j ≤ F 1,a (t 0,a (k 0,a ∗ ) )}. Then, conditional ont 0,a (k 0,a ∗ ) ,F 1,a andξ a , the assertion is given by Lemma 4.1. In Theorem 4.3, the conditions for the distributions ofT s (X y,s ) ensure that the Bernstein-von Mises theorem can be invoked. Indeed, take the{S =a} part for example: this theorem is applied to the class of binomial samplel a defined in equation (4.14), whose probability of success rate is P X 1,a T a (X 1,a )≤ t 1,a (i) . The key issue here is that this random probability needs to be in the interior of[0,1] with probability1, which is guaranteed by assumptions onT s (X y,s ). Next, the assumptions forn 0 a andn 0 b , adapted from [83], are mild sample size requirements to ensure the high probability NP constraint (c.f. part (a) of Theorem 4.3). We note that part (b) of Theorem 4.3 states that the type II error disparity violation rate can be con- trolled byγ plus a term that vanishes asymptotically. This extra term is the price for the errors of Gaussian approximation in the distributions ofr a 1 andr b 1 . It is asymptotically negligible by the Bernstein-von Mises theorem. 4.6 Numericalresults. In this section, we present simulation and real-data evidence that supports the effectiveness of the newly proposed NP-EO algorithms. In each simulation setting, all trained algorithms are evaluated on a large test 113 set to the approximate the (population-level) type I and type II errors. This procedure is repeated 1,000 times and thus1,000 copies of estimated type I and type II errors can be acquired. Then, the NP violation rate is computed as the proportion of type I error exceeding the target level defined in the NP constraint. Similarly, the EO violation rate is computed as the proportion of type II error disparity exceeding the target level defined in the EO constraint. Finally, recall that for NP-EO algorithm, we use δ , instead of δ/ 2, in Algorithm (4.2). 4.6.1 Simulation. In all settings, for each y ∈ {0,1} and s ∈ {a,b}, we generate n y,s training observations and 100n y,s test observations. We compare the NP-EO algorithm with three existing algorithms, namely, the classical algorithm, NP umbrella algorithm, and NP umbrella algorithm mixed with random guesses. Here, the classical algorithm (e.g., logistic regression, support vector machines) is the base algorithm without any adjustment for either the NP or EO constraint. The NP umbrella algorithm adjusts base algorithms for the NP constraint and it is described in Proposition 3.1. The NP umbrella algorithm mixed with random guesses, inspired by [37], might seem like an obvious thing to do under the NP-EO paradigm and it works as follows. We start with two classifiers: the first is an NP classifier, ˆ ϕ NP , trained by the NP umbrella algorithm. The second is a random guess classifier ϕ RG with P(ϕ RG =1)=α . Without loss of generality, we assumeR a 1 ( ˆ ϕ NP )>R b 1 ( ˆ ϕ NP ). Then, for an observation in the testing sample such thatS = a, we use ˆ ϕ NP only; for an observation withS = b, ˆ ϕ NP is selected with probabilityp to classify this observation, and with probability1− p,ϕ RG is used. Note thatR b 1 (ϕ RG )=1− α . Then, for this mixed classifier ˆ ϕ mixed ,R a 1 ( ˆ ϕ mixed )=R a 1 ( ˆ ϕ NP ) andR b 1 ( ˆ ϕ mixed )=pR b 1 ( ˆ ϕ NP )+(1− p)(1− α ). As long as ˆ ϕ NP is more powerful thanϕ RG on groupa, i.e.,R a 1 ( ˆ ϕ NP )≤ 1− α , ˆ ϕ mixed can achieve equality of opportunity by choosingp properly. 114 In this simulation, we choose the probabilityp by20-fold cross validation: we train an NP classifier on 19 folds of the training data and compute the estimatedR a 1 ( ˆ ϕ NP ) andR b 1 ( ˆ ϕ NP ) on the left-out fold. Since R a 1 (ϕ RG ) andR b 1 (ϕ RG ) are explicit, we can directly estimateR a 1 ( ˆ ϕ mixed ),R b 1 ( ˆ ϕ mixed ) and thus type II error disparity for every value of p and the option of adding random guesses for either S = a or S = b. We traverse all the combinations ofp=0,0.1,0.2,0.3,··· ,0.9 and the options of adding random guesses to both S components. Next, for every combination, we calculate the estimated type II error disparity for every fold and thus can estimate the estimated probability of type II error disparity exceedingε. Finally, we select the the combination such that this estimated probability is smaller than or equal toγ . If there are multiple such combinations, we select the one with the largest p. Then the resulting ˆ ϕ mixed satisfies high-probability NP and EO constraints. Simulation4.1. LetX y,s be multidimensional Gaussian distributed with meanµ y,s and covariance matrix Σ y,s for eachy ∈ {0,1} ands ∈ {a,b}. Here, µ 0,a = (1,2,1) ⊤ , µ 1,a = (0,0,0) ⊤ , µ 0,b = (0,0,2) ⊤ and µ 1,b =(1,0,− 1) ⊤ . Moreover Σ y,a = 2 − 1 0 − 1 2 − 1 0 − 1 2 and Σ y,b = 1 0 0 0 2 0 0 0 1 , for everyy∈{0,1}. Furthermore,n 0,a =800,n 1,a =400,n 0,b =1200 andn 1,b =1600. We setα =0.05, δ = 0.05, ε = 0.2 and γ = 0.05. The base algorithm used is logistic regression. The numerical results associated with this simulation are reported in Table 4.1. Simulation4.2. LetX y,s be uniformly distributed in a three dimensional ballB y,s with radius1 and cen- tered at O y,s , where O 0,a = (0,0,0) ⊤ , O 1,a = (1,0,− 1) ⊤ , O 0,b = (1,1,1) ⊤ and O 1,b = (− 1,1,0) ⊤ . Furthermore, n 0,a = 800, n 1,a = 400, n 0,b = 1200 and n 1,b = 1600. We also set α = 0.05, δ = 0.05, 115 Table 4.1: (aprroximate) averages of type I and II errors, along with (approximate) violation rates of the NP and EO conditions over1,000 repetitions for Simulation 4.1. Standard error of the means (× 10 − 4 ) in parentheses average of type I errors average of type II errors NP violation rate EO violation rate NP-EO .012(1.13) .480(12.75) 0(0) .046(66.28) NP mixed with random guess .039(1.50) .741(7.25) .015(38.46) 0(0) NP .039(1.97) .163(3.78) .047(66.96) 1(0) classical .094(1.79) .096(1.35) 1(0) 1(0) Table 4.2: (aprroximate) averages of type I and II errors, along with (approximate) violation rates of NP and EO conditions over 1,000 repetitions for Simulation 4.2. Standard error of the means (× 10 − 4 ) in parentheses average of type I errors average of type II errors NP violation rate EO violation rate NP-EO .012(1.12) .478(12.67) 0(0) .053(70.88) NP mixed with random guess .035(1.55) .588(21.11) .006(24.43) .009(29.88) NP .034(2.45) .191(6.43) .029(53.09) 1(0) classical .094(1.88) .094(1.39) 1(0) 1(0) ε = 0.2 andγ = 0.05. The base algorithm used is logistic regression. The numerical results associated with this simulation are reported in Table 4.2. In both simulations scenarios, the classical classifier admits the lowest type II error; the NP classifier comes in the second place. This is not surprising as the NP paradigm controls the type I error to a low level with high probability, thereby resulting in a higher type II error. The NP and EO violation rates are both higher than the target levels for the classical classifier, whereas the NP classifier fails to keep the EO violation rate under the target level. These two classifiers adopt no design for EO adjustments; thus, it is expected that the EO requirement would fail. The remaining two algorithms, NP-EO and NP mixed with random guesses, are built to achieve the high-probability NP and EO constraints. All two algorithms produce an overall type II error larger than that of the NP paradigm. This is the price paid for equality in our classification algorithms. For reference, 116 we remark that the “nearly trivial” NP-EO classifier, a random guess that return 1 with probability 0.05 and0 otherwise, has an overall type II error as high as0.95. Benchmarked against this result, the classifiers listed in both Tables 4.1 and 4.2 have much smaller type II errors. Moreover, in terms of the overall type II error, it is clear that NP-EO outperforms NP mixed with random guesses, suggesting the effectiveness of our proposed algorithm. In conclusion, the two simulation studies illustrate that our proposed algorithm under the NP-EO paradigm are able to achieve the goals of regulating equality of opportunities and controlling type I error while only paying a modest price in terms of the less consequential type II error. 4.6.2 Realdataanalysis. In many countries, lenders’ discrimination against a certain social group other than creditworthiness is either illegal or socially unacceptable. Most notably, the Equal Credit Opportunity Act in the US explicitly makes it unlawful for any creditor to discriminate against any applicant on the basis of race, color, sex, and other non-credit related social factors. Nevertheless, ample evidence shows that Hispanic and Black borrowers have less access to credits or pay a higher price for mortgage loans in the US [67, 25, 36, 12]. With the emergence of the FinTech market, statistical and machine learning techniques have gained increasing popularity in lending decisions by both traditional financial institutions and peer-to-peer lend- ing and crowd-sourcing platforms. An important regulatory concern in this development is whether al- gorithmic decision-making promotes or impedes impermissible discrimination. Recently, [10] show that algorithmic lending reduces rate disparities between Latinx/African-American borrowers and other bor- rowers in consumer-lending markets but cannot eliminate the bias. [32] find that, in the US mortgage market, Black and Hispanic borrowers are disproportionately less likely to gain from the introduction of machine learning in lending decisions. Central in the welfare judgement of algorithmic lending is the 117 tradeoff between efficiency (controlling default risk) and equality (non-disparate treatment). In the sec- tion, we illustrate how our proposed algorithms can help address this question with an example of potential gender bias in credit card consumption in Taiwan. We focus on this case for two reasons. First, gender discrimination is a significant phenomenon in credit lending markets worldwide. [2] find that Italian women pay more for overdraft facilities than men. [13] and [4] show that female entrepreneurs face tighter credit availability in Italy and Spain. [73] document a strong correlation between gender bias and credit access across developing countries. Second, practically, the Taiwanese credit card dataset is simple, transparent, and has clear labelling of payment status that enables an analysis of financial risk. The dataset is from [88], which has been widely used to evaluate various data mining techniques. This dataset depicts the given credit, demographic features, and payment history of30,000 individuals. Impor- tantly, it includes a binary status of the payment: either default, encoded by0, or non-default, encoded by 1. The payment status defines the type I/II errors in the classification problem, while the protected attribute is gender. Fitting such a typical credit-lending problem into the NP-EO classification framework, banks primarily want to control the risk of misclassifying someone who will default as non-default (type I error) although they also want to minimize the chance of letting go non-defaulters (type II error). Furthermore, by regulation or as a social norm, fairness requires banks not to discriminate against qualified applicants on the basis of gender. Therefore, to obtain the dual goal of risk control and fairness, our classification problem needs to satisfy the NP constraint, i.e., controlling the overall type I error, and the EO constraint, i.e., the upper bound for the difference between gender-conditional type II errors. We also note that since we already illustrated in Section 4.6.1 that the NP classifier mixed with random guesses performs worse than our proposed algorithms in all simulation settings, we do not include this classifier in this real data section. 118 Table 4.3: (aprroximate) averages of type I and II errors, along with (approximate) violation rates of NP and EO conditions over1,000 repetitions for credit card dataset. Standard error of the means (× 10 − 4 ) in parentheses average of type I errors average of type II errors NP violation rate EO violation rate NP-EO .081(3.11) .720(6.65) .033(56.52) .034(57.34) NP .088(3.02) .700(6.26) .111(99.39) .482(158.10) classical .633(4.02) .059(1.31) 1(0) 0(0) We use 1/3 of the data for training and the other 2/3 for test, with stratification in both protected attribute and label. As an illustrative example, we set α = 0.1, δ = 0.1, ε = 0.05 and γ = 0.1. The base algorithm used is random forest. The process is repeated 1000 times, and the numerical results are presented in Table 4.3. Using the classifier under the classical paradigm, the high-probability EO constraint is satisfied. Indeed, the EO violation rate in Table 4.3 is 0, indicating that the random forest under the classic paradigm is “fair” and “equal” in terms of gender. This is not entirely surprising given that gender bias in modern Taiwan is not a significant concern. The problem with this classifier is that it produces a type I error of0.633, which is prohibitively high for nearly any financial institution. Benchmarked against the modest NP constraint (α =0.1), the violation rate is1, imposing too much risk to the banks. When the NP paradigm alone is employed, the EO violation rate surges to 0.482, demonstrating a conflict between the banks’ private gain of improving risk control and the society’s loss of achieving fairness. When the NP-EO algorithm is employed, both the NP and EO constraints are satisfied with very small violation rates, and the classifiers simultaneously achieve the goals of risk control and fairness. The cost that the banks have to bear is missing some potential business opportunities from non-defaulters, which is reflected in the higher overall type II error committed by either NP-EO algorithm. 119 4.7 Discussion. This chapter is motivated by two practical needs in algorithmic design: a private user’s need to internalize social consideration and a social planner’s need to facilitate private users’ compliance with regulation. The challenge in fulfilling these needs stems from the conflict between the private and social goals. Notably, the social planner’s promotion of fairness and equality may constrain private users’ pursuit of profits and effi- ciency. In an ideal world without measurement and sampling problems, such a private-public conflict can be best resolved by maximizing a social welfare function with well-defined private and public components. Statistical tools hardly play any role in this process. However, when knowledge about the social welfare function is partial, measurement of each component in the objective is imperfect, and consequences of predictive errors are uncertain, statistical innovation is called for to step into the endeavor of resolving the private-public conflict. Our work is a response to this challenge. In the classification setting, we propose the NP-EO paradigm, in which we incorporate a social consid- eration into a constrained optimization problem with the less-important private goal (type II error) being the objective while the social goal (equal opportunity) and the more-important private goal (type I error) as constraints. Algorithmic decisions with such restrictions provide safeguards against deviations from the social goal and avoid significant damage to the private goal, leaving the private-social conflict mostly absorbed by the less-consequential private consideration. We believe that our approach can be applied to a wide range of settings beyond the problem we are handling in this thesis. We do not claim that our proposed NP-EO paradigm is superior to other classification paradigms. Rather, we are proposing an alternative framework to handle private-social conflicts in algorithmic de- sign. Central in our analysis is a perspective of gaining security through statistical control when multiple objectives have to be compromised. Key to our methodological innovation is a principled way to redis- tribute specific errors so that the resulting classifiers have high-probability statistical guarantees. 120 Possible future research direction include but not limited to: (i) replacing type I error constraint by other efficiency constraints, and replacing the EO constraint by other fairness criteria, and (ii) working with parametric models, such as the linear discriminant analysis (LDA) model, to derive model-specific NP-EO classifiers that address small sample size problem and satisfy oracle type inequalities. 121 Bibliography [1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. “A reductions approach to fair classification”. In: InternationalConference onMachineLearning. PMLR. 2018, pp. 60–69. [2] Alberto F. Alesina, Francesca Lotti, and Paolo Emilio Mistrulli. “Do Women Pay More for Credit? Evidence from Italy”. In: Journal of the European Economic Association 11 (1 2013), pp. 45–66. [3] Noga Alon, Yossi Matias, and Mario Szegedy. “The space complexity of approximating the frequency moments”. In: Journal of Computer and system sciences 58.1 (1999), pp. 137–147. [4] Pablode Andrés, Ricardo Gimeno, and Ruth Mateos de Cabo. “The gender gap in bank credit access”. In: Journal of Corporate Finance 71 (2021). [5] David Arnold, Will Dobbie, and Crystal S Yang. “Racial bias in bail decisions”. In: The Quarterly Journal of Economics 133.4 (2018), pp. 1885–1932. [6] Jean-Yves Audibert, Olivier Catoni, et al. “Robust linear least squares regression”. In: The Annals of Statistics 39.5 (2011), pp. 2766–2794. [7] Yannick Baraud, Lucien Birgé, et al. “Robust Bayes-like estimation: Rho-Bayes estimation”. In: Annals of Statistics 48.6 (2020), pp. 3699–3720. [8] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. http://www.fairmlbook.org. fairmlbook.org, 2019. [9] Solon Barocas and Andrew D Selbst. “Big data’s disparate impact”. In: Calif. L. Rev. 104 (2016), p. 671. [10] Robert Bartlett, Adair Morse, Richard Stanton, and Nancy Wallace. “Consumer-lending discrimination in the FinTech era”. In: Journal of Financial Economics 143.1 (2022), pp. 30–56. [11] M J Bayarri and James O Berger. “Robust Bayesian bounds for outlier detection”. In: Recent Advances in Statistics and Probability (1994), pp. 175–190. 122 [12] Patrick Bayer, Fernando Ferreira, and Stephen L. Ross. “What Drives Racial and Ethnic Differences in High-Cost Mortgages? The Role of High-Risk Lenders”. In: The Review of Financial Studies 31 (1 2018), pp. 175–205. [13] Andrea Bellucci, Alexander Borisov, and Alberto Zazzaro. “Does Gender Matter in Bank-Firm Relationships? Evidence from Small Business Lending”. In: Journal of Banking and Finance 34 (12 2010), pp. 2968–2984. [14] Anirban Bhattacharya, Debdeep Pati, and Yun Yang. “Bayesian fractional posteriors”. In: The Annals of Statistics 47.1 (2019), pp. 39–66. [15] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. “A general framework for updating belief distributions”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78.5 (2016), pp. 1103–1130. [16] Gilles Blanchard, Marek Flaska, Gregory Handy, Sara Pozzi, and Clayton Scott. “Classification with asymmetric label noise: Consistency and maximal denoising”. In: Electronic Journal of Statistics 10.2 (2016), pp. 2780–2824.doi: 10.1214/16-ejs1193. [17] Pavel Brazdil and Kurt Konolige. “Machine Learning, Meta-Reasoning and Logics”. In: Springer (1990). [18] Carla Brodley and Mark Friedl. “Identifying mislabeled training data”. In: Journal of Artificial Intelligence Research 11 (1999), pp. 131–167. [19] Carla E Brodley and Mark A Friedl. “Identifying mislabeled training data”. In: Journal of Artificial Intelligence Research 11 (1999), pp. 131–167. [20] Christian Brownlees, Emilien Joly, and Gábor Lugosi. “Empirical risk minimization for heavy-tailed losses”. In: Annals of Statistics 43.6 (2015), pp. 2507–2536. [21] Timothy I Cannings, Yingying Fan, and Richard J Samworth. “Classification with imperfect training labels”. In: Biometrika 107.2 (2020), pp. 311–330. [22] Adam Cannon, James Howse, Don Hush, and Clint Scovel. “Learning with the Neyman-Pearson and min-max criteria”. In: Los Alamos National Laboratory, Tech. Rep. LA-UR (2002), pp. 02–2951. [23] Jingjing Cao, Sam Kwong, and Ran Wang. “A noise-detection based AdaBoost algorithm for mislabeled data”. In: Pattern Recognition 45.12 (2012), pp. 4451–4465. [24] Aaron Chalfin, Oren Danieli, Andrew Hillis, Zubin Jelveh, Michael Luca, Jens Ludwig, and Sendhil Mullainathan. “Productivity and selection of human capital with machine learning”. In: American Economic Review 106.5 (2016), pp. 124–27. [25] Kerwin Kofi Charles, Erik Hurst, and Melvin Stephens. “Rates for Vehicle Loans: Race and Loan Source”. In: American Economic Review 98 (2 2008), pp. 315–320. [26] Corbett-Davies, Emma Pierson Sam, Avi Feller, Sharad Goel, and Aziz Huq. “Algorithmic Decision Making and the Cost of Fairness”. In: arXiv:1701.08230 (2017). 123 [27] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. “Modeling wine preferences by data mining from physicochemical properties”. In: Decision support systems 47.4 (2009), pp. 547–553. [28] Bo Cowgill and Catherine E Tucker. “Algorithmic fairness and economics”. In: Columbia Business School Research Paper (2020). [29] Ilias Diakonikolas and Daniel M Kane. “Recent advances in algorithmic high-dimensional robust statistics”. In: arXiv preprint arXiv:1911.05911 (2019). [30] Kjell A Doksum and Albert Y Lo. “Consistent and robust Bayes procedures for location based on partial information”. In: The Annals of Statistics (1990), pp. 443–453. [31] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil. “Empirical risk minimization under fairness constraints”. In: Advances in Neural Information Processing Systems 31 (2018). [32] Andreas Fuster, Paul Goldsmith-Pinkham, Tarun Ramadorai, and Ansgar Walther. “Predictably unequal? The effects of machine learning on credit markets”. In: The Journal of Finance 77.1 (2022), pp. 5–47. [33] Abhik Ghosh and Ayanendranath Basu. “Robust Bayes estimation using the density power divergence”. In: Annals of the Institute of Statistical Mathematics 68.2 (2016), pp. 413–437. [34] Aritra Ghosh, Naresh Manwani, and PS Sastry. “Making risk minimization tolerant to label noise”. In: Neurocomputing 160 (2015), pp. 93–107. [35] Isabelle Guyon, Nada Matic, Vladimir Vapnik, et al. Discovering Informative Patterns and Data Cleaning. 1996. [36] Andrew Hanson, Zackary Hawley, Hal Martin, and Bo Liu. “Discrimination in Mortgage Lending: Evidence from a Correspondence Experiment”. In: Journal of Urban Economics 92 (2016), pp. 48–65. [37] Moritz Hardt, Eric Price, and Nati Srebro. “Equality of opportunity in supervised learning”. In: Advances in neural information processing systems. 2016, pp. 3315–3323. [38] Ray J Hickey. “Noise modelling and evaluating learning from examples”. In: Artificial Intelligence 82.1-2 (1996), pp. 157–179. [39] Peter D Hoff. “Extending the rank likelihood for semiparametric copula estimation”. In: The Annals of Applied Statistics (2007), pp. 265–283. [40] Matthew D Hoffman and Andrew Gelman. “The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo.” In: J. Mach. Learn. Res. 15.1 (2014), pp. 1593–1623. [41] Giles Hooker and Anand N Vidyashankar. “Bayesian model robustness via disparities”. In: Test 23.3 (2014), pp. 556–584. 124 [42] Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. “Spambase data set”. In: Hewlett-Packard Labs 1.7 (1999). [43] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. “Random generation of combinatorial structures from a uniform distribution”. In: Theoretical computer science 43 (1986), pp. 169–188. [44] Jack Jewson, Jim Q Smith, and Chris Holmes. “Principles of Bayesian inference using general divergence criteria”. In: Entropy 20.6 (2018), p. 442. [45] Roni Khardon and Gabriel Wachman. “Noise tolerant variants of the perceptron algorithm”. In: Journal of Machine Learning Research 8.Feb (2007), pp. 227–248. [46] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. “Human decisions and machine predictions”. In: The quarterly journal of economics 133.1 (2018), pp. 237–293. [47] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. “Inherent trade-offs in the fair determination of risk scores”. In: Proceedings of the 8th Conference on Innovation in Theoretical Computer Science 43.23 (2017), pp. 1–43. [48] Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. “Generalized variational inference: Three arguments for deriving new posteriors”. In: arXiv preprint arXiv:1904.02063 (2019). [49] Peter A Lachenbruch. “Discriminant analysis when the initial samples are misclassified”. In: Technometrics 8.4 (1966), pp. 657–662. [50] Peter A Lachenbruch. “Note on initial misclassification effects on the quadratic discriminant function”. In: Technometrics 21.1 (1979), pp. 129–132. [51] Anja Lambrecht and Catherine Tucker. “Algorithmic bias? An empirical study of apparent gender-based discrimination in the display of STEM career ads”. In: Management science 65.7 (2019), pp. 2966–2981. [52] Guillaume Lecué and Matthieu Lerasle. “Robust machine learning by median-of-means: theory and practice”. In: Annals of Statistics 48.2 (2020), pp. 906–931. [53] Matthieu Lerasle and Roberto I Oliveira. “Robust empirical mean estimators”. In: arXiv preprint arXiv:1112.3914 (2011). [54] Tongliang Liu and Dacheng Tao. “Classification with noisy labels by importance reweighting”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 38.3 (2016), pp. 447–461. [55] Gabor Lugosi and Shahar Mendelson. “Risk minimization by median-of-means tournaments”. In: Journal of the European Mathematical Society 22.3 (2019), pp. 925–965. [56] Gábor Lugosi and Shahar Mendelson. “Mean estimation and regression under heavy-tailed distributions: A survey”. In: Foundations of Computational Mathematics 19.5 (2019), pp. 1145–1190. 125 [57] Owen MacDonald. “Physician Perspectives on Preventing Diagnostic Errors”. In: https://www.kff.org/wp-content/uploads/sites/2/2013/05/quantiamd_preventingdiagnosticerrors_- whitepaper_1.pdf (2011). [58] Naresh Manwani and PS Sastry. “Noise tolerance under risk minimization”. In: IEEE Transactions on Cybernetics 43.3 (2013), pp. 1146–1151. [59] Pascal Massart. “The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality”. In: The annals of Probability (1990), pp. 1269–1283. [60] Timothée Mathieu and Stanislav Minsker. “Excess risk bounds in robust empirical risk minimization”. In: Information and Inference: A Journal of the IMA (2021). [61] Takuo Matsubara, Jeremias Knoblauch, François-Xavier Briol, Chris Oates, et al. “Robust generalised Bayesian inference for intractable likelihoods”. In: arXiv preprint arXiv:2104.07359 (2021). [62] Jeffrey W Miller. “Asymptotic normality, concentration, and coverage of generalized posteriors”. In: Journal of Machine Learning Research 22.168 (2021), pp. 1–53. [63] Jeffrey W Miller and David B Dunson. “Robust Bayesian inference via coarsening”. In: arXiv preprint arXiv:1506.06101 (2015). [64] Stanislav Minsker. “Asymptotic normality of robust risk minimizers”. In: arXiv preprint arXiv:2004.02328 (2020). [65] Stanislav Minsker. “Uniform bounds for robust mean estimators”. In: arXiv preprint arXiv:1812.03523 (2018). [66] Sendhil Mullainathan and Ziad Obermeyer. “Does machine learning automate moral hazard and error?” In: American Economic Review 107.5 (2017), pp. 476–80. [67] A. H. Munnell, L. Browne, J. McEneaney, and G. Tootel. “Mortgage lending in Boston: Interpreting HMDA data”. In: American Economic Review 86 (1996), pp. 25–54. [68] Tomoyuki Nakagawa and Shintaro Hashimoto. “Robust Bayesian inference viaγ -divergence”. In: Communications in Statistics-Theory and Methods 49.2 (2020), pp. 343–360. [69] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. “Learning with noisy labels”. In: Advances in Neural Information Processing Systems. 2013, pp. 1196–1204. [70] Arkadi Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983. [71] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. “Dissecting racial bias in an algorithm used to manage the health of populations”. In:Science 366.6464 (2019), pp. 447–453. [72] Seishi Okamoto and Nobuhiro Yugami. “An average-case analysis of the k-nearest neighbor classifier for noisy domains”. In: IJCAI (1). 1997, pp. 238–245. 126 [73] Steven Ongena and Alexander Popov. “Gender Bias and Credit Access”. In: Journal of Money, Credit, and Banking 48 (8 2016), pp. 1691–1724. [74] Ken Orr. “Data quality and systems theory”. In: Communications of the ACM 41.2 (1998), pp. 66–71. [75] Ashesh Rambachan, Jon Kleinberg, Jens Ludwig, and Sendhil Mullainathan. “An economic perspective on algorithmic fairness”. In: AEA Papers and Proceedings. Vol. 110. 2020, pp. 91–95. [76] Ashesh Rambachan and Jonathan Roth. “Bias in, bias out? Evaluating the folk wisdom”. In: arXiv preprint arXiv:1909.08518 (2019). [77] Thomas Redman. “The impact of poor data quality on the typical enterprise”. In: Communications of the ACM 2.2 (1998), pp. 79–82. [78] Philippe Rigollet and Xin Tong. “Neyman-pearson classification, convexity and stochastic constraints”. In: Journal of Machine Learning Research (2011). [79] Clayton Scott and Robert Nowak. “A Neyman-Pearson approach to statistical learning”. In: IEEE Transactions on Information Theory 51.11 (2005), pp. 3806–3819. [80] Sainbayar Sukhbaatar and Rob Fergus. “Learning from noisy labels with deep neural networks”. In: arXiv preprint arXiv:1406.2080 2.3 (2014), p. 4. [81] Markus Svensen and Christopher M Bishop. “Robust Bayesian mixture modelling”. In: Neurocomputing 64 (2005), pp. 235–252. [82] Xin Tong. “A plug-in approach to Neyman-Pearson classification”. In: Journal of Machine Learning Research 14.1 (2013), pp. 3011–3040. [83] Xin Tong, Yang Feng, and Jingyi Jessica Li. “Neyman-Pearson classification algorithms and NP receiver operating characteristics”. In: Science Advances 4.2 (2018), eaao1659. [84] Xin Tong, Lucy Xia, Jiacheng Wang, and Yang Feng. “Neyman-Pearson classification: parametrics and sample size requirement”. In: Journal of Machine Learning Research 21 (2020), pp. 1–18. [85] Aad W Van der Vaart. Asymptotic statistics. Vol. 3. Cambridge university press, 2000. [86] Aad W Van Der Vaart, Adrianus Willem van der Vaart, Aad van der Vaart, and Jon Wellner. Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media, 1996. [87] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. “Learning non-discriminatory predictors”. In: Conference on Learning Theory. PMLR. 2017, pp. 1920–1953. [88] I-Cheng Yeh and Che-hui Lien. “The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients”. In: Expert systems with applications 36.2 (2009), pp. 2473–2480. 127 [89] Anqi Zhao, Yang Feng, Lie Wang, and Xin Tong. “Neyman-Pearson classification under high-dimensional settings”. In: Journal of Machine Learning Research 17.213 (2016), pp. 1–39. 128
Abstract (if available)
Abstract
This dissertation is devoted to statistical methods in robust Bayesian inference, asymmetric binary classification, and algorithmic fairness. The performance of these methods is evaluated, both theoretically and numerically, and asymptotic guarantees are provided to establish their efficacy.
The first part is related to robust estimation and inference in the Bayesian framework. In this framework, we often consider a family of probability density functions and data are generated from an unknown distribution whose probability density belongs to this family. Given a prior distribution on this set of probability density functions, one can construct a posterior distribution. In this part, it is assumed that an "adversary'' replaces a random set of data by arbitrary values. The goal is to construct a posterior robust to the arbitrary outliers. The contribution of this work consists of two parts: first, we propose a robust posterior distribution based on median of means (MOM) principle. Second, we show its robustness and a version of Bernstein-von Mises theorem.
The second part is related to label noise in Neyman-Pearson (NP) classification framework. Given a random couple (X,Y)∈Rᵈ×{0,1}, the goal of NP classification is to predict Y such that the control of type I error, i.e., probability of misclassification given Y = 0, under a pre-specified value is given priority. We study the NP paradigm under the scenario that, instead of true training data {(Xj,Yj)}n/j=1, the labels Y in the training data are randomly flipped and the actual training data is {(Xj,Ỹj)}n/j=1. The contribution of this work consists of two parts: first, we show that NP classifier still satisfies the type I error control though trained with noisy labels. However, the price for this control is high type II error, i.e., the probability of misclassification given Y=1. Second, we propose an approach which allows to adjust the NP algorithm to achieve the proper type I and II errors.
The third part is related to algorithmic fairness in NP classification. In binary classification problems, algorithmic fairness setting assumes that one or more features are sensitive and the predictive outcome should satisfy certain criteria describing social fairness based on sensitive features. Among these criteria, we are interested in the Equal Opportunity (EO) framework. In this thesis, we consider an NP classification problem with EO constraint. The contribution of this part consists of two parts: first, we propose the Neyman-Pearson Equal-Opportunity (NP-EO) paradigm that combines social fairness and institutional interests, and then we present the NP-EO oracle classifier. Second, we propose the NP-EO umbrella algorithm that trains NP-EO classifier.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adapting statistical learning for high risk scenarios
PDF
Statistical citation network analysis and asymmetric error controls
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
A rigorous study of game-theoretic attribution and interaction methods for machine learning explainability
PDF
Robust estimation of high dimensional parameters
PDF
Optimization strategies for robustness and fairness
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Conformalized post-selection inference and structured prediction
PDF
Leveraging sparsity in theoretical and applied machine learning and causal inference
PDF
High dimensional estimation and inference with side information
PDF
Large scale inference with structural information
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Statistical insights into deep learning and flexible causal inference
PDF
Statistical inference for dynamical, interacting multi-object systems with emphasis on human small group interactions
PDF
Understanding diffusion process: inference and theory
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Data-driven optimization for indoor localization
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Improved computational and statistical guarantees for high dimensional linear regression
PDF
Nonparametric estimation of an unknown probability distribution using maximum likelihood and Bayesian approaches
Asset Metadata
Creator
Yao, Shunan
(author)
Core Title
New methods for asymmetric error classification and robust Bayesian inference
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Applied Mathematics
Degree Conferral Date
2023-08
Publication Date
06/22/2023
Defense Date
06/20/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithmic fairness,asymmetric classification,Bayesian inference,label noise,OAI-PMH Harvest,robust statistics
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Minsker, Stanislav (
committee chair
), Heilman, Steven (
committee member
), Tong, Xin (
committee member
)
Creator Email
shunanya@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113189889
Unique identifier
UC113189889
Identifier
etd-YaoShunan-11982.pdf (filename)
Legacy Identifier
etd-YaoShunan-11982
Document Type
Dissertation
Format
theses (aat)
Rights
Yao, Shunan
Internet Media Type
application/pdf
Type
texts
Source
20230627-usctheses-batch-1058
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
algorithmic fairness
asymmetric classification
Bayesian inference
label noise
robust statistics