Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimization strategies for robustness and fairness
(USC Thesis Other)
Optimization strategies for robustness and fairness
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Optimization Strategies for Robustness and Fairness by Valeriu Balaban A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2023 Copyright 2023 Valeriu Balaban Acknowledgements I would like to express my deepest gratitude to my advisor, Paul Bogdan, for his unwavering support, guidance, and mentorship throughout my doctoral journey. Without his expertise, patience, and encouragement, this research and thesis would not have been possible. I also want to thank the members of my defense committee: Edmond Jonckheere, Rahul Jain, Bhaskar Krishnamachari, and Jyotirmoy Deshmukh for their valuable feedback and insights. Your constructive criticism has significantly improved the quality of this work. To my Ph.D. friends in particular, Yannick Bliessener, Ramy Tadros, Martin Martinez, and Jayson Sia thank you for sharing your knowledge, expertise, and experiences with me. Your friendship and camaraderie have made this journey much more enjoyable and memorable. I am grateful to acknowledge the financial support provided by the Annenberg Fellowship which jumpstarted my academic journey and the National Science Foundation (NSF) which supported me through the CAREER award CPS/CNS-1453860 grant. Their generous funding has enabled me to pursue my studies with fewer financial constraints and has provided me with the opportunity to focus on my research and academic pursuits. I am also grateful to acknowledge the teaching assistantship provided by the Electrical and Computer Engineering Department and the opportunity to teach alongside Mark Redekopp and Allan Weber, which not only helped me to improve my teaching skills but also gave me a deeper appreciation for the art of teaching. Finally, I would like to express my heartfelt thanks to my girlfriend, Jenny Kan, for her love, understanding, and unwavering support. Your patience and encouragement have been a constant source of motivation and inspiration for me. ii To all of you, I offer my sincere gratitude and appreciation for your contributions to the research shared in this thesis and to my personal and professional growth. iii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Optimizing for Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Optimizing for Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Variance penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Direct Variance Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Distributional Variance Penalization . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3: Optimizing for Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Positive Variance Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Negative Variance Penalization . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Variance Minimization when Training with Limited Data . . . . . . . 24 3.3.2 Variance Maximization for Training with Mislabeled Data . . . . . . 26 3.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 4: Optimizing for Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Definition of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Optimization Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Distributionally Robust Optimization . . . . . . . . . . . . . . . . . . . . . . 36 4.4 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.6 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 iv Chapter 5: Conclusion and ongoing work . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1 Summary of completed work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Ongoing and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A Supplementary Figures – Fairness . . . . . . . . . . . . . . . . . . . . . . . . 56 B Weighted Mean Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 B.1 Moments penalization . . . . . . . . . . . . . . . . . . . . . . . . . . 58 B.1.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 B.1.2 Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.1.3 Literature Review . . . . . . . . . . . . . . . . . . . . . . . 68 B.1.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . 68 B.1.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 B.2 Robust classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 B.2.1 Optimizing for robustness . . . . . . . . . . . . . . . . . . . 72 B.2.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . 76 B.3 Fair classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 B.3.1 Optimizing for fairness . . . . . . . . . . . . . . . . . . . . . 78 B.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . 83 B.4 Appendix Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 B.5 Additional Theoretical Results . . . . . . . . . . . . . . . . . . . . . . 84 B.6 Variance Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 B.7 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.8 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v List of Tables 4.1 Numerical results of fairness optimization . . . . . . . . . . . . . . . . . . . . 43 5.1 Mean accuracy (%) and standard deviation on clean data over 5 runs. The best result for each scenario is underlined. . . . . . . . . . . . . . . . . . . . 89 5.2 Robust classification CIFAR results with per class accuracy . . . . . . . . . . 100 vi List of Figures 2.1 Demosntration of distributional variance penalization on a toy example . . . 10 3.1 Experimental results for positive variance penalization . . . . . . . . . . . . 24 3.2 Experimental results for negative variance penalization . . . . . . . . . . . . 25 4.1 Illustration of fairness aware objectives . . . . . . . . . . . . . . . . . . . . . 30 4.2 Experimental results when optimizing fairness on CIFAR-10 . . . . . . . . . 40 5.1 Metrics for CVaR on CIFAR-10 dataset. . . . . . . . . . . . . . . . . . . . . 56 5.2 Metrics for Variance Penalization method for Adult dataset using ¯λ =2 and ¯γ =0.75. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.3 Metrics for CVaR method on the Adult dataset. . . . . . . . . . . . . . . . . 57 5.4 Experimental results for synthetic dataset . . . . . . . . . . . . . . . . . . . 58 5.5 Metrics for CVaR method on Synthetic dataset. . . . . . . . . . . . . . . . . 59 5.6 Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =1 and ¯γ =0.75. . . . . . . . . . . . . . . . . . . . . . . . . 59 5.7 Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =2 and ¯γ =0.75. . . . . . . . . . . . . . . . . . . . . . . . . 60 5.8 Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =2 and ¯γ =1. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.9 Metrics for Variance Penalization method on the validation set of Adult dataset using ¯λ =1 and ¯γ =0.75. . . . . . . . . . . . . . . . . . . . . . . . . 61 5.10 Metrics for Variance Penalization method on the validation set of Adult dataset using ¯λ =2 and ¯γ =1. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.11 Variance penalization for classification problems . . . . . . . . . . . . . . . . 63 5.12 Polynomial functions for moments penalization . . . . . . . . . . . . . . . . . 67 vii 5.13 Algebraic interpretation of moments penalization . . . . . . . . . . . . . . . 87 5.14 Bounding the cross-entropy loss function using negative variance penalization 88 5.15 Fair classification results, matching mean . . . . . . . . . . . . . . . . . . . . 88 5.16 Fair classification results, matching mean and variance . . . . . . . . . . . . 90 5.17 Fair classification – CVaR method (various α ) on Donini dataset . . . . . . . 93 5.18 Fair classification – moments method (various λ 2 ) on Donini dataset . . . . . 94 5.19 Fair classification – moments method ( λ 2 = 3 and various λ 3 ) on Donini dataset 95 5.20 Fair classification – moments method ( λ 2 = 7 and various λ 3 ) on Donini dataset 96 5.21 Fair classification – moments method ( λ 2 = 10 and various λ 3 ) on Donini dataset 97 5.22 Fair classification – moments method ( λ 2 = 16 and various λ 3 ) on Donini dataset 98 viii Abstract We introduce distributional variance penalization, a strategy for learning with limited and/or mislabeled data, and for jointly optimizing individual and group fairness – a closely related problem. While minimizing the loss function currently stands as the training objective for many machine learning applications, it suffers from (i) poor robustness and (ii) inability to equalize performance across subsets of the training data. With regard to the first issue, we show that we can improve robustness by minimizing the average loss along with penalizing the variance. In particular, we expand on past studies of directly penalizing the variance, which adjusts the weights of individual samples, resulting in improved robustness. However, the weights can take negative values and lead to unstable behavior. We introduce distributional variance penalization, which solves the issue of negative weights. Distributional variance penalization minimizes the expectation with respect to a distinct distribution that achieves a similar weighting scheme as direct variance penalization. We study the impact of both positive and negative variance penalization in the context of classification and show that the generalization and robustness against mislabeled data can be improved for a broad class of loss functions. Experimental results show that test accuracy improves by up to 20% compared to ERM when training with limited or mislabeled data. In regards to the second issue, we introduce a novel procedure for jointly optimizing individual and group fairness based on distributional variance penalization. We built this procedure upon the paradigm of distributionally robust optimization, in which the goal is to improve robustness and protect against possible shifts in data distribution. This is of paramountinterestwhenoptimizingforfairness, assensitivegroupsareoftenunderrepresented, ix which leads to skews in the underlying data distribution. In addition, we also introduce a new definition of fairness which extends some existing ones. In principle, our goal is to minimize the empirical mean and variance of the loss values, while optimizing for all groups to achieve the same loss mean and variance. Our experimental results show that we can enforce groups to achieve similar loss means and variances and improve the final accuracy for all groups when the training data is skewed. x Chapter 1 Introduction Empirical risk minimization (ERM) is a key paradigm used in many machine learning applications. The objective of the ERM is to minimize a loss function averaged over the training data. This approach has been widely studied in many applications (Krizhevsky, Hinton et al. 2009; Goodfellow et al. 2014; Redmon et al. 2016). A key drawback of ERM is its lack of robustness – poor generalization when training with limited data (Maurer and Pontil 2009; Duchi and Namkoong 2019; Robey et al. 2022), poor noise rejection when training with noise corrupted data (Li et al. 2021; Song et al. 2022), and poor fairness as models can not only learn the intrinsic biases present in the dataset but may even amplify them (Zhao et al. 2017; Hashimoto et al. 2018). This work aims to address this observed limitation by coupling the ERM strategy with a variance penalizing component and analyze its implications on various machine learning tasks. In addition, the goal of this work is to convince the reader that from the perspective of the model, not all samples are equal, and optimizing a weighted mean is a more progressive approach. For instance, at the end of the training, a few samples are left uncaptured by the model, which yield large loss values. One might choose to put more weight on those samples to help the model learn (individual fairness) or consider them noise and assign them less weight (robustness). The decision will also be reflected by the loss variance; it will decrease when hard samples are weighted more and increase otherwise. 1 To justify theoretically the impact of variance penalization, we use Bennett’s inequality (Hoeffding 1963; Maurer and Pontil 2009), which shows that for i.i.d. loss values Z and bounded variance, we have with probability 1− δ : E[Z]− 1 n n X i=1 Z i ≤ C 1 r 2V[Z]ln1/δ n + C 2 ln1/δ 3n (1.1) where C 1 , C 2 are problem depended constants. This inequality reveals two things. First, we have with high probability that the empirical mean is close to the theoretical value. This bound, along with similar PAC-Bayes bounds (Seldin et al. 2012; Tolstikhin and Seldin 2013) justifies the practical success of the empirical risk minimization (ERM), which has the objective of minimizing the mean loss value (Namkoong and Duchi 2017). Secondly, the difference between the two is bounded by the variance. This second implication led to a large and growing body of studies that investigate variance penalization (Maurer and Pontil 2009; Namkoong and Duchi 2017; Duchi and Namkoong 2019; Staib, Wilder, and Jegelka 2019; Lam 2019; Heinze-Deml and Meinshausen 2021; Hu et al. 2018). Moreover, by penalizing the variance, the intrinsic bias-variance tradeoff of the ERM can be controlled. We propose a unified framework that can improve robustness when the training data is either limited or mislabeled, and later extend to optimize for both individual and group fairness. 1.1 Optimizing for Robustness We note that penalizing the variance is equivalent to changing the weight of individual samples. Consequently, the variance penalization magnifies different data clusters, which drives the class boundaries to follow their alignment as shown in Figure 2.1a. However, direct variance penalization leads to samples having negative weights, which usually causes unstable behavior during training. We solve this issue by using probabilities instead of 2 weights. Towards this end, we introduce distributional variance optimization that takes the expectation with respect to a new discrete distribution Q n such that: E Qn [ℓ(θ,x,y )]=E Pn [ℓ(θ,x,y )]+λ V Pn [ℓ(θ,x,y )] (1.2) where P n is the empirical data distribution and ℓ(θ,x,y ) are the loss values. Of note, the proposed approach uses probabilities that are nonnegative. To improve generalization when training with limited data, we should proportionally increase the weights of samples with large loss values as those are not learned by the model. We accomplish this by using positive variance penalization, i.e., variance minimization. On the contrary, to improve noise rejection when training with mislabeled data, we should proportionally decrease the weights of samples with large loss values as those are usually the mislabeled samples. We accomplish this by using negative variance penalization, i.e., variance maximization. A large and growing body of studies investigated minimizing the variance (Maurer and Pontil 2009; Robey et al. 2022; Namkoong and Duchi 2017; Duchi and Namkoong 2019; Staib, Wilder, and Jegelka 2019; Lam 2019; Heinze-Deml and Meinshausen 2021; Hu et al. 2018; Li et al. 2021), however, only a few for improving robustness when training with limited data. Duchi and Namkoong (2019) proposed to penalize the variance by taking the expectation with respect to a different distribution. Li et al. (2021) investigated the optimization of a tilted empirical risk which is equivalent to penalizing all central moments of the loss, including the variance. Moreover, we also explore the impact of increasing the variance to attenuate the effect of outliers when training with mislabeled data. Previous methods attenuated the impact of outliers by either using learning management (Li et al. 2017; Szegedy et al. 2016; Hendrycks, Lee, and Mazeika 2019; Liu et al. 2020; Harutyunyan et al. 2020; Song et al. 2022) or robust losses (Song et al. 2022; Bartlett, Jordan, and McAuliffe 2006; Ghosh, Kumar, and Sastry 2017; Wang et al. 2019a; Zhang and Sabuncu 2018; Wang et al. 2019b; Feng et al. 2020; Ma et al. 2020; Natarajan et al. 2013; Mnih and Hinton 2012; Xu et al. 2019; Patrini et al. 2016; Rooyen, Menon, and Williamson 2015). 3 1.2 Optimizing for Fairness We respect to optimizing for fairness; we consider a model fair if it incurs similar losses for each sensitive group. A large number of studies investigate optimizing for group fairness or individual fairness, but only a few investigate optimizing simultaneously for both individual and group fairness (Mehrabi et al. 2021; Caton and Haas 2020). This manuscripts aims to contribute to this rarely explored but growing area of research. Publications that concentrate on optimizing group fairness often require the model to achieve similar losses for each sensitive group, e.g., women and men, (Hardt et al. 2016; Kusner et al. 2017; Corbett-Davies et al. 2017; Hashimoto et al. 2018; Williamson and Menon 2019). However, Speicher et al. (2018) note that when exclusively optimizing for group fairness, one can deteriorate individual fairness. For this reason, we propose minimizing the empirical loss mean and variance while requiring all groups to achieve the same mean and variance. We show that by decreasing the loss variance, we improve individual fairness at both the group and dataset levels. Moreover, to achieve the above-stated goal, we present a two-step optimization procedure based on distributional variance penalization. Additionally, we link our proposed framework to distributionally robust optimization and show that the trained models are robust against shifts in the underlying data distribution. This is of paramount interest when optimizing for fairness, as it is common for a small number of sensitive groups to be poorly represented in the training dataset, leading to the empirical data distribution diverging from the true underlying one (Fabris et al. 2022). For example, Torralba and Efros (2011) reveal that datasets often favor certain types of samples or contain imprecise category definitions that are unfavorable for certain sensitive groups. Casting the optimization objective as a weighted sum leads to two additional advantages. First, this allows the proposed method to flawlessly integrate into popular machine learning (ML) frameworks. Secondly, given that recently many governments (Munoz, Smith, and Patil 2016; European Commission 2018) and researchers (Bechmann and Zevenbergen 2019) have 4 called for more social accountability and understanding of ML, the final training weights can be used as an indicator of bias or to explain the decision-making process in contested cases. 1.3 Contributions The main contributions grouped by robustness (CR) and fairness (CF) are as follows: (CR1) We show that penalizing the variance is equivalent to changing the weight of individual samples and we exploit it to our advantage to improves robustness when training with limited data (Theorem 2) and mislabeled data (Theorem 4). (CR2) We show that the variance minimization is equivalent to distributionally robust optimization as it optimizes the most unfavorable scenario, (Equation 3.1). (CR3) We prove that maximizing the variance bounds the loss function (Theorem 4) and improves robustness when training with mislabeled data. (CR4) Weshowexperimentallythatbothpositiveandnegativevariancepenalizationimprove model accuracy (compared to ERM) by up to 20% when training with limited or mislabeled data, respectively. (CF1) We introduce a new definition of group fairness (Theorem 5) which extends some existing ones and requires all groups to achieve the same loss mean and variance. (CF2) We present a novel procedure to jointly optimize for individual and group fairness (Algorithm 1). (CF3) We link our algorithm to distributionally robust optimization where we protect the model against shifts in data distribution (Equation (3.1)). (CF4) We show experimentally that when the training data is skewed the final accuracy improves for all groups if we require sensitive groups to achieve similar loss mean and variance. 5 The implications of the mathematical framework are much broader than what we in- vestigate in this work. We limit the scope of this manuscript to classification. But, with minor changes, the framework can also cover regression problems. We expect this simple, yet powerful framework to reach a wide audience as it is straightforward to integrate into existing works while providing great flexibility in optimizing for desired properties of the solution. 1.4 Notation In subsequent chapters, we will use the following notation. We consider the training data to be distributed according to an unknown joint distribution P, and consisting of n tuples of (x,y) where the attributes x are drawn fromX ⊆ R d and the class labels y fromY ={1,...,C}, where C represents the number of classes. Given a model f :Θ ×X → R C parameterizable byθ ∈ Θ ⊆ R t , our goal is to find the optimum parameters denoted by θ ∗ for which the model f correctly predicts the label y given x. To evaluate the prediction performance of the model, we use a loss function ℓ:Θ ×X ×Y → R + , with the objective to minimizeE P [ℓ(θ,x,y )]. 6 Chapter 2 Variance penalization In this chapter, we explore two approaches of variance penalization. 2.1 Direct Variance Penalization As the distributionP is unknown in the ERM setting, we minimize the empirical expectation E Pn [ℓ(θ,x,y )], or similarly: min θ n X i=1 w i ℓ(θ,x i ,y i ). (2.1) where w i are the sample weights with w i = 1 n ,∀i. This fact simplifies the implementation, but hinders the robustness since not all samples are equal in practice (Bengio et al. 2009). We show in the next lemma that training with direct variance penalization is equivalent to changing the weights of individual samples. This result is independent of the loss function. Lemma 1: Variance Expansion Let ℓ(θ,x,y ) be a loss function, λ a variance penalization factor, and w i the sam- ples’ weights computed as w i = 1 n + λ n ℓ(θ,x i ,y i )− E Pn [ℓ(θ,x,y )] , then the variance penalization is equivalent to the weighted sum with weights w i : E Pn [ℓ(θ,x,y )]+λ V Pn [ℓ(θ,x,y )] = n X i=1 w i ℓ(θ,x i ,y i ). 7 Proof. 1. P n i=1 w i ℓ(θ,x i ,y i )=E Pn [ℓ(θ,x,y )]+λ E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] ℓ(θ,x,y ) . Proof: By definition of w i and linearity of expectation. 2. E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] ℓ(θ,x,y ) =V Pn [ℓ(θ,x,y )]. Proof: E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] ℓ(θ,x,y ) =E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )]+E Pn [ℓ(θ,x,y )] =E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] +E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] E Pn [ℓ(θ,x,y )] =V Pn [ℓ(θ,x,y )]+E Pn ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )] E Pn [ℓ(θ,x,y )] =V Pn [ℓ(θ,x,y )]. Where on the penultimate step we useE Pn [ℓ(θ,x,y )− E Pn [ℓ(θ,x,y )]]=0. 3. Q.E.D. Proof: By replacing 2 in 1. Remarks. From the formula for w i , we can see that samples with a loss value ℓ(θ,x i ,y i ) equal to E Pn [ℓ(θ,x,y )] receive a weight of 1 n , the same as in the ERM setting. However, for all other samples, their weight depends on the deviation of their loss value from the empirical mean. The variance penalization factor λ defines the rate at which the weight changes based on this deviation. Moreover, for λ> 0, samples with loss values above average receive more weight, whereas for λ< 0, samples with loss values below average receive more weight. Although w i sum to 1, we call them weights and not probabilities since w i can take negative values. In particular, forλ> 0, a loss valueℓ(θ,x i ,y i )<E Pn [ℓ(θ,x,y )]− 1 λ will have a corresponding negative weight. Similarly, for λ< 0 a loss valueℓ(θ,x i ,y i )>E Pn [ℓ(θ,x,y )]+ 1 λ will also have a corresponding negative weight. 8 The fact that the weights can take negative values usually leads to unstable behavior during optimization, as it can cause the objective to alternate between minimization and maximization. This is exacerbated for negative variance penalization, as in this case, samples with negative weights are the samples with large loss values and are more likely to skew the weighted sum; see the range highlighted in red in Figure 2.1c. Though it also occurs for positive variance penalization, this issue is of less concern as the inset in Figure 2.1b shows. 2.2 Distributional Variance Penalization To solve the problem of having negative weights, we use a distributional variance penal- ization approach which achieves a similar weighting scheme as the direct variance penal- ization. We propose, instead of using the empirical distribution P n , to employ a new discrete distribution Q n such that when taking the expectation E Qn [ℓ(θ,x,y )] we recover E Pn [ℓ(θ,x,y )]+λ V Pn [ℓ(θ,x,y )]. As probabilities are nonnegative, we solve the problem of having negative weights. Henceforth, the optimization objective becomes: min θ n X i=1 q i ℓ(θ,x i ,y i ) (2.2) where the probabilities q i are from a discrete distributionQ n . To construct our surrogate distribution Q n , we measure how much it diverges from the empirical data distributionP n by the chi-square divergence, D χ 2. The chi-square divergence is calculated as D χ 2(Q∥P)=E x∼ P h Q(x) P(x) − 1 2 i whereP(x) andQ(x) denote the two densities at x. The next theorem shows that we recover the variance penalization problem by selecting the distributionQ n from an ambiguity setA, which is constructed by bounding the chi-square divergence. 9 Negative Var. Pen. Positive Var. Pen. λ = 0.3 λ = 0.8 λ = 0 Class 0 Class 1 a) 0 1.2 0 9 · 10 − 2 1 n ℓ i q i 0 1 0 0.5 predicted prob. q i · ℓ i 0 3.6 0 8 · 10 − 2 1 n ℓ i q i 0 1 0 0.1 predicted prob. q i · ℓ i ERM Dir. Var. Pen. Dist. Var. Pen. − 5 0 5 · 10 − 4 λ = 0.8 λ = 0.3 b) c) 1 Figure 2.1: Toy example. (a) A binary classification problem with the decision boundaries of a linear classifier trained with γ = 1 and varying λ . (b) Sample probability for positive variance penalization with λ = 0.8. (c) Sample probability for negative penalization with λ = 0.3. The left-hand side subfigure in (b) and (c) shows the sample probability as a function of sample loss and compares it to the ERM case (dashed line). The right-hand side subfigure in (b) and (c) show the product of the crossentropy loss and sample probability as a function of predicted probability. The insets show where the two methods deviate. Theorem 2: Distributional Variance Penalization Let ℓ(θ,x,y ) be a loss function, λ > 0 a variance penalization factor, and γ ∈ R the variance exponent. We then have the following bounds for the two stochastic optimization problems: max Qn∈A E Qn [ℓ(θ,x,y )]≤ E Pn [ℓ(θ,x,y )]+λ V Pn [ℓ(θ,x,y )] γ min Qn∈A E Qn [ℓ(θ,x,y )]≥ E Pn [ℓ(θ,x,y )]− λ V Pn [ℓ(θ,x,y )] γ whereA= n Q n :D χ 2(Q n ∥P n )≤ λ 2 n 2 V Pn [ℓ(θ,x,y )] 2γ − 1 o . Proof. Let Z be a random variable with Z i =ℓ(θ,x i ,y i ). 1. The expectation of the two stochastic optimization problems can be written as: n X i=1 q i Z i s.t. q i ≥ 0, n X i=1 q i =1, 1 n n X i=1 nq i − 1 2 ≤ λ 2 n 2 V Pn [Z] 2γ − 1 . 10 2. Writing q i in terms of a new variable u i with q i = 1 n +u i , we have: 1 n n X i=1 Z i + n X i=1 u i Z i s.t. u i ≥− 1 n , n X i=1 u i =0, 1 n n X i=1 u 2 i ≤ λ 2 n 2 V Pn [Z] 2γ − 1 . 3. 1 n n X i=1 Z i + n X i=1 u i Z i ⇔ E Pn [Z]+nE Pn u Z− E Pn [Z] Proof: For the second term of the sum we have: n X i=1 u i Z i =nE Pn [uZ] =nE Pn u Z− E Pn [Z] +nE Pn uE Pn [Z] =nE Pn u Z− E Pn [Z] +nE Pn [u]E Pn [Z] =nE Pn u Z− E Pn [Z] Where on penultimate step we use the fact that E Pn [u]=0 since P n i=1 u i =0. 4. n E Pn u Z− E Pn [Z] ≤ λ V Pn [Z] γ Proof: Using the Cauchy–Schwarz inequality we have: n E Pn u Z− E Pn [Z] ≤ n r E Pn [u 2 ]E Pn h Z− E Pn [Z] 2 i ≤ n r λ 2 n 2 V Pn [Z] 2γ − 1 V Pn [Z] ≤ λ V Pn [Z] γ Where for the second inequality we used the constraint from step 2 on u 2 . 5. n E Pn u Z− E Pn [Z] =λ V Pn [Z] γ for u i =± λ n V Pn [Z] γ − 1 Z i − E Pn [Z] . Proof: From Cauchy–Schwarz, the equality holds if u i and Z i − E Pn [Z] are linearly dependent. Thus, we achieve equality for 11 u i =± λ n V Pn [Z] γ V Pn [Z] Z i − E Pn [Z] =± λ n V Pn [Z] γ − 1 Z i − E Pn [Z] given that the constraint u i ≥− 1 n is satisfied. 6. When the constraint u i ≥− 1 n is violated, n|E Pn [u(Z− E Pn [Z])]|<λ V Pn [Z] γ . Proof: To avoid negative probabilities we clamp u i with values smaller than− 1 n to − 1 n , however, as a result u i and Z i − E Pn [Z] are no longer linearly dependent. In addition we have u i =± αλ n V Pn [Z] γ − 1 (Z− E Pn [Z]− µ ) where we set u i ≤− 1 n to − 1 n , and µ> 0 and α > 1 are two constants selected to satisfy the constraints from step (ii). 7. max Qn∈A E Qn [Z]≤ E Pn [Z]+λ V Pn [Z] γ min Qn∈A E Qn [Z]≥ E Pn [Z]− λ V Pn [Z] γ . Proof: Combining the results of steps 3 to 6 with equality achieved when the constraint u i ≥− 1 n is satisfied. Remarks. Of note, we generalize the variance penalization problem to any exponent γ of the variance term. In special cases, we have the classical penalization of variance for γ =1 and the penalization of standard deviation for γ = 1/2. Moreover, the exponent γ controls the dynamic size of the ambiguity set A. For γ =1/2, the size of the set remains constant, whereas for all other values the size changes as the variance changes. In our experiments, we discover that using a γ ̸= 1 usually yields better results. To find the distribution Q n , one must solve the optimization problem with constraints described in step (ii) of the proof. However, Algorithm 2 shows a simpler procedure for finding the distribution Q n . 12 Both the direct and distributional variance penalization methods yield equivalent weights for small values of λ , i.e., when all the weights are positive. However, when some samples receive negative weights under direct variance penalization, these same samples will receive a probability of 0 under distributional variance penalization. This solves the issue of negative weights at the cost of a lower penalized variance. Moreover, samples with a corresponding probability of 0 will have no impact on the model parameters during the back-propagation phase. Figure 2.1 highlights how the distributional variance penalization helps to improve the generalization when training with limited or mislabeled data. Figure 2.1a presents a toy dataset along with several linear decision boundaries obtained for different values of λ . In this dataset, each class contains two types of data points with distinct alignments. In particular, squared points are aligned horizontally and round points are aligned diagonally. In the ERM setting when λ = 0, i.e., all samples have equal weight, the decision boundary follows the alignment of the majority (in this case, the alignment of the round points as they are more numerous). However, by varying λ , we can bias the decision boundary to favor the alignment of either the round or the square data points, as shown by the colored lines. The different alignments result from using different distributions Q n which weigh each sample differently. The two colored boxes in Figure 2.1b and Figure 2.1c show the sample probabilities for λ =0.8 (top) and λ =− 0.3 (bottom), respectively. Positive variance penalization increases the probability for samples with a higher than average loss value, as shown in the left plot of Figure 2.1b. Here, the probability of the square data points increases as those samples are closer to the decision boundary and thus, have a higher loss value. Similarly, if a subset of data is uncaptured by the model, their loss values will be above average, and using a positive λ will increase their probability and help the model to learn. In contrast, negative variance penalization increases the probability for samples with a lower than average loss value, as shown in the left plot of Figure 2.1c. As a result, the 13 decision boundary will follow the alignment of the round data points since those are the furthest from the boundary and thus have the smallest loss. Similarly, this helps reduce the impact of mislabeled data, as samples with large loss values are more likely to be mislabeled (Zhang et al. 2017a; Arpit et al. 2017). The red rectangles in Figures 2.1b and 2.1c highlight an important distinction between direct and distributional variance penalization. These insets reveal a bigger issue in the case of negative variance penalization since the weights are negative for large loss values and are more likely to skew the weighted sum. This causes unstable behavior as the objective will alternate between minimization and maximization. 2.3 Computational Aspects Applying the method in practice requires including an additional step for finding the discrete distribution Q n . Algorithm 1 illustrates the training procedure. We provide a Pytorch implementation available on GitHub. 1 Training with Variance Penalization input: {x i ,y i } n 1 – training data ℓ(θ,x,y ) – loss function λ, γ – hyper-parameters η – learning rate while stopping criteria not reached do for i← 1 to n do z i ← ℓ(θ,x i ,y i ) q ← FindQ n (z,λ,γ ) L ← P n i=1 q i z i // E Qn [ℓ(θ,x,y )] θ ← θ − η ∇ θ L Algorithm 2 shows the procedureFindQ n that solves the stochastic optimization problem of Theorem 2. The algorithm follows the approach of Duchi and Namkoong (2019) of projecting the loss values onto the probability simplex to find Q n . However, we extend the algorithm to encompass finding both the distribution that minimizes, and the distribution 1 https://github.com/valeriu-balaban/improving-robustness-with-variance-penalization 14 that maximizes the expectation. The advantage of this approach is that to recover the distribution that satisfies the constraints of Theorem 2, we only need to find the value for a single variable ρ , compared to two, α and µ . Alternatively, one can set ρ as a training parameter and update its value during back-propagation, which is not explored in this paper. FindQ n Procedure Input : z – loss values λ,γ – hyper-parameters Param: ϵ – tolerance a,b – initial search range s – +1for maxE Qn [z],− 1for minE Qn [z] Result: q – sample probabilities D χ 2 ← λ 2 n 2 V Pn [z] 2γ − 1 // target repeat ρ ← 1 2 (a+b) r ← [s(z− ρ )] + q ← r/ P n i=1 r i b D χ 2 ← P n i=1 1 n (nq i − 1) 2 // current if s( b D χ 2− D χ 2)<0 then a← ρ else b← ρ until| b D χ 2− D χ 2|<ϵ Next, we highlight the influence of variance penalization on the update strategy of the parameters θ . Consider the case of negative variance penalization and let R Pn = E Pn [ℓ(θ,x,y )]− λ V Pn [ℓ(θ,x,y )] γ , then by differentiating R Pn with respect to θ , we have ∇ θ R Pn = P n i=1 w i ∇ θ ℓ(θ,x i ,y i )wherew i = 1 n 1− 2λγ V Pn [ℓ(θ,x,y )] γ − 1 ℓ(θ,x i ,y i )− E Pn [ℓ(θ,x,y )] . On the other hand, when penalizing the variance using the results of Theorem 2 by taking the expectation with respect to distributionQ n , then by differentiating with respect to θ we have ∇ θ E Qn [ℓ(θ,x,y )] = P n i=1 q i ∇ θ ℓ(θ,x i ,y i ) where q i = 1 n 1− λα V Pn [ℓ(θ,x,y )] γ − 1 ℓ(θ,x i ,y i )− E Pn [ℓ(θ,x,y )]− µ + . Of note, the formulas for w i and q i are very much alike, they both change the impact of the gradient in a linear fashion based on the distance of the sample loss value to the mean. However, in the case of q i , we can avoid unstable behavior during optimization since q i are nonnegative. We emphasize that probabilities q i must be treated as 15 constants. Otherwise, it can lead to unstable behavior since when differentiating, we will have an additional term, ℓ(θ,x i ,y i )∇ θ q i , which is negative in the case of negative variance penal- ization. To treat q i as constants use detach() function in Pytorch, and stop_gradient() in Tensorflow. 16 Chapter 3 Optimizing for Robustness By training with positive or negative variance penalization, we can optimize for two types of robustness: (i) robustness to distributional shifts when training with limited data, and (ii) robustness to noise when training with mislabeled data. 3.1 Positive Variance Penalization In what follows, we link positive variance penalization with the distributionally robust optimization (DRO) (Shapiro 2017; Rahimian and Mehrotra unpublished) where the goal is to minimize the expectation with respect to the worst-case data distribution. Moreover, we show that positive variance penalization promotes generalization by improving the confidence bound of the empirical mean. Maurer and Pontil (2009) introduced a concentration bound (Theorem 3) and proposed to extend the ERM objective by adding a variance penalization term. However, they noted that the variance penalization is non-convex in many situations where ERM is convex. To address this, Duchi and Namkoong (2019) examined penalizing the standard deviation of the loss in the context of DRO which they showed to preserve the convexity of the loss function. We extend their previous work and analyze the DRO problem where the size of the ambiguity set is dynamic and changes with the loss variance. 17 Choosing the probabilities q i in eq. (2.2) according to the worst-case scenario from Theorem 2, naturally allows us to cast the objective in form of a DRO: min θ ∈Θ max Qn∈A E Qn [ℓ(θ,x,y )] (3.1) whereA represents a set of possible data distributions. The DRO goal is to protect against all deviations from a nominal model, which improves robustness against distributional shift (Hashimoto et al. 2018). Of note, the most common DRO formulations consider the size of the ambiguity setA to be constant. However, for distributional variance penalization, this is only the case for γ = 1 2 . In our experiments, we find that an expanding set, γ < 1 2 , yields better results. In this case, as the loss variance decreases, the selected distribution becomes progressively more unfavorable allowing for better optimization. Inaddition, wehavethatbyminimizingthevariancethroughpositivevariancepenalization we improve the following confidence bound. Lemma 3: Maurer and Pontil (2009, Theorem 4) Let Z 1 ,...,Z n be i.i.d random variables with values in [0,M] and let δ > 0. Then we have with probability at least 1− δ that: E P [Z]− E Pn [Z]≤ r 2M 2 V Pn [Z]ln2/δ n + 7Mln2/δ 3(n− 1) whereP represents the unknown data distribution and P n the empirical one. From this lemma, we see that the empirical variance V Pn [Z] bounds the deviation of the empirical mean from its theoretical value. Thus, by using a positive variance penalization, we direct the optimization algorithm to minimize both the mean and the variance together, which leads to a tighter concentration bound. Moreover, if during this, the maximum loss value, M, also decreases, then we attain an even tighter bound. 18 3.2 Negative Variance Penalization In this section, we investigate negative variance penalization which translates to minimizing the expectation taken with respect to the best-case scenario distribution. Of note, this is the opposite of the DRO objective. Moreover, in this scenario, we show that the weight of inliers increases while the weight of outliers decreases. As a result, this constitutes a viable strategy when training with mislabeled data. From prior research, we know that overparameterized models such as deep neural networks are capable of learning random datasets (Zhang et al. 2017a). However, if a dataset contains real-world data along with random data, the model prioritizes learning real-world data before it memorizes the random data (Arpit et al. 2017). Henceforth, when training with mislabeled data, samples with a larger loss value are more likely to be the ones mislabeled, and thus, should not be relied upon. From Theorem 1, we know that we can decrease the weight of those samples using negative variance penalization. However, the drawback of this strategy is that samples with large loss values, i.e., ℓ(θ,x i ,y i )>E Pn [ℓ(θ,x,y )]+ 1 λ , will have negative weights which can lead to unstable behavior during training. In our experiments when using direct variance penalization, the optimization algorithm only converges when we use small values for λ and do not improve the results relative to the ERM case anyway. However, this is not the case for distributional variance penalization. Choosing the probabilities q i in eq. (2.2) according to the best-case scenario of Theorem 2 recovers the following objective: min θ ∈Θ min Qn∈A E Qn [ℓ(θ,x,y )] (3.2) This approach solves the convergence issue of the direct variance penalization as samples with corresponding negative weights now receive a probability of 0. Samples that receive a probability of 0 will have no effect when computing the gradients during the back-propagation 19 phase. In our experiments, similar to the positive variance penalization, we find that an expanding ambiguity setA yields better results. In this case, one must use a γ > 1 2 for that. Next, we show that negative variance penalization using the distributional approach bounds the loss function, and as a result, improves robustness when training with mislabeled data. We assume that we do not have access to clean data and that for each class, the correctly labeled samples represent the majority of the class. Formally, this assumption requires thatP(ˆ y =y)>P(ˆ y =j),∀j∈Y\{y} where y denotes the true class labels, and ˆ y the mislabeled data. Moreover, here robustness implies that a model trained on mislabeled data has the same probability of misclassification as a model trained on clean data. Under these conditions, Ghosh, Kumar, and Sastry (2017) outlined the distribution independent sufficient conditions for a loss function to be robust. Specifically, if the loss function satisfies the equality: C X j=1 ℓ(θ,x i ,j)=K (3.3) where C is the number of classes, K is a constant, and the equality holds∀θ ∈Θ ,∀x i ∈X, then ℓ is a robust loss when all classes are equally mislabeled. However, if the ratio of the mislabeled data is different for each class, then for ℓ to be a robust loss, we must also have E Pn [ℓ(θ ∗ ,x,y)] = 0. The crossentropy (CE) loss, which is widely used for classification, is an unbounded loss and does not satisfy the above equality and is consequently extremely susceptible to mislabeled data. On the other hand, the mean absolute error (MAE) satisfies all the above conditions for a robust loss function but struggles to converge when training deep neural networks. Recent studies (Zhang and Sabuncu 2018; Feng et al. 2020; Wang et al. 2019b; Ma et al. 2020; Wang et al. 2019a) explored relaxing constraint eq. (3.3) by requiring the sum to be bounded. Moreover, they showed that by tightening or relaxing the bounds, we can balance robustness with convergence. In a like manner, we next show that using negative variance penalization bounds the sum in eq. (3.3). 20 Theorem 4: Variance Penalization Bounds Let ℓ(θ,x,y ) be an unbounded and non-negative loss function, and q i the solution to min Qn∈A E Qn [ℓ(θ,x,y )] withA= n Q n :D χ 2(Q n ∥P n )≤ λ 2 n 2 V Pn [ℓ(θ,x,y )] 2γ − 1 o , then: 0≤ C X j=1 q i ℓ(θ,x i ,j)≤ C[1+λα V Pn [ℓ(θ,x,y )] γ − 1 (E Pn [ℓ(θ,x,y )]+µ )] 2 4nλα V Pn [ℓ(θ,x,y )] γ − 1 where α ≥ 1 and µ ≥ 0 are two constants that emerge when finding the distribution Q n . Proof. Let Z be a random variable with the realizations Z i =ℓ(θ,x i ,y i ) then the probabilities q i from Theorem 2 are calculated as q i = 1 n 1− λα V Pn [Z] γ − 1 Z i − E Pn [Z]− µ + where α ≥ 1 and µ ≥ 0 are constants found when solving the stochastic optimization problem min Qn∈A E Qn [ℓ(θ,x,y )]. The operator [·] + truncates negative values to 0 and enforces the non-negativity constraint for q i . 1. 0≤ q i Z i . Proof: The two terms are non-negative and thus their product is also non-negative. 2. ∂ ∂Z i (q i Z i )= 1 n 1− λα V Pn [Z] γ − 1 2Z i − E Pn [Z]− µ . Proof: To find the maximum of q i Z i , the probability q i must be positive and thus we drop the non-negativity constraint. Of note, since we are interested to find the distance from the mean where we achieve the maximum value given E Pn [Z] and V Pn [Z], we will we treat them as constants. 21 ∂ ∂Z i (q i Z i )=q i ∂ ∂Z i Z i +Z i ∂ ∂Z i q i = 1 n 1− λα V Pn [Z] γ − 1 Z i − E Pn [Z]− µ − 1 n λα V Pn [Z] γ − 1 Z i = 1 n 1− λα V Pn [Z] γ − 1 2Z i − E Pn [Z]− µ . 3. ∂ ∂Z i (q i Z i )=0 for Z i = 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 . Proof: ∂ ∂Z i (q i Z i )=0 1 n 1− λα V Pn [Z] γ − 1 2Z i − E Pn [Z]− µ =0 2λα V Pn [Z] γ − 1 Z i =1+λα V Pn [Z] γ − 1 (E Pn [Z]+µ ) Z i = 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 4. q i Z i ≤ [1+λα V Pn [Z] γ − 1 (E Pn [Z]+µ )] 2 4nλα V Pn [Z] γ − 1 . Proof: Since ∂ 2 ∂Z 2 i (q i Z i )=− 2λα V Pn [Z] γ − 1 is negative, the product q i Z i is maximized when using for Z i the value found at the previous step. Thus, we have: 22 q i Z i ≤ 1 n 1− λα V Pn [Z] γ − 1 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 − E Pn [Z]− µ × × 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 ≤ 1 n 1− λα V Pn [Z] γ − 1 1 2λα V Pn [Z] γ − 1 − E Pn [Z]+µ 2 × × 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 ≤ 1 n 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 − λα V Pn [Z] γ − 1 1 4(λα V Pn [Z] γ − 1 ) 2 − (E Pn [Z]+µ ) 2 4 ≤ 1 n 1 2λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 − 1 4λα V Pn [Z] γ − 1 + λα V Pn [Z] γ − 1 (E Pn [Z]+µ ) 2 4 ≤ 1 n 1 4λα V Pn [Z] γ − 1 + E Pn [Z]+µ 2 + λα V Pn [Z] γ − 1 (E Pn [Z]+µ ) 2 4 ≤ 1+2λα V Pn [Z] γ − 1 (E Pn [Z]+µ )+(λα V Pn [Z] γ − 1 ) 2 (E Pn [Z]+µ ) 2 4nλα V Pn [Z] γ − 1 ≤ [1+λα V Pn [Z] γ − 1 (E Pn [Z]+µ )] 2 4nλα V Pn [Z] γ − 1 Remarks. The constants α and µ represent the Lagrange multipliers that satisfy the con- straints of step (ii) in the proof of Theorem 2. The distribution Q n can be found by solving the constraint optimization problem (Kumar and Amid unpublished). However, we present a much simpler approach in Section 2.3. From this theorem, we see that the upper bound depends on the statistics of the loss values. However, when penalizing the variance with γ = 1, only the empirical mean affects the upper bound. Moreover, in this case we achieve the tightest upper bound of (E Pn [ℓ(θ,x,y )]+µ )/n for λ = 1 α (E Pn [ℓ(θ,x,y )]+µ ) . However, using the tightest upper bound might not yield the best performance as it might trade convergence for robustness. As a result, we suggest performing hyper-parameter tuning to find the values for λ and γ which offer the best balance of robustness and convergence. 23 10 0 10 1 10 − 2 10 − 3 10 − 4 λ Training Loss Variance Distributional Var. Pen. Direct Var. Pen. γ = 0.45 γ = 1.0 10 0 10 1 50 55 60 λ Test Accuracy (%) 0.2 0.4 0.6 0.8 50 60 70 Fashion–MNIST Rare Group (%) Test Accuracy (%) Distributional Var. Pen. TERM Focal ERM 1 2 3 4 50 60 CIFAR–10 Rare Group (%) Test Accuracy (%) (a) (b) (c) (d) 1 Figure3.1: Applyingpositivevariancepenalizationwhentrainingwithlimiteddata. Theeffect of hyper-parameters λ and γ on (a) final loss variance, and (b) test accuracy when training on Fashion–MNIST with the rare group representing 0.2% of the samples. Distributional variance penalization outperforms similar methods on (c) Fashion–MNIST, and (d) CIFAR–10 datasets. 3.3 Experimental Results We assess the performance of the proposed robustness framework on two datasets: (i) Fashion– MNIST which contains grayscale images associated with a label from 10 classes of apparel, and (ii) CIFAR–10 which contains color images associated with a label from 10 classes such as cat, truck, and airplane. We use the classical crossentropy loss function in the ERM and the variance penalization scenarios. The training procedure with the model configuration are described in Section 3.3.3 and the results are shown in Figures 3.1 and 3.2. These show the average of 5 runs for different model initialization, with the error bars denoting the standard deviation. 3.3.1 Variance Minimization when Training with Limited Data When training with limited data, our goal is for the model to capture all the patterns present in the dataset. In this case, to evaluate the generalization capability for each dataset, we use only a small fraction of the samples of class 9, 2, 3, 5, and 4 (the rare group). For the remaining classes (the common group), we use all the samples. We retain 10% of samples from each group to form a validation subset which we use to select the hyper-parameters for each method. We recover the generalization capability by evaluating the trained model 24 10 − 1 10 0 87 88 89 90 91 λ Test Accuracy (%) Distributional Var. Pen. γ = 0.7 Direct Var. Pen. γ = 1.0 (a) 0 1 γ = 0.7, λ = 0.32 Mislabeled Clean 0 1 γ = 1.0, λ = 0.14 Proportion 0 0.5 1 1.5 · 10 − 3 0 1 γ = 1.0, λ = 0.16 qi (b) 20 40 60 70 80 90 Fashion–MNIST Mislabeled Percentage (%) Test Accuracy (%) Distributional Var. Pen. TERM Generalized Symmetric ERM 20 40 60 65 70 75 80 85 CIFAR–10 Mislabeled Percentage (%) Test Accuracy (%) (c) (d) 1 Figure3.2: Applyingnegativevariancepenalizationwhentrainingwithnoisylabels. (a)Direct variance penalization fails whereas distributional variance penalization improves accuracy by 4% when training on Fashion–MNIST with 20% mislabeled samples. (b) Histogram representation of probabilities q i . Distributional variance penalization is competitive with methods tailored for training with mislabeled data on (c) Fashion–MNIST, and (d) CIFAR–10 datasets. on the original test subset where each class is equally represented. Of note, we select hyper- parameters that yield the lowest loss variance while maintaining a high accuracy on the validation subset. In this case, using only the validation accuracy is not enough as the common group is overly represented in this set. Figures 3.1a and 3.1b provide experimental proof of Theorem 3 and show that a lower training loss variance improves generalization which we assess using the test accuracy. More- over, in all scenarios, when using the optimum set of hyper-parameters, the test accuracy obtained for Distributional Variance Penalization was similar to or higher than that of the Direct Variance Penalization. In some cases, even up to 5% higher as shown in Figure 3.1b. For this reason, we recommend always using Distributional Variance Penalization unless negative weights are required. We compare our method of Distributional Variance Penalization with several baselines which similarly change the weight of the samples during training, Tilted Empirical Risk Minimization (TERM) (Li et al. 2021) for classical crossentropy, and Focal loss (Lin et al. 2017). As a result of hyper-parameter tuning, the best parameters for Distributional Variance Penalization are γ = 0.45 and λ = 0.75 when training on Fashion–MNIST, and γ = 0.45 and λ = 2.4 when training on CIFAR–10. We found for TERM that t = 0.316 and t = 0.562 25 yield the best results on Fashion–MNIST and CIFAR-10, respectively. For the Focal loss, we found that γ =2 was the best fit for both datasets. Figures 3.1c and 3.1d show the results for the Fashion–MNIST and CIFAR–10 datasets, respectively. Compared to ERM, all methods increase the accuracy evaluated on the test dataset where each class is equally represented. However, only the Distributional Variance Penalization registers the highest improvement in all tested scenarios. Our calculated judgment for this is that the other two methods indirectly penalize the variance, i.e., γ =1. Whereas Figure 3.1b shows that the optimum exponent for the variance term is γ =0.45. 3.3.2 Variance Maximization for Training with Mislabeled Data To evaluate the robustness when training with mislabeled data, we flip the labels to one of the other classes for each dataset based on the given mislabeled percentage. After that, we retain 10% of the samples to form a validation subset which we use to select the hyper-parameters for each method. By using the test subset of each dataset, we obtain the performance on clean data, labels of which we do not alter. Figures 3.2a and 3.2b show that Distributional Variance Penalization, given the right pair (γ,λ ), can disregard the impact of mislabeled samples by assigning them a probability q i = 0. However, in this case, a few correctly labeled but hard-to-learn samples also receive a probability of 0. In terms of hyper-parameter selection, we found that γ =0.75 yields the best result in all tested scenarios. Whereas for λ , the higher the percentage of mislabeled samples, the higher its value should be. In particular, for scenarios with 20%, 40%, and 60% of mislabeled samples, we used for λ a value of 0.32, 0.44, and 0.61, respectively. Of note, the Direct Variance Penalization often fails to converge. However, when it does converge, it does not improve the validation accuracy nor the test accuracy compared to the ERM case. For this reason, we recommend using the Distributional Variance Penalization method instead. WecompareDistributional Variance Penalization againststate-of-the-artmethods, TERM (Li et al. 2021) for classical crossentropy, Symmetric crossentropy (Wang et al. 2019b), and 26 Generalized crossentropy (Zhang and Sabuncu 2018). Using hyper-parameter tuning, we found that we obtain the best results for TERM when t equals− 0.5,− 0.7 and− 0.9 in scenarios with 20%, 40%, and 60% of mislabeled samples, respectively. The following parameters worked best in all scenarios, for Generalized q =0.7, and for Symmetric α =0.1 and β =1. Figures 3.2c and 3.2d illustrate the accuracy evaluated on the test subsets which contain only clean labels. We observe that in all presented scenarios, the Distributional Variance Penalization obtains similar or higher final accuracy when compared to the specialized methods. Particularly, for the worst-case scenario (Fashion–MNIST with 60% mislabeled), our method trailed the best-performing method by merely 0.6%. Yet for the best-case scenario (CIFAR-10 with 60% mislabeled), our method outperformed all other methods by at least 0.9%. 3.3.3 Experimental Setup We use the same experimental setup as Wang et al. (2019b). In particular, for Fashion– MNIST, we trained a model containing 4 convolutional layers followed by 3 fully connected layers. For CIFAR–10, we trained an 8 layer network composed of 6 convolutional layers followed by 2 fully connected layers. The training batch size was 128 samples. The optimizer used was stochastic gradient descent (SGD) with 0.9 momentum and a learning rate of 0.01 which is divided by 10 every 20 epochs for a total of 60 training epochs. Moreover, we used a 0.005 and a 0.01 weight decay for the convolutional and the fully connected layers, respectively. For the complete implementation details, we refer to our GitHub repository (footnote 1). 3.4 Chapter Conclusion This chapter aims to present a novel robustness framework based on distributional variance penalization. We reveal two contrasting types of robustness that result from minimizing 27 or maximizing the loss variance. Although we did not cover optimizing simultaneously for both types of robustness, we believe our framework can be extended to cover it. We also emphasize the role of the exponent for the variance term, an often overlooked factor. Lastly, we highlight that the proposed framework makes negative variance penalization a viable training technique when training with mislabeled data. 28 Chapter 4 Optimizing for Fairness We start by extending the notation. In order to optimize for fairness, each attribute x now contains a sensitive feature g ∈G ={1,...,n g } with n g denoting the number of sensitive groups. Given a model f parameterizable by θ ∈ Θ ⊆ R t , our goal is to find the optimum parameters denoted by θ ∗ for which the model f correctly predicts the label y given x while achieving identical prediction performance across all sensitive groups g. We consider a fairness- aware learning setup similar to that of Williamson and Menon (2019). In particular, to evaluate the prediction performance of this model, we use a loss function ℓ : Θ ×X×Y → R + , and as a result, our goal becomes minimizing the expected population riskE[ℓ(θ,x,y )] subject to ∆ g [ℓ(θ,x,y )] being small, where ∆ g represents a dissimilarity measure of sensitive groups. Williamson and Menon (2019) use the deviation of the group loss mean as a measure of dissimilarity, see Figure 4.1c. However, our proposed fairness objective also includes the deviation of group loss variance as part of the measure, see Figure 4.1d. By also considering the group loss variance, we improve intra-group individual fairness and the confidence bounds of the population and group risks in the low data regime. In the ERM setting shown in Figure 4.1a, we minimize the empirical expectation E Pn [ℓ(θ,x,y )] whereP n denotes the empirical data distribution. Or similarly: min θ n X i=1 w i ℓ(θ,x i ,y i ) (4.1) 29 0 5 0 1 ℓ(θ,x,y ) density Minimization Objective: E[ℓ(θ,x,y )] V[ℓ(θ,x,y )] Subgroup: Population Subgroup A Subgroup B 0 5 ℓ(θ,x,y ) 0 5 ℓ(θ,x,y ) 0 5 ℓ(θ,x,y ) ERM Individual Group Individual and Group this paper (a) (b) (c) (d) 1 Figure 4.1: Fairness-aware objectives. The arrows indicate the direction of optimization, and their lengths indicate the strength. (a) In ERM, we minimize the population mean loss. (b) We improve individual fairness by minimizing the variance in addition to minimizing the mean. (c) We improve group fairness if we also optimize for all groups to achieve similar mean loss. (d) We improve both group and individual fairness if we optimize for all groups to achieve similar mean and variance of the loss values. where w i are the sample weights with w i = 1 n ,∀i. In ERM, we treat all samples equally and ignore the sensitive features. However, using this approach can make the model not only learn the biases embedded into the dataset but also amplify them (Zhao et al. 2017; Hashimoto et al. 2018). To avoid this, we will dynamically change the weights w i during training to enforce individual and group fairness. Much of the current research on fairness investigate changing the sample weights either directly or indirectly using the following methods: (i) reweight- ing (Williamson and Menon 2019; Jiang and Nachum 2020), (ii) regularization (Goel, Yaghini, and Faltings 2018; Heidari et al. 2018; Jiang et al. 2020; Aghaei, Azizi, and Vayanos 2019), and (iii) constrained optimization (Nabi and Shpitser 2018; Narasimhan 2018; Celis et al. 2019; Cotter et al. 2019). Moreover, the existing literature has extensively studied individual fairness (Grgic-Hlaca et al. 2016; Dwork et al. 2012) and group fairness (Kusner et al. 2017; Corbett-Davies et al. 2017; Williamson and Menon 2019), which generally require the model to incur similar losses across the entire dataset or for each sensitive group, see Figures 4.1b and 4.1c. However, only a few studies focused on optimizing both group and individual fairness simultaneously (Dwork et al. 2012; Kearns et al. 2019). Notably, Dwork et al. (2012) study metrics for determining 30 which individuals are similar and enforce a fairness constraint such that similar individuals are treated similarly. Kearns et al. (2019) investigate enforcing a group fairness constraint also to hold over an exponentially large collection of subgroups, to bridge group and individual fairness. However, these two methods are not robust to protect from shifts in the underlying data distribution. We aim to address this issue by optimizing jointly for low deviation of the loss mean and variance between sensitive groups, while also minimizing the loss mean and variance of each group, see Figure 4.1d. By including the variance in the minimization procedure, we improve intra-group individual fairness and the confidence bounds on the optimized population and group risks in the low data regime. With respect to when fairness is enforced, we have methods applied: (i) pre-training (Cal- mon et al. 2017; Adler et al. 2018; Gordaliza et al. 2019; McNamara, Ong, and Williamson 2019), (ii) during training (Zafar et al. 2017; Agarwal et al. 2018; Donini et al. 2018; Kearns et al. 2018; Williamson and Menon 2019; Jiang et al. 2020), and (iii) post-training (Hardt et al. 2016; Jiang et al. 2020). Approaches applied pre- and post-training do not require changing the training procedure. However, they do not have direct control in mitigating biases during training. Additionally, a large number of studies use different definitions for fairness (Verma and Rubin 2018), which can be grouped together as based on: (i) statistical measures (Kleinberg, Mullainathan, and Raghavan 2016; Chouldechova 2017; Corbett-Davies et al. 2017; Berk et al. 2021), (ii) similarity measures (Galhotra, Brun, and Meliou 2017; Kusner et al. 2017), and (iii) causal reasoning (Kilbertus et al. 2017; Nabi and Shpitser 2018). Several studies (Kleinberg, Mullainathan, and Raghavan 2016; Kim, Chen, and Talwalkar 2020; Saravanakumar 2020) investigated the incompatibility of several notions of group fairness. In particular, the impossibility theorem of fairness (Kleinberg, Mullainathan, and Raghavan 2016; Saravanakumar 2020) states that the fairness metrics demographic parity, equalized odds (Hardt et al. 2016), and predictive parity (Verma and Rubin 2018) are mutually incompatible and only one can hold at the same time except in some trivial conditions. Of 31 note, enforcing fairness often prompts a trade-off in accuracy for which Zhao and Gordon (2019) provide algorithmic-independent lower bounds. 4.1 Definition of Fairness In this section, we introduce the new definition of fairness which requires all sensitive groups to achieve the same loss mean and variance. First, we define two induced quantities which we use to optimize for fairness. The average loss of the group g, we denote by L g and compute as follows: L g =E Pn [ℓ(θ,x,y )|G=g]. (4.2) We can express the ERM objective of Equation (4.1) in terms of the group average losses, L g : min θ ng X g=1 w g L g (4.3) where w g are the group weights with w g =P(G =g). This illustrates that groups with fewer samples receive less weight as their group probability is smaller. At the same time, even if all sensitive groups have equal representation in the dataset, it does not guarantee that all groups will attain the same group average loss, L g . To ensure this, we must increase the group weights w g for groups with a larger average loss L g . The second induced quantity is the loss variance of the group g denoted with V g and computed as follows: V g =E Pn h ℓ(θ,x,y )− L g 2 |G=g i . (4.4) This quantity indicates the intra-group fairness for which we prefer to have a small value. A large variance indicates that within the same group, samples receive widely different treatments. We can minimize V g by increasing the weights for samples with large loss values within each group. In addition, minimizing V g improves the confidence bounds on the group 32 average loss L g (Maurer and Pontil 2009). However, one must be aware of the bias-variance tradeoff between L g and V g . We now introduce our new definition of fairness expressed in terms of group average loss L g and variance V g . Definition 5: Fairness Definition The model f is perfectly fair with respect to the loss ℓ if all groups attain the same average loss and loss variance. Or formally: L g =L g ′ ∧ V g =V g ′ ∀g,g ′ ∈G. (4.5) Of note, the above definition extends that of Williamson and Menon (2019), where only the group loss L g is considered. Likewise, the definition is specified in terms of an abstract loss function ℓ(θ,x,y ) which should capture all the aspects of the problem, besides fairness. Moreover, we improve individual fairness by equalizing the loss variance across all groups as samples will receive similar treatment irrespective of their group membership. During the training phase of the model, our goal is to bring the above quantities as close as possible to equality. However, equality, and thus perfect fairness, is rarely achieved in reality. Even if all groups have the same mean and variance of loss, we cannot conclude that they have the same distribution of loss values. Generally, the distributions diverge less when they have the same mean and variance compared to just the mean. Of note, the presence of mislabeled samples can skew the distribution of loss values but can be mitigated using a robust loss function during training. 4.2 Optimization Procedure In this section, we present the training procedure that optimizes for fairness based on reweighting individual samples. The weights are assigned such that we minimize the variances 33 of the two induced group quantities, V[L g ] and V[V g ]. When the two variances are 0, we satisfy the requirements for perfect fairness as stated in Theorem 5. Next, we introduce our fairness-aware objective which has the following expression: min θ n X i=1 w g(i) w i ℓ(θ,x i ,y i ) (4.6) where g(i) represents the group membership of sample i, w g(i) is the weight of the group to which sample i belongs, and w i is the individual weight of sample i. We use variance penalization to calculate both sets of weights, w g and w i . As a result, by minimizing the above equation we minimize the deviation of loss mean and variance between sensitive groups, while also minimizing the loss mean and variance of each group. This improves both group and individual fairness. The optimization procedure consists of two steps, one for finding the group weights and one for finding the individual weights, which we describe in more detail below. Step 1. Compute the weightw g for each sensitive groupg. We optimize for group fairness by ensuring that all groups have similar average loss L g , which we achieve by minimizing the varianceV[L g ]. We compute the weight w g for each group g by penalizing the varianceV[L g ] using the distributional variance penalization. Note that by using these weights, we minimize simultaneously two quantities, E[L g ] andV[L g ]. Moreover, if V[L g ] = 0, we fulfill the first requirement of Theorem 5, i.e.,∀g,g ′ ∈G we have L g =L g ′. Step 2. Compute the weights w i for each group g. To obtain the weights w i , we perform variance penalization twice. First, we do so to obtain the variance penalization factor λ g for each group g, and second, to obtain the sample weights w i . In this fashion, we can leverage variance penalization to obtain the weights w i that equalize the loss variance V g between sensitive groups (group fairness) while also minimizing the variance for each group separately (individual fairness). In fact, we equalize the loss variance V g between sensitive groups by 34 using a larger variance penalization factor for groups with higher loss variances. Similar to the previous step, we fulfill the second requirement in Theorem 5 when V[V g ]=0. We provide the pseudocode for this procedure in Algorithm 3. Important to realize, the weights must be recomputed after each iteration of the optimization algorithm to account for the changes in the distribution of the loss values due to the new model parameters. In addition, we have that the product of the weights sums to 1, i.e., P n i=1 w g(i) w i = 1, similar to the ERM case. This results from the fact that within each group, the individual weights w i sum to 1, and the weights of all groups also sum to 1, i.e., P mg i=1 w i = 1 wherem g is the number of samples in the group g and P ng g=1 w g = 1. Moreover, it is crucial to use distributional variance penalization for the above steps to prevent groups to receive negative weights, as it will result in the objective of a large subset of samples to switch from minimization to maximization which often leads to unstable behavior during optimization. The proposed method has the following key advantages: (i) It preserves convexity. If the loss function ℓ(θ,x,y ) is convex, then the objective of minimizing Equation (4.6) is also convex given that the weights are computed using distributional variance penalization. (ii) Minimizing the group variance V g improves the confidence bounds on the group average loss L g , and in turn, the confidence in the attained fairness. (iii) Lastly, the objective in Equation (4.6) can be rendered as a form of distributionally robust optimization (DRO) protecting against shifts in the underlying data distribution. We elaborate more on the last two points in the next section. In addition, the proposed procedure is comprehensive and can optimize for fairness in multiple other applications besides classification given that the optimization objective is in ERM form. For example, for regression problems where we minimize the mean square error, the fair version will optimize for sensitive groups to attain similar mean and variance of the error values. Also, it can be used in other applications where a model is trained such as clustering with deep learning (Min et al. 2018) and ranking (Zhang et al. 2019). 35 Of note, the optimization procedure uses distributional variance penalization to compute the weights w i for each group g. And although, we present an efficient algorithm based on bisection for projecting the loss values onto the probability simplex to find the distribution Q n (Algorithm 4), when the number of groups is large it can slow down the training. We can mitigate this by changing the stopping criteria of the algorithm, reducing tolerance when searching for the distributionQ n , or alternatively, switch to using approximate algorithms for finding Q n designed for large-scale applications (Levy et al. 2020). 4.3 Distributionally Robust Optimization The goal of distributionally robust optimization (DRO) is to protect against all deviations from a nominal model (Shapiro 2017; Rahimian and Mehrotra unpublished), which leads to improved robustness against distributional shift (Hashimoto et al. 2018). ERM minimizes the expected loss with respect to the empirical data distribution, whereas in DRO, we minimize the expected loss with respect to the worst distribution in an uncertainty setA. This equalizes the model performance across subsets of the training data similarly to the work of Sohoni et al. (2020). Choosing the weights of Equation (4.6) using the distributional variance penalization from Theorem 2, naturally allows us to cast the objective in form of a DRO: min θ ∈Θ max Qn∈A E Qn [ℓ(θ,x,y )] (4.7) where the uncertainty set A is composed of possible data distributions with a bounded chi-square divergence to empirical distribution P n . This is meaningful when optimizing for fairness as some groups are often underrepresented or undersampled which increases the possibility of the empirical distribution P n to be far from the true one. On the other hand, DRO accommodates this scenario as it improves the generalization performance of ERM on the hardest-to-learn subset. 36 In addition, minimizing the group variance V g improves the confidence bound on the group average loss L g as the next lemma shows. Lemma 6: Maurer and Pontil [2009] et Z 1 ,...,Z n be i.i.d random variables with values in [0,M] and let δ > 0. Then we have with probability at least 1− δ that: E P [Z]− E Pn [Z]≤ r 2M 2 V Pn [Z]ln2/δ n + 7Mln2/δ 3(n− 1) whereP is the unknown data distribution and P n is the empirical one. This lemma presents a data-dependent bound expressed in terms of the observable quantity V Pn [Z] on the deviation of the empirical mean from its theoretical value. Thus, by minimizing the variance, we tighten the above concentration bound. Naturally, the maximum loss value M will also decrease as samples with the largest loss values receive the most weight. This then leads to an even tighter bound. Alternatively, Buolamwini and Gebru (2018) suggest collecting more representative data as the first step of the solution in eliminating dataset biases. While this tightens the above bound by increasing the number of samples n, it may not always be a viable solution. For example, when the real-world data is skewed or when samples with particular properties are costly to collect. In such cases, the proposed variance minimization approach represents an effective solution. A different method of optimizing for group fairness is based on the conditional value at risk (CVaR) originally introduced by Rockafellar and Uryasev (2000). The fairness-aware version of the CVaR proposed by Williamson and Menon (2019) optimizes the following objective: min θ ∈Θ ,ρ ∈R ρ + 1 1− α E[L g − ρ ] + . (4.8) 37 This objective, at each iteration, minimizes only sensitive groups with the average loss L g higher than the threshold ρ . Both methods assign a weight to sensitive groups based on the distance of group mean L g either to the threshold ρ or to the statisticE[L g ]. However, CVaR only optimizes the group mean L g and ignores the group variance V g which makes the method sensitive to shifts in the underlying data distribution. 4.4 Computational Aspects Applying the method in practice requires computing the group and sample weights prior to computing the weighted sum. To compute the two sets of weights, the user must supply four hyper-parameters: ¯λ and ¯γ to penalize the variance of L g and V g , and λ and γ to penalize the variance of loss values within each group. For hyper-parameter tuning, we recommend to initially set ¯γ =γ =1 (classical variance penalization) and only tune ¯λ and λ . Algorithm 3 illustrates the training procedure, the Pytorch implementation is available on GitHub. 1 Algorithm 4 shows the procedureFindQ n that solves the stochastic optimization problem of Theorem 2. The algorithm expands on the approach of Duchi and Namkoong (2019) of projecting the loss values onto the probability simplex to find Q n . The advantage of the presented method is that to recover the distribution that satisfies the constraints of Theorem 2, we only need to find the value for a single variable ρ , compared to two, α andµ . Alternatively, one can set ρ as a training parameter and update its value during back-propagation (Levy et al. 2020). However, note that this variant of the algorithm is not explored in this paper. Next, we highlight the influence of variance penalization on the update strategy of the parameters θ . Let z = ℓ(θ,x,y ) and define the risk of the direct variance penalization R Pn =E Pn [z]+λ V Pn [z] γ . Then by differentiating ∇ θ R Pn = P n i=1 w i ∇ θ z i we have the following expression for the weights w i : w i = 1 n 1− 2λγ V Pn [z] γ − 1 z i − E Pn [z] (4.9) 1 https://github.com/valeriu-balaban/fairness-var-pen/ 38 Training Procedure input: {x i ,y i } n 1 – training data ℓ(θ,x,y ) – loss function ¯λ, ¯γ – group fairness parameters λ, γ – individual fairness parameters η – learning rate 1 while stopping criteria not reached do 2 for i← 1 to n do 3 z i ← ℓ(θ,x i ,y i ) 4 5 with no gradient 6 L g ← E[z|G=g] // Step 1 7 w g ← FindQ n (L g , ¯λ, ¯γ ) 8 9 V g ← E[(z− L g ) 2 |G=g] // Step 2 10 λ g ← λ FindQ n (V g , ¯λ, ¯γ ) 11 for g← 1 to n g do 12 w i [G=g]← FindQ n (z i [G=g],λ g ,γ ) 13 14 L ← P n i=1 w g(i) w i z i 15 θ ← θ − η ∇ θ L FindQ n Procedure Input : z – loss values λ,γ – hyper-parameters Param: ϵ – tolerance a,b – initial search range Result: q – sample probabilities 1 D χ 2 ← λ 2 n 2 V Pn [z] 2γ − 1 // target 2 repeat 3 ρ ← 1 2 (a+b) 4 r ← [z− ρ ] + 5 q ← r/ P n i=1 r i 6 b D χ 2 ← P n i=1 1 n (nq i − 1) 2 // current 7 if ( b D χ 2− D χ 2)<0 then a← ρ else b← ρ 8 until| b D χ 2− D χ 2|<ϵ 39 On the other hand, when differentiating the objective of distributional variance penalization, we have∇ θ E Qn [z]= P n i=1 q i ∇ θ z i where the probabilities q i have the following expression: q i = 1 n + λα n V Pn [z] γ − 1 z i − E Pn [z]− µ + (4.10) Of note, the formulas for w i and q i are very much alike, they both change the impact of the gradient in a linear fashion based on the distance of the sample loss value to the mean. However, in the case of q i , we can avoid unstable behavior during optimization since q i are nonnegative. We emphasize that probabilities q i must be treated as constants. Otherwise, it can lead to unstable behavior since when differentiating, we will have an additional term, ℓ(θ,x i ,y i )∇ θ q i , which can be negative. To treat q i as constants use detach() function in Pytorch, and stop_gradient() in Tensorflow. 4.5 Experimental Results Figure 4.2: Experimental results for the CIFAR-10 when optimizing fairness using distribu- tional variance penalization. The top row (a to c) shows the results when we only optimize for group means (λ = 0 and γ = 0), and the bottom row (d to f) when we optimize for both mean and variance (we set ¯λ = 0.8 and ¯γ = 1 and vary λ and γ ). The black dotted line shows the performance of the baseline model. 40 We assess the performance of the proposed method on three datasets spanning a wide range of difficulty: (i) CIFAR-10 dataset (Krizhevsky, Hinton et al. 2009) consisting color images distributed across 10 classes such as airplane, bird, and cat; (ii) Adult dataset (Kohavi 1996) created to represent a real-world source of socially relevant data; and a (iii) Synthetic dataset (Donini et al. 2018) created for assessing fairness. For CIFAR-10 dataset we only use 8% of the samples for the first 5 classes to skew the distribution of training data. To create the sensitive groups, we randomly convert one-third of the samples to grayscale to form the gray group and leave the rest of the samples unaltered and assign them to the color group. For evaluation, we use the provided test set and for each class, we convert and assign half of the samples to the gray group. These modifications enable us to evaluate fairness and robustness to shifts in the data distribution of the proposed method. For the Adult dataset we use the gender attribute to construct the sensitive groups (two-thirds of samples are part of the Male group and one-third of the Female group). We use the provided test set for evaluation. For the Synthetic dataset we use the provided sensitive groups A and B (75% of samples belong to A and 25% to B), and for the test set, we use a different random seed to generate a new dataset. For each training dataset, we retain 10% of the samples for validation and hyperparameter selection. We describe the model configuration and training setup at the end of the chapter and distribute our Python scripts as part of our repository (footnote 1). For each dataset we train four models: (i) the baseline using the ERM objective, (ii) the Fair CVaR of Williamson and Menon (2019), and two using our proposed objective Fair Var. Pen.: (iii) one when we equalize only the group means, and one (iv) when we equalize both the mean and the variance. For all four models, we use the cross-entropy as a loss function. Figure 4.2 shows the fairness results of the Fair Var. Pen. method on the CIFAR-10 dataset for two scenarios: top row when we only optimize for group mean, and bottom row when we optimize for both group mean and variance. In the first scenario, we have that for ¯λ > 8 and ¯γ = 0.5 the loss mean and variance of the two groups closely match. Moreover, 41 the accuracy for both groups increases compared to the baseline and the difference between the two decreases from 2.7% to 2%. Note that the superior results for ¯γ =0.5 follow from the fact that for γ < 1 the strength of the penalization increases as the variance is minimized. The Fair CVaR method achieves similar values for accuracy and loss variance, however, the loss means of the two groups are farther apart, see Table 4.1 and figures from Supplementary Materials. In the second scenario, when we optimize the loss mean and variance, bottom row in Figure 4.2, we further increase the accuracy by 3% on both groups (λ =16 and γ =0.5) but trade in group fairness as the difference between group metrics increases. Note that the accuracy increases as decreasing the variance improves model robustness to distributional shifts. With respect to fairness, we improve individual fairness at the cost of group fairness. We obtained similar experimental results for the other two datasets which we outline in Table 4.1 and provide additional figures as part of Supplementary Materials. On Adult dataset, our proposed method achieves the lowest difference between group means and the lowest group variance, however, at the cost of a slightly lower decrease in the accuracy gap. For the Synthetic dataset, our proposed method achieves the lowest variance of the groups at the cost of increasing loss mean, but with no decrease in accuracy. On the other hand, the CVaR method brings the group loss means closer together compared to the other methods. However, the registered loss variance is high and the accuracy for both groups decreases. To summarizes our experimental results, when we only optimize for group mean, our proposed method Fair Var. Pen. achieves the lowest or close to the lowest difference between the group means. This sometimes comes at a cost of a marginal decrease in accuracy (Synthetic dataset). When we optimize to equalize both group mean and group variance, our proposed method Fair Var. Pen. achieves the lowest loss variance for all groups. This improves accuracy when the distribution of training data is skewed (Synthetic and Adult datasets). 42 Loss Mean Loss Variance Accuracy (%) Method A B diff A B diff A B diff CIFAR-10 † Baseline 4.08 2.42 1.66 1.25 0.49 0.76 59.6 56.9 2.7 Fair CVaR 3.81 3.18 0.63 0.78 0.84 0.06 60.5 58.4 2.1 Fair Var. Pen. ‡ 3.44 3.39 0.05 1.03 1.08 0.05 60.4 58.4 2.0 Fair Var. Pen. 2.99 1.99 1.00 0.45 0.24 0.21 62.5 60.1 2.4 Adult Baseline 0.21 0.44 0.23 0.22 0.24 0.02 87.1 75.5 11.6 Fair CVaR 0.25 0.44 0.19 0.19 0.24 0.05 86.6 75.9 10.7 Fair Var. Pen. ‡ 0.26 0.45 0.19 0.17 0.18 0.01 87.1 75.6 11.5 Fair Var. Pen. 0.26 0.45 0.19 0.12 0.12 0.01 86.5 75.1 11.4 Synthetic Baseline 0.32 0.16 0.16 0.32 0.20 0.12 83.6 92.3 8.7 Fair CVaR 0.33 0.22 0.11 0.28 0.13 0.15 84.7 90.5 5.8 Fair Var. Pen. ‡ 0.33 0.19 0.14 0.32 0.13 0.19 84.6 90.6 6.0 Fair Var. Pen. 0.48 0.23 0.25 0.04 0.07 0.03 84.2 92.4 8.2 Table 4.1: Entries represent the average of three runs under different random initialization with the best result for each scenario underlined. † Loss Mean values have exponent e− 2 and Loss Variance e− 3. ‡ Indicates the setup when we only optimize for group mean. Model Configuration and Training Setup When training on CIFAR-10 dataset we use an 8 layer network composed of 6 convolutional layers followed by 2 fully connected layers. The training batch size was 128 samples. The optimizer used was stochastic gradient descent (SGD) with 0.9 momentum and a learning rate of 0.01 which is divided by 2 every 10 epochs for a total of 40 training epochs. Moreover, we used a 0.005 and a 0.01 weight decay for the convolutional and the fully connected layers, respectively. When training on Synthetic and Adult datasets we use a fully connected network with one hidden layer of 8 neurons for Synthetic and 128 neurons for Adult followed by a sigmoid activation function. For the loss function we use the classical crossentropy loss. The training batch size was 128 samples. The optimizer used was stochastic gradient descend (SGD) with 0.9 momentum and a learning rate of 0.02 and train for a total of 40 epochs. 43 4.6 Chapter Conclusion We introduce a new definition of group fairness that requires all groups to achieve the same loss mean and variance and provide a corresponding optimization procedure for enforcing fairness which also preserves the convexity of the objective. One of the most significant findings to emerge from this study was the link of the proposed algorithm to distributionally robust optimization where we optimize against the worst-case shift in data distribution. This is of paramount interest when optimizing for fairness as it is common for a small number of sensitive groups to be poorly represented in the training dataset, which causes their empirical data distribution to diverge from the true underlying data distribution. Although this paper centers on classification problems, the mathematical framework covers regression, ranking, or clustering problems. 44 Chapter 5 Conclusion and ongoing work 5.1 Summary of completed work The completed work investigates optimization strategies based on distributional variance penalization and the flexibility these strategies provide in enforcing properties such as robustness and fair group classification. We solve the limitation of direct variance penalization by using probabilities instead of weights. One of the most significant findings to emerge from this study was that we can control the final distribution of the loss values and train models which are robust to distributional shifts and are fair to sensitive groups. This improves upon the existing CVaR objectives which only optimizes the mean, while the proposed solution extends it to include variance. 5.2 Ongoing and future work Although the completed work centered on variance penalization and classification problems, the framework can also be extended to cover a generalized definition of variance and applied to a variety of machine-learning problems. In particular, the ongoing (OW) and future work (FW) will focus on: (OW1) Extending the theoretical results to general instances of alpha-, beta-, and gamma- divergences. This extension will enable us to include other higher-order central 45 moments and generalize the distributional variance penalization method. The advantage of this is that it allows much finer control over the final distribution of loss values. (OW2) Natural language processing (NLP). Investigate fine-tuning large language models to remove bias. Recent trends in NLP are to pre-train large language models on general text datasets and then fine-tune and distill the pretrained model on an application-specific dataset. The issue is that the majority of the large language models are trained on biased datasets, and thus the learned biases are transferred to the fine-tuned model. (FW1) Clustering and community detection. Recent formulations of clustering as a convex optimization objective can be extended using moments penalization. Penalizing moments will provide a finer control on how border samples (clustering) or small communities (community detection) are treated by the algorithm. (FW2) Regression problems. Although regression problems are considered easier than classification this is no longer the case when a fair regression is desired. Fair regression requires samples across the entire range to be treated equally and can be achieved by first clustering samples and then using this information during optimization. 46 Bibliography Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. Adler, P.; Falk, C.; Friedler, S. A.; Nix, T.; Rybeck, G.; Scheidegger, C.; Smith, B.; and Venkatasubramanian, S. 2018. Auditing black-box models for indirect influence. Knowledge and Information Systems, 54(1): 95–122. Agarwal, A.; Beygelzimer, A.; Dudík, M.; Langford, J.; and Wallach, H. 2018. A reductions approach to fair classification. In International Conference on Machine Learning, 60–69. PMLR. Aghaei, S.; Azizi, M. J.; and Vayanos, P. 2019. Learning optimal and fair decision trees for non-discriminative decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence. Alabi, D.; Immorlica, N.; and Kalai, A. 2018. Unleashing linear optimizers for group-fair learning and optimization. In Conference On Learning Theory, 2043–2066. PMLR. Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.; and McGuinness, K. 2019. Unsupervised label noise modeling and loss correction. In ICML. Arpit, D.; Jastrzębski, S.; Ballas, N.; Krueger, D.; Bengio, E.; Kanwal, M. S.; Maharaj, T.; Fischer, A.; Courville, A.; Bengio, Y.; et al. 2017. A closer look at memorization in deep networks. In ICML. Bartlett, P. L.; Jordan, M. I.; and McAuliffe, J. D. 2006. Convexity, classification, and risk bounds. JASA. Bechmann, A.; and Zevenbergen, B. 2019. AI and Machine Learning: Internet Research Ethics Guidelines (IRE 3.0 6.1). Internet Research: Ethical Guidelines 3.0. Bekker, A. J.; and Goldberger, J. 2016. Training deep neural-networks based on unreliable labels. In ICASSP. 47 Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML. Berk, R.; Heidari, H.; Jabbari, S.; Kearns, M.; and Roth, A. 2021. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research. Buolamwini, J.; and Gebru, T. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. PMLR. Calmon, F.; Wei, D.; Vinzamuri, B.; Natesan Ramamurthy, K.; and Varshney, K. R. 2017. Optimized Pre-Processing for Discrimination Prevention. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Caton, S.; and Haas, C. 2020. Fairness in machine learning: A survey. arXiv preprint arXiv:2010.04053. Celis, L. E.; Huang, L.; Keswani, V.; and Vishnoi, N. K. 2019. Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings of the conference on fairness, accountability, and transparency, 319–328. Chen, X.; and Gupta, A. 2015. Webly supervised learning of convolutional networks. In ICCV. Chouldechova, A. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2): 153–163. Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; and Huq, A. 2017. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd acm sigkdd international conference on knowledge discovery and data mining. Cotter, A.; Jiang, H.; Gupta, M. R.; Wang, S.; Narayan, T.; You, S.; and Sridharan, K. 2019. Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals. J. Mach. Learn. Res., 20(172): 1–59. Donini, M.; Oneto, L.; Ben-David, S.; Shawe-Taylor, J. S.; and Pontil, M. 2018. Empirical Risk Minimization Under Fairness Constraints. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. Duchi, J.; and Namkoong, H. 2019. Variance-based regularization with convex objectives. The Journal of Machine Learning Research, 20(1): 2450–2504. Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel, R. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, 214–226. 48 Dwork, C.; Immorlica, N.; Kalai, A. T.; and Leiserson, M. 2018. Decoupled classifiers for group-fair and efficient machine learning. In Conference on Fairness, Accountability and Transparency, 119–133. PMLR. European Commission. 2018. Communication Artificial Intelligence for Europe. https://digital-strategy.ec.europa.eu/en/library/ communication-artificial-intelligence-europe. Accessed: 2022-08-05. Fabris, A.; Messina, S.; Silvello, G.; and Susto, G. A. 2022. Algorithmic Fairness Datasets: the Story so Far. arXiv preprint arXiv:2202.01711. Feng, L.; Shu, S.; Lin, Z.; Lv, F.; Li, L.; and An, B. 2020. Can cross entropy loss be robust to label noise. In IJCAI. Galhotra, S.; Brun, Y.; and Meliou, A. 2017. Fairness testing: testing software for dis- crimination. In Proceedings of the 2017 11th Joint meeting on foundations of software engineering. Ghosh, A.; Kumar, H.; and Sastry, P. 2017. Robust loss functions under label noise for deep neural networks. In AAAI. Ghosh, A.; and Lan, A. 2021. Do We Really Need Gold Samples for Sample Weighting under Label Noise? In WACV. Goel, N.; Yaghini, M.; and Faltings, B. 2018. Non-discriminatory machine learning through convex fairness criteria. In Proceedings of the AAAI Conference on Artificial Intelligence . Goldberger, J.; and Ben-Reuven, E. 2017. Training deep neural-networks using a noise adaptation layer. In ICLR. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27. Gordaliza, P.; Del Barrio, E.; Fabrice, G.; and Loubes, J.-M. 2019. Obtaining fairness using optimal transport theory. In International Conference on Machine Learning, 2357–2365. PMLR. Grgic-Hlaca, N.; Zafar, M. B.; Gummadi, K. P.; and Weller, A. 2016. The case for process fairness in learning: Feature selection for fair decision making. In NIPS symposium on machine learning and the law, volume 1, 2. Barcelona, Spain. Hardt, M.; Price, E.; Price, E.; and Srebro, N. 2016. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc. Harutyunyan, H.; Reing, K.; Ver Steeg, G.; and Galstyan, A. 2020. Improving generalization by controlling label-noise information in neural network weights. In ICML. 49 Hashimoto, T.; Srivastava, M.; Namkoong, H.; and Liang, P. 2018. Fairness without demographicsinrepeatedlossminimization. In International Conference on Machine Learning, 1929–1938. PMLR. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778. Heidari, H.; Ferrari, C.; Gummadi, K.; andKrause, A.2018. Fairnessbehindaveilofignorance: A welfare analysis for automated decision making. Advances in Neural Information Processing Systems. Heidari, H.; Loi, M.; Gummadi, K. P.; and Krause, A. 2019. A moral framework for understanding fair ML through economic models of equality of opportunity. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 181–190. Heinze-Deml, C.; and Meinshausen, N. 2021. Conditional variance penalties and domain shift robustness. Machine Learning, 110(2): 303–348. Hendrycks, D.; Lee, K.; and Mazeika, M. 2019. Using pre-training can improve model robustness and uncertainty. In ICML. Hendrycks, D.; Mazeika, M.; Wilson, D.; and Gimpel, K. 2018. Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS. Hinton, G. E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2012. Improving neural networks by preventing co-adaptation of feature detectors. NeurIPS. Hoeffding, W. 1963. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 13–30. Hu, W.; Niu, G.; Sato, I.; and Sugiyama, M. 2018. Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning, 2029–2037. PMLR. Jiang, H.; and Nachum, O. 2020. Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics . Jiang, L.; Zhou, Z.; Leung, T.; Li, L.-J.; and Fei-Fei, L. 2018. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML. Jiang, R.; Pacchiano, A.; Stepleton, T.; Jiang, H.; and Chiappa, S. 2020. Wasserstein fair classification. In Uncertainty in Artificial Intelligence , 862–872. PMLR. Kearns, M.; Neel, S.; Roth, A.; and Wu, Z. S. 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, 2564–2572. PMLR. Kearns, M.; Neel, S.; Roth, A.; and Wu, Z. S. 2019. An empirical study of rich subgroup fairness for machine learning. In Proceedings of the conference on fairness, accountability, and transparency. 50 Kilbertus, N.; Rojas Carulla, M.; Parascandolo, G.; Hardt, M.; Janzing, D.; and Schölkopf, B. 2017. Avoiding discrimination through causal reasoning. Advances in neural information processing systems, 30. Kim, J. S.; Chen, J.; and Talwalkar, A. 2020. Fact: A diagnostic for group fairness trade-offs. In International Conference on Machine Learning. Kleinberg, J.; Mullainathan, S.; and Raghavan, M. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807. Kohavi, R. 1996. Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 202–207. Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. Kumar, A.; and Amid, E. unpublished. Constrained Instance and Class Reweighting for Robust Learning under Label Noise. arXiv:2111.05428. Kusner, M. J.; Loftus, J.; Russell, C.; and Silva, R. 2017. Counterfactual Fairness. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Lam, H. 2019. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4): 1090–1105. Levy, D.; Carmon, Y.; Duchi, J. C.; and Sidford, A. 2020. Large-scale methods for distri- butionally robust optimization. Advances in Neural Information Processing Systems, 33: 8847–8860. Li, T.; Beirami, A.; Sanjabi, M.; and Smith, V. 2021. Tilted empirical risk minimization. ICLR. Li, Y.; Yang, J.; Song, Y.; Cao, L.; Luo, J.; and Li, L.-J. 2017. Learning from noisy labels with distillation. In ICCV. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. 2017. Focal loss for dense object detection. In ICCV. Liu, S.; Niles-Weed, J.; Razavian, N.; and Fernandez-Granda, C. 2020. Early-Learning Regularization Prevents Memorization of Noisy Labels. In NeurIPS. Liu, X.; Li, S.; Kan, M.; Shan, S.; and Chen, X. 2017. Self-error-correcting convolutional neural network for learning with noisy labels. In FG. Lukasik, M.; Bhojanapalli, S.; Menon, A.; and Kumar, S. 2020. Does label smoothing mitigate label noise? In ICML. Ma, X.; Huang, H.; Wang, Y.; Romano, S.; Erfani, S.; and Bailey, J. 2020. Normalized loss functions for deep learning with noisy labels. In ICML. 51 Maurer, A.; andPontil, M.2009. EmpiricalBernsteinboundsandsamplevariancepenalization. Proc. Computational Learning Theory Conference (COLT). McNamara, D.; Ong, C. S.; and Williamson, R. C. 2019. Costs and benefits of fair repre- sentation learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 263–270. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; and Galstyan, A. 2021. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6): 1–35. Min, E.; Guo, X.; Liu, Q.; Zhang, G.; Cui, J.; and Long, J. 2018. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6: 39501–39514. Mirzasoleiman, B.; Cao, K.; and Leskovec, J. 2020. Coresets for Robust Training of Deep Neural Networks against Noisy Labels. In NeurIPS. Mnih, V.; and Hinton, G. E. 2012. Learning to label aerial images from noisy data. In ICML. Munoz, C.; Smith, M.; and Patil, D. 2016. Big Data: A Report on Algorithmic Systems, Opportunity, and Civil Rights. https://obamawhitehouse.archives.gov/sites/default/ files/microsites/ostp/2016_0504_data_discrimination.pdf. Accessed: 2022-08-05. Nabi, R.; and Shpitser, I. 2018. Fair inference on outcomes. In Proceedings of the AAAI Conference on Artificial Intelligence . Namkoong, H.; and Duchi, J. C. 2017. Variance-based Regularization with Convex Objectives. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Narasimhan, H. 2018. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics . Natarajan, N.; Dhillon, I. S.; Ravikumar, P.; and Tewari, A. 2013. Learning with noisy labels. In NIPS. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, 8024–8035. Curran Associates, Inc. Patrini, G.; Nielsen, F.; Nock, R.; and Carioni, M. 2016. Loss factorization, weakly supervised learning and label noise robustness. In ICML. Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; and Qu, L. 2017. Making deep neural networks robust to label noise: A loss correction approach. In CVPR. Rahimian, H.; and Mehrotra, S. unpublished. Distributionally robust optimization: A review. arXiv:1908.05659. 52 Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788. Ren, M.; Zeng, W.; Yang, B.; and Urtasun, R. 2018. Learning to reweight examples for robust deep learning. In ICML. Robey, A.; Chamon, L.; Pappas, G. J.; and Hassani, H. 2022. Probabilistically Robust Learning: Balancing Average-and Worst-case Performance. ICML. Rockafellar, R. T.; and Uryasev, S. 2000. Optimization of conditional value-at-risk. Journal of Risk. Rooyen, B. v.; Menon, A. K.; and Williamson, R. C. 2015. Learning with symmetric label noise: the importance of being unhinged. In NeurIPS. Royset, J. O.; and Wets, R. J.-B. 2017. Variational theory for optimization under stochastic ambiguity. SIAM Journal on Optimization, 27(2): 1118–1149. Saravanakumar, K. K. 2020. The Impossibility Theorem of Machine Fairness–A Causal Perspective. arXiv preprint arXiv:2007.06024. Seldin, Y.; Laviolette, F.; Cesa-Bianchi, N.; Shawe-Taylor, J.; and Auer, P. 2012. PAC- Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12): 7086–7093. Shapiro, A. 2017. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4): 2258–2275. Sohoni, N.; Dunnmon, J.; Angus, G.; Gu, A.; and Ré, C. 2020. No subclass left behind: Fine- grained robustness in coarse-grained classification problems. Advances in Neural Information Processing Systems, 33: 19339–19352. Song, H.; Kim, M.; Park, D.; Shin, Y.; and Lee, J.-G. 2022. Learning from noisy labels with deep neural networks: A survey. IEEE TNNLS. Speicher, T.; Heidari, H.; Grgic-Hlaca, N.; Gummadi, K. P.; Singla, A.; Weller, A.; and Zafar, M. B. 2018. A unified approach to quantifying algorithmic unfairness: Measuring individual &group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2239–2248. Staib, M.; Wilder, B.; and Jegelka, S. 2019. Distributionally robust submodular maximization. In The 22nd International Conference on Artificial Intelligence and Statistics , 506–516. PMLR. Sukhbaatar, S.; Bruna, J.; Paluri, M.; Bourdev, L.; andFergus, R.2015. Trainingconvolutional networks with noisy labels. In ICLR. Sun, H.; and Xu, H. 2016. Convergence analysis for distributionally robust optimization and equilibrium problems. Mathematics of Operations Research, 41(2): 377–401. 53 Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR. Tolstikhin, I.; and Seldin, Y. 2013. PAC-Bayes-empirical-Bernstein inequality. Advances in Neural Information Processing Systems 26 (NIPS 2013), 1–9. Torralba, A.; and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR. Vahdat, A. 2017. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS. Veit, A.; Alldrin, N.; Chechik, G.; Krasin, I.; Gupta, A.; and Belongie, S. 2017. Learning from noisy large-scale datasets with minimal supervision. In CVPR. Verma, S.; and Rubin, J. 2018. Fairness definitions explained. In 2018 IEEE/ACM Interna- tional workshop on software fairness (fairware). Wang, X.; Kodirov, E.; Hua, Y.; and Robertson, N. 2019a. Improved mean absolute error for learning meaningful patterns from abnormal training data. Technical report, Technical Report. Wang, Y.; Ma, X.; Chen, Z.; Luo, Y.; Yi, J.; and Bailey, J. 2019b. Symmetric cross entropy for robust learning with noisy labels. In ICCV. Wang, Z.; Hu, G.; and Hu, Q. 2020. Training Noise-Robust Deep Neural Networks via Meta-Learning. In CVPR. Williamson, R.; and Menon, A. 2019. Fairness risk measures. In International Conference on Machine Learning, 6786–6797. PMLR. Wu, P.; Zheng, S.; Goswami, M.; Metaxas, D.; and Chen, C. 2020. A Topological Filter for Learning with Label Noise. In NeurIPS. Xia, X.; Liu, T.; Han, B.; Wang, N.; Gong, M.; Liu, H.; Niu, G.; Tao, D.; and Sugiyama, M. 2020. Part-dependent label noise: Towards instance-dependent label noise. In NeurIPS. Xia, X.; Liu, T.; Wang, N.; Han, B.; Gong, C.; Niu, G.; and Sugiyama, M. 2019. Are Anchor Points Really Indispensable in Label-Noise Learning? In NeurIPS. Xu, Y.; Cao, P.; Kong, Y.; and Wang, Y. 2019. L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise. In NeurIPS. Yao, J.; Wu, H.; Zhang, Y.; Tsang, I. W.; and Sun, J. 2019. Safeguarded dynamic label regression for noisy supervision. In AAAI. Yao, Y.; Liu, T.; Han, B.; Gong, M.; Deng, J.; Niu, G.; and Sugiyama, M. 2020. Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning. In NeurIPS. Yuan, B.; Chen, J.; Zhang, W.; Tai, H.-S.; and McMains, S. 2018. Iterative cross learning on noisy labels. In WACV. 54 Zafar, M. B.; Valera, I.; Rodriguez, M.; Gummadi, K.; and Weller, A. 2017. From Parity to Preference-based Notions of Fairness in Classification. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc. Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2017a. Understanding deep learning requires rethinking generalization (2016). ICLR. Zhang, J.; Sheng, V. S.; Li, T.; and Wu, X. 2017b. Improving crowdsourced label quality using noise correction. IEEE Trans. Neural Netw. Learn. Syst. Zhang, S.; Yao, L.; Sun, A.; and Tay, Y. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR), 52(1): 1–38. Zhang, Y.; Zheng, S.; Wu, P.; Goswami, M.; and Chen, C. 2021. Learning with Feature- Dependent Label Noise: A Progressive Approach. In ICLR. Zhang, Z.; and Sabuncu, M. R. 2018. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS. Zhang, Z.; Zhang, H.; Arik, S. O.; Lee, H.; and Pfister, T. 2020. Distilling effective supervision from severe label noise. In CVPR. Zhao, H.; and Gordon, G. 2019. Inherent tradeoffs in learning fair representations. Advances in neural information processing systems, 32. Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; and Chang, K.-W. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. Zheng, S.; Wu, P.; Goswami, A.; Goswami, M.; Metaxas, D.; and Chen, C. 2020. Error- bounded correction of noisy labels. In ICML. 55 Appendices A Supplementary Figures – Fairness Figure 5.1: Metrics for CVaR on CIFAR-10 dataset. Figure 5.2: Metrics for Variance Penalization method for Adult dataset using ¯λ = 2 and ¯γ =0.75. 56 Figure 5.3: Metrics for CVaR method on the Adult dataset. B Weighted Mean Trick The weighted mean trick is an alternative strategy for variance penalization, is substantially less demanding computationally, and yields similar results to the distributional variance penalization. However, it has much weaker theoretical support. After presenting an updated notation used throughout this appending, we introduce the moment penalization strategy and the weighted mean trick, which is a computational technique to make moment penalization practical. Next, we illustrate the application of moment penalization using the weighted mean trick formulation to optimize for robustness and fairness respectively. Updated notation In subsequent chapters we will use the following notation. A training data set is defined as D ={(x i ,y i )} n i=1 where x i ∈X ⊂ R d are the features and y i ∈Y ={1,...,k} represent the class labels. Class labels corrupted by noise will be indicated with a hat, ˆ y i . When optimizing for fairness the dataset will be extended to include a sensitive group, D ={(x i ,y i ,s i )} n i=1 , where s i ∈ S = {1,...,m} indicates the group membership. The classifier f : R d → C is a mapping from the feature space to the probability simplex C. Given a loss function ℓ : C×Y → R + we define the population risk as L = E D [ℓ(f(x i ),y i )] and the group risk L s = E D [ℓ(f(x i ),y i )|s i = s]. To simplify the notation, we will drop the dataset D from 57 Figure 5.4: Experimental results for fairness optimization using distributional variance penalization for the Synthetic dataset. We plot the mean and standard deviation across three random initializations of the model evaluated on the validation set. Top row (a to d) shows the results when we only optimize for groups to have the same mean loss values and ignore the group variance, i.e., λ =0. Bottom row (e to h) shows the results when we optimize for groups to have the same loss mean and variance, we set ¯λ = 2 and ¯γ = 0.75 and only vary λ and γ . Subplots (a) and (e) show group loss mean. Subplots (b) and (f) show group loss variance. Subplots (c) and (g) show group accuracy. Subplots (d) and (h) show histogram of the predicted probabilities for the correct label when using ¯λ = 5, ¯γ = 0.75, and λ = 0 for (d) and ¯λ =2, ¯γ =0.75, λ =2, and γ =0.5 for (h). the expectation subscript and only use a subscript to indicate conditional expectation on a sensitive group, i.e.,E s [ℓ(f(x i ),y i )]⇔E D [ℓ(f(x i ),y i )|s i =s]. B.1 Moments penalization B.1.1 Theory In ERM, the impact on the model parameters of a sample is determined by the balance between its loss value and the average of the training batch. Especially observed in the latter stages of the training when majority of samples have a small loss value except few which are left uncaptured by the model. Case in which we consider them as unlearned samples and amplify their impact or as outliers and suppress their impact on the model. The weighted mean trick consist in assigning weights to each sample based on the loss value to either 58 Figure 5.5: Metrics for CVaR method on Synthetic dataset. Figure 5.6: Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =1 and ¯γ =0.75. amplify or suppress their impact on training. This is similar to dropout (Hinton et al. 2012) but for losses. However, weighted mean trick is a deterministic process and the weights are not restricted to 0 or 1 but can take any non-negative value. In what follows, we apply the weighted mean trick to extend the ERM framework to multi-objective optimization of the mean and higher-order moments of the loss function. First, we show that for some distinct weights, optimizing the weighted mean is equivalent to a simultaneous optimization of the mean and variance. 59 Figure 5.7: Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =2 and ¯γ =0.75. Figure 5.8: Metrics for Variance Penalization method on the validation set of Synthetic dataset using ¯λ =2 and ¯γ =1. Lemma 7: Variance Expansion Letℓ be a random variable with finite first and second moments and W = 1+λ (ℓ− E[ℓ]), then taking the weighted mean of ℓ with weights W is equivalent to: E[Wℓ]=E[ℓ]+λ V[ℓ] (B.1) 60 Figure 5.9: Metrics for Variance Penalization method on the validation set of Adult dataset using ¯λ =1 and ¯γ =0.75. Figure 5.10: Metrics for Variance Penalization method on the validation set of Adult dataset using ¯λ =2 and ¯γ =1. Proof. Replacing W with the definition and using the linearity property of the expectation along with Proposition 17 (Appendix B.6) we get: E[Wℓ]=E 1+λ (ℓ− E[ℓ]) ℓ =E[ℓ]+λ E (ℓ− E[ℓ])ℓ =E[ℓ]+λ E (ℓ− E[ℓ])(ℓ− E[ℓ]+E[ℓ]) =E[ℓ]+λ E (ℓ− E[ℓ]) 2 =E[ℓ]+λ V[ℓ] 61 Thus, switching from mean to weighted mean allows us to control the bias-variance tradeoff through λ to improve the distributional robustness of the model (Maurer and Pontil 2009). However, the range of λ values that preserve the convexity depends on the statistical properties of ℓ, average and the minimum value, as shown by the next lemma. Lemma 8: Variance Expansion Limits The variance expansion using theorem 7 of a convex objective ℓ is also convex if λ ∈[0,λ max ], where λ max =1/(E[ℓ]− min(ℓ)). Proof. We note that W = p(ℓ(x)− E[ℓ]) where p is a polynomial of degree 1 with p(x) = λx +1. First we use the result of Proposition 15 (Appendix B.5) and get that p is convex if λ is non negative thus the lower bound of the interval. Next, we use Proposition 16 (Appendix B.5) which states that W(x)ℓ(x) =p(ℓ(x)− E[ℓ])ℓ(x) is convex if [w(x)− w(y)][ℓ(x)− ℓ(y)]≥ 0 and p is non negative. Since p is convex we replace p in the previous inequality with its the first order condition p(a)− p(b)≥ λ (a− b) where a = ℓ(x)− M and b = ℓ(y)− M to obtain λ [ℓ(x)− ℓ(y)] 2 ≥ 0. This is always true for λ> 0, however, p(ℓ(x)− E[ℓ]) is non negative for λ ≤ 1 min(ℓ)− E[ℓ] which proves the upper bound of the interval. This lemma precisely shows why directly penalizing the variance does not preserve convexity besides when λ takes values in a narrow interval. Moreover, the upper limit of this interval is not constant and changes with each iteration of the optimization algorithm. Note that the further the minimum value min(ℓ) is from the average lossE[ℓ], the narrower the λ interval is. However, when using the weighted mean trick to penalize the variance and clip the negative weights to 0 then the objective remains convex for any positive λ . Clipping negative weights to 0 also improves convergence when λ takes negative values and the resulting objective is non-convex, as we will see later. Figure 5.11 shows the weights distribution for different values of λ . When λ is positive (top left plot) samples closer to the decision boundary (which also means with larger loss 62 − 3 0 3 − 3 0 3 0.0 0.0 0.2 0.2 1.0 1.0 2.0 2.0 3.9 3.9 − 3 0 3 − 3 0 3 0.0 0.0 0.5 0.5 1.0 1.0 1.3 1.3 λ> 0 λ< 0 1 − 3 0 3 − 3 0 3 0 λ 1 λ 2 − 3 0 3 − 3 0 3 λ 2 λ 1 0 λ 2 >λ 1 > 0 λ 2 <λ 1 < 0 1 Figure 5.11: Variance penalization for classification problems. Contour lines of the weights distribution for positive λ are shown on the left and for negative on the center plot. The right plot shows the use of variance penalization for outlier suppression or amplification (robustification versus generalization). values) receive more weight. On the other hand, when λ is negative (top right plot) samples with larger loss values receive less weight. Of note, for λ < 0 the objective is not convex, however as we will show later will still converge to an optimal solution. The bottom two plots show a binary classification problem where each class consists of two clusters, one cluster is diagonally aligned containing the majority of points and another consisting of the three central points aligned across a horizontal line. The parameter λ controls the placement of the decision boundary with respect to these two clusters. Positive λ values (bottom left plot) place more weight on the three horizontal points as they are closer to the decision boundary 63 and favors the model to learn how to perfectly separate the two classes. On the other hand, negative λ values (bottom right plot) place less weight on the horizontal points close to the boundary and more on the farther ones. Therefore, this aligns the decision boundary only with respect to the two large clusters. Variance is not the only central moment that can be penalized using the weighted mean. In fact, any combination of the moments can be penalized. The next theorem generalizes theorem 7. Theorem 9: Moments Expansion Let ℓ be a random variable with finite first n central moments and W = P n i=1 λ i (ℓ− E[ℓ]) i− 1 , then taking the weighted mean of ℓ with weights W is equivalent to: E[Wℓ]=λ 1 E[ℓ]+ n X i=2 ˜ λ i E (ℓ− E[ℓ]) i (B.2) where ˜ λ i =λ i +λ i+1 E[ℓ] for i<n and ˜ λ n =λ n Proof. In this case, for i≥ 2 we cannot apply Lemma 17 since the expression (ℓ− E[ℓ]) i is not guaranteed to have zero mean and as a result it incurs an additional penalization of the previous moment. The proof follows the same steps as Lemma 7: 64 E[Wℓ]=E " n X i=1 λ i (ℓ− E[ℓ]) i− 1 ℓ # = n X i=1 λ i E (ℓ− E[ℓ]) i− 1 ℓ =λ 1 E[ℓ]+ n X i=2 λ i E (ℓ− E[ℓ]) i− 1 ℓ =λ 1 E[ℓ]+ n X i=2 λ i E (ℓ− E[ℓ]) i− 1 (ℓ− E[ℓ]+E[ℓ]) =λ 1 E[ℓ]+ n X i=2 λ i E (ℓ− E[ℓ]) i + n X i=2 λ i E[ℓ]E (ℓ− E[ℓ]) i− 1 =λ 1 E[ℓ]+ n− 1 X i=1 ˜ λ i E (ℓ− E[ℓ]) i +λ n E[(ℓ− E[ℓ]) n ] For the last step, we combine the two sums by matching the E[(ℓ− E[ℓ]) i ] terms and consolidate the penalization factors as ˜ λ i =λ i +λ i+1 E[ℓ]. We notice that penalizing moments higher than two incurs an additional penalization of the previous moment. For example, penalizing skewness by λ 3 incurs a variance penalization of λ 3 E[ℓ]. Theorem 9 also has an algebraic interpretation. Note that the formula for weights is nothing more than a polynomial in ℓ translated byE[ℓ]. When penalizing the variance, λ 2 controls the slope of the linear equation, and when penalizing the skewness, λ 3 controls the curvature of the quadratic equation. Moreover, the penalization factors λ i also define the placement of the roots of the polynomial and the convexity of the weighted mean objective. Lemma 10: Convexity of Moments Expansion Let ℓ :R n →R + and p :R→R + be two non negative convex functions and M ≥ 0, then the Wℓ(x) with W =p(ℓ(x)− M) is convex if p is non decreasing. 65 Proof. We set w(x) =p(ℓ(x)− M) and using the result of Proposition 15 (Appendix B.5) we obtain that w(x) is convex. Next, from Proposition 16 (Appendix B.5) we see that the product w(x)ℓ(x) is convex if [w(x)− w(y)][ℓ(x)− ℓ(y)]≥ 0. Since p is convex we use the first order condition p(a)− p(b)≥ p ′ (b)(a− b) and set a =ℓ(x)− M and b =ℓ(y)− M to obtain: p(ℓ(x)− M)− p(ℓ(y)− M)≥ p ′ (ℓ(y)− M)[ℓ(x)− M− ℓ(y)+M] (B.3) w(x)− w(y)≥ p ′ (ℓ(y)− M)[ℓ(x)− ℓ(y)] (B.4) Substituting this result in the requirement from Proposition 16 (Appendix B.5) we obtain: p ′ (ℓ(y)− M)[ℓ(x)− ℓ(y)] 2 ≥ 0 (B.5) Given that p is non decreasing and p ′ (ℓ(y)− M)≥ 0 proves that the inequality always holds. The above theorem lists the two requirements on the polynomial function p such that the weighted mean objective remains convex. First, the function must be non-negative, thus, the negative weights must be clipped to 0. Secondly, when the function takes positive values, p must be non decreasing and convex. Figure 5.12 shows several examples of polynomials. Variance penalization implies p is an affine function, and thus, convex. However, for λ 1 >0 the polynomial is non decreasing (left plot) and for λ 1 <0 the polynomial is non increasing (center plot). Thus, only positive variance penalization will result in a convex objective. Of note, Lemma 8 upper bounds λ 1 if weights are not clipped to 0, however, if clipping is used λ 1 is not upper bounded. When penalizing skewness, p is a quadratic function (right plot). In this case, if λ 2 >0 then p is convex and non decreasing only on a restricted interval instead of the entire real line. Thus, convexity can be preserved by appropriately adjusting the roots of the polynomial such that p is convex on [min(ℓ)− E[ℓ],+∞]. The effect of the polynomial 66 − 1 0 1 − 1 0 1 2 3 λ 1 = 2 λ 0 = 1 λ 3 = 1 λ 2 = 2.5 λ 1 = 3 λ 0 = 1 − 1 0 1 λ 1 =− 2 λ 0 = 1 − 1 0 1 λ 2 = 1 λ 1 = 0.5 λ 0 =− 0.5 1 Figure 5.12: Polynomial functions for moments penalization. Dotted lines show the complete polynomial whereas solid lines the functions when the negative part is clipped to 0. Left plot shows two functions which are convex and non decreasing when their corresponding polynomial is clipped to 0. Center plot shows a convex polynomial that is non increasing and thus will result in a non convex objective. Right plot shows a polynomial which is convex and non decreasing only on the interval [− 1,+∞], and thus will yield a convex objective when min(ℓ)− E[ℓ]≥− 1. roots on the mean absolute error loss, ℓ mae , and mean square error loss, ℓ mse , when penalizing the variance is shown in Figure 5.13. B.1.2 Clipping When clipping negative weights to 0 it prevents the optimization objective to switch from minimization to maximization, usually an undesirable behavior. Consequently, the samples with a corresponding zero weight reach their maximum contribution in the moments pe- nalization. Further increasing the factors λ i , will have no effect on those samples and only samples with non-zero weights will participate in the training. As a result, the efficiency of the moments penalization will slightly fall. Moreover, the samples with zero weight will be excluded from training, technique used in the past by multiple studies. Rockafellar and Uryasev (2000) used a similar clipping technique to optimize only for samples that are part of the tail of the distribution. 67 B.1.3 Literature Review The Moments Expansion theorem extends the variance expansion proposed by Duchi and Namkoong (2019) to include higher-order moments. However, even when only the variance is penalized, the two methods are still slightly different. Moments expansion theorem computes the weights directly from the loss values, whereas the variance expansion of Duchi and Namkoong (2019) solves a secondary optimization problem to find the weights. The use of a secondary optimization problem has the advantage of penalizing the variance more consistently despite being more computationally expensive. On the contrary, when using the moments expansion, the penalized variance can be slightly lower depending on the number of weights that are 0. However, the advantage is that the direct computation of the weights makes it easier to include the method into existing analysis frameworks. A similar method that extends the optimization objective to include penalization factors for higher-order moments was proposed by Li et al. (2021). The proposed method replaces the ERM objective with a tilted version calculated as 1 t K(t) = 1 t logE[e tℓ ] where K(t) is the cumulant-generating function of the loss ℓ. The penalization of the higher order moments of the tilted ERM can be recovered from the power series expansion of K(t). The distinction between the two is that the moments penalization introduced in this appendix represents a generalization of the tilted ERM as it allows any combination of the higher order moments to be penalized whereas tilted ERM uses a single parameter that governs the penalization factors. In summary, the moments penalization implemented using the weighted mean trick is more flexible, however, it comes at a cost as there are more parameters to tune when penalizing multiple moments compared to tilted ERM of Li et al. (2021). B.1.4 Convergence The moments penalization problem along with variance expansion of Duchi and Namkoong (2019) fall under the class of distributional robust stochastic programs(Sun and Xu 2016) 68 (DRSP) which is a subclass of ambiguity programs(Royset and Wets 2017) (AP) where the general objective is: AP: min θ ∈Θ sup P∈D(θ ) φ(θ,P ) (B.6) where Θ is the set of model parameters and D(θ ) is the ambiguity set. In DSRP, the bivariate function φ(θ,P ) =E P [ℓ] where ℓ is the loss function and the ambiguity set D(θ ) is a set of probability distributions. In the case of moments penalization problem, the ambiguity set D(θ ) depends on the model parameters and is a singleton as the weights uniquely transform the empirical distribution. As a result, the optimal value of the inner maximization problem becomes sup P∈D(θ ) φ(θ,P ) =E P [ℓ]. Intuitively, a model will converge if changes in its parameters will cause minor changes in the distribution P, and with each step the distribution will approach the optimum distribution P ∗ . Formally, to quantify the changes in the distribution, we would need a distance or a metric. Sun and Xu (2016) use the total variation metric and a pseudometric to prove uniform convergence, possibly at an exponential rate, if P converges to P ∗ under total variation metric and ℓ is uniformly bounded ((see Sun and Xu 2016, Th. 1 and Prop. 3)). Royset and Wets (2017) proposed a hypo-distance metric and proved lop-convergence given that the bivariate function φ(θ,P ) satisfies some assumptions, (see(Royset and Wets 2017) Def. 4.1). Duchi and Namkoong (2019) provide guarantees for a number of stochastic risk minimization problems when only the variance is penalized and P is in the local neighborhood of the empirical distribution defined using the χ 2 -divergence. We refer the reader to the works of Sun and Xu (2016) and Royset and Wets (2017) and the references therein for additional guarantees if more information about the problem structure is available, or if other metrics are used. For the moments penalization problem, the moments penalization factors λ i for i≥ 2 determine how much the distribution P changes when the loss changes. Small values of the penalization factors will keep P in the neighborhood of the empirical distribution, whereas large values will make the weights sensitive to changes in the loss values that can cause stability or convergence issues. The 69 exact values depend on the empirical distribution of the data and the choice of the model and loss function. However, when applying the method in practice, we did not encounter any convergence or stability issues even when performing a hyper-parameter sweep. B.1.5 Algorithm To apply the method in practice, the classical batch training algorithm must be extended to include an additional step, the weights calculation. Instead of directly calculating the average loss, the user will calculate the loss value for each sample in the batch and then use the expression from Theorem 9 to compute the weights and the weighted mean. The above procedure is further detailed in the algorithm below where an exponential moving average is used to track the mean (α can be zero for large batch sizes). The operations for computing weights are not part of the computation graph as weights must be treated as constants. This is achieved using the detach() function in Pytorch(Paszke et al. 2019) and stop_gradient() in Tensorflow.(Abadi et al. 2015) 70 Algorithm 5: Training with Moments Penalization Training with Moments Penalization input: f(x;θ ) – model to be trained {x i ,y i } n 1 – batch of training data {λ } k 1 – penalization factors ℓ – loss function while stopping criteria not reached do for i← 1 to n do ℓ i ←L (f(x i ,θ ),y i ) ; /* sample loss */ z ← detach(ℓ) ; /* detach from computational graph */ w ← h P k j=1 λ j (z− E n [z]) j− 1 i + L w ← 1 n P n i=1 w i ℓ i ; /* weighted mean loss */ θ ← θ − γ ∇ θ L w ; /* update model parameters */ The moments penalization factors λ i are the hyper-parameters and are tuned in ascending order with respect to i. However, penalizing higher-order moments might affect the impact of the lower-order ones and thus might require a few iterations to find the optimal combination. Sincethegradientoftheweightedmeanistheweightedgradientoftheelements, thisallows weights to control the impact of each sample on the model parameters. Of note, the theoretical results hold when switching from expectation to sample expectation, E n [ℓ]= 1 n P n i=1 ℓ i . B.2 Robust classification We will explore the robustification of the cross-entropy function under label noise by bounding the loss values using negative variance penalization. In this case, the resulting objective will not be convex. However, for moderately small penalization factors the objective remains convex on almost the entirety of the domain and does not hinder the convergence. Moreover, 71 we also develop a convex version of the variance penalized cross-entropy though for a minor price in performance. The objective of robust classification is to learn an optimal classifier f ∗ that minimize the average loss for both clean and noisy data. Formally, f ∗ =argminE[ℓ(f(x),ˆ y)] where ˆ y represents noisy labels. The noise considered in this appendix corrupts the class labels with probabilityP(ˆ y i ̸=y i ) =η or preserves it with probabilityP(ˆ y i =y i ) = 1− η , where η ∈ [0,1]. We investigate two scenarios: when η does not depend on the class label (class independent or uniform noise), and when η depends on the class label (class dependent or asymmetric noise). In what follows, we show that using a negative variance penalization factor, λ 2 <0, bounds the loss function. By bounding the loss function the impact of misclassification of noisy samples on the average loss will decrease, and thus reduce the noise impact on the model during training. B.2.1 Optimizing for robustness Ghosh, Kumar, and Sastry (2017) outlined the distribution independent sufficient conditions for a loss function to be robust under both, class dependent and class independent noise. Specifically, if the loss function satisfies the symmetry constraint: k X i=1 ℓ(f(x),i)=C (B.7) where C is a constant and it holds∀f,∀x∈X, then ℓ is noise tolerant for class independent noise when η < k− 1 k (see Theorem 1 of Ghosh, Kumar, and Sastry (2017)). Moreover, if the population risk of the optimum classifier E[ℓ(f ∗ (x),ˆ y)]=0 then ℓ is also noise tolerant for class dependent noise. Among the commonly used losses, only the mean absolute error (MAE) satisfies the symmetry constraint. On the other end, the cross-entropy (CE) loss which is widely used for classification, not only does not satisfy the above constraint, but is also an unbounded loss making it extremely susceptible to noise. Specifically, as noise 72 makes the predicted probability of the correct class approach zero, its cross-entropy loss will approach infinity. Following the work of Ghosh, Kumar, and Sastry (2017), many studies (Zhang and Sabuncu 2018; Feng et al. 2020; Wang et al. 2019b; Ma et al. 2020; Wang et al. 2019a) found that models trained with MAE struggle to converge and proposed novel losses that rely on boundedness to achieve robustness instead of the symmetry constraint to avoid convergence problems. Zhang and Sabuncu (2018) proposed a generalized cross-entropy loss parameterized by q that recovers CE loss for q → 0 and MAE for q = 1. For intermediate values of q, the loss is bounded and trades robustness for convergence. Feng et al. (2020) found that the first two terms of the Taylor series expansion for the CE loss are the MAE and MSE, respectively. And suggested adjusting the number of terms in the Taylor expansion for the CE to balance noise robustness and convergence. Wang et al. (2019b) proposed using a two term loss, complimenting the CE loss with a secondary reverse cross-entropy term that satisfies the symmetry constraint. The research by Ma et al. (2020) extended the concept of two term loss combining a robust “active” loss and a robust “passive” loss. In the following, we use the weighted mean trick to bound a loss function to improve its noise robustness. The next lemma specify the requirements for the weights: Lemma 11: Bounded Loss Let ℓ represent unbounded loss values and W represent non-negative and bounded weights, then the product Wℓ is bounded if there exists a finite threshold ℓ 0 such that loss values greater than ℓ 0 have a corresponding weight of 0. Proof. For ℓ > ℓ 0 since the corresponding weights W are 0 the product Wℓ is also 0 and for ℓ≤ ℓ 0 since the weights are bounded the product Wℓ is also bounded and thus 0≤ Wℓ0,λ 1 <0 and λ i =0,∀i>1, loss values ℓ greater or equal to ℓ 0 =E[ℓ]− λ 0 λ 1 will have an associated weight of 0. Proof. From Theorem 9 the weights are calculated as W =λ 0 +λ 1 (ℓ− E[ℓ]) from which we can determine the threshold value ℓ 0 = E[ℓ]− λ 0 λ 1 . Thus losses ℓ ≥ ℓ 0 will have a 0 weight due to the non-negativity constraint. Of note, bounding an unbounded loss function is achieved through clipping, however, the resulting loss function is not convex as shown in Figure 5.14. With parameters λ 0 and λ 1 along withE[ℓ] establishing which samples are considered noise and excluded from training. Since for λ 1 <0 the threshold ℓ 0 =E[ℓ]− λ 0 λ 1 is greater thanE[ℓ], thus all excluded samples have a loss value above average. Moreover, the higher the magnitude of λ 1 , the closer the threshold ℓ 0 is to E[ℓ]. This can be seen in the left plot of Figure 5.14. Moreover, the magnitude of λ 1 dictates by how much the loss values below average are amplified and the ones above average suppressed. In practice, using larger values for λ 1 will decrease the impact of misclassified samples and those near the separating hyperplane on the placement of the decision boundary. This effect is shown in Figure 5.11 by the decision boundary 3λ min which correspond to a negative variance penalization and which is aligned with the diagonal clusters far from the boundary. Lemma 12 can be extended to include other higher-order moments. In this case, the roots of the weights polynomial will determine which samples participate in the optimization problem and which are considered noise. For example, if the decision boundary should be decided by samples with average loss values then a possible solution is to penalize skewness. Since in this case the weights are defined by a quadratic equation, and using a negative value for λ 2 will make the parabola open downwards and thus assign non-zero weights only to samples around the mean. 74 Moreover, we also develop a convex version of the variance penalized cross-entropy by constraining the second derivative of the weighted loss function to be non-negative. We note that the second derivative of the weighted loss w(x)ℓ(x) reaches 0 for x = x c and x c =exp 1+ λ 1 2λ 2 − E[ℓ] 2 and is negative for x<x c where x is the predicted probability of the correct class. To constrain the second derivative be non-negative we linearly interpolate w(x)ℓ(x) for x < x c using the derivative at x c . The weight function for x < x c is w(x) = [w(x c )ℓ(x c )+∇(w(u)ℓ(u))| u=xc (x− x c )]/ℓ(x) and the resulting convex objective is shown on the right plot of Figure 5.14. The black dot marks the transition point from the weighted cross-entropy on the right to the linear interpolation on the left of the dot. Of note, the resulting objective is still bounded albeit with a higher upper bound. In general, methods for training of neural networks in the presence of noise can be classified into two broad categories: noise model-free and noise model-based strategies. Noise model-free methods focus on reducing the impact of outliers and the two main subcategories are: robust losses (Bartlett, Jordan, and McAuliffe 2006; Ghosh, Kumar, and Sastry 2017; Wang et al. 2019a; Zhang and Sabuncu 2018; Wang et al. 2019b; Natarajan et al. 2013; Mnih and Hinton 2012; Xu et al. 2019; Patrini et al. 2016; Rooyen, Menon, and Williamson 2015; Feng et al. 2020; Ma et al. 2020) and learning management such as meta-learning or regularization (Li et al. 2017; Szegedy et al. 2016; Hendrycks, Lee, and Mazeika 2019; Liu et al. 2020; Harutyunyan et al. 2020). On the other hand, noise model-based methods estimate the properties of the noise and use this information when training the model. Noise model-based methods can be further divided into: using noise transition matrix (Patrini et al. 2017; Hendrycks et al. 2018; Chen and Gupta 2015; Bekker and Goldberger 2016; Goldberger and Ben-Reuven 2017; Sukhbaatar et al. 2015; Xia et al. 2019; Yao et al. 2019; Wang, Hu, and Hu 2020; Xia et al. 2020; Yao et al. 2020; Lukasik et al. 2020), noise mitigation requiring a clean dataset (Jiang et al. 2018; Ren et al. 2018; Veit et al. 2017; Zhang et al. 2020; Yuan et al. 2018; Vahdat 2017; Mirzasoleiman, Cao, and Leskovec 2020), or incrementally estimating the 75 distribution of clean data (Liu et al. 2017; Zheng et al. 2020; Arazo et al. 2019; Zhang et al. 2017b; Ghosh and Lan 2021; Zhang et al. 2021; Wu et al. 2020). B.2.2 Experimental results The procedure adopted for all the experiments is similar to other studies investigating robust losses (Zhang and Sabuncu 2018; Feng et al. 2020; Wang et al. 2019b; Ma et al. 2020). The source code to reproduce our results is available online. After we corrupt the training datasets (CIFAR-10, CIFAR-100, Fashion-MNIST) with noise we retain 10% as a secondary validation dataset. This allows us to detect when the model is overfitting on noise by comparing the performance on clean versus noisy validation data. For the class independent noise, we flip the label to any other class with equal probability such that the ratio of noisy labels is η . For class dependent noise, for CIFAR-10 and Fashion-MNIST we only flip the label in the source classes {9,2,3,5,4} to the corresponding target class {1,0,5,3,7} given the noise ratio η . And for CIFAR-100 we flip between two randomly selected subclasses withing each superclass. We use the same modes as in Wang et al. (2019b) when training on CIFAR-10 which is an 8 layer network composed of 6 convolutional layers followed by 2 fully connected layers. For Fashion-MNIST the model contains 4 convolutional layers followed by 3 fully connected layers. For CIFAR-100 the model we use is ResNet-34 (He et al. 2016). We train using SGD with 0.9 momentum, 0.005 weight decay for the convolutional layers and 0.01 for the fully connected layers, and a starting learning rate of 0.01 which we divide by 10 every 20 epochs for a total of 60 training epochs for CIFAR datasets and every 5 epochs for a total of 15 training epochs for Fashion-MNIST. The training batch size is 128 samples. We compare our solution for robustification of the loss function through moments penaliza- tion against five other state of the art methods: i) Tilted empirical risk minimization (TERM) introduced by Li et al. (2021), ii) Taylor expansion of cross entropy (Taylor) proposed by Feng et al. (2020), iii) Normalized cross entropy coupled with reverse cross entropy (Normalized) investigated by Ma et al. (2020), iv) Symmetric cross entropy (Symmetric) studied by Wang 76 et al. (2019b), andv) Generalized cross entropy (Generalized) explored by Zhang and Sabuncu (2018). For these methods, when possible we used the same parameters suggested by their authors, however, when the method underperfomed we used cross-validation to find better ones. When training on CIFAR-10 and Fashion-MNIST datasets we used: λ 1 = 1, λ 2 =− 0.5 for Moments and its convex version, t =− 0.5 for TERM, t = 2 for Taylor, α = 10, β = 1 for Normalized, α =0.1, β =1 for Symmetric, and q =0.7 for Generalized. When training on CIFAR-100 the parameters where: λ 1 =1, λ 2 =− 0.5 for Moments and its convex version, t=− 0.5 for TERM, t=6 for Taylor, α =10, β =0.1 for Normalized, α =6.0, β =0.1 for Symmetric, and q =0.7 for Generalized. The model accuracy for clean data when training with both, class independent and class dependent noise, are summarized in Table 1. Of note, we report the accuracy results of the model at the end of the training to capture the overfitting on noise for some models. We observe that the training with moments penalization outperforms other methods in low to moderate noise level scenarios and falls behind in high noise scenarios but still anove the classical cross-entropy. The performances of the convex version and non-convex version are very similar, however, the convex version shows better results in high noise scenarios under class dependent noise. In case of CIFAR datasets, for class independent noise and for η = 0.2 the proposed method improves the accuracy of the classical cross-entropy by 6%, the highest improvement among the investigated methods. However, this improvement decreases as the noise ratio increases and approaches the accuracy of the classical cross entropy loss for η = 0.8. To note, the convex version of the moments method for η =0.8 underperformed by 0.8% compared to the classical cross-entropy, however, in all other scenarios it had a higher accuracy. High noise scenarios turned out to be challenging for all losses with the Taylor and Generalized cross-entropy losses not converging on CIFAR-10 for η = 0.8 and the best accuracy was registered by Symmetric cross-entropy loss proposed by Wang et al. (2019b). For class dependent noise, the moments penalization registered the highest accuracy for CIFAR-10 77 when η = 0.1 and η = 0.2 but had the lowest accuracy for η = 0.4. To further investigate this behavior, we reran the experiments and monitored the accuracy for each individual class. The classical cross-entropy recorded similar accuracy as the robust methods on classes affected by noise and outperformed them on classes 6 and 8, both not targeted by class dependent noise. The poor performance of robust methods can be justified by using parameters tuned for low and moderate noise. In case of Fashion-MNIST dataset, as similar behavior as with CIFAR datasets was observed. For low to moderate noise level scenarios the moments penalization methods had the highest or the second highest accuracy. Moreover, when the noise was class dependent the convex version outperformed the non-convex version by at least 6%. B.3 Fair classification Afairclassifierrequiresthemodeltoincursimilarlossesforeachsensitivegroupwhileachieving good prediction accuracy.(Dwork et al. 2018; Hashimoto et al. 2018; Alabi, Immorlica, and Kalai 2018; Heidari et al. 2019; Williamson and Menon 2019; Zafar et al. 2017; Agarwal et al. 2018; Donini et al. 2018; Kearns et al. 2018) Mathematically, given a measure of dissimilarity between group losses ∆( Y,f,S), the objective is to learn a fair classifier f ∗ that minimizes L+λ ∆( Y,f,S) whereλ controls the trade-off between prediction accuracy and group fairness. B.3.1 Optimizing for fairness Generally, fairness-aware methods can be divided by application into three groups: methods applied during training,(Zafar et al. 2017; Agarwal et al. 2018; Donini et al. 2018; Kearns et al. 2018; Williamson and Menon 2019; Jiang et al. 2020) pre-training,(Calmon et al. 2017; Adler et al. 2018; Gordaliza et al. 2019; McNamara, Ong, and Williamson 2019) and post-training.(Hardt et al. 2016; Jiang et al. 2020) Furthermore, we can also classify these methods in terms of fairness enforced at the group level(Zafar et al. 2017; Agarwal et al. 2018; Donini et al. 2018; Kearns et al. 2018; Williamson and Menon 2019; Jiang et al. 2020; Calmon 78 et al. 2017; Adler et al. 2018; Gordaliza et al. 2019; McNamara, Ong, and Williamson 2019; Hardt et al. 2016) or at the individual level.(Dwork et al. 2012; Kusner et al. 2017) In what follows, we introduce a novel fairness-aware convex objective which is applied directly during training to optimize for fairness amongst sensitive groups. We start by expanding on previous definitions of fairness to include higher-order statistics into the dissimilarity measure. The two most common formalisms link perfect fairness to probabilistic independence. The demographic parity formalism(Dwork et al. 2012) requires that f(X)⊥ ⊥S, i.e., knowing the model prediction gives no information about the sensitive groups. On the other hand, equalized odds formalism(Hardt et al. 2016) requires f(X)⊥ ⊥S|Y, i.e., knowing the true label and the model prediction gives no information about the sensitive groups. The difference between the two is that equalized odds allows the prediction f(X) to depend on the sensitive groupsS but the model should not discriminate more than the ability of S to predictY. In practice, these notions of perfect fairness cannot be enforced when training a model using a finite number of samples. By introducing Definition 13, we try to strike a balance between theoretical perfect fairness and practical perfect fairness that can be incorporated into a training algorithm. Definition 13: Group Fairness A predictor f is perfectly fair with respect to the loss function ℓ if all sensitive groups attain the same first N moments as the general population (the entire dataset): L s = L if n=1 E s [(ℓ− L s ) n ] = E[(ℓ− L) n ] otherwise (B.8) Of note, for n = 1, we use the first raw moment, and for n ≥ 2, we use the central moment. Satisfying the Definition 13 implies that each group has the same distribution of the loss values. Matching the loss distribution of each sensitive group with that of the entire 79 dataset results in a linear computational complexity in the number of groups, a much lower complexity than comparing each pair of groups separately. Moreover, if we relax the definition, and instead of equality, we bound the absolute difference between moments to be less than ϵ , |E s [(ℓ− L s ) k ] − E[(ℓ− L) k ]| < ϵ, ∀k≤ n, we obtain a measure of ϵ -approximate fairness. Next, we show that this definition recovers previous measures as special cases. If we set n = 1, we recover the measure of perfect fairness proposed by Williamson and Menon (2019) (Definition 1) when all sensitive groups attain the same average loss, i.e., L s =L s ′ ∀s,s ′ ∈S. If in addition to that, we use the 0− 1 loss, ℓ(f(x),y)= Jf(x)̸= yK then we recover Zafar et al. (2017)’s parity-based fairness measure that ensures that the group benefits are all equal, P(f(X)̸=Y |S =s)=P(f(X)̸=Y |S =s ′ ) ∀s,s ′ ∈S. Using the result of Theorem 9 (Moments Expansion), and by separately computing the weights for each sensitive group, we obtain a fairness-aware objective as the next theorem shows: Theorem 14: Fairness through Moments Expansion Let ℓ be a loss function and r a reference set that contains all the sensitive groups, minimizing the weighted mean with weights W: W =λ 0 +λ 1 L s L − 1 + m X i=2 λ i (ℓ− L s ) i− 1 − (ℓ− L) i− 1 (B.9) is a fairness-aware objective and the penalization factors λ i satisfy the following constraints: λ i =0 if i≥ 1 and sample not in r λ i ≥ 0 if i=1 or i is even λ i ≥ 0 ifE s [(ℓ− L s ) i ]≥ E r [(ℓ− L r ) i ] and i is odd λ i <0 ifE s [(ℓ− L s ) i ]<E r [(ℓ− L r ) i ] and i is odd (B.10) 80 Proof. The constraints on the penalization factors ensure that along with minimizing the mean loss, the moments of the sensitive groups are also aligned with the ones of the reference set. The factor λ 0 controls the weight of the general minimization of the mean loss whereas, the factors λ i with i≥ 1 control the weight of aligning the moments. In particular, if a sample is not in the reference set, which includes all the sensitive groups, we set the penalization factors λ i to 0, for i≥ 1, as these samples should be excluded when aligning the moments. For even i, since the moments in this case take non-negative values, we restrict the factors to be positive and thus put more weight on groups with higher variance, kurtosis, and so forth to match that of the reference set. For odd i, the moments can take both positive and negative values, thus we change the sign of λ i such that it is equivalent to minimizing the absolute error between group moments and reference set. The exception is when i=1 as in this case we minimize the raw moment and we require λ 1 to be positive and thus prioritize a smaller group mean loss over matching the mean loss of the reference. Following the same steps as in Theorem 7 and replacing W in the expectation we obtain: E[Wℓ]=E " λ 0 +λ 1 L s L r − L r L r + m X i=2 λ i (ℓ− L s ) i− 1 − (ℓ− L r ) i− 1 ! ℓ # =λ 0 E[ℓ]+E λ 1 L s L r − L r L r ℓ + m X i=2 E λ i (ℓ− L s ) i− 1 − (ℓ− L r ) i− 1 ℓ =λ 0 E[ℓ]+λ 1 (L s − L r )+ m X i=2 E λ i (ℓ− L s ) i− 1 ℓ − E λ i (ℓ− L r ) i− 1 ℓ =λ 0 E[ℓ]+λ 1 (L s − L r )+ m X i=2 λ i E s (ℓ− L s ) i − E r (ℓ− L r ) i =λ 0 E[ℓ]+λ 1 (L s − L r )+ m X i=2 λ i L r E s (ℓ− L s ) i− 1 − E r (ℓ− L r ) i− 1 =λ 0 E[ℓ]+λ 1 (L s − L r )+ m X i=2 ˜ λ i E s (ℓ− L s ) i − E r (ℓ− L r ) i 81 For the last step, we combine the two sums by matching the moments terms and consolidate the penalization factors as ˜ λ i = λ i +λ i+1 L r . This is similar to the Theorem 9 where penalizing moments higher than two incurs an additional penalization of the previous moment. Of note, the convexity of the objective is preserved if the negative weights are clipped to 0. The penalization factors control priorities of this multi-objective optimization problem. Whereλ 0 emphasizes the minimization of the loss in general and factors λ i fori≥ 1 emphasize fairnessbyminimizingthedifferencein i-thmomentbetweensensitivegroupsandthereference set. In particular, λ 1 controls the weight for the entire group given the difference between the average loss of the group and the average loss of the reference set. Sensitive groups with an average loss higher than the reference set receive more weight and are prioritized during optimization. On the other hand, λ 2 assigns individual weights to each sample resulting in the variance of the group matching the variance of the reference set. An alternative method for ensuring fairness for groups is based on the conditional value at risk (CVaR) originally introduced by Rockafellar and Uryasev (2000) to minimize the tails of the distribution. The fairness-aware version of the CVaR was proposed by Williamson and Menon (2019) and optimizes the following objective: min f∈F,ρ ∈R ρ + 1 1− α E[L s − ρ ] + (B.11) This objective, at each iteration, minimizes only sensitive groups with the average loss higher than the threshold ρ . Comparing objective 14 to objective B.11, we first notice that both clip to 0 in order to preserve convexity. Secondly, sensitive groups receive a weight based on the distance between their group mean and the threshold ρ , similar to objective 14 when λ 1 > 0 and λ i = 0 for i≥ 2. However, objective B.11 only considers the mean of the sensitive group but not any other higher-order statistics as does objective 14. 82 B.3.2 Experimental results To compare the two methods, we follow the same experimental setup of Williamson and Menon (2019). The training datasets are synthetic from Donini et al. (2018) and adult(Kohavi 1996; Dua and Graff 2017) which we use twice to optimize fairness amongst gender and race. For the test dataset, we use the one provided with the adult, and for synthetic we generate a similar dataset but use a different random seed. The trained model is a fully connected network with one hidden layer of 8 neurons for syntethic and 128 neurons for adult followed by a sigmoid activation function. We use the cross entropy loss and the ADAM optimizer to train the model for 20 epochs using a 0.02 learning rate and a batch size of 256 samples. For each configuration of the parameters α and λ i , we run 50 experiments with different initializations to calculate the mean and standard deviation for each metric. The results of these experiments are shown in Figure 5.15. The parameters of the moments method were set to λ 0 = 1,λ 1 = 16 and we varied λ 2 between 0 and 20. For the CVaR-based method α was varied between 0 and 0.9. From the figure, we observe that the moments method yields much closer group statistics, mean and variance, compared to the CVaR method. This is also reflected by the lower Kullback-Liebler divergence between the group loss distributions and the reference set. In terms of accuracy, we observe similar levels for different values of the parameters α and λ , except when optimizing fairness amongst races (third row) where the moments method consistently converges to a higher accuracy. As expected, the moments method has tighter intervals for variance across all experiments since the CVaR method does not explicitly optimizes for that. When only λ 0 andλ 1 parameters are used for the moments, both methods obtain comparable results as in this case both optimize only to match the means of the sensitive groups, as illustrated in Figure 5.16. Additional figures supporting the above statement are provided in Appendix B.7. Even if the method shows good results, it does not imply the trained model will be fair, it will just prevent the model to be biased if the dataset allows it. For example, in the case of a biased dataset with three sensitive groups where two overlap and one is perfectly 83 separable, the distribution of the loss values for the overlapping groups will be different from the distribution of the separable group (the overlapping groups will have more misclassified samples). In this case, making all three distributions identical is impossible, however, we can still optimize for fairness amongst the overlapping groups. B.4 Appendix Conclusion This appendix investigates the optimization of a weighted mean and the flexibility it provides in enforcing properties such as robustness and fair group classification. We extended previous variance penalization methods to include higher-order moments while eliminating their limitations. One of the most significant findings to emerge from this study was that we can control the final distribution of the loss values by penalizing higher-order moments. Moreover, as weights are required to be non-negative we use this to our advantage and connect it to the robustification of a loss function. In addition, we extended the moments penalization to form a new objective that ensures fairness amongst sensitive groups. This improves upon the current CVaR objective which only optimizes the mean, while the proposed solution extends it to include variance and also other central moments. B.5 Additional Theoretical Results Next, we present two propositions used in the proofs of the main theorems of this appendix. Proposition 15 Let ℓ : R n → R + and p : R → R two convex functions, then the composition w(x)=p(ℓ(x)− M) where M ≥ 0 is convex if p is non decreasing. Proof. Using the convexity of ℓ for any x,y∈R n and α ∈[0,1] we have: ℓ(αx +(1− α )y)≤ αℓ (x)+(1− α )ℓ(y) (B.12) 84 Using the fact that p is non decreasing along we the above inequality we obtain: p(ℓ(αx +(1− α )y))≤ p(αℓ (x)+(1− α )ℓ(y)) (B.13) and using convexity of p for the term on the right we get: p(αℓ (x)+(1− α )ℓ(y))≤ αp (ℓ(x))+(1− α )p(ℓ(y)) (B.14) Combining the two inequalities leads to the following result (ℓ(αx +(1− α )y))≤ αp (ℓ(x))+ (1− α )p(ℓ(y)) which proves the composition p(ℓ(x)) is convex. Moreover, since p is convex and ℓ(x)− M is in the domain of p then p(ℓ(x)− M) is also convex as it is a composition with an affine mapping. This proves that w(x) is convex. Proposition 16 Let f,g : R n → R + be two convex function taking non-negative values then their product h(x)=f(x)g(x) is convex if 0≤ [f(x)− f(y)][g(x)− g(y)],∀x,y∈R n . Proof. The function h is convex if it satisfies the following inequality where α ∈[0,1]: 0≤ αh (x)+(1− α )h(y)− h(αx +(1− α )y) (B.15) 0≤ α (fg)(x)+(1− α )(fg)(y)− (fg)(αx +(1− α )y) (B.16) Note that since the two functions are convex the following inequalities hold: f(αx +(1− α )y)≤ αf (x)+(1− α )f(y) (B.17) g(αx +(1− α )y)≤ αg (x)+(1− α )g(y) (B.18) 85 Given that the two functions take non-negative values we can multiply the above two inequalities. f(αx +(1− α )y)g(αx +(1− α )y)≤ [αf (x)+(1− α )f(y)][g(x)+(1− α )g(y)] (B.19) (fg)(αx +(1− α )y)≤ [αf (x)+(1− α )f(y)][g(x)+(1− α )g(y)] (B.20) Substituting (fg)(αx +(1− α )y) from B.20 in inequality B.15 we obtain: 0≤ α (fg)(x)+(1− α )(fg)(y)− [αf (x)+(1− α )f(y)][g(x)+(1− α )g(y)] (B.21) Multiplying the two square brackets and then grouping the terms we get: 0≤ α (1− α )[f(x)− f(y)][g(x)− g(y)] (B.22) We complete the proof by noting that α ≥ 0 and thus the product of two convex functions is convex if 0≤ [f(x)− f(y)][g(x)− g(y)]. 86 0 0.5 1 1.5 ` mae E[` mae ] ` mse E[` mse ] 0 1 2 3 E[` mae ] λ 2 =− 2 λ 2 =− 1 λ 2 = 1 λ 2 = 2 w =1+λ 2 × (` mae − E[` mae ]) E[` mse ] λ 2 =− 2 λ 2 =− 1 λ 2 = 1 λ 2 = 2 w =1+λ 2 × (` mse − E[` mse ]) − 1.2 − 0.6 0 0.6 1.2 0 1 2 3 w× ` mae − 1.2 − 0.6 0 0.6 1.2 w× ` mse 1 Figure 5.13: Algebraic interpretation of moments penalization. On the left columns corre- sponds to the mean absolute error loss, ℓ mae , and the right one to the mean square error loss, ℓ mse . From top to bottom, we have the original loss function, the weights for four different variance penalization scenarios, and the resulting weighted loss function. Note, in all plots the mean loss,E[ℓ], was chosen to be 1. 87 0 1 0 1 2 3 E[` ce ]=1.7 λ 1 =− 0.1 λ 0 = 1 λ 1 =− 0.2 λ 0 = 1 0 1 E[` ce ]=2.5 E[` ce ]=0.5 λ 1 =− 0.15 λ 0 = 1 λ 1 =− 0.15 λ 0 = 1 0 1 λ 1 =− 0.15 λ 0 = 1 λ 1 =− 0.15 λ 0 = 1 1 Figure 5.14: Bounding the cross-entropy loss function using negative variance penalization. The impact of the λ 1 parameter on the loss shape when the average loss is constant is shown on the left plot and the impact of the average loss on the shape in the center plot. Right plot shows the convex version of the variance penalized cross-entropy loss. Figure 5.15: Comparison of the moments methods (Theorem 14) when matching the group mean and the conditional value at risk (CVaR) in the context of fairness optimization. The top row shows results for the synthetic dataset, middle and bottom rows the adult dataset when optimized for gender (a–female, b–male) and race (a–white, b–black). 88 Table 5.1: Mean accuracy (%) and standard deviation on clean data over 5 runs. The best result for each scenario is underlined. Loss Noise rate η Class independent Class dependent 0.2 0.4 0.6 0.8 0.1 0.2 0.3 0.4 CIFAR-10 Classical 84.1 0.2 77.1 0.4 66.3 0.4 36.0 1.5 88.9 0.2 87.2 0.2 84.8 0.4 80.9 0.7 Moments 90.5 0.1 85.0 0.6 69.1 0.4 37.3 2.3 91.4 0.2 89.9 0.0 85.0 3.2 72.9 3.3 Moment-convex 86.8 0.6 80.3 0.3 67.6 0.6 35.2 1.5 89.1 0.1 87.2 0.2 83.5 0.3 78.7 0.4 TERM 89.5 0.1 83.2 0.2 69.2 0.5 37.2 2.0 91.1 0.2 88.1 0.4 83.0 0.7 77.2 1.1 Taylor 90.2 0.2 86.2 0.1 73.0 0.5 10.0 0.0 91.0 0.2 88.4 0.3 83.7 0.4 78.0 1.1 Normalized 90.1 0.1 87.0 0.2 80.8 0.2 32.6 2.9 90.9 0.1 89.7 0.3 86.8 0.5 79.8 0.5 Symmetric 90.0 0.2 86.7 0.2 80.2 1.0 44.8 4.0 90.8 0.2 89.6 0.1 86.5 0.4 79.8 0.9 Generalized 90.4 0.2 86.0 0.2 67.3 2.7 10.0 0.0 91.3 0.1 89.0 0.1 83.7 0.5 74.0 3.2 CIFAR-100 Classical 40.0 0.5 31.1 1.1 20.7 0.5 11.6 0.4 46.8 0.2 42.6 0.4 37.2 0.5 31.7 0.3 Moments 46.1 0.4 39.8 0.7 27.9 0.8 13.8 1.0 47.8 0.9 43.7 0.9 37.2 0.4 30.6 0.3 Moment-convex 44.5 0.8 36.7 0.9 24.8 0.8 13.4 0.8 47.4 0.5 42.9 0.7 36.9 0.5 31.1 0.6 TERM 45.5 0.3 41.5 1.1 32.8 1.4 17.7 0.4 47.1 0.6 44.6 0.7 37.9 0.6 30.9 0.6 Taylor 37.3 0.6 33.0 1.3 27.2 0.8 17.4 0.2 38.7 0.6 35.9 0.7 31.0 1.0 26.6 0.5 Normalized 32.9 0.8 27.8 0.7 22.7 0.4 13.9 0.2 35.0 0.3 32.7 0.3 30.2 0.4 26.5 0.6 Symmetric 43.1 0.4 37.9 0.4 31.2 0.8 19.4 0.5 45.3 0.5 43.1 0.2 41.0 0.5 34.6 0.3 Generalized 38.2 0.4 33.7 0.9 28.1 0.6 18.5 0.4 39.6 0.9 37.9 0.7 35.8 0.6 29.9 0.3 Fashion-MNIST Classical 92.0 0.1 90.9 0.1 89.0 0.2 78.4 1.0 92.9 0.1 92.5 0.3 92.0 0.7 90.6 1.7 Moments 92.1 0.2 91.6 0.2 89.8 0.1 80.2 1.1 87.0 2.9 82.4 6.7 79.3 6.8 62.1 3.0 Moments-convex 92.4 0.1 91.6 0.1 89.6 0.2 79.7 1.0 92.6 0.2 92.0 0.4 92.4 0.2 88.7 1.5 TERM 92.1 0.1 91.6 0.1 90.0 0.1 80.3 0.9 92.3 0.1 92.0 0.4 91.5 0.9 91.3 0.4 Taylor 90.6 0.0 89.4 0.3 86.8 0.2 75.7 0.7 91.1 0.1 90.7 0.4 89.8 1.4 85.1 3.8 Normalized 91.5 0.1 90.8 0.3 88.9 0.3 79.8 1.3 91.7 0.1 91.7 0.3 90.7 0.7 91.1 0.1 Symmetric 92.3 0.1 91.8 0.1 90.6 0.3 82.8 1.2 92.5 0.1 92.3 0.3 91.9 0.6 91.5 1.3 Generalized 91.2 0.2 90.2 0.2 88.1 0.2 73.9 1.1 91.6 0.1 91.3 0.3 90.7 0.7 90.4 0.3 89 Figure 5.16: Comparison of the moments methods (Theorem 14) when matching the group mean and variance and the conditional value at risk (CVaR) in the context of fairness optimization. The top row shows results for the synthetic dataset, middle and bottom rows the adult dataset when optimized for gender (a–female, b–male) and race (a–white, b–black). 90 B.6 Variance Expansion We present an auxiliary lemma introduced by Duchi and Namkoong (2019) for variance expansion and used in the proof of Lemma 7. Lemma 17: Mean Canceling Let U and V be two random variables, and let U and V denote their expectation, then: E U− U V =E U− U V − V (B.23) Proof. To prove we write V as V − V +V and then use the linear property of the expectation operator: E U− U V =E U− U V − V +V =E U− U V − V +E U− U V =E U− U V − V +V E[U]− U =E U− U V − V Now we prove the variance expansion theorem using this lemma. Theorem 18: Variance Expansion Letℓ be a random variable with finite first and second moments and W = 1+λ (ℓ− E[ℓ]), then taking the weighted mean of ℓ with weights W is equivalent to: E[Wℓ]=E[ℓ]+λ V[ℓ] (B.24) 91 Proof. To prove we replace W with its definition and then use Lemma 17 in the second step to obtain the result: E[Wℓ]=E 1+λ (ℓ− E[ℓ]) ℓ =E[ℓ]+λ E (ℓ− E[ℓ])ℓ =E[ℓ]+λ E (ℓ− E[ℓ]) 2 =E[ℓ]+λ V[ℓ] 92 B.7 Supplementary Figures Figure 5.17: Fair classification – CVaR method (various α ) on Donini dataset 93 Figure 5.18: Fair classification – moments method (various λ 2 ) on Donini dataset 94 Figure 5.19: Fair classification – moments method ( λ 2 = 3 and various λ 3 ) on Donini dataset 95 Figure 5.20: Fair classification – moments method ( λ 2 = 7 and various λ 3 ) on Donini dataset 96 Figure 5.21: Fair classification – moments method ( λ 2 = 10 and various λ 3 ) on Donini dataset 97 Figure 5.22: Fair classification – moments method ( λ 2 = 16 and various λ 3 ) on Donini dataset 98 B.8 Supplementary Tables In the context of robust optimization, Table 5.2 shows the model accuracy for each class when trained on CIFAR-10 with class dependent noise. In this case, we only flip the label in the source classes {9,2,3,5,4} to the corresponding target class {1,0,5,3,7} given the noise ratio η . We see that as switching to class mean, Moments-cm, the performance on noisy classes improves. However, using an exponential moving average for the mean, Moments-cme, does not improve the results compared to normal mean. 99 Loss η Class 0 1 2 3 4 5 6 7 8 9 Moments 0.1 89.9 0.5 94.3 0.3 78.5 0.7 72.6 0.9 85.4 0.9 78.9 1.5 90.7 0.5 89.2 0.3 92.4 0.3 90.1 0.4 0.2 90.9 0.5 94.8 0.5 75.5 0.5 71.8 0.9 83.1 0.4 75.7 0.8 90.8 0.8 89.8 0.7 92.4 0.3 87.8 0.8 0.3 91.3 0.7 95.8 0.7 69.7 1.1 72.3 1.2 77.7 0.4 67.4 0.6 91.4 0.2 90.4 0.4 92.3 0.4 82.2 1.5 0.4 92.4 0.4 97.4 0.6 41.8 23.4 75.0 1.0 48.4 27.0 49.0 4.0 91.7 0.8 90.8 0.5 92.3 0.5 25.3 34.7 Moments-cm 0.1 90.7 0.6 94.6 0.3 78.8 0.7 74.1 1.1 84.4 0.6 79.8 0.7 89.6 0.5 89.3 0.6 91.3 0.5 88.6 0.6 0.2 91.5 0.3 95.1 0.5 74.7 0.5 73.1 0.3 81.3 0.7 76.1 0.6 89.5 0.6 89.9 0.5 90.9 0.4 85.5 0.5 0.3 92.2 0.4 96.1 0.6 69.6 0.4 71.8 0.6 74.9 1.1 69.4 1.4 89.3 0.8 90.5 0.6 90.8 0.4 78.3 1.2 0.4 92.4 0.3 96.3 0.5 60.8 0.6 72.0 1.1 65.8 1.5 58.6 2.3 89.7 0.7 90.1 0.8 90.8 0.5 68.9 1.8 Moments-cme 0.1 90.7 0.7 94.4 0.4 78.0 0.7 74.3 1.6 84.2 0.7 80.3 1.4 89.3 0.4 89.2 0.5 91.0 0.4 88.9 0.6 0.2 91.6 0.6 95.4 0.2 74.4 0.5 73.3 0.8 80.5 0.9 76.3 0.6 89.0 0.7 89.9 0.5 90.1 0.5 85.0 0.4 0.3 92.2 0.3 96.1 0.3 69.5 0.8 72.3 1.5 74.6 1.2 69.8 1.5 89.2 0.7 90.2 0.6 90.5 0.2 77.9 0.5 0.4 92.3 0.3 96.3 0.5 60.6 0.8 70.4 1.1 64.7 0.6 59.2 1.3 89.1 0.4 90.4 0.5 90.4 0.2 68.2 1.1 Classical 0.1 90.5 0.8 95.0 0.6 80.2 0.5 73.6 1.2 85.0 1.1 79.0 1.0 92.0 0.5 90.7 0.1 93.4 0.4 89.3 0.7 0.2 91.2 0.3 95.6 0.5 76.8 0.8 72.3 1.0 81.7 1.0 74.2 0.9 92.2 0.6 91.2 0.6 93.5 0.2 85.2 0.7 0.3 91.9 0.5 96.5 0.4 72.0 0.9 73.6 1.2 77.2 0.8 68.3 1.5 92.2 0.6 91.9 0.4 93.2 0.4 79.3 1.1 0.4 92.4 0.4 97.1 0.3 63.6 1.0 77.6 0.6 69.0 1.0 51.0 0.9 92.6 0.8 91.9 0.6 93.3 0.1 70.0 2.4 TERM(Li et al. 2021) 0.1 90.3 0.5 94.8 0.5 79.8 0.7 73.0 1.2 85.8 0.5 79.8 0.7 91.5 0.6 90.1 0.3 93.1 0.4 90.0 0.4 0.2 91.2 0.7 95.3 0.5 75.9 0.5 72.0 1.1 82.2 1.1 75.0 1.2 91.7 0.8 90.7 0.9 92.9 0.6 85.9 0.4 0.3 91.7 0.6 96.4 0.3 68.6 0.9 72.9 0.3 74.1 1.3 66.8 0.9 91.8 0.8 91.3 0.3 92.6 0.4 77.3 0.7 0.4 92.4 0.4 97.0 0.3 54.6 0.9 75.7 0.8 61.1 2.0 49.1 1.5 92.2 0.8 91.7 0.8 92.7 0.5 62.6 2.0 Taylor(Feng et al. 2020) 0.1 89.4 0.4 94.1 0.2 77.9 0.6 71.1 0.9 84.0 0.6 78.1 0.8 90.4 0.5 88.8 0.5 91.8 0.5 89.0 0.4 0.2 90.3 0.5 95.2 0.2 72.9 0.8 70.7 1.1 79.6 1.0 73.3 1.1 90.8 0.7 89.5 0.9 91.6 0.3 84.7 0.3 0.3 90.7 0.4 96.2 0.5 64.4 0.9 73.0 1.1 71.4 1.2 63.7 1.6 91.2 0.9 90.5 0.4 91.7 0.6 75.2 1.1 0.4 91.5 0.5 96.8 0.4 42.7 7.5 78.9 0.7 55.1 1.1 37.4 2.3 91.3 0.8 90.9 0.7 91.6 0.5 60.7 1.2 Normalized(Ma et al. 2020) 0.1 89.8 0.6 94.2 0.2 79.3 1.5 72.2 0.8 84.9 0.8 79.6 0.8 90.5 0.5 88.7 0.4 92.2 0.2 89.7 0.6 0.2 90.3 0.7 94.5 0.5 75.3 0.8 71.4 1.5 82.4 0.7 75.9 1.0 90.2 0.4 88.8 0.9 92.5 0.5 87.2 0.7 0.3 90.4 0.4 95.3 0.4 69.6 1.0 72.8 1.7 77.7 0.3 69.3 0.9 90.5 0.8 89.8 0.4 92.3 0.3 82.0 1.0 0.4 90.9 0.6 95.9 0.3 56.3 2.0 76.7 1.3 64.9 1.4 36.9 6.3 90.1 0.9 89.8 0.5 91.8 0.7 67.0 1.5 Symmetric(Wang et al. 2019b) 0.1 89.9 0.4 94.8 0.4 79.2 0.9 72.7 0.9 85.4 1.1 78.8 1.4 90.6 0.2 89.2 0.2 92.9 0.3 89.4 0.3 0.2 90.8 0.3 94.9 0.4 75.9 0.7 71.9 0.3 83.2 0.9 75.9 1.0 90.6 0.8 89.2 0.9 92.5 0.4 87.3 0.7 0.3 91.1 0.6 95.4 0.3 69.1 1.4 72.2 0.6 77.4 1.1 69.6 0.5 90.9 0.8 90.2 0.5 92.6 0.4 82.3 1.3 0.4 91.4 0.7 96.3 0.1 50.3 3.9 77.5 0.9 65.1 1.4 31.0 4.5 90.9 0.7 89.9 0.5 92.2 0.4 67.5 3.5 Generalized(Zhang and Sabuncu 2018) 0.1 89.9 0.6 94.3 0.5 78.8 0.6 71.9 0.7 85.0 0.4 79.5 1.0 91.1 0.5 89.2 0.8 92.5 0.5 89.7 0.3 0.2 90.6 0.2 95.2 0.4 74.6 1.0 71.2 1.5 81.8 0.7 73.9 0.8 91.4 0.7 90.3 0.7 92.4 0.7 86.4 0.7 0.3 91.4 0.5 96.0 0.3 65.7 0.7 72.3 0.8 73.7 0.8 65.2 1.2 91.6 0.4 91.0 0.4 92.0 0.5 77.5 1.0 0.4 92.2 0.6 96.7 0.4 9.0 20.2 77.2 1.2 55.6 1.1 42.0 2.1 92.2 0.8 91.3 0.4 92.1 0.7 58.6 2.7 Table 5.2: Class accuracy (%) and standard deviation over 5 runs on clean data for CIFAR-10 affected by class dependent noise. 100
Abstract (if available)
Abstract
We introduce distributional variance penalization, a strategy for learning with limited and/or mislabeled data, and for jointly optimizing individual and group fairness -- a closely related problem. While minimizing the loss function currently stands as the training objective for many machine learning applications, it suffers from (i) poor robustness and (ii) inability to equalize performance across subsets of the training data.
With regard to the first issue, we show that we can improve robustness by minimizing the average loss along with penalizing the variance. In particular, we expand on past studies of directly penalizing the variance, which adjusts the weights of individual samples, resulting in improved robustness. However, the weights can take negative values and lead to unstable behavior. We introduce distributional variance penalization, which solves the issue of negative weights. Distributional variance penalization minimizes the expectation with respect to a distinct distribution that achieves a similar weighting scheme as direct variance penalization. We study the impact of both positive and negative variance penalization in the context of classification and show that the generalization and robustness against mislabeled data can be improved for a broad class of loss functions. Experimental results show that test accuracy improves by up to 20% compared to ERM when training with limited or mislabeled data.
In regards to the second issue, we introduce a novel procedure for jointly optimizing individual and group fairness based on distributional variance penalization. We built this procedure upon the paradigm of distributionally robust optimization, in which the goal is to improve robustness and protect against possible shifts in data distribution. This is of paramount interest when optimizing for fairness, as sensitive groups are often underrepresented, which leads to skews in the underlying data distribution. In addition, we also introduce a new definition of fairness which extends some existing ones. In principle, our goal is to minimize the empirical mean and variance of the loss values, while optimizing for all groups to achieve the same loss mean and variance. Our experimental results show that we can enforce groups to achieve similar loss means and variances and improve the final accuracy for all groups when the training data is skewed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Theoretical foundations for modeling, analysis and optimization of cyber-physical-human systems
PDF
Dispersed computing in dynamic environments
PDF
Verification, learning and control in cyber-physical systems
PDF
Interaction and topology in distributed multi-agent coordination
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Stochastic optimization in high dimension
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Empirical methods in control and optimization
PDF
Neural representation learning for robust and fair speaker recognition
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Resource scheduling in geo-distributed computing
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Remote exploration with robotic networks: queue-aware autonomy and collaborative localization
PDF
Dealing with unknown unknowns
PDF
Computational validation of stochastic programming models and applications
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
Asset Metadata
Creator
Balaban, Valeriu
(author)
Core Title
Optimization strategies for robustness and fairness
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-05
Publication Date
04/11/2023
Defense Date
03/02/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
distributional robustness,Fairness,machine learning,noise robustness,OAI-PMH Harvest,robustness
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Bogdan, Paul (
committee chair
), Deshmukh, Jyotirmoy (
committee member
), Jain, Rahul (
committee member
), Jonckheere, Edmond (
committee member
), Krishnamachari, Bhaskar (
committee member
)
Creator Email
vbalaban@usc.edu,vbalabanjr@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113010440
Unique identifier
UC113010440
Identifier
etd-BalabanVal-11606.pdf (filename)
Legacy Identifier
etd-BalabanVal-11606
Document Type
Dissertation
Format
theses (aat)
Rights
Balaban, Valeriu
Internet Media Type
application/pdf
Type
texts
Source
20230412-usctheses-batch-1020
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
distributional robustness
machine learning
noise robustness
robustness