Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable optimization for trustworthy AI: robust and fair machine learning
(USC Thesis Other)
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Scalable Optimization for Trustworthy AI:
Robust and Fair Machine Learning
by
Sina Baharlouei
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(INDUSTRIAL AND SYSTEMS ENGINEERING)
May 2024
Copyright 2024 Sina Baharlouei
Acknowledgements
First and foremost, I express my deepest gratitude to my supervisor, Dr. Meisam Razaviyayn, whose guidance, support, and encouragement have been invaluable throughout this journey. His insights and expertise have not only shaped this thesis but have also fostered my growth as a researcher.
I am equally grateful to my committee members, Dr. Sze-Chuan Suen and Dr. Oded Stein, for their
constructive feedback and insightful suggestions, which have significantly strengthened this work. Additionally, I extend my thanks to Dr. John Gunnar Carlsson, Dr. Andres Gomez, Dr. Renyuan Xu, Dr. Vishal
Gupta, Dr. Vatsal Sharan, and Dr. Larry Goldstein, my Thesis Proposal and Qualification Exam Committee
members, whose insightful comments have contributed to the refinement of this thesis.
Heartfelt thanks are due to the University of Southern California for providing the resources and environment conducive to research. The Industrial and Systems Engineering community has been exceptionally supportive, and I am grateful for the opportunities it has provided.
I am indebted to my collaborators, colleagues, and friends for their camaraderie, encouragement, and
intellectual exchange. Special appreciation goes to Dr. Ahmad Beirami for his invaluable guidance and
deep insights into my Ph.D. projects, focusing on fair inference in machine learning.
I express my deepest appreciation to my family for their unwavering love, encouragement, and understanding throughout this journey. Their support has been my anchor during challenging times. In
ii
particular, I extend my gratitude to my mother, Arezoo, and my father, Mohammadreza, for their consistent support before, during, and after completing my Ph.D. Furthermore, I am grateful to my beloved
Maryam for her unwavering support during my remarkable Ph.D. journey.
Lastly, I dedicate this thesis to the memory of Alfréd Rényi, the Hungarian mathematician whose
pioneering works in Probability and Information Theory continue to inspire generations of researchers.
iii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Empirical Risk Minimization Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Vulnerability to Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Poor Performance in the Presence of Batch Effects . . . . . . . . . . . . . . . . . . 3
1.1.3 Vulnerability to Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.4 Biased and Unfair Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: Robust Imputation and Inference from Low Order Marginals . . . . . . . . . . 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Robust Inference via Estimating Low-order Moments . . . . . . . . . . . . . . . . . . . . . 13
2.3 Robust Linear Regression in the Presence of Missing Values . . . . . . . . . . . . . . . . . 16
2.3.1 A Distributionally Robust Formulation of Linear Regression . . . . . . . . . . . . . 17
2.3.2 RIFLE for Ridge Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Performance Guarantees for RIFLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4 Imputation of Missing Values and Going Beyond Linear Regression . . . . . . . . . 21
2.4 Robust Classification Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Robust Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Tuning Hyper-parameters of RIFLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.3 RIFLE Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.4 Data Imputation via RIFLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.5 Sensitivity of RIFLE to the Number of Samples and Proportion of Missing Values . 33
2.5.6 Performance Comparison on Real Datasets . . . . . . . . . . . . . . . . . . . . . . 35
2.5.6.1 Performance of RIFLE on Classification Tasks . . . . . . . . . . . . . . . 36
2.5.6.2 Comparison of Robust Linear Regression and Robust QDA . . . . . . . . 38
Chapter 3: Feature Selection in the Presence of Monotone Batch Effects . . . . . . . . . . 40
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
iv
3.2 Problem Formulation and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 MMD-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Low-Rank MMD Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.3 Ranking Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.2 Hyper-Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.3 Interpreting the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Convergence of the Training Approach . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Validate Multi-variate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Genomic Data Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Chapter 4: Improving Adversarial Robustness via Joint Classification and Multiple
Explicit Detection Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.1 Verification of feedforward neural networks . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Verification of neural networks via IBP . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2.3 Verification of neural networks via β-Crown . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Training a joint robust classifier and detector . . . . . . . . . . . . . . . . . . . . . 66
4.3 Verification of Neural Networks with Multiple Detection Classes . . . . . . . . . . . . . . . 67
4.3.1 Verification with IBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Verification with β-Crown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Training of Neural Networks with Multiple Detection Classes . . . . . . . . . . . . . . . . 71
4.4.1 Addressing model degeneracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.2 Robust Verified Accuracy on Shallow Networks . . . . . . . . . . . . . . . . . . . . 77
4.5.3 Performance on Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Chapter 5: Rényi Fair Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Rényi Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 A General Min-Max Framework for Rényi Fair Inference . . . . . . . . . . . . . . . . . . . 87
5.3.1 Computing Rényi Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Rényi Fair Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 General Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4.2 Binary Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Rényi Fair Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 6: FERMI: A Stochastic Optimization Framework for Fair Empirical Risk
Minimization via Exponential Rényi Mutual Information . . . . . . . . . . . . 102
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Fair Risk Minimization through ERMI Regularization . . . . . . . . . . . . . . . . . . . . . 106
v
6.2.1 A Convergent Stochastic Algorithm for Fair Empirical Risk Minimization . . . . . 108
6.2.2 Asymptotic Convergence of Algorithm 12 for Population-level FRMI Objective . . 113
6.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3.1 Fair Classification with Binary Sensitive Attributes using Logistic Regression . . . 116
6.3.1.1 Benchmarking full-batch performance . . . . . . . . . . . . . . . . . . . 116
6.3.1.2 The effect of batch size on fairness/accuracy tradeoffs . . . . . . . . . . . 118
6.3.1.3 The effect of missing sensitive attributes on fairness/accuracy tradeoffs . 119
6.3.2 Fair Binary Classification using Neural Models . . . . . . . . . . . . . . . . . . . . 120
6.3.3 Fair Non-binary Classification with Multiple Sensitive Attributes . . . . . . . . . . 121
6.3.4 Beyond Fairness: Domain Parity Regularization for Domain Generalization . . . . 122
6.4 Discussion and Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 7: Dr. FERMI: A Stochastic Distributionally Robust Fair Empirical Risk
Minimization Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.1.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Preliminaries and Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.1 Notions of Group Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.2 Fairness Through Regularized ERM . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2.3 Measuring Fairness Violation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2.3.1 ERMI as Fairness Violation Measure . . . . . . . . . . . . . . . . . . . . . 135
7.3 Efficient Robustification of ERMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.3.1 Distributionally Robust Formulation of ERMI . . . . . . . . . . . . . . . . . . . . . 136
7.3.2 The Uncertainty Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.3 Simplification of Dr. FERMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3.4 Generalization of Dr. FERMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4 Algorithms for Solving Dr. FERMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4.1 Deterministic (Full-Batch) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4.2 Stochastic (Mini-Batch) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.5 Robustification of the Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6.1 Modifying Benchmark Datasets to Contain Hybrid Distribution Shifts . . . . . . . 143
7.6.2 Robustness Against Distribution Shifts in Real Datasets . . . . . . . . . . . . . . . 145
7.6.3 Handling Distribution Shift in Non-Binary Case . . . . . . . . . . . . . . . . . . . . 147
7.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.8 Broader Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Chapter 8: f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization 153
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2 Fair Empirical Risk Minimization via f-divergences . . . . . . . . . . . . . . . . . . . . . . 158
8.2.1 A Convergent Stochastic Algorithm for fair ERM via f-Divergences . . . . . . . . 159
8.3 Robust f-FERM in the Presence of Distribution Shifts . . . . . . . . . . . . . . . . . . . . . 163
8.3.1 Robust f-FERM Under ℓp Norms and Small Distribution Shifts . . . . . . . . . . . 164
8.3.2 Robust f-FERM Under ℓ∞ Norms and Potentially Large Distribution Shifts . . . . 166
8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.4.1 Fairness-Accuracy Tradeoffs on Benchmark Datasets . . . . . . . . . . . . . . . . . 168
8.4.2 Fairness-Accuracy Tradeoffs in the Presence of the Distribution Shift . . . . . . . . 169
vi
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Chapter 9: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Chapter 10: Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.1 Chapter 2 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
10.1.1 A Review of Missing Value Imputation Methods in the Literature . . . . . . . . . . 174
10.1.2 Estimating Confidence Intervals of Low-order Moments . . . . . . . . . . . . . . . 176
10.1.3 Robust Ridge Regression with the Optimal Convergence Rate . . . . . . . . . . . . 178
10.1.4 Solving the Dual Problem of the Robust Ridge Linear Regression via ADMM . . . . 179
10.1.5 Quadratic RIFLE: Using Kernels to Go Beyond Linearity . . . . . . . . . . . . . . . 187
10.1.6 Performance of RIFLE and QRIFLE on Synthetic Non-linear Data . . . . . . . . . . 188
10.1.7 Robust Quadratic Discriminant Analysis (Partially Missing Target Feature) . . . . . 189
10.1.8 Generating Missing Values Patterns in Numerical Experiments . . . . . . . . . . . 191
10.1.9 Proof of Lemmas and Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.1.10 Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.1.11 Further Discussion on the Consistency of RIFLE . . . . . . . . . . . . . . . . . . . . 203
10.1.12 Numerical Experiments for Convergence of RIFLE Algorithms . . . . . . . . . . . . 203
10.1.13 Execution Time Comparison of RIFLE and Other State-of-the-art Approaches . . . 205
10.2 Chapter 4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.2.2 Bregman-Divergence Method for Optimizing a Convex Function
Over a Probability Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.2.3 Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.2.4 Details of β-Crown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
10.2.5 Derivation of equation (4.12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
10.2.6 A simple example on the benefits and pitfalls of having multiple abstain classes . . 215
10.2.7 Structure of Neural Networks in Section 4.3 . . . . . . . . . . . . . . . . . . . . . . 219
10.2.8 Experiments on Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 219
10.2.9 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.2.10 Societal Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3 Chapter 5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3.1 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
10.3.2 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
10.3.3 Rényi Fair K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
10.3.4 Updating w after updating the assignment of each data point in Algorithm 11 . . . 225
10.3.5 Datasets Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.3.6 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10.3.7 Fair Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10.4 Chapter 6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
10.4.1 Notions of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
10.4.2 ERMI: General Definition, Properties, and Special Cases Unraveled . . . . . . . . . 231
10.4.3 Relations Between ERMI and Other Fairness Violation Notions . . . . . . . . . . . 235
10.4.4 Precise Statement and Proofs of Theorem 19 and Theorem 21 . . . . . . . . . . . . 245
10.4.5 Proof of Theorem 21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.4.6 Experiment Details and Additional Results . . . . . . . . . . . . . . . . . . . . . . . 260
10.4.6.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
10.4.6.2 More comparison to [223] . . . . . . . . . . . . . . . . . . . . . . . . . . 261
vii
10.4.6.3 Performance in the presence of outliers & class-imbalance . . . . . . . . 262
10.4.6.4 Effect of hyperparameter λ on the accuracy-fairness tradeoffs . . . . . . 263
10.4.6.5 Complete version of Figure 6.1 . . . . . . . . . . . . . . . . . . . . . . . . 263
10.4.6.6 Description of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.5 Chapter 7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.5.1 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.5.2 Proof of Theorem 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.5.3 Proof of Lemma 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.5.4 Derivation of Algorithm 13 and Details on Computing Gradients . . . . . . . . . . 273
10.5.5 Convergence of Algorithm 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
10.5.6 Convergence of Algorithm 14 (Stochastic version of Dr. FERMI) . . . . . . . . . . . 275
10.5.7 Full-Batch Algorithms for Dr. FERMI under L2 and L∞ Norms . . . . . . . . . . . 276
10.5.8 Algorithms for CVaR-Dr. FERMI and Group-Dr. FERMI . . . . . . . . . . . . . . . . 277
10.5.9 Hyper-Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.6 Chapter 8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
10.6.1 f-FERM for other notions of group fairness . . . . . . . . . . . . . . . . . . . . . . 278
10.6.2 f-divergence for continuous sensitive attributes and target variables . . . . . . . . 279
10.6.3 f-divergences cover well-known notions of fairness violation . . . . . . . . . . . . 281
10.6.4 Proof of Proposition 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
10.6.5 Derivation of Closed-Form Expressions for Unbiased Gradient Estimators of
f-Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
10.6.6 Formal Statement of Theorem 28 and Proof . . . . . . . . . . . . . . . . . . . . . . 289
10.6.7 A Faster (But Double-Loop) Algorithm for Optimizing (f-FERM) . . . . . . . . . . 293
10.6.8 A First-order Optimization Algorithm for Optimizing (8.10) . . . . . . . . . . . . . 295
10.6.9 Proof of Equation (8.12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
10.6.10 Details of Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
10.6.11 Further Experiments on Other Datasets and Notions of Fairness . . . . . . . . . . . 297
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
viii
List of Tables
2.1 Performance comparison of RIFLE, QRIFLE (Quadratic RIFLE), and state-of-the-art
methods on several UCI datasets. We applied to impute methods on three different
missing-value proportions for each dataset. The best imputer is highlighted with bold
font, and the second-best imputer is underlined. Each experiment is done 5 times, and the
average and the standard deviation of performances are reported. . . . . . . . . . . . . . . 32
2.2 Normalized RMSE of RIFLE and several state-of-the-art Methods on Superconductivity,
blog feedback, and NHANES datasets. The first two datasets contain 30% Missing Not
At Random (MNAR) missing values in the training phase generated by Algorithm 21.
Each method applied 5 times to each dataset, and the result is reported as the average
performance ± standard deviation of experiments in terms of NRMSE. . . . . . . . . . . . 36
2.3 Sensitivity of Linear Discriminant Analysis, Robust LDA (Common Covariance Matrices),
and Robust QDA (Different Covariance matrices for two groups) to the number of training
samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Accuracy of RIFLE, MICE, KNN-Imputer, Expectation Maximization (EM), and Robust
QDA on different discrete, mixed, and continuous datasets. Robust QDA can perform
better than other methods when the input features are continuous, and the target variable
is discrete. However, RIFLE results in higher accuracy in mixed and discrete settings. . . . 38
3.1 F1 Scores for Different Methods and Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Standard Error and PGD attack success rate on the MNIST dataset for different state-ofthe-art approaches. The chosen ϵ for the PGD attack equals 0.4. . . . . . . . . . . . . . . . 80
4.2 Standard Error and PGD attack success rate on the CIFAR-10 dataset for different
state-of-the-art approaches. The chosen ϵ for the PGD attack equals 8/255. . . . . . . . . . 80
ix
6.1 Comparison of state-of-the-art in-processing methods (NB = non-binary) on whether they (a)
handle non-binary targets (beyond binary classification), (b) handle non-binary sensitive attributes,
(c) release code that applies to non-binary targets/attributes, (d) extend to arbitrary models, (e)
provide code for stochastic optimization (and whether the gradients are unbiased), (f) provide
convergence guarantees (for stochastic optimization). FERMI is the only method compatible with
stochastic optimization and guaranteed convergence. The only existing baselines for non-binary
classification with non-binary sensitive attributes are [223, 23, 74] (NB code). ∗We refer to the
in-processing method of [157], not their post-processing method. ∗∗We use the term “unbiased” in
statistical estimation sense; not to be confused with bias in the fairness sense. . . . . . . . . . . . 104
7.1 Train and Test Accuracy and Fairness Violation of Different Methods on ACS PUBS dataset. The
reported numbers are the average and the standard deviation of the values among 50 states. . . . . 148
7.2 Memory Consumption and Computation Time of Different Approaches in Figure 1. Note
that the reported values are for the batch size of 8. The required memory for the full batch
algorithm is more than 4GB, and the required time is nearly the same (since the number
of epochs is the same). Note that while other approaches lose performance due to the
smaller batch sizes, Dr. FERMI and CVaR Dr. FERMI preserve their performance even for
small batch sizes (as small as 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.1 Unbiased Estimators for f-divergence Regularizers . . . . . . . . . . . . . . . . . . . . . . 162
10.1 Execution time of RIFLE and other SOTA methods on three datasets. . . . . . . . . . . . . 205
10.2 Standard and Robust Verified error of state-of-the-art approaches on CIFAR-10 dataset. . . 206
10.3 Standard and Robust Verified error of state-of-the-art approaches on CIFAR-10 dataset. . . 220
10.4 Performance and training time of a neural network trained on the Adult dataset. The
first and second rows correspond to the networks not regularized, and regularized by
Rényi correlation respectively. As can be seen in the table, while adding Rényi regularizer
makes the classifier more fair, it does so by bringing a minimum amount of additional
computational overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
x
List of Figures
1.1 Consider the problem of predicting the trait y from feature vector (x1, . . . , x100). Suppose
we have access to three data sets: The first dataset includes the measurements of
(x1, x2, . . . , x40, y) for n1 individuals. The second dataset collects data from another n2
individuals by measuring (x30, . . . , x80) with no measurements of the target variable y in
it; and the third dataset contains the measurements from the variables (x70, . . . , x100, y)
for n3 number of individuals; How one should learn the predictor yˆ = h(x1, . . . , x100)
from these three datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The batch effect can be conceptualized as transformations on each of the ground-truth datasets,
which changes the distribution of data in each in a potentially different way. . . . . . . . . . . . 5
1.3 The deep neural network correctly classifies the left image as a Panda. However, by adding
a crafted noise that is invisible to the human eye, the same neural network incorrectly
classifies it as a Gibbon [129]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Different trained neural networks on large-scale datasets have a huge bias (accuracy gap)
against women and people with darker skins [54]. . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Consider the problem of predicting the trait y from feature vector (x1, . . . , x100). Suppose
that we have access to three data sets: The first dataset includes the measurements of
(x1, x2, . . . , x40, y) for n1 individuals. The second dataset collects data from another n2
individuals by measuring (x30, . . . , x80) with no measurements of the target variable y in
it; and the third dataset contains the measurements from the variables (x70, . . . , x100, y)
for n3 number of individuals. How one should learn the predictor yˆ = h(x1, . . . , x100)
from these three datasets? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Prediction of the target variable without imputation. RIFLE estimates confidence intervals
for low-order (first and second-order) marginals from the input data containing missing
values. Then, it solves a distributionally robust problem over the set of all distributions
whose low-order marginals are within the estimated confidence intervals. . . . . . . . . . 12
2.3 Comparing the consistency of RIFLE, MissForest, KNN Imputer, MICE, Amelia, and
Expectation Maximization methods on a synthetic dataset containing 40% of missing values. 31
xi
2.4 Performance Comparison of RIFLE, MICE, and MissForest on four UCI datasets: Parkinson,
Spam, Wave Energy Converter, and Breast Cancer. For each dataset, we count the number
of features that each method outperforms the others. . . . . . . . . . . . . . . . . . . . . . 33
2.5 Sensitivity of RIFLE, MissForest, Amelia, KNN Imputer, MIDA, and Mean Imputer to the
percentage of missing values on the Drive dataset. Increasing the percentage of missing
value entries degrades the benchmarks’ performance compared to RIFLE. KNN-imputer
implementation cannot be executed on datasets containing 80% (or more) missing entries.
Moreover, Amelia and MIDA do not converge to a solution when the percentage of
missing value entries is higher than 70%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Sensitivity of RIFLE, MissForest, MICE, Amelia, Mean Imputer, KNN Imputer, and MIDA
to the number of samples for the imputations of Blog Feedback dataset containing 40% of
MCAR missing values. When the number of samples is limited, RIFLE outperforms other
methods, and its performance is very close to the non-linear imputer MissForest for larger
samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 Effect of the number of covariance estimations on the performance (left) and run time
(right) of robust LDA on Avila and Magic datasets. Increasing the number of covariance
estimations (k) improves the model’s accuracy on the test data. However, it takes longer
training time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 The batch effect can be conceptualized as transformations on each of the ground-truth datasets,
which changes the distribution of data in each in a potentially different way. . . . . . . . . . . . 43
3.2 F1 Score Versus Normalized Logarithmic Supervised Learning Loss . . . . . . . . . . . . . 52
3.3 Training Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Heat-map of MMD before and after transformation . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Histogram for Validating Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 The IBP verification for 400 input data points of 2-layer and 3-layer neural networks. Part (a) shows the
assigned four labels to data points. Part (b) demonstrates that IBP can verify 14 points using one of two
abstain classes (black triangles), while it cannot verify 13 data points (red ×). c) shows that when IBP is
applied to a network with one more layer and one detection class, 8 points are verified by the detection class,
while it fails to verify 21 points. The description of both networks can be found in Appendix 10.2.7. . . . . . 68
4.2 Performance of Multiple-abstain shallow networks on MNIST and CIFAR-10 datasets. We compared multiple
abstain neural networks (both regularized and non-regularized versions) with the single abstain networks
and networks with one more layer. The above and below rows demonstrate the trade-off between standard
and robust verified accuracy on MNIST and CIFAR-10 datasets. . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Performance of shallow networks with multiple detection classes without regularization, with regularization,
and the network with one more layer on CIFAR-10 dataset. As demonstrated, the networks with regularization
work consistently well for 4 ≤ M ≤ 10, which is very close to the best performance we get from networks
without regularization with 4 detection classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
xii
4.4 Distribution of Natural and adversarial images over different abstain classes on CIFAR-10 dataset. When
there are 10 abstain classes, model degeneracy leads to lower performance than the baseline. Adding the
regularization term (rightmost column) will utilize all abstain classes and enhance standard and robust verified
accuracy. Standard accuracy is the proportion of correctly classified natural images, while robust verified
accuracy is the proportion of images that are robust against all adversarial attacks within the ϵ-neighborhood
(ϵ =
16
255 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Performance of β-crown on verification of Neural Networks with single abstain, 4 abstain classes, 10 abstain
classes with regularized, and networks with one more layer (single abstain) on CIFAR-10 dataset. M = 1
coincides with [303]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 Trade-off between the accuracy of classifier and fairness on the adult dataset under the
equality of opportunity notion. (a, b) By increasing λ from 0 to 1000, EO violation (the
blue curve on the left axis) approaches to 0. The fairer solution comes at the price of a
slight increase of the training/test error (Red curve, right axis). (c) Comparison of the
existing approaches with Rényi classifier, under the equality of opportunity notion. Rényi
classifier demonstrates a better accuracy for a given level of fairness measured by EO
violation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Trade-off between accuracy and fairness for logistic regression classifier regularized with
Rényi, HSIC, and Pearson measures, on German Credit, Adult, and Bank datasets. (Top)
The drop in the accuracy of the model regularized by Rényi, is less than the same model
regularized by HSIC, and Pearson correlation. Moreover, as can be observed for both Bank
and Adult datasets, Pearson and HSIC regularizers usually cannot increase p% beyond
a certain limit due to the fact that removing all linear correlations does not guarantee
independence between the predictor and the sensitive attribute. (Down) When the
sensitive attribute is not binary (or we have more than one sensitive attribute), obtaining
a fair model for HSIC and Pearson regularizers is even harder. The model regularized by
HSIC or Pearson, cannot minimize the DP violation (or maximize its reciprocal) beyond a
threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 Performance and fairness of K-means algorithm in terms of Rényi regularizer hyperparameter λ. By increasing λ, the standard deviation of the w vector components (each
component represents the relative proportion of the privileged group in the corresponding
cluster) is reduced accordingly. Both plots demonstrate that the standard deviation of w is
reduced fast with respect to λ, and the increase in loss is small when λ ≤ 0.005. However,
to reach a completely fair clustering, a λ ≥ 1 must be chosen that can increase the loss
(the right axis, red curve) drastically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1 Accuracy/Fairness trade-off of FERMI and several state-of-the-art in-processing approaches on
Adult dataset. FERMI offers the best fairness vs. accuracy tradeoff curve in all experiments against
all baselines. [277] only allow for a single output and do not yield a tradeoff curve. Further, the
algorithms by [223] and [23] are equivalent in this binary setting and shown by the red curve. In
the binary/binary setting, FERMI, [223], and [23] all try to solve the same objective Eq. (FRMI obj.).
However, the empirical formulation Eq. (FERMI obj.) and FERMI algorithm that we use results in
better performance, even though we are using a full batch for all baselines in this experiment. . . 117
xiii
6.2 Performance of FERMI, [75], [223], [23] with different batch-sizes on Adult dataset. FERMI
demonstrates the best accuracy/fainess tradeoff across different batch sizes. . . . . . . . . . . . . 118
6.3 Performance of FERMI and other state-of-the-art approaches on the Adult dataset where 90%
of gender entries are missing. Full-sensitive FERMI is obtained by applying FERMI on the data
without any missing entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4 Fair toxic comment detection with different batch sizes. For |B| = 128, the performance of [259]
and FERMI are close to each other, however, when the batch size is reduced to 16, FERMI
demonstrates a better fairness/ performance trade-off. The performance and fairness are measured
by the test error and the false positive gap between different religious sub-groups (Christians vs
Muslim-Jews), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.5 Comparison between FERMI, [223], [23], and [74] on Communities dataset. [223] outperforms [23,
74], which we believe could be attributed to the effectiveness of ERMI as a regularizer. FERMI
outperforms [223]. This can be attributed to our empirical formulation Eq. (FERMI obj.) and
unbiased stochastic optimization algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6 Domain generalization on Color MNIST [197] using in-process fair algorithms for demographic
parity. Left panel: The dashed line is the training error and the solid line is the test error. As λ
increases, fairness regularization results in a learned representation that is less dependent on color;
hence training error increases while test error decreases (all algorithms reach a plateau around
λ = 8). We use |B| = 512 for all baselines. Right panel: We plot test error vs. batch size using an
optimized value of λ for each algorithm selected via a validation set. The performance of baselines
drops 10-20% as the batch size becomes small, whereas FERMI is less sensitive to batch size. . . . . 123
7.1 Performance of the trained fair models on test datasets that have different distributions than the
training data. In the left figure, we undersampled the high-income minority group (women), while
we oversampled in the right figure. The proposed methods (Dr. FERMI and CVaR-Dr. FERMI)
generally outperform other benchmarks. In both figures, either Dr. FERMI or CVaR-Dr. FERMI
can reach lower demographic parity violations, given any accuracy level. The red dashed line
represents the Naïve baseline where the model outputs zero with probability p. By increasing p,
the model becomes fairer at the cost of the loss in accuracy. . . . . . . . . . . . . . . . . . . . . 145
7.2 Learning fair logistic regression models on four different states as the source (in-distribution)
domain and evaluating them on all 50 states of the US. . . . . . . . . . . . . . . . . . . . . . . . 146
7.3 Performance of different methods regarding fairness and accuracy for small batch sizes of 2 and 8.
Our robust formulation works consistently well even for small batch sizes due to the unbiasedness
of the gradient, while other methods are more sensitive to smaller batch sizes. . . . . . . . . . . . 149
8.1 Performance of different f-divergences as the regularizers. The experiment is on the adult dataset
with gender and race as sensitive attributes. While the offered tradeoffs are close to each other
for small demographic parity violations, KL-divergence shows an extraordinary performance for
a low-fairness high-accuracy regime. We do not display the performance for larger batch sizes
or when only one sensitive attribute is available due to the insignificant difference between the
performance of different f-divergences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
xiv
8.2 Performance of the trained fair models on Adult Dataset with gender and race as two sensitive
attributes with different Batch-sizes. The red dashed line represents the Naïve baseline where the
model outputs zero with probability p. By increasing p, the model becomes fairer at the cost of the
loss in accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.3 Performance of different state-of-the-art approaches and our two methods for handling distribution
shift. The dataset is adult, and the sensitive attribute is gender. We randomly flip the label of a
proportion of gender entries (from 0 to 20%). As we observe, our approach demonstrates more
robustness against the drop in DP violation compared to other approaches. . . . . . . . . . . . . 170
8.4 Performance of the trained fair models on new Adult Dataset. The model is trained on one state
(California or Texas) and evaluated in 50 states. The distribution of each state dataset is different
than others. Thus, the IID assumption does not hold among datasets of different states. . . . . . . 171
10.1 Performance of RIFLE, QRIFLE, MissForest, Amelia, KNN Imputer, MICE, Expectation
Maximization to the percentage of missing values on Quadratic artificial datasets with
different percentages of missing values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.2 Consistency of ADMM (Algorithm 19) and Projected Gradient Ascent on function g
(Algorithm 1) on the synthetic datasets with 40%, 60% and 80% missing values. . . . . . . 203
10.3 Convergence of ADMM algorithm to the optimal solution of Problem (10.4) for different
values of ρ. The left plot measures the objective function of Problem (10.4) per iteration
(without considering the constraints), while the right plot demonstrates the constraint
violation of the algorithm per iteration. The constraint violation can be measured by
adding all regularization terms in the augmented Lagrangian function formulated in
Problem (10.6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.4 The performance of the Nesterov acceleration method, projected gradient ascent, and
ADMM on the Super Conductivity dataset vs. the number of iterations. . . . . . . . . . . . 205
10.5 Distribution of adversarial and real data described in the example. While one linear
classifier cannot separate the adversarial (red section) and real (green section) data points,
two detection classes can detect adversarial examples. . . . . . . . . . . . . . . . . . . . . . 216
10.6 Applying K-means algorithm without fairness on the synthetic dataset. . . . . . . . . . . . 225
10.7 Applying fair K-means algorithm with different values of λ on the synthetic dataset. . . . 225
10.8 The relationship between Rényi correlation, Pearson correlation, and normalized mutual
information. Direct optimization of normalized mutual information is intractable due to
its non-convexity. However, as we can observe on the right-hand-side, by minimizing
Rényi correlation to 0, the normalized mutual information is converging to 0 accordingly. . 228
10.9 [223] fails to converge to a stationary point whereas our stochastic algorithm easily converges. . . 262
xv
10.10 Comparing FERMI with other methods in the presence of outliers (random 10% of sensitive
attributes in training forced to 0. FERMI still achieves a better trade-off compared to all other
baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.11 Tradeoff of fairness violation vs test error for FERMI algorithm on COMPAS and Adult datasets.
The solid and dashed curves correspond to FERMI algorithm under the demographic parity and
equality of opportunity notions accordingly. The left axis demonstrates the effect of changing λ
on the test error (red curves), while the right axis shows how the fairness of the model (measured
by equality of opportunity or demographic parity violations) depends on changing λ. . . . . . . . 263
10.12 Tradeoff of demographic parity violation vs test error for FERMI algorithm on COMPAS, German,
and Adult datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.13 Tradeoff of equalized odds violation vs test error for FERMI algorithm on COMPAS, German, and
Adult datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.14 Tradeoff of equality of opportunity violation vs test error for FERMI algorithm on COMPAS,
German, and Adult datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
10.15 Performance of the trained fair models on COMPAS and German Credit Datasets. . . . . . . . . . 298
xvi
Abstract
The advent of artificial intelligence (AI) has revolutionized a diverse set of complex decision-making tasks
in society and industry. Empirical Risk Minimization (ERM), as the dominant paradigm for training and
deploying large-scale AI models, has showcased remarkable efficacy across diverse applications such as
object detection, language model learning, and speech recognition. Despite its promising results, ERM
grapples with notable limitations that curtail its suitability for critical decision-making tasks. One significant drawback lies in the potential for ERM-based training to yield unfair outcomes, particularly with
respect to protected groups such as racial minorities and women. Additionally, the robustness of these
models hinges heavily on the assumption that the training data adheres to the assumption of being independent and identically distributed (IID) samples of the eventual test (production) data. However, this
assumption may be compromised if there are disparities between the distribution of the target domain and
the source domain, stemming from factors such as changes in environment, geographical location, or date.
Furthermore, challenges such as batch effects, missing values, and the influence of adversarial agents serve
to disrupt the IID assumption, rendering conventional ERM training unreliable and leading to significant
performance degradation in such scenarios.
In this thesis, we study optimization methods and algorithms for robust and fair training of machinelearning models. We study the robustness of machine learning models in different scenarios, including:
1. Robust Imputation and Inference in the Presence of Missing Values (Chapter 2)
2. Learning Robust Models in the Presence of Batch Effects (Chapter 3)
xvii
3. Robustness Against Adversarial Attacks (Chapter 4)
4. Robustness Against Distributional Changes (Chapter 7 and Chapter 8) for Fair Empirical Risk
Minimization
In Section Two (Chapter 5 and Chapter 6), our focus shifts to developing scalable algorithms to enhance
fairness criteria within machine-learning models. Our approach involves introducing innovative measures and methodologies designed to integrate fairness constraints into the empirical risk minimization
framework. In the study of fair empirical risk minimization, our primary considerations center around
two pivotal aspects: evaluating the model’s performance through the accuracy-fairness tradeoff and the
amenability to scalable stochastic optimization algorithms. We put forth a diverse set of measures for
quantifying fairness violations in empirical risk minimization tasks, delving into comprehensive analyses
that showcase the convergence and performance guarantees of the optimization algorithms developed for
these measures.
While extensive research has delved into fortifying vanilla empirical risk minimization against distribution shifts, there is a noticeable scarcity of studies and methodologies addressing the robustness of
fair empirical risk minimization models. Therefore, it is crucial to design large-scale fair algorithms that
demonstrate resilience against alterations (shifts) in the data distribution. Unlike the conventional fair
empirical risk minimization, we dispense with the implicit assumption that the training data constitutes
an independently and identically distributed (IID) sample of the test data. In Chapter 7, we focus on learning fair models in the presence of distribution shifts. We build our distributionally robust model on the
stochastic framework developed in Chapter 6. More precisely, we robustify the empirical risk minimization
objective regularized by the Exponential Rényi Mutual Information (ERMI) measure against distributional
changes. ERMI, as a measure of independence, is a special case of f-divergences.
Finally, in Chapter 8, we propose a convergent scalable (stochastic) distributionally robust fair empirical risk minimization via f-divergences. This chapter generalizes the stochastic framework in Chapter 6.
xviii
Further, it provides a novel distributionally robust optimization framework using ℓp norms as the uncertainty balls.
A shared attribute among the formulations and models presented in the thesis is their incorporation of
data-driven constraints or regularization into the vanilla Empirical Risk Minimization (ERM) framework,
resulting in min-max optimization problems. Consequently, the methods and algorithms devised have
broader applicability beyond the specific issues addressed in the thesis, extending their utility to applications such as domain generalization, contrastive learning, and meta-learning. We discuss the broader
impacts of the presented algorithms and methods in the thesis and potential future directions in the Conclusion (Chapter 9).
xix
Chapter 1
Introduction
Recent developments in the field of machine learning have demonstrated promising results in a wide range
of applications from image classification [76, 177, 1] and object detection [378, 322, 250] to recommender
systems [92, 145], Large language models [97] and detecting communities [363] in social networks. Empirical Risk Minimization (ERM) provides a general framework that is used in most aforementioned learning
problems [337, 230]. Formally, a parametric learning problem seeks to predict a response variable y from
a vector of features (predictors) x = [x1, x2, . . . , xd] ∈ R
d
. Let P
∗ be the joint distribution of (x, y) and
ℓ be a loss function measuring the error between the model prediction and the ground truth. Ideally, we
aim to find the best possible hypothesis in a hypothesis set parameterized by a vector θ such that the loss
function is minimized. Thus, the learning problem can be described as finding the best hypothesis that
minimizes the population risk:
min
θ
E(x,y)∈P∗
h
ℓ
hθ(x), y
i
(Population Risk)
Since P
∗
is rarely known in most practical scenarios, one cannot directly optimize (Population Risk). The
empirical risk minimization (ERM) framework tackles this issue by minimizing the empirical loss function on n given data points with the assumption that the samples are drawn identically and independently
(i.i.d) from the unknown distribution P
∗
:
min
θ
1
n
Xn
i=1
ℓ(hθ(xi), y). (ERM)
The most common approach for optimizing the empirical risk minimization problems is to use first-order
methods such as (stochastic) gradient descent and their variations [175, 51, 102]. When the loss function
ℓ is convex, the gradient descent method is guaranteed to converge to the global solution of (ERM). As
it is shown in [187], even for a non-convex loss function, under a set of mild assumptions, the gradient
descent method will converge to the local optima of (ERM) almost surely. Moreover, first-order methods
are scalable in the sense that the convergence rate of the problem does not scale with the dimension of
θ [238]. Moreover, it can be used when n the number of samples is large (and even when the entire data
cannot fit in the memory). For these reasons and others such as desirable generalized properties, gradient
descent and its stochastic variations such as Adam [175] and RMSProp [146] are widely used optimization
methods for solving ERM problems in various applications involving large volumes of data.
1.1 Empirical Risk Minimization Limitations
Despite its wide range of applications, the ERM framework suffers a set of challenges arise in modern
applications: Vulnerability to adversarial and out-of-distribution samples [144, 323, 119], changes in the
distribution of the data from training to test domains, high bias against protected groups such as women
2
and racial minorities, and poor performance in the presence of non-ignorable missing values [201], outliers [193, 52] and batch effect [314] are among the most critical issues that the general ERM framework
faces.
1.1.1 Vulnerability to Missing Data
The success of the empirical risk minimization mainly relies on the availability of large-scale, high-quality
datasets, which may be scarce in many practical problems, especially in medical and health applications
[253, 319, 33]. Moreover, many experiments and datasets suffer from small sample sizes in such applications. At the same time, an increasingly large number of datasets are publicly available. Information
across various datasets must be combined reliably to fully and effectively utilize information on related
research questions from diverse datasets. After appending these datasets together, the obtained dataset
can contain large blocks of missing values as they may not share precisely the same features (Figure 1.1).
Such structured missing values significantly challenge statistical inference tasks formulated as empirical
risk minimization problems. In Chapter 2, we propose a distributionally robust approach for formulating
regression and classification problems in the presence of missing values.
1.1.2 Poor Performance in the Presence of Batch Effects
The rising popularity of public-use datasets is driven by the growing emphasis on funding mandates,
a commitment to data transparency, and the prevalent open-source culture encouraging researchers to
share their collected data. The aggregation of interconnected datasets offers enhanced statistical power
during analysis, a particularly valuable advantage in the realms of biological and genomics data. This is
especially relevant as individual experiments in these domains often yield only a limited number of samples due to the substantial costs associated with experiments. The integration of datasets from diverse
environments poses challenges, primarily due to potential biases shared among datasets originating from
3
Figure 1.1: Consider the problem of predicting the trait y from feature vector (x1, . . . , x100). Suppose we
have access to three data sets: The first dataset includes the measurements of (x1, x2, . . . , x40, y) for n1
individuals. The second dataset collects data from another n2 individuals by measuring (x30, . . . , x80)
with no measurements of the target variable y in it; and the third dataset contains the measurements
from the variables (x70, . . . , x100, y) for n3 number of individuals; How one should learn the predictor
yˆ = h(x1, . . . , x100) from these three datasets?
the same environment. For instance, variations in genome sequencing machines [252], hybridization protocols [358], and transformation methods [283, 279] can introduce batch effects, which refer to systematic
non-biological differences between batches of samples in gene expression data. These batch effects have
the potential to adversely impact the performance of statistical inference algorithms, particularly in tasks
such as feature selection for identifying valuable biomarkers. They can introduce bias into predictions,
elevate false discovery rates, and diminish prediction accuracy [307]. Consequently, the identification and
elimination of batch effects emerge as critical pre-processing stages in the statistical analysis of various
bioinformatics tasks, including (single-cell) RNA sequencing [68], metabolomics analysis [206], and cancer
classification [11, 189].
The goal of batch effect removal is to learn the underlying transformations on the data batches such
that the ground-truth datasets (the data not affected by the aforementioned factors known as batch effects)
are recovered up to bijective mappings (see Figure 1.2). In other words, the optimal transformations on
4
Figure 1.2: The batch effect can be conceptualized as transformations on each of the ground-truth datasets, which
changes the distribution of data in each in a potentially different way.
the datasets make the distributions of them as close as possible to each other. Thus, one can quantify
the quality of batch effect removal based on how close the resulting distributions of datasets are after the
transformations. We study the statistical inference in the presence of batch effects in Chapter 3.
1.1.3 Vulnerability to Adversarial Attacks
Despite the superior performance of modern inference tools such as deep neural networks on visions and
language processing tasks, these models are extremely vulnerable to adversarial attacks [322, 129]. In
particular, by adding a small perturbation to a given natural image, a trained deep neural network based
on the ERM framework fails to classify the image correctly (Figure 1.3 [129]).
More recently, [227] shows that such deep networks are not robust against out-of-distribution samples.
Such adversarial attacks drastically reduce the reliability of image detection systems, especially in selfdriving cars [110]. To address the vulnerability of neural networks to adversarial attacks, researchers in
the community have designed defense mechanisms that are robust against adversarial attacks [251, 156,
127, 219, 152]. These early papers and their follow-up works provide robustness against certain types of
attacks, such as the Fast Gradient Sign Method (FGSM) and its first-order variations [323, 129]. However,
the overwhelming majority of these defense mechanisms are highly ineffective against more complex
5
Figure 1.3: The deep neural network correctly classifies the left image as a Panda. However, by adding
a crafted noise that is invisible to the human eye, the same neural network incorrectly classifies it as a
Gibbon [129].
attacks such as adaptive and brute-force methods [329, 63]. This ineffectiveness necessitates 1) the design
of rigorous verification approaches that can measure the robustness of a given network; 2) the development
of defense mechanisms that are verifiably robust against any attack strategy within the class of permissible
attack strategies.
In Chapter 4, we analyze the robust training and verification of neural networks against adversarial
samples. The underlying optimization problems for these tasks have min-max optimization structures. We
propose scalable algorithms with convergence guarantees and demonstrate the superiority of our approach
compared to the state-of-the-art robust training and verification methods.
1.1.4 Biased and Unfair Outcomes
With the widespread adoption of machine learning algorithms on automated decision making, we have
witnessed increased reports of instances in which the ERM-based model results in discrimination against
certain groups of individuals [91, 321, 49, 15]. In this context, discrimination is defined as the unwanted
distinction against individuals based on their membership to a specific group. For instance, [15] presents
an example of a computer-based risk assessment model for recidivism, which is biased against certain
6
Figure 1.4: Different trained neural networks on large-scale datasets have a huge bias (accuracy gap)
against women and people with darker skins [54].
ethnicities. [54] show that the neural networks trained on large datasets for face detection are highly
biased against women and darker-skinned people. In particular, these networks have much higher errors
in detecting darker females than lighter males (see Figure 1.4).
Imposing fairness to a given empirical risk minimization can be achieved with various methods, such as
adding a constraint between the sensual attributes (gender, ethnicity, etc.) and the model’s predictions. We
study several measures of independence and the resulting constrained optimization problems obtained by
adding the independence constraints in Chapters 5 to 8. As we show, fair empirical risk minimization can
be modeled by a min-max optimization problem that can be solved efficiently using first-order stochastic
models. While the robustification of ERM-based models is extensively studied in the literature, the methods
and algorithms for preserving the fairness of the trained learning models are mostly overlooked. Changes
in the distribution of data can be due to changes in the proportion of protected groups (demographic shift),
changes in the target variable (label shift), input data (covariate/concept) shift, or a combination of these
shifts. In Chapter 7, we formalize a distributionally robust optimization framework for fair empirical risk
minimization in the presence of distribution shifts. More precisely, we robustify the proposed framework
7
in Chapter 6 by introducing uncertainty sets on the joint distribution of predictions and sensitive attributes.
The advantage of this approach is that it does not rely on access to the causal graph or explicit assumptions
on the aforementioned types of shifts. The primary challenge of the distributionally robust formulation in
Chapter 7 is that the underlying min-max optimization is non-convex-non-concave. As a result, we focus
on several special cases where the ℓp norm balls are utilized as the measure of distributional uncertainties.
We study the resulting optimization problems for these cases and offer convergent stochastic algorithms
with high performance in terms of fairness-accuracy preservation.
The fair empirical risk minimization framework presented in Chapter 6 and Chapter 7 is limited to the
Exponential Rényi Mutual Information (ERMI) as the measure of fairness violation. ERMI is equivalent to
χ
2
-divergence, and hence, is a special case of f-divergences in general. In Chapter 8, we propose a unified stochastic optimization framework for robust and fair empirical risk minimization via f-divergences.
The methodology is based on using Legendre-Fenchel duality to reformulate the problem as a min-max
optimization. Further, several algorithms and techniques are adopted to solve the resulting optimization
problems efficiently and with convergence guarantees. The proposed methodologies and techniques can
be adopted beyond robust and fair machine learning to solve a variety of learning tasks such as metalearning, contrastive learning, federated learning, domain generalization, and domain adaptation. We further discuss the broader impact of the methods developed throughout the thesis in the Conclusion chapter
(Chapter 9).
8
Chapter 2
Robust Imputation and Inference from Low Order Marginals
This chapter is based on the published paper in TMLR 2023 [24] with the same title.
The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can
prevent similar datasets from being analyzed in the same study, precluding many existing datasets from
being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and
low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy
estimations adversely affect the performance of downstream statistical models. We develop a statistical
inference framework for regression and classification in the presence of missing data without imputation. Our
framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments
of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we
provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE to several state-of-the-art approaches (including
MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for imputation and inference in the
9
presence of missing values. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively
small. RIFLE is publicly available at https://github.com/optimization-for-data-driven-science/RIFLE.
2.1 Introduction
Machine learning algorithms have shown promise when applied to various problems, including healthcare,
finance, social data analysis, image processing, and speech recognition. However, this success mainly relied
on the availability of large-scale, high-quality datasets, which may be scarce in many practical problems,
especially in medical and health applications [253, 319, 33]. Moreover, many experiments and datasets
suffer from the small sample size in such applications. Despite the availability of a small number of data
points in each study, an increasingly large number of datasets are publicly available. To fully and effectively
utilize information on related research questions from diverse datasets, information across various datasets
(e.g., different questionnaires from multiple hospitals with overlapping questions) must be combined in a
reliable fashion.
Figure 2.1: Consider the problem of predicting the trait y from feature vector (x1, . . . , x100). Suppose that
we have access to three data sets: The first dataset includes the measurements of (x1, x2, . . . , x40, y) for
n1 individuals. The second dataset collects data from another n2 individuals by measuring (x30, . . . , x80)
with no measurements of the target variable y in it; and the third dataset contains the measurements
from the variables (x70, . . . , x100, y) for n3 number of individuals. How one should learn the predictor
yˆ = h(x1, . . . , x100) from these three datasets?
10
After integrating data from different studies, the obtained dataset can contain large blocks of missing
values, as they may not share the same features (Figure 2.1).
There are three general approaches for handling missing values in statistical inference (classification
and regression) tasks. A Naïve method is to remove the rows containing missing entries. However, such
an approach is not an option when the percentage of missingness in a dataset is high. For instance, as
demonstrated in Figure 2.1, the entire dataset will be discarded if we eliminate the rows with at least one
missing entry.
The most common methodology for handling missing values in a learning task is to impute them in a
pre-processing stage. The general idea behind data imputation is that the missing values can be predicted
using the available data entries and correlated features. Imputation algorithms cover a wide range of
methods, including imputing missing entries with the columns means [205, Chapter 3] (or median), leastsquare and linear regression-based methods [267, 173, 371, 57, 56], matrix completion and expectation
maximization approaches [95, 125, 148], KNN based [333], Tree based methods [318, 350], and methods
using different neural network structures. Appendix 10.1.1 presents a comprehensive review of these
methods.
The imputation of data allows practitioners to run standard statistical algorithms requiring complete
data. However, the prediction model’s performance can be highly reliant on the accuracy of the imputer.
High error rates in the prediction of missing values by the imputer can lead to the catastrophic performance
of the downstream statistical methods executed on the imputed data.
Another class of methods for inference in the presence of missing values relies on robust optimization
over the uncertainty sets on missing entries. [304] and [355] adopt robust optimization to learn the parameters of a support vector machine model. They consider uncertainty sets for the missing entries in the
dataset and solve a min-max problem over those sets. The obtained classifiers are robust to the uncertainty
11
Figure 2.2: Prediction of the target variable without imputation. RIFLE estimates confidence intervals for
low-order (first and second-order) marginals from the input data containing missing values. Then, it solves
a distributionally robust problem over the set of all distributions whose low-order marginals are within
the estimated confidence intervals.
of missing entries within the uncertainty regions. In contrast to the imputation-based approaches, the robust classification formulation does not carry the imputation error to the classification phase. However,
finding appropriate intervals for each missing entry is challenging, and it is unclear how to determine
the uncertainty range in many real datasets. Moreover, their proposed algorithms are limited to the SVM
classifier.
In this chapter, we propose RIFLE (Robust InFerence via Low-order moment Estimations) for the direct
inference of a target variable based on a set of features containing missing values. The proposed framework
does not require the data to be imputed in a pre-processing stage. However, it can also be used as a
pre-processing tool for imputing data. The main idea of the proposed framework is to estimate the first
and second-order moments of the data and their confidence intervals by bootstrapping on the available
data matrix entries. Then, RIFLE finds the optimal parameters of the statistical model for the worst-case
distribution with the low-order moments (mean and variance) within the estimated confidence intervals
(See Figure 2.2). Compared to [304, 355], we estimate uncertainty regions for the low-order marginals
using the Bootstrap technique. Furthermore, our framework is not restricted to any particular machine
learning model, such as support vector machines [355].
Contributions: Our main contributions are as follows:
1. We present a distributionally robust optimization framework over the low-order marginals of the
training data distribution for inference in the presence of missing values. The proposed framework
12
does not require data imputation as a pre-processing stage. In Section 2.3 and Section 2.4, we specialize the framework to ridge regression and classification models as two case studies respectively.
The proposed framework provides a novel strategy for inference in the presence of missing data,
especially for datasets with large proportions of missing values.
2. We provide theoretical convergence guarantees and the iteration complexity analysis of the presented algorithms for robust formulations of ridge linear regression and normal discriminant analysis. Moreover, we show the consistency of the prediction under mild assumptions and analyze the
asymptotic statistical properties of the solutions found by the algorithms.
3. While the robust inference framework is primarily designed for direct statistical inference in the
presence of missing values without performing data imputation, it can also be adopted as an imputation tool. To demonstrate the quality of the proposed imputer, we compare its performance
with several widely-used imputation packages such as MICE [56], Amelia [148], MissForest [318],
KNN-Imputer [333], MIDA [128], GAIN [357] on real and synthetic datasets. Generally speaking,
our method outperforms all of the mentioned packages when the number of missing entries is large.
2.2 Robust Inference via Estimating Low-order Moments
RIFLE is based on a distributionally robust optimization (DRO) framework over low-order marginals. Assume that (x, y) ∈ R
d × R follows a joint probability distribution P
∗
. A standard approach for predicting
the target variable y given the input vector x is to find the parameter θ that minimizes the population risk
with respect to a given loss function ℓ:
min
θ
E(x,y)∼P∗
h
ℓ
x, y; θ
i. (2.1)
13
Since the underlying distribution of data is rarely available in practice, the above problem cannot be directly
solved. The most common approach for approximating (2.1) is to minimize the empirical risk with respect
to n given i.i.d samples (x1, y1), . . . ,(xn, yn) drawn from the joint distribution P
∗
:
min
θ
1
n
Xn
i=1
ℓ(xi
, yi
; θ).
The above empirical risk formulation assumes that all entries of xi and yi are available. Thus, to utilize the
empirical risk minimization (ERM) framework in the presence of missing values, one can either remove
or impute the missing data points in a pre-processing stage. Training via robust optimization is a natural
alternative in the presence of missing data. [304, 355] suggest the following optimization problem that
minimizes the loss function for the worst-case scenario over the defined uncertainty sets per data points:
min
θ
max
{δi∈Ni}
n
i=1
1
n
Xn
i=1
ℓ(xi − δi
, yi
; θ), (2.2)
where Ni represents the uncertainty region of data point i. [304] obtains the uncertainty sets by assuming
a known distribution on the missing entries of datasets. The main issue in their approach is that the
constraints defined on data points are totally uncorrelated. [355] on the other hand defines Ni as a “box”
constraint around the data point i such that they can be linearly correlated. For this specific case, they
show that solving the corresponding robust optimization problem is equivalent to minimizing a regularized
reformulation of the original loss function. Such an approach has several limitations: First, it can only
handle a few special cases (SVM loss with linearly correlated perturbations on data points). Furthermore,
[355] is primarily designed for handling outliers and contaminated data. Thus, they do not offer any
mechanism for the initial estimation of xi when several vector entries are missing. In this work, we
instead take a distributionally robust approach by considering uncertainty on the data distribution instead
14
of defining an uncertainty set for each data point. In particular, we aim to fit the best parameters of a
statistical learning model for the worst distribution in a given uncertainty set by solving the following:
min
θ
max
P ∈P
E(x,y)∼P [ℓ(x, y; θ)], (2.3)
where P is an uncertainty set over the underlying distribution of data. A key observation is that defining
the uncertainty set P in (2.3) is easier and computationally more efficient than defining the uncertainty
sets {Ni}
n
i=1 in (2.2). In particular, the uncertainty set P can be obtained naturally by estimating loworder moments of data distribution using only available entries. To explain this idea and to simplify the
notations, let z = (x, y), µ¯
z ≜ E[z], and C¯ z ≜ E[zzT
]. While µ¯
z
and C¯ z
are typically not known exactly,
one can estimate them (within certain confidence intervals) from the available data by simply ignoring
missing entries (assuming the missing value pattern is completely at random, e.g., MCAR). Moreover, we
can estimate the confidence intervals via bootstrapping. Particularly, we can estimate µ
z
min, µ
z
max, Cz
min,
and Cz
max from data such that µ
z
min ≤ µ¯
z ≤ µ
z
max and Cz
min ≤ C¯ z ≤ Cz
max with high probability (where
the inequalities for matrices and vectors denote component-wise relations). In Appendix 10.1.2, we show
how a bootstrapping strategy can be used to obtain the confidence intervals described above. Given these
estimated confidence intervals from data, (2.3) can be reformulated as
min
θ
max
P
EP [ℓ(z; θ)]
s.t. µ
z
min ≤ EP [z] ≤ µ
z
max,
Cz
min ≤ EP [zzT
] ≤ Cz
max.
(2.4)
[123] utilize the distributionally robust optimization as (2.3) over the set of positive semi-definite (PSD)
cones for robust inference under uncertainty. While their formulation considers ℓ2 balls for the constraints
on low order moments of the data, we use ℓ∞ constraints that are computationally more natural in the
15
presence of missing entries when combined with bootstrapping. Furthermore, while it can be applied to
general convex losses, their method relies on the ellipsoid and the existence of oracles for performing the
steps of the ellipsoid method, which is not applicable in modern high-dimensional problems. Moreover,
they assume concavity in data (the existence of some oracle to return the worst-case data points) that is
practically unavailable even in convex loss functions (including linear regression and normal discriminant
analysis studied in our work).
In Section 2.3, we study the proposed distributionally robust framework described in (2.4) for the ridge
linear regression. We design efficient first-order convergent algorithms to solve the problem and show how
we can use the algorithms for both inference and imputation in the presence of missing values. Further, in
Appendix 10.1.7, we study the proposed distributionally robust framework for the classification problems
under the normality assumption of features. In particular, we show how Framework (2.4) can be specialized
to the robust normal discriminant analysis in the presence of missing values.
2.3 Robust Linear Regression in the Presence of Missing Values
Let us specialize our framework to the ridge linear regression model. In the absence of missing data, ridge
regression finds optimal regressor parameter θ by solving
min
θ
∥Xθ − y∥
2
2 + λ∥θ∥
2
2
,
or equivalently by solving:
min
θ
θ
T XT Xθ − 2θ
T XT y + λ∥θ∥
2
2
. (2.5)
Thus, having the second-order moments of the data C = XT X and b = XT y is sufficient for finding
the optimal solution. In other words, it suffices to compute the inner product of any two column vectors
ai
, aj of X, and the inner product of any column ai of X with vector y. Since the matrix X and vector
16
y are not fully observed due to the existence of missing values, one can use the available data (see (10.1)
for details) to compute the point estimators C0 and b0. These point estimators can be highly inaccurate,
especially when the number of non-missing rows for two given columns is small. In addition, if the pattern
of missing entries does not follow the MCAR assumption, the point estimators are not unbiased estimators
of C and b.
2.3.1 A Distributionally Robust Formulation of Linear Regression
As we mentioned above, to solve the linear regression problem, we only need to estimate the second-order
moments of the data (XT X and XT y). Thus, the distributionally robust formulation described in (2.4) is
equivalent to the following optimization problem for the linear regression model:
min
θ
max
C,b
θ
T Cθ − 2b
T θ + λ∥θ∥
2
2
s.t. C0 − c∆ ≤ C ≤ C0 + c∆,
b0 − cδ ≤ b ≤ b0 + cδ,
C ⪰ 0,
(2.6)
where the last constraint guarantees that the covariance matrix is positive and semi-definite. We dicuss
the procedure of estimating the confidence intervals (b0, C0, δ, and ∆) in Appendix 10.1.2.
17
2.3.2 RIFLE for Ridge Linear Regression
Since the objective function in (2.6) is convex in θ (ridge regression) and concave in b and C (linear),
the minimization and maximization sub-problems are interchangeable [312]. Thus, we can equivalently
rewrite Problem (2.6) as:
max
C,b
g(C, b)
s.t. C0 − c∆ ≤ C ≤ C0 + c∆,
b0 − cδ ≤ b ≤ b0 + cδ,
C ⪰ 0,
(2.7)
where g(b, C) = minθ θ
T Cθ − 2b
T θ + λ∥θ∥
2
. Function g can be computed in closed-form given any
pair of (C, b) by setting θ = (C + λI)
−1b. Thus, using Danskin’s Theorem [87], we can apply projected
gradient ascent to function g to find an optimal solution of (2.7) as described in Algorithm 1. At each
iteration of the algorithm, we first perform one step of projected gradient ascent on matrix C and vector
b; then we update θ in closed-form for the obtained C and b. We initialize C and b using entriwise point
estimation on the available rows (see Equation (10.1) in Appendix 10.1.2). The projection of b to the box
Algorithm 1 RIFLE for Ridge Linear Regression in the Presence of Missing Values
1: Input: C0, b0, ∆, δ, T
2: Initialize: C = C0, b = b0.
3: for i = 1, . . . , T do
4: Update C = Π∆+
C + αθθT
5: Update b = Πδ(b − 2αθ)
6: Set θ = (C + λI)
−1b
18
constraint b0 − cδ ≤ b ≤ b0 + cδ can be done entriwise and has the following closed-form
Πδ(bi) =
bi
if b0i − cδi ≤ bi ≤ b0i + cδi
,
b0i − cδi
if bi < b0i − cδi
,
b0i + cδi
if b0i + cδi < bi
.
Theorem 1 Let (θ˜, C˜ , b˜) be the optimal solution of (2.6), θ
∗
(b, C) = argminθ θ
T Cθ − 2b
T θ + λ∥θ∥
2
,
and D = ∥C0 − C˜ ∥
2
F + ∥b0 − b˜∥
2
2
. Assume that for any given b and C, within the uncertainty (constraint)
sets described in (2.6), ∥θ
∗
(b, C)∥ ≤ τ . Then Algorithm 1 computes an ϵ-optimal solution of the objective
function in (2.7) in O
D(τ+1)2
λϵ
iterations.
Proof: The proof is relegated to Appendix 10.1.9. □
In Appendix 10.1.3, we show how using the acceleration method of Nesterov can improve the convergence rate of Algorithm 1 to O
q
D(τ+1)2
ϵλ
. A technical issue of Algorithm 1 and its accelerated version
presented in Appendix 10.1.3 is that projection of C to the intersection of box constraints and the set of
positive semidefinite matrices (Π∆+ [C]) is challenging and cannot be done in closed-form. In the implementation of Algorithm 1, we relax the problem by removing the PSD constraint on C to avoid this
complexity and time-consuming singular value decomposition at each iteration. This relaxation does not
drastically change the algorithm’s performance, as our experiments show in Section 2.5. A more systematic
approach is to write the dual problem of the maximization problem and handle the resulting constrained
minimization problem with the Alternating Direction Method of Multipliers (ADMM). The detailed procedure of such an approach can be found in Appendix 10.1.4. All these algorithms are provably convergent
19
to the optimal points of Problem (2.6). In addition to theoretical convergence, we have numerically evaluated the convergence of resulting algorithms in Appendix 10.1.12. Further, the proposed algorithms are
consistent, as discussed in Appendix 10.1.11.
2.3.3 Performance Guarantees for RIFLE
Thus far, we have discussed how to efficiently solve the robust linear regression problem in the presence
of missing values. A natural question in this context is the statistical performance of the obtained optimal
solution in the previous section on the unseen test data points. Theorem 2 answers this question from two
perspectives: Assuming that the missing values are distributed completely at random, our estimators are
consistent. Moreover, for the finite case, Theorem 2 part (b) states that with the proper choice of confidence
intervals, with high probability, the test loss of the obtained solution is bounded by the training loss of
the estimator. Note that the results regarding the performance of the robust estimator generally hold for
MCAR missing pattern. However, we perform several experiments on datasets with MNAR patterns to
show how RIFLE works in practice on such datasets in Section 2.5.
Theorem 2 Assume the data domain is bounded and that the missing pattern of the data follows MCAR. Let
Xn×d
, y be the training data drawn i.i.d. from the ground-truth distribution P
∗ with low-order moments C∗
and b
∗
. Further, assume that each entry of X and y is missing with probability p < 1. Let (θ˜
n, C˜
n, b˜
n) be
the solution of Problem (2.6).
(a) Consistency of the Covariance Estimator: As the number of data points goes to infinity, the estimated low-order marginals converge to the ground-truth values, almost surely. More precisely,
limn→∞
C˜
n = EP∗ [xxT
], a.s., (2.8)
limn→∞
b˜
n = EP∗ [xy], a.s. (2.9)
20
(b) Defining
Ltrain(θ˜
n) = θ˜
T
nC˜
nθ˜
n − 2b˜
nθ˜
n + λ∥θ˜
n∥
2
2
Ltest(θ˜
n) = θ˜
T
nC∗θ˜
n − 2b
∗T θ˜
n + λ∥θ˜
n∥
2
2
,
where C∗ = E(x,y)∼P∗ [xxT
] and b
∗ = E(x,y)∼P∗ [xy] are the ground-truth second-order moments. Given
V = maxi,j Var(XiXj ) (maximum variance of pairwise feature products), with the probability of at least
1 −
d
2V
2c
2∆2n(1−p)
, we have:
Ltest(θ˜) ≤ Ltrain(θ˜), (2.10)
where ∆ = min{∆ij} and c is the hyper-parameter for controlling the size of the confidence intervals as
presented in (2.6)
Proof: The proof is relegated to Appendix 10.1.9. □
2.3.4 Imputation of Missing Values and Going Beyond Linear Regression
RIFLE can be used for imputing missing data. To this end, we impute different features of a given dataset
independently. More precisely, to impute each feature containing missing values, we consider it as a
target variable y and the rest of the features as the input X in our methodology. Then, we train a model
to predict the feature y given X via Algorihm 1 (or its ADMM version, Algorithm 19, in the appendix).
Let the obtained optimal solutions be C∗
, b
∗
, and θ
∗
. For a given missing entry, we can use θ
∗
only if
all other features in the row of that missing entry are available. However, that is not usually the case in
practice, as each row can contain more than one missing entry. Therefore, one can learn a separate model
for each missing pattern in the dataset. Let us clarify this point through the example in Figure 2.1. In
21
this example, we have three different missing patterns (one missing pattern for each dataset). For missing
entries in Dataset 1, the first forty features are available. Let rj denote the vector of the first 40 features
in row j. Assume that we aim to impute entry i ∈ {41, . . . , 100} in row j where i denoted by xji. To this
end, we restrict X to the first 40 features. Moreover, we consider y = xi as the target variable. Then, we
run Algorithm 1 on X and y to obtain the optimal C∗
, b
∗
i
, and θ
∗
i
. Consequently, we impute xji as follows:
xji = r
T
j θ
∗
i
We can use the same methodology for imputing missing entries in each feature for missing patterns in
Dataset 2 and Dataset 3. While this approach is reasonable for the missing pattern observed in Figure 2.1,
in many practical problems, different rows can have distinct missing patterns. Thus, in the worst case,
Algorithm 1 must be executed once for each missing entry. Such an approach is computationally expensive
and might be infeasible in large-scale datasets containing large amounts of missing entries. Alternatively,
one can perform Algorithm 1 only once to obtain C∗
and b
∗
(considered the “worst-case/pessimistic”
estimation of the moments). Then to impute each missing entry, C∗
and b
∗
are restricted to the features
available in that missing entry’s row. Having the restricted C∗
and b
∗
, the regressor θ
∗
can be obtained in
closed-form (line 6 in Algorithm 1). In this approach, we perform algorithm 1 once and find the optimal θ
∗
for each missing entry based on the estimated C∗
and b
∗
. This approach can lead to sub-optimal solutions
compared to the former approach, but it is much faster and more scalable.
Beyond Linear Regression: While the developed methods are primarily designed for ridge linear regression, one can apply non-linear transformations (kernels) to obtain models beyond linear. In Appendix 10.1.5, we show how to extend the developed algorithms to quadratic models. The RIFLE framework
applied to the quadratically transformed data is called QRIFLE.
22
2.4 Robust Classification Framework
In this section, we study the proposed framework in (2.4) for the classification tasks in the presence of
missing values. Since the target variable y ∈ Y = {1, . . . , M} takes discrete values in classification tasks,
we consider the uncertainty sets over the data’s first- and second-order marginals given each target value
(label) separately. Therefore, the distributionally robust classification over low-order marginals can be
described as:
min
w
max
P
EP [ℓ(x, y, w)]
s.t. µmin,y ≤ EP [x|y] ≤ µmax,y ∀y ∈ Y
Σmin,y ≤ EP [xxT
|y] ≤ Σmax,y ∀y ∈ Y
(2.11)
where µmin, µmax, Σmin, and Σmax are the estimated confidence intervals for the first and second order of
the data distribution. Unlike the robust linear regression task in Section 2.3, the evaluation of the objective
function in (2.11) might depend on higher-order marginals (beyond second-order) due to the nonlinearity
of the loss function. As a result, Problem (2.11) is a non-convex non-concave intractable min-max optimization problem in general. For the sake of computational traceability, we restrict the distribution in the
inner maximization problem to the set of normal distributions. In the following section, we specialize (2.11)
to the quadratic discriminant analysis as a case study. The methodology can be extended to other popular
classification algorithms, such as support vector machines and multi-layer neural networks.
2.4.1 Robust Quadratic Discriminant Analysis
Learning a logistic regression model on datasets containing missing values has been studied extensively
in the literature [121, 2]. Besides deleting missing values and imputation-based approaches, [121] models
the logistic regression task in the presence of missing values as a linear discriminant analysis problem
where the underlying assumption is that the predictors follow normal distribution conditional on the
23
labels. Mathematically speaking, they assume that the data points assigned to a specific label follow a
Gaussian distribution, i.e., x|y = i ∼ N(µi
, Σ). They use the available data to estimate the parameters
of each Gaussian distribution. Therefore, the parameters of the logistic regression model can be assigned
based on the estimated parameters of the Gaussian distributions for different classes. Similar to the linear
regression case, the estimations of means and covariances are unbiased only when the data satisfies the
MCAR condition. Moreover, when the number of data points in the dataset is small, the variance of the
estimations can be very high. Thus, to train a logistic regression model that is robust to the percentage and
different types of missing values, we specialize the general robust classification framework formulated in
Equation (2.11) to the logistic regression model. Instead of considering a common covariance matrix for
the conditional distributions of x given labels y (linear discriminant analysis), we assume a more general
case where each conditional distribution has its own covariance matrix (quadratic discriminant analysis).
Assume that x|y ∼ N(µy
, Σy) for y = 0, 1. We aim to find the optimal solution to the following problem:
min
w
max
µ0
,µ1
,Σ0,Σ1
Ex|y=1∼N(µ1
,Σ1)
h
− log
σ(wT x)
iP(y = 1) +
Ex|y=0∼N(µ0
,Σ0)
h
− log
1 − σ(wT x)
iP(y = 0)
s.t. µmin0 ≤ µ0 ≤ µmax0
µmin1 ≤ µ1 ≤ µmax1
Σmin0 ≤ Σ0 ≤ Σmax0
Σmin1 ≤ Σ1 ≤ Σmax1
(2.12)
Where σ(x) = 1/
1 + exp(−x)
is the sigmoid function.
To solve Problem (2.12), first, we focus on the scenario when the target variable has no missing values.
In this case, each data point contributes to the estimation of either (µ1
, Σ1) or (µ0
, Σ0), depending on its
24
label. Similar to the robust linear regression case, we can apply Algorithm 16 to estimate the confidence
intervals for µi
, Σi using data points whose target variable equals i (y = i).
Obviously, the objective function is convex in w since the logistic regression loss is convex, and the
expectation of loss can be seen as a weighted summation, which is convex. Thus, fixing µ, Σ the outer
minimization problem can be solved with respect to w using standard first-order methods such as gradient
descent.
Although the robust reformulation of logistic regression stated in (2.12) is convex in w and concave
in µ0
and µ1
, the inner maximization problem is intractable with respect to Σ0 and Σ1. We approximate
Problem (2.12) in the following manner:
min
w
max
µ0
,Σ0,µ1
,Σ1
π1Ex|y=1∼N(µ1
,Σ1)
h
− log
σ(wT x)
i + π0Ex|y=0∼N(µ0
,Σ0)
h
− log
1 − σ(wT x)
i,
s.t. µmin0 ≤ µ0 ≤ µmax0
µmin1 ≤ µ1 ≤ µmax1
Σ0 ∈ {Σ01, Σ02, . . . , Σ0k}
Σ1 ∈ {Σ11, Σ12, . . . , Σ1k},
(2.13)
where π1 = P(y = 1) and π0 = P(y = 0). To compute optimal µ0
and µ1
, we have:
max
µ1
Ex∼N(µ1
,Σ1)
h
− log
σ(wT x)
i s.t. µmin ≤ µ1 ≤ µmax (2.14)
25
Theorem 3 Let a[i] be the i-th element of vector a. The optimal solution of Problem (2.14) has the following
form:
µ
∗
1
[i] =
µmax[i], if w[i] ≤ 0
µmin[i], if w[i] > 0.
(2.15)
Note that we relaxed (2.12) by taking the maximization problem over a finite set of Σ estimations. We
estimate each Σ by bootstrapping on the available data using Algorithm 16. Define fi(w) as:
fi(w) = π1Ex∼N(µ∗
1
,Σi1)
h
− log
σ(wT x)
i (2.16)
Similarly, we can define:
gi(w) = π0Ex∼N(µ∗
0
,Σi0)
h
− log
1 − σ(wT x)
i (2.17)
Since the maximization problem is over a finite set, we can rewrite Problem (2.13) as:
min
w
max
i,j∈{1,...,k}
fi(w) + gj (w) = min
w
max
p1,...,pk,q1,...,qk
X
k
i=1
pifi(w) +X
k
j=1
pigj (w)
s.t.
Pk
i=1 pi = 1, pi ≥ 0
Pk
j=1 qj = 1, qj ≥ 0
(2.18)
26
Since the maximum of several functions is not necessarily smooth (differentiable), we add a quadratic
regularization term to the maximization problem, accelerating the convergence rate [244] as follows:
min
w
max
p1,...,pk,q1,...,qk
X
k
i=1
pifi(w) − δ
X
k
i=1
p
2
i +
X
k
j=1
qjgj (w) − δ
X
k
j=1
q
2
j
s.t.
Pk
i=1 pi = 1, pi ≥ 0
Pk
j=1 qj = 1, qj ≥ 0
(2.19)
First, we show how to solve the inner maximization problem. Note that the pi
’s and qi
’s are independent.
We show how to find optimal pi
’s. Optimizing with respect to qi
’s is similar. Since the maximization
problem is a constrained quadratic program, we can write the Lagrangian function as follows:
max
p1,...,pk
X
k
i=1
pifi(w) − δ
X
k
i=1
p
2
i − λ(
X
k
i=1
pi − 1)
s.t. pi ≥ 0
(2.20)
Having the optimal λ, the above problem has a closed-form solution with respect to each pi
, which can be
written as:
p
∗
i =
−λ + fi
2δ
+
Since p
∗
i
is a non-increasing function with respect to λ, we can find the optimal value of λ using the
following bisection algorithm. Algorithm 2 demonstrates how to find an ϵ-optimal λ and p
∗
i
’s efficiently
using the bisection idea.
Remark 4 An alternative method for finding optimal λ, and pi
’s is to sort fi values in O(k log k) first, and
then finding the smallest fi such that if we set λ = fi
, the sum of pi
’s is bigger than 1 (let j be the index of
27
Algorithm 2 Finding the optimal λ and pi
’s using the bisection idea
1: Initialize: λlow = 0, λhigh = maxi fi
, pi = 0 ∀i ∈ {1, 2, . . . , k}.
2: while |
Pn
i=1 pk − 1| > ϵ do
3: λ =
λlow+λhigh
2
4: Set pi = [ −λ+fi
2δ
]+ ∀i ∈ {1, 2, . . . , k}
5: if Pk
i=1 pi < 1 then
6: λhigh = λ
7: else
8: λlow = λ
9: return λ, p1, p2, . . . , pk.
that value). Without loss of generality, assume that f1 ≤ · · · ≤ fk. Then, Pk
i=j
−λ+fi
2δ = 1, which has a
closed-form solution with respect to λ.
To update w, we need to solve the following optimization problem:
min
w
X
k
i=1
p
∗
i
fi(w) +X
k
j=1
q
∗
j
gi(w),
(2.21)
Similar to the standard statistical learning framework, we solve the following empirical risk minimization
problem by applying the gradient descent to w on a finite data sample. Define ˆfi as follows:
ˆfi(w) = π1
Xn
t=1
h
− log
σ(wT xt)
i, (2.22)
where x1, . . . , xn are generated from the distribution N (µ
∗
1
, Σ1i). The empirical risk minimization problem can be written as follows:
min
w
X
k
i=1
p
∗
i
ˆfi(w) +X
k
j=1
q
∗
j
gˆi(w),
(2.23)
Algorithm 3 summarizes the robust linear discriminant analysis method for the case where the label of all
data points is available. Theorem 5 demonstrates the convergence of gradient descent algorithm applied
to (2.23) in O
k
ϵ
log(M
ϵ
)
iterations to an ϵ-optimal solution.
28
Algorithm 3 Robust Quadratic Discriminant Analysis in the Presence of Missing Values
1: Input: X0, X1: matrix of data points with labels 0 and 1 respectively, T : Number of iterations, α :
Step-size.
2: Estimate µmin0
and µmax0
using the available entries of X0.
3: Estimate µmin1
and µmax1
using the available entries of X1.
4: Estimate Σ01, . . . , Σ0k using bootstrap estimator on the available data of X0.
5: Estimate Σ11, . . . , Σ1k using bootstrap estimator on the available data of X1.
6: for i = 1, . . . , T do
7: Compute µ
∗
1
and µ
∗
0
by Equation (2.15).
8: Find optimal p1, . . . , pk, and q1, . . . , qk using Algorithm 2.
9: w = w − α
Pk
i=1 p
∗
i ∇ ˆfi(w) + Pk
j=1 q
∗
j ∇gˆi(w)
!
Theorem 5 Assume that M = maxi fi
. Gradient descent algorithm requires O
k
ϵ
log(M
ϵ
)
gradient evaluations for converging to an ϵ-optimal saddle point of the optimization problem (2.23).
In Appendix 10.1.7, we extend the methodology to the case where y contains missing entries.
2.5 Experiments
In this section, we evaluate RIFLE’s performance on a diverse set of inference tasks in the presence of missing values. We compare RIFLE’s performance to several state-of-the-art approaches for data imputation
on synthetic and real-world datasets. The experiments are designed in a manner that the sensitivity of the
model to factors such as the number of samples, data dimension, types, and proportion of missing values
can be evaluated. The description of all datasets used in the experiments can be found in Appendix 10.1.10.
2.5.1 Evaluation Metrics
We need access to the ground-truth values of the missing entries to evaluate RIFLE and other state-of-theart imputation approaches. Hence, we artificially mask a proportion of available data entries and predict
them with different imputation methods. A method performs better than others if the predicted missing
entries are closer to the ground-truth values. To measure the performance of RIFLE and the existing
29
approaches on a regression task for a given test dataset consisting of N data points, we use normalized
root mean squared error (NRMSE), defined as:
NRMSE =
q
1
N
PN
i=1(yi − yˆi)
2
q
1
N
PN
i=1(yi − y¯)
2
where yi
, yˆi
, and y¯ represent the true value of the i-th data point, the predicted value of the i-th data point,
and the average of true values of data points, respectively. In all experiments, generated missing entries
follow either a missing completely at random (MCAR) or a missing not at random (MNAR) pattern. A
discussion on the procedure of generating these patterns can be found in Appendix 10.1.8.
2.5.2 Tuning Hyper-parameters of RIFLE
The hyper-parameter c in (2.7) controls the robustness of the model by adjusting the size of confidence intervals. This parameter is tuned by performing a cross-validation procedure over the set {0.1, 0.25, 0.5, 1,
2, 5, 10, 20, 50, 100}, and the one with the lowest NMRSE is chosen. The default value in the implementation is c = 1 since it consistently performs well over different experiments. Furthermore, λ, the hyperparameter for the ridge regression regularizer, is tuned by choosing 20% of the data as the validation set
from the set {0.01, 0.1, 0.5, 1, 2, 5, 10, 20, 50}. To tune K, the number of bootstrap samples for estimating
the confidence intervals, we tried 10, 20, 50, and 100. No significant difference is observed in terms of the
test performance for the above values.
We tune the hyper-parameters of the competing packages as follows. For KNN-Imputer [333], we
try {2, 10, 20, 50} for the number of neighbors (K) and pick the one with the highest performance. For
MICE [56] and Amelia [148], we generate 5 different imputed data and pick the one with the highest performance on the test data. MissForest has multiple hyper-parameters. We keep the criterion as “MSE”
since our performance evaluation measure is NRMSE. Moreover, we tune the number of iterations and
30
number of estimations (number of trees) by checking values from {5, 10, 20} and {50, 100, 200}, respectively. We do not change the structure of the neural networks for MIDA [128] and GAIN [357], and the
default versions are performed for imputing datasets.
2.5.3 RIFLE Consistency
In Theroem 2 Part (a), we demonstrated that RIFLE is consistent. In Figure 2.3, we investigate the consisFigure 2.3: Comparing the consistency of RIFLE, MissForest, KNN Imputer, MICE, Amelia, and Expectation
Maximization methods on a synthetic dataset containing 40% of missing values.
tency of RIFLE on synthetic datasets with different proportions of missing values. The synthetic data has
50 input features following a jointly normal distribution with the mean whose entries are randomly chosen from the interval (−100, 100). Moreover, the covariance matrix equals Σ = SST where S elements
are randomly picked from (−1, 1). The dimension of S is 50 × 20. The target variable is a linear function
of input features added to a mean zero normal noise with a standard deviation of 0.01. As depicted in
Figure 2.3, RIFLE requires fewer samples to recover the ground-truth parameters of the model compared
to MissForest, KNN Imputer, Expectation Maximization [95], and MICE. Amelia’s performance is significantly good since the predictors have a joint normal distribution and the linear underlying model. Note
31
Dataset Name RIFLE QRIFLE MICE Amelia GAIN MissForest MIDA EM
Spam (30%) 0.87 ±0.009 0.82 ±0.009 1.23 ±0.012 1.26 ±0.007 0.91 ±0.005 0.90 ±0.013 0.97 ±0.008 0.94 ± 0.004
Spam (50%) 0.90 ±0.013 0.86 ±0.014 1.29 ±0.018 1.33 ±0.024 0.93 ±0.015 0.92 ±0.011 0.99 ±0.011 0.97 ± 0.008
Spam (70%) 0.92 ±0.017 0.91 ±0.019 1.32 ±0.028 1.37 ±0.032 0.97 ±0.014 0.95 ±0.016 0.99 ±0.018 0.98 ± 0.017
Housing (30%) 0.86 ±0.015 0.89 ±0.018 1.03 ±0.024 1.02 ±0.016 0.82 ±0.015 0.84 ±0.018 0.93 ±0.025 0.95 ± 0.011
Housing (50%) 0.88 ±0.021 0.90 ±0.024 1.14 ±0.029 1.09 ±0.027 0.88 ±0.019 0.88 ±0.018 0.98 ±0.029 0.96 ± 0.016
Housing (70%) 0.92 ±0.026 0.95 ±0.028 1.22 ±0.036 1.18 ±0.038 0.95 ±0.027 0.93 ±0.024 1.02 ±0.037 0.98 ± 0.017
Clouds (30%) 0.81 ±0.018 0.79 ±0.019 0.98 ±0.024 1.04 ±0.027 0.76 ±0.021 0.71 ±0.011 0.83 ±0.022 0.86 ± 0.013
Clouds (50%) 0.84 ±0.026 0.84 ±0.028 1.10 ±0.041 1.13 ±0.046 0.82 ±0.027 0.75 ±0.023 0.88 ±0.033 0.89 ± 0.018
Clouds (70%) 0.87 ±0.029 0.90 ±0.033 1.16 ±0.044 1.19 ±0.048 0.89 ±0.035 0.81 ±0.031 0.93 ±0.044 0.92 ± 0.023
Breast Cancer (30%) 0.52 ±0.023 0.54 ±0.027 0.74 ±0.031 0.81 ±0.032 0.58 ±0.024 0.55 ±0.016 0.70 ±0.026 0.67 ± 0.014
Breast Cancer (50%) 0.56 ±0.026 0.59 ±0.027 0.79 ±0.029 0.85 ±0.033 0.64 ±0.025 0.59 ±0.022 0.76 ±0.035 0.69 ± 0.022
Breast Cancer (70%) 0.59 ±0.031 0.65 ±0.034 0.86 ±0.042 0.92 ±0.044 0.70 ±0.037 0.63 ±0.028 0.82 ±0.035 0.67 ± 0.014
Parkinson (30%) 0.57 ±0.016 0.55 ±0.016 0.71 ±0.019 0.67 ±0.021 0.53 ±0.015 0.54 ±0.010 0.62 ±0.017 0.64 ± 0.011
Parkinson (50%) 0.62 ±0.022 0.64 ±0.025 0.77 ±0.029 0.74 ±0.034 0.61 ±0.022 0.65 ±0.014 0.71 ±0.027 0.69 ± 0.022
Parkinson (70%) 0.67 ±0.027 0.74 ±0.033 0.85 ±0.038 0.82 ±0.037 0.69 ±0.031 0.73 ±0.022 0.78 ±0.038 0.75 ± 0.029
Table 2.1: Performance comparison of RIFLE, QRIFLE (Quadratic RIFLE), and state-of-the-art methods on
several UCI datasets. We applied to impute methods on three different missing-value proportions for each
dataset. The best imputer is highlighted with bold font, and the second-best imputer is underlined. Each
experiment is done 5 times, and the average and the standard deviation of performances are reported.
that by increasing the number of samples, the NRMSE of our framework converges to 0.01, which is the
standard deviation of the zero-mean Gaussian noise added to each target value (the dashed line).
2.5.4 Data Imputation via RIFLE
As explained in Section 2.3, while the primary goal of RIFLE is to learn a robust regression model in the
presence of missing values, it can also be used as an imputation tool. We run RIFLE and several state-ofthe-art approaches on five datasets from the UCI repository [104] (Spam, Housing, Clouds, Breast Cancer,
and Parkinson datasets) with different proportions of MCAR missing values (the description of the datasets
can be found in Appendix 10.1.10). Then, we compute the NMRSE of imputed entries. Table 2.1 shows the
performance of RIFLE compared to other approaches for the datasets where the proportion of missing
values are relatively high
n(1−p)
d ≈ O(1)
. RIFLE outperforms these methods in almost all cases and
performs slightly better than MissForest, which uses a highly non-linear model (random forest) to impute
missing values.
32
2.5.5 Sensitivity of RIFLE to the Number of Samples and Proportion of Missing Values
In this section, we analyze the sensitivity of RIFLE and other state-of-the-art approaches to the number
of samples and the proportion of missing values. In the experiment in Figure 2.4, we create 5 datasets
Figure 2.4: Performance Comparison of RIFLE, MICE, and MissForest on four UCI datasets: Parkinson,
Spam, Wave Energy Converter, and Breast Cancer. For each dataset, we count the number of features that
each method outperforms the others.
containing 40%, 50%, 60%, 70%, and 80% of MCAR missing values, respectively, for four real datasets
(Spam, Parkinson, Wave Energy Converter, and Breast Cancer) from UCI Repository [104] (the description
of the datasets can be found in Appendix 10.1.10). Given a feature in a dataset containing missing values,
we say an imputer wins that feature if the imputation error in terms of NRMSE for that imputer is less
than the error of the other imputers. Figure 2.4 reports the number of features won by each imputer on the
created datasets described above. As we observe, the number of wins for RIFLE increases as we increase
the proportion of missing values. This observation shows that the sensitivity of RIFLE as an imputer to
the proportion of missing values is less than MissForest and MICE in general.
33
Figure 2.5: Sensitivity of RIFLE, MissForest, Amelia, KNN Imputer, MIDA, and Mean Imputer to the percentage of missing values on the Drive dataset. Increasing the percentage of missing value entries degrades
the benchmarks’ performance compared to RIFLE. KNN-imputer implementation cannot be executed on
datasets containing 80% (or more) missing entries. Moreover, Amelia and MIDA do not converge to a
solution when the percentage of missing value entries is higher than 70%.
Figure 2.4 does not show how the increase in the proportion of missing values changes the NRMSE of
imputers. Next, we analyze the sensitivity of RIFLE and several imputers to change in missing value proportions. Fixing the proportion of missing values, we generate 10 random datasets containing missing values in random locations on the Drive dataset (the description of datasets is available in Appendix 10.1.10).
We impute the missing values for each dataset with RIFLE, MissForest, Mean Imputation, and MICE. Figure 2.5 shows the average and the standard deviation of these 4 imputers’ performances for different proportions of missing values (10% to 90%). Figure 2.5 depicts the sensitivity of MissForest and RIFLE to
the proportion of missing values in the Drive dataset. We select 400 data points for each experiment with
different proportions of missing values (from 10% to 90%) and report the average NRMSE of imputed
entries. Finally, in Figure 2.6, we have evaluated RIFLE and other methods on the BlogFeedback dataset
(see Appendix 10.1.10) containing 40% missing values. The results show that RIFLE’s performance is less
sensitive to decreasing the number of samples.
34
Figure 2.6: Sensitivity of RIFLE, MissForest, MICE, Amelia, Mean Imputer, KNN Imputer, and MIDA to the
number of samples for the imputations of Blog Feedback dataset containing 40% of MCAR missing values.
When the number of samples is limited, RIFLE outperforms other methods, and its performance is very
close to the non-linear imputer MissForest for larger samples.
2.5.6 Performance Comparison on Real Datasets
In this section, we compare the performance of RIFLE to several state-of-the-art approaches, including
MICE [56], Amelia [148], MissForest [318], KNN Imputer [267], and MIDA [128]. There are two primary
ways to do this. One method to predict a continuous target variable in a dataset with many missing
values is first to impute the missing data with a state-of-the-art package, then run a linear regression. An
alternative approach is to directly learn the target variable, as we discussed in Section 2.3.
Table 2.2 compares the performance of mean imputation, MICE, MIDA, MissForest, and KNN to that
of RIFLE on three datasets: NHANES, Blog Feedback, and superconductivity. Both Blog Feedback and
Superconductivity datasets contain 30% of MNAR missing values generated by Algorithm 21, with 10000
and 20000 training samples, respectively. The description of the NHANES data and its distribution of
missing values can be found in Appendix 10.1.10.
Efficiency of RIFLE: We perform RIFLE for 1000 iterations and the step size of 0.01 in the above experiments. At each iteration, the main operation is to find the optimal θ for any given b and C. The average
time of each method on each dataset is reported in Table 10.1 in Appendix 10.1.13. The main reason for
35
the time efficiency of RIFLE compared to MICE, MissForest, MIDA, and KNN Imputer is that it directly
predicts the target variable without imputation of all missing entries.
Methods Datasets
Super Conductivity Blog Feedback NHANES
Regression on Complete Data 0.4601 0.7432 0.6287
RIFLE 0.4873 ± 0.0036 0.8326 ± 0.0085 0.6304 ± 0.0027
Mean Imputer + Regression 0.6114 ± 0.0006 0.9235 ± 0.0003 0.6329 ± 0.0008
MICE + Regression 0.5078 ± 0.0124 0.8507 ± 0.0325 0.6612 ± 0.0282
EM + Regression 0.5172 ± 0.0162 0.8631 ± 0.0117 0.6392 ± 0.0122
MIDA Imputer + Regression 0.5213 ± 0.0274 0.8394 ± 0.0342 0.6542 ± 0.0164
MissForest 0.4925 ± 0.0073 0.8191 ± 0.0083 0.6365 ± 0.0094
KNN Imputer 0.5438 ± 0.0193 0.8828 ± 0.0124 0.6427 ± 0.0135
Table 2.2: Normalized RMSE of RIFLE and several state-of-the-art Methods on Superconductivity, blog
feedback, and NHANES datasets. The first two datasets contain 30% Missing Not At Random (MNAR)
missing values in the training phase generated by Algorithm 21. Each method applied 5 times to each
dataset, and the result is reported as the average performance ± standard deviation of experiments in
terms of NRMSE.
Since MICE and MIDA cannot predict values during the test phase without data imputation, we use
them in a pre-processing stage to impute the data. Then, we apply the linear regression to the imputed
dataset. On the other hand, RIFLE, KNN imputer, and MissForest can predict the target variable without
imputing the training dataset. Table 2.2 shows that RIFLE outperforms all other state-of-the-art approaches
executed on the three mentioned datasets. In particular, RIFLE outperforms MissForest, while the underlying model RIFLE uses is simpler (linear) compared to the nonlinear random forest model utilized by
Missforest.
2.5.6.1 Performance of RIFLE on Classification Tasks
In Section 2.4, we discussed how to specialize RIFLE to robust normal discriminant analysis in the presence
of missing values. Since the maximization problem over the second moments of the data (Σ) is intractable,
we solved the maximization problem over a set of k covariance matrices estimated by bootstrap sampling.
To investigate the effect of choosing k on the performance of the robust classifier, we train robust normal
36
Number of Training Data Points Method
LDA Robust LDA Robust QDA
50 52.38% ± 3.91% 62.14% ± 1.78% 61.36% ± 1.62%
100 61.24% ± 1.89% 68.46% ± 1.04% 70.07% ± 0.95%
200 73.49% ± 0.97% 73.35% ± 0.67% 73.51% ± 0.52%
Table 2.3: Sensitivity of Linear Discriminant Analysis, Robust LDA (Common Covariance Matrices), and
Robust QDA (Different Covariance matrices for two groups) to the number of training samples.
discriminant analysis models for different values of k on two training datasets (Avila and Magic) containing
40% MCAR missing values. The description of the datasets can be found in Appendix 10.1.10. For k = 1,
there is no maximization problem; thus, it is equivalent to the classifier proposed in [121]. As shown
in Figure 2.7, increasing the number of covariance estimations generally enhances the accuracy of the
classifier in the test phase. However, as shown in Theorem 5, the required time for completing the training
phase grows linearly regarding the number of covariance estimations.
Figure 2.7: Effect of the number of covariance estimations on the performance (left) and run time (right)
of robust LDA on Avila and Magic datasets. Increasing the number of covariance estimations (k) improves
the model’s accuracy on the test data. However, it takes longer training time.
37
Accuracy of Methods
Dataset Feature Type RIFLE Robust QDA MissForest MICE KNN Imputer EM
Glass Identification Continuous 67.12% ± 1.84% 69.54% ± 1.97% 65.76% ± 1.49% 62.48% ± 2.45% 60.37% + ±1.12% 68.21% + ±0.94%
Annealing Mixed 63.41% ± 2.44% 59.51% ± 2.21% 64.91% ± 1.35% 60.66% ± 1.59% 57.44% ± 1.44% 59.43% + ±1.29%
Abalone Mixed 68.41% ± 0.74% 63.27% ± 0.76% 69.40% ± 0.42% 63.12% ± 0.98% 62.43% ± 0.38% 62.91% + ±0.37%
Lymphography Discrete 66.32% ± 1.05% 58.15% ± 1.21% 66.11% ± 0.94% 55.73% ± 1.24 57.39% ± 0.88% 59.55% + ±0.68%
Adult Discrete 72.42% ± 0.06% 60.36% ± 0.08 70.34% ± 0.03% 63.30% ± 0.14% 60.14% ± 0.00 60.69% + ±0.01%
Table 2.4: Accuracy of RIFLE, MICE, KNN-Imputer, Expectation Maximization (EM), and Robust QDA on
different discrete, mixed, and continuous datasets. Robust QDA can perform better than other methods
when the input features are continuous, and the target variable is discrete. However, RIFLE results in
higher accuracy in mixed and discrete settings.
2.5.6.2 Comparison of Robust Linear Regression and Robust QDA
An alternative approach to the robust QDA presented in Section 2.4 is to apply the robust linear regression
algorithm (Section 2.3) and mapping the solutions to each one of the classes by thresholding (positive value
maps to Label 1 and negative values to label −1).
Table 2.4 compares the performance of two classifiers on three different datasets. As demonstrated in
the table, when all features are continuous, quadratic discriminant analysis has a better performance. It
shows the QDA model relies highly on the normality assumption, while robust linear regression handles
the categorical features better than robust QDA.
Limitations and Future Directions: The proposed framework for robust regression in the presence
of missing values is limited to linear models. While in Appendix 10.1.5, we use polynomial kernels to
apply non-linear transformations on the data, such an approach can potentially increase the number of
missing values in the kernel space generated by the composition of the original features. A future direction
is to develop efficient algorithms for non-linear regression models such as multi-layer neural networks,
decision tree regressors, gradient boosting regressors, and support vector regression models. In the case
of robust classification, the methodology is extendable to any loss beyond quadratic discriminant analysis.
Unlike the regression case, a limitation of the proposed method for robust classification is its reliance on
the Gaussianity assumption of data distribution (conditioned on each data label). A natural extension is to
assume the underlying data distribution follows a mixture of Gaussian distributions.
38
Conclusion: In this chapter, we proposed a distributionally robust optimization framework over the distributions with the low-order marginals within the estimated confidence intervals for inference and imputation of datasets in the presence of missing values. We developed algorithms for regression and classification with convergence guarantees. The method’s performance is evaluated on synthetic and real datasets
with different numbers of samples, dimensions, missing value proportions, and types of missing values. In
most experiments, RIFLE consistently outperforms other existing methods.
Acknowledgments
This work was supported by the NIH/NSF Grant 1R01LM013315-01, the NSF CAREER Award CCF-2144985,
and the AFOSR Young Investigator Program Award FA9550-22-1-0192.
39
Chapter 3
Feature Selection in the Presence of Monotone Batch Effects
This chapter is based on a paper in Spurious correlations, Invariance, and Stability Workshop of
ICML 2023 [86] with the same title.
We study the feature selection problem in the presence of monotone batch effects when merging
datasets from disparate technologies and different environments affects the underlying causal dependence
of data features. We propose two novel algorithms for this task: 1) joint feature selection and batch effect
correction through non-linear transformations matching the distribution of data batches; 2) transforming
data using a batch-invariant characteristic (i.e., feature rank) to append datasets. To match the distribution
of data batches during the feature selection procedure, we use the maximum mean discrepancy (MMD)
distance. We assess the performance of the feature selection methods used in conjunction with our batch
effect removal methods.
Our experiments on synthetic data show that the former method combined with Lasso improves the
F1 score significantly, even with few samples per dataset. This method outperforms popular batch effect
removal algorithms, including Combat-Seq, Limma, and PCA. Comparatively, while the ranking method
is computationally more efficient, its performance is worse due to the information loss resulting from
ignoring the magnitude of data.
40
3.1 Introduction
Merging datasets from disparate environments comes with challenges, as datasets from the same environment may be subject to similar biases. For example, differences in genome sequencing machines [252],
hybridization protocols [358], and transformation methods [283, 279] may lead to batch effects (i.e. systematic non-biological differences between batches of samples) in gene expression data. Batch effects
can harm the performance of statistical inference algorithms (in particular, feature selection for detecting
useful biomarkers) by imposing bias on the predictions, increasing the false discovery rate, and reducing prediction accuracy [307]. Thus, detection and removal of batch effects are crucial pre-processing
stages of the statistical analysis of various bioinformatics tasks such as (single-cell) RNA sequencing [68],
metabolomics analysis [206], and cancer classification [11, 189].
Prior literature has studied the problem of mitigating batch effects. These methods can be categorized
into several general categories. Clustering and KNN-based methods remove batch effects by finding the
common sources of variations among datasets based on the proximity of data points [55, 367, 196, 113, 182].
More specifically, each data batch is considered a cluster, and the batch effect is viewed as the betweencluster variances [113]. These methods are computationally expensive due to computing pairwise distances
of data points. Further, they might not perform well if the batch effect changes the original distances of
data points drastically and differently among different batches.
Another collection of methods is based on reducing the dimension of the original data by removing the
batch effects as spurious dimensions in data [140, 12, 55]. More specifically, they apply a pre-processing
algorithm such as t-SNE [367], Principle Component Analysis [140], Singular Value Decomposition [12],
UMAP [332], or Canonical Correlation Analysis [55] to project the data into a lower dimensional subspace. Then, the noise components (results of the batch effect) are removed from the signal data (ground
truth data which is free of the batch effect). A common disadvantage of such methods is that by projecting
different batches onto the same low-dimensional space, valuable information contributing to the inference
41
phase may be lost. An alternative approach is to apply a clustering algorithm on the data and remove the
batch effect iteratively through the clustering procedure [196, 113, 182]. More specifically, each data batch
is considered a cluster, and the batch effect is viewed as the between-cluster variances [113]. A common
disadvantage of such methods is that by projecting different batches onto the same low-dimensional space,
valuable information contributing to the inference phase may be lost.
Yet another class of approaches formulate the batch effect problem as a parametric model and estimate
the unknown parameters by classification or regression techniques [161, 183, 188, 279, 282, 209, 373]. The
most popular method in this category is arguably ComBat-seq [373], which considers a joint negative
binomial regression model for gene counts and the batch effects. Their approach adjusts the model’s
parameters by matching the cumulative density functions of the data generated by the model and the
original data. The common problem of these approaches is the strict parametric assumptions on the data
distribution (such as negative binomial) and the model (linearity) that limit their applicability. Finally, the
aforementioned methods apply batch effect removal and inference tasks on data in separate phases. Thus,
the error in the batch effect removal phase can be propagated to the inference phase.
A more recent class of methods removes the batch effect by matching the empirical distribution of
different batches by minimizing a distribution distance measure such as Kullback-Leibler (KL) Divergence [182] or maximum mean discrepancy (MMD) [300, 242]. [300] mitigates batch effect by minimizing
the Maximum Mean Discrepancy (MMD) between the two distributions of samples. They consider one
dataset as the reference, and the other dataset is transformed to have a similar distribution as the reference
dataset. While their algorithm works for two input datasets, our methodology utilizes all input datasets to
simultaneously reduce the batch effect on all of them. Another significant difference between our framework and theirs is that we perform joint batch removal and feature selection together instead of doing
these steps independently. As discussed, removing batch effects jointly with feature selection can significantly enhance the feature selection accuracy. The reason is many feature selection strategies (such as
42
Figure 3.1: The batch effect can be conceptualized as transformations on each of the ground-truth datasets, which
changes the distribution of data in each in a potentially different way.
the popular LASSO algorithm) make certain assumptions on the dataset (e.g., generalized linear models).
Such assumptions may be invalidated when the batch effect removal procedure does not account for the
downstream feature selection procedure.
Feature Selection Methods: Feature selection when no batch effect is involved is extensively studied
in the literature [288, 162]. A popular approach selects features with the highest correlation (e.g., Pearson
or Spearman [143]) with the target variable. To choose the most relevant features, one can then compute
the p-value for each computed correlation and choose the ones with the highest correlation after adjusting
with false discovery rate control mechanisms such as the Benjamini-Hochberg procedure [41]. Notice that
one can rely on exact statistical approaches such as permutation tests to avoid distributional assumptions.
A more popular approach for finding the most relevant features with the highest prediction power of the
target variable is to formulate the problem as a Lasso regression task. The method’s output is the features
corresponding to the non-zero elements of the regression parameter vector [338].
In this work, we jointly handle the feature selection task and batch effect removal through data transformation. In particular, we remove the batch effect by finding the optimal transformations that minimize
the maximum mean discrepancy (MMD) of different data batches.
43
3.2 Problem Formulation and Methodology
Let us first rigorously define the problem of distribution-free feature selection in the presence of batch
effects: let D1, . . . , Dm be a collection of datasets from m unique laboratories studying the interactions of
d input features (e.g., gene expressions, proteomics data, etc.) and a target variable y. Ideally, all datasets in
the collection follow the identical distribution P
∗ describing the joint distribution of input features and the
target variable. However, due to the aforementioned factors in the previous section, known as the batch
effect, the datasets in the collection do not follow the ground-truth distribution P
∗
. Formally, the batch
effect can be described as m different transformations f1, . . . , fm applied to the ground-truth datasets
D∗
1
, . . . , D∗
m ∼ P
∗
(see Figure 3.1). We assume that all functions fi
, i = 1, . . . , m, are monotonically
increasing and applying on dataset Di pointwise. Such an assumption on the monotonicity of batch effects
holds for the widely used transformations on the raw microarray and sequencing datasets, such as the log
transformations and dividing by the maximum values. Thus each observed dataset can be viewed as a
monotone transformation of the ground-truth data:
Di = fi(D
∗
i
) i = 1, . . . , m. (3.1)
The goal of batch effect removal is to learn the underlying transformations f1, . . . , fm on the data batches
such that the ground-truth datasets D1, . . . , Dm are recovered up to bijective mappings (see Figure 3.1). In
other words, the optimal transformations on the datasets make the distributions of them asclose as possible
to each other. Thus, one can quantify the quality of batch effect removal based on how close the resulting
distributions of datasets are after the transformations. Note that it is crucial for transformations to be
bijective. Otherwise, one can transform all datasets to zero, and the difference between the transformed
datasets will be zero. However, such transformations are not desirable. Unfortunately, finding the optimal
44
transformations over the set of bijective functions is challenging from the optimization point of view. We
propose a novel formulation in the next section to avoid such spurious solutions.
3.2.1 MMD-based Approach
We utilize the maximum mean discrepancy [136] to measure the proximity of dataset distributions [19].
Maximum Mean Discrepancy (MMD) is a statistical measure that quantifies the dissimilarity between probability distributions. Given two datasets or distributions, MMD aims to determine the dissimilarity by calculating the discrepancy between their respective means in the Reproducing Kernel Hilbert Space (RKHS).
Mathematically, let z and z
′ be independent random variables following distribution p, and w as well as
w
′
independent random variables following distribution q. Then, the MMD between distributions p and q
is
MMD2
[p, q] = Ez,z′
k
z, z′
− 2Ez,w[k(z, w)] + Ew,w′
k
w, w′
(3.2)
An empirical estimate of MMD is given by [135]
MMD2
u
[x, y] = 1
m(m − 1)
Xm
i=1
Xm
j̸=i
k (xi
, xj ) + 1
n(n − 1)
Xn
i=1
Xn
j̸=i
k (yi
, yj ) −
2
mn
Xm
i=1
Xn
j=1
k (xi
, yj ),
where k(·, ·) is a non linear kernel function. In our work, we apply the Gaussian kernel, defined as
k(z, w) = exp −|z − w|
2
2σ
2
.
Here, σ is a hyper-parameter to control the bandwidth of the kernel. To accurately estimate MMD and
avoid the vanishing gradient phenomenon, we use multiple kernels with different bandwidths taking values in σ = [10−2
, 10−1
, . . . , 1011], which is proper for even large distributional differences.
We assume the input features x
∗ without batch effect follow a multi-variate Gaussian distribution
N (µ, Σ) with the zero mean and an unknown covariance matrix. Therefore, we can formulate the batch
effect removal problem as finding the optimal transformation and covariance matrix Σ such that the total
difference between transformed distributions and the underlying normal distribution is minimized:
min
Φ,Σ
X
i∈E
MMD
Φi(Di), N (0, Σ)
, (3.3)
where MMD(·, ·) measures the maximum mean discrepancy between the two input distributions/datasets.
A two-layer neural network models each Φi transformation. We add non-negativity constraints on the
weights of the neural networks to ensure the corresponding transformations are monotone:
Φ(Di) = UT ReLU
W2ReLU(W1Xi + b1) + b2
+ c (3.4)
Where W1,W2, U, b1, b2,and c are the neural network trainable weights with the non-negativity constraints. Further, Xi represents the data matrix associated with the dataset Di
. One can argue that instead
of transforming the data to a common normal distribution, we can minimize the pairwise MMD distances
of the distributions. However, in this case, transforming all distributions to a constant value such as 0
is the optimal solution with 0 loss, which is clearly undesirable. An alternative approach used by [300]
takes one of the dataset distributions as the "reference" and then brings the distribution of other datasets to
the chosen reference dataset. This approach can lead to models with poor performances if the batch effect
highly affects the reference data. Our experiments show that even when we randomly generate the covariance matrix Σ instead of considering it as an optimization parameter, it still achieves better performance
compared to the case when one of the data distributions is considered as the reference.
One can optimize Problem (3.3) to remove batch effects in a pre-processing stage. Next, a feature selection method on the transformed data can be performed to find the set of input features that are most
relevant for predicting the target variable. A more elegant approach is to remove batch effects and perform the feature selection in a joint optimization task. Note that performing these tasks separately and
46
sequentially can lead to sub-optimal solutions compared to the joint formulation due to the data processing inequality [32]. The joint batch effect removal via MMD and Lasso regression can be formulated as
follows:
min
Φ,θ,Σ
1
n
Xn
i=1
X
(x,y)∈Di
(θ
T Φi(x) − y)
2 + λ∥θ∥1 + µ
Xm
i=1
MMD
Φi(Di), N (0, Σ)
. (3.5)
The objective function consists of three main components: the first term represents the linear regression
loss on the transformed dataset. The second part is the non-smooth ℓ1 regularizer controlling the sparsity
of the model’s parameter θ. Finally, the third component transforms the distribution of datasets to a
common normal distribution. Problem (3.5) can be optimized using a first-order iterative algorithm where
at each iteration, a step of projected gradient descent is applied on the parameters of the transformations.
Then θ is updated by the proximal gradient method. The procedure of optimizing (3.5) is presented in
Algorithm 4.
Algorithm 4 MMD-based feature selection and batch effect removal
1: Initialize θ with normal distribution and Φi randomly for all 1 ≤ i ≤ m.
2: for t = 1, . . . , T do
3: Update the parameters in Φi via Adam optimizer and set all negative weights to 0 for all 2 ≤ i ≤ m.
4: Update θ by applying one step of ISTA [36] on Problem (3.5).
5: Return features corresponding to non-zero elements in θ.
3.2.2 Low-Rank MMD Method
In many biological datasets, especially microarray and sequencing data containing thousands of gene expressions of living organisms, the number of samples (e.g., humans or mice) is much less than the number of
studied genes. As a result, Problem (3.5) consists of many optimization parameters in the high dimensional
setting when the number of data points is very limited compared to the dimension of the data (n ≪ d),
in particular, the unknown covariance matrix Σ has O(d
2
) parameters to optimize. For instance, when
d = 104
(or even larger for genomic datasets), Σ contains 108 variables, while only a few samples (< 100)
47
are available. As mentioned earlier, a practical approach to avoiding the challenging optimization procedure in (3.5) considers a multi-Gaussian distribution with a fixed randomly generated covariance matrix
as the reference distribution. Although our simulation results show that even a fixed covariance matrix
can beat the state-of-the-art approaches in the literature, the randomly generated covariance matrix can
be arbitrarily far from the covariance matrix of the actual data distribution. Further, having small number
of samples compared to the dimension of Σ, may cause the overfitting of the trained model.
To overcome the aforementioned issues, we exploit the fact that only a small proportion of input
features actually have strong correlations with the target variable in genomic datasets. In other words,
in high-dimensional biological datasets (e.g., microarrays, gene expressions, etc), most pairs of features
are almost independent. Thus, the ground-truth matrix is sparse, making the low-rank assumption on the
covariance matrix of the data practically reasonable. Thus, we assume that the ground-truth covariance
matrix is low-rank to reduce the parameters of the covariance matrix. This assumption reduces the number
of optimization parameters significantly. Therefore, Problem (3.5) can be reformulated with an extra lowrank constraint on Σ as follows:
min
Φ,θ,Σ
1
n
Xn
i=1
X
(x,y)∈Di
(θ
T Φi(x) − y)
2 + λ∥θ∥1 + µ
Xm
i=1
MMD
Φi(Di), N (0, Σ)
s.t. Rank(Σ) ≤ s (3.6)
Solving (3.6) is intractable due to the non-convexity of the rank constraint. Alternatively, since the covariance matrix must be symmetric and positive semi-definite, it can be represented as Σ = AT A, where A is
a s × d matrix. In this case, Rank(AT A) ≤ s. Therefore, Problem (3.6) can be alternatively formulated as:
min
Φ,θ,A
1
n
Xn
i=1
X
(x,y)∈Di
ℓ
hθ
Φi(x)
, Φi(y)
+ λ∥θ∥1 + +µ
Xm
i=1
MMD
Φi(Di), N (0, AT A)
. (3.7)
The standard approach for generating a zero-mean multi-Gaussian distribution (X ∼ N n×d
(0, Σ)) is
by first generating a standard normal Gaussian distribution Z ∼ N n×d
(0, Id×d
) and then generating a
4
random matrix Ad×d
. The resulting multi-Gaussian distribution follows a covariance matrix of AT A. In
the problem context, the matrix A is the variable matrix that needs to be trained.
A can be generated with dimension s × d where s ≪ d. Correspondingly, Z ∼ N n×s
(0, Is×s
) and
then X = AZ. Consequently, the optimization problem is modified to include the low-rank matrix as
follows:
min
Φ,θ,A
1
n
Xn
i=1
X
(x,y)∈Di
ℓ
hθ
Φi(x)
, Φi(y)
+ µ
Xm
i=1
MMD
Φi(Di), AZ
+ λ∥θ∥1.
This modified approach is referred to as the “Low-Rank MMD" method, which is implemented and evaluated alongside Problem (3.5). The results, shown in Table 3.1, indicate that both the low-rank MMD method
and original MMD methods perform well in two of the four scenarios. However, the low-rank method offers additional advantages and potential for further exploration. One of the key benefits of the low-rank
approach is its increased explainability. By incorporating a low-rank matrix, the model becomes more interpretable, allowing a better understanding of the underlying factors influencing the data. Furthermore,
the low-rank method demonstrates greater adaptability as the number of samples changes.
In particular, as we vary the low-rank parameter s according to the number of samples, the lowrank model exhibits enhanced performance. This flexibility allows the model to effectively capture the
underlying patterns and dependencies in the data, resulting in improved predictive power. By adjusting
the low-rank parameter dynamically, the low-rank method can leverage the available information and
adapt to different dataset sizes.
3.2.3 Ranking Method
An alternative, less computationally intense approach to the MMD-based method is to make predictions
by relying on features that are invariant under batch effect transformations f,
. . . , fm. In particular, since
these transformations are monotonically increasing, they do not change the order of entries in any
4
row of the data matrix. Thus, the order statistic in each row is unchanged, which means the row-wise
order statistics are invariant under the batch effects. After applying the order statistics, one can perform
Lasso regression or other feature selection algorithms on the transformed dataset. Despite its simplicity,
using the order statistics instead of the original data can lead to information loss, as it only considers
the orders, not the magnitude of the data. While the simple Ranking method beats several state-of-theart approaches in the literature based on our experiments, the method based on joint removal of batch
effect and feature selection through Lasso regression is dominantly better than the ranking method in all
synthetic and real data scenarios.
Algorithm 5 Feature Selection with Ranking-based Method
1: Convert original data (X, Y ) to ranking data R(X, Y ) = (X′
, Y ′
) per row;
2: Apply Lasso on R(X, Y ) and select corresponding features.
3.3 Numerical Experiments
In this section, we evaluate the performance of MMD-based and ranking-based methods on simulated
datasets. To do this, we first consider the feature selection tasks in the presence of distributional shifts.
For this task, we measure performance using the F1 score which is defined as the harmonic mean of
recall and precision. Moreover, to evaluate the effectiveness of the methods in removing batch effects, we
visualize the data batches before and after the transformation.
3.3.1 Simulated Data
To evaluate the performance of our proposed methods against state-of-the-art baselines in the literature,
we generate datasets with different numbers of batches m and the number of data points n in each batch.
To this end, we generate mn data points with dimension d = 100. Each batch follows a normal distribution
with a randomly assigned mean and covariance. The target variable y is a linear function of the input data
50
plus a mean zero normal noise (y = x
T β
∗ + ϵ). To induce sparsity, each β
∗
entry is set to zero with the
probability of 90%. To add batch effects to batch k, we transform the dimension j in data point i in batch
k as follows:
x
′
ij = akx
5
3
ij + bkxij + ck + ϵk
where ak, bk and ck are randomly generated positive numbers for batch k and ϵk is Gaussian noise. We
vary the number of data points and data batches to evaluate the effectiveness of each approach in different scenarios. Besides MMD-based and ranking-based methods proposed in this chapter, we evaluate
several state-of-the-art approaches, including CombatSeq [373], Limma [313], zero-mean-unit-variance,
and PCA [189]. We also compare outcomes when we apply Lasso on the datasets without moving batch
effects. Table 3.1 reports the F1 score for the aforementioned approaches in four different scenarios.
(m, n)
S1
(5,10)
S2
(50,10)
S3
(5,100)
S4
(50,100)
Combat-Seq 0.424 0.313 0.444 0.759
Limma 0.077 0.109 0.217 0.238
PCA 0.143 0,089 0.228 0.238
Zero-Mean Unit-Variance 0.061 0.204 0.231 0.16
Original Data 0.381 0.145 0.289 0.289
Ranking 0.444 0.095 0.214 0.214
Shaham 0.326 0.143 0.289 0.297
MMD 0.410 0.727 0.880 0.857
Low-Rank MMD 0.537 0.400 0.857 0.909
Table 3.1: F1 Scores for Different Methods and Scenarios
51
3.3.2 Hyper-Parameter Tuning
The selection of suitable hyper-parameters in problem (3.5) significantly impacts the overall performance.
However, determining optimal values for λ and θ poses a considerable challenge. In realistic datasets, the
true coefficient vector β is unknown, making it difficult to assess the final results accurately. Consequently,
an alternative metric is needed to replace the F1 score for conducting cross-validation.
Figure 3.2: F1 Score Versus Normalized Logarithmic Supervised Learning Loss
One potential indicator is the supervised learning loss. We calculate the F1 score at regular intervals,
typically every 1000 iteration, while simultaneously recording the logarithmic supervised learning loss.
To facilitate comparison, we normalize the loss by dividing it by the maximum logarithmic supervised
learning loss. Figure 3.2 demonstrates that the F1 score tends to increase as the supervised learning loss
decreases.
3.3.3 Interpreting the Results
From Table 3.1, it is evident that the MMD-based approach performs significantly better than other methods by a large margin in Scenarios 2 to 4. In Scenario 1, Combat-Seq works slightly better. This can be
52
attributed to the requirement in the MMD method that there be an adequate number of samples to obtain the optimal transformations (modeled by two-layer neural networks) effectively. Conversely, other
benchmarks perform even worse than applying Lasso on the original datasets. This suggests that these
approaches may alter the underlying information within the datasets. Additionally, the ranking method
does not effectively select the correct features, potentially leading to the loss of crucial information.
3.3.4 Convergence of the Training Approach
Based on Equation 3.5, the supervised learning loss, MMD loss, and L1 norm are expected to decrease
during the training process as we minimize the objective. Figure 3.3 plots the training process for scenario
3, the upper left is the logarithmic prediction loss or supervised training loss, the upper right is the L1
norm of θ, the lower left is the sum of MMD between transformed data and reference data, the lower
right is the objective value in (3.5). We can see overall the loss shows a declining trend. Small MMD loss
indicates the batch effects are corrected, and the transformed data’s distributions are close to that of the
reference data. Figure 3.4 shows the MMD between each pair of datasets before and after transformation
in scenario 3. The diagonals are black (MMD=0) because the statistical distance between a dataset and
itself is 0.
53
Figure 3.3: Training Loss
Non-diagonal values in Figure 3.4 (left) are roughly between 1 and 2, whereas the post-transformation
values (right) are smaller than 0.1, showing that the batch effects have been corrected. In Appendix 3.4, we
further discuss the convergence behavior and quality of the transformed datasets obtained by our MMDbased approach. All codes and outcomes are publicly available at https://github.com/DesPeradoGoden/
Feature-Selection-in-the-Presence-of-Monotone-Batch-Effects.
Drawing inspiration from this observation, we can utilize cross-validation to identify the optimal values of θ and λ that minimize the converged supervised learning loss for test datasets.
In addition, we have observed that when each term in equation 3.5, along with its corresponding
coefficients, possesses similar magnitudes, the overall performance improves. This phenomenon has also
been mentioned in [249].
54
Figure 3.4: Heat-map of MMD before and after transformation
The idea behind this approach involves training the model with random values for λ and θ. Once the
training loss converges, the values of λ and θ are reset such that the terms
1
n
Xn
i=1
X
(x,y)∈Di
ℓ
hθ
Φi(x)
, Φi(y)
, (Supervised Training Loss)
and
µ
Xm
i=1
MMD
Φi(Di), N (0, Σ)
, (MMD Loss)
and λ∥θ∥1 (ℓ1 norm) are approximately equal. For instance, if the supervised training loss is 1000, the
MMD loss is 0.1, and the L1 norm is 10, we can set µ = 104
and θ = 100. This process is repeated until
no further updates are needed for λ and θ. This method assists in determining suitable hyper-parameters
and adjusting step sizes accordingly.
Subsequently, cross-validation is employed to search for the values of θ and λ that minimize the supervised learning loss on the test dataset. The empirical settings for λ and θ can also help narrow down
the search range during cross-validation implementation.
5
3.4 Validate Multi-variate Gaussian Distribution
In Figure 3.4 (right), we can observe that the Maximum Mean Discrepancy (MMD) values of the transformed datasets are not equal to 0. We consider this to be a reasonable outcome because the reference
dataset is generated randomly and is expected to differ from the true underlying distribution. The goal is
for the transformed data to be close to a multivariate Gaussian distribution rather than an exact match to
the reference data. If the MMD in Figure 3.4 were 0, it would indicate overfitting. Figure 3.5 displays the
histograms and overlays the corresponding Gaussian curves for the original data and the transformed data
achieved through linear mapping with the random vector a. Notably, the data exhibits a clear Gaussian
distribution after the transformation, unlike the data prior to the transformation.
Figure 3.5: Histogram for Validating Gaussian Distribution
The transformed data does not fully converge to the reference data due to the design of the optimization
problem 3.5. In Problem 3.5, we conceptualize minimizing the MMD as the main objective and the Lasso
as a regularization term. Since the underlying distribution follows a sparse linear relationship, forcing the
transformed data to be the same as the reference data would lead to large supervised loss and L1 norm.
Hence, appropriate tuning of hyper-parameters λ and µ becomes crucial in achieving good performance.
56
3.4.1 Genomic Data Experiments
Batch effects also appear in datasets in biological applications, particularly in genomic data where a small
number of samples from different laboratories may be pooled together for analysis. Therefore, we demonstrate our approach to an analysis of Srebp-1c expression in liver cells, a gene implicated in liver disease.
Identifying genes with expressions correlated to Srebp-1c may help us narrow down which other genes
should be studied to elucidate gene pathways in liver disease more fully.
To identify genes potentially correlated with Srebp-1c, we gathered 15 publicly available GEO datasets
originating from diverse environments. Our dataset collection includes the following:
GSE149863, GSE94593, GSE162249, GSE138778, GSE137345, GSE162276, GSE163728, GSE163652,
GSE151358, GSE157798, GSE140994, GSE156918, GSE162322, GSE162324, and GSE166347. We observed
a significant batch effect within these datasets. For instance, the expression values in GSE137345 exhibit
magnitudes below 10, considerably smaller than those in GSE156918, where half of the counts were above
1,000. Our proposed approach, which removes the batch effect, is therefore well-suited for identifying
potential genes correlated with Srebp-1c.
To assess the performance of our method, we can compute the F1 score calculated by comparing the
genes obtained through our approach with a list of known correlated genes. We do not have the ground
truth when conducting genomic data analysis. However, some prior biological experiments have explored
potential genes correlated with Srebp-1c (as referenced in [299]). These experiments provide a list of genes
known to be true correlated genes, serving as a benchmark. We use the F1 score as it provides a robust
evaluation metric that balances precision and recall.
The highest F1 score attained in our analysis was 3 times higher than that of the F1 score using the direct
Lasso method. This process identified 395 potential genes out of about 10,000 genes; 15 of them overlap
with the 395 genes listed in the biological publication by [299], which provides genes with biological
57
connections with Srebp-1c. This suggests that this process may be useful for identifying other genes of
biological interest.
Another way to identify correlated genes is by using the frequency of genes occurring among experiments with different hyperparameter settings. The higher the frequency, the more robust the relationship
between a gene and the target gene will likely be. In our analysis, several genes appeared with high frequency, including Ubtd2, CD36, and Defb1. CD36 appeared 91 times among 100 experiments and, as confirmed by existing biological literature, it is known to correlate with Srebp-1c [365]. Additional biological
experiments are needed for other genes to explore their potential associations further.
3.5 Conclusion
We proposed a joint optimization framework for feature selection using Lasso and removing batch effects
by matching the distributions of datasets using MMD. Aside from feature selection, the method can be used
as an effective tool to remove batch effects in a pre-processing task. Numerical experiments on different
scenarios demonstrate a significant improvement in the performance of this approach compared to other
state-of-the-art methods. Moreover, the results of genomic datasets conducted by the MMD method are
supported by existing biological literature, demonstrating our approach’s practicality.
58
Chapter 4
Improving Adversarial Robustness via Joint Classification and Multiple
Explicit Detection Classes
This chapter is based on a paper published in AISTATS 2023 [28] with the same title.
This chapter concerns the development of deep networks that are certifiably robust to adversarial
attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism,
where adversarial examples are either correctly classified or assigned to the “abstain” class. In this work,
we show that such a provable framework can benefit by extension to networks with multiple explicit
abstain classes, where the adversarial examples are adaptively assigned to those. We show that naïvely
adding multiple abstain classes can lead to “model degeneracy”, then we propose a regularization approach
and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our
experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust
verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of
abstain classes. Our code is available at https://github.com/sinaBaharlouei/MultipleAbstainDetection.
4.1 Introduction
Deep Neural Networks (DNNs) have revolutionized many machine learning tasks such as image processing [177, 377] and speech recognition [134, 235]. However, despite their superior performance, DNNs
59
are highly vulnerable to adversarial attacks and perform poorly on out-of-distributions samples [129, 199,
359]. To address the vulnerability of DNNs to adversarial attacks, the community has designed various
defense mechanisms against such attacks [251, 156, 127, 219, 152]. These mechanisms provide robustness
against certain types of attacks, such as the Fast Gradient Sign Method (FGSM) [323, 129]. However, the
overwhelming majority of these defense mechanisms are highly ineffective against more complex attacks
such as adaptive and brute-force methods [329, 63]. This ineffectiveness necessitates: 1) the design of rigorous verification approaches that can measure the robustness of a given network; 2) the development of
defense mechanisms that are verifiably robust against any attack strategy within the class of permissible
attack strategies.
To verify the robustness of a given network against any attack in a reasonable set of permissible attacks (e.g., ℓp-norm ball around the given input data), one needs to solve a hard non-convex optimization
problem (see, e.g., Problem (4.1) in this paper). Consequently, exact verifiers, such as the ones developed
in [327, 351], are not scalable to large networks. To develop scalable verifiers, the community turns to
“inexact" verifiers, which can only verify a subset of perturbations to the input data that the network can
defend against successfully. These verifiers typically rely on tractable lower bounds for the verification
optimization problem. [130] finds such a lower bound by interval bound propagation (IBP), which is essentially an efficient convex relaxation of the constraint sets in the verification problem. Despite its simplicity,
this approach demonstrates relatively superior performance compared to prior works.
IBP-CROWN [368] combines IBP with novel linear relaxations to have a tighter approximation than
standalone IBP. β-Crown [344] utilizes a branch-and-bound technique combined with the linear bounds
in IBP-CROWN to tighten the relaxation gap further. While β-Crown demonstrates a tremendous performance gain over other verifiers such as [368, 114, 213], it cannot be used as a tool in large-scale training
procedures due to its computationally expensive branch-and-bound search. One can adopt a composition
60
of certified architectures to enhance the performance of the obtained model on both natural and adversarial
accuracy [231, 151].
Another line of work for enhancing the performance of certifiably robust neural networks relies on
the idea of learning a detector alongside the classifier to capture adversarial samples. Instead of trying
to classify adversarial images correctly, these works design a detector to determine whether a given sample is natural/in-distribution or a crafted attack/out-of-distribution. [70] train the detector on both indistribution and out-of-distribution samples to learn a detector distinguishing these samples. [144] develops a method based on a simple observation that, for real samples, the output of the softmax layer is closer
to 0 or 1 compared to out-of-distribution and adversarial examples where the softmax output entries are
distributed more uniformly. [98, 302, 320] learn uncertainty regions around actual samples where the network prediction remains the same. Interestingly, this approach does not require out-of-distribution samples during training. Other approaches such as deep generative models [273], self-supervised and ensemble
methods [339, 72] are also used to learn out-of-distribution samples. However, typically, these methods
are vulnerable to adversarial attacks and can be easily fooled by carefully designed out-of-distribution
images [119] as discussed in [328]. A more resilient approach is to jointly learn the detector and the classifier [181, 303, 71] by adding an auxiliary abstain output class capturing adversarial samples.
Building on these prior works, this chapter develops a framework for detecting adversarial examples
using multiple abstain classes. We observe that naïvely adding multiple abstain classes (in the existing
framework of [303]) results in a model degeneracy phenomenon where all adversarial examples are assigned to a small fraction of abstain classes (while other abstain classes are not utilized). To resolve this
issue, we propose a novel regularizer and a training procedure to balance the assignment of adversarial
examples to abstain classes. Our experiments demonstrate that utilizing multiple abstain classes in conjunction with the proper regularization enhances the robust verified accuracy on adversarial examples
while maintaining the standard accuracy of the classifier.
61
Challenges and Contribution. We propose a framework for training and verifying robust neural nets
with multiple detection classes. The resulting optimization problems for training and verifying such networks is a constrained min-max optimization problem over a probability simplex that is more challenging
from an optimization perspective than the problems associated with networks with no or single detection
classes. We devise an efficient algorithm for this problem. Furthermore, having multiple detectors leads
to the “model degeneracy" phenomenon, where not all detection classes are utilized. To prevent model
degeneracy and to avoid tuning the number of network detectors, we introduce a regularization mechanism guaranteeing that all detectors contribute to detecting adversarial examples to the extent possible.
We propose convergent algorithms for the verification (and training) problems using proximal gradient
descent with Bregman divergence. Compared to networks with a single detection class, our experiments
show that we enhance the robust verified accuracy by more than 5% and 2% on CIFAR-10 and MNIST
datasets, respectively, for various perturbation sizes.
Roadmap. In section 4.2 we review interval bound propagation (IBP) and β-crown as two existing efficient
methods for verifying the performance of multi-layer neural networks against adversarial attacks. We
discuss how to train and verify joint classifier and detector networks (with a single abstain class) based
on these two approaches. Section 4.3 is dedicated to the motivation and procedure of joint verification
and classification of neural networks with multiple abstain classes. In particular, we extend IBP and βcrown verification procedures to networks with multiple detection classes. In section 4.4, we show how
to train neural networks with multiple detection classes via IBP procedure. However, we show that the
performance of the trained network cannot be improved by only increasing the number of detection classes
due to “model degeneracy" (a phenomenon that happens when multiple detectors behave very similarly
and identify the same adversarial examples). To avoid model degeneracy and to automatically/implicitly
tune the number of detection classes, we introduce a regularization mechanism such that all detection
classes are used in balance.
62
4.2 Background
4.2.1 Verification of feedforward neural networks
Consider an L-layer feedforward neural network with {Wi
, bi} denoting the weight and bias parameters
associated with layer i, and let σi(·) denote the activation function applied at layer i. Throughout the chapter, we assume the activation function is the same for all hidden layers, i.e., σi(·) = σ(·) = ReLU(·), ∀i =
1, . . . , L − 1. Thus, our neural network can be described as
zi = σ(Wizi−1 + bi) ∀i ∈ [L − 1], zL = WLzL−1 + bL,
where z0 = x is the input to the neural network and zi
is the output of layer i and [N] denotes the set
{1, . . . , N}. Note that the activation function is not applied at the last layer. Further, we use [z]i to denote
the i-th element of the vector z. We consider a supervised classification task where zL represents the
logits. To explicitly show the dependence of zL on the input data, we use the notation zL(x) to denote
logit values when x is used as the input data point.
Given an input x0 with the ground-truth label y, and a perturbation set C(x0, ϵ) (e.g. C(x0, ϵ) =
{x | ∥x − x0∥∞ ≤ ϵ}), the network is provably robust against adversarial attacks on x0 if
0 ≤ min
x∈C(x0,ϵ)
c
T
ykzL(x), ∀ k ̸= y, (4.1)
where cyk = ey − ek with ek (resp. ey) denoting the standard unit vector whose k-th row (resp. y-th
row) is 1 and the other entries are zero. Condition (4.1) implies that the logit score of the network for the
true label y is always greater than that of any other label k for all x ∈ C(x0, ϵ). Thus, the network will
correctly classify all the points inside C(x0, ϵ). The objective function in (4.1) is non-convex when L ≥
2. It is customary in many works to move the non-convexity of the problem to the constraint set and
reformulate (4.1) as
0 ≤ min
z∈Z(x0,ϵ)
c
T
ykz, ∀ k ̸= y, (4.2)
63
where Z(x0, ϵ) = {z | z = zL(x) for some x ∈ C(x0, ϵ)}. This verification problem has a linear objective function and a non-convex constraint set. Since both problems (4.1) and (4.2) are non-convex, existing
works proposed efficiently computable lower-bounds for the optimal objective value of them. For example,
[130, 348] utilize convex relaxation, while [327, 344] rely on mixed integer programming and branch-andbound to find lower-bounds for the optimal objective value of (4.2). In what follows, we explain two popular and relatively successful approaches for solving the verification problem (4.1) (or equivalently (4.2))
in detail.
4.2.2 Verification of neural networks via IBP
Interval Bound Propagation (IBP) of [130] tackles problem (4.2) by convexification of the constraint set
Z(x0, ϵ) to its convex hypercube super-set [z(x0), z¯(x0)], i.e., Z(x0, ϵ) ⊆ [z(x0), z¯(x0)]. After this relaxation, problem (4.2) can be lower-bounded by the convex problem:
min
z(x0)≤z≤z¯(x0)
c
T
ykz (4.3)
The upper- and lower- bounds z(x0) and z¯(x0) are obtained by recursively finding the convex relaxation
of the image of the set C(x0, ϵ) at each layer of the network. In particular, for the adversarial set C(x0, ϵ) =
{x | ∥x − x0∥∞ ≤ ϵ}, we start from z0
(x0) = x0 − ϵ1 and z¯0(x0) = x0 + ϵ1. Then, the lower-bound
zL(x0) and upper-bound z¯L(x0) are computed by the recursions for all i ∈ [L]:
z¯i(x0) = σ(WT
i
z¯i−1 + zi−1
2
+ |WT
i
|
z¯i−1 − zi−1
2
),
zi
(x0) = σ(WT
i
z¯i−1 + zi−1
2
− |WT
i
|
z¯i−1 − zi−1
2
).
(4.4)
Note that |W| denotes the element-wise absolute value of matrix W. One of the main advantages of IBP
is its efficient computation: verification of a given input only requires two forward passes for finding the
lower and upper bounds, followed by a linear programming.
64
4.2.3 Verification of neural networks via β-Crown
Despite its simplicity, IBP-based verification comes with a certain limitation, namely the looseness of its
layer-by-layer bounds of the input. To overcome this limitation, tighter verification methods have been
proposed in the literature [308, 368, 90, 344]. Among these, β-crown [344] utilizes the branch-and-bound
technique to generalize and improve the IBP-CROWN proposed in [368]. Let zi
and z¯i be the estimated
element-wise lower-bound and upper-bounds for the pre-activation value of zi
, i.e., zi ≤ zi ≤ z¯i
, where
these lower and upper bounds are obtained by the method in [368]. Let zˆi be the value we obtain by
applying ReLU function to zi
. A neuron is called unstable if its sign after applying ReLU activation cannot
be determined based on only knowing the corresponding lower and upper bounds. That is, a neuron
is unstable if zi < 0 < z¯i
. For stable neurons, no relaxation is needed to enforce convexity of σ(z)
(since the neuron operates in a linear regime). On the other hand, given an unstable neuron, they use a
branch-and-bound (BAB) approach to split the input range of the neuron into two sub-domains Cil = {x ∈
C(x0, ϵ)| zˆi ≤ 0} and Ciu = {x ∈ C(x0, ϵ)| zˆi > 0}. The neuron operates linearly within each subdomain.
Thus we can verify each subdomain separately. If we have N unstable nodes, the BAB algorithm requires
the investigation of 2
N sub-domains in the worst case. β-Crown proposes a heuristic for traversing all these
subdomains: The higher the absolute value of the corresponding lower bound of a node is, the sooner the
verifier visits it. For verifying each sub-problem, [344] proposed a lower-bounded which requires solving
a maximization problem over two parameters α and β:
min
z∈Z(x0,ϵ)
c
T
ykz ≥ max
α,β
g(x, α, β)
where g(x, α, β) = (a + Pαβ)
T x + q
T
αβ + dα. (4.5)
65
Here, the matrix P and the vectors q, a and d are functions of Wi
, bi
, zi
, z¯i
, α, and β parameters. See
Appendix 10.2.4 for the precise definition of g. Notice that any choice of (α, β) provides a valid lower
bound for verification. However, optimizing α and β in (4.5) leads to a tighter bound.
4.2.4 Training a joint robust classifier and detector
[303] improves the performance tradeoff on natural and adversarial examples by introducing an auxiliary
class for detecting adversarial examples. If this auxiliary class is selected as the output, the network “abstains" from declaring any of the original K classes for the given input. Let a be the abstain class. The
classification network performs correctly on an adversarial image if it is classified correctly as the class
of the original (unperturbed) image (similar to robust networks without detectors) or it is classified as the
abstain class (detected as an adversarial example). Hence, for input image (↶0, y) the network is verified
against a certain class k ̸= y if
0 ≤ min
z∈Z(x0,ϵ)
max(c
T
ykz, c
T
akz), (4.6)
i.e., if the score of the true label y or the score of the abstain class a is larger than the score of class k.
To train a neural network that can jointly detect and classify a dataset of images, [303] relies on the
loss function of the form:
LTotal = LRobust + λ1L
Abstain
Robust + λ2LNatural, (4.7)
where the term LNatural denotes the natural loss when no adversarial examples are considered. More precisely, LNatural =
1
n
Pn
i=1 ℓxent
zL(xi), yi
, where ℓxent is the standard cross-entropy loss. The term LRobust
in (4.7) represents the worst-case adversarial loss used in [219], without considering the abstain class. Precisely,
6
LRobust = max
δ1,...,δn
1
n
Xn
i=1
ℓxent
zL(xi + δi), yi
s.t. ∥δi∥∞ ≤ ϵ, ∀i = 1, . . . , n.
Finally, the Robust-Abstain loss L
Abstain
Robust is the minimum of the detector and the classifier losses:
L
Abstain
Robust = max
δ1,...,δn
1
n
Xn
i=1
min
ℓxent
zL(xi + δi), yi
,
ℓxent
zL(xi + δi), a
s.t. ∥δi∥∞ ≤ ϵ, ∀ i = 1, . . . , n.
(4.8)
In (4.7), tuning λ1 and λ2 controls the trade-off between standard and robust accuracy. Furthermore, to
obtain non-trivial results, IBP-relaxation should be incorporated during training for the minimization subproblems in Lrobust and L
abstain
robust [303, 130].
4.3 Verification of Neural Networks with Multiple Detection Classes
Motivation: The set of all adversarial images that can be generated within the ϵ-neighborhood of clean
images might not be detectable only by a single detection class. Hence, the robust verified accuracy of
the joint classifier and detector can be enhanced by introducing multiple abstain classes instead of a single abstain class to detect adversarial examples. This observation is illustrated in a simple example in
Appendix 10.2.6 where we theoretically show that 2 detection classes can drastically increase the performance of the detector compared to 1 detection class. Note that a network with multiple detection classes
can be equivalently modeled by another network with one more layer and a single abstain class. This added
layer, which can be a fully connected layer with a max activation function, can merge all abstain classes
and collapse them into a single class. Thus, any L-layer neural network with multiple abstain classes can
be equivalently modeled by an L + 1-layer neural network with a single abstain class. However, the performance of verifiers such as IBP reduces as we increase the number of layers. The reason is that increasing
Figure 4.1: The IBP verification for 400 input data points of 2-layer and 3-layer neural networks. Part (a) shows the assigned
four labels to data points. Part (b) demonstrates that IBP can verify 14 points using one of two abstain classes (black triangles),
while it cannot verify 13 data points (red ×). c) shows that when IBP is applied to a network with one more layer and one
detection class, 8 points are verified by the detection class, while it fails to verify 21 points. The description of both networks can
be found in Appendix 10.2.7.
the number of layers leads to looser bounds in (4.4) for the last layer. To illustrate this fact, Figure 4.1
shows that the number of verified points by a 2−layer neural network is higher than the number of points
verified by an equivalent network with 3 layers.
Thus, it is beneficial to train/verify the original L-layer neural network with multiple abstain classes
instead of L + 1-layer network with a single abstain class. This fact will be illustrated further in the
experiments on MNIST and CIFAR-10 datasets depicted in Figure 4.2. Next, we present how one can verify
a network with multiple abstain classes:
68
Let a1, a2, . . . , aM be M abstain classes detecting adversarial samples. A sample is considered adversarial if the network’s output is any of the M abstain classes. A neural network with K regular classes
and M abstain classes outputs the label of a given sample as yˆ(x) = argmaxi∈{1,...,K,a1,...,aM}
[zL(x)]i
.
An input (x, y) is verified if the network either correctly classifies it as class y or assigns it to any of the
explicit M abstain classes. More formally and following equation (4.6), the neural network is verified for
input x0 against a target class k if
0 ≤ min
zL∈Z(x0,ϵ)
max
c
T
ykzL, c
T
a1kzL, . . . , c
T
aMkzL
. (4.9)
Since the set Z(x0, ϵ) is nonconvex, verifying (4.9) is computationally expensive. The next subsections
present two verifiers for solving (4.9) based on IBP and β-crown.
4.3.1 Verification with IBP
Using the IBP mechanism to relax the non-convex set Z(x0, ϵ) leads to the following result for a given
network with M detection classes:
Theorem 6 Condition (4.9) is satisfied if for all k ̸= y:
min
η∈P
max
zL−1≤zL−1≤z¯L−1
−ck(η)
T
(WLzL−1 + bL) (4.10)
is greater than or equal to zero where P = {(η0, . . . , ηM)|
PM
i=0 ηi = 1, ηi ≥ 0, ∀i = 0, 1, . . . , M}, and
ck(η) = η0cyk + η1ca1k · · · + ηMcaMk. Here, the bounds zL−1
and z¯L−1 are obtained according to (4.4).
Unlike (4.9), the condition in (4.10) is easy to verify computationally. To understand this, let us define
Jk(η) = max
zL−1≤zL−1≤z¯L−1
−ck(η)
T
(WLzL−1 + bL). (4.11)
Then, our aim in (4.10) is to minimize Jk(η) over P.
69
First, notice that the maximization problem (4.11) can be solved in closed form as described in Step 4
of Algorithm 6. Consequently, one can rely on Danskin’s Theorem [87] to compute the subgradient of
the function Jk(·). Thus, to minimize Jk(·) in (4.10), we can rely on the Bregman proximal (sub)gradient
method (see [139] and the references therein). This algorithm is guaranteed to find ϵ− accurate solution
to (4.10) in T = O(1/
√
ϵ) iterations–see [139, Corollary 2].
Algorithm 6 IBP verification of the network with multiple detection classes against class k
1: Parameters: Stepsize ν > 0, number of iterations T.
2: Initialize η0 = 1 and η1 = . . . = ηM = 0.
3: for t = 0, 1, . . . , T do
4: [z
∗
L−1
]j =
(
[zL−1
]j if [WT
L
ck(η
t
)]j ≥ 0
[z¯L−1]j otherwise.
.
5: Set η
t+1
m =
η
tm exp(−2ν(z
∗t
L−1
)
TWT
L
camk)
PM
j=0 η
t
j
exp(−2ν(z
∗t
L−1
)
TWT
L
cajk)
, for all m ∈ [M] where a0 = y.
Remark 7 Time Complexity Comparison: [303] finds Jk(η) using sorting and comparing a finite number of calculated values based on the last two layer weight matrices. While their algorithm finds the exact
solution of η, it is limited to the one-dimensional case (M = 1). In this scenario, their algorithm requires
O(nL−1 log(nL−1)) evaluations to find the optimal η where nL−1 represents the number of nodes in the
one to the last layer. Alternatively, our algorithm gives an ϵ-optimal solution in order of O(
1
ϵ
). By choosing
ϵ = O(
1
nL−1
), Algorithm 6 has almost the same complexity as Algorithm 1 in [303] in the one dimensional
case. When M > 1, their algorithm cannot be utilized, while our algorithm finds the solution with the same
order complexity. Note that the order complexity, in this case, is linear with respect to M.
4.3.2 Verification with β-Crown
IBP verification is computationally efficient but less accurate than β–Crown. Hence, we focus on β–Crown
verification of networks with multiple abstain classes in this section to obtain a tighter verifier. To this
end, we will find a sufficient condition for (4.9) using the lower-bound technique of (4.5) in β–Crown.
70
In particular, by switching the minimization and maximization in (4.9) and using the β–Crown lower
bound (4.5), we can find a lower-bound of the form
min
zL∈Z(x0,ϵ)
max{c
T
ykzL, c
T
a1kzL, . . . , c
T
aMkzL} ≥
max
η∈P,α,β≥0
G(x0, α, β, η). (4.12)
The details of this inequality and the exact definition of the function G(·) is provided in Appendix 10.2.5.
Note that any feasible solution to the right-hand side of (4.12) is a valid lower bound to the original verification problem (left-hand-side). Thus, in order for (4.9) to be satisfied, it suffices to find a feasible (α, β, η)
such that G(x0, α, β, η) ≥ 0. To optimize the RHS of (4.12) in Algorithm 7, we utilize AutoLirpa library
of [368] for updating α, and use Bregman proximal subgradient method to update β and η – See appendix 10.2.2. We use Euclidean norm Bregman divergence for updating β and Shannon entropy Bregman
divergence for η to obtain closed-form updates.
Algorithm 7 β–Crown verification of networks with multiple detection classes
1: Input: number of iterations T, number of iterations in the inner-loop T0, Step-size γ.
2: for t = 0, 1, . . . , T do
3: Update α using AutoLirpa [368]
4: for k = 0, 1, . . . , T0 do
5: β = [β + γ
∂G(x0,α,β,η
∂β
]+,
6: η
new
m =
η
old m exp(2γ
∂G(x0,α,β,η)
∂ηm
)
PM
j=0 η
old
j
exp(2γ
∂G(x0,α,β,η)
∂ηj
)
∀m ∈ [M].
[w]+ = max(w, 0) denotes the projection to the non-negative orthant in Algorithm 7.
4.4 Training of Neural Networks with Multiple Detection Classes
We follow a similar combination of loss functions to train a neural network consisting of multiple abstain
classes as in (4.7). While the last term (LNatural) can be computed efficiently, the first and second terms cannot be computed efficiently because even evaluating the functions LRobust and L
Abstain
Robust requires maximizing
71
nonconcave functions. Thus, we will minimize their upper bounds instead of minimizing these two terms.
Particularly, following [302, Equation (17)], we use L¯
Robust as an upper-bound to LRobust. This upper-bound
is obtained by the IBP relaxation procedure described in subsection 4.2.2. To obtain an upper-bound for the
Robust-Abstain loss term L
Abstain
Robust in (4.7), let us first start by clarifying its definition in the multi-abstain
class scenario:
L
Abstain
Robust = max
δ1,...,δn
1
n
Xn
i=1
min n
ℓxent
zL(xi + δi), yi
,
min
m∈[M]
ℓxent
zL(xi + δi), am
o
. (4.13)
This definition implies that the classification is considered “correct” for a given input if the predicted label
is the ground-truth label or if it is assigned to one of the abstain classes. Since the maximization problem
w.r.t. {δi} is nonconcave, it is hard to even evaluate L
Abstain
Robust . Thus, we minimize an efficiently computable
upper bound of this loss function as described in Theorem 8.
Theorem 8 Let ℓ
Abstain
Robust (x, y) is defined as:
max
∥δ∥≤ϵ
min n
ℓxent
zL(x + δ), y
, min
m∈[M]
ℓxent
zL(x + δ), am
o
Then,
ℓ
Abstain
Robust (x, y) ≤ ¯ℓ
Abstain
Robust (x, y) = ℓxent\A0
(J(x), y), (4.14)
where J(x) is a vector whose k-th component equals Jk(x) as defined in (4.11) and ℓxent\A0
(x0, y) :=
− log
exp(e
T
y zL(x0))
P
i∈I\A0
exp(e
T
i
zL(x0))!
. Here, I = {1, . . . , K, a1, . . . , aM} is the set of all classes (true labels and
abstain classes) and A0 = {a1, . . . , aM} is the set of abstain classes.
Notice that the definition of ℓxent\A0
(x0, y) removes the terms corresponding to the abstain classes in the
denominator. This definition is less restrictive toward abstain classes compared to incorrect classes. Thus,
for a given sample, it is more advantageous for the network to classify it as an abstain class instead of an
incorrect classification. This mechanism enhances the performance of the network in detecting adversarial
examples by abstain classes, while it does not have an adverse effect on the performance of the network
on natural samples. Note that during the evaluation/test phase, this loss function does not change the final
prediction of the network for a given input since the winner (the entry with the highest score) remains
the same. Overall, we upper-bound the loss in (4.7) by replacing LRobust with the IBP relaxation approach
utilized in [130, 303] and replacing L
Abstain
Robust with L¯Abstain
Robust =
1
n
Pn
i=1
¯ℓ
Abstain
Robust (xi
, yi) presented in Theorem 8.
Thus our total training loss can be presented as:
LTotal = L¯
Robust + λ1L¯Abstain
Robust + λ2LNatural (4.15)
Algorithm 8 describes the procedure of optimizing (4.15) on a joint classifier and detector with multiple
abstain classes.
Algorithm 8 Robust Neural Network Training
1: Input: Batches of data D1, . . . , DR, step-size ν, θ(L): optimization parameters for loss L.
2: for t = 1, . . . , R do
3: Compute Jo(x) ∀ x ∈ Dt
, ∀o ∈ [K] by Algorithm 6.
4: Compute L¯
Robust on Batch Dt as [130].
5: Compute L¯abstain
Robust on Batch Dt using Theorem 8.
6:
θ(L) = θ(L) − ν∇θ
L¯
Robust + λ1L¯abstain
Robust + λ2LNatural
4.4.1 Addressing model degeneracy
Having multiple abstain classes can potentially increase the capacity of our classifier to detect adversarial
examples. However, as we will see in Figure 4.4 (10 abstains, unregularized), several abstain classes collapse
together and capture similar adversarial patterns. Such a phenomenon, referred to as “model degeneracy”
and illustrated with an example in Appendix 10.2.6 will prevent us from fully utilizing all abstain classes.
7
Figure 4.2: Performance of Multiple-abstain shallow networks on MNIST and CIFAR-10 datasets. We compared multiple abstain
neural networks (both regularized and non-regularized versions) with the single abstain networks and networks with one more
layer. The above and below rows demonstrate the trade-off between standard and robust verified accuracy on MNIST and CIFAR10 datasets.
To address this issue, we impose a regularization term to the loss function such that the network utilizes
all abstain classes in balance. We aim to make sure the η values are distributed nearly uniformly, and there
are no idle abstain classes. Let η
ik
, zL−1(xi), and yi be the abstain vector corresponding to the sample xi
verifying against the target class k, the output of the layer L − 1, and the assigned label to the data point
xi respectively. Therefore, the regularized verification problem over n given samples takes the following
form:
min
η1,...,ηn∈P
Xn
i=1
X
k̸=yi
max
z(xi)≤zL−1≤z¯(xi)
−ck(η
ik)
T
f(zL−1)
+ µ∥
h
γ1
M + 1
−
Pn
j=1
P
o̸=yi
η
jo
n(K − 1)
i
+
∥
2
, (4.16)
74
where f(zL−1) = WLzL−1 + bL. The above regularizer penalizes the objective function if the average value of η coefficient corresponding to a given abstain class over all batch samples is smaller than
a threshold (which is determined by the hyper-parameter γ). In other words, if an abstain class is not
contributing enough to detect adversarial samples, it will be penalized accordingly. Note that if γ is larger,
we penalize an idle abstain class more.
Note that in the unregularized case, the optimization of parameters η
ik are independent of each other.
In contrast, by adding the regularizer described in (4.16), we require to optimize η
ik of different samples
and target classes jointly as they are coupled in the regularization term. Since optimizing (4.16) over the
set of all n data points is infeasible for datasets with many samples, we solve the problem by choosing
small data batches (≤ 64). We utilize the same Bregman divergence procedure used in Algorithm 6, while
the gradient with respect to η
ik takes the regularization term into account as well.
Hyper-parameter tuning compared to the single-abstain scenario. In contrast to [303], our methodology has the additional hyper-parameter M (the number of abstain classes). Tuning this hyper-parameter
can be costly since, in each run, we have to change the architecture (and potentially the stepsizes of the
algorithm). This can be viewed as a potential additional computational overhead for training. However,
as we observed in our experiments, setting M = K combined with our regularization mechanism always
leads to significant performance gains over training with a single abstain class. Thus, tuning the hyperparameter M is unnecessary when we apply the regularization mechanism. In that case, we have two
other hyper-parameters: γ is chosen from the set {0.1, 0.2, 0.5, 1, 2, 5, 10} over K + M. The experiments
show that 1 consistently works the best across different datasets and architectures. Further, µ in (4.16) is
the regularization mechanism hyper-parameter. We determine µ by cross-validation over a wide range of
numbers in [0.1, 10]. The experiments show that the optimal value of µ is between [0.8, 1.5] depending on
the network, dataset, and ϵ.
75
4.5 Numerical results
We devise diverse experiments on shallow and deep networks to investigate the effectiveness of joint
classifiers and detectors with multiple abstain classes.
4.5.1 Training Setup
To train the networks on MNIST and CIFAR-10 datasets, we use Algorithm 8 as a part of an optimizer
scheduler. Initially, we set λ1 = λ2 = 0. Thus, the network is trained without considering any abstain
classes. In the second phase, we optimize the objective function (4.15), where we linearly increase ϵ from
0 to ϵtrain. Finally, we further tune the network on the fixed ϵ = ϵtrain. On both MNIST and CIFAR-10
datasets, we have used an Adam optimizer with a learning rate 5 × 10−4
. The networks are trained with
four NVIDIA V100 GPUs. The trade-off between standard accuracy on clean images and robust verified
accuracy can be tuned by changing λ2 from 0 to +∞ where the larger values correspond to more robust
networks. For the networks with the regularizer addressing the model degeneracy issue, we choose γ by
tuning it in the [
0.1
K+M ,
1.5
K+M ]. Our observations on both MNIST and CIFAR-10 datasets for different ϵ
values show that the optimal value for γ is consistently close to 1
K+M . Thus, we suggest choosing hyperparameter γ =
1
K+M where K is the number of labels, and M is the number of detection classes. The
optimal value for M is 4 for the CIFAR-10 and M = 3 for the MNIST dataset. By adding the "model
degeneracy" regularizer, the obtained network has nearly the same performance for M ∈ [4, 2K]. Overall,
we suggest to choose M = K and γ =
1
K+M as the default values for hyper-parameters M and γ.
The model’s robust verified accuracy will generally be increased by changing λ2 from 0 to large values.
As compensation, the standard accuracy of the model is reduced. Therefore, λ2 determines the tradeoffs between the standard and robust verified accuracy. The tradeoff curves presented in the figures are
obtained by changing λ from 0 to 100.
76
Figure 4.3: Performance of shallow networks with multiple detection classes without regularization, with regularization, and
the network with one more layer on CIFAR-10 dataset. As demonstrated, the networks with regularization work consistently
well for 4 ≤ M ≤ 10, which is very close to the best performance we get from networks without regularization with 4 detection
classes.
4.5.2 Robust Verified Accuracy on Shallow Networks
In the first set of experiments depicted in Figure 4.2, we compare the performance of shallow networks
with the fully optimized number of abstain classes to the single abstain network, the network with an
additional layer, and the network with M = K regularized by Equation (4.16). The shallow networks
have one convolutional layer with sizes 256 and 1024 for MNIST and CIFAR-10 datasets, respectively.
This convolutional layer is connected to the second (last) layer consisting of K + M (20 for both MNIST
and CIFAR-10) nodes. The optimal number of abstain classes is obtained by changing the number of them
from 1 to 20 on both MNIST and CIFAR-10. The optimal value for the network trained on MNIST is M = 3,
and M = 4 for the CIFAR-10 dataset. Moreover, we compare the optimal multi-abstain shallow network
to two other baselines: One is the network with the number of abstain classes equal to the number of
regular classes (M = K) trained via the regularizer described in (4.16). The other is a network with one
extra layer than the shallow network. This network has K + M nodes in one to the last layer and K + 1
77
Figure 4.4: Distribution of Natural and adversarial images over different abstain classes on CIFAR-10 dataset. When there are
10 abstain classes, model degeneracy leads to lower performance than the baseline. Adding the regularization term (rightmost
column) will utilize all abstain classes and enhance standard and robust verified accuracy. Standard accuracy is the proportion of
correctly classified natural images, while robust verified accuracy is the proportion of images that are robust against all adversarial
attacks within the ϵ-neighborhood (ϵ =
16
255 ).
nodes in the last layer compared to the shallow network. Ideally, the set of models can be supported by
such a network is a super-set of the original shallow network. Therefore, it has more capacity to classify
images and detect adversarial attacks. However, due to the training procedure of IBP, which is sensitive to
a higher number of layers (the higher the number of layers, the looser, the lower and upper bounds), we
obtain better results with the original network with multiple abstain classes.
Next, in Figure 4.3, we investigate the effect of changing the number of abstain classes of the shallow
network described above. We observe that the unregularized network and the network with one more
layer are much more sensitive to the change of M than the regularized version. This means we can use
the regularized network with the same performance while it does not require to be tuned for the optimal
M. In the unregularized version, by increasing the number of abstain classes from M = 1 to M = 4, we
78
Figure 4.5: Performance of β-crown on verification of Neural Networks with single abstain, 4 abstain classes, 10 abstain classes
with regularized, and networks with one more layer (single abstain) on CIFAR-10 dataset. M = 1 coincides with [303].
see improvement. However, after this threshold, the network performance drops gradually such that for
M = 10 where the number of labels and abstain classes are equal (M = K = 10), the performance of the
network, in this case, is even worse than the single-abstain network due to the model degeneracy of the
multi-abstain network. However, the network trained on the regularized loss maintains its performance
when M changes from the optimal value to larger values.
Figure 4.4 shows the percentage of adversarial examples captured by each abstain class (M = 10) on
CIFAR-10 for both regularized and non-regularized networks. The hyper-parameter γ is set to 1
K+M =
0.05. Next, we illustrate the performance of networks trained in the first set of experiments by β-crown in
Figure 4.5. The networks verified by Beta-crown have 1% to 2% improvement in robust accuracy compared
to the same networks verified by IBP.
4.5.3 Performance on Deep Neural Networks
Networks with multiple abstain classes achieve a superior trade-off between standard and verified robust
accuracy on the deep networks as well. To demonstrate the performance of multiple abstain classes compared to state-of-the-art approaches, we trained deep neural networks on MNIST and CIFAR-10 datasets
with different ϵ values. The results are reported in Table 10.3. The structure of the trained deep network
is the same as the one described in [303] (see Appendix 10.2.1).
79
Method St Err PGD Success
Madry [219] 2.91% 11.35%
IBP 2.27% 6.68%
IBP Crown 2.17% 12.06%
Balunovic [30] 2.77% 15.31%
Sheikholeslami [303] 4.74% 4.15%
Aquino [17] 4.81% 5.86%
Our Method 4.97% 3.91%
Table 4.1: Standard Error and PGD attack success rate on the MNIST dataset for different state-of-the-art
approaches. The chosen ϵ for the PGD attack equals 0.4.
Method St Err PGD Success
Madry [219] 49.78% 68.48%
IBP 50.51% 65.23%
IBP Crown 54.02% 65.42%
Balunovic[30] 48.3% 69.81%
Sheikholeslami [303] 55.60% 63.63%
Aquino [17] 50.25% 64.94%
Our Method 56.44% 60.29%
Table 4.2: Standard Error and PGD attack success rate on the CIFAR-10 dataset for different state-of-the-art
approaches. The chosen ϵ for the PGD attack equals 8/255.
While the verified robust accuracy guarantees the robustness of the networks against all attacks within
the ϵ-neighborhood of each given test data point, one can argue that in many practical situations, being
robust against certain adversarial attacks such as PGD attack [219] is sufficient. Table 4.1 and Table 4.2
demonstrate the performance of several state-of-the-art approaches for training robust neural networks
against adversarial attacks on the MNIST and CIFAR-10 datasets, respectively. The performances are evaluated by the standard accuracy and robustness against the PGD attack on the test samples. The chosen ϵ is
0.4 for MNIST and 8/255 for CIFAR-10, and the attacks are applied to each test sample using 100 iterations
of the projected gradient descent (PGD). Our method achieves the best robustness against PGD attacks on
both MNIST and CIFAR-10 datasets.
80
4.6 Conclusion
We improved the trade-off between standard accuracy and robust verifiable accuracy of the shallow and
deep neural networks by introducing a training mechanism for networks with multiple abstain classes.
We observed that increasing the number of abstain classes results in the “model degeneracy” phenomenon
where not all abstain classes are utilized, and regular training can lead to solutions with poor performance
in terms of standard and robust verified accuracy. To avoid the model degeneracy when the number of
abstain classes is large, we propose a regularizer scheme penalizing the network if it does not utilize
all abstain classes in balance. Our experiments demonstrate the superiority of the trained shallow and
deep networks over state-of-the-art approaches on MNIST and CIFAR-10 datasets. We have used multiple
detection classes to improve the performance of the verifiable neural networks. However, the application
of multiple detection classes can be beyond such networks for detecting out-of-distribution samples or
specific types of adversarial attacks.
81
Chapter 5
Rényi Fair Inference
This chapter is based on a paper published in ICLR 2020 [22] with the same title.
Machine learning algorithms have been increasingly deployed in critical automated decision-making
systems that directly affect human lives. When these algorithms are solely trained to minimize the training/test error, they could suffer from systematic discrimination against individuals based on their sensitive
attributes, such as gender or race. Recently, there has been a surge in machine learning society to develop
algorithms for fair machine learning. In particular, several adversarial learning procedures have been proposed to impose fairness. Unfortunately, these algorithms either can only impose fairness up to linear
dependence between the variables, or they lack computational convergence guarantees. In this chapter,
we use Rényi correlation as a measure of the fairness of machine learning models and develop a general
training framework to impose fairness. In particular, we propose a min-max formulation that balances the
accuracy and fairness when solved to optimality. For the case of discrete sensitive attributes, we suggest
an iterative algorithm with theoretical convergence guarantees for solving the proposed min-max problem. Our algorithm and analysis are then specialized to fair classification and fair clustering problems. To
demonstrate the performance of the proposed Rényi fair inference framework in practice, we compare it
with well-known existing methods on several benchmark datasets. Experiments indicate that the proposed
method has favorable empirical performance against state-of-the-art approaches.
82
5.1 Introduction
As we experience the widespread adoption of machine learning models in automated decision-making,
we have witnessed increased reports of instances in which the employed model results in discrimination
against certain groups of individuals – see [91, 321, 49, 15]. In this context, discrimination is defined as
the unwanted distinction against individuals based on their membership to a specific group. For instance,
[15] present an example of a computer-based risk assessment model for recidivism, which is biased against
certain ethnicities. In another example, [91] demonstrate gender discrimination in online advertisements
for web pages associated with employment. These observations motivated researchers to pay special attention to fairness in machine learning in recent years; see [60, 115, 141, 366, 354, 106, 116, 349, 362, 361,
254, 34, 200].
In addition to its ethical standpoint, equal treatment of different groups is legally required by many
countries [4]. Anti-discrimination laws imposed by many countries evaluate fairness by notions such
as disparate treatment and disparate impact. We say a decision-making process suffers from disparate
treatment if its decisions discriminate against individuals of a certain protected group based on their sensitive/protected attribute information. On the other hand, we say it suffers from disparate impact if the
decisions adversely affect a protected group of individuals with certain sensitive attributes – see [361]. In
simpler words, disparate treatment is intentional discrimination against a protected group, while disparate
impact is an unintentional disproportionate outcome that hurts a protected group. To quantify fairness,
several notions of fairness have been proposed in the recent decade [58, 141]. Examples of these notions
include demographic parity, equalized odds, and equalized opportunity.
Demographic parity condition requires that the model output (e.g., assigned label) be independent of
sensitive attributes. This definition might not be desirable when the base ground-truth outcome of the
two groups are completely different. This shortcoming motivated the use of equalized odds notion [141],
83
which requires that the model output is conditionally independent of sensitive attributes given the groundtruth label. Finally, equalized opportunity requires having equal false positive or false negative rates across
protected groups.
Machine learning approaches for imposing fairness can be broadly classified into three main categories:
pre-processing methods, post-processing methods, and in-processing methods. Pre-processing methods
modify the training data to remove discriminatory information before passing data to the decision-making
process [58, 115, 164, 165, 166, 105, 60, 287]. These methods map the training data to a transformed space in
which the dependencies between the class label and the sensitive attributes are removed [108, 141, 354, 294,
262, 218, 364, 208]. On the other hand, post-processing methods adjust the output of a trained classifier to
remove discrimination while maintaining high classification accuracy [116, 106, 349]. The third category is
the in-process approach that enforces fairness by either introducing constraints or adding a regularization
term to the training procedure [362, 361, 254, 34, 43, 5, 67, 100, 276, 168, 366, 34, 169, 225, 9]. The Rényi
fair inference framework proposed in this chapter also belongs to this in-process category.
Among in-processing methods, many add a regularization term or constraints to promote statistical
independence between the classifier output and the sensitive attributes. To do that, various independence
proxies such as mutual information [168], false positive/negative rates [34], equalized odds [100], Pearson correlation coefficient [361, 362], Hilbert Schmidt independence criterion (HSIC) [254] were used. As
will be discussed in Section 5.2, many of these methods cannot capture nonlinear dependence between
random variables and/or lead to computationally expensive algorithms. Motivated by these limitations,
we propose to use Rényi correlation to impose several known group fairness measures. Rényi correlation
captures nonlinear dependence between random variables. Moreover, Rényi correlation is a normalized
measure and can be computed efficiently in certain instances. Using Rényi correlation coefficient as a regularization term, we propose a min-max optimization framework for fair statistical inference. In particular,
we specialize our framework to both classification and clustering tasks. We show that when the sensitive
84
attribute(s) is discrete (e.g., gender and/or race), the learning task can be efficiently solved to optimality
using a simple gradient ascent-descent approach. We summarize our contributions next:
• We introduce Rényi correlation as a tool to impose several notions of group fairness. Unlike Pearson
correlation and HSIC, which only capture linear dependence, Rényi correlation captures any statistical dependence between random variables as zero Rényi correlation implies independence. Moreover,
it is more computationally efficient than the mutual information regularizers approximated by neural
networks.
• Using Rényi correlation as a regularization term in training, we propose a min-max formulation for fair
statistical inference. Unlike methods that use an adversarial neural network to impose fairness, we show
that in particular instances such as binary classification, or discrete sensitive variable(s), it suffices to use
a simple quadratic function as the adversarial objective. This observation helped us to develop a simple
multi-step gradient ascent descent algorithm for fair inference and guarantee its theoretical convergence
to first-order stationarity.
• Our Rényi correlation framework leads to a natural fair classification method and a novel fair K-means
clustering algorithm. For K-means clustering problem, we show that sufficiently large regularization
coefficient yields perfect fairness under disparate impact doctrine. Unlike the two-phase methods proposed in [73, 21, 286, 42, 296], our method does not require any pre-processing step, is scalable, and
allows for regulating the trade-off between the clustering quality and fairness.
5.2 Rényi Correlation
The most widely used notions for group fairness in machine learning are demographic parity, equalized
odds, and equalized opportunities. These notions require (conditional) independence between a certain
85
model output and a sensitive attribute. This independence is typically imposed by adding fairness constraints or regularization terms to the training objective function. For instance, [168] added a regularization
term based on mutual information. Since estimating mutual information between the model output and
sensitive variables during training is not computationally tractable, [168] approximates the probability
density functions using a logistic regression model. To have a tighter estimation, [315] used an adversarial
approach that estimates the joint probability density function using a parameterized neural network. Although these works start from a well-justified objective function, they end up solving approximations of
the objective function due to computational barriers. Thus, no fairness guarantee is provided even when
the resulting optimization problems are solved to global optimality in the large sample size limit.
A more tractable measure of dependence between two random variables is the Pearson correlation.
The Pearson correlation coefficient between the two random variables A and B is defined as ρP (A, B) =
√
Cov(A,B)
Var(A)
√
Var(B)
, where Cov(·, ·) denotes the covariance and Var(·) denotes the variance. The Pearson
correlation coefficient is used in [361] to decorrelate the binary sensitive attribute and the decision boundary of the classifier. A major drawback of Pearson correlation is that it only captures linear dependencies
between random variables. In fact, two random variables A and B may have strong dependence but have
zero Pearson correlation. This property raises concerns about the use of the Pearson correlation for imposing fairness. Similar to the Pearson correlation, the HSIC measure proposed in [254] may be zero even
if the two variables have strong dependencies. While universal Kernels can be used to alleviate this issue,
they could arrive at the expense of computational intractability. In addition, HSIC is not a normalized
dependence measure [138, 137] which raises concerns about the appropriateness of using it as a measure
of dependence.
In this chapter, we suggest using Hirschfeld-Gebelein-Rényi correlation [274, 147, 124] as a dependence
measure between random variables to impose fairness. Rényi correlation, which is also known as maximal
correlation, between two random variables A and B is defined as
86
ρR(A, B) = sup
f,g
E[f(A)g(B)] s.t. E[f(A)] = E[g(B)] = 0, E[f
2
(A)] = E[g
2
(B)] = 1, (5.1)
where the supremum is over the set of measurable functions f(·) and g(·) satisfying the constraints.
Unlike HSIC and Pearson correlation, Rényi correlation captures higher-order dependencies between random variables. Rényi correlation between two random variables is zero if and only if the random variables
are independent, and it is one if there is a strict dependence between the variables [274]. These favorable statistical properties of ρR do not come at the price of computational intractability as opposed to
other measures such as mutual information. In fact, as we will discuss in Section 5.3, ρR can be used in a
computationally tractable framework to impose several group fairness notions.
5.3 A General Min-Max Framework for Rényi Fair Inference
Consider a learning task over a given random variable Z. Our goal is to minimize the average inference loss
L(·) where our loss function is parameterized with parameter θ. To find the optimal value of parameter θ
with the smallest average loss, we solve the following optimization problem
min
θ
E
L(θ , Z)
,
where the expectation is taken over Z and possible regularization terms are absorbed in the loss function
L(·). Notice that this formulation is quite general and can include regression, classification, clustering, or
dimensionality reduction tasks as special cases. As an example, in the case of linear regression Z = (X, Y )
and L(θ , Z) = (Y − θ
T X)
2 where X is a random vector and Y is the random target variable.
Assume that, in addition to minimizing the average loss, we are interested in bringing fairness to our
learning task. Let S be a sensitive attribute (such as age or gender) and Ybθ(Z) be a certain output of
87
our inference task using parameter θ. Assume we are interested in removing/reducing the dependence
between the random variable Ybθ(Z) and the sensitive attribute S. To balance the goodness-of-fit and
fairness, one can solve the following optimization problem
min
θ
E
L(θ, Z)
+ λ ρ2
R
Ybθ(Z), S
, (5.2)
where λ is a positive scalar balancing fairness and goodness-of-fit. Notice that the above framework is
quite general. For example, Yˆ
θ may be the assigned label in a classification task, the assigned cluster in a
clustering task, or the output of a regressor in a regression task.
Using the definition of Rényi correlation, we can rewrite optimization problem in (5.2) as
min
θ
sup
f,g
E
L(θ, Z)
+ λ
E
f(Ybθ(Z)) g(S)
2
,
s.t. E
f(Ybθ(Z))
= E
g(S)
= 0, E
f
2
(Ybθ(Z))
= E
g
2
(S)
= 1,
(5.3)
where the supremum is taken over the set of measurable functions. The next natural question to ask
is whether this optimization problem can be efficiently solved in practice. This question motivates the
discussions of the following subsection.
5.3.1 Computing Rényi Correlation
The objective function in (5.3) may be non-convex in θ in general. Several algorithms have been recently
proposed for solving such non-convex min-max optimization problems [290, 244, 159]. Most of these methods require solving the inner maximization problem to (approximate) global optimality. More precisely, we
need to be able to solve the optimization problem described in (5.1). While popular heuristic approaches
such as parameterizing the functions f and g with neural networks can be used to solve (5.1), we focus on
solving this problem in a more rigorous manner. In particular, we narrow down our focus to the discrete
random variable case. This case holds for many practical sensitive attributes among which are the gender
and race. In what follows, we show that in this case, (5.1) can be solved “efficiently” to global optimality.
Theorem 9 ([347]) Let a ∈ {a1, . . . , ac} and b ∈ {b1, . . . , bd} be two discrete random variables. Then the
Rényi coefficient ρR(a, b) is equal to the second largest singular value of the matrix Q = [qij ]i,j ∈ R
c×d
,
where qij =
P(a=ai,b=bj )
√
P(a=ai)P(b=bj )
.
The above theorem provides a computationally tractable approach for computing the Rényi coefficient.
This computation could be further simplified when one of the random variables is binary.
Theorem 10 Suppose that a ∈ {1, . . . , c} is a discrete random variable and b ∈ {0, 1} is a binary random
variable. Let a˜ be a one-hot encoding of a, i.e., a˜ = ei
if a = i, where ei = (0, . . . , 0, 1, 0 . . . , 0) is the i-th
standard unit vector. Let ˜b = b − 1/2. Then,
ρR(a, b) = r
1 −
γ
P(b = 1)P(b = 0),
where γ ≜ minw∈Rc E
h
(wT ae − ˜b)
2
i
. Equivalently,
γ ≜ min
w∈Rc
Xc
i=1
w
2
i P(a = i) −
Xc
i=1
wi
P(a = i, b = 1) − P(a = i, b = 0)
+ 1/4.
Proof: The proof is relegated to the appendix. □
Let us specialize our framework to classification and clustering problems in the next two sections.
5.4 Rényi Fair Classification
In a typical (multi-class) classification problem, we are given samples from a random variable Z ≜ (X, Y )
and the goal is to predict Y from X. Here X ∈ R
d
is the input feature vector, and Y ∈ Y ≜ {1, . . . , c} is
8
the class label. Let Ybθ be the output of our classifier taking different values in the set {1, . . . , c}. Assume
further that
P(Ybθ = i | X) = Fi(θ, X), ∀i = 1, . . . , c.
Here θ is that parameter of the classifier that needs to be tuned. For example, F(θ, X) = (F1(θ, X), . . . ,
and Fc(θ, X)) could represent the output of a neural network after softmax layer; the soft probability label
assigned by a logistic regression model; or the 0-1 probability values obtained by a deterministic classifier.
In order to find the optimal parameter θ, we need to solve the optimization problem
min
θ
E
L(F
θ, X
, Y )
, (5.4)
where L is the loss function and the expectation is taken over the random variable Z = (X, Y ). Let S be
the sensitive attribute. We say a model satisfies demographic parity if the assigned label Yb is independent
of the sensitive attribute S, see [105]. Using our regularization framework, to find the optimal parameter
θ balancing classification accuracy and fairness objective, we need to solve
min
θ
E
L
F(θ, X), Y
+ λρ2
R
Ybθ, S
. (5.5)
5.4.1 General Discrete Case
When S ∈ {s1, . . . , sd} is discrete, Theorem 9 implies that (5.5) can be rewritten as
min
θ
max
v⊥v1, ∥v∥
2≤1
fD(θ, v) ≜ E
L
F(θ, X), Y
+ λv
T QT
θ Qθv
. (5.6)
Here v1 =
hp
P(S = s1), . . . , p
P(S = sd)
i
∈ R
d
is the right singular vector corresponding to the
largest singular value of Qθ = [qij ]i,j ∈ R
c×d
, with qij ≜
P(Ybθ = i| S = sj )P(S = sj )
q
P(Ybθ = i) P(S = sj )
. Given training data (xn, yn)
N
n=1 sampled from the random variable Z = (X, Y ), we can estimate the entries of the
matrix Qθ using P
Ybθ = i
= E[P(Ybθ = i | X)] ≈
1
N
PN
n=1 Fi(θ, xn), and P
Ybθ = i|S = sj
≈
1
|Xj |
P
x∈Xj
Fi(θ, x), where Xj is the set of samples with sensitive attribute sj . Motivated by the algorithm proposed in [159], we present Algorithm 9 for solving (5.6).
Algorithm 9 Rényi Fair Classifier for Discrete Sensitive Attributes
1: Input: θ
0 ∈ Θ, step-size η. t = 0, 1, . . . , T
2: Set v
t+1 ← maxv∈⊥v1,∥v∥≤1 fD(θ
t
, v) by finding the second singular vector of Qθ
t
3: Set θ
t+1 ← θ
t − η∇θfD(θ
t
, v
t+1)
To understand the convergence behavior of Algorithm 9 for the nonconvex optimization problem (5.6),
we need to first define an approximate stationary solution. Let us define g(θ) = maxv∈⊥v1,∥v∥≤1 f(θ, v).
Assume further that f(·, v) has L1-Lipschitz gradient, then g(·) is L1-weakly convex; for more details
check [265]. For such weakly convex function, we say θ
∗
is a ϵ-stationary solution if the gradient of its
Moreau envelop is smaller than epsilon, i.e., ∥∇gβ(·)∥ ≤ ε with gβ(θ) ≜ minθ
′ g(θ
′
) + 1
2β
∥θ − θ
′
∥ and
β < 1
2L1
is a given constant. The following theorem demonstrates the convergence of Algorithm 9. This
theorem is a direct consequence of Theorem 27 in [159].
Theorem 11 Suppose that f is L0-Lipschitz and L1-gradient Lipschitz. Then Algorithm 9 computes an εstationary solution of the objective function in (5.6) in O(ε
−4
) iterations.
5.4.2 Binary Case
When S is binary, we can obtain a more efficient algorithm compared to Algorithm 9 by exploiting Theorem 10. Particularly, by a simple scaling of λ and ignoring the constant terms, the optimization problem
in (5.5) can be written as
min
θ
max
w
f(θ, v) ≜ E
L
F(θ, X), Y − λ
hPc
i=1 w
2
i P
Ybθ = i
−
Pc
i=1 wi
P
Ybθ = i, S = 1
− P
Ybθ = i, S = 0
i.
(5.7)
Defining S˜ = 2S − 1, the above problem can be rewritten as
min
θ
max
w
E
L(F(θ, X), Y ) − λ
Xc
i=1
w
2
i Fi(θ, X) + λ
Xc
i=1
wiSeFi(θ, X)
.
Thus, given training data (xn, yn)
N
n=1 sampled from the random variable Z = (X, Y ), we solve
min
θ
max
w
fB(θ, w) ≜
1
N
X
N
n=1
L
F(θ, xn), yn
− λ
Xc
i=1
w
2
i Fi(θ, xn) + λ
Xc
i=1
wisenFi(θ, xn)
. (5.8)
Notice that the maximization problem in (5.8) is concave, separable, and has a closed-form solution. We
propose Algorithm 10 for solving (5.8).
Algorithm 10 Rényi Fair Classifier for Binary Sensitive Attributes
1: Input: θ
0 ∈ Θ, step-size η. t = 0, 1, . . . , T
2: Set wt+1 ← arg maxw fB(θ
t
, w), i.e., set w
t+1
i ←
PN
n=1 senFi(θ
t
,xn)
2
PN
n=1 Fi(θ
t
,xn)
, ∀i = 1, . . . , c
3: Set θ
t+1 ← θ
t − η∇θfB(θ
t
, wt+1)
While the result in Theorem 11 can be applied to Algorithm 10, under the following assumption, we
can show a superior convergence rate.
Assumption 5.4.1 We assume that there exists a constant scalar µ > 0 such that PN
n=1 Fi(θ, xn) ≥
µ, ∀i = 1, . . . , C.
This assumption is reasonable when soft-max is used. This is because we can always assume θ lies in a
compact set in practice, and hence the output of the softmax layer cannot be arbitrarily small.
Theorem 12 Suppose that f is L1-gradient Lipschitz. Then Algorithm 10 computes an ε-stationary solution
of the objective function in (5.8) in O(ε
−2
) iterations.
Proof: The proof is relegated to the appendix. □
Notice that this convergence rate is clearly a faster rate than the one obtained in Theorem 11. Moreover,
this rate of convergence matches the oracle lower bound for general non-convex optimization; see [65].
This observation shows that the computational overhead of imposing fairness is negligible as compared
to solving the original non-convex training problem without imposing fairness.
Remark 13 (Extension to multiple sensitive attributes) Our discrete Rényi classification framework
can naturally be extended to the case of multiple discrete sensitivity attributes by concatenating all attributes
into one. For instance, when we have two sensitivity attribute S
1 ∈ {0, 1} and S
2 ∈ {0, 1}, we can consider them as a single attribute S ∈ {0, 1, 2, 3} corresponding to the four combinations of {(S
1 = 0, S2 =
0),(S
1 = 0, S2 = 1),(S
1 = 0, S2 = 0),(S
1 = 1, S2 = 1)}.
Remark 14 (Extension to other notions of fairness) Our proposed framework imposes the
demographic parity notion of group fairness. However, other notions of group fairness may be represented by
(conditional) independence conditions. For such cases, we can again apply our framework. For example, we
say a predictor Yˆ
θ satisfies equalized odds condition if the predictor Yˆ
θ is conditionally independent of the
sensitive attribute S given the true label Y . Similar to formulation (5.5), the equalized odds fairness notion
can be achieved by the following min-max problem
min
θ
E
L
F(θ, X), Y
+ λ
X
y∈Y
ρ
2
R
Yˆ
θ, S
Y = y
. (5.9)
9
5.5 Rényi Fair Clustering
In this section, we apply the proposed fair Rényi framework to the widespread K-means clustering problem. Given a set of data points x1, . . . , xN ∈ R
N×d
, in the K-means problem, we seek to partition them
into K clusters such that the following objective function is minimized:
min
A,C
X
N
n=1
X
K
k=1
akn∥xn − ck∥
2
s.t.
X
K
k=1
akn = 1, ∀n, akn ∈ {0, 1}, ∀k, n (5.10)
where ck is the centroid of cluster k; the variable akn = 1 if data point xn belongs to cluster k and it
is zero otherwise; A = [akn]k,n and C = [c1, . . . , cK] represent the association matrix and the cluster
centroids respectively. Now, suppose we have an additional sensitive attribute S for each one of the given
data points. In order to have a fair clustering under disparate impact doctrine, we need to make the random
variable an = [a1n, . . . , aKn] independent of S. In other words, we need to make the clustering assignment
independent of the sensitive attribute S. Using our framework in (5.2), we can easily add a regularizer to
this problem to impose fairness under disparate impact doctrine. In particular, for binary sensitive attribute
S, using Theorem 10, and absorbing the constants into the hyper-parameter λ, we need to solve
min
A,C
max
w∈RK
X
N
n=1
X
K
k=1
akn∥xn − ck∥
2 − λ
X
N
n=1
(a
T
nw − sn)
2
s.t.
PK
k=1 akn = 1, ∀ n, akn ∈ {0, 1}, ∀ k, n.
(5.11)
where an = (a1n, . . . , aKn)
T
encodes the clustering information of data point xn and sn is the sensitive attribute for data point n.
Fixing the assignment matrix A, and cluster centers C, the vector w can be updated in closed-form.
More specifically, wk at each iteration equals to the current proportion of the privileged group in the kth cluster. Combining this idea with the update rules of assignments and cluster centers in the standard
94
K-means algorithm, we propose Algorithm 11, which is a fair K-means algorithm under disparate impact
doctrine. To illustrate the behavior of the algorithm, a toy example is presented in Appendix 10.3.3.
Algorithm 11 Rényi Fair K-means
1: Input: X = {x1, . . . , xN } and S = {s1, . . . , sN }
2: Initialize: Random assignment A s.t.
PK
k=1 akn = 1 ∀ n; and akn ∈ {0, 1}. Set Aprev = 0.
Aprev ̸= A
3: Set Aprev = A n = 1, . . . , N {U}pdate A
4: k
∗ = argmink ∥xn − ck∥
2 − λ(wk − sn)
2
5: Set ak
∗n = 1 and akn = 0 for all k ̸= k
∗
6: Set wk =
PN
n=1 snakn
PN
n=1 akn
, ∀k = 1, . . . , K. {U}pdate w
7: Set ck =
PN
n=1 aknxn PN
n=1 akn
, ∀k = 1, . . . , K.{U}pdate c
The main difference between this algorithm and the popular K-means algorithm is in Step 4 of Algorithm 11. This step is a result of optimizing (5.11) over A when both C and w are fixed. When λ = 0, this
step would be identical to the update of cluster assignment variables in K-means. However, when λ > 0,
Step 4 considers fairness when computing the distance considered in updating the cluster assignments.
Remark 15 Note that in Algorithm 11, the parameter w is being updated after each assignment of a point
to a cluster. More specifically, for every iteration of the algorithm, w is updated N times. If we otherwise
update w after completely updating the matrix A, then with a simple counterexample we can show that the
algorithm can get stuck; see more details in Appendix 10.3.4.
5.6 Numerical Experiments
In this section, we evaluate the performance of the proposed Rényi fair classifier and Rényi fair k-means
algorithm on three standard datasets: Bank, German Credit, and Adult datasets. The detailed description
of these datasets is available in the supplementary material.
We evaluate the performance of our proposed Rényi classifier under both demographic parity, and
equality of opportunity notions. We have implemented a logistic regression classifier regularized by Rényi
95
correlation on Adult dataset considering gender as the sensitive feature. To measure the equality of opportunity we use the Equality of Opportunity (EO) violation, defined as EO Violation =
P(Yˆ = 1|S =
1, Y = 1)−P(Yˆ = 1|S = 0, Y = 1)
, where Yˆ and Y represent the predicted, and true labels respectively.
Smaller EO violation corresponds to a more fair solution. Figure 5.1, parts (a) and (b) demonstrate that by
increasing λ, the Rényi regularizer coefficient decreases implying a more fair classifier at the price of a
higher training and testing errors. Figure 5.1, part (c) compares the fair Rényi logistic regression model to
several existing methods in the literature [141, 361, 276, 100]. As we can see in plot (c), the Rényi classifier
outperforms other methods in terms of accuracy for a given level of EO violation.
The better performance of the Rényi fair classifier compared to the baselines could be attributed to the
following. [141] is a post-processing approach where the output of the classification is modified to promote a fair prediction. This modification is done without changing the classification process. Clearly, this
approach limits the design space and cannot explore the possibilities that can reach with “in-processing"
methods where both fairness and classification objectives are optimized jointly. [361] imposes fairness
by using linear covariance and thus can only capture linear dependence between the predictor and the
sensitive attribute. Consequently, there might exist nonlinear dependencies which are revealed in fairness
measures such as DP or EO violation. [276, 100], on the other hand, propose to use nonlinear measures
of dependence as regularizers. However, due to computational barriers, they approximate the regularizer
and solve the approximate problem. The approximation step could potentially have an adverse effect on
the performance of the resulting classifier. Notice that while these methods are different in terms of the
way they impose fairness, they are all implemented for logistic regression model (with the exception of
SVM model used in [100]). Thus, the difference in the performance is not due to the classification model
used in the experiments.
To show the practical benefits of Rényi correlation over Pearson correlation and HSIC regularizers
under the demographic parity notion, we evaluate the logistic regression classifier regularized by these
96
three measures on Adult, Bank, and German Credit datasets. For the first two plots, we use p% =
min( P(Yˆ =1|S=1)
P(Yˆ =1|S=0) ,
P(Yˆ =1|S=0)
P(Yˆ =1|S=1) ) as a measure of fairness. Since p% is defined only for binary sensitive variables, for the last two plots in Figure 5.2 (German dataset with gender and marital status, and Adult dataset
with gender and race as the sensitive features), we use the inverse of demographic parity (DP) violation
as the fairness measure. We define DP violation as DP Violation = maxa,b
P(Yˆ = 1|S = a) − P(Yˆ =
1|S = b)
. As it is evident from the figure, Rényi classifier outperforms both HSIC and Pearson classifiers,
especially when targeting high levels of fairness. For the last two experiments in Figure 5.2, we could not
further increase fairness by increasing the regularization coefficient for Pearson and HSIC regularizers (see
green and red curves cannot go beyond a certain point on the fairness axis). This can be explained by the
nonlinear correlation between the predictor and the sensitive variables in these two scenarios which cannot
be fully captured using linear or quadratic independence measures. Interestingly, our experiments indicate that minimizing Rényi correlation eventually minimizes the Normalized Mutual Information (NMI)
between the variables (See Supplementary Figure 10.8). Recall that similar to Rényi correlation, NMI can
capture any dependence between two given random variables.
Finally, to evaluate the performance of our fair k-means algorithm, we implement Algorithm 11 to find
clusters of Adult and Bank datasets. We use the deviation of the elements of the vector w as a measure
of fairness. The element wk of w represents the ratio of the number of data points that belong to the
privileged group (S = 1) in cluster k over the number of data points in that cluster. This notion of fairness
is closely related to minimum balance introduced by [73]. The deviation of these elements is a measure for
the deviation of these ratios across different clusters. A clustering solution is exactly fair if all entries of w
are the same. For K = 14, we plot in Figure 5.3 the minimum, maximum, average, and average ± standard
deviation of the entries of w vector for different values of λ. For an exactly fair clustering solution, these
values should be the same. As we can see in Figure 5.3, increasing λ yields exact fair clustering at the price
of a higher clustering loss.
97
Figure 5.1: Trade-off between the accuracy of classifier and fairness on the adult dataset under the equality
of opportunity notion. (a, b) By increasing λ from 0 to 1000, EO violation (the blue curve on the left axis)
approaches to 0. The fairer solution comes at the price of a slight increase of the training/test error (Red
curve, right axis). (c) Comparison of the existing approaches with Rényi classifier, under the equality of
opportunity notion. Rényi classifier demonstrates a better accuracy for a given level of fairness measured
by EO violation.
5.7 Conclusion
In this chapter, we proposed Rényi fair inference as an in-process method to impose fairness in empirical
risk minimization. Fairness is defined as (conditional) independence between a sensitive attribute and the
inference output from the learning machine. As statistical independence is only measurable when the data
distributions are fully known, we can only hope to promote independence through empirical surrogates
in this framework. Our method imposes a regularizer in the form of the Rényi correlation (maximal correlation) between a sensitive attribute(s) and the inference output. Rényi correlation between two random
variables is zero if and only if they are independent, which is a desirable property for an independence
surrogate. We pose Rényi fair correlation as a minimax optimization problem. In the case where the sensitive attributes are discrete (e.g., race), we present an algorithm that finds a first-order optimal solution
to the problem with convergence guarantees. In the special case where the sensitive attribute is binary
98
(e.g., gender), we show an algorithm with optimal convergence guarantees. Our numerical experiments
show that Rényi fair inference captures nonlinear correlations better than Pearson correlation or HSIC.
We also show that increasing the regularization hyperparameter results in near statistical independence
between the sensitive attribute and the inference output. Future work would naturally consider an extension to continuous sensitive attributes and problems with missing or non-explicit sensitive labels, such as
fair word embedding problems.
99
Figure 5.2: Trade-off between accuracy and fairness for logistic regression classifier regularized with Rényi,
HSIC, and Pearson measures, on German Credit, Adult, and Bank datasets. (Top) The drop in the accuracy
of the model regularized by Rényi, is less than the same model regularized by HSIC, and Pearson correlation. Moreover, as can be observed for both Bank and Adult datasets, Pearson and HSIC regularizers
usually cannot increase p% beyond a certain limit due to the fact that removing all linear correlations does
not guarantee independence between the predictor and the sensitive attribute. (Down) When the sensitive
attribute is not binary (or we have more than one sensitive attribute), obtaining a fair model for HSIC and
Pearson regularizers is even harder. The model regularized by HSIC or Pearson, cannot minimize the DP
violation (or maximize its reciprocal) beyond a threshold.
100
Figure 5.3: Performance and fairness of K-means algorithm in terms of Rényi regularizer hyper-parameter
λ. By increasing λ, the standard deviation of the w vector components (each component represents the
relative proportion of the privileged group in the corresponding cluster) is reduced accordingly. Both plots
demonstrate that the standard deviation of w is reduced fast with respect to λ, and the increase in loss is
small when λ ≤ 0.005. However, to reach a completely fair clustering, a λ ≥ 1 must be chosen that can
increase the loss (the right axis, red curve) drastically.
101
Chapter 6
FERMI: A Stochastic Optimization Framework for Fair Empirical Risk
Minimization via Exponential Rényi Mutual Information
This chapter is based on a paper published in TMLR 2023 [210].
Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across
a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with
stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and
potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each
iteration or they are not guaranteed to converge. In this chapter, we develop the first stochastic in-processing
fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm–called FERMI–and prove that each
of these variations converges in stochastic optimization with any batch size. Empirically, we show that
FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch sizes as small as one. Extensive experiments show that FERMI
achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups
compared with state-of-the-art baselines for demographic parity, equalized odds, and equal opportunity.
These benefits are especially significant with small batch sizes and for non-binary classification with a
102
large number of sensitive attributes, making FERMI a practical, scalable fairness algorithm. The code for
all of the experiments in this chapter is available at:
https://github.com/optimization-for-data-driven-science/FERMI.
6.1 Introduction
Ensuring that decisions made using machine learning (ML) algorithms are fair to different subgroups is
of utmost importance. Without any mitigation strategy, learning algorithms may result in discrimination
against certain subgroups based on sensitive attributes, such as gender or race, even if such discrimination
is absent in the training data [224], and algorithmic fairness literature aims to remedy such discrimination
issues [321, 91, 115, 49, 15, 60, 141, 116, 349, 362, 34, 5, 170, 259, 180]. Modern ML problems often involve
large-scale models with hundreds of millions or even billions of parameters, e.g., BART [191], ViT [101],
GPT-2 [261]. In such cases, during fine-tuning, the available memory on a node constrains us to use
stochastic optimization with (small) minibatches in each training iteration. In this chapter, we address
the dual challenges of fair and stochastic ML, providing the first stochastic fairness algorithm that provably
converges with any batch size.
A machine learning algorithm satisfies the demographic parity fairness notion if the predicted target
is independent of the sensitive attributes [105]. Promoting demographic parity can lead to poor performance, especially if the true outcome is not independent of the sensitive attributes. To remedy this, [141]
proposed equalized odds to ensure that the predicted target is conditionally independent of the sensitive
attributes given the true label. A further relaxed version of this notion is equal opportunity which is satisfied if predicted target is conditionally independent of sensitive attributes given that the true label is in an
advantaged class [141]. Equal opportunity ensures that false positive rates are equal across different demographics, where negative outcome is considered an advantaged class, e.g., extending a loan. See Sec. 10.4.1
for formal definitions of these fairness notions.
103
Reference NB NB NB Beyond Stoch. alg. Converg.
target attrib. code logistic (unbiased∗∗) (stoch.)
FERMI (this work) ✓ ✓ ✓ ✓ ✓ (✓) ✓ (✓)
Cho 2020a [74] ✓ ✓ ✓ ✓ ✓ (✗) ✗
Cho 2020b [75] ✓ ✓ ✗ ✓ ✓ (✓) ✗
Rényi Fair Inference [23] ✓ ✓ ✓ ✓ ✗ ✓ (✗)
Rezaei 2020 [277] ✗ ✗ ✗ ✗ ✗ ✗
Jiang 2020 [157]
∗ ✗ ✓ ✗ ✗ ✗ ✗
Mary 2019 [223] ✓ ✓ ✓ ✓ ✓ (✗) ✗
Prost 2019 [259] ✗ ✗ ✗ ✓ ✓ (✗) ✗
Donini 2018 [100] ✗ ✓ ✗ ✓ ✗ ✗
Zhang 2018 [366] ✓ ✓ ✗ ✓ ✓ (✗) ✗
Agrawal 2018 [5] ✗ ✓ ✗ ✓ ✗ ✓ (✗)
Table 6.1: Comparison of state-of-the-art in-processing methods (NB = non-binary) on whether they (a) handle
non-binary targets (beyond binary classification), (b) handle non-binary sensitive attributes, (c) release code that
applies to non-binary targets/attributes, (d) extend to arbitrary models, (e) provide code for stochastic optimization
(and whether the gradients are unbiased), (f) provide convergence guarantees (for stochastic optimization). FERMI is
the only method compatible with stochastic optimization and guaranteed convergence. The only existing baselines
for non-binary classification with non-binary sensitive attributes are [223, 23, 74] (NB code). ∗We refer to the inprocessing method of [157], not their post-processing method. ∗∗We use the term “unbiased” in statistical estimation
sense; not to be confused with bias in the fairness sense.
Measuring fairness violation. In practice, the learner only has access to finite samples and cannot verify
demographic parity, equalized odds, or equal opportunity. This has led the machine learning community
to define several fairness violation metrics that quantify the degree of (conditional) independence between
random variables, e.g., L∞ distance [105, 141], mutual information [168, 277, 317, 366, 75, 285], Pearson
correlation [362, 46], false positive/negative rate difference [34], Hilbert Schmidt independence criterion
(HSIC) [254], kernel-based minimum mean discrepancy (MMD) [259], Rényi correlation [223, 23, 133, 131],
and exponential Rényi mutual information (ERMI) [223]. In this chapter, we focus on three variants of
ERMI specialized to demographic parity, equalized odds, and equal opportunity. The motivation for the
use of ERMI is two-fold. First, we will see in Sec. 6.2 that ERMI is amenable to stochastic optimization.
Moreover, we observe (Sec. 10.4.3) that ERMI provides an upper bound on several of the above notions of
fairness violation. Consequently, a model trained to reduce ERMI will also provide guarantees on these
other fairness violations.∗
∗Nevertheless, we use L∞ distance for measuring fairness violation in our numerical experiments, since L∞ is broadly used.
104
Related work & contributions. Fairness-promoting machine learning algorithms can be categorized in
three main classes: pre-processing, post-processing, and in-processing methods. Pre-processing algorithms
[115, 364, 60] transform the biased data features to a new space in which the labels and sensitive attributes
are statistically independent. This transform is oblivious to the training procedure. Post-processing approaches [141, 256] mitigate the discrimination of the classifier by altering the final decision. In-processing
approaches focus on the training procedure and impose the notions of fairness as constraints or regularization terms in the training procedure. Several regularization-based methods are proposed in the literature
to promote fairness [280, 260] in decision-trees [167, 263, 6], support vector machines [100], boosting [117],
neural networks [131, 74, 259], or (logistic) regression models [362, 43, 325, 77, 23, 157, 133]. See the recent
paper by [150] for a more comprehensive literature survey.
While in-processing approaches generally give rise to better tradeoffs between fairness violation and
performance, existing approaches are mostly incompatible with stochastic optimization. This chapter addresses this problem in the context of fair (non-binary) classification with discrete (non-binary) sensitive
attributes. See Table 6.1 for a summary of the main differences between FERMI and existing in-processing
methods.
Our main contributions are as follows:
1. For each given fairness notion (demographic parity, equalized odds, or equal opportunity), we formulate an objective that uses ERMI as a regularizer to balance fairness and accuracy (Eq. (FRMI obj.)),
and derive an empirical version of this objective (Eq. (FERMI obj.)). We propose an algorithm (Algorithm 12) for solving each of these objectives, which is the first stochastic in-processing fairness algorithm
with guaranteed convergence. The main property needed to obtain a convergent stochastic algorithm
is to derive a (stochastically) unbiased estimator of the gradient of the objective function. The existing
stochastic fairness algorithms by [366, 223, 259, 75, 74] are not guaranteed to converge since there is no
105
straightforward way to obtain such unbiased estimator of the gradients for their fairness regularizers.†
For any minibatch size (even as small as 1), we prove (Theorem 19) that our algorithm converges to an
approximate solution of the empirical objective (Eq. (FERMI obj.)).
2. We show that if the number of training examples is large enough, then our algorithm (Algorithm 12)
converges to an approximate solution of the population-level objective (Theorem 21). The proofs of
these convergence theorems require the development of novel techniques (see e.g. Proposition 17 and
Proposition 22), and the resourceful application of many classical results from optimization, probability
theory, and statistics.
3. We demonstrate through extensive numerical experiments that our stochastic algorithm achieves superior fairness-accuracy tradeoff curves against all comparable baselines for demographic parity, equalized
odds, and equal opportunity. In particular, the performance gap is very large when minibatch size is small
(as is practically necessary for large-scale problems) and the number of sensitive attributes is large.
6.2 Fair Risk Minimization through ERMI Regularization
In this section, we propose a fair learning objective (Eq. (FRMI obj.)) and derive an empirical variation
(Eq. (FERMI obj.)) of this objective. We then develop a stochastic optimization algorithm (Algorithm 12)
that we use to solve these objectives, and prove that our algorithm converges to an approximate solution
of the two objectives.
Consider a learner who trains a model to make a prediction, Y , b e.g., whether or not to extend a loan,
supported on [m] := {1, . . . , m}. The prediction is made using a set of features, X, e.g., financial history
features. Assume that there is a set of discrete sensitive attributes, S, e.g., race and sex, supported on [k].
We now define the fairness violation notion that we will use to enforce fairness in our model.
†We suspect it might be possible to derive a provably convergent stochastic algorithm from the framework in [259] using
our techniques, but their approach is limited to binary classification with binary sensitive attributes. By contrast, we provide
(empirical and population-level) convergence guarantees for our algorithm with any number of sensitive attributes and any
number of classes.
106
Definition 16 (ERMI – exponential Rényi mutual information) We define the exponential Rényi mutual information between random variables Yb and S with joint distribution pY ,S b and marginals pYb , pS by:
DR(Yb; S) := E
(
pY ,S b (Y , S b )
pYb (Yb)pS(S)
)
− 1 = X
j∈[m]
X
r∈[k]
pY ,S b (j, r)
2
pYb (j)pS(r)
− 1. (ERMI)
Theorem 16 is what we would use if demographic parity were the fairness notion of interest. If instead
one wanted to promote fairness with respect to equalized odds or equal opportunity, then it is straightforward to modify the definition by substituting appropriate conditional probabilities for pY ,S b , pYb , and pS
in Eq. (ERMI): see Sec. 10.4.2. In Sec. 10.4.2, we also discuss that ERMI is the χ
2
-divergence (which is an
f-divergence) between the joint distribution, pY ,S b , and the Kronecker product of marginals, pYb ⊗pS [59].
In particular, ERMI is non-negative and zero if and only if demographic parity (or equalized odds or equal
opportunity, for the conditional version of ERMI) is satisfied. Additionally, we show in Sec. 10.4.3 that
ERMI provides an upper bound on other commonly used measures of fairness violation: Shannon mutual
information [75], Rényi correlation [23], Lq fairness violation [170, 141]. Therefore, any algorithm that
makes ERMI small will also have small fairness violation with respect to these other notions.
We can now define our fair risk minimization through exponential Rényi mutual information framework:‡
min
θ
n
FRMI(θ) := L(θ) + λDR
Ybθ(X); S
o
, (FRMI obj.)
where L(θ) := E(X,Y )
[ℓ(X, Y ; θ)] for a given loss function ℓ (e.g. L2 loss or cross entropy loss); λ > 0 is
a scalar balancing the accuracy versus fairness objectives; and Ybθ(X) is the output of the learned model
(i.e. the predicted label in a classification task). While Ybθ(X) = Yb(X; θ) inherently depends on X and θ,
in the rest of this chapter, we sometimes leave the dependence of Yb on X and/or θ implicit for the brevity
‡
In this section, we present all results in the context of demographic parity, leaving off all conditional expectations for clarity
of presentation. The algorithm/results are readily extended to equalized odds and equal opportunity by using the conditional
version of Eq. (ERMI) (which is described in Sec. 10.4.2); we use these resulting algorithms for numerical experiments.
10
of notation. Notice that we have also left the dependence of the loss on the predicted outcome Yb = Ybθ(X)
implicit.
As is standard, we assume that the prediction function satisfies P(Yb(θ, X) = j|X) = Fj (θ, X),
where F(θ, X) = (F1(θ, X), . . . , Fm(θ, X))T ∈ [0, 1]m is differentiable in θ and Pm
j=1 Fj (θ, X) = 1.
For example, F(θ, X) could represent the probability label given by a logistic regression model or the
output of a neural network after the softmax layer. Indeed, this assumption is natural for most classifiers.
Further, even classifiers, such as SVM, that are not typically expressed using probabilities can often be well
approximated by a classifier of the form P(Yb(θ, X) = j|X) = Fj (θ, X), e.g. by using Platt Scaling [255,
241].
The work of [223] considered the same objective Eq. (FRMI obj.), and tried to empirically solve it
through a kernel approximation. We propose a different approach to solving this problem, which we
shall describe below. Essentially, we express ERMI as a “max” function (Theorem 17), which enables us
to re-formulate Eq. (FRMI obj.) (and its empirical counterpart Eq. (FERMI obj.)) as a stochastic min-max
optimization problem. This allows us to use stochastic gradient descent ascent (SGDA) to solve Eq. (FRMI
obj.). Unlike the algorithm of [223], our algorithm provably converges. Our algorithm also empirically
outperforms the algorithm of [223], as we show in Sec. 6.3 and Sec. 10.4.6.2.
6.2.1 A Convergent Stochastic Algorithm for Fair Empirical Risk Minimization
In practice, the true joint distribution of (X, S, Y, Yb) is unknown and we only have N samples at our
disposal. Let D = {xi
, si
, yi
, yb(xi
; θ)}i∈[N] denote the features, sensitive attributes, targets, and the predictions of the model parameterized by θ for these given samples. For now, we consider the empirical
risk minimization (ERM) problem and do not require any assumptions on the data set (e.g. we allow for
108
different samples in D to be drawn from different, heterogeneous distributions). Consider the empirical
objective
min
θ
n
FERMI(θ) := Lb(θ) + λDbR(Ybθ(X), S)
o
, (FERMI obj.)
where Lb(θ) := 1
N
PN
i=1 ℓ(xi
, yi
; θ) is the empirical loss and§
DbR(Y , S b ) := E
(
pˆY ,S b (Y , S b )
pˆYb (Yb)ˆpS(S)
)
− 1 = X
j∈[m]
X
r∈[k]
pˆY ,S b (j, r)
2
pˆYb (j)ˆpS(r)
− 1
is empirical ERMI with pˆ denoting empirical probabilities with respect to D: pˆS(r) =
1
N
PN
i=1 1{si=r}
;
pbyb(j) = 1
N
PN
i=1 Fj (θ, xi); and pˆY ,S b (j, r) = 1
N
PN
i=1 Fj (θ, xi)si,r for j ∈ [m], r ∈ [k]. We shall see
(Theorem 22) that empirical ERMI is a good approximation of ERMI when N is large. Now, it is straightforward to derive an unbiased estimate for Lb(θ) via 1
|B|
P
i∈B
ℓ
xi
, yi
; θ
where B ⊆ [N] is a random
minibatch of data points drawn from D. However, unbiasedly estimating DbR(Y , S b ) in the objective function Eq. (FERMI obj.) with |B| < N samples is more difficult. In what follows, we present our approach to
deriving statistically unbiased stochastic estimators of the gradients of DbR(Y , S b ) given a random batch of
data points B. This stochastic estimator is key to developing a stochastic convergent algorithm for solving
Eq. (FERMI obj.). The key novel observation that allows us to derive this estimator is that Equation FERMI
obj. can be written as a min-max optimization problem (see Theorem 18). This observation, in turn, follows
from the following result:
Proposition 17 For random variables Yb and S with joint distribution pˆY ,S b , where Yb ∈ [m], S ∈ [k], we
have
DbR(Yb; S) = max
W∈Rk×m
{− Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s
) − 1},
if Pb
yb = diag(ˆpYb (1), . . . , pˆYb (m)), Pbs = diag(ˆpS(1), . . . , pˆS(k)), and (Pb
y,s b )i,j = ˆpY ,S b (i, j) given that for
any i ∈ [m], j ∈ [k] the inequality pˆYb (i), pˆS(j) > 0 holds.
§We overload notation slightly here and use E to denote expectation with respect to the empirical (joint) distribution.
10
The proof is a direct calculation, given in Sec. 10.4.4. Let yb(xi
, θ) ∈ {0, 1}
m and si = (si,1, . . . , si,k)
T ∈
{0, 1}
k be the one-hot encodings of yb(xi
, θ) and si
, respectively for i ∈ [N]. Then, Theorem 17 provides a
useful variational form of Eq. (FERMI obj.), which forms the backbone of our novel algorithmic approach:
Corollary 18 Let (xi
, si
, yi) be a random draw from D. Then, Eq. (FERMI obj.) is equivalent to
min
θ
max
W∈Rk×m
n
Fb(θ, W) := Lb(θ) + λΨ( b θ, W)
o
, (6.1)
where Ψ( b θ, W) = − Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s ) − 1 = 1
N
PN
i=1 ψbi(θ, W) and
ψbi(θ, W) := − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1
= − Tr(Wdiag(F1(θ, xi), . . . , Fm(θ, xi))WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1.
Theorem 18 implies that for any given data set D, the quantity ℓ(xi
, yi
; θ) +λψbi(θ, W) is an unbiased
estimator of Fb(θ, W) (with respect to the uniformly random draw of i ∈ [N]). Thus, we can use stochastic
optimization (e.g. SGDA) to solve Eq. (FERMI obj.) with any batch size 1 ≤ |B| ≤ N, and the resulting
algorithm will be guaranteed to converge since the stochastic gradients are unbiased. We present our
proposed algorithm, which we call FERMI, for solving Eq. (FERMI obj.) in Algorithm 12.
Note that the matrix Pb−1/2
s depends only on the full data set of sensitive attributes {s1, · · · , sN } and
has no dependence on θ, and can therefore be computed just once, in line 2 of Algorithm 12. On the other
hand, the quantities E[yb(xi
, θ)yb(xi
, θ)
T
|xi
] and E[yb(xi
; θ)s
T
i
|xi
, si
] depend on the sample (xi
, si
, ybi)that
is drawn in a given iteration and on the model parameters θ, and are therefore computed at each iteration
of the algorithm.
Although the min-max problem Eq. (FERMI obj.) that we aim to solve is unconstrained, we project the
iterates Wt
(in line 5 of Algorithm 12) onto a bounded set W in order to satisfy a technical assumption that
110
Algorithm 12 FERMI Algorithm
Input: θ
0 ∈ R
dθ , W0 = 0 ∈ R
k×m, step-sizes (ηθ, ηw), fairness parameter λ ≥ 0, iteration number T,
minibatch sizes |Bt
|, t ∈ {0, 1, · · · , T}, W := Frobenius norm ball of radius D around 0 ∈ R
k×m for D
given in Sec. 10.4.4.
Compute Pb−1/2
s = diag(ˆpS(1)−1/2
, . . . , pˆS(k)
−1/2
).
for t = 0, 1, . . . , T do
Draw a mini-batch Bt of data points {(xi
, si
, yi)}i∈Bt
Set θ
t+1 ← θ
t −
ηθ
|Bt|
P
i∈Bt
[∇θℓ(xi
, yi
; θ
t
) + λ∇θψbi(θ
t
, Wt
)].
Set Wt+1 ← ΠW
Wt +
2ληw
|Bt|
P
i∈Bt
h
− WtE[yb(xi
, θ)yb(xi
, θ)
T
|xi
] + Pb−1/2
s E[siyb
T
(xi
; θ
t
)|xi
, si
]
i
Pick tˆuniformly at random from {1, . . . , T}.
Return: θ
tˆ
.
is needed to prove convergence of Algorithm 12¶
. We choose W to be a sufficiently large ball that contains
W∗
(θ) := argmaxW Fb(θ, W) for every θ in some neighborhood of θ
∗ ∈ argminθ maxW Fb(θ, W), so
that Eq. (FERMI obj.) is equivalent to
min
θ
max
W∈W
n
Fb(θ, W) = Lb(θ) + λΨ( b θ, W)
o
.
See Sec. 10.4.4 for details. When applying Algorithm 12 in practice, it is not necessary to project the
iterates; e.g. in Sec. 6.3, we obtain strong empirical results without projection in Algorithm 12.
Since Eq. (FERMI obj.) is potentially nonconvex in θ, a global minimum might not exist and even computing a local minimum is NP-hard in general [232]. Thus, as is standard in the nonconvex optimization
literature, we aim for the milder goal of finding an approximate stationary point of Eq. (FERMI obj.). That
is, given any ϵ > 0, we aim to find a point θ
∗
such that E∥∇FERMI(θ
∗
)∥ ≤ ϵ, where the expectation
is solely with respect to the randomness of the algorithm (minibatch sampling). The following theorem
guarantees that Algorithm 12 will find such a point efficiently:
Theorem 19 (Informal statement) Let ϵ > 0. Assume that ℓ(x, y; ·) and F(·, x) are Lipschitz continuous
and differentiable with Lipschitz continuous gradient (see Appendix 10.4.4 for definitions), pˆS(j) > 0 for all
sensitive attributes j ∈ [k] and pˆYˆ (l) ≥ µ > 0 for all labels l ∈ [m] and at every iterate θ
t
. Then for any
¶Namely, bounded Wt
ensures that the gradient of Fb is Lipschitz continuous at every iterate and the variance of the stochastic
gradients is bounded.
111
batch sizes 1 ≤ |Bt
| ≤ N, Algorithm 12 converges to an ϵ-first order stationary point of the Eq. (FERMI obj.)
objective in O
1
ϵ
5
stochastic gradient evaluations.
The formal statement of Theorem 19 can be found in Theorem 65 in Appendix 10.4.4. Theorem 19
implies that Algorithm 12 can efficiently achieve any tradeoff between fairness (ERMI) violation and (empirical) accuracy, depending on the choice of λ.
∥ However, if smaller fairness violation is desired (i.e. if
larger λ is chosen), then the algorithm needs to run for more iterations (see Appendix 10.4.4). The proof
of Theorem 19 follows from Corollary 18 and the observation that ψbi
is strongly concave in W (see Theorem 68 in Sec. 10.4.4). This implies that Eq. (6.1) is a nonconvex-strongly concave min-max problem,
so the convergence guarantee of SGDA [204] yields Theorem 19.
∗∗ The detailed proof of Theorem 19 is
given in Appendix 10.4.4. Increasing the batch size to Θ(ϵ
−2
) improves the stochastic gradient complexity
to O(ϵ
−4
). On the other hand, increasing the batch size further to |Bt
| = N results in a deterministic
algorithm which is guaranteed to find a point θ
∗
such ∥∇FERMI(θ
∗
)∥ ≤ ϵ (no expectation) in O(ϵ
−2
)
iterations [204, Theorem 4.4],[247, Remark 4.2]; this iteration complexity has the optimal dependence on ϵ
[64, 369]. However, like existing fairness algorithms in the literature, this full-batch variant is impractical
for large-scale problems.
Remark 20 The condition pˆYˆ (l) ≥ µ in Theorem 19 is assumed in order to ensure strong concavity of Fb(θ
t
, ·)
at every iterate θ
t
, which leads to the O(ϵ
−5
) convergence rate. This assumption is typically satisfied in
practice: for example, if the iterates θ
t
remain in a compact region during the algorithm and the classifier
uses softmax, then pˆYˆ (l) ≥ µ > 0. Having said that, it is worth noting that this condition is not absolutely
necessary to ensure convergence of Algorithm 12. Even if this condition doesn’t hold, then Eq. (6.1) is still
a nonconvex-concave min-max problem. Hence SGDA still converges to an ϵ-stationary point, albeit at the
∥This sentence is accurate to the degree that an approximate stationary point of the non-convex objective Eq. (FERMI obj.)
corresponds to an approximate risk minimizer.
∗∗A faster convergence rate of O(ϵ
−3
) could be obtained by using the (more complicated) SREDA method of [217] instead of
SGDA to solve FERMI objective. We omit the details here.
11
slower rate of O(ϵ
−8
) [204]. Alternatively, one can add a small ℓ2 regularization term to the objective to
enforce strong concavity and get the fast convergence rate of O(ϵ
−5
).
6.2.2 Asymptotic Convergence of Algorithm 12 for Population-level FRMI Objective
So far, we have let N ≥ 1 be arbitrary and have not made any assumptions on the underlying distribution(s)
from which the data was drawn. Even so, we showed that Algorithm 12 always converges to a stationary
point of Eq. (FERMI obj.). Now, we will show that if D contains i.i.d. samples from an unknown joint
distribution D and if N ≫ 1, then Algorithm 12 converges to an approximate solution of the population
risk minimization problem Eq. (FRMI obj.). Precisely, we will use a one-pass sample-without-replacement
(“online”) variant of Algorithm 12 to obtain this population loss guarantee. The one-pass variant is identical
to Algorithm 12 except that: a) once we draw a batch of samples Bt
, we remove these samples from the
data set so that they are never re-used; and b) the for-loop terminates when we have used all n samples.
Theorem 21 Let ϵ > 0. Assume that ℓ(x, y; ·) and F(·, x) are Lipschitz continuous and differentiable with
Lipschitz continuous gradient, and that minr∈[k] pS(r) > 0. Then, there exists N ∈ N such that if n ≥ N
and D ∼ Dn
, then a one-pass sample-without-replacement variant of Algorithm 12 converges to an ϵ-first
order stationary point of the Eq. (FRMI obj.) objective in O
1
ϵ
5
stochastic gradient evaluations, for any batch
sizes |Bt
|.
Theorem 21 provides a guarantee on the fairness/accuracy loss that can be achieved on unseen “test
data.” This is important because the main goal of (fair) machine learning is to (fairly) give accurate predictions on test data, rather than merely fitting the training data well. Specifically, Theorem 21 shows
that with enough (i.i.d.) training examples at our disposal, (one-pass) Algorithm 12 finds an approximate
stationary point of the population-level fairness objective Eq. (FRMI obj.). Furthermore, the gradient complexity is the same as it was in the empirical case. The proof of Theorem 21 will be aided by the following
11
result, which shows that ψbi
is an asymptotically unbiased estimator of Ψ, where maxW Ψ(θ, W) equals
ERMI:
Proposition 22 Let {zi}
n
i=1 = {xi
, si
, yi}
n
i=1 be drawn i.i.d. from an unknown joint distribution D. Denote
ψb(n)
i
(θ, W) = − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) + 2 Tr
WE[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
− 1,
where Pb(n)
s =
1
n
Pn
i=1 diag(1{si=1}
, · · · , 1{si=k}
). Denote
Ψ(θ, W) = − Tr(W PyˆWT
) + 2 Tr(W Py,s ˆ P
−1/2
s
) − 1,
where Pyˆ = diag(EF1(θ, x), · · · , EFm(θ, x)), (Py,s ˆ )j,r = Exi,si
[Fj (θ, xi)si,r] for j ∈ [m], r ∈ [k], and
Ps = diag(PS(1), · · · , PS(k)). Assume pS(r) > 0 for all r ∈ [k]. Then,
max
W
Ψ(θ, W) = DR(Yb(θ); S)
and
limn→∞
E[ψb(n)
i
(θ, W)] = Ψ(θ, W).
The proof of Theorem 22 is given in Sec. 10.4.5. The first claim is immediate from Theorem 17 and its
proof, while the second claim is proved using the strong law of large numbers, the continuous mapping
theorem, and Lebesgue’s dominated convergence theorem.
Theorem 22 implies that the empirical stochastic gradients computed in Algorithm 12 are good approximations of the true gradients of Eq. (FRMI obj.). Intuitively, this suggests that when we use Algorithm 12
to solve the fair ERM problem Eq. (FERMI obj.), the output of Algorithm 12 will also be an approximate
solution of Eq. (FRMI obj.). While Theorem 21 shows this intuition does indeed hold, the proof of Theorem 21 requires additional work. A reasonable first attempt at proving Theorem 21 might be to try to
bound the expected distance between the gradient of FRMI and the gradient of FERMI (evaluated at the
point θˆ that is output by Algorithm 12) via Danskin’s theorem [88] and strong concavity, and then leverage Theorem 19 to conclude that the gradient of FRMI must also be small. However, the dependence of θˆ
114
on the training data prevents us from obtaining a tight enough bound on the distance between the empirical and population gradients at θˆ. Thus, we take a different approach to proving Theorem 21, in which
we consider the output of two different algorithms: one is the conceptual algorithm that runs one-pass Algorithm 12 as if we had access to the true sensitive attributes Ps (“Algorithm A”); the other is the realistic
one-pass Algorithm 12 that only uses the training data (“Algorithm B”). We argue: 1) the output of the
conceptual algorithm is a stationary point of the population-level objective; and 2) the distance between
the gradients of the population-level objective at θA and θB is small. While 1) follows easily from the
proof of Theorem 19 and the online-to-batch conversion, establishing 2) requires a careful argument. The
main tools we use in the proof of Theorem 21 are Theorem 19, Theorem 22, Danskin’s theorem, Lipschitz
continuity of the argmax function for strongly concave objective, the continuous mapping theorem, and
Lebesgue’s dominated convergence theorem: see Sec. 10.4.5 for the detailed proof.
Note that the online-to-batch conversion used to prove Theorem 21 requires a convergent stochastic optimization algorithm; this implies that our arguments could not be used to prove an analogue of Theorem 21
for existing fair learning algorithms since existing convergent fairness algorithms are not stochastic. An
alternate approach to bounding the “generalization error” of our algorithm would be to use a standard
covering/uniform convergence argument. However, this approach would not yield as tight a guarantee
as Theorem 21. Specifically, the accuracy and/or gradient complexity guarantee would depend on the
dimension of the space (i.e. the number of model parameters), since the covering number depends (exponentially) on the dimension. For large-scale problems with a huge number of model parameters, such
dimension dependence is prohibitive.
As previously mentioned, we can interpret Theorem 21 as providing a guarantee that Algorithm 12
generalizes well, achieving small fairness violation and test error, even on unseen “test” examples–as long
as the data is i.i.d. and N is sufficiently large. In the next section, we empirically corroborate Theorem 21,
115
by evaluating the fairness-accuracy tradeoffs of the FERMI algorithm (Algorithm 12) in several numerical
experiments.
6.3 Numerical Experiments
In this section, we evaluate the performance of FERMI in terms of the fairness violation vs. test error
for different notions of fairness (e.g. demographic parity, equalized odds, and equality of opportunity). To
this end, we perform diverse experiments comparing FERMI to other state-of-the-art approaches on several
benchmarks. In Section 6.3.1, we showcase the performance of FERMI applied to a logistic regression model
on binary classification tasks with binary sensitive attributes on Adult, German Credit, and COMPAS
datasets. In Section 6.3.2, we utilize FERMI with a convolutional neural network base model for fair (to
different religious groups) toxic comment detection. In Section 6.3.3, we explore fairness in non-binary
classification with non-binary sensitive attributes. Finally, Section 6.3.4 shows how FERMI may be used
beyond fair empirical risk minimization in domain generalization problems to learn a model independent
of spurious features.
6.3.1 Fair Classification with Binary Sensitive Attributes using Logistic Regression
6.3.1.1 Benchmarking full-batch performance
In the first set of experiments, we use FERMI to learn a fair logistic regression model on the Adult dataset.
With the Adult data set, the task is to predict whether or not a person earns over $50k annually without
discriminating based on the sensitive attribute, gender. We compare FERMI against state-of-the-art inprocessing full-batch (|B| = N) baselines, including [362, 115, 168, 157, 141, 259, 23, 277, 100, 74]. Since
the majority of existing fair learning algorithms cannot be implemented with |B| < N, these experiments
allow us to benchmark the performance of FERMI against a wider range of baselines. To contextualize
the performance of these methods, we also include a Naïve Baseline that randomly replaces the model
116
output with the majority label (0 in Adult dataset), with probability p (independent of the data), and sweep
p in [0, 1]. At one end (p = 1), the output will be provably fair with performance reaching that of a
naive classifier that outputs the majority class. At the other end (p = 0), the algorithm has no fairness
mitigation and obtains the best performance (accuracy). By sweeping p, we obtain a tradeoff curve between
performance and fairness violation.
Figure 6.1: Accuracy/Fairness trade-off of FERMI and several state-of-the-art in-processing approaches on Adult
dataset. FERMI offers the best fairness vs. accuracy tradeoff curve in all experiments against all baselines. [277] only
allow for a single output and do not yield a tradeoff curve. Further, the algorithms by [223] and [23] are equivalent
in this binary setting and shown by the red curve. In the binary/binary setting, FERMI, [223], and [23] all try to solve
the same objective Eq. (FRMI obj.). However, the empirical formulation Eq. (FERMI obj.) and FERMI algorithm that
we use results in better performance, even though we are using a full batch for all baselines in this experiment.
In Fig. 6.1, we report the fairness violation (demographic parity, equalized odds, and equality of opportunity violations) vs. test error of the aforementioned in-processing approaches on the Adult dataset.
The upper left corner of the tradeoff curves coincides with the unmitigated baseline, which only optimizes
for performance (smallest test error). As can be seen, FERMI offers a fairness-accuracy tradeoff curve
that dominates all state-of-the-art baselines in each experiment and with respect to each notion of fairness, even in the full batch setting. Aside from in-processing approaches, we compare FERMI with several
pre-processing and post-processing algorithms on Adult, German Credit, and COMPAS datasets in Appendix 10.4.6.5, where we show that the tradeoff curves obtained from FERMI dominate that of all other
baselines considered. See Appendix 10.4.6 for details on the data sets and experiments.
It is noteworthy that the empirical objectives of [223] and [23] are exactly the same in the binary/binary
setting, and their algorithms also coincide with the red curve in Fig. 6.1. This is because Exponential
117
Rényi mutual information is equal to Rényi correlation for binary targets and/or binary sensitive attributes (see Theorem 47), which is the setting of all experiments in Sec. 6.3.1. Additionally, like us, in
the binary/binary setting these works are trying to empirically solve Eq. (FRMI obj.), albeit using different
estimation techniques; i.e., their empirical objective is different from Eq. (FERMI obj.). This demonstrates
the effectiveness of our empirical formulation (FERMI obj.) and our solver (Algorithm 12), even though we
are using all baselines in full batch mode in this experiment. See Sec. 10.4.6.5 for the complete version of
Fig. 6.1 which also includes pre-processing and post-processing baselines.
Fig. 10.10 in Sec. 10.4.6 illustrates that FERMI outperforms baselines in the presence of noisy outliers
and class imbalance. Our theory did not consider the role of noisy outliers and class imbalance, so the
theoretical investigation of this phenomenon could be an interesting direction for future work.
6.3.1.2 The effect of batch size on fairness/accuracy tradeoffs
Next, we evaluate the performance of FERMI on smaller batch sizes ranging from 1 to 64. To this end,
we compare FERMI against several state-of-the-art in-processing algorithms that permit stochastic implementation for demographic parity: [223], [23], and [74]. Similarly to the full batch setting, for all methods,
we train a logistic regression model with a respective regularizer for each method. We use demographic
parity L∞ violation (Theorem 53) to measure demographic parity violation. More details about the dataset
and experiments, and additional experimental results, can be found in Appendix 10.4.6.
Figure 6.2: Performance of FERMI, [75], [223], [23] with different batch-sizes on Adult dataset. FERMI demonstrates
the best accuracy/fainess tradeoff across different batch sizes.
118
Fig. 6.2 shows that FERMI offers a superior fairness-accuracy tradeoff curve against all baselines, for
each tested batch size, empirically confirming Theorem 19, as FERMI is the only algorithm that is guaranteed to converge for small minibatches. It is also noteworthy that all other baselines cannot beat Naïve
Baseline when the batch size is very small, e.g., |B| = 1. Furthermore, FERMI with |B| = 4 almost achieves
the same fairness-accuracy tradeoff as the full batch variant.
6.3.1.3 The effect of missing sensitive attributes on fairness/accuracy tradeoffs
Sensitive attributes might be partially unavailable in many real-world applications due to legal issues,
privacy concerns, and data gathering limitations [374, 78]. Missing sensitive attributes make fair learning
tasks more challenging in practice.
Figure 6.3: Performance of FERMI and
other state-of-the-art approaches on the
Adult dataset where 90% of gender entries are missing. Full-sensitive FERMI is
obtained by applying FERMI on the data
without any missing entries.
The unbiased nature of the estimator used in FERMI algorithm
motivates that it may be able to handle cases where sensitive attributes are partially available and are dropped uniformly at random. As a case study on the Adult dataset, we randomly masked
90% of the sensitive attribute (i.e., gender entries). To estimate the
fairness regularization term, we rely on the remaining 10% of the
training samples (≈ 3k) with sensitive attribute information. Figure 6.3 depicts the tradeoff between accuracy and fairness (demographic parity) violation for FERMI and other baselines. We suspect
that the superior accuracy-fairness tradeoff of FERMI compared to
other approaches is due to the fact that the estimator of the gradient
remains unbiased since the missing entries are missing completely
at random (MCAR). Note that the Naïve Baseline is similar to the
one implemented in the previous section, and Full-sensitive FERMI is an oracle method that applies FERMI
to the data with no missing attributes (for comparison purposes only). We observe that FERMI achieves
119
a slightly worse fairness-accuracy tradeoff compared to Full-sensitive FERMI oracle, whereas the other
baselines are hurt significantly and are only narrowly outperforming the Naïve Baseline.
6.3.2 Fair Binary Classification using Neural Models
Figure 6.4: Fair toxic comment detection with different batch sizes. For |B| = 128, the performance of [259]
and FERMI are close to each other, however, when the batch size is reduced to 16, FERMI demonstrates a better
fairness/ performance trade-off. The performance and fairness are measured by the test error and the false positive
gap between different religious sub-groups (Christians vs Muslim-Jews), respectively.
In this experiment, our goal is to showcase the efficacy of FERMI in stochastic optimization with neural network function approximation. To this end, we apply FERMI, [259], [23], and [223] (which coincides
with [23]) to the Toxic Comment Classification dataset where the underlying task is to predict whether
a given published comment in social media is toxic. The sensitive attribute is religion that is binarized
into two groups: Christians in one group; Muslims and Jews in the other group. Training a neural network without considering fairness leads to higher false positive rate for the Jew-Muslim group. Figure 6.4
demonstrates the performance of FERMI, MinDiff [259], [23], and naïve baseline on two different batchsizes: 128 and 16. Performance is measured by the overall false positive rate of the trained network and
fairness violation is measured by the false positive gap between two sensitive groups (Christians and JewsMuslims). The network structure is exactly same as the one used by MinDiff [259]. We can observe that
by decreasing the batch size, FERMI maintains the best fairness-accuracy tradeoff compared to other baselines.
120
Figure 6.5: Comparison between FERMI, [223], [23], and [74] on Communities dataset. [223] outperforms [23, 74],
which we believe could be attributed to the effectiveness of ERMI as a regularizer. FERMI outperforms [223]. This
can be attributed to our empirical formulation Eq. (FERMI obj.) and unbiased stochastic optimization algorithm.
6.3.3 Fair Non-binary Classification with Multiple Sensitive Attributes
In this section, we consider a non-binary classification problem with multiple binary sensitive attributes.
In this case, we consider the Communities and Crime dataset, which has 18 binary sensitive attributes
in total. For our experiments, we pick a subset of 1, 2, 3, . . . , 18 sensitive attributes, which corresponds
to |S| ∈ {2, 4, 8, . . . , 2
18}. We discretize the target into three classes {high, medium, low}. The only
baselines that we are aware of that can handle non-binary classification with multiple sensitive attributes
are [223], [23], [74], [75], and [366]. We used the publicly available implementations of [23] and [74] and
extended their binary classification algorithms to the non-binary setting.
The results are presented in Fig. 6.5, where we use conditional demographic parity L∞ violation (Theorem 53) and conditional equal opportunity L∞ violation (Theorem 56) as the fairness violation notions for
the two experiments. In each panel, we compare the test error for different number of sensitive attributes
for a fixed value of DP violation. It is expected that test error increases with the number of sensitive attributes, as we will have a more stringent fairness constraint to satisfy. As can be seen, compared to the
121
baselines, FERMI offers the most favorable test error vs. fairness violation tradeoffs, particularly as the
number of sensitive attributes increases and for the more stringent fairness violation levels, e.g., 0.02.
††
6.3.4 Beyond Fairness: Domain Parity Regularization for Domain Generalization
In this section, we demonstrate that our approach may extend beyond fair empirical risk minimization to
other problems, such as domain generalization. In fact, [197, 180, 80] have already established connections
between fair ERM and domain generalization. We consider the Color MNIST dataset [197], where all 60,000
training digits are colored with different colors drawn from a class conditional Gaussian distribution with
variance σ
2
around a certain average color for each digit, while the test set remains black and white.
[197] show that as σ
2 → 0, a convolutional network model overfits significantly to each digit’s color
on the training set, and achieves vanishing training error. However, the learned representation does not
generalize to the black and white test set, due to the spurious correlation between digits and color.
Conceptually, the goal of the classifier in this problem is to achieve high classification accuracy with
predictions that are independent of the color of the digit. We view color as the sensitive attribute in this
experiment and apply fairness baselines for the demographic parity notion of fairness. One would expect
that by promoting such independence through a fairness regularizer, the generalization would improve
(i.e., lower test error on the black and white test set), at the cost of increased training error (on the colored
training set). We compare against [223], [23], and [74] as baselines in this experiment.
The results of this experiment are illustrated in Fig. 6.6. In the left panel, we see that with no regularization (λ = 0), the test error is around 80%. As λ increases, all methods achieve smaller test errors while
training error increases. We also observe that FERMI offers the best test error in this setup. In the right
panel, we observe that decreasing the batch size results in significantly worse generalization for the three
baselines considered (due to their biased estimators for the regularizer). However, the negative impact of
††Sec. 6.3.4 demonstrated that using smaller batch sizes results in much more pronounced advantages of FERMI over these
baselines.
122
Figure 6.6: Domain generalization on Color MNIST [197] using in-process fair algorithms for demographic parity.
Left panel: The dashed line is the training error and the solid line is the test error. As λ increases, fairness regularization results in a learned representation that is less dependent on color; hence training error increases while test
error decreases (all algorithms reach a plateau around λ = 8). We use |B| = 512 for all baselines. Right panel:
We plot test error vs. batch size using an optimized value of λ for each algorithm selected via a validation set. The
performance of baselines drops 10-20% as the batch size becomes small, whereas FERMI is less sensitive to batch size.
small batch size is much less severe for FERMI, since FERMI uses unbiased stochastic gradients. In particular, the performance gap between FERMI and other baselines is more than 20% for |B| = 64. Moreover,
FERMI with minibatch size |B| = 64 still outperforms all other baselines with |B| > 1, 000. Finally, notice
that the test error achieved by FERMI when σ = 0 is ∼ 30%, as compared to more than 50% obtained
using REPAIR [197] for σ ≤ 0.05.
6.4 Discussion and Concluding Remarks
In this chapter, we tackled the challenge of developing a fairness-promoting algorithm that is amenable
to stochastic optimization. As discussed, algorithms for large-scale ML problems are constrained to use
stochastic optimization with (small) minibatches of data in each iteration. To this end, we formulated an
empirical objective (FERMI obj.) using ERMI as a regularizer and derived unbiased stochastic gradient
estimators. We proposed the stochastic FERMI algorithm (Algorithm 12) for solving this objective. We
then provided the first theoretical convergence guarantees for a stochastic in-processing fairness algorithm,
by showing that FERMI converges to stationary points of the empirical and population-level objectives
123
(Theorem 19, Theorem 21). Further, these convergence results hold even for non-binary sensitive attributes
and non-binary target variables, with any minibatch size.
From an experimental perspective, we showed that FERMI leads to better fairness-accuracy tradeoffs
than all of the state-of-the-art baselines on a wide variety of binary and non-binary classification tasks (for
demographic parity, equalized odds, and equal opportunity). We also showed that these benefits are particularly significant when the number of sensitive attributes grows or the batch size is small. In particular,
we observed that FERMI consistently outperforms [223] (which tries to solve the same objective Eq. (FRMI
obj.)) by up to 20% when the batch size is small. This is not surprising since FERMI is the only algorithm that is guaranteed to find an approximate solution of the fair learning objective with any batch size
|B| ≥ 1. Also, we show in Fig. 10.9 that the lack of convergence guarantee of [223] is not just due to more
limited analysis: in fact, their stochastic algorithm does not converge. Even in full batch mode, FERMI
outperforms all baselines, including [223] (Fig. 6.1, Fig. 6.5). In full batch mode, all baselines should be
expected to converge to an approximate solution of their respective empirical objectives, so this suggests
that our empirical objective Eq. (FERMI obj.) is fundamentally better, in some sense than the empirical
objectives proposed in prior works. In what sense is Eq. (FERMI obj.) a better empirical objective (apart
from permitting stochastic optimization)? For one, it is an asymptotically unbiased estimator of Eq. (FRMI
obj.) (by Theorem 22), and Theorem 21 suggests that FERMI algorithm outputs an approximate solution
of Eq. (FRMI obj.) for large enough N.
By contrast, the empirical objectives considered in prior works do not provably yield an approximate
solution to the corresponding population-level objective.
The superior fairness-accuracy tradeoffs of FERMI algorithm over the (full batch) baselines also suggest
that the underlying population-level objective Eq. (FRMI obj.) has benefits over other fairness objectives.
What might these benefits be? First, ERMI upper bounds all other fairness violations (e.g., Shannon mutual
information, Lq, L∞) used in the literature: see Sec. 10.4.3. This implies that ERMI-regularized training
124
yields a model that has a small fairness violation with respect to these other notions. Could this also
somehow help explain the superior fairness-accuracy tradeoffs achieved by FERMI? Second, the objective
function Eq. (FRMI obj.) is easier to optimize than the objectives of competing in-processing methods:
ERMI is smooth and is equal to the trace of a matrix (see Theorem 51 in the Appendix), which is easy to
compute. Contrast this with the larger computational overhead of Rényi correlation used by [23], for example, which requires finding the second singular value of a matrix. Perhaps these computational benefits
contribute to the observed performance gains.
We leave it as future work to rigorously understand the factors that are most responsible for the favorable fairness-accuracy tradeoffs observed from FERMI.
Broader Impact and Limitations
This chapter studied the important problem of developing practical machine learning (ML) algorithms that
are fair towards different demographic groups (e.g. race, gender, age). We hope that the societal impacts
of our work will be positive, as the deployment of our FERMI algorithm may enable/help companies,
government agencies, and other organizations to train large-scale ML models that are fair to all groups of
users. On the other hand, any technology has its limitations, and our algorithm is no exception.
One important limitation of our work is that we have (implicitly) assumed that the data set at hand is
labeled accurately and fairly. For example, if race is the sensitive attribute and “likelihood of default on
a loan” is the target, then we assume that the training data based on past observational data accurately
reflects the financial histories of all individuals (and in particular does not disproportionately inflate the
financial histories of racial minorities). If this assumption is not satisfied in practice, then the outcomes
promoted by our algorithm may not be as fair (in the philosophical sense) as the computed level of fairness
violation might suggest. For example, if racial minorities are identified as higher risk for default on loans,
they may be extended loans with higher interest rates and payments, which may in turn increase their
125
likelihood of a default. Hence, it might be even possible that our mitigation strategy could result in more
unfairness than unmitigated ERM in this case. More generally, conditional fairness notions like equalized
odds suffer from a potential amplification of the inherent discrimination that may exist in the training
data. Tackling such issues is beyond the scope of this work; c.f. [172] and [35].
Another consideration that was not addressed in this chapter is the interplay between fairness and
other socially consequential AI metrics, such as privacy and robustness (e.g. data poisoning). It is possible
that our algorithm could increase the impact of data from certain individuals to improve fairness at the risk
of leaking private information about individuals in the training data set (e.g. via membership inference
attacks or model inversion attacks), even if the data is anonymous [120, 305, 111, 234, 61]. Differential privacy [107] ensures that sensitive data cannot be leaked (with high probability), and the interplay between
fairness and privacy has been explored (see e.g. [155, 353, 84, 229, 330, 331]. Developing and analyzing
a differentially private version of FERMI could be an interesting direction for future work. Another potential threat to FERMI-trained models is data poisoning attacks. While our experiments demonstrated
that FERMI is relatively effective with missing sensitive attributes, we did not investigate its performance
in the presence of label flipping or other poisoning attacks. Exploring and improving the robustness of
FERMI is another avenue for future research.
126
Chapter 7
Dr. FERMI: A Stochastic Distributionally Robust Fair Empirical Risk
Minimization Framework
This chapter is based on a paper in Algorithmic Fairness Through the Lens of Time Workshop
NeurIPS 2023 [26].
While there has been extensive research on training fair machine learning models in recent years,
the majority of developed methods rely on the assumption that the training and test data exhibit similar
distributions. Consequently, models trained using these approaches may demonstrate unfair behavior on
test data when confronted with distributional shifts commonly encountered in real-world applications.
Efforts have been made to enhance the robustness of fair learning models against such shifts; however,
many of the proposed solutions necessitate access to a causal graph describing the interaction of various
features—a luxury rarely available in practical scenarios. Furthermore, existing algorithms typically require the entire dataset at every iteration, rendering them unsuitable for scenarios involving small batches
(stochastic/batch implementation). This chapter introduces Dr. FERMI, the first stochastic distributionally robust fair empirical risk minimization framework, offering convergence guarantees without the need
for knowledge of the causal graph. More specifically, Dr. FERMI formulates fair learning in the presence of
the distribution shift as a distributionally robust optimization problem considering the Exponential Rényi
Mutual Information (ERMI) as the measure of fairness violation. We then discuss how the proposed method
127
can be implemented in a stochastic fashion. To assess the effectiveness and efficiency of Dr. FERMI, we
conduct thorough experiments on real and synthetic datasets characterized by distribution shifts.
7.1 Introduction
Machine Learning models demonstrate remarkable results in automated decision-making tasks such as
image processing [178], object detection [324], natural language understanding [97, 292], speech recognition [1], and automated code generation [336, 236]. However, naïvely optimizing the accuracy (performance) may lead to biased models against protected groups such as racial minorities and women [16,
54]. A wide range of algorithms has been proposed to enhance the fairness of machine learning models. These algorithms can be divided into three main categories: pre-processing, post-processing, and
in-processing. Pre-processing algorithms remove the dependence/correlation of the training data with the
sensitive attributes by transforming it to a new “fair representation" in a separate stage prior to the training
procedure [364, 81]. Post-processing algorithms satisfy the fairness constraints by adjusting the decision
boundary of the learned model after the training stage [141, 207, 174]. Finally, in-processing approaches
optimize the parameters of the model to maximize accuracy while satisfying the fairness constraints during the training procedure; see, e.g., [362, 100, 223, 23]. While in-processing approaches can ideally achieve
better fairness-accuracy tradeoffs compared to pre and post-processing approaches, they are computationally expensive and, in most cases, are not amenable to stochastic optimization methods due to the complex
fairness constraints in the training procedure [211].
The underlying assumption of most aforementioned fair learning algorithms is that the train and test
domains have an identical distribution. Thus, establishing fairness in the training phase can guarantee
fairness in the test phase. However, this assumption does not hold in many real applications. As an
example, consider the new collection of datasets (ACS PUMS) released by [99] with the underlying task of
predicting the income level of individuals (similar to the adult dataset [104]). The data points are divided
128
based on different years and states. When several fair algorithms are applied to ACS PUMS dataset to learn
fair logistic regression and XGBoost classifiers in the people of one US state, the performance of the learned
models is severely dropped in terms of both accuracy and fairness violation when they are evaluated in
other states [99]. We also made a similar observation in our numerical experiments (see Figure 7.2). As
another example, [297] reviews several examples from the field of healthcare and medicine where a trained
“fair" model in one hospital does not behave fairly in other hospitals. Such examples demonstrate the
importance of developing efficient algorithms for training fair models against distribution shifts.
Another crucial requirement in modern machine learning algorithms is the amenability to stochastic optimization. In large-scale learning tasks, only a mini-batch of data is used at each algorithm step.
Thus, implementing the algorithm based on the mini-batches of data has to improve the loss function
over iterates and be convergent. While this requirement is met when training based on vanilla empirical
risk minimization (ERM) loss functions [186, 163], the stochastic algorithms (such as stochastic gradient
descent) may not be necessarily convergent in the presence of fairness regularizers [210].
In this work, we propose a distributionally robust optimization framework to maintain fairness across
different domains in the presence of distribution shifts. Our framework does not rely on having access to
the knowledge of the causal graph of features. Moreover, our framework is amenable to stochastic
optimization and comes with convergence guarantees.
7.1.1 Related Work
Machine learning models can suffer from severe performance drops when the distribution of the test data
differs from training data [40, 228, 271, 272]. Learning based on spurious features [54, 376], class imbalance [198, 160], non-ignorable missing values [355], and overfitting [334] are several factors contributing
to such a poor performance on test data. To mitigate distribution shift, [39] formalizes the problem as a
domain adaptation task where the unlabeled test data is available during the training procedure. In this
129
case, the key idea is to regularize the empirical risk minimization over the given samples via a divergence
function measuring the distribution distance of the source and target predictors. When data from the target
domain is not available, the most common strategy for handling distribution shift is distributionally robust
optimization [268, 179, 202]. This approach relies on minimizing the risk for the worst-case distribution
within an uncertainty set around the training data distribution. Such uncertainty sets can be defined as
ℓ1 [39], ℓ2 [123], ℓ∞ [24], optimal transport [48], Wasserstein distance [179], Sinkhorn divergences [245],
χ
2
-divergence [190], or Conditional Value at Risk (CVaR) [379, 190] balls around the empirical distribution
of the source domain.
A diverse set of methods and analyses are introduced in the context of learning fair models in the presence of the distribution shift. [184] shows how learning a fair representation using a fair pre-processing
algorithm is almost impossible even under the demographic parity notion when the distribution of the test
data shifts. [85] empirically demonstrates how bias in labels and changes in the distribution of labels in
the target domain can lead to catastrophic performance and dreadful unfair behavior in the test phase.
[99] depicts through extensive experiments that applying post-processing fairness techniques to learn fair
predictors of income with respect to race, gender, and age fails to transfer from one US state (training
domain) to another state. Several in-processing methods are proposed in the literature to mitigate specific types of distribution shifts. [309] finds a subset of features with the minimum risk on the domain
target with fairness guarantees on the test phase relying on the causal graph and conditional separability
of context variables and labels. [103] proposes a reweighting mechanism [112] for learning robust fair
predictors when the task involves sample selection bias with respect to different protected groups. [278]
generalizes the fair log loss classifier of [277] to a distributionally robust log-likelihood estimator under
the covariate shift. [13] finds sufficient conditions for transferring fairness from the source to the target
domain in the presence of sub-population and domain shifts and handles these two specific types of shifts
using a proposed consistency regularizer.
130
The above methods rely on the availability of unlabeled test samples in the training phase (domain
adaptation setting) or explicit assumptions on having access to the causal graph describing the causal interaction of features and/or knowing the specific type of the distribution shift apriori. As an alternative
approach, [325] learns a distributionally robust fair classifier over a Wasserstein ball around the empirical
distribution of the training data as the uncertainty set. Unfortunately, their formulation is challenging
to solve efficiently via scalable first-order methods, even for linear and convex classifiers such as logistic
regression, and there are no efficient algorithms converging to the optimal solutions of the distributional
robust problem. The proposed algorithm in the paper is a greedy approach whose time complexity grows
with the number of samples. In another effort, [341] optimizes the accuracy and maximum mean discrepancy (MMD) of the curvature distribution of the two sub-populations (minorities and majorities) jointly to
impose robust fairness. This formulation is based on the idea that flatter local optima with less sharpness
have nicer properties in terms of robustness to the distribution shift. However, no convergence guarantee
for the proposed optimization problem is provided. Compared to [341], our work is amenable to stochastic
optimization. In addition, we define the distributionally robust problem directly over the fairness violation measure, while [341] relies on matching the curvature distribution of different sub-populations as a
heuristic proxy for the measuring robustness of fairness.
7.1.2 Our Contribution
We propose stochastic and deterministic distributionally robust optimization algorithms under Lp-norm
balls as the uncertainty sets for maintaining various group fairness criteria across different domains. We established the convergence of the algorithms and showed that the group fairness criteria can be maintained
in the target domain with the proper choice of the uncertainty set size and fairness regularizer coefficient
without relying on the knowledge of the causal graph of the feature interactions or explicit knowledge of the
distribution shift type (demographic, covariate, or label shift). The proposed stochastic algorithm is the
131
first provably convergent algorithm with any arbitrary batch size for distributionally robust fair empirical
risk minimization. Stochastic (mini-batch) updates are crucial to large-scale learning tasks consisting of a
huge number of data points.
7.2 Preliminaries and Problem Description
Consider a supervised learning task of predicting a label/target random variable y ∈ Y ≜ {1, 2, . . . , m}
based on the input feature x. Assume our model (e.g., a neural network or logistic regression model) outputs the variable yˆθ(x) where θ represents the model parameters (e.g., the weights of the neural network).
For simplicity, sometimes we use the notation yˆ instead of yˆθ(x). Let s ∈ S = {1, 2, . . . , k} denote the
random variable modeling the sensitive attribute (e.g., gender or ethnicity) of data points. If there are more
than on sensitive attribute, one can consider all possible combinations and merge them into one variable.
In the fair supervised learning task, we aim to reach two (potentially) competing goals: Goal 1) Maximize
prediction accuracy P
∗
(ˆyθ(x) = y); Goal 2) Maximize fairness.
Here, P
∗
(·) is the ground-truth distribution during the deployment/test phase, which is typically unknown during training. The first goal is usually achieved by minimizing a certain loss function. To achieve
the second goal, one needs to mathematically define “fairness”, which is described next.
7.2.1 Notions of Group Fairness
Different notions of group fairness have been introduced [360, 233]. Among them, demographic parity [3,
105], equalized odds [141], and equality of opportunity [141] gained popularity. A classifier satisfies demographic parity [105] if the output of the classifier is independent of the sensitive attribute, i.e.,
P
∗
(ˆy, s) = P
∗
(ˆy) P
∗
(s) (7.1)
132
On the other hand, a classifier is fair with respect to the equalized odds notion [141] if
P
∗
(ˆy, s|y = a) = P
∗
(ˆy|y = a) P
∗
(s|y = a) ∀a ∈ Y. (7.2)
Further, in the case of binary classification, equality of opportunity is defined as [141]:
P
∗
(ˆy, s|y = 1) = P
∗
(ˆy|y = 1) P
∗
(s|y = 1), (7.3)
where y = 1 is the advantaged group (i.e., the desired outcome from each individual’s viewpoint).
7.2.2 Fairness Through Regularized ERM
In the fairness notions defined above, yˆ = ˆyθ(x) depends on the parameter of the model θ. Therefore, the
above fairness notions depend on the parameters of the (classification) model θ. Moreover, they are all in
the form of (conditional) statistical independence between random variables y, y ˆ , and s. Thus, a popular
framework to reach the two (potentially competing) goals of maximizing accuracy and fairness is through
regularized empirical risk minimization framework [362, 23, 223, 133, 210]. In this framework, we add a
fairness-imposing regularizer to the regular empirical risk minimization loss. More specifically, the model
is trained by solving the optimization problem
min
θ
EP [ℓ(ˆyθ(x), y)] + λρ
y, y, s ˆ ; P
. (7.4)
Here, the first term in the objective function aims to improve the model’s accuracy with ℓ(·, ·) being the
loss function, such as cross-entropy loss. On the other hand, ρ(ˆy, s, y, P) is a group fairness violation
measure, which quantifies the model’s fairness violation. We will discuss examples of such measures in
section 7.2.3. λ is a positive regularization coefficient to control the tradeoff between fairness and accuracy. In the above formulation, P denotes the data distribution. Ideally, we would like to define the
expectation term EP[ℓ(ˆyθ(x), y)] and the fairness violation term ρ
y, y, s ˆ ; P
in (7.4) with respect to the
133
test distribution P
∗
, i.e. , P = P
∗
. However, since this distribution is unknown, existing works typically
use the training data instead. Next, we describe popular approaches in the literature for defining fairness
violation measure ρ
y, y, s ˆ ; P
.
7.2.3 Measuring Fairness Violation
The fairness criteria defined in (7.1), (7.2), and (7.3) can all be viewed as a statistical (conditional) independence condition between the random variables y, y, ˆ and s. For example, the equalized odds notion (7.2)
means that yˆ is independent of s condition on the random variable y. Therefore, one can quantify the
“violation" of these conditions through well-known statistical dependence measures. Such quantification
can be used as a regularization in (7.4). In what follows, we briefly describe some of these measures. We
only describe these measures and our methodology for promoting the demographic parity notion. The
equalized odds and the equality of opportunity notions can be tackled in a similar fashion. For instance,
equality of opportunity violation uses a similar regularization term, while only the points with actual label
1 (y = 1) are considered.
To quantify/measure fairness violation based on the demographic parity notion, one needs to measure
the statistical dependence between the sensitive attribute s and the output feature yˆ. To this end, [362]
utilizes the empirical covariance of the two variables, i.e., ρc(ˆy, s, y; P) ≜ E[ˆys]. However, it is known that
if two random variables have 0 covariance, it does not necessarily imply their statistical independence. To
address this issue, several in-processing approaches utilize nonlinear statistical dependence measures such
as mutual information [285], maximum-mean-discrepancy (MMD) [245, 259], Rényi correlation [23, 133],
and Exponential Rényi Mutual Information (ERMI) [223, 211]. Let us briefly review some of these notions:
Given the joint distribution of sensitive attributes and predictions, the mutual information between the
sensitive features and predictions is defined as: ρMI (ˆy, s, y, P) = P
s,y P(s, y) log
P(s,y)
P(s) P(y)
. Some other
134
notions are based on the Hirschfeld-Gebelein-Rényi (HGR) and Exponential Rényi Mutual Information
(ERMI), which are respectively defined as [147, 124, 274, 210]:
ρHGR(ˆy, s, y, P) = σ2(Q) and ρERMI(ˆy, s, y, P) = Tr(Q
TQ). (7.5)
where σ2(Q) denotes the second largest singular value of matrix Q defined as
Qij =
h Py,s ˆ (ˆy = i, s = j)
p
Pyˆ(ˆy = i)
p
Ps(s = j)
i
, (7.6)
7.2.3.1 ERMI as Fairness Violation Measure
In this chapter, we utilize ρ ERMI(ˆy, s, y, P) as the measure of independence between the target variable
and sensitive attribute(s) due to the following reasons. First, as shown in [210], ERMI upper-bounds many
well-known measures of independence such as HGR, ℓ1, ℓ∞ and Shannon Divergence. Thus, by minimizing
ERMI, we also guarantee that other well-known fairness violation measures are minimized. Second, unlike
mutual information, Rényi correlation [23, 223], and MMD [259] measures, there exists a convergent
stochastic algorithms to the stationary solutions of (7.4) for ERMI, which makes this measure suitable for
large-scale datasets containing a large number of samples [210].
7.3 Efficient Robustification of ERMI
In most applications, one cannot access the test set during the training phase. Thus, the regularized ERM
framework (7.4) is solved w.r.t. the training set distribution. That is, the correlation term ρ(·) is computed
w.r.t. the training data distribution in (7.4) to learn θ. However, as mentioned earlier, the test distribution
may differ from the training distribution, leading to a significant drop in the accuracy of the model and
the fairness violation measure ρ(·) at the test phase.
135
7.3.1 Distributionally Robust Formulation of ERMI
A prominent approach to robustify machine learning models against distribution shifts is Distributionally
Robust Optimization (DRO) [93, 345, 202]. In this framework, the performance of the model is optimized
with respect to the worst possible distribution (within a certain uncertainty region). Assume that Ptr is the
joint training distribution of (x, y) where x is the set of input features, and y is the target variable (label).
The distributionally robust optimization for fair risk minimization can be formulated as:
min
θ
max
P∈U(Ptr)
E(x,y)∼P[ℓ(ˆyθ(x), y)] + λρ
yˆθ(x), y, s, P
, (7.7)
where Ptr denotes the empirical distribution over training data points. The set U(Ptr) is the uncertainty set
for test distribution. This set is typically defined as the set of distributions that are close to the training distribution Ptr. The min-max problem in (7.7) is non-convex non-concave in general. Therefore, even finding
a locally optimal solution to this problem is not guaranteed using efficient first-order methods [270, 89,
160, 171]. To simplify this problem, we upperbound (7.7) by the following problem where the uncertainty
sets for the accuracy and fairness terms are decoupled:
min
θ
max
P∈U1(Ptr)
E(x,y)∼P[ℓ(ˆyθ(x), y)] + max
P∈U2(Ptr)
λρ
yˆθ(x), y, s, P
, (7.8)
The first maximum term of the objective function is the distributionally robust optimization for empirical
risk minimization, which has been extensively studied in the literature, and the ideas of many existing
algorithms can be utilized [179, 289, 190, 379]. However, the second term has not been studied before in
the literature. Thus, from now on, and to simplify the presentation of the results, we focus on how to
deal with the second term. In other words, for now, we consider U1(Ptr) = {Ptr} is singleton and we
only robustify fairness. Later, we will utilize existing algorithms for CVaR and Group DRO optimization
to modify our framework for solving the general form of (7.8) and robustify both accuracy and fairness.
136
7.3.2 The Uncertainty Set
Different ways of defining/designing the uncertainty set in the DRO framework are discussed in section 7.1.1. The uncertainty set can be defined based on the knowledge of the potential type of distribution
shifts (which might not be available during training). Moreover, the uncertainty set should be defined
so that the resulting DRO problem is efficiently solvable. As mentioned in section 7.2.3.1, we utilize the
correlation measure ρERMI(·), which depends on the probability distribution P through matrix Q; see equations (7.6) and (7.5). Consequently, we define the uncertainty set directly on the matrix Q. Thus, our robust
fair learning problem can be expressed as
min
θ
EPtr[ℓ(ˆyθ(x), y)] + λ max
Qθ∈B(Qtr
θ
,ϵ)
Tr(Q
T
θ Qθ), (Dr. FERMI)
where Qtr
θ
is obtained by plugging Ptr in (7.6), and B(Qtr
θ
, ϵ) is the uncertainty set/ball around Qtr
θ
.
7.3.3 Simplification of Dr. FERMI
The formulation (Dr. FERMI) is of the min-max form and could be challenging to solve at first glance [270,
89]. The following theorem shows how (Dr. FERMI) can be simplified for various choices of the uncertainty
set B(Qtr
θ
, ϵ), leading to the development of efficient optimization algorithms. The proof of this theorem
is deferred to Appendix 10.5.1
Theorem 23 Define B(Qtr
θ
, ϵ, p) = {Q : ∥σ(Q) − σ(Qtr
θ
)∥p ≤ ϵ} where σ(Q) is the vector of singular
values of Q. Then: a) for p = 1, we have:
max
Qθ∈B(Qtr
θ
,ϵ,p),σ1(Qθ)=1
Tr(Q
T
θ Qθ) = Tr
(Q
tr
θ
)
TQ
tr
θ
+ 2ϵσ2
Q
tr
θ
+ ϵ
2
, (7.9)
which means the robust fair regularizer can be described as ρERMI(ˆy, s) + 2ϵρHGR(ˆy, s) when p = 1.
137
b) For p = 2,
max
Qθ∈B(Qtr
θ
,ϵ,p)
Tr(Q
T
θ Qθ) = Tr
(Q
tr
θ
)
TQ
tr
θ
+ 2ϵ
r
Tr
(Qtr
θ
)
TQtr
θ
+ ϵ
2
, (7.10)
which means that when p = 2, the regularizer is equivalent to ρ ERMI(ˆy, s) + 2ϵ
p
ρ ERMI(ˆy, s).
c) For p = ∞, assume that Qtr
θ = UθΣθV
T
θ
is the singular value decomposition of Qtr
θ
. Therefore:
max
Qθ∈B(Qtr
θ
,ϵ,p)
Tr(Q
T
θ Qθ) = Tr
(Q
tr
θ
)
TQ
tr
θ
+ 2ϵ Tr(|Σθ|) + ϵ
2
. (7.11)
Remark 24 In the case of p = 1, we enforced σ1(Q) = 1. Without this condition, as we will see in the proof,
the maximum is attained for a Q, which does not correspond to a probability vector.
7.3.4 Generalization of Dr. FERMI
A natural question on the effectiveness of the proposed DRO formulation for fair empirical risk minimization is whether we can guarantee the boundedness of the fairness violation by optimizing (Dr. FERMI).
The following theorem shows that if λ and ϵ are appropriately chosen, then the fairness violation of the
solution of (Dr. FERMI) can be reduced to any small positive value on the unseen test data for any distribution shift. This result is proven for the cross-entropy loss (covering logistic regression and neural
networks with cross-entropy loss). Following the proof steps, the theorem can be extended to other loss
functions that are bounded below (such as hinge loss and mean squared loss).
Theorem 25 Let ℓ(·, ·) in (Dr. FERMI) be the cross-entropy loss. For any shift in the distribution of data from
the source to the target domain, and for any given γ > 0, there exists a choice of λ, ϵ > 0 such that the solution
to (Dr. FERMI) has a demographic parity violation less than or equal to γ on the target domain (unseen test
data).
138
Proof: The proof is deferred to Appendix 10.5.2. □
Significance and Limitations of Theorem 25: The obtained result is based on the two crucial properties of (7.7) formulation: First, bounding the ERMI between sensitive attributes and predictions results in
bounding the demographic parity violation. Second, the uncertainty sets characterized by Lp balls around
the matrix Qtrain can be chosen such that the current distribution shift lies within the uncertainty ball.
In that sense, the proposed framework is less restrictive than the aforementioned related works that only
focus on a specific type of shift, such as demographic, covariate, or label shifts. Note that Theorem 25
only provides a generalization guarantee on the unseen data for the model’s fairness. Therefore, it does
not guarantee any level of accuracy on the test data. An open question is whether optimizing the loss
function in (7.10) leads to a solution with both Fairness and Accuracy guarantees on the unseen data. Such
a problem is challenging even in the case of generalizability of non-convex trained models when no fairness concerns are considered. Further, the satisfaction of Qtest
θ ∈ B(Qtr
θ
) at each iteration as a necessary
condition depends on the choice of ϵ. A large distance between Qtest
θ
and Qtr
θ
can lead to larger ϵ, and therefore the poor performance of the trained model in terms of accuracy as the compensation for promoting
fairness.
7.4 Algorithms for Solving Dr. FERMI
This section utilizes Theorem 23 to first develop efficient deterministic algorithms for solving (Dr. FERMI).
Building on that, we will then develop stochastic (mini-batch) algorithms.
139
7.4.1 Deterministic (Full-Batch) Algorithm
Let us first start by developing an algorithm for the case of p = 1, as in (7.9). Notice that
σ2(Q
T
θ Qθ) = max
v⊥v1, ∥v∥
2≤1
q
vTQT
θ Qθv. (7.12)
where v1 =
hp
P(S = s1), . . . , p
P(S = s1)
i
; see also [23, Equation (6)]. Further, as described in [211,
Proposition 1], we have
ρbERMI(ˆy, s, y; P) = max
W
{− Tr(WPbyˆWT
) + 2 Tr(WPby,s ˆ Pˆ−1/2
s
) − 1} (7.13)
where ρbERMI is the ERMI correlation measure (7.5) defined on training distribution Ptr. Here, Pby,s ˆ is a
probability matrix whose (i, j)-th entry equals Ptr(ˆy = i, s = j). Similarly, we define Pbyˆ.
Combining (7.9) with (7.12) and (7.13) leads to a min-max reformulation of (Dr. FERMI) in variables
(W, v, θ)(see Appendix 10.5.4 for details). This reformulation gives birth to Algorithm 13. At each iteration
of this algorithm, we maximize w.r.t. v and W variables, followed by a step of gradient descent w.r.t.
θ variable. One can establish the convergence of this algorithm following standard approaches in the
literature; see Theorem 70 in Appendix 10.5.5. Note that Pˆ
yˆ and Pˆ
y,s ˆ are functions of θ (through yˆ). Thus,
it is crucial to recompute them after each update of θ. However, as Pbs does not depend on θ, it remains
the same throughout the training procedure. Details on the gradient computations in Algorithm 13 are
Algorithm 13 Deterministic Distributionally Robust FERMI under L1 Ball Uncertainty
1: Input: θ
0 ∈ R
dθ , W0 = 0, step-sizes η, fairness parameter λ ≥ 0, iteration number T.
2: for t = 1, . . . , T do
3: θ
t = θ
t−1 − η
1
n
Pn
i=1 ∇θℓ(ˆyθ
t−1 (xi), y) + λϵ∇θ
q
vT QT
θ
t−1Qθ
t−1 v
+ λ Tr
∇θ(QT
θ
t−1Qθ
t−1 )
4: Set Wt = Pb−1/2
s PbT
y,s ˆ Pb−1
yˆ
5: Set v
t
to the second largest singular vector of Qθ
t by performing SVD on Qθ
t .
6: Update Pbyˆ and Pby,s ˆ as a function of θt−1.
7: Return: θ
T
, WT
, v
T
.
14
deferred to Appendix 10.5.4. Following similar steps, we developed algorithms for L2 and L∞ uncertainty
balls using (7.10) and (7.11); see Appendix 10.5.7.
7.4.2 Stochastic (Mini-Batch) Algorithm
In large-scale learning problems, we can only use small batches of data points to update the parameters
at each iteration. Thus, it is crucial to have stochastic/minibatch algorithms. The convergence of such
algorithms relies on the unbiasedness of the (minibatch) gradient at each iteration. To develop a stochastic algorithm version of Dr. FERMI, one may naïvely follow the steps of Algorithm 13 (or a gradient
descent-ascent version of it) via using mini-batches of data. However, this heuristic does not converge
(and also leads to statistically biased measures of fairness). It is known that for the stochastic gradient
descent algorithm to converge, the update direction must be a statistically unbiased estimator of the actual gradient of the objective function [257, 237]; thus, the naïve heuristic fails to converge and may even
lead to unfair trained models. The following lemma rewrites the objective in (7.10) so that it becomes a
summation/expectation over the training data and hence provides a statistically unbiased estimator of the
gradient for stochastic first-order algorithms.
Lemma 26 Let B(Qtr
θ
, ϵ) = {Q : ∥Q − Qtr
θ
∥2 ≤ ϵ}, then (Dr. FERMI) is equivalent to:
min
α>0,θ
max
W∈Rk×d
1
n
Xn
i=1
ℓ(zi
; θ) + λ(1 + ϵα)ψ(zi
; θ, W)
+
λϵ
α
(7.14)
Where ψ is a quadratic concave function in W defined as:
ψ(zi
; θ, W) := − Tr(WE[ybθ(xi)ybθ(xi)
T
|xi
]WT
) + 2 Tr(WE[ybθ(xi)s
T
i
|xi
, si
]Pb−1/2
s
) (7.15)
Proof: The proof is relegated to Appendix 10.5.3. □
Lemma 26 is key to the development of our convergent stochastic algorithms (proof in Appendix 10.5.3). It
rewrites (Dr. FERMI) as the average over n training data points by introducing new optimization variables
141
α and W. Consequently, taking the gradient of the new objective function w.r.t. a randomly drawn batch
of data points is an unbiased estimator of the full batch gradient. Having an unbiased gradient estimator,
we can apply the Stochastic Gradient Descent Ascent (SGDA) algorithm to solve (7.14). The convergence
of Algorithm 14 and its proof can be found in Appendix 10.5.6.
7.5 Robustification of the Model Accuracy
In Section 7.4, we analyzed the DRO formulation where U1(Ptr) is singleton in (7.8). However, to make the
accuracy of the model robust, we can consider various choices for U1(Ptr). For example, we can consider
the set of distributions with the bounded likelihood ratio to Ptr, which leads to Conditional Value at Risk
(CVaR) [284]. Using the dual reformulation of CVaR, (7.8) can be simplified to (see [190, Appendix A] and
[301, Chapter 6]):
min
θ,η>0
1
α
E(x,y)∼Ptrh
ℓ
yˆθ(x), y
− η
i
+
+ η + max
P∈U2(Ptr)
λρ
yˆθ(x), y, s, P
, (CVaR-Dr. FERMI)
where [x]+ is the projection to the set of non-negative numbers and we define U2(Ptr) the same way as we
defined in subsection 7.3.2. All methods and algorithms in Section 7.4 can be applied to (CVaR-Dr. FERMI).
Compared to the ERM, the CVaR formulation has one more minimization parameter η that can be updated
alongside θ by the (stochastic) gradient descent algorithm. Similarly, one can use group DRO formulation
for robustifying the accuracy [289]. Assume that G contains the groups of data (in the experiments, we
consider each sensitive attribute level as one group). Then, (7.8) for group DRO as the measure of accuracy
can be written as:
min
θ
max
g∈G
E(x,y)∼Pg
[ℓ(ˆyθ(x), y)] + max
P∈U2(Ptr)
λρ
yˆθ(x), y, s, P
, (Group-Dr. FERMI)
14
Different variations of Algorithm 13 for optimizing (CVaR-Dr. FERMI) and (Group-Dr. FERMI) are presented in Appendix 10.5.8. These developed algorithms are evaluated in the next section.
Algorithm 14 Stochastic Distributionally Robust FERMI Under L2 Ball Uncertainty
1: Input: θ
0
, W0
, α0
, step-sizes η, fairness parameter λ ≥ 0, number of iterations T, uncertainty ball
radius ϵ.
2: for t = 1, . . . , T do
3: Take a mini-batch of data Bt ⊆ {1, . . . , N} uniformly at random
4: θ
t = θ
t−1 − η
hP
z∈Bt ∇θℓ(z; θ
t−1
) + P
z∈Bt ∇θψ(z; θ
t−1
, Wt−1
)
i
5: α
t = α
t−1 − ηλϵhP
z∈Bt
ψ(z; θ
t−1
, Wt−1
) − 1/α2
t−1
i
6: Wt = ΠW
h
Wt−1 + ηλ(1 + ϵαt−1
)
P
z∈Bt ∇W ψ(z; θ
t−1
, Wt−1
)
i
7: Return: θ
T
.
7.6 Numerical Experiments
We evaluate the effectiveness of our proposed methods through several experiments involving datasets
with distribution shifts. We use two well-known group fairness measures: Demographic Parity Violation
(DPV) and Equality of Opportunity Violation (EOV) [141], defined as
DPV = |P(ˆy = 1|s = 1) − P(ˆy = 1|s = 0)|, and EOV = |P(ˆy = 1|s = 1, y = 1) − P(ˆy = 1|s = 0, y = 1)|
The details of our hyperparameter tuning are provided in Appendix 10.5.9. All implementations for experiments are available at https://github.com/optimization-for-data-driven-science/DR-FERMI.
7.6.1 Modifying Benchmark Datasets to Contain Hybrid Distribution Shifts
Standard benchmark datasets in algorithmic fairness, such as Adult, German Credit, and COMPAS [104],
include test and training data that follow the same distribution. To impose distribution shifts, we choose
a subset of the test data to generate datasets containing demographic and label shifts. The demographic
shift is the change in the population of different sensitive sub-populations from the source and the target
143
distributions, i.e., Pbs(s) ̸= P
∗
s
(s). Similarly, the label shift means Pby(y) ̸= P
∗
y
(y). The Adult training (and
test) data has the following characteristics:
Pbs(s = ‘Woman’) = 33.07% and Pbs(s = ‘Woman’| y = ‘High Income’) = 15.03%
We generate two different test datasets containing distribution shifts where the probability
Pbs(s = ‘Woman’|y = ‘High Income’)
is 10% and 20% (by undersampling and oversampling high-income women). We train the model on the
original dataset and evaluate the performance and fairness (in terms of demographic parity violation) on
two newly generated datasets in Figure 7.1.
144
Figure 7.1: Performance of the trained fair models on test datasets that have different distributions than the training
data. In the left figure, we undersampled the high-income minority group (women), while we oversampled in the
right figure. The proposed methods (Dr. FERMI and CVaR-Dr. FERMI) generally outperform other benchmarks.
In both figures, either Dr. FERMI or CVaR-Dr. FERMI can reach lower demographic parity violations, given any
accuracy level. The red dashed line represents the Naïve baseline where the model outputs zero with probability p.
By increasing p, the model becomes fairer at the cost of the loss in accuracy.
7.6.2 Robustness Against Distribution Shifts in Real Datasets
While the original Adult dataset [104] has no considerable distribution shift, a relatively new dataset ACSIncome [99] designed based on the US census records contains natural distribution shifts since we can
choose different US states as source and target datasets. The underlying task defined on the dataset is to
predict whether a given person is high-income or low-income. The sensitive attributes are gender and race.
Using this data, we perform experiments on binary and non-binary sensitive attribute cases. In the binary
setting, we only consider gender as the sensitive attribute (in the dataset there are only two genders). In
the non-binary case, we have four different sensitive groups: White-Male, White-Female, Black-Male, and
Black-Female (a combination of race and gender). One can observe that Ps(s) and Py,s(y, s) have large
145
fluctuations over different states. Thus, these datasets, as mentioned in the paper [99], naturally contain
the distribution shift with respect to different states.
Figure 7.2: Learning fair logistic regression models on four different states as the source (in-distribution) domain
and evaluating them on all 50 states of the US.
In Figure 7.2, we apply Hardt Post-Processing approach [141], Zemel Pre-processing method [364],
FERMI [211], Robust Log-Loss under covariate shift [278], Curvature matching with MMD [341], and our
distributionally robust method under L2 norm for three accuracy variations (Dr. FERMI, CVaR-Dr. FERMI,
and Group-Dr. FERMI), on the new adult (ACS-Income) dataset [99]. The methods are trained on a single
state (California, Texas, Utah, and Virginia) and evaluated/tested on all 50 states in terms of prediction
146
accuracy and fairness violation under the equality of opportunity notion. We chose California and Texas
in accordance with other papers in the literature as two datasets with a large number of samples. Further,
we chose Utah and Virginia as the two states with the highest and lowest initial equality of opportunity
violations.
Each method’s horizontal and vertical range in Figure 7.2 denote 25-th and 75-th percentiles of accuracy and fairness violations across 50 states, respectively. Thus, a wider line means less robustness to
the distribution shift. Ideally, we prefer models whose corresponding curves are on the upper-left side of
the plot. Figure 7.2 shows that Dr. FERMI with L2 uncertainty ball has a better robust, fair performance
than other approaches among 50 states when the training (in-distribution) dataset is either of the four
aforementioned states. When maintaining more accuracy is a priority, one can use CVaR or Group DRO
as the objective function for the accuracy part instead of the ERM. As a note, we can see that learning
the model on a dataset with a more initial fairness gap (Utah) leads to a better generalization in terms of
fairness violation, while learning on datasets with a less fairness gap (Virginia) leads to poorer fairness in
the test phase.
7.6.3 Handling Distribution Shift in Non-Binary Case
We run different versions of Dr. FERMI alongside [223], [23], [75], and [211] as the baselines supporting
multiple sensitive attributes. The algorithms are executed on the ACS PUMS [99] dataset with gender
and race as the sensitive attributes. The accuracy and equality of opportunity violations are calculated on
50 different datasets; then, their means and standard deviations are reported in Table 7.1. The training
dataset (source domain) is California. Dr. FERMI under L2 ball has the best performance in terms of
average fairness violation across 50 states. If a more balanced tradeoff between accuracy and fairness is
required, one can use Group-Dr. FERMI or CVaR-Dr. FERMI. Note that, in all cases, the training equality
of opportunity violation is set to 0.02 for all methods in the figure.
147
Method Tr Accuracy Test Accuracy Test EO Violation
Mary et al., 2019 71.41 ± 1.61% 68.35 ± 1.69% 0.1132 ± 0.023
Cho et al., 2020 71.84 ± 1.55% 68.91 ± 1.57% 0.1347 ± 0.027
Baharlouei et al., 2020 72.77 ± 1.37% 69.44 ± 1.39% 0.0652 ± 0.013
Lowy et al., 2022 73.81 ± 0.98% 70.22 ± 1.01% 0.0512 ± 0.009
Dr. FERMI-L1 73.45 ± 1.02% 70.09 ± 1.06% 0.0392 ± 0.004
Dr. FERMI-L2 73.12 ± 1.04% 69.71 ± 1.11% 0.0346 ± 0.005
Dr. FERMI-L∞ 73.57 ± 0.96% 69.88 ± 0.99% 0.0359±0.008
CVaR-Dr. FERMI-L1 74.21 ± 0.24% 70.94 ± 0.28% 0.0471 ± 0.011
CVaR-Dr. FERMI-L2 73.84 ± 0.28% 70.26 ± 0.31% 0.0429 ± 0.008
CVaR-Dr. FERMI-L∞ 73.92±0.31% 70.45±0.37% 0.0466 ± 0.013
Table 7.1: Train and Test Accuracy and Fairness Violation of Different Methods on ACS PUBS dataset. The reported
numbers are the average and the standard deviation of the values among 50 states.
Evaluation of Stochastic DR ERMI: We evaluate Algorithm 14 for different batch sizes. To this end,
we learn fair models with different batch sizes on the adult dataset where the minority group is underrepresented. We compare Algorithm 14 to [211], [23], [278], [223], and [75] as the baselines supporting
stochastic updates. Since the gradient estimator in Algorithm 14 is unbiased, the performance of Algorithm 14 remains consistently high for different batch sizes. However, by reducing the batch size, other
methods (except [211]) fail to keep their initial performance on the full batch setting. Comparing our approach and [211], we achieve better generalization in terms of the fairness-accuracy tradeoff for different
batch sizes.
148
Figure 7.3: Performance of different methods regarding fairness and accuracy for small batch sizes of 2 and 8. Our
robust formulation works consistently well even for small batch sizes due to the unbiasedness of the gradient, while
other methods are more sensitive to smaller batch sizes.
An important advantage of Dr. FERMI is its compatibility with different batch sizes. Therefore, with
the batch size as small as 2, it still demonstrates an impressive result compared to other baselines. For
instance, in the experiment depicted in Figure 7.3, the full-batch implementation needs more than 8 Gb
memory. However, we can reduce the memory consumption to less than 1 Gb when the batch size of 8
is used. The exact running time and memory consumption of different methods supporting small batch
implementation in Figure 7.3 is presented in Table 7.2.
149
Method Memory Consumption Training Time
CVaR Dr. FERMI <800 Mb 783 (s)
Dr. FERMI <800 Mb 692 (s)
Lowy et al, 2022 <800 Mb 641 (s)
Cho et al., 2021 1.21 Gb 3641 (s)
Rezaei et al., 2021 1.59 Gb 3095 (s)
Baharlouei, et al., 2020 <900 Mb 2150 (s)
Mary, et al., 2019 <1.43 Gb 1702 (s)
Table 7.2: Memory Consumption and Computation Time of Different Approaches in Figure 1. Note that
the reported values are for the batch size of 8. The required memory for the full batch algorithm is more
than 4GB, and the required time is nearly the same (since the number of epochs is the same). Note that
while other approaches lose performance due to the smaller batch sizes, Dr. FERMI and CVaR Dr. FERMI
preserve their performance even for small batch sizes (as small as 1).
7.7 Limitations
While the distributionally robust optimization (DRO) for the vanilla ERM is thoroughly analyzed in the literature, the studies regarding DRO for fair empirical risk minimization are limited to linear classifiers [278,
325]. A major complication of distributionally robust formulation for fair ERM is the non-convex, nonconcave structure of the resulting problem in the general case. Further, the available convergent stochastic
algorithms for the vanilla ERM do not converge when the fairness regularizers are involved. This chapter addresses the fair empirical risk minimization in the presence of the distribution shift by handling the
150
distribution shift of the target domain for fairness and accuracy terms separately. Our proper choice of fairness violation measure (ERMI) and the uncertainty balls (Lp norms) lead to convergent scalable algorithms
that have superior fairness-accuracy tradeoffs compared to the methods in the literature.
As shown in Theorem 25, we can bound the fairness violation of the target domain with the proper
choice of hyper-parameters. However, the proposed framework has several limitations. First, the defined
uncertainty set for the fairness term is limited to Lp norm balls (in certain variables) and the Exponential
Rényi Mutual Information (ERMI) as the measure of fairness. Second, the presented guarantee is just
for the fairness violation and does not offer any guarantee for the fairness-accuracy tradeoff. Moreover,
the guarantees are for the global optimal solutions, which may not be possible to compute due to the
nonconvexity of the problem.
7.8 Broader Impacts
The emergence of large-scale machine learning models and artificial intelligence tools has brought about
a pressing need to ensure the deployment of safe and reliable models. Now, more than ever, addressing the
challenges posed by distribution shifts and striving for fairness in machine learning is crucial. This chapter
presents a comprehensive framework that tackles the deployment of large-scale fair machine learning
models, offering provably convergent stochastic algorithms that come with statistical guarantees on the
fairness of the models in the target domain.
While it is important to note that our proposed algorithms may not provide an all-encompassing solution for every application, we believe that sharing our research findings with the broader community will
positively impact society. By publishing and disseminating our work, we hope to contribute to the ongoing
discourse on fairness in machine learning and foster a collaborative environment for further advancements
in this field. Moreover, the significance of our algorithms extends beyond the context of fairness. They
can potentially be harnessed in a wide range of applications to enforce statistical independence between
151
random variables in the presence of distribution shifts. This versatility underscores the potential impact
of our research and its relevance across various domains.
152
Chapter 8
f-FERM: A Scalable Framework for Robust Fair Empirical Risk
Minimization
This chapter is based on a paper published in ICLR 2024 with the same title [25].
Training and deploying machine learning models that meet fairness criteria for protected groups are
fundamental in modern artificial intelligence. While numerous constraints and regularization terms have
been proposed in the literature to promote fairness in machine learning tasks, most of these methods are
not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term “stochastic” refers to the ability of the algorithm to work with small mini-batches
of data. Motivated by the limitation of existing literature, this chapter presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The
proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments
demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes
(ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to
the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under ℓp norms as uncertainty sets.
Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees
153
but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient
stochastic implementation of f-FERM is publicly available ∗
.
8.1 Introduction
Machine learning models are increasingly deployed in critical applications ranging from healthcare [7] to
image processing [178], education to job recruitment [50], and social networking to cybersecurity [352].
Machine learning practitioners have adopted learning algorithms to fathom inherently difficult and crucial
problems. However, naïve deployment of these models may lead to serious shortcomings such as biased
predictions against minority groups [16, 54], vulnerability to adversarial attacks [219, 62, 28], or lack of
generalizability [18]. Consequently, it is of utmost importance to have reliable and trustworthy models
that are, in particular, fair and comply with equality norms and provisions worldwide [3, 109].
With the increasing concern for the trustworthiness of unchecked machine learning algorithms, a
broad class of paradigms has been proposed to counteract and mitigate both the cause and effects of model
unreliability. Imposing statistical independence between model output and particular input features is of
interest in various domains, especially when the generalization of a trained model is based on a collection
of spurious features present in the training dataset [105, 141, 356]. These could be sensitive features like
gender, race, age, and/or income in the context of fairness or confounding factors like environmental
artifacts in the context of image classification [18]. Existing literature on imposing statistical independence
between selected input features and model outputs is directed into three approaches: pre-processing, postprocessing, and in-processing methods.
Pre-processing methods entail upstream changes made in datasets to mask sensitive features or reduce
the dependency of output variables on sensitive features through transforming data in a stage before the
training phase [166, 364, 335]. Post-processing methods involve model-specific adjustments to the model’s
∗
https://github.com/optimization-for-data-driven-science/f-FERM
154
output to ensure the independence of predictions and sensitive attributes [141, 10]. While pre-processing
and post-processing methods do not affect the training procedure, they fail to exploit underlying training
mechanisms for the best achievable accuracy-fairness tradeoffs. Unsurprisingly enough, optimizing accuracy and fairness jointly (in-processing) leads to better tradeoffs than sequentially optimizing fairness and
accuracy in a pre-processing or post-processing fashion.
In-processing methods alternatively add fairness constraints or regularizers, penalizing dependence
between sensitive attributes and output variables. [362] utilizes covariance as the measure of independence between the sensitive attributes and the predictions. While such a measure is amenable to stochastic updates, it fails to capture correlations beyond linear. Alternatively, several non-linear measures such
as Rényi correlation [23], χ
2 divergence [211], L∞ distance [100], and Maximum Mean Discrepancy
(MMD) [259] are proposed in the literature to establish the independence of the predictors and sensitive
attributes. In-processing techniques can be model-specific [340, 6] or generalizable to different training
algorithms [23, 211].
In the spirit of in-processing methods, input data-driven constraints or regularization terms are used
to modify training objectives of problems like learning generalizable models to new environments, invariant learning, and learning in the presence of distribution shifts [18, 222, 23]. Such reformulations
are prevalent in learning robust classifiers against adversarial attacks [311], meta-learning [29], federated
learning [96], and alternative learning paradigms such as learning distributionally robust optimization
(DRO) models [179, 190], tilted empirical risk minimization (TERM) [193], and Squared-root Lasso [38].
While in-processing techniques outperform pre-processing and post-processing approaches, they are
not scalable to large datasets because of a lack of adaptability to stochastic optimization [222, 211]. All
aforementioned examples consist of regularization terms in their objective functions where the gradient
cannot be described as a linear combination of data points functions. As a result, applying stochastic
155
gradient descent or other stochastic first-order methods on the objective functions of such problems might
not converge, especially for small batch sizes.
Motivated by this, [211] proposes a stochastic optimization framework for Exponential Rényi Mutual Information as the measure of independence. More recently [375] use f-divergences as regularization terms to establish the independence between sensitive attributes and predictions. They estimate the
f-divergence regularizers offline through multi-layer neural networks to avoid the computational challenges of devising scalable stochastic methods for nonconvex min-max problems. Our approach, on the
other hand, directly solves the variational formulation for both full-batch and stochastic settings with
convergence guarantees to non-spurious solutions. In Section 8.2, using the variational representation
of f-divergences, we present a convergent stochastic optimization framework for fair learning via fdivergences. [211] is a special case of f-divergences where f(t) = t
2 − 1 (χ
2 divergence). Aside from
χ
2
, all other divergences listed in Table 8.1 are not introduced in the literature to the best of our knowledge.
The need of having convergent algorithms for fair empirical risk minimization does not end with
designing methods amenable to stochastic approximation. Detection and mitigation of biases against protected groups in the presence of distribution shifts have been extensively studied in recent years. [184]
theoretically shows that learning fair representations (pre-processing) is nearly impossible for the popular
notions of fairness, such as demographic parity in the presence of the distribution shift. [99], on the other
hand, experimentally demonstrates that applying post-processing fairness techniques [141] to learn fair
predictors of income concerning race, gender, and age fails to transfer from one US state (training domain)
to another state. Overlooking distribution shifts can lead to catastrophic decisions threatening the wellbeing of human subjects when deploying a trained model in certain hospitals to other hospitals [297]. The
current literature for handling distribution shifts with in-processing methods relies on certain assumptions
on the type of distribution shift (demographic shift [112, 103, 220, 126], label shift [85], and/or covariate
156
shift [278, 310]) or explicit access to the causal graph [226, 297] of predictors, sensitive attributes, and
target variables. As a result, they face practical limitations and cannot cope with most real-world problems
involving complex shifts that cannot be categorized in the ones assumed in their works.
Alternatively, [325] provides convex objective functions for imposing fairness on logistic regression
using constraint optimization. [316] use MMD for defining uncertainty sets around training distribution,
whereas [153] use Integral Probability Measure (IPM) to mitigate the distribution shift. The main limitation of these approaches is their reliance on the convexity of the underlying learning model and lack of
scalability due to incompatibility with stochastic optimization algorithms.
[342] uses the Maximum Mean Discrepancy (MMD) distance between the spectral norm of the Hessian
matrix at advantaged and disadvantaged data points. However, they do not provide convergence guarantees for their proposed algorithm to any notion of optimality. In addition, the method is not necessarily
amenable to stochastic updates. While we naturally define the uncertainty set directly on the joint distribution of sensitive attributes and predictions, they use the curvature of the obtained solution quantified
by the norm of the Hessian matrix as a heuristic for promoting the robustness of the fair solution.
Contributions: This chapter establishes a scalable (stochastic) fair empirical risk minimization framework through regularization via f-divergences (f-FERM) for both standard and distributed shift settings.
f-FERM presents a unified methodology based on the Legendre-Fenchel transformation, enabling us to
develop theoretically convergent first-order stochastic algorithms when only small batches of data are
available at each iteration. Further, we have presented the first distributionally robust optimization framework under ℓp norms uncertainty sets covering nonconvex losses such as neural networks. The presented
framework for fair inference in the presence of distribution shifts does not rely on the causal graph describing the causal interaction of input features, sensitive attributes, and target variables, which is rarely
available in practical problems.
157
Chapter Organization: We structure our response towards designing scalable, robust, and fair algorithms
into two sections. Section 8.2 motivates the design of unbiased gradient estimators of objectives with
information-theoretic f-divergence regularizers. In Section 8.3, we present our approach for fair inference
in the presence of the distribution shift in detail.
Our experiments provide an extensive examination of various f-divergences and their suitability as
regularizers and also show the consistency of our method across all batch sizes in contrast to existing
benchmarks. Similar experiments are carried out for robust training on varying amounts of distributional
shifts in data.
8.2 Fair Empirical Risk Minimization via f-divergences
A widely studied problem in algorithmic fairness is promoting a notion of group fairness, such as demographic parity, equalized odds, equality of opportunity, or sufficiency through an in-processing method.
For these notions, we aim to establish a [conditional] statistical independence between the predictions
(e.g., the creditworthiness of the individual) and the sensitive attributes (e.g., gender, race). For simplicity
of presentation, we formulate all problems under the demographic parity notion, which requires statistical
independence between the prediction and the sensitive attribute. Without loss of generality, all formulations and methods are generalizable to other aforementioned notions of group fairness by considering
conditional random variables (see Appendix 10.6.1). A popular in-processing approach for training fair
(classification) models under the demographic parity notion is to regularize the empirical risk minimization:
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λD
P(ˆyθ(x), s), P(ˆyθ(x)) ⊗ P(s)
, (8.1)
where θ is the learning parameters (e.g., weights of the neural network); xi ∈ R
d
is the i-th input feature
vector; yi
is the actual label/class for sample i; yˆθ(xi) is the prediction of the model for sample i; and
ℓ(ˆyθ(xi), yi) is the loss function measuring the “goodness-of-fit" for sample i. Here, D is a divergence
158
between the joint probability distribution of the predictions and sensitive attributes and the Kronecker
product of their marginal distributions. Recall that yˆθ and s are statistically independent iff P(ˆyθ(x), s)
follows P(ˆyθ(x)) ⊗ P(s). Thus, the second term in (8.1) is zero iff yˆθ and s are statistically independent
(complete fairness under the demographic parity notion).
This section studies the fair empirical risk minimization regularized by a broad class of f-divergence
measures. Let P and Q be two discrete probability measures taking values in P = {1, . . . , m}. The fdivergence between P and Q is defined as [258, Def 4.9](see Appendix 10.6.2 for the general continuous
case):
Df (P, Q) = Xm
j=1
Qjf
Pj
Qj
(8.2)
The above definition, which is also known as f-mutual information [214, 82], covers many known divergence measures used for imposing fairness, such as KL-divergence for the choice of f(t) = tlog(t) [306],
or χ
2 divergence when f(t) = (t − 1)2
[211]. As shown in Appendix 10.6.3, Df in (8.1) is zero if and
only if the probability distribution of s and yˆθ are statistically independent for the choices of f listed in
Table 8.1. In addition, we prove that these f-divergences either cover or provide upper bounds for the
popular notions of fairness violations in the literature, such as ℓp distances, Rényi correlation [23], and
demographic parity (equalized odds) violation. This means that by minimizing these regularizers, we are
minimizing an upper bound of (other) popular fairness violation measures, and thus we are controlling
them implicitly. Further, unlike Rényi correlation [23, 132], we can utilize Legendre-Fenchel duality (and
variational representation) to develop (provably) convergent algorithms with stochastic (mini-batch)
updates. This formulation and the resulting stochastic optimization algorithm are described in the next
subsection.
8.2.1 A Convergent Stochastic Algorithm for fair ERM via f-Divergences
Let us start by rewriting (8.1) using f-divergences as the divergence measure:
159
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
Ps(s = k)Pyˆθ
(ˆyθ)f
Pyˆθ,s(ˆyθ = j, s = k)
Pyˆθ
(ˆyθ = j)Ps(s = k)
(f-FERM)
While the non-linearity of f-divergences in (f-FERM) empowers the underlying model to capture more
complex dependencies between sensitive attributes and predictions compared to the linear measures [362],
the objective function can no longer be represented as a summation of functions over input data points.
Consequently, one cannot directly apply the stochastic gradient descent method (or its variations, such
as Adam) to the objective function in (f-FERM). In particular, directly evaluating the gradient of the
objective function of (f-FERM) on a mini-batch of data leads to a statistically biased estimation of the
entire objective’s gradient. Such statistical biases prevent the convergence of algorithms such as SGD (even
with a strongly convex minimization landscape) [8, 69], let aside the more complex objectives arising in
modern-day neural networks.
To derive stochastic algorithms, one can use the variational forms of f-divergences to delineate them
as a pointwise supremum of affine transformation over probability densities. The most commonly used
and well-behaved transform is the Legendre-Fenchel transform (often called the convex conjugates), which
linearizes the dependence of the objective function to input data points using a variational reformulation.
Particularly, we can rewrite (f-FERM) using the following result:
Proposition 27 Let f(·) be a convex function. Then, (f-FERM) can be reformulated as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
h
AjkPy,s ˆ (ˆyθ = j, s = k) − f
∗
(Ajk)Pyˆ(ˆyθ = j)Ps(s = k)
i
(8.3)
where f
∗
(z) = supw∈dom(f) w
T
z − f(w) is the Legendre-Fenchel transformation of the function f.
Proof: The proof is standard and appears in Appendix 10.6.4. □
In order to solve (8.3), we will use (stochastic) first-order methods. Notice that Ps(s = k) is constant
through the optimization procedure and is computed once by counting the number of data points whose
sensitive attribute takes the value of k: πk := Ps(s = k) = 1
n
Pn
i=1 1(si = k). Assume we use the softmax
layer to compute the probabilities of different classes in our classification task (as it is standard in logistic
regression or using neural networks for classification). Let Fj (xi
; θ) be the j-th entry of the softmax layer
160
output for datapoint xi
, predicting the probability of class j. Then it is easy to show that we can obtain
unbiased estimators of Pyˆθ
(ˆyθ = j) and Pyˆθ,s(ˆyθ = j, s = k) using i.i.d. mini-batch B of data points.
More precisely, we have
Pyˆθ
(ˆyθ = j) = 1
n
Xn
i=1
Fj (xi
; θ) = E
h
1
|B|
X
|B|
i=1
Fj (xi
; θ)
| {z }
Pˆyˆθ
(j; B)
i
Pyˆθ,s(ˆyθ = j, s = k) = 1
n
Xn
i=1
Fj (xi
; θ)1(si = k) = E
h
1
|B|
X
|B|
i=1
Fj (xi
; θ)1(si = k)
| {z }
Pˆyˆθ,s(j,k; B)
i
.
(8.4)
As a result, Problem (8.3) can be written as a linearly separable function of input data points (xi
’s):
min
θ
max
A
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
h
AjkFj (xi
; θ)1(si = k) − f
∗
(Ajk)πkFj (xi
; θ)
i
(8.5)
Thus, evaluating the gradient of the objective function w.r.t. the variables θ and A over a random
batch of data points leads to an unbiased estimator of the gradient of the objective function.
In addition to providing an unbiased estimator of gradients, the reformulation (8.5) has another crucial
property: the objective function is concave in A. Therefore, optimization problem (8.5) falls under the category of nonconvex-concave min-max optimization problems. That is, the objective is (possibly) nonconvex
in θ and is concave in A. Thus, we can borrow tools from the (stochastic) nonconvex-concave min-max
optimization literature [203, 269, 192] to derive a convergent first-order stochastic algorithm as presented
in Algorithm 15. We listed the closed-form of f(·), f ∗
(·), for several widely-used f-divergence measures including KL-divergence, Reverse KL-divergence, χ
2
-divergence, Squared Hellinger distance, JensenShannon divergence, and total variation distance in Table 8.1. For the derivation, see Appendix 10.6.5.
Theorem 28 (Informal Statement) Assume that ℓ(·, ·) and Fj (·, θ) are Lipschitz continuous for any given
j and θ and their gradients are L-Lipshitz. Further, assume that P(s = k) > 0 for all protected groups and
P(ˆyθ = j) > 0 at every iteration for all labels j. Then, for any given batch size 1 ≤ |B| ≤ n, Algorithm 15
finds an ϵ-stationary solution of (f-FERM) in O(
1
ϵ
8 ) for any given ϵ > 0.
161
Algorithm 15 Stochastic Gradient Descent-Ascent (SGDA) for f-FERM
1: Input: θ
0 ∈ R
dθ , step-sizes ηθ, ηα, fairness parameter λ ≥ 0, iteration number T, Batchsize b
2: for t = 1, . . . , T do
3: Sample minibatch of data Bt = {(xt1, yt1), · · · ,(xtb, ytb)}
4: θ
t = θ
t−1 −
ηθ
b
P ∇θℓ(ˆyθ(x), y) − ηθλ∇θ
A
t−1
jk Pˆ
yˆθ,s(j, k; Bt) − πkf
∗
(A
t−1
jk )Pˆ
yˆθ
(j; Bt)
5: At
jk = A
t−1
jk + ηα ∇A
A
t−1
jk Pˆ
yˆθ,s(j, k; Bt) − πkf
∗
(A
t−1
jk )Pˆ
yˆθ
(j; Bt)
6: Return: θ
T
Table 8.1: Unbiased Estimators for f-divergence Regularizers
Divergence f(t) The term rjk inside regularizer λ
P
j,k rjk in (8.5)
χ
2
(t − 1)2 πk[AjkPyˆθ|sk − (Ajk +
A2
jk
4
)Pyˆθ
]
Reverse KL − ln t πk[AjkPyˆθ|sk + (1 + ln(−Ajk))Pyˆθ
]
Total Variational 1
2
|t − 1| πkAjk[Pyˆθ|sk − Pyˆθ
]I{|Ajk|<1/2}
KL tln t πk[AjkPyˆθ|sk − e
Ajk−1Pyˆθ
]
Jensen-Shannon −(t + 1) ln(t+1
2
) + tln t πk[AjkPyˆθ|sk + ln(2 − e
Ajk )Pyˆθ
]
Squared Hellinger (
√
t − 1)2 πk[AjkPyˆθ|sk + (A
−1
jk + 2)Pyˆθ
]
Proof: The formal statement and proof are relegated to Appendix 10.6.6. □
Theorem 28 applies to all f-divergences listed in Table 8.1 for all batch-sizes (even as small as the
batch size of 1). More sophisticated algorithms can be used to obtain O(ϵ
−6
) iteration complexity [264,
372]. However, such algorithms use nested loops and require more hyperparameter tunings. We provide an
example of such an algorithm in Appendix 10.6.7. If the f-divergence leads to a strongly concave function
in A or satisfies Polyak-Łojasiewicz condition (e.g., for χ
2 divergence), a faster rate of O(ϵ
−5
) can be
obtained for this algorithm (Appendix 10.6.6). In addition, if larger batch size of O(ϵ
−2
) is used, we can
further improve this rate to O(ϵ
−4
) iteration complexity (see Appendix 10.6.6). Finally, when full batch
size is used, then double/triple-loop algorithms can lead to the iteration complexity bounds of O(ϵ
−2
) in
the nonconvex-strongly concave setting and O(ϵ
−3
) in the general nonconvex-concave setting; see [176,
244, 248, 326].
162
8.3 Robust f-FERM in the Presence of Distribution Shifts
In the previous section, we assumed that the training and test domains have the same distribution. However, this assumption is not necessarily valid in certain applications [112]. In particular, a model that
behaves fairly on the training data distribution may have an unfair performance in the test phase. To
address this issue, this section develops stochastic algorithms for fair empirical risk minimization via fdivergences in the presence of the distribution shifts.
Assume that Pˆ
s,y(s, yˆ) is the joint distribution of sensitive attributes and predictions on the training
data. The distributionally robust fair empirical risk minimization via f-divergences is formulated as:
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) s.t. max
P∈B
Df
P(ˆyθ(x), s)|| P(ˆyθ(x)) ⊗ P(s)
≤ κ. (8.6)
B = B(Pˆ, δ) is the distributional uncertainty set defined as a certain ball around the training distribution
Pˆ with radius δ. This formulation guarantees that the model fairness is preserved (up to a violence of fdivergence less than κ) even when the test distribution slightly changes. With a slight change of notation,
Pˆ refers to the training distribution, whereas P is the optimization parameter.
One can define the uncertainty set through an ϵ neighborhood around the joint distribution of the
training data characterized by a distance measure such as ℓp norms, Wasserstein distance, or MMD distance. While these distributionally robust uncertainty sets are thoroughly analyzed for empirical risk
minimization (ERM) [179, 47, 190], the DRO formulation for ERM is limited to the Wasserstein distance
for the fair logistic regression [325] and MMD distance [342] on the distribution curvature as a heuristic
for robustness. Unfortunately, none of these approaches offer a convergent algorithm with stochastic updates. Further, some of these approaches are limited to special loss functions and heuristics. On the other
hand, we study imposing the distributionally robust fairness via f-divergences for a general loss function
where the uncertainty set is characterized by ℓp norms (Section 8.3.1) or f-divergences (Section 8.3.2). Our
163
results show that the former approach is more suitable when lower levels of robustness for fairness are
required, and the latter works better for handling larger distribution shifts.
8.3.1 Robust f-FERM Under ℓp Norms and Small Distribution Shifts
This section focuses on the widely studied ℓp norms as the uncertainty set for the distributional distance
between the training and test domains. In this case, Problem (8.6) can be written as:
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) s.t. max
||P−Pˆ||p≤δ
||Q−Qˆ||p≤δ
Df (P||Q) ≤ κ, (8.7)
where Pˆ represents the joint distribution of the sensitive attributes and predictions and Qˆ denotes the
Kronecker product of the marginal distributions between sensitive attributes and predictions.
Since handling non-convex constraints is challenging, as it is standard in training machine learning
models, we consider the Lagrangian relaxation of Problem (8.7) as follows:
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ max
∥P−Pˆ∥p≤δ
∥Q−Qˆ∥p≤δ
Df (P||Q) (8.8)
This problem falls under the nonconvex-nonconcave, min-max optimization category and is most likely to
be computationally hard for general uncertainty sets [89]. However, such a min-max optimization problem
can be solved to stationarity when the diameter of set B is small (i.e., under small domain shift), see [246].
The core idea is to approximate the inner maximization problem with the Taylor approximation, leading
to a nonconvex-concave min-max optimization, which is easier to solve [89, 270]. This idea has been used
and been successful in machine learning (see [118] for its use in Sharpness-aware minimization). Utilizing
this idea, Problem (8.8) can be approximated as:
min
θ
max
∥U∥p≤δ
∥V∥p≤δ
h(θ, U, V) := 1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ⟨U, ∇PDf (Pˆ||Qˆ )⟩ + λ⟨V, ∇QDf (Pˆ||Qˆ )⟩
!
, (8.9)
where we used the change of variables U := P − Pˆ and V := Q − Qˆ . Equivalently,
164
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λδ∥∇PDf (Pˆ||Qˆ )∥q + λδ∥∇QDf (Pˆ||Qˆ )∥q, (8.10)
where ∥ · ∥q is the dual of the ℓp norm with 1
p +
1
q
= 1.
Proposition 29 Assume that the gradient of the loss function is L-Lipshitz, and the second-order derivative
of the loss exists. Then, a given ϵ−approximate stationary solution of Problem (8.10) is an O(ϵ)−approximate
stationary solution of Problem (8.8) whenever Lδ ≲ ϵ.
This proposition, which is an immediate application of [246, Theorem 3.1], states that if the desired
training accuracy ϵ is comparable with the distribution shift amount δ (i.e. small distribution shift regime),
then one can solve problem (8.10) instead of (8.8). Thus, in this regime, we need to solve (8.10) or equivalently (8.9). To this end, we need to obtain the (sub)-gradients of the objective function in (8.9) w.r.t the θ,
U, and V variables. First, notice that
∇Uh(θ, U, V) = ∇PDf (Pˆ||Qˆ ) = α
∗
(Pˆ, Qˆ ) and ∇Vh(θ, U, V) = ∇QDf (Pˆ||Qˆ ) = f
∗
(α
∗
(Pˆ, Qˆ )),
where α∗
(Pˆ, Qˆ ) ∈ argmaxα
P
j αjpˆj (θ) − qˆj (θ)f
∗
(αj ). Here we invoked Danskin’s theorem on the
variational form of Df ; pˆj (θ) and qˆj (θ) is the j-th element of Pˆ and Qˆ , respectively. Next, we need to
compute ∇θh(θ, U, V). Notice that the derivative of the first term in h(·) w.r.t. θ is easy to compute. We
next calculate the derivative of the second term of h(θ, U, V) w.r.t. θ. As the derivative of the third term
can be computed similarly, we omit its derivation here.
∇θ⟨U, ∇PDf (Pˆ||Qˆ )⟩ = ∇θ⟨U, α
∗
(Pˆ, Qˆ )⟩ =
X
j
uj
qˆj (θ)∇θpˆj (θ) − pˆj (θ)∇θqˆj (θ)
qˆ
2
j
(θ) × (f
∗)
′′(α)|α=α
∗
j
(Pˆ,Qˆ)
(8.11)
where in the last equation, we used the implicit function theorem to compute the derivative of α∗ w.r.t.
θ. Notice that an implicit assumption here is that f is differentiable (which holds for KL-divergence, χ
2
165
divergence, reverse KL, Jensen-Shannon, and Squared Hellinger distance). Having access to the gradients,
we can apply the standard [sub-]gradient descent-ascent algorithm to obtain a solution to Problem (8.10)
(see Appendix 10.6.8 for the details).
A semi-stochastic memory-efficient first-order training algorithm. To apply (stochastic) gradient descent-ascent algorithm [203] to problem (8.9), we need to have unbiased estimator of the function
h(θ, U, V) w.r.t. θ, U, and V variables. While it seems challenging to obtain unbiased estimator w.r.t. all
variables, one can notice that if pˆj (θ) and qˆj (θ) can be computed easily with one forward pass over all data
points (i.e., in O(m × n) memory requirement). Consequently, the gradient of h(θ, U, V) w.r.t. U and V
can be computed with one forward pass over all data points (without the need for doing backpropagation).
On the other hand, one can easily obtain unbiased estimator of ∇θpˆj (θ) and ∇θqˆj (θ) in (8.11) using a
small mini-batch of data. Such a task requires O(b × d) memory with d being the number of parameters
(i.e., θ ∈ R
d
) and b being the batch size. Combining this unbiased estimation with the computed values of
pˆj (θ) and qˆj (θ) leads to an unbiased estimator of the objective of (8.9) w.r.t. θ variable. To summarize, we
need to do one forward propagation to obtain gradients w.r.t. U and V, and we only do backpropagation
for computing gradients w.r.t. θ over the mini-batch of data. Such an algorithm requires O(mn + bd)
memory requirement and thus can be used for training large models (with d, n ≫ b, m). It is known that
memory requirements are the major limiting factors in training large models such as LLMs [221].
8.3.2 Robust f-FERM Under ℓ∞ Norms and Potentially Large Distribution Shifts
The developed framework in the previous section assumes the distribution shift is small (the uncertainty set
diameter is smaller than a certain threshold). When preserving fairness in the presence of large distribution
shifts is a priority, our previous methodology might not work well. As discussed before, the formulation
(8.8) leads to a nonconvex-nonconcave min-max optimization problem and this class of problems is hard
to solve computationally in general (even to stationarity notions). Thus, we need to exploit the structure of
166
the problem. In this section, we show that we can exploit the structure to develop a first-order algorithm
under large distribution shifts. Particularly, we focus on the case where the uncertainty set is ℓ∞ ball
and the divergence satisfies certain assumptions (i.e., f
∗
(α
∗
) > 0 and α
∗ > 0, which is satisfied for KL
divergence).
Since the function Df is convex in P and Q, under ℓ∞ uncertainty set on P and Q, the optimal solution
of the maximization problem in (8.8) will be at an extreme point. Moreover, under the assumption that
f
∗
(α
∗
) > 0 and α
∗ > 0 (which is satisfied for KL divergence), one can easily see that the optimal pj =
min{pˆj + δ, 1} and qj = max{qˆj − δ, 0} (see Appendix 10.6.9 for the exact proof). Notice that to obtain
this efficient optimal closed-form solution, we need to relax the probability simplex constraint. Thus under
this assumption, problem (8.8) can be reformulated as
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λDf (min{P + δ, 1}|| max{Q − δ, 0}), (8.12)
which is a regular minimization problem and (sub)gradient descent can be utilized to solve it.
8.4 Experiments
We use three popular notions of group fairness: demographic parity, equalized odds, and equality of opportunity violations (see Appendix 10.6.1 for definitions) to measure the fairness of trained models. To run
Algorithm 15, we set ηθ and ηα to 10−5
and 10−6
respectively in all experiments. Further, by changing λ,
we get different points in the trade-off curve between accuracy and fairness. The range of λ depends on the
f-divergence (see Appendix 10.6.10 for more information on tuning hyper-parameters). In the inference
phase of our experiments, we use the standard maximum likelihood decoding based on the output of the
softmax layer, i.e., the predicted label is the label with the highest logit value.
As we will see in this section, several f-divergence measures lead to reasonable fairness/accuracy
tradeoffs and can outperform existing benchmarks. However, no single f-divergence measure uniformly
167
outperforms other measures in all the experiments. Thus, we believe in applications, the choice of the
f-divergence can be viewed as a hyperparameter that can be tuned by cross-validation.
8.4.1 Fairness-Accuracy Tradeoffs on Benchmark Datasets
Figure 8.1: Performance of different f-divergences
as the regularizers. The experiment is on the adult
dataset with gender and race as sensitive attributes.
While the offered tradeoffs are close to each other for
small demographic parity violations, KL-divergence
shows an extraordinary performance for a lowfairness high-accuracy regime. We do not display
the performance for larger batch sizes or when only
one sensitive attribute is available due to the insignificant difference between the performance of
different f-divergences.
In the first set of experiments, we compare different
f-divergence formulations for (f-FERM) to each other
and several state-of-the-art approaches supporting multiple sensitive attributes. Figure 8.1 demonstrates the
given tradeoff on the adult dataset [37] with gender
and race as the sensitive attributes (black-female, blackmale, white-female, white-male). To measure fairness,
we use the demographic parity violation defined as:
DPV = max
i,j∈S
|P(ˆy = 1|s = i) − P(ˆy = 1|s = j)|
In the case of binary sensitive attributes (e.g., gender),
there is no significant variation between different fdivergences. However, when we have 2 sensitive attributes and the batch size is small (8 in Figure 8.1), the
results significantly differ for various f-divergences. Interestingly, KL-divergence for smaller λ values shows
improvement in fairness violation and accuracy simultaneously. We do not observe such a phenomenon for other f-divergences and state-of-the-art approaches
in the literature. Further, in Figure 8.2, we compare one of the f-divergences (reverse KL) to several SOTA
168
methods including [222, 22, 75]. Other approaches such as the pre-processing method of [364], postprocessing approach of [141], and several in-processing methods including [362, 100, 157] demonstrate
lower performance compared to the ones depicted in Figure 8.2 and are removed from the figure. While
our approach demonstrates consistently good performance across different batch sizes (full-batch, 64, 8,
2), the performances of other methods drop significantly for smaller ones. For further experiments on
other datasets (German and COMPAS) and other fairness measures (equality of opportunity and equalized
odds violations), see Appendix 10.6.11.
Figure 8.2: Performance of the trained fair models on Adult Dataset with gender and race as two sensitive attributes
with different Batch-sizes. The red dashed line represents the Naïve baseline where the model outputs zero with
probability p. By increasing p, the model becomes fairer at the cost of the loss in accuracy.
8.4.2 Fairness-Accuracy Tradeoffs in the Presence of the Distribution Shift
We perform two experiments to evaluate the Algorithms developed in Section 8.3. In the first experiment,
we randomly switch the label of genders for n% of the data points (n ranges from 1 to 20) in the Adult
dataset. Then, we train models on the new datasets with a proportion of corrupted sensitive attributes and
evaluate the performance on the test data. Figure 8.3 is obtained by training different models to achieve
80% accuracy on the test data and comparing their demographic parity violation. By increasing the percentage of corrupted sensitive attributes, we see that both f-DRO and f-infinity achieve less DP violation
than SOTA approaches in the literature. In this specific experiment, f-DRO works better than f-infinity,
and there is no significant difference between choosing KL-divergence or χ
2
as the function f. Among the
169
papers designed for handling distribution shifts, [278] and [343] were the only options with the available
implementation.
Figure 8.3: Performance of different state-of-the-art approaches and our two methods for handling distribution
shift. The dataset is adult, and the sensitive attribute is
gender. We randomly flip the label of a proportion of gender entries (from 0 to 20%). As we observe, our approach
demonstrates more robustness against the drop in DP violation compared to other approaches.
In a more recently collected dataset knows as new
Adult [99], the users are separated based on their
living state. We train different fair models in a single state and evaluate the fairness-accuracy tradeoff in other states. Figure 8.4 depicts the performance of different methods. For each method, the
center point is the average of accuracy and fairness
among 50 states. The horizontal and vertical lines
show the 25-percentile to 75-percentile range of
performance among the states. The training fairness violation is set to 0.02 for all methods. We
observe that f-infinity preserves the fairness level
better than other approaches. In comparison, fDRO has a better accuracy. Depending on the application, we suggest using f-infinity if preserving
a high level of fairness is a priority and f-DRO for the cases when a better tradeoff between fairness and
accuracy is expected. Note that both approaches offer better fairness-accuracy tradeoffs compared to the
SOTA approaches in the literature.
170
Figure 8.4: Performance of the trained fair models on new Adult Dataset. The model is trained on one state (California or Texas) and evaluated in 50 states. The distribution of each state dataset is different than others. Thus, the
IID assumption does not hold among datasets of different states.
8.5 Conclusion
This chapter presented a unified stochastic framework for fair risk minimization via f-divergences (fFERM). The key idea is to reformulate the objective function as a min-max optimization problem using
Legendre-Fenchel duality of f-divergence. This enables us to develop an unbiased gradient estimator
and a convergent stochastic first-order algorithm. Furthermore, we robustified f-FERM using ℓp norm
balls as the uncertainty set against distributional changes. While our empirical investigation delves into
the performance and fairness distinctions among various f-divergences, a more comprehensive analysis
is warranted to determine the optimal f-divergence concerning the tradeoff between performance and
fairness, faster convergence, and asymptotic behaviors. Furthermore, the distributionally robust formulation of fair empirical risk minimization and the advantages of each formulation can be explored beyond
f-divergences as the measure of fairness violation and ℓp norm balls as uncertainty sets.
171
Chapter 9
Conclusions
In this thesis, we delve into landscape optimization techniques aimed at enhancing the robustness and
fairness of machine learning tasks. Chapter 1 highlights the limitations of empirical risk minimization,
the prevailing approach in machine learning, particularly concerning robustness and fairness. To address these shortcomings, we introduce innovative optimization frameworks tailored for robust machine
learning tasks. These frameworks encompass strategies for facilitating inference when confronted with
significant missing data (Chapter 2), optimizing feature selection amidst batch effects (Chapter 3), and
robustifying neural networks against adversarial attacks (Chapter 4).
In our exploration of fair machine learning, Chapter 5 introduces a groundbreaking convergent optimization framework for fair empirical risk minimization, employing the HGR correlation. Our analysis in
Chapter 6 reveals that existing methodologies such as Rényi Fair Inference lack adaptability to stochastic
optimization, where only small batches of data are available per iteration. To address this limitation, we
propose a novel first-order stochastic framework utilizing Exponential Rényi Mutual Information (ERMI)
as a fairness violation metric. Despite the superior performance of FERMI outlined in Chapter 6, it exhibits
susceptibility to distribution shifts, as discussed in Chapter 7. To mitigate this vulnerability, Chapter 7 introduces a distributionally robust stochastic optimization framework, an extension of FERMI tailored for
scenarios involving data distribution shifts. The convergent stochastic frameworks presented in Chapter 6
172
and Chapter 7 are limited to a very specific measure of fairness violation (ERMI). Therefore, in Chapter 8,
we developed a stochastic optimization framework for robust and fair empirical risk minimization via the
broad class of f-divergences.
From a broader perspective, the methodologies and formulations presented in the thesis fall within
the domain of constrained optimization tasks. A notable challenge in addressing these constraints lies
in their data-driven nature, where they are the function of input data points. This stands in contrast to
conventional constraints and regularizers, which solely depend on model parameters. This data-driven
structure poses unique challenges in devising optimization frameworks capable of effectively handling
these constraints while maintaining robustness and fairness in machine learning tasks.
The optimization techniques and methods proposed in the thesis can be utilized for learning many
different machine learning tasks, including Meta-learning [127], contrastive learning [215], differentially
private machine learning [212], domain generalization [298], distributionally robust optimization [190],
tensor decomposition [291, 14], machine learning for biological and healthcare tasks [281, 27], and federated learning [194, 195].
173
Chapter 10
Appendix
10.1 Chapter 2 Appendix
10.1.1 A Review of Missing Value Imputation Methods in the Literature
The fundamental idea behind many data imputation approaches is that the missing values can be predicted
based on the available data of other data points and correlated features. One of the most straightforward
imputation techniques is to replace missing values by the mean (or median) of that feature calculated from
what data is available see [205, Chapter 3]. However, this naïve approach ignores the correlation between
features and does not preserve the variance of features. Another class of imputers has been developed based
on the least-square methods [267, 173, 371, 57]. [267] learns a linear model with multivariate Gaussian
noise for the feature with the least missing entries. It repeats the same procedure on the updated data
to impute the next feature with the least missing entries until all features are completely imputed. One
drawback of this approach is that the error from the imputation of previous features can be propagated
to subsequent features. To impute entries of a given feature in a dataset, [173] learns several univariate
regression models that consider that feature as the response. Then it takes the average of these predictions
as the final value of imputation. This approach fails to learn the correlations involving more than two
features.
174
Many more complex algorithms have been developed for imputation, although many are sensitive to
initial assumptions and may not converge. For instance, KNN-Imputer imputes a missing feature of a data
point by taking the mean value of the K closest complete data points [333]. MissForest, on the other hand,
imputes the missing values of each feature by learning a random forest classifier using other training data
features [318]. MissForest does not need to assume that all features are continuous [148] or categorical
[295]. However, both KNN-imputer and MissForest do not guarantee statistical or computational convergence for their algorithms. Moreover, when the proportion of missing values is high, both are likely to
have a severe drop in performance, as demonstrated in Section 2.5. The Expectation Maximization (EM)
algorithm is another popular approach that learns the parameters of a prior distribution on the data using
available values based on the EM algorithm of [95]; see also [125] and [148]. The EM algorithm is also
used in Amelia, which fits a jointly normal distribution to the data using EM and the bootstrap technique
[148]. While Amelia demonstrates a superior performance on datasets following a normal distribution, it
is highly sensitive to the violation of the normality assumption (as discussed in [45]). [125] adopt the EM
algorithm to learn a joint Bernoulli distribution for the categorical data and a joint Gaussian distribution
for the continuous variables independently. While those algorithms can be viewed as inference methods
based on low-order estimates of moments, they do not consider uncertainty in such low-order moments
estimates. By contrast, our framework utilizes robust optimization to consider the uncertainty around the
estimated moments. Moreover, our optimization procedure for imputation and prediction is guaranteed to
converge despite some of the algorithms mentioned above.
Another popular method for data imputation is multiple imputations by chained equations (MICE).
MICE learns a parametric distribution for each feature conditional on the remaining features. For instance,
it assumes that the current target variable is a linear function of other features with a zero-mean Gaussian
noise. Each feature can have its distinct distribution and parameters (e.g., Poisson regression, logistic regression). Based on the learned parameters of conditional distributions, MICE can generate one or more
175
imputed datasets [56]. More recently, several neural network-based imputers have been proposed. GAIN
(Generative Adversarial Imputation Network) learns a generative adversarial network based on the available data and then imputes the missing values using the trained generator [357]. One advantage of GAIN
over other existing GAN imputers is that it does not need a complete dataset during the training phase.
MIDA (Multiple Imputation using Denoising Autoencoders) is an auto-encoder-based approach that trains
a denoising auto-encoder on the available data considering the missing entries as noise. Similar to other
neural network-based methods, these algorithms suffer from their black-box nature. They are challenging
to interpret/explain, making them unpopular in mission-critical healthcare approaches. In addition, no
statistical or computational guarantees are provided for these algorithms.
[45] formulates the imputation task as a constrained optimization problem where the constraints are
determined by the underlying classification model such as KNN (k-nearest neighbors), SVM (Support Vector Machine), and Decision Trees. Their general framework is non-convex, and the authors relax the
optimization for each choice of the cost function using first-order methods. The block coordinate descent
algorithm then optimizes the relaxed problem. They show the convergence and accuracy of their proposed
algorithm numerically, while a theoretical analysis that guarantees the algorithm’s convergence is absent
in their work.
10.1.2 Estimating Confidence Intervals of Low-order Moments
In this section, we explain the methodology of estimating confidence intervals for E[zi
] and E[zizj ]. Let
Xn×d
and y be the data matrix and target variables for n given data points respectively whose entries are
176
in R˜ = R ∪ {∗}, where ∗ symbol represents a missing entry. Moreover, assume that ai represents the i-th
column (feature) of matrix X. We define:
a˜i(k) =
ai(k) if ai(k) ̸= *
0 if ai(k) = *
Thus, a˜ is obtained by replacing the missing values with 0. We estimate the confidence intervals for the
mean and covariance of features using multiple bootstrap samples on the available data. Let C0[i][j] and
∆0[i][j] be the center and the radius of the confidence interval for C[i][j], respectively. We compute the
center of the confidence interval for C[i][j] as follows:
C0[i][j] = 1
mij
a˜
T
i a˜j (10.1)
where mi = |{k : ai(k) ̸= ∗}| and mij = |{k : ai(k) ̸= ∗, aj (k) ̸= ∗}|. This estimator is obtained from
the rows where both features are available. More precisely, let M be the mask of the input data matrix X
defined as:
Mij =
0, if Xij is missing,
1, otherwise.
Assume that mij = (MTM)ij , which is the number of rows in the dataset where both features i and j are
available. To estimate the confidence intervals for Cij , we use Algorithm 16. First, we select multiple (K)
samples of size N = mij from the rows where both features are available. Each one of these samples with
size mij is obtained by applying a bootstrap sampler (sampling with replacement) on the mij rows where
both features are available. Then, we compute the second-order moment of two features for each sample.
To find the radius of confidence intervals for each given pair (i, j) of features, we choose k different
bootstrap samples with length n on the rows where both features i and j are available. Then, we compute
177
C0[i][j] of two features in each bootstrap sample. The standard deviation of these estimations determines
the radius of the corresponding confidence interval. Algorithm 16 summarizes the required steps for computing the confidence interval radius for the ij-th entry of covariance matrix ∆. Note that the confidence
intervals for µ can be computed similarly. Having C0 and ∆, the confidence interval for the matrix C is
Algorithm 16 Estimating Confidence Interval Length ∆ij for Feature i and Feature j.
1: Input: K : Number of bootstrap estimations
2: for t = 1, . . . , K do
3: Pick n samples with replacement from the rows where both i-th and j-th are available.
4: Let (Xˆ
i1, Xˆ
j1), . . . ,(Xˆ
in, Xˆ
jn) be the i-th and j-th features of the selected samples
5: Ct =
1
n
Pn
r=1 Xˆ
irXˆ
jr
6: ∆ij = std(C1, C2, . . . , CK)
computed as follows:
Cmin = C0 − c∆
Cmax = C0 + c∆,
Computing bmin and bmax can be done in the same manner. The hyper-parameter c is defined to
control the robustness of the model by tuning the length of confidence intervals. A larger c corresponds to
bigger confidence intervals and, thus, a more robust estimator. On the other hand, large values for c lead
to very large confidence intervals that can adversely affect the performance of the trained model.
Remark 30 Since the computation of confidence intervals for different entries of the covariance matrix are
independent of each other, they can be computed in parallel. In particular, if γ cores are available, ⌈d/γ⌉
features (columns of the covariance matrix) can be assigned to each one of the available cores.
10.1.3 Robust Ridge Regression with the Optimal Convergence Rate
The convergence rate of Algorithm 1 to the optimal solution of Problem (2.6) can be slow in practice since
the algorithm requires to do a matrix inversion for updating θ and applying the box constraint to C and
178
b at each iteration. While we update the minimization problem in closed-form with respect to θ, we can
speed up the convergence rate of the maximization problem by applying Nesterov’s acceleration method
to function g(b, C) in (2.7). Since function g is the minimum of convex functions, its gradient with respect
to C and b can be computed using Danskin’s theorem. Algorithm 17 describes the steps to optimize
Problem (2.7) using Nesterov’s acceleration method.
Algorithm 17 Applying the Nesterov’s Acceleration Method to Robust Linear Regression
1: C0, b0, ∆, δ, T
2: Initialize: C1 = C0, b1 = b0, γ0 = 0, γ1 = 1.
3: for i = 1, . . . , T do
4: γi+1 =
1+√
1+4γ
2
i
2
5: YCi = Ci +
γi−1
γi+1
(Ci − Ci−1)
6: Ci+1 = Π∆+
YCi +
1
L
θθT
7: Ybi = bi +
γi−1
γi+1
(bi − bi−1)
8: bi+1 = Πδ(Ybi −
2θ
L
)
9: Set θ = (Ci+1 + λI)
−1bi+1
Theorem 31 Let (θ˜, C˜ , b˜) be the optimal solution of (2.6) and D = ∥C0 −C˜ ∥
2
F +∥b0 −b˜∥
2
2
. Assume that
for any given b and C, within the uncertainty sets described in (2.6), ∥θ
∗
(b, C)∥ ≤ τ . Then, Algorithm 1
computes an ϵ-optimal solution of the objective function in O
q
D(τ+1)2
λϵ
iterations.
Proof: The proof is relegated to Appendix 10.1.9. □
10.1.4 Solving the Dual Problem of the Robust Ridge Linear Regression via ADMM
The Alternating Direction Method of Multipliers (ADMM) is a popular algorithm for efficiently solving
linearly constrained optimization problems [122, 149]. It has been extensively applied to large-scale optimization problems in machine learning and statistical inference in recent years [20, 370]. Consider the
following optimization problem consisting of two blocks of variables x and y that are linearly coupled:
179
min
w,z
f(w) + g(z)
s.t. Aw + Bz = c,
(10.2)
The augmented Lagrangian of the above problem can be written as:
min
w,z
f(w) + g(z) + ⟨Aw + Bz − c,λ⟩ +
ρ
2
∥Aw + Bz − c∥
2 (10.3)
ADMM schema updates the primal and dual variables iteratively as presented in Algorithm 18.
Algorithm 18 General ADMM Algorithm
1: for t = 1, . . . , T do
2: wt+1 = argminw f(w) + ⟨Aw + Bzt − c,λ⟩ +
ρ
2
∥Aw + Bzt − c∥
2
3: z
t+1 = argminz f(wt+1) + ⟨Awt+1 + Bz − c,λ⟩ +
ρ
2
∥Awt+1 + Bz − c∥
2
4: λ
t+1 = λ
t + ρ(Awt+1 + Bzt+1 − c)
As we mentioned earlier, simultaneous projection of C to the set of positive semi-definite matrices and
the box constraint Cmin ≤ C ≤ Cmax in Algorithm 1 and Algorithm 17 is computationally expensive.
Moreover, careful step-size tuning is necessary to avoid inconsistency and guarantee convergence in that
algorithm.
An alternative approach for solving Problem (2.6) that avoids removing the PSD constraint in the
implementation of Algorithm 1 and Algorithm 17 is to solve the dual of the inner maximization problem.
Since the maximization problem is concave with respect to C and b, and the relative interior of the feasible
set of constraints is non-empty, the duality gap is zero. Hence, instead of solving the inner maximization
problem, we can solve its dual which is a minimization problem. Theorem 32 describes the dual problem
of the inner maximization problem in (2.6). Thus, Problem (2.6) can be alternatively formulated as a
minimization problem rather than a min-max problem. We can solve such a constrained minimization
problem efficiently via the ADMM algorithm. As we will show, the ADMM algorithm applied to the dual
180
problem does not need tuning of step-size or applying simultaneous projections to the box constraints and
positive semi-definite (PSD) constraints.
Theorem 32 (Dual Problem) The inner maximization problem described in (2.6) can be equivalently formulated as:
min
A,B,d,e,H
−⟨bmin, d⟩ + ⟨bmax, e⟩ − ⟨Cmin, A⟩ + ⟨Cmax, B⟩ + λ∥θ∥
2
s.t. −θθT − A + B − H = 0,
2θ − d + e = 0,
A, B, d, e ≥ 0,
H ⪰ 0.
Therefore, Problem (2.6) can be alternatively written as:
min
θ,A,B,d,e,H
−⟨bmin, d⟩ + ⟨bmax, e⟩ − ⟨Cmin, A⟩ + ⟨Cmax, B⟩ + λ∥θ∥
2
s.t. −θθT − A + B − H = 0,
2θ − d + e = 0,
A, B, d, e ≥ 0,
H ⪰ 0.
(10.4)
Proof: The proof is relegated to Appendix 10.1.9. □
To apply the ADMM method to the dual problem, we require to divide the optimization variables into two
blocks as in (10.2) such that both sub-problems in Algorithm 18 can be efficiently solved. To do so, first,
181
we introduce the auxiliary variables d
′
, e
′
, θ
′
, A
′
and B
′
to the dual problem. Also, let G = H + θ
′
θ
′T
.
Therefore, Problem (10.4) is equivalent to:
min
θ,A,B,d,e,H
−⟨bmin, d⟩ + ⟨bmax, e⟩ − ⟨Cmin, A⟩ + ⟨Cmax, B⟩ + λ∥θ∥
2
s.t. B − A = G,
2θ − d + e = 0,
A = A
′
, B = B
′
,
d = d
′
, e = e
′
, θ = θ
′
,
A
′
, B
′
, d
′
, e
′
≥ 0,
G ⪰ θ
′
θ
′T
.
(10.5)
Since handling both constraints on θ in Problem (10.4) is difficult, we interchange θ with θ
′
in the first
constraint. Moreover, the non-negativity constraints on A, B, d and e are exchanged with non-negativity
constraints on A
′
, B
′
, d
′
and e
′
. For the simplicity of presentation, assume that c
t
1 = bmin−µ
t
d+ρd
′
t+η
t
,
c
t
2 = −bmax − µ
t
e + ρe
′
t − η
t
, c
t
3 = −µθ + ρθ
′
t − 2η
t
, Dt
1 = ρA
′
t − ρGt + Γ
t − Mt
A + Cmin,
and Dt
2 = ρB
′
t + ρGt − Γ
t − Mt
B − Cmax. Algorithm 19 describes the ADMM algorithm applied to
Problem (10.5).
Corollary 33 If the feasible set of Problem (2.6) has non-empty interior, then Algorithm 19 converges to an
ϵ-optimal solution of Problem (10.5) in O(
1
ϵ
) iterations.
Proof: Since the inner maximization problem, in (2.6) is convex, and its feasible interior set is not empty,
the duality gap is zero by Slater’s condition. Thus, according to Theorem 6.1 in [142], Algorithm 19 converges to an optimal solution of the primal-dual problem with a linear rate. Moreover, the sequence of
182
constraint residuals converges to zero with a linear rate as well. □
Remark 34 The optimal solution obtained from the ADMM algorithm can be different from the one given
by Algorithm 1 because we remove the positive semi-definite constraint on C in the latter. We investigate the
difference between solutions of two algorithms in three cases: First, we generate a small positive semi-definite
matrix C∗ and the matrix of confidence intervals (∆) as follows:
C∗ =
97 40 92
40 17 38
92 38 88
, ∆ =
0.2 0.3 0.2
0.3 0.1 0.2
0.1 0.3 0.1
.
Moreover, let b
∗ and δ are generated as follows:
b
∗ =
6.65
8.97
5.40
, δ =
0.1
0.2
0.2
.
Initializing both algorithms with a random matrix within Cmin = C∗ − ∆ and Cmax = C∗ + ∆, and a
random vector within bmin = b
∗ −δ and bmax = b
∗ +δ, ADMM algorithm returns a different solution from
Algorithm 1. Besides, the difference in the performance of algorithms during the test phase can be observed in
the experiments on synthetic datasets depicted in Figure 2.3 as well, especially when the number of samples is
smaller.
183
Algorithm 19 Applying ADMM to the Dual Reformulation of Robust Linear Regression
1: Given: bmin, bmax, Cmin, Cmax, λ, ρ
2: Initialize: C1 = C0, b1 = b0, γ0 = 0, γ1 = 1.
3: for t = 0, . . . , T do
4: θ
t+1 =
1
6λ+7ρ
(2c
t
1 − 2c
t
2 − 3c
t
3
)
5: d
t+1 =
1
6λ+7ρ
(
6ρ+4λ
ρ
c
t
1 +
4ρ+4λ
ρ
c
t
2 + 2c
t
3
)
6: e
t+1 =
2
6λ+7ρ
(
ρ+2λ
ρ
c
t
1 +
3ρ+2λ
ρ
c
t
2 − c
t
3
)
7: A
′
t+1 = max(At +
Mt
A
ρ
, 0)
8: B
′
t+1 = max(Bt +
Mt
B
ρ
, 0)
9: Gt+1 = [Bt − At +
Γ
t
ρ − θ
′
tθ
′
tT ]+ + θ
′
tθ
′
tT
10: d
′
t+1 = max(d
t +
µt
d
ρ
, 0)
11: e
′
t+1 = max(e
t +
µt
e
ρ
, 0)
12: θ
′
t+1 = argminθ
′ ∥θ
t+1 − θ
′
∥
2 + ⟨µ
t
θ
, θ
t+1 − θ
′
⟩ s.t. Gt+1 ⪰ θ
′T θ
′
13: At+1 =
1
3ρ
(2Dt
1 + Dt
2
)
14: Bt+1 =
1
3ρ
(Dt
1 + 2Dt
2
)
15: Mt+1
A = Mt
A + ρ(At+1 − A
′
t+1)
16: Mt+1
B = Mt
B + ρ(Bt+1 − B
′
t+1)
17: µ
t+1
d = µ
t
d + ρ(d
t+1 − d
′
t+1)
18: µ
t+1
e = µ
t
e + ρ(e
t+1 − e
′
t+1)
19: µ
t+1
θ = µ
t
θ + ρ(θ
t+1 − θ
′
t+1)
20: η
t+1 = η
t + ρ(2θ
t+1 − d
t+1 + e
t+1)
21: Γ
t+1 = Γ
t + ρ(Bt+1 − At+1 − Gt+1)
Now, we show how to apply ADMM schema to Problem (10.5) to obtain Algorithm 19. As we discussed
earlier, we consider two separate blocks of variables w = (θ, d, e, G, B
′
, A
′
) and z = (d
′
, e
′
, θ
′
, B, A).
184
Assigning Γ, η,MA,MB, µd
, µe
, and µθ
to the constraints of Problem (10.5) in order, we can write the
corresponding augmented Lagrangian function as:
min
θ,θ
′
,A,A
′
,B,B
′
,d,d
′
,e,e
′
,G
−⟨bmin, d⟩ + ⟨bmax, e⟩ − ⟨Cmin, A⟩ + ⟨Cmax, B⟩ + λ∥θ∥
2
+⟨A − A
′
,MA⟩ +
ρ
2
∥A − A
′
∥
2
F
+⟨B − B
′
,MB⟩ +
ρ
2
∥B − B
′
∥
2
F
+⟨d − d
′
, µd
⟩ +
ρ
2
∥d − d
′
∥
2
+⟨e − e
′
, µe
⟩ +
ρ
2
∥e − e
′
∥
2
+⟨θ − θ
′
, µθ
⟩ +
ρ
2
∥θ − θ
′
∥
2
+⟨2θ − d + e, η⟩ +
ρ
2
∥2θ − d + e∥
2
+⟨B − A − G,Γ⟩ +
ρ
2
∥B − A − G∥
2
F
s.t. A
′
, B
′
, d
′
, e
′
≥ 0,
G ⪰ θ
′
θ
′T
,
(10.6)
At each iteration of the ADMM algorithm, the parameters of one block are fixed, and the optimization
problem is solved with respect to the parameters of the other block. For the simplicity of presentation, let
c
t
1 = ρθ
′
t − µ
t
θ − 2η
t
, c
t
2 = ρd
′
t − µ
t
d − bmin + η
t
, c
t
3 = ρe
′
t − µ
t
e + bmax − η
t
, Dt
1 = ρA
′
t − ρGt +
Γ
t − Mt
A + Cmin, and Dt
2 = ρB
′
t + ρGt − Γ
t − Mt
B − Cmax.
185
We have two non-trivial problems containing positive semi-definite constraints. The sub-problem with
respect to G can be written as:
min
G
⟨B
t − At − G,Γ
t
⟩ +
ρ
2
∥B
t − At − G∥
2
F
s.t. G ⪰ θ
′
tθ
′
tT
,
(10.7)
By completing the square, and changing the variable G
′
= G − θ
′
tθ
′
tT , equivalently we require to solve
the following problem:
min
G
′
ρ
2
∥G
′
− (B
t − At − θ
′
tθ
′
tT +
Γ
t
ρ
)∥
2
F
s.t. G
′
⪰ 0,
(10.8)
Thus, G
′∗ = [Bt − At +
Γ
t
ρ − θ
′
tθ
′
tT ]+, where [A]+ is the projection to the set of PSD matrices, which
can be done by setting the negative eigenvalues of A in its singular value decomposition to zero.
The other non-trivial sub-problem in Algorithm (19) is the minimization with respect to θ
′
(Line 10).
By completing the square, it can be equivalently formulated as:
min
θ
′
∥θ
′
− (θ
t+1 +
µ
t
θ
ρ
)∥
2
2
s.t. Gt+1 ⪰ θ
′
θ
′T
,
(10.9)
Let G = UΛUT
be the singular value decomposition of the matrix G where Λ is a diagonal matrix
containing the eigenvalues of the matrix G. Set α = θ
t+1 +
µt
θ
2
. Since UT U = I, we have:
∥U
T θ − U
T α∥
2 = ∥θ − α∥
2
2
186
Set β = UT θ
′
, then Problem (10.9) can be reformulated as:
min
β
∥β − U
T α∥
2
2
s.t. ββT ⪯ Λ.
(10.10)
Note that the constraint of the above optimization problem is equivalent to the following:
ββT ⪯ Λ ⇔
1 β
T
β Λ
⪰ 0 ⇔ β
T Λ
−1β ≤ 1 ⇔
Xn
i=1
β
2
i
λi
≤ 1,
where λi = Λii Since the block matrix is symmetric, using Schur Complement, it is positive semi-definite
if and only if Λ is positive semi-definite and 1 − β
tΛ−1β ≥ 0 (the third inequality above).
Set γ = UT α, then we can write Problem (10.10) as:
min
β
∥β − γ∥
2
2
s.t.
Pn
i=1
β
2
i
λi
≤ 1,
(10.11)
It can be easily shown that the optimal solution has the form β
∗
i =
γi
1+ µ∗
λi
, where µ
∗
is the optimal Lagrangian multiplier corresponding to the constraint of Problem (10.11). The optimal Lagrangian multiplier
can be obtained by the bisection algorithm similar to Algorithm 2. Having β
∗
, the optimal θ can be computed by solving the linear equation UT θ
∗ = β
∗
.
10.1.5 Quadratic RIFLE: Using Kernels to Go Beyond Linearity
A natural extension of RILFE to non-linear models is to transform the original data via multiple Kernels and
then apply RIFLE to the obtained data. To this end, we applied Polynomial Kernels to the original data that
considers the polynomial transformations of features and their interactions. A drawback of this approach
187
is that if the original data contains d features, and the order of the polynomial Kernel is t, the number
of features in the transformed data will be O(d
t
) that increases the runtime of the prediction/imputation
drastically. Thus, we only consider t = 2, which leads to a dataset containing the interaction of different
features of the original data. We call the RIFLE algorithm applied on the data transformed by Quadratic
Kernel Quadratic RIFLE (QRIFLE). Table 2.1 demonstrates the performance of QRIFLE alongside RIFLE and
other state-of-the-art approaches. Moreover, we applied QRIFLE on a regression task where the correlation
between predictors and the target variable is quadratic (Figure 10.1). We can observe that QRIFLE works
better than RIFLE when the percentage of missing values is not high.
10.1.6 Performance of RIFLE and QRIFLE on Synthetic Non-linear Data
A natural question is how RIFLE performs when the underlying model is non-linear. To evaluate RIFLE
and other methods, we have generated jointly normal data similar to the experiment in Figure 10.2. Here,
we have 5000 data points, and the data dimension is d = 5. The target variable has the following quadratic
relationship with the input features:
y = x
2
1 + 3x
2
3 − 6x
2
5 − 0.9x1x4 + 9x2x3 + 3.2x4x5 − 1.7x2x5 − 5x1 − 2x3 + 7x4 + 4.6
We evaluated the performance of KNN-Imputer [333], MICE [56], Amelia [148], MissForest [318], and
Expectation Maximization [95], alongside the RIFLE and QRIFLE. QRIFLE is the RIFLE application on the
original data transformed by a polynomial kernel with the degree of 2. Although QRIFLE can learn the
quadratic models, the number of missing values in the new features (interaction terms) will be higher
than the original data. For instance, if, on average, 50% of entries are missing in the original features,
there will be 75% of missing entries in the interaction terms. Moreover, the computation complexity
will be increased since we have d
2
features instead of d if we use QRIFLE. Figure 10.1 demonstrates the
performance of the aforementioned methods on the artificial data with 5000 samples containing different
188
Figure 10.1: Performance of RIFLE, QRIFLE, MissForest, Amelia, KNN Imputer, MICE, Expectation Maximization to the percentage of missing values on Quadratic artificial datasets with different percentages of
missing values.
percentages of missing values. We generated 5 artificial datasets for each missing value percentage, and
each method is performed 5 times on the datasets. We reported the average performances for each method
in Figure 10.1. For small percentages of missing values, QRIFLE performs better than other approaches.
However, by increasing the percentage of missing values, QRIFLE performance drops, and RIFLE works
much better than RIFLE.
10.1.7 Robust Quadratic Discriminant Analysis (Partially Missing Target Feature)
In Section 2.4, we formalized robust quadratic discriminant analysis assuming the target variable is fully
available. In this appendix, we study Problem (2.12) when the target variable contains missing values.
If the target feature contains missing values, the proposed algorithm for solving the optimization problem (2.13) does not exploit the data points whose target feature is unavailable. However, such points can
contain valuable statistical information about the underlying data distribution. Thus, we apply an Expectation Maximization (EM) procedure on the dataset as follows:
189
Assume that a dataset consisting of n + m samples. Let (X1, y1), . . . ,(Xn, yn) be n samples whose
target variable is available and (Xn+1, z1), . . . ,(Xn+m, zm) are m samples where their corresponding
labels are missing. Similar to the previous case, we assume:
Xi
|zi = j ∼ N (µj
, Σj ), j = 0, 1.
Thus, the probability of observing a data point xi can be written as:
P(Xi = xi) = π0P(Xi = xi
|zi = 0) + π1P(Xi = xi
|zi = 1)
= π0N (xi
; µ0, Σ0) + π1N (xi
; µ1
, Σ1)
The log of likelihood function can be formulated as follows:
ℓ(µ0
, Σ0, µ1
, Σ1) =
nX
+m
i=1
log
π0N (xi
; µ0
, Σ0) + π1N (xi
; µ1
, Σ1)
We apply Expectation Maximization procedure to jointly update Σ0, Σ1, µ0
, µ1
and zi
’s. Note that
the posterior distribution of zi can be written as:
P(Zi = t|Xi = xi) = P(Xi = xi
|Zi = t)P(Zi = t)
P(Xi = xi)
=
πtN (xi
; µt
, Σt)
P(Xi = xi)
We update zi values in the E-step by comparing the posterior probabilities for two possible labels.
Precisely, we assign label 1 to Zi
if and only if:
π1N (xi
; µ1
, Σ1) > π0N (xi
; µ0
, Σ0)
190
In M-step, we estimate Σ0, Σ1, µ0
, µ1
, π0 and pi0 by fixing the zi values. Since in M-step, all labels
(both already available yi
’s and estimated zi
’s in E-step) are assigned, updating the aforementioned parameters can be done as follows:
µ1
[j] = 1
|S1 ∩ Tj |
X
i∈S1∩Tj
xi
[j] (10.12)
µ0
[j] = 1
|S0 ∩ Tj |
X
i∈S0∩Tj
xi
[j] (10.13)
Σ1[i][j] = 1
|S1 ∩ Ti ∩ Tj |
X
t∈S1∩Ti∩Tj
xt
[i]xt
[j] (10.14)
Σ0[i][j] = 1
|S0 ∩ Ti ∩ Tj |
X
t∈S0∩Ti∩Tj
xt
[i]xt
[j] (10.15)
π1 =
|S1|
|S0 ∪ S1|
(10.16)
π0 =
|S0|
|S0 ∪ S1|
(10.17)
We apply the M-step and E-step iteratively to obtain Σ1 and Σ0. Based on the random initialization of zi
’s
we can obtain different values for µ0
, µ1
, Σ0 and Σ1. Having these estimations, we apply Algorithm 3 to
solve the robust normal discriminant analysis formulated in (2.13).
Algorithm 20 Expectation Maximization Procedure for Learning a Robust Normal Discriminant Analysis
1: Input: T: Number of EM iterations, k: Number of covariance estimations at each iteration.
2: Initialize: Set each missing labels randomly to 0 or 1.
3: for i = 1, . . . , T do
4: Estimate k covariance matrices by sampling with replacement from the available entries
5: Find an optimal w for Problem (2.18)
6: Update the missing labels using the new w obtained above.
10.1.8 Generating Missing Values Patterns in Numerical Experiments
In this appendix, we define MCAR and MNAR patterns and discuss how to generate them in a given
dataset. Formally, the distribution of missing values in a dataset follows a missing completely at random
191
(MCAR) pattern if the probability of having a missing value for a given entry is constant, independent of
other available and missing entries. On the other hand, a dataset follows a Missing At Random (MAR)
pattern if the missingness of each entry only depends on the available data of other features. Finally, if the
distribution of missing values does not follow an MCAR or MAR pattern, we call it missing not at random
(MNAR).
To generate the MCAR pattern on a given dataset, we fix a constant probability 0 < p < 1 and make
each data entry unavailable with the probability of p. On the other hand, the generation of the MNAR
pattern is based on the idea that if the value of an entry is farther from the mean of its corresponding
feature, then the probability of missingness for that entry is larger.
The generation of the MNAR pattern is based on the idea that if the value of an entry is farther from the
mean of its corresponding feature, then the probability of missingness for that entry is larger. Algorithm 21
describes the procedure of generating MNAR missing values for a given column of a dataset:
Algorithm 21 Generating MNAR Pattern for a Given Column of a Dataset
1: Input: x1, x2, . . . , xn: The entries of the current column in the dataset, a, b: Hyper-parameters controlling the percentage of missing values
2: Initialize: Set µ =
1
n
Pn
i=1 xi and σ
2 =
1
n
Pn
i=1 x
2
i − µ
2
.
3: for i = 1, . . . , n do
4: x
′
i =
xi−µ
σ
5: pi = F(a|x
′
i
| + b)
6: Set xi = ∗ with probability of pi
Note that F in the above algorithm is the cumulative distribution function of a standard Gaussian
random variable. a and b control the percentage of missing values in the given column. As a and b increase,
the probability of having more missing values is higher. Since the availability of each data entry depends
on its value, the generated missing pattern is missing not at random (MNAR).
192
10.1.9 Proof of Lemmas and Theorems
In this appendix, we prove all lemmas and theorems presented in the article. First, we prove the following
lemma that is useful in several convergence proofs:
Lemma 35 Let θ
∗
(b, C) = argminθ θ
T Cθ − 2b
T θ + λ∥θ∥
2
. Assume that for any given b and C,
∥θ
∗
(b, C)∥ ≤ τ . Then, the Lipschitz constant of the gradient of the function g(b, C) = minθ θ
T Cθ −
2b
T θ + λ∥θ∥
2 used in Problem (2.7) is equal to L =
2(τ+1)2
λ
.
Proof: Since the problem is convex in θ and concave in C and b, we have:
min
θ
max
C,b
θ
T Cθ − 2b
T θ + λ∥θ∥
2 = − min
C,b
max
θ
−θ
T Cθ + 2b
T θ − λ∥θ∥
2
Assume that h(θ, C, b) ≜ −θ
T Cθ + 2b
T θ − λ∥θ∥
2
. Define L11, L12 as follows:
∥∇b,Ch(θ, b1, C1) − ∇b,Ch(θ, b2, C2)∥ ≤ L11∥(C1, b1) − (C2, b2)∥
∥∇θh(θ, b1, C1) − ∇θh(θ, b2, C2)∥ ≤ L12∥(C1, b1) − (C2, b2)∥
h(θ, b, C) is convex in C and b and strongly concave with respect to θ. According to Lemma 1 in [31],
g
′
= −g = maxθ h(θ, b, C) is Lipschitz continuous with the Lipschitz constant equal to:
Lg = L
g
′ = L11 +
L
2
12
σ
,
where σ = 2λ is the strong-concavity modulus of −θ
T Cθ + 2b
T θ − λ∥θ∥
2
. Note that
∇b,C h(θ, b, C) = −θθT + 2θ →
∇b,Ch(θ, b1, C1) − ∇b,Ch(θ, b2, C2) = 0
193
Thus, L11 = 0. On the other hand,
∇θ h(θ, b, C) = −2θC + 2b − 2λθ →
∇θh(θ, b1, C1) − ∇θh(θ, b2, C2) = −2(C1 − C2)θ + 2(b1 − b2)
≤ 2∥C1 − C2∥2∥θ∥2 + 2∥b1 − b2∥2
≤ 2∥(C1, b1) − (C2, b2)∥2∥θ∥2 + 2∥(C1, b1) − (C2, b2)∥2
≤ (2∥θ∥2 + 2)∥(C1, b1) − (C2, b2)∥2
Therefore, L12 = 2 max ∥θ∥2 + 2, which means Lg =
2(max ∥θ∥+1)2
λ
. Note that θ is computed exactly in
Algorithm 1 and Algorithm 17 at each iterations. Thus, during the optimization procedure the norm of θ
is bounded by the maximum norm of θ for any given b and C:
max ∥θ∥2 ≤ max
b,C
θ
∗
(b, C) ≤ τ
As a result, Lg =
2(τ+1)2
λ
. □
Proof of Theorem 1: Since the set of feasible solutions for b and C defines a compact set, and function
g is a concave function with respect to b and C, the projected gradient ascent algorithm converges to the
global maximizer of g in T = O(
LD
ϵ
) iterations [53, Theorem 3.3], where D = ∥C0 − C∗∥
2
F +∥b0 −b
∗∥
2
2
and L is the Lipschitz constant of function g, which is equal to 2(τ+1)2
λ
according to Lemma 35.
Proof of Theorem 31 Algorithm 17 applies the projected Nesterov acceleration method on the concave
function g. As proved in [240], the rate of convergence of this method conforms to the lower bound of firstorder oracles for the general convex minimization (concave maximization) problems, which is O(
q
LD2
ϵ
).
194
We compute the Lipschitz constant L that appeared in the iteration complexity bound by Lemma 35.
Proof of Theorem 32: First, note that if we multiply the objective function by −1, Problem (2.6) can be
equivalently formulated as:
max
θ
min
C,b
−θ
T Cθ + 2b
T θ − λ∥θ∥
2
2
s.t. −C + Cmin ≤ 0,
C − Cmax ≤ 0,
−b + bmin ≤ 0,
b − bmax ≤ 0,
−C ⪯ 0
(10.18)
If we assign A, B, d, e, H to the constraints respectively, then the Lagrangian function can be written
as:
L(C, b, A, B, d, e, H) = −θ
T Cθ + 2b
T θ + ⟨A, −C + Cmin⟩
+⟨B, C − Cmax⟩ + ⟨d, −b + bmin⟩
+⟨e, b − bmax⟩ − ⟨C, H⟩ − λ∥θ∥
2
2
,
(10.19)
The dual problem is defined as:
max
A,B,d,e,H
min
C,b
L(C, b, A, B, d, e, H)
(10.20)
195
The minimization of L takes the following form:
min
C,b
⟨C, −θθT − A + B − H⟩ + ⟨b, 2θ − d + e⟩ − λ∥θ∥
2
2
−⟨B, Cmax⟩ + ⟨A, Cmin⟩ − ⟨e, bmax⟩ + ⟨d, bmin⟩,
(10.21)
To avoid −∞ value for the above minimization problem, it is required to set −θθT − A + B − H and
2θ − d + e to zero. Thus the dual problem of (10.18) is formulated as:
max
A,B,d,e,H
b
T
mind − b
T
maxe + ⟨Cmin, A⟩ − ⟨Cmax, B⟩ − λ∥θ∥
2
2
s.t. −θθT − A + B − H = 0,
2θ − d + e = 0,
A, B, d, e ≥ 0,
H ⪰ 0
(10.22)
Since the duality gap is zero, Problem (2.6) can be equivalently formulated as:
max
θ,A,B,d,e,H
b
T
mind − b
T
maxe + ⟨Cmin, A⟩ − ⟨Cmax, B⟩ − λ∥θ∥
2
s.t. −θθT − A + B − H = 0,
2θ − d + e = 0,
A, B, d, e ≥ 0,
H ⪰ 0.
(10.23)
196
We can multiply the objective function by −1 and change the maximization to minimization, which gives
the dual problem described in (10.4).
Proof of Theorem 2:
(a) Let ∆n be the estimated confidence matrix obtained from n samples. The first part of the theorem is
true, if ∆n converges to 0 as n, the number of samples goes to infinity (the same argument works for b
and δ). Assume that {(xi1, xi2)}
n
i=1 is an i.i.d bootstrap sample over data points that both features X1
and X2 are available. Since the distribution of missing values is completely at random (MCAR), we have
E[xi1xi2] = E[X1X2]. Therefore, E[
1
n
Pn
i=1 xi1xi2] = E[X1X2]. Moreover, since the samples are drawn
independently, Var[
1
n
Pn
i=1 xi1xi2] = 1
n2
Pn
i=1 Var[xi1xi2] = n
n2Var[X1X2] = 1
n
Var[X1X2]. Since the
variance of the product of every two features is bounded, according to the weak law of large numbers:
limn→∞
Pr
|
1
n
Xn
i=1
xi1xi2 − E[X1X2]| ≥ ϵ
= 0
Therefore, for any given bootstrap sample of features X1 and X2, the estimation converges in probability
to the ground-truth value. This means the size of the confidence interval ∆12 converges in probability
to 0. Therefore, the estimation of E[X1X2] is consistent by the definition of consistency. With the same
argument, we can prove the consistency of the estimator for any given features Xi and Xj .
(b) Fix two features i and j. Let (Xˆ
i1, Xˆ
j1), . . . ,(Xˆ
im, Xˆ
jm) be m = ⌈n(1 − p)⌉ i.i.d pairs sampled via
bootstrap from the entries where both features i and j are available. Define Zt = Xˆ
itXˆ
jt (for simplicity
197
we do not consider the dependence of Zt to i and j in the notation). Assume that we initialize Cij =
1
m
Pm
t=1 Zt
. Note that, E[Zt
] = E[Xˆ
itXˆ
jt] = C∗
ij . According to Chebyshev’s inequality, we have:
Pr"
1
m
Xm
t=1
Zt − E[Zt
]
≥ ∆ij#
≤
Var(
1
m
Pm
t=1 Zt)
c
2∆2
ij
Note that Zi
’s are iid samples, thus:
Var(
1
m
Xm
t=1
Zt) = 1
m
Var(Zt) ≤
1
m
max
i,j
Var(Xˆ
iXˆ
j ) = V
m
=
V
n(1 − p)
Let ∆ = min{∆ij}. Then, based on the two above inequalities, we have:
Pr"
C0[i][j] − C∗
[i][j]
≥ ∆
#
≤
V
c
2∆2n(1 − p)
Using a union bound argument, with the probability of at least 1 −
V d2
2c
2∆2n(1−p)
, we have: C0 − c∆ ≤
C∗ ≤ C0 + c∆, which means the actual covariance matrix is within the confidence intervals we have
considered. In that case, for (θ˜, b˜, C˜ ), we have:
θ˜
T
C˜ θ˜ − 2b˜T θ˜ = max
C,b
θ˜
T
Cθ˜ − 2b
T θ˜ ≥ θ˜
T
C∗θ˜ − 2b
∗T θ˜,
which completes the proof.
198
Proof of Theorem 3: Since the objective function is convex with respect to µ1
, and the constraint on µ1
is closed and bounded (compact), an optimal solution exists to the problem on the boundaries (note that
the problem is convex maximization.) Therefore, for any entry of the µ1
, it should either take µmin[i] or
µmax[i], which gives the provided solution in the theorem.
10.1.10 Dataset Descriptions
In this section, we introduce the datasets used in Section 2.5 to evaluate the performance of RIFLE on
regression and classification tasks. Except for the NHANES, all other datasets contain no missing values.
For those datasets, we generate MCAR and MNAR missing values artificially (for MNAR patterns, we apply
Algorithm 21 to the datasets).
Datasets for Evaluating RIFLE on Regression and Imputation Tasks
• NHANES: The percentage of missing values varies for different features of the NHANES dataset.
There are two sources of missing values in NHANES data: Missing entries during data collection and
missing entries resulting from merging different datasets in the NHANES collection. On average,
approximately 20% of data is missing.
• Super Conductivity∗
: Super Conductivity datasets contains 21263 samples describing superconductors and their relevant features (81 attributes). All features are continuous, and the assigned task
is to predict the critical temperature based on the given 81 features. We have used this dataset in
experiments summarized in Figure 10.3, Figure 10.4, and Table 2.2.
• BlogFeedback†
: BlogFeedback data is a collection of 280 features extracted from HTML-documents
of the blog posts. The assigned task is to predict the number of comments in the upcoming 24 hours
∗
https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data
†
https://archive.ics.uci.edu/ml/datasets/BlogFeedback
199
based on the features of more than 60K data training data points. The test dataset is fixed and is
originally separated from the training data. The dataset is used in experiments described in Table 2.2.
• Breast Cancer(Prognostic)‡
: The dataset consists of 34 features and 198 instances. Each record
represents follow-up data for one breast cancer case collected in 1984. We have done several experiments to impute the MCAR missing values generated artificially with different proportions. The
results are depicted in Table 2.1 and Figure 2.4.
• Parkinson§
: The dataset describes a range of biomedical voice recording from 31 people, 23 with
Parkinson’s disease (PD). The assigned task is to discriminate healthy people from those with PD.
There are 193 records and 23 features in the dataset. The dataset is processed similarly to the Breast
Cancer dataset and used in the same experiments.
• Spam Base¶
: The dataset consists of 4601 instances and 57 attributes. The assigned classification
task is to predict whether the email is spam. To evaluate different imputation methods, we randomly mask a proportion of data entries and impute them with different approaches. The results are
depicted in Table 2.1 and Figure 2.4.
• Boston Housing∥
: Boston Housing dataset contains 506 instances and 14 columns. We generate
random missing entries with different proportions and impute them with RIFLE and several stateof-the-art approaches. The results are demonstrated in Table 2.1 and Figure 2.4.
• Cloud∗∗: The dataset has 1024 instances and 10 features extracted from clouds images. We use this
dataset in experiments depicted in Table 2.1 with 70% artificial MCAR missing values.
‡
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic)
§
https://archive.ics.uci.edu/ml/datasets/parkinsons
¶
https://archive.ics.uci.edu/ml/datasets/spambase
∥
https://www.kaggle.com/c/boston-housing
∗∗https://archive.ics.uci.edu/ml/datasets/Cloud
200
• Wave Energy Converters††: We sample a subset of 3000 instances with 49 features from the
original Wave Energy Converter dataset. We have executed several imputation methods on the
dataset, and the results are shown in Figure 2.4.
• Sensorless Drive Diagnosis‡‡: The 49 continuous features in this dataset are extracted from electric current drive signals, and the associated classification task is to determine the condition of device’s motor. We choose different random samples with size 400 to run experiments (imputation) in
Figure 2.5.
Datasets for Evaluating Robust QDA on Classification Tasks
• Avila§§: The Avila dataset consists of 10 attributes extracting from 800 images of "Avila Bible". The
associated classification task is to match each pattern (an instance of the dataset) to a copyist. We put
40% of MCAR missing values (both input features and the target variable) on 10 different random
samples of the dataset with size 1000. The average accuracy of the robust LDA method on the 10
datasets is demonstrated in Figure 2.7 for each value of k (the number of covariance estimations).
• Magic Gamma Telescope¶¶: The dataset consists of 11 continuous MC-generated features from
contributing to the prediction of the type of event (signal or background). We used the same procedure as the above dataset for the results depicted in Figure 2.7 (random sampling subsets of 1000
data points out of more than 19000).
• Glass Identification∗∗∗: This dataset is composed of 10 continuous features and 214 instances. The
assigned classification task is to predict the type of glasses based on the materials used for making
††https://archive.ics.uci.edu/ml/datasets/Wave+Energy+Converters
‡‡https://archive.ics.uci.edu/ml/datasets/dataset+for+sensorless+drive+diagnosis
§§https://archive.ics.uci.edu/ml/datasets/Avila
¶¶https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope
∗∗∗https://archive.ics.uci.edu/ml/datasets/glass+identification
201
them. We have assigned 40% of MCAR missing values to the dataset for the experiments reported
in Table 2.4.
• Annealing††† This dataset is a mix of categorical and numerical features (37 in total), and the
associated task is to predict the class (5 classes) of instances (metals). The number of instances in
this dataset is 798. We use 500 data points as the training data and the rest as the test. We apply 40%
of MCAR missing values to both input features and the target variable. The accuracy of different
models is reported in Table 2.4.
• Abalone‡‡‡: This dataset consists of 4177 instances and 8 categorical and continuous features. The
goal is to predict the age of abalone based on physical measurements. The first 1000 samples are used
as the training data and the rest as the test data. We applied the same pre-processing procedure as
the above dataset to generate missing values on the training data. The accuracy of different models
is reported in Table 2.4.
• Lymphography§§§: Lymphography is a categorical dataset containing 18 features and 148 data
points obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia.
100 data points are used as the training data; the rest are test data points (with no missing values).
We applied the same pre-processing described for the above dataset to generate MCAR missing
values.
• Adult ¶¶¶: The adult dataset contains census information of individuals, including education, gender, and capital gain. The assigned classification task is to predict whether a person earns over 50k
annually. The train and test sets are two separate files consisting of 32, 000 and 16, 000 samples,
respectively. We consider gender and race as the sensitive attributes (For the experiments involving
†††https://archive.ics.uci.edu/ml/datasets/Annealing
‡‡‡https://archive.ics.uci.edu/ml/datasets/abalone
§§§https://archive.ics.uci.edu/ml/datasets/Lymphography
¶¶¶https://archive.ics.uci.edu/ml/datasets/adult.
202
one sensitive attribute, we have chosen gender). Learning a logistic regression model on the training
dataset (without imposing fairness) shows that only 3 features out of 14 have larger weights than
the gender attribute.
10.1.11 Further Discussion on the Consistency of RIFLE
The three developed algorithms in Section 2.3 for solving robust ridge regression are all consistent. To
show this, we have generated a synthetic dataset with 50 input features following a jointly normal distribution. As observed in Figure 10.2, by increasing the number of samples, the NRMSE of all three algorithms
converges to 0.01, which is the standard deviation of the zero-mean Gaussian noise added to each target
value (the dashed line). The pattern can be observed for different percentages of missing values.
Figure 10.2: Consistency of ADMM (Algorithm 19) and Projected Gradient Ascent on function g (Algorithm 1) on the synthetic datasets with 40%, 60% and 80% missing values.
10.1.12 Numerical Experiments for Convergence of RIFLE Algorithms
We presented three algorithms for solving the robust linear regression problem formulated in (2.6): Projected gradient ascent (Algorithm 1, Nesterov acceleration method (Algorithm 17), and Alternating Direction Method of Multipliers (ADMM) (Algorithm 19) applied on the dual problem.
203
Figure 10.3: Convergence of ADMM algorithm to the optimal solution of Problem (10.4) for different values
of ρ. The left plot measures the objective function of Problem (10.4) per iteration (without considering the
constraints), while the right plot demonstrates the constraint violation of the algorithm per iteration. The
constraint violation can be measured by adding all regularization terms in the augmented Lagrangian
function formulated in Problem (10.6).
We established the convergence rate of the gradient ascent and Nesterov acceleration methods in Theorem 1 and Theorem 31, respectively. To investigate the convergence of the ADMM algorithm and its dependence on ρ, we perform Algorithm 19 on the Super Conductivity dataset (Description in Appendix 10.1.10)
with 30% MCAR missing values. Figure 10.3 demonstrates the convergence of the ADMM algorithm for
multiple values of ρ applied to the Super Conductivity dataset as described above. As can be observed,
decreasing the value of ρ accelerates the ADMM convergence to the optimal value. Note that for ρ = 0.2,
the objective function is smaller than the final value in the first few iterations. The reason is that for those
iterations, the solution is not feasible (as observed in the right figure). The final solution is the optimal
feasible solution.
In the next experiment, we compare the three proposed algorithms regarding the number of iterations
required to reach a certain level of test accuracy on the Super Conductivity dataset. The number of training
samples is 1000, containing 40% of MCAR missing values on both input features and the target variable.
The test dataset contains 2000 samples. As depicted in Figure 10.4, ADMM and Nesterov’s algorithms
204
require less number of iterations to reach the ϵ-optimal solution compared to Algorithm 1. However, the
cost per iteration of the ADMM algorithm (Algorithm 19) is higher than the Nesterov acceleration and
Algorithm 1.
Figure 10.4: The performance of the Nesterov acceleration method, projected gradient ascent, and ADMM
on the Super Conductivity dataset vs. the number of iterations.
10.1.13 Execution Time Comparison of RIFLE and Other State-of-the-art Approaches
This section reports the average execution time of the RIFLE and other approaches presented in Table 2.2.
Methods Datasets
Super Conductivity Blog Feedback NHANES
Regression on Complete Data 0.3 sec 0.7 sec 0.4 sec
RIFLE 87 sec 471 sec 125 sec
Mean Imputer + Regression 0.4 sec 0.9 sec 0.5 sec
MICE + Regression 112 sec 573 sec 294 sec
EM + Regression 171 sec 612 sec 351 sec
MIDA Imputer + Regression 245 sec 726 sec 599 sec
MissForest 94 sec 321 sec 132 sec
KNN Imputer 66 sec 292 sec 144 sec
Table 10.1: Execution time of RIFLE and other SOTA methods on three datasets.
205
10.2 Chapter 4 Appendix
10.2.1 Implementation Details
In table 10.2, we demonstrate the structure of the deep networks used in experiments of Table 10.3.
Network Layers
Conv 64 3×3
Conv 64 3×3
Conv 128 3×3
Conv 128 3×3
Fully Connected 512
Linear 10
Table 10.2: Standard and Robust Verified error of state-of-the-art approaches on CIFAR-10 dataset.
1. For MNIST, we train on a single Nvidia V100 GPU for 100 epochs with batch sizes of 100. The
total number of training steps is 60K. We decay the learning rate by 10× at steps 15K and 25K.
We use warm-up and ramp-up duration of 2K and 10K steps, respectively. We do not use any data
augmentation techniques and use full 28 × 28 images without any normalization.
2. CIFAR-10, we train for 3200 epochs with batch sizes of 1600. The total number of training steps
is 100K. We decay the learning rate by 10× at steps 60K and 90K. We use warm-up and ramp-up
duration of 5K and 50K steps, respectively. During training, we add random translations and flips,
and normalize each image channel (using the channel statistics from the train set).
10.2.2 Bregman-Divergence Method for Optimizing a Convex Function
Over a Probability Simplex
In this section, we use the Bregman divergence method to optimize a convex optimization problem over
a probability simplex. Let η be a vector of n elements. We aim to minimize the following constrained
optimization problem where J is a convex function with respect to η:
206
min
η1,...,ηn
J(η1, . . . , ηn) subject to Xn
i=1
ηi = 1, ηi ≥ 0 ∀i = 1, . . . , n. (10.24)
To solve the above problem, we define the Bregman distance function as:
B(x, y) = γ(x) − γ(y) − ⟨∇γ(x), x − y⟩
where γ is a strictly convex function. For this specific problem where the constraint is over a probability
simplex, we choose γ(x) = Pn
i=1 xi
log(xi). Thus:
B(x, y) = Xn
i=1
xi
log(xi
yi
)
One can rewrite problem 10.24 as:
min
η1,...,ηn
J(η1, . . . , ηn) + IP(η) (10.25)
where P =. Applying the proximal gradient descent method to the above problem, we have:
η
r+1 = argmin
η
IP(η) + ⟨∇J(η), η − η
i
⟩ +
1
2ν
B(η, η
i
) (10.26)
= argmin
η
Xn
i=1
∂J(η
r
)
∂ηi
(ηi − η
r
i
) + 1
2ν
Xn
i=1
ηi
log(ηi) −
Xn
i=1
∂γ(η
r
i
)
∂ηi
(ηi − η
r
i
)
(10.27)
By simplifying the above problem, it turns to:
207
η
r+1 = argmin
η
Xn
i=1
ηi(
∂J(η
r
)
∂ηi
−
1
2ν
log(η
r
i
) −
1
2ν
) + 1
2ν
Xn
i=1
ηi
log(ηi) (10.28)
subject to Xn
i=1
ηi = 1, ηi ≥ 0 ∀i = 1, . . . , n. (10.29)
Writing the Lagrangian function of the above problem, we have:
η
r+1 = argmin
η
Xn
i=1
ηi(
∂J(η
r
)
∂ηi
−
1
2ν
log(η
r
i
) −
1
2ν
) + 1
2ν
Xn
i=1
ηi
log(ηi) + λ
∗
(
Xn
i=1
ηi − 1) (10.30)
subject to ηi ≥ 0 ∀i = 1, . . . , n.
By taking the derivative with respect to ηi and using the constraint Pn
i=1 ηi = 1, it can be shown that:
η
r+1
i =
η
r
i
exp(−2ν∇J(η)i)
Pn
j=1 η
r
j
exp(−2ν∇J(η)j )
(10.31)
We use the update rule (10.31) in Algorithm 6 and Algorithm 7 to obtain the optimal η at each iteration.
10.2.3 Proof of Theorems
In this section, we prove Theorem 6 and Theorem 8.
Proof of Theorem 6: Starting from Equation 4.9, we can equivalently formulate it as:
min
z∈Z(x0,ϵ)
max(c
T
ykz, c
T
a1kz, . . . , c
T
aMkz) = min
z∈Z(x0,ϵ)
max
{η0,...,ηM}∈P
ck(η)
T
z. (10.32)
Note that the maximum element of the left-hand side can be obtained by setting its corresponding η
coefficient to 1 on the right-hand side. Conversely, any optimal solution to the right hand is exactly equal
208
to the maximum element of the left-hand side. According to the min-max equality (duality), when the
minimum and the maximum problems are interchanged, the following inequality holds:
min
z∈Z(x0,ϵ)
max
{η0,...,ηM}∈P
η0c
T
ykz + η1c
T
a1kz + · · · + ηMc
T
aMkz ≥
max
{η0,...,ηM}∈P
min
z∈Z(x0,ϵ)
η0c
T
ykz + η1c
T
a1kz + · · · + ηMc
T
aMkz. (10.33)
Moreover, by the definition of upper-bounds and lower-bounds presented in [130], Z(x0, ϵ) is a subset
of zL ≤ z ≤ z¯L. Thus:
max
{η0,...,ηM}∈P
min
z∈Z(x0,ϵ)
η0c
T
ykz + η1c
T
a1kz + · · · + ηMc
T
aMkz ≥
max
{η0,...,ηM}∈P
min
zL≤z≤z¯L
η0c
T
ykz + η1c
T
a1kz + · · · + ηMc
T
aMkz. (10.34)
Combining Equality (10.32) with (10.33) and (10.34), we have:
min
z∈Z(x0,ϵ)
max(c
T
ykz, c
T
a1kz, . . . , c
T
aMkz) ≥ max
{η0,...,ηM}∈P
min
zL≤z≤z¯L
ck(η)
T
z. (10.35)
Since zL = WLzL−1 + bL, the right-hand-side of the above inequality can be rewritten as:
min
z∈Z(x0,ϵ)
max(c
T
ykz, c
T
a1kz, . . . , c
T
aMkz) ≥ max
η∈P
min
zL−1≤z≤z¯L−1
c(η)
T
(WLz + bL),
which is exactly the claim of Theorem 6.
Proof of Theorem 8: For the simplicity of the presentation, assume that a0 = y. Partition the set of
possible values of zL in the following sets:
209
Zˆ
ai = {zL|[zL]ai ≥ [zL]aj ∀j ̸= i}
If zL ∈ Zˆ
ai
, then:
[zL]ai − [zL]k ≥ [zL]aj − [zL]k ∀j ̸= i ⇒ [zL]ai − [zL]k
= max
i=0,...,M
{[zL]ai − [zL]k} = max
i∈{0,...,M}
{c
T
ai,kzL}
Thus:
[zL]ai − [zL]k = max
i=0,...,M
{c
T
ai,kzL} ≥ min
zL∈Z(x0,ϵ)
max
i=0,...,M
{c
T
ai,kzL}
= min
zL−1∈ZL−1(x0,ϵ)
max
i=0,...,M
{c
T
ai,k(WLzL−1 + bL)}
≥ min
z≤zL−1≤z¯
max
i=0,...,M
{c
T
ai,k(WLzL−1 + bL)}
= min
z≤zL−1≤z¯
max
η∈P
c(η)
T
(WLzL−1 + bL) (10.36)
Note that the second inequality holds since the minimum is taken over a larger set in the right hand
side of the inequality. Using the min-max inequality:
min
z≤zL−1≤z¯
max
η∈P
c(η)
T
(WLzL−1 + bL) ≥ max
η∈P
min
z≤zL−1≤z¯
c(η)
T
(WLzL−1 + bL) = −Jk(η) (10.37)
Combining (10.36) and (10.37), and multiplying both sides by −1, we obtain:
[zL]k − [zL]ai ≤ Jk(η) (10.38)
210
On the other hand:
max
∥δ∥∞≤ϵ
min
m=0,...,M
ℓxent\Am
zL(x + δ), am
≤ max
∥δ∥∞≤ϵ
ℓxent\Ai
zL(x + δ), ai
≤ max
zL−1≤z≤z¯L−1
ℓxent\Ai
(zL) s.t. zL = WLzL−1 + bL. (10.39)
Moreover, by the property of the cross-entropy loss, we have:
ℓxent\Ai
(zL) = ℓxent\Ai
(zL − [zL]ai1) (10.40)
Combining (10.38), (10.39) and (10.40), we have:
max
∥δ∥∞≤ϵ
min
m=0,...,M
ℓxent\Am
zL(x + δ), am
≤ max
zL−1≤zL−1≤z¯L−1
ℓxent\Ai
(zL) s.t. zL = WLzL−1 + bL.
= max
zL−1≤zL−1≤z¯L−1
ℓxent\Ai
(zL − [zL]ai1) s.t. zL = WLzL−1
≤ max
zL−1≤zL−1≤z¯L−1
ℓxent\Ai
(Jk(η), ai)
= max
zL−1≤zL−1≤z¯L−1
ℓxent\A0
(Jk(η), a0)
Summing up over all data points, the desired result is proven.
10.2.4 Details of β-Crown
In this section, we show how β-crown sub-problems can be obtained for neural networks without abstain
classes and with multiple abstain classes respectively. Before proceeding, let us have a few definitions and
lemmas.
211
Lemma 36 [368, Theorem 15] Given two vectors u and v, the following inequality holds:
v
⊤ReLU(u) ≥ v
⊤Dαu + b
′
,
where b
′
is a constant vector and Dα is a diagonal matrix containing αj ’s as free parameters:
Dj,j (α) =
1, if zj ≥ 0
0, if z¯j ≤ 0
αj , if z¯j > 0 > zj
and vj ≥ 0
z¯j
z¯j−zj
, if z¯j > 0 > zj
and vj < 0,
(10.41)
Definition 37 The recursive function Ω(i, j) is defined as follows [344]:
Ω(i, i) = I, Ω(i, j) = WiDi−1(αi−1)Ω(i − 1, j)
β-crown defines a matrix S for handling splits through the branch-and-bound process. The multiplier(s) β determines the branching rule.
Si
[j][j] =
−1, if split zi
[j] ≥ 0
1, if split zi
[j] < 0
0, if no split z¯j ,
(10.42)
Thus, the verification problem of β-crown is formulated as:
212
min
zinZ
c
T
WLReLU(zL−1) + bL−1
≥ min
zinZ
max
βL−1
c
T
WLDL−1zL−1 + bL−1
+ β
⊤
L−1SL−1 (10.43)
Having these definitions, we can write P, q, a, and d explicitly as functions of α and β.
P ∈ R
d0×(
PL−1
i=1 di)
is a block matrix P :=
P⊤
1 P⊤
2
· · · P⊤
L−1
, q ∈ R
PL−1
i=1 di
is a vector q :=
q
⊤
1
· · · q
⊤
L−1
⊤
. Moreover:
a = [Ω(L, 1)W1]
⊤ ∈ R
d0×1
,
Pi = SiΩ(i, 1)W1 ∈ R
di×d0
, ∀ 1 ≤ i ≤ L − 1
qi =
X
i
k=1
SiΩ(i, k)bk +
X
i
k=2
SiΩ(i, k)Wkbk−1 ∈ R
di
, ∀ 1 ≤ i ≤ L − 1
d =
X
L
i=1
Ω(L, i)bi +
X
L
i=2
Ω(L, i)Wibi−1
bi =
1, if zj ≥ 0
0, if z¯j ≤ 0
αj , if z¯j > 0 > zj
and vj ≥ 0
z¯j
z¯j−zj
, if z¯j > 0 > zj
and vj < 0,
2
Now we extend the definition of g for the network consisting of multiple abstain classes. Let z¯ be
the pre-activation value of vector z before applying the ReLU function. We aim to solve the following
verification problem:
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
T
(WLzL−1 + bL).
Applying Lemma 36 to the above problem, we have:
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
T
WLzL−1 + bL
≤ min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
T
WLDL−1
αL−1
zˆL−1 + bL
Adding the β-crown Lagrangian multiplier to the above problem, it turns to:
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
T
WLDL−1
αL−1
zˆL−1 + bL
≤
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P,αL−1,βL−1
ck(η)
T
WLDL−1(αL−1)zL−1 + bL
+ β
⊤
L−1SL−1zL−1
≤ max
αL−1,βL−1
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
TWLDL−1(αL−1) + β
⊤
L−1SL−1
zˆL−1
+ ck(η)
TbL = max
αL−1,βL−1
min
zL−1∈ZL−1(x0,ϵ)
max
η∈P
ck(η)
TWLDL−1(αL−1)
+ β
⊤
L−1SL−1
WL−1zL−2 + bL−1
+ ck(η)
TbL
Replace the definition of A(i)
in [344, Theorem 3.1] with the following matrix and repeat the proof.
2
A(i) =
ck(η)
TWL, if i = L − 1
A(i+1)Di+1(αi+1) + β
⊤
i+1Si+1
Wi+1, if 0 ≤ i ≤ L − 2
(10.44)
Note that the definition of d will be changed in the following way:
d = ck(η)
TbL +
X
L
i=1
Ω(L, i)bi +
X
L
i=2
Ω(L, i)Wibi−1
Moreover, Ω(L, j) = ck(η)
TWLDL−1(αL−1)Ω(L − 1, j). The rest of the definitions remain the
same.
10.2.5 Derivation of equation (4.12)
In this section, we show how to derive Equation 10.2.5.
min
zL∈Z(x,ϵ)
max{c
T
ykzL, c
T
a1kzL, . . . , c
T
aMkzL}
= min
zL∈Z(x,ϵ)
max
η∈P
X
M
i=0
ηic
T
aikzL
≥ max
η∈P
min
zL∈Z(x,ϵ)
X
M
i=0
ηic
T
aikzL
≥ max
η∈P
max
α,β≥0
ηic
T
aikzL
= max
α,β≥0,η∈P
X
M
i=0
ηigi(x0, α, β) ≜ G(x0, α, β, η)
10.2.6 A simple example on the benefits and pitfalls of having multiple abstain classes
In this example, we provide a simple toy example illustrating:
1. How adding multiple abstain classes can improve the detection of adversarial examples.
215
Figure 10.5: Distribution of adversarial and real data described in the example. While one linear classifier
cannot separate the adversarial (red section) and real (green section) data points, two detection classes can
detect adversarial examples.
2. How detection with multiple abstain classes may suffer from a “model degeneracy" phenomenon.
Example: Consider a simple one-dimensional data distributed where the read data is coming from the
Laplacian distribution with probability density function Pr(X = x) = 1
2
exp(−|x|). Assume that the adversary samples are distributed according to the probability density function Pa(X = x) = 1
4
(exp(−|x −
10|) + exp(−|x + 10|). Assume that 1
3
data is real, and 2
3
is coming from adversary. The adversary and
the real data are illustrated in Fig 10.5.
Consider a binary neural network classifier with no hidden layer for detecting adversaries. Specifically,
the neural network has two weight vectors w
r
and w
a
, and the bias values b
r
and b
a
. The network classifies
a sample x as "real" if w
rx+b
r > wax+b
a
; otherwise, it classifies the sample as out-of-distribution/abstain.
The misclassification rate of this classifier is given by:
P(error) = 1
3
Px∼Pr
(w
ax + b
a > wrx + b
r
) + 2
3
Px∼Pa
(w
ax + b
a < wrx + b
r
)
=
1
3
Px∼Pr
(x >
b
r − b
a
wa − wr
) + 2
3
Px∼Pa
(x <
b
r − b
a
wa − wr
),
216
where due to symmetry and scaling invariant, without loss of generality, we assumed that w
a − w
r > 0.
Let t =
b
r−b
a
wa−wr . Therefore,
P(error) = 1
3
Z +∞
t
1
2
exp(−|x|)dx +
2
3
Z t
−∞
1
4
(exp(−|x − 10|) + exp(−|x + 10|)dx (10.45)
Thus, to find the optimal classifier, we require to determine the optimal t minimizing the above equation.
One can numerically verify that the optimal t is given by t
∗ = 5, leading to the minimum misclassification
rate of ≈ 0.34. This value is the optimal misclassification rate that can be achieved by our single abstain
class neural network.
Now consider a neural network with two abstain classes. Assume that the weights and biases corresponding to the abstain classes are w
a
1
, wa
2
, ba
1
, ba
2
, and the weight and bias for the real class is given by wr
and br. A sample x is classified as a real example if and only if both of the following conditions hold:
w
rx + br > wa
1x + b
a
1
(10.46)
w
rx + br > wa
2x + b
a
2
, (10.47)
otherwise, it is classified as an adversarial (out of distribution) sample. The misclassification rate of the
such classifier is given by:
P(error) = 1
3
Px∼Pc
(Conditions (10.46) hold) + 2
3
Px∼Pa
(Conditions (10.46) do not hold) (10.48)
Claim 1: The point w
a
1 = −1, wa
2 = 1, ba
1 = b
a
2 = 0, br = 5, wr = 0 is a global minimum of (10.48)
with the optimum misclassification rate less than 0.1.
217
Proof: Define t1 = −
b
a
1−br
wa
1−wr , t2 = −
b
a
2−br
wa
2−wr . Considering all possible sign cases, it is not hard to see
that at the optimal point, w
a
1 − w
r
and w
a
2 − w
r have different signs. Without loss of generality, assume
that w
a
1 − w
r < 0 and w
a
2 − w
r > 0. Then:
P(error) = 1
3
Px∼Pc
(x ≤ t1 ∨ x ≥ t2) + 2
3
Px∼Pa
(x ≥ t1 ∧ x ≤ t2) (10.49)
It is not hard to see that the optimal solution is given by t
∗
1 = −5, t
∗
2 = 5. Plugging these values in the
above equation, we can check that the optimal loss is less than 0.1. ■
Claim 1 shows that by adding an abstain class, the misclassification rate of the classifier goes down
from 0.34 to below 0.1. This simple example illustrates the benefit of having multiple abstain classes. Next,
we show that by having multiple abstain classes, we are prone to the “model degeneracy" phenomenon.
Claim 2: Let w¯
a
1 = ¯w
a
2 = 1,
¯b
a
1 = ¯b
a
2 = 0, w¯
r = 0,
¯b
r = 5. Then, there exists a point ( ˜w, ˜b) =
( ˜w
a
1
, w˜
a
2
,
˜b
a
1
,
˜b
a
2
, w˜
r
,
˜b
r
) such that ( ˜w, ˜b) is a local minimum of the loss function in (10.48) and ∥( ˜w, ˜b) −
( ¯w, ¯b)∥2 ≤ 0.1.
Proof: Let t1 = −
b
a
1−br
wa
1−wr , t2 = −
b
a
2−br
wa
2−wr . Notice that in a neighborhood of point ( ¯w, ¯b), we have
w
a
1 − w
r > 0 and w
a
2 − w
r > 0. Thus, after the loss function in (10.48) can be written as:
ℓ(t1, t2) = 1
3
Px∼Pc
(x ≤ t1 ∨ x ≥ t2) + 2
3
Px∼Pa
(x ≥ t1 ∧ x ≤ t2)
=
1
3
Px∼Pr
(x ≥ min(t1, t2)) + 2
3
Px∼Pr
(x ≤ min(t1, t2))
=
1
3
Px∼Pr
(x ≥ z) + 2
3
Px∼Pr
(x ≤ z),
where z = mint1,t2
. It suffices to show that the above function has a local minimum close to the point
z¯ = 5 (see [243]). Simplifying ℓ(t1, t2) as a function of z, we have:
ℓ(t1, t2) = h(z) = 1
6
exp(−z) + 1
3
−
1
6
exp(−z − 10) + 1
6
exp(z − 10)
218
Plotting h(z) shows that it has a local minimum close to z¯ = 5. ■
This claim shows that by optimizing the loss, we may converge to the local optimum ( ˜w, ˜b) where both
abstain classes become essentially the same and we do not utilize the two abstain classes fully.
10.2.7 Structure of Neural Networks in Section 4.3
In Section 4.3 we introduced a toy example in the Motivation subsection to show how loser IBP bounds
can become when we go from a 2-layer network to an equivalent 3-layer network. The structure of the
2-layer neural networks is as follows:
z2(x) = W2ReLU(W1x),
where x is the 2-dimensional input, W1 =
0.557 −0.296 −0.449
−0.474 −0.504 0.894
−0.0208 0.0679 0.901
, and
W2 =
0.817 −0.376 0.36
0.524 0.530 0.0557
0.0753 0.191 0.744
−0.547 0.660 −0.718
.
Note that the input data is 2 dimensional, but we add an extra one for incorporating bias into W1 and
W2. The chosen ϵ for each data point equals 1.
10.2.8 Experiments on Deep Neural Networks
To compare the performance of our proposed approach to other state-of-the-art methods on deep neural
networks, we run the methods on the network with the structure described in Appendix 10.2.1. The results
are reported in Table 10.3.
219
ϵ Method Standard Error (%) Robust Verified Error (%)
Interval Bound Propagation [130] 50.51 68.44
IBP-CROWN [368] 54.02 66.94
ϵtrain = 8.8/255 [30] 48.3 72.5
Single Abstain [303] 55.60 63.63
ϵtest = 8/255 Multiple Abstain Classes (Current Work) 56.72 61.45
Multiple Abstain Classes (Verified by Beta-crown) 56.72 57.55
Interval Bound Propagation [130] 68.97 78.12
ϵtrain = 17.8/255 IBP-CROWN [368] 66.06 76.80
Single Abstain [303] 66.37 67.92
ϵtest = 16/255 Multiple Abstain Classes (verified by IBP) 66.25 64.57
Multiple Abstain Classes (Verified by Beta-crown) 66.25 62.81
Table 10.3: Standard and Robust Verified error of state-of-the-art approaches on CIFAR-10 dataset.
10.2.9 Limitations
The proposed framework for training and verifying joint detector and classifier networks defines the uncertainty set on each sample as an L∞ norm ball. The results can be extended to other Lp norm balls (L1
or L2) by changing the Interval Bound Propagation [130] procedure to other constraint sets defined by
Lp balls. However, the experiments in the paper are performed on the L∞ constraint sets. Furthermore,
the networks in the numerical section are trained on MNIST and CIFAR-10 datasets due to the expensive
and complex training procedure of verifiable neural networks. The training procedure of the verifiably
robust neural networks is yet limited to these datasets, even for the fastest methods such as IBP. Besides,
on large-scale datasets with millions of samples and many different classes, the optimal M might be much
larger compared to its optimal value of M = 3 and M = 4 on MNIST and CIFAR-10 datasets, respectively.
220
Therefore, it is crucial to devise techniques for training large verifiable neural networks in general and
networks with multiple detection classes in particular.
10.2.10 Societal Impacts
Given the susceptibility of presently trained neural networks to adversarial examples and out of distribution samples, the deployment of such models in critical applications like self-driving cars has been the
subject of debate. Ensuring the reliability and safety of neural networks in unpredictable and adversarial
environments requires the development of mechanisms that guarantee the models’ robustness. Our current study proposes a systematic approach for training and validating neural networks against adversarial
attacks. From a broader perspective, establishing verifiable assurances for the performance of artificial
intelligence (AI) models alleviates ethical and safety concerns associated with AI systems.
10.3 Chapter 5 Appendix
10.3.1 Proof of Theorem 10
Proof: First, notice that since a˜ is a one-hot encoding of a, any function f : {1, . . . , c} 7→ R can be
equivalently represented as f(a˜) = u
T a˜ for some u ∈ R
c
. Therefore, following the definition of Rényi
correlation, we can write
ρR(a, b) = max
u,g
E
(u
T ae)g(b)
s.t. E
(u
T a˜)
2
≤ 1, E[u
T a˜] = 0
E[g
2
(b)] ≤ 1, E[g(b)] = 0
.
221
Notice that since b is binary, there is a unique function g(b) = √
b−q
q(1−q)
satisfying the constraints where
q ≜ P(b = 1). Therefore, the above optimization problem can be written as
ρR(a, b) = max
u
u
TE [aeg(b)]
s.t. u
TE
a˜a˜
T
u ≤ 1
u
TE[a˜] = 0
.
The last constraint simply implies that u should be orthogonal to p ≜ E[a˜], which is a stochastic vector
capturing the distribution of a. Equivalently, we can write u =
I −
ppT
∥p∥
2
v for some v ∈ R
c
. Thus, we
can simplify the above optimization problem as
ρR(a, b) = max
v
v
T
I −
ppT
∥p∥
2
E [aeg(b)]
s.t. v
T
I −
ppT
∥p∥
2
diag(p)
I −
ppT
∥p∥
2
v ≤ 1,
where in the constraint, we used the equality E
a˜a˜
T
= diag(p), which follows the definition. Let us do
the change of variable vˆ = diag(
√
p)(I −
ppT
∥p∥
2 )v. Then the above optimization can be simplified to
ρR(a, b) = max
vˆ
vˆ
T
diag(1/
√
p)E [aeg(b)]
s.t. ∥vˆ∥ ≤ 1.
Clearly, this leads to
ρ
2
R(a, b) =
diag
1
√
p
E [aeg(b)]
2
=
Xc
i=1
1
P(a = i)
P(a = i, b = 1)r
1 − q
q
− P(a = i, b = 0)r q
1 − q
2
,
(10.50)
222
where in the last equality we use the fact that g(1) = q1−q
q
and g(0) = −
q q
1−q
. Define pi0 ≜ P(a =
i, b = 0) and pi1 ≜ P(a = i, b = 0), pi ≜ P(a = i) = pi0 + pi1. Then, using simple algebraic
manipulations, we have that
ρ
2
R(a, b) = Xc
i=1
1
pi
pi1
r
1 − q
q
− pi0
r q
1 − q
2
=
Xc
i=1
(2pi1(1 − q) − 2pi0q)
2
4piq(1 − q)
−
Xc
i=1
(pi0 − pi1)
2
4piq(1 − q)
+
Xc
i=1
(pi0 − pi1)
2
4piq(1 − q)
=
Xc
i=1
((3 − 2q)pi1 − (1 + 2q)pi0)((1 − 2q)pi1 + (1 − 2q)pi0)
4piq(1 − q)
+
Xc
i=1
(pi0 − pi1)
2
4piq(1 − q)
=
1 − 2q
4q(1 − q)
((3 − 2q)q − (1 + 2q)(1 − q)) +Xc
i=1
(pi0 − pi1)
2
4piq(1 − q)
= 1 −
1 −
Pc
i=1(pi0 − pi1)
2/pi
4q(1 − q)
= 1 −
γ
q(1 − q)
,
where in the last equality we used the definition of γ and the optimal value of (10). □
10.3.2 Proof of Theorem 12
Proof: Define gB(θ) = maxw fB(θ, w). Since the optimization problem maxw fB(θ, w) is strongly
concave in w, using Danskin’s theorem (see [87] and [44]), we conclude that the function gB(·) is differentiable. Moreover,
∇θgB(θ¯) = ∇θfB(θ¯, w¯ )
where w¯ = arg maxw fB(θ¯, w). Thus Algorithm 10 is in fact equivalent to the gradient descent algorithm
applied to gB(θ). Thus according to [239, Chapter 1], the algorithm finds a point with ∥∇gB(θ)∥ ≤ ϵ in
O(ϵ
−2
) iterations. □
223
10.3.3 Rényi Fair K-means
To illustrate the behavior of Algorithm 11, we deployed a simple two-dimensional toy example. In this
example we generated data by randomly selecting 5 center points and then randomly generating 500 data
points around each center according to a normal distribution with small enough variance. The data is
shown in Figure 10.6 with different colors corresponding to different clusters. Moreover, we assigned for
each data point xi a binary value si ∈ {0, 1} that corresponds to its sensitive attribute. This assignment
was also performed randomly except for points generated around center 2 (green points in Figure 10.6)
which were assigned a value of 1 and points generated around center 4 (blue points in Figure 10.6) which
were assigned a value of 0. Without imposing fairness, traditional K-means algorithm would group points
generated around center 2 in one cluster regardless of the fact that they all belong to the same protected
group. Similarly, points generated around center 4 will belong to the same cluster. Hence, according to
traditional K-means clustering shown in Figure 10.6, the proportion of the protected group in clusters 2
and 4 are 1 and 0 respectively. However, when imposing our fairness scheme, we expect these points to
be distributed among various clusters to achieve balanced clustering. This is illustrated in Figure 10.7. It
is evident from Figure 10.7 that increasing lambda, data points corresponding to centers 2 and 4 are now
distributed among different clusters.
224
Figure 10.6: Applying K-means algorithm without fairness on the synthetic dataset.
Figure 10.7: Applying fair K-means algorithm with different values of λ on the synthetic dataset.
10.3.4 Updating w after updating the assignment of each data point in Algorithm 11
To understand the reasoning behind updating the vector of proportions w after updating each ai which
is the assignment of data point i, we discuss a simple one-dimensional counterexample. Consider the
225
following four data points X1 = −5, X2 = −4, X3 = 4, and X4 = 5 with their corresponding sensitive
attributes S1 = S2 = 1 and S3 = S4 = 0. Moreover, assume the following initial A0
and C0
A0 =
1 0
1 0
0 1
0 1
, C0 = [−4.5, 4.5].
Hence, X1 and X2 which both have a sensitive attribute of 1 are assigned to cluster 1 with center C0
1 =
−4.5 and X3 and X4 which both have a sensitive attribute of 0 are assigned to cluster 2 with center C0
2 =
4.5. Then w which is the current proportion of the privileged group in the clusters will be w0 = [1, 0].
Now, for sufficiently large λ if we update A according to Step 4 of Algorithm 11, we get the following new
assignment
A1 =
0 1
0 1
1 0
1 0
, C1 = [4.5, −4.5], w1 = [0, 1].
Hence, the points just switch their clusters. Then, performing another iteration will get us back to the
initial setup and the algorithm will get stuck between these two states that are both not fair. To overcome
this issue we update the proportions w after updating the assignment of each data point.
10.3.5 Datasets Description
In this section we introduce the datasets used in numerical experiment discussed in Section 5.6. All of
these datasets are publicly available at UCI repository.
226
• German Credit Dataset: German Credit dataset consists of 20 features (13 categorical and 7 numerical) regarding to social, and economic status of 1000 customers. The assigned task is to classify
customers as good or bad credit risks. Without imposing fairness, the DP violation of the trained
model is larger than 20%. We chose first 800 customers as the training data, and last 200 customers
as the test data. The sensitive attributes are gender, and marital-status.
• Bank Dataset: Bank dataset contains the information of individuals contacted by a Portuguese
bank institution. The assigned classification task is to predict whether the client will subscribe a
term deposit. For the classification task we consider all 17 attributes (except martial status as the
sensitive attribute). Removing the sensitive attribute, and train a logistic regression model on the
dataset, yields to a solution that is biased under the demographic parity notion (p% = 70.75%). To
evaluate the performance of the classifier, we split data into the training (32000 data points), and test
set (13211 data points). For the clustering task, we sampled 3 continuous features: Age, balance, and
duration. The sensitive attribute is the marital status of the individuals.
• Adult Dataset: Adult dataset contains the census information of individuals including education,
gender, and capital gain. The assigned classification task is to predict whether a person earns over
50K annually. The train and test sets are two separated files consisting of 32000 and 16000 samples
respectively. We consider gender and race as the sensitive attributes (For the experiments involving
one sensitive attribute, we have chosen gender). Learning a logistic regression model on the training
dataset (without imposing fairness) shows that only 3 features out of 14 have larger weights than
the gender attribute. Note that removing the sensitive attribute (gender), and retraining the model
does not eliminate the bias of the classifier. the optimal logistic regression classifier in this case is
still highly biased (p% = 31.49% ). For the clustering task, we have chosen 5 continuous features
(Capital-gain, age, fnlwgt, capital-loss, hours-per-week), and 10000 samples to cluster. The sensitive
attribute of each individual is gender.
227
10.3.6 Supplementary Figures
Figure 10.8: The relationship between Rényi correlation, Pearson correlation, and normalized mutual information. Direct optimization of normalized mutual information is intractable due to its non-convexity.
However, as we can observe on the right-hand-side, by minimizing Rényi correlation to 0, the normalized
mutual information is converging to 0 accordingly.
10.3.7 Fair Neural Network
In this section, we train a 2-layers neural network on the adult dataset regularized by the Rényi correlation.
In this experiment, the sensitive attribute is gender. We set the number of nodes in the hidden layer, the
batch-size, and the number of epochs to 12, 128, and 50, respectively. The following table depicts the
performance of the trained model.
p% Test Accuracy Time (Seconds)
31.49% 85.33 731
80.42% 83.34 915
Table 10.4: Performance and training time of a neural network trained on the Adult dataset. The first and
second rows correspond to the networks not regularized, and regularized by Rényi correlation respectively.
As can be seen in the table, while adding Rényi regularizer makes the classifier more fair, it does so by
bringing a minimum amount of additional computational overhead.
10.4 Chapter 6 Appendix
We provide a simple table of contents below for easier navigation of Chapter 6 appendix.
228
CONTENTS
Section 10.4.1: Notions of fairness
Section 10.4.2: ERMI: General definition, properties, and special cases unraveled
Section 10.4.3: Relations between ERMI and other fairness violation notions
Section 10.4.4: Precise Statement and Proofs of Theorem 19 and Theorem 21
Section 10.4.6: Experiment details & additional results
Section 10.4.6.1: Model description
Section 10.4.6.2: More comparison to [223]
Section 10.4.6.3: Performance in the presence of outliers & class-imbalance
Section 10.4.6.4: Effect of hyperparameter λ on the accuracy-fairness tradeoffs
Section 10.4.6.5: Complete version of Figure 1 (with pre-processing and post-processing baselines)
Section 10.4.6.6: Description of datasets
10.4.1 Notions of Fairness
Let (Y, Y , b A, S) denote the true target, predicted target, the advantaged outcome class, and the sensitive
attribute, respectively. We review three major notions of fairness.
Definition 38 (demographic parity [105]) We say that a learning machine satisfies demographic parity
if Yb is independent of S.
Definition 39 (equalized odds [141]) We say that a learning machine satisfies equalized odds, if Yb is
conditionally independent of S given Y .
Definition 40 (equal opportunity [141]) We say that a learning machine satisfies equal opportunity with
respect to A, if Yb is conditionally independent of S given Y = y for all y ∈ A.
229
Notice that the equal opportunity as defined here generalizes the definition in [141]. It recovers equalized
odds if A = Y, and it recovers equal opportunity of [141] for A = {1} in binary classification.
230
10.4.2 ERMI: General Definition, Properties, and Special Cases Unraveled
We begin by stating a notion of fairness that generalizes demographic parity, equalized odds, and equal
opportunity fairness definitions (the three notions considered in this chapter). This will be convenient for
defining ERMI in its general form and presenting the results in Sec. 10.4.3. Consider a learner who trains a
model to make a prediction, Y , b e.g., whether or not to extend a loan, supported on a set Y. Here we allow Yb
to be either discrete or continuous. The prediction is made using a set of features, X, e.g., financial history
features. We assume that there is a set of discrete sensitive attributes, S, e.g., race and sex, supported on
S, associated with each sample. Further, let A ⊆ Y denote an advantaged outcome class, e.g., the outcome
where a loan is extended.
Definition 41 ((Z, Z)-fairness) Given a random variable Z, let Z be a subset of values that Z can take.
We say that a learning machine satisfies (Z, Z)-fairness if for every z ∈ Z, Yb is conditionally independent
of S given Z = z, i.e. ∀yb ∈ Y, s ∈ S, z ∈ Z, pY ,S b |Z
(y, s b |z) = pYb|Z
(yb|z)pS|Z(s|z).
(Z, Z)-fairness includes the popular demographic parity, equalized odds, and equal opportunity notions
of fairness as special cases:
1. (Z, Z)-fairness recovers demographic parity [105] if Z = 0 and Z = {0}. In this case, conditioning
on Z has no effect, and hence (0, {0}) fairness is equivalent to the independence between Yb and S
(see Theorem 38, Sec. 10.4.1).
2. (Z, Z)-fairness recovers equalized odds [141] if Z = Y and Z = Y. In this case, Z ∈ Z is trivially
satisfied. Hence, conditioning on Z is equivalent to conditioning on Y, which recovers the equalized
odds notion of fairness, i.e., conditional independence of Yb and S given Y (see Theorem 39, Sec. 10.4.1).
3. (Z, Z)-fairness recovers equal opportunity [141] if Z = Y and Z = A. This is also similar to the
previous case with Y replaced with A (see Theorem 40, Sec. 10.4.1).
231
Note that verifying (Z, Z)-fairness requires having access to the joint distribution of random variables
(Z, Y , S b ). This joint distribution is unavailable to the learner in the context of machine learning, and hence
the learner would resort to empirical estimation of the amount of violation of independence, measured
through some divergence. See [346] for a related discussion.
In this general context, here is the general definition of ERMI:
Definition 42 (ERMI – exponential Rényi mutual information) We define the exponential Rényi mutual information between Yb and S given Z ∈ Z as
DR(Yb; S|Z ∈ Z) := EZ,Y ,S b
(
pY ,S b |Z
(Y , S b |Z)
pYb|Z
(Yb|Z)pS|Z(S|Z)
Z ∈ Z)
− 1. (ERMI)
Notice that ERMI is in fact the χ
2
-divergence between the conditional joint distribution, pY ,S b , and the
Kronecker product of conditional marginals, pYb ⊗ pS, where the conditioning is on Z ∈ Z. Further, χ
2
-
divergence is an f-divergence with f(t) = (t − 1)2
. See [83, Section 4] for a discussion. As an immediate
result of this observation and well-known properties of f-divergences, we can state the following property
of ERMI:
Remark 43 DR(Yb; S|Z ∈ Z) ≥ 0 with equality if and only if for all z ∈ Z, Yb and S are conditionally
independent given Z = z.
To further clarify the definition of ERMI, especially as it relates to demographic parity, equalized odds,
and equal opportunity, we will unravel the definition explicitly in a few special cases.
232
First, let Z = 0 and Z = {0}. In this case, Z ∈ Z trivially holds, and conditioning on Z has no effect,
resulting in:
DR(Yb; S) := DR(Yb; S|Z ∈ Z)
Z=0,Z={0}
= EY ,S b
(
pY ,S b (Y , S b )
pYb (Yb)pS(S)
)
− 1
=
X
s∈S
Z
yb∈Y
pY ,S b (y, s b ) − pYb (yb)pS(s)
pYb (yb)pS(s)
pY ,S b (y, s b )dy. b (10.51)
DR(Yb; S) is the notion of ERMI that should be used when the desired notion of fairness is demographic
parity. In particular, DR(Yb; S) = 0 implies that χ
2 divergence between pY ,S b , and the Kronecker product
of marginals, pYb ⊗ pS is zero. This in turn implies that Yb and S are independent, which is the definition
of demographic parity. We note that when Yb and S are discrete, this special case (Z = 0 and Z = {0}) of
ERMI is referred to as χ
2
-information by [59].
Next, we consider Z = Y and Z = Y. In this case, Z ∈ Z is trivially satisfied, and hence,
DR(Yb; S|Y ) := DR(Yb; S|Z ∈ Z)
Z=Y,Z=Y
= EY,Y ,S b
pY ,S b |Y
(Y , S b |Y )
pYb|Y
(Yb|Y )pS|Y (S|Y )
− 1
=
X
s∈S
Z
y∈Y
Z
yb∈Y
pY ,S b |Y
(y, s b |y) − pYb|Y
(yb|y)pS|Y (s|y)
pYb|Y
(yb|y)pS|Y (s|y)
pY,Y ,S b (y, y, s b )dydy b
=
X
s∈S
Z
y∈Y
Z
yb∈Y
pY ,S b |Y
(y, s b |y)
2
pYb|Y
(yb|y)pS|Y (s|y)
pY (y)dydy b − 1. (10.52)
DR(Yb; S|Y ) should be used when the desired notion of fairness is equalized odds. In particular,
DR(Yb; S|Y ) = 0,
233
directly implies the conditional independence of Yb and S given Y.
Finally, we consider Z = Y and Z = A. In this case, we have
DA
R (Yb; S|Y ) := DR(Yb; S|Z ∈ Z)
Z=Y,Z=A
= EY,Y ,S b
pY ,S b |Y
(Y , S b |Y )
pYb|Y
(Yb|Y )pS|Y (S|Y )
Y ∈ A
− 1
=
X
s∈S
Z
y∈A
Z
yb∈Y
pY ,S b |Y
(y, s b |y) − pYb|Y
(yb|y)pS|Y (s|y)
pYb|Y
(yb|y)pS|Y (s|y)
p
A
Y
(y)dydy b
=
X
s∈S
Z
y∈A
Z
yb∈Y
pY ,S b |Y
(y, s b |y)
2
pYb|Y
(yb|y)pS|Y (s|y)
pY ,S b |Y
(y, s b |y)p
A
Y
(y)dydy b − 1, (10.53)
where
p
A
Y
(y) := pY (y)
R
y
′∈A pY (y
′)dy′
. (10.54)
This notion is what should be used when the desired notion of fairness is equal opportunity. This can be
further simplified when the advantaged class is a singleton (which is the case in binary classification). If
Z = Y and Z = {y}, then
DR(Yb; S|Y = y) := D
{y}
R
(Yb; S|Y )
=
X
s∈S
Z
yb∈Y
pY ,S b |Y
(y, s b |y) − pYb|Y
(yb|y)pS|Y (s|y)
pYb|Y
(yb|y)pS|Y (s|y)
pY ,S b |Y
(y, s b |y)dyb
=
X
s∈S
Z
yb∈Y
pY ,S b |Y
(y, s b |y)
2
pYb|Y
(yb|y)pS|Y (s|y)
dyb− 1. (10.55)
Finally, we note that we use the notation DR(Yb; S|Y ) and DR(Yb; S|Y = y) to be consistent with the
definition of conditional mutual information in [79].
234
10.4.3 Relations Between ERMI and Other Fairness Violation Notions
Recall that most existing in-processing methods use some notion of fairness violation as a regularizer to
enforce fairness in the trained model. These notions of fairness violation typically take the form of some
information divergence between the sensitive attributes and the predicted targets (e.g. [223, 23, 75]). In this
section, we show that ERMI provides an upper bound on all of the existing measures of fairness violations
for demographic parity, equal opportunity, and equalized odds. As mentioned in the main body, this insight
might help explain the favorable empirical performance of our algorithm compared to baselines–even
when full batch is used. In particular, the results in this section imply that FERMI algorithm leads to small
fairness violation with respect to ERMI and all of these other measures.
We should mention that many of these properties of f divergences are well-known or easily derived
from existing results, so we do not intend to claim great originality with any of these results. That said,
we include proofs of all results for which we are not aware of any references with proofs. The results in
this section also hold for continuous (or discrete) Yb. We will now state and discuss these results before
proving them.
Definition 44 (Rényi mutual information [275]) Let the Rényi mutual information of order α > 1 between random variables Yb and S given Z ∈ Z be defined as:
Iα(Yb; S|Z ∈ Z) := 1
α − 1
log
EZ,Y ,S b
pY ,S b |Z
(Y , S b |Z)
pYb|Z
(Yb|Z)pS|Z(S|Z)
!α−1
Z ∈ Z
!
, (RMI)
which generalizes Shannon mutual information
I1(Yb; S|Z ∈ Z) := EZ,Y ,S b
(
log
pY ,S b |Z
(Y , S b |Z)
pYb|Z
(Yb|Z)pS|Z(S|Z)
!
Z ∈ Z)
, (MI)
and recovers it as limα→1+ Iα(Yb; S|Z ∈ Z) = I1(Yb; S|Z ∈ Z).
Note that Iα(Yb; S|Z ∈ Z) ≥ 0 with equality if and only if (Z, Z)-fairness is satisfied.
235
The following is a minor change from results in [293]:
Lemma 45 (ERMI provides an upper bound for Shannon mutual information) We have
0 ≤ I1(Yb; S|Z ∈ Z) ≤ I2(Yb; S|Z ∈ Z) ≤ e
I2(Yb;S|Z∈Z) − 1 = DR(Yb; S|Z ∈ Z). (10.56)
Lemma 45 also shows that ERMI is exponentially related to the Rényi mutual information of order 2. We
include a proof below for completeness.
Definition 46 (Rényi correlation [147, 124, 274]) Let F and G be the set of measurable functions such
that for random variables Yb and S, EYb {f(Yb; z)} = ES {g(S; z)} = 0, EYb {f(Yb; z)
2} = ES
g(S; z)
2
=
1, for all z ∈ Z. Rényi correlation is:
ρR(Y , S b |Z ∈ Z) := sup
f,g∈F×G
EZ,Y ,S b
n
f(Yb;Z)g(S;Z)
Z ∈ Zo
. (RC)
Rényi correlation generalizes Pearson correlation,
ρ(Y , S b |Z ∈ Z) := EZ
EY ,S b {Y Sb |Z}
q
EYb {Yb 2|Z}ES{S2|Z}
Z ∈ Z
, (PC)
to capture nonlinear dependencies between the random variables by finding functions of random variables that maximize the Pearson correlation coefficient between the random variables. In fact, it is true
that ρR(Y , S b |Z ∈ Z) ≥ 0 with equality if and only if (Z, Z)-fairness is satisfied. Rényi correlation has
gained popularity as a measure of fairness violation [223, 23, 131]. Rényi correlation is also upper bounded
by ERMI. The following result has already been shown by [223] and we present it for completeness.
Lemma 47 (ERMI provides an upper bound for Rényi correlation) We have
0 ≤ |ρ(Y , S b |Z ∈ Z)| ≤ ρR(Y , S b |Z ∈ Z) ≤ DR(Yb; S|Z ∈ Z), (10.57)
236
and if |S| = 2, DR(Yb; S|Z ∈ Z) = ρR(Y , S b |Z ∈ Z).
Definition 48 (Lq fairness violation) We define the Lq fairness violation for q ≥ 1 by:
Lq(Y , S b |Z ∈ Z) := EZ
( Z
yb∈Y0
X
s∈S0
pY ,S b |Z
(y, s b |Z) − pYb|Z
(yb|Z)pS|Z(s|Z)
q
dy!1
q
Z ∈ Z)
. (Lq)
Note that Lq(Y , S b |Z ∈ Z) = 0 if and only if (Z, Z)-fairness is satisfied. In particular, L∞ fairness
violation recovers demographic parity violation [170, Definition 2.1] if we let Z = {0} and Z = 0. It also
recovers equal opportunity violation [141] if Z = A and Z = Y .
Lemma 49 (ERMI provides an upper bound for L∞ fairness violation) Let Yb be a discrete or continuous random variable, and S be a discrete random variable supported on a finite set. Then for any q ≥ 1,
0 ≤ Lq(Y , S b |Z ∈ Z) ≤
q
DR(Y , S b |Z ∈ Z). (10.58)
The above lemma says that if a method controls ERMI value for imposing fairness, then L∞ violation
is controlled. In particular, the variant of ERMI that is specialized to demographic parity also controls
L∞ demographic parity violation [170]. The variant of ERMI that is specialized to equal opportunity also
controls the L∞ equal opportunity violation [141]. While our algorithm uses ERMI as a regularizer, in our
experiments, we measure fairness violation through the more commonly used L∞ violation. Despite this,
we show that our approach leads to better tradeoff curves between fairness violation and performance.
Remark. The bounds in Lemmas 1-3 are not tight in general, but this is not of practical concern. They
show that bounding ERMI is sufficient because any model that achieves small ERMI is guaranteed to satisfy
any other fairness violation. This makes ERMI an effective regularizer for promoting fairness. In fact, in
Sec. 6.3, we saw that our algorithm, FERMI, achieves the best tradeoffs between fairness violation and
performance across state-of-the-art baselines.
237
Proof: [Proof of Lemma 45] We proceed to prove all the (in)equalities one by one:
• 0 ≤ IS(Yb; S|Z ∈ Z). This is well known and the proof can be found in any information theory
textbook [79].
• I1(Yb; S|Z ∈ Z) ≤ I2(Yb; S|Z ∈ Z). This is a known property of Rényi mutual information, but we
provide a proof for completeness in Theorem 50 below.
• I2(Yb; S|Z ∈ Z) ≤ e
I2(Yb;S|Z∈Z) − 1. This follows from the fact that x ≤ e
x − 1.
• e
I2(Yb;S)|Z∈Z − 1 = DR(Yb; S|Z ∈ Z). This follows from simple algebraic manipulation.
□
Lemma 50 Let Y , S, Z b be discrete or continuous random variables. Then:
(a) For any α, β ∈ [1, ∞], Iβ(Yb; S|Z ∈ Z) ≥ Iα(Yb; S|Z ∈ Z) if β > α.
(b) limα→1+ Iα(Yb; S|Z ∈ Z) = I1(Yb; S) := EZ
n
DKL(pY ,S b |Z
||pYb|Z ⊗ pS|Z)
Z ∈ Zo
, where I1(·; ·)
denotes the Shannon mutual information and DKL is Kullback-Leibler divergence (relative entropy).
(c) For all α ∈ [1, ∞], Iα(Yb; S|Z ∈ Z) ≥ 0 with equality if and only if for all z ∈ Z, Yb and S are
conditionally independent given z.
Proof: (a) First assume 0 < α < β < ∞ and that α, β ̸= 1. Define a = α − 1, and b = β − 1. Then the
function ϕ(t) = t
b/a is convex for all t ≥ 0, so by Jensen’s inequality we have:
1
b
log
E
p(Y , S b |Z)
p(Yb|Z)p(S|Z)
!b
Z ∈ Z
≥
1
b
log
E
( p(Y , S b |Z)
p(Yb|Z)p(S|Z)
!a
Z ∈ Z)b/a
=
1
a
log
E
( p(Y , S b |Z)
p(Yb|Z)p(S|Z)
!a
Z ∈ Z)! . (10.59)
238
Now suppose α = 1. Then by the monotonicity for α ̸= 1 proved above, we have:
I1(Yb; S) = lim
α→1−
Iα(Yb; S) = sup
α∈(0,1)
Iα(Yb; S) ≤ inf
α>1
Iα(Yb; S).
Also, I∞(Yb; S) = limα→∞ Iα(Yb; S) = supα>0
Iα(Yb; S).
(b) This is a standard property of the cumulant generating function (see [94]).
(c) It is straightforward to observe that independence implies that Rényi mutual information vanishes.
On the other hand, if Rényi mutual information vanishes, then part (a) implies that Shannon mutual information also vanishes, which implies the desired conditional independence. □
Proof: [Proof of Lemma 47] The proof is completed using the following pieces.
• 0 ≤ |ρ(Y , S b |Z ∈ Z)| ≤ ρR(Y , S b |Z ∈ Z). This is obvious from the definition of ρR(Y , S b |Z ∈ Z).
• ρR(Y , S b |Z ∈ Z) ≤ DR(Yb; S|Z ∈ Z). This follows from Theorem 51 below.
• Notice that if |S| = 2, Theorem 51 implies that DR(Yb; S|Z ∈ Z) = ρR(Y , S b |Z ∈ Z).
□
Next, we recall the following lemma, which is stated in [223] and derives from Witsenhausen’s characterization of Renyi correlation:
Lemma 51 Suppose that S = [k]. Let the k × k matrix P be defined as P = {Pij}i,j∈[k]×[k]
, where
Pij :=
1
p
pS(i)pS(j)
Z
y∈Y
pY ,S b (y, i)pY ,S b (y, j)
pYb (y)
!
dy. (10.60)
239
Let 1 = σ1 ≥ σ2 ≥ . . . ≥ σk ≥ 0 be the eigenvalues of P. Then,
ρR(Y , S b ) = σ2, (10.61)
DR(Yb; S) = Tr(P) − 1 = X
k
i=2
σi
. (10.62)
Proof: Eq. (10.61) is proved in [347, Section 3]. To prove Eq. (10.62), notice that
Tr(P) = X
i∈[k]
Pii
=
X
i∈[k]
1
pS(i)
Z
y∈Y
pY ,S b (y, i)
2
pYb (y)
!
dy
= EY ,S b
( pY ,S b (Y , S b )
pYb (Yb)pS(S)
!)
= 1 + DR(Yb; S),
which completes the proof. □
Proof: [Proof of Lemma 49] It suffices to prove the inequality for L1, as Lq is bounded above by L1 for
all q ≥ 1. The proof for the case where Z = 0 and Z = {0} follows from the following set of inequalities:
240
L1(Y , S b |Z ∈ Z) = X
s∈S
Z
y∈Y
pY ,S b (y, s) − pYb (y)pS(s)
dy (10.63)
=
X
s∈S
Z
y∈Y
q
pYb (y)pS(s)
pY ,S b (y, s) − pYb (y)pS(s)
q
pYb (y)pS(s)
dy (10.64)
≤
vuut
X
s∈S
Z
y∈Y
pYb (y)pS(s)dy! X
s∈S
Z
y∈Y
(pY ,S b (y, s) − pYb (y)pS(s))2
pYb (y)pS(s)
!!
(10.65)
≤
vuutX
s∈S
Z
y∈Y
(pY ,S b (y, s) − pYb (y)pS(s))2
pYb (y)pS(s)
!
dy (10.66)
=
q
DR(Yb; S), (10.67)
where Eq. (10.65) follows from Cauchy-Schwarz inequality, and Eq. (10.67) follows from Theorem 52. The
extension to general Z and Z is immediate by observing that ρ(Y , S b |Z ∈ Z) = EZ
h
ρ(Y , S b |Z)
Z ∈ Zi
,
ρR(Y , S b |Z ∈ Z) = EZ
h
ρR(Y , S b |Z)
Z ∈ Zi
, and DR(Y , S b |Z ∈ Z) = EZ
h
DR(Y , S b |Z)
Z ∈ Zi
.
□
Lemma 52 We have
DR(Yb; S) = X
s∈S
Z
y∈Y
(pY ,S b (y, s) − pYb (y)pS(s))2
pYb (y)pS(s)
!
dy. (10.68)
241
Proof: The proof follows from the following set of identities:
X
s∈S
Z
y∈Y
(pY ,S b (y, s) − pYb (y)pS(s))2
pYb (y)pS(s)
!
dy =
X
s∈S
Z
y∈Y
(pY ,S b (y, s))2
pYb (y)pS(s)
dy
− 2
X
s∈S
Z
y∈Y
pY ,S b (y, s)dy
+
X
s∈S
Z
y∈Y
pYb (y)pS(s)dy (10.69)
= E
(
pY ,S b (Y , S b )
pYb (Yb)pS(S)
)
− 1 (10.70)
= DR(Yb; S). (10.71)
□
Next, we present some alternative fairness definitions and show that they are also upper bounded by
ERMI.
Definition 53 (conditional demographic parity L∞ violation) Given a predictor Yb supported on Y
and a discrete sensitive attribute S supported on a finite set S, we define the conditional demographic parity
violation by:
dpe (Yb|S) := sup
yb∈Y
max
s∈S
pYb|S
(yb|s) − pYb (yb)
. (10.72)
First, we show that dpf(Yb|S) is a reasonable notion of fairness violation.
Lemma 54 dpe (Yb|S) = 0 iff (if and only if) Yb and S are independent.
242
Proof: By definition, dpf(Yb|S) = 0 iff for all yb ∈ Y, s ∈ S, pY ,S b (yb|s) = pYb (yb) iff Yb and S are independent (since we always assume p(s) > 0 for all s ∈ S). □
Lemma 55 (ERMI pupper bounds conditional demographic parity L∞ violation) Let Yb be a
discrete or continuous random variable supported on Y, and S be a discrete random variable supported on a
finite set S. Denote p
min
S
:= mins∈S pS(s) > 0. Then,
0 ≤ dpe (Yb|S) ≤
1
p
min
S
q
DR(Yb; S). (10.73)
Proof: The proof follows from the following set of (in)equalities:
dpf(Yb|S)
2
= sup
yb∈Y
max
s∈S
pYb|S
(yb|s) − pYb (yb)
2
(10.74)
≤
1
(p
min
S
)
2
sup
yb∈Y
max
s∈S
pY ,S b (y, s b ) − pYb (yb)pS(s))2
(10.75)
≤
1
(p
min
S
)
2
Z
yb∈Y
X
s∈S
pY ,S b (y, s b ) − pYb (yb)pS(s))2
(10.76)
=
1
(p
min
S
)
2
DR(Yb; S), (10.77)
where Eq. (10.77) follows from Theorem 49. □
Definition 56 (conditional equal opportunity L∞ violation [141]) Let Y, Yb take values in Y and let
A ⊆ Y be a compact subset denoting the advantaged outcomes (For example, the decision “to interview" an
243
individual or classify an individual as a “low risk" for financial purposes). We define the conditional equal
opportunity L∞ violation of Yb with respect to the sensitive attribute S and the advantaged outcome A by
eoe (Yb|S, Y ∈ A) := EY
(
sup
yb∈Y
max
s∈S
pY ,S b |Y
(yb|s, Y ) − pYb|Y
(yb|Y )
Y ∈ A)
. (10.78)
Lemma 57 (ERMI upper bounds conditional equal opportunity L∞ violation) Let Y , Y, b be
discrete or continuous random variables supported on Y, and let S be a discrete random variable supported
on a finite set S. Let A ⊆ Y be a compact subset of Y.
Denote p
min
S|A = mins∈S,y∈A pS|Y (s|y). Then,
0 ≤ eoe (Yb|S, Y ∈ A) ≤
1
p
min
S|A
q
DR(Yb; S|Y ∈ A). (10.79)
Proof: Notice that the same proof for Theorem 55 would give that for all y ∈ A:
0 ≤ sup
yb∈Y
max
s∈S
pY ,S b |Y
(yb|s, y) − pYb|Y
(yb|y)
:= eoe (Yb|S, Y = y)
≤
1
p
min
S|y
(y)
q
DR(Yb; S|Y = y)
≤
1
p
min
S|C
q
DR(Yb; S|Y = y).
244
Hence,
eoe (Yb|S, Y ∈ A) = EY
n
eoe (Yb|S, Y )
Y ∈ Ao
≤
1
p
min
S|A
EY
q
DR(Yb; S|Y )
Y ∈ A
≤
1
p
min
S|A
r
EY
n
DR(Yb; S|Y )
Y ∈ Ao
=
1
p
min
S|A
q
DR(Yb; S|Y ∈ A),
where the last inequality follows from Jensen’s inequality. This completes the proof. □
10.4.4 Precise Statement and Proofs of Theorem 19 and Theorem 21
To begin, we provide the proof of Theorem 17:
Proposition 58 (Re-statement of Theorem 17) For random variables Yb and S with joint distribution
pˆY ,S b , where Yb ∈ [m], S ∈ [k], we have
DbR(Yb; S) = max
W∈Rk×m
{− Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s
) − 1},
if Pb
yb = diag(ˆpYb (1), . . . , pˆYb (m)), Pbs = diag(ˆpS(1), . . . , pˆS(k)), and (Pb
y,s b )i,j = ˆpY ,S b (i, j) where for all
i ∈ [m], j ∈ [k]. the inequality pˆYb (i), pˆS(j) > 0 holds.
Proof: Let W∗ ∈ argmaxW∈Rk×m − Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s ). Setting the derivative of the
expression on the RHS equal to zero leads to:
−2WPb
yb + 2Pb−1/2
s PbT
y,s b = 0 =⇒ W∗ = Pb−1/2
s PbT
y,s b Pb−1
yb
.
245
Plugging this expression for W∗
, we have
max
W∈Rk×m
− Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s
)
= − Tr(Pb−1/2
s PbT
y,s b P
−1
yb
Pb
y,s b Pb−1/2
s
) + 2 Tr(Pb−1/2
s PbT
y,s b P
−1
yb
Pb
y,s b Pb−1/2
s
)
= Tr(Pb−1/2
s PbT
y,s b P
−1
yb
Pb
y,s b Pb−1/2
s
)
= Tr(Pb−1
s PbT
y,s b P
−1
yb
Pb
y,s b ).
Writing out the matrix multiplication explicitly in the last expression, we have
Pb−1
s PbT
y,s b Pb−1
yb
Pb
y,s b = UV T
,
where Ui,j = ˆpS(i)
−1pˆY ,S b (j, i) and Vi,j = ˆpYb (j)
−1pˆY ,S b (j, i), for i ∈ [k], j ∈ [m]. Hence
max
W∈Rk×m
− Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s
)
= Tr(UV T
)
=
X
i∈[k]
X
j∈[m]
pˆY ,S b (j, i)
2
pˆS(i)ˆpYb (j)
= DbR(Yb; S) + 1,
which completes the proof. □
Corollary 59 (Re-statement of Theorem 18) Let (xi
, si
, yi) be a random draw from D. Then,
Eq. (FERMI obj.) is equivalent to
min
θ
max
W∈Rk×m
n
Fb(θ, W) := Lb(θ) + λΨ( b θ, W)
o
, (10.80)
246
where Ψ( b θ, W) = − Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s ) − 1 = 1
N
PN
i=1 ψbi(θ, W) and
ψbi(θ, W) := − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1
= − Tr(Wdiag(F1(θ, xi), . . . , Fm(θ, xi))WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1.
Proof: The proof simply follows the fact that
max
W∈Rk×m
E
h
ψbi(θ, W)
i
= max
W∈Rk×m
− Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s
) − 1
= DbR(Yb; S),
where the last equality is due to Theorem 17. □
Next, we will state and prove the precise form of Theorem 19. We first recall some basic definitions:
Definition 60 A function f is L-Lipschitz if for all u, u
′ ∈ domain(f) we have ∥f(u)−f(u)∥ ≤ L∥u−u
′∥.
Definition 61 A differentiable function f is β-smooth if for all u, u
′ ∈ domain(∇f) we have ∥∇f(u) −
∇f(u)∥ ≤ β∥u − u
′∥.
Definition 62 A differentiable function f is µ-strongly concave if for all x, y ∈ domain(f), we have f(x)+
f(x)
T
(y − x) −
µ
2
∥y − x∥
2 ≥ f(y)
Definition 63 A point θ
∗ = A(D) output by a randomized algorithm A is an ϵ-stationary point of a
differentiable function Φ if E∥∇Φ(θ
∗
)∥ ≤ ϵ. We say θ
∗
is an ϵ-stationary point of the nonconvex-strongly
concave min-max problem minθ maxW F(θ, W) if it is an ϵ-stationary point of the differentiable function
Φ(θ) := maxW F(θ, W).
247
Recall that Eq. (FERMI obj.) is equivalent to
min
θ
max
W∈Rk×m
(
Fb(θ, W) := Lb(θ) + λΨ( b θ, W) = 1
N
X
N
i=1
h
ℓ(xi
, yi
, θ) + λψbi(θ, W)
i
)
, (10.81)
where Ψ( b θ, W) = − Tr(WPb
ybWT
) + 2 Tr(WPb
y,s b Pb−1/2
s ) − 1 = 1
N
PN
i=1 ψbi(θ, W) and
ψbi(θ, W) := − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1
= − Tr(Wdiag(F1(θ, xi), . . . , Fm(θ, xi))WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1,
where yb(xi
; θ) and si are the one-hot encodings of yb(xi
; θ) and si
, respectively.
Assumption 10.4.1 • ℓ(·, x, y) is G-Lipscthiz, and βℓ-smooth for all x, y.
• F(·, x) is L-Lipschitz and b-smooth for all x.
• pb
min
yb
:= inf{θ
t
,t∈[T]} minj∈[m]
1
N
PN
i=1 Fj (θ, xi) ≥
µ
2 > 0.
• pˆ
min
S
:= 1
N
PN
i=1 1{si=j} > 0.
Remark 64 As mentioned in Theorem 20, the third bullet in Assumption 10.4.1 is convenient and allows for
a faster convergence rate, but not strictly necessary for convergence of Algorithm 12.
Theorem 65 (Precise statement of Theorem 19) Let {xi
, yi
, si}i∈[N] be any given data set of features,
labels, and sensitive attributes and grant Assumption 10.4.1. Let W := BF (0, D) ⊂ R
k×m (Frobenius norm
ball of radius D), where D := 2
pˆmin
Yb
√
pˆmin
S
in Algorithm 12. Denote ∆Φb := Φ( b θ0) − infθ Φ( b θ), where
Φ( b θ) := maxW Fb(θ, W). In Algorithm 12, choose the step-sizes as ηθ = Θ(1/κ2β) and ηW = Θ(1/β) and
248
mini-batch size as |Bt
| = Θ
max
1, κσ2
ϵ
−2
. Then under Assumption 10.4.1, the iteration complexity of
Algorithm 12 to return an ϵ-stationary point of Φb is bounded by
O
κ
2β∆Φb + κβ2D2
ϵ
2
!
,
which gives the total stochastic gradient complexity of
O
κ
2β∆Φb + κβ2D2
ϵ
2
!
max
1, κσ2
ϵ
−2
!
,
where
β = 2
βℓ + 2λDmb
D +
1
q
pˆ
min
S
+ 2 + 8L
D +
1
q
pˆ
min
S
,
µ = 2λpˆ
min
Yb
,
κ = β/µ,
σ
2 = 16λ
2
(D2 + 1) + 4G
2 + 32λ
2D2L
2
1 +
mk
pˆ
min
S
.
Remark 66 The larger minibatch size is necessary to obtain the faster O(ϵ
−4
) convergence rate via twotimescale SGDA. However, as noted in [204, p.8], their proof readily extends to any batch size |Bt
| ≥ 1,
showing that two-timescale SGDA still converges. But with |Bt
| = 1, the iteration complexity becomes slower:
O(κ
3
ϵ
−5
). This is the informal Theorem 19 that was stated in the main body.
In light of Corollary 18, Theorem 65 follows from [204, Theorem 4.5] combined with the following technical lemmas. We assume Assumption 10.4.1 holds for the remainder of the proof of Theorem 65:
24
Lemma 67 If xi
, yi
, si are drawn uniformly at random from data set D, then the gradients of ℓ(xi
, yi
, θ) +
λ∇ψbi(θ, W) are unbiased estimators of the gradients of Fb(θ, W) for all θ, W, λ:
E[∇θℓ(xi
, yi
, θ) + λ∇θψbi(θ, W)] = ∇θFb(θ, W), and
E[λ∇W ψbi(θ, W)] = ∇W Fb(θ, W).
Furthermore, if ∥W∥F ≤ D, then the variance of the stochastic gradients is bounded as follows:
sup
θ,W
E∥∇ℓ(xi
, yi
, θ) + λ∇ψbi(θ, W) − ∇Fb(θ, W)∥
2 ≤ σ
2
, (10.82)
where σ
2 = 16λ
2
(D2 + 1) + 4G2 + 32λ
2D2L
2
1 + mk
pˆmin
S
.
Proof: Unbiasedness is obvious. For the variance bound, we will show that
sup
θ,W
E∥λ∇W ψbi(θ, W) − ∇W Fb(θ, W)∥
2 ≤ σ
2
w, (10.83)
and
sup
θ,W
E∥∇θℓ(xi
, yi
, θ) + λ∇θψbi(θ, W) − ∇θFb(θ, W)∥
2 ≤ σ
2
θ
, (10.84)
where σ
2 = σ
2
θ + σ
2
w. First,
∇W ψbi(θ, W) = −2WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
] + 2pbs(r)
−1/2E[siyb(xi
, θ)|xi
, si
]. (10.85)
250
Thus, for any θ, W, λ, we have
E∥λ∇W ψbi(θ, W) − ∇W Fb(θ, W)∥
2
F
=
4λ
2
N
X
N
i=1
WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
] − pbs(r)
−1/2E[siyb(xi
, θ)
T
|xi
, si
]
−
1
N
X
N
i=1
WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
] − pbs(r)
−1/2E[siyb(xi
, θ)
T
|xi
, si
]
2
F
≤
4λ
2
N
X
N
i=1
2
"
∥W∥
2
F
E[yb(xi
, θ)yb(xi
, θ)
T
|xi
] − Pb
yb
2
F
+
Pb−1/2
s
E[siyb(xi
, θ)
T
|xi
, si
] − PbT
y,s b
2
F
#
≤
4λ
2
N
X
N
i=1
2
2D2 +
Pb−1/2
s
E[siyb(xi
, θ)
T
|xi
, si
] − PbT
y,s b
2
F
≤
4λ
2
N
X
N
i=1
2
2D2 + 2
≤ 16λ
2
(D2 + 1),
where we used Young’s inequality, the Frobenius norm inequality ∥AB∥F ≤ ∥A∥F ∥B∥F , the facts that
∥E[yb(xi
, θ)yb(xi
, θ)
T
|xi
]∥
2
F =
Pm
j=1 Fj (θ, xi)
2 ≤ 1 and
∥Pb−1/2
s E[siyb(xi
, θ)
T
|xi
, si
]∥
2
F =
X
k
j=1
s
2
i,j Xm
l=1
Fl(xi
, θ)
2 ≤ 1,
for all i ∈ [N] (since for every i ∈ [N], only one of the si,j is non-zero and equal to 1, and Pm
l=1 Fl(θ, xi)
= 1).
Next,
λ∇θψbi(θ, W) = λ
h
− ∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW)
+2∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec(WTPb−1/2
s
)
i
. (10.86)
251
Hence, for any θ, W, we have
E∥∇θℓ(xi
, yi
, θ) + λ∇θψbi(θ, W) − ∇θFb(θ, W)∥
2
≤ 2
"
2 sup
xi,yi
∥∇θℓ(xi
, yi
, θ)∥
2 + λ
2
sup
xi,yi,si
− ∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW)
+ 2∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec(WTPb−1/2
s
)
2
#
≤ 4
"
G
2 + 2λ
2
sup
xi
∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW)
2
+ 4 sup
xi,si
∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec(WTPb−1/2
s
)
2
!#,
by Young’s and Jensen’s inequalities and the assumption that ℓ(xi
, yi
, ·) is G-Lipschitz. Now,
∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW) = Xm
l=1
∇Fl(xi
, θ)
X
k
j=1
Wj,1Wj,l,
which implies
∥∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW)∥
2 ≤
X
j,l
W2
j,l sup
l∈[m],x,θ
∥∇Fl(x, θ)∥
2 ≤ D2L
2
, (10.87)
by L-Lipschitzness of F(·, x). Also,
∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec(WTPb−1/2
s
) = X
k
r=1
Xm
j=1
∇Fj (θ, xi)
si,rWr,j
p
pˆS(r)
,
which implies
∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec(WTPb−1/2
s
)
2
≤
mkX
k
r=1
Xm
j=1
sup
xi,θ
∥∇Fj (θ, xi)∥
2
si,rWr,j
p
pˆS(r)
!2
≤
mk
pˆ
min
S
L
2D2
.
252
Thus,
σ
2
θ ≤ 4G
2 + 32λ
2D2L
2
1 +
mk
pˆ
min
S
.
Combining the θ- and W-variance bounds yields the lemma.
□
Lemma 68 Let
Fb(θ, W) = 1
N
X
i∈[N]
ℓ(xi
, yi
; θ) + λψbi(θ, W)
where
ψbi(θ, W) = − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) + 2 Tr(WE[yb(xi
; θ)s
T
i
|xi
, si
]Pb−1/2
s
) − 1.
Then:
1. Fb is β-smooth, where β = 2
βℓ + 2λDmb
D + √
1
pˆmin
S
+ 2 + 8L
D + √
1
pˆmin
S
.
2. Fb(θ, ·) is 2λpˆ
min
Yb
-strongly concave for all θ
t
.
3. If W = BF (0, D) with D ≥
2
pˆmin
Yb
√
pˆmin
S
, then Eq. (6.1) = minθ maxW∈W Fb(θ, W).
Proof: We shall freely use the expressions for the derivatives of ψbi obtained in the proof of Theorem 67.
1. First,
∥∇wFb(θ, W) − ∇wFb(θ, W′
)∥F ≤ 2 sup
xi
∥WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
] − W′E[yb(xi
, θ)yb(xi
, θ)
T
|xi
]∥F
≤ 2∥W − W′
∥F ,
253
since Fj (θ, xi) ≤ 1 for all j ∈ [m]. Next,
∥∇wFb(θ, W) − ∇wFb(θ
′
, W)∥
2
F
≤ 8 sup
xi,si,yi
"
D2
E[yb(xi
, θ)yb(xi
, θ)
T
|xi
] − E[Yb (xi
, θ
′
)Yb (xi
, θ
′
)
T
|xi
]
2
F
+
Pb−1/2
s
E[siyb(xi
, θ)
T
|xi
, si
] − E[siYb (xi
, θ
′
)
T
|xi
, si
]
2
F
#
≤ 8 sup
xi,si,yi
"
D2
F(θ, xi) − F(θ
′
, xi)
2
F
+
Xm
j=1
X
k
r=1
|Fj (θ, xi) − Fj (θ
′
, xi)|
2
pbs(r)(r)
−1
s
2
i,r#
≤ 8 sup
xi,si,yi
"
D2L
2
∥θ − θ
′
∥
2 +
L
2
pˆ
min
S
∥θ − θ
′
∥
2
#
,
which implies
∥∇wFb(θ, W) − ∇wFb(θ
′
, W)∥F ≤ 8L
D +
1
q
pˆ
min
S
∥θ − θ
′
∥.
254
Lastly,
∥∇θFb(θ, W) − ∇θFb(θ
′
, W)∥ ≤ sup
xi,yi,si
"
∥∇ℓ(xi
, yi
, θ) − ∇ℓ(xi
, yi
, θ
′
)∥
+ λ
−∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
+ ∇θ vec(E[yb(xi
, θ
′
)yb(xi
, θ
′
)
T
|xi
])T
vec(WTW)
+ 2λ
[∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
])
− ∇θ vec(E[siyb(xi
, θ
′
)
T
|xi
, si
])] vec(WTPb−1/2
s
)
#
≤ βℓ∥θ − θ
′
∥ + λD2
sup
x
Xm
l=1
∥∇Fl(θ, x) − ∇Fl(θ
′
, x)∥
+ 2λ sup
x,r∈[k]
Xm
j=1
∇Fj (θ, x) − ∇Fj (θ
′
, x)ˆpS(r)
−1/2Wr,j
≤
βℓ + 2λ
D2
b +
Db
q
pˆ
min
S
∥θ − θ
′
∥,
by Assumption 10.4.1. Combining the above inequalities yields part 1.
2. We have ∇2
wwFb(θ, W) = −2λPb
yb, which is a diagonal matrix with
(∇2
wwFb(θ, W))j,j = −2λ
1
N
X
N
i=1
Fj (xi
, θ) ≤ −2λpˆ
min
Yb
.
By Assumption 10.4.1. Thus, Fb(·, θ) is 2λpˆ
min
Yb
-strongly concave for all θ.
3. Our choice of D ensures that W∗
(θ
∗
) ∈ int(W), since
∥W∗
(θ
∗
)∥F = ∥Pb−1/2
s Pb
y,s b (θ
∗
)
TPb
yb(θ
∗
)
−1
∥F (10.88)
≤
1
pˆ
min
Yb
q
pˆ
min
S
. (10.89)
Therefore, maxW∈W Fb(θ, W) = maxW Fb(θ, W), which implies part 3 of the lemma.
255
□
By Assumption 10.4.1 and Theorem 68, our choice of W implies that W∗
(θ
∗
) ∈ W and hence that the
solution of Eq. (FERMI obj.) solves
min
θ
max
W∈W
Fb(θ, W) := 1
N
X
i∈[N]
Lb(θ) + λΨ( b θ, W)
.
This enables us to establish the convergence of Algorithm 12 (which involves projection) to a stationary
point for the unconstrained min-max optimization problem Eq. (6.1) that we consider. The Wt projection
step in Algorithm 12 is necessary to ensure that the iterates Wt
remain bounded, and hence that the
smoothness and bounded variance conditions of Fb are satisfied at every iteration.
10.4.5 Proof of Theorem 21
Now we turn to the proof of Theorem 21. We first re-state and prove Theorem 22.
Proposition 69 (Restatement of Proposition 22) Let {zi}
n
i=1 = {xi
, si
, yi}
n
i=1 be drawn i.i.d. from an
unknown joint distribution D. Denote ψb(n)
i
(θ, W) = − Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
) +
2 Tr
WE[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
−1, where Pb(n)
s =
1
n
Pn
i=1 diag(1{si=1}
, · · · , 1{si=k}
). Denote
Ψ(θ, W) = − Tr(W PyˆWT
) + 2 Tr(W Py,s ˆ P
−1/2
s ) − 1, where Pyˆ = diag(EF1(θ, x), · · · , EFm(θ, x)),
(Py,s ˆ )j,r = Exi,si
[Fj (θ, xi)si,r] for j ∈ [m], r ∈ [k], and Ps = diag(PS(1), · · · , PS(k)). Assume pS(r) >
0 for all r ∈ [k]. Then,
max
W
Ψ(θ, W) = DR(Yb(θ); S)
and
limn→∞
E[ψb(n)
i
(θ, W)] = Ψ(θ, W).
256
Proof: The first claim, that maxW Ψ(θ, W) = DR(Yb(θ); S) is immediate from Theorem 17 and its
proof, by replacing the empirical probabilities with D-probabilities everywhere. For the second claim, we
clearly have
E[ψb(n)
i
(θ, W)] = E
− Tr(WE[yb(xi
, θ)yb(xi
, θ)
T
|xi
]WT
)
+ 2E
Tr
WE[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
− 1
(10.90)
= − Tr(W PybWT
) + 2E
Tr
WE[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
− 1,
for any n ≥ 1. Now, Pb(n)
s (r) converges almost surely (and in probability) to pS(r) by the strong law
of large numbers, and E[Pb(n)
s (r)] = pS(r). Thus, Pb(n)
s (r) is a consistent estimator of pS(r). Then by
the continuous mapping theorem and the assumption that pS(r) ≥ C for some C > 0, we have that
Pb(n)
s (r)
−1/2
converges almost surely (and in probability) to pS(r)
−1/2
. Moreover, we claim that there
exists N∗ ∈ N such that for any n ≥ N∗
, Var
Pb(n)
s (r)
−1/2
≤
2
C < ∞. To see why this claim holds,
note that the definition of almost sure convergence of Pb(n)
s (r) to pS(r) implies that, with probability 1,
there exists N∗
such that for all n ≥ N∗
,
min
r∈[k]
Pb(n)
s
(r) ≥ min
r∈[k]
pS(r) − C/2 ≥ C/2.
Thus, Var
Pb(n)
s (r)
−1/2
≤ E
Pb(n)
s (r)
−1
≤
2
C
. Therefore
Pb(n)
s (r)
−1/2
is a consistent estimator with uniformly bounded variance, hence it is asymptotically unbiased: limn→∞ E
Pb(n)
s (r)
−1/2
=
pS(r)
−1/2
. Furthermore,
E[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
j,r
≤
2
C
for all n ≥ N∗
and
E[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
j,r
converges almost surely to
E[yb(xi
; θ)s
T
i
|xi
, si
] (PS)
−1/2
j,r
as
n → ∞ (for any j ∈ [m], r ∈ [k]). Thus, by Lebesgue’s dominated convergence theorem, we have
limn→∞
E
E[yb(xi
; θ)s
T
i
|xi
, si
]
Pb(n)
s
−1/2
j,r
= E
E[yb(xi
; θ)s
T
i
|xi
, si
] limn→∞
Pb(n)
s
(r)
−1/2
= E [si,rFj (θ, xi)] limn→∞
Pb(n)
s
(r)
−1/2
= (Py,s b )j,rpS(r)
−1/2
= (Py,s b P
−1/2
s
)j,r, (10.91)
257
for all j ∈ [m], r ∈ [k]. Combining Eq. (10.90) with Eq. (10.91) (and using linearity of trace and matrix
multiplication) proves the second claim. □
We are now ready to prove Theorem 21.
Proof: [Proof of Theorem 21] Denote Φ(θ) := maxW F(θ, W) for the population-level objective
F(θ, W) := L(θ) + λΨ(θ, W)
(using the notation in Theorem 22). Let θ
∗
denote the output of the one-pass/sample-without-replacement
version of Algorithm 12, run on the modified empirical objective where
Pb(n)
s
−1/2
is replaced by the true
sensitive attribute matrix P
−1/2
S
. That is, θ
∗ ∼ Unif(θ
∗
1
, . . . , θ
∗
T
), where θ
∗
t denotes the t-th iterate of the
modified FERMI algorithm just described. Then, given i.i.d. samples, the stochastic gradients are unbiased (with respect to the population distribution D) for any minibatch size, by Corollary 18 and its proof.
Further, the without-replacement sampling strategy ensures that the stochastic gradients are independent
across iterations. Additionally, the proof of Theorem 22 showed that
Pb(n)
s
−1/2
converges almost surely
to P
−1/2
S
. Thus, there exists N such that if n ≥ N ≥ T = Ω(ϵ
−5
), then minr∈[k] Pb(n)
s (r) > 0 (by almost
sure convergence of Pbs, see proof of Theorem 22), and
E∥∇Φ(θ
∗
)∥
2 ≤
ϵ
4
, (10.92)
by Theorem 19 and its proof. Let θb
(n)
t denote the t-th iteration of the one-pass version of Algorithm 12
run on the empirical objective (with
Pb(n)
s
−1/2
). Now,
258
∇θψbi(θ, W) = −∇θ vec(E[yb(xi
, θ)yb(xi
, θ)
T
|xi
])T
vec(WTW)
+2∇θ vec(E[siyb(xi
, θ)
T
|xi
, si
]) vec
WT
Pb(n)
s
−1/2
,
which shows that θb
(n)
t
is a continuous (indeed, linear) function of
Pb(n)
s
−1/2
for every t. Thus, the
continuous mapping theorem implies that θb
(n)
t
converges almost surely to θ
∗
t
as n → ∞ for every t ∈
[T]. Hence, if θb
(n)
∼ Unif
θb
(n)
1
, . . . , θb
(n)
T
, then θb
(n)
converges almost surely to θ
∗
. Now, for any θ,
let us denote W(θ) = argmaxW F(θ, W). Recall that by Danskin’s theorem [88], we have ∇Φ(θ) =
∇θF(θ, W(θ)). Then,
∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2 ≤ 2
∇θF(θb
(n)
, W(θb
(n)
)) − ∇θF(θ
∗
, W(θb
(n)
))
2
+ 2
∇θF(θ
∗
, W(θb
(n)
)) − ∇θF(θ
∗
, W(θ
∗
))
2
≤ 2
h
β
2
∥θb
(n)
− θ
∗
∥
2 + β
2
∥W(θb
(n)
) − W(θ
∗
)∥
2
i
≤ 2
β
2
∥θb
(n)
− θ
∗
∥
2 +
2β
2L
2
µ2
∥θb
(n)
− θ
∗
∥
2
,
where L denotes the Lipschitz parameter of F, β is the Lipschitz parameter of ∇F, and µ is the strong
concavity parameter of F(θ, ·): see Theorem 68 and its proof (in Sec. 10.4.4) for the explicit β, L, and µ.
We used Danskin’s theorem and Young’s inequality in the first line, β-Lipschitz continuity of ∇F in the
second line, and 2L
µ
-Lipschitz continuity of the argmaxW F(θ, W) function for µ-strongly concave and
L-Lipschitz F(θ, ·) (see e.g. [211, Lemma B.2]). Letting n → ∞ makes ∥θb
(n)
− θ
∗
∥
2 → 0 almost surely,
and hence ∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2 → 0 almost surely. Furthermore, Danskin’s theorem and Lipschitz
continuity of ∇θF implies that ∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2 ≤ C almost surely for some absolute constant
C > 0 and all n sufficiently large. Therefore, we may apply Lebesgue’s dominated convergence theorem to
get limn→∞ E∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2 = E
h
limn→∞ ∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2
i
= 0. In particular, there
exists N such that n ≥ N =⇒ E∥∇Φ(θb
(n)
) − ∇Φ(θ
∗
)∥
2 ≤
ϵ
4
. Combining this with Eq. (10.92) and
259
Young’s inequality completes the proof. □
10.4.6 Experiment Details and Additional Results
10.4.6.1 Model description
For all the experiments, the model’s output is of the form O = softmax(W x + b). The model outputs are
treated as conditional probabilities p(yb = i|x) = Oi which are then used to estimate the ERMI regularizer.
We encode the true class label Y and sensitive attribute S using one-hot encoding. We define ℓ(·) as the
cross-entropy measure between the one-hot encoded class label Y and the predicted output vector O.
We use logistic regression as the base classification model for all experiments in Fig. 6.1. The choice
of logistic regression is due to the fact that all of the existing approaches demonstrated in Fig. 6.1, use
the same classification model. The model parameters are estimated using the algorithm described in
Algorithm 12. The trade-off curves for FERMI are generated by sweeping across different values for
λ ∈ [0, 10000]. The learning rates ηθ, ηw is constant during the optimization process and is chosen from
the interval [0.0005, 0.01] for all datasets. Moreover, the number of iterations T for experiments in Fig. 6.1
is fixed to 2000. Since the training and test data for the Adult dataset are separated and fixed, we do not
consider confidence intervals for the test accuracy. We generate ten distinct train/test sets for each one of
the German and COMPAS datasets by randomly sampling 80% of data points as the training data and the
rest 20% as the test data. For a given method in Fig. 6.1, the corresponding curve is generated by taking
the average test accuracy on 10 training/test datasets. Furthermore, the confidence intervals are estimated
based on the test accuracy’s standard deviation on these 10 datasets.
To perform the experiments in Sec. 6.3.3 we use a a linear model with softmax activation. The model
parameters are estimated using the algorithm described in Sec. 6.3. The data set is cleaned and processed
as described in [170]. The trade-off curves for FERMI are generated by sweeping across different values
260
for λ in [0, 100] interval, learning rate η in [0.0005, 0.01], and number of iterations T in [50, 200]. The data
set is cleaned and processed as described in [170].
For the experiments in Sec. 6.3.4, we create the synthetic color MNIST as described by [197]. We set
the value σ = 0. In Fig. 6.6, we compare the performance of stochastic solver (Algorithm 12) against the
baselines. We use a mini-batch of size 512 when using the stochastic solver. The color MNIST data has
60000 training samples, so using the stochastic solver gives a speedup of around 100x for each iteration,
and an overall speedup of around 40x. We present our results on two neural network architectures; namely,
LeNet-5 [185] and a Multi-layer perceptron (MLP). We set the MLP with two hidden layers (with 300 and
100 nodes) and an output layer with ten nodes. A ReLU activation follows each hidden layer, and a softmax
activation follows the output layer.
Some general advice for tuning λ: Larger value for λ generally translates to better fairness, but one
must be careful to not use a very large value for λ as it could lead to poor generalization performance of
the model. The optimal values for λ, η, and T largely depend on the data and intended application. We
recommend starting with λ ≈ 10. In Appendix 10.4.6.4, we can observe the effect of changing λ on the
model accuracy and fairness for the COMPAS dataset.
10.4.6.2 More comparison to [223]
The algorithm proposed by [223] backpropagates the batch estimate of ERMI, which is biased especially
for small minibatches. Our work uses a correct and unbiased implementation of a stochastic ERMI estimator. Furthermore, [223] does not establish any convergence guarantees, and in fact their algorithm does
not converge. See Fig. 10.9 for the evolution of training loss (i.e. value of the objective function) and test
accuracy. For this experiment, we follow the same setup used in [223, Table 1]; the minibatch size for this
experiment is 128.
261
Figure 10.9: [223] fails to converge to a stationary point whereas our stochastic algorithm easily converges.
10.4.6.3 Performance in the presence of outliers & class-imbalance
We also performed an additional experiment on Adult (setup of Fig 6.1) with a random 10% of sensitive
attributes in training forced to 0. FERMI offers the most favorable tradeoffs on clean test data, however, all
methods reach a higher plateau (see Fig 10.10). The interplay between fairness, robustness, and generalization is an important future direction. With respect to imbalanced sensitive groups, the experiments in
Fig 6.5 are on a naturally imbalanced dataset, where maxs∈S p(s)/ mins∈S p(s) > 100 for 3-18 sensitive
attrib, and FERMI offers the favorable tradeoffs.
Figure 10.10: Comparing FERMI with other methods in the presence of outliers (random 10% of sensitive attributes
in training forced to 0. FERMI still achieves a better trade-off compared to all other baselines.
262
10.4.6.4 Effect of hyperparameter λ on the accuracy-fairness tradeoffs
We run ERMI algorithm for the binary case to COMPAS dataset to investigate the effect of hyper-parameter
tuning on the accuracy-fairness trade-off of the algorithm. As it can be observed in Fig. 10.11, by increasing
λ from 0 to 1000, test error (left axis, red curves) is slightly increased. On the other hand, the fairness
violation (right axis, green curves) is decreased as we increase λ to 1000. Moreover, for both notions of
fairness (demographic parity with the solid curves and equality of opportunity with the dashed curves)
the trade-off between test error and fairness follows the similar pattern. To measure the fairness violation,
we use demographic parity violation and equality of opportunity violation defined in Section (6.3) for the
solid and dashed curves respectively.
Figure 10.11: Tradeoff of fairness violation vs test error for FERMI algorithm on COMPAS and Adult datasets. The
solid and dashed curves correspond to FERMI algorithm under the demographic parity and equality of opportunity
notions accordingly. The left axis demonstrates the effect of changing λ on the test error (red curves), while the right
axis shows how the fairness of the model (measured by equality of opportunity or demographic parity violations)
depends on changing λ.
10.4.6.5 Complete version of Figure 6.1
In Figure 6.1 we compared FERMI with several state-of-the-art in-processing approaches. In the next three
following figures we compare the in-processing approaches depicted in Figure 6.1 with pre-processing and
post-processing methods including [141, 167, 115].
263
Figure 10.12: Tradeoff of demographic parity violation vs test error for FERMI algorithm on COMPAS, German,
and Adult datasets.
10.4.6.6 Description of datasets
All of the following datasets are publicly available at UCI repository.
German Credit Dataset. German Credit dataset consists of 20 features (13 categorical and 7 numerical) regarding to social, and economic status of 1000 customers. The assigned task is to classify customers
as good or bad credit risks. Without imposing fairness, the DP violation of the trained model is larger than
20%. We choose 80% of customers as the train data and the remaining 20% customers as the test data.
The sensitive attributes are gender, and marital-status.
264
Figure 10.13: Tradeoff of equalized odds violation vs test error for FERMI algorithm on COMPAS, German, and
Adult datasets.
Adult Dataset. Adult dataset contains the census information of individuals including education,
gender, and capital gain. The assigned classification task is to predict whether a person earns over 50k
annually. The train and test sets are two separated files consisting of 32, 000 and 16, 000 samples respectively. We consider gender and race as the sensitive attributes (For the experiments involving one sensitive
attribute, we have chosen gender). Learning a logistic regression model on the training dataset (without
imposing fairness) shows that only 3 features out of 14 have larger weights than the gender attribute. Note
that removing the sensitive attribute (gender), and retraining the model does not eliminate the bias of the
classifier. the optimal logistic regression classifier in this case is still highly biased. For the clustering
265
Figure 10.14: Tradeoff of equality of opportunity violation vs test error for FERMI algorithm on COMPAS, German,
and Adult datasets.
task, we have chosen 5 continuous features (Capital-gain, age, fnlwgt, capital-loss, hours-per-week), and
10, 000 samples to cluster. The sensitive attribute of each individual is gender.
Communities and Crime Dataset. The dataset is cleaned and processed as described in [170].
Briefly, each record in this dataset summarizes aggregate socioeconomic information about both the citizens and police force in a particular U.S. community, and the problem is to predict whether the community
has a high rate of violent crime.
COMPAS Dataset. Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)
is a famous algorithm which is widely used by judges for the estimation of likelihood of reoffending crimes.
It is observed that the algorithm is highly biased against the black defendants. The dataset contains features
https://www.kaggle.com/danofer/compass
266
used by COMPAS algorithm alongside with the assigned score by the algorithm within two years of the
decision.
Colored MNIST Dataset. We use the code by [197] to create a Colored MNIST dataset with σ = 0.
We use the provided LeNet-5 model trained on the colored dataset for all baseline models of [23, 223, 74]
and FERMI, where we further apply the corresponding regularizer in the training process.
10.5 Chapter 7 Appendix
10.5.1 Proof of Theorem 23
Let Qθ = UΣV and Qtr
θ = U0Σ0V0 be the singular value decompositions of matrices Qθ and Qtr
θ
respectively. We aim to solve the following optimization problem:
max
Qθ
Tr(Q
TQ) s.t. ∥Σ − Σ0∥p ≤ ϵ. (10.93)
Since U and V are unitary matrices (U
TU = V
T V = I), we have Tr(QTQ) = Tr(ΣT Σ). Therefore,
Problem (10.93) is equivalent to:
max
Q
Tr(ΣT Σ) s.t. ∥Σ − Σθ∥p ≤ ϵ. (10.94)
a) If p = 1, we aim to solve the following maximization problem:
max
Σ
Tr(ΣT Σ) s.t. ∥Σ − Σ0∥1 ≤ ϵ. (10.95)
https://github.com/JerryYLi/Dataset-REPAIR/
267
Let a and b be the vectorized forms of Σ and Σ0 respectively. Therefore, the above problem is equivalent
to the following:
max
a
Xa
2
i
s.t. ∥a − b∥1 ≤ ϵ (10.96)
Without loss of generality, assume that |b1| ≥ · · · ≥ |bm|. We claim the solution to this problem is
obtained by putting all budgets to the largest entry of a (a
∗
1 = b1 + ϵ, and a
∗
i = bi ∀i > 1). To arrive
at a contradiction, assume that there exists an optimal solution such that a1 = b1 + ϵ1 and a2 = b2 + ϵ2
such that ϵ2 > 0. We claim a solution with a
′
1 = b1 + ϵ1 + ϵ2 and a
′
2 = b2 has a higher objective value
function:
a
′2
1 + a
′2
2 = (b1 + ϵ1 + ϵ2)
2 + b
2
2
= b
2
1 + b
2
2 + ϵ
2
1 + ϵ
2
2 + 2ϵ1ϵ2 + 2ϵ1b1 + 2ϵ2b1
> b
2
1 + ϵ
2
1 + 2ϵ1b1 + b
2
2 + ϵ
2
2 + 2ϵ2b1
≥ b
2
1 + ϵ
2
1 + 2ϵ1b1 + b
2
2 + ϵ
2 + 2ϵ2b2
= (b1 + ϵ1)
2 + (b2 + ϵ2)
2 = a
2
1 + a
2
2
Thus, with proof by contradiction, we have ϵ2 = 0. Similarly for other ai
’s such that i > 2, if the optimal
ai = bi + ϵi
, then ϵi = 0. Hence, the claim that the optimal solution to Problem (10.96) is obtained by
putting ϵ to the largest entry is true. Note that the largest singular value of Qθ is always 1. We also consider
the same constraint on matrix Q (this is a practical assumption since the actual matrix Q generated by P
∗
,
the ground-truth joint distribution of data in the test domain, has the largest singular value equal 1). Thus,
to maximize the summation of singular values, we put all available budgets to the second-largest singular
value. Therefore, the optimal solution to Problem (10.95) is Tr(QT
θ Qθ) + 2ϵσ2(Qθ) + ϵ
2
, which is the
desired result in Theorem 23 part a.
268
b) If p = 2, we require to solve the following maximization problem:
max
Σ
Tr(ΣT Σ) s.t. ∥Σ − Σ0∥2 ≤ ϵ, (10.97)
Assume that a = diag(Σ) and b = diag(Σ0) Thus, Problem (10.97) is equivalent to:
max
a
Xa
2
i
s.t. ∥a − b∥2 ≤ ϵ (10.98)
Note that:
∥a∥
2 = ∥a − b + b∥
2 = ∥a − b∥
2
2 + ∥b∥
2 + 2b
T
(a − b)
≤ ϵ
2 + ∥b∥
2
2 + 2∥b∥2∥a − b∥ ≤ ϵ
2 + ∥b∥
2
2 + 2ϵ∥b∥2 = (∥b∥2 + ϵ)
2
Setting a = b +
ϵb
∥b∥2
, we can achieve to the maximum value ∥a∥
2
2 = (∥b∥2 + ϵ)
2 without violating
the constraints in (10.98). Therefore, the optimal solution to Problem (10.97) is Σ = (∥Σ0∥2 + ϵ)
2 which is
Tr(ΣT
0 Σ
T
0
) + 2ϵ
q
Tr(ΣT
0 ΣT
0
) + ϵ
2
. Therefore, Problem (10.97) is equivalent to:
Tr(Q
T
θ Qθ) + 2ϵ
q
Tr(QT
θ Qθ) + ϵ
2
(10.99)
which gives us the desired result in Theorem 23 part b.
c) In the case of p = ∞, we seek to solve:
max
Q
Tr(Q
TQ) s.t. ∥σ(Q) − σ(Qθ)∥∞ ≤ ϵ, (10.100)
269
where the uncertainty set is defined on the singular values of the matrix Qθ. We can add ϵ to each singular
value of the matrix Qθ independently to maximize the above problem. In that case, the optimal solution
is Tr(QT
θ Qθ) + ϵ
2 + 2ϵ Tr(|Σθ|) , which is the desired result presented in Theorem 23.
10.5.2 Proof of Theorem 25
First, note that there exists θ˜ such that Tr(QT
θ˜ Qθ˜) = 0. To see why, we choose θ˜ such that yˆθ˜(x) outputs
uniformly at random between zero and one independent of x. Thus, yˆθ˜(x) is independent of the sensitive
attribute s. By the definition of the Exponential Rényi Mutual Information (ERMI), Tr(QT
θ˜ Qθ˜) = 0, since
yˆθ˜(x) and s are independent. Assume that θ
∗
λ,ϵ is the optimal solution of Problem Dr. FERMI for the given
λ and ϵ. Since, the θ
∗
λ,ϵ is the optimizer of (Dr. FERMI), we have:
EPtr[ℓ(ˆyθ
∗
λ,ϵ
(x), y)] + λTr(Q
T
θ
∗
λ,ϵ
Qθ
∗
λ,ϵ
) ≤ EPtr[ℓ(ˆyθ˜(x), y)] + λTr(Q
T
θ˜ Qθ˜) = EPtr[ℓ(ˆyθ˜(x), y)] (10.101)
The equality holds, since the Tr(QT
θ˜ Qθ˜) is zero. Therefore:
Tr(Q
T
θ
∗
λ,ϵ
Qθ
∗
λ,ϵ
) ≤
EPtr[ℓ(ˆyθ˜(x), y)] − EPtr [ℓ(ˆyθ
∗ (x), y)]
/λ (10.102)
Since the cross-entropy loss is always greater than or equal to zero, we have:
EPtr[ℓ(ˆyθ
∗ (x), y)] ≥ 0
Further, assume that the loss value at point θ˜ = B. Thus:
Tr(Q
T
θ
∗
λ,ϵ
Qθ
∗
λ,ϵ
) ≤
EPtr[ℓ(ˆyθ˜(x), y)]
λ
≤
B
λ
(10.103)
270
Therefore, if we choose λ ≥
B
γ
, we have
Tr(Q
T
θ
∗
λ,ϵ
Qθ
∗
λ,ϵ
) ≤ γ (10.104)
We showed the above inequality for arbitrary choice of ϵ. Next, we choose ϵ such that Qtest
θ
∗
ϵ,λ
∈ B(Qtrain
θ
∗
ϵ,λ
, ϵ).
Note that according to [347], the largest singular value of the matrix Q is one. As a result, the size of all
eigenvalues of matrices Qtest
θ
∗
ϵ,λ
and Qtrain
θ
∗
ϵ,λ
are less than or equal to 1. Therefore, by choice of large enough
ϵ, the Lp distance of the vectors of eigenvalues for Qtest
θ
∗
ϵ,λ
and Qtrain
θ
∗
ϵ,λ
is within ϵ. Hence:
Tr(Q
(test)T
θ Q
test
θ
) ≤ max
Q∈B(Qtrain
θ
,ϵ)
Tr(Q
TQ) < γ.
According to [211, Lemma 3], the exponential Rényi Mutual Information is an upper bound for L∞ fairness
violation (demographic parity violation). Thus:
DPV = |P(ˆy = 1|s = 1) − P(ˆy = 1|s = 0)| < γ,
which completes the proof.
10.5.3 Proof of Lemma 26
First, we show that for any z > 0:
√
z = min
α>0
1
2
(zα +
1
α
) (10.105)
271
Let gz(α) = 1
2
(zα +
1
α
). Since ∂
2gz(α)
∂α2 =
1
α3 > 0, the function gz is strictly convex with respect to α.
Thus, the minimizer is unique. The minimizer of gz(α) can be obtained by setting the derivative to zero
with respect to α:
∂gz(α)
∂α = z +
1
α2
= 0 → α
∗ =
1
√
z
Therefore, the optimal value can be obtained by plugging in α
∗
into gz:
gz(α
∗
) = 1
2
(
z
√
z
+
1
√
1
z
) = 1
2
(
√
z +
√
z) = √
z,
that proves (10.105). Based on [211, Lemma 1],
Tr(Q
TQ) = Ψ(θ, W) = 1
N
X
N
i=1
ψ(zi
; θ, W) (10.106)
where ψ is given by Equation (7.15). Thus Problem (7.10) can be written as:
1
N
X
N
i=1
ℓ(ˆyθ(xi), yi) + λΨ(θ, W) + 2λϵp
Ψ(θ, W) (10.107)
Applying Equation (10.105) to p
Ψ(θ, W), we have:
p
Ψ(θ, W) = 1
2
min
α>0
αΨ(θ, W) + 1
α
(10.108)
272
Combining (10.107) and (10.108), Problem (7.10) can be reformulated as:
min
θ
max
W∈W
1
N
X
N
i=1
ℓ(ˆyθ(xi), yi) + λΨ(θ, W)
+ min
α>0
λϵαΨ(θ, W) + λϵ
α
(10.109)
Note that the maximization problem with respect to W is concave due to the concavity of Ψ, and the
minimization problem with respect to α is convex on α > 0. To switch the minimum and maximum
terms, we need to have boundedness on the parameters of the minimization problem. Obviously, α is
bounded below by 0. Further, note that the optimal α satisfies the following condition due to the firstorder stationarity condition:
Ψ(θ, W) −
1
α∗2
= 0 → α
∗ =
1
p
Ψ(θ, W)
Since Ψ(θ, W) is the reformulation of Tr(QT
θ Qθ) and the latter term is the summation of all singular
values, it is always greater than or equal to 1 (since the largest singular value equals 1). As a result α
∗ ≤ 1.
Therefore, it is bounded from above as well. Thus, we can switch the maximization and minimization
problems in (10.109) due to the minimax theorem [312]. Hence, Problem (7.10) is equivalent to
min
θ,α>0
max
W
1
N
X
N
i=1
ℓ(ˆyθ(xi), yi) + λΨ(θ, W) + λϵαΨ(θ, W) + λϵ
α
,
which is precisely Problem (7.14).
10.5.4 Derivation of Algorithm 13 and Details on Computing Gradients
Algorithm 13 requires to update the probability matrices Pˆ
yˆ and Pˆ
y,s ˆ after updating θ. Assume that the output of the classification model is a probability vector (logits in neural networks or logistic regression) where
273
the probability of assigning label j to the data point x is Fj (x; θ). Note that yˆθ(x) = argmaxj Fj (x; θ).
One can compute the elements of probability matrices as follows:
Pˆ
yˆ[j][j] = P(ˆyθ = j) = 1
n
Xn
i=1
Fj (x; θ) (10.110)
Pˆ
y,s ˆ [j][k] = P(ˆyθ = j, s = k) = P(s = k)P(ˆyθ = j|s = k) = πk
n
Xn
i=1
Fj (xi
; θ)1(si = k) (10.111)
As a note, P(s = k) = πk =
1
n
Pn
i=1 1(si = k) and it is constant through the algorithm.
Next, we show how to compute the gradient of Tr(QT
θ Qθ) with respect to θ. First, it is evident that:
Tr(Q
T
θ Qθ) = X
j
X
k
P
2
(ˆyθ = j, s = k)
P(ˆyθ = j)P(s = k)
Thus, its gradient with respect to θ using the chain rule equals:
X
j,k
2P(ˆyθ = j, s = k)P(ˆyθ = j)∇θP(ˆyθ = j, s = k) − P
2
(ˆyθ = j, s = k)∇θP(ˆyθ = j)
P2(ˆyθ = j)P(s = k)
where the gradient of P(ˆyθ = j, s = k) can be computed using Equations (10.110) and (10.111):
∇θP(ˆyθ = j) = 1
n
Xn
i=1
∇θFj (x; θ) (10.112)
∇θP(ˆyθ = j, s = k) = πk
n
Xn
i=1
∇θFj (xi
; θ)1(si = k) (10.113)
10.5.5 Convergence of Algorithm 13
Since we solve the maximization problem in (7.9) in closed form (see lines 5 and 6 in Algorithm 13), as an
immediate result of Theorem 27 in [158], Algorithm 13 converges to a stationary solution of Problem (7.9)
in O(
1
ϵ
4 ). Note that, although the problem is non-convex non-concave, we have a max oracle (we can
exactly compute the solution to the maximum problem in closed form).
274
Theorem 70 Suppose that the objective function defined in (Dr. FERMI) is L0-Lipschitz and L1-gradient
Lipschitz for the choice of p = 1 in Theorem 23. Then, Algorithm 13 computes an ϵ-stationary solution of the
objective function in (7.9) in O(ϵ
−4
) iterations.
10.5.6 Convergence of Algorithm 14 (Stochastic version of Dr. FERMI)
In this section, we show the convergence rate of Algorithm 14 for solving (Dr. FERMI) under the L2 norm
ball as the uncertainty set. As mentioned in Section 7.4.2, (Dr. FERMI) can be rewritten as:
min
α>0,θ
max
W∈W
1
n
Xn
i=1
ℓ(zi
; θ) + λ(1 + ϵα)ψ(zi
; θ, W)
+
λϵ
α
Since the objective function is represented as an average/expectation over n data points, the gradient vector with respect to θ evaluated for a given batch of data is an unbiased estimator of the gradient with
respect to all data points. Further, the problem is strongly concave with respect to the matrix W. Therefore, a standard stochastic gradient descent ascent (SGDA) algorithm will find a stationary solution to the
problem.
Proposition 71 Assume that ℓ(·, ·) and F(x, θ), the probabilistic output of the model (logits) are Lipschitz
continuous and differentiable with the Lipschitz gradient. Further, assume that Ps(s = i) > 0 ∀i ∈ S, and
Pyˆ(ˆyθ = j) > 0 ∀θ and ∀j ∈ Y. Then, for any batch size of 1 ≤ b ≤ n, Algorithm 14 finds an ϵ-first order
stationary point of Problem (7.10) in T = O(
1
ϵ
5 ) iterations.
The proof is based on the following assumptions:
• Ps(s = i) > 0 ∀i ∈ S
• Pyˆ(ˆy = j) > 0 ∀j ∈ Y for every iteration of the algorithm.
• The loss function ℓ(x, θ) and the probabilistic output F(x, θ) are Lipschitz continuous with the
Lipschitz constant L.
275
• The loss function ℓ(x, θ) and the probabilistic output F(x, θ) are β-smooth, meaning that their
gradients are Lipschitz continuous with the Lipschitz constant β.
Remark 72 Note that the first two assumptions are true in practice. Because we assume at least one sample
from each sensitive group should be available in the training data. Further, the probability of predicting any
label in every iteration should not be zero. In our experiments, such an assumption always holds. If, for some
extreme cases, one probability goes to zero, one can add a small perturbation to θ to avoid the exact zero
probability. The third and fourth assumptions are standard in the convergence of iterative method proofs, and
they hold for loss functions such as cross-entropy and mean-squared loss over bounded inputs.
Proof: Define θ
′
= [θ, α], which means θ
′
is obtained by adding the parameter α to vector θ. In that
case, Algorithm 14 can be seen as one step of gradient descent to θ
′
and one step of projected gradient
ascent to W. Note that, although the maximization problem is originally unconstrained, we are required
to apply the projection to the set W = {W ∈ R
m×k
: ∥W∥F ≤
2
Pˆmin
yˆ
√
Pˆmin
s
} to ensure the Lipschitz continuity and the boundedness of the variance of the gradient estimator. The convergence of Algorithm 14
to an ϵ-stationary solution of Problem 7.10 in O(
1
ϵ
5 ) is the direct result of Theorem 3 in [211]. □
10.5.7 Full-Batch Algorithms for Dr. FERMI under L2 and L∞ Norms
To solve Problem 7.10, we use the following gradient descent-ascent (GDA) approach:
Remark 73 The convergence of Stochastic Algorithm 22 is obtained by setting the batch size in Theorem 71
to |B| = N.
To solve Problem (7.11), we use the following algorithm:
276
Algorithm 22 Distributionally Robust FERMI under L2 Ball Uncertainty
1: Input: θ
0 ∈ R
dθ , W0 = 0, step-sizes η, fairness parameter λ ≥ 0, iteration number T.
2: for t = 1, . . . , T do
3: θ
t = θ
t−1 − η∇θf2(θ
t−1
)
4: ρW = − Tr
Wt−1PbyˆWt−1,T + 2Wt−1Pby,s ˆ Pb−1/2
s
5: Set RW = λρW + 2λϵ√ρW
6: Set Wt = Wt−1 + η∇W RW
7: Update Pbyˆ and Pby,s ˆ
8: Return: θ
T
, WT
.
Algorithm 23 Distributionally Robust FERMI under L∞ Ball Uncertainty
1: Input: θ
0 ∈ R
dθ , W0 = 0, step-sizes η, fairness parameter λ ≥ 0, iteration number T.
2: for t = 1, . . . , T do
3: θ
t = θ
t−1 − η
1
n
Pn
i=1 ∇θℓ(ˆyθ
t−1 (xi), y) + 2λϵ Tr(|Σθ
t−1 |) + λ Tr
∇θ(QT
θ
t−1Qθ
t−1 )
4: Return: θ
T
10.5.8 Algorithms for CVaR-Dr. FERMI and Group-Dr. FERMI
By a simple modification of Algorithm 13, one can derive convergent algorithms to the stationary points
of (CVaR-Dr. FERMI) and (Group-Dr. FERMI). Note that the maximization step at each iteration remains
the same. The only change is in the minimization part. Instead of minimizing with respect to θ, we
need to update the extra parameter η as an optimization problem using (sub-)gradient descent. Algorithm 24 describes the procedure of optimizing (CVaR-Dr. FERMI). At each iteration, we apply one step
of (sub-)gradient descent to θ and η and then update v and W in closed form. Similarly, one can optiAlgorithm 24 CVaR-Dr. FERMI Algorithm
1: Input: θ
0 ∈ R
dθ , W0 = 0, step-sizes α, fairness parameter λ ≥ 0, iteration number T.
2: for t = 1, . . . , T do
3: θ
t = θ
t−1 −
α
n
Pn
i=1 ∂θ[ℓ(ˆyθ
t−1 (xi), y) − ηt]+ +
λϵ
α
∂θ
q
vT
QT
θ
t−1Qθ
t−1
v +
λ
α Tr
∂θ(QT
θ
t−1Qθ
t−1
4: η
t = η
t−1 −
α
n
Pn
i=1
∂η[ℓ(ˆyθ
t−1 (xi), y) − ηt]+
− α
5: Set Wt = Pb−1/2
s PbT
y,s ˆ Pb−1
yˆ
6: Set v to the second largest singular vector of Qθ
t by performing SVD on Qθ
t .
7: Update Pbyˆ and Pby,s ˆ as a function of θ
t−1
.
8: Return: θ.
mize (Group-Dr. FERMI) by applying gradient descent to the θ as in Algorithm 13 and the group DRO
parameters as in Algorithm 1 in [289] and then applying v and W in the closed form at each iteration.
10.5.9 Hyper-Parameter Tuning
DP-FERMI has two hyper-parameters ϵ and λ. Further, the presented algorithms in the chapter have a stepsize (learning rate) η, and the number of iterations T. We use η = 10−5
and T = 3000 in the code. These
values are obtained by changing η and T over 10 different values in different runs on ACS PUMS [99] data.
We consider two scenarios for tuning λ and ϵ. If a sample of data from the target domain is available for the
validation, we reserve that data as the validation set to choose the optimal λ ∈ {0.1, 0.5, 1, 2, 5, 10, 20, 50}
and ϵ ∈ {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10}. In the second scenario, when no data from the target
domain is provided, one can rely on the k-fold cross-validation on the source data. A more resilient approach is to create the validation dataset by oversampling the minority groups. Based on this idea, we do
weighted sampling based on the population of sensitive groups (oversampling from minorities), and then,
we choose the optimal λ and ϵ as in scenario 1.
10.6 Chapter 8 Appendix
10.6.1 f-FERM for other notions of group fairness
This section shows how we can use alternative notions of fairness, such as equality of opportunity of
equalized odds [141] instead of demographic parity violation in f-FERM.
Note that a trained model satisfies the equality of opportunity notion for a given binary classifier with
a binary sensitive attribute if and only if:
P(ˆyθ(x) = 1, s = i| y = 1) = P(ˆyθ(x) = 1, s = j| y = 1) ∀ i, j ∈ S (10.114)
278
Therefore, to have a framework for fair inference via f-divergences under the equality of opportunity
notion, we optimize:
min
θ
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λDf
P(ˆyθ(x), s|y = 1)||P(ˆyθ(x)|y = 1) ⊗ P(s|y = 1)
. (10.115)
Practically, it means that for evaluating the probability measures in the regularization term, we need to
focus on the data points whose target labels are 1.
Further, one can similarly adopt equalized odds as the measure of fairness. Equalized odds as the
measure of fairness is defined as:
P(ˆyθ(x) = 1, s = i| y = k) = P(ˆyθ(x) = 1, s = j| y = k) ∀ i, j ∈ S, k ∈ Y (10.116)
Therefore, we must add a regularizer per each class label to satisfy the equalized odds notion. Other
notions of fairness can be used in this framework as long as they can be represented as the conditional
independence between sensitive attributes, predictions, and labels [66].
10.6.2 f-divergence for continuous sensitive attributes and target variables
In Section 8.2, we developed a framework for promoting fairness for classification problems where target
labels and sensitive attributes are discrete variables. Hence, we could efficiently solve the variational formulation that arose through the designing of unbiased estimators. However, it is not uncommon to find
applications of f-divergence regularizers in practice that require either the sensitive features or the output
variable to be continuous; or both to be continuous parameters. In such cases, the summation over the
respective variable is replaced by an integral over the probability distribution. The challenging aspect is
calculating the variational form’s integral and trailing supremum in the continuous domain.
279
Let P and Q be two continuous distributions over the space Ω such that P is absolutely continuous
with respect to Q (P ≪ Q). Then, the f-divergence between these two distributions for a given convex
function f is defined as:
Df (P, Q) = Z
Ω
f
dP
dQ
dQ (10.117)
When the target variable is continuous (regression problems), but the sensitive attribute is discrete,
(f-FERM) can be written as:
min
θ
max
A∈R∞
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
k
Z
x
h
Ak(x)P(x) − f
∗
(Ak(x))Qk
i
dx
With slight changes, the above problem can be reformulated as follows:
min
θ
max
A∈Rjk
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ max
A1,...,Am
X
k
E
h
Ak(s)Pj (s) − f
∗
(Ak(s))Qk
i
When both sensitive features and target variables are continuous, the objective function becomes:
min
θ
max
A∈R∞×∞
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
Z
x
Z
y
h
A(x, y)P(x) − f
∗
(A(x, y))Q(y)
i
dx dy
Such a formulation is clearly intractable for solving Ak(x) or A(x, y) in the continuous domain. We
need to approximate the above integrals in discretized/quantized regions or find another variational representation for designing unbiased estimators of continuous domain f-divergences. We leave developing
algorithms for the continuous target variables and sensitive attributes as a future direction.
280
10.6.3 f-divergences cover well-known notions of fairness violation
In this section, first, we show that optimizing f-divergences to 0 guarantees the independence of the
sensitive attributes and predictions. In other words, optimizing f-divergences leads to a fair model under
the demographic parity notion (or other group fairness notions discussed in Appendix 10.6.1).
Proposition 74 [258, Theorem 2.3] Let f be a convex function from R
+ to R, such that f is convex, f(1) = 0,
f is strictly convex in a neighborhood of 1. Then Df (P||Q) = 0, if and only if P = Q.
As an immediate result, a trained model in (f-FERM) is fair under the demographic notion if and only if
P(ˆyθ(x), s) = P(ˆyθ(x)) ⊗ P(s), (10.118)
which means the independence of s and yˆθ(x).
Next, we show f-divergences either include or provide an upper bound for well-known notions of
fairness violation in the literature.
Proposition 75 Exponential Rényi Mutual Information (ERMI) [222, 210] is an f-divergence with f(t) =
(t − 1)2
Proof: Exponential Rényi Mutual Information is defined as [210]:
ERMI(ˆy, s) = X
j∈Y,k∈S
Pˆ
y,s ˆ (j, k)
2
Pˆ
yˆ(j)Pˆ
s(k)
− 1 (10.119)
281
For the case of f(t) = (t − 1)2
, we have:
Df
Pˆ
yˆ ⊗ Pˆ
s||Pˆ
y,s ˆ
=
X
j∈Y
X
k∈S
Pˆ
yˆ(j)Pˆ
s(k)f
Pˆ
y,s ˆ (j, k)
Pˆ
yˆ(j)Pˆ
s(k)
=
X
j∈Y
X
k∈S
Pˆ
yˆ(j)Pˆ
s(k)
Pˆ
y,s ˆ (j, k)
Pˆ
yˆ(j)Pˆ
s(k)
− 1
2
=
X
j∈Y
X
k∈S
Pˆ
yˆ(j)Pˆ
s(k)
Pˆ
y,s ˆ (j, k)
2
Pˆ
yˆ(j)
2Pˆ
s(k)
2
− 2
Pˆ
y,s ˆ (j, k)
Pˆ
yˆ(j)Pˆ
s(k)
+ 1
=
X
j∈Y
X
k∈S
Pˆ
y,s ˆ (j, k)
2
Pˆ
yˆ(j)Pˆ
s(k)
− 2Pˆ
y,s ˆ (j, k) + Pˆ
yˆ(j)Pˆ
s(k)
=
X
j∈Y
X
k∈S
Pˆ
y,s ˆ (j, k)
2
Pˆ
yˆ(j)Pˆ
s(k)
− 2 + 1 = ERMI(ˆy, s)
Note that, in the last equality, we use:
X
j∈Y
X
k∈S
Pˆ
y,s ˆ (j, k) = X
j∈Y
Pˆ
yˆ(j) = 1,
and
X
j∈Y
X
k∈S
Pˆ
yˆ(j)Pˆ
s(k) = X
j∈Y
Pˆ
yˆ(j)
X
k∈S
Pˆ
s(k)
=
X
j∈Y
Pˆ
yˆ(j) = 1,
□
Proposition 76 Demographic parity violation is upper-bounded by the f-divergence for f(t) = (t − 1)2
Proof: Based on Propositions 75, ERMI is an f-divergence with f(t) = (t − 1)2
. Therefore, the proposition is an immediate result of Lemma 3 in [210]. □
Proposition 77 Rényi correlation [22] can be upper bounded by the f-divergences for the choice of f(t) =
(t − 1)2
.
28
Proof: Based on Propositions 75, ERMI is an f-divergence with f(t) = (t − 1)2
. Therefore, the proposition is an immediate result of Lemma 2 in [210]. □
Remark 78 Mutual Information as the measure of fairness violation [75] is a special case of f-divergences
for the choice of KL-divergence f(t) = tlog(t) in (f-FERM).
10.6.4 Proof of Proposition 27
Lemma 79 Assume that f(z) is a semi-continuous convex function. Therefore, f can be written as the following maximization problem:
f(z) = max
α
z
Tα − g(α)
where g is the convex conjugate of f.
Proof: Let g be the convex conjugate of the function f defined as:
g(α) = sup
z
α
T
z − f(z)
Since f is a lower semi-continuous convex function, by Fenchel-Moreau theorem [154], it is biconjugate,
which means the taking conjugate of g transforms it back to f. Therefore,
f(z) = sup
α
α
T
z − g(α)
where g is the convex conjugate of f. □
283
Based on the above lemma, we have:
Df (P, Q) = Xm
i=1
Qif
Pi
Qi
= Df (P, Q) = Xm
i=1
Qi sup
αi∈domf
αi
Pi
Qi
− f
∗
(αi)
= sup
α1,...,αm∈domf
Xm
i=1
αiPi − f
∗
(αi)Qi
Set P = P(ˆyθ(x), s), Q = P(ˆyθ(x)) ⊗ P(s), and αi = Ajk. Therefore, we obtain the formulation in (8.3).
10.6.5 Derivation of Closed-Form Expressions for Unbiased Gradient Estimators of fDivergences
Proposition 80 For two functions f(t), g(t) such that g(t) = f(t) + c(t − 1), then Df (·|·) ≡ Dg(·|·).
Proof: Proof follows naturally from [258, Proposition 7.2] □
Theorem 81 Let f(t) = (t − 1)2 and P(s = k) = πk (χ
2 Divergence). Then, Equation (8.1) can be written
as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k) − (Ajk +
A2
jk
4
)P(ˆyθ = j)
i
(10.120)
Variational Representation of f(x) = (x − 1)2
is given by
f(x) = sup
α
(αx − f
∗
(α))
Where f
∗
(α) is the convex conjugate
f
∗
(α) = sup
x
(xα − f(x))
284
Taking derivative of f
∗
(α) w.r.t x gives x
∗ = α/2 + 1. This results in f
∗
(α) = α + α
2/4
Theorem 82 Let f(t) = − ln(t) and P(s = k) = πk (Reverse KL). Then, Equation (8.1) can be written as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k) + (1 + ln(−Ajk))P(ˆyθ = j)
i
(10.121)
Proceeding as above, optimal x
∗
for the supremum of f
∗
(α) is x
∗ = −1/α,
resulting in f
∗
(α) = −1 − ln(−α).
Theorem 83 Let f(t) = 1
2
|t − 1| and P(s = k) = πk (Total Variational Distance). Then, Equation (8.1) can
be written as (where |Ajk| ≤ 1
2
):
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πkAjkh
P(ˆyθ = j|s = k) − P(ˆyθ = j)
i
(10.122)
For f =
1
2
|t − 1|, the variational representation is f(x) = supα
αx − f
∗
(α)
Through the convex conjugate f
∗
(α), we have that
f
∗
(α) = sup
x
xα − f(α)
= sup
x
xα −
1
2
|x − 1|
=
∞ for |α| >
1
2
α for |α| ≤ 1
2
So |α| ≤ 1
2
is constrained for the supremum/maximum to exist (otherwise tends to ∞).
Theorem 84 Let f(t) = tln(t) and P(s = k) = πk (KL Divergence). Then, Equation (8.1) can be written
as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k) − e
Ajk−1P(ˆyθ = j)
i
(10.123)
285
For f(t) = tln(t) in f-divergence, the convex conjugate can be represented by:
f
∗
(α) = sup
x
(xα − x ln(x))
On differenting w.r.t x for attaining supremum, we get x = e
α−1
. Hence, the variational representation of
f(t) = tln(t) becomes:
f(x) = sup
α
xα − e
α−1
Note: We can also use the affine transformation α ← α − 1 which results in the more commonly studied
version in literature:
D(P||Q) = 1 + sup
g:X→R
EP [g(X)] − EQ[e
g(X)
]
Theorem 85 Let f(t) = −(t + 1) ln(t+1
2
) + tln(t) and P(s = k) = πk (Jensen-Shannon Divergence).
Then, Equation (8.1) can be written as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k) + ln(2 − e
Ajk )P(ˆyθ = j)
i
(10.124)
For the JS Divergence, we have f(t) = −(t + 1) ln(t+1
2
) + tln(t), whose convex conjugate can be represented as:
f
∗
(α) = sup
x
αx + (x + 1) ln x + 1
2
− x ln(x)
286
On differentiating w.r.t x to obtain the supremum, we have
2x
x + 1
= e
α =⇒ x =
e
α
2 − e
α
Substituting x in f
∗
(α),
f
∗
(α) = − ln(2 − e
α
)
Thus, in f(x) = supα
xα − f
∗
(α)
, we get the variational form as:
f(x) = sup
α
xα + ln(2 − e
α
)
Theorem 86 Let f(t) be
f(t) =
t
α−αt−(1−α)
α(α−1) if α ̸= 0, α ̸= 1
tln(t) − t + 1 if α = 1
− ln(t) + t − 1 if α = 0
and P(s = k) = πk (General α Divergence). Then, Equation (8.1) can be written as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k)
−
P(ˆyθ = j)
α
(α − 1)Ajk + 1 α
α−1 − 1
i (10.125)
287
Excluding the limiting cases where α = 1 or α = 0, we can find the convex conjugate f
∗
(y) as:
f
∗
(y) = sup
x
xy − f(x)
= sup
x
xy −
x
α − αx − (1 − α)
α(α − 1)
On differentiating w.r.t. x, we obtain (here variational parameter is y, do not confuse with the constant α)
x
∗ =
(α − 1)y + 1 1
α−1
Thus,
f
∗
(y) =
(α − 1)y + 1 α
α−1
α
−
1
α
KL Divergence and Reverse KL Divergence can be obtained by taking the limit when α tends to 1 and 0,
respectively.
Note: Standard literature on divergences often parametrize the α-divergence as
f(x) =
tln(t) if α = 1
− ln(t) if α = −1
4
1−α2
1 − t
(1+α/2)
otherwise
This is tantamount to the substitution α ← 1+α
2
in the original definition of generalized f-divergence.
288
Theorem 87 Let f(t) = (√
t−1)2
(equivalently f(t) = 2(1−
√
t)) and P(s = k) = πk (Squared Hellinger
Distance). Then, Equation (8.1) can be written as:
min
θ
max
A
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j
X
k
πk
h
AjkP(ˆyθ = j|s = k) + P(ˆyθ = j)
1
Ajk
+ 2i (10.126)
For Squared Hellinger Distance,
f
∗
(α) = sup
x
(xα − f(x))
= sup
x
(xα − 2(1 −
√
x))
On differentiating w.r.t. x, we get
α +
1
√
x
= 0 (Note α < 0) =⇒ x =
1
(α)
2
=⇒ f
∗
(α) = α
α2
− 2 +
(−2)
α
=
−1
α
− 2
Note that the first, second, and third terms are negative, negative, and positive, respectively; hence the
appropriate choice of sign(α) for functions of odd powers of α.
10.6.6 Formal Statement of Theorem 28 and Proof
Theorem 88 Formal Statement of Theorem Let (xi
, yi
, si) ∀1 ≤ i ≤ n be the collection of n data
points satisfying the following assumptions:
• ℓ(·, x, y) is G-Lipschitz, and βℓ-smooth for all xi
, yi
.
• Fj (·, θ) is L-Lipschitz and b-smooth for all θ and all label classes j.
289
• pb
min
yˆ
:= inf{θ
t
,t∈[T]} minj∈[m]
1
N
PN
i=1 yˆθ,j (xi) ≥
µ
2 > 0.
• pˆ
min
S
:= 1
N
PN
i=1 1{si=j} > 0.
choose ηθ = Θ( ϵ
4
ℓ
3L2D2 ) and ηα = Θ( ϵ
2
ℓσ2 ) and the mini-batch size of 1. Therefore, Algorithm 15 finds an
ϵ-stationary of Problem f-FERM in O(
1
ϵ
8 ).
Remark 89 The first assumption listed in the theorem statement is true for popular losses such as crossentropy loss and squared loss (assuming that the input data takes values in a bounded set, which holds for all
real-world datasets).
Remark 90 The second assumption holds for popular classifiers generating probability vectors (e.g., logits
in neural networks, logistic regression outputs). For classifiers with no probability output, one must transform
the output to a number between zero and one first.
Remark 91 The third assumption states that the probability of assigning a label to the data points must
not be zero for all data points for any label at each iteration.
Remark 92 Finally, the fourth assumption ensures each sensitive class’s probability is not zero. In other
words, there should be at least one point in the dataset with that sensitive attribute for any sensitive group. It
holds for all benchmark datasets in practice. Simply put, any protected group appearing during the test phase
must have at least one representative in the training data.
The following lemma is helpful for the proof of the theorem:
Lemma 93 Let A1, . . . , An be n variables such that ∥Ai∥2 ≤ ci
. Then, we have:
E[∥
Xn
i=1
Ai∥
2
2
] ≤ n
Xn
i=1
c
2
i
(10.127)
290
Proof:
∥
Xn
i=1
Ai∥
2
2 =
X∥Ai∥
2
2 + 2X
i̸=j
⟨Ai
, Aj ⟩ ≤ X∥Ai∥
2
2 +
X
i̸=j
∥Ai∥
2
2 + ∥Aj∥
2
2 = n
Xn
i=1
∥Ai∥
2
2
,
which is based on the fact that 2⟨Ai
, Aj ⟩ ≤ ∥Ai∥
2
2 + ∥Aj∥
2
2
. Therefore:
E[∥
Xn
i=1
Ai∥
2
2
] ≤ n
Xn
i=1
E[∥Ai∥
2
2
] ≤ n
Xn
i=1
c
2
i
□
Now, we are ready to prove Theorem 88.
Proof: The proof consists of three main steps. First, we need to show that the gradient estimator in
Algorithm 15 is unbiased. Since the samples are IID, for any function ψ(·, ·), and an IID batch of data
points B we have:
E
h
1
B
X
(x,y)∈B
∇ψ(x, y)
i
=
1
B
X
(x,y)
E[ψ(x, y)] = E(x,y)∼P(x,y,s)
[∇ψ(x, y)]
As an immediate result, if the objective function is written as the summation over n functions, the gradient
estimator over an IID batch of data will be unbiased. According to Equation (8.5), the objective function
has the desired form for:
min
θ
max
A
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
h
AjkFj (xi
; θ)1(si = k) − f
∗
(Ajk)πkFj (xi
; θ)
i
(10.128)
Next, we need to show the boundedness of the gradient estimator variance. Let
GB =
1
|B|
X
(xi,yi)∈B
∇θℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
h
Ajk∇θFj (xi
; θ)1(si = k) − f
∗
(Ajk)πk∇θFj (xi
; θ)
291
We need to show for a given data batch:
E[∥GB − Gn∥
2
2
]
where Gn is the gradient with respect to all n data points (when B = {1, . . . , n}. Note that:
∥GB − Gn∥
2
2 ≤ 2∥GB∥
2
2 + ∥Gn∥
2
2
Thus, it suffices to show that the gradient is bounded for any given B batch. Since the samples are independent of each other and identically distributed from Ptrain (IID samples), the second-order moment of
the average over |B| data points is 1/|B| times the variance of a single data point.
Thus, we need to show the boundedness of the gradient for a given data point drawn from the training
distribution:
∇θℓ(ˆyθ(x), yi) + λ
X
j∈Y,
k∈S
h
Ajk∇θFj (xi
; θ)1(si = k) − f
∗
(Ajk)πk∇θFj (xi
; θ)
(10.129)
Based on the first assumption:
∥∇θℓ(ˆyθ(x), yi)∥2 ≤ G (10.130)
Based on the second assumption:
∥Ajk∇θFj (xi
; θ)1(si = k)∥2 ≤ LAjk (10.131)
∥πkf
∗
(Ajk)∇θFj (xi
; θ)∥2 ≤ πkLf ∗
(Ajk) (10.132)
292
These terms are bounded if Ajk is bounded and f
∗
(Ajk) is bounded for any Ajk. This holds true for all
f-divergences given assumptions 3 and 4. To see why, it suffices to find the optimal solution of each fdivergence by setting the gradient zero with respect to Ajk. In all cases, the solution is a combination of
Psk
and Pyˆj
terms that are non-zero and bounded (by assumptions 3 and 4). Since each term is bounded
in (10.129), the expectation of the squared norm is also bounded, according to Lemma 93.
Finally, given that the estimator is unbiased, and the variance is bounded (Assumption 4.1 holds
in [203]), the two-time-scale stochastic gradient descent-ascent algorithm (which is Algorithm 15) finds
an ϵ-stationary solution of the Problem in O(
1
ϵ
8 ) according to Theorem 4.9 in [203]. □
Remark 94 For the case of strongly convex f-divergence (e.g. χ
2 divergence), the convergence rate of
O(κ
3
ϵ
−4
) can be obtained (Theorem 4.5 in [203]). Such an iteration complexity holds for the batch size of
O(
κσ2
ϵ
2 ). If the batch size is as small as one, the rate will be O(κ
3
ϵ
−5
).
Remark 95 If the batch size is n (deterministic), a rate of O(ϵ
−6
) can be obtained according to Theorem 4.8
in [203]. Note that it does not necessarily translate to a better runtime compared to the stochastic case. Because
the per iteration cost of evaluating the gradient for n data points can be much higher than evaluating on just
1 (or a small number of) data points.
10.6.7 A Faster (But Double-Loop) Algorithm for Optimizing (f-FERM)
We apply SREDA [216] to find an ϵ stationary solution of Problem (f-FERM). Note that SREDA works
for non-convex strongly-concave min-max problems. We can directly apply the algorithm when f is the
χ
2
-divergence. In the cases that the function is concave but not strongly concave (e.g., KL divergence and
Reverse KL), we first consider the following approximation:
min
θ
max
A
1
n
Xn
i=1
ℓ(ˆyθ(xi), yi) + λ
X
j∈Y,
k∈S
h
AjkFj (xi
; θ)1(si = k) − f
∗
(Ajk)πkFj (xi
; θ) −
ϵ
2
∥Ajk∥
2
i
(10.133)
293
The maximization problem is an ϵ-strongly concave problem. If we apply SREDA (see Algorithm 3 in [216]),
we find an ϵ stationary solution of Problem (10.133) in O(κ
3
ϵ
−3
) where κ =
L
µ
is the condition number.
In our case, µ, the strong concavity modulus can be set to the desired accuracy ϵ so that solving the approximate problem (10.133) leads to an approximate stationary point of the original problem. Therefore,
the rate of convergence will be O(ϵ
−6
). Note that applying SREDA finds an ϵ stationary solution of Problem (10.133). Similar to Theorem 29, since the added regularization term is small enough, the obtained
solution is a O(ϵ)-stationary solution of the original problem (f-FERM). An important note is that the
SREDA algorithm (line 13 in Algorithm 3 [216]) has a nested loop compared to the SGDA algorithm proposed in Algorithm 15. Therefore, the O(ϵ
−6
) iteration complexity bound does not necessarily translate
to an improved runtime in practice. Algorithm 2 describes SREDA applied to Problem (10.133). For the
simplicity of the presentation, define the summation argument over n data points as ϕ(xi
, yi
, si
, θ, A).
The ConcaveMaximizer module is described in Algorithm 4 in [216].
Algorithm 25 SREDA Algorithm For Solving (f-FERM)
1: Input: periods q, m > 0, Number of iterations T, step-size ηθ, fairness parameter λ ≥ 0, iteration
number T, Batchsizes S and R.
2: for t = 1, . . . , T do
3: if t mod q = 0 then
4: Draw S samples (x
′
1
, y
′
1
), . . . ,(x
′
S
, y
′
S
)
5: vt =
1
S
PS
i=1 ∇θϕ(xi
, yi
, si
, θ, A)
6: ut =
1
S
PS
i=1 ∇Aϕ(xi
, yi
, si
, θ, A)
7: else
8: vt = v
′
t
9: ut = u
′
t
10: θt+1 = θt − ηθvk
11: (At+1, v
′
t+1, u
′
t+1) = ConcaveMaximizer(t, m, R, θt
, θt+1, At
, ut
, vt)
12: Return: θ
T
294
10.6.8 A First-order Optimization Algorithm for Optimizing (8.10)
This section presents a first-order optimization algorithm for optimizing (8.10). The details are presented
in Algorithm 26. Further, we show the convergence of the algorithm to an ϵ-stationary solution of Problem (8.10) in O(ϵ
−8
).
Theorem 96 Assume that ℓ(·, ·) and Fj (·, θ) are Lipschitz continuous for any given j and θ and their gradients are L-Lipshitz. Further, assume that P(s = k) > 0 for all protected groups and P(ˆyθ = j) > 0 at
every iteration for all labels j. Then, for any given batch size 1 ≤ |B| ≤ n, Algorithm 26 finds an ϵ-stationary
solution of (f-FERM) in O(
1
ϵ
8 ) for any given ϵ > 0.
The proof of the theorem is similar to Theorem 28 as the objective function is non-convex concave.
One can obtain faster algorithms under additional assumptions. For example, if the set for θ is assumed
to be compact (e.g., we restrict the norm of the weight of the gradient), then we can accelerate the algorithm
to O(ϵ
−6
), see [266]. Moreover, if we consider full batch sizes, we can utilize Algorithm 2 in [246]. This
will give you the rate of convergence of O(ϵ
−2
) (Theorem 5.2).
Algorithm 26 Gradient-Regularization Robust Training algorithm
1: Input: θ
0 ∈ R
dθ , step-sizes ηθ, ηα, fairness parameter λ ≥ 0, iteration number T, Batchsize [b]t
2: for t = 1, . . . , T do
3: Sample minibatch of data bt = {(x1, y1), · · · ,(xb, yb)}
4: Estimate P(yˆθ
t ) for minibatch bt
5: repeat
6: dAjk = ∇A(AjkPyˆθ,s − f
∗
(Ajk)Pyˆθ Ps)
7: Ajk = Ajk + ηα dAjk
8: until Convergence to A∗
jk
9: Obtain closed form expressions: ∂
∂θ
||∇PDf (P||Q)||2
2
and ∂
∂θ
||∇PDf (P||Q)||2
2
in terms of Pyˆθ
10: dθ = ∇θ
h
ℓ(θ
t−1
, x, y) + λ
h
Df (Pˆ||Qˆ ) + ϵ
||∇PDf (Pˆ||Qˆ )||2
2 + ||∇QDf (Pˆ||Qˆ )||2
2
ii
11: θ
t = θ
t−1 − ηθ dθ
12: Return: θ
T
295
10.6.9 Proof of Equation (8.12)
To show Problem (8.8) is equivalent to (8.12) under ℓp norm and the probability simplex constraint relaxation, note that the maximization problem in (8.8) is a constrained convex maximization with respect to P.
Therefore, there is a global solution on the boundary. The maximum problem under ℓ∞ can be written as:
max
||P−Pˆ||∞≤δ
||Q−Qˆ||∞≤δ
Df (P||Q), (10.134)
or equivalently:
max
||P−Pˆ||∞≤δ
||Q−Qˆ||∞≤δ
Xm
j=1
Qjf
Pj
Qj
, (10.135)
For the choice of KL-divergence (f(t) = tlog(t)) and χ
2 divergence (f(t) = (t−1)2
), f is a non-decreasing
function. Fixing a j ∈ {1, . . . , m}, the maximum with respect to is Pj attained when Pj is maximized. The
maximum of Pj is obtained on the boundary where δ is added to Pˆ
j . Since Pˆ
j + δ should be a probability
value, if it is larger than 1, we project it back to 1. As a result, the maximum of Pj is max(Pˆ
j + δ, 1).
Further, f in both choices of f is super-linear, meaning that Qjf
Pj
Qj
is a non-increasing function with
respect to Qj . Thus, its maximum with respect to Qj is attained when Qj is minimized. Therefore, the
optimal solution is either Qˆ
j − δ, or if it goes less than 0, we project it back to 0. Applying the same
argument to all j’s, we obtain Equation (8.12).
10.6.10 Details of Tuning Hyperparameters
In all experiments, we set ηθ = 10−5
and ηα = 10−6
. Further, we train the model with λ = 0 for 300
epochs, and then we set λ to the considered value. We continue the training until 2000 epochs. The range
of λ to get each point in the tradeoff figures is varied for different f-divergences. The KL-divergence λ
range is [0, 150]. For χ
2 divergence it is [0, 300] and for the reverse KL it is [0, 50]. Moreover, the λ range
296
for JS and Squared Hellinger is [0, 110] and [0, 250]. Note that larger values outside the range lead to
models with 0 predictions for all values.
In the DRO case, aside from λ we must tune ϵ, the robustness parameter. To achieve the best result, we have two different strategies depending on the availability of the data from the target domain.
Suppose we have access to a collection of data points from the target domain. In that case, we consider it as the validation set to choose the optimal combination of λ ∈ {0.1, 0.5, 1, 2, 5, 10, 20, 50} and
δ ∈ {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10}. In the second scenario, when we do not have any access
to target domain data, we perform a k-fold cross-validation on the source data. A more elegant way is
to create the validation dataset by oversampling the minority groups. Having access to the oversampled
validation set, we choose the optimal λ and δ similar to the first scenario. In the experiment regarding
Figure 8.4, we reserve 5% of data from the target domain for validation (scenario 1). In Figure 8.2, we
apply scenario 2 to tune the hyperparameters λ and δ.
10.6.11 Further Experiments on Other Datasets and Notions of Fairness
In this section, we perform (f-FERM), [141], [222], and [22] to COMPAS and German Credit datasets. In the
experiment on COMPAS, we use equality of opportunity as the measure of fairness violation, while in the
German Credit dataset experiment, we use equalized odds. The results show that (f-FERM) is significantly
better than other approaches regarding the accuracy-fairness tradeoff. The batch size is equal to 64 for all
methods.
297
Figure 10.15: Performance of the trained fair models on COMPAS and German Credit Datasets.
298
Bibliography
[1] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, and Dong Yu.
“Convolutional neural networks for speech recognition”. In: IEEE/ACM Transactions on audio,
speech, and language processing 22.10 (2014), pp. 1533–1545.
[2] Mohamed Reda Abonazel and Mohamed Gamal Ibrahim. “On estimation methods for binary
logistic regression model with missing values”. In: International Journal of Mathematics and
Computational Science 4.3 (2018), pp. 79–85.
[3] Civil Rights Act. “Civil rights act of 1964”. In: Title VII, Equal Employment Opportunities (1964).
[4] Civil Rights Act. “Civil Rights Act of 1964, Title VII,Equal Employment Opportunities”. In: (1964).
[5] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. “A
reductions approach to fair classification”. In: arXiv preprint arXiv:1803.02453 (2018).
[6] Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. “Learning optimal and fair decision
trees for non-discriminative decision-making”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 33. 2019, pp. 1418–1426.
[7] Muhammad Aurangzeb Ahmad, Carly Eckert, and Ankur Teredesai. “Interpretable machine
learning in healthcare”. In: Proceedings of the 2018 ACM international conference on bioinformatics,
computational biology, and health informatics. 2018, pp. 559–560.
[8] Ahmad Ajalloeian and Sebastian U. Stich. “Analysis of SGD with Biased Gradient Estimators”. In:
CoRR abs/2008.00051 (2020). eprint: 2008.00051. url: https://arxiv.org/abs/2008.00051.
[9] Daniel Alabi, Nicole Immorlica, and Adam Kalai. “Unleashing linear optimizers for group-fair
learning and optimization”. In: Conference On Learning Theory. 2018, pp. 2043–2066.
[10] Wael Alghamdi, Hsiang Hsu, Haewon Jeong, Hao Wang, Peter Michalak, Shahab Asoodeh, and
Flavio Calmon. “Beyond Adult and COMPAS: Fair multi-class prediction via information
projection”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 38747–38760.
[11] Nada Almugren and Hala Alshamlan. “A survey on hybrid feature selection methods in
microarray gene expression data for cancer classification”. In: IEEE access 7 (2019),
pp. 78533–78548.
299
[12] Orly Alter, Patrick O Brown, and David Botstein. “Singular value decomposition for
genome-wide expression data processing and modeling”. In: Proceedings of the National Academy
of Sciences 97.18 (2000), pp. 10101–10106.
[13] Bang An, Zora Che, Mucong Ding, and Furong Huang. “Transferring fairness under distribution
shifts via fair consistency regularization”. In: arXiv preprint arXiv:2206.12796 (2022).
[14] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, Matus Telgarsky, et al.
“Tensor decompositions for learning latent variable models.” In: J. Mach. Learn. Res. 15.1 (2014),
pp. 2773–2832.
[15] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias”. In: ProPublica
(2016).
[16] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine bias”. In: Ethics of Data
and Analytics. Auerbach Publications, 2016, pp. 254–264.
[17] Bernardo Aquino, Arash Rahnama, Peter Seiler, Lizhen Lin, and Vijay Gupta. “Robustness against
adversarial attacks in neural networks using incremental dissipativity”. In: IEEE Control Systems
Letters 6 (2022), pp. 2341–2346.
[18] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. “Invariant risk
minimization”. In: arXiv preprint arXiv:1907.02893 (2019).
[19] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein generative adversarial
networks”. In: International conference on machine learning. PMLR. 2017, pp. 214–223.
[20] Jakob Assländer, Martijn A Cloos, Florian Knoll, Daniel K Sodickson, Jürgen Hennig, and
Riccardo Lattanzi. “Low rank alternating direction method of multipliers reconstruction for MR
fingerprinting”. In: Magnetic resonance in medicine 79.1 (2018), pp. 83–96.
[21] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner.
“Scalable Fair Clustering”. In: arXiv preprint arXiv:1902.03519 (2019).
[22] Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. “Rényi Fair
Inference”. In: International Conference on Learning Representations. 2020.
[23] Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. “Rényi Fair
Inference”. In: ICLR. 2020.
[24] Sina Baharlouei, Kelechi Ogudu, Sze-chuan Suen, and Meisam Razaviyayn. “RIFLE: Imputation
and Robust Inference from Low Order Marginals”. In: Transactions on Machine Learning Research
(2023). issn: 2835-8856. url: https://openreview.net/forum?id=oud7Ny0KQy.
[25] Sina Baharlouei, Shivam Patel, and Meisam Razaviyayn. “f-FERM: A Scalable Framework for
Robust Fair Empirical Risk Minimization”. In: The Twelfth International Conference on Learning
Representations. 2024.
300
[26] Sina Baharlouei and Meisam Razaviyayn. “Dr. FERMI: A Stochastic Distributionally Robust Fair
Empirical Risk Minimization Framework”. In: arXiv preprint arXiv:2309.11682 (2023).
[27] Sina Baharlouei, Meisam Razaviyayn, Elizabeth Tseng, and David Tse. “I-CONVEX: Fast and
Accurate de Novo Transcriptome Recovery from Long Reads”. In: Joint European Conference on
Machine Learning and Knowledge Discovery in Databases. Springer. 2022, pp. 339–363.
[28] Sina Baharlouei, Fatemeh Sheikholeslami, Meisam Razaviyayn, and Zico Kolter. “Improving
Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes”. In:
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics. Ed. by
Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent. Vol. 206. Proceedings of Machine
Learning Research. PMLR, 2023, pp. 11059–11078.
[29] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chellappa. “Metareg: Towards domain
generalization using meta-regularization”. In: Advances in neural information processing systems
31 (2018).
[30] Mislav Balunovic and Martin Vechev. “Adversarial training and provable defenses: Bridging the
gap”. In: International Conference on Learning Representations. 2019.
[31] Babak Barazandeh and Meisam Razaviyayn. “Solving Non-Convex Non-Differentiable Min-Max
Games using Proximal Gradient Method”. In: arXiv preprint arXiv:2003.08093 (2020).
[32] Normand J Beaudry and Renato Renner. “An intuitive proof of the data processing inequality”. In:
arXiv preprint arXiv:1107.0740 (2011).
[33] Brett K Beaulieu-Jones, Daniel R Lavage, John W Snyder, Jason H Moore, Sarah A Pendergrass,
and Christopher R Bauer. “Characterizing and managing missing structured data in electronic
health records: data analysis”. In: JMIR medical informatics 6.1 (2018), e11.
[34] Yahav Bechavod and Katrina Ligett. “Penalizing unfairness in binary classification”. In: arXiv
preprint arXiv:1707.00044 (2017).
[35] Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. “Equal
opportunity in online classification with partial feedback”. In: arXiv preprint arXiv:1902.02242
(2019).
[36] Amir Beck and Marc Teboulle. “A fast iterative shrinkage-thresholding algorithm for linear
inverse problems”. In: SIAM journal on imaging sciences 2.1 (2009), pp. 183–202.
[37] Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository. DOI:
https://doi.org/10.24432/C5XW20. 1996.
[38] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. “Square-root lasso: pivotal recovery of
sparse signals via conic programming”. In: Biometrika 98.4 (2011), pp. 791–806.
[39] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and
Jennifer Wortman Vaughan. “A theory of learning from different domains”. In: Machine learning
79.1 (2010), pp. 151–175.
301
[40] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. “Analysis of representations
for domain adaptation”. In: Advances in neural information processing systems 19 (2006).
[41] Yoav Benjamini and Yosef Hochberg. “Controlling the false discovery rate: a practical and
powerful approach to multiple testing”. In: Journal of the Royal statistical society: series B
(Methodological) 57.1 (1995), pp. 289–300.
[42] Ioana O Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R Schmidt,
and Melanie Schmidt. “On the cost of essentially fair clusterings”. In: arXiv preprint
arXiv:1811.10319 (2018).
[43] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns,
Jamie Morgenstern, Seth Neel, and Aaron Roth. “A convex framework for fair regression”. In:
arXiv preprint arXiv:1706.02409 (2017).
[44] Dimitri P Bertsekas. “Control of uncertain systems with a set-membership description of the
uncertainty.” PhD thesis. Massachusetts Institute of Technology, 1971.
[45] Dimitris Bertsimas, Colin Pawlowski, and Ying Daisy Zhuo. “From predictive methods to missing
data imputation: an optimization approach”. In: The Journal of Machine Learning Research 18.1
(2017), pp. 7133–7171.
[46] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Allison Woodruff, Christine Luu,
Pierre Kreitmann, Jonathan Bischof, and Ed H Chi. “Putting fairness principles into practice:
Challenges, metrics, and improvements”. In: Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society. 2019, pp. 453–459.
[47] Jose Blanchet, Yang Kang, and Karthyek Murthy. “Robust Wasserstein profile inference and
applications to machine learning”. In: Journal of Applied Probability 56.3 (2019), pp. 830–857.
[48] Jose Blanchet, Yang Kang, Karthyek Murthy, and Fan Zhang. “Data-driven optimal transport cost
selection for distributionally robust optimization”. In: 2019 winter simulation conference (WSC).
IEEE. 2019, pp. 3740–3751.
[49] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. “Man is
to computer programmer as woman is to homemaker? debiasing word embeddings”. In: Advances
in neural information processing systems. 2016, pp. 4349–4357.
[50] Roberto Boselli, Mirko Cesarini, Fabio Mercorio, and Mario Mezzanzanica. “Classifying online job
advertisements through machine learning”. In: Future Generation Computer Systems 86 (2018),
pp. 319–328.
[51] Léon Bottou. “Stochastic gradient descent tricks”. In: Neural networks: Tricks of the trade.
Springer, 2012, pp. 421–436.
[52] Christian Brownlees, Emilien Joly, and Gábor Lugosi. “Empirical risk minimization for
heavy-tailed losses”. In: The Annals of Statistics 43.6 (2015), pp. 2507–2536.
302
[53] Sébastien Bubeck. “Convex optimization: Algorithms and complexity”. In: arXiv preprint
arXiv:1405.4980 (2014).
[54] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in
commercial gender classification”. In: Conference on fairness, accountability and transparency.
PMLR. 2018, pp. 77–91.
[55] Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. “Integrating
single-cell transcriptomic data across different conditions, technologies, and species”. In: Nature
biotechnology 36.5 (2018), pp. 411–420.
[56] S van Buuren and Karin Groothuis-Oudshoorn. “mice: Multivariate imputation by chained
equations in R”. In: Journal of statistical software (2010), pp. 1–68.
[57] Zhipeng Cai, Maysam Heydari, and Guohui Lin. “Iterated local least squares microarray missing
value imputation”. In: Journal of bioinformatics and computational biology 4.05 (2006), pp. 935–957.
[58] Toon Calders, Faisal Kamiran, and Mykola Pechenizkiy. “Building classifiers with independency
constraints”. In: 2009 IEEE International Conference on Data Mining Workshops. IEEE. 2009,
pp. 13–18.
[59] Flavio Calmon, Ali Makhdoumi, Muriel Médard, Mayank Varia, Mark Christiansen, and
Ken R Duffy. “Principal inertia components and applications”. In: IEEE Transactions on
Information Theory 63.8 (2017), pp. 5011–5038.
[60] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and
Kush R Varshney. “Optimized pre-processing for discrimination prevention”. In: Advances in
Neural Information Processing Systems. 2017, pp. 3992–4001.
[61] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss,
Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. “Extracting
training data from large language models”. In: 30th USENIX Security Symposium (USENIX Security
21). 2021, pp. 2633–2650.
[62] Nicholas Carlini and David Wagner. “Towards Evaluating the Robustness of Neural Networks”.
In: 2017 IEEE Symposium on Security and Privacy (SP). 2017, pp. 39–57. doi: 10.1109/SP.2017.49.
[63] Nicholas Carlini and David Wagner. “Towards evaluating the robustness of neural networks”. In:
2017 ieee symposium on security and privacy (sp). IEEE. 2017, pp. 39–57.
[64] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. “Lower bounds for finding
stationary points I”. In: Mathematical Programming 184.1 (2020), pp. 71–120.
[65] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. “Lower bounds for finding
stationary points i”. In: Mathematical Programming (2019), pp. 1–50.
[66] Alessandro Castelnovo, Riccardo Crupi, Greta Greco, Daniele Regoli, Ilaria Giuseppina Penco,
and Andrea Claudio Cosentini. “A clarification of the nuances in the fairness metrics landscape”.
In: Scientific Reports 12.1 (2022), p. 4209.
303
[67] L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. “Classification with
fairness constraints: A meta-algorithm with provable guarantees”. In: Proceedings of the
Conference on Fairness, Accountability, and Transparency. ACM. 2019, pp. 319–328.
[68] Chao Chen, Kay Grennan, Judith Badner, Dandan Zhang, Elliot Gershon, Li Jin, and Chunyu Liu.
“Removing batch effects in analysis of expression microarray data: an evaluation of six batch
adjustment methods”. In: PloS one 6.2 (2011), e17238.
[69] Jie Chen and Ronny Luss. “Stochastic Gradient Descent with Biased but Consistent Gradient
Estimators”. In: CoRR abs/1807.11880 (2018). eprint: 1807.11880. url:
http://arxiv.org/abs/1807.11880.
[70] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. “Robust out-of-distribution
detection for neural networks”. In: arXiv preprint arXiv:2003.09711 (2020).
[71] Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, and Somesh Jha. “Revisiting
adversarial robustness of classifiers with a reject option”. In: The AAAI-22 Workshop on
Adversarial Machine Learning and Beyond. 2021.
[72] Jiuhai Chen, Chen Zhu, and Bin Dai. “Understanding the Role of Self-Supervised Learning in
Out-of-Distribution Detection Task”. In: arXiv preprint arXiv:2110.13435 (2021).
[73] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. “Fair clustering through
fairlets”. In: Advances in Neural Information Processing Systems. 2017, pp. 5029–5037.
[74] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. “A Fair Classifier Using Kernel Density
Estimation”. In: Advances in Neural Information Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by
Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin.
2020.
[75] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. “A fair classifier using mutual information”.
In: 2020 IEEE International Symposium on Information Theory (ISIT). IEEE. 2020, pp. 2521–2526.
[76] Wang Chong, David Blei, and Fei-Fei Li. “Simultaneous image classification and annotation”. In:
2009 IEEE Conference on computer vision and pattern recognition. IEEE. 2009, pp. 1903–1910.
[77] Evgenii Chzhen and Nicolas Schreuder. “A minimax framework for quantifying risk-fairness
trade-off in regression”. In: arXiv preprint arXiv:2007.14265 (2020).
[78] Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R Varshney,
Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. “Fair transfer learning with
missing protected attributes”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and
Society. 2019, pp. 91–98.
[79] Thomas M Cover and Joy A Thomas. “Information theory and statistics”. In: Elements of
Information Theory 1.1 (1991), pp. 279–335.
304
[80] Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. “Environment inference for invariant
learning”. In: International Conference on Machine Learning. PMLR. 2021, pp. 2189–2200.
[81] Elliot Creager, David Madras, Jörn-Henrik Jacobsen, Marissa Weis, Kevin Swersky,
Toniann Pitassi, and Richard Zemel. “Flexibly fair representation learning by disentanglement”.
In: International conference on machine learning. PMLR. 2019, pp. 1436–1445.
[82] Imre Csiszár. “Information-type measures of difference of probability distributions and indirect
observation”. In: studia scientiarum Mathematicarum Hungarica 2.4 (1967), pp. 229–318.
[83] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Now Publishers Inc,
2004.
[84] Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. “On the
compatibility of privacy and fairness”. In: Adjunct Publication of the 27th Conference on User
Modeling, Adaptation and Personalization. 2019, pp. 309–315.
[85] Jessica Dai and Sarah M Brown. “Label bias, label shift: Fair machine learning with unreliable
labels”. In: NeurIPS 2020 Workshop on Consequential Decision Making in Dynamic Environments.
Vol. 12. 2020.
[86] Peng Dai, Sina Baharlouei, Taojian Tu, Bangyan L Stiles, Meisam Razaviyayn, and
Sze-chuan Suen. “Feature Selection in the Presence of Monotone Batch Effects”. In: ().
[87] John M Danskin. The theory of max-min and its application to weapons allocation problems. Vol. 5.
Springer Science & Business Media, 2012.
[88] John M Danskin. “The theory of max-min, with applications”. In: SIAM Journal on Applied
Mathematics 14.4 (1966), pp. 641–664.
[89] Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. “The complexity of
constrained min-max optimization”. In: Proceedings of the 53rd Annual ACM SIGACT Symposium
on Theory of Computing. 2021, pp. 1466–1478.
[90] Sumanth Dathathri, Krishnamurthy Dvijotham, Alexey Kurakin, Aditi Raghunathan,
Jonathan Uesato, Rudy Bunel, Shreya Shankar, Jacob Steinhardt, Ian Goodfellow, Percy Liang,
et al. “Enabling certification of verification-agnostic networks via memory-efficient semidefinite
programming”. In: arXiv preprint arXiv:2010.11645 (2020).
[91] Amit Datta, Michael Carl Tschantz, and Anupam Datta. “Automated experiments on ad privacy
settings”. In: Proceedings on privacy enhancing technologies 2015.1 (2015), pp. 92–112.
[92] James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi,
Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, et al. “The YouTube video recommendation
system”. In: Proceedings of the fourth ACM conference on Recommender systems. 2010, pp. 293–296.
[93] Erick Delage and Yinyu Ye. “Distributionally robust optimization under moment uncertainty with
application to data-driven problems”. In: Operations research 58.3 (2010), pp. 595–612.
305
[94] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Springer Science &
Business Media, 2009.
[95] Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum likelihood from incomplete
data via the EM algorithm”. In: Journal of the Royal Statistical Society: Series B (Methodological)
39.1 (1977), pp. 1–22.
[96] Yuyang Deng, Mohammad Mahdi Kamani, Pouria Mahdavinia, and Mehrdad Mahdavi.
“Distributed Personalized Empirical Risk Minimization”. In: International Workshop on Federated
Learning for Distributed Data Mining. 2023.
[97] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “Bert: Pre-training of deep
bidirectional transformers for language understanding”. In: arXiv preprint arXiv:1810.04805 (2018).
[98] Terrance DeVries and Graham W Taylor. “Learning confidence for out-of-distribution detection
in neural networks”. In: arXiv preprint arXiv:1802.04865 (2018).
[99] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. “Retiring adult: New datasets for
fair machine learning”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 6478–6490.
[100] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil.
“Empirical risk minimization under fairness constraints”. In: Advances in Neural Information
Processing Systems. 2018, pp. 2791–2801.
[101] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[102] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. “Gradient descent finds global
minima of deep neural networks”. In: International conference on machine learning. PMLR. 2019,
pp. 1675–1685.
[103] Wei Du and Xintao Wu. “Fair and robust classification under sample selection bias”. In:
Proceedings of the 30th ACM International Conference on Information & Knowledge Management.
2021, pp. 2999–3003.
[104] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017. url:
http://archive.ics.uci.edu/ml.
[105] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. “Fairness
through awareness”. In: Proceedings of the 3rd innovations in theoretical computer science
conference. 2012, pp. 214–226.
[106] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. “Decoupled
classifiers for group-fair and efficient machine learning”. In: Conference on Fairness,
Accountability and Transparency. 2018, pp. 119–133.
306
[107] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity
in private data analysis”. In: Theory of cryptography conference. Springer. 2006, pp. 265–284.
[108] Harrison Edwards and Amos Storkey. “Censoring representations with an adversary”. In: arXiv
preprint arXiv:1511.05897 (2015).
[109] Gideon Elford. “Equality of Opportunity”. In: (2023). Ed. by Edward N. Zalta and Uri Nodelman.
[110] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao,
Atul Prakash, Tadayoshi Kohno, and Dawn Song. “Robust physical-world attacks on deep
learning visual classification”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018, pp. 1625–1634.
[111] Irfan Faizullabhoy and Aleksandra Korolova. “Facebook’s advertising platform: New attack
vectors and the need for interventions”. In: arXiv preprint arXiv:1803.10099 (2018).
[112] Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. “Rethinking importance weighting
for deep learning under distribution shift”. In: Advances in Neural Information Processing Systems
33 (2020), pp. 11996–12007.
[113] Zhao-Yu Fang, Cui-Xiang Lin, Yun-Pei Xu, Hong-Dong Li, and Qing-Song Xu. “REBET: a method
to determine the number of cell clusters based on batch effect removal”. In: Briefings in
Bioinformatics 22.6 (2021), bbab204.
[114] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George J Pappas.
“Efficient and accurate estimation of lipschitz constants for deep neural networks”. In: arXiv
preprint arXiv:1906.04893 (2019).
[115] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and
Suresh Venkatasubramanian. “Certifying and removing disparate impact”. In: Proceedings of the
21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. 2015,
pp. 259–268.
[116] Benjamin Fish, Jeremy Kun, and Ádám D Lelkes. “A confidence-based approach for balancing
fairness and accuracy”. In: Proceedings of the 2016 SIAM International Conference on Data Mining.
SIAM. 2016, pp. 144–152.
[117] Benjamin Fish, Jeremy Kun, and Adám D Lelkes. “Fair boosting: a case study”. In: Workshop on
Fairness, Accountability, and Transparency in Machine Learning. Citeseer. 2015.
[118] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. “Sharpness-aware
minimization for efficiently improving generalization”. In: arXiv preprint arXiv:2010.01412 (2020).
[119] Stanislav Fort. “Adversarial vulnerability of powerful near out-of-distribution detection”. In:
arXiv preprint arXiv:2201.07012 (2022).
[120] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit
confidence information and basic countermeasures”. In: Proceedings of the 22nd ACM SIGSAC
Conference on Computer and Communications Security. 2015, pp. 1322–1333.
307
[121] Karen Yuen Fung and Barbara A Wrobel. “The treatment of missing values in logistic regression”.
In: Biometrical Journal 31.1 (1989), pp. 35–47.
[122] Daniel Gabay and Bertrand Mercier. “A dual algorithm for the solution of nonlinear variational
problems via finite element approximation”. In: Computers & mathematics with applications 2.1
(1976), pp. 17–40.
[123] Rui Gao and Anton J Kleywegt. “Distributionally robust stochastic optimization with dependence
structure”. In: arXiv preprint arXiv:1701.04200 (2017).
[124] Hans Gebelein. “Das statistische Problem der Korrelation als Variations-und Eigenwertproblem
und sein Zusammenhang mit der Ausgleichsrechnung”. In: ZAMM-Journal of Applied
Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 21.6 (1941),
pp. 364–379.
[125] Zoubin Ghahramani and Michael I Jordan. “Supervised learning from incomplete data via an EM
approach”. In: Advances in neural information processing systems. 1994, pp. 120–127.
[126] Stephen Giguere, Blossom Metevier, Bruno Castro da Silva, Yuriy Brun, Philip S Thomas, and
Scott Niekum. “Fairness Guarantees under Demographic Shift”. In: International Conference on
Learning Representations. 2021.
[127] Micah Goldblum, Liam Fowl, Soheil Feizi, and Tom Goldstein. “Adversarially robust distillation”.
In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 2020, pp. 3996–4003.
[128] Lovedeep Gondara and Ke Wang. “MIDA: Multiple imputation using denoising autoencoders”. In:
Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer. 2018, pp. 260–272.
[129] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing
adversarial examples”. In: arXiv preprint arXiv:1412.6572 (2014).
[130] Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin,
Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. “On the effectiveness
of interval bound propagation for training verifiably robust models”. In: arXiv preprint
arXiv:1810.12715 (2018).
[131] Vincent Grari, Oualid El Hajouji, Sylvain Lamprier, and Marcin Detyniecki. “Learning Unbiased
Representations via Rényi Minimization”. In: arXiv preprint arXiv:2009.03183 (2020).
[132] Vincent Grari, Sylvain Lamprier, and Marcin Detyniecki. “Fairness-Aware Neural Rényi
Minimization for Continuous Features”. In: Twenty-Ninth International Joint Conference on
Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence
{IJCAI-PRICAI-20}. International Joint Conferences on Artificial Intelligence Organization. 2020,
pp. 2262–2268.
[133] Vincent Grari, Boris Ruf, Sylvain Lamprier, and Marcin Detyniecki. “Fairness-Aware Neural
Réyni Minimization for Continuous Features”. In: arXiv preprint arXiv:1911.04929 (2019).
308
[134] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep
recurrent neural networks”. In: 2013 IEEE international conference on acoustics, speech and signal
processing. Ieee. 2013, pp. 6645–6649.
[135] Arthur Gretton, Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex Smola. “A kernel
method for the two-sample-problem”. In: Advances in neural information processing systems 19
(2006).
[136] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola.
“A kernel two-sample test”. In: The Journal of Machine Learning Research 13.1 (2012), pp. 723–773.
[137] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. “Measuring statistical
dependence with Hilbert-Schmidt norms”. In: International conference on algorithmic learning
theory. Springer. 2005, pp. 63–77.
[138] Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, and Bernhard Schölkopf.
“Kernel methods for measuring independence”. In: Journal of Machine Learning Research 6.Dec
(2005), pp. 2075–2129.
[139] David H Gutman and Javier F Pena. “A unified framework for Bregman proximal methods:
subgradient, gradient, and accelerated gradient schemes”. In: arXiv preprint arXiv:1812.10198
(2018).
[140] Laleh Haghverdi, Aaron TL Lun, Michael D Morgan, and John C Marioni. “Batch effects in
single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors”. In: Nature
biotechnology 36.5 (2018), pp. 421–427.
[141] Moritz Hardt, Eric Price, Nati Srebro, et al. “Equality of opportunity in supervised learning”. In:
Advances in neural information processing systems. 2016, pp. 3315–3323.
[142] Bingsheng He and Xiaoming Yuan. “On non-ergodic convergence rate of Douglas–Rachford
alternating direction method of multipliers”. In: Numerische Mathematik 130.3 (2015), pp. 567–577.
[143] Zengyou He and Weichuan Yu. “Stable feature selection for biomarker discovery”. In:
Computational biology and chemistry 34.4 (2010), pp. 215–225.
[144] Dan Hendrycks and Kevin Gimpel. “A baseline for detecting misclassified and out-of-distribution
examples in neural networks”. In: arXiv preprint arXiv:1610.02136 (2016).
[145] Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. “Evaluating
collaborative filtering recommender systems”. In: ACM Transactions on Information Systems
(TOIS) 22.1 (2004), pp. 5–53.
[146] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. “Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent”. In: Cited on 14.8 (2012), p. 2.
[147] Hermann O Hirschfeld. “A connection between correlation and contingency”. In: Mathematical
Proceedings of the Cambridge Philosophical Society. Vol. 31. 4. Cambridge University Press. 1935,
pp. 520–524.
309
[148] James Honaker, Gary King, and Matthew Blackwell. “Amelia II: A program for missing data”. In:
Journal of statistical software 45.7 (2011), pp. 1–47.
[149] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. “Convergence analysis of alternating
direction method of multipliers for a family of nonconvex problems”. In: SIAM Journal on
Optimization 26.1 (2016), pp. 337–364.
[150] Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. “Bia Mitigation for
Machine Learning Classifiers: A Comprehensive Survey”. In: arXiv preprint arXiv:2207.07068
(2022).
[151] Miklós Z Horváth, Mark Niklas Müller, Marc Fischer, and Martin Vechev. “Robust and
Accurate–Compositional Architectures for Randomized Smoothing”. In: arXiv preprint
arXiv:2204.00487 (2022).
[152] Tianjian Huang, Shaunak Halbe, Chinnadhurai Sankar, Pooyan Amini, Satwik Kottur,
Alborz Geramifard, Meisam Razaviyayn, and Ahmad Beirami. “Dair: Data augmented invariant
regularization”. In: arXiv preprint arXiv:2110.11205 (2021).
[153] Hisham Husain. “Distributional Robustness with IPMs and links to Regularization and GANs”. In:
Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell,
M.F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 11816–11827. url:
https://proceedings.neurips.cc/paper_files/paper/2020/file/8929c70f8d710e412d38da624b21c3c8-
Paper.pdf.
[154] Aleksandr Davidovich Ioffe and Vladimir Mihajlovič Tihomirov. Theory of extremal problems.
Elsevier, 2009.
[155] Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth,
Saeed Sharifi-Malvajerdi, and Jonathan Ullman. “Differentially private fair learning”. In:
International Conference on Machine Learning. PMLR. 2019, pp. 3000–3008.
[156] Yunseok Jang, Tianchen Zhao, Seunghoon Hong, and Honglak Lee. “Adversarial defense via
learning to generate diverse attacks”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. 2019, pp. 2740–2749.
[157] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. “Wasserstein fair
classification”. In: Uncertainty in Artificial Intelligence. PMLR. 2020, pp. 862–872.
[158] Chi Jin, Praneeth Netrapalli, and Michael Jordan. “What is local optimality in
nonconvex-nonconcave minimax optimization?” In: International conference on machine learning.
PMLR. 2020, pp. 4880–4889.
[159] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. “Minmax Optimization: Stable Limit Points of
Gradient Descent Ascent are Locally Optimal”. In: arXiv preprint arXiv:1902.00618 (2019).
[160] Taotao Jing, Bingrong Xu, and Zhengming Ding. “Towards fair knowledge transfer for imbalanced
domain adaptation”. In: IEEE Transactions on Image Processing 30 (2021), pp. 8200–8211.
310
[161] W Evan Johnson, Cheng Li, and Ariel Rabinovic. “Adjusting batch effects in microarray
expression data using empirical Bayes methods”. In: Biostatistics 8.1 (2007), pp. 118–127.
[162] Alan Jović, Karla Brkić, and Nikola Bogunović. “A review of feature selection methods with
applications”. In: 2015 38th international convention on information and communication technology,
electronics and microelectronics (MIPRO). Ieee. 2015, pp. 1200–1205.
[163] Satyen Kale, Jason D Lee, Chris De Sa, Ayush Sekhari, and Karthik Sridharan. “From Gradient
Flow on Population Loss to Learning with Stochastic Gradient Descent”. In: arXiv preprint
arXiv:2210.06705 (2022).
[164] Faisal Kamiran and Toon Calders. “Classification with no discrimination by preferential
sampling”. In: Proc. 19th Machine Learning Conf. Belgium and The Netherlands. Citeseer. 2010,
pp. 1–6.
[165] Faisal Kamiran and Toon Calders. “Classifying without discriminating”. In: 2009 2nd International
Conference on Computer, Control and Communication. IEEE. 2009, pp. 1–6.
[166] Faisal Kamiran and Toon Calders. “Data preprocessing techniques for classification without
discrimination”. In: Knowledge and Information Systems 33.1 (2012), pp. 1–33.
[167] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. “Discrimination aware decision tree
learning”. In: 2010 IEEE International Conference on Data Mining. IEEE. 2010, pp. 869–874.
[168] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. “Fairness-aware learning through
regularization approach”. In: 2011 IEEE 11th International Conference on Data Mining Workshops.
IEEE. 2011, pp. 643–650.
[169] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. “Preventing fairness
gerrymandering: Auditing and learning for subgroup fairness”. In: arXiv preprint arXiv:1711.05144
(2017).
[170] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. “Preventing fairness
gerrymandering: Auditing and learning for subgroup fairness”. In: International Conference on
Machine Learning. 2018, pp. 2564–2572.
[171] Mohammad Khalafi and Digvijay Boob. “Accelerated Primal-Dual Methods for
Convex-Strongly-Concave Saddle Point Problems”. In: Proceedings of the 40th International
Conference on Machine Learning. Ed. by Andreas Krause, Emma Brunskill, Kyunghyun Cho,
Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett. Vol. 202. Proceedings of Machine
Learning Research. PMLR, 2023, pp. 16250–16270.
[172] Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and
Isabel Valera. “Fair Decisions Despite Imperfect Predictions”. In: Proceedings of the Twenty Third
International Conference on Artificial Intelligence and Statistics. Ed. by Silvia Chiappa and
Roberto Calandra. Vol. 108. Proceedings of Machine Learning Research. PMLR, 2020, pp. 277–287.
[173] Hyunsoo Kim, Gene H Golub, and Haesun Park. “Missing value estimation for DNA microarray
gene expression data: local least squares imputation”. In: Bioinformatics 21.2 (2005), pp. 187–198.
311
[174] Michael P Kim, Amirata Ghorbani, and James Zou. “Multiaccuracy: Black-box post-processing for
fairness in classification”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and
Society. 2019, pp. 247–254.
[175] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[176] Weiwei Kong and Renato DC Monteiro. “An accelerated inexact proximal point method for
solving nonconvex-concave min-max problems”. In: SIAM Journal on Optimization 31.4 (2021),
pp. 2558–2585.
[177] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25 (2012).
[178] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Communications of the ACM 60.6 (2017), pp. 84–90.
[179] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, and Soroosh Shafieezadeh-Abadeh.
“Wasserstein distributionally robust optimization: Theory and applications in machine learning”.
In: Operations research & management science in the age of analytics. Informs, 2019, pp. 130–166.
[180] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and
Ed Chi. “Fairness without demographics through adversarially reweighted learning”. In: Advances
in neural information processing systems 33 (2020), pp. 728–740.
[181] Cassidy Laidlaw and Soheil Feizi. “Playing it safe: Adversarial robustness with an abstain option”.
In: arXiv preprint arXiv:1911.11253 (2019).
[182] Justin Lakkis, David Wang, Yuanchao Zhang, Gang Hu, Kui Wang, Huize Pan, Lyle Ungar,
Muredach P Reilly, Xiangjie Li, and Mingyao Li. “A joint deep learning model enables
simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics”. In:
Genome research 31.10 (2021), pp. 1753–1766.
[183] Cosmin Lazar, Stijn Meganck, Jonatan Taminau, David Steenhoff, Alain Coletta, Colin Molter,
David Y Weiss-Solís, Robin Duque, Hugues Bersini, and Ann Nowé. “Batch effect removal
methods for microarray gene expression data integration: a survey”. In: Briefings in bioinformatics
14.4 (2013), pp. 469–490.
[184] Tosca Lechner, Shai Ben-David, Sushant Agarwal, and Nivasini Ananthakrishnan. “Impossibility
results for fair representations”. In: arXiv preprint arXiv:2107.03483 (2021).
[185] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document
recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi: 10.1109/5.726791.
[186] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and
Benjamin Recht. “First-order methods almost always avoid strict saddle points”. In: Mathematical
programming 176.1 (2019), pp. 311–337.
312
[187] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. “Gradient descent only
converges to minimizers”. In: Conference on learning theory. PMLR. 2016, pp. 1246–1257.
[188] Jeffrey T Leek. “Svaseq: removing batch effects and other unwanted noise from sequencing data”.
In: Nucleic acids research 42.21 (2014), e161–e161.
[189] Jeffrey T Leek, Robert B Scharpf, Héctor Corrada Bravo, David Simcha, Benjamin Langmead,
W Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A Irizarry. “Tackling the widespread
and critical impact of batch effects in high-throughput data”. In: Nature Reviews Genetics 11.10
(2010), pp. 733–739.
[190] Daniel Levy, Yair Carmon, John C Duchi, and Aaron Sidford. “Large-scale methods for
distributionally robust optimization”. In: Advances in Neural Information Processing Systems 33
(2020), pp. 8847–8860.
[191] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. “Bart: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension”. In: arXiv preprint
arXiv:1910.13461 (2019).
[192] Jiajin Li, Linglingzhi Zhu, and Anthony Man-Cho So. Nonsmooth Nonconvex-Nonconcave Minimax
Optimization: Primal-Dual Balancing and Iteration Complexity Analysis. 2023. arXiv: 2209.10825
[math.OC].
[193] Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. “Tilted empirical risk minimization”.
In: arXiv preprint arXiv:2007.01162 (2020).
[194] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. “Ditto: Fair and robust federated
learning through personalization”. In: International conference on machine learning. PMLR. 2021,
pp. 6357–6368.
[195] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. “Fair resource allocation in
federated learning”. In: arXiv preprint arXiv:1905.10497 (2019).
[196] Xiangjie Li, Kui Wang, Yafei Lyu, Huize Pan, Jingxiao Zhang, Dwight Stambolian,
Katalin Susztak, Muredach P Reilly, Gang Hu, and Mingyao Li. “Deep learning enables accurate
clustering with batch effect removal in single-cell RNA-seq analysis”. In: Nature communications
11.1 (2020), pp. 1–14.
[197] Yi Li and Nuno Vasconcelos. “REPAIR: Removing representation bias by dataset resampling”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
pp. 9572–9581.
[198] Zeju Li, Konstantinos Kamnitsas, and Ben Glocker. “Overfitting of neural nets under class
imbalance: Analysis and improvements for segmentation”. In: International Conference on Medical
Image Computing and Computer-Assisted Intervention. Springer. 2019, pp. 402–410.
[199] Shiyu Liang, Yixuan Li, and Rayadurgam Srikant. “Enhancing the reliability of out-of-distribution
image detection in neural networks”. In: arXiv preprint arXiv:1706.02690 (2017).
313
[200] Jiachun Liao, Oliver Kosut, Lalitha Sankar, and Flavio du Pin Calmon. “Tunable Measures for
Information Leakage and Applications to Privacy-Utility Tradeoffs”. In: IEEE Transactions on
Information Theory (2019).
[201] Serena G Liao, Yan Lin, Dongwan D Kang, Divay Chandra, Jessica Bon, Naftali Kaminski,
Frank C Sciurba, and George C Tseng. “Missing value imputation in high-dimensional phenomic
data: imputable or not, and how?” In: BMC bioinformatics 15.1 (2014), pp. 1–12.
[202] Fengming Lin, Xiaolei Fang, and Zheming Gao. “Distributionally Robust Optimization: A review
on theory and applications”. In: Numerical Algebra, Control & Optimization 12.1 (2022), p. 159.
[203] Tianyi Lin, Chi Jin, and Michael Jordan. “On Gradient Descent Ascent for Nonconvex-Concave
Minimax Problems”. In: Proceedings of the 37th International Conference on Machine Learning.
Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research.
PMLR, 2020, pp. 6083–6093. url: https://proceedings.mlr.press/v119/lin20a.html.
[204] Tianyi Lin, Chi Jin, and Michael I. Jordan. “On Gradient Descent Ascent for Nonconvex-Concave
Minimax Problems”. In: arXiv: 1906.00331v6 (2020).
[205] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. Vol. 793. John Wiley
& Sons, 2019.
[206] Qin Liu, Douglas Walker, Karan Uppal, Zihe Liu, Chunyu Ma, ViLinh Tran, Shuzhao Li,
Dean P Jones, and Tianwei Yu. “Addressing the batch effect issue for LC/MS metabolomics data in
data preprocessing”. In: Scientific reports 10.1 (2020), pp. 1–13.
[207] Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha,
Kush R Varshney, and Ruchir Puri. “Bias mitigation post-processing for individual and group
fairness”. In: Icassp 2019-2019 ieee international conference on acoustics, speech and signal
processing (icassp). IEEE. 2019, pp. 2847–2851.
[208] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. “The variational fair
autoencoder”. In: arXiv preprint arXiv:1511.00830 (2015).
[209] Michael I Love, Wolfgang Huber, and Simon Anders. “Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2”. In: Genome biology 15.12 (2014), pp. 1–21.
[210] Andrew Lowy, Sina Baharlouei, Rakesh Pavan, Meisam Razaviyayn, and Ahmad Beirami. “A
stochastic optimization framework for fair risk minimization”. In: tmlr (2022).
[211] Andrew Lowy and Meisam Razaviyayn. “Output Perturbation for Differentially Private Convex
Optimization with Improved Population Loss Bounds, Runtimes and Applications to Private
Adversarial Training”. In: arXiv preprint arXiv:2102.04704 (2021).
[212] Andrew Lowy and Meisam Razaviyayn. “Private federated learning without a trusted server:
Optimal algorithms for convex losses”. In: arXiv preprint arXiv:2106.09779 (2021).
[213] Jingyue Lu and M Pawan Kumar. “Neural network branching for neural network verification”. In:
arXiv preprint arXiv:1912.01329 (2019).
314
[214] Yiwei Lu, Guojun Zhang, Sun Sun, Hongyu Guo, and Yaoliang Yu. “$f$-MICL: Understanding and
Generalizing InfoNCE-based Contrastive Learning”. In: Transactions on Machine Learning
Research (2023). issn: 2835-8856. url: https://openreview.net/forum?id=ZD03VUZmRx.
[215] Yiwei Lu, Guojun Zhang, Sun Sun, Hongyu Guo, and Yaoliang Yu. “f-MICL: Understanding and
Generalizing InfoNCE-based Contrastive Learning”. In: arXiv preprint arXiv:2402.10150 (2024).
[216] Luo Luo, Haishan Ye, Zhichao Huang, and Tong Zhang. “Stochastic recursive gradient descent
ascent for stochastic nonconvex-strongly-concave minimax problems”. In: Advances in Neural
Information Processing Systems 33 (2020), pp. 20566–20577.
[217] Luo Luo, Haishan Ye, and Tony Zhang. “Stochastic Recursive Gradient Descent Ascent for
Stochastic Nonconvex-Strongly-Concave Minimax Problems”. In: Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual. 2020.
[218] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. “Learning adversarially fair
and transferable representations”. In: arXiv preprint arXiv:1802.06309 (2018).
[219] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.
“Towards deep learning models resistant to adversarial attacks”. In: arXiv preprint
arXiv:1706.06083 (2017).
[220] Subha Maity, Debarghya Mukherjee, Mikhail Yurochkin, and Yuekai Sun. “Does enforcing
fairness mitigate biases caused by subpopulation shift?” In: Advances in Neural Information
Processing Systems 34 (2021), pp. 25773–25784.
[221] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and
Sanjeev Arora. “Fine-Tuning Language Models with Just Forward Passes”. In: arXiv preprint
arXiv:2305.17333 (2023).
[222] Jeremie Mary, Clément Calauzènes, and Noureddine El Karoui. “Fairness-Aware Learning for
Continuous Attributes and Treatments”. In: Proceedings of the 36th International Conference on
Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of
Machine Learning Research. PMLR, 2019, pp. 4382–4391. url:
https://proceedings.mlr.press/v97/mary19a.html.
[223] Jérémie Mary, Clément Calauzenes, and Noureddine El Karoui. “Fairness-aware learning for
continuous attributes and treatments”. In: International Conference on Machine Learning. PMLR.
2019, pp. 4382–4391.
[224] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: ACM Computing Surveys (CSUR) 54.6 (2021),
pp. 1–35.
[225] Aditya Krishna Menon and Robert C Williamson. “The cost of fairness in binary classification”.
In: Conference on Fairness, Accountability and Transparency. 2018, pp. 107–118.
315
[226] Alan Mishler and Niccolò Dalmasso. “Fair When Trained, Unfair When Deployed: Observable
Fairness Measures are Unstable in Performative Prediction Settings”. In: arXiv preprint
arXiv:2202.05049 (2022).
[227] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. “Universal
adversarial perturbations”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2017, pp. 1765–1773.
[228] Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and
Francisco Herrera. “A unifying view on dataset shift in classification”. In: Pattern recognition 45.1
(2012), pp. 521–530.
[229] Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. “Fair learning with private
demographic data”. In: International Conference on Machine Learning. PMLR. 2020, pp. 7066–7075.
[230] Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. “Learning theory: stability is
sufficient for generalization and necessary and sufficient for consistency of empirical risk
minimization”. In: Advances in Computational Mathematics 25.1 (2006), pp. 161–193.
[231] Mark Niklas Müller, Mislav Balunović, and Martin Vechev. “Certify or predict: Boosting certified
robustness with compositional architectures”. In: International Conference on Learning
Representations (ICLR 2021). OpenReview. 2021.
[232] Katta G Murty and Santosh N Kabadi. “Some NP-complete problems in quadratic and nonlinear
programming”. In: (1985).
[233] Arvind Narayanan. “Translation tutorial: 21 fairness definitions and their politics”. In: Proc. Conf.
Fairness Accountability Transp., New York, USA. Vol. 1170. 2018, p. 3.
[234] Milad Nasr, Reza Shokri, and Amir Houmansadr. “Comprehensive privacy analysis of deep
learning: Passive and active white-box inference attacks against centralized and federated
learning”. In: 2019 IEEE symposium on security and privacy (SP). IEEE. 2019, pp. 739–753.
[235] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. “Speech
recognition using deep neural networks: A systematic review”. In: IEEE access 7 (2019),
pp. 19143–19165.
[236] Amirmohammad Nazari, Yifei Huang, Roopsha Samanta, Arjun Radhakrishna, and
Mukund Raghothaman. “Explainable Program Synthesis by Localizing Specifications”. In:
Proceedings of the ACM on Programming Languages (2023).
[237] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. “Robust stochastic
approximation approach to stochastic programming”. In: SIAM Journal on optimization 19.4
(2009), pp. 1574–1609.
[238] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer
Science & Business Media, 2013.
[239] Yurii Nesterov. Lectures on convex optimization. Vol. 137. Springer, 2018.
316
[240] Yurii E Nesterov. “A method for solving the convex programming problem with convergence rate
O (1/kˆ 2)”. In: Dokl. akad. nauk Sssr. Vol. 269. 1983, pp. 543–547.
[241] Alexandru Niculescu-Mizil and Rich Caruana. “Predicting good probabilities with supervised
learning”. In: Proceedings of the 22nd international conference on Machine learning. 2005,
pp. 625–632.
[242] Jingyang Niu, Jing Yang, Yuyu Guo, Kun Qian, and Qian Wang. “Joint deep learning for batch
effect removal and classification toward MALDI MS based metabolomics”. In: BMC bioinformatics
23.1 (2022), pp. 1–19.
[243] Maher Nouiehed and Meisam Razaviyayn. “Learning deep models: Critical points and local
openness”. In: INFORMS Journal on Optimization (2021).
[244] Maher Nouiehed, Maziar Sanjabi, Jason D Lee, and Meisam Razaviyayn. “Solving a Class of
Non-Convex Min-Max Games Using Iterative First Order Methods”. In: arXiv preprint
arXiv:1902.08297 (2019).
[245] Luca Oneto, Michele Donini, Giulia Luise, Carlo Ciliberto, Andreas Maurer, and
Massimiliano Pontil. “Exploiting mmd and sinkhorn divergences for fair and transferable
representation learning”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 15360–15370.
[246] Dmitrii M Ostrovskii, Babak Barazandeh, and Meisam Razaviyayn. “Nonconvex-nonconcave
min-max optimization with a small maximization domain”. In: arXiv preprint arXiv:2110.03950
(2021).
[247] Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. “Efficient search of first-order
nash equilibria in nonconvex-concave smooth min-max problems”. In: arXiv preprint
arXiv:2002.07919 (2020).
[248] Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. “Efficient search of first-order
nash equilibria in nonconvex-concave smooth min-max problems”. In: SIAM Journal on
Optimization 31.4 (2021), pp. 2508–2538.
[249] Liwen Ouyang and Aaron Key. “Maximum Mean Discrepancy for Generalization in the Presence
of Distribution and Missingness Shift”. In: arXiv preprint arXiv:2111.10344 (2021).
[250] Constantine Papageorgiou and Tomaso Poggio. “A trainable system for object detection”. In:
International journal of computer vision 38.1 (2000), pp. 15–33.
[251] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. “Distillation as
a defense to adversarial perturbations against deep neural networks”. In: 2016 IEEE symposium on
security and privacy (SP). IEEE. 2016, pp. 582–597.
[252] Chandra Shekhar Pareek, Rafal Smoczynski, and Andrzej Tretyn. “Sequencing technologies and
genome sequencing”. In: Journal of applied genetics 52.4 (2011), pp. 413–435.
317
[253] AB Pedersen, EM Mikkelsen, D Cronin-Fenton, NR Kristensen, TM Pham, L Pedersen, et al.
Missing data and multiple imputation in clinical epidemiological research. Clin Epidemiol. 2017; 9:
157-66.
[254] Adrián Pérez-Suay, Valero Laparra, Gonzalo Mateo-García, Jordi Muñoz-Marí,
Luis Gómez-Chova, and Gustau Camps-Valls. “Fair kernel learning”. In: Joint European Conference
on Machine Learning and Knowledge Discovery in Databases. Springer. 2017, pp. 339–355.
[255] John Platt et al. “Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods”. In: Advances in large margin classifiers 10.3 (1999), pp. 61–74.
[256] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. “On fairness
and calibration”. In: Advances in Neural Information Processing Systems. 2017, pp. 5680–5689.
[257] Boris T Polyak. “New stochastic approximation type procedures”. In: Automat. i Telemekh
7.98-107 (1990), p. 2.
[258] Yury Polyanskiy and Yihong Wu. “Information theory: From coding to learning”. In: Book draft
(2022).
[259] Flavien Prost, Hai Qian, Qiuwen Chen, Ed H Chi, Jilin Chen, and Alex Beutel. “Toward a better
trade-off between performance and fairness with kernel-based distribution matching”. In: arXiv
preprint arXiv:1910.11779 (2019).
[260] Novi Quadrianto and Viktoriia Sharmanska. “Recycling privileged learning and distribution
matching for fairness”. In: Advances in Neural Information Processing Systems 30 (2017).
[261] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
“Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[262] Edward Raff and Jared Sylvester. “Gradient Reversal Against Discrimination: A Fair Neural
Network Learning Approach”. In: 2018 IEEE 5th International Conference on Data Science and
Advanced Analytics (DSAA). IEEE. 2018, pp. 189–198.
[263] Edward Raff, Jared Sylvester, and Steven Mills. “Fair forests: Regularized tree induction to
minimize model bias”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society.
2018, pp. 243–250.
[264] H Rafique, M Liu, Q Lin, and T Yang. “Non-convex min–max optimization: provable algorithms
and applications in machine learning (2018)”. In: arXiv preprint arXiv:1810.02060 (1810).
[265] Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. “Non-convex min-max
optimization: Provable algorithms and applications in machine learning”. In: arXiv preprint
arXiv:1810.02060 (2018).
[266] Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. “Weakly-convex–concave min–max
optimization: provable algorithms and applications in machine learning”. In: Optimization
Methods and Software 37.3 (2022), pp. 1087–1121.
318
[267] Trivellore E Raghunathan, James M Lepkowski, John Van Hoewyk, Peter Solenberger, et al. “A
multivariate technique for multiply imputing missing values using a sequence of regression
models”. In: Survey methodology 27.1 (2001), pp. 85–96.
[268] Hamed Rahimian and Sanjay Mehrotra. “Distributionally robust optimization: A review”. In:
arXiv preprint arXiv:1908.05659 (2019).
[269] Meisam Razaviyayn, Tianjian Huang, Songtao Lu, Maher Nouiehed, Maziar Sanjabi, and
Mingyi Hong. “Nonconvex Min-Max Optimization: Applications, Challenges, and Recent
Theoretical Advances”. In: IEEE Signal Processing Magazine 37.5 (2020), pp. 55–66. doi:
10.1109/MSP.2020.3003851.
[270] Meisam Razaviyayn, Tianjian Huang, Songtao Lu, Maher Nouiehed, Maziar Sanjabi, and
Mingyi Hong. “Nonconvex min-max optimization: Applications, challenges, and recent
theoretical advances”. In: IEEE Signal Processing Magazine 37.5 (2020), pp. 55–66.
[271] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. “Do cifar-10 classifiers
generalize to cifar-10?” In: arXiv preprint arXiv:1806.00451 (2018).
[272] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. “Do imagenet
classifiers generalize to imagenet?” In: International Conference on Machine Learning. PMLR. 2019,
pp. 5389–5400.
[273] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A DePristo, Joshua V Dillon,
and Balaji Lakshminarayanan. “Likelihood ratios for out-of-distribution detection”. In: arXiv
preprint arXiv:1906.02845 (2019).
[274] Alfréd Rényi. “On measures of dependence”. In: Acta mathematica hungarica 10.3-4 (1959),
pp. 441–451.
[275] Alfréd Rényi. “On measures of entropy and information”. In: Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of
Statistics. The Regents of the University of California. 1961.
[276] Ashkan Rezaei, Rizal Fathony, Omid Memarrast, and Brian Ziebart. “Fair Logistic Regression: An
Adversarial Perspective”. In: arXiv preprint arXiv:1903.03910 (2019).
[277] Ashkan Rezaei, Rizal Fathony, Omid Memarrast, and Brian D Ziebart. “Fairness for Robust Log
Loss Classification.” In: AAAI. 2020, pp. 5511–5518.
[278] Ashkan Rezaei, Anqi Liu, Omid Memarrast, and Brian D Ziebart. “Robust fairness under covariate
shift”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 2021, pp. 9419–9427.
[279] Davide Risso, John Ngai, Terence P Speed, and Sandrine Dudoit. “Normalization of RNA-seq data
using factor analysis of control genes or samples”. In: Nature biotechnology 32.9 (2014),
pp. 896–902.
319
[280] Goce Ristanoski, Wei Liu, and James Bailey. “Discrimination aware classification for imbalanced
datasets”. In: Proceedings of the 22nd ACM international conference on Information & Knowledge
Management. 2013, pp. 1529–1532.
[281] Adam Roberts and Lior Pachter. “Streaming fragment assignment for real-time analysis of
sequencing experiments”. In: Nature methods 10.1 (2013), pp. 71–73.
[282] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. “edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data”. In: bioinformatics 26.1 (2010),
pp. 139–140.
[283] Mark D Robinson and Alicia Oshlack. “A scaling normalization method for differential expression
analysis of RNA-seq data”. In: Genome biology 11.3 (2010), pp. 1–9.
[284] R Tyrrell Rockafellar, Stanislav Uryasev, et al. “Optimization of conditional value-at-risk”. In:
Journal of risk 2 (2000), pp. 21–42.
[285] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. “Fr-train: A mutual
information-based approach to fair and robust training”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 8147–8157.
[286] Clemens Rösner and Melanie Schmidt. “Privacy preserving clustering with constraints”. In: arXiv
preprint arXiv:1802.02497 (2018).
[287] Salvatore Ruggieri. “Using t-closeness anonymity to control for non-discrimination.” In: Trans.
Data Privacy 7.2 (2014), pp. 99–129.
[288] Yvan Saeys, Inaki Inza, and Pedro Larranaga. “A review of feature selection techniques in
bioinformatics”. In: bioinformatics 23.19 (2007), pp. 2507–2517.
[289] Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. “Distributionally robust
neural networks for group shifts: On the importance of regularization for worst-case
generalization”. In: arXiv preprint arXiv:1911.08731 (2019).
[290] Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, and Jason D Lee. “On the convergence and
robustness of training GANs with regularized optimal transport”. In: Advances in Neural
Information Processing Systems. 2018, pp. 7091–7101.
[291] Maziar Sanjabi, Sina Baharlouei, Meisam Razaviyayn, and Jason D Lee. “When does
non-orthogonal tensor decomposition have no spurious local minima?” In: arXiv preprint
arXiv:1911.09815 (2019).
[292] Sina Mahdipour Saravani, Indrajit Ray, and Indrakshi Ray. “Automated identification of social
media bots using deepfake text detection”. In: International Conference on Information Systems
Security. Springer. 2021, pp. 111–123.
[293] Igal Sason and Sergio Verdú. “f-divergence Inequalities”. In: IEEE Transactions on Information
Theory 62.11 (2016), pp. 5973–6006.
320
[294] Prasanna Sattigeri, Samuel C Hoffman, Vijil Chenthamarakshan, and Kush R Varshney. “Fairness
gan”. In: arXiv preprint arXiv:1805.09910 (2018).
[295] Joseph L Schafer. Analysis of incomplete multivariate data. Chapman and Hall/CRC, 1997.
[296] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. “Fair coresets and streaming
algorithms for fair k-means clustering”. In: arXiv preprint arXiv:1812.10854 (2018).
[297] Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo, Ibrahim Alabdulmohsin, Eva Schnider,
Krista Opsahl-Ong, Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen, et al. “Maintaining
fairness across distribution shift: do we have viable solutions for real-world applications?” In:
arXiv preprint arXiv:2202.01034 (2022).
[298] Seonguk Seo, Yumin Suh, Dongwan Kim, Geeho Kim, Jongwoo Han, and Bohyung Han.
“Learning to optimize domain specific normalization for domain generalization”. In: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
XXII 16. Springer. 2020, pp. 68–83.
[299] Young-Kyo Seo, Hansook Kim Chong, Aniello M Infante, Seung-Soon Im, Xiaohui Xie, and
Timothy F Osborne. “Genome-wide analysis of SREBP-1 binding in mouse liver chromatin reveals
a preference for promoter proximal binding to a new motif”. In: Proceedings of the National
Academy of Sciences 106.33 (2009), pp. 13765–13769.
[300] Uri Shaham, Kelly P Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Montgomery, and
Yuval Kluger. “Removal of batch effects using distribution-matching residual networks”. In:
Bioinformatics 33.16 (2017), pp. 2539–2546.
[301] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski. Lectures on stochastic
programming: modeling and theory. SIAM, 2021.
[302] Fatemeh Sheikholeslami, Swayambhoo Jain, and Georgios B Giannakis. “Minimum uncertainty
based detection of adversaries in deep neural networks”. In: 2020 Information Theory and
Applications Workshop (ITA). IEEE. 2020, pp. 1–16.
[303] Fatemeh Sheikholeslami, Ali Lotfi Rezaabad, and J Zico Kolter. “PROVABLY ROBUST
CLASSIFICATION OF ADVERSARIAL EXAMPLES WITH DETECTION”. In: International
Conference on Learning Representations (ICLR) (2021).
[304] Pannagadatta K Shivaswamy, Chiranjib Bhattacharyya, and Alexander J Smola. “Second order
cone programming approaches for handling missing and uncertain data”. In: Journal of Machine
Learning Research 7.Jul (2006), pp. 1283–1314.
[305] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference
attacks against machine learning models”. In: 2017 IEEE symposium on security and privacy (SP).
IEEE. 2017, pp. 3–18.
[306] Changjian Shui, Gezheng Xu, Qi Chen, Jiaqi Li, Charles X Ling, Tal Arbel, Boyu Wang, and
Christian Gagné. “On learning fairness and accuracy on multiple subgroups”. In: Advances in
Neural Information Processing Systems 35 (2022), pp. 34121–34135.
321
[307] Andrew H Sims, Graeme J Smethurst, Yvonne Hey, Michal J Okoniewski, Stuart D Pepper,
Anthony Howell, Crispin J Miller, and Robert B Clarke. “The removal of multiplicative,
systematic bias allows integration of breast cancer gene expression datasets–improving
meta-analysis and prediction of prognosis”. In: BMC medical genomics 1.1 (2008), pp. 1–14.
[308] Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T Vechev. “Fast
and Effective Robustness Certification.” In: NeurIPS 1.4 (2018), p. 6.
[309] Harvineet Singh, Rina Singh, Vishwali Mhasawade, and Rumi Chunara. “Fair predictors under
distribution shift”. In: NeurIPS Workshop on Fair ML for Health. 2019.
[310] Harvineet Singh, Rina Singh, Vishwali Mhasawade, and Rumi Chunara. “Fairness violations and
mitigation under covariate shift”. In: Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency. 2021, pp. 3–13.
[311] Aman Sinha, Hongseok Namkoong, and John Duchi. “Certifying Some Distributional Robustness
with Principled Adversarial Training”. In: International Conference on Learning Representations.
2018.
[312] Maurice Sion et al. “On general minimax theorems.” In: Pacific Journal of mathematics 8.1 (1958),
pp. 171–176.
[313] Gordon K Smyth. “Limma: linear models for microarray data”. In: Bioinformatics and
computational biology solutions using R and Bioconductor (2005), pp. 397–420.
[314] Charlotte Soneson, Sarah Gerster, and Mauro Delorenzi. “Batch effect confounding leads to strong
bias in performance estimates obtained by cross-validation”. In: PloS one 9.6 (2014), e100335.
[315] Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, and Stefano Ermon. “Learning
Controllable Fair Representations”. In: arXiv preprint arXiv:1812.04218 (2019).
[316] Matthew Staib and Stefanie Jegelka. “Distributionally Robust Optimization and Generalization in
Kernel Methods”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach,
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates,
Inc., 2019. url:
https://proceedings.neurips.cc/paper_files/paper/2019/file/1770ae9e1b6bc9f5fd2841f141557ffbPaper.pdf.
[317] Daniel Steinberg, Alistair Reid, Simon O’Callaghan, Finnian Lattimore, Lachlan McCalman, and
Tiberio Caetano. “Fast Fair Regression via Efficient Approximations of Mutual Information”. In:
arXiv preprint arXiv:2002.06200 (2020).
[318] Daniel J Stekhoven and Peter Bühlmann. “MissForest—non-parametric missing value imputation
for mixed-type data”. In: Bioinformatics 28.1 (2012), pp. 112–118.
[319] Jonathan AC Sterne, Ian R White, John B Carlin, Michael Spratt, Patrick Royston,
Michael G Kenward, Angela M Wood, and James R Carpenter. “Multiple imputation for missing
data in epidemiological and clinical research: potential and pitfalls”. In: Bmj 338 (2009), b2393.
322
[320] David Stutz, Matthias Hein, and Bernt Schiele. “Confidence-calibrated adversarial training:
Generalizing to unseen attacks”. In: International Conference on Machine Learning. PMLR. 2020,
pp. 9155–9166.
[321] Latanya Sweeney. “Discrimination in online ad delivery”. In: arXiv preprint arXiv:1301.6822 (2013).
[322] Christian Szegedy, Alexander Toshev, and Dumitru Erhan. “Deep neural networks for object
detection”. In: Advances in neural information processing systems 26 (2013).
[323] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian Goodfellow, and Rob Fergus. “Intriguing properties of neural networks”. In: arXiv preprint
arXiv:1312.6199 (2013).
[324] Mingxing Tan, Ruoming Pang, and Quoc V Le. “Efficientdet: Scalable and efficient object
detection”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2020, pp. 10781–10790.
[325] Bahar Taskesen, Viet Anh Nguyen, Daniel Kuhn, and Jose Blanchet. “A Distributionally Robust
Approach to Fair Classification”. In: arXiv preprint arXiv:2007.09530 (2020).
[326] Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. “Efficient
algorithms for smooth minimax optimization”. In: Advances in Neural Information Processing
Systems 32 (2019).
[327] Vincent Tjeng, Kai Xiao, and Russ Tedrake. “Evaluating robustness of neural networks with
mixed integer programming”. In: arXiv preprint arXiv:1711.07356 (2017).
[328] Florian Tramer. “Detecting adversarial examples is (nearly) as hard as classifying them”. In:
International Conference on Machine Learning. PMLR. 2022, pp. 21692–21702.
[329] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. “On adaptive attacks
to adversarial example defenses”. In: arXiv preprint arXiv:2002.08347 (2020).
[330] Cuong Tran, My Dinh, and Ferdinando Fioretto. “Differentially private empirical risk
minimization under the fairness lens”. In: Advances in Neural Information Processing Systems 34
(2021), pp. 27555–27565.
[331] Cuong Tran, Ferdinando Fioretto, and Pascal Van Hentenryck. “Differentially private and fair
deep learning: A lagrangian dual approach”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 35. 11. 2021, pp. 9932–9939.
[332] Hoa Thi Nhu Tran, Kok Siong Ang, Marion Chevrier, Xiaomeng Zhang, Nicole Yee Shin Lee,
Michelle Goh, and Jinmiao Chen. “A benchmark of batch-effect correction methods for single-cell
RNA sequencing data”. In: Genome biology 21.1 (2020), pp. 1–32.
[333] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani,
David Botstein, and Russ B Altman. “Missing value estimation methods for DNA microarrays”.
In: Bioinformatics 17.6 (2001), pp. 520–525.
323
[334] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. “Deep domain
confusion: Maximizing for domain invariance”. In: arXiv preprint arXiv:1412.3474 (2014).
[335] Berk Ustun, Yang Liu, and David Parkes. “Fairness without harm: Decoupled classifiers with
preference guarantees”. In: International Conference on Machine Learning. PMLR. 2019,
pp. 6373–6382.
[336] Priyan Vaithilingam, Tianyi Zhang, and Elena L Glassman. “Expectation vs. experience:
Evaluating the usability of code generation tools powered by large language models”. In: Chi
conference on human factors in computing systems extended abstracts. 2022, pp. 1–7.
[337] Vladimir N Vapnik and A Ya Chervonenkis. “On the uniform convergence of relative frequencies
of events to their probabilities”. In: Measures of complexity. Springer, 2015, pp. 11–30.
[338] Susana Vinga. “Structured sparsity regularization for analyzing high-dimensional omics data”. In:
Briefings in Bioinformatics 22.1 (2021), pp. 77–87.
[339] Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Dipankar Das, Bharat Kaul, and
Theodore L Willke. “Out-of-distribution detection using an ensemble of self supervised leave-out
classifiers”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 550–564.
[340] Mingyang Wan, Daochen Zha, Ninghao Liu, and Na Zou. “Modeling Techniques for Machine
Learning Fairness: A Survey”. In: CoRR abs/2111.03015 (2021). arXiv: 2111.03015. url:
https://arxiv.org/abs/2111.03015.
[341] Haotao Wang, Junyuan Hong, Jiayu Zhou, and Zhangyang Wang. “How Robust is Your Fairness?
Evaluating and Sustaining Fairness under Unseen Distribution Shifts”. In: arXiv preprint
arXiv:2207.01168 (2022).
[342] Haotao Wang, Junyuan Hong, Jiayu Zhou, and Zhangyang Wang. “How Robust is Your Fairness?
Evaluating and Sustaining Fairness under Unseen Distribution Shifts”. In: Transactions on Machine
Learning Research (2023). issn: 2835-8856. url: https://openreview.net/forum?id=11pGlecTz2.
[343] Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, and
Michael Jordan. “Robust optimization for fairness with noisy protected groups”. In: Advances in
neural information processing systems 33 (2020), pp. 5190–5203.
[344] Shiqi Wang, Huan Zhang, Kaidi Xu, Xue Lin, Suman Jana, Cho-Jui Hsieh, and J Zico Kolter.
“Beta-crown: Efficient bound propagation with per-neuron split constraints for complete and
incomplete neural network verification”. In: arXiv preprint arXiv:2103.06624 (2021).
[345] Wolfram Wiesemann, Daniel Kuhn, and Melvyn Sim. “Distributionally robust convex
optimization”. In: Operations Research 62.6 (2014), pp. 1358–1376.
[346] Robert Williamson and Aditya Menon. “Fairness risk measures”. In: International Conference on
Machine Learning. PMLR. 2019, pp. 6786–6797.
324
[347] Hans S Witsenhausen. “On sequences of pairs of dependent random variables”. In: SIAM Journal
on Applied Mathematics 28.1 (1975), pp. 100–113.
[348] Eric Wong and Zico Kolter. “Provable defenses against adversarial examples via the convex outer
adversarial polytope”. In: International Conference on Machine Learning. 2018, pp. 5286–5295.
[349] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. “Learning
non-discriminatory predictors”. In: arXiv preprint arXiv:1702.06081 (2017).
[350] Jing Xia, Shengyu Zhang, Guolong Cai, Li Li, Qing Pan, Jing Yan, and Gangmin Ning. “Adjusted
weight voting algorithm for random forests in handling missing values”. In: Pattern Recognition
69 (2017), pp. 52–60.
[351] Kai Y Xiao, Vincent Tjeng, Nur Muhammad Shafiullah, and Aleksander Madry. “Training for
faster adversarial robustness verification via inducing relu stability”. In: arXiv preprint
arXiv:1809.03008 (2018).
[352] Yang Xin, Lingshuang Kong, Zhi Liu, Yuling Chen, Yanmiao Li, Hongliang Zhu, Mingcheng Gao,
Haixia Hou, and Chunhua Wang. “Machine learning and deep learning methods for
cybersecurity”. In: Ieee access 6 (2018), pp. 35365–35381.
[353] Depeng Xu, Shuhan Yuan, and Xintao Wu. “Achieving differential privacy and fairness in logistic
regression”. In: Companion proceedings of The 2019 world wide web conference. 2019, pp. 594–599.
[354] Depeng Xu, Shuhan Yuan, Lu Zhang, and Xintao Wu. “Fairgan: Fairness-aware generative
adversarial networks”. In: 2018 IEEE International Conference on Big Data (Big Data). IEEE. 2018,
pp. 570–575.
[355] Huan Xu, Constantine Caramanis, and Shie Mannor. “Robustness and Regularization of Support
Vector Machines.” In: Journal of machine learning research 10.7 (2009).
[356] Ke Yan, Lu Kou, and David Zhang. “Learning domain-invariant subspace using domain features
and independence maximization”. In: IEEE transactions on cybernetics 48.1 (2017), pp. 288–299.
[357] Jinsung Yoon, James Jordon, and Mihaela Van Der Schaar. “Gain: Missing data imputation using
generative adversarial nets”. In: arXiv preprint arXiv:1806.02920 (2018).
[358] Alexander P Young, Daniel J Jackson, and Russell C Wyeth. “A technical review and guide to RNA
fluorescence in situ hybridization”. In: PeerJ 8 (2020), e8806.
[359] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. “Adversarial examples: Attacks and defenses
for deep learning”. In: IEEE transactions on neural networks and learning systems 30.9 (2019),
pp. 2805–2824.
[360] Muhammad Bilal Zafar, Isabel Valera, Manuel Rodriguez, Krishna Gummadi, and Adrian Weller.
“From parity to preference-based notions of fairness in classification”. In: Advances in Neural
Information Processing Systems 30 (2017).
325
[361] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi.
“Fairness constraints: Mechanisms for fair classification”. In: arXiv preprint arXiv:1507.05259
(2015).
[362] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi.
“Fairness constraints: Mechanisms for fair classification”. In: Artificial Intelligence and Statistics.
PMLR. 2017, pp. 962–970.
[363] Lihi Zelnik-Manor and Pietro Perona. “Self-tuning spectral clustering”. In: Advances in neural
information processing systems 17 (2004).
[364] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. “Learning fair
representations”. In: International Conference on Machine Learning. 2013, pp. 325–333.
[365] Han Zeng, Hong Qin, Meng Liao, Enze Zheng, Xiaoqing Luo, Anhua Xiao, Yiyu Li, Lin Chen,
Li Wei, Lei Zhao, et al. “CD36 promotes de novo lipogenesis in hepatocytes through
INSIG2-dependent SREBP1 processing”. In: Molecular Metabolism 57 (2022), p. 101428.
[366] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. “Mitigating unwanted biases with
adversarial learning”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society.
ACM. 2018, pp. 335–340.
[367] Feng Zhang, Yu Wu, and Weidong Tian. “A novel approach to remove the batch effect of
single-cell data”. In: Cell discovery 5.1 (2019), pp. 1–4.
[368] Huan Zhang, Hongge Chen, Chaowei Xiao, Sven Gowal, Robert Stanforth, Bo Li, Duane Boning,
and Cho-Jui Hsieh. “Towards stable and efficient training of verifiably robust neural networks”.
In: arXiv preprint arXiv:1906.06316 (2019).
[369] Siqi Zhang, Junchi Yang, Cristóbal Guzmán, Negar Kiyavash, and Niao He. “The complexity of
nonconvex-strongly-concave minimax optimization”. In: arXiv preprint arXiv:2103.15888 (2021).
[370] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and
Yanzhi Wang. “A systematic dnn weight pruning framework using alternating direction method
of multipliers”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 184–199.
[371] Xiaobai Zhang, Xiaofeng Song, Huinan Wang, and Huanping Zhang. “Sequential local least
squares imputation estimating missing value of microarray data”. In: Computers in biology and
medicine 38.10 (2008), pp. 1112–1120.
[372] Xuan Zhang, Necdet Serhat Aybat, and Mert Gurbuzbalaban. “Sapd+: An accelerated stochastic
method for nonconvex-concave minimax problems”. In: Advances in Neural Information
Processing Systems 35 (2022), pp. 21668–21681.
[373] Yuqing Zhang, Giovanni Parmigiani, and W Evan Johnson. “ComBat-seq: batch effect adjustment
for RNA-seq count data”. In: NAR genomics and bioinformatics 2.3 (2020), lqaa078.
326
[374] Tianxiang Zhao, Enyan Dai, Kai Shu, and Suhang Wang. “Towards Fair Classifiers Without
Sensitive Attributes: Exploring Biases in Related Features”. In: Proceedings of the Fifteenth ACM
International Conference on Web Search and Data Mining. 2022, pp. 1433–1442.
[375] Meiyu Zhong and Ravi Tandon. Learning Fair Classifiers via Min-Max F-divergence Regularization.
2023. arXiv: 2306.16552 [cs.LG].
[376] Chunting Zhou, Xuezhe Ma, Paul Michel, and Graham Neubig. “Examining and combating
spurious features under distribution shift”. In: International Conference on Machine Learning.
PMLR. 2021, pp. 12857–12867.
[377] Zheng Zhu, Guan Huang, Jiankang Deng, Yun Ye, Junjie Huang, Xinze Chen, Jiagang Zhu,
Tian Yang, Jiwen Lu, Dalong Du, et al. “WebFace260M: A Benchmark Unveiling the Power of
Million-Scale Deep Face Recognition”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2021, pp. 10492–10502.
[378] Zhengxia Zou, Zhenwei Shi, Yuhong Guo, and Jieping Ye. “Object detection in 20 years: A
survey”. In: arXiv preprint arXiv:1905.05055 (2019).
[379] Steve Zymler, Daniel Kuhn, and Berç Rustem. “Distributionally robust joint chance constraints
with second-order moment information”. In: Mathematical Programming 137.1 (2013),
pp. 167–198.
327
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Optimization strategies for robustness and fairness
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
Landscape analysis and algorithms for large scale non-convex optimization
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Integer optimization for analytics in high stakes domain
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Optimizing healthcare decision-making: Markov decision processes for liver transplants, frequent interventions, and infectious disease control
PDF
Applications of Wasserstein distance in distributionally robust optimization
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Difference-of-convex learning: optimization with non-convex sparsity functions
PDF
Learning controllable data generation for scalable model training
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Data-driven optimization for indoor localization
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Robust causal inference with machine learning on observational data
PDF
Optimization of CO2 storage efficiency under geomechanical risks using coupled flow-geomechanics-fracturing model
PDF
Neural representation learning for robust and fair speaker recognition
Asset Metadata
Creator
Baharlouei, Sina
(author)
Core Title
Scalable optimization for trustworthy AI: robust and fair machine learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2024-05
Publication Date
04/08/2024
Defense Date
03/18/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adversarial attacks,algorithmic fairness,batch effect,distributionally robust optimization,imputation,large-scale optimization,machine learning,optimization,responsible AI,robust machine learning,robust optimization,trustworthy AI
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Razaviyayn, Meisam (
committee chair
), Stein, Oded (
committee member
), Suen, Sze-chuan (
committee member
)
Creator Email
baharlou@usc.edu,sinabaharlouei@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113871262
Unique identifier
UC113871262
Identifier
etd-Baharlouei-12776.pdf (filename)
Legacy Identifier
etd-Baharlouei-12776
Document Type
Thesis
Format
theses (aat)
Rights
Baharlouei, Sina
Internet Media Type
application/pdf
Type
texts
Source
20240408-usctheses-batch-1137
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
adversarial attacks
algorithmic fairness
batch effect
distributionally robust optimization
imputation
large-scale optimization
machine learning
optimization
responsible AI
robust machine learning
robust optimization
trustworthy AI