Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
(USC Thesis Other)
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Differentially Private and Fair Optimization for Machine Learning:
Tight Error Bounds and Efficient Algorithms
by
Andrew Lowy
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(APPLIED MATHEMATICS)
August 2023
Copyright 2023 Andrew Lowy
Acknowledgements
I am immensely grateful to the people who have helped me throughout my PhD journey. First, I would
like to thank my wife Jocelyne for always supporting, encouraging, and loving me. I thank my family,
especially my parents Adam and Susan, for their love and kindness.
I am greatly indebted to my wonderful collaborators: Hilal Asi, Sina Baharlouei, Ahmad Beirami, Ali
Ghafelebashi, Devansh Gupta, Tianjian Huang, Zeman Li, Dmitrii Ostrovskii, Rakesh Pavan, and Meisam
Razaviyayn. This thesis would not have been possible without their great contributions. I learned an
invaluable amount from working with Dmitrii in my first research project.
I am extremely thankful to several established members of the privacy community for their willingness
to help me, despite (mostly) having never met me in person: John Duchi, Gautam Kamath, Adam Smith,
and Thomas Steinke. Their help was indispensable. I am especially grateful to Adam Smith for his detailed
and prompt answers to the many questions I emailed him over the last few years.
I would also like to thank the esteemed members of my qualifying exam committee and thesis com-
mittee: Larry Goldstein, Aleksandra Korolova, Jinchi Lv, Stanislav Minsker, Jong-Shi Pang, Meisam Raza-
viyayn, and Jon Ullman. Their generousity, interest, and feedback means so much to me.
I thank Larry Goldstein and Stas Minsker—my semi-official co-advisors—for their gracious guidance
and support over the last six years.
Last, but certainly not least, I owe tremendous gratitude to my amazing advisor, Meisam Razaviyayn.
Of course, none of this would have been possible without him. I learned a ton from him from Meisam
ii
about how to conduct research at each stage of the process and how to present my work effectively. He
has also been extraordinarily helpful in offering professional guidance and support. I could not have asked
for a better mentor.
iii
Dedication
To Jocelyne.
iv
TableofContents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2.1 Optimal Differentially Private Learning with Public Data (Chapter 2) . . 7
1.1.2.2 Differentially Private Federated Learning Without a Trusted Server
(Chapter 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.3 Differentially Private Stochastic Optimization with Large Worst-Case
Lipschitz Parameter (Chapter 4) . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Fair (and Private) Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2.1 A Stochastic Optimization Framework for Fair Risk Minimization
(Chapter 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.2.2 Stochastic Differentially Private and Fair Learning (Chapter 6) . . . . . . 12
1.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Optimal Differentially Private Learning With Public Data . . . . . . . . . . . . . . . . . 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Semi-Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Optimal Centrally Private Learning with Public Data . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Optimal Semi-DP Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1.1 An “Even More Optimal” Semi-DP Algorithm for Mean Estimation . . . 20
2.2.2 Optimal Semi-DP Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Optimal Semi-DP Stochastic Convex Optimization . . . . . . . . . . . . . . . . . . 24
2.3 Optimal Locally Private Learning with Public Data . . . . . . . . . . . . . . . . . . . . . . . 26
v
2.3.1 Optimal Semi-LDP Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1.1 An “Even More Optimal” Semi-LDP Estimator . . . . . . . . . . . . . . . 27
2.3.2 Optimal Semi-LDP Stochastic Convex Optimization . . . . . . . . . . . . . . . . . . 28
2.3.2.1 An “Even More Optimal” Semi-LDP Algorithm for SCO . . . . . . . . . . 29
2.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3: Private Federated Learning Without a Trusted Server . . . . . . . . . . . . . . . . . . . 32
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Private Convex Federated Learning Without a Trusted Server . . . . . . . . . . . . . . . . 43
3.2.1 Inter-Silo Record-Level DP FL with Homogeneous Silo Data . . . . . . . . . . . . . 43
3.2.1.1 Upper Bounds via Noisy Distributed Minibatch SGD . . . . . . . . . . . 43
3.2.1.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Inter-Silo Record-Level DP FL with Heterogeneous Silo Data . . . . . . . . . . . . 46
3.2.3 Inter-Silo Record-Level DP Federated ERM . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Shuffle DP Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Private Non-Convex Federated Learning Without a Trusted Server . . . . . . . . . . . . . . 53
3.3.1 Algorithms for Proximal-PL Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3.1.1 Noisy Distributed Proximal SGD for Heterogeneous FL (SO) . . . . . . . 54
3.3.1.2 Noisy Distributed Prox-PL-SVRG for Federated ERM . . . . . . . . . . . 56
3.3.2 Algorithms for Non-Convex/Non-Smooth Composite Losses . . . . . . . . . . . . . 58
3.3.2.1 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4 Concluding Remarks and Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 4: Private Stochastic Optimization with Large Worst-Case Lipschitz Parameter: Optimal
Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses . . . . . . 67
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1.2 Contributions and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Private Heavy-Tailed Mean Estimation Building Blocks . . . . . . . . . . . . . . . . . . . . 73
4.3 Optimal Rates for Non-Smooth (Strongly) Convex Losses . . . . . . . . . . . . . . . . . . . 74
4.3.1 Localized Noisy Clipped Subgradient Method for Convex Losses . . . . . . . . . . 75
4.3.2 The Strongly Convex Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.3 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Linear-Time Algorithms for Smooth (Strongly) Convex Losses . . . . . . . . . . . . . . . . 82
4.4.1 Noisy Clipped Accelerated SGD for Smooth Convex Losses . . . . . . . . . . . . . 82
4.4.2 Noisy Clipped SGD for Strongly Convex Losses . . . . . . . . . . . . . . . . . . . . 83
4.5 Algorithm for Non-Convex Proximal-PL Loss Functions . . . . . . . . . . . . . . . . . . . . 84
4.6 Concluding Remarks and Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 5: A Stochastic Optimization Framework for Fair Risk Minimization . . . . . . . . . . . . 87
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 Fair Risk Minimization through ERMI Regularization . . . . . . . . . . . . . . . . . . . . . 91
5.2.1 A Convergent Stochastic Algorithm for Fair Empirical Risk Minimization . . . . . 94
5.2.2 Asymptotic Convergence of Algorithm 9 for Population-level FRMI Objective . . . 98
vi
5.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Fair Binary Classification with Binary Sensitive Attributes using Logistic Regression 101
5.3.1.1 Benchmarking full-batch performance . . . . . . . . . . . . . . . . . . . 101
5.3.1.2 The effect of batch size on fairness/accuracy tradeoffs . . . . . . . . . . . 103
5.3.1.3 The effect of missing sensitive attributes on fairness/accuracy tradeoffs . 104
5.3.2 Fair Binary Classification using Neural Models . . . . . . . . . . . . . . . . . . . . 105
5.3.3 Fair Non-binary Classification with Multiple Sensitive Attributes . . . . . . . . . . 106
5.3.4 Beyond Fairness: Domain Parity Regularization for Domain Generalization . . . . 107
5.4 Discussion and Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 6: Stochastic Differentially Private and Fair Learning . . . . . . . . . . . . . . . . . . . . . 112
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Problem Setting and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Private Fair ERM via Exponential Rényi Mutual Information . . . . . . . . . . . . . . . . . 118
6.3.1 Noisy DP-FERMI for Stochastic Private Fair ERM . . . . . . . . . . . . . . . . . . . 120
6.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4.1 Standard Benchmark Experiments: Logistic Regression on Tabular Datasets . . . . 123
6.4.1.1 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.1.2 Equalized Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.2 Large-Scale Experiment: Deep Convolutional Neural Network on Image Dataset . 125
6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Chapter 7: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Appendix to Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1 Techniques and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1.1 Overview of Our Techniques and Challenges . . . . . . . . . . . . . . . . . . . . . 153
A.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.1.2.1 Sample complexity bounds for semi-DP learning and estimation . . . . . 155
A.1.2.2 DP model training (ERM and SCO) with public data . . . . . . . . . . . . 155
A.1.2.3 Personalized DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.2 Optimal Centrally Private Learning with Public Data . . . . . . . . . . . . . . . . . . . . . 156
A.2.1 Optimal Semi-DP Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.2.1.1 Estimating the Empirical Mean . . . . . . . . . . . . . . . . . . . . . . . 156
A.2.1.2 Estimating the Population Mean . . . . . . . . . . . . . . . . . . . . . . . 159
A.2.2 An “Even More Optimal” Semi-DP Algorithm for Mean Estimation . . . . . . . . . 171
A.2.3 Optimal Semi-DP Empirical Risk Minimization . . . . . . . . . . . . . . . . . . . . 175
A.2.4 Optimal Semi-DP Stochastic Convex Optimization . . . . . . . . . . . . . . . . . . 179
A.2.4.1 Semi-DP SCO with an “Even More Optimal” Gradient Estimator . . . . . 184
A.3 Optimal Locally Private Learning with Public Data . . . . . . . . . . . . . . . . . . . . . . . 187
A.3.1 Optimal Semi-LDP Mean Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 187
A.3.1.1 An “Even More Optimal” Semi-LDP Estimator . . . . . . . . . . . . . . . 191
A.3.2 Optimal Semi-LDP Stochastic Convex Optimization . . . . . . . . . . . . . . . . . . 195
A.3.2.1 An “Even More Optimal” Semi-LDP Algorithm for SCO . . . . . . . . . . 198
vii
A.4 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1 Central Semi-DP Linear Regression with Gaussian Data . . . . . . . . . . . . . . . 200
A.4.1.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
A.4.1.3 Details on Implementations of Algorithms . . . . . . . . . . . . . . . . . 201
A.4.1.4 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
A.4.2 Central Semi-DP Logistic Regression with CIFAR-10 . . . . . . . . . . . . . . . . . 203
A.4.3 Semi-Local DP Linear Regression with Gaussian Data . . . . . . . . . . . . . . . . 205
A.4.4 Hyperparameters Search Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Appendix to Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
B.1 Private Convex Federated Learning Without a Trusted Server . . . . . . . . . . . . . . . . 209
B.1.1 Thorough Discussion of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 209
B.1.2 Rigorous Definition of Inter-Silo Record-Level DP . . . . . . . . . . . . . . . . . . . 212
B.1.3 Relationships between notions of DP . . . . . . . . . . . . . . . . . . . . . . . . . . 214
B.1.3.1 ISRL-DP is stronger than CDP: . . . . . . . . . . . . . . . . . . . . . . . . 214
B.1.3.2 ISRL-DP Implies User-Level DP for smallpε
0
,δ 0
q: . . . . . . . . . . . . . 215
B.1.4 Supplementary Material for Upper Bounds in Section 3.2.1 . . . . . . . . . . . . . . 215
B.1.4.1 Pseudocode of Noisy MB-SGD . . . . . . . . . . . . . . . . . . . . . . . . 215
B.1.4.2 Proof of Theorem 34 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
B.1.5 Supplementary Material for ISRL-DP Lower Bounds in Section 3.2.1 . . . . . . . . 228
B.1.5.1 C-Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
B.1.5.2 Algorithms whose Privacy follows from Advanced Composition
Theorem are1-Compositional . . . . . . . . . . . . . . . . . . . . . . . . 229
B.1.5.3 Example of ISRL-DP Algorithm that is Not Compositional . . . . . . . . 229
B.1.5.4 Proof of Theorem 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
B.1.5.5 Tight Excess Risk Bounds for Cross-Device FL Without a Trusted Server 240
B.1.6 Supplemental Material for Section 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . 241
B.1.6.1 Mutli-Stage Implementation of Accelerated Noisy MB-SGD . . . . . . . . 241
B.1.6.2 Complete version and proof of Theorem 36 . . . . . . . . . . . . . . . . . 241
B.1.7 Supplementary Material for Section 3.2.3 . . . . . . . . . . . . . . . . . . . . . . . . 247
B.1.7.1 Proof of Theorem 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
B.1.7.2 Lower bounds for ISRL-DP Federated ERM . . . . . . . . . . . . . . . . . 250
B.1.8 Proofs and Supplementary Materials for Section 3.2.4 . . . . . . . . . . . . . . . . . 253
B.1.8.1 Proof of Theorem 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
B.1.8.2 SDP One-Pass Accelerated Noisy MB-SGD and the Proof of Theorem 39 256
B.1.9 ISRL-DP Upper Bounds with Unbalanced Data Set Sizes and Differing Privacy
Needs Across silos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
B.1.9.1 Additional Notation and Assumptions . . . . . . . . . . . . . . . . . . . 259
B.1.9.2 Pseudocode for Noisy ISRL-DP MB-SGD in the Unbalanced Case . . . . 261
B.1.9.3 General Unbalanced Version of Theorem 34 . . . . . . . . . . . . . . . . 261
B.1.10 Numerical Experiments: Details and Additional Results . . . . . . . . . . . . . . . 267
B.1.10.1 Logistic Regression with MNIST . . . . . . . . . . . . . . . . . . . . . . . 267
B.1.10.2 Linear Regression with Health Insurance Data: . . . . . . . . . . . . . . . 269
B.1.10.3 Softmax Regression with Obesity Dataset . . . . . . . . . . . . . . . . . . 271
B.2 Private Non-Convex Federated Learning Without a Trusted Server . . . . . . . . . . . . . . 273
B.2.1 Further Discussion of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
viii
B.2.2 Supplemental Material for Section 3.3.1: Proximal PL Loss Functions . . . . . . . . 275
B.2.2.1 Noisy Proximal Gradient Methods for Proximal PL FL (SO) - Pseudocodes 275
B.2.2.2 Proof of Theorem 40: Heterogeneous FL (SO) . . . . . . . . . . . . . . . . 276
B.2.2.3 SDP Noisy Distributed Prox-PL SVRG Pseudocode . . . . . . . . . . . . . 281
B.2.2.4 Proof of Theorem 42: Federated ERM . . . . . . . . . . . . . . . . . . . . 282
B.2.3 Supplemental Material for Section 3.3.2: Non-Convex/Non-Smooth Losses . . . . . 293
B.2.3.1 ISRL-DP Non-Convex Lower Bound . . . . . . . . . . . . . . . . . . . . . 301
B.2.3.2 Upper and Lower Bounds for Cross-Device FL Without a Trusted Server 303
B.2.4 Experimental Details and Additional Results . . . . . . . . . . . . . . . . . . . . . . 307
B.2.4.1 ISRL-DP Fed-SPIDER: Alternate implementation of ISRL-DP FedProx-
SPIDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
B.2.4.2 MNIST experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
B.2.4.3 CIFAR-10 experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
B.2.4.4 Breast cancer experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
Appendix to Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
C.1 Additional Discussion of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
C.2 Other Bounded Moment Conditions Besides Assumption 5 . . . . . . . . . . . . . . . . . . 316
C.3 Correcting the Errors in the Strongly Convex Upper Bounds Claimed in [137, 253] . . . . . 318
C.4 Details and Proofs for Section 4.3: Optimal Rates for (Strongly) Convex Losses . . . . . . . 320
C.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
C.4.2 Localized Noisy Clipped Subgradient Method (Section 4.3.1) . . . . . . . . . . . . . 322
C.4.3 The Strongly Convex Case (Section 4.3.2) . . . . . . . . . . . . . . . . . . . . . . . 333
C.4.4 Asymptotic Upper Bounds Under Assumptions 9 and 10 . . . . . . . . . . . . . . . 335
C.4.5 Lower Bounds (Section 4.3.3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
C.5 Details and Proofs for Section 4.4: Linear Time Algorithms . . . . . . . . . . . . . . . . . . 350
C.5.1 Noisy Clipped Accelerated SGD for Smooth Convex Losses (Section 4.4.1) . . . . . 350
C.5.2 Noisy Clipped SGD for Strongly Convex Losses (Section 4.4.2) . . . . . . . . . . . . 356
C.6 Details and Proofs for Section 4.5: Non-Convex Proximal PL Losses . . . . . . . . . . . . . 362
C.7 Shuffle Differentially Private Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
C.7.1 ℓ
2
Clip Shuffle Private Mean Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 367
C.7.2 Coordinate-wise Shuffle Private Mean Estimation Oracle . . . . . . . . . . . . . . . 369
Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Appendix to Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
D.1 Notions of Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
D.2 ERMI: General Definition, Properties, and Special Cases Unraveled . . . . . . . . . . . . . 381
D.3 Relations Between ERMI and Other Fairness Violation Notions . . . . . . . . . . . . . . . . 385
D.4 Precise Statement and Proofs of Theorem 67 and Theorem 69 . . . . . . . . . . . . . . . . 395
D.4.1 Proof of Theorem 69 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
D.5 Experiment Details and Additional Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
D.5.1 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
D.5.2 More comparison to Mary et al., 2019 [186] . . . . . . . . . . . . . . . . . . . . . . 410
D.5.3 Performance in the presence of outliers & class-imbalance . . . . . . . . . . . . . . 411
D.5.4 Effect of hyperparameter λ on the accuracy-fairness tradeoffs . . . . . . . . . . . . 412
D.5.5 Complete version of Figure 1 (with pre-processing and post-processing baselines) . 412
D.5.6 Description of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
ix
Appendix E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Appendix to Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
E.1 Additional Discussion of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
E.2 Equalized Odds Version of ERMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
E.3 Complete Version of Theorem 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
E.4 DP-FERMI Algorithm: Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
E.5 DP-FERMI Algorithm: Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
E.5.1 Noisy DP-SGDA for Nonconvex-Strongly Concave Min-Max Problems . . . . . . . 424
E.5.2 Proof of Theorem 76 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
E.6 Numerical Experiments: Additional Details and Results . . . . . . . . . . . . . . . . . . . . 436
E.6.1 Measuring Demographic Parity and Equalized Odds Violation . . . . . . . . . . . . 436
E.6.2 Tabular Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
E.6.2.1 Model Description and Experimental Details . . . . . . . . . . . . . . . . 436
E.6.2.2 Description of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
E.6.2.3 Demographic Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
E.6.2.4 Equalized Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
E.6.3 Image Dataset (UTK-Face) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
E.7 Societal Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
x
ListofTables
5.1 Comparison of in-processing methods for fair learning . . . . . . . . . . . . . . . . . . . . 89
A.1 Semi-DP Hyperparameter Search Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.2 Semi-DP Hyperparameter Search Grid (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.3 Semi-DP Hyperparameter Search Grid (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
A.4 Semi-LDP Hyperparameter Search Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
xi
ListofFigures
2.1 Minimax optimal error rates forε-semi-DP andε-semi-LDP. . . . . . . . . . . . . . . . . . 17
2.2 Semi-DP Loss vs. vs.
n
pub
n
. ε“1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Semi-DP Loss vs.
n
pub
n
. ε“2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Semi-DP Loss vs. ε.
n
pub
n
“0.04. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Semi-DP Loss vs. ε.
n
pub
n
“0.03. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Inter-silo record-level DP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Comparison of DP FL notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Summary of convex DP FL results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Summary of non-convex DP FL results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 ISRL-DP FL: Logistic regression on MNIST data . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 ISRL-DP FL: Linear regression on health data . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.7 ISRL-DP FL: Logistic regression on obesity data . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 ISRL-DP FL: Neural net classification on MNIST data . . . . . . . . . . . . . . . . . . . . . 62
3.9 ISRL-DP FL: Neural net classification on MNIST data (2) . . . . . . . . . . . . . . . . . . . . 63
3.10 ISRL-DP FL: CNN classification on CIFAR-10 data . . . . . . . . . . . . . . . . . . . . . . . 63
3.11 ISRL-DP FL: CNN classification on CIFAR-10 data (2) . . . . . . . . . . . . . . . . . . . . . 64
3.12 ISRL-DP FL: Neural Net classification on WBCD data . . . . . . . . . . . . . . . . . . . . . 64
4.1 DP Stochastic Optimization with Large Worst-Case Lipschitz Parameter: Summary of
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xii
5.1 FERMI: Adult experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 FERMI: Effect of batch size experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 FERMI: Missing sensitive attributes experiment . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4 FERMI: Toxic comment detection experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 FERMI: Communities experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 FERMI: Domain generalization experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.1 Comparison of DP fair learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.2 DP FERMI: Adult demographic parity experiment . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 DP FERMI: Parkinsons demographic parity experiment . . . . . . . . . . . . . . . . . . . . 126
6.4 DP FERMI: Credit Card equalized odds experiment . . . . . . . . . . . . . . . . . . . . . . 126
6.5 DP FERMI: UTK-Face experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.1 Semi-DP Loss vs. PDA-MD Regularization Parameter . . . . . . . . . . . . . . . . . . . . . 203
A.2 Semi-DP Error vs.
n
pub
n
on CIFAR-10. εPt.1,.5,1u . . . . . . . . . . . . . . . . . . . . . . . 204
A.3 Semi-DP Error vs.
n
pub
n
on CIFAR-10. εPt2,4,8u . . . . . . . . . . . . . . . . . . . . . . . 204
A.4 Semi-DP Error vs. ε on CIFAR-10. n
pub
{nPt.01,.02,.04u . . . . . . . . . . . . . . . . . . 204
A.5 Semi-DP Error vs. ε on CIFAR-10. n
pub
{nPt0.1,0.2,0.3u . . . . . . . . . . . . . . . . . . 205
A.6 Semi-LDP Linear Regression: Loss vs. n
pub
{n . . . . . . . . . . . . . . . . . . . . . . . . . . 206
A.7 Semi-LDP Linear Regression: Loss vs. n
pub
{n (2) . . . . . . . . . . . . . . . . . . . . . . . . 206
A.8 Semi-LDP Linear Regression: Loss vs. ε . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
A.9 Semi-LDP Linear Regression: Loss vs. ε (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
B.1 ISRL-DP FL: Error vs. ε for logistic regression on MNIST . . . . . . . . . . . . . . . . . . . 269
B.2 ISRL-DP FL: Error vs. ε for linear regression on health data . . . . . . . . . . . . . . . . . . 305
B.3 ISRL-DP FL: Error vs. ε for linear regression on health data (2) . . . . . . . . . . . . . . . . 306
B.4 ISRL-DP non-convex FL: Error vs. ε on MNIST . . . . . . . . . . . . . . . . . . . . . . . . . 310
xiii
B.5 ISRL-DP non-convex FL: Error vs. ε on MNIST (2) . . . . . . . . . . . . . . . . . . . . . . . 310
D.1 Experiment comparing FERMI to Mary et al., 2019 . . . . . . . . . . . . . . . . . . . . . . . 411
D.2 FERMI: Effect of outliers experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
D.3 FERMI: Effect of λ experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
D.4 FERMI: COMPAS, German, and Adult demographic parity experiments . . . . . . . . . . . 413
D.5 FERMI: COMPAS, German, and Adult equalized odds experiments . . . . . . . . . . . . . . 414
D.6 FERMI: COMPAS, German, and Adult equality of opportunity experiments . . . . . . . . . 415
E.1 DP FERMI: Retired Adult experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
E.2 DP FERMI: Credit Card demographic parity experiment . . . . . . . . . . . . . . . . . . . . 440
E.3 DP FERMI: Parkinsons demographic parity experiment . . . . . . . . . . . . . . . . . . . . 441
E.4 DP FERMI: Adult equalized odds experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 441
E.5 DP FERMI: Retired Adult equalized odds experiment . . . . . . . . . . . . . . . . . . . . . 442
xiv
Abstract
In recent years, machine learning (ML) systems have increasingly been deployed in industry, government,
and society. Although ML models can be extremely useful and capable of making very accurate predic-
tions, they can also have dangerous social, ethical, and legal consequences. For example, the output of
a ML algorithm can leak confidential information about individuals whose data was used in training the
model, violating their privacy. Another major concern is that ML models maydiscriminate against certain
demographic groups, such as individuals with certain race, gender, or age. In this thesis, we study what
performanceispossibleinMLunderprivacyand/orfairnessconstraints. Additionally, we developefficiental-
gorithms for training private and/or fair ML models; in most instances, our algorithms achieve the optimal
performance.
First, we focus on private ML. Particularly, we study differentially private algorithms. Differential
privacy (DP) offers a strong guarantee that no adversary can learn too much about any person’s private
data. We study three problems in DP ML:
• DP learning with side access to public data: Consider learning a DP model, given a certain amount
of auxiliary “public data” that is free of privacy concerns. We characterize the minimax optimal
error (up to constant factors) of DP learning with side access to public data for three fundamental
problems: mean estimation, empirical risk minimization, and stochastic convex optimization.
• DP federated learning without a trusted server: Federated learning is a distributed ML paradigm in
which “silos” (e.g. hospitals, banks, cell-phones) store their data locally and collaborate to achieve a
xv
global learning objective. We study a notion of DP—called inter-silo record-level DP—that protects
the privacy of each silo’s data, even if an attacker has access to the central server. We provide tight
error bounds (up to logarithms) for optimal inter-silo record-level DP FL algorithms.
• DP stochastic optimization with large worst-case Lipschitz parameter: In DP stochastic optimization
(SO), we are given a loss function and an i.i.d. training set drawn from an unknown distribution,
and aim to privately minimize the expected test loss. Prior works have considered DP SO with loss
functions that are uniformly Lipschitz continuous over all data points, and gave optimal error bounds
that scale with the worst-case (over all data points) Lipschitz parameter. In practice, this worst-case
Lipschitz parameter can be huge (e.g. due to outliers or heavy-tailed data), resulting in pessimistic
error bounds. We prove optimal error bounds (up to logarithms) for DP stochastic optimization that
do not depend on the worst-case Lipschitz parameter of the loss.
Second, we turn to fair ML:
• A stochastic optimization framework for fair learning: We consider the problem of fair classification
with discrete sensitive attributes and potentially large models and data sets, requiring stochastic
solvers. Here, the term “stochastic” refers to algorithms that only require small batches of data
in each iteration of training. Previous fair learning methods were impractical in the large-scale
setting because they either required computations with large batches of data at each iteration or
were not guaranteed to converge. We develop the first stochastic fairness algorithm with guaranteed
convergence for any batch size.
• Stochastic differentially private and fair learning : We devise the first stochastic differentially private
algorithm for fair learning that is guaranteed to converge. Our DP algorithm is built directly on the
stochastic fairness algorithm described in the above bullet.
xvi
Beyond our theoretical results, we also provide empirical evidence that our algorithms perform well
in practice. For many of the above problems, our algorithms outperform the state-of-the-art methods in a
range of numerical experiments.
xvii
Chapter1
Introduction
With the increasing adoption of machine learning (ML) systems in industry, government, and society,
apprehensions about the potential risks associated with these systems have surged. One major risk is that
ML models may leak individuals’ data, violating their privacy. In certain applications, such as healthcare
and finance, the data that is used to train ML models may contain sensitive information, such as disease
history and bank account details. Thus, privacy violations can have severe adverse ramifications. A second
danger associated with the deployment of ML systems is that these systems may discriminate against
certain demographic groups, such as individuals with certain race, gender, or age. When ML systems are
used to help make high-stakes decisions, e.g. whether to release someone from prison or grant them a
loan, such discrimination can be extremely detrimental.
This thesis studies the problem of training ML models that preserve the privacy of individuals and/or
avoid making discriminatory predictions. Training a ML model amounts to finding a parameter w
˚
PW
that achieves small prediction error. Here,WĂR
d
is a constraint set. The prediction error of a parameter
w is measured with respect to a loss function fpw,xq (e.g. cross-entropy loss for logistic regression, or
squared loss for linear regression) and a data set or distribution. (Here, x is a data point.) Thus, model
1
training amounts to solving an optimization problem. If a learner aims to minimize the error on the training
dataX“px
1
,...,x
n
q, then the optimization problem is called empirical risk minimization (ERM):
min
wPW
#
1
n
n
ÿ
i“1
fpw,x
i
q
+
.
If the learner aims to minimize the error on unseen test data (drawn from some distributionP , which is
typically unknown to the learner), then the optimization problem is called stochastic optimization:
min
wPW
E
x„P
rfpw,xqs.
Most of this thesis is devoted to understanding the above two optimization problems—and the algorithms
used to solve them—in different contexts, when privacy and/or fairness constraints are also present. In the
rest of this introductory chapter, we provide more background on data privacy and algorithmic fairness,
and describe the problems that we will study and our contributions.
1.1 DifferentialPrivacy
The practice of collecting, curating, and utilizing detailed personal data is ubiquitous in the digital era
that we live in. Attempts have been made to “anonymize” this data, by suppressing or removing “per-
sonally identifiable” portions of it. However, the richness of the data leaves individuals vulnerable to re-
identification attacks, in which non-anonymized features are used to deduce the identity of someone in the
“anonymous” data set. Moreover, even when the raw data is kept private, the output of machine learning
and optimization algorithms may leak personal information, e.g. via membership inference attacks [228],
model inversion attacks [103], or extraction attacks [50].
These dangers prompted the development and deployment of differential privacy [87]. Differential
privacy is a rigorous and quantitative notion of data privacy that thwarts privacy attacks.
2
1.1.1 Background
In this subsection, we define differential privacy (DP) and state some facts about DP that will be used
throughout this thesis. LetX denote a data universe, which may be any set. Two data sets X and X
1
are adjacent if they differ in a single entry (i.e. |X∆ X
1
|“|X
Ť
X
1
´X
Ş
X
1
|“ 2) ). IfX andX
1
are
adjacent, we writeX„X
1
.
Definition 1 (Differential Privacy [87]) . Letεě 0, δ Pr0,1q. A randomized algorithmA : X
n
Ñ Z is
pε,δ q-differentially private (DP) if for all pairs of adjacent data sets X,X
1
PX
n
and all measurable subsets
SĎZ, we have
PpApXqPSqďe
ε
PpApX
1
qPSq`δ.
In the above display, the probabilityP is solely over the randomness ofA. Ifδ “ 0, we say thatA satisfies
“pure DP” and writeε-DP. Ifδ ą0, we say “approximate DP” and writepε,δ q-DP.
Definition 1 requires that ApXq does not reveal too much information about any individual data point.
The amount of information thatA may leak is quantified by the privacy parameters ε andδ , with smaller
ε,δ providing more privacy. With an appropriate choice ofpε,δ q, DP ensures that no attacker can infer
much more about any person’s data than they could have learned had that person’s data never been used
for training. One drawback of Definition 1 is that the practical privacy implications of a specific choice
of parameterspε,δ q are not very apparent. However, empirical evidence suggests that choosingε to be a
small constant (e.g. ε“1) andδ !1{n effectively prevents practical privacy attacks.
Definition 1 is sometimes referred to as central differential privacy (CDP). An alternative, more strin-
gent, notion of differential privacy is local differential privacy (LDP):
3
Definition2 (Local Differential Privacy [144, 85]) . Letεě0, δ Pr0,1q.ArandomizedalgorithmA:X Ñ
Z ispε,δ q-locallydifferentiallyprivate(LDP)ifforallpairsofdatapoints x,x
1
PX andallmeasurablesubsets
SĎZ, we have
PpApxqPSqďe
ε
PpApx
1
qPSq`δ.
In the above display, the probabilityP is solely over the randomness ofA.
Local DP is useful in settings where individuals do not trust others (e.g. the data scientist/curator
or server) with their sensitive data: instead of contributing their raw data x, an individual can provide
the data scientist with the privatized version of their data,Apxq. Local DP is stronger than central DP
in the following sense: ifA ispε,δ q-LDP and we runA independently on x
1
,...,x
n
, thenA
1
pXq :“
pApx
1
q,...,Apx
n
qq ispε,δ q-DP.
We will also define another notion of DP that is sometimes more convenient to work with. Notice
that Definition 1 is equivalent to requiring that a certain divergence—the so-called δ -Approximate Max
Divergence—betweenApXq andApX
1
q be bounded byε [88]. We can instead use theα -Rényi divergence:
Definition3 ([221]). FordistributionsP andQwithprobabilitydensity/massfunctionspandq,theα -Rényi
divergence betweenP andQ is defined as
D
α pP||Qq“
1
α ´1
ln
ˆż
ppxq
α qpxq
1´α dx
˙
.
If we replaceδ -Approximate Max Divergence byα -Rényi divergence in the definition of DP, then we
arrive at the following:
Definition4 ([47]). ArandomizedalgorithmA:X
n
ÑZ satisfies ρ -zero-concentrateddifferentialprivacy
(ρ -zCDP) if for all adjacent data setsX,X
1
PX
n
and allα Pp1,8q, we have
D
α pApXq||ApX
1
qqďρα,
4
whereD
α pApXq||ApX
1
qq is theα -Rényi divergence between the distributions ofApXq andApX
1
q.
zCDP is weaker than pure DP, but stronger than approximate DP in the following sense:
Proposition5 ([47]). IfAisε-DP,thenAis
ε
2
2
-zCDP.IfAisρ -zCDP,thenAispρ `2
a
ρ logp1{δ q,δ q-DP
for anyδ ą0. In particular, ifϵ ď2lnp1{δ q, then any
ϵ 2
8lnp1{δ q
-zCDP algorithm ispϵ,δ q-DP.
Given some (potentially non-private) functionq, a simple way to achieve differential privacy is to add
random noise to qpXq. The level of privacy depends on the distribution of the additive noise, as well as
the sensitivity of the functionq:
Definition 6. Letq : X
n
Ñ R
d
be a function. Define the ℓ
2
-sensitivity ofq by ∆ 2
“ sup
X„X
1}qpXq´
qpX
1
q}
2
. Define the ℓ
1
-sensitivity of q by ∆ 1
“ sup
X„X
1}qpXq´qpX
1
q}
1
. The supremums are over all
pairs of adjacent data sets inX
n
.
Ifq has boundedℓ
1
-sensitivity, then adding Laplace noise toqpXq gives rise to a pure DP algorithm
called the Laplace Mechanism:
Proposition 7 ([88]). Assume q : X
n
Ñ R
d
has ℓ
1
-sensitivity that is upper bounded by ∆ 1
. Let y “
py
1
,...,y
d
q be a vector of i.i.d. Laplace random variables with variance
2∆
2
1
ε
2
: i.e. y
i
„ Lap
`
∆ 1
ε
˘
has proba-
bilitydensityfunctionf
y
i
ptq“
ε
2∆
1
exp
´
´
ε|t|
∆ 1
¯
fortPR. Then,theLaplaceMechanismApXq:“qpXq`y
isε-DP.
Ifq has boundedℓ
2
-sensitivity, then adding Gaussian noise toqpXq gives rise to a zCDP (and approx-
imate DP) algorithm called the Gaussian Mechanism:
Proposition 8 ([47]). Assume q : X
n
Ñ R
d
has ℓ
2
-sensitivity that is upper bounded by ∆ 2
. Define the
GaussianMechanismbyApXq:“qpXq`uforu„N
d
p0,σ 2
I
d
q. Then,Aisρ -zCDPifσ 2
ě
∆ 2
2ρ . Moreover,
ifϵ ď2lnp1{δ q andσ 2
ě
4∆
2
lnp1{δ q
ϵ 2
, thenA ispϵ,δ q-DP.
5
Three fundamental properties of differentially private mechanisms are post-processing, composition,
andgroupprivacy. Post-processing guarantees that if we are given a DP algorithmA and we useApXq to
do some further computations, then these computations are still DP. More precisely:
Proposition 9 (Post-processing [88, 47]). LetA : X
n
Ñ Z bepε,δ q-DP (respectively, ρ -zCDP) and let
g : Z Ñ Z
1
be an arbitrary (potentially randomized) function. Then, g˝A : X
n
Ñ Z
1
ispε,δ q-DP
(respectively,ρ -zCDP).
Combining multiple DP algorithms results in a DP algorithm, but the privacy level of the combined
algorithm degrades. For example, ifA
1
,...,A
R
areε-DP and we run each of these algorithms indepen-
dently on the same data set (or overlapping data sets) to obtainApXq“pA
1
pXq,...,A
R
pXqq, thenA is
Rε-DP. For zCDP, we have the following composition property:
Lemma 10 (zCDP Composition [47]). SupposeA : X
n
Ñ Z satisfies ρ -zCDP andA
1
: X
n
ˆZ Ñ Z
1
satisfies ρ 1
-zCDP (as a function of its first argument). Define the adaptive composition of A andA
1
,A
2
:
X
n
ÑZ
1
byA
2
pXq“A
1
pX,ApXqq. ThenA
2
satisfies pρ `ρ 1
q-zCDP.Moreover,theadaptivecomposition
ofRρ -zCDP mechanisms is aRρ -zCDP mechanism.
The analogue of Lemma 10 for approximate DP is the so-called “Advanced Composition Theorem”:
Theorem 11 (Advanced Composition [88]). Let ϵ ě 0,δ,δ 1
P r0,1q. The adaptive composition of R
pε,δ q-DP mechanisms ispϵ 1
,Rδ `δ 1
q-DP forϵ 1
“
a
2Rlnp1{δ 1
qϵ `Rϵ pe
ϵ ´1q.
The moments accountant [1, Theorem 1] provides a slightly tighter composition bound and is sometimes
used instead of Theorem 11.
Lastly, differential privacy facilitates the accounting of privacy risks experienced by groups of people,
such as households:
6
Proposition12 (Group Privacy [88, 47]). AssumeA:X
n
ÑZ ispε,δ q-DP.LetX andX
1
beanydatasets
that differ in at most k rows (i.e.|X∆ X
1
|ď2k ). Then,
PpApXqPSqďe
kε
PpApX
1
qPSq`ke
pk´1qε
δ,
for all measurable subsetsSĎZ. IfA isρ -zCDP, then
D
α pApXq||ApX
1
qqďpk
2
ρ qα for allα Pp1,8q.
In the next subsection, we give high-level descriptions of our contributions in areas related to differ-
ential privacy.
1.1.2 Contributions
1.1.2.1 OptimalDifferentiallyPrivateLearningwithPublicData(Chapter2)
Differential privacy usually comes at the expense of reduced model accuracy or increased sample com-
plexity. In practice, we often have access to publicly available data that is devoid of privacy concerns. A
recent line of work has studied whether and how this public data might allow for training more accurate
DP models. In Chapter 2, we assume access to a given amount of public data and resolve the following
fundamental open questions:
1. Whatistheoptimal(worst-case)errorofaDPmodeltrainedoveraprivatedatasetwhilehavingaccess
to side public data? What algorithms are theoretically optimal?
2. How can we harness public data to improve DP model training in practice?
7
We investigate these questions in both the local and central models of DP. To answer the first question,
we present tight lower and upper bounds that characterize the optimal error rates (up to constant factors)
for three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex
optimization. Our analysis reveals that incorporating public data reduces the sample complexity of training
DP models. Intriguingly, we demonstrate that the optimal error rates can be attained (up to constants) by
either discarding private data and training a public model or treating public data as if it were private and
employing an optimal DP algorithm.
To address the second question, we develop novel algorithms that are “even more optimal” than the
asymptotically optimal approaches described above (i.e. our algorithms have better constants). For the case
of local DP mean estimation with public data, our algorithm is optimal including constants. Empirically,
our algorithms show advantages over state-of-the-art techniques for DP model training with side access
to public data.
1.1.2.2 DifferentiallyPrivateFederatedLearningWithoutaTrustedServer(Chapter3)
Machine learning tasks often involve data from different “silos” (e.g. cell-phone users or organizations such
as hospitals) containing sensitive information (e.g. location or health records). In federated learning (FL),
each silo stores its data locally and a central server coordinates updates among different silos to achieve a
global learning objective. Individuals (e.g. cell-phone users or patients) participating in FL might not trust
the central server or other silos with their personal data. In this case, each silo (e.g. hospital) must ensure
that the data of each person (e.g. patient) in that silo is private. This requirement motivates the exploration
of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which necessitates that the communications of
siloi adhere to record/item-level differential privacy. By enforcing ISRL-DP, the data of each person within
siloi cannot be leaked—even if an attacker has access to the central server or the messages that are sent
between silos during the FL process.
8
ISRL-DP differs from well-studied privacy notions. Central and user-level DP assume that individuals
trust the server and other silos, and fail to protect against an attacker with server access. On the other
hand, local DP assumes that individuals have no trust in anyone, including their own silo. Positioned
between central and local DP, ISRL-DP makes the realistic assumption (in the context of cross-silo FL) that
individuals trust their own silo but not the server or other silos.
In Chapter 3, we provide upper and lower bounds on the error of ISRL-DP FL algorithms. Our bounds
are tight up to logarithmic factors for ISRL-DP FL with convex/strongly convex loss functions and homo-
geneous (i.i.d.) silo data. Remarkably, we demonstrate that similar (nearly optimal) bounds can be achieved
for smooth losses with heterogeneous (non-i.i.d.) silo data distributions by utilizing a novel accelerated
ISRL-DP algorithm. Additionally, we establish tight upper and lower bounds for ISRL-DP federated em-
pirical risk minimization, and leverage acceleration to attain the optimal bounds in fewer communica-
tion rounds compared to the state-of-the-art approaches. Further, by incorporating a secure “shuffler” to
anonymize silo messages (but without a trusted server), our algorithm achieves the optimal central DP
rates under more practical trust assumptions. Moreover, we extend our analysis to non-convex loss func-
tions. We give upper and lower bounds for ISRL-DP FL with several classes of non-convex loss functions.
Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithms in classification and
regression tasks with both convex and non-convex loss functions.
1.1.2.3 DifferentiallyPrivateStochasticOptimizationwithLargeWorst-CaseLipschitzParameter
(Chapter4)
In Chapter 4, we consider DP stochastic optimization (SO) with loss functions whose worst-case Lipschitz
parameter across all data points can be extremely large. Most research on DP SO has assumed that the loss
functions exhibit uniform Lipschitz continuity over the data, which guarantees that stochastic gradients
9
remain uniformly bounded for all data points. While this assumption is convenient, it often leads to pes-
simistic error bounds. In practical learning scenarios, the worst-case (uniform) Lipschitz parameter of the
loss over all data points may be excessively large (e.g. due to outliers). In such cases, the error bounds for
DP SO, which scale with the worst-case Lipschitz parameter of the loss, become vacuous.
To address these limitations, we provide near-optimal error bounds that are independent of the worst-
case Lipschitz parameter of the loss. Rather than assuming a bound on the worst-case Lipschitz parameter,
we adopt a milder assumption that stochastic gradients possess boundedk-thordermoments, wherekě2.
In contrast to works on uniformly Lipschitz DP SO, our error bounds scale with thek-th moment bound
instead of the uniform Lipschitz parameter of the loss. Thus, the error guarantee of our algorithm is signif-
icantly better than the uniformly Lipschitz DP SO guarantees when dealing with outliers or heavy-tailed
data. Prior works [253, 137] gave suboptimal error bounds for DP SO with smooth convex loss functions
possessing stochastic gradients with bounded k-th moment. By comparison, we obtain the first error
bounds that are optimal up to a logarithmic factor. Moreover, our bounds do not require the loss function
to be differentiable or smooth. Finally, we initiate the study of non-convex non-uniformly Lipschitz loss
functions.
1.2 Fair(andPrivate)Learning
Machine learning algorithms are increasingly used to inform high-stakes decisions—such as whether to
release someone from prison or let them borrow money. These algorithms can introduce or exacerbate
unfairness in decision-making processes, even in cases where the training data itself does not exhibit
discriminatory patterns. The algorithmic fairness literature aims to remedy such discrimination.
10
1.2.1 Background
We briefly discuss some algorithmic fairness preliminaries. In the context of fairness, we will use Z “
X ˆY to denote the data universe, to distinguish between feature vectorsxP X and labelsy P Y. We
use the terminology sensitive attribute to refer to a feature—such as race, gender, or age—that we want
to protect from discrimination. Z P Z
n
denotes a data set that is randomly drawn from some (joint)
distribution.
Definition13 (Fairness Notions). LetA:ZÑY be a classifier.
• Asatisfiesdemographicparity[86]ifthepredictions ApZqarestatisticallyindependentofthesensitive
attributes.
• A satisfies equalized odds [115] if the predictions ApZq are conditionally independent of the sensitive
attributes givenY “y for ally.
• Let y
˚
denote an advantaged outcome (e.g. not defaulting a loan). A satisfies equality of opportu-
nity [115] with respect to y
˚
if the predictions ApZq are conditionally independent of the sensitive
attributes givenY “y
˚
.
In practice, it is generally impossible to achieve or verify exact fairness—i.e. exact (conditional) in-
dependence ofApZq and the sensitive attributes. Instead, one uses certain fairness violation metrics that
quantify the degree of (conditional) independence between random variables; e.g.,L
8
distance [86, 115],
mutual information [140, 222, 233, 275, 58, 224], or Pearson correlation [273, 38].
11
1.2.2 Contributions
1.2.2.1 AStochasticOptimizationFrameworkforFairRiskMinimization(Chapter5)
While large-scale empirical risk minimization (ERM) has proven successful in achieving high accuracy
across diverse machine learning tasks, large-scale fair ERM poses a significant challenge due to the in-
compatibility between fairness these constraints and stochastic optimization methods. Chapter 5 studies
the problem of fair classification with large models and data sets, requiring stochastic solvers. Here, the
term “stochastic” refers to training algorithms that use minibatches of data in each iteration (i.e. stochastic
optimization). Previous fair learning algorithms were impractical in the large setting because they either
required computations on large batches of data in each iteration or were not guaranteed to converge. We
devise the first stochastic in-processing fairness algorithm—which we call FERMI—with guaranteed con-
vergence. We prove that FERMI converges to an approximate solution of both the fair ERM objective
function and the fair population loss function. Empirically, we show that FERMI is amenable to stochas-
tic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even
with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable
tradeoffs between fairness violation and test accuracy across compared with state-of-the-art baselines.
These benefits are especially significant with small batch sizes and for non-binary classification with large
number of sensitive attributes, making FERMI a practical, scalable fairness algorithm.
1.2.2.2 StochasticDifferentiallyPrivateandFairLearning(Chapter6)
Despite the promise of fair learning algorithms in mitigating discrimination in ML systems, these algo-
rithms can still leak sensitive information, such as individuals’ health or financial records. Previously
proposed algorithms for DP fair learning are either not guaranteed to converge or require full batch of
data in each iteration of the algorithm to converge. In Chapter 6, we provide the first stochastic differen-
tially private algorithm—DP FERMI—for fair learning that is guaranteed to converge. A byproduct of our
12
analysis is the first convergence guarantee for a DP algorithm for solving nonconvex-strongly concave
min-max optimization problems. Numerical experiments demonstrate that DP FERMI consistently out-
performs state-of-the-art baselines by a significant margin, and is applicable to larger scale ML problems.
1.3 Notation
In this section, we collect some notation and definitions that will be used throughout the thesis.
Let}¨} be theℓ
2
norm.W denotes a convex subset ofR
d
, which we will usually assume to be compact
withℓ
2
diameterD. Π W
pzq :“ argmin
wPW
}w´z}
2
denotes the Euclidean projection operator. Define
B“txPR
d
|}x}ď1u, the unitℓ
2
-ball inR
d
.
Functiong :WÑR isµ -stronglyconvex ifgpαw `p1´α qw
1
qďαg pwq`p1´α qgpw
1
q´
α p1´α qµ 2
}w´
w
1
}
2
for all α P r0,1s and all w,w
1
P W. If µ “ 0, we say g is convex. For convex fp¨,xq, denote any
subgradient offpw,xq w.r.t. w by∇fpw,xqPB
w
fpw,xq: i.e. fpw
1
,xqě fpw,xq`x∇fpw,xq,w
1
´wy
for allw
1
PW. Iffp¨,xq is differentiable, then ∇fpw,xq is the gradient offpw,xq with respect tow.
Function h : W Ñ R
m
is L-Lipschitz if}hpwq´hpw
1
q} ď L}w´w
1
},@w,w
1
P W. For a convex
or differentiable function h, we can equivalently define h to be L-Lipschitz if sup
wPW
}∇hpwq} ď L
for all subgradients∇hpwq P Bhpwq. Function f : WˆX Ñ R is L-uniformly-Lipschitz (w.r.t. w) if
sup
wPW,xPX
}∇fpw,xq}ďL. A differentiable function hp¨q isβ -smooth if its derivative∇h isβ -Lipschitz.
Ifh isβ -smooth andµ -strongly convex, denote its condition number byκ :“β {µ .
For functionsa“apθ q andb“bpϕ q of input parameter vectorsθ andϕ , we writeaÀb ora“Opbq
if there is an absolute constantCą0 such thataďCb for all values of input parameter vectorsθ andϕ .
Equivalently, we may write this condition asbÁa orb“Ω paq. Writea“
r
Opbq ifaÀlog
2
pψ qb for some
parametersψ . Similarly, writeb“
r
Ω paq ifblog
2
pψ qÁa.
13
Chapter2
OptimalDifferentiallyPrivateLearningWithPublicData
This chapter is based on [179].
Summary
Differential Privacy (DP) ensures that training a machine learning model does not leak private data. How-
ever, the cost of DP is lower model accuracy or higher sample complexity. In practice, we may have access
to auxiliary public data that is free of privacy concerns. This has motivated the recent study of what role
public data might play in improving the accuracy of DP models. In this work, we assume access to a given
amount of public data and settle the following fundamental open questions:
1. Whatistheoptimal(worst-case)errorofaDPmodeltrainedoveraprivatedatasetwhilehavingaccess
to side public data? What algorithms are optimal?
2. How can we harness public data to improve DP model training in practice?
We consider these questions in both the local and central models of DP. To answer the first question, we
prove tight (up to constant factors) lower and upper bounds that characterize the optimal error rates of
three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex opti-
mization. We prove that public data reduces the sample complexity of DP model training. Perhaps surpris-
ingly, we show that the optimal error rates can be attained (up to constants) by either discarding private
14
data and training a public model, or treating public data like it’s private data and using an optimal DP
algorithm. To address the second question, we develop novel algorithms which are “even more optimal”
(i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean
estimation, our algorithm is optimal including constants. Empirically, our algorithms show benefits over
the state-of-the-art for DP model training with side access to public data.
2.1 Introduction
Training machine learning (ML) models on people’s data can leak sensitive information, violating their
privacy [103, 228, 50]. DifferentialPrivacy(DP) prevents such leaks by providing a rigorous guarantee that
no attacker can learn too much about any individual’s data [87]. DP has been successfully deployed by
various companies [13, 238, 91, 74], and by government agencies such as the U.S. Census Bureau [247].
However, a major hindrance to more widespread adoption of DP in training ML models is that DP models
are less accurate than their non-private counterparts.
Leveraging public data—that is free of privacy concerns—is a promising and practically important av-
enue for closing the accuracy gap between DP and non-private models [207, 21, 97, 9]. Public data may be
provided by people who volunteer their data (e.g. product developers or early testers) [60, 97] or by people
who receive incentives in exchange for their data. Data that is generated synthetically [239, 252, 42, 119]
or released through a legal process [152] may serve as additional sources of public data.
A recent line of work has sought to understand the role of public data in DP learning [33, 97, 134,
256, 285, 168, 173, 207, 208, 271, 8, 40, 99, 9]. Despite progress on both the theoretical and empirical
fronts, important fundamental questions about DP learning with public data remain open. In this work,
we consider the following fundamental questions:
15
Question1. What is the optimal (minimax) error of DP learning with public data?
Question 2. How can we effectively harness public data in practice? Can we provide
improved performance (e.g. smaller error including constants) over the asymptotically
optimal algorithms?
Contributions: To addressQuestion1, we characterize the optimal minimax error (up to constants) of
semi-DP [36, 8] learning algorithms—algorithms that are DP with respect to private data, but not neces-
sarily DP w.r.t. public data (Definition 14). We provide tight (up to constants) lower and upper bounds for
three fundamental problems: mean estimation, empirical risk minimization (ERM), and stochastic convex
optimization (SCO). The ERM problem corresponds to minimizing the training loss of a ML model, while
SCO corresponds to minimizing test loss. We consider both the local [144, 85] and central [87] models of
(pure and approximate) DP. For population mean estimation and SCO, we assume access to in-distribution
public data. For ERM, the public data may be out-of-distribution.
Our results reveal both the limitations and power of using public data for DP model training. On the
negative side, our lower bounds show that obtaining asymptotic improvements beyond the optimal DP
rates (without public data) is impossible if the number of public samplesn
pub
is small (e.g. n
pub
“ Op1q),
or ifd{ε is small relative to the total number of samplesn“n
pub
`n
priv
.
∗
On the positive side, we show
that with enough public samples or in high privacy regime (smallε), asymptotic improvements over the
optimal DP rates are indeed possible. For example, without public data, if d “ n and ε “ 1, then no
ε-DP algorithm can estimate the mean of a bounded random variable withℓ
2
-accuracy better thanΘ p1q.
On the other hand, we show that with n
pub
“
?
n or n
pub
“ 0.01n public samples at our disposal, the
minimax error of DP estimation improves significantly to Θ p1{
?
nq or Θ p1{nq. Note that Θ p1{nq is the
best achievable rateevenwithoutanyprivacyconstraints. More generally—for population mean estimation
and SCO, with either central or local DP—we get “privacy for free” (up to constants) whenevern
pub
ě cn
∗
Hered denotes the number of model parameters andε is the DP parameter.
16
for some absolute constantc. See Figure 2.1 for some of the minimax optimal error rates for semi-DP and
semi-LDP learning in this paper.
Learning
problem
Semi-DP error
When is semi-DP
error less than DP?
error?
Mean
Estimation
(Pop. MSE)
(Theorem 15*)
or
ERM
(Excess
emp. Risk)
(Theorem 3*)
(Theorem 21)
SCO
(Excess
pop. risk)
(Theorem 23*)
or
Learning
problem
Semi-LDP error
When is semi-LDP
error less than LDP?
error?
Mean
Estimation
(Pop. MSE)
(Theorem 25)
or
SCO
(Excess
pop. risk)
(Theorem 29)
or
Figure 2.1: Minimax optimal error rates for ε-semi-DP and ε-semi-LDP. *Semi-DP mean estimation and
SCO lower bounds are only tight if n
pub
“ Opnε{dq or d “ Op1q. Dependence on range and Lipschitz
parameters, constraint set diameter omitted. See main text/Appendix forδ ą 0 and strongly convex SCO
results.
In all of the cases that we study—mean estimation, ERM, and SCO under both local and central semi-
DP—tight upper bounds on the error can be obtained by taking the minimum of two errors: 1) the error
of the optimal DP algorithm that privatizes all of the data including public data (e.g. DP-SGD for DP
SCO [29]); and 2) the error of an appropriate “throw-away” algorithm that discards the private data and
uses an optimal non-private algorithm (e.g. SGD for SCO) on the public data. Neither of these algorithms
seems to leverage the full data set in the most effective way: throw-away completely ignores valuable
private data, while standard DP algorithms (intuitively) seem to be adding more noise than necessary to
privatize public data. This observation leads us toQuestion2.
To address Question 2, we develop novel semi-DP algorithms that are (central or local) DP w.r.t.
the private samples, but not DP in the traditional sense. Our algorithms add less noise than would be
necessary to privatize the full data set (including public). By doing so, we are able to harness the public
data and achieveoptimalerrorboundswithsignificantlyimprovedconstants over the asymptotically optimal
algorithms discussed above. In the case of locally DP mean estimation, our algorithm is truly optimal
17
includingconstants. We also support this finding empirically, showing that our algorithms outperform the
asymptotically optimal algorithms discussed above, even when the optimal DP algorithm is pre-trained on
the public data.
In Appendix A.1, we give an overview of our techniques and further discussion of related work.
2.1.1 Semi-DifferentialPrivacy
We define semi-DP [36, 8], a relaxation of DP that only requiresA to preserve the privacy of a pre-specified
subset of people’s dataX
priv
ĂX :
Definition 14 (Semi-Differential Privacy [36, 8]) . For x
i
P X, say two data sets X,X
piq
P X
n
are x
i
-
adjacent if they differ only in the data point x
i
: X∆ X
piq
“ tx
i
,x
1
i
u for some x
1
i
P X. Algorithm A :
X
n
ÑW is (centrally)pε,δ q-semi-DP if for all private samplesx
i
PX
priv
, all pairs ofx
i
-adjacent data sets
X,X
piq
PX
n
, and all measurableSĂW, we have
PpApXqPSqďe
ε
PpApX
piq
qPSq`δ.
We can define ρ -semi-zCDP analogously. We define semi-LDP in Section 2.3, which is a similar relax-
ation of local DP (LDP) [144, 85]. Local DP is a stronger notion of DP that does not allow data to be leaked
even to the data scientist or central server. To distinguish Definition 14 from semi-LDP, we sometimes
refer to algorithms that satisfy Definition 14 as centrally semi-DP.
2.1.2 Roadmap
We begin by studying the central model of semi-DP in Section 2.2. We give tight error bounds (up to
constants) for mean estimation in Section 2.2.1 and provide a novel algorithm with better constants than
the asymptotically optimal algorithms. In Section 2.2.2 and Section 2.2.3, we characterize the optimal
excess risk of semi-DP ERM and SCO respectively. We give an improved semi-DP algorithm for SCO
18
in Appendix A.2.4.1. In Section 2.3, we turn to the local model of semi-DP. We characterize the optimal
error rates (up to constants) for mean estimation in Section 2.3.1 and SCO in Section 2.3.2. Along the way,
we develop novel semi-LDP algorithms with smaller error than the asymptotically optimal algorithms. We
present experimental evaluations of our algorithms in Sections 2.4 and A.4.
2.2 OptimalCentrallyPrivateLearningwithPublicData
2.2.1 OptimalSemi-DPMeanEstimation
In this section, we determine the minimax optimal semi-DP error rates for estimating the mean of an
unknown bounded distribution.
†
Proofs are deferred to Appendix A.2.1.2.
Consider the following problem: givenn
priv
private samplesX
priv
ĎB“txPR
d
|}x}ď 1u andn
pub
public samples X
pub
Ď B, drawn from an unknown distribution P onB, estimate the population mean
ApXq«E
x„P
rxs subject to the constraint thatA satisfies semi-DP. Defining n“ n
priv
`n
pub
, we will
characterize the minimax squared error of population mean estimation underpε,δ q-semi-DP:
M
pop
pε,δ,n
priv
,n,dq:“ inf
APA
ε,δ
pBq
sup
P
E
A,X„P
n
“
}ApXq´E
x„P
rxs}
2
‰
, (2.1)
whereA
ε,δ
pBq denotes the set of allpε,δ q-semi-DP estimatorsA:B
n
ÑB, and|X
priv
|“n
priv
.
Theorem15. Letεďd{8andeithern
pub
À
nε
d
ordÀ1. Then,thereexistabsoluteconstantscandC,with
0ăcďC, such that
cmin
"
1
n
pub
,
d
2
n
2
ε
2
`
1
n
*
ďM
pop
pϵ,δ “0,n
priv
,n,dqďCmin
"
1
n
pub
,
d
2
n
2
ε
2
`
1
n
*
. (2.2)
†
Our analysis also extends to unbounded distributions with boundedk-th order moment for somekě 2.
19
The upper bound in 2.2 holds for anyn
pub
,d. Further, for anyn
pub
,d and anyδ Pr0,εs, we have
cmin
"
1
n
pub
,
1
n
2
ε
2
`
1
n
*
ďM
pop
pϵ,δ,n
priv
,n,dqďCmin
"
1
n
pub
,
dlnp1{δ q
n
2
ε
2
`
1
n
*
. (2.3)
Ifn
pub
ą
n
2
ε
2
d
2
, then theε-semi-DP upper bound in 2.2 isOp1{n
pub
q, which improves over the optimal
ε-DP MSE of Θ pminp1,pd{nεq
2
q. If n
pub
“ Θ pnq (e.g. n
pub
“ .01n), then we get “privacy for free”:
the semi-DP error rates match the optimal non-private rate Θ p1{nq (up to constants). This is also the
case for semi-DP SCO in Section 2.2.3, semi-LDP mean estimation in Section 2.3.1, and semi-LDP SCO
in Section 2.3.2. We conjecture that the upper bound in 2.3 is tight for all d, but proving a tight lower
bound for theδ ą0,d"1 case is challenging.
To prove our lower bounds, we develop a new technique: Theorem 79 in Appendix A.2.1.2, which is
a semi-DP analogue of Fano’s inequality [270, 82]. For the upper bounds, the first term in each minima is
attained by the throw-away estimatorApXq“
1
n
pub
ř
xPX
pub
x. The second terms in the minima 2.2 and 2.3
are attained by the DP (hence semi-DP) Laplace and Gaussian mechanisms, respectively [88].
In the next subsection, we show that the asymptotically optimal algorithms discussed above have
suboptimal constants. In particular, we provide improved semi-DP estimators with smaller error.
2.2.1.1 An“EvenMoreOptimal”Semi-DPAlgorithmforMeanEstimation
Before presenting our improved semi-DP algorithms, we precisely describe the worst-case error of the
optimal algorithms discussed in the preceding subsection. We will considerρ -semi-zCDP, which facilitates
a sharp characterization of the privacy of the Gaussian mechanism. Note that the lower bound in 2.3 also
holds for semi-zCDP, sinceε
2
{lnp1{δ q-zCDP impliespOpεq,δ q-DP, by Proposition 5.
Definition16. LetPpB,VqbethecollectionofalldistributionsP onR
d
suchthatforanyx„P,wehave
Varpxq“V
2
and}x}ďB,P-almost surely.
20
Lemma17. The error of theρ -semi-zCDP throw-away algorithmApXq“
1
n
pub
ř
xPX
pub
x is
sup
PPPpB,Vq
E
X„P
n
“
}ApXq´E
x„P
rxs}
2
‰
“
V
2
n
pub
.
Further, let
¯ X be the average of the public and private samples. The minimax error of theρ -zCDP Gaussian
mechanismGpXq“
¯ X`N
`
0,σ 2
I
d
˘
is
inf
ρ -zCDP G
sup
PPPpB,Vq
E
G,X„P
n
“
}GpXq´E
x„P
rxs}
2
‰
“
2dB
2
ρn
2
`
V
2
n
.
Intuitively, it seems like the estimators in Lemma 17 do not harness the public and private data in the
most effective way possible, despite being optimal up to constants. Throw-away fails to utilize the private
data at all, while the Gaussian mechanism gives equal weight toX
priv
andX
pub
(regardless ofρ,d,n
priv
),
and provides unnecessary privacy forX
pub
. We now present aρ -semi-zCDP estimator that is “even more
optimal” than the estimators in Lemma 17, meaning our estimator has smaller worst-case error (accounting
for constants). We define the family of Weighted-Gaussian estimators:
A
r
pXq:“
ÿ
xPX
priv
rx`
ÿ
xPX
pub
ˆ
1´n
priv
r
n
pub
˙
x`Np0,σ 2
r
I
d
q, (2.4)
forrPr0,1{n
priv
s. This estimator can recover both the throw-away and standard Gaussian mechanisms
by choosingr“0 orr“1{n. Intuitively, asρ {d shrinks, the accuracy cost of adding privacy noise grows,
so we should choose smallerr to reduce the sensitivity ofA
r
. On the other hand, smallerr increases the
variance ofA
r
on X
pub
. By choosing r optimally (depending on ρ,d,n
priv
,B,V ),A
r
achieves smaller
MSE than both throw-away and the Gaussian mechanism:
21
Proposition18. There existsrą0 such that, with proper choice ofσ r
,A
r
isρ -semi-zCDP, and
sup
PPPpB,Vq
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
ămin
ˆ
V
2
n
pub
,
2dB
2
ρn
2
`
V
2
n
˙
. (2.5)
Further, if
V
2
n
pub
ď
2dB
2
ρn
2
, then the quantitative advantage ofA
r
is
sup
PPPpB,Vq
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
ď
ˆ
q
2
`qs
2
q
2
`2qs
2
`s
4
˙
min
ˆ
V
2
n
pub
,
2dB
2
ρn
2
`
V
2
n
˙
, (2.6)
whereq“2`
nprivρV
2
dB
2
ands“
Vnpriv
?
ρ B
?
dn
pub
.
Let us consider some numerical examples to understand the advantage of our Weighted-Gaussian over
the asymptotically optimal baselines (throw-away and the Gaussian mechanism).
Example19. Consider 2.6 and assumen“10,000 in the following examples:
• Theroleofd{ρ : Fixn
pub
“ 0.01n andB“ 25V. Ifρ “ 1 andd“n, then the MSE of our Weighted
GaussianA
r
issmallerthantheMSEofthethrow-awayandstandardGaussianbaselinesbyafactorof
« 1.08. Ifd changes tod“n{10 orρ changes toρ “ 10, then the MSE ofA
r
is« 1.78 times smaller
than that of the baselines.
• The role ofn
pub
andd{ρ : Fixd“
?
n“ n{100 andB“ 25V. Ifρ “ 0.1 andn
pub
“ 0.01n, then
the MSE of our Weighted GaussianA
r
is smaller than the MSE of throw-away and standard Gaussian
by a factor of« 1.78. Ifn
pub
changes ton
pub
“ 0.008n, then the MSE ofA
r
is« 1.98 times smaller
thanthatofthebaselines. Ifn
pub
“n
1{4
“0.001nandρ “0.01, thentheMSEofA
r
is«1.80times
smaller than that of the baselines.
• TheroleofB,n
pub
,andd{ρ : FixB“500V andρ “1. Ifn
pub
“n
1{4
andd“
?
n,thentheMSEof
ourWeightedGaussianA
r
is«1.20timessmallerthantheMSEofthrow-awayandstandardGaussian.
Ifn
pub
shrinks ton
pub
“ n
1{10
, then the MSE ofA
r
is« 1.80 times smaller than the baselines. IfB
22
shrinks toB“ 1 andn
pub
,d increase ton
pub
“ n{2 andd“ n, then the MSE of our algorithm is1.2
times smaller than the MSE of the baselines.
Forε-semi-DP, using Laplace noise instead of Gaussian noise in 2.4 yields an estimator with smaller
error than theε-DP Laplace mechanism and throw-away.
2.2.2 OptimalSemi-DPEmpiricalRiskMinimization
For a givenX“pX
priv
,X
pub
qPX
n
priv
priv
ˆX
n
pub
pub
and parameter domainW, consider the ERM problem:
min
wPW
˜
p
F
X
pwq:“
1
n
n
ÿ
j“1
fpw,x
j
q
¸
,
wherefp¨,xq is a loss function (e.g. squared loss for linear regression). Here, we donot assume thatX
priv
andX
pub
are independent or drawn from the same distribution. We measure the (in-sample) performance
of a training algorithmA:X
n
ÑW by its excess empirical risk
E
A
p
F
X
pApXqq´
p
F
˚
X
“E
A
p
F
X
pApXqq´ min
w
1
PW
p
F
X
pw
1
q.
Definition20. LetF
µ,L,D
be the set of all functionsf :WˆX ÑR that are uniformlyL-Lipschitz and
µ -strongly convex (µ ě0) inw for some convex compactWĂR
d
withℓ
2
-diameter bounded byDą0 and
someX “X
priv
Ť
X
pub
. LetF
L,D
“F
µ “0,L,D
.
Also, define the minimax excess empirical risk of ε-semi-DP (strongly) convex ERM as
R
ERM
pε,n
priv
,n,d,L,D,µ q:“ inf
APAε
sup
fPF
µ,L,D
sup
!
X
priv
PX
n
priv
priv
,X
pub
PX
n
pub
pub
)
E
A
p
F
X
pApXqq´
p
F
˚
X
, (2.7)
whereA
ε
is the set of allε-semi-DP algorithmsA:X
n
ÑW for someX,W.
23
Theorem21. There exist absolute constantsc andC, with0ăcďC, such that
cLDmin
"
n
priv
n
,
d
nε
*
ďR
ERM
pϵ,n
priv
,n,d,L,D,µ “0qďCLDmin
"
n
priv
n
,
d
nε
*
. (2.8)
See Appendix A.2.3 for the strongly convex result. We prove the lower bound in Theorem 21 by reduc-
ing ERM to empirical mean estimation [32] and then lower bounding the semi-DP MSE of empirical mean
estimation with a packing argument. The upper bound is achieved by two algorithms: The first term in
the minimum is achieved by the throw-away algorithmApXqPargmin
wPW
ř
xPX
pub
fpw,xq. The second
term is achieved by theε-DP exponential mechanism [32].
In the absence of public data, the minimax optimal excess risk of convex DP ERM isΘ pLDminp1,d{pnεqqq [32];
the “trivial algorithm” that returnsApXq“w
0
PW independent ofX has excess riskďLD. Thus, public
data reduces the error of DP ERM whenevern
priv
ă d{ε, or equivalentlyn
pub
ą n´d{ε. This improve-
ment is possible because the throw-away algorithm outperforms the trivial algorithm (and, ifn
pub
ord{ε
is large enough, the exponential mechanism). On the other hand, ifn
pub
ďn´d{ε, then public data does
not allow for improved accuracy (asymptotically) over the optimalε-DP exponential mechanism.
Remark 22. The same minimax risk bound 2.8 holds up to a logarithmic factor if we replaceF
µ “0,L,D
by
set of all Lipschitz non-convex loss functions in the definition 2.7. However, the optimal semi-DP algorithms
are inefficient for non-convex loss functions. See Appendix A.2.3 for details.
2.2.3 OptimalSemi-DPStochasticConvexOptimization
In stochastic convex optimization (SCO), we are given n i.i.d. samples from an unknown distribution
X „ P
n
(withn
priv
of them being private), and aim to approximately minimize the expected population
lossFpwq:“E
x„P
rfpw,xqs. We measure the quality of a learnerA by its excess population risk
E
A,X„P
nFpApXqq´F
˚
:“E
A,X„P
nE
x„P
rfpApXq,xqs´ min
w
1
PW
E
x„P
fpw
1
,xq.
24
Denote the minimax optimal excess population risk of semi-DP SCO by
R
SCO
pε,δ,n
priv
,n,d,L,D,µ q:“ inf
APA
ε,δ
sup
fPF
µ,L,D
sup
P
E
A,X„P
nFpApXqq´F
˚
, (2.9)
whereA
ε,δ
contains allpε,δ q-semi-DP algorithmsA:X
n
ÑW for someX,W, and|X
priv
|“n
priv
.
Theorem23. Letεď d{8 and eithern
pub
À
nε
d
ordÀ 1. Then, there exist absolute constants0ă cď C
such that
cLDmin
#
1
?
n
pub
,
d
nε
`
1
?
n
+
ďR
SCO
pϵ,δ “0,n
priv
,n,d,L,D,µ “0qďCLDmin
#
1
?
n
pub
,
d
nε
`
1
?
n
+
.
The above upper bound holds for anyn
pub
,d.
We provide theµ -strongly convex andδ ą 0 results in Appendix A.2.4. The main tool that we use to
prove the lower bound is our semi-DP Fano’s method (Theorem 79).
Let us compare theε-semi-DP bound for SCO in Theorem 23 with the ERM bound in Theorem 21. Fix
d“1“L“D for simplicity. Depending on the values ofε andn
priv
, the minimax excess population risk
(“test loss”) of SCO may either be largerorsmaller than the excess empirical risk (“training loss”) of ERM.
For example, ifε«1, then the semi-DP excess empirical riskΘ p1{nq is smaller than the excess population
riskΘ p1{
?
nq. On the other hand, supposeε« 1{n andn
priv
« n
2{3
: then the semi-DP excess empirical
riskΘ p1{n
1{3
q is larger than the excess population riskΘ p1{
?
nq. This is surprising: for both non-private
learning and DP learning (withn
pub
“0), the optimal error of ERM is never larger than that of SCO. While
it may seem counter-intuitive that minimizing the training loss can be harder than minimizing test loss,
there is a natural explanation: For SCO, a small amount of public data gives us free information about the
private data, sinceX„P
n
is i.i.d. by assumption. By contrast, for ERM, the public data does not give us
any information about the private data, sinceX is arbitrary (not necessarily i.i.d.). Thus, the (appropriate)
throw-away algorithm is more effective at minimizing F than
p
F
X
whenn
pub
is small but positive.
25
In Appendix A.2.4.1, we give a noisy stochastic gradient Algorithm 11 for semi-DP SCO that uses 2.4
to estimate∇Fpwq in each iteration. Our algorithm has smaller excess risk (accounting for constants)
than the state-of-the-art linear-time algorithm for DP-SCO whose privacy analysis does not require con-
vexity [181]. We empirically evaluate our algorithm in Section 2.4 and Appendix A.4.
2.3 OptimalLocallyPrivateLearningwithPublicData
We now turn to a stronger privacy notion that we refer to as semi-local DP (semi-LDP, Definition 24).
Semi-LDP guarantees privacy for each privatex
i
, without requiring personi to trust anyone else (e.g. data
scientist). Semi-LDP generalizes LDP [144, 85], which has been deployed in industry [13, 91, 74].
Following [83], we permit algorithms to befullyinteractive: algorithms may adaptively query the same
person i multiple times over the course of T “communication rounds.” We denote i’s message in round
t byZ
i,t
P Z. Full interactivity is the most general notion of interactivity. IfT “ 1, then the algorithm
is sequentially interactive [85]. Essentially, semi-LDP requires the messagestZ
i,t
u
tPrTs
to be DP for all
x
i
PX
priv
:
Definition24 (Semi-Local Differential Privacy) . TheT-round algorithmA :X
n
ÑW ispε,δ q-semi-LDP
ifthefulltranscriptofcommunicationsZ“tZ
i,t
u
iďn,tďT
ispε,δ q-semi-DP:i.e. forallprivatex
i
PX
priv
,all
pairs ofx
i
-adjacent data setsX,X
piq
PX
n
, and allSĂZ
nT
, we have
PpZPS|Xqďe
ε
PpZPS|X
piq
q`δ.
2.3.1 OptimalSemi-LDPMeanEstimation
We will characterize the minimax squared error ofε-semi-LDPd-dimensional mean estimation:
M
loc
pop
pε,n
priv
,n,dq:“ inf
A
loc
ε
pBq
sup
P
E
A,X„P
n
“
}ApXq´E
x„P
rxs}
2
‰
, (2.10)
26
whereA
loc
ε
pBq contains all (fully interactive)ε-semi-LDP estimatorsA:B
n
ÑB, and|X
priv
|“n
priv
.
Theorem25. There are absolute constants0ăcďC such that
cmin
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
ďM
loc
pop
pϵ,n
priv
,n,dqďCmin
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
.
Compared with LDP mean estimation, semi-LDP allows for improved error if n
pub
ą
minpε,ε
2
qn
d
. If
n
pub
ě
n
2
ε
2
d
2
as well, then the semi-LDP error rate matches the (central) semi-DP rate in Theorem 15.
Remark26. Theorem25stillholdsifwereplaceA
loc
ε
inthedefinitionof M
loc
pop
pϵ,n
priv
,n,dqbythesetofall
pε,δ q-semi-LDP estimatorsA for which either δ ă 1{2 andA is “compositional” [83, Assumption A2] (e.g.
sequentially interactive) orδ ă1{2
d
.
Our lower bound proof constructs a Bernoulli product distribution and combines the techniques of [83]
with an upper bound on the mutual information betweenX andApXq for semi-LDPA. The upper bound is
the minimum of the error of the throw-away estimator and the optimalε-LDP (hence semi-LDP) estimator
PrivUnit [39]. We recall PrivUnit in Algorithm 12 in Appendix A.3.
2.3.1.1 An“EvenMoreOptimal”Semi-LDPEstimator
By applying PrivUnit only to the private samples, we can obtain an ε-semi-LDP algorithm with smaller
error than the asymptotically optimal PrivUnit. Define the Semi-LDP PrivUnit estimator by
A
semi-PrivU
pXq“
1
n
»
–
ÿ
xPX
priv
PrivUnitpxq`
ÿ
xPX
pub
x
fi
fl
.
Lemma27. LetCą0 be a constant such thatE
x„P
}PrivUnitpxq´x}
2
“
C
2
d
minpε,ε
2
q
. Then,
E
A,X„P
n
”
}A
semi-PrivU
pXq´E
x„P
rxs}
2
ı
ď
2
n
`2C
2
n
priv
n
d
nminpε,ε
2
q
.
27
Thus,theMSEofSemi-LDPPrivUnitissmallerthanthatofPrivUnit,roughlybyafactorof
npriv
n
. We can
say even more about the optimality of Semi-LDP PrivUnit: it istrulyoptimal—i.e. it has the smallest MSE,
including constants—among a large class of unbiased semi-LDP estimators of
¯ X.
Proposition 28. Let ApXq “
1
n
“
M
priv
pRpx
1
q,¨¨¨ ,Rpx
npriv
qq`M
pub
pX
pub
q
‰
be a ε-semi-LDP algo-
rithm,whereR:S
d´1
ÑZ isanε-LDPrandomizer
‡
andM
priv
:Z
npriv
ÑR
d
andM
pub
:Z
n
pub
ÑR
d
are
aggregationprotocolssuchthatE
Mpriv,R
“
M
priv
pRpx
1
q,¨¨¨ ,Rpx
npriv
qq
‰
“
ř
xPXpriv
xandE
M
pub
“
M
pub
pX
pub
q
‰
“
ř
xPX
pub
x for allX“pX
priv
,X
pub
qP
`
S
d´1
˘
n
. Then,
sup
XPpS
d´1
q
n
E
A
semi-PrivU
”
›
›
A
semi-PrivU
pXq´
¯ X
›
›
2
ı
ď sup
XPpS
d´1
q
n
E
A
}ApXq´
¯ X}
2
.
Proposition 28 is proved by extending the analysis of Asi et al. [20] to the semi-LDP setting.
2.3.2 OptimalSemi-LDPStochasticConvexOptimization
We will characterize the minimax optimal excess population risk of semi-LDP SCO
R
loc
SCO
pε,n
priv
,n,d,L,D,µ q:“ inf
APA
loc
ε
sup
fPF
µ,L,D
sup
P
E
A,X„P
nFpApXqq´F
˚
, (2.11)
whereA
loc
ε
denotes the set of all algorithmsA : X
n
Ñ W that are ε-semi-LDP for someX andW and
exactly n
priv
samples in X are private. If µ “ 0 (convex case), we writeR
loc
SCO
pε,n
priv
,n,d,L,Dq :“
R
loc
SCO
pε,n
priv
,n,d,L,D,µ “0q. We present the result for convex loss functions in Theorem 29:
Theorem29. Lethpε,n
priv
,n,d,L,Dq:“LDmin
!
1
?
n
pub
,
b
d
nminpε,ε
2
q
`
1
?
n
)
. There exist absolute con-
stantsc andC with0ăcďC, such that
chpε,n
priv
,n,d,L,DqďR
loc
SCO
pε,n
priv
,n,d,L,DqďC hpε,n
priv
,n,d,L,Dq.
‡
HereS
d´1
denotes the unit sphere. If}x}‰ 1, then PrivUnit can be combined with an LDP scaling mechanism [39]. This
scaled PrivUnit is still optimal up to constants [83].
28
See Appendix A.3.2 for the strongly convex result. The lower bound follows from Theorem 25 and a
reduction from SCO to mean estimation. The first term in the upper bound is achieved by throwing away
X
priv
and running SGD onX
pub
[200]. The second term in the upper bound is achieved by the one-pass
LDP-SGD of [85]. Remark 26 also applies to Theorem 29.
2.3.2.1 An“EvenMoreOptimal”Semi-LDPAlgorithmforSCO
We give a semi-LDP algorithm, called Semi-LDP-SGD, with smaller excess risk than the optimal LDP-
SGD of [85]. Essentially, Semi-LDP-SGD runs as follows: In each iteration t P rns, we draw a random
sample x
t
P X without replacement. If x
t
P X
priv
, update w
t`1
“ Π W
rw
t
´η PrivUnitp∇fpw
t
,x
t
qqs;
ifx
t
P X
pub
, instead updatew
t`1
“ Π W
rw
t
´η ∇fpw
t
,x
t
qs. See Algorithm 13 in Appendix A.3.2.1 for
pseudocode.
Proposition30. Letf PF
L,D
, letP be any distribution andεď d. Algorithm 13 isε-semi-LDP. Further,
there is an absolute constantc such that the outputApXq“ ¯w
n
of Algorithm 13 satisfies
E
A,X„P
nFp¯w
n
q´F
˚
ďc
LD
?
n
max
#d
d
minpε,ε
2
q
c
n
priv
n
,
c
n
pub
n
+
.
Thus, Algorithm 13 has smaller excess risk than LDP-SGD, roughly by a factor of
a
n
priv
{n.
2.4 NumericalExperiments
In this section, we empirically assess the performance of semi-DP Algorithm 11 (see Appendix A.2.4.1).
We compare four different semi-DP algorithms: 1. Throw-away (i.e. finding a minimizer of the public
loss). 2. DP-SGD [1, 71] on the entire data. 3. PDA-MD [9, Algorithm 1], which is the state-of-the-art
semi-DP algorithm for training convex models. 4. Our Algorithm 11. We evaluate all algorithms with
29
“warm start,” which means finding a minimizer w
pub
of the public loss and initializing training at w
pub
.
The hyperparameters of each algorithm were carefully tuned.
The Figures below show results forpε,δ “ 10
´5
q-semi-DP linear regression with synthetic Gaussian
data. OurAlgorithm11achievesthesmallesttestloss across different levels of ε (privacy) andn
pub
, except for
ε“0.1. The advantage of Algorithm 11 over the baselines is largest whenn
pub
is small. See Appendix A.4
for details on the experimental setup and further discussion of results. Also, we performed additional
numerical experiments evaluating our algorithms in Appendix A.4.
0.030.04 0.1 0.25 0.5 0.75 0.9 0.95
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
Figure 2.2: Test loss vs.
n
pub
n
. ε“1.
0.030.04 0.1 0.25 0.5 0.75 0.9 0.95
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
Figure 2.3: Test loss vs.
n
pub
n
. ε“2.
Algorithm 1
DP-SGD
PDA-MD
Throw-away
0.1 0.5 1.0 2.0 4.0 8.0
ε
0.4
0.6
Test Loss
Figure 2.4: Test loss vs. ε.
n
pub
n
“0.04.
0.1 0.5 1.0 2.0 4.0 8.0
ε
0.5
1.0
1.5
2.0
Test Loss
Figure 2.5: Test loss vs. ε.
n
pub
n
“0.03.
Algorithm 1
DP-SGD
PDA-MD
Throw-away
30
2.5 Conclusion
We considered learning models with DP for a certain subset of people in the data set. Theoretically, we
characterized the optimal error bounds (up to constants) for three fundamental learning problems: mean
estimation, empirical risk minimization (ERM), and stochastic convex optimization (SCO). In each case,
we identified the optimal algorithms and the parameter regimes in which public data reduces the minimax
optimal error of DP learning. Algorithmically, we developed new optimal methods for semi-DP learning
that have smaller error than the asymptotically optimal algorithms. Empirically, we demonstrated that our
algorithms are effective in training semi-DP models.
AuthorContributions
This work was conducted in collaboration with Zeman Li, Tianjian Huang, and Meisam Razaviyayn. AL led
the project, executed the theoretical analyses, wrote the paper, and managed the numerical experiments.
ZL and TH executed the numerical experiments and helped draft the “numerical experiments” subsection.
MR advised and provided feedback.
Acknowledgements
AL would like to thank Thomas Steinke, Gautam Kamath, and Adam Smith for very helpful pointers and
explanations related to existing lower bound proof techniques.
31
Chapter3
PrivateFederatedLearningWithoutaTrustedServer
This chapter is based on [181, 177].
Summary
This chapter studies federated learning (FL)—especially cross-silo FL—with data from people who do not
trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g.
patients) and must maintain the privacy of each person’s data (e.g. medical record), even if the server or
other silos act as adversarial eavesdroppers. This requirement motivates the study ofInter-SiloRecord-Level
DifferentialPrivacy (ISRL-DP), which requires siloi’scommunications to satisfy record/item-level differen-
tial privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in siloi (e.g. hospitali) cannot
be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that
people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not
trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic
assumption (in cross-silo FL) thatpeopletrusttheirownsilo,butnottheserverorothersilos. In this work, we
providetight (up to logarithms)upperandlowerbounds for ISRL-DP FL with convex/strongly convex loss
functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for
smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm.
32
We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and
use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art.
Further, with a secure “shuffler” to anonymize silo messages (but without a trusted server), our algorithm
attains the optimal central DP rates under more practical trust assumptions.
We also extend our analysis to non-convex loss functions. We propose novel ISRL-DP algorithms for
FL withheterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous (non-convex) loss func-
tions: First, we consider losses satisfying the Proximal Polyak-Łojasiewicz (PL) inequality, which is an
extension of the classical PL condition to the constrained setting. In contrast to our result, prior works
only considered unconstrained private optimization with Lipschitz PL loss, which rules out every PL loss
function that we are aware of. Our proximal PL algorithm nearly attains the optimal strongly convex, ho-
mogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first
DP algorithms fornon-convexnon-smooth loss functions. Our utility bounds even improve on the state-of-
the-art DP FL bounds forsmooth losses. We complement our upper bounds with lower bounds. Numerical
experiments show favorable privacy-accuracy tradeoffs for our algorithms in classification and regression
tasks with both convex and non-convex loss functions.
3.1 Introduction
Machine learning tasks often involve data from different “silos” (e.g. cell-phone users or organizations such
as hospitals) containing sensitive information (e.g. location or health records). In federated learning (FL),
each silo (a.k.a. “client”) stores its data locally and a central server coordinates updates among different
silos to achieve a global learning objective [135]. One of the primary reasons for the introduction of FL
was to offer greater privacy [188]. However, storing data locally is not sufficient to prevent data leakage.
Model parameters or updates can still reveal sensitive information (e.g. via model inversion attacks or
membership inference attacks) [103, 120, 231, 286].
33
Differential privacy (DP) [87] protects against privacy attacks. Different notions of DP have been pro-
posed for FL. The works of [130, 245, 203] considered central DP (CDP) FL, which protects the privacy of
silos’ aggregated data against an external adversary who observes the final trained model .
∗
There are two
major issues with CDP FL: 1) it does not guarantee privacy foreachspecificsilo ; and 2) it does not guaran-
tee data privacy when an adversarial eavesdropper has access to other silos or the server. To address the
first issue, [189, 107, 130, 104, 259, 284, 164, 110] considered user-levelDP (a.k.a. client-level DP). User-level
DP guarantees privacy of each silo’sfulllocaldataset. This is a practical notion forcross-deviceFL, where
each silo/client corresponds to a single person (e.g. cell-phone user) with many records (e.g. text mes-
sages). However, user-level DP still suffers from the second critical shortcoming of CDP: it allowssilodata
to be leaked to an untrusted server or to other silos. Furthermore, user-level DP is less suitable for cross-silo
FL, where silos are typically organizations (e.g. hospitals, banks, or schools) that contain data from many
different people (e.g. patients, customers, or students). In cross-silo FL, each person has a record (a.k.a.
“item”) that may contain sensitive data. Thus, an appropriate notion of DP for cross-silo FL should protect
the privacy of each individual record (“item-level DP”) in siloi, rather than siloi’s full aggregate data.
Another notion of DP is local DP (LDP) [144, 85]. While central and user-level DP assume that people
trust all of the silos and the server, LDP assumes that individuals (e.g. patients) do not trust anyone else
with their sensitive data, not even their own silo (e.g. hospital). Thus, LDP would require each person
(e.g. patient) to randomize her report (e.g. medical test results) before releasing it (e.g. to their own
doctor/hospital). Since patients/customers/students usually trust theirown hospital/bank/school, LDP may
be unnecessarily stringent, hindering performance/accuracy.
In this work, we consider a privacy notion called inter-silo record-level differential privacy (ISRL-DP),
which requires that all of the communications of each silo satisfy (item-level) DP; see Figure 3.1. By the
post-processing property of DP, this also ensures that the the broadcasts by the server and the global model
∗
We abbreviate central differential privacy by CDP. This is different than the concentrated differential privacy notion in [47],
for which the same abbreviation is sometimes used in other works.
34
Server
Global model
Patients
Data
Data
Adversary
Adversary Adversary
Adversary
Data
Patients
Patients
Figure 3.1: ISRL-DP protects the privacy of each patient’s record regardless of whether the server/other
silos are trustworthy, as long as the patient’s own hospital is trusted. By contrast, user-level DP protects
aggregate data of patients in hospitali and does not protect against adversarial server/other silos.
are DP. Privacy notions similar or identical to ISRL-DP have been considered in [246, 127, 126, 267, 260,
78, 283, 14, 226, 174]. We provide a rigorous definition of ISRL-DP in Definition 32 and Appendix B.1.2.
WhyISRL-DP? ISRL-DP is the natural notion of DP for cross-silo FL, where each silo contains data
from many individuals who trust their own silo but may not trust the server or other silos (e.g., hospitals
in Figure 3.1). The item-level privacy guarantee that ISRL-DP provides for each silo (e.g. hospital) ensures
that no person’s record can be leaked. In contrast to central DP and user-level DP, the protection of
ISRL-DP is guaranteedevenagainstanadversarywithaccesstotheserverand/orothersilos (e.g. hospitals).
This is because each silo’s communications are DP with respect to their own data records and cannot leak
information to any adversarial eavesdropper. On the other hand, since individuals (e.g. patients) trust
their own silo (e.g. hospital), ISRL-DP does not require individuals to randomize their own data reports
(e.g. health records). Thus, ISRL-DP leads to better performance/accuracy than local DP by relaxing the
strict local DP requirement. Another benefit of ISRL-DP is that each silo i can set its ownpε
i
,δ i
q item-level
DP budget depending on its privacy needs; see Appendix B.1.9 and also [174, 7].
35
In addition, ISRL-DP can be useful in cross-device FL without a trusted server: If the ISRL privacy
parameters are chosen sufficiently small, then ISRL-DP implies user-level DP (see Appendix B.1.3). Unlike
user-level DP, ISRL-DP does not allow data to be leaked to the untrusted server/other users.
Another intermediate trust model between the low-trust local model and the high-trust central/user-
level models is theshufflemodel of DP [41, 56]. In this model, a secure shuffler receives noisy reports from
the silos and randomly permutes them before the reports are sent to the untrusted server.
†
An algorithm
is Shuffle Differentially Private (SDP) if silos’ shuffled messages are CDP; see Definition 33. Figure 3.2
summarizes which parties are assumed to be trustworthy (from the perspective of a person contributing
data to a silo) in each of the described privacy notions.
Problemsetup: Consider a FL setting withN silos, each containing a local data set withn samples:
‡
X
i
“px
i,1
,¨¨¨ ,x
i,n
q foriPrNs :“t1,...,Nu. In each round of communicationr, silos download the
global modelw
r
and use their local data to improve the model. Then, silos send local updates to the server
(or other silos, in peer-to-peer FL), who updates the global model to w
r`1
. For each silo i, let P
i
be an
unknown probability distribution on a data universeX
i
(i.e. X
i
P X
n
i
). LetX :“
Ť
N
i“1
X
i
. Given a loss
functionf :WˆX ÑR, define silo i’s local objective as
F
i
pwq:“E
x
i
„P
i
rfpw,x
i
qs, (3.1)
whereW Ă R
d
is a parameter domain. Our goal is to find a model parameter that performs well for all
silos, by solving the FL problem
min
wPW
#
Fpwq:“
1
N
N
ÿ
i“1
F
i
pwq
+
, (3.2)
†
Assume that the reports can be decrypted by the server, but not by the shuffler [90, 96].
‡
In Appendix B.1.9, we consider the more general setting where data set sizesni and ISRL-DP parameterspεi,δ iq may vary
across silos, and the weightspi on each silo’s lossFi in Equation 3.2 may differ (i.e. pi‰ 1{N).
36
Local
DP
Inter-Silo
Record-
Level DP
Shuffle
DP
Central DP ,
User-level
DP
Trust one’s
own silo
(e.g.
hospital)?
“Trust” a
shuffler?
Trust
server?
Trust other
silos (e.g.
hospitals)?
Low trust High trust
Figure 3.2: Trust assumptions of DP FL notions: We put “trust” in quotes because the shuffler is assumed
to be secure and silo messages must already satisfy (at least a weak level of) ISRL-DP in order to realize
SDP: anonymization alone cannot “create” DP [88].
37
while maintaining the privacy of each silo’s local data. At times, we will focus on empirical risk mini-
mization (ERM), where
p
F
i
pwq :“
1
n
ř
n
j“1
fpw,x
i,j
q is siloi’s local objective. Thus, in the ERM case, our
goal is to solvemin
wPW
!
p
F
X
pwq:“
1
N
ř
N
i“1
p
F
i
pwq
)
, while maintaining privacy. WhenF
i
takes the form
of Equation 3.1 (not necessarily ERM), we may refer to the problem asstochasticoptimization (SO) for em-
phasis. For ERM, we make no assumptions on the data; for SO, we assume the samplestx
i,j
u
iPrNs,jPrns
are
drawn independently. For SO, we say Problem 3.2 is “i.i.d.” or “homogeneous” ifX
i
“X andP
i
“ P ,@i.
Theexcessrisk of an algorithmA for solving Problem 3.2 isEFpApXqq´F
˚
, whereF
˚
“inf
wPW
Fpwq
and the expectation is taken over both the random draw ofX “ pX
1
,...,X
N
q and the randomness of
A. For ERM, the excess empirical risk ofA isE
p
F
X
pApXqq´
p
F
˚
X
, where the expectation is taken solely
over the randomness ofA. For general non-convex loss functions, meaningful excess risk guarantees are
not tractable in polynomial time. Thus, for non-convex FL, we measure the utility ofA by its expected
(squared)stationaritygapE}∇FpApXqq}
2
.
§
A fundamental question in FL is about theminimumachiev-
able excess risk (or stationarity gap) while maintaining privacy. In this chapter, we specifically study the
following questions:
Question 1. What is the minimum achievable excess risk for solving Problem 3.2 with
inter-silo record-level DP and (strongly) convex loss functions?
Question 2. With a secure shuffler (but no trusted server), can the optimal central DP
rates be attained?
Question3. What is the best performance (excess risk or stationarity gap) that any inter-
silo record-level DP algorithm can achieve for solving Problem 3.2 with non-convex loss
functions? With a secure shuffler (but no trusted server), what performance is attainable?
Contributions: Our first contribution is a complete answer to Question1 when silo data isi.i.d.: we
give tight upper and lower bounds in Section 3.2.1. The ISRL-DP rates sit between the local DP and central
§
In the non-smooth case, we instead use the norm of the proximal gradient mapping, defined in Section 3.3.2.
38
DP rates: higher trust allows for higher accuracy. Further, we show that the ISRL-DP rates nearly match the
optimal non-private rates ifdÀnε
2
0
, whereε
0
is the ISRL-DP parameter (“privacy for free”). As a corollary
of our analysis, we also derive tight upper and lower bounds for FL algorithms that satisfybothISRL-DPand
user-level DP simultaneously, which could be useful in cross-device settings where (e.g. cell phone) users
don’t trust the server or other users with their sensitive data (e.g. text messages): see Appendix B.1.5.5.
Second, we give a complete answer toQuestion1 whenF “
p
F
X
is anempirical loss in Section 3.2.3.
¶
While [111] provided a tight upper bound for the (non-strongly) convex case, we use a novel accelerated
algorithm to achieve this upper bound in fewer communication rounds. Further, we obtain matching lower
bounds. We also cover the strongly convex case.
Third, we give a partial answer to Question 1 when silo data is heterogeneous (non-i.i.d.), providing
algorithms for smooth fp¨,xq that nearly achieve the optimal i.i.d. rates in Section 3.2.2. For example, if
fp¨,xq isµ -strongly convex andβ -smooth, then the excess risk of our algorithm nearly matches the i.i.d.
lower bound up to a multiplicative factor of
r
Opβ {µ q. Our algorithm is significantly more effective (in terms
of excess risk) than existing ISRL-DP FL algorithms (e.g. [14, 78, 283]): see Appendix B.1.1 for a thorough
discussion of related work.
Fourth, we addressQuestion2 in Section 3.2.4: We give a positive answer toQuestion2 when silo data
is i.i.d. Further, with heterogeneous silo data, the optimal central DP rates are nearly achieved without a
trusted server, if the loss function is smooth. We summarize our results for (strongly) convex loss functions
in Figure 3.3.
In Section 3.3, we turn toQuestion3. In Section 3.3.1.1, we give a nearly complete answer toQuestion
3 for the subclass of non-convex loss functions that satisfy the Proximal Polyak-Łojasiewicz (PL) inequal-
ity [142]. The Proximal PL (PPL) condition is a generalization of the classical PL inequality [213] and
covers many important ML models: e.g. some classes of neural nets such as wide neural nets [171, 163],
¶
ERM is a special case of the FL problem Equation 3.2: ifPi is the empirical distribution onXi, thenF “
p
F
X
.
39
linear/logistic regression, LASSO, strongly convex losses [142]. For heterogeneous FL with (potentially
non-convex)proximalPL losses, our ISRL-DP algorithm attains excess risk thatnearlymatchesthestrongly
convex, i.i.d. lower bound [181]. Additionally, the excess risk of our SDP algorithm nearly matches the
strongly convex, i.i.d., central DP lower bound [29] and is attained without convexity, without i.i.d. data,
and without a trusted server. Our excess risk bounds nearly match the optimal statistical rates in certain
practical parameter regimes, resulting in “privacy almost for free.”
To obtain our proximal-PL results, we invent a new method of analyzing noisy proximal gradient
algorithms that does not require convexity, applying tools from the analysis of objective perturbation [52,
149]. Our novel analysis is necessary because privacy noise cannot be easily separated from the non-private
optimization terms in the presence of the proximal operator and non-convexity.
Next, in Section 3.3.1.2, we provide a nearly complete answer to Question 3 for federated ERM with
proximal PL losses. We provide novel, communication-efficient, proximal variance-reduced ISRL-DP and
SDP algorithms for non-convex ERM. Our algorithms have near-optimal excess empirical risk that almost
match the strongly convex ISRL-DP and CDP lower bounds [181, 32], without requiring convexity.
Finally, we address Question 3 for general non-convex/non-smooth (non-PL) loss functions in Sec-
tion 3.3.2. We develop the first DP optimization (in particular, FL) algorithms for non-convex/non-smooth
loss functions. Our ISRL-DP and SDP algorithms have significantly better utility (stationarity gap) than all
previous ISRL-DP andCDP FL algorithms forsmooth losses [257, 76, 125, 203]. We complement our upper
bound with the first non-trivial ISRL-DP lower bound for non-convex FL in Section 3.3.2.1. See Figure 3.4
for a summary of our results and Appendix B.2.1 for a thorough discussion of related work.
3.1.1 Preliminaries
Before giving our precise definition of inter-silo record-level DP in Definition 32, it will be useful to in-
troduce a slight generalization of classical DP, along with some more notation. To that end, let X “
40
FL
problem
Loss
Func.
Excess Risk
Upper Bound
Excess Risk
Lower Bound
ISRL-DP
i.i.d.
SCO
Convex
Strongly
Convex
ISRL-DP
non-i.i.d.
SCO
Smooth
Convex
Smooth
strongly
ISRL-DP
ERM
Convex
Strongly
Convex
SDP
i.i.d.
SCO
Convex
Strongly
Convex
SDP
non-i.i.d.
SCO
Smooth
Convex
Smooth
strongly
1
nN
+
d
ϵ
0
n N
1
nN
+
d
ϵ
0
n N
1
nN
+
d
ϵ
2
n
2
N
2
1
nN
+
d
ϵ
2
0
n
2
N
1
nN
+
(
d
ϵ
0
n N
)
4/5
1
nN
+
d
ϵ
0
n NC
2
1
nN
+
d
ϵ
2
0
n
2
Nκ
1
nN
+
κd
ϵ
2
0
n
2
N
d
ϵ
0
n N
d
ϵ
2
0
n
2
N
d
ϵ
0
n N
d
ϵ
2
0
n
2
N
1
nN
+
d
ϵnN
1
nN
+
d
ϵ
2
0
n
2
N
1
nN
+
d
ϵ
2
n
2
N
2
1
nN
+
d
ϵnN
(Thm.
36)
(Thm.
34)
(Thm.
34)
(Thm.
36)
[111] +
(Thm. 37)
(Thm.
37)
(Thm.
38)
(Thm.
38)
(Thm.
35)
(Thm.
35)
(Thm.
35*)
(Thm.
35*)
(Thm.
133)
(Thm.
133)
[29]
1
nN
+
(
d
ϵnN)
4/5
(Thm.
39)
1
nN
+
κd
ϵ
2
n
2
N
2
(Thm.
39)
1
nN
+
d
ϵ
2
n
2
N
2
1
nN
+
d
ϵnN
[29]
[29]
[29]
Figure 3.3: We fix M “ N, omit logs, and C
2
:“ pε
0
n
?
N{
?
dq
2{5
. Round complexity in Theorem 37
improves on [111]. *For our non-i.i.d. algorithm, Theorem 35 only applies ifε
0
“ Op1{nq orN “ Op1q:
see Appendix B.1.5.
X
n
1
ˆ¨¨¨X
n
N
and ρ : X
2
Ñ r0,8q be a distance between databases. Two databases X,X
1
P X are
ρ -adjacent ifρ pX,X
1
qď 1. DP ensures that (with high probability) an adversary cannot distinguish be-
tween the outputs of algorithmA when it is run on adjacent databases:
Definition31 (Differential Privacy) . Letεě0, δ Pr0,1q. A randomized algorithmA:XÑW ispε,δ q-
differentially private (DP) (with respect to ρ ) if for all ρ -adjacent data sets X,X
1
P X and all measurable
subsetsSĎW, we have
PpApXqPSqďe
ε
PpApX
1
qPSq`δ. (3.3)
Definition32 (Inter-Silo Record-Level Differential Privacy) . Letρ i
:X
2
i
Ñr0,8q,ρ i
pX
i
,X
1
i
q:“
ř
n
j“1
1
tx
i,j
‰x
1
i,j
u
,
iPrNs. A randomized algorithmA ispε
0
,δ 0
q-ISRL-DP if for alliPrNs and allρ i
-adjacent silo data sets
X
i
,X
1
i
, the full transcript of siloi’s sent messages satisfies Equation 3.3 for any fixed settings of other silos’
messages and data.
41
Function
Class
FL Problem
Upper
Bound
Lower
Bound
Prior
state-of-the-
art
Proximal PL
(PPL)
(Excess risk)
ISRL-DP
Non-i.i.d. SO
No work on Proximal
PL or FL; for CDP
with
(i.i.d. data), [277]
consider narrow class
of unconstrained PL,
Lipschitz losses.
SDP
Non-i.i.d. SO
ISRL-DP
ERM
No work on Proximal
PL, ISRL-DP , or SDP .
For CDP, [254, 141]
consider narrow class
of unconstrained PL,
Lipschitz losses.
SDP
ERM
Non-
Convex/
Non-
Smooth
(Squared
gradient norm)
ISRL-DP
ERM
[202] for smooth :
SDP
ERM
For smooth CDP, [256,
202]:
κ
2
d
ϵ
2
n
2
N
+
κ
nN
d
ϵ
2
n
2
Nκ
2
+
1
nN
(Theorem 40) (Theorem 35)
κd
ϵ
2
n
2
N
(Theorem 42)
d
ϵ
2
n
2
N
(
d
ϵn N
)
4/3
(Theorem 44)
(
d
ϵn N
)
2
(Theorem 46)
(Smooth Convex)
d
ϵn N
κd
ϵ
2
n
2
N
2
(Theorem 42)
d
ϵ
2
n
2
N
2
κ
2
d
ϵ
2
n
2
N
2
+
κ
nN
d
ϵ
2
n
2
N
2
+
1
nN
(Theorem 40) [29]
(Theorem 45)
(Smooth Convex)
[16]
d
ϵnN
[32]
f
N = 1
(
d
ϵnN)
4/3
(
d
ϵnN)
2
(Theorem 133)
Figure 3.4: We fix M “ N and omit logs. κ “ β {µ , where β is the smoothness parameter and µ is the
proximal-PL parameter of the loss. *[203] mostly analyzes CDP FL but we observe that a ISRL-DP bound
can also be obtained with a small modification of their algorithm and analysis.
Definition33 (Shuffle Differential Privacy [41, 56]) . ArandomizedalgorithmAispε,δ q-shuffleDP(SDP)if
forallρ -adjacentdatabasesX,X
1
PXandallmeasurablesubsetsS,thecollectionofalluniformlyrandomly
permutedmessagesthataresentbytheshufflersatisfiesEquation3.3,with ρ pX,X
1
q:“
ř
N
i“1
ř
n
j“1
1
tx
i,j
‰x
1
i,j
u
.
Assume the following throughout this chapter:
Assumption1. 1. WĂR
d
is closed, convex, and}w´w
1
}ďD,@w,w
1
PW.
2. fp¨,xq is (uniformly) L-Lipschitz and convex for all xP X. In some parts of the chapter, we assume
fp¨,xq isµ -strongly convex.
3. sup
wPW
E
x
i
„P
i
}∇fpw,x
i
q´∇F
i
pwq}
2
ďϕ 2
for alliPrNs.
4. Ineachroundr,auniformlyrandomsubsetS
r
ofM
r
PrNssilosisavailabletocommunicatewiththe
server, wheretM
r
u
rě0
are independent random variables with
1
M
:“Ep
1
Mr
q.
In Assumption 1 part 4, the network determines M
r
: it is not a design parameter. This assumption
is more general (and realistic for cross-device FL [135]) than most (DP) FL works, which usually assume
42
M “ N orM
r
“ M is deterministic. On the other hand, in cross-silo FL, typically all silos can reliably
communicate in each round, i.e. M“N [135].
3.2 PrivateConvexFederatedLearningWithoutaTrustedServer
3.2.1 Inter-SiloRecord-LevelDPFLwithHomogeneousSiloData
In this section, we provide tight (up to logarithms) upper and lower bounds on the excess risk of ISRL-DP
algorithms for FL with i.i.d. silo data. For consistency of presentation, we assume that there is an untrusted
server. However, our algorithms readily extend to peer-to-peer FL (no server), by having silos send private
messages directly to each other and perform model updates themselves.
3.2.1.1 UpperBoundsviaNoisyDistributedMinibatchSGD
We begin with our upper bounds, obtained viaNoisyDistributedMinibatchSGD(MB-SGD): In each round
r, all M
r
available silos receive w
r
from the server and send noisy stochastic gradients to the server:
r g
i
r
:“
1
K
ř
K
j“1
∇fpw
r
,x
r
i,j
q`u
i
, whereu
i
„Np0,σ 2
I
d
q andx
r
i,j
are drawn uniformly fromX
i
(and then
replaced). The server averages theseM
r
reports and updatesw
r`1
:“ Π W
rw
r
´
η r
Mr
ř
iPSr
r g
i
r
s. AfterR
rounds, a weighted average of the iterates is returned: p w
R
“
1
Γ R
ř
R´1
r“0
γ r
w
r
withΓ R
:“
ř
R´1
r“0
γ r
. With
proper choices oftη r
,γ r
u
R´1
r“0
,σ 2
,K, andR, we have:
Theorem 34 (Informal). Letε
0
ď 2lnp2{δ 0
q,δ 0
Pp0,1q. Then Noisy MB-SGD ispε
0
,δ 0
q-ISRL-DP. More-
over:
1. Iffp¨,xq is convex, then
EFpp w
T
q´F
˚
“
r
O
˜
LD
?
M
˜
1
?
n
`
a
dlnp1{δ 0
q
ε
0
n
¸¸
. (3.4)
43
2. Iffp¨,xq isµ -strongly convex, then
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µM
ˆ
1
n
`
dlnp1{δ 0
q
ε
2
0
n
2
˙˙
. (3.5)
The first terms in each of Equations 3.4 and 3.5 ( LD{
?
Mn for convex andL
2
{µMn for strongly convex)
are bounds on the uniform stability [45] of Noisy MB-SGD. We use these uniform stability bounds to
bound the generalization error of our algorithm. Our stability bound forµ ą0 in Lemma 99 is novel even
for N “ 1. The second terms in Equation 3.4 and Equation 3.5 are bounds on the empirical risk of the
algorithm. We use Nesterov smoothing [201] to extend our bounds to non-smooth losses. This requires
us to choose a different stepsize and R from [29] (for N “ 1,µ “ 0), which eliminates the restriction
on the smoothness parameter that appears in [29, Theorem 3.2].
∥
Privacy of Noisy MB-SGD follows from
the Gaussian mechanism [88, Theorem A.2], privacy amplification by subsampling [248], and advanced
composition [88, Theorem 3.20] or moments accountant [1, Theorem 1]. See Appendix B.1.4.2 for the
detailed proof.
3.2.1.2 LowerBounds
We provide excess risk lower bounds for the caseM“N, establishing the optimality of Noisy MB-SGD for
two function classes:F
L,D
“tf|fp¨,xq is convex,L-Lipschitz@xPX and}w´w
1
}ďD@w,w
1
PWu;
andF
µ,L,D
“tfPF
L,D
|fp¨,xq isµ -strongly convex, @xPXu. We restrict attention to distributionsP
satisfying Assumption 1, part 3. Thepε
0
,δ 0
q-ISRL-DP algorithm classA
pε
0
,δ
0
q,C
contains all sequentially
interactive algorithms and all fully interactive, C-compositional (defined in Appendix B.1.5) algorithms.
∗∗
IfA is sequentially interactive or Op1q-compositional, writeA P A
pε
0
,δ
0
q
. Most DP algorithms in the
literature are C-compositional. For example, any algorithm that uses the strong composition theorems
∥
In [29], the number of iterations is denoted byT , rather thanR.
∗∗
Sequentially interactive algorithms can query silos adaptively in sequence, but cannot query any one silo more than once.
Fully interactive algorithms can query each silo any number of times. See [133] for further discussion.
44
of [88, 136, 1] for its privacy analysis is 1-compositional: see Appendix B.1.5.2. In particular, Noisy MB-
SGD is1-compositional, hence it is inA
pε
0
,δ
0
q
.
Theorem35 (Informal). Let ε
0
Pp0,
?
Ns, δ 0
“op1{nNq,M“N, andAPA
pε
0
,δ
0
q,C
. Then:
1. There exists afPF
L,D
and a distributionP such that forX„P
nN
, we have:
EFpApXqq´F
˚
“
r
Ω ˜
ϕD
?
Nn
`LDmin
#
1,
?
d
ε
0
n
?
NC
2
+¸
.
2. There exists aµ -smoothfPF
µ,L,D
and distributionP such that forX„P
nN
, we have
EFpApXqq´F
˚
“
r
Ω ˆ
ϕ 2
µnN
`LDmin
"
1,
d
ε
2
0
n
2
NC
4
*˙
.
Further, ifAPA, then the above lower bounds hold withC“1.
The lower bounds forA
pε
0
,δ
0
q
are nearly tight
††
by Theorem 34. The first term in each of the lower
bounds is the optimal non-private rate [200]; the second terms are proved in Appendix B.1.5.4. In particular,
ifdÀε
2
0
n, then the non-private term in each bound is dominant, so the ISRL-DP rates match the respective
non-private rates, resulting in “privacy for free.” The ISRL-DP rates sit between the rates for LDP and CDP:
higher trust allows for higher accuracy. For example, forF
L,D
, the LDP rate isΘ pLDmint1,
?
d{ε
0
?
Nuq [85],
and the CDP rate isΘ pϕ 2
{
?
Nn`LDmint1,
?
d{ε
0
nNuq [29].
Theorem 35 is more generally applicable than existing LDP and CDP lower bounds. When n “ 1,
ISRL-DP is equivalent to LDP and Theorem 35 recovers the LDP lower bounds [85, 230]. When N “ 1,
ISRL-DP is equivalent to CDP and Theorem 35 recovers the CDP lower bounds [29].
Obtaining our more general lower bounds under the more complex (relative to LDP and CDP) notion
of ISRL-DP is challenging. The lower bound approaches of [85, 230, 83] are heavily tailored to the LDP.
††
Up to logarithms, a factor ofϕ {L, and for strongly convex case–a factor ofµD {L. Ifdąε
2
0
n
2
N, then the ISRL-DP algorithm
that outputs anyw0PW attains the matching upper boundOpLDq.
45
Further, the applicability of the standard CDP lower bound framework (e.g. [32, 29]) to ISRL-DP FL is
unclear.
In light of these challenges, we take a different approach to proving Theorem 35: We first analyze the
central DP guarantees ofA when silo data setsX
1
,¨¨¨ ,X
N
are shuffled in each round, showing that CDP
amplifies to ε“
r
O
´
ε
0
?
N
¯
. We could not have concluded this from existing amplification results [89, 96,
24, 56, 25] since these results are all limited to sequentially interactive LDP algorithms andn“ 1. Thus,
we provethefirstprivacyamplificationbyshufflingboundforfullyinteractiveISRL-DPFLalgorithms . Then,
we apply the CDP lower bounds of [29] toA
s
, the “shuffled” version of A. This implies that the shuffled
algorithmA
s
has excess population loss that is lower bounded as in Theorem 35. Finally, we observe that
the i.i.d. assumption implies thatA
s
andA have the same expected population loss. Note that our proof
techniques can also be used to obtain ISRL-DP lower bounds for other problems in which a CDP lower
bound is known.
3.2.2 Inter-SiloRecord-LevelDPFLwithHeterogeneousSiloData
Consider thenon-i.i.d. FL problem, whereF
i
pwq takes the form of Equation 3.1 for some unknown dis-
tributionsP
i
onX
i
,@i. The uniform stability approach that we used to obtain our i.i.d. upper bounds does
not work in this setting.
‡‡
Instead, we directly minimizeF by modifying Noisy MB-SGD as follows:
1. We draw disjoint batches ofK local sampleswithoutreplacement from each silo and setR“ t
n
K
u. Thus,
stochastic gradients are independent across iterations, so our bounds apply toF .
2. We use acceleration [108] to increase the convergence rate.
3. To provide ISRL-DP, we re-calibrateσ 2
and apply parallel composition [191].
After these modifications, we call the resulting algorithm One-Pass Accelerated Noisy MB-SGD. It is de-
scribed in Algorithm 1. In the strongly convex case, we use a multi-stage implementation of One-Pass
‡‡
Specifically, Lemma 98 in Appendix B.1.4 does not apply without the i.i.d. assumption.
46
Accelerated Noisy MB-SGD (inspired by [109]) to further expedite convergence: see Appendix B.1.6.1 for
details. Carefully tuning step sizes,σ 2
, andK yields:
Theorem36 (M “ N case). Letfp¨,xq beβ -smooth for allxPX. Assumeε
0
ď 8lnp1{δ 0
q, δ 0
Pp0,1q.
Then One-Pass Accelerated Noisy MB-SGD ispε
0
,δ 0
q-ISRL-DP. Moreover, ifM“N, then:
1. For convexfp¨,xq, we have
EFpw
ag
R
q´F
˚
À
ϕD
?
Nn
`
˜
β 1{4
LD
3{2
a
dlnp1{δ 0
q
ε
0
n
?
N
¸
4{5
. (3.6)
2. Forµ -strongly convexfp¨,xq withκ “
β µ , we have
EFpw
ag
R
q´F
˚
“
r
O
ˆ
ϕ 2
µnN
`
?
κ L
2
µ dlnp1{δ 0
q
ε
2
0
n
2
N
˙
. (3.7)
Remarkably, the bound in 3.7 nearly matches the optimal i.i.d. bound 3.5 up to the factor
r
Op
?
κ q. In
particular, for well-conditioned loss functions, our algorithm achieves the optimal i.i.d. rates even when silo
dataisarbitrarilyheterogeneous. The gap between the bound Equation 3.6 and the i.i.d. bound Equation 3.4
isOpp
?
d{ε
0
n
?
Nq
1{5
q. Closing the gaps between the non-i.i.d. upper bounds in Theorem 36 and the i.i.d.
lower bounds in Theorem 35 is left as an open problem. Compared to previous upper bounds, Theorem 36
is a major improvement: see Appendix B.1.1 for details.
In Appendix B.1.6.2, we provide a general version (and proof) of Theorem 36 forM ď N. IfM ă N
but M is sufficiently large or silo heterogeneity sufficiently small, then the same bounds in Theorem 36
hold withN replaced byM. Intuitively, theM ă N case is harder when data is highly heterogeneous,
since stochastic estimates of∇F will have larger variance. In Lemma 128 (Appendix B.1.6.2), we use a
combinatorial argument to bound the variance of stochastic gradients.
47
Algorithm1 Accelerated Noisy MB-SGD
1: Input: Data X
i
P X
n
i
, i P rNs, strong convexity modulus µ ě 0, noise parameter σ 2
, iteration
numberRPN, batch sizeKPrns, step size parameterstη r
u
rPrRs
,tα r
u
rPrRs
.
2: Initializew
ag
0
“w
0
PW andr“1.
3: forrPrRsdo
4: Server updates and broadcastsw
md
r
“
p1´α rqpµ `η rq
η r`p1´α 2
r
qµ w
ag
r´1
`
α rrp1´α rqµ `η rs
η r`p1´α 2
r
qµ w
r´1
5: foriPS
r
inparalleldo
6: Siloi drawstx
r
i,j
u
k
j“1
fromX
i
(replace samples for ERM) and noiseu
i
„Np0,σ 2
I
d
q.
7: Siloi computesr g
i
r
:“
1
K
ř
K
j“1
∇fpw
md
r
,x
r
i,j
q`u
i
.
8: endfor
9: Server aggregates r g
r
:“
1
Mr
ř
iPSr
r g
i
r
and updates w
r
:“
argmin
wPW
␣
α r
“
xr g
r
,wy`
µ 2
}w
md
r
´w}
2
‰
`
“
p1´α r
q
µ 2
`
η r
2
‰
}w
r´1
´w}
2
(
.
10: Server updates and broadcastsw
ag
r
“α r
w
r
`p1´α r
qw
ag
r´1
.
11: endfor
12: Output: w
ag
R
.
3.2.3 Inter-SiloRecord-LevelDPFederatedERM
In this section, we provide an ISRL-DP FL algorithm, Accelerated Noisy MB-SGD, with optimal excess em-
pirical risk. The difference between our proposed algorithm and One-Pass Accelerated Noisy MB-SGD is
that silo i now samples from X
i
with replacement in each round: see line 6 in Algorithm 1. This allows
us to a) amplify privacy via local subsampling and advanced composition/moments accountant, allowing
for smallerσ 2
; and b) run more iterations of our algorithm to better optimize
p
F
X
. These modifications are
necessary for obtaining the optimal rates for federated ERM. We again employ Nesterov smoothing [201]
to extend our results to non-smoothf.
Theorem37 (Informal). Letε
0
ď2lnp2{δ 0
q,δ 0
Pp0,1q. Then,thereexistalgorithmicparameterssuchthat
Algorithm 1 ispε
0
,δ 0
q-ISRL-DP and:
1. Iffp¨,xq is convex, then
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
˜
LD
a
dlnp1{δ 0
q
ε
0
n
?
M
¸
. (3.8)
2. Iffp¨,xq isµ -strongly convex then
48
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ dlnp1{δ 0
q
ε
2
0
n
2
M
˙
. (3.9)
See Appendix B.1.7.1 for the complete statement and proof. With non-random M
r
“ M, [111] pro-
vides an upper bound for (non-strongly) convex ISRL-DP ERM that nearly matches the one we provide
in Equation 3.8. However, Algorithm 1 achieves the upper bounds for convex and strongly convex loss in
fewer rounds of communication than [111]: see Appendix B.1.7.1 for details. In Appendix B.1.7.2, we get
matching lower bounds, establishing the optimality of Algorithm 1 for ERM.
3.2.4 ShuffleDPFederatedLearning
Assume access to a secure shuffler and fix M
r
“ M. In each round r, the shuffler receives reports
pZ
p1q
r
,¨¨¨Z
pMq
r
q from active silos (we assumeS
r
“rMs here for concreteness), draws a uniformly random
permutation ofrMs, π , and then sendspZ
pπ p1qq
r
,¨¨¨ ,Z
pπ pMqq
r
q to the server for aggregation. When this
shuffling procedure is combined with ISRL-DP Noisy Distributed MB-SGD, we obtain:
Theorem38 (i.i.d.). Letεďlnp2{δ q, δ Pp0,1q. Thentherearechoicesofalgorithmicparameterssuchthat
Shuffled Noisy MB-SGD is pε,δ q-SDP. Moreover:
1. Iffp¨,xq is convex, then
EFpp w
T
q´F
˚
“
r
O
˜
LD
˜
1
?
nM
`
a
dlnp1{δ q
εnN
¸¸
.
2. Iffp¨,xq isµ -strongly convex, then
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
1
nM
`
dlnp1{δ q
ε
2
n
2
N
2
˙˙
.
49
See Appendix B.1.8.1 for details and proof. WhenM“N, the rates in Theorem 38 match theoptimal
central DP rates [29], and are attained without a trusted server. Thus, with shuffling, Noisy MB-SGD is
simultaneously optimal for i.i.d. FL in the inter-silo and central DP models.
If silo data isheterogeneous, then we use a shuffle DP variation of One-PassAcceleratedNoisyMB-SGD,
described in Appendix B.1.8.2, to get:
Theorem 39 (Non-i.i.d.). Assume fp¨,xq is β -smooth@x P X. Let ε ď 15,δ P p0,
1
2
q. Then, there is an
pε,δ q-SDP variation of One-Pass Accelerated Noisy MB-SGD such that forM“N, we have:
1. Iffp¨,xq is convex, then
EFpw
ag
R
q´F
˚
À
ϕD
?
Nn
`
˜
β 1{4
LD
3{2
?
dlnpd{δ q
εnN
¸
4{5
. (3.10)
2. Iffp¨,xq isµ -strongly convex withκ “
β µ , then
EFpw
ag
R
q´F
˚
“
r
O
ˆ
ϕ 2
µnN
`
?
κ L
2
µ dlnp1{δ q
ε
2
n
2
N
2
˙
. (3.11)
The bound in 3.11 matches the optimal i.i.d., central DP bound [29] up to
r
Op
?
κ q. Hence, if f is not
ill-conditioned, then 3.11 shows that it is not necessary to have either i.i.d. data or a trusted server to attain
the optimal CDP rates. In both the convex and strongly convex cases, even in the special case of N “ 1
silo with i.i.d. data, the bounds in Theorem 39 are state-of-the-art among linear-time DP algorithms whose
privacy analysis does not require convexity. See Appendix B.1.8.2 for proof and theMăN case.
50
3.2.5 NumericalExperiments
We validate our theoretical findings with three sets of experiments. Our results indicate that ISRL-DP
MB-SGD yields accurate, private models in practice. Our method performs well even relative to non-
private Local SGD, a.k.a. FedAvg [188], and outperforms ISRL-DP Local SGD for most privacy levels.
Appendix B.1.10 contains details of experiments, and additional results.
§§
BinaryLogisticRegressionwithMNIST: Following [265], we consider binary logistic regression on
MNIST [160]. The task is to classify digits as odd or even. Each of 25 odd/even digit pairings is assigned
to a silo (N“25). Figure 3.5 shows thatISRL-DPMB-SGDoutperforms(non-private)LocalSGDforεě12
and consistently outperforms ISRL-DP Local SGD.
LinearRegressionwithHealthInsuranceData: We divide the data set [59] (d“ 7,nˆN “ 1338)
intoN silos based on the level of the target: patient medical costs. Figure 3.6 shows that ISRL-DP MB-SGD
outperforms ISRL-DP Local SGD, especially in the high privacy regimeεď2. Forεě2,ISRL-DPMB-SGD
is in line with (non-private) Local SGD.
MulticlassLogisticRegressionwithObesityData: We train a softmax regression model for a 7-way
classification task with an obesity data set [206]. We divide the data into N“7 heterogeneous silos based
on the value of the target variable, obesity level, which takes7 values: Insufficient Weight, Normal Weight,
Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. As shown
in Figure 3.7, our algorithm significantly outperforms ISRL-DP Local SGD by 10´30% across all privacy
levels.
§§
We also describe the ISRL-DP Local SGD baseline in Appendix B.1.10.
51
LDP
LDP Local SGD after 35 rounds
LDP MB SGD after 35 rounds
MB SGD after 35 rounds
Local SGD after 35 rounds
M = 25, K = 443
CL-LDP MB-SGD after 35 rounds
CL-LDP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
SSIL-DP
SSIL-DP
N = 25, M = 25
ISRL-DP Local SGD after 35 rounds
ISRL-DP MB-SGD after 35 rounds
LDP Local SGD after 35 rounds
LDP MB SGD after 35 rounds
MB SGD after 35 rounds
Local SGD after 35 rounds
M = 12, K = 443
CL-LDP MB-SGD after 35 rounds
CL-LDP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
SSIL-DP
SSIL-DP
N = 25, M = 12
ISRL-DP MB-SGD after 35 rounds
ISRL-DP MB-SGD after 35 rounds
Figure 3.5: Binary logistic regression on MNIST.δ “1{n
2
.90% error bars are shown.
LDP Local SGD after 35 rounds
LDP MB SGD after 35 rounds
MB SGD after 35 rounds
Local SGD after 35 rounds
N = 3, M = 2, K = 65
CL-LDP Local SGD after 35 rounds
CL-LDP Local SGD after 35 rounds
MB-Local SGD after 35 rounds
Local SGD after 35 rounds
SSIL-DP
SSIL-DP
ISRL-DP MB-SGD after 35 rounds
ISRL-DP Local SGD after 35 rounds LDP Local SGD after 35 rounds
LDP MB SGD after 35 rounds
MB SGD after 35 rounds
Local SGD after 35 rounds
N = 5, M =5, K = 39
CL-LDP MB-SGD after 35 rounds
CL-LDP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
SSIL-DP
SSIL-DP
M = 5
ISRL-DP MB-SGD after 35 rounds
ISRL-DP Local SGD after 35 rounds
Figure 3.6: Linear regression on health insurance data. δ “1{n
2
.90% error bars are shown.
SSIL-DP MB-SGD after 35 rounds
SSIL-DP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
N = 7, M = 3
N = 7, M = 3
SSIL-DP MB-SGD after 35 rounds
SSIL-DP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
ISRL-DP MB-SGD after 35 rounds
ISRL-DP Local SGD after 35 rounds
SSIL-DP MB-SGD after 35 rounds
SSIL-DP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
N = 7, M = 7
SSIL-DP MB-SGD after 35 rounds
SSIL-DP Local SGD after 35 rounds
MB-SGD after 35 rounds
Local SGD after 35 rounds
N = 7, M = 7
ISRL-DP MB-SGD after 35 rounds
ISRL-DP Local SGD after 35 rounds
Figure 3.7: Softmax regression on obesity data. δ “1{n
2
.
52
3.3 PrivateNon-ConvexFederatedLearningWithoutaTrustedServer
In this section, we study the more general DP FL problem setting in which the loss functionfp¨,xq may be
non-convex. Specifically, we will tackle Question3 posed in Section 3.1. We will require some additional
notation and assumptions:
Notation: A proper function has rangeR
Ť
t`8u and is not identically equal to`8. Function g is
closed if@α PR, the settwP dompgq|gpwqďα u is closed. The indicator function ofW is
ι W
pwq:“
$
’
’
’
’
&
’
’
’
’
%
0 ifwPW
`8 otherwise
.
Theproximaloperator of functionf
1
isprox
ηf
1pzq:“argmin
yPR
d
`
ηf
1
pyq`
1
2
}y´z}
2
˘
, forη ą0.
Let
ˆ
∆ X
:“
p
F
X
pw
0
q´
p
F
˚
X
, with
p
F
˚
X
“inf
w
p
F
X
pwq.
We assume the loss functionfpw,xq“f
0
pw,xq`f
1
pwq isnon-convex/non-smoothcomposite, where
f
0
is bounded below, and:
Assumption2. f
0
p¨,xqisL-Lipschitz(onW iff
1
“ι W
`g forsomeconvexgě0;onR
d
otherwise)and
β -smooth for allxPX.
Assumption3. f
1
is a proper, closed, convex function.
Examples of functions satisfying Assumption 3 include indicator functions of convex setsι W
andℓ
p
-
norm regularizersλ }w}
p
.
53
3.3.1 AlgorithmsforProximal-PLLosses
3.3.1.1 NoisyDistributedProximalSGDforHeterogeneousFL(SO)
We propose a simple distributed Noisy Proximal SGD (Prox-SGD) method: in each round r P rRs, avail-
able silos i P S
r
draw local samplestx
r
i,j
u
K:“tn{Ru
j“1
from X
i
(without replacement) and compute r g
i
r
:“
1
K
ř
K
j“1
∇f
0
pw
r
,x
r
i,j
q` u
i
, where u
i
„ Np0,σ 2
I
d
q for ISRL-DP. For SDP, u
i
has Binomial distribu-
tion [55]. The server (or silos) aggregates r g
r
:“
1
M
ř
iPSr
r g
i
r
and then updates the global modelw
r`1
:“
prox 1
2β f
1
pw
r
´
1
2β r g
r
q. The use of prox is needed to handle the potential non-smoothness of f. Pseu-
docodes are in Appendix B.2.2.1.
Assumption4. The loss isµ -PPL in expectation:@wPR
d
,
µ Er
p
F
S
pwq´inf
w
1
p
F
S
pw
1
qsď´β E
«
min
y
”
x∇
p
F
0
S
pwq,y´wy`
β 2
}y´w}
2
`f
1
pyq´f
1
pwq
ı
ff
,
where
p
F
S
pwq :“
1
MK
ř
iPS
ř
K
j“1
fpw,x
i,j
q, S Ď rNs is a uniformly random subset of size M, S “
tx
i,j
u
iPS,jPrKs
, andx
i,j
„P
i
. Denoteκ “β {µ .
As discussed earlier, many interesting losses (e.g. wide neural nets, linear regression) satisfy Assump-
tion 4.
Theorem40 (Noisy Prox-SGD: Heterogeneous PPL FL). GrantAppendix4. Letεďmint8lnp1{δ q,15u,δ P
p0,1{2q,ně
r
Ω pκ q. Then, there existσ 2
andK such that:
1. ISRL-DP Prox-SGD ispε,δ q-ISRL-DP, and inR“
r
Opκ q communications, we have:
EFpw
R
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
κ 2
dlnp1{δ q
ε
2
n
2
M
`
κ Mn
˙˙
. (3.12)
54
2. SDP Prox-SGD ispε,δ q-SDP forMěNminpε{2,1q, and ifR“
r
Opκ q, then:
EFpw
R
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
κ 2
dln
2
pd{δ q
ε
2
n
2
N
2
`
κ Mn
˙˙
. (3.13)
Remark41 (Near-Optimality and “privacy almost for free”). LetM“N. Then,theboundinEquation3.13
nearly matches the central DP strongly convex, i.i.d. lower bound of [29] up to the factor
r
Opκ 2
q without a
trustedserver,withoutconvexity,andwithouti.i.d. silos. Further,ifκd log
2
pd{δ q{ε
2
ÀnN,then3.13matches
thenon-privatestronglyconvex,i.i.d. lowerboundof[3]uptoa
r
Opκ qfactor,providingprivacynearlyforfree,
withoutconvexity/homogeneity. TheboundinEquation3.12islargerthanthei.i.d.,stronglyconvex,ISRL-DP
lower bound of [181] by a factor of
r
Opκ 4
q.
¶¶
Moreover, ifκd lnp1{δ q{ε
2
À n, then the ISRL-DP rate in 3.12
matches the non-private, strongly convex, i.i.d. lower bound [3] up to
r
Opκ q.
Theorem 40 is proved in Appendix B.2.2.2. Privacy follows from parallel composition [191] and the
guarantees of the Gaussian mechanism [88] and binomial-noised shuffle vector summation protocol [55].
The main idea of the excess loss proofs is to view each noisy proximal evaluation as an execution of
objective perturbation [52]. Using techniques from the analysis of objective perturbation, we bound the
key term arising from descent lemma by the corresponding noiseless minimum plus an error term that
scales with}
1
M
ř
iPSr
u
i
}
2
.
Our novel excess loss proof approach is necessary because with the proximal operator and without
convexity, privacy noise cannot be easily separated from the non-private terms. By comparison, in the
convex case, the proof of Theorem 34 uses non-expansiveness of projection and independence of the noise
and data to boundE}w
r`1
´w
r
}
2
, which yields a bound onEFpw
r
q´F
˚
by convexity. On the other hand,
in the unconstrained PL case considered in [255, 141, 278], the excess loss proof is easy, but the result is
essentially vacuous since Lipschitzness onR
d
is incompatible with all PL losses that we are aware of. The
results mentioned above considered the simpler i.i.d. or ERM cases: Balancing divergent silo distributions
¶¶
In the terminology of [181], Noisy Prox-SGD isC-compositional withC“
?
R“
r
Op
?
κ q.
55
and privately reaching consensus on a single parameterw
R
that optimizes the global objectiveF poses an
additional difficulty.
3.3.1.2 NoisyDistributedProx-PL-SVRGforFederatedERM
In this subsection, we assume
p
F
X
satisfies the proximal-PL inequality with parameter µ ą 0; i.e. for all
wPR
d
:
µ r
p
F
X
pwq´inf
w
1
p
F
X
pw
1
qsď´β min
y
”
x∇
p
F
0
X
pwq,y´wy`
β 2
}y´w}
2
`
p
F
1
X
pyq´
p
F
1
X
pwq
ı
.
We propose new, variance-reduced accelerated ISRL-DP/SDP algorithms in order to achieve near-optimal
rates in fewer communication rounds than would be possible with Noisy Prox-SGD. Our ISRL-DP Algo-
rithm 3 for Proximal PL ERM, which builds on [128], iteratively runs ISRL-DP Prox-SVRG (Algorithm 2)
with re-starts. See Appendix B.2.2.3 for our SDP algorithm, which is nearly identical to Algorithm 3 except
that we use the binomial noise-based shuffle protocol of [55] instead of Gaussian noise.
Algorithm2 ISRL-DP FedProx-SVRGpw
0
q
1: Input: EPN,KPrns,Q:“ t
n
K
u,XPX,η ą0,σ 1
,σ 2
ě0, s w
0
“w
Q
0
“w
0
PR
d
.
2: forrPt0,1,¨¨¨ ,E´1udo
3: Server updatesw
0
r`1
“w
Q
r
.
4: foriPS
r
inparalleldo
5: Server sends global modelw
r
to siloi.
6: Siloi draws noiseu
i
1
„Np0,σ 2
1
I
d
q and computesr g
i
r`1
:“∇
p
F
0
i
ps w
r
q`u
i
1
.
7: fortPt0,1,¨¨¨Q´1udo
8: Silo i draws K samples x
r`1,t
i,j
uniformly from X
i
with replacement (for j P rKs) and noise
u
i
2
„Np0,σ 2
2
I
d
q.
9: Siloi computesr v
t,i
r`1
“
1
K
ř
K
j“1
r∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´∇f
0
ps w
r
,x
r`1,t
i,j
qs`r g
i
r`1
`u
i
2
.
10: Server aggregatesr v
t
r`1
“
1
M
ř
iPS
r`1
r v
t,i
r`1
, updatesw
t`1
r`1
:“ prox
ηf
1pw
t
r`1
´η r v
t
r`1
q.
11: endfor
12: Server updates s w
r`1
“w
Q
r`1
.
13: endfor
14: endfor
15: Output: w
priv
„ Unifptw
t
r`1
u
r“0,¨¨¨,E´1;t“0,¨¨¨Q´1
q.
56
Algorithm3 ISRL-DP FedProx-PL-SVRG
1: Input: EPN,KPrns,Q:“ t
n
K
u,XPX,η ą0,σ 1
,σ 2
ě0, s w
0
“w
Q
0
“w
0
PR
d
.
2: forsPrSsdo
3: w
s
“ ISRL-DP FedProx-SVRGpw
s´1
q
4: endfor
5: Output: w
priv
:“w
S
.
The key component of ISRL-DP Prox-SVRG is in line 9 of Algorithm 2, where instead of using standard
noisy stochastic gradients, siloi computes the difference between the stochastic gradient at the current it-
eratew
t
r`1
and the iterate s w
r
from the previous epoch, thereby reducing the variance of the noisy gradient
estimator–which is still unbiased–and facilitating faster convergence (i.e. better communication complex-
ity). Notice that theℓ
2
-sensitivity of the variance-reduced gradient in line 9 is larger than the sensitivity
of standard stochastic gradients (e.g. in line 6), so largerσ 2
2
ą σ 2
1
is required for ISRL-DP. However, the
sensitivity only increases by a constant factor, which does not significantly affect utility. Algorithm 3
runs Algorithm 2S times with re-starts. For a suitable choice of algorithmic parameters, we have:
Theorem 42 (Noisy Prox-PL-SVRG: ERM). Letεď mint15,2lnp2{δ qu,δ Pp0,1{2q, M “ N,XPX
n
,
andκ “β {µ . Then, in
r
Opκ q communication rounds, we have:
1. ISRL-DP Prox-PL-SVRG ispε,δ q-ISRL-DP and E
p
F
XO
´
κ L
2
dlnp1{δ q
µε
2
n
2
N
¯
.
2. SDP Prox-PL-SVRG ispε,δ q-SDP andE
p
F
X
pw
priv
q´
p
F
˚
X
“
r
O
´
κ L
2
dlnp1{δ q
µε
2
n
2
N
2
¯
.
Expectations are solely over A. A similar result holds for M ă N, provided silo data is not too
heterogeneous. See Appendix B.2.2.4 for details and the proof.
Remark43 (Near-Optimality). TheISRL-DPandSDPboundsinTheorem42nearlymatch(respectively)the
ISRL-DPandCDPstronglyconvexERMlowerbounds[181,32](forf
1
“ι W
q)uptothefactor
r
Opκ q,andare
attained without convexity.
57
3.3.2 AlgorithmsforNon-Convex/Non-SmoothCompositeLosses
In this section, we consider private FL with general non-convex/non-smooth composite losses: i.e. we
make no additional assumptions onf beyond Assumption 2 and Assumption 3. In particular, we do not
assume the Proximal PL condition, allowing for a range of constrained/unconstrained non-convex and
non-smooth FL problems. For such a function class, finding global optima is not possible in polynomial
time; optimization algorithms may only find stationary points in polynomial time. Thus, we measure the
utility of our algorithms in terms of the norm of the proximal gradient mapping:
p
G
η pw,Xq:“
1
η rw´prox
ηf
1pw´η ∇
p
F
0
X
pwqqs
For proximal algorithms, this is a natural measure of stationarity [128, 258] which generalizes the stan-
dard (for smooth unconstrained) notion of first-order stationarity. In the smooth unconstrained case,
}
p
G
η pw,Xq} reduces to
›
›
›∇
p
F
X
pwq
›
›
›, which is often used to measure convergence in non-convex optimiza-
tion. Building on [93, 258, 16], we propose Algorithm 4 for ISRL-DP FL with non-convex/non-smooth com-
posite losses. Algorithm 4 is inspired by the optimality of non-private SPIDER for non-convex ERM [15].
The essential elements of the algorithms are: the variance-reduced Stochastic Path-Integrated Differ-
ential EstimatoR of the gradient in line 11; and the non-standard choice of privacy noise in line 10 (in-
spired by [16]), in which we chooseσ 2
2
“
16β 2
Rlnp1{δ q
ε
2
n
2
. With careful choices ofη,σ 2
1
,ˆ σ 2
2
,q,R in ISRL-DP
FedProx-SPIDER, our algorithm achieves state-of-the-art utility:
Theorem 44 (ISRL-DP FedProx-SPIDER, M “ N version). Let ε ď 2lnp1{δ q. Then, ISRL-DP FedProx-
SPIDER ispε,δ q-ISRL-DP. Moreover, ifM“N, then
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
À
¨
˝
b
Lβ
ˆ
∆ X
dlnp1{δ q
εn
?
N
˛
‚
4{3
`
L
2
dlnp1{δ q
ε
2
n
2
N
.
58
Algorithm4 ISRL-DP FedProx-SPIDER
1: Input: RPN,K
1
,K
2
Prns,XPX,η ą0,σ 2
1
,σ 2
2
,ˆ σ 2
2
ě0,qPN,w
0
PW.
2: forrPt0,1,¨¨¨ ,Rudo
3: foriPS
r
inparalleldo
4: Server sends global modelw
r
to siloi.
5: if r”0pmodqqthen
6: Siloi drawsK
1
samplestx
r
i,j
u
K
1
j“1
u.a.r. fromX
i
(with replacement) andu
i
1
„Np0,σ 2
1
I
d
q.
7: Siloi computesh
i
r
“
1
K
1
ř
K
1
j“1
∇f
0
pw
r
,x
r
i,j
q`u
i
1
.
8: Server aggregatesh
r
“
1
Mr
ř
iPSr
h
i
r
.
9: else
10: Silo i draws K
2
samples tx
r
i,j
u
K
1
j“1
u.a.r. from X
i
(with replacement) and u
i
2
„
Np0,I
d
mintσ 2
2
}w
r
´w
r´1
}
2
,ˆ σ 2
2
uq.
11: Siloi computesH
i
r
“
1
K
2
ř
K
2
j“1
r∇f
0
pw
r
,x
r
i,j
q´∇f
0
pw
r´1
,x
r
i,j
qs`u
i
2
.
12: Server aggregatesH
r
“
1
Mr
ř
iPSr
H
i
r
andh
r
“h
r´1
`H
r
.
13: endif
14: endfor
15: Server updatesw
r`1
“ prox
ηf
1pw
r
´ηh
r
q.
16: endfor
17: Output: w
priv
„ Unifptw
r
u
r“1,¨¨¨,R
q.
See Appendix B.2.3 for the general statement for M ď N, and the detailed proof. Theorem 44 provides
the first utility bound for any kind of DP optimization problem (even centralized) with non-convex/non-
smooth losses. In fact, the only work we are aware of that addresses DP non-convex optimization with
f
1
‰0 is [30], which considers CDP constrained smooth non-convex SO withN“1 (i.i.d.) andf
1
“ι W
.
However, their noisy Franke-Wolfe method could not handle general non-smooth f
1
. Further, handing
Ną1 heterogeneous silos requires additional work.
The improved utility that our algorithm offers compared with existing DP FL works (discussed in Sec-
tion 3.1) stems from the variance-reduction that we get from: a) using smaller privacy noise that scales with
β }w
t
´w
t´1
} and shrinks ast increases (in expectation); and b) using SPIDER updates. Byβ -smoothness
off
0
, we can bound the sensitivity of the local updates and use standard DP arguments to prove ISRL-DP
of Algorithm 4. A key step in the proof of the utility bound in Theorem 44 involves extending classical
ideas from [46, p. 269-271] for constrained convex optimization to the noisy distributed non-convex setting
and leveraging non-expansiveness of the proximal operator in the right way.
59
Our SDP FedProx-SPIDER Algorithm 24 is described in Appendix B.2.3. SDP FedProx-SPIDER is similar
to Algorithm 4 except that Gaussian noises get replaced by appropraitely re-calibrated binomial noises plus
shuffling. It’s privacy and utility guarantees are provided in Theorem 45:
Theorem 45 (SDP FedProx-SPIDER, M “ N version). Let ε ď lnp1{δ q, δ P p0,
1
2
q. Then, there exist
algorithmic parameters such that SDP FedProx-SPIDER ispε,δ q-SDP and
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
“
r
O
¨
˚
˝
»
–
b
Lβ
ˆ
∆ X
dlnp1{δ q
εnN
fi
fl
4{3
`
dL
2
lnp1{δ q
ε
2
n
2
N
2
˛
‹
‚.
Our non-smooth, SDP federated ERM bound in Theorem 45 improves over the state-of-the-art CDP,
smooth federated ERM bound of [257], which isOp
?
d{εnNq. We obtain this improved utility even under
the weaker assumptions of non-smooth loss and no trusted server.
3.3.2.1 LowerBounds
We complement our upper bounds with lower bounds:
Theorem 46 (Smooth Convex Lower Bounds, Informal). LetεÀ 1 and 2
´Ω pnNq
ď δ ď 1{pnNq
1`Ω p1q
.
Then, there is anL-Lispchitz,β -smooth, convex lossf :R
d
ˆX ÑR and a databaseXPX
nˆN
such that
any compositional, symmetric
∗∗∗
pε,δ q-ISRL-DP algorithmA satisfies
E
›
›
›∇
p
F
X
pApXqq
›
›
›
2
“Ω ˆ
L
2
min
"
1,
dlnp1{δ q
ε
2
n
2
N
*˙
.
Further, anypε,δ q-SDP algorithm satisfies
E
›
›
›∇
p
F
X
pApXqq
›
›
›
2
“Ω ˆ
L
2
min
"
1,
dlnp1{δ q
ε
2
n
2
N
2
*˙
.
∗∗∗
See Appendix B.1.5 for precise definitions. To the best of our knowledge, every DP ERM algorithm proposed in the literature
is compositional and symmetric.
60
The proof (and formal statement) of the ISRL-DP lower bound is relegated to Appendix B.2.3.1; the
SDP lower bound follows directly from the CDP lower bound of [16]. Intuitively, it is not surprising that
there is a gap between the non-convex/non-smooth upper bounds in Theorem 44 and the smooth, convex
lower bounds, since smooth convex optimization is easier than non-convex/non-smooth optimization.
†††
As discussed in [16, Appendix B.2], the non-private lower bound of [15] provides some evidence that their
CDP ERM bound (which our SDP bound matches whenM “ N) is tight for noisy gradient methods.
‡‡‡
By Theorem 46, this would imply that our ISRL-DP ERM bound is also tight. Rigorously proving tight
bounds is left as an interesting open problem.
3.3.3 NumericalExperiments
To evaluate the performance of ISRL-DP FedProx-SPIDER, we compare it against standard FL baselines for
privacy levels ranging fromε“ 0.75 toε“ 18: Minibatch SGD (MB-SGD), Local SGD (a.k.a. Federated
Averaging) [188], ISRL-DP MB-SGD [181], and ISRL-DP Local SGD. We fix δ “ 1{n
2
. Note that FedProx-
SPIDER generalizes MB-SGD (takeq“ 1). Therefore, ISRL-DP FedProx-SPIDER always performs at least
as well as ISRL-DP MB-SGD, with performance being identical when the optimal phase length hyperpa-
rameter isq“1.
The main takeaway from our numerical experiments is that ISRL-DP FedProx-SPIDER outperforms
the other ISRL-DP baselines for most privacy levels. To quantify the advantage of our algorithm, we
computed the percentage improvement in test error over baselines in each experiment and privacy (ε)
level, and averaged the results: our algorithm improves over ISRL-DP Local SGD by 6.06% on average
and improves over ISRL-DP MB-SGD by 1.72%. Although the advantage over MB-SGD may not seem
†††
For example, the non-private sample complexity of smooth convex SO is significantly smaller than the sample complexity
of non-private non-convex SO [200, 102, 15].
‡‡‡
Note that the non-private first-order oracle complexity lower bound of [15] requires a very high dimensional construction,
restricting its applicability to the private setting.
61
substantial, it is promising that our algorithm dominated MB-SGD in every experiment: ISRL-DP MB-
SGD never outperformed ISRL-DP SPIDER for any value of ε. More details about the experiments and
additional results are provided in Appendix B.2.4.
NeuralNetClassifi cationwithMNISTData: Following [265, 181], we partition the MNIST [160] data
set intoN “ 25 heterogeneous silos, each containing one odd/even digit pairing. The task is to classify
digits as even or odd. We use a two-layer perceptron with a hidden layer of 64 neurons. As Figure 3.8 and
Figure 3.9 show, ISRL-DP FedProx-SPIDER tends to outperform both ISRL-DP baselines.
Following [265, 181], we partition the MNIST [160] data set into N “ 25 heterogeneous silos, each
containing one odd/even digit pairing. The task is to classify digits as even or odd. We use a two-layer
perceptron with a hidden layer of 64 neurons. As Figure 3.8 and Figure 3.9 show,ISRL-DPFedProx-SPIDER
tends to outperform both ISRL-DP baselines.
Figure 3.8: MNIST data. M“25,R“25.
Convolutional Neural Net Classifi cation with CIFAR-10 Data: CIFAR-10 dataset [156] includes 10
image classes and we partition it into N “ 10 heterogeneous silos, each containing one class. Follow-
ing [215], we use a 5-layer CNN with two 5x5 convolutional layers (the first with 6 channels, the second
with 16 channels, each followed by a ReLu activation and a 2x2 max pooling) and three fully connected lay-
ers with 120, 84, 10 neurons in each fully connected layer. The first and second fully connected layers are
62
Figure 3.9: MNIST data. M“12,R“50.
followed by a ReLu activation. As Figure 3.10 and Figure 3.11 show, ISRL-DP FedProx-SPIDER outperforms
both ISRL-DP baselines for most tested privacy levels.
Figure 3.10: CIFAR-10 data. M“10,R“50.
Neural Net Classifi cation with Breast Cancer Data: With the Wisconsin Breast Cancer (Diagnosis)
(WBCD) dataset [81], our task is to diagnose breast cancers as malignant vs. benign. We partition the data
set intoN “ 2 heterogeneous silos, one containing malignant labels and the other benign labels. We use
a 2-layer perceptron with 5 neurons in the hidden layer. As Figure 3.12 shows, ISRL-DP FedProx-SPIDER
outperforms both ISRL-DP baselines for most tested privacy levels.
63
Figure 3.11: CIFAR-10 data. M“10,R“100.
Figure 3.12: WBCD data. M“2,R“25.
64
3.4 ConcludingRemarksandOpenQuestions
This chapter considered FL without a trusted server and advocated for inter-silo record-level DP (ISRL-DP)
as a practical privacy notion in this setting, particularly in cross-silo applications. We provided optimal
ISRL-DP algorithms for convex/strongly convex FL in both the i.i.d. and ERM settings when allM “ N
clients are able to communicate. The i.i.d. rates sit between the rates for the stringent “no trust” local DP
and relaxed “high trust” central DP notions, and allow for “privacy for free” whendÀε
2
0
n. As a side result,
in Appendix B.1.5.5, we established the optimal rates for cross-device FL with user-level DP in the absence
of a trusted server. Additionally, we devised an accelerated ISRL-DP algorithm to obtain state-of-the-art
upper bounds for heterogeneous FL. We also gave a shuffle DP algorithm that (nearly) attains the optimal
central DP rates for (non-i.i.d.) i.i.d. FL. A challenging open problem is to close the gap between our i.i.d.
lower bounds and non-i.i.d. upper bounds: e.g. can
?
κ in Equation 3.7 be removed? Also, whenM ăN,
are our upper bounds tight?
For two broad classes of non-convex loss functions, we provided novel ISRL-DP/SDP FL algorithms and
utility bounds that advance the state-of-the-art. For proximal-PL losses, our algorithms are nearly optimal
and show that neither convexity or i.i.d. data is required to obtain fast rates. Numerical experiments
demonstrated the practicality of our algorithm at providing both high accuracy and privacy on several
learning tasks and data sets. An interesting open problem is proving tight bounds on the gradient norm
for private non-convex FL.
AuthorContributions
Section 3.2 was conducted in collaboration with Meisam Razaviyayn. AL led the project and executed the
analysis and the experiments. MR initially proposed the problem of private FL, advised the project, and
provided feedback on multiple drafts.
65
Section 3.3 was a joint work with Ali Ghafelebashi and Meisam Razaviyayn. AL led the project, exe-
cuted the theoretical analyses, wrote the paper, and guided the numerical experiments. AG executed the
numerical experiments. MR advised and provided feedback on the initial drafts.
Acknowledgements
AL thanks Adam Smith for an enlightening discussion around one-pass algorithms and Xingyu Zhou for
carefully checking and providing feedback on our lower bound proofs.
66
Chapter4
PrivateStochasticOptimizationwithLargeWorst-CaseLipschitz
Parameter: OptimalRatesfor(Non-Smooth)ConvexLossesand
ExtensiontoNon-ConvexLosses
This chapter is based on [182].
Summary
We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case
Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on
DP SO—including the preceding two chapters of this thesis—assumes that the loss is uniformly Lipschitz
continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this
assumption is convenient, it often leads to pessimistic excess risk bounds. In many practical problems, the
worst-case (uniform) Lipschitz parameter of the loss over all data points may be extremely large due to
outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of
the loss, are vacuous. To address these limitations, this work provides near-optimal excess risk bounds that
do not depend on the uniform Lipschitz parameter of the loss. Building on a recent line of work [253, 137],
we assume that stochastic gradients have boundedk-th order moments for somekě 2. Compared with
67
works on uniformly Lipschitz DP SO, our excess risk scales with the k-th moment bound instead of the
uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers
and/or heavy-tailed data. Forconvex andstronglyconvex loss functions, we provide the first asymptotically
optimal excess risk bounds (up to a logarithmic factor). In contrast to [253, 137], our bounds do not require
the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses
that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally,
our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying theProximal-
PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-
optimal excess risk.
4.1 Introduction
This chapter focuses on the fundamental problem of DP stochastic (convex) optimization (S(C)O): givenn
i.i.d. samplesX“px
1
,...,x
n
qPX
n
from an unknown distributionP , we aim to privately solve
min
wPW
tFpwq:“E
x„P
rfpw,xqsu, (4.1)
wheref :WˆX ÑR is the loss function andWĂR
d
is the parameter domain. Since finding the exact
solution to (4.1) is not generally possible, we measure the quality of the obtained solution via excess risk
(a.k.a. excess population loss): The excess risk of a (randomized) algorithmA for solving 4.1 is defined as
EFpApXqq´min
wPW
Fpwq, where the expectation is taken over both the random draw of the data X
and the algorithmA.
A large body of literature is devoted to characterizing the optimal achievable differentially private ex-
cess risk of 4.1 when the functionfp¨,xq is uniformlyL-Lipschitz for allxPX —see e.g., [29, 95, 19, 31, 181].
In these works, the gradient off is assumed to be uniformly bounded withsup
wPW,xPX
}∇
w
fpw,xq}ďL,
68
and excess risk bounds scale withL. While this assumption is convenient for bounding thesensitivity [87]
of the steps of the algorithm, it is often unrealistic in practice or leads to pessimistic excess risk bounds. In
many practical applications, data contains outliers, is unbounded or heavy-tailed (see e.g. [65, 185, 266] and
references therein for such applications). Consequently,L may be prohibitively large. For example, even
the linear regression lossfpw,xq“
1
2
pxw,x
p1q
y´x
p2q
q
2
with compactW and data fromX “X
p1q
ˆX
p2q
,
leads toLě diameterpX
p1q
q
2
, which could be huge or even infinite. Similar observations can be made for
other useful ML models such as deep neural nets [163], and the situation becomes even grimmer in the
presence of heavy-tailed data. In these cases, existing excess risk bounds, which scale with L, becomes
vacuous.
While L can be very large in practice (due to outliers), the k-th moment of the stochastic gradients
is often reasonably small for some k ě 2 (see, e.g., Example 47). This is because the k-th moment
r r
k
:“E
“
sup
wPW
}∇
w
fpw,xq}
k
2
‰
1{k
depends on theaverage behavior of the stochastic gradients, whileL
depends on theworst-case behavior over all data points. Motivated by this observation and building on the
prior results [253, 137], this work characterizes the optimal differentially private excess risk bounds for
the class of problems with a given parameterr r
k
. Specifically, for the class of problems with parameter r r
k
,
we answer the following questions (up to a logarithmic factor):
• Question I: What are the minimax optimal rates for (strongly) convex DP SO?
• Question II: What utility guarantees are achievable for non-convex DP SO?
Prior works have made progress in addressing the first question above:
∗
The work of [253] provided the
first excess risk upper bounds for smooth DP (strongly) convex SO. [137] gave improved, yet suboptimal,
upper bounds for smooth (strongly) convex fp¨,xq, and lower bounds for (strongly) convex SO. In this
work, we provide optimal algorithms for convex and strongly convex losses, resolving Question I up to
∗
[253, 137] consider a slightly different problem class than the class r r
k
, which we consider: see Appendix C.2. However,
our results imply asymptotically optimal rates for the problem class considered in [253, 137] under mild assumptions: see Ap-
pendix C.4.4.
69
logarithmic factors. Our bounds hold even fornon-differentiable/non-smooth F . Regarding Question II, we
give thefirstalgorithmforDPSOwithnon-convexloss functions satisfying the Proximal Polyak-Łojasiewicz
(PL) condition [213, 142]. We provide a summary of our results for the case k “ 2 in Figure 4.1, and a
thorough discussion of related work in Appendix C.1.
4.1.1 Preliminaries
Forβ -smooth,µ -strongly convex functiong, denote itsconditionnumber byκ “β {µ . We assume that the
stochastic gradient distributions have boundedk-th moment for somekě2:
Assumption 5. There exists k ě 2 and r r
pkq
ą 0 such that E
“
sup
wPW
}∇fpw,xq}
k
2
‰
ď r r
pkq
for all
∇fpw,x
i
qPB
w
fpw,x
i
q. Denote r r
k
:“pr r
pkq
q
1{k
.
Clearly, r r
k
ďL“sup
t∇fpw,xqPBwfpw,xqu
sup
w,x
}∇fpw,xq}, but this inequality is often very loose:
Example 47. For linear regression on a unit ballW with 1-dimensional data x
p1q
,x
p2q
P r´10
10
,10
10
s
having Truncated Normal distributions and Varpx
p1q
q “ Varpx
p2q
q ď 1, we have L ě 10
20
. On the other
hand, r r
k
is much smaller thanL forkă8: e.g., r r
2
ď5, r r
4
ď8, and r r
8
ď14.
4.1.2 ContributionsandRelatedWork
We discuss our contributions in the context of related work. See Figure 4.1 for a summary of our results
whenk“2, and Appendix C.1 for a more thorough discussion of related work.
OptimalRatesforNon-Smooth(Strongly)ConvexLosses(Section4.3): We establish asymptotically
optimal (up to logarithms) excess risk bounds for DP SCO under Assumption 5, without requiring differ-
entiability offp¨,xq:
Theorem 48 (Informal, see Theorem 50, Theorem 56, Theorem 57, Theorem 58). Let fp¨,xq be con-
vex and L ă 8. Grant Assumption 5. Then, there is a polynomial-time
ε
2
2
-zCDP algorithmA such that
70
r r
2k
D
ˆ
1
?
n
`
´
?
d
εn
¯
pk´1q{k
˙
. Iffp¨,xq isµ -strongly convex, then
r r
2
2k
µ ˆ
1
n
`
´
?
d
εn
¯
p2k´2q{k
˙
. Further, these
bounds are minimax optimal up to factors of
r
Opr r
2k
{r r
k
q and
r
Opr r
2
2k
{r r
2
k
q respectively.
As k Ñ 8, r r
k
,r r
2k
Ñ L and Theorem 48 recovers the known rates for uniformly L-Lipschitz DP
SCO [29, 95]. However, whenkă8 and r r
2k
! L, the excess risk bounds in Theorem 48 may be much
smaller than the uniformly Lipschitz excess risk bounds, which increase withL.
The works [253, 137] make a slightly different assumption than Assumption 5: they instead assume
that thek-th order central moment of each coordinate∇
j
fpw,xq is bounded byγ 1{k
for alljPrds,wP
W. We also provide asymptotically optimal excess risk bounds for the class of problems satisfying the
coordinate-wise moment assumption of [253, 137] and having subexponential stochastic subgradients:
see Appendix C.4.4.
The previous state-of-the-art convex upper bound was suboptimal: O
´
r r
k
D
b
d
n
¯
for ε « 1 [137,
Theorem 5.4].
†
Their result also requiredfp¨,xq to beβ f
-smooth for allxPX withβ f
ď 10, which can
be restrictive with outlier data: e.g. this implies thatfp¨,xq is uniformlyL-Lipschitz withLďβ f
Dď10D
if∇fpw
˚
pxq,xq“ 0 for somew
˚
pxqPW. By comparison, our near-optimal bounds hold even for non-
differentiable f withL"1.
Our optimalµ -strongly convex bound also improves over the best previous upper bound of [137, The-
orem 5.6], which required uniformβ f
-smoothness offp¨,xq. In fact, [137, Theorem 5.6] was incorrect, as
we explain in Appendix C.3.
‡
After communicating with the authors of [137], they updated and corrected
the result and proof in the arXiv version of their paper. The arXiv version of [137, Theorem 5.6]—which
we derive in Appendix C.3 for completeness—is suboptimal by a factor of
r
Ω ppβ f
{µ q
3
q. In practice, the
worst-case condition numberβ f
{µ can be very large, especially in the presence of outliers or heavy-tailed
†
We write the bound in [137, Theorem 5.4] in terms of Assumption 5, replacing theirγ 1{k
d by r r
?
d.
‡
In short, the mistake is that Jensen’s inequality is used in the wrong direction to claim that theT -th iterate of their algorithm
wT satisfies Er}wT ´w
˚
}
2
sďpE}wT ´w
˚
}q
2
, which is false.
71
Function
Class
Lower bound Upper bound
Linear-Time
Upper bound
Prior state-of-the-art
Nonsmooth
Convex
Smooth
Convex
Nonsmooth
Strongly Convex
Smooth
Strongly Convex
Proximal PL
min
{
d
ϵn
,ζ+
d
5/4
n
}
d
(
1
n
+
dκ
ϵn )
(Theorem
61)
κ
2
d
3
nϵ
4
[137, Thm. 5.6]
+ Appendix C.3:
dκ
(
1
n
+
d
ϵn)
[137, Thm. 5.4] (not linear time):
if is uniformly smooth
&
f
∇F(w*) = 0
(Theorem
63)
ζ
(
1+
(
ϵn
d
3/2)
1/18
)
(Theorem
59)
(Theorems 50 and 167)
(Theorem 57 )
+ [137, Thm. 6.4]
d
(
1
n
+
d
ϵn)
(Theorem 58) +
[137, Theorem 6.1]
d
(
1
n
+
d
ϵn)
(Theorems 56 and 167)
d
1
n
+
(
d
ϵn)
1/2
ζ
None
None
None
None
None
d
1
n
+
(
d
ϵn)
1/2
ζ
d
κ
3
f
n
+
κ
3/2
f
d
ϵn
if is uniformly
smooth &
f
∇F(w*) = 0
Figure 4.1: Excess risk fork “ 2, r r “
?
d; we omit logarithms. κ “ β {µ is the condition number ofF ;
κ f
“β f
{µ is the worst-case condition number offp¨,xq.
data. Our near-optimal excess risk bound removes this dependence on β f
{µ and holds even for non-
differentiable f.
Our Algorithm 7 combines the iterative localization technique of [95, 19] with a noisyclipped subgra-
dient method. With clipped (hencebiased) stochastic subgradients and non-Lipschitz/non-smoothfp¨,xq,
the excess risk analysis of our algorithm is harder than in the uniformly Lipschitz setting. Instead of
the uniform convergence analysis used in [253, 137], we derive new results about the stability [146, 45]
and generalization error of learning with loss functions that are not uniformly Lipschitz or differentiable;
prior results (e.g. [227, 162]) were limited toβ f
-smooth and/orL-Lipschitzfp¨,xq. Specifically, we show
the followingfornon-Lipschitz/non-smoothfp¨,xq: a)On-averagemodelstability [162] implies generaliza-
tion (Proposition 53); and b) regularized empirical risk minimization is on-average model stable, hence it
generalizes (Proposition 54). We combine these results with an empirical error bound forbiased, noisy sub-
gradient method to bound the excess risk of our algorithm (Theorem 50). We obtain our strongly convex
bound (Theorem 56) by a reduction to the convex case, ala [118, 95].
We also refine (to describe the dependence on r r
k
,D,µ ), extend (tok"1), and tighten (forµ “0) the
lower bounds of [137]: see Theorems 57 and 58.
72
Linear-TimeAlgorithmsforSmooth(Strongly)ConvexLosses(Section4.4): For convex,β -smooth
F , we provide a novel accelerated DP algorithm (Algorithm 8), building on the work of [108].
§
Our algo-
rithm islineartime and attains excess risk that improves over the previous state-of-the-art (notlineartime)
algorithm [137, Theorem 5.4] in practical parameter regimes (e.g. dÁ n
1{6
). The excess risk of our algo-
rithm is tight in certain cases: e.g., dÁpεnq
2{3
or “sufficiently smooth” F (see Remark 60). Our excess
risk bound holds even ifL“8, which is the case, for instance, for linear regression with unbounded (e.g.
Gaussian) data. To prove our bound, we give the first analysis of accelerated SGD with biased stochastic
gradients.
Forµ -strongly convex,β -smooth losses, acceleration results in excessive bias accumulation, so we pro-
pose a simple noisy clipped SGD. Our algorithm builds on [137], but uses a lower-bias clipping mechanism
from [26] and a new, tighter analysis. We attain excess risk that is near-optimal up to a
FirstAlgorithmforNon-Convex(Proximal-PL)Losses (Section 4.5): We consider losses satisfying
theProximalPolyak-Łojasiewicz(PPL)inequality [213, 142] (Definition 62), an extension of the classical PL
inequality to the proximal setting. This covers important models like (some) neural nets, linear/logistic
regression, and LASSO [142, 163]. We propose a DP proximal clipped SGD to attain near-optimal excess
risk that almost matches the strongly convex rate: see Theorem 63.
We also provide (in Appendix C.7) thefirst shuffledifferentiallyprivate(SDP) [41, 56]algorithms
forSOwithlargeworst-caseLipschitzparameter. Our SDP algorithms achieve the same risk bounds
as their zCDP counterparts without requiring a trusted curator.
4.2 PrivateHeavy-TailedMeanEstimationBuildingBlocks
In each iteration of our SO algorithms, we need a way to privately estimate the mean∇Fpw
t
q“E
x„P
r∇fpw
t
,xqs.
Iffp¨,xq is uniformly Lipschitz, then one can simply draw a random samplex
t
fromX and add noise to the
§
In contrast to [253, 137], we do not requirefp¨,xq to beβ f
-smooth for allx.
73
stochastic gradient∇fpw
t
,x
t
q to obtain a DP estimator of∇Fpw
t
q: theℓ
2
-sensitivity of the stochastic gra-
dients is bounded bysup
x,x
1
PX
}∇fpw
t
,xq´∇fpw
t
,x
1
q}ď 2L, so the Gaussian mechanism guarantees
DP (by Proposition 8). However, in the setting that we consider,L (and hence the sensitivity) may be huge,
leading the privacy noise variance to also be huge. Thus, weclip the stochastic gradients (to force the sen-
sitivity to be bounded) before adding noise. Specifically, we invoke Algorithm 5 on a batch of s stochastic
gradients at each iteration of our algorithms. In Algorithm 5, Π C
pzq :“ argmin
yPB
2
p0,Cq
}y´ z}
2
de-
notes the projection onto the centeredℓ
2
ball of radiusC inR
d
. Lemma 49 bounds the bias and variance
of Algorithm 5.
Algorithm5MeanOracle1ptz
i
u
s
i“1
;s;C;
ε
2
2
q [26]
1: Input: Z“tz
i
u
s
i“1
,Cą0,εą0. Setσ 2
“
4C
2
s
2
ε
2
for
ε
2
2
-zCDP.
2: DrawN„Np0,σ 2
I
d
q and compute r ν :“
1
s
ř
s
i“1
Π C
pz
i
q`N.
3: Output: r ν .
Lemma 49 ([26]). Lettz
i
u
s
i“1
„ P
s
beR
d
-valued random vectors withEz
i
“ ν andE}z
i
}
k
ď r
pkq
for
somekě2. Denotethenoiselessaverageofclippedsamplesby p ν :“
1
s
ř
s
i“1
Π C
pz
i
qand r ν :“ p ν `N. Then,
}Er ν ´ν }“}Ep ν ´ν }ďE}p ν ´ν }ď
r
pkq
pk´1qC
k´1
, andE}r ν ´Er ν }
2
“E}r ν ´Ep ν }
2
ďdσ
2
`
r
p2q
s
.
4.3 OptimalRatesforNon-Smooth(Strongly)ConvexLosses
In this section, we establish the optimal rates (up to logarithms) for the class of DP SCO problems satis-
fying Assumption 5. We present our result for convex losses in Section 4.3.1, and our result for strongly
convex losses in Section 4.3.2. In Section 4.3.3, we provide lower bounds, which show that our upper
bounds are tight (up to logarithms).
74
4.3.1 LocalizedNoisyClippedSubgradientMethodforConvexLosses
Our algorithm (Algorithm 7) uses iterative localization [95, 19] with clipping (in Algorithm 6) to handle
stochastic subgradients with large norm.
¶
Algorithm6 Noisy Clipped Regularized Subgradient Method for DP ERM
1: Input: DataX P X
n
, T P N, stepsizeη , clip thresh. C, regularizationλ ě 0, constraint setW and
initializationw
0
PW.
2: fortPt0,1,¨¨¨ ,T´1udo
3:
r
∇F
t
pw
t
q:“ MeanOracle1pt∇fpw
t
,x
i
qu
n
i“1
;n;C;
ε
2
2T
q for subgradients∇fpw
t
,x
i
q.
4: w
t`1
“Π W
”
w
t
´η ´
r
∇F
t
pw
t
q`λ pw
t
´w
0
q
¯ı
5: endfor
6: Output: w
T
.
Algorithm7 Localized Noisy Clipped Subgradient Method for DP SCO
1: Input: DataX PX
n
, stepsizeη , clip thresh.tC
i
u
log
2
pnq
i“1
, iteration num.tT
i
u
log
2
pnq
i“1
, hyperparameter
pě1.
2: Initializew
0
PW. Letl :“log
2
pnq.
3: foriPrlsdo
4: Setn
i
“2
´i
n,η i
“4
´i
η ,λ i
“
1
η i
n
p
i
,T
i
“
r
Θ ´
1
λ i
η i
¯
, andD
i
“
2L
λ i
.
5: Draw new batchB
i
ofn
i
“|B
i
| samples fromX without replacement.
6: Let
p
F
i
pwq:“
1
n
i
ř
jPB
i
fpw,x
j
q`
λ i
2
}w´w
i´1
}
2
.
7: Use Algorithm 6 with initializationw
i´1
to minimize
p
F
i
overW
i
:“twPW :}w´w
i´1
}ďD
i
u,
forT
i
iterations with clip thresholdC
i
and noiseσ 2
i
“
4C
2
i
T
i
n
2
i
ε
2
. Letw
i
be the output of Algorithm 6.
8: endfor
9: Output: w
l
.
¶
We assume WLOG thatn“ 2
l
for somelPN. If this is not the case, then throw out samples until it is; since the number
of remaining samples is at leastn{2, our bounds still hold up to a constant factor.
75
The main ideas of Algorithm 7 are:
1. Clipping only the non-regularized component of the subgradient to control sensitivity and bias: Notice
that when we call Algorithm 6 in phasei of Algorithm 7, we only clip the subgradients offpw,x
j
q
and not the regularized lossfpw,x
j
q`
λ 2
}w´w
i´1
}
2
. Compared to clipping the full gradient of the
regularized loss, our selective clipping approach significantly reduces the bias of our subgradient
estimator. This is essential for obtaining our near-optimal excess risk. Further, this reduction in bias
comes at no cost to the variance of our subgradient estimator: theℓ
2
-sensitivity of our estimator is
unaffected by the regularization term.
2. Solve regularized ERM subproblem with a stable DP algorithm: We run a multi-pass zCDP solver on
a regularized empirical loss: Multiple passes let us reduce the noise variance in phasei by a factor
ofT
i
(via strong composition for zCDP) and get a more accurate solution to the ERM subproblem.
Regularization makes the empirical loss strongly convex, which improveson-averagemodelstability
and hence generalization of the obtained solution (see Proposition 53 and 161).
3. Localization [95, 18] (i.e. iteratively “zooming in” on a solution): In early phases (small i), when
we are far away from the optimumw
˚
, we use more samples (largern
i
) and large learning rateη i
to make progress quickly. Asi increases,w
i
is closer tow
˚
, so fewer samples and slower learning
rate suffice. Since step size η i
shrinks (geometrically) faster than n
i
, the effective variance of the
privacy noiseη 2
i
σ 2
i
decreases asi increases. This preventsw
i`1
from moving too far away fromw
i
(and hence fromw
˚
). We further enforce this localization behavior by increasing the regularization
parameterλ i
and shrinkingD
i
over time. We chooseD
i
as small as possible subject to the constraint
thatargmin
wPW
p
F
i
pwqPW
i
. This constraint ensures that Algorithm 6 can find w
i
with small excess
risk.
Next, we provide privacy and excess risk guarantees for Algorithm 7:
76
Theorem50. Grant Assumption 5. Letεď
?
d and letfp¨,xq be convex andL-Lipschitz for allx for some
Lă8. Then, there are algorithmic parameters such that Algorithm 7 is
ε
2
2
-zCDP, and has excess risk
EFpw
l
q´F
˚
À r r
2k
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
.
Moreover, this excess risk is attained in
r
Opn
2
L
2
{r r
2
2k
q subgradient evaluations.
The excess risk bound in Theorem 50 is optimal up to a logarithmic factor.
∥
A key feature of this bound
is that it does not depend onL. The only cost of largerL is higher subgradient computational complexity.
Further, Theorem 50 holds for anyLă8 and even non-differentiable/non-smooth f. By contrast, prior
works [253, 137] required uniformβ f
-smoothness offp¨,xq, which implies the restrictionLď β f
D for
loss functions that have a vanishing gradient at some point.
∗∗
In Section 4.4, we develop algorithms that can
even handle the caseL“8 and still give excess risk bounds that are tight in certain practical parameter
regimes. The algorithms in Section 4.4 have runtime Opndq (independent of L), which may make them
more practical than Algorithm 7.
The proof of Theorem 50 consists of three main steps: i) We bound the empirical error of the noisy
clipped subgradient subroutine (Lemma 51). ii) We prove that if an algorithm is on-average model stable
(Definition 52), then it generalizes (Proposition 53). iii) We bound the on-average model stability of reg-
ularized ERM with non-smooth/non-Lipschitz fp¨,xq (Proposition 161), leading to an excess population
loss bound for Algorithm 6 run on the regularized empirical objective (c.f. line 7 of Algorithm 7). By using
these results with the proof technique of [95], we can obtain Theorem 50.
First, we bound the empirical error of the step in line 7 of Algorithm 7, by extending the analysis of
noisy subgradient method to biased subgradient oracles:
∥
The lower bound construction in Theorem 57 isL-Lipschitz forLďn
2
r r
k
ă8.
∗∗
Additionally, [137] assumesβ f
ď 10.
77
Lemma51. FixXPX
n
andlet
p
F
λ pwq“
1
n
ř
n
i“1
fpw,x
i
q`
λ 2
}w´w
0
}
2
forw
0
PW,whereW isaclosed
convexdomainwithdiameterD. Assumefp¨,xqisconvexandp r
n
pXq
pkq
ěsup
wPW
␣
1
n
ř
n
i“1
}∇fpw,x
i
q}
k
(
for all∇fpw,x
i
qPB
w
fpw,x
i
q. Denote p r
n
pXq“
“
p r
n
pXq
pkq
‰
1{k
and ˆ w“ argmin
wPW
p
F
λ pwq. Letη ď
2
λ .
Then, the output of Algorithm 6 satisfies
E}w
T
´ ˆ w}
2
ďexp
ˆ
´
ληT
2
˙
}w
0
´ p w}
2
`
8η λ `
p r
n
pXq
2
`λ 2
D
2
`dσ
2
˘
`
20
λ 2
˜
p r
n
pXq
pkq
pk´1qC
k´1
¸
2
,
whereσ 2
“
4C
2
T
n
2
ε
2
.
Detailed proofs for this subsection are deferred to Appendix C.4.2.
Our next goal is to bound the generalization error of regularized ERM with convex loss functions that
are not differentiable or uniformly Lipschitz. We will use a stability argument to obtain such a bound.
Recall the notion of on-average model stability [162]:
Definition52. LetX“px
1
,¨¨¨ ,x
n
qandX
1
“px
1
1
,¨¨¨ ,x
1
n
qbedrawnindependentlyfromP. ForiPrns,
letX
i
:“px
1
,¨¨¨ ,x
i´1
,x
1
i
,x
i`1
,¨¨¨ ,x
n
q. WesayrandomizedalgorithmAhason-averagemodelstability
α (i.e.A isα -on-average model stable) ifE
“
1
n
ř
n
i“1
}ApXq´ApX
i
q}
2
‰
ď α 2
. The expectation is over the
randomness ofA and the draws ofX andX
1
.
On-average model stability is weaker than the notion ofuniformstability [45], which has been used in DP
Lipschitz SCO (e.g. by [29]); this is necessary for obtaining our learnability guarantees without uniform
Lipschitz continuity.
The main result in [162] showed that on-average model stable algorithms generalize well iffp¨,xq is
β f
-smooth for all x, which leads to a restriction on L. We show that neither smoothness nor Lipschitz
continuity off is needed to ensure generalizability:
78
Proposition 53. Letfp¨,xq be convex for allx. SupposeA : X
n
Ñ W isα -on-average model stable. Let
p
F
X
pwq:“
1
n
ř
n
i“1
fpw,x
i
q be an empirical loss. Then for anyζ ą0,
ErFpApXqq´
p
F
X
pApXqqsď
r r
p2q
2ζ `
ζ 2
α 2
.
Using Proposition 53, we can bound the generalization error and excess (population) risk of regularized
ERM:
Proposition 54. Let fp¨,xq be convex, w
i´1
,y P W, and p w
i
:“ argmin
wPW
p
F
i
pwq, where
p
F
i
pwq :“
1
n
i
ř
jPB
i
fpw,x
j
q`
λ i
2
}w´w
i´1
}
2
(c.f. line 6 of Algorithm 7). Then,
ErFpp w
i
qs´Fpyqď
2r r
p2q
λ i
n
i
`
λ i
2
}y´w
i´1
}
2
,
where the expectation is over both the random draws ofX fromP andB
i
fromX.
With the pieces developed above, we can now sketch the proof of Theorem 50:
Sketch of the Proof of Theorem 50. Privacy: Since the batchestB
i
u
l
i“1
are disjoint, it suffices to show that
w
i
(produced byT
i
iterations of Algorithm 6 in line 7 of Algorithm 7) is
ε
2
2
-zCDP@iPrls. Theℓ
2
sensitivity
of the clipped subgradient update is ∆ “ sup
w,X„X
1}
1
n
i
ř
n
i
j“1
Π C
i
p∇fpw,x
j
qq´Π C
i
p∇fpw,x
1
j
qq} ď
2C
i
{n
i
. (Note that the regularization term does not contribute to sensitivity.) Thus, the privacy guarantees
of the Gaussian mechanism (Proposition 8) and the composition theorem for zCDP (Lemma 10) imply that
Algorithm 7 is
ε
2
2
-zCDP.
79
Excess risk: First, our choice of D
i
ensures that ˆ w
i
P W
i
. Second, since L ă 8, there exists p ě 1
such thatL“
r
O
˜
n
p{2
r r
2k
˜
1
?
n
`
ˆ
?
dlnpnq
εn
˙k´1
k
¸¸
. Combining Lemma 51 with Lemma 49 and proper
choices ofη andT
i
, we get:
E}w
i
´ ˆ w
i
}
2
Àη 2
i
n
p
i
pL
2
`dσ
2
i
q`
η 2
i
n
2p
i
r r
p2kq
C
2k´2
i
À
˜
η 2
n
p
32
i
˜
L
2
`
dC
2
i
T
i
ε
2
n
2
i
`
n
p
r r
p2kq
C
2k´2
i
2
pi
¸¸
. (4.2)
Now, following the strategy used in the proof of [95, Theorem 4.4], we writeEFpw
l
q´Fpw
˚
q“ErFpw
l
q´
Fpˆ w
l
qs`
ř
l
i“1
ErFpˆ w
i
q´Fpˆ w
i´1
qs, where ˆ w
0
:“w
˚
. Using 4.2 andr r
k
-Lipschitz continuity ofF (which
is implied by Assumption 5), we can boundErFpw
l
q´Fpˆ w
l
qs for the rightη andC
l
. To bound the sum
(second term), we use Proposition 54 to obtain
l
ÿ
i“1
ErFpp w
i
q´Fpp w
i´1
qsÀr
2
ηn
p´1
`
D
2
ηn
p
`ηL
2
`η l
ÿ
i“2
4
´i
n
p
i
r r
2
2k
ˆ
dlnpnq
ε
2
n
2
i
˙k´1
k
Àη «
r
2
n
p´1
`L
2
`r r
2
2k
n
p
ˆ
dlnpnq
ε
2
n
2
˙k´1
k
ff
`
D
2
ηn
p
,
for the right choice ofC
i
. Then properly choosingη completes the excess risk proof.
Computationalcomplexity: The choiceT
i
“
r
Opn
p
i
q implies that the number of subgradient evaluations
is bounded by
ř
l
i“1
n
i
T
i
“
r
Opn
p`1
q. Further,n
p
ďn
´
L
r r
2k
¯
2
Remark55 (Improved Computational Complexity for Approximate or Shuffle DP) . Ifonedesirespε,δ q-DP
orpε,δ q-SDP instead of zCDP, then the computational complexity of Algorithm 7 can be improved: see Ap-
pendix C.4.2.
4.3.2 TheStronglyConvexCase
Following [95], we use a folklore reduction to the convex case (detailed in Appendix C.4.3) in order to
obtain the following upper bound via Theorem 50:
80
Theorem56. Grant Assumption 5. Letεď
?
d and letfp¨,xq beµ -strongly convex andL-Lipschitz for all
xPX for someLă8. Then, there is a polynomial-time
ε
2
2
-zCDP algorithmA based on Algorithm 7 with
excess risk
EFpApXqq´F
˚
À
r r
2
2k
µ ¨
˝
1
n
`
˜
a
dlnpnq
εn
¸2k´2
k
˛
‚
.
4.3.3 LowerBounds
The work of [137] proved lower bounds that are tight (by our upper bounds in Section 4.3) in most pa-
rameter regimes forD“ µ “ 1, r r
k
“
?
d, andk“ Op1q.
††
Our (relatively modest) contribution in this
subsection is: refining these lower bounds to display the correct dependence on r r
k
,D,µ ; tightening the
convex lower bound [137, Theorem 6.4] in the regimedąn; and extending [137, Theorems 6.1 and 6.4] to
k"1. Our lower bound constructions satisfy the condition onL in the statements of Theorems 50 and 56.
Our first lower bound holds even for affine functions:
Theorem57 (Smooth Convex, Informal). Letρ ďd. Foranyρ -zCDPalgorithmA,thereexistclosedconvex
setsW,X ĂR
d
suchthat}w´w
1
}ď2D forallw,w
1
PW,aβ f
-smooth,L-Lipschitz,linear,convex(inw)
lossf :WˆX ÑR, and distributionsP andD
1
onX such that Assumption 5 holds and ifX„P
n
, then
EFpApXqq´F
˚
“Ω ¨
˝
r r
k
D
¨
˝
1
?
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
˛
‚
.
Remark 169 (in Appendix C.4.5) discusses parameter regimes in which Theorem 57 is strictly tighter
than [137, Theorem 6.4], as well as differences in our proof vs. theirs.
Next, we provide lower bounds for smooth, strongly convex loss functions:
††
The lower bounds asserted in [137] only hold ifkÀ 1 since the moments of the Gaussian distribution that they construct
grow exponentially/factorially withk.
81
Theorem 58 (Smooth Strongly Convex, Informal). Let ρ ď d. For any ρ -zCDP algorithmA, there exist
compact convex setsW,X ĂR
d
, aL-Lipschitz,µ -smooth,µ -strongly convex (inw) lossf :WˆX ÑR,
and distributionsP andD
1
onX such that: Assumption 5 holds, and ifX„P
n
, then
EFpApXqq´F
˚
“Ω ¨
˝
r r
2
k
µ ¨
˝
1
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸2k´2
k
,
.
-
˛
‚
˛
‚
.
Thus, our upper bounds are indeed tight (up to logarithms). Having resolved Question I, next we will
develop more computationally efficient, linear-time algorithms for smoothFp¨q.
4.4 Linear-TimeAlgorithmsforSmooth(Strongly)ConvexLosses
4.4.1 NoisyClippedAcceleratedSGDforSmoothConvexLosses
Algorithm 8 is a one-pass accelerated algorithm, which builds on (non-private) AC-SA of [108]; its privacy
and excess risk guarantees are given in Theorem 59.
Algorithm8 Noisy Clipped Accelerated SGD (AC-SA) for Heavy-Tailed DP SCO
1: Input: Data X P X
n
, iteration number T ď n, stepsize parameterstη t
u
tPrTs
,tα t
u
tPrTs
with α 1
“
1,α t
Pp0,1q@tě2.
2: Initializew
ag
0
“w
0
PW andt“1.
3: fortPrTsdo
4: w
md
t
:“p1´α t
qw
ag
t´1
`α t
w
t´1
.
5: Draw new batchB
t
(without replacement) ofn{T samples fromX.
6:
r
∇F
t
pw
md
t
q:“ MeanOracle1pt∇fpw
md
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
7: w
t
:“argmin
wPW
!
α t
x
r
∇F
t
pw
md
t
q,wy`
η t
2
}w
t´1
´w}
2
)
.
8: w
ag
t
:“α t
w
t
`p1´α t
qw
ag
t´1
.
9: endfor
10: Output: w
ag
T
.
82
Theorem 59 (Informal). Grant Assumption 5. LetF be convex andβ -smooth. Then, there are parameters
such that Algorithm 8 is
ε
2
2
-zCDP and:
EFpw
ag
T
q´F
˚
À r r
k
D
»
–
1
?
n
`max
$
&
%
˜
ˆ
βD
r r
k
˙
1{4
?
d
εn
¸
4pk´1q
5k´1
,
˜
?
d
εn
¸k´1
k
,
.
-
fi
fl
. (4.3)
Moreover, Algorithm 8 uses at mostn gradient evaluations.
Besides being linear-time, another advantage of Theorem 59 is that it holds even for problems withL“
8. The key ingredient used to prove 4.3 is a novel convergence guarantee for AC-SA with biased, noisy
stochastic gradients: see Proposition 172 in Appendix C.5.1. Combining Proposition 172 with Lemma 49
and a careful choice of stepsizes, clip threshold, andT yields Theorem 59.
Remark 60 (Optimal rate for “sufficiently smooth” functions) . Note that the upper bound 4.3 scales with
the smoothness parameter β . Thus, for sufficiently small β , the optimal rates are attained. For example, if
k“2, the upper bound in 4.3 matches the lower bound in Theorem 57 when
Having discussed the dependence onβ , let us focus on understanding how the bound in Theorem 59
scales withn,d andε. Thus, let us fix Our bound 4.3 improves over [137, Theorem 5.4] if dÁn
1{6
, which
is typical in practical ML applications. As k Ñ 8, 4.3 becomes O
´b
d
n
`
`
d
n
˘
4{5
¯
for ε « 1, which is
strictly better than the bound in [137, Theorem 5.4].
4.4.2 NoisyClippedSGDforStronglyConvexLosses
Our algorithm for strongly convex losses (Algorithm 27 in Appendix C.5.2) is a simple one-pass noisy
clipped SGD. Compared to the algorithm of [137], our approach differs in the choice of MeanOracle, step
size, and iterate averaging weights, and in our analysis.
83
Theorem 61 (Informal). Grant Assumption 5. Let F be µ -strongly convex, β -smooth with
β µ ď n{lnpnq.
Then, there are algorithmic parameters such that Algorithm 27 is
ε
2
2
-zCDP and:
EFpp w
T
q´F
˚
À
r r
2
k
µ ¨
˝
1
n
`
˜
a
dpβ {µ qlnpnq
εn
¸2k´2
k
˛
‚
. (4.4)
Moreover, Algorithm 27 uses at mostn gradient evaluations.
The bound 4.4 is optimal up to a
r
Oppβ {µ q
pk´1q{k
q factor and improves over the best previous bound in [137,
Theorem 5.6] by removing the dependence onβ f
(which can be much larger thanβ in the presence of out-
liers). The proof of Theorem 61 (in Appendix C.5.2) relies on a novel convergence guarantee for projected
SGD with biased noisy stochastic gradients: Proposition 174. Compared to results in [18] for convex ERM
and [6] for PL SO, Proposition 174 is tighter, which is needed to obtain near-optimal excess risk: we lever-
age smoothness and strong convexity. Our new analysis also avoids the issue in the proofs of (the ICML
versions of) [253, 137].
4.5 AlgorithmforNon-ConvexProximal-PLLossFunctions
Assume:fpw,xq“f
0
pw,xq`f
1
pwq;f
0
p¨,xq is differentiable (maybe non-convex); f
1
is proper, closed,
and convex (maybe non-differentiable) for all xPX ; andFpwq“ F
0
pwq`f
1
pwq“E
x„P
rf
0
pw,xqs`
f
1
pwq satisfies the Proximal-PL condition [142]:
Definition62 (µ -PPL). LetFpwq“ F
0
pwq`f
1
pwq be bounded below;F
0
isβ -smooth andf
1
is convex.
F satisfies Proximal Polyak-Łojasiewicz inequality with parameter µ ą0 if
µ rFpwq´inf
w
1
Fpw
1
qsď´β min
y
„
x∇F
0
pwq,y´wy`
β 2
}y´w}
2
`f
1
pyq´f
1
pwq
ȷ
, @wPR
d
.
84
Definition 62 generalizes the classical PL condition ( f
1
“ 0), allowing for constrained optimization or
non-smooth regularizer depending onf
1
[213, 142].
Recall that theproximaloperator of a convex functiong is defined as prox
ηg
pzq:“argmin
yPR
d
`
ηg pyq`
1
2
}y´z}
2
˘
forη ą0
Theorem63 (Informal). Grant Assumption 5. LetF beµ -PPL forβ -smoothF
0
, with
β µ ď n{lnpnq. Then,
there are parameters such that Algorithm 29 is
ε
2
2
-zCDP, and:
EFpw
R
q´F
˚
À
r r
2
k
µ ¨
˝
˜
?
d
εn
pβ {µ qlnpnq
¸2k´2
k
`
pβ {µ qlnpnq
n
˛
‚
.
Moreover, Algorithm 29 uses at mostn gradient evaluations.
The bound in Theorem 63 nearly matches the smooth strongly convex (hence PPL) lower bound in Theo-
rem 58 up to
r
Oppβ {µ q
p2k´2q{2
q, and is attained without convexity.
To prove Theorem 63, we derive a convergence guarantee for proximal SGD with generic biased, noisy
stochastic gradients in terms of the bias and variance of the oracle (see Proposition 177). We then apply
this guarantee for MeanOracle1 (Algorithm 5) with carefully chosen stepsizes, clip threshold, andT , us-
ing Lemma 49. Proposition 177 generalizes [6, Theorem 6]–which covered the unconstrained classical PL
problem–to the proximal setting. However, the proof of Proposition 177 is very different from the proof
of [6, Theorem 6], since prox makes it hard to bound excess risk without convexity when the stochastic
gradients are biased/noisy. Instead, our proof builds on the proof of Theorem 40, using techniques from
the analysis of objective perturbation [52, 149]. See Appendix C.6 for details.
4.6 ConcludingRemarksandOpenQuestions
This chapter was motivated by practical problems in which data contains outliers and potentially heavy
tails, causing the worst-case Lipschitz parameter of the loss over all data points to be prohibitively large. In
85
such cases, existing bounds for DP SO that scale with the worst-case Lipschitz parameter become vacuous.
Thus, we operated under the more relaxed assumption of stochastic gradient distributions having bounded
k-th moments. The k-th moment bound can be much smaller than the worst-case Lipschitz parameter
in practice. For (strongly) convex loss functions, we established the asymptotically optimal rates (up to
logarithms), even with non-differentiable losses. We also provided linear-time algorithms for smooth losses
that are optimal in certain practical parameter regimes, but suboptimal in general. An interesting open
question is: does there exist a linear-time algorithm with optimal excess risk? We also initiated the study
of non-convex DP SO without uniform Lipschitz continuity, showing that the optimal strongly convex
rates can nearly be attained without convexity, via the proximal-PL condition. We leave the treatment of
general non-convex losses for future work.
AuthorContributions
This work was done in collaboration with Meisam Razaviyayn. AL led the project, conducted the analysis,
and wrote the paper. MR advised and provided useful feedback.
Acknowledgements
We would like to thank John Duchi, Larry Goldstein, and Stas Minsker for very helpful conversations and
pointers related to our lower bounds and the proof of Lemma 188. We also thank Huanyu Zhang, Xingtu
Liu, and Gautam Kamath for clarifying some steps in the proof of their Theorem 4.1 [137], and providing
useful feedback on the first draft of this manuscript. Finally, we thank an anonymous ALT reviewer for
identifying an issue in a previous version of this paper.
86
Chapter5
AStochasticOptimizationFrameworkforFairRiskMinimization
This chapter is based on [176].
Summary
Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a
variety of machine learning tasks,fair ERM is hindered by the incompatibility of fairness constraints with
stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes
and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness al-
gorithms are either impractical in the large-scale setting because they require large batches of data at each
iteration or they are not guaranteed to converge. In this chapter, we developthefirststochasticin-processing
fairnessalgorithmwithguaranteedconvergence. For demographic parity, equalized odds, and equal oppor-
tunity notions of fairness, we provide slight variations of our algorithm—called FERMI—and prove that
each of these variations converges in stochastic optimization with any batch size. Empirically, we show
that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary
targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI
achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups
compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These
87
benefits are especially significant with small batch sizes and for non-binary classification with large num-
ber of sensitive attributes, making FERMI a practical, scalable fairness algorithm. The code for all of the ex-
periments in this paper is available at: https://github.com/optimization-for-data-driven-science/FERMI.
5.1 Introduction
Ensuring that decisions made using machine learning (ML) algorithms are fair to different subgroups is
of utmost importance. Without any mitigation strategy, learning algorithms may result in discrimination
against certain subgroups based on sensitive attributes, such as gender or race, even if such discrimination
is absent in the training data [192], and algorithmic fairness literature aims to remedy such discrimination
issues [235, 70, 94, 43, 12, 49, 115, 100, 263, 273, 34, 4, 145, 214, 158]. Modern ML problems often involve
large-scale models with hundreds of millions or even billions of parameters, e.g., BART [165], ViT [80],
GPT-2 [217]. In such cases, during fine-tuning, the available memory on a node constrains us to use
stochastic optimization with (small) minibatches in each training iteration. In this paper, we address the
dual challenges of fair and stochastic ML, providing the first stochastic fairness algorithm that provably
converges with any batch size.
A machine learning algorithm satisfies the demographic parity fairness notion if the predicted target
is independent of the sensitive attributes [86]. Promoting demographic parity can lead to poor perfor-
mance, especially if the true outcome is not independent of the sensitive attributes. To remedy this, [115]
proposed equalized odds to ensure that the predicted target is conditionally independent of the sensitive
attributes given the true label. A further relaxed version of this notion is equal opportunity which is sat-
isfied if predicted target is conditionally independent of sensitive attributes given that the true label is in
an advantaged class [115]. Equal opportunity ensures that false positive rates are equal across different
demographics, where negative outcome is considered an advantaged class, e.g., extending a loan. See Ap-
pendix D.1 for formal definitions of these fairness notions.
88
Reference NB NB NB Beyond Stoch. alg. Converg.
target attrib. code logistic (unbiased
˚˚
) (stoch.)
FERMI(thiswork) ✓ ✓ ✓ ✓ ✓ (✓) ✓ (✓)
[57] ✓ ✓ ✓ ✓ ✓ (✗) ✗
[58] ✓ ✓ ✗ ✓ ✓ (✓) ✗
[23] ✓ ✓ ✓ ✓ ✗ ✓ (✗)
[222] ✗ ✗ ✗ ✗ ✗ ✗
[131]
˚
✗ ✓ ✗ ✗ ✗ ✗
[186] ✓ ✓ ✓ ✓ ✓ (✗) ✗
[214] ✗ ✗ ✗ ✓ ✓ (✗) ✗
[79] ✗ ✓ ✗ ✓ ✗ ✗
[275] ✓ ✓ ✗ ✓ ✓ (✗) ✗
[4] ✗ ✓ ✗ ✓ ✗ ✓ (✗)
Table 5.1: Comparison of state-of-the-art in-processing methods (NB = non-binary) on whether they (a) handle
non-binary targets (beyond binary classification), (b) handle non-binary sensitive attributes, (c) release code that
applies to non-binary targets/attributes, (d) extend to arbitrary models, (e) provide code for stochastic optimization
(and whether the gradients are unbiased), (f) provide convergence guarantees (for stochastic optimization). FERMI is
the only method compatible with stochastic optimization and guaranteed convergence. The only existing baselines
for non-binary classification with non-binary sensitive attributes are [186, 23, 57] (NB code).
˚
We refer to the in-
processing method of [131], not their post-processing method.
˚˚
We use the term “unbiased” in statistical estimation
sense; not to be confused with bias in the fairness sense.
Measuringfairnessviolation. In practice, the learner only has access to finite samples and cannot verify
demographic parity, equalized odds, or equal opportunity. This has led the machine learning community
to define several fairness violation metrics that quantify the degree of (conditional) independence between
random variables, e.g., L
8
distance [86, 115], mutual information [140, 222, 233, 275, 58, 224], Pearson
correlation [273, 38], false positive/negative rate difference [34], Hilbert Schmidt independence criterion
(HSIC) [210], kernel-based minimum mean discrepancy (MMD) [214], Rényi correlation [186, 23, 114, 113],
and exponential Rényi mutual information (ERMI) [186]. In this paper, we focus on three variants of ERMI
specialized to demographic parity, equalized odds, and equal opportunity. The motivation for the use
of ERMI is two-fold. First, we will see in Section 5.2 that ERMI is amenable to stochastic optimization.
Moreover, we observe (Appendix D.3) that ERMI provides an upper bound on several of the above notions
of fairness violation. Consequently, a model trained to reduce ERMI will also provide guarantees on these
other fairness violations.
∗
∗
Nevertheless, we useL8 distance for measuring fairness violation in our numerical experiments, sinceL8 is broadly used.
89
Relatedwork&contributions. Fairness-promoting machine learning algorithms can be categorized in
three main classes: pre-processing, post-processing, and in-processing methods. Pre-processing algorithms
[94, 274, 49] transform the biased data features to a new space in which the labels and sensitive attributes
are statistically independent. This transform is oblivious to the training procedure. Post-processing ap-
proaches [115, 212] mitigate the discrimination of the classifier by altering the final decision. In-processing
approaches focus on the training procedure and impose the notions of fairness as constraints or regulariza-
tion terms in the training procedure. Several regularization-based methods are proposed in the literature
to promote fairness [223, 216] in decision-trees [139, 218, 5], support vector machines [79], boosting [101],
neural networks [113, 57, 214], or (logistic) regression models [273, 37, 237, 61, 23, 131, 114]. See the recent
paper by [123] for a more comprehensive literature survey.
While in-processing approaches generally give rise to better tradeoffs between fairness violation and
performance, existing approaches are mostly incompatible with stochastic optimization. This paper ad-
dresses this problem in the context of fair (non-binary) classification with discrete (non-binary) sensitive
attributes. See Table 5.1 for a summary of the main differences between FERMI and existing in-processing
methods.
Our main contributions are as follows:
1. For each given fairness notion (demographic parity, equalized odds, or equal opportunity), we formu-
late an objective that uses ERMI as a regularizer to balance fairness and accuracy (FRMI Objective),
and derive an empirical version of this objective (FERMI Objective). We propose an algorithm (Algo-
rithm 9) for solving each of these objectives, which isthefirststochasticin-processingfairnessalgorithm
with guaranteed convergence. The main property needed to obtain a convergent stochastic algorithm
is to derive a (stochastically) unbiased estimator of the gradient of the objective function. The existing
stochastic fairness algorithms by [275, 186, 214, 58, 57] are not guaranteed to converge since there is no
90
straightforward way to obtain such unbiased estimator of the gradients for their fairness regularizers.
†
For any minibatch size (even as small as1), we prove (Theorem 67) that our algorithm converges to an
approximate solution of the empirical objective (FERMI Objective).
2. We show that if the number of training examples is large enough, then our algorithm (Algorithm 9)
converges to an approximate solution of the population-level objective (Theorem 69). The proofs of
these convergence theorems require the development of novel techniques (see e.g. Proposition 65 and
Proposition 70), and the resourceful application of many classical results from optimization, probability
theory, and statistics.
3. We demonstrate through extensive numerical experiments that our stochastic algorithm achieves supe-
rior fairness-accuracy tradeoff curves against all comparable baselines for demographic parity, equalized
odds, and equal opportunity. In particular,theperformancegapisverylargewhenminibatchsizeissmall
(as is practically necessary for large-scale problems) and the number of sensitive attributes is large.
5.2 FairRiskMinimizationthroughERMIRegularization
In this section, we propose a fair learning objective (FRMI Objective) and derive an empirical variation
(FERMI Objective) of this objective. We then develop a stochastic optimization algorithm (Algorithm 9)
that we use to solve these objectives, and prove that our algorithm converges to an approximate solution
of the two objectives.
Consider a learner who trains a model to make a prediction,
p
Y, e.g., whether or not to extend a loan,
supported onrms :“t1,...,mu. The prediction is made using a set of features,X, e.g., financial history
features. Assume that there is a set of discrete sensitive attributes,S, e.g., race and sex, supported onrks.
†
We suspect it might be possible to derive a provably convergent stochastic algorithm from the framework in [214] using
our techniques, but their approach is limited to binary classification with binary sensitive attributes. By contrast, we provide
(empirical and population-level) convergence guarantees for our algorithm with any number of sensitive attributes and any
number of classes.
91
We emphasize that, in contrast to the previous chapters on privacy, here we will denote the full data—
features, sensitive attributes, and labels—bypX,S,Yq.
We now define the fairness violation notion that we will use to enforce fairness in our model.
Definition 64 (ERMI – exponential Rényi mutual information). We define the exponential Rényi mutual
information between random variables
p
Y andS with joint distributionp
p
Y,S
and marginalsp
p
Y
, p
S
by:
D
R
p
p
Y;Sq:“E
#
p
p
Y,S
p
p
Y,Sq
p
p
Y
p
p
Yqp
S
pSq
+
´1“
ÿ
jPrms
ÿ
rPrks
p
p
Y,S
pj,rq
2
p
p
Y
pjqp
S
prq
´1. (ERMI)
Definition 64 is what we would use if demographic parity were the fairness notion of interest. If instead
one wanted to promote fairness with respect to equalized odds or equal opportunity, then it is straightfor-
ward to modify the definition by substituting appropriate conditional probabilities for p
p
Y,S
, p
p
Y
, andp
S
in ERMI: see Appendix D.2. In Appendix D.2, we also discuss that ERMI is theχ 2
-divergence (which is an
f-divergence) between the joint distribution,p
p
Y,S
, and the Kronecker product of marginals,p
p
Y
bp
S
[48].
In particular, ERMI is non-negative, and zero if and only if demographic parity (or equalized odds or equal
opportunity, for the conditional version of ERMI) is satisfied. Additionally, we show in Appendix D.3 that
ERMI provides an upper bound on other commonly used measures of fairness violation: Shannon mutual
information [58], Rényi correlation [23], L
q
fairness violation [145, 115]. Therefore, any algorithm that
makes ERMI small will also have small fairness violation with respect to these other notions.
We can now define our fair risk minimization through exponential Rényi mutual information frame-
work:
‡
min
θ !
FRMIpθ q:“Lpθ q`λD
R
`
p
Y
θ pXq;S
˘
)
, (FRMI Objective)
‡
In this section, we present all results in the context of demographic parity, leaving off all conditional expectations for clarity
of presentation. The algorithm/results are readily extended to equalized odds and equal opportunity by using the conditional
version of ERMI (which is described in Appendix D.2); we use these resulting algorithms for numerical experiments.
92
whereLpθ q:“E
pX,Yq
rℓpX,Y;θ qs for a given loss functionℓ (e.g. L
2
loss or cross entropy loss);X is the
random vector of features;Y is the random label;λ ą0 is a scalar balancing the accuracy versus fairness
objectives; and
p
Y
θ pXq is the output of the learned model (i.e. the predicted label in a classification task).
While
p
Y
θ pXq“
p
YpX;θ q inherently depends onX andθ , in the rest of this chapter, we sometimes leave
the dependence of
p
Y on X and/or θ implicit for brevity of notation. Notice that we have also left the
dependence of the loss on the predicted outcome
p
Y “
p
Y
θ pXq implicit.
As is standard, we assume that the prediction function satisfies Pp
p
Ypθ,X q “ j|Xq “ F
j
pθ,X q,
whereFpθ,X q “ pF
1
pθ,X q,...,F
m
pθ,X qq
T
P r0,1s
m
is differentiable in θ and
ř
m
j“1
F
j
pθ,X q “ 1.
For example, Fpθ,X q could represent the probability label given by a logistic regression model or the
output of a neural network after softmax layer. Indeed, this assumption is natural for most classifiers.
Further, even classifiers, such as SVM, that are not typically expressed using probabilities can often be well
approximated by a classifier of the form Pp
p
Ypθ,X q“ j|Xq“F
j
pθ,X q, e.g. by using Platt Scaling [211,
202].
The work of [186] considered the same objective FRMI Objective, and tried to empirically solve it
through a kernel approximation. We propose a different approach to solving this problem, which we
shall describe below. Essentially, we express ERMI as a “max” function (Theorem 65), which enables us
to re-formulate FRMI Objective (and its empirical counterpart FERMI Objective) as a stochastic min-max
optimization problem. This allows us to use stochastic gradient descent ascent (SGDA) to solve FRMI
Objective. Unlike the algorithm of [186], our algorithm provablyconverges. Our algorithm also empirically
outperforms the algorithm of [186], as we show in Section 5.3 and Appendix D.5.2.
93
5.2.1 AConvergentStochasticAlgorithmforFairEmpiricalRiskMinimization
In practice, the true joint distribution ofpX,S,Y,
p
Yq is unknown and we only have N samples at our
disposal. LetZ “tx
i
,s
i
,y
i
,p ypx
i
;θ qu
iPrNs
denote the features, sensitive attributes, targets, and the pre-
dictions of the model parameterized by θ for these given samples. For now, we consider the empirical
risk minimization (ERM) problem and do not require any assumptions on the data set (e.g. we allow for
different samples in Z to be drawn from different, heterogeneous distributions). Consider the empirical
objective
min
θ !
FERMIpθ q:“
p
Lpθ q`λ p
D
R
p
p
Y
θ pXq,Sq
)
, (FERMI Objective)
where
p
Lpθ q:“
1
N
ř
N
i“1
ℓpx
i
,y
i
;θ q is the empirical loss and
§
p
D
R
p
p
Y,Sq:“E
#
ˆ p
p
Y,S
p
p
Y,Sq
ˆ p
p
Y
p
p
Yqˆ p
S
pSq
+
´1“
ÿ
jPrms
ÿ
rPrks
ˆ p
p
Y,S
pj,rq
2
ˆ p
p
Y
pjqˆ p
S
prq
´1
is empirical ERMI with ˆ p denoting empirical probabilities with respect to Z: ˆ p
S
prq “
1
N
ř
N
i“1
1
ts
i
“ru
;
p p
ˆ y
pjq “
1
N
ř
N
i“1
F
j
pθ,x
i
q; and ˆ p
p
Y,S
pj,rq “
1
N
ř
N
i“1
F
j
pθ,x
i
qs
i,r
for j P rms,r P rks. We shall see
(Proposition 70) that empirical ERMI is a good approximation of ERMI whenN is large. Now, it is straight-
forward to derive an unbiased estimate for
p
Lpθ q via
1
|B|
ř
iPB
ℓ
`
x
i
,y
i
;θ ˘
where B Ď rNs is a random
minibatch of data points drawn fromZ. However, unbiasedly estimating
p
D
R
p
p
Y,Sq in the objective func-
tion FERMI Objective with|B|ăN samples is more difficult. In what follows, we present our approach to
deriving statistically unbiased stochastic estimators of the gradients of
p
D
R
p
p
Y,Sq given a random batch of
data pointsB. This stochastic estimator is key to developing a stochastic convergent algorithm for solving
FERMI Objective. The key novel observation that allows us to derive this estimator is that Equation FERMI
Objective can be written as amin-max optimization problem (see Corollary 66). This observation, in turn,
follows from the following result:
§
We overload notation slightly here and useE to denote expectation with respect to the empirical (joint) distribution.
94
Proposition 65. For random variables
p
Y andS with joint distribution ˆ p
p
Y,S
, where
p
Y Prms,S Prks, we
have
p
D
R
p
p
Y;Sq“ max
WPR
kˆm
t´TrpW
p
P
ˆ y
W
T
q`2TrpW
p
P
ˆ y,s
p
P
´1{2
s
q´1u,
if
p
P
ˆ y
“diagpˆ p
p
Y
p1q,...,ˆ p
p
Y
pmqq,
p
P
s
“diagpˆ p
S
p1q,...,ˆ p
S
pkqq, andp
p
P
ˆ y,s
q
i,j
“ ˆ p
p
Y,S
pi,jqwithˆ p
p
Y
piq,ˆ p
S
pjqą
0 foriPrms,jPrks.
The proof is a direct calculation, given in Section D.4. Letp ypx
i
,θ qPt0,1u
m
ands
i
“ps
i,1
,...,s
i,k
q
T
P
t0,1u
k
be the one-hot encodings of p ypx
i
,θ q ands
i
, respectively foriPrNs. Then, Theorem 65 provides a
useful variational form of FERMI Objective, which forms the backbone of our novel algorithmic approach:
Corollary66. Letpx
i
,s
i
,y
i
q be a random draw fromZ. Then, FERMI Objective is equivalent to
min
θ max
WPR
kˆm
!
p
Fpθ,W q:“
p
Lpθ q`λ p
Ψ pθ,W q
)
, (5.1)
where
p
Ψ pθ,W q“´TrpW
p
P
ˆ y
W
T
q`2TrpW
p
P
ˆ y,s
p
P
´1{2
s
q´1“
1
N
ř
N
i“1
p
ψ i
pθ,W q and
p
ψ i
pθ,W q:“´TrpWErp ypx
i
,θ qp ypx
i
,θ q
T
|x
i
sW
T
q`2TrpWErp ypx
i
;θ qs
T
i
|x
i
,s
i
s
p
P
´1{2
s
q´1
“´TrpWdiagpF
1
pθ,x
i
q,...,F
m
pθ,x
i
qqW
T
q`2TrpWErp ypx
i
;θ qs
T
i
|x
i
,s
i
s
p
P
´1{2
s
q´1.
Corollary 66 implies that for any given data set Z, the quantity ℓpx
i
,y
i
;θ q`λ p
ψ i
pθ,W q is an unbi-
ased estimator of
p
Fpθ,W q (with respect to the uniformly random draw of i P rNs). Thus, we can use
stochastic optimization (e.g. stochastic gradient descent-ascent) to solve FERMI Objective with any batch
size1ď|B|ď N, and the resulting algorithm will be guaranteed to converge since the stochastic gradi-
ents are unbiased. We present our proposed algorithm, which we callFERMI, for solving FERMI Objective
95
in Algorithm 9. Essentially, FERMI Algorithm applies the two-timescale stochastic gradient descent-ascent
(SGDA) of [170] to solve the min-max problem given in FERMI Objective.
Algorithm9 FERMI Algorithm
Input: θ 0
PR
d
θ , W
0
“ 0PR
kˆm
, step-sizespη θ ,η w
q, fairness parameterλ ě 0, iteration numberT ,
minibatch sizes|B
t
|,tPt0,1,¨¨¨ ,Tu,W :“ Frobenius norm ball of radiusD around0PR
kˆm
forD
given in Section D.4.
Compute
p
P
´1{2
s
“diagpˆ p
S
p1q
´1{2
,...,ˆ p
S
pkq
´1{2
q.
fort“0,1,...,T do
Draw a mini-batchB
t
of data pointstpx
i
,s
i
,y
i
qu
iPBt
Setθ t`1
Ðθ t
´
η θ |Bt|
ř
iPBt
r∇
θ ℓpx
i
,y
i
;θ t
q`λ ∇
θ p
ψ i
pθ t
,W
t
qs.
SetW
t`1
ÐΠ W
´
W
t
`
2λη w
|Bt|
ř
iPBt
”
´W
t
Erp ypx
i
,θ qp ypx
i
,θ q
T
|x
i
s`
p
P
´1{2
s
Ers
i
p y
T
px
i
;θ t
q|x
i
,s
i
s
ı¯
endfor
Pick
ˆ
t uniformly at random fromt1,...,Tu.
Return: θ ˆ
t
.
Note that the matrix
p
P
´1{2
s
depends only on the full data set of sensitive attributests
1
,¨¨¨ ,s
N
u and
has no dependence onθ , and can therefore be computed just once, in line 2 of Algorithm 9. On the other
hand, the quantitiesErp ypx
i
,θ qp ypx
i
,θ q
T
|x
i
s andErp ypx
i
;θ qs
T
i
|x
i
,s
i
s depend on the samplepx
i
,s
i
,p y
i
q that
is drawn in a given iteration and on the model parametersθ , and are therefore computed at each iteration
of the algorithm.
Although the min-max problem FERMI Objective that we aim to solve is unconstrained, we project the
iteratesW
t
(in line 5 of Algorithm 9) onto a bounded setW in order to satisfy a technical assumption that
is needed to prove convergence of Algorithm 9
¶
. We chooseW to be a sufficiently large ball that contains
W
˚
pθ q :“ argmax
W
p
Fpθ,W q for every θ in some neighborhood of θ ˚
P argmin
θ max
W
p
Fpθ,W q, so
that FERMI Objective is equivalent to
min
θ max
WPW
!
p
Fpθ,W q“
p
Lpθ q`λ p
Ψ pθ,W q
)
.
See Appendix D.4 for details. When applying Algorithm 9 in practice, it is not necessary to project the
iterates; e.g. in Section 5.3, we obtain strong empirical results without projection in Algorithm 9.
¶
Namely, boundedW
t
ensures that the gradient of
p
F is Lipschitz continuous at every iterate and the variance of the stochastic
gradients is bounded.
96
Since FERMI Objective is potentially nonconvex inθ , a global minimum might not exist and even com-
puting a local minimum is NP-hard in general [198]. Thus, as is standard in the nonconvex optimization
literature, we aim for the milder goal of finding an approximate stationarypoint of FERMI Objective. That
is, given any ϵ ą 0, we aim to find a point θ ˚
such thatE}∇FERMIpθ ˚
q} ď ϵ, where the expectation
is solely with respect to the randomness of the algorithm (minibatch sampling). The following theorem
guarantees that Algorithm 9 will find such a point efficiently:
Theorem 67. (Informal statement) Letϵ ą 0. Assume thatℓpx,y;¨q andFp¨,xq are Lipschitz continuous
and differentiable with Lipschitz continuous gradient (see Appendix D.4 for definitions), ˆ p
S
pjq ą 0 for all
sensitive attributesj Prks and ˆ p
ˆ
Y
plqě µ ą 0 for all labelslPrms and at every iterateθ t
. Then for any
batch sizes1ď|B
t
|ď N, Algorithm 9 converges to anϵ -first order stationary point of FERMI Objective in
O
`
1
ϵ 5
˘
stochastic gradient evaluations.
The formal statement of Theorem 67 can be found in Theorem 218 in Appendix D.4. Theorem 67 im-
plies that Algorithm 9 can efficiently achieve any tradeoff between fairness (ERMI) violation and (empirical)
accuracy, depending on the choice ofλ .
∥
However, if smaller fairness violation is desired (i.e. if largerλ is
chosen), then the algorithm needs to run for more iterations (see Appendix D.4). The proof of Theorem 67
follows from Corollary 66 and the observation that
p
ψ i
is strongly concave inW (see Lemma 221 in Sec-
tion D.4). This implies that 5.1 is a nonconvex-strongly concave min-max problem, so the convergence
guarantee of SGDA [170] yields Theorem 67.
∗∗
The detailed proof of Theorem 67 is given in Appendix D.4.
Increasing the batch size toΘ pϵ ´2
q improves the stochastic gradient complexity toOpϵ ´4
q. On the other
hand, increasing the batch size further to|B
t
| “ N results in a deterministic algorithm which is guar-
anteed to find a point θ ˚
such}∇FERMIpθ ˚
q} ď ϵ (no expectation) in Opϵ ´2
q iterations [170, Theorem
∥
This sentence is accurate to the degree that an approximate stationary point of the non-convex objective FERMI Objective
corresponds to an approximate risk minimizer.
∗∗
A faster convergence rate ofOpϵ ´3
q could be obtained by using the (more complicated) SREDA method of [183] instead of
SGDA to solve FERMI objective. We omit the details here.
97
4.4],[205, Remark 4.2]; this iteration complexity has the optimal dependence onϵ [51, 279]. However, like
existing fairness algorithms in the literature, this full-batch variant is impractical for large-scale problems.
Remark 68. The condition ˆ p
ˆ
Y
plq ě µ in Theorem 67 is assumed in order to ensure strong concavity of
p
Fpθ t
,¨qateveryiterateθ t
,whichleadstotheOpϵ ´5
qconvergencerate. Thisassumptionistypicallysatisfied
in practice: for example, if the iteratesθ t
remain in a compact region during the algorithm and the classifier
uses softmax, then ˆ p
ˆ
Y
plqě µ ą 0. Having said that, it is worth noting that this condition is not absolutely
necessarytoensureconvergenceofAlgorithm9. Evenifthisconditiondoesn’thold,then5.1isstillanonconvex-
concave min-max problem. Hence SGDA still converges to anϵ -stationary point, albeit at the slower rate of
Opϵ ´8
q [170]. Alternatively, one can add a small ℓ
2
regularization term to the objective to enforce strong
concavity and get the fast convergence rate ofOpϵ ´5
q.
5.2.2 AsymptoticConvergenceofAlgorithm9forPopulation-levelFRMIObjective
So far, we have letNě1 be arbitrary and have not made any assumptions on the underlying distribution(s)
from which the data was drawn. Even so, we showed that Algorithm 9 always converges to a stationary
point of FERMI Objective. Now, we will show that if Z contains i.i.d. samples from an unknown joint
distributionD and if N " 1, then Algorithm 9 converges to an approximate solution of the population
risk minimization problem FRMI Objective. Precisely, we will use a one-pass sample-without-replacement
(“online”) variant of Algorithm 9 to obtain this population loss guarantee. The one-pass variant is identical
to Algorithm 9 except that: a) once we draw a batch of samples B
t
, we remove these samples from the
data set so that they are never re-used; and b) thefor-loop terminates when we have used alln samples.
Theorem69. Letϵ ą0. Assume thatℓpx,y;¨q andFp¨,xq are Lipschitz continuous and differentiable with
Lipschitzcontinuousgradient,andthatmin
rPrks
p
S
prqą0. Then,thereexistsNPNsuchthatifněN and
Z „ D
n
, then a one-pass sample-without-replacement variant of Algorithm 9 converges to an ϵ -first order
stationary point of FRMI Objective inO
`
1
ϵ 5
˘
stochastic gradient evaluations, for any batch sizes|B
t
|.
98
Theorem 69 provides a guarantee on the fairness/accuracy loss that can be achieved on unseen “test
data.” This is important because the main goal of (fair) machine learning is to (fairly) give accurate pre-
dictions on test data, rather than merely fitting the training data well. Specifically, Theorem 69 shows
that with enough (i.i.d.) training examples at our disposal, (one-pass) Algorithm 9 finds an approximate
stationary point of the population-level fairness objective FRMI Objective Furthermore, the gradient com-
plexity is the same as it was in the empirical case. The proof of Theorem 69 will be aided by the following
result, which shows that
p
ψ i
is an asymptotically unbiased estimator of Ψ , where max
W
Ψ pθ,W q equals
ERMI:
Proposition 70. Lettz
i
u
n
i“1
“ tx
i
,s
i
,y
i
u
n
i“1
be drawn i.i.d. from an unknown joint distributionD. De-
note
p
ψ pnq
i
pθ,W q“´TrpWErp ypx
i
,θ qp ypx
i
,θ q
T
|x
i
sW
T
q`2Tr
ˆ
WErp ypx
i
;θ qs
T
i
|x
i
,s
i
s
´
p
P
pnq
s
¯
´1{2
˙
´1,
where
p
P
pnq
s
“
1
n
ř
n
i“1
diagp1
ts
i
“1u
,¨¨¨ , 1
ts
i
“ku
q. DenoteΨ pθ,W q“´TrpWP
ˆ y
W
T
q`2TrpWP
ˆ y,s
P
´1{2
s
q´
1, whereP
ˆ y
“ diagpEF
1
pθ,x q,¨¨¨ ,EF
m
pθ,x qq,pP
ˆ y,s
q
j,r
“E
x
i
,s
i
rF
j
pθ,x
i
qs
i,r
s forjPrms,rPrks, and
P
s
“diagpP
S
p1q,¨¨¨ ,P
S
pkqq. Assumep
S
prqą0 for allrPrks. Then,
max
W
Ψ pθ,W q“D
R
p
p
Ypθ q;Sq
and
lim
nÑ8
Er
p
ψ pnq
i
pθ,W qs“Ψ pθ,W q.
The proof of Proposition 70 is given in Appendix D.4.1. The first claim is immediate from Theorem 65
and its proof, while the second claim is proved using the strong law of large numbers, the continuous
mapping theorem, and Lebesgue’s dominated convergence theorem.
Proposition 70 implies that the empirical stochastic gradients computed in Algorithm 9 are good ap-
proximations of the true gradients of FRMI Objective. Intuitively, this suggests that when we use Al-
gorithm 9 to solve the fair ERM problem FERMI Objective, the output of Algorithm 9 will also be an
approximate solution of FRMI Objective. While Theorem 69 shows this intuition does indeed hold, the
proof of Theorem 69 requires additional work. A reasonable first attempt at proving Theorem 69 might be
99
to try to bound the expected distance between the gradient of FRMI and the gradient of FERMI (evaluated
at the point
ˆ
θ that is output by Algorithm 9) via Danskin’s theorem [68] and strong concavity, and then
leverage Theorem 67 to conclude that the gradient of FRMI must also be small. However, the dependence
of
ˆ
θ on the training data prevents us from obtaining a tight enough bound on the distance between the em-
pirical and population gradients at
ˆ
θ . Thus, we take a different approach to proving Theorem 69, in which
we consider the output of two different algorithms: one is the conceptual algorithm that runs one-pass Al-
gorithm 9 as if we had access to the true sensitive attributesP
s
(“Algorithm A”); the other is the realistic
one-pass Algorithm 9 that only uses the training data (“Algorithm B”). We argue: 1) the output of the con-
ceptual algorithm is a stationary point of the population-level objective; and 2) the distance between the
gradients of the population-level objective atθ A
andθ B
is small. While 1) follows easily from the proof
of Theorem 67 and the online-to-batch conversion, establishing 2) requires a careful argument. The main
tools we use in the proof of Theorem 69 are Theorem 67, Proposition 70, Danskin’s theorem, Lipschitz
continuity of theargmax function for strongly concave objective, the continuous mapping theorem, and
Lebesgue’s dominated convergence theorem: see Appendix D.4.1 for the detailed proof.
Note that the online-to-batch conversion used to prove Theorem 69 requires aconvergentstochasticop-
timizationalgorithm; this implies that our arguments could not be used to prove an analogue of Theorem 69
for existing fair learning algorithms, since existing convergent fairness algorithms are not stochastic. An
alternate approach to bounding the “generalization error” of our algorithm would be to use a standard
covering/uniform convergence argument. However, this approach would not yield as tight a guarantee
as Theorem 69. Specifically, the accuracy and/or gradient complexity guarantee would depend on the
dimension of the space (i.e. the number of model parameters), since the covering number depends (ex-
ponentially) on the dimension. For large-scale problems with a huge number of model parameters, such
dimension dependence is prohibitive.
100
As previously mentioned, we can interpret Theorem 69 as providing a guarantee that Algorithm 9
generalizes well, achieving small fairness violation and test error, even on unseen “test” examples–as long
as the data is i.i.d. andN is sufficiently large. In the next section, we empirically corroborate Theorem 69,
by evaluating the fairness-accuracy tradeoffs of the FERMI algorithm (Algorithm 9) in several numerical
experiments.
5.3 NumericalExperiments
In this section, we evaluate the performance of FERMI in terms of the fairness violation vs. test error
for different notions of fairness (e.g. demographic parity, equalized odds, and equality of opportunity). To
this end, we perform diverse experiments comparing FERMI to other state-of-the-art approaches on several
benchmarks. In Section 5.3.1, we showcase the performance of FERMI applied to a logistic regression model
on binary classification tasks with binary sensitive attributes on Adult, German Credit, and COMPAS
datasets. In Section 5.3.2, we utilize FERMI with a convolutional neural network base model for fair (to
different religious groups) toxic comment detection. In Section 5.3.3, we explore fairness in non-binary
classification with non-binary sensitive attributes. Finally, Section 5.3.4 shows how FERMI may be used
beyond fair empirical risk minimization in domain generalization problems to learn a model independent
of spurious features.
5.3.1 FairBinaryClassificationwithBinarySensitiveAttributesusingLogisticRegression
5.3.1.1 Benchmarkingfull-batchperformance
In the first set of experiments, we use FERMI to learn a fair logistic regression model on the Adult dataset.
With the Adult data set, the task is to predict whether or not a person earns over $50k annually without
discriminating based on the sensitive attribute, gender. We compare FERMI against state-of-the-art in-
processing full-batch (|B| “ N) baselines, including [273, 94, 140, 131, 115, 214, 23, 222, 79, 57]. Since
101
the majority of existing fair learning algorithms cannot be implemented with|B|ăN, these experiments
allow us to benchmark the performance of FERMI against a wider range of baselines. To contextualize
the performance of these methods, we also include a Naïve Baseline that randomly replaces the model
output with the majority label (0 in Adult dataset), with probabilityp (independent of the data), and sweep
p inr0,1s. At one end (p “ 1), the output will be provably fair with performance reaching that of a
naive classifier that outputs the majority class. At the other end ( p “ 0), the algorithm has no fairness
mitigation and obtains the best performance (accuracy). By sweepingp, we obtain a tradeoff curve between
performance and fairness violation.
Figure 5.1: Accuracy/Fairness trade-off of FERMI and several state-of-the-art in-processing approaches on Adult
dataset. FERMI offers the best fairness vs. accuracy tradeoff curve in all experiments against all baselines. Rezaei et
al., 2020 [222] only allow for a single output and do not yield a tradeoff curve. Further, the algorithms by Mary et
al., 2019 [186] and Baharlouei et al., 2020 [23] are equivalent in this binary setting and shown by the red curve. In
the binary/binary setting, FERMI, [186] and [23] all try to solve the same objective FRMI Objective. However, the
empirical formulation FERMI Objective and FERMI algorithm that we use results in better performance, even though
we are using a full-batch for all baselines in this experiment.
In Figure 5.1, we report the fairness violation (demographic parity, equalized odds, and equality of op-
portunity violations) vs. test error of the aforementioned in-processing approaches on the Adult dataset.
The upper left corner of the tradeoff curves coincides with the unmitigated baseline, which only opti-
mizes for performance (smallest test error). As can be seen, FERMI offers a fairness-accuracy tradeoff
curve that dominates all state-of-the-art baselines in each experiment and with respect to each notion of
fairness, even in the full batch setting. Aside from in-processing approaches, we compare FERMI with
several pre-processing and post-processing algorithms on Adult, German Credit, and COMPAS datasets in
102
Appendix D.5.5, where we show that the tradeoff curves obtained from FERMI dominate that of all other
baselines considered. See Appendix D.5 for details on the data sets and experiments.
It is noteworthy that the empirical objectives of Mary et al., 2019 [186] and Baharlouei et al., 2020 [23]
are exactly the same in the binary/binary setting, and their algorithms also coincide to the red curve
in Figure 5.1. This is because Exponential Rényi mutual information is equal to Rényi correlation for
binary targets and/or binary sensitive attributes (see Theorem 200), which is the setting of all experiments
in Section 5.3.1. Additionally, like us, in the binary/binary setting these works are trying to empirically
solve FRMI Objective, albeit using different estimation techniques; i.e., their empirical objective is different
from FERMI Objective. This demonstrates the effectiveness of our empirical formulation (FERMI Objective)
and our solver (Algorithm 9), even though we are using all baselines in full batch mode in this experiment.
See Appendix D.5.5 for the complete version of Figure 5.1 which also includes pre-processing and post-
processing baselines.
Figure D.2 in Appendix D.5 illustrates that FERMI outperforms baselines in the presence ofnoisyoutliers
and class imbalance. Our theory did not consider the role of noisy outliers and class imbalance, so the
theoretical investigation of this phenomenon could be an interesting direction for future work.
5.3.1.2 Theeffectofbatchsizeonfairness/accuracytradeoffs
Next, we evaluate the performance of FERMI on smaller batch sizes ranging from1 to64. To this end, we
compare FERMI against several state-of-the-art in-processing algorithms that permit stochastic implemen-
tation for demographic parity: Mary et al., 2019 [186], Baharlouei et al., 2020 [23], and Cho et al., 2020 [57].
Similarly to the full batch setting, for all methods, we train a logistic regression model with a respective
regularizer for each method. We use demographic parity L
8
violation (Definition 206) to measure de-
mographic parity violation. More details about the dataset and experiments, and additional experimental
results, can be found in Appendix D.5.
103
Figure 5.2: Performance of FERMI, Mary et al., 2019 [186], Baharlouei et al., 2020 [23], and Cho et al., 2020 [57] with
different batch-sizes on Adult dataset. FERMI demonstrates the best accuracy/fairness tradeoff across different batch
sizes.
Figure 5.3: Performance of FERMI and other state-of-the-art approaches on the Adult dataset where90% of gender
entries are missing. Full-sensitive FERMI is obtained by applying FERMI on the data without any missing entries.
Figure 5.2 shows that FERMI offers a superior fairness-accuracy tradeoff curve against all baselines,
for each tested batch size, empirically confirming Theorem 67, as FERMI is the only algorithm that is
guaranteed to converge for small minibatches. It is also noteworthy that all other baselines cannot beat
Naïve Baseline when the batch size is very small, e.g.,|B|“ 1. Furthermore, FERMI with|B|“ 4 almost
achieves the same fairness-accuracy tradeoff as the full batch variant.
5.3.1.3 Theeffectofmissingsensitiveattributesonfairness/accuracytradeoffs
Sensitive attributes might be partially unavailable in many real-world applications due to legal issues,
privacy concerns, and data gathering limitations [282, 62]. Missing sensitive attributes make fair learning
tasks more challenging in practice.
The unbiased gradient estimator used in FERMI algorithm suggests that it may be able to handle cases
where sensitive attributes arepartially available and are dropped uniformly at random. As a case study on
104
the Adult dataset, we randomly masked90% of the sensitive attribute (i.e., gender entries). To estimate the
fairness regularization term, we rely on the remaining10% of the training samples (« 3k) with sensitive
attribute information. Figure 5.3 depicts the tradeoff between accuracy and fairness (demographic parity)
violation for FERMI and other baselines. We suspect that the superior accuracy-fairness tradeoff of FERMI
compared to other approaches is due to the fact that the estimator of the gradient remains unbiased since
the missing entries are missing completely at random (MCAR). Note that the Naïve Baseline is similar to
the one implemented in the previous section, and Full-sensitive FERMI is an oracle method that applies
FERMI to the data with no missing attributes (for comparison purposes only). We observe that FERMI
achieves a slightly worse fairness-accuracy tradeoff compared to Full-sensitive FERMI oracle, whereas the
other baselines are hurt significantly and are only narrowly outperforming the Naïve Baseline.
5.3.2 FairBinaryClassificationusingNeuralModels
Figure 5.4: Fair toxic comment detection with different batch sizes. For |B|“ 128, the performance of Prost et al.,
2019 [214] and FERMI are close to each other. However, when the batch size is reduced to 16, FERMI demonstrates
a better fairness/ performance trade-off. The performance and fairness are measured by the test error and the false
positive gap between different religious sub-groups (Christians vs Muslim-Jews), respectively.
In this experiment, our goal is to showcase the efficacy of FERMI in stochastic optimization with neu-
ral network function approximation. To this end, we apply FERMI, [214], [23], and [186] (which coincides
with [23]) to the Toxic Comment Classification dataset where the underlying task is to predict whether
a given published comment in social media is toxic. The sensitive attribute is religion that is binarized
into two groups: Christians in one group; Muslims and Jews in the other group. Training a neural net-
work without considering fairness leads to higher false positive rate for the Jew-Muslim group. Figure 5.4
105
Figure 5.5: Comparison between FERMI, Mary et al., 2019 [186], Baharlouei et al., 2020 [23], and Cho et al., 2020 [57]
on Communities dataset. [186] outperforms [23, 57], which we believe could be attributed to the effectiveness of
ERMI as a regularizer. FERMI outperforms [186]. This can be attributed to our empirical formulation FERMI Objective
and unbiased stochastic optimization algorithm.
demonstrates the performance of FERMI, MinDiff [214], [23], and naïve baseline on two different batch-
sizes: 128 and16. Performance is measured by the overall false positive rate of the trained network and
fairness violation is measured by the false positive gap between two sensitive groups (Christians and Jews-
Muslims). The network structure is exactly same as the one used by MinDiff [214]. We can observe that
by decreasing the batch size, FERMI maintains the best fairness-accuracy tradeoff compared to other base-
lines.
5.3.3 FairNon-binaryClassificationwithMultipleSensitiveAttributes
In this section, we consider a non-binary classification problem with multiple binary sensitive attributes.
In this case, we consider the Communities and Crime dataset, which has 18 binary sensitive attributes in
total. For our experiments, we pick a subset of 1,2,3,...,18 sensitive attributes, which corresponds to
|S|Pt2,4,8,...,2
18
u. We discretize the target into three classesthigh, medium, lowu. The only baselines
that we are aware of that can handle non-binary classification with multiple sensitive attributes are [186],
106
[23], [57], [58], and [275]. We used the publicly available implementations of [23] and [57] and extended
their binary classification algorithms to the non-binary setting.
The results are presented in Figure 5.5, where we use conditional demographic parity L
8
violation
(Definition 206) and conditional equal opportunity L
8
violation (Definition 209) as the fairness violation
notions for the two experiments. In each panel, we compare the test error for different number of sensitive
attributes for a fixed value of DP violation. It is expected that test error increases with the number of
sensitive attributes, as we will have a more stringent fairness constraint to satisfy. As can be seen, compared
to the baselines, FERMI offers the most favorable test error vs. fairness violation tradeoffs, particularly as
the number of sensitive attributes increases and for the more stringent fairness violation levels, e.g.,0.02.
††
5.3.4 BeyondFairness: DomainParityRegularizationforDomainGeneralization
In this section, we demonstrate that our approach may extend beyond fair empirical risk minimization to
other problems, such as domain generalization. In fact, [169, 158, 64] have already established connections
between fair ERM and domain generalization. We consider the Color MNIST dataset [169], where all 60,000
training digits are colored with different colors drawn from a class conditional Gaussian distribution with
variance σ 2
around a certain average color for each digit, while the test set remains black and white.
[169] show that as σ 2
Ñ 0, a convolutional network model overfits significantly to each digit’s color
on the training set, and achieves vanishing training error. However, the learned representation does not
generalize to the black and white test set, due to the spurious correlation between digits and color.
Conceptually, the goal of the classifier in this problem is to achieve high classification accuracy with
predictions that are independent of the color of the digit. We view color as the sensitive attribute in this
experiment and apply fairness baselines for the demographic parity notion of fairness. One would expect
that by promoting such independence through a fairness regularizer, the generalization would improve
††
Section 5.3.4 demonstrated that using smaller batch sizes results in much more pronounced advantages of FERMI over these
baselines.
107
Figure 5.6: Domain generalization on Color MNIST [169] using in-process fair algorithms for demographic parity.
Leftpanel: The dashed line is the training error and the solid line is the test error. Asλ increases, fairness regular-
ization results in a learned representation that is less dependent on color; hence training error increases while test
error decreases (all algorithms reach a plateau around λ “ 8). We use|B| “ 512 for all baselines. Right panel:
We plot test error vs. batch size using an optimized value ofλ for each algorithm selected via a validation set. The
performance of baselines drops 10-20% as the batch size becomes small, whereas FERMI is less sensitive to batch size.
(i.e., lower test error on the black and white test set), at the cost of increased training error (on the colored
training set). We compare against Mary et al., 2019 [186], Baharlouei et al., 2020 [23], and Cho et al.,
2020 [57] as baselines in this experiment.
The results of this experiment are illustrated in Figure 5.6. In the left panel, we see that with no
regularization (λ “ 0), the test error is around 80%. As λ increases, all methods achieve smaller test
errors while training error increases. We also observe that FERMI offers the best test error in this setup. In
the right panel, we observe that decreasing the batch size results in significantly worse generalization for
the three baselines considered (due to their biased estimators for the regularizer). However, the negative
impact of small batch size is much less severe for FERMI, since FERMI uses unbiased stochastic gradients. In
particular,theperformancegapbetweenFERMIandotherbaselinesismorethan20%for|B|“64. Moreover,
FERMI with minibatch size|B|“ 64 still outperforms all other baselines with|B|ą 1,000. Finally, notice
that the test error achieved by FERMI when σ “ 0 is„ 30%, as compared to more than 50% obtained
using REPAIR [169] forσ ď0.05.
108
5.4 DiscussionandConcludingRemarks
In this paper, we tackled the challenge of developing a fairness-promoting algorithm that is amenable
to stochastic optimization. As discussed, algorithms for large-scale ML problems are constrained to use
stochastic optimization with (small) minibatches of data in each iteration. To this end, we formulated an
empirical objective (FERMI Objective) using ERMI as a regularizer and derived unbiased stochastic gradient
estimators. We proposed the stochastic FERMI algorithm (Algorithm 9) for solving this objective. We
then provided the first theoretical convergence guarantees for a stochastic in-processing fairness algorithm ,
by showing that FERMI converges to stationary points of the empirical and population-level objectives
(Theorem 67, Theorem 69). Further, these convergence results hold even for non-binary sensitive attributes
and non-binary target variables, with any minibatch size.
From an experimental perspective, we showed that FERMI leads to better fairness-accuracy tradeoffs
thanallofthestate-of-the-artbaselines on a wide variety of binary and non-binary classification tasks (for
demographic parity, equalized odds, and equal opportunity). We also showed that these benefits are par-
ticularly significant when the number of sensitive attributes grows or the batch size is small. In particular,
we observed that FERMI consistently outperforms [186] (which tries to solve the same objective FRMI
Objective) by up to 20% when the batch size is small. This is not surprising since FERMI is the only algo-
rithm that is guaranteed to find an approximate solution of the fair learning objective with any batch size
|B|ě1. Also, we show in Fig. D.1 that the lack of convergence guarantee of [186] is not just due to more
limited analysis: in fact, their stochastic algorithm does not converge. Even in full batch mode, FERMI
outperforms all baselines, including [186] (Fig. 5.1, Fig. 5.5). In full batch mode, all baselines should be
expected to converge to an approximate solution of their respective empirical objectives, so this suggests
that our empirical objective FERMI Objective is fundamentally better, in some sense than the empirical
objectives proposed in prior works. In what sense is FERMI Objective a better empirical objective (apart
from permitting stochastic optimization)? For one, it is an asymptotically unbiased estimator of FRMI
109
Objective (by Proposition 70), and Theorem 69 suggests that FERMI algorithm outputs an approximate
solution of FRMI Objective for large enoughN.
By contrast, the empirical objectives considered in prior works do not provably yield an approximate
solution to the corresponding population-level objective.
The superior fairness-accuracy tradeoffs of FERMI algorithm over the (full batch) baselines also suggest
that the underlying population-level objective FRMI Objective has benefits over other fairness objectives.
What might these benefits be? First, ERMI upper bounds all other fairness violations (e.g., Shannon mutual
information,L
q
,L
8
) used in the literature: see Appendix D.3. This implies that ERMI-regularized train-
ing yields a model that has a small fairness violation with respect to these other notions. Could this also
somehow help explain the superior fairness-accuracy tradeoffs achieved by FERMI? Second, the objective
function FRMI Objective is easier to optimize than the objectives of competing in-processing methods:
ERMI is smooth and is equal to the trace of a matrix (see Lemma 204 in the Appendix), which is easy to
compute. Contrast this with the larger computational overhead of Rényi correlation used by [23], for ex-
ample, which requires finding the second singular value of a matrix. Perhaps these computational benefits
contribute to the observed performance gains.
We leave it as future work to rigorously understand the factors that are most responsible for the fa-
vorable fairness-accuracy tradeoffs observed from FERMI.
AuthorContributions
This work was done in collaboration with Sina Baharlouei, Ahmad Beiram, Rakesh Pavan, and Meisam
Razaviyayn. AL led the theoretical portion of the project and the writing of the paper. SB led the exper-
imental portion of the paper. RP also contributed to the experiments. AB provided useful guidance and
feedback, especially in helping to shape up the experimental portion of the paper. AB and MR advised on
110
the project, pointed to [48] (which inspired our use of ERMI), and helped with the min-max formulation
of the FERMI objective function.
Acknowledgements
The authors are thankful to James Atwood (Google Research), Alex Beutel (Google Research), Jilin Chen
(Google Research), and Flavien Prost (Google Research) for constructive discussions and feedback that
helped shape up this paper.
111
Chapter6
StochasticDifferentiallyPrivateandFairLearning
This chapter is based on [178].
Summary
Machine learning models are increasingly used in high-stakes decision-making systems. In such applica-
tions, a major concern is that these models sometimes discriminate against certain demographic groups
such as individuals with certain race, gender, or age. Another major concern in these applications is the
violation of the privacy of users. Whilefairlearning algorithms have been developed to mitigate discrimi-
nation issues, these algorithms can stillleak sensitive information, such as individuals’ health or financial
records. Utilizing the notion of differential privacy (DP) , prior works aimed at developing learning algo-
rithms that are both private and fair. However, existing algorithms for DP fair learning are either not
guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this
paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed
to converge. Here, the term “stochastic” refers to the fact that our proposed algorithm converges even
whenminibatches of data are used at each iteration (i.e. stochasticoptimization). Our framework is flexible
enough to permit different fairness notions, including demographic parity and equalized odds. In addi-
tion, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive
112
attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algo-
rithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that
the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines ,
and can be applied to larger scale problems with non-binary target/sensitive attributes.
6.1 Introduction
In recent years, machine learning algorithms have been increasingly used to inform decisions with far-
reaching consequences (e.g. whether to release someone from prison or grant them a loan), raising con-
cerns about their compliance with laws, regulations, societal norms, and ethical values. Specifically, ma-
chine learning algorithms have been found to discriminate against certain “sensitive” demographic groups
(e.g. racial minorities), prompting a profusion of algorithmic fairness research [86, 235, 70, 94, 43, 12, 49,
115, 100, 263, 273, 34, 145, 214, 23, 176]. Algorithmic fairness literature aims to develop fair machine
learning algorithms that output non-discriminatory predictions.
Fair learning algorithms typically need access to the sensitive data in order to ensure that the trained
model is non-discriminatory. However, consumer privacy laws (such as the E.U. General Data Protection
Regulation) restrict the use of sensitive demographic data in algorithmic decision-making. These two
requirements—fair algorithms trained with private data—presents a quandary: how can we train a model
to be fair to a certain demographic if we don’t even know which of our training examples belong to that
group?
The works of [250, 150] proposed a solution to this quandary using secure multi-party computation
(MPC), which allows the learner to train a fair model without directly accessing the sensitive attributes.
Unfortunately, as [129] observed, MPC does not prevent the trained model from leaking sensitive data. For
example, with MPC, the output of the trained model could be used to infer the race of an individual in
the training data set [103, 120, 231, 50]. To prevent such leaks, [129] argued for the use of differential
113
privacy [87] in fair learning. Differential privacy (DP) provides a strong guarantee that no company (or
adversary) can learn much more about any individual than they could have learned had that individual’s
data never been used.
Since [129], several follow-up works have proposed alternate approaches to DP fair learning [268, 77,
195, 242, 241, 244]. As shown in Figure 6.1, each of these approaches suffers from at least two critical
shortcomings. In particular, none of these methods have convergence guarantees when mini-batches of data
are used in training. In training large-scale models, memory and efficiency constraints require the use of
small minibatches in each iteration of training (i.e. stochastic optimization). Thus, existing DP fair learning
methods cannot be used in such settings since they require computations on the full training data set in
every iteration. See Appendix E.1 for a more comprehensive discussion of related work.
OurContributions: In this chapter, we propose a novel algorithmic framework for DP fair learning.
Our approach builds on the non-private fair learning method of [176], which was the subject of Chapter 5.
We consider a regularized empirical risk minimization (ERM) problem where the regularizer penalizes fair-
ness violations, as measured by the Exponential Rényi Mutual Information. Using a result from Chapter 5,
we reformulate this fair ERM problem as a min-max optimization problem. Then, we use an efficient differ-
entially private variation of stochastic gradient descent-ascent (DP-SGDA) to solve this fair ERM min-max
objective. The main features of our algorithm are:
1. Guaranteedconvergence for any privacy and fairness level,evenwhenmini-batchesofdataareused in
each iteration of training (i.e. stochastic optimization setting). As discussed, stochastic optimization
is essential in large-scale machine learning scenarios. Our algorithm is the first stochastic DP fair
learning method with provable convergence.
2. Flexibility to handlenon-binary classification with multiple(non-binary)sensitiveattributes (e.g. race
and gender) under different fairness notions such as demographic parity or equalized odds. In each
of these cases, our algorithm is guaranteed to converge.
114
Empirically, we show that our method outperforms the previous state-of-the-art methods in terms of
fairness vs. accuracy trade-off across all privacy levels. Moreover, our algorithm is capable of training
with mini-batch updates and can handlenon-binarytargetandnon-binarysensitiveattributes. By contrast,
existing DP fairness algorithms could not converge in our stochastic/non-binary experiment.
A byproduct of our algorithmic developments and analyses is the first DP convergent algorithm for
nonconvex min-max optimization: namely, we provide an upper bound on the stationarity gap of DP-
SGDA for solving problems of the formmin
θ max
W
Fpθ,W q, whereFp¨,Wq is non-convex. We expect
this result to be of independent interest to the DP optimization community. Prior works that provide
convergence results for DP min-max problems have assumed thatFp¨,Wq is either (strongly) convex [44,
277] or satisfies a generalization of strong convexity known as the Polyak-Łojasiewicz (PL) condition [269].
6.2 ProblemSettingandPreliminaries
LetZ“tz
i
“px
i
,s
i
,y
i
qu
n
i“1
be a data set with non-sensitive featuresx
i
PX , discrete sensitive attributes
(e.g. race, gender) s
i
P rks fi t1,...,ku, and labels y
i
P rls. Let ˆ y
θ pxq “ ˆ ypx;θ q denote the model
predictions parameterized by θ , and ℓpθ,x,y q “ ℓpˆ y
θ pxq,yq be a loss function (e.g. cross-entropy loss).
Our goal is to (approximately) solve the empirical risk minimization (ERM) problem
min
θ #
p
Lpθ q:“
1
n
n
ÿ
i“1
ℓpθ,x
i
,y
i
q
+
(6.1)
in a fair manner, while maintaining the differential privacy of the sensitive data ts
i
u
n
i“1
. We consider two
different notions of fairness in this chapter, which are recalled below:
∗
Definition71 (Fairness Notions). LetA:ZÑY be a classifier.
∗
Our method can also handle any other fairness notion that can be defined in terms of statistical (conditional) indepen-
dence, such as equal opportunity. However, our method cannot handle all fairness notions: for example, false discovery rate and
calibration error are not covered by our framework.
115
• Asatisfiesdemographicparity[86]ifthepredictions ApZqarestatisticallyindependentofthesensitive
attributes.
• A satisfies equalized odds [115] if the predictions ApZq are conditionally independent of the sensitive
attributes givenY “y for ally.
Reference
Non-binary
target?
Multiple
fairness
notions?
Convergence
guarantee
(poly. time)?
Guarantees
with mini-
batches?
This work
Jagielski et
al., 2019 [129]
(post-proc.)*
N/A
Jagielski et
al., 2019 [129]
(in-proc.)
Xu et al., 2019
[267]
Ding et al.,
2020 [77]
Mozannar et
al., 2020 [194]
Tran et al.,
2021 [240]
Tran et al.,
2021 [241]
Tran et al.,
2022 [243]
Figure 6.1: Comparison with existing works. N/A: the
post-processing method of [129] is not an iterative algo-
rithm. *Method requires access to the sensitive data at test
time. The in-processing method of [129] is inefficient. The
work of [195] specializes to equalized odds, but most of
their analysis seems to be extendable to other fairness no-
tions.
Depending on the specific problem at hand, one
fairness notion may be more desirable than the
other [86, 115].
In practical applications, achieving exact fair-
ness, i.e. (conditional) independence of
p
Y and S,
is unrealistic. In fact, achieving exact fairness can
be impossible for a differentially private algorithm
that achieves non-trivial accuracy [67]. Thus, we
instead aim to design an algorithm that achieves
small fairness violation on the given data set Z.
Fairness violation can be measured in different
ways: see e.g. [176] for a thorough survey. For
example, if demographic parity is the desired fair-
ness notion, then one can measure (empirical) de-
mographic parity violation by
max
p yPY
max
sPS
ˇ
ˇ
ˇˆ p
p
Y|S
pp y|sq´ ˆ p
p
Y
pp yq
ˇ
ˇ
ˇ, (6.2)
where ˆ p denotes an empirical probability calculated directly frompZ,tˆ y
i
u
n
i“1
q. Here, ˆ y
i
:“ ˆ y
θ px
i
q is the
model’s prediction on thei-th training examplex
i
.
116
Following the DP fair learning literature in [129, 242, 244]), we consider a relaxation of DP, in which
only the sensitive attributes require privacy. Say Z and Z
1
are adjacent with respect to sensitive data if
Z“tpx
i
,y
i
,s
i
qu
n
i“1
,Z
1
“tpx
i
,y
i
,s
1
i
qu
n
i“1
, and there is a uniqueiPrns such thats
i
‰s
1
i
.
Definition 72 (Differential Privacy w.r.t. Sensitive Attributes) . Let ε ě 0, δ P r0,1q. A randomized
algorithmA ispε,δ q-differentially private w.r.t. sensitive attributes S (DP) if for all pairs of data sets Z,Z
1
that are adjacent w.r.t. sensitive attributes, we have
PpApZqPOqďe
ε
PpApZqPOq`δ,
for all measurableOĎY.
As discussed in Section 6.1, Definition 72 is useful if a company wants to train a fair model, but is
unable to use the sensitive attributes (which are needed to train a fair model) due to privacy concerns and
laws (e.g., the E.U. GDPR). Definition 72 enables the company to privately use the sensitive attributes to
train a fair model, while satisfying legal and ethical constraints. That being said, Definition 72 still may
not prevent leakage of non-sensitive data. Thus, if the company is concerned with privacy of user data
beyond the sensitive demographic attributes, then it should impose DP for all the features. Our algorithm
and analysis readily extends to DP for all features: see Section 6.3.
Throughout the paper, we shall restrict attention to data sets that contain at leastρ -fraction of every
sensitive attribute for some ρ P p0,1q: i.e.
1
|Z|
ř
|Z|
i“1
1
ts
i
“ru
ě ρ for all r P rks. This is a reasonable
assumption in practice: for example, if sex is the sensitive attribute and a data set contains all men, then
training a model that is fair with respect to sex and has a non-trivial performance (better than random)
seems almost impossible. Understanding what performance is (im-)possible for DP fair learning in the
absence of sample diversity is an important direction for future work.
117
6.3 PrivateFairERMviaExponentialRényiMutualInformation
A standard in-processing strategy in the literature for enforcing fairness is to add a regularization term
to the empirical objective that penalizes fairness violations [275, 79, 186, 23, 57, 176]. We can then jointly
optimize for fairness and accuracy by solving
min
θ !
p
Lpθ q`λ Dp
p
Y,S,Yq
)
,
whereD is some measure of statistical (conditional) dependence between the sensitive attributes and the
predictions (givenY ), andλ ě0 is a scalar balancing fairness and accuracy considerations. The choice ofD
is crucial and can lead to different fairness-accuracy profiles. Inspired by the strong empirical performance
and amenability to stochastic optimization of the FERMI Algorithm 9, we chooseD to be the Exponential
Rényi Mutual Information (ERMI). Recall:
Definition 73 (ERMI – Exponential Rényi Mutual Information). We define the exponential Rényi mutual
informationbetweenrandomvariables
p
Y andS withempiricaljointdistribution ˆ p
p
Y,S
andmarginals ˆ p
p
Y
, ˆ p
S
by:
p
D
R
p
p
Y,Sq:“E
#
ˆ p
p
Y,S
p
p
Y,Sq
ˆ p
p
Y
p
p
Yqˆ p
S
pSq
+
´1“
ÿ
jPrls
ÿ
rPrks
ˆ p
p
Y,S
pj,rq
2
ˆ p
p
Y
pjqˆ p
S
prq
´1 (ERMI)
Definition 73 is what we would use if demographic parity were the desired fairness notion. If instead one
wanted to encourage equalized odds, then Definition 73 can be readily adapted to these fairness notions
by substituting appropriate conditional probabilities for ˆ p
p
Y,S
, ˆ p
p
Y
, and ˆ p
S
in ERMI: see Appendix E.2 for
details.
†
It can be shown that ERMIě 0, and is zero if and only if demographic parity (or equalized
odds, for the conditional version of ERMI) is satisfied [176]. Further, ERMI provides an upper bound on
†
To simplify the presentation, we will assume that demographic parity is the fairness notion of interest in the remainder of
this section. However, we consider both fairness notions in our numerical experiments.
118
other commonly used measures of fairness violation [176]: e.g.) 6.2, Shannon mutual information [58],
Rényi correlation [23],L
q
fairness violation [145, 115]. This implies that any algorithm that makes ERMI
small will also have small fairness violation with respect to these other notions. Lastly, Proposition 70 in
Chapter 5 shows that empirical ERMI (Definition 73) is an asymptotically unbiased estimator of “population
ERMI”—which can be defined as in Definition 73, except that empirical distributions are replaced by their
population counterparts (c.f. Definition 64).
Building on Chapter 5, our approach to enforcing fairness is to augment 6.1 with an ERMI regularizer
and privately solve:
min
θ !
FERMIpθ q:“
p
Lpθ q`λ p
D
R
p
p
Y
θ pXq,Sq
)
. (FERMI Objective)
Since empirical ERMI is an asymptotically unbiased estimator of population ERMI, a solution to FERMI
Objective is likely to generalize to the corresponding fair population risk minimization problem [176].
There are numerous ways to privately solve FERMI Objective. For example, one could use the exponential
mechanism [190], or run noisy gradient descent (GD) [32]. The problem with these approaches is that
they are inefficient or require computing n gradients at every iteration, which is prohibitive for large-scale
problems, as discussed earlier.
Our next goal is to derive astochastic, differentially private fair learning algorithm. For feature input x,
let the predicted class labels be given by ˆ ypx,θ q“ jPrls with probabilityF
j
px,θ q, whereFpx,θ q is dif-
ferentiable inθ , has ranger0,1s
l
, and
ř
l
j“1
F
j
px,θ q“1. For instance,Fpx,θ q“pF
1
px,θ q,...,F
l
px,θ qq
could represent the output of a neural net after softmax layer or the probability label assigned by a logis-
tic regression model. Then, recall from Chapter 5 that we have the following min-max re-formulation
of FERMI Objective:
119
Theorem74 ([176]). There are differentiable functions
p
ψ i
such that FERMI Objective is equivalent to
min
θ max
WPR
kˆl
#
p
Fpθ,W q:“
p
Lpθ q`λ 1
n
n
ÿ
i“1
p
ψ i
pθ,W q
+
. (6.3)
Further,
p
ψ i
pθ, ¨q is strongly concave for allθ .
The functions
p
ψ i
were given explicitly in Corollary 66 and are recalled in Appendix E.3. Theorem 74 is
useful because it permits us to usestochastic optimization to solve FERMI Objective: for any batch sizemP
rns, the gradients (with respect toθ andW ) of
1
m
ř
iPB
ℓpx
i
,y
i
;θ q`λ p
ψ i
pθ,W q are statistically unbiased
estimators of the gradients of
p
Fpθ,W q, ifB is drawn uniformly fromZ. However, when differential privacy
of the sensitive attributes is also desired, the formulation 6.3 presents some challenges, due to the non-
convexity of
p
Fp¨,Wq. Indeed, there is no known DP algorithm for solving non-convex min-max problems
that is proven to converge. Next, we provide the first such convergence guarantee.
6.3.1 NoisyDP-FERMIforStochasticPrivateFairERM
Our proposed stochastic DP algorithm for solving FERMI Objective, is given in Algorithm 10. It is a noisy
DP variation of the FERMI Algorithm 9.
Algorithm10 DP-FERMI Algorithm for Private Fair ERM
1: Input: θ 0
PR
d
θ , W
0
“ 0PR
kˆl
, step-sizespη θ ,η w
q, fairness parameterλ ě 0, iteration numberT ,
minibatch size|B
t
|“mPrns, setWĂR
kˆl
, noise parametersσ 2
w
,σ 2
θ .
2: Compute
p
P
´1{2
S
.
3: fort“0,1,...,T do
4: Draw a mini-batchB
t
of data pointstpx
i
,s
i
,y
i
qu
iPBt
5: Setθ t`1
Ðθ t
´
η θ |Bt|
ř
iPBt
r∇
θ ℓpx
i
,y
i
;θ t
q`λ p∇
θ p
ψ i
pθ t
,W
t
q`u
t
qs, whereu
t
„Np0,σ 2
θ I
d
θ q.
6: Set W
t`1
Ð Π W
´
W
t
` η w
”
λ |Bt|
ř
iPBt
∇
w
p
ψ i
pθ t
,W
t
q` V
t
ı¯
, where V
t
is a kˆ l matrix with
independent random Gaussian entriespV
t
q
r,j
„Np0,σ 2
w
q.
7: endfor
8: Pick
ˆ
t uniformly at random fromt1,...,Tu.
9: Return:
ˆ
θ T
:“θ ˆ
t
.
120
Explicit formulae for∇
θ p
ψ i
pθ t
,W
t
q and∇
w
p
ψ i
pθ t
,W
t
q are given in Lemma 224 (Appendix E.4). We
provide the privacy guarantee of Algorithm 10 in Theorem 75:
Theorem 75. Let εď 2lnp1{δ q, δ Pp0,1q, and T ě
´
n
?
ε
2m
¯
2
. AssumeFpx,¨q is L
θ -Lipschitz for all x,
and|pW
t
q
r,j
| ď D for all t P rTs,r P rks,j P rls. Then, for σ 2
w
ě
16T lnp1{δ q
ε
2
n
2
ρ and σ 2
θ ě
16L
2
θ D
2
lnp1{δ qT
ε
2
n
2
ρ ,
Algorithm10ispε,δ q-DPwithrespecttothesensitiveattributesforalldatasetscontainingatleastρ -fraction
of minority attributes. Further, ifσ 2
w
ě
32T lnp1{δ q
ε
2
n
2
´
1
ρ `D
2
¯
andσ 2
θ ě
64L
2
θ D
2
lnp1{δ qT
ε
2
n
2
ρ `
32D
4
L
2
θ l
2
T lnp1{δ q
ε
2
n
2
,
then Algorithm 10 ispε,δ q-DP (with respect to all features) for all data sets containing at leastρ -fraction of
minority attributes.
See Appendix E.4 for the proof. Next, we give a convergence guarantee for Algorithm 10:
Theorem76. Assumethelossfunctionℓp¨,x,yqandFpx,¨qareLipschitzcontinuouswithLipschitzgradient
for allpx,yq, and
p
P
S
prqě ρ ą 0@rPrks. In Algorithm 10, chooseW to be a sufficiently large ball that
containsW
˚
pθ q:“argmax
W
p
Fpθ,W qforeveryθ insomeneighborhoodofθ ˚
Pargmin
θ max
W
p
Fpθ,W q.
Then there exist algorithmic parameters such that thepε,δ q-DP Algorithm 10 returns
ˆ
θ T
with
E}∇FERMIp
ˆ
θ T
q}
2
“O
˜
a
maxpd
θ ,klqlnp1{δ q
εn
¸
,
treatingD“ diameterpWq,λ ,ρ ,l, and the Lipschitz and smoothness parameters ofℓ andF as constants.
Theorem 76 shows that Algorithm 10 finds an approximate stationary point of FERMI Objective. Find-
ing approximate stationary points is generally the best one can hope to do in polynomial time for non-
convex optimization [198]. The stationarity gap in Theorem 76 depends on the number of samplesn and
model parametersd
θ , the desired level of privacypε,δ q, and the number of labelsl and sensitive attributes
k. For large-scale models (e.g. deep neural nets), we typically have d
θ " 1 and k,l “ Op1q, so that the
convergence rate of Algorithm 10 is essentially immune to the number of labels and sensitive attributes. In
121
contrast, no existing works with convergence guarantees are able to handle non-binary classification (lą2),
even with full batches and a single binary sensitive attribute.
A few more remarks are in order. First, the utility bound in Theorem 76 corresponds to DP for all of
the features. If DP is only required for the sensitive attributes, then using the smaller σ 2
θ ,σ 2
w
in Theo-
rem 75 would improve the dependence on constants D,l,L
θ in the utility bound. Second, the choice of
W in Theorem 76 implies that 6.3 is equivalent tomin
θ max
WPW
p
Fpθ,W q, which is what our algorithm
directly solves (c.f. Problem E.3). Lastly, note that while we return a uniformly random iterate in Algo-
rithm 10 for our theoretical convergence analysis, we recommend returning the last iterateθ T
in practice:
our numerical experiments show strong performance of the last iterate.
In Theorem 226 of Appendix E.5, we prove a result which is more general than Theorem 76. Theo-
rem 226 shows that noisy DP-SGDA converges to an approximate stationary point ofany smooth nonconvex-
strongly concave min-max optimization problem (not just 6.3). We expect Theorem 226 to be of general
interest to the DP optimization community beyond its applications to DP fair learning, since it is the first
DP convergence guarantee for nonconvex min-max optimization. We also give a bound on the iteration
complexityT in Appendix E.5.
The proof of Theorem 226 involves a careful analysis of how the Gaussian noises propagate through
the optimization trajectories of θ t
and w
t
. Compared with DP non-convex minimization analyses (e.g.
[257, 125, 76, 177]), the two noises required to privatize the solution of the min-max problem we consider
complicates the analysis and requires careful tuning ofη θ andη W
. Compared to existing analyses of DP
min-max games in [44, 269, 277], which assume thatfp¨,wq isconvex orPL, dealing withnon-convexity is
a challenge that requires different optimization techniques.
122
6.4 NumericalExperiments
In this section, we evaluate the performance of our proposed approach (DP-FERMI) in terms of the fairness
violation vs. test error for different privacy levels. We present our results in two parts: In Section 6.4.1,
we assess the performance of our method in training logistic regression models on several benchmark
tabular datasets. Since this is a standard setup that existing DP fairness algorithms can handle, we are able
to compare our method against the state-of-the-art baselines. We carefully tuned the hyperparameters
of all baselines for fair comparison. We find that DP-FERMI consistently outperforms all state-of-the-art
baselines across all data sets and all privacy levels. These observations hold for both demographic parity
and equalized odds fairness notions. To quantify the improvement of our results over the state-of-the-art
baselines, we calculated the performance gain with respect to fairness violation (for fixed accuracy level)
that our model yields over all the datasets. We obtained a performance gain of demographic parity that
was 79.648 % better than [242] on average, and 65.89% better on median. The average performance gain of
equalized odds was 96.65% while median percentage gain was 90.02%. In Section 6.4.2, we showcase the
scalability of DP-FERMI by using it to train a deep convolutional neural network for classification on a
large image dataset. In Appendix E.6, we give detailed descriptions of the data sets, experimental setups
and training procedure, along with additional results.
6.4.1 StandardBenchmarkExperiments: LogisticRegressiononTabularDatasets
In the first set of experiments we train a logistic regression model using DP-FERMI (Algorithm 10) for
demographic parity and a modified version of DP-FERMI (described in Appendix E.6) for equalized odds.
We compare DP-FERMI against all applicable publicly available baselines in each expeiment.
123
6.4.1.1 DemographicParity
We use four benchmark tabular datasets: Adult Income, Retired Adult, Parkinsons, and Credit-Card dataset
from the UCI machine learning repository [81]. The predicted variables and sensitive attributes are both bi-
nary in these datasets. We analyze fairness-accuracy trade-offs with four different values of εPt0.5,1,3,9u
for each dataset. We compare against state-of-the-art algorithms proposed in [241] and (the demographic
parity objective of) [242]. The results displayed are averages over 15 trials (random seeds) for each value
ofε.
For the Adult dataset, the task is to predict whether the income is greater than $50K or not keeping
gender as the sensitive attribute. The Retired Adult dataset is the same as the Adult dataset, but with
updated data. We use the same output and sensitive attributes for both experiments. The results for Adult
and Retired Adult are shown in Figures 6.2 and E.1 (in Appendix E.6). Compared to [241, 242], DP-FERMI
offers superior fairness-accuracy tradeoffs at every privacy ( ε) level.
In the Parkinsons dataset, the task is to predictwhetherthetotalUPDRSscoreofthepatientisgreaterthan
the median or not keeping gender as the sensitive attribute. Results forεPt1,3u are shown in Figure 6.3.
See Figure E.3 in Appendix E.6 forεPt0.5,9u. Our algorithm again outperforms the baselines [241, 242]
for all tested privacy levels.
In the Credit Card dataset , the task is to predict whether the user will default payment the next month
keeping gender as the sensitive attribute. Results are shown in Figure E.2 in Appendix E.6. Once again,
DP-FERMI provides the most favorable privacy-fairness-accuracy profile.
6.4.1.2 EqualizedOdds
Next, we consider the slightly modified version of Algorithm 10, which is designed to minimize the Equal-
ized Odds violation by replacing the absolute probabilities in the objective with class conditional proba-
bilities: see Appendix E.6.2.4 for details.
124
(a) ε“ 0.5 (b) ε“ 1
(c) ε“ 3 (d) ε“ 9
Figure 6.2: Private, Fair (Demographic Parity) logistic regression on Adult Dataset.
We considered the Credit Card and Adult datasets for these experiments, using the same sensitive
attributes as mentioned above. Results for Credit Card are shown in Figure 6.4. Adult results are given
in Figure E.4 in Appendix E.6.2.4. Compared to [129] and the equalized odds objective in [242], ourequal-
ized odds variant of DP-FERMI outperforms these state-of-the-art baselines at every privacy level.
6.4.2 Large-ScaleExperiment: DeepConvolutionalNeuralNetworkonImageDataset
In our second set of experiments, we train a deep 9-layer VGG-like classifier [229] with d« 1.6 million pa-
rameters on the UTK-Face dataset [281] using Algorithm 10. We classify the facial images into 9 age groups
similar to the setup in [244], while keeping race (containing 5 classes) as the sensitive attribute. See Ap-
pendix E.6.3 for more details.We analyze consider with four different privacy levels εPt10,25,50,100u.
Compared to the tabular datasets, largerε is needed to obtain stable results in the large-scale setting since
125
(a) ε“ 1 (b) ε“ 3
Figure 6.3: Private, Fair (Demographic Parity) logistic regression on Parkinsons Dataset
(a) ε“ 0.5 (b) ε“ 1 (c) ε“ 3
Figure 6.4: Private, Fair (Equalized Odds) logistic regression on Credit Card Dataset
the number of parameters d is much larger and the cost of privacy increases with d (see Theorem 76).
Larger values ofεą100 were used in the baseline [129] for smaller scale experiments.
The results in Figure 6.5 empirically verify our main theoretical result: DP-FERMI converges even for
non-binary classification with small batch size and non-binary sensitive attributes. We took [241, 242] as
our baselines and attempted to adapt them to this non-binary large-scale task. We observed that the
baselines were very unstable while training and mostly gave degenerate results (predicting a single output
irrespective of the input). By contrast, our method was able to obtain stable and meaningful tradeoff curves.
Also, while [244] reported results on UTK-Face, their code is not publicly available and we were unable to
reproduce their results.
126
(a) ε“ 10 (b) ε“ 25
(c) ε“ 50 (d) ε“ 100
Figure 6.5: DP-FERMI on a Deep CNN for Image Classification on UTK-Face
6.5 ConcludingRemarks
Motivated by pressing legal, ethical, and social considerations, we studied the challenging problem of learn-
ing fair models with differentially private demographic data. We observed that existing works suffer from
a few crucial limitations that render their approaches impractical for large-scale problems. Specifically,
existing approaches require full batches of data in each iteration (and/or exponential runtime) in order
to provide convergence/accuracy guarantees. We addressed these limitations by deriving a DP stochastic
optimization algorithm for fair learning, and rigorously proved the convergence of the proposed method.
Our convergence guarantee holds even for non-binary classification (with any hypothesis class, even infi-
nite VC dimension, c.f. [129]) with multiple sensitive attributes and access to random minibatches of data
in each iteration. Finally, we evaluated our method in extensive numerical experiments and found that it
127
significantly outperforms the previous state-of-the-art models, in terms of fairness-accuracy tradeoff. The
potential societal impacts of our work are discussed in Appendix E.7.
AuthorContributions
This work was done in collaboration with Devansh Gupta and Meisam Razaviyayn. AL led the project,
conducted the theoretical analyses, and primarily wrote the paper. DG executed the numerical experiments
and helped write the “numerical experiments” portion of the paper. MR advised and provided feedback.
128
Chapter7
Conclusion
This thesis focused on training machine learning models subject to privacy and/or fairness considerations.
We studied three problems in differentially private ML: learning with public data (Chapter 2), federated
learning without a trusted server (Chapter 3), and stochastic optimization with large worst-case Lipschitz
parameter (Chapter 4). For each of these problems, we proved tight upper and lower bounds (up to constant
or logarithmic factors) on the worst-case error of DP algorithms. We also developed efficient first-order
optimization algorithms that are suitable for large-scale ML and that have (near) optimal error. Next, we
turned to fair learning. We developed the first provably convergent stochastic algorithm (called FERMI)
for training large-scale fair ML models (Chapter 5). We then devised a differentially private variation of
FERMI and proved that this algorithm also converges (Chapter 6). Numerical experiments demonstrated
that our algorithms outperform state-of-the-art approaches in many cases. Our work also raised several
interesting open questions. We conclude by recalling a few of these questions:
• Are the lower bounds for i.i.d. convex ISRL-DP FL achievable when silo data is heterogeneous?
• What are the optimal rates for finding approximate stationary points in DP non-convex (non-PL)
ERM and stochastic optimization?
• Can the lower bounds for DP stochastic optimization with bounded k-th moments be achieved in
polynomial time ifL
f
“8? Can they be achieved in linear time (even ifL
f
ă8)?
129
Bibliography
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. “Deep Learning with Differential Privacy”. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (Oct. 2016).doi: 10.1145/2976749.2978318.
[2] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. “Differentially private assouad, fano, and le
cam”. In: Algorithmic Learning Theory. PMLR. 2021, pp. 48–78.
[3] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright.
“Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex
Optimization”. In: IEEE Transactions on Information Theory 58.5 (2012), pp. 3235–3249.doi:
10.1109/TIT.2011.2182178.
[4] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudık, John Langford, and Hanna Wallach. “A
reductions approach to fair classification”. In: International Conference on Machine Learning.
PMLR. 2018, pp. 60–69.
[5] Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos. “Learning optimal and fair decision
trees for non-discriminative decision-making”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 33. 2019, pp. 1418–1426.
[6] Ahmad Ajalloeian and Sebastian U Stich. “On the convergence of SGD with biased gradients”. In:
arXiv preprint arXiv:2008.00051 (2020).
[7] Nasser Aldaghri, Hessam Mahdavifar, and Ahmad Beirami. “FeO2: Federated Learning with
Opt-Out Differential Privacy”. In: arXiv preprint arXiv:2110.15252 (2021).
[8] Noga Alon, Raef Bassily, and Shay Moran. “Limits of private learning with access to public data”.
In: Advances in neural information processing systems 32 (2019).
[9] Ehsan Amid, Arun Ganesh, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song,
Thomas Steinke, Vinith M Suriyakumar, Om Thakkar, and Abhradeep Thakurta. “Public
data-assisted mirror descent for private model training”. In: International Conference on Machine
Learning. PMLR. 2022, pp. 517–535.
[10] Alex Andoni. COMS E6998-9: Algorithmic Techniques for Massive Data. 2015.url:
https://www.mit.edu/~andoni/F15_AlgoTechMassiveData/files/scribe2.pdf.
130
[11] Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. “Differentially
private learning with adaptive clipping”. In: Advances in Neural Information Processing Systems 34
(2021).
[12] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias”. In: ProPublica
(2016).
[13] Apple. Differential Privacy Overview . 2016.url:
%5Curl%7Bhttps://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf%7D.
[14] Pathum Chamikara Mahawaga Arachchige, Peter Bertok, Ibrahim Khalil, Dongxi Liu,
Seyit Camtepe, and Mohammed Atiquzzaman. “Local differential privacy for deep learning”. In:
IEEE Internet of Things Journal 7.7 (2019), pp. 5827–5842.
[15] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and
Blake Woodworth. “Lower bounds for non-convex stochastic optimization”. In: arXiv preprint
arXiv:1912.02365 (2019).
[16] Raman Arora, Raef Bassily, Tomás González, Cristóbal Guzmán, Michael Menart, and
Enayat Ullah. “Faster Rates of Convergence to Stationary Points in Differentially Private
Optimization”. In: arXiv preprint arXiv:2206.00846 (2022).
[17] Raman Arora, Raef Bassily, Cristóbal Guzmán, Michael Menart, and Enayat Ullah. “Differentially
Private Generalized Linear Models Revisited”. In: arXiv preprint arXiv:2205.03014 (2022).
[18] Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, and Kunal Talwar. “Private adaptive
gradient methods for convex optimization”. In: International Conference on Machine Learning.
PMLR. 2021, pp. 383–392.
[19] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. “Private stochastic convex
optimization: Optimal rates in l1 geometry”. In: International Conference on Machine Learning.
Ed. by Marina Meila and Tong Zhang. Vol. 139. Proceedings of Machine Learning Research.
PMLR. PMLR, 18–24 Jul 2021, pp. 393–403.url: https://proceedings.mlr.press/v139/asi21b.html.
[20] Hilal Asi, Vitaly Feldman, and Kunal Talwar. “Optimal algorithms for mean estimation under
local differential privacy”. In: International Conference on Machine Learning. PMLR. 2022,
pp. 1046–1056.
[21] Brendan Avent, Aleksandra Korolova, David Zeber, Torgeir Hovden, and Benjamin Livshits.
“BLENDER: Enabling Local Search with a Hybrid Differential Privacy Model.” In: USENIXSecurity
Symposium. 2017, pp. 747–764.
[22] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. “Differential privacy has disparate
impact on model accuracy”. In: Advances in neural information processing systems 32 (2019).
[23] Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. “Rényi Fair
Inference”. In: ICLR. 2020.
131
[24] Borja Balle, James Bell, Adria Gascón, and Kobbi Nissim. “The privacy blanket of the shuffle
model”. In: Annual International Cryptology Conference. Springer. 2019, pp. 638–667.
[25] Borja Balle, Peter Kairouz, Brendan McMahan, Om Dipakbhai Thakkar, and Abhradeep Thakurta.
“Privacy Amplification via Random Check-Ins”. In: 33 (2020).
[26] Rina Foygel Barber and John C Duchi. “Privacy and statistical risk: Formalisms and minimax
bounds”. In: arXiv preprint arXiv:1412.4451 (2014).
[27] Raef Bassily, Albert Cheu, Shay Moran, Aleksandar Nikolov, Jonathan Ullman, and Steven Wu.
“Private query release assisted by public data”. In: International Conference on Machine Learning.
PMLR. 2020, pp. 695–703.
[28] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. “Stability of stochastic
gradient descent on nonsmooth convex losses”. In: Advances in Neural Information Processing
Systems 33 (2020), pp. 4381–4391.
[29] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Thakurta. “Private Stochastic Convex
Optimization with Optimal Rates”. In: Advances in Neural Information Processing Systems. Vol. 32.
2019.
[30] Raef Bassily, Cristóbal Guzmán, and Michael Menart. “Differentially Private Stochastic
Optimization: New Results in Convex and Non-Convex Settings”. In: arXiv preprint
arXiv:2107.05585 (2021).
[31] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. “Non-euclidean differentially private
stochastic convex optimization”. In: Conference on Learning Theory. PMLR. 2021, pp. 474–499.
[32] Raef Bassily, Adam Smith, and Abhradeep Thakurta. “Private empirical risk minimization:
Efficient algorithms and tight error bounds”. In: 2014IEEE 55thAnnualSymposium onFoundations
of Computer Science. IEEE. 2014, pp. 464–473.
[33] Raef Bassily, Om Thakkar, and Abhradeep Guha Thakurta. “Model-agnostic private learning”. In:
Advances in Neural Information Processing Systems 31 (2018).
[34] Yahav Bechavod and Katrina Ligett. “Penalizing unfairness in binary classification”. In: arXiv
preprint arXiv:1707.00044 (2017).
[35] Yahav Bechavod, Katrina Ligett, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. “Equal
opportunity in online classification with partial feedback”. In: arXiv preprint arXiv:1902.02242
(2019).
[36] Amos Beimel, Kobbi Nissim, and Uri Stemmer. “Private learning and sanitization: Pure vs.
approximate differential privacy”. In: Approximation, Randomization, and Combinatorial
Optimization. Algorithms and Techniques: 16th International Workshop, APPROX 2013, and 17th
International Workshop, RANDOM 2013, Berkeley, CA, USA, August 21-23, 2013. Proceedings.
Springer. 2013, pp. 363–378.
132
[37] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns,
Jamie Morgenstern, Seth Neel, and Aaron Roth. “A convex framework for fair regression”. In:
arXiv preprint arXiv:1706.02409 (2017).
[38] Alex Beutel, Jilin Chen, Tulsee Doshi, Hai Qian, Allison Woodruff, Christine Luu,
Pierre Kreitmann, Jonathan Bischof, and Ed H Chi. “Putting fairness principles into practice:
Challenges, metrics, and improvements”. In: Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society. 2019, pp. 453–459.
[39] Abhishek Bhowmick, John Duchi, Julien Freudiger, Gaurav Kapoor, and Ryan Rogers. “Protection
against reconstruction and its applications in private federated learning”. In: arXiv preprint
arXiv:1812.00984 (2018).
[40] Alex Bie, Gautam Kamath, and Vikrant Singhal. “Private Estimation with Public Data”. In: arXiv
preprint arXiv:2208.07984 35 (2022). Ed. by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh, pp. 18653–18666.url:
https://proceedings.neurips.cc/paper_files/paper/2022/file/765ec49952dd0140ac754d6d3f9bc899-
Paper-Conference.pdf.
[41] Andrea Bittau, Ulfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie,
Mitch Rudominer, Ushasree Kode, Julien Tinnes, and Bernhard Seefeld. “Prochlo: Strong Privacy
for Analytics in the Crowd”. In: Proceedings of the Symposium on Operating Systems Principles
(SOSP). 2017, pp. 441–459.
[42] March Boedihardjo, Thomas Strohmer, and Roman Vershynin. “Covariance’s loss is privacy’s
gain: Computationally efficient, private and accurate synthetic data”. In: Foundations of
Computational Mathematics (2022), pp. 1–48.
[43] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. “Man is
to computer programmer as woman is to homemaker? debiasing word embeddings”. In: Advances
in neural information processing systems. 2016, pp. 4349–4357.
[44] Digvijay Boob and Cristóbal Guzmán. “Optimal algorithms for differentially private stochastic
monotone variational inequalities and saddle-point problems”. In: arXiv preprint arXiv:2104.02988
(2021).
[45] Olivier Bousquet and André Elisseeff. “Stability and generalization”. In: The Journal of Machine
Learning Research 2 (2002), pp. 499–526.
[46] Sébastien Bubeck et al. “Convex optimization: Algorithms and complexity”. In: Foundations and
Trends® in Machine Learning 8.3-4 (2015), pp. 231–357.
[47] Mark Bun and Thomas Steinke. “Concentrated Differential Privacy: Simplifications, Extensions,
and Lower Bounds”. In: Proceedings, Part I, of the 14th International Conference on Theory of
Cryptography - Volume 9985. Berlin, Heidelberg: Springer-Verlag, 2016, pp. 635–658.isbn:
9783662536407.doi: 10.1007/978-3-662-53641-4_24.
133
[48] Flavio Calmon, Ali Makhdoumi, Muriel Médard, Mayank Varia, Mark Christiansen, and
Ken R Duffy. “Principal inertia components and applications”. In: IEEE Transactions on
Information Theory 63.8 (2017), pp. 5011–5038.
[49] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and
Kush R Varshney. “Optimized pre-processing for discrimination prevention”. In: Advances in
Neural Information Processing Systems. 2017, pp. 3992–4001.
[50] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss,
Katherine Lee, Adam Roberts, Tom B Brown, Dawn Song, Ulfar Erlingsson, et al. “Extracting
Training Data from Large Language Models”. In: USENIX Security Symposium. Vol. 6. 2021,
pp. 2633–2650.
[51] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. “Lower bounds for finding
stationary points I”. In: Mathematical Programming 184.1 (2020), pp. 71–120.
[52] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. “Differentially private empirical
risk minimization.” In: Journal of Machine Learning Research 12.3 (2011).
[53] Yi-Ruei Chen, Amir Rezapour, and Wen-Guey Tzeng. “Privacy-preserving ridge regression on
distributed data”. In: Information Sciences 451 (2018), pp. 34–49.
[54] Xiangyi Chen, Steven Z Wu, and Mingyi Hong. “Understanding gradient clipping in private SGD:
A geometric perspective”. In: Advances in Neural Information Processing Systems 33 (2020),
pp. 13773–13782.
[55] Albert Cheu, Matthew Joseph, Jieming Mao, and Binghui Peng. “Shuffle Private Stochastic
Convex Optimization”. In: International Conference on Learning Representations. 2022.url:
https://openreview.net/forum?id=DrZXuTGg2A-.
[56] Albert Cheu, Adam Smith, Jonathan Ullman, David Zeber, and Maxim Zhilyaev. “Distributed
differential privacy via shuffling”. In: Annual International Conference on the Theory and
Applications of Cryptographic Techniques. Springer. 2019, pp. 375–403.
[57] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. “A Fair Classifier Using Kernel Density
Estimation”. In: Advances in Neural Information Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by
Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin.
2020.
[58] Jaewoong Cho, Gyeongjo Hwang, and Changho Suh. “A fair classifier using mutual information”.
In: 2020 IEEE International Symposium on Information Theory (ISIT). IEEE. 2020, pp. 2521–2526.
[59] Miri Choi. “Medical Insurance Charges Data”. In: (2018).url:
https://www.kaggle.com/mirichoi0218/insurance.
[60] George M Church. The personal genome project. 2005.
134
[61] Evgenii Chzhen and Nicolas Schreuder. “A minimax framework for quantifying risk-fairness
trade-off in regression”. In: arXiv preprint arXiv:2007.14265 (2020).
[62] Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R Varshney,
Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. “Fair transfer learning with
missing protected attributes”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and
Society. 2019, pp. 91–98.
[63] Thomas M Cover and Joy A Thomas. “Information theory and statistics”. In: Elements of
Information Theory 1.1 (1991), pp. 279–335.
[64] Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. “Environment inference for invariant
learning”. In: International Conference on Machine Learning. PMLR. 2021, pp. 2189–2200.
[65] Mark E Crovella, Murad S Taqqu, and Azer Bestavros. “Heavy-tailed probability distributions”. In:
A Practical Guide to Heavy Tails Statistical Techniques and Applications (1998).
[66] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Now Publishers Inc,
2004.
[67] Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. “On the
compatibility of privacy and fairness”. In: Adjunct Publication of the 27th Conference on User
Modeling, Adaptation and Personalization. 2019, pp. 309–315.
[68] John M Danskin. “The theory of max-min, with applications”. In: SIAM Journal on Applied
Mathematics 14.4 (1966), pp. 641–664.
[69] Rudrajit Das, Satyen Kale, Zheng Xu, Tong Zhang, and Sujay Sanghavi. “Beyond Uniform
Lipschitz Condition in Differentially Private Optimization”. In: arXiv preprint arXiv:2206.10713
(2022).
[70] Amit Datta, Michael Carl Tschantz, and Anupam Datta. “Automated experiments on ad privacy
settings”. In: Proceedings on privacy enhancing technologies 2015.1 (2015), pp. 92–112.
[71] Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. “Unlocking
high-accuracy differentially private image classification through scale”. In: arXiv preprint
arXiv:2204.13650 (2022).
[72] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Springer Science &
Business Media, 2009.
[73] Frank Den Hollander. “Probability theory: The coupling method”. In: (2012).
[74] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. “Collecting telemetry data privately”. In:
Advances in Neural Information Processing Systems 30 (2017).
135
[75] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. “Retiring Adult: New Datasets for
Fair Machine Learning”. In:AdvancesinNeuralInformationProcessingSystems. Ed. by M. Ranzato,
A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan. Vol. 34. Curran Associates, Inc.,
2021, pp. 6478–6490.url:
https://proceedings.neurips.cc/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf.
[76] Jiahao Ding, Guannan Liang, Jinbo Bi, and Miao Pan. “Differentially Private and Communication
Efficient Collaborative Learning”. In: Proceedings of the AAAI Conference on Artificial Intelligence .
Vol. 35. 8. 2021, pp. 7219–7227.
[77] Jiahao Ding, Xinyue Zhang, Xiaohuan Li, Junyi Wang, Rong Yu, and Miao Pan. “Differentially
private and fair classification via calibrated functional mechanism”. In: Proceedings of the AAAI
Conference on Artificial Intelligence . Vol. 34. 01. 2020, pp. 622–629.
[78] Roel Dobbe, Ye Pu, Jingge Zhu, Kannan Ramchandran, and Claire Tomlin. Customized Local
Differential Privacy for Multi-Agent Distributed Optimization . 2020. arXiv: 1806.06035[math.OC].
[79] Michele Donini, Luca Oneto, Shai Ben-David, John S Shawe-Taylor, and Massimiliano Pontil.
“Empirical risk minimization under fairness constraints”. In: Advances in Neural Information
Processing Systems. 2018, pp. 2791–2801.
[80] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint
arXiv:2010.11929 (2020).
[81] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017.url:
http://archive.ics.uci.edu/ml.
[82] John Duchi. “Lecture notes for statistics 311/electrical engineering 377”. In: URL: https://stanford.
edu/class/stats311/Lectures/full_notes.pdf (2021).
[83] John Duchi and Ryan Rogers. “Lower bounds for locally private estimation via communication
complexity”. In: Conference on Learning Theory. PMLR. 2019, pp. 1161–1191.
[84] John C Duchi, Michael I Jordan, and Martin J Wainwright. “Minimax optimal procedures for
locally private estimation”. In: Journal of the American Statistical Association 113.521 (2018),
pp. 182–201.
[85] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. “Local Privacy and Statistical
Minimax Rates”. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. 2013,
pp. 429–438.doi: 10.1109/FOCS.2013.53.
[86] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. “Fairness
through awareness”. In: Proceedings of the 3rd innovations in theoretical computer science
conference. 2012, pp. 214–226.
[87] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity
in private data analysis”. In: Theory of cryptography conference. Springer. 2006, pp. 265–284.
136
[88] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy . Vol. 9. 3–4.
Now Publishers, Inc., 2014, pp. 211–407.
[89] Ulfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and
Abhradeep Thakurta. Amplification by Shuffling: From Local to Central Differential Privacy via
Anonymity. 2020. arXiv: 1811.12469[cs.LG].
[90] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song,
Kunal Talwar, and Abhradeep Thakurta. “Encode, shuffle, analyze privacy revisited:
Formalizations and empirical evaluation”. In: arXiv preprint arXiv:2001.03618 (2020).
[91] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. “RAPPOR: Randomized Aggregatable
Privacy-Preserving Ordinal Response”. In: Proceedings of the 21st ACM Conference on Computer
and Communications Security. 2014.
[92] Alireza Fallah, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. “Optimal and
Differentially Private Data Acquisition: Central and Local Mechanisms”. In: Proceedings of the
23rd ACM Conference on Economics and Computation. EC ’22. Boulder, CO, USA: Association for
Computing Machinery, 2022, p. 1141.isbn: 9781450391504.doi: 10.1145/3490486.3538329.
[93] C Fang, CJ Li, Z Lin, and T Zhang. “Near-optimal non-convex optimization via stochastic path
integrated differential estimator”. In: Advances in Neural Information Processing Systems 31 (2018),
p. 689.
[94] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and
Suresh Venkatasubramanian. “Certifying and removing disparate impact”. In: proceedings of the
21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015,
pp. 259–268.
[95] Vitaly Feldman, Tomer Koren, and Kunal Talwar. “Private stochastic convex optimization:
optimal rates in linear time”. In: Proceedings of the 52nd Annual ACM SIGACT Symposium on
Theory of Computing. 2020, pp. 439–449.
[96] Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding Among the Clones: A Simple and
Nearly Optimal Analysis of Privacy Amplification by Shuffling . 2020. arXiv: 2012.12803[cs.LG].
[97] Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. “Privacy amplification by
iteration”. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE.
2018, pp. 521–532.
[98] Vitaly Feldman and Jan Vondrak. “High probability generalization bounds for uniformly stable
algorithms with nearly optimal rate”. In: Proceedings of the Thirty-Second Conference on Learning
Theory. Ed. by Alina Beygelzimer and Daniel Hsu. Vol. 99. Proceedings of Machine Learning
Research. Phoenix, USA: PMLR, 25–28 Jun 2019, pp. 1270–1279.url:
http://proceedings.mlr.press/v99/feldman19a.html.
[99] Cecilia Ferrando, Jennifer Gillenwater, and Alex Kulesza. “Combining public and private data”.
In: arXiv preprint arXiv:2111.00115 (2021).
137
[100] Benjamin Fish, Jeremy Kun, and Ádám D Lelkes. “A confidence-based approach for balancing
fairness and accuracy”. In: Proceedings of the 2016 SIAM International Conference on Data Mining.
SIAM. 2016, pp. 144–152.
[101] Benjamin Fish, Jeremy Kun, and Adám D Lelkes. “Fair boosting: a case study”. In: Workshop on
Fairness, Accountability, and Transparency in Machine Learning. Citeseer. 2015.
[102] Dylan J Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, and
Blake Woodworth. “The complexity of making the gradient small in stochastic convex
optimization”. In: Conference on Learning Theory. PMLR. 2019, pp. 1319–1345.
[103] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit
confidence information and basic countermeasures”. In: Proceedings of the 22nd ACM SIGSAC
Conference on Computer and Communications Security. 2015, pp. 1322–1333.
[104] Shripad Gade and Nitin H Vaidya. “Privacy-preserving distributed learning via obfuscated
stochastic gradients”. In: 2018 IEEE Conference on Decision and Control (CDC). IEEE. 2018,
pp. 184–191.
[105] Arun Ganesh, Abhradeep Thakurta, and Jalaj Upadhyay. “Langevin diffusion: An almost universal
algorithm for private euclidean (convex) optimization”. In: arXiv preprint arXiv:2204.01585 (2022).
[106] Hans Gebelein. “Das statistische Problem der Korrelation als Variations-und Eigenwertproblem
und sein Zusammenhang mit der Ausgleichsrechnung”. In: ZAMM-Journal of Applied
Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 21.6 (1941),
pp. 364–379.
[107] Robin C. Geyer, Tassilo Klein, and Moin Nabi. “Differentially Private Federated Learning: A
Client Level Perspective”. In: CoRR abs/1712.07557 (2017). arXiv: 1712.07557.url:
http://arxiv.org/abs/1712.07557.
[108] Saeed Ghadimi and Guanghui Lan. “Optimal Stochastic Approximation Algorithms for Strongly
Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework”. In: SIAM
Journal on Optimization 22.4 (2012), pp. 1469–1492.doi: 10.1137/110848864. eprint:
https://doi.org/10.1137/110848864.
[109] Saeed Ghadimi and Guanghui Lan. “Optimal Stochastic Approximation Algorithms for Strongly
Convex Stochastic Composite Optimization, II: Shrinking Procedures and Optimal Algorithms”.
In: SIAM Journal on Optimization 23.4 (2013), pp. 2061–2089.doi: 10.1137/110848876. eprint:
https://doi.org/10.1137/110848876.
[110] Badih Ghazi, Ravi Kumar, and Pasin Manurangsi. “User-level private learning via correlated
sampling”. In: arXiv preprint arXiv:2110.11208 (2021).
[111] Antonious Girgis, Deepesh Data, Suhas Diggavi, Peter Kairouz, and Ananda Theertha Suresh.
“Shuffled Model of Differential Privacy in Federated Learning”. In: Proceedings of The 24th
International Conference on Artificial Intelligence and Statistics . Ed. by Arindam Banerjee and
Kenji Fukumizu. Vol. 130. Proceedings of Machine Learning Research. PMLR, 13–15 Apr 2021,
pp. 2521–2529.url: https://proceedings.mlr.press/v130/girgis21a.html.
138
[112] Aditya Golatkar, Alessandro Achille, Yu-Xiang Wang, Aaron Roth, Michael Kearns, and
Stefano Soatto. “Mixed differential privacy in computer vision”. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2022, pp. 8376–8386.
[113] Vincent Grari, Oualid El Hajouji, Sylvain Lamprier, and Marcin Detyniecki. “Learning Unbiased
Representations via Rényi Minimization”. In: arXiv preprint arXiv:2009.03183 (2020).
[114] Vincent Grari, Boris Ruf, Sylvain Lamprier, and Marcin Detyniecki. “Fairness-Aware Neural
Réyni Minimization for Continuous Features”. In: arXiv preprint arXiv:1911.04929 (2019).
[115] Moritz Hardt, Eric Price, and Nati Srebro. “Equality of opportunity in supervised learning”. In:
Advances in neural information processing systems. 2016, pp. 3315–3323.
[116] Moritz Hardt, Ben Recht, and Yoram Singer. “Train faster, generalize better: Stability of stochastic
gradient descent”. In:ProceedingsofThe33rdInternationalConferenceonMachineLearning. Ed. by
Maria Florina Balcan and Kilian Q. Weinberger. Vol. 48. Proceedings of Machine Learning
Research. New York, New York, USA: PMLR, 20–22 Jun 2016, pp. 1225–1234.url:
http://proceedings.mlr.press/v48/hardt16.html.
[117] Moritz Hardt and Kunal Talwar. “On the geometry of differential privacy”. In: Proceedings of the
forty-second ACM symposium on Theory of computing. 2010, pp. 705–714.
[118] Elad Hazan and Satyen Kale. “Beyond the regret minimization barrier: optimal algorithms for
stochastic strongly-convex optimization”. In: The Journal of Machine Learning Research 15.1
(2014), pp. 2489–2512.
[119] Yiyun He, Roman Vershynin, and Yizhe Zhu. “Algorithmically Effective Differentially Private
Synthetic Data”. In: arXiv preprint arXiv:2302.05552 (2023).
[120] Zecheng He, Tianwei Zhang, and Ruby B Lee. “Model inversion attacks against collaborative
inference”. In: Proceedings of the 35th Annual Computer Security Applications Conference. 2019,
pp. 148–162.
[121] Hermann O Hirschfeld. “A connection between correlation and contingency”. In: Proceedings of
the Cambridge Philosophical Society. Vol. 31. 1935, pp. 520–524.
[122] Matthew J Holland. “Robust descent using smoothed multiplicative noise”. In: The 22nd
International Conference on Artificial Intelligence and Statistics . PMLR. 2019, pp. 703–711.
[123] Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. “Bia Mitigation for
Machine Learning Classifiers: A Comprehensive Survey”. In: arXiv preprint arXiv:2207.07068
(2022).
[124] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. “High Dimensional Differentially Private
Stochastic Optimization with Heavy-tailed Data”. In: arXiv preprint arXiv:2107.11136 (2021).
[125] Rui Hu, Yuanxiong Guo, and Yanmin Gong. “Concentrated Differentially Private Federated
Learning With Performance Analysis”. In: IEEE Open Journal of the Computer Society 2 (2021),
pp. 276–289.doi: 10.1109/OJCS.2021.3099108.
139
[126] Zonghao Huang and Yanmin Gong. “Differentially Private ADMM for Convex Distributed
Learning: Improved Accuracy via Multi-Step Approximation”. In:arXivpreprint:2005.07890 (2020).
[127] Zonghao Huang, Rui Hu, Yuanxiong Guo, Eric Chan-Tin, and Yanmin Gong. “DP-ADMM:
ADMM-Based Distributed Learning With Differential Privacy”. In: IEEE Transactions on
Information Forensics and Security 15 (2020), pp. 1002–1012.
[128] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. “Proximal stochastic
methods for nonsmooth nonconvex finite-sum optimization”. In: Advances in neural information
processing systems 29 (2016), pp. 1145–1153.
[129] Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth,
Saeed Sharifi-Malvajerdi, and Jonathan Ullman. “Differentially private fair learning”. In:
International Conference on Machine Learning. PMLR. 2019, pp. 3000–3008.
[130] Bargav Jayaraman and Lingxiao Wang. “Distributed learning without distress: Privacy-preserving
empirical risk minimization”. In: Advances in Neural Information Processing Systems (2018).
[131] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. “Wasserstein fair
classification”. In: Uncertainty in Artificial Intelligence . PMLR. 2020, pp. 862–872.
[132] Zach Jorgensen, Ting Yu, and Graham Cormode. “Conservative or liberal? Personalized
differential privacy”. In: 2015 IEEE 31St international conference on data engineering. IEEE. 2015,
pp. 1023–1034.
[133] Matthew Joseph, Jieming Mao, Seth Neel, and Aaron Roth. “The role of interactivity in local
differential privacy”. In: 2019 IEEE 60th Annual Symposium on Foundations of Computer Science
(FOCS). IEEE. 2019, pp. 94–105.
[134] Peter Kairouz, Monica Ribero Diaz, Keith Rush, and Abhradeep Thakurta. “(Nearly) Dimension
Independent Private ERM with AdaGrad Rates via Publicly Estimated Subspaces”. In: Conference
on Learning Theory. PMLR. 2021, pp. 2717–2746.
[135] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett,
Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He,
Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi,
Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo,
Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür,
Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song,
Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr,
Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and
Sen Zhao. “Advances and Open Problems in Federated Learning”. In: arXiv preprint:1912.04977
(2019).
[136] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The Composition Theorem for Differential
Privacy. 2015. arXiv: 1311.0776[cs.DS].
140
[137] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. “Improved rates for differentially private
stochastic convex optimization with heavy-tailed data”. In: International Conference on Machine
Learning. Ed. by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and
Sivan Sabato. Vol. 162. Proceedings of Machine Learning Research. PMLR. PMLR, 17–23 Jul 2022,
pp. 10633–10660.url: https://proceedings.mlr.press/v162/kamath22a.html.
[138] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. “Private Mean Estimation of
Heavy-Tailed Distributions”. In: Proceedings of Thirty Third Conference on Learning Theory. Ed. by
Jacob Abernethy and Shivani Agarwal. Vol. 125. Proceedings of Machine Learning Research.
PMLR, Sept. 2020, pp. 2204–2235.url: https://proceedings.mlr.press/v125/kamath20a.html.
[139] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy. “Discrimination aware decision tree
learning”. In: 2010 IEEE International Conference on Data Mining. IEEE. 2010, pp. 869–874.
[140] Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. “Fairness-aware learning through
regularization approach”. In: 2011 IEEE 11th International Conference on Data Mining Workshops.
IEEE. 2011, pp. 643–650.
[141] Yilin Kang, Yong Liu, Ben Niu, and Weiping Wang. “Weighted distributed differential privacy
ERM: Convex and non-convex”. In: Computers & Security 106 (2021), p. 102275.
[142] Hamed Karimi, Julie Nutini, and Mark Schmidt. “Linear convergence of gradient and
proximal-gradient methods under the polyak-łojasiewicz condition”. In:JointEuropeanConference
on Machine Learning and Knowledge Discovery in Databases. Springer. 2016, pp. 795–811.
[143] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. “SCAFFOLD: Stochastic Controlled Averaging for Federated Learning”.
In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and
Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, 13–18 Jul 2020,
pp. 5132–5143.
[144] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and
Adam Smith. “What can we learn privately?” In: SIAM Journal on Computing 40.3 (2011),
pp. 793–826.
[145] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. “Preventing fairness
gerrymandering: Auditing and learning for subgroup fairness”. In: International Conference on
Machine Learning. 2018, pp. 2564–2572.
[146] Michael Kearns and Dana Ron. “Algorithmic stability and sanity-check bounds for leave-one-out
cross-validation”. In: Neural computation 11.6 (1999), pp. 1427–1453.
[147] Gavin Kerrigan, Dylan Slack, and Jens Tuyls. “Differentially private language models benefit from
public pre-training”. In: arXiv preprint arXiv:2009.05886 (2020).
[148] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. “Better communication complexity
for local SGD”. In: arXiv preprint (2019).
141
[149] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. “Private convex empirical risk minimization
and high-dimensional regression”. In: Conference on Learning Theory. JMLR Workshop and
Conference Proceedings. 2012, pp. 25–1.
[150] Niki Kilbertus, Adrià Gascón, Matt Kusner, Michael Veale, Krishna Gummadi, and Adrian Weller.
“Blind justice: Fairness with encrypted sensitive attributes”. In: International Conference on
Machine Learning. PMLR. 2018, pp. 2630–2639.
[151] Niki Kilbertus, Manuel Gomez Rodriguez, Bernhard Schölkopf, Krikamol Muandet, and
Isabel Valera. “Fair Decisions Despite Imperfect Predictions”. In: Proceedings of the Twenty Third
International Conference on Artificial Intelligence and Statistics . Ed. by Silvia Chiappa and
Roberto Calandra. Vol. 108. Proceedings of Machine Learning Research. PMLR, 26–28 Aug 2020,
pp. 277–287.
[152] Bryan Klimt and Yiming Yang. “The enron corpus: A new dataset for email classification
research”. In: Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa,
Italy, September 20-24, 2004. Proceedings 15. Springer. 2004, pp. 217–226.
[153] Andrei Nikolaevich Kolmogorov and Vladimir Mikhailovich Tikhomirov. “ε-entropy and
ε-capacity of sets in function spaces”. In: Uspekhi Matematicheskikh Nauk 14.2 (1959), pp. 3–86.
[154] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. “A Unified
Theory of Decentralized SGD with Changing Topology and Local Updates”. In: Proceedings of the
37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh.
Vol. 119. Proceedings of Machine Learning Research. PMLR, 13–18 Jul 2020, pp. 5381–5393.
[155] W. Kong and R. D. C. Monteiro. “An accelerated inexact proximal point method for solving
nonconvex-concave min-max problems”. In: arXiv:1905.13433 (2019).
[156] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: (2009).
[157] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. “Private non-smooth empirical risk
minimization and stochastic convex optimization in subquadratic steps”. In: arXiv preprint
arXiv:2103.15352 (2021).
[158] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and
Ed Chi. “Fairness without demographics through adversarially reweighted learning”. In:Advances
in neural information processing systems 33 (2020), pp. 728–740.
[159] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document
recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.doi: 10.1109/5.726791.
[160] Yann LeCun and Corinna Cortes. “MNIST handwritten digit database”. In: (2010).url:
http://yann.lecun.com/exdb/mnist/.
[161] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. “Non-convex finite-sum optimization via
SCSG methods”. In: Proceedings of the 31st International Conference on Neural Information
Processing Systems. 2017, pp. 2345–2355.
142
[162] Yunwen Lei and Yiming Ying. “Fine-grained analysis of stability and generalization for stochastic
gradient descent”. In: International Conference on Machine Learning. PMLR. 2020, pp. 5809–5819.
[163] Yunwen Lei and Yiming Ying. “Sharper generalization bounds for learning with
gradient-dominated objective functions”. In: International Conference on Learning Representations.
2021.
[164] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and
Ananda Theertha Suresh. “Learning with User-Level Privacy”. In: arXiv preprint arXiv:2102.11845
(2021).
[165] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. “Bart: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension”. In: arXiv preprint
arXiv:1910.13461 (2019).
[166] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
“Federated optimization in heterogeneous networks”. In: Proceedings of Machine Learning and
Systems 2 (2020), pp. 429–450.
[167] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. “On the Convergence
of FedAvg on Non-IID Data”. In: 8th International Conference on Learning Representations, ICLR
2020, Addis Ababa, Ethiopia, April 26-30, 2020. 2020.url:
https://openreview.net/forum?id=HJxNAnVtDS.
[168] Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. “Large language models can
be strong differentially private learners”. In: arXiv preprint arXiv:2110.05679 (2021).
[169] Yi Li and Nuno Vasconcelos. “REPAIR: Removing representation bias by dataset resampling”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
pp. 9572–9581.
[170] Tianyi Lin, Chi Jin, and Michael I. Jordan. “On Gradient Descent Ascent for Nonconvex-Concave
Minimax Problems”. In: (2020), pp. 6083–6093.
[171] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. “Loss landscapes and optimization in
over-parameterized non-linear systems and neural networks”. In: Applied and Computational
Harmonic Analysis 59 (2022), pp. 85–116.
[172] Jingcheng Liu and Kunal Talwar. “Private selection from private candidates”. In: Proceedings of
the 51st Annual ACM SIGACT Symposium on Theory of Computing. 2019, pp. 298–309.
[173] Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, and Steven Wu. “Leveraging
public data for practical private query release”. In: International Conference on Machine Learning.
PMLR. 2021, pp. 6968–6977.
[174] Ziyu Liu, Shengyuan Hu, Zhiwei Steven Wu, and Virginia Smith. “On Privacy and
Personalization in Cross-Silo Federated Learning”. In: arXiv preprint arXiv:2206.07902 (2022).
143
[175] Ilan Lobel and Asuman Ozdaglar. “Distributed subgradient methods for convex optimization over
random networks”. In: IEEE Transactions on Automatic Control 56.6 (2010), pp. 1291–1306.
[176] Andrew Lowy, Sina Baharlouei, Rakesh Pavan, Meisam Razaviyayn, and Ahmad Beirami. “A
Stochastic Optimization Framework for Fair Risk Minimization”. In: Transactions on Machine
Learning Research (2022).
[177] Andrew Lowy, Ali Ghafelebashi, and Meisam Razaviyayn. “Private Non-Convex Federated
Learning Without a Trusted Server”. In: Proceedings of the 26th International Conference on
Artificial Intelligence and Statistics (AISTATS) . PMLR. 2023, pp. 5749–5786.
[178] Andrew Lowy, Devansh Gupta, and Meisam Razaviyayn. “Stochastic Differentially Private and
Fair Learning”. In: The Eleventh International Conference on Learning Representations. 2023.url:
https://openreview.net/forum?id=3nM5uhPlfv6.
[179] Andrew Lowy, Zeman Li, Tianjian Huang, and Meisam Razaviyayn. “Optimal Differentially
Private Learning with Public Data”. In: arXiv preprint:2306.15056 (2023). arXiv: 2306.15056
[cs.LG].
[180] Andrew Lowy and Meisam Razaviyayn. “Output Perturbation for Differentially Private Convex
Optimization with Improved Population Loss Bounds, Runtimes and Applications to Private
Adversarial Training”. In: arXiv preprint:2102.04704 (2021). arXiv: 2102.04704[cs.LG].
[181] Andrew Lowy and Meisam Razaviyayn. “Private Federated Learning Without a Trusted Server:
Optimal Algorithms for Convex Losses”. In: The Eleventh International Conference on Learning
Representations. 2023.url: https://openreview.net/forum?id=TVY6GoURrw.
[182] Andrew Lowy and Meisam Razaviyayn. “Private Stochastic Optimization with Large Worst-Case
Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to
Non-Convex Losses”. In: International Conference on Algorithmic Learning Theory. PMLR. 2023,
pp. 986–1054.
[183] Luo Luo, Haishan Ye, and Tony Zhang. “Stochastic Recursive Gradient Descent Ascent for
Stochastic Nonconvex-Strongly-Concave Minimax Problems”. In: Advances in Neural Information
Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual. 2020.
[184] Xu Ma, Fangguo Zhang, Xiaofeng Chen, and Jian Shen. “Privacy preserving multi-party
computation delegation for deep learning in cloud computing”. In: Information Sciences 459
(2018), pp. 103–116.
[185] Natalia Markovich. Nonparametric analysis of univariate heavy-tailed data: research and practice.
John Wiley & Sons, 2008.
[186] Jérémie Mary, Clément Calauzenes, and Noureddine El Karoui. “Fairness-aware learning for
continuous attributes and treatments”. In: International Conference on Machine Learning. PMLR.
2019, pp. 4382–4391.
[187] Cain Mckay. Probability and Statistics. Scientific e-Resources, 2019.
144
[188] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
“Communication-efficient learning of deep networks from decentralized data”. In: Artificial
Intelligence and Statistics. PMLR. 2017, pp. 1273–1282.
[189] Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. “Learning Differentially Private
Recurrent Language Models”. In: International Conference on Learning Representations (ICLR).
2018.url: https://openreview.net/pdf?id=BJ0hF1Z0b.
[190] Frank McSherry and Kunal Talwar. “Mechanism design via differential privacy”. In: 48th Annual
IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE. 2007, pp. 94–103.
[191] Frank D McSherry. “Privacy integrated queries: an extensible platform for privacy-preserving
data analysis”. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management
of data. 2009, pp. 19–30.
[192] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: ACM Computing Surveys (CSUR) 54.6 (2021),
pp. 1–35.
[193] Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. “Large scale transfer
learning for differentially private image classification”. In: arXiv preprint arXiv:2205.02973 (2022).
[194] Stanislav Minsker. “U-statistics of growing order and sub-Gaussian mean estimators with sharp
constants”. In: arXiv preprint arXiv:2202.11842 (2022).
[195] Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. “Fair learning with private
demographic data”. In: International Conference on Machine Learning. PMLR. 2020, pp. 7066–7075.
[196] Christopher Mühl and Franziska Boenisch. “Personalized PATE: Differential privacy for machine
learning with individual privacy guarantees”. In: arXiv preprint arXiv:2202.10517 (2022).
[197] Jack Murtagh and Salil Vadhan. “The complexity of computing the optimal composition of
differential privacy”. In: Theory of Cryptography Conference. Springer. 2016, pp. 157–175.
[198] Katta G Murty and Santosh N Kabadi. “Some NP-complete problems in quadratic and nonlinear
programming”. In: (1985).
[199] Angelia Nedic, Alex Olshevsky, and Wei Shi. “Achieving geometric convergence for distributed
optimization over time-varying graphs”. In: SIAM Journal on Optimization 27.4 (2017),
pp. 2597–2633.
[200] Arkadii Semenovich Nemirovskii and David Borisovich Yudin. Problem Complexity and Method
Efficiency in Optimization . Chichester, 1983.
[201] Yurii Nesterov. “Smooth minimization of non-smooth functions”. In: Mathematical programming
103.1 (2005), pp. 127–152.
145
[202] Alexandru Niculescu-Mizil and Rich Caruana. “Predicting good probabilities with supervised
learning”. In: Proceedings of the 22nd international conference on Machine learning. 2005,
pp. 625–632.
[203] Maxence Noble, Aurélien Bellet, and Aymeric Dieuleveut. “Differentially private federated
learning on heterogeneous data”. In: International Conference on Artificial Intelligence and
Statistics. PMLR. 2022, pp. 10110–10145.
[204] M. Nouiehed, M. Sanjabi, J. D. Lee, and M. Razaviyayn. “Solving a class of non-convex min-max
games using iterative first order methods”. In: arXiv preprint arXiv:1902.08297 (2019).
[205] Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. “Efficient search of first-order
nash equilibria in nonconvex-concave smooth min-max problems”. In: arXiv preprint
arXiv:2002.07919 (2020).
[206] Fabio Mendoza Palechor and Alexis de la Hoz Manotas. “Dataset for estimation of obesity levels
based on eating habits and physical condition in individuals from Colombia, Peru and Mexico”.
In: Data in brief 25 (2019), p. 104344.
[207] Nicolas Papernot, Martın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar.
“Semi-supervised knowledge transfer for deep learning from private training data”. In: (2016).
[208] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and
Úlfar Erlingsson. “Scalable private learning with pate”. In: arXiv preprint arXiv:1802.08908 (2018).
[209] Nicolas Papernot and Thomas Steinke. Hyperparameter Tuning with Renyi Differential Privacy .
2021. arXiv: 2110.03620[cs.LG].url: https://openreview.net/forum?id=-70L8lpp9DF.
[210] Adrián Pérez-Suay, Valero Laparra, Gonzalo Mateo-Garcıa, Jordi Muñoz-Marı,
Luis Gómez-Chova, and Gustau Camps-Valls. “Fair kernel learning”. In:JointEuropeanConference
on Machine Learning and Knowledge Discovery in Databases. Springer. 2017, pp. 339–355.
[211] John Platt et al. “Probabilistic outputs for support vector machines and comparisons to
regularized likelihood methods”. In: Advances in large margin classifiers 10.3 (1999), pp. 61–74.
[212] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. “On fairness
and calibration”. In: Advances in Neural Information Processing Systems. 2017, pp. 5680–5689.
[213] Boris T Polyak. “Gradient methods for the minimisation of functionals”. In: USSR Computational
Mathematics and Mathematical Physics 3.4 (1963), pp. 864–878.
[214] Flavien Prost, Hai Qian, Qiuwen Chen, Ed H Chi, Jilin Chen, and Alex Beutel. “Toward a better
trade-off between performance and fairness with kernel-based distribution matching”. In: arXiv
preprint arXiv:1910.11779 (2019).
[215] PyTorch team. Training a Classifier .url: %5Curl%7Bhttps:
//pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#training-a-classifier%7D.
146
[216] Novi Quadrianto and Viktoriia Sharmanska. “Recycling privileged learning and distribution
matching for fairness”. In: Advances in Neural Information Processing Systems 30 (2017).
[217] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
“Language models are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[218] Edward Raff, Jared Sylvester, and Steven Mills. “Fair forests: Regularized tree induction to
minimize model bias”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society.
2018, pp. 243–250.
[219] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. “Making gradient descent optimal for
strongly convex stochastic optimization”. In: Proceedings of the 29th International Coference on
International Conference on Machine Learning. 2012, pp. 1571–1578.
[220] Alfréd Rényi. “On measures of dependence”. In: Acta Mathematica Academiae Scientiarum
Hungarica 10.3-4 (1959), pp. 441–451.
[221] Alfréd Rényi. “On measures of entropy and information”. In: Proceedings of the Fourth Berkeley
Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of
Statistics. Vol. 4. University of California Press. 1961, pp. 547–562.
[222] Ashkan Rezaei, Rizal Fathony, Omid Memarrast, and Brian D Ziebart. “Fairness for Robust Log
Loss Classification.” In: AAAI. 2020, pp. 5511–5518.
[223] Goce Ristanoski, Wei Liu, and James Bailey. “Discrimination aware classification for imbalanced
datasets”. In: Proceedings of the 22nd ACM international conference on Information & Knowledge
Management. 2013, pp. 1529–1532.
[224] Yuji Roh, Kangwook Lee, Steven Whang, and Changho Suh. “Fr-train: A mutual
information-based approach to fair and robust training”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 8147–8157.
[225] Igal Sason and Sergio Verdú. “f-divergence Inequalities”. In: IEEE Transactions on Information
Theory 62.11 (2016), pp. 5973–6006.
[226] Mohamed Seif, Ravi Tandon, and Ming Li. “Wireless Federated Learning with Local Differential
Privacy”. In:2020IEEEInternationalSymposiumonInformationTheory(ISIT). 2020, pp. 2604–2609.
doi: 10.1109/ISIT44484.2020.9174426.
[227] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. “Stochastic Convex
Optimization.” In: COLT. Vol. 2. 4. 2009, p. 5.
[228] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership inference
attacks against machine learning models”. In: 2017 IEEE symposium on security and privacy (SP).
IEEE. 2017, pp. 3–18.
[229] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale
Image Recognition”. In: International Conference on Learning Representations. 2015.
147
[230] Adam Smith, Abhradeep Thakurta, and Jalaj Upadhyay. “Is Interaction Necessary for Distributed
Private Learning?” In: 2017 IEEE Symposium on Security and Privacy (SP). 2017, pp. 58–77.doi:
10.1109/SP.2017.35.
[231] Mengkai Song, Zhibo Wang, Zhifei Zhang, Yang Song, Qian Wang, Ju Ren, and Hairong Qi.
“Analyzing User-Level Privacy Attack Against Federated Learning”. In: IEEE Journal on Selected
Areas in Communications 38.10 (2020), pp. 2430–2444.doi: 10.1109/JSAC.2020.3000372.
[232] Shuang Song, Thomas Steinke, Om Thakkar, and Abhradeep Thakurta. “Evading the curse of
dimensionality in unconstrained private glms”. In: International Conference on Artificial
Intelligence and Statistics. PMLR. 2021, pp. 2638–2646.
[233] Daniel Steinberg, Alistair Reid, Simon O’Callaghan, Finnian Lattimore, Lachlan McCalman, and
Tiberio Caetano. “Fast Fair Regression via Efficient Approximations of Mutual Information”. In:
arXiv preprint arXiv:2002.06200 (2020).
[234] Sebastian U. Stich. “Unified Optimal Analysis of the (Stochastic) Gradient Method”. In: arXiv
preprint:1907.04232 (2019). arXiv: 1907.04232[cs.LG].
[235] Latanya Sweeney. “Discrimination in online ad delivery”. In:arXivpreprintarXiv:1301.6822 (2013).
[236] Youming Tao, Yulian Wu, Peng Zhao, and Di Wang. “Optimal rates of (locally) differentially
private heavy-tailed multi-armed bandits”. In: arXiv preprint arXiv:2106.02575 (2021).
[237] Bahar Taskesen, Viet Anh Nguyen, Daniel Kuhn, and Jose Blanchet. “A Distributionally Robust
Approach to Fair Classification”. In: arXiv preprint arXiv:2007.09530 (2020).
[238] Abhradeep Guha Thakurta, Andrew H Vyrros, Umesh S Vaishampayan, Gaurav Kapoor,
Julien Freudiger, Vivek Rangarajan Sridhar, and Doug Davidson. Learning new words. US Patent
9,594,741. Mar. 2017.
[239] Reihaneh Torkzadehmahani, Peter Kairouz, and Benedict Paten. “Dp-cgan: Differentially private
synthetic data and label generation”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops. 2019, pp. 0–0.
[240] Behrouz Touri and Bahman Gharesifard. “Continuous-time distributed convex optimization on
time-varying directed networks”. In: 2015 54th IEEE Conference on Decision and Control (CDC).
2015, pp. 724–729.doi: 10.1109/CDC.2015.7402315.
[241] Cuong Tran, My Dinh, and Ferdinando Fioretto. “Differentially private empirical risk
minimization under the fairness lens”. In: Advances in Neural Information Processing Systems 34
(2021), pp. 27555–27565.
[242] Cuong Tran, Ferdinando Fioretto, and Pascal Van Hentenryck. “Differentially private and fair
deep learning: A lagrangian dual approach”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 35. 11. 2021, pp. 9932–9939.
[243] Cuong Tran, Ferdinando Fioretto, Pascal Van Hentenryck, and Zhiyan Yao. “Decision Making
with Differential Privacy under a Fairness Lens.” In: IJCAI. 2021, pp. 560–566.
148
[244] Cuong Tran, Keyu Zhu, Ferdinando Fioretto, and Pascal Van Hentenryck. “SF-PATE: Scalable,
Fair, and Private Aggregation of Teacher Ensembles”. In: arXiv preprint arXiv:2204.05157 (2022).
[245] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and
Yi Zhou. “A hybrid approach to privacy-preserving federated learning”. In: Proceedings of the 12th
ACM Workshop on Artificial Intelligence and Security . 2019, pp. 1–11.
[246] Stacey Truex, Ling Liu, Ka-Ho Chow, Mehmet Emre Gursoy, and Wenqi Wei. “LDP-Fed:
Federated Learning with Local Differential Privacy”. In: Proceedings of the Third ACM
International Workshop on Edge Systems, Analytics and Networking. Association for Computing
Machinery, 2020, pp. 61–66.isbn: 9781450371322.
[247] U.S. Census Bureau. “Differential Privacy and the 2020 Census”. In: (2020). url:
https://www2.census.gov/library/working-papers/2020/demo/2020-Census-Differential-Privacy-
Algorithm.pdf.
[248] Jonathan Ullman. CS7880: Rigorous Approaches to Data Privacy. 2017.url:
http://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf.
[249] Vladimir N Vapnik. “An overview of statistical learning theory”. In: IEEE transactions on neural
networks 10.5 (1999), pp. 988–999.
[250] Michael Veale and Reuben Binns. “Fairer machine learning in the real world: Mitigating
discrimination without collecting sensitive data”. In: Big Data & Society 4.2 (2017),
p. 2053951717743530.
[251] Roman Vershynin. High-dimensional probability: An introduction with applications in data science.
Vol. 47. Cambridge university press, 2018.
[252] Giuseppe Vietri, Grace Tian, Mark Bun, Thomas Steinke, and Steven Wu. “New oracle-efficient
algorithms for private synthetic data release”. In: International Conference on Machine Learning.
PMLR. 2020, pp. 9765–9774.
[253] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. “On differentially private stochastic
convex optimization with heavy-tailed data”. In: International Conference on Machine Learning.
PMLR. 2020, pp. 10081–10091.
[254] Di Wang and Jinhui Xu. “Differentially Private ℓ
1
-norm Linear Regression with Heavy-tailed
Data”. In: arXiv preprint arXiv:2201.03204 (2022).
[255] Di Wang, Minwei Ye, and Jinhui Xu. “Differentially Private Empirical Risk Minimization Revisited:
Faster and More General”. In: Advances in Neural Information Processing Systems 30 (2017).
[256] Jun Wang and Zhi-Hua Zhou. “Differentially private learning with small public data”. In:
Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34. 04. 2020, pp. 6219–6226.
[257] Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. “Efficient Privacy-Preserving
Stochastic Nonconvex Optimization”. In: arXiv preprint arXiv:1910.13659 (2019).
149
[258] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. “Spiderboost and momentum:
Faster variance reduction algorithms”. In: Advances in Neural Information Processing Systems 32
(2019).
[259] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Hang Su, Bo Zhang, and H Vincent Poor. “User-Level
Privacy-Preserving Federated Learning: Analysis and Performance Optimization”. In: arXiv
preprint:2003.00229 (2020).
[260] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek,
and H Vincent Poor. “Federated learning with differential privacy: Algorithms and performance
analysis”. In: IEEE Transactions on Information Forensics and Security 15 (2020), pp. 3454–3469.
[261] Robert Williamson and Aditya Menon. “Fairness risk measures”. In: International Conference on
Machine Learning. PMLR. 2019, pp. 6786–6797.
[262] Hans S Witsenhausen. “On sequences of pairs of dependent random variables”. In: SIAM Journal
on Applied Mathematics 28.1 (1975), pp. 100–113.
[263] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. “Learning
non-discriminatory predictors”. In: arXiv preprint arXiv:1702.06081 (2017).
[264] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins,
Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. “Is local SGD better than minibatch SGD?”
In: International Conference on Machine Learning. PMLR. 2020, pp. 10334–10343.
[265] Blake E Woodworth, Kumar Kshitij Patel, and Nati Srebro. “Minibatch vs Local SGD for
Heterogeneous Distributed Learning”. In: Advances in Neural Information Processing Systems.
Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran Associates,
Inc., 2020, pp. 6281–6292.url:
https://proceedings.neurips.cc/paper/2020/file/45713f6ff2041d3fdfae927b82488db8-Paper.pdf.
[266] Robert F Woolson and William R Clarke. Statistical methods for the analysis of biomedical data.
John Wiley & Sons, 2011.
[267] Nan Wu, Farhad Farokhi, David Smith, and Mohamed Ali Kaafar. The Value of Collaboration in
Convex Machine Learning with Differential Privacy . 2019. arXiv: 1906.09679[cs.CR].
[268] Depeng Xu, Shuhan Yuan, and Xintao Wu. “Achieving differential privacy and fairness in logistic
regression”. In: Companion proceedings of The 2019 world wide web conference. 2019, pp. 594–599.
[269] Zhenhuan Yang, Shu Hu, Yunwen Lei, Kush R Varshney, Siwei Lyu, and Yiming Ying.
“Differentially Private SGDA for Minimax Problems”. In: arXiv preprint arXiv:2201.09046 (2022).
[270] Bin Yu. “Assouad, fano, and le cam”. In: Festschrift for Lucien Le Cam: research papers in
probability and statistics (1997), pp. 423–435.
[271] Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. “Do not let privacy overbill utility: Gradient
embedding perturbation for private learning”. In: arXiv preprint arXiv:2102.12677 (2021).
150
[272] Honglin Yuan and Tengyu Ma. “Federated Accelerated Stochastic Gradient Descent”. In:Advances
in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell,
M. F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 5332–5344.url:
https://proceedings.neurips.cc/paper/2020/file/39d0a8908fbe6c18039ea8227f827023-Paper.pdf.
[273] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi.
“Fairness constraints: Mechanisms for fair classification”. In: Artificial Intelligence and Statistics .
PMLR. 2017, pp. 962–970.
[274] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. “Learning fair
representations”. In: International Conference on Machine Learning. 2013, pp. 325–333.
[275] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. “Mitigating unwanted biases with
adversarial learning”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society.
2018, pp. 335–340.
[276] Jiaqi Zhang, Kai Zheng, Wenlong Mou, and Liwei Wang. Efficient Private ERM for Smooth
Objectives. 2017. arXiv: 1703.09947[cs.LG].
[277] Liang Zhang, Kiran Koshy Thekumparampil, Sewoong Oh, and Niao He. “Bring Your Own
Algorithm for Optimal Differentially Private Stochastic Minimax Optimization”. In: arXivpreprint
arXiv:2206.00363 (2022).
[278] Qiuchen Zhang, Jing Ma, Jian Lou, and Li Xiong. “Private Stochastic Non-convex Optimization
with Improved Utility Rates”. In: Proceedings of the Thirtieth International Joint Conference on
Artificial Intelligence (IJCAI-21) . 2021.
[279] Siqi Zhang, Junchi Yang, Cristóbal Guzmán, Negar Kiyavash, and Niao He. “The complexity of
nonconvex-strongly-concave minimax optimization”. In: arXiv preprint arXiv:2103.15888 (2021).
[280] Xinwei Zhang, Mingyi Hong, Sairaj Dhople, Wotao Yin, and Yang Liu. “Fedpd: A federated
learning framework with optimal rates and adaptivity to non-iid data”. In: arXiv preprint
arXiv:2005.11418 (2020).
[281] Zhifei Zhang, Yang Song, and Hairong Qi. “Age progression/regression by conditional adversarial
autoencoder”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2017, pp. 5810–5818.
[282] Tianxiang Zhao, Enyan Dai, Kai Shu, and Suhang Wang. “Towards Fair Classifiers Without
Sensitive Attributes: Exploring Biases in Related Features”. In: Proceedings of the Fifteenth ACM
International Conference on Web Search and Data Mining. 2022, pp. 1433–1442.
[283] Yang Zhao, Jun Zhao, Mengmeng Yang, Teng Wang, Ning Wang, Lingjuan Lyu, Dusit Niyato, and
Kwok-Yan Lam. “Local differential privacy based federated learning for Internet of Things”. In:
IEEE Internet of Things Journal (2020).
[284] Yaqin Zhou and Shaojie Tang. “Differentially Private Distributed Learning”. In: INFORMS Journal
on Computing 32.3 (2020), pp. 779–789.
151
[285] Yingxue Zhou, Zhiwei Steven Wu, and Arindam Banerjee. “Bypassing the ambient dimension:
Private sgd with gradient subspace identification”. In: arXiv preprint arXiv:2007.03813 (2020).
[286] Ligeng Zhu and Song Han. “Deep leakage from gradients”. In: Federated learning. Springer, 2020,
pp. 17–31.
152
AppendixA
AppendixtoChapter2
A.1 TechniquesandRelatedWork
A.1.1 OverviewofOurTechniquesandChallenges
We use several techniques to prove our lower bounds. At a high level, our population mean estimation
lower bounds rely on classical reductions from estimation to testing and powerful information-theoretic
inequalities like LeCam’s, Fano’s, and Assouad’s methods [82], along with novel semi-DP strengthenings
of these inequalities. Several of our arguments draw inspiration from the lower bound proofs in [117, 32,
26, 83, 92].
The key tool that we develop to prove our central DP population (mean estimation and SCO) lower
bounds is a novel “Semi-DP Fano’s method” (Theorem 79). In combination with the reduction from es-
timation to testing, Theorem 79 generalizes DP Fano’s method [26, 2] and strengthens classical Fano’s
method [270]. Compared with DP Fano’s inequality, the main difficulty in proving Semi-DP Fano is that
the behavior of semi-DPA cannot be controlled if apublic samplex is exchanged with publicx
1
. It seems
challenging to get a tight semi-DP bound using the approach of [2]: the resulting bound gets arbitrarily
loose as the expected Hamming distance between X
pub
and X
1
pub
grows, and X
pub
and X
priv
need to be
153
drawn from the same distribution. Instead, we build on the tools of [26, Proof of Theorem 3] and account
for the probability thatX
pub
‰X
1
pub
in our proof of Theorem 79.
Our semi-DP ERM lower bound uses a packing argument [117, 32]: we construct2
d{2
“hard” data sets
with well-separated sample means, and use the group privacy property of DP [88] to show that any semi-
DP algorithm must make large error on at least one of these data sets. The main challenge is in constructing
the private and public data sets so as to force semi-DPA to make large error, even whenA has access to
public data.
For our semi-LDP lower bounds, we build directly on the sophisticated techniques of [83]. The main
idea of our proofs is to combine Assouad’s method [82] with bounds on the mutual information between
the input and output of semi-DP algorithms.
We develop novel algorithms to obtain smaller error (including constants) than the asymptotically
optimal algorithms (throw-away and the optimal DP algorithm). Our algorithms are simple. For central
pε,δ q-semi-DP, we propose an estimator that puts different weights on the private and public samples
and adds (Gaussian) noise that is calibrated to the private weight. Our estimator is inspired by the linear
Laplace estimator of [92], which was used for1-dimensional personalized DP estimation. We choose the
weights depending on the privacy level, dimension, and number of public samples to trade off smaller
sensitivity (less noise) with larger variance on the public data. An optimal choice of weights minimizes
the variance of our unbiased estimator, leading to smaller error than the optimal DP estimator. Our semi-
local DP algorithm simply applies the optimal local randomizer of [39] only to the private samples, and
averages the noisy private data with the raw public data. For model training (SCO), we develop semi-(L)DP
stochastic gradient methods that use our mean estimation algorithms to estimate the gradient of the loss
function in each iteration.
154
A.1.2 RelatedWork
A.1.2.1 Samplecomplexityboundsforsemi-DPlearningandestimation
The works of [33, 8, 27] give sample complexity bounds for DP PAC learning and query release with
access to public data in terms of the VC-dimension of the hypothesis class. For DPd-dimensional Gaussian
mean estimation, [40] gave sample complexity bounds that do not depend on the range parameters of the
distribution ifn
pub
ěd`1; this is known to be impossible without public data.
A.1.2.2 DPmodeltraining(ERMandSCO)withpublicdata
The works of [134, 285] considered DP ERM with public data and additional assumptions on the gra-
dients lying in a low-rank subspace. Under these assumptions, [134, 285] show that nearly dimension-
independent excess empirical risk bounds are possible. Our lower bounds show that these additional as-
sumptions are strictly necessary. The work of [256] used public data to adjust the parameters of DP-SGD.
Empirically, pre-training on public data sets and privately fine-tuning the model [168, 147, 193] has shown
great promise for large-scale ML.
The work of [9] developed a public data-assisted DP mirror descent (PDA-MD) algorithm that some-
times outperforms DP-SGD empirically in training ML models, and theoretically in terms of excess risk for
linear regression. We use the PDA-MD of [9] as a baseline in our experiments. [9] also gave an “efficient
approximation” of their PDA-MD in [9, Equation 1], which they used for training non-convex models.
While finalizing this manuscript, we became aware that this “efficient approximation” is equivalent to our
Algorithm 11. Our Algorithm 11 was derived in a different fashion from the algorithm in [9, Equation
1]: we derived our algorithm as an application of our “even more optimal” mean estimator, while theirs
was derived as an approximation of their mirror descent method. Moreover, no theoretical analysis was
provided for the “efficient approximation” in [9].
155
A.1.2.3 PersonalizedDP
A related line of work is that of personalized DP (PDP) [132, 112, 196, 92], a generalization of DP in which
each person may have different privacy parameters pε
i
,δ i
q. By letting ε
i
“ 8 for some person i, PDP
also generalizes semi-DP. We leverage this connection to borrow techniques from the work of [92], which
considers pure (δ i
“ 0) PDP. We also note that the 1-dimensional pure (central) PDP mean estimation
bound of [92] extends easily to a 1-dimensionalε-semi-DP bound. However, ourd-dimensional semi-DP
lower bounds require a different set of techniques. Additionally, the personalized LDP bound of [92] relies
on the assumptionε
i
ď1 and does not seem to easily extend to theε-semi-LDP setting (even ifd“1).
A.2 OptimalCentrallyPrivateLearningwithPublicData
A.2.1 OptimalSemi-DPMeanEstimation
We begin in Appendix A.2.1.1 with empirical mean estimation. This subsection was omitted from the main
body due to space constraints. Theorem 77 will be useful for proving our ERM bounds (Theorem 21). Then,
in Appendix A.2.1.2, we turn to population mean estimation (i.e. the proof of Theorem 15). Appendix A.2.2
contains proofs for Section 2.2.1.1.
A.2.1.1 EstimatingtheEmpiricalMean
For a given data setX ĂB :“txPR
d
:}x}ď 1u, consider the problem of estimating
¯ X “
1
n
ř
n
i“1
x
i
subject to the constraint that the estimator satisfies semi-DP. We will characterize the minimax squared
error ofd-dimensional empirical mean estimation underε-semi-DP:
M
emp
pε,n
priv
,n,dq:“ inf
APAεpBq
sup
XPB
n
,|X
priv
|“n
priv
E
A
“
}ApXq´
¯ X}
2
‰
, (A.1)
whereA
ε
pBq denotes the set of allε-semi-DP estimatorsA:B
n
ÑB.
156
Theorem77. Letεą0,n,dPN,n
priv
Prns. There exist absolute constantsc andC with0ăcďC such
that
cmin
"
n
priv
n
,
d
nε
*
2
ďM
emp
pϵ,n
priv
,n,dqďCmin
"
n
priv
n
,
d
nε
*
2
.
Proof. Lowerbound: We use a packing argument to prove our lower bound. Variations of such arguments
have been used to prove DP lower bounds in previous works (see e.g. [117, 32]). DenoteX “
”
´1
?
d
,
1
?
d
ı
d
.
LetA :B
n
ÑB beε-semi-DP. ChooseK“ 2
d{2
private data pointstx
i
u
K
i“1
ĂX such that}x
i
´x
j
}ě
1{8 for all i ‰ j. The existence of such a set of points is well-known (see e.g. the Gilbert-Varshamov
construction). Letn
˚
“
nd
3εn
priv
.
Case 1:nďn
˚
(i.e.dě3εn
priv
). In this case, we’ll show thatE}ApXq´
¯ X}
2
Á
`
n
priv
n
˘
2
for someXPX
n
.
For i P rKs, let X
i
“ px
i
,¨¨¨ ,x
i
,0
n´n
priv
q consist of n
priv
copies of x
i
followed by n´ n
priv
“ n
pub
copies of 0 P R
d
. Suppose for the sake of contradiction that for every i P rKs, with probabilityě 1{3
we have}ApXq´
¯ X} ă
1
32
n
priv
n
. That is, we are supposing PpApX
i
q P B
i
q ě 1{3 for all i, where
B
i
“txP B :}x´
¯ X
i
}ď
1
32
n
priv
n
u. Note that the setstB
i
u
K
i“1
are disjoint by construction. SinceA is
ε-semi-DP, group privacy implies thatPpApX
1
qPB
i
qě
1
3
e
´εn
priv
for alliPrKs. Thus,
K
1
3
e
´εn
priv
ď
K
ÿ
i“1
PpApX
1
qPB
i
qď1,
where the last inequality follows from disjointness of the ballsB
i
. Thus, we obtain lnpK{3q ď εn
priv
.
Assume for now that d ě 8. (A 1-dimensional lower bound that is tight up to constant factors can be
shown easily by following the proof of thed-dimensional case but choosingK“16 instead ofK“2
d{2
.)
Thend{2ď d´4ď εn
priv
impliesdď 2εn
priv
, contradicting the assumption made in Case 1. Thus, we
conclude that there exists a data setX
i
such that with probability at least2{3,}ApX
i
q´
¯ X
i
}ą
1
32
n
priv
n
.
Squaring both sides of this inequality and applying Markov’s inequality yields the desired lower bound.
Case 2:nąn
˚
.
157
Additionally, suppose for now thatn
˚
ď n
priv
. In this case, we’ll show thatE}ApXq´
¯ X}
2
Á
`
d
nε
˘
2
for some X P X
n
. Let
r
X
i
“ px
i
,¨¨¨ ,x
i
,0
n´n
˚q consist of n
˚
copies of x
i
followed by n´n
˚
copies
of 0 P R
d
.
∗
Denoting the mean of a dataset X by qpXq :“
1
n
ř
xPX
x for convenience, we see that
qp
r
X
i
q“
n
˚
n
x
i
“
d
3n
priv
ε
x
i
. Define the algorithm
ˆ
A :B
n
˚
ÑB by
ˆ
ApXq“
n
n
˚
ApX,0
n´n
˚q. SinceA is
ε-semi-DP, we see that
ˆ
A isε-semi-DP by post-processing. Also, the domain of
ˆ
A isX
n
˚
andn
˚
ď n
˚
,
so the argument in Case 1 applies to
ˆ
A. Thus, by applying the result in Case 1, there existsiPrKs such
that with probability at least2{3,}
ˆ
Apx
i
,¨¨¨ ,x
i
q´qpx
i
,¨¨¨ ,x
i
q}ě
1
32
n
priv
n
. (Herepx
i
,¨¨¨ ,x
i
qPX
n˚
.)
But this implies}Ap
r
X
i
q´qp
r
X
i
q}ě
1
32
n
priv
n
n
˚
n
“
1
96
d
nε
with probability at least2{3. Again, squaring both
sides and applying Markov yields the desired lower bound.
Next, consider the complementary subcase wheren
˚
ą n
priv
. Define the algorithm
ˆ
A :B
n
˚
ÑB by
ˆ
ApXq“ApX,0
n´n
˚q. SinceA isε-semi-DP, we see that
ˆ
A isε-semi-DP by post-processing. Also, the
domain of
ˆ
A isX
n
˚
andn
˚
ď n
˚
, so the argument in Case 1 applies to
ˆ
A. Thus, by applying the result
in Case 1, there existsiPrKs such that with probability at least2{3,}
ˆ
Apx
i
,¨¨¨ ,x
i
q´qpx
i
,¨¨¨ ,x
i
q}ě
1
32
n
priv
n
. (Herepx
i
,¨¨¨ ,x
i
qPX
n˚
.) But this implies that}ApX
i
q´
¯ X
i
}ě
1
32
n
priv
n
with probability at least
2{3. Again, squaring both sides and applying Markov yields the desired lower bound. Thus, the lower
bound holds in all cases.
Upperbound: For the first term in the minimum, consider the algorithm which throws away the private
data and outputsApXq“
1
n
ř
xPX
pub
x. Clearly,A is semi-DP. Moreover,
}ApXq´
¯ X}
2
“
1
n
2
›
›
›
›
›
›
ÿ
xPX
priv
x
›
›
›
›
›
›
2
ď
n
priv
n
2
ÿ
xPX
priv
}x}
2
ď
´
n
priv
n
¯
2
.
∗
We assume without loss of generality thatn
˚
PN. Ifn
˚
is not an integer, then choosing rn
˚
s instead yields the same bound
up to constant factors.
158
For the second term, consider the Laplace mechanismApXq“
¯ X`pL
1
,¨¨¨ ,L
d
q, whereL
i
„ Lapp2
?
d{nεq
are i.i.d. mean-zero Laplace random variables. We knowA is ε-DP by [88], since the ℓ
1
-sensitivity is
sup
X„X
1}
¯ X´
¯ X
1
}
1
“
1
n
sup
x,x
1}x´x
1
}
1
ď
2
?
d
n
. HenceA isε-semi-DP. Moreover,A has error
E}ApXq´
¯ X}
2
“d VarpLapp2
?
d{nεqq“
8d
2
n
2
ε
2
.
Combining the two upper bounds completes the proof.
A.2.1.2 EstimatingthePopulationMean
Theorem78 (Re-statement of Theorem 15). Letεď d{8 and eithern
pub
À
nε
d
ordÀ 1. Then, there exist
absolute constantsc andC, with0ăcďC, such that
cmin
"
1
n
pub
,
d
2
n
2
ε
2
`
1
n
*
ďM
pop
pϵ,δ “0,n
priv
,n,dqďCmin
"
1
n
pub
,
d
2
n
2
ε
2
`
1
n
*
. (A.2)
The upper bound in 2.2 holds for anyn
pub
,d. Further, for anyn
pub
,d and anyδ Pr0,εs, we have
cmin
"
1
n
pub
,
1
n
2
ε
2
`
1
n
*
ďM
pop
pϵ,δ,n
priv
,n,dqďCmin
"
1
n
pub
,
dlnp1{δ q
n
2
ε
2
`
1
n
*
. (A.3)
The proof of A.2 will require the following intermediate result, Theorem 79, which can be viewed both
as a “semi-DP Fano’s inequality” and as an extension of [26, Theorem 3] to the semi-DP setting.
Theorem 79. LettP
v
u
vPV
Ă P be a family of distributions on X. Let P
0
be a distribution and p P
r0,1s such that P
θ v
:“ p1´ pqP
0
` pP
v
P P for all v P V. Denote θ v
:“ E
x„P
θ v
rxs and ρ ˚
pVq :“
mint}θ v
´θ v
1}|v,v
1
PV,v‰v
1
u. Let
p
θ :X
n
ÑX beanyε-semi-DPestimator. DrawV „ UnifpVq;then
159
conditional on V “ v, draw an i.i.d. sample X|V “ v „ P
n
θ v
containing n
priv
private samples and n
pub
samples. Then,
1
|V|
ÿ
vPV
P
θ v
´›
›
›
p
θ pXq´θ v
›
›
›ěρ ˚
pVq
¯
ě
p|V|´1qe
´εrnprivps
p1´pq
n
pub
2
`
1`p|V|´1qe
´εrnprivps
˘ (A.4)
Remark80. Note that the right-hand-side of A.4 is similar to the second term on the right-hand-side of DP
Fano’sinequality[2,Equation5],afteraligningnotation. ThemaindifferencesarethatA.4hasanextrafactor
ofp1´pq
n
pub
, andn
priv
in place ofn.
Proof of Theorem 79. The key novel element of the proof is A.5, which is a semi-DP variation of [26, Lemma
2]: ifA is a measurable set andv,v
1
PV, then
P
θ v
p
p
θ pXqPAqěe
εrn
priv
ps
„
P
θ v
1
p
p
θ PAq´1`
p1´pq
n
pub
2
ȷ
. (A.5)
Once we prove A.5, we can obtain Theorem 79 by following the proof approach of [26, Theorem 3]. Let us
now prove A.5: We will use upper case letters to denote random variables and lower case letters to denote
the values that the random variables take. LetB“tB
i
u
n
i“1
be i.i.d. Bernoullippq random variables. Assume
that the random variablesX“tX
i
u
n
i“1
are generated in the following way: first draw W
0
1
,...,W
0
n
„P
0
i.i.d. and drawW
v
1
,...,W
v
n
„P
v
i.i.d. For eachi, ifB
i
“0, setX
i
“W
0
i
; ifB
i
“1, setX
i
“W
v
i
. Thus,
conditional onV “v, the random variablesX
i
are each distributed according toP
θ v
“p1´pqP
0
`pP
v
.
For fixed v
1
P V, generate a different sample X
1
“ tX
1
i
u
n
i“1
by drawing W
v
1
i
„ P
v
1 i.i.d. and setting
X
1
i
“W
0
i
p1´B
i
q`W
v
1
i
B
i
. Note that ifB
i
“0, thenX
i
“X
1
i
. Thus, the hamming distance betweenX
andX
1
is
d
ham
pX,X
1
qďB
T
1“
n
ÿ
i“1
B
i
.
160
Now let Q denote the conditional distribution of the ε-semi-DP estimator
p
θ given input data (X or X
1
).
For notational convenience, assume without loss of generality thatX“pX
1
,...,X
n
priv
,X
pub
q andX
1
“
pX
1
1
,...,X
1
n
priv
,X
1
pub
q. Then, for any fixed sequence b“pb
1
,...,b
n
priv
,0
n
pub
qPt0,1u
n
priv
ˆt0u
n
pub
, we
have
Qp
p
θ PA|X
i
“W
0
i
p1´b
i
q`W
v
i
b
i
@iPrnsqěe
´εb
T
1
Qp
p
θ PA|X
1
i
“W
0
i
p1´b
i
q`W
v
1
i
b
i
@iPrnsq (A.6)
by group privacy, sinceb
i
“0@iąn
priv
impliesX
pub
“X
1
pub
. Thus,
P
θ v
p
p
θ PAq
“
ÿ
bPt0,1u
n
PpB“bq
ż
Qp
p
θ PA|X
i
“w
0
i
p1´b
i
q`w
v
i
b
i
@iPrnsqdP
n
0
pw
0
1:n
qdP
n
v
pw
v
1:n
q
ě
ÿ
bPt0,1u
n
priv
ˆt0u
n
pub
,
b
T
1ďrn
priv
ps
PpB“bq
ż
Qp
p
θ PA|X
i
“w
0
i
p1´b
i
q`w
v
i
b
i
@iPrnsqdP
n
0
pw
0
1:n
qdP
n
v
pw
v
1:n
q
“
ÿ
bPt0,1u
n
priv
ˆt0u
n
pub
,
b
T
1ďrn
priv
ps
PpB“bq
ż
Qp
p
θ PA|X
i
“w
0
i
p1´b
i
q`w
v
i
b
i
@iPrnsqdP
n
0
pw
0
1:n
qdP
n
v
pw
v
1:n
qdP
n
v
1pw
v
1
1:n
q
ě
ÿ
bPt0,1u
n
priv
ˆt0u
n
pub
,
b
T
1ďrn
priv
ps
PpB“bq
ż
”
e
´εb
T
1
Qp
p
θ PA|X
1
i
“w
0
i
p1´b
i
q`w
v
1
i
b
i
@iPrnsqdP
n
0
pw
0
1:n
q
¨¨¨dP
n
v
pw
v
1:n
qdP
n
v
1pw
v
1
1:n
q
ı
“
ÿ
bPt0,1u
n
priv
ˆt0u
n
pub
,
b
T
1ďrn
priv
ps
PpB“bqe
´εb
T
1
P
θ v
1
´
p
θ PA|B“b
¯
ě
ÿ
bPt0,1u
n
priv
ˆt0u
n
pub
,
b
T
1ďrn
priv
ps
PpB“bqe
´εrn
priv
ps
P
θ v
1
´
p
θ PA|B“b
¯
ěe
´εrn
priv
ps
P
θ v
1
´
p
θ PA,B
piąn
priv
q
“0
n
pub
,B
T
1ď rn
priv
ps
¯
ěe
´εrn
priv
ps
”
P
θ v
1
´
p
θ PA
¯
´P
´
B
piąn
priv
q
‰0
n
pub
ď
B
T
1ą rn
priv
ps
¯ı
,
161
where the second inequality used A.6 and the last inequality used a union bound. Now, by independence
oftB
i
u
n
i“1
, we have
P
´
B
piąn
priv
q
‰0
n
pub
ď
B
T
1ą rn
priv
ps
¯
“1´P
´
B
piąn
priv
q
“0
n
pub
č
B
T
1ď rn
priv
ps
¯
“1´P
´
B
piąn
priv
q
“0
n
pub
¯
P
`
B
T
1ď rn
priv
ps
ˇ
ˇ
B
piąn
priv
q
“0
n
pub
q
“1´P
´
B
piąn
priv
q
“0
n
pub
¯
P
´
B
T
piďn
priv
q
1ď rn
priv
ps
¯
“1´p1´pq
n
pub
P
´
B
T
piďn
priv
q
1ď rn
priv
ps
¯
ď1´p1´pq
n
pub
¨
1
2
,
since the median ofB
T
piďn
priv
q
1„ Binomialpn
priv
,pq is no larger thanrn
priv
ps. Putting together the pieces,
we get
P
θ v
p
p
θ PAqěe
´εrn
priv
ps
„
P
θ v
1
´
p
θ PA
¯
´1`p1´pq
n
pub
¨
1
2
ȷ
,
establishing A.5.
Now, with A.5 in hand, we proceed as in the proof of [26, Theorem 3]: Let B
α pθ q :“ tθ 1
P X :
}θ ´θ 1
}ď α u. Note that the ballsB
ρ ˚
pVq
pθ v
q are disjoint for v P V by definition of ρ ˚
pVq. Denote the
average probability of success for the estimator
p
θ by
P
succ
:“
1
|V|
ÿ
vPV
P
θ v
´
p
θ PB
ρ ˚
pVq
pθ v
q
¯
.
162
Then by a union bound and disjointness of the ballstB
ρ ˚
pVq
pθ v
qu
vPV
, we have
P
succ
ď1´
1
|V|
ÿ
vPV
ÿ
v
1
PV,v
1
‰v
P
θ v
´
p
θ PB
ρ ˚
pVq
pθ v
1q
¯
.
An application of A.5 yields
P
succ
ď1´
1
|V|
ÿ
vPV
ÿ
v
1
PV,v
1
‰v
„
e
´εrn
priv
ps
P
θ v
1
´
p
θ PB
ρ ˚
pVq
pθ v
1q
¯
´
ˆ
1´
p1´pq
n
pub
2
˙ȷ
ď1´e
´εrn
priv
ps
p|V|´1qP
succ
`e
´εrn
priv
ps
p|V|´1q
ˆ
1´
p1´pq
n
pub
2
˙
.
Re-arranging this inequality leads to
P
succ
ď
1`p|V|´1q
´
1´
p1´pq
n
pub
2
¯
e
´εrn
priv
ps
1`p|V|´1qe
´εrn
priv
ps
and hence
1´P
succ
ě
p|V|´1qe
´εrn
priv
ps
¨
p1´pq
n
pub
2
1`p|V|´1qe
´εrn
priv
ps
.
This last inequality is equivalent to the inequality stated in Theorem 79.
While we state Theorem 79 for mean estimation, it holds more generally for estimating any popu-
lation statistic θ : P Ñ Θ . However, this additional generality will not be necessary for our purposes.
With Theorem 79 in hand, we now turn to the proof of Theorem 78.
Proof of Theorem 78. LowerBounds: We begin by proving A.2. First supposen
pub
À
nε
d
anddě8.
Choose a finite subset VĂR
d
such that|V|ě2
d{2
,}v}“1, and}v´v
1
}ě
1
8
for allv,v
1
PV,v‰v
1
.
The existence of such a set of points is well-known (see e.g. the Gilbert-Varshamov construction). Define
P
0
to be the point mass distribution ontX “ 0u and P
v
to be point mass ontX “ vu for v P V. For
v P V, letP
θ v
:“p1´pqP
0
`pP
v
for somepPr0,1s to be specified later. Note that if X „ P
θ v
, then
163
}X}ď1 with probability1. Thus,P
θ v
is a valid distribution in the classP of bounded (by1) distributions
onB that we are considering. Also, note thatθ v
:“E
P
θ v
rXs“pv. Further,
ρ ˚
pVq:“min
␣
}θ v
´θ v
1}|v,v
1
PV,v‰v
1
(
ě
p
8
by construction.
Now we use the classical reduction from estimation to testing (see [26] for details) to lower bound the
MSE of anyε-semi-DP estimator
p
θ by
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ěρ ˚
pVq
2
1
|V|
ÿ
vPV
P
θ v
´
}
p
θ pXq´θ v
}ěρ ˚
pVq
¯
ě
´
p
8
¯
2
p|V|´1qe
´εrn
priv
ps
p1´pq
n
pub
2
`
1`p|V|´1qe
´εrn
priv
ps
˘
ě
p
2
64
p2
d{2
´1qe
´εrn
priv
ps
p1´pq
n
pub
2
`
1`p2
d{2
´1qe
´εrn
priv
ps
˘
where we used Theorem 79 in the second inequality. Since we assumeddě 8, we have2
d{2
´1ě e
d{4
and hence
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ě
p
2
64
¨
p1´pq
n
pub
2
¨
e
d{4´εrn
priv
ps
1`e
d{4´εrn
priv
ps
ě
p
2
64
¨
p1´pq
n
pub
4
.
for anypď
d
4nε
´
1
n
. Now choose
p“min
˜
d
4nε
´
1
n
,
1
2
?
n
pub
¸
.
164
By assumption, there exists an absolute constantk such thatn
pub
ďk
nε
d
. Thus, ifnεě2, then
p1´pq
n
pub
ě
ˆ
1´
d
εn
˙
knε{d
“
«
ˆ
1´
d
εn
˙
nε{d
ff
k
ě
«
ˆ
1´
1
2
˙
2
ff
k
“
1
4
k
.
On the other hand, if nε ă 2, then n
pub
ď 2k ùñ p1´pq
n
pub
ě
`
1´
1
2
˘
2k
“
1
4
k
. Also, note that
pěmin
´
d
8εn
,
1
2
?
n
pub
¯
by the assumption thatdě8ε. Therefore,
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ě
1
256¨4
k
min
˜
d
8εn
,
1
2
?
n
pub
¸
2
.
Combining the above inequality with the non-private mean estimation lower bound ofΩ p1{nq proves the
lower bound in A.2.
Now consider the alternative case in whichdÀ 1 (i.e. dď k for some absolute constantkPN), but
n
pub
Prns is arbitrary. Then we will prove that the lower bound in A.3 holds ford“ 1, for anyδ Pr0,εs
and ε ď 1. By taking a k-fold product distribution, this will suffice to complete the proof of the lower
bounds in A.2 and A.3. To that end, we will use Le Cam’s method and the techniques in [26, 92]. The key
novel ingredient is the following extension of [92, Lemma 3] to thepε,δ q-semi-DP setting:
165
Lemma81. Let
p
θ :X
n
ÑX bepε,δ q-semi-DPandletP
1
,P
2
bedistributionsonX suchthatP
1
isabsolutely
continuous w.r.t. P
2
. Denote the conditional distribution of
p
θ given X by Q and let Q
j
pAq “
ş
Qp
p
θ P
A|x
1:n
qdP
n
j
px
1:n
q for any measurable setA. Then
}Q
1
´Q
2
}
TV
ďmin
"c
n
2
D
KL
pP
1
,P
2
q, 2}P
1
´P
2
}
TV
n
priv
pe
ε
´1`δ q`
c
n
pub
2
D
KL
pP
1
,P
2
q
*
.
Let us defer the proof of Lemma 81 for now. We will now use Lemma 81 to prove the lower bound
in A.3 ford“1 andεď1. Define distributions P
1
,P
2
ont´1,1u as follows:
P
1
p´1q“P
2
p1q“
1`γ 2
, P
1
p1q“P
2
p´1q“
1´γ 2
for someγ Pr0,1{2s to be chosen later. ClearlyP
1
,P
2
PP (i.e. they are bounded by1 with probability
1). Also,E
P
1
rxs“´γ andE
P
2
rxs“γ , sotP
1
,P
2
u is aγ -packing oft´1,1u. Thus, by Le Cam’s method
(see [26] for details), for anypε,δ q-semi-DP
p
θ , we have
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ě
γ 2
8
p1´}Q
1
´Q
2
}
TV
q.
Now, applying Lemma 81 and the assumptionδ ďεď1 yields
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
(A.7)
ě
γ 2
8
„
1´min
"c
n
2
D
KL
pP
1
,P
2
q, 6}P
1
´P
2
}
TV
n
priv
ε`
c
n
pub
2
D
KL
pP
1
,P
2
q
*ȷ
ě
γ 2
8
„
1´min
"c
n
2
3γ 2
pP
1
,P
2
q, 6γn
priv
ε`
c
n
pub
2
3γ 2
*ȷ
“
γ 2
8
«
1´γ min
#
c
3n
2
, 6n
priv
ε`
c
3n
pub
2
+ff
. (A.8)
166
In the second inequality we used the fact that}P
1
´P
2
}
TV
“
1
2
´ˇ
ˇ
ˇ
1`γ 2
´
1´γ 2
ˇ
ˇ
ˇ¨2
¯
“γ andD
KL
pP
1
,P
2
qď
3γ 2
forγ ď1{2.
Now we will chooseγ to (approximately) maximize the right-hand side of A.7. Suppose
min
#
c
3n
2
,6n
priv
ε`
c
3n
pub
2
+
“
c
3n
2
.
Then choosingγ “
1
3
b
2
3n
yields
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ě
k
n
for some absolute constantk ą 0. Our assumption thatmin
"
b
3n
2
,6n
priv
ε`
b
3n
pub
2
*
“
b
3n
2
implies
that there existsk
1
ą 0 such thatnď k
1
max
´
n
2
priv
ε
2
,n
pub
¯
. Thus,k{ně
k
k
1
min
ˆ
1
n
2
priv
ε
2
,
1
n
pub
˙
, which
gives the desired lower bound in A.3.
Suppose instead thatmin
"
b
3n
2
,6n
priv
ε`
b
3n
pub
2
*
“6n
priv
ε`
b
3n
pub
2
. Then chooseγ “
2
3
ˆ
6n
priv
ε`
b
3n
pub
2
˙
´1
.
Then, there are constantsk,cą0 such that
sup
PPP
E
X„P
n
,
p
θ ›
›
›
p
θ pXq´E
x„P
rxs
›
›
›
2
ě
γ 2
8
«
1´γ min
#
c
3n
2
,6n
priv
ε`
c
3n
pub
2
+ff
ě
k
n
2
priv
ε
2
`n
pub
ěcmin
˜
1
n
2
priv
ε
2
,
1
n
pub
¸
.
Combining the above inequality with the non-private lower bound Ω p1{nq completes the proof of the
lower bound in A.3.
167
It remains to prove Lemma 81. To that end, fix = kPt0,n
priv
u and denote by
r
Q the marginal distribu-
tion of
p
θ givenX
1
,...,X
k
„P
1
(i.i.d.) andX
k`1
,...,X
n
„P
2
(i.i.d.); i.e. for measurableA,
r
QpAq:“
ż
Qp
p
θ PA|X
1:n
“x
1:n
qdP
k
1
px
1:k
qdP
n´k
2
px
k`1:n
q.
Note that ifk“0, then
r
Q“Q
2
. We have
}Q
1
´Q
2
}
TV
ď}Q
1
´
r
Q}
TV
l jh n
a ⃝ `}
r
Q´Q
2
}
TV
l jh n
b ⃝ .
Also,
min
kPt0,n
priv
u
c
n´k
2
D
KL
pP
1
,P
2
q`2}P
1
´P
2
}
TV
min
kPt0,n
priv
u
kpe
ε
´1`δ q (A.9)
ďmin
"c
n
2
D
KL
pP
1
,P
2
q, 2}P
1
´P
2
}
TV
n
priv
pe
ε
´1`δ q`
c
n
pub
2
D
KL
pP
1
,P
2
q
*
,
so it suffices to upper bound the sum of the terms of a ⃝ + b ⃝ by the left-hand-side of A.9. First, we deal
with a ⃝ : for anyk, we have
}Q
1
´
r
Q}
2
TV
ď}P
n
1
´P
k
1
P
n´k
2
}
2
TV
(A.10)
ď
1
2
D
KL
´
P
n
1
,P
k
1
P
n´k
2
¯
(A.11)
ď
n´k
2
D
KL
pP
1
,P
2
q, (A.12)
by the data processing inequality forf-divergences, Pinsker’s inequality, and the chain-rule for KL-divergences
(see, e.g. [82] for a reference on these facts). Thus, it remains to show
}
r
Q´Q
2
}
TV
ď2}P
1
´P
2
}
TV
min
kPt0,n
priv
u
kpe
ε
´1`δ q (A.13)
168
forkPt0,n
priv
u. Ifk“0, A.13 is trivial. Assumek“n
priv
. Now, for any measurableA, we may write
r
QpAq´Q
2
pAq“
ż
R
n
pub
∆ px
n
priv
`1:n
qdP
n
pub
2
px
n
priv
`1:n
q, (A.14)
where
∆ px
n
priv
`1:n
q:“
ż
R
n
priv
Qp
p
θ PA|X
1:n
“x
1:n
q
`
dP
n
priv
1
px
1:n
priv
q´dP
n
priv
2
px
1:n
priv
q
˘
.
By A.14, it suffices to show that |∆ px
n
priv
`1:n
q|ď2}P
1
´P
2
}
TV
n
priv
pe
ε
´1`δ q for allx
n
priv
`1:n
PX
n
pub
.
To do so, letx
i
1:n
:“px
1
,...,x
i´1
,x
1
i
,x
i`1
,...,x
n
q for someiďn
priv
. Then
|Qp
p
θ PA|X
1:n
“x
1:n
q´Qp
p
θ PA|X
1:n
“x
i
1:n
q|ďpe
ε
´1qQp
p
θ PA|X
1:n
“x
i
1:n
q`δ (A.15)
since
p
θ ispε,δ q-semi-DP. Moreover, by the proof of [92, Lemma 3], we have
∆ px
n
priv
`1:n
q“
n
priv
ÿ
i“1
ż
R
n
priv
”
Qp
p
θ PA|X
1:n
“x
1:n
q´Qp
p
θ PA|X
1:n
“x
i
1:n
q
ı
¨¨¨dP
i´1
2
px
1:i´1
qpdP
1
px
i
q´dP
2
px
i
qqdP
n
priv
´i
1
px
i`1:n
priv
q.
Applying the triangle inequality and A.15, we get
|∆ px
n
priv
`1:n
q|
ď
n
priv
ÿ
i“1
ż
R
n
priv
”
pe
ε
´1qQp
p
θ PA|X
1:n
“x
i
1:n
q`δ ı
dP
i´1
2
px
1:i´1
q|dP
1
px
i
q´dP
2
px
i
q|dP
n
priv
´i
1
px
i`1:n
priv
q
ď2}P
1
´P
2
}
TV
n
priv
pe
ε
´1`δ q,
as desired. This completes the proof of Lemma 81 and hence the proof of the lower bounds in Theorem 78.
169
Upperbounds: First, the throw-away estimatorApXq“
1
n
pub
ř
xPX
pub
x is clearlyp0,0q-semi-DP and
has MSE
E
X„P
n}ApXq´E
x„P
rxs}
2
“E
X„P
n
›
›
›
›
›
›
1
n
pub
ÿ
xPX
pub
px´E
x„P
rxsq
›
›
›
›
›
›
2
“
1
n
2
pub
ÿ
xPX
pub
E
x„P
}x´E
x„P
rxs}
2
ď
1
n
pub
.
To get the second term in the minimum in A.2, consider the Laplace mechanism ApXq “
¯ X `
pL
1
,...,L
d
q, where L
i
„ Lapp2
?
d{nεq are i.i.d. mean-zero Laplace random variables. We knowA is
ε-DP by [88], since the ℓ
1
-sensitivity is sup
X„X
1}
¯ X´
¯ X
1
}
1
“
1
n
sup
x,x
1}x´x
1
}
1
ď
2
?
d
n
. HenceA is
ε-semi-DP. Moreover,A has MSE
E
X„P
n}ApXq´E
x„P
rxs}
2
ď2E}ApXq´
¯ X}
2
`2E
›
› ¯ X´E
x„P
rxs
›
›
2
ď2d VarpLapp2
?
d{nεqq`
2
n
“
16d
2
n
2
ε
2
`
2
n
.
This completes the proof of A.2.
170
To get the second term in the minimum in A.3, consider the Gaussian mechanism ApXq “
¯ X `
Np0,σ 2
I
d
q, where σ 2
“
8lnp2{δ q
ε
2
n
2
. We know A is pε,δ q-DP (by e.g. [88]) since the ℓ
2
-sensitivity is
sup
X„X
1}
¯ X´
¯ X
1
}
2
ď
2
n
. HenceA ispε,δ q-semi-DP. Moreover, the MSE ofA is
E
X„P
n}ApXq´E
x„P
rxs}
2
ď2E}ApXq´
¯ X}
2
`2E
›
› ¯ X´E
x„P
rxs
›
›
2
ď2d
8lnp2{δ q
ε
2
n
2
`
2
n
“
16dlnp2{δ q
n
2
ε
2
`
2
n
.
This completes the proof of A.3.
A.2.2 An“EvenMoreOptimal”Semi-DPAlgorithmforMeanEstimation
Lemma82 (Re-statement of Lemma 17). Recallthedefinitionof PpB,Vq(Definition16): PpB,Vqdenotes
the collection of all distributionsP onR
d
such that for anyx„ P, we have}x}ď B P-almost surely and
Varpxq“V
2
. Then, the error of theρ -semi-zCDP throw-away algorithmApXq“
1
n
pub
ř
xPX
pub
x is
sup
PPPpB,Vq
E
X„P
n
“
}ApXq´E
x„P
rxs}
2
‰
“
V
2
n
pub
.
The minimax error of theρ -semi-zCDP Gaussian mechanismGpXq“
¯ X`N
`
0,σ 2
I
d
˘
is
inf
ρ -zCDP G
sup
PPPpB,Vq
E
G,X„P
n
“
}GpXq´E
x„P
rxs}
2
‰
“
2dB
2
ρn
2
`
V
2
n
. (A.16)
Proof. For throw-away, the i.i.d. data assumption implies
E}ApXq´Ex}
2
“
1
n
2
pub
ÿ
xPX
pub
E}x´Ex}
2
“
V
2
n
pub
.
171
The Gaussian mechanismG
˚
pXq :“
¯ X`N
´
0,
2B
2
ρn
2
I
d
¯
isρ -zCDP by [47, Proposition 1.6] since theℓ
2
-
sensitivity is bounded by ∆ 2
“ sup
X„X
1}
¯ X´
¯ X
1
}
2
ď
2B
n
. Moreover, this sensitivity bound is tight:
consider anyP such thatx“pB,0
d´1
q andx
1
“p´B,0
d´1
q are in the support ofP . Then fix any y in
the support ofP and consider the adjacent data setsX“px,y,¨¨¨ ,yq andX
1
“px
1
,y,¨¨¨ ,yq. We have
}
¯ X´
¯ X
1
}
2
“
1
n
}x´x
1
}
2
“
2B
n
. Additionally, if the variance of the additive isotropic Gaussian noiseσ 2
is smaller than
∆ 2
2
2ρ “
2B
2
ρn
2
, then the Gaussian mechanism is notρ -zCDP [47]. Thus,G
˚
pXq is theρ -zCDP
Gaussian mechanism with the smallest noise varianceσ 2
. Hence the infimum in A.16 is attained by G
˚
.
Finally, for anyP PPpB,Vq, the MSE ofG
˚
is
E
G,X„P
n
“
}G
˚
pXq´E
x„P
rxs}
2
‰
“E}G
˚
pXq´
¯ X}
2
`E
›
›
›
›
›
1
n
n
ÿ
i“1
x
i
´Ex
i
›
›
›
›
›
2
“
2dB
2
ρn
2
`
V
2
n
.
Proposition83 (Re-statement of Proposition 18). Recall the definition of A
r
from 2.4. There existsrą 0
such thatA
r
isρ -semi-zCDP, and
sup
PPPpB,Vq
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
ămin
ˆ
V
2
n
pub
,
2dB
2
ρn
2
`
V
2
n
˙
. (A.17)
Further, if
V
2
n
pub
ď
2dB
2
ρn
2
, then the quantitative advantage ofA
r
is
sup
PPPpB,Vq
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
ď
ˆ
q
2
`qs
2
q
2
`2qs
2
`s
4
˙
min
ˆ
V
2
n
pub
,
2dB
2
ρn
2
`
V
2
n
˙
, (A.18)
whereq“2`
nprivρV
2
dB
2
ands“
Vnpriv
?
ρ B
?
dn
pub
.
172
Proof. Privacy: Chooseσ 2
r
“
2B
2
r
2
ρ . Note that theℓ
2
-sensitivity ofMpXq“
ř
xPX
priv
rx`
ř
xPX
pub
´
1´n
priv
r
n
pub
¯
x
is
∆ 2
“ sup
X
priv
„X
1
priv
›
›
›
›
›
›
ÿ
xPX
priv
rx`
ÿ
xPX
pub
ˆ
1´n
priv
r
n
pub
˙
x´
ÿ
xPX
1
priv
rx´
ÿ
xPX
pub
ˆ
1´n
priv
r
n
pub
˙
x
›
›
›
›
›
›
“sup
x,x
1
}rx´rx
1
}ď2rB.
Recall that the Gaussian mechanism guaranteesρ -zCDP wheneverσ 2
ě
∆ 2
2
2ρ [47, Proposition 1.6]. Thus,
A
r
isρ -semi-zCDP forσ 2
r
ě
p2rBq
2
2ρ “
2B
2
r
2
ρ .
Errorbounds: LetP PPpB,Vq. We have
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
“dσ
2
r
`E
›
›
›
›
›
›
ÿ
xPX
priv
rpx´Exq`
ÿ
xPX
pub
ˆ
1´n
priv
r
n
pub
˙
px´Exq
›
›
›
›
›
›
2
“
2dB
2
r
2
ρ `
ÿ
xPX
priv
r
2
V
2
`
ÿ
xPX
pub
ˆ
1´n
priv
r
n
pub
˙
2
V
2
“
2dB
2
r
2
ρ `n
priv
r
2
V
2
`
p1´n
priv
rq
2
n
pub
V
2
, (A.19)
using independence of the Gaussian noise and the data, basic properties of variance, and the fact that the
data is i.i.d.
To prove 2.5, let
Jprq:“
2dB
2
r
2
ρ `n
priv
r
2
V
2
`
p1´n
priv
rq
2
n
pub
V
2
.
We compute first and second derivatives of J:
d
dr
Jprq“2r
ˆ
2dB
2
ρ `n
priv
V
2
˙
´2n
priv
V
2
n
pub
p1´n
priv
rq
173
and
d
2
dr
2
Jprq“2
ˆ
2dB
2
ρ `n
priv
V
2
˙
`2n
2
priv
V
2
n
pub
.
SinceJ is strongly convex, it has a unique minimizerr
˚
which satisfies
d
dr
Jpr
˚
q“0. We find
r
˚
“
n
priv
V
2
n
pub
˜
2dB
2
ρ `n
priv
V
2
`
n
2
priv
V
2
n
pub
¸
´1
.
One can verify thatr
˚
‰ 0 andr
˚
‰
1
n
, since1ď n
priv
ă n andρ ă8 by assumption. Thus,Jpr
˚
qă
minpJp0q,Jp1{nqq, which yields 2.5 by Lemma 17.
To prove A.18, we will choose a different r: r :“
K
?
ρV
B
?
dn
pub
forKą0 to be determined. Then by A.19,
we have
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
“
2dB
2
r
2
ρ `n
priv
r
2
V
2
`
p1´n
priv
rq
2
n
pub
V
2
“K
2
V
2
n
pub
ˆ
2`
n
priv
ρV
2
dB
2
˙
`
V
2
n
pub
˜
1´
KV
B
?
ρn
priv
a
dn
pub
¸
2
“
V
2
n
pub
¨
˝
qK
2
`
˜
1´
KV
B
?
ρn
priv
a
dn
pub
¸
2
˛
‚
,
whereq“2`
n
priv
ρV
2
dB
2
. Now, lettings“
Vn
priv
?
ρ B
?
dn
pub
and choosingK“
s
q`s
2
gives
E
X„P
n
“
}A
r
pXq´E
x„P
rxs}
2
‰
ď
V
2
n
pub
¨
˝
qK
2
`
˜
1´
KV
B
?
ρn
priv
a
dn
pub
¸
2
˛
‚
ď
V
2
n
pub
ˆ
q
2
`qs
2
q
2
`2qs
2
`s
4
˙
.
Finally, the assumption
V
2
n
pub
ď
2dB
2
ρn
2
implies
V
2
n
pub
“min
´
V
2
n
pub
,
2dB
2
ρn
2
`
V
2
n
¯
, completing the proof.
174
A.2.3 OptimalSemi-DPEmpiricalRiskMinimization
Theorem84 (Complete statement of Theorem 21). Thereexistabsoluteconstantsc
0
andC
0
,with0ăc
0
ď
C
0
, such that
c
0
LDmin
"
n
priv
n
,
d
nε
*
ďR
ERM
pϵ,n
priv
,n,d,L,D,µ “0qďC
0
LDmin
"
n
priv
n
,
d
nε
*
.
Further, ifµ ą0, then there exist absolute constants0ăc
1
ďC
1
such that
c
1
LDmin
"
n
priv
n
,
d
nε
*
2
ďR
ERM
pϵ,n
priv
,n,d,L,D,µ qďC
1
L
2
µ min
#
n
priv
n
,
d
a
lnpnq
nε
+
2
.
Proof. LowerBounds: Given a lower bound for empirical mean estimation, Bassily et al. [32] show how to
prove excess risk lower bounds for convex and strongly convex ERM by reducing these problems to mean
estimation. Thus, our lower bounds follow immediately by combining the lower bound in Theorem 77
with the reduction in [32]. Roughly, the reduction works as follows:
In the strongly convex case, we simply take fpw,xq “
1
2
}w´ x}
2
onW “ X “ B; fp¨,xq is 1-
uniformly-Lipschitz and1-strongly convex. Moreover, for anyε-semi-DPA with outputw
priv
“ ApXq,
we have
E
p
F
X
pw
priv
q´
p
F
˚
X
“
1
2
E}ApXq´
¯ X}
2
.
Applying Theorem 77 and then scalingfÑ
L
D
f andWÑDW andX ÑDX completes the proof.
For the convex case, we takefpw,xq“´xw,xy onW“X “B. Thenw
˚
:“argmin
wPW
p
F
X
pwq“
¯ X
}
¯ X}
and
p
F
X
pw
priv
q´
p
F
˚
X
ě
1
2
}
¯ X}}w
priv
´w
˚
}
2
.
175
Also, the proof of Theorem 77 shows that there exists a dataset X P X
n
such that}
¯ X} “ M{n :“
min
`
n
priv
n
,
d
3nε
˘
andE}A
1
pXq´
¯ X}
2
Ámin
`
n
priv
n
,
d
nε
˘
2
for anyε-semi-DPA
1
. Note thatA
1
:“
M
n
w
priv
is
ε-semi-DP by post-processing. Thus,
E
p
F
X
pw
priv
q´
p
F
˚
X
ě
M
2n
E
“
}w
priv
´w
˚
}
2
‰
“
M
2n
´
n
M
¯
2
E
“
}A
1
pXq´
¯ X}
2
‰
Á
n
M
min
ˆ
n
priv
n
,
d
nε
˙
2
Ámin
ˆ
n
priv
n
,
d
nε
˙
.
A standard scaling argument (see [32] for details) completes the proof.
UpperBounds: The second terms in each minimum follows by running theε-DP algorithms in [32]:
these achieve the desired excess empirical risk bounds and are automaticallyε-semi-DP.
We now prove the first term in each respective minimum. Denote
p
F
pub
pwq“
1
n
ř
xPX
pub
fpw,xq and
p
F
priv
pwq“
1
n
ř
xPX
priv
fpw,xq, so that
p
F
X
“
p
F
pub
`
p
F
priv
. The algorithm we will use simply returns any
minimizer of the public empirical loss:ApXq“ w
˚
pub
P argmin
wPW
p
F
pub
pwq. (It will be easy to see from
the proof that any approximate minimizer would also suffice.) A is clearlyε-semi-DP. Next, we bound the
excess risk ofA. Letw
˚
Pargmin
wPW
p
F
X
pwq.
176
ConvexUpperBound: We have
p
F
X
pw
˚
pub
q´
p
F
X
pw
˚
q“
p
F
X
pw
˚
pub
q´
p
F
pub
pw
˚
pub
q`
p
F
pub
pw
˚
pub
q´
p
F
pub
pw
˚
q
`
p
F
pub
pw
˚
q´
p
F
X
pw
˚
q
ď
1
n
ÿ
xPX
priv
fpw
˚
pub
,xq`0´
1
n
ÿ
xPX
priv
fpw
˚
,xq
ď
1
n
ÿ
xPX
priv
L}w
˚
pub
´w
˚
}
“L}w
˚
pub
´w
˚
}
n
priv
n
ďLD
n
priv
n
.
StronglyConvexUpperBound: By the above,
p
F
X
pw
˚
pub
q´
p
F
X
pw
˚
qďL}w
˚
pub
´w
˚
}
n
priv
n
. Now we will
use strong convexity to bound}w
˚
pub
´w
˚
}. To do so, we use the following lemma, versions of which have
appeared, e.g. in [180, 52]:
Lemma 85 ([180]). Let Hpwq,hpwq be convex functions on some convex closed setW Ď R
d
and suppose
thatHpwq isµ H
-strongly convex. Assume further thath isL
h
-Lipschitz. Define w
1
“ argmin
wPW
Hpwq
andw
2
“argmin
wPW
rHpwq`hpwqs. Then}w
1
´w
2
}ď
L
h
µ H
.
We apply the lemma withhpwq“
p
F
priv
pwq andHpwq“
p
F
pub
pwq. Then the conditions of the lemma
are satisfied with L
h
ď
n
priv
n
L andµ H
“
n
pub
n
µ . Thus,
}w
˚
´w
˚
pub
}ď
L
h
µ H
ď
L
µ n
priv
n
pub
This leads to
p
F
X
pw
˚
pub
q´
p
F
X
pw
˚
qď
L
2
µ n
2
priv
nn
pub
.
177
Combining the two strongly convex upper bounds with the upper boundLD
n
priv
n
ď
L
2
µ n
priv
n
(which holds
for any convex function), we have an algorithmA with the following excess risk:
E
p
F
X
pApXqq´
p
F
˚
X
À
L
2
µ min
˜
n
priv
n
,
n
2
priv
nn
pub
,
d
2
lnpnq
n
2
ε
2
¸
. (A.20)
We will show that A.20 is equal to the strongly convex upper bound stated in Theorem 21 up to con-
stant factors. First, suppose n
pub
Á n: i.e. there is a constant k ą 0 such that n
pub
ě kn for all
n ě 1. Then, clearly A.20 and the strongly convex upper bound stated in Theorem 21 are both equal
toΘ ˆ
L
2
µ min
ˆ
n
2
priv
n
2
,
`
d
nε
˘
2
lnpnq
˙˙
.
Next, suppose n
pub
! n: i.e. for any k ą 0, there exists n ě 1 such that n
pub
ă kn. Then we
claim that min
"
n
priv
n
,
d
?
lnpnq
nε
*
2
Á min
"
1,
d
?
lnpnq
nε
*
2
. If we prove this claim, then we are done. There
are two subcases to consider: A) n
priv
{nă
d
?
lnpnq
εn
; and B) n
priv
{ně
d
?
lnpnq
εn
. In subcase B), the claim
is immediate. Consider subcase A): if n
priv
Á n, then we’re done. If not, then we have n
priv
! n and
n
pub
!n, son“n
priv
`n
pub
!n, a contradiction. This completes the proof.
Remark 86 (Details of Remark 22). The same minimax risk bound 2.8 holds up to a logarithmic factor if
we replaceF
µ “0,L,D
by the larger class of all Lipschitz non-convex (or convex) loss functions in the defini-
tion 2.7: First, the lower bound in Theorem 21 clearly still holds for non-convex loss functions. For the upper
bound, the ε-DP (hence semi-DP) exponential mechanism achieves errorOpLD
d
nε
lnpnqq [32, 105]. Further,
the proof of Theorem 21 reveals that convexity is not necessary for the throw-away algorithm to achieve er-
rorOpLDn
priv
{nq. However,theoptimalalgorithmsareinefficientfornon-convexlossfunctions: tothebestof
ourknowledge,allexistingpolynomialtimeimplementationsoftheexponentialmechanismrequireconvexity
for their runtime guarantees to hold. Further, computing« argmin
wPW
p
F
pub
pwq in the implementation of
throw-away may not be tractable in polynomial time for non-convex
p
F
pub
.
178
A.2.4 OptimalSemi-DPStochasticConvexOptimization
Theorem87 (Complete Version of Theorem 23). Supposeεďd{8,andeithern
pub
À
nε
d
ordÀ1. Ifµ “0
(convex case), then there exist absolute constants0ăc
0
ďC
0
such that
c
0
LDmin
#
1
?
n
pub
,
d
nε
`
1
?
n
+
ďR
SCO
pϵ,δ “0,n
priv
,n,d,L,D,µ “0qďC
0
LDmin
#
1
?
n
pub
,
d
nε
`
1
?
n
+
.
Ifµ ą0, there are constants0ăc
1
ďC
1
such that
c
1
LDmin
"
1
n
pub
,
d
2
n
2
ε
2
`
1
n
*
ďR
SCO
pϵ,δ “0,n
priv
,n,d,L,D,µ qďC
1
L
2
µ min
"
1
n
pub
,
d
2
lnpnq
n
2
ε
2
`
1
n
*
.
The above upper bounds hold for anyn
pub
,d.
Further, for any n
pub
,d, and any 0 ď δ ď ε ď 1, the following holds for some absolute constants
0ăc
1
0
ďC
1
0
and0ăc
1
1
ďC
1
1
:
c
1
0
LDmin
#
1
?
n
pub
,
1
nε
`
1
?
n
+
ďR
SCO
pϵ,δ,n
priv
,n,d,L,D,µ “0qďC
1
0
LDmin
#
1
?
n
pub
,
a
dlnp1{δ q
nε
`
1
?
n
+
,
and
c
1
1
LDmin
"
1
n
pub
,
1
n
2
ε
2
`
1
n
*
ďR
SCO
pϵ,δn
priv
,n,d,L,D,µ qďC
1
1
L
2
µ min
"
1
n
pub
,
dlnp1{δ q
n
2
ε
2
`
1
n
*
.
Proof. Lowerbounds: LetA bepε,δ q-semi-DP (forδ ě0) and denotew
priv
“ApXq.
Strongly convex lower bounds: We begin with the strongly convex lower bounds, which can be proved
straightforwardly by reducing strongly convex SCO to mean estimation and applying Theorem 15. In a bit
more detail, letf :WˆX ÑR be given by
fpw,xq“
L
2D
}w´x}
2
,
179
whereW“X “DB. Note thatf isL-uniformly Lipschitz and
L
D
-strongly convex inw for allx. Further,
w
˚
:“ argmin
wPW
tFpwq“E
x„P
rfpw,xqsu“E
x„P
rxs. By a direct calculation (see e.g. [137, Lemma
6.2]), we have
EFpw
priv
q´Fpw
˚
q“
L
2D
E}w
priv
´w
˚
}
2
. (A.21)
We can lower boundE}w
priv
´w
˚
}
2
“E}ApXq´E
x„P
rxs}
2
via Theorem 15 (and its proof, to account
for the re-scaling). Specifically, if δ “ 0,εď maxp1,d{8q, and eitherd“ Op1q orn
pub
“ Opnε{dq, then
Theorem 15 and its proof imply
E}w
priv
´w
˚
}
2
ěcD
2
min
ˆ
1
n
pub
,
d
2
n
2
ε
2
`
1
n
˙
.
Combining this with A.21 leads to the desired excess risk lower bound for δ “ 0. On the other hand, if
δ Pp0,εs, then
E}w
priv
´w
˚
}
2
ěcD
2
min
"
1
n
pub
,
1
n
2
ε
2
`
1
n
*
,
by Theorem 15. Again, applying A.21 yields the desired excess risk lower bound forδ ą0.
Convex lower bounds: We will begin by proving the lower bounds for the case in whichL“D“ 1, and
then scale our construction to get the lower bounds for arbitraryL,D.
First consider the case in which δ “ 0, ε ď d{8, and either n
pub
À nε{d or d À 1. Let X “
t0u
Ť
!
˘
1
?
d
)
d
ĂR
d
andW“B. Define
fpw,xq“´xw,xy,
which is convex and 1-uniformly-Lipschitz in w onX . ChooseV to be a finite subset of R
d
such that
|V| ě 2
d{2
,}v} “ 1 for all v, and}v´v
1
} ě 1{8 for all v ‰ v
1
(see e.g. the Gilbert-Varshamov con-
struction). Following the proof of Theorem 15, we define P
θ v
“ p1´pqP
0
`pP
v
for all v P V, where
180
p P r0,1s will be chosen later, P
0
is point mass ontX “ 0u and P
v
is point mass ontX “ vu. De-
note the mean θ v
:“ E
x„P
θ v
rxs“ pv. Note that}θ v
}“ p for all v. Let F
v
pwq“ E
x„P
θ v
rfpw,xqs and
w
˚
v
Pargmin
wPW
F
v
pwq“
θ v
}θ v}
“v. A direct calculation (see e.g. [137, Equation 14]) shows
EF
v
pwq´F
˚
v
ě
1
2
E
“
}θ v
}}w´w
˚
v
}
2
‰
(A.22)
for anywPW,vPV. Also,
ρ ˚
pVq“min
␣
}w
˚
v
´w
˚
v
1}:v,v
1
PV,v‰v
1
(
“min
␣
}v´v
1
}:v,v
1
PV,v‰v
1
(
ě
1
8
.
Thus, by combining A.22 with the reduction from estimation to testing and Theorem 79 (see the proof
of Theorem 15 for details), we have
sup
vPV
E
“
F
v
pw
priv
q´F
˚
v
‰
ě
1
2
sup
vPV
E
“
}θ v
}}w
priv
´w
˚
v
}
2
‰
“
p
2
sup
vPV
E
“
}w
priv
´w
˚
v
}
2
‰
ě
p
2
ρ ˚
pVq
2
1
|V|
ÿ
vPV
P
θ v
´
}
p
θ pXq´θ v
}ěρ ˚
pVq
¯
ě
p
128
p|V|´1qe
´εrn
priv
ps
p1´pq
n
pub
2
`
1`p|V|´1qe
´εrn
priv
ps
˘
ě
p
128
p2
d{2
´1qe
´εrn
priv
ps
p1´pq
n
pub
2
`
1`p2
d{2
´1qe
´εrn
priv
ps
˘
ě
p
512
p1´pq
n
pub
min
#
1,
2
d{2
´1
e
εpn
priv
p`1q
+
.
Now, assumedě4 so that2
d{2
´1ěe
d{4
. Then, as detailed in the proof of Theorem 15, choosing
p“min
˜
d
4nε
´
1
n
,
1
2
?
n
pub
¸
181
and assumingn
pub
ďknε{d for some absolute constantk implies
sup
vPV
E
“
F
v
pw
priv
q´F
˚
v
‰
ěcmin
˜
1
?
n
pub
,
d
nε
¸
for some absolute constantcą0. Combining this with the non-private SCO lower bound [200] yields
sup
P
E
“
Fpw
priv
q´F
˚
‰
ěc
1
min
˜
1
?
n
pub
,
d
nε
`
1
?
n
¸
,
whereFpwq:“E
x„P
rfpw,xqs.
Suppose instead that0ďδ ďε anddÀ1 (i.e.dďk for some absolute constantkě1), butn
pub
Prns
is arbitrary. We will prove the lower bound for d“ 1: by taking the k-fold product distribution, this is
sufficient to complete the proof of the unscaled lower bounds. Define distributions P
1
,P
2
ont´1,1u as
follows:
P
1
p´1q“P
2
p1q“
1`γ 2
, P
1
p1q“P
2
p´1q“
1´γ 2
for someγ Pp0,1{2s to be chosen later. Noteθ 1
:“E
P
1
rxs“´γ andθ 2
:“E
P
2
rxs“ γ , so|θ j
|“ γ for
j“1, 2. LetF
j
pwq“E
x„P
j
fpw,xq andw
˚
j
“
θ j
|θ j
|
“
θ j
γ Pargmin
wPW
F
j
pwq. Then by A.22, we have
max
jPt1,2u
E
“
F
j
pw
priv
q´F
˚
j
‰
ě
1
2
max
jPt1,2u
E
“
|θ j
||w
priv
´w
˚
j
|
2
‰
“
γ 2
max
jPt1,2u
E
“
|w
priv
´w
˚
j
|
2
‰
“
1
2γ max
jPt1,2u
E
”
|w
1
priv
´θ j
|
2
ı
,
182
where w
1
priv
:“ γw
priv
is semi-DP iff w
priv
is semi-DP (by post-processing). Thus, by applying Le Cam’s
method and Lemma 81 (see the proof of Theorem 15 for details), we get
max
jPt1,2u
E
“
F
j
pw
priv
q´F
˚
j
‰
ě
1
2γ «
γ 2
8
˜
1´γ min
˜
c
3n
2
,6n
priv
ε`
c
3n
pub
2
¸¸ff
.
Now we will chooseγ to (approximately) maximize the right-hand side of the above inequality. If
min
#
c
3n
2
,6n
priv
ε`
c
3n
pub
2
+
“
c
3n
2
,
then choosingγ “
1
3
b
2
3n
yields
max
jPt1,2u
E
“
F
j
pw
priv
q´F
˚
j
‰
ě
k
?
n
for some absolute constantką0. If insteadmin
"
b
3n
2
,6n
priv
ε`
b
3n
pub
2
*
“6n
priv
ε`
b
3n
pub
2
, then we
chooseγ “
2
3
ˆ
6n
priv
ε`
b
3n
pub
2
˙
´1
. This choice implies
max
jPt1,2u
E
“
F
j
pw
priv
q´F
˚
j
‰
ěk
1
min
˜
1
n
priv
ε
,
1
?
n
pub
¸
for some absolute constantk
1
ą0. Combining the pieces above with the non-private SCO lower bound [200]
yields
sup
P
E
“
Fpw
priv
q´F
˚
‰
ěcmin
˜
1
?
n
pub
,
1
nε
`
1
?
n
¸
,
whereFpwq:“E
x„P
rfpw,xqs.
It remains to properly scale the above constructions. For each respective constructionpW,X,Pq, we
let
Ă
W “ DW,
r
X “ LX , and r x„
r
P ðñ r x“ Lx forx„ P . Define
r
f :
Ă
Wˆ
r
X ÑR by
r
fpr w,r xq :“
183
fpr w,r xq“´xr w,r xy. Then
r
fp¨,r xq isL-Lipschitz and convex. Moreover, ifFpwq“E
x„P
rfpw,xqs,
r
Fpr wq“
E
r x„
r
P
rfpr w,r xqs, r w“Dw, and r µ “E
r x„
r
P
rr xs, thenDw
˚
Pargmin
r wP
Ă
W
r
Fpr wq and
r
Fpr wq´
r
F
˚
“´xr w,r µ y`xr w,r µ y
“Dxr µ,w
˚
´wy
“LDxµ,w
˚
´wy
“LDrFpwq´F
˚
s.
This shows that excess risk scales byLD, completing the lower bound proofs.
Upperbounds:
Convex upper bounds: Consider the 0-semi-DP throw-away algorithm that discards X
priv
and runs
n
priv
steps of one-pass SGD (stochastic approximation) usingX
pub
. This algorithm has excess riskO
`
LD{
?
n
pub
˘
[200].
To obtain the second term in the convexε-semi-DP upper bound, use theε-DP (hence semi-DP) regular-
ized exponential mechanism of [105]. The second term in the convexpε,δ q-semi-DP upper bound can be
attained, e.g. bypε,δ q-DP-SGD [29].
Strongly convex upper bounds: Consider the0-semi-DP throw-away algorithm that discardsX
priv
and
runs n
priv
steps of one-pass SGD (stochastic approximation) using X
pub
. This algorithm has excess risk
O
´
L
2
µn
pub
¯
[200]. To obtain the second term in the storngly convexε-semi-DP upper bound, one can use,
e.g. the ε-DP (hence semi-DP) iterated exponential mechanism of [105]. The second term in the convex
pε,δ q-semi-DP upper bound can be attained, e.g. bypε,δ q-DP-SGD [181].
A.2.4.1 Semi-DPSCOwithan“EvenMoreOptimal”GradientEstimator
In this section, we provide anpε,δ q-semi-DP algorithm for SCO. Our Algorithm 11 is a noisy stochastic
gradient method that uses the “even more optimal” Weighted-Gaussian estimator 2.4 to estimate∇Fpw
t
q
184
in iterationt.
†
In Algorithm 11, clip
C
pxq :“ argmin
yPR
d
,}y}ďC
}x´y} is the Euclidean projection onto
the centeredℓ
2
-ball of radiusC.
Algorithm11 Semi-DP-SGD via Weighted-Gaussian Gradient Estimators
1: Input: T P N, clip threshold C ą 0, stepsizestη t
u
tPrTs
, batch sizes K
priv
P rn
priv
s, K
pub
P rn
pub
s,
weight parameterα Pr0,1s, noise parameterσ 2
ą0.
2: Initializew
0
PW.
3: fortPt0,1,¨¨¨ ,T´1udo
4: Draw random batch ofK
priv
private samplesB
priv
t
uniformly fromX
priv
.
5: Draw random batch ofK
pub
public samplesB
pub
t
uniformly fromX
pub
.
6: Draw privacy noisev
t
„N
`
0,σ 2
I
d
˘
.
7: r g
t
Ðα ”
1
K
priv
ř
xPB
priv
t
clip
C
p∇fpw
t
,xqq`v
t
ı
`
1´α K
pub
ř
xPB
pub
t
∇fpw
t
,xq.
8: Updatew
t`1
:“Π W
rw
t
´η t
r g
t
s.
9: endfor
10: Output: w
T
or a weighted average of the iteratestw
t
u
tPrTs
.
Proposition88. 1. Suppose we sample with replacement in line 4 of Algorithm 11. Then, there exist
constantsc
1
,c
2
such that for anyεă c
1
´
Kpriv
npriv
¯
2
T, Algorithm 11 ispε,δ q-semi-DP for anyδ ą 0 if
we chooseσ 2
ěc
2
C
2
lnp1{δ qT
ε
2
n
2
priv
.
2. Suppose we sample without replacement in line 4 and chooseT ď
n
Kpriv
. Then Algorithm 11 isρ -semi-
zCDP ifσ 2
ě
2C
2
ρK
2
priv
.
Proof. Note that theℓ
2
-sensitivity of the private stochastic gradient query is
∆ “ sup
X
priv
„X
1
priv
›
›
›
›
›
›
α K
priv
ÿ
xPB
priv
t
clip
C
p∇fpw
t
,xqq´
α K
priv
ÿ
x
1
PB
priv
t
clip
C
p∇fpw
t
,x
1
qq
›
›
›
›
›
›
2
ď
2αC
K
priv
.
1. Consider sampling with replacement. Then we are randomly subsampling from the private data
uniformly with sampling ratioK
priv
{a. Thus, the theorem follows from [1, Theorem 1].
†
In Algorithm 11, we re-parameterize by settingr“
α K
priv
forα Pr0,1s.
185
2. Consider sampling without replacement. Then by the ρ -zCDP guarantee of the Gaussian mecha-
nism [47, Proposition 1.6] and the sensitivity bound above,r g
t
isρ -semi-zCDP for everyt. Moreover, since
we are sampling without replacement, the privacy of every x P X
priv
is only affected by r g
t
for a single
tPrTs. Thus, semi-zCDP of Algorithm 11 follows by parallel composition [191].
By Proposition 18, there exists a choice ofα such that the variance of our unbiased estimator in line 7 is
always less than the variance of both the throw-away gradient estimator
1
K
pub
ř
x
∇fpw
t
,xq and the DP-
SGD estimator
1
K
priv
`K
pub
ř
xPBt
∇fpw
t
,xq`u
t
, whereu
t
is appropriately scaled (to ensure DP) Gaussian
noise. Consequently, if we choose T and K “ K
priv
` K
pub
such that n “ TK and sample without
replacement (i.e. one pass), then Algorithm 11 always has smaller excess risk than both the throw-away
SCOalgorithmandone-passDP-SGD. Moreover, if the loss function has Lipschitz continuous gradient, then
one can combine the stochastic gradient estimator of line 7 with acceleration [108] to obtain a linear-time
semi-DP algorithm that always outperforms the accelerated DP algorithm of [181]. The algorithm of [181]
has the smallest excess risk among linear-time (one-pass) DP algorithms whose privacy analysis does not
require convexity. Thus, our algorithm can be used for deep learning. In our numerical experiments, we
implement the with-replacement sampling version of Algorithm 11.
We also note that near-optimal excess risk bounds fornon-convex loss functions that satisfy the(Prox-
imal) PL inequality [213, 142] can be derived by combining a proximal variation of Algorithm 11 with
the techniques of [177]. Further, if fp¨,xq is not uniformly Lipschitz, but has stochastic gradients with
boundedk-th order moment for somekě2, then excess risk bounds can still be derived for Algorithm 11
via techniques in [182]. Our algorithm can also be extended to a variation of noisy stochastic gradient
descent ascent, which could be used, e.g. forfair semi-DP model training [178]. We leave it as future work
to explore these and other potential applications of our gradient estimator in efficiently training private
ML models with public data.
186
Finally, we note that Algorithm 11 is essentially equivalent to the “efficient approximation” of public-
data-assisted DP-mirror-descent (PDA-MD) proposed in [9, Equation 1]. The work [9] uses this “efficient
approximation” to train semi-DP non-convex models, but does not analyze the algorithm theoretically. We
became aware of this overlap while we were finalizing our manuscript. Interestingly, our work and [9]
derive Algorithm 11 in two orthogonal ways: we derive the algorithm as an application of our “even more
optimal” mean estimator, while [9] derives it as an approximation of their PDA-MD algorithm.
A.3 OptimalLocallyPrivateLearningwithPublicData
Notation and Setup: Following [83], we permit algorithms to be fully interactive. That is, algorithms
may adaptively query the same individuali multiple times over the course ofT “communication rounds.”
We denotei’s message in roundt byZ
i,t
PZ. Personi’s messageZ
i,t
PZ in roundt may depend on all
previous communicationsB
ptq
:“pZ
ďn,t
,B
pt´1q
q and oni’s own data: Z
i,t
„ Q
i,t
p¨|x
i
,Z
ăi,t
,B
pt´1q
q. If
i’s data is private, thenZ
i,t
is a randomized view ofx
i
distributed (conditionally) according toQ
i,t
. Ifi’s
data is public, thenZ
i,t
may be deterministic. Full interactivity is the most general notion of interactivity.
IfT “ 1, then we say the algorithm is sequentially interactive. If, in addition, each person’s messageZ
i,1
depends only onx
i
and not onx
j‰i
, then we say the algorithm isnon-interactive. Semi-LDP (Definition 24)
essentially requires that the messagestZ
i,t
u
tPrTs
be DP for all privatex
i
PX
priv
.
A.3.1 OptimalSemi-LDPMeanEstimation
Theorem89 (Re-statement of Theorem 25). There exist absolute constantsc andC, with0ăcďC, such
that
cmin
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
ďM
loc
pop
pϵ,n
priv
,n,dqďCmin
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
.
187
Proof. Lower bound: We will actually prove a more general lower bound than the one in Theorem 25;
namely, we will show a lower bound on the minimax ℓ
1
-error for estimation of distributions onX
r
“
t˘ru
d
forrą0. To that end, letγ Pp0,1q and
P
1
:“
$
’
’
’
’
&
’
’
’
’
%
r with probability
1`γ 2
´r with probability
1´γ 2
and
P
´1
:“
$
’
’
’
’
&
’
’
’
’
%
r with probability
1´γ 2
´r with probability
1`γ 2
.
We define our hard distribution on X
r
by first drawing V „ Unifpt˘1u
d
q and then—conditional onV “v—
drawingX
i,j
„ P
v
“ Π d
j“1
P
v
j
foriPrns,j Prds, whereP
v
denotes the product distribution. We have
Markov chains V
j
Ñ X
i,j
Ñ Z for all j P rds,i P rns, where Z is the semi-LDP transcript. Note that
ˇ
ˇ
ˇln
´
dP
1
dP
´1
¯ˇ
ˇ
ˇďln
´
1`γ 1´γ ¯
fi b ande
b
ď3 for anyγ Pp0,1{2s. Now we will use the following lemma from
Duchi & Rogers (2019):
Lemma90. [83, Lemma 24] LetV ÑXÑZ be a Markov chain, whereX„P
v
conditional onV “v. If
ˇ
ˇ
ˇln
dPv
dP
v
1
ˇ
ˇ
ˇďα for allv,v
1
, then
IpV;Zqď2pe
α ´1q
2
IpX;Zq.
Thus, for V
j
„ Unifpt˘1uq, Lemma 90 implies IpV
j
;Zqď
8γ 2
p1´γ q
2
IpX
i,j
;Zq. Hence the strong data
processing constant [83, Definition 9] is β :“β pP
1
,P
´1
qď
8γ 2
p1´γ q
2
.
188
Now,θ v
j
:“E
x„Pv
j
rxs“γrv
j
for anyv
j
Pt˘1u. Moreover, lettingθ v
“pθ v
1
,¨¨¨ ,θ v
d
q forvPt˘1u
d
andθ PR
d
, we have
}θ ´θ v
}
1
“
d
ÿ
j“1
|θ j
´rγv
j
|“rγ d
ÿ
j“1
ˇ
ˇ
ˇ
ˇ
θ j
rγ ´v
j
ˇ
ˇ
ˇ
ˇ
ěrγ d
ÿ
j“1
1
tsignpθ j
q‰v
j
u
.
Thus,t˘1u
d
induces anrγ -Hamming separation, so Assouad’s lemma [84, Lemma 1] yields
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
1
ěrγ d
ÿ
j“1
inf
ˆ
V
Pp
ˆ
V
j
pZq‰V
j
q,
whereZ is the communication transcript ofA, the infimum on the RHS is over all estimators of V ,θ pPq“
E
x„P
rxs, andP
r
is the set of distributions onX
r
Assume WLOG that the private samples are the first n
priv
samples ofX: X
priv
“px
1
,¨¨¨ ,x
n
priv
q. To
lower bound
ř
d
j“1
inf
ˆ
V
Pp
ˆ
V
j
pZq‰V
j
q, we use a slight extension of [83, Theorem 10]:
d
ÿ
j“1
inf
ˆ
V
Pp
ˆ
V
j
pZq‰V
j
qě
d
2
«
1´
c
7pe
b
`1q
d
β `
IpX
priv
;Z|Vq`IpX
pub
;Z|Vq
˘
ff
.
This follows since V Ñ X
priv
Ñ Z and V Ñ X
pub
Ñ Z are both Markov chains and the other as-
sumptions in [83, Theorem 10] all hold. Combining this bound with Assouad’s lemma [84, Lemma 1] and
substituting the definitions of b andβ given above gives us
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
1
ě
rγd
2
«
1´
c
896
d
γ 2
`
IpX
priv
;Z|Vq`IpX
pub
;Z|Vq
˘
ff
for any γ P p0,1{2s. It remains to upper bound the conditional mutual information IpX
priv
;Z|Vq and
IpX
pub
;Z|Vq.
189
Now for any ε-semi-LDP algorithm with communication transcript Z, we have IpX
priv
;Z|Vq ď
n
priv
minpε,4ε
2
q, by an easy extension of [83, Lemma 12]. Also,IpX
pub
;Z|VqďHpX
pub
|Vqďlogp|X
n
pub
|q“
dn
pub
, whereHp¨|¨q denotes conditional entropy. Thus,
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
1
ě
rγd
2
«
1´
c
4000
d
γ 2
`
n
priv
minpε,ε
2
q`dn
pub
˘
ff
.
Choosingγ 2
“cmin
´
1
n
pub
,
d
n
priv
minpε,ε
2
q
¯
for some small constantcą0 yields
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
1
Árdmin
˜
1
?
n
pub
,
d
d
n
priv
minpε,ε
2
q
¸
,
whence
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
2
Ár
?
dmin
˜
1
?
n
pub
,
d
d
n
priv
minpε,ε
2
q
¸
.
By applying the non-private mean estimation lower bound, we get
inf
APA
loc
ε
sup
PPPr
E}ApXq´θ pPq}
2
Ár
?
dmin
˜
1
?
n
pub
,
d
d
n
priv
minpε,ε
2
q
`
1
?
n
¸
.
Choosingr“1{
?
d ensures thatP
r
ĂPpBq and yields
inf
APA
loc
ε
sup
PPPpBq
E}ApXq´θ pPq}
2
Ámin
˜
1
?
n
pub
,
d
d
nminpε,ε
2
q
`
1
?
n
¸
.
Applying Jensen’s inequality completes the proof of the lower bound in Theorem 25.
Upper bound: The first term in the minimum can be realized by the algorithm that throws away the
private data and returnsApXq“
1
n
pub
ř
xPX
pub
x, which is0-semi-LDP. Also,
E}ApXq´Ex}
2
“
1
n
2
pub
ÿ
xPX
pub
E}x´Ex}
2
ď
1
n
pub
.
190
The second term in the upper bound can be realized by an appropriate choice ofpp,γ q in PrivUnit [39]
(Algorithm 12). The work [39] shows that PrivUnit isε-LDP (henceε-semi-LDP) and has worst-case MSE
that is upper bounded byO
´
d
nminpε,ε
2
q
`
1
n
¯
.
Algorithm12 PrivUnitpp,γ q[39]
1: Input: v P S
d´1
, γ Pr0,1s, pPr0,1s. Bp¨;¨,¨q below is the incomplete Beta function Bpx;a,bq“
ş
x
0
t
a´1
p1´tq
b´1
dt andBpa,bq“Bp1;a,bq.
2: Drawz„ Bernoullippq
3: if z“1then
4: DrawV „ UniftuPS
d´1
:xu,vyěγ u
5: else
6: DrawV „ UniftuPS
d´1
:xu,vyăγ u
7: endif
8: Setα “
d´1
2
andτ “
1`γ 2
9: Calculate normalization constant
m“
p1´γ 2
q
α 2
d´2
pd´1q
ˆ
p
Bpα,α q´Bpτ ;α,α q
`
1´p
Bpτ ;α,α q
˙
10: Return
1
m
¨V
A.3.1.1 An“EvenMoreOptimal”Semi-LDPEstimator
Lemma91 (Re-statement of Lemma 27). LetCą0beanabsoluteconstantsuchthatE
x„P
}PrivUnitpxq´
x}
2
“C
2 d
minpε,ε
2
q
. Then
E
A,X„P
n}A
semi-PrivU
pXq´E
x„P
rxs}
2
ď
2
n
`2C
2
n
priv
n
d
nminpε,ε
2
q
.
191
Proof. We have
E
A,X„P
n}A
semi-PrivU
pXq´Ex}
2
ď2E}
¯ X´Ex}
2
`2E}A
semi-PrivU
pXq´
¯ X}
2
ď
2
n
`
2
n
2
ÿ
xPX
priv
E}PrivUnitpxq´x}
2
ď
2
n
`2C
2
n
priv
n
d
nminpε,ε
2
q
,
Proposition92 (Re-statement of Proposition 28). LetApXq“
1
n
“
M
priv
pRpx
1
q,¨¨¨ ,Rpx
npriv
qq`M
pub
pX
pub
q
‰
be a ε-semi-LDP algorithm, where R : S
d´1
Ñ Z is an ε-LDP randomizer and M
priv
: Z
npriv
Ñ R
d
andM
pub
: Z
n
pub
Ñ R
d
are aggregation protocols such thatE
Mpriv,R
“
M
priv
pRpx
1
q,¨¨¨ ,Rpx
npriv
qq
‰
“
ř
xPXpriv
x andE
M
pub
“
M
pub
pX
pub
q
‰
“
ř
xPX
pub
x for allX“pX
priv
,X
pub
qP
`
S
d´1
˘
n
. Then,
sup
XPpS
d´1
q
n
E
A
semi-PrivU
}A
semi-PrivU
pXq´
¯ X}
2
ď sup
XPpS
d´1
q
n
E
A
}ApXq´
¯ X}
2
.
Proof. First, Asi et al. [20, Proposition 3.4] showed that PrivUnit (with a proper choice ofpp,γ q) has the
smallest worst-case variance among all unbiasedε-LDP randomizers:
sup
xPS
d´1
E}Rpxq´x}
2
ě sup
xPS
d´1
E}PrivUnitpxq´x}
2
(A.23)
for allε-LDP randomizersR such thatErRpxqs“x for allxPS
d´1
.
192
Now, let R be a ε-LDP randomizer and M
priv
and M
pub
be aggregation protocols such that the
assumptions in Proposition 28 are satisfied. We claim that there exists an unbiased ε-LDP randomizer
R
1
:S
d´1
ÑZ such that
sup
XPpS
d´1
q
n
E
A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ě sup
XPpS
d´1
q
n
E
R
1
›
›
›
›
›
›
ÿ
xPX
priv
R
1
pxq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
. (A.24)
To prove A.24, we follow the idea in the proof of [20, Proposition 3.3]. LetP denote the uniform distribution
onS
d´1
. We have
sup
XPpS
d´1
q
n
E
A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ěE
X„P
n
,A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ěE
›
›
›
›
›
›
M
priv
pRpx
1
q,¨¨¨ ,Rpx
n
priv
qq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
`E
›
›
›
›
›
›
M
pub
pX
pub
q´
ÿ
xPX
pub
x
›
›
›
›
›
›
2
,
since the cross-term (inner product) vanishes by independence of X
pub
and X
priv
, and unbiasedness of
M
pub
. Now, [20, Lemma A.1] shows that there existε-LDP randomizerst
ˆ
R
x
u
xPX
priv
such thatEr
ˆ
R
x
pvqs“
v for allvPS
d´1
and
E
X
priv
„P
n
priv
›
›
›
›
›
›
M
priv
pRpx
1
q,¨¨¨ ,Rpx
n
priv
qq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
ě
ÿ
xPX
priv
E
v„P
}
ˆ
R
x
pvq´v}
2
.
Hence
sup
XPpS
d´1
q
n
E
A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ě
ÿ
xPX
priv
E
v„P
}
ˆ
R
x
pvq´v}
2
.
193
Define R
1
x
pvq :“ U
T
ˆ
R
x
pUvq for v P S
d´1
, where U is a uniformly random rotation matrix such that
U
T
U “ I
d
. Note thatR
1
x
is an ε-LDP randomizer such thatErR
1
pvqs “ v for all v P S
d´1
,x P X
priv
.
Moreover, for any fixed vPS
d´1
,xPX
priv
, we have
E}R
1
x
pvq´v}
2
“E
U
}
ˆ
R
x
pUvq´Uv}
2
“E
v
1
„P
}
ˆ
R
x
pv
1
q´v
1
}
2
Letx
˚
:“argmin
xPX
priv
sup
vPS
d´1E}R
1
x
pvq´v}
2
andR
1
pvq:“R
1
x
˚
pvq. Then putting the pieces together,
we have
sup
XPpS
d´1
q
n
E
A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ě
ÿ
xPX
priv
sup
vPS
d´1
E}R
1
x
pvq´v}
2
ěn
priv
sup
vPS
d´1
E}R
1
pvq´v}
2
“ sup
X
priv
PpS
d´1
q
n
priv
E
R
1
›
›
›
›
›
›
ÿ
xPX
priv
R
1
pxq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
,
by conditional independence oftR
1
pxqu
xPX
priv
givenX. This establishes A.24. Thus,
n
2
sup
XPpS
d´1
q
n
E
A
›
›
ApXq´
¯ X
›
›
2
“ sup
XPpS
d´1
q
n
E
A
›
›
›
›
›
nApXq´
ÿ
xPX
x
›
›
›
›
›
2
ě sup
XPpS
d´1
q
n
E
R
1
›
›
›
›
›
›
ÿ
xPX
priv
R
1
pxq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
ě sup
XPpS
d´1
q
n
E
›
›
›
›
›
›
ÿ
xPX
priv
PrivUnitpxq´
ÿ
xPX
priv
x
›
›
›
›
›
›
2
“n
2
sup
XPpS
d´1
q
n
E
A
semi-PrivU
}A
semi-PrivU
pXq´
¯ X}
2
,
194
where we used A.23 in the last inequality. Dividing both sides of the above inequality byn
2
completes the
proof.
A.3.2 OptimalSemi-LDPStochasticConvexOptimization
Theorem93 (Complete statement of Theorem 29). ThereexistabsoluteconstantscandC,with0ăcďC,
such that
cLDmin
#
1
?
n
pub
,
d
d
nminpε,ε
2
q
`
1
?
n
+
ďR
loc
SCO
pε,n
priv
,n,d,L,DqďCLDmin
#
1
?
n
pub
,
d
d
nminpε,ε
2
q
`
1
?
n
+
,
and
cLDmin
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
ďR
loc
SCO
pε,n
priv
,n,d,L,D,µ qďC
L
2
µ min
"
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
*
.
Proof. Lowerbounds: LetA beε-semi-LDP and denotew
priv
“ApXq.
Strongly convex lower bound: We begin with the strongly convex lower bounds, which can be proved
straightforwardly by reducing strongly convex SCO to mean estimation and applying Theorem 25. In a bit
more detail, letf :WˆX ÑR be given by
fpw,xq“
L
2D
}w´x}
2
,
whereW“X “DB. Note thatf isL-uniformly Lipschitz and
L
D
-strongly convex inw for allx. Further,
w
˚
:“ argmin
wPW
tFpwq“E
x„P
rfpw,xqsu“E
x„P
rxs. By a direct calculation (see e.g. [137, Lemma
6.2]), we have
EFpw
priv
q´Fpw
˚
q“
L
2D
E}w
priv
´w
˚
}
2
. (A.25)
195
We can lower boundE}w
priv
´w
˚
}
2
“E}ApXq´E
x„P
rxs}
2
via Theorem 25 (and its proof, to account
for the re-scaling). Specifically, there is a distribution P onX such that
E
X„P
n
,w
priv
}w
priv
´w
˚
}
2
ěcD
2
min
ˆ
1
n
pub
,
d
nminpε,ε
2
q
`
1
n
˙
.
Combining this with A.25 leads to the desired excess risk lower bound.
Convex lower bound: We will begin by proving the lower bounds for the case in whichL“ D“ 1, and
then scale our construction to get the lower bounds for arbitraryL,D.
LetW“B,X “
!
˘
1
?
d
)
d
, and
fpw,xq“´xw,xy,
which is convex and 1-uniformly-Lipschitz in w on X . For any ε-semi-LDP A
1
pXq “ w
1
priv
and any
γ Pp0,1q, the proof of Theorem 25 constructs a distributionP
γ onX with meanE
x„Pγ rxs“θ Pt˘
γ ?
d
u
d
such that
E
w
1
priv
,X„P
n
γ „
›
›
›w
1
priv
´θ ›
›
›
2
ȷ
ě
γ 2
4
«
1´
d
4000γ 2
ˆ
n
priv
minpε,ε
2
q
d
`n
pub
˙
ff
. (A.26)
Now, letF
γ pwq“E
x„Pγ rfpw,xqs andw
˚
γ “argmin
wPW
F
γ pwq“
θ }θ }
. A direct calculation (see e.g. [137,
Equation 14]) shows
EF
γ pwq´F
˚
γ ě
1
2
E
“
}θ }}w´w
˚
γ }
2
‰
ě
1
2}θ }
E
“
}w
1
´θ }
2
‰
“
1
2γ E
“
}w
1
´θ }
2
‰
(A.27)
196
for any w P W,w
1
:“ w}θ }. Note thatApXq “ w
priv
is ε-semi-DP if and only ifA
1
pXq “ w
1
priv
:“
}θ }w
priv
“ γw
priv
isε-semi-DP, by post-processing. Thus, A.26 and A.27 together imply that anyε-semi-
DPw
1
priv
has worst-case excess risk that is lower bounded by
ErF
γ pw
1
priv
q´F
˚
γ sě
γ 8
«
1´
d
4000γ 2
ˆ
n
priv
minpε,ε
2
q
d
`n
pub
˙
ff
.
Choosingγ 2
“cmin
´
d
n
priv
minpε,ε
2
q
,
1
n
pub
¯
for some smallcPp0,1{16000q implies
ErF
γ pw
1
priv
q´F
˚
γ sěc
1
min
˜d
d
n
priv
minpε,ε
2
q
,
1
?
n
pub
¸
for somec
1
ą0. This proves the desired lower bound for the case whenL“D“1. In the general case, we
scale our hard instance, as in the proof of Theorem 15: Let
Ă
W “DW,
r
X“LX , andr x„
r
P ðñ r x“Lx
for x „ P . Define
r
f :
Ă
Wˆ
r
X Ñ R by
r
fpr w,r xq :“ fpr w,r xq “ ´xr w,r xy. Then
r
fp¨,r xq is L-Lipschitz
and convex. Moreover, if Fpwq “ E
x„P
rfpw,xqs, θ :“ E
x„P
rxs, w
˚
:“ argmin
wPW
Fpwq “
θ }θ }
,
r
Fpr wq“E
r x„
r
P
rfpr w,r xqs, r w“Dw, and
r
θ :“E
r x„
r
P
rr xs“Lθ , then r w
˚
“Dw
˚
Pargmin
r wP
Ă
W
r
Fpr wq and
r
Fpr wq´
r
F
˚
“´xr w,
r
θ y`xr w
˚
,
r
θ y
“Dx
r
θ,w
˚
´wy
“LDxθ,w
˚
´wy
“LDrFpwq´F
˚
s.
This shows that the excess risk of the scaled instance scales byLD, completing the lower bound proofs.
Upper bounds: The first term in each of the upper bounds ( LD{
?
n
pub
for convex, andL
2
{pµn
pub
q
for strongly convex) is attained by the throw-away algorithm that runs n
pub
steps of (one-pass) SGD on
X
pub
[200].
197
The second term in the convex upper bound follows from theε-LDP (hence semi-LDP) upper bound
of [85, Proposition 3]: although their upper bound only holds forεď1, if we replace their LDP randomizer
with PrivUnit, then we get the upper bound stated in Theorem 29.
For the second term in the strongly convex upper bound, we runε-LDP-SGD as in [85] but with their
LDP randomizer replaced by PrivUnit. We also return a non-uniform weighted average ˆ w
n
of the iterates
w
1
,...,w
n
as in [219] to obtain
EFpˆ w
n
q´F
˚
ďC
G
2
µn
,
whereG
2
“sup
tPrns
E}PrivUnitp∇fpw
t
,x
t
qq}
2
ÀL
2
´
1`
d
ε
`
d
pe
ε
´1q
2
¯
[39, Proposition 4]. Thus,
EFpˆ w
n
q´F
˚
ďC
1
L
2
µ ˆ
d
nminpε,ε
2
q
`
1
n
˙
.
This completes the proof.
A.3.2.1 An“EvenMoreOptimal”Semi-LDPAlgorithmforSCO
Algorithm13 Semi-LDP-SGD
1: Input: clip thresholdCą0, stepsizeη , PrivUnit parameterspp,γ q.
2: Initializew
0
PW.
3: fortPt0,1,¨¨¨ ,n´1udo
4: Draw random samplex
t
fromX without replacement.
5: if x
t
PX
priv
then
6: r g
t
Ð PrivUnitpclip
C
p∇fpw
t
,x
t
qqq, where PrivUnit is defined in Algorithm 12.
7: else
8: r g
t
Ð∇fpw
t
,x
t
q
9: endif
10: Updatew
t`1
:“Π W
rw
t
´η r g
t
s.
11: endfor
12: Output: ¯w
n
“
1
n
ř
n
i“1
w
i
.
198
Proposition 94 (Re-statement of Proposition 30). Let f P F
L,D
and P be any distribution and ε ď d.
Algorithm 13 isε-semi-LDP. Further, there is an absolute constantc such that the outputApXq“ ¯w
n
of Al-
gorithm 13 satisfies
E
A,X„P
nFp¯w
n
q´F
˚
ďc
LD
?
n
max
#d
d
minpε,ε
2
q
c
n
priv
n
,
c
n
pub
n
+
.
Proof. Privacy: Since PrivUnit is an ε-LDP randomizer and we are applying PrivUnit to all the private
samplesxPX
priv
, Algorithm 13 isε-semi-LDP.
Excess risk: Choose C ą L: i.e. we don’t clip, since stochastic subgradients are already uniformly
bounded by the L-Lipschitz assumption. By the classical analysis of the stochastic subgradient method
(see e.g. [46]), we can obtain
EFp¯w
n
q´F
˚
ď
1
n
n
ÿ
t“1
Err g
T
t
pw
t
´w
˚
qs
ď
D
2
ηn
`
η n
`
n
priv
G
2
a
`n
pub
G
2
b
˘
,
whereG
2
a
:“ sup
t
E}PrivUnitp∇fpw
t
,x
t
qq}
2
andG
2
b
:“ sup
t
E}∇fpw
t
,x
t
q}
2
. By the uniform Lipschitz
assumption, we have G
2
b
ď L
2
. By [39, Proposition 4], we have G
2
a
ď c
2
L
2 d
minpε,ε
2
q
for some absolute
constantcą0. Thus, choosingη “
D
L
min
´b
minpε,ε
2
q
n
priv
c
2
d
,
1
?
n
pub
¯
yields
EFp¯w
n
q´F
˚
ď3
LD
?
n
max
˜d
c
2
d
minpε,ε
2
q
c
n
priv
n
,
c
n
pub
n
¸
.
This completes the proof.
199
A.4 NumericalExperiments
Code for all of the experiments is available here: https://github.com/optimization-for-data-driven-science/
DP-with-public-data.
A.4.1 CentralSemi-DPLinearRegressionwithGaussianData
A.4.1.1 DataGeneration
We implement Algorithm 11 on synthetic data designed for a linear regression problem of dimension
2,000, using the squared error loss: fpw,xq“pxw,x
p1q
y´x
p2q
q
2
, wherex
p1q
P R
d
denotes the feature
vector andx
p2q
PR is the target. Here,d“2,000. Our synthetic dataset consists ofn“75,000 training
samples, 18,750 validation samples, and 10,000 test samples. The feature vectors x
i
:“ x
p1q
i
and the
optimal parameter vectorw
˚
are drawn i.i.d. from a multivariate Gaussian distributionNp0,I
2000
q. We
generate predicted values x
p2q
i
from a Gaussian distributionNpxw
˚
,x
i
y,0.5q. Thus, the optimal linear
regression model has which ensures an optimal mean squared error of0.25.
A.4.1.2 ExperimentalSetup
Our experiments investigate two phenomena: 1) the effect of the ratio
n
pub
n
on test loss when ε P t1,2u
is fixed, for values of
n
pub
n
ranging from0.03 to0.95; and 2) the effect of privacy (quantified by ε) on test
loss, for fixed
n
pub
n
P t0.03,0.04u, and varying ε P t0.1,0.5,1.0,2.0,4.0,8.0u. We maintain the privacy
parameter δ at a constant value of 10
´5
throughout out experiments. All algorithms undergo extensive
hyperparameter tuning using the validation dataset, and the performance of each tuned algorithm is sub-
sequently assessed using the test dataset. (See “Hyperparameter Tuning” paragraph below for details
on the tuning process.) The plots in Section 3.3.3 show the test loss for each algorithm with standard error
bars computed over 5 independent trials.
200
A.4.1.3 DetailsonImplementationsofAlgorithms
We compare four different semi-DP algorithms: 1. Throw-away. 2. DP-SGD [1, 71]. 3. PDA-MD [9, Al-
gorithm 1]. 4. Our Algorithm 11—specifically, the sample-with-replacement version of our algorithm. The
Throw-away algorithm simply returns a minimizerw
pub
of the public loss:w
pub
“pX
T
pub
X
pub
q
´1
X
T
pub
y
pub
.
DP-SGD adds noise to all (public or private) gradients. We use the state-of-the-art (for image classification)
implementation of DP-SGD of [71]. We adopt the re-parameterization of DP-SGD in [71, Equation 3] to
ease the hyperparameter tuning. For PDA-MD, we implement [9, Algorithm]. We use their exact Mirror
descent form by multiplying the inverse of the hessian X
T
pub
X
pub
by the private gradient [9]. In their
original implementation, they added a small constant, Hessian Regularization, times the identity matrix
to the Hessian before calculating the inverse for numerically stable. We choose the hessian regularization
constant as0.01, the same value in their original implementation.
Effect of hessian regularization parameter on PDA-MD: Upon reproducing the results of PDA-MD, we
discovered a high sensitivity to the choices of Hessian Regularization, especially when the ratio of the
largest and the smallest eigenvalue of the data matrix is large. We test PDA-MD on the dataset proposed
in the original study as well as on our own dataset. The results of these tests are displayed in Figure A.1. We
see PDA-MD is sensitive to hessian regularization parameter, which requires extra tuning on complicated
tasks. We implement PDA-MD with the optimal regularization value of0.01.
Details on warm-start: In our evaluation, we adopt a “warm start” strategy for all algorithms: we first
find a minimizer w
pub
of the public loss, and then initialize the training process withw
pub
. Note that the
minimizerw
pub
of the public loss can be obtained viaw
pub
“
´
X
T
pub
X
pub
¯
´1
X
T
pub
y
pub
.
Clipping public gradients improves performance: We have empirically found that a slight modification
of Algorithm 11 offers performance benefits. Namely, it is beneficial to project the public gradients onto
theℓ
2
-sphere of radiusC; that is, we re-scale the public gradients to haveℓ
2
-norm equal toC. Note that
semi-DP still holds regardless of whether or not the public gradients are re-scaled. Re-scaling the public
201
gradients helps balance the effects of the public and private gradients on the optimization trajectory. In the
original method stated in Algorithm 11, if unclipped public gradients and the private gradient are of very
different magnitudes, then one gradient direction might dominate the optimization procedure, leading to
a sub-optimal model. Thus, our public gradient re-scaling technique promotes a more balanced update,
which gracefully combines the public and private data in each iteration. To the best of our knowledge, this
technique is novel.
Privacyaccounting: We compute the privacy loss of each algorithm by using the moments acccountant
of Abadi et al. [1]. For a fixed clip threshold C, privacy levelpε,δ q is determined by three parameters: the
variance of privacy noiseσ 2
, the private sampling ratioq :“K
priv
{n
priv
, and the total number of iterations
T . In our setting, the privacy parameterspε,δ q are given, and we use the moments accountant to compute
an approximation ofσ 2
for any choice of hyperparametersT andq. We utilize the implementation of the
privacy accountant provided by the Pytorch privacy framework, Opacus.
A.4.1.4 HyperparameterTuning
The results reported are for each algorithm with the hyperparameters (step size, number of epochs and
α in Semi-DP) that attain the best performance for a given experiment. For simplicity and computation
efficiency, we keep clip threshold C“1 for all of our experiments. Preliminary experiments found that a
clip threshold ofC“ 1 worked well for all algorithms. To tune Algorithm 11, we use Bayesian hyperpa-
rameter optimization. That is, we build a probability model of the objective function and use it to select the
most promising hyperparameters to evaluate in the true objective function. We find that both DP-SGD and
PDA-MD are relatively insensitive to hyperparameters. Therefore, to save time, we use grid search to tune
them. We also randomly select some setups and apply both tuning methods. It turns out that Bayesian
hyperparameter optimization and grid search yield very similar results. See Tables A.1 and A.2 in Sec-
tion A.4.4 for detailed descriptions of the hyperparameter search grids.
202
10
− 5
10
− 3
10
− 1
PDA-MD Regularization Parameter
0.018
0.020
0.022
0.024
0.026
MSE Loss
Train loss
Valid loss
Test loss
10
− 5
10
− 3
10
− 1
PDA-MD Regularization Parameter
0.2750
0.2775
0.2800
0.2825
0.2850
MSE Loss
Train loss
Valid loss
Test loss
Figure A.1: Loss vs. PDA-MD Regularization Parameter. Left: Results on proposed dataset in [9]. Right:
Results on our dataset. PDA-MD is sensitive to hessian regularization parameter, which requires extra
tuning on complicated tasks.
A.4.2 CentralSemi-DPLogisticRegressionwithCIFAR-10
We evaluate the performance of Algorithm 11 in training a logistic regression model to classify digits in the
CIFAR-10 dataset [156]. We compare Algorithm 11 against DP-SGD and throw-away. Note that PDA-MD
does not have an efficient implementation for logistic regression. This is because there does not exist a
closed form update rule for the mirror descent step. Thus, we do not compare against PDA-MD.
We flatten the images and feed them to the logistic (softmax) model. Cross-entropy loss is used here;
therefore, the model is convex. Implementations of the algorithms are similar to the linear regression case.
However, in this case, throw-away consists of running non-private SGD onX
pub
to find an approximate
minimizerw
pub
. For all three algorithms, we fixed batch-size 256 and privacy parameter δ “ 10
´6
. The
remaining hyperparameters are tuned by grid search, using the same grid for each algorithm. Also, in
contrast to the linear regression experiments, we do not use warm-start for any of the algorithms and we
use GD instead of SGD for all algorithms in these experiments. See Table A.3 in Section A.4.4 for detailed
descriptions of the hyperparameter search grids.
Results are reported in Figures A.2,A.3, A.4, and A.5. OurAlgorithm11alwaysoutperformsthebaselines
in terms of minimizing test accuracy/error. The advantage of Algorithm 11 over these baselines is even
203
more pronounced than it was for linear regression. This might be partially due to the fact that we did not
use warm-start.
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65.0
67.5
70.0
72.5
75.0
Test Error Rate
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65.0
67.5
70.0
72.5
75.0
Test Error Rate
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65
70
75
Test Error Rate
Algorithm 1
DP-SGD
Throw-away
Figure A.2: Test error rate vs.
n
pub
n
. Left: ε“0.1. Middle: ε“0.5. Right: ε“1.0
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65
70
75
Test Error Rate
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65
70
75
Test Error Rate
0.01 0.02 0.04 0.1 0.2 0.3
n
pub
/n
65
70
75
Test Error Rate
Algorithm 1
DP-SGD
Throw-away
Figure A.3: Test error rate vs.
n
pub
n
. Left: ε“2.0. Middle: ε“4.0. Right: ε“8.0
0.1 0.5 1.0 2.0 4.0 8.0
ε
65
70
75
Test Error Rate
0.1 0.5 1.0 2.0 4.0 8.0
ε
65.0
67.5
70.0
72.5
Test Error Rate
0.1 0.5 1.0 2.0 4.0 8.0
ε
65.0
67.5
70.0
72.5
Test Error Rate
Algorithm 1
DP-SGD
Throw-away
Figure A.4: Test error rate vs. ε. Left:
n
pub
n
“0.01. Middle:
n
pub
n
“0.02. Right:
n
pub
n
“0.04
204
0.1 0.5 1.0 2.0 4.0 8.0
ε
65.0
67.5
70.0
72.5
Test Error Rate
0.1 0.5 1.0 2.0 4.0 8.0
ε
65.0
67.5
70.0
72.5
Test Error Rate
0.1 0.5 1.0 2.0 4.0 8.0
ε
65.0
67.5
70.0
72.5
Test Error Rate
Algorithm 1
DP-SGD
Throw-away
Figure A.5: Test error rate vs. ε. Left:
n
pub
n
“0.1. Middle:
n
pub
n
“0.2. Right:
n
pub
n
“0.3
A.4.3 Semi-LocalDPLinearRegressionwithGaussianData
In this section, we empirically evaluate our semi-LDP Algorithm 13 on linear regression. While running
the experiments, we observed that the noisy clipped private gradients PrivUnitpclip
C
p∇fpw
t
,x
t
qqq and
the public gradients∇fpw
t
,x
t
q had very different magnitudes. This made it challenging for our algorithm
to utilize both the public and private data in the most effective way possible. To overcome this difficulty,
we scale∇fpw
t
,x
t
q, so that it is of a similar magnitude as PrivUnitpclip
C
p∇fpw
t
,x
t
qqq. In more detail,
at the beginning of the algorithm, we randomly take the first private sample x
0
PX
priv
, apply the update
rule and record the norm of the noisy gradient. That is, at Line 6 of Algorithm 13, additionally we perform
once L
pub
Ð }PrivUnitpclip
C
p∇fpw
0
,x
0
qqq}
2
. Then for all the public samples later on, we scale their
gradient toL
pub
. That is at Line 8 of Algorithm 13, we dor g
t
Ð
L
pub
∇fpwt,xtq
}∇fpwt,xtq}
2
.
We use the same experimental setup and data generation process as described above in Appendix A.4.1
but
n
pub
n
Pt0.03,0.05,0.1,0.2u, andεPt16.0,32.0,64.0,128.0u. We compare three linear-time semi-LDP
algorithms: 1) LDP-SGD [85], which is optimal up to constants for semi-LDP SCO (by Theorem 29). 2)
Throw-away: This algorithm runs SGD on the public data. 3) Our Algorithm 13, which has smaller worst-
case error than LDP-SGD theoretically, according to Proposition 30. For fairness of comparison, we run
each of the three algorithms for the same total number of iterations (stochastic gradient evaluations) T ,
with T “ n“ n
priv
`n
pub
(i.e. one epoch). Due to the small number of iterations, warm-start is used.
205
Hence all three algorithm start from w
˚
pub
, a minimizer of the public loss. The range of hyperparameter
search is detailed in Table A.4 in Section A.4.4.
We use PrivUnitG [20] instead of PrivUnit in line 6 of our implementation of Algorithm 13. PrivUnitG
is an approximation of PrivUnit that can be easier to implement and tune. Also, we use PrivUnitG in
our implementation of LDP-SGD, even though [85] proposes using a different, suboptimal (in terms of
constants) LDP randomizer. We make this improvement to LDP-SGD so that the only difference between
LDP-SGD and semi-LDP-SGD occurs in lines 7-8 of Algorithm 13. This allows us to isolate the role of
semi-LDP vs. LDP in determining the accuracy of the trained model.
Results are shown in Figures A.6, A.7, A.8, and A.9. As we can see,semi-LDPSGDachievesthesmallest
test error consistently across different experimental settings.
0.03 0.05 0.1 0.2
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
0.03 0.05 0.1 0.2
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
Algorithm 3
LDP-SGD
Throw-away
Figure A.6: Test loss vs.
n
pub
n
. Left: ε“16. Right: ε“32.
0.03 0.05 0.1 0.2
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
0.03 0.05 0.1 0.2
n
pub
/n
0.5
1.0
1.5
2.0
Test Loss
Algorithm 3
LDP-SGD
Throw-away
Figure A.7: Test loss vs.
n
pub
n
. Left: ε“64. Right: ε“128.
206
16.0 32.0 64.0 128.0
ε
0.5
1.0
1.5
2.0
Test Loss
16.0 32.0 64.0 128.0
ε
0.35
0.40
0.45
0.50
0.55
Test Loss
Algorithm 3
LDP-SGD
Throw-away
Figure A.8: Test loss vs. ε. Left:
n
pub
n
“0.03. Right:
n
pub
n
“0.05.
16.0 32.0 64.0 128.0
ε
0.30
0.32
0.34
Test Loss
16.0 32.0 64.0 128.0
ε
0.27
0.28
0.29
Test Loss
Algorithm 3
LDP-SGD
Throw-away
Figure A.9: Test loss vs. ε.
n
pub
n
“0.1.
n
pub
n
“0.2.
A.4.4 HyperparametersSearchGrids
hyperparameter private_epochs learning-rate
value 10,15,20,...,55 0,1,2,...,160
Table A.1: Grid used for hyperparameter search for PDA-MD and DP-SGD in Appendix A.4.1
hyperparameter private_epochs learning-rate α value r10,55s r0,160s [0,1]
Table A.2: Range used for Bayesian hyperparameter search for Semi-DP in Appendix A.4.1. We only allow
integer values for private_epochs and learning-rate.
207
hyperparameter value
iterations 2000,4000,6000,...,20000
learning-rate 0.005,0.01,0.05,0.1,0.15,0.2,0.25,0.5,0.75,1,1.5,2,3,4,5
α 0,0.1,0.2,...,1
Table A.3: Grid used for hyperparameter search for DP-SGD and Semi-DP in Appendix A.4.2
hyperparameter learning-rate p in PrivUnitG
value r10
´5
,1s [0,1]
Table A.4: Range used for Bayesian hyperparameter search for LDP and Semi-LDP in Appendix A.4.3.
208
AppendixB
AppendixtoChapter3
B.1 PrivateConvexFederatedLearningWithoutaTrustedServer
B.1.1 ThoroughDiscussionofRelatedWork
Federated Learning: In the absence of differential privacy constraints, federated learning (FL) has re-
ceived a lot of attention from researchers in recent years. Among these, the most relevant works to us
are [154, 167, 143, 264, 265, 272], which have proved bounds on the convergence rate of FL algorithms.
From an algorithmic standpoint, all of these works propose and analyze either Minibatch SGD (MB-SGD),
FedAvg/Local SGD [188], or an extension or accelerated/variance-reduced variation of one of these (e.g.
SCAFFOLD [143]). Notably, [265] proves tight upper and lower bounds that establish the near optimality
of accelerated MB-SGD for the heterogeneous SCO problem with non-randomM
r
“ M “ N in a fairly
wide parameter regime. [175, 240, 199] provide convergence results with random connectivity graphs. Our
upper bounds describe the effect of the mean of 1{M
r
on DP FL.
DPOptimization: In the centralized setting, DP ERM and SCO is well-understood for convex and strongly
convex loss functions [32, 255, 29, 95, 180, 19].
Tight excess risk bounds for local DP SCO were provided in [85, 230, 83].
209
A few works have also considered Shuffle DP ERM and SCO. [111, 90] showed that the optimal CDP
convex ERM rate can be attained in the lower trust (relative to the central model) shuffle model of DP. The
main difference between our treatment of shuffle DP and [55] is that our results are much more general
than [55]. For example, [55] does not consider FL: instead they consider the simpler problem of stochastic
convex optimization (SCO). SCO is a simple special case of FL in which each silo has onlyn“ 1 sample.
Additionally, [55] only considers the i.i.d. case, but not the more challenging non-i.i.d. case. Further, [55]
assumes perfect communication (M “ N), while we also analyze the case whenM ă N and some silos
are unavailable in certain rounds (e.g. due to internet issues). Note that our bounds in Theorem 38 recover
the results in [55] in the special case considered in their work.
DPFederatedLearning: More recently, there have been many proposed attempts to ensure the privacy
of individuals’ data during and after the FL process. Some of these have used secure multi-party computa-
tion (MPC) [53, 184], but this approach leaves users vulnerable to inference attacks on the trained model
and does not provide the rigorous guarantee of DP. Others [189, 107, 130, 104, 259, 284, 164, 110, 203]
have used user-level DP or central DP (CDP), which rely on a trusted third party, or hybrid DP/MPC ap-
proaches [130, 245]. The work of [130] proves CDP empirical risk bounds and high probability guarantees
on the population loss when the data is i.i.d. across silos. However, ISRL-DP and SDP are not considered,
nor is heterogeneous (non-i.i.d.) FL. It is also worth mentioning that [107] considers randomM
r
but does
not prove any bounds.
Despite this progress, prior to our present work, very little was known about the excess risk potential
of ISRL-DP FL algorithms, except in the two extreme corner cases of N “ 1 and n“ 1. When N “ 1,
ISRL-DP and CDP are essentially equivalent; tight ERM [32] and i.i.d. SCO [29, 95] bounds are known
for this case. In addition, for LDP i.i.d. SCO when n “ 1 and M
r
“ N, [85] establishes the minimax
optimal rate for the class of sequentially interactive algorithms and non-strongly convex loss functions.
To the best of our knowledge, all previous works examining the general ISRL-DP FL problem with arbitrary
210
n,M,N ě 1 either focus on ERM and/or do not provide excess risk bounds that scale with bothM and
n, making the upper bounds provided in the present work significantly tighter. Furthermore, none of the
existing works on ISRL-DP FL provide lower bounds or upper bounds for the case of random M
r
. We
discuss each of these works in turn below:
[246] gives a ISRL-DP FL algorithm, but no risk bounds are provided in their work.
The works of [127] and [126] use ISRL-DP ADMM algorithms for smooth convex Federated ERM.
However, the utility bounds in their works are stated in terms of an average of the silo functions evaluated
at different points, so it is not clear how to relate their result to the standard performance measure for
learning (which we consider in this paper): expected excess risk at the point p w output by the algorithm.
Also, no lower bounds are provided for their performance measure. Therefore, the sub-optimality gap of
their result is not clear.
[267, Theorem 2] provides anpε,0q-ISRL-DP ERM bound for fixed M
r
“M“N ofO
´
κ L
2
µ d
nNε
2
`ε
¯
for µ -strongly convex, β -smooth f with condition number κ “ β {µ, and 1{ε
2
is an average of 1{ε
2
i
.
The additive ε term is clearly problematic: e.g. if ε “ Θ p1q, then the bound becomes trivial. Ignoring
this term, the first term in their bound is still looser than the bound that we provide in Theorem 37.
Namely, forε“ ε
0
, our bound in part 2 of Theorem 37 is tighter by a factor ofO
´
lnp1{δ 0
q
κn
¯
and does not
requireβ -smoothness of the loss. Additionally, the bounds in [267] requireR “large enough” and do not
come with communication complexity guarantees. In the convex case, the ISRL-DP ERM bound reported
in [267, Theorem 3] is difficult to interpret because the unspecified “constants” in the upper bound on
E
p
F
X
pp w
T
q´
p
F
˚
X
are said to be allowed to depend onR.
[260, Theorems 2-3] provide convergence rates for smooth convex Polyak-Łojasiewicz (a generalization
of strong convexity) ISRL-DP ERM, which are complicated non-monotonic functions ofR. Since they do
not prescribe a choice of R, it is unclear what excess loss and communication complexity bounds are
attainable with their algorithm.
211
[78] proposes a ISRL-DP Inexact Alternating Minimization Algorithm (IAMA) with Laplace noise;
their result [78, Theorem 3.11] gives a convergence rate for smooth, strongly convex ISRL-DP FL of order
O
ˆ
Θ ř
iPrMs
σ 2
i
R
˙
ignoring smoothness and strong convexity factors, whereΘ is a parameter that is only
upper bounded in special cases (e.g. quadratic objective). Thus, the bounds given in [78] are not complete
for general strongly convex loss functions. Even in the special cases where a bound forΘ is provided, our
bounds are tighter. Assuming thatΘ “ 1 and (for simplicity of exposition) that parameters are the same
across silos, [78, Theorem 3.11] implies takingσ 2
“ 1{ε
2
to ensurepε,0q-ISRL-DP. The resulting conver-
gence rate is thenOpM{ε
2
q, which does not scale withn and isincreasing withM. Also, the dependence
of their rate on the dimensiond is unclear, as it does not appear explicitly in their theorem.
∗
Ignoring this
issue, the dependence onM andn in the bound of [78] is still looser than all of the excess loss bounds that
we provide in the present work.
[283] and [14] apply the ISRL-DP FL framework to Internet of Things, and [226] uses noisy (full-
batch) GD for ISRL-DP wireless channels in the FL (smooth strongly convex) ERM setting. The bounds in
these works do not scale with the number of data points n, however (only with the number of silos N).
Therefore, the bounds that we provide in the present work are tighter, and apply to general convex FL
problems besides wireless channels and Internet of Things.
B.1.2 RigorousDefinitionofInter-SiloRecord-LevelDP
Recall:
Definition 95. (Differential Privacy) Let ε ě 0, δ P r0,1q. A randomized algorithm A : X Ñ W is
pε,δ q-DP if for allρ -adjacent data setsX,X
1
PX and all measurable subsetsSĂW, we have
PpApXqPSqďe
ε
PpApX
1
qPSq`δ. (B.1)
∗
Note that in order for their result to be correct, by [32] when N “ M “ 1, their bound must scale at least as d
2
{ε
2
n
2
,
unless their bound is trivial (ěLD).
212
If B.1 holds for all measurable subsetsS, then we denote this property byApXq »
pε,δ q
ApX
1
q. AnR-round
fully interactive randomized algorithmA for FL is characterized in every round r PrRs by N local silo
functions called randomizersR
piq
r
:Z
pr´1qˆN
ˆX
n
i
ÑZ (iPrNs) and an aggregation mechanism. (See
below for an example that further clarifies the terminology used in this paragraph.) The randomizers send
messagesZ
piq
r
:“R
piq
r
pZ
1:r´1
,X
i
q, to the server or other silos. The messagesZ
piq
r
may depend on silo data
X
i
and the outputsZ
1:r´1
:“tZ
pjq
t
u
jPrNs,tPrr´1s
of silos’ randomizers in prior rounds.
†
Then, the server
(or silos, for peer-to-peer FL) updates the global model. We consider the output ofA :XÑZ
RˆN
to be
the transcript of all silos’ communications: i.e. the collection of allNˆR messagestZ
piq
r
u. AlgorithmA
ispε
0
,δ 0
q-ISRL-DP if for all silos i, the full transcripttZ
piq
r
u
rPrRs
ispε
0
,δ 0
q-DP, for any fixed settings of
the other silos’ messages and data. More precisely:
Definition 96. (Inter-Silo Record-Level Differential Privacy) Let ρ i
pX
i
,X
1
i
q :“
ř
n
j“1
1
tx
i,j
‰x
1
i,j
u
,iPrNs.
A randomized algorithmA ispε
0
,δ 0
q-ISRL-DP if for all silosi and allρ i
-adjacentX
i
,X
1
i
,
pR
piq
1
pX
i
q,R
piq
2
pZ
1
,X
i
q,¨¨¨ ,R
piq
R
pZ
1:R´1
,X
i
qq »
pε0,δ 0q
pR
piq
1
pX
1
i
q,R
piq
2
pZ
1
1
,X
1
i
q,¨¨¨ ,R
piq
R
pZ
1
1:R´1
,X
1
i
qq,
whereZ
r
:“tR
piq
r
pZ
1:r´1
,X
i
qu
N
i“1
andZ
1
r
:“tR
piq
r
pZ
1
1:r´1
,X
1
i
qu
N
i“1
.
Example clarifying the terminology used in the definition of ISRL-DP given above: Assume all
M
r
“ N silos are available in every round and considerA to be the minibatch SGD algorithm,w
r`1
:“
w
r
´ ηg
r
, where g
r
“
1
NK
ř
N
i“1
ř
K
j“1
∇fpw
r
,x
r
i,j
q fortx
r
i,j
u
K
j“1
drawn randomly from X
i
. Then the
randomizersR
piq
r
:Z
pr´1qˆN
ˆX
n
i
ÑZ of siloi are its stochastic gradients: Z
piq
r
“R
piq
r
pZ
1:r´1
,X
i
q“
1
K
ř
K
j“1
∇fpw
r
,x
r
i,j
q fortx
r
i,j
u
K
j“1
drawn randomly fromX
i
. Note that the output of these randomizers
depends onw
r
, which is a function of previous stochastic gradients of all silosZ
1:r´1
“tZ
piq
t
u
iPrNs,tPrr´1s
.
The aggregation mechanism outputs g
r
by simply averaging the outputs of silos’ randomizers: g
r
“
†
We assume thatR
piq
r pZ1:r´1,Xiq is not dependent onXj (j‰ i) givenZ1:r´1 andXi; that is, the distribution ofR
piq
r is
completely characterized byZ1:r´1 and Xi. Therefore, the randomizers of i cannot “eavesdrop” on another silo’s data, which
aligns with the local data principle of FL. We allow forZ
piq
t
to be empty/zero if siloi does not output anything to the server in
roundt.
213
1
N
ř
N
i“1
Z
piq
r
. We view the output ofA : X Ñ Z
RˆN
to be the transcript of all silos’ communications,
which in this case is the collection of allNˆR stochastic minibatch gradientstZ
piq
r
u
iPrNs,rPrRs
. Note that
in practice, the algorithmA does not truly output a list of gradients, but rather outputs p wPW that is some
convex combination of the iteratestw
r
u
rPrRs
, which themselves are functions oftZ
piq
r
u
iPrNs,rPrRs
. How-
ever, by the post-processing property of DP [88, Proposition 2.1], the privacy of p w will be guaranteed if the
silo transcripts are DP. Thus, here we simply consider the output ofA to be the silo transcripts. Clearly,
minibatch SGD is not ISRL-DP. To make it ISRL-DP, it would be necessary to introduce additional random-
ness to make sure that each silo’s collection of stochastic gradients is DP, conditional on the messages and
data of all other silos. For example, Noisy MB-SGD is a ISRL-DP variation of (projected) minibatch SGD.
B.1.3 RelationshipsbetweennotionsofDP
B.1.3.1 ISRL-DPisstrongerthanCDP:
Assume A is pε
0
,δ 0
q-ISRL-DP. Let X,X
1
be adjacent databases in the CDP sense; i.e. there exists a
unique i P rNs, j P rns such that x
i,j
‰ x
1
i,j
. Then for all r P rRs, l ‰ i, X
l
“ X
1
l
, so the condi-
tional distributions ofR
plq
r
pZ
1:r´1
,X
l
q andR
plq
r
pZ
1
1:r´1
,X
1
l
q given Z
pl
1
‰lq
1:r´1
“ z
pl
1
‰lq
1:r´1
are identical for all
z
pl
1
‰lq
1:r´1
PZ
pr´1qˆpN´1q
. Integrating both sides of this equality with respect to the joint density ofZ
pl
1
‰lq
1:r´1
shows thatR
plq
r
pZ
1:r´1
,X
l
q“R
plq
r
pZ
1
1:r´1
,X
1
l
q (unconditional equality of distributions). Hence the full
transcript of silol is (unconditionally)p0,0q-CDP for alll‰ i. A similar argument (using the inequality
3.3 instead of equality) shows that siloi’s full transcript is unconditionallypε
0
,δ 0
q-CDP. Therefore, by the
basic composition theorem for DP [88], the full combined transcript of allN silos ispε
0
,δ 0
q-CDP, which
implies thatA ispε
0
,δ 0
q-CDP.
Conversely,pε,δ q-CDP does not implypε
1
,δ 1
q-ISRL-DP for any ε
1
ą 0,δ 1
Pp0,1q. This is because a
CDP algorithm may send non-private updates to the server and rely on the server to randomize, completely
violating the requirement of LDP.
214
B.1.3.2 ISRL-DPImpliesUser-LevelDPforsmallpε
0
,δ 0
q:
Precisely, we claim that ifA ispε
0
,δ 0
q-ISRL-DP thenA ispnε,ne
pn´1qε
δ q user-level DP; but conversely
pε,δ q-user-level DP does not implypε
1
,δ 1
q-ISRL-DP for anyε
1
ą 0,δ 1
Pp0,1q. The first part of the claim
is due to group privacy (Proposition 12) and the argument used above in Appendix B.1.3.1 to get rid of
the “conditional”. The second part of the claim is true because a user-level DP algorithm may send non-
private updates to the server and rely on the server to randomize, completely violating the requirement of
ISRL-DP.
Therefore, ifε
0
“ Op1{nq andδ 0
! 1{n, then anypε
0
,δ 0
q-ISRL-DP algorithm also provides a strong
user-level privacy guarantee.
B.1.4 SupplementaryMaterialforUpperBoundsinSection3.2.1
B.1.4.1 PseudocodeofNoisyMB-SGD
We present pseudocode for Noisy MB-SGD in Algorithm 14:
Algorithm14 Noisy ISRL-DP MB-SGD
1: Input: N,d,R P N, σ 2
ě 0, X
i
P X
n
i
for i P rNs, loss function fpw,xq, K P rns ,tη r
u
rPrRs
and
tγ r
u
rPrRs
.
2: Initializew
0
PW.
3: forrPt0,1,¨¨¨ ,R´1udo
4: foriPS
r
inparalleldo
5: Server sends global modelw
r
to siloi.
6: Siloi drawsK samplesx
r
i,j
uniformly fromX
i
(forjPrKs) and noiseu
i
„Np0,σ 2
I
d
q.
7: Siloi computesr g
i
r
:“
1
K
ř
K
j“1
∇fpw
r
,x
r
i,j
q`u
i
and sends to server.
8: endfor
9: Server aggregatesr g
r
:“
1
Mr
ř
iPSr
r g
i
r
.
10: Server updatesw
r`1
:“Π W
rw
r
´η r
r g
r
s.
11: endfor
12: Output: p w
R
“
1
Γ R
ř
R´1
r“0
γ r
w
r
, whereΓ R
:“
ř
R´1
r“0
γ r
.
215
B.1.4.2 ProofofTheorem34
We begin by proving Theorem 34 forβ -smoothfp¨,xq and then extend our result to the non-smooth case
via Nesterov smoothing [201]. We will require some preliminaries. We begin with the following definition
from [45]:
Definition97. (Uniform Stability) A randomized algorithmA :WˆX
r
N
is said to beα -uniformly stable
(w.r.t. loss functionf :WˆX) if for any pair of adjacent data setsX,X
1
PX
r
N
,|X∆ X
1
|ď2, we have
sup
xPX
E
A
rfpApXq,xq´fpApX
1
q,xqsďα.
In our context,
r
N “ Nˆn. The following lemma, which is well known, allows us to easily pass from
empirical risk to population loss when the algorithm in Question 1s uniformly stable:
Lemma 98. LetA :X
r
N
ÑW beα -uniformly stable w.r.t. convex loss functionf :WˆX ÑR. LetP
be any distribution overX and letX„ P
r
N
. Then the excess population loss is upper bounded by the excess
expected empirical loss plusα :
ErFpApXq,Pq´F
˚
sďα `Er
p
F
X
pApXqq´ min
wPW
p
F
X
pwqs,
where the expectations are over both the randomness in A and the sampling of X „ P
r
N
. Here we de-
note the empirical loss by
p
F
X
pwq and the population loss by Fpw,Pq for additional clarity, and F
˚
:“
min
wPW
Fpw,Pq“min
wPW
Fpwq.
Proof. By Theorem 2.2 in [116],
ErFpApXq,Pq´
p
F
X
pApXqqsďα.
216
Hence
ErFpApXq,Pq´F
˚
s“ErFpApXq,Pq´
p
F
X
pApXqq`
p
F
X
pApXqq
´ min
wPW
p
F
X
pwq` min
wPW
p
F
X
pwq´F
˚
s
ďα `Er
p
F
X
pApXqq´ min
wPW
p
F
X
pwqs,
sinceEmin
wPW
p
F
X
pwqďmin
wPW
E
”
p
F
X
pw,Xq
ı
“min
wPW
Fpw,Pq“F
˚
.
The next step is to bound the uniform stability of Algorithm 14.
Lemma 99. Let fp¨,xq be convex, L-Lipschitz, and β -smooth loss for all x P X. Let n :“ mintnu
N
i“1
.
Then under Assumption 3, Noisy MB-SGD with constant stepsize η ď
1
β and averaging weights γ r
“
1
R
is
α -uniformlystablewithrespecttof forα “
2L
2
Rη nM
. If,inadditionfp¨,xqisµ -stronglyconvex,forallxPX,
then Noisy MB-SGD with constant step sizeη r
“ η ď
1
β and any averaging weightstγ r
u
R
r“1
isα -uniformly
stable with respect tof forα “
4L
2
µ pMn´1q
(assumingmintM,nuą1).
Proof of Lemma 99. The proof of the convex case extends techniques and arguments used in the proofs
of [116, Theorem 3.8], [98, Lemma 4.3], and [29, Lemma 3.4] to the ISRL-DP FL setting; the strongly
convex bound requires additional work to get a tight bound. For now, fix the randomness of tM
r
u
rě0
.
LetX,X
1
PX
r
N
be two data sets, denotedX“pX
1
,¨¨¨ ,X
N
q forX
i
PX
n
for alliPrNs and similarly
forX
1
, and assume|X∆ X
1
|“ 2. Then there is a uniqueaPrNs andbPrns such thatx
a,b
‰ x
1
a,b
. For
t P t0,1,¨¨¨ ,Ru, denote the t-th iterates of Algorithm 14 on these two data sets by w
t
“ w
t
pXq and
w
1
t
“w
t
pX
1
q respectively. We claim that
E
“
}w
t
´w
1
t
}|tM
r
u
0ďrďt
‰
ď
2Lη
n
t
ÿ
r“0
1
M
r
(B.2)
for all t. We prove the claim by induction. It is trivially true when t “ 0. Suppose B.2 holds for all
tď τ. Denote the samples in each local mini-batch at iterationτ bytx
i,j
u
iPrNs,jPrKs
(dropping theτ for
brevity). Assume WLOG thatS
τ “rM
τ s. First condition on the randomness due to minibatch sampling
217
and due to the Gaussian noise, which we denote by s u “
1
Mτ ř
iPSτ u
i
. Also, denote (for t ě 0) r g
t
:“
1
MtK
ř
iPSt,jPrKs
∇fpw
t
,x
i,j
q`s u andr g
t
:“
1
MtK
ř
iPSt,jPrKs
∇fpw
1
t
,x
1
i,j
q`s u. Then by the same argument
used in Lemma 3.4 of [29], we can effectively ignore the noise in our analysis of step τ `1 of the algorithm
since the same (conditionally non-random) update is performed on X and X
1
, implying that the noises
cancel out. More precisely, by non-expansiveness of projection and gradient descent step forη ď
2
β (see
Lemma 3.7 in [116]), we have
}w
τ `1
´w
1
τ `1
}
“
›
›
Π W
pw
τ ´η τ r g
τ q´Π W
`
w
1
τ ´η τ r g
1
τ ˘›
›
ď
›
›
pw
τ ´η τ r g
τ q´pw
1
τ ´η τ r g
1
τ q
›
›
ď
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
w
τ ´η τ ˜˜
1
M
τ K
ÿ
pi,jq‰pa,bq
∇fpw
τ ,x
i,j
q
¸
`s u
¸
´
˜
w
1
τ ´η τ ˜˜
1
M
τ K
ÿ
pi,jq‰pa,bq
∇fpw
1
τ ,x
i,j
q
¸
`s u
¸¸ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
`
q
τ η τ M
τ K
›
›
∇fpw
τ ,x
a,b
q´∇fpw
1
τ ,x
1
a,b
q
›
›
ď}w
τ ´w
1
τ }`
q
τ η τ M
τ K
›
›
∇fpw
τ ,x
a,b
q´∇fpw
1
τ ,x
1
a,b
q
›
›
,
whereq
τ Pt0,1,¨¨¨ ,Ku is a realization of the random variableQ
τ that counts the number of times index
b occurs in workera’s local minibatch at iterationτ. (Recall that we sample uniformly with replacement.)
Now Q
τ is a sum of K independent Bernoulli(
1
na
q random variables, henceEQ
τ “
K
na
. Then using the
inductive hypothesis and taking expected value over the randomness of the Gaussian noise and the mini-
batch sampling proves the claim. Next, taking expectation with respect to the randomness oftM
r
u
rPrts
implies
E}w
t
´w
1
t
}ď
2Lt
nM
,
218
since theM
r
are i.i.d. withEp
1
M
1
q“
1
M
. Then Jensen’s inequality and Lipschitz continuity offp¨,xq imply
that for anyxPX,
Erfps w
R
,xq´fps w
1
R
,x
1
qsďLE}s w
R
´ s w
1
R
}
ď
L
R
R´1
ÿ
t“0
E}w
t
´w
1
t
}
ď
2L
2
η RMn
RpR`1q
2
“
L
2
η pR`1q
Mn
,
completing the proof of the convex case.
Next supposef isµ -strongly convex. The proof begins identically to the convex case. We condition
onM
r
,u
i
, andS
r
as before and (keeping the same notation used there) get for anyrě0
}w
r`1
´w
1
r`1
}
ď
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
w
r
´η r
˜˜
1
M
r
K
ÿ
pi,jq‰pa,bq
∇fpw
r
,x
i,j
q
¸
`s u
¸
´
˜
w
1
r
´η r
˜˜
1
M
r
K
ÿ
pi,jq‰pa,bq
∇fpw
1
r
,x
i,j
q
¸
`s u
¸¸ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
ˇ
`
q
r
η r
M
r
K
›
›
∇fpw
r
,x
a,b
q´∇fpw
1
r
,x
1
a,b
q
›
›
.
We will need the following tighter estimate of the non-expansiveness of the gradient updates to bound the
first term on the right-hand side of the inequality above:
Lemma100. [116] LetG :W ÑR
d
beµ -strongly convex andβ -smooth. Assumeη ď
2
β `µ Then for any
w,vPW, we have
}pw´η ∇Gpwqq´pv´η ∇Gpvqq}ď
ˆ
1´
ηβµ β `µ ˙
}v´w}ď
´
1´
ηµ 2
¯
}v´w}.
219
Note thatG
r
pw
r
q:“
1
MrK
ř
pi,jq‰pa,bq,pi,jqPrMrsˆrKs
fpw
r
,x
r
i,j
q isp1´
qr
MrK
qβ -smooth andp1´
qr
MrK
qµ -
strongly convex and hence so isG
r
pw
r
q` s u. Therefore, invoking Lemma 100 and the assumptionη r
“
η ď
1
β , as well as Lipschitzness offp¨,xq@xPX, yields
}w
r`1
´w
1
r`1
}ď
˜
1´
ηµ p1´
qr
MrK
q
2
¸
}w
r
´w
1
r
}`
2q
r
ηL
M
r
K
.
Next, taking expectations over theM
r
(with meanEp
1
Mr
q“
1
M
), the minibatch sampling (recallEq
r
“
K
na
),
and the Gaussian noise implies
E}w
r`1
´w
1
r`1
}ď
˜
1´
ηµ p1´
1
naM
q
2
¸
E}w
r
´w
1
r
}`
2ηL
n
a
M
.
One can then prove the following claim by an inductive argument very similar to the one used in the proof
of the convex part of Lemma 99: for alltě0,
E}w
t
´w
1
t
}ď
2ηL
n
a
M
t
ÿ
r“0
p1´bq
r
,
whereb:“
µη 2
´
naM´1
naM
¯
ă1. The above claim implies that
E}w
t
´w
1
t
}ď
2ηL
n
a
M
ˆ
1´p1´bq
t`1
b
˙
ď
4L
µ pn
a
M´1q
ď
4L
µ pnM´1q
.
220
Finally, using the above bound together with Lipschitz continuity off and Jensen’s inequality, we obtain
that for anyxPX,
Erfpp w
R
,xq´fpp w
1
R
,xqsďLE}p w
R
´ p w
1
R
}
“LE
›
›
›
›
›
1
Γ R
R´1
ÿ
r“0
γ r
pw
r
´w
1
r
q
›
›
›
›
›
ďLE
«
1
Γ R
R´1
ÿ
r“0
γ r
}w
r
´w
1
r
}
ff
ďL
«
1
Γ R
R´1
ÿ
r“0
γ r
ˆ
4L
µ pnM´1q
˙
ff
“
4L
2
µ pnM´1q
.
This completes the proof of Lemma 99.
Next, we will bound the empirical loss of Noisy MB-SGD (Algorithm 14). We will need the following
two lemmas for the proof of Lemma 103 (and hence Theorem 34):
Lemma 101. (Projection lemma) LetW ĂR
d
be a closed convex set. Then}Π W
paq´b}
2
ď}a´b}
2
for
anyaPR
d
,bPW.
Lemma102. [234] Letbą 0, leta,cě 0, andtη t
u
tě0
be non-negative step-sizes such thatη t
ď
1
g
for all
t ě 0 for some parameter g ě a. Lettr
t
u
tě0
andts
t
u
tě0
be two non-negative sequences of real numbers
which satisfy
r
t`1
ďp1´aη
t
qr
t
´bη
t
s
t
`cη
2
t
for alltě0. Then there exist particular choices of step-sizesη t
ď
1
g
and averaging weightsγ t
ě0 such that
b
Γ T
T
ÿ
t“0
s
t
γ t
`ar
T`1
“
r
O
ˆ
gr
0
exp
ˆ
´aT
g
˙
`
c
aT
˙
,
221
whereΓ T
:“
ř
T
t“0
γ t
. In fact, we can chooseη t
andγ t
as follows:
η t
“η “min
#
1
g
,
ln
`
max
␣
2,a
2
r
0
T
2
{c
(˘
aT
+
, γ t
“p1´aη q
´pt`1q
.
We give the excess empirical risk guarantee of ISRL-DP MB-SGD below:
Lemma 103. Let f : WˆX Ñ R be µ -strongly convex (with µ “ 0 for convex case), L-Lipschitz, and
β -smooth inw for allxPX, whereW is a closed convex set inR
d
s.t.}w´w
1
}ďD for allw,w
1
PW. Let
XPX. ThenNoisyMB-SGD(Algorithm14)withσ 2
“
256L
2
Rlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
attainsthefollowingempirical
loss bounds as a function of step size and the number of rounds:
1. (Convex) For anyη ď1{β andRPN,γ r
:“1{R, we have
E
p
F
X
pp w
T
q´
p
F
˚
X
ď
D
2
2ηR
`
ηL
2
2
˜
256dRlnp
2.5R
δ 0
qlnp2{δ 0
q
Mn
2
ε
2
0
`1
¸
.
2. (Strongly Convex) There exists a constant stepsizeη r
“η ď1{β such that ifRě2κ ln
´
µMε
2
0
n
2
βD
L
2
d
¯
,
then
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ ˆ
dlnp1{δ 0
q
Mε
2
0
n
2
˙
`
1
R
˙
. (B.3)
Proof. First, condition on the randomM
r
and considerM
r
as fixed. Let w
˚
P argmin
wPW
p
F
X
pwq be any
minimizer of
p
F , and denote the average of the i.i.d. Gaussian noises across all silos in one round by s u
r
:“
222
1
Mr
ř
iPSr
u
i
. Note that s u
r
„ N
´
0,
σ 2
Mr
I
d
¯
by independence of thetu
i
u
iPrNs
and henceE}s u
r
}
2
“
dσ
2
Mr
.
Then for anyrě0, conditional onM
r
, we have that
E
„
}w
r`1
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
“E
»
–
›
›
›
›
›
Π W
«
w
r
´η r
˜
1
M
r
ÿ
iPSr
1
K
K
ÿ
j“1
∇fpw
r
,x
r
i,j
q´u
i
¸ff
´w
˚
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
r
fi
fl
ďE
»
–
›
›
›
›
›
w
r
´η r
˜
1
M
r
ÿ
iPSr
1
K
K
ÿ
j“1
∇fpw
r
,x
r
i,j
q´u
i
¸
´w
˚
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
r
fi
fl
“E
„
}w
r
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
´2η r
E
„
x∇
p
F
X
pw
r
q`s u
r
,w
r
´w
˚
y
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
`η 2
r
E
»
–
›
›
›
›
›
s u
r
`
1
M
r
ÿ
iPSr
1
K
K
ÿ
j“1
∇fpw
r
,x
r
i,j
q
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
r
fi
fl
ďp1´µη r
qE
„
}w
r
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
´2η r
Er
p
F
X
pw
r
q´
p
F
˚
X
|M
r
s
`η 2
r
E
»
–
›
›
›
›
›
s u
r
`
1
M
r
ÿ
iPSr
1
K
K
ÿ
j“1
∇fpw
r
,x
r
i,j
q
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
r
fi
fl
ďp1´µη r
qE
„
}w
r
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
´2η r
Er
p
F
X
pw
r
q´
p
F
˚
X
|M
r
s`η 2
r
ˆ
dσ
2
M
r
`L
2
˙
, (B.4)
where we used Lemma 101 in the first inequality, µ -strong convexity of
p
F (for µ ě 0) and the fact that
s u
r
is independent of the gradient estimate and mean zero in the next inequality, and the fact thatfp¨,xq
isL-Lipschitz for allx in the last inequality (together with independence of the noise and the data again).
Now we consider the convex (µ “0) and strongly convex (µ ą0) cases separately.
Convex(µ “0)case: Re-arranging B.4, we get
Er
p
F
X
pw
r
q´
p
F
˚
X
|M
r
sď
1
2η r
`
Er}w
r
´w
˚
}
2
´}w
r`1
´w
˚
}
2
s
˘
`
η r
2
ˆ
dσ
2
M
r
`L
2
˙
223
and hence
Er
p
F
X
pw
r
q´
p
F
˚
X
sď
1
2η r
`
Er}w
r
´w
˚
}
2
´}w
r`1
´w
˚
}
2
s
˘
`
η r
2
ˆ
dσ
2
M
`L
2
˙
by taking total expectation and usingEr1{M
r
s“1{M. Then forη r
“η , the average iterate s w
R
satisfies:
Er
p
Fps w
R
q´
p
F
˚
X
sď
1
R
R´1
ÿ
r“0
Er
p
F
X
pw
r
q´
p
F
˚
X
s
ď
1
R
R´1
ÿ
r“0
1
2η pEr}w
r
´w
˚
}
2
´}w
r`1
´w
˚
}sq
`
η 2
ˆ
dσ
2
M
`L
2
˙
ď
}w
0
´w
˚
}
2
ηR
`
η r
2
ˆ
dσ
2
M
`L
2
˙
.
Plugging inσ 2
finishes the proof of the convex case.
Stronglyconvex(µ ą0)case: Recall from B.4 that
Er}w
t`1
´w
˚
}
2
sďp1´µη t
qEr}w
t
´w
˚
}
2
s´2η t
Er
p
F
X
pw
t
q´
p
F
˚
X
s
`η 2
t
ˆ
dσ
2
M
`L
2
˙
(B.5)
for all t ě 0 (upon taking expectation over M
t
). Now, B.5 satisfies the conditions for Lemma 102, with
sequences
r
t
“E}w
t
´w
˚
}
2
,s
t
“Er
p
F
X
pw
t
q´
p
F
˚
X
s
and parameters
a“µ, b “2, c“
dσ
2
M
`L
2
, g“2β, T “R.
224
Then Lemma 102 and Jensen’s inequality imply
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
βD
2
exp
ˆ
´R
2κ ˙
`
L
2
µ ˆ
1
R
`
d
Mε
2
0
n
2
˙˙
.
Finally, plugging inR andσ 2
completes the proof.
We are now prepared to prove Theorem 34 in theβ -smooth case:
Theorem 104. Let fpw,xq be β -smooth in w for all x P X. Assume ε
0
ď 2lnp2{δ 0
q, choose σ 2
“
256L
2
Rlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
and K ě
ε
0
n
4
?
2Rlnp2{δ 0
q
. Then Algorithm 14 ispε
0
,δ 0
q-ISRL-DP. Moreover, there are
choices oftη r
u
R
r“1
such that Algorithm 14 achieves the following excess loss bounds:
1. Iffp¨,xq is convex, then settingR“
βD
?
M
L
min
!
?
n,
ε
0
n
?
d
)
`min
!
nM,
ε
2
0
n
2
M
d
)
yields
EFpp w
T
q´F
˚
“
r
O
˜
LD
?
M
˜
1
?
n
`
a
dlnp1{δ 0
q
ε
0
n
¸¸
. (B.6)
2. Iffp¨,xq isµ -strongly convex, then settingR“max
´
2β µ ln
´
βD
2
µMε
2
0
n
2
dL
2
¯
,min
!
Mn,
Mε
2
0
n
2
d
)¯
yields
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µM
ˆ
1
n
`
dlnp1{δ 0
q
ε
2
0
n
2
˙˙
. (B.7)
Proof. Privacy: By independence of the Gaussian noise across silos, it suffices to show that transcript
of silo i’s interactions with the server is DP for all i P rNs (conditional on the transcripts of all other
silos). WLOG consideri“ 1. By the advanced composition theorem (Theorem 3.20 in [88]), it suffices to
show that each of theR rounds of the algorithm ispr ε,
r
δ q-ISRL-DP, where r ε“
ε
0
2
?
2Rlnp2{δ 0
q
(we used the
assumptionε
0
ď 2lnp2{δ 0
q here) and
r
δ “
δ 0
2R
. First, condition on the randomness due to local sampling
of the local data pointx
r
1,1
(line 4 of Algorithm 14). Now, theL
2
sensitivity of each local step of SGD is
bounded by∆: “sup
|X
1
∆ X
1
1
|ď2,wPW
}
1
K
ř
K
j“1
∇fpw,x
1,j
q´∇fpw,x
1
1,j
q}ď2L{K, byL-Lipschitzness
off. Thus, the standard privacy guarantee of the Gaussian mechanism (see [88, Theorem A.1]) implies that
225
(conditional on the randomness due to sampling) takingσ 2
1
ě
8L
2
lnp1.25{
r
δ q
r ε
2
K
2
suffices to ensure that round r
(in isolation) ispr ε,
r
δ q-ISRL-DP. Now we invoke the randomness due to sampling: [248] implies that round
r (in isolation) isp
2r εK
n
,
r
δ q-ISRL-DP. The assumption on K ensures that ε
1
:“
n
2K
ε
0
2
?
2Rlnp2{δ 0
q
ď 1, so
that the privacy guarantees of the Gaussian mechanism and amplification by subsampling stated above
indeed hold. Therefore, with sampling, it suffices to take σ 2
ě
32L
2
lnp1.25{
r
δ q
n
2
r ε
2
“
256L
2
Rlnp2.5R{δ 0
qlnp2{δ 0
q
n
2
ε
2
0
to
ensure that roundr (in isolation) ispr ε,
r
δ q-ISRL-DP for allr and hence that the full algorithm (R rounds)
ispε
0
,δ 0
q-ISRL-DP.
Excessloss: 1. First supposef is merely convex (µ “ 0). By Lemma 99, Lemma 98, and Lemma 103,
we have:
EFpp w
T
q´F
˚
ďα `E
p
F
X
pp w
T
q´
p
F
˚
X
ď
2L
2
Rη nM
`
D
2
2ηR
`
ηL
2
2M
˜
256dRlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
`1
¸
for anyη ď1{β . Choosingη “min
´
1
β ,
D
L
?
R
min
!
1,
ε
0
n
?
M
?
Rd
,
b
nM
R
)¯
implies
EFpp w
T
q´F
˚
À
βD
2
R
`
LD
?
R
˜
1`
a
dRlnpR{δ 0
q
ε
0
n
?
M
`
c
R
nM
¸
.
Finally, one verifies that the prescribed choice of R yields B.6.
2. Now suppose f is µ -strongly convex. Then for the η ď 1{β used in the proof of Lemma 103 and
Rě
2β µ ln
´
βD
2
µMε
2
0
n
2
dL
2
¯
, we have
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ ˆ
dlnp1{δ 0
q
Mε
2
0
n
2
`
1
R
`
1
Mn
˙˙
,
by Lemma 103, Lemma 99, and Lemma 98. Hence B.7 follows by our choice ofRěmin
´
Mn,
Mε
2
0
n
2
d
¯
.
226
We use Theorem 104 to prove Theorem 34 via Nesterov smoothing [201], similar to how [29] proceeded
for CDP SCO with non-strongly convex loss andN“1. That is, for non-smoothf, we run ISRL-DP Noisy
MB-SGD on the smoothed objective (a.k.a. β -Moreau envelope)f
β pwq :“ min
vPW
´
fpvq`
β 2
}w´v}
2
¯
,
whereβ ą0 is a design parameter that we will optimize for. The following key lemma allows us to easily
extend Theorem 104 to non-smoothf:
Lemma105 ([201]). Letf :WÑR
d
beconvexandL-Lipschitzandletβ ą0. Thentheβ -Moreauenvelope
f
β pwq:“min
vPW
´
fpvq`
β 2
}w´v}
2
¯
satisfies:
1. f
β is convex,2L-Lipschitz, andβ -smooth.
2.@w,f
β pwqďfpwqďf
β pwq`
L
2
2β .
Now let us re-state the precise version of Theorem 34 before providing its proof:
Theorem106 (Precise version of Theorem 34). Letε
0
ď2lnp2{δ 0
qandchooseσ 2
“
256L
2
Rlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
,
K ě
ε
0
n
4
?
2Rlnp2{δ 0
q
. ThenAlgorithm14 ispε
0
,δ 0
q-ISRL-DP. Further, there exist choices ofβ ą 0 such that
running Algorithm 14 onf
β pw,xq:“min
vPW
´
fpv,xq`
β 2
}w´v}
2
¯
yields:
1. Iffp¨,xq is convex, then settingR“
βD
?
M
L
min
!
?
n,
ε
0
n
?
d
)
`min
!
nM,
ε
2
0
n
2
M
d
)
yields
EFpp w
T
q´F
˚
“
r
O
˜
LD
?
M
˜
1
?
n
`
a
dlnp1{δ 0
q
ε
0
n
¸¸
. (B.8)
2. Iffp¨,xq isµ -strongly convex, then settingR“max
´
2β µ ln
´
βD
2
µMε
2
0
n
2
dL
2
¯
,min
!
Mn,
Mε
2
0
n
2
d
)¯
yields
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µM
ˆ
1
n
`
dlnp1{δ 0
q
ε
2
0
n
2
˙˙
. (B.9)
Proof. Privacy: ISRL-DP is immediate from post-processing [88, Proposition 2.1], since we already showed
that Algorithm 14 (applied tof
β ) is ISRL-DP.
Excessrisk: We haveEFpp w
T
q´F
˚
ďEF
β pp w
R
q´F
˚
β `
L
2
2β , by part 2 of Lemma 105. Moreover, by part
227
1 of Lemma 105 and Theorem 104, we have:
1. EF
β pp w
R
q´F
˚
β “
r
O
˜
LD
?
M
˜
1
?
n
`
a
dlnp1{δ 0
q
ε
0
n
¸¸
for convexf, and
2. EF
β pp w
R
q´F
˚
β “
r
O
ˆ
L
2
µM
ˆ
1
n
`
dlnp1{δ 0
q
ε
2
0
n
2
˙˙
,
forµ -strongly convexf. Thus, choosingβ 1
such thatL
2
{β 1
ď
LD
?
M
ˆ
1
?
n
`
?
dlnp1{δ 0
q
ε
0
n
˙
andβ 2
such that
L
2
{β 2
ď
L
2
µM
´
1
n
`
dlnp1{δ 0
q
ε
2
0
n
2
¯
completes the proof.
B.1.5 SupplementaryMaterialforISRL-DPLowerBoundsinSection3.2.1
This section requires familiarity with the notation introduced in the rigorous definition of ISRL-DP in Ap-
pendix B.1.2.
B.1.5.1 C-Compositionality
Thepε
0
,δ 0
q-ISRL-DP algorithm classA
pε
0
,δ
0
q,C
contains all sequentially interactive and fully interactive,
C-compositional algorithms.
Definition107 (Compositionality). LetAbeanR-roundpε
0
,δ 0
q-ISRL-DPFLalgorithmwithdatadomain
X. Lettpε
r
0
,δ r
0
qu
R
r“1
denote the minimal (non-negative) parameters of the local randomizersR
piq
r
selected
at round r such thatR
piq
r
pZ
p1:r´1q
,¨q ispε
r
0
,δ r
0
q-DP for all i P rNs and all Z
p1:r´1q
. For C ą 0, we say
thatA is C-compositional if
b
ř
rPrRs
pε
r
0
q
2
ď Cε
0
. If such C is an absolute constant, we simply sayA is
compositional.
Definition 107 is an extension of the definition in [133] to δ 0
ą0.
228
B.1.5.2 AlgorithmswhosePrivacyfollowsfromAdvancedCompositionTheoremare1-Compositional
SupposeA ispε
0
,δ 0
q-ISRL-DP by the advanced composition theorem [88, Theorem 3.20]. Then
ε
0
“
g
f
f
e
2
R
ÿ
r“1
pε
r
0
q
2
lnp1{δ 1
q`
R
ÿ
r“1
ε
r
0
pe
ε
r
0
´1q,
andδ 0
“
ř
R
r“1
δ r
0
`δ 1
for anyδ 1
Pp0,1q. Assumeδ 1
Pp0,1{3q without loss of generality: otherwise the
privacy guarantee ofA is essentially meaningless (see Remark 112). Thenε
0
ě
b
2
ř
R
r“1
pε
r
0
q
2
lnp1{δ 1
qě
?
2
b
ř
R
r“1
pε
r
0
q
2
ě
b
ř
R
r“1
pε
r
0
q
2
, so thatA is1-compositional. Note that even ifδ 1
ą 1{3,A would still
be compositional, but the constantC may be larger than1.
B.1.5.3 ExampleofISRL-DPAlgorithmthatisNotCompositional
This example is a simple modification of Example 2.2 in [133] (adapted to our definition of compositionality
for δ 0
ą 0). Given any C ą 0, set d :“ 2C
2
and let X “ te
1
,¨¨¨e
d
u Ă t0,1u
d
be the standard
basis for R
d
. Let n “ 1 and X “ px
1
,¨¨¨ ,x
N
q P X
N
. For all i P rNs let Q
piq
: X Ñ X be the
randomized response mechanism that outputs Q
piq
px
i
q “ x
i
with probability
e
ε
0
e
ε
0`d´1
and otherwise
outputs a uniformly random element ofXztx
i
u. Note thatQ
piq
isε
0
-DP, hencepε
0
,δ 0
q-DP for anyδ 0
ą0.
Consider thed-round algorithmA:X
N
ÑZ
dˆN
in Algorithm 15, whereZ“R
d
Algorithm15 ISRL-DP Algorithm that is notC-compositional
1: forrPrdsdo
2: foriPN do
3: if x
i
“e
r
then
4: R
piq
r
px
i
q:“Q
piq
px
i
q.
5: else
6: R
piq
r
px
i
q:“0PR
d
.
7: endif
8: endfor
9: endfor
10: Output:tR
piq
r
px
i
qu
iPrNs,rPrds
.
229
Since each silo’s data is only referenced once andQ
piq
isε
0
-DP, we haveε
r
0
“ε
0
andA ispε
0
,δ 0
q-DP.
However,
b
ř
d
r“1
pε
r
0
q
2
“
a
dε
2
0
“
?
2Cε
0
ąCε
0
, so thatA is notC-compositional.
Also, note that our One-Pass Accelerated Noisy MB-SGD is onlyC-compositional forCě
?
R, since
ε
0
“ε
r
0
for this algorithm. Thus, substituting theR that is used to prove the upper bounds in Theorem 36
(see Appendix B.1.6.2) and pluggingC“
?
R into Theorem 35 explains where the non-i.i.d. lower bounds
in Figure 3.3 come from.
B.1.5.4 ProofofTheorem35
First, let us state the complete, formal version of Theorem 35, which uses notation from Appendix B.1.2:
Theorem108 (Complete Version of Theorem 35). Let ε
0
Pp0,
?
Ns, δ 0
“op1{nNq, andAPA
pε
0
,δ
0
q,C
.
Suppose that in each round r, the local randomizers are allpε
r
0
,δ r
0
q-DP, for ε
r
0
À
1
n
, δ r
0
“ op1{nNRq,
Ně16lnp2{δ r
0
nq. Then:
1. There exists afPF
L,D
and a distributionP such that forX„P
nN
, we have:
EFpApXqq´F
˚
“
r
Ω ˜
ϕD
?
Nn
`LDmin
#
1,
?
d
ε
0
n
?
NC
2
+¸
.
2. There exists aµ -smoothfPG
µ,L,D
and distributionP such that forX„P
nN
, we have
EFpApXqq´F
˚
“
r
Ω ˆ
ϕ 2
µnN
`LDmin
"
1,
d
ε
2
0
n
2
NC
4
*˙
.
Further, ifAPA, then the above lower bounds hold withC“1.
Before we proceed to the proof of Theorem 35, we recall the simpler characterization of ISRL-DP for se-
quentially interactive algorithms. A sequentially interactive algorithmA with randomizerstR
piq
u
N
i“1
is
pε
0
,δ 0
q-ISRL-DP if and only if for alliPrNs,R
piq
p¨,Z
p1:i´1q
q :X
n
i
ˆZ ispε
0
,δ 0
q-DP for allZ
p1:i´1q
P
Z
i´1
. In what follows, we will fix X
i
“X for alli. We now turn to the proof.
Step 1: Privacy amplification by shuffling. We begin by stating and proving the amplification by
shuffling result that we will leverage to obtain Theorem 35:
230
Theorem109. LetAPA
pε
0
,δ
0
q,C
such thatε
0
Pp0,
?
Ns andδ 0
Pp0,1q. Assume that in each round, the
local randomizersR
piq
r
pZ
p1:r´1q
,¨q:X
n
ÑZ arepε
r
0
,δ r
0
q-DP for alliPrNs, rPrRs, Z
p1:r´1q
PZ
r´1ˆN
withε
r
0
ď
1
n
. AssumeNě16lnp2{δ r
0
nq. IfAisC-compositional,thenassume δ r
0
ď
1
14nNR
anddenoteδ :“
14Nn
ř
R
r“1
δ r
0
;ifinsteadAissequentiallyinteractive,thenassumeδ 0
“δ r
0
ď
1
7Nn
anddenoteδ :“7Nnδ
0
.
LetA
s
:XÑW bethesamealgorithmasAexceptthatineachroundr,A
s
drawsarandompermutationπ r
ofrNs and appliesR
piq
r
toX
π rpiq
instead ofX
i
. Then,A
s
ispε,δ q-CDP, whereε“O
ˆ
ε
0
lnp1{nNδ min
0
qC
2
?
N
˙
,
andδ min
0
:“ min
rPrRs
δ r
0
. In particular, ifAPA, thenε“ O
ˆ
ε
0
lnp1{nNδ min
0
q
?
N
˙
. Note that for sequentially
interactiveA, δ min
0
“δ 0
.
To the best of our knowledge, the restriction on ε
r
0
is needed to obtain ε“
r
Opε
0
{
?
Nq in all works
that have analyzed privacy amplification by shuffling [89, 96, 24, 56, 25], but these works focus on the
sequentially interactive case withn“ 1, so the restriction amounts toε
0
À 1 (orε
0
“
r
Op1qq. The non-
sequentialC-compositional part of Theorem 109 will follow as a corollary (Corollary 111) of the following
result which analyzes the privacy amplification in each round:
Theorem110 (Single round privacy amplification by shuffling) . Letε
r
0
ď ln
´
N
16lnp2{δ r
q
¯
{n, rPN and
letR
piq
r
pZ,¨q :X
n
ÑZ be anpε
r
0
,δ r
0
q-DP local randomizer for allZ“ Z
p1:Nq
p1:r´1q
PZ
pr´1qˆN
andiPrNs,
whereX is an arbitrary set. Given a distributed data setX“pX
1
,¨¨¨ ,X
N
qPX
Nˆn
andZ“ Z
p1:Nq
p1:r´1q
,
consider the shuffled algorithm A
r
s
:X
nˆN
ˆZ
pr´1qˆN
ÑZ
N
that first samples a random permutation π ofrNsandthencomputesZ
r
“pZ
p1q
r
,¨¨¨ ,Z
pNq
r
q,whereZ
piq
r
:“R
piq
r
pZ,X
π piq
q.Then,A
r
s
ispε
r
,
r
δ r
q-CDP,
where
ε
r
:“ln
«
1`
ˆ
e
ε
r
0
´1
e
ε
r
0
`1
˙
˜
8
a
e
nε
r
0
lnp4{δ r
q
?
N
`
8e
nε
r
0
N
¸ff
, (B.10)
and
r
δ r
:“δ r
`2Nne
pn´1qε
r
0
δ r
0
. In particular, ifε
r
0
“O
`
1
n
˘
, then
ε
r
“O
˜
ε
r
0
a
lnp1{δ r
q
?
N
¸
. (B.11)
231
Further, ifε
r
0
ď1{n, then settingδ r
:“Nnδ
r
0
implies that
ε
r
“O
˜
ε
r
0
a
lnp1{nNδ r
0
q
?
N
¸
(B.12)
and
r
δ r
ď7Nnδ
r
0
, which is inp0,1q if we assumeδ r
0
Pp0,
1
7Nn
s.
We sometimes refer to the algorithm A
r
s
as the “shuffled algorithm derived from the randomizers
tR
piq
r
u.” From Theorem 110, we obtain:
Corollary111 (R-round privacy amplification for C-compositional algorithms). LetA:X
nˆN
ÑZ
RˆN
be an R-roundpε
0
,δ 0
q-ISRL-DP and C-compositional algorithm such that ε
0
P p0,
?
Ns and δ 0
P p0,1q,
whereX is an arbitrary set. Assume that in each round, the local randomizersR
piq
r
pZ
p1:r´1q
,¨q : X
n
Ñ
Z arepε
r
0
,δ r
0
q-DP for i P rNs,r P rRs, where N ě 16lnp2{δ r
0
nq, ε
r
0
ď
1
n
, and δ r
0
ď
1
14nNR
. Then,
the shuffled algorithm A
s
: X
nˆN
Ñ Z
RˆN
derived fromtR
piq
r
pZ
p1:r´1q,¨
qu
iPrNs,rPrRs
(i.e. A
s
is the
compositionoftheRshuffledalgorithms A
r
s
definedinTheorem110)is pε,δ q-CDP,whereδ ď14Nn
ř
R
r“1
δ r
0
andε“ O
ˆ
ε
0
lnp1{nNδ min
0
qC
2
?
N
˙
, whereδ min
0
:“ min
rPrRs
δ r
0
. In particular, ifAPA is compositional, then
ε“O
ˆ
ε
0
lnp1{nNδ min
0
q
?
N
˙
.
Proof. Let δ 1
:“
ř
r
Nnδ
r
0
and δ r
:“ Nnδ
r
0
. Then the (central) privacy loss of the full R-round shuffled
algorithm is bounded as
εď2
ÿ
r
pε
r
q
2
`
c
2
ÿ
r
pε
r
q
2
lnp1{δ 1
q
“O
˜
ÿ
r
ˆ
pε
r
0
q
2
lnp1{δ r
q
N
˙
`
d
ÿ
r
pε
r
0
q
2
lnp1{δ r
qlnp1{δ 1
q
N
¸
“O
ˆ
C
2
ε
0
lnp1{nNδ min
0
q
?
N
˙
,
where the three (in)equalities follow in order from the Advanced Composition Theorem [88], B.12 in The-
orem 110, andC-compositionality ofA combined with the assumptionε
0
À
?
N. Also,δ “ δ 1
`
ř
r
r
δ r
232
by the Advanced Composition Theorem, where
r
δ r
ď7Nnδ
r
0
by Theorem 110. Henceδ ď14Nn
ř
r
δ r
0
. In
particular, ifA is compositional, thenC“Op1q, soε“O
ˆ
ε
0
lnp1{nNδ min
0
q
?
N
˙
.
Remark 112. The upper bounds assumed on δ r
0
and δ r
in Theorem 35 ensure that δ P p0,1q and that
the lower bounds of [32] apply (see Theorem 123). These assumptions are not very restrictive in practice,
since δ r
0
,δ 0
! 1{n is needed for meaningful privacy guarantees (see e.g. Chapter 2 in [88]) and R must be
polynomialforthealgorithmtorun. Toquote[88],“typicallyweareinterestedinvaluesofδ thatarelessthan
the inverse of any polynomial in the size of the database” (page 18). For largerδ (e.g. δ “ Ω p1{nq), there are
examples of algorithms that satisfy the definition of DP but clearly violate any reasonable notion of privacy.
For instance, an algorithm that outputsδ 0
n random samples from each silo’s data set isp0,δ 0
q-ISRL-DP, but
completelyviolatestheprivacyofatleastonepersonineachsiloifδ 0
ě1{n. Also,sinceN"1istheregime
of interest (otherwise if N “
r
Op1q, the CDP lower bounds of [29] already match our upper bounds up to
logarithms), the requirement thatN be larger than16lnp2{δ min
0
nq is unimportant.
‡
The sequentially interactive part of Theorem 109 will be clear directly from the proof of Theorem 110.
We now turn to the proof of Theorem 110, which uses the techniques from [96]. First, we’ll need some
more notation. The privacy relation in 3.3 between random variablesP andQ can be characterized by the
hockey-stick divergence: D
e
εpP}Qq :“
ş
maxt0,ppxq´e
ε
qpxqudx, wherep andq denote the probability
density or mass functions ofP andQ respectively. ThenP »
pε,δ q
Q iff maxtD
e
εpP}Qq,D
e
εpQ}Pquď δ.
Second, recall thetotalvariationdistance betweenP andQ is given byTVpP,Qq“
1
2
ş
R
|ppxq´qpxq|dx.
Third, we recall the notion of group privacy:
Definition113 (Group DP). ArandomizedalgorithmA:X
N
ÑZ ispε,δ qgroupDPforgroupsofsizeN
ifApXq »
pε,δ q
ApX
1
q for allX,X
1
PX
N
.
We’ll also need the following stronger version of a decomposition from [136] and Lemma 3.2 of [197].
‡
Technically, this assumption on N is needed to ensure that the condition on ε
r
0
in Theorem 110 is satisfied. A similar
restriction appears in Theorem 3.8 of [96].
233
Lemma114 ([136]). LetR
0
,R
1
:X
n
ÑZ belocalrandomizerssuchthatR
0
pX
0
qandR
1
pX
1
qarepε,0q
indistinguishable. Then, there exists a randomized algorithm U : tX
0
,X
1
u Ñ Z such that R
0
pX
0
q “
e
ε
e
ε
`1
UpX
0
q`
1
e
ε
`1
UpX
1
q andR
1
pX
1
q“
1
e
ε
`1
UpX
0
q`
e
ε
e
ε
`1
UpX
1
q.
Lemma 114 follows from the proof of Lemma 3.2 in [197], noting that the weaker hypothesis assumed
in Lemma 114 sufficient for all steps to go through.
Definition115 (Deletion Group DP). AlgorithmR:X
n
ÑZ ispε,δ qdeletiongroupDPforgroupsofsize
n if there exists a reference distributionρ such thatRpXq »
pε,δ q
ρ for allXPX
n
.
It’s easy to show that ifR ispε,δ q-deletion group DP for groups of sizen, thenR isp2ε,p1`e
ε
qδ q group
DP for groups of sizen. In addition, we have the following result:
Lemma 116. Let X
0
P X
n
. IfR : X
n
Ñ Z is anpε,δ q-DP local randomizer, thenR ispnε,ne
pn´1qε
δ q
deletion group DP for groups of size n with reference distributionRpX
0
q (i.e. RpXq »
pr ε,
r
δ q
RpX
0
q for all
XPX
n
, where r ε“nε and
r
δ “ne
pn´1qε
δ ).
Proof. By group privacy (Proposition 12), and the assumption thatR ispε,δ q-DP, it follows thatRpXq and
RpX
1
q arepnε,ne
pn´1qε
δ q indistinguishable for allX,X
1
PX
n
. In particular, takingX
1
:“X
0
completes
the proof.
Lemma 117. Let R
piq
: X
n
Ñ Z be randomized algorithms (i P rNs) and let A
s
: X
nˆN
Ñ Z
N
be the shuffled algorithm A
s
pXq :“ pR
p1q
pX
π p1q
q,¨¨¨R
pNq
pX
π pNq
qq derived fromtR
piq
u
iPrNs
for X “
pX
1
,¨¨¨ ,X
N
q, where π is a uniformly random permutation ofrNs. Let X
0
“ pX
0
1
,X
2
,¨¨¨ ,X
N
q and
X
1
“pX
1
1
,X
2
,¨¨¨ ,X
N
q,δ Pp0,1qandpPr
16lnp2{δ q
N
,1s. SupposethatforalliPrNs,XPX
n
ztX
1
1
,X
0
1
u,
there exists a distributionLO
piq
pXq such that
R
piq
pXq“
p
2
R
piq
pX
0
1
q`
p
2
R
piq
pX
1
1
q`p1´pqLO
piq
pXq.
234
ThenA
s
pX
0
q »
pε,δ q
A
s
pX
1
q, where
εďln
˜
1`
8
a
lnp4{δ q
?
pN
`
8
pN
¸
.
Proof. The proof mirrors the proof of Lemma 3.3 in [96] closely, replacing their notation with ours. Observe
that the DP assumption in Lemma 3.3 of [96] is not actually needed in the proof.
Lemma118. LetR:X
n
ÑZ bepε,δ qdeletiongroupDPforgroupsofsizenwithreferencedistributionρ .
Then there exists a randomizerR
1
:X
n
ÑZ such that:
(i)R
1
ispε,0q deletion group DP for groups of sizen with reference distributionρ ; and
(ii)TVpRpXq,R
1
pXqqďδ.
In particular,R
1
isp2ε,0q group DP for groups of sizen (by (i)).
Proof. The proof is nearly identical to the proof of Lemma 3.7 in [96].
We also need the following stronger version of Lemma 3.7 from [96]:
Lemma119. IfRpX
0
1
q »
pε
0
,δ
0
q
RpX
1
1
q,thenthereexistsarandomizerR
1
:X
n
ÑZ suchthatR
1
pX
1
1
q »
pε
0
,0q
RpX
0
1
q andTVpR
1
pX
1
1
q,RpX
1
1
qqďδ 0
.
Proof. The proof follows the same techniques as Lemma 3.7 in [96], noting that the weaker hypothesis in
Lemma 119 is sufficient for all the steps to go through and that the assumption of n “ 1 in [96] is not
needed in the proof.
Lemma120 ([88], Lemma 3.17). GivenrandomvariablesP,Q,P
1
andQ
1
,ifD
e
εpP
1
,Q
1
qďδ ,TVpP,P
1
qď
δ 1
, andTVpQ,Q
1
qďδ 1
, thenD
e
εpP,Qqďδ `pe
ε
`1qδ 1
.
235
Lemma 121 ([96], Lemma 2.3). Let P and Q be distributions satisfying P “p1´qqP
0
`qP
1
and Q“
p1´qqP
0
`qQ
1
for someqPr0,1s. Then for anyεą0, ifε
1
“logp1`qpe
ε
´1qq, then
D
e
ε
1pP||QqďqmaxtD
e
εpP
1
||P
0
q,D
e
εpP
1
||Q
1
qu.
We are now ready to prove Theorem 110:
Proof of Theorem 110. LetX
0
,X
1
PX
nˆN
be adjacent (in the CDP sense) distributed data sets (i.e.|X
0
∆ X
1
|ď
1). Assume WLOG thatX
0
“pX
0
1
,X
2
,¨¨¨ ,X
N
q andX
1
“pX
1
1
,X
2
,¨¨¨ ,X
N
q, whereX
0
1
“px
1,0
,x
1,2
,¨¨¨ ,x
1,n
q‰
px
1,1
,x
1,2
,¨¨¨ ,x
1,n
q. We can also assume WLOG that X
j
R tX
0
1
,X
1
1
u for all j P t2,¨¨¨ ,Nu by re-
defining X andR
piq
r
if necessary.
FixiPrNs,rPrRs,Z“ Z
1:r´1
“ Z
p1:Nq
p1:r´1q
PZ
pr´1qˆN
, denoteRpXq :“R
piq
r
pZ,Xq forX PX
n
,
andA
s
pXq :“ A
r
s
pZ
1:r´1
,Xq. Draw π uniformly from the set of permutations ofrNs. Now, sinceR
ispε
0
,δ 0
q-DP, RpX
1
1
q »
pε
r
0
,δ
r
0
q
RpX
0
1
q, so by Lemma 119, there exists a local randomizer R
1
such that
R
1
pX
1
1
q »
pε
r
0
,0q
RpX
0
1
q andTVpR
1
pX
1
1
q,RpX
1
1
qqďδ r
0
.
Hence, by Lemma 114, there exist distributionsUpX
0
1
q andUpX
1
1
q such that
RpX
0
1
q“
e
ε
r
0
e
ε
r
0
`1
UpX
0
1
q`
1
e
ε
r
0
`1
UpX
1
1
q (B.13)
and
R
1
pX
1
1
q“
1
e
ε
r
0
`1
UpX
0
1
q`
e
ε
r
0
e
ε
r
0
`1
UpX
1
1
q. (B.14)
Denote r ε
0
:“nε
r
0
and
r
δ 0
:“ne
pn´1qε
r
0
δ r
0
. By convexity of hockey-stick divergence and the hypothesis
thatR ispε
r
0
,δ r
0
q-DP (henceRpXq »
pĂ ε
0
,
r
δ 0
q
RpX
0
1
q,RpX
1
1
q for allX by Lemma 116), we haveRpXq »
pĂ ε
0
,
r
δ 0
q
1
2
pRpX
0
1
q`RpX
1
1
qq :“ ρ for all X P X
n
. That is,R ispr ε
0
,
r
δ 0
q deletion group DP for groups of size
n with reference distribution ρ. Thus, Lemma 118 implies that there exists a local randomizerR
2
such
236
that R
2
pXq and ρ arepr ε
0
,0q indistinguishable and TVpR
2
pXq,RpXqq ď
r
δ 0
for all X. Then by the
definition of pr ε
0
,0q indistinguishability, for allX there exists a “left-over” distributionLOpXq such that
R
2
pXq“
1
e
Ă ε
0
ρ `p1´1{e
Ă ε
0
qLOpXq“
1
2e
Ă ε
0
pRpX
0
1
q`RpX
1
1
qq`p1´1{e
Ă ε
0
qLOpXq.
Now, define a randomizer L byLpX
0
1
q:“RpX
0
1
q, LpX
1
1
q:“R
1
pX
1
1
q, and
LpXq:“
1
2e
Ă ε
0
RpX
0
1
q`
1
2e
Ă ε
0
R
1
pX
1
1
q`p1´1{e
Ă ε
0
qLOpXq
“
1
2e
Ă ε
0
UpX
0
1
q`
1
2e
Ă ε
0
UpX
1
1
q`p1´1{e
Ă ε
0
qLOpXq (B.15)
for allX P X
n
ztX
0
1
,X
1
1
u. (The equality follows from B.13 and B.14.) Note thatTVpRpX
0
1
q,LpX
0
1
qq“
0, TVpRpX
1
1
q,LpX
1
1
qqďδ r
0
, and for allXPX
n
ztX
0
1
,X
1
1
u,TVpRpXq,LpXqqďTVpRpXq,R
2
pXqq`
TVpR
2
pXq,LpXqqď
r
δ 0
`
1
2e
Ă ε
0
TVpR
1
pX
1
1
q,RpX
1
1
qq“pne
pn´1qε
r
0
`
1
2e
nε
r
0
qδ r
0
ďp2ne
pn´1qε
r
0
qδ r
0
“2
r
δ 0
.
Keeping r fixed (omitting r scripts everywhere), for any i P rNs andZ :“ Z
1:r´1
P Z
pr´1qˆN
, let
L
piq
pZ,¨q, U
piq
pZ,¨q, andLO
piq
pZ,¨q denote the randomizers resulting from the process described above.
LetA
L
:X
nˆN
ÑZ
N
be defined exactly the same way as A
r
s
:“A
s
(sameπ ) but with the randomizers
R
piq
replaced byL
piq
. SinceA
s
applies each randomizerR
piq
exactly once andR
p1q
pZ,X
π p1q
,¨¨¨R
pNq
pZ,X
π pNq
q
are independent (conditional onZ“ Z
1:r´1
)
§
, we haveTVpA
s
pX
0
q,A
L
pX
0
qď Np2ne
pn´1qε
r
0
qδ r
0
and
TVpA
s
pX
1
q,A
L
pX
1
qďNp2ne
pn´1qε
r
0
qδ r
0
(see [73]). Now we claim thatA
L
pX
0
q andA
L
pX
1
q arepε
r
,δ r
q
indistinguishable for any δ r
ě 2e
´Ne
´nε
r
0{16
. Observe that this claim implies thatA
s
pX
0
q andA
s
pX
1
q
arepε
r
,
r
δ r
q indistinguishable by Lemma 120 (with P
1
:“ A
L
pX
0
q,Q
1
:“ A
L
pX
1
q,P :“ A
s
pX
0
q,Q :“
A
s
pX
1
q.) Therefore, it remains to prove the claim, i.e. to show thatD
e
ε
r
pA
L
pX
0
q,A
L
pX
1
qď δ r
for any
δ r
ě2e
´Ne
´nε
r
0{16
.
§
This follows from the assumption given in the lead up to Definition 96 that R
piq
pZ1:r´1,Xq is conditionally independent
ofX
1
givenZ1:r´1 for allZ1:r´1 andX‰X
1
.
237
Now, define L
piq
U
pZ,Xq:“
$
’
’
’
’
’
’
’
’
’
&
’
’
’
’
’
’
’
’
’
%
U
piq
pZ,X
0
1
q ifX“X
0
1
U
piq
pZ,X
1
1
q ifX“X
1
1
L
piq
pZ,Xq otherwise.
. For any inputsZ,X, letA
U
pZ,Xq be defined
exactly the same asA
s
pZ,Xq (sameπ ) but with the randomizersR
piq
replaced byL
piq
U
. Then by B.13 and
B.14,
A
L
pX
0
q“
e
ε
r
0
e
ε
r
0
`1
A
U
pX
0
q`
1
e
ε
r
0
`1
A
U
pX
1
q andA
L
pX
1
q“
1
e
ε
r
0
`1
A
U
pX
0
q`
e
ε
r
0
e
ε
r
0
`1
A
U
pX
1
q.
(B.16)
Then by B.15, for any X P X
n
ztX
0
1
,X
1
1
u and anyZ“ Z
1:r´1
P Z
pr´1qˆN
, we haveL
piq
U
pZ,Xq“
1
2e
Ă ε
0
L
piq
U
pZ,X
0
1
q`
1
2e
Ă ε
0
L
piq
U
pZ,X
1
1
q`p1´e
´Ă ε
0
qLO
piq
pZ,Xq. Hence, Lemma 117 (withp:“e
´Ă ε
0
“e
´nε
r
0
)
implies thatA
U
pX
0
q andA
U
pX
1
q) are
˜
log
˜
1`
8
a
e
Ă ε
0
lnp4{δ r
q
?
N
`
8e
Ă ε
0
N
¸
,δ r
¸
indistinguishable for any δ r
ě 2e
´Ne
´nε
r
0{16
. Applying Lemma 121 with P :“ A
L
pX
0
q, Q“ A
L
pX
1
q,
q “
e
ε
r
0´1
e
ε
r
0`1
, P
1
“ A
U
pX
0
q, Q
1
“ A
U
pX
1
q, and P
0
“
1
2
pP
1
`Q
1
q and convexity of the hockey-stick
divergence yields thatA
L
pX
0
q andA
L
pX
1
q arepε
r
,δ r
q indistinguishable, as desired. This proves the
claim and hence (by Lemma 120, as described earlier) the theorem.
Remark 122. Notice that ifA is sequentially interactive, then the proof of Theorem 110 above almost im-
mediately implies the sequentially interactive part of Theorem 109. Essentially, just change notation: replace
Z
1:r´1
byZ
p1:i´1q
, the collection of (single) reports sent by the first i´1 silos; note thatε
r
0
“ ε
0
,δ r
0
“ δ 0
;
andviewtheN reportsasbeingsentinorderinsteadofsimultaneously. Alternatively,plugourtechniquesfor
ną1 into the proof of Theorem 3.8 in [96], which is for sequentially interactive algorithms.
238
Step2: Combine Theorem 109 with the following CDP SCO lower bounds which follow from [32, 29] and
the non-private SCO lower bounds [200, 3] :
Theorem 123. [29, 32] Letµ,D,ε ą 0, Lě µD , andδ “ op1{nNq. ConsiderX :“
!
´D
?
d
,
D
?
d
)
d
Ă R
d
andW :“B
2
p0,DqĂR
d
. LetA:X
nN
ÑW be anypε,δ q-CDP algorithm. Then:
1. There exists a (µ “ 0) convex, linear (β -smooth for any β ), L-Lipschitz loss f : WˆX Ñ R and a
distributionP onX such that ifX„P
nN
, then the expected excess loss ofA is lower bounded as
EFpApXqq´F
˚
“
r
Ω ˜
ϕD
?
Nn
`LDmin
#
1,
?
d
εnN
+¸
.
2. For L « µD , there exists a µ -strongly convex and smooth, L-Lipschitz loss f : WˆX Ñ R and a
distributionP onX such that ifX„P
nN
, then the expected excess loss ofA is lower bounded as
EFpApXqq´F
˚
“
r
Ω ˆ
ϕ 2
µNn
`
L
2
µ min
"
1,
d
ε
2
n
2
N
2
*˙
.
For generalL,µ,D , the above strongly convex lower bound holds with the factor
L
2
µ replaced byLD.
Namely, if A ispε
0
,δ 0
q-ISRL-DP, then (under the hypotheses of Theorem 35)A
s
ispε,δ q-CDP for
ε“
r
Opε
0
{
?
Nq, so Theorem 123 implies that the excess loss ofA
s
is lower bounded as in Theorem 123
withε replaced byε
0
{
?
N.
Step 3: We simply observe that when the expectation is taken over the randomness in samplingX„
P
nˆN
, the expected excess population loss ofA
s
is identical to that ofA sinceX
i
andX
π piq
have the same
distribution for alli,π by the i.i.d. assumption. This completes the proof of Theorem 35.
239
B.1.5.5 TightExcessRiskBoundsforCross-DeviceFLWithoutaTrustedServer
We explain how our ISRL-DP excess risk bounds also imply tight excess risk bounds for algorithms that
satisfybothISRL-DPanduser-levelDPsimultaneously, which may be desirable for cross-device FL without
a trusted server. AssumeM“N for simplicity and consider i.i.d. FL for concreteness.
¶
Givenpε,δ q with
ε ď 1, let ε
0
“ ε{n, δ 0
“ δ {4n ď δ {pne
pn´1qε
0
q “ δ {pne
pn´1qε{n
q and run Noisy (ISRL-DP) MB-SGD
with noise calibrated topε
0
,δ 0
q. Then the algorithm also satisfies pε,δ q-user level DP by Appendix B.1.3.
Thus, by Theorem 34, we obtain hybrid ISRL-DP/user-level DP excess risk upper bounds. For example, in
the convex case, we get
EFpp w
T
q´F
˚
“
r
O
˜
LD
˜
a
dlnp1{δ 0
q
ε
0
n
?
N
`
1
?
Nn
¸¸
“
r
O
˜
LD
˜
a
dlnpn{δ q
ε
?
N
`
1
?
Nn
¸¸
“
r
O
˜
LD
a
dlnpn{δ q
ε
?
N
¸
. (B.17)
Regarding lower bounds: note that the semantics of the hybrid ISRL-DP/user-level DP notion are essen-
tially identical to LDP, except that individual “records/items” are now thought of as datasets of size n.
Thus, lettingn“ 1 in our ISRL-DP lower bounds (we think of each silo as having just one “record” even
though that record is really a dataset) yields lower bounds matching (up to logarithms) the upper bounds
attained above. For example, puttingn“1 in the convex lower bound in Theorem 35 yields a bound that
matches B.17, establishing the optimal FL rates (up to logarithmic factors) for this algorithm class. Note
that the minimax risk bounds for ISRL-DP/user-level DP hybrid algorithms resemble the bounds for LDP
algorithms [85], scaling withN, but not withn.
¶
It’s easy to see that the same arguments we use in this subsection for i.i.d. SCO can also be used to establish tight federated
ERM bounds.
240
B.1.6 SupplementalMaterialforSection3.2.2
B.1.6.1 Mutli-StageImplementationofAcceleratedNoisyMB-SGD
Here we describe the multi-stage implementation of Accelerated Noisy MB-SGD that we will use to further
expedite convergence for strongly convex loss, which builds on [109]. As before, for SCO, silos sample
locally without replacement in each round and set R “ tn{Ku. Whereas for ERM, silos sample locally
with replacement.
Multi-stageimplementationofAlgorithm1: Inputs:UPrRs such that
ř
U
k“1
R
k
ďR forR
k
defined
below;w
0
PW,∆ ěFpw
0
q´F
˚
,V ą0, andq
0
“0.
ForkPrUs, do the following:
1. RunR
k
rounds of Algorithm 1 usingw
0
“q
k´1
,tα r
u
rě1
andtη r
u
rě1
, where
R
k
“
S
max
#
4
d
2β µ ,
128V
2
3µ ∆2
´pk`1q
+W
,
α r
“
2
r`1
, η r
“
4υ k
rpr`1q
,
υ k
“max
#
2β,
„
µV
2
3∆2
´pk´1q
R
k
pR
k
`1qpR
k
`2q
ȷ
1{2
+
2. Setq
k
“w
ag
R
k
, wherew
ag
R
k
is the output of Step 1 above. Then updatekÐk`1 and return to Step
1.
B.1.6.2 CompleteversionandproofofTheorem36
We will state and prove the theorem for general M P rNs under Assumption 3. We first require some
additional notation: Define the heterogeneity parameter
υ 2
:“ sup
wPW
1
N
N
ÿ
i“1
}∇F
i
pwq´∇Fpwq}
2
, (B.18)
241
which has appeared in [148, 154, 143, 265]. υ 2
“ 0 iff F
i
“ F ` a
i
for constants a
i
P R, i P rNs
(“homogeneous up to transaltion”).
Theorem 124 (Complete version of Theorem 36). Let fp¨,xq be β -smooth for all x P X. Assume ε
0
ď
8lnp1{δ 0
q, δ 0
P p0,1q Then, with σ 2
“
32L
2
lnp1.25{δ 0
q
ε
2
0
K
2
, One-Pass Accelerated Noisy MB-SGD ispε
0
,δ 0
q-
ISRL-DP. Moreover, there are choices of stepsize, batch size, andλ ą0 such that :
1. Running One-Pass Accelerated Noisy MB-SGD on
r
fpw,xq :“ fpw,xq`
λ 2
}w´w
0
}
2
(where w
0
P W)
yields
EFpw
ag
R
q´F
˚
“
r
O
¨
˝
ϕD
?
nM
`
˜
β 1{4
LD
3{2
a
dlnp1{δ 0
q
ε
0
n
?
M
¸
4{5
`
c
N´M
N´1
1
tNą1u
υL
1{5
D
4{5
β 1{5
˜
a
dlnp1{δ 0
q
ε
0
nM
3
¸
1{5
˛
‚
.
(B.19)
2. If fp¨,xq is µ -strongly convex@x P X and κ “
β µ , then running the Multi-Stage Implementation of
One-Pass Accelerated Noisy MB-SGD directly onfyields with batch sizeK yields
EFpw
ag
R
q´F
˚
“
r
O
ˆ
ϕ 2
nM
`
L
2
µ ?
κd lnp1{δ 0
q
ε
2
0
n
2
M
`
υ 2
µ ?
κM
ˆ
1´
M´1
N´1
˙
1
tNą1u
˙
. (B.20)
Remark125. 1. For convexf, ifM“N or
υ À
c
N´1
N´M
«
pL
3
D
2
β 2
q
1{5
M
1{5
pdlnp1{δ 0
qq
3{10
n
3{5
ε
3{5
0
`ϕ ˆ
βD
L
˙
1{5
ˆ
ε
2
0
M
n
3
dlnp1{δ 0
q
˙
1{10
ff
,
then B.19 recovers the bound 3.6 in Theorem 36 (M“N version), withN replaced byM.
2. Forµ -strongly convexf, ifM“N or
υ 2
À
ˆ
N´1
N´M
˙
?
κ ˆ
ϕ 2
n
`
?
κL
2
dlnp1{δ 0
q
ε
2
0
n
2
˙
,
then B.20 recovers the bound 3.7 in Theorem 36, withN replaced byM.
To prove Theorem 124, we will need some preliminaries:
242
Lemma 126. [265, Lemma 4] Let F : W Ñ R
d
be convex and β -smooth, and suppose that the unbiased
stochastic gradients r gpw
t
q at each iteration have bounded varianceE}r gpwq´∇Fpwq}
2
ď V
2
. If p w
ag
R
is
computedbyRstepsofAcceleratedMB-SGDontheregularizedobjective
r
Fpwq“Fpwq`
V
2}w
0
´w
˚
}
?
R
}w´
w
0
}
2
, then
EFpp w
ag
R
q´F
˚
À
β }w
0
´w
˚
}
2
R
2
`
V}w
0
´w
˚
}
?
R
.
We then have the following bound for the multi-stage protocol:
Lemma127. [109,Proposition7]Letf :WÑR
d
beµ -stronglyconvexandβ -smooth,andsupposethatthe
unbiasedstochasticgradientsr gpw
r
qateachiterationr haveboundedvarianceE}r gpw
r
q´∇Fpw
t
q}
2
ďV
2
.
If p w
ag
R
is computed byR steps of the Multi-Stage Accelerated MB-SGD, then
EFpp w
ag
R
q´F
˚
À∆exp
ˆ
´
c
µ β R
˙
`
V
2
µR
,
where∆ “Fpw
0
q´F
˚
.
Of course, [265, Lemma 4] and [109, Proposition 7] are stated for thenon-private Accelerated MB-SGD
(AC-SA). However, we observe that the bounds in [265, Lemma 4] and [109, Proposition 7] depend only
on the stochastic gradient oracle via its bias and variance. Hence these results also apply to our (multi-
stage implementation of) AcceleratedNoisy MB-SGD. Next, we bound the variance of our noisy stochastic
gradient estimators:
Lemma128. LetX
i
„ P
n
i
, r g
r
:“
1
Mr
ř
iPSr
1
K
ř
jPrKs
p∇fpw
r
,x
r
i,j
q`u
i
q, wherepx
r
i,j
q
jPrKs
are sampled
fromX
i
andu
i
„Np0,σ 2
I
d
q is independent of∇fpw
r
,x
r
i,j
q for alliPrNs,jPrKs. Then
E}r g
r
´∇Fpw
r
q}
2
ď
ϕ 2
MK
`
ˆ
1´
M´1
N´1
˙
υ 2
M
1
tNą1u
`
dσ
2
M
.
243
The three terms on the right-hand side correspond (from left to right) to the variances of: local minibatch
sampling within each silo, the draw of the silo setS
r
of sizeM
r
under Assumption 3, and the Gaussian
noise. We now turn to the proof of Lemma 128.
Proof of Lemma 128. First, fix the randomness due to the size of the silo set M
r
. Now r g
r
“ g
r
` s u
r
,
wheres u
r
“
1
Mr
ř
Mr
i“1
u
i
„Np0,
σ 2
Mr
I
d
q ands u
r
is independent ofg
r
:“
1
Mr
ř
iPSr
1
K
i
ř
jPrKs
∇fpw
r
,x
r
i,j
q.
Hence,
Er}r g
r
´∇Fpw
r
q}
2
|M
r
s“Er}g
r
´∇Fpw
r
q}
2
|M
r
s`Er}s u}
2
|M
r
s
“Er}g
r
´∇Fpw
r
q}
2
|M
r
s`d
σ 2
M
r
.
Let us drop ther subscripts for brevity (denotingg“g
r
,w“w
r
,S“S
r
, andM
r
“M
1
since they have
the same distribution) and denoteh
i
:“
1
K
i
ř
K
i
j“1
∇fpw,x
i,j
q. Now, we have (conditionally onM
1
)
Er}g´∇Fpwq}
2
|M
1
s“E
»
–
›
›
›
›
›
1
M
1
ÿ
iPS
1
K
i
K
i
ÿ
j“1
∇fpw,x
i,j
q´∇Fpwq
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
1
fi
fl
“E
»
–
›
›
›
›
›
1
M
1
ÿ
iPS
p∇h
i
´∇F
i
pwqq`
1
M
1
ÿ
iPS
∇F
i
pwq´∇Fpwq
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
1
fi
fl
“
1
M
2
1
E
«
}
ÿ
iPS
h
i
pwq´∇F
i
pwq}
2
ˇ
ˇ
ˇ
ˇ
M
1
ff
l jh n
a ⃝ `
1
M
2
1
E
«
}
ÿ
iPS
∇F
i
pwq´∇Fpwq}
2
ˇ
ˇ
ˇ
ˇ
M
1
ff
l jh n
b ⃝ ,
244
since, conditional onS, the cross-terms vanish by (conditional) independence ofh
i
and the non-random
ř
i
1
PS
∇F
i
1pwq´∇Fpwq for alliPS. Now we bound a ⃝ :
a ⃝ “E
S
«
E
h
i
}
ÿ
iPS
h
i
pwq´∇F
i
pwq}
2
ˇ
ˇ
ˇ
ˇ
S,M
1
ff
“E
S
«
ÿ
iPS
E
h
i
}h
i
pwq´∇F
i
pwq}
2
ˇ
ˇ
ˇ
ˇ
S,M
1
ff
ďE
S
«
ÿ
iPS
ϕ 2
K
ff
ďE
S
„
M
1
ϕ 2
K
ȷ
,
by conditional independence ofh
i
´∇F
i
andh
i
1´∇F
i
1 givenS. Hence
1
M
2
1
E
«
}
ÿ
iPS
h
i
pwq´∇F
i
pwq}
2
ˇ
ˇ
ˇ
ˇ
M
1
ff
ď
ϕ 2
M
1
K
Next we bound b ⃝ . Denotey
i
:“∇F
i
pwq ands y :“
1
N
ř
N
i“1
y
i
“∇Fpwq. We claim b ⃝ =E
„
}
ř
iPS
y
i
´ s y}
2
ˇ
ˇ
ˇ
ˇ
M
1
ȷ
ď
M
1
´
N´M
1
N´1
¯
υ 2
. Assume WLOG that s y“0 (otherwise, considery
1
i
“y
i
´s y, which has mean0). In what
245
follows, we shall omit the “conditional onM
1
” notation (but continue to condition onM
1
) and denote by
Ω the collection of all
`
N
M
1
˘
subsets ofrNs of sizeM
1
. Now,
b ⃝ “
1
`
N
M
1
˘
ÿ
SPΩ ›
›
›
›
›
ÿ
iPS
y
i
›
›
›
›
›
2
“
1
`
N
M
1
˘
ÿ
SPΩ ¨
˝
ÿ
iPS
}y
i
}
2
`2
ÿ
i,i
1
PS,iăi
1
xy
i
,y
i
1y
˛
‚
“
1
`
N
M
1
˘
˜
ˆ
N´1
M
1
´1
˙
N
ÿ
i“1
}y
i
}
2
`2
ˆ
N´2
M
1
´2
˙
ÿ
1ďiăi
1
ďN
xy
i
,y
i
1y
¸
“
M
1
N
N
ÿ
i“1
}y
i
}
2
`2
M
1
pM
1
´1q
NpN´1q
ÿ
1ďiăi
1
ďN
xy
i
,y
i
1y
“
M
1
N
ˆ
M
1
´1
N´1
`
N´M
1
N´1
˙
N
ÿ
i“1
}y
i
}
2
`
2M
1
pM
1
´1q
NpN´1q
ÿ
1ďiăi
1
ďN
xy
i
,y
i
1y
“
M
1
N
M
1
´1
N´1
›
›
›
›
›
N
ÿ
i“1
y
i
›
›
›
›
›
2
`
M
1
N
N´M
1
N´1
N
ÿ
i“1
}y
i
}
2
“
M
1
N
N´M
1
N´1
N
ÿ
i“1
}y
i
}
2
ďM
1
ˆ
N´M
1
N´1
˙
υ 2
.
Hence
1
M
2
1
E
„
}
ř
iPS
∇F
i
pwq´∇Fpwq}
2
ˇ
ˇ
ˇ
ˇ
M
1
ȷ
ď
N´M
1
N´1
υ 2
M
1
. Finally, we take expectation over the ran-
domness in M
1
and useEr1{M
1
s “ 1{M to arrive at the lemma. Also, the result clearly holds when
N “ 1 since the b ⃝ term is zero when there is no variance in silo sampling (which is the case when
N“1).
Proof of Theorem 124. Privacy: By post-processing [88, Proposition 2.1], it suffices to show that the R“
n{K noisy stochastic gradients computed in line 7 of Algorithm 1 arepε
0
,δ 0
q-ISRL-DP. Further, since the
batches sampled locally are disjoint (because we sample locallywithoutreplacement), parallel composition
[191] implies that if each update in line 7 ispε
0
,δ 0
q-ISRL-DP, then the full algorithm ispε
0
,δ 0
q-ISRL-
DP. Now recall that the Gaussian mechanism providespε
0
,δ 0
q-DP if σ 2
ě
8∆
2
2
lnp1.25{δ 0
q
ε
2
0
, where ∆ 2
“
246
sup
w,X
i
„X
1
i
}
1
K
ř
K
j“1
∇fpw,x
i,j
q´∇fpw,x
1
i,j
q}“sup
w,x,x
1
1
K
}∇fpw,xq´∇fpw,x
1
q}ď
2L
K
is theL
2
sensitivity of the non-private gradient update in line 7 of Algorithm 1: this follows from [47, Propositions
1.3 and 1.6] and our assumption ε
0
ď 8lnp1{δ 0
q. Therefore, conditional on the private transcript of all
other silos, our choice ofσ 2
implies that siloi’s transcript ispε
0
,δ 0
q-DP for alliPrNs, which means that
One-Pass Noisy Accelerated Distributed MB-SGD ispε
0
,δ 0
q-ISRL-DP.
Excessloss: 1. For the convex case, we chooseλ “
V
2D
?
R
, whereV
2
“
ϕ 2
MK
`
υ 2
M
1
tNą1u
´
N´M
N´1
¯
`
dσ
2
M
is the variance of the noisy stochastic minibatch gradients, by Lemma 128 for our noise with variance
σ 2
“
32L
2
lnp1.25{δ 0
q
K
2
ε
2
0
. Now pluggingV
2
into Lemma 126, settingR“n{K, andλ :“
V
2D
?
R
yields
EFpw
ag
R
q´F
˚
À
βD
2
K
2
n
2
`
ϕD
?
nM
`
LD
a
dlnp1{δ 0
q
ε
0
?
KnM
`
?
KυD
?
nM
c
N´M
N´1
1
tNą1u
.
ChoosingK“
´
L
βD
¯
2{5
n
3{5
pdlnp1{δ 0
q
1{5
ε
2{5
0
M
1{5
implies B.19.
2. For strongly convex loss, we plug the same estimate forV
2
used above into Lemma 127 withR“n{K
to obtain
EFpw
ag
R
q´F
˚
À∆exp
ˆ
´n
K
?
κ ˙
`
ϕ 2
µnM
`
υ 2
K
µnM
ˆ
1´
M´1
N´1
˙
1
tNą1u
`
L
2
µ dlnp1{δ 0
q
Knε
2
0
M
.
ChoosingK“
n
?
κ ln
ˆ
µ ∆min
"
ε
2
0
n
2
M
L
2
dlnp1{δ 0
q
,
nM
ϕ 2
*˙
yields B.20.
B.1.7 SupplementaryMaterialforSection3.2.3
B.1.7.1 ProofofTheorem37
We begin by consideringβ -smoothf.
Theorem129 (Smooth ERM Upper Bound). Assumefp¨,xqisβ -smoothforallx. Letε
0
ď2lnp2{δ 0
q,δ 0
P
p0,1q, choose K ě
ε
0
n
4
?
2Rlnp2{δ 0
q
, and σ 2
“
256L
2
Rlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
. Then Algorithm 1 ispε
0
,δ 0
q-ISRL-DP.
Further:
247
1. Iff
λ p¨,xqisconvex,thenrunningAlgorithm1ontheregularizedobjective
r
fpw,xq“fpw,xq`
λ 2
}w´w
0
}
2
withR“max
˜
ˆ
b
βD
L
?
Mε
0
n
?
dlnp1{δ 0
q
˙
1{2
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
¸
(andλ specified in the proof) yields
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
˜
LD
a
dlnp1{δ 0
q
ε
0
n
?
M
¸
. (B.21)
2. If fp¨,xq is µ -strongly convex, then running the multi-stage implementation (Appendix B.1.6.1) of Algo-
rithm 1 withR“max
!b
β µ ln
´
∆ X
µMε
2
0
n
2
L
2
d
¯
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
)
and∆ X
ě
p
F
X
pw
0
q´
p
F
˚
X
yields
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ dlnp1{δ 0
q
ε
2
0
n
2
M
˙
. (B.22)
To prove Theorem 129, we will require the following lemma:
Lemma130 ([161]). Letta
l
u
lPr
r
Ns
beanarbitrarycollectionofvectorssuchthat
ř
r
N
l“1
a
l
“0. Further,letS
be a uniformly random subset ofr
r
Ns of size
Ă
M. Then,
E
›
›
›
›
›
1
Ă
M
ÿ
lPS
a
l
›
›
›
›
›
2
“
r
N´
Ă
M
p
r
N´1q
Ă
M
1
r
N
r
N
ÿ
l“1
}a
l
}
2
ď
1
t
Ă
Mă
r
Nu
Ă
M
r
N
r
N
ÿ
l“1
}a
l
}
2
.
Proof of Theorem 129. Privacy: ISRL-DP of Algorithm 14 follows from ISRL-DP of the stochastic gradients
tr g
i
r
u
R
r“1
in line 7 of the algorithm (which was established in Theorem 104) and the post-processing property
of DP [88, Proposition 2.1]. Namely, since the choices ofσ 2
,K given in Theorem 129 ensure that silos’ local
stochastic minibatch gradients are ISRL-DP and the iterates in Algorithm 1 are functions of these private
noisy gradients (which do not involve additional data queries), it follows that the iterates themselves are
ISRL-DP.
Excessrisk: 1. Forλ “
V
X
}w
0
´w
˚
}
?
R
, Lemma 126 implies
E
p
F
X
pp w
T
q´
p
F
˚
X
À
βD
2
R
2
`
VD
?
R
, (B.23)
248
whereV
2
X
:“ sup
rPrRs
E}
1
Mr
ř
iPSr
r g
i
r
´∇
p
F
X
pw
md
r
q}
2
for r g
i
r
“
1
K
ř
K
j“1
∇fpw
md
r
,x
r
i,j
q`u
i
defined in
line 7 of Algorithm 1. Now Lemma 130 andL-Lipschitzness offp¨,xq implies that
V
2
X
ď
1
tMrKănNu
M
r
KnN
N
ÿ
i“1
n
ÿ
j“1
sup
wPW
}∇fpw,x
i,j
q´∇
p
F
X
pwq}
2
ď 1
tMrKănNu
4L
2
M
r
K
,
conditional onM
r
. Hence, taking total expectation with respect toM
r
and plugging this bound into B.23
yields:
E
p
F
X
pp w
T
q´
p
F
˚
X
À
βD
2
R
2
`LD
¨
˝
1
tMKănNu
?
MKR
`
b
dln
2
pR{δ 0
q
?
Mε
0
n
˛
‚
,
by our choice ofσ 2
and independence of noisestu
i
u
iPSr
across silos. Then one verifies that the prescribed
choice ofR yields B.21.
2. Invoking Lemma 127 with the same estimate forV
2
X
obtained above gives
E
p
F
X
pp w
T
q´
p
F
˚
X
À∆ X
exp
ˆ
´
c
µ β R
˙
`
L
2
µ ˆ
1
tMKănNu
MKR
`
dln
2
pRδ 0
q
Mε
2
0
n
2
˙
,
and plugging in the prescribedR completes the proof.
We now re-state a precise form of Theorem 37 before providing its proof:
Theorem 131 (Precise Re-statement of Theorem 37). Let ε
0
ď 2lnp2{δ 0
q,δ 0
P p0,1q, choose σ 2
“
256L
2
Rlnp
2.5R
δ 0
qlnp2{δ 0
q
n
2
ε
2
0
and K ě
ε
0
n
4
?
2Rlnp2{δ 0
q
. Then, Algorithm 1 ispε
0
,δ 0
q-ISRL-DP. Further, there exist
choices ofβ ą0 such that for Algorithm 14 run onf
β pw,xq:“min
vPW
´
fpv,xq`
β 2
}w´v}
2
¯
, we have:
1. Iffp¨,xq is convex, then settingR“max
ˆ
?
Mε
0
n
?
dlnp1{δ 0
q
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
˙
yields
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
˜
LD
a
dlnp1{δ 0
q
ε
0
n
?
M
¸
. (B.24)
249
2. If fp¨,xq is µ -strongly convex and R “
r
O
ˆ
max
"
?
Mε
0
n
?
dlnp1{δ 0
q
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
*˙
(in the multi-
stage implementation in Appendix B.1.6.1), then
E
p
F
X
pw
ag
R
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ dlnp1{δ 0
q
ε
2
0
n
2
M
˙
. (B.25)
Proof. We established ISRL-DP in Theorem 129.
Excessrisk: 1. By Lemma 105 and Theorem 129, we have
E
p
F
X
pp w
T
q´
p
F
˚
X
ÀLD
b
dln
2
pR{δ 0
q
?
Mε
0
n
`
L
2
β ,
if R“ max
˜
ˆ
b
βD
L
?
Mε
0
n
?
dlnp1{δ 0
q
˙
1{2
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
¸
. Now choosing β :“
ε
0
n
?
ML
?
dD
yields both
the desired excess risk and communication complexity bound.
2. By Lemma 105 and Theorem 129, we have
E
p
F
X
pp w
T
q´
p
F
˚
X
À
L
2
µ dln
2
pR{δ 0
q
Mε
2
0
n
2
`
L
2
β ,
ifR“ max
!b
β µ ln
´
∆ X
µMε
2
0
n
2
L
2
d
¯
, 1
tMKăNnu
ε
2
0
n
2
Kdlnp1{δ 0
q
)
. Now choosingβ :“
µMε
2
0
n
2
d
yields both the
desired excess risk and communication complexity bound.
Remark 132. The algorithm of [111] requires R “
r
Ω pε
2
0
n
2
M{dq communications, making Algorithm 1
faster by a factor of minp
?
Mε
0
n{
?
d, MKq. If M “ N and full batches are used, the advantage of our
algorithm over that of [111] is even more significant.
B.1.7.2 LowerboundsforISRL-DPFederatedERM
Formally, define the algorithm class B
pε
0
,δ
0
q,C
to consist of those (sequentially interactive orC-compositional,
ISRL-DP) algorithmsAPA
pε
0
,δ
0
q,C
such that for anyXPX, fPF
L,D
, the expected empirical loss of the
250
shuffled algorithm A
s
derived fromA is upper bounded by the expected loss ofA:E
A,tπ rur
p
F
X
pA
s
pXqqÀ
E
A
p
F
X
pApXqq. HereA
s
denotes the algorithm that applies the randomizerR
piq
r
toX
π rpiq
for alli,r, but
otherwise behaves exactly likeA. This is not a very constructive definition but we will describe examples
of algorithms inB
pε
0
,δ
0
q,C
. B
pε
0
,δ
0
q,C
includes all C-compositional or sequentially interactive ISRL-DP
algorithms that are symmetric with respect to each of theN silos, meaning that the aggregation functions
g
r
are symmetric (i.e. g
r
pZ
1
,¨¨¨ ,Z
N
q“g
r
pZ
π p1q
,¨¨¨Z
π pNq
q for all permutationsπ ) and in each roundr
the randomizersR
piq
r
“R
r
are the same for all silosiPrNs. (R
piq
r
can still change withr though.) For
example, Algorithm 14 and Algorithm 1 are both inB
pε
0
,δ
0
q
:“B
pε
0
,δ
0
q,1
. This is because the aggregation
functions used in each round are simple averages of the M “ N noisy gradients received from all silos
(and they are compositional) and the randomizers used by every silo in roundr are identical: each adds the
same Gaussian noise to the stochastic gradients. B also includes sequentially interactive algorithms that
choose the order in which silos are processed uniformly at random. This is because the distributions of the
updates ofA andA
s
are both averages over all permutations ofrNs of the conditional (onπ ) distributions
of the randomizers applied to the π -permuted database. IfA P B
pε
0
,δ
0
q,C
is sequentially interactive or
compositional, we writeAPB. We now state and prove our ERM lower bounds:
Theorem 133. Let ε
0
Pp0,
?
Ns, δ 0
“ op1{nNq andAP B
pε
0
,δ
0
q,C
such that in every round r PrRs,
the local randomizersR
piq
r
pZ
p1:r´1q
,¨q:X
n
ÑZ arepε
r
0
,δ r
0
q-DP for alliPrNs, Z
p1:r´1q
PZ
r´1ˆN
, with
ε
r
0
ď
1
n
, δ r
0
“op1{nNRq, and Ně16lnp2{δ r
0
nq. Then:
1. there exists a (linear, henceβ -smooth@β ě0) loss functionfPF
L,D
and a databaseXPX
nN
(for some
X) such that:
E
p
F
X
pApXqq´
p
F
˚
X
“
r
Ω ˜
LDmin
#
1,
?
d
ε
0
n
?
NC
2
+¸
.
251
2. There exists a (µ -smooth)fPG
µ,L,D
and databaseXPX
nN
such that
E
p
F
X
pApXqq´
p
F
˚
X
“
r
Ω ˆ
LDmin
"
1,
d
ε
2
0
n
2
NC
4
*˙
.
Further, ifAPB, then the above lower bounds hold withC“1.
Proof. Step1 is identical to Step 1 of the proof of Theorem 35. Step2 is very similar to Step 2 in the proof
of Theorem 35, but now we use Theorem 134 (below) instead of Theorem 123 to lower bound the excess
empirical loss ofA
s
. Step 3: Finally, the definition of B implies that the excess risk ofA is the same as
that ofA
s
, hence the lower bound also applies toA.
Theorem 134. [32] Letµ,D,ε ą 0, Lě µD , andδ “ op1{nNq. ConsiderX :“t
´D
?
d
,
D
?
d
u
d
Ă R
d
and
W :“B
2
p0,DqĂR
d
. LetA:X
nN
ÑW be anypε,δ q-CDP algorithm. Then:
1. There exists a (µ “ 0) convex, linear (β -smooth for any β ), L-Lipschitz loss f : WˆX Ñ R and a
databaseXPX
nN
such that the expected empirical loss ofA is lower bounded as
E
p
F
X
pApXqq´
p
F
˚
X
“
r
Ω ˜
LDmin
#
1,
?
d
εnN
+¸
.
2. There exists aµ -strongly convex,µ -smooth,L-Lipschitz lossf :WˆX ÑR and a databaseXPX
nN
such that the expected empirical loss ofA is lower bounded as
E
p
F
X
pApXqq´
p
F
˚
X
“
r
Ω ˆ
LDmin
"
1,
d
ε
2
n
2
N
2
*˙
.
252
B.1.8 ProofsandSupplementaryMaterialsforSection3.2.4
B.1.8.1 ProofofTheorem38
Theorem135 (Precise version of Theorem 38). Letεďlnp2{δ q, δ Pp0,1q,andMě16lnp18RM
2
{Nδ q
for (polynomial)R specified in the proof. Then, there is a constant Cą0 such that setting
σ 2
:“
CL
2
RMlnpRM
2
{Nδ qlnpR{δ qlnp1{δ q
n
2
N
2
ε
2
ensures that the shuffled version of Algorithm 14 is pε,δ q-CDP. Moreover, there exist η r
“ η andtγ r
u
R´1
r“0
such that the shuffled version of Algorithm 14 achieves the following upper bounds:
1. Iffp¨,xq is convex, then
EFpp w
T
q´F
˚
“
r
O
˜
LD
˜
1
?
nM
`
a
dlnp1{δ q
εnN
¸¸
. (B.26)
2. Iffp¨,xq isµ -strongly convex, then
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
1
nM
`
dlnp1{δ q
ε
2
n
2
N
2
˙˙
. (B.27)
Proof. We fix K “ 1 for simplicity, but note that K ą 1 can also be used (see Lemma 3 in [111]). We
shall also assume WLOG that fp¨,xq is β -smooth: the reduction to non-smooth f follows by Nesterov
smoothing, as in the proofs of Theorems 34 and 37. Our choices ofR shall be:
R:“max
ˆ
n
2
N
2
ε
2
M
,
N
M
, min
"
n,
ε
2
n
2
N
2
dM
*
,
βD
L
min
"
?
nM,
εnN
?
d
*˙
for convexf; andR:“max
´
n
2
N
2
ε
2
M
,
N
M
,
8β µ ln
´
βD
2
µε
2
n
2
N
2
dL
2
¯
,min
!
n,
ε
2
n
2
N
2
dM
)¯
for strongly convexf.
253
Privacy: Observe that in each roundr, the model updates of the shuffled algorithm A
r
s
can be viewed
as post-processing of the compositionM
r
pXq“S
M
˝samp
M,N
pZ
p1q
r
,¨¨¨ ,Z
pNq
r
q, whereS
M
uniformly
randomly shuffles the M received reports,samp
M,N
is the mechanism that choosesM reports uniformly
at random from N, and Z
piq
r
“ samp
1,n
p
p
R
r
px
i,1
q,¨¨¨
p
R
r
px
i,n
qq, where
p
R
r
pxq :“ ∇fpw
r
,xq` u and
u„Np0,σ 2
I
d
q. Recall (Theorem A.1 in [88]) thatσ 2
“
8L
2
lnp2{
p
δ 0
q
x ε
0
2
suffices to ensure that
p
R
r
ispp ε
0
,
p
δ 0
q-
DP if p ε
0
ď 1. Now note thatM
r
pXq“
r
R
M
pS
M
˝samp
M,N
pX
1
,¨¨¨X
N
q, where
r
R :X
n
ÑZ is given
by X ÞÑ samp
1,n
p
p
Rpx
1
q,¨¨¨ ,
p
Rpx
n
qq and
r
R
M
: X
nM
Ñ Z
M
is given byXÞÑp
r
RpX
1
q,¨¨¨
r
RpX
M
qq
for anyX “ pX
1
,¨¨¨ ,X
M
q P X
nM
. This is because we are applying the same randomizer (same ad-
ditive Gaussian noise) across silos and the operatorsS
M
and
r
R
M
commute. (Also, applying a random-
izer to all N silos and then randomly choosing M reports is equivalent to randomly choosing M silos
and then applying the same randomizer to all M of these silos.) Therefore, conditional on the random
subsampling of M out of N silos (denotedpX
1
,¨¨¨ ,X
M
q for convenience), Theorem 3.8 in [96] im-
plies thatp
p
Rpx
π p1q,1
q,¨¨¨ ,
p
Rpx
π p1q,n
q,¨¨¨ ,¨¨¨ ,
p
Rpx
π pMq,1
q,¨¨¨ ,
p
Rpx
π pMq,n
qq ispp ε,
p
δ q-CDP, where p ε“
O
ˆ
x ε
0
?
lnp1{M
p
δ 0
q
?
M
˙
and
p
δ “ 9M
p
δ 0
, provided p ε
0
ď 1 and M ě 16lnp2{
p
δ 0
q (which we will see is satis-
fied by our assumption on M). Next, privacy amplification by subsampling (see [248] and Lemma 3 in
[111]) silos and local samples implies thatM
r
ispε
r
,δ r
q-CDP, whereε
r
“
2p εM
nN
“ O
ˆ
p ε
0
?
Mlnp1{M
p
δ 0
q
nN
˙
and δ r
“
M
nN
p
δ “
9M
2
nN
p
δ 0
. Finally, by the advanced composition theorem (Theorem 3.20 in [88]), to en-
sureA
s
ispε,δ q-CDP, it suffices to make each round pε
r
:“
ε
2
?
2Rlnp1{δ q
,δ r
:“ δ {2Rq-CDP. Using the
two equations to solve for p ε
0
“
CnNε
?
Rlnp1{δ qlnpRM{nNδ qM
for some C ą 0 and
p
δ 0
“
nNδ 18RM
2
, we see that
σ 2
“ O
´
L
2
lnpRM
2
{Nδ qlnpR{δ qlnp1{δ qqRM
n
2
N
2
ε
2
¯
ensures thatA
s
ispε,δ q-CDP, i.e. thatA ispε,δ q-SDP. Note
that our choices ofR in the theorem (specifically Rě N{M andRě
n
2
N
2
ε
2
M
) ensure that
p
δ 0
,δ ď 1 and
p ε
0
À1, so that Theorem 3.8 in [96] indeed gives us the amplification by shuffling result used above.
Excessrisk: Note that shuffling does not affect the uniform stability of the algorithm. So we proceed
similarly to the proof of Theorem 34, exceptσ 2
is now smaller.
254
1. Convex case: Setγ r
“ γ “ 1{R for allr. Now Lemma 99, Lemma 98, and Lemma 103 (withσ 2
in the
lemma replaced by theσ 2
prescribed here) together imply for anyη ď1{β that
EFpp w
T
q´F
˚
À
L
2
Rη nM
`
D
2
ηR
`η ˆ
L
2
{M`
dσ
2
M
˙
.
Now plugging inη :“min
"
1{β,
D
?
M
LR
min
"
?
n,
εnN
?
dMlnpRM
2
{Nδ qlnpR{δ qlnp1{δ q
**
yields
EFpp w
T
q´F
˚
ÀLD
˜
max
#
1
?
nM
,
a
dlnpRM
2
{Nδ qlnpR{δ qlnp1{δ q
εnN
+¸
`
LD
R
?
M
min
#
?
n,
εnN
a
dMlnpRM
2
{Nδ qlnpR{δ qlnp1{δ q
+
`
βD
2
R
.
Then one can verify that plugging in the prescribedR yields the stated excess population loss bound.
2. µ -strongly convex case:
By Lemma 103 (and its proof), we know there existsη ď1{β andtγ r
u
R
r“1
such that
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
βD
2
exp
ˆ
´
R
2κ ˙
`
L
2
µR
`
dσ
2
µMR
˙
.
Hence Lemma 98 and Lemma 99 imply
EFpp w
T
q´F
˚
“
r
O
ˆ
βD
2
exp
ˆ
´
R
2κ ˙
`
L
2
µR
`
L
2
dlnp1{δ q
µε
2
n
2
N
2
`
L
2
µMn
˙
.
Then one verifies that the prescribed R is large enough to achieve the stated excess population loss bound.
255
B.1.8.2 SDPOne-PassAcceleratedNoisyMB-SGDandtheProofofTheorem39
To develop an SDP variation of One-Pass Accelerated Noisy MB-SGD, we will use the binomial noise-based
protocol of [55] (described in Algorithm 16) instead of using the Gaussian mechanism and amplification
by shuffling. This is because for our one-pass algorithm, amplification by shuffling would result in an
impractical restriction onε. Algorithm 16 invokes the SDP scalar summation subroutine Algorithm 17.
Algorithm16P
vec
, a shuffle protocol for vector summation [55]
1: Input: database ofd-dimensional vectorsX“px
1
,¨¨¨ ,x
N
); privacy parametersε,δ ;L.
2: procedure: Local RandomizerR
vec
px
i
q
3: forjPrdsdo
4: Shift component to enforce non-negativity:w
i,j
Ðx
i,j
`L
5: m
j
ÐR
1D
pw
i,j
q
6: endfor
7: Output labeled messagestpj,m
j
qu
jPrds
8: endprocedure
9: procedure: AnalyzerA
vec
pyq
10: forjPrdsdo
11: Run analyzer on coordinatej’s messagesz
j
ÐA
1D
py
j
q
12: Re-center: o
j
Ðz
j
´L
13: endfor
14: Output the vector of estimateso“po
1
,¨¨¨o
d
q
15: endprocedure
Algorithm17P
1D
, a shuffle protocol for summing scalars [55]
1: Input: Scalar databaseX“px
1
,¨¨¨x
N
qPr0,Ls
N
;g,bPN;pPp0,
1
2
q.
2: procedure: LocalRandomizerR
1D
px
i
q
3: s x
i
Ð tx
i
g{Lu.
4: Sample rounding valueη 1
„ Berpx
i
g{L´s x
i
q.
5: Set ˆ x
i
Ð s x
i
`η 1
.
6: Sample privacy noise valueη 2
„ Binpb,pq.
7: Reporty
i
Pt0,1u
g`b
containing ˆ x
i
`η 2
copies of1 andg`b´pˆ x
i
`η 2
q copies of0.
8: endprocedure
9: procedure: AnalyzerA
1D
pSpyqq
10: Output estimator
L
g
pp
ř
N
i“1
ř
b`g
j“1
py
i
q
j
q´pbnq.
11: endprocedure
We recall the privacy and accuracy guarantees of Algorithm 16 below:
256
Lemma136 ([55]). Forany0ăεď15,0ăδ ă1{2,d,N PN,andLą0,therearechoicesofparameters
b,g P N and p P p0,1{2q forP
1D
(Algorithm 17) such that, for X “ px
1
,¨¨¨x
N
q containing vectors of
maximumnormmax
iPrNs
}x
i
}ďL,thefollowingholds: 1)P
vec
ispε,δ q-SDP;and2)P
vec
pXqisanunbiased
estimate of
ř
N
i“1
x
i
with bounded variance
E
»
–
›
›
›
›
›
P
vec
pX;ε,δ ;Lq´
N
ÿ
i“1
x
i
›
›
›
›
›
2
fi
fl
“O
˜
dL
2
log
2
`
d
δ ˘
ε
2
¸
.
With these building blocks in hand, we provide our SDP One-Pass Accelerated Noisy MB-SGD algo-
rithm in Algorithm 18. We now provide the general version of Theorem 39 forMďN:
Algorithm18 SDP Accelerated Noisy MB-SGD
1: Input: DataX
i
PX
n
i
, iPrNs, strong convexity modulusµ ě0, privacy parameterspε,δ q, iteration
numberRPN, batch sizeKPrns, step size parameterstη r
u
rPrRs
,tα r
u
rPrRs
.
2: Initializew
ag
0
“w
0
PW andr“1.
3: forrPrRsdo
4: Server updates and broadcastsw
md
r
“
p1´α rqpµ `η rq
η r`p1´α 2
r
qµ w
ag
r´1
`
α rrp1´α rqµ `η rs
η r`p1´α 2
r
qµ w
r´1
5: foriPS
r
inparalleldo
6: Siloi drawstx
r
i,j
u
K
j“1
fromX
i
(without replacement) and computesZ
r
i
:“t∇fpw
md
r
,x
r
i,j
qu
K
j“1
.
7: endfor
8: Server receivesr g
r
:“
1
MrK
P
vec
ptZ
r
i
u
iPSr
;ε,δ ;Lq.
9: Server computesw
r
:“argmin
wPW
␣
α r
“
xr g
r
,wy`
µ 2
}w
md
r
´w}
2
‰
`
“
p1´α r
q
µ 2
`
η r
2
‰
}w
r´1
´w}
2
(
.
10: Server updates and broadcastsw
ag
r
“α r
w
r
`p1´α r
qw
ag
r´1
.
11: endfor
12: Output: w
ag
R
.
Theorem 137 (Complete version of Theorem 39). Letfp¨,xq beβ -smooth@xP X. Assumeεď 15,δ P
p0,
1
2
q. Then, Algorithm 18 ispε,δ q-SDP. Moreover, there are choices of stepsize, batch size, andλ ą 0 such
that (forυ defined in B.18):
1. Running Algorithm 18 on
r
fpw,xq:“fpw,xq`
λ 2
}w´w
0
}
2
(wherew
0
PW) yields
EFpw
ag
R
q´F
˚
“O
¨
˝
ϕD
?
nM
`
˜
β 1{4
LD
3{2
?
dlnpd{δ q
εnM
¸
4{5
`
c
N´M
N´1
1
tNą1u
υL
1{5
D
4{5
β 1{5
˜
?
dlnpd{δ q
εnM
3.5
¸
1{5
˛
‚
.
(B.28)
257
2. Iffp¨,xq isµ -strongly convex@xPX andκ “
β µ , then running the Multi-Stage Implementation of Algo-
rithm 18 (recall Appendix B.1.6.1) directly onf yields
EFpw
ag
R
q´F
˚
“
r
O
ˆ
ϕ 2
nM
`
L
2
µ ?
κd lnp1{δ q
ε
2
n
2
M
2
`
υ 2
µ ?
κM
ˆ
1´
M´1
N´1
˙
1
tNą1u
˙
. (B.29)
Remark138. 1. For convexf, ifM“N or
υ À
c
N´1
N´M
«
pL
3
D
2
β 2
q
1{5
pdlnpd{δ qq
3{10
M
1{10
n
3{5
ε
3{5
`ϕ ˆ
βD
L
˙
1{5
ˆ
ε
2
M
2
n
3
dln
2
pd{δ q
˙
1{10
ff
,
then B.28 recovers the bound 3.10 in Theorem 36 (M“N version), withN replaced byM.
2. Forµ -strongly convexf, ifM“N or
υ 2
À
ˆ
N´1
N´M
˙
?
κ ˆ
ϕ 2
n
`
?
κL
2
dlnp1{δ q
Mε
2
n
2
˙
,
then B.29 recovers the bound 3.11 in Theorem 36, withN replaced byM.
Also, note that appealing to privacy amplification by subsampling would result in tighter excess risk bounds
than those stated in Theorem 137 when M ă N, but would require a restriction ε À M{N. To avoid this
restriction, we do not invoke privacy amplification by subsampling in our analysis.
Proof of Theorem 137. Privacy: By post-processing [88, Proposition 2.1], it suffices to show that the R“
n{K noisy stochastic gradients computed in line 8 of Algorithm 18 arepε,δ q-SDP. Further, since the
batches sampled locally are disjoint (because we sample locally without replacement), parallel composi-
tion [191] implies that if each update in line 8 ispε,δ q-SDP, then the full algorithm ispε,δ q-SDP. Since
fp¨,xq isL-Lipschitz, it follows directly from Theorem 136 that each update in line 8 ispε,δ q-SDP.
Excess loss: 1. For the convex case, we choose λ “
V
2D
?
R
, where V
2
“
ϕ 2
MK
`
υ 2
M
1
tNą1u
´
N´M
N´1
¯
`
Varpr g
r
q is the conditional variance of the noisy stochastic minibatch gradients givenM
r
, by Lemma 128.
258
Also, conditional on M
r
, we have Varpr g
r
qÀ
dL
2
ln
2
pd{δ q
ε
2
M
2
r
K
2
by Theorem 136 and independence of the data.
Hence, taking total expectation over M
r
, we get V
2
“
ϕ 2
MK
`
υ 2
M
1
tNą1u
´
N´M
N´1
¯
`
dL
2
ln
2
pd{δ q
ε
2
M
2
K
2
. Now
pluggingV
2
into Lemma 126, settingR“n{K, andλ :“
V
2D
?
R
yields
EFpw
ag
R
q´F
˚
À
βD
2
K
2
n
2
`
ϕD
?
nM
`
LD
?
dlnpd{δ q
εM
?
Kn
`
?
KυD
?
nM
c
N´M
N´1
1
tNą1u
.
ChoosingK“
´
L
βD
¯
2{5
n
3{5
pdlnp1{δ 0
q
1{5
ε
2{5
0
M
2{5
implies B.28.
2. For strongly convex loss, we plug the same estimate forV
2
used above into Lemma 127 withR“n{K
to obtain
EFpw
ag
R
q´F
˚
À∆exp
ˆ
´n
K
?
κ ˙
`
ϕ 2
µnM
`
υ 2
K
µnM
ˆ
1´
M´1
N´1
˙
1
tNą1u
`
L
2
µ dln
2
pd{δ q
Knε
2
M
2
,
where∆ ěFpw
0
q´F
˚
. ChoosingK“
n
?
κ ln
´
µ ∆min
!
ε
2
n
2
M
2
L
2
dlnp1{δ 0
q
,
nM
ϕ 2
)¯
yields B.29.
B.1.9 ISRL-DP Upper Bounds with Unbalanced Data Set Sizes and Differing Privacy
NeedsAcrosssilos
In order to state the generalized versions of our upper bounds (for arbitrary n
i
,ε
i
,δ i
, i P rNs), we will
require some additional notation and assumptions.
B.1.9.1 AdditionalNotationandAssumptions
First, we define a generalization of pε
0
,δ 0
q-ISRL-DP (as it was formally defined in Appendix B.1.2) that
allows for differing privacy parameters across silos:
Definition139. (GeneralizedInter-SiloRecord-LevelDifferentialPrivacy)Let ρ i
pX
i
,X
1
i
q:“
ř
n
i
j“1
1
tx
i,j
‰x
1
i,j
u
,
iPrNs. A randomized algorithmA istpε
i
,δ i
qu
N
i“1
-ISRL-DP if for all silosi and allρ i
-adjacentX
i
,X
1
i
,
259
pR
piq
1
pX
i
q,R
piq
2
pZ
1
,X
i
q,¨¨¨ ,R
piq
R
pZ
1:R´1
,X
i
qq »
pϵ i,δ iq
pR
piq
1
pX
1
i
q,R
piq
2
pZ
1
1
,X
1
i
q,¨¨¨ ,R
piq
R
pZ
1
1:R´1
,X
1
i
qq,
whereZ
r
:“tR
piq
r
pZ
1:r´1
,X
i
qu
N
i“1
andZ
1
r
:“tR
piq
r
pZ
1
1:r´1
,X
1
i
qu
N
i“1
.
We also allow for the weights put on each silo in the FL objective to differ and consider:
min
wPW
#
Fpwq:“
N
ÿ
i“1
p
i
F
i
pwq
+
,
where p
i
P r0,1s and
ř
N
i“1
p
i
“ 1. However, we will present our results for the case where p
i
“
1
N
for all i P rNs. This is without loss of generality: given any
r
Fpwq “
ř
N
i“1
p
i
F
i
pwq, we have
r
Fpwq “
ř
N
i“1
p
i
F
i
pwq“
1
N
ř
N
i“1
r
F
i
pwq, where
r
F
i
pwq“Np
i
F
i
pwq“E
x
i
„P
i
rNp
i
fpw,x
i
qs:“E
x
i
„P
i
r
r
f
i
pw,x
i
qs.
Thus, our results for the case ofp
i
“1{N apply for generalp
i
, butL gets replaced with
r
L“max
iPrNs
p
i
NL,
µ gets replaced with r µ “min
iPrNs
p
i
Nµ , andβ gets replaced with
r
β “max
iPrNs
p
i
Nβ.
We will choose batch sizesK
i
such thatK
i
{n
i
“ K
l
{n
l
for alli,lPrNs in each round, and denote
K“min
iPrNs
K
i
. In addition to the assumptions we made in the main body, we also refine the assumption
ontM
r
u to include a description of the second moment of1{M
r
:
Assumption 6. In each round r, a uniformly random subset S
r
of M
r
P rNs distinct silos is available to
communicate with the server, wheretM
r
u
rě0
are independent random variables with
1
M
:“ Ep
1
Mr
q and
1
M
1
:“
b
Ep
1
M
2
r
q.
ForMPrNs, denote
Ď
σ 2
M
:“
1
M
M
ÿ
i“1
σ 2
piq
,
where σ 2
p1q
:“ σ 2
max
:“ max
iPrNs
σ 2
i
ě σ 2
p2q
ě ¨¨¨ ě σ 2
pNq
:“ σ 2
min
:“ min
iPrNs
σ 2
i
,. (More generally,
whenever a bar and M subscript are appended to a parameter, it denotes the average of the M largest
values.) Also, define
Σ 2
:“
b
Ep
Ě
σ 2
M
1
q
2
260
for anytσ 2
i
u
N
i“1
Ďr0,8q.
Next, recall the heterogeneity parameter from B.18:
υ 2
:“ sup
wPW
1
N
N
ÿ
i“1
}∇F
i
pwq´∇Fpwq}
2
.
Lastly, for given parameters, denote
ξ i
:“
ˆ
1
n
i
ε
i
˙
2
lnp2.5R{δ i
qlnp2{δ i
q
foriPrNs,
ξ max
“maxpξ 1
,¨¨¨ξ N
q,
and
Ξ: “
g
f
f
e
E
M
1
˜
1
M
1
M
1
ÿ
i“1
ξ piq
¸
2
.
In the case of balanced data and same parameters across silos, we haveξ i
“ξ “Ξ for alli. In the general
case, we haveξ min
ďΞ ďξ max
.
B.1.9.2 PseudocodeforNoisyISRL-DPMB-SGDintheUnbalancedCase
The generalized version of Noisy ISRL-DP MB-SGD is described in Algorithm 19.
B.1.9.3 GeneralUnbalancedVersionofTheorem34
We first state the formal version of Theorem 34 for arbitrary n
i
,ε
i
,δ i
, using the notation defined in Sec-
tion B.1.9.1.
261
Algorithm19 Noisy ISRL-DP MB-SGD
1: Input: N,d,RPN,tσ i
u
iPrNs
Ăr0,8q,X
i
PX
n
i
i
foriPrNs, loss functionfpw,xq,tK
i
u
N
i“1
ĂN ,
tη r
u
rPrRs
andtγ r
u
rPrRs
.
2: Initializew
0
PW.
3: forrPt0,1,¨¨¨ ,R´1udo
4: foriPS
r
inparalleldo
5: Server sends global modelw
r
to siloi.
6: Siloi drawsK
i
samplesx
r
i,j
uniformly fromX
i
(forjPrK
i
s) and noiseu
i
„Np0,σ 2
i
I
d
q.
7: Siloi computesr g
i
r
:“
1
K
i
ř
K
i
j“1
∇fpw
r
,x
r
i,j
q`u
i
and sends to server.
8: endfor
9: Server aggregatesr g
r
:“
1
Mr
ř
iPSr
r g
i
r
.
10: Server updatesw
r`1
:“Π W
rw
r
´η r
r g
r
s.
11: endfor
12: Output: p w
R
“
1
Γ R
ř
R´1
r“0
γ r
w
r
, whereΓ R
:“
ř
R´1
r“0
γ r
.
Theorem140 (Generalized Version of Theorem 34). Letε
i
ď 2lnp2{δ i
q,δ i
Pp0,1q. Then, Algorithm 14
istpε
i
,δ i
qu
N
i“1
-ISRL-DP ifσ 2
i
“
256L
2
Rlnp
2.5R
δ i
qlnp2{δ i
q
n
2
i
ε
2
i
andK
i
ě
ε
i
n
4
?
2Rlnp2{δ i
q
. Moreover, with notation as
in Section B.1.9.1, there are choices of algorithmic parameters such that:
1. Iffp¨,xq is convex, then
EFpp w
T
q´F
˚
ÀLD
˜
1
?
n
min
M
`
d
dmin
"
Ξ M
1
,
ξ max
M
*
¸
.
2. Iffp¨,xq isµ -strongly convex, then
EFpp w
T
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
1
Mn
min
`dmin
"
Ξ M
1
,
ξ max
M
*˙˙
. (B.30)
Remark141. Notethat1{M
1
ě1{M bytheCauchy-Schwartzinequality. BothoftheupperboundsinThe-
orem 140 involve minima of the terms Ξ {M
1
and ξ max
{M, which trade off the unbalancedness of silo data
and privacy needs with the variance of 1{M
r
. In particular, if the variance of 1{M
r
is small enough that
Ξ M
1
ď
ξ max
M
, then the excess risk bounds in Theorem 140 depend on averages of the parameters across silos,
rather than maximums. In FL problems with unbalanced data and disparate privacy needs across a large
262
number of silos, the difference between “average” and “max” can be substantial. On the other hand, if data is
balanced and privacy needs are the same across silos, thenξ i
“ξ max
“ Ξ “ lnp2.5R{δ 0
qlnp2{δ 0
q{n
2
ε
2
0
for
alliand
Ξ M
1
ě
ξ max
M
,sowerecoverTheorem34,withdependenceonlyonthemean1{M of1{M
r
andnotthe
square root of the second moment1{M
1
.
To prove Theorem 140, we need the following empirical loss bound for Algorithm 14, which general-
izes Lemma 103 to the unbalanced setting:
Lemma 142. Let f : WˆX Ñ R be µ -strongly convex (with µ “ 0 for convex case), L-Lipschitz, and
β -smooth inw for allxPX, whereW is a closed convex set inR
d
s.t.}w´w
1
}ďD for allw,w
1
PW. Let
XPX. Then Algorithm 14 withσ 2
i
“
256L
2
Rlnp
2.5R
δ i
qlnp2{δ i
q
n
2
i
ε
2
i
attains the following empirical loss bounds as a
function of step size and the number of rounds:
1. (Convex) For anyη ď1{β andRPN,γ r
:“1{R, we have
E
p
F
X
pp w
T
q´
p
F
˚
X
ď
D
2
ηR
`
η 2
ˆ
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
˙
2. (StronglyConvex)Thereexistsaconstantstepsizeη r
“η ď1{β suchthatifRě2κ ln
ˆ
βµD
2
L
2
dminpΞ {M
1
,
ξ max
M
q
˙
,
then
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
L
2
µ ˆ
1
R
`dmin
"
Ξ M
1
,
ξ max
M
*˙˙
. (B.31)
Proof. By the proof of Lemma 103, we have:
E
„
}w
r`1
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
ďp1´µη r
qE
„
}w
r
´w
˚
}
2
ˇ
ˇ
ˇ
ˇ
M
r
ȷ
´2η r
Er
p
F
X
pw
r
q´
p
F
˚
X
|M
r
s
`η 2
r
E
»
–
›
›
›
›
›
s u
r
`
1
M
r
ÿ
iPSr
1
K
i
K
i
ÿ
j“1
∇fpw
r
,x
r
i,j
q
›
›
›
›
›
2ˇ
ˇ
ˇ
ˇ
M
r
fi
fl
263
which implies
E
”
}w
r`1
´w
˚
}
2
ı
ďp1´µη r
qE
”
}w
r
´w
˚
}
2
ı
´2η r
Er
p
F
X
pw
r
q´
p
F
˚
X
s`η 2
r
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
,
(B.32)
sinceEr}s u
r
}
2
|M
r
s“
1
M
2
r
ř
iPSr
and hence
Er}s u
r
}
2
sďdE
«
Ě
σ 2
Mr
M
r
ff
ďdmin
ˆ
σ 2
max
M
,
Σ 2
M
1
˙
,
using linearity of expectation for the first term in the minimum and Cauchy-Schwartz inequality for the
second term in the minimum. Now we consider the convex (µ “ 0) and strongly convex (µ ą 0) cases
separately.
Convex(µ “0)case: Re-arranging B.32, we get
Er
p
F
X
pw
r
q´
p
F
˚
X
sď
1
2η r
`
Er}w
r
´w
˚
}
2
´}w
r`1
´w
˚
}
2
s
˘
`
η r
2
ˆ
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
˙
.
Then forη r
“η , the average iterate s w
R
satisfies:
Er
p
F
X
ps w
R
q´
p
F
˚
X
sď
1
R
R´1
ÿ
r“0
Er
p
F
X
pw
r
q´
p
F
˚
X
s
ď
1
R
R´1
ÿ
r“0
1
2η pEr}w
r
´w
˚
}
2
´}w
r`1
´w
˚
}sq
`
η 2
ˆ
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
˙
ď
}w
0
´w
˚
}
2
ηR
`
η 2
ˆ
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
˙
,
which proves part 1 of the lemma.
264
Stronglyconvex(µ ą0)case: Note that B.32 satisfies the conditions for Lemma 102, with sequences
r
t
“E}w
t
´w
˚
}
2
,s
t
“Er
p
F
X
pw
t
q´
p
F
˚
X
s
and parameters
a“µ, b “2, c“dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
, g“2β, T “R.
Then Lemma 102 and Jensen’s inequality imply
E
p
F
X
pp w
T
q´
p
F
˚
X
“
r
O
ˆ
βD
2
exp
ˆ
´R
2κ ˙
`
L
2
µ ˆ
1
R
`dmin
"
Ξ M
1
,
ξ max
M
*˙˙
,
whereκ “β {µ . Finally, plugging inR completes the proof.
We are prepared to prove Theorem 140.
Proof of Theorem 140. Privacy: The proof follows exactly as in the balanced case, since σ 2
i
is now cali-
brated topε
i
,δ i
q for alliPrNs.
Excessloss: We shall prove the results for the case whenfp¨,xq isβ -smooth. The non-smooth case
follows by Nesterov smoothing, as in the proof of Theorem 34.
1. Convexcase: By Lemma 99 (and its proof), Lemma 98, and Lemma 142, we have:
EFpp w
T
q´F
˚
ďα `E
p
F
X
pp w
T
q´
p
F
˚
X
ď
2L
2
Rη n
min
M
`
D
2
ηR
`
η 2
ˆ
dmin
"
Σ 2
M
1
,
σ 2
max
M
*
`L
2
˙
ď2ηL
2
ˆ
R
n
min
M
`dRmin
"
Ξ M
1
,
ξ max
M
*
`1
˙
`
D
2
ηR
,
265
for anyη ď
1
β . Choosingη “min
ˆ
1{β,
D
L
?
R
min
ˆ
?
n
min
M
?
R
,1,
c
1
dR
max
!
M
1
Ξ ,
M
ξ max
)
˙˙
yields
EFpp w
T
q´F
˚
À
βD
2
R
`LD
˜
1
?
n
min
M
`
1
?
R
`
d
dmin
"
Ξ M
1
,
ξ max
M
*
¸
.
ChoosingRě
βD
L
min
ˆ
?
Mn
min
,
c
dmin
!
Ξ M
1
,
ξ max
M
)
˙
`min
´
n
min
M,
1
d
max
!
M
1
Ξ ,
M
ξ max
)¯
completes
the proof of the convex case.
2. µ -stronglyconvexcase: By Lemma 99 (and its proof), Lemma 98, and Lemma 142, we have:
EFpp w
T
q´F
˚
ďα `E
p
F
X
pp w
T
q´
p
F
˚
X
ď
4L
2
µ pMn
min
´1q
`
r
O
ˆ
L
2
µ ˆ
1
R
`dmin
"
Ξ M
1
,
ξ max
M
*˙˙
for theη ď 1{β prescribed in the proof of Lemma 142 and anyRě 2κ ln
ˆ
βµD
2
L
2
dminpΞ {M
1
,
ξ max
M
q
˙
. Thus,
choosingR“2κ ln
ˆ
βµD
2
L
2
dminpΞ {M
1
,
ξ max
M
q
˙
`Mn
min
completes the proof.
Remark 143. Generalized versions of the other upper bounds in this paper can also be easily derived with
the techniques used above. The key takeaways are: a) the excess empirical risk (and the private term in the
SCO bounds) involve a minimum of two terms that trade off the degree of unbalancedness with the variance
of 1{M
r
. In particular, if the variance of 1{M
r
is sufficiently small (e.g. if M
r
“ M, which is what most
existingworksonFLassume),thentherefinedexcessriskboundsdependonaveragesoftheparametersacross
silos, rather than worst-case maximums. b) the generalization error scales withmin
iPrNs
n
i
.
266
B.1.10 NumericalExperiments: DetailsandAdditionalResults
In some plots in this section, we include a parameter describing the heterogeneity of the FL problem:
υ 2
˚
:“
1
N
N
ÿ
i“1
}∇F
i
pw
˚
q}
2
,
which has appeared in [148, 154, 143, 265]. If the data is homogeneous, then allF
i
share the same mini-
mizers, soυ 2
˚
“0, but the converse is false.
ISRL-DP Local SGD runs as follows: in roundr, each siloiPS
r
receives the global modelw
r
and takes
K steps of noisy SGD (with one sample per step) with their local data: w
i,0
r
“ w
r
, w
i,t
r
“ w
i,t´1
r
´
η p∇fpw
i,t´1
r
,x
r
i,t
q` u
t
i
q for t P rKs, where x
r
i,t
is drawn uniformly at random from X
i
and u
t
i
„
Np0,σ 2
I
d
q for σ 2
“
8L
2
RKlogp1{δ q
ε
2
0
n
2
. Then silo i sends its K-th iterate, w
i,K
r
to the server; the server
averages the iterates across all silos and updates the global model tow
r`1
“
1
Mr
ř
iPSr
w
i,K
r
.
B.1.10.1 LogisticRegressionwithMNIST
The data set can be downloaded from http://yann.lecun.com/exdb/mnist. Our code does this for you auto-
matically.
Experimentalsetup: To divide the data intoN “ 25 silos and for preprocessing, we borrow code from
[265], which can be downloaded from:
https://papers.nips.cc/paper/2020/hash/45713f6ff2041d3fdfae927b82488db8-Abstract.html. It is available
under a Creative Commons Attribution-Share Alike 3.0 license. There are n
i
“ n “ 8673 training and
2168 test examples per silo; to expedite training, we use only1{7 of the MNIST samples (n“1238 training
examples per silo). We fix δ i
“ δ “ 1{n
2
and testεPt0.75,1.5,3,6,12,18u. The maximumυ 2
˚
is about
0.17 for this problem (corresponding to each silo having disjoint local data sets/pairs of digits).
267
Preprocessing: We used PCA to reduce the dimensionality tod“ 50. We used an80{20 train/test split
for all silos. To improve numerical stability, we clipped the inputxw,xy (i.e. projected it ontor´15,15s)
before feeding into logistic loss.
Hyperparametertuning: For each algorithm and each setting ofε,R,K,υ 2
˚
, we swept through a range of
constant stepsizes and ran 3 trials to find the (approximately) optimal stepsize for that particular algorithm
and experiment. We then used the correspondingw
R
(averaged over the 3 runs) to compute test error. For
(ISRL-DP) MB-SGD, the stepsize grid consisted of 10 evenly spaced points betweene
´6
and1. For (ISRL-
DP) Local SGD, the stepsizes were betweene
´8
ande
´1
. We repeated this entire process20 times for fresh
train/test splits of the data and reported the average test error in our plots.
Choice of σ 2
and K: We used smaller noise (compared to the theoretical portion of the paper) to get
better utility (at the cost of larger K/larger computational cost, which is needed for privacy): σ 2
“
8L
2
lnp1{δ qR
n
2
ε
2
, which provides ISRL-DP by Theorem 1 of [1] if K “
n
?
ε
2
?
R
(c.f. Theorem 3.1 in [29]). Here
L“ 2max
xPX
}x} is an upper bound on the Lipschitz parameter of the logistic loss and was computed
directly from the training data.
To estimateυ 2
˚
, we followed the procedure used in [265], using Newton’s method to computew
˚
and then
averaging}∇
p
F
i
pw
˚
q}
2
over alliPrNs.
Additionalexperimentalresult: In Figure B.1, we show an additional experiment withM“18 avail-
able to communicate in each round. The results are qualitatively similar to the results presented in the
main body for MNIST withM“25,12.
LimitationsofExperiments: It is important to note that pre-processing and hyperparameter tuning (and
estimation ofL) were not done in a ISRL-DP manner, since we did not want to detract focus from evaluation
of ISRL-DP FL algorithms.
∥
As a consequence, the overall privacy loss for the entire experimental process
∥
See [1, 172, 209] and the references therein for discussion of DP PCA and DP hyperparameter tuning.
268
Figure B.1: Test error vs. ε for binary logistic regression on MNIST. δ “ 1{n
2
. We show 90% error bars
over the 20 trials (train/test splits).
is higher than the ε indicated in the plots, which solely reflects the privacy loss from running the FL
algorithms with fixed hyperparameters and (pre-processed) data. Similar remarks apply for the linear
regression experiments (see Appendix B.1.10.2).
B.1.10.2 LinearRegressionwithHealthInsuranceData:
Dataset: The data set can be downloaded from:
https://www.kaggle.com/mirichoi0218/insurance. It is available under an Open Database license. The data
set consists of
r
N “ 1338 observations. The target variable y is medical charges. There are d´1 “ 6
features: age, sex, BMI, number of children, smoker, and geographic region.
Experimental setup: For a given N, we grouped data into N (almost balanced) silos by sorting y in
ascending order and then dividing into N groups, the first N ´ 1 of size r1338{Ns and the remaining
points in the last silo. For eachN, we ran experiments withR“ 35. We ran20 trials, each with a fresh
random train/test (80{20) split. We testedεPt.125,.25,.5,1,2,3u and fixed δ i
“1{n
2
i
for all experiments.
269
To estimateυ 2
˚
, we followed the procedure used in [265], using Newton’s method to computew
˚
and then
averaging}∇
p
F
i
pw
˚
q}
2
over alliPrNs.
Preprocessing: We first numerically encoded the categorical variables and then standardized the numer-
ical features age and BMI to have zero mean and unit variance.
Gradientclipping: In the absence of a reasonable a priori bound on the Lipschitz parameter of the squared
loss (as is typical for unconstrained linear regression problems with potentially unbounded data), we in-
corporated gradient clipping [1] into the algorithms. We then calibrated the noise to the clip thresholdL
to ensure LDP. For fairness of comparison, we also allowed for clipping for the non-private algorithms (if
it helped their performance).
Hyperparameter tuning: For each trial and each algorithm, we swept through a log-scale grid of 10
stepsizes and 5 clip thresholds 3 times, selected the parameterw that minimized average (over 3 repetitions)
training error (among all10ˆ5“ 50), and computed the corresponding average test error. The stepsize
grids we used ranged frome
´8
ande
1
for (ISRL-DP) MB-SGD and frome
´10
to1 for (ISRL-DP) Local SGD.
The excess risk (train and test) we computed was for the normalized objective function Fpw,X,Yq “
}Y ´ wX}
2
{2N
0
where N
0
P t1070,268u (1070 for train, 268 for test) and X is N
0
ˆ d with d “ 7
(including a column of all1s) andY PR
N
0
. The clip threshold grids weret100,10
4
,10
6
,10
8
,10
32
u, with
the last element corresponding to effectively no clipping.
Choice of σ 2
and K: We used the same σ 2
and K “
n
?
ε
2
?
R
as in the logistic regression experiments
described in Appendix B.1.10.1. However, hereL is the clip threshold instead of the Lipschitz parameter.
Relative Root Mean Square Error (RMSE): We scale our reported errors (in the plots) to make them
more interpretable. We define the Relative (test) RMSE of an algorithm to be
a
MSE{NMSE“
d
ř
N test
i“1
py
i
´ ˆ y
i
q
2
ř
N test
i“1
py
i
´ s y
train
q
2
,
270
whereNMSE (“Naiive Mean Square Error”) is the (test) MSE incurred by the non-private naiive predictor
that predicts y
i
to be s y
train
, the average of the training labels, for all test data i P rN
test
s. Here ˆ y
i
is the
predicted label of the algorithm. Relative RMSE can be thought of as the Coefficient of Variation or Relative
Standard Deviation of the predictions of the algorithm. Note that even though the naiive predictor is
completely non-private and allowed to access the entire training data even whenM ă N (violating the
decentralized data principle of FL), the ISRL-DP FL algorithms still outperform this predictor for most
values of ε (except for some experiments when ε « 0), as evidenced by values of Relative RMSE being
below1. Forε«1, ISRL-DP MB-SGD tends to outperform the naiive predictor by30´40%.
Additionalexperimentalresults: In Figure B.2 and Figure B.3, we present results for experiments with
additional settings ofN andM. We observe qualitatively similar behavior as in the plots presented in the
main body of the paper. In particular, ISRL-DP MB-SGD continues to outperform ISRL-DP Local SGD in
most tested setups/privacy levels (especially the high-privacy regime). On the other hand, for some settings
ofM,N,K, we observe that ISRL-DP Local SGD outperforms ISRL-DP MB-SGD asεÑ3 (e.g. Figure B.3,
andN “ M “ 3). In general, we see that the utility impact of ISRL-DP is relatively insignificant for this
problem whenε«3.
B.1.10.3 SoftmaxRegressionwithObesityDataset
The data set can be freely downloaded fromhttps://archive.ics.uci.edu/ml/datasets/Estimation+of+obesity+
levels+based+on+eating+habits+and+physical+condition+. The data contains 17 attributes (such as age, gen-
der, physical activity, eating habits, etc.) and 2111 records.
Experimental setup: We divide the data into N “ 7 heterogeneous silos based on the value of the
target variable, obesity level, which is categorical and takes 7 values: Insufficient Weight, Normal Weight,
Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. We fix
271
δ i
“δ “1{n
2
and testεPt0.5,1,3,6,9u. We ran three trials with a new train/test split in each trial and
reported the average test error.
Preprocessing: We numerically encode the categorical variables and standardize the continuous numer-
ical features to have mean zero and unit variance. We used an80{20 train/test split. We discarded a small
number of (randomly selected) training samples from some silos in order to obtain balanced silo sets, to
ease implementation of the noisy methods.
Hyperparametertuning: For each algorithm and each setting ofε,R,K, we swept through a range of
constant stepsizes to find the (approximately) optimal stepsize for that particular algorithm and experi-
ment. We then used the correspondingw
R
to compute test error for that trial. For (ISRL-DP) MB-SGD, the
stepsize grid consisted of 8 evenly spaced points betweene
´7
ande
´1
For (ISRL-DP) Local SGD, we started
with the same stepsize grid, but sometimes required additional tuning with smaller stepsizes (especially
for smallε) to find the optimal one.
Choiceofσ 2
andK: We used smaller noise (compared to the theoretical portion of the paper) to get better
utility (at the cost of largerK/larger computational cost, which is needed for privacy): σ 2
“
8L
2
lnp1{δ qR
n
2
ε
2
for ISRL-DP MB-SGD, which provides ISRL-DP by Theorem 1 of [1] if K “
n
?
ε
2
?
R
(c.f. Theorem 3.1 in
[29]). For ISRL-DP Local SGD, one needsσ 2
“
8L
2
lnp1{δ qRK
n
2
ε
2
since the sensitivity and number of gradient
evaluations are both larger by a factor of K. Here L “ 2max
xPX
}x} “ 20 is an upper bound on the
Lipschitz parameter of the softmax loss and was estimated directly from the pre-processed training data.
272
B.2 PrivateNon-ConvexFederatedLearningWithoutaTrustedServer
B.2.1 FurtherDiscussionofRelatedWork
DP Optimization with the Polyak-Łojasiewicz (PL) Condition: For unconstrained central DP opti-
mization, [255, 141, 278] provide bounds for Lipschitz losses satisfying the classical PL inequality. How-
ever, the combined assumptions of Lipschitzness and PL onR
d
(unconstrained) are very strong and rule
out most interesting PL losses, such as strongly convex, least squares, and neural nets, since the Lips-
chitz parameter L of such losses is infinite or prohibitively large.
∗∗
We address this gap by considering
the Proximal PL condition, which admits such interesting loss functions. There was no prior work on DP
optimization (in particular, FL) with the Proximal PL condition.
DPSmoothNon-convexDistributedERM: Non-convex federated ERM has been considered in previous
works under stricter assumptions ofsmooth loss and (usually) a trusted server. [257] provide state-of-the-
art CDP upper bounds for distributed ERM of orderE}∇
p
Fpw
priv
,Xq}
2
À
´
?
d
εnN
¯
with perfect commu-
nication (M “ N), relying on a trusted server (in conjunction with secure multi-party computation) to
perturb the aggregated gradients. Similar bounds were attained by [203] forM ă N with a DP variation
of SCAFFOLD [143]. In Theorem 45, we improve on these utility bound under the weaker trust model of
shuffle DP (no trusted server) and with unreliable communication (i.e. arbitrary M P rNs). We also im-
prove over the state-of-the-art ISRL-DP bound of [203], in Theorem 44. A number of other works have
also addressed private non-convex federated ERM (under various notions of DP), but have fallen short of
the state-of-the-art utility and communication complexity bounds:
• The noisy FedAvg algorithm of [125] is not ISRL-DP for anyN ą n since the variance of the Gaus-
sian noise σ 2
« TKL
2
logp1{δ q{nNϵ 2
decreases as N increases; moreover, for their prescribed
stepsize η “
?
N
?
T
, the resulting rate (with T “ RK) from [125, Theorem 2] isE}∇
p
Fpp w
R
q}
2
“
∗∗
In particular, the DP ERM/SCO strongly convex, Lipschitz lower bounds of [32, 29] do not imply lower bounds for the
unconstrained Lipschitz, PL function class considered in these works, since the quadratic hard instance of [32] is notL-Lipschitz
on all ofR
d
for anyLă`8.
273
r
O
´
d
?
NTK
ϵ 2
nN
`
NK
2
T
`
?
N
?
T
`
dK
2
ϵ 2
n
¯
which grows unbounded with T . Moreover, T and K are not
specified in their work, so it is not clear what bound their algorithm is able to attain, or how many
communication rounds are needed to attain it.
• Theorems 3 and 7 of [76] provide ISRL-DP upper bounds on the empirical gradient norm which hold
for sufficiently large Rě T
nc
min
for some unspecified T
nc
min
. The resulting upper bounds are bigger
than
dσ
2
R
1{3
«
dR
2{3
ϵ 2
n
2
. In particular, the bounds becomes trivial for large R (diverges) and no utility
bound expressed in terms of problem parameters (rather than unspecified design parameters R or
T ) is provided. Also, no communication complexity bound is provided.
DP Smooth Non-convex Centralized ERM (N “ 1): In the centralized setting with a single client
and smooth loss function, several works [276, 255, 257, 16] have considered CDP (unconstrained) non-
convex ERM (with gradient norm as the utility measure): the state-of-the-art bound isE}∇
p
F
X
pw
priv
q}
2
“
O
ˆ
?
dlnp1{δ q
εn
˙
4{3
[16]. Our private FedProx-SPIDER algorithms build on the DP SPIDER-Boost of [16],
by parallelizing their updates for FL and incorporating proximal updates to cover non-smooth losses.
Non-privateFL: In the absence of privacy constraints, there is a plethora of works studying the conver-
gence of FL algorithms in both the convex [154, 167, 143, 264, 265, 272] and non-convex [166, 280, 143]
settings. We do not attempt to provide a comprehensive survey of these works here; see [135] for such a
survey. However, we briefly discuss some of the well known non-convex FL works:
• The “FedProx” algorithm of [166] augments FedAvg [188] with a regularization term in order to
decrease “client drift” in heterogeneous FL problems with smooth non-convex loss functions. (By
comparison, we use a proximal term in our private algorithms to deal with non-smooth non-convex
loss functions, and show that our algorithms effectively handle heterogeneous client data via careful
analysis.)
274
• [280] provides primal-dual FL algorithms for non-convex loss functions that have optimal commu-
nication complexity (in a certain sense).
• The SCAFFOLD algorithm of [143] can be viewed as a hybrid between Local SGD (FedAvg) and MB-
SGD, as [265] observed. Convergence guarantees for their algorithm with non-convex loss functions
are provided.
B.2.2 SupplementalMaterialforSection3.3.1: ProximalPLLossFunctions
B.2.2.1 NoisyProximalGradientMethodsforProximalPLFL(SO)-Pseudocodes
Algorithm20 ISRL-DP Noisy Distributed Proximal Gradient Method
1: Input: RPN,X
i
PX
n
piPrNsq,σ 2
ě0,Kď
n
R
,w
0
PR
d
.
2: forrPt0,1,¨¨¨ ,R´1udo
3: foriPS
r
inparalleldo
4: Server sends global modelw
r
to siloi.
5: Siloi drawstx
r
i,j
u
K
j“1
uniformly fromX
i
(without replacement) and noiseu
i
„Np0,σ 2
I
d
q.
6: Siloi sendsr g
i
r
:“
1
K
ř
K
j“1
∇f
0
pw
r
,x
r
i,j
q`u
i
to server.
7: endfor
8: Server aggregatesr g
r
:“
1
Mr
ř
iPSr
r g
i
r
.
9: Server updatesw
r`1
:“ prox 1
2β f
1
pw
r
´
1
2β r g
r
q
10: endfor
11: Output: w
R
.
Algorithm21 SDP Noisy Distributed Proximal Gradient Method
1: Input: Number of roundsRPN, data setsX
i
PX
n
i
foriPrNs, loss functionfpw,xq“f
0
pw,xq`
f
1
pw,xq, privacy parametersϵ,δ , local batch sizeKď
n
R
,w
0
PR
d
.
2: forrPt0,1,¨¨¨ ,R´1udo
3: foriPS
r
inparalleldo
4: Server sends global modelw
r
to siloi.
5: Silo i draws K samples tx
r
i,j
u
K
j“1
uniformly from X
i
(without replacement) and computes
t∇f
0
pw
r
,x
r
i,j
qu
jPrKs
.
6: endfor
7: Server updatesr g
r
:“
1
MrK
P
vec
pt∇f
0
pw
r
,x
r
i,j
qu
iPSr,jPrKs
;
N
2M
ϵ,δ ;Lq andw
r`1
:“ prox 1
2β f
1
pw
r
´
1
β r g
r
q
8: endfor
9: Output: w
R
.
275
B.2.2.2 ProofofTheorem40: HeterogeneousFL(SO)
First we provide a precise re-statement of the result, in which we assumeM
r
“M is fixed, for convenience:
Theorem 144 (Re-statement of Theorem 40). Grant Assumption 2, Assumption 3, and Assumption 4 for
K“ t
n
R
uandR specifiedbelow. Let εďmint8lnp1{δ q,15u,δ Pp0,1{2q,ně
r
Ω pκ q. Thereisachoiceofσ 2
such that:
1. ISRL-DP Prox-SGD ispϵ,δ q-ISRL-DP and
EFpw
R
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
κ 2
dlnp1{δ q
ϵ 2
n
2
M
`
κ Mn
˙˙
(B.33)
inR“
Q
2κ ln
´
µ ∆ L
2
min
!
Mn,
ϵ 2
n
2
M
dlnp1{δ q
)¯U
communication rounds, where∆ ěFpw
0
q´F
˚
.
2. SDP Prox-SGD ispϵ,δ q-SDP forMěNminpϵ {2,1q, and
EFpw
R
q´F
˚
“
r
O
ˆ
L
2
µ ˆ
κ 2
dln
2
pd{δ q
ϵ 2
n
2
N
2
`
κ Mn
˙˙
(B.34)
inR“
Q
2κ ln
´
µ ∆ L
2
min
!
Mn,
ϵ 2
n
2
N
2
dlnp1{δ q
)¯U
communication rounds.
Proof. We prove part 1 first. Privacy: First, by independence of the Gaussian noise across silos, it is
enough show that transcript of silo i’s interactions with the server is DP for all i P rNs (conditional
on the transcripts of all other silos). Since the batches sampled by silo i in each round are disjoint (as
we sample without replacement), the parallel composition theorem of DP [191] implies that it suffices to
show that each round ispϵ,δ q-ISRL-DP. Then by post-processing [88], we just need to show that that the
noisy stochastic gradientr g
i
r
in line 6 of the algorithm ispϵ,δ q-DP. Now, theℓ
2
sensitivity of this stochastic
gradient is bounded by ∆ 2
:“ sup
|X
i
∆ X
1
i
|ď2,wPW
}
1
K
ř
K
j“1
∇fpw,x
i,j
q´∇fpw,x
1
i,j
q}ď 2L{K, by L-
Lipschitzness of f. Hence Proposition 8 implies that r g
i
r
in line 6 of the algorithm ispϵ,δ q-DP for σ 2
ě
8L
2
lnp1{δ q
ϵ 2
K
2
. Therefore, ISRL-DP Prox-SGD ispϵ,δ q-ISRL-DP.
276
Excessloss: Denote the stochastic approximation ofF in roundr by
p
F
r
pwq:“
1
MK
ř
iPSr
ř
K
j“1
fpw,x
r
i,j
q,
and s u
r
:“
1
M
ř
iPSr
u
i
„Np0,
σ 2
M
I
d
q. Byβ -smoothness, we have
EFpw
r`1
q“E
“
F
0
pw
r`1
q`f
1
pw
r
q`f
1
pw
r`1
q´f
1
pw
r
q
‰
ďEFpw
r
q`E
„
x∇
p
F
0
r
pw
r
q,w
r`1
´w
r
y`
β 2
}w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q`xs u
r
,w
r`1
y
ȷ
`Ex∇F
0
pw
r
q´∇
p
F
0
r
pw
r
q,w
r`1
´w
r
y´Exs u
r
,w
r`1
y (B.35)
ďEFpw
r
q`E
”
x∇
p
F
0
r
pw
r
q,w
r`1
´w
r
y`β }w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q`xs u
r
,w
r`1
y
ı
´xs u
r
,w
r`1
y`E
„
1
2β ›
›
›∇F
0
pw
r
q´∇
p
F
0
r
pw
r
q
›
›
›
2
ȷ
, (B.36)
where we used Young’s inequality to bound
Ex∇F
0
pw
r
q´∇
p
F
0
r
pw
r
q,w
r`1
´w
r
yďE
„
1
2β ›
›
›∇F
0
pw
r
q´∇
p
F
0
r
pw
r
q
›
›
›
2
ȷ
l jh n
a ⃝ `E
„
β 2
}w
r`1
´w
r
}
2
ȷ
(B.37)
in the last line above. We bound a ⃝ as follows:
E
„
1
2β ›
›
›∇F
0
pw
r
q´∇
p
F
0
r
pw
r
q
›
›
›
2
ȷ
“
1
2β E
›
›
›
›
›
1
MK
ÿ
iPSr
K
ÿ
j“1
∇F
0
pw
r
q´∇f
0
pw
r
,x
r
i,j
q
›
›
›
›
›
2
(B.38)
“
1
2βM
2
K
2
ÿ
iPSr
K
ÿ
j“1
E
›
›
∇F
0
pw
r
q´∇f
0
pw
r
,x
r
i,j
q
›
›
2
(B.39)
ď
L
2
βMK
, (B.40)
by independence of the data andL-Lipschitzness off
0
.
Next, we will boundE
”
x∇
p
F
0
r
pw
r
q,w
r`1
´w
r
y`β }w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q`xs u
r
,w
r`1
y
ı
.
DenoteH
priv
r
pyq:“x
p
F
0
r
pw
r
q,y´w
r
y`β }y´w
r
}
2
`f
1
pyq´f
1
pw
r
q`xs u
r
,yy andH
r
pyq:“x∇
p
F
0
r
pw
r
q,y´
277
w
r
y`β }y´w
r
}
2
`f
1
pyq´f
1
pw
r
q Note thatH
r
andH
priv
r
are2β -strongly convex. Denote the minimizers
of these two functions byy
˚
andy
priv
˚
respectively. Now, conditional onw
r
,S
r
, and s u
r
, we claim that
H
r
py
priv
˚
q´H
r
py
˚
qď
}s u
r
}
2
2β . (B.41)
To prove B.41, we will need the following lemma:
Lemma 145 ([180]). Let Hpyq,hpyq be convex functions on some convex closed setY Ď R
d
and suppose
thatHpwq is2β -strongly convex. Assume further thath isL
h
-Lipschitz. Define y
1
“argmin
yPY
Hpyq and
y
2
“argmin
yPY
rHpyq`hpyqs. Then}y
1
´y
2
}
2
ď
L
h
2β .
We apply Lemma 145 withHpyq :“ H
r
pyq, hpyq :“xs u
r
,yy,L
h
“}s u
r
},y
1
“ y
˚
, andy
2
“ y
priv
˚
to
get
}y
˚
´y
priv
˚
}ď
}s u
r
}
2β .
On the other hand,
H
priv
r
py
priv
˚
q“H
r
py
priv
˚
q`xs u
r
,y
priv
˚
yďH
priv
r
py
˚
q“H
r
py
˚
q`xs u
r
,y
˚
y.
Combining these two inequalities yields
H
r
py
priv
˚
q´H
r
py
˚
qďxs u
r
,y
˚
´y
priv
˚
y
ď}s u
r
}}y
˚
´y
priv
˚
}
ď
}s u
r
}
2
β , (B.42)
278
as claimed. Also, note thatw
r`1
“y
priv
˚
. Further, by Assumption 4, we know
EH
r
py
˚
q“Emin
y
”
x∇
p
F
0
r
pw
r
q,y´w
r
y`β }y´w
r
}
2
`f
1
pyq´f
1
pw
r
q
ı
(B.43)
ď´
µ 2β Er
p
F
r
pw
r
q´min
w
p
F
r
pwqsď´
µ 2β rFpw
r
q´F
˚
s. (B.44)
Combining this with B.41, we get:
E
”
x∇
p
F
0
r
pw
r
q,w
r`1
´w
r
y`β }w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q`xs u
r
,w
r`1
y
ı
“EH
priv
r
py
priv
˚
q
“EH
r
py
priv
˚
q`Exs u
r
,w
r`1
y
ďEH
r
py
˚
q`
dσ
2
βM
`Exs u
r
,w
r`1
y
ď´E
µ 2β rFpw
r
q´F
˚
s`
dσ
2
βM
`Exs u
r
,w
r`1
y.
Plugging the above bounds back into B.36, we obtain
EFpw
r`1
qďEFpw
r
q´
µ 2β ErFpw
r
q´F
˚
s`
2dσ
2
βM
`
2L
2
βMK
, (B.45)
whence
ErFpw
r`1
q´F
˚
sďErFpw
r
q´F
˚
s
ˆ
1´
µ 2β ˙
`
2dσ
2
βM
`
2L
2
βMK
. (B.46)
Using B.46 recursively and plugging inσ 2
, we get
ErFpw
R
q´F
˚
sď∆ ˆ
1´
µ 2β ˙
R
`
L
2
µ „
16dlnp1{δ q
ϵ 2
K
2
M
`
1
MK
ȷ
. (B.47)
279
Finally, plugging inK andR, and observing that
1
ln
´
β β ´µ ¯
ďκ , we conclude
EFpw
R
q´F
˚
À
L
2
µ „
ln
2
ˆ
µ ∆ L
2
min
"
Mn,
ϵ 2
n
2
M
d
*˙ˆ
κ 2
dlnp1{δ q
ϵ 2
n
2
M
`
κ Mn
˙ȷ
.
Next, we move to part 2.
Privacy: Since the batches used in each iteration are disjoint by our sampling (without replacement)
strategy, the parallel composition theorem [191] implies that it is enough to show that each of theR rounds
ispϵ,δ q-SDP. This follows immediately from Theorem 136 and privacy amplification by subsambling [248]
(silos only): in each round, the network “selects” a uniformly random subset ofM
r
“ M silos out ofN,
and the shuffler executes a p
N
2M
ϵ,δ q-DP (by L-Lipschitzness of f
0
p¨,xq@x P X ) algorithmP
vec
on the
data of theseM silos (line 8), implying that each round ispϵ,δ q-SDP.
Utility: The proof is very similar to the proof of part 1, except that the variance of the Gaussian noise
dσ
2
M
is replaced by the variance ofP
vec
. Denoting Z :“
1
MK
P
vec
pt∇f
0
pw
r
,x
r
i,j
qu
iPSr,jPrKs
;
N
2M
ϵ,δ q´
1
MK
ř
iPS
r`1
ř
K
j“1
∇f
0
pw
r
,x
r
i,j
q, we have (by Theorem 136)
E}Z}
2
“O
˜
dL
2
ln
2
pd{δ q
M
2
K
2
p
N
2M
ϵ q
2
¸
“O
ˆ
dL
2
ln
2
pd{δ q
ϵ 2
K
2
N
2
˙
.
Also, Z is independent of the data and gradients. Hence we can simply replace
dσ
2
M
by O
´
dL
2
ln
2
pd{δ q
ϵ 2
K
2
N
2
¯
and follow the same steps as the proof of Theorem 40. This yields (c.f. B.46)
ErFpw
r`1
q´F
˚
sďErFpw
r
q´F
˚
s
ˆ
1´
µ 2β ˙
`O
ˆ
dL
2
ln
2
pd{δ q
ϵ 2
K
2
N
2
˙
`
2L
2
βMK
. (B.48)
Using B.48 recursively, we get
ErFpw
R
q´F
˚
sď∆ ˆ
1´
µ 2β ˙
R
`
L
2
µ „
O
ˆ
dL
2
ln
2
pd{δ q
ϵ 2
K
2
N
2
˙
`
1
MK
ȷ
. (B.49)
280
Finally, plugging inR andK“n{R, and observing that
1
ln
´
β β ´µ ¯
ďκ , we conclude
EFpw
R
q´F
˚
À
L
2
µ „
ln
2
ˆ
µ ∆ L
2
min
"
Mn,
ϵ 2
n
2
N
2
d
*˙ˆ
κ 2
dln
2
pd{δ q
ϵ 2
n
2
M
`
κ Mn
˙ȷ
.
B.2.2.3 SDPNoisyDistributedProx-PLSVRGPseudocode
Our SDP Prox-SVRG algorithm is described in Algorithm 22.
Algorithm22 SDP Prox-SVRGpw
0
,E,K,η,ϵ,δ q
1: Input: Number of epochsEPN, local batch sizeKPrns, epoch lengthQ“ t
n
K
u, data setsX
i
PX
n
,
loss function fpw,xq “ f
0
pw,xq` f
1
pwq, step size η , privacy parameters ϵ,δ , initial parameters
s w
0
“w
Q
0
“w
0
PR
d
;P
vec
privacy parametersr ϵ :“
εnn
8MK
?
4EQlnp2{δ q
and
r
δ :“
δ 2EQ
.
2: forrPt0,1,¨¨¨ ,E´1udo
3: Server updatesw
0
r`1
“w
Q
r
.
4: foriPS
r
inparalleldo
5: Server sends global modelw
r
to siloi.
6: siloi computest∇f
0
ps w
r
,x
i,j
qu
n
j“1
.
7: Server updatesr g
r`1
:“
1
M
r`1
n
P
vec
pt∇f
0
ps w
r
,x
i,j
qu
iPS
r`1
,jPrns
;r ϵ,
r
δ ;Lq.
8: fortPt0,1,¨¨¨Q´1udo
9: silo i draws tx
r`1,t
i,j
u
K
j“1
uniformly from X
i
with replacement, and computes
t∇f
0
pw
t
r`1
,x
r`1,t
i,j
qu
K
j“1
.
10: Server updates r p
t
r`1
:“
1
M
r`1
K
P
vec
pt∇f
0
pw
t
r`1
,x
r`1,t
i,j
q ´
∇f
0
ps w
r`1
,x
r`1,t
i,j
qu
iPS
r`1
,jPrKs
;r ϵ,
r
δ ;2Lq
11: Server updatesr v
t
r`1
:“ r p
t
r`1
`r g
r`1
andw
t`1
r`1
:“ prox
ηf
1pw
t
r`1
´η r v
t
r`1
q.
12: endfor
13: Server updates s w
r`1
:“w
Q
r`1
.
14: endfor
15: endfor
16: Output: w
priv
„ Unifptw
t
r`1
u
r“0,1,¨¨¨,E´1;t“0,1,¨¨¨Q´1
q.
Algorithm23 SDP Prox-PL-SVRG
1: forsPrSsdo
2: w
s
“ SDP Prox-SVRGpw
s´1
,E,K,η,
ϵ 2
?
2S
,
δ 2S
q.
3: endfor
4: Output: w
priv
:“w
S
.
281
B.2.2.4 ProofofTheorem42: FederatedERM
For the precise/formal version of Theorem 42, we will need an additional notation: the heterogeneity
parameterυ 2
X
, which has appeared in other works on FL (e.g. [265]). Assume:
Assumption7.
1
N
ř
N
i“1
›
›
›∇
p
F
i
pwq´∇
p
F
X
pwq
›
›
›
2
ď ˆ υ 2
X
for alliPrNs, wPW.
Additionally, motivated by practical FL considerations (especially cross-device FL [135]), we shall ac-
tually prove a more general result, which holds even when the number of active silos in each round is
random:
Assumption 8. In each roundr, a uniformly random subsetS
r
ofM
r
PrNs silos can communicate with
the server, wheretM
r
u
rě0
are independent with
1
M
:“Ep
1
Mr
q.
Recall that we considered the same assumption in our analysis of convex FL.
We will require the following two lemmas for the proofs in this Appendix section:
Lemma 146 ([128]). Let
p
F
X
pwq“
p
F
X
pwq`f
1
pwq, where
p
F
X
is β -smooth and f
1
is proper, closed, and
convex. Lety :“ prox
ηf
1pw´ηd
1
q for somed
1
PR
d
. Then for allzPR
d
, we have:
p
F
X
pyqď
p
F
X
pzq`xy´z,∇
p
F
X
pwq´d
1
y`
„
β 2
´
1
2η ȷ
}y´w}
2
`
„
β 2
`
1
2η ȷ
}z´w}
2
´
1
2η }y´z}
2
.
Lemma147. For alltPt0,1,¨¨¨ ,Q´1u andrPt0,1,¨¨¨ ,E´1u, the iterates of Algorithm 2 satisfy:
E}∇
p
F
X
pw
t
r`1
q´r v
t
r`1
}
2
ď
81
tMKăNnu
MK
β 2
E}w
t
r`1
´ s w
r
}
2
`
2pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
dpσ 2
1
`σ 2
2
q
M
.
Moreover, the iterates of Algorithm 22 satisfy
E}∇
p
F
X
pw
t
r`1
q´r v
t
r`1
}
2
ď
81
tMKăNnu
MK
β 2
E}w
t
r`1
´s w
r
}
2
`
2pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
,
282
whereR“EQ.
Proof. We begin with the first claim (Algorithm 2). Denote
ζ t
r`1
:“
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
r∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´∇f
0
ps w
r
,x
r`1,t
i,j
q
l jh n
:“ζ t,i,j
r`1
s
“ r v
t
r`1
´r g
r`1
´s u
2
,
wherer g
r`1
:“
1
M
r`1
ř
iPS
r`1
r g
i
r`1
“
1
M
r`1
ř
iPS
r`1
∇
p
F
i
ps w
r
q`s u
1
, ands u
j
“
1
M
r`1
ř
iPS
r`1
u
i
j
forj“1,2.
NoteEζ t,i,j
r`1
“∇
p
F
i
pw
t
r`1
q´∇
p
F
i
ps w
r
q. Then, conditional on all iterates throughw
t
r`1
and s w
r
, we have:
E
›
›
›∇
p
F
X
pw
t
r`1
q´r v
t
r`1
›
›
›
2
“E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
rζ t,i,j
r`1
`r g
i
r`1
´∇
p
F
X
pw
t
r`1
qs`s u
2
›
›
›
›
›
›
2
(B.50)
“E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
rζ t,i,j
r`1
`∇
p
F
i
ps w
r
q`u
i
1
´∇
p
F
X
pw
t
r`1
qs`s u
2
›
›
›
›
›
›
2
(B.51)
“
dpσ 2
1
`σ 2
2
q
M
`E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
rζ t,i,j
r`1
`∇
p
F
i
ps w
r
q´∇
p
F
X
pw
t
r`1
qs
›
›
›
›
›
›
2
l jh n
:“ a ⃝ ,
(B.52)
by independence of the Gaussian noise and the gradients. Now,
a ⃝ “E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
!
rζ t,i,j
r`1
´Eζ t,i,j
r`1
s`∇
p
F
i
pw
t
r`1
q´∇
p
F
X
pw
t
r`1
q
)
›
›
›
›
›
›
2
(B.53)
ď2E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
ζ t,i,j
r`1
´Eζ t,i,j
r`1
›
›
›
›
›
›
2
`2E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
∇
p
F
i
pw
t
r`1
q´∇
p
F
X
pw
t
r`1
q
›
›
›
›
›
›
2
.
(B.54)
283
We bound the first term (conditional on M
r`1
and all iterates through roundr) in B.54 using Lemma 130:
E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPSr`1
K
ÿ
j“1
ζ t,i,j
r`1
´Eζ t,i,j
r`1
›
›
›
›
›
›
2
ď
1
tMr`1KăNnu
M
r`1
KNn
N
ÿ
i“1
n
ÿ
j“1
E
›
›
›ζ t,i,j
r`1
´Eζ t,i,j
r`1
›
›
›
2
ď
1
tMr`1KăNnu
MKNn
N
ÿ
i“1
n
ÿ
j“1
2E
„
›
›
›∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´f
0
ps w
r
,x
r`1,t
i,j
q
›
›
›
2
ȷ
`E
›
›
›∇
p
F
X
pw
t
r`1
q´∇
p
F
X
ps w
r
q
›
›
›
2
ď
41
tMr`1KăNnu
M
r`1
K
β 2
}w
t
r`1
´ s w
r
}
2
,
where we used Cauchy-Schwartz andβ -smoothness in the second and third inequalities. Now ifM“N,
thenM
r`1
“N (with probability1) and taking expectation with respect toM
r`1
(conditional on thew’s)
bounds the left-hand side by
41
tKănu
MK
β 2
}w
t
r`1
´ s w
r
}
2
“
41
tMKăNnu
MK
β 2
}w
t
r`1
´ s w
r
}
2
, via Assumption 8.
On the other hand, if M ă N, then taking expectation with respect to M
r`1
(conditional on the w’s)
bounds the left-hand-side by
4
MK
β 2
}w
t
r`1
´ s w
r
}
2
“
41
tMKăNnu
MK
β 2
}w
t
r`1
´ s w
r
}
2
(since the indicator is
always equal to1 ifMăN). In either case, taking total expectation with respect to s w
r
,w
t
r`1
yields
E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
ζ t,i,j
r`1
´Eζ t,i,j
r`1
›
›
›
›
›
›
2
ď
41
tMKăNnu
MK
β 2
E}w
t
r`1
´ s w
r
}
2
.
We can again invoke Lemma 130 to bound (conditional onM
r`1
andw
t
r`1
) the second term in B.54:
E
›
›
›
›
›
›
1
M
r`1
K
ÿ
iPS
r`1
K
ÿ
j“1
∇
p
F
i
pw
t
r`1
q´∇
p
F
X
pw
t
r`1
q
›
›
›
›
›
›
2
ď 1
tNą1u
N´M
r`1
pN´1qM
r`1
ˆ
1
N
N
ÿ
i“1
›
›
›∇
p
F
i
pw
t
r`1
q´∇
p
F
X
pw
t
r`1
q
›
›
›
2
ď 1
tNą1u
N´M
r`1
pN´1qM
r`1
ˆ υ 2
X
.
Taking total expectation and combining the above pieces completes the proof of the first claim.
284
The second claim is very similar, except that the Gaussian noise terms s u
1
and s u
2
get replaced by the
respective noises due toP
vec
:Z
1
:“
1
Mn
P
vec
pt∇f
0
ps w
r
,x
i,j
qu
iPS
r`1
,jPrns
;r ϵ,
r
δ q´
1
M
ř
iPS
r`1
∇
p
F
0
i
ps w
r
q and
Z
2
:“
1
MK
»
–
P
vec
pt∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´∇f
0
ps w
r`1
,x
r`1,t
i,j
qu
iPSr`1,jPrKs
;r ϵ,
r
δ q´
ÿ
iPSr`1
K
ÿ
j“1
p∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´f
0
ps w
r
,x
r`1,t
i,j
q
fi
fl
.
By Theorem 136, we have
E}Z
1
}
2
“O
˜
dL
2
ln
2
pd{
r
δ q
M
2
n
2
r ϵ 2
¸
“O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
and
E}Z
2
}
2
“O
˜
dL
2
ln
2
pd{
r
δ q
M
2
K
2
r ϵ 2
¸
“O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
.
Below we provide a precise re-statement of Theorem 42 forMăN, including choices of algorithmic
parameter.
Theorem 148 (Complete Statement of Theorem 42). Assumeεď mint2lnp2{δ q,15u and letR :“ EQ.
Suppose
p
F
X
isµ -PPL and grant Assumption 2, Assumption 3, Assumption 8, and Assumption 7. Then:
1. Algorithm 3 ispϵ,δ q-ISRL-DP if σ 2
1
“
256L
2
SElogp2{δ qlogp5E{δ q
ϵ 2
n
2
, σ 2
2
“
1024L
2
SRlogp2{δ qlogp2.5R{δ q
ϵ 2
n
2
, and
K ě
εn
4
?
2SRlnp2{δ q
. Further, ifK ě
´
n
2
M
¯
1{3
,R“ 12κ , andS ě log
2
´
ˆ
∆ X
µMϵ 2
n
2
κdL
2
¯
, then there isη such
that@XPX,
E
p
F
X
pw
S
q´
p
F
˚
X
“
r
O
ˆ
κ L
2
dlnp1{δ q
µϵ 2
n
2
M
`
pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
˙
in
r
Opκ q communications.
2. Algorithm 22 ispϵ,δ q-SDP, provided M
r`1
“ M ě min
!
pεnLq
3{4
pdln
3
pd{δ qq
3{8
n
1{4
pβ ˆ
∆ X
q
3{8
,N
)
for all r. Further, if
Kě
´
n
2
M
¯
1{3
,R“12κ , andSělog
2
´
ˆ
∆ X
µϵ 2
N
2
n
2
κdL
2
¯
, then there isη such that@XPX,
E
p
F
X
pw
S
q´
p
F
˚
X
“
r
O
ˆ
κ L
2
dlnp1{δ q
µϵ 2
n
2
N
2
`
pN´Mqˆ υ 2
X
µM pN´1q
1
tNą1u
˙
.
285
Proof. 1.Privacy: For simplicity, assumeS“1. It will be clear from the proof (and advanced composition
of DP [88] or moments accountant [1]) that the privacy guarantee holds for all S due to the factor of S
appearing in the numerators ofσ 2
1
andσ 2
2
. Then by independence of the Gaussian noise across silos, it is
enough show that transcript of silo i’s interactions with the server is DP for all iPrNs (conditional on
the transcripts of all other silos). Further, by the post-processing property of DP, it suffices to show that
allE´1 computations ofr g
i
r`1
(line 7) arepϵ {2,δ {2q-ISRL-DP and allR“EQ computations ofr v
t,i
r`1
(line
10) by siloi (forrPt0,1,¨¨¨ ,E´1u,tPt0,1,¨¨¨ ,Q´1u) arepϵ,δ q-ISRL-DP. Now, by the advanced
composition theorem (see Theorem 3.20 in [88]), it suffices to show that: 1) each of the E computations
of r g
i
r`1
(line 7) ispr ϵ 1
{2,
r
δ 1
{2q-ISRL-DP, wherer ϵ 1
“
ϵ 2
?
2Elnp2{δ q
and
r
δ 1
“
δ 2E
; and 2)each andR“ EQ
computations ofr v
t,i
r`1
(line 10) ispr ϵ 2
{2,
r
δ 2
{2q-ISRL-DP, wherer ϵ 2
“
ϵ 2
?
2Rlnp2{δ q
and
r
δ 2
“
δ 2R
.
We first show 1): The ℓ
2
sensitivity of the (noiseless versions of) gradient evaluations in line 7 is
bounded by∆ p1q
2
:“sup
|X
i
∆ X
1
i
|ď2,wPW
}
1
K
2
ř
n
j“1
∇f
0
pw,x
i,j
q´∇f
0
pw,x
1
i,j
q}ď2L{n, byL-Lipschitzness
off
0
. HereW denotes the constraint set if the problem is constrained (i.e.f
1
“ι W
`h for closed convex
h); andW“R if the problem is unconstrained. Hence the privacy guarantee of the Gaussian mechanism
implies that taking σ 2
1
ě
8L
2
lnp1.25{p
r
δ 1
{2qq
pr ϵ 1
{2q
2
n
2
“
256L
2
Elnp2{δ qlnp5E{δ q
ϵ 2
n
2
suffices to ensure that each update in
line 7 ispr ϵ 1
{2,
r
δ 1
{2q-ISRL-DP.
Now we establish 2): First, condition on the randomness due to local sampling of each local data point
x
r`1,t
i,j
(line 9). Now, theℓ
2
sensitivity of the (noiseless versions of) stochastic minibatch gradient (ignoring
the already private r g
i
r`1
) in line 10 is bounded by∆ p2q
2
:“ sup
|X
i
∆ X
1
i
|ď2,w,w
1
PW
}
1
K
ř
K
j“1
∇f
0
pw,x
i,j
q´
∇f
0
pw,x
1
i,j
q´f
0
pw
1
,x
i,j
q`∇f
0
pw
1
,x
1
i,j
q}ď2sup
|X
i
∆ X
1
i
|ď2,wPW
}
1
K
ř
K
j“1
∇f
0
pw,x
i,j
q´∇f
0
pw,x
1
i,j
q}ď
4L{K, byL-Lipschitzness off
0
;W is as defined above. Thus, the standard privacy guarantee of the Gaus-
sian mechanism (Theorem A.1 in [88]) implies that (conditional on the randomness due to sampling) taking
σ 2
1
ě
8L
2
lnp1.25{p
r
δ 2
{2qq
pr ϵ 2
{2q
2
K
2
2
“
32L
2
lnp2.5{
r
δ 2
q
r ϵ 2
2
K
2
2
suffices to ensure that each such update is pr ϵ 2
{2,
r
δ 2
{2q-ISRL-DP.
286
Now we invoke the randomness due to sampling: [248] implies that round r (in isolation) isp
2r ϵ 2
K
n
,
r
δ 2
q-
ISRL-DP. The assumption on K ensures that ϵ 1
:“
n
2K
ϵ 2
?
2Rlnp2{δ q
ď 1, so that the privacy guarantees
of the Gaussian mechanism and amplification by subsampling stated above indeed hold. Therefore, with
sampling, it suffices to take σ 2
1
ě
128L
2
lnp2.5{
r
δ 2
q
n
2
r ϵ 2
2
“
1024L
2
Rlnp5R{δ qlnp2{δ q
n
2
ϵ 2
to ensure that all of the R up-
dates made in line 10 arepϵ {2,δ {2q-ISRL-DP (for every client). Combining this with the above implies that
the full algorithm ispϵ,δ q-ISRL-DP.
Utility: For our analysis, it will be useful to denote the full batch gradient update ˆ w
t`1
r`1
:“ prox
ηf
1rw
t
r`1
´
η ∇
p
F
X
pw
t
r`1
qss. Fix any databaseXPX. Also, forα ą0 andwPR
d
denote
D
f
1pw,α q:“´2α min
yPR
d
”
x∇
p
F
X
pwq,y´wy`
α 2
}y´w}
2
`f
1
pyq´f
1
pwq
ı
Setη :“
1
8β min
´
1,
K
3{2
?
M
n
¯
. Then we claim
β 2
`c
t`1
´
1`
n
K
¯
ď
1
2η (B.55)
for alltPt0,1,¨¨¨ ,Q´1u. First, ifMK “ Nn, thenc
t
“ c
t`1
p2q“ c
t`2
p2q
2
“ c
Q
p2q
Q´t
“ 0 since
c
Q
“ 0. Next, suppose MK ă Nn. Denote q :“
K
n
. Then by unraveling the recursion, we get for all
tPt0,¨¨¨ ,Q´1u that
c
t
“c
t`1
p1`qq`
4ηβ 2
MK
“
4ηβ 2
MK
rp1`qq
Q´t´1
`¨¨¨`p1`qq
2
`p1`qq`1s
“
4ηβ 2
MK
ˆ
p1`qq
Q´t
´1q
q
˙
ď
4ηβ 2
n
MK
2
˜
ˆ
1`
K
n
˙
n{K
´1
¸
ď
8ηβ 2
n
MK
2
.
287
Then it’s easy to check that with the prescribed choice ofη , B.55 holds.
Now, by Lemma 146 (withw“z“w
t
r`1
andd
1
“∇
p
F
X
pwq), we have
p
F
X
pp w
t`1
r`1
qď
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
2η ˙
}p w
t`1
r`1
´w
t
r`1
}
2
´
1
2η }p w
t`1
r`1
´w
t
r`1
}
2
,
which implies
E
p
F
X
pp w
t`1
r`1
qďE
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
η ˙
E}p w
t`1
r`1
´w
t
r`1
}
2
. (B.56)
Recallw
t`1
r`1
“ prox
ηf
1pw
t
r`1
´η r v
t
r`1
q. Applying Lemma 146 again (withy“ w
t`1
r`1
,z“ p w
t`1
r`1
,d
1
“
r v
t
r`1
,w“w
t
r`1
) yields
p
F
X
pw
t`1
r`1
qď
p
F
X
pp w
t`1
r`1
q`xw
t`1
r`1
´ p w
t`1
r`1
,∇
p
F
X
pw
t
r`1
q´r v
t
r`1
y (B.57)
`
ˆ
β 2
´
1
2η ˙
}w
t`1
r`1
´w
t
r`1
}
2
`
ˆ
β 2
`
1
2η ˙
}p w
t`1
r`1
´w
t
r`1
}
2
´
1
2η }w
t`1
r`1
´ p w
t`1
r`1
}
2
.
(B.58)
Further, byβ -smoothness of
p
F
X
, we have:
p
F
X
pp w
t`1
r`1
qď
p
F
X
pw
t
r`1
q`f
1
pw
t
r`1
q`x∇
p
F
X
pw
t
r`1
q, p w
t`1
r`1
´w
t
r`1
y`
β 2
}p w
t`1
r`1
´w
t
r`1
}
2
`f
1
pp w
t`1
r`1
q´f
1
pw
t
r`1
q
ď
p
F
X
pw
t
r`1
q`x∇
p
F
X
pw
t
r`1
q, p w
t`1
r`1
´w
t
r`1
y`
1
2η }p w
t`1
r`1
´w
t
r`1
}
2
`f
1
pp w
t`1
r`1
q´f
1
pw
t
r`1
q
“
p
F
X
pw
t
r`1
q´
η 2
D
f
1pw
t
r`1
,
1
η q
ď
p
F
X
pw
t
r`1
q´
η 2
D
f
1pw
t
r`1
,β q
ď
p
F
X
pw
t
r`1
q´ηµ r
p
F
X
pw
t
r`1
q´
p
F
˚
X
s, (B.59)
where the second inequality usedη ď1{β , the third inequality used the Proximal-PL lemma (Lemma 1 in
[142]), and the last inequality used the assumption that
p
F
X
satisfies the Proximal-PL inequality.
288
Now adding2{3ˆ B.56 to1{3ˆ B.59 and taking expectation gives
E
p
F
X
pp w
t`1
r`1
qďE
„
p
F
X
pw
t
r`1
q`
2
3
ˆ
β 2
´
1
η ˙
}p w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q
ȷ
. (B.60)
Adding B.60 to B.57 yields
E
p
F
X
pw
t`1
r`1
qďE
«
p
F
X
pw
t
r`1
q`
ˆ
5β 6
´
1
6η ˙
}p w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q
`xw
t`1
r`1
´ p w
t`1
r`1
,∇
p
F
X
pw
t
r`1
q´r v
t
r`1
y`
ˆ
β 2
´
1
2η ˙
}w
t`1
r`1
´w
t
r`1
}
2
´
1
2η }w
t`1
r`1
´ p w
t`1
r`1
}
2
ff
.
(B.61)
Sinceη ď
1
5β , Young’s inequality implies
E
p
F
X
pw
t`1
r`1
qďE
«
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
2η ˙
}w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q`
η 2
›
›
›∇
p
F
X
pw
t
r`1
q´r v
t
r`1
›
›
›
2
ff
ďE
«
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
2η ˙
}w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q
`
4η 1
tMKăNnu
MK
β 2
}w
t
r`1
´ s w
r
}
2
`
η pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
ηd pσ 2
1
`σ 2
2
q
2M
ff
, (B.62)
289
where we used Lemma 147 to get the second inequality. Now, denoteγ t
r`1
:“E
”
p
F
X
pw
t
r`1
q`c
t
}w
t
r`1
´ s w
r
}
2
ı
,
c
t
:“c
t`1
p1`
K
n
q`
4η 1
tMKăNnu
MK
β 2
fort“0,¨¨¨ ,Q´1, andc
Q
:“0. Then B.62 is equivalent to
γ t`1
r`1
ďE
«
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
2η ˙
}w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q`
4η 1
tMKăNnu
MK
β 2
}w
t
r`1
´ s w
r
}
2
`
η pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
ηd pσ 2
1
`σ 2
2
q
2M
`c
t`1
}w
t`1
r`1
´ s w
r
}
2
ff
ďE
«
p
F
X
pw
t
r`1
q`
ˆ
β 2
´
1
2η `c
t`1
p1`
1
q
q
˙
}w
t`1
r`1
´w
t
r`1
}
2
´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q
`
ˆ
4η 1
tMKăNnu
MK
β 2
`c
t`1
p1`qq
˙
}w
t
r`1
´ s w
r
}
2
`
η pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
ηd pσ 2
1
`σ 2
2
q
2M
ff
, (B.63)
whereq :“
K
n
and we used Young’s inequality (after expanding the square, to bound}w
t`1
r`1
´ s w
r
}
2
) in the
second inequality above. Now, applying B.55 yields
γ t`1
r`1
ďE
«
p
F
X
pw
t
r`1
q´
ηµ 3
p
p
F
X
pw
t
r`1
q´
p
F
˚
X
q`
ˆ
4η 1
tMKăNnu
MK
β 2
`c
t`1
p1`qq
˙
}w
t
r`1
´ s w
r
}
2
`
η pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
ηd pσ 2
1
`σ 2
2
q
2M
ff
“γ t
r`1
´
ηµ 3
Ep
p
F
X
pw
t
r`1
q´
p
F
˚
X
q`
ηd pσ 2
1
`σ 2
2
q
2M
(B.64)
Summing up, we get
Er
p
F
X
ps w
r`1
q´
p
F
X
ps w
r
qs“
Q´1
ÿ
t“0
γ t`1
r`1
´γ t
r`1
“
ηµ 3
Q´1
ÿ
t“0
Er
p
F
X
pw
t
r`1
q´
p
F
˚
X
s`
ηQ pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
ηQd pσ 2
1
`σ 2
2
q
2M
ùñ
ηµ 3
E´1
ÿ
r“0
Q´1
ÿ
t“0
Er
p
F
X
pw
t
r`1
q´
p
F
˚
X
sď∆ `Rη ˆ
pN´Mqˆ υ 2
X
MpN´1q
1
tNą1u
`
dpσ 2
1
`σ 2
2
q
2M
˙
,
290
where
ˆ
∆: “
p
F
X
ps w
0
q´
p
F
˚
X
“
ˆ
∆ X
andR“EQ. Recallw
s
:“ ISRL-DP Prox-SVRGpw
s´1
,E,K,η,σ 1
,σ 2
q
forsPrSs. Plugging in the prescribedη andσ 2
1
,σ 2
2
, we get
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
3
ˆ
∆ β µR
ˆ
1`
n
K
3{2
?
M
˙
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`
r
O
ˆ
RdL
2
lnp1{δ q
ϵ 2
n
2
M
˙
. (B.65)
Our choice ofKě
´
n
?
M
¯
2{3
implies
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
6
ˆ
∆ κ R
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`
r
O
ˆ
RdL
2
lnp1{δ q
ϵ 2
n
2
M
˙
. (B.66)
Our choice ofR“12κ implies
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
ˆ
∆ 2
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`
r
O
ˆ
κdL
2
lnp1{δ q
ϵ 2
n
2
M
˙
. (B.67)
Iterating B.67 S ě log
2
´
ˆ
∆ X
µMϵ 2
n
2
κdL
2
¯
times proves the desired excess loss bound. Note that the total
number of communications isSR“
r
Opκ q.
2. Privacy: As inPart1, we shall first consider the case of S“1. It suffices to show that: 1) the collection
of allE computations ofr g
r`1
(line 7 of Algorithm 22) (forrPt0,1,¨¨¨ ,E´1u) ispϵ {2,δ {2q-DP; and 2)
the collection of allR“EQ computations of r p
t
r`1
(line 10) (forrPt0,1,¨¨¨ ,E´1u,tPt0,1,¨¨¨ ,Q´
1u) ispϵ {2,δ {2q-DP. Further, by the advanced composition theorem (see Theorem 3.20 in [88]) and the
assumption onϵ , it suffices to show that: 1) each of the E computations ofr g
r`1
(line 7) ispϵ 1
{2,δ 1
{2q-DP;
and 2)each of the R “ EQ computations of r p
t
r`1
(line 10) ispϵ 1
{2,δ 1
{2q-DP, where ϵ 1
:“
ϵ 2
?
2Rlnp2{δ q
andδ 1
:“
δ 2R
. Now, condition on the randomness due to subsampling of silos (line 4) and local data (line
9). Then Theorem 136 implies that each computation in line 7 and line 10 ispr ϵ,
r
δ q-DP (with notation as
defined in Algorithm 22), since the norm of each stochastic gradient (and gradient difference) is bounded by
2L byL-Lipschitzness off
0
. Now, invoking privacy amplification from subsampling [248] and using the
291
assumption onM (and choices ofK andR) to ensure thatr ϵ ď1, we get that each computation in line 7 and
line 10 isp
2MK
Nn
r ϵ,
r
δ q-DP. Recallingr ϵ :“
εnn
8MK
?
4EQlnp2{δ q
and
r
δ :“
δ 2EQ
, we conclude that Algorithm 22
ispϵ,δ q-SDP. Finally, SDP follows by the advanced composition theorem Theorem 11, since Algorithm 23
calls Algorithm 22S times.
ExcessLoss: The proof is very similar to the proof of Theorem 42, except that the variance of the Gaussian
noises
dpσ 2
1
`σ 2
2
q
M
is replaced by the variance ofP
vec
. DenotingZ
1
:“
1
Mn
P
vec
pt∇f
0
ps w
r
,x
i,j
qu
iPS
r`1
,jPrns
;r ϵ,
r
δ q´
1
M
ř
iPS
r`1
∇
p
F
0
i
ps w
r
q and
Z
2
:“
1
MK
«
P
vec
pt∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´∇f
0
ps w
r`1
,x
r`1,t
i,j
qu
iPS
r`1
,jPrKs
;r ϵ,
r
δ q
´
ÿ
iPS
r`1
K
ÿ
j“1
p∇f
0
pw
t
r`1
,x
r`1,t
i,j
q´f
0
ps w
r
,x
r`1,t
i,j
q
ff
,
we have (by Theorem 136)
E}Z
1
}
2
“O
˜
dL
2
ln
2
pd{
r
δ q
M
2
n
2
r ϵ 2
¸
“O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
and
E}Z
2
}
2
“O
˜
dL
2
ln
2
pd{
r
δ q
M
2
K
2
r ϵ 2
¸
“O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
.
Hence we can simply replace
dpσ 2
1
`σ 2
2
q
M
byO
´
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
¯
and follow the same steps as the proof
of Theorem 42. This yields (c.f. B.65)
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
3
ˆ
∆ X
β µR
ˆ
1`
n
K
3{2
?
M
˙
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
.
(B.68)
292
Our choice ofKě
´
n
?
M
¯
2{3
implies
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
6
ˆ
∆ X
κ R
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`O
ˆ
dL
2
Rln
2
pdR{δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
. (B.69)
Our choice ofR“12κ implies
Er
p
F
X
pw
1
q´
p
F
˚
X
sď
ˆ
∆ X
2
`
3ˆ υ 2
X
pN´Mq
µM pN´1q
`O
ˆ
κdL
2
ln
2
pdκ {δ qlnp1{δ q
ϵ 2
n
2
N
2
˙
. (B.70)
Iterating B.70 S ě log
2
´
ˆ
∆ X
µϵ 2
N
2
n
2
κdL
2
¯
times proves the desired excess loss bound. Note that the total
number of communications isSR“
r
Opκ q.
B.2.3 SupplementalMaterialforSection3.3.2: Non-Convex/Non-SmoothLosses
Theorem149 (Complete Statement of Theorem 44). Letεď2lnp1{δ q. Then,therearechoicesofalgorith-
mic parameters such that ISRL-DP FedProx-SPIDER ispϵ,δ q-ISRL-DP. Moreover, we have
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
À
»
—
–
¨
˝
b
Lβ
ˆ
∆ X
dlnp1{δ q
εn
?
M
˛
‚
4{3
`
L
2
dlnp1{δ q
ϵ 2
n
2
M
` 1
tMăNu
¨
˝
L
b
β ˆ
∆ X
dlnp1{δ q
εn
3{2
M
`
L
2
Mn
˛
‚
fi
ffi
fl .
(B.71)
Proof. Choose η “
1
2β , σ 2
1
“
16L
2
lnp1{δ q
ϵ 2
n
2
max
´
R
q
,1
¯
, σ 2
2
“
16β 2
Rlnp1{δ q
ϵ 2
n
2
, ˆ σ 2
2
“
64L
2
Rlnp1{δ q
ϵ 2
n
2
, and K
1
“
K
2
“n (full batch).
Privacy: First, by independence of the Gaussian noise across silos, it is enough show that transcript of
siloi’s interactions with the server is DP for alliPrNs (conditional on the transcripts of all other silos).
Since ε ď 2lnp1{δ q, it suffices (by Proposition 5) to show that silo i’s transcript is
ϵ 2
8lnp1{δ q
-zCDP. Then
by Proposition 8 and Lemma 10, it suffices to bound the sensitivity of the update in line 7 of Algorithm 4 by
2L{n and the update in line 11 by
1
n
mint2β }w
r
´w
r´1
},4Lu. The line 7 sensitivity bound holds because
293
sup
X
i
„X
1
i
}
1
n
ř
n
j“1
∇f
0
pw,x
i,j
q´∇f
0
pw,x
1
i,j
q}“ sup
x,x
1}∇f
0
pw,xq´∇f
0
pw,x
1
q}ď 2L{n for any
w sincef
0
isL-Lipschitz. The line 11 sensitivity bound holds because
sup
X
i
„X
1
i
›
›
›
›
›
1
n
n
ÿ
j“1
∇f
0
pw
r
,x
i,j
´∇f
0
pw
r´1
,x
i,j
q´pf
0
pw
r
,x
1
i,j
q´∇f
0
pw
r´1
,x
1
i,j
qq
›
›
›
›
›
“
1
n
sup
x,x
1
›
›
∇f
0
pw
r
,x´∇f
0
pw
r´1
,xq´pf
0
pw
r
,x
1
q´∇f
0
pw
r´1
,x
1
qq
›
›
ď
1
n
mint2β }w
r
´w
r´1
},4Lu
sincef
0
isL-Lipschitz andβ -smooth. Note that ifRăq, then only one update in line 7 is made, and the
privacy of this update follows simply from the guarantee of the Gaussian mechanism and the sensitivity
bound, without needing to appeal to the composition theorem.
Utility: Fix any X P X and denote
p
G
η pwq :“
p
G
η pw,Xq for brevity of notation. Recall the notation
of Algorithm 4. Note that Lemma 150 holds with
τ 2
1
“ sup
r”0pmod qq
E
›
›
›h
r
´∇
p
F
0
X
pw
r
q
›
›
›
2
“ sup
r”0pmod qq
E
›
›
›
›
›
1
M
r
n
ÿ
iPSr
n
ÿ
j“1
”
∇f
0
pw
r
,x
i,j
q´∇
p
F
0
X
pw
r
q
ı
›
›
›
›
›
2
`
dσ
2
1
M
ď
2L
2
Mn
1
tMăNu
`
dσ
2
1
M
,
using independence of the noises across silos and Lemma 130. Further, for anyr, we have (conditional on
w
r
,w
r´1
)
E
›
›
›H
r
´∇
p
F
0
X
pw
r
q
›
›
›
2
ď2
»
–
dσ
2
2
M
}w
r
´w
r´1
}
2
`E
›
›
›
›
›
1
M
r
n
ÿ
iPSr
n
ÿ
j“1
”
∇f
0
pw
r
,x
i,j
q´∇f
0
pw
r´1
,x
i,j
q´
´
∇
p
F
0
X
pw
r
q´
p
F
0
X
pw
r´1
q
¯ı
›
›
›
›
›
2
fi
fl
ď
2dσ
2
2
M
}w
r
´w
r´1
}
2
`
8β 2
Mn
}w
r
´w
r´1
}
2
1
tMăNu
,
294
using Young’s inequality, independence of the noises across silos, and Lemma 130. Therefore, Lemma 150
holds withτ 2
2
“8
´
β 2
Mn
1
tMăN
u`
dσ
2
2
M
¯
. Next, we claim that ifη “1{2β andqď
1
η 2
τ 2
2
, then
E}G
η pw
priv
q}
2
ď16
˜
ˆ
∆ X
ηR
`τ 2
1
¸
. (B.72)
We prove B.72 as follows. Let gpw
r
q :“ ´
1
η pw
r`1
´w
r
q. By Lemma 146 (with y “ w
r`1
, z “ w “
w
r
, d
1
“h
r
), we have
E
p
F
X
pw
r`1
qďE
p
F
X
pw
r
q`E
A
w
r`1
´w
r
,∇
p
F
0
X
pw
r
q´h
r
E
`
ˆ
β 2
´
1
2η ˙
E}w
r`1
´w
r
}
2
´
1
2η E}w
r`1
´w
r
}
2
ďE
p
F
X
pw
r
q`
η 2
E
›
›
›∇
p
F
0
X
pw
r
q´h
r
›
›
›
2
`
ˆ
β 2
´
1
2η ˙
E}w
r`1
´w
r
}
2
“E
p
F
X
pw
r
q`
η 2
E
›
›
›∇
p
F
0
X
pw
r
q´h
r
›
›
›
2
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
r
q}
2
.
Thus, by Lemma 150, we have
Er
p
F
X
pw
r`1
q´
p
F
X
pw
r
qsď
η 2
E
›
›
›∇
p
F
0
X
pw
r
q´h
r
›
›
›
2
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
r
q}
2
ď
η 2
τ 2
2
r
ÿ
t“sr`1
Er}w
t
´w
t´1
}
2
s`
η 2
τ 2
1
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
r
q}
2
“
η 3
2
τ 2
2
r
ÿ
t“sr`1
Er}gpw
t
q}
2
s`
η 2
τ 2
1
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
r
q}
2
,
wheres
r
“ t
r
q
uq. Now we sum over a given phase (froms
r
tor), noting thatr´qďs
r
ďr:
Er
p
F
X
pw
r`1
q´
p
F
X
pw
sr
qsď
η 3
τ 2
2
2
r
ÿ
k“sr
k
ÿ
j“sr`1
Er}gpw
j
q}
2
s`
r
ÿ
k“sr
„
η 2
τ 2
1
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
k
q}
2
ȷ
ď
qη 3
τ 2
2
2
r
ÿ
k“sr
Er}gpw
k
q}
2
s`
r
ÿ
k“sr
„
η 2
τ 2
1
`
ˆ
β 2
´
1
2η ˙
η 2
E}gpw
k
q}
2
ȷ
“´
r
ÿ
k“sr
"
Er}gpw
k
q}
2
s
ˆ
η 2
´
βη 2
2
´
η 3
τ 2
2
q
2
˙
´
ητ 2
1
2
*
295
DenotingA“
η 2
´
βη 2
2
´
η 3
τ 2
2
q
2
and summing over all phasesP “tp
0
,p
1
,...u“
!
0,q,...,t
R´1
q
uq,R
)
,
we get
Er
p
F
X
pw
R
q´
p
F
X
pw
0
qsď
|P|
ÿ
j“1
Er
p
F
X
pw
p
j
q´
p
F
X
pw
p
j´1
qs
ď
ηRτ 2
1
2
´A
R
ÿ
r“0
Er}gpw
r
q}
2
s,
which implies
1
R
R
ÿ
r“0
Er}gpw
r
q}
2
sď
ˆ
∆ X
RA
`
ητ 2
1
2A
. (B.73)
Now, for anyrě0,
›
›
›
p
G
η pw
r
q´gpw
r
q
›
›
›
2
“
1
η 2
›
›
›w
r`1
´prox
ηf
1pw
r
´η p
F
0
X
pw
r
qq
›
›
›
2
“
1
η 2
›
›
›prox
ηf
1pw
r
´ηh
r
q´prox
ηf
1pw
r
´η p
F
0
X
pw
r
qq
›
›
›
2
ď
1
η 2
›
›
›´ηh
r
`η p
F
0
X
pw
r
q
›
›
›
2
“
›
›
›h
r
´
p
F
0
X
pw
r
q
›
›
›
2
,
by non-expansiveness of the proximal operator. Furthermore, conditional on the uniformly drawn r “
r
˚
Pt0,1,...,Ru, we have
E
›
›
›
p
G
η pw
r
˚q´gpw
r
˚q
›
›
›
2
ďE
›
›
›h
r
˚´
p
F
0
X
pw
r
˚q
›
›
›
2
ďτ 2
2
r
˚
ÿ
k“s
r
˚`1
E}w
k
´w
k´1
}
2
`τ 2
1
“η 2
τ 2
2
r
˚
ÿ
k“s
r
˚`1
E}gpw
k´1
q}
2
`τ 2
1
,
296
by Lemma 150, and taking total expectation yields
E
›
›
›
p
G
η pw
priv
q´gpw
priv
q
›
›
›
2
ď
η 2
τ 2
2
R
R
ÿ
r“1
r
ÿ
k“sr`1
E}gpw
r´1
q}
2
`τ 2
1
ď
qη 2
τ 2
2
R
R
ÿ
r“1
E}gpw
r´1
q}
2
`τ 2
1
ďqη 2
τ 2
2
«
ˆ
∆ X
RA
`
ητ 2
1
2A
ff
`τ 2
1
,
where the last inequality follows from B.73. Hence
E}
p
G
η pw
priv
q}
2
ď2
«
qη 2
τ 2
2
«
ˆ
∆ X
RA
`
ητ 2
1
2A
ff
`τ 2
1
ff
`2E}gpw
priv
q}
2
ď2
«
qη 2
τ 2
2
«
ˆ
∆ X
RA
`
ητ 2
1
2A
ff
`τ 2
1
ff
`
2
ˆ
∆ X
RA
`
ητ 2
1
A
,
by Young’s inequality and B.73. Now, our choices ofη “1{2β andqď
1
τ 2
2
η 2
implyA“
η 2
´
βη 2
2
´
η 3
τ 2
2
q
2
ě
η 4
and
E}
p
G
η pw
priv
q}
2
ď8
«˜
ˆ
∆ X
Rη `
τ 2
1
2
¸
`τ 2
1
ff
`
8
ˆ
∆ X
Rη `4τ 2
1
“
16
ˆ
∆ X
Rη `16τ 2
1
,
proving B.72. The rest of the proof follows from plugging in τ 2
1
and setting algorithmic parameters.
Pluggingτ 2
1
“
2L
2
Mn
1
tMăNu
`
dσ
2
1
M
ď
2L
2
Mn
1
tMăNu
`
16dL
2
Rlnp1{δ q
qϵ 2
n
2
M
`
16dL
2
lnp1{δ q
ϵ 2
n
2
M
into B.72 yields
E}
p
G
η pw
priv
q}
2
ď16
˜
ˆ
∆ X
ηR
`
2L
2
Mn
1
tMăNu
`
16dL
2
Rlnp1{δ q
qϵ 2
n
2
M
`
16dL
2
lnp1{δ q
ϵ 2
n
2
M
¸
.
297
ChoosingR“
εn
?
Mq
?
ˆ
∆ X
β L
?
dlnp1{δ q
equalizes the two terms in the above display involvingR (up to constants)
and we get
E}
p
G
η pw
priv
q}
2
ďC
¨
˝
L
b
ˆ
∆ X
β a
dlnp1{δ q
εn
?
Mq
`
dL
2
lnp1{δ q
ϵ 2
n
2
M
`
L
2
Mn
1
tMăNu
˛
‚
(B.74)
for some absolute constantCą0. Further, with this choice ofR, it suffices to choose
q“
—
—
—
—
–min
$
’
&
’
%
¨
˝
εnL
?
M
b
dlnp1{δ q
ˆ
∆ X
β ˛
‚
2{3
,
nM
1
tMăN
u
,
/
.
/
-
ffi
ffi
ffi
ffi
fl
to ensure thatqď
1
τ 2
2
η 2
, so that B.72 holds. Assumeqě1. Then plugging thisq intoB.74 yields
E}
p
G
η pw
priv
q}
2
ďC
1
»
—
–
¨
˝
b
Lβ
ˆ
∆ X
dlnp1{δ q
εn
?
M
˛
‚
4{3
`
dL
2
lnp1{δ q
ϵ 2
n
2
M
`
¨
˝
L
b
β ˆ
∆ X
dlnp1{δ q
εn
3{2
M
`
L
2
Mn
˛
‚
1
tMăNu
fi
ffi
fl
for some absolute constant C
1
ą 0, as desired. In case q ă 1, then we must have L ă
?
β ˆ
∆ X
dlnp1{δ q
εn
?
M
;
hence, we can simply output w
0
(which is clearly ISRL-DP) instead of running Algorithm 4 and get
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
ďL
2
ă
ˆ
?
Lβ
ˆ
∆ X
dlnp1{δ q
εn
?
M
˙
4{3
.
The lemmas used in the above proof are stated below. The following lemma is an immediate conse-
quence of the martingale variance bound for SPIDER, given in [93, Proposition 1]:
Lemma 150 ([93]). Let r P t0,1,...,Ru and s
r
“ t
r
q
uq. With the notation of Algorithm 4, assume that
E|h
sr
´∇
p
F
0
X
pw
sr
q}
2
ď τ 2
1
andE
›
›
›H
r
´
´
∇
p
F
0
X
pw
r
q´∇
p
F
0
X
pw
r´1
q
¯›
›
›
2
ď τ 2
2
}w
r
´w
r´1
}
2
. Then for all
rěs
r
`1, the iterates of Algorithm 4 satisfy:
E}h
r
´∇
p
F
X
pw
r
q}
2
ďτ 2
2
r
ÿ
t“sr`1
E}w
t
´w
t´1
}
2
`τ 2
1
.
298
We present SDP FedProx-SPIDER in Algorithm 24.
Algorithm24 SDP FedProx-SPIDER
1: Input: RPN,K
1
,K
2
Prns,XPX,η ą0,εą0,δ Pp0,1{2q,qPN,w
0
PW.
2: forrPt0,1,¨¨¨ ,Rudo
3: foriPS
r
inparalleldo
4: Server sends global modelw
r
to siloi.
5: if r”0pmodqqthen
6: siloi drawsK
1
samplestx
r
i,j
u
K
1
j“1
u.a.r. fromX
i
(with replacement).
7: siloi computes
!
∇f
0
pw
r
,x
r
i,j
q
)
K
1
j“1
.
8: Server updatesh
r
“
1
MK
1
P
vec
˜
!
∇f
0
pw
r
,x
r
i,j
q
)
iPSr,jPrK
1
s
;
εnN
4K
1
M
?
2lnp1{δ qmax
´
1,
?
q
?
R
¯
,
δq
2R
;L
¸
.
9: else
10: siloi drawsK
2
samplestx
r
i,j
u
K
1
j“1
u.a.r. fromX
i
(with replacement).
11: siloi computesJ
i
“t∇f
0
pw
r
,x
r
i,j
q´∇f
0
pw
r´1
,x
r
i,j
qu
K
2
j“1
.
12: Server receivesH
r
“
1
MK
2
P
vec
´
tJ
i
u
iPSr
;
εnn
4MK
2
?
2Rlnp1{δ q
;
δ 2R
;mint2L,β }w
r
´w
r´1
}u
¯
, and
updatesh
r
“h
r´1
`H
r
.
13: endif
14: endfor
15: Server updatesw
r`1
“ prox
ηf
1pw
r
´ηh
r
q.
16: endfor
17: Output: w
priv
„ Unifptw
r
u
r“1,¨¨¨,R
q.
Theorem151 (Complete Statement of Theorem 45). Letεďlnp1{δ q, δ Pp0,
1
2
q, and
M
r
“Mě
ˆ
εnd
2
n
2
˙
1{3
¨
˝
L
b
β ˆ
∆ X
˛
‚
1{3
»
—
–1`
¨
˝
L
b
β ˆ
∆ X
˛
‚
1{3
fi
ffi
fl .
Then, there exist algorithmic parameters such that SDP FedProx-SPIDER ispϵ,δ q-SDP. Further,
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
À
»
—
–
¨
˝
b
Lβ
ˆ
∆ X
dln
3
pdnN{δ q
εnN
˛
‚
4{3
`
dL
2
ln
3
pRd{qδ q
ϵ 2
n
2
N
2
` 1
tMăNu
¨
˝
L
b
β ˆ
∆ X
dln
3
pdnN{δ q
εn
3{2
N
?
M
`
L
2
Mn
˛
‚
fi
ffi
fl .
Proof. We will choose
R“
»
—
—
—
εnN
L
d
ˆ
∆ X
β dln
3
pdnN{δ q
min
$
&
%
?
Mn
1
tMăN
u
,
¨
˝
εnNL
b
ˆ
∆ X
βd ln
3
pdnN{δ q
˛
‚
,
.
-
fi
ffi
ffi
ffi
,
299
η “1{2β , andK
1
“K
2
“n.
Privacy: By Theorem 11, it suffices to show that the message received by the server in each update in line
12 of Algorithm 24 (in isolation) is
ˆ
ϵ 2
?
2Rlnp1{δ q
,
δ 2R
˙
-DP, and that each update in line 8 is
ˆ
ε
?
q
2
?
2Rlnp1{δ q
,
δq
2R
˙
-
DP. Conditional on the random subsampling of silos, Theorem 136 (together with the sensitivity esti-
mates established in the proof of Theorem 44) implies that each update in line 12 ispϵ 1
,δ 1
q-SDP, where
ϵ 1
ď
εn
4M
?
2Rlnp1{δ q
andδ 1
“
δ 2R
; each update in line 8 ispϵ 2
,δ 2
q-SDP, whereϵ 2
“ ϵ 1
?
q andδ 2
“ δ 1
?
q.
By our choice ofR and our assumption onM, we haveMě
εn
4
?
2Rlnp1{δ q
and henceϵ 1
ď1. Thus, privacy
amplification by subsampling (silos only) (see e.g. [248, Problem 1]) implies that the privacy loss of each
round is bounded as desired, establishing that Algorithm 24 ispϵ,δ q-SDP, as long as q ď R. If instead
q ą R, then the update in line 8 is only executed once (at iteration r “ 0), so our choice of σ 2
1
ensures
SDP simply by Theorem 136 and privacy amplification by subsampling.
Utility: Denote the (normalized) privacy noises induced byP
vec
in lines 8 and 12 of the algorithm byZ
1
andZ
2
respectively. By Theorem 136,Z
i
is an unbiased estimator of its respective mean and we have
E}Z
1
}
2
À
dL
2
ln
3
pRd{qδ q
ϵ 2
n
2
N
2
max
ˆ
R
q
,1
˙
,
and
E}Z
2
}
2
À
dRln
3
pdR{δ q
ϵ 2
n
2
N
2
β 2
}w
r
´w
r´1
}
2
for ther-th round. Also, note that Lemma 150 is satisfied with
τ 2
1
“
2L
2
Mn
1
tMăN
u`
dL
2
ln
3
pRd{qδ q
ϵ 2
n
2
N
2
max
ˆ
R
q
,1
˙
,
and
τ 2
2
“8β 2
ˆ
1
tMăN
u
Mn
`
dRln
3
pRd{δ q
ϵ 2
n
2
N
2
˙
.
300
Then by the proof of Theorem 44, we have
E}G
η pw
priv
q}
2
ď16
˜
ˆ
∆ X
ηR
`τ 2
1
¸
. (B.75)
ifη “1{2β andqď
1
η 2
τ 2
2
. Thus,
E
›
›
›
p
G
η pw
priv
,Xq
›
›
›
2
À
ˆ
∆ X
ηR
`
L
2
Mn
1
tMăN
u`
dL
2
ln
3
pRd{qδ q
ϵ 2
n
2
N
2
max
ˆ
R
q
,1
˙
.
Our choice ofR together with the choice of
q“
—
—
—
–
1
2
min
¨
˝
Mn
1
tMăN
u
,
¨
˝
εnNL
b
ˆ
∆ X
βd ln
3
pRd{δ q
˛
‚
˛
‚
ffi
ffi
ffi
fl
equalizes the two terms involvingR (up to constants), and we obtain the desired ERM bound (upon noting
thatqď1{pη 2
τ 2
2
q is satisfied).
B.2.3.1 ISRL-DPNon-ConvexLowerBound
Like our excess risk lower bounds for convex loss functions, the lower bounds proved in this section will
hold for allnon-interactive andsequentiallyinteractive algorithms, as well as a broad subclass offullyinter-
active ISRL-DP algorithms that arecompositional (recall Definition 107). Recall that any algorithm that uses
the composition theorems of [88, 136] for its privacy analysis is1-compositional; this includes Algorithm 4
and most (but not all [181]) ISRL-DP algorithms in the literature. Define the pϵ,δ q-ISRL-DP algorithm
classA
pϵ,δ q,C
to contain all sequentially interactive algorithms and all fully interactive, C-compositional
algorithms. IfA is sequentially interactive orOp1q-compositional, denoteAPA.
Next we re-state the precise form of our lower bound (using notation from Appendix B.1.2) and then
provide the proof.
301
Theorem 152 (Precise Statement of Theorem 46). Let ε P p0,
?
Ns,2
´Ω pnNq
ď δ ď 1{pnNq
1`Ω p1q
.
Suppose that in each round r, the local randomizers are allpε
r
0
,δ r
0
q-DP, for ε
r
0
À
1
n
, δ r
0
“ op1{nNRq,
M “N ě 16lnp2{δ r
0
nq. Then, there exists anL-Lispchitz,β -smooth smooth, convex lossf :R
d
ˆX ÑR
and a databaseXP X
nˆN
such that any compositional and symmetricpε
0
,δ 0
q-ISRL-DP algorithmA run
onX with outputw
priv
satisfies
E}∇
p
F
X
pw
priv
q}
2
“Ω ˆ
L
2
min
"
1,
dlnp1{δ 0
q
ε
2
0
n
2
N
*˙
.
Proof. Recall Theorem 109, which showed that a compositionalpε
0
,δ 0
q-ISRL-DP algorithm can become
an
´
O
´
ε
0
?
N
¯
,δ ¯
-SDP algorithm when a shuffler is introduced:
Theorem153 (Re-statement of Theorem 109). LetAPA
pε
0
,δ
0
q,C
such thatε
0
Pp0,
?
Ns andδ 0
Pp0,1q.
Assume that in each round, the local randomizersR
piq
r
pZ
p1:r´1q
,¨q : X
n
Ñ Z arepε
r
0
,δ r
0
q-DP for all i P
rNs, rPrRs, Z
p1:r´1q
PZ
r´1ˆN
withε
r
0
ď
1
n
. AssumeN ě 16lnp2{δ r
0
nq. IfA isC-compositional, then
assume δ r
0
ď
1
14nNR
and denoteδ :“ 14Nn
ř
R
r“1
δ r
0
; if insteadA is sequentially interactive, then assume
δ 0
“ δ r
0
ď
1
7Nn
and denote δ :“ 7Nnδ
0
. LetA
s
: X Ñ W be the same algorithm asA except that in
each roundr,A
s
draws a random permutationπ r
ofrNs and appliesR
piq
r
toX
π rpiq
instead ofX
i
. Then,A
s
ispϵ,δ q-CDP, where ε “ O
ˆ
ε
0
lnp1{nNδ min
0
qC
2
?
N
˙
, and δ min
0
:“ min
rPrRs
δ r
0
. In particular, ifA P A, then
ε“O
ˆ
ε
0
lnp1{nNδ min
0
q
?
N
˙
. Note that for sequentially interactiveA, δ min
0
“δ 0
.
Next, we will observe that the expected (squared) gradient norm of the output ofA
s
is the same as
the expected (squared) gradient norm of the output ofA for symmetric FL algorithms. The precise def-
inition of a “symmetric” (fully interactive) ISRL-DP algorithm is that the aggregation functions g
r
(used
to aggregate silo updates/messages and update the global model) are symmetric (i.e. g
r
pZ
1
,¨¨¨ ,Z
N
q“
g
r
pZ
π p1q
,¨¨¨Z
π pNq
q for all permutationsπ ) and in each roundr the randomizersR
piq
r
“R
r
are the same
for all silos iPrNs. (R
piq
r
can still change with r though.) For example, all of the algorithms presented
302
in this paper (and essentially all algorithms that we’ve come across in the literature, for that matter) are
symmetric. This is because the aggregation functions used in each round are simple averages of theM
r
noisy gradients received from all silos and the randomizers used by every silo in round r are identical:
each adds the same Gaussian noise to the stochastic gradients. Note that for any symmetric algorithm, the
distributions of the updates ofA andA
s
are both averages over all permutations ofrNs of the conditional
(onπ ) distributions of the randomizers applied to theπ -permuted database.
Now, for a givenpε
0
,δ 0
q-ISRL-DP algorithmA, denote the shuffled algorithm derived from A byA
s
.
Then apply Theorem 154 toA
s
to obtain lower bounds on its expected squared gradient norm:
Theorem154 ([16]). Let εPp0,
?
Ns,2
´Ω pnNq
ď δ ď 1{pnNq
1`Ω p1q
. Then, there exists anL-Lispchitz,
β -smooth smooth, convex loss f : R
d
ˆX Ñ R and a database X P X
nˆN
such that anypϵ,δ q-CDP
algorithmA run onX with outputw
priv
satisfies
E}∇
p
F
X
pw
priv
q}
2
“Ω ˆ
L
2
min
"
1,
dlnp1{δ q
ϵ 2
n
2
N
2
*˙
.
Applying Theorem 154 with ε “ ε
0
{
?
N yields the desired lower bound for A
s
. Further, by the
observations above about symmetric algorithms, this lower bound also apply toA.
B.2.3.2 UpperandLowerBoundsforCross-DeviceFLWithoutaTrustedServer
In this section, we use our results to derive upper and lower bounds for FL algorithms that satisfy both
ISRL-DP and user-level DP. Our approach nearly mirrors the approach taken for convex loss functions.
303
Assume M “ N for simplicity.
††
Given ISRL-DP parameterspϵ,δ q with ε ď 1, let ϵ 0
“ ϵ {n and
δ 0
“δ {4nďδ {pne
pn´1qε
0
q“δ {pne
pn´1qϵ {n
q. Consider the Proximal PL case for now. Run Noisypε
0
,δ 0
q-
ISRL-DP Prox-SGD, which also satisfies pϵ,δ q-user level DP by Appendix B.1.3. Thus, Theorem 40 yields
an ISRL-DP/user-level DP excess risk upper bound for heterogeneous FL with Proximal-PL losses:
EFpp w
T
q´F
˚
“
r
O
˜
L
2
µ ˜
κ 2
a
dlnp1{δ 0
q
ε
2
0
n
2
?
N
`
κ ?
Nn
¸¸
“
r
O
˜
L
2
µ ˜
κ 2
a
dlnp1{δ 0
q
ϵ 2
?
N
`
κ ?
Nn
¸¸
(B.76)
“
r
O
˜
L
2
µ ˜
κ 2
a
dlnp1{δ 0
q
ϵ 2
?
N
¸¸
.
Regarding lower bounds: note that the semantics of the hybrid ISRL-DP/user-level DP notion are essen-
tially identical to local DP, except that individual “records/items” are now thought of as datasets of sizen.
Thus, lettingn“ 1 in the strongly convex ISRL-DP lower bound from the section on ISRL-DP convex FL
(where we think of each silo as having just one “record” even though that record is really a dataset) yields a
lower bound that matches the upper bound attained above up to a factor of
r
Opκ 2
q. Note that the minimax
risk bounds for ISRL-DP/user-level DP hybrid algorithms resemble the bounds for LDP algorithms [85],
scaling with N, but not with n. A similar procedure can be used to derive upper and lower bounds for
Proximal PL ERM and non-convex/non-smooth ERM, using our upper bounds in Theorems 42 and 44, and
the lower bound in Theorem 46.
††
The extension toMăN will be clear.
304
Figure B.2: Test error vs. ε for linear regression on heterogeneous health insurance data. δ “ 1{n
2
.90%
error bars are shown.
305
Figure B.3: Test error vs.ε for linear regression on heterogeneous health insurance data.δ “ 1{n
2
.90% error bars are shown.
306
B.2.4 ExperimentalDetailsandAdditionalResults
B.2.4.1 ISRL-DPFed-SPIDER:AlternateimplementationofISRL-DPFedProx-SPIDER
We also evaluated an alternative implementation of ISRL-DP FedProx-SPIDER, given in Algorithm 25. We
found that this variation of ISRL-DP FedProx-SPIDER sometimes performed better in practice. For each
εPt0.75,1,1.5,3,6,12,18u, we chose the algorithm with smaller training loss and reported the test error
for the corresponding algorithm as SPIDER in the plots.
Algorithm25 ISRL-DP Fed-SPIDER: Alternate Implementation
1: Input: Number of silos N P N, dimension d P N of data, noise parameters σ 2
1
and σ 2
2
, data sets
X
i
PX
n
i
foriPrNs, loss functionfpw,xq, number of roundsE´1PN, local batch size parameters
K
1
andK
2
, step sizeη .
2: Server initializesw
2
0
:“0 and broadcasts.
3: Silos syncw
i,2
0
:“w
2
0
(iPrNs).
4: Network determines random subsetS
0
ofM
0
PrNs available silos.
5: foriPS
0
inparalleldo
6: Silo i draws K
2
random samples tx
0,2
i,j
u
jPrK
2
s
(with replacement) from X
i
and noise u
piq
2
„
Np0,σ 2
2
I
d
q.
7: Siloi computes noisy stochastic gradientr v
i,2
0
:“
1
K
2
ř
K
2
j“1
∇fpw
2
0
,x
0,2
i,j
q`u
piq
2
and sends to server.
8: endfor
9: Server aggregatesr v
2
0
:“
1
M
0
ř
iPS
0
r v
i,2
0
and broadcasts.
10: forrPt0,1,¨¨¨ ,E´2udo
11: Network determines random subsetS
r`1
ofM
r`1
PrNs available silos.
12: foriPS
r`1
inparalleldo
13: Server updatesw
0
r`1
:“w
2
r
,w
1
r`1
:“w
2
r
´η r v
2
r
and broadcasts to silos.
14: Silos syncw
i,0
r`1
:“w
0
r`1
,r v
i,0
r`1
:“ r v
2
r
, andw
i,1
r`1
:“w
1
r`1
(iPrNsq.
15: Silo i draws K
1
random samplestx
r`1,1
i,j
u
jPrK
1
s
(with replacement) from X
i
and noise u
piq
1
„
Np0,σ 2
1
I
d
q.
16: Silo i computes r v
i,1
r`1
:“
1
K
1
ř
K
1
j“1
r∇fpw
1
r`1
,x
r`1,1
i,j
q´∇fpw
0
r`1
,x
r`1,1
i,j
qs` r v
i,0
r`1
` u
piq
1
and
sends to server.
17: Server aggregatesr v
1
r`1
:“
1
M
r`1
ř
iPS
r`1
r v
i,1
r`1
, updatesw
2
r`1
:“w
1
r`1
´η r v
1
r`1
, and broadcasts.
18: Silos syncw
i,2
r`1
:“w
2
r`1
.
19: Silo i draws K
2
random samplestx
r`1,2
i,j
u
jPrK
2
s
(with replacement) from X
i
and noise u
piq
2
„
Np0,σ 2
2
I
d
q.
20: Siloi computesr v
i,2
r`1
:“
1
K
2
ř
K
2
j“1
∇fpw
2
r`1
,x
r`1,2
i,j
q`u
piq
2
and sends to server.
21: Server updatesr v
2
r`1
:“
1
M
r`1
ř
iPS
r`1
r v
i,2
r`1
and broadcasts.
22: endfor
23: endfor
24: Output: w
priv
„ Unifptw
t
r
u
r“1,¨¨¨,E´1;t“1,2
q.
307
B.2.4.2 MNISTexperiment
The MNIST data is available at http://yann.lecun.com/exdb/mnist/. In our implementation, we use
torchvision.datasets.MNIST to download the MNIST data. All experiments are conducted on a device
with 6-core Intel Core i7-8700.
Experimental setup: To divide the data into N “ 25 silos and pre-process it, we rely on the code
provided by [265]. The code is shared under a Creative Commons Attribution-Share Alike 3.0 license.
We fix δ “ 1{n
2
(where n“number of training samples per silo, is given in “Preprocessing”) and test
εPt0.75,1,1.5,3,6,12,18u.
Preprocessing: First, we standardize the numerical data to have mean zero and unit variance, and flatten
them. Then, we utilize PCA to reduce the dimension of flattened images from d “ 784 to d “ 50. To
expedite training, we used 1/7 of the5,421 samples per digit, which is 774 samples per digit. As each silo
is assigned data of two digits, each silo has n“ 1,543 samples. We employ an 80/20 train/test split for
data of each silo.
Gradientclipping: Since the Lipschitz parameter of the loss is unknown for this problem, we incorpo-
rated gradient clipping [1] into the algorithms. Noise was calibrated to the clip thresholdL to guarantee
ISRL-DP (see below for more details). We also allowed the non-private algorithms to employ clipping if it
was beneficial.
Hyperparameter tuning: For each algorithm, each ε P t0.75,1,1.5,3,6,12,18u, and eachpM,Rq P
tp12,25q,p12,50q,p25,25q,p25,50qu, we swept through a range of constant stepsizes and clipping thresh-
olds to find the (approximately) optimal stepsize and clipping threshold for each algorithm and setting. The
stepsize grid consists of 5 evenly spaced points betweene
´9
and 1. The clipping threshold includes 5 val-
ues of 1, 5, 10, 100, 10000. For ISRL-DP FedProx-SPIDER, we useqPt1,2,3,4u forR“50 andqPt1,2u
forR“ 25. Due to memory limitation, we did not check largeq values because it results in large batch
size based onK in ISRL-DP FedProx-SPIDER (see below for more details).
308
Choiceofσ 2
andK: We used noise with smaller constants/log terms (compared to the theoretical portion
of the paper) to get better utility (at the expense of needing largerK to ensure privacy), by appealing to
the moments accountant [1, Theorem 1] instead of the advanced composition theorem [88, Theorem 3.20].
For ISRL-DP FedProx-SPIDER, we usedσ 2
1
“
16L
2
lnp1{δ q
ϵ 2
n
2
max
´
R
q
,1
¯
,σ 2
“8, andˆ σ 2
2
“
64L
2
Rlnp1{δ q
ϵ 2
n
2
.
We choseσ 2
“8 because we do not have an a priori bound on the smoothness parameterβ . Therefore,
only the variance-reduction benefits of SPIDER are illustrated in the experiments and not the smaller
privacy noise.
For ISRL-DP FedSPIDER: Alternate Implementation, we usedσ 2
1
“
32L
2
lnp2{δ qR
n
2
ϵ 2
andσ 2
2
“
8L
2
lnp2{δ qR
n
2
ϵ 2
withK
1
“ K
2
“
n
?
ϵ 2
?
R
given above, which guarantees ISRL-DP by [1, Theorem 1]. Note that the larger
constant32 is needed for ISRL-DP inσ 2
1
because theℓ
2
sensitivity of the updates in line 16 of Algorithm 4
is larger than simple SGD updates (which are used in MB-SGD, Local SGD, and line 20 of Algorithm 4) by
a factor of2.
For ISRL-DP MB-SGD and ISRL-DP Local SGD, we use the same implementation as [181].
GeneratingNoise: Due to the low speed of NumPy package in generating multivariate random normal
vectors, we use an alternative approach to generate noises. For ISRL-DP SPIDER and ISRL-DP MB-SGD
algorithms, we generate the noises on MATLAB and save them. Then, we load them into Python when we
run the algorithms. Since the number of required noise vectors for ISRL-DP Local SGD is much larger (K
times larger) than two other ISRL-DP algorithms, saving the noises beforehand requires a lot of memory.
Hence, we generate the noises of ISRL-DP Local SGD on Python by importing a MATLAB engine.
Plotsandadditionalexperimentalresults: See Figure B.4 and Figure B.5 for results of the two remain-
ing experiments: pM “ 12,R “ 25q andpM “ 25,R “ 50q. The results are qualitatively similar to
those presented in the main body. In particular, ISRL-DP SPIDER continues to outperform both ISRL-DP
baselines in most tested privacy levels. Also, ISRL-DP MB-SGD continues to show strong performance in
the high privacy regime (εď1.5).
309
Figure B.4: MNIST.M“12,R“25.
Figure B.5: MNIST.M“25,R“50.
B.2.4.3 CIFAR-10experiment
We run an experiment on CIFAR-10 data to further evaluate the performance of ISRL-DP SPIDER in image
classification. We partition the data set into 10 heterogeneous silos, each containing one class out of
10 classes of data. We use a 5-layer CNN with two 5x5 convolutional layers (the first with 6 channels,
the second with 16 channels, each followed by a ReLu activation and a 2x2 max pooling) and three fully
connected layers with 120, 84, 10 neurons in each fully connected layer (the first and second fully connected
layers followed by a ReLu activation). For 7 privacy levels ranging fromε“ 0.75 toε“ 18, we compare
ISRL-DP SPIDER against standard FL baselines: MB-SGD, Local SGD (a.k.a. Federated Averaging) [188],
310
ISRL-DP MB-SGD [181], and ISRL-DP Local SGD. We fix δ “1{n
2
. As Figure 3.10 shows,ISRL-DPSPIDER
outperforms both ISRL-DP baselines for most tested privacy levels. The results are based on the average
error of 10 random assignment of train/test split of data for each algorithm/epsilon pair. CIFAR-10 data is
available at https://www.cs.toronto.edu/~kriz/cifar.html. In our implementation, we directly download
the data fromtorchvision.datasets.CIFAR10.
Experimentalsetup: To divide the CIFAR-10 data intoN “10 heterogeneous silos, we use labels. That
is, we assign one unique image class to each of 10 heterogeneous silos.
Preprocessing: We standardize the numerical data to have mean zero and unit variance. We utilize a
80/20 train/test split for data of each client.
Gradientclipping: Since the Lipschitz parameter of the loss is unknown for this problem, we incorpo-
rated gradient clipping [1] into the algorithms. Noise was calibrated to the clip thresholdL to guarantee
ISRL-DP (see below for more details). We also allowed the non-private algorithms to employ clipping if it
was beneficial.
Hyperparameter tuning: It is similar to hyperparameter tuning of MNIST data. However, we check
pM,Rq “ p10,50q andpM,Rq “ p10,100q here. Also, the stepsize grid of Local SGD consists of 20
evenly spaced points between e
´5
and e
1
for local SGD. The stepsize grid of MB-SGD and SPIDER with
R=50 consists of 12 evenly spaced points betweene
´5
ande
0
and with R=100 consists of 8 evenly spaced
points betweene
´5
ande
1
. The clipping threshold of all algorithms includes 6 values of 0.001, 0.01, 0.1, 1,
5, 10. For ISRL-DP FedProx-SPIDER, we useqPt1,2,3,4u forR“50 andqPt1,2,...,8u forR“100.
Due to memory limitation, we did not check largeq values because it results in large batch size based on
K in ISRL-DP FedProx-SPIDER (see below for more details).
Choiceofσ 2
andK: Same as in MNIST: see Appendix B.2.4.2.
311
B.2.4.4 Breastcancerexperiment
We run an experiment on Wisconsin Breast Cancer (Diagnosis) data (WBCD) to further evaluate the per-
formance of ISRL-DP SPIDER in binary (malignant vs. benign) classification. We partition the data set
into 2 heterogeneous silos, one containing malignant labels and the other benign labels. We use a 2-layer
perceptron with 5 neurons in the hidden layer. For 7 privacy levels ranging from ε “ 0.75 to ε “ 18,
we compare ISRL-DP SPIDER against standard FL baselines: MB-SGD, Local SGD (a.k.a. Federated Aver-
aging) [188], ISRL-DP MB-SGD [181], and ISRL-DP Local SGD. We fix δ “ 1{n
2
. As Figure 3.12 shows,
ISRL-DP SPIDER outperforms both ISRL-DP baselines for most tested privacy levels. The results are based
on the average error of 10 random assignment of train/test split of data for each algorithm/epsilon pair.
WBCD data is available at https://archive.ics.uci.edu/ml/datasets and we directly download the data
from UCI repository website. The experiment is conducted on a device with 6-core Intel Core i7-8700.
Experimentalsetup: To divide the WBCD data intoN “ 2 silos, we use labels (malignant vs. benign).
We split the data into 2 parts, one only has malignant labels and the other only has benign data. Then, we
assign each part to a client to have full heterogeneous silos. In all experiments, we fix δ “ 1{n
2
(where
n“number of training samples per client, is given in “Preprocessing”) and testεPt0.75,1,1.5,3,6,12,18u.
Preprocessing: We standardize the numerical data to have mean zero and unit variance. We utilize a
80/20 train/test split for data of each client.
Gradientclipping: Since the Lipschitz parameter of the loss is unknown for this problem, we incorpo-
rated gradient clipping [1] into the algorithms. Noise was calibrated to the clip thresholdL to guarantee
ISRL-DP (see below for more details). We also allowed the non-private algorithms to employ clipping if it
was beneficial.
Hyperparameter tuning: It is similar to hyperparameter tuning of MNIST data. However, we check
pM,Rq“p4,25q here. Also, the stepsize grid consists of 15 evenly spaced points betweene
´9
and 1. The
clipping threshold includes 4 values of 0.1, 1, 5, 10. For ISRL-DP FedProx-SPIDER, we useqPt1,2,...,10u.
312
Due to memory limitation, we did not check largeq values because it results in large batch size based on
K in ISRL-DP FedProx-SPIDER (see below for more details).
Choiceofσ 2
andK: Same as in MNIST: see Appendix B.2.4.2.
GeneratingNoise: Due to the low speed of NumPy package in generating multivariate random normal
vectors, we use an alternative approach to generate noises. For ISRL-DP SPIDER and ISRL-DP MB-SGD
algorithms, we generate the noises on MATLAB and use them in Python when we run the algorithms.
Since the number of required noise vectors for ISRL-DP Local SGD is much larger (K times larger) than
two other ISRL-DP algorithms, saving the noises beforehand requires a lot of memory. Hence, we generate
the noises of ISRL-DP Local SGD on Python by importing a MATLAB engine.
313
AppendixC
AppendixtoChapter4
C.1 AdditionalDiscussionofRelatedWork
DP SCO Without Uniform Lipschitz Continuity: The study of DP SCO without uniformly Lipschitz
continuous loss functions was initiated by [253], who provided upper bounds for smooth convex/strongly
convex loss. The work of [137] provided lower bounds and improved, yet suboptimal, upper bounds for
the convex case. Both of the works [253, 137] require f to be β f
-smooth. It is also worth mentioning
that [253, 137] restricted attention to losses satisfying∇Fpw
˚
q “ 0 for w
˚
P W, i.e. W is a compact
set containing the unconstrained optimum w
˚
“ argmin
wPR
dFpwq P W. By comparison, we consider
the more general constrained optimization problemmin
wPW
Fpwq, whereW need not contain the global
unconstrained optimum.
Here we provide a brief discussion of the techniques used in [253, 137]. The work of [253] used a
full batch (clipped, noisy) gradient descent based algorithm, building on the heavy-tailed mean estimator
of [122]. They bounded the excess risk of their algorithm by using a uniform convergence [249] argument,
resulting in a suboptimal dependence on the dimension d. The work of [137] used essentially the same
314
approach as [253], but obtained an improved rate with a more careful analysis.
∗
However, as discussed,
the bound in [137] isO
´
rD
b
d
n
¯
whenε«1, which is still suboptimal.
†
More recently, DP optimization with outliers was studied in special cases of sparse learning [124],
multi-arm bandits [236], andℓ
1
-norm linear regression [254].
DP ERM and DP GLMs without Uniform Lipschitz continuity: The work of [18] provides bounds
for constrained DPERM with arbitrary convex loss functions using a Noisy Clipped SGD algorithm that is
similar to our Algorithm 27, except that their algorithm is multi-pass and ours is one-pass. In a concurrent
work, [69] considered DP ERM in the unconstrained setting with convex and non-convex loss functions.
Their algorithm, noisy clipped SGD, is also similar to Algorithm 27 and the algorithm of [18]. The results in
[69] are not directly comparable to [18] since [69] consider the unconstrained setting while [18] consider
the constrained setting, but the rates in [18] are faster. [69] also analyzes the convergence of noisy clipped
SGD with smooth non-convex loss functions.
The works of [232, 17] consider generalized linear models (GLMs), a particular subclass of convex loss
functions and provide empirical and population risk bounds for the unconstrained DP optimization prob-
lem. The unconstrained setting is not comparable to the constrained setting that we consider here: in
the unconstrained case, a dimension-independent upper bound is achievable, whereas our lower bounds
(which apply to GLMs) imply that a dependence on the dimensiond is necessary in the constrained case.
Other works on gradient clipping: The gradient clipping technique (and adaptive variants of it) has
been studied empirically in works such as [1, 54, 11], to name a few. The work of [54] shows that gradi-
ent clipping can prevent SGD from converging, and describes the clipping bias with a disparity measure
between the gradient distribution and a geometrically symmetric distribution.
∗
Additionally, [137, Theorem 5.2] provided a bound via noisy gradient descent with the clipping mechanism of [138], but
this bound is inferior (in the practical privacy regimeε« 1) to their bound in [137, Theorem 5.4] that used the estimator of [122].
†
The bound in [137, Theorem 5.4] fork“ 2 is stated in the notation of Assumption 10 and thus has an extra factor of
?
d,
compared to the bound written here. We write their bound in terms of our Assumption 5, replacing theirγd term byr
?
d.
315
Optimizationwithbiasedgradientoracles: The works [6, 18] analyze SGD with biased gradient ora-
cles. Our work provides a tighter bound for smooth, strongly convex functions and analyzes accelerated
SGD and proximal SGD with biased gradient oracles.
DPSOwithUniformlyLipschitzlossfunctions: In the absence of outlier data, there are a multitude of
works studying uniformly Lipschitz DP SO, mostly in the convex/strongly convex case. We do not attempt
to provide a comprehensive list of these here, but will name the most notable ones, which provide optimal
or state-of-the-art utility guarantees. The first suboptimal bounds for DP SCO were provided in [32].
The work of [29] established the optimal rate for non-strongly convex DP SCO, by bounding the uniform
stability of Noisy DP SGD (without clipping). The strongly convex case was addressed by [95], who also
provided optimal rates in linear times for sufficiently smooth, convex losses. Since then, other works have
provided faster and simpler (optimal) algorithms for the non-smooth DP SCO problem [28, 19, 157, 30] and
considered DP SCO with different geometries [19, 31]. State-of-the-art rates for DP SO with the proximal
PL condition are due to [177].
C.2 OtherBoundedMomentConditionsBesidesAssumption5
In this section, we give the alternate bounded moment assumption made in [253, 137] and a third bounded
moment condition, and discuss the relationships between these assumptions. The notation presented here
will be necessary in order to state the sharper versions of our linear-time excess risk bounds and the
asymptotically optimal excess risk bounds under the coordinate-wise assumption of [253, 137] (which
our Algorithm 7 also attains). First, we introduce a relaxation of Assumption 5:
Assumption 9. There exists k ě 2 and r
pkq
ą 0 such that sup
wPW
E
“
}∇fpw,xq}
k
2
‰
ď r
pkq
, for all
subgradients∇fpw,x
i
qPB
w
fpw,x
i
q. Denoter
k
:“pr
pkq
q
1{k
.
316
Assumption 5 implies Assumption 9 forrď r r. Next, we precisely state the coordinate-wise moment bound
assumption that is used in [253, 137] for differentiable f:
Assumption10. (Usedby[253,137]
‡
,butnotinthiswork) There existskě 2 andγ ą 0 such that
sup
wPW
E|x∇fpw,xq´∇Fpwq,e
j
y|
k
ď γ , for alljPrds, wheree
j
denotes thej-th standard basis vector
inR
d
. Also,L
F
fi sup
wPW
}∇Fpwq}ď
?
dγ
1{k
.
Lemma 155 allows us compare our results in Section 4.4 obtained under Assumption 9 to the results
in [253, 137], which require Assumption 10.
Lemma155. Suppose Assumption 10 holds. Then, Assumption 9 holds forr
k
ď4
?
dγ
1{k
.
Proof. We use the following inequality, which can easily be verified inductively, using Cauchy-Schwartz
and Young’s inequalities: for any vectorsu,vPR
d
, we have
}u}
k
ď2
k´1
´
}u´v}
k
`}v}
k
¯
. (C.1)
Therefore,
r
pkq
“ sup
wPW
E}∇fpw,xq}
k
ď2
k´1
ˆ
sup
wPW
E}∇fpw,xq´∇Fpwq}
k
`L
k
F
˙
“2
k´1
¨
˝
sup
wPW
E
»
–
#
d
ÿ
j“1
|x∇fpw,xq´∇Fpwq,e
j
y|
2
+
k{2
fi
fl
`L
k
F
˛
‚
ďp2L
F
q
k
`2
k
d
k{2
sup
wPW
E
«
1
d
d
ÿ
j“1
|x∇fpw,xq´∇Fpwq,e
j
y|
k
ff
,
‡
The work of [137] assumes thatLÀ γ 1{k
“ 1. On the other hand, [253] assumes thatF isβ -smooth and∇Fpw
˚
q“ 0
for somew
˚
PW, which impliesLď 2βD .
317
where we used convexity of the functionϕ pyq“ y
k{2
for allyě 0,kě 2 and Jensen’s inequality in the
last inequality. Now using linearity of expectation and Assumption 10 gives us
r
pkq
ď2
k
´
L
k
F
`d
k{2
γ ¯
ď2
k`1
d
k{2
γ,
sinceL
k
F
ďd
k{2
γ by hypothesis.
Remark 156. Since Assumption 9 is implied by Assumption 10, the upper bounds that we obtain under
Assumption 9 also hold (up to constants) if we grant Assumption 10 instead, withrØ
?
dγ
1{k
. Also, in Ap-
pendix C.4.4, we will use Lemma 155 to show that our optimal excess risk bounds under Assumption 5 imply
asymptotically optimal excess risk bounds under Assumption 10.
C.3 CorrectingtheErrorsintheStronglyConvexUpperBoundsClaimed
in[137,253]
While the ICML 2022 paper [137, Theorem 5.6] claimed an upper bound for smooth strongly convex losses
that is tight up to a factor of
r
Opκ 2
f
q—whereκ f
“ β f
{µ is the uniform condition number offp¨,xq over
allxPX –we identify an issue with their proof that invalidates their result. A similar issue appears in the
proof of [253, Theorems 5 and 7], which [137] built upon. We then show how to salvage a correct upper
bound within the framework of [137], albeit at the cost of an additional factor ofκ f
.
§
The proof of [137, Theorem 5.6] relies on [137, Theorem 3.2]. The proof of [137, Theorem 3.2], in turn,
boundsE}w
T
´w
˚
}ď
pλ `LqpM`1qG
λL
in the notation of [137], whereL is the smoothness parameter,λ is
the strong convexity parameter (soLě λ ), andM is the diameter ofW. Then, it is incorrectly deduced
thatEr}w
T
´w
˚
}
2
sď
´
pλ `LqpM`1qG
λL
¯
2
(final line of the proof). Notice that Er}w
T
´w
˚
}
2
s can be much
§
The corrected result of [137], derived here, is also included in the latest arXiv version of their paper. We communicated with
the authors of [137] to obtain this correct version.
318
larger thanpE}w
T
´w
˚
}q
2
in general: for example, if}w
T
´w
˚
} has the Pareto distribution with shape
parameterα Pp1,2s and scale parameter1, thenpE}w
T
´w
˚
}q
2
“
´
α α ´1
¯
2
!Ep}w
T
´w
˚
}
2
q“8.
As a first attempt to correct this issue, one could use Young’s inequality to instead bound
Er}w
T
´w
˚
}
2
sď2
ˆ
1´
2λL
pλ `Lq
2
˙
Er}w
T´1
´w
˚
}
2
s`
2G
2
pλ `Lq
2
ď
„
2
ˆ
1´
2λL
pλ `Lq
2
˙ȷ
T
}w
0
´w
˚
}
2
`
2G
2
pλ `Lq
2
T´1
ÿ
t“0
„
2
ˆ
1´
2λL
pλ `Lq
2
˙ȷ
t
,
but the geometric series above diverges to`8 asT Ñ8, since2
´
1´
2λL
pλ `Lq
2
¯
ě1 ðñ pλ ´Lq
2
ě0.
Next, we show how to modify the proof of [137, Theorem 5.6] in order to obtain a correct excess risk
upper bound of
r
O
¨
˝
γ 2{k
µ d
»
–
pβ f
{µ q
3
n
`
˜
a
dpβ f
{µ q
3
εn
¸
p2k´2q{k
fi
fl
˛
‚
(C.2)
(in our notation). This correction was derived in collaboration with the authors of [137], who have also
updated the arXiv version of their paper accordingly. By waiting until the very of the proof of [137,
Theorem 3.2] to take expectation, we can derive
}w
t
´w
˚
}ď
ˆ
1´
λL
pλ `Lq
2
˙
}w
t´1
´w
˚
}`
}
r
∇Fpw
t´1
q´∇Fpw
t´1
q}
λ `L
(C.3)
for allt, where we use theirL“ β f
andλ “ µ notation but our notationF and
r
∇F for the population
loss and its biased noisy gradient estimate (instead of theirL
D
notation). By iterating C.3, we can get
}w
T
´w
˚
}ď
ˆ
1´
2λL
pλ `Lq
2
˙
T
}w
0
´w
˚
}`
T´1
ÿ
t“0
ˆ
1´
2λL
pλ `Lq
2
˙
t
«
}
r
∇Fpw
T´t
q´∇Fpw
T´t
q}
λ `L
ff
.
319
Squaring both sides and using Cauchy-Schwartz, we get
}w
T
´w
˚
}
2
ď2
ˆ
1´
2λL
pλ `Lq
2
˙
2T
}w
0
´w
˚
}
2
`T
T´1
ÿ
t“0
ˆ
1´
2λL
pλ `Lq
2
˙
2t
«
}
r
∇Fpw
T´t
q´∇Fpw
T´t
q}
λ `L
ff
2
.
Using L-smoothness of F and the assumption made in [137] that ∇Fpw
˚
q “ 0, and then taking
expectation yields
EFpw
T
q´F
˚
ÀL}w
0
´w
˚
}
2
ˆ
1´
2λL
pλ `Lq
2
˙
2T
`TG
2
L
λ , (C.4)
whereG
2
ěE
”
}
r
∇Fpw
T´t
q´∇Fpw
T´t
q}
2
ı
for allt. It is necessary and sufficient to choose T “
r
Ω pL{λ q
to make the first term in the right-hand side of C.4 less than the second term (up to logarithms). With this
choice ofT , we get
EFpw
T
q´F
˚
“
r
O
`
G
2
κ 2
f
˘
, (C.5)
whereκ f
“ L{λ . Next, we apply the bound onG
2
for the MeanOracle that is used in [137]; this bound
is stated in the version of [137, Lemma B.5] that appears in the updated (November 1, 2022) arXiv version
of their paper. The bound (for general γ ) is G
2
“
r
O
ˆ
γ 2{k
„
Td
n
`d
´
?
dT
3{2
εn
¯
p2k´2q{k
ȷ˙
. Plugging this
bound onG
2
into C.5 yields C.2.
C.4 DetailsandProofsforSection4.3:OptimalRatesfor(Strongly)Convex
Losses
In order to precisely state (sharper forms of) Theorem 50 and 56, we will need to introduce some notation.
320
C.4.1 Notation
For a batch of dataXPX
m
, we define the k-th empirical moment offpw,¨q by
p r
m
pXq
pkq
“ sup
wPW
sup
t∇fpw,x
i
qPBwfpw,x
i
qu
1
m
m
ÿ
i“1
}∇fpw,x
i
q}
k
,
where the supremum is also over all subgradients∇fpw,x
i
qPB
w
fpw,x
i
q in casef is not differentiable.
ForX„P
m
, we denote thek-th expected empirical moment by
r e
pkq
m
:“Erp r
m
pXq
pkq
s
and let
r r
k,m
:“pr e
pkq
m
q
1{k
.
Note that r r
k,1
“ r r
k
. Our excess risk upper bounds will depend on a weighted average of the expected
empirical moments for different batch sizes mPt1,2,4,8,¨¨¨ ,nu, with more weight being given to r r
m
for largem (which are smaller, by Lemma 157 below): forn“2
l
, define
r
R
k,n
:“
g
f
f
e
l
ÿ
i“1
2
´i
r r
2
k,n
i
,
wheren
i
“2
´i
n.
Lemma 157. Under Assumptions 9 and 5, we have: r r
pkq
“ r e
pkq
1
ě r e
pkq
2
ě r e
pkq
4
ě r e
pkq
8
ě ¨¨¨ ě r
pkq
. In
particular,
r
R
k,n
ď r r
k
.
321
Proof. LetlPN,n“2
l
and consider
p r
n
pXq
pkq
“
1
n
sup
w
¨
˝
n{2
ÿ
i“1
}∇fpw,x
i
q}
k
`
n
ÿ
i“n{2`1
}∇fpw,x
i
q}
k
˛
‚
ď
1
n
¨
˝
sup
w
n{2
ÿ
i“1
}∇fpw,x
i
q}
k
`sup
w
n
ÿ
i“n{2`1
}∇fpw,x
i
q}
k
˛
‚
.
Taking expectations over the random draw of X „ P
n
yields r e
pkq
n
ď r e
pkq
n{2
. Thus,
r
R
k,n
ď r r
k
by the
definition of
r
R
n
.
C.4.2 LocalizedNoisyClippedSubgradientMethod(Section4.3.1)
We begin by proving the technical ingredients that will be used in the proof of Theorem 50. First, we will
prove a variant of Lemma 49 that bounds the bias and variance of the subgradient estimator in Algorithm 6.
Lemma158. Let
p
F
λ pwq“
1
n
ř
n
i“1
fpw,x
i
q`
λ 2
}w´w
0
}
2
bearegularizedempiricallossonaclosedconvex
domainW withℓ
2
-diameterD. Let
r
∇F
λ pw
t
q“∇
p
F
λ pw
t
q`b
t
`N
t
“
1
n
ř
n
i“1
Π C
p∇fpw,x
i
qq`λ pw´w
0
q`
N
t
be the biased, noisy subgradients of the regularized empirical loss in Algorithm 6, withN
t
„Np0,σ 2
I
d
q
andb
t
“
1
n
ř
n
i“1
Π C
p∇fpw
t
,x
i
qq´
1
n
ř
n
i“1
∇fpw
t
,x
i
q. Assumep r
n
pXq
pkq
ěsup
wPW
␣
1
n
ř
n
i“1
}∇fpw,x
i
q}
k
(
for all∇fpw,x
i
qPB
w
fpw,x
i
q. Then, for anyT ě1, we have:
ˆ
B :“ sup
tPrTs
}b
t
}ď
p r
n
pXq
pkq
pk´1qC
k´1
and
ˆ
Σ 2
:“ sup
tPrTs
E}N
t
}
2
“dσ
2
.
322
Proof. Fix anyt. We have
}b
t
}“
›
›
›
›
›
1
n
n
ÿ
i“1
Π C
p∇fpw
t
,x
i
qq´
1
n
n
ÿ
i“1
∇fpw
t
,x
i
q
›
›
›
›
›
ď
1
pk´1qC
k´1
«
1
n
n
ÿ
i“1
}∇fpw
t
,x
i
q}
k
ff
, (C.6)
by Lemma 49 applied with P as the empirical distribution on X, and z
i
in Lemma 49 corresponding to
∇fpw
t
,x
i
q in C.6. Taking supremum overt of both sides of C.6 and recalling the definition of p r
n
pXq
pkq
proves the bias bound. The noise variance bound is immediate from the distribution ofN
t
.
Using Lemma 158, we can obtain the following convergence guarantee for Algorithm 6:
Lemma159 (Re-statement of Lemma 51). FixXPX
n
and let
p
F
λ pwq“
1
n
ř
n
i“1
fpw,x
i
q`
λ 2
}w´w
0
}
2
forw
0
PW,whereW isaclosedconvexdomainwithdiameterD. Assumefp¨,xqisconvexandp r
n
pXq
pkq
ě
sup
wPW
␣
1
n
ř
n
i“1
}∇fpw,x
i
q}
k
(
for all ∇fpw,x
i
q P B
w
fpw,x
i
q. Denote p r
n
pXq “
“
p r
n
pXq
pkq
‰
1{k
and
ˆ w“argmin
wPW
p
F
λ pwq. Letη ď
2
λ . Then, the output of Algorithm 6 satisfies
E}w
T
´ ˆ w}
2
ďexp
ˆ
´
ληT
2
˙
}w
0
´ p w}
2
`
8η λ `
p r
n
pXq
2
`λ 2
D
2
`dσ
2
˘
`
20
λ 2
˜
p r
n
pXq
pkq
pk´1qC
k´1
¸
2
,
whereσ 2
“
4C
2
T
n
2
ε
2
.
Proof. We use the notation of Lemma 158 and write
r
∇F
λ pw
t
q“∇
p
F
λ pw
t
q`b
t
`N
t
“
1
n
ř
n
i“1
Π C
p∇fpw,x
i
qq`
λ pw´w
0
q`N
t
as the biased, noisy subgradients of the regularized empirical loss in Algorithm 6, with
N
t
„Np0,σ 2
I
d
q andb
t
“
1
n
ř
n
i“1
Π C
p∇fpw
t
,x
i
qq´
1
n
ř
n
i“1
∇fpw
t
,x
i
q. Denotey
t`1
“w
t
´η r
∇F
λ pw
t
q,
323
so thatw
t`1
“Π W
py
t`1
q. For now, condition on the randomness of the algorithm (noise). By strong con-
vexity, we have
p
F
λ pw
t
q´
p
F
λ pˆ wqďx∇
p
F
λ pw
t
q,w
t
´ ˆ wy´
λ 2
}w
t
´ p w}
2
“x
r
∇F
λ pw
t
q,w
t
´ p wy´
λ 2
}w
t
´ p w}
2
`x∇
p
F
λ pw
t
q´
r
∇F
λ pw
t
q,w
t
´ p wy
“
1
2η `
}w
t
´ p w}
2
`}w
t
´y
t`1
}
2
´}y
t`1
´ p w}
2
˘
´
λ 2
}w
t
´ p w}
2
`x∇
p
F
λ pw
t
q´
r
∇F
λ pw
t
q,w
t
´ p wy
“
1
2η `
}w
t
´ p w}
2
p1´λη q´}y
t`1
´ p w}
2
˘
`
η 2
}
r
∇F
λ pw
t
q}
2
`x∇
p
F
λ pw
t
q´
r
∇F
λ pw
t
q,w
t
´ p wy
ď
1
2η `
}w
t
´ p w}
2
p1´λη q´}w
t`1
´ p w}
2
˘
`
η 2
}
r
∇F
λ pw
t
q}
2
´xb
t
`N
t
,w
t
´ p wy,
where we used non-expansiveness of projection and the definition of
r
∇F
λ pw
t
q in the last line. Now, re-
arranging this inequality and taking expectation, we get
Er}w
t`1
´ p w}
2
sď´2η Er
p
F
λ pw
t
q´
p
F
λ pp wqs`E}w
t
´ p w}
2
p1´λη q`η 2
E}
r
∇F
λ pw
t
q}
2
´2η Exb
t
`N
t
,w
t
´ p wy
ďE}w
t
´ p w}
2
p1´λη q`η 2
E}
r
∇F
λ pw
t
q}
2
´2η Exb
t
,w
t
´ p wy,
by optimality of p w and the assumption that the noiseN
t
is independent ofw
t
´ p w and zero mean. Also,
E}
r
∇F
λ pw
t
q}
2
ď2
´
E}∇
p
F
λ pw
t
q}
2
`}b
t
}
2
`E}N
t
}
2
¯
ď2
´
2p r
n
pXq
2
`2λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
,
324
where
ˆ
B :“ sup
tPrTs
}b
t
}ď
p rnpXq
pkq
pk´1qC
k´1
and
ˆ
Σ 2
:“ sup
tPrTs
E}N
t
}
2
“ dσ
2
. by Lemma 158. We also used
Young’s and Jensen’s inequalities and the fact thatEN
t
“0. Further,
|Exb
t
,w
t
´ p wy|ď
ˆ
B
2
λ `
λ 4
E}w
t
´ p w}
2
,
by Young’s inequality. Combining these pieces yields
E}w
t`1
´ p w}
2
ď
ˆ
1´
λη 2
˙
E}w
t
´ p w}
2
`4η 2
´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
`
2η ˆ
B
2
λ . (C.7)
Iterating C.7 gives us
E}w
T
´ p w}
2
ď
ˆ
1´
λη 2
˙
T
}w
0
´ p w}
2
`
«
4η 2
´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
`
2η ˆ
B
2
λ ff
T´1
ÿ
t“0
ˆ
1´
λη 2
˙
t
ď
ˆ
1´
λη 2
˙
T
}w
0
´ p w}
2
`
«
4η 2
´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
`
2η ˆ
B
2
λ ff
ˆ
2
λη ˙
“
ˆ
1´
λη 2
˙
T
}w
0
´ p w}
2
`
8η λ ´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
`
4
ˆ
B
2
λ 2
ďexp
ˆ
´
ληT
2
˙
}w
0
´ p w}
2
`
8η λ ´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
B
2
`
ˆ
Σ 2
¯
`
4
ˆ
B
2
λ 2
ďexp
ˆ
´
ληT
2
˙
}w
0
´ p w}
2
`
8η λ ´
p r
n
pXq
2
`λ 2
D
2
`
ˆ
Σ 2
¯
`
20
ˆ
B
2
λ 2
,
sinceη ď
2
λ . Plugging in the bounds on
ˆ
B and
ˆ
Σ from Lemma 158 completes the proof.
Proposition160 (Precise statement of Proposition 53). Letfp¨,xq be convex for allx and grant Assump-
tion 9 fork“2. SupposeA:X
n
ÑW isα -on-average model stable. Then for anyζ ą0, we have
ErFpApXqq´
p
F
X
pApXqqsď
r
p2q
2ζ `
ζ 2
α 2
.
325
Proof. Let X,X
1
,X
i
be constructed as in Definition 52. We may write ErFpApXqq´
p
F
X
pApXqqs “
Er
1
n
ř
n
i“1
fpApX
i
q,x
i
q´fpApXq,x
i
qs, by symmetry and independence ofx
i
andApX
i
q (c.f. [162, Equa-
tion B.2]). Then by convexity, we have
ErFpApXqq´
p
F
X
pApXqqsď
1
n
n
ÿ
i“1
ErxApX
i
q´ApXq,∇fpApX
i
q,x
i
qys
ď
1
n
n
ÿ
i“1
E
„
ζ 2
}ApX
i
q´ApXq}
2
`
1
2ζ }∇fpApX
i
q,x
i
q}
2
ȷ
.
Now, sinceApX
i
q is independent ofx
i
, we have:
E}∇fpApX
i
q,x
i
q}
2
“ErEr}∇fpApX
i
q,x
i
q}
2
|ApX
i
qss
ď sup
wPW
Er}∇fpApX
i
q,x
i
q}
2
|ApX
i
q“ws
“ sup
wPW
Er}∇fpw,x
i
q}
2
s
ďr
p2q
.
Combining the above inequalities and recalling Definition 52 yields the result.
To prove our excess risk bound for regularized ERM (i.e. Proposition 54), we require the following
bound on the generalization error of ERM with strongly convex loss:
Proposition161. Letfp¨,xqbeλ -stronglyconvex,andgrantAssumption9. LetApXq:“argmin
wPW
p
F
X
pwq
be the ERM algorithm. Then,
ErFpApXqq´
p
F
X
pApXqqsď
2r
p2q
λn
.
326
Proof. We first bound the stability of ERM and then use Proposition 53 to get a bound on the generaliza-
tion error. The beginning of the proof is similar to the proof of [162, Proposition D.6]: LetX,X
1
,X
i
be
constructed as in Definition 52. By strong convexity of
p
F
X
i and optimality ofApX
i
q, we have
λ 2
}ApXq´ApX
i
q}
2
ď
p
F
X
ipApXqq´
p
F
X
ipApX
i
qq,
which implies
1
n
n
ÿ
i“1
}ApXq´ApX
i
q}
2
ď
2
λn
n
ÿ
i“1
”
p
F
X
ipApXqq´
p
F
X
ipApX
i
qq
ı
. (C.8)
Now, for anywPW,
n
n
ÿ
i“1
p
F
X
ipwq“
n
ÿ
i“1
rfpw,x
1
i
q`
ÿ
j‰i
fpw,x
j
qs
“pn´1qn
p
F
X
pwq`n
p
F
X
1pwq.
Hence
1
n
E
«
n
ÿ
i“1
p
F
X
ipApXqq
ff
“
ˆ
n´1
n
˙
E
p
F
X
pApXqq`
1
n
E
p
F
X
1pApXqq
“
ˆ
n´1
n
˙
1
n
E
«
n
ÿ
i“1
p
F
X
ipApX
i
qq
ff
`
1
n
EFpApXqq,
by symmetry and independence ofApXq andX
1
. Re-arranging the above equality and using symmetry
yields
1
n
E
«
n
ÿ
i“1
p
F
X
ipApXqq´
p
F
X
ipApX
i
qq
ff
“
1
n
E
”
FpApXqq´
p
F
X
pApXqq
ı
. (C.9)
327
Combining C.8 with C.9 shows that ERM isα -on-average model stable for
α 2
“E
«
1
n
n
ÿ
i“1
}ApXq´ApX
i
q}
2
ff
ď
2
λn
E
”
FpApXqq´
p
F
X
pApXqq
ı
. (C.10)
The rest of the proof is where we depart from the analysis of [162] (which required smoothness offp¨,xq):
Bounding the right-hand side of C.10 by Proposition 53 yields
α 2
ď
2
λn
˜
r
p2q
2ζ `
ζ 2
α 2
¸
for anyζ ą0. Choosingζ “
λn
2
, we obtain
α 2
2
ď
r
p2q
λnζ
“
2r
p2q
λ 2
n
2
,
andα 2
ď
4r
p2q
λ 2
n
2
. Applying Proposition 53 again yields (for anyζ 1
ą0)
ErFpApXqq´
p
F
X
pApXqqsď
r
p2q
2ζ 1
`
ζ 1
2
˜
4r
p2q
λ 2
n
2
¸
ď
2r
p2q
λn
,
by the choiceζ 1
“
λn
2
.
Proposition 162 (Precise statement of Proposition 54). Let fp¨,xq be convex, w
i´1
,y P W, and p w
i
:“
argmin
wPW
p
F
i
pwq, where
p
F
i
pwq:“
1
n
i
ř
jPB
i
fpw,x
j
q`
λ i
2
}w´w
i´1
}
2
(c.f. line 6 of Algorithm 7). Then,
ErFpp w
i
qs´Fpyqď
2r
p2q
λ i
n
i
`
λ i
2
}y´w
i´1
}
2
,
where the expectation is over both the random draws ofX fromP andB
i
fromX.
328
Proof. Denote the regularized population loss by G
i
pwq :“ Er
p
F
i
pwqs “ Fpwq`
λ i
2
}w´ w
i´1
}
2
. By
Proposition 161, we have
ErG
i
pp w
i
q´
p
F
i
pp w
i
qsď
2r
p2q
λ i
n
i
.
Thus,
λ i
2
E}p w
i
´w
i´1
}
2
`EFpp w
i
q“EG
i
pp w
i
q
ď
2r
p2q
λ i
n
i
`Er
p
F
i
pp w
i
qs
ď
2r
p2q
λ i
n
i
`
λ i
2
}y´w
i´1
}
2
`Fpyq, (C.11)
sinceEr
p
F
i
pp w
i
qs“Ermin
wPW
p
F
i
pwqsď min
wPW
Er
p
F
i
pwqs“ min
wPW
G
i
pwqď
λ i
2
}y´w
i´1
}
2
`Fpyq.
SubtractingFpyq from both sides of C.11 completes the proof.
We are ready to state and prove the precise form of Theorem 50, using the notation of Section C.4.1:
Theorem 163 (Precise Statement of Theorem 50). Grant Assumption 5. Let fp¨,xq be convex and L-
Lipschitzforallx,withLă8andletεď
?
d. Then,therearealgorithmicparameterssuchthatAlgorithm7
is
ε
2
2
-zCDP, and has excess risk
EFpw
l
q´F
˚
À
r
R
2k,n
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
.
Moreover, this excess risk is attained in
r
Opn
2
L
2
{
r
R
2k,n
q subgradient evaluations.
Proof. We chooseσ 2
i
“
4C
2
i
T
i
n
2
i
ε
2
forC
i
andT
i
to be determined exactly later. Note that forλ i
andη i
defined
in Algorithm 7, we haveη i
ď
2
λ i
for alliPrls.
329
Privacy: Since the batchestB
i
u
l
i“1
are disjoint, it suffices (by parallel composition [191]) to show that w
i
(produced by T
i
iterations of Algorithm 6 in line 7 of Algorithm 7) is
ε
2
2
-zCDP for all i P rls. With clip
threshold C
i
and batch size n
i
, the ℓ
2
sensitivity of the clipped subgradient update is bounded by ∆ “
sup
w,X„X
1
1
n
i
}
ř
n
i
j“1
Π C
i
p∇fpw,x
j
qq´Π C
i
p∇fpw,x
1
j
qq}“
1
n
i
sup
w,x,x
1}Π C
i
p∇fpw,xqq´Π C
i
p∇fpw,x
1
qq}ď
2C
i
n
i
. (Note that the terms arising from regularization cancel out.) Thus, by Proposition 8, conditional on the
previous updatesw
1:i
, thepi`1q-st update in line 5 of Algorithm 6 satisfies
ε
2
2T
i
-zCDP. Hence, Lemma 10
implies thatw
i
(in line 7 of Algorithm 7) is
ε
2
2
-zCDP.
Excessrisk: First, our choice ofD
i
ensures that ˆ w
i
PW
i
, since
p
F
i
pˆ w
i
q“
1
n
i
ÿ
jPB
i
fpˆ w
i
,x
j
q`
λ i
2
}ˆ w
i
´w
i´1
}
2
ď
p
F
i
pw
i´1
q“
1
n
i
ÿ
jPB
i
fpw
i´1
,x
j
q
ùñ
λ i
2
}ˆ w
i
´w
i´1
}
2
ďL}ˆ w
i
´w
i´1
}
ùñ }ˆ w
i
´w
i´1
}ď
2L
λ i
“D
i
,
by definition of ˆ w
i
andL-Lipschitz continuity offp¨,x
j
q for allj. Then by Lemma 51, we have
E}w
i
´ˆ w
i
}
2
ďexp
ˆ
´
λ i
η i
T
i
2
˙
}w
i´1
´p w
i
}
2
`
8η i
λ i
´
p r
n
i
pB
i
q
p2q
`λ 2
i
D
2
i
`dσ
2
i
¯
`
20
λ 2
i
˜
p r
n
i
pB
i
q
pkq
pk´1qC
k´1
i
¸
2
,
conditional on w
i´1
and the draws of X „ P
n
andB
i
„ X
n
i
. Taking expectation over the random
sampling yields
E}w
i
´ ˆ w
i
}
2
ďexp
ˆ
´
λ i
η i
T
i
2
˙
}w
i´1
´ p w
i
}
2
`
8η i
λ i
´
r e
p2q
n
i
`λ 2
i
D
2
i
`dσ
2
i
¯
`
20
λ 2
i
r e
p2kq
n
i
C
2k´2
i
pk´1q
2
,
330
wheredσ
2
i
ď
4dC
2
i
T
i
n
2
i
ε
2
. Now, sinceLă8, there existspě1 such thatLÀn
p{2
r
R
2k,n
˜
1
?
n
`
ˆ
?
dlnpnq
εn
˙k´1
k
¸
.
ChoosingT
i
“
1
λ i
η i
ln
´
D
2
λ i
dσ
2
i
η i
¯
Àn
p
i
lnpnq andη to be determined later (polynomial inn), we get
E}w
i
´ ˆ w
i
}
2
À
η i
λ i
`
L
2
`dσ
2
i
˘
`
r e
p2kq
n
i
λ 2
i
C
2k´2
i
Àη 2
i
n
p
i
pL
2
`dσ
2
i
q`
η 2
i
n
2p
i
r e
p2kq
n
i
C
2k´2
i
À
˜
η 2
n
p
16
i
2
ip
˜
L
2
`
dC
2
i
T
i
ε
2
n
2
i
`
n
p
r e
p2kq
n
i
C
2k´2
i
2
pi
¸¸
. (C.12)
Note that under Assumption 5,F isL
F
-Lipshitz, whereL
F
“sup
wPW
}∇Fpwq}ďr by Jensen’s inequal-
ity. Now, following the strategy used in the proof of [95, Theorem 4.4], we write
EFpw
l
q´Fpw
˚
q“ErFpw
l
q´Fpˆ w
l
qs`
l
ÿ
i“1
ErFpˆ w
i
q´Fpˆ w
i´1
qs,
where ˆ w
0
:“w
˚
. Using C.12, the first term can be bounded as follows:
ErFpw
l
q´Fpˆ w
l
qsďL
F
a
E}w
l
´ ˆ w
l
}
2
ÀL
F
g
f
f
e
η 2
l
˜
L
2
`
C
2
l
d
ε
2
`
r e
p2kq
n
l
C
2k´2
l
¸
ÀL
F
«
η n
2
˜
L`
?
dC
l
ε
`
r r
k
2k
C
k´1
l
¸ff
ÀL
F
»
–
η n
2
¨
˝
L`r r
2k
˜
?
d
ε
¸
pk´1q{k
˛
‚
fi
fl
331
if we chooseC
l
“ r r
2k
´
ε
?
d
¯
1{k
. Therefore,
ErFpw
l
q´Fpˆ w
l
qsÀ
r
R
2k,n
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
, (C.13)
if we choose
η À
r
R
2k,n
Dn
2
L
F
min
˜
1
L
,
1
r r
2k
ˆ
ε
?
d
˙
pk´1q{k
¸
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
“:η A
.
Next, Proposition 54 implies
ErFpp w
i
q´Fpp w
i´1
qsď
2r
2
λ i
n
i
`
λ i
2
E}p w
i´1
´w
i´1
}
2
for alliPrls. Hence
l
ÿ
i“1
ErFpp w
i
q´Fpp w
i´1
qsÀ
r
2
λ 1
n
1
`λ 1
D
2
`
l
ÿ
i“2
«
r
2
λ i
n
i
`λ i
η 2
i
˜
n
p
i
pL
2
`dσ
2
i
q`
n
2p
i
r e
p2kq
n
i
C
2k´2
i
¸ff
Àr
2
ηn
p´1
`
D
2
ηn
p
`
l
ÿ
i“2
r
2
η i
n
p´1
i
`
l
ÿ
i“2
η i
n
p
i
˜
n
p
i
pL
2
`
dC
2
i
T
i
ε
2
n
2
i
q`
n
2p
i
r e
p2kq
n
i
C
2k´2
i
¸
Àr
2
ηn
p´1
`
D
2
ηn
p
`
l
ÿ
i“2
η i
˜
L
2
`
dC
2
i
n
p
i
lnpnq
ε
2
n
2
i
`
n
p
i
r e
p2kq
n
i
C
2k´2
i
¸
.
Choosing C
i
“ r r
2k,n
i
ˆ
εn
i
?
dlnpnq
˙
1{k
approximately equalizes the two terms above involving C
i
and we
get
l
ÿ
i“1
ErFpp w
i
q´Fpp w
i´1
qsÀr
2
ηn
p´1
`
D
2
ηn
p
`ηL
2
`η l
ÿ
i“2
4
´i
n
p
i
r r
2
2k,n
i
ˆ
dlnpnq
ε
2
n
2
i
˙k´1
k
Àη «
r
2
n
p´1
`L
2
`
r
R
2
2k,n
n
p
ˆ
dlnpnq
ε
2
n
2
˙k´1
k
ff
`
D
2
ηn
p
.
332
Now, choosing
η “min
¨
˝
η A
,
D
n
p{2
min
$
&
%
1
rn
pp´1q{2
,
1
L
,
1
r
R
2k,n
n
p{2
˜
εn
a
dlnpnq
¸
pk´1q{k
,
.
-
˛
‚
yields
l
ÿ
i“1
ErFpp w
i
q´Fpp w
i´1
qsÀ
r
R
2k,n
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
`
LD
n
p{2
`
D
2
η A
n
p
À
r
R
2k,n
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
,
since L À n
p{2
r
R
2k,n
˜
1
?
n
`
ˆ
?
dlnpnq
εn
˙k´1
k
¸
. Combining the above pieces completes the excess risk
proof.
Subgradient complexity: Our choice of T
i
“
r
Θ ´
1
λ i
η i
¯
À n
p
i
lnpnq implies that Algorithm 7 uses
ř
l
i“1
n
i
T
i
Àlnpnqn
p`1
subgradient evaluations. Further,n
p
ďn
ˆ
L
r
R
2k,n
˙
2
.
Remark164 (Details of Remark 55). Ifonedesirespε,δ q-DPorpε,δ q-SDPinsteadofzCDP,thenthegradient
complexityofAlgorithm7canbeimprovedtoOpn
p`
1
2
lnpnqqbyusingClippedNoisyStochasticSubgradient
MethodinsteadofAlgorithm6asthesubroutineinline7ofAlgorithm7. Choosingbatchsizesm
i
«
?
n
i
ăn
i
in this subroutine (and increasing σ 2
i
by a factor of Oplogp1{δ qq) ensurespε,δ q-DP by [1, Theorem 1] via
privacyamplificationbysubsampling. Thesameexcessriskboundsholdforanyminibatchsize m
i
Prn
i
s,as
the proof of Theorem 50 shows.
C.4.3 TheStronglyConvexCase(Section4.3.2)
Our algorithm is an instantiation of the meta-algorithm described in [95]: Initialize w
0
P W. For j P
rMs :“ rlog
2
plog
2
pnqqs, let N
j
“ 2
j´2
n{log
2
pnq,C
j
“
!
ř
hăj
N
h
`1,...,
ř
hďj
N
h
)
, and let w
j
be
the output of Algorithm 7 run with input data X
j
“ px
s
q
sPC
j
initialized at w
j´1
. Output w
M
. Assume
333
without loss of generality thatN
j
“2
p
for somepPN. Then, with the notation of Section C.4.1, we have
the following guarantees:
Theorem 165 (Precise Statement of Theorem 56). Grant Assumption 5. Let ε ď
?
d and fp¨,xq be µ -
strongly convex and L-Lipschitz for all x P X, with L ă 8. Then, there is a polynomial-time
ε
2
2
-zCDP
algorithmA based on Algorithm 7 with excess risk
EFpApXqq´F
˚
À
r
R
2
2k,n{4
µ ¨
˝
1
n
`
˜
a
dlnpnq
εn
¸2k´2
k
˛
‚
.
Proof. Privacy: Since the batchesX
j
used in each phase of the algorithm are disjoint and Algorithm 7 is
ε
2
2
-zCDP, privacy of the algorithm follows from parallel composition of DP [191].
Excessrisk: Note thatN
j
samples are used in phasej of the algorithm. Forjě0, letD
2
j
“Er}w
j
´w
˚
}
2
s
and∆ j
“ErFpw
j
q´F
˚
s. By strong convexity, we haveD
2
j
ď
2∆
j
µ . Also,
∆ j`1
ďa
r
R
2k,N
j
D
j
¨
˝
1
a
N
j
`
˜
a
dlnpN
j
q
εN
j
¸k´1
k
˛
‚
ďa
r
R
2k,N
j
d
2∆
j
µ ¨
˝
1
a
N
j
`
˜
a
dlnpN
j
q
εN
j
¸k´1
k
˛
‚
(C.14)
for an absolute constantaě1, by Theorem 50. DenoteE
j
“
«
a
r
R
2k,N
j
b
2
µ ˜
1
?
N
j
`
ˆ
?
dlnpN
j
q
εN
j
˙k´1
k
¸ff
2
.
Then sinceN
j
“2N
j`1
, we have
E
j
E
j`1
ď4
˜
r
R
2k,N
j
r
R
2k,N
j`1
¸
2
ď8, (C.15)
334
where the second inequality holds because for anym“2
q
, we have:
r
R
2
2k,m{2
“
log
2
pmq´1
ÿ
i“1
2
´i
r r
2
2k,2
´pi`1q
m
“
log
2
pmq
ÿ
i“2
2
´pi´1q
r r
2
2k,2
´i
m
“2
log
2
pmq
ÿ
i“2
2
´i
r r
2
2k,2
´i
m
ď2
r
R
2
2k,m
.
Now, C.15 implies that C.14 can be re-arranged as
∆ j`1
64E
j`1
ď
d
∆ j
64E
j
ď
ˆ
∆ 0
64E
0
˙
1{2
j`1
. (C.16)
Further, ifMěloglog
´
∆ 0
E
0
¯
, then
∆ M
64E
M
ď
ˆ
∆ 0
64E
0
˙
1{2
M
ď
ˆ
∆ 0
64E
0
˙
1{logp∆ 0
{E
0
q
ď2
A
ˆ
1
64
˙
1{logp∆ 0
{E
0
q
ď2
A
,
for an absolute constantAą0, since∆ 0
ď
2L
2
µ andE
0
ě
2L
2
µn
implies∆ 0
{E
0
“
n
a
2
ďn and
1
logp∆ 0
{E
0
q
“
1
logpnq´2logpaq
ď
A
logpnq
for someAą0, so that
´
∆ 0
E
0
¯
1{logp∆ 0
{E
0
q
ďn
A{logpnq
ď2
A
. Therefore,
∆ M
ď2
A
64E
M
“O
¨
˝
r
R
2
2k,n{4
µ ¨
˝
1
n
`
˜
a
dlnpnq
εn
¸2k´2
k
˛
‚
˛
‚
,
sinceN
M
“n{4.
C.4.4 AsymptoticUpperBoundsUnderAssumptions9and10
We first recall the notion of subexponential distribution:
Definition166 (Subexponential Distribution). ArandomvariableY issubexponentialifthereisanabsolute
constantsą0 such thatPp|Y|ětqď2exp
`
´
t
s
˘
for alltě0. For subexponentialY, we define }Y}
ψ 1
:“
inf
␣
są0:Pp|Y|ětqď2exp
`
´
t
s
˘
@tě0
(
.
Essentially all (heavy-tailed) distributions that arise in practice are subexponential [187].
335
Now, we establish asymptotically optimal upper bounds for a broad subclass of the problem class con-
sidered in [253, 137]: namely,subexponential stochastic subgradient distributions satisfying Assumption 9
or Assumption 10. In Theorem 167 below (which uses the notation of Section C.4.1), we give upper bounds
under Assumption 9:
Theorem 167. Let fp¨,xq be convex. Assume r r
2k
ă 8 and Y
i
“ }∇fpw,x
i
q}
2k
is subexponential with
E
n
ě max
iPrns
p}Y
i
}
ψ 1
q@w P W, ∇fpw,x
i
q P B
w
fpw,x
i
q. Assume that for sufficiently large n, we
have sup
w,x
}∇fpw,xq}
2k
ď n
q
r
p2kq
for some q ě 1 and max
´
En
r
p2kq
,
E
2
n
pr
p2kq
q
2
¯
ln
´
3nDβ 4r
2k
¯
ď
n
dq
, where
}∇fpw,xq´∇fpw
1
,xq}ďβ }w´w
1
}forallw,w
1
PW,xPX,andsubgradients∇fpw,xqPB
w
fpw,xq.
Then,lim
nÑ8
r
R
2k,n
ď4r
2k
. Further, there existsN PN such that for allněN, the output of Algorithm 7
satisfies
EFpw
l
q´F
˚
“O
¨
˝
r
2k
D
¨
˝
1
?
n
`
˜
a
dlnpnq
εn
¸k´1
k
˛
‚
˛
‚
.
Iffp¨,xq isµ -strongly convex, then the output of algorithmA (in Section 4.3.2) satisfies
EFpApXqq´F
˚
“O
¨
˝
r
2
2k
µ ¨
˝
1
n
`
˜
a
dlnpnq
εn
¸2k´2
k
˛
‚
˛
‚
.
While a bound onsup
w,x
}∇fpw,xq} is needed in Theorem 167, it can grow as fast as any polynomial
inn and only needs to hold for sufficiently large n. AsnÑ8, this assumption is easily satisfied. Like-
wise, Theorem 167 depends only logarithmically on the Lipschitz parameter of the subgradientsβ , so the
result still holds up to constant factors if, say, β ď n
p
pr{Dq as n Ñ 8 for some p ě 1. Crucially, our
excess risk bounds do not depend onL orβ .
Asymptotically optimal upper bounds for Assumption 10 are an immediate consequence of Lemma 155
combined with Theorem 167. Namely, under Assumption 10, the upper bounds in Theorem 167 hold with
r replaced by
?
dγ
1{k
(by Lemma 155). These upper bounds, and the ones in Theorem 167, are tight up to
logarithms for their respective problem classes, by the lower bounds in Appendix C.4.5.
336
Proof of Theorem 167. StepOne: There existsNPN such that r r
2
2k,n
ď16r
2
2k
for allněN.
We will first use a covering argument to show that p r
n
pXq
p2kq
is upper bounded by2
2k`1
r
p2kq
with high
probability. For anyα ą 0, we may choose anα -net withN
α ď
`
3D
2α ˘
d
balls centered around points in
W
α “tw
1
,w
2
,¨¨¨ ,w
Nα uĂW such that for anywPW there existsiPrN
α s with}w´w
i
}ď α (see
e.g. [153] for the existence of suchW
α ). ForwPW, let r w denote the element ofW
α that is closest tow,
so that}w´ r w}ďα . Now, for anyXPX
n
, we have
p r
n
pXq
p2kq
“sup
w
#
1
n
n
ÿ
i“1
}∇fpw,x
i
qq´∇fpr w,x
i
q`∇fpr w,x
i
q}
2k
+
ď2
2k
sup
w
#
1
n
n
ÿ
i“1
}∇fpw,x
i
q´∇fpr w,x
i
q}
2k
`}∇fpr w,x
i
q}
2k
+
ď2
2k
«
β 2k
α 2k
`
1
n
max
jPrNα s
n
ÿ
i“1
}∇fpw
j
,x
i
q}
2k
ff
,
where we used Cauchy-Schwartz and Young’s inequality for the first inequality, and the assumption of
β -Lipschitz subgradients plus the definition of W
α for the second inequality. Further,
P
˜
2
2k
n
max
jPrNα s
n
ÿ
i“1
}∇fpw
j
,x
i
q}
2k
ě2
2k`1
r
p2kq
¸
ďN
α max
jPrNα s
P
˜
n
ÿ
i“1
}∇fpw
j
,x
i
q}
2k
ě2
2k`1
r
p2kq
¸
ďN
α exp
˜
´nmin
˜
r
p2kq
E
n
,
pr
p2kq
q
2
E
2
n
¸¸
,
by a union bound and Bernstein’s inequality (see e.g. [251, Corollary 2.8.3]). Choosingα “
2r
2k
β ensures
thatPp2
2k
β 2k
α 2k
ą2
2k`1
r
p2kq
q“0 and hence (by union bound)
P
´
p r
n
pXq
p2kq
ě2
2k`1
r
p2kq
¯
ďN
α exp
˜
´nmin
˜
r
p2kq
E
n
,
pr
p2kq
q
2
E
2
n
¸¸
ď
ˆ
3Dβ 4r
2k
˙
d
exp
˜
´nmin
˜
r
p2kq
E
n
,
pr
p2kq
q
2
E
2
n
¸¸
ď
1
n
q
,
337
by the assumption onn. Next, we use this concentration inequality to derive a bound onr e
p2kq
n
:
r e
p2kq
n
“E
”
p r
n
pXq
p2kq
ı
ďE
”
p r
n
pXq
p2kq
|p r
n
pXq
p2kq
ě2
2k`1
r
p2kq
ı
1
n
q
`2
2k`1
r
p2kq
ď
sup
w,x
}∇fpw,xq}
2k
n
q
`2
2k`1
r
p2kq
ďp1`2
2k`1
qr
p2kq
,
for sufficiently large n. Thus, r r
2
2k,n
ď16r
2
2k
for all sufficiently large n. This establishes Step One.
StepTwo: lim
nÑ8
r
R
2k,n
ď4r
2k
.
For alln“ 2
l
,l,iPN, define h
n
piq“ 2
´i
r r
2
2k,2
´i
n
1
tiPrlog
2
pnqsu
. Note that0ď h
n
piqď gpiq :“ 2
´i
r r
2
2k
for
alln,i, and that
ř
8
i“1
gpiq“ r r
2
2k
ă8 (i.e.g is integrable with respect to the counting measure). Further-
more, the limit lim
nÑ8
h
n
piq “ 2
´i
lim
nÑ8
r r
2
2k,2
´i
n
exists since Lemma 157 implies that the sequence
tr r
2
2k,2
´i
n
u
8
n“1
is monotonic and bounded for every i P N. Thus, by Lebesgue’s dominated convergence
theorem, we have
lim
nÑ8
r
R
2
2k,n
“ lim
nÑ8
8
ÿ
i“1
h
n
piq
“
8
ÿ
i“1
lim
nÑ8
h
n
piq
ď
8
ÿ
i“1
2
´i
lim
nÑ8
r r
2
2k,2
´i
n
ď16
8
ÿ
i“1
2
´i
r
2
2k
“16r
2
2k
,
where the last inequality follows from Step One. Therefore, lim
nÑ8
r
R
2k,n
ď 4r
2k
. By Theorem 50
and Theorem 56, this also implies the last two claims in Theorem 167.
338
C.4.5 LowerBounds(Section4.3.3)
In this section, we prove the lower bounds stated in Section 4.3.3, and also provide tight lower bounds
under Assumption 9 and 10.
Theorem 168 (Precise Statement of Theorem 57). Letk ě 2, D,γ,r
pkq
,r r
pkq
ą 0, β f
ě 0, dě 40,ną
7202, andρ ď d. Then, for anyρ -zCDP algorithmA, there existW,X ĂR
d
such that}w´w
1
}ď 2D for
allw,w
1
PW, aβ f
-smooth, linear, convex (inw) lossf :WˆX ÑR, and distributionsP andD
1
onX
such that:
1. Assumption 5 holds and ifX
1
„D
1
n
, then
EFpApX
1
qq´F
˚
“Ω ¨
˝
r r
k
D
¨
˝
1
?
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
˛
‚
. (C.17)
2. Assumption 9 holds and ifX
1
„D
1
n
, then
EFpApX
1
qq´F
˚
“Ω ¨
˝
r
k
D
¨
˝
1
?
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
˛
‚
. (C.18)
3. Assumption 10 holds and ifX„P
n
, then
EFpApXqq´F
˚
“Ω ¨
˝
γ 1{k
D
¨
˝
c
d
n
`
?
dmin
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
˛
‚
.
Proof. We will prove part 3 first.
3. We begin by proving the result forγ “D“1. In this case, it is proved in [137] that
EFpApXqq´F
˚
“Ω ¨
˝
?
dmin
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
339
forfpw,xq“´xw,xy withW“ B
d
2
p0,1q andX “t˘1u
d
, and a distribution satisfying Assumption 10
withγ “ 1. Thenfp¨,xq is linear, convex, andβ -smooth for allβ ě 0. We prove the first (non-private)
term in the lower bound. By the Gilbert-Varshamov bound (see e.g. [2, Lemma 6]) and the assumption
d ě 40, there exists a setV Ď t˘1u
d
with|V| ě 2
d{20
, d
Ham
pν,ν 1
q ě
d
8
for all ν,ν 1
P V, ν ‰ ν 1
. For
ν PV, define the product distribution Q
ν “pQ
ν 1
,¨¨¨Q
ν d
q, where for alljPrds,
Q
ν j
“
$
’
’
’
’
&
’
’
’
’
%
1 with probability
1`δ ν j
2
´1 with probability
1´δ ν j
2
forδ ν j
Pp0,1q to be chosen later. ThenEQ
ν j
:“µ ν j
“δ ν j
and for anywPW,x„Q
ν , we have
E|x∇fpw,xq´∇Fpwq,e
j
y|
k
“E|x´x`Ex,e
j
y|
k
(C.19)
“E|x´µ ν j
|
k
(C.20)
“
1`δ ν j
2
|1´δ ν j
|
k
`
1´δ ν j
2
|1`δ ν j
|
k
(C.21)
ď1´δ 2
ν j
ď1 (C.22)
forδ ν j
Pp0,1q. Now, letp :“
a
d{n andδ ν j
:“
pν
j
?
d
. Note thatEQ
ν :“ µ ν “
pν
?
d
andw
ν :“
µ ν }µ ν }
“
ν }ν }
.
Also,}µ ν }“ p :“}µ } for allν PV. Now, denotingF
Qν pwq :“E
x„Qν fpw,xq, we have for anywPW
(possibly depending onX„Q
n
ν ) that
max
ν PV
E
„
F
Qν pwq´ min
w
1
PW
F
Qν pw
1
q
ȷ
“max
ν PV
E
„B
µ ν }µ }
,µ ν F
´xw,µ ν y
ȷ
(C.23)
“max
ν PV
Er}µ }´xw,µ ν ys (C.24)
“max
ν PV
Ep}µ }r1´xw,w
ν ysq (C.25)
ěmax
ν PV
E
„
1
2
}µ }}w´w
ν }
2
ȷ
, (C.26)
340
since}w},}w
ν }ď1. Further, denoting ˆ w :“argmin
ν PV
}w
ν ´w}, we have}ˆ w´w
ν }
2
ď4}w
ν ´w}
2
for
allν PV (via Young’s inequality). Hence
max
ν PV
E
„
F
Qν pwq´ min
w
1
PW
F
Qν pw
1
q
ȷ
ě
}µ }
8
max
ν PV
E}ˆ w´w
ν }
2
. (C.27)
Now we apply Fano’s method (see e.g. [270, Lemma 3]) to lower boundmax
ν PV
E}ˆ w´w
ν }
2
. For all
ν ‰ ν 1
, we have}w
ν ´w
ν 1}
2
ě
}ν ´ν 1
}
2
}ν }
2
ě 1 sinced
Ham
pν,ν 1
qě
d
2
andν Pt˘1u
d
implies}ν ´ν 1
}
2
ě
d
2
and}ν }
2
“d. Also, a straightforward computation shows that for anyjPrds andν,ν 1
PV,
D
KL
pQ
ν j
||Q
ν 1
j
qď
1`
p
?
d
2
«
log
˜
?
d`p
?
d
¸
`log
˜
?
d
?
d´p
¸ff
(C.28)
ďlog
˜
1`
p
?
d
1´
p
?
d
¸
(C.29)
ď
3p
?
d
, (C.30)
for our choice of p, provided
p
?
d
“
1
?
n
P p0,
1
2
q, which holds if n ą 4. Hence by the chain rule for
KL-divergence,
D
KL
pQ
ν ||Q
ν 1qď3p
?
d“3
d
?
n
for allν,ν 1
PV. Thus, for anywPW, Fano’s method yields
max
ν PV
E}w´w
ν }
2
ě
1
2
˜
1´
3p
?
d`logp2q
pd{20q
¸
“
1
2
˜
1´
60
d
?
n
´20logp2q
d
¸
,
which isΩ p1q fordě40ą20logp2q andną7202. Combining this with C.27 and plugging in}µ }“
b
d
n
shows that
EF
Qν pApXqq´F
˚
Qν “Ω ˜
c
d
n
¸
341
for someν PV (for any algorithmA), whereX„Q
n
ν .
Next, we scale our hard instance for arbitraryγ,D ą 0. First, we scale the distributionQ
ν Ñ
˜
Q
ν “
γ 1{k
Q
ν , which is supported on
˜
X “t˘γ 1{k
u
d
. Denote its mean byEQ
ν :“ r µ ν “ γ 1{k
µ ν . Also we scale
W Ñ
Ă
W “ DW “ B
d
2
p0,Dq. So our final (linear, convex, smooth) hard instance is f :
Ă
Wˆ
r
X Ñ R,
fpr w,r xq“´xr w,r xy,
r
Fpr wq:“E
r x„
˜
Qν fpr w,r xq. DenoteFpwq:“E
x„Qν fpw,xq. Note that
E|x∇fpr w,r xq´∇
r
Fpr wq,e
j
y|
k
“E|x´r x`Er x,e
j
y|
k
“E|r x´ r µ ν j
|
k
“E|γ 1{k
px´µ ν j
q|
k
ďγ.
Further, we havew
˚
:“ argmin
wPW
Fpwq“
µ ν }µ ν }
and r w
˚
:“ argmin
r wP
Ă
W
r
Fpr wq“ Dw
˚
. Therefore, for
anywPW, r w“DwP
Ă
W, we have
r
Fpr wq´
r
Fpr w
˚
q“´xr w,r µ ν y`xr w
˚
,r µ ν y (C.31)
“xDpw
˚
´wq,γ 1{k
µ ν y (C.32)
“Dγ 1{k
rFpwq´Fpw
˚
qs. (C.33)
Thus,
E
r
FpAp
r
Xqq´
r
F
˚
“γ 1{k
DrEFpApXqq´F
˚
s,
so applying the lower bound for the case D “ γ “ 1 (i.e. for the unscaled F ) yields the desired lower
bound via
r
F .
1. We will use nearly the same unscaled hard instances used to prove the private and non-private terms
of the lower bound in part 3, but the scaling will differ. Starting with the non-private term, we scale the
342
distribution Q
ν Ñ
r
Q
ν “
r r
k
?
d
Q
ν andX Ñ
r
X “
r r
k
?
d
X . Also, scaleW Ñ
Ă
W “ DW “ B
d
2
p0,Dq. Let
fpr w,r xq:“´xr w,r xy, which satisfies all the hypotheses of the theorem. Also,
E
r x„
r
Qν „
sup
r w
}∇fpr w,r xq}
k
ȷ
“
ˆ
r r
k
?
d
˙
k
E
x„Qν }x}
k
ď
ˆ
r r
k
?
d
˙
k
d
k{2
“ r r
pkq
.
Now r w
˚
“Dw
˚
as before and letting
r
Fp¨q:“E
r x„
r
Qν fp¨,r xq, we have
r
Fpr wq´
r
Fpr w
˚
q“
r r
k
D
?
d
rFpwq´F
˚
s
for any r w“ Dw. Thus, applying the unscaled non-private lower bound established above yields a lower
bound ofΩ ´
r rD
?
n
¯
on the non-private excess risk of our scaled instance.
Next, we turn to the scaled private lower bound. The unscaled hard distributionQ
1
ν given by
Q
1
ν “
$
’
’
’
’
&
’
’
’
’
%
0 with probability1´p
p
´1{k
ν with probabilityp
(with the same linearf and sameW) provides the unscaled lower bound
EFpApXqq´F
˚
“Ω ¨
˝
?
dmin
$
&
%
1,
˜
?
d
?
ρn
¸k´1
k
,
.
-
˛
‚
,
by the proof of [137, Theorem 6.4]. We scaleQ
1
ν Ñ
r
Q
1
ν “
r r
?
d
Q
1
ν ,X Ñ
r
X “
r r
?
d
X , andWÑ
Ă
W“DW.
Then for any r wP
Ă
W,
E
r x„
Ă
Q
1
„
sup
r w
}∇fpr w,r xq}
k
ȷ
“
ˆ
r r
k
?
d
˙
k
E
x„Q
1}x}
k
“p}p
´1{k
ν }
k
“ r r
pkq
.
343
Moreover, excess risk scales by a factor of
r rD
?
d
, as we saw above. Thus, applying the unscaled lower bound
completes the proof of part 1.
2. We use an identical construction to that used above in part 1 except that the scaling factor r r
k
gets
replaced by r
k
. It is easy to see thatE
“
sup
wPW
}∇fpw,xq}
k
‰
“ sup
wPW
E
“
}∇fpw,xq}
k
‰
for our con-
struction, hence the result follows.
Remark169. Themaindifferencesinourproofofpart3ofTheorem57fromtheproofof[137,Theorem6.4]
(forγ “D“1)are: 1)weconstructaBernoulliproductdistribution(builton[82,Example7.7])insteadofa
Gaussian,whichestablishesalowerboundthatholdsforallkě2insteadofjustk“Op1q;and2)wechoose
a different parameter value (larger p in the notation of the proof) in our application of Fano’s method, which
results in a tighter lower bound: the term mint1,
a
d{nu in [137, Theorem 6.4] gets replaced with
a
d{n.
¶
Also, there exist parameter settings for which our lower bound is indeed strictly greater than the lower bound
in [137, Theorem 6.4]: for instance, ifdą ną d{ρ andkÑ8, then our lower bound simplifies to Ω p
b
d
n
q.
On the other hand, the lower bound in [137, Theorem 6.4] breaks askÑ8 (since thek-th moment of their
Gaussian goes to infinity); however, even if were extended to k Ñ8 (e.g. by replacing their Gaussian with
ourBernoullidistribution),thentheresultinglowerboundΩ p1`
d
?
ρn
qwouldstillbesmallerthantheonewe
prove above.
∥
Theorem170 (Precise Statement of Theorem 58). Letkě 2,µ,γ, r r
k
,r
k
ą 0,nPN,dě 40, andρ ď d.
Then, for anyρ -zCDP algorithmA, there exist convex, compact setsW,X ĂR
d
of diameterD, aµ -smooth,
¶
Note that [137, Theorem 6.4] writes
a
d{n for the first term. However, the proof (see Equation 16 in their paper) only
establishes the boundmint1,
a
d{nu.
∥
By Lemma 155, lower bounds under Assumption 10 imply lower bounds under Assumption 9 withγ 1{k
replaced byr{
?
d.
Nevertheless, we provide direct proofs under both assumptions for additional clarity.
344
µ -strongly convex (inw) lossf :WˆX ÑR, and distributionsP andD
1
onX such that:
1. Assumption 5 holds withD«
r r
k
µ , and ifX
1
„D
1
n
, then
EFpApX
1
qq´F
˚
“Ω ¨
˝
r r
2
k
µ ¨
˝
1
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸2k´2
k
,
.
-
˛
‚
˛
‚
.
2. Assumption 9 holds withD«
r
k
µ , and ifX
1
„D
1
n
, then
EFpApX
1
qq´F
˚
“Ω ¨
˝
r
2
k
µ ¨
˝
1
n
`min
$
&
%
1,
˜
?
d
?
ρn
¸2k´2
k
,
.
-
˛
‚
˛
‚
.
3. Assumption 10 holds,D«
γ 1{k
?
d
µ , and ifX„P
n
, then
EFpApXqq´F
˚
“Ω ¨
˝
γ 2{k
µ ¨
˝
d
n
`dmin
$
&
%
1,
˜
?
d
?
ρn
¸2k´2
k
,
.
-
˛
‚
˛
‚
.
Proof. We will prove part 3 first. 3. We first consider γ “ µ “ 1 and then scale our hard instance. For
fpw,xq:“
1
2
}w´x}
2
, [137] construct a convex/compact domainWˆX ĂR
d
ˆR
d
and distributionP
onX such that
EFpApXqq´F
˚
“Ω ¨
˝
dmin
$
&
%
1,
˜
?
d
?
ρn
¸2k´2
k
,
.
-
˛
‚
for anyk and anyρ -zCDP algorithmA :X
n
ÑW ifX„ P
n
.
∗
So, it remains to a) prove the first term
(d{n) in the lower bound, and then b) show that the scaled instance satisfies the exact hypotheses in the
theorem and has excess loss that scales by a factor of γ 2{k
{µ . We start with task a). Observe that for f
defined above and any distribution P such thatEP PW, we have
EFpApXqq´F
˚
“
1
2
E}ApXq´EP}
2
(C.34)
∗
In fact, W and X can be chosen to be Euclidean balls of radius
?
dp
´1{k
for p defined in the proof of [137, Lemma 6.3],
which ensures thatEP PW“X .
345
(see [137, Lemma 6.2]), and
E|x∇fpw,xq´∇Fpwq,e
j
y|
k
“E|xx´Ex,e
j
y|
k
.
Thus, it suffices to prove that E}ApXq´EP}
2
Á
d
n
for someP such thatE|xx´Ex,e
j
y|
k
ď1. This is a
known result for products of Bernoulli distributions; nevertheless, we provide a detailed proof below. First
consider the cased“1. Then the proof follows along the lines of [82, Example 7.7]. Define the following
pair of distributions ont˘1u:
P
0
:“
$
’
’
’
’
&
’
’
’
’
%
1 with probability
1
2
´1 with probability
1
2
and
P
1
:“
$
’
’
’
’
&
’
’
’
’
%
1 with probability
1`δ 2
´1 with probability
1´δ 2
forδ Pp0,1q to be chosen later. Notice that ifX is a random variable with distributionP
ν (ν Pt0,1u), then
E|X´µ |
k
ďE|X|
k
ď 1. Also,EP
ν “ δν forν Pt0,1u and|EP
1
´EP
0
|“ δ (i.e. the two distributions
areδ -separated with respect to the metricρ pa,bq“|a´b|). Then by LeCam’s method (see e.g. [82, Eq.
7.33] and takeΦ p¨q“p¨q
2
),
max
ν Pt0,1u
E
X„P
n
ν |ApXq´EP
ν |
2
ě
δ 2
8
r1´}P
n
0
´P
n
1
}
TV
s.
Now, by Pinsker’s inequality and the chain rule for KL-divergence, we have
}P
n
0
´P
n
1
}
2
TV
ď
1
2
D
KL
pP
n
0
||P
n
1
q“
n
2
D
KL
pP
0
||P
1
q“
n
2
log
ˆ
1
1´δ 2
˙
.
346
Choosingδ “
1
?
2n
ă
1
?
2
implies}P
n
0
´P
n
1
}
2
TV
ďnδ
2
“
1
2
. Hence there exists a distribution
ˆ
DPtP
0
,P
1
u
onR such that
E
X„
ˆ
D
n
|ApXq´E
ˆ
D|
2
ě
δ 2
8
„
1´
1
?
2
ȷ
ě
1
64n
For generaldě1, we take the product distributionP :“
ˆ
D
d
onX “t˘1u
d
and chooseW“B
d
2
p0,
?
dq
to ensureEP PW (so that C.34 holds). Clearly,E|xP´EP,e
j
y|
k
ď 1 for allj Prds. Further, the mean
squared error of any algorithm for estimating the mean ofP is
E
X„P
n}ApXq´EP}
2
“
d
ÿ
j“1
E|ApXq
j
´EP
j
|
2
ě
d
64n
, (C.35)
by applying thed“1 result to each coordinate.
Next, we move to task b). For this, we re-scale each of our hard distributions (non-private given above,
and private given in the proof of [137, Lemma 6.3] and below in our proof of part 2 of the theorem–see
C.39):P Ñ
γ 1{k
µ P “
r
P ,X Ñ
γ 1{k
µ X “
r
X ,WÑ
γ 1{k
µ W“
Ă
W andf :WˆX ѵf “
r
f :
Ă
Wˆ
r
X . Then
r
fp¨,r xq isµ -strongly convex andµ -smooth for all r xP
r
X and
E|x∇
r
fpr w,r xq´∇
r
Fpr wq,e
j
y|
k
“µ k
E|xr x´Er x,e
j
y|
k
“µ k
E
ˇ
ˇ
ˇ
ˇ
ˇ
˜
γ 1{k
µ ¸
xx´Ex,e
j
y
ˇ
ˇ
ˇ
ˇ
ˇ
k
“γ E|xx´Ex,e
j
y|
k
ďγ for any j P rds, x „ P, r x „
r
P, r w P
Ă
W. Thus, the scaled hard instance is in the required class of
functions/distributions. Further, denote
r
Fpwq“ E
r
fpw,xq, r w
˚
:“ argmin
r wP
Ă
W
r
Fpr wq“ E
r
P “
γ 1{k
µ EP .
Then, for anywPW, r w :“
γ 1{k
µ w, we have:
r
Fpr wq´
r
Fpr w
˚
q“
µ 2
E
“
}r w´r x}
2
´}r w
˚
´r x}
2
‰
(C.36)
“
µ 2
˜
γ 2{k
µ 2
¸
E
“
}w´x}
2
´}w
˚
´x}
2
‰
(C.37)
“
γ 2{k
µ rFpwq´Fpw
˚
qs. (C.38)
347
In particular, forw :“ApXq and r w :“
γ 1{k
µ ApXq, we get
E
A,X„P
n
«
r
F
˜
γ 1{k
µ ApXq
¸
´
r
F
˚
ff
“
γ 2{k
µ E
A,X„P
nrFpApXqq´F
˚
s
for any algorithmA:X
n
ÑW. Writing
˜
Ap
˜
Xq:“
γ 1{k
µ ApXq and
˜
X :“
γ 1{k
µ X forXPX
n
, we conclude
E
r
A,
˜
X„
˜
P
n
”
r
F
´
˜
Ap
˜
Xq
¯
´
r
F
˚
ı
“
γ 2{k
µ E
A,X„P
nrFpApXqq´F
˚
s
for any
r
A:
r
X
n
Ñ
Ă
W. Therefore, an application of the unscaled lower bound
E
A,X„P
nrFpApXqq´F
˚
s“Ω ¨
˝
d
n
`dmin
$
&
%
1,
˜
?
d
n
?
ρ ¸2k´2
k
,
.
-
˛
‚
,
which follows by combining part 3a) above with [137, Lemma 6.3], completes the proof of part 3.
1. We begin by proving the first (non-private) term in the lower bound: For our unscaled hard instance,
we will take the same distributionP “P
d
ν (for someν Pt0,1u) onX “t˘1u
d
and quadraticf described
above in part 1a withW :“B
d
2
p0,
?
dq. The choice ofW ensuresEP PW so that C.34 holds. Further,
E
„
sup
wPW
}∇fpw,xq}
k
ȷ
“E
„
sup
wPW
}w´x}
k
ȷ
ďEr}3x}
k
sďp9dq
k{2
.
Thus, if we scalef Ñ
r
f “ µf ,W Ñ
Ă
W :“
r r
k
µ ?
9d
W,X Ñ
r
X :“
r r
k
µ ?
9d
X andP Ñ
r
P “
r r
k
µ ?
9d
P , then
r
fp¨,r xq isµ -strongly convex andµ -smooth, and
E
«
sup
r wP
Ă
W
›
›
›∇
r
fpr w,r xq
›
›
›
k
ff
“E
«
sup
r wP
Ă
W
}r w´r x}
k
ff
“µ k
r r
pkq
ˆ
1
µ ?
9d
˙
k
E
„
sup
wPW
}∇fpw,xq}
k
ȷ
ď r r
pkq
.
348
Moreover, if
´
r r
k
3µ ?
d
¯
A“
r
A:
r
X
n
Ñ
Ă
W is any algorithm and
r
X„
r
P
n
, then by C.35 and C.34, we have
E
r
Fp
r
Ap
r
Xqq´
r
F
˚
“
µ 2
E}
r
Ap
r
Xq´E
r
P}
2
“
µ 2
ˆ
r r
k
µ ?
9d
˙
2
E}ApXq´EP}
2
Á
r r
2
k
µn
.
Next, we prove the second (private) term in the lower bound. Let f be as defined above. For our
unscaled hard distribution, we follow [26, 137] and define a family of distributions tQ
ν u
ν PV
onR
d
, where
VĂt˘1u
d
will be defined later. For any given ν PV, we define the distribution Q
ν as follows: X
ν „Q
ν iff
X
ν “
$
’
’
’
’
&
’
’
’
’
%
0 with probability1´p
p
´1{k
ν with probabilityp
(C.39)
wherep:“min
´
1,
?
d
n
?
ρ ¯
. Now, we select a setVĂt˘1u
d
such that|V|ě2
d{20
andd
Ham
pν,ν 1
qě
d
8
for
allν,ν 1
PV,ν ‰ν 1
: suchV exists by standard Gilbert-Varshamov bound (see e.g. [2, Lemma 6]). For any
ν PV, ifx„Q
ν andwPW :“B
d
2
p0,
?
dp
´1{k
q, then
E
„
sup
wPW
}∇fpw,xq}
k
ȷ
“E
„
sup
wPW
}w´x}
k
ȷ
ďEr}2x}
k
s“2
k
pp}p
´1{k
ν }
k
q“2
k
}ν }
k
“2
k
d
k{2
.
Note also that our choice ofW and p ď 1 ensures thatErQ
ν s P W. Moreover, as in the proof of [137,
Lemma 6.3], zCDP Fano’s inequality (see [137, Theorem 1.4]) implies that for anyρ -zCDP algorithmA,
sup
ν PV
E
X„Q
n
ν ,A
}ApXq´EQ
ν }
2
“Ω ¨
˝
dmin
$
&
%
1,
˜
?
d
n
?
ρ ¸2k´2
k
,
.
-
˛
‚
. (C.40)
Thus,
E
X„Q
n
ν ,A
FpApXqq´F
˚
“Ω ¨
˝
dmin
$
&
%
1,
˜
?
d
n
?
ρ ¸2k´2
k
,
.
-
˛
‚
349
for some ν P V, by C.34. Now we scale our hard instance: f Ñ
r
f “ µf , W Ñ
Ă
W :“
r r
k
2µ ?
d
W,
X Ñ
r
X :“
r r
k
2µ ?
d
X andP Ñ
r
P “
r r
k
2µ ?
d
P . Then
r
fp¨,r xq isµ -strongly convex andµ -smooth, and
E
«
sup
r wP
Ă
W
}∇
r
fpr w,r xq}
k
ff
“E
«
sup
r wP
Ă
W
}r w´r x}
k
ff
“µ k
ˆ
r r
k
2µ ?
d
˙
k
E
„
sup
wPW
}∇fpw,xq}
k
ȷ
ď r r
pkq
.
Moreover, if
´
r r
k
2µ ?
d
¯
A“
r
A:
r
X
n
Ñ
Ă
W is anyρ -zCDP algorithm and
r
X„
r
P
n
, then
E
r
Fp
r
Ap
r
Xqq´
r
F
˚
“
µ 2
E}
r
Ap
r
Xq´E
r
P}
2
“
µ 2
ˆ
r r
k
2µ ?
d
˙
2
E}ApXq´EP}
2
ě
r r
2
k
16µd
Ω ¨
˝
dmin
$
&
%
1,
˜
?
d
n
?
ρ ¸2k´2
k
,
.
-
˛
‚
,
by C.40.
2. We use an identical construction to that used above in part 1 except that the scaling factor r r
k
gets
replaced by r
k
. It is easy to see thatE
“
sup
wPW
}∇fpw,xq}
k
‰
« sup
wPW
E
“
}∇fpw,xq}
k
‰
for our con-
struction, and the lower bound in part 2 follows just as it did in part 1. This completes the proof.
Remark171. Note that the lower bound proofs construct bounded (hence subexponential) distributions and
uniformlyL-Lipschitz,β f
-smooth losses that easily satisfy the conditions in our upper bound theorems.
C.5 DetailsandProofsforSection4.4: LinearTimeAlgorithms
C.5.1 NoisyClippedAcceleratedSGDforSmoothConvexLosses(Section4.4.1)
We first present Algorithm 26, which is a generalized version of Algorithm 8 that allows for any MeanOracle.
This will be useful for our analysis. Proposition 172 provides excess risk guarantees for Algorithm 26 in
terms of the bias and variance of theMeanOracle.
350
Algorithm26 Generic Framework for DP Accelerated Stochastic Approximation (AC-SA)
1: Input: Data X P X
n
, iteration number T ď n, stepsize parameterstη t
u
tPrTs
,tα t
u
tPrTs
with α 1
“
1,α t
Pp0,1q@tě2, DPMeanOracle.
2: Initializew
ag
0
“w
0
PW andt“1.
3: fortPrTsdo
4: w
md
t
:“p1´α t
qw
ag
t´1
`α t
w
t´1
.
5: Draw new batchB
t
(without replacement) ofn{T samples fromX.
6:
r
∇F
t
pw
md
t
q:“ MeanOraclept∇fpw
md
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
7: w
t
:“argmin
wPW
!
α t
x
r
∇F
t
pw
md
t
q,wy`
η t
2
}w
t´1
´w}
2
)
.
8: w
ag
t
:“α t
w
t
`p1´α t
qw
ag
t´1
.
9: endfor
10: Output: w
ag
T
.
Proposition 172. Consider Algorithm 26 run with a MeanOracle satisfying
r
∇F
t
pw
md
t
q “ ∇Fpw
md
t
q`
b
t
`N
t
, where}b
t
}ď B (with probability1),EN
t
“ 0,E}N
t
}
2
ď Σ 2
for alltPrT´1s, andtN
t
u
T
t“1
are
independent. Assume thatF :W ÑR is convex andβ -smooth,Fpw
0
q´F
˚
ď ∆ , and}w
0
´w
˚
}ď D.
Suppose parameters are chosen in Algorithm 26 so that for alltPrTs,η t
ąβα 2
t
andη t
{Γ t
“η 1
{Γ 1
, where
Γ t
:“
$
’
’
’
’
&
’
’
’
’
%
1, t“1
p1´α t
qΓ t
, tě2.
Then,
EFpw
ag
T
q´F
˚
ď
Γ T
η 1
D
2
2
`Γ T
T
ÿ
t“1
„
2α 2
t
pΣ 2
`B
2
q
Γ t
pη t
´βα 2
t
q
`
α t
Γ t
BD
ȷ
.
In particular, choosingα t
“
2
t`1
andη t
“
4η tpt`1q
, @tě1, whereη ě2β implies
EFpw
ag
T
q´F
˚
ď
4ηD
2
TpT`1q
`
4pΣ 2
`B
2
qpT`2q
3η `BD.
Further, settingη “max
!
2β,
T
3{2
?
Σ 2
`B
2
D
)
implies
EFpw
ag
T
q´F
˚
À
βD
2
T
2
`
DpΣ `Bq
?
T
`BD. (C.41)
351
Proof. We begin by extending [108, Proposition 4] to biased/noisy stochastic gradients. Fix anyw
t´1
,w
ag
t´1
P
W. By [108, Lemma 3], we have
Fpw
ag
t
qďp1´α t
qFpw
ag
t´1
q`α rFpzq`x∇Fpzq,w
t
´zys`
β 2
}w
ag
t
´z}
2
, (C.42)
for anyzPW. Denote
Υ t
pwq:“α t
xN
t
`b
t
,w´w
t´1
y`
α 2
t
}N
t
`b
t
}
2
η t
´βα 2
t
andd
t
:“w
ag
t
´w
md
t
“α t
pw
t
´w
t´1
q. Then using C.42 withz“w
md
t
, we have
Fpw
ag
t
qďp1´α t
qFpw
ag
t´1
q`α t
rFpw
md
t
q`x∇Fpw
md
t
q,w
t
´w
md
t
ys`
β 2
}d
t
}
2
“p1´α t
qFpw
ag
t´1
q`α t
rFpw
md
t
q`x∇Fpw
md
t
q,w
t
´w
md
t
ys
`
η t
2
}w
t´1
´w
t
}
2
´
η t
´βα 2
t
2α 2
t
}d
t
}
2
, (C.43)
by the expression ford
t
. Now we apply [108, Lemma 2] withppuq“ α t
rx
r
∇F
t
pw
md
t
,uys, µ 1
“ 0, µ 2
“
η t
, ˜ x“w
md
t
, and ˜ y“w
t´1
to obtain (conditional on all randomness) for anywPW:
α t
rFpw
md
t
q`x
r
∇F
t
pw
md
t
q,w
t
´w
md
t
ys`
η t
2
}w
t´1
´w
t
}
2
ďα t
rFpw
md
t
q`x∇Fpw
md
t
q,w´w
md
t
ys
`α t
xN
t
`b
t
,w´w
md
t
y`
η t
2
}w
t´1
´w}
2
´
η t
2
}w
t
´w}
2
.
352
Next, we combine the above inequality with C.43 to get
Fpw
ag
t
qďp1´α t
qFpw
ag
t´1
q`α t
rFpw
md
t
q`x∇Fpw
md
t
q,w´w
md
t
ys
`
η t
2
“
}w
t´1
´w}
2
´}w
t
´w}
2
‰
`´
η t
´βα 2
t
2α 2
t
}d
t
}
2
`α t
xN
t
`b
t
,w´w
t
y
l jh n
Ut
, (C.44)
for allwPW. By Cauchy-Schwartz, we can bound
U
t
ď´
η t
´βα 2
t
2α 2
t
}d
t
}
2
`}N
t
`b
t
}}d
t
}`α t
xN
t
`b
t
,w´w
t´1
y
ďΥ t
pwq, (C.45)
where the last inequality follows from maximizing the concave quadratic functionqp}d
t
}q:“´
”
η t´βα 2
t
2α 2
t
ı
}d
t
}
2
`
}N
t
`b
t
}}d
t
} with respect to}d
t
}. Plugging the bound C.45 back into C.44 shows that
Fpw
ag
t
qďp1´α t
qFpw
ag
t´1
q`α t
rFpw
md
t
q`x∇Fpw
md
t
q,w´w
md
t
ys`
η t
2
“
}w
t´1
´w}
2
´}w
t
´w}
2
‰
`Υ t
pwq. (C.46)
Then it can be shown (see [108, Proposition 5]) that the assumptions onη t
andα t
imply that
Fpw
ag
T
q´Γ T
T
ÿ
t“1
„
α t
Γ t
´
Fpw
md
t
q`x∇Fpw
md
t
q,w´w
md
t
y
¯
ȷ
ďΓ T
T
ÿ
t“1
η t
2Γ
t
r}w
t´1
´w}
2
´}w
t
´w}
2
s
(C.47)
`Γ T
T
ÿ
t“1
Υ t
pwq
Γ t
, (C.48)
for anywPW and anyT ě1. Now,
T
ÿ
t“1
α t
Γ t
“
1
Γ T
353
by definition. Hence by convexity of F ,
T
ÿ
t“1
„
α t
Γ t
´
Fpw
md
t
q`x∇Fpw
md
t
q,w´w
md
t
y
¯
ȷ
ďFpwq, @wPW.
Also, sinceΓ t
{η t
“Γ 1
{η 1
for alltě1, we have
Γ T
T
ÿ
t“1
η t
2Γ
t
r}w
t´1
´w}
2
´}w
t
´w}
2
s“Γ t
η 1
2Γ
1
r}w
0
´w}
2
´}w
T
´w}
2
sďΓ T
η 1
1
2
}w
0
´w}
2
,
sinceΓ 1
“1. Substituting the above bounds into C.47, we get
Fpw
ag
T
q´FpwqďΓ T
η 1
1
2
}w
0
´w}
2
`Γ T
T
ÿ
t“1
Υ t
pwq
Γ t
, @wPW. (C.49)
Now, settingw“w
˚
and taking expectation yields
ErFpw
ag
T
q´F
˚
sď
Γ T
η 1
D
2
2
`Γ T
T
ÿ
t“1
EΥ t
pw
˚
q
Γ t
(C.50)
ď
Γ T
η 1
D
2
2
`Γ T
T
ÿ
t“1
„
1
Γ t
ˆ
α t
Exb
t
,w
˚
´w
t´1
y`
2α 2
t
pΣ 2
`B
2
q
η t
´βα 2
t
˙ȷ
(C.51)
ď
Γ T
η 1
D
2
2
`Γ T
T
ÿ
t“1
„
1
Γ t
ˆ
α t
BD`
2α 2
t
pΣ 2
`B
2
q
η t
´βα 2
t
˙ȷ
, (C.52)
where we used conditional independence ofN
t
andw
˚
´w
t´1
givenw
t´1
, Young’s inequality, Cauchy-
Schwartz, and the definitions of B
2
and Σ 2
. This establishes the first claim of the theorem. The second
and third claims are simple corollaries, which can be verified as in [108, Proposition 7] and the ensuing
discussion.
354
Theorem173 (Complete Version of Theorem 59). Grant Assumption 9. Letεą0 and assumeF is convex
andβ -smooth. Then, there are algorithmic parameters such that Algorithm 8 is
ε
2
2
-zCDP. Further, if
něT :“
S
min
#
ˆ
βD
r
˙
2k{p5k´1q
ˆ
εn
?
d
˙
p2k´2q{p5k´1q
,
c
βD
r
n
1{4
+W
,
then,
EFpw
ag
T
q´F
˚
Àr
k
D
»
–
1
?
n
`max
$
&
%
˜
ˆ
βD
r
k
˙
1{4
?
d
εn
¸
4pk´1q
5k´1
,
˜
?
d
εn
¸k´1
k
,
.
-
fi
fl
.
Proof. Privacy: Choose σ 2
“
4C
2
T
2
ε
2
n
2
. First, the collection of all
r
∇F
t
pw
md
t
q, t P rTs is
ε
2
2
-zCDP: since
the batches of data drawn in each iteration are disjoint, it suffices (by parallel composition [191]) to show
that
r
∇F
t
pw
md
t
q is
ε
2
2
-zCDP for all t. Now, the ℓ
2
sensitivity of each clipped gradient update is bounded
by∆ “sup
w,X„X
1}
T
n
ř
xPBt
Π C
p∇fpw,xqq´
ř
x
1
PB
1
t
Π C
p∇fpw,x
1
qq}“sup
w,x,x
1}
T
n
Π C
p∇fpw,xqq´
Π C
p∇fpw,x
1
qq}ď
2CT
n
. Thus,
r
∇F
t
pw
md
t
q is
ε
2
2
-zCDP by Proposition 8. Second, the iteratesw
ag
t
are de-
terministic functions of
r
∇F
t
pw
md
t
q, so the post-processing property of differential privacy [88, 47] ensures
that Algorithm 8 is
ε
2
2
-zCDP.
Excessrisk: Consider roundtPrTs of Algorithm 8, where Algorithm 5 is run on input datat∇fpw
t
,x
t
i
qu
n{T
i“1
.
Denote the bias of Algorithm 5 by b
t
:“ E
r
∇F
t
pw
t
q´∇Fpw
t
q, where
r
∇F
t
pw
t
q “ r ν in the notation
of Algorithm 5. Also let
p
∇F
t
pw
t
q :“ ˆ µ (in the notation of Lemma 49) and denote the noise by N
t
“
r
∇F
t
pw
t
q´∇Fpw
t
q´b
t
“
r
∇F
t
pw
t
q´E
r
∇F
t
pw
t
q. Then we haveB :“ sup
tPrTs
}b
t
}ď
r
pkq
pk´1qC
k´1
and
355
Σ 2
:“sup
tPrTs
Er}N
t
}
2
sďdσ
2
`
r
2
T
n
À
dC
2
T
2
ε
2
n
2
`
r
2
T
n
, by Lemma 49. Plugging these estimates forB and
Σ 2
into Proposition 172 and settingC“rp
εn
?
dT
q
1{k
, we get
EFpw
ag
T
q´F
˚
À
βD
2
T
2
`
DpΣ `Bq
?
T
`BD
À
βD
2
T
2
`
CD
?
dT
εn
`
rD
?
n
`
r
pkq
D
C
k´1
À
βD
2
T
2
`rD
»
–
1
?
n
`
˜
?
dT
εn
¸
pk´1q{k
fi
fl
. (C.53)
Now, our choice ofT implies that
βD
2
T
2
ďrD
„
1
?
n
`
´
?
dT
εn
¯
pk´1q{k
ȷ
and we get the result upon plugging
inT .
C.5.2 NoisyClippedSGDforStronglyConvexLosses(Section4.4.2)
We begin by presenting the pseudocode for our noisy clipped SGD in Algorithm 27.
Algorithm27 Noisy Clipped SGD for Heavy-Tailed DP SCO
1: Input: DataXPX
n
,T ďn, stepsizestη t
u
T
t“0
, averaging weightstζ t
u
T
t“0
,w
0
PW.
2: fortPt0,1,¨¨¨ ,Tudo
3: Draw new batchB
t
(without replacement) ofn{T samples fromX.
4:
r
∇F
t
pw
t
q:“ MeanOracle1pt∇fpw
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
5: w
t`1
“Π W
”
w
t
´η t
r
∇F
t
pw
t
q
ı
6: endfor
7: Output: p w
T
:“
1
Z
T
ř
T
t“0
ζ t
w
t`1
, whereZ
T
“
ř
T
t“0
ζ t
.
Algorithm 28 is a generalized version of Algorithm 27 that allows for any MeanOracle and will be
useful in our analysis. In Proposition 174, we provide the convergence guarantees for Algorithm 28 in
terms of the bias and variance of theMeanOracle.
Proposition174. LetF :WÑR beµ -strongly convex andβ -smooth with condition numberκ :“
β µ . Let
w
t`1
:“Π W
rw
t
´η t
r
∇F
t
pw
t
qs, where
r
∇F
t
pw
t
q“∇Fpw
t
q`b
t
`N
t
, such that the bias and noise (which
can depend onw
t
and the samples drawn) satisfy}b
t
}ďB (with probability1),EN
t
“0,E}N
t
}
2
ďΣ 2
for
356
Algorithm28 Generic Noisy Clipped SGD Framework for Heavy-Tailed SCO
1: Input: DataXPX
n
,T ďn,MeanOracle, stepsizestη t
u
T
t“0
, averaging weightstζ t
u
T
t“0
.
2: Initializew
0
PW.
3: fortPt0,1,¨¨¨ ,Tudo
4: Draw new batchB
t
(without replacement) ofn{T samples fromX.
5:
r
∇F
t
pw
t
q:“ MeanOraclept∇fpw
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
6: w
t`1
“Π W
”
w
t
´η t
r
∇F
t
pw
t
q
ı
7: endfor
8: Output: p w
T
:“
1
Z
T
ř
T
t“0
ζ t
w
t`1
, whereZ
T
“
ř
T
t“0
ζ t
.
all tPrT ´1s, and thattN
t
u
T
t“1
are independent. Then, there exist stepsizestη t
u
T
t“1
and weightstζ t
u
T
t“0
such that the average iterate p w
T
:“
1
ř
T
t“0
ζ t
ř
T
t“0
ζ t
w
t`1
satisfies
EFpp w
T
q´F
˚
ď32βD
2
exp
ˆ
´
T
4κ ˙
`
72Σ
2
µT
`
2B
2
µ .
Proof. Define gpw
t
q“´
1
η t
pw
t`1
´w
t
q. Then
E}w
t`1
´w
˚
}
2
“E}w
t
´η t
gpw
t
q´w
˚
}
2
“E}w
t
´w
˚
}
2
´2η t
Exgpw
t
q,w
t
´w
˚
y`η 2
t
E}gpw
t
q}
2
. (C.54)
357
Now, conditional on all randomness, we use smoothness and strong convexity to write:
Fpw
t`1
q´Fpw
˚
q“Fpw
t`1
q´Fpw
t
q`Fpw
t
q´Fpw
˚
q
ďxFpw
t
q,w
t`1
´w
t
y`
β 2
}w
t`1
´w
t
}
2
`x∇Fpw
t
q,w
t
´w
˚
y´
µ 2
}w
t
´w
˚
}
2
“x
r
∇F
t
pw
t
q,w
t`1
´w
˚
y`x∇Fpw
t
q´
r
∇F
t
pw
t
q,w
t`1
´w
˚
y`
βη 2
t
2
}gpw
t
q}
2
´
µ 2
}w
t
´w
˚
}
2
ďxgpw
t
q,w
t`1
´w
˚
y`x∇Fpw
t
q´
r
∇F
t
pw
t
q,w
t`1
´w
˚
y`
βη 2
t
2
}gpw
t
q}
2
´
µ 2
}w
t
´w
˚
}
2
“xgpw
t
q,w
t`1
´w
t
y`xgpw
t
q,w
t
´w
˚
y´xb
t
`N
t
,w
t`1
´w
˚
y`
βη 2
t
2
}gpw
t
q}
2
´
µ 2
}w
t
´w
˚
}
2
“xgpw
t
q,w
t
´w
˚
y´xb
t
`N
t
,w
t`1
´w
˚
y`
ˆ
βη 2
t
2
´η t
˙
}gpw
t
q}
2
´
µ 2
}w
t
´w
˚
}
2
,
where we used the fact thatxΠ W
pyq´x,Π W
pyq´yyď 0 for allxPW,yPR
d
(c.f. [46, Lemma 3.1]) to
obtain the last inequality. Thus,
´2η t
Exgpw
t
q,w
t
´w
˚
yď´2η t
ErFpw
t`1
q´F
˚
s
`2η t
E
„
´xb
t
`N
t
,w
t`1
´w
˚
y`
ˆ
βη 2
t
2
´η t
˙
}gpw
t
q}
2
´
µ 2
}w
t
´w
˚
}
2
ȷ
.
Combining the above inequality with C.54, we get
E}w
t`1
´w
˚
}
2
ďp1´µη t
qE}w
t
´w
˚
}
2
´2η t
ErFpw
t`1
q´F
˚
s´2η t
Exb
t
`N
t
,w
t`1
´w
˚
y
`2η t
ˆ
η 2
t
β 2
´η t
˙
E}gpw
t
q}
2
. (C.55)
358
Next, consider
|Exb
t
`N
t
,w
t`1
´w
˚
y|ď|Exb
t
`N
t
,w
t`1
´w
t
y|`|Exb
t
`N
t
,w
t
´w
˚
y|
“|Exb
t
`N
t
,w
t`1
´w
t
y|`|Exb
t
,w
t
´w
˚
y|
ď|Exb
t
`N
t
,w
t`1
´w
t
y|`
B
2
µ `
µ 4
E}w
t
´w
˚
}
2
by independence of N
t
(which has zero mean) and w
t
´ w
˚
, and Young’s inequality. Next, note that
v :“w
t
´η t
p∇Fpw
t
q`b
t
q is independent ofN
t
, soExN
t
,Π W
pvqy“0. Thus,
|ExN
t
,w
t`1
´w
t
y|“|ExN
t
,w
t`1
y|
“|ExN
t
,Π W
rw
t
´η t
p∇Fpw
t
q`b
t
`N
t
qsy|
“|ExN
t
,Π W
rv´η t
N
t
sy|
“|ExN
t
,Π W
rvs´Π W
rv´η t
N
t
sy|
ďEr}N
t
}}Π W
rvs´Π W
rv´η t
N
t
s}s
ďEr}N
t
}}η t
N
t
}s
ďη t
Σ 2
,
by Cauchy-Schwartz and non-expansiveness of projection. Further,
|Exb
t
,w
t`1
´w
t
y|“|Exb
t
,´η t
gpw
t
qy|
ď
B
2
µ `
η 2
t
µ 4
E}gpw
t
q}
2
,
359
by Young’s inequality. Therefore,
´2η t
Exb
t
`N
t
,w
t`1
´w
˚
yď2η t
„
2B
2
µ `
η 2
t
µ 4
E}gpw
t
q}
2
`η t
Σ 2
`
µ 4
E}w
t
´w
˚
}
2
ȷ
.
Plugging this bound back into C.55 and choosingη t
ď
1
β ď
1
µ yields:
E}w
t`1
´w
˚
}
2
ď
´
1´
µη t
2
¯
E}w
t
´w
˚
}
2
´2η t
ErFpw
t`1
q´F
˚
s`
4η t
B
2
µ `2η 2
t
Σ 2
`2η t
ˆ
η 2
t
β 2
´η t
`
η 2
t
µ 4
˙
E}gpw
t
q}
2
ď
´
1´
µη t
2
¯
E}w
t
´w
˚
}
2
´2η t
ErFpw
t`1
q´F
˚
s`
4η t
B
2
µ `2η 2
t
Σ 2
.
Next, we apply Lemma 175 (see below) withr
t
:“E}w
t
´w
˚
}
2
, s
t
:“EFpw
t`1
q´F
˚
´
2B
2
µ ,a :“
µ 2
,
b :“ 2, c“ 2Σ
2
, and g “ β . We may assume s
t
ě 0 for all t: if this inequality breaks for some t, then
simply returnw
t`1
instead of p w
T
to obtainEFpw
t
q´F
˚
ă
2B
2
µ . Thus,
1
Γ T
T
ÿ
t“0
γ t
ErFpw
t`1
q´F
˚
sď
1
2
„
32βD
2
exp
ˆ
´µT
4β ˙
`
144Σ
2
µT
`
2B
2
µ ȷ
Finally, Jensen’s inequality yields the theorem.
Lemma 175. [234, Lemma 3] Let b ą 0, let a,c ě 0, andtη t
u
tě0
be non-negative step-sizes such that
η t
ď
1
g
for alltě 0 for some parameterg ě a. Lettr
t
u
tě0
andts
t
u
tě0
be two non-negative sequences of
real numbers which satisfy
r
t`1
ďp1´aη
t
qr
t
´bη
t
s
t
`cη
2
t
for alltě0. Then there exist particular choices of step-sizesη t
ď
1
g
and averaging weightsζ t
ě0 such that
b
Γ T
T
ÿ
t“0
s
t
ζ t
`ar
T`1
ď32gr
0
exp
ˆ
´aT
2g
˙
`
36c
aT
,
360
whereΓ T
:“
ř
T
t“0
γ t
.
We are now prepared to prove Theorem 61.
Theorem 176 (Precise statement of Theorem 61). Grant Assumption 9. Let ε ą 0, and assume F is µ -
strongly convex andβ -smooth withκ “
β µ ďn{lnpnq. Then, there are parameters such that Algorithm 27 is
ε
2
2
-zCDP, and
EFpp w
T
q´F
˚
À
r
2
k
µ ¨
˝
1
n
`
˜
a
dκ lnpnq
εn
¸2k´2
k
˛
‚
. (C.56)
Proof. Privacy: Chooseσ 2
“
4C
2
T
2
ε
2
n
2
. Since the batches of data drawn in each iteration are disjoint, it suf-
fices (by parallel composition [191]) to show that
r
∇F
t
pw
t
q is
ε
2
2
-zCDP for allt. Now, theℓ
2
sensitivity of
each clipped gradient update is bounded by∆ “sup
w,X„X
1}
T
n
ř
xPBt
Π C
p∇fpw,xqq´
ř
x
1
PB
1
t
Π C
p∇fpw,x
1
qq}“
sup
w,x,x
1}
T
n
Π C
p∇fpw,xqq´Π C
p∇fpw,x
1
qq}ď
2CT
n
. Hence Proposition 8 implies that the algorithm is
ε
2
2
-zCDP.
Excess risk: For any iteration t P rTs, denote the bias of Algorithm 5 by b
t
:“ E
r
∇F
t
pw
t
q´∇Fpw
t
q,
where
r
∇F
t
pw
t
q“ r ν in the notation of Algorithm 5. Also let
p
∇F
t
pw
t
q:“ ˆ ν (in the notation of Lemma 49)
and denote the noise by N
t
“
r
∇F
t
pw
t
q´∇Fpw
t
q´ b
t
“
r
∇F
t
pw
t
q´E
r
∇F
t
pw
t
q. Then we have
B :“sup
tPrTs
}b
t
}ď
r
pkq
pk´1qC
k´1
andΣ 2
:“sup
tPrTs
Er}N
t
}
2
sďdσ
2
`
r
2
T
n
À
dC
2
T
2
ε
2
n
2
`
r
2
T
n
, by Lemma 49.
Plugging these bias and variance estimates into Proposition 174, we get
EFpp w
T
q´F
˚
ÀβD
2
exp
ˆ
´
T
4κ ˙
`
1
µT
ˆ
dC
2
T
2
ε
2
n
2
`
r
2
T
n
˙
`
pr
pkq
q
2
C
2k´2
µ .
ChoosingC“r
´
ε
2
n
2
dT
¯
1{2k
implies
EFpp w
T
q´F
˚
ÀβD
2
exp
ˆ
´
T
4κ ˙
`
r
2
µ ˜
1
n
`
ˆ
dT
ε
2
n
2
˙
pk´1q{k
¸
.
361
Finally, choosingT “
R
4κ ln
ˆ
µβD
2
r
2
ˆ
n`
´
ε
2
n
2
d
¯
pk´1q{k
˙˙V
Àκ lnpnq yields the result.
C.6 DetailsandProofsforSection4.5: Non-ConvexProximalPLLosses
Pseudocode for our algorithm for PPL losses is given in Algorithm 29. Algorithm 30 is a generalization
Algorithm29 Noisy Clipped Proximal SGD for Heavy-Tailed DP SO
1: Input: DataXPX
n
,T ďn, stepsizestη t
u
T´1
t“0
.
2: Initializew
0
PW.
3: fortPt0,1,¨¨¨ ,T´1udo
4: Draw new batchB
t
(without replacement) ofn{T samples fromX.
5:
r
∇F
0
t
pw
t
q:“ MeanOracle1pt∇f
0
pw
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
6: w
t`1
“ prox
η tf
1
´
w
t
´η t
r
∇F
0
t
pw
t
q
¯
7: endfor
8: Output: w
T
.
of Algorithm 29 which allows for arbitrary MeanOracle. This will be useful for our analysis. Propo-
Algorithm30 Generic Noisy Proximal SGD Framework for Heavy-Tailed DP SO
1: Input: DataXPX
n
,T ďn,MeanOracle (and truncation/minibatch parameters), privacy parameter
ρ “ε
2
{2, stepsizestη t
u
T´1
t“0
.
2: Initializew
0
PW.
3: fortPt0,1,¨¨¨ ,T´1udo
4: Draw new batchB
t
(without replacement) ofn{T samples fromX.
5:
r
∇F
0
t
pw
t
q:“ MeanOraclept∇f
0
pw
t
,xqu
xPBt
;
n
T
;
ε
2
2
q
6: w
t`1
“ prox
η tf
1
´
w
t
´η t
r
∇F
0
t
pw
t
q
¯
7: endfor
8: Output: w
T
.
sition 177 provides a convergence guarantee for Algorithm 30 in terms of the bias and variance of the
MeanOracle.
Proposition177. Consider Algorithm 30 with biased, noisy stochastic gradients:
r
∇F
0
t
pw
t
q“∇F
0
pw
t
q`
b
t
`N
t
, and stepsize η “
1
2β . Assume that the bias and noise (which can depend on w
t
and the samples
362
drawn) satisfy}b
t
}ď B (with probability1),EN
t
“ 0,E}N
t
}
2
ď Σ 2
for alltPrT´1s, and thattN
t
u
T
t“1
are independent. Assume further thatF isµ -Proximal-PL,F
0
isβ -smooth, andFpw
0
q´F
˚
ď∆ . Then,
EFpw
T
q´F
˚
ď
ˆ
1´
µ 2β ˙
T
∆ `
4pB
2
`Σ 2
q
µ .
Proof. Our proof extends the ideas in [177] to generic biased and noisy gradients without using Lipschitz
continuity off. Byβ -smoothness, for anyrPrT´1s, we have
EFpw
r`1
q“ErF
0
pw
r`1
q`f
1
pw
r
q`f
1
pw
r`1
q´f
1
pw
r
qs
ďE
"
Fpw
r
q`
„
x
r
∇F
0
r
pw
r
q,w
r`1
´w
r
y`
β 2
}w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q
ȷ*
`Ex∇F
0
pw
r
q´
r
∇F
0
r
pw
r
q,w
r`1
´w
r
y
“EFpw
r
q`E
”
x∇F
0
pw
r
q,w
r`1
´w
r
y`
β 2
}w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q
`xb
r
`N
r
,w
r`1
´w
r
y
ı
´Exb
r
`N
r
,w
r`1
´w
r
y
ďEFpw
r
q`E
”
x∇F
0
pw
r
q,w
r`1
´w
r
y`β }w
r`1
´w
r
}
2
`f
1
pw
r`1
q´f
1
pw
r
q
`xb
r
`N<