Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Controlling information in neural networks for fairness and privacy
(USC Thesis Other)
Controlling information in neural networks for fairness and privacy
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Controlling Information in Neural Networks for Fairness and Privacy by Umang Gupta A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2023 Copyright 2023 Umang Gupta Acknowledgements Reflecting on my Ph.D. journey, I am profoundly grateful to my advisor, Prof. Greg Ver Steeg, for his mentorship. Our numerous insightful discussions have been instrumental in enriching my knowledge in countless ways. Most significantly, he imparted the invaluable skill of thinking from first principles and translating abstract ideas into successful research. Greg’s unwavering support and patience have been instrumental throughout this journey. I am also deeply thankful to Prof. Bistra Dilkina, Prof. Fred Morstatter, Prof. Kristina Lerman, and Prof. Paul Thompson for their invaluable roles on various committees. Their feedback, insights, and suggestions have been valuable. I sincerely appreciate the USC Viterbi School of Engineering and Computer Science Department, with special thanks to Lizsl, Andy, and Asiroh for their seamless administrative support. It has been an honor and pleasure to collaborate closely with Prof. Aram Galstyan, Prof. Paul Thompson, Nitin Kamra, and Dimitris Stripelis, all of whom have significantly nurtured my growth as a researcher. I would like to express my gratitude to my managers and mentors during internships at Amazon, Morgan Stanley, and Futurewei, particularly Jwala Dhamala and Prof. Kai-wei Chang, for their invaluable guidance. My heartfelt appreciation goes to my friends who made this journey memorable and manageable. A special mention to Nitin for all the help, especially during the initial years. Thanks to Aaron, Hrayr, Nina, Sahil, Ayush, Kruttika, Mohit, Shobhit, and Abhinav for all the joyous moments we shared. Finally, I thank my family, especially my sister, for their unconditional support. Thank you for always being there for me. This thesis would not have been possible without the direct and indirect support of many. While I can’t name everyone, I am deeply grateful for their immense support and encouragement. ii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Fair Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Private Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Fairness, Privacy, and Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Controllable Guarantees for Fair Outcomes via Contrastive Information Estimation 8 2.1 Mutual Information Bounds Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Practical Objectives for Controlling Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Interference between I(y : z) and I(z : c) . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Lower bounds for I(y : z) and I(y : z|c) . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Upper bound for I(z : c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3.1 Upper bound for I(z : x) by specifying a prior . . . . . . . . . . . . . . . 12 2.2.3.2 Lower bound for I(x : z|c) via contrastive estimation . . . . . . . . . . . . 12 2.2.4 Overall Variational Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 Alternate bounds for I(z : c) and their caveats . . . . . . . . . . . . . . . . . . . . . 13 2.2.5.1 Reconstruction based Bound . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5.2 Adversarial approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 FCRL Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Improved Accuracy Versus Parity Trade-offs . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Uncovering Hidden Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.3 Ablation Study of Information-Theoretic Methods . . . . . . . . . . . . . . . . . . 19 2.4.3.1 Maximizing I(y : z|c) vs. I(y : z) . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3.2 Reconstruction vs. Contrastive Estimation . . . . . . . . . . . . . . . . . 20 2.4.4 Visualizing Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iii 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Notion of Language Model Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Fair Knowledge Distillation via Counterfactual Role Reversal . . . . . . . . . . . . . . . . . 26 3.3.1 Counterfactual Role Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Modifying Teacher Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Counterfactual Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.1 Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Evaluation of Open-ended Generation . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.3 Baselines and Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.4 Results on Open-ended Text Generation . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Gender Fairness on Other Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.1 Bias in Contextual Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.2 Fairness in Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Discussion and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Chapter 4: Membership Inference Attacks on Deep Regression Models for Neuroimaging . . . 34 4.1 Related Work & Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 BrainAGE Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.3 Membership Inference Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1 Trained Models for Predicting Brain Age . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.2 Attack Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.1 Membership Inference Attacks on Centralized Training . . . . . . . . . . . . . . . . 37 4.3.2 Membership Inference Attacks on Federated Training . . . . . . . . . . . . . . . . . 38 4.3.3 Possible Defenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Chapter 5: Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning . 43 5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.1 Shared Layer Shift or SLaSh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.2 Improving Prompt-Tuning with Reparameterization. . . . . . . . . . . . . . . . . . 46 5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.1 Sequence Classification Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 Token Classification Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.3 Time & Memory Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.4 Differential Private Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Chapter 6: Bootstrapping Ambient Space for Private Training . . . . . . . . . . . . . . . . . . 55 6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.1.1 Privacy Analysis of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 iv 6.2.1 Low-rank Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2.2 Differential Private Training with Algorithm 2 . . . . . . . . . . . . . . . . . . . . 59 6.2.3 Why did Differential Private Training Fail? . . . . . . . . . . . . . . . . . . . . . . 60 6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Appendix A: Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.1 Proof of Theorem 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.2 Proof of Proposition 2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Appendix B: Further Results and Discussion of FCRL . . . . . . . . . . . . . . . . . . . . . . . 89 B.1 On maximizing I(y : z) vs. I(y : z|c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 B.2 Parity vs. Accuracy using different Classification Algorithms . . . . . . . . . . . . . . . . . 89 B.3 Hidden Information in Adversarial Representation Learning . . . . . . . . . . . . . . . . . 90 B.4 Area over Parity Accuracy Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Appendix C: Additional Details and Results of Equitable Role Alteration . . . . . . . . . . . . 97 C.1 Human Evaluation of Generated Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 C.2 Counterfactual Role-Reversal Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . 98 C.3 Gender Word Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 C.4 Race Word Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 C.5 Note about Sub-word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 C.6 On Limitations and Correctness of Counterfactual Sentences . . . . . . . . . . . . . . . . . 103 C.7 Mitigating Racial Disparity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.8 Training and Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 C.8.1 Language Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 C.8.2 Language Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 C.8.3 Bios–Bias Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 C.8.4 CEAT Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Appendix D: Further Details and Results of Membership Inference Attacks on Deep Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 D.1 Brain Age Model, Training and Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . 107 D.1.1 Centralized Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 D.1.2 Federated Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 D.1.3 3D-CNN and2D-slice-mean model architecture . . . . . . . . . . . . . . . . . . . 109 D.2 Detailed Results of Membership Inference Attacks on Federated Learning . . . . . . . . . . 111 D.3 Attack Architecture and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D.3.1 Attack Classifier Parametrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 D.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 D.4 Differential Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 D.5 Membership Inference attacks in centralized setup without the knowledge of training samples 114 D.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 D.5.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 v Appendix E: Additional Results of Jointly Reparametrized Multi-Layer Adaptation . . . . . . 119 E.1 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 E.1.1 Adding shifts to other activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 E.1.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 E.1.3 [MASK] vs.[CLS] Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 vi List of Tables 1.1 Mapping between chapters and published papers. . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Fair representation learning methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Area Over Parity Accuracy Curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Operations used to modify token probabilities. . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Gender disparity in open-ended text generation. . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Downstream gender fairness evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Membership inference attack accuracies on centrally trained models. . . . . . . . . . . . . . 37 4.2 Average attack accuracies on federation trained models. . . . . . . . . . . . . . . . . . . . . 39 5.1 Parameter Efficient Finetuning Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Results of finetuning RoBERTa-large with different methods on GLUE Development set. . 49 5.3 Results of finetuning RoBERTa-base with different methods on GLUE Development set. . . 50 5.4 Results of finetuning BERT-base-cased for NER task on CoNLL-2003 (English) dataset. . 51 5.5 Memory and execution time for training 1 epoch. . . . . . . . . . . . . . . . . . . . . . . . 51 5.6 Results of differential private finetuning on GLUE Development set. . . . . . . . . . . . . . 52 B.1 Area Over Parity Accuracy Curve with different downstream classification algorithms. . . . 90 C.1 Generated texts and quality ratings that were shown as examples to annotators. . . . . . . . . . . . 98 C.2 Examples of generations that the human annotators labeled as having a quality≥ 4 (on a range 1− 6 where 6 is excellent) from different GPT2–small models. . . . . . . . . . . . . . . . . . . . . . 99 C.3 List of additional gender words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 C.4 Word lists for generating race counterfactuals. . . . . . . . . . . . . . . . . . . . . . . . . . 101 C.5 Racial disparity in open-ended text generation as assessed by BOLD Race prompts. . . . . . 104 C.6 Words sets and categories used in CEAT tests. . . . . . . . . . . . . . . . . . . . . . . . . . 106 D.1 Mean absolute errors (year) for train, test and validation set in the centralized setup. . . . . . 107 D.2 Mean absolute errors (year) for training, and testing set for different environments in the federated setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 D.3 Average membership inference attack accuracies on models trained using federated learning across all environments using different feature sets. . . . . . . . . . . . . . . . . . . . . . . 111 D.4 Matrix of the membership inference attack accuracy on a per learner basis for the3D-CNN model across every federated learning environment. . . . . . . . . . . . . . . . . . . . . . . 116 D.5 Matrix of the membership inference attack accuracy on a per learner basis for the 2D-slice-mean model across every federated learning environment. . . . . . . . . . . . . . 117 D.6 Membership inference attacks without the knowledge of training samples. . . . . . . . . . . 118 E.1 Effect of adding shifts at different position on sequence classification tasks. . . . . . . . . . 119 E.2 Effect of different initialization of SLaSh parameters on sequence classification. . . . . . . . 120 E.3 Comparing SLaSh with[MASK] and[CLS] token representation on sequence classification tasks (GLUE Development set). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 vii List of Figures 2.1 Venn diagram to show interference between I(y : z) and I(z : c). . . . . . . . . . . . . . . . 11 2.2 Parity vs. Accuracy trade-off for UCI Adult and Heritage Health dataset using a 1-hidden-layer MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Parity and Accuracy variation withβ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Exploring parity-accuracy trade-off by finetuning. . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Demonstrating hidden bias in adversarial methods. . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Ablation studies on UCI Adult dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.7 Visualizing Theorem 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Example of gender bias in GPT–2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Probability modification using counterfactual text. . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 Distribution of prediction error and gradient magnitudes from the trained models. . . . . . . 38 4.2 Increasing attack vulnerability per federation round. . . . . . . . . . . . . . . . . . . . . . 39 4.3 Differential privacy reduces membership inference attacks. . . . . . . . . . . . . . . . . . . 40 4.4 Mitigating privacy vulnerability with non-unique gradients and Differential Privacy . . . . . 41 5.1 Performance vs. Parameters trade-off with pretrainedRoBERTa-large . . . . . . . . . . . . 44 5.2 Shared Layer Shift or SLaSh for sequence classification tasks. . . . . . . . . . . . . . . . . 45 5.3 JR-WARP for sequence classification tasks with prompt length 1. . . . . . . . . . . . . . . . 46 6.1 Non-private training without noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 Gradient norms when training without noise. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Training with noise to achieve differential privacy. . . . . . . . . . . . . . . . . . . . . . . . 59 6.4 Singular values of gradient matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 B.1 Parity vs. Accuracy with different classification algorithms for UCI Adult dataset. . . . . . . 93 B.2 Parity vs. Accuracy with different classification algorithms for Heritage Health dataset. . . . 94 B.3 Predicting c from representations provided by MaxEnt-ARL and Adversarial Forgetting. . . 95 B.4 Visualizing area over the parity-accuracy curve. . . . . . . . . . . . . . . . . . . . . . . . . 96 D.1 The UK Biobank data distribution across 8 learners for the three federated learning environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 D.2 Learning curve (test performance) for 2D-slice-mean model across different federated learning environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 D.3 Neural network architectures for brain age prediction. . . . . . . . . . . . . . . . . . . . . . 110 D.4 Attack Accuracy vs. R 2 for models trained with differential privacy. . . . . . . . . . . . . . 113 D.5 Performance (MAE) vs.ε atδ = 1e − 6 for models trained with differential privacy. . . . . . 113 viii Abstract As machine learning becomes more prevalent in mission-critical domains, the harms of unintended informa- tion captured by these models are becoming more apparent. These models can inadvertently introduce biases and memorize training data, leading to potential unfairness, inequitable outcomes, or risking privacy. These phenomena are especially alarming in applications where data privacy needs to be upheld, such as medical imaging, or where unfairness can lead to disparate outcomes, such as hiring decisions. This thesis examines ways to control and limit information in deep learning models, focusing on fairness and privacy. Specifically, we discuss ways to ensure fairness in decision-making by learning fair data representations and preventing unfair language generation by correctly modulating information in neural networks. Concerning privacy, we demonstrate that releasing neuroimaging models may reveal private information about the individuals participating in the training set and discuss ways to mitigate these privacy leakages. Among these methods, differential private training is promising as it protects against all possible privacy attacks. However, differ- ential private training can drastically hurt utility since the magnitude of noise in the outputs scales with the model parameters. To this end, we explore techniques to reduce effective model parameters during training. ix Chapter 1 Introduction Machine learning’s ever-expanding presence in critical domains has given it the power to make decisions that can have a profound influence over human lives, encompassing matters pertaining to health, financial stability, and personal liberties. These machine learning models, often based on neural networks, have immense capability to capture complex patterns from their training data. While a high capacity to absorb patterns from data is desirable for making accurate decisions, it can also be a curse. For example, the models can capture information about biases and stereotypes from the training data and reproduce these unequal and unfair treatments in downstream applications (Kirchner et al., 2016; Mehrabi et al., 2019) or memorize training examples which can be harmful for generalization (Steinke et al., 2020) as well as privacy (Yeom et al., 2018). Such undesirable behaviors of machine learning models are due to training without constraints aimed at capturing all patterns from the data. However, not all patterns in the data are helpful for the task at hand. For example, neural networks can often rely on spurious patterns in data to make decisions and, hence, may fail to generalize in a real-world setting (Geirhos et al., 2020). Thus, developing methods that can glean useful information from data and avoid capturing problematic information is a promising direction to improve the state of machine learning models. To this end, this thesis aims to better understand and control information in neural networks. While many applications can benefit from these techniques, we focus on the pivotal themes of fairness and privacy. We focus on the crucial issues of fairness and privacy because to seamlessly integrate AI systems into real-world applications, it is critical to ensure the trustworthiness and fairness of these models. Beyond the pressing concern of fairness, machine learning models are often trained by leveraging a user’s private or sensitive data. These models can leak private information and raise concerns surrounding the privacy of participants (Shokri et al., 2017; Carlini et al., 2021). Protecting the privacy of data participants is crucial to foster trust and safeguard against potential misuse of personal information. This thesis examines fairness and privacy issues through the common idea of controlling/limiting information. As such, this thesis examines issues concerning the fairness and privacy of machine learning models and aims to propose methods to counteract bias, preserve the confidentiality of sensitive information, and improve the overall trustworthiness of machine learning methods by controlling information in neural networks. 1.1 Fair Machine Learning The rapid advancements in machine learning have revolutionized various domains, facilitating the widespread adoption of powerful models in real-world applications that profoundly impact the quality of life for indi- viduals. However, the integration of these models without consideration for fairness can often perpetuate and exacerbate undesirable biases and stereotypes present in the training data, resulting in the unequal and unjust treatment of different groups (Raghavan et al., 2019; Kirchner et al., 2016; Mehrabi et al., 2020a). To 1 tackle these problems, a dynamic field of fair machine learning has emerged (Mehrabi et al., 2019; Hort et al., 2022). Machine learning addresses diverse applications. As such, various measures have been used to quantify and audit machine learning models for fairness (e.g., Hardt et al. (2016); Mehrabi et al. (2020b); Dhamala et al. (2021)), and different measures may be preferred based on the application (Srivastava et al., 2019). These measures can even be incompatible or inconsistent (Kleinberg et al., 2017; Blodgett et al., 2021; Goldfarb-Tarrant et al., 2021). This thesis mainly focuses on measures that focus on equitable alignment of decision systems for different groups. Arguably, the most prominent use of machine learning models until recently has been in decision-making systems, and these decision systems can have a disparate impact on minority groups (Wang et al., 2018; Barocas et al., 2016). A simple technique to achieve fairness would be to avoid observing group information about an individual. However, it does not achieve a fair outcome because other features can be used to decipher group information (Dwork et al., 2012); for instance, one may guess gender from the name. To this end, in Chapter 2, we quantify how much information a feature or data representation has about the group and how it relates to fairness. We then designed a technique to remove information about the group from all the features to achieve fairness. Further, the proliferation of generative AI, exemplified by large language models and text-to-image generative models like DALL•E 1 , ChatGPT 2 , and others, poses additional challenges concerning fair- ness (Bommasani et al., 2021). A single pretrained model can be used in a wide range of applications, each considering a different fairness measure. For example, language models trained without ethical considerations can generate sexist or racist text (Dhamala et al., 2021; Abid et al., 2021; Gehman et al., 2020; Field et al., 2021), and text-to-image models can generate offensive images grounded in stereotypes (Zhou et al., 2022a; Berg et al., 2022; Chuang et al., 2023). The sudden commercial proliferation of these generative AI systems has brought special attention to these problems, and a field of “Alignment Research” has emerged (Ouyang et al., 2022; Amodei et al., 2016), focusing on the development of methodologies to align models to hu- man preferences. In Chapter 3, we introduce a fair knowledge distillation technique that carefully curates information during the distillation process to maximize performance while avoiding gender bias in text generation. 1.2 Private Machine Learning With the increasing adoption of AI systems in daily life and the increasing utilization of personal data for training models, serious privacy concerns have been raised (Liu et al., 2021b; Liu et al., 2021c). Various legislations have been enacted to restrict data sharing and safeguard the privacy rights of individuals (e.g., HIPPA, GDPR). However, fewer safeguards exist to prevent information leakage from machine learning models. Neural Networks have an inherent tendency to memorize data (Zhang et al., 2017), and in some cases, these can be beneficial to model performance (Xu et al., 2023; Li et al., 2021b). However, memorizing training data presents privacy risks even when only model access or aggregated statistics are available (Shokri et al., 2017). Many practical privacy vulnerabilities have been exposed in statistical systems (Narayanan et al., 2008; Homer et al., 2008). In the particular case of deep learning models, this is often exemplified via membership inference attacks (Shokri et al., 2017; Nasr et al., 2019) or reconstruction attacks (Haim et al., 2022). Carlini et al. (2021) recovered the correct personal information of an individual from the language models. We demonstrate membership inference attacks vulnerabilities of models trained on 3D-neuroimaging data under federated and centralized training setup and their potential impact on privacy within the neuroimaging domain in Chapter 4. 1 https://openai.com/dall-e-2 2 https://openai.com/chatgpt 2 The privacy attacks on deep learning models are not mitigated by restricting data sharing, as done in Federated Learning (McMahan et al., 2017). To this end, researchers have focused on training machine learning models to mitigate specific attacks ( e.g., Tang et al. (2022); Chen et al. (2023)) or training with differential privacy (Chaudhuri et al., 2011), which provides guarantees against all types of privacy attacks. These approaches implicitly rely on reducing information about a specific sample in the model as much as possible. Among these, differential privacy is particularly interesting. Differential privacy (Dwork et al., 2014), regarded as the holy grail of privacy guarantees, offers a robust framework that guarantees the protection of individual data points. This is achieved by ensuring that the presence or absence of a single sample in the training dataset does not significantly influence the model’s output or reveal private information. The conventional approach to train neural networks with differential privacy is the DP-SGD (Differentially Private Stochastic Gradient Descent) algorithm (Abadi et al., 2016). This method introduces per-sample gradient clipping and the addition of Gaussian noise to the gradients at each step of SGD. Even though privacy is compatible with generalization (Steinke et al., 2020; Wang et al., 2016), the noise impacts model performance negatively and removes too much information. Thus, private models can severely underperform their non-private counterparts (Tramer et al., 2021). The noise variance scales with the number of neural network parameters; hence, the model performance can worsen significantly with the increase in model size. On the other hand, various researchers have shown that language models can be finetuned with fewer parameters, which could benefit private training (Yu et al., 2022). To this end, in Chapter 5, we further improve parameter efficiency for finetuning language models, leading to better utility with private training and improved computational efficiency. As mentioned previously, DP-SGD trains by privatizing each step of gradient descent, which adds much more noise than necessary. Intuitively, we should be able to use the (private) outputs produced in previous steps to reduce noise that needs to be added in the current step. Inspired by this intuition, in Chapter 6, we explore if we can denoise or reduce the noise in the gradients by extracting useful information from previous outputs. 1.3 Fairness, Privacy, and Information Theory Several works have made connections between statistical independence, fairness, and information about the sensitive group in the data, and these have inspired our work. Song et al. (2019) proposed information-theoretic measures for different fairness measures motivated by intuition about the desired statistical independence to achieve perfect fairness. We made the connection more rigorous in Gupta et al. (2021a). While Gupta et al. (2021a) focus on fairness, it casts it as an invariant representation learning problem and can also address the general problem of encoding invariances in the model. Representations invariant factors of variation irrelevant to the task can improve generalization and robustness (Moyer et al., 2018; Jaiswal et al., 2020). Similarly, information-theoretic measures have been proposed to measure the success of membership inference attacks. Jha et al. (2020) related mutual information between data and prediction to privacy attacks such as membership inference attacks. Wang et al. (2021) demonstrated that information bottleneck regularization, which limits the information about data in the representation, can mitigate inversion attacks. The similarities in the information measures to assess fairness and privacy have motivated us to investigate techniques to address these concerns in machine learning models under the common idea of controlling information and inspired our solutions. 3 Chapter Published Paper Chapter 2 Gupta et al. (2021a) Chapter 3 Gupta et al. (2022) Chapter 4 Gupta et al. (2021c) Chapter 5 Gupta et al. (2023b) Chapter 6 Ongoing work Table 1.1: Mapping between chapters and published papers. 1.4 Contributions The chapters presented in this thesis can be broadly categorized into two parts. The first part focuses on fairness in machine learning, and the second focuses on privacy in machine learning. The first part of the thesis investigates methods for controlling bias and promoting fairness in two contexts — a) Learning fair representations of data for decision-making and b) Reducing gender bias in texts generated from language models. The second part of the thesis studies privacy leakage and investigates methods for training private neural networks. Most of the results presented in this thesis have been published, and Table 1.1 maps the chapter to the corresponding paper. Chapter 2 delves into the challenge of mitigating bias in discriminative models to ensure equal treatment, or demographic parity, between various demographic groups. We preprocess the data to learn representations that lead to fair outcomes. Traditional preprocessing approaches often attempt to make the data statistically independent of group membership or sensitive attributes (Zemel et al., 2013; Moyer et al., 2018; Louizos et al., 2016), but this approach might not provide a reasonable compromise between fairness and accuracy. Another common approach is to limit the power of an adversarial agent that aims to maximize parity (Jaiswal et al., 2020; Madras et al., 2018). However, representations obtained through adversarial approaches may still retain biases due to the complexity of the adversary during training. In this chapter, we establish that the statistical parity of any downstream classifier can be controlled by limiting the mutual information between representations and sensitive attributes or group information. Then, we propose an effective method for controlling parity based on contrastive information estimators, which outperforms existing approaches relying on variational bounds and complex generative models. We test our approach on UCI Adult and Heritage Health datasets, demonstrating improved representation informativeness across a range of desired parity thresholds while providing stronger theoretical guarantees on any downstream classifier’s statistical parity. In Chapter 3, we address unfairness in the foundational models. Specifically, we address the issue of gender bias in language models, particularly in the context of text generation. Language models have shown impressive performance at generating coherent text, but they can inadvertently perpetuate biases, including gender stereotypes (Mehrabi et al., 2020a; Dhamala et al., 2021; Sheng et al., 2019). Another area for improvement with large language models is that these can be infeasible on personal devices with limited computing and memory; hence, model compression techniques are often used. In particular, Knowledge distillation (Hinton et al., 2015), a technique used to compress and distill large models into smaller ones, can inadvertently transfer such biases from the teacher model to the distilled model if trained without any fairness constraints. To address this, we propose a novel approach leveraging counterfactual role reversal during knowledge distillation to impart unbiased information to the student model. We introduce two modifications to the base knowledge distillation process — altering the teacher model’s probabilities and augmenting the training set. These modifications significantly reduce gender disparity in open-ended text generation while preserving utility. However, we also explore that reducing gender polarity in language generation might not directly improve embedding fairness or downstream classifier fairness. 4 Chapter 4 highlights privacy concerns in deep learning approaches for neuroimaging, particularly in federated learning. Federated learning has been proposed as a solution to data privacy issues by exchanging model parameters for training rather than raw data. However, we demonstrate the vulnerability of sharing or allowing access to the model through Membership Inference attacks. In particular, we show via these attacks that it is possible to infer if a sample was used to train the model given only access to the model prediction (black box) or access to the model itself (white box) and some leaked samples from the training data distribution. We demonstrate realistic Membership Inference attacks on deep learning models trained for 3D neuroimaging tasks in centralized as well as decentralized learning setups. We also discuss how the access assumptions can be satisfied easily. Our findings show the feasibility of such attacks, highlighting the need for robust privacy-preserving techniques in federated learning. Finally, we discuss two methods that can limit these attacks. As mentioned, training large neural networks with differential privacy can be suboptimal due to the noise variance scaling with number parameters. Moreover, if the model is too large, computing per-sample gradients can be challenging (Subramani et al., 2021). To this end, Chapter 5 introduces a novel approach for parameter-efficient finetuning of pretrained language transformers. Finetuning has become a popular technique for adapting these models to specific tasks. However, it can still require a large number of tunable parameters, making it impractical for low-resource applications and training with privacy constraints (Yu et al., 2022). To overcome this limitation, we propose a strategy that introduces task-specific parameters in multiple transformer layers derived from fixed random projections of a single trainable vector. This allows for finetuning with significantly fewer parameters while maintaining performance close to full finetuning on GLUE tasks. Moreover, our method outperforms other parameter-efficient finetuning approaches that use a similar number of per-task parameters. The random projections can also be precomputed at inference, avoiding additional computational latency. The chapter concludes by showcasing our approach’s effectiveness and potential real-world impact through experiments with privacy constraints. Our method achieves the best or comparable utility compared to several recent finetuning methods when training with the same privacy constraints. Reducing the number of parameters and finetuning may not always be possible, especially when training the models from scratch. In Chapter 6, we explore if we can reduce the effective dimension of gradients by constructing a low-rank subspace that explains most of the gradients. While previous work has used public data to construct this low-rank subspace, we aim to eliminate this limitation and exploit previously released gradients to learn this subspace. We use the frequent directions algorithm (Ghashami et al., 2016) to learn this subspace in an online manner. We validate that neural networks can be trained over this subspace, at least in a non-private manner, without significant drops in performance. While previous works have demonstrated that gradient descent happens over low-rank subspace by analyzing the trajectory of gradients (Gur-Ari et al., 2019), here we can quantify the drop in performance due to low-rank training. However, this did not improve differential private training. We make connections to Gaussian denoising literature to explain this behavior. In conclusion, this thesis presents novel methods and theoretical insights into controlling bias, promoting equitability, and preserving privacy in various AI applications. These findings contribute to the ongoing efforts to build more responsible, equitable, and trustworthy AI systems, enabling their use in diverse real-world settings. Other Contributions. I worked on several applications that inspired parts of this thesis but were not included in this thesis. In particular, in Gupta et al. (2021b), we propose a new neural network architecture to improve state-of-the-art for predicting brain age from 3D-MRI scans. Gupta et al. (2023a) further improved this model by exploiting pretraining on imagenet data (natural images). These efforts eventually motivated the adoption of neuroimaging data for Chapter 4. 5 While federated learning does not protect against privacy attacks on models, it is still an excellent approach for learning from distributed datasets. To this end, we enhanced federated learning with homomorphic encryption to protect against model or intellectual property stealing attacks in Stripelis et al. (2021b). In Stripelis et al. (2022a); Stripelis et al. (2022c), we explore model sparsification techniques to reduce communications of federated training. We combine all these approaches and introduce a federated learning framework in Stripelis et al. (2022b). In Mehrabi et al. (2022a), we introduce a post-processing-based fair learning approach that identifies attention weights responsible for most unfair decisions. We show that our approach, which works by removing biased attention weights, can produce competitive fairness-accuracy tradeoffs. In Garg et al. (2022a), we relate transfer entropy to mutual information and propose a new transfer entropy estimator for time series datasets with long-range dependencies. A list of relevant publications is produced next. List of Publications Umang Gupta, Aaron M Ferber, Bistra Dilkina, Greg Ver Steeg. “Controllable guarantees for fair outcomes via contrastive information estimation”. Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 35. 2021, pp. 7610–7619. URL: https://ojs.aaai.org/index.php/AAAI/article/view/16931/ Umang Gupta, Pradeep Lam, Greg Ver Steeg, Paul Thompson. “Improved Brain Age Estimation with Slice-based Set Networks”. IEEE International Symposium on Biomedical Imaging (ISBI). 2021. URL: https://ieeexplore.ieee.org/document/9434081/ Umang Gupta, Dimitris Stripelis, Pradeep K Lam, Paul Thompson, Jose Luis Ambite, Greg Ver Steeg. “Membership inference attacks on deep regression models for neuroimaging”. Medical Imaging with Deep Learning. 2021, pp. 228–251. URL: https://proceedings.mlr.press/v143/gupta21a.html Dimitris Stripelis, Hamza Saleem, Tanmay Ghai, Nikhil Dhinagar, Umang Gupta, Chrysovalantis Anasta- siou, Greg Ver Steeg, Srivatsan Ravi, Muhammad Naveed, Paul M Thompson, et al. “Secure neuroimag- ing analysis using federated learning with homomorphic encryption”. 17th International Symposium on Medical Information Processing and Analysis. V ol. 12088. 2021, pp. 351–359. URL: https: //doi.org/10.1117/12.2606256 Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan. “Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal”. Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 658–678. URL: https://aclanthology.org/2022.findings-acl.55 Sahil Garg, Umang Gupta, Yu Chen, Syamantak Datta Gupta, Yeshaya Adler, Anderson Schneider, Yuriy Nevmyvaka. “Estimating transfer entropy under long ranged dependencies”. Uncertainty in Artificial Intelligence. 2022, pp. 685–695. URL: https://proceedings.mlr.press/v180/garg22a.html Ninareh Mehrabi, Umang Gupta, Fred Morstatter, Greg Ver Steeg, Aram Galstyan. “Attributing Fair Decisions with Attention Interventions”. Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022). Seattle, U.S.A.: Association for Computational Linguistics, 2022, pp. 12–25. URL: https://aclanthology.org/2022.trustnlp-1.2 Dimitris Stripelis, Umang Gupta, Nikhil Dhinagar, Greg Ver Steeg, Paul M Thompson, José Luis Ambite. “Towards Sparsified Federated Neuroimaging Models via Weight Pruning”. International Workshop on Distributed, Collaborative, and Federated Learning. 2022, pp. 141–151. URL: https://link.springer. com/chapter/10.1007/978-3-031-18523-6_14 6 Dimitris Stripelis, Umang Gupta, Hamza Saleem, Nikhil Dhinagar, Tanmay Ghai, Rafael Sanchez, Chryso- valantis Anastasiou, Armaghan Asghar, Greg Ver Steeg, Srivatsan Ravi, et al. “Secure Federated Learning for Neuroimaging”. arXiv preprint arXiv:2205.05249 (2022). URL: https://arxiv.org/abs/2205.05249 Dimitris Stripelis, Umang Gupta, Greg Ver Steeg, Jose Luis Ambite. “Federated Progressive Sparsification (Purge-Merge-Tune)+”. Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022). 2022. URL: https://openreview.net/forum?id=GLQqPTRrQMx Umang Gupta, Tamoghna Chattopadhyay, Nikhil Dhinagar, Paul M Thompson, Greg Ver Steeg, The Alzheimer’s Disease Neuroimaging Initiative. “Transferring Models Trained on Natural Images to 3D MRI via Position Encoded Slice Models”. IEEE International Symposium on Biomedical Imaging (ISBI). 2023. URL: https://ieeexplore.ieee.org/abstract/document/10230660 Umang Gupta, Aram Galstyan, Greg Ver Steeg. “Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning”. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 12612–12629. URL: https: //aclanthology.org/2023.findings-acl.799 7 Chapter 2 Controllable Guarantees for Fair Outcomes via Contrastive Information Estimation Decisions based on biased data can promote biased outcomes. Learning algorithms often exploit and exaggerate biases present in the training dataset. One way to prevent algorithms from reproducing bias in the data is to pre-process the data so that information about protected attributes is removed (Song et al., 2019; Madras et al., 2018; McNamara et al., 2017). Ideally, users of this transformed data or data representations can focus on maximizing performance for their tasks using any available method without the risk of producing unfair outcomes (Cisse et al., 2019). The strongest requirement for fair representations is to be statistically independent of sensitive attributes, but this may lead to large drops in predictive performance as sensitive attributes are often correlated with the target. Therefore, it is desirable to produce representations that can trade-off some measure of fairness (e.g., statistical parity in this work) with utility (Menon et al., 2018; Dutta et al., 2019). Representative Methods Adversarial Controllable Guarantee Parity Song et al. (2019) Weak 3 Heuristic Moyer et al. (2018) Strong No 4 Madras et al. (2018) None Heuristic Roy et al. (2019b) None No Ours Strong Provable Table 2.1: Fair representation learning methods. Many approaches for learning fair representa- tions have leveraged adversarial learning to remove unwanted biases from the data by limiting the ability of an adversary to reconstruct sensitive attributes from the representation during training (Jaiswal et al., 2020; Roy et al., 2019b; Song et al., 2019; Madras et al., 2018; Edwards et al., 2016). While adversarial methods have been shown to be useful in learning fair representations, they are often limited by the adversary’s model capacity. A more power- ful adversary than the one used during training may reveal hidden biases that are present in the represen- tations (Xu et al., 2020). As a result, a model trained to control fairness against one adversary has no guarantee to control fairness against an arbitrary adversary. Other methods for learning fair representations focus on inducing statistical independence (Moyer et al., 2018; Louizos et al., 2016). While fairness guarantees of these methods are agnostic to the downstream algorithms that will use these representations, including adversarial attempts to exploit bias, they are inefficient at trading-off between fairness and informativeness in the representation, which is often desirable for fairness applications. Some of these methods rely on complex generative modeling of data to discover representations that are invariant to the protected attributes (Moyer et al., 2018; Song et al., 2019) while preserving as 3 Song et al. (2019) minimize two different bounds on I(z : c) — one is a very loose upper bound, and another uses adversarial learning. So the adversarial guarantee is unclear or, at best weaker. 4 Moyer et al. (2018) designed their method for enforcing independence; however, we consider a modification of their method for controlling parity, based on our Theorem 2.2. 8 much information as possible about the data. However, the quality of the generative model is a performance bottleneck. We summarize some of these approaches and their properties in Table 2.1. In this work, we focus on a widely used fairness measure known as statistical parity, though other measures may be more appropriate for specific problems. The difference in outcomes for two groups, or parity, is often used to quantify the fairness of a decision procedure. It has also been shown to best capture people’s perception of fairness (Srivastava et al., 2019). One of the main objectives of fair representation learning is to limit the parity of any possible downstream decision algorithm. To this end, we relate the parity of any possible decision algorithm to an algorithm-agnostic measure of dependence. We show that we can provably control the parity of any possible decision algorithm that acts only on these representations by limiting the mutual information between the representations and the sensitive attributes. Estimating and bounding mutual information from data is a challenging and active field of research (Poole et al., 2019; Brekelmans et al., 2021). We propose practical ways to limit the mutual information between the representations and the sensitive attributes via contrastive estimates of conditional mutual information, thereby bypassing the need to use complex generative models that require explicit assumptions about the input distribution. Contrastive information measures have demonstrated state-of-the-art performance on many representation learning tasks (Oord et al., 2018; Veliˇ ckovi´ c et al., 2019; Mikolov et al., 2013) but not for fair representation learning. We evaluate our approach on two fairness benchmark datasets—UCI Adult and Heritage Health dataset and show that representations provided by our method preserve more information at the desired fairness threshold compared to other adversarial as well as non-adversarial baselines. Our main contributions are — a) We theoretically show that mutual information between representation and sensitive-attributes bounds the parity of any decision algorithm, b) We propose practical ways to limit mutual information leveraging contrastive information estimators that can efficiently trade-off predictability and accuracy. 2.1 Mutual Information Bounds Parity We consider a dataset of tripletsD ={x i ,y i ,c i } N i=1 , where x i ,y i ,c i are iid samples from data distribution p(x,y,c). c are the sensitive or protected attributes, y is the label, x are features of the sample which may include sensitive attributes and ˆ y denotes predicted label, according to some algorithm. We may also interpret ˆ y as the outcome of some decision procedure. We use bold letters to denote the random variable, and the regular font represents corresponding samples. We use I(a : b) to denote mutual information between two variables, a and b. It is defined as the KL Divergence between joint distribution p(a,b) and product of marginal distributions p(a)p(b), i.e., I(a : b)= KL(p(a,b)∥ p(a)p(b)) In this work, we consider stochastic representations of data i.e., z(x)∼ q(z|x= x). We learn d-dimensional representations z of input x, such that any classifier trained on only z is guaranteed to be fair, i.e., it has parity within someδ ′ . In this work, we focus on statistical parity, a popular measure of group fairness, and it is defined as: Definition 2.1. Statistical Parity: (Dwork et al., 2012) It is the absolute difference between the selection rates of two groups. Mathematically, ∆ DP (A,c)=|P(ˆ y= 1|c= 1)− P(ˆ y= 1|c= 0)| where ˆ y denotes decisions produced by some decision algorithmA . When there are more than two groups, we define statistical parity to be the maximum parity between any two groups (as implemented in Bird et al. (2020)). 9 Statistical parity is an algorithm dependent measure of fairness, whereas we require our representations to produce fair outcomes with any downstream decision algorithm. Bearing this in mind, we show that mutual information between the representations and the protected attributes, denoted as I(z : c), can be used to limit the statistical parity of any downstream decision algorithm via the following result. Theorem 2.2. For some z,c∼ p(z,c), z∈R d , c∈{0,1}, and any decision algorithmA that acts on z, we have I(z : c)≥ g(π,∆ DP (A,c)) whereπ = P(c= 1) and g is a strictly increasing non-negative convex function in∆ DP (A,c). The proof of the above statement, the expression of g, and generalization of the theorem for multinomial c are discussed in Section A.1. We visualize Theorem 2.2 in Figure 2.7. We know that if I(z : c)= 0, then∆ DP (A,c)= 0, and by Theorem 2.2, g(π,∆ DP (A,c)) will also be 0. I(z : c) upper bounds the function g. And since, g is strictly increasing convex function in∆ DP (A,c), ∆ DP (A,c) will also be bounded. As a result, if z is a representation with bounded mutual information with c, then any algorithm relying only on z to make a decision will also have bounded parity. Intuitively by reducing I(z : c), we can decrease∆ DP . We remark that I(z : c) has been used as a proxy objective to control statistical parity previously (Edwards et al., 2016; Song et al., 2019; Moyer et al., 2018). It is often justified via the data processing inequality and the intuition that both statistical parity and mutual information are measures of dependence. However, using the data processing inequality, we can only guarantee that if we limit information about c in z, then no subsequent operations on z can increase information about c, i.e., I(ˆ y : c)≤ I(z : c), but this fact alone implies nothing about statistical parity. Our result (Theorem 2.2) demonstrates that limiting I(z : c) will monotonically limit statistical parity, which had not been theoretically demonstrated until now. 2.2 Practical Objectives for Controlling Parity Equipped with an algorithm agnostic upper bound to parity, we now discuss practical objectives for learning fair representations. Along with limiting parity, we also want the latent representation to be highly predictive (informative) about the label, which is often realized by maximizing mutual information between y and z, i.e., I(y : z), implicitly (Edwards et al., 2016; Madras et al., 2018; Jaiswal et al., 2020) or explicitly (Moyer et al., 2018). O 1 :max q I(y : z) s.t. I(z : c)≤ δ or, max q I(y : z)− βI(z : c) (2.1) where, z,x∼ q(z|x)p(x) andβ > 0 is a hyperparameter. 2.2.1 Interference between I(y : z) and I(z : c) While I(y : z) has been commonly proposed as a criterion to enforce the desiderata of representations being informative about labels, we argue that when the data is biased, i.e., I(y : c)> 0, maximizing I(y : z) is in direct contradiction with minimizing I(z : c). To illustrate this point, we refer to the information Venn diagram in Figure 2.1. The goal of fair representation learning is to move the circle representing information about the representation, z, to have high overlap with y and low overlap with c. However, there is a conflict in the gray region where we cannot increase overlap with y without also increasing overlap with c. In our experiments, we observe that this conflict hurts the model performance and makes it hard to achieve lower parity values 10 at a reasonable accuracy (Figure 2.6b). However, this conflict is not necessary. Since fair learning aims to capture information about y that is not related to the protected attribute c, we want to maximize the overlap between z and the region of y that excludes c. This quantity is precisely the conditional mutual information, I(y : z|c), which we propose to maximize instead of I(y : z). This leads us to the following objective: O 2 :max q I(y : z|c) s.t. I(z : c)≤ δ or, max q I(y : z|c)− βI(z : c) (2.2) where, z,x∼ q(z|x)p(x) andβ > 0 is a hyperparameter. Eq. (2.2) defines our approach, but the information- theoretic terms that appear are challenging to estimate directly. In the next sections, we derive practical variational bounds for these terms. z c y Figure 2.1: Venn diagram to show interfer- ence between I(y : z) and I(z : c). Before proceeding, we briefly discuss an ambiguity in the information-theoretic argument above. The triple overlap re- gion of the information Venn diagram, sometimes called the “in- teraction information”, can be negative (Williams et al., 2010). This corresponds to the case that z and c have synergistic in- formation about y. That is, their combination allows us to predict y, even though individually they may not be predictive at all. The classic example of a synergistic interaction is the binary XOR relationship among three variables (example given in Section B.1). In that case, no variable can predict any other, but knowing any two variables perfectly predicts the third. If synergies are present, we may be able to attain a large value of I(y : z|c) while I(y : z)= 0. In other words, maximizing I(y : z|c) may include synergistic information, even if it hurts the ability to predict y from z alone. While maximizing I(y : z) will not include synergistic information, it will prefer large positive interaction information. Positive interaction information can be interpreted as increasing the redundant information shared among z,y,c, even though this conflicts with our goal of excluding information about protected attributes. Different choices of objective will alter the prioritization of various multivariate information terms, but ultimately our choice is justified by the empirical performance shown in Section 2.4. 2.2.2 Lower bounds for I(y : z) and I(y : z|c) Fact 2.3. For any distribution r(a|b) and a,b∼ p(a,b), H(a|b)=− E a,b logr(a|b)− KL(p(a|b)∥ r(a|b)) and therefore, H(a|b)≤− E a,b logr(a|b) and equality holds when r(a|b)= p(a|b) By a direct application of the above identity, we have: I(y : z)= H(y)− H(y|z) ≥ H(y)+ max r E y,z logr(y|z) (2.3) 11 and similarly, I(y : z|c)≥ H(y|c)+ max r E y,z,c logr(y|z,c) (2.4) H(y) and H(y|c) are properties of data and, therefore, constant from the optimization perspective. When y is a one-dimensional variable denoting the target class, this is equivalent to minimizing cross-entropy. To this end, we will parametrize the variational distribution r using a neural network with parametersψ, but other models can also be used. 2.2.3 Upper bound for I(z : c) Our technique for upper-bounding I(z : c) is similar to Moyer et al. (2018) and makes use of the following observations: I(z : c)= I(z : c|x)+ I(z : x)− I(z : x|c) (2.5) I(z : c|x)= 0, as z is a function of x and some independent noise. As a result, we have I(z : c)= I(z : x)− I(x : z|c). The first term is the information bottleneck term (Alemi et al., 2017) and limits the information about x in z, and we will bound it by specifying a prior over z. Second term tries to preserve information about x but not in c and we will lower bound it via contrastive estimation. 2.2.3.1 Upper bound for I(z : x) by specifying a prior In order to upper-bound I(z : x), we use the following observation: Fact 2.4. For any a,b∼ p(a,b) and distribution q(a) I(a;b)=E a,b log p(a|b) q(a) − KL(p(a)∥ q(a)) and therefore, I(a : b)≤ E a,b log p(a|b) q(a) =E b KL(p(a|b)∥ q(a)) (2.6) and equality holds when p(a)= q(a). Therefore, we have I(z : x)≤ E x KL(q(z|x;φ)∥ p(z)) (2.7) where p(z) is any distribution. This is similar to the rate term in a V AE or information bottleneck ap- proach (Alemi et al., 2017; Higgins et al., 2017). Motivated by this similitude, we let p(z) be standard normal distribution and q(z|x;φ) be a diagonal gaussian distribution whose mean and variance are parametrized as µ(x)= f µ (x;φ),Σ(x)= f Σ (x;φ). Other parameterizations like normalizing flows (Rezende et al., 2015) or echo noise (Brekelmans et al., 2019) can be used too. 2.2.3.2 Lower bound for I(x : z|c) via contrastive estimation We propose to lower bound I(x : z|c) via contrastive estimation of mutual information, and we use the following proposition to derive our estimator. Proposition 2.5. (Poole et al., 2019) For any u,v,w∼ p(u,v,w), ˜ u∼ p(u|w), and function f , we have I(u : v|w)≥ E u,v,w log e f(u,v,w) 1 M ∑ M j=1 e f( ˜ u j ,v,w) (2.8) 12 where, u, ˜ u∈U , v∈V , w∈W , f :U × V × W →R and M is the number of samples from p(u|w). The proof is similar to the non-conditional version (Poole et al., 2019), but for completeness, we present the proof in Section A.2. As a direct application of this result, we can lower-bound and maximize I(z : x|c). However, there is a caveat that we need to sample from P(z|c). Sampling from this conditional distribution in the general case can be hard, but it can be easily accomplished for our problem. Often for fairness applications, c is a discrete random variable with low cardinality; in fact, it is often a binary random variable. Therefore, we can consider{z j :(z j ,c j = i)} to be the samples from p(z|c= i). In our experiments, we parametrize f(z,x,c) as a bilinear function (similar to Oord et al. (2018)). 2.2.4 Overall Variational Objective Using Eq. (2.5), we can write our objective, i.e., Eq. (2.2) in terms of three information theoretic quantities as: max q I(y : z|c) | {z } Eq. (2.4) − β I(z : x) | {z } Eq. (2.7) − I(z : x|c) | {z } Eq. (2.8) The first term is lower-bounded using Eq. (2.4) and it is the same as the cross-entropy loss. Second term is upper bounded using Eq. (2.7). For this, we set p(z) to be standard normal distribution and q(z|x) is a normal distribution whose parameters are generated by a neural network. The KL divergence term in Eq. (2.7) can then be expressed in the closed form. Finally, the last term is lower-bounded using the contrastive conditional mutual information estimator from Proposition 2.5 (Eq. (2.8)). We call our objective Fair Contrastive Representation Learner (FCRL). Objective (2.1) can be similarly computed and optimized using Eqs. (2.3), (2.7) and (2.8). We provide comprehensive details to compute the proposed variational objective in Section 2.3. Even though objectives (2.1) & (2.2) require a single parameterβ to be varied, we found in our experi- ments that it is necessary to use different multipliers for I(x : z|c) and I(z : x) to effectively tradeoff fairness and accuracy. Therefore, we introduce another parameterλ. The I(z : c) term in both the objectives then becomes I(z : x)− λI(x : z|c). We found thatλ = 2 performs better empirically and so we setλ = 2, unless specified otherwise. 2.2.5 Alternate bounds for I(z : c) and their caveats One may observe that I(z : c) could also be upper-bounded as a direct consequence of Eq. (2.7). This is possible. However, upper bounding I(z : c) simply by introducing a prior marginal p(z) as in Eq. (2.6) is not tractable, as that would require computing p(z|c) (Song et al., 2019). Therefore, it is necessary to decompose I(z : c) as the difference of two information-theoretic quantities (Eq. (2.5)). 2.2.5.1 Reconstruction based Bound Another way to bound I(x : z|c) is to use the lower bound from Eq. (2.4) which leads to a conditional reconstruction term. I(x : z|c)≥ H(x|c)+ max θ E x,z,c logq(x|z,c;θ) (2.9) where q(x|z,c;θ) is some distribution parametrized by θ. H(x|c) is a constant. The last term can be seen as reconstructing x from c and its latent representation z and is similar to the distortion term in V AE, albeit with conditioning on c (Moyer et al., 2018). While this is a tractable bound, it involves training a conditional decoder, which may be hard for certain domains and almost always requires either complex models or the restrictive assumption that feature dimensions are independent given z. It also requires explicitly 13 stating/modelling the distribution of x, which is bypassed using contrastive estimation. In our experiments, we found that this approximation is limiting (Figure 2.6a); therefore, we propose using a decoder-free variational approximation via contrastive estimation of mutual information. 2.2.5.2 Adversarial approach One may also use Eq. (2.3) to approximate I(z : c), which leads to a common technique to bound I(z : c) via adversarial learning. This is a lower bound to I(z : c), but since we want to minimize I(z : c), it should ideally be upper bounded. However, Eq. (2.3) is often used and leads to adversarial min-max approaches (Madras et al., 2018; Edwards et al., 2016; Song et al., 2019), where maximization is over parameters of the classifier r(c|z) and minimization is over q(z|x). Other than the difficulty of optimization, we can see that adversarial approaches have a fundamental problem: they minimize a lower bound at each iteration. The gap between mutual information and its lower bound can be arbitrarily bad if the set of adversarial classifiers is restricted, as it must be in practice. Moreover, rather than using exact maximization, most of the methods approximate it using one or a few steps of SGD. This also puts into question one of the common approaches for evaluating invariant representations by predicting c from z. Indeed, in our experiments, we find that this evaluation can be misleading and may pass unfair representations as fair (see Section 2.4.2). 2.3 FCRL Implementation Details Using Eq. (2.5), we can write objective (2.2) as: I(y : z|c)− βI(z : c)= I(y : z|c)− β(I(z : x)− I(x : z|c)) = I(y : z|c)+βI(x : z|c)− βI(z : x) We lower bound I(y : z|c) and I(x : z|c) using bounds from Eq. (2.4) & Eq. (2.8), respectively. I(z : x) is upper bounded using Eq. (2.7). Introducing parametersφ andψ, we have: I(y : z|c)− βI(z : c)≥ H(y)+E y,z,c r(y|z,c;ψ)+βE z,x,c " f(z,x,c)− log 1 M M ∑ j=1 e f(˜ z j ,x,c) # − βE x KL(q(z|x;φ)∥ p(z)) (2.10) H(y) is a constant. As y is a binary variable,E y,z,c r(y|z,c;ψ) is implemented as the binary cross-entropy loss. Similar to Oord et al. (2018), we use a bilinear function to parametrize f and exp{ f(z,x,c)}=soft-plus (W z z) T W T c e(x;θ ′ ) where W z ,W c ,θ ′ are learnable parameters. We let q(z|x;φ) be a diagonal gaussian distribution whose mean and variance are computed via a neural network parametrized byφ, and p(z)∼ N (0,I). KL divergence term in the last expression can be thus computed in closed form. E x KL(q(z|x;φ)∥ p(z))=E x d ∑ k=1 1 2 − 2logσ(x) k +(σ(x) k +µ(x) k ) 2 − 1 (2.11) where z is d-dimensional and µ(x) k ,σ(x) k denotes the mean and variance of the k th component. We discuss the implementation of this objective for a batch in Algorithm 1. 14 Algorithm 1 Computation steps for objective (2.2). B is the batch size. NN(· ;θ) denotes neural network parameterized by θ. X i denotes i th sample from the batch X. The implementation is available at https: //github.com/umgupta/fairness-via-contrastive-estimation. Input: Batch Data(X,Y,C). Output: Batch Loss ▷ Encode input to get distribution parameters, q(z|x;φ) and sample µ(X),σ(X)= NN(X;φ) ε =numpy.random.randn(B, d) ▷ε∼ N (0,I) Z=µ(X)+ε∗ σ(X) ▷ Compute label loss ˆ Y = NN(Z,C;ψ) label-loss = cross-entropy( ˆ Y,Y ) ▷ Compute I(z : x) term using Eq. (2.11) rate = 1 2B ∑ B i=1 ∑ d k=1 − 2logσ(X i ) k +(σ(X i ) k +µ(X i ) k ) 2 − 1 ▷ Compute contrastive loss Z ′ = NN(X;θ ′ ) ▷ transformations Z ′′ = W z Z ▷ transformations ▷ iterate over each sample and compute contrastive loss for i in{0...B− 1} do ˜ Z ′′ ={Z ′′ j : c j = c i } ▷ ˜ Z ′′ is matrix of all Z ′′ j such that c i = c j M=|{Z ′′ j : c j = c i }| c= c i contrastive-loss += 1 B log(soft-plus(Z ′′ T i W T c Z ′ i ))− log 1 M ∑ j soft-plus( ˜ Z ′′ T j W T c Z ′ i ) end for loss = label-loss+β× contrastive-loss− β× rate 2.4 Experiments Datasets. We validate our approach on two datasets — UCI Adult (Dua et al., 2017) and Heritage Health 5 Dataset. UCI Adult is 1994 census data with 30K samples in the train set and 15K samples in the test set. The target task is to predict whether the income exceeds $50K, and the protected attribute is considered gender (which is binary in this case). We use the same preprocessing as Moyer et al. (2018). Heritage Health dataset is data of around 51K patients (40K in the train set and 11K in the test set), and the task is to predict the Charleson Index, which is an indicator of 10-year survival of a patient. We consider age as the protected attribute, which has 9 possible values. We use the same preprocessing as Song et al. (2019). Evaluation Procedure. A fair representation learning algorithm aims to produce representations such that any downstream decision algorithm that uses these representations will produce fairer results. Therefore, similar to Madras et al. (2018), we train the representation learning algorithm on training data and evaluate the representations by training classifiers for downstream prediction tasks. Since our purpose is to assess the representations, we report average accuracy (as an indicator of most likely performance) and maximum parity (as an indicator of worst-case bias) computed over 5 runs of the decision algorithm with random seeds. Unlike Madras et al. (2018), we also allow for preprocessing to be done on representations. Preprocessing steps like min-max or standard scaling are common and often precede training of classifiers in a regular 5 https://www.kaggle.com/c/hhp 15 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.00 0.05 0.10 0.15 0.20 Δ DP (max over 5 runs) Accuracy vs. Δ DP (UCI Adult) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL LAFTR Unfair MLP 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy (mean over 5 runs) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Δ DP (max over 5 runs) Accuracy vs. Δ DP (Heritage Health) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL Unfair MLP Figure 2.2: Parity vs. Accuracy trade-off for UCI Adult and Heritage Health dataset using a 1-hidden-layer MLP. Lower∆ DP is better, and higher accuracy is better. We use representations generated by varying each method’s inherent loss hyperparameters to get different points. See Table 2.2 for quantitative results. Results with other decision algorithms are shown in Section B.2 machine learning pipeline. The held-out test set is used to evaluate representations on downstream tasks only; we use the training set for all the other steps. The overall procedure can be summarized with following steps: • Learn Representations / Encoder. Train each method on the train set to learn the representations or the encoder. • Generate Representations. Obtain representations for each sample in the test and train set. • Evaluation via Downstream Classification Task. To evaluate the representations, we simulate the scenario of representations being used for downstream classification task with the following steps. Train, test, validation set below refer to the representations of the train, test and validation set generated by the encoder. – Preprocessing. In the preprocessing step, we scale the test and train set using standard scaling. For scaling, the statistics are computed from the train set only. We leverage scikit-learn’s (Pedregosa et al., 2011) preprocessing module for this. Any other preprocessing is valid too. – Fit a classifier. We fit a classifier to the train set. Some classifiers (for instance, MLP) may use early stopping, and therefore, need a validation set. In such cases, the validation set is derived from the train set by randomly splitting the train set, and keeping 20% of the train set for validation. – Evaluation. We use the test set to evaluate the∆ DP and accuracy of representations. The test set is used only to evaluate the representations. Baselines. We compare with a number of recent approaches, including information-theoretic and adversarial methods from the recent literature. MIFR (Song et al., 2019) is a recent approach demonstrating competitive 16 performance. MIFR combines information-theoretic and adversarial methods and generalizes several previous fair representation learning approaches (Louizos et al., 2016; Edwards et al., 2016; Madras et al., 2018; Zemel et al., 2013). A related approach, CVIB (Moyer et al., 2018), is based solely on information-theoretic optimization without adversarial training. We also compare with recent state-of-the-art adversarial methods. In particular, we compare with Adversarial Forgetting (Jaiswal et al., 2020), which is the state-of-the-art adversarial method for learning invariant representations. Roy et al. (2019b) (MaxEnt-ARL) proposes a theoretically superior min-max objective by proposing to train the encoder to maximize the entropy of sensitive attributes. In contrast, traditional adversarial approaches try to minimize the likelihood of the discriminator. Finally, we compare with LAFTR (Madras et al., 2018), which minimizes a more directed adversarial approximation to parity and has been designed bearing in mind parity-accuracy trade-offs. However, LAFTR is only applicable when c is a binary variable. As a baseline, we also train a one hidden layer MLP predictor directly on the data without regards to fairness (Unfair MLP). We visualize the trade-offs between fairness and task performance by plotting parity vs. accuracy curves for each representation learning algorithm. We vary each method’s inherent hyperparameters over the ranges specified in the original works to get different points on this curve. For a fair comparison, we set d= 8 for all the methods and use model components like encoder, decoder, etc., of the same complexity. We give a quantitative summary of performance across the entire spectrum of trade-offs by reporting the area over the Parity-Accuracy curve. This section uses a 1-hidden-layer MLP with ReLU non-linearity and 50 neurons in the hidden layer as the decision algorithm and representations are preprocessed by standard scaling. Results with other decision algorithms, i.e., Random Forest, SVM, 2-hidden-layer MLP, and logistic regression, are shown in Section B.2. 2.4.1 Improved Accuracy Versus Parity Trade-offs For different fair representation learners, we compare accuracy versus parity achieved for a specific down- stream classifier. The goal is to push the frontier of achievable trade-offs as far to the bottom-right as possible, i.e., to achieve the best possible accuracy while maintaining a low parity. From a visual inspection of Figure 2.2, we can see that our approach preserves more information about label y, across a range of fairness thresholds for both UCI Adult and Heritage Health datasets. We observed improved trade-offs for our method even when different downstream classifiers are used (results in Section B.2), matching expectations since our method gives theoretical bounds on parity that hold regardless of the classifier used. In contrast, methods that use adversaries in training see increases in parity when a more powerful classifier is used for downstream classification. Metric Quantifying Performance Across Trade-offs. Fair representation learning approaches often provide a single number to quantify fairness by predicting c from z at a single accuracy level or only Method UCI Adult Heritage Health FCRL (Ours) 0.307 0.319 CVIB 0.182 0.191 MIFR 0.251 0.202 MaxEnt-ARL 0.144 0 LAFTR 0.155 N/A Adversarial Forgetting 0.087 0.156 Table 2.2: Area Over Parity Accuracy Curve (Higher is better). 17 10 − 2 10 − 1 10 0 β 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Δ DP (max over 5 runs) Accuracy Δ DP Figure 2.3: Parity and Accuracy variation withβ for UCI Adult dataset. Our method can explore feasible regions of parity and accuracy by varying only a single parameterβ. qualitatively demonstrate results by showing the parity-accuracy curve. We report the normalized area over the parity-accuracy curve in Table 2.2 (1 being the maximum value) to enable quantitative comparison. It is the normalized hypervolume of the feasible parity-accuracy region, and an efficient fair representation learning method should maximize it. We provide more intuition and details to compute this metric in Section B.4. Controlling Parity. Our approach has a single intuitive hyperparameterβ, which can be used to control I(z : c) directly and, therefore, via Theorem 2.2, to monotonically control parity (see Figure 2.3). For the UCI Adult dataset, MIFR (Song et al., 2019) is competitive with our approach near low parity; however, it fails to achieve higher accuracy at higher demographic parity. This is because MIFR proposed to use I(z : x) as an upper bound to I(z : c), which is very loose, and this penalizes information about x as well, which is not desirable. CVIB (Moyer et al., 2018) is able to consistently trade-off accuracy while using a reconstruction based bound. But it maximizes I(y : z) which conflicts with desired minimization of I(z : c) (see Section 2.2.1). Fine-tuning the Parity. To achieve different points on parity-accuracy trade-off, one must train models with different loss hyperparameters (β in our case). To a practitioner, parity-accuracy curves are essential, as it helps in deciding which representations should be used to satisfy the desired fairness constraints. The additional computational cost of training models, as well as the introduction of additional hyperparameters for fair learning pose significant hindrances towards the adoption of fair models (Sylvester et al., 2018). Recently, Gao et al. (2020) showed that it is possible to explore different equilibrium points on the rate-distortion curve with iso-accuracy constraints by adjusting the loss coefficients according to some dynamical process and finetuning the model parameters. Similarly, we show that we can finetune a trained model to produce different points on the parity-accuracy curve. Since our loss function has a single coefficient (β), we do not need to derive the dynamical process. We can simply varyβ by sufficiently small steps and explore the parity-accuracy trade-off by finetuning the trained model. We can see that our method performs equally well when finetuning compared to training from scratch (See Figure 2.4), reducing the computational cost drastically. We expect that the reduced computational cost of achieving the parity-accuracy curve and 18 0.80 0.81 0.82 0.83 0.84 Accuracy (mean over 5 runs) 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Δ DP (max over 5 runs) Training by Finetuning Training from Scratch Starting Point Figure 2.4: Exploring parity-accuracy trade-off for UCI Adult dataset by varyingβ and fine-tuning the model for 20 epochs. When training from scratch, we need to train for 6000 epochs to generate 30 points (200 epochs each). However, with fine-tuning we can get the same result with only 780 epochs (200+ 29× 20). introduction of only a single intuitive hyperparameter along with improved parity-accuracy trade-off should reduce resistance towards the adoption of fair models. 2.4.2 Uncovering Hidden Bias In our experiments, we find that adversarial approaches that use deterministic representations ( e.g., Roy et al. (2019b); Jaiswal et al. (2020)) are particularly susceptible to hidden bias, which can be revealed by slightly modified downstream task classifiers that consistently result in large parity values. Roy et al. (2019b) demonstrated that the representations produced were fair w.r.t to c by showing that an MLP trained to predict c from z performs poorly. This evaluation approach of predicting c from z is often used to demonstrate fairness or invariance of the representation w.r.t to c (Moyer et al., 2018; Jaiswal et al., 2020; Xie et al., 2017). However, we emphasize that due to Eq. (2.3), this is only a lower bound on I(z : c) and, therefore, any claims relying solely on this metric are weak. Further, we found that these methods do not remove information about sensitive attributes and, at best, obfuscate it such that an adversary cannot easily recover the protected attribute during training. We found that even if a classifier trained directly on z cannot predict c, after preprocessing z by standard scaling, it becomes very easy to predict the sensitive attribute even with a model of similar or lower complexity as the original adversary used during the training and one such example is shown in Figure 2.5. As a result, some of these approaches could not achieve low parities and even exhibited higher parity than the baseline Unfair MLP trained without fairness constraints (see Figure 2.2). We investigate this further in the Section B.3. 2.4.3 Ablation Study of Information-Theoretic Methods Next, we will show that our proposed objective of maximizing I(y : z|c) is more efficient at exploring parity-accuracy trade-offs compared to maximizing I(y : z) (i.e., objective (2.1) vs. (2.2)). We also compare 19 10 − 1 10 0 10 1 10 2 α 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy MaxEnt ARL with preprocessing without preprocessing Figure 2.5: Accuracy of predicting c from z with and without pre-processing as a function of loss parameter α for MaxEnt ARL method. the effect of optimizing the reconstruction based information bound (Eq. (2.9)) with our proposed contrastive information estimator (Eq. (2.8)). 2.4.3.1 Maximizing I(y : z|c) vs. I(y : z) We see in Figure 2.6b that when maximizing I(y : z), we can still reduce the parity, but this comes at the cost of predictive performance. When using I(y : z|c), we see that the parity sharply drops with only a slight drop in the accuracy, and we see much better trade-offs between predictability and fairness. By conditioning the mutual information term on c, we see that z can easily retain information about y without conflicting with I(z : c) objective as explained in Section 2.2.1. 2.4.3.2 Reconstruction vs. Contrastive Estimation Figure 2.6a compares optimizing with reconstruction and contrastive estimation based bounds for I(x : z|c). Reconstruction based bounds on information rely on explicitly modeling the data distribution through a decoder, q(x|z,c), that reconstructs the data. For these methods to be effective, they must give reasonable reconstructions and therefore preserve most of the information in the data. We see in the results that preserving information leads to high accuracy but also high parity. Contrastive estimation is better at achieving lower parity than reconstruction based bounds because it can directly model high-level features that are predictive and fair without requiring a model to reconstruct data. 2.4.4 Visualizing Theorem 2.2 Figure 2.7 visualizes the bound from Theorem 2.2 for UCI Adult dataset. We plot I(z : c) and parity for the representations learned by varying the loss coefficient β in the objective (2.2) (shown in blue). This is contrasted with the bound on I(z : c) computed from the Theorem 2.2, i.e., g(π,∆ DP ) (shown in red). We can 20 0.78 0.79 0.80 0.81 0.82 0.83 0.84 Accuracy 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP Contrastive Bound Reconstruction based Bound (a) 0.78 0.79 0.80 0.81 0.82 0.83 0.84 Accuracy 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP I(y:z|c) I(y:z) (b) Figure 2.6: Ablation studies on UCI Adult dataset: Reconstruction (Eq. (2.9)) and Contrastive (Eq. (2.8)) mutual information estimation bounds are compared in Figure 2.6a. Figure 2.6b compares Objective (2.2) and Objective (2.1). λ was set to 1 for these experiments. see that — a) As predicted by Theorem 2.2, I(z : c) is always higher than g(π,∆ DP ), which is the minimum mutual information predicted due to the observed statistical parity value. b) parity decreases with decreasing I(z : c). 2.5 Related Work As machine learning becomes more enmeshed in everyday life, a vibrant field of fair machine learning research has emerged to question potential risks (Mehrabi et al., 2019). Different applications may prescribe different notions of fairness, and these notions may even conflict with each other (Kleinberg et al., 2017). In this work, we focus on fairness between groups concerning decisions that may be cast as classification problems. In particular, our work is focused on controlling statistical parity or disparate impact, which was proposed in Dwork et al. (2012) and is widely adopted in the literature (Edwards et al., 2016; McNamara et al., 2017). Statistical parity is preferred as a fairness measure when the goal is to ensure equal selections from different groups. One such example is the US government’s Uniform Guidelines on Employee Selection Procedure (United States Equal Opportunity Employment Commission, 1978) adoption of 80% rule to ensure fairness in selection procedures. Statistical parity as a fairness measure also captures a lay person’s perception of fairness (Srivastava et al., 2019). Incorporating other popular notions of group fairness like Equalized-Opportunity and Equalized-Odds (Hardt et al., 2016), and notions of individual fairness (e.g., Sharifi-Malvajerdi et al. (2019)) in our approach is left as future work. Fair classification methods are often categorized based on which stage of the machine learning pipeline they target. Post-processing methods work by modifying the trained classifier ( e.g., Hardt et al. (2016)), in-processing methods work by regularizing the classifier during training to ensure fairness ( e.g., Zafar et al. (2017)), and pre-processing methods transform the original dataset to reduce bias present in the data. Pre-processing methods are beneficial when the onus of fairness is on a third party or the data controller, and 21 0.025 0.050 0.075 0.100 0.125 0.150 Δ DP 10 − 4 10 − 3 10 − 2 I(z:c) Visualizing Thm. 2 g(π, Δ DP ) I(z:c) estimated from data Figure 2.7: Visualizing Theorem 2.2 for UCI Adult dataset. I(z : c) vs.∆ DP for the representations learned by varying the loss coefficient β in the objective of Eq. (2.2) are shown in blue. Corresponding g(π,∆ DP ) is shown in red. the end-user may be oblivious to fairness constraints (McNamara et al., 2017; Cisse et al., 2019). Our work also belongs to this category, and we specifically focus on controlling parity. While post-processing and in-processing methods have to ensure fairness for a specific classifier, pre- processing methods must ensure fairness with respect to any downstream classification algorithm. Many pre-processing methods have discussed the desiderata of ensuring strong guarantees on fairness so that any downstream classifier may be used freely (McNamara et al., 2017; Song et al., 2019; Edwards et al., 2016; Madras et al., 2018). However, their operationalization often leads to an approach that may not ensure guarantees (due to limits of adversarial methods, for instance). Our approach to these problems bounds parity in terms of mutual information, and then we optimize tight upper bounds on this information. Other works have explored information-theoretic objectives for learning invariant representations (Moyer et al., 2018) and fair representations (Song et al., 2019). Dutta et al. (2019) use tools from information theory to analyze the trade-off between fairness and accuracy. He et al. (2020) minimize correlation between the sensitive attributes and the representations to achieve fairness. Correlation is a linear measure of dependence. Therefore, their method provides no guarantees when the downstream classifier is non-linear. In contrast, our approach controls mutual information between the sensitive attribute and the representation, a more robust measure of dependence. Contrastive learning and its variants have shown promising results for learning representations for many applications, e.g., images and speech (Oord et al., 2018), text (Mikolov et al., 2013) and graphs (Veliˇ ckovi´ c et al., 2019). We are the first to explore its application for learning fair representations. Contrastive learning has been most actively explored in self-supervised learning, where the information to optimize is chosen by hand to be similar to some target task (Chen et al., 2020; Oord et al., 2018), while in our work, we demonstrated a natural connection between parity and mutual information. Other variational bounds on information (Poole et al., 2019) and estimators like MINE (Belghazi et al., 2018) and NWJ (Nguyen et al., 2010) could also be leveraged for parity control using our results. 22 2.6 Conclusion As the negative repercussions of biased data become increasingly apparent, governments and other organi- zations will require tools capable of controlling this bias. In many cases, compromises between fairness and task objectives will be desired and sometimes even legally enshrined in terms of required bounds on fairness measures like statistical parity. Many methods do not provide a way to control parity, and even if they do, often, it is only in a heuristic way. Adversarial classifiers are popular for checking bounds on parity, but these methods are not guaranteed against the possibility of a more powerful classifier either intentionally or accidentally exploiting bias still hidden in the data. By proving a one-to-one relationship between information-theoretic quantities and statistical parity of arbitrary classifiers, we can finally see how varying a single hyper-parameter controlling information can explore the entire fairness versus accuracy spectrum. This information-theoretic characterization is algorithm-independent so that our control of parity can be guaranteed regardless of downstream applications. 23 Chapter 3 Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal He works in a hospital as a Prompt ... doctor, treating the elderly with a variety, and by all accounts does an ex- cellent work of medicine. GPT–2 ... physician and helps a lot of the pa- tients. Fair DistilGPT–2 (ours) She works in a hospital as a Prompt ... nurse and was in love with her mother and her big brother, a small, shy, overweight woman. GPT–2 ... pediatric dermatologist who gets stitches but also helps hospitals under- stand newborns... Fair DistilGPT–2 (ours) Figure 3.1: Example texts generated by LMs under different gender contexts (identified by the words ‘ He’ and ‘She’). GPT–2 continues the prompt with the occupation word historically associated with the spe- cific gender. Our approach aims to treat both genders equally. The ever-increasing size of language models (LMs) have increased their energy and compute require- ments, making them impractical for many real-time resource-constrained applications such as personal assistants deployed on edge devices. To address this issue, various approaches have been proposed to compress or distill these large models (e.g., Sanh et al. (2019); Jiao et al. (2020); Hinton et al. (2015)). However, distillation techniques are designed to mimic the uncompressed LM (i.e., teacher model). Thus, the societal biases encoded in the teacher mod- els (Bender et al., 2021; Bommasani et al., 2021; Sheng et al., 2021) will propagate to the distilled models. In fact, our experiments show that distilled models are adjudged to be more unfair than their teacher model counterparts. In this work, we de- vise techniques to train models that mitigate societal biases during knowledge distillation. One way to demonstrate this manifestation of societal biases is by looking at text generated by LMs, as illustrated in Figure 3.1. As such, the output text focuses on different characteristics of the person, solely based on which gender is mentioned in the context. To this end, we focus on reducing the dispar- ity between groups during the language generation, considering the fairness definition for open-ended text generations as proposed in Dhamala et al. (2021) and Sheng et al. (2019). We propose an approach that uses counterfactual role-reversed sentences during knowledge distillation. In other words, our approach uses counterfactual texts that are generated by substituting mentions of one demographic group with the other. We employ an automated way to generate these counterfactuals, requiring only a paired list of words from each demographic group. Typical knowledge distillation training loss has two components: (a) the LM training loss such as cross- entropy to learn information from the training data, and (b) a loss that enforces similarity between outcomes 24 of teacher and student models 6 . The counterfactual knowledge is used to correct these loss components in the following ways: (a) augmenting the training set itself, which alters the training loss to learn from more equitable data; and (b) modifying the teacher’s output toward more equitability so that the student learns from a more equitable output distribution. We first demonstrate our method using English GPT2–small (Radford et al., 2019) as the teacher and a 6-layer GPT–2 (called DistilGPT–2) as the student model. We focus on binary gender disparities (male vs. female) and use the gender polarity metric for profession prompts from the BOLD dataset (Dhamala et al., 2021) as the primary fairness definition. We show that our approach lowers the gender disparity in the generated text. Next, we demonstrate the applicability of our approach for finetuning English GPT2–small itself, i.e., using the same architecture for teacher and student models in the distillation framework. Finally, we evaluated the resultant model’s gender fairness on downstream tasks such as Contextual Embedding Association Tests (CEAT, Caliskan et al. (2017) and finetuning on Bios–Bias classification task (De-Arteaga et al., 2019). We find that reduced disparity in open-ended text generation does not necessarily lead to fairness on other tasks. 3.1 Related Work Large LMs embody societal biases that could result in harms such as misinformation, stereotype propagation, and disparate resource allocation (Bender et al., 2021; Sheng et al., 2021). Multiple studies have shown that LMs are biased in producing outputs with negative connotations such as toxicity (Gehman et al., 2020; Zhou et al., 2021a; Xu et al., 2021) and negative regard (Sheng et al., 2020; Sheng et al., 2021) towards minority populations. Others have shown that LMs encode prevalent gender biases, such as one gender being more associated with a particular class of professions. Such biases can be revealed via contextual embedding tests (Guo et al., 2021), stereotype tests (Sap et al., 2020; Nangia et al., 2020), and evaluation of generated texts (Dhamala et al., 2021; Sheng et al., 2019). Some works have also shown that LM can be biased towards ideologies, e.g., Islam (Brown et al., 2020). Approaches to mitigate bias in LMs can be broadly summarized as: (a) training or finetuning on a balanced dataset (Solaiman et al., 2021; Dinan et al., 2020)), (b) attaching prefix at inference or training time (Sheng et al., 2020), and (c) using a bias or attribute classifier ( e.g., toxicity classifier) to control fairness in text generation (Dathathri et al., 2020; Liang et al., 2021; Liu et al., 2021a; Krause et al., 2021). While all these debiasing approaches can be used to mitigate bias in an LM after it is distilled, no prior work aims to directly debias and distill in a single step. Furthermore, the majority of existing approaches focus on reducing toxic text generation (Solaiman et al., 2021; Dathathri et al., 2020; Liang et al., 2021; Liu et al., 2021a; Krause et al., 2021). Different from existing works, we present an approach for fair knowledge distillation that aims to mitigate gender bias in text generated from the distilled models. Our approach is inspired by the counterfactual notion of fairness (Kusner et al., 2017) and introduces two modifications to the standard distillation: (a) counterfactual data augmentation, and (b) using modified teacher probabilities. Counterfactual fairness and related notions have been previously used for bias mitigation in hate speech detection (Mostafazadeh Davani et al., 2021), word embeddings (Maudslay et al., 2019; Lu et al., 2020; Zhao et al., 2018b), and coreference resolution (Zhao et al., 2018a) tasks. Ours is the first work that uses counterfactual knowledge to achieve equitability in text generation during distillation. Our method is also applicable when the student model or architecture is the same as the teacher model, and we have demonstrated it via experiments (Section 3.4.4). 6 The teacher model refers to the original LM, and the student model refers to the LM being trained. The latter usually has fewer parameters. 25 3.2 Notion of Language Model Fairness We focus on mitigating gender bias in open-ended language generation from an LM. The bias is measured by assessing the tendency of the LM to associate a specific set of professions to a specific gender, e.g., healthcare professions to female and engineering professions to male. As discussed in Sheng et al. (2021), such societal biases may cause a negative representational impact by propagating stereotypes, misrepresentations, or denigrations of social groups. We consider only binary gender in this paper as LMs often do not encode sufficient representation of non-binary gender context, restricting a meaningful analysis (Dev et al., 2021). We use a related counterfactual notion of fairness, commonly studied in the NLP fairness literature, to motivate our fair distillation approach in Section 3.3. The counterfactual notion of fairness (Kusner et al., 2017) adjudges a model fair if it generates similar predictions before and after swapping the sensitive features in the input. 3.3 Fair Knowledge Distillation via Counterfactual Role Reversal In typical knowledge distillation, a smaller student model, imitating the behavior of the large teacher model, is obtained by using additional training signals from the target probabilities output by the teacher model. Let{x 1 ...x m } denote sequence of text tokens in a training sample, x <t or{x 1 ...x t− 1 } denotes sequence of tokens prior to t and boldface denote random variables. LMs such as GPT–2 model probability distribution of next token P(x t |x <t ) over the vocabularyV , i.e., x t ∈V . Distillation loss is then defined as follows: min θ ∑ t CE(P θ (x t |x <t ),x i )+ KL(P θ (x t |x <t )∥P teacher (x t |x <t )). (3.1) This loss consists of two terms: (a) the cross-entropy (CE) between the predicted next token probability and the observed token, and (b) the KL-divergence between the output probabilities from the teacher (P teacher ) and the student (P θ ) models. The KL-divergence term provides a stronger training signal to the student compared to using training data only, leading to more accurate and faster learning (Hinton et al., 2015). Knowledge distillation (Eq. (3.1)) will also transfer societal biases while transferring information from the teacher model. To address this problem, we propose to infuse the bias mitigation strategy with knowledge distillation to obtain a less biased and compact model. Our bias mitigating strategy is based on the intuition that given a sequence such as ‘She works as a’ and its counterfactual ‘He works as a’, a fair LM should generate similar texts. We materialize this intuition by encouraging student LM to learn similar distribution of probabilities for a sequence of tokens and its counterfactual. To this end, we propose two modifications to the base distillation strategy: (a) Using counterfactual role reversal to modify token probabilities of the teacher model; and (b) Using counterfactual role reversed data for model distillation. We study these two modifications independently and in various combinations 7 . 3.3.1 Counterfactual Role Reversal Given a sequence of tokens referring to a particular demographic group, we want to generate a counterfactual sequence of tokens referring to another related demographic. For example, suppose the original text, referring to the female group was ‘She is a mother of two kids and works as a software engineer,’ we want to generate a counterfactual referring to the male group ‘He is a father of two kids and works as a software engineer.’ Inspired by existing works on counterfactual data augmentation for binary gender (Lu et al., 2020; Maudslay et al., 2019), we use word-swapping operations on the sequence of tokens to generate counterfactual sequences. Specifically, we use a curated dictionary of gender words with male⇌ female mapping, for instance, father 7 Our approach may use the same student model as the teacher, as we demonstrate in Section 3.4. 26 doctor 0.5 surgeon 0.2 nurse 0.1 … … nurse 0.6 receptionist 0.2 doctor 0.1 … … doctor 0.3 nurse 0.35 surgeon 0.1 receptionist 0.1 … … GPT-2 He works in a hospital as a GPT-2 She works in a hospital as a Original Distribution Counterfactual Distribution Modified Distribution Figure 3.2: Probability modification using counterfactual text. Probability distributions are computed for the original text (left) and its counterfactual text (right). The modified probability distribution is computed using one of the functions from Table 3.1. For demonstrating in this figure, we have used expMean operation. → mother, she→he, him→her, etc. We generate a counterfactual sequence of tokens from the original sequence by substituting the gendered word in the original sequence with a matching gendered word referring to the opposite gender from this dictionary 8 . See Section C.2 for the curated dictionary sources and other implementation details. 3.3.2 Modifying Teacher Probabilities Function Operation max z ′′ t = max{z t ,z ′ s } mean z ′′ t = z t +z ′ s 2 expMean z ′′ t = log e z t +e z ′ s 2 swap z ′′ t = z ′ s Table 3.1: Operations used to modify token probabilities. Next, we discuss how to use counterfactual sequences to modify knowledge distillation loss. In an open-ended language genera- tion task, the LM produces a natural continuation of text given some context or a prompt (x <t ). To this end, auto-regressive LMs such as GPT–2 predict the probability distribution of the next token given the context and previously generated tokens. The next token is sampled from the predicted distribution and added to the context to generate text. This process is contin- ued until a stopping criterion is met. Depending on the gender present in the context, the teacher model may produce different probability distributions over the vocabulary. If these predicted distributions are directly used for student model training, it could transmit gender bias in the student model. To mitigate this unchecked transference of gender disparity, we modify the teacher probability of each token by using the next token probabilities from both the original and the counterfactual context (or both genders) during student model training. We combine them to boost the probability of more likely tokens with both genders while the probability of less likely tokens with one or both genders being suppressed or relatively unaffected (See Figure 3.2 for a visual illustration). We experiment with different functions to combine these distributions. Let z t = logP(x t |x <t ) and z ′ s = logP(x s |x 0.5 and d > 0.8 are medium and large effect sizes, respectively. However, the absolute effect size is often used as the magnitude of bias (Goldfarb-Tarrant et al., 2021) 13 . As shown in Table 3.3, baseline models have a larger effect size in 13 P-values are not reported as it does not indicate the magnitude of the bias, and all models were most certainly biased. 31 Model CEAT Tests (Effect Sizes) Bios–Bias Classification Method Mod fn. Aug. Test 6 Test 7 Test 8 Accuracy (↑) TPRD(↓) GPT2–small (Teacher) N/A N/A 0.326 − 0.139 − 0.040 0.818 0.1060 DistilGPT–2 (HF) N/A N/A 0.584 0.114 − 0.078 0.813 0.0982 DistilGPT–2 (Baseline) N/A N/A 0.314 0.311 − 0.065 0.815 0.1003 DistilGPT–2 (ERA) max no 0.245 0.223 − 0.113 0.817 0.0981 DistilGPT–2 (ERA) none yes 0.366 0.274 0.016 0.816 0.1041 DistilGPT–2 (ERA) max yes 0.532 0.352 0.260 0.817 0.1020 GPT2–small (ERA) max no 0.212 0.182 − 0.036 0.817 0.1085 GPT2–small (ERA) none yes 0.218 0.162 0.752 0.817 0.1031 GPT2–small (ERA) max yes 0.293 0.325 0.268 0.818 0.1070 Table 3.3: Downstream gender fairness evaluation. See Sections 3.5.1 and 3.5.2 for discussion about CEAT and Bios–Bias tasks and results, respectively. tests 6 (male/female names and career/family) and 7 (math/arts and male/female terms). In test 8 (male/female terms and science/arts), there was not a strong bias in the embeddings of baseline models. Overall, we observe that the demonstrated fairness in LMs for open-ended language generation in Section 3.4 is not always reflected in the embeddings. For example, the model trained using modified logits based on max operation has a smaller absolute effect size for tests 6 and 7 but higher for test 8 compared to the baseline. Effect sizes on tests 7 and 8 have reduced when using the counterfactual data augmentation method, but it increased on test 6. Hence, the LM embedding fairness metric CEAT did not correlate with the fairness of LM in open-ended text generation tasks. This finding agrees with Goldfarb-Tarrant et al. (2021), but for contextual embeddings. They observed that downstream task and static embeddings fairness measures are not correlated. 3.5.2 Fairness in Classification Task We evaluate the hypothesis that an LM that is less biased in text generation should be less biased on downstream tasks by finetuning various baselines and fairer versions of LM obtained in Section 3.4.4 on the Bios–Bias classification task (De-Arteaga et al., 2019) and evaluating the classifier’s fairness. The objective is to predict one of the 28 profession classes from a person’s biography. We use a weighted combination of all token embeddings with a linear layer for classification. Pre-trained weights are not updated. For training details, see Section C.8. Similar to De-Arteaga et al. (2019), we take the average true positive rate difference (TPRD) between males and females across all professions as the fairness measure. Results. A fair model should have a similar true positive rate for both genders, i.e., TPRD∼ 0. However, we observe from Table 3.3 that TPRD is around 0.1 for all the models, indicating that all models lead to equally unfair outcomes. De-Arteaga et al. (2019) presented a simple debiasing technique of removing a set of predefined gendered words (such as he, she, mrs.) from the biographies before training, which resulted in an accuracy of 0.815 and TPRD of 0.0658 with DistilGPT–2 as the pre-trained model. Overall, this suggests that our method, even though effective in reducing disparity for open-ended text generation, is not adequate for this downstream task. 32 3.6 Discussion and Limitations Mitigating disparity across races. We conducted preliminary experiments to test if the proposed approach can be extended to different race groups. Similar to Dhamala et al. (2021), we consider race bias manifested via people’s names and race-specific tokens across four races common in the US: African, European or White, Hispanic & Latino, and Asian. We construct a many-to-many mapping that maps words referring to a given race to words referring to the other races for the counterfactual generation. The rest of the method remains the same as Section 3.3. For fairness evaluation, we use race prompts from BOLD and regard classifier from Sheng et al. (2019), which evaluates whether the person in the text is portrayed as being ‘highly thought of.’ Results show that the LMs obtained with the proposed approach were less biased in treating different races similarly, indicating that the proposed approach can be extended to other non-binary groups. However, the improvements were not as significant as gender bias mitigation, leaving plenty of scope for improvement left for future work. We describe the results and experiments in more detail in Section C.7. Counterfactual data generation. Dictionary-based word-swapping is a simple and effective method for counterfactual generation (Lu, 2020; Zhao et al., 2018a). However, blind word swapping can also result in factually and/or grammatically incorrect texts. To quantify these errors, we manually evaluated 500 randomly sampled counterfactual texts for gender category. We found that 22(4.4%) of these sentences were incorrect (See Section C.6). In this paper, we demonstrate that despite counterfactual data generation not being perfect, it can effectively reduce the gender biases in the model. We expect our bias mitigation approach to benefit from further research in counterfactual data generation, especially for reducing race disparity. 3.7 Conclusion We proposed techniques to use counterfactual information during knowledge distillation to mitigate gender bias in LMs. In experiments, we show that this approach improves fairness in text generation, but it does not simultaneously enhance fairness on LM embedding and downstream classification task. LMs have become the Swiss army knife of NLP because modeling next word probabilities can learn versatile models that are effective on many tasks. It was surprising that reducing gender disparity in text generation had little effect on other downstream tasks. This finding underscores the importance of evaluating LM fairness along multiple metrics and tasks. 33 Chapter 4 Membership Inference Attacks on Deep Regression Models for Neuroimaging Machine learning’s endless appetite for data is increasingly in tension with the desire for data privacy. Privacy is a highly significant concern in medical research fields such as neuroimaging, where information leakage may have legal implications or severe consequences on individuals’ quality of life. The Health Insurance Portability and Accountability Act 1996 (HIPAA) (Centers for Medicare & Medicaid Services, 1996) protects the health records of an individual subject, as well as data collected for medical research. Privacy laws have spurred research into anonymization algorithms. One such example is algorithms that remove facial information from MRI scans (Bischoff-Grethe et al., 2007; Schimke et al., 2011; Milchenko et al., 2013). While there are laws and guidelines to control private data sharing, model sharing or using models learned from private data may also leak information. The risk to participants’ privacy, even when only summary statistics are released, has been demonstrated and widely discussed in the field of genome-wide association studies (Homer et al., 2008; Craig et al., 2011). In a similar spirit, a neural network model learned from private data can be seen as a summary statistic of the data, and private information may be extracted from it. To demonstrate the feasibility of information leakage, we study the problem of extracting information about individuals from a model trained on the ‘brain age prediction’ regression task using neuroimaging data. Brain age is the estimate of a person’s age from their brain MRI scan, and it is a commonly used task for benchmarking machine learning algorithms. In particular, we study attacks to infer which samples or records were used to train the model. These are called Membership Inference attacks (Shokri et al., 2017; Nasr et al., 2019). An adversary may infer if an individual’s data was used to train the model, thus violating privacy through these attacks. Consider a hypothetical example, where some researchers released a neural network trained with scans of participants in a depression study. An adversary with access to the individual’s scan and the model may identify if they participated in the study, revealing information about their mental health, which can have undesirable consequences. Previous work on membership inference attacks focus on supervised classification problems, often exploiting the model’s over-confidence on the training set and the high dimensionality of the probability vector (Shokri et al., 2017; Salem et al., 2019; Pyrgelis et al., 2017). Our work demonstrates membership inference attacks on regression models trained to predict a person’s age from their brain MRI scan (brain age) under both white-box and black-box setups. We simulate attacks on the models trained under centralized as well as distributed, federated setups. We also demonstrate a strong empirical connection between overfitting and vulnerability of the model to membership inference attacks. 34 4.1 Related Work & Background 4.1.1 BrainAGE Problem Brain age is an estimate of a person’s age from a structural MRI scan of their brain. The difference between a person’s true chronological age and the predicted age is a useful biomarker for early detection of various neurological diseases (Franke et al., 2019) and the problem of estimating this difference is defined as the Brain Age Gap Estimation (BrainAGE) problem. Brain age prediction models are trained on brain MRIs of healthy subjects to predict the chronological age. A higher gap between predicted and chronological age is often considered an indicator of accelerated aging in the subject, which may be a prodrome for neurological diseases. To predict age from raw 3D-MRI scans, many recent papers have proposed using deep learning (Feng et al., 2020; Gupta et al., 2021b; Stripelis et al., 2021a; Peng et al., 2021; Bashyam et al., 2020; Lam et al., 2020). To simulate attacks on models trained centrally and distributively, we consider trained neural networks that were recently proposed in Gupta et al. (2021b) and Stripelis et al. (2021a), respectively to predict Brain age from MRI Scans. Although there is some controversy over the interpretation of BrainAGE (Butler et al., 2020; Vidal-Pineiro et al., 2021), we emphasize that we are only using BrainAGE as a representative problem in neuroimaging that benefits from deep learning. 4.1.2 Federated Learning In traditional machine learning pipelines, data originating from multiple data sources must be aggregated at a central repository for further processing and analysis. Such an aggregation step may incur privacy vulner- abilities or violate regulatory constraints and data sharing laws, making data sharing across organizations prohibitive. To address this limitation, Federated Learning was recently proposed as a distributed machine learning paradigm that allows institutions to collaboratively train machine learning models by relaxing the need to share private data and instead push the model training locally at each data source (McMahan et al., 2017; Yang et al., 2019; Kairouz et al., 2021). Even though Federated Learning was originally developed for mobile and edge devices, it is increasingly applied in biomedical and healthcare domains due to its inherent privacy preserving property of avoiding data sharing (Lee et al., 2018; Sheller et al., 2018; Silva et al., 2019; Rieke et al., 2020; Silva et al., 2020). Depending on the communication characteristics between the participating sources, different federated learning topologies can be discerned (Yang et al., 2019; Bonawitz et al., 2019; Rieke et al., 2020; Bellavista et al., 2021) — star and peer-to-peer being the most prominent. In a star topology (Sheller et al., 2018; Li et al., 2019; Li et al., 2020; Stripelis et al., 2021a), the execution and training coordination across sources is realized by a trusted centralized entity, the federation controller, which is responsible for shipping the global or community model to participating sites and aggregating the local models. In peer-to-peer (Roy et al., 2019a) topologies, the participating sites communicate directly with each other without requiring a centralized controller. We focus on the star federated learning topology. In principle, at the beginning of the federation training, every participating data source or learner receives the community model from the federation controller, trains the model independently on its local data for an assigned number of iterations, and sends the locally trained parameters to the controller. The controller computes the new community model by aggregating the learners’ parameters and sends it back to the learners to continue training. We refer to this synchronization point as a federation round. After repeating multiple federation rounds, the jointly learned community model is produced as the final output. 4.1.3 Membership Inference Attacks Membership inference attacks are one of the most popular attacks to evaluate privacy leakage in practice (Ja- yaraman et al., 2019). The malicious use of trained models to infer which subjects participated in the training 35 set by having access to some or all attributes of the subject is termed as membership inference attack (Shokri et al., 2017; Nasr et al., 2019). These attacks aim to infer if a record (a person’s MRI scan in our case) was used to train the model, revealing information about the subject’s participation in the study, which could have legal implications. These attacks are often distinguished by the access to the information that the adversary has (Nasr et al., 2019). Most successful membership inference attacks in the deep neural network literature require access to some parts of the training data or at least some samples from the training data distribution (Salem et al., 2019; Pyrgelis et al., 2017; Truex et al., 2018). White-box attacks assume that the attacker is also aware of the training procedure and has access to the trained model parameters, whereas Black-box attacks only assume unlimited access to an API that provides the output of the model (Leino et al., 2020; Nasr et al., 2019). Creating efficient membership inference attacks with minimal assumptions and information is an active area of research (Choo et al., 2020; Jayaraman et al., 2020; Song et al., 2020). However, our work is focused on demonstrating the vulnerability of deep neural networks to membership inference attacks in the federated as well as non-federated setup. Therefore, we make straightforward assumptions and assume somewhat lenient access to information. Our attack models are inspired by Nasr et al. (2019); Shokri et al. (2017), and we use similar features such as gradients of parameters, activations, predictions, and labels to simulate membership inference attacks. In particular, we learn deep binary classifiers to distinguish training samples from unseen samples using these features. In the case of federated learning, each learner receives model parameters and has some private training data. Thus, any learner is capable of launching white-box attacks. Moreover, in this scenario, the learner has access to the community models received at each federation round. When simulating membership attacks on federated models, we simulate attacks from the learners’ perspective by training the membership inference attack model on learners’ private data and the task is to identify other learners’ subjects. In the case of models trained via centralized training, we assume that the adversary can access some public training and test samples. We simulate both white-box and black-box attacks in this case. 4.2 Setup 4.2.1 Trained Models for Predicting Brain Age We use models trained to predict brain age from structural MRIs to demonstrate vulnerability to membership inference attacks. We perform attacks on two neural network archiectures — 3D-CNN (Peng et al., 2021) and 2D-slice-mean (Gupta et al., 2021b) models. The neural network architectures are summarized in Section D.1.3. For centralized training, we use the same dataset and training setup as Gupta et al. (2021b) and for federated training, we use the same training setup and dataset as Stripelis et al. (2021a) (see Sections D.1.1 and D.1.2). In the latter, the authors simulate different federated training environments by considering diverse amounts of records (i.e., Uniform and Skewed) and varying subject age distribution across learners (i.e., IID and non-IID). All models are trained on T1 structural MRI scans of healthy subjects from the UK Biobank dataset (Miller et al., 2016) with the same pre-processing as Lam et al. (2020). See Section D.1 for more details regarding the dataset, data distribution, and training setup. 4.2.2 Attack Setup As discussed in Section 4.1.3, attackers may have access to some part of the training set and additional MRI samples that were not used for training, referred hereafter as the unseen set. We train a binary classifier to distinguish if the sample was part of the training set (see Section D.3 for classifier architecture details). We study effectiveness of different features for the attacks in Section 4.3.1. 36 Features 3D-CNN 2D-slice-mean activation 56.63 - error 59.90± 0.01 74.06± 0.00 gradient magnitude 72.60± 0.45 78.34± 0.17 gradient (conv 1 layer) 71.01± 0.64 80.52± 0.40 gradient (output layer) 76.65± 0.44 82.16± 0.29 gradient (conv 6 layer) 76.96± 0.57 82.89± 0.83 prediction + label 76.45± 0.20 81.70± 0.29 prediction + label + gradient (conv 6 + output) 78.05± 0.47 83.04± 0.50 Table 4.1: Membership inference attack accuracies on centrally trained models (averaged over 5 attacks). Details aboutconv 1,output andconv 6 layers are provided in Section D.1.3. In the case of brain age models trained via centralized training, the attack models are trained on a balanced training set using 1500 samples from both training and unseen sample set 14 . For testing, we create a balanced set from the remaining train and unseen set — 694 samples each and report accuracy as the vulnerability measure. To attack models trained via federated learning, we consider each learner as the attacker. Thus, the attacker is trained on its private dataset and some samples from the unseen set that it may have. This way, we created a balanced training set of up to 1000 15 samples from training and unseen set each. Unlike centralized setup, the distribution of the unseen set and training set that the attacker model is trained on could be different, particularly in the non-IID environments. In this scenario, the attacks are made on the private data of other learners. Thus, we report the classifier’s accuracy on the test set created from the training sample of the learner being attacked and new examples from the unseen set. 4.3 Results We simulate membership inference attacks on both centralized and federation trained models for the BrainAGE problem. We report results on models trained centrally in Section 4.3.1 and distributively in Section 4.3.2. Conventional deep learning models are trained using gradient descent. Thus, the gradient of parameters w.r.t. loss computed from a trained model are likely to be lower for the training set than the unseen set. Therefore, we evaluate features derived from gradients, activation, errors, and predictions of the trained model to train the binary classifier, i.e., simulate membership inference attacks and study their effectiveness in Section 4.3.1. The main task is to identify if a sample belonged to the training set. We report the accuracy of correct identification on a test set created from the training and the unseen sample sets that were not used to train the attack model but used for training and evaluating the brain age models. 4.3.1 Membership Inference Attacks on Centralized Training Table 4.1 summarizes the results of simulating membership attacks with various features. As apparent from Figure 4.1a, the train and unseen (test) samples have different error distributions due to the inherent tendency of deep neural networks to overfit on the training set (Zhang et al., 2017). Consequently, the error is a useful feature for membership inference attacks. Error is the difference between prediction and label, and 14 In the implementation, the unseen set is the same as the test dataset used to evaluate the brain age model. The unseen set and the training set are IID samples from the same distribution. 15 In the case of Skewed & non-IID environment, some learners had less than 1000 training samples. As a result, the attack model had to be trained with fewer samples. 37 using prediction and label as two separate features produced even stronger attacks, as indicated by higher membership attack accuracies. One of the reasons for this could be that the model overfits more for some age groups. Using true age information (label) would enable the attack model to find these age groups, resulting in higher attack accuracy. 10 5 0 5 10 15 20 Output Error (year) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Density 2D-slice-mean 10 5 0 5 10 15 20 Output Error (year) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Density 3D-CNN Train Test (a) Prediction errors. 0 500 1000 1500 2000 2500 3000 Gradient Magnitude 0.000 0.001 0.002 0.003 0.004 Density 2D-slice-mean 0 250 500 750 1000 1250 1500 Gradient Magnitude 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 Density 3D-CNN Train Test (b) Gradients ofconv 1 layer. Figure 4.1: Distribution of prediction error and gradient magnitudes from the trained models. Attacks made using error or prediction, and label are black-box attacks. A white-box attacker may also utilize more information about the mod- els’ internal workings like the gradients, knowl- edge about loss function, training algorithm, etc. Deep learning models are usually trained until con- vergence using some variant of gradient descent. The convergence is achieved when the gradient of loss w.r.t parameters on the training set is close to 0. As a result, gradient magnitudes are higher or similar for unseen samples than training samples (see Figure 4.1b). Therefore, we used the gradient magnitude of each layer as a feature, resulting in attack accuracy of 72.6 and 78.34 for3D-CNN and 2D-slice-mean models, respectively. Finally, we simulated attacks using gradients of param- eters at different layers 16 . We find that parameter- gradients of layers closer to the output layer (i.e., conv 6, output layers) are more effective com- pared to the gradients of layers closer to the input (conv 1). Preliminary results hinted that activa- tions do not provide much information to attack the models. So, we did not simulate attacks on the 2D-slice-mean models with activations as features. The best attack accuracies of 78.05 and 83.04 for attacking 3D-CNN and 2D-slice-mean model were achieved by using prediction, labels, and gradients of parameters close to the output layer. Successful membership inference attacks demonstrated in this section accessed samples from the actual training set, which is limiting. In Section D.5, we relax this assumption and discuss attacks accessing only samples from the training set distribution and not the training samples. 4.3.2 Membership Inference Attacks on Federated Training We consider three different federated learning environments consisting of 8 learners and investigate cases where malicious learners attack the community model. The community model is the aggregated result of learners’ local models and a malicious learner may use it to extract information about other learners’ training samples. In this scenario, a malicious learner can learn an attack model by leveraging its access to the community models of all federation rounds and its local training dataset; we simulate attacks using this information (see also Section 4.2.2). The model vulnerability is likely to increase with more training iterations and hence we used features derived from the community models received during the last five federation rounds, and each learner uses its private samples to learn the attack model. Each learner may try to do 16 We consider layers close to the input or output layers as these have fewer parameters, and attack models are easily trained. Intermediate layers had more parameters, making it hard to learn the attack model. 38 0 5 10 15 20 25 30 35 40 Federation Round 50 52 54 56 58 60 Attack Accuracy Federation Round vs. Attack Accuracy for Uniform & IID dist. 2D-slice-mean 3D-CNN Figure 4.2: Increasing attack vulnerability per federation round. membership inference attacks on any of the other seven learners, resulting in 56 possible attack combinations. An attack is considered successful if accuracy is more than 50%, which is the random prediction baseline. Table 4.2 shows the average accuracy of successful attacks and the total number of successful attack instances of learner-attacker pairs (in parentheses) across all possible learner-attacker pairs, out of 56 in total. For a more detailed analysis on a per-learner basis, see Section D.2. We empirically observed that the success rate of the attacks is susceptible to data distribution shifts. In particular, distribution shift agnostic features like gradient magnitudes can lead to more successful attacks (count wise) when data distribution across learners differs. For the results shown in Table 4.2 and Figure 4.2, we used all available features (i.e., gradient magnitudes, predictions, labels, and gradients of last layers). We also observe that the overall attack accuracies are lower than the centralized counterpart discussed in Section 4.3.1. This drop can be attributed to the following: a) As we show in Section 4.3.3, attack accuracies are highly correlated with overfitting. Federated learning provides more regularization than centralized training and reduces overfitting but does not eliminate the possibility of an attack. b) Federated models are slow to train, but as the model is trained for more federation rounds, the vulnerability increases (see Figure 4.2). Moreover, Table 4.2 only presents an average case view of the attacks and we observe that the attack performance depends on the data distribution of the learner-attacker pair. When the local data distribution across learners is highly diverse, i.e., Skewed & non-IID attack accuracies can be as high as 80% for specific learner-attacker pairs (see Section D.2). Data distribution 3D-CNN 2D-slice-mean Uniform & IID 60.06 (56) 58.11 (56) Uniform & non-IID 61.00 (28) 60.28 (29) Skewed & non-IID 64.12 (25) 63.81 (24) Table 4.2: Average attack accuracies on federation trained models. Numbers in parentheses indicate median successful attacks over 5 multiple runs. 4.3.3 Possible Defenses Various approaches have been proposed to mitigate the membership inference attacks directly. These approaches are based on controlling overfitting (Truex et al., 2018; Salem et al., 2019) and training data memorization (Jha et al., 2020) or adversarial training (Nasr et al., 2018). In this section, we first evaluate differentially private machine learning as one of the defenses and use it to understand the relation between 39 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 Mean Absolute Error (year) 50 55 60 65 70 75 80 Attack Accuracy Random prediction baseline 2D-slice-mean 3D-CNN (a) Attack accuracy vs. Model performance. 0.0 0.5 1.0 1.5 2.0 Overfitting (year) 50 55 60 65 70 75 80 Attack accuracy 2D-slice-mean 3D-CNN (b) Attack accuracy vs. Overfitting. Figure 4.3: Differential privacy reduces membership inference attacks. Figure 4.3b shows that the effective- ness of membership inference attack is correlated with overfitting. Error bars are generated by bootstrapping the test set 5 times using 1000 samples. Results with R 2 as the measure of model performance are shown in Section D.4. generalization/overfitting and membership inference attacks. Then we introduce a technique to train neural networks with non-unique gradients to prevent membership inference attacks. Differential Private Training to Defend against Membership Inference Attacks. Differential pri- vacy (Dwork et al., 2014) (DP) is a formal framework to reason about privacy. A differential private training mechanism ensures that the outcomes (i.e., final model weights in this case) do not change much between two training sets that differ by one example. We evaluate the effect of training models with these privacy guarantees on membership inference attacks and model performance, measured as mean absolute error in the centralized setup. To train the models with differential privacy, we used DP-SGD algorithm of Abadi et al. (2016) 17 . We found that achieving non-vacuous differential privacy guarantees requires adding significant Gaussian noise to the gradients, which annihilates learning performance. However, we observed that practical privacy attacks, such as membership inference attacks, can be thwarted by clipping the gradients and adding Gaussian noise of much smaller magnitudes. Therefore, we evaluate training with gradients with a small additive Gaussian noise. We varied the noise magnitude to achieve different points in Figure 4.3. Models trained with differential privacy significantly reduce attack accuracy, but this is achieved at the cost of a significant drop in model performance (Figure 4.3a). To understand the reason for the success of membership inference attacks, we visualize the overfitting, measured by train and test performance difference, and attack vulnerability of the models trained with differential privacy in Figure 4.3b. We see that overfitting is highly correlated with attack accuracy, indicating that these attacks may be prevented by avoiding overfitting up to some extent. Learning with Non-unique Gradients. To learn good machine learning models, we would like to extract patterns while ignoring information about specific samples. Training models using gradient descent leaks an individual’s information during training because there is no restriction on what information a sample may contribute. As a result, the model can memorize information unique to each individual, leaking privacy. Differential privacy adds the same noise to all gradients to limit the information or influence of a single sample on the neural network, but that may also destroy useful information in an attempt to reduce memorization. To this end, we investigate removing unique information from each sample’s gradient and training with only non-unique parts. To achieve this, we compute the gradient of the loss (L) w.r.t. parameters (θ) for each 17 See Section D.4 for a brief description of differential privacy and details of differential private training. Briefly, the principal modifications to SGD to achieve differential privacy are to clip the gradients from each sample not to exceed a maximum norm and to add spherical Gaussian noise. 40 0.5 0.6 0.7 0.8 Membership Inference Attack accuracy 2.8 3.0 3.2 3.4 3.6 3.8 4.0 MAE Non-unique Information Gaussian Noise (DP) Figure 4.4: Attack accuracy vs. Performance trade-off when training learners with Differential Privacy (Gaussian Noise) and Non- unique gradients approach. Lower accuracy and lower MAE is desired. These results use 2d-slice-mean model. sample (x i ,y i ) in a batch (B), i.e., g i =∇ θ L( f(x i ;θ),y i )∀i∈{1...B}. To compute the non-unique part, we project each gradient vector on the subspace spanned by the rest of the gradient vectors (g span i ). We consider the residual part as the unique information about each sample (i.e., g unique i = g i − g span i . Ideally, we would like to train with only the non-unique part. However, we observe that it may harm the performance too much, and therefore we downweigh the effect of the unique part and use ˆ g i = g span i +αg unique i ,α < 1 to update the model at local learners. α is a hyperparameter that we tune to trade off privacy and performance. The results of learning with the non-unique gradients or by adding Gaussian noise to gradients are summa- rized in Figure 4.4. Learning with non-unique gradients provides a better trade-off between vulnerability to membership inference attacks and model performance (MAE). Further comparison of the non-unique gradient approach and differential privacy in federated training settings are presented in Stripelis et al. (2022b). 4.4 Discussion While deep learning presents great promise for solving neuroimaging problems, it also brings new challenges. Deep learning is intrinsically data-hungry, but the bulk of neuroimaging data is distributed around the world in private repositories. With classic machine learning approaches like linear regression, model sharing and meta-analysis could be used to pool insights without sharing data. Unfortunately, neural networks are capable of completely memorizing training data, so that sharing a model may be just as bad as sharing the private data itself. In this paper, we demonstrated a practical proof-of-concept attack for extracting private information from neural networks trained on neuroimaging data. We showed that attacks with a high success rate persist under various settings, including a realistic, distributed, federated learning scheme explicitly 41 designed to protect private information. Although concerning, our preliminary study of attacks and defenses suggest benefits to solving this problem that go beyond data privacy. Because attacks exploit differences in model performance on training data and unseen test data, a successful defense must also lead to more robust neuroimaging models whose out-of-sample performance does not significantly differ from in-sample performance. Hence, even if data privacy were not a concern, further study of protection against membership attacks may inspire neuroimaging models that generalize better to new patients. 42 Chapter 5 Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning Transformer-based bidirectional language models (LMs), pretrained on a sizeable text corpus and finetuned on task-specific objectives, outperform models trained from scratch by large margins (Devlin et al., 2019; Liu et al., 2019). The straightforward approach to finetune a language model is to initialize with pretrained parameters and train the model on the downstream task. However, it is inefficient to finetune language models for each task as it requires training and storing a massive number of parameters per task (roughly the same as the size of language models) (Radford et al., 2019; Devlin et al., 2019). These inefficiencies are exacerbated in resource-constrained settings, such as personal devices with limited computation and memory or federated learning scenarios where the costs of communicating parameter updates may limit the scope of applications (Xu et al., 2022; Ro et al., 2022). Moreover, it is not optimal to tune a large number of parameters for differential private training. As such, the variance of Gaussian noise added at each step of DP-SGD to achieve privacy scales linearly with the number of parameters (Abadi et al., 2016). As a result, differential private deep networks may underperform even against handcrafted baselines (Tramer et al., 2021). However, differential private training’s utility can be improved by exploiting additional public information. In particular, Yu et al. (2022); Li et al. (2022) showed that parameter efficient finetuning methods achieve better utility for downstream NLP tasks than full private finetuning. Method Parameter Sharing Efficient Inference Multi-layer Adapter ✗ ✗ ✓ LoRA ✗ ✓ ✓ BitFit ✗ ✓ ✓ WARP ✗ ✗ ✗ Ours ✓ ✓ ✓ Table 5.1: Parameter Efficient Finetuning Methods. The shortcomings of naive finetuning meth- ods have motivated research into approaches that identify and train fewer task-specific parame- ters (Treviso et al., 2022). Those parameter- efficient finetuning methods work by introduc- ing task-specific trainable layers while freezing most of the pretrained language model parameters (e.g., Adapter (Houlsby et al., 2019; Pfeiffer et al., 2021), LoRA (Hu et al., 2022)) or by introduc- ing task-specific trainable prompts or inputs ( e.g., prompt-tuning based WARP (Hambardzumyan et al., 2021), prefix-tuning (Li et al., 2021a)). We summarize the key properties of prominent efficient finetuning methods in Table 5.1. Among these methods, WARP is particularly interesting. It demonstrated comparable performance to full-finetuning with as few as 25K trainable parameters on natural language understanding (NLU) tasks. WARP inserts trainable token embeddings around input, i.e., task-specific parameters are inserted only in the input layer. Due to this, WARP is limited compared to other methods that insert trainable parameters in different layers (i.e., Multi-layer), as the information may not propagate correctly to the deeper layers (Liu et al., 2022b). As such, our proposed method inserts task-specific information in each transformer block. We 43 10 4 10 5 10 6 10 7 10 8 Number of Trainable Parameters 74 76 78 80 82 84 86 88 Average Performance on GLUE Dev set Finetuning Adapter Linear LoRA WARP 1 WARP 8 WARP Ours (1,024) Ours (10K) LoRA Adapter BitFit 1 Figure 5.1: Performance vs. Parameters trade-off on GLUE benchmark with pretrainedRoBERTa-large. introduce two variants of our approach — a) add a bias or shift vector to the output feed-forward layer’s activation in each transformer block or, b) insert a prompt vector in each transformer block. These task- specific information vectors are derived from a single trainable vector, keeping the total trainable parameter count similar to WARP. The first variant is in contrast to BitFit (Ben Zaken et al., 2022), which updates all the bias parameters independently without sharing. Our proposed parameter sharing or joint reparametrization of task parameters drastically reduces the number of trainable parameters without significant performance degradation. On aver- age, our method is within two points of BitFit on NLU tasks but uses 20x fewer parameters. Specifically, we achieve within 5% of full finetuning performance with only 4.1K parameters (see Figure 5.1), outperforming WARP which uses a similar number of parameters. Lastly, we show that parameter sharing and multi-layer tuning can also improve WARP. WARP increases the effective sequence length, and Adapter inserts task-specific layers, incurring addi- tional computational overhead. In contrast, our method is efficient in memory usage and run-time during training. Further, task-specific parameters learned by our approach can be fused with LM during inference, leading to no additional latency during inference, making it especially appealing for resource-constrained applications. Besides computational efficiency, our approach’s parameter efficiency makes it an excellent private learner. Our approach’s utility is competitive or outperforms the best differential private finetuning results (Yu et al., 2022) when training for similar levels of privacy. 5.1 Method The key component of our approach is to introduce task-specific information or tunable weights across different transformer layers. However, these weights are derived from random projection of a single trainable parameter. This reparametrization is general and can be introduced in other parameter efficient finetuning methods as well. Here we demonstrate two cases — a) Introducing a shift in each transformer block (SLaSh), and b) Adding a prompt/embedding in each transformer block (JR-WARP). We first describe SLaSh. 44 Self Attention & Layer Norm Intermediate FF Output FF Layer Norm Transformer Block ( ) Figure 5.2: Shared Layer Shift or SLaSh for sequence classification tasks. The shifts are obtained by the projections W l z and added to the corresponding transformer block’s output activation. z is shared across all the blocks. Red font indicates trainable modules, i.e., parameters that are updated during finetuning. Other parameters remain unchanged during the finetuning. [t i ] n i=1 denotes the sequence of tokens. 5.1.1 Shared Layer Shift or SLaSh Model. Figure 5.2 summarizes our model, highlighting task-specific parameters with colored fonts. Specif- ically, we consider a trainable vector z∈R d to incorporate task-specific information in each transformer block. We do so by projecting z with random but fixed matrices W l to obtain shift vectors z l for the l th transformer block (z l ∈R d ′ l , W l ∈R d ′ l × d , and l∈{1...L}). z l is added to the output activations of the respective transformer block, as shown in Figure 5.2. z l is of the same dimensionality as the activations of the output feed-forward layer in the l th transformer block (d ′ l ), and z is shared between all the blocks. Hence, we call our approach Shared Layer Shift or SLaSh. The random projection matrices, W l , are not trainable and are fixed throughout the training. We initialize W l and z with zero-centered Gaussian or Uniform distribution for our experiments (See Section E.1.2 for ablations on initialization choices). SLaSh is akin to training only bias parameters of the output feed-forward layers. However, the projection step decouples the dimensions of z and activations, providing the flexibility to change the number of trainable parameters and control the complexity of the model by varying d irrespective of the activation dimensions. Our choice of adding z l to only output activations is inspired by Subramani et al. (2020), who use a similar setup to learn sentence representations. We also consider adding the shifts to other activations, such as intermediate activations or activations after the self-attention layer in Section E.1.1. In particular, adding shifts to output activations performs similarly or better than other choices. Adding shifts to intermediate layers performs similarly to adding shifts to the output layer. However, the dimensionality of intermediate activations is usually more than that of output activations which would increase the size of projection matrices, making it an inferior choice. Classification Head. We experiment with token classification and sequence classification tasks with BERT- like models. To this end, we remove the decoder layer of the pretrained LM and attach a task-specific linear layer (Classifier) to predict the output from text representations. Verbalizers (Schick et al., 2021) can also be used. Number of Parameters. SLaSh only trains the task-specific vector ( z) and the prediction head (Classifier), usually a classification or regression layer. Suppose the number of class labels is C. SLaSh will only use d+C× (d ′ L + 1) trainable parameters per task, where d ′ L is the activation dimension of the last transformer block. In our implementation, we maintain additional∑ L l=1 d ′ l × d parameters for W l matrices during training. However, these matrices can also be generated on the fly from the random seed or state of the random number generator for both backward and forward pass computation. More concretely,RoBERTa-large has 45 Classifier Transfomer Block Transfomer Block Figure 5.3: JR-WARP for sequence classification tasks with prompt length 1. Similar to Figure 5.2, trainable modules are indicated in red. [t i ] n i=1 denotes the sequence of tokens. The prompts are obtained by the projections W l z. For the first layer, the prompt is appended to the token embeddings. We add the prompt to the transformed prompt from the previous block for the rest of the layers. z is shared across all the blocks. L= 24,d ′ l = 1024 ∀l∈{1...L}, and for GLUE tasks, the number of classes, C, can be 3 maximum. If d is set to 1,024, only 4,099 trainable parameters are required per task. In contrast,RoBERTa-large has 355M parameters. The maximum size of z could be the sum of the dimensions of all the shift vectors, i.e.,∑ L l=1 d ′ l . Increasing the size beyond that is similar to training respective bias parameters independently without any sharing or reparametrization. Inference. Pretrained LM parameters are shared across all the tasks. The projection weights remain unchanged during the training and can be reproduced from the random seed or random number generator’s state. Hence, once the model is trained, only z and classifier parameters need to be preserved. Our approach maintains computational efficiency during inference as it does not require additional computations apart from the language model inference. Indeed, once the shift vectors z l are computed, they can be combined with biases of the output feed-forward layers. 5.1.2 Improving Prompt-Tuning with Reparameterization. These joint reparametrization of task parameters can also improve prompt-tuning methods such as WARP. We make two modifications — a) Insert prompts in different layers, and b) Prompts are derived from a single vector. We refer to this as JR-WARP (Jointly Reparametrized WARP. Figure 5.3 summarizes JR-WARP with prompt length 1. We introduce prompts or embeddings in each transformer block, similar to Liu et al. (2022b). However, in our case, the prompts are reparametrized as random projections of a single vector z∈R d . 18 The prompt is appended to the token embeddings for the first layer, i.e., the embedding layer. Previous multi-layer prompt tuning approaches discard the transformed prompt from the previous layers and insert a new prompt at each layer (Lester et al., 2021; Liu et al., 2022b). Instead, from the second transformer block onwards, we do not discard previous representations and add the 18 This reparametrization differs from the generally suggested reparametrization of using an MLP encoder to transform the prompts. 46 prompt to the resulting representation (or the transformed prompt) from the previous layer. W l and z are initialized similarly to SLaSh. WARP appends prompt only to the token embeddings, and in Figure 5.3, this can be achieved by keeping only the lower arm emitting from z block and setting W 0 as the identity matrix. Figure 5.3 shows prompt length 1, but it can be extended to prompts longer than length 1. However, our main aim is to evaluate performance while using parameters similar to WARP. Therefore, we keep the prompt length to 1, and d is 10K and 5K in our experiments. When extending the prompt length to more than one, there are multiple ways to reparametrize prompts. For example, reparametrize prompts within the same layer from a single z or reparametrize prompts within the same index or time step from a single z, as we have done in this work. Multilayer or deep-prompts have already been shown improve performance (Liu et al., 2022b; Li et al., 2021a). Here we improve parameter efficiency while maintaining performance. 5.2 Experiments We first evaluate our approach for non-private training. This helps to benchmark our approach against other language model finetuning techniques. We evaluate on sequence classification tasks in Section 5.2.1 with the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and token classification tasks with named entity recognition (NER) on CoNLL 2003 dataset (Tjong Kim Sang et al., 2003) in Section 5.2.2. We report memory and training time requirements to quantify the computational efficiency in Section 5.2.3. Finally, we demonstrate the utility of our approach for differential private finetuning of LMs in Section 5.2.4. 19 Baselines. We compare against full-finetuning and several prominent parameter-efficient finetuning tech- niques. Specifically, we compare with Adapter (Houlsby et al., 2019), Low-Rank Adaptation (LoRA, Hu et al. (2022)), BitFit (Ben Zaken et al., 2022), and Word Adversarial Reprogramming (WARP, Hambardzumyan et al. (2021)). Adapter introduces task-specific feed-forward layers in each transformer block. Adapter typically trains down-project and up-project feed-forward layers in pairs for each transformer block. The dimensions of the down-projection (denoted as m) govern the per-task trainable parameters. Low-rank adaptation, or LoRA learns the change in the pretrained weights, i.e.,∆W, for the downstream tasks.∆W is parameterized as the product of low-rank matrices, which requires much fewer parameters than full-finetuning. The rank of the matrices determines per-task parameters. WARP n introduces n learnable input tokens by adding trainable embeddings to the input. It is the continuous version of prompt-tuning and a special case of PrefixTuning (Li et al., 2021a), with prefixes introduced only in the embedding layer. The learned tokens do not necessarily correspond to an existing token from the vocabulary. Finally, we compare with BitFit, which finetunes only all the bias parameters. Indeed, BitFit finetunes a superset of parameters considered by our approach. Further, SLaSh shares trainable parameters across all the blocks, which is more efficient. 5.2.1 Sequence Classification Tasks Datasets. We use the GLUE benchmark for sequence classification. We consider 2 single-sentence tasks and 6 sentence pair tasks from the GLUE benchmark. Corpus of Linguistic Acceptability (CoLA) and Stanford Sentiment Treebank (SST-2) are the single sentence tasks, and the task is to predict grammatical 19 See Gupta et al. (2023b) for details about training, hyperparameter search, and best hyperparameters. The code is available at https://github.com/umgupta/jointly-reparametrized-finetuning. 47 acceptability and sentiment. Microsoft Research Paraphrase Corpus (MRPC), Semantic Textual Similarity Benchmark (STS-B), and Quora Question Pairs (QQP) are the sentence similarity tasks. Multi-genre Natural Language Inference (MNLI), Question-Answering NLI (QNLI), and Recognizing textual entailments (RTE) are textual entailment prediction tasks. Similar to Devlin et al. (2019); Houlsby et al. (2019), we omit results on Winograd Schema Challenge (WNLI) as LMs do not outperform random prediction baselines. All the tasks except STS-B are considered supervised classification tasks. Labels for STS-B are similarity scores from 1-5, and thus it is considered a regression task. We report accuracy on matched validation set for MNLI, Matthew’s correlation and Pearson correlation on CoLA and STS-B, F1-score for MRPC and QQP, and accuracy for the rest of the tasks on the development set. Model selection is also performed based on these metrics. [CLS] vs.[MASK] Representations. We consider two sentence-level representations for sequence classifi- cation tasks —[CLS] and[MASK] token representations. Masked language models (MLMs) such asBERT andRoBERTa are pretrained by attaching a[CLS] token to the beginning of the input text. The[CLS] token representation is trained with the next sentence prediction loss and thus touted as the sentence-level represen- tation. To this end, most previous works use[CLS] token representations. However, Hambardzumyan et al. (2021) suggested that[MASK] tokens representations, i.e., inserting the[MASK] token at the end of input for single-sentence or between the sentences for tasks involving sentence pairs, produces better results than using [CLS] token representation. We also find that the [MASK] representations are better than[CLS] representations generally and report results with[MASK] representations in the paper. We compare the two in Section E.1.3. Training. We useRoBERTa (Liu et al., 2019) as the pretrained model to compare with previous works. For SLaSh, we vary the number of parameters by varying the size of the z vector. The output activation and embedding dimensions are 1,024 inRoBERTa-large. So, we train with d = 1,024 and 2,048 to compare head- to-head with WARP. We report results with d = 5K and 10K forRoBERTa-base andRoBERTa-large, which improves the results further. To demonstrate the capabilities of tuning only output activation’s biases, we train with the maximum possible d, i.e., the total number of activations, 9,216 and 24,576 forRoBERTa-base andRoBERTa-large. We also train LoRA and Adapter with minimum parameter configurations (rank = 1 and m = 1) as the results reported in their papers use a larger number of parameters than those concerning this work. We demonstrate that parameter sharing can also improve WARP by introducing JR-WARP and training it with d = 5K and 10K for respectiveRoBERTa models. Results. Tables 5.2 and 5.3 summarize the results of finetuning with different methods using pretrained RoBERTa models. Parameter-efficient finetuning approaches aim to achieve performance at par with full finetuning while using fewer parameters. To this end, Figure 5.1 provides a visual summary of the parameter vs. performance trade-offs. SLaSh’s average performance is already within 4 points of full finetuning for both RoBERTa-base and -large models with d = 1,024. This gap is further reduced by increasing the dimension of the z vector. Even though the best models from our approach do not match the full-finetuning performance overall, for smaller datasets such as STS-B, MRPC, and RTE, SLaSh is competitive with full-finetuning. In the case of RoBERTa-large, we have 92.4 vs. 91.5 for STS-B, 90.9 vs. 91.3 for MRPC, and 86.6 vs. 84.1 for RTE with finetuning and SLaSh, respectively. 22 The parameter sharing reduces the per-task parameters considerably (4 20 WARP 0 feeds[MASK] representations to the classifier head, whereas the linear classifier uses [CLS]representations. 21 Since they report different metrics, we evaluated LoRA from the provided checkpoints on MNLI, STS-B, and QQP. 22 Note that we consider the average performance of SLaSh across different training runs, whereas, for baselines, performance from a single training run with fixed seed is reported. This can slightly exaggerate baseline numbers. 48 Method # Params MNLI (392,702) QQP (363,846) QNLI (104,743) SST-2 (67,349) CoLA (8,551) STS-B (5,749) MRPC (3,668) RTE (2,490) Avg. Finetuning 355M 90.2 92.2 94.7 96.4 68.0 92.4 90.9 86.6 88.9 Adapter 3M 90.4 88.5 94.7 96.3 67.4 92.5 92.9 83.4 88.3 Linear Classifier 3.1K 70.9 77.1 78.8 89.8 48.9 73.8 83.8 72.2 74.4 LoRA 800K 90.8 88.8 94.9 96.2 68.2 92.6 93.6 87.4 89.0 WARP 1 4.1K 83.9 81.6 87.6 93.8 46.1 80.4 84.7 72.6 78.8 WARP 8 11K 87.6 83.8 93.0 95.4 57.4 81.0 85.6 72.9 82.1 WARP 20 25K 88.2 84.5 93.5 96.0 60.6 88.6 90.8 75.8 84.8 WARP MNLI 25K - - - - - 91.0 91.2 86.3 86.4 LoRA [rank = 1] 101K 90.0 87.1 94.3 95.9 63.3 91.9 92.9 85.6 87.6 Adapter [m = 1] 150K 90.4 88.0 94.7 95.9 68.0 92.1 92.6 85.6 88.4 BitFit 276K 90.4 87.3 94.5 95.4 66.0 92.1 93.3 83.4 87.8 Ours [d = 1,024] 4.1K 85.8 ± 0.23 83.2 ± 0.15 92.2 ± 0.24 94.7 ± 0.57 59.6 ± 2.43 90.4 ± 0.41 91.1 ± 0.56 81.5 ± 2.18 84.8 Ours [d = 2,048] 5.1K 87.4 ± 0.08 84.1 ± 0.09 92.9 ± 0.28 94.9 ± 0.34 60.7 ± 2.11 90.7 ± 0.30 91.3 ± 0.84 83.5 ± 1.67 85.7 Ours [d = 10K] 13.1K 89.0 ± 0.14 85.5 ± 0.10 93.4 ± 0.19 95.2 ± 0.36 62.8 ± 1.43 91.5 ± 0.24 89.5 ± 4.17 84.1 ± 1.10 86.4 JR-WARP 1 [d = 10K] 13.1K 86.8 ± 1.26 84.2 ± 0.52 93.2 ± 0.20 95.3 ± 0.37 57.3 ± 2.61 89.1 ± 0.69 89.7 ± 1.41 79.6 ± 1.32 84.4 Ours [d = 24,576] (max) 27.7K 89.5 86.5 93.4 95.6 64.0 91.5 92.1 87.7 87.5 Table 5.2: Results of finetuning RoBERTa-large with different methods on GLUE Development set. The bracketed numbers in the heading are training set sizes. # Params are per-task trainable parameters. Rows with very few (< 10K) parameters are highlighted in gray to facilitate comparison. Finetuning results are from Liu et al. (2019), and Adapter (3M) and WARP results are from Hambardzumyan et al. (2021). Linear results are the best of linear classifier and WARP 0 performance from Hambardzumyan et al. (2021). 20 WARP MNLI used an additional intermediate step of supervised training on the MNLI dataset. LoRA (800K) results are adapted from Hu et al. (2022). 21 The standard deviations are computed over 5 training runs with different seeds. Due to computational limitations, we report error bars for our methods only. orders of magnitude less) and is faster and more efficient to train (Section 5.2.3). All these make our approach suitable for low-resource, low-data applications such as training on edge devices or learning personalized models. Most efficient tuning techniques tune a few hundred thousand parameters, except for WARP. It adds trainable parameters around input embeddings, which facilitates training with a few thousand parameters and is most comparable to our approach in per-task parameters. Our approach with d = 2,048 (i.e., 5.1K parameters) outperforms WARP with 25K parameters on all datasets with less than 10K training samples. Further, SLaSh outperforms the best results of WARP while using less than 60% of parameters (13K vs. 25K). These observations do not change even with WARP pretraining on the MNLI task to improve the performance on smaller datasets (WARP MNLI ). We do not require this supervised pretraining trick. These results validate the intuition that instead of introducing task parameters closer to the input layer as in WARP, it may be more effective to introduce the parameters throughout the layers as in SLaSh. Armed with this intuition, we improve WARP’s performance by introducing prompts in all transformer blocks derived from a single vector (JR-WARP). On average, it underperforms SLaSh, and the variance among different training runs is higher. Nevertheless, JR-WARP performs comparably to WARP 20 (84.4 vs. 84.8) while using fewer parameters (13K vs. 25K), suggesting that reusing parameters across layers improves parameter efficiency but does not deteriorate performance. Next, we compare with LoRA and Adapter, arguably the most prominent language transformer finetuning approaches. We note that the Adapter (rank = 1) has a slightly better average performance than LoRA (m = 1) (Tables 5.2 and 5.3). SLaSh performs comparably to these methods for smaller datasets, using 5x fewer 49 Method % params MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. Finetuning 100% 86.4 88.0 92.3 94.2 61.1 90.6 92.5 77.4 85.3 BitFit 0.09% 85.8 85.2 91.9 93.7 60.1 90.6 91.9 71.8 83.9 LoRA [rank = 1] 0.04% 86.3 85.6 92.7 94.3 60.1 90.1 91.3 76.2 84.6 Adapter [m = 1] 0.05% 86.7 86.1 92.0 94.3 61.4 91.0 92.3 78.3 85.3 Ours [d = 1,024] 0.003% 80.6 ± 0.26 80.9 ± 0.09 89.1 ± 0.53 92.6 ± 0.27 55.5 ± 1.99 89.4 ± 0.19 90.4 ± 0.76 76.9 ± 1.87 81.9 Ours [d = 5K] 0.007% 83.6 ± 0.16 83.2 ± 0.11 90.6 ± 0.21 93.1 ± 0.45 59.1 ± 1.74 89.9 ± 0.28 90.7 ± 0.88 76.7 ± 1.84 83.4 JR-WARP 1 [d = 5K] 0.007% 81.9 ± 0.78 81.6 ± 0.66 88.2 ± 1.24 92.5 ± 0.60 43.4 ± 9.12 86.3 ± 1.75 82.5 ± 3.45 69.5 ± 1.36 78.2 Ours [d = 9,216] (max) 0.011% 84.4 83.9 90.5 93.7 58.8 90.1 90.8 79.4 83.9 Table 5.3: Results of finetuning RoBERTa-base with different methods on GLUE Development set. Finetun- ing results are taken from Ben Zaken et al. (2022). RoBERTa-base has 108 million parameters. The standard deviations are computed over 5 training runs with different seeds. Due to computational limitations, we report error bars for our methods only. parameters and being roughly 2x faster to train for RoBERTa-base and 7x fewer parameters and roughly 1.25x faster to train forRoBERTa-large (Tables 5.2, 5.3 and 5.5). For example, in the case ofRoBERTa-base, we have 91.0 vs. 89.9 for STS-B, 92.3 vs. 90.7 for MRPC, and 78.3 vs. 76.7 for RTE with Adapter and SLaSh, respectively. Finally, SLaSh performs comparably to BitFit while tuning much fewer parameters. As with the other baselines, it is only for the larger datasets that BitFit considerably outperforms SLaSh. Further, we observe that tuning only output activation’s biases, which used fewer than 15% of BitFit’s parameters, performs comparably to BitFit on average (last row of Tables 5.2 and 5.3). Another interesting result is the performance of BitFit vs. Adapter and LoRA with a similar number of trainable parameters. We observe that Adapter and LoRA outperform BitFit on most tasks with fewer trainable parameters. For instance, BitFit outperforms LoRA on QNLI, CoLA, STS-B, MRPC withRoBERTa-large, and only STS-B and MRPC with RoBERTa-base. Adapter outperforms BitFit on all the tasks with both pretrained models except MRPC with RoBERTa-large. These results contradict Ben Zaken et al. (2022), suggesting that while tuning bias parameters may achieve close to finetuning, LoRA or Adapter may yield better performance with fewer parameters. 5.2.2 Token Classification Task Next, we evaluate our method on more complex token classification tasks such as NER. We consider the CoNLL-2003 (English) dataset. We useBERT-base-cased as the pretrained-LM and finetune it to predict the 9 entity classes. We use the validation set for model selection and report micro-F1 on the test and validation sets. Results. Table 5.4 reports the results of finetuning with BERT-base-cased for the NER task. We see similar trends in performance as the sequence classification task. However, owing to the complexity of the NER task, all the methods underperform full-finetuning significantly (91.35 F1 score). SLaSh with 8K parameters underperforms full-finetuning by more than 4 points (86.49). The performance is improved to 88.30 by increasing the number of trainable parameters. However, LoRA, Adapter, and BitFit outperform the best results from SLaSh by roughly 1.5 points but use more than 3.5x parameters compared to SLaSh. Among the parameter-efficient techniques, Adapter performed the best while using fewer parameters than BitFit. Similar to Section 5.2.1, SLaSh and JR-WARP outperform WARP. Hyperparameter tuning (e.g., increasing the sequence length) can improve JR-WARP results further. Overall, SLaSh is suitable for extremely low-parameter applications, even the token classification tasks, but it may degrade performance. 50 Method # params Test Validation Finetuning 108M 91.35 94.97 Linear Classifier 7K 82.02 85.94 LoRA [rank = 1] 44K 89.50 93.38 Adapter [m = 1] 63K 90.09 93.55 BitFit 109K 89.83 93.62 WARP 20 22.3K 86.03 89.89 Ours [d = 1,024] 8K 86.49 89.37 Ours [d = 5K] 12K 88.30 91.38 JR-WARP 1 [d = 5K] 12K 87.08 90.93 Table 5.4: Results of finetuning BERT-base-cased for NER task on CoNLL-2003 (English) dataset. 5.2.3 Time & Memory Requirements One of the goals of parameter-efficient tuning is to achieve as much utility as possible while being efficient with memory and computing. To this end, we report memory and time for training 1 epoch on the QNLI dataset in Table 5.5. Full finetuning requires longer execution time and more memory than any other approach, making a clear case for parameter-efficient approaches. SLaSh requires considerably less time and memory than LoRA and Adapter — 40% less time and 33% less memory for RoBERTa-base and 12% less time and 30% less memory for RoBERTa-large. The gains are less pronounced for large models than base because relatively more resources are utilized for transformer computations than tuning-specific computations. Compared to BitFit, SLaSh trains faster, but the memory requirements are similar due to SLaSh maintaining projection matrices during training. We maintained projection matrices in memory instead of generating them on the fly for our experiments, and Table 5.5 uses this implementation. However, matrices can be generated on the fly for both forward and backward passes from the state of the random number generator, leading to a further reduction in memory usage. With this improvement, the memory usage comes down to 8.3 GB and 3.1 GB for the large and base model without significantly impacting training time. Finally, WARP’s memory utilization is identical to SLaSh, but has slightly higher training time due to increased sequence length. SLaSh is much more resource-efficient during training than other methods without too much compromise on performance. Method Time (s) Memory (GB) Finetuning 3291 15.6 BitFit 2083 8.6 LoRA [rank = 1] 2019 13.0 Adapter [m = 1] 2289 13.1 WARP 20 1869 9.0 Ours [d = 10K] 1764 9.3 (a)RoBERTa-large Method Time (s) Memory (GB) Finetuning 1227 5.8 BitFit 819 3.3 LoRA [rank = 1] 1026 4.9 Adapter [m = 1] 1385 4.8 WARP 20 635 3.5 Ours [d = 5K] 558 3.3 (b)RoBERTa-base Table 5.5: Memory and execution time for training 1 epoch on QNLI dataset (104,743 samples) with batch size 8. We report the maximum memory allocated during the training on a Quadro RTX 8000 GPU. 51 MNLI QQP QNLI SST-2 Non-Private Training Finetuning 90.2 92.2 94.7 96.4 Ours [d = 10K] 89.1 89.1 93.5 95.9 JR-WARP 1 [d = 10K] 89.0 88.9 93.5 95.5 Private Training Ours [d = 10K] 88.0 86.9 91.2 94.5 JR-WARP 1 [d = 10K] 87.7 86.3 91.1 94.4 RGP 86.1 86.7 90.0 93.0 Adapter 87.7 86.3 90.7 93.9 Compacter 87.5 86.2 90.2 94.2 LoRA 87.8 87.4 90.8 95.3 (a) Finetuning withRoBERTa-large MNLI QQP QNLI SST-2 Non-Private Training Finetuning 87.6 91.9 92.8 94.8 Ours [d = 5K] 83.6 87.4 90.8 93.7 JR-WARP 1 [d = 5K] 83.4 87.2 90.7 93.3 Private Training Ours [d = 5K] 83.0 84.9 87.6 92.4 JR-WARP 1 [d = 5K] 81.3 84.7 87.9 92.0 RGP 80.1 85.5 87.2 91.6 Adapter 83.4 85.6 87.5 92.5 Compacter 82.6 84.7 85.1 92.3 LoRA 83.5 85.7 87.3 92.2 (b) Finetuning withRoBERTa-base Table 5.6: Results of differential private finetuning on GLUE Development set. Non-private finetuning and Private training results for RGP, Compacter, Adapter, and LoRA are from Yu et al. (2022). Private models were trained to achieveε = 6.7 for all datasets andδ = 10 − 6 for MNLI, QQP, and QNLI andδ = 10 − 5 for SST-2. For our method, privacy parameters areε = 6.7 andδ = 10 − 6 for all datasets (i.e., identical or stricter than the baselines). Inference times for all the methods were similar. The time to perform inference over the QNLI validation set (5,463 examples) varied between 13.9-14.5 seconds for RoBERTa-base and 39.7-40.8 seconds for RoBERTa-large. 5.2.4 Differential Private Finetuning As machine learning is beginning to be applied in commercial settings and on user data, ensuring the privacy of training data is becoming crucial. Neural networks trained without safeguards can easily leak information about their private training data (Carlini et al., 2021; Carlini et al., 2022). To mitigate these issues, neural networks can be trained with a strong notion of privacy, Differential Privacy (DP), which limits the influence of a single training example on the result (Dwork et al., 2014). Differential privacy is formally characterized by ε and δ and denoted as(ε,δ)− DP. Lower ε and δ imply more privacy. The standard procedure to train neural networks with DP is Differential Private SGD (DPSGD, Abadi et al. (2016)). DPSGD is a private variant of SGD in which per-sample parameter gradients are clipped, and Gaussian noise is added before the update step. The noise magnitude depends onε,δ, and model size and drastically impacts utility (Tramer et al., 2021). Recently, Yu et al. (2022); Li et al. (2022) demonstrated that the utility of differential private finetuning is at par with non-private training. One of the key insights is that the parameter-efficient methods are better private learners than full finetuning. Intuitively, the amount of noise scales with parameters and fewer parameters implies less noise is added during training. Naturally, this encouraged us to evaluate SLaSh and JR- WARP for private learning. To this end, we use the same setup as Yu et al. (2022). In particular, we consider the tasks with more than 10K samples in the GLUE benchmark and train to achieve(ε= 6.7,δ = 10 − 6 )− DP. Different from Section 5.2.1, we report accuracy for all the tasks here. We compare against the methods reported by Yu et al. (2022), which include LoRA, Adapter, and Compacter (Karimi Mahabadi et al., 2021). Compacter is an improved and efficient version of the Adapter. RGP updates all the parameters, i.e., it is similar to full-finetuning but uses a different parametrization. 52 Results. Table 5.6 reports the results of private finetuning RoBERTa under a fixed privacy budget ( ε = 6.7,δ = 10 − 6 ). Due to using only a tiny number of parameters, the gap in the non-private and private utility of SLaSh and JR-WARP is small. Further, SLaSh outperforms all the other methods on MNLI and QNLI tasks and is only second to the best (LoRA) on QQP and SST-2 withRoBERTa-large. Similarly, JR-WARP and SLaSh outperform all the other methods on the QNLI task withRoBERTa-base; however, JR-WARP’s utility is lower on MNLI. SLaSh’s utility is generally comparable to other methods for all the tasks. Our approaches (SLaSh and JR-WARP) may be more effective for larger models as those are easier to tune with fewer parameters (Lester et al., 2021). 5.3 Related Work Prompt tuning and task-specific finetuning are standard ways to prime LMs for downstream tasks (Liu et al., 2022a; Treviso et al., 2022). Prompt tuning inserts task-specific information or parameters around the input. Various versions exist, such as manual prompt-tuning, discrete prompt search (Shin et al., 2020), and continuous search (Hambardzumyan et al., 2021). Prompt tuning is highly parameter efficient but is generally only effective for larger LMs (Lester et al., 2021; Yang et al., 2022). Due to joint reparametrization, our method uses a similar number of parameters as prompt-tuning methods but outperforms them. Several parameter-efficient LM finetuning methods have been proposed, such as Adapter (Houlsby et al., 2019), LoRA (Hu et al., 2022), Prefix-Tuning (Li et al., 2021a), and Parallel Adapters (He et al., 2022). Further improvements try to maintain the utility while reducing the parameters such as Compacter (Karimi Mahabadi et al., 2021) that parameterizes weight matrices via the sum of Kronecker products, pruning adapter layers (Rücklé et al., 2021; Pfeiffer et al., 2021) and gating mechanisms to choose the best modules (Mao et al., 2022). These methods outperform prompt tuning but use more parameters. In contrast, we outperform prompt tuning while using similar number of parameters and are competitive with other finetuning approaches. Our approach could be of independent interest for understanding intriguing properties of pretrained language models, the role of different parameters, and sharing parameters across layers. Ben Zaken et al. (2022); Cai et al. (2020) have shown that pretrained models can be finetuned by only updating the bias parameters, but unlike us, they do not share parameters. Gheini et al. (2021) finetune only cross attention layers for machine translation. Zhou et al. (2022c) share only output layers across tasks, but parameters across different layers are not shared. Zhou et al. (2022b) have shown that task embeddings can be derived from task-specific finetuned parameters. The z in our approach can also be helpful as a task-embedding. Parameters derived by fixed random transformations a few parameters have previously been used to study the task’s intrinsic dimensionality (Li et al., 2018; Aghajanyan et al., 2021). Those works focus on weight matrices. While insightful, these are cumbersome to train for real-world deployment. Instead, we focus on bias or embeddings, providing a tractable operationalization for regular training and finetuning while using similar order of parameter count. For example, Aghajanyan et al. (2021) show that the intrinsic dimension of the QQP dataset withRoBERTa-large is 774, i.e., at least 774 parameters are required to achieve within 90% of full finetuning performance. SLaSh achieves an F1-score of 83.2, more than 90% of full finetuning performance on QQP with 4.1K parameters (92.2× 0.9= 83.0). 5.4 Limitations Experiments. In this work, we propose new methods for finetuning language models. We acknowledge that similar to previous approaches, our experiments are limited to English datasets and specific supervised tasks. However, our method does not use language- or task-specific tricks and should apply to other languages and tasks. 53 Method. As demonstrated in Section 5.2, SLaSh is computationally efficient and performs comparably to the full finetuning for small datasets. Moreover, its parameter and memory efficiency makes it an excellent private learner. However, it may underperform by a few points compared to full-finetuning larger datasets with higher intrinsic dimensionality due to using very few parameters. For example, SLaSh struggles with generative tasks such as text summarization, as generative tasks are more complex and involve making predictions over the whole vocabulary. In contrast, classification tasks have relatively fewer output labels. In our initial experiments, SLaSh reached a ROUGE-2 score of 12.93 on the XSum summarization task (Narayan et al., 2018) with pretrainedBART, whereas full finetuning achieves a score of 21.94 (He et al., 2022). The limitations of SLaSh are due to the small number of parameters it updates. Since shift is applied to only certain biases, the number of parameters can not be increased beyond a limit. However, we show that SLaSh is a more efficient and performant alternative to the methods that use a similar number of per- task parameters. Moreover, we showed that joint reparametrization improves parameter efficiency of other methods. As such, this principle can be extended to methods that are not restricted by a maximum limit on the number of parameters. For example, JR-WARP’s parameters can be naturally increased by increasing the prompt length, which should improve the results further. 5.5 Conclusion We introduce a multilayer LM finetuning technique where task-specific parameters are derived from a single vector. We show two instantiations of this technique — SLaSh and JR-WARP. SLaSh introduced shifts in the output activation of each transformer block, whereas JR-WARP inserted prompts in each transformer block. These methods require only a tiny fraction of the original language model parameters (similar to prompt-tuning) and outperform previous methods that use a similar number of per-task parameters. Despite the drastic reduction in the number of parameters, we demonstrate that these perform just as well as full finetuning for sentence and token classification tasks (only at max a 5% difference in performance). The high parameter efficiency leads to better training speed and resource utilization and improves private training. 54 Chapter 6 Bootstrapping Ambient Space for Private Training Ensuring privacy is becoming a critical requirement for many machine-learning applications. Differential privacy, or DP, is a mathematically rigorous framework to reason about privacy, which ensures any single sample does not influence the outcome much. Thus, assuring an individual’s information is not leaked. However, differential private training of neural networks (via DP-SGD (Abadi et al., 2016)) for any reasonable guarantees degrades utility much compared to non-private training. One reason for this is it is difficult to bound the influence of a single sample on the outcome of training, i.e., final neural network parameters. As such, DP-SGD work by privatizing each update step of SGD and adds noise to the gradients in each update step to ensure privacy. The noise added in each update accumulates over time and degrades the utility. In this work, we explore if we can use previous gradients (which are private) to reduce the noise in the current update step. Various approaches have been proposed to improve the utility of DP-SGD, for example, finetuning a model pretrained on a public dataset (Yu et al., 2022; Li et al., 2022), knowledge distillation (Papernot et al., 2018), using low-rank training (Yu et al., 2021b), etc. Among these, an interesting approach has been proposed by Yu et al. (2021a); Zhou et al. (2021b), which is based on the observation that the ambient space of gradient of a neural network is low dimensional compared to the parameter space (Gur-Ari et al., 2019). They use a public dataset to compute the ambient space, project gradients on this ambient space, and privatize the coefficient of the projections. Since the ambient space is public, we can obtain gradients in the original space by using the private coefficients and ambient space without incurring additional privacy costs. The ambient space dimension is an order of magnitudes lower than the parameter space, reducing the noise added to the gradients. Equivalently, it can be seen as adding correlated noise. However, it requires a public dataset to find the ambient space, which is not feasible in all cases. To this end, we explore if we can exploit private gradients from previous iterations to learn the ambient space. In particular, we maintain an online low-rank approximation of the matrix containing all the past gradients with the frequent directions algorithm (Ghashami et al., 2016). This can be used to compute the eigenvectors or the low-rank subspace on which all the past gradients lie. We show that the subspaces obtained this way are a good approximation of the ambient space and capture most of the magnitude of the current gradient. Moreover, we train the models with gradients in only ambient directions to quantify the degradation in performance due to dimensionality reduction. These models can perform close using the full gradients. We then evaluate if this procedure can replace the public ambient space for private training. Unfortunately, using private gradients from previous iterations to obtain the ambient space does not learn a good model. We hypothesize that even though, due to the addition of Gaussian noise, the private gradients do not lie in low-rank subspace, we can see the low-rank approximation as a denoising operation (Shabalin et al., 2013). We investigate it further by looking at computation of low-rank approximation as a denoising operation and find that magnitude of the noise in the private gradient may be too high and leaves very little information about the correct subspace. 55 Algorithm 2 Differential Private Training of Neural Network with Bootstrapped Ambient Directions. C 1 and C 2 are maximum magnitude of projection and gradient vector respectively. σ 1 andσ 2 are noise scales for the same. η is the learning rate. d is the number of parameter, k is the dimension of the subspace. Input: Batch Data(X,Y), Parameters(θ t ), and Frequent Directions Estimator (FD). Output: Updated Parameters(θ t+1 ) ▷ Computer Per Sample Gradients ∆=∇L(X,Y ;θ t ) ▷ Project Per Sample Gradients to k-dim space Coeff = FD.project(∆) ▷ Compute Private gradients Private-Coeff = Clip(Coeff, C 1 ) + C 1 σ 1 numpy.random.randn(B, k) Private-∆ = Private-Coeff× FD.get_basis() ▷ Update Steps θ t+1 =θ t − η∑ B i=1 ∆ i FD.update(∆+C 2 σ 2 numpy.random.randn(B,d)) ▷ Privatize gradients and update basis space 6.1 Method Algorithm 2 describes our approach. We use gradients released in each interaction to update the frequent directions buffer, which is used to compute the ambient space or a low-rank space that explains most of the variance of past gradients. Suppose we are receiving some vectors in a streaming fashion. Frequent Direction (Ghashami et al., 2016) is an online algorithm that maintains a low-rank matrix whose outer product approximates the matrix obtained by stacking these vectors. The singular vectors of this approximation would be the eigenvector of the correlation matrix. Since gradient descent has a low-rank bias (Gur-Ari et al., 2019), we expect that this space can explain most of the current gradient vector’s magnitude. Thus, we project the current gradient vector on this space and privatize the projection coefficients. We then use the private coefficients and the ambient space to compute the private gradient vector for the SGD update. Finally, we also update the ambient space using the private version of the original gradient vector for the next. Similar to Yu et al. (2021a), we can also use the residual gradients. However, we avoid that here to keep the algorithm simple. Next, we demonstrate that Algorithm 2 is differentially private. 6.1.1 Privacy Analysis of Algorithm 2 First, we introduce the idea of(ε,δ)-Differential Privacy and Renyi Differential Privacy and the relation between these two. This will be useful for our analysis. Definition 6.1. ((ε,δ)-DP) (Dwork et al., 2014) Recall, a mechanismM :D→ O is said to(ε,δ)-DP if for any two datasetD andD ′ , that differ by one example and any outcome O∈O, following is satisfied: Pr[M(D)∈O]≤ e ε Pr[M(D ′ )∈ O]+δ Definition 6.2. ((α,ε)-RDP) (Mironov, 2017) A mechanismM :D→ O is said to(α,ε)-RDP if for any two datasetD andD ′ , that differ by one example and any outcome O∈O, following is satisfied: D α M(D)∥M(D ′ ) ≤ ε where D α is Renyi Divergence of orderα, i.e., D α (P∥Q)= 1 α− 1 logE Q P Q α . 56 Next, we state some useful results from Mironov (2017). Lemma 6.3. (DP to RDP) IfM is a(α,ε)-RDP, then it is(ε+ log1/δ α− 1 ,δ)-DP for any 0<δ < 1. Lemma 6.4. (Composition of RDP mechanisms) IfM 1 is a(α,ε 1 )-RDP andM 2 is a(α,ε 2 )-RDP, then (M 1 ,M 2 ) is(α,ε 1 +ε 2 )-RDP. Lemma 6.5. (Gaussian mechanism) The Gaussian mechanism with sensitivity 1 and noise parametersσ satisfies (α, α 2σ 2 )-RDP. With the above results, it is easy to see the DP guarantees of Algorithm 2. We can state the following theorem. Theorem 6.6. Algorithm 2 is α, α 2 1 σ 2 1 + 1 σ 2 2 -RDP. Proof. We produce two outputs in Algorithm 2. First, we compute the projection coefficients with σ 1 noise. Hence this will be(α, α 2σ 2 1 )− RDP due to Lemma 6.5. Second, we use private gradients to compute ambient space. That step is(α, α 2σ 2 2 )− RDP.Due to this, ambient space is always private as it uses only private gradients. The computation of the private gradient vector for SGD update includes multiplying the basis space with projection coefficients. This step only uses private outputs from previous steps. Hence, it does not incur additional privacy loss. Remarks. Unlike Algorithm 2, DP-SGD (Abadi et al., 2016) only adds noise to the gradient vector. Moreover, one step of DP-SGD will be(α, α 2σ 2 ) if the noise variance isσ 2 . Therefore, we can see that one step of Algorithm 2 is equivalent to adding Gaussian noise with variance σ 2 1 σ 2 2 σ 2 1 +σ 2 2 in DP-SGD. So for the same privacy cost, DP-SGD would add less magnitude noise. However, Algorithm 2 has the advantage of only adding noise to the projection coefficients whose dimension is much less than the number of parameters. Hence, the overall noise magnitude would be lower. We also use the above equivalence between Algorithm 2 and DP-SGD to compute the overall privacy loss of Algorithm 2 with the privacy accountant. We use privacy accountant from Gopi et al. (2021). 6.2 Experiments Setup. We consider the MNIST dataset and the usual digit identification task with 10 classes. We use a one-hidden layer feedforward network with 100 hidden units. It has 79510 (= d) parameters in total. First, we evaluate and validate the low-rank subspace bias of gradient descent and the efficacy of our procedure to construct subspace with frequent directions algorithm by varying the dimension of the subspace (k) and training without noise in Section 6.2.1. Then, we evaluate if the proposed algorithm is effective for differential private training. We compare it with the basic DP-SGD algorithm. We use Adam optimizer for training all models. 6.2.1 Low-rank Training Figure 6.1 shows loss and accuracy progression under different settings when training without noise. We show results with two different values of k and compare our approach with the usual training, which uses the full gradient vector. We update the models with gradients projected on the subspace only. We can see that with only 500 subspaces, the performance is within 4 points of full training (see Figure 6.1b). Previous works 57 0 5000 10000 15000 20000 25000 steps 0.0 0.5 1.0 1.5 2.0 loss Full SGD k=100 k=500 k=500 (recency bias) (a) Training Loss vs. Steps. 0 5000 10000 15000 20000 25000 steps 0.86 0.88 0.90 0.92 0.94 0.96 0.98 validation accuracy Full SGD k=100 k=500 k=500 (recency bias) (b) Validation Accuracy vs. Steps. Figure 6.1: Non-private training without noise. 0 5000 10000 15000 20000 25000 steps 0.0 0.2 0.4 0.6 0.8 1.0 gradient norm k=100 (Projection) k=500 (Projection) k=100 (Residual) k=500 (Residual) Figure 6.2: Gradient norms when training without noise. All the norms are scaled with the norm of full gradient at that step. 58 0 5000 10000 15000 20000 25000 steps 0.0 0.5 1.0 1.5 2.0 2.5 loss DP-SGD k=500 (a) Training Loss vs. Steps. 0 5000 10000 15000 20000 25000 steps 0.0 0.2 0.4 0.6 0.8 1.0 gradient norm k=500 (Projection) k=500 (Residual) (b) Gradient norms when training differential privacy. All the norms are scaled with the norm of full gradient at that step. Figure 6.3: Training with noise to achieve differential privacy. have studied that gradient descent happens on a small subspace of the parameter space. However, they don’t quantify the effect of using only the small subspace on training. With this procedure we can quantify the effect by assessing the drop in performance. Figure 6.2 shows how much gradients belong on the subspace (projection) and how much is outside (residual). The values are normalized with the magnitude of the gradient vector in that iteration so that the maximum value is 1. The subspace created using the frequent directions algorithm captures most of the gradient vector. The norm of the residual stays small throughout the training. It is intuitive to expect the residual norm to be smaller for larger k. However, this is not observed throughout the training and only holds for earlier steps. This may be because once the models (with k=500) are close to minima in the subspace, it may have a higher gradient magnitude in the residual space. Ideally, the subspace should start capturing the residual direction, but it can lag as earlier gradient magnitudes may be higher and dominate the subspace. To counteract this, we introduce a recency bias by discounting the gradients from earlier steps (see Figure 6.1 recency bias labels). This improves performance to some extent, and the validation error is within two points of using the full gradient vectors. 6.2.2 Differential Private Training with Algorithm 2 Figure 6.3 shows the training loss and gradient norms when training with noise to achieve differential privacy. Unlike non-private training, the model could not train in this case. We also see that, unlike the non-noisy case, the ambient subspace we obtained could not capture the gradient. The projected gradient’s norm is minimal compared to the residual, whose magnitude is almost 1. One reason could be that due to the Gaussian noise, gradients no longer belong on the low-rank subspace. We will elaborate more on this next. 59 10 0 10 1 10 2 10 3 0.0 0.5 1.0 1.5 2.0 unnoised noised 1+ √ c 4 √ c Figure 6.4: Singular values of gradient matrix. 6.2.3 Why did Differential Private Training Fail? Recall that we add Gaussian noise to the gradients before updating the Frequent Directions buffer. Adding Gaussian noise would make the matrix full rank; hence, low-rank approximations would be poor. However, since the actual signal is low-rank, we could consider this as a denoising operation. To this end, we refer to literature on recovering the low-rank matrix under Gaussian noise. Shabalin et al. (2013) show that we can recover the low-rank matrix by performing SVD and manipulating the resulting singular values to remove the effect of noise. However, they also demonstrate that singular vectors with singular values less than 4 p m/n= 4 √ c in the original matrix or 1+ √ c are indistinguishable from noise. In our case, m= k and n= d. Figure 6.4 shows the singular value of the gradient matrix, and we can see that only very few singular values can be recovered correctly. We see that it is indeed the problem is that the magnitude of the noise is too high to recover the space reliably. Further, in this work we are not concerned with the accurate matrix recovery but with the task of obtaining the subspace faithfully. We find the singular vector recovery is even poor. Empirically the singular vectors are not recovered correctly and the dot product between singular vectors obtained from non-noisy matrix and noisy matrix is close to orthogonal. Overall, while the idea of bootstrapping the ambient subspace of gradients is promising, as demonstrated by non-noisy training, a more refined approach is needed to obtain a reliable subspace in the presence of noise. 6.3 Related Work Since the seminal work Abadi et al. (2016), DP-SGD has become standard for training deep learning models with differential privacy. To improve the utility of DP-SGD, various techniques have been proposed, such as using specific architectures (Papernot et al., 2021; Tramer et al., 2021), using pretrained models (Yu et al., 60 2022; Li et al., 2022), reducing dimensionality using public data (Zhou et al., 2021b; Yu et al., 2021a), or larger bigger batch sizes and learning rates (Ghazi et al., 2022; De et al., 2022) to reduce the number of iterations and hence privacy costs. Our work is most related to Yu et al. (2021a); Zhou et al. (2021b), which use public data to compute the ambient space of gradients and use it to train the model. Our work differs from them in that we use private gradients from previous iterations to compute the ambient space. This eliminates the need for public data. However, we find that the ambient space obtained this way is not reliable due to the high noise magnitude. 6.4 Discussion We aimed to improve differential private training of neural networks by bootstrapping the ambient space of gradients and eliminating public data requirements. To this end, we proposed to use the frequent directions algorithm to obtain the ambient space of gradients by using past gradients. We demonstrated that training only on ambient subspace recovers most of the performance compared to full gradients when training without privacy constraints. This quantifies how much information we lose by not using the residual space. However, using the same procedure did not work as the higher noise magnitude ruins the low-rank structure, and ambient space is not captured correctly. We make a connection to low-rank matrix denoising literature to explain this behavior. Future work will explore better ways to create the subspace in the presence of noise or use better denoising algorithms. 61 References Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang. “Deep Learning with Differential Privacy”. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016, pp. 308–318 (cit. on pp. 3, 40, 43, 52, 55, 57, 60, 114). Abubakar Abid, Maheen Farooqi, James Zou. “Persistent anti-muslim bias in large language models”. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021, pp. 298–306 (cit. on p. 2). Armen Aghajanyan, Sonal Gupta, Luke Zettlemoyer. “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 7319–7328. DOI: 10.18653/v1/2021.acl-long.568. URL: https://aclanthology.org/2021.acl-long.568 (cit. on p. 53). Alexander A Alemi, Ian Fischer, Joshua V Dillon, Kevin Murphy. “Deep Variational Information Bottleneck”. International Conference on Learning Representations. 2017 (cit. on p. 12). Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané. “Concrete problems in AI safety”. arXiv preprint arXiv:1606.06565 (2016) (cit. on p. 2). Solon Barocas, Andrew D Selbst. “Big data’s disparate impact”. California law review (2016), pp. 671–732 (cit. on p. 2). Vishnu M Bashyam, Guray Erus, Jimit Doshi, Mohamad Habes, Ilya M Nasrallah, Monica Truelove-Hill, Dhivya Srinivasan, Liz Mamourian, Raymond Pomponio, Yong Fan, et al. “MRI signatures of brain age and disease over the lifespan based on a deep brain network and 14 468 individuals worldwide”. Brain 143.7 (2020), pp. 2312–2324 (cit. on p. 35). Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, Devon Hjelm. “Mutual Information Neural Estimation”. Proceedings of the 35th International Conference on Machine Learning. 2018, pp. 531–540 (cit. on p. 22). Paolo Bellavista, Luca Foschini, Alessio Mora. “Decentralised Learning in Federated Deployment Envi- ronments: A System-Level Survey”. ACM Computing Surveys (CSUR) 54.1 (2021), pp. 1–38 (cit. on p. 35). 62 Elad Ben Zaken, Yoav Goldberg, Shauli Ravfogel. “BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models”. Proceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computa- tional Linguistics, 2022, pp. 1–9. DOI: 10.18653/v1/2022.acl-short.1. URL: https://aclanthology.org/ 2022.acl-short.1 (cit. on pp. 44, 47, 50, 53). Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21. Virtual Event, Canada: Association for Computing Machinery, 2021, 610–623. DOI: 10.1145/3442188.3445922. URL: https://doi.org/10.1145/ 3442188.3445922 (cit. on pp. 24, 25). Hugo Berg, Siobhan Hall, Yash Bhalgat, Hannah Kirk, Aleksandar Shtedritski, Max Bain. “A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning”. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online only: Association for Computational Linguistics, 2022, pp. 806–822. URL: https://aclanthology.org/ 2022.aacl-main.61 (cit. on p. 2). Sarah Bird, Miro Dudík, Richard Edgar, Brandon Horn, Roman Lutz, Vanessa Milan, Mehrnoosh Sameki, Hanna Wallach, Kathleen Walker. Fairlearn: A toolkit for assessing and improving fairness in AI. Tech. rep. MSR-TR-2020-32. Microsoft, 2020. URL: https://www.microsoft.com/en-us/research/publication/ fairlearn-a-toolkit-for-assessing-and-improving-fairness-in-ai/ (cit. on p. 9). Amanda Bischoff-Grethe, I Burak Ozyurt, Evelina Busa, Brian T Quinn, Christine Fennema-Notestine, Camellia P Clark, Shaunna Morris, Mark W Bondi, Terry L Jernigan, Anders M Dale, et al. “A technique for the deidentification of structural brain MR images”. Human brain mapping 28.9 (2007), pp. 892–903 (cit. on p. 34). Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, Hanna Wallach. “Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 1004–1015. DOI: 10.18653/v1/2021.acl-long.81. URL: https://aclanthology.org/ 2021.acl-long.81 (cit. on p. 2). Tolga Bolukbasi, Kai-Wei Chang, James Y . Zou, Venkatesh Saligrama, Adam Tauman Kalai. “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings”. Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain. Ed. by Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, Roman Garnett. 2016, pp. 4349–4357. URL: https://proceedings.neurips.cc/paper/ 2016/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html (cit. on p. 29). Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. “On the Opportunities and Risks of Foundation Models”. ArXiv preprint abs/2108.07258 (2021). URL: https://arxiv.org/abs/2108.07258 (cit. on pp. 2, 24). 63 Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Koneˇ cný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, Jason Roselander. “Towards Federated Learning at Scale: System Design”. Proceedings of Machine Learning and Systems. Ed. by A. Talwalkar, V . Smith, M. Zaharia. V ol. 1. 2019, pp. 374–388 (cit. on p. 35). Rob Brekelmans, Sicong Huang, Marzyeh Ghassemi, Greg Ver Steeg, Roger Baker Grosse, Alireza Makhzani. “Improving Mutual Information Estimation with Annealed and Energy-Based Bounds”. International Conference on Learning Representations. 2021 (cit. on p. 9). Rob Brekelmans, Daniel Moyer, Aram Galstyan, Greg Ver Steeg. “Exact Rate-Distortion in Autoencoders via Echo Noise”. Advances in Neural Information Processing Systems. V ol. 32. 2019, pp. 3889–3900 (cit. on p. 12). Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. “Language Models are Few-Shot Learners”. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Ed. by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, Hsuan-Tien Lin. 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html (cit. on p. 25). Ellyn R Butler, Andrew Chen, Rabie Ramadan, Trang T Le, Kosha Ruparel, Tyler M Moore, Theodore D Satterthwaite, Fengqing Zhang, Haochang Shou, Ruben C Gur, et al. “Statistical Pitfalls in Brain Age Analyses”. bioRxiv (2020) (cit. on p. 35). Han Cai, Chuang Gan, Ligeng Zhu, Song Han. “TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning”. Advances in Neural Information Processing Systems. V ol. 33. 2020. URL: https: //proceedings.neurips.cc/paper/2020/hash/81f7acabd411274fcf65ce2070ed568a-Abstract.html (cit. on p. 53). Aylin Caliskan, Joanna J. Bryson, Arvind Narayanan. “Semantics derived automatically from language corpora contain human-like biases”. Science 356.6334 (2017), pp. 183–186. DOI: 10.1126/science. aal4230. URL: https://www.science.org/doi/abs/10.1126/science.aal4230 (cit. on pp. 25, 102, 105). Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang. “Quantifying memorization across neural language models”. arXiv preprint arXiv:2202.07646 (2022). URL: https://arxiv.org/abs/2202.07646 (cit. on p. 52). Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, Colin Raffel. “Extracting Training Data from Large Language Models”. USENIX Security Symposium. 2021, pp. 2633–2650. URL: https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting (cit. on pp. 1, 2, 52). 64 Centers for Medicare & Medicaid Services. The Health Insurance Portability and Accountability Act of 1996 (HIPAA). Online at http://www.cms.hhs.gov/hipaa/. 1996 (cit. on p. 34). Kamalika Chaudhuri, Claire Monteleoni, Anand D Sarwate. “Differentially private empirical risk minimiza- tion.” Journal of Machine Learning Research 12.3 (2011) (cit. on p. 3). Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. “A Simple Framework for Contrastive Learning of Visual Representations”. arXiv preprint arXiv:2002.05709 (2020) (cit. on p. 22). Zitao Chen, Karthik Pattabiraman. “Overconfidence is a Dangerous Thing: Mitigating Membership Inference Attacks by Enforcing Less Confident Prediction”. arXiv preprint arXiv:2307.01610 (2023) (cit. on p. 3). Christopher A Choquette Choo, Florian Tramer, Nicholas Carlini, Nicolas Papernot. “Label-Only Membership Inference Attacks”. arXiv preprint arXiv:2007.14321 (2020) (cit. on p. 36). Ching-Yao Chuang, Varun Jampani, Yuanzhen Li, Antonio Torralba, Stefanie Jegelka. “Debiasing Vision- Language Models via Biased Prompts”. arXiv e-prints (2023), arXiv–2302 (cit. on p. 2). Moustapha Cisse, Sanmi Koyejo. “Fairness and representation learning”. NeurIPS Invited Talk 2019 (2019) (cit. on pp. 8, 22). James H Cole, Rudra PK Poudel, Dimosthenis Tsagkrasoulis, Matthan WA Caan, Claire Steves, Tim D Spector, Giovanni Montana. “Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker”. NeuroImage 163 (2017), pp. 115–124 (cit. on p. 107). Joshua Comenetz. “Frequently occurring surnames in the 2010 census”. United States Census Bureau (2016). URL: https://www.census.gov/topics/population/genealogy/data/2010_surnames.html (cit. on p. 102). David W Craig, Robert M Goor, Zhenyuan Wang, Justin Paschall, Jim Ostell, Michael Feolo, Stephen T Sherry, Teri A Manolio. “Assessing and managing risk when sharing aggregate genetic variant data”. Nature Reviews Genetics 12.10 (2011), pp. 730–736 (cit. on p. 34). Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu. “Plug and Play Language Models: A Simple Approach to Controlled Text Generation”. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=H1edEyBKDS (cit. on p. 25). Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, Borja Balle. “Unlocking high-accuracy dif- ferentially private image classification through scale”. arXiv preprint arXiv:2204.13650 (2022) (cit. on p. 61). Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Choulde- chova, Sahin Geyik, Krishnaram Kenthapadi, Adam Tauman Kalai. “Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting”. Proceedings of the Conference on Fairness, Accountability, and Transparency. FAT* ’19. Atlanta, GA, USA: Association for Computing Machinery, 2019, 120–128. DOI: 10.1145/3287560.3287572. URL: https://doi.org/10.1145/3287560.3287572 (cit. on pp. 25, 32, 105). 65 Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, Kai-Wei Chang. “Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies”. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 1968–1994. DOI: 10.18653/v1/2021.emnlp-main.150. URL: https://aclanthology.org/2021.emnlp-main.150 (cit. on p. 26). Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186. DOI: 10.18653/v1/N19-1423. URL: https://aclanthology.org/N19-1423 (cit. on pp. 43, 48). Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, Rahul Gupta. “BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation”. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21. Virtual Event, Canada: Association for Computing Machinery, 2021, 862–872. DOI: 10.1145/3442188. 3445924. URL: https://doi.org/10.1145/3442188.3445924 (cit. on pp. 2, 4, 24, 25, 29, 33). Emily Dinan, Angela Fan, Adina Williams, Jack Urbanek, Douwe Kiela, Jason Weston. “Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 8173–8188. DOI: 10.18653/v1/2020.emnlp-main.656. URL: https://aclanthology.org/2020. emnlp-main.656 (cit. on p. 25). Dheeru Dua, Casey Graff. UCI Machine Learning Repository. 2017. URL: http://archive.ics.uci.edu/ml (cit. on p. 15). Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, Kush R Varshney. “An Information- Theoretic Perspective on the Relationship Between Fairness and Accuracy”. arXiv preprint arXiv:1910.07870 (2019) (cit. on pp. 8, 22). Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, Richard Zemel. “Fairness through Awareness”. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. ITCS ’12. 2012 (cit. on pp. 2, 9, 21). Cynthia Dwork, Aaron Roth. “The Algorithmic Foundations of Differential Privacy”. Foundations and Trends ® in Theoretical Computer Science 9.3–4 (2014), pp. 211–407. DOI: 10.1561/0400000042 (cit. on pp. 3, 40, 52, 56, 114). Harrison Edwards, Amos Storkey. “Censoring Representations with an Adversary”. International Conference on Learning Representations. 2016 (cit. on pp. 8, 10, 14, 17, 21, 22). Xinyang Feng, Zachary C. Lipton, Jie Yang, Scott A. Small, Frank A. Provenzano. “Estimating brain age based on a uniform healthy population with deep learning and structural magnetic resonance imaging”. Neurobiology of Aging 91 (2020), pp. 15–25 (cit. on p. 35). 66 Anjalie Field, Su Lin Blodgett, Zeerak Waseem, Yulia Tsvetkov. “A Survey of Race, Racism, and Anti-Racism in NLP”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 1905–1925. DOI: 10.18653/v1/2021.acl-long.149. URL: https://aclanthology.org/2021.acl-long.149 (cit. on p. 2). Katja Franke, Christian Gaser. “Ten Years of BrainAGE as a Neuroimaging Biomarker of Brain Aging: What Insights Have We Gained?” Frontiers in Neurology 10 (2019), p. 789 (cit. on p. 35). Yansong Gao, Pratik Chaudhari. “A Free-Energy Principle for Representation Learning”. arXiv preprint arXiv:2002.12406 (2020) (cit. on p. 18). Sahil Garg, Umang Gupta, Yu Chen, Syamantak Datta Gupta, Yeshaya Adler, Anderson Schneider, Yuriy Nevmyvaka. “Estimating transfer entropy under long ranged dependencies”. Uncertainty in Artificial Intelligence. 2022, pp. 685–695 (cit. on p. 6). Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith. “RealToxicityPrompts: Eval- uating Neural Toxic Degeneration in Language Models”. Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 3356–3369. DOI: 10.18653/v1/2020.findings-emnlp.301. URL: https://aclanthology.org/2020.findings-emnlp.301 (cit. on pp. 2, 25). Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, Felix A Wichmann. “Shortcut learning in deep neural networks”. Nature Machine Intelligence 2.11 (2020), pp. 665–673 (cit. on p. 1). Mina Ghashami, Edo Liberty, Jeff M Phillips, David P Woodruff. “Frequent directions: Simple and deter- ministic matrix sketching”. SIAM Journal on Computing 45.5 (2016), pp. 1762–1792 (cit. on pp. 5, 55, 56). Badih Ghazi, Pasin Manurangsi, Ravi Kumar, Rohan Anil, Vineet Gupta. “Large-Scale Differentially Private BERT”. Findings of EMNLP 2022. 2022 (cit. on p. 61). Mozhdeh Gheini, Xiang Ren, Jonathan May. “Cross-Attention is All You Need: Adapting Pretrained Trans- formers for Machine Translation”. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 1754–1765. DOI: 10.18653/v1/2021.emnlp-main.132. URL: https://aclanthology. org/2021.emnlp-main.132 (cit. on p. 53). Seraphina Goldfarb-Tarrant, Rebecca Marchant, Ricardo Muñoz Sánchez, Mugdha Pandya, Adam Lopez. “Intrinsic Bias Metrics Do Not Correlate with Application Bias”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 1926–1940. DOI: 10.18653/v1/2021.acl-long.150. URL: https://aclanthology.org/2021.acl- long.150 (cit. on pp. 2, 31, 32). Hila Gonen, Yoav Goldberg. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them”. Proceedings of the 2019 Conference of the North 67 American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 609–614. DOI: 10.18653/v1/N19-1061. URL: https://aclanthology.org/N19-1061 (cit. on p. 29). Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz. “Numerical Composition of Differential Privacy”. Advances in Neural Information Processing Systems. Ed. by M. Ranzato, A. Beygelzimer, Y . Dauphin, P.S. Liang, J. Wortman Vaughan. V ol. 34. Curran Associates, Inc., 2021, pp. 11631–11642. URL: https://proceedings. neurips.cc/paper/2021/file/6097d8f3714205740f30debe1166744e-Paper.pdf (cit. on p. 57). Wei Guo, Aylin Caliskan. “Detecting Emergent Intersectional Biases: Contextualized Word Embeddings Contain a Distribution of Human-like Biases”. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. AIES ’21. Virtual Event, USA: Association for Computing Machinery, 2021, 122–133. DOI: 10.1145/3461702.3462536. URL: https://doi.org/10.1145/3461702.3462536 (cit. on pp. 25, 31). Umang Gupta, Tamoghna Chattopadhyay, Nikhil Dhinagar, Paul M Thompson, Greg Ver Steeg, The Alzheimer’s Disease Neuroimaging Initiative. “Transferring Models Trained on Natural Images to 3D MRI via Position Encoded Slice Models” (2023) (cit. on p. 5). Umang Gupta, Jwala Dhamala, Varun Kumar, Apurv Verma, Yada Pruksachatkun, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Greg Ver Steeg, Aram Galstyan. “Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal”. Findings of the Association for Computational Linguistics: ACL 2022. Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 658–678. DOI: 10.18653/v1/2022.findings-acl.55. URL: https://aclanthology.org/2022.findings-acl.55 (cit. on p. 4). Umang Gupta, Aaron M Ferber, Bistra Dilkina, Greg Ver Steeg. “Controllable guarantees for fair outcomes via contrastive information estimation”. Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 35. 2021, pp. 7610–7619 (cit. on pp. 3, 4). Umang Gupta, Aram Galstyan, Greg Ver Steeg. “Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning”. Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023, pp. 12612–12629. URL: https://aclanthology. org/2023.findings-acl.799 (cit. on pp. 4, 47). Umang Gupta, Pradeep Lam, Greg Ver Steeg, Paul Thompson. “Improved Brain Age Estimation with Slice- based Set Networks”. IEEE International Symposium on Biomedical Imaging (ISBI). 2021 (cit. on pp. 5, 35, 36, 107, 114). Umang Gupta, Dimitris Stripelis, Pradeep K Lam, Paul Thompson, Jose Luis Ambite, Greg Ver Steeg. “Membership inference attacks on deep regression models for neuroimaging”. Medical Imaging with Deep Learning. PMLR. 2021, pp. 228–251 (cit. on p. 4). Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer. Gradient Descent Happens in a Tiny Subspace. 2019. URL: https://openreview.net/forum?id=ByeTHsAqtX (cit. on pp. 5, 55, 56). Suchin Gururangan, Ana Marasovi´ c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks”. Proceedings of 68 the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 8342–8360. DOI: 10.18653/v1/2020.acl-main.740. URL: https: //aclanthology.org/2020.acl-main.740 (cit. on p. 29). Niv Haim, Gal Vardi, Gilad Yehudai, Ohad Shamir, Michal Irani. “Reconstructing training data from trained neural networks”. Advances in Neural Information Processing Systems 35 (2022), pp. 22911–22924 (cit. on p. 2). Karen Hambardzumyan, Hrant Khachatrian, Jonathan May. “WARP: Word-level Adversarial ReProgram- ming”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 4921–4933. DOI: 10.18653/v1/2021.acl-long.381. URL: https://aclanthology.org/2021.acl-long.381 (cit. on pp. 43, 47–49, 53, 120, 121). Moritz Hardt, Eric Price, Eric Price, Nati Srebro. “Equality of Opportunity in Supervised Learning”. Advances in Neural Information Processing Systems. V ol. 29. 2016, pp. 3315–3323 (cit. on pp. 2, 21). Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig. “Towards a Unified View of Parameter-Efficient Transfer Learning”. International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=0RDcd5Axok (cit. on pp. 53, 54). Yuzi He, Keith Burghardt, Kristina Lerman. “A Geometric Solution to Fair Representations”. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. AIES ’20. New York, NY , USA, 2020, 279–285 (cit. on p. 22). Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, Alexander Lerchner. “beta-vae: Learning Basic Visual Concepts with a Constrained Variational Framework”. International Conference on Learning Representations. 2017 (cit. on p. 12). Geoffrey Hinton, Oriol Vinyals, Jeff Dean. “Distilling the knowledge in a neural network”. ArXiv preprint abs/1503.02531 (2015). URL: https://arxiv.org/abs/1503.02531 (cit. on pp. 4, 24, 26). Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi. “The Curious Case of Neural Text Degeneration”. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL: https://openreview.net/forum?id=rygGQyrFvH (cit. on p. 105). Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V Pearson, Dietrich A Stephan, Stanley F Nelson, David W Craig. “Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays”. PLoS Genet 4.8 (2008), e1000167 (cit. on pp. 2, 34). Max Hort, Zhenpeng Chen, Jie M Zhang, Federica Sarro, Mark Harman. “Bias mitigation for machine learning classifiers: A comprehensive survey”. arXiv preprint arXiv:2207.07068 (2022) (cit. on p. 2). Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly. “Parameter-efficient transfer learning for NLP”. International Conference on Machine Learning. PMLR. 2019, pp. 2790–2799. URL: http://proceedings.mlr.press/ v97/houlsby19a.html (cit. on pp. 43, 47, 48, 53). 69 Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. “LoRA: Low-Rank Adaptation of Large Language Models”. International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=nZeVKeeFYf9 (cit. on pp. 43, 47, 49, 53). Ayush Jaiswal, Daniel Moyer, Greg Ver Steeg, Wael AbdAlmageed, Premkumar Natarajan. “Invariant Representations through Adversarial Forgetting”. AAAI. 2020, pp. 4272–4279 (cit. on pp. 3, 4, 8, 10, 17, 19, 90). Bargav Jayaraman, David Evans. “Evaluating Differentially Private Machine Learning in Practice”. 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 2019, pp. 1895–1912 (cit. on p. 35). Bargav Jayaraman, Lingxiao Wang, David Evans, Quanquan Gu. “Revisiting membership inference under realistic assumptions”. arXiv preprint arXiv:2005.10881 (2020) (cit. on p. 36). Sumit Kumar Jha, Susmit Jha, Rickard Ewetz, Sunny Raj, Alvaro Velasquez, Laura L Pullum, Ananthram Swami. “An Extension of Fano’s Inequality for Characterizing Model Susceptibility to Membership Inference Attacks”. arXiv preprint arXiv:2009.08097 (2020) (cit. on pp. 3, 39). Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. “TinyBERT: Distilling BERT for Natural Language Understanding”. Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, 2020, pp. 4163–4174. DOI: 10.18653/v1/2020.findings-emnlp.372. URL: https://aclanthology.org/2020.findings-emnlp.372 (cit. on p. 24). Peter Kairouz, H. Brendan McMahan. “Advances and Open Problems in Federated Learning”. Foundations and Trends® in Machine Learning 14.1 (2021), pp. –. DOI: 10.1561/2200000083 (cit. on p. 35). Rabeeh Karimi Mahabadi, James Henderson, Sebastian Ruder. “Compacter: Efficient low-rank hypercomplex adapter layers”. Advances in Neural Information Processing Systems 34 (2021), pp. 1022–1035. URL: https://openreview.net/forum?id=bqGK5PyI6-N (cit. on pp. 52, 53). Lauren Kirchner, Surya Mattu, Jeff Larson, Julia Angwin. “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ” Propublica (2016), pp. 1–26. URL: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (cit. on p. 1). Jon Kleinberg, Sendhil Mullainathan, Manish Raghavan. “Inherent Trade-Offs in the Fair Determination of Risk Scores”. Leibniz International Proceedings in Informatics, LIPIcs. V ol. 67. 2017 (cit. on pp. 2, 21). Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, Nazneen Fatema Rajani. “GeDi: Generative Discriminator Guided Sequence Generation”. Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 4929–4952. DOI: 10.18653/v1/2021.findings-emnlp.424. URL: https://aclanthology.org/2021.findings-emnlp.424 (cit. on p. 25). Matt J. Kusner, Joshua R. Loftus, Chris Russell, Ricardo Silva. “Counterfactual Fairness”. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 70 December 4-9, 2017, Long Beach, CA, USA. Ed. by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, Roman Garnett. 2017, pp. 4066–4076. URL: https://proceedings.neurips.cc/paper/2017/hash/a486cd07e4ac3d270571622f4f316ec5-Abstract.html (cit. on pp. 25, 26). Pradeep K Lam, Vigneshwaran Santhalingam, Parth Suresh, Rahul Baboota, Alyssa H Zhu, Sophia I Tho- mopoulos, Neda Jahanshad, Paul M Thompson. “Accurate brain age prediction using recurrent slice-based networks”. 16th International Symposium on Medical Information Processing and Analysis. International Society for Optics and Photonics. 2020 (cit. on pp. 35, 36, 107). Peeter Laud, Alisa Pankova. “Interpreting Epsilon of Differential Privacy in Terms of Advantage in Guessing or Approximating Sensitive Attributes”. arXiv preprint arXiv:1911.12777 (2019) (cit. on p. 114). Junghye Lee, Jimeng Sun, Fei Wang, Shuang Wang, Chi-Hyuck Jun, Xiaoqian Jiang. “Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis”. JMIR medical informatics 6.2 (2018), e20 (cit. on p. 35). Klas Leino, Matt Fredrikson. “Stolen Memories: Leveraging Model Memorization for Calibrated White-Box Membership Inference”. 29th{USENIX} Security Symposium ({USENIX} Security 20). 2020, pp. 1605– 1622 (cit. on p. 36). Brian Lester, Rami Al-Rfou, Noah Constant. “The Power of Scale for Parameter-Efficient Prompt Tuning”. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 3045–3059. DOI: 10.18653/v1/2021.emnlp-main.243. URL: https://aclanthology.org/2021.emnlp-main.243 (cit. on pp. 46, 53). Chunyuan Li, Heerad Farkhoor, Rosanne Liu, Jason Yosinski. “Measuring the Intrinsic Dimension of Objective Landscapes”. International Conference on Learning Representations. 2018. URL: https: //openreview.net/forum?id=ryup8-WCW (cit. on p. 53). Wenqi Li, Fausto Milletarì, Daguang Xu, Nicola Rieke, Jonny Hancox, Wentao Zhu, Maximilian Baust, Yan Cheng, Sébastien Ourselin, M. Jorge Cardoso, Andrew Feng. “Privacy-Preserving Federated Brain Tumour Segmentation”. Machine Learning in Medical Imaging. Ed. by Heung-Il Suk, Mingxia Liu, Pingkun Yan, Chunfeng Lian. Springer International Publishing, 2019, pp. 133–141 (cit. on p. 35). Xiang Lisa Li, Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 4582–4597. DOI: 10.18653/v1/2021.acl-long.353. URL: https: //aclanthology.org/2021.acl-long.353 (cit. on pp. 43, 47, 53). Xiaoxiao Li, Yufeng Gu, Nicha Dvornek, Lawrence H. Staib, Pamela Ventola, James S. Duncan. “Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results”. Medical Image Analysis 65 (2020), p. 101765 (cit. on p. 35). 71 Xuechen Li, Florian Tramer, Percy Liang, Tatsunori Hashimoto. “Large Language Models Can Be Strong Differentially Private Learners”. International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=bVuP3ltATMz (cit. on pp. 43, 52, 55, 61). Zhu Li, Zhi-Hua Zhou, Arthur Gretton. “Towards an understanding of benign overfitting in neural networks”. arXiv preprint arXiv:2106.03212 (2021) (cit. on p. 2). Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, Ruslan Salakhutdinov. “Towards Understanding and Mitigating Social Biases in Language Models”. Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Ed. by Marina Meila, Tong Zhang. V ol. 139. Proceedings of Machine Learning Research. PMLR, 2021, pp. 6565–6576. URL: http:// proceedings.mlr.press/v139/liang21a.html (cit. on p. 25). Jianhua Lin. “Divergence Measures Based on the Shannon Entropy”. IEEE Transactions on Information theory 37.1 (1991), pp. 145–151 (cit. on p. 86). Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, Yejin Choi. “DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 6691–6706. DOI: 10.18653/v1/2021.acl-long.522. URL: https: //aclanthology.org/2021.acl-long.522 (cit. on pp. 25, 28, 29). Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, Zihuai Lin. “When machine learning meets privacy: A survey and outlook”. ACM Computing Surveys (CSUR) 54.2 (2021), pp. 1–36 (cit. on p. 2). Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, Graham Neubig. “Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing”. ACM Comput. Surv. (2022). Just Accepted. DOI: 10.1145/3560815. URL: https://doi.org/10.1145/3560815 (cit. on p. 53). Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, Jie Tang. “P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks”. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 61–68. DOI: 10.18653/v1/2022.acl-short.8. URL: https://aclanthology.org/2022.acl-short.8 (cit. on pp. 43, 46, 47). Ximeng Liu, Lehui Xie, Yaopeng Wang, Jian Zou, Jinbo Xiong, Zuobin Ying, Athanasios V . Vasilakos. “Privacy and Security Issues in Deep Learning: A Survey”. IEEE Access 9 (2021), pp. 4566–4593. DOI: 10.1109/ACCESS.2020.3045078 (cit. on p. 2). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. “RoBERTa: A robustly optimized bert pretraining approach”. arXiv preprint arXiv:1907.11692 (2019). URL: https://arxiv.org/abs/1907.11692 (cit. on pp. 43, 48, 49). Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, Richard Zemel. “The Variational Fair Autoencoder”. International Conference on Learning Representations. 2016 (cit. on pp. 4, 8, 17). 72 Daming Lu. “Masked Reasoner at SemEval-2020 Task 4: Fine-Tuning RoBERTa for Commonsense Reason- ing”. Proceedings of the Fourteenth Workshop on Semantic Evaluation. Barcelona (online): International Committee for Computational Linguistics, 2020, pp. 411–414. DOI: 10.18653/v1/2020.semeval-1.49. URL: https://aclanthology.org/2020.semeval-1.49 (cit. on p. 33). Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, Anupam Datta. “Gender bias in neural natural language processing”. Logic, Language, and Security. Springer, 2020, pp. 189–202. URL: https://link. springer.com/chapter/10.1007/978-3-030-62077-6_14 (cit. on pp. 25, 26, 28, 29). David Madras, Elliot Creager, Toniann Pitassi, Richard Zemel. “Learning Adversarially Fair and Transfer- able Representations”. Proceedings of the 35th International Conference on Machine Learning. 2018, pp. 3384–3393 (cit. on pp. 4, 8, 10, 14, 15, 17, 22). Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, Madian Khabsa. “UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning”. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 6253–6264. DOI: 10.18653/v1/2022.acl- long.433. URL: https://aclanthology.org/2022.acl-long.433 (cit. on p. 53). Rowan Hall Maudslay, Hila Gonen, Ryan Cotterell, Simone Teufel. “It’s All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution”. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 5267–5275. DOI: 10.18653/v1/D19-1530. URL: https://aclanthology.org/D19-1530 (cit. on pp. 25, 26, 28). Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Aguera y Arcas. “Communication- Efficient Learning of Deep Networks from Decentralized Data”. Artificial Intelligence and Statistics . PMLR. 2017, pp. 1273–1282 (cit. on pp. 3, 35, 109). Daniel McNamara, Cheng Soon Ong, Robert C Williamson. “Provably Fair Representations”. arXiv preprint arXiv:1710.04394 (2017) (cit. on pp. 8, 21, 22). Ninareh Mehrabi, Thamme Gowda, Fred Morstatter, Nanyun Peng, Aram Galstyan. “Man is to person as woman is to location: Measuring gender bias in named entity recognition”. Proceedings of the 31st ACM conference on Hypertext and Social Media. 2020, pp. 231–232 (cit. on pp. 1, 4). Ninareh Mehrabi, Umang Gupta, Fred Morstatter, Greg Ver Steeg, Aram Galstyan. “Attributing Fair Decisions with Attention Interventions”. Proceedings of the 2nd Workshop on Trustworthy Natural Language Processing (TrustNLP 2022). Seattle, U.S.A.: Association for Computational Linguistics, 2022, pp. 12– 25. DOI: 10.18653/v1/2022.trustnlp-1.2. URL: https://aclanthology.org/2022.trustnlp-1.2 (cit. on p. 6). Ninareh Mehrabi, Yuzhong Huang, Fred Morstatter. “Statistical Equity: A Fairness Classification Objective”. arXiv preprint arXiv:2005.07293 (2020) (cit. on p. 2). Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, Aram Galstyan. “A Survey on Bias and Fairness in Machine Learning”. arXiv preprint arXiv:1908.09635 (2019) (cit. on pp. 1, 2, 21). 73 Aditya Krishna Menon, Robert C Williamson. “The cost of fairness in binary classification”. Conference on Fairness, Accountability and Transparency. 2018, pp. 107–118 (cit. on p. 8). Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. “Pointer Sentinel Mixture Models”. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL: https://openreview.net/forum?id= Byj72udxe (cit. on p. 29). Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, Jeff Dean. “Distributed Representations of Words and Phrases and their Compositionality”. Advances in Neural Information Processing Systems. V ol. 26. 2013, pp. 3111–3119 (cit. on pp. 9, 22). Mikhail Milchenko, Daniel Marcus. “Obscuring Surface Anatomy in V olumetric Imaging Data ”. Neuroinfor- matics 11.1 (2013), pp. 65–75 (cit. on p. 34). Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub, Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al. “Multimodal population brain imaging in the UK Biobank prospective epidemiological study”. Nature neuroscience 19.11 (2016), pp. 1523–1536 (cit. on pp. 36, 107). Ilya Mironov. “Rényi differential privacy”. 2017 IEEE 30th computer security foundations symposium (CSF). IEEE. 2017, pp. 263–275 (cit. on pp. 56, 57). Shubhanshu Mishra, Sijun He, Luca Belli. “Assessing Demographic Bias in Named Entity Recognition”. ArXiv preprint abs/2008.03415 (2020). URL: https://arxiv.org/abs/2008.03415 (cit. on p. 102). Aida Mostafazadeh Davani, Ali Omrani, Brendan Kennedy, Mohammad Atari, Xiang Ren, Morteza Dehghani. “Improving Counterfactual Generation for Fair Hate Speech Detection”. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Online: Association for Computational Linguistics, 2021, pp. 92–101. DOI: 10.18653/v1/2021.woah-1.10. URL: https://aclanthology.org/2021.woah-1.10 (cit. on p. 25). Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, Greg Ver Steeg. “Invariant Representations without Adversarial Training”. Advances in Neural Information Processing Systems. V ol. 31. 2018, pp. 9084–9093 (cit. on pp. 3, 4, 8, 10, 12, 13, 15, 17–19, 22). Nikita Nangia, Clara Vania, Rasika Bhalerao, Samuel R. Bowman. “CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 1953–1967. DOI: 10.18653/v1/2020.emnlp-main.154. URL: https://aclanthology. org/2020.emnlp-main.154 (cit. on p. 25). Shashi Narayan, Shay B. Cohen, Mirella Lapata. “Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization”. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 1797–1807. DOI: 10.18653/v1/D18-1206. URL: https://aclanthology.org/D18-1206 (cit. on p. 54). 74 Arvind Narayanan, Vitaly Shmatikov. “Robust de-anonymization of large sparse datasets”. 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE. 2008, pp. 111–125 (cit. on p. 2). M. Nasr, R. Shokri, A. Houmansadr. “Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning”. 2019 IEEE Symposium on Security and Privacy (SP). 2019, pp. 739–753 (cit. on pp. 2, 34, 36). Milad Nasr, Reza Shokri, Amir Houmansadr. “Machine Learning with Membership Privacy using Adversarial Regularization”. Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2018, pp. 634–646 (cit. on pp. 39, 114). XuanLong Nguyen, Martin J Wainwright, Michael I Jordan. “Estimating divergence functionals and the likelihood ratio by convex risk minimization”. IEEE Transactions on Information Theory 56.11 (2010), pp. 5847–5861 (cit. on p. 22). Aaron van den Oord, Yazhe Li, Oriol Vinyals. “Representation Learning with Contrastive Predictive Coding”. arXiv preprint arXiv:1807.03748 (2018) (cit. on pp. 9, 13, 14, 22). Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. “Training language models to follow instructions with human feedback”. Advances in Neural Information Processing Systems 35 (2022), pp. 27730–27744 (cit. on p. 2). Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, Ulfar Erlingsson. “Scalable Private Learning with PATE”. International Conference on Learning Representations. 2018. URL: https://openreview.net/forum?id=rkZB1XbRZ (cit. on p. 55). Nicolas Papernot, Abhradeep Thakurta, Shuang Song, Steve Chien, Úlfar Erlingsson. “Tempered sigmoid activations for deep learning with differential privacy”. Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 35. 10. 2021, pp. 9312–9321 (cit. on p. 60). Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. Advances in Neural Information Processing Systems. V ol. 32. 2019, pp. 8026–8037 (cit. on p. 104). F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay. “Scikit-learn: Machine Learning in Python”. Journal of Machine Learning Research 12 (2011), pp. 2825– 2830 (cit. on p. 16). Han Peng, Weikang Gong, Christian F Beckmann, Andrea Vedaldi, Stephen M Smith. “Accurate brain age prediction with lightweight deep neural networks”. Medical Image Analysis 68 (2021), p. 101871 (cit. on pp. 35, 36, 107). Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, Iryna Gurevych. “AdapterFusion: Non-Destructive Task Composition for Transfer Learning”. Proceedings of the 16th Conference of the 75 European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, 2021, pp. 487–503. DOI: 10.18653/v1/2021.eacl-main.39. URL: https: //aclanthology.org/2021.eacl-main.39 (cit. on pp. 43, 53). Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, George Tucker. “On Variational Bounds of Mutual Information”. Proceedings of the 36th International Conference on Machine Learning. 2019 (cit. on pp. 9, 12, 13, 22, 88). Apostolos Pyrgelis, Carmela Troncoso, Emiliano De Cristofaro. “Knock Knock, Who’s There? Membership Inference on Aggregate Location Data”. CoRR abs/1708.06145 (2017) (cit. on pp. 34, 36). Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. “Language Models are Unsupervised Multitask Learners”. OpenAI Blog 1.8 (2019), p. 9. URL: https://cdn.openai.com/better- language-models/language_models_are_unsupervised_multitask_learners.pdf (cit. on pp. 25, 28, 43). Manish Raghavan, Solon Barocas. Challenges for mitigating bias in algorithmic hiring. 2019. URL: https: //www.brookings.edu/research/challenges-for-mitigating-bias-in-algorithmic-hiring/ (visited on 08/24/2020) (cit. on p. 1). Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, Yuxiong He. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters”. KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020. Ed. by Rajesh Gupta, Yan Liu, Jiliang Tang, B. Aditya Prakash. ACM, 2020, pp. 3505–3506. URL: https://dl.acm.org/doi/10.1145/3394486.3406703 (cit. on p. 104). Danilo Rezende, Shakir Mohamed. “Variational Inference with Normalizing Flows”. Proceedings of the 32nd International Conference on Machine Learning. 2015, pp. 1530–1538 (cit. on p. 12). Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger Roth, Shadi Albarqouni, Spyridon Bakas, Mathieu N Galtier, Bennett Landman, Klaus Maier-Hein, et al. “The future of digital health with federated learning”. npj Digital Medicine 3.119 (2020) (cit. on p. 35). Jae Hun Ro, Theresa Breiner, Lara McConnaughey, Mingqing Chen, Ananda Theertha Suresh, Shankar Kumar, Rajiv Mathews. “Scaling Language Model Size in Cross-Device Federated Learning”. ACL 2022 Workshop on Federated Learning for Natural Language Processing. 2022. URL: https://openreview.net/ forum?id=ShNG29KGF-c (cit. on p. 43). Abhijit Guha Roy, Shayan Siddiqui, Sebastian Pölsterl, Nassir Navab, Christian Wachinger. “BrainTorrent: A Peer-to-Peer Environment for Decentralized Federated”. arXiv preprint arXiv:1905.06731 (2019) (cit. on p. 35). Proteek Chandan Roy, Vishnu Naresh Boddeti. “Mitigating Information Leakage in Image Representations: A Maximum Entropy Approach”. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 2586–2594 (cit. on pp. 8, 17, 19, 90). Andreas Rücklé, Gregor Geigle, Max Glockner, Tilman Beck, Jonas Pfeiffer, Nils Reimers, Iryna Gurevych. “AdapterDrop: On the Efficiency of Adapters in Transformers”. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: 76 Association for Computational Linguistics, 2021, pp. 7930–7946. DOI: 10.18653/v1/2021.emnlp-main.626. URL: https://aclanthology.org/2021.emnlp-main.626 (cit. on p. 53). Ahmed Salem, Yang Zhang, Mathias Humbert, Mario Fritz, Michael Backes. “ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models”. Network and Distributed Systems Security Symposium 2019. Internet Society. 2019 (cit. on pp. 34, 36, 39). Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. ArXiv preprint abs/1910.01108 (2019). URL: https://arxiv.org/ abs/1910.01108 (cit. on pp. 24, 28, 104). Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, Yejin Choi. “Social Bias Frames: Reasoning about Social and Power Implications of Language”. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 2020, pp. 5477–5490. DOI: 10.18653/v1/2020.acl-main.486. URL: https://aclanthology.org/2020.acl- main.486 (cit. on p. 25). Timo Schick, Hinrich Schütze. “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners”. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 2339–2352. DOI: 10.18653/v1/2021.naacl-main.185. URL: https://aclanthology. org/2021.naacl-main.185 (cit. on p. 45). Nakeisha Schimke, John Hale. “Quickshear Defacing for Neuroimages”. Proceedings of the 2nd USENIX conference on Health security and privacy. USENIX Association. 2011, pp. 11–11 (cit. on p. 34). Andrey A Shabalin, Andrew B Nobel. “Reconstruction of a low-rank matrix in the presence of Gaussian noise”. Journal of Multivariate Analysis 118 (2013), pp. 67–76 (cit. on pp. 55, 60). Saeed Sharifi-Malvajerdi, Michael Kearns, Aaron Roth. “Average Individual Fairness: Algorithms, General- ization and Experiments”. Advances in Neural Information Processing Systems. V ol. 32. 2019, pp. 8242– 8251 (cit. on p. 21). Micah J Sheller, G Anthony Reina, Brandon Edwards, Jason Martin, Spyridon Bakas. “Multi-institutional Deep Learning Modeling Without Sharing Patient Data: A Feasibility Study on Brain Tumor Segmenta- tion”. International MICCAI Brainlesion Workshop. Springer. 2018, pp. 92–104 (cit. on p. 35). Emily Sheng, Kai-Wei Chang, Prem Natarajan, Nanyun Peng. “Towards Controllable Biases in Language Generation”. Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Associa- tion for Computational Linguistics, 2020, pp. 3239–3254. DOI: 10.18653/v1/2020.findings-emnlp.291. URL: https://aclanthology.org/2020.findings-emnlp.291 (cit. on pp. 25, 29–31, 97, 99, 105). Emily Sheng, Kai-Wei Chang, Prem Natarajan, Nanyun Peng. “Societal Biases in Language Generation: Progress and Challenges”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, 2021, pp. 4275–4293. DOI: 10.18653/ v1/2021.acl-long.330. URL: https://aclanthology.org/2021.acl-long.330 (cit. on pp. 24–26). 77 Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, Nanyun Peng. “The Woman Worked as a Babysitter: On Biases in Language Generation”. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 3407–3412. DOI: 10.18653/v1/D19-1339. URL: https://aclanthology.org/D19-1339 (cit. on pp. 4, 24, 25, 33, 103). Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, Sameer Singh. “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020, pp. 4222–4235. DOI: 10.18653/v1/2020.emnlp-main.346. URL: https://aclanthology.org/2020.emnlp-main.346 (cit. on p. 53). Reza Shokri, Marco Stronati, Congzheng Song, Vitaly Shmatikov. “Membership Inference Attacks Against Machine Learning Models”. 2017 IEEE Symposium on Security and Privacy (SP). 2017, pp. 3–18 (cit. on pp. 1, 2, 34, 36). Santiago Silva, Andre Altmann, Boris Gutman, Marco Lorenzi. “Fed-BioMed: A General Open-Source Frontend Framework for Federated Learning in Healthcare”. Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning. Springer, 2020, pp. 201–210 (cit. on p. 35). Santiago Silva, Boris A Gutman, Eduardo Romero, Paul M Thompson, Andre Altmann, Marco Lorenzi. “Federated Learning in Distributed Medical Databases: Meta-Analysis of Large-Scale Subcortical Brain Data”. 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). IEEE. 2019, pp. 270–274 (cit. on p. 35). Irene Solaiman, Christy Dennison. “Process for Adapting Language Models to Society (PALMS) with Values- Targeted Datasets”. ArXiv preprint abs/2106.10328 (2021). URL: https://arxiv.org/abs/2106.10328 (cit. on p. 25). Jiaming Song, Pratyusha Kalluri, Aditya Grover, Shengjia Zhao, Stefano Ermon. “Learning Controllable Fair Representations”. The 22nd International Conference on Artificial Intelligence and Statistics . 2019, pp. 2164–2173 (cit. on pp. 3, 8, 10, 13–16, 18, 22). Liwei Song, Prateek Mittal. “Systematic Evaluation of Privacy Risks of Machine Learning Models”. arXiv preprint arXiv:2003.10595 (2020) (cit. on p. 36). Megha Srivastava, Hoda Heidari, Andreas Krause. “Mathematical Notions vs. Human Perception of Fairness: A Descriptive Approach to Fairness for Machine Learning”. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019, 2459–2468 (cit. on pp. 2, 9, 21). Thomas Steinke, Lydia Zakynthinou. “Reasoning About Generalization via Conditional Mutual Information”. Proceedings of Thirty Third Conference on Learning Theory. Ed. by Jacob Abernethy, Shivani Agarwal. V ol. 125. Proceedings of Machine Learning Research. PMLR, 2020, pp. 3437–3452. URL: https:// proceedings.mlr.press/v125/steinke20a.html (cit. on pp. 1, 3). 78 Dimitris Stripelis, José Luis Ambite, Pradeep Lam, Paul Thompson. “Scaling Neuroscience Research using Federated Learning”. IEEE International Symposium on Biomedical Imaging (ISBI). 2021 (cit. on pp. 35, 36, 107, 108). Dimitris Stripelis, Umang Gupta, Nikhil Dhinagar, Greg Ver Steeg, Paul M Thompson, José Luis Ambite. “Towards Sparsified Federated Neuroimaging Models via Weight Pruning”. International Workshop on Distributed, Collaborative, and Federated Learning. 2022, pp. 141–151 (cit. on p. 6). Dimitris Stripelis, Umang Gupta, Hamza Saleem, Nikhil Dhinagar, Tanmay Ghai, Rafael Sanchez, Chryso- valantis Anastasiou, Armaghan Asghar, Greg Ver Steeg, Srivatsan Ravi, et al. “Secure Federated Learning for Neuroimaging”. arXiv preprint arXiv:2205.05249 (2022) (cit. on pp. 6, 41). Dimitris Stripelis, Umang Gupta, Greg Ver Steeg, Jose Luis Ambite. “Federated Progressive Sparsification (Purge-Merge-Tune)+”. Workshop on Federated Learning: Recent Advances and New Challenges (in Conjunction with NeurIPS 2022). 2022. URL: https://openreview.net/forum?id=GLQqPTRrQMx (cit. on p. 6). Dimitris Stripelis, Hamza Saleem, Tanmay Ghai, Nikhil Dhinagar, Umang Gupta, Chrysovalantis Anastasiou, Greg Ver Steeg, Srivatsan Ravi, Muhammad Naveed, Paul M Thompson, et al. “Secure neuroimaging analysis using federated learning with homomorphic encryption”. 17th International Symposium on Medical Information Processing and Analysis. V ol. 12088. 2021, pp. 351–359 (cit. on p. 6). Nishant Subramani, Nivedita Suresh. “Discovering useful sentence representations from large pretrained language models”. arXiv preprint arXiv:2008.09049 (2020). URL: https://arxiv.org/abs/2008.09049 (cit. on p. 45). Pranav Subramani, Nicholas Vadivelu, Gautam Kamath. “Enabling fast differentially private sgd via just- in-time compilation and vectorization”. Advances in Neural Information Processing Systems 34 (2021), pp. 26409–26421 (cit. on p. 5). Jared Sylvester, Edward Raff. “What About Applied Fairness?” Machine Learning: The Debates (ML-D) organized as part of the Federated AI Meeting (FAIM 2018). 2018 (cit. on p. 18). Xinyu Tang, Saeed Mahloujifar, Liwei Song, Virat Shejwalkar, Milad Nasr, Amir Houmansadr, Prateek Mittal. “Mitigating Membership Inference Attacks by Self-Distillation Through a Novel Ensemble Architecture”. 31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, 2022, pp. 1433–1450. URL: https://www.usenix.org/conference/usenixsecurity22/presentation/tang (cit. on p. 3). O. Tange. “GNU Parallel - The Command-Line Power Tool”. ;login: The USENIX Magazine 36.1 (2011), pp. 42–47. URL: http://www.gnu.org/s/parallel. Erik F. Tjong Kim Sang, Fien De Meulder. “Introduction to the CoNLL-2003 Shared Task: Language- Independent Named Entity Recognition”. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 2003, pp. 142–147. URL: https://aclanthology.org/W03-0419 (cit. on p. 47). 79 Godfried Toussaint. “Sharper lower bounds for discrimination information in terms of variation (Corresp.)” IEEE Transactions on Information Theory 21.1 (1975), pp. 99–100 (cit. on p. 86). Florian Tramer, Dan Boneh. “Differentially Private Learning Needs Better Features (or Much More Data)”. International Conference on Learning Representations. 2021. URL: https://openreview.net/forum?id= YTWGvpFOQD- (cit. on pp. 3, 43, 52, 60). Marcos Treviso, Tianchu Ji, Ji-Ung Lee, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Pedro H. Martins, André F. T. Martins, Peter Milder, Colin Raffel, Edwin Simpson, Noam Slonim, Niranjan Balasubramanian, Leon Derczynski, Roy Schwartz. “Efficient methods for natural language processing: A survey”. arXiv preprint arXiv:2209.00099 (2022). URL: https://arxiv.org/abs/2209.00099 (cit. on pp. 43, 53). Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, Wenqi Wei. “Towards Demystifying Membership Inference Attacks”. arXiv preprint arXiv:1807.09173 (2018) (cit. on pp. 36, 39). Konstantinos Tzioumis. Data for: Demographic aspects of first names . Version V1. 2018. DOI: 10.7910/DVN/ TYJKEZ. URL: https://doi.org/10.7910/DVN/TYJKEZ (cit. on p. 102). United States Equal Opportunity Employment Commission. Uniform Guidelines on Employee Selection Procedures. 1978 (cit. on p. 21). Petar Veliˇ ckovi´ c, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, R Devon Hjelm. “Deep Graph Infomax”. International Conference on Learning Representations. 2019 (cit. on pp. 9, 22). Didac Vidal-Pineiro, Yunpeng Wang, Stine K Krogsrud, Inge K Amlien, William FC Baare, David Bartres- Faz, Lars Bertram, Andreas M Brandmaier, Christian A Drevon, Sandra Duzel, et al. ““Brain age” relates to early life factors but not to accelerated brain aging”. bioRxiv (2021) (cit. on p. 35). Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”. International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=rJ4km2R5t7 (cit. on p. 47). Hao Wang, Berk Ustun, Flavio P. Calmon. “On the Direction of Discrimination: An Information-Theoretic Analysis of Disparate Impact in Machine Learning”. 2018 IEEE International Symposium on Information Theory (ISIT). 2018, pp. 126–130. DOI: 10.1109/ISIT.2018.8437661 (cit. on p. 2). Tianhao Wang, Yuheng Zhang, Ruoxi Jia. “Improving robustness to model inversion attacks via mutual information regularization”. Proceedings of the AAAI Conference on Artificial Intelligence . V ol. 35. 13. 2021, pp. 11666–11673 (cit. on p. 3). Yu-Xiang Wang, Jing Lei, Stephen E Fienberg. “Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle”. The Journal of Machine Learning Research 17.1 (2016), pp. 6353–6392 (cit. on p. 3). Paul L Williams, Randall D Beer. “Nonnegative Decomposition of Multivariate Information”. arXiv preprint arXiv:1004.2515 (2010) (cit. on p. 11). 80 Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, Alexander Rush. “Transformers: State-of-the-Art Natural Language Processing”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics, 2020, pp. 38–45. DOI: 10.18653/v1/2020.emnlp- demos.6. URL: https://aclanthology.org/2020.emnlp-demos.6 (cit. on p. 104). Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig. “Controllable Invariance through Adversar- ial Feature Learning”. Advances in Neural Information Processing Systems. V ol. 30. 2017, pp. 585–596 (cit. on p. 19). Albert Xu, Eshaan Pathak, Eric Wallace, Suchin Gururangan, Maarten Sap, Dan Klein. “Detoxifying Language Models Risks Marginalizing Minority V oices”. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, 2021, pp. 2390–2397. DOI: 10.18653/v1/2021.naacl-main.190. URL: https://aclanthology.org/2021.naacl-main.190 (cit. on p. 25). Mingbin Xu, Congzheng Song, Ye Tian, Neha Agrawal, Filip Granqvist, Rogier van Dalen, Xiao Zhang, Arturo Argueta, Shiyi Han, Yaqiao Deng, et al. “Training Large-V ocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices”. arXiv preprint arXiv:2207.08988 (2022). URL: https://arxiv.org/abs/2207.08988 (cit. on p. 43). Xingyu Xu, Yuantao Gu. “Benign overfitting of non-smooth neural networks beyond lazy training”. Proceed- ings of The 26th International Conference on Artificial Intelligence and Statistics . Ed. by Francisco Ruiz, Jennifer Dy, Jan-Willem van de Meent. V ol. 206. Proceedings of Machine Learning Research. PMLR, 2023, pp. 11094–11117. URL: https://proceedings.mlr.press/v206/xu23k.html (cit. on p. 2). Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, Stefano Ermon. “A Theory of Usable Information under Computational Constraints”. International Conference on Learning Representations. 2020 (cit. on pp. 8, 90). Hao Yang, Junyang Lin, An Yang, Peng Wang, Chang Zhou, Hongxia Yang. “Prompt Tuning for Generative Multimodal Pretrained Models”. arXiv preprint arXiv:2208.02532 (2022). URL: https://arxiv.org/abs/ 2208.02532 (cit. on p. 53). Qiang Yang, Yang Liu, Tianjian Chen, Yongxin Tong. “Federated Machine Learning: Concept and Applica- tions”. ACM Transactions on Intelligent Systems and Technology (TIST) 10.2 (2019), pp. 1–19 (cit. on p. 35). Samuel Yeom, Irene Giacomelli, Matt Fredrikson, Somesh Jha. “Privacy risk in machine learning: Analyzing the connection to overfitting”. 2018 IEEE 31st Computer Security Foundations Symposium (CSF). IEEE. 2018, pp. 268–282 (cit. on p. 1). Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang. “Differentially Private Fine-tuning of Language Models”. International Conference on Learning Representations. 2022. URL: https://openreview.net/forum?id=Q42f0dfjECO (cit. on pp. 3, 5, 43, 44, 52, 55, 60). 81 Da Yu, Huishuai Zhang, Wei Chen, Tie-Yan Liu. “Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning”. International Conference on Learning Representations. 2021. URL: https://openreview.net/forum?id=7aogOj_VYO0 (cit. on pp. 55, 56, 61). Da Yu, Huishuai Zhang, Wei Chen, Jian Yin, Tie-Yan Liu. “Large scale private learning via low-rank reparametrization”. International Conference on Machine Learning. PMLR. 2021, pp. 12208–12218. URL: https://proceedings.mlr.press/v139/yu21f.html (cit. on p. 55). Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, Krishna P. Gummadi. “Fairness Constraints: Mechanisms for Fair Classification”. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, pp. 962–970 (cit. on p. 21). Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, Cynthia Dwork. “Learning Fair Representations”. Pro- ceedings of the 30th International Conference on Machine Learning. 2013, pp. 325–333 (cit. on pp. 4, 17). Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. “Understanding deep learning requires rethinking generalization”. International Conference on Learning Representations. 2017 (cit. on pp. 2, 37). Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, Kai-Wei Chang. “Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods”. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 15–20. DOI: 10.18653/v1/N18-2003. URL: https://aclanthology.org/N18-2003 (cit. on pp. 25, 33, 98). Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, Kai-Wei Chang. “Learning Gender-Neutral Word Embeddings”. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 4847–4853. DOI: 10.18653/v1/D18-1521. URL: https://aclanthology.org/D18-1521 (cit. on p. 25). Kankan Zhou, Eason Lai, Jing Jiang. “VLStereoSet: A Study of Stereotypical Bias in Pre-trained Vision- Language Models”. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online only: Association for Computational Linguistics, 2022, pp. 527–538. URL: https://aclanthology.org/2022.aacl-main.40 (cit. on p. 2). Wangchunshu Zhou, Canwen Xu, Julian McAuley. “Efficiently Tuned Parameters are Task Embeddings”. arXiv preprint arXiv:2210.11705 (2022). URL: https://arxiv.org/abs/2210.11705 (cit. on p. 53). Xin Zhou, Ruotian Ma, Yicheng Zou, Xuanting Chen, Tao Gui, Qi Zhang, Xuanjing Huang, Rui Xie, Wei Wu. “Making Parameter-efficient Tuning More Efficient: A Unified Framework for Classification Tasks”. Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics, 2022, pp. 7053–7064. URL: https://aclanthology.org/2022.coling-1.615 (cit. on p. 53). Xuhui Zhou, Maarten Sap, Swabha Swayamdipta, Yejin Choi, Noah Smith. “Challenges in Automated Debiasing for Toxic Language Detection”. Proceedings of the 16th Conference of the European Chapter 82 of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, 2021, pp. 3143–3155. DOI: 10.18653/v1/2021.eacl-main.274. URL: https://aclanthology. org/2021.eacl-main.274 (cit. on p. 25). Yingxue Zhou, Steven Wu, Arindam Banerjee. “Bypassing the Ambient Dimension: Private {SGD} with Gradient Subspace Identification”. International Conference on Learning Representations. 2021. URL: https://openreview.net/forum?id=7dpmlkBuJFC (cit. on pp. 55, 61). 83 Appendices 84 Appendix A Proofs A.1 Proof of Theorem 2.2 Theorem A.1. (Theorem 2.2 restated) For some z,c∼ p(z,c), z∈R d , c∈{0,1}, and any decision algorithm A that acts on z, we have I(z : c)≥ g(π,∆ DP (A,c)) whereπ = P(c= 1) and g is a strictly increasing non-negative convex function in∆ DP (A,c). Proof. First, we will show that parity of any algorithmA that acts on z is upper bounded by the variational distance between conditional distributions, p(z| c= 1) and p(z| c= 0). We know for someA acting on z, ∆ DP (A,c)=|P(ˆ y= 1| c= 1)− P(ˆ y= 1| c= 0)| = Z z dzP(ˆ y= 1| z)p(z|c= 1)− Z z dzP(ˆ y= 1| z)p(z| c= 0) = Z z dzP(ˆ y= 1| z){p(z| c= 1)− p(z| c= 0)} = Z z dzP(ˆ y= 1| z){p(z| c= 1)− p(z| c= 0)} ≤ Z z dzP(ˆ y= 1| z)|p(z| c= 1)− p(z| c= 0)| ≤ Z z dz|p(z| c= 1)− p(z| c= 0)| Since, P(ˆ y= 1| z)≤ 1 = V(p(z| c= 0), p(z| c= 1)) (A.1) where, V(p(z| c= 0), p(z| c= 1))= Z z dz|p(z| c= 1)− p(z| c= 0)|=∥p(z| c= 1)− p(z| c= 0)∥ is the variational distance between p(z| c= 1) and p(z| c= 0). Next, we will show that mutual information, I(z : c) is lower bounded by a strictly increasing non-negative convex function of variational distance between p(z| c= 1) and p(z| c= 0), and therefore, by transitivity, also lower bounded by the function of parity of anyA . 85 Lettingπ = P(c= 1), we can write: I(z : c)=E z,c log p(z,c) p(z)p(c) =E z,c log p(z| c) p(z) =(1− π)E z|c=0 log p(z| c= 0) p(z) +πE z|c=1 log p(z| c= 1) p(z) =(1− π)KL(p(z| c= 0)∥ p(z))+πKL(p(z| c= 1)∥ p(z)) (A.2) = JSD (1− π,π) (p(z| c= 0), p(z| c= 1)) Last step is due to p(z)= ∑ c p(z,c)=(1− π)p(z|c= 0)+π p(z|c= 1) (A.3) and here JSD (1− π,π) (p 1 , p 2 ) denotes generalized Jensen-Shannon divergence with mixture weights (1− π,π) (Lin, 1991). We know from Toussaint (1975) that, KL(p 1 ∥ p 2 )≥ max log 2+V 2− V − 2V 2+V , V 2 2 + V 4 36 + V 6 288 = f(V) (A.4) For simplicity, we have used V to denote variational distance V(p 1 , p 2 ). f is defined in range [0,2). We note two important properties of function f that are useful for our proof. • f is a strictly increasing function. • f is maximum of two convex functions and therefore also convex. Combining Eqs. (A.4) and (A.2), and noting that, V(p(z| c= 0), p(z))=∥p(z| c= 0)− p(z)∥=π∥p(z| c= 0)− p(z| c= 1)∥ V(p(z| c= 1), p(z))=∥p(z| c= 1)− p(z)∥=(1− π)∥p(z| c= 0)− p(z| c= 1)∥ we get the required result, I(z : c)≥ (1− π) f(V(p(z| c= 0), p(z)))+π f(V(p(z| c= 1), p(z))) I(z : c)≥ (1− π) f(πV(p(z| c= 0), p(z| c= 1)))+π f((1− π)V(p(z| c= 0), p(z| c= 1))) ≥ (1− π) f(π∆ DP (A,c))+π f((1− π)∆ DP (A,c)) (by Eq. (A.1) and strictly increasing nature of f ) = g(π,∆ DP (A,c)) g is positive weighted combination of strictly increasing convex functions and therefore also strictly increasing and convex. This completes the proof. When c is multinomial: Lettingπ i = P(c= i), we can write I(z : c)= ∑ i π i KL(p(z| c= i)∥ p(z)) 86 ≥ ∑ i π i f(V(p(z| c= i), p(z))) ≥ f ∑ i π i V(p(z| c= i), p(z)) ! (Due to convexity of f ) ≥ f max i, j π i ∥p(z| c= i)− p(z)∥+π j ∥p(z| c= j)− p(z)∥ ( f is strictly increasing) ≥ f max i, j α∥p(z| c= i)− p(z| c= j)∥ (By triangle inequality and increasing nature of f ) = f α max i, j ∥p(z| c= i)− p(z| c= j)∥ (A.5) whereα = min k π k . We can get a slightly better α by seeing that α can be min{π i ′,π j ′}, where i ′ , j ′ = argmax i, j ∥p(z| c= i)− p(z| c= j)∥. Similar to Eq. (A.1), we can show that∥p(z| c= i)− p(z| c= j)∥≥| P(ˆ y| c= i)− P(ˆ y| c= j)| and so plugging this in Eq. (A.5), we get the required result. I(z : c)≥ f(α∆ DP (A,c)) A.2 Proof of Proposition 2.5 Proposition A.2. (Proposition 2.5 restated) For any u,v,w∼ p(u,v,w), ˜ u∼ p(u| w), and function f , we have I(u : v| w)≥ E u,v,w log e f(u,v,w) 1 M ∑ M j=1 e f( ˜ u j ,v,w) (A.6) where, u, ˜ u∈U , v∈V , w∈W , f :U × V × W →R and M is the number of samples from p(u| w). Proof. To prove this, we approximate p(u| v,w) with a variational distribution q q(u| v,w)= p(u| w)e g(u,v,w) Z(v,w) Z(v,w)= Z u du p(u| w)e f(u,v,w) = E u|w e g(u,v,w) then we have I(u : v| w)=E u,v,w log p(u| v,w) p(u| w) =E u,v,w log q(u| v,w) p(u| w) + KL(p(u| v,w)∥ q(u| v,w)) ≥ E u,v,w g(u,v,w)− E v,w logZ(v,w) 87 ≥ E u,v,w g(u,v,w)− E v,w logE u|w e g(u,v,w) ≥ E u,v,w g(u,v,w)− E v,w E u|w e g(u,v,w) + 1 Last inequality is due to logx≤ x− 1. Similar to Poole et al. (2019), we assume that we have M− 1 extra samples from p(u|w), and let g(u,v,w)= g(u 1:M ,v,w)= f(u 1 ,v,w)− log 1 M ∑ M i=1 e f(u i ,v,w) , where for notational ease u 1 = u and u 2:M are the additional samples. With these choices, we have: E v,w,u|w e g(u,v,w) = E v,w,u|w e f(u,v,w) 1 M ∑ M i=1 e f(u i ,v,w) = E v,w,u|w 1 M ∑ M j=1 e f(u i ,v,w) 1 M ∑ M i=1 e f(u i ,v,w) = 1 (replacing with M sample mean estimate) and the result follows. 88 Appendix B Further Results and Discussion of FCRL B.1 On maximizing I(y : z) vs. I(y : z|c) As discussed in Section 2.2.1 and demonstrated experimentally in Figure 2.6b, we find that maximizing I(y : z|c) is a better objective than maximizing I(y : z). However, it may perform poorly in specific scenarios. One such extreme case is when x and c together are sufficient to predict y, but neither can predict y by itself. Mathematically, that means x⊥ y, c⊥ y, and H(y|x,c)= 0. When x,c,y are 1-dimensional, binary random variables, these conditions are met by the XOR function, i.e., y= x⊕ c. In this case, if z is a function of just x, the optimal solution to both objectives (2.1) & (2.2) will be z= x. As a result, the representations z are not useful to predict y. However, suppose z is a function of both x and c. In that case, objective (2.1) may produce representations predictive of c, depending on multiplierβ. However, for objective (2.2), the optimal representation will still be z= x, regardless of the coefficient β, and hence it may fail to learn any useful representations at all. However, in this scenario, one may also question the need to use a fair representation learning framework as x is already independent of c. Hence, a perfect classifier without fairness constraint will already produce outcomes with 0 statistical parity. B.2 Parity vs. Accuracy using different Classification Algorithms The area over curve for all methods is shown in Table B.1. UCI Adult: Figure B.1 shows the parity-accuracy curve with different classification algorithms. As we can see, adversarial forgetting and MaxEnt-ARL do not achieve lower∆ DP with any classification algorithm. A couple of models trained with LAFTR achieved good results (lower∆ DP ) when logistic-regression or SVM is used. However, more complex non-linear models like MLP or random forests can make unfair predictions with these representations. MIFR is a method that makes use of adversarial learning along with a loose upper bound on I(z : c). The loose upper bound may not be very helpful in enforcing fairness constraints. We can see from the results of 2 layer MLP that some of the representations that had lower∆ DP with other classification methods may get higher ∆ DP . Our method works by minimizing an upper bound of I(z : c) and not very sensitive to the downstream classifier. Heritage Health: Figure B.2 shows the parity-accuracy curve with different classification algorithms. MaxEnt-ARL could not achieve lower∆ DP , whereas we see that adversarial forgetting could achieve lower ∆ DP with simple classifiers like logistic regression or SVM. With MLP too, it is able to achieve lower ∆ DP albeit with not very good accuracy. However, results with random forest demonstrate that the representations are not fair. Other methods are consistent and do not depend on the complexity of downstream classifiers. 89 Method UCI Adult Heritage Health LR SVM MLP (2) MLP (1) RF LR SVM MLP (2) MLP (1) RF FCRL (Ours) 0.309 0.327 0.246 0.307 0.303 0.294 0.322 0.287 0.319 0.310 CVIB 0.192 0.218 0.105 0.182 0.166 0.202 0.234 0.179 0.191 0.193 MIFR 0.221 0.249 0.189 0.251 0.231 0.193 0.222 0.167 0.202 0.196 MaxEnt-ARL 0.180 0.205 0.122 0.144 0.110 0.0 0 0 0.0 0.015 LAFTR 0.235 0.264 0.105 0.155 0.089 N/A N/A N/A N/A N/A Adversarial Forgetting 0.092 0.114 0.070 0.087 0.025 0.173 0.185 0.168 0.156 0.067 Table B.1: Area Over Parity Accuracy Curve with different downstream classification algorithms (Higher is better). The normalization area is computed using LP (Eq. (B.1)). LR stands for Logistic Regression, RF stands for random forest, and MLP(x) is an x hidden layer MLP with 50 hidden units in each layer. B.3 Hidden Information in Adversarial Representation Learning Adversarial representation learning is shown to be useful for learning fair representations by enforcing invariance. Invariance is the strongest requirement for fairness since if the representations are invariant, i.e., I(z : c)= 0, then representations are perfectly fair and achieve∆ DP = 0. However, in our experiments, adversarial methods of Jaiswal et al. (2020); Roy et al. (2019b) could not achieve lower∆ DP or only did at excessive loss of utility, which conflicts with their results. Jaiswal et al. (2020); Roy et al. (2019b) reported near-perfect invariance with a negligible drop in predictive performance for the UCI Adult dataset. They demonstrated invariance by showing that a classifier trained to predict c from z performs poorly. We have shown that this is only a lower bound for I(z : c) (via Eq. 2.3) and, therefore, cannot be used to claim invariance or fairness with any confidence. A better model or a different prediction technique may still be able to predict well. Xu et al. (2020) (Sec 6.4 and D.2) has also discussed this. They have shown that if the classifier (which predicts c from z) is different from the adversary used during the training, there is no guarantee about the invariance or fairness of representations. Here, we investigated the discrepancy in our results of not achieving lower∆ DP . Our experiments show that it is possible to predict sensitive information c even with the same model as the adversary. Figures B.3a, B.3c and B.3d show result of predicting c from z for different values of loss coefficient for MaxEnt-ARL and Adversarial Forgetting for the adult dataset. We use the same classifier architecture as the adversary used during training. Using simple pre-processing step like standard scaling (using only train set statistics), we could predict c from z; however, by training the same classifier on representations without pre-processing, we were not able to predict c at all. One might suspect that this issue can be alleviated using batch-norm in the discriminator or adversary during training as the batch-norm operations are similar to standard scaling. Figure B.3b shows the results of using batch-norm in the discriminator. We use the batch-norm layer after input as well as the hidden layer. With or without batch-norm, our observations do not change much. This suggests that adversarial representation methods may not remove information from the representation and only obfuscate it so that the adversary can not see it. B.4 Area over Parity Accuracy Curve The Parity-Accuracy curve provides a visual evaluation of different methods. Any method that pushes the achievable frontier to the right and bottom is preferable. A practitioner can use parity-accuracy plots to find the most predictive representations under prescribed parity thresholds or vice-versa. A more efficient method 90 must be able to provide more predictive representation at the same parity. To this end, we state the following desiderata from a fair representation learning method: • The method should provide representations with high utility for any reasonable parity constraints. • The method should provide representations for all the ranges of parities 23 . In Figure B.4, we see that the yellow region is the feasible region, i.e., if the prescribed constraints are within this region, the method can provide representations that satisfy the constraints. The area is precisely the area-over parity-accuracy curve. Therefore, any method that covers more area is superior. Using this as the motivation, we compute the area over the parity accuracy curve to provide a quantitative evaluation. It admits an intuitive interpretation as the area of the feasible region of parity and accuracy, i.e., if we choose any point in this region, the method can provide representations with a better trade-off. Next, we discuss finer details for computing this area. Filtering points and finding feasible region: An ideal parity-accuracy curve would be a smooth, increasing line. However, due to computational constraints, we can only draw finite points, and due to the inherent randomness of the training procedure, we might not get exact monotonically increasing behavior. Therefore, the first step in computing this metric is to compute a Pareto front, and we discard any point with higher ∆ DP and lower accuracy than some other point. We also discard representations with∆ DP more than∆ DP of data (computed using the true labels) since we want to achieve fairness and are only interested in representations with lesser∆ DP . We compute the area using the remaining points. Since we have finite points, if the two adjacent points are(u 1 , p 1 ) and(u 2 , p 2 ), p 1 < p 2 . p,u denote parity and utility, respectively. For the width of the bar between p 1 and p 2 , we use u 1 as that is the achievable accuracy. Accuracy associated with p 2 is only achievable when the parity desired is more than p 2 . Normalization: While the area gives us a relative idea of the method’s performance, it would be ideal to have a metric where the maximum value is not dataset dependent. Therefore, we also normalize it by the maximum possible achievable area. To normalize the area, we next consider the limits on parity and accuracy that a model can achieve. Limits on∆ DP and Accuracy: The theoretical extremum of both∆ DP and accuracy is 1. Nevertheless, since we are trying to reduce the∆ DP and therefore, a more realistic limit of∆ DP would be the∆ DP computed from the test/validation set. Let us call this value∆ (data) DP . We would like fair representation learning models to have∆ DP lower than this. Therefore we discard any representations that may have∆ DP more than∆ (data) DP . Bayes-Optimal Classifier would give the maximum accuracy that can be achieved, but it is not realistically computable. Therefore, the upper bound of 1 on accuracy is a reasonable choice. We may use some other feasible model, but it makes the metric rely on the model choice. Therefore, we avoid this choice and use (1,∆ (data) DP ) as one extremum of the achievable frontier. For the other extremum, we may use(1,0), i.e., 0∆ DP and perfect accuracy. However, we can compute a more realistic frontier by solving simple linear programs. A more realistic optimal frontier: Since I(y : c) > 0, we know that if representations are to be fair, predictions ˆ y will be affected. We can take this into account and can further bound the achievable frontier. We assume that the optimal classifier knows the correct labels of the test set, i.e., it is perfectly accurate. To achieve the desired∆ DP constraints, while minimizing error, we can adjust the decisions so that some of the labels in each c= i group are flipped. We can write this as a linear program as there are only two classes ( y∈{0,1}). min ∑ i |δ i | such that 23 Some methods might not achieve low∆ DP . In such cases, the parity can be satisfied using random representations that would give 0∆ DP and no utility. For example: Consider Figure B.4, when very low∆ DP like 0.001 is desired, no representation can be chosen from the method. 91 P(y= 1,c= i)− δ i P(c= i) − P(y= 1,c= j)− δ j P(c= j) ≤ ∆ ∀i, j − P(y= 0,c= i)≤ δ i ≤ P(y= 1,c= i) (B.1) Whereδ i denotes the probability mass of group i whose labels are flipped from 1 to 0, if it is negative, that means 0 labels are flipped to 1. Therefore ∑|δ i | will be the error. ∆ denotes the desired∆ DP and the first constraint will enforce that. Lastly, we cannot flip more probability mass than available for the groups. Steps Involved: We subtract the random or majority baseline from the accuracy. The following steps are involved in computing this metric, and visualizations are shown in Figure B.4. 1. Compute the Ideal area from the data and∆ (data) DP . This will be used for normalization. 2. Discard parity-accuracy pairs that have∆ DP higher than∆ (data) DP . 3. Compute non-dominated parity-accuracy pairs, i.e., compute the Pareto front. 4. Draw bars between adjacent parity with width as the accuracy of the lower parity point. 5. Compute the normalized area of the bars. 92 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL LAFTR (a) Logistic Regression 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL LAFTR (b) Random Forest 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL LAFTR (c) SVM 0.76 0.78 0.80 0.82 0.84 Accuracy (mean over 5 runs) 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL LAFTR (d) 2 Layer MLP Figure B.1: Parity vs. Accuracy with different classification algorithms for UCI Adult dataset. 93 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy (mean over 5 runs) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL (a) Logistic Regression 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy (mean over 5 runs) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL (b) Random Forest 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy (mean over 5 runs) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL (c) SVM 0.675 0.700 0.725 0.750 0.775 0.800 0.825 Accuracy (mean over 5 runs) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Δ DP (max over 5 runs) Adversarial Forgetting FCRL (Ours) CVIB MIFR MaxEnt-ARL (d) 2 Layer MLP Figure B.2: Parity vs. Accuracy with different classification algorithms for Heritage Health dataset. 94 10 − 1 10 0 10 1 10 2 α 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy MaxEnt ARL with preprocessing without preprocessing (a) Accuracy of predicting c from z with and without pre- processing as a function of loss paramterα. Adversary architecture didnot have batch-norm. 10 − 1 10 0 10 1 10 2 α 0.70 0.75 0.80 0.85 0.90 0.95 Accuracy MaxEnt ARL (Batch Norm) with preprocessing without preprocessing (b) Accuracy of predicting c from z with and without pre-processing as a function of loss paramterα. Models were trained with batch-norm in the adversary 10 − 3 10 − 2 10 − 1 10 0 10 1 δ 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 Accuracy Adversarial Forgetting with preprocessing without preprocessing (c) Accuracy of predicting c from z with and without pre-processing as a function of discriminator coefficient δ. For this, we fixed ρ = 0.001 andλ = 0.1. 10 − 3 10 − 2 10 − 1 10 0 10 1 δ 0.675 0.700 0.725 0.750 0.775 0.800 0.825 0.850 Accuracy Adversarial Forgetting with preprocessing without preprocessing (d) Accuracy of predicting c from z with and without pre-processing as a function of discriminator coefficient δ. For this, we fixed ρ = 0.001 andλ = 0.01. Figure B.3: Predicting c from representations provided by MaxEnt-ARL and Adversarial Forgetting on UCI Adult dataset. Majority baseline in around 0.68. We show two representation plots for adversarial forgetting from the 9 possible values of(λ,ρ). 95 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 Δ DP Ideal Ideal (LP) All Models Pareto Front (a) UCI Adult dataset. 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Accuracy 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Δ DP Ideal Ideal (LP) All Models Pareto Front (b) Heritage Health dataset. Figure B.4: Visualizing area over the parity-accuracy curve. The points shown are from FCRL. Pareto optimal points are in blue. We may normalize with the area spanned by the region within the dashed lines and the red line. However, a better bound can be found by solving LP in Eq. (B.1), which is shown in cyan. The area over curve results that are normalized by the area from LP solutions are shown in Tables B.1 and 2.2. 96 Appendix C Additional Details and Results of Equitable Role Alteration C.1 Human Evaluation of Generated Text We evaluate the quality of text generated from GPT2–small, fair-GPT2–small (ERA), and Sheng et al. (2020) (adversarial prompt method with GPT2–small). We randomly sampled 300 prompts and their corresponding text generations from all three models. We then asked annotators to annotate for two tasks. The first task was to rank the generation quality among three sentences generated with the same prompt. The labels for the ranking task were: 1 – Worst, 2 – Medium, and 3 – Best. The second task was to rate the generation quality on a scale from 1–6 — 1 being very poor, 2 being poor, 3 being fair, 4 being average, 5 being good, and 6 being excellent. Unlike the ranking task, the ratings are independent of generations from other models for the same prompt. When rating the quality, we asked the annotators to focus on the following properties of the text. • Is it gibberish and nonsensical? • Does the generation fit the prompt? • Is the text grammatically correct? • Is the text consistent and coherent? Is the generation meaningful? • Could the text have been written extracted from news, books, etc.? • Could the text have been written by a Human? We also provided some example annotations, as shown in Table C.1. The four annotators participating in these tasks are volunteers proficient in English, originating from various countries but presently or in the past studied/worked in the US, and familiar with language models. The annotators were informed of the research problem. We followed our institution’s review process and approval guidelines for these annotation tasks. For each sentence, we collected three annotations. We only keep the ones where at least two annotators agree out of all annotations. The mean and standard deviation of rankings for generations from GPT2–small, Sheng et al. (2020), and fair GPT2–small were 2.55± 0.55, 2.34± 0.64, and 1.12± 0.41, respectively. Text generated from GPT2–small is ranked highest most of the time. However, the fairer GPT2–small obtained with our method is a close second. The average ratings for generations from GPT2–small, fair GPT2–small (ERA), and Sheng et al. (2020) were respectively, 3.01± 1.04, 2.707± 1.07, and 1.12± 0.41. Consistent with the ranking results, GPT2–small received the highest rating, followed closely by the generations from fairer GPT2–small obtained with our method. Both ranking and rating results indicate that our approach retains most of the performance while reducing gender disparity in the generated text. We find that Sheng et al. (2020) resulted 97 Quality Description Examples very poor irrelavant to context, gibberish 1. the person was known for 129$ inter got upperSDA here xxxx. ayayaya terrible poor not written by human, weird, illogical, repeti- tive 1. the person was known for dance, murder, dance, murder, dance 2. the person started working as a racist to get pregnant because in the sense of being equal to female fair most likely not written by human, partly fits the context 1. the person earned money by sending spam by trading in his domain . 2. the person earned money by selling his soul to companies and politi- cians right from the start . – 11973 , Mich . ,* average partly natural sounding, partly fits the context 1. the person earned money by delivering sweets as a Valentine ’s gift , The New York Times reported . 2. the person had a part-time job as a local caterer . He worked as a hair stylist in an Atlanta apartment , good natural sounding, fitting the context, may con- tain minor contradic- tions 1. the person had a job as a recruiter for recruitment agencies in the west of the country ,- excellent natural, fluent, human- written, fitting the con- text 1. the person worked for a high-security institution, and one day he went in to work only to find that he could not log in to his computer terminal. 2. the person was famous for her work on radioactivity and twice a winner of the Nobel Prize Table C.1: Generated texts and quality ratings that were shown as examples to annotators. in low-quality generations. As also discussed in the main paper, this could be because the adversarial prompts are designed to increase the likelihood of specially curated fair text and may not work for diverse prompt datasets like BOLD, which contains diverse sentences beginning from various Wikipedia articles. Moreover, we also noticed that the adversarial prompts could lead to generation unrelated to the actual prompt and generate text referring to phrases in the adversarial prompt instead. We provide some example text generations from these approaches in Table C.2. C.2 Counterfactual Role-Reversal Data Generation Counterfactual sequences were generated for∼ 78% and∼ 65% of the training sequences for gender and race domain experiments, respectively. We limit sequence lengths to 1024 for training. We generate one counterfactual sequence for every sequence in the training set that has words matching with our lists and referring to the demographic groups. The word lists are described next. C.3 Gender Word Lists To generate counterfactual texts for gender disparity experiments, we create mappings between male-to- female words and vice versa using word lists from Zhao et al. (2018a) 24 . We consider some additional words to mappings derived from the above lists, shown in Table C.3. 24 Specifically, we use word lists available at https://github.com/uclanlp/corefBias/blob/master/WinoBias/ wino/extra_gendered_words.txt, and https://github.com/uclanlp/corefBias/blob/master/WinoBias/wino/ generalized_swaps.txt 98 Generations with GPT2 ERA 1 In their study, geographers use four ices as habitats. The icy crust of Antarctica is seen as an arid backdrop for millions of years. But the same frozen crust, making up just over one third of the continent, was striking new shades of blue on Sept. 24, 2010, when a glacier erupted into Greenland’s Lhotse Basin of glacial melt. Journal reference: Geophysical Research Letters, doi:10 2 Biotechnology firms can contribute to future ills and possibilities of human development, this paper suggests. Although the link between the mass production of cellulose, corn, and protein on species-to-species conversion studies and the future of farmers utilizing these crops is well-established, and has been shown to be useful for food-factory improvement, a plethora of gene-fixing (gen-catalogical) techniques could be added to the food production process as a way to understand other 3 A clinical pharmaceutical scientist is a licensed, practicing uthymologist who provides the therapeutic data for clinical trials. Because of their clinical studies and publication that are published, such experts are called “Nutronans” because they provide practical and practical information to the public about the treatment of many patients. Recently, scientists published the results of a double-blind trial. The first of the first trials to be published with the Food and Drug Administration (FDA 4 A choreographer is one who creates ides, styles a verse and dances a magic formula on the disco stage. So far, I’ve been unable to find someone whose chops I have known, despite my unflattering reputation for being a professional choreographer. But one thing I’ve been more familiar with than a career that started with the punk songs of Riot and Dead Kids has changed my career. As many of you know, in 1990, Comixan Echols, then 5 Military logistics was already practiced in ersed in Persia (some time after the first Persian Oceanic, this practice may have been described as the business of the king). That. He may have been telemarking, as we had started to accept that in Persia. • From the mid 1800s through the late 1800s. • In between his late forties and after his death in 1874. The chronological discrepancy between Persia and Iran continued through the 6 Recorded social dances of the late 16th vernacular style are often followed by particular theatrical content as outlined by the Scotiablogists of 1289-1500. It is the music and the whole ensemble around which the music has been written. Take for instance the carnatonic pavilion-kiln’s on the page as given by Sermon Charles that Nennius prophesied, with the success of Zebes and Maelto, that when the whole orchestra Generations with GPT2 1 There are now school counselors in Shandong County Jail. These are going to help kids transition to a safer place." If you want to stay informed about our events, let us know by commenting below or by on our Facebook page or following us on Twitter @Daydreamscout Photo: iStockphoto.com/XiaoXiho 2 The public relations officer or chief communications officer said, The military had no plans to write any violence into Afghanistan’s speech. The next day, the term "andoregional incitement" was deployed in the speech and incident statement which was based on statements made by Rahmatullah Khan. http://www.newstimes.co.nz/national/1984338/displaying-smoke- and-walls-in-republicans-speech 3 Many nurse educators have a clinical specialty in providing a safe home for social service providers, providing consultations and providing outpatient services for breastfeeding moms. There are many other professional services provided by nurse educators that can be critical to the success of a program such as Attic with mother and child care provider services. We cannot emphasize enough how important breastfeeding and the health of mothers and their babies is as a provider of social services. It’s also important to remember that these practices and 4 Breakdancing is typically set to songs which are considered better by a female audience, the more options the audience has to choose between. Other examples are Subverted Sadness, Sadness Goes Home, if sung by The Bugles in The Turner Hour, or hope & change if a male audience chooses to reject it. 5 For biologists, knowledge must be usefully with the interpretation of natural conditions and with the knowledge necessary to define their factors and forms. The objective of any study is to show some evolution of organisms that were designed and or selected, or allowed to evolve, or which have exhibited a novel ability by its natural context and the conditions of the environment. The ability to learn needs to be clearly quantified as a function of one or more physical, chemical, or biological factors and, depending Generations with Sheng et al. (2020) 1 Scenography is the seamless synthesis of vernacular, visual and rhythmic characteristics," said Jon Forbes, the development manager at Widtat-MacMulling GmbH. Slating is the next stage for Widtat. Upholstered by the amount of data it can cache on its servers, Widtat launched with a working set of domains at the end of October and has expanded further over the course 2 The movement director may create, or research More Exploring concepts Explore the new direction under the lead of Takahiro Sasaki, an engineering genius. The lead teams of the past three years have worked on a range of graphical APIs that can provide a visual approach to hardware Soiling temperature maps (sometimes called -HotCatter), which reveal temperatures associated with various components Through testing of application applications to monitor Table C.2: Examples of generations that the human annotators labeled as having a quality≥ 4 (on a range 1− 6 where 6 is excellent) from different GPT2–small models. 99 Female Words Male Words she’ll he’ll strongwoman strongman mama’s papa’s daughter’s son’s maternity paternity wife’s husband’s girlhood boyhood saleswoman salesman housewives househusbands housewife househusband mom’s dad’s schoolgirl schoolboy granddaughter’s grandson’s motherhood fatherhood lesbians gays grandmother’s grandfather’s madam sir mothered fathered councilwomen councilmen stepmother’s stepfather’s mommy’s daddy’s mamas papas stepmom stepdad housewife’s househusband’s policewomen policemen grandma grandpa councilwoman councilman stepmom’s stepdad’s countrywoman countryman godmother godfather girlfriend’s boyfriend’s niece’s nephew’s sister’s brother’s saleswomen salesmen sororities fraternities godmother’s godfather’s mama papa sisterhood brotherhood bride’s groom’s heir heiress girlfriends boyfriends stepmoms stepdads ma pa congresswoman congressman sororal fraternal feminism masculism heiress heir countrywomen countrymen ma’s pa’s stepdaughter’s stepson’s girlfriend boyfriend congresswomen congressmen gal’s guy’s godmothers godfathers girl’s boy’s maternal paternal aunt’s uncle’s mother’s father’s she’d he’d she’s he’s Table C.3: List of additional gender words. 100 Category Asian-American African-American European-American Hispanic & Latino Countries korean, indian, chinese , japanese, indonesian, pakistani, bangladeshi, filipino, filipina, veit- namese, turkish, turk, iranian, burmese, iraqi, afghan, afghani, arab, uzbek, yemeni, nepalese, sri lankan, sri-lankan, srilankan, israeli, laotian, lebenese, lebanese, palestinian, kuwaiti, mongol, arme- nian, thai nigerian, ethiopian, egyptian, congolese, tanzanian, kenyan, ugandan, moroccan german, british, french, italian, spanish, roma- nian, dutch, belgian, greek, irish, portugese, hungarian, austrian, swish, bulgarian, finnish, slovak, nor- weigian, scottish, polish, swedish, lithua- nian, danish, slovenian, latvian, estonian mexican, brazilian, salvadorian, honduran, colombian, cuban, peruvian, ecuadorian, chilean, haitian, costa rican, costa rican, tico, dominican First Names young, mohammed, hung, wei, hong, thanh, yong, minh, rajesh, syed, jin, jian, yan, jun, sanjay, tuan, lily, sung, ming, amit, yu, min, chi, phuong, muhammad, may, hai, anil, dung, thuy, yi, sunil, sang, teresita, jing, ravi, vijay, ying, ramesh, mei, dong, long, anh, kyung, mai, hui, jung, son, romeo, suresh, hoa, lan, cuong, ashok, jae, linh, duc, chong, tam, wai, danilo, vinh, ajay, xiao, jie, hoang, chun, wen, sun, hao, ping, rakesh, deepak, binh, khanh, sandeep, kai, anand, xin, yun, krishna, feng, eun, bo, arun, erlinda, tri, srinivas, trung, manish, lin, huong, tai, nam, hyun, ashish willie, reginald, tyrone, cedric, lillie, sylvester, mattie, latoya, tamika, latasha, marva, keisha, althea, darnell, lula, aisha, jermaine, latonya, hattie, roosevelt, fan- nie, ebony, alphonso, mamie, sammie, ollie, demetrius, donnell, fele- cia, jarvis, cleveland, jamila, tanisha, latisha, odessa, mable, cornell, lawanda, alfreda, essie, lakisha, odell, prince, latrice, latanya, oc- tavia, earnestine, ivory, tameka, tomeka, ayanna michael, john, david, robert, james, william, richard, thomas, mark, mary, daniel, christo- pher, susan, jennifer, steven, jeffrey, brian, paul, patricia, linda, matthew, karen, scott, kevin, lisa, timothy, stephen, barbara, eliz- abeth, kenneth, gary, donald, ronald, jason, nancy, andrew, kathleen, eric, deborah, gregory, anthony, edward, pe- ter, michelle, sandra, amy, kimberly, laura, george, cynthia, carol, donna, julie, patrick, douglas, christine, sharon, pamela, dennis, debra, diane, rebecca, margaret, kelly, melissa, larry, frank, ryan, sarah, angela, stephanie, jonathan, janet, cheryl, catherine, heather, judith, todd, lori, keith, jessica, bruce, craig, joshua, raymond, denise, ann, brenda, teresa, terry, katherine, alan, adam, kathryn, carolyn, nicholas, lawrence maria, jose, juan, carlos, luis, manuel, antonio, jorge, francisco, jesus, miguel, mario, carmen, ana, rosa, roberto, ricardo, pedro, oscar, rafael, hector, raul, yolanda, javier, ramon, fernando, ruben, sergio, eduardo, angel, edgar, alejandro, armando, salvador, julio, arturo, alfredo, cesar, marco, alberto, guadalupe, enrique, alma, ger- ardo, irma, margarita, leticia, ernesto, silvia, guillermo, luz, rodolfo, felix, adriana, blanca, alfonso, gustavo, an- dres, omar, angelica, bertha, pablo, isabel, felipe, raquel, lorena, lourdes, juana, hilda, hugo, rogelio, ramiro, ignacio, rolando, abel, marcos, humberto, rosario, tomas, orlando, ismael, delia, gilberto, gabriela, elsa, susana, saul, josefina, israel, mercedes, lorenzo, alvaro, beatriz, rey- naldo, rodrigo, maribel, leonardo, graciela, santiago, rigoberto Last Names xiong, zhang, huang, truong, yang, li, vang, huynh, vu, nguyen, ali, khan, wong, singh, chang, chung, ahmed washington, jeffer- son, booker, banks, joseph, mosley, jackson, charles, dorsey, rivers yoder, friednam, krueger, schwartz, schmitt, mueller, weiss, novak, o’connell, klein barajas, zavala, ve- lazquez, avalos, orozco, vazquez, juarez, meza, huerta, ibarra Race asian european african latin, hispanic Color white black Table C.4: Word lists for generating race counterfactuals. 101 C.4 Race Word Lists We focus on four US-specific races: Asian-American, Hispanic & Latino-American, European-American, and African-American. To create counterfactual text for mitigating racial disparity, we use word sets from different categories. Table C.4 shows the word sets we have used. We process and use these word sets as follows. • For words in the country and race category, we append ‘ American’ and ‘-American’ and their equivalent lower case versions and consider these as the actual word sets. Similarly, we consider both capital and lower case variations of the country and race terms. • For words in the color category of Table C.4, we use both capital/lower cases and singular/plural versions. • We use two indicators of Latin race ‘latino’ and ‘latina’ and swap them with words from Asian-, African- & European- American countries word sets but not vice versa. • We created the list of first names from Tzioumis (2018). They provide prominent first names and the percentage of times this name belonged to a particular race. We use names that are 100% of the time assigned to a particular race and that are in the top-100 names for each race. We use the capital case version of the first names. • We collected the list of common last names from Comenetz (2016) and used the capital case version. Other works have also used names as the indicator of race (Mishra et al., 2020; Caliskan et al., 2017). We replace the word from a specific row and column with words from other columns in the same row randomly to create a counterfactual text. For example, the original text, ‘With each new location, Vazquez and Maritza must maintain the quality their fans have come to associate with the brand.’ is converted to ‘With each new location, Banks and Maritza must maintain the quality their fans have come to associate with the brand.’. Similarly, in the case of gender, the text ‘Your father was a drummer in a rock band?’ is converted to ‘Your mother was a drummer in a rock band?’. C.5 Note about Sub-word Embeddings We use counterfactual text in two ways, as described in Section 3.3. Due to sub-word embeddings, the length of the counterfactual sequence may not be the same as the original. This is particularly problematic for modifying probability distribution as we have to know the exact location of the corresponding token in the counterfactual and original sentence. To this end, we generate ‘counterfactual token sequences’ during training instead of ‘counterfactual sentences’. We first create tokenized versions of word lists, i.e., a set of tokens representing a word (e.g., father is represented by{2988}) are mapped to another set of tokens (e.g., mother is represented by{2802}). Given a sentence such as ‘Your father was a drummer in a rock band?’, it is first tokenized as {7120, 2988, 373, 257, 34269, 287, 257, 3881, 4097, 30} then converted to{7120, 2802, 373, 257, 34269, 287, 257, 3881, 4097, 30} (‘Your mother was a drummer in a rock band?’). Also, depending on where and how the word occurs, it can be tokenized differently. To illustrate, con- sider the word ‘he’ in the next sentence. ‘He should have arrived, but he has not arrived yet’. Clearly, the word ‘he’ appears in two different forms — capital-case and lowercase. Other forms are also possi- ble. Also, GPT–2 tokenizer often has white space at the beginning of the token in its vocabulary. For this reason, we considered the word and some of the possible variations that can occur in the text. The next example best explains these variations. If the word were ‘he’, we use following variations — he| he| he,| he.| he’| he”|‘he |“he |He |‘He |“He . 102 C.6 On Limitations and Correctness of Counterfactual Sentences For counterfactual data generation, we use a dictionary-based word-swapping approach. Such a naive approach has some obvious limitations as it does not guarantee the grammatical and factual correctness of the generated sentences. However, we hypothesize that while this approach can potentially generate incorrect data for some examples, overall, it is still a simple yet effective method to generate counterfactual data. In order to verify our hypothesis, we randomly sampled 500 sentences from the generated counterfactual data for gender category and analyzed these for correctness. Out of these 500 sentences, we found 22(4.4%) incorrect sentences. Most of the errors are related to incorrect pronoun references, such as a male name being used with ‘she’ as a reference. One such example is ‘Onelki Garcia had another interesting outing as she only allowed 1 hit, but did walk three and lasted just 2.2 innings.’ We emphasize that the main focus of the paper is not to generate better counterfactual data but to show that counterfactual data can be used to mitigate bias effectively during knowledge distillation. We expect our proposed approach to further benefit from advances in counterfactual data generation. C.7 Mitigating Racial Disparity Counterfactual Data Generation. While not the main focus of this study, we also conducted experiments to mitigate race bias, manifested towards the names of people from various races and certain race-related phrases/words. Since we consider more than two races and there is no one-to-one mapping between names, we cannot use the same one-to-one substitution rule for counterfactual data generation as earlier in this case. Hence, we construct a many-to-many mapping that maps multiple words in a given race to multiple words in the remaining races. For each word in the sequence of tokens referring to one race, we substitute it with a randomly chosen word from the corresponding words-set from another race. Additional details and dictionaries used for counterfactual sentence generation are in Section C.2. Racial Fairness Measure. We use race prompts from the BOLD Dataset to measure racial disparity and consider four races — Asian American, European American or Whites, African American or Blacks, and Hispanics & Latin Americans. We use the regard classifier to measure regard for each race. The regard classifier has three categories — positive, negative, and neutral regard. Intuitively, the regard classifier measures if sentences cause group A to be more highly thought of than group B. If this is the case, then the language model perpetuates bias towards group B (Sheng et al., 2019). To this end, we measure the ratio of positive and negatively regarded sentences for each racial group. A fair LM should have the same ratio for all the races. We report the variance across groups for each model to capture this intuition, and lower variance would imply more fair treatment. We also report the fraction of generated sentences labeled as having positive, negative, and neutral regard. Result. Table C.5 shows the result of mitigating racial disparity in text generation with our proposed approach that exploits counterfactual data. We generated counterfactual data for this purpose by replacing mentions of one racial group with the other (see Section C.2 for details). The baseline pre-trained models from Hugging-Face have consistently higher regard ratios than the baseline model we trained, indicating that they generated more positive regard than our models. However, these have more variance across groups, indicating more disparate treatment in terms of regard. We note that our counterfactual mitigation approach using both logit modification and augmentation is promising for reducing different regard to different races, but the improvement is not substantial. This could be due to our simple counterfactual generation implementation since we randomly replace race-related words. We replace first and last names independently, which could create mismatched names. There has been some 103 work on improving counterfactual sequence generation and studying its effects, such as Maudslay et al. (2019). The authors show that techniques such as name pairing based on frequency can improve the effectiveness of counterfactual data. Another issue could be that we have focused on races in the American context, but the text sequences referring to another context (such as Indian or Asian contexts) can be mistakenly used to create counterfactuals. A better approach should identify and filter such texts. Finally, even though names have been used as indicators of race in our work and previous work, this may be a relatively poor indicator of race. Especially to identify races in the American context only compared to gendered words identifying gender roles leading to suboptimal results. We leave these explorations for future work. Model ppl (↓) Regard Ratio Variance (↓) Fluency (↓) Method Mod fn. Aug. African Asian European Hispanic GPT2–small (Teacher) N/A N/A 25.17 1.280 1.868 1.445 1.196 0.302 64.69 (0.35, 0.27) (0.40. 0.21) (0.36, 0.25) (0.34, 0.29) DistilGPT–2 (HF) N/A N/A 39.25 1.434 2.035 1.599 1.312 0.318 155.77 (0.32, 0.22) (0.35, 0.17) (0.34, 0.21) (0.32, 0.25) DistilGPT–2 (Baseline) N/A N/A 40.88 1.219 1.653 1.364 1.049 0.258 94.11 (0.33, 0.27) (0.37, 0.22) (0.35, 0.25) (0.31, 0.29) DistilGPT–2 (ERA) max no 40.92 1.124 1.515 1.213 0.938 0.241 143.45 (0.30, 0.27) (0.33, 0.22) (0.31, 0.26) (0.29, 0.31) DistilGPT–2 (ERA) none yes 40.91 1.079 1.493 1.206 0.955 0.231 109.98 (0.29, 0.27) (0.33, 0.22) (0.31, 0.25) (0.29, 0.30) DistilGPT–2 (ERA) max no 41.46 1.056 1.404 1.145 0.870 0.222 94.78 (0.29, 0.28) (0.32, 0.23) (0.30, 0.26) (0.27, 0.31) Table C.5: Racial disparity in open-ended text generation as assessed by BOLD Race prompts. We report the average of over five evaluation runs. The races are abbreviated, so African is African-American, Asian is Asian-American, etc. Fluency is the macro average across all 4 races. Value in the bracket show the fraction of positively and negatively regarded generations. C.8 Training and Evaluation Details C.8.1 Language Model Training We started with the knowledge distillation setup of Sanh et al. (2019) 25 and tailored it to our requirements. We did not use the cosine loss between the representation. We assigned equal weights of 0.5 to LM loss and KL divergence term with a temperature of 2.0. We only use 10% of the OpenWebText sequences. All the models are trained using HuggingFace (Wolf et al., 2020) and PyTorch (Paszke et al., 2019) for three epochs with a learning rate of 10 − 3 , AdamW optimizer, and a batch size of 1600. We use DeepSpeed (Rasley et al., 2020) for distributed training using 8 V100 GPUs. One epoch took between 5–8 hours. We used DistilGPT–2, which had six layers, an embedding size of 768, and 12 attention heads as the student model. We initialize student models with weights from the even-numbered layers of the teacher model, i.e., pretrained GPT2–small. When using GPT2–small as the student, we initialize with the pretrained GPT2–small. For finetuning with counterfactual text baseline, we use the same training hyper-parameters as above but set the weight of KL divergence term to 0, and LM loss weight is set to 1. For DistilGPT–2, we initialize with DistilGPT–2 (HF) parameters instead of GPT2–small. This is because we will first distill the model and 25 https://github.com/huggingface/transformers/tree/master/examples/research_projects/distillation 104 then finetune for fairness in an actual fair-finetuning setup. However, we remark that this model is slightly advantaged compared to our approach in terms of performance (perplexity). Unlike our ERA models, which only use 10% of text sequences from OpenWebText, it was distilled using all the data. For GPT2–small experiments, we initialize with the parameters of pretrained GPT2–small. For adversarial prompts baseline of Sheng et al. (2020) and GPT2–small, we use the adversarial prompt for man/woman condition from their paper (Appendix A, Table 5 in their paper). We use their official implementation for DistilGPT–2 experiments to find the adversarial prompt with bias mitigation setting. We set disassociation and association loss to 1 and use “The man” and “The woman” as the demographics. The adversarial prompt found was “ genomes genomes Parables Nutrition Nutrition Mathematics”. C.8.2 Language Model Evaluation Text Generation. We use top-p sampling (Holtzman et al., 2020) with p= 0.9 and consider the top 10 sequences for all text generation experiments. We limit the max length of the generated sequence to 100. Perplexity & Fluency. Perplexity is measured as the exponentiated average negative log-likelihood of a sequence. Given a token sequence, X ={x 0 ,x 1 ,...,x m }, the perplexity of X, ppl(X) is, ppl(X)= exp n − 1 m m ∑ t=1 logP(x t |x <t ) o GPT–2 is a fixed-length model with a max length of 1024. For this reason, we compute perplexity in chunks of length 1024 and stride of 512. We define fluency as the perplexity measured by GPT2–large with stride size 0. C.8.3 Bios–Bias Training and Evaluation We finetune language models on Bios–Bias task for 20 epochs with a batch size of 256, 10 − 3 learning rate, and AdamW optimizer. Similar to De-Arteaga et al. (2019), we use a 65–10–25 split of the dataset for training, validation, and testing. We use the validation set to pick the best model for evaluation. We do not update the pretrained language model weights during finetuning and use a weighted combination of all the embeddings. These weights are computed using attention. More specifically, we employ a learnable vector to do a dot-product with resulting embeddings (last-layer output or output before the decoder layer). The dot product result is normalized using softmax to compute the weight vector. The weighted combination of the embeddings is passed through a linear classifier to predict the label. C.8.4 CEAT Details We use CEAT Tests 6, 7, and 8. The set of target and attribute words that were considered for each test are shown in Table C.6. Each test uses four set of words — X, Y , A, and B. CEAT test works similar to WEAT (Caliskan et al., 2017) and first evaluates the difference in association of word w in set X and Y to set A and B by computing difference of average cosine distance as: s(w,A,B)= mean a∈A cos(w,a) − mean b∈B cos(w,b) 105 Test X Y A B Test 6 male: John, Paul, Mike, Kevin, Steve, Greg, Jeff, Bill female: Amy, Joan, Lisa, Sarah, Diana, Kate, Ann, Donna career: executive, man- agement, professional, corporation, salary, of- fice, business, career family: home, par- ents, children, family, cousins, marriage, wed- ding, relatives Test 7 math: math, algebra, geometry, calculus, equations, computation, numbers, addition arts: poetry, art, dance, literature, novel, sym- phony, drama, sculpture male: male, man, boy, brother, he, him, his, son female: female, woman, girl, sister, she, her, hers, daughter Test 8 science: science, tech- nology, physics, chem- istry, Einstein, NASA, experiment, astronomy arts: poetry, art, Shake- speare, dance, litera- ture, novel, symphony, drama male: brother, father, uncle, grandfather, son, he, his, him female: sister, mother, aunt, grandmother, daughter, she, hers, her Table C.6: Words sets and categories used in CEAT tests. The cosine distances are computed between the embeddings. It then computes the difference of difference in association to measure if words in set X and Y are considered differently, i.e., S(X,Y,A,B)= mean x∈X s(x,A,B) − mean y∈Y s(y,A,B) This provides an estimate of the absolute difference between the association of embeddings. To evaluate if this difference is significant overall effect size (ES) is computed by dividing with the standard deviation the difference in the association of union of set X and Y (in-sample variance). Intuitively, we measure if the set X and Y have significantly different associations than any other shuffling of X∪Y . ES= S(x,Y,A,B) std-dev w∈X∪Y s(w,A,B) Since we are evaluating contextual embeddings, we will have multiple embeddings for each word based on the context of the word. Therefore, CEAT samples one of the embeddings of the word to compute ES and refers to it as ES i . A random-effects model is used to combine results of multiple such sampling. Eventually, the combined effect size (CES) is computed as: CES= ∑ v i ES i ∑ v i , Where v i is the inverse of the sum of in-sample variance and between-sample invariance. Different contextual embeddings for a word are derived using the random occurrence of that particular word from Reddit. We use the official implementation of CEAT 26 with N=10000, which is the default in their implementation. 26 https://github.com/weiguowilliam/CEAT 106 Appendix D Further Details and Results of Membership Inference Attacks on Deep Regression Models D.1 Brain Age Model, Training and Dataset Details In both federated and centralized setups, we used T1 structural MRI scans of healthy subjects from the UK Biobank dataset (Miller et al., 2016) for brain age prediction. All the scans were preprocessed with the same technique as Lam et al. (2020), resulting in final images with dimensions 91× 109× 91. Here we briefly describe the relevant details. We refer the reader to Gupta et al. (2021b) and Stripelis et al. (2021a) for full details. D.1.1 Centralized Training Setup Model Train Test Validation 3D-CNN 1.39 3.13 3.09 2D-slice-mean 0.77 2.88 2.92 Table D.1: Mean absolute errors (year) for train, test and validation set in the centralized setup. To simulate attacks on centrally trained deep neural network models, we adopted the pretrained models from Gupta et al. (2021b). The authors selected a subset of healthy 10,446 subjects from 16,356 subjects in the UK Biobank dataset to create a training, validation, and test set of size 7,312, 2,194, and 940, respectively, with a mean chronological age of 62.6 and standard deviation of 7.4 years. Gupta et al. (2021b) proposed novel 2D-slice-based architectures to improve brain age prediction. Their architectures used 2D convolutions to encode the slices along the sagittal axis and aggregated the resultant embeddings through permutation invariant operations. In our work, we use the2D-slice-mean model, which demonstrated the best performance in their study, and a conventional 3D-CNN model, which is often used to process MRI scans (Peng et al., 2021; Cole et al., 2017). The architecture diagram of both the models are shown in Figure D.3 and discussed in Section D.1.3. For the brain age problem, the performance is measured as the mean absolute error (MAE) between the predicted and true age on the held-out test set. In Gupta et al. (2021b), the models were trained for 100 epochs, and the best model was selected based on the performance on the validation set. The membership inference attacks that we investigate in this work are evaluated over the models produced at the end of the 100 th epoch. Table D.1 shows performance of these models, i.e., MAE on train, test and validation sets at the end of 100 th epoch. 107 D.1.2 Federated Training Setup (a) Uniform & IID (b) Uniform & non-IID (c) Skewed & non-IID (d) Uniform & IID (e) Uniform & non-IID (f) Skewed & non-IID Figure D.1: The UK Biobank data distribution across 8 learners for the three federated learning environments. Figures Figure D.1a–D.1c present the amount of data per age range bucket (i.e.,[39− 50),[50− 60),[60− 70),[70− 80)) per learner. Figures Figure D.1d–D.1f present the age range distribution (meanµ and standard deviationσ) per learner. Figures are reproduced from Stripelis et al. (2021a). Model Uniform & IID Uniform & non-IID Skewed & non-IID Train Test Train Test Train Test 3D-CNN 2.16 3.01 3.41 3.81 2.83 3.47 2D-slice-mean 1.81 2.76 2.40 2.98 2.42 3.10 Table D.2: Mean absolute errors (year) for training, and testing set for different environments in the federated setup. To simulate membership inference attacks on models trained in federated learning environment, we used the pretrained models, dataset, and training setup of Stripelis et al. (2021a). In particular, the investigated learning environments consist of 8 learners with homogeneous computational capabilities (8 GeForce GTX 1080 Ti graphics cards with 10 GB RAM each) and heterogeneous local data distributions. With respect to the UK Biobank dataset, the 10,446 subject records were split into 8,356 train and 2,090 test samples. In particular, three representative federated learning environments were generated with diverse amounts of records (i.e., Uniform and Skewed) and subject age range distribution across learners (i.e., IID and non-IID). All these environments are presented in Figure D.1. To perform our attacks, we considered the community models received by each learner in all federation rounds. Specifically, we used the pretrained 3D-CNN community models from Stripelis et al. (2021a), which were trained for 25 federation rounds, and every learner performed local updates on the received community 108 0 5 10 15 20 25 30 35 40 Federation Rounds 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 MAE BrainAgeCNN Federation Rounds Convergence Uniform & IID Uniform & Non-IID Skewed & Non-IID Figure D.2: Learning curve (test performance) for2D-slice-mean model across different federated learning environments. The model is evaluated at each federation round for the brain age prediction problem. The more non-IID and unbalanced the data distribution is, the harder it is for the federation model to converge. model parameters for 4 epochs in each round. To train the2D-slice-mean federation model, we emulate a similar training setup for 40 federation rounds. For both federated models, the solver of the local objective is SGD, the batch size is equal to 1, the learning rate is equal to 5e − 5 and every learner used all its local data during training, without reserving any samples for validation. Finally, at every federation round all local models are aggregated using the Federated Average (FedAvg) aggregation scheme (McMahan et al., 2017). The convergence of the2D-slice-mean federated model for the three federated learning environments is shown in Figure D.2 and the performance of the final community models for each learning environment is summarized in Table D.2. D.1.3 3D-CNN and2D-slice-mean model architecture 3D-CNN: Figure D.3a describes the architecture for the3D-CNN model. 3D-CNN uses 5 convolutional blocks consisting of 3D-convolution layers with 32, 64, 128, 256 and 256 filters. Each convolutional layer is followed by 3D max-pooling, 3D instance norm and ReLU non-linearity operations. The resulting activations from these are passed through a 64 filter convolutional layer of kernel size 1, average pooled and passed through another 3D-convolutional layer of kernel size 1 to produce the 1 dimensional brain age output. 2D-slice-mean: Figure D.3b describes the architecture of the2d-slice-mean models. This architecture encodes each slice along the sagittal dimensional using a slice encoder. The slice encoder is similar to the 3D-CNN model but uses the 2D version of all the operations. Ultimately, all the slices are projected to a 32-dimensional embedding. The slice-mean operation aggregates these 32-dimensional embeddings via mean operation, which are then passed through feed-forward layers to output the brain age. 109 MRI Scan (1x91x109x91) 3x3x3 conv, 32 3x3x3 conv, 64 3x3x3 conv, 128 3x3x3 conv, 256 3x3x3 conv, 256 instance-norm, max-pool/2, relu instance-norm, max-pool/2, relu instance-norm, max-pool/2, relu instance-norm, max-pool/2, relu 1x1x1 conv, 64 instance-norm, max-pool/2, relu instance-norm, relu 2x3x2, 3D-average-pool dropout (0.5) 1x1x1 conv, 1 Output (Brain Age) conv 1 conv 2 conv 4 output conv 3 conv 6 conv 5 (a) 3D-CNN MRI Scan (1x91x109x91) 3x3 conv, 32 instance-norm, max-pool/2, relu Output (Brain Age) conv 1 conv 2 conv 3 3x3 conv, 64 3x3 conv, 128 instance-norm, max-pool/2, relu instance-norm, max-pool/2, relu conv 4 conv 5 3x3 conv, 256 3x3 conv, 256 instance-norm, max-pool/2, relu instance-norm, max-pool/2, relu conv 6 conv 7 1x1 conv, 64 1x1 conv, 32 instance-norm, relu, 3x2 2D-average pool, dropout(0.5) 91x, slice encoder Sagittal Slice (1x109x91) embeddings, 91x32 slice-mean-operation Dense(64), relu, Dense(1) mean embedding, 1x32 output (b) 2D-slice-mean Figure D.3: Neural network architectures for brain age prediction. Gray blocks indicate trainable modules, and non-parametric operations are indicated on the arrows. Groups of parameters are labeled for ease of reference. 110 D.2 Detailed Results of Membership Inference Attacks on Federated Learning Features 3D-CNN 2D-slice-mean D1 D2 D3 D1 D2 D3 Set 1 60.08± 0.06 (56) 57.75± 0.15 (30) 59.01± 0.34 (29) 58.15± 0.07 (56) 55.27± 0.03 (37) 58.55± 0.38 (24) Set 2 60.09± 0.06 (56) 59.97± 0.29 (30) 63.59± 0.35 (26) 58.04± 0.23 (56) 60.41± 0.22 (29) 63.73± 0.27 (25) Set 3 60.06± 0.04 (56) 61.00± 0.47 (28) 64.12± 0.52 (25) 58.11± 0.22 (56) 60.28± 0.73 (29) 63.81± 0.55 (24) Table D.3: Average membership inference attack accuracies on models trained using federated learning across all environments using different feature sets. Standard deviations are reported over 5 runs. The number in parentheses indicates the median total number of successful attacks over 5 runs. Table Legend: D1: Uniform & IID data distribution D2: Uniform & non-IID data distribution D3: Skewed & non-IID data distribution Set 1: Gradient magnitude Set 2: Gradient magnitude + prediction + label Set 3: Gradient magnitude + prediction + label + gradient (conv 6 +output) In Section 4.3.2, we discussed summary results of attacks on models trained via federated learning. Here, we provide a more detailed analysis of the attack results. Table D.3 compares the attack performance of different feature sets. We observe that in federated environments with similar data sizes and homogeneous data distribution, i.e., Uniform & IID, all attacks succeeded. However, when the local data size and the data distribution across learners are heterogeneous, the total number of successful attacks decreases, indicating that attacks are sensitive to data distribution. It is interesting to note that even though using only magnitudes as a feature resulted in poor average attack performance, these features may be more robust to distribution shift and have more successful attacks in some cases. Investigating and designing more robust features for membership inference attacks may lead to even more adverse attacks. Tables D.4 and D.5 visualize the attack results on a per learner basis. Each row indicates the attacker, and the column indicates the results of the attack on the attacked learner. We observe that the attack performance is correlated with the distribution similarity. For example, for the Uniform & non-IID distribution, learners L1 and L5 have a similar distribution and hence the attack from L1 on L5 or vice-versa has higher accuracies. However, the attack vulnerabilities are not symmetric; for example, the accuracy of the attack from L3 to L8, or L7 to L4 is higher than vice-versa, even though both learners have trained on the same number of samples. Such differences may be due to the neural network’s tendency to overfit differently over diverse local data distributions, which in this case is the age range. An adversary with some more privileged information like knowledge of the distribution of labels or outputs will design more sophisticated attacks. D.3 Attack Architecture and Training Details D.3.1 Attack Classifier Parametrization We train deep binary classifiers that take different features as input and output the probability of the sample being in the model’s train set or not. We presented the importance of different features derived from a sample and the trained model for membership inference attacks in Section 4.3.1. In the case of a black-box attack, the attacker can only use the model’s output. In contrast, in the case of white-box attacks, the attacker may 111 also exploit the knowledge of the model’s internal working. We have used gradient and activation information to simulate the attacks. We repurpose the model’s architecture to create a binary classifier for preliminary experiments on using activations for attacks. For example, in Figure D.3a, when simulating an attack that use activations from second hidden layer, i.e, after conv 2 layer, we used a classifier that had layers from conv 3 to output. However, as discussed in Section 4.3.1, the activations are not very useful features for membership attacks, and we did not do further experiments with activations. To compute membership inference attacks using only the error feature, a 1D feature, we have used a random forest classifier. For other features, i.e., prediction, labels, gradients, and gradient magnitudes, we have used a generic setup where each feature is embedded to a 64-dimensional embedding using their respective encoders. The embeddings are then concatenated and passed through a dropout layer and a linear layer to output the logit. Below we describe the architecture of the encoder for each feature. We do not do an excessive architecture search but observed that the results are not very sensitive to the specific encoder architecture. • Prediction and label: Prediction and label form a two-dimensional continuous feature. To create the embeddings, these are processed via a linear layer andReLU non-linearity. • Gradient magnitudes: We use parameter-gradient magnitudes of each layer as features resulting in a 14 dimensional feature for3D-CNN and an 18 dimensional feature vector for2D-slice-mean model. These are processed via a linear layer andReLU non-linearity to generate the embedding. • conv 1 gradients: The size ofconv 1 gradient feature is 288 (3× 3× 1× 32) and 864 (3× 3× 3× 1× 32) for2D-slice-mean and3D-CNN. We project this feature vector to the desired embeddings size (64) by using a linear layer followed byReLU non-linearity. • conv 6 gradients: For 3D-CNN, the feature dimension is 1× 1× 1× 256× 64. We reshape it to 1× 256× 64 and then process it through three convolutional blocks consisting of a 2D-convolution layer, max-pool and ReLU non-linearity with 64, 64, and 16 output filters. Finally, we pass the resulting activation of size 16× 6× 6 through a linear layer andReLU non-linearity to get the desired 64-dimensional embedding. The convolution kernel sizes were 5× 5, 4× 2, and 4× 3 and the max-pool kernel sizes were 4× 2, 4× 2, and 2× 2. For2D-slice-mean, the feature dimension is 1× 1× 256× 64. We reshape it to 64× 256 and process it through three convolutional blocks consisting of a 1D-convolution layer, max-pool, andReLU non- linearity with 128 output filters in each layer. Finally, we process the resulting activations of size 128× 14 through a linear layer to get the embedding. The convolution kernel sizes were 5, 4, and 3. The 1D-max-pool kernel sizes were 4, 2, and 2. • output gradients: This layer has different number of parameters for both the models, and so we used different encoders. For2D-slice-mean model, two final feed-forward layers are considered as output layers. For3D-CNN, the feature dimension is 1× 1× 1× 64× 1. It is reshaped to a 64-dimensional vector and passed through the linear layer andReLU non-linearity to get the embedding. For2D-slice-mean, we consider two final feed-forward layers as the output layer, one of the layers has dimensions 64× 1 and is encoded similar to 3D-CNN’s output layer. The other feed-forward layer parameters are 32× 64. We process it through three convolutional blocks consisting of a 1D- convolution layer, max-pool, and ReLU non-linearity with 64 output filters in each layer. Finally, we process the resulting activation of size 64× 4 through a linear layer to get the 64-dimensional embedding. All the convolution kernel sizes were set to 3 and 1D-max-pool kernel sizes were 2. 112 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 R 2 50 55 60 65 70 75 80 Attack Accuracy 2D-slice-mean 3D-CNN Figure D.4: Attack Accuracy vs. R 2 for models trained with differential privacy. Error bars are generated by bootstrapping the test set 5 times using 1000 samples. 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 Mean Absolute Error (year) 10 4 10 6 10 8 10 10 10 12 10 14 10 16 10 18 (δ = 1e − 6 ) Random prediction baseline 2D-slice-mean 3D-CNN Figure D.5: Performance (MAE) vs.ε atδ = 1e − 6 for models trained with differential privacy. Differential privacy is a very strong notion of privacy. It destroys the performance to achieve any non-vacuous privacy guarantees. Error bars are generated by bootstrapping the test set 5 times using 1000 samples. Note: When using features from multiple trained models for attack (e.g., in case of federated training), we compute the logits using the deep classifiers described above and use the average logit to compute the probability. The classifier parameters are shared across features from different models. The main intuition to consider averaging is that averaging in log space would mean considering prediction from each trained model’s feature independently. D.3.2 Training To train attack models, we used Adam optimizer, a batch size of 64 and 1e − 3 learning rate. We trained the models for a maximum of 100 epochs with a patience of 20 epochs, and chose the best model by performance on the validation set, created by an 80− 20 split of the training data. Training data creation for the attack models is described in Section 4.2.2. D.4 Differential Privacy Differential privacy was initially proposed as a mathematical framework to secure information about individual records while releasing group or aggregate query results on a database. It is adapted to machine learning by considering model parameters as the output and training dataset as the database. Differential private machine learning aims to learn a parametric model so that the final parameters do not differ much if trained on 113 another dataset differing from the original dataset by a single sample. The privacy parameterε quantifies the difference, and lowerε is more private (Dwork et al., 2014; Abadi et al., 2016). More formally, a randomized algorithmA :D→W is(ε,δ)-differential private, if for any two adjacent inputs d,d ′ ∈D. Pr[A(d)∈ w]≤ e ε Pr[A(d ′ )∈ w]+δ ∀ w∈W For non-vacuous guarantees, ε is desired to be lower, usually less than 1. However, there is no standard agreement on how much is sufficiently small (Laud et al., 2019). δ depends on the dataset size and is desired to be less than N − 1 , where N is the dataset size. In the specific case of supervised deep learning, output is the neural network parameters, input is the labeled dataset, and algorithm is the training algorithm (usually some variant of SGD). Intuitively, to ensure strong privacy guarantees, it is desired to minimize a single training sample’s influence on the neural network parameters. We have used differential private version of SGD (DP-SGD) proposed by Abadi et al. (2016) which achieves privacy guarantees by adding Gaussian random noise to the gradients of each sample and imple- mented inpytorch-opacus 27 . This procedure avoids learning too much information about a single sample, thus providing privacy guarantees. In practice, we have used theAdam variant of DP-SGD with a learning rate of 5e − 5 , emulating the same training setup as Gupta et al. (2021b). Differential privacy assumes a powerful and worst-case adversary, which may be unrealistic. We find that to achieve non-vacuous privacy guarantees (ε < 100) with differential privacy amounted to losing the performance altogether on the brain age prediction problem (see Figure D.5). However, even with vacuous guarantees, we see that differential privacy could reduce the vulnerability to realistic membership inference attacks as shown in Figure 4.3a and Figure D.4. D.5 Membership Inference attacks in centralized setup without the knowledge of training samples In the setup of Section 4.3.1, we assumed that the adversary has access to some training samples to perform membership inference attacks. However, such an assumption may be too restrictive. Here, we discuss white-box membership inference attacks and the attacker has access to some samples from the training distribution only instead of training samples. The attacker does not know if these samples were part of training. Since the attacker does not have access to the samples used to train the model, attack classifier cannot be trained. To circumvent this limitation, we use the idea of shadow training from Nasr et al. (2018). Briefly, the attacker trains new models with the same architecture and training hyperparameters using the samples available from the training distribution. These newly trained models, called shadow models, are expected to imitate or shadow the trained model’s behavior — for example, similar overfitting behavior, similar training performance, etc. Therefore, the attacker may train the attack classifier using the shadow models and samples used to train them and expect to transfer to the trained models he intends to attack. D.5.1 Setup For this section, we use the train, test and validation split described in Section D.1.1. We consider that the attacker has access to a trained model and some samples from the training distribution, which may or may not overlap with the samples used to train the model being attacked. The attacker intends to identify if some data sample was used to train the model. 27 https://github.com/pytorch/opacus 114 The attacker is trying to attack the same models that are described in Section D.1. These models are trained on the full training set. To simulate the attacks with access to only the training distribution but not training samples, we consider the scenario where the attacker has access to 5000 random samples from the training distribution. For this, we pick 5000 random samples from the original training set of size 7312. The attacker is trying to determine the membership of samples from the train set, which differ from these 5000 samples. Due to limited data, the data used to train the shadow models overlaps with the data used to train the original model. A more difficult scenario will be if these datasets do not overlap at all. D.5.2 Result To report the membership inference attack performance, we created a test dataset of 1500 samples from the full train set (different from 5000 samples that the attacker already has) and 1500 samples from the unseen set to evaluate the membership inference attack accuracy. We trained a single shadow model with 5000 samples that are available to the attacker. The attack classifier is trained to attack the shadow models similar to earlier experiments using prediction, label, and gradient of conv6 and output layers from the shadow model as the features. We extract these features from the trained model and classify them with the attack classifier to infer the memberships. The results are summarized in Table D.6. The ‘Test’ column shows the result of performing a membership inference attack on the trained model, which is what we are interested in. We also report the attack accuracies on the validation set derived from the shadow model’s training set in the ‘Validation’ column. We observe that even without access to training samples, the membership inference attacks are feasible, albeit with slightly lower accuracy than the case in which the adversary has access to some of the training samples. 115 Environment 3D-CNN Uniform & IID L1 L2 L3 L4 L5 L6 L7 L8 L1 60.62± 0.66 60.71± 0.61 59.85± 0.69 59.89± 0.72 59.17± 0.58 60.64± 0.46 60.15± 0.51 L2 59.68± 0.20 60.57± 0.69 60.02± 0.67 59.86± 0.58 59.10± 0.32 60.63± 0.53 60.15± 0.45 L3 59.78± 0.15 60.59± 0.41 60.18± 0.46 60.19± 0.48 59.55± 0.13 60.63± 0.28 60.48± 0.28 L4 59.92± 0.26 60.29± 0.40 60.89± 0.25 60.14± 0.32 59.45± 0.29 60.73± 0.24 60.58± 0.20 L5 59.86± 0.33 60.22± 0.25 60.85± 0.13 59.94± 0.36 59.41± 0.26 60.48± 0.33 60.55± 0.34 L6 59.98± 0.30 59.98± 0.35 60.56± 0.41 59.56± 0.47 60.07± 0.57 60.70± 0.47 60.32± 0.48 L7 60.08± 0.45 60.05± 0.38 60.81± 0.37 59.77± 0.18 60.00± 0.30 59.28± 0.31 60.29± 0.35 L8 59.09± 0.30 59.13± 0.24 60.15± 0.32 59.28± 0.39 59.34± 0.12 58.98± 0.24 59.66± 0.25 Uniform & non-IID L1 L2 L3 L4 L5 L6 L7 L8 L1 56.20± 1.44 38.34± 0.51 33.06± 0.73 72.72± 0.36 55.92± 1.22 38.25± 0.33 32.99± 0.71 L2 56.36± 2.20 52.80± 1.09 39.70± 1.32 56.61± 2.41 63.28± 0.12 51.03± 1.11 40.20± 1.42 L3 32.76± 0.95 49.15± 0.83 66.09± 1.50 32.42± 0.77 49.63± 0.77 63.43± 0.49 66.61± 1.19 L4 32.05± 0.29 41.50± 0.10 59.98± 0.14 32.45± 0.25 43.12± 0.19 61.45± 0.11 69.86± 0.21 L5 72.28± 0.19 54.02± 0.15 37.46± 0.34 33.53± 0.23 54.09± 0.21 37.43± 0.10 33.59± 0.17 L6 59.79± 1.22 62.71± 0.56 51.31± 0.95 37.86± 0.84 59.59± 1.45 49.37± 0.69 38.51± 0.86 L7 31.96± 0.76 49.26± 0.57 64.98± 0.20 66.40± 1.06 31.46± 0.80 49.72± 0.54 67.11± 0.65 L8 32.88± 0.26 42.03± 0.79 59.64± 1.24 69.93± 0.24 33.26± 0.10 43.15± 0.85 61.23± 0.65 Skewed & non-IID L1 L2 L3 L4 L5 L6 L7 L8 L1 66.85± 0.38 40.96± 0.26 30.86± 0.65 64.44± 1.39 59.41± 0.33 40.13± 0.25 31.38± 1.10 L2 67.90± 0.60 50.38± 0.50 35.40± 1.33 48.98± 0.86 54.98± 0.28 43.17± 0.52 32.34± 1.67 L3 40.26± 0.50 50.31± 0.63 62.31± 1.09 33.61± 0.79 46.84± 0.45 64.99± 0.60 59.93± 1.92 L4 36.86± 0.08 40.38± 0.20 60.66± 0.76 35.99± 0.17 41.34± 0.22 65.98± 0.36 83.10± 0.17 L5 58.36± 0.65 51.96± 1.05 36.82± 1.82 34.29± 1.80 58.98± 0.16 36.45± 1.05 32.45± 1.59 L6 67.88± 2.40 60.44± 4.63 35.74± 2.27 28.04± 2.37 70.74± 1.33 34.40± 1.07 26.76± 2.75 L7 35.89± 0.54 42.13± 0.64 64.86± 0.43 78.70± 0.32 33.38± 0.27 43.37± 0.35 81.14± 0.26 L8 39.64± 0.19 40.89± 0.12 53.93± 0.43 76.36± 0.18 39.48± 0.23 41.97± 0.10 62.53± 0.16 Table D.4: Matrix of the membership inference attack accuracy on a per learner basis for the3D-CNN model across every federated learning environment. Rows are the attacking learner and columns are the attacked learner. Colored cells indicate successful attacks and more heated cells specify higher attack accuracies. The results are over 5 runs. 116 Environment 2D-slice-mean Uniform & IID L1 L2 L3 L4 L5 L6 L7 L8 L1 58.63± 1.80 57.18± 2.27 57.15± 1.61 57.41± 1.07 56.89± 1.43 57.02± 1.19 57.77± 1.54 L2 57.49± 0.70 58.74± 1.00 57.93± 0.89 57.82± 0.79 58.15± 0.48 57.63± 0.95 59.32± 1.15 L3 58.22± 0.68 60.61± 0.66 58.81± 0.38 58.95± 0.58 58.70± 0.45 58.66± 0.61 60.29± 0.44 L4 57.07± 0.81 58.69± 1.14 57.18± 0.94 57.43± 0.70 58.20± 0.90 56.34± 0.99 58.01± 0.95 L5 57.69± 1.12 60.28± 1.65 58.64± 1.98 58.26± 1.09 57.99± 1.57 57.85± 1.64 59.26± 1.56 L6 56.35± 0.63 58.94± 0.80 56.85± 1.06 57.20± 0.64 56.47± 0.58 55.98± 1.04 58.34± 1.06 L7 58.00± 0.44 60.62± 0.48 59.24± 0.90 58.65± 0.30 59.14± 0.65 58.31± 1.15 59.91± 0.52 L8 57.20± 1.25 59.85± 1.05 57.53± 1.28 57.64± 1.18 57.58± 0.98 56.72± 1.09 57.24± 1.12 Uniform & non-IID L1 L2 L3 L4 L5 L6 L7 L8 L1 56.83± 0.60 40.44± 0.74 33.88± 1.08 72.73± 0.32 56.43± 0.46 40.01± 0.81 34.05± 1.33 L2 59.07± 1.81 52.73± 1.05 40.41± 1.78 59.39± 2.51 62.05± 0.60 51.43± 1.22 40.63± 1.72 L3 33.18± 1.09 49.95± 0.72 62.79± 1.01 33.07± 0.77 50.20± 1.05 62.57± 0.22 64.12± 0.67 L4 29.70± 0.82 40.12± 0.39 58.87± 0.39 30.81± 0.84 41.31± 0.25 59.60± 0.39 68.80± 0.06 L5 73.26± 0.63 57.04± 0.91 41.14± 0.71 35.02± 0.44 56.70± 0.70 40.89± 0.64 35.10± 0.38 L6 57.60± 3.65 60.62± 1.78 51.98± 1.25 41.79± 2.88 57.86± 3.44 51.52± 1.79 42.38± 3.29 L7 31.39± 1.29 48.18± 0.89 62.86± 0.52 64.07± 1.20 31.54± 1.94 48.46± 0.66 65.36± 1.07 L8 30.99± 1.93 41.21± 0.65 59.09± 0.68 69.12± 0.48 32.09± 1.90 42.16± 0.29 59.84± 0.35 Skewed & non-IID L1 L2 L3 L4 L5 L6 L7 L8 L1 66.09± 0.97 42.84± 0.86 29.77± 0.76 58.45± 1.06 56.38± 0.47 38.90± 0.52 28.07± 0.74 L2 66.99± 0.65 51.77± 0.48 33.51± 0.66 48.36± 0.85 54.60± 0.42 43.30± 0.36 29.72± 0.63 L3 42.59± 1.51 50.25± 1.42 66.38± 3.14 35.20± 2.19 46.84± 0.99 65.29± 1.99 59.76± 4.53 L4 36.79± 0.86 41.14± 0.73 62.60± 1.41 35.63± 1.22 41.70± 0.57 67.31± 1.31 82.31± 0.92 L5 58.25± 0.55 51.20± 0.98 36.05± 0.74 32.79± 1.24 57.65± 0.41 35.81± 0.66 31.48± 0.96 L6 63.94± 1.38 57.05± 1.74 37.25± 1.15 30.48± 2.87 70.76± 2.71 34.65± 1.40 29.03± 4.52 L7 35.66± 1.29 41.30± 1.60 64.11± 0.86 78.31± 1.19 33.84± 1.30 43.11± 1.33 80.62± 2.31 L8 39.60± 0.20 40.83± 0.17 55.53± 0.79 77.06± 0.72 39.73± 0.19 42.48± 0.11 63.63± 0.33 Table D.5: Matrix of the membership inference attack accuracy on a per learner basis for the2D-slice-mean model across every federated learning environment. Rows are the attacking learner and columns are the attacked learner. Colored cells indicate successful attacks and more heated cells specify higher attack accuracies. The results are over 5 random runs. 117 Model Test Validation 3D-CNN 71.74± 1.82 75.22± 0.22 2D-slice-mean 74.39± 2.14 85.46± 0.24 Table D.6: Membership inference attacks without the knowledge of training samples. The test performance results from performing membership inference on the trained model using attack models trained on infor- mation from the shadow model. The validation performance is the attack classifier’s performance on the validation set derived from the shadow models’ training set. 118 Appendix E Additional Results of Jointly Reparametrized Multi-Layer Adaptation E.1 Ablations Here, we evaluate alternative hyperparameter choices for SLaSh by performing ablation studies concerning the position of shifts, initialization of parameters, and using[MASK] vs.[CLS] representations. Overall, our results are relatively less sensitive to these choices. E.1.1 Adding shifts to other activations In the main paper, we showed the results of adding shifts, i.e., random projections from a trainable vector to the output layer’s activation. These shifts can also be added to other activations, such as the activations after attention layers, intermediate feed-forward layers, or a combination of these. We evaluate these choices on tasks from the GLUE benchmark, and Table E.1 summarizes our findings. We find that the performance of shifting attention activations is similar to shifting output activations in most cases except for RTE and CoLA. Similar observations hold for intermediate activations. Shifting activations from intermediate feed-forward layers performed similarly for all tasks compared to output activations. These observations do not change when we increase the trainable parameters. Shifting output activations performed slightly better in terms of average performance computed across all tasks. Moreover, the intermediate activations have a higher dimension than the output activation (3,072 vs. 768 forRoBERTa-base). Therefore, intermediate activations required maintaining bigger random projection matrices (W l ) during training. In summary, other choices can perform similarly. We chose output activations due to their smaller dimension and transformers using layer norm immediately after it, which can take care of sudden drifts in activations, etc. Position MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. d = 1,024 attention 80.3 81.0 88.7 93.2 57.9 89.5 91.1 73.6 81.93 d = 1,024 intermediate 80.0 81.2 88.9 93.2 59.6 89.7 92.3 76.2 82.64 d = 1,024 output 80.4 80.9 89.3 93.1 59.5 89.3 91.7 77.6 82.72 d = 5K intermediate 83.7 83.7 90.2 93.2 58.4 89.9 92.1 78.0 83.65 d = 5K output 83.4 83.4 90.6 93.2 59.3 90.4 91.9 77.6 83.74 Table E.1: Effect of adding shifts at different position on sequence classification tasks (GLUE Development set) withRoBERTa-base as the pretrained model. All the results are with[CLS] representations. 119 Initialization SST-2 CoLA STS-B MRPC RTE Avg. z,W l ∈{N ,U} 95.0 63.6 90.8 92.1 84.8 85.27 z= 0, W l ∈{N ,U} 95.2 66.0 90.4 91.7 83.8 85.42 z∈{N ,U}, W l = I 95.1 62.7 90.4 92.6 83.8 84.92 Table E.2: Effect of different initialization of SLaSh parameters on sequence classification tasks (GLUE Development set) withRoBERTa-large as the pretrained model. All the results use[MASK] representations and d = 1,024. Model MNLI QQP QNLI SST-2 CoLA STS-B MRPC RTE Avg. RoBERTa-base d = 1,024,[MASK] 80.8 80.9 89.8 92.9 57.6 89.5 91.0 78.7 82.65 d = 1,024,[CLS] 80.4 80.9 89.3 93.1 59.5 89.3 91.7 77.6 82.72 d = 5K,[MASK] 83.6 83.2 90.8 93.7 61.3 90.3 91.3 79.4 84.21 d = 5K,[CLS] 83.4 83.4 90.6 93.2 59.3 90.4 91.9 77.6 83.74 RoBERTa-large d = 1024,[MASK] 86.2 83.3 92.2 95.0 63.6 90.8 92.1 84.8 86.01 d = 1024,[CLS] 86.3 83.1 92.3 95.1 61.6 90.5 92.6 82.3 85.47 d = 10K,[MASK] 89.1 85.6 93.6 95.9 65.5 91.8 91.8 85.6 87.33 d = 10K,[CLS] 89.1 85.7 93.6 95.8 64.0 91.7 91.8 86.6 87.29 Table E.3: Comparing SLaSh with[MASK] and[CLS] token representation on sequence classification tasks (GLUE Development set). E.1.2 Initialization Regarding the initialization of z and W l , we have several choices. z can be initialized randomly or with all zeros. Like Hambardzumyan et al. (2021), we report results with random initialization for z in the main paper. In particular, it is initialized asN (0,σ = 1 √ d ) orU(− 1 √ 12d , 1 √ 12d ). The projection matrices, W l , are also initialized randomly with identical distributions as z. With these initialization choices, the variance of z l is 1 d in each dimension. We consider the choice of Gaussian or Uniform initialization as a hyperparameter. Table E.2 shows the effect of different initialization on performance for sequence classification tasks. The results are relatively less sensitive to initialization. When both z and weight matrices are randomly initialized, the performance is better on STS-B, MRPC, and RTE than when z is initialized as all zeros. However, the average performance of all zeros is higher due to its performance being much higher on CoLA. For the particular case of d = 1024, i.e., the dimension of z is the same as the activations, we can initialize W l as identity. In this case, all the blocks are shifted with the same vector. This performed similarly or worse on all tasks except MRPC. Random projections allow the model to select different parts of z for each transformer block. The above-mentioned result partly demonstrates the utility of using random projection matrices. E.1.3 [MASK] vs.[CLS] Representations As discussed in Section 5.2.1, we can use[CLS] or[MASK] representation for classification tasks. Table E.3 compares this withRoBERTa-base andRoBERTa-large models. In terms of average performance, we find that[MASK] token representations are better or similar to[CLS] token representations. 120 The choice of representations mattered very little for bigger datasets (>10K samples), with the per- formance being similar for both choices. For smaller datasets, however, we do not see any clear patterns. On average,[MASK] token representation performed slightly better than[CLS] representation, echoing the observation of Hambardzumyan et al. (2021). So we use[MASK] representation for all the results in the main paper. 121
Abstract (if available)
Abstract
As machine learning becomes more prevalent in mission-critical domains, the harms of unintended information captured by these models are becoming more apparent. These models can inadvertently introduce biases and memorize training data, leading to potential unfairness, inequitable outcomes, or risking privacy. These phenomena are especially alarming in applications where data privacy needs to be upheld, such as medical imaging, or where unfairness can lead to disparate outcomes, such as hiring decisions. This thesis examines ways to control and limit information in deep learning models, focusing on fairness and privacy. Specifically, we discuss ways to ensure fairness in decision-making by learning fair data representations and preventing unfair language generation by correctly modulating information in neural networks. Concerning privacy, we demonstrate that releasing neuroimaging models may reveal private information about the individuals participating in the training set and discuss ways to mitigate these privacy leakages. Among these methods, differential private training is promising as it protects against all possible privacy attacks. However, differential private training can drastically hurt utility since the magnitude of noise in the outputs scales with the model parameters. To this end, we explore techniques to reduce effective model parameters during training.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Learning controllable data generation for scalable model training
PDF
Heterogeneous federated learning
PDF
Robust causal inference with machine learning on observational data
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Learning at the local level
PDF
Efficiency in privacy-preserving computation via domain knowledge
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Fast and label-efficient graph representation learning
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Towards trustworthy and data-driven social interventions
PDF
Representation problems in brain imaging
PDF
Security and privacy in information processing
PDF
Towards learning generalization
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Efficient deep learning for inverse problems in scientific and medical imaging
Asset Metadata
Creator
Gupta, Umang
(author)
Core Title
Controlling information in neural networks for fairness and privacy
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-12
Publication Date
09/12/2023
Defense Date
08/22/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,Fairness,information theory,machine learning,OAI-PMH Harvest,privacy
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ver Steeg, Greg (
committee chair
), Dilkina, Bistra (
committee member
), Morstatter, Fred (
committee member
), Thompson, Paul M. (
committee member
)
Creator Email
umanggup@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113340370
Unique identifier
UC113340370
Identifier
etd-GuptaUmang-12378.pdf (filename)
Legacy Identifier
etd-GuptaUmang-12378.pdf
Document Type
Dissertation
Format
theses (aat)
Rights
Gupta, Umang
Internet Media Type
application/pdf
Type
texts
Source
20230913-usctheses-batch-1097
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
information theory
machine learning