Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
(USC Thesis Other)
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Enhancing Privacy, Security, and Efficiency in Federated Learning: Theoretical Advances and Algorithmic Developments by Ahmed R. Elkordy A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2024 Copyright 2024 Ahmed R. Elkordy Dedicated to my beloved wife, Shereen Doma my dear children, Retal and Omar and my lovely mother and brothers. ii Acknowledgements With nearly 24 years in academia, from primary school to my PhD, this journey is now complete. I am deeply grateful for every stage of this path, with the most profound gratitude reserved for my PhD experience. First and foremost, I would like to express my deepest respect and gratitude to my advisor, Professor Salman Avestimehr. Before joining his lab, my master’s degree focused primarily on the theoretical aspects of information theory. While this provided a valuable foundation, I aspired to pursue my PhD on more practical problems with impactful applications. Joining Professor Avestimehr’s lab was a pivotal moment in my career, marking my transition from purely theoretical research to applied research. To facilitate this shift, Professor Avestimehr guided me on how to identify relevant problems that are both challenging and impactful, define the key questions we aim to answer, develop effective solutions, and present meaningful results. During the pandemic and the shift to online meetings, Professor Avestimehr also helped me improve my communication and presentation skills. His ability to present complex ideas in simple terms greatly inspired me and enhanced my own ability to convey complex concepts effectively. I am deeply grateful to my collaborators, mentors, and friends, Yahya H. Ezzeldin and Saurav Prakash, for their invaluable support and guidance throughout my research. Yahya, a postdoctoral researcher in our lab, was always available to brainstorm ideas and provide insightful feedback. Saurav, a senior PhD candidate in our lab, greatly assisted me in my first machine learning project with his expert advice and strong research background. iii I also would like to thank the members of my qualifying exam and dissertation committee members, Prof. Murali Annavaram, Prof. Mahdi Soltanolkotabi, Prof. Konstantinos Psounis, Prof. Meisam Razaviyayn and Prof. Sai Praneeth Karimireddy whose insightful feedbacks have helped to significantly improve the quality of this dissertation. I cherished the time spent with my lab mates, Amir Ziashahabi, Asal Mehradfar, Duygu Nur Yaldiz, Emir Ceyani, Erum Mushtaq, Yavuz Faruk Bakman, Sara Babakniya, Tuo Zhang. They are always there when I want to share good and bad news. I am deeply grateful to my lovely wife, Shereen Doma, who has always been there to support me. Without her, this achievement would not have been possible. She believed in me and encouraged me to apply for my Master’s degree, marking the beginning of my postgraduate journey. At the start of my PhD, I often felt uncertainty and ambiguity about the path I had chosen, but she was always there, comforting me and helping me see the big picture and long-term goals. I am incredibly grateful to her and always consider this achievement to be hers as much as it is mine. I am also grateful to my daughter Retal and my son Omar. Retal was just 8 months old when I began my PhD, and I can’t imagine going through that challenging time without her. She was a source of joy during some of the darkest moments. My lovely Omar was born in my second year of the PhD. Some might think having two kids is hard, but in fact, Omar brought even more happiness and joy to our growing family and made me more dedicated and motivated. My children gave meaning to what I was doing, inspiring me to push forward and achieve my goals. I hope they will always be proud of me. Lastly, I am incredibly grateful to my mother, who raised me after my father’s passing when I was just 5 years old. She always encouraged me and ensured that everything I needed for my success was available. She dedicated a significant portion of her income to my education. I will never be able to repay her for all that she has done for me. I would also like to thank my brothers, Mohamed and Mahmoud, who have been incredibly supportive throughout my journey. iv Table of Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Privacy in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Incomplete Privacy Guarantee for Secure Aggregation in FL . . . . . . . . . . . . . 3 1.1.2 Integrating Robust Aggregation Methods with SA . . . . . . . . . . . . . . . . . . . 3 1.1.3 Adapting to Devices’ Resource Heterogeneity in SA . . . . . . . . . . . . . . . . . 4 1.2 Limited Training Resources in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Robustness in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: How Much Privacy Does Federated Learning with Secure Aggregation Guarantee? . . 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Basic Setting of Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Secure Aggregation Protocols for Federated Learning . . . . . . . . . . . . . . . . . 14 2.2.2.1 Threat Model in Secure Aggregation for Federated Learning . . . . . . . 14 2.2.2.2 Secure Aggregation Guarantees . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2.3 Our Contribution: Guarantees on Privacy Leakage from the Aggregated Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Theoretical Privacy Guarantees of FL with Secure Aggregation . . . . . . . . . . . . . . . . 16 2.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Impact of System Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2.1 Impact of Number of Users (N) . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2.2 Impact of Batch Size (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2.3 Impact of Model Size (d) . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.2.4 Impact of Global Training Rounds (T) . . . . . . . . . . . . . . . . . . . . 21 2.3.3 Impact of User Dropout, Collusion, and User Sampling . . . . . . . . . . . . . . . . 21 2.3.3.1 Impact of User Dropout and Collusion with the Server . . . . . . . . . . 22 2.3.3.2 Impact of User Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 v 2.4.1 MI Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Datasets and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Impact of Number of Users (N) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.2 Impact of Model Size (d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.3 Impact of Batch Size (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.4 Accumulative MI leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5.5 Impact of Local Training Epochs (E) . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5.6 Impact of Data Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.7 Practical Privacy Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Further Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.8.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.8.1.1 Upper bound on HM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.8.1.2 Lower bound on HM for Case 1 in Theorem 2.1 . . . . . . . . . . . . . . 46 2.8.1.3 Lower bound on HM for Case 2 in Theorem 2.1 . . . . . . . . . . . . . . 47 2.8.2 Proof of Corollary 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.8.3 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8.4 Overview of MINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3: HeteroSAg: Secure Aggregation with Heterogeneous Quantization in Federated Learning 53 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Performance metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 The proposed HeteroSAg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.1 HeteroSAg for heterogeneous quantization . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2 HeteroSAg for Byzantine-Resilience . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.3 Theoretical guarantees of HeteroSAg . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.5.1 Experiment 1 (Heterogeneous quantization) . . . . . . . . . . . . . . . . . . . . . . 78 3.5.2 Experiment 2 (Byzantine robustness) . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7.1 Illustrative example for SecAg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.7.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.7.3 Main properties for the SS matrix of HeteroSAg . . . . . . . . . . . . . . . . . . . 83 3.7.4 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.7.5 Proof of Lemma 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.7.6 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.7.7 Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.7.8 Proof of Proposition 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.7.9 HeteroSAg for Heterogeneous group Size . . . . . . . . . . . . . . . . . . . . . . . 89 vi 3.7.9.1 Execution of HeteroSAg for heterogeneous group size . . . . . . . . . . . 89 3.7.10 Proof of Proposition 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7.11 Byzantine robustness of HeteroSAg . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3.7.12 Complete Experimental Results for Section 3.5 . . . . . . . . . . . . . . . . . . . . 93 3.7.12.1 Evaluating the results in Figure 3.4b . . . . . . . . . . . . . . . . . . . . . 93 3.7.12.2 Additional experiment (Heterogeneous quantization) . . . . . . . . . . . 94 3.7.13 Additional experiment (Byzantine robustness) . . . . . . . . . . . . . . . . . . . . . 96 3.7.14 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 4: Federated Parameter Efficient Fine-Tuning of Language Models . . . . . . . . . . . . 100 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3.1 PEFT Baselines in Centralized Learning . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3.2 Observation: PEFT is challenged when data distribution gets non-IID . . . . . . . . 105 4.3.3 Low-Rank Adaption: LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 Our approach (Primed-LoRA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.4.1 Full fine-tuning for priming LoRA . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.2 Primed-LoRA with Domain Adaption . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.4.3 Primed-LoRA with Sparse Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4.4 Primed-LoRA in Stage 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8.1 Priming LORA from FFT using SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.8.2 Details of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8.3 LoRA Blocks Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8.4 Training time computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.9 Performance of 20News group on Albert . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.10 Performance and cost of 20News group on DistilBERT . . . . . . . . . . . . . . . . . . . . 118 4.11 Performance of News Category on Albert and DistilBERT . . . . . . . . . . . . . . . . . . . 119 4.12 Impact of Data Heterogeneity on PEFT Methods . . . . . . . . . . . . . . . . . . . . . . . . 119 4.13 Impact of Update size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 5: Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training . . . . . . 122 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.1 Decentralized System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3 The Proposed Basil Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.3.1 Basil for IID Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.3.2 Theoretical Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.4 Generalizing Basil to Non-IID Setting via Anonymous Cyclic Data Sharing . . . . . . . . . 139 5.4.1 ACDS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.4.2 Anonymity Guarantees of ACDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vii 5.5 Basil+: Parallelization of Basil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.5.1 Basil+ Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.2 Random Clustering Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.5.3 The Success of Basil+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.6 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.6.1 Numerical Experiments for Basil . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 5.6.2 Numerical Experiments for Basil+ . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.7 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.8.1 Proof of Proposition 5.1 and Proposition 5.2 . . . . . . . . . . . . . . . . . . . . . . 161 5.8.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 5.8.2.1 Proof of Theorem 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 5.8.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 5.8.4 Joining and Leaving of Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.8.4.1 Nodes Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.8.4.2 Nodes Rejoining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.8.5 Proof of Proposition 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.8.6 Proof of Proposition 5.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 5.8.7 Proof of Proposition 5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 5.8.8 UBAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.8.8.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.8.8.2 Time Analysis for UBAR . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.8.9 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 5.8.9.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.8.9.2 Implementing UBAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 5.8.9.3 Performance of Basil on MNIST . . . . . . . . . . . . . . . . . . . . . . . 180 5.8.9.4 Wall-Clock Time Performance of Basil . . . . . . . . . . . . . . . . . . . 181 5.8.9.5 Performance of Basil for Non-IID Data Distribution using CIFAR100 . 183 5.8.9.6 Performance Comparison Between Basil and Basil+ . . . . . . . . . . . . 185 Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 viii List of Tables 2.1 Models used for MNIST and CIFAR10 datasets. Note that SLP, MLP, and CNN represent Single Layer Perceptron, Multiple Layer Perceptron, and Convolutional Neural Network, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Tuples of masks that users could use in Example 2. . . . . . . . . . . . . . . . . . . . . . . 64 3.2 The aggregated model update xS0 from group 0 after fully unmasking the model updates of its users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 A Comparison between SecAg [18] and our proposed HeteroSAg. Here, N is the total number of nodes, m is the model size, G is the number of groups, n is the number of users in each group, and Kg is the number of quantization levels. . . . . . . . . . . . . . . . . . . 77 3.4 User communication cost and the total communication time in Experiment 1. . . . . . . . 93 3.5 Details of the parameters in the architecture of the neural network used in our MNIST experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.6 Details of the parameters in the architecture of the neural network used in our CIFAR10 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1 The performance of the proposed methods on the 20news dataset on Albert model. . . . . 112 4.2 Parameter r, which we use to calculate the density of the updates. . . . . . . . . . . . . . . 113 4.3 Impact of the density on the performance for the 20News group dataset on Albert. The model is trained for 10 epochs in the centralized setting and for 100 rounds in the other configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.4 The configurations of LoRA modules (e.g., r) for Multi-Head (MH), FeedForwad (FFN), and Pre-Classification (PC) layers Albert and DistilBER. . . . . . . . . . . . . . . . . . . . . . . 117 4.5 Performance of different algorithms for the 20New group dataset on Albert model on different data heterogeneity. We fix the total number of communication rounds to 150 for LDA distribution and 450 rounds for pathological. The number of trainable parameter of the baselines (.18 M) parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 ix 4.6 Performance and training cost of different algorithms for the News category dataset on Albert model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.7 Performance and training cost of different algorithms for the News category dataset on DistilBERT model. We train SLoRA on Stage 2 for only 50 rounds. . . . . . . . . . . . . . . 119 4.8 Performance and training cost of different algorithm for the 20News group dataset on DistilBERT model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.9 Impact of update density of PEFT methods for the 20News group dataset on DistilBERT. . 121 4.10 Impact of update density of PEFT methods for the News category dataset on Albert. . . . . 121 4.11 Impact of update density of PEFT methods for the News category dataset on DistilBERT. . 121 5.1 Details of the parameters in the architecture of the neural network used in our MNIST experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.2 Details of the parameters in the architecture of the neural network used in our CIFAR10 experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.3 The breakdown of the average computation time per node for Basil and UBAR. . . . . . . 182 x List of Figures 1.1 The training process of federated learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2.1 Figure (a) illustrates the current formal privacy guarantee of FL with SA protocols and sheds light on the missing privacy guarantee on the aggregated model information leakage which is studied in this work. Figure (b) gives a preview of the behavior of the privacy leakage through the global aggregated model for a CNN model as a function of the number of users in FL. The privacy leakage follows a O(1/N) decay as proved in our theoretical bounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 The training process in federated learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Impact of the number of users (N) when using FedSGD. Note that we set B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of a single data batch (i.e. 32 ∗ 567 for MNIST and 32 ∗ 1403 for CIFAR10). . . . . . . . . . . . . . . . 27 2.4 Impact of the number of users (N) when using FedAvg. Note that we set E=1 and B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of the whole local training dataset (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). 28 2.5 Impact of the number of users (N) when using FedProx. Note that we set E=1 and B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of a single data batch (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). . 29 2.6 Impact of batch size (B) when using FedSGD. The MI is normalized by the entropy of a data batch, which is proportional to the batch size B (i.e. B ∗ 567 for MNIST and B ∗ 1403 for CIFAR10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Impact of batch size (B) when using FedAvg. The MI is normalized by the entropy of a user’s local dataset, which is a constant (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.8 Impact of batch size (B) when using FedProx. The MI is normalized by the entropy of a user’s local dataset, which is a constant (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xi 2.9 Accumulative MI privacy leakage on MNIST and CIFAR10 datasets. Note that we normalize the MI by the entropy of each user’s local dataset, which will not change with T. We use the linear model for both MNIST and CIFAR10 datasets. . . . . . . . . . . . . . . 33 2.10 Accumulative MI privacy leakage vs model accuracy of different FL algorithms. Note that we use a linear model for case study and normalize the MI by the entropy of each user’s local dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.11 Impact of the local training round (E) when using FedAvg. We normalize the MI by the entropy of each user’s local dataset, and we consider N ∈ {10, 20}. . . . . . . . . . . . . . 33 2.12 Impact of the local training round (E) when using FedProx. We normalize the MI by the entropy of each user’s local dataset, and we consider N ∈ {10, 20}. . . . . . . . . . . . . . 34 2.13 Impact of user heterogeneity when using FedAvg on non-IID CIFAR10. Note that α = ∞ means that the user data distributions are identical (IID users), and the MI is normalized by the entropy of a user’s local dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.14 Impact of user heterogeneity when using FedAvg on FEMNIST. Note that the MI is normalized by the entropy of the target user’s local dataset, which is 678 ∗ 176. . . . . . . 35 2.15 Impact of varying the number of users N, on the reconstructed image quality (PSNR) of the DLG attack and on the MI privacy leakage. . . . . . . . . . . . . . . . . . . . . . . . . 37 2.16 Effects of using DP noise together with SA on MI privacy leakage and model accuracy. Note that we add DP noise in aggregated model updates after SA. . . . . . . . . . . . . . . 38 2.17 Heatmap of the absolute values of sampled updates from clients 1, 2 and 3 in the counter example. x4 and x ′ 4 can be distinguished even adding the aggregated noise from P3 i=1 xi . 41 3.1 Segment selection matrix B for G = 5 groups. . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 A system with N users partitioned into G = 5 groups, with n users in each group. Each user holds a quantized local model update x¯i , i ∈ [N]. The segment selection and grouping is completed by using the SS matrix B. . . . . . . . . . . . . . . . . . . . . . . . 70 3.3 A system with N users partitioned into G = 5 groups, with n users in each group. Each user holds a local model update x¯i , i ∈ [N]. The segment selection and grouping is completed by using the SS matrix B. Segments with the same color will be encoded and decoded together independently of all other segments. . . . . . . . . . . . . . . . . . . . . 72 3.4 The performance of HeteroSAg under different quantization schemes for the non-IID setting. 79 3.5 The performance of HeteroSAg and FedAvg under three different attacks for the IID setting. 79 3.6 Matrix B for G = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 xii 3.7 Matrix Be for G = 3 groups with the number of subgroups in each group L0 = 1, L1 = 2, and L2 = 2, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.8 The performance of HeteroSAg under different quantization schemes for the non-IID setting (N = 100). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.9 The label distribution over the N = 100 users using Dirichlet distribution with three different α parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.10 The performance of HeteroSAg and FedAvg under Gaussian attack and three different data distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.11 The label distribution among N = 300 users using Dirichlet distribution with different α parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.12 The performance of HeteroSAg and Fedavg under Gaussian and sign-flip attacks with three different data distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.1 The effect of client data distribution on the performance of FFT vs. PEFT. While heterogeneity adversely affects both, parameter-efficient methods are more vulnerable and experience more accuracy drops in more heterogeneous settings. . . . . . . . . . . . . 101 4.2 LoRA Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3 Impact of the number of FL rounds in stage 1 on the final performance in FLoRA for 20News group dataset on Albert. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4 Performance of PEFT methods for different data distributions for the 20News group dataset. The model is trained for 10 epochs in the centralized setting and for 100 rounds in the other configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5 Overview of SLoRA: First server initializes a mask, and clients only update the parameters in the mask. Then, the updates are decomposed into LoRA blocks for the initialization in Stage 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.6 Performance of SLoRA on the 20News dataset for fixed training rounds on Albert and DistilBERT models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.7 Performance of PEFT methods for different data distributions for News Category dataset. . 120 xiii 5.1 A highlight of the performance benefits of Basil, compared with state-of-the-art (UBAR) [45], for CIFAR10 under different settings: In Figure 5.1(a), we can see the superior performance of Basil over UBAR with ∼ 16% improvement of the test accuracy under Gaussian attack in the IID setting. Figure 5.1(b) demonstrates that the test accuracy in the non-IID setting by using sequential training over the ring topology can be increased by up to ∼ 10% in the absence of Byzantine nodes, when each node shares only 5% of its local data anonymously with other nodes. Figure 5.1(c) shows that ACDS on the top of Basil not only provides Byzantine robustness to Gaussian attack in the non-IID setting, but also gives higher performance than UBAR in the IID setting. Furthermore, UBAR for the non-IID setting completely fails in the presence of this attack. For further details, please refer to Section 5.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 Basil with N = 6 nodes, where node 3 and node 6 are Byzantine nodes. Node 1, the current active benign node in the k-th round, selects one model out of its stored 3 models which gives the lowest loss when it is evaluated on a mini-batch from its local dataset Z1. After that, node 1 updates the selected model by using the same mini-batch according to equation 5.2 before multicasting it to the next 3 clockwise neighbors. . . . . . . . . . . . . 133 5.3 ACDS algorithm within group g with n = 4 users, where each node ig ∈ Ng has two batches {c 1 ig , c2 ig }. Each round starts from node 1g and continues in a clockwise order. The dummy round is introduced to make sure that node 2g and node 3g get their missing batches {c 2 3g , c2 4g } and c 2 4g , respectively. Here, bcig represents the dummy batch with the same size as the other batches that are used by node ig ∈ Ng. This dummy batch could be a batch of public data that shares the same features that are used in the learning task. . . . 139 5.4 Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.5 Illustrating the performance of Basil using CIFAR10 dataset under non-IID data distribution setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 5.6 Illustrating the performance of Basil compared with UBAR for CIFAR10 under non-IID data distribution setting with α = 5% data sharing. . . . . . . . . . . . . . . . . . . . . . . 155 5.7 Ablation studies. Here N, b, S, and α are the total nodes, Byzantine nodes, connectivity, and fraction of shared data. For non-IID data, Gaussian attack is considered, while for others, IID hidden attack is used. Using the same NNs as in Section 5.8.9.1. . . . . . . . . . 156 5.8 The scalability gain of Basil+ in the presence of Byzantine nodes as number of nodes increases. Here G is the number of groups, where each group has n = 25 nodes. . . . . . . 158 5.9 Illustrating the performance gains of Basil+ over UBAR and R-plain+ for CIFAR10 dataset under different number of nodes Na. . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.10 Illustrating the results for MNIST under IID data distribution setting. . . . . . . . . . . . . 178 5.11 Illustrating the results for MNIST under non-IID data distribution setting. . . . . . . . . . . 179 xiv 5.12 Illustrating the performance of Basil compared with UBAR for MNIST under non-IID data distribution setting with α = 5% data sharing. . . . . . . . . . . . . . . . . . . . . . . 179 5.13 Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting with respect to the training time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.14 The performance of Basil under different data sharing scenario in the presence of the Gaussian attack when the data distribution at the nodes is non-IID. Here, γ represents the fraction of subclasses that nodes consider the data from them as non-sensitive out of its available 5 subclasses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 5.15 Illustrating the performance of Basil and Basil+ for CIFAR10 dataset. . . . . . . . . . . . 185 xv Abstract Machine learning (ML) has become a foundational element of modern intelligent systems, fueling breakthroughs in fields like computer vision and natural language processing. The remarkable performance of contemporary models, such as transformers and convolutional neural networks (CNNs), has significantly narrowed the gap between human and machine capabilities, enabling machines to perform tasks that were once considered exclusively human. Traditionally, training these models has required the centralization of data, often involving sensitive user information. However, increasing privacy concerns and stringent regulations have made it increasingly difficult to centralize users’ datasets for ML training. In response, Federated Learning (FL) has emerged as a compelling alternative. FL aims to develop a global model by leveraging data distributed across edge devices, all without requiring data centralization.While promising, FL still faces challenges related to privacy and raises new issues due to its distributed nature, such as robustness and the limited computational resources of edge devices. This thesis explores these challenges and proposes innovative solutions to enhance the effectiveness and scalability of FL. First, the thesis addresses the privacy limitations of Secure Aggregation (SA), a protocol integrated with FL to protect users’ data against gradient inversion attacks. While SA prevents the server from accessing individual users’ data by encrypting model updates, it remains unclear how much information about an individual user’s dataset might still be leaked through the aggregated model. Additionally, SA protocols face challenges in integrating state-of-the-art defense strategies against malicious users, such as the coordinate-wise median, and in enabling heterogeneous quantization when the communication resources xvi at the edge users are different. To address these issues, this thesis advances the theoretical understanding of SA by deriving information-theoretic upper bounds on the potential leakage of individual user data from aggregated model updates. Furthermore, it proposes an algorithm that integrates SA with distancebased defense mechanisms, thereby enhancing both privacy and robustness. Additionally, it introduces an algorithm that allows secure model aggregation with heterogeneous quantization accommodating the heterogeneous resources at the edge users thereby optimizing the trade-off between training accuracy and communication time. A second critical challenge in FL is the limited computational, memory, and communication resources of edge devices. These constraints become particularly problematic when training large models or finetuning foundation models, such as those used in natural language processing. We explore the efficiency of parameter-efficient fine-tuning (PEFT) methods within the FL setting, focusing on the impact of data heterogeneity on training performance. To mitigate the performance gap between Full Fine-Tuning (FFT) and PEFT under heterogeneous data distributions, we introduce novel variants of LoRA that employ a data-driven initialization technique. This approach reduces training and communication costs while maintaining high model performance. The third challenge addressed in this thesis is robustness, particularly in the face of Byzantine faults where some users may send arbitrary or malicious updates that can degrade the model’s performance. While several Byzantine-robust algorithms exist for server-based FL, there has been limited work on achieving Byzantine resiliency in server-less FL environments. This thesis proposes Basil, a novel algorithm designed for server-less FL that is both fast and computationally efficient. Basil operates over a logical ring, using a sequential, memory-assisted, and performance-based approach to filter out Byzantine users, ensuring robust and reliable training. Overall, this thesis provides comprehensive solutions to key challenges in FL, advancing the state of the art in privacy, robustness, and resource efficiency in distributed machine learning. xvii Chapter 1 Introduction With the popularity of smartphones and personal gadgets, valuable user data is distributed more than ever. This valuable data can help different companies and service providers improve their products and make them more efficient and personalized [25, 30]. However, privacy is a growing and critical concern that is inevitable to avoid. Although users care about the performance of their applications, they do not want their private data which often contains sensitive information (e.g., personal images/texts) to be accessed by everyone. Also, regulations [40] make it extremely difficult to pool users’ datasets for a centralized ML training procedure. As a result, Federated Learning (FL) [79, 62] has emerged as a promising framework for privacy preserving distributed machine learning that does not require any data collection, yet users locally perform model updates using their private data. The training procedure in FL as depicted in Figure 1.1 is typically coordinated through a central server who maintains a global model that is frequently updated locally by the users over a number of iterations. In Figure 1.1: The training process of federated learning. 1 each training iteration, the server firstly sends the current global model θ t to the users. Next, the N users update the global model by training it on their private datasets and then push their local model updates x t i , for i ∈ [N], back to the server. Finally, the server updates the global model by aggregating the received local model updates from the users. Although FL has already proven beneficial in various domains [56, 114], such as next-word prediction and healthcare, it still faces critical challenges related to privacy, robustness, and the limited resources of edge devices. In the following sections, we discuss each of these challenges in detail. 1.1 Privacy in Federated Learning In the training process of FL, users can achieve the simplest notion of privacy by keeping their data ondevice and never sharing it with the server. Instead, they only share their local model updates. However, recent research (e.g., [126, 42, 117]) has demonstrated that simply keeping data on-device is not sufficient to ensure privacy, as the shared model updates can still reveal substantial information about the local datasets. Specifically, these works have empirically demonstrated that the private training data of the users can be reconstructed from the local model updates through what is known as the model inversion attack [126]. To prevent information leakage from the individual models shared during the training process of federated learning (FL), Secure Aggregation (SA) protocols [12, 102, 54, 123] have been developed to address these privacy concerns. SA enables the server to aggregate local model updates from multiple users without observing any of their updates in clear form. In each training round, users encrypt their local model updates before sending them to the server for aggregation. SA protocols formally guarantee that: 1) neither the server nor other users can gain any information about an individual user’s clear model update from the encrypted update, in the information-theoretic sense; and 2) the server can correctly decode the aggregate 2 model even in the presence of users’ dropout. In other words, secure aggregation ensures that only the aggregated model update is revealed to the server. Despite the promise of federated learning with secure aggregation to protect users’ data from privacy attacks, such as model inversion attacks, it still faces significant limitations. These limitations exist both in the algorithmic aspects and in the current theoretical guarantees, which are outlined in the following three subsections. 1.1.1 Incomplete Privacy Guarantee for Secure Aggregation in FL SA is designed to ensure that the server only learns the aggregated model in each round. However, even with these SA guarantees on individual updates, it is not yet fully understood how much privacy is truly guaranteed in FL using SA. The aggregated model update from all users may still leak information about an individual user’s local dataset. In this thesis, we derive information-theoretic upper bounds on the amount of information that the aggregated model update leaks about any single user’s dataset [38]. Additionally, we analyze the impact of system parameters, such as the number of users, on the extent of this leakage. 1.1.2 Integrating Robust Aggregation Methods with SA Many robust training strategies have been proposed for FL [91, 111, 112, 14, 116, 122] to mitigate Byzantine users [67], who may send arbitrary or malicious updates. These strategies use robust aggregation rules, such as coordinate-wise median [116], to combine model updates from all users, ensuring that malicious users do not affect the training process. However, secure aggregation protocols pose a challenge in integrating existing state-of-the-art defense strategies [14, 116, 91, 111, 112] against Byzantine users. This is because in SA the server receives only masked model updates from each user, while the success of these strategies relies on having access to individual clear model updates. 3 In this thesis, we propose an algorithm [37] that integrates Secure Aggregation (SA) with distance-based defense mechanisms, such as the coordinate-wise median, to achieve both privacy and robustness. The key innovation of our algorithm lies in partitioning the users into groups and dividing their local model updates into segments. Instead of applying the SA protocol to the entire local model update vector, it is applied to these segments with specific coordination among users. At the server side, the coordinate-wise median is then applied to the aggregated segments. 1.1.3 Adapting to Devices’ Resource Heterogeneity in SA State-of-the-art secure aggregation protocols with additive masking [12, 102, 54, 123] require all users to quantize their model updates to the same level of quantization to ensure correct decoding [37]. However, this uniform quantization requirement, regardless of the users’ transmission rates or network capabilities (e.g., 3G, 4G, 5G, Wi-Fi), can significantly degrade training performance due to a lack of adaptation to network speed and quality fluctuations over time. Specifically, enforcing a low-level quantizer across all users reduces communication time but compromises test accuracy. Conversely, a high-level quantizer enhances test accuracy but increases communication time. In this thesis, we propose HeteroSAg [37], an algorithm that allows secure model aggregation while using heterogeneous quantization. HeteroSAg enables the edge users to adjust their quantization proportional to their available communication resources, thereby achieving a significantly better trade-off between training accuracy and communication time. The core idea behind HeteroSAg is similar to that of SA with robust aggregation: it involves partitioning users and applying secure aggregation at the segment level. 4 1.2 Limited Training Resources in Federated Learning The limited computation, memory and communication resources of edge devices (e.g., IoT devices) are another significant consideration in the FL training setting. These resource constraints create critical bottlenecks for performing distributed training of large models [57, 25] which require training and communicating the models over multiple rounds with the server. Unlike centralized training, where powerful servers can handle extensive computations, edge devices must operate within their limited capacities while contributing to the training process. This limitation affects the efficiency and effectiveness of model training, requiring specialized strategies to optimize resource usage and ensure scalability. This challenge is further compounded when fine-tuning foundation models, such as language models [19], which are designed to be versatile across various applications and support a wide range of tasks [108]. As these models can be fine-tuned for different tasks, the memory requirements increase linearly with the number of tasks. This growing demand for memory can exacerbate the difficulties posed by limited edge device resources, potentially rendering the training process infeasible. In this thesis, we study the problem of efficient fine-tuning of language models in the FL setting with limited communication, computation, and storage capabilities. We first explores the opportunities and challenges of applying the existing parameter efficient fine-tuning (PEFT) [6] method in FL under different settings of data heterogeneity for various models and datasets. Specifically, our investigations reveal that that the gap between Full Fine Tuning (FFT) and PEFT increases with users’ data heterogeneity. To resolve this problem, we propose two simple and effective variants of LoRA designed for FL through a novel datadriven initialization technique to achieve parameter efficiency, reduce training and communication costs, and close the gap between PEFT and FFT. 5 1.3 Robustness in Federated Learning Another critical challenge for federated learning (FL) arises from its distributed nature, where the server does not control the training process at the edge users. This decentralization makes FL vulnerable to performance degradation due to system failures and malicious nodes [57]. Specifically, some users can become faulty due to software bugs, hardware components which may behave arbitrarily, or even get hacked during training, sending arbitrary or malicious updates, thus severely degrading the overall convergence performance. Such faults, where client nodes arbitrarily deviate from the agreed-upon protocol, are called Byzantine faults [67]. Many Byzantine-robust strategies have been proposed for FL [91, 111, 112, 14, 116, 37, 99, 87, 122, 88, 119]. However, despite the abundance of Byzantine-robust algorithms for the server-based FL, there has been limited work on achieving Byzantine resiliency in server-less FL setup [45, 115]. Server-less training relies solely on on-device computations on edge nodes and peer-to-peer communications, eliminating the need for a central server (e.g., central aggregator), which can be a single point of failure and vulnerable to hacking [76, 57]. The few robust training algorithms [45, 115] proposed for serverless FL face challenges related to efficiency and robustness. Specifically, these algorithms rely on parallel training over a random graph, requiring all nodes to be continuously active and engaged in training throughout the entire process. This requirement is unsuitable for resource-constrained edge devices, as the perpetual activity could drain their limited resources. Furthermore, these algorithms are only effective in mitigating Byzantine nodes under homogeneous data distribution across users and fail under heterogeneous distributions. In this thesis, we propose Basil [36], a fast and computationally efficient Byzantine-robust algorithm for server-less FL, which leverages a novel sequential, memory-assisted and performance-based criteria for training over a logical ring while filtering the Byzantine users. 6 Thesis structure. In the remainder of this thesis, I will delve into the work focused on enhancing the privacy, robustness, and efficiency of Federated Learning (FL). In Chapter 2 and Chapter 3, I will propose improvements in secure aggregation (SA) for FL, addressing both algorithmic and theoretical perspectives. Specifically, in Chapter 2, I will quantify the information leakage from the aggregated model in secure aggregation scenarios. Chapter 3 will concentrate on advancing the algorithmic aspects of SA by introducing methods that integrate SA with robust aggregation techniques and heterogeneous quantization settings. In Chapter 4, I will present an efficient training algorithm for large models in the FL context, designed to operate within the constraints of limited communication, computational, and memory resources. Chapter 5 will explore the robustness of FL against Byzantine users. Finally, I will conclude the thesis in Chapter 6. 7 Chapter 2 How Much Privacy Does Federated Learning with Secure Aggregation Guarantee? Federated learning (FL) has attracted growing interest for enabling privacy-preserving machine learning on data stored at multiple users while avoiding moving the data off-device. However, while data never leaves users’ devices, privacy still cannot be guaranteed since significant computations on users’ training data are shared in the form of trained local models. These local models have recently been shown to pose a substantial privacy threat through different privacy attacks such as model inversion attacks. As a remedy, Secure Aggregation (SA) has been developed as a framework to preserve privacy in FL, by guaranteeing the server can only learn the global aggregated model update but not the individual model updates. While SA ensures no additional information is leaked about the individual model update beyond the aggregated model update, there are no formal guarantees on how much privacy FL with SA can actually offer; as information about the individual dataset can still potentially leak through the aggregated model computed at the server. In this work [38], we perform a first analysis of the formal privacy guarantees for FL with SA. Specifically, we use Mutual Information (MI) as a quantification metric and derive upper bounds on how much information about each user’s dataset can leak through the aggregated model update. When using the FedSGD aggregation algorithm, our theoretical bounds show that the amount of privacy leakage reduces linearly with the number of users participating in FL with SA. To validate our theoretical bounds, 8 we use an MI Neural Estimator to empirically evaluate the privacy leakage under different FL setups on both the MNIST and CIFAR10 datasets. Our experiments verify our theoretical bounds for FedSGD, which show a reduction in privacy leakage as the number of users and local batch size grow, and an increase in privacy leakage as the number of training rounds increases. We also observe similar dependencies for the FedAvg and FedProx protocol. 2.1 Introduction Federated learning (FL) has recently gained significant interest as it enables collaboratively training machine learning models over locally private data across multiple users without requiring the users to share their private local data with a central server [18, 57, 78]. The training procedure in FL is typically coordinated through a central server who maintains a global model that is frequently updated locally by the users over a number of iterations. In each training iteration, the server firstly sends the current global model to the users. Next, the users update the global model by training it on their private datasets and then push their local model updates back to the server. Finally, the server updates the global model by aggregating the received local model updates from the users. In the training process of FL, users can achieve the simplest notion of privacy in which users keep their data in-device and never share it with the server, but instead they only share their local model updates. However, it has been shown recently in different works (e.g., [126, 42, 117]) that this alone is not sufficient to ensure privacy, as the shared model updates can still reveal substantial information about the local datasets. Specifically, these works have empirically demonstrated that the private training data of the users can be reconstructed from the local model updates through what is known as the model inversion attack. To prevent information leakage from individual models, Secure Aggregation (SA) protocols have been employed to enable the server to aggregate local model updates from multiple users without having access 9 ➕ ➕ ➕ data data … … Information leaked about a single client’s data from the aggregated model ? ? No leakage (Secure Aggregation guarantees) Aggregated model (a) Current and missing privacy guarantees for FL with secure aggregation (b) Privacy leakage vs. number of users in FL with secure aggregation Figure 2.1: Figure (a) illustrates the current formal privacy guarantee of FL with SA protocols and sheds light on the missing privacy guarantee on the aggregated model information leakage which is studied in this work. Figure (b) gives a preview of the behavior of the privacy leakage through the global aggregated model for a CNN model as a function of the number of users in FL. The privacy leakage follows a O(1/N) decay as proved in our theoretical bounds. to any clear model updates. Thus, SA protocols formally guarantee that: 1) both the server and other users have no information about any user’s clear model update from the encrypted update in the information theoretic sense; 2) the server only learns the aggregated model. As shown in Fig. 2.1a, in each training round, users encrypt their local model updates before sending it to the server for aggregation. Note that these SA guarantees allow for its use as a supporting protocol for other privacy-preserving approaches such as differential privacy [32]. In particular, these approaches can benefit from SA by reducing the amount of noise needed to achieve a target privacy level (hence improving the model accuracy) as demonstrated in different works (e.g., [107, 55]). However, even with these SA guarantees on individual updates, it is not yet fully understood how much privacy is guaranteed in FL using SA, since the aggregated model update may still leak information about an individual user’s local dataset. This observation leads us to the central question that this work addresses: How much information does the aggregated model leak about the local dataset of an individual user? 10 In this chapter, we tackle this question by studying how much privacy can be guaranteed by using FL with SA protocols. We highlight that this work does not propose any new approaches to tackle privacy leakage but instead analyzes the privacy guarantees offered by state-of-the-art SA protocols, where updates from other users can be used to hide the contribution of any individual user. An understanding of this privacy guarantee may potentially assist other approaches such as differential privacy, such that instead of introducing novel noise to protect a user’s model update, the randomized algorithm can add noise only to supplement the noise from other users’ updates to the target privacy level. We can summarize the contributions of the work as follows. Contributions. We provide information-theoretic upper bounds on the amount of information that the aggregated model update (using FedSGD [18]) leaks about any single user’s dataset under an honest-butcurious threat model, where the server and all users follow the protocol honestly, but can collude to learn information about a user outside their collusion set [38]. Our derived upper bounds show that SA protocols exhibit a more favorable behavior as we increase the number of honest users participating in the protocol at each round. We also show that the information leakage from the aggregated model decreases by increasing the batch size, which has been empirically demonstrated in different recent works on model inversion attacks (e.g., [126, 42, 117]), where increasing the batch size limits the attack’s success rate. Another interesting conclusion from our theoretical bounds is that increasing the model size does not have a linear impact on increasing the privacy leakage, but it depends linearly on the rank of the covariance matrix of the gradient vector at each user. In our empirical evaluation, we conduct extensive experiments on the CIFAR10 [64] and MNIST [70] datasets in different FL settings. In these experiments, we estimate the privacy leakage using a mutual information neural estimator [10] and evaluate the dependency of the leakage on different FL system parameters: number of users, local batch size and model size. Our experiments show that the privacy leakage empirically follows similar dependencies to what is proven in our theoretical analysis. Notably, as 11 the number of users in the FL system increase to 20, the privacy leakage (normalized by the entropy of a data batch) drops below 5% when training a CNN network on the CIFAR10 dataset (see Fig. 2.1b. We also show empirically that the dependencies, observed theoretically and empirically for FedSGD, also extend when using the FedAvg [18] FL protocol to perform multiple local training epochs at the users. 2.2 Preliminaries We start by discussing the basic federated learning model, before introducing the secure aggregation protocol and its state-of-the-art guarantees. 2.2.1 Basic Setting of Federated Learning Federated learning is a distributed training framework [78] for machine learning, in which a set of users N = [N] (|N | = N), each with its own local dataset Di (∀i ∈ [N]), collaboratively train a d-dimensional machine learning model parameterized by θ ∈ R d , based on all their training data samples. For simplicity, we assume that users have equal-sized datasets, i.e., Di = D for all i ∈ [N]. The typical training goal in FL can be formally represented by the following optimization problem: θ ∗ = arg min θ∈Rd " C(θ) := 1 N X N i=1 Ci(θ) # , (2.1) where θ is the optimization variable, C(θ) is the global objective function, Ci(θ) is the local loss function of user i. The local loss function of user i is given by Ci(θ) = 1 D X (x,y)∈Di ℓi(θ,(x, y)), (2.2) where ℓi(θ,(x, y)) ∈ R denotes the loss function at a given data point (xi , yi) ∈ Di . The dataset Di at user i ∈ [N] is sampled from a distribution Pi . 12 Figure 2.2: The training process in federated learning. To solve the optimization problem in equation 2.1, an iterative training procedure is performed between the server and distributed users, as illustrated in Fig. 2.2. Specifically, at iteration t, the server firstly sends the current global model parameters, θ (t) , to the users. User i ∈ [N] then computes its model update x (t) i and sends it to the server. After that, the model updates of the N users are aggregated by the server to update the global model parameters into θ (t+1) for the next round according to θ (t+1) = θ (t) − η (t) 1 N X N i=1 x (t) i . (2.3) There are two common protocols for computing the model update xi : FedSGD and FedAvg [78]. Specifically, in FedSGD, each user uses a data batch B (t) i of size B sampled uniformly at random from it local dataset Di to compute the model update as follows: x (t) i = 1 B X b∈B(t) i gi(θ (t) , b), (2.4) where gi(θ (t) ) is the stochastic estimate of the gradient ∇Ci(θ (t) ) of the local loss function Ci of user i computed based on a random sample b (corresponding to (xb, yb)) drawn uniformly from Di without replacement. In FedAvg, each user will run E complete local training rounds over its local dataset Di to get its model update x (t) i . Specifically, during each training round, each user will use all their mini-batches sampled from Di to perform multiple stochastic gradient descent steps. 13 2.2.2 Secure Aggregation Protocols for Federated Learning Recent works (e.g., [126, 42, 117]) have empirically shown that some of the local training data of user i can be reconstructed from the local model update xi , for i ∈ [N]. To prevent such data leakage, different SA protocols [5, 107, 29, 113, 11, 98, 53, 123, 103, 37, 81, 100] have been proposed to provide a privacypreserving FL setting without sacrificing the training performance. In the following, we discuss the threat model used in these SA protocols. 2.2.2.1 Threat Model in Secure Aggregation for Federated Learning Most of SA protocols consider the honest-but-curious model [18] with the goal of uncovering users’ data. In this threat model, the server and users honestly follow the SA protocol as specified. In particular, they will not modify their model architectures to better suit their attack, nor send malicious model update that do not represent the actually learned model. However, the server and the participating users are assumed to be curious and try to extract any useful information about the training data of any particular user. The extraction of the information is done by storing and analyzing the different data received during the execution of the protocol. On the other hand, the threat model in theses SA protocols assumes that the server can collude with any subset of users T ⊂ [N] by jointly sharing any data that was used during the execution of the protocol (including their clear model updates xi , for all i ∈ T ) that could help in breaching the data privacy of any target user i ∈ [N]/T . Similarly, this threat model also assumes that users can collude with each other to get information about the training data of other users. 2.2.2.2 Secure Aggregation Guarantees In general, SA protocols that rely on different encryption techniques; such as homomorphic encryption [5, 107, 29, 113], and secure multi-party computing (MPC) [11, 98, 53, 123, 103, 37, 81, 100], are all similar 14 in the encryption procedure in which each user encrypts its own model update y (t) i = Enc(x (t) i ) before sending it to the server. This encryption is done such that these protocols achieve: 1) Correct decoding of the aggregated model under users’ dropout; 2) Privacy for the local model update of the users from the encrypted model. In the following, we formally describe each of these guarantees. Correct decoding. The encryption guarantees correct decoding for the aggregated model of the surviving users even if a subset U ⊂ [N] of the users dropped out during the protocol execution. In other words, the server should be able to decode Dec X i∈V y (t) i ! = X i∈V x (t) i , (2.5) where V is the set of surviving users (e.g., U ∪ V = [N] and U ∩ V = ϕ). Privacy guarantee. Under the collusion between the server and any strict subset of users T ⊂ [N], we have the following I {y (t) i }i∈[N] ; {x (t) i }i∈[N] X N i=1 x (t) i , zT ! = 0, (2.6) where zT is the collection of information at the users in T . In other words, equation 2.6 guarantees that under a given subset of colluding users T with the server, the encrypted model updates {y (t) i }i∈[N] leak no information about the model updates {x (t) i }i∈[N] beyond the aggregated model PN i=1 x (t) i . We note that the upper bound on the size of the colluding set T such that equation 2.6 is always guaranteed has been analyzed in the different SA protocols. Assuming that |T | ≤ N 2 is widely used in most of the works (e.g., [103, 100]). Remark 2.1. Recently, there have been also some works that enable doing secure model aggregation by using Trusted Execution Environments (TEE) such as Intel SGX (e.g., [66, 120]). SGX is a hardware-based security mechanism to protect applications running on a remote server. These TEE-based works are also designed to give the same guarantee in equation 2.6. 15 In the following, we formally highlight the weakness of the current privacy guarantee discussed in equation 2.6. 2.2.2.3 Our Contribution: Guarantees on Privacy Leakage from the Aggregated Model Different SA protocols guarantee that the server doesn’t learn any information about the local model update x (t) i of any user i from the received encrypted updates {y (t) i }i∈N , beyond the aggregated model as formally shown in equation 2.6. However, it is not clear how much information the aggregated model update itself leaks about a single user’s local dataset Di . In this work, we fill this gap by theoretically analyzing the following term. Ipriv/data = max i∈[N] I Di ; ( 1 N X N i=1 x (t) i ) t∈[T] . (2.7) The term in equation 2.7 represents how much information the aggregated model over T global training rounds could leak about the private data Di of any user i ∈ [N]. In the following section, we theoretically study this term and discuss how it is impacted by the different FL system parameters such as model size, number of users , etc. In Section 2.5, we support our theoretical findings by empirically evaluating Ipriv/data in real-world datasets and different neural network architectures. 2.3 Theoretical Privacy Guarantees of FL with Secure Aggregation In this section, we theoretically quantify the privacy leakage in FL when using secure aggregation with the FedSGD protocol. 16 2.3.1 Main Results For clarity, we first state our main results under the honest-but-curious threat model discussed in Section 2.2.2.1 while assuming that there is no collusion between the server and users. We also assume that there is no user dropout. Later in Section 2.3.3, we discuss the general result with user dropout and the collusion with the server. Our central result in this section characterizes the privacy leakage in terms of mutual information for a single round of FedSGD, which for round t is defined as I (t) priv = max i∈[N] I x (t) i ; X N i=1 x (t) i (X N i=1 x (k) i ) k∈[t−1] (2.8) and then extends the privacy leakage bound to multiple rounds. Before stating our main result in Theorem 2.1 below, we first define two key properties of random vectors that will be used in stating our theorem and formally state our operational assumptions. Definition 1 (Independent under whitening). We say that a random vector v with mean µv and nonsingular covariance matrix Kv is independent under whitening, if the whitened vector vb is composed of independent random variables, where vb = K −1/2 v (v − µv). Definition 2 (Uniformly σ-log concave). A random vector v with covariance Kv is uniformly σ-log concave if it has a probability density function e −ϕ(v) satisfying ∇2ϕ(v) ⪰ I and ∃ σ > 0, such that Kv ⪰ σI. Assumption 1 (IID data distribution). Throughout this section, we consider the case where the local dataset Zi are sampled IID from a common distribution, i.e., the local dataset of user iconsists of IID data samples from a distribution Pi , where Pi = P for ∀i ∈ [N]. This implies that the distribution of the gradients gi(θ (t) , b), for i ∈ [N], conditioned on the last global model θ (t) is also IID. For this common conditional distribution, we will denote its mean with µ (t) G and the covariance matrix K (t) G in the t-th round. 17 With the above definitions and using Assumption 1, we can now state our main result below, which is proved in Appendix 2.8.1. Theorem 2.1 (Single Round Leakage). Let d ∗ ≤ d be the rank of the gradient covariance matrix K (t) G , and let Sg denote the set of subvectors of dimension d ∗ of g(θ (t−1), b) that have a non-singular covariance matrices. Under Assumption 1, we can upper bound I (t) priv for FedSGD in the following two cases: Case. 1 If ∃g¯ ∈ Sg, such that g¯ is independent under whitening (see Def. 1), and E|g¯i | 4 < ∞, ∀i ∈ [d ∗ ], then ∃ C0,g¯ > 0, such that I (t) priv ≤ C0,g¯ d ∗ (N − 1)B + d ∗ 2 log N N − 1 , (2.9) Case. 2 If ∃g¯ ∈ Sg, such that g¯ is σ-log concave under whitening (see Def. 2) then we have that I (t) priv ≤ d ∗C1,g¯ − C2,g¯ (N − 1)Bσ4 + d ∗ 2 log N N − 1 , (2.10) where: the constants C1,g¯ = 2 (1 + σ + log(2π) − log(σ)) and C2,g¯ = 4 h(¯g) − 1 2 log(|Σg¯| , with Σg¯ being the covariance matrix of the vector g¯. Remark 2.2 (Simplified bound). Note that each g¯ ∈ S(t) g satisfying Case 1 or Case 2 gives an upper bound on I (t) priv. Let S (t) g,c be the set of g¯ ∈ S(t) g satisfying either Case 1 or Case 2. Then, we can combine these different bounds in Theorem 2.1 as follows I (t) priv ≤ d ∗ 2 log N N−1 + min g¯∈S(t) g,c n d ∗Cb1,g¯ − Cb2,g¯ o (N − 1)B , (2.11) 1 where (Cb1,g¯, Cb2,g¯) = (C0,g¯, 0), if g¯ satisfies Case 1, C1,g¯ σ4 , C2,g¯ σ4 , if g¯ satisfies Case 2, where C0,g¯, C1,g¯ and C2,g¯ are defined as in Theorem 2.1. Remark 2.3. (Why the IID assumption?) Our main result in Theorem 2.1 relies on recent results on the entropic central [34, 15] for the sum of independent and identically random variables/vectors. Note that the IID assumption in the entropic central limit theorem can be relaxed to independent (but not necessarily identical) distributions, however, in this case, the upper bound will have a complex dependency on the moments of the N distributions in the system. In order to high-light how the privacy guarantee depends on the different system parameters (discussed in the next subsection), we opted to consider the IID setting in our theoretical analysis. Remark 2.4. (Independence under whitening) One of our key assumptions in Theorem 2.1 is the independence under whitening assumption for stochastic gradient descent (SGD). This assumption is satisfied if the SGD vector can be approximated by a distribution with independent components or by a multivariate Gaussian vector. Our adoption of this assumption is motivated by recent theoretical results for analyzing the behaviour of SGD. These results have demonstrated great success in approximating the practical behaviour of SGD, in the context of image classification problems, by modeling the SGD with (i) a nonisotropic Gaussian vector [127], or, (ii) α-stable random vectors with independent components [97]. For both these noise models, the independence under whitening assumption in Theorem 2.1 is valid. However, a key practical limitation for the aforementioned SGD models (and thus of the independence under whitening assumption) is assuming a smooth loss function for learning. This excludes deep neural networks that make use of non-smooth activation and pooling functions (e.g., ReLU and max-pooling). 19 Now using the bounds in Theorem 2.1, in the following corollary, we characterize the privacy leakage of the local training data Di of user i after T global training rounds of FedSGD, which is defined as Ipriv/data = max i∈[N] I Di ; 1 N X i∈[N] x (t) i t∈[T] , (2.12) Corollary 2.1. Assuming that users follow the FedSGD training protocol and the same assumptions in Theorem 2.1, we can derive the upper bound of the privacy leakage Ipriv/data after T global training rounds of FedSGD in the following two cases: Case. 1: Following the assumptions used in Case 1 in Theorem 1, we get Ipriv/data ≤ T C0,g¯d ∗ (N − 1)B + d ∗ 2 log N N − 1 , (2.13) Case. 2: Following the assumptions used in Case 2 in Theorem 1, we get Ipriv/data ≤ T d ∗C1,g¯ − C2,g¯ (N − 1)Bσ4 + d ∗ 2 log N N−1 . (2.14) We prove Corollary 2.1 in Appendix 2.8.2. Note that, we can combine the bounds in Corollary 2.1 similar to the simplification in equation 2.11 from Theorem 2.1. 2.3.2 Impact of System Parameters 2.3.2.1 Impact of Number of Users (N) As shown in Theorem 2.1 and Corollary 2.1, the upper bounds on information leakage from the aggregated model update decrease in the number of users N. Specifically, the leakage dependency on N is at a rate of O(1/N). 20 2.3.2.2 Impact of Batch Size (B) Theorem 1 and Corollary 2.1 show that the information leakage from the aggregated model update could decrease when increasing the batch size that is used in updating the local model of each user. 2.3.2.3 Impact of Model Size (d) Given our definition of d ∗ in Theorem 1, where d ∗ represents the rank of the covariance matrix KG(t) and d ∗ ≤ d (d is the model size), the leakage given in Theorem 2.1 and Corollary 2.1 only increases with increasing the rank of the covariance matrix of the gradient. This increase happens at a rate of O(d ∗ ). In other words, increasing the model size d (especially when the model is overparameterized) does not have a linear impact on the leakage. The experimental observation in Section 2.4 supports these theoretical findings. 2.3.2.4 Impact of Global Training Rounds (T) Corollary 2.1 demonstrates that the information leakage from the aggregated model update about the private training data of the users increases with increasing the number of global training rounds. This result reflects the fact as the training proceed, the model at the server start to memorize the training data of the users, and the data of the users is being exposed multiple times by the server as T increases, hence the leakage increases. The increase of the leakage happens at a rate of O(T). 2.3.3 Impact of User Dropout, Collusion, and User Sampling In this section, we extend the results given in Theorem 2.1 and Corollary 2.1 to cover the more practical FL scenario that consider, user dropout, the collusion between the server and the users and user sampling. We start by discussing the impact of user dropout and collusion. 21 2.3.3.1 Impact of User Dropout and Collusion with the Server Note that, in the case of user dropouts, this is equivalent to a situation where the non-surviving users send a deterministic update of zero. As a result, their contribution can be removed from the aggregated model, and we can, without loss of generality, consider an FL system where only the surviving subset Ns ⊂ [N] users participate in the system. Similarly, when a subset of users colludes with the server, then the server can subtract away their contribution to the aggregated model in order to unmask information about his target user i. As a result, we can again study this by considering only the subset of non-colluding (and surviving, if we also consider dropout) users in our analysis. This observation gives us the following derivative of the result in Theorem 2.1 which can summarized by the following corollary. Corollary 2.2. In FedSGD, under the assumptions used in Theorem 1, if there is only a subset N (t) s ⊂ [N] of non-colluding and surviving users in the global training round t, then, we have the following bound on I (t) priv I (t) priv ≤ d ∗ 2 log |Ns| |Ns|−1 + min g¯∈S(t) g,c n d ∗Cb1,g¯ − Cb2,g¯ o (|Ns| − 1)B , (2.15) where the maximization in I (t) priv (given in equation 2.8) is only over the set of non-colluding surviving and non-colluding users; and the constants Cb1,g¯ and Cb2,g¯ are given in Remark 2. This implies that the per round leakage increases when we have a smaller number of surviving and noncolluding users. Similarly, we can modify the bound in Corollary 1 to take into account user dropout and user collusion by replacing N with |Ns|. 2.3.3.2 Impact of User Sampling In Theorem 2.1 and Corollary 2.1, we assume that all N users in the FL system participate in each training round. If instead K users are chosen each round, then all leakage upper bound will be in terms of K, the 22 number of users in each round, instead of N. Furthermore, through Corollary 2.1, we can develop upper bounds for each user i, depending on the number of rounds Ti that the user participated in. For example, taking into account selecting K users in each round denoted by K(t) , then the upper bound in equation 2.13 is modified to give the following information leakage for user i Ipriv/data(i) = I Di ; 1 K X i∈K(t) x (t) i t∈[T] ≤ Ti C0,g¯d ∗ (K − 1)B + d ∗ 2 log K K − 1 , (2.16) where Ti = K/N if the set of K users are chosen independently and uniformly at random in each round. Thus user sampling would improve the linear dependence of the leakage on T (Section 2.3.2.4), but increase the per round leakage due to a smaller number of users in each round (Section 2.3.2.1). 2.4 Experimental Setup 2.4.1 MI Estimation In order to estimate the mutual information in our experiments, we use Mutual Information Neural Estimator (MINE) which is the state-of-the-art method [10] to estimate the mutual information between two random vectors (see Appendix 2.8.4 for more details). In our experiments, at the t-th global training round, we use MINE to estimate I(x (t) i ; PN i=1 x (t) i |θ (t−1)), i.e., the mutual information between model update of the i-th user x (t) i and the aggregated model update from all users PN i=1 x (t) i . Our sampling procedure is described as follows: 1) at the beginning of the global training round t, each user will first update its local model parameters as the global model parameters θ (t−1). 2) Next, each user shuffles its local dataset. 3) Then, each user will pick a single data batch from its local dataset (if using FedSGD) or use all local data 23 batches (if using FedAvg) to update its local model. 4) Lastly, secure aggregation is used to calculate the aggregated model update. We repeat the above process for K times to get K samples {(x (t) i,k; PN i=1 x (t) i,k)} k=K k=1 , where x (t) i,k represents the model update from the i-th user in the k-th sampling and PN i=1 x (t) i,k represents the aggregated model update from the i-th user in the k-th sampling. Note that we use the K − th (last) sample PN i=1 x (t) i,K to update the global model. We repeat the end-to-end training and MI estimation multiple times in order to get multiple MI estimates for each training round t. We use the estimates for each round to report the average MI estimate and derive the confidence interval (95%) for the MI estimation∗ . Lastly, when using MINE to estimate MI, we use a fully-connected neural network with two hidden layers each having 100 neurons each as Tθ (see Appendix 2.8.4 for more details) and we perform gradient ascent for 1000 iterations to train the MINE network. 2.4.2 Datasets and Models Datasets. We use MNIST and CIFAR10 datasets in our experiments. Specifically, the MNIST dataset contains 60,000 training images and 10,000 testing images, with 10 classes of labels. The CIFAR10 dataset contains 50,000 training images and 10,000 testing images, with 10 classes of labels. For each of the dataset, we randomly split the training data into 50 local datasets with equal size to simulate a total number of 50 users with identical data distribution. Note that we describe how to generate users with non-identical data distribution when we evaluate the impact of user heterogeneity in Section 2.5.6. Moreover, we use MINE to measure the entropy of an individual image in each of these datasets, as an estimate of the maximal potential MI privacy leakage per image. We report that the entropy of an MNIST image is 567 (bits) and the entropy of a CIFAR10 image is 1403 (bits). Note that we will use the entropy of training data to normalize the measured MI privacy leakage in Section 2.5. ∗During our experiments, we observe that the estimated MI does not change significantly across training rounds. Hence, we average the estimated MI across training rounds when reporting our results. 24 Models for MNIST Name Linear SLP MLP Size (d) 7850 7850 89610 Models for CIFAR10 Name Linear SLP CNN Size (d) 30730 30730 82554 Table 2.1: Models used for MNIST and CIFAR10 datasets. Note that SLP, MLP, and CNN represent Single Layer Perceptron, Multiple Layer Perceptron, and Convolutional Neural Network, respectively. Models. Table 2.1 reports the models and their number of parameters used in our evaluation. For MNIST dataset, we consider three different models for federated learning. For each of these models, it takes as input a 28×28 image and outputs the probability of 10 image classes. We start by using a simple linear model, with a dimension of 7850. Next, we consider a non-linear model with the same amounts of parameters as the linear model. Specifically, we use a single layer perceptron (SLP), which consists of a linear layer and a ReLU activation function (which is non-linear). Finally, we choose a multiple layer perceptron (MLP) with two hidden layers, each of which contains 100 neurons. In total, it has 89610 parameters. Since the MLP model we use can already achieve more than 95% testing accuracy on MNIST dataset, we do not consider more complicated model for MNIST. For the CIFAR10 dataset, we also evaluate three different models for FL. For each of these models, it will take as input an 32×32×3 image and outputs the probability of 10 image classes. Similar to MNIST, the first two models we consider are a linear model and a single layer perceptron (SLP), both of which contains 30720 parameters. The third model we consider is a Convolutional Neural Network (CNN) modified from AlexNet [65], which contains a total of 82554 parameters and is able to achieve a testing accuracy larger than 60% on CIFAR. We do not consider larger CNN models due to the limited computation resources. 2.5 Empirical Evaluation In this section, we empirically evaluate how different FL system parameters affect the MI privacy leakage in SA. Our experiments explore the effect of the system parameters on FedSGD, FedAvg and FedProx [92]. 25 Note that our evaluation results on FedSGD are backed by our theoretical results in Section 2.3, while our evaluation results on FedAvg and FedProx are purely empirical. We start by evaluating the impact of the number of users N on the MI privacy leakage for FedSGD, FedAvg and FedProx (see in Section 2.5.1). Then, we evaluate the impact of batch size B on the MI privacy leakage for both FedSGD, FedAvg and FedProx (see in Section 2.5.3). Next, in Section 2.5.4, we measure the accumulative MI privacy leakage across all global training rounds. We evaluate how the local training rounds E for each user will affect the MI privacy leakage for FedAvg and FedProx in Section 2.5.5. Finally, the impact of user heterogeneity on the MI privacy leakage for FedAvg is evaluated in Section 2.5.6. We would like to preface by noting that FedProx differs from FedAvg by adding a strongly-convex proximal term to the loss used in FedAvg. Thus, we expect similar dependencies on the number of users N, batch-size B and local epochs E, when using FedAvg and FedProx. 2.5.1 Impact of Number of Users (N) FedSGD. Fig. 2.3 shows the impact of varying N on MI privacy leakage in FedSGD, where the number of users is chosen from {2, 5, 10, 20, 50}, and we measure the MI privacy leakage of different models on both MNIST and CIFAR10 datasets. We observe that increasing the number of users participating in FL using FedSGD will decrease the MI privacy leakage in each global training round (see Fig. 2.3a and 2.3b), which is consistent with our theoretical analysis in Section 2.3.2.1. Notably, as demonstrated in Fig. 2.3c and 2.3d, the percentile of MI privacy leakage (i.e. normalized by the entropy of a data batch) can drop below 2% for MNIST and 5% for CIFAR10 when there are more than 20 users. FedAvg. Fig. 2.4 shows the impact of varying N on MI privacy leakage in FedAvg. Similar to the results in FedSGD, as the number of users participating in FedAvg increases, the MI privacy leakage in each global training round will decrease (see Fig. 2.4a and 2.4b), and the decreasing rate is approximately O(N). Moreover, as shown in Fig. 2.4c and 2.4d, the percentile of MI privacy leakage drops below 0.1% on both 26 (a) Unnormalized MI, MNIST. (b) Unnormalized MI, CIFAR10. (c) Normalized MI, MNIST. (d) Normalized MI, CIFAR10. Figure 2.3: Impact of the number of users (N) when using FedSGD. Note that we set B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of a single data batch (i.e. 32 ∗ 567 for MNIST and 32 ∗ 1403 for CIFAR10). MNIST and CIFAR10 when there are more than 20 users participating in FL. It is worth noting that we normalize the MI by the entropy of the whole training dataset in FedAvg instead of the entropy of a single batch, since users will iterate over all their data batches to calculate their local model updates in FedAvg. Therefore, although we observe that the unnormalized MI is comparable for FedSGD and FedAvg, the percentile of MI privacy leakage in FedAvg is significantly smaller than that in FedSGD. FedProx. Similar to FedAvg, Fig. 2.5 shows how the MI privacy leakage with FedProx varies with the number of users N. As the number of users increase, the MI privacy leakage decreases in each training round at an approximate rate of O(N). With more than 20 participating users, the percentile of MI leakage drops below 0.12% under both MNIST and CIFAR10. Same as FedAvg, we normalize the MI privacy leakage by the entropy of the whole training dataset of a single user. 27 (a) Unnormalized MI, MNIST. (b) Unnormalized MI, CIFAR10. (c) Normalized MI, MNIST. (d) Normalized MI, CIFAR10. Figure 2.4: Impact of the number of users (N) when using FedAvg. Note that we set E=1 and B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of the whole local training dataset (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). In conclusion, while our theoretical analysis on the impact of N in Section 2.3.2.1 is based on the assumption that the FedSGD protocol is used, our empirical study shows that it holds not only in FedSGD but also in FedAvg and FedProx. 2.5.2 Impact of Model Size (d) FedSGD. From Fig. 2.3, we observe that increasing model size d will increase the MI leakage during each global training round. However, the increase rate of MI leakage is smaller than the increase rate of d. This is expected since the upper bound of MI privacy leakage is proportional to d ∗ (i.e. the rank of the covariance of matrix as proved in Theorem 2.1), which will not increase linearly with d especially for overparameterized neural networks (see Section 2.3.2.3). Finally, we observe that the MI privacy leakage 28 (a) Unnormalized MI, MNIST. (b) Unnormalized MI, CIFAR10. (c) Normalized MI, MNIST. (d) Normalized MI, CIFAR10. Figure 2.5: Impact of the number of users (N) when using FedProx. Note that we set E=1 and B = 32 for all users on both MNIST and CIFAR10 datasets. We normalize the MI by entropy of a single data batch (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). on CIFAR10 is generally higher than that on MNIST. Since the input images on CIFAR10 have higher dimension than the images on MNIST, larger model size are required during training. Therefore, we expect that the MI privacy leakage on CIFAR10 is higher than that on MNIST. FedAvg and FedProx. As shown in Fig. 2.4 and Fig. 2.5, increasing the model size will also have a sublinear impact on the increase of the MI privacy leakage in FedAvg and FedProx, which is consistent with our results in FedSGD. 2.5.3 Impact of Batch Size (B) FedSGD. Fig. 2.6 shows the impact of varying B on the normalized MI privacy leakage in FedSGD, where the batch size is chosen from {16, 32, 64, 128, 256} and we use MLP model on MNIST and CNN model on 29 (a) Normalized MI, MNIST. (b) Normalized MI, CIFAR10. Figure 2.6: Impact of batch size (B) when using FedSGD. The MI is normalized by the entropy of a data batch, which is proportional to the batch size B (i.e. B ∗ 567 for MNIST and B ∗ 1403 for CIFAR10). (a) Normalized MI, MNIST. (b) Normalized MI, CIFAR10. Figure 2.7: Impact of batch size (B) when using FedAvg. The MI is normalized by the entropy of a user’s local dataset, which is a constant (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). (a) Normalized MI, MNIST. (b) Normalized MI, CIFAR10. Figure 2.8: Impact of batch size (B) when using FedProx. The MI is normalized by the entropy of a user’s local dataset, which is a constant (i.e. 1200 ∗ 567 for MNIST and 1000 ∗ 1403 for CIFAR10). CIFAR10 during experiments. Note that we normalize the MI by the entropy of a single data batch used in each training round, which is proportional to the batch size B. On both MNIST and CIFAR10 datasets, we consistently observe that increasing B will decrease the MI privacy leakage in FedSGD, and the decay rate 30 of MI is inversely proportional to batch size B. As demonstrated in Fig. 2.6, when there are more than 20 users, the percentile of MI privacy leakage for a single training round can be around 4% on MNIST and 12% on CIFAR10 with batch size 16. However, such leakage can drop to less 1% on both MNIST and CIFAR10 with batch size 256, which is significantly reduced. FedAvg and FedProx. Fig. 2.7 and Fig. 2.8 show the impact of varying the batch size B on MI privacy leakage in FedAvg and FedProx, respectively, following the same experimental setup as in Fig. 2.6. Since in both FedAvg and FedProx, each user will transverse their whole local dataset in each local training round, we normalize the MI by the entropy of the target user’s local training dataset. As shown in Fig. 2.7 and Fig. 2.8, the impact of B in FedAvg and FedProx is relatively smaller than that in FedSGD. However, we can still observe that increasing B can decrease the MI privacy leakage in both FedAvg and FedProx. For example, with 20 users participating in FedAvg, the percentile of MI privacy leakage at each training round can drop from 0.8% to 0.3% when the batch size increases from 16 to 256, achieving a reduction in privacy leakage by a factor of more than 2×. Similarly, in FedProx, this causes a decrease in the MI privacy leakage from 0.09% to 0.04% when the batch size increases from 16 to 256. In conclusion, we observe that increasing the batch size B can decrease the MI privacy leakage from the aggregated model update in FedSGD, FedAvg and FedProx which verifies our theoretical analysis in Section 2.3.2.3. 2.5.4 Accumulative MI leakage To evaluate how the accumulative MI privacy leakage will accumulate with the number of training round T, we measure the MI between training data and the aggregated model updates across training round. Specifically, given a local training dataset sample Di , we will concatenate the aggregated model updates { 1 N P i∈N x (t) i }t∈[T] across T training rounds in a single vector with dimension d ∗ T. By randomly generating Di for the target user for K times, we can get K concatenated aggregated model update vectors. 31 Then, we use MINE to estimate I(Di ; { 1 N P i∈N x (t) i }t∈[T] ) with these K dataset and concatenated model update samples. As illustrated in Fig. 2.9, the MI privacy leakage will accumulate linearly as we increase the global training round T on both MNIST and CIFAR dataset, which is consistent with our theoretical results in Section 2.3.2.4. That also says, by reducing the times of local model aggregation, the MI privacy leakage of secure aggregation will be reduced. In practice, we can consider using client sampling to reduce the participation times of each client in FL, such that the accumulative MI leakage of individual users can be reduced. Moreover, we can also consider increasing the number of local averaging as much as possible to reduce the aggregation times for local model updates. Although, the three aggregation algorithms exhibit a similar trend with T, these algorithms can result in different convergence speeds to a target accuracy. To highlight the effect of convergence rate on the accumulative MI privacy leakage, we show, in Fig. 2.10, how the accuracy changes with the amount of MI leakage incurred for the three algorithms during the training process up to a maximum of 30 training rounds for FedSGD. We observe that although FedSGD achieves lower MI leakage for a fixed number of rounds (see Fig. 2.9), its slow convergence rate will make it suffer from more leakage before reaching a target accuracy rate. For example, given a target accuracy of 85% on the MNIST dataset, both FedAvg and FedProx achieve the target accuracy with 0.058% and 0.057% leakage while FedSGD will reach 85% accuracy in later rounds resulting in an accumulative MI leakage of 0.11% (even with smaller leakage per round). 2.5.5 Impact of Local Training Epochs (E) Fig. 2.11 shows the impact of varying the number of local training epochs E on MI privacy leakage in FedAvg on both MNIST and CIFAR10 datasets. We select E from {1, 2, 5, 10} and N from {10, 20}, and we consider MLP model for MNIST and CNN model for CIFAR10. We observe that increasing the local 32 (a) Normalized accumulative MI, MNIST. (b) Normalized accumulative MI, CIFAR10. Figure 2.9: Accumulative MI privacy leakage on MNIST and CIFAR10 datasets. Note that we normalize the MI by the entropy of each user’s local dataset, which will not change with T. We use the linear model for both MNIST and CIFAR10 datasets. (a) MNIST (b) CIFAR Figure 2.10: Accumulative MI privacy leakage vs model accuracy of different FL algorithms. Note that we use a linear model for case study and normalize the MI by the entropy of each user’s local dataset. (a) Normalized MI, MNIST. (b) Normalized MI, CIFAR10. Figure 2.11: Impact of the local training round (E) when using FedAvg. We normalize the MI by the entropy of each user’s local dataset, and we consider N ∈ {10, 20}. training round E will increase the MI privacy leakage in FedAvg. An intuitive explanation is that with more local epochs, the local model updates become more biased towards the user’s local dataset, hence it 33 (a) Normalized MI, MNIST. (b) Normalized MI, CIFAR10. Figure 2.12: Impact of the local training round (E) when using FedProx. We normalize the MI by the entropy of each user’s local dataset, and we consider N ∈ {10, 20}. will potentially leak more private information about users’ and make it easier for the server to infer the individual model update from the aggregated update. However, as shown in Fig. 2.11, increasing the local epochs E will not have a linear impact on the increase of MI privacy leakage. As E increases, the increase rate of MI privacy leakage becomes smaller. Similar to FedAvg, we observe from Fig. 2.12 that the local training epochs E has a sub-linear impact on the MI privacy leakage when using FedProx. As aforementioned, this can be attributed to the fact that FedProx represents an application of FedAvg with the original loss function in addition to a convex regularization term. 2.5.6 Impact of Data Heterogeneity As discussed in Remark 3 of Section 2.3, in our theoretical analysis, we considered IID data distribution across users in Theorem 2.1 in order to make use of entropic central limit theorem results in developing our upper bounds on privacy leakage. However in practice, the data distribution at the users can be heterogeneous. Hence, in this subsection, we analyze the impact of the non-IID (heterogeneous) data distribution across the users’ on the privacy leakage. To measure how user heterogeneity can potentially impact the MI privacy leakage in FedAvg, we consider two different data settings. In the first setting, we create synthetic users with non-IID data distributions following the methodology in [51]. For the second setting, 34 (a) Normalized MI when E = 1. (b) Normalized MI when E = 5. Figure 2.13: Impact of user heterogeneity when using FedAvg on non-IID CIFAR10. Note that α = ∞ means that the user data distributions are identical (IID users), and the MI is normalized by the entropy of a user’s local dataset. Figure 2.14: Impact of user heterogeneity when using FedAvg on FEMNIST. Note that the MI is normalized by the entropy of the target user’s local dataset, which is 678 ∗ 176. we consider FEMNIST [20], a benchmark non-IID FL dataset extended from MNIST, which consists of 62 different classes of 28×28 images (10 digits, 26 lowercase letters, 26 uppercase letters) written by 3500 users. In the first, synthetic non-IID data setting, we use Dirichlet distribution parameterized by α to split the dataset into multiple non-IID distributed local datasets. Smaller α (i.e., α → 0) represents that the users’ datasets are more non-identical with each other, while larger α (i.e., α → ∞) means that the user datasets are more identical with each other. We choose CIFAR10 as the dataset, CNN as the model, and use FedAvg for a case study while using a batch size of B = 32. Note that we do not consider FedSGD since it will not be affected by user heterogeneity. During the experiments, we choose the α value from {1, 10, 100, ∞} to create different levels of non-IID user datasets, and we consider N ∈ {2, 5, 10, 20} and E ∈ {1, 5}. 35 Fig. 2.13 shows how the MI privacy leakage varies with the number of users under different α, where the MI privacy leakage is normalized by the entropy of each user’s local dataset. We notice that the MI privacy leakage will decrease with the number of users consistently under different α, which empirically shows that our theoretical results in Section 2.3 also holds in the case where users are heterogeneous. For the second, FEMNIST data setting, we split the dataset by users into 3500 non-overlapping subsets, each of which contains character images written by a specific user. Considering that the size of each subset is small, in order to have enough training data, we choose to sample N users at each training round instead of using a fixed set of N users, which simulates the user sampling scenario in FL. Specifically, at the beginning of each FL training round with N participating users, we use the same target user and randomly pick the other N − 1 out of 3500 users. Note that we consider N ∈ {2, 5, 10, 20, 50} and E ∈ {1, 5}, and use the same model (CNN), batch size (B = 32), and FedAvg algorithm in our evaluation.. Fig. 2.14 shows how the MI privacy leakage varies with the number of users. Similar to the synthetic non-IID data setting in Fig. 2.13, the privacy leakage decreases with increasing the number of user N. 2.5.7 Practical Privacy Implications Success of Privacy attacks To provide insights on how MI translates to practical privacy implications, we conduct experiments using one of the state-of-the-art data reconstruction attack, i.e., the Deep Leakage from Gradients (DLG) attack from [126], to show how the MI metric reflects the reconstructed image quality of the attack as we vary system parameters. Specifically, we choose MNIST as the dataset, the same SLP used in Section 2.4.2 as the model, and FedSGD with batch size of 32 as training algorithm. For the data distribution across the users, we consider the IID setting. At the end of each training round, each user uses a batch of images with size 32 to calculate their local gradients, which will be securely aggregated by the server. The DLG attack will reconstruct a batch of images with size 32 from the aggregated gradient, making them as similar as 36 Figure 2.15: Impact of varying the number of users N, on the reconstructed image quality (PSNR) of the DLG attack and on the MI privacy leakage. possible to the batch of images used by the target user. After that, we apply the same PSNR (Peak Signalto-noise Ratio) metric used in [126] to measure the quality of reconstructed images compared with the images used by the target user during training. Note that without loss of generality, we report the PSNR value of reconstructed images by DLG attack for the first training round. Fig. 2.15 shows the impact of number of users N on the privacy leakage metric (MI) and the reconstructed image quality of DLG attack (PSNR). We pick the image of digit 3 out of the target 32 images as an example of reconstructed images. We can observe that increasing the number of users N decreases the MI metric as well as the PSNR at almost the same rate. This demonstrates that the MI metric used in this work can translate to practical privacy implications well. MI Privacy leakage under the joint use of DP and SA To highlight the joint effect of differential privacy with secure aggregation, we conduct experiments on the MNIST dataset with a linear model to measure the MI privacy leakage in the presence of centralized DP noise added at the server after SA. Specifically, following [1], we first clip the aggregated model updates to make its norm bounded by C, and then add Gaussian noise with variance σ 2 to achieve (ϵ, δ)-DP. We set C = 1, δ = 1/1200, and σ = q 2 log( 1.25 δ )/ϵ. 37 (a) Normalized MI, MNIST. (b) Model accuracy, MNIST. Figure 2.16: Effects of using DP noise together with SA on MI privacy leakage and model accuracy. Note that we add DP noise in aggregated model updates after SA. Fig. 2.16a shows the MI privacy leakage for different (ϵ, δ)-DP levels with SA (δ is fixed at 1/1200). As the number of users increase, SA improves the privacy level (measured in terms of MI leakage) for different levels of DP noise, with the effect being most pronounced for weak DP noise level (ϵ = 5000 in Fig. 2.16a). Our experiments also show that as the number of users increase, the gain from using higher DP noise levels is diminished. In particular, with N = 1000 users, the MI leakage level for ϵ =5, 10 and 5000 are almost the same; MI leakage is only reduced from 0.046% to 0.034% when using ϵ = 5 instead of ϵ = 5000. In contrast, we get a reduction from 0.234% to 0.056% when there are N = 2 users. Importantly, the reduction observed in privacy leakage due to applying additional DP noise results in a severe degradation in accuracy as seen in Fig. 2.16b, whereas privacy improvement gained by having more users has a negligible effect on the performance of the trained model. For example, consider the case of 1000 users. One may achieve the same level of privacy in terms of MI leakage (lower than 0.05% MI) with either (i) (ϵ, δ)-DP with ϵ = 10, which, however, results in unusable model accuracy (less than 50%), or, (ii) by aggregating the 1000 users and using a tiny amount of DP noise (equivalent to ϵ = 5000), which achieves a model accuracy higher than 90%. 38 2.6 Related work Secure Aggregation in FL. As mentioned secure aggregation has been developed for FL [18] to provide protection against model inversion attacks and robustness to user dropouts (due to poor connections or unavailability). There has been a series of works that aim at improving the efficiency of the aggregation protocol [11, 98, 53, 123, 103, 100, 37]. This general family of works using secure aggregation disallow the learning information about each client’s individual model update beyond the global aggregation of updates, however there has not been a characterization of how much information the global aggregation can leak about the individual client’s model and dataset. To the best of our knowledge, in this work, we provide the first characterization of the privacy leakage due to the aggregated model through mutual information for FL using secure aggregation. Differential Privacy. One way to protect a client’s contributions is to use differential privacy (DP). DP provides a rigorous, worst-case mathematical guarantee that the contribution a single client does not impact the result of the query. Central application of differential privacy was studied in [9, 22, 1]. This form of central application of DP in FL requires trusting the server with individual model updates before applying the differentially private mechanism. An alternative approach studied in FL for an untrusted server entity is the local differential privacy (LDP) model [59, 2, 8] were clients apply a differentially private mechanism (e.g. using the Gaussian mechanism) locally on their update before sending to the central server. LDP constraints imply central DP constraints, however due to local privacy constraints LDP mechanisms significantly perturb the input and reduces globally utility due to the compounded effect of adding noise at different clients. In this work, we use a mutual information metric to study the privacy guarantees for the client’s dataset provided through the secure aggregation protocol without adding differential privacy noise at the clients. In this case, secure aggregation uses contributions from other clients to mask the contribution of a single client. We will discuss in Section 2.7 situations where relying only on SA can clearly fail to provide 39 differential privacy guarantees and comment on the prevalence of such situations in practical training scenarios. Privacy Attacks. There have been some works trying to empirically show that it is possible to recovery some training data from the gradient information. [86, 5, 110, 117]. Recently, the authors in [42] show that it is possible to recover a batch of images that were used in the training of non-smooth deep neural network. In particular, their proposed reconstruction attack was successful in reconstruction of different images from the average gradient computed over a mini-batch of data. Their empirical results have shown that the success rate of the inversion attack decreases with increasing the batch size. Similar observations have been demonstrated in the subsequent works [117]. In contrast to this work, we are the first to the best of our knowledge to theoretically quantify the amount of information that the aggregated gradient could leak about the private training data of the users, and to understand how the training parameters (e.g., number of users) affect the leakage. Additionally, our empirical results are different from the ones in [86, 5, 110, 117, 117] in the way of quantifying the leakage. In particular, we use the MINE tool to abstractly quantify the amount of information leakage in bits instead of the number of the reconstructed images. We have also empirically studied the effect of the system parameters extensively using different real world data sets and different neural network architectures. 2.7 Further Discussion and Conclusions In this paper, we derived the first formal privacy guarantees for FL with SA using MI as a metric to measure how much information the aggregated model update can leak about the local dataset of each user. We proved theoretical bounds on the MI privacy leakage in theory and showed through an empirical study that this holds in practice after FL settings. Our concluding observations is that by using FL with SA, we get that: 1) the MI privacy leakage will decrease at a rate of O( 1 N ) (N is the number of users participating in FL with SA); 2) increasing model size will not have a linear impact on the increase of MI privacy leakage, 40 Figure 2.17: Heatmap of the absolute values of sampled updates from clients 1, 2 and 3 in the counter example. x4 and x ′ 4 can be distinguished even adding the aggregated noise from P3 i=1 xi . and the MI privacy leakage only linearly increases with the rank of the covariance matrix of the individual model update; 3) larger batch size during local training can help to reduce the MI privacy leakage. We hope that our findings can shed lights on how to select FL system parameters with SA in practice to reduce privacy leakage and provide an understanding for the baseline protection provided by SA in settings where it is combined with other privacy-preserving approaches such as differential privacy. Can we provide differential privacy guarantees using SA? Note that when using FL with SA, then from the point of view of an adversary that is interested in the data of the i-th user, the aggregated model in i − = [N]\{i} can be viewed as noise that is independent of the gradient xi given the last global model, which is very similar to an LDP mechanism for the update x (t) i of user i that adds noise to x (t) i . This leads to an intriguing question: Can we get LDP-like guarantees from the securely aggregated updates? Since DP is interested in a worst-case guarantee, it turns out that their exist model update distributions where it is impossible to achieve an ϵ < ∞ DP guarantee by using other model updates as noise as illustrated in Fig. 2.17. In this case, the alignment of the sparsity pattern in x1, x2 and x3 allows an adversary to design a perfect detector to distinguish between x4 and x ′ 4 . Why our MI privacy guarantee can avoid this? Although, the previous example illustrates that DP flavored guarantees are not always possible, in practical scenarios, the worst-case distribution for x1, x2 and x3 that enables the distinguishing between x4 and x ′ 4 in Fig. 2.17 are an unlikely occurrence during 41 training. For instance, in our theoretical analysis, since users have IID datasets, then having the distribution of x1, x2 and x3 be restricted to a subspace Sxi− , implies also that points generated from x4 would also belong to Sxi− almost surely. This is a key reason why we can get mutual information guarantee in Theorem 2.1: for an aggregated gradient direction PN i=1 xi , where each component is restricted to a common subspace Sx protects the contribution of each individual component xi as N increases. In the worst case, where one component is not restricted to the subspace Sx spanned by the remaining components, then we get the privacy leakage discussed in the example above. We highlight that through our experiments and other studies in the literature [39], we observe that such sparsity alignment happens with very low probability. This presents motivation for studying a probabilistic notion of DP that satisfies (ϵ, δ)-DP with a probability at least γ, instead of the worst-case treatment in current DP notions, but this is beyond the scope of the study in this current work. Another interesting future direction is to use the results from this work for a providing “privacy metrics” to users to estimate/quantify their potential leakage for participating in a federated learning cohort. Such metrics can be embedded in platforms, such as FedML [47], to guide users to make informed decisions about their participation in federated learning. Finally, it would also be important to extend the results to model aggregation protocols that are beyond weighted averaging (e.g., in federated knowledge transfer [46]). 2.8 Appendix 2.8.1 Proof of Theorem 2.1 Without loss of generality, using permutation of clients indices, we will prove the upper bound for the following term I x (t) N ; 1 N X N i=1 x (t) i ( 1 N X N i=1 x (k) i ) k∈[t−1] , (2.17) 42 where xN is the mini-batch gradient of node i which is given by x (t) i = 1 B X b∈B(t) i gi(θ (t) , b), (2.18) We will use the following property of vectors with singular covariance matrices in the proof of this theorem. Property 1. Given a random vector q with a singular covariance matrix Kq of rank d ∗ , there exists a subvector q¯ of q with a non-singular covariance matrix Kq¯ such that q = Aq¯ where A ∈ Rd×d ∗ is a deterministic linear transformation matrix. Let us define S (t) N = 1 N PN i=1 x (t) i . We also use the definition of g¯i(θ (t) , b) ∈ R d ∗ , for d ∗ ≤ d where d is the model size, which is the largest sub-vector of the stochastic gradient gi(θ (t) , b) such that g¯i(θ (t) , b) has a non-singular covariance matrix KG¯(t) for all i ∈ N . According to the definition of g¯i(θ (t) , b), we can rewrite equation 2.17 and the term S (t) N as follows: x¯ (t) i = 1 B X b∈B(t) i g¯i(θ (t) , b) S¯ (t) N = 1 N X i∈N x¯ (t) i (2.19) 43 Let also define F (t) N = √ NS¯ (t) N . We can decompose the expression in equation 2.17 as follows: I x (t) N ; S (t) N n S (k) N o k∈[t−1] (a) = I √ Bx (t) N ; √ NS(t) N n S (k) N o k∈[t−1] (b) = I √ Bx¯ (t) N ; F (t) N n S (k) N o k∈[t−1] = h √ Bx¯ (t) N n S (k) N o k∈[t−1] +h F (t) N n S (k) N o k∈[t−1] − h √ Bx¯ (t) N , F(t) N n S (k) N o k∈[t−1] = h √ Bx¯ (t) N n S (k) N o k∈[t−1] +h F (t) N n S (k) N o k∈[t−1] −h I ∗ d 0 ∗ d I ∗ d √ 1 N √ √ N−1 N I ∗ d √ Bx¯ (t) N F (t) N−1 n S (k) N o k∈[t−1] (c) = h √ Bx¯ (t) N n S (k) N o k∈[t−1] +h F (t) N n S (k) N o k∈[t−1] −h √ Bx¯ (t) N n S (k) N o k∈[t−1] −h F (t) N−1 n S (k) N o k∈[t−1] − log det I ∗ d 0 ∗ d I ∗ d √ 1 N−1 √ √ N−1 N I ∗ d (d) = h F (t) N n S (k) N o k∈[t−1] − h F (t) N−1 n S (k) N o k∈[t−1] + d ∗ 2 log N N − 1 , (2.20) where: (a) follows from the fact that the mutual information is invariant under deterministic multiplication; (b) from Property 1 (c) follows from the property of the entropy of linear transformation of random vectors [24] and the fact that x¯ (t) N and F (t) N−1 are conditionally independent given n S (k) N o k∈[t−1] (e.g., the last global model at time t) ; (d) follows from the Schur compliment of the matrix. 44 We will now turn our attention to characterizing the entropy term h F (t) M n S (k) N o k∈[t−1] for any M. Note that h F (t) M n S (k) N o k∈[t−1] = h 1 √ MB X M i=1 X d∈B(t) i g¯i(b, θ (t) ) n S (k) N o k∈[t−1] (i) = h K 1/2 G¯(t) √ MB X M i=1 X b∈B(t) i gbi(b, θ (t) ) n S (k) N o k∈[t−1] (ii) = log | det K 1/2 G¯(t) | + h 1 √ MB X M i=1 X b∈B(t) i gbi(b, θ (t) ) n S (k) N o k∈[t−1] | {z } HM , (2.21) where: (i) makes use of the fact that the covariance matrix is the same across clients and using the whitening definition (Definition 1) on the vector g¯i(b, θ (t) ); (ii) again uses the property of entropy of linear transformation of random vectors. Note that the term of h F (t) M n S (k) N o k∈[t−1] only depends on M in the second term HM. As a result by substituting equation 2.21 in equation 2.20, we get that I x (t) N ; S (t) N n S (k) N o k∈[t−1] = HN − HN−1 + d ∗ 2 log N N − 1 , (2.22) 45 Our final step is to find suitable upper and lower bounds for HM to use in equation 2.22. Recall for the following arguments that due to whitening, the vector gb (t) b = gb(b, θ (t) ) has zero mean and identity covariance. 2.8.1.1 Upper bound on HM The upper bound is the simplest due to basic entropy properties. In particular, the sum √ 1 MB PM i=1 P b∈B(t) i gb (t) b has zero mean and Id ∗ covariance. Thus, HM = h 1 √ MB X M i=1 X b∈B(t) i gb (t) b n S (k) N o k∈[t−1] (a) ≤ 1 2 d ∗ log (2πe), (2.23) where (a) follows from the fact that for a fixed first and second moment, Gaussian distribution maximizes the entropy. The distinction between the proof of the bound in Case 1 and Case 2 in Theorem 1 is in the lower bound on the term HM. We start by providing the lower bound that is used for proving Case 1. 2.8.1.2 Lower bound on HM for Case 1 in Theorem 2.1 For the lower bound, we will rely heavily on the assumption that the elements of gb (t) b are independent and the interesting result that gives Berry-Esseen style bounds for the entropic central limit theorem [15]. In particular, in its simplest form, the result states that for IID zero mean random variables Xi , the entropy of the normalized sum Tm = √ 1 M PM i=1 Xi approaches the entropy of a Gaussian random variable Φσ2 with the same variance σ 2 as Xi , such that the following is always satisfied h(Φσ2 ) − h(TM) ≤ C˜ E|Xi | 4 M , (2.24) 46 Using equation 2.24, we can find a lower bound for HM as follows: HM = h 1 √ MB X M i=1 X b∈B(t) i gb (t) b n S (k) N o k∈[t−1] = d X∗ j=1 h 1 √ MB X b∈B(t) i X M i=1 gb (t) b [j] n S (k) N o k∈[t−1] | {z } variance = 1 equation 2.24 ≥ d X∗ j=1 h(Φ1) − C0,g¯ MB = d ∗ 2 log (2πe) − d ∗C0,g¯ MB . (2.25) In other words, we have the following bound on HM d ∗ 2 log (2πe) − d ∗C0,g¯ MB ≤ HM ≤ d ∗ 2 log (2πe). (2.26) By substituting equation 2.26 in equation 2.22 (lower bound for M = N −1 and upper bound for M = N), we get that I x (t) N ; S (t) N n S (k) N o k∈[t−1] = HN − HN−1 + d ∗ log N N − 1 ≤ d ∗ 2 log N N − 1 + d ∗C0,g¯ (N − 1)B . (2.27) This concludes the proof of Case 1. 2.8.1.3 Lower bound on HM for Case 2 in Theorem 2.1 The proof of this lower bound relies on the entropic central limit theorem for the vector case [34] and Lemma 1 below. We start by giving the entropic central limit theorem for the case of IID random vectos [34]. Theorem 2.2. [Entropic central limit theorem [34]] Let q be a σ-uniformly log concave d-dimensional random vector with E[q] = 0 and non-singular covariance matrix Σ. Additionally, let z ∼ N (0, Σ) be a Gaussian 47 vector with the same covariance as q, and let γ ∼ N (0,Id) to be a standard Gaussian. The entropy of the normalized sum TM = √ 1 M PM i=1 qi , where qi ’s are random samples, approaches the entropy of a Gaussian random vector Z, such that the following is always satisfied Ent(TM||z) ≤ 2(d + 2(Ent( √ σq||γ) Mσ4 , (2.28) where Ent(TM||z) is the relative entropy. Lemma 2.1. Given a random vector q ∈ Rd with a distribution fq(y) and Cov(q) = Σ, and defining z ∼ N (0, Σ) to be a Gaussian vector with the same covariance as q, for σ > 0 , we get Ent( √ σq||z) = −h(q) − d 2 log(σ) + d 2 log(2π) + 1 2 log(|Σ|) + σ d 2 , (2.29) and Ent(q||z) = h(z) − h(q). (2.30) Given the assumption that gb (t) b has a σ-log concave distribution while both the term √ 1 MB PM i=1 P b∈B(t) i gb (t) b and gb (t) b have an identity covariance matrix Σ = Id ∗ given n S (k) N o k∈[t−1] , we can use equation 2.28 with z ∼ N (0,Id ∗ ). Furthermore, by using Lemma 2.1, we get h(z) − HM ≤ d ∗C1,g¯ − C2,g¯ MB , (2.31) where, C1,g¯ = 2(1+σ+log(2π)−log(σ)) σ4 and C2,g¯ = 4h(gb(b,θ (t) )) σ4 , and h(gb(b, θ (t) )) is the entropy of the random vector g¯i(b, θ (t) ) after whitening. 48 Finally, using the fact that the entropy of the Gaussian random vector z with covariance Id ∗ is given by h(z) = d ∗ 2 log(2πe), we get the following bound on HM d ∗ 2 log (2πe) − d ∗C1,g¯ − C2,g¯ (N − 1)B ≤ HM ≤ d ∗ 2 log (2πe). (2.32) By substituting equation 2.32 in equation 2.22 (lower bound for M = N −1 and upper bound for M = N), we can now upper bound the mutual information term as follows I x (t) N ; S (t) N n S (k) N o k∈[t−1] = HN − HN−1 + d ∗ 2 log N N − 1 ≤ d ∗ 2 log N N − 1 + d ∗C1,g¯ − C2,g¯ (N − 1)B . (2.33) This concludes the proof of Theorem 1. 2.8.2 Proof of Corollary 2.1 In the following, we define S (t) N = 1 N PN i=1 x (t) i . Using this notation, we can upper bound Ipriv/data as follows Ipriv/data = I Di ; n S (k) N o k∈[T] (a) = X T t=1 I Di ; S (t) N n S (k) N o k∈[t−1] (b) ≤ X T t=1 I B (t) i ; S (t) N n S (k) N o k∈[t−1] (c) ≤ X T t=1 I x (t) i B (t) i ; n S (k) N o k∈[t−1] ; S (t) N n S (k) N o k∈[t−1] | {z } This is bounded by the result in Theorem 1 . (2.34) 49 where: (a) comes from the chain-rule; (b) from data processing inequality Di → B (t) i → x (t) i , where B (t) i is the sampled mini-batch from the data set of node i; (c) from data processing inequality B (t) i → x (t) i → 1 N P i∈N x (t) i ;. Combining the results given in the two cases of Theorem 1 with equation 2.34 concludes the proof of Corollary 1. 2.8.3 Proof of Lemma 2.1 Ent( √ σq||Z) = Ent(q ′ ||Z) = Z fq′(y) log fq′(y) fZ(y) dy = Z fq′(y) log fq′dy − Z fq′(y) log fZ(y)dy (a) = −h(q ′ ) + d 2 log(2π) + 1 2 log(|Σ|) + 1 2 Z fq′(y)y T Σ −1 ydy (b) = −h(q) − d 2 log(σ) + d 2 log(2π) + 1 2 log(|Σ|) + 1 2 Z fq′(y) Tr(Σ−1 y T y)dy (c) = −h(q) − d 2 log(σ) + d 2 log(2π) + 1 2 log(|Σ|) + 1 2 Tr Σ −1 Z fq′(y)y T ydy = −h(q) − d 2 log(σ) + d 2 log(2π) + 1 2 log(|Σ|) + 1 2 Tr Σ −1Eq′[q ′T q ′ ] (d) = −h(q) + d 2 log(2π σ ) + 1 2 log(|Σ|) + 1 2 σ Tr Σ −1Σ = −h(q) + d 2 log(2π σ ) + 1 2 log(|Σ|) + σ d 2 , (2.35) where: Tr represents the trace function; (a) follows from using the multivariate distribution of the Gaussian vector z; (b) using the scaling property of the entropy with q ′ = √ σq; (c) from follows from using the linearity of the trace function; finally (d) from using the linear transformation of the random vector q ′ = √ σq and the fact that q has the same covariance matrix Σ as z. The proof of equation 2.30 follows directly by substituting σ = 1 in the equation equation 2.35 and using entropy of a Gaussian vector with covariance Σ. 2.8.4 Overview of MINE In our empirical evaluation in Section 2.5, we use the Mutual Information Neural Estimator (MINE) [10] to estimate the mutual information, which is the state-of-the-art method for mutual information estimation [10]. Specifically, given random vectors X and Z, and a function family parameterized by a neural network F = {Tθ : X × Z → R}θ∈Θ, the following bound holds: I(X;Z) ≥ IΘ(X;Z), (2.36) where IΘ(X;Z) is the neural mutual information measure defined as: IΘ(X;Z) = sup θ∈Θ EPXZ [Tθ] − log(EPX⊗PZ [e Tθ ]), (2.37) PX and PZ are the marginal distribution of X and Z respectively, PXZ is the joint distribution of X and Z, and PX ⊗ PZ is the product of marginals PX and PZ. As an empirical estimation of IΘ(X;Z), MINE is implemented as I(\X;Z)K = sup θ∈Θ EP (K) XZ [Tθ] − log(EP (K) X ⊗P (K) Z [e Tθ ]), (2.38) 51 where P (K) (·) is the empirical distribution of P(·) with K IID samples. Finally, solving Eq. 2.38 (i.e. get the MI estimation) can be achieved by solving the following optimization problem via gradient ascent: I(\X;Z)K = max θ∈Θ ( 1 K X K k=1 Tθ(xk, zk) − log 1 K X K k=1 e Tθ(xk,z¯k) !) , where (xk, zk) is the k-th sample from PXZ and z¯k is the k-th sample from PZ. 52 Chapter 3 HeteroSAg: Secure Aggregation with Heterogeneous Quantization in Federated Learning Secure model aggregation across many users is a key component of federated learning (FL) systems. The state-of-the-art protocols for secure model aggregation, which are based on additive masking, require all users to quantize their model updates to the same level of quantization. This severely degrades their performance due to lack of adaptation to available communication resources, e.g., bandwidth, at different users. As the main contribution of our chapter, we propose HeteroSAg, a algorithm that allows secure model aggregation while using heterogeneous quantization. HeteroSAg enables the edge users to adjust their quantization proportional to their available communication resources, which can provide a substantially better trade-off between the accuracy of training and the communication time. Our proposed algorithm is based on a grouping strategy by partitioning the users into groups, and partitioning the local model updates of users into segments. Instead of applying secure aggregation protocol to the entire local model update vector, it is applied on segments with specific coordination between users. We further demonstrate how HeteroSAg can enable Byzantine robustness while achieving secure aggregation simultaneously. Finally, we prove the convergence guarantees of HeteroSAg under heterogeneous quantization in the non-Byzantine scenario. 53 3.1 Introduction The state-of-the-art secure aggregation protocols with additive masking in FL [12, 102, 54, 123] have some limitations associated with: • (System heterogeneity) They require all users to quantize their model updates to the same level of quantization (to guarantee correct decoding Section 3.3.2), even if they have different communication resources such as transmission rates. Lack of adaptation to the speed of the available network (3G, 4G, 5G, Wi-Fi) and the fluctuation of the network quality over time severely degrades the performance of these protocols. More specifically, by making all users use a low level quantizer, the communication time will be small, but the test accuracy will decrease. On the other hand, using a high level quantizer will result in increasing the test accuracy at the expense of increasing the communication time. • (Robustness) Secure aggregation protocols make the adaption of the existing state-of-the-art defense strategies [14, 116, 91, 111, 112] against Byzantine users difficult to implement, as the server only receive a masked model update from each user, while the success of these strategies are based on having users’ individual clear model updates. • (Communication efficiency) The bandwidth expansion, which measures the ratio between the size of the encoded model in bits to the size of the clear model. This bandwidth expansion results from the additional O(log N) bits that should be communicated for each scalar in the model update vector, where N is the total number of users, to guarantee correct decoding. Hence, this expansion makes them ineffective with aggressive quantization, specially for large N [18, 57, 16]. Overcoming the aforementioned limitations, specifically the one associated with system heterogeneity, is a challenging problem as illustrated in detail in Section 3.3.2. Towards solving these limitations, we propose HeteroSAg. 54 3.1.1 Main contributions HeteroSAg has the following four salient features: 1. HeteroSAg protects the privacy of the local model updates of each individual user in the strong information-theoretic sense by masking the model update of each user such that the mutual information between the masked model and the clear model is zero. 2. HeteroSAg allows using heterogeneous quantization. This enables the edge users to adjust their quantizations proportional to their available communication resources which can result in a substantial better trade-off between the accuracy of training and the communication time. 3. HeteroSAg further enables robustness against Byzantine users, by incorporating distance-based defense mechanisms such as coordinate-wise median [116]. 4. HeteroSAg reduces the bandwidth expansion. For instance, we demonstrate that for the case of having N = 210 users using a single bit quantization, the bandwidth expansion factor when using HeteroSAg is 4×, as opposite to 11× when using SecAg. We provide the theoretical convergence guarantees of HeteroSAg under heterogeneous quantization for convex loss function in the non-Byzantine setting. Furthermore, using neural network with real-world dataset, we demonstrate the efficiency of the heterogeneous quantization given by HeteroSAg. Specifically, we show that we can achieve accuracy close to the baseline case (no-quantization) with the same communication time as the case when all users are using 1-bit quantizer. We also show that we can achieve ∼ 15% higher test accuracy when compared with the setting of homogeneous quantization with 1-bit quantizer, while the communication time is the same for both settings. We then experimentally demonstrate the resiliency of HeteroSAg in the presence of Byzantine users under three different attacks and compare it to the conventional federated averaging scheme [78] by using two different datasets. 55 3.1.2 Related works The authors in [35] provide two heterogeneous quantization algorithms for distributed ML in the absence of a central server to reduce the communication cost, but without any privacy guarantee for the model updates of the users. Our work is different from [35], since our objective is to provide a scheme that not only allows for heterogeneous quantization, but also guarantees the privacy of the users’ models by doing secure model aggregation. We also consider a network topology where there exists a parameter server. Therefore, the two setups are not comparable. We also highlight that our objective in this work is not to design a new quantization scheme, yet to provide a general method that enables using heterogeneous quantization while doing secure aggregation. In recent work, Byzantine-robust secure aggregation algorithms have been proposed [101, 94]. The work in [94] has been proposed for two honest (non-colluding) servers who both interact with the mobile users and communicate with each other to carry out a secure two-party protocol. Unlike this work, the authors in [101] develop BREA, a single-server Byzantine-resilient secure training framework, to facilitate robust and privacy-preserving training architectures for FL. Our work also achieves Byzantine-resilient secure aggregation in a single server by a simple incorporating of some state-of-the-art Byzantine robust algorithms which have provable convergence guarantees such as coordinate-wise median based [116] without either extra computation cost to the users or extra communication cost to the original cost of SecAg. The per-user communication cost of HeteroSAg is O(d + N) as opposite to O(N2 + N d) for BREA and O(N + N d) for the generalized BREA, where d and N are the model size and the number of users, respectively. Furthermore, the per-user computation cost for HeteroSAg is O(N2 + m log N) as opposite to O(dN2 + N d log2 N) for BREA and the generalized BREA. We note that the upper bound on the number of Byzantine nodes for the success of HeteroSAg is given by B ≤ ⌈0.25G⌉ − 1, where G the number of groups given by HeteroSAg, is less than the upper bound for coordinate-wise median based in [116]. However, our proposal is initially developed for enabling secure aggregation while using heterogeneous 56 quantization at different users, while incorporating defense technique against Byzantine nodes comes as an extra feature. Additionally, the scheme in [116] solely does not provide privacy for the local models of the users. Therefore, we conclude that HeteroSAg is the first scheme that achieves secure aggregation with heterogeneous quantization while providing Byzantine-resiliency. 3.2 Background Secure aggregation (SecAg), e.g., [18], is a key component in FL that enables the distributed training process while preserving the privacy of the users. We summarize SecAg in the following five steps while considering S ≜ N ≜ {1, . . . , N}, where N is the total number of nodes. We provide this summary as this protocol is a key component in HeteroSAg. Step 1 (Sharing keys and masks): Users first establish a secure communication channel between them by using pairwise keys through a key exchange protocol such as Diffie-Hellman key agreement [27]. All the communication is forwarded through the server. Also, each pair of users i, j ∈ N first agrees on a pairwise random seed si,j by using Diffie-Hellman key agreement, such that si,j is a function of the public key s PK j of user j and the private key s SK i of user i. At the end, each node i ∈ N will have this set of agreement keys {si,j}j∈N /i. Also, according to the key generation in Diffie-Hellman key agreement [27], the pairwise key is symmetric, i.e., si,j = sj,i. Furthermore, the server will have all the set of public keys s PK j for all j ∈ N . In addition, user i creates a private random seed bi . The role of bi is to prevent the privacy breaches that may occur if user i is only delayed instead of dropped ∗ (or declared as dropped by a malicious server), in which case the pairwise masks alone are not sufficient for privacy protection. Further discussion about the rule of bi is given in Step 5. ∗The dropped users are those who failed to send their masked model to the server. In other words, the server will not receive the model update of those users for the current round. On the other hand, the delayed users are those who send their model updates to the server, but their models have experienced high delay before receiving by the server. Although, these users have already sent their models to the server, the server will consider them as dropped users, because of their high delay. Therefore, the server will not include adding the received masked models from those users in the model aggregation step 57 Step 2 (Secret sharing): User i, i ∈ N , secret shares the private key s SK i as well as bi with the other users in the system, via Shamir’s secret sharing [95]. To ensures that the local model is private against an adversarial server which tries to learn information about the local models of the honest users, while the mobile users are honest and do not collude with the server, the threshold of the secret share scheme should be ⌈N/2⌉ + 1. For the case where users are adversaries, no matter how we set the threshold value, users on their own learn nothing about other users. Step 3 (Quantization): SecAg and cryptographic protocols require the input vector elements to be integers, while using modular operation to transmit these vectors. By considering the case where the model update of each user takes real values, we need to do quantization first so that we can apply SecAg. Without loss of generality and for the ease of the analysis, we use the K-level quantizer in [106] to quantize the model update xi , for i ∈ S. We assume that the elements of each model xi , for i = 1, . . . , N, fall in the range [r1, r2]. Let 0 ≤ l < Kg, where Kg is the number of quantization levels, be an integer such that when xi(k) ∈ [T(l), T(l + 1)], where T(l) = r1 + l∆Kg , and ∆Kg = r2−r1 Kg−1 is the quantization interval. Then QKg (xi(k)) = T(l + 1) with probability xi(k)−T(l) T(l+1)−T(l) , T(l) otherwise. (3.1) The output of the quantizer x¯i(k) = QKg (xi(k)) takes a discrete value from this range {r1, r1+∆Kg , r1+ 2∆Kg , . . . , r2 − ∆Kg , r2}. Step 4 (Encoding): Following the quantization step, the set of users S starts the encoding process on {x¯i(k)}i∈S, for k = 1, . . . , |x¯i |, by first mapping the outputs of the quantizer from the Kg real values that belongs to the discrete range {r1, r1 + ∆Kg , r1 + 2∆Kg , . . . , r2 − ∆Kg , r2} to integer values in this range [0, Kg − 1]. This mapping is performed such that a real value r1 maps to 0 and r2 maps to Kg − 1, etc. The encoding process is completed by allowing each pair of users in S to use the pairwise random seeds to 58 randomly generate 0-sum pairs of mask vectors to provide the privacy for individual models. The output vector of the encoder is given by yS,i = x¯i + PRG(bi) + X j:i<j PRG(si,j ) − X j:i>j PRG(sj,i) mod R, (3.2) where yS,i is a vector of |x¯i | elements, and R = |S|(Kg − 1) + 1 to ensure that all possible aggregate vectors from the |S| users will be representable without overflow at the server. PRG is a pseudo random generator used to expand the different seeds to vectors in ZR to mask users’ local models. Step 5 (Decoding): From a subset of survived users, the server collects either the shares of private keys the belonging to dropped users, or the shares of the private seed belonging to a surviving user (but not both). The server then reconstructs the private seed of each surviving user, and the pairwise seeds si,j of each dropped user i. The server reconstructs si,j by combining the reconstructed private key s SK i with the corresponding available public key at the server from user s PK j . Note that, the server hold all the public keys of all users. The server removes the masks of the dropped users from the aggregate of the masked models. Finally, the server computes the aggregated model xU = X i∈U (yS,i − PRG(bi)) − X i∈D X j:i<j PRG(si,j ) − X j:i>j PRG(sj,i) mod R = X i∈U x¯i mod R, (3.3) where U and D represent the set of surviving and dropped users, respectively. The decoding process is completed by mapping the global model from ZR to the corresponding values in this discrete set of real numbers {|U|r1, |U|r1 + ∆Kg , |U|r1 + 2∆Kg , . . . , |U|r2 − ∆Kg , |U|r2}. In the following, we discuss the importance of using the private mask bi in equation 3.2 in preserving the privacy of the delayed model of user i. According to the SecAg protocol, the server will consider any user, user i, with delayed model yS,i as a dropped user. Hence, according to the decoding step, the server will ask the set of survived users to get the shares of the private key s SK i of the delayed user i. Getting the 59 private key s SK i allows the server to reconstruct the set of agreement keys {si,j}j∈N /i, and hence remove the corresponding masks from the aggregation of the masked models. Although the server have already known {si,j}j∈N /i, the privacy of the local model of node i is still preserved thanks to the private mask PRG(bi ) as shown in equation 3.2. We have provided a simple illustrative example for SecAg in Appendix 3.7.1. 3.3 Problem Formulation We first describe the secure aggregation with heterogeneous quantization problem. After that, we explain why the conventional SecAg protocol can not be applied directly to our problem. 3.3.1 System Model We consider a FL system that consists of a central server and a set N = {1, . . . . , N} of N mobile users with heterogeneous communication resources. These N users allow training a ML model locally on their local dataset, as described in the introduction. We also consider having a set Q = {QK0 , QK1 , . . . , QKG−1 } of element-wise stochastic quantizers, e.g., [106], with G different levels that can be used in this system, where Kg is the number of quantization levels of the quantizer QKg , and K0 < K1 < · · · < KG−1, instead of having a single quantizer as in SecAg. In this problem, we assume that users are already clustered into G different groups based on their communication resources. Each user i in group g can quantize its model update xi by using quantizers from a pre-assigned set of quantizers with these levels Kg = {K0, K1, . . . , Kg} where Kg is the highest possible quantization levels that can be used by the users in group g that is suitable for his transmission rate† . †The problem of the optimal clustering of the users based on their transmission rates or the optimal assignment of the quantizers to the users is not the main scope of this chapter. Instead, our focus is to provide an approach that allows for doing secure aggregation when different quantizers can be utilized at different users, which is a challenging problem as we will show in Section III-B. 60 Threat model: The server is honest, in which it honestly follows the protocol as specified, but it can be curious and try to extract any useful information about the training data of the users from their received models. On the other hand, users are curious and can only collude with each other, such that any colluding set of users only knows the models from the users in this set. Furthermore, B users out of the N available users are malicious and could share false information during protocol execution, or send malicious updates to the server. At a high level, we want to design a scheme that achieves 1) Secure model aggregation, where the server can only decode the aggregate model from all users, while users are allowed to use different quantizers. 2) Byzantine-resilience and secure aggregation simultaneously. We will formalize the objective in Section II-C. Now, we discuss why SecAg can not be applied directly in our setting where users are using different quantizers ‡ . 3.3.2 Challenges To describe the main challenge for applying secure aggregation protocols with additive masking (including SecAg) to the case where users are using heterogeneous quantization, we consider the following simple example. In this example, we first start by describing the case of homogeneous quantization. Example 3.1. We consider having two users, where user i has an input xi ∈ R, and a central server, which should only decode the sum x = x1 + x2. User 1 is assigned these quantization levels, K0 = {2}, while user 2 is assigned K1 = {2, 4}. The encoding processes for the two users are given as follows: a) Homogeneous quantization: As a first step, each user quantizes its input xi by using the same K = 2 levels of quantization, where we assume without loss of generality that the output of the quantizer is denoted by ‡The difficulties of applying SecAg in the presence of Byzantine users is described in bullet two in the introduction. 61 x¯i ∈ {0, 1}, for i = 1, 2. The encoded messages y1 and y2 from the two users, and the decoded message x at the server are given by y1 = x¯1 + Z12 mod R, y2 = x¯2 − Z12 mod R, x = y1 + y2 mod R = x¯1 + x¯2 mod R, (3.4) where the mask Z12 is drawn uniformly at random from [0, R). By working in the space of integers mod R and sampling masks uniformly over [0, R), this guarantees that each user’s encoded message is indistinguishable from its own input (mutual information I(x¯i ; yi) = 0). The correct decoding in equation 3.4 is guaranteed for the following two reasons. First, the two users and the server are working in the same space of integers mod R. Second, the summation and mod commute. Therefore, the mask pairs will be cancelled out. We note that choosing R = 3 ensures that all possible outputs will be represented without any overflow. In our problem, users are supposed to use different quantization levels in order to compress their models to the size (in bits) which is suitable to their available communication resources. For instance, user 2 might want to quantize its model using the four levels of quantization which is suitable to its channel bandwidth. In fact, allowing users to transmit different number of bits when using SecAg requires the size of the space of integers that users use for the encoding, e.g., R in equation 3.4, to be different (the size of the masked message yi in equation 3.4 is ⌈log R⌉ bits). Hence, by using different modular at the users, we end up with incorrect decoding as shown in the following case. b) Heterogeneous quantization: User 1 and user 2 (assuming user 2 has a higher bandwidth than user 1) quantize their inputs x1 and x2 by using K = 2 and K = 4 levels of quantization, respectively. We assume without loss of generality that the output of the two quantizers are x¯1 ∈ {0, 1} and x¯2 ∈ {0, . . . , 3}. By further assuming without loss of generality that user 1 and user 2 are using mod 2 and mod 4 in equation 3.4, respectively, instead of the same mod R, while having x¯1 = 1, x¯2 = 3 and Z12 = 1, the decoded output will be given by x = y1 + y2 = 2 instead of the true output x = 4. This confirms that having different 62 modular at the users results in incorrect decoding. Another issue for using different modular at the users is that the space of integer in which each user works on to choose its masks at random will also be different. Thus, using SecAg’s approach for generating the pairwise masks does not guarantee having 0-sum pairs of masks. For instance, generating Z12 = 3 at both user 1 and 2 in Case (b) is not possible. The reason for that Z12 = 3 only belongs to the space of integers that node 2 uses to generate its masks. One method to cancel out masks that belong to different spaces of integers is by using modulus D at the server, where D is an arbitrary integer, and let all users jointly choose a tuple of masks, whose sum mod D equal to 0, uniformly at random from a set of possible tuples. In this tuple of masks, each mask for each user belongs to its space of integer. The main issue of this approach is that whenever wrapping around occurs for the transmitted masked model of any user, the masks will not be cancelled out at the server side and the aggregated model will be distorted. We will consider the following example for illustration. Example 3.2. Assume having three users with the quantized model update vectors x¯1 ∈ [0, 1]m, x¯2 ∈ [0, 2]m, and x¯3 ∈ [0, 2]m, with dimension m. Without loss of generality, we assume the masks of user 1, user 2 and user 3 take values randomly over [0, 1]m, [0, 5]m, and [0, 5]m, respectively. The transmitted masked models and the sum of the masked models at the server are given as follows y1 =x¯1 + Z1 mod 2, y2 =x¯2 + Z2 mod 6, y3 =x¯3 + Z3 mod 6, x{1,2,3} =y1 + y2 + y3 mod 6. (3.5) If the users choose their tuple of masks uniformly at random from the set of tuples in TABLE 3.1, the mask of each user becomes uniformly distributed over its mask range. This guarantees user’s model privacy in strong information-theoretic sense, i.e., I(yi , x¯i) = 0, for i = 1, 2, 3. However, the main limitation for this approach is that no guarantee for correct decoding. In particular, once the masked models are added together, the masks will not always be canceled out, but it will only be cancelled out when there is no overflow happens for the 63 Table 3.1: Tuples of masks that users could use in Example 2. Z1 0 0 0 0 0 0 1 1 1 1 1 1 Z2 1 2 3 4 5 0 1 2 3 4 5 0 Z3 5 4 3 2 1 0 4 3 2 1 0 5 transmitted masked model of any user. In other words, the sum of users models will be distorted whenever an overflow happens for the transmitted masked model of any user, which occurs with non-negligible probability. For example, having a tuple of masks (Z1(k), Z2(k), Z3(k)) = (1, 1, 4), while the k-th element of the model updates of the set of users (x¯1(k), x¯2(k), x¯3(k)) = (1, 0, 0), an overflow will occur at user 1, y1 = 0, and the sum will be x{1,2,3} (k) = 5 instead of being x{1,2,3} (k) = 1. To overcome the aforementioned issues associated with incorrect decoding and 0-pairwise masks generation in SecAg when having heterogeneous quantization, SecAg can leverage multi-group structure. In multi-group structure, the set of user users in group Sg, for g ∈ [G], where G is the number of groups, uses the same quantizer QKg , which has Kg levels proportional to their communication resources. After that, each group applies the SecAg protocol independently of all other groups. After that, each group applies the SecAg protocol independently of all other groups. However, in this strategy the server would decode the aggregate of the model updates from each group which implies knowing the average gradient/model from each group. Hence, this strategy is not robust against some attacks such as membership-inference attack [80, 96, 82] and gradient inversion attack [126, 42, 117], specially when having a small group size. In gradient inversion attack, the server can reconstruct multiple images used in the training by a user (subset of users) from the averaged gradients corresponding to these images as shown in [42, 117]. In membership-inference attack, the server could breach users’ privacy by inferring whether a specific data point was used in the training by a certain subset of users or not by using the average model from this targeted set of users. By letting the server observe the average model from a small group, the attack becomes much stronger and the inferred information may directly reveal the identity of the users to whom the data belongs. Our goal is to leverage the benefits of grouping strategy while limiting the threat of 64 membership/inversion attacks. In particular, we want to address the following question “Can we design a grouping strategy that allows for secure model aggregation while using heterogeneous quantization such that the server can not unmask (decode) the entire average gradient/model from any subset of users”?. In this chapter, we propose HeteroSAg, a scheme which is based on a specific segment grouping strategy. This segment grouping strategy leverages the multi-group structure for performing secure model aggregation with heterogeneous quantization while preventing the server from unmasking (decoding) the entire average model from any subset of users. At a high level, our proposed segment grouping strategy is based on partitioning the edge users into groups, and dividing the local model updates of these users into segments. Instead of applying the secure aggregation protocol to the entire local model update vectors, it is applied on segments with specific coordination between users, allowing segments to be quantized by different quantizers. More specifically, segments from different set of users are grouped such that they are quantized by the same quantizer while being encoded and decoded together at the server independently of all other segments. This is different from SecAg where the entire local model updates from all users (or a subset of users in the multi-group structure) are quantized by the same quantizer, while being encoded and decoded together. Furthermore, unlike SecAg with multi-group structure, where the server can decode the entire average model from each group, the key objective of our segment grouping strategy is to limit the ability of a curious server from launching inference/inversion attacks on an arbitrary subset of groups. This is achieved in HeteroSAg by allowing the server to only decode a fraction of at most 2 G segments from the average gradient/model of any set of users which approaches 0 for sufficiently large number of segments G. The remaining segments from this average model interfere with segments from the average models of some other groups. We quantify the smallest fraction of segments that the server will not successfully decode from the aggregated model from any set of users S ⊊ N by the inference robustness δ. In the following subsection, we formally define the inference robustness and discuss its implications. 65 3.3.3 Performance metric Let θ p i = [θ 0 i , . . . , θ G−1 i ] denotes the segmentation of the local model θi of useri, and θ¯p S = [θ¯0 S , . . . , θ¯G−1 S ] denotes the segmentation of the average model θ¯S from the set of users S, where θ l i , θ¯l S ∈ R d G for l = 0, . . . , G − 1, i ∈ [N], and S ⊊ N . We define A({θ p 1 , . . . , θ p N }) to be an arbitrary segment grouping strategy that leverages the multi-group structure for doing secure model aggregation on the segment level. This strategy A groups each set of segments from {θ p 1 , . . . , θ p N } together such that they are encoded and decoded together independently of all other segments. We define A({θ p 1 , . . . , θ p N }) to be feasible if it satisfies these three conditions 1) The server could only receive a masked model θ˜i from user i ∈ N , where the mutual information I(θi ; θ˜i) = 0. 2) The server could only decode at most a fraction of αS segments from the average model θ¯p S , where S ⊊ N , while each segment in the remaining 1−αS fraction of segments interferes with segments from the average model θ¯p S∗ from other sets of users S ∗ , where S ∗ ⊆ Sc such that S ∪ Sc = N . 3) The server could decode the average model θ from all users. Now, we define our inference robustness metric. Definition 1. (Inference robustness δ) For a feasible segment grouping strategy A({θ p 1 , . . . , θ p N }), the inference robustness δ(A), where δ(A) ∈ [0, 1], is given as follows: δ(A) = min {1 − αS : S ⊊ {1, . . . , N}} , (3.6) where 1 − αS is the fraction of segments from θ¯p S that interferes with segments from the average models θ¯p S∗ , where S ∗ ⊆ Sc such that S ∪ Sc = N . Remark 3.1. The underlying objective of a good segment grouping strategy A({θ p 1 , . . . , θ p N }) is to limit the ability of a curious server from launching inference/inversion attacks on an arbitrary subset of users S ⊊ N , by allowing the server to decode only a fraction αS ∈ [0, 1) of segments from the average model θ¯p S . This is different from the worst case scenario where the server can decode the entire target model θ¯S. The segments 66 from the average model θ¯p S that interfere with segments from other users outside the set S can be viewed as clear segments plus random noise, where the number of noisy segments is determined by the inference robustness δ(A). The worst case inference robustness for HeteroSAg is δ = G−2 G , which approaches one for sufficiently large number of segments. 3.4 The proposed HeteroSAg We first present our HeteroSAg, and then state its theoretical performance guarantees. 3.4.1 HeteroSAg for heterogeneous quantization HeteroSAg starts by letting the set of N users share their keys and masks according to Step 1 in Section 3.2. Each user i ∈ N then uses Step 2 to secret shares its masks with all other users. For clarity and ease of analysis of the proposed scheme, we first consider the case where users are already clustered into G groups based on their communication resources, each of which has the same number of users, |Sg| = N G = n, for g ∈ [G], where [G] := {0, . . . , G − 1}. The case of having a different number of users in each group is presented in the Appendix 3.7.9. Without loos of generality, we consider users in higher groups have communication resources higher than users in lower groups. Following the secret sharing step, each local model update vector {xi}i∈N is equally partitioned into G segments such that the segmented model update of user i is given by x p i = [x 0 i , x 1 i , . . . , x G−1 i ] T , where x l i ∈ R d G for l ∈ [G]. Also, the aggregated model update xS at the server from any set of users S ⊊ N can be viewed as a set of G segments x p S = [x 0 S , x 1 S , . . . , x G−1 S ] T . Finally, instead of the direct implementing of SecAg protocol where (1) All the unsegmented vectors {xi}i∈N are quantized by the same quantizer, (2) All the N users jointly encode (mask) the N quantized vectors together, and (3) All the N encoded model updates will be decoded together at the server, we apply the segment grouping strategy AHeteroSAg({x p 1 , . . . , x p N }) such that SecAg protocol is applied on the segment level where (1) Different sets of segments are quantized 67 by using different quantizers, (2) Different sets of users jointly encode their quantized segments together independently of all other users, and (3) The jointly encoded segments will be also jointly decoded at the server. Algorithm 1: The SS Matrix B for HeteroSAg for g = 0, . . . , G − 2 do for r = 0, . . . , G − g − 2 do l = 2g + r; B(l mod G, g) = B(l mod G, g + r + 1) = g; end end The remaining entries of B HeteroSAg will hold ∗. Each set of segments and its corresponding set of users that jointly executes SecAg protocol together according to the segment grouping strategy AHeteroSAg is given by the G × G Segment Selection (SS) matrix B produced by Algorithm 1. In this matrix and as illustrated in the example given in Figure. 3.1, B = 0 1 2 3 4 0 0 0 2 ∗ 2 1 0 ∗ 0 3 3 2 0 1 1 0 ∗ 3 0 1 ∗ 1 0 4 ∗ 1 2 2 1 Figure 3.1: Segment selection matrix B for G = 5 groups. the label for each column represents the index of the group (index g of the set of users Sg, for g ∈ [G]), where the communication resources of the set of users Sg in group g is smaller than those of the set of users Sg ′, where g < g′ . On the other hand, the label of each row represents the index l of the segment x l i . In this matrix having an entry B(l, g) = ∗ means that the set of users S = Sg will execute SecAg protocol on the set of segments {x l i }i∈Sg . In other words, having B(l, g) = ∗ means that the set of users Sg will quantize the set of segments {x l i }i∈Sg according to Step 3 in Section 3.2 by using the quantizer QKg , and jointly encode the resulting quantized segments together according to Step 4. At the server, these set 68 of segments will be decoded together. Similarly, when B(l, g) = B(l, g′ ) = g, this means that the set of users S = Sg ∪ Sg ′ corresponding to these columns g and g ′ , where g < g′ , will quantize the set of segments {x l i }i∈Sg∪Sg′ by using the quantizer QKg and jointly encode the resulting quantized segments. At the server side, these set of segments will be decoded together. Finally, the server aggregates each set of decoded segments {x l S }S⊂{0,...,G−1} , which results from different sets of users and belongs to the same segment level l, together. The server concatenates these sets of aggregated segments, which belong to these levels l ∈ [G], to get the global update x. To illustrate HeteroSAg and understand how its inference robustness is measured, we consider the following example. Example 3.3. We consider a system which consists of N users, and a set of G = 5 quantizers Q = {QK0 , QK1 , QK2 , QK3 , QK4 }, where K0 < · · · < K4. HeteroSAg execution starts by letting the N users first share their keys and masks with each other, and then each user secret shares its masks with the other users in the system. We consider having G = 5 groups with n users in each group, where groups are arranged in ascending order based on the communication resources of their users. The local model of each user, xi for i ∈ N , is equally partitioned into G = 5 segments xi = [x 0 i , x 1 i , x 2 i , x 3 i , x 4 i ] T , where x l i ∈ R m 5 , for l ∈ [5]. The SS matrix B that is used for managing the execution of HeteroSAg is given in Figure. 3.1. To further formalize the execution of HeteroSAg for this example, we consider Figure. 3.2. In this figure, each set of segments that executes the SecAg together is given the same color. In particular, the set of segments {x 0 i }i∈S0∪S1 will be quantized by the quantizer QK0 according to the third step in Section 3.2. By using the encoding step in Section 3.2, the output of the quantizer {x¯ 0 i }i∈S, where S = S0 ∪ S1, will be first mapped from the values that belongs to its discrete range to integer values in this range [0, K0 − 1]. By generating the random 0-sum pairs of masks and the individual masks, the encoded segment for each user i ∈ S will be given as follows y 0 S,i = x¯ 0 i + PRG(bi) + X j:i<j PRG(si,j ) − X j:i>j PRG(sj,i) mod R, (3.7) 69 Figure 3.2: A system with N users partitioned into G = 5 groups, with n users in each group. Each user holds a quantized local model update x¯i , i ∈ [N]. The segment selection and grouping is completed by using the SS matrix B. where R = |S|(K0 − 1) + 1, |S| = 2n, and j ∈ S, while the PRG is used to expand the different seeds to segments in ZR. The server collects the shares and reconstructs the private seed of each surviving user, and the pairwise seeds of each dropped user. Then its uses the PRG along with the reconstructed seeds to expand them to segments in ZR, where R = |S|(K0 − 1) + 1, |S| = 2n, to be removed from the aggregate of the masked segments. The server then computes this segment x 0 U = X i∈U (y 0 S,i − PRG(bi)) − X i∈D X j:i<j PRG(si,j ) − X j:i>j PRG(sj,i) mod R = X i∈U0 x¯ 0 i + X i∈U1 x¯ 0 i mod R, (3.8) where the set Ui represents the set of survived users from Si , for i = 0, 1. The aggregate model update from group 0 after fully unmasking its users’ models is given by Table 3.2, where Ug ⊆ Sg for g ∈ [4], represents the set of survived users from group g. According to Table 3.2, the server will decode only the last segment from the aggregated model update from group 0, while the other segments 70 from that group interfere with segments from some other groups § . In particular, the first segment from group 0 (first row in Table 3.2) results from the sum of the first set of segments from the survived users in group 0 and group 1, as these segments were encoded together, and hence must be decoded together. More generally, the server will decode only one clear segment from each individual group which is corresponding to the index denoted by ∗ in the SS matrix B. Also, it can be easily seen that the server will not decode more than 0.2 of clear segments from the average model from any set of users S ⊊ N , and hence the inference robustness will be δ(AHeteroSAg) = 4 5 = 0.8. Table 3.2: The aggregated model update xS0 from group 0 after fully unmasking the model updates of its users x 0 U0∪U1 = P i∈U0 x¯ 0 i + P i∈U1 x¯ 0 i x 1 U0∪U2 = P i∈U0 x¯ 1 i + P i∈U2 x¯ 1 i x 2 U0∪U3 = P i∈U0 x¯ 2 i + P i∈U3 x¯ 2 i x 3 U0∪U4 = P i∈U0 x¯ 3 i + P i∈U4 x¯ 3 i x 4 U0 = P i∈U0 x¯ 4 i 3.4.2 HeteroSAg for Byzantine-Resilience Now, we extend the segment grouping strategy AHeteroSAg given by the SS matrix B generated by Algorithm 1 to further provide Byzantine robustness while achieving secure model aggregation simultaneously. This can be done be integrating AHeteroSAg with some coordinate-wise defense techniques, such as coordinate-wise median (Median) [116], which have provable convergence guarantees. Integrating Median with HeteroSAg is possible thanks to the design of the SS matrix. Particularly in the SS matrix, for a given row l out of the available G rows (segment index l out of the G segments indices), the server observes a set of unmasked segments. Each unmasked segment results from averaging the segments that were encoded together. This is difference than the design of the other secure aggregation protocols which make the server only receive the masked model of each user and the aggregate of all users model updates. §Decoding a segment x l S from the model update xS implies decoding the segment θ l S from the average model from that S. We note that θ l S = θ l + 1 |S|x l S , where θ l is the l-th segment from the global model θ. 71 Figure 3.3: A system with N users partitioned into G = 5 groups, with n users in each group. Each user holds a local model update x¯i , i ∈ [N]. The segment selection and grouping is completed by using the SS matrix B. Segments with the same color will be encoded and decoded together independently of all other segments. Observing the masked model from each user makes the adaption of the state-of-the-art Byzantine-robust techniques difficult to implement, as these defense techniques are based on observing the individual clear model update of each user to compare the updates from different users with each other and remove the outliers. To further illustrate how the design of the SS matrix has solved the aforementioned limitation of the convention secure aggregation protocol, we consider the following example. Example 4. We consider a system with N users, where users are divided equally among G = 5 groups, and each group has n users, as illustrated in Figure. 3.3. We assume that node 1 in the first group is a Byzantine node. The local model update of each node is divided equally into G = 5 segments. Each set of segments that is encoded and decoded together are given the same color as shown in Figure. 3.3. The segments are grouped and colored according to the SS matrix B given in Figure. 3.1. As can be seen from Figure. 3.3, the server decodes the segments that were encoded together, e.g., {y 0 0,1 = 1 2n P i∈N0∪N1 x¯ 0 i , y 0 2,4 = 1 2n P i∈N2∪N4 x¯ 0 i , y 0 3 = 1 n P i∈N3 x¯ 0 i }. The segments in this set are from the sets of users in this tuple of sets (S0 ∪ S1, S2 ∪ S4, S3), respectively, and belong to the same segment level l = 0. Since we 72 have more than one segment in this set, coordinate-wise median can be applied as demonstrated in Figure. 3.3. The coordinate-wise median scheme in [116] is presented for the case where the number of Byzantine users is less than half the total number of users, i.e., the number of benign models is more than the number of faulty models. To find the number of allowed Byzantine users in our setting while using the segment grouping strategy AHeteroSAg, we consider the worst case scenario where Byzantine users are distributed uniformly among the groups. To make sure that each set of unmasked segments, which belongs to the same level l, e.g., this set of unmasked segments {y 0 0,1 = 1 2n P i∈N0∪N1 x¯ 0 i , y 0 2,4 = 1 2n P i∈N2∪N4 x¯ 0 i , y 0 3 = 1 n P i∈N3 x¯ 0 i }, contains benign segments more than faulty segments, the number of Byzantine users, B, should be B ≤ ⌈0.25G⌉−1. The former result comes from the fact that having one Byzantine user in one group makes all the segments of the average model from this group faulty. Also in HeteroSAg, we can see that some segments belong to the aggregated model from two groups. In particular, we can see the faulty model from user 1 in group 0 results in having the following faulty segments {y 0 0,1 , y 1 0,2 , y 2 0,3 , y 3 0,4 , y 0 0 }. By taking this extreme case where one Byzantine node has an impact on the segments from two groups, and the number of benign segments should be more than the faulty segments within each set of decoded segments, the number of Byzantine users, B, should be B ≤ ⌈0.25G⌉ − 1. We have included further discussion for the Byzantine robustness of HeteroSAg in Appendix 3.7.11. 3.4.3 Theoretical guarantees of HeteroSAg We state our main theoretical results. The proofs of the theorems, propositions, and lemmas are presented in the Appendix. Theorem 3.1. (Inference robustness) For a FL system with N users clustered into G groups, and the model update of each user is divided equally into G segments, the segment grouping strategy of HeteroSAg achieves 73 inference robustness of δ(AHeteroSAg) = G−2 G when the number of groups is even, and δ(AHeteroSAg) = G−1 G when the number of groups is odd. Remark 3.2. Theorem 3.1 shows that we can achieve full inference robustness for sufficiently large number of groups. In particular, the maximum value for inference robustness, δ, is reached when G = m. In this case, the server can decode 2 m (or 1 m when m is odd) of the average model from any set of users, which approaches zero for sufficiently large model size. In the Appendix, we show how we can further portion the users in each group to smaller sub-groups when the size of the set of quantizers G is small to increase δ. We derive the convergence guarantees of HeteroSAg for the following objective function in the FL setting arg min θ F(θ) such that F(θ) = X N i=1 ni n Fi(θ), and Fi(θ) = 1 ni Xni j=1 fi(θ; xj , yj ), (3.9) where θ ∈ R d is the global model to be optimized. Here, Fi is the local objective function of user i, fi(θ; xj , yj ) is the loss of the prediction on example (xj , yj ) form user i made with global model θ, ni is the data size at user i, and n = P i ni . Without loss of generality, we assume that all users have an equal-sized dataset. under the following standard assumptions. Assumption 3.1. (Unbiasedness) The stochastic gradient x (t) i = gi(θ (t) ) is an unbiased estimator for the true gradient of the global loss function in (1) such that E[x (t) i ] = ∇F(θ (t) ) Assumption 3.2. (Smoothness) The objective function F(θ) in (1) is convex, and its gradient is L-Lipschitz that is ||∇F(θ) − ∇F(θ ′ )|| ≤ L||θ − θ ′ ||, for all θ, θ ′ ∈ R d . Lemma 3.1. For any vector xi = [x 0 i , . . . , x G−1 i ] ∈ R d , where x l i ∈ R d G and its values belong to this interval [r1, r2], and by letting x¯i to be the quantization of xi , we have that (i) E[x¯i ] = xi (unbiasedness), (ii) E||x¯i − xi ||2 2 ≤ d G PG−1 l=0 (∆l i ) 2 4 (bounded variance), where ∆l i is the quantization interval associated with the stochastic quantizer QKl i used to quantize the l-th segment x l i . (iii) E||p¯ − p||2 2 ≤ σ 2 (total quantization 74 error), where p¯ = 1 N PN i=1 x¯i , p = 1 N PN i=1 xi and σ 2 = (r2−r1) 2 4N2 d G PN i=1 PG−1 l=0 1 (Kl i−1)2 . Here, G is the number of segments and Kl i is the number of levels used to quantize the l-th segment of xi . Theorem 3.2. (Convergence) Consider FL system with N users, each of which has a local gradient vector x (t) i = gi(θ (t) ) ∈ R d , such that the elements of each local gradient x (t) i , for i = 1, . . . , N, fall in the range [r1, r2]. Suppose the conditions in Assumptions 1-2 are satisfied. When each local gradient vector is partitioned equally into G segments such that segments are quantized by using the set Q of G quantizers according to the SS matrix B, and by using constant step size η = 1/L, HeteroSAg guarantees E " F 1 J X J t=1 θ (t) !# − F(θ ∗ ) ≤ ||θ 0 − θ ∗ ||2 2ηJ + ησ2 HeteroSAg, (3.10) where σHeteroSAg = (r2−r1) 2 4N2 d G n PG−1 g=0 2(G−g)−1 (Kg−1)2 , and θ 0 is the initial model. Remark 3.3. HeteroSAg has a convergence rate of O(1/J). The term ησHeteroSAg is a residual error in the training which can be reduced by using an adaptive (decreasing) learning rate and by using a set of high level quantizers. Remark 3.4. According to Theorem 1 and the bound on the number of Byzantine nodes given in Section 3.4.2, increasing the number of groups by further partitioning each group out of the G available groups equally into L subgroups results in increasing 1) The number of Byzantine to be tolerated B ≤ ⌈0.25LG⌉ − 1, 2) The inference robustness δ = LG−2 LG . On the other hand, the residual error in Theorem 2 will not increase as stated in the next proposition. Proposition 3.1. (Quantization error) Let G be the number of quantizers to be used in the system, and users are partitioned equally into G groups, by extra partitioning each group equally into L subgroups while using the segment grouping strategy of HeteroSAg, the total quantization error will not be changed and will also be given by σ 2 HeteroSAg. 75 Theorem 3.3. (Privacy leakage and dropout) For HeteroSAg when the number of users in each subgroup is given by n¯ = N LG , where LG is the total number of subgroups, and the dropout probability of each user is p, the probability of privacy leakage, i.e., having only one survived user in any subgroup, is given by P[Privacy leakage] = P(X = 1) = (1 − p)p n¯−1 , (3.11) where having one survived user in any group implies that the server will be able to decode one clear segment from the model of that user, i.e., I(x l i ; y l i ) ̸= 0, for a given segment l, where y l i is the l-th encoded segment from user i. Remark 3.5. The probability in equation 3.11 approaches zero by either having a small probability of dropout p, or by increasing the number of users in each subgroup. Therefore, the number of users in each subgroup makes a trade-off between the benefits of extra partitioning discussed in Remark 4 and the privacy of users’ models. To further illustrate the impact of the subgroup size n¯ on the probability in equation 3.11, we consider the following example. Let the total number of users in each subgroup to be n¯ = 8, and by considering p = 0.1, a typical number for the probability of dropout [17], the probability in equation 3.11 turns out to be 7.2×10−7 , which is negligible. We further note that by using HeteroSAg, the dropout rate of users becomes smaller. The reason for that in HeteroSAg users consider their transmission rates when they choose their quantizers. This decreases their probabilities of being delayed and hence being considered dropped out by the server. Proposition 3.2. [HeteroSAg communication and computation costs] Each user has a computation cost of O(N2 + dn¯) as opposed to O(N2 + dN) for SecAg, where d, N, and n¯ are the model size, total number of users, and number of users in each group, respectively, and the same communication cost O(N + d) as SecAg. However, the bandwidth expansion, which measures the ratio between the size of the encoded model in bits to the size of the clear model, is much lower for HeteroSAg. The communication and the computation complexities at the server are the same as in SecAg. 76 Table 3.3: A Comparison between SecAg [18] and our proposed HeteroSAg. Here, N is the total number of nodes, m is the model size, G is the number of groups, n is the number of users in each group, and Kg is the number of quantization levels. SecAg HeteroSAg Adaptive quantizers No Yes Communication complexity O(N + d) O(N + d) Computation complexity O(N2 + dN) O(N2 + dn) Inference robustness 1 G−2 G Byzantine Robustness No Yes Quantization error bound (r2−r1) 2 4N2 m(nG) 1 (K−1)2 (r2−r1) 2 4N2 d G n PG−1 g=0 2(G−g)−1 (Kg−1)2 Probability of local model breach 0 → 0 In Table 3.3, we give a comparison between HeteroSAg and SecAg [18]. As we can observe from this table that HeteroSAg is an adaptive algorithm that allows users to use different quantizers to balance their communication load to their channel quality, while achieving secure model aggregation. This is different from SecAg which requires all users to use the same quantizer to guarantee correct decoding as discussed in Section 3.3.2. Additionally, HeteroSAg achieves Byzantine robustness, while SecAg fails in the presence of Byzantine nodes, as we will demonstrate in Section 3.5. Furthermore, the communication complexity of HeteroSAg is lower than SecAg, while both algorithms have the same communication complexity. Regarding the inference robustness, SecAg achieves inference robustness of 1, meaning that the server will not decode any segment from the average model of any subset of users. On the other hand, HeteroSAg achieves lower inference robustness. However, the inference robustness of HeteroSAg approaches 1 for sufficiently large number of groups. The probability of local model breach when users dropped out in SecAg is 0, while in HeteroSAg this probability approaches 0 when increasing the number of users in each group (Theorem 3). 3.5 Numerical Experiments We run two different experiments to show the performance gains achieved by HeteroSAg. Experiment 1 highlights the benefits of using heterogeneous quantization. The second experiment is to demonstrate 77 how the secure aggregation strategy of HeteroSAg can be effective along with coordinate-wise median against Byzantine users. 3.5.1 Experiment 1 (Heterogeneous quantization) We consider the setup of N = 25 users, where users are equally partitioned into G = 5 groups and each model update vector is equally partitioned into 5 segments. We consider MNIST dataset [70] and use a neural network with two fully connected layers. The details of the neural network is presented in Appendix 3.7.14. For the data distribution, we sort the training data as per class, partition the sorted data into N subsets, and assign each node one partition. We set the number of epochs to be 5, use a batch size of 240, and constant learning rate 0.03. We consider three different scenarios for the performance comparison based on the quantization scheme. The three scenarios apply the same segment grouping strategy given by the SS matrix B in Figure. 3.1 in terms of the encoding and decoding strategy (e.g., the first segment from group 0 and group 1 will be encoded together and decoded together at the server in the same way for the three scenarios), while they are different in the quantization scheme. Quantization. We consider three scenarios based on the quantization scheme.Heterogeneous quantization: We consider a set Q of G = 5 quantizers with these levels of quantization (K0, K1, K2, K3, K4) = (2, 6, 8, 10, 12), where using these quantizers follows the pattern given in Figure. 3.1. Homogeneous quantization: All the segments from all users are quantized by using K = 2 levels quantizer. No quantization: All segments are represented in floating-point numbers (32 bits)¶ . We consider group 0 as a straggler group which includes users with limited communication resources including low transmission rates. In particular, we let each user in group 0 to have 1Mb/s transmission rate while users in higher groups to have more than 2Mb/s, in order to have a comparison between the three cases. ¶Using HeteroSAg with no quantization is the same as using FedAvg with no quantization with respect to the test accuracy. The difference between the two schemes is that in HeteroSAg models are encoded unlike FedAvg where clear models are sent to the server. This just results in a model with larger size as described in details in Appendix 3.7.8. We consider HeteroSAg under different quantization schemes for a fair comparison regarding the communication cost and time given in Figure. 3.4b. 78 0 50 100 150 200 Round, t 0.2 0.4 0.6 0.8 Test Accuracy No-quantization Heterogeneous quantization Homogeneous quantization: K=2 (a) Test accuracy for MNIST dataset Heterogeneous Homogeneous, K = 2 No-quantization Quantizatin scheme 0 50 100 150 200 250 Total Communication time (s) (b) Total communication time Figure 3.4: The performance of HeteroSAg under different quantization schemes for the non-IID setting. We have run the same experiment with N = 100 users. The results lead to the same conclusion, and can be found in the Appendix 3.7.12.2. The details of the results in Figure. 3.4b are given in Appendix 3.7.12.1. Additionally, we have further evaluate the performance of HeteroSAg using CIFAR10 dataset in Appendix 3.7.12.2. Figure. 3.4a illustrates that HeteroSAg with heterogeneous quantization achieves accuracy close to the baseline (no-quantization). Additionally, after t = 200 rounds of communication with the server, the total communication time when using heterogeneous is less the case with no quantization by a factor of 5.2× according Figure. 3.5a. Furthermore, HeteroSAg with heterogeneous quantization maintains superior performance over the case of homogeneous quantization with K = 2 levels with more than 15% improvement in test accuracy, while the communication time is the same for the both settings. This confirms our motivation that by adapting the quantization levels to the transmission rates of the users, we can achieve high accuracy with small training time. 0 50 100 150 200 Round, t 0.0 0.2 0.4 0.6 0.8 Test Accuracy Fedavg: No-attack HeteroSAg: Gaussain HeteroSAg: Sign-flip HeteroSAg: Label-flip Fedavg: Gaussain Fedavg: Sign-flip Fedavg: Label-flip (a) Illustrating the results for MNIST 0 100 200 300 Round, t 0.2 0.4 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: Gaussain HeteroSAg: Sign-flip HeteroSAg: Label-flip Fedavg: Gaussain Fedavg: Sign-flip Fedavg: Label-flip (b) Illustrating the results for CIFAR10 Figure 3.5: The performance of HeteroSAg and FedAvg under three different attacks for the IID setting. 79 3.5.2 Experiment 2 (Byzantine robustness) We show how the secure aggregation strategy of HeteroSAg can be effective along with the coordinatewise median against Byzantine users. For running the experiment, we consider a setup of N = 300 users, in which B = 18 of them are Byzantines. We consider the IID setting, where we randomly split the training data samples to N = 300 disjoint subsets, and assign each subset to a distinct user. We use a learning rate of 0.06, set the number of epochs to be 1, and use batch size 40. (model details are presented in Section 3.7.14). The performance of HeteroSAg in the non-IID setting is presented in Appendix 3.7.13. Scheme. We consider two schemes: HeteroSAg with G = 75 groups and n = 4 users in each group along with coordinate-wise median, and FedAvg implemented with secure aggregation [78]. Attack model. We assume that the Byzantine users are distributed over 18 groups. We note that since the focus here is the presence of Byzantine users while doing secure aggregation, where users are sending masked model to the server, Byzantine users can sends any faulty model with extreme values without being individually decoded and hence filtered out. Gaussian Attack: Each Byzantine user replaces its model parameters with entries drawn from a Gaussian distribution with mean 0 and standard distribution σ = 5. Sign-flip: Each Byzantine user multiplies its model updates by −5. Label-flip: Each Byzantine user subtract 9 from its labeled data, and then multiplies its resulting model update by 30. As we can see in Figure. 3.5, HeteroSAg with coordinate-wise median is robust to the three attacks and gives performance almost the same as the case with no Byzantine users. On the other hand, in the presence of these attacks, FedAvg scheme gives very low performance. As a final remark, HeteroSAg achieves both privacy for the users’ local models and Byzantine robustness, simultaneously. This is different from the SecAg protocol that only achieves model privacy, and different from the naive coordinate-wise median [67] that solely achieves Byzantine robustness. 80 3.6 Conclusion We propose HeteroSAg, a scheme that allows secure aggregation with heterogeneous quantization. This enables the users to adjust their quantization proportional to their communication resources, which can provide a substantial better trade-off between the accuracy of training and the communication time. We show that the proposed strategy used in HeteroSAg can be utilized to mitigate Byzantine users. Finally, we demonstrate that HeteroSAg can significantly reduce the bandwidth expansion of the sate-of-the-art secure aggregation protocol. 3.7 Appendix 3.7.1 Illustrative example for SecAg In this simple example, we illustrate SecAg protocol. We consider a secure aggregation problem in FL, where there are N = 3 nodes with drop-out resiliency D = 1. Node i ∈ {1, 2, 3} holds a local model update vector xi ∈ R m. In the following, we present the steps for executing the SecAg protocol. Step 1: Sharing keys and masks. User 1 and User 2 agree on pairwise random seed s1,2. User 1 and User 3 agree on pairwise random seed s1,3. User 2 and User 3 agree on pairwise random seed s2,3. Each one of these pairwise seeds is a function of the public key and the private keys of paried users (more details is given in Step 1 in Section II). In addition, user i ∈ {1, 2, 3} creates a private random seed bi . Step 2: Sharing keys and masks. Each user i ∈ {1, 2, 3} secret shares bi and the private key s SK i with the other users via Shamir’s secret sharing. The threshold for secret sharing is set to 2. Step 3: Quantizing. user i ∈ {1, 2, 3} quantizes its model xi using K levels of quantization according to equation equation 3.1. The output of the quantizer x¯i(k) = QK(xi(k)) takes a discrete value from this range {r1, r1 + ∆K, r1 + 2∆K, . . . , r2 − ∆K, r2}, where ∆K = r2−r1 K−1 is the quantization interval. The 81 quantized model of each node will be mapped from {r1, r1 + ∆K, r1 + 2∆K, . . . , r2 − ∆K, r2} to the integer range [0, K − 1]. Step 4: Masking. To provide privacy for each individual model, user i ∈ {1, 2, 3}, masks its model x¯i as follows: y1 = x¯1 + n1 + z1,2 + z1,3 mod R, y2 = x¯2 + n2 + z2,3 − z1,2 mod R, y3 = x¯3 + n3 − z1,3 − z2,3 mod R. (3.12) where ni = PRG(bi) and zi,j = PRG(si,j ) are the random masks generated by a pseudo random number generator. Here, R = 3(K − 1) + 1 to ensure that all possible aggregate vectors from the three users will be representable without any overflow. After that, user i ∈ {1, 2, 3} sends its masked model yi to the server. Step 5: Decoding. This phase for the aggregate-model recovery. Suppose that user 1 drops in the previous phase. The goal of the server is to compute the aggregate of the models x¯2 + x¯3. The aggregated model at the server from the survived users (user 2 and user 3) is given as follows x2,3 = x¯2 + x¯3 + (n2 + n3) − z1,2 − z1,3 mod R. (3.13) Hence, the server needs to reconstruct masks n2, n3, z1,2, and z1,3 to recover x¯2+x¯3. To do that, the server has to collect two shares for each of b2, b3 and s SK i from the two survived users. Therefore, the server can reconstruct the missing masks and remove them from equation 3.13. Note that, if node 1 is delayed while the server has already reconstructed z1,2, and z1,3, the local model of node 1 is still protected by the private mask bi in equation 3.12. 82 3.7.2 Proof of Theorem 3.1 First, we state four main properties of the SS matrix B for HeteroSAg. These properties will be used to prove the inference robustness of HeteroSAg. 3.7.3 Main properties for the SS matrix of HeteroSAg Property 1. Each column in the SS matrix contains only one ∗. This implies that each group of users independently of the other groups executes the secure aggregation on only one segment from its model update. Property 2. Any two distinct columns g and g ′ in the SS matrix have at most one row with the same number, where we do not consider the symbol ∗ as a number. This implies that each two groups of users corresponding to these columns independently of the other groups execute the secure aggregation on only one segment from their model updates. Property 3. For the case where the number of groups G is odd, each row in the SS matrix contains only one ∗. On the other hand, when the number of groups is even, only a pair of ∗’s can be found in the odd rows (the indices of the rows in the SS matrix started from 0). A pair of ∗’s for a given row belongs to one pair of groups from this set {(g, g + G 2 ) : g = 0, . . . G 2 − 1} as shown in Figure. 3.6. B = 0 1 2 3 4 0 0 0 2 3 3 2 1 0 ∗ 0 3 ∗ 3 2 0 1 1 0 4 4 3 0 1 ∗ 1 0 * 4 0 1 2 2 1 0 5 ∗ 1 2 ∗ 2 1 Figure 3.6: Matrix B for G = 6. Property 4. We say that we have a pair when having two equal numbers. In the SS matrix, if row i contains only pairs of numbers from the set of columns S ⊂ {0, . . . , G − 1} where |S| = 2n, for n = 2, . . . , G−i 2 , where i = 2 for even number of groups and i = 1 for odd number of groups, each row in the remaining set 83 of rows {0, . . . , G − 1}/i corresponding to the set of columns S will have at least two unpaired numbers. For example, in the first row of the SS matrix given in Figure. 2, we have these two pairs (B(0, 0), B(0, 1)) and (B(0, 2), B(0, 4)), where B(0, 0) = B(0, 1) = 0 and B(0, 2) = B(0, 4) = 2. On the other hand, for these column indices S = {0, 1, 2, 4}, each row in the set of remaining rows {1, . . . , 3} does not contain numbers to be paired. 3.7.4 Proof of Theorem 3.1 According to the SS matrix B, we have these 4 cases: Case 1: Single group. For a single group S ∈ {0, . . . , G − 1}, the server can only decode the segment denoted by ∗ in the SS matrix from the model update xS by using Property 1. Case 2: A Pair of groups. From any pair of groups S ⊂ {0, . . . , G − 1}, where |S| = 2, the server can successfully decode at most one segment from the local model update xS when the number of groups is odd. On the other hand, when the total number of groups is even, the server can decode at most two segments. The former results come from: First, Property 1 and Property 2 show that for any pair of groups S ⊂ {0, . . . , G − 1}, where |S| = 2, the server can decode one segment from the model update xS. This segment is the segment that is jointly encoded by the set of groups in S. Second, Property 3 states that when the total number of groups G is odd, the segments denoted by ∗’s in any pair of groups S do not get aligned in the same row, but interfere with segments from other groups. On the other hand, when the total number of groups G is even, the segments denoted by ∗’s might be aligned together on the same row, and hence make the server able to decode another segment from the model update xS according to Property 3. Case 3: Even number of groups. From any even number of groups S ⊂ {0, . . . , G−1}, where |S| = 2n for n = 2, . . . , G−i 2 with i = 2 when G is even and i = 1 when G is odd, the server can only decode one segment from the model update xS according to Property 4. This segment results from the sum of the 84 decoded segments from each pair of groups in S, while the segment denoted by ∗ from each group in the set S interferes with a segment from another group according to Property 3. Case 4: Odd number of groups. For an odd number of groups in the set S ⊊ {0, . . . , G − 1}, where |S| = 2n + 1 for n = 1, . . . , G−i 2 with i = 4 when G is even and i = 2 when G is odd, the server can not decode any segment from the model update xS. The reason for that for any given row in the SS matrix there is at least two segments to not be paired. These four cases complete the proof. 3.7.5 Proof of Lemma 3.1 i) (Unbiasedness) One can easily prove that E[x¯i ] = xi . ii)(Variance bound) E||x¯i − xi ||2 2 a= G X−1 l=0 mXG k=1 E(x¯ l i (k) − x l i (k))2 b= G X−1 l=0 mXG k=1 T(l + 1) − x l i (k) x l i (k) − T(l) c ≤ G X−1 l=0 mXG k=1 (T(l + 1) − T(l))2 4 = m G G X−1 l=0 (∆l i ) 2 4 , (3.14) where (a) follows from the fact that the random quantization is IID over elements of the vector xi , (b) from the variance of the quantizer in equation 3.1, and (c) from the bound in [106], which sates that having x such that a ≤ x ≤ b, this implies (b − x)(x − a) ≤ (b−a) 2 4 . iii) (Total quantization error) E||p¯ − p||2 2 =E 1 N X N i=1 x¯i − 1 N X N i=1 xi 2 2 = 1 N2 E X N i=1 x¯i − xi 2 2 d= 1 N2 X N i=1 E||x¯i − xi ||2 2 e ≤ (r2 − r1) 2 4N2 m G X N i=1 G X−1 l=0 1 (Kl i − 1)2 = σ 2 . (3.15) where (d) follows from the fact that the random quantization is IID over the N local gradients {xi} N i=1, and (e) from equation 3.14. We note that σ 2 HeteroSAg (total quantization error when using HeteroSAg) in 85 Theorem 2 can be derived from equation 3.15 by counting the number of segments that is quantized by each quantizer. According to the SS matrix B, each user i ∈ Sg in group g, for 0 ≤ g ≤ G − 1 uses quantizer QKg to quantize G − g segments, and the remaining g segments are quantized by the set of quantizers {QK0 , QK1 , . . . , QKg−1 }, with one segment for each quantizer. Hence, the total number of segments used quantizer QKg , where 0 ≤ g ≤ G − 1, is given by (2(G − g) − 1)n. 3.7.6 Proof of Theorem 3.2 From the L-Lipschitz continuity of ∇F(θ), we have F(θ (t+1)) ≤ F(θ (t) ) + ⟨∇F(θ (t) ), θ (t+1) − θ (t) ⟩ + L 2 ∥θ (t+1) − θ (t) ∥ 2 a= F(θ (t) ) − η⟨∇F(θ (t) ), p¯ (t) ⟩ + Lη2 2 ∥p¯ (t) ∥ 2 . (3.16) where p¯ (t) = 1 N PN i=1 x¯i , and x¯i = g¯i(θ (t) ) is the quantized local gradient at node i. We used this relation θ (t+1) = θ (t) − ηp¯ (t) to get (a). By taking the expectation with respect to the quantization noise and data sampling randomness, E h F(θ (t+1)) i a ≤F(θ (t) ) − η||∇F(θ (t) )||2 + Lη2 2 ||∇F(θ (t) )||2 + σ 2 HeteroSAg b ≤F(θ (t) ) − η 2 ||∇F(θ (t) ||2 + η 2 σ 2 HeteroSAg c ≤F(θ ∗ ) + ⟨ ∇F(θ (t) ), θ (t) − θ ∗ ⟩ − η 2 ||∇F(θ (t) )||2 + η 2 σ 2 HeteroSAg =F(θ ∗ ) + ⟨ E[p¯ (t) ], θ (t) − θ ∗ ⟩ − η 2 ||E[p¯ (t) ]||2 + η 2 σ 2 HeteroSAg ≤F(θ ∗ ) + ⟨ E[p¯ (t) ], θ (t) − θ ∗ ⟩ − η 2 E||p¯ (t) ||2 + ησ2 HeteroSAg =F(θ ∗ ) + E h ⟨ p¯ (t) , θ (t) − θ ∗ ⟩ − η 2 ||p¯ (t) ||2 i + ησ2 HeteroSAg =F(θ ∗ ) + 1 2η E||θ (t) − θ ∗ ||2 − E||θ (t+1) − θ ∗ ||2 + ησ2 HeteroSAg (3.17) 86 where (a) follows from that E||p¯ (t) ||2 = E||p¯ (t) − E[p¯ (t) ]||2 + ||E[p¯ (t) ]||2 , where E[p¯ (t) ] = E[xi (t) ] = ∇F(θ (t) ) according to Lemma 1-(i) and Assumption 1, and E||p¯ (t) − E[p¯ (t) ]||2 ≤ σ 2 HeteroSAg according to Lemma 1-(iii), where E[p¯ (t) ] = p (t) with respect to the quantization error. Furthermore, (b) follows from using η ≤ 1 L , and (c) from the convexity of F(.). By summing the above equations for t = 0 . . . , J − 1 J X−1 t=0 E h F(θ (t+1)) i − F(θ ∗ ) ≤ 1 2η (E||θ 0−θ ∗ ||2−E||θ J−θ ∗ ||2 )+ηJσ2 HeteroSAg ≤ ||θ 0 − θ ∗ ||2 2η +ηJσ2 HeteroSAg. (3.18) By using the convexity of F(.), E " F 1 J X J t=1 θ (t) !# − F(θ ∗ ) ≤ 1 J J X−1 t=0 E h F(θ (t+1)) i − F(θ ∗ ) ≤ ||θ 0 − θ ∗ ||2 2ηJ + ησ2 HeteroSAg. (3.19) 3.7.7 Proof of Theorem 3.3 We recall that the model update xi of user i, is partitioned into Z segments when using HeteroSAg, where Z is the total number of subgroups. The partitioned model is denoted by xi = [x 0 i , x 1 i , . . . , x Z−1 i ] T , where x l i ∈ R m Z . To guarantee information theoretic privacy for the model update xi , we should have I(x l i ; y l i ) = 0, for l = 0, . . . , Z − 1, where y l i is the l-th encoded segment from user i. To achieve this information theoretic privacy for each segment in the model update, the server should not be able to decode any individual segment x l i when recovering all pairwise keys of dropped users and the private keys of the survived users. Each segment from user i in a subgroup g is jointly encoded with either the remaining n¯ −1 users from this subgroup, or 2¯n−1, users from its subgroup and from an additional subgroup, where n¯ is the total number the number of users in each subgroup. Therefore, the number of survived users in each subgroup can not be one. We assume that each user has a dropout probability p ∈ [0, 1]. By using the fact that the number of survived users in each subgroup X follows a binomial distribution with parameters n¯, which is the number of users in each subgroup, and 1 − p, Theorem ?? can be proven. 87 3.7.8 Proof of Proposition 3.2 Computation cost: O(N2+ ¯nm), where d, N and n¯ are the model size, total number of users and number of users in each group, respectively. Each user computation can be broken up as (1) Performing the 2N key agreements, which takes O(N) time, (2) Creating t-out-of-N Shamir secret shares of the private key of s SK i and bi , which is order O(N2 ) (3) Generating the model masks according to equation 3.2 for all neighbors which takes O(¯nd) time in total. The former result comes from the fact that each element in the model update of any user in HeteroSAg is masked by either n¯ − 1 masks or 2¯n − 1 masks, unlike SecAg where the whole vector is masked by N − 1 0-sum pairwise masks. Therefore, for the case where the number of users in each group n¯ = log N, the computation cost becomes O(N2 + d log N), as opposite to O(N2 + dN). User communication complexity: O(N + d) The communication complexity is the same as the secure aggregation protocol; however, the actual number of transmitted bits per user in HeteroSAg is lower. In particular, HeteroSAg gives lower per user communication cost compared to SecAg. Specifically, having a set of |S| users executes the secure aggregation protocol together on the set of segments {x l i }i∈S, the actual number of transmitted bits from each user i ∈ S is given by R = |x l i | log(|S|(Kg − 1) + 1) according to equation 3.2, where |x l i | gives us the number of elements in this segment, and Kg is the number of quantizer levels. On the other hand, just sending the quantized segments in clear without any encoding results in |x l i | log(Kg) bits. This gives us an expression for what is called the bandwidth expansion factor with respect to segments ⌈log(|S|(Kg−1)+1)⌉ ⌈log(Kg)⌉ , while ignoring the cost of sharing keys and masks and other cryptographic aspects of the protocol∥ . In fact, the majority of the bandwidth expansion for the additive masking in SecAg comes from the number of users that execute the protocol together. In our proposed segment grouping strategy, all segments are executed by either n¯ or 2¯n users. On the other hand, SecAg besides the fact that the local model of all users are quantized by using the same quantizer, even if they ∥The costs of sharing keys and masks in HeteroSAg are the same as SecAg, so we do not consider them in the evaluation. 88 have different communication resources, all the N users execute the secure aggregation protocol together. This implies much larger bandwidth expansion factor than our HeteroSAg. In order to further illustrate how HeteroSAg reduces the bandwidth expansion of the SecAg protocol, we give the following numerical explanations. We assume having N = 210 users, a number used to evaluate this metric in [18, 16], and assume without loss of generality that only one quantizer to be used by the users. When the partitioning step results in n¯ = 8 users in each subgroup, and when using K = 216 quantization levels, the bandwidth expansion factor becomes 1.25× instead of being 1.625× for SecAg. For a single bit quantization, the expansion factor is significantly reduced from 11× to 4×. 3.7.9 HeteroSAg for Heterogeneous group Size We have considered the case of uniform group sizes for HeteroSAg, where clustering users results in the same number of users in each group, in Section 3.4 in the main submission. In this section, we consider a more general scenario, where instead of assuming that the set of N users are divided equally on the G groups, where each group has n users, we assume the case where the number of users in each group is different. 3.7.9.1 Execution of HeteroSAg for heterogeneous group size Similar to HeteroSAg in Section 3.4.1, key agreement and secret sharing are executed according to Step 1 and Step 2 in Section 3.2. Here, the size of the set of users in group g is denoted by |Sg| = ng, for g ∈ [G], where PG−1 g=0 ng = N, and G is the number of possible quantizers given in the set Q. The second extension for HeteroSAg is that we allow further partitioning of the groups into smaller subgroups when the number of users in each group is large. Extra partitioning results in the benefits given in Remark 4, and in decreasing the expansion factor discussed in Section 3.7.8, which measures the ratio between the size of the masked model in bits to the size of the clear model. Extra partitioning is achieved by dividing each set 89 of users Sg, for g ∈ [G], into Lg subsets (subgroups), S d g , for d = 0, . . . , Lg − 1, such that each subgroup has the same number of users n¯. Following the clustering step and the extra partition of the groups, each model update vector {xi}i∈[N] is equally partitioned into Z segments xi = [x 0 i , x 1 i , . . . , x Z−1 i ] T , where x l i ∈ R m Z and Z = PG−1 g=0 Lg, for l ∈ [Z]. Also, we should have Z ≤ m, and for sufficiently large N > m, we might restrict the number of subgroups to equal the size of the model parameter Z = m, which means that each segment of the local model update is just one element. The segment grouping strategy AHeteroSAg is given by the SS matrix B with dimensions Z × Z according to Algorithm 2. As shown in the example SS matrix given in Figure. 3.7, B e = (0, 0) (1, 0) (1, 1) (2, 0) (2, 1) (0, 0) (0, 0) (1, 1) ∗ (1, 1) 0 (0, 0) ∗ (0, 0) (2, 0) (2, 0) 1 (0, 0) (1, 0) (1, 0) (0, 0) ∗ 2 (0, 0) (1, 0) ∗ (1, 0) (0, 0) 3 ∗ (1, 0) (1, 1) (1, 1) (1, 0) 4 Figure 3.7: Matrix Be for G = 3 groups with the number of subgroups in each group L0 = 1, L1 = 2, and L2 = 2, respectively. each column is indexed by two indices (g, d) representing the set of users S d g , for g ∈ [G] and d = 0, . . . , Lg − 1, while each row l, for l = 0, . . . , Z − 1, represents the index of the segment. Similar to the description in Section 3.4.1, having an entry Be (l,(g, d)) = ∗ means that the set of users S = S d g will quantize the set of segments {x l i }i∈Sd g by the quantizer QKg and encode them together, while at the server side these set of segment will be decoded together. When Be (l,(g, d)) = Be (l,(g ′ , d′ )) = (g, d), where g ≤ g ′ , this means that the set of users S = S d g ∪ S d ′ g ′ , corresponding to these columns (g, d) and (g ′ , d′ ), will quantize the set of segments {x l i }i∈Sd g ∪S d′ g′ by using the quantizer QKg and encode the output of the quantizer together, while at the server side these set of segments will be decoded together. 90 Algorithm 2: Segment Selection matrix Be for HeteroSAg Define: Zg−1 = Pg−1 l=0 Ll and Zg−1 = 0 when g = 0, and 1 is the indicator function; for g = 0, . . . , G − 1 do for d = 0, . . . , Lg − 1 − 1g=G−1 do i = 0 and s = 0 ; for r = 0, . . . , Z − Zg−1 − d − 2 do m = 2(Zg−1 + d) + r ; if (d + r + 1) mod Pi l=0 Lg+l = 0 then i = i + 1 and s = 0 ; B e ((m mod Z,(g, d)) = B e (m mod Z,( (g + i), s)) = (g, d) else s = s + 1 ; B e ((m mod Z,(g, d)) = B e (m mod Z,( (g + i), s)) = (g, d) end end end end The remaining entries of Matrix B e will hold ∗ Now, we give the theoretical guarantees of HeteroSAg for heterogeneous group size. Theorem 3.4. (Inference robustness) For a FL system with N users clustered into Z subgroups, and the model update of each node is divided equally into Z segments, HeteroSAg achieves an inference robustness δ(AHeteroSAg) = Z−2 Z , when the number of subgroups is even, and δ(AHeteroSAg) = Z−1 Z , when the number of subgroups is odd, where Z = PG−1 g=0 Lg, for l = 0, . . . , Z − 1, and Lg is the number of subgroups in group g. Lemma 3.2. (Quantization error bound) Let Lg to be the number of subgroups in group g, and Zg−1 = Pg−1 l=0 Ll represent the sum of subgroups of group 0 to group g − 1, such that Zg−1 = 0 when g = 0. Additionally, let Z = PG−1 g=0 Lg to be the total number of subgroups. For a set of vector {xi ∈ R m} N i=1, such that the elements of each vector xi , for i = 1, . . . , N, take value from this interval [r1, r2], and each vector is 91 partitioned into Z equal segments, the quantization error bound σHeteroSAg+ when using the quantizers in Q along with the SS matrix Be is given by σHeteroSAg+ = (r2−r1) 2 4N2 m Z n¯ PG−1 g=0 PLg−1 j=0 (2(Z−Zg−1−j)−1) (Kg−1)2 . The proofs of Theorem 3.4 and Lemma 3.2 can be derived similarly to the proofs of Theorem 3.1 and Lemma 3.1, respectively. The convergence rate is the same as in Theorem 3.2 with replacing σHeteroSAg with σHeteroSAg+. 3.7.10 Proof of Proposition 3.1 From Lemma 3.1, when having G groups and each group is partitioned equally into L subgroups each of which has a size of n¯ = N LG users and using these two results Zg−1 = gL and Z = GL, the quantization error bound will be the same as σHeteroSAg given in Theorem 3.2. This means that extra partitioning of each group does not change the quantization error. 3.7.11 Byzantine robustness of HeteroSAg We further discuss the intuition behind the success of HeteroSAg in mitigating the Byzantine nodes in the following remark Remark 3.6 (Byzantine robustness of HeteroSAg). In this remark, we further motivate the reason behind the success of HeteroSAg in mitigating Byzantine nodes. The reason for the success of coordinate-wise median (Median) algorithm [116] in mitigating the Byzantine nodes in the the IID setting is the same reason behind the success of HeteroSAg when it is integrated with coordinate-wise median. In particular, the success of Median is guaranteed since the model updates from all benign users are similar to each other [116], where the similarity increases as the data at the users become more IID. Therefore, taking the median over each coordinate across the model update of all users ensures that we get a representative model for all the benign models while ignoring the outliers from each coordinate. For the same reason, integrating median with HeteroSAg can provide Byzantine robustness. In particular, unlike the case where each coordinate represents one element from the local model 92 of each user (e.g., the k-th element x¯i(k) of the local model of node i) in the naive coordinate-wise median algorithm. In HeteroSAg, each coordinate becomes representing the average of a set of elements from the local models of some users (e.g., the k-th element y 0 0,1 (k) of the segment y 0 0,1 given in Example 3, where y 0 0,1 (k) = 1 2n P i∈N0∪N1 x¯ 0 i (k)). Similarly, the average of a set of elements from the local models of some benign nodes is a reasonably good representative of those elements. Therefore, applying the median along the new coordinates will guarantee that we get a representative model of the benign models while ignoring the outliers from each coordinate. The outliers of each coordinate appear when having at least one faulty model contribute to the average element in that coordinate. For instance, y 0 0,1 (k) will be faulty if at least one of these elements {x¯ 0 i (k), i ∈ N0 ∪ N1} is faulty. 3.7.12 Complete Experimental Results for Section 3.5 3.7.12.1 Evaluating the results in Figure 3.4b Table 3.4: User communication cost and the total communication time in Experiment 1. Quantization Group User communication cost (Mb) Communication time (s) Heterogeneous (K0, K1, K2, K3, K4) = (2, 6, 8, 10, 12) 0 53 1 87 2 90 53 3 97 4 101 Homogeneous K = 2 - 53 53 No-quantization - 279 279 The total communication time of the three heterogeneous scheme given in Figure. 3.4b can be derived from the results in Table 3.4. The transmission rate of the users in group 0 is 1Mb/s, while users in higher groups have transmission rate more than 2Mb/s, as given in Section 3.5. The communication cost in (Mb) per user i in group g is given by summing the size of the masked model sent by node i and the size of the global model received from the server (in Mb). The model size for the fully connected neural network considered 93 for Experiment 1 is 79510 elements. In HeteroSAg, the encoding is done on the segment level, and the size of the encoded segment x l i from node i is given by |x l i |⌈log(|S|(Kl g − 1) + 1)⌉, where |S| is the number of users who jointly encode this segment, Kl g is the number of quantization levels used for quantizing x l i . The former result is given according to the encoding step in equation 3.2. By using the previous formula along with the segment grouping given in Figure. 3.1, and the quantizers from the three different scenarios that we are considering (heterogeneous quantization with levels (K0, K1, K2, K3, K4) = (2, 6, 8, 10, 12), homogeneous quantization K = 2, and no quantization, i.e., K = 232), the per-user communication cost after t = 200 rounds can be evaluated. The cost of sharing keys and masks is the same for the three scenarios, therefore, we do not consider that in the calculations of the communication cost. The communication time can simply be computed by dividing the per-user communication cost by the corresponding transmission rate. 3.7.12.2 Additional experiment (Heterogeneous quantization) MNIST dataset: We consider the same setup given for Experiment 1 in Section 3.5.1, while setting N = 100 users with n = 20 in each group of the G = 5 groups for running the experiment in Figure 3.8. In this experiment, we use a batch size of 60. 0 50 100 150 200 Round, t 0.2 0.4 0.6 0.8 Test Accuracy No-quantization Heterogeneous quantization Homogeneous quantization: K=2 (a) Test accuracy for MNIST dataset Heterogeneous Homogeneous, K = 2 No-quantization Quantization scheme 0 100 200 300 400 500 600 Total Communication time (s) (b) Total communication time Figure 3.8: The performance of HeteroSAg under different quantization schemes for the non-IID setting (N = 100). 94 CIFAR10 dataset: Data distribution and Hyperparameters We set the total number of users to N = 100. We use a fixed learning rate of 0.02 for the first 150 rounds, t ≤ 150, and then gradually decrease the learning rate according to 0.02 (1+0.02t) , for t > 150. We set the batch size for each user to be 20% of its local data. We consider epoch training, where the number of epochs is 5. We use CIFAR10 dataset with non-IID data distribution. In particular, we use the generic non-IID synthesis method based on the Dirichlet distribution with parameter α proposed in [51]. In this method, increasing α makes the data more IID at the users. On the other hand, decreasing α makes each user have very few samples from some random classes. We implement this method using FedML library [47]. 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 0.7 0.00 0.15 0.30 0.45 0.60 (a) Label distribution with α = 0.7 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 1 0.0 0.1 0.2 0.3 0.4 0.5 (b) Label distribution with α = 1 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 10 0.00 0.04 0.08 0.12 0.16 0.20 (c) Label distribution with α = 10 Figure 3.9: The label distribution over the N = 100 users using Dirichlet distribution with three different α parameters. Results Figure 3.9 illustrates the label distribution for N = 100 users with Dirichlet distribution with different α parameter. Using this set of distributions given in Figure 3.9, we evaluate the performance of HeteroSAg using the same quantization schemes given in Section 3.5.1. Similar to the performance of HeteroSAg presented in Figure 3.4 and Figure 3.8 for the MNIST dataset, Figure 3.10 demonstrates that HeteroSAg for CIFAR10 dataset still achieves higher accuracy than the Homogeneous quantization with K = 2 over the three different data distribution settings in Figure 3.9. Additionally, HeteroSAg gives a comparable test accuracy to the baseline (no-quantization). The high test accuracy of HeteroSAg over the case of homogenous quantization with K = 2 is achieved at no extra communication time as illustrated in Figure 3.12d. On the other hand, after t = 250 rounds of communication with the server, the total 95 0 50 100 150 200 250 Round, t 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy No-quantization Heterogeneous quantization Homogeneous quantization: K = 2 (a) Data distribution with α = 0.7 0 50 100 150 200 250 Round, t 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy No-quantization Heterogeneous quantization Homogeneous quantization: K = 2 (b) Data distribution with α = 1 0 50 100 150 200 250 Round, t 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy No-quantization Heterogeneous quantization Homogeneous quantization: K = 2 (c) Data distribution with α = 10 Heterogeneous Homogeneous K = 2 No-quantization Quantization scheme 0 1000 2000 3000 4000 Total communication time (s) (d) Total communication time Figure 3.10: The performance of HeteroSAg and FedAvg under Gaussian attack and three different data distributions. communication time when using heterogeneous is less the baseline case with no quantization by a factor of 5.4×. 3.7.13 Additional experiment (Byzantine robustness) In the following set of experiments, we further demonstrate the performance of HeteroSAg under Byzantine attacks using non-IID data setting. Dataset and Hyperparameters Similar to the setting given in Section 3.5.2, we set the total number of users to N = 300, in which B = 18 of them are Byzantines. We use a fixed learning rate of 0.02, and setting the batch size for each user to be 20% of its local data. We consider epoch training, where the number of epochs is 4. We use CIFAR10 dataset with non-IID data distribution. Here, we also generate the non-IID data distribution using Dirichlet distribution with parameter α. Results Figure 3.11 gives the label distribution for N = 300 users with Dirichlet distribution with three different α parameters. Using this set of distributions, we evaluate the performance of HeteroSAg. As we 96 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 0.7 0.00 0.15 0.30 0.45 0.60 0.75 (a) First 100 users (α = 0.7) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 0.7 0.00 0.15 0.30 0.45 0.60 (b) Second 100 users (α = 0.7) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 0.7 0.00 0.15 0.30 0.45 0.60 0.75 (c) Third 100 users (α = 0.7) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 1 0.0 0.1 0.2 0.3 0.4 0.5 (d) First 100 users (α = 1) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 1 0.00 0.15 0.30 0.45 0.60 0.75 (e) Second 100 users (α = 1) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 1 0.00 0.15 0.30 0.45 0.60 (f) Third 100 users (α = 1) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 10 0.00 0.04 0.08 0.12 0.16 0.20 (g) First 100 users (α = 10) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 10 0.00 0.04 0.08 0.12 0.16 0.20 (h) Second 100 users (α = 10) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Realtive client number 0 1 2 3 4 5 6 7 8 9 Class number Label distribution with = 10 0.00 0.05 0.10 0.15 0.20 0.25 (i) Third 100 users (α = 10) Figure 3.11: The label distribution among N = 300 users using Dirichlet distribution with different α parameters. can see in Figure 3.12, HeteroSAg with coordinate-wise median is robust to Gaussian attack and the sign flip attack while giving performance almost the same as the case with no Byzantine users. On the other hand, in the presence of these attacks, FedAvg scheme gives very low performance. 3.7.14 Models We provide the details of the neural network architectures used in our experiments. For MNIST, we use a model with two fully connected layers, and the details are provided in Table 5.1. The first fully connected layers is followed by ReLU, while softmax is used at the output of the last layer. For CIFAR10, we consider a neural network with two convolutional layers, and three fully connected layers, and the specific details of these layers are provided in Table 5.2. ReLU and maxpool is applied on 97 0 50 100 150 200 250 Round, t 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: Gaussian Fedavg: Gaussian (a) Data distribution with α = 0.7 (Gaussian) 0 50 100 150 200 250 Round, t 0.2 0.4 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: Gaussian Fedavg: Gaussian (b) Data distribution with α = 1 (Gaussian) 0 50 100 150 200 250 Round, t 0.2 0.4 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: Gaussian Fedavg: Gaussian (c) Data distribution with α = 10 (Gaussian) 0 50 100 150 200 250 Round, t 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: sign-flip Fedavg: sign-flip (d) Data distribution with α = 0.7 (Sign-flip) 0 50 100 150 200 250 Round, t 0.2 0.4 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: sign-flip Fedavg: sign-flip (e) Data distribution with α = 1 (Signflip) 0 50 100 150 200 250 Round, t 0.2 0.4 0.6 Test Accuracy Fedavg: No-attack HeteroSAg: sign-flip Fedavg: sign-flip (f) Data distribution with α = 10 (Sign-flip) Figure 3.12: The performance of HeteroSAg and Fedavg under Gaussian and sign-flip attacks with three different data distributions. Table 3.5: Details of the parameters in the architecture of the neural network used in our MNIST experiments. Parameter Shape fc1 784 × 100 fc2 100 × 10 the convolutional layers. The first maxpool has a kernel size 3×3 and a stride of 3 and the second maxpool has a kernel size of 4×4 and a stride of 4. Each of the first two fully connected layers is followed by ReLU, while softmax is used at the output of the third one fully connected layer. We initialize all biases to 0. Furthermore, for weights in convolutional layers, we use Glorot uniform initializer, while for weights in fully connected layers, we use the default Pytorch initialization. 98 Table 3.6: Details of the parameters in the architecture of the neural network used in our CIFAR10 experiments. Parameter Shape conv1 3 × 16 × 3 × 3 conv2 16 × 64 × 4 × 4 fc1 64 × 384 fc2 384 × 192 fc3 192 × 10 99 Chapter 4 Federated Parameter Efficient Fine-Tuning of Language Models Fine-tuning pre-trained models has successfully delivered state-of-the-art results across various NLP tasks. A crucial aspect of fine-tuning is task-related data, and since this data can be private and sensitive, it is not always available. Without such centralized data, Federated Learning (FL) helps fine-tune the model using clients’ private data. However, efficient fine-tuning becomes critical, especially for edge devices, due to their limited communication, computation, and storage capabilities and the vast sizes of popular pre-trained models. This work explores the opportunities and challenges of applying parameter-efficient fine-tuning (PEFT) methods in FL for language tasks. Specifically, our investigations reveal that increasing data heterogeneity across users widens the gap between fully fine-tuning the model and employing PEFT methods. To bridge this performance gap, we propose a method that overcomes the critical limitations of LoRA in high heterogeneous data scenarios through a novel data-driven initialization technique. Our experimental results demonstrate that our method achieves performance comparable to full fine-tuning, with significant sparse updates with ∼ 1% density while reducing training time by up to 90%. 100 Figure 4.1: The effect of client data distribution on the performance of FFT vs. PEFT. While heterogeneity adversely affects both, parameter-efficient methods are more vulnerable and experience more accuracy drops in more heterogeneous settings. 4.1 Introduction Although FL has already proven beneficial in various domains [56, 114], such as next-word prediction and healthcare, it still has critical challenges to be deployed on large-scale setting that involves training state-of-the-art large models such as LLMs [19, 25]. Large pre-trained language models have performed well even in the zero-shot setting [19]. However, as the tasks get more specialized, these models require fine-tuning on task-specific data to enhance their performance [52]. FL is a promising solution to provide privacy for fine-tuning, but requiring the clients to fine-tune the model and communicate the update comes with costs and downsides. One of the concerns of shifting the fine-tuning to the client side is its overheads. In particular, edge devices usually have limited bandwidth (especially up-link) as multiple users might share the same resource (network). Also, fine-tuning large models can be computationally expensive. Finally, both communicating and fine-tuning the model can be highly energy-consuming. Therefore, the direct use of FL for NLP tasks may limit its applicability. Additionally, language models are considered foundation models [19] that can be used in various applications while supporting a wide range of tasks [108]. As a result, supporting multiple tasks in memoryconstrained edge devices poses another critical challenge, especially in the FL setting. In particular, the 101 required memory for the fine-tuned models of different tasks grows linearly with the number of tasks that might make the training infeasible. Some of the mentioned problems also exist in the centralized setting (single node setting). Parameter Efficient Fine Tuning (PEFT) [52, 75, 72] recently has emerged as an alternative efficient training strategy. PEFT methods do not require fine-tuning all parameters of the pre-trained model, but only update a small portion of the parameters (task-specific parameters) while freezing most of the pre-trained weights of the model to their initial pre-trained values. This approach in the centralized setting has been shown to maintain task performance while reducing the training and storage budget. In this work, we first explore the performance of the existing PEFT method in FL under different settings of data heterogeneity for various models and datasets. We observe that the gap between Full Fine Tuning (FFT) and PEFT increases with clients’ data heterogeneity. To resolve this problem, we propose two simple and effective variants of LoRA designed for FL to achieve parameter efficiency, reduce training and communication costs, and close the gap between PEFT and FFT. The first approach is called domain adaption LoRA, DALoRA, where the server first fine-tunes the model using LoRA modules on public datasets. After that, the server and users further federated fine-tuning the model using LoRA modules on the target private dataset. For situations where the server does not have access to public datasets, we propose SLoRA, which alleviates the need for public datasets and achieves comparable performance to FFT and DALoRA. SLoRA adapts uniform sparsity and LoRA modules in two stages of training to mitigate the limitations of LoRA in the FL setting. Our experimental results demonstrate that SLoRA achieves performance comparable to full fine-tuning, with significant sparse updates with ∼ 1% density while reducing training time by up to 90%. 102 4.2 Related Work Parameter Efficient Fine Tuning (PEFT). In general, PEFT methods can be broadly classified into two main categories based on the nature of the tuned parameter. The first category fine-tunes a subset of existing parameters, including the classification head, bias term [118], and sparse subnetworks within the original pre-trained model for each task [44]. The second category is module-based fine-tuning, where additional task-specific modules are added [50, 84, 48, 52]. These modules are fine-tuned while freezing the entire pre-trained model. Different methods have been proposed depending on the position of the PEFT modules in the model. Adapter and its variants [50, 84, 48] add bottleneck trainable modules serially to the model components. Other methods add modules in parallel to the model, such as LoRA [52], prefix or prompt tuning which are added in parallel to the attention heads [75] or embeddings [72]. Recently, new approaches propose data-driven PEFT configuration selection for adding adapter modules [48, 125, 109]. PEFT in FL. Recent studies [104, 121] investigate the performance of PEFT methods within the context of FL for vision tasks considering different aspects, such as client stability, data distribution, and differential privacy settings. The findings indicated that PEFT could replace FFT without compromising performance while significantly reducing communication costs. While the previous works focused on vision and vision-language models, our study differs in several key aspects. Firstly, we specifically study PEFT for language models and additionally examine the effect of data heterogeneity across clients on the performance of PEFT for NLP tasks. Secondly, our work extends beyond benchmarking different PEFT methods in the federated setting to propose an approach that yields comparable performance to FFT even in extreme non-IID settings. Efficient Training in FL. Efficient training in FL has been extensively studied in the literature [26, 49, 3, 83, 58, 74, 73]. Efficient training in FL employs sparse training at different levels. Some approaches only 103 apply sparse training at the client side to update a full-size model retained at the server [26, 49, 3, 83]. Other approaches utilize sparse training to optimize a sparse model at the client and server sides [13, 7, 89, 73]. Given efficient learning and PEFT, both share a common goal of reducing the training complexity for the clients; they may seem very similar at first glance. However, PEFT also focuses on the unique aspect of storage load for multiple tasks. Applying efficient sparse training methods on (pre-trained) large language models typically can only retain good performance with a moderate level of sparsity [41, 43, 23]. This can result in a huge storage penalty as sparsity patterns can differ across tasks. On the other hand, PEFT retains the full pre-trained model but applies extremely sparse with (∼ 1%) density updates [52, 50, 118, 84, 48] to the pre-trained model per task, which allows for substantial storage savings and significant reduction on the communication cost. Importantly, PEFT can achieve a strong comparable performance to the fully fine-tuned model. 4.3 Preliminaries 4.3.1 PEFT Baselines in Centralized Learning We investigate Pfieffer, LoRA, Holusby, and BitFit as state-of-the-art PEFT methods. The first three methods add a separate bottleneck module (a down projection dense layer followed by an up projection) with a dimension r to the model. The main difference between these methods is where the module is added. Holusby [50] places a bottleneck module after each multi-head attention and feed-forward block. Pfieffer [84] places a bottleneck module only after each feed-forward block. In LoRA [52], the bottleneck module can be parallel to any dense layer in the model. Finally, BitFit [118] is a simple method only allows fine-tuning the bias terms. 104 B = 0 x d h r Pretrained weights W ∊ ℛ d x d A = (0,σ2 ) d Figure 4.2: LoRA Block 4.3.2 Observation: PEFT is challenged when data distribution gets non-IID One of the challenges in FL is performance degradation caused by heterogeneous data distributions [56]. While this is a well-documented phenomenon reflected in FFT in FL, we observe that the performance penalty is even more substantial when using PEFT. In particular, after benchmarking different models and datasets, we observe that the higher the level of heterogeneity, the more significant the performance gap between FFT and PEFT methods. Thus, a simple naive adaptation of applying PEFT methods locally in FL can lead to potentially huge performance loss. Our focus in the remainder of the paper is on developing approaches to reduce the gap between FFT and PEFT while efficiently using clients’ resources regarding communication and storage loads. Before proposing our approach in Section 4.4, we summarize the approach that shows the most significant promise: the state-of-the-art PEFT approach, LoRA. 4.3.3 Low-Rank Adaption: LoRA In the following, we explain Low-Rank Adaptation (LoRA) [52] since it’s the SOTA method for PEFT of large pre-trained language models, and our proposed algorithm adopts this method. The key idea of LoRA is that instead of fully fine-tuning the pre-trained weight matrix W0 ∈ R d×d , its update is constrained 105 with a low-rank decomposition W0 + ∆W = W0 + BA, where B ∈ R d×r , A ∈ R r×d , and r << d, and only B and A are trained (W0 is frozen). Figure 4.2 shows LoRA implementation, where a parallel module of a down projection matrix B followed by the up projection matrix A in parallel to the original pre-trained weight matrix. A random Gaussian initialization for A and zero for B are used to ensure the modified model and the original model are equivalent (i.e., ∆W = BA is zero at the beginning of training). The modified forward pass after adding the LoRA module is given below where r is LoRA rank and β is a constant in r. h = W0x + β r BAx, (4.1) According to Equation 4.1, the output from the LoRA module is added coordinate-wise to the output of the original model. The scaling β r is used to reduce the need to re-tune hyper-parameters when varying r. One key advantage of LoRA compared to other adapter methods such as [50, 84, 48] is that it does not increase the inference cost of the original model. The implementation of LoRA allows for re-parametrization of the weight W = W0 + β r BA. 4.4 Our approach (Primed-LoRA) In centralized learning, LoRA consistently shows promising performance in different tasks and closely follows the FFT accuracy. This still holds in FL with homogeneous data distribution (larger α) as shown in Figure 4.4. However, in highly non-IID data distribution, LoRA fails to reach close to the FFT performance and suffers from a slower convergence rate compared to FFT. We hypothesize that the initialization of LoRA Blocks (Figure 4.2, A, and B are initialized with random noise and 0, respectively) can be one of the reasons behind this problem. While this initialization works in centralized settings where the data is ample and concentrated, it can potentially slow down the fine-tuning process in FL. 106 0 50 100 150 200 250 300 350 Federated Round 10 20 30 40 50 60 Test Accuracy (%) #Rounds in Stage1 10 30 50 70 100 Figure 4.3: Impact of the number of FL rounds in stage 1 on the final performance in FLoRA for 20News group dataset on Albert. Data-driven Priming of LoRA. Based on our hypothesis, a better starting point for LoRA might improve its performance. Therefore, we propose a two-stage PEFT method, Primed-LoRA, based on the LoRA algorithm. In Stage 1, clients collaboratively find a mature starting point to prime the LoRA blocks. Then, in Stage 2, they use the LoRA algorithm with the learned initializers from Stage 1. In the remainder of the section, we discuss different ways of priming LoRA and their properties. 4.4.1 Full fine-tuning for priming LoRA A simple approach for Stage 1 is to do FFT for a few rounds in Stage 1 of Primed-LoRA and then use SVD matrix decomposition to extract a good initialization for Stage 2. We call this variant of Primed-LoRA as FFT-LoRA or FLoRA. We use ∆W to denote the accumulated change in the model parameters after Stage 1.We use SVD to derive a low-rank approximation of ∆W = B × A, with B ∈ R d×r , A ∈ R r×d that is used in Stage 2. A description of how we create A and B using SVD is delegated to Appendix 4.8.1. After converting ∆W into A and B, the training goes to stage 2, where clients only update and share the LoRA blocks with the server. As shown in Figure 4.3, FLoRA can improve the global model’s performance, especially with more training rounds in Stage 1. However, as we discuss next, using full fine-tuning to prime LoRA comes at a cost. Cost of FLoRA In FLoRA, adding stage 1 successfully enhances the performance while achieving parameter efficiency. This better performance can show the importance of initialization for the LoRA blocks. 107 Centralized LDA( =1000) LDA( =0.1) Pathological Data Distribution 0 20 40 60 80 100 Test Accuracy (%) FFT LoRA Pfeiffer Holusby BITFIT (a) 20News group dataset on Albert Centralized LDA( =1000) LDA( =0.1) Pathological Data Distribution 40 45 50 55 60 65 70 75 80 85 Test Accuracy (%) FFT LoRA Pfeiffer Holusby BitFit (b) 20News group dataset on DistilBERT Figure 4.4: Performance of PEFT methods for different data distributions for the 20News group dataset. The model is trained for 10 epochs in the centralized setting and for 100 rounds in the other configurations. However, the communication and computation cost of training at this stage is the same as full fine-tuning. Therefore, while FLoRA shows that Primed-LoRA can meet our targets in terms of parameter efficiency, one important question is yet to be answered. How to preserve this performance but reduce the training costs? This is especially important during the training because, in cross-device federated learning, clients have a limited budget. One can observe that two parameters are involved in the cost of stage 1: the number of rounds and the communication/computation cost of each round. So, to achieve efficiency, one solution is to reduce the number of FFT rounds to decrease the cost in Stage 1. However, as depicted in Figure 4.3, the model’s performance at the end of Stage 1 directly impacts the performance of the final model. Hence, we focus on solutions that reduce clients’ update size. Towards this goal, we propose two algorithms; Primed-LoRA with Domain Adaption (DALoRA) and Primed-LoRA with Sparse Fine-tuning (SLoRA), which we explain below. 4.4.2 Primed-LoRA with Domain Adaption One solution to reduce the communication and computation costs associated with stage 1 is to find the LoRA block initialization in a non-federated manner. In other words, while users’ data is private and unavailable, the server can benefit from the existing public dataset in similar application domains. To this 108 SFT_MASK Sparse Update STAGE 1 SFT SVD matrix decomposition Sparse Update ΔW∊ℛ d x d A = A0 B = B0 r ΔW≈ A0B0 ΔWi Train for R1 FL rounds LoRA Update STAGE 2 LoRA Ai , Bi Train for R2 Initialize LoRA Blocks FL rounds LoRA blocks Figure 4.5: Overview of SLoRA: First server initializes a mask, and clients only update the parameters in the mask. Then, the updates are decomposed into LoRA blocks for the initialization in Stage 2. aim, the server pre-trains the LoRA blocks locally using the public data, and after reaching an accepted starting point, it transmits the updated blocks to the clients. In this way, stage 1 does not have any cost for the clients, and clients only need to train the small LoRA blocks. We call this method Domain Adaptation LoRA (DALoRA) and show that with the presence of public data, it can effectively and efficiently enhance the performance of the global model. 4.4.3 Primed-LoRA with Sparse Fine-tuning The public data, unfortunately, is not always available for all tasks. Sometimes, the nature of the data makes it sensitive to publish, or gathering a high-quality dataset is expensive and time-consuming. Therefore, we may still need the clients for stage 1. In this section, we propose the SLoRA algorithm designed to improve the performance of PEFT in FL without public data. Sparse Fine-Tuning (SFT). Various works in centralized setting [4, 118] have attempted to achieve parameter efficiency by sparsifying the updates. In other words, the update (∆W) of pre-trained weights to final ones is a set of highly sparse matrices that can be stored and transmitted efficiently. [4] propose to use a calibration dataset to generate a sparse mask (binary mask such that 1’s indicates the position of trainable weights in each round) to find the top-K most important weights based on their contribution in FFT. The weights that change the highest – from the original pre-trained values – in FFT 109 are the most important ones in SFT. A new warm-up step is required to fine-tune the weight for several rounds to detect such weights. Sparse Fine-tuning in Stage 1. [4] is designed for the centralized setting where the data can be utilized in finding the mask. However, in our federated setting, the server cannot fine-tune the model because it cannot access the training data. Clients can individually train the model and come up with personalized masks, but differences in clients’ masks increase the density of the aggregated update and reduce parameter efficiency. Alternatively, for clients to find the important weights together, they need to do FFT, which is against our goal of reducing the cost and number of FFT communications. To solve this problem, we propose using a data-independent sparse fine-tuning method where the server generates a random and data-independent binary mask with uniform density for all layers at the beginning to decide the frozen weights. Then, the clients only train and communicate the unfrozen weights to get the final model update ∆W. The random mask method is typically known to give low performance compared to data-driven mask [4]; however, we choose to adapt it in our algorithm for the following reasons. • Random mask does not require extra costs. Specifically, users do not need to pay any extra cost to find the mask, which reduces the strain on the resources of the edge users. Moreover, users only need to communicate the non-zero updates with the server, which is very small (e.g., 10%). Although this method might have inferior performance compared to FFT, the fine-tuning procedure in Stage 2 can compensate for the performance drop with an order of magnitude smaller update size (e.g., 1%). • Using random SFT results in low SVD reconstruction error. In particular, the update of each fully connected layer ∆W from Stage 1 is mostly sparse (e.g., 90%) with random patterns. We have observed that applying SVD decomposition with low rank (e.g., maintaining 1% update size in this stage vs. 10% in stage 1) has low SVD reconstruction error. Our results in different experiments show that this minor error helps the model in Stage 2 to reach a good performance. 110 4.4.4 Primed-LoRA in Stage 2 Stage 2 in SLoRA and DALoRA follows the same procedure as discussed in Section 4.4.1 and summarized in Algorithm 3 as well. Fig. 4.5 further illustrates the two stages of SLoRA. Algorithm 3: Overview of Primed LoRA 1: Ri : FL rounds, N: Total # clients, K: # participant per round, E: Local epoch, WR: Model weights in round R, r: LoRA parameter, di : update density in stage i 2: # Stage 1 3: if Algorith == DALoRA then 4: [A, B]0 = trainRoLA([N (0, 1), 0], E) 5: else 6: if Algorithm = SLoRA then 7: M ask = generateM ask(W0, d1) 8: else 9: M ask = 1 10: end if 11: for R = 1 to R1 do 12: for k = 1 to K in parallel do 13: Wk R−1 = train(WR, M ask, E) 14: end for 15: WR = aggregate(W 0,..,k R−1 ) 16: end for 17: ∆W = WR − W0 18: [A, B] 0 = SV D(∆W, r) 19: end if 20: # Stage 2 21: for R = 1 to R2 do 22: for k = 1 to K in parallel do 23: [A, B] k R = trainLoRA([A, B]R−1 , E) 24: end for 25: [A, B]R = aggregate([A, B] 0,...,K R−1 ) 26: end for 4.5 Experiments Models. In our experiments we show results for two models; Albert [68], DistilBERT [93] which possess relatively compact sizes and are potentially suitable for training on mobile devices. Datasets. Since we 111 Table 4.1: The performance of the proposed methods on the 20news dataset on Albert model. Method # Trainable Computation Total FL Training Comm. Accuracy Parameter Time (sec/epoch) Rounds Time (min) (Gbits) (%) FFT 11.7M 0.39 250 596.7 174 58.17 ± 4 DALoRA 0.18M Stage 1: 0 Stage 1: 0 Server: 72 10.7 60.78 ± 1.8 Stage 2: 0.43 Stage 2 : 1250 Clients: 49.5 LoRA 0.18M 0.43 1250 49.5 10.7 56.5 ± 1.2 SFT 0.18M 2 1250 78.4 10.7 57.6 ± 0.3 SLoRA 0.18M Stage 1: 2.1 Stage 1: 100 40.4 10.7 58.6 ± 1 Stage 2: 0.43 Stage 2 : 350 SVD — 15.4 — — — — decomposition (one-time cost) rely on non-IID label distribution, we focus on classification tasks of two datasets, News Category and 20News group [69]. Details are in Appendix 4.8.2. The total number of clients is 100 in all the experiments. The number of participants in every round is 20 clients for News Category pathological non-IID and 10 otherwise. For our aggregation method, we used FedAvg [79]. For data heterogeneity, we used LDA [90], with parameter α (smaller α means more heterogeneity). Also, we added pathologically non-IID similar to [79]. Clients use AdamW [77] for their optimizer and do a grid search for the learning rate of each setting ([1e − 5, 5e − 5, 9e − 5, 1e − 4, 2e − 4, . . . , 9e − 4, 1e − 3, 2e − 3, 3e − 3]). We mainly focus on the accuracy of the final global model. All the experiments are performed for 5 seeds, and the averaged top 3 results are reported. Moreover, to compare the costs of each algorithm, we report their communication cost and training time based on our local GPU, NVIDIA-A100 (details in 4.8.4). Baselines. For our baselines, we investigate LoRA, Holusby, Pfieffer, and BitFit to understand the impact of data heterogeneity and update size. The update size in BitFit is fixed and equal to the total size of bias layers. For other algorithms, update size is controlled by parameter r, which indicates the size of the appended module and is summarized in Table 4.2. Here, we consider two different sizes, and the density of the smaller update is approximately similar to BitFit’s. To explore the impact of model size, we have also included a larger update size with a higher density. Also, for the LoRA algorithm, we can add the blocks in all the dense layers, but here, we only select the 4 dense layers in the multi-head attention layer, similar 112 Table 4.2: Parameter r, which we use to calculate the density of the updates. Method LoRA Holusby Pfieffer Small 20 38 76 Large 190 384 768 Table 4.3: Impact of the density on the performance for the 20News group dataset on Albert. The model is trained for 10 epochs in the centralized setting and for 100 rounds in the other configurations. Data Density LoRA Holusby Pfeiffer Distribution (%) (%) % (%) Centralized 10 65.9 66 66.5 1 65.5 66.3 66.2 α = 0.1 10 61.4 53.1 58 1 54.9 51.6 56.89 Pathological 10 40.2 38.8 21.1 1 9 13.22 18.6 to the original paper. LoRA has an extra α parameter, which we set to equal r. The implementation of the baselines is adapted from AdapterHub [85]. 4.6 Evaluation Data Heterogeneity. Figure 4.4 shows the impact of data heterogeneity on the performance 20News group. The gap between PEFT and FFT grows by making the data more non-IID, which indicates the necessity of a new PEFT algorithm tailored for FL. (More results in Appendix 4.12) Update Size. Table 4.3 shows the impact of update size for the 20News group dataset on the Albert model on the final performance of the global model. As expected, larger update sizes improve performance with the cost of an increased communication budget. (More results in Appendix 4.13) Primed-LoRA. Table 4.1 shows different aspects of the proposed algorithms, DALoRA and SLoRA, and compares them with two existing baselines, STF and LoRA for 20News dataset on Albert Model. Here, we fix the communication budget and final update density (1.5%) and select different design parameters accordingly which are discussed later. We only consider pathologically non-IID, as discussed earlier in 113 Stage 2 Stage 1 (a) Albert model Stage 2 Stage 1 (b) DistilBERT model Figure 4.6: Performance of SLoRA on the 20News dataset for fixed training rounds on Albert and DistilBERT models. this setting, where PEFT experiences a significant drop in its performance. Other experiments for other models, datasets and data distributions (Table 5) can be found in the Appendix. For DALoRA, the server pre-trains the LoRA blocks using the News Group Category dataset in a centralized manner for 20 Epochs. Then, it transmits the new initializations for these blocks to the clients. Note that Stage 1 does not guarantee a better initialization as the server is not aware of the exact content of the local data. For SLoRA, we train the model in Stage 1 using an update sparsity of 10% to have a good model performance at Stage 1 that can be utilized at Stage 2, yet with a minimal cost. In Stage 2, for SLoRA, we add the LoRA module by utilizing the SVD decomposed model update from Stage 1 with a higher target update sparsity. Specifically, the LoRA modules are added to each dense layer in the model, except for the embedding and classification layers. The configurations and model update density of LoRA for Albert and DistilBERT are in the Appendix in Table 4.4. Furthermore, to show the importance of the proposed stages, we add two baselines stage 1 only (SFT) and stage 2 only (LoRA). The target density for both methods is the same (1.5%), and the mask in SFT is selected randomly with uniform distribution for all layers. Details of LoRA blocks are the same as SLoRA and given in Table 4.4. 114 Regarding the model training time (i.e., computation time), we report the duration based on a single GPU. Table 4.1 summarizes the average training time per epoch over 10 distinct runs. We note that the time for SLoRA can be computed using the time for Stages 1 and 2. The server executes the SVD decomposition only once in Stage 2. Also, Stage 1 for DALoRA does not have computational overhead for clients. Table 4.1 presents the performance of different methods with a fixed communication budget. In particular, in Stage 1, SLoRA communicates larger models compared to the others, and to ensure a fair comparison, we increase the training rounds of other baselines to match the communication costs. As depicted in Table 4.1, DALoRA shows the best accuracy compared to all other methods. In settings where the public dataset is not available for training DALoRA, SLoRA can be used while still able to achieve good performance. In particular, SLoRA achieves comparable accuracy to FFT, with a slight marginal improvement despite the baselines having even higher communication rounds. In contrast, the maximum accuracy of LoRA and SFT is 1.67 and 0.57 lower than FFT, respectively. It’s worth noting that the performance of the baselines are achieved with a longer time compared to SLoRA. Further Analysis on SLoRA. Given the good performance of SLoRA without relaying on any public dataset for the training unlike DALoRA. We further analyze SLoRA performance. The number of FL rounds is another crucial aspect, especially when considering client availability. Therefore, we compare the performance of SLoRA to the baselines with a fixed number of training rounds. As depicted in Figure 4.6, the performance gap between SLoRA and the baselines increases to 4% and 14.2%, respectively. We further study the performance of SLoRA under different dataset in Appendix 4.11. Comparison with a Concurrent Work. A concurrent work[60] addresses a similar problem. However, our study of PEFT under data heterogeneity is more comprehensive, covering two LLM with models and two different datasets with different data distributions (pathological and LDA) and the impact of update size compared to only LDA and one model. For the proposed methods, we adopt LoRA compared to the Houlsby adapter. The benefit of using LoRA is the reparametrization, meaning that once the model is trained using 115 PEFT, we can integrate the modules into the original parameters without increasing the model size during inference. However, this is not possible for the Holsuby adapters, as a nonlinear activation is used between the up and down projections. Also, [60] assumes that each user trains a hypernetwork locally in each training round, which puts additional costs on the training. In contrast, our method does not have the extra computations, and the final model is the same size as the original without performance degradation. Finally, we could not reproduce their results since the code and key aspects of the algorithms were missing. 4.7 Conclusion We investigate employing PEFT methods for fine-tuning language models in the FL to reduce communication/storage costs. After observing the poor performance of PEFT compared to FFT for heterogeneous clients, we propose Primed-LoRA to reduce the costs while maintaining the FFT performance. 4.8 Appendix 4.8.1 Priming LORA from FFT using SVD Singular Value Decomposition is a matrix decomposition method that helps us to rewrite a m × n matrix M as a multiplication of three matrices, M = UΣVT , where M ∈ R m×n , U ∈ R m×m, Σ ∈ R m×n , V ∈ R n×n and Σ is a rectangular diagonal matrix with descending diagonal terms ∗ . The accurate decomposition is not parameter efficient and generates matrices with large dimensions. Therefore, instead of the exact decomposition, we use an approximation that preserves most of the information in M (Our ∆W of interest throughout the paper). A common way to approximate a m × m matrix M ≈ U˜ V˜ T (where U˜ ∈ R m×r , V˜ ∈ Rm×r and r ≪ m) is to take the first r columns of U and V which are associated with the largest singular values in Σ, and then constructing U˜ = U[1:m,1:r]Σ[1:r,1:r] ∗ SVD decomposition is not unique, but we are concerned with the decomposition where the singular values are organized in descending order 116 and V˜ = V[1:m,1:r] . Note that increasing the r value makes this approximation more accurate as it approaches the true SVD but, at the same time, decreases the saving in the parameters. 4.8.2 Details of Dataset The 20News group dataset includes 20 news topics with about 19K data points. We use a 60 % - 40 % split for train and test, respectively. The news category is another dataset about news, and it has 15 different labels. This dataset includes about 330K training data, but we only use a randomly sampled 10% for training. Clients only train for one epoch every round, and the batch size is always 32. 4.8.3 LoRA Blocks Details Table 4.4 shows the r parameter for different LoRA blocks in DistilBERT and Albert. Table 4.4: The configurations of LoRA modules (e.g., r) for Multi-Head (MH), FeedForwad (FFN), and PreClassification (PC) layers Albert and DistilBER. MH FFN PC # Trainable Parameters Albert 10 10 18 0.18 M (1.5 %) DistilBERT 8 10 18 0.93 M (1.4 %) 4.8.4 Training time computation We can measure the total training time of each baselines and DALoRA using the following formula; T = K(Tcomputation + Tcommunication), (4.2) where K: federated rounds (given in Table 1 and Table 8). Tcomputation: computation time for each round (given in Table 1 and Table 8). Tcommunication= 2 model-update-size / BW. In our experiments, we assume 117 that each user possesses a communication bandwidth of 5 Mbits/s for both the uplink and downlink channels. For SLoRA, we use equation 4.2 for computing the time for Stage 1 and Stage 2 before aggregating them. 4.9 Performance of 20News group on Albert Table 4.5 shows the impact of data heterogeneity on different algorithms. Here we train the α = 0.1 and 1000 for 150 rounds and Pathological non-IID for 450 rounds. Also, the update size for all the methods is 0.18M. Table 4.5: Performance of different algorithms for the 20New group dataset on Albert model on different data heterogeneity. We fix the total number of communication rounds to 150 for LDA distribution and 450 rounds for pathological. The number of trainable parameter of the baselines (.18 M) parameters. α = 1000 α = 0.1 Pathological BitFit 62.8 42 31 Holusby 63.4 56.6 45 SFT 64.25 59.9 54.6 LoRA 65.5 59.3 44.6 SLoRA 65.2 60.3 58.6 DALoRA 61.28 59.28 57.28 4.10 Performance and cost of 20News group on DistilBERT Table 4.8 shows different aspects of baseline methods for the 20News dataset on the DistilBERT model. As expected, the cost of Stage 1 is similar to SFT, and the cost of Stage 2 is equal to that in LoRA. As mentioned earlier, to have a fair comparison, we trained different methods for different rounds of federated learning and only fixed the communication cost. The table shows that SLoRA performs better than FFT and still has better training time. 118 Table 4.6: Performance and training cost of different algorithms for the News category dataset on Albert model. # Trainable Training Accuracy Communication Parameter Time (min) (%) (Gbits) FFT 11.7M 559 65.2 ± 0.6 174 LoRA 0.18M 39.5 56.8 ± 5 10.2 SFT 0.18M 140 55.1 ± 0.6 10.2 SLoRA 0.18M 41.5 62.8 ± 3 10.2 Table 4.7: Performance and training cost of different algorithms for the News category dataset on DistilBERT model. We train SLoRA on Stage 2 for only 50 rounds. # Trainable Training Accuracy Communication Parameter Time (min) (%) (Gbits) FFT 67.0M 3406 61.6 ± 1.5 997 LoRA 0.9M 161 50.2 ± 5 45.5 SFT 0.9M 177 55.8 ± 0.1 45.5 SLoRA 0.9M 160 56.1 ± 1 45.5 4.11 Performance of News Category on Albert and DistilBERT The results for the News category dataset on Albert and DistilBERT are summarized in Table 4.6 and Table 4.7, respectively. As depicted in the tables, the same observations also hold for this dataset. 4.12 Impact of Data Heterogeneity on PEFT Methods Here, we repeat the experiments in the evaluation section for the News Category dataset on Albert and DistilBERT models. A similar trend to Figure 4.4 can be also observed in Figure 4.7. 4.13 Impact of Update size. In tables 4.9, 4.10, and 4.11, we show the impact of update size for different settings. As we expected, higherdensity updates have better performance in all the settings, and the performance considerably drops by increasing the data heterogeneity. 119 Centralized LDA( =1000) LDA( =0.1) Pathological Data Distribution 20 30 40 50 60 70 80 90 100 Test Accuracy (%) FFT LoRA Pfeiffer Holusby BitFit (a) News Category dataset on Albert Centralized LDA( =1000) LDA( =0.1) Pathological Data Distribution 20 30 40 50 60 70 80 90 100 Test Accuracy (%) FFT LoRA Pfeiffer Holusby BitFit (b) News Category dataset on DistilBERT Figure 4.7: Performance of PEFT methods for different data distributions for News Category dataset. Table 4.8: Performance and training cost of different algorithm for the 20News group dataset on DistilBERT model. Method # Trainable Computation Total FL Training Comm. Accuracy Parameter Time (sec/epoch) Rounds Time (min) (Gbits) (%) FFT 67.0M 0.19 250 3407 997 64.4 ± 2 LoRA 0.9M 0.20 1250 203 57.5 63.1 ± 0.5 SFT 0.9M 1.02 1250 220 57.5 62.3 ± 0.9 SLoRA 0.9M Stage 1: 1.1 300 186 57.5 64.8 ± 0.4 Stage 2: 0.20 SVD — 15.4 — — — — decomposition (one-time cost) 120 Table 4.9: Impact of update density of PEFT methods for the 20News group dataset on DistilBERT. Data Density LoRA Holusby Pfeiffer Distribution (%) (%) % (%) Centralized 10 70.0 70.0 69.5 1 69.8 69.6 68.7 α = 0.1 10 66.3 65.3 65.3 1 65.8 65.0 64.8 Pathological 10 58.6 55.1 55.7 1 57.2 54.7 54.2 Table 4.10: Impact of update density of PEFT methods for the News category dataset on Albert. Data Density LoRA Holusby Pfeiffer Distribution (%) (%) % (%) Centralized 10 76.0 76.0 76.0 1 76.0 75.9 75.9 α = 0.1 10 72.3 68.54 70.6 1 71 68.17 70.27 Pathological 10 60.7 33.6 57.0 1 41.96 27.78 50.8 Table 4.11: Impact of update density of PEFT methods for the News category dataset on DistilBERT. Data Density LoRA Holusby Pfeiffer Distribution (%) (%) % (%) Centralized 10 78.4 78.0 78.15 1 77.99 77.94 77.85 α = 0.1 10 73.55 72.2 71.6 1 71.6 71.6 71.362 Pathological 10 49.39 38.03 37.4 1 33.9 33.9 28.2 121 Chapter 5 Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training Decentralized (i.e., serverless) training across edge nodes can suffer substantially from potential Byzantine nodes that can degrade the training performance. However, detection and mitigation of Byzantine behaviors in a decentralized learning setting is a daunting task, especially when the data distribution at the users is heterogeneous. As our main contribution, we propose Basil, a fast and computationally efficient Byzantine-robust algorithm for decentralized training systems, which leverages a novel sequential, memory-assisted and performance-based criteria for training over a logical ring while filtering the Byzantine users. In the IID dataset setting, we provide the theoretical convergence guarantees of Basil, demonstrating its linear convergence rate. Furthermore, for the IID setting, we experimentally demonstrate that Basil is robust to various Byzantine attacks, including the strong Hidden attack, while providing up to absolute ∼16% higher test accuracy over the state-of-the-art Byzantine-resilient decentralized learning approach. Additionally, we generalize Basil to the non-IID setting by proposing Anonymous Cyclic Data Sharing (ACDS), a technique that allows each node to anonymously share a random fraction of its local non-sensitive dataset (e.g., landmarks images) with all other nodes. Finally, to reduce the overall latency of Basil resulting from its sequential implementation over the logical ring, we propose Basil+ that enables 122 Byzantine-robust parallel training across groups of logical rings, and at the same time, it retains the performance gains of Basil due to sequential training within each group. Furthermore, we experimentally demonstrate the scalability gains of Basil+ through different sets of experiments. 5.1 Introduction Thanks to the large amounts of data generated on and held by the edge devices, machine learning (ML) applications can achieve significant performance [25, 30]. However, privacy concerns and regulations [40] make it extremely difficult to pool clients’ datasets for a centralized ML training procedure. As a result, distributed machine learning methods are gaining a surge of recent interests. The key underlying goal in distributed machine learning at the edge is to learn a global model using the data stored across the edge devices. Federated learning (FL) has emerged as a promising framework [57] for distributed machine learning. In federated learning, the training process is facilitated by a central server. In an FL architecture, the task of training is federated among the clients themselves. Specifically, each participating client trains a local model based on its own (private) training dataset and shares only the trained local model with the central entity, which appropriately aggregates the clients’ local models. While the existence of the parameter server in FL is advantageous for orchestrating the training process, it brings new security and efficiency drawbacks [76, 57]. Particularly, the parameter server in FL is a single point of failure, as the entire learning process could fail when the server crashes or gets hacked. Additionally, the parameter server can become a performance bottleneck itself due to the large number of the mobile devices that need to be handled simultaneously. Training using a decentralized setup is another approach for distributed machine learning without having to rely on a central coordinator (e.g., parameter server), thus avoiding the aforementioned limitations of FL. Instead, it only requires on-device computations on the edge nodes and peer-to-peer communications. 123 In fact, many decentralized training algorithms have been proposed for the decentralized training setup. In particular, a class of gossip-based algorithms over random graphs has been proposed, e.g., [61, 28, 105], in which all the nodes participate in each training round. During training, each node maintains a local model and communicates with others over a graph-based decentralized network. More specifically, every node updates its local model using its local dataset, as well as the models received from the nodes in its neighborhood. For example, a simple aggregation rule at each node is to average the locally updated model with the models from the neighboring nodes. Thus, each node performs both model training and model aggregation. Although decentralized training provides many benefits, its decentralized nature makes it vulnerable to performance degradation due to system failures, malicious nodes, and data heterogeneity [57]. Specifically, one of the key challenges in decentralized training is the presence of different threats that can alter the learning process, such as the software/hardware errors and adversarial attacks. Particularly, some clients can become faulty due to software bugs, hardware components which may behave arbitrarily, or even get hacked during training, sending arbitrary or malicious values to other clients, thus severely degrading the overall convergence performance. Such faults, where client nodes arbitrarily deviate from the agreedupon protocol, are called Byzantine faults [67]. To mitigate Byzantine nodes in a graph-based decentralized setup where nodes are randomly connected to each other, some Byzantine-robust optimization algorithms have been introduced recently, e.g., [45, 115]. In these algorithms, each node combines the set of models received from its neighbors by using robust aggregation rules, to ensure that the training is not impacted by the Byzantine nodes. However, to the best of our knowledge, none of these algorithms have considered the scenario when the data distribution at the nodes is heterogeneous. Data heterogeneity makes the detection of Byzantine nodes a daunting task, since it becomes unclear whether the model drift can be attributed to a Byzantine node, or to the very heterogeneous nature of the data. Even in the absence of Byzantine nodes, data heterogeneity can degrade the convergence rate [57]. 124 The limited computation and communication resources of edge devices (e.g., IoT devices) are another important consideration in the decentralized training setting. In fact, the resources at the edge devices are considered as a critical bottleneck for performing distributed training for large models [57, 25]. In prior Byzantine-robust decentralized algorithms (e.g., [45, 115]), which are based on parallel training over a random graph, all nodes need to be always active and perform training during the entire training process. Therefore, they might not be suitable for the resource constrained edge devices, as the parallel training nature of their algorithms requires them to be perpetually active which could drain their resources. In contrast to parallel training over a random graph, our work takes the view that sequential training over a logical ring is better suited for decentralized training in resource constrained edge setting. Specifically, sequential training over a logical ring allows each node to become active and perform model training only when it receives the model from its counterclockwise neighbor. Since nodes need not be active during the whole training time, the sequential training nature makes it suitable for IoT devices with limited computational resources. 5.1.1 Contributions To overcome the aforementioned limitations of prior graph-based Byzantine-robust algorithms, we propose Basil, an efficient Byzantine mitigation algorithm, that achieves Byzantine robustness in a decentralized setting by leveraging the sequential training over a logical ring. To highlight some of the benefits of Basil, Figure 5.1(a) illustrates a sample result that demonstrates the performance gains and the cost reduction compared to the state-of-the-art Byzantine-robust algorithm UBAR that leverages the parallel training over a graph-based setting. We observe that Basil retains a higher accuracy than UBAR with an absolute value of ∼ 16%. Additionally, we note that while UBAR achieves its highest accuracy in ∼ 500 rounds, Basil achieves UBAR’s highest accuracy in just ∼ 100 rounds. This implies that for achieving 125 UBAR’s highest accuracy, each client in Basil uses 5× lesser computation and communication resources compared to that in UBAR confirming the gain attained from the sequential training nature of Basil. In the following, we further highlight the key aspects and performance gains of Basil: • In Basil, the defense technique to filter out Byzantine nodes is a performance-based strategy, wherein each node evaluates a received set of models from its counterclockwise neighbors by using its own local dataset to select the best candidate. • We theoretically show that Basil for convex loss functions in the IID data setting has a linear convergence rate with respect to the product of the number of benign nodes and the total number of training rounds over the ring. Thus, our theoretical result demonstrates scalable performance for Basil with respect to the number of nodes. • We empirically demonstrate the superior performance of Basil compared to UBAR, the state-ofthe-art Byzantine-resilient decentralized learning algorithm over graph, under different Byzantine attacks in the IID setting. Additionally, we study the performance of Basil and UBAR with respect to the wall-clock time in Appendix 5.8.9 showing that the training time for Basil is comparable to UBAR. • For extending the superior benefits of Basil to the scenario when data distribution is non-IID across devices, we propose Anonymous Cyclic Data Sharing (ACDS) to be applied on top of Basil. To the best of our knowledge, no prior decentralized Byzantine-robust algorithm has considered the scenario when the data distribution at the nodes is non-IID. ACDS allows each node to share a random fraction of its local non-sensitive dataset (e.g., landmarks images captured during tours) with all other nodes, while guaranteeing anonymity of the node identity. As highlighted in Section 5.1.2, there are multiple real-world use cases where anonymous data sharing is sufficient to meet the privacy concerns of the users. 126 (a) Gaussian Attack (IID setting) 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy R-plain: =0% R-plain: =5% (b) No Attack (non-IID setting) 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy IID: UBAR Non-IID: Basil+ACDS ( =5%) Non-IID: UBAR (c) Gaussian Attack Figure 5.1: A highlight of the performance benefits of Basil, compared with state-of-the-art (UBAR) [45], for CIFAR10 under different settings: In Figure 5.1(a), we can see the superior performance of Basil over UBAR with ∼ 16% improvement of the test accuracy under Gaussian attack in the IID setting. Figure 5.1(b) demonstrates that the test accuracy in the non-IID setting by using sequential training over the ring topology can be increased by up to ∼ 10% in the absence of Byzantine nodes, when each node shares only 5% of its local data anonymously with other nodes. Figure 5.1(c) shows that ACDS on the top of Basil not only provides Byzantine robustness to Gaussian attack in the non-IID setting, but also gives higher performance than UBAR in the IID setting. Furthermore, UBAR for the non-IID setting completely fails in the presence of this attack. For further details, please refer to Section 5.6. • We experimentally demonstrate that using ACDS with only 5% data sharing on top of Basil provides resiliency to Byzantine behaviors, unlike UBAR which diverges in the non-IID setting (Figure 5.1(c)). • As the number of participating nodes in Basil scales, the increase in the overall latency of sequential training over the logical ring topology may limit the practicality of implementing Basil. Therefore, we propose a parallel extension of Basil, named Basil+, that provides further scalability by enabling Byzantine-robust parallel training across groups of logical rings, while retaining the performance gains of Basil through sequential training within each group. 5.1.2 Related Works Many Byzantine-robust strategies have been proposed recently for the distributed training setup (federated learning) where there is a central server to orchestrate the training process [91, 111, 112, 14, 116, 37, 99, 87, 122, 88, 119]. These Byzantine-robust optimization algorithms combine the gradients received by all workers using robust aggregation rules, to ensure that training is not impacted by malicious nodes. Some of these strategies [14, 116, 87, 37, 99] are based on distance-based approaches, while some others are based on performance-based criteria [91, 111, 112, 88]. The key idea in distance-based defense solutions is to filter the updates that are far from the average of the updates from the benign nodes. It has been shown that distance-based solutions are vulnerable to the sophisticated Hidden attack proposed in [33]. In this attack, Byzantine nodes could create gradients that are malicious but indistinguishable from benign gradients in distance. On the other hand, performance-based filtering strategies rely on having some auxiliary dataset at the server to evaluate the model received from each node. Compared to the large number of Byzantine-robust training algorithms for distributed training in the presence of a central server, there have been only a few recent works on Byzantine resiliency in the decentralized training setup with no central coordinator. In particular, to address Byzantine failures in a decentralized training setup over a random graph under the scenario when the data distribution at the nodes is IID, the authors in [115] propose using a trimmed mean distance-based approach called BRIDGE to mitigate Byzantine nodes. However, the authors in [45] demonstrate that BRIDGE is defeated by the hidden attack proposed in [33]. To solve the limitations of the distance-based approaches in the decentralized setup, [45] proposes an algorithm called UBAR in which a combination of performance-based and distance-based stages are used to mitigate the Byzantine nodes, where the performance-based stage at a particular node leverages only its local dataset. As demonstrated numerically in [45], the combination of these two strategies allows UBAR to defeat the Byzantine attack proposed in [33]. However, UBAR is not suitable for the training over resource-constrained edge devices, as the training is carried out in parallel 128 and nodes remain active all the time. In contrast, Basil is a fast and computationally efficient Byzantinerobust algorithm, which leverages a novel sequential, memory-assisted and performance-based criteria for training over a logical ring while filtering the Byzantine users. Data heterogeneity in the decentralized setting has been studied in some recent works (e.g., [71]) in the absence of Byzantine nodes. In particular, the authors of TornadoAggregate [71] propose to cluster users into groups based on an algorithm called Group-BY-IID and CLUSTER where both use EMD (earth mover distance) that can approximately model the learning divergences between the models to complete the grouping. However, EMD function relies on having a publicly shared data at each node which can be collected similarly as in [124]. In particular, to improve training on non-IID data in federated learning, [124] proposed sharing of small portions of users’ data with the server. The parameter server pools the received subsets of data thus creating a small subset of the data distributed at the clients, which is then globally shared between all the nodes to make the data distribution close to IID. However, the aforementioned data sharing approach is considered insecure in scenarios where users are fine with sharing some of their datasets with each other but want to keep their identities anonymous, i.e., data shares should not reveal who the data owners are. There are multiple real-world use cases where anonymous data sharing is sufficient for privacy concerns. For example, mobile users maybe fine with sharing some of their own text data, which does not contain any personal and sensitive information with others, as long as their personal identities remain anonymous. Another example is sharing of some non-private data (such as landmarks images) collected by a person with others. In this scenario, although data itself is not generated at the users, revealing the identity of the users can potentially leak private information such as personal interests, location, or travel history. Our proposed ACDS strategy is suitable for such scenarios as it guarantees that the owner identity of the shared data points are kept hidden. 129 As a final remark, we point out that for anonymous data sharing, [31] proposed an approach which is based on utilizing a secure sum operation along with anonymous ID assignment (AIDA). This involves computational operations at the nodes such as polynomial evaluations and some arithmetic operations such as modular operations. Thus, this algorithm may fail in the presence of Byzantine faults arising during these computations. Particularly, computation errors or software bugs can be present during the AIDA algorithm thus leading to the failure of anonymous ID assignment, or during the secure sum algorithm which can lead to distortion of the shared data. 5.2 Problem Statement We formally define the decentralized learning system in the presence of Byzantine faults. 5.2.1 Decentralized System Model We consider a decentralized learning setting in which a set N = {1, . . . , N} of |N | = N nodes collaboratively train a machine learning (ML) model x ∈ R d , where d is the model size, based on all the training data samples ∪n∈N Zn that are generated and stored at these distributed nodes, where the size of each local dataset is |Zi | = Di data points. In this work, we are motivated by the edge IoT setting, where users want to collaboratively train an ML model, in the absence of a centralized parameter server. The communication in this setting leverages the underlay communication fabric of the internet that connects any pair of nodes directly via overlay communication protocols. Specifically, we assume that there is no central parameter server, and consider the setup where the training process is carried out in a sequential fashion over a clockwise directed ring. Each node in this ring topology takes part in the training process when it receives the model from the previous counterclockwise node. In Section 5.3.1, we propose a method in which nodes can consensually agree on a random ordering on a logical ring at the beginning of the training process, so that each node 130 knows the logical ring positions of the other nodes. Therefore, without loss of generality, we assume for notation simplification that the indices of nodes in the ring are arranged in ascending order starting from node 1. In this setup, each node can send its model update to any set of users in the network. In this decentralized setting, an unknown β-proportion of nodes can be Byzantine, where β ∈ (0, 1), meaning they can send arbitrary and possibly malicious results to the other nodes. We denote R (with cardinality |R| = r) and B (with cardinality |B| = b) as the sets of benign nodes and Byzantine nodes, respectively. Furthermore, Byzantine nodes are uniformly distributed over the ring due to consensusbased random order agreement. Finally, we assume nodes can authenticate the source of a message, so no Byzantine node can forge its identity or create multiple fake ones [21]. 5.2.2 Model Training Each node in the set R of benign nodes uses its own dataset to collaboratively train a shared model by solving the following optimization problem in the presence of Byzantine nodes: x ∗ = arg min x∈Rd " f(x) := 1 r Xr i=1 fi(x) # , (5.1) where x is the optimization variable, and fi(x) is the expected loss function of node i such that fi(x) = Eζi∼Pi [li(x, ζi)]. Here, li(x, ζi) ∈ R denotes the loss function for model parameter x ∈ R d for a given realization ζi , which is generated from a distribution Pi . The general update rule in this decentralized setting is given as follows. At the k-th round, the current active node i updates the global model according to: x (i) k =x¯ (i) k − η (i) k gi(x¯ (i) k ), (5.2) 131 where x¯ (i) k =A(x (j) v , j ∈ N , v = 1, . . . , k) is the selected model by node i according to the underlying aggregation rule A, g(x¯ (i) k ) is the stochastic gradient computed by node i by using a random sample from its local dataset Zi , and η (i) k is the learning rate in round k used by node i. Threat model: Byzantine node i ∈ B could send faulty or malicious update x (i) k = ∗, where “∗” denotes that x (i) k can be an arbitrary vector in R d . Furthermore, Byzantine nodes cannot forge their identities or create multiple fake ones. This assumption has been used in different prior works (e.g., [21]). Our goal is to design an algorithm for the decentralized training setup discussed earlier, while mitigating the impact of the Byzantine nodes. Towards achieving this goal, we propose Basil that is described next. 5.3 The Proposed Basil Algorithm Now, we describe Basil, our proposed approach for mitigating both malicious updates and faulty updates in the IID setting, where the local dataset Zi at node i consists of IID data samples from a distribution Pi , where Pi = P for i ∈ N , and characterize the complexity of Basil. Note that, in Section 5.4, we extend Basil to the non-IID setting by integrating it to our proposed Anonymous Cyclic Data Sharing scheme. 5.3.1 Basil for IID Setting Our proposed Basil algorithm that is given in Algorithm 1 consists of two phases; initialization phase and training phase which are described below. Phase 1: Order Agreement over the Logical Ring. Before the training starts, nodes consensually agree on their order on the ring by using the following simple steps. 1) All users first share their IDs with each other, and we assume WLOG that nodes’ IDs can be arranged in ascending order. 2) Each node locally generates the order permutation for the users’ IDs by using a pseudo random number generator (PRNG) initialized via a common seed (e.g., N). This ensures that all nodes will generate the same IDs’ order for the ring. 132 Figure 5.2: Basil with N = 6 nodes, where node 3 and node 6 are Byzantine nodes. Node 1, the current active benign node in the k-th round, selects one model out of its stored 3 models which gives the lowest loss when it is evaluated on a mini-batch from its local dataset Z1. After that, node 1 updates the selected model by using the same mini-batch according to equation 5.2 before multicasting it to the next 3 clockwise neighbors. Phase 2: Robust Training. As illustrated in Figure 5.2, Basil leverages sequential training over the logical ring to mitigate the effect of Byzantine nodes. At a high level, in the k-th round, the current active node carries out the model update step in equation 5.2, and then multicasts its updated model to the next S = b + 1 clockwise nodes, where b is the worst case number of Byzantine nodes. We note that multicasting each model to the next b + 1 neighbors is crucial to make sure that the benign subgraph, which is generated by excluding the Byzantine nodes, is connected. Connectivity of the benign subgraph is important as it ensures that each benign node can still receive information from a few other non-faulty nodes, i.e., the good updates can successfully propagate between the benign nodes. Even in the scenario where all Byzantine nodes come in a row, multicasting each updated model to the next S clockwise neighbors allows the connectivity of benign nodes. We now describe how the aggregation rule ABasil in Basil, that node i implements for obtaining the model x¯ (i) k for carrying out the update in equation 5.2, works. Node i stores the S latest models from its previous S counterclockwise neighbors. As highlighted above, the reason for storing S models is to make sure that each stored set of models at node i contains at least one good model. When node i is active, it 133 implements our proposed performance-based criteria to pick the best model out of its S stored models. In the following, we formally define our model selection criteria: Definition 5.1. (Basil Aggregation Rule) In the k-th round over the ring, let N i k = {y1, . . . , yS} be the set of S latest models from the S counterclockwise neighbors of node i. We define ζi to be a random sample from the local dataset Zi , and let li(yj ) = li(yj , ζi) to be the loss function of node i evaluated on the model yj ∈ N i k , by using a random sample ζi . The proposed Basil aggregation rule is defined as x¯ (i) k = ABasil (N i k ) = arg min y∈N i k E [li(y, ζi)] . (5.3) In practice, node i can sample a mini-batch from its dataset and leverage it as validation data to test the performance (i.e., loss function value) of each model of the neighboring S models, and set x¯ (i) k to be the model with the lowest loss among the S stored models. As demonstrated in our experiments in Section 5.6, this practical mini-batch implementation of the Basil criteria in Definition 5.1 is sufficient to mitigate Byzantine nodes in the network, while achieving superior performance over state-of-the-art. In the following, we characterize the communication, computation, and storage costs of Basil. Proposition 5.1. The communication, computation, and storage complexities of Basil algorithm are all O(Sd) for each node at each iteration, where d is the model size. The complexity of prior graph-based Byzantine-robust decentralized algorithms UBAR [45] and Bridge [115] is O(Hid), where Hi is the number of neighbours (e.g., connectivity parameter) of node i on the graph. So we conclude that Basil maintains the same per round complexity as Bridge and UBAR, but with higher performance as we will show in Section 5.6. 134 Algorithm 4: Basil Input: N (nodes); S(connectivity) ;{Zi}i∈N (local datasets); x 0 (initial model); K(number of rounds) Initialization: for each node i ∈ N do StoredModels [i]. insert(x 0 ) // queue “StoredModels [i]” is used by node i to keep the last inserted S models, denoted by N i . It is intialized by inserting x 0 end Order ← RandomOrderAgrement(N ) // users’ order generation for the logical ring topology according to Section 5.3-A. Robust Training: for each round k = 1, . . . , K do for each node i = 1, . . . , N in sequence do if node i ∈ benign set R then x¯ (i) k ← ABasil (N i ) // Basil performance-based strategy to select one model from N i using equation 5.3 x (i) k ← Update(x¯ (i) k , Zi) // model update using equation 5.2 end else x (i) k ← ∗ // Byzantine node sends faulty model end multicast x (i) k to the next S clockwise neighbors for neighbor s = 1, . . . , S do StoredModels [(i + s) mod N]. insert(x (i) k ) // insert x (i) k to the queue of the s-th neighbor of node i end end end Return {x (i) K }i∈N 135 The costs in Proposition 5.1 can be reduced by relaxing the connectivity parameter S to S < b + 1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability, as formally presented in the following. Proposition 5.2. The number of models that each benign node needs to receive, store and evaluate from its counterclockwise neighbors for ensuring the connectivity and success of Basil can be relaxed to S < b + 1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability. Additionally, the failure probability of Basil is given by P(Failure) ≤ b!(N − S)! (b − S)!(N − 1)!, (5.4) where N, b are the total number of nodes, and Byzantine nodes, respectively. The proofs of Proposition 5.1-5.2 are given in Appendix 5.8.1. Remark 5.1. In order to further illustrate the impact of choosing S on the probability of failure given in equation 5.4, we consider the following numerical examples. Let the total number of nodes in the system be N = 100, where b = 33 of them are Byzantine, and the storage parameter S = 15. The failure event probability in equation 5.4 turns out to be ∼ 4 × 10−7 , which is negligible. For the case when S = 10, the probability of failure becomes ∼ 5.34 × 10−4 , which remains reasonably small. 5.3.2 Theoretical Guarantees We derive the convergence guarantees of Basil under the following standard assumptions. Assumption 5.1. (IID data distribution). Local dataset Zi at node i consists of IID data samples from a distribution Pi , where Pi = P for i ∈ R. In other words, fi(x) = Eζi∼Pi [l(x, ζi)] = Eζj∼Pj [l(x, ζi))] = fj (x) ∀i, j ∈ R. Hence, the global loss function f(x) = Eζi∼Pi [l(x, ζi)]. 136 Assumption 5.2. (Bounded variance). Stochastic gradient gi(x) is unbiased and variance bounded, i.e., EPi [gi(x)] = ∇fi(x) = ∇f(x), and EPi ||gi(x) − ∇fi(x)||2 ≤ σ 2 , where gi(x) is the stochastic gradient computed by node i by using a random sample ζi from its local dataset Zi . Assumption 5.3. (Smoothness The loss functions) f ′ i s are L-smooth and twice differentiable, i.e., for any x ∈ R d , we have ||∇2fi(x)||2 ≤ L. Assumption 5.4. The batch size at each node is large enough such that the following is guaranteed B ≥ σ 2 mini,k ||∇F(x i k )||2 (5.5) where k and i are the round index, and node index respectively. Let b i be the number of counterclockwise Byzantine neighbors of node i. We divide the set of stored models N i k at node i in the k-th round into two sets. The first set G i k = {y1, . . . , yr i} contains the benign models, where r i = S − b i . We consider scenarios with S = b + 1, where b is the total number of Byzantine nodes in the network. Without loss of generality, we assume the models in this set are arranged such that the first model is the closest benign node in the neighborhood of node i, while the last model is the farthest node. Similarly, we define the second set B i k to be the set of models from the counterclockwise Byzantine neighbors of node i such that B i k ∪ Gi k = N i k . Theorem 5.1. When the learning rate η (i) k for node i ∈ R in round k satisfies η (i) k ≤ 1 L , the expected loss function E [li(·)] of node i evaluated on the set of models in N i k can be arranged as follows: E [li(y1)] ≤ E [li(y2)] ≤ · · · ≤E [li(yr i )] < E [li(x)] ∀x ∈ Bi k , (5.6) 137 where G i k = {y1, . . . , yr i} is the set of benign models stored at node i. Hence, the Basil aggregation rule in Definition 5.1 is reduced to x¯ (i) k = ABasil (N i k ) = y1. Hence, the model update step in equation 5.2 can be simplified as follows: x (i) k = y1 − η (i) k gi(y1). (5.7) Remark 5.2. For the Basil aggregation rule in Definition 5.1, equation 5.6 in Theorem 5.1 implies that for convergence analysis, we can consider only the benign sub-graph which is generated by removing the Byzantine nodes. As described in Section 5.3.1, the benign sub-graph is connected. Furthermore, due to equation 5.7 in Theorem 5.1, training via Basil reduces to sequential training over a logical ring with only the set R of benign nodes and connectivity parameter S = 1. Leveraging the results in Theorem 5.1 and based on the discussion in Remark 1, we prove the linear convergence rate for Basil, under the additional assumption of convexity of the loss functions. Theorem 5.2. Assume that f(x) is convex. Under Assumptions 1-3 stated in this section, Basil with a fixed learning rate η = 1 L at all users achieves linear convergence with a constant error as follows: E " f 1 T X T s=1 x s !# − f(x ∗ ) ≤ ||x 0 − x ∗ ||2L 2T + 1 L σ ′2 , (5.8) where T = Kr, K is the total number of rounds over the ring and r is the number of benign nodes. Here x s represents the model after s update steps starting from the initial model x 0 , where s = rk+i with i = 1, . . . , r and k = 0, . . . , K − 1. Furthermore, x ∗ is the optimal solution in equation 5.1 and σ ′2 = σ 2 B is mini-batch variance where σ ′2 is defined in Assumption 2 and B is the batch size. Remark 5.3. The error bound for Basil decreases with increasing the total number of benign nodes r = βN, where β ∈ (0, 1). 138 To extend Basil to be robust against software/hardware faults in the non-IID setting, i.e., when the local dataset Zi at node i consists of data samples from a distribution Pi with Pi ̸= Pj for i, j ∈ N and i ̸= j, we present our Anonymous Cyclic Data Sharing algorithm (ACDS) in the following section. 5.4 Generalizing Basil to Non-IID Setting via Anonymous Cyclic Data Sharing We propose Anonymous Cyclic Data Sharing (ACDS), an algorithm that can be integrated on the top of Basil to guarantee robustness against software/hardware faults in the non-IID setting. This algorithm allows each node to anonymously share a fraction of its local non-sensitive dataset with other nodes. In other words, ACDS guarantees that the identity of the owner of the shared data is kept hidden from all other nodes under no collusion between nodes. 1g 4g 2g 3g Dummy round Figure 5.3: ACDS algorithm within group g with n = 4 users, where each node ig ∈ Ng has two batches {c 1 ig , c2 ig }. Each round starts from node 1g and continues in a clockwise order. The dummy round is introduced to make sure that node 2g and node 3g get their missing batches {c 2 3g , c2 4g } and c 2 4g , respectively. Here, bcig represents the dummy batch with the same size as the other batches that are used by node ig ∈ Ng. This dummy batch could be a batch of public data that shares the same features that are used in the learning task. 139 5.4.1 ACDS Algorithm The ACDS procedure has three phases which are formally described next. The overall algorithm has been summarized in Algorithm 2 and illustrated in Figure 5.3. Phase 1: Initialization. ACDS starts by first clustering the N nodes into G groups of rings, where the set of nodes in each group g ∈ [G] is denoted by Ng = {1g, . . . , ng}. Here, node 1g is the starting node in ring g, and without loss of generality, we assume all groups have the same size n = N G . Node ig ∈ Ng for g ∈ [G] divides its dataset Zig into sensitive (Z s ig ) and non-sensitive (Z ns ig ) portions, which can be done during the data processing phase by each node. Then, for a given hyperparameter α ∈ (0, 1), each node selects αD points from its local non-sensitive dataset at random, where |Zig | = D, and then partitions these data points into H batches denoted by {c 1 ig , . . . , cH ig }, where each batch has M = αD H data points. Phase 2: Within Group Anonymous Data Sharing. In this phase, for g ∈ [G], the set of nodes Ng in group g anonymously share their data batches {c j 1g , . . . , c j ng }j∈[H] with each other. The anonymous data sharing within group g takes H +1 rounds. In particular, as shown in Figure 5.3, within group g and during the first round h = 1, node 1g sends the first batch c 1 1g to its clockwise neighbor, node 2g. Node 2g then stores the received batch. After that, node 2g augments the received batch with its own first batch c 1 2g and shuffles them together before sending them to node 3g. More generally, in the intra-group cyclic sharing over the ring, node ig stores the received set of shuffled data points from batches {c 1 1g , . . . , c1 (i−1)g } from its counterclockwise neighbor node (i − 1)g. Then, it adds its own batch c 1 ig to the received set of data points, and shuffles them together, before sending the combined and shuffled dataset to node (i + 1)g. For round 1 < h ≤ H, as shown in Figure 5.3-(round 2), node 1g has the data points from the set of batches {c (h−1) 1g , . . . , c (h−1) ng } which were received from node ng at the end of round (h − 1). It first removes its old batch of data points c (h−1) 1g and then stores the remaining set of data points. After that, it adds its h-th batch, c (h) 1g to this remaining set, and then shuffles the entire set of data points in the new set of batches {c h 1g , ch−1 2g , . . . , ch−1 ng }, before sending them to node 2g. More generally, in the h-th round for 1 < h ≤ H, 140 Algorithm 5: ACDS Input: N (nodes); {Zi}i∈N (local datasets); α (data fraction); H(number of batches); G(number of groups) Phase 1: Initialization {Ng}g∈[G] ← Clustering (N , G) // cluster nodes N into G groups each of size n for each node ig ∈ Ng in parallel do Z s ig ∪ Zns ig ← Partition(Zig ) // partion local data Zig into sensitive and non-sensitive parts. {c 1 ig , . . . , cH ig } ← RandomSelection(Z ns ig , α, H) // random selection of H batches, each of size αD H , from Z ns ig . DStored[ig] =list() // a list used by node ig to store the shared data from other nodes end DShared[g]= list(), ∀g ∈ [G] // a list that is used to circulate the data within group g Phase 2: Within Group Anonymous Data Sharing for each group g = 1, . . . , G in parallel do for each batch h = 1, . . . , H do for each node ig ∈ Ng in sequence do DShared[g]; DStored[ig] ← RobustShare(DShared[g]; DStored[ig], g, ig, c h−1 ig , c h ig ) Send DShared[g] to the next clockwise neighbor end end // start the dummy round for each node ig ∈ Ng\{1g, ng} in parallel do DShared[g]; DStored[ig] ← RobustShare(DShared[g]; DStored[ig], g, ig, H, c H ig , ccig ) end end Function RobustShare(DShared[g]; DStored[ig], g, ig, h, c h−1 ig , c h ig ): if h > 1 then DShared[g].remove(c h−1 ig ) // remove the batch c h−1 ig from “DataShared[g]” end DStored[ig].add(DShared[g]) // copy the data in “DShared[g]” to “DStored[ig]” DShared[g].add(c h ig ) // add the h-th batch c h ig that will be shared with other nodes to “DShared[g] DShared[g].shuffle() //shuffle the data in the list “DShared[g]” return DShared[g]; DStored[ig] Phase 3: Global Sharing for g = 1, . . . , G in parallel do node 1g multicasts DStored[1g] S {c 1 1g , . . . , cH 1g } with all nodes in S g ′∈[N]\{g} Ng ′ end 141 node ig first removes its batch c h−1 ig from the received set of batches and then stores the set of remaining data points. Thereafter, node ig adds its c h ig to the set {c h 1g , . . . , ch (i−1)g , ch−1 (i)g , . . . , ch−1 ng }\{c h−1 ig }, and then shuffles the set of data points in the new set of batches {c h 1g , . . . , ch ig , ch−1 (i+1)g , . . . , ch−1 ng } before sending them to node (i + 1)g. After H intra-group communication iterations within each group as described above, all nodes within each group have completely shared their H − 1 batches with each other. However, due to the sequential nature of unicast communications, some nodes are still missing the batches shared by some clients in the Hth round. For instance, in Figure 5.3, after the completion of the second round, node 2g is missing the last batches c 2 3g and c 2 4g . Therefore, we propose a final dummy round, in which we repeat the same procedure adopted in rounds 1 < h ≤ H, but with the following slight modification: node ig replaces its batch c H ig with a dummy batch c dummy ig . This dummy batch could be a batch of public data points that share the same feature space that is used in the learning task. This completes the anonymous cyclic data sharing within group g ∈ [G]. Phase 3: Global Sharing. For each g ∈ [G], node 1g shares the dataset {c j 1g , . . . , c j ng }j∈[H] , which it has gathered in phase 2, with all other nodes in the other groups. We note that implementation of ACDS only needs to be done once before training. As we demonstrate later in Section 5.6, the one-time overhead of the ACDS algorithm dramatically improves convergence performance when data is non-IID. In the following proposition, we describe the communication cost/time of ACDS. Proposition 5.3. (Communication cost/time of ACDS)Consider ACDS algorithm with a set of N nodes divided equally into G groups with n nodes in each group. Each node i ∈ [N] has H batches each of size M = αD H data points, where αD is the fraction of shared local data, such that α ∈ (0, 1) and D is the local data size. 142 By letting I to be the size of each data point in bits, we have the following: (1) Worst case communication cost per node (CACDS) CACDS = αDI( 1 H + n(G + 1)). (5.9) (2) Total communication time for completing ACDS (TACDS) When the upload/download bandwidth of each node is R b/s, we have the following TACDS = αDI HR n 2 (H + 0.5) + n (H(G − 1) − 1.5) . (5.10) Remark 5.4. The worst case communication cost in equation 5.9 is with respect to the first node 1g, for g ∈ [G], that has more communication cost than the other nodes in group g for its participation in the global sharing phase of ACDS. Remark 5.5. To illustrate the communication overhead resulting from using ACDS, we consider the following numerical example. Let the total number of nodes in the system be N = 100 and each node has D = 500 images from the CIFAR10 dataset, where each image of size I = 24.5 Kbits∗ . When the communication bandwidth at each node is R = 100 Mb/s (e.g., 4G speed), and each node shares only α = 5% of its dataset in the form of H = 5 batches each with size M = 5 images, the latency, and communication cost of ACDS with G = 4 groups are 11 seconds and 75 Mbits, respectively. We note that the communication cost for ACDS and completion time of the algorithm are small with respect to the training process that requires sharing large model for large number of iteration as demonstrated in Section 5.6. The proof of Proposition 5.3 is presented in Appendix 5.8.5. In the following, we discuss the anonymity guarantees of ACDS. ∗ Each image in CIFAR10 dataset has 3 channels each of size 32 × 32 pixels, and each pixel takes value from 0 − 255. 143 5.4.2 Anonymity Guarantees of ACDS In the first round of the scheme, node 2g will know that the source of the received batch c 1 1g is node 1g. Similarly and more general, node ig will know that each data point in the received set of batches {c 1 1g , . . . , c1 (i−1)g } is generated by one of the previous i − 1 counterclockwise neighbors. However, in the next H − 1 rounds, each received data point by any node will be equally likely generated from any one of the remaining n − 1 nodes in this group. Hence, the size of the candidate pool from which each node could take a guess for the owner of each data point is small specially for the first set of nodes in the ring. In order to provide anonymity for the entire data and decrease the risk in the first round of the ACDS scheme, the size of the batch can be reduced to just one data point. Therefore, in the first round node 2g will only know one data point from node 1g. This comes on the expense of increasing the number of rounds. Another key consideration is that the number of nodes in each group trades the level of anonymity with the communication cost. In particular, the communication cost per node in the ACDS algorithm is O(n), while the anonymity level, which we measure by the number of possible candidates for a given data point, is (n − 1). Therefore, increasing n, i.e., decreasing the number of groups G, will decrease the communication cost but increase the anonymity level. 5.5 Basil+: Parallelization of Basil In this section, we describe our modified protocol Basil+ which allows for parallel training across multiple rings, along with sequential training over each ring. This results in decreasing the training time needed for completing one global epoch (visiting all nodes) compared to Basil which only considers sequential training over one ring. 144 5.5.1 Basil+ Algorithm At a high level, Basil+ divides nodes into G groups such that each group in parallel performs sequential training over a logical ring using Basil algorithm. After τ rounds of sequential training within each group, a robust circular aggregation strategy is implemented to have a robust average model from all the groups. Following the robust circular aggregation stage, a final multicasting step is implemented such that each group can use the resulting robust average model. This entire process is repeated for K global rounds. We now formalize the execution of Basil+ through the following four stages. Stage 1: Basil+ Initialization. The protocol starts by clustering the set of N nodes equally into G groups of rings with n = N G nodes in each group. The set of nodes in group g is denoted by Ng = {u g 1 , . . . , u g n}, where node u g 1 is the starting node in ring g, where g = 1, . . . , G. The clustering of nodes follows a random splitting agreement protocol similar to the one in Section 5.3.1 (details are presented in Section 5.5.2). The connectivity parameter within each group is set to be S = min(n−1, b+1), where b is the worst-case number of Byzantine nodes. This choice of S guarantees that each benign subgraph within each group is connected with high probability, as described in Proposition 5.3. Stage 2: Within Group Parallel Implementation of Basil. Each group g ∈ [G] in parallel performs the training across its nodes for τ rounds using Basil algorithm. Stage 3: Robust Circular Aggregation Strategy. We denote Sg = {u g n−1 , u g n−2 , . . . , u g n−S+1}, (5.11) 145 to be the set of S counterclockwise neighbors of node u g 1 . The robust circular aggregation strategy consists of G − 1 steps performed sequentially over the G groups, where the G groups form a global ring. At step g, where g ∈ {1, . . . , G − 1}, the set of nodes Sg send their aggregated models to each node in the set Sg+1. The reason for sending S models from one group to another is to ensure the connectivity of the global ring when removing the Byzantine nodes. The average aggregated model at node u g+1 i ∈ Sg+1 is given as follows: z g+1 i = 1 g + 1 x (i,g+1) τ + gz¯ g+1 i , (5.12) where x (i,g+1) τ is the local updated model at node u g+1 i in ring g + 1 after τ rounds of updates according to Basil algorithm. Here, z¯ g+1 i is the selected model by node u g+1 i from the set of received models from the set Sg. More specifically, by letting Lg = {z g n−1 , z g n−2 , . . . , z g n−S+1} (5.13) be the set of average aggregated models sent from the set of nodes Sg to each node in the set Sg+1, we define z¯ g+1 i to be the model selected from Lg by node u g+1 i ∈ Sg+1 using the Basil aggregation rule as follows: z¯ g+1 i = ABasil (Lg) = arg min y∈Lg E h l g+1 i (y, ζg+1 i ) i . (5.14) Stage 4: Robust Multicasting. The final stage is the multicasting stage. The set of nodes in SG send the final set of robust aggregated models LG to S1. Each node in the set S1 applies the aggregation rule in equation 5.14 on the set of received models LG. Finally, each benign node in the set S1 sends the filtered model z 1 i to all nodes in this set ∪ G g=1Ug, where Ug is defined as follows Ug = {u g 1 , u g 2 , . . . , u g S }. (5.15) 146 Finally, all nodes in this set ∪ G g=1Ug use the aggregation rule in equation 5.14 to get the best model out of this set L1 before updating it according to equation 5.2. These four stages are repeated for K rounds. We compare between the training time of Basil and Basil+ in the following proposition. Proposition 5.4. Let Tcomm, Tperf-based, and TSGD respectively denote the time needed to multicast/receive one model, the time to evaluate S models according to Basil performance-based criterion, and the time to take one step of model update. The total training time for completing one global round when using Basil algorithm, where one global round consists of τ sequential rounds over the ring, is TBasil ≤ (τnG)Tperf-based + (τnG)Tcomm + (τnG)TSGD, (5.16) compared to the training time for Basil+ algorithm, which is given as follows TBasil+ ≤ (τn + G + 1)Tperf-based + (SG + τn − 1)Tcomm + (τn)TSGD. (5.17) Remark 5.6. According to this proposition, we can see the training time of Basil is polynomial in nG, while in Basil+, the training time is linear in both n and G. The proof of Proposition 5.4 is given in Appendix 5.8.6. In the following section, we discuss the random clustering method used in the Stage 1 of Basil+. 5.5.2 Random Clustering Agreement In practice, nodes can agree on a random clustering by using similar approach as in Section 5.3.1 by the following simple steps. 1) All nodes first share their IDs with each other, and we assume WLOG that nodes’ IDs can be arranged in ascending order, and Byzantine nodes cannot forge their identities or create multiple fake ones [21]. 2) Each node locally random splits the nodes into G subsets by using a pseudo random number generator (PRNG) initialized via a common seed (e.g., N). This ensures that all nodes will 147 Algorithm 6: Basil+ Input: N ; S; {Zi}i∈N ; x 0 ; τ ; K Stage 1: Initialization: for each node i ∈ N do {Ng}g∈[G] ← RandomClusteringAggrement(N , G) // cluster the nodes into G groups each of size n according to Section 5.4.1 x (i,g) ← x 0 , ∀i ∈ Ng, g ∈ [G] end for each global round k = 1, . . . , K do Stage 2: Within Groups Robust Training: for each group g = 1, . . . , G in parallel do {x (i,g) τ }i∈Ng ← Basil(Ng, S, {Zi}i∈Ng , {x (i,g)}i∈Ng , τ ) // apply Basil algorithm within each group for τ rounds z g i ← x (i,g) τ , ∀i ∈ Ng end Stage 3: Robust Circular Aggregation: for each group g = 1, . . . , G − 1 in sequence do for each node u g+1 i ∈ Sg+1 in parallel do // the set Sg+1 is defined in equation 5.11 Lg ← {z g i }u g i ∈Sg // each node u g+1 i ∈ Sg+1 receives the set of models Lg from the nodes in Sg if node u g+1 i ∈ benign set R then z¯ g+1 i ← ABasil(Lg) // Basil performance-based strategy to select one model from Lg using equation 5.14 z g+1 i ← 1 g+1 x (i,g+1) τ + gz¯ g+1 i // get proper average model from the first g + 1 groups end else z g+1 i ← ∗ // Byzantine node sends faulty model end end end Stage 4: Robust Multicasting: for each node u 1 i ∈ S1 in parallel do LG ← {z (G) i }uG i ∈SG z¯ 1 i ← ABasil(LG) z 1 i ← z¯ 1 i end for each node u g i ∈ ∪G g=1Ug in parallel do // the set Ug is defined in equation 5.15 L1 ← {z 1 i }u 1 i ∈S1 // each node u g i receives the set of models L1 from the nodes in S1 z¯ g i ← ABasil(L1) x (i,g) ← z¯ g i end end Return {x (i,g) K }i∈Ng,g∈[G] 148 generate the set of nodes. To know the nodes order within each local group, the method in Section 5.3.1 can be used. 5.5.3 The Success of Basil+ We will consider different scenarios for the connectivity parameter S while evaluating the success of Basil+. Case 1: S = min(n − 1, b + 1) We set the connectivity parameter for Basil+ to S = min(n − 1, b + 1). By setting S = b + 1 when n > b + 1, this ensures the connectivity of each ring (after removing the Byzantine nodes) along with the global ring. On the other hand, by setting S = n − 1 if n ≤ b + 1, Basil+ would only fail if at least one group has a number of Byzantine nodes of n or n − 1. We define Bj to be the failure event in which the number of Byzantine nodes in a given group is n or n − 1. The failure event Bj follows a Hypergeometric distribution with parameters (N, b, n), where N, b, and n are the total number of nodes, total number of Byzantine nodes, number of nodes in each group, respectively. The probability of failure is given as follows P(Failure)=P( [ G j=1 Bj ) (a) ≤ X G j=1 P(Bj )= b n + b n−1 N−b 1 N n G, (5.18) where (a) follows from the union bound. In order to further illustrate the impact of choosing the group size n when setting S = n − 1 on the probability of failure given in equation 5.18, we consider the following numerical examples. Let the total number of nodes in the system be N = 100, where b = 33 of them are Byzantine. By setting n = 20 nodes in each group, the probability in equation 5.18 turns out to be ∼ 5 × 10−10, which is negligible. For the case when n = 10, the probability of failure becomes ∼ 1.2 × 10−4 , which remains reasonably small. Case 2: S < n − 1 Similar to the failure analysis of Basil given in Proposition 5.2, we relax the connectivity parameter S as stated in the following proposition. Proposition 5.5. The connectivity parameter S in Basil+ can be relaxed to S < n − 1 while guaranteeing the success of the algorithm (benign local/global subgraph connectivity) with high probability. The failure probability of Basil+ is given by P(F) ≤ G min ( X b,n) i=0 S Y−1 s=0 max (i − s, 0) (N − s) n ! b i N−b n−i N n , (5.19) where N, n, G, S and b are the number of total nodes, number of nodes in each group, number of groups, the connectivity parameter, and the number of Byzantine nodes. The proof of Proposition 5.5 is presented in Appendix 5.8.7. Remark 5.7. In order to further illustrate the impact of choosing S on the probability of failure given in equation 5.19, we consider the following numerical examples. Let the total number of nodes in the system be N = 400, where b = 60 of them are Byzantine and n = 100, and the connectivity parameter S = 10. The probability of failure event in equation 5.19 turns out to be ∼ 10−6 , which is negligible. For the case when S = 7, the probability of failure becomes ∼ 10−4 , which remains reasonably small. 5.6 Numerical Experiments We start by evaluating the performance gains of Basil. After that, we give the set of experiments of Basil+. We note that in Appendix 5.8.9, we have included additional experiments for Basil including the wall-clock time performance compared to UBAR, performance of Basil and ACDS for CIFAR100 dataset, and performance comparison between Basil and Basil+. 5.6.1 Numerical Experiments for Basil Schemes: We consider four schemes as described next. • G-plain: This is for graph based topology. At the start of each round, nodes exchange their models with their neighbors. Each node then finds the average of its model with the received neighboring models and uses it to carry out an SGD step over its local dataset. • R-plain: This is for ring based topology with S = 1. The current active node carries out an SGD step over its local dataset by using the model received from its previous counterclockwise neighbor. • UBAR: This is the prior state-of-the-art for mitigating Byzantine nodes in decentralized training over graph, and is described in Appendix 5.8.8. • Basil: This is our proposal. Datasets and Hyperparameters: There are a total of 100 nodes, in which 67 are benign. For decentralized network setting for simulating UBAR and G-plain schemes, we follow a similar approach as described in the experiments in [45] (we provide the details in Appendix 5.8.9.2). For Basil and R-plain, the nodes are arranged in a logical ring, and 33 of them are randomly set as Byzantine nodes. Furthermore, we set S = 10 for Basil which gives us the connectivity guarantees discussed in Proposition 5.2. We use a decreasing learning rate of 0.03/(1 + 0.03 k). We consider CIFAR10 [63] and use a neural network with 2 convolutional layers and 3 fully connected layers. The details are included in Appendix 5.8.9.1. The training dataset is partitioned equally among all nodes. Furthermore, we report the worst test accuracy among the benign clients in our results. We also conduct similar evaluations on the MNIST dataset. The experimental results lead to the same conclusion and can be found in Appendix 5.8.9. Additionally, we emphasize that Basil is based on sequential training over a logical ring, while UBAR is based on parallel training over a graph topology. Hence, for consistency of experimental evaluations, we consider the following common definition for training round: 151 0 200 400 600 800 1000 Round, k 0.2 0.4 0.6 Test Accuracy R-plain G-plain UBAR Basil (a) No Attack 0 100 200 300 400 500 Round, k 0.2 0.4 0.6 Test Accuracy R-plain G-plain UBAR Basil (b) Gaussian Attack 0 100 200 300 400 500 Round, k 0.2 0.4 0.6 Test Accuracy R-plain G-plain UBAR Basil (c) Random Sign Flip 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy R-plain G-plain UBAR Basil (d) Hidden Attack Figure 5.4: Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting. Definition 5.2 (Training round). With respect to the total number of SGD computations, we define a round over a logical ring to be equivalent to one parallel iteration over a graph. This definition aligns with our motivation of training with resource constrained edge devices, where user’s computation power is limited. Byzantine Attacks: We consider a variety of attacks, that are described as follows. Gaussian Attack: Each Byzantine node replaces its model parameters with entries drawn from a Gaussian distribution with mean 0 and standard distribution σ = 1. Random Sign Flip: We observed in our experiments that the naive sign flip attack, in which Byzantine nodes flip the sign of each parameter before exchanging their models with their neighbors, is not strong in the R-plain scheme. To make the sign-flip attack stronger, we propose a layer-wise sign flip, in which Byzantine nodes randomly choose to flip or keep the sign of the entire elements in each neural network layer. Hidden Attack: This is the attack that degrades the performance of distance-based defense approaches, as proposed in [33]. Essentially, the Byzantine nodes are assumed to be omniscient, i.e., they can collect the models uploaded by all the benign clients. Byzantine nodes then design their models such that they are undetectable from the benign ones in terms of the distance metric, while still degrading the training process. For hidden attack, as the key idea is to exploit similarity of 152 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy R-plain: =0% R-plain: =5% (a) No Attack 1 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy R-plain: =0% R-plain: =5% UBAR: =0% Basil: =0% Basil: =5% (b) No Attack 2 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy R-plain: =0% R-plain: =5% UBAR: =0% Basil: =0% Basil: =5% (c) Gaussian Attack (d) Random Sign Flip Figure 5.5: Illustrating the performance of Basil using CIFAR10 dataset under non-IID data distribution setting. models from benign nodes, thus, to make it more effective, the Byzantine nodes launch this attack after 20 rounds of training. Results (IID Setting): We first present the results for the IID data setting. The training dataset is first shuffled randomly and then partitioned among the nodes. As can be seen from Figure 5.4(a), Basil converges much faster than both UBAR and G-plain even in the absence of any Byzantine attacks, illustrating the benefits of ring topology based learning over graph based topology. We note that the total number of gradient updates after k rounds in the two setups are almost the same. We can also see that R-plain gives higher performance than Basil. This is because in Basil, a small mini-batch is used for performance evaluation, hence in contrast to R-plain, the latest neighborhood model may not be chosen in each round resulting in the loss of some update steps. Nevertheless, Figs. 4(b), (c) and 4(d) illustrate that Basil is not only Byzantine-resilient, it maintains its superior performance over UBAR with ∼16% improvement in test accuracy, as opposed to R-plain that suffers significantly. Furthermore, we would like to highlight that as Basil uses a performance-based criterion for mitigating Byzantine nodes, it is robust to the Hidden attack as well. Finally, by considering the poor convergence of R-plain under different Byzantine attacks, conclude that Basil is a good solution with fast convergence, strong Byzantine resiliency and acceptable computation overhead. Results (non-IID Setting): For simulating the non-IID setting, we sort the training data as per class, partition the sorted data into N subsets, and assign each node 1 partition. By applying ACDS in the absence of Byzantine nodes while trying different values for α, we found that α = 5% gives a good performance while a small amount of shared data from each node. Figure 5.5(a) illustrates that test accuracy for R-plain in the non-IID setting can be increased by up to ∼10% when each node shares only α = 5% of its local data with other nodes. Figure 5.5(c), and Figure 5.5(d) illustrate that Basil on the top of ACDS with α = 5% is robust to software/hardware faults represented in Gaussian model and random sign flip. Furthermore, both Basil without ACDS and UBAR completely fail in the presence of these faults. This is because the two defenses are using performance-based criterion which is not meaningful in the non-IID setting. In other words, each node has only data from one class, hence it becomes unclear whether a high loss value for a given model can be attributed to the Byzantine nodes, or to the very heterogeneous nature of the data. Additionally, R-plain with α = 0%, 5% completely fail in the presence of these faults. Furthermore, we can observe in Figure 5.5(b) that Basil with α = 0 gives low performance. This confirms that non-IID data distribution degraded the convergence behavior. For UBAR, the performance is completely degraded, since in UBAR each node selects the set of models which gives a lower loss than its own local model, before using them in the update rule. Since performance-based is not meaningful in this setting, each node might end up only with its own model. Hence, the model of each node does not completely propagate over the graph, as also demonstrated in 5.5(b), where UBAR fails completely. This is different from the ring setting, where the model is propagated over the ring. In Figure 5.5, we showed that UBAR performs quite poorly for non-IID data setting, when no data is shared among the clients. We note that achieving anonymity in data sharing in graph based decentralized learning 154 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy Basil+ACDS: =5% UBAR: =5% (a) No Attack 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy Basil+ACDS: =5% UBAR: =5% (b) Gaussian Attack 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 Test Accuracy Basil+ACDS: =5% UBAR: =5% (c) Random Sign Flip Attack Figure 5.6: Illustrating the performance of Basil compared with UBAR for CIFAR10 under non-IID data distribution setting with α = 5% data sharing. in general and UBAR in particular is an open problem. Nevertheless, in Figure 5.6, we further show that even 5% data sharing is done in UBAR, performance remains quite low in comparison to Basil+ACDS. Now, we compute the communication cost overhead due to leveraging ACDS for the experiments associated with Figure 5.5. By considering the setting discussed in Remark 5 for ACDS with G = 4 groups for data sharing and each node sharing α = 5% fraction of its local dataset, we can see from Figure 5.5 that Basil takes 500 rounds to get ∼ 50% test accuracy. Hence, given that the model used in this section is of size 3.6 Mbits (117706 trainable parameters each represented by 32 bits), the communication cost overhead resulting from using ACDS for data sharing is only 4%. Further ablation studies: We perform ablation studies to show the effect of different parameters on the performance of Basil: number of nodes N, number of Byzantine nodes b, connectivity parameter S, and the fraction of data sharing α. For the ablation studies corresponding to N, b, S, we consider the IID setting described previously, while for the α, we consider the non-IID setting. 155 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Basil: N = 45 Basil: N = 60 Basil: N = 75 Basil: N = 100 UBAR: N = 45 UBAR: N = 60 UBAR: N = 75 UBAR: N = 100 (a) Varying N, (b=0.33×N, S=10) 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Basil: b = 20 Basil: b = 33 Basil: b = 40 Basil: b = 50 (b) Varying b, (N=100, S=10) 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Basil: S = 20 Basil: S = 15 Basil: S = 10 Basil: S = 6 Basil: S = 2 UBAR (c) Varying S, (N=100, b=33) 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Basil: =0.5% Basil: =3% Basil: =5% Basil: =7% Basil: =20% UBAR (d) Basil with ACDS in the nonIID setting (N=100, b=33, S=10) Figure 5.7: Ablation studies. Here N, b, S, and α are the total nodes, Byzantine nodes, connectivity, and fraction of shared data. For non-IID data, Gaussian attack is considered, while for others, IID hidden attack is used. Using the same NNs as in Section 5.8.9.1. Figure 5.7(a) demonstrates that, unlike UBAR, Basil performance scales with the number of nodes N. This is because in any given round, the sequential training over the logical ring topology accumulates SGD updates of clients along the logical ring, as opposed to parallel training over the graph topology in which an update from any given node only propagates to its neighbors. Hence, Basil has better accuracy than UBAR. Additionally, as described in Section 5.5, one can also leverage Basil+ to achieve further scalability for large N by parallelizing Basil. We provide the experimental evaluations corresponding to Basil+ in Section 5.6.2. To study the effect of different number of Byzantine nodes in the system, we conduct experiments with different b. Figure 5.7(b) demonstrates that Basil is quite robust to different number of Byzantine nodes. Figure 5.7(c) demonstrates the impact of the connectivity parameter S. Interestingly, the convergence rate decreases as S increases. We posit that due to the noisy SGD based training process, the closest benign model is not always selected, resulting in loss of some intermediate update steps. However, decreasing S too much results in larger increase in the connectivity failure probability of Basil. For example, the upper 156 bound on the failure probability when S = 6 is less than 0.09. However, for an extremely low value of S = 2, we observed consistent failure across all simulation trials, as also illustrated in Figure 5.7(c). Hence, a careful choice of S is important. Finally, to study the relationship between privacy and accuracy when Basil is used alongside ACDS, we carry out numerical analysis by varying α in the non-IID setting described previously. Figure 5.7(d) demonstrates that as α is increased, i.e., as the amount of shared data is increased, the convergence rate increases as well. Furthermore, we emphasize that even having α = 0.5% gives reasonable performance when data is non-IID, unlike UBAR which fails completely. 5.6.2 Numerical Experiments for Basil+ In this section, we demonstrate the achievable gains of Basil+ in terms of its scalability, Byzantine robustness, and superior performance over UBAR. Schemes: We consider three schemes, as described next. • Basil+: Our proposed scheme. • R-plain+: We implement a parallel extension of R-plain. In particular, nodes are divided into G groups. Within each group, a sequential R-plain training process is carried out, wherein the current active node carries out local training using the model received from its previous counterclockwise neighbor. After τ rounds of sequential R-plain training within each group, a circular aggregation is carried out along the G groups. Specifically, the model from the last node in each group gets averaged. The average model from each group is then used by the first node in each group in the next global round. This entire process is repeated for K global rounds. Setting: We use CIFAR10 dataset [63] and use the neural network with 2 convolutional layers and 3 fully connected layers described in Section 5.6. The training dataset is partitioned uniformly among the set N of all nodes, where |N | = 400. We set the batch size to 80 for local training and performance evaluation in 157 Basil+ as well as UBAR. Furthermore, we consider epoch based local training, where we set the number of epochs to 3. We use a decreasing learning rate of 0.03/(1 + 0.03 k), where k denotes the global round. For all schemes, we report the average test accuracy among the benign clients. For Basil+, we set the connectivity parameter to S = 6 and the number of intra-group rounds to τ = 1. The implementation of UBAR is given in Section 5.8.9.2. 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy G = 1 G = 2 G = 4 G = 16 Figure 5.8: The scalability gain of Basil+ in the presence of Byzantine nodes as number of nodes increases. Here G is the number of groups, where each group has n = 25 nodes. Results: For studying how the three schemes perform when more nodes participate in the training process, we consider different cases for the number of participating nodes Na ⊆ N , where |Na| = Na. Furthermore, for the three schemes, we set the total number of Byzantine nodes to be ⌊βNa⌋ with β = 0.2. Figure 5.8 shows the performance of Basil+ in the presence of Gaussian attack when the number of groups increases, i.e., when the number of participating nodes Na increases. Here, for all the four scenarios in Figure 5.8, we fix the number of nodes in each group to n = 25 nodes. As we can see Basil+ is able to mitigate the Byzantine behavior while achieving scalable performance as the number of nodes increases. In particular, Basil+ with G = 16 groups (i.e., Na = 400 nodes) achieves a test accuracy which is higher by an absolute margin of 20% compared to the case of G = 1 (i.e., Na = 25 nodes). Additionally, while Basil+ provides scalable model performance when the number of groups increases, the overall increase in training time scales gracefully due to parallelization of training within groups. In particular, the key difference in the training time between the two cases of G = 1 and G = 16 is in the stages of robust aggregation and global multicast. In order to get further insight, we set Tcomm, Tperf-based and TSGD in 158 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy Basil+: G = 1 UBAR R_plain+: G = 1 (a) Number of active nodes Na = 25 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy Basil+: G = 2 UBAR R_plain+: G = 2 (b) Number of active nodes Na = 50 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 Test Accuracy Basil+: G = 4 UBAR R_plain+: G = 4 (c) Number of active nodes Na = 100 0 100 200 300 400 500 Round, k 0.1 0.2 0.3 0.4 0.5 0.6 Test Accuracy Basil+: G = 16 UBAR R_plain+: G = 16 (d) Number of active nodes Na = 400 Figure 5.9: Illustrating the performance gains of Basil+ over UBAR and R-plain+ for CIFAR10 dataset under different number of nodes Na. equation equation 5.17 to one unit of time. Hence, using equation 5.17, one can show that the ratio of the training time for G = 16 with Na = 400 nodes to the training time for G = 1 with Na = 25 nodes is just 1.5. This further demonstrates the scalability of Basil+. In Figure 5.9, we compare the performance of the three schemes for different numbers of participating nodes Na. In particular, in both Basil+ and R-plain+, we have Na = 25 G, where G denotes the number of groups, while for UBAR, we consider a parallel graph with Na nodes. As can be observed from Figure 5.9, Basil+ is not only robust to the Byzantine nodes, but also gives superior performance over UBAR in all the 4 cases shown in Figure 5.9. The key reason of Basil+ having higher performance than UBAR is that training in Basil+ includes sequential training over logical rings within groups, which has better performance than graph based decentralized training. 159 5.7 Conclusion and Future Directions We propose Basil, a fast and computationally efficient Byzantine-robust algorithm for decentralized training over a logical ring. We provide the theoretical convergence guarantees of Basil demonstrating its linear convergence rate. Our experimental results in the IID setting show the superiority of Basil over the state-of-the-art algorithm for decentralized training. Furthermore, we generalize Basil to the non-IID setting by integrating it with our proposed Anonymous Cyclic Data Sharing (ACDS) scheme. Finally, we propose Basil+ that enables a parallel implementation of Basil achieving further scalability. One interesting future direction is to explore some techniques such as data compression or data placement and coded shuffling to reduce the communication cost resulting from using ACDS. Additionally, it is interesting to see how some differential privacy (DP) methods can be adopted by adding noise to the shared data in ACDS to provide further privacy while studying the impact of the added noise in the overall training performance. 5.8 Appendix In the following, we summarize the content of the appendix. • In Appendix 5.8.1, we prove Propositions 5.1 and 5.2. • In Appendix 5.8.2, we prove the convergence guarantees of Basil. • In Appendix 5.8.4, we describe how Basil can be robust to nodes dropout. • In Appendix 5.8.5, Appendix 5.8.6 and Appendix 5.8.7, the proofs of Propositions 5.3, 5.4 and 5.5 are presented. • In Appendix 5.8.8, we present UBAR [45], the recent Byzantine-robust decentralized algorithm. • In Appendix 5.8.9, we provide additional experiments. 160 5.8.1 Proof of Proposition 5.1 and Proposition 5.2 Proposition 5.1. The communication, computation, and storage complexities of Basil algorithm are all O(Sd) for each node in each iteration, where d is the model size. Proof. Each node receives and stores the latest S models, calculates the loss by using each model out of the S stored models, and multicasts its updated model to the next S clockwise neighbors. Thus, this results in O(Sd) communication, computation, and storage costs. □ Proposition 5.2 The number of models that each benign node needs to receive, store and evaluate from its counterclockwise neighbors for ensuring the connectivity and success of Basil can be relaxed to S < b + 1 while guaranteeing the success of Basil (benign subgraph connectivity) with high probability. Proof. This can be proven by showing that the benign subgraph, which is generated by removing the Byzantine nodes, is connected with high probability when each node multicasts its updated model to the next S < b + 1 clockwise neighbors instead of b + 1 neighbors. Connectivity of the benign subgraph is important as it ensures that each benign node can still receive information from a few other non-faulty nodes. Hence, by letting each node store and evaluate the latest S model updates, this ensures that each benign node has the chance to select one of the benign updates. More formally, when each node multicasts its model to the next S clockwise neighbors, we define Aj to be the failure event in which S Byzantine nodes come in a row where j is the starting node of these S nodes. When Aj occurs, there is at least one pair of benign nodes that have no link between them. The probability of Aj is given as follows: P(Aj ) = S Y−1 i=0 (b − i) (N − i) = b!(N − S)! (b − S)!N! , (5.20) 161 where the second equality follows from the definition of factorial, while b, and N are the number of Byzantine nodes and the total number of nodes in the system, respectively. Thus, the probability of having a disconnected benign subgraph in Basil, i.e., S Byzantine nodes coming in a row, is given as follows: P(Failure) = P( [ N j=1 (Aj )) (a) ≤ X N j=1 P(Aj ) (b) = b!(N − S)! (b − S)!(N − 1)!, (5.21) where (a) follows from the union bound and (b) follows from equation 5.20. □ 5.8.2 Convergence Analysis Let b i be the number of Byzantine nodes out of the S counterclockwise neighbors of node i. We divide the set of stored models N i k at node i in the k-th round into two sets. The first set G i k = {y1, . . . , yr i} contains the benign models, where r i = S −b i . We consider scenarios with S = b+ 1, where b is the total number of Byzantine nodes in the network. Without loss of generality, we assume the models in this set are arranged such that the first model is from the closest benign node in the neighborhood of node i, while the last model is from the farthest node. Similarly, we define the second set B i k to be the set of models from the counterclockwise Byzantine neighbors of node i such that B i k ∪ Gi k = N i k . The general update rule in Basil is given as follows. At the k-th round, the current active node i updates the global model according to the following rule: x (i) k =x¯ (i) k − η (i) k gi(x¯ (i) k ), (5.22) where x¯ (i) k is given as follows x¯ (i) k = arg min y∈N i k E [li(y, ζi)] . (5.23) 162 5.8.2.1 Proof of Theorem 5.1 We first show that if node i completed the performance-based criteria in equation 5.23 and selected the model y1 ∈ Gi k , and updated its model as follows: x (i) k = y1 − η (i) k gi(y1), (5.24) we will have E h ℓi+1(x (i) k ) i ≤ E [ℓi+1(y1)] , (5.25) where ℓi+1(y1) = ℓi+1(y1, ζi+1) is the loss function of node i + 1 evaluated on a random sample ζi+1 by using the model y1. The proof of equation 5.25 is as follows: By using Taylor’s theorem, there exists a γ such that ℓi+1(x (i) k ) =ℓi+1 y1 − η (i) k gi(y1) (5.26) =ℓi+1(y1) − η (i) k gi(y1) T gi+1(y1) + 1 2 η (i) k gi(y1) T ∇2 ℓi+1(γ)η (i) k gi(y1), (5.27) where ∇2 ℓi+1 is the stochastic Hessian matrix. By using the following assumption ||∇2 ℓi+1(x (i) k )||2 ≤ L, (5.28) for all random samples ζi+1 and any model x ∈ R d , where L is the Lipschitz constant, we get ℓi+1(x (i) k ) ≤ℓi+1(y1) − η (i) k gi(y1) T gi+1(y1) + (η (i) k ) 2L 2 ||gi(y1)||2 . (5.29) 163 By taking the expected value of both sides of this expression (where the expectation is taken over the randomness in the sample selection), we get E h ℓi+1(x (i) k ) i ≤ E [ℓi+1(y1)] − η (i) k E gi(y1) T gi+1(y1) + (η (i) k ) 2L 2 E||gi(y1)||2 a=E [ℓi+1(y1)] − η (i) k E gi(y1) T E [gi+1(y1)] + (η (i) k ) 2L 2 E||gi(y1)||2 b ≤E [ℓi+1(y1)] − η (i) k ||∇f(y1)||2 + (η (i) k ) 2L 2 ||∇f(y1)||2 +(η (i) k ) 2L 2 σ 2 =E [ℓi+1(y1)] −||∇f(y1)||2 η (i) k −(η (i) k ) 2L 2 + (η (i) k ) 2L 2 σ 2 , (5.30) where (a) follows from that the samples are drawn from independent data distribution, while (b) from Assumption 5.1 along with E||gi(y1)||2 =E||gi(y1) − E [gi(y1)] ||2 + ||E[gi(y1)]||2 ≤σ 2 + ||∇f(y1)||2 . (5.31) By considering the case where each user uses a batch size of size B, the variance of using stochastic gradient will be given by σ 2 B . Therefore, by assuming that each user uses a large batch size such that Assumption 5.4 holds, while setting the learning rate as η (i) k ≤ 1 L , we get E h li+1(x (i) k ) i ≤ E [li+1(y1)] . (5.32) This completes the first part of the proof. 164 By using equation 5.32, it can be easily seen that the update rule in equation equation 5.22 can be reduced to the case where each node updates its model based on the model received from the closest benign node equation 5.24 in its neighborhood, where this follows from using induction. Let’s consider this example. Consider a ring with N nodes and by using S = 3 while ignoring the Byzantine nodes for a while (assume all nodes are benign nodes). We consider the first round k = 1. With a little abuse of notations, we can get the following, the updated model by node 1 (the first node in the ring) x1 = h(x0) is a function of the initial model x0 (updated by using the model x0). Now, node 2 has to select one model from the set of two models N 2 k = {x1 = h(x0), x0}. The selection is performed by evaluating the expected loss function of node 2 by using the criteria given in equation 5.23 on the models on the set N 2 k . According to equation 5.32, node 2 will select the model x1 which results in lower expected loss. Now, node 2 updates its model based on the model x1, i.e., x2 = h(x1). After that, node 3 applies the aggregation rule in equation 5.23 to selects one model from this set of models N 3 k = {x2 = h(x1), x1 = h(x0), x0}. By using equation 5.32 and Assumption 1, we get E[l3(x2)] ≤ E[l3(x1)] ≤ E[l3(x0)], (5.33) and node 3 model will be updated according to the model x2, i.e., x3 = h(x2). More generally, the set of stored benign models at node iis given by N i k = {y1 = h(y2), y2 = h(y3), . . . , yr i = h(yr i−1 )}, where r i is the number of benign models in the set N i k . According to equation 5.32, we will have the following E [li(y1)] ≤E [li(y2)] ≤ · · · ≤E [li(yr i )] ≤E [li(x)] ∀x ∈ Bi k , (5.34) where the last inequality in equation 5.34 follows from the fact that the Byzantine nodes are sending faulty models and their expected loss is supposed to be higher than the expected loss of the benign nodes. 165 According to this discussion and by removing the Byzantine nodes thanks to equation 5.34, we can only consider the benign subgraph which is generated by removing the Byzantine nodes according to the discussion in Section 5.3-A in the main chapter. Note that by letting each active node send its updated model to the next b+1 nodes, where b is the total number of Byzantine nodes, the benign subgraph can always be connected. By considering the benign subgraph (the logical rings without Byzantine nodes), we assume without loss of generality that the indices of benign nodes in the ring are arranged in ascending order starting from node 1 to node r. In this benign subgraph, the update rule will be given as follows x (i) k = x (i−1) k − η (i) k gi(x (i−1) k ). (5.35) 5.8.3 Proof of Theorem 2 By using Taylor’s theorem, there exists a γ such that f(x (i+1) k ) a= f x (i) k − η (i+1) k gi+1(x (i) k ) = f(x (i) k ) − η (i+1) k gi+1(x (i) k ) T ∇f(x (i) k ) + 1 2 η (i+1) k gi+1(x (i) k ) T ∇2 f(γ)η (i+1) k gi+1(x (i) k ) b ≤ f(x (i) k ) − η (i+1) k gi+1(x (i) k ) T ∇f(x (i) k ) + L 2 η (i+1) k ||gi+1(x (i) k ))||2 , (5.36) where (a) follows from the update rule in equation 5.35, while f is the global loss function in equation (1) in the main chapter, and (b) from Assumption 3 where ||∇2f(γ)|| ≤ L. We assume gi+1(.) to be the stochastic gradient computed on a mini-batch of size B. Given the model x (i) k , we take expectation over the randomness in selection the samples to get x (i+1) k ). We recall that each sample ζi+1 in the sampled 166 batch is drawn according to the distribution Pi+1 and is independent of the model x i k ). Therefore, we get the following set of equations: E[f(x (i+1) k )] ≤ f(x (i) k ) − η (i) k E h (gi+1(x (i) k ))iT ∇f(x (i) k ) + (η (i) k ) 2L 2 E||gi+1(x (i) k )||2 a ≤f(x (i) k ) − η (i) k ||∇f(x (i) k )||2 + (η (i) k ) 2L 2 ||∇f(x (i) k )||2 + (η (i) k ) 2L 2 σ ′2 =f(x (i) k ) − ||∇f(x (i) k )||2 η (i) k − (η (i) k ) 2L 2 ! + (η (i) k ) 2L 2 σ ′2 b ≤f(x (i) k ) − η (i) k 2 ||∇f(x (i) k )||2 + η (i) k 2 σ ′2 , (5.37) where (a) follows from equation 5.31, and (b) by selecting η (i) k ≤ 1 L . σ ′2 = σ 2 B is the variance when using a batch of size B. By the convexity of the loss function f, we get the next inequality from the inequality in equation 5.37 E[f(x (i+1) k )] ≤f(x ∗ ) + ⟨ ∇f(x (i) k ), x (i) k − x ∗ ⟩ − η (i) k 2 ||∇f(x (i) k )||2 + η (i) k 2 σ ′2 . (5.38) 167 We now back-substitute gi(x (i) k ) into equation 5.38 by using E[gi+1(x (i) k )] = ∇f(x (i) k ) and ||∇f(x (i) k )||2 ≥ E||gi+1(x (i) k )||2 − σ ′2 : E[f(x (i+1) k )] ≤ f(x ∗ ) + ⟨ E[gi+1(x (i) k )], x (i) k − x ∗ ⟩ − η (i) k 2 E||gi+1(x (i) k )||2 + η (i) k σ ′2 = f(x ∗ ) + E[⟨[gi+1(x (i) k )], x (i) k − x ∗ ⟩ − η (i) k 2 ||gi+1(x (i) k )||2 ] + η (i) k σ ′2 . (5.39) By completing the square of the middle two terms to get: E[f(x (i+1) k )] ≤ f(x ∗ ) +E " 1 2η (i) k ||x (i) k −x ∗ ||2−||x (i) k −x ∗−η (i) k gi+1(x (i) k )||2 # +η (i) k σ ′2 . =f(x ∗ )+E " 1 2η (i) k ||x (i) k −x ∗ ||2−||x (i+1) k −x ∗ ||2 # +η (i) k σ ′2 . (5.40) For K rounds and r benign nodes, we note that the total number of SGD steps are T = Kr. We let s = kr + i represent the number of updates happen to the initial model x 0 , where i = 1, . . . , r and k = 0, . . . , K − 1. Therefore, x (i) k can be written as x s . With the modified notation, we can now take the 168 expectation in the above expression over the entire sampling process during training and then by summing the above equations for s = 0, . . . , T − 1, while taking η = 1 L , we have the following: T X−1 s=0 E[f(x s+1)] − f(x ∗ ) ≤ L 2 ||x0 − x ∗ ||2 − E[||x (T) k − x ∗ ||2 ] + 1 L T σ′2 . (5.41) By using the convexity of f(.), we get E " f 1 T X T s=1 x s !# − f(x ∗ ) ≤ 1 T T X−1 s=0 E f(x s+1) − f(x ∗ ) ≤ ||x 0 − x ∗ ||2L 2T + 1 L σ ′2 . (5.42) 5.8.4 Joining and Leaving of Nodes Basil can handle the scenario of 1) node dropouts out of the N available nodes 2) nodes rejoining the system. 5.8.4.1 Nodes Dropout For handling node dropouts, we allow for extra communication between nodes. In particular, each active node can multicast its model to the S=b+d+1 clockwise neighbors, where b and d are respectively the number of Byzantine nodes and the worst case number of dropped nodes, and each node can store only the latest b+1 model updates. By doing that, each benign node will have at least 1 benign update even in the worst case where all Byzantine nodes appear in a row and d (out of S) counterclockwise nodes drop out. 16 5.8.4.2 Nodes Rejoining To address a node rejoining the system, this rejoined node can re-multicast its ID to all other nodes. Since benign nodes know the correct order of the nodes (IDs) in the ring according to Section 5.3.1, each active node out of the L=b+d+1 counterclockwise neighbors of the rejoined node sends its model to it, and this rejoined node stores the latest b+1 models. We note that handling participation of new fresh nodes during training is out of scope of our paper, as we consider mitigating Byzantine nodes in decentralized training with a fixed number of N nodes 5.8.5 Proof of Proposition 5.3 We first prove the communication cost given in Proposition 5.3, which corresponds to node 1g, for g ∈ [G]. We recall from Section IV that in ACDS, each node i ∈ N has H batches each of size αD H data points. Furthermore, for each group g ∈ [G], the anonymous cyclic data sharing phase (phase 2) consists of H + 1 rounds. The communication cost of node 1g in the first round is αD H I bits, where αD H is the size of one batch and I is the size of one data point in bits. The cost of each round h ∈ [2, H + 1] is n αD H I, where node 1g sends the set of shuffled data from the n batches {c h 1g , ch−1 2g , . . . , ch−1 ng } to node 2g. Hence, the total communication cost for node 1g in this phase is given by C phase-2 ACDS = αDI( 1 H + n). In phase 3, node 1g multicasts its set of shuffled data from batches {c h 1g , ch 2g , . . . , ch ng }h∈[H] to all nodes in the other groups at a cost of nαDI bits. Finally, node 1g receives (G−1) set of batches {c h 1g′ , ch 2g′ , . . . , ch ng′ }h∈[H],g′∈[G]\{g} at a cost of (G − 1)nαDI. Hence, the communication cost of the third phase of ACDS is given by C phase-3 ACDS = αDnGI. By adding the cost of Phase 2 and Phase 3, we get the first result in Proposition 5.3. Now, we prove the communication time of ACDS by first computing the time needed to complete the anonymous data sharing phase (phase-2), and then compute the time for the multicasting phase. The second phase of ACDS consists of H + 1 rounds. The communication time of the first round is given by TR1 = Pn i=1 iT, where n is the number of nodes in each group. Here, T = αDI HR is the time needed to send 170 one batch of size αDI H data points with R being the communication bandwidth in b/s, and I is the size of one data points in bits. On the other hand, the time for each round h ∈ [2, H], is given by TRh = n 2T, where each node sends n batches. Finally, the time for completing the dummy round, the (H +1)-th round, is given by TRH+1 = n(n − 2)T where only the first n − 2 nodes in the ring participate in this round as discussed in Section 5.4. Therefore, the total time for completing the anonymous cyclic data sharing phase (phase 2 of ACDS) is given by Tphase-2 = TR1 + (H − 1)TRh + TRH+1 = T(n 2 (H + 0.5) − 1.5n) as this phase happens in parallel for all the G groups. The time for completing the multicasting phase is Tphase-3 = (G − 1)nHT, where each node in group g receives nH batches from each node 1g ′ in group g ′ ∈ [G]\{g}. By adding Tphase-2 and Tphase-3, we get the communication time of ACDS given in Proposition 5.3. 5.8.6 Proof of Proposition 5.4 We recall from Section 5.3 that the per round training time of Basil is divided into four parts. In particular, each active node (1) receives the model from the S counterclockwise neighbors; (2) evaluates the S models using the Basil aggregation rule; (3) updates the model by taking one step of SGD; and (4) multicasts the model to the next S clockwise neighbors. Assuming training begins at time 0, we define E (k) i to be the wall-clock time at which node i finishes the training in round k. We also define Tcom = 32d R to be the time to receive one model of size d elements each of size 32 bits, where each node receives only one model at each step in the ring as the training is sequential. Furthermore, we let Tcomp = Tperf-based + TSGD to be the time needed to evaluate S models and perform one step of SGD update. 171 We assume that each node i ∈ N becomes active and starts evaluating the S models (using equation 5.3) and taking the SGD model update step (using equation 5.2) only when it receives the model from its counter clock-wise neighbor. Therefore, for the first round, we have the following time recursion: E (1) 1 = TSGD (5.43) E (1) n =E (1) n−1 + Tcom+(n − 1)Tperf-based + TSGD for n ∈ [2, S] (5.44) E (1) n = E (1) n−1 + Tcom + Tcomp for n ∈ [S + 1, N], (5.45) where (42) follows from the fact that node 1 just takes one step of model update using the initial model x 0 . Each node i ∈ [2, S] receives the model from its own node, evaluates the (i − 1) received model and then takes one step of model update. For node i ∈ [S + 1, N], each node will have S models to evaluate, and the time recursion follows (44). The time recursions, for the remaining τ rounds, where the training is assumed to happen over τ rounds, are given by E (k+1) 1 = E (k) n + Tcom + Tcomp for k ∈ [τ − 1] (5.46) E (k) n = E (k) n−1 + Tcom + Tcomp for n ∈ [N]\{1}, k ∈ [τ ] (5.47) E (τ+1) 1 = E (τ) n + Tcom + STperf-based. (5.48) By telescoping (42)-(47), we get the upper bound in equation 5.16. The training time of Basil+ in equation 5.17 can be proven by computing the time of each stage of the algorithm: In Stage 1, all groups in parallel apply Basil algorithm within its group, where the training is carried out sequentially. This results in a training time of Tstage1 ≤ nτTperf-based +nτTcomm +nτTSGD. The 172 time of the robust circular aggregation stage is given by Tstage2 = GTperf-based+SGTcomm. Here, STperf-based in the first term comes from the fact that each node in the set Sg in parallel evaluates S models received from the nodes in Sg−1. The second term in Tstage2 comes from the fact that each node in Sg receives S models from the nodes in Sg−1. The term G in stage 2 results from the sequential aggregation over the G groups. The time of the final stage (multicasting stage) is given by Tstage3 = Tperf-based + (G − 1)Tcomm, where the first term from the fact all nodes in the set {U1, U2, . . . , UG−1} evaluates the S robust average model in parallel, while the second term follows from the time needed to receive the S robust average model by each corresponding node in the remaining groups. By combining the time of the three stages, we get the training time given in equation 5.17. 5.8.7 Proof of Proposition 5.5 At a high level, Basil+ would fail if at least one group out of the G groups failed (the set of models Lg sent from the set Sg in any particular group g to the group g + 1 are faulty). According to the discussion in the proof of Proposition 5.2, group g fails, when we S Byzantine nodes comes in a row. Now, we formally prove the failure probability of Basil+. We start our proof by defining the failure event of Basil+ by F = [ G g=1 Fg, (5.49) where Fg is the failure event of group g and G is the number of groups. The failure probability of group g is given by P(Fg) = min( X b,n) i=0 P(Fg | bg = i)P(bg = i), (5.50) where bg is the number of Byzantine nodes in group g. Equation equation 5.50 follows the law of total probability. The conditional probability in equation 5.50 represents the failure probability of group g given i nodes in that group are Byzantine nodes. This conditional group failure probability can be derived similarly 173 to the failure probability in equation 5.21 in Proposition 5.2. In particular, the conditional probability is formally given by P(Fg | bg = i) ≤ Xn j=1 P(Aj | bg = i) (5.51) where Aj is the failure event in which S Byzantine nodes come in a row, where j is the starting node of these S nodes given that there are i Byzantine nodes in that group. The probability P(Aj | bg = i) is given as follows P(Aj | bg = i) = S Y−1 s=0 max (i − s, 0) (N − s) , (5.52) where i is the number of Byzantine nodes in group g and S is the connectivity parameter in that group. By combining equation 5.52 with equation 5.51, we get the conditional probability in the first term in equation 5.50 which is given as follows P(Fg | bg = i) ≤ Xn j=1 P(Aj | bg = i) = S Y−1 s=0 max (i − s, 0) (N − s) n. (5.53) The probability P(bg = i) in the second term of equation 5.50 follows a Hypergeometric distribution with parameter (N, b, n) where N, b, and n are the total number of nodes, total number of Byzantine nodes, number of nodes in each group, respectively. This probability is given by P(bg = i) = b i N−b n−i N n . (5.54) By substituting equation 5.54 and equation 5.53 in equation 5.50, we get the failure probability of one group in Basil+, which is given as follows P(Fg) ≤ min ( X b,n) i=0 S Y−1 s=0 max (i − s, 0) (N − s) n ! b i N−b n−i N n . (5.55 Finally, the failure probability of Basil+ is given by P(F) = P( [ G g=1 Fg) a ≤ X G g=1 P(Fg) = G min ( X b,n) i=0 S Y−1 s=0 max (i − s, 0) (N − s) n ! b i N−b n−i N n , (5.56) where (a) follows the union bound. □ 5.8.8 UBAR In this section, we describe UBAR [45], the SOTA Byzantine resilient approach for parallel decentralized training. 5.8.8.1 Algorithm This decentralized training setup is defined over undirected graph: G = (V, E), where V denotes a set of N nodes and E denotes a set of edges representing communication links. Filtering Byzantine nodes is done over two stages for each training iteration. At the first stage, each benign node performs a distancebased strategy to select a candidate pool of potential benign nodes from its neighbors. This selection is performed by comparing the Euclidean distance of its own model with the model from its neighbors. In the second stage, each benign node performs a performance-based strategy to pick the final nodes from the candidate pool resulted from stage 1. It reuses the training sample as the validation data to compute the loss function value of each model. It selects the models whose loss values are smaller than the value of its own model, and calculates the average of those models as the final updated value. Formally, the update rule in UBAR is given by x (i) k+1 = αx (i) k + (1 − α)RUBAR(x (j) k , j ∈ Ni) − η∇fi(x (i) k ), (5.57) where Ni is the set of neighbors of Node i, ∇fi(x (i) k ) is the local gradient of node i evaluated on a random sample from the local dataset of node i while using its own model, k is the training round, and RUBAR is given as follows: RUBAR = 1 N r i,k P j∈N r i,k x (j) k if N r i,k ̸= ϕ x j ∗ k Otherwise, (5.58) where there are two stages of filtering: (1) N s i,k= arg min N ∗⊂Ni, N ∗=ρi|Ni| X j∈Ni ||x (j) k − x (i) k ||, (2) N r i,k= [ j∈N s i,k ℓi(x (j) k )≤ℓi(x (i) k ) j, and j ∗= arg min j∈N s i,k ℓi(x (i) k ). 5.8.8.2 Time Analysis for UBAR The training time of UBAR is divided into two parts; computation time and communication time. We start by discussing the communication time. For modeling the communication time, we assume that each node in parallel can multicast its model to its neighbors and receive the models from the neighbor nodes, where each node is assumed to be connected to S neighbors. Hence, the time to multicast S models can be calculated as d32 R , where d is the model size and each element of the model is represented by 32 bits and R is the communication BW in b/s. On the other hand, the time to receives S different models from the S neighbor nodes is given by S∗d∗32 R . We assume that in UBAR, each node starts the model evaluations 176 and model update after receiving the S models (the same assumption is used when computing the training time for Basil). Therefore, given that each node starts the training procedure only when it receives the S models while all the communications happen in parallel in the graph, the communication time in one parallel round in UBAR is given as follows: TUBAR-communication = Sd32 R . (5.59) The computation time of UBAR is given by TUBAR-computation = Tdist-based + Tperf-based + Tagg + TSGD, (5.60) where Tdist-based, Tperf-based, Tagg and TSGD are respectively the times to apply the distance-based strategy, the performance-based strategy, the aggregation, and one step of SGD model update. Hence, the total training time when using UBAR for K training rounds is given by TUBAR = K(Tdist-based + Tperf-based + Tagg + TSGD + STcomm), (5.61) where Tcomm = d32 R . 5.8.9 Setting In this section, we provide the details of the neural networks used in our experiments, some key details regarding the UBAR implementation, and multiple additional experiments to further demonstrate the superiority of our proposed algorithms. We start in Section 5.8.9.1 by describing the model that is used in our experiments in Section 5.6 and the additional experiments given in this section. In Section 5.8.9.2, we discuss the implementation of UBAR. After that, we run additional experiments by using MNIST dataset 177 [70] in Section 5.8.9.3. In Section 5.8.9.4, we study the computation time of Basil compared to UBAR, and the training performance of Basil and UBAR with respect to the training time. Finally, we study the performance of Basil and ACDS for CIFAR100 dataset with non-IID data distribution in Section 5.8.9.5, and the performance comparison between Basil and Basil+ in Section 5.8.9.6. (a) No Attack (b) Gaussian Attack (c) Random Sign Flip Attack (d) Hidden Attack Figure 5.10: Illustrating the results for MNIST under IID data distribution setting. 5.8.9.1 Models We provide the details of the neural network architectures used in our experiments. For MNIST, we use a model with three fully connected layers, and the details for the same are provided in Table 5.1. Each of the first two fully connected layers is followed by ReLU, while softmax is used at the output of the third one fully connected layer. Table 5.1: Details of the parameters in the architecture of the neural network used in our MNIST experiments. Parameter Shape fc1 784 × 100 fc2 100 × 100 fc3 100 × 10 For CIFAR10 experiments in the main chapter, we consider a neural network with two convolutional layers, and three fully connected layers, and the specific details of these layers are provided in Table 5.2. ReLU and maxpool is applied on the convolutional layers. The first maxpool has a kernel size 3 × 3 and a stri 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy R-plain: =5% R-plain: =0% (a) No Attack 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy R-plain: =5% Basil: =5% R-plain: =0% Basil: =0% UBAR: =0% (b) No Attack 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy Basil: =5% R-plain: =0% R-plain: =5% UBAR: =0% Basil: =0% (c) Gaussian Attack 0 50 100 150 Round, k 0.0 0.2 0.4 0.6 0.8 Test Accuracy Basil: =5% R-plain: =0% R-plain: =5% UBAR: =0% Basil: =0% (d) Random Sign Flip Attack Figure 5.11: Illustrating the results for MNIST under non-IID data distribution setting. 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy Basil+ACDS: =5% UBAR: =5% (a) No Attack 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy Basil+ACDS: =5% UBAR: =5% (b) Gaussian Attack 0 50 100 150 Round, k 0.2 0.4 0.6 0.8 Test Accuracy Basil+ACDS: =5% UBAR: =5% (c) Random Sign Flip Attack Figure 5.12: Illustrating the performance of Basil compared with UBAR for MNIST under non-IID data distribution setting with α = 5% data sharing. and the second maxpool has a kernel size of 4 × 4 and a stride of 4. Each of the first two fully connected layers is followed by ReLU, while softmax is used at the output of the third one fully connected layer. We initialize all biases to 0. Furthermore, for weights in convolutional layers, we use Glorot uniform initializer, while for weights in fully connected layers, we use the default Pytorch initialization. Table 5.2: Details of the parameters in the architecture of the neural network used in our CIFAR10 experiments. Parameter Shape conv1 3 × 16 × 3 × 3 conv2 16 × 64 × 4 × 4 fc1 64 × 384 fc2 384 × 192 fc3 192 × 10 179 5.8.9.2 Implementing UBAR We follow a similar approach as described in the experiments in [45]. Specifically, we first assign connections randomly among benign nodes with a probability of 0.4 unless otherwise specified, and then randomly assign connections from the benign nodes to the Byzantine nodes, with a probability of 0.4 unless otherwise specified. Furthermore, we set the Byzantine ratio for benign nodes as ρ = 0.33. 5.8.9.3 Performance of Basil on MNIST We present the results for MNIST in Figure 5.10 and Figure 5.11 under IID and non-IID data distribution settings, respectively. As can be seen from Figure 5.10 and Figure 5.11 that using Basil leads to the same conclusions shown in CIFAR10 dataset in the main chapter in terms of its fast convergence, high test accuracy, and Byzantine robustness compared to the different schemes. In particular, Figure 5.10 under IID data setting demonstrates that Basil is not only resilient to Byzantine attacks, it maintains its superior convergence performance over UBAR. Furthermore, Figure 5.10(a) and Figure 5.10(b) illustrate that the test accuracy when using Basil and R-plain under non-IID data setting increases when each node shares 5% of its local data with other nodes in the absence of Byzantine nodes. It can also be seen from Figure 5.10(c) and Figure 5.10(d) that ACDS with α = 5% on the top of Basil provides the same robustness to software/hardware faults represented in Gaussian model and random sign flip as concluded in the main chapter. Additionally, we observe that both Basil without ACDS and UBAR completely fail in the presence of these faults. Similar to the results in Figure 5.6 in the main chapter, Figure 5.12 shows that even 5% data sharing is done in UBAR, performance remains quite low in comparison to Basil+ACDS. 180 5.8.9.4 Wall-Clock Time Performance of Basil In this section, we show the training performance of Basil and UBAR with respect to the training time instead of the number of rounds. To do so, we consider the following setting. (a) Using a communication Bandwidth of 100 Mb/s (b) Using a communication Bandwidth of 10 Mb/s Figure 5.13: Illustrating the performance of Basil using CIFAR10 dataset under IID data distribution setting with respect to the training time. Experimental setting. We consider the same setting discussed in Section 5.6 where there exists a total of 100 nodes, in which 67 are benign. For the dataset, we use CIFAR10. We also consider the Gaussian attack. We set the connectivity parameter for the two algorithms to be S = 10. Now, we start by giving the computation/communication time of Basil and UBAR. Computation time. We measured the computation time of Basil and UBAR on a server with AMD EPYC 7502 32-Core CPU Processor. In particular, in TABLE 5.3, we report the average running time of each main component (function) of UBAR and Basil. To do that, we take the average computation time over 103 runs of each component in the mitigation strategy on each training round for 100 rounds. These functions (components of the mitigation strategy) are the performance-based evaluation for Basil given in Section 5.3, while for UBAR these functions are the performance-based evaluation and the distancebased evaluation along with the model aggregation for UBAR given in Appendix 5.8.8. We can see from TABLE 5.3 that the average time each benign node in UBAR takes to evaluate the received set of models and take one step of model update using SGD is ∼ 2× the one in Basil. The reason is that each benign node in UBAR performs two extra stages before taking the model update step: (1) distance-based evaluation and 1 (2) model aggregation. The distance-based stage includes a comparison between the model of each benign node and the received set of models which is a time-consuming operation compared to the performancebased as shown in Table 5.3. Table 5.3: The breakdown of the average computation time per node for Basil and UBAR. Algorithm Average evaluation time per node (s) Aggregation (s) Tagg SGD step (s) TSGD Total computation time per node (s) Performance-based Tper-based Distance-based Tdis-based Basil 0.019 - - 0.006 0.025 UBAR 0.012 0.027 0.002 0.006 0.047 Communication time. We consider an idealistic simulation, where the communication time of the trained model is proportional to the number of elements of the model. In particular, we simulate the communication time taken to send the model as used in Section 5.6 and described in Appendix 5.8.9.1 to be given by Tcomm = 32d R , where the model size is given by d = 117706 parameters where each is represented by 32 bits and R is the bandwidth in Mb/s. To get the training time after K rounds, we use the per round time result for Basil and UBAR that are given in Proposition 5.4 in Section 5.6 and Appendix 5.8.8, respectively, while considering K to be the number of training rounds. Results. Figure 5.13 demonstrates the performance of Basil and UBAR with respect to the training time. As we can observe in Figure 5.13, the time it takes for UBAR to reach its maximum accuracy is almost the same as the time for Basil to reach UBAR’s maximum achievable accuracy. We recall from Section 5.6 that UBAR needs 5× more computation/communication resources than Basil to get 41% test accuracy. The performance of Basil and UBAR with respect to the training time is not surprising as we know that Basil takes much less training rounds to reach the same accuracy that UBAR can reach as shown in Figure 5.4. As a result, the latency resulting from the sequential training does not have high impact in comparison to UBAR. Finally, we can see that the communication time is the bottleneck in this setting as when the BW increases from 10 Mb/s to 100 Mb/s the training time decreases significantly. 182 5.8.9.5 Performance of Basil for Non-IID Data Distribution using CIFAR100 To demonstrate the practicality of ACDS and simulate the scenario where each node shares data only from its non-sensitive dataset portion, we have considered the following experiment. Dataset and hyperparameters. We run image classification task on CIFAR100 dataset [64]. This dataset is similar to CIFAR10 in having the same dimension (32 × 32 × 3), except it has 100 classes containing 600 images each, where each class has its own feature distribution. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR100 are grouped into 20 superclasses. For instance, the superclass Fish includes these five subclasses; Aquarium fish, Flatfish, Ray, Shark and Trout. In this experiment, we consider a system of a total of 100 nodes, in which 80 are benign. We have set the connectivity parameter of Basil to S = 5. We use a decreasing learning rate of 0.03/(1 + 0.03 k), where k denotes the training round. For the classification task, we only consider the superclasses as the target labels, i.e., we have 20 labels for our classification task. Model architecture. We use the same neural network that is used for CIFAR10 in the main chapter which consists of 2 convolutional layers and 3 fully connected layers, with the modification that the output of the last layer has a dimension of 20. The details for this network are included in Appendix 5.8.9.1. Data distribution. For simulating the non-IID setting, we first shuffle the data within each superclass, then partition each superclass into 5 portions, and assign each node one partition. Hence, each node will have only data from one superclass and includes data from each of the corresponding 5 subclasses. Data sharing. To simulate the case where each node shares data from the non-sensitive portion of its local data, we take the advantage of the variation of the feature distribution across subclasses and simulate the sensitive and non-sensitive data as per subclasses. Towards achieving this partition goal, we define γ ∈ (0, 1) to represent the fraction of subclasses that a node considers their data as non-sensitive out of its available 5 subclasses. For instance, γ = 1 implies that all the 5 subclasses are considered non-sensitive and nodes can share data from them (e.g., nodes can share data from their entire local data). On the other 183 k Figure 5.14: The performance of Basil under different data sharing scenario in the presence of the Gaussian attack when the data distribution at the nodes is non-IID. Here, γ represents the fraction of subclasses that nodes consider the data from them as non-sensitive out of its available 5 subclasses. hand, γ = 0.4 means that all nodes only consider the first two subclasses of their data as non-sensitive and only share data from them. We note that for nodes that share the same superclass, we consider the order of the subclasses are the same among them (e.g., if node 1 and node 2 have data from the same superclass Fish, hence the subclass Aquarium fish is the first subclass at both of them). This ensures that for γ = 0.4 data in 3 subclasses per each superclass will never be shared by any user. Finally, we allow each node to share αD data points, where D = 500 is the local data set size at each node, from the γ subclasses, and α = 5%. Results Figure 5.14 shows the performance of Basil in the presence of the Gaussian attack under different γ values. As we can see that even when each node shares data from only two subclasses (γ = 0.4) out of the five available subclasses, Basil is giving a comparable performance to the case where each node shares data from its entire dataset (γ = 1). The intuition behinds getting a good performance in the presence of Byzantine nodes is the fact that although data from the three subclasses in each superclass is never shared, there are nodes in the system originally that have data from these sensitive classes, and when they train the model on their local dataset, the augmented side information from the shared dataset helps to maintain the model performance and resiliency to Byzantine faults. Furthermore, we can see that R-plain fails in the presence of the attack, demonstrating that the ACDS data sharing is quite crucial for good performance when data is non-IID. 5.8.9.6 Performance Comparison Between Basil and Basil+ We compare the performance of Basil and Basil+ under the following setting: 0 100 200 300 400 500 Round, k 0.2 0.4 0.6 Test Accuracy Basil Basil+: G = 2 Basil+: G = 4 Basil+: G = 8 UBAR (a) Reporting the average test accuracy among the benign nodes in each round. 0 100 200 300 400 500 Round, k 0.2 0.4 0.6 Test Accuracy Basil Basil+: G = 2 Basil+: G = 4 Basil+: G = 8 UBAR (b) Reporting the worst case test accuracy among the benign nodes in each round. Figure 5.15: Illustrating the performance of Basil and Basil+ for CIFAR10 dataset. Setting. We consider a setting of 400 nodes in which 80 of them are Byzantine nodes. For the dataset, we use CIFAR10 dataset, while considering inverse attack for the Byzantine nodes. We use a connectivity parameter of S = 6 for both Basil and Basil+, and consider epoch based local training, where we set the number of epochs to 3. Results: As we can see from Figure 5.15, Basil and Basil+ for different groups retain high test accuracy over UBAR in the presence of the inverse attack. We can also see from Figure 5.15-(b) that when we have a large number of nodes in the system increasing the number of groups (e.g., G = 8) makes the worst case training performance of the benign nodes more stable across the training rounds compared to Basil that has high fluctuation for large setting (e.g., 400 nodes). 185 Chapter 6 Conclusion In conclusion, this thesis presents significant advancements in the privacy, security, and efficiency of Federated Learning (FL). Privacy Considerations: The this established the first formal privacy guarantees for FL with Secure Aggregation (SA) using Mutual Information (MI) as a metric to assess privacy leakage. The findings reveal that FL with SA can substantially reduce MI privacy leakage as the number of participating users increases, while the impact of model size on privacy leakage is constrained by the rank of the covariance matrix of individual model updates. Additionally, larger batch sizes during local training contribute to further reducing privacy leakage. To enhance the algorithmic aspect of SA, HeteroSAg was introduced, an innovative algorithm that supports secure aggregation with heterogeneous quantization. HeteroSAg accommodates users with varying communication resources, offering an effective trade-off between training accuracy and communication time. By employing a segment-based aggregation strategy, HeteroSAg also mitigates the impact of Byzantine users, significantly reducing bandwidth expansion compared to state-of-the-art protocols. Training Efficiency: The application of Parameter Efficient Fine-Tuning (PEFT) methods was explored to reduce computation, communication, and storage costs in FL. An analysis of existing PEFT methods under various data heterogeneity settings highlighted the performance gap between Full Fine-Tuning (FFT) and 186 PEFT. To address this, two variants of LoRA tailored for FL were proposed: DALoRA, which fine-tunes models using public datasets before federated fine-tuning on private data, and SLoRA, which does not require public datasets and employs uniform sparsity and LoRA modules in a two-stage training process. Both approaches demonstrated their ability to reduce training and communication costs while closing the performance gap between PEFT and FFT. Security: While several Byzantine-robust algorithms exist for server-based FL, there has been limited work on achieving Byzantine resiliency in server-less FL environments. This thesis proposes Basil, a novel algorithm designed for server-less FL that is both fast and computationally efficient. Basil operates over a logical ring, using a sequential, memory-assisted, and performance-based approach to filter out Byzantine users, ensuring robust and reliable training. This thesis contributes to the advancement of FL by addressing critical challenges in privacy, efficiency, and security, offering practical solutions applicable in real-world scenarios. 187 References [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep learning with differential privacy”. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016, pp. 308–318. [2] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. “cpSGD: Communication-efficient and differentially-private distributed SGD”. In: Advances in Neural Information Processing Systems 31 (2018). [3] Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang. “FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction”. In: arXiv preprint arXiv:2212.01548 (2022). [4] Alan Ansell, Edoardo Maria Ponti, Anna Korhonen, and Ivan Vulić. “Composable sparse fine-tuning for cross-lingual transfer”. In: arXiv preprint arXiv:2110.07560 (2021). [5] Yoshinori Aono, Takuya Hayashi, Lihua Wang, Shiho Moriai, et al. “Privacy-preserving deep learning via additively homomorphic encryption”. In: IEEE Transactions on Information Forensics and Security 13.5 (2017), pp. 1333–1345. [6] Sara Babakniya, Ahmed Roushdy Elkordy, Yahya H. Ezzeldin, Qingfeng Liu, Kee-Bong Song, Mostafa El-Khamy, and Salman Avestimehr. SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models. 2023. arXiv: 2308.06522 [cs.LG]. url: https://arxiv.org/abs/2308.06522. [7] Sara Babakniya, Souvik Kundu, Saurav Prakash, Yue Niu, and Salman Avestimehr. “Federated sparse training: Lottery aware model compression for resource constrained edge”. In: arXiv preprint arXiv:2208.13092 (2022). [8] Borja Balle, James Bell, Adrià Gascón, and Kobbi Nissim. “The privacy blanket of the shuffle model”. In: Annual International Cryptology Conference. Springer. 2019, pp. 638–667. [9] Raef Bassily, Adam Smith, and Abhradeep Thakurta. “Private empirical risk minimization: Efficient algorithms and tight error bounds”. In: 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE. 2014, pp. 464–473. 188 [10] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. “Mutual Information Neural Estimation”. In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, Oct. 2018, pp. 531–540. [11] James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. “Secure single-server aggregation with (poly) logarithmic overhead”. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 1253–1269. [12] James Henry Bell, Kallista A. Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. “Secure Single-Server Aggregation with (Poly)Logarithmic Overhead”. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 1253–1269. [13] Sameer Bibikar, Haris Vikalo, Zhangyang Wang, and Xiaohan Chen. “Federated dynamic sparse training: Computing less, communicating less, yet learning better”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. 2022, pp. 6080–6088. [14] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. “Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent”. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Curran Associates Inc., 2017, pp. 118–128. [15] Sergey G Bobkov, Gennadiy P Chistyakov, and Friedrich Götze. “Berry–Esseen bounds in the entropic central limit theorem”. In: Probability Theory and Related Fields 159.3-4 (2014), pp. 435–478. [16] K. Bonawitz, F. Salehi, J. Konečný, B. McMahan, and M. Gruteser. “Federated Learning with Autotuned Communication-Efficient Secure Aggregation”. In: 2019 53rd Asilomar Conference on Signals, Systems, and Computers. 2019, pp. 1222–1226. doi: 10.1109/IEEECONF44664.2019.9049066. [17] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, and et al. “Towards Federated Learning at Scale: System Design”. In: Proceedings of Machine Learning and Systems (MLSys 2019) (). [18] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. “Practical secure aggregation for privacy-preserving machine learning”. In: proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 1175–1191. [19] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [20] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečny, ` H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. “Leaf: A benchmark for federated settings”. In: arXiv preprint arXiv:1812.01097 (2018). [21] Miguel Castro, Barbara Liskov, and et al. “Practical Byzantine fault tolerance”. In: OSDI 173–186 (1999). 189 [22] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. “Differentially private empirical risk minimization.” In: Journal of Machine Learning Research 12.3 (2011). [23] Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. “The lottery ticket hypothesis for pre-trained bert networks”. In: Advances in neural information processing systems 33 (2020), pp. 15834–15846. [24] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006. isbn: 0471241954. [25] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding”. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019), pp. 4171–4186. [26] Enmao Diao, Jie Ding, and Vahid Tarokh. “HeteroFL: Computation and communication efficient federated learning for heterogeneous clients”. In: arXiv preprint arXiv:2010.01264 (2020). [27] Whitfield Diffie and Martin E. Hellman. “New Directions in Cryptography”. In: IEEE Transactions on Information Theory 22.6 (Nov. 1976), pp. 644–654. [28] Roel Dobbe, David Fridovich-Keil, and Claire Tomlin. “Fully Decentralized Policies for Multi-Agent Systems: An Information Theoretic Approach”. In: NIPS’17. Long Beach, California, USA: Curran Associates Inc., 2017, pp. 2945–2954. isbn: 9781510860964. [29] Ye Dong, Xiaojun Chen, Liyan Shen, and Dakui Wang. “EaSTFLy: Efficient and secure ternary federated learning”. In: Computers & Security 94 (2020), p. 101824. [30] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. “An image is worth 16×16 words: Transformers for image recognition at scale”. In: International Conference on Learning Representations (ICLR) (2021). [31] L. A. Dunning and R. Kresman. “Privacy Preserving Data Sharing With Anonymous ID Assignment”. In: IEEE Transactions on Information Forensics and Security 8.2 (2013), pp. 402–413. doi: 10.1109/TIFS.2012.2235831. [32] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity in private data analysis”. In: Theory of cryptography conference. Springer. 2006, pp. 265–284. [33] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. “The Hidden Vulnerability of Distributed Learning in Byzantium”. In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, Oct. 2018, pp. 3521–3530. [34] Ronen Eldan, Dan Mikulincer, and Alex Zhai. “The CLT in high dimensions: quantitative bounds via martingale embedding”. In: The Annals of Probability 48.5 (2020), pp. 2494–2524. 190 [35] Anis Elgabli, Jihong Park, Amrit S. Bedi, Mehdi Bennis, and Vaneet Aggarwal. “Q-GADMM: Quantized Group ADMM for Communication Efficient Decentralized Machine Learning”. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. [36] Ahmed Roushdy Elkordy, Saurav Prakash, and Salman Avestimehr. “Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training”. In: IEEE Journal on Selected Areas in Communications 40.9 (2022), pp. 2694–2716. doi: 10.1109/JSAC.2022.3191347. [37] Ahmed Roushdy Elkordy and A. Salman Avestimehr. “HeteroSAg: Secure Aggregation with Heterogeneous Quantization in Federated Learning”. In: IEEE Transactions on Communications (2022), pp. 1–1. doi: 10.1109/TCOMM.2022.3151126. [38] Ahmed Roushdy Elkordy, Jiang Zhang, Yahya H. Ezzeldin, Konstantinos Psounis, and Amir Salman Avestimehr. “How Much Privacy Does Federated Learning with Secure Aggregation Guarantee?” In: Proc. Priv. Enhancing Technol. 2023 (2022), pp. 510–526. url: https://api.semanticscholar.org/CorpusID:251320376. [39] Irem Ergun, Hasin Us Sami, and Basak Guler. “Sparsified Secure Aggregation for Privacy-Preserving Federated Learning”. In: arXiv preprint arXiv:2112.12872 (2021). [40] Elizabeth L Feld. “United States Data Privacy Law: The Domino Effect After the GDPR”. In: N.C. Banking Inst. Vol. 24. HeinOnline, 2020, p. 481. [41] Elias Frantar and Dan Alistarh. “SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot”. In: (2023). [42] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. “Inverting Gradients – How easy is it to break privacy in federated learning?” In: Advances in Neural Information Processing Systems. 2020. [43] Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. “Compressing bert: Studying the effects of weight pruning on transfer learning”. In: arXiv preprint arXiv:2002.08307 (2020). [44] Demi Guo, Alexander M Rush, and Yoon Kim. “Parameter-efficient transfer learning with diff pruning”. In: arXiv preprint arXiv:2012.07463 (2020). [45] Shangwei Guo, Tianwei Zhang, Xiaofei Xie, Lei Ma, Tao Xiang, and Yang Liu. “Towards Byzantine-resilient Learning in Decentralized Systems”. In: preprint arXiv:2002.08569 (2020). arXiv: 2002.08569. [46] Chaoyang He, Murali Annavaram, and Salman Avestimehr. “Group knowledge transfer: Federated learning of large cnns at the edge”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 14068–14080. [47] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. “Fedml: A research library and benchmark for federated machine learning”. In: arXiv preprint arXiv:2007.13518 (2020). 191 [48] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. “Towards a unified view of parameter-efficient transfer learning”. In: arXiv preprint arXiv:2110.04366 (2021). [49] Samuel Horvath, Stefanos Laskaridis, Mario Almeida, Ilias Leontiadis, Stylianos Venieris, and Nicholas Lane. “Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 12876–12889. [50] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. “Parameter-efficient transfer learning for NLP”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2790–2799. [51] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. “Measuring the effects of non-identical data distribution for federated visual classification”. In: arXiv preprint arXiv:1909.06335 (2019). [52] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. “Lora: Low-rank adaptation of large language models”. In: arXiv preprint arXiv:2106.09685 (2021). [53] Swanand Kadhe, Nived Rajaraman, O Ozan Koyluoglu, and Kannan Ramchandran. “Fastsecagg: Scalable secure aggregation for privacy-preserving federated learning”. In: arXiv preprint arXiv:2009.11248 (2020). [54] Swanand Kadhe, Nived Rajaraman, O. Ozan Koyluoglu, and Kannan Ramchandran. “FastSecAgg: Scalable Secure Aggregation for Privacy-Preserving Federated Learning”. In: arXiv,2009.11248 (2020). [55] Peter Kairouz, Ziyu Liu, and Thomas Steinke. “The distributed discrete gaussian mechanism for federated learning with secure aggregation”. In: arXiv preprint arXiv:2102.06387 (2021). [56] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. “Advances and open problems in federated learning”. In: Foundations and Trends® in Machine Learning 14.1–2 (2021), pp. 1–210. [57] Peter Kairouz, H. Brendan McMahan, Brendan, and et al. “Advances and Open Problems in Federated Learning”. In: preprint arXiv:1912.04977 (2019). arXiv: 1912.04977. [58] Jiawen Kang, Zehui Xiong, Dusit Niyato, Yuze Zou, Yang Zhang, and Mohsen Guizani. “Reliable federated learning for mobile networks”. In: IEEE Wireless Communications 27.2 (2020), pp. 72–80. [59] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. “What can we learn privately?” In: SIAM Journal on Computing 40.3 (2011), pp. 793–826. [60] Yeachan Kim, Junho Kim, Wing-Lam Mok, Jun-Hyung Park, and SangKeun Lee. “Client-customized adaptation for parameter-efficient federated learning”. In: Findings of the Association for Computational Linguistics: ACL 2023. 2023, pp. 1159–1172. 192 [61] Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. “Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, Sept. 2019, pp. 3478–3487. [62] Jakub Konečny, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and ` Dave Bacon. “Federated learning: Strategies for improving communication efficiency”. In: arXiv preprint arXiv:1610.05492 (2016). [63] A. Krizhevsky. “Learning Multiple Layers of Features from Tiny Images”. In: 2009. [64] Alex Krizhevsky. Learning multiple layers of features from tiny images. Tech. rep. Citeseer, 2009. [65] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems 25 (2012). [66] Eugene Kuznetsov, Yitao Chen, and Ming Zhao. “SecureFL: Privacy Preserving Federated Learning with SGX and TrustZone”. In: 2021 IEEE/ACM Symposium on Edge Computing (SEC). 2021, pp. 55–67. doi: 10.1145/3453142.3491287. [67] Leslie Lamport, Robert Shostak, and Marshall Pease. “The Byzantine Generals Problem”. In: ACM Trans. Program. Lang. Syst. (1982). [68] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. “Albert: A lite bert for self-supervised learning of language representations”. In: arXiv preprint arXiv:1909.11942 (2019). [69] Ken Lang. “Newsweeder: Learning to filter netnews”. In: Machine learning proceedings 1995. Elsevier, 1995, pp. 331–339. [70] Yann LeCun, Corinna Cortes, and CJ Burges. “MNIST handwritten digit database”. In: ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2 (2010). [71] Jin-woo Lee, Jaehoon Oh, Sungsu Lim, Se-Young Yun, and Jae-Gil Lee. “TornadoAggregate: Accurate and Scalable Federated Learning via the Ring-Based Architecture”. In: preprint arXiv:1806.00582 (2020). arXiv: 2012.03214. [72] Brian Lester, Rami Al-Rfou, and Noah Constant. “The power of scale for parameter-efficient prompt tuning”. In: arXiv preprint arXiv:2104.08691 (2021). [73] Ang Li, Jingwei Sun, Xiao Zeng, Mi Zhang, Hai Li, and Yiran Chen. “Fedmask: Joint computation and communication-efficient personalized federated learning via heterogeneous masking”. In: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems. 2021, pp. 42–55. [74] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. “Federated learning: Challenges, methods, and future directions”. In: IEEE signal processing magazine 37.3 (2020), pp. 50–60. 193 [75] Xiang Lisa Li and Percy Liang. “Prefix-tuning: Optimizing continuous prompts for generation”. In: arXiv preprint arXiv:2101.00190 (2021). [76] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. 2017. arXiv: 1705.09056 [math.OC]. [77] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: arXiv preprint arXiv:1711.05101 (2017). [78] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Ed. by Aarti Singh and Jerry Zhu. Vol. 54. Proceedings of Machine Learning Research. 2017, pp. 1273–1282. [79] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-efficient learning of deep networks from decentralized data”. In: Artificial intelligence and statistics. PMLR. 2017, pp. 1273–1282. [80] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. “Inference Attacks Against Collaborative Learning”. In: CoRR abs/1805.04049 (2018). [81] Vaikkunth Mugunthan, Antigoni Polychroniadou, David Byrd, and Tucker Hybinette Balch. “Smpai: Secure multi-party computation for federated learning”. In: Proceedings of the NeurIPS 2019 Workshop on Robust AI in Financial Services. 2019. [82] M. Nasr, R. Shokri, and A. Houmansadr. “Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning”. In: 2019 IEEE Symposium on Security and Privacy (SP). 2019, pp. 739–753. doi: 10.1109/SP.2019.00065. [83] Yue Niu, Saurav Prakash, Souvik Kundu, Sunwoo Lee, and Salman Avestimehr. “Federated Learning of Large Models at the Edge via Principal Sub-Model Training”. In: arXiv preprint arXiv:2208.13141 (2022). [84] Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, and Mikel Artetxe. “Lifting the curse of multilinguality by pre-training modular transformers”. In: arXiv preprint arXiv:2205.06266 (2022). [85] Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aishwarya Kamath, Ivan Vulić, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. AdapterHub: A Framework for Adapting Transformers. 2020. arXiv: 2007.07779 [cs.CL]. [86] Le Trieu Phong, Yoshinori Aono, Takuya Hayashi, Lihua Wang, and Shiho Moriai. “Privacy-preserving deep learning: Revisited and enhanced”. In: International Conference on Applications and Techniques in Information Security. Springer. 2017, pp. 100–110. [87] Krishna Pillutla, Sham M. Kakade, and Zaid Harchaoui. “Robust Aggregation for Federated Learning”. In: preprint arXiv:1912.13445 (2019). arXiv: 1912.13445. 194 [88] Saurav Prakash and Amir Salman Avestimehr. “Mitigating Byzantine Attacks in Federated Learning”. In: arXiv preprint arXiv:2010.07541 (2020). [89] Xinchi Qiu, Javier Fernandez-Marques, Pedro PB Gusmao, Yan Gao, Titouan Parcollet, and Nicholas Donald Lane. “ZeroFL: Efficient on-device training for federated learning with local sparsity”. In: arXiv preprint arXiv:2208.02507 (2022). [90] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečny, ` Sanjiv Kumar, and H Brendan McMahan. “Adaptive federated optimization”. In: arXiv preprint arXiv:2003.00295 (2020). [91] Jayanth Regatti, Hao Chen, and Abhishek Gupta. “ByGARS: Byzantine SGD with Arbitrary Number of Attackers”. In: preprint arXiv:2006.13421 (2020). arXiv: 2006.13421 [cs.LG]. [92] Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalkar, and Virginia Smith. “On the convergence of federated optimization in heterogeneous networks”. In: arXiv preprint arXiv:1812.06127 3 (2018), p. 3. [93] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. In: arXiv preprint arXiv:1910.01108 (2019). [94] Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. “Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data”. In: IEEE Transactions on Neural Networks and Learning Systems 31.9 (2020), pp. 3400–3413. [95] Adi Shamir. “How to share a secret”. In: Communications of the ACM (1979), pp. 612–613. doi: 10.1145/28395.28420. [96] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership Inference Attacks Against Machine Learning Models”. In: 2017 IEEE Symposium on Security and Privacy (SP). 2017, pp. 3–18. doi: 10.1109/SP.2017.41. [97] Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. “A tail-index analysis of stochastic gradient noise in deep neural networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 5827–5837. [98] Jinhyun So, Ramy E Ali, Basak Guler, Jiantao Jiao, and Salman Avestimehr. “Securing secure aggregation: Mitigating multi-round privacy leakage in federated learning”. In: arXiv preprint arXiv:2106.03328 (2021). [99] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communications (2020). [100] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning”. In: IEEE Journal on Selected Areas in Information Theory 2.1 (2021), pp. 479–489. 195 [101] Jinhyun So, Başak Güler, and A. Salman Avestimehr. “Byzantine-Resilient Secure Federated Learning”. In: IEEE Journal on Selected Areas in Communications (2020), pp. 1–1. doi: 10.1109/JSAC.2020.3041404. [102] Jinhyun So, Başak Güler, and A. Salman Avestimehr. “Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning”. In: IEEE Journal on Selected Areas in Information Theory 2.1 (2021), pp. 479–489. doi: 10.1109/JSAIT.2021.3054610. [103] Jinhyun So, Corey J Nolet, Chien-Sheng Yang, Songze Li, Qian Yu, Ramy E Ali, Basak Guler, and Salman Avestimehr. “Lightsecagg: a lightweight and versatile design for secure aggregation in federated learning”. In: Proceedings of Machine Learning and Systems 4 (2022), pp. 694–720. [104] Guangyu Sun, Matias Mendieta, Taojiannan Yang, and Chen Chen. “Exploring Parameter-Efficient Fine-tuning for Improving Communication Efficiency in Federated Learning”. In: arXiv preprint arXiv:2210.01708 (2022). [105] S. Sundhar Ram, A. Nedić, and V. V. Veeravalli. “Asynchronous gossip algorithms for stochastic optimization”. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference. 2009, pp. 3581–3586. doi: 10.1109/CDC.2009.5399485. [106] Ananda Theertha Suresh, Felix X. Yu, Sanjiv Kumar, and H. Brendan McMahan. “Distributed Mean Estimation with Limited Communication”. In: Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 3329–3337. [107] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou. “A hybrid approach to privacy-preserving federated learning”. In: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security. 2019, pp. 1–11. [108] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. “GLUE: A multi-task benchmark and analysis platform for natural language understanding”. In: arXiv preprint arXiv:1804.07461 (2018). [109] Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. “AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning”. In: arXiv preprint arXiv:2210.17451 (2022). [110] Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. “Beyond inferring class representatives: User-level privacy leakage from federated learning”. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE. 2019, pp. 2512–2520. [111] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Zeno: Distributed Stochastic Gradient Descent with Suspicion-based Fault-tolerance”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, Sept. 2019, pp. 6893–6901. 196 [112] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Zeno++: Robust Fully Asynchronous SGD”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, 13–18 Jul 2020, pp. 10495–10503. [113] Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, and Heiko Ludwig. “Hybridalpha: An efficient approach for privacy-preserving federated learning”. In: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security. 2019, pp. 13–23. [114] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. “Federated machine learning: Concept and applications”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 10.2 (2019), pp. 1–19. [115] Zhixiong Yang and Waheed U. Bajwa. “BRIDGE: Byzantine-resilient Decentralized Gradient Descent”. In: preprint arXiv:1908.08098 (2019). arXiv: 1908.08098. [116] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. “Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates”. In: Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. Stockholmsmässan, Stockholm Sweden: PMLR, Oct. 2018, pp. 5650–5659. [117] Hongxu Yin, Arun Mallya, Arash Vahdat, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. “See through Gradients: Image Batch Recovery via GradInversion”. In: arXiv,2104.07586 (2021). [118] Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. “Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models”. In: arXiv preprint arXiv:2106.10199 (2021). [119] Tuo Zhang, Chaoyang He, Tian-Shya Ma, Mark Ma, and S. Avestimehr. “Federated Learning for Internet of Things: A Federated Learning Framework for On-device Anomaly Data Detection”. In: ArXiv abs/2106.07976 (2021). [120] Yuhui Zhang, Zhiwei Wang, Jiangfeng Cao, Rui Hou, and Dan Meng. “ShuffleFL: gradient-preserving federated learning using trusted execution environment”. In: Proceedings of the 18th ACM International Conference on Computing Frontiers. 2021, pp. 161–168. [121] Zhuo Zhang, Yuanhang Yang, Yong Dai, Lizhen Qu, and Zenglin Xu. “When Federated Learning Meets Pre-trained Language Models’ Parameter-Efficient Tuning Methods”. In: arXiv preprint arXiv:2212.10025 (2022). [122] L. Zhao, S. Hu, Q. Wang, J. Jiang, S. Chao, X. Luo, and P. Hu. “Shielding Collaborative Learning: Mitigating Poisoning Attacks through Client-Side Detection”. In: IEEE Transactions on Dependable and Secure Computing (2020), pp. 1–1. doi: 10.1109/TDSC.2020.2986205. [123] Yizhou Zhao and Hua Sun. “Information theoretic secure aggregation with user dropouts”. In: 2021 IEEE International Symposium on Information Theory (ISIT). IEEE. 2021, pp. 1124–1129. 197 [124] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. “Federated Learning with Non-IID Data”. In: preprint arXiv:1806.00582 (2018). [125] Han Zhou, Xingchen Wan, Ivan Vulić, and Anna Korhonen. “AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning”. In: arXiv preprint arXiv:2301.12132 (2023). [126] Ligeng Zhu, Zhijian Liu, and Song Han. “Deep Leakage from Gradients”. In: Advances in Neural Information Processing Systems. Vol. 32. 2019. [127] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. “The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects”. In: International Conference on Machine Learning. PMLR. 2019, pp. 7654–7663. 198
Abstract (if available)
Abstract
Machine learning (ML) has become integral to modern intelligent systems, driving advancements in fields like computer vision and natural language processing. Traditionally, ML model training requires centralized data, raising privacy concerns and facing regulatory challenges. Federated Learning (FL) offers an alternative by training models on decentralized data across edge devices. However, FL introduces challenges in privacy, robustness, and resource efficiency.
This thesis addresses these challenges with innovative solutions. First, it enhances the privacy limitations of Secure Aggregation (SA), a protocol in FL that protects user data from gradient inversion attacks. The thesis provides theoretical bound on potential data leakage from aggregated models. It also enhances the algorithmic framework of SA by introducing HeteroSAg, a novel approach that simultaneously achieves privacy and robustness. HeteroSAg supports heterogeneous quantization, thereby improving privacy, robustness, and efficiency in Federated Learning.
Second, the thesis tackles the computational, memory, and communication constraints of edge devices in FL, particularly when training large models. By exploring parameter-efficient fine-tuning (PEFT) methods, it introduces novel LoRA variants that use data-driven initialization to reduce costs while maintaining model performance, even under heterogeneous data distributions.
Finally, the thesis addresses robustness against Byzantine faults in server-less FL environments. It introduces Basil, an algorithm that ensures robust training by filtering out malicious updates through a memory-assisted, performance-based approach.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Security and privacy in information processing
PDF
Efficiency in privacy-preserving computation via domain knowledge
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Differentially private learned models for location services
PDF
Inductive biases for data- and parameter-efficient transfer learning
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Scalable machine learning algorithms for item recommendation
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
Asset Metadata
Creator
Elkordy, Ahmed Roushdy Aboumosalam Mohamed
(author)
Core Title
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2024-08
Publication Date
09/04/2024
Defense Date
08/29/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
efficient training,federated learning,machine learning,OAI-PMH Harvest,privacy,Security
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Annavaram, Murali (
committee member
), Karimireddy, Sai praneeth reddy (
committee member
)
Creator Email
aelkordy@usc.edu,ana.roushdy20@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399A778
Unique identifier
UC11399A778
Identifier
etd-ElkordyAhm-13484.pdf (filename)
Legacy Identifier
etd-ElkordyAhm-13484
Document Type
Dissertation
Format
theses (aat)
Rights
Elkordy, Ahmed Roushdy Aboumosalam Mohamed
Internet Media Type
application/pdf
Type
texts
Source
20240904-usctheses-batch-1207
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
efficient training
federated learning
machine learning
privacy