Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
(USC Thesis Other)
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Coding Centric Approaches for Efficient, Scalable, and Privacy-Preserving Machine Learning in Large-scale Distributed Systems by Jinhyun So A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2022 Copyright 2022 Jinhyun So Acknowledgements First and foremost, I would like to express my deepest respect and gratitude to my advisor Prof. Salman Avestimehr. Prof. Avestimehr is a perfect example for me as supervisor, as a researcher, and as a person. As an supervisor, he has always emphasized on working on the "good" problems, from which I learnt how to identify relevant problems, define big questions we want answer, find solutions, and present impactful results. With his outstanding vision, knowledge, and patience, he spent hours per every single week to discuss technical details, and guide me to good research direction. As a researcher, he always demonstrated by himself that it is essential to stay precise and rigorous. I also remember that he always emphasized on the importance of presenting results to other researchers, form which I learnt that communicating the results to the others is as important as obtaining the results itself. As a person, he always care and understand my personal situation, and encourage me to keep the big picture in mind not only as a good researcher but also as a good person. I also thank Prof. Avestimehr again for all the invaluable advice and support he provided for my future career after PhD. I would like to thank the members of my qualifying exam and dissertation committee members, Prof. Murali Annavaram, Prof. Mahdi Soltanolkotabi, Prof. Leana Golubchik, and Prof. Meisam Razaviyayn, whose insightful feedbacks have helped to significantly improve the quality of this dissertation. ii It has been an amazing experience for me to work with the brilliant group members at USC, includingSongzeLi, QianYu, SauravPrakash, Chien-ShengYang, RamyAli, ChaoyangHe, Ahmed Elkordy, Julien Niu, Sunwoo Lee, Tuo Zhang, Emir Ceyani, Yayha Essa, and Baturalp Buyukates. I can learn how to collaborate with the other researchers and how to make synergy from them. I would also like to thank the staff of our department, Susan Wiedem, Gerrienlyn Ramos, and Corine Wong, who have provided me great help and support during my years at USC. Last but not least, I would likt to express my deepest thanks to my beloved family, my wife Soohyun Lee; my parents Hongseok So and Insook Lee; my brother Yonghwan So, for their always caring, understanding, and supporting me. This dissertation is dedicated to them. iii Table of Contents Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: CodedPrivateML: A Fast and Privacy-Preserving Framework for Distributed Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Other related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 The CodedPrivateML Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Encoding the Dataset and the Model . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.3 Polynomial Approximation and Gradient Computation . . . . . . . . . . . . . 16 2.3.4 Decoding the Gradient and Model Update . . . . . . . . . . . . . . . . . . . . 18 2.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Chapter 3: A Scalable Approach for Privacy-Preserving Collaborative Machine Learning . . . 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 The COPML Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4 Convergence and Privacy Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 4: Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 iv 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Basic Federated Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.2 Secure Aggregation Protocol for Federated Learning and Key Parameters . . 62 4.3.3 State-of-the-art for Secure Aggregation . . . . . . . . . . . . . . . . . . . . . . 63 4.4 The Turbo-Aggregate Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.1 Multi-group circular aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2 Masking with additive secret sharing . . . . . . . . . . . . . . . . . . . . . . . 67 4.4.3 Adding redundancies to recover the data of dropped or delayed users . . . . . 69 4.4.4 Final aggregation and the overall Turbo-Aggregate protocol . . . . . . . . . . 72 4.5 Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 Theoretical Guarantees of Turbo-Aggregate . . . . . . . . . . . . . . . . . . . . . . . 75 4.6.1 Generalized Turbo-Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7.3 Impact of bandwidth and stragglers . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.4 Additional experiments with FedAvg . . . . . . . . . . . . . . . . . . . . . . . 94 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter 5: Byzantine-Resilient Secure Federated Learning . . . . . . . . . . . . . . . . . . . . 98 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.4 The BREA Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.1 Stochastic Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.4.2 Verifiable Secret Sharing of the User Models . . . . . . . . . . . . . . . . . . . 110 5.4.3 Secure Distance Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4.4 User Selection at the Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.4.5 Secure Model Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 6: LightSecAgg: a Lightweight and Versatile Design for Secure Aggregation in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.3 Overview of Baseline Protocols: SecAgg and SecAgg+ . . . . . . . . . . . . . . . . . 137 6.4 LightSecAgg Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.4.1 General Description of LightSecAgg for Synchronous FL . . . . . . . . . . . 142 6.4.2 Extension to Asynchronous FL . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5.1 Theoretical Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.5.2 Complexity Analysis of LightSecAgg . . . . . . . . . . . . . . . . . . . . . . 146 6.6 System Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 v 6.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.7.2 Overall Evaluation and Performance Analysis . . . . . . . . . . . . . . . . . . 153 6.7.3 Performance Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.7.4 Convergence Performance in Asynchronous FL . . . . . . . . . . . . . . . . . 156 6.8 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Appendix of Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 A.1 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 A.2 Proof of Lemma 2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 A.3 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 A.4 Details of the Multi-Party Computation (MPC) Implementation . . . . . . . . . . . 178 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Appendix of Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.1 Details of the Quantization Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.3 Details of the Multi-Party Computation (MPC) Implementation . . . . . . . . . . . 185 B.3.1 Details of the Optimized Baseline Protocols . . . . . . . . . . . . . . . . . . . 187 B.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Appendix of Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 C.1 Pseudo Code of LightSecAgg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 C.2 Proof of Theorem 6.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 C.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 C.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 C.4 Proof of Lemma C.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 C.5 Application of LightSecAgg to Asynchronous FL . . . . . . . . . . . . . . . . . . . . 199 C.5.1 General Description of Asynchronous FL . . . . . . . . . . . . . . . . . . . . 200 C.5.2 Incompatibility of SecAgg and SecAgg+ with Asynchronous FL . . . . . . . . 201 C.5.3 Asynchronous LightSecAgg . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 C.5.4 Offline Encoding and Sharing of Local Masks . . . . . . . . . . . . . . . . . . 203 C.5.5 Training, Quantizing, Masking, and Uploading of Local Updates . . . . . . . 203 C.5.6 One-shot Aggregate-update Recovery and Global Model Update . . . . . . . 205 C.5.7 Convergence Analysis of Asynchronous LightSecAgg . . . . . . . . . . . . . . 206 C.5.8 Experiments for Asynchronous LightSecAgg . . . . . . . . . . . . . . . . . . 210 vi List of Tables 2.1 Complexity summary of CPML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 (CIFAR-10) Breakdown of total runtime for N = 50. . . . . . . . . . . . . . . . . . . 28 3.1 Breakdown of the running time with N = 50 clients. . . . . . . . . . . . . . . . . . . 49 3.2 Complexity summary of COPML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 Summary of simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Breakdown of the running time (ms) of Turbo-Aggregate with N = 200 users. . . . 91 4.3 Breakdown of the running time (ms) of Turbo-Aggregate+ with N = 200 users. . . 91 4.4 Breakdown of the running time (ms) of the benchmark protocol [17] with N = 200 users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.1 Complexity comparison between SecAgg, SecAgg+, and LightSecAgg. Here N is the total number of users, d is the model size, s is the length of the secret keys as the seeds for PRG (sd). In the table, U stands for User and S stands for Server. 148 6.2 Summary of four implemented machine learning tasks and performance gain of LightSecAgg with respect to SecAgg and SecAgg+. All learning tasks are for image classification. MNIST, FEMNIST and CIFAR-10 are low-resolution datasets, while images in GLD-23K are high resolution, which cost much longer training time; LR and CNN are shallow models, but MobileNetV3 and EfficientNet-B0 are much larger models, but they are tailored for efficient edge training and inference. . . . . 151 6.3 Performance gain in different bandwidth settings. . . . . . . . . . . . . . . . . . . . . 154 6.4 Breakdown of the running time (sec) of LightSecAgg and the state-of-the-art protocols (SecAgg [17] and SecAgg+ [8]) to train CNN [98] on the FEMNIST dataset [23] with N = 200 users, for dropout rate p = 10%, 30%, 50%. . . . . . . . . 155 vii C.1 Complexity comparison between SecAgg [17], SecAgg+ [8], and LightSecAgg. Here N is the total number of users. The parameters d and s respectively represent the model size and the length of the secret keys as the seeds for PRG, where s d. LightSecAgg and SecAgg provide worst-case privacy guarantee T and dropout-resiliency guarantee D for any T and D as long as T +D <N. SecAgg+ provides probabilistic privacy guarantee T and dropout-resiliency guarantee D. LightSecAgg selects three design parametersT,D andU such thatT (s(X×w)−y). Accordingly, model parameters are updated as, w (t+1) =w (t) − η m X > (s(X×w (t) )−y), (2.2) wherew (t) holds the estimated parameters from iterationt,η is the learning rate, and functions(·) operates element-wise over the vector given by X×w (t) . We consider the master-worker distributed computing architecture shown in Figure 2.1, in which the master offloads the computationally-intensive operations to N workers. For the training problem, these operations correspond to gradient computations in (2.2). In doing so, the master wishes to protect the privacy of the dataset X against any potential collusions between up to T workers, where T is the privacy parameter of the system. 9 worker 1 worker N Dataset: X master . . . T colluding workers Figure 2.1: The distributed training setup consisting of a master and N worker nodes. Remark 2.1. Although our presentation is based on logistic regression, CodedPrivateML can also be applied to the simpler linear regression model with minor modifications. In this work, we consider strong information-theoretic privacy, where any subset of T colluding workers can not learn any information about the original dataset X. Formally, for every subset of workersT ⊆ [N] of size at most T, we require I X;Z T = 0 for any distribution on X, where I is the mutual information, and Z T represents the collection of all the information received by the workers in setT during training. The distribution of X may be known to the workers. We refer to a protocol that guarantees privacy against T colluding workers as a T-private protocol. In the sequel, we present a novel protocol, CodedPrivateML, to solve (2.1) while preserving the information-theoretic privacy of the dataset against up to T colluding workers. 2.3 The CodedPrivateML Protocol CodedPrivateML consists of four main components: 1) quantization, 2) encoding, 3) polynomial approximation and gradient computation, and 4) decoding the gradient and model update. Fig- ure 2.2 shows the flowchart of CodedPrivateML. In the first component, the master quantizes the dataset from the real domain to the domain of integers, and then embeds it in a finite field. In 10 Figure 2.2: Flowchart of CodedPrivateML. the second component, the master encodes the quantized dataset and sends them to the work- ers. At each iteration, the master also quantizes and encodes the model parameters. In the third component, given the encoded dataset and model parameters, each worker performs the gradient computations by using polynomial approximation to substitute the sigmoid function. In the last component, the master decodes the gradient computations and converts them from the finite field to real domain, and updates the model parameters in the real domain. This process is iterated until the model parameters converge. We now provide the details of each component. 2.3.1 Quantization In order to guarantee information-theoretic privacy, one has to mask the dataset and weights in a finite field ∗ F using uniformly random matrices, so that the added randomness can make each data point appear equally likely. In contrast, the dataset and weights for the training task are defined in the domain of real numbers. Our solution to handle the conversion between the real and finite domains is through the use of stochastic quantization. Accordingly, in the first component of our system, master quantizes the dataset and weights from the real domain to the domain of ∗ We need a finite field instead of a ring as our encoding and decoding schemes based on Lagrange coding, which we explain in Sections 2.3.2 and 2.3.4, require division (or inverse multiplication) which the ring does not have in general. 11 integers, and then embeds them in a field F p of integers modulo a prime p. The quantized version of the dataset X is given by X. The quantization of the weight vector w (t) , on the other hand, is represented by a matrix W (t) , where each column holds an independent stochastic quantization of w (t) . This structure will be important for the convergence of the model. We consider an element-wise lossy quantization scheme for the dataset and weights. For quan- tizing the dataset X∈R m×d , we use a simple deterministic rounding technique: Round(x) = bxc if x−bxc< 0.5 bxc + 1 otherwise , (2.3) wherebxc is the largest integer less than or equal to x. We define the quantized dataset as X,φ Round(2 lx ·X) , (2.4) where the rounding function from (2.3) is applied element-wise to the elements of matrix X and l x is an integer parameter that controls the quantization loss. Function φ :R→F p is a mapping defined to represent a negative integer in the finite field by using two’s complement representation, φ(x) = x if x≥ 0 p +x if x< 0 . (2.5) Note that the domain of (2.4) is − p−1 2 (lx+1) , p−1 2 (lx+1) . To avoid a wrap-around which may lead to an overflow error, prime p should be large enough, i.e., p≥ 2 lx+1 max{|X i,j |} + 1. The value of p also depends on the bitwidth of the machine as well as the number of features d. For instance, in our experiments presented in Section 2.5, we select p = 2 25 − 37 in a 64-bit implementation with the GISETTE dataset whose number of features is d = 5000. This is the largest prime to avoid 12 an overflow on intermediate multiplications. More specifically, we do a modular operation after the inner product of vectors instead of doing a modular operation per product of each element in order to speed up the running time of matrix-matrix multiplication. To avoid an overflow on this, p should satisfy d(p− 1) 2 ≤ 2 64 − 1. At each iteration t, master also quantizes the weight vector w (t) from real domain to the finite field † . This proves to be a challenging task as it should be performed in a way to ensure the convergence of the model. Our solution to this is a quantization technique inspired by [160, 159]. Initially, we define a stochastic quantization function: Q(x;l w ),φ Round stoc (2 lw ·x) , (2.6) wherel w is an integer parameter to control the quantization loss. Round stoc :R→R is a stochastic rounding function: Round stoc (x) = bxc with prob. 1− (x−bxc) bxc + 1 with prob. x−bxc . The probability of roundingx tobxc is proportional to the proximity ofx tobxc so that stochastic rounding is unbiased (i.e.,E[Round stoc (x)] =x). For quantizing the weight vector w (t) , the master creates r independent quantized vectors: w (t),j ,Q j (w (t) ;l w )∈F d×1 p for j∈ [r], (2.7) † The weight vector w (0) is initialized as a Gaussian random vector with zero mean and unit variance. 13 where the quantization function (2.6) is applied element-wise to the vector w (t) and each Q j (·;·) denotes an independent realization of (2.6). To avoid a wrap-around which may lead to an overflow error, prime p should be large enough, i.e., p≥ 2 lx+1 max{|w (t) i |} + 1. The number of quantized vectors r is equal to the degree of the polynomial approximation for the sigmoid function, which we will describe later in Section 2.3.3. The intuition behind creating r independent quantizations is to ensure that the gradient computations performed using the quan- tized weights are unbiased estimators of the true gradients. As detailed in Section 2.4, this property is fundamental for the convergence analysis of our model. The specific values of parameters l x and l w provide a trade-off between the rounding error and overflow error. In particular, a larger value reduces the rounding error while increasing the chance of an overflow. We denote the quantization of the weight vector w (t) as W (t) = [w (t),1 ··· w (t),r ], (2.8) by arranging the quantized vectors from (2.7) in matrix form. 2.3.2 Encoding the Dataset and the Model Inthesecondcomponent, themasterpartitionsthequantizeddatasetX∈F m×d p intoK submatrices and encodes them using Lagrange coding [156]. It then sends to worker i∈ [N] a coded submatrix e X i ∈F m K ×d p . This encoding enables two salient features of CPML, parallelization and information- theoretic privacy guarantees. First, parameterK is related to the computation load at each worker (i.e., what fraction of the dataset is processed at each worker) because the size of encoded dataset is 1/K-th of the size of original datasetX. As we will show later, we can increase the parameterK as N increases, which reduces the computation overhead of each worker and communication overhead between the master and workers. This property enables our approach to scale to a significantly larger number of workers than state-of-the-art privacy preserving machine learning approaches. 14 Second, this encoding ensures that the coded matrices do not leak any information about the original dataset X even if T workers collude, which will be showed in Section 2.4. In addition, the master has to ensure the weight estimations sent to the workers at each iteration do not leak information about the dataset. This is because the weights updated via (2.2) carry information about the whole training set, and sending them directly to the workers may breach privacy. In order to prevent this, at iterationt, the master also quantizes the current weight vector w (t) to the finite field and encodes it again using Lagrange coding. We now state the details of our second component. The master first partitions the quantized dataset X into K submatrices X = [X > 1 ...X > K ] > , where X i ∈F m K ×d p for i∈ [K]. We assume that m is divisible by K. Next, the master selects K +T distinct elements β 1 ,...,β K+T from F p and employs Lagrange coding [156] to encode the dataset. To do so, the master forms a polynomial u : F p → F m K ×d p of degree at most K +T− 1 such that u(β i ) = X i for i∈ [K], and u(β i ) = R i for i∈{K + 1,...,K +T}, where R i ’s are chosen uniformly at random from F m K ×d p (the role of R i ’s is to mask the dataset and provide privacy against up to T colluding workers). This can be accomplished by letting u be the respective Lagrange interpolation polynomial, u(z), X j∈[K] X j · Y k∈[K+T]\{j} z−β k β j −β k + K+T X j=K+1 R j · Y k∈[K+T]\{j} z−β k β j −β k . (2.9) The master then selectsN distinct elements{α i } i∈[N] fromF p such that{α i } i∈[N] ∩{β j } j∈[K] =?, and encodes the dataset by letting e X i = u(α i ) for i ∈ [N]. By defining an encoding matrix U = [u 1 ...u N ]∈F (K+T)×N p whose (i,j) th element is given by u ij = Q `∈[K+T]\{i} α j −β ` β i −β ` , one can also represent the encoding of the dataset as e X i =u(α i ) = (X 1 ,...,X K ,R K+1 ,...,R K+T )·u i . (2.10) 15 At iterationt, the quantized weightsW (t) are also encoded using a Lagrange interpolation polyno- mial, v(z), X j∈[K] W (t) · Y k∈[K+T]\{j} z−β k β j −β k + K+T X j=K+1 V j · Y k∈[K+T]\{j} z−β k β j −β k , (2.11) where V j for j ∈ [K + 1,K +T ] are chosen uniformly at random from F d×r p . The coefficients β 1 ,...,β K+T are the same as in (2.9), and we have the property v(β i ) = W (t) for i∈ [K]. The master then encodes the quantized weight vector by using the same evaluation points{α i } i∈[N] . Accordingly, the weight vector is encoded as f W (t) i =v(α i )=(W (t) ,...,W (t) ,V K+1 ,...,V K+T )·u i , (2.12) fori∈ [N], using the encoding matrixU from (2.10). The degree of the polynomialsu(z) andv(z) are both K +T− 1. 2.3.3 Polynomial Approximation and Gradient Computation Upon receiving the encoded (and quantized) dataset and weights, workers should proceed with gradient computations. However, a major challenge is that Lagrange coding is originally designed for polynomial computations, while the gradient computations are not polynomials due to the sigmoid function. Our solution is to use a polynomial approximation of the sigmoid function, ˆ s(z) = r X i=0 c i z i , (2.13) 16 where r andc i denotes the degree and coefficients of the polynomial, respectively. The coefficients are obtained by fitting the sigmoid function via least squares estimation. Using this polynomial approximation we can rewrite (2.2) as w (t+1) =w (t) − η m X > (ˆ s(X×w (t) )−y). (2.14) where X is the quantized version of X, and ˆ s(·) operates element-wise over the vector X×w (t) . Another challenge is to ensure the convergence of weights. As we detail in Section 2.4, this necessitates the gradient estimations to be unbiased using the polynomial approximation with quantized weights. We solve this by utilizing the computation technique from Section 4.1 in [159] using the quantized weights formed in Section 2.3.1. Specifically, given a degree r polynomial from (2.13) and r independent quantizations from (2.8), we define a function ¯ s(X,W (t) ), r X i=0 c i Y j≤i (X×w (t),j ), (2.15) where the product Q j≤i operates element-wise over the vectors (X×w (t),j ) for j≤ i. Lastly, we note that (2.15) is an unbiased estimator of ˆ s(X×w (t) ), E[¯ s(X,W (t) )] = ˆ s(X×w (t) ), (2.16) where ˆ s(·) acts element-wise over the vectorX×w (t) , and the result follows from the independence ofquantizations. Using (2.15), werewritetheupdateequationsfrom(2.14)usingquantizedweights, w (t+1) =w (t) − η m X > (¯ s(X,W (t) )−y). (2.17) 17 CodedPrivateML guarantees the convergence to the optimal loss function C(w ∗ ) where C is the cross entropy function defined in (2.1), even though we use the polynomial approximation to substi- tutethesigmoidfunctionintheupdateequation(2.2), whichwillbedemonstratedinourtheoretical results in Section 2.4. Computations are then performed at each worker locally. At each iteration, worker i∈ [N] locally computes f :F m K ×d p ×F d×r p →F d p , f e X i , f W (t) i = e X > i ¯ s( e X i , f W (t) i ), (2.18) using e X i and f W (t) i and sends the result back to the master. This computation is a polynomial function evaluation in finite field arithmetic and the degree of f is deg(f) = 2r + 1. 2.3.4 Decoding the Gradient and Model Update After receiving the evaluation results in (2.18) from a sufficient number of workers, master decodes n f X k ,W (t) o k∈[K] over the finite field. The minimum number of workers needed for the decoding operation to be successful, which we call the recovery threshold of the protocol, is equal to (2r + 1)(K +T− 1) + 1 as we demonstrate in Section 2.4. We now proceed to the details of decoding. By construction of the Lagrange polynomials in (2.9) and (2.11), one can define a univariate polynomial h(z) =f u(z),v(z) such that h(β i ) =f u(β i ),v(β i ) =f X i ,W (t) =X > i ¯ s(X i ,W (t) ), (2.19) for i∈ [K]. On the other hand, from (2.18), the computation result from worker i equals to h(α i ) =f u(α i ),v(α i ) =f e X i , f W (t) i = e X > i ¯ s( e X i , f W (t) i ). (2.20) 18 Themainintuitionbehindthedecodingprocessistousethecomputationsfrom (2.20)asevaluation points h(α i ) to interpolate the polynomial h(z). Specifically, the master can obtain all coefficients ofh(z) from (2r + 1)(K +T− 1) + 1 evaluation results as the degree of the polynomial h(z) is less than or equal to (2r + 1)(K +T− 1). After h(z) is recovered, the master can recover (2.19) by computingh(β i )fori∈ [K]. Todoso, themasterperformspolynomialinterpolationinafinitefield. Upon receiving the local computation f e X i , f W (t) i in (2.20) from at least (2r + 1)(K +T− 1) + 1 workers, the master computes f X k ,W (t) , X i∈I f( e X i , f W (t) i )· Y j∈I\{i} β k −α j α i −α j (2.21) for k∈ [K], whereI ⊆ [N] denotes the set of the first (2r + 1)(K +T− 1) + 1 workers who send their local computations f( e X i , f W (t) i ) to the master. The master then aggregates the decoded computations f X k ,W (t) to compute the desired gradient as, K X k=1 f(X k ,W (t) ) = K X k=1 X > k ¯ s(X k ,W (t) ) =X > ¯ s(X,W (t) ). (2.22) Lastly, master converts (2.22) from the finite field to the real domain and updates the weights according to (2.17) in the real domain. This conversion is attained by the function Q −1 p (x;l) = 2 −l ·φ −1 x , (2.23) where we let l =l x +r(l x +l w ), and φ −1 :F p →R is defined as, φ −1 (x) = x if 0≤x< p−1 2 x−p if p−1 2 ≤x (¯ s(X,W (t) )−y). (2.25) 20 Wefirststatealemma,whichshowsthatthegradientestimationofCodedPrivateMLisunbiased and variance bounded. Lemma 2.1. Let p (t) , 1 m X > ¯ s(X,W (t) )−y denote the gradient computation using the quantized weights W (t) in CodedPrivateML. Then, we have • (Unbiasedness) Vectorp (t) is an asymptotically unbiased estimator of the true gradient. E[p (t) ] = ∇C(w (t) ) +(r), and (r)→0 as r→∞ where r is the degree of the polynomial in (2.13) and the expectation is taken with respect to the quantization errors, • (Variance bound) E kp (t) −E[p (t) ]k 2 2 ≤ 1 2 −2lw m 2 kXk 2 F , σ 2 wherek·k 2 andk·k F are the l 2 and Frobenius norms, respectively. Proof. The proof of Lemma 2.1 is presented in Appendix A.1. Wealsoneedthefollowingbasiclemma, whichdescribestheL-Lipschitzpropertyofthegradient of the cost function. Lemma 2.2. The gradient of the cost function from (2.1) evaluated on the quantized dataset X is L-Lipschitz with L, 1 4 kXk 2 2 , that is,k∇C(w)−∇C(w 0 )k≤Lkw−w 0 k for all w,w 0 ∈R d . Proof. The proof of Lemma 2.2 is presented in Appendix A.2. We now state our main result for the theoretical performance guarantees of CodedPrivateML. Theorem 2.1. Consider the training of a logistic regression model in a distributed system with N workers using CodedPrivateML with the dataset X = (X 1 ,...,X K ), initial weight vector w (0) , and constant step size η = 1/L (whereL is defined in Lemma 2.2). Then, CodedPrivateML guarantees, • (Convergence) E C 1 J P J t=0 w (t) −C(w ∗ )≤ kw (0) −w ∗ k 2 2ηJ +ησ 2 in J iterations, where σ 2 is given in Lemma 2.1, 21 • (Privacy) X remains information-theoretically private against any T colluding workers, i.e., I X; e X T ,{ f W (t) T } t∈[J] = 0 for any distribution on X and any setT ⊂ [N] with|T|≤T, for any N≥ (2r + 1)(K +T− 1) + 1, where r is the degree of the polynomial from (2.13). Remark 2.2. Theorem 2.1 reveals an important trade-off between privacy and parallelization in CodedPrivateML. Parameter K reflects the amount of parallelization in CodedPrivateML, since the computation load at each worker node is proportional to 1/K-th of the dataset. Parameter T reflects the privacy threshold in CodedPrivateML. Theorem 2.1 shows that, in a cluster with N workers, we can achieve any K and T as long as N≥ (2r + 1)(K +T− 1) + 1. This condition further implies that, as the number of workers N increases, the parallelization (K) and privacy threshold (T) of CodedPrivateML can also increase linearly, leading to a scalable solution. Remark 2.3. There are two terms in the bound on the distance between the loss function to the optimum in the first equation of Theorem 1, i.e.,E C 1 J P J t=0 w (t) −C(w ∗ )≤ kw (0) −w ∗ k 2 2ηJ +ησ 2 . When we use a constant learning rate η = 1/L, the first term kw (0) −w ∗ k 2 2ηJ = Lkw (0) −w ∗ k 2 2J goes to zero as the number of iterations J increases, hence CodedPrivateML has the convergence rate of O(1/J). The second term ησ 2 = σ 2 L is a residual error in the training as it does not go to zero as J increases. By using an adaptive (decreasing) learning rate, this term can be made arbitrarily small. Remark 2.4. The convergence rate of CodedPrivateML is the same as that of conventional logistic regression. This follows from Theorem 1 where the convergence rate of CodedPrivateML is found as O( 1 J ) where J is the iteration index, which is the same as the convergence rate of conventional logistic regression, which follows from [20, Section 9.3] and [20, Section 7.1.1]. Remark 2.5. Theorem 2.1 applies also to (simpler) linear regression. The proof follows the same steps. Proof. The proof of Theorem 2.1 is presented in Appendix A.3. 22 Table 2.1: Complexity summary of CPML. Computation Communication Master O mdN(K+T) K +drJN(K +T ) O( mdN K +drNJ) Worker O( md 2 K ) O( md K +drJ) 2.4.1 Complexity Analysis In this section, we analyze the asymptotic complexity of CodedPrivateML with respect to the number of workers N, parallelization parameter K, privacy parameter T, number of samples m, number of features d, and number of iterations J. Complexity Analysis of the Master Node: Computation cost of the master node can be broken into three parts: 1) encoding the dataset by using e X i = u(α i ) from (2.9) for i∈ [N], 2) encoding the weight vector by using f W (t) i =v(α i ) from (2.11) for i∈ [N],t∈ [J], and 3) decoding the gradient by recovering h(β i ) in (2.19) for i∈ [K]. For the first part, the encoded dataset e X i (i∈ [N]) from (2.9) is a weighted sum ofK +T matrices where the size of each matrix is m K ×d. The Lagrangian coefficients can be calculated offline since the sets of{α i } i∈[N] and{β j } j∈[K] are public. Each encoding requires O md(K+T) K multiplications and we must perform N encodings, resulting in a total computational cost ofO mdN(K+T) K . Decoding the gradient computations from (2.21) can be performed via a weighted sum of (2r + 1)(K +T− 1) + 1 =O(N) vectors where the size of each vector is d. Each decoding requires O(dN) multiplications and we require K decoded gradients, resulting in a total computational cost of O(dJNK). Communication cost of the master node to send the encoded dataset e X i and the encoded weight vector f W (t) i to worker i∈ [N] is O( mdN K ) and O(drNJ), respectively. Communication cost of the master to receive the local computation f( e X i , f W (t) i ) from worker i∈ [N] for t∈ [J] is O(dJN). 23 ComplexityAnalysisoftheWorkers: Computationcostofworkeritocompute e X > i e X i ,thedominant part of the local computation f( e X i ,e w (t) i ) in (2.18), is O( md 2 K ). This corresponds to O( 1 K ) th of the computation cost of conventional logistic regression, which requires the computation of X > s(X× w (t) ) in (2.2). This is due to the fact that the size of the encoded dataset e X i and original datasetX are m K ×dandm×d, respectively. Communicationcostofworkeritoreceivetheencodeddataset e X i and the encoded weight vector f W (t) i fort∈ [J] isO( md K ) andO(drJ), respectively. Communication cost of worker i to send the local computation f( e X i , f W (t) i ) to the master for t∈ [J] is O(dJ). We summarize the asymptotic complexity of CodedPrivateML in Table 2.1. 2.5 Experiments WenowexperimentallydemonstratetheperformanceofCodedPrivateMLcomparedtoconventional MPC baselines. Our focus is on training a logistic regression model for image classification, while the computation load is distributed to multiple machines on the Amazon EC2 Cloud Platform. Experiment setup. We train the logistic regression model from (2.1) for binary image classifica- tion on the CIFAR-10 [81] and GISETTE [61] datasets to experimentally examine two things: the accuracy of CodedPrivateML and the performance gain in terms of training time. The size of the CIFAR-10 and GISETTE datasets are (m,d) = (9019, 3073) ‡ and (6000, 5000), respectively. We implement the communication phase using the MPI4Py [36] message passing interface on Python. Computations are performed in a distributed manner on Amazon EC2 clusters using m3.xlarge machine instances. We then compare CodedPrivateML with two MPC-based benchmarks that we apply to our problem. In particular, we implement two MPC constructions. The first one is based on the well-known BGW protocol [10], whereas the second one is a more recent protocol from [6, 37] ‡ We select images with the label of "plane" and "car", and the number of these images in 50000 training samples is 9019. For the number of features, we added a bias term, hence, we have 3072+1 = 3073 features. 24 that trade-offs offline calculations for a more efficient implementation. Our choice of these MPC benchmarks is due to their ability to be applied to a large number of workers. While several more recent works exist that have developed MPC-based training protocols with information-theoretic privacy guarantees, their constructions are limited to three or four parties [105, 144, 104]. For instance, [105] is a two-party protocol that requires two non-colluding workers. Both baselines utilize Shamir’s secret sharing scheme [123] where the dataset is secret shared among the N workers. For the (quantized) dataset X, this is achieved by creating a random polynomial P(z) = X +zZ 1 +... +z T Z T , where Z j for j∈ [T ] are i.i.d. uniformly distributed random matrices. This guarantees privacy againstb N−1 2 c colluding workers [10, 6, 37], but requires acomputationloadateachworkerthatisaslargeasprocessingthewholedatasetatasingleworker, leading to slow training. Hence, in order to provide a fair comparison with CodedPrivateML, we optimize (speed up) the benchmark protocols by partitioning the users into subgroups of size 2T +1. Then, we let each group compute the gradient over the partitioned dataset X 0 i ∈ F m G ×d p , where X = [X 0> 1 ...X 0> G ] > and G is the number of subgroups. For group i∈ [G], each worker receives a share of the partitioned dataset by using a random polynomial P i (z) = X 0 i +zZ i1 +... +z T Z iT , where Z ij for j ∈ [T ] and i∈ [G] are i.i.d. uniformly distributed random matrices. Workers then proceed with a multiround protocol to compute the sub-gradient. We further incorporate our quantization and approximation techniques in our benchmark implementations as conventional MPC protocols are also bound to arithmetic operations over a finite field. In our experiments, we set G = 3, hence the total amount of data stored at each worker is equal to one third of the size of the dataset X, which significantly reduces the total training time of the two benchmarks, while providing a privacy threshold of T =b N−3 6 c. The implementation details of the MPC operations are provided in Appendix A.4. 25 CodedPrivateML parameters. There are several system parameters in CodedPrivateML that shouldbeset. Giventhatwehavea 64-bitimplementation, weselectthefieldsizetobep = 2 25 −37, whichisthelargestprimewith 25bitstoavoidanoverflowonintermediatemultiplications. Wethen optimizethequantizationparameters,l x in(2.4)andl w in(2.7), bytakingintoaccountthetrade-off between the rounding and overflow error. In particular, we choose (l x ,l w ) = (2, 6) and (2, 5) for the CIFAR-10 and GISETTE datasets, respectively. We also need to set the parameterr, the degree of the polynomial for approximating the sigmoid function. We consider both r = 1 and r = 2 and as shown later empirically we observe that the degree one approximation achieves good accuracy. We finally need to selectT (privacy threshold) andK (amount of parallelization) in CodedPrivateML. As stated in Theorem 2.1, these parameters should satisfy N≥ (2r + 1)(K +T− 1) + 1. Given our choice of r = 1, we consider two cases: • Case 1 (maximum parallelization). All resources allocated for parallelization (faster training) by setting K =b N−1 3 c, T = 1, • Case 2 (equal parallelization & privacy). Resources split almost equally between parallelization & privacy, i.e., T =b N−3 6 c,K =b N+2 3 c−T. Training time. Initially, we measure the training time while increasing the number of workers N gradually. Our results are demonstrated in Figure 2.3, which shows the comparison of Cod- edPrivateML with the [BH08] protocol from [6], as we have found it to be the faster of the two benchmarks. In particular, we make the following observations. § • CodedPrivateML provides substantial speedup over the MPC baselines, in particular, up to 4.4× and 5.2× with the CIFAR-10 and GISETTE datasets, respectively, while providing the same privacy threshold as the benchmarks (T =b N−3 6 c for Case 2). Table 2.2 demonstrates § ForN = 10, all schemes have similar performance because the total amount of data stored at each worker is one third of the size of whole dataset (K = 3 for CodedPrivateML and G = 3 for the benchmark). 26 (a) CIFAR-10 (for accuracy 81.35% with 50 iterations) (b) GISETTE (for accuracy 97.50% with 50 iterations) Figure 2.3: Performance gain of CodedPrivateML over the MPC baseline ([BH08] from [6]). The plot shows the total training time for different number of workers N. 27 Table 2.2: (CIFAR-10) Breakdown of total runtime for N = 50. Protocol Enc. time (s) Comm. time (s) Comp. time (s) Total time (s) MPC using [BGW88] 202.78 31.02 7892.42 8127.07 MPC using [BH08] 201.08 30.25 1326.03 1572.34 CodedPrivateML (Case 1) 59.93 4.76 141.72 229.07 CodedPrivateML (Case 2) 91.53 8.30 235.18 361.08 the breakdown of the total runtime with the CIFAR-10 dataset for N = 50 workers. In this scenario, CodedPrivateML provides significant improvement in all three categories of dataset encoding and secret sharing; communication time between the workers and the master; and the computation time. Main reason for this is that, in the MPC baselines, the size of the data processed at each worker is one third of the original dataset, while in CodedPrivateML it is 1/K-th of the dataset. This reduces the computational overhead of each worker while computing matrix multiplications as well as the communication overhead between the master and workers. We also observe that a higher amount of speedup is achieved as the dimension of the dataset becomes larger (CIFAR-10 vs. GISETTE datasets), suggesting CodedPrivateML to be well-suited for data-intensive training tasks where parallelization is essential. • The total runtime of CodedPrivateML decreases as the number of workers increases. This is againduetotheparallelizationgainofCodedPrivateML(i.e., increasingK whileN increases). This is not achievable in conventional MPC baselines, since the size of data processed at each worker is constant for all N. • IncreasingN in CodedPrivateML has two major impacts on the total training time. The first one is reducing the computation load per worker, as each new worker can be used to increase the parameter K. This in turn reduces the computation load per worker as the amount of work done by each worker is scaled with respect to 1/K. The second one is that increasing the number of workers increases the encoding time at the master node. Hence, the gain from increasing the number of workers beyond a certain point may be minimal and the system may 28 saturate. In those cases, increasing the number of workers cannot further reduce the training time, as the computation will be dominated by the encoding overhead. • CodedPrivateML provides up to 22.5× speedup over the BGW protocol [10], as shown in Table 2.2 for the CIFAR-10 dataset with N = 50 workers. This is due to the fact that BGW requires additional communication between the workers to execute a degree reduction phase for every multiplication operation. Accuracy. We also examine the accuracy and convergence of CodedPrivateML. Figure 2.4(a) illustrates the test accuracy of the binary classification problem between plane and car images for the CIFAR-10 dataset. With 50 iterations, the accuracy of CodedPrivateML with degree one polynomial approximation and conventional logistic regression are 81.35% and 81.75%, respec- tively. Figure 2.4(b) shows the test accuracy for binary classification between digits 4 and 9 for the GISETTE dataset. With 50 iterations, the accuracy of CodedPrivateML with degree one poly- nomial approximation and conventional logistic regression has the same value of 97.5%. Hence, CodedPrivateML has comparable accuracy to conventional logistic regression while being privacy preserving. Figure 2.5 presents the cross entropy loss for CodedPrivateML versus the conventional logistic regression model for the GISETTE dataset. The latter setup uses the sigmoid function and no polynomial approximation, in addition, no quantization is applied to the dataset or the weight vec- tors. We observe that CodedPrivateML achieves convergence with comparable rate to conventional logistic regression, while being privacy preserving. 29 2.6 Conclusion and Discussion In this chapter, we considered a distributed training scenario in which a data-owner wants to train a logistic regression model by off-loading the computationally-intensive gradient computations to multiple workers, while preserving the privacy of the dataset. We proposed a privacy-preserving training framework, CodedPrivateML, that distributes the computation load effectively across mul- tiple workers, and reduces the per-worker computation load as more and more workers become available. We demonstrated the theoretical convergence guarantees and the fundamental trade-offs of our framework, in terms of the number of workers, privacy protection, and scalability. Our experiment results demonstrate significant speed-up in the training time compared to conventional baseline protocols. This work focuses on a logistic regression model mainly with the goal of demonstrating how CodedPrivateML can be utilized to scale and speed up logistic regression training under privacy and convergence guarantees, which is a first step towards more complex models. To the best of our knowledge, even for this setup, no other system has been able to efficiently scale beyond 3− 4 workers while achieving information-theoretic privacy. Our work is the first privacy-preserving machine learning approach that reduces the communication and computation load per worker as the number of workers increases, which we hope will open up further research. Future directions include extending CodedPrivateML to deeper neural networks by leveraging an MPC-friendly (i.e., polynomial) activation function such as the one proposed in [105]. In this paper, in order to provide information-theoretic privacy, we utilize quantization to convert the dataset and model to the finite field F p . Doing so has two inherent challenges: 1) determining a proper value forp and 2) potential performance degradation caused by quantization or overflow error. This has inspired a new line of works, such as analog coded computing [135], which uses floating-point numbers instead of fixed-point numbers to represent the finite field and provides 30 a fundamental trade-off between the accuracy and privacy level. Leveraging such techniques to address these challenges is another interesting future direction. 31 (a) CIFAR-10 dataset, binary classification between car and plane images (using 9019 samples for the training set and 2000 samples for the test set). (b) GISETTE dataset, binary classification between the images of digits 4 and 9 (using 6000 samples for the training set and 1000 samples for the test set). Figure 2.4: Comparison of the accuracy of CodedPrivateML (demonstrated for Case 2 and N = 50 workers) vs conventional logistic regression that uses the sigmoid function without quantization. 32 Figure 2.5: Convergence of CodedPrivateML (demonstrated for Case 2 and N = 50 workers) vs conventional logistic regression (using the sigmoid function without polynomial approximation or quantization). 33 Chapter 3 A Scalable Approach for Privacy-Preserving Collaborative Machine Learning 3.1 Introduction Machine learning applications can achieve significant performance gains by training on large vol- umes of data. In many applications, the training data is distributed across multiple data-owners, such as patient records at multiple medical institutions, and furthermore contains sensitive infor- mation, e.g., genetic information, financial transactions, and geolocation information. Such settings give rise to the following key problem that is the focus of this paper: How can multiple data-owners jointly train a machine learning model while keeping their individual datasets private from the other parties? More specifically, we consider a distributed learning scenario in which N data-owners (clients) wish to train a logistic regression model jointly without revealing information about their individual datasets to the other parties, even if up to T out of N clients collude. Our focus is on the semi- honest adversary setup, where the corrupted parties follow the protocol but may leak information in an attempt to learn the training dataset. To address this challenge, we propose a novel framework, 34 COPML ∗ , that enables fast and privacy-preserving training by leveraging information and coding theory principles. COPML has three salient features: • speeds up the training time significantly, by distributing the computation load effectively across a large number of parties, • advances the state-of-the-art privacy-preserving training setups by scaling to a large number of parties, as it can distribute the computation load effectively as more parties are added in the system, • utilizes coding theory principles to secret share the dataset and model parameters which can significantly reduce the communication overhead and the complexity of distributed training. At a high level, COPML can be described as follows. Initially, the clients secret share their individualdatasetswiththeotherparties,afterwhichtheycarryoutasecuremulti-partycomputing (MPC) protocol to encode the dataset. This encoding operation transforms the dataset into a coded formthatenablesfastertrainingandsimultaneouslyguaranteesprivacy(inaninformation-theoretic sense). Training is performed over the encoded data via gradient descent. The parties perform the computations over the encoded data as if they were computing over the uncoded dataset. That is, the structure of the computations are the same for computing over the uncoded dataset versus computing over the encoded dataset. At the end of training, each client should only learn the final model, and no information should be leaked (in an information-theoretic sense) about the individual datasets or the intermediate model parameters, beyond the final model. We characterize the theoretical performance guarantees of COPML, in terms of convergence, scalability, and privacy protection. Our analysis identifies a trade-off between privacy and paral- lelization, such that, each additional client can be utilized either for more privacy, by protecting against a larger number of collusions T, or more parallelization, by reducing the computation load ∗ COPML stands for collaborative privacy-preserving machine learning. 35 at each client. Furthermore, we empirically demonstrate the performance of COPML by comparing it with cryptographic benchmarks based on secure multi-party computing (MPC) [154, 10, 6, 37], that can also be applied to enable privacy-preserving machine learning tasks (e.g. see [110, 53, 105, 95, 35, 28, 143, 104]). Given our focus on information-theoretic privacy, the most relevant MPC-based schemes for empirical comparison are the protocols from [10] and [6, 37] based on Shamir’s secret sharing [123]. While several more recent works have considered MPC-based learn- ing setups with information-theoretic privacy [143, 104], their constructions are limited to three or four parties. We run extensive experiments over the Amazon EC2 cloud platform to empirically demonstrate the performance of COPML. We train a logistic regression model for image classification over the CIFAR-10 [81] and GISETTE [61] datasets. The training computations are distributed to up to N = 50 parties. We demonstrate that COPML can provide significant speedup in the training time against the state-of-the-art MPC baseline (up to 16.4×), while providing comparable accuracy to conventional logistic regression. This is primarily due to the parallelization gain provided by our system, which can distribute the workload effectively across many parties. Other related works. Other than MPC-based setups, one can consider two notable approaches. The first one is Homomorphic Encryption (HE) [56], which enables computations on encrypted data, and has been applied to privacy-preserving machine learning [58, 70, 59, 158, 86, 78, 146, 63]. The privacy protection of HE depends on the size of the encrypted data, and computing in the encrypted domain is computationally intensive. The second approach is differential privacy (DP), which is a noisy release mechanism to protect the privacy of personally identifiable information. The main application of DP in machine learning is when the model is to be released publicly after training, so that individual data points cannot be backtracked from the released model [27, 125, 36 1, 112, 102, 115, 73]. On the other hand, our focus is on ensuring privacy during training, while preserving the accuracy of the model. 3.2 Problem Setting We consider a collaborative learning scenario in which the training dataset is distributed across N clients. Clientj∈ [N]holdsanindividualdatasetdenotedbyamatrixX j ∈R m j ×d consistingofm j data points with d features, and the corresponding labels are given by a vector y j ∈{0, 1} m j . The overall dataset is denoted by X = [X > 1 ,...,X > N ] > consisting of m, P j∈[N] m j data points with d features, and corresponding labelsy = [y > 1 ,...,y > N ] > , which consists ofN individual datasets each one belonging to a different client. The clients wish to jointly train a logistic regression model w over the training set X with labels y, by minimizing a cross entropy loss function, C(w) = 1 m m X i=1 (−y i log ˆ y i − (1−y i ) log(1− ˆ y i )) (3.1) where ˆ y i =g(x i ·w)∈ (0, 1) is the probability of labeli being equal to 1,x i is thei th row of matrix X, and g(·) denotes the sigmoid function g(z) = 1/(1 +e −z ). The training is performed through gradient descent, by updating the model parameters in the opposite direction of the gradient, w (t+1) =w (t) − η m X > (g(X×w (t) )−y) (3.2) where∇C(w) = 1 m X > (g(X×w)−y) is the gradient for (3.1),w (t) holds the estimated parameters from iterationt,η is the learning rate, and functiong(·) acts element-wise over the vectorX×w (t) . During training, the clients wish to protect the privacy of their individual datasets from other clients, even if up to T of them collude, where T is the privacy parameter of the system. There is no trusted party who can collect the datasets in the clear and perform the training. Hence, 37 the training protocol should preserve the privacy of the individual datasets against any collusions between up to T adversarial clients. More specifically, this condition states that the adversarial clients should not learn any information about the datasets of the benign clients beyond what can already be inferred from the adversaries’ own datasets. To do so, client j ∈ [N] initially secret shares its individual dataset X j and y j with the other parties. Next, clients carry out a secure MPC protocol to encode the dataset by using the received secret shares. In this phase, the dataset X is first partitioned into K submatrices X = [X > 1 ,··· ,X > K ] > for some K∈N. Parameter K characterizes the computation load at each client. Specifically, our system ensures that the computation load (in terms of gradient computations) at each client is equal to processing only (1/K) th of the entire datasetX. The clients then encode the dataset by combining the K submatrices together with some randomness to preserve privacy. At the end of this phase, client i∈ [N] learns an encoded dataset e X i , whose size is equal to (1/K) th of the dataset X. This process is only performed once for the dataset X. T colluding clients client 1 client N client j client i X j X 1 X N X i ! X N ! X i ! X 1 ! X j [X j ] i . . . . . . [X j ] 1 [X j ]N Figure 3.1: The multi-client distributed training setup with N clients. Client j ∈ [N] holds a dataset X j with labels y j . At the beginning of training, clientj secret sharesX j andy j to guar- antee their information-theoretic privacy against any collusions between up toT clients. The secret share ofX j andy j assigned from clientj to client i is represented by [X j ] i and [y j ] i , respectively. At each iteration of training, clients also en- code the current estimation of the model pa- rametersw (t) using a secure MPC protocol, af- ter which client i ∈ [N] obtains the encoded model e w (t) i . Client i∈ [N] then computes a lo- cal gradient e X > i g( e X i × e w (t) i ) over the encoded dataset e X i and encoded model e w (t) i . After this step, clients carry out another secure MPC pro- tocol to decode the gradientX > g(X×w (t) ) and update the model according to (3.2). As the de- coding and model updates are performed using 38 a secure MPC protocol, clients do not learn any information about the actual gradients or the updated model. In particular, client i∈ [N] only learns a secret share of the updated model, denoted by [w (t+1) ] i . Using the secret shares [w (t+1) ] i , clients i∈ [N] encode the model w (t+1) for the next iteration, after which client i learns an encoded model e w (t+1) i . Figure 3.1 demonstrates our system architecture. 3.3 The COPML Framework COPML consists of four main phases: quantization; encoding and secret sharing; polynomial ap- proximation; decoding and model update, as demonstrated in Figure 3.2. In the first phase, quan- tization, each client converts its own dataset from the real domain to finite field. In the second phase, clients create a secret share of their quantized datasets and carry out a secure MPC pro- tocol to encode the datasets. At each iteration, clients also encode and create a secret share of the model parameters. In the third phase, clients perform local gradient computations over the encoded datasets and encoded model parameters by approximating the sigmoid function with a polynomial. Then, in the last phase, clients decode the local computations and update the model parameters using a secure MPC protocol. This process is repeated until the convergence of the model parameters. Phase 1: Quantization. Computations involving secure MPC protocols are bound to finite field operations, which requires the representation of real-valued data points in a finite field F. To do so, each client initially quantizes its dataset from the real domain to the domain of integers, and then embeds it in a fieldF p of integers modulo a primep. Parameterp is selected to be sufficiently large to avoid wrap-around in computations. For example, in a 64-bit implementation with the 39 CIFAR-10 dataset, we select p = 2 26 − 5. The details of the quantization phase are provided in Appendix B.1. Figure 3.2: Flowchart of COPML. Phase 2: Encoding and secret sharing. In this phase, client j∈ [N] creates a secret share of its quantized dataset X j designated for each client i∈ [N] (including client j itself). The secret shares are constructed via Shamir’s se- cret sharing with threshold T [123], to protect the privacy of the individual datasets against any collusions between up to T clients. To do so, client j creates a random polynomial, h j (z) = X j +zR j1 +... +z T R jT where R ji for i ∈ [T ] are i.i.d. uniformly distributed ran- dom matrices, and selects N distinct evaluation points λ 1 ,...,λ N from F p . Then, client j sends client i∈ [N] a secret share [X j ] i ,h j (λ i ) of its dataset X j . Client j also sends a secret share of its labels y j to client i∈ [N], denoted by [y j ] i . Finally, the model is initialized randomly within a secure MPC protocol between the clients, and at the end client i∈ [N] obtains a secret share [w (0) ] i of the initial model w (0) . After obtaining the secret shares [X j ] i for j∈ [N], clients i∈ [N] encode the dataset using a secure MPC protocol and transform it into a coded form, which speeds up the training by distribut- ing the computation load of gradient evaluations across the clients. Our encoding strategy utilizes Lagrange coding from [156] † , which has been applied to other problems such as privacy-preserving offloading of a training task [130] and secure federated learning [129]. However, we encode (and later decode) the secret shares of the datasets and not their true values. Therefore, clients do not learn any information about the true value of the dataset X during the encoding-decoding process. † Encoding of Lagrange coded computing is the same as a packed secret sharing [51]. 40 Theindividualstepsoftheencodingprocessareasfollows. Initially, thedatasetXispartitioned into K submatrices X = [X > 1 ,...,X > K ] > where X k ∈F m K ×d p for k∈ [K]. To do so, client i∈ [N] locally concatenates [X j ] i for j∈ [N] and partitions it into K parts, [X k ] i for k∈ [K]. Since this operation is done over the secret shares, clients do not learn any information about the original dataset X. Parameter K quantifies the computation load at each client, as will be discussed in Section 3.4. The clients agree on K +T distinct elements{β k } k∈[K+T] and N distinct elements{α i } i∈[N] from F p such that{α i } i∈[N] ∩{β k } k∈[K+T] =?. Client i∈ [N] then encodes the dataset using a Lagrange interpolation polynomial u :F p →F m K ×d p with degree at most K +T− 1, [u(z)] i , X k∈[K] [X k ] i · Y l∈[K+T]\{k} z−β l β k −β l + K+T X k=K+1 [Z k ] i · Y l∈[K+T]\{k} z−β l β k −β l , (3.3) where [u(β k )] i = [X k ] i for k ∈ [K] and i∈ [N]. The matrices Z k are generated uniformly at random ‡ fromF m K ×d p and [Z k ] i is the secret share ofZ k at clienti. [Z k ] i is the secret share ofZ k at client i. Client i∈ [N] then computes and sends [ e X j ] i , [u(α j )] i to client j∈ [N]. Upon receiving {[ e X j ] i } i∈[N] , client j∈ [N] can recover the encoded matrix e X j . § The role of Z k ’s are to mask the dataset so that the encoded matrices e X j reveal no information about the dataset X, even if up to T clients collude, as detailed in Section 3.4. Using the secret shares [X j ] i and [y j ] i , clientsi∈ [N] also computeX T y = P j∈[N] X T j y j using a secure multiplication protocol (see Appendix B.3 for details). At the end of this step, clients learn a secret share of X T y, which we denote by [X T y] i for client i∈N. ‡ The random parameters can be generated by a crypto-service provider in an offline manner, or by using pseudo- random secret sharing [34]. § In fact, gathering only T +1 secret shares is sufficient to recover e Xi, due to the construction of Shamir’s secret sharing [123]. Using this fact, one can speed up the execution by dividing the N clients into subgroups of T +1 and performing the encoding locally within each subgroup. We utilize this property in our experiments. 41 Atiterationt, clientiinitiallyholdsasecretshareofthecurrentmodel, [w (t) ] i , andthenencodes the model via a Lagrange interpolation polynomial v :F p →F d p with degree at most K +T− 1, [v(z)] i , X k∈[K] [w (t) ] i · Y l∈[K+T]\{k} z−β l β k −β l + K+T X k=K+1 [v (t) k ] i · Y l∈[K+T]\{k} z−β l β k −β l , (3.4) where [v(β k )] i = [w (t) ] i for k∈ [K] and i∈ [N]. The vectors v (t) k are generated uniformly at random from F d p . Client i∈ [N] then sends [e w (t) j ] i , [v(α j )] i to client j∈ [N]. Upon receiving {[e w (t) j ] i } i∈[N] , client j∈ [N] recovers the encoded model e w (t) j . Phase 3: Polynomial Approximation and Local Computations. Lagrange encoding can be used to compute polynomial functions only, whereas the gradient computations in (3.2) are not polynomial operationsduetothesigmoidfunction. Tothisend, weapproximatethesigmoidwithapolynomial, ˆ g(z) = r X i=0 c i z i , (3.5) wherer andc i represent the degree and coefficients of the polynomial, respectively. The coefficients are evaluated by fitting the sigmoid to the polynomial function via least squares estimation. Using this polynomial approximation, we rewrite the model update from (3.2) as, w (t+1) =w (t) − η m X > (ˆ g(X×w (t) )−y). (3.6) Clienti∈ [N]thenlocallycomputesthegradientovertheencodeddataset, byevaluatingafunction, f( e X i ,e w (t) i ) = e X > i ˆ g( e X i × e w (t) i ) (3.7) 42 and secret shares the result with the other clients, by sending a secret share of (3.7), [f( e X i ,e w (t) i )] j , toclientj∈ [N]. Attheendofthisstep, clientj holdsthesecretshares [f( e X i ,e w (t) i )] j corresponding to the local computations from clients i∈ [N]. Note that (3.7) is a polynomial function evaluation in the finite field arithmetic and the degree of function f is deg(f) = 2r + 1. Phase 4: Decoding and Model Update. In this phase, clients perform the decoding of the gradient usingasecureMPCprotocol, throughpolynomialinterpolationoverthesecretshares [f( e X i ,e w (t) i )] j . The minimum number of clients needed for the decoding operation to be successful, which we call the recovery threshold of the protocol, is equal to (2r + 1)(K +T− 1) + 1. In order to show this, we first note that, from the definition of Lagrange polynomials in (3.3) and (3.4), one can define a univariate polynomial h(z) =f u(z),v(z) such that h(β i ) =f u(β i ),v(β i ) =f X i ,w (t) =X > i ˆ g(X i ×w (t) ) (3.8) for i∈ [K]. Moreover, from (3.7), we know that client i performs the following computation, h(α i ) =f u(α i ),v(α i ) =f e X i ,e w (t) i = e X > i ˆ g( e X i × e w (t) i ). (3.9) The decoding process is based on the intuition that, the computations from (3.9) can be used as evaluation pointsh(α i ) to interpolate the polynomialh(z). Since the degree of the polynomialh(z) is deg h(z) ≤ (2r + 1)(K +T− 1), all of its coefficients can be determined as long as there are at least (2r + 1)(K +T− 1) + 1 evaluation points available. After h(z) is recovered, the computation results in (3.8) correspond to h(β i ) for i∈ [K]. 43 Our decoding operation corresponds to a finite-field polynomial interpolation problem. More specifically, upon receiving the secret shares of the local computations [f( e X j ,e w (t) j )] i from at least (2r + 1)(K +T− 1) + 1 clients, client i locally computes [f(X k ,w (t) )] i , X j∈I i [f( e X j ,e w (t) j )] i · Y l∈I i \{j} β k −α l α j −α l (3.10) fork∈ [K], whereI i ⊆ [N] denotes the set of the (2r + 1)(K +T− 1) + 1 fastest clients who send their secret share [f( e X j ,e w (t) j )] i to client i. Afterthisstep,clientilocallyaggregatesitssecretshares [f(X k ,w (t) )] i tocompute P K k=1 [f(X k ,w (t) )] i , which in turn is a secret share of X T ˆ g(X×w (t) ) since, K X k=1 f(X k ,w (t) ) = K X k=1 X > k ˆ g(X k ×w (t) ) =X > ˆ g(X×w (t) ). (3.11) Let [X > ˆ g(X×w (t) )] i , P K k=1 [f(X k ,w (t) )] i denote the secret share of (3.11) at client i. Client i then computes [X > ˆ g(X×w (t) )] i −[X > y] i , which in turn is a secret share of the gradientX > (ˆ g(X× w (t) )−y). Since the decoding operations are carried out using the secret shares, at the end of the decoding process, the clients only learn a secret share of the gradient and not its true value. Next, clients update the model according to (3.6) using a secure MPC protocol, using the secret shared model [w (t) ] i and the secret share of the gradient [X > ˆ g(X×w (t) )] i − [X > y] i . A major challenge in performing the model update in (3.6) in the finite field is the multiplication with parameter η m , where η m < 1. In order to perform this operation in the finite field, one potential approach is to treat it as a computation on integer numbers and preserve full accuracy of the results. This in turn requires a very large field size as the range of results grows exponentially with the number of multiplications, which becomes quickly impractical as the number of iterations increase [105]. Instead, we address this problem by leveraging the secure truncation technique from 44 [24]. This protocol takes secret shares [a] i of a variable a as input as well as two public integer parameters k 1 and k 2 such that a∈ F 2 k 2 and 0 < k 1 < k 2 . The protocol then returns the secret shares [z] i fori∈ [N] such thatz =b a 2 k 1 c+s wheres is a random bit with probabilityP (s = 1) = (a mod 2 k 1 )/(2 k 1 ). Accordingly, the protocol rounds a/(2 k 1 ) to the closest integer with probability 1−τ, withτ being the distance betweena/(2 k 1 ) and that integer. The truncation operation ensures that the range of the updated model always stays within the range of the finite field. Since the model update is carried out using a secure MPC protocol, at the end of this step, client i∈ [N] learns only a secret share [w (t+1) ] i of the updated model w (t+1) , and not its actual value. In the next iteration, using [w (t+1) ] i , client i∈ [N] locally computes [e w (t+1) j ] i from (3.4) and sends it to client j∈ [N]. Client j then recovers the encoded model e w (t+1) j , which is used to compute (3.7). The implementation details of the MPC protocols are provided in Appendix B.3. The overall algorithm for COPML is presented in Appendix B.4. 3.4 Convergence and Privacy Guarantees Consider the cost function in (3.1) with the quantized dataset, and denotew ∗ as the optimal model parameters that minimize (3.1). In this subsection, we prove that COPML guarantees convergence to the optimal model parameters (i.e., w ∗ ) while maintaining the privacy of the dataset against colluding clients. This result is stated in the following theorem. Theorem 3.1. For training a logistic regression model in a distributed system with N clients using the quantized dataset X = [X > 1 ,...,X > N ] > , initial model parameters w (0) , and constant step size η≤ 1/L (where L = 1 4 kXk 2 2 ), COPML guarantees convergence, E C 1 J J X t=0 w (t) −C(w ∗ )≤ kw (0) −w ∗ k 2 2ηJ +ησ 2 (3.12) 45 inJ iterations, for anyN≥ (2r+1)(K +T−1)+1, wherer is the degree of the polynomial in (3.5) and σ 2 is the variance of the quantization error of the secure truncation protocol. Proof. The proof of Theorem 3.1 is presented in Appendix B.2. As for the privacy guarantees, COPML protects the statistical privacy of the individual dataset of each client against up to T colluding adversarial clients, even if the adversaries have unbounded computational power. The privacy protection of COPML follows from the fact that all building blocks of the algorithm guarantees either (strong) information-theoretic privacy or statistical pri- vacyoftheindividualdatasetsagainstanycollusionsbetweenuptoT clients. Information-theoretic privacy of Lagrange coding against T colluding clients follows from [156]. Moreover, encoding, de- coding, and model update operations are carried out in a secure MPC protocol that protects the information-theoretic privacy of the corresponding computations against T colluding clients [10, 6, 37]. Finally, the (statistical) privacy guarantees of the truncation protocol follows from [24]. Remark 3.1. (Privacy-parallelization trade-off) Theorem 3.1 reveals an important trade-off between privacy and parallelization in COPML. Parameter K reflects the amount of parallelization. In particular, the size of the encoded matrix at each client is equal to (1/K) th of the size of X. Since each client computes the gradient over the encoded dataset, the computation load at each client is proportional to processing (1/K) th of the entire dataset. As K increases, the computation load at each client decreases. Parameter T reflects the privacy threshold of COPML. In a distributed system withN clients, COPML can achieve any K andT as long asN≥ (2r + 1)(K +T− 1) + 1. Moreover, as the number of clients N increases, parallelization (K) and privacy (T) thresholds of COPML can also increase linearly, providing a scalable solution. The motivation behind the encoding process is to distribute the load of the computationally-intensive gradient evaluations across multiple clients (enabling parallelization), and to protect the privacy of the dataset. Remark 3.2. Theorem 3.1 also holds for the simpler linear regression problem. 46 (a) CIFAR-10 (for accuracy 80.45%) (b) GISETTE (for accuracy 97.50%) Figure 3.3: Performance gain of COPML over the MPC baseline ([BH08] from [6]). The plot shows the total training time for different number of clients N with 50 iterations. 3.5 Experiments We demonstrate the performance of COPML compared to conventional MPC baselines by exam- ining two properties, accuracy and performance gain, in terms of the training time on the Amazon EC2 Cloud Platform. 3.5.1 Experiment setup Setup. We train a logistic regression model for binary image classification on the CIFAR-10 [81] and GISETTE [61] datasets, whose size is (m,d) = (9019, 3073) and (6000, 5000), respectively. The dataset is distributed evenly across the clients. The clients initially secret share their individual datasetswiththeotherclients. ¶ ComputationsarecarriedoutonAmazonEC2m3.xlargemachine instances. We run the experiments in a WAN setting with an average bandwidth of 40Mbps. Communication between clients is implemented using the MPI4Py [36] interface on Python. Implemented schemes. We implement four schemes for performance evaluation. For COPML, we consider two set of key parameters (K,T ) to investigate the trade-off between parallelization and privacy. For the baselines, we apply two conventional MPC protocols (based on [10] and [6]) to our multi-client problem setting. ‖ ¶ This can be done offline as it is an identical one-time operation for both MPC baselines and COPML. ‖ As described in the Section 3.1, there is no prior work at our scale (beyond 3-4 parties), hence we implement two baselines based on well-known MPC protocols which are also the first implementations at our scale. 47 1. COPML. In COPML, MPC is utilized to enable secure encoding and decoding for Lagrange coding. The gradient computations are then carried out using the Lagrange encoded data. We determine T (privacy threshold) and K (amount of parallelization) in COPML as follows. Initially, we have from Theorem 3.1 that these parameters must satisfy N ≥ (2r + 1)(K + T− 1) + 1 for our framework. Next, we have considered both r = 1 and r = 3 for the degree of the polynomial approximation of the sigmoid function and observed that the degree one approximation achieves good accuracy, as we demonstrate later. Given our choice of r = 1, we then consider two setups: Case 1: (Maximum parallelization gain) Allocate all resources to parallelization (fastest train- ing), by letting K =b N−1 3 c and T = 1, Case 2: (Equal parallelization and privacy gain) Split resources almost equally between paral- lelization and privacy, i.e., T =b N−3 6 c,K =b N+2 3 c−T. 2. Baseline protocols. We implement two conventional MPC protocols (based on [10] and [6]). In a naive implementation of these protocols, each client would secret share its local dataset with the entire set of clients, and the gradient computations would be performed over the secret shared data whose size is as large as the entire dataset, which leads to a significant computationaloverhead. ForafaircomparisonwithCOPML,wespeedupthebaselineprotocols by partitioning the clients into three groups, and assigning each group one third of the entire dataset. Hence, the total amount of data processed at each client is equal to one third of the size of the entire dataset, which significantly reduces the total training time while providing a privacy threshold of T =b N−3 6 c, which is the same privacy threshold as Case 2 of COPML. The details of these implementations are presented in Appendix B.3.1. 48 (a) CIFAR-10 dataset for binary classification between plain and car images (using 9019 sam- ples for the training set and 2000 samples for the test set). (b) GISETTE dataset for binary classification between digits 4 and 9 (using 6000 samples for the training set and 1000 samples for the test set). Figure 3.4: Comparison of the accuracy of COPML (demonstrated for Case 2 and N = 50 clients) vs conventional logistic regression that uses the sigmoid function without quantization. Table 3.1: Breakdown of the running time with N = 50 clients. Protocol Comp. Comm. Enc/Dec Total run time (s) time (s) time (s) time (s) MPC using [BGW88] 918 21142 324 22384 MPC using [BH08] 914 6812 189 7915 COPML (Case 1) 141 284 15 440 COPML (Case 2) 240 654 22 916 Inallschemes, weapplytheMPCtruncationprotocolfromSection3.3tocarryoutthemultipli- cation with η m during model updates, by choosing (k 1 ,k 2 ) = (21, 24) and (22, 24) for the CIFAR-10 and GISETTE datasets, respectively. 3.5.2 Performance evaluation Training time. In the first set of experiments, we measure the training time. Our results are demonstrated in Figure 3.3, which shows the comparison of COPML with the protocol from [6], as we have found it to be the faster of the two baselines. Figures 3.3(a) and 3.3(b) demonstrate that COPML provides substantial speedup over the MPC baseline, in particular, up to 8.6× and 16.4× with the CIFAR-10 and GISETTE datasets, respectively, while providing the same privacy threshold T. We observe that a higher amount of speedup is achieved as the dimension of the dataset becomes larger (CIFAR-10 vs. GISETTE datasets), suggesting COPML to be well-suited for data-intensive distributed training tasks where parallelization is essential. 49 Table 3.2: Complexity summary of COPML. Communication Computation Encoding O( mdN K +dNJ) O( md 2 K ) O( mdN(K+T) K +dN(K +T )J) To further investigate the gain of COPML, in Table 3.1 we present the breakdown of the total running time with the CIFAR-10 dataset for N = 50 clients. We observe that COPML provides K/3 times speedup for the computation time of matrix multiplication in (3.7), which is given in the first column. This is due to the fact that, in the baseline protocols, the size of the data processed at each client is one third of the entire dataset, while in COPML it is (1/K) th of the entire dataset. This reduces the computational overhead of each client while computing matrix multiplications. Moreover, COPMLprovidessignificantimprovementinthecommunication, encoding, anddecoding time. This is because the two baseline protocols require intensive communication and computation to carry out a degree reduction step for secure multiplication (encoding and decoding for additional secretshares), whichisdetailedinAppendixB.3. Incontrast, COPMLonlyrequiressecureaddition and multiplication-by-a-constant operations for encoding and decoding. These operations require no communication. In addition, the communication, encoding, and decoding overheads of each client are also reduced due to the fact that the size of the data processed at each client is only (1/K) th of the entire dataset. Accuracy. We finally examine the accuracy of COPML. Figures 3.4(a) and 3.4(b) demonstrate that COPML with degree one polynomial approximation provides comparable test accuracy to con- ventional logistic regression. For the CIFAR-10 dataset in Figure 3.4(a), the accuracy of COPML and conventional logistic regression are 80.45% and 81.75%, respectively, in 50 iterations. For the GISETTE dataset in Figure 3.4(b), the accuracy of COPML and conventional logistic regres- sion have the same value of 97.5% in 50 iterations. Hence, COPML has comparable accuracy to conventional logistic regression while also being privacy preserving. 50 3.5.3 Complexity Analysis In this section, we analyze the asymptotic complexity of each client in COPML with respect to the number of clients N, model dimension d, number of data points m, parallelization parameter K, privacy parameter T, and total number of iterations J. Client i’s communication cost can be broken to three parts: 1) sending the secret shares [ e X j ] i = [u(α j )] i in (3.3) to client j ∈ [N], 2) sending the secret shares [e w (t) j ] i = [v(α j )] i in (3.4) to client j ∈ [N] for t∈{0,...,J− 1}, and 3) sending the secret share of local computation [f( e X i ,e w (t) i )] j in (3.7) to client j∈ [N] for t∈{0,...,J− 1}. The communication cost of the three parts areO( mdN K ),O(dNJ), andO(dNJ), respectively. Therefore, the overall communication cost of each client is O( mdN K +dNJ). Clienti’s computation cost of encoding can be broken into two parts, encoding the dataset by using (3.3) and encoding the model by using (3.4). The encoded dataset [ e X j ] i = [u(α j )] i from (3.3) is a weighted sum of K +T matrices where each matrix belongs to F m K ×d p . As there are N encoded dataset and each encoded dataset requires a computation cost ofO( md(K+T) K ), the computation cost of encoding the dataset isO( mdN(K+T) K ) in total. Similarly, computation cost of encoding [e w (t) j ] i = [v(α j )] i from (3.4) is O(dN(K +T )J). Computation cost of client i to compute e X > i e X i , the dominant part of local computation f( e X i ,e w (t) i ) in (3.7), is O( md 2 K ). We summarize the asymptotic complexity of each client in Table 3.2. When we set N = 3(K +T− 1) + 1 and K = O(N) (Case 2), increasing N has two major impacts on the training time: 1) reducing the computation per client by choosing a larger K, 2) increasing the encoding time. In this case, as m is typically much larger than other parameters, dominate terms in communication, computation, and encoding cost are O(md), O(md 2 /N) and O(mdN), respectively. For small datasets, i.e., when the computation load at each worker is very small, the gain from increasing the number of workers beyond a certain point may be minimal and 51 system may saturate, as encoding may dominate the computation. This is the reason that a higher amount of speedup of training time is achieved as the dimension of the dataset becomes larger. 3.6 Conclusions Weconsideredacollaborativelearningscenarioinwhichmultipledata-ownersjointlytrainalogistic regression model without revealing their individual datasets to the other parties. To the best of our knowledge, even for the simple logistic regression, COPML is the first fully-decentralized training framework to scale beyond 3-4 parties while achieving information-theoretic privacy. Extending COPML to more complicated (deeper) models is a very interesting future direction. An MPC- friendly (i.e., polynomial) activation function is proposed in [105] which approximates the softmax and shows that the accuracy of the resulting models is very close to those trained using the original functions. We expect to achieve a similar performance gain even in those setups, since COPML can similarly be leveraged to efficiently parallelize the MPC computations. 52 Chapter 4 Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning 4.1 Introduction Federated learning is an emerging approach that enables model training over a large volume of decentralized data residing in mobile devices, while protecting the privacy of the individual users [100, 16, 17, 75]. This is achieved by two key design principles. First, the training data is kept on the user device rather than sending it to a central server, and users locally perform model updates using their individual data. Second, local models are aggregated in a privacy-preserving framework, either at a central server (or in a distributed manner across the users) to update the global model. The global model is then pushed back to the mobile devices for inference. This process is demonstrated in Figure 4.1. The privacy of individual models in federated learning is protected through what is known as a secure aggregation protocol [16, 17]. In this protocol, each user locally masks its own model using pairwise random masks and sends the masked model to the server. The pairwise masks have a unique property that once the masked models from all users are summed up at the server, the pairwise masks cancel out. As a result, the server learns the aggregate of all models, but no 53 individual model is revealed to the server during the process. This is a key property for ensuring user privacy in secure federated learning. In contrast, conventional distributed training setups that do not employ secure aggregation may reveal extensive information about the private datasets of the users, which has been recently shown in [166, 147, 55]. To prevent such information leakage, secure aggregation protocols ensure that the individual update of each user is kept private, both from other users and the central server [16, 17]. A recent promising implementation of federated learning, as well as its application to Google keyboard query suggestions is demonstrated in [153]. Several other works have also demonstrated that leveraging the information that is distributed over many mobile users can increase the training performance dramatically, while ensuring data privacy and locality [101, 14, 89]. The overhead of secure model aggregation, however, creates a major bottleneck in scaling secure federated learning to a large number of users. More specifically, in a network with N users, the state-of-the-art protocols for secure aggregation require pairwise random masks to be generated between each pair of users (for hiding the local model updates), and therefore the overhead of secure aggregation grows quadratically in the number of users (i.e.,O(N 2 )) [16, 17]. This quadratic growth of secure aggregation overhead limits its practical applications to hundreds of users while the scale of current mobile systems is in the order of tens of millions [14]. Another key challenge in model aggregation is the dropout or unavailability of the users. Device availability and connection quality in mobile networks change rapidly, and users may drop from federated learning systems at any time due to various reasons, such as poor connectivity, making a phone call, low battery, etc. The design protocol hence needs to be robust to operate in such environments, where users can drop at any stage of the protocol execution. Furthermore, dropped or delayed users can lead to privacy breaches [17], and privacy guarantees should hold even in the case when users are dropped or delayed. 54 local updates x 1 ( t ) x 2 ( t ) x N ( t ) user 1 user N user 2 x ( t +1) N! i =1 x i ( t ) server updated global model x ( t ) x ( t ) x ( t ) · · · secure aggregation Figure 4.1: Federated learning framework. At iteration t, the central server sends the current version of the global model, x(t), to the mobile users. User i∈ [N] updates the global model using its local data, and computes a local model x i (t). The local models are then aggregated in a privacy-preserving manner. Using the aggregated models, the central server updates the global model x(t + 1) for the next round, and pushes it back to the mobile users. In this paper, we introduce a novel secure aggregation framework for federated learning, named Turbo-Aggregate, with four salient features: 1. Turbo-Aggregate reduces the overhead of secure aggregation to O(N logN) from O(N 2 ); 2. Turbo-Aggregate has provable robustness guarantees against up to a user dropout rate of 50%; 3. Turbo-Aggregate protects the privacy of the local model updates of each individual user, in the strong information-theoretic sense; 4. Turbo-Aggregate experimentally achieves a total running time that grows almost linear in the number of users, and provides up to 40× speedup over the state-of-the-art with N = 200 users, in distributed implementation over Amazon EC2 cloud. Atahighlevel,Turbo-Aggregateiscomposedofthreemaincomponents. First,Turbo-Aggregate employs a multi-group circular strategy for model aggregation. In particular, the users are parti- tionedintoseveralgroups, andateachaggregationstage, theusersinonegrouppasstheaggregated 55 models of all the users in the previous groups and current group to users in the next group. We showthatthisstructureenablesthereductionofaggregationoverheadtoO(N logN)(fromO(N 2 )). However, there are two key challenges that need to be addressed in the proposed multi-group cir- cular strategy for model aggregation. The first one is to protect the privacy of the individual user, i.e., the aggregation protocol should not allow the identification of individual model updates. The second one is handling the user dropouts. For instance, a user dropped at a higher group of the protocol may lead to the loss of the aggregated model information from all the previous groups, and collecting this information again from the lower groups may incur a large communication overhead. The second key component is to leverage additive secret sharing [7, 47] to enable privacy and security of the users. In particular, additive sharing masks each local model by adding randomness in a way that can be cancelled out once the models are aggregated. Finally, the third component is to add aggregation redundancy via Lagrange coding [156] to enable robustness against delayed or dropped users. In particular, Turbo-Aggregate injects redundancy via Lagrange polynomial so that the added redundancy can be exploited to reconstruct the aggregated model amidst potential dropouts. Turbo-Aggregate allows the use of both centralized and decentralized communication architec- tures. The centralized architecture refers to the communication model used in the conventional federated learning setup where all communication goes through a central server, i.e., the server acts as an access point [100, 17, 75]. The decentralized architecture, on the other hand, refers to the setup where mobile devices communicate directly with each other via an underlay com- munication network (e.g., a peer-to-peer network) [82, 66] without requiring a central server for secure model aggregation. Turbo-Aggregate also allows additional parallelization opportunities for communication, such as broadcasting and multi-casting. 56 We theoretically analyze the performance guarantees of Turbo-Aggregate in terms of the ag- gregation overhead, privacy protection, and robustness to dropped or delayed users. In particular, we show that Turbo-Aggregate achieves an aggregation overhead of O(N logN) and can tolerate a user dropout rate of 50%. We then quantify the privacy guarantees of our system. An important implication of dropped or delayed users is that they may lead to privacy breaches [16]. Accordingly, we show that the privacy-protection of our algorithm is preserved in such scenarios, i.e., when users are dropped or delayed. We also provide extensive experiments to numerically evaluate the performance of Turbo- Aggregate. To do so, we implement Turbo-Aggregate for up to 200 users on the Amazon EC2 cloud, and compare its performance with the state-of-the-art secure aggregation protocol from [17]. We demonstrate that Turbo-Aggregate can achieve an overall execution time that grows almost lin- ear in the number of users, and provides up to 40× speedup over the state-of-the-art with 200 users. Furthermore, the overall execution time of Turbo-Aggregate remains stable as the user dropout rate increases, while for the benchmark protocol, the overall execution time significantly increases as the user dropout rate increases. We further study the impact of communication bandwidth on the performance of Turbo-Aggregate, by measuring the total running time with various bandwidth constraints. Our experimental results demonstrate that Turbo-Aggregate still provides substantial gain in environments with more severe bandwidth constraints. 4.2 Related Work A potential solution for secure aggregation is to leverage cryptographic approaches, such as multi- partycomputation(MPC),homomorphicencryption,ordifferentialprivacy. MPC-basedtechniques mainly utilize Yao’s garbled circuits or secret sharing (e.g., [154, 123, 10, 6]). Their main bottleneck is the high communication cost, and communication-efficient implementations require an extensive 57 offline computation part [10, 6]. A notable recent work is [22], which focuses on optimizing MPC protocols for network security and monitoring. Homomorphic encryption is a cryptographic secure computation scheme that allows aggregations to be performed on encrypted data [85, 116, 62]. However, the privacy guarantees of homomorphic encryption depends on the size of the encrypted data (more privacy requires a larger encypted data size), and performing computations in the encrypted domain is computationally expensive [56, 38]. Differential privacy is a a noisy release mechanism that preserves the privacy of personally identifiable information, in that the removal of any single element from the dataset does not affect the computation outcomes significantly. As such, the computation outcomes cannot be used to infer much about any single individual element [43]. In the context of federated learning, differential privacy is mainly used to ensure that in- dividual data points from the local datasets cannot be identified from the local updates sent to the server, by adding artificial noise to the local updates at the clients’ side [57, 101, 149]. This approach entails a trade-off between convergence performance and privacy protection, i.e., stronger privacy guarantees lead to a degradation in the convergence performance. On the other hand, our focus is on ensuring that the server or a group of colluding users can learn nothing beyond the aggregate of all local updates, while preserving the accuracy of the model. This approach, also known as secure aggregation [16, 17], does not sacrifice the convergence performance. A recent line of work has focused on secure aggregation by additive masking [2], [17]. In [2], users agree on pairwise secret keys using a Diffie-Hellman type key exchange protocol and then each user sends the server a masked version of their data, which contains the pairwise masks as well as an individual mask. The server can then sum up the masked data received from the users to obtain the aggregated value, as the summation of additive masks cancel out. If a user fails and drops out, the server asks the remaining users to send the sum of their pairwise keys with the dropped users added to their individual masks, and subtracts them from the aggregated value. The main 58 limitation of this protocol is the communication overhead of this recovery phase, as it requires the entire sum of the missing masks to be sent to the server. Moreover, the protocol terminates if additional users drop during this phase. A novel technique is proposed in [17] to ensure that the protocol is robust if additional users drop during the recovery phase. It also ensures that the additional information sent to the server does not breach privacy. To do so, the protocol utilizes pairwise random masks between users to hide the individual models. The cost of reconstructing these masks, which takes the majority of execution time, scales with respect to O(N 2 ), with N corresponding to the number of users. The execution time of [17] increases as more users are dropped, as the protocol requires additional information corresponding to the dropped users. The recovery phase of our protocol does not require any additional information to be shared between the users, which is achieved by a coding technique applied to the additively secret shared data. Hence, the execution time of our algorithm stays almost the same as more and more users are dropped, the only overhead comes from the decoding phase whose contribution is very small compared to the overall communication cost. Notable approaches to reduce the communication cost in federated learning include reducing the model size via quantization, or learning in a smaller parameter space [80]. In [18], a frame- work has been proposed for autotuning the parameters in secure federated learning, to achieve communication-efficiency. Another line of work has focused on approaches based on decentralized learning [68, 94] or edge-assisted hierarchical physical layer topologies [96]. Specifically, [96] utilizes edgeserverstoactasanintermediateaggregatorforthelocalupdatesfromedgedevices. Theglobal model is then computed at the central server by aggregating the intermediate computations avail- able at the edge servers. These setups perform the aggregation using the clear (unmasked) model updates, i.e., the aggregation is not required to preserve the privacy of individual model updates. Our focus is different, as we study the secure aggregation problem which requires the server to learn 59 no information about an individual update beyond the aggregated values. Finally, approaches that aim at alleviating the aggregation overhead by reducing the model size (e.g., quantization [80]) can also be leveraged in Turbo-Aggregate, which can be an interesting future direction. Circular communication and training architectures have been considered previously in the con- text of distributed stochastic gradient descent on clear (unmasked) gradient updates, to reduce communication load [93] or to model data-heterogeneity [44]. Different from these setups, our key challenge in this work is handling user dropouts while ensuring user privacy, i.e., secure aggregation. Conventional federated learning frameworks consider a centralized communication architecture in which all communication between the mobile devices goes through a central server [17, 100, 75]. More recently, decentralized federated learning architectures without a central server have been considered for peer-to-peer learning on graph topologies [82] and in the context of social networks [66]. Model poisoning attacks on federated learning architectures have been analyzed in [11, 49]. Differentially-private federated learning frameworks have been studied in [57, 136]. A multi-task learning framework for federated learning has been proposed in [126], for learning several models simultaneously. [106, 91] have explored federated learning frameworks to address fairness challenges and to avoid biasing the trained model towards certain users. Convergence properties of trained models are studied in [92]. 4.3 System Model In this section, we first discuss the basic federated learning model. Next, we introduce the secure aggregation protocol for federated learning and discuss the key parameters for performance evaluation. Finally, we present the state-of-the-art for secure aggregation. 60 4.3.1 Basic Federated Learning Model Federatedlearning is a distributed learning framework that allowstraining machine learning models directly on the data held at distributed devices, such as mobile phones. The goal is to learn a single global model x with dimension d, using data that is generated, stored, and processed locally at millions of remote devices. This can be represented by minimizing a global objective function, min x L(x) such that L(x) = N X i=1 w i L i (x), (4.1) where N is the total number of mobile users, L i is the local objective function of user i, and w i ≥ 0 is a weight parameter assigned to user i to specify the relative impact of each user such that P i w i = 1. One natural setting of the weight parameter is w i = m i m where m i is the number of samples of user i and m = P N i=1 m i ∗ . To solve (4.1), conventional federated learning architectures consider a centralized communica- tion topology in which all communication between the individual devices goes through a central server [100, 17, 75], and no direct links are allowed between the mobile users. The learning setup is as demonstrated in Figure 4.1. At iteration t, the central server shares the current version of the global model, x(t), with the mobile users. Each user then updates the model using its local data. User i∈ [N] then computes a local model x i (t). To increase communication efficiency, each user can update the local model over multiple local epochs before sending it to the server [100]. The local models of the N users are sent to the server and then aggregated by the server. Using the aggregated models, the server updates the global model x(t + 1) for the next iteration. This update equation is given by x(t + 1) = X i∈U(t) x i (t), (4.2) ∗ For simplicity, we assume that all users have equal-sized datasets i.e., a weight parameter assigned to user i satisfies wi = 1 N for all i∈ [N]. 61 whereU(t) denotes the set of participating users at iterationt. Then, the server pushes the updated global model x(t + 1) to the mobile users. 4.3.2 Secure Aggregation Protocol for Federated Learning and Key Parameters The basic federated learning model from Section 4.3.1 aims at addressing the privacy concerns over transmitting raw data to the server, by letting the training data remain on the user device and instead requiring only the local models to be sent to the server. However, as the local models still carry extensive information about the local datasets stored at the users, the server can reconstruct the private data from the local models by using a model inversion attack, which has been recently demonstrated in [166, 147, 55]. Secure aggregation has been introduced in [17] to address such privacy leakage from the local models. A secure aggregation protocol enables the computation of the aggregation operation in (4.2) while ensuring that the server learns no information about the local models x i (t) beyond their aggregated value P N i=1 x i (t). In this paper, our focus is on the aggregation phase in (4.2) and how to make this aggregation phase secure and efficient. In particular, our goal is to evaluate the aggregate of the local models z = X i∈U x i , (4.3) where we omit the iteration index t for simplicity. As we discuss in Section 4.3.3 in detail, secure aggregation protocols build on cryptographic primitives that require all operations to be carried out over a finite field. Accordingly, similar to prior works [16, 17], we assume that the elements of x (l) i and z are from a finite fieldF q for some field size q. We evaluate the performance of a secure aggregation protocol for federated learning through the following key parameters. 62 1. Robustness guarantee: We consider a network model in which each user can drop from the network with a probability p∈ [0, 1], called the user dropout rate. In a real world setting, the dropout rate varies between 0.06 and 0.1 [14]. The robustness guarantee quantifies the maximum user dropout rate that a protocol can tolerate with a probability approaching to 1 as N→∞ to correctly evaluate the aggregate of the surviving user models. 2. Privacy guarantee: We consider a security model where the users and the server are honest but curious. We assume that up to T users can collude with each other as well as with the server for learning the models of other users. The privacy guarantee quantifies the maximum number of colluding entities that the protocol can tolerate for the individual user models to keep private. 3. Aggregation overhead: The aggregation overhead, denoted by C, quantifies the asymptotic time complexity (i.e., runtime) with respect to the number of mobile users,N, for aggregating the models of all users in the network. Note that this includes both the computation and communication time complexities. 4.3.3 State-of-the-art for Secure Aggregation The state-of-the-art for secure aggregation is the protocol proposed in [17]. In this protocol, the privacy of individual models is protected by pairwise random masking, where users create pairwise keys through a key exchange protocol, such as [39]. To do so, each pair of users u,v∈ [N] first agree on a pairwise random seed s u,v . In addition, user u∈ [N] also creates a private random seed b u . The role ofb u is to prevent the privacy breaches that may occur if useru is only delayed instead 63 of dropped, in which case the pairwise masks alone are not sufficient for privacy protection. User u∈ [N] then creates a masked version of its model x u , given by y u =x u +PRG(b u ) + X v:u<v PRG(s u,v )− X v:u>v PRG(s v,u ), (4.4) where PRG is a pseudo random generator, and sends it to the server. Finally, user u secret shares b u as well as{s u,v } v∈[N] with the other users, via Shamir’s secret sharing. Using the collected shares, the server reconstructs the private seed of each surviving user, and the pairwise seeds of each dropped user, to be removed from the aggregate of the masked models. The server then computes the aggregated model, z = X u∈U y u −PRG(b u ) − X u∈D X v:u<v PRG(s u,v )− X v:u>v PRG(s v,u ) = X u∈U x u , (4.5) whereD andU denote the set of dropped and surviving users, respectively. This state-of-the-art protocol achieves robustness guarantee to a user dropout rate of up to p = 0.5, whileprovidingprivacyguaranteetouptoT = N 2 colludingusers † However, itsaggregation overhead is quadratic with the number of users (i.e., C = O(N 2 )). This quadratic aggregation overhead results from the fact that the server has to reconstruct and remove the pairwise random masks corresponding to dropped users. More specifically, in order to recover the random masks of dropped users, the server has to execute a pseudo random generator based on the recovered seeds s u,v , which has a quadratic computation overhead as the number of pairwise masks is quadratic in the number of users, which dominates the overall time consumed in the protocol. This quadratic aggregation overhead severely limits the network size for real-world applications [14]. † This is achieved by setting the parameter t = N 2 +1 in Shamir’st-out-of-N secret sharing protocol. This allows each user to split its own random seeds into N shares such that any t shares can be used to reconstruct the seeds, but any set of at most t−1 shares reveals no information about the seeds. 64 Our goal in this paper is to develop a secure aggregation protocol that can provide comparable robustness and privacy guarantees as the state-of-the-art, while achieving a significantly lower (almost linear) aggregation overhead. 4.4 The Turbo-Aggregate Protocol We now introduce the Turbo-Aggregate protocol for secure federated learning that can simultane- ously achieve robustness guarantee to a user dropout rate of up to p = 0.5, privacy guarantee to up to T = N 2 colluding users, and aggregation overhead of C = O(N logN). Turbo-Aggregate is composed of three main components. First, it creates a multi-group circular aggregation structure for fast model aggregation. Second, it leverages additive secret sharing by adding randomness in a way that can be cancelled out once the models are aggregated, in order to guarantee the privacy of the users. Third, it adds aggregation redundancy via Lagrange polynomial in the model up- dates that are passed from one group to the next, so that the added redundancy can be exploited to reconstruct the aggregated model amidst potential user dropouts. We now describe each of these components in detail, present the overall Turbo-Aggregate protocol, and finally provide an illustrative example of Turbo-Aggregate. 4.4.1 Multi-group circular aggregation Turbo-Aggregate computes the aggregate of the individual user models by utilizing a circular ag- gregation strategy. Given a mobile network withN users, this is done by first partitioning the users into L groups as shown in Figure 4.2, with N l users in group l∈ [L], such that P l∈[L] N l = N. We consider a random partitioning strategy in which each user is assigned to one of the available groups uniformly at random, by using a bias-resistant public randomness generation protocol such as in [137]. We useU l ⊆ [N l ] to represent the set of users that complete their part in the protocol 65 server · · · Group 1 Group L Group 2 N l users Group l Figure 4.2: Network topology withN users partitioned toL groups, withN l users in groupl∈ [L]. (surviving users), andD l = [N l ]\U l to denote the set of dropped users ‡ . We use x (l) i to denote the local model of user i in group l∈ [L], which is a vector of dimension d that corresponds to the parameters of their locally trained model. Then, we can rewrite (4.3) as z = X l∈[L] X i∈U l x (l) i . (4.6) The elements ofx (l) i andz are from a finite fieldF q for some field sizeq. All operations are carried out over the finite field and we omit the modulo q operation for simplicity. The dashed links in Figure 4.2 represent the communication links between the server and mobile users. In our general description, we assume that all communication takes place through a central server, via creating pairwise secure keys using a Diffie-Hellman type key exchange protocol [39] as in [17]. Turbo-Aggregate can also use decentralized communication architectures with direct links between devices, such as peer-to-peer communication, where users can communicate directly ‡ For modeling the user dropouts, we focus on the worst-case scenario, which is the case when a user drops during the execution of the corresponding group, i.e., when a user receives messages from the previous group but fails to propagate it to the next group. 66 through an underlay communication network [82, 66]. Then, the aggregation steps are the same as the centralized setting except that messages are now communicated via direct links between the users, and a random election algorithm should be carried out to select one user (or multiple users, depending on the application) to aggregate the final sum at the final stage instead of the server. The detailed process of the final stage will be explained in Section 4.4.4. Turbo-Aggregate consists of L execution stages performed sequentially. At stage l∈ [L], users in group l encode their inputs, including their trained models and the partial summation of the models from lower stages, and send them to users in group l + 1. Next, users in groupl + 1 recover (decode) the missing information due to potentially dropped users, and then aggregate the received messages. At the end of the protocol, models of all surviving users will be aggregated. The proposed coding and aggregation mechanism guarantees that no party (mobile users or the server) can learn an individual model, or a partial aggregate of a subset of models. The server learns nothing but the final aggregated model of all surviving users. This is achieved by leveraging additive secret sharing to mask the individual models, which we describe in the following. 4.4.2 Masking with additive secret sharing Turbo-Aggregate hides the individual user models using additive masks to protect their privacy against potential collusions between the interacting parties. This is done by a two-step procedure. In the first step, the server sends a random mask to each user, denoted by a random vector u (l) i for user i∈ [N l ] at group l∈ [L]. Each user then masks its local model x (l) i as x (l) i +u (l) i . Since this random mask is known only by the server and the corresponding user, it protects the privacy of each user against potential collusions between any subset of the remaining users, as long as the server is honest. On the other hand, privacy may be breached if the server is adversarial and colludes with a subset of users. The second step of Turbo-Aggregate aims at protecting user privacy against such 67 scenarios. In this second step, users generate additive secret sharing of the individual models for privacy protection against potential collusions between the server and the users. To do so, user i in group l sends a masked version of its local model to each user j in group l + 1, given by e x (l) i,j =x (l) i +u (l) i +r (l) i,j , (4.7) forj∈ [N l+1 ], wherer (l) i,j is a random vector such that P j∈[N l+1 ] r (l) i,j = 0 for alli∈ [N l ]. The role of additive secret sharing is not only to mask the model to provide privacy against collusions between the server and the users, but also to maintain the accuracy of aggregation by making the sum of the received data over the users in each group equal to the original data, as the vectors r (l) i,j cancel out. In addition, each user holds a variable corresponding to the aggregated masked models from the previous group. For user i in group l, this variable is represented by e s (l) i . At each stage of Turbo-Aggregate, users in the active group update and propagate these variables to the next group. Aggregation of these masked models is defined via the recursive relation, e s (l) i = 1 N l−1 X j∈[N l−1 ] e s (l−1) j + X j∈U l−1 e x (l−1) j,i (4.8) at user i in group l > 1, whereas the initial aggregation at group l = 1 is set as e s (1) i = 0, for i∈ [N 1 ]. While computing (4.8), any missing values in{e s (l−1) j } j∈[N l−1 ] (due to the users dropped in group l− 1) is reconstructed via the recovery technique presented in Section 4.4.3. User i in group l then sends the aggregated value in (4.8) to each user in group l + 1. The average of the aggregated values from the users in group l consists of the models of the users up to groupl− 1, masked by the randomness sent from the server. This can be observed by defining the following partial summation, which can be computed by each user in group l + 1, 68 s (l+1) = 1 N l X i∈[N l ] e s (l) i = 1 N l−1 X j∈[N l−1 ] e s (l−1) j + X j∈U l−1 x (l−1) j + X j∈U l−1 u (l−1) j (4.9) =s (l) + X j∈U l−1 x (l−1) j + X j∈U l−1 u (l−1) j , (4.10) where(4.9)followsfrom P j∈[N l−1 ] r (l−1) i,j = 0. Withtheinitialpartialsummations (2) = 1 N 1 P i∈[N 1 ] e s (1) i = 0, one can show thats (l+1) is equal to the aggregation of the models of all surviving users in up to group l− 1, masked by the randomness sent from the server, s (l+1) = X m∈[l−1] X j∈Um x (m) j + X m∈[l−1] X j∈Um u (m) j . (4.11) At the final stage, the server obtains the final aggregate value from (4.11) and removes the random masks P m∈[L] P j∈Um u (m) j . This approach works well if no user drops out during the execution of the protocol. On the other hand, if any user in group l + 1 drops out, the random vectors masking the models of the l-th group in the summation (4.9) cannot be cancelled out. In the following, we propose a recovery technique that is robust to dropped or delayed users, based on coding theory principles. 4.4.3 Adding redundancies to recover the data of dropped or delayed users The main intuition behind our recovery strategy is to encode the additive secret shares (masked models) in a way that guarantees secure aggregation when users are dropped or delayed. To do so, we leverage Lagrange coding [156], which has been applied to other problems such as offloading or collaborative machine learning in the privacy-preserving manner [130, 128]. The primary benefits of Lagrange coding over alternative codes that may also be used for introducing redundancy, such as other error-correcting codes, is that Lagrange coding enables us to perform the aggregation operation on the encoded models, and that the final result can be decoded from the computations 69 performed on the encoded models. This is not necessarily true for other error-correcting codes, as they do not guarantee the recovery of the original computation results (i.e., the computations performed on the true values of the model parameters) from the computations performed on the encoded models. It encodes a given set ofK vectors (v 1 ,...,v K ) by using a Lagrange interpolation polynomial. One can view this as embedding a given set of vectors on a Lagrange polynomial, such that each encoded value represents a point on the polynomial. The resulting encoding enables a set of users to compute a given polynomial function h on the encoded data in a way that any individual computation{h(v i )} i∈[K] can be reconstructed using any subset of deg(h)(K− 1) + 1 other computations. The reconstruction is done through polynomial interpolation. Therefore, one can reconstruct any missing value as long as a sufficient number of other computations are available, i.e., enough number of points are available to interpolate the polynomial. In our problem of gradient aggregation, the function of interest, h, would be linear and accordingly have degree 1, since it corresponds to the summation of all individual gradient vectors. Turbo-Aggregate utilizes Lagrange coding for recovery against user dropouts, via a novel strat- egy that encodes the secret shared values to compute secure aggregation. More specifically, in Turbo-Aggregate, the encoding is performed as follows. Initially, user i in group l forms a La- grange interpolation polynomial f (l) i : F q → F d q of degree N l+1 − 1 such that f (l) i (α (l+1) j ) = e x (l) i,j for j ∈ [N l+1 ], where α (l+1) j is an evaluation point allocated to user j in group l + 1. This is accomplished by letting f (l) i (z) = X j∈[N l+1 ] e x (l) i,j · Y k∈[N l+1 ]\{j} z−α (l+1) k α (l+1) j −α (l+1) k . 70 Then, another set ofN l+1 distinct evaluation points{β (l+1) j } j∈[N l+1 ] are allocated fromF q such that {β (l+1) j } j∈[N l+1 ] ∩{α (l+1) j } j∈[N l+1 ] =?. Next, useri∈ [N l ] in groupl generates the encoded model, ¯ x (l) i,j =f (l) i (β (l+1) j ), (4.12) and sends ¯ x (l) i,j to userj in group (l+1). In addition, useri∈ [N l ] in groupl aggregates the encoded models{¯ x (l−1) j,i } j∈U l−1 received from the previous stage, with the partial summation s (l) from (4.9) as ¯ s (l) i =s (l) + X j∈U l−1 ¯ x (l−1) j,i . (4.13) The summation of the masked models in (4.8) and the summation of the coded models in (4.13) can be viewed as evaluations of a polynomial g (l) such that e s (l) i =g (l) (α (l) i ), (4.14) ¯ s (l) i =g (l) (β (l) i ), (4.15) for i∈ [N l ], where g (l) (z) = s (l) + P j∈U l−1 f (l−1) j (z) is a polynomial function with degree at most N l − 1. Then, user i∈ [N l ] sends the set of messages{e x (l) i,j , ¯ x (l) i,j ,e s (l) i , ¯ s (l) i } to user j in group l + 1. Upon receiving the messages, userj in groupl + 1 reconstructs the missing terms in{e s (l) i } i∈[N l ] (caused by the dropped users in group l), computes the partial sum s (l+1) from (4.9), and updates the terms{e s (l+1) j , ¯ s (l+1) j } as in (4.8) and (4.13). Users in group l + 1 can reconstruct each term in {e s (l) i } i∈[N l ] as long as they receive at least N l evaluations out of 2N l evaluations from the users in group l. This is because{e s (l) i , ¯ s (l) i } i∈[N l ] are evaluation points of the polynomial g (l) whose degree is at most N l − 1. As a result, the model can be aggregated at each stage as long as at least half 71 of the users at that stage are not dropped. As we will demonstrate in the proof of Theorem 4.1, as long as the drop rate of the users is below 50%, the fraction of dropped users at all stages will be below half with high probability, hence Turbo-Aggregate can proceed with model aggregation at each stage. 4.4.4 Final aggregation and the overall Turbo-Aggregate protocol For the final aggregation, we need a dummy stage to securely compute the aggregation of all user models, especially for the privacy of the local models of users in group L. To do so, we arbitrarily select a set of users who will receive and aggregate the models sent from the users in groupL. They can be any surviving user who has participated in the protocol, and will be called user j∈ [N final ] in the final stage, where N final is the number of users selected. During this phase, users in group L mask their own model with additive secret sharing by using (4.7), generate the encoded data by using (4.12), and aggregate the models received from the usersingroup (L−1)byusing(4.8)and(4.13). Then,userifromgroupLsends{e x (L) i,j , ¯ x (L) i,j ,e s (L) i , ¯ s (L) i } to user j in the final stage. Upon receiving the set of messages, user j∈ [N final ] in the final stage recovers the missing terms in{e s (L) i } i∈[N L ] , and aggregates them with the masked models, e s (final) j = 1 N L X i∈[N L ] e s (L) i + X i∈U L e x (L) i,j , (4.16) ¯ s (final) j = 1 N L X i∈[N L ] e s (L) i + X i∈U L ¯ x (L) i,j , (4.17) and sends the resulting{e s (final) j , ¯ s (final) j } to the server. The server then recovers the summations {e s (final) j } j∈[N final ] , by reconstructing any missing terms in (4.16) using the set of received values (4.16) and (4.17). Finally, the server computes 72 Algorithm 2 Turbo-Aggregate input Local models x (l) i of users i∈ [N l ] in group l∈ [L]. output Aggregated model P l∈[L] P i∈U l x (l) i . 1: for group l = 1,...,L do 2: for user i = 1,...,N l do 3: Compute the masked model{e x (l) i,j } l∈[N l+1 ] from (4.7). 4: Generate the encoded model{¯ x (l) i,j } j∈[N l+1 ] from (4.12). 5: if l = 1 then 6: Initializee s (1) i = ¯ s (1) i =0. 7: else 8: Reconstruct the missing values in{e s (l−1) k } k∈[N l−1 ] due to the dropped users in group l− 1. 9: Update the aggregate valuee s (l) i from (4.8). 10: Compute the coded aggregate value ¯ s (l) i from (4.13). 11: end if 12: Send{e x (l) i,j , ¯ x (l) i,j ,e s (l) i , ¯ s (l) i } to user j∈ [N l+1 ] in group l + 1 (j∈ [N final ] if l =L). 13: end for 14: end for 15: for user i = 1,...,N final do 16: Reconstruct the missing values in{e s (L) k } k∈[N L ] due to the dropped users in group L. 17: Computee s (final) i from (4.16) and ¯ s (final) i from (4.17). 18: Send{e s (final) i , ¯ s (final) i } to the server. 19: end for 20: Server computes the final aggregated model from (4.18). the average of the summations from (4.16) and removes the random masks P m∈[L] P j∈Um u (m) j from the aggregate, which, as can be observed from (4.9)-(4.11), is equal to the aggregate of the individual models of all surviving users, 1 N final X j∈[N final ] e s (final) j − X m∈[L] X j∈Um u (m) j = X m∈[L] X j∈Um x (m) j . (4.18) Having all above steps, the overall Turbo-Aggregate protocol is presented in Algorithm 2. 4.5 Illustrative Example We next demonstrate the execution of Turbo-Aggregate through an illustrative example. Consider the network in Figure 4.3 with N = 9 users partitioned into L = 3 groups with N l = 3 (l∈ [3]) users per group, and assume that user 3 in group 2 drops during protocol execution. 73 x (1) 1 x (2) 1 x (3) 1 x (1) 2 x (1) 3 x (2) 2 x (2) 3 x (3) 2 x (3) 3 server Group 1 Group 2 Group 3 Figure 4.3: Example with N = 9 users and L = 3 groups, with 3 users per group. User 3 in group 2 drops during protocol execution. In the first stage, user i ∈ [3] in group 1 masks its model x (1) i using additive masking as in (4.7) and computes{e x (1) i,j } j∈[3] . Then, the user generates the encoded models,{¯ x (1) i,j } j∈[3] , by using the Lagrange polynomial from (4.12). Finally, the user initializese s (1) i = ¯ s (1) i =0, and sends {e x (1) i,j , ¯ x (1) i,j ,e s (1) i , ¯ s (1) i } to user j∈ [3] in group 2. Figure 4.4 demonstrates this stage for one user. In the second stage, user j ∈ [3] in group 2 generates the masked models{e x (2) j,k } k∈[3] , and the coded models{¯ x (2) j,k } k∈[3] , by using (4.7) and (4.12), respectively. Next, the user aggregates the messages received from group 1, by computing, e s (2) j = 1 3 P i∈[3] e s (1) i + P i∈[3] e x (1) i,j and ¯ s (2) j = 1 3 P i∈[3] e s (1) i + P i∈[3] ¯ x (1) i,j . Figure 4.5 shows this aggregation phase for one user. Finally, user j sends{e x (2) j,k , ¯ x (2) j,k ,e s (2) j , ¯ s (2) j } to userk∈ [3] in group 3. However, user 3 (in group 2) drops out during the execution of this stage and fails to complete its part. In the third stage, user k∈ [3] in group 3 generates the masked models{e x (3) k,t } t∈[3] and the coded models{¯ x (3) k,t } t∈[3] . Then, the user runs a recovery phase due to the dropped user in group 2. This is facilitated by the Lagrange coding technique. Specifically, user k can decode the missing valuee s (2) 3 =g (2) (α (2) 3 ) due to the dropped user, by using the four evaluations{e s (2) 1 , ¯ s (2) 1 ,e s (2) 2 , ¯ s (2) 2 } = {g (2) (α (2) 1 ),g (2) (β (2) 1 ),g (2) (α (2) 2 ),g (2) (β (2) 2 )} receivedfrom theremainingusersin group 2. Then, user 74 x (1) 1 x (2) 1 x (3) 1 x (1) 2 x (1) 3 x (2) 2 x (2) 3 x (3) 2 x (3) 3 server Group 1 Group 2 Group 3 { ! x (1) 1 ,j } j ∈[3] { ¯x (1) 1 ,j } j ∈[3] ! s (1) 1 ¯s (1) 1 Figure 4.4: Illustration of the computations performed by user 1 in group 1, who then sends {e x (1) 1,j , ¯ x (1) 1,j ,e s (1) 1 , ¯ s (1) 1 } to user j∈ [3] in group 2 (using pairwise keys through the server). k aggregates the received and reconstructed values by computinge s (3) k = 1 3 P j∈[3] e s (2) j + P j∈[2] e x (2) j,k and ¯ s (3) k = 1 3 P j∈[3] e s (2) j + P j∈[2] ¯ x (2) j,k . In the final stage, Turbo-Aggregate selects a set of surviving users to aggregate the models of group 3. Without loss of generality, we assume these are the users in group 1. Next, user k∈ [3] from group 3 sends{e x (3) k,t , ¯ x (3) k,t ,e s (3) k , ¯ s (3) k } to user t∈ [3] in the final group. Then, user t computes the aggregation,e s (final) t = 1 3 P k∈[3] e s (3) k + P k∈[3] e x (3) k,t and ¯ s (final) t = 1 3 P k∈[3] e s (3) k + P k∈[3] ¯ x (3) k,t , and sends{e s (final) t , ¯ s (final) t } to the server. Finally, the server computes the average of the summations received from the final group and removes the added randomness, which is equal to the aggregate of the original models of the surviving users. 1 3 X t∈[3] e s (final) t − X i∈[3] u (3) i − X j∈[2] u (2) j − X k∈[3] u (3) k = X i∈[3] x (1) i + X j∈[2] x (2) j + X k∈[3] x (3) k . 4.6 Theoretical Guarantees of Turbo-Aggregate In this section, we formally state our main theoretical result. 75 x (1) 1 x (2) 1 x (3) 1 x (1) 2 x (1) 3 x (2) 2 x (2) 3 x (3) 2 x (3) 3 server Group 1 Group 2 Group 3 { ¯x (1) i, 1 } i ∈[3] {! s (1) i } i ∈[3] { ¯s (1) i } i ∈[3] { ! x (1) i, 1 } i ∈[3] ! s (2) 1 = 1 3 " i ∈[3] ! s (1) i + " i ∈[3] ! x (1) i 1 ¯s (2) 1 = 1 3 ! i ∈[3] " s (1) i + ! i ∈[3] ¯x (1) i 1 Figure 4.5: The aggregation phase for user 1 in group 2. After receiving the set of messages e x (1) i,1 , ¯ x (1) i,1 ,e s (1) i , ¯ s (1) i i∈[3] from the previous stage, the user computes the aggregated valuese s (2) 1 and ¯ s (2) 1 (note that this is an aggregation of the masked values). Theorem 4.1. Turbo-Aggregate can simultaneously achieve: 1. robustness guarantee to any user dropout rate p < 0.5, with probability approaching to 1 as the number of users N→∞, 2. privacy guarantee against up to T = (0.5−)N colluding users, with probability approaching to 1 as the number of users N→∞, and for any > 0, 3. aggregation overhead of C =O(N logN). Remark 4.1. Theorem 4.1 states that Turbo-Aggregate can tolerate up to 50% user dropout rate and N 2 collusions between the users, simultaneously. Turbo-Aggregate can guarantee robustness against an even higher number of user dropouts by sacrificing the privacy guarantee as a trade-off. Specifically, when we generate and communicatek set of evaluation points during Lagrange coding, we can recover the partial aggregations by decoding the polynomial in (4.14) as long as each user receives N l evaluations, i.e., (1 +k)(N l −pN l )≥N l . As a result, Turbo-Aggregate can tolerate up to ap< k 1+k user dropout rate. On the other hand, the individual models will be revealed whenever 76 T (k + 1)≥ N. In this case, one can guarantee privacy against up to ( 1 k+1 −)N colluding users for any > 0. This demonstrates a trade-off between robustness and privacy guarantees achieved by Turbo-Aggregate, that is, one can increase the robustness guarantee by reducing the privacy guarantee and vice versa. Proof. We now state the proof of Theorem 4.1. As described in Section 4.4.1, Turbo-Aggregate first employs an unbiased random partitioning strategy to allocate the N users intoL = N N l groups with a group size of N l for all l∈ [L]. To prove the theorem, we choose N l = 1 c logN where c , min{D(0.5||p),D(0.5|| T N )} and D(a||b) is the Kullback-Leibler (KL) distance between two Bernoulli distributions with parameter a and b (see e.g., page 19 of [33]). We now prove each part of the theorem separately. (Robustness guarantee) In order to compute the aggregate of the user models, Turbo-Aggregate requires the partial summation in (4.10) to be recovered by the users in groupl∈ [L]. This requires users in each group l∈ [L] to be able to reconstruct the term e s (l−1) i of all the dropped users in group l− 1. This is facilitated by our encoding procedure as follows. At stage l, each user in groupl + 1 receives 2(N l −D l ) evaluations of the polynomial g (l) , whereD l is the number of users dropped in groupl. Since the degree ofg (l) is at mostN l −1, one can reconstructe s (l) i for alli∈ [N l ] using polynomial interpolation, as long as 2(N l −D l )≥N l . If the condition of 2(N l −D l )≥N l is satisfied for alll∈ [L], Turbo-Aggregate can compute the aggregate of the user models. Therefore, the probability that Turbo-Aggregate provides robustness guarantee against user dropouts is given by P[robustness] =P \ l∈[L] {D l ≤ N l 2 } = 1−P [ l∈[L] {D l ≥b N l 2 c + 1} . (4.19) 77 Now, we show that this probability goes to 1 as N→∞. D l follows a binomial distribution with parameters N l and p. When p< 0.5, its tail probability is bounded as P[D l ≥b N l 2 c + 1]≤ exp −c p N l ), (4.20) where c p = D( b N l 2 c+1 N l ||p) [71]. When N l is sufficiently large, c p ≈ D(0.5||p). Note that c p > 0 is a positive constant for any p< 0.5, since by definition the KL distance is non-negative and equal to 0 if and only if p = 0.5 (Theorem 2.6.3 of [33]). Then, using a union bound, the probability of failure can be bounded by P[failure] =P [ l∈[L] {D l ≥b N l 2 c + 1} ≤ X l∈[L] P[D l ≥b N l 2 c + 1] ≤ N N l exp(−c p N l ) (4.21) ,B failure , (4.22) where (4.21) follows from (4.20). Asymptotic behavior of this upper bound is given by lim N→∞ B failure = exp n lim N→∞ logB failure o = exp n lim N→∞ logN− logN l −c p N l o = exp (−∞) (4.23) = 0, (4.24) where(4.23)holdsbecauseN l = 1 c logN≥ 1 cp logN. From(4.22)and(4.24), lim N→∞ P[failure] = 0. (Privacy guarantee) LetA l be an event that a collusion between users and the server can reveal the local model of any honest user in groupl−1, andX l be a random variable corresponding to the 78 number of colluding users in group l∈ [L]. First, note that the colluding users in groups l 0 ≤l− 1 cannotlearnanyinformationaboutauseringroupl−1becausecommunicationinTurbo-Aggregate is directed from users in lower groups to users in upper groups. Moreover, colluding users in groups l 0 ≥ l + 1 can learn nothing beyond the partial summation, s (l 0 ) . Hence, information breach of a local model in group l− 1 occurs only when the number of colluding users in group l is greater than or equal to half of the number of users in group l. In this case, colluding users can obtain a sufficient number of evaluation points to recover all of the masked models belonging to a certain user, i.e., e x (l−1) i,j in (4.7) for all j∈N l . Then, they can recover the original model x (l−1) i by adding all of the masked models and removing the randomness u (l−1) i . Therefore, P[A l ] = P[X l ≥ N l 2 ]. To calculate P[A l ], we again consider the random partitioning strategy described in Section 4.4.1 which allocates N users to L = N N l groups whose size is N l for all groups, while T out of N users are colluding users. Then,X l follows a hypergeometric distribution with parameters (N,T,N l ) and its tail probability is bounded as P[X l ≥ N l 2 ]≤ exp −c T N l ), (4.25) where c T =D(0.5|| T N ) [71]. Note that c T > 0 is a positive constant when T≤ (0.5−)N for any > 0. Then the probability of privacy leakage of any individual model is given by P[privacy leakage] =P[ [ l∈[L] A l ]≤ X l∈[L] P[A l ] (4.26) ≤ N N l exp(−c T N l ) (4.27) ,B privacy , 79 where (4.26) follows from a union bound and (4.27) follows from (4.25). Note that (4.27) can be bounded similarly to (4.21) by replacing c p withc T , from which we find that lim N→∞ B privacy = 0. As a result, lim N→∞ P[privacy leakage] = 0. (Aggregation overhead) As described in Section 4.3, aggregation overhead consists of two main components, computation and communication. The computation overhead quantifies the process- ing time for: 1) masking the models with additive secret sharing, 2) adding redundancies to the models through Lagrange coding, and 3) reconstruction of the missing information due to user dropouts. First, masking the model with additive secret sharing has a computation overhead of O(logN). Second, encoding the masked models with Lagrange coding has a computation overhead ofO(logN log 2 logN log log logN), because evaluating a polynomial of degreei at anyj points has a computation overhead of O(j log 2 i log logi) [77], and both i and j are logN for the encoding operation. Third, the decoding operation to recover the missing information due to dropped users has a computation overhead of O(p logN log 2 logN log log logN), because it requires evaluating a polynomial of degree logN at p logN points. Within each execution stage, the computations are carried out in parallel by the users. Therefore, computations per execution stage has a computa- tion overhead of O logN log 2 logN log log logN , which is the same as the computation overhead of a single user. Since there are L = cN logN execution stages, overall the computation overhead is O N log 2 logN log log logN . Communicationoverheadisevaluatedbymeasuringthetotalamountofcommunicationoverthe entire network, to quantify the communication time in the worst-case communication architecture, which corresponds to the centralized communication architecture where all communication goes through a central server. At execution stagel∈ [L], each of theN l users in groupl sends a message to each of the N l+1 users in group l + 1, which has a communication overhead of O(log 2 N). Since there are L = cN logN stages, overall the communication overhead is O(N logN). 80 As the aggregation overhead consists of the overheads incurred by communication and compu- tation, the aggregation overhead of Turbo-Aggregate is C =O(N logN). As we showed in the proof of Theorem 1, Turbo-Aggregate achieves its robustness and privacy guaranteesbychoosingagroupsizeofN l = 1 c logN foralll∈ [L]wherec, min{D(0.5||p),D(0.5|| T N )} and D(a||b) is the Kullback-Leibler (KL) distance between two Bernoulli distributions with pa- rameter a and b [33]. We can further reduce the aggregation overhead if we choose a smaller group size N l . However, we cannot further reduce the group size beyond O(logN) because when 0 < D(0.5||p) < 1 ( 1 c > 1) and N l = logN, the probability that Turbo-Aggregate guarantees the accuracy of full model aggregation goes to 0 with sufficiently large number of users, which is stated in Theorem 4.2. Theorem 4.2. (Converse) When 0 < D(0.5||p) < 1 and N l = logN for all l∈ [L], the probability that Turbo-Aggregate achieves the robustness guarantee to any user dropout rate p< 0.5 goes to 0 as the number of users N→∞. Proof. WefirstnotethatD l , thenumberofusersdroppedingroupl, followsabinomialdistribution with parameters N l and p. The probability that the information of the users in group l cannot be reconstructed is given byP D l ≥b N l 2 c + 1 . From [33], this probability can be bounded as P D l ≥b N l 2 c + 1 ≥ 1 N l + 1 exp (−c 0 N l ), (4.28) where c 0 =D(0.5||p). 81 The probability that Turbo-Aggregate can provide robustness against user dropouts is given by P[robustness] =P \ l∈[L] {D l ≤ N l 2 } =P {D l ≤ N l 2 } L (4.29) = 1−P D l ≥b N l 2 c + 1 L ≤ 1− 1 N l + 1 exp (−c 0 N l ) L (4.30) = 1− 1 logN + 1 exp (−c 0 logN) N logN = 1− N −c 0 logN + 1 N logN ,B robustness , (4.31) 82 where (4.29) holds since the events are independent over the groups and (4.30) follows from (4.28). The asymptotic behavior of the upper bound is then given by lim N→∞ B robustness = exp n lim N→∞ logB o = exp n lim N→∞ log 1− N −c 0 logN+1 logN N o = exp n lim N→∞ c 0 N −c 0 −1 (1+logN)+N −c−1 (1+logN)(1+logN−N −c 0 ) 1−logN N 2 o (4.32) = exp n lim N→∞ c 0 N 1−c 0 logN +N 1−c 0 (1 +c 0 ) (1− log 2 N)(1 + logN−N −c 0 ) o = exp n lim N→∞ c 0 N 1−c 0 logN +N 1−c 0 (1 +c 0 ) − log 3 N− log 2 N + logN o (4.33) = exp n lim N→∞ c 0 (1−c 0 )N 1−c 0 logN +N 1−c 0 (1 +c 0 −c 02 ) −3 log 2 N− 2 logN + 1 o (4.34) = exp n lim N→∞ c 0 (1−c 0 ) 2 N 1−c 0 logN +N 1−c 0 (1−c 0 )(1 + 2c 0 −c 02 ) −6 logN− 2 o (4.35) = exp n lim N→∞ c(1−c 0 ) 3 N 1−c 0 logN +N 1−c 0 (1−c 0 ) 2 (1 + 3c 0 −c 02 ) −6 o (4.36) = exp (−∞) (4.37) = 0, (4.38) where (4.32), (4.34), (4.35), and (4.36) follow the L’Hospital’s rule, (4.33) removes the terms which do not go to infinity as N goes to infinity, and (4.37) holds since N 1−c 0 goes to infinity when 0<c 0 =D(0.5||p)< 1. From (4.31) and (4.38), lim N→∞ P[robustness] = 0, which completes the proof. 4.6.1 Generalized Turbo-Aggregate Theorem 4.1 states that the privacy of each individual model is guaranteed against any collusion between the server and up to N 2 users. On the other hand, a collusion between the server and a 83 subset of users can reveal the partial aggregation of a group of honest users. For instance, a collusion between the server and a user in group l can reveal the partial aggregation of the models of all users up to group l− 2, as the colluding server can remove the random masks in (4.11). However, the privacy protection can be strengthened to guarantee the privacy of any partial aggregation, i.e., the aggregate of any subset of user models, with a simple modification. The modified protocol follows the same steps in Algorithm 2 except that the random mask u (l) i in (4.7) is generated by each user individually, instead of being generated by the server. At the end of the aggregation phase, the server learns P m∈[L] P j∈Um (x (m) j +u (m) j ). Simultaneously, the protocol executes an additional random partitioning strategy to aggregate the random masks u (m) j , at the end of which the server obtains P m∈[L] P j∈Um u (m) j and recovers P m∈[L] P j∈Um x (m) j . In this second partitioning, N users are randomly allocated into L groups with a group size of N l . User i in groupl 0 ∈ [L] then secret sharesu (l 0 ) i with the users in groupl 0 + 1, by generating and sending a secret share denoted by [u (l 0 ) i ] j to userj in groupl 0 + 1. For secret sharing, we utilize Shamir’s N l 2 - out-of-N l secret sharing protocol [123]. LetU 0 l 0 denote the surviving users in group l 0 in the second partitioning. Useri in groupl 0 then aggregates the received secret shares P j∈U 0 l 0 −1 u (l 0 −1) j i , which in turn is a secret share of P j∈U 0 l 0 −1 u (l 0 −1) j , and sends the sum to the server. Finally, the server reconstructs P j∈U 0 l 0 u (l 0 ) j for all l 0 ∈ [L] and recovers the aggregate of the individual models of all surviving users by subtracting P j∈U 0 l 0 u (l 0 ) j l 0 ∈[L] from the aggregate P m∈[L] P j∈Um (x (m) j +u (m) j ). In this generalized version of Turbo-Aggregate, the privacy of any partial aggregation, i.e., the aggregate of any subset of user models, can be protected as long as a collusion between the server and the users does not reveal the aggregation of the random masks, P j∈U l u (l) j in (4.11) for any l∈ [L]. Since there are at least N 2 unknown random masks generated by honest users and the server only knowsL = N N l equations, i.e., P j∈U 0 l u (l) j l∈[L] , the server cannot calculate P j∈U l u (l) j for any l∈ [L]. Therefore, a collusion between the server and users cannot reveal the partial aggregate as 84 they cannot remove the random masks in (4.11). We now formally state the privacy guarantee, robustness guarantee, and aggregation overhead of the generalized Turbo-Aggregate protocol in Theorem 4.3. Theorem 4.3. Generalized Turbo-Aggregate simultaneously achieves 1), 2), and 3) from Theo- rem 4.1. In addition, it provides privacy guarantee for the partial aggregate of any subset of user models, against any collusion between the server and up to T = (0.5−)N users for any > 0, with probability approaching to 1 as the number of users N→∞. Proof. The generalized version of Turbo-Aggregate creates two independent random partitionings of the users, where each partitioning randomly assigns N users into L groups with a group size of N l for all l∈ [L]. To prove the theorem, we choose N l =O(logN) as in the proof of Theorem 4.1 for both the first and second partitioning. We now prove the privacy guarantee for each individual model as well as any aggregate of a subset of models, robustness guarantee, and the aggregation overhead. (Privacy guarantee) As the generalized protocol follows Algorithm 2, the privacy guarantee of individual models follows from the proof of the privacy guarantee in Theorem 4.1. We now prove the privacy guarantee of any partial aggregation, i.e., aggregate of the models from any subset of users. A collusion between the server and users can reveal a partial aggregation in two events: 1) for any l,l 0 ∈ [L],U l andU 0 l 0 are exactly the same whereU l andU 0 l 0 denote the set of surviving users in group l of the first partitioning and group l 0 of the second partitioning, respectively, or 2) the number of colluding users in group l 0 of the second partitioning is larger than half of the group size, and the number of such groups is large enough that colluding users can reconstruct the individual random masks{u (l) j } j∈U l in (4.7) for some group l of the first partitioning. In the first event, the server can reconstruct the aggregate of the random masks, P j∈U l u (l) j , and then if the server colludes with any user in groupl + 1 and any user in groupl + 2, they can reveal the partial 85 aggregation P j∈U l x (l) j by subtracting s (l+1) + P j∈U l u (l) j from s (l+2) in (4.11). The probability that a given group from the second partitioning is the same as any group l∈ [L] from the first partitioning is N N l 1 ( N N l ) . As there are L = N N l groups in the second partitioning, the probability of the first event is bounded by N 2 N 2 l 1 ( N N l ) from a union bound, which goes to zero as N →∞ when N l =O(logN). In the second event, the colluding users in group l 0 where the number of colluding users is larger than half of the group size can reconstruct the individual random masks of all users in group l 0 − 1. If these colluding users can reconstruct the individual random masks of all users in group l for any l∈ [L] and collude with any user in group l + 1 and any user in group l + 2, they can reveal the partial aggregation P j∈U l x (l) j . As the second event requires that the number of colluding users is larger than half of the group size in multiple groups, the probability of the second event is less than the probability in (4.25), which goes to zero as N→∞. Therefore, as the upper bounds on these two events go to zero, generalized Turbo-Aggregate can provide privacy guarantee for any partial aggregation, i.e., aggregate of any subset of user models, with probability approaching to 1 as N→∞. (Robustness guarantee) Generalized Turbo-Aggregate can provide the same level of robustness guarantee as Theorem 4.1. This is because the probability that all groups in the first partition- ing have enough numbers of surviving users for the protocol to correctly compute the aggregate P j∈U l x (l) j +u (l) j is the same as the probability in (4.19), and the probability that all groups in the second partitioning have enough numbers of surviving users is also the same as the probability in (4.19). Therefore, from a union bound, the upper bound on the failure probability of generalized Turbo-Aggregate is twice that of the bound in (4.22), which goes to zero as N→∞. (Aggregation overhead) Generalized Turbo-Aggregate follows Algorithm 2 which achieves an aggregation overhead of O(N logN), and additional operations to secret share the individual ran- dom masks u (l) i and decode the aggregate of the random masks also have an aggregation overhead 86 ofO(N logN). The aggregation overhead of the additional operations consists of four parts: 1) user i in groupl generates the secret shares ofu (l) i via Shamir’s secret sharing with a polynomial degree N l 2 , 2) user i in group l sends the secret shares to the N l users in group l + 1, 3) user i in group l aggregates the received secret shares, P j∈U 0 l−1 u (l−1) j i , and sends the sum to the server, and 4) server reconstructs the secret P j∈U 0 l u (l) j for all l∈ [L]. The computation overhead of Shamir’s secret sharing is O(N 2 l ) [123], and these computations are carried out in parallel by the users in one group. As there are L = N N l groups and N l =O(logN), the computation overhead of the first part is O(N logN). The communication overhead of the second part is O(N logN) as the total number of secret shares is O(N logN). The communication overhead of the third part is O(N) as each user sends a single message to the server. The computation overhead of the last part is O(N logN) as the computation overhead of decoding the secret from the N l = O(logN) secret shares is O(log 2 N) [123], and the server must decode L =O( N logN ) secrets. Therefore, generalized Turbo-Aggregate also achieves an aggregation overhead of O(N logN). 4.7 Experiments Inthissection, weevaluatetheperformanceofTurbo-AggregatebyexperimentsoveruptoN = 200 users for various user dropout rates and bandwidth conditions. 4.7.1 Experiment setup Platform. In our experiments, we implement Turbo-Aggregate on a distributed platform by using FedML library [65], and examine its total running time with respect to the state-of-the-art [17]. Computation is performed in a distributed network over the Amazon EC2 cloud using m3.medium machine instances. Communication is implemented using the MPI4Py [36] message passing interface on Python. The default setting for the maximum bandwidth constraint of m3.medium machine 87 Table 4.1: Summary of simulation parameters. Variable Definition Value N number of users 4∼ 200 d model size (with 32 bit entries) 100000 p dropout rate 10%, 30%, 50% q field size 2 32 − 5 maximum bandwidth constraint 100Mbps∼ 1Gbps (default) (a) Turbo-Aggregate. Each arrow is carried out sequentially. Turbo-Aggregate requires 7 stages. (b) Turbo-Aggregate+. Arrows for the same execution stage are carried out simultaneously. Turbo-Aggregate+ requires only 3 stages. Figure 4.6: Example networks with N = 24, N l = 3 and L = 8. An arrow represents that users in one group generate and send messages to the users in the next group. instances is 1Gbps. The model size, d, is fixed to 100,000 with 32 bit entries, and the field size, q, is set as the largest prime within 32 bits. We summarize the simulation parameters in Table 4.1. Modeling user dropouts. To model the dropped users in Turbo-Aggregate, we randomly select pN l users out of N l users in group l∈ [L] where p is the dropout rate. We consider the worst case scenario where the selected users drop after receiving the messages sent from the previous group (users in group l− 1) and do not send their messages to users in group l + 1. To model the dropped users in the benchmark protocol, we follow the scenario in [17]. We randomly select pN users out of N users, which artificially drop after sending their masked model y u in (4.4). In this case, the server has to reconstruct the pairwise seeds{s u,v } v∈[N]\{u} corresponding to each dropped useru, and execute the pseudo random generator using the reconstructed seeds to remove the corresponding random masks. 88 Figure 4.7: Total running time of Turbo-Aggregate versus the benchmark protocol [17] as the number of users increases, for various user dropout rates. Implemented Schemes. We implement the following schemes for performance evaluation. For the schemes with Turbo-Aggregate, we use N l = logN. 1. Turbo-Aggregate: For our first implementation, we directly implement Turbo-Aggregate as described in Section 4.4, where the L execution stages are performed sequentially. 2. Turbo-Aggregate+: We can speed up Turbo-Aggregate by parallelizing the L execution stages. To do so, we again utilize the circular aggregation topology but leverage a tree struc- ture for flooding the information between different groups across the network, which reduces the required number of execution stages fromL− 1 to logL. We refer to this implementation as Turbo-Aggregate+. Figure 4.6 demonstrates the difference between Turbo-Aggregate+ and Turbo-Aggregate through an example network of N = 24 users and L = 8 groups. Turbo-Aggregate+ requires only 3 stages to complete the protocol while Turbo-Aggregate carries out each execution stage sequentially and requires 7 stages. 3. Benchmark: We implement the benchmark protocol [17] where a server mediates the com- munication between users to exchange the information required for key agreements (rounds 89 of advertising and sharing keys) and users send their masked models to the server (masked input collection). One can also speed up the rounds of advertising and sharing keys by allow- ing users to communicate in parallel. However, this has minimal effect on the total running time of the protocol, as the total running time is dominated by the overhead when the server generates the pairwise masks [17]. 4.7.2 Performance evaluation Forperformanceanalysis, wemeasurethetotalrunningtimeforasingleroundofsecureaggregation witheachprotocolwhileincreasingthenumberofusersN graduallyfordifferentuserdropoutrates. We use synthesized vectors for locally trained models and do not include the local training time in the total running time. One can also consider the entire learning process and since all other steps remain the same for the three schemes, we expect the same speedup in the aggregation phase. Our results are demonstrated in Figure 4.7. We make the following key observations. • Total running time of Turbo-Aggregate and Turbo-Aggregate+ are almost linear in the num- ber of users, while for the benchmark protocol, the total running time is quadratic in the number of users. • Turbo-Aggregate and Turbo-Aggregate+ provide a stable total running time as the user dropout rate increases. This is because the encoding and decoding time of Turbo-Aggregate do not change significantly when the dropout rate increases, and we do not require additional information to be transmitted from the remaining users when some users are dropped or delayed. On the other hand, for the benchmark protocol, the running time significantly increases as the dropout rate increases. This is because the total running time is dominated by the reconstruction of pairwise masks at the server, which substantially increases as the number of dropped users increases. 90 Table 4.2: Breakdown of the running time (ms) of Turbo-Aggregate with N = 200 users. Drop rate Encoding Decoding Communication Total 10% 2070 2333 22422 26825 30% 2047 2572 22484 27103 50% 2051 3073 22406 27530 Table 4.3: Breakdown of the running time (ms) of Turbo-Aggregate+ with N = 200 users. Drop rate Encoding Decoding Communication Total 10% 93 356 3353 3802 30% 94 460 3282 3836 50% 94 559 3355 4009 Table 4.4: Breakdown of the running time (ms) of the benchmark protocol [17] withN = 200 users. Drop rate Communication of Reconstruction Other Total the models at server 10% 8670 53781 832 63284 30% 8470 101256 742 110468 50% 8332 151183 800 160315 • Turbo-Aggregate and Turbo-Aggregate+ provide a speedup of up to 5.8× and 40× over the benchmark, respectively, for a user dropout rate of up to 50% withN = 200 users. This gain is expected to increase further as the number of users increases. Breakdown of the total running time: To illustrate the impact of user dropouts, we present the breakdown of the total running time of Turbo-Aggregate, Turbo-Aggregate+, and the benchmark protocol in Tables 4.2, 4.3 and 4.4, respectively, for the case of N = 200 users. Tables 4.2 and 4.3 demonstrate that, for Turbo-Aggregate and Turbo-Aggregate+, the encoding time stays constant with respect to the user dropout rate, and decoding time is linear in the number of dropout users, which takes only a small portion of the total running time. Table 4.4 shows that, for the benchmark protocol, total running time is dominated by the reconstruction of pairwise masks (using a pseudo 91 random generator) at the server, which has a computation overhead of O (N−D) +D(N−D) whereD is the number of dropout users [17]. This leads to an increased running time as the number of user dropouts increases. The running time of two Turbo-Aggregate schemes, on the other hand, is relatively stable against varying user dropout rates, as the communication time is independent from the user dropout rate and the only additional overhead comes from the decoding phase, whose overall contribution is minimal. Table 4.3 shows that the encoding time of Turbo-Aggregate+ is reduced to anL-th of the encoding time of original Turbo-Aggregate because encoding is performed in parallel across all groups. Turbo-Aggregate+ also speeds up the decoding and communication phases by reducing the number of execution stages from L− 1 to logL. It is also useful to comment on the selection of a user dropout rate of up to p = 0.5. From a practical perspective, our selection of p = 0.5 is to demonstrate our results with the same parameter setting as the state-of-the-art [17]. From a privacy perspective, as most secure systems, e.g., Blockchain systems, are designed to tolerate at most 50% adversaries, Turbo-Aggregate is also designed to achieve a privacy guarantee against up to T =N/2 where N is total number of users, which limits the maximum value of p to 0.5, as Turbo-Aggregate provides a trade-off between the maximum user dropout rate p and the privacy guarantee T, as detailed in Remark 4.1. 4.7.3 Impact of bandwidth and stragglers We further study the impact of the bandwidth and stragglers on the performance of our protocol, by measuring the total running time with various communication bandwidth constraints. Our results are demonstrated in Figure 4.8, from which we observe that Turbo-Aggregate still provides substantial gain in environments with more severe bandwidth constraints. The details of these experiments are as follows. 92 Figure 4.8: Total running time of Turbo-Aggregate versus the benchmark protocol from [17] as the maximum bandwidth increases, for various user dropout rates. The number of users is fixed to N = 200. Bandwidth. For simulating various bandwidth conditions in mobile network environments, we mea- sure the total running time while decreasing the maximum bandwidth constraint for the communi- cation links between the Amazon EC2 machine instances from 1Gbps to 100Mbps. In Figure 4.8, we observe that the gain of Turbo-Aggregate and Turbo-Aggregate+ over the benchmark decrease as the maximum bandwidth constraint decreases. This is because for the benchmark, the major bottleneck is the running time for the reconstruction of pairwise masks, which remains constant over various bandwidth conditions. On the other hand, for Turbo-Aggregate, the total running time is dominated by the communication time which is a reciprocal proportion to the bandwidth. This leads to a significantly decreased gain of Turbo-Aggregate over the benchmark, 1.9× with the maximumbandwidthconstraintof 100Mbps. However, totalrunningtimeofTurbo-Aggregate+ in- creasesmoderatelyasthemaximumbandwidthconstraintdecreasesbecausecommunicationtimeof Turbo-Aggregate+ is even less than the communication time of the benchmark. Turbo-Aggregate+ still provides a speedup of 12.1× over the benchmark with the maximum bandwidth constraint of 93 100Mbps. In real-world settings, this gain is expected to increase further as the number of users increases. Stragglers or delayed users. Beyond user dropouts, in a federated learning environment, stragglers or slow users can also significantly impact the total running time. Turbo-Aggregate can effectively handle these straggling users by simply treating them as user dropouts. At each stage of the aggregation, if some users send their messages later than a certain threshold, users at the higher group can start to decode those messages without waiting for stragglers. This has negligible impact on the total running time because Turbo-Aggregate provides a stable total running time as the number of dropout users increases. For the benchmark, however, stragglers can significantly delay the total running time even though it can also handle stragglers as dropout users. This is because the running time for the reconstruction of pairwise masks corresponding to the dropout users, which is the dominant time consuming part of the benchmark protocol, significantly increases as the number of dropout users increases. 4.7.4 Additional experiments with FedAvg In our previous experiments, we have primarily focused on the aggregation phase and measured a single round of the secure aggregation phase with synthesized vectors for the locally trained models. This is due to the fact that these vectors can be replaced with any trained model using the real world federated learning setups. To further investigate the performance of Turbo-Aggregate in real world federated learning setups, we implement the FedAvg scheme from [100] with a convolutional neural network (CNN) architecture on the CIFAR-10 dataset as considered in [100], and apply two secure aggregation protocols, Turbo-Aggregate and the benchmark protocol from [17], in the aggregation phase. This architecture has 100,000 parameters, and with the setting of N = 100, E = 5 (number of local 94 Figure 4.9: Running time of Turbo-Aggregate versus the benchmark protocol from [3]. Both proto- cols are applied to the aggregation phase in FedAvg with CIFAR-10 dataset and CNN architecture [100]. We measure the running time of training phase and aggregation phase to achieve a test accuracy of 80% with N = 100, E = 5 (number of local epochs), and B = 50 (mini-batch size). epochs), and B = 50 (mini-batch size), it requires 280 rounds to achieve a test accuracy of 80% [100]. Figure 4.9 shows the local training time, aggregation time, and total running time of Turbo- Aggregate and the benchmark protocol [17], from which we have observed that Turbo-Aggregate provides 10.8× speedup over the benchmark to achieve 80% test accuracy. We note that this gain can be much larger (almost 40×) when the number of users is larger (N = 200). 4.8 Conclusion This chapter presents the first secure aggregation framework that theoretically achieves an aggregation overhead of O(N logN) in a network with N users, as opposed to the prior O(N 2 ) overhead, while tolerating up to a user dropout rate of 50%. Furthermore, via experiments over Amazon EC2, we demonstrated that Turbo-Aggregate achieves a total running time that grows 95 almost linearly in the number of users, and provides up to 40× speedup over the state-of-the-art scheme with N = 200 users. Turbo-Aggregate is particularly suitable for wireless topologies, in which network conditions and user availability can vary rapidly, as Turbo-Aggregate can provide a resilient framework to handle such unreliable network conditions. Specifically, if some users cause unexpected delays due to unstable connection, Turbo-Aggregate can simply treat them as user dropouts and can reconstruct the information of dropped or delayed users in the previous groups as long as half of the users remain. One may also leverage the geographic heterogeneity of wireless networks to better form the communication groups in Turbo-Aggregate. An interesting future direction would be to explore how to optimize the multi-group communication structure of Turbo-Aggregate based on the specific topology of the users, as well as the network conditions. In this work, we have focused on protecting the privacy of individual models against an honest- but-curious server and up to T colluding users so that no information is revealed about the indi- vidual models beyond their aggregated value. If one would like to further limit the information that may be revealed from the aggregated model, differential privacy can be utilized to ensure that the individual data points cannot be identified from the aggregated model. All the benefits of differential privacy could be applied to our approach by adding noise to the local models before the aggregation phase in Turbo-Aggregate. Combining these two techniques is another interesting future direction. Finally, the implementation of Turbo-Aggregate in a real-world large-scale distributed system would be another interesting future direction. This would require addressing the following three challenges. First, the computation complexity of implementing the random grouping strategy may increase as the number of users increases. Second, Turbo-Aggregate currently focuses on protecting 96 the privacy against honest-but-curious adversaries. In settings with malicious (Byzantine) adver- saries who wish to manipulate the global model by poisoning their local datasets, one may require additional strategies to protect the resilience of the trained model. One approach is combining secure aggregation with an outlier detection algorithm as proposed in [131], which has a commu- nication cost of O(N 2 ) that limits its scalability to large federated learning systems. It would be an interesting direction to leverage Turbo-Aggregate to address this challenge, i.e., develop a communication-efficient secure aggregation strategy against Byzantine adversaries. Third, com- munication may still be a bottleneck in severely resource-constrained systems since users need to exchange the masked models with each other, whose size is as large as the size of the global model. To overcome this bottleneck, one may leverage model compression techniques or group knowledge transfer [64]. 97 Chapter 5 Byzantine-Resilient Secure Federated Learning 5.1 Introduction Federated learning is a distributed training framework that has received significant interest in the recent years, by allowing machine learning models to be trained over the vast amount of data collected by mobile devices [97, 17]. In this framework, training is coordinated by a central server who maintains a global model, which is updated by the mobile users through an iterative process. At each iteration, the server sends the current version of the global model to the mobile devices, who update it using their local data and create a local model. The server then aggregates the local updates of the users and updates the global model for the next iteration [97, 17, 100, 14, 75, 129, 88, 153]. Security and privacy considerations of distributed learning are mainly focused around two seem- ingly separate directions: 1) ensuring the robustness of the global model against adversarial manip- ulations and 2) protecting the privacy of individual users. The first direction aims at ensuring that the trained model is robust against Byzantine faults that may occur in the training data or during protocol execution. These faults may result either from an adversarial user who can manipulate the training data or the information exchanged during the protocol, or due to device malfunctioning. Notably, it has been shown that even a single Byzantine fault can significantly alter the trained 98 model [12]. The primary approach for defending against Byzantine faults is by comparing the local updates received from different users and removing the outliers at the server [12, 29, 155, 3]. Doing so, however, requires the server to learn the true values of the local updates of each individual user. The second direction aims at protecting the privacy of the individual users, by keeping each local update private from the server and the other users participating in the protocol [17, 100, 14, 75, 129, 88]. This is achieved through what is known as a secure aggregation protocol [17]. In this protocol, each user masks its local update through additive secret sharing using private and pairwise random keys before sending it to the server. Once the masked models are aggregated at the server, the additional randomness cancels out and the server learns the aggregate of all user models. At the end of the protocol, the server learns no information about the individual models beyond the aggregated model, as they are masked by the random keys unknown to the server. In contrast, conventional distributed training frameworks that perform gradient aggregation and model updates using the true values of the gradients may reveal extensive information about the local datasets of the users, as shown in [166, 147, 55]. This presents a major challenge in developing a Byzantine-resilient, and at the same time, privacy-preserving federated learning framework. On the one hand, robustness against Byzantine faults requires the server to obtain the individual model updates in the clear, to be able to compare the updates from different users with each other and remove the outliers. On the other hand, protecting user privacy requires each individual model to be masked with random keys, as a result, the server only observes the masked model, which appears as a uniformly random vector that could correspond to any point in the parameter space. Our goal is to reconcile these two critical directions. In particular, we want to address the following question, “How can one make federated learning protocols robust against Byzantine adversaries while preserving the privacy of individual users?”. 99 In this paper, we propose the first single-server Byzantine-resilient secure aggregation frame- work, BREA, towards addressing this problem. Our framework is built on the following main principles. Given a network of N mobile users with up to A adversaries, each user initially secret shares its local model with the other users, through a verifiable secret sharing protocol [50]. How- ever, doing so requires the local models to be masked by uniformly random vectors in a finite field [123], whereas the model updates during training are performed in the domain of real numbers. In order to handle this problem, BREA utilizes stochastic quantization to transfer the local models from the real domain into a finite field. Verifiable secret sharing allows the users to perform consistency checks to validate the secret shares and ensure that every user follows the protocol. However, a malicious user can still manipu- late the global model by modifying its local model or private dataset. BREA handles such attacks through a robust gradient descent approach, enabled by secure computations over the secret shares of the local models. To do so, each user locally computes the pairwise distances between the secret shares of the local models belonging to other users, and sends the computation results to the server. Since these computations are carried out using the secret shares, users do not learn the true values of the local models belonging to other users. In the final phase, the server collects the computation results from a sufficient number of users, recovers the pairwise distances between the local models, and performs user selection for model ag- gregation. The user selection protocol is based on a distance-based outlier removal mechanism [12], to remove the effect of potential adversaries and to ensure that the selected models are sufficiently close to an unbiased gradient estimator. After the user selection phase, the secret shares of the models belonging to the selected users are aggregated locally by the mobile users. The server then gathers the secure computation results from the users, reconstructs the true value of the aggregate of the selected user models, and updates the global model. Our framework guarantees the privacy 100 of individual user models, in particular, the server learns no information about the local models, beyond their aggregated value and the pairwise distances. In our theoretical analysis, we demonstrate provable convergence guarantees for the model and robustness guarantees against Byzantine adversaries. We then identify the theoretical performance limits in terms of the fundamental trade-offs between the network size, user dropouts, number of adversaries, and privacy protection. Our results demonstrate that, in a network with N mobile users, BREA can theoretically guarantee: i) robustness of the trained model against up to A Byzantine adversaries, ii) tolerance against up to D user dropouts, iii) privacy of each local model, against the server and up toT colluding users, as long asN≥ 2A+1+max{m+2,D +2T}, where m is the number of selected models for aggregation. We then numerically evaluate the performance of BREA and compare it to the conventional federated learning protocol, the federated averaging scheme of [97]. To do so, we implement BREA in a distributed network ofN = 40 users with up toA = 12 Byzantine users who can send arbitrary vectors to the server or to the honest users. We demonstrate that BREA guarantees convergence against Byzantine users and its convergence rate is comparable to the convergence rate of federated averaging. BREAalsohascomparabletestaccuracytothefederatedaveragingschemewhileBREA entails quantization loss to preserve the privacy of individual users. 5.1.1 Related Work In the non-Byzantine federated learning setting, secure aggregation is performed through a proce- dure known as additive masking [17], [2]. In this setup, users first agree on pairwise secret keys using a Diffie-Hellman type key exchange protocol [39]. After this step, users send a masked version of their local model to the server, where the masking is done using pairwise and private secret keys. Additivemaskinghasauniquepropertythat, whenthemaskedmodelsareaggregatedattheserver, 101 additive masks cancel out, allowing the server to learn the aggregate of the local models. On the other hand, no information is revealed to the server about the local models beyond their aggregated value, which protects the privacy of individualusers. This process workswellifno usersdrop during the execution of the protocol. In wireless environments, however, users may drop from the protocol anytime due to the variations in channel conditions, or user preferences. Such user dropouts are handled by letting each user secret share their private and pairwise keys through Shamir’s secret sharing [123]. Then, the server can remove the additive masks by collecting the secret shares from the surviving users. This approach leads to a quadratic communication overhead in the number of users. More recent approaches have focused on reducing the communication overhead, by training in a smaller parameter space [80], autotuning the parameters[18], or by utilizing coding techniques [129]. Another line of work has focused on differentially-private federated learning approaches [57, 136], to protect the privacy of personally-identifiable information against inference attacks that may be initiated from the trained model. Although our focus in not on differential-privacy, we believe our approach may in principle be combined with differential privacy techniques [42], which is an interesting future direction. Another important direction in federated learning is the study of fairness and how to avoid biasing the model towards specific users [106, 91]. The convergence properties of federated learning models are investigated in [92]. Distributed training protocols have been extensively studied in the Byzantine setting using clear (unmasked) model updates [12, 29, 155, 3]. The main defense mechanism to protect the trained model against Byzantine users is by comparing the model updates received from different users, and removing the outliers. Doing so ensures that the selected model updates are close to each other, as long as the network has a sufficiently large number of honest users. A related line of work is model poisoning attacks, which are studied in [11, 49]. 102 In concurrent work, a Byzantine-robust secure gradient descent algorithm has been proposed for a two-server model in [69], however, unlike federated learning (which is based on a single-server architecture) [97, 17], this work requires two honest (non-colluding) servers who both interact with the mobile users and communicate with each other to carry out a secure two-party protocol, but do not share any sensitive information with each other in an attempt to breach user privacy. In contrast, our goal is to develop a single-server Byzantine-resilient secure training framework, to facilitate robust and privacy-preserving training architectures for federated learning. Compared to the two-server models, single server models carry the additional challenge where all information has to be collected at a single server, while still being able to keep the individual models of the mobile users private from the server. The remainder of the paper is organized as follows. In Section 5.2, we provide background on federated learning. Section 5.3 presents our system model along with the key parameters that are used to evaluate the system performance. Section 5.4 introduces our framework and the details of the specific components. Section 5.5 presents our theoretical results, whereas our numerical evaluations are provided in Section 5.6, to demonstrate the convergence and Byzantine-resilience of the proposed framework. The paper is concluded in Section 5.7. The following notation is used throughout the paper. We represent a scalar variable with x, whereas x represents a vector. A set is denoted byX, whereas [N] refers to the set{1,...,N}. The term i.i.d. refers to independent identically distributed random variables. 5.2 Background Federated learning is a distributed training framework for machine learning in mobile networks while preserving the privacy of mobile users. Training is coordinated by a central server who 103 user 1 user N user 2 updated global model · · · w ( t ) 2 secure aggregation N! i =1 w ( t ) i w ( t +1) w ( t ) N w ( t ) 1 local update w ( t ) w ( t ) w ( t ) server Figure 5.1: Secure aggregation in federated learning. At iteration t, the server sends the current state of the global model, denoted by w (t) , to the mobile users. User i∈ [N] forms a local model w (t) i by updating the global model using its local dataset. The local models are aggregated in a privacy-preserving protocol at the server, who then updates the global model, and sends the new model, w (t+1) , to the mobile users. maintains a global model w∈ R d with dimension d. The goal is to train the global model using the data held at mobile devices, by minimizing a global objective function C(w) as, min w C(w). (5.1) The global model is updated locally by mobile users on sensitive private datasets, by letting C(w) = N X i=1 B i B C i (w) (5.2) where N is the total number of mobile users, C i (w) denotes the local objective function of user i, B i is the number of data points in user i’s private datasetD i , and B := P N i=1 B i . For simplicity, we assume that users have equal-sized datasets, i.e., B i = B N for all i∈ [N]. 104 Training is performed through an iterative process where mobile users interact through the central server to update the global model. At each iteration, the server shares the current state of the global model, denoted by w (t) , with the mobile users. Each user i creates a local model, w (t) i =g(w (t) ,ξ (t) i ) (5.3) whereg is an estimate of the gradient∇C(w (t) ) of the cost functionC andξ (t) i is a random variable representing the random sample (or a mini-batch of samples) drawn fromD i . We assume that the private datasets{D i } i∈[N] have the same distribution and{ξ (t) i } i∈[N] are i.i.d. ξ (t) i ∼ ξ where ξ is a uniform random variable such that each w (t) i is an unbiased estimator of the true gradient ∇C(w (t) ), i.e., E ξ [g(w (t) ,ξ (t) i )] =∇C(w (t) ). (5.4) The local models are aggregated at the server in a privacy-preserving protocol, such that the server only learns the aggregate of a large fraction of the local models, ideally the sum of all user models P i∈[N] w (t) i , but no further information is revealed about the individual models beyond their aggregated value. Using the aggregate of the local models, the server updates the global model for the next iteration, w (t+1) =w (t) −γ (t) X i∈[N] w (t) i (5.5) whereγ (t) is the learning rate at iterationt, and sends the updated modelw (t+1) to the users. This process is illustrated in Figure 5.1. Conventional secure aggregation protocols require each user to mask its local model using ran- dom keys before aggregation [17, 129, 9]. This is typically done by creating pairwise keys between the users through a key exchange protocol [39]. Using the pairwise keys, each pair of usersi,j∈ [N] 105 agree on a pairwise random seeda (t) ij . Useri also creates a private random seedb (t) i , which protects the privacy of the local model in case the user is delayed instead of being dropped, in which case the pairwise keys are not sufficient for privacy, as shown in [17]. User i∈ [N] then sends a masked version of its local model w (t) i , given by y (t) i :=w (t) i +PRG(b (t) i ) + X j:i<j PRG(a (t) ij )− X j:i>j PRG(a (t) ji ) (5.6) to the server, where PRG is a pseudo random generator. User i then secret shares b (t) i as well as {a (t) ij } j∈[N] with the other users, via Shamir’s secret sharing [123]. For computing the aggregate of the user models, the server collects either the secret shares of the pairwise seeds belonging to a dropped user, or the shares of the private seed belonging to a surviving user (but not both). The server then recovers the private seeds of the surviving users and the pairwise seeds of the dropped users, and removes them from the aggregate of the masked models, y (t) = X i∈U y (t) i −PRG(b (t) i ) − X i∈D X j:i<j PRG(a (t) ij )− X j:i>j PRG(a (t) ji ) = X i∈U w (t) i (5.7) and obtains the aggregate of the local models as shown in (5.7), whereU⊆ [N] andD⊆ [N] denote the set of surviving and dropped users, respectively. 5.3 Problem Formulation In this section, we describe the Byzantine-resilient secure aggregation problem, by extending the conventional secure aggregation scenario from Section 5.2 to the case when some users, known as 106 Byzantine adversaries, can manipulate the trained model by modifying their local datasets or by sharing false information during the protocol. We consider a distributed network with N mobile users and a single server. User i∈ [N] holds a local model ∗ w i of dimension d. The goal is to aggregate the local models at the server, while protecting the privacy of individual users. However, unlike the non-Byzantine setting of Section 5.2, the aggregation operation in the Byzantine setting should be robust against potentially malicious users. To this end, we represent the aggregation operation by a function, f(w 1 ,...,w N ) = X i∈S w i (5.8) whereS is a set of users selected by the server for aggregation. The role ofS is to remove the effect ofpotentially Byzantine adversaries onthetrainedmodel, byremovingtheoutliers. Similarto prior works on federated learning, our focus is on computationally-bounded parties, whose strategies can be described by a probabilistic polynomial time algorithm [17]. We evaluate the performance of a Byzantine-resilient secure aggregation protocol according to the following key parameters: • Robustness against Byzantine users: We assume that up toA users are Byzantine (malicious), who manipulate the protocol by modifying their local datasets or by sharing false information during protocol execution. The protocol should be robust against such Byzantine adversaries. • Privacy of local models: The aggregation protocol should protect the privacy of any individual user from the server and any collusions between up to T users. Specifically, the local model of any user should not be revealed to the server or the remaining users, even if up to T users cooperate with each other by sharing information. † ∗ For notational clarity, throughout Sections 5.3 and 5.4, we omit the iteration number (t) from the models w (t) i . † Collusions that may occur between the server and the users are beyond the scope of our paper. 107 • Tolerance to user dropouts: Due to potentially poor wireless channel conditions, we assume that up to D users may get dropped or delayed at any time during protocol execution. The protocol should be able to tolerate such dropouts, i.e., the privacy and convergence guarantees should hold even if up to D users drop or get delayed. In this paper, we present a single-server Byzantine-resilient secure aggregation framework (BREA) for the computation of (5.8). BREA consists of the following key components: 1. Stochastic quantization: Users initially quantize their local models from the real domain to the domain of integers, and then embed them in a field F p of integers modulo a prime p. To do so, our framework utilizes stochastic quantization, which is instrumental in our theoretical convergence guarantees. 2. Verifiable secret sharing of the user models: Users then secret share their quantized models using a verifiable secret sharing protocol. This ensures that the secret shares created by the mobile users are valid, i.e., Byzantine users cannot cheat by sending invalid secret shares. 3. Secure distance computation: In this phase, users compute the pairwise distances between the secret shares of the local models, and send the results to the server. Since this computation is performed using the secret shares of the models instead of their true values, users do not learn any information about the actual model parameters. 4. User selection at the server: Uponreceivingthecomputationresultsfromtheusers, theserver recovers the pairwise distances between the local models and selects the set of users whose models will be included in the aggregation, by removing the outliers. This ensures that the aggregated model is robust against potential manipulations from Byzantine users. The server then announces the list of the selected users. 108 5. Secure model aggregation: In the final phase, each user locally aggregates the secret shares of the models selected by the server, and sends the computation result to the server. Using the computation results, the server recovers the aggregate of the models of the selected users, and updates the model. In the following, we describe the details of each phase. 5.4 The BREA Framework In this section, we present the details of the BREA framework for Byzantine-resilient secure feder- ated learning. 5.4.1 Stochastic Quantization In BREA, the operations for verifiable secret sharing and secure distance computations are carried out over a finite field F p for some large prime p. To this end, user i∈ [N] initially quantizes its local model w i from the domain of real numbers to the finite field. We assume that the field size p is large enough to avoid any wrap-around during secure distance computation and secure model aggregation, which will be described in Sections 5.4.3 and 5.4.5, respectively. Quantization requires a challenging task as it should be performed in a way to ensure the convergence of the model. Moreover, the quantization function should allow the representation of negative integers in the finite field, and facilitate computations to be performed in the quan- tized domain. Therefore, we cannot utilize well-known gradient quantization techniques such as in [alistarh2017qsgd], which represents the sign of a negative number separately from its magnitude. 109 BREA addresses this challenge with a simple stochastic quantization strategy as follows. For any integer q≥ 1, we first define a stochastic rounding function: Q q (x) = bqxc q with prob. 1− (qx−bqxc) bqxc+1 q with prob. qx−bqxc (5.9) wherebxc is the largest integer less than or equal tox, and note that this function is unbiased, i.e., E Q [Q q (x)] =x. Parameterq is a tuning parameter that corresponds to the number of quantization levels. Variance ofQ q (x) decreases as the value ofq increases, which will be detailed in Lemma 5.1 in Section 5.5. We then define the quantized model, w i :=φ(q·Q q (w i )) (5.10) where the functionQ q from (5.9) is carried out element-wise, and a mapping functionφ :R→F p is defined to represent a negative integer in the finite field by using two’s complement representation, φ(x) = x if x≥ 0 p +x if x< 0. (5.11) 5.4.2 Verifiable Secret Sharing of the User Models BREA protects the privacy of individual user models through verifiable secret sharing. This is to ensure that the individual user models are kept private while preventing the Byzantine users from breaking the integrity of the protocol by sending invalid secret shares to the other users. To do so, user i∈ [N] secret shares its quantized model w i with the other users through a non-interactive verifiable secret sharing protocol [50]. Our framework leverages Feldman’s verifi- able secret sharing protocol from [50], which combines Shamir’s secret sharing with homomorphic 110 encryption. In this setup, each party creates the secret shares using Shamir’s secret sharing [123], then broadcasts commitments to the coefficients of the polynomial they use for Shamir’s secret sharing, so that other parties can verify that the secret shares are constructed correctly. To verify the secret shares from the given commitments, the protocol leverages the homomorphic property of exponentiation, i.e., exp(a +b) =exp(a)exp(b), whereas the privacy protection is based on the assumption that computation of the discrete logarithm in the finite field is intractable. The individual steps carried out for verifiable secret sharing in our framework are as follows. Initially, the server and users agree on N distinct elements{θ i } i∈[N] from F p . This can be done offline by using a conventional majority-based consensus protocol [41, 31]. User i ∈ [N] then generates secret shares of the quantized model w i by forming a random polynomial f i :F p →F d p of degree T, f i (θ) =w i + T X j=1 r ij θ j (5.12) in which the vectors r ij are generated uniformly at random fromF d p by user i. User i then sends a secret share of w i to user j, denoted by, s ij =f i (θ j ). (5.13) To make these shares verifiable, user i also broadcasts commitments to the coefficients of f i , given by, c ij := ψ w i for j = 0 ψ r ij for j = 1,...,T. (5.14) where ψ denotes a generator of F p , and all arithmetic is taken modulo λ for some large prime λ such that p divides λ− 1. 111 Upon receiving the commitments in (5.14), each userj∈ [N] can verify the validity of the secret share s ij =f i (θ j ) by checking the equality, ψ s ij = T Y k=0 c θ k j ik (5.15) where all arithmetic is taken modulo λ. This commitment scheme ensures that the secret shares are created correctly from the polynomial in (5.12), hence they are valid. On the other hand, as we assume the intractability of computing the discrete logarithm [50], the server or the users cannot compute the discrete logarithm log ψ (c it ) and reveal the quantized model w i from c i0 in (5.14). 5.4.3 Secure Distance Computation Verifiable secret sharing of the model parameters, as described in Section 5.4.2, ensures that the users follow the protocol correctly by creating valid secret shares. However, malicious users can still try to manipulate the trained model by modifying their local models instead. In this case, the secret shares will be created correctly but according to a false model. In order to ensure that the trained model is robust against such adversarial manipulations, BREA leverages a distance-based outlier detection mechanism, such as in [13, 12]. The main principle behind these mechanisms is to compute the pairwise distances between the local models and select a set of models that are sufficiently close to each other. On the other hand, the outlier detection mechanism in BREA has to protect the privacy of local models, and performing the distance computations on the true values of the model parameters would breach the privacy of individual users. We address this by a privacy-preserving distance computation approach, in which the pairwise distancesarecomputedlocallybyeachuser,usingthesecretsharesofthemodelparametersreceived 112 from the other users. In particular, upon receiving the secret shares of the model parameters as described in Section 5.4.2, user i computes the pairwise distances, d (i) jk :=ks ji −s ki k 2 (5.16) between each pair of usersj,k∈ [N], and sends the result to the server. Since the computations in (5.16) are performed over the secret shares, useri learns no information about the true values of the model parametersw j andw k of usersj andk, respectively. Finally, we note that the computation results from (5.16) are scalar values. 5.4.4 User Selection at the Server Upon receiving the computation results in (5.16) from a sufficient number of users, the server reconstructs the true values of the pairwise distances. During this phase, Byzantine users may send incorrect computation results to the server, hence the reconstruction process should be able to correct the potential errors that may occur in the computation results due to malicious users. Our decoding procedure is based on the decoding of Reed-Solomon codes. The main intuition of the decoding process is that the computations from (5.16) correspond to evaluation points of a univariate polynomial h jk :F p →F d p of degree at most 2T, where h jk (θ) :=kf j (θ)−f k (θ)k 2 (5.17) for θ ∈ {θ i } i∈[N] and j,k ∈ [N]. Accordingly, h jk can be viewed as the encoding polynomial of a Reed-Solomon code with degree at most 2T, such that the missing computations due to the dropped users correspond to the erasures in the code, and manipulated computations from Byzantine users refer to the errors in the code. Therefore, the decoding process of the server 113 corresponds to decoding an [N, 2T + 1,N− 2T ] p Reed-Solomon code with at most D erasures and at most A errors. By utilizing well-known Reed-Solomon decoding algorithms [52], the server can recover the polynomialh jk and obtain the true value of the pairwise distances by using the relation h jk (0) =kf j (0)−f k (0)k 2 =kw j −w k k 2 . At the end, the server learns the pairwise distances d jk :=kw j −w k k 2 (5.18) between the models of each pair of users j,k∈ [N]. Then the server converts (5.18) from the finite field to the real domain as follows, d jk = φ −1 (d jk ) q 2 (5.19) for j,k∈ [N], where q is the integer parameter in (5.9) and the demapping function φ −1 :F p →R is defined as φ −1 (x) = x if 0≤x< p−1 2 x−p if p−1 2 ≤x E[f] ≥ (1− sinα)kWk 2 , and ii) for r ∈ {2, 3, 4}, Ekfk r is bounded above by Ekfk r ≤K P r 1 +···+r N−A =r EkWk r 1 ...EkWk r N−A where K denotes a generic constant. Lemma 5.2 below states that if the standard deviation caused by random sample selection and quantization is smaller than the norm of the true gradient, and 2A + 2 < N−m, then the aggregation function f from (5.32) is (α,A)-Byzantine resilient where α depends on the ratio of the standard deviation over the norm of the gradient [13]. Lemma 5.2. Assume that 2A + 2<N−m and η(N,A) √ dσ 0 <k∇C(w)k where η(N,A) := s 2 N−A + A(N−A− 2) +A 2 (N−A− 1) N− 2A− 2 . (5.42) Let w 1 ,...,w N be i.i.d. random vectors in R d such that w i ∼ w with E ξ [g(w,ξ)] =∇C(w) and E ξ kg(w,ξ)−∇C(w)k 2 =dσ 2 (w). Then, the aggregation functionf from (5.32) is (α,A)-Byzantine resilient where 0≤α<π/2 is defined by sinα = η(N,A) √ dσ 0 k∇C(w)k . 121 Proof. From Lemma 5.1,E Q,ξ [Q q (g(w,ξ))] =∇C(w) andE Q,ξ kQ q (g(w,ξ))−∇C(w)k 2 ≤dσ 0 2 (w). Then, the quantized multi-Krum algorithm described in Section 5.4.4, where the multi-Krum al- gorithm applied to the quantized vectors Q q (w i ), is (α,A)-Byzantine resilient from Proposition 3 of [13]. Hence, function f in (5.32) is (α,A)-Byzantine resilient. We now state our main result for the theoretical performance guarantees of BREA. Theorem 5.1. We assume that: 1) the cost function C is three times differentiable with continuous derivatives, and is bounded from below, i.e., C(x)≥ 0; 2) the learning rates satisfy, P ∞ t=1 γ (t) =∞ and P ∞ t=1 (γ (t) ) 2 <∞; 3) the second, third, and fourth moments of the quantized gradient estimator do not grow too fast with the norm of the model, i.e.,∀r∈ 2, 3, 4,E Q,ξ kQ q (g(w,ξ))k r ≤A r +B r kwk r for some constants A r and B r ; 4) there exist a constant 0≤ α < π/2 such that for all w∈ R d , η(N,A) √ dσ 0 (w)≤k∇C(w)k sinα; 5) the gradient of the cost function C satisfies that forkwk 2 ≥ R, there exist constants > 0 and 0≤β <π/2−α such that k∇C(w)k≥> 0, (5.43) w > C(w) kwk·k∇C(w)k ≥ cosβ. (5.44) Then, BREA guarantees, • (Robustness against Byzantine users) The protocol executes correctly against up to A Byzan- tine users and the trained model is (α,A)-Byzantine resilient. • (Convergence) The sequence of the gradients∇C(w (t) ) converges almost surely to zero, ∇C(w (t) ) a.s. −−−→ t→∞ 0. (5.45) 122 • (Privacy) The server or any group of up to T users cannot compute an unknown local model. For any set of usersT ⊂ [N] of size at most T, P[User i has secret w i | view T ] =P[User i has secret w i ] (5.46) for all i∈ [N]\T where view T denotes the messages that the members ofT receive. for anyN≥ 2A+1+max{m+2,D+2T}, wherem is the number of selected models for aggregation. Remark 5.1. The two conditions P ∞ t=1 γ (t) =∞ and P ∞ t=1 (γ (t) ) 2 <∞ are instrumental in the convergence of stochastic gradient descent algorithms [19]. Condition P ∞ t=1 (γ (t) ) 2 <∞ states that the learning rates decrease fast enough, whereas condition P ∞ t=1 γ (t) =∞ bounds the rate of their decrease, to ensure that the learning rates do not decrease too fast. Remark 5.2. We consider a general (possibly non-convex) objective function C. In such scenarios, proving the convergence of the model directly is challenging, and various approaches have been proposed instead. Our approach follows [19] and [12], where we prove the convergence of the gradient to a flat region instead. We note, however, that such a region may refer to any stationary point, including the local minima as well as saddle and extremal points. Proof. (Robustness against Byzantine users) The (α,A)-Byzantine resilience of the trained model follows from Lemma 5.2. We next provide sufficient conditions for BREA to correctly evaluate the update function (5.32), in the presence of up to A Byzantine users. Byzantine users may send any arbitrary random vector to the server or other users in every step of the protocol in Section 5.4. In particular, Byzantine users can create and send incorrect computations in three attack scenarios: i) sending invalid secret shares s ij in (5.13), ii) sending incorrect secure distance computations d (i) jk in (5.16), and iii) sending incorrect aggregate of the secret shares s i in (5.26). 123 The first attack scenario occurs when the secret shares s ij in (5.13) do not refer to the same polynomial from (5.12). BREA utilizes verifiable secret sharing to prevent such attempts. The correctness (validity) of the secret shares can be verified by testing (5.15), whenever the majority of the surviving users are honest, i.e., N > 2A +D [50, 41]. The second attack scenario can be detected and corrected by the Reed-Solomon decoding algo- rithm. In particular, as described in Section 5.4.4, given j,k∈ [N],{d (i) jk } i∈[N] can be viewed as N evaluation points of the polynomial h jk given in (5.17) whose degree is at most 2T. The decoding process at the server then corresponds to the decoding of an [N, 2T + 1,N− 2T ] p Reed-Solomon code with at most D erasures and at most A errors. As an [n,k,n−k + 1] p Reed-Solomon code with e erasures can tolerate a maximum number ofb n−k−e 2 c errors [52], the server can recover the correct pairwise distances as long as A≤b N−(2T+1)−D 2 c, i.e. N≥D + 2A + 2T + 1. The third attack scenario can also be detected and corrected by the Reed-Solomon decoding algorithm. As described in Section 5.4.5,{s i } i∈[N] are evaluation points of polynomial h in (5.27) of degree at most T. This decoding process corresponds to the decoding of an [N,T + 1,N−T ] p Reed-Solomon code with at mostD erasures and at mostA errors. As such, the server can recover the desired aggregate modelh(0) = P j∈S w j as long asN≥D +2A+T +1. Therefore, combining withtheconditionofLemma5.2, thesufficientconditionsunderwhichBREAguaranteesrobustness against Byzantine users is given by N≥ 2A + 1 + max{m + 2,D + 2T}. (5.47) (Convergence) We now consider the update equation in (5.32) and prove the convergence of the random sequence∇C(w (t) ). From Lemma 5.2, the quantized multi-Krum function f in (5.32) 124 is (α,A)-Byzantine resilient. Hence, from Proposition 2 of [13], the random sequence∇C(w (t) ) converges almost surely to zero, ∇C(w (t) ) a.s. −−−→ t→∞ 0. (5.48) (Privacy) As described in Section 5.4.2, we assume the intractability of computing discrete logarithms, hence the server or any user cannot compute w i from c i0 in (5.14). It is therefore sufficient to prove the privacy of each individual model against a group of T colluding users, in the case where T has size T. If T users cannot get any information about w i , then neither can fewer than T users. Without loss of generality, let T = {1,...,T} = [T ] and view T = (s kj ) k∈[N] , d j kl k,l∈[N] ,s j j∈[T] where s kj in (5.13) is the secret share of w k sent from user k to user j, d j kl in (5.16) is the pairwise distance of the secret shares sent from users k and l to user j, and s j in (5.26) is the aggregate of the secret shares. As d j kl k,l∈[N] j∈[T] and s j j∈[T] are determined by s kj k,l∈[N] j∈[T] , we can simplify the left hand side of (5.46) as P[User i has secret w i | view T ] =P[User i has secret w i |{s kj } k∈[N],j∈[T] ] (5.49) For any k6=i, s kj is independent of w i . Hence, we have, P[User i has secret w i | view T ] =P[User i has secret w i |{s ij } j∈[T] ]. (5.50) 125 Then, for any realization of vectors ρ 0 ,...,ρ T ∈F d p , we obtain, P[User i has secret w i | view T ] =P[w i =ρ 0 |s i1 =ρ 1 ,...,s iT =ρ T ] = P[w i =ρ 0 ,s i1 =ρ 1 ,...,s iT =ρ T ] P[s i1 =ρ 1 ,...,s iT =ρ T ] = 1/|F d p | T+1 1/|F d p | T (5.51) = 1 |F d p | =P[w i =ρ 0 ] where (5.51) follows from the fact that any T + 1 evaluation points define a unique polynomial of degree T, which completes the proof of privacy. 5.6 Experiments In this section, we demonstrate the convergence and resilience properties of BREA compared to conventional federated learning, i.e., the federated averaging scheme from [97], which is termed FedAvg throughout the section. We measure the performance in terms of the cross entropy loss evaluated over the training samples and the model accuracy evaluated over the test samples, with respect to the iteration index, t. Network architecture: We consider an image classification task with 10 classes on the MNIST dataset [84] and train a convolutional neural network with 6 layers [97] including two 5× 5 con- volutional layers with stride 1, where the first and the second layers have 32 and 64 channels, respectively, and each is followed by ReLu activation and 2×2 max pooling layer. It also includes a fully connected layer with 1024 units and ReLu activation followed by a final softmax output layer. 126 Figure 5.2: Test accuracy of BREA and FedAvg [97] for different number of Byzantine users. Experiment setup: We assume a network of N = 40 users where T = 7 users may collude and A users are malicious. We consider two cases for the number of Byzantine users: i) 0% Byzantine users (A = 0) and ii) 30% Byzantine users (A = 12). Honest users utilize the ADAM optimizer [79] to update the local model by setting the size of the local mini-batch sample to|ξ (t) i | = 500 for all i∈ [N],t∈ [J] where J is the total number of iterations. Byzantine users generate vectors uniformly at random from F d p where we set the field size p = 2 32 − 5, which is the largest prime within 32 bits. For both schemes, BREA and FedAvg, the number of models to be aggregated is set to m =|S| = 13 < N− 2A− 2. FedAvg randomly selects m models at each iteration while BREA selects m users from (5.24). Convergence and robustness against Byzantine users: Figure 5.2 shows the test accuracy of BREA and FedAvg for different number of Byzantine users. We can observe that BREA with 0% and 30% Byzantine users is as efficient as FedAvg with 0% Byzantine users, while FedAvg does not tolerate Byzantine users. Figure 5.3 presents the cross entropy loss for BREA versus FedAvg for different number of Byzantine users. We omit the FedAvg with 30% Byzantine users as it diverges. We observe that BREA with 30% Byzantine users achieves convergence with comparable rate to 127 Figure 5.3: Convergence of BREA and FedAvg [97] for different number of Byzantine users. FedAvg with 0% Byzantine users, while providing robustness against Byzantine users and being privacy-preserving. For all cases of BREA in Figures 5.2 and 5.3, we set the quantization value in (5.9) to q = 1024. Figure 5.4 further illustrates the cross entropy loss of BREA for different values of quantization parameterq. We can observe that BREA with a larger value of q has better performance because the variance caused by the quantization function Q q defined in (5.9) gets smaller as q increases. On the other hand, given the field size p, the quantization parameter q should be less than a certain threshold to ensure (5.23) holds. 5.7 Conclusion This chapter presents the first single-server solution for Byzantine-resilient secure federated learn- ing. Ourframeworkisbasedonaverifiablesecureoutlierdetectionstrategytoguaranteerobustness of the trained model against Byzantine faults, while protecting the privacy of the individual users. We provide the theoretical convergence guarantees and the fundamental performance trade-offs of our framework, in terms of the number of Byzantine adversaries and the user dropouts the system can tolerate. In our experiments, we have implemented our system in a distributed network of 128 Figure 5.4: Convergence of BREA for different values of the quantization parameterq in (5.9) with 30% Byzantine users. N = 40 users, and numerically demonstrated the convergence behaviour while providing robust- ness against Byzantine users and being privacy-preserving. The main limitation of our framework is that its communication overhead is O(N 3 ), since each user sends N 2 distance values to the server. Although the distances are scalar valued, the cubic overhead can become a limitation for very large-scale networks. Future directions include developing single-server Byzantine-resilient secure learning architectures with efficient communication structures. 129 Chapter 6 LightSecAgg: a Lightweight and Versatile Design for Secure Aggregation in Federated Learning Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user’s individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substantially more complex. State-of-the-art secure aggregation protocols rely on secret sharing of the random-seeds used for mask generations at the users to enable the reconstruction and cancellation of those belonging to the dropped users. The complexity of such approaches, however, growssubstantiallywiththenumberofdroppedusers. Weproposeanewapproach, named LightSecAgg,toovercomethisbottleneckbychangingthedesignfrom“random-seedreconstruction of the dropped users” to “one-shot aggregate-mask reconstruction of the active users via mask encoding/decoding”. We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols while significantly reducing the overhead for resiliency against dropped users. We also demonstrate that, unlike existing schemes, LightSecAgg can be applied to secure aggregation in the asynchronous FL setting. Furthermore, we provide a modular system design and optimized on-device parallelization for scalable implementation, by enabling 130 computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. We evaluate LightSecAgg via extensive experiments for training diverse models (logistic regression, shallow CNNs, MobileNetV3, and EfficientNet-B0) on various datasets (MNIST, FEMNIST, CIFAR-10, GLD-23K) in a realistic FL system with large number of users and demonstrate that LightSecAgg significantly reduces the total training time. 6.1 Introduction Federated learning (FL) has emerged as a promising approach to enable distributed training over a large number of users while protecting the privacy of each user [98, 99]. The key idea of FL is to keep users’ data on their devices and instead train local models at each user. The locally trained models are then aggregated via a server to update a global model, which is then pushed back to users. Due to model inversion attacks (e.g., [54, 148, 165]), a critical consideration in FL design is to also ensure that the server does not learn the locally trained model of each user during model aggregation. Furthermore, model aggregation should be robust against likely user dropouts (due to poor connectivity, low battery, unavailability, etc) in FL systems. As such, there have been a series of works that aim at developing secure aggregation protocols for FL that protect the privacy of each user’s individual model while allowing their global aggregation amidst possible user dropouts [17, 74, 134]. The state-of-the-art secure aggregation protocols essentially rely on two main principles: (1) pairwise random-seed agreement between users to generate masks that hide users’ models while having an additive structure that allows their cancellation when added at the server and (2) secret sharing of the random-seeds to enable the reconstruction and cancellation of masks belonging to dropped users. The main drawback of such approaches is that the number of mask reconstructions 131 Figure 6.1: Illustration of our proposed LightSecAgg protocol. (1) Sharing encoded mask: users encode and share their generated local masks. (2) Masking model: each user masks its model by random masks, and uploads its masked model to the server. (3) Reconstructing aggregate-mask: The surviving users upload the aggregate of encoded masks to reconstruct the desired aggregate- mask. The server recovers the aggregate-model by canceling out the reconstructed aggregate-mask. at the server substantially grows as more users are dropped, causing a major computational bot- tleneck. For instance, the execution time of the SecAgg protocol proposed in [17] is observed to be significantly limited by mask reconstructions at the server [15]. SecAgg+ [8], an improved version of SecAgg, reduces the overhead at the server by replacing the complete communication graph of SecAgg with a sparse random graph, such that secret sharing is only needed within a subset of users rather than all users pairs. However, the number of mask reconstructions in SecAgg+ still increases as more users drop, eventually limits the scalability of FL systems. There have also been several other approaches, such as [134, 74], to alleviate this bottleneck, however they either increase round/communication complexity or compromise the dropout and privacy guarantees. Contributions. We propose a new perspective for secure model aggregation in FL by turning the design focus from “pairwise random-seed reconstruction of the dropped users” to “one-shot aggregate-mask reconstruction of the surviving users”. Using this viewpoint, we develop a new pro- tocol namedLightSecAgg that provides the same level of privacy and dropout-resiliency guarantees as the state-of-the-art while substantially reducing the aggregation (hence runtime) complexity. As illustrated in Figure 6.1, the main idea of LightSecAgg is that each user protects its local model using a locally generated random mask. This mask is then encoded and shared to other users in such a way that the aggregate-mask of any sufficiently large set of surviving users can be directly 132 reconstructed at the server. In sharp contrast to prior schemes, in this approach the server only needs to reconstruct one mask in the recovery phase, independent of the number of dropped users. Moreover, we provide a modular federated training system design and optimize on-device paral- lelization to improve the efficiency when secure aggregation and model training interact at the edge devices. This enables computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. To the best of our knowledge, this provides the first open-sourced and secure aggregation-enabled FL system that is built on the modern deep learning framework (PyTorch) and neural architecture (e.g., ResNet) with system and security co-design. We further propose system-level optimization methods to improve the run-time. In particular, we design a federated training system and take advantage of the fact that the generation of random masks is independent of the computation of the local model, hence each user can parallelize these two operations via a multi-thread processing, which is beneficial to all evaluated secure aggregation protocols in reducing the total running time. In addition to the synchronous FL setting, where all users train local models based on the same globalmodelandtheserverperformsasynchronizedaggregationateachround,wealsodemonstrate that LightSecAgg enables secure aggregation when no synchrony between users’ local updates are imposed. This is unlike prior secure aggregation protocols, such as SecAgg and SecAgg+, that are not compatible with asynchronous FL. To the best of our knowledge, in the asynchronous FL setting, this is the first work to protect the privacy of the individual updates without relying on differential privacy [141] or trusted execution environments (TEEs) [109]. We run extensive experiments to empirically demonstrate the performance of LightSecAgg in a real-world cross-device FL setting with up to 200 users and compare it with two state-of-the-art protocols SecAgg and SecAgg+. To provide a comprehensive coverage of realistic FL settings, we 133 train various machine learning models including logistic regression, convolutional neural network (CNN) [98], MobileNetV3 [72], and EfficientNet-B0 [139], for image classification over datasets of different image sizes: low resolution images (FEMNIST [23], CIFAR-10 [81]), and high resolution images (Google Landmarks Dataset 23k [150]). The empirical results show that LightSecAgg provides significant speedup for all considered FL training tasks, achieving a performance gain of 8.5×-12.7× over SecAgg and 2.9×-4.4× over SecAgg+, in realistic bandwidth settings at the users. Hence, compared to baselines, LightSecAgg can even survive and speedup the training of large deep neural network models on high resolution image datasets. Breakdowns of the total running time further confirm that the primary gain lies in the complexity reduction at the server provided by LightSecAgg, especially when the number of users are large. Related works. Beyond the secure aggregation protocols proposed in [17, 8], there have been also other works that aim towards making secure aggregation more efficient. TurboAgg [134] uti- lizes a circular communication topology to reduce the communication overhead, but it incurs an additional round complexity and provides a weaker privacy guarantee than SecAgg as it guaran- tees model privacy in the average sense rather than in the worst-case scenario. FastSecAgg [74] reduces per-user overhead by using the Fast Fourier Transform multi-secret sharing, but it provides lower privacy and dropout guarantees compared to the other state-of-the-art protocols. The idea of one-shot reconstruction of the aggregate-mask was also employed in [163], where the aggregated masks corresponding to each user dropout pattern was prepared by a trusted third party, encoded and distributed to users prior to model aggregation. The major advantages of LightSecAgg over the scheme in [163] are 1) not requiring a trusted third party; and 2) requiring significantly less randomness generation and a much smaller storage cost at each user. Furthermore, there is also a lack of system-level performance evaluations of [163] in FL experiments. Finally, we emphasize that our LightSecAgg protocol can be applied to any aggregation-based FL approach 134 (e.g., FedNova [145], FedProx [90], FedOpt [4]), personalized FL frameworks [138, 87, 48, 107, 67], communication-efficient FL [124, 119, 45], and asynchronous FL. 6.2 Problem Setting FL is a distributed training framework for machine learning, where the goal is to learn a global modelx with dimensiond using data held at edge devices. This can be represented by minimizing a global objective function F: F (x) = P N i=1 p i F i (x), where N is the total number of users, F i is the local objective function of user i, andp i ≥ 0 is a weight parameter assigned to useri to specify the relative impact of each user such that P N i=1 p i = 1. ∗ Training in FL is performed through an iterative process, where the users interact through a server to update the global model. At each iteration, the server shares the current global model, denoted by x(t), with the edge users. Each user i creates a local update, x i (t). The local models are sent to the server and then aggregated by the server. Using the aggregated models, the server updatestheglobalmodelx(t+1)forthenextiteration. InFL,someusersmaypotentiallydropfrom the learning procedure for various reasons such as having unreliable communication connections. The goal of the server is to obtain the sum of the surviving users’ local models. This update equation is given by x(t + 1) = 1 |U(t)| P i∈U(t) x i (t), whereU(t) denotes the set of surviving users at iteration t. Then, the server pushes the updated global model x(t + 1) to the edge users. Local models carry extensive information about the users’ datasets, and in fact their private data can be reconstructed from the local models by using a model inversion attack [54, 148, 165]. To address this privacy leakage from local models, secure aggregation has been introduced in [17]. A secure aggregation protocol enables the computation of the aggregated global model while en- suring that the server (and other users) learn no information about the local models beyond their ∗ For simplicity, we assume that all users have equal-sized datasets, i.e., pi = 1 N for all i∈ [N]. 135 aggregated model. In particular, the goal is to securely recover the aggregate of the local models y = P i∈U x i , where the iteration index t is omitted for simplicity. Since secure aggregation pro- tocols build on cryptographic primitives that require all operations to be carried out over a finite field, we assume that the elements of x i and y are from a finite field F q for some field size q. We require a secure aggregation protocol for FL to have the following key features. • Threat model and privacy guarantee. We consider a threat model where the users and the server are honest but curious. We assume that up toT (out ofN) users can collude with each other as well as with the server to infer the local models of other users. The secure aggregation protocol has to guarantee that nothing can be learned beyond the aggregate-model, even if up to T users cooperate with each other. We consider privacy leakage in the strong information-theoretic sense. This requires that for every subset of usersT ⊆ [N] of size at most T, we must have mutual information I({x i } i∈[N] ;Y| P i∈U x i ,Z T ) = 0, where Y is the collection of information at the server, and Z T is the collection of information at the users inT . • Dropout-resiliency guarantee. In the FL setting, it is common for users to be dropped or delayed at any time during protocol execution for various reasons, e.g., delayed/interrupted processing, poor wireless channel conditions, low battery, etc. We assume that there are at most D dropped users during the execution of protocol, i.e., there are at least N−D surviving users after potential dropouts. The protocol must guarantee that the server can correctly recover the aggregated models of the surviving users, even if up to D users drop. • Applicability to asynchronous FL. Synchronizing all users for training at each round of FL can be slow and costly, especially when the number of users are large. Asynchronous FL handles this challenge by incorporating the updates of the users in asynchronous fashion [152, 40, 26, 30]. This asynchrony, however, creates a mismatch of staleness among the users, which causes 136 the incompatibility of the existing secure aggregation protocols (such as [17, 8]). More specifi- cally, since it is not known a priori which local models will be aggregated together, the current secure aggregation protocols that are based on pairwise random masking among the users fail to work. We aim at designing a versatile secure aggregation protocol that is applicable to both synchronous and asynchronous FL. Goal. We aim to design an efficient and scalable secure aggregation protocol that simultaneously achieves strong privacy and dropout-resiliency guarantees, scaling linearly with the number of users N, e.g., simultaneously achieves privacy guarantee T = N 2 and dropout-resiliency guarantee D = N 2 −1. Moreover, the protocol should be compatible with both synchronous and asynchronous FL. 6.3 Overview of Baseline Protocols: SecAgg and SecAgg+ We first review the state-of-the-art secure aggregation protocols SecAgg [17] and SecAgg+ [8] as our baselines. Essentially, SecAgg and SecAgg+ require each user to mask its local model using random keys before aggregation. In SecAgg, the privacy of the individual models is protected by pairwise random masking. Through a key agreement (e.g., Diffie-Hellman [39]), each pair of users i,j∈ [N] agree on a pairwise random seeda i,j = Key.Agree(sk i ,pk j ) = Key.Agree(sk j ,pk i ) wheresk i andpk i are the private and public keys of user i, respectively. In addition, user i creates a private random seedb i to prevent the privacy breaches that may occur if useri is only delayed rather than dropped, in which case the pairwise masks alone are not sufficient for privacy protection. User i∈ [N] then masks its model x i as ˜ x i =x i +PRG(b i ) + P j:i<j PRG(a i,j )− P j:i>j PRG(a j,i ), where PRG is a pseudo random generator, and sends it to the server. Finally, user i secret shares its private seed b i as well as private keysk i with the other users via Shamir’s secret sharing [154]. From the subset of users who survived the previous stage, the server collects either the shares of the private key 137 belonging to a dropped user, or the shares of the private seed belonging to a surviving user (but not both). Using the collected shares, the server reconstructs the private seed of each surviving user, and the pairwise seeds of each dropped user. The server then computes the aggregated model as follows X i∈U x i = X i∈U (˜ x i −PRG(b i )) + X i∈D X j:i<j PRG(a i,j )− X j:i>j PRG(a j,i ) , (6.1) whereU andD represent the set of surviving and dropped users, respectively. SecAgg protects model privacy against T colluding users and is robust to D user dropouts as long as N−D>T. We now illustrate SecAgg through a simple example. Consider a secure aggregation problem in FL, where there are N = 3 users with T = 1 privacy guarantee and dropout-resiliency guarantee D = 1. Each user i∈{1, 2, 3} holds a local model x i ∈F d q where d is the model size and q is the size of the finite field. As shown in Figure 6.2, SecAgg is composed of the following three phases. Offline pairwise agreement. User 1 and user 2 agree on pairwise random seed a 1,2 . User 1 and user 3 agree on pairwise random seed a 1,3 . User 2 and user 3 agree on pairwise random seed a 2,3 . In addition, user i∈{1, 2, 3} creates a private random seed b i . Then, user i secret shares b i and its private key sk i with the other users via Shamir’s secret sharing. In this example, a 2 out of 3 secret sharing is used to tolerate 1 curious user. 138 Figure 6.2: An illustration of SecAgg in the example of 3 users is depicted. The users first agree on pairwise random seeds, and secret share their private random seeds and private keys. The local models are protected by the pairwise random masking. Suppose that user 1 drops. To recover the aggregate-mask, the server first reconstructs the private random seeds of the surviving users and the private key of user 1 by collecting the secret shares for each of them. Then, the server recovers z 1,2 , z 1,3 , n 2 and n 3 , which incurs the computational cost of 4d at the server. Masking and uploading of local models. To provide the privacy of each individual model, user i∈{1, 2, 3} masks its model x i as follows: ˜ x 1 =x 1 +n 1 +z 1,2 +z 1,3 , ˜ x 2 =x 2 +n 2 +z 2,3 −z 1,2 , ˜ x 3 =x 3 +n 3 −z 1,3 −z 2,3 , where n i = PRG(b i ) and z i,j = PRG(a i,j ) are the random masks generated by a pseudo random generator. Then user i∈{1, 2, 3} sends its masked local model ˜ x i to the server. Aggregate-model recovery. Suppose that user 1 drops in the previous phase. The goal of the server is to compute the aggregate of models x 2 +x 3 . Note that x 2 +x 3 =˜ x 2 +˜ x 3 + (z 1,2 +z 1,3 −n 2 −n 3 ). (6.2) Hence, the server needs to reconstruct masks n 2 , n 3 , z 1,2 , z 1,3 to recover x 2 +x 3 . To do so, the server has to collect two shares for each of b 2 , b 3 , sk 1 , and then compute the aggregate model by 139 (6.2). Since the complexity of evaluating a PRG scales linearly with its size, the computational cost of the server for mask reconstruction is 4d. We note that SecAgg requires the server to compute a PRG function on each of the recon- structed seeds to recover the aggregated masks, which incurs the overhead of O(N 2 ) (see more details in Section 6.5) and dominates the overall execution time of the protocol [17, 15]. SecAgg+ reduces the overhead of mask reconstructions fromO(N 2 ) toO(N logN) by replacing the complete communication graph of SecAgg with a sparse random graph of degree O(logN) to reduce both communication and computational loads. Reconstructing pairwise random masks in SecAgg and SecAgg+ poses a major bottleneck in scaling to a large number of users. Remark 6.1. (Incompatibility of SecAgg and SecAgg+ with Asynchronous FL). It is important to note that SecAgg and SecAgg+ cannot be applied to asynchronous FL as the cancellation of the pairwise random masks based on the key agreement protocol is not guaranteed. This is because the users do not know a priori which local models will be aggregated together, hence the masks cannot be designed to cancel out in these protocols. We explain this in more detail in Appendix C.5.2. It is also worth noting that a recently proposed protocol known as FedBuff [109] enables secure aggregation in asynchronous FL through a trusted execution environment (TEE)-enabled buffer, where the server stores the local models that it receives in this private buffer. The reliance of FedBuff on TEEs, however, limits the buffer size in this approach as TEEs have limited memory. It would also limit its application to FL settings where TEEs are available. 6.4 LightSecAgg Protocol Before providing a general description of LightSecAgg, we first illustrate its key ideas through the previous 3-user example in the synchronous setting. As shown in Figure 6.3, LightSecAgg has the following three phases. 140 Figure 6.3: An illustration of LightSecAgg in the example of 3 users is depicted. Each user first generates a single mask. Each mask of a user is encoded and shared to other users. Each user’s local model is protected by its generated mask. Suppose that user 1 drops during the execution of protocol. The server directly recovers the aggregate-mask in one shot. In this example, LightSecAgg reduces the computational cost at the server from 4d to d. Offline encoding and sharing of local masks. User i∈{1, 2, 3} randomly picks z i and n i from F d q . User i∈{1, 2, 3} creates the masked version of z i as ˜ z 1,1 =−z 1 +n 1 , ˜ z 1,2 = 2z 1 +n 1 , ˜ z 1,3 =z 1 +n 1 ; ˜ z 2,1 =−z 2 +n 2 , ˜ z 2,2 = 2z 2 +n 2 , ˜ z 2,3 =z 2 +n 2 ; ˜ z 3,1 =−z 3 +n 3 , ˜ z 3,2 = 2z 3 +n 3 , ˜ z 3,3 =z 3 +n 3 ; and user i∈{1, 2, 3} sends ˜ z i,j to each user j∈{1, 2, 3}. Thus, user i∈{1, 2, 3} receives ˜ z j,i for j∈{1, 2, 3}. In this case, this procedure provides robustness against 1 dropped user and privacy against 1 curious user. Masking and uploading of local models. To make each individual model private, each user i∈{1, 2, 3} masks its local model as follows: ˜ x 1 =x 1 +z 1 , ˜ x 2 =x 2 +z 2 , ˜ x 3 =x 3 +z 3 , (6.3) and sends its masked model to the server. 141 One-shot aggregate-model recovery. Suppose that user 1 drops in the previous phase. To recover the aggregate of models x 2 +x 3 , the server only needs to know the aggregated masks z 2 +z 3 . To recover z 2 +z 3 , the surviving user 2 and user 3 send ˜ z 2,2 + ˜ z 3,2 and˜ z 2,3 +˜ z 3,3 , ˜ z 2,2 +˜ z 3,2 = 2(z 2 +z 3 ) +n 2 +n 3 , ˜ z 2,3 +˜ z 3,3 = (z 2 +z 3 ) +n 2 +n 3 , to the server, respectively. After receiving the messages from user 2 and user 3, the server can directly recover the aggregated masks via an one-shot computation as follows: z 2 +z 3 =˜ z 2,2 +˜ z 3,2 − (˜ z 2,3 +˜ z 3,3 ). (6.4) Then, the server recovers the aggregate-model x 2 +x 3 by subtracting z 2 +z 3 from ˜ x 2 +˜ x 3 . As opposed to SecAgg which has to reconstruct the random seeds of the dropped users, LightSecAgg enables the server to reconstruct the desired aggregate of masks via a one-shot recovery. Compared with SecAgg, LightSecAgg reduces the server’s computational cost from 4d to d in this simple example. 6.4.1 General Description of LightSecAgg for Synchronous FL We formally present LightSecAgg, whose idea is to encode the local generated random masks in a way that the server can recover the aggregate of masks from the encoded masks via an one-shot computation with a cost that does not scale with N. LightSecAgg has three design parameters: (1) 0≤ T ≤ N− 1 representing the privacy guarantee; (2) 0≤ D ≤ N− 1 representing the dropout-resiliency guarantee; (3) 1≤U≤N representing the targeted number of surviving users. In particular, parameters T, D, and U are selected such that N−D≥U >T≥ 0. 142 LightSecAgg is composed of three main phases. First, each user first partitions its local random mask toU−T pieces and creates encoded masks via a Maximum Distance Separable (MDS) code [121, 157, 140] to provide robustness against D dropped users and privacy against T colluding users. Each user sends one of the encoded masks to one of the other users for the purpose of one- shot recovery. Second, each user uploads its masked local model to the server. Third, the server reconstructs the aggregated masks of the surviving users to recover their aggregate of models. Each surviving user sends the aggregated encoded masks to the server. After receiving U aggregated encoded masks from the surviving users, the server recovers the aggregate-mask and the desired aggregate-model. The pseudo code of LightSecAgg is provided in Appendix C.1. We now describe each of these phases in detail. Offline encoding and sharing of local masks. User i∈ [N] picks z i uniformly at random from F d q and partitions it to U−T sub-masks [z i ] k ∈ F d U−T q , k∈ [U−T ]. With the randomly picked [n i ] k ∈F d U−T q for k∈{U−T + 1,...,U}, user i∈ [N] encodes sub-masks [z i ] k ’s as [˜ z i ] j = ([z i ] 1 ,..., [z i ] U−T , [n i ] U−T+1 ,..., [n i ] U )·W j , (6.5) where W j is j’th column of a T-private MDS matrix W∈ F U×N q . In particular, we say an MDS matrix † is T-private iff the submatrix consisting of its{U−T + 1,...,U}-th rows is also MDS. A T-private MDS matrix guarantees that I(z i ;{[˜ z i ] j } j∈T ) = 0, for any i∈ [N] and anyT ⊆ [N ] of sizeT, if [n i ] k ’s are jointly uniformly random. We can always findT-private MDS matrices for any U, N, and T (e.g., [123, 157, 121]). Each user i∈ [N] sends [˜ z i ] j to user j∈ [N]\{i}. In the end of offline encoding and sharing of local masks, each user i∈ [N] has [˜ z j ] i from j∈ [N]. ‡ † A matrix W∈F U×N q (U <N) is an MDS matrix if any U×U sub-matrix of W is non-singular. ‡ All users communicate through secure (private and authenticated) channels, i.e., the server would only receive the encrypted version of [˜ zi]j’s. Such secure communication is also used in prior works on secure aggregation, e.g., SecAgg, SecAgg+. 143 Masking and uploading of local models. To protect the local models, each useri masks its local model as ˜ x i = x i +z i , and sends it to the server. Since some users may drop in this phase, the server identifies the set of surviving users, denoted byU 1 ⊆ [N]. The server intends to recover P i∈U 1 x i . We note that before masking the model, each user quantizes the local model to convert from the domain of real numbers to the finite field (Appendix C.5.5). One-shot aggregate-model recovery. After identifying the surviving users in the previous phase, user j∈U 1 is notified to send its aggregated encoded sub-masks P i∈U 1 [˜ z i ] j to the server for the purpose of one-shot recovery. We note that each P i∈U 1 [˜ z i ] j is an encoded version of P i∈U 1 [z i ] k for k∈ [U−T ] using the MDS matrixW (see more details in Appendix C.2). Thus, the server is able to recover P i∈U 1 [z i ] k for k∈ [U−T ] via MDS decoding after receiving a set of any U messages from the participating users. The server obtains the aggregated masks P i∈U 1 z i by concatenating P i∈U 1 [z i ] k ’s. Lastly, the server recovers the desired aggregate of models for the set of participating usersU 1 by subtracting P i∈U 1 z i from P i∈U 1 ˜ x i . Remark 6.2. Note that it is not necessary to have a stable communication link between every pair of users in LightSecAgg. Specifically, given the design parameterU, LightSecAgg only requires at least U surviving users at any time during the execution. That is, even if up to N−U users drop or get delayed due to unstable communication links, the server can still reconstruct the aggregate- mask. Remark 6.3. We note that LightSecAgg directly applies for secure aggregation of weighted local models. The sharing of the masking keys among the clients does not require the knowledge of the weight coefficients. For example, LightSecAgg can work for the case in which all users do not have equal-sized datasets. Suppose that useri holds a dataset with a number of sampless i . Rather than directly masking the local model x i , user i first computes x 0 i =s i x i . Then, user i uploads x 0 i +z i to the server. Through the LightSecAgg protocol, the server can recover P i∈U x 0 i = P i∈U s i x i 144 securely. By dividing by P i∈U s i , the server can obtain the desired aggregate of weighted model P i∈U p i x i where p i = s i P i∈U s i . 6.4.2 Extension to Asynchronous FL We now describe how LightSecAgg can be applied to asynchronous FL. We consider the asyn- chronous FL setting with bounded staleness as considered in [109], where the updates of the users are not synchronized and the staleness of each user is bounded by τ max . In this setting, the server stores the models that it receives in a buffer of sizeK and updates the global model once the buffer is full. More generally, LightSecAgg may apply to any asynchronous FL setting where a group of local models are aggregated at each round. That is, the group size does not need to be fixed in all rounds. While the baselines are not compatible with this setting, LightSecAgg can be applied by encoding the local masks in a way that the server can recover the aggregate of masks from the encoded masks via a one-shot computation, even though the masks are generated in different training rounds. Specifically, the users share the encoded masks with the timestamp to figure out which encoded masks should be aggregated for the reconstruction of the aggregate of masks. As the users aggregate the encoded masks after the server stores the local updates in the buffer, the users can aggregate the encoded masks according to the timestamp of the stored updates. Due to the commutative property of MDS coding and addition, the server can reconstruct the aggregate of masks even though the masks are generated in different training rounds. We postpone the detailed description of the LightSecAgg protocol for the asynchronous setting to Appendix C.5. 6.5 Theoretical Analysis 6.5.1 Theoretical Guarantees We now state our main result for the theoretical guarantees of the LightSecAgg protocol. 145 Theorem 6.1. Consider a secure aggregation problem in federated learning with N users. Then, the proposed LightSecAgg protocol can simultaneously achieve (1) privacy guarantee against up to anyT colluding users, and (2) dropout-resiliency guarantee against up to any D dropped users, for any pair of privacy guarantee T and dropout-resiliency guarantee D such that T +D<N. The proof of Theorem 6.1, which is applicable to both synchronous and asynchronous FL settings, is presented in Appendix C.2. Remark 6.4. Theorem 6.1 provides a trade-off between privacy and dropout-resiliency guarantees, i.e., LightSecAgg can increase the privacy guarantee by reducing the dropout-resiliency guarantee and vice versa. As SecAgg [17], LightSecAgg achieves the worst-case dropout-resiliency guarantee. That is, for any privacy guarantee T and the number of dropped users D <N−T, LightSecAgg ensures that any set of dropped users of size D in secure aggregation can be tolerated. Differ- ently, SecAgg+ [8], FastSecAgg [74], and TurboAgg [134] relax the worst-case constraint to random dropouts and provide a probabilistic dropout-resiliency guarantee, i.e., the desired aggregate-model can be correctly recovered with high probability. Remark6.5. Fromthetrainingconvergenceperspective, LightSecAggonlyaddsaquantizationstep to the local model updates of the users. The impact of this model quantization on FL convergence is well studied in the synchronous FL [119, 45]. In the asyncrhonous FL, however, we need to analyze the convergence of LightSecAgg. We provide this analysis in the smooth and non-convex setting in Appendix C.5.7. 6.5.2 Complexity Analysis of LightSecAgg We measure the storage cost, communication load, and computational load of LightSecAgg in units of elements or operations in F q for a single training round. Recall that U is a design parameter chosen such that N−D≥U >T. 146 Offline storage cost. Each useri independently generates a random mask z i of lengthd. Addition- ally, each user i stores a coded mask [˜ z j ] i of size d U−T , for j∈ [N]. Hence, the total offline storage cost at each user is (1 + N U−T )d. Offline communication and computation loads. For each iteration of secure aggregation, before the local model is computed, each user prepares offline coded random masks and distributes them to the other users. Specifically, each user encodesU local data segments with each of size d U−T intoN coded segments and distributes each of them to one of N users. Hence, the offline computational and communication load of LightSecAgg at each user is O( dNlogN U−T ) and O( dN U−T ), respectively. Communication load during aggregation. While each user uploads a masked model of length d, in the phase of aggregate-model recovery, no matter how many users drop, each surviving user inU 1 sends a coded mask of size d U−T . The server is guaranteed to recover the aggregate-model of the surviving users inU 1 after receiving messages from anyU users. The total required communication load at the server in the phase of mask recovery is therefore U U−T d. Computation load during aggregation. The major computational bottleneck of LightSecAgg is the decoding process to recover P j∈U 1 z j at the server. This involves decoding a U-dimensional MDS code from U coded symbols, which can be performed with O(U logU) operations on elements in F q , hence a total computational load of UlogU U−T d. We compare the communication and computational complexities of LightSecAgg with baseline protocols. In particular, we consider the case where secure aggregation protocols aim at provid- ing privacy guarantee T = N 2 and dropout-resiliency guarantee D = pN simultaneously for some 0≤p< 1 2 . As shown in Table 6.1, by choosing U = (1−p)N, LightSecAgg significantly improves the computational efficiency at the server during aggregation. SecAgg and SecAgg+ incurs a total computational load of O(dN 2 ) and O(dN logN), respectively at the server, while the server com- plexity of LightSecAgg remains nearly constant with respect to N. It is expected to substantially 147 Table 6.1: Complexity comparison between SecAgg, SecAgg+, and LightSecAgg. Here N is the total number of users, d is the model size, s is the length of the secret keys as the seeds for PRG (sd). In the table, U stands for User and S stands for Server. SecAgg SecAgg+ LightSecAgg offline comm. (U) O(sN) O(slogN) O(d) offline comp. (U) O(dN +sN 2 ) O(dlogN +slog 2 N) O(dlogN) online comm. (U) O(d+sN) O(d+slogN) O(d) online comm. (S) O(dN +sN 2 ) O(dN +sNlogN) O(dN) online comp. (U) O(d) O(d) O(d) reconstruction (S) O(dN 2 ) O(dNlogN) O(dlogN) reduce the overall aggregation time for a large number of users, which is bottlenecked by the server’s computation in SecAgg [17, 15]. More detailed discussions, as well as a comparison with another recently proposed secure aggregation protocol [163], which achieves similar server complexity as LightSecAgg, are carried out in Appendix C.2.1. 6.6 System Design and Optimization Apart from theoretical design and analysis, we have further designed a FL training system to reduce the overhead of secure model aggregation and enable realistic evaluation of LightSecAgg in cross-device FL. The software architecture is shown in Figure 6.4. In order to keep the software architecture lightweight and maintainable, we do not over-design and only modularize the system as the foun- dation layer and the algorithm layer. The foundation layer (blocks below the dashed line) contains the communicator and training engine. The communicator can support multiple communication protocols (PyTorch RPC [114], and gRPC [60]), but it provides a unified communication interface for the algorithmic layer. In 148 ARM-based PyTorch Send Thread Abstract Communication Layer Receive Thread Tensor-aware Communicator Communication gRPC Trainer Standard PyTorch Training On-device training framework buffer for client masked model Client Manager multiprocessing (master) Server Manager Secure Aggregator Secure Aggregation Protocol multiprocessing (slave) PyTorch RPC Cache masked model Send the updated global model to clients 1 2 3 4 5 6 offline phase - encoding and sharing of local masks masking and uploading of local models upload aggregate of encoded masks Model Training Client Encoder Reconstruction Decoder 7 Key Agreement Secret Sharing MDS Encoding/Decoding Pseudo Random Generator … Security Primitive APIs Overlapping Figure 6.4: Overview of the System Design the training engine, in addition to standard PyTorch for GPU, we also compile the ARM-based PyTorch for embedded edge devices (e.g., Raspberry Pi). 149 In the algorithm layer, Client Manager calls Trainer in the foundation layer to perform on-device training. Client Manager also integrates Client Encoder to complete the secure ag- gregation protocol, which is supported by security primitive APIs. In Server Manager, Secure Aggregator maintains the cache for masked models, and once the cache is full, it starts reconstruc- tion based on aggregated masks uploaded by clients. The server then synchronizes the updated global model to clients for the next round of training. In Figure 6.4, we mark the 7 sequential steps in a single FL round as circled numbers to clearly show the interplay between federated training and secure aggregation protocol. This software architecture has two special designs that can further reduce the computational and communication overhead of the secure aggregation protocol. Parallelization of offline phase and model training. We note that for all considered protocols, LightSecAgg, SecAgg, and SecAgg+, the communication and computation time to generate and exchange the random masks in the offline phase can be overlapped with model training. Hence, in our design, we reduce the offline computation and communication overhead by allowing each user to train the model and carry out the offline phase simultaneously by running two parallel processes (multi-threading performs relatively worse due to Python GIL, Global Interpreter Lock), as shown as purple and red colors in Figure 6.4. We also demonstrate the timing diagram of the overlapped implementation in a single FL training round in Figure 6.5. We will analyze its impact on overall acceleration in section 6.7.2. Optimized federated training system and communication APIs via tensor-aware RPC (Remote Procedure Call). As the yellow blocks in Figure 6.4 show, we specially design the sending and receiving queues to accelerate the scenario that the device has to be sender and receiver simulta- neously. As such, the offline phase of LightSecAgg can further be accelerated by parallelizing the transmission and reception of [˜ z i ] j . This design can also speed up the offline pairwise agreement 150 (a) Non-overlapped (b) Overlapped Figure 6.5: The timing diagram of the overlapped implementation in LightSecAgg and SecAgg+ [8] for a single FL training round to train MobileNetV3 [72] with CIFAR-100 dataset [81]. SecAgg [17] is not included as it takes much longer than other two protocols. Table 6.2: Summary of four implemented machine learning tasks and performance gain of LightSecAgg with respect to SecAgg and SecAgg+. All learning tasks are for image classifica- tion. MNIST, FEMNIST and CIFAR-10 are low-resolution datasets, while images in GLD-23K are high resolution, which cost much longer training time; LR and CNN are shallow models, but MobileNetV3 and EfficientNet-B0 are much larger models, but they are tailored for efficient edge training and inference. No. Dataset Model Model Size Gain (d) Non-overlapped Overlapped Aggregation-only 1 MNIST [83] Logistic Regression 7,850 6.7×, 2.5× 8.0×, 2.9× 13.2×, 4.2× 2 FEMNIST [23] CNN [98] 1,206,590 11.3×, 3.7× 12.7×, 4.1× 13.0×, 4.1× 3 CIFAR-10 [81] MobileNetV3 [72] 3,111,462 7.6×, 2.8× 9.5×, 3.3× 13.1×, 3.9× 4 GLD-23K [150] EfficientNet-B0 [139] 5,288,548 3.3×, 1.6× 3.4×, 1.7× 13.0×, 4.1× in SecAgg and SecAgg+. Moreover, we choose PyTorch RPC [114] as the communication backend rather than gRPC [60] and MPI [111] because its tensor-aware communication API can reduce the latency in scenarios where the communicator is launched frequently, i.e., each client in the offline mask exchanging phase needs to distribute N coded segments to N users. With the above design, we can deploy LightSecAgg in both embedded IoT devices and AWS EC2 instances. AWS EC2 instances can also represent a realistic cross-device setting because, in our experiments, we use AWS EC2 m3.medium instances, which are CPU-based and have the same hardware configuration as modern smartphones such as iOS and Android devices. Furthermore, we package our system as a Docker image to simplify the system deployment to hundreds of edge devices. 151 To promote reproducible research in the MLSys community, we publicly release the source code at https://github.com/LightSecAgg/MLSys2022_anonymous. 6.7 Experimental Results 6.7.1 Setup Dataset and models. To provide a comprehensive coverage of realistic FL settings, we train four models over computer vision datasets of different sizes, summarized in Table 6.2. The hyper- parameter settings are provided in Appendix C.3. Dropout rate. To model the dropped users, we randomly select pN users where p is the dropout rate. We consider the worst-case scenario [17], where the selected pN users artificially drop after uploading the masked model. All three protocols provide privacy guarantee T = N 2 and resiliency for three different dropout rates, p = 0.1,p = 0.3, andp = 0.5, which are realistic values according to the industrial observation in real FL system [14]. As we can see that when carefully selecting devices which may be stable online during the time period of training, the dropout rate is as high as 10%; when considering intermittently connected devices, only up to 10K devices can participate simultaneously when there are 10M daily active devices (1 : 1000). Number of users and Communication Bandwidth. In our experiments, we train up to N = 200 users. The measured real bandwidth is 320Mb/s. We also consider two other bandwidth settings of 4G (LTE-A) and 5G cellular networks as we discuss later. Baselines. We analyze and compare the performance of LightSecAgg with two baseline schemes: SecAgg and SecAgg+ described in Section 6.3. While there are also other secure aggregation proto- cols (e.g., TurboAgg [134] and FastSecAgg [74]), we use SecAgg and SecAgg+ for our baselines since other schemes weaken the privacy guarantees as we discussed in Related Works part of Section 3.1. 152 (a) Non-overlapped (b) Overlapped Figure 6.6: Total running time of LightSecAgg versus the state-of-the-art protocols (SecAgg and SecAgg+) to train CNN [98] on the FEMNIST dataset [23], as the number of users increases, for various dropout rates. 6.7.2 Overall Evaluation and Performance Analysis For the performance analysis, we measure the total running time for a single round of global iteration which includes model training and secure aggregation with each protocol while increasing the number of usersN gradually for different user dropouts. Our results from training CNN [98] on the FEMNIST dataset [23] are demonstrated in Figure 6.6. The performance gain of LightSecAgg with respect to SecAgg and SecAgg+ to train the other models is also provided in Table 6.2. More detailed experimental results are provided in Appendix C.3. We make the following key observations. Impact of dropout rate: the total running time of SecAgg and SecAgg+ increases monotonically with the dropout rate. This is because their total running time is dominated by the mask recovery at the server, which increases quadratically with the number of users. Non-overlapping v.s. Overlapping: In the non-overlapped implementation, LightSecAgg pro- vides a speedup of up to 11.3× and 3.7× over SecAgg and SecAgg+, respectively, by significantly reducing the server’s execution time; in the overlapped implementation, LightSecAgg provides a further speedup of up to 12.7× and 4.1× over SecAgg and SecAgg+, respectively. This is due to the fact that LightSecAgg requires more communication and a higher computational cost in the 153 offline phase than the baseline protocols, and the overlapped implementation helps to mitigate this extra cost. Impact of model size: LightSecAgg provides a significant speedup of the aggregate-model recovery phase at the server over the baseline protocols in all considered model sizes. When training EfficientNet-B0 on GLD-23K dataset, LightSecAgg provides the smallest speedup in the most training-intensive task. This is because training time is dominant in this task, and training takes almost the same time in LightSecAgg and baseline protocols. Aggregation-only: When comparing the aggregation time only, the speedup remains the same for various model sizes as shown in Table 6.2. We note that speeding up the aggregation phase by itself is still very important because local training and aggregation phases are not necessarily hap- pening one immediately after the other. For example, local training may be done sporadically and opportunistically throughout the day (whenever resources are available), while global aggregation may be postponed to a later time when a large fraction of the users are done with local training, and they are available for aggregation (e.g., 2 am). Impact ofU: LightSecAgg incurs the smallest running time for the case whenp = 0.3, which is almost identical to the case whenp = 0.1. Recall that LightSecAgg can select the design parameter U between T = 0.5N and N−D = (1−p)N. Within this range, while increasing U reduces the size of the symbol to be decoded, it also increases the complexity of decoding each symbol. The experimental results suggest that the optimal choices for the cases of p = 0.1 and p = 0.3 are both U =b0.7Nc, which leads to a faster execution than when p = 0.5, where U can only be chosen as U = 0.5N + 1. Table 6.3: Performance gain in different bandwidth settings. Protocols 4G (98 Mbps) 320 Mbps 5G (802 Mbps) SecAgg 8.5× 12.7× 13.5× SecAgg+ 2.9× 4.1× 4.4× 154 Table 6.4: Breakdown of the running time (sec) of LightSecAgg and the state-of-the-art protocols (SecAgg [17] and SecAgg+ [8]) to train CNN [98] on the FEMNIST dataset [23] withN = 200 users, for dropout rate p = 10%, 30%, 50%. Protocols Phase Non-overlapped Overlapped p = 10% p = 30% p = 50% p = 10% p = 30% p = 50% LightSecAgg Offline 69.3 69.0 191.2 75.1 74.9 196.9 Training 22.8 22.8 22.8 Uploading 12.4 12.2 21.6 12.6 12.0 21.4 Recovery 40.9 40.7 64.5 40.7 41.0 64.9 Total 145.4 144.7 300.1 123.4 127.3 283.2 SecAgg Offline 95.6 98.6 102.6 101.2 102.3 101.3 Training 22.8 22.8 22.8 Uploading 10.7 10.9 11.0 10.9 10.8 11.2 Recovery 911.4 1499.2 2087.0 911.2 1501.3 2086.8 Total 1047.5 1631.5 2216.4 1030.3 1614.4 2198.9 SecAgg+ Offline 67.9 68.1 69.2 73.9 73.8 74.2 Training 22.8 22.8 22.8 Uploading 10.7 10.8 10.7 10.7 10.8 10.9 Recovery 379.1 436.7 495.5 378.9 436.7 497.3 Total 470.5 538.4 608.2 463.6 521.3 582.4 Impact of Bandwidth: We have also analyzed the impact of communication bandwidth at the users. In addition to the default bandwidth setting used in this section, we have considered two other edge scenarios: 4G (LTE-A) and 5G cellular networks using realistic bandwidth settings of 98 and 802 Mbps respectively [103, 122]). The results are reported in Table 6.3 for a single FL round to train CNN over FEMNIST. 6.7.3 Performance Breakdown To further investigate the primary gain of LightSecAgg, we provide the breakdown of total running time for training CNN [98] on the FEMNIST dataset [23] in Table 6.4. The breakdown of the running time confirms that the primary gain lies in the complexity reduction at the server provided by LightSecAgg, especially for a large number of users. 155 Figure 6.7: Accuracy of asynchronous LightSecAgg and FedBuff on CIFAR-10 dataset [81] with two strategies for mitigating the staleness: a constant functions(τ) = 1 named Constant; and a polynomial functionsα(τ) = (1+τ) −α named Poly where α = 1. The accuracy is reasonable since we use a variant of LeNet-5 [152]. 6.7.4 Convergence Performance in Asynchronous FL As described in Remark 6.1, SecAgg andSecAgg+ are not applicable to asynchronous FL, and hence we cannot compare the total running time of LightSecAgg with these baseline secure aggregation protocols. As such, in our experiments here we instead focus on convergence performance of LightSecAgg compared to FedBuff [109] to investigate the impact of asynchrony and quantization in performance. In Figure 6.7, we demonstrate that LightSecAgg has almost the same performance as FedBuff on CIFAR-10 dataset while LightSecAgg includes quantization noise to protect the privacy of individual local updates of users. The details of the experiment setting and additional experiments for asynchronous FL are provided in Appendix C.5.8. 6.8 Conclusion and Future Works ThispaperproposedLightSecAgg, a newapproachforsecureaggregationinsynchronousandasyn- chronous FL. Compared with the state-of-the-art protocols, LightSecAgg reduces the overhead of model aggregation in FL by leveraging one-shot aggregate-mask reconstruction of the surviving users, while providing the same privacy and dropout-resiliency guarantees. In a realistic FL frame- work, via extensive empirical results it is also shown that LightSecAgg can provide substantial speedup over baseline protocols for training diverse machine learning models. While we focused on privacy in this work (under the honest but curious threat model), an interesting future research 156 is to combine LightSecAgg with state-of-the-art Byzantine robust aggregation protocols (e.g., [69, 132, 46, 76]) to also mitigate Byzantine users while ensuring privacy. 157 Bibliography [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. “Deep learning with differential privacy”. In: ACM Conf. on Comp. and Comm. Security. 2016, pp. 308–318. [2] Gergely Ács and Claude Castelluccia. “I Have a DREAM! (DiffeRentially privatE smArt Metering)”. In: International Workshop on Information Hiding. Springer. 2011, pp. 118–132. [3] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. “Byzantine stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 2018, pp. 4613–4623. [4] Muhammad Asad, Ahmed Moustafa, and Takayuki Ito. “FedOpt: Towards communication efficiency and privacy preservation in federated learning”. In: Applied Sciences 10.8 (2020), p. 2864. [5] Zuzana Beerliová-Trubıniová. “Efficient multi-party computation with information-theoretic security”. PhD thesis. ETH Zurich, 2008. [6] Zuzana Beerliová-Trubıniová and Martin Hirt. “Perfectly-secure MPC with linear communication complexity”. In: Theory of Crypto. Conf. Springer. 2008, pp. 213–230. [7] Amos Beimel. “Secret-sharing schemes: a survey”. In: Int. Conf. on Coding and Cryptology. Springer. 2011, pp. 11–46. [8] James Bell, KA Bonawitz, Adria Gascon, Tancrède Lepoint, and Mariana Raykova. “Secure Single-Server Vector Aggregation with (Poly) Logarithmic Overhead”. In: (2020). [9] James Bell, Keith Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova. “Secure Single-Server Aggregation with (Poly)Logarithmic Overhead”. In: IACR Cryptol. ePrint Arch. (2020). url: https://eprint.iacr.org/2020/704. [10] Michael Ben-Or, Shafi Goldwasser, and Avi Wigderson. “Completeness theorems for non-cryptographic fault-tolerant distributed computation”. In: ACM Symposium on Theory of Computing. 1988, pp. 1–10. 158 [11] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. “Analyzing federated learning through an adversarial lens”. In: arXiv preprint arXiv:1811.12470 (2018). [12] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. “Machine learning with adversaries: Byzantine tolerant gradient descent”. In: Advances in Neural Information Processing Systems. 2017, pp. 119–129. [13] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. “Byzantine-tolerant machine learning”. In: arXiv preprint arXiv:1703.02757 (2017). [14] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. “Towards federated learning at scale: System design”. In: 2nd SysML Conf. 2019. [15] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konečný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. “Towards Federated Learning at Scale: System Design”. In: Proceedings of Machine Learning and Systems. Vol. 1. 2019, pp. 374–388. url: https://proceedings.mlsys.org/paper/2019/file/bd686fd640be98efaae0091fa301e613-Paper.pdf. [16] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. “Practical secure aggregation for federated learning on user-held data”. In: Conference on Neural Information Processing Systems (2016). [17] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. “Practical secure aggregation for privacy-preserving machine learning”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 1175–1191. [18] Keith Bonawitz, Fariborz Salehi, Jakub Konečn, Brendan McMahan, and Marco Gruteser. “Federated learning with autotuned communication-efficient secure aggregation”. In: arXiv preprint arXiv:1912.00131 (2019). [19] Léon Bottou. “Online learning and stochastic approximations”. In: Online learning in neural networks 17.9 (1998), p. 142. [20] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [21] J. Brinkhuis and V. Tikhomirov. Optimization: Insights and Applications. Princeton Series in Applied Mathematics. Princeton University Press, 2011. [22] Martin Burkhart, Mario Strasser, Dilip Many, and Xenofontas Dimitropoulos. “SEPIA: Privacy-preserving aggregation of multi-domain network events and statistics”. In: Network 1.101101 (2010). 159 [23] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečn, H Brendan McMahan, Virginia Smith, and Ameet Talwalkar. “Leaf: A benchmark for federated settings”. In: arXiv preprint arXiv:1812.01097 (2018). [24] Octavian Catrina and Amitabh Saxena. “Secure computation with fixed-point numbers”. In: International Conference on Financial Cryptography and Data Security. Springer. 2010, pp. 35–50. [25] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, Constance Morel, and Emmanuel Prouff. “Privacy-Preserving Classification on Deep Neural Network.” In: IACR Cryptol. ePrint Arch. 2017 (2017), p. 35. [26] Zheng Chai, Yujing Chen, Liang Zhao, Yue Cheng, and Huzefa Rangwala. “FedAt: A communication-efficient federated learning method with asynchronous tiers under non-iid data”. In: arXiv preprint arXiv:2010.05958 (2020). [27] Kamalika Chaudhuri and Claire Monteleoni. “Privacy-preserving logistic regression”. In: Advances in Neural Information Processing Systems. 2009, pp. 289–296. [28] Valerie Chen, Valerio Pastro, and Mariana Raykova. “Secure Computation for Machine Learning With SPDZ”. In: arXiv:1901.00329 (2019). [29] Yudong Chen, Lili Su, and Jiaming Xu. “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent”. In: Proceedings of the ACM on Measurement and Analysis of Computing Systems 1.2 (2017), pp. 1–25. [30] Yujing Chen, Yue Ning, Martin Slawski, and Huzefa Rangwala. “Asynchronous online federated learning for edge devices with non-iid data”. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE. 2020, pp. 15–24. [31] George F Coulouris, Jean Dollimore, and Tim Kindberg. Distributed systems: concepts and design. pearson education, 2005. [32] Thomas M Cover and Joy A Thomas. Elements of Information Theory. John Wiley & Sons, 2012. [33] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). USA: Wiley-Interscience, 2006. isbn: 0471241954. [34] Ronald Cramer, Ivan Damgård, and Yuval Ishai. “Share conversion, pseudorandom secret-sharing and applications to secure computation”. In: Theory of Cryptography Conference. Springer. 2005, pp. 342–362. [35] Morten Dahl, Jason Mancuso, Yann Dupis, Ben Decoste, Morgan Giraud, Ian Livingstone, Justin Patriquin, and Gavin Uhma. “Private Machine Learning in TensorFlow using Secure Computation”. In: arXiv:1810.08130 (2018). 160 [36] Lisandro Dalcın, Rodrigo Paz, and Mario Storti. “MPI for Python”. In: Journal of Parallel and Distributed Computing 65.9 (2005), pp. 1108–1115. [37] Ivan Damgård and Jesper Buus Nielsen. “Scalable and unconditionally secure multiparty computation”. In: International Cryptology Conf. Springer. 2007, pp. 572–590. [38] Ivan Damgård, Valerio Pastro, Nigel Smart, and Sarah Zakarias. “Multiparty computation from somewhat homomorphic encryption”. In: Annual Cryptology Conf. Springer. 2012, pp. 643–662. [39] Whitfield Diffie and Martin Hellman. “New directions in cryptography”. In: IEEE Trans. on Inf. Theory 22.6 (1976), pp. 644–654. [40] Marten van Dijk, Nhuong V Nguyen, Toan N Nguyen, Lam M Nguyen, Quoc Tran-Dinh, and Phuong Ha Nguyen. “Asynchronous Federated Learning with Reduced Number of Rounds and with Differential Privacy from Less Aggregated Gaussian Noise”. In: arXiv preprint arXiv:2007.09208 (2020). [41] Danny Dolev and H. Raymond Strong. “Authenticated algorithms for Byzantine agreement”. In: SIAM Journal on Computing 12.4 (1983), pp. 656–666. [42] Cynthia Dwork. “Differential privacy: A survey of results”. In: International conference on theory and applications of models of computation. Springer. 2008, pp. 1–19. [43] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity in private data analysis”. In: Theory of Crypto. Conf. Springer. 2006, pp. 265–284. [44] Hubert Eichner, Tomer Koren, H Brendan McMahan, Nathan Srebro, and Kunal Talwar. “Semi-Cyclic Stochastic Gradient Descent”. In: arXiv preprint arXiv:1904.10120 (2019). [45] Ahmed Roushdy Elkordy and A Salman Avestimehr. “Secure aggregation with heterogeneous quantization in federated learning”. In: arXiv preprint arXiv:2009.14388 (2020). [46] Ahmed Roushdy Elkordy, Saurav Prakash, and A Salman Avestimehr. “Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training”. In: arXiv preprint arXiv:2109.07706 (2021). [47] David Evans, Vladimir Kolesnikov, Mike Rosulek, et al. “A pragmatic introduction to secure multi-party computation”. In: Foundations and Trends in Priv. and Sec. 2.2-3 (2018), pp. 70–246. [48] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. “Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach”. In: Advances in Neural Information Processing Systems 33 (2020). 161 [49] Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. “Local model poisoning attacks to Byzantine-robust federated learning”. In: arXiv preprint arXiv:1911.11815 (2019). [50] Paul Feldman. “A practical scheme for non-interactive verifiable secret sharing”. In: 28th Annual Symposium on Foundations of Computer Science. IEEE. 1987, pp. 427–438. [51] Matthew Franklin and Moti Yung. “Communication complexity of secure computation”. In: Proceedings of the twenty-fourth annual ACM symposium on Theory of computing. ACM. 1992, pp. 699–710. [52] Shuhong Gao. “A new algorithm for decoding Reed-Solomon codes”. In: Communications, information and network security. Springer, 2003, pp. 55–68. [53] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner, Samee Zahur, and David Evans. “Privacy-preserving distributed linear regression on high-dimensional data”. In: Priv. Enhancing Technologies. 4. 2017, pp. 345–364. [54] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. “Inverting Gradients - How easy is it to break privacy in federated learning?” In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 16937–16947. url: https://proceedings.neurips.cc/paper/2020/file/c4ede56bbd98819ae6112b20ac6bf145-Paper.pdf. [55] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. “Inverting Gradients – How easy is it to break privacy in federated learning?” In: arXiv preprint arXiv:2003.14053 (2020). [56] Craig Gentry and Dan Boneh. A fully homomorphic encryption scheme. Vol. 20. 09. Stanford University, Stanford, 2009. [57] Robin C Geyer, Tassilo Klein, and Moin Nabi. “Differentially private federated learning: A client level perspective”. In: arXiv preprint arXiv:1712.07557 (2017). [58] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. “Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy”. In: Int. Conf. on Mach. Learning. 2016, pp. 201–210. [59] Thore Graepel, Kristin Lauter, and Michael Naehrig. “ML confidential: Machine learning on encrypted data”. In: Int. Conf. on Inf. Sec. and Crypto. 2012, pp. 1–21. [60] gRPC: A high performance, open source universal RPC framework. https://grpc.io/. 2021. [61] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. “Result Analysis of the NIPS 2003 Feature Selection Challenge”. In: Advances in Neural Inf. Processing Systems. 2005, pp. 545–552. 162 [62] Shai Halevi, Yehuda Lindell, and Benny Pinkas. “Secure computation on the web: Computing without simultaneous interaction”. In: Annual Cryptology Conf. Springer. 2011, pp. 132–150. [63] Kyoohyung Han, Seungwan Hong, Jung Hee Cheon, and Daejun Park. “Logistic Regression on Homomorphic Encrypted Data at Scale”. In: Annual Conf. on Innovative Applications of Artificial Intelligence. 2019. [64] Chaoyang He, Salman Avestimehr, and Murali Annavaram. “Group knowledge transfer: Collaborative training of large cnns on the edge”. In: Advances in Neural Information Processing Systems (2020). [65] Chaoyang He, Songze Li, Jinhyun So, Mi Zhang, Hongyi Wang, Xiaoyang Wang, Praneeth Vepakomma, Abhishek Singh, Hang Qiu, Li Shen, et al. “Fedml: A research library and benchmark for federated machine learning”. In: arXiv preprint arXiv:2007.13518 (2020). [66] Chaoyang He, Conghui Tan, Hanlin Tang, Shuang Qiu, and Ji Liu. “Central server free federated learning over single-sided trust social networks”. In: arXiv preprint arXiv:1910.04956 (2019). [67] Chaoyang He, Zhengyu Yang, Erum Mushtaq, Sunwoo Lee, Mahdi Soltanolkotabi, and Salman Avestimehr. “SSFL: Tackling Label Deficiency in Federated Learning via Personalized Self-Supervision”. In: arXiv preprint arXiv:2110.02470 (2021). [68] Lie He, An Bian, and Martin Jaggi. “Cola: Decentralized linear learning”. In: Advances in Neural Information Processing Systems. 2018, pp. 4536–4546. [69] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. “Secure Byzantine-Robust Machine Learning”. In: arXiv preprint arXiv:2006.04747 (2020). [70] Ehsan Hesamifard, Hassan Takabi, and Mehdi Ghasemi. “CryptoDL: Deep Neural Networks over Encrypted Data”. In: arXiv:1711.05189 (2017). [71] Wassily Hoeffding. “Probability Inequalities for Sums of Bounded Random Variables”. In: Journal of the American Statistical Association 58.301 (1963), pp. 13–30. issn: 01621459. [72] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. “Searching for mobilenetv3”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 1314–1324. [73] Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. “Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization”. In: Advances in Neural Information Processing Systems. 2018, pp. 6346–6357. [74] Swanand Kadhe, Nived Rajaraman, O Ozan Koyluoglu, and Kannan Ramchandran. “FastSecAgg: Scalable Secure Aggregation for Privacy-Preserving Federated Learning”. In: arXiv preprint arXiv:2009.11248 (2020). 163 [75] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. “Advances and open problems in federated learning”. In: arXiv preprint arXiv:1912.04977 (2019). [76] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. “Learning from history for byzantine robust optimization”. In: International Conference on Machine Learning. PMLR. 2021, pp. 5311–5319. [77] Kiran S Kedlaya and Christopher Umans. “Fast polynomial factorization and modular composition”. In: SIAM Journal on Computing 40.6 (2011), pp. 1767–1802. [78] Andrey Kim, Yongsoo Song, Miran Kim, Keewoo Lee, and Jung Hee Cheon. “Logistic regression model training based on the approximate homomorphic encryption”. In: BMC Medical Genomics 11.4 (Oct. 2018), pp. 23–55. [79] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9. 2015. url: http://arxiv.org/abs/1412.6980. [80] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. “Federated Learning: Strategies for Improving Communication Efficiency”. In: Conference on Neural Information Processing Systems: Workshop on Private Multi-Party Machine Learning. 2016. [81] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech. rep. Citeseer, 2009. [82] Anusha Lalitha, Osman Cihan Kilinc, Tara Javidi, and Farinaz Koushanfar. “Peer-to-peer federated learning on graphs”. In: arXiv preprint arXiv:1901.11173 (2019). [83] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. [84] Yann LeCun, Corinna Cortes, and CJ Burges. “MNIST handwritten digit database”. In: [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010). [85] Iraklis Leontiadis, Kaoutar Elkhiyaoui, and Refik Molva. “Private and dynamic time-series data aggregation with trust relaxation”. In: International Conference on Cryptology and Network Security. Springer. 2014, pp. 305–320. [86] Ping Li, Jin Li, Zhengan Huang, Chong-Zhi Gao, Wen-Bin Chen, and Kai Chen. “Privacy-preserving outsourced classification in cloud computing”. In: Cluster Computing (2017), pp. 1–10. [87] Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. “Ditto: Fair and robust federated learning through personalization”. In: arXiv: 2012.04221 (2020). 164 [88] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. “Federated Learning: Challenges, Methods, and Future Directions”. In: IEEE Signal Processing Magazine 37.3 (2020), pp. 50–60. [89] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. “Federated learning: Challenges, methods, and future directions”. In: IEEE Signal Processing Magazine 37.3 (2020), pp. 50–60. [90] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. “Federated optimization in heterogeneous networks”. In: arXiv preprint arXiv:1812.06127 (2018). [91] Tian Li, Maziar Sanjabi, and Virginia Smith. “Fair resource allocation in federated learning”. In: arXiv preprint arXiv:1905.10497 (2019). [92] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. “On the convergence of fedavg on non-iid data”. In: arXiv preprint arXiv:1907.02189 (2019). [93] Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexander Schwing. “Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training”. In: Advances in Neural Information Processing Systems. 2018, pp. 8045–8056. [94] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. “Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 2017, pp. 5330–5340. [95] Yehuda Lindell and Benny Pinkas. “Privacy preserving data mining”. In: Annual International Cryptology Conference. Springer. 2000, pp. 36–54. [96] Lumin Liu, Jun Zhang, SH Song, and Khaled B Letaief. “Edge-Assisted Hierarchical Federated Learning with Non-IID Data”. In: arXiv preprint arXiv:1905.06641 (2019). [97] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Vol. 54. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA, Apr. 2017, pp. 1273–1282. [98] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-efficient learning of deep networks from decentralized data”. In: Artificial Intelligence and Statistics. PMLR. 2017, pp. 1273–1282. [99] H Brendan McMahan et al. “Advances and open problems in federated learning”. In: Foundations and Trends® in Machine Learning 14.1 (2021). 165 [100] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-efficient learning of deep networks from decentralized data”. In: Int. Conf. on Artificial Int. and Stat. (AISTATS). 2017, pp. 1273–1282. [101] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. “Learning differentially private recurrent language models”. In: Int. Conf. on Learning Representations (ICLR) (2018). [102] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. “Learning Differentially Private Recurrent Language Models”. In: Int. Conf. on Lear. Repr. 2018. [103] Dimitar Minovski, Niclas Ogren, Christer Ahlund, and Karan Mitra. “Throughput prediction using machine learning in lte and 5g networks”. In: IEEE Transactions on Mobile Computing (2021). [104] Payman Mohassel and Peter Rindal. “ABY 3: A mixed protocol framework for machine learning”. In: ACM Conf. on Comp. and Comm. Security. 2018, pp. 35–52. [105] Payman Mohassel and Yupeng Zhang. “SecureML: A system for scalable privacy-preserving machine learning”. In: IEEE Symp. on Sec. and Privacy. 2017, pp. 19–38. [106] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. “Agnostic federated learning”. In: arXiv preprint arXiv:1902.00146 (2019). [107] Erum Mushtaq, Chaoyang He, Jie Ding, and Salman Avestimehr. “SPIDER: Searching Personalized Neural Architecture for Federated Learning”. In: arXiv preprint arXiv:2112.13939 (2021). [108] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. 1st ed. Springer Publishing Company, Incorporated, 2014. isbn: 1461346916. [109] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Michael Rabbat, Mani Malek Esmaeili, and Dzmitry Huba. “Federated Learning with Buffered Asynchronous Aggregation”. In: arXiv preprint arXiv:2106.06639 (2021). [110] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, and Nina Taft. “Privacy-preserving ridge regression on hundreds of millions of records”. In: IEEE Symposium on Security and Privacy. 2013, pp. 334–348. [111] Open MPI: Open Source High Performance Computing. https://grpc.io/. [112] Manas Pathak, Shantanu Rane, and Bhiksha Raj. “Multiparty differential privacy via aggregation of locally trained classifiers”. In: Advances in Neural Information Processing Systems. 2010, pp. 1876–1884. [113] Allan Pinkus. “Weierstrass and approximation theory”. In: Journal of Approximation Theory 107.1 (2000), pp. 1–66. 166 [114] PyTorch RPC: Distributed Deep Learning Built on Tensor-Optimized Remote Procedure Calls. https://pytorch.org/docs/stable/rpc.html. 2021. [115] Arun Rajkumar and Shivani Agarwal. “A Differentially Private Stochastic Gradient Descent Algorithm for Multiparty Classification”. In: Int. Conf. on Artificial Intelligence and Statistics (AISTATS’12). 2012, pp. 933–941. [116] Vibhor Rastogi and Suman Nath. “Differentially private aggregation of distributed time-series with transformation and encryption”. In: ACM SIGMOD Int. Conf. on Management of data. 2010, pp. 735–746. [117] Kaveh Razavi, Ben Gras, Erik Bosman, Bart Preneel, Cristiano Giuffrida, and Herbert Bos. “Flip feng shui: Hammering a needle in the software stack”. In: 25th USENIX Security Symposium. 2016, pp. 1–18. [118] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečn, Sanjiv Kumar, and H Brendan McMahan. “Adaptive federated optimization”. In: arXiv preprint arXiv:2003.00295 (2020). [119] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization”. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2020, pp. 2021–2031. [120] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage. “Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds”. In: ACM Conf. on Comp. and Comm. Security. 2009, pp. 199–212. [121] Ron M Roth and Abraham Lempel. “On MDS codes via Cauchy matrices”. In: IEEE transactions on information theory 35.6 (1989), pp. 1314–1319. [122] Joel Scheuner and Philipp Leitner. “A cloud benchmark suite combining micro and applications benchmarks”. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 2018, pp. 161–166. [123] Adi Shamir. “How to share a secret”. In: Communications of the ACM 22.11 (1979), pp. 612–613. [124] Nir Shlezinger, Mingzhe Chen, Yonina C Eldar, H Vincent Poor, and Shuguang Cui. “UVeQFed: Universal vector quantization for federated learning”. In: IEEE Transactions on Signal Processing 69 (2020), pp. 500–514. [125] Reza Shokri and Vitaly Shmatikov. “Privacy-preserving deep learning”. In: ACM Conf. on Comp. and Comm. Security. 2015, pp. 1310–1321. [126] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. “Federated multi-task learning”. In: Advances in Neural Information Processing Systems. 2017, pp. 4424–4434. 167 [127] Jinhyun So, Ramy E Ali, Basak Guler, Jiantao Jiao, and Salman Avestimehr. “Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning”. In: arXiv preprint arXiv:2106.03328 (2021). [128] Jinhyun So, Basak Guler, and A Salman Avestimehr. “A Scalable Approach for Privacy-Preserving Collaborative Machine Learning”. In: Advances in Neural Information Processing Systems. 2020. [129] Jinhyun So, Basak Guler, and A Salman Avestimehr. “Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning”. In: arXiv preprint arXiv:2002.04156 (2020). [130] Jinhyun So, Basak Guler, A Salman Avestimehr, and Payman Mohassel. “Codedprivateml: A fast and privacy-preserving framework for distributed machine learning”. In: arXiv preprint arXiv:1902.00641 (2019). [131] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communications (2020). [132] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communications 39.7 (2021), pp. 2168–2181. [133] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Codedprivateml: A fast and privacy-preserving framework for distributed machine learning”. In: IEEE Journal on Selected Areas in Information Theory 2.1 (2021), pp. 441–451. [134] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning”. In: IEEE Journal on Selected Areas in Information Theory 2.1 (2021), pp. 479–489. [135] Mahdi Soleymani, Hessam Mahdavifar, and A Salman Avestimehr. “Analog lagrange coded computing”. In: arXiv preprint arXiv:2008.08565 (2020). [136] Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, and H Brendan McMahan. “Can You Really Backdoor Federated Learning?” In: arXiv preprint arXiv:1911.07963 (2019). [137] Ewa Syta, Philipp Jovanovic, Eleftherios Kokoris Kogias, Nicolas Gailly, Linus Gasser, Ismail Khoffi, Michael J Fischer, and Bryan Ford. “Scalable bias-resistant distributed randomness”. In: 2017 IEEE Symposium on Security and Privacy (SP). 2017, pp. 444–460. [138] Canh T. Dinh, Nguyen Tran, and Josh Nguyen. “Personalized Federated Learning with Moreau Envelopes”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin. Vol. 33. Curran Associates, Inc., 2020, pp. 21394–21405. url: https://proceedings.neurips.cc/paper/2020/file/f4f1f13c8289ac1b1ee0ff176b56fc60-Paper.pdf. 168 [139] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 6105–6114. [140] Tingting Tang, Ramy E Ali, Hanieh Hashemi, Tynan Gangwani, Salman Avestimehr, and Murali Annavaram. “Verifiable coded computing: Towards fast, secure and private distributed machine learning”. In: arXiv preprint arXiv:2107.12958 (2021). [141] Stacey Truex, Ling Liu, Ka-Ho Chow, Mehmet Emre Gursoy, and Wenqi Wei. “LDP-Fed: Federated learning with local differential privacy”. In: Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking. 2020, pp. 61–66. [142] Venkatanathan Varadarajan, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. “A placement vulnerability study in multi-tenant public clouds”. In: 24th USENIX Security Symposium. 2015, pp. 913–928. [143] Sameer Wagh, Divya Gupta, and Nishanth Chandran. “SecureNN: 3-Party Secure Computation for Neural Network Training”. In: Proceedings on Privacy Enhancing Technologies 2019.3 (2019), pp. 26–49. [144] Sameer Wagh, Divya Gupta, and Nishanth Chandran. SecureNN: Efficient and Private Neural Network Training. Cryptology ePrint Archive, Report 2018/442. https://eprint.iacr.org/2018/442. 2018. [145] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. “Tackling the objective inconsistency problem in heterogeneous federated optimization”. In: arXiv preprint arXiv:2007.07481 (2020). [146] Q. Wang, M. Du, X. Chen, Y. Chen, P. Zhou, X. Chen, and X. Huang. “Privacy-Preserving Collaborative Model Learning: The Case of Word Vector Training”. In: IEEE Trans. on Knowledge and Data Engineering 30.12 (2018), pp. 2381–2393. [147] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi. “Beyond Inferring Class Representatives: User-Level Privacy Leakage From Federated Learning”. In: IEEE INFOCOM 2019 - IEEE Conference on Computer Communications. 2019, pp. 2512–2520. [148] Zhibo Wang, Mengkai Song, Zhifei Zhang, Yang Song, Qian Wang, and Hairong Qi. “Beyond inferring class representatives: User-level privacy leakage from federated learning”. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE. 2019, pp. 2512–2520. [149] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. “Federated learning with differential privacy: Algorithms and performance analysis”. In: IEEE Transactions on Information Forensics and Security (2020). [150] Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim. “Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 2575–2584. 169 [151] Zhenyu Wu, Zhang Xu, and Haining Wang. “Whispers in the hyper-space: high-bandwidth and reliable covert channel attacks inside the cloud”. In: IEEE/ACM Transactions on Networking 23.2 (2014), pp. 603–615. [152] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Asynchronous federated optimization”. In: arXiv preprint arXiv:1903.03934 (2019). [153] Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong, Daniel Ramage, and Françoise Beaufays. “Applied federated learning: Improving google keyboard query suggestions”. In: arXiv preprint arXiv:1812.02903 (2018). [154] Andrew C Yao. “Protocols for secure computations”. In: IEEE Annual Symposium on Foundations of Computer Science. 1982, pp. 160–164. [155] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. “Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates”. In: Proceedings of the 35th International Conference on Machine Learning. Vol. 80. Stockholm Sweden, Oct. 2018, pp. 5650–5659. [156] Qian Yu, Songze Li, Netanel Raviv, Seyed Mohammadreza Mousavi Kalan, Mahdi Soltanolkotabi, and A Salman Avestimehr. “Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy”. In: Int. Conf. on Artificial Intelligence and Statistics (AISTATS). 2019. [157] Qian Yu, Songze Li, Netanel Raviv, Seyed Mohammadreza Mousavi Kalan, Mahdi Soltanolkotabi, and Salman A Avestimehr. “Lagrange coded computing: Optimal design for resiliency, security, and privacy”. In: The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019, pp. 1215–1225. [158] Jiawei Yuan and Shucheng Yu. “Privacy preserving back-propagation neural network learning made practical with cloud computing”. In: IEEE Transactions on Parallel and Distributed Systems 25.1 (2014), pp. 212–221. [159] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. “The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning”. In: arXiv:1611.05402 (2016). [160] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. “ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning”. In: Int. Conf. on Machine Learning. Sydney, Australia, Aug. 2017, pp. 4035–4043. [161] Yinqian Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. “Cross-tenant side-channel attacks in PaaS clouds”. In: ACM Conf. on Comp. and Comm. Security. 2014, pp. 990–1003. [162] Yinqian Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. “Cross-VM side channels and their use to extract private keys”. In: ACM Conf. on Comp. and Comm. Security. ACM. 2012, pp. 305–316. 170 [163] Yizhou Zhao and Hua Sun. “Information Theoretic Secure Aggregation with User Dropouts”. In: arXiv preprint arXiv:2101.07750 (2021). [164] Pan Zhou and Jiashi Feng. “Empirical risk landscape analysis for understanding deep neural networks”. In: International Conference on Learning Representations. 2018. [165] Ligeng Zhu and Song Han. “Deep leakage from gradients”. In: Federated Learning. Springer, 2020, pp. 17–31. [166] Ligeng Zhu, Zhijian Liu, and Song Han. “Deep Leakage from Gradients”. In: Advances in Neural Information Processing Systems 32. 2019, pp. 14774–14784. 171 Appendix A Appendix of Chapter 2 A.1 Proof of Lemma 2.1 (Unbiasedness) Given X, we have E[p (t) ] = 1 m X > E ¯ s(X,W (t) ) −y = 1 m X > ˆ s(X×w (t) )−y , (A.1) where(A.1)followsfromtheunbiasednessofthequantizationstrategy,E[¯ s(X,W (t) )] = ˆ s(X×w (t) ), and expectation is taken with respect to the quantization noise at iteration t. Then, we obtain E[p (t) ]−∇C(w (t) ) = 1 m X > ˆ s(X×w (t) )−s(X×w (t) ) . (A.2) Assume w (t) is constrained such thatkw (t) k≤ B for some real value B∈ R [159, 164]. Then, from the Weierstrass approximation theorem [21, 113], for every > 0, there exists a polynomial that approximates the sigmoid arbitrarily well, i.e.,|ˆ s(x)−s(x)|≤ for all x in the constrained interval. Therefore, givenX, there exists a polynomial making the norm of (A.2) arbitrarily small. 172 (Variance bound) The variance of p (t) satisfies, E kp (t) −E[p (t) ]k 2 2 = 1 m 2 E kX > ¯ s(X,W (t) )− ˆ s(X×w (t) ) k 2 2 = 1 m 2 E Tr X > q (t) q (t)> X = 1 m 2 Tr X > E[q (t) q (t)> ]X , (A.3) whereTr(·) denotes the trace of a matrix, and we letq (t) , ¯ s(X,W (t) )−ˆ s(X×w (t) ). From Lemma 4 of [159], we have that E q (t) i q (t) j ] ≤ 2 −2lw P r k=0 c k x i w (t) k 2 if i=j = 0 otherwise, (A.4) whereq (t) i denotes thei th element of q (t) . Combining equations (A.3) and (A.4) with the fact that P r k=0 c k x i w (t) k 2 ≈ s(x i ·w (t) ) 2 ≤ 1 for all i∈ [m], we obtain E kp (t) −E[p (t) ]k 2 2 ≤ 1 2 −2lw m 2 Tr X > X = 1 2 −2lw m 2 kXk 2 F . A.2 Proof of Lemma 2.2 For the logistic regression cost function C(w), the Lipschitz constantL is less than or equal to the largest eigenvalue of the Hessian∇ 2 (w) for all w and is given by L = 1 4 max eig X > X . (A.5) 173 A.3 Proof of Theorem 2.1 (Convergence) First, we show that the master can decodeX > ¯ s(X,W (t) ) over the finite field as long as N≥ (2r + 1)(K +T− 1) + 1. As described in Section 2.3, given the polynomial approximation of the sigmoid function in (2.13), the degree of h(z) in (2.19) is at most (2r + 1)(K +T− 1). The decoding process uses the computations from the workers as evaluation points h(α i ) to interpolate the polynomial h(z). The master can obtain all of the coefficients of h(z) as long as it collects at least deg h(z) + 1≤ (2r + 1)(K +T− 1) + 1 evaluation results of h(α i ). After h(z) is recovered, the master can decode the sub-gradientX > i ¯ s(X i ,W (t) ) by computingh(β i ) fori∈ [K]. Hence, the recovery threshold is given by (2r + 1)(K +T− 1) + 1 to decode X > ¯ s(X,W (t) ). Next, we consider the update equation in CodedPrivateML (see (2.25)) and prove its conver- gence to w ∗ . From the L-Lipschitz continuity of∇C(w) stated in Lemma 2.2, we have C(w (t+1) )≤C(w (t) )+h∇C(w (t) ),w (t+1) −w (t) i+ L 2 kw (t+1) −w (t) k 2 =C(w (t) )−ηh∇C(w (t) ),p (t) i + L 2 kp (t) k 2 , 174 whereh,·,i is the inner product [20].By taking the expectation with respect to the quantization noise on both sides, E C(w (t+1) ) ≤C(w (t) )−ηk∇C(w (t) )k 2 + Lη 2 2 k∇C(w (t) )k 2 +σ 2 ≤C(w (t) )−η 1− Lη 2 k∇C(w (t) )k 2 + Lη 2 σ 2 2 ≤C(w (t) )− η 2 k∇C(w (t) )k 2 + ησ 2 2 (A.6) ≤C(w ∗ )+h∇C(w (t) ),w (t) −w ∗ i− η 2 k∇C(w (t) )k 2 + ησ 2 2 (A.7) ≤C(w ∗ ) +hE[p (t) ],w (t) −w ∗ i− η 2 E kp (t) )k 2 +ησ 2 (A.8) =C(w ∗ ) +ησ 2 +E h hp (t) ,w (t) −w ∗ i− η 2 kp (t) )k 2 i =C(w ∗ ) +ησ 2 + 1 2η Ekw (t) −w ∗ k 2 −Ekw (t+1) −w ∗ )k 2 , where (A.6) follows from Lη≤ 1, (A.7) from the convexity of C, and (A.8) holds since E[p (t) ] = ∇C(w (t) ) and E kp (t) )k 2 −k∇C(w (t) )k 2 ≤ σ 2 from Lemma 2.1 by assuming an arbitrarily large r. Summing the above equations for t = 0,...,J− 1, we have J−1 X t=0 E C(w (t+1) ) −C(w ∗ ) ≤ 1 2η Ekw (0) −w ∗ k 2 −Ekw (J) −w ∗ )k 2 +Jησ 2 ≤ kw (0) −w ∗ k 2 2η +Jησ 2 . Finally, since C is convex, we observe that, E h C 1 J J X t=0 w (t) i −C(w ∗ )≤ 1 J J−1 X t=0 E C(w (t+1) ) −C(w ∗ ) ≤ kw (0) −w ∗ k 2 2ηJ +ησ 2 , 175 which completes the proof of convergence. (Privacy) Let U top ∈ F K×N p and U bottom ∈ F T×N p are the top and bottom submatrix of the encoding matrix U constructed in Section 2.3, respectively. From Lemma 2 of [156], U bottom is an MDS matrix. Therefore, every T×T submatrix of U bottom is invertible. For a colluding set of workersT ⊂ [N] of size T, their received dataset satisfies, e X T =X×U top T +R×U bottom T , (A.9) where R = R K ,...,R K+T , and U top T ∈ F K×T p and U bottom T ∈ F T×T p are the top and bottom submatrices which correspond to the columns in U that are indexed byT . All elements of R are independent and uniformly distributed over the finite field F p . Similarly, f W (t) T can be represented as f W (t) T =W (t) ×U top T +V (t) ×U bottom T , (A.10) whereV (t) = (V (t) K+1 ,...,V (t) K+T )andallelementsofV (t) areindependentanduniformlydistributed over the finite fieldF p , for all t∈ [J]. We first show that I( e X T ;X) =0 as follows. I( e X T ;X) =H( e X T )−H( e X T |X) =H( e X T )−H(XU top T +RU bottom T |X) (A.11) =H( e X T )−H(RU bottom T |X) (A.12) =H( e X T )−H(R) (A.13) = 0, (A.14) 176 where (A.11) follows from (A.9), (A.12) holds since we can drop XU top T given X as the former is a deterministic function of the latter. Equation (A.13) holds since R and X are independent and U bottom T is invertible. Equation (A.14) holds from the observation thatR is a uniformly distributed random matrix, hence it has the maximum entropy in the finite field F p , combined with the fact that mutual information is always non-negative. Next, we prove I X; e X T ,{ f W (t) T } t∈[J] = 0. We first obtain I X; e X T ,{ f W (t) T } t∈[J] =I X; e X T +I X;{ f W (t) T } t∈[J] e X T =I X;{ f W (t) T } t∈[J] e X T (A.15) =H { f W (t) T } t∈[J] e X T −H { f W (t) T } t∈[J] X, e X T = J X j=1 H f W (j) T { f W (t) T } t∈[j−1] , e X T −H f W (j) T { f W (t) T } t∈[j−1] ,X, e X T ! , (A.16) where (A.15) follows from (A.14), and (A.16) from the chain rule of entropy. From the second term of (A.16), we derive H f W (j) T { f W (t) T } t∈[j−1] ,X, e X T ≥H f W (j) T W (j) ,{ f W (t) T } t∈[j−1] ,X, e X T (A.17) =H f W (j) T W (j) (A.18) =H W (j) U top T +V (j) U bottom T W (j) (A.19) =H V (j) , (A.20) where (A.17) holds since conditioning cannot increase entropy. Equation (A.18) holds since f W (j) T and { f W (t) T } t∈[j−1] ,X, e X T are conditionally independent given W (j) . Equation (A.19) follows 177 from (A.10), and (A.20) follows from the same steps (A.11)-(A.13). From (A.16) and (A.20) we obtain I X; e X T ,{ f W (t) T } t∈[J] ≤ J X j=1 H f W (j) T { f W (t) T } t∈[j−1] , e X T −H V (j) = 0, (A.21) where (A.21) holds since V (j) is a uniformly distributed random matrix, and therefore has the maximum entropy in the finite field F p , H V (j) ≥ H f W (j) T { f W (t) T } t∈[j−1] , e X T , combined with the fact that mutual information is always non-negative, I X; e X T ,{ f W (t) T } t∈[J] ≥ 0. Finally, from the data-processing inequality [32], I X; e X T ,{ f W (t) T } t∈[J] ≥I X; e X T ,{ f W (t) T } t∈[J] ≥ 0. (A.22) Therefore, I X; e X T ,{ f W (t) T } t∈[J] = 0 and the original dataset remains information-theoretically private against T colluding workers. A.4 Details of the Multi-Party Computation (MPC) Implementation Our benchmarks are based on two well-known MPC protocols, the notable BGW protocol from [10], and the more recent MPC protocol from [6, 37]. Both protocols allow computations of polynomial functions, which consists of addition and multiplication operations, in a privacy preserving manner by untrusted workers. At the end of the computation, any collusion between T out of N workers does not reveal any information (in an information-theoretic sense) about the input variables while workers only learn a secret share of the actual result. The former protocol is more communication- intensive than the latter, as it incurs a communication cost that is quadratic in the number of workers. The latter protocol enables the communication cost to scale linearly with respect to the 178 number of workers, however, as a trade-off, it requires a significant amount of offline computations as well as storage load at each worker. For constructing the secret shares, we use Shamir’s secret sharing protocol [123], which protects the privacy of secret variables against any collisions between up to T workers. This is done by embedding a given secreta in a degreeT polynomialh(ξ) =a +ξr 1 ,...,ξ T r T wherer i ,i∈ [T ] are uniformly random variables. The secret share of a at worker i∈ [N] is represented by h(i) = [a] i . Then, addition and multiplication operations are computed as follows. Addition. To compute a secure additiona +b, workers locally add their secret shares [a] i + [b] i and perform a modulo operation. Multiplication. To compute a secure multiplication ab, the two protocols differ in their approaches. In the BGW protocol from [10], each worker first multiplies its secret shares [a] i and [b] i locally. The resulting value [a] i [b] i is a secret share of ab, however, the corresponding polynomial has degree 2T, twice the degree of the original polynomial. This causes the degree to grow excessively as more multiplication gates are executed. To alleviate this problem, workers perform a degree-reduction step by creating new shares corresponding to a polynomial of degree T, reducing the degree from 2T. The communication overhead of this protocol is O(N 2 ). The protocol from [6, 37] utilizes offline computations to reduce the communication overhead. In the offline phase, this protocol creates a random variable ρ and two secret shares corresponding to random polynomials with degree T and 2T, which are denoted by [ρ] T,i and [ρ] 2T,i , respectively, for worker i∈ [N]. In the online phase, worker i∈ [N] locally the multiplies [a] i with [b] i . Each worker now holds a secret share of ab, however, the corresponding polynomial for the secret shares has degree 2T. Workeri∈ [N] then locally computes [a] i [b] i −[ρ] 2T,i . Next, workers broadcast their local computations to others, after which each worker decodes ab−ρ. Note that the privacy of ab is still protected since it is masked by the random value ρ. Finally, each worker locally computes 179 ab−ρ + [ρ] T,i . As a result, ρ cancels out and workers obtain a secret share of ab embedded in a degree T polynomial. The communication overhead of this protocol is O(N). For the details, we refer to [5]. 180 Appendix B Appendix of Chapter 3 B.1 Details of the Quantization Phase ForquantizingitsdatasetX j ,clientj∈ [N]employsascalarquantizationfunctionφ Round(2 lx ·X j ) , where the rounding operation Round(x) = bxc if x−bxc< 0.5 bxc + 1 otherwise (B.1) is applied element-wise to the elements x of matrix X j and l x is an integer parameter to control the quantization loss. bxc is the largest integer less than or equal to x, and function φ :Z→F p is a mapping defined to represent a negative integer in the finite field by using two’s complement representation, φ(x) = x if x≥ 0 p +x if x< 0 (B.2) To avoid a wrap-around which may lead to an overflow error, prime p should be large enough, p≥ 2 lx+1 max{|x|} + 1. Its value also depends on the bitwidth of the machine as well as the 181 dimension of the dataset. For example, in a 64-bit implementation with the CIFAR-10 dataset whose dimension is d = 3072, we select p = 2 26 − 5, which is the largest prime needed to avoid an overflow on intermediate multiplications. In particular, in order to speed up the running time of matrix-matrix multiplication, we do a modular operation after the inner product of vectors instead of doing a modular operation per product of each element. To avoid an overflow on this, p should be smaller than a threshold given by d(p− 1) 2 ≤ 2 64 − 1. For ease of exposition, throughout the paper, X = [X > 1 ,...,X > N ] > refers to the quantized dataset. B.2 Proof of Theorem 3.1 First, we show that the minimum number of clients needed for our decoding operation to be successful, i.e., the recovery threshold of COPML, is equal to (2r + 1)(K +T− 1) + 1. To do so, we demonstrate in the following that the decoding process will be successful as long as N≥ (2r + 1)(K +T− 1) + 1. As described in Section 3.3, given the polynomial approximation of the sigmoid function in (3.5), the degree of h(z) in (3.8) is at most (2r + 1)(K +T− 1). The decoding process uses the computations from the clients as evaluation points h(α i ) to interpolate the polynomial h(z). If at least deg(h(z)) + 1 evaluation results of h(α i ) are available, then, all of the coefficients ofh(z) can be evaluated. Afterh(z) is recovered, the sub-gradient X > i ˆ g(X i ×w (t) ) can be decoded by computing h(β i ) for i∈ [K], from which the gradient X > ˆ g(X×w (t) ) from (3.11) can be computed. Hence, the recovery threshold of COPML is (2r + 1)(K +T− 1) + 1, as long asN≥ (2r + 1)(K +T− 1) + 1, the protocol can correctly decode the gradient using the local evaluationsoftheclients, andthedecodingprocesswillbesuccessful. Sincethedecodingoperations are performed using a secure MPC protocol, throughout the decoding process, the clients only learn a secret share of the gradient and not its actual value. Next, we consider the update equation in (3.6) and prove its convergence tow ∗ . As described in Section 3.3, after decoding the gradient, the 182 clients carry out a secure truncation protocol to multiply X > (ˆ g(X×w (t) )−y) with parameter η m to update the model as in (3.6). The update equation from (3.6) can then be represented by w (t+1) =w (t) −η 1 m X > (ˆ g(X×w (t) )−y) +n (t) . (B.3) =w (t) −ηp (t) (B.4) where n (t) represents the quantization noise introduced by the secure multi-party truncation pro- tocol [24], and p (t) , 1 m X > (ˆ g(X×w (t) )−y) +n (t) . From [24], n (t) has zero mean and bounded variance, i.e., E n (t)[n (t) ] = 0 and E n (t) kn (t) k 2 2 ≤ d2 2(k 1 −1) m 2 ,σ 2 wherek·k 2 is the l 2 norm and k 1 is the truncation parameter described in Section 3.3. Next, we show thatp (t) is an unbiased estimator of the true gradient,∇C(w (t) ) = 1 m X > (g(X× w (t) )−y), and its variance is bounded by σ 2 with sufficiently large r. From E n (t)[n (t) ] = 0, we obtain E n (t)[p (t) ]−∇C(w (t) ) = 1 m X > ˆ g(X×w (t) )−g(X×w (t) ) . (B.5) From the Weierstrass approximation theorem [21], for any > 0, there exists a polynomial that approximates the sigmoid arbitrarily well, i.e.,|ˆ g(x)−g(x)|≤ for allx in the constrained interval. Hence, as there exists a polynomial making the norm of (B.5) arbitrarily small, E n (t)[p (t) ] = ∇C(w (t) ) andE n (t) kp (t) −E n (t)[p (t) ]k 2 2 =E n (t) kn (t) k 2 2 ≤σ 2 . Next, we consider the update equation in (B.4) and prove its convergence to w ∗ . From the L-Lipschitz continuity of∇C(w) (Theorem 2.1.5 of [108]), we have C(w (t+1) )≤C(w (t) )+h∇C(w (t) ),w (t+1) −w (t) i+ L 2 kw (t+1) −w (t) k 2 ≤C(w (t) )−ηh∇C(w (t) ),p (t) i + Lη 2 2 kp (t) k 2 , (B.6) 183 whereh,·,i is the inner product. For a cross entropy loss C(w), the Lipschitz constant L is equal to the largest eigenvalue of the Hessian∇ 2 C(w) for all w, and is given by L = 1 4 kXk 2 2 . By taking the expectation with respect to the quantization noise n (t) on both sides in (B.6), we have E n (t) C(w (t+1) ) ≤C(w (t) )−ηk∇C(w (t) )k 2 + Lη 2 2 k∇C(w (t) )k 2 +σ 2 (B.7) ≤C(w (t) )−η 1− Lη 2 k∇C(w (t) )k 2 + Lη 2 σ 2 2 ≤C(w (t) )− η 2 k∇C(w (t) )k 2 + ησ 2 2 (B.8) ≤C(w ∗ )+h∇C(w (t) ),w (t) −w ∗ i− η 2 k∇C(w (t) )k 2 + ησ 2 2 (B.9) ≤C(w ∗ ) +hE n (t)[p (t) ],w (t) −w ∗ i− η 2 E n (t)kp (t) )k 2 +ησ 2 (B.10) =C(w ∗ ) +ησ 2 +E n (t) h hp (t) ,w (t) −w ∗ i− η 2 kp (t) )k 2 i =C(w ∗ ) +ησ 2 + 1 2η kw (t) −w ∗ k 2 −E n (t)kw (t+1) −w ∗ )k 2 (B.11) where (B.7) and (B.10) hold sinceE n (t)[p (t) ] =∇C(w (t) ) andE n (t) kp (t) −∇C(w (t) )k 2 2 ≤σ 2 , (B.8) follows from Lη ≤ 1, (B.9) follows from the convexity of C, and (B.11) follows from p (t) = − 1 η (w (t+1) −w (t) ). By taking the expectation on both sides in (B.11) with respect to the joint distribution of all random variables n (0) ,...,n (J−1) where J denotes the total number of iterations, we have E C(w (t+1) ) −C(w ∗ )≤ 1 2η Ekw (t) −w ∗ k 2 −Ekw (t+1) −w ∗ )k 2 +ησ 2 . (B.12) Summing both sides of the inequality in (B.12) for t = 0,...,J− 1, we find that, J−1 X t=0 E C(w (t+1) ) −C(w ∗ ) ≤ kw (0) −w ∗ k 2 2η +Jησ 2 . 184 Finally, since C is convex, we observe that, E h C 1 J J X t=0 w (t) i −C(w ∗ )≤ 1 J J−1 X t=0 E C(w (t+1) ) −C(w ∗ ) ≤ kw (0) −w ∗ k 2 2ηJ +ησ 2 which completes the proof of convergence. B.3 Details of the Multi-Party Computation (MPC) Implementation We consider two well-known MPC protocols, the notable BGW protocol from [10], and the more recent, efficientMPCprotocolfrom[6, 37]. Bothprotocolsallowthecomputationofanypolynomial function in a privacy-preserving manner by untrusted parties. Computations are carried out over thesecretshares, andattheend, partiesonlylearnasecretshareoftheactualresult. Anycollusions between up to T =b N−1 2 c out of N parties do not reveal information (in an information-theoretic sense)about theinput variables. The latterprotocolismoreefficientintermsofthecommunication cost between the parties, which scales linearly with respect to the number of parties, whereas for the former protocol this cost is quadratic. As a trade-off, it requires a considerable amount of offline computations and higher storage cost for creating and secret sharing the random variables used in the protocol. For creating secret shares, we utilize Shamir’s T-out-of-N secret sharing [123]. This scheme embeds a secreta in a degreeT polynomialh(ξ) =a+ξv 1 ,...,ξ T v T wherev i ,i∈ [T ] are uniformly random variables. Client i∈ [N] then receives a secret share of a, denoted by h(i) = [a] i . This keeps a private against any collusions between up to any T parties. The specific computations are then carried out as follows. 185 Addition. In order to perform a secure addition a + b, clients locally add their secret shares [a] i + [b] i . The resulting value is a secret share of the original summation a +b. This step requires no communication. Multiplication-by-a-constant. Forperformingasecuremultiplicationac wherecisapublicly-known constant, clients locally multiply their secret share [a] i withc. The resulting value is a secret share of the desired multiplication ac. This step requires no communication. Multiplication. Forperformingasecuremultiplicationab,thetwoprotocolsdifferintheirexecution. IntheBGWprotocol,eachclientinitiallymultipliesitssecretshares [a] i , [b] i locallytoobtain [a] i [b] i . The clients will then be holding a secret share of ab, however, the corresponding polynomial now has degree 2T. This may in turn cause the degree of the polynomial to increase excessively as more multiplication operations are evaluated. To alleviate this problem, in the next phase, clients carry out a degree reduction step to create new shares corresponding to a polynomial of degree T. The communication overhead of this protocol is O(N 2 ). The protocol from [6], on the other hand, leverages offline computations to speed up the com- munication phase. In particular, a random variable ρ is created offline and secret shared with the clients twice using two random polynomials with degreesT and 2T, respectively. The secret shares corresponding to the degree T polynomial are denoted by [ρ] T,i , whereas the secret shares for the degree 2T polynomial are denoted by [ρ] 2T,i for clients i∈ [N]. In the online phase, client i∈ [N] locally computes the multiplication [a] i [b] i , after which each client will be holding a secret share of the multiplicationab. The resulting polynomial has degree 2T. Then, each client locally computes [a] i [b] i − [ρ] 2T,i , which corresponds to a secret share ofab−ρ embedded in a degree 2T polynomial. Clients then broadcast their individual computations to others, after which each client computes ab−ρ. Note that the privacy of the computation ab is still protected since clients do not know the actual value of ab, but instead its masked version ab−ρ. Then, each client locally computes 186 ab−ρ + [ρ] T,i . As a result, variable ρ cancels out, and clients obtain a secret share of the multipli- cation ab embedded in a degree T polynomial. This protocol requires only O(N) broadcasts and therefore is more efficient than the previous algorithm. On the other hand, it requires an offline computation phase and higher storage overhead. For the details, we refer to [6, 5]. Remark B.1. The secure MPC computations during the encoding, decoding, and model update phases of COPML only use addition and multiplication-by-a-constant operations, instead of the expensive multiplication operation, as{α i } i∈[N] and{β k } k∈[K+T] are publicly known constants for all clients. B.3.1 Details of the Optimized Baseline Protocols In a naive implementation of our multi-client problem setting, both baseline protocols would utilize Shamir’s secret sharing scheme where the quantized dataset X = [X > 1 ,...,X > N ] > is secret shared withN clients. To do so, both baselines would follow the same secret sharing process as in COPML, where clientj∈ [N] creates a degreeT random polynomialh j (z) =X j +zR j1 +...+z T R jT where R ji for i ∈ [T ] are i.i.d. uniformly distributed random matrices while selecting T = b N−1 2 c. By selecting N distinct evaluation points λ 1 ,...,λ N from F p , client j would generate and send [X j ] i = h j (λ i ) to client i∈ [N]. As a result, client i∈ [N] would be assigned a secret share of the entire dataset X, i.e, [X] i = [X 1 ] > i ,..., [X N ] > i > . Client i would also obtain a secret share of the labels, [y] i , and a secret share of the initial model, [w (0) ] i , where y = [y > 1 ,...,y > N ] > and w (0) is a randomly initialized model. Then, the clients would compute the gradient and update the model from (3.7) within a secure MPC protocol. This guarantees privacy againstb N−1 2 c colluding workers, but requires a computation load at each worker that is as large as processing the whole dataset at a single worker, leading to slow training. 187 Hence, in order to provide a fair comparison with COPML, we optimize (speed up) the baseline protocols by partitioning the clients into subgroups of size 2T + 1. Clients communicate a secret share of their own datasets with the other clients in the same subgroup, instead of secret sharing it withtheentiresetofclients. Eachclientinsubgroupireceivesasecretshareofapartitioneddataset X i ∈F m G ×d p where X = [X > 1 ···X > G ] > and G is the number of subgroups. In other words, client j in subgroupi obtains a secret share [X i ] j . Then, subgroup i∈ [G] computes the sub-gradient over the partitioned dataset, X i , within a secure MPC protocol. To provide the same privacy threshold T =b N−3 6 c as Case 2 of COPML in Section 3.5, we set G = 3. This significantly reduces the total training time of the two baseline protocols (compared to the naive MPC implementation where the computation load at each client would be as high as training centrally), as the total amount of data processed at each client is equal to one third of the size of the entire dataset X. B.4 Algorithms The overall procedure of COPML protocol is given in Algorithm 4. 188 Algorithm 4 COPML input Dataset (X,y) = ((X1,y1),...,(XN,yN)) distributed over N clients. output Model parameters w (J) . 1: for client j = 1,...,N do 2: Secret share the individual dataset (Xj,yj) with clients i∈ [N]. 3: end for 4: Within a secure MPC protocol, initialize the model w (0) randomly and secret share with clients i∈ [N]. // Client i receives a secret share [w (0) ]i of w (0) . 5: Encode the dataset within a secure MPC protocol, using the secret shares [Xj]i for j∈ [N], i∈ [N]. // After this step, client i holds a secret share [ e Xj]i of each encoded dataset e Xj for j∈ [N]. 6: for client i = 1,...,N do 7: Gather the secret shares [ e Xi]j from clients j∈ [N]. 8: Recover the encoded dataset e Xi from the secret shares{[ e Xi]j} j∈[N] . // At the end of this step, client i obtains the encoded dataset e Xi. 9: end for 10: Compute X T y within a secure MPC protocol using the secret shares [Xj]i and [yj]i for j∈ [N], i∈ [N]. // At the end of this step, client i holds a secret share [X T y]i of X T y. 11: for iteration t = 0,...,J−1 do 12: Encode the model w (t) in a secure MPC protocol using the secret shares [w (t) ]i. // After this step, client i holds a secret share [e w (t) j ]i of the encoded model e w (t) j for j∈ [N]. 13: for client i = 1,...,N do 14: Gather the secret shares [e w (t) i ]j from clients j∈ [N]. 15: Recover the encoded model e w (t) i from the secret shares{[e w (t) i ]j} j∈[N] . // At the end of this step, client i obtains the encoded model e w (t) i . 16: Locally compute f( e Xi,e w (t) i ) from (3.7) and secret share the result with clients j∈ [N]. // Client i sends a secret share [f( e Xi,e w (t) i )]j of f( e Xi,e w (t) i ) to client j. 17: end for 18: for client i = 1,...,N do 19: Locally computes [f(X k ,w (t) )]i for k∈ [K] from (3.10). // After this step, client i knows a secret share [f(X k ,w (t) )]i of f(X k ,w (t) ) for k∈ [K]. 20: Locally aggregate the secret shares {[f(X k ,w (t) )]i} k∈K to compute [X T ˆ g(X×w (t) )]i , P k∈[K] [f(X k ,w (t) )]i. // At the end of this step, client i now has a secret share [X T ˆ g(X×w (t) )]i of X T ˆ g(X×w (t) ) = P k∈[K] f(X k ,w (t) ). 21: Locally compute [X > (ˆ g(X×w (t) )−y)]i, [X T ˆ g(X×w (t) )]i−[X T y]i. // Each client now has a secret share [X > (ˆ g(X×w (t) )−y)]i of X > (ˆ g(X×w (t) )−y). 22: end for 23: Updatethemodelaccordingto(3.6)withinasecureMPCprotocolusingthesecretshares[X > (ˆ g(X×w (t) )−y)]i and [w (t) ]i for i∈ [N], and by carrying out the secure truncation operation. // At the end of this step, client i holds a secret share of the updated model [w (t+1) ]i. // Secure truncation is carried out jointly as it requires communication between the clients. 24: end for 25: for client j = 1,...,N do 26: Collect the secret shares [w (J) ]i from clients i∈ [N] and recover the final model w (J) . 27: end for 189 Appendix C Appendix of Chapter 5 C.1 Pseudo Code of LightSecAgg C.2 Proof of Theorem 6.1 We prove the dropout-resiliency guarantee and the privacy guarantee for a single FL training round. As all randomness is independently generated across each round, one can extend the dropout- resiliency guarantee and the privacy guarantee for all training rounds for both synchronous and asynchronous FL setting. For simplicity, round index t is omitted in this proof. For any pair of privacy guarantee T and dropout-resiliency guarantee D such thatT +D<N, we select an arbitrary U such that N−D≥U >T. In the following, we show that LightSecAgg withchosendesignparametersT,D andU cansimultaneouslyachieve(1)privacyguaranteeagainst up to any T colluding users, and (2) dropout-resiliency guarantee against up to any D dropped users. We denote the concatenation of{[n i ] k } k∈U−T+1,...,U by n i for i∈ [N]. 190 Algorithm 5 The LightSecAgg protocol Input: T (privacy guarantee), D (dropout-resiliency guarantee), U (target number of surviving users) 1: Server Executes: 2: // phase: offline encoding and sharing of local masks 3: for each user i = 1, 2,...,N in parallel do 4: z i ← randomly picks fromF d q 5: [z i ] 1 ,..., [z i ] U−T ← obtained by partitioning z i to U−T pieces 6: [n i ] U−T+1 ,..., [n i ] U ← randomly picks fromF d U−T q 7: {[˜ z i ] j } j∈[N] ← obtained by encoding [z i ] k ’s and [n i ] k ’s using (6.5) 8: sends encoded mask [˜ z i ] j to user j∈ [N]\{i} 9: receives encoded mask [˜ z j ] i from user j∈ [N]\{i} 10: end for 11: // phase: masking and uploading of local models 12: for each user i = 1, 2,...,N in parallel do 13: // user i obtains x i after the local update 14: ˜ x i ←x i +z i // masks the local model 15: uploads masked model ˜ x i to the server 16: end for 17: identifies set of surviving usersU 1 ⊆ [N] 18: gathers masked models ˜ x i from user i∈U 1 19: // phase: one-shot aggregate-model recovery 20: for each user i∈U 1 in parallel do 21: computes aggregated encoded masks P j∈U 1 [˜ z j ] i 22: uploads aggregated encoded masks P j∈U 1 [˜ z j ] i to the server 23: end for 24: collects U messages of aggregated encoded masks P j∈U 1 [˜ z j ] i from user i∈U 1 25: // recovers the aggregated-mask 26: P i∈U 1 z i ← obtained by decoding the received U messages 27: // recovers the aggregate-model for the surviving users 28: P i∈U 1 x i ← P i∈U 1 ˜ x i − P i∈U 1 z i 191 (Dropout-resiliency guarantee) We now focus on the phase of one-shot aggregate-model recov- ery. Since each user encodes its sub-masks by the same MDS matrix W, each P i∈U 1 [˜ z i ] j is an encoded version of P i∈U 1 [z i ] k fork∈ [U−T ] and P i∈U 1 [n i ] k fork∈{U−T + 1,...,U} as follows: X i∈U 1 [˜ z i ] j = ( X i∈U 1 [z i ] 1 ,..., X i∈U 1 [z i ] U−T , X i∈U 1 [n i ] U−T+1 ,..., X i∈U 1 [n i ] U )·W j , (C.1) where W j is the j’th column of W. Since N−D≥U, there are at least U surviving users after user dropouts. Thus, the server is able to recover P i∈U 1 [z i ] k fork∈ [U−T ] via MDS decoding after receiving a set of anyU messages from the surviving users. Recall that [z i ] k ’s are sub-masks of z i , so the server can successfully recover P i∈U 1 z i . Lastly, the server recovers the aggregate-model for the set of surviving usersU 1 by P i∈U 1 x i = P i∈U 1 ˜ x i − P i∈U 1 z i = P i∈U 1 (x i +z i )− P i∈U 1 z i . (Privacy guarantee) We first present Lemma C.1, whose proof is provided in Appendix C.4. Lemma C.1. For anyT ⊆ [N] of sizeT and anyU 1 ⊆ [N],|U 1 |≥U such thatU >T, if the random masks [n i ] k ’s are jointly uniformly random, we have I({z i } i∈[N]\T ;{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ) = 0. (C.2) We consider the worst-case scenario in which all the messages sent from the users are received by the server during the execution of LightSecAgg, i.e., the users identified as dropped are delayed. Thus, the server receives x i +z i from user i∈ [N] and P j∈U 1 [˜ z j ] i from user i∈U 1 . We now show 192 that LightSecAgg provides privacy guarantee T, i.e., for an arbitrary set of colluding usersT of size T, the following holds, I {x i } i∈[N] ;{x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T = 0. (C.3) 193 We prove it as follows: I {x i } i∈[N] ;{x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T (C.4) =H {x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T (C.5) =H {x i +z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T (C.6) =H {x i +z i } i∈[N]\T , X i∈U 1 z i , X i∈U 1 n i X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T (C.7) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T +H X i∈U 1 z i , X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {z i } i∈[N] {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! −H X i∈U 1 z i , X i∈U 1 n i {z i } i∈[N] ,{x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T (C.8) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T +H X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {z i } i∈[N]\T {z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! −H X i∈U 1 n i {z i } i∈[N] ,{[˜ z j ] i } j∈[N],i∈T (C.9) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T +H X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T −H {z i } i∈[N]\T −H X i∈U 1 n i {z i } i∈[N] ,{[˜ z j ] i } j∈[N],i∈T (C.10) =0, (C.11) 194 where (C.6) follows from the fact that{ P j∈U 1 [˜ z j ] i } i∈U 1 is invertible to P i∈U 1 z i and P i∈U 1 n i . Equation (C.7) holds since{x i +z i } i∈T isadeterministicfunctionof{z i } i∈T and{x i } i∈T . Equation (C.8) follows from the chain rule. In equation (C.9), the second term follows from the fact that P i∈U 1 z i is a deterministic function of{x i +z i } i∈[N]\T , P i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ; the third term follows from the independence of x i ’s and z i ’s; the last term follows from the fact that P i∈U 1 z i is a deterministic function of{z i } i∈[N] and the independence of n i ’s and x i ’s. In equation (C.10), the third term follows from Lemma C.1. Equation (C.11) follows from 1) P i∈U 1 n i is a function of{x i +z i } i∈[N]\T , P i∈U 1 x i ,{x i } i∈T ,{z i } i∈T and{[˜ z j ] i } j∈[N],i∈T ; 2) P i∈U 1 n i is a function of {z i } i∈U 1 {[˜ z j ] i } j∈U 1 ,i∈T ; 3) z i is uniformly distributed and hence it has the maximum entropy in F d q , combined with the non-negativity of mutual information. C.2.1 Discussion AsshowninTableC.1, comparedwiththeSecAggprotocol[17],LightSecAggsignificantlyimproves the computational efficiency at the server during aggregation. While SecAgg requires the server to retrieve T + 1 secret shares of a secret key for each of the N users, and to compute a single PRG function if the user survives, or N− 1 PRG functions to recover N− 1 pairwise masks if the user drops off, yielding a total computational load of O(N 2 d) at the server. In contrast, as we have analyzed in Section 6.5.2, for U = O(N), LightSecAgg incurs an almost constant (O(d logN)) computational load at the server. This admits a scalable design and is expected to achieve a much faster end-to-end execution for a large number of users, given the fact that the overall execution time is dominated by the server’s computation in SecAgg [17, 15]. SecAgg has a smaller storage overhead than LightSecAgg as secret shares of keys with small sizes (e.g., as small as an integer) are stored, and the model sized is much larger than the number of usersN in typical FL scenarios. 195 This effect will also allow SecAgg to have a smaller communication load in the phase of aggregate- model recovery. Finally, we would like to note that another advantage of LightSecAgg over SecAgg is the reduced dependence on cryptographic primitives such as public key infrastructure and key agreement mechanism, which further simplifies the implementation of the protocol. SecAgg+ [8] improves both communication and computational load of SecAgg by considering a sparse random graph of degree O(logN), and the complexity is reduced by factor of O( N logN ). However, SecAgg+ still incurs O(dN logN) computational load at the server, which is much larger than O(d logN) computational load at the server in LightSecAgg when U =O(N). Table C.1: Complexity comparison between SecAgg [17], SecAgg+ [8], and LightSecAgg. Here N is the total number of users. The parametersd ands respectively represent the model size and the length of the secret keys as the seeds for PRG, where s d. LightSecAgg and SecAgg provide worst-case privacy guarantee T and dropout-resiliency guarantee D for any T and D as long as T +D<N. SecAgg+ provides probabilistic privacy guarantee T and dropout-resiliency guarantee D. LightSecAgg selects three design parameters T, D and U such that T i) subtracts PRG(a (t) i,j ) from x (t) j . 201 In asynchronous FL, however, the cancellation of the pairwise random masks based on the key agreement protocol is not guaranteed due to the mismatch in staleness between the users. Specifically, at round t, user i∈S (t) sends the masked model y (t;t i ) i to the server that is given by y (t;ti) i = Δ (t;ti) i + PRG b (ti) i + X j:i<j PRG a (ti) i,j − X j:i>j PRG a (ti) j,i , (C.22) where Δ (t;t i ) i is the local update defined in (C.19). When t i 6= t j , the pairwise random vectors in y (t;t i ) i and y (t;t j ) j are not canceled out as a (t i ) i,j 6=a (t j ) i,j . We note that the identity of the staleness of each user is not known a priori, hence each pair of users cannot use the same pairwise random-seed. C.5.3 Asynchronous LightSecAgg We now demonstrate how LightSecAgg can be applied to the asynchronous FL setting where the server stores each local update in a buffer of size K and updates the global model by aggregating the stored updates when the buffer is full. Our key intuition is to encode the local masks in a way that the server can recover the aggregate of masks from the encoded masks via a one-shot computation even though the masks are generated in different training rounds. The asynchronous LightSecAgg protocol also consists of three phases with three design parametersD,T,U which are defined in the same way as the synchronous LightSecAgg. Synchronous and asynchronous LightSecAgg have two key differences: (1) In asynchronous FL, the users share the encoded masks with the time stamp in the first phase to figure out which encoded masks should be aggregated for the reconstruction ofaggregate of masks inthe third phase. Due to the commutative property of coding and addition, the server can reconstruct the aggregate of masks even though the masks are generated in different training rounds; (2) In asynchronous FL, the server compensates the staleness of the local updates. This is challenging as this compensation should be carried out over the masked model in the finite field to provide the privacy guarantee while the conventional compensation functions have real numbers as outputs [152, 109]. 202 We now describe the three phases in detail. C.5.4 Offline Encoding and Sharing of Local Masks User i generates z (t i ) i uniformly at random from the finite field F d q , where t i is the global round index when user i downloads the global model from the server. The mask z (t i ) i is partitioned into U−T sub-masks denoted by [z (t i ) i ] 1 ,··· , [z (t i ) i ] U−T , where U denotes the targeted number of surviving users and N−D≥ U ≥ T. User i also selects another T random masks denoted by [n (t i ) i ] U−T+1 ,··· , [n (t i ) i ] U . These U partitions [z (t i ) i ] 1 ,··· , [z (t i ) i ] U−T , [n (t i ) i ] U−T+1 ,··· , [n (t i ) i ] U are then encoded through an (N,U) Maximum Distance Separable (MDS) code as follows [e z (t i ) i ] j = [z (t i ) i ] 1 ,··· , [z (t i ) i ] U−T , [n (t i ) i ] U−T+1 ,··· , [n (t i ) i ] U W j , (C.23) where W j is the Vandermonde matrix defined in (6.5). User i sends [e z (t i ) i ] j to user j∈ [N]\{i}. At the end of this phase, each user i∈ [N] has [e z (t j ) j ] i from j∈ [N]. C.5.5 Training, Quantizing, Masking, and Uploading of Local Updates Each useri trains the local model as in (C.19) and (C.20). User i quantizes its local update Δ (t;t i ) i from the domain of real numbers to the finite field F q as masking and MDS encoding are carried out in the finite field to provide information-theoretic privacy. The field size q is assumed to be large enough to avoid any wrap-around during secure aggregation. The quantization is a challenging task as it should be performed in a way to ensure the conver- genceoftheglobalmodel. Moreover,thequantizationshouldallowtherepresentationofnegativein- tegers in the finite field, and enable computations to be carried out in the quantized domain. There- fore, we cannot utilize well-known gradient quantization techniques such as in [alistarh2017qsgd], 203 which represents the sign of a negative number separately from its magnitude. LightSecAgg ad- dresses this challenge with a simple stochastic quantization strategy combined with the two’s com- plement representation as described subsequently. For any positive integer c≥ 1, we first define a stochastic rounding function as Q c (x) = bcxc c with prob. 1− (cx−bcxc) bcxc+1 c with prob. cx−bcxc, (C.24) wherebxc is the largest integer less than or equal tox, and this rounding function is unbiased, i.e., E Q [Q c (x)] = x. The parameter c is a design parameter to determine the number of quantization levels. The variance of Q c (x) decreases as the value of c increases. We then define the quantized update Δ (t;t i ) i :=φ c l ·Q c l Δ (t;t i ) i , (C.25) where the functionQ c from (C.24) is carried out element-wise, andc l is a positive integer parameter todeterminethequantizationlevelofthelocalupdates. Themappingfunctionφ :R→F q isdefined to represent a negative integer in the finite field by using the two’s complement representation, φ(x) = x if x≥ 0 q +x if x< 0. (C.26) To protect the privacy of the local updates, user i masks the quantized update Δ (t;t i ) i in (C.25) as e Δ (t;t i ) i = Δ (t;t i ) i +z (t i ) i , (C.27) 204 and sends the pair of n e Δ (t;t i ) i ,t i o to the server. The local round index t i is used in two cases: (1) when the server identifies the staleness of each local update and compensates it, and (2) when the users aggregate the encoded masks for one-shot recovery, which will be explained in Section C.5.6. C.5.6 One-shot Aggregate-update Recovery and Global Model Update The server stores e Δ (t;t i ) i in the buffer, and when the buffer of size K is full, the server aggregates the K masked local updates. In this phase, the server intends to recover X i∈S (t) s(t−t i )Δ (t;t i ) i , (C.28) where Δ (t;t i ) i is the local update in the real domain defined in (C.19),S (t) ( S (t) =K) is the index set of users whose local updates are stored in the buffer and aggregated by the server at round t, and s(τ) is the staleness function defined in (C.21). To do so, the first step is to reconstruct P i∈S (t)s(t−t i )z (t i ) i . This is challenging as the decoding should be performed in the finite field, but the value ofs(τ) is a real number. To address this problem, we introduce a quantized staleness function s :{0, 1,...,}→F q , s cg (τ) =c g Q cg (s(τ)), (C.29) where Q c (·) is a stochastic rounding function defined in (C.24), and c g is a positive integer to determine the quantization level of staleness function. Then, the server broadcasts information of n S (t) ,{t i } i∈S (t),c g o to all surviving users. After identifying the selected users inS (t) , the local round indices{t i } i∈S (t) and the corresponding staleness, user j∈ [N] aggregates its encoded sub- masks P i∈S (t)s cg (t−t i ) h e z (t i ) i i j and sends it to the server for the purpose of one-shot recovery. The key difference between the asynchronous LightSecAgg and the synchronous LightSecAgg is that in the asynchronous LightSecAgg, the time stamp t i for encoded masks h e z (t i ) i i j for each i∈S (t) can 205 bedifferent, hence userj∈ [N] must aggregatetheencodedmask withthe proper roundindex. Due to the commutative property of coding and linear operations, each P i∈S (t)s cg (t−t i ) h e z (t i ) i i j is an encoded version of P i∈S (t)s cg (t−t i ) h z (t i ) i i k fork∈ [U−T ] using the MDS matrix (or Vandermonde matrix) V defined in (C.23). Thus, after receiving a set of any U results from surviving users in U 2 , where|U 2 | = U, the server reconstructs P i∈S (t)s cg (t−t i ) h z (t i ) i i k for k∈ [U−T ] via MDS decoding. By concatenating the U−T aggregated sub-masks P i∈S (t)s cg (t−t i ) h z (t i ) i i k , the server can recover P i∈S (t)s cg (t−t i )z (t i ) i . Finally, the server obtains the desired global update as follows g (t) = 1 c g c l P i∈S (t)s cg (t−t i ) φ −1 X i∈S (t) s cg (t−t i ) e Δ (t;t i ) i − X i∈S (t) s cg (t−t i )z (t i ) i , (C.30) where c l is defined in (C.25) and φ −1 :F q →R is the demapping function defined as follows φ −1 (x) = x if 0≤x< q−1 2 x−q if q−1 2 ≤x
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Controlling information in neural networks for fairness and privacy
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Attacks and defense on privacy of hardware intellectual property and machine learning
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Decision-aware learning in the small-data, large-scale regime
Asset Metadata
Creator
So, Jin Hyun
(author)
Core Title
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
10/10/2022
Defense Date
07/06/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
coding theory,federated learning,information theory,large-scale distributed computing,OAI-PMH Harvest,privacy-preserving machine learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Annavaram, Murali (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
jinhyun.soh@gmail.com,jinhyuns@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112114246
Unique identifier
UC112114246
Identifier
etd-SoJinHyun-11265.pdf (filename)
Legacy Identifier
etd-SoJinHyun-11265
Document Type
Dissertation
Format
theses (aat)
Rights
So, Jin Hyun
Internet Media Type
application/pdf
Type
texts
Source
20221017-usctheses-batch-986
(),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
coding theory
federated learning
information theory
large-scale distributed computing
privacy-preserving machine learning