Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
(USC Thesis Other)
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TAMING HETEROGENEITY, THE UBIQUITOUS BEAST IN CLOUD COMPUTING AND DECENTRALIZED LEARNING by Saurav Prakash A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Saurav Prakash Dedicated to my beloved parents, Jay Prakash Lal and Anita Lal, and my lovely sister, Surabhi Prakash. ii Acknowledgements This dissertation would not have been possible without the involvement of so many people in my life. I would like to thank them for their invaluable support – it’s been truly humbling. First and foremost, I am deeply grateful to being advised by Prof. Amir Salman Aves- timehr. I was fortunate to work under his mentorship as an IUSSTF Viterbi-India under- graduate intern and then to subsequently return to USC and have him as my PhD advisor. My sincere gratitude to him for all his help, guidance and support throughout my PhD studies. His constant emphasis on tackling hard and big problems, finding efficient solutions and becoming the leader in the research field has shaped my PhD journey quite positively. He has an amazing ability to quickly grasp new ideas and to help in formulation of clean research problems, which have been instrumental in my PhD studies. Furthermore, he has always motivated me to become an independent researcher, giving me ample freedom for identifying and pursuing research problems to work on. I would also like to express my deep- est gratitude for his unrelenting assistance during the tumultuous and extremely uncertain times of the COVID-19 pandemic. I cannot imagine finishing this dissertation without his help. Finally, I have learnt immensely from him about technical writing and presentation skills, for which I will be forever grateful. I am also indebted to his wonderful family for gracefully hosting all of us in the group during the Nowruz celebrations at their home. I am also immensely grateful to Prof. Ramtin Pedarsani for his guidance and insightful comments that were integral in formulating research problems and solving technical difficul- ties arising during our multiple fruitful collaborations. I am also indebted to Amirhossein Reisizadeh, an excellent open-minded and ambitious researcher, for being an awesome friend andcollaboratoronmultipleprojectsduringmyPhDjourney. TogetherwithProf. Pedarsani iii and Prof. Avestimehr, we developed many foundational works addressing multiple challeng- ing problems, particularly in the nascent field of coded distributed computing, culminating into many publications. Furthermore, our research proposal led us into the finals of the prestigious Qualcomm Innovation Fellowship in 2019, which was an exceptional experience. I would also like to express my sincere gratitude to Prof. Murali Annavaram, Hanieh Hashemi and Yongqin Wang for their collaboration in the area of secure decentralized learn- ing in the final phase of my PhD. It has been a wonderful experience working with them. I would especially like to thank Prof. Murali, who is very kind and approachable, and has been immensely helpful in discussing the practical aspects of the underlying research problems. His focus on demonstrating end-to-end gains in practice have been quite crucial in formu- lating problems and designing experiments. Hanieh is an excellent researcher with unique expertise at the intersection of machine learning and hardware security, which has been very instrumental in our collaboration. I have also had the fortune to gain industry experience through multiple internships. I spent Summer 2018 and Summer 2019 as a Research Intern at Intel Labs under Dr. Shilpa Talwar and Dr. Nageen Himayat respectively. During Summer 2021, I was an Applied Scientist Intern at Amazon Alexa AI under Dr. Clement Chung and Dr. Rahul Gupta. I am deeply grateful to my managers for having me on their stellar teams. I would particularly like to thank Nageen for being a great mentor during my time at Intel, she is a formidable researcher, an amicable colleague, and an awesome person to work with. Furthermore, special thanks to Sagar Dhakal, a great friend and mentor, who made my time at Intel a really memorable experience. I immensely enjoyed going deeply into research topics with him, his attention to the minutest details helped tremendously in resolving the technical difficulties in our research works. With their amazing guidance, supportandfeedback,andincollaborationwithShilpaTalwarandotherawesomeresearchers Yair Yona, Ravikumar Balakrishnan, and Mustafa Akdeniz, I pursued multiple research directions in cloud computing and federated learning. My research at Intel culminated in iv multiple publications and patent applications, some of which are part of this dissertation. In particular, Chapter 5 of this dissertation is an outcome of the work I did at Intel. This dissertation has benefited greatly from the in-depth comments and insightful sug- gestions from the members of my qualifying exam committee Prof. Murali Annavaram, Prof. Bhaskar Krishnamachari, Prof. Antonio Ortega and Prof. Meisam Razaviyayn. I would also like to thank Prof. Annavaram and Prof. Leana Golubchik for agreeing to serve on my dissertation committee. Their expertise has been invaluable. I am also grateful to have worked alongside the brilliant and inspiring colleagues, both within and outside our group at USC, who have made my doctoral journey at USC a great learning and enjoyable experience. Special thanks to Hanieh Hashemi, Yongqin Wang, Ahmed Roushdy, Yue Niu, Sara Babakniya, Souvik Kundu, Sunwoo Lee, and Haleh Akrami for the rewarding collaborations, I acquired a lot of knowledge while working with them. I am also very thankful to Chaoyang He, together we secured the Qualcomm Innovation Fel- lowship in 2021. Furthermore, I want to express my deepest gratitude to my colleagues and friends Songze Li, Qian Yu, Chien-Sheng Yang, Mohammadreza Kalan, Jinhyun So, Navid Naderializadeh, Dhruva Kartik, Chaitanya Kalagarla, Ajinkya Jayawant, Krishnagiri Narra, Jitin Singla, Feng Ling, Jorge Gomez, Hussein Hammoud, Olaoluwa Adigun, Emir Ceyani, Yahya Ezzeldin, Ramy Ali, Tuo Zhang, Amir Ziashahabi, Erum Mushtaq and Akash Panda. I immensely enjoyed having enriching discussions with them on a variety of topics including those in research, life and spirituality. I am also grateful to have got constant support of Diane Demetras, our department studentservicesdirector. TracyCharlesandAndyChen,thedoctoralprogramscoordinators, have also played an integral role in this journey. I would also like to thank the administrative staff Susan Wiedem, Gerrielyn Ramos and Corine Wong, who have been quite instrumental in carrying out the administrative procedures smoothly throughout my years at USC. I am also grateful to Ben and Cathy for organizing the various activities in the department, includingthememorabledepartmentalretreatin2019. IwouldalsoliketothankMs. Monika v Madan at IUSSTF and Prof. Cauligi Raghavendra at USC. They are quite instrumental in managing the IUSSTF Viterbi-India internship program which provided me a crucial internship opportunity to explore research at USC as an undergraduate. I would also like to thank my professors at IIT Kanpur, who inspired and motivated me during my undergraduate years to pursue higher studies and build a career in academia. In particular, I would like to express my sincere gratitude to Prof. Aditya Jagannatham, who mentored me during my undergraduate research studies. I am deeply indebted to him, and to Prof. Satyadev Nandakumar for giving me their valuable recommendation letters. I would also like to thank Prof. Vinay Namboodiri, Prof. Adrish Banerjee, Prof. Sri Sivakumar, Prof. Ketan Rajawat, and Prof. Sandeep Verma for their time and support at IIT Kanpur. Lastly but most importantly, I have been extremely fortunate to have such a supportive family throughout my life. In particular, their relentless love, wise counsel and connectivity have been quite integral in my doctoral studies. Despite being so far physically, they have always been close to my heart. This dissertation is dedicated to them. vi Table of Contents Dedication ii Acknowledgements iii List of Tables xi List of Figures xii Abstract xvii Chapter 1: Introduction 1 1.1 Taming Heterogeneity in Cloud Computing . . . . . . . . . . . . . . . . . . . 4 1.2 Taming Heterogeneity in Decentralized Learning . . . . . . . . . . . . . . . . 8 Chapter 2: Coded Computing for Distributed Graph Analytics 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Distributed Implementation . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Proposed Scheme and Proof of Achievability of Theorem 1 . . . . . . . . . . 34 2.4.1 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.2 Proof of Achievability of Theorem 1 . . . . . . . . . . . . . . . . . . . 38 2.4.3 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5 Converse for the Erdös-Rényi Model . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Achievability for the Power Law Model . . . . . . . . . . . . . . . . . . . . . 49 2.7 Experiments over Amazon EC2 Clusters . . . . . . . . . . . . . . . . . . . . 54 2.7.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 3: Coded Computation over Heterogeneous Clusters 62 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2 Problem Formulation and Main Results . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 vii 3.2.2 Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3 The Proposed HCMM Scheme and Proofs of Theorems 5 and 6 . . . . . . . 73 3.3.1 Alternative Formulation ofP main via Maximal Aggregate Return . . . 73 3.3.2 Solving the Alternative Formulation . . . . . . . . . . . . . . . . . . . 75 3.3.3 Asymptotic Optimality of HCMM . . . . . . . . . . . . . . . . . . . . 77 3.3.4 Comparison with Uncoded Schemes . . . . . . . . . . . . . . . . . . . 82 3.4 Generalization to the Shifted Weibull Model and Proofs of Theorems 7 and 8 85 3.5 Numerical Studies and Experiments using Amazon EC2 Machines . . . . . . 89 3.5.1 Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5.2 Experiments using Amazon EC2 machines . . . . . . . . . . . . . . . 93 3.6 Generalization to Computing Scenarios under Budget Constraints . . . . . . 97 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 4: CodedReduce: A Fast and Robust Framework for Gradient Ag- gregation in Distributed Learning 104 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.2 Problem Setup and Background . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2.2 Ring-AllReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2.3 Gradient Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.3 Proposed CodedReduce Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.3.1 Description of CR Scheme . . . . . . . . . . . . . . . . . . . . . . . . 113 4.3.2 An Example for CR . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.3.3 Theoretical Guarantees of CR . . . . . . . . . . . . . . . . . . . . . . 118 4.4 Empirical Evaluation of CR . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.4.1 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.1.1 Real Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.4.1.2 Artificial Data Set . . . . . . . . . . . . . . . . . . . . . . . 125 4.4.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Chapter5: CodedComputingforLow-LatencyFederatedLearningoverWire- less Edge Networks 131 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.2 Problem Setup and MEC Model . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.2 Compute and Communication Models . . . . . . . . . . . . . . . . . . 141 5.3 Proposed CodedFedL Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.3.1 Distributed Kernel Embedding . . . . . . . . . . . . . . . . . . . . . 143 5.3.2 Distributed Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.3.3 Coding Redundancy and Load Assignment . . . . . . . . . . . . . . . 146 5.3.4 Weight Matrix Construction . . . . . . . . . . . . . . . . . . . . . . . 150 5.3.5 Coded Federated Aggregation . . . . . . . . . . . . . . . . . . . . . . 150 viii 5.4 Analyzing CodedFedL Load Design . . . . . . . . . . . . . . . . . . . . . . . 153 5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5.5.1 Simulation Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Chapter 6: Hierarchical Coded Gradient Aggregation for Learning at the Edge 164 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.2 Problem Setting and Main Results . . . . . . . . . . . . . . . . . . . . . . . 166 6.2.1 Computation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 6.2.2 Network and Communication Model . . . . . . . . . . . . . . . . . . 167 6.2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.2.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.2.5 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 6.3 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 6.4 Aligned Repetition Coding and Proof of Achievability for C ∗ HM . . . . . . . . 174 6.5 Aligned MDS Coding and Proof of Achievability for C ∗ EH . . . . . . . . . . . 176 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Chapter 7: Secure and Fault Tolerant Decentralized Learning 180 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.2 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.2.1 Federated Learning with SGD . . . . . . . . . . . . . . . . . . . . . . 185 7.2.2 Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.2.3 Trusted Execution Environments . . . . . . . . . . . . . . . . . . . . 187 7.3 The Proposed DiverseFL Algorithm . . . . . . . . . . . . . . . . . . . . . . . 188 7.3.1 Description of DiverseFL . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.3.2 Effectiveness of per-client fault mitigation . . . . . . . . . . . . . . . 191 7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 7.4.1 Performance of DiverseFL . . . . . . . . . . . . . . . . . . . . . . . . 193 7.4.2 Scalability of DiverseFL . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 7.4.3.1 Number of Faulty Clients . . . . . . . . . . . . . . . . . . . 198 7.4.3.2 Performance for Multiple Local Iterations . . . . . . . . . . 199 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 References 201 Appendices 221 A Achievability for the Random Bi-partite Model . . . . . . . . . . . . . . . . . 221 B Converse for the Random Bi-partite Model . . . . . . . . . . . . . . . . . . . 223 C Achievability for the Stochastic Block Model . . . . . . . . . . . . . . . . . . 224 D Converse for the Stochastic Block Model . . . . . . . . . . . . . . . . . . . . 226 E Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 F Proof of Lemma 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 ix G McDiarmid’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 H Pseudo-code for Computation Allocation Sub-routine . . . . . . . . . . . . . 228 I Pseudo-code for CodedReduce Scheme . . . . . . . . . . . . . . . . . . . . . 229 J Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 K Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 L Proof of Optimality of the Two-step Approach . . . . . . . . . . . . . . . . . 238 M Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 N Proof of Monotonically Increasing Behavior of Optimized Expected Return . 240 O One-shot Solution for AWGN . . . . . . . . . . . . . . . . . . . . . . . . . . 241 P Towards Convergence Analysis of CodedFedL . . . . . . . . . . . . . . . . . 243 Q Privacy Budget for CodedFedL . . . . . . . . . . . . . . . . . . . . . . . . . 247 R Proof of Converse for C ∗ HM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 S Proof of Converse for C ∗ EH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 T Upper Bound for C HM for AMC . . . . . . . . . . . . . . . . . . . . . . . . . 249 U Convergence Analysis for DiverseFL . . . . . . . . . . . . . . . . . . . . . . . 251 x List of Tables 3.1 Total computation load ( P n i=1 ℓ i ) of HCMM and Uniform Coded . . . . . . . 95 3.2 Amazon EC2 Pricing for Linux . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.1 Communication parallelization gain and straggler resiliency of three designs RAR, GC, and CR in a system with N nodes with computation load r, where CR has a tree communication topology of L layers. . . . . . . . . . . . . . . 113 4.2 Details of the neural network architecture used in the simulations. . . . . . . 129 5.1 Main notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2 Summary of Results for δ =ψ =0.1 . . . . . . . . . . . . . . . . . . . . . . . . 162 5.3 Summary of Results for δ =ψ =0.2 . . . . . . . . . . . . . . . . . . . . . . . . 162 7.1 Final test accuracies for CIFAR10 under Gaussian fault for different number of faults. Similar results were observed for other faults and datasets. . . . . . 199 xi List of Figures 2.1 Illustrating the think like a vertex paradigm prevalent in common parallel graph computing frameworks. The computation associated with a vertex only depends on its neighbors. In this example, we consider the PageRank com- putation over a graph with six vertices. Using vertex 1 for representation, we illustrate the file and PageRank update at each vertex. File w 1 contains the state (current PageRank Π curr 1 ) and the neighborhood parameters (prob- abilities of transitioning to neighbors {P(1→ 1),P(1→ 2),P(1→ 5)}). The PageRank update associated with vertex 1 is a function of only the neigh- borhood files (specifically, of the PageRanks of neighboring vertices and the transition probabilities from neighbors to vertex 1). . . . . . . . . . . . . . . 12 2.2 Demonstrating the impact of our proposed coded scheme in practice. We con- sider PageRank implementation over a real-world dataset in an Amazon EC2 cluster consisting of 6 servers. In this figure, we have illustrated the overall execution time as well as the times spent in different phases of execution, as a function of computation load r (details of implementation are provided in Section 2.7). One can observe that the Shuffle phase is the major compo- nent of the overall execution time in conventional PageRank implementation (computation load r =1), and our proposed coded scheme slashes the overall execution time by shortening the Shuffle phase (i.e., reducing the communi- cation load) at the expense of increasing the Map phase (i.e. increasing the Map computations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 An illustrative example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Performance comparison of our proposed coded scheme with uncoded Shuffle scheme and the proposed lower bound. The averages for the communication load for the two schemes were obtained over graph realizations of the Erdös- Rényi model with n=300, p=0.1 and K =5. . . . . . . . . . . . . . . . . . 30 2.5 Illustrative instances of the random graph models considered in the paper. In Fig. 2.5(a), each edge exists with a given probability p. In Fig. 2.5(b), expected degree of each vertex follows a power law distribution with exponent γ . In Fig. 2.5(c), each cross-edge exists with a given probability q. In Fig. 2.5(d), each intra-cluster edge exists with a given probability p and each cross- edge exists with a given probability q. . . . . . . . . . . . . . . . . . . . . . . 31 xii 2.6 Illustration of our proposed scheme. . . . . . . . . . . . . . . . . . . . . . . . 36 2.7 Creating coded messages by aligning the associated intermediate value seg- ments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.8 OverallexecutiontimesfordistributedPageRankimplementationfordifferent computation load for the three scenarios. . . . . . . . . . . . . . . . . . . . . 58 3.1 Master-worker setup of the computing clusters: The master node receives the input vector x and broadcasts it to all the worker nodes. Upon receiving the input, worker node i starts computing the inner products of the input vector with the locally assigned rows, i.e., y i =A i x, and unicasts the output vector y i to the master node upon completing the computation. The results are aggregated at the master node until r inner products are received and the desired output Ax is recovered. . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2 Illustration of the performance gain of HCMM over the three benchmark schemes for the exponential run-time model. Among the three scenarios, HCMM achieves a performance improvement of up to 71% over Uniform Un- coded, up to 53% over Load-balanced Uncoded, and up to 39% over Uniform Coded. Furthermore, the coding redundancy P n i=1 ℓ i /r for the three scenar- ios is in the range of 1.41− 1.46 for HCMM and in the range of 2.3− 2.8 for Uniform Coded. This demonstrates the efficient utilization of resources by HCMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3 Illustration of the performance gain of HCMM over the three benchmark schemes for Weibull model for run-time. Among the three scenarios, HCMM achieves a performance improvement of up to 73% over Uniform Uncoded, up to 56% over Load-balanced Uncoded, and up to 42% over Uniform Coded. Furthermore, the coding redundancy P n i=1 ℓ i /r for the three scenarios is in the range of 1.30− 1.42 for HCMM and in the range of 2.0− 2.5 for Uniform Coded. This demonstrates the efficient utilization of resources by HCMM. . 91 3.4 Illustration of the performance gain of HCMM over the three benchmark schemes. Among the three scenarios, HCMM achieves a performance im- provementofupto61%overUniformUncoded, upto46%overLoad-balanced Uncoded, and up to 36% over Uniform Coded. Furthermore, the coding re- dundancy P n i=1 ℓ i /r for the three scenarios is approximately 1.4 for HCMM and in the range of 2.12− 2.26 for Uniform Coded. Therefore, HCMM gives the best overall execution time among the four scenarios with minimal coding overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.5 Typical empirical cumulative distribution functions for two instances used in Scenario 3 of our experiments. The measurements were taken in the absence of any manual delay. As demonstrated here, shifted exponential distribution is a good model for the task execution time in EC2 machines. . . . . . . . . 96 3.6 Total cost associated with every pair of (n 1 ,n 2 ); 0≤ n 1 ,n 2 ≤ 10. . . . . . . . 102 xiii 3.7 Expected time associated with every pair of (n 1 ,n 2 ); 0≤ n 1 ,n 2 ≤ 10. . . . . 102 4.1 Illustration of RAR,GC andCR: InRAR, workers communicate only with their neighbors on a ring, which results in high bandwidth utilization; however, RAR is prone to stragglers. GC is robust to stragglers by doing redundant computations at workers; however, GC imposes bandwidth bottleneck at the master. CR achieves the benefits of both worlds, providing high bandwidth efficiency along with straggler resiliency. . . . . . . . . . . . . . . . . . . . . 105 4.2 Average iteration time for gradient aggregation in different schemes CR,RAR, GC andUMW: Training a linear model is implemented on a cluster of N =84 t2.micro instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3 Illustration of communication strategy in RAR for N =3 workers. . . . . . . 111 4.4 Illustration of data allocation and communication strategy in GC for N = 3 workers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5 (n,L)–regular tree topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6 Illustration of task allocation in CR. . . . . . . . . . . . . . . . . . . . . . . . 116 4.7 Illustration of data allocation and communication strategy in CR for a (3,2)– regular tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 Convergence curves for relative error rate vs wall-clock time for logistic regres- sion over N = 84 workers. The straggler resiliency is α = 1/4. CR achieves a speedup of up to 32.8× , 5.3× , 3.8× and 3.2× respectively over UMW, GC, RAR and SGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.9 Convergence results for relative error rate vs wall-clock time for logistic re- gression over N =156 workers with different straggler resiliency α . . . . . . . 124 4.10 Convergence curves for normalized error rate vs wall-clock time for linear regression over N = 84 workers. The straggler resiliency is α = 1/4. CR achieves a speedup of up to 24.1× , 4.6× , 3.0× and 2.8× respectively over UMW, GC, RAR and SGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.11 Convergence results for normalized error rate vs wall-clock time for linear regression over N =156 workers with different straggler resiliency α . . . . . 126 4.12 Convergence curves for normalized error rate vs wall-clock time for linear regression over N = 156 workers and (d,p) = (32760,5000). The straggler resiliency is α = 1/4 and the number of rounds is 50. CR achieves a speedup of up to 11.3× , 9.7× , 1.69× and 6.1× respectively over UMW, GC, RAR and SGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 xiv 4.13 Convergence curves for test accuracy vs wall-clock time for neural network training over N = 156 workers. The neural network model has p≈ 120,000 parameters. The straggler resiliency is α =5/12 and the number of rounds is 2500. CR achieves a speedup of up to 6.6× , 4.8× , 1.8× and 4.0× respectively over UMW, GC, RAR and SGD. . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.1 Illustration of the federated learning paradigm over multi-access edge com- puting (MEC) networks with n client devices and an MEC server. During each training round, client E j receives the latest model from server M, com- putes a local gradient update over its local dataset D j , and communicates the gradient update to the server. Training performance is critically bottlenecked by the presence of straggling nodes and communication links. . . . . . . . . . 132 5.2 Overview of our proposed CodedFedL framework, illustrating the main pro- cessing steps at the MEC server and at each client. . . . . . . . . . . . . . . 143 5.3 Illustrating the properties of expected aggregate return E(R j (t; e ℓ j )) based on the result in Theorem 11. We assume p j =0.9, τ j = √ 3, µ j =2, α j =20, and for Fig. 5.3b, t=10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 5.4 Illustrating the results for MNIST. . . . . . . . . . . . . . . . . . . . . . . . 158 5.5 Illustrating the results for Fashion MNIST. . . . . . . . . . . . . . . . . . . . 160 6.1 Hierarchical distributed learning setup. For resiliency of up to s straggling links among n h helper links, client node i, i∈ [n e ] encodes its local gradient to obtain coded messages c i,j ’s for the n h helpers. Master uses the pattern of the messages received by the helpers to direct the helpers what they should communicate, and recovers the full gradient from the helpers’ messages. . . . 167 6.2 Illustrating the alignment opportunities for the ARC approach in Example 6.3 with n e = 4, n h = 4 and s = 1. For the failure pattern tables, the 1’s denote the successful communications, 0’s denote the failed communications, while the blue boxes denote the messages that the helpers are directed by the master to locally aggregate and upload to enable recovery of full gradient. For example, in Fig. 6.2b, it is sufficient for helpers 1 and 2 to aggregate their received messages to obtaing D,1 andg D,2 respectively and send to the master, and the master concatenates them to obtain the full gradient. . . . . . . . . 172 6.3 Illustrating the AMC approach for Example 6.3. The master first finds the maximumnumberofexactlymatchingrowsinthefailuretable(highlightedby a black box). The entries in these rows are locally aggregated at the helpers and sent to the master. The entries outside the black box (highlighted in green) are simply forwarded to the master. For example, in Fig. 6.3b, all rows match, thus it is sufficient for helpers 1, 2 and 3 to aggregate their received messages to obtain g D,1 , g D,2 and g D,3 and send to the master. . . 173 xv 7.1 Illustration of system components and general steps in DiverseFL for commu- nication round i∈ [R]. Without loss of generality, we have assumed that all clients participate in the communication round i. For brevity, we use AGG(·) in final step to jointly denote the aggregation of the potential normal clients as well as the global model update step. . . . . . . . . . . . . . . . . . . . . 188 7.2 The values of C 1 × C 2 in the 1000 training rounds are plotted in red for faulty clients, and in green, for normal clients. For normal clients, C 1 >0 exclusively, and C 2 is concentrated around 1. For faulty clients, C 1 <0 in almost all itera- tions, and C 2 varies significantly. . . . . . . . . . . . . . . . . . . . . . . . . 192 7.3 Top-1 accuracy for neural network training with MNIST (first row of plots), CIFAR10(second row ofplots)andCIFAR100(thirdrowofplots). DiverseFL with both 1% and 3% sample sharing achieves close to OracleSGD perfor- mance under all scenarios. Even with relatively simpler training setting with MNIST and a small neural network, prior benchmarks degrade in performance in one or more scenarios. Furthermore, the three sets of results demonstrate that increasing complexity of dataset and training model increases the perfor- mance gap of prior benchmarks for non-IID setting. . . . . . . . . . . . . . . 194 7.4 Execution time on client (computation + communication time) relative to TEE’s guiding update computation. 1: MNIST/3-NN, 2: CIFAR10/VGG-11, and 3: CIFAR100/VGG-11. 1% sampling used in (a) and 3% sampling in (b). Single TEE supports many clients without stalling FL execution. . . . . . . . 197 7.5 Performance evaluation of DiverseFL with multiple local iterations with Ora- cleSGD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 6 Comparing the achievability bound for C HM in (49) with C HM for the naive MDS approach based on simple forwarding of the messages received at the helpers to the master without any local aggregation. . . . . . . . . . . . . . . 250 xvi Abstract Large-scale cloud computing services enable the use of big data by supporting critical com- puting tasks, including PageRank and training of massive language and vision models. Ad- ditionally, decentralized machine learning (ML) paradigms, such as federated learning (FL), are gaining popularity as they allow building of accurate statistical models from privacy- sensitive datasets at multiple data owners, such as health data at hospitals, without requir- ing the clients to share their raw data. However, scaling of distributed computing is severely bottlenecked by heterogeneity that comes in various forms. For example, heterogeneity arising due to non-uniformity in computation and communication resources can result in straggling nodes that can increase task latency significantly. Therefore, this thesis attempts to holistically address real-world challenges posed by heterogeneity in large-scale distributed computing, studying cloud computing as well as decentralized ML. A common aspect of the solutions proposed in this dissertation is the design and ap- plication of computation redundancy for taming heterogeneity in innovative ways. For in- stance, to carry out efficient distributed ML in the cloud, this thesis proposes CodedReduce, which parallelizes the communications over a tree topology leading to efficient bandwidth utilization, and carefully designs a redundant data placement strategy leading to robust- ness against stragglers, thus reducing the overall training time significantly. In decentralized ML, designing and exploiting computation redundancy for achieving optimal performance becomes even more challenging, due to significant privacy and resource constraints of the data owners. Therefore, another key contribution of this thesis is CodedFedL, that involves privacy-preserving generation of a parity dataset at FL server from clients’ local data, which xvii enables efficient and robust FL in wireless multi-access edge computing networks. In particu- lar, gradient from the global parity dataset compensates for missing updates from straggling clients, and thereby speeds up convergence significantly. Another key highlight of this thesis is DiverseFL, the first algorithm that makes model aggregation in FL secure as well as robust to faulty clients when data across clients is heterogeneous. DiverseFL involves a novel secure hardware solution for data redundancy at the FL server, which enables the use of a novel per-client criteria for mitigating faulty client updates during training. Overall, this dissertation presents our works targeted towards taming the beast of het- erogeneity that presents a significant barrier in scaling of critical distributed computing paradigms. For this, this thesis leverages existing concepts and develops new ideas in in- formation theory, coding theory, communication topology design, optimization theory, data privacy, and hardware security. The proposed solutions are grounded in strong theoretical foundations and complemented with extensive experiments with real data. xviii Chapter 1 Introduction Amajorphenomenonofrecenttimeshasbeenthegrowingecosystemofbillionsofcomputing devices with sensors connected through the network edge and powered by artificial intelli- gence. This is shaping the future of public-interest and curiosity-driven scientific discovery, as the massive amounts of data generated each day by this ecosystem has the potential to power a wide range of data analytics based applications, such as predicting vehicular accidents in real-time [63] and predicting events like a heart attack from wearables [16]. To enable usability of these large-scale datasets in real-time, it is essential to develop efficient strategies that can perform data processing at scale. In the recent past, general dis- tributed computing frameworks, such as MapReduce [236], Spark [235] and GraphLab [132], along with the availability of large-scale commodity servers in the cloud such as Amazon EC2, have made it possible to carry out large-scale data analytics at the production level. These virtualized data centers enjoy an abundance of storage space and computing power, and are cheaper to rent by the hour than maintaining dedicated data centers round the year. Their availability has progressively increased the ease and access of utilizing cloud computing to perform a variety of large-scale distributed tasks, including large-scale graph processing algorithms like PageRank [151, 218], and pre-training of large natural language processing (NLP) models (e.g., GPT-3 [23]). Large-scale distributed computing in the cloud, however, suffers from several bottlenecks arising due to the beast of heterogeneity, that manifests itself in various forms. Particularly, 1 as described next, performance of cloud computing systems is severely bottlenecked by non- uniformities in computation and communication, as well by the irregularities inherent in the dataset involved in certain underlying tasks. Firstly, cloud computing platforms exhibit compute heterogeneity. In particular, they provide a variety of server types that can be rented out for computing tasks [149, 136, 194, 173]. The instance types have different pricing and system configurations. In addition to the cross-server heterogeneity from different configurations, the same server can exhibit stochasticity in compute power over time [39], thus exhibiting heterogeneous behavior across time. Thedisparitiesincomputepowermayresultinsomeoftheserversbeing stragglers, i.e., being significantly slower than others. Therefore, it’s quite essential to take into account the deterministic as well as the stochastic aspects of compute heterogeneity while allocating the resources for completing the underlying large-scale computing tasks. The resource allocation becomes even more challenging under monetary constraints. Secondly, bandwidth heterogeneity is quite common in commodity clusters. Different servers can have different available bandwidths for communicating messages, and the band- width can also fluctuate over time [149, 136, 194, 173] due to shared underlying bandwidth across servers. Additionally, with the rise of multi-access edge computing (MEC) platforms [5, 72, 91, 146, 185], the traditional cloud computing capabilities are progressively being brought closer to the network edge [234], where connections between servers involve wireless channels. Dynamically changing wireless connectivity can lead to packet loss and significant variations in the communication bandwidth, making resiliency of long-running tasks, such as machine learning, a critical problem. Thirdly, heterogeneity in datasets associated with the underlying task can present sig- nificant challenges by translating into skewness in computation and communication across servers. For example, the vertices in real-world graphs can be significantly skewed [70], i.e., the graphs can have extremely irregular degree distributions. Typical graph processing tasks, such as PageRank [151], require communicating a large number of messages across servers, 2 and the communications involved can be skewed depending on the densities of the allocated sub-graphs [133]. Therefore, a integrated task and resource allocation is essential for optimal performance to mitigate the challenge of dataset heterogeneity. To overcome the aforementioned challenges, we consider the daunting task of taming heterogeneity in cloud computing in the first part of this dissertation. Particularly, we study three critical distributed computing tasks in the cloud – communication efficient graph processing (in Chapter 2), low-latency matrix multiplication (in Chapter 3), and fast and robust machine learning (in Chapter 4). We ground our contributions in solid theoretical foundations which we support by strong real-world experiments over AWS. In Section 1.1, we provide an overview of the three chapters, summarizing the main aspects of the underlying challenges and our proposed solutions. In the second part of this dissertation, we shift our focus to machine learning (ML) in decentralized settings with multiple data owners, which has attracted much recent interest. In such settings, there are significant privacy concerns and regulations [62] associated with the data of the users, and centralizing users’ datasets for a cloud computing based training procedure is not feasible. For example, medical researchers might be keenly interested in leveraging the MRI data pertaining to patient records in multiple hospitals for training an accurate model for a certain disease [85]. However, they cannot pool the datasets for centralized training due to privacy concerns linked with private health data. For enabling distributed privacy-preserving machine learning from private datasets of clients, decentralized training methods are gaining a surge of recent interests: when im- plementation is orchestrated via a central server, the problem is commonly referred to as federated learning (FL) [126, 141, 175, 222]. In an FL architecture, the task of training is federated among the clients themselves. Each participating client locally trains a local model based on its own (private) training dataset and shares only the trained local model with the central entity, which appropriately aggregates the clients’ local models. Since the training task is distributed to individual participating clients, the raw data of clients need not be 3 shared with the central entity or other clients, thereby yielding a layer of privacy protection for clients’ local datasets. Furthermore, all participating clients can still mutually benefit from the data and models of others. The beast of heterogeneity leads to daunting problems in decentralized learning as well by manifesting itself in several forms, that are arguably more challenging to handle than in cloud computing. For example, federated learning in MEC networks can suffer from severe communication bottleneck, as massive sized model updates must be moved between the FL server (master) and the clients, which can be significantly heterogeneous and bandwidth- constrained. Additionally, clients may not have dedicated resources for carrying out the training procedure, and hence can exhibit high variability in resources available for train- ing task execution. Consequentially, the straggling computations and communication links can degrade convergence performance immensely. [148, 50]. Furthermore, due to the non- centralization of clients’ local datasets, the data distribution across clients for any training task is typically heterogeneous [240]. Also, the interplay between data privacy requirements and heterogeneity of datasets brings in new challenges, thus requiring novel tools [204]. We study these critical bottlenecks in decentralized training in chapters 5, 6 and 7 of this dissertation. The key aspects are summarized in Section 1.2. 1.1 Taming Heterogeneity in Cloud Computing As described previously, there are several forms of heterogeneity that can result in multiple bottlenecks resulting in major degradation of performance in large-scale distributed comput- ing clusters: system failures, bottlenecks due to limited communication bandwidth, latency due to straggler nodes, etc. Prior state-of-the-art approaches to mitigate the impact of this system noise in cloud computing environments involved creation of some form of computa- tion redundancy. For example, replicating the straggling task on another available node is a 4 common approach to deal with stragglers [236], while partial data replication is also used to reduce the communication load in distributed computing [159]. We use a coding theoretical lens in this dissertation, taking the viewpoint that coding theory can play a transformational role for creating and exploiting computation redundancy to effectively alleviate the impact of system noise. The benefits of coding theory in injecting structured redundancy to provide optimal efficiency and resiliency in communication and storage systems are well-established. Our results [206, 205, 173, 169, 171, 170, 172] obtained for mitigating several bottlenecks in cloud computing demonstrate that coding theoretic ideas can significantly improve resiliency and efficiency in cloud computing as well. In Chapter 2, we present our results for communication efficient graph processing [206, 205]. Performance of large-scale graph processing significantly suffers from communication bottleneck as a large number of messages are exchanged among servers at each step of the computation. Motivated by graph based MapReduce, we propose a coded computing frame- work that leverages computation redundancy to alleviate the communication bottleneck in distributed graph processing. As a key contribution, we develop a novel coding scheme that systematically injects structured redundancy in the computation phase to enable coded multicasting opportunities during message exchange between servers, reducing the commu- nication load substantially. For theoretical analysis, we consider random graph models, and focus on schemes in which subgraph allocation and Reduce allocation are only depen- dent on vertex ID while the Shuffle design varies with graph connectivity. Specifically, we prove that our proposed scheme enables an (asymptotically) inverse-linear trade-off between computation load and average communication load for two popular random graph models – Erdös-Rényi model, and power law model. Furthermore, for the Erdös-Rényi model, we prove that our proposed scheme is optimal asymptotically as the graph size increases by providing an information-theoretic converse. 5 WeimplementPageRankoverAmazonEC2, usingartificialaswellasreal-worlddatasets, demonstrating gains of up to 50.8% in comparison to the conventional PageRank implemen- tation. Furthermore, we specialize our coded scheme and extend our theoretical results to two other random graph models – random bi-partite model, and stochastic block model. Our specialized schemes asymptotically enable inverse-linear trade-offs between computation and communication loads in distributed graph processing for these popular random graph models as well. We complement the achievability results with converse bounds for both models. Next, in Chapter 3, we present our results for optimal and cost efficient load allocation for large-scale matrix multiplication [173, 169]. We focus on general heterogeneous distributed computing clusters consisting of a variety of computing machines with different capabilities. We propose a coding framework for speeding up distributed computing in heterogeneous clusters by trading redundancy for reducing the latency of computation. In particular, we propose Heterogeneous Coded Matrix Multiplication (HCMM) algorithm for performing distributed matrix multiplication over heterogeneous clusters that is provably asymptotically optimal for a broad class of processing time distributions. We also prove that HCMM is unboundedly faster than any uncoded scheme partitioning the total work among the workers. To demonstrate how the proposed HCMM scheme can be applied in practice, we provide results from numerical studies and Amazon EC2 experiments that demonstrate speedups of upto73% overpriorbenchmarks. We alsocarryout experimentsdemonstrating how HCMM can be combined with rateless codes with nearly linear decoding complexity. In particular, we show that HCMM combined with the Luby transform (LT) codes can significantly re- duce the overall execution time. Additionally, we provide a generalization to the problem of optimal load allocation in heterogeneous settings, where we take into account the monetary costs associated with distributed computing clusters. We argue that HCMM is asymptot- ically optimal for budget-constrained scenarios as well. In particular, we characterize the minimum possible expected cost associated with a computation task over a given cluster of 6 machines. Furthermore, we develop a heuristic algorithm for (HCMM) load allocation for the distributed implementation of budget-limited computation tasks. Finally, inChapter4, wepresentourresultsforthecommonlyusedsynchronousGradient Descent paradigm for large-scale distributed learning [172, 170, 171]. In large-scale training tasks, such as pre-training of NLP models with billions of parameters (e.g., GPT-3 [23]), straggling nodes adversely impact the performance by increasing the tail latency [39]. A simplewayhereinistoignorethecomputationscarriedoutatthestragglingnodes. However, in many industry settings, ignoring straggling tasks is not favored as it reduces the model quality. Additionally, large models cause communication bottleneck. Therefore, there has been a growing interest to develop efficient and robust gradient aggregation strategies that overcome two key system bottlenecks: communication bandwidth and stragglers’ delays. In particular, Ring-AllReduce (RAR) design has been proposed to avoid bandwidth bottleneck at any particular node by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. However, RAR is not robust to stragglers. On the other hand, Gradient Coding (GC) has been proposed to mitigate stragglers in a master-worker topology by allowing carefully designed redundant data allocation, but GC incurs significant communication bottleneck at the master. To overcome the challenges of GC and RAR while combining their advantages, we propose a joint communication topology design and data set allocation strategy, named CodedReduce (CR) [172]. CR parallelizes the communications over a tree topology leading to efficient band- width utilization, and carefully designs a redundant data set allocation and coding strategy at the nodes to make the proposed gradient aggregation scheme robust to stragglers. We quantify the communication parallelization gain and resiliency of the proposed CR scheme, and prove its optimalitywhen the communication topologyis a regular tree. Viaexperiments over AWS, we also demonstrate speedups of up to 27.2× for CR over the prior benchmarks. 7 1.2 Taming Heterogeneity in Decentralized Learning As described earlier, ML applications can achieve significant performance gains by training on large volumes of data. In many applications, however, the training data is distributed across multiple data-owners, such as patient records at multiple medical institutions. Ad- ditionally, such training data often contains sensitive information (e.g., genetic information, financial transactions, and geolocation information), and with increasing privacy concerns and regulations [62], it becomes extremely difficult to pool user datasets for centralized training. Thus, enabling decentralized privacy-preserving machine learning at the edge has gained much interest recently as it seeks to learn powerful statistical models from the pri- vately owned and locally generated rich data at the billions of devices in modern day edge networks, while also tackling the risks and responsibilities that are involved in training. The different forms of heterogeneity create several bottlenecks in decentralized machine learning from private data. In the second part of the dissertation, we focus on addressing these chal- lenges in federated learning, a recent framework that enables training a global model from data located at the client nodes, without sharing of raw data to a centralized server. In Chapter 5 1 , we focus on multi-access edge computing (MEC) networks, where perfor- manceoffederatedlearningsuffersfromslowconvergenceduetoheterogeneityandstochastic fluctuations in compute power and communication link qualities across clients. We present our novel coded computing framework, CodedFedL, that injects structured coding redun- dancy into federated learning for mitigating stragglers and speeding up the training proce- dure [6, 46, 201, 202, 203]. CodedFedL enables coded computing for non-linear federated learning by efficiently exploiting distributed kernel embedding via random Fourier features that transforms the training task into computationally favourable distributed linear regres- sion. Furthermore, clients generate local parity datasets by coding over their local datasets, while the server combines them to obtain the global parity dataset. Gradient from the global 1 The contents of this chapter and the associated references [6, 46, 201, 202, 203] are an outcome of the work done at Intel. 8 parity dataset compensates for straggling gradients during training, and thereby speeds up convergence. For minimizing the epoch deadline time at the MEC server, we provide a tractable approach for finding the amount of coding redundancy and the number of local data points that a client processes during training, by exploiting the statistical properties of compute as well as communication delays. Using differential privacy theory, we characterize the leakage in data privacy in Cod- edFedL when clients share their local parity datasets with the server. Additionally, we ana- lyze the convergence rate and the iteration complexity of CodedFedL by treating CodedFedL as a stochastic gradient descent algorithm. Finally, for demonstrating gains that CodedFedL can achieve in practice, we conduct numerical experiments using practical network param- eters and benchmark datasets, in which CodedFedL speeds up the overall training time by up to 15× in comparison to the benchmark schemes. In Chapter 6, we present our hierarchical model aggregation approach for straggler re- silient decentralized training at the edge [207]. Particularly, to guarantee robustness against straggling communication links, we consider a hierarchical setup with n e clients and n h reli- able helper nodes that are available to aid in gradient aggregation at the master. To achieve resiliency against straggling client-to-helpers links, we propose two approaches leveraging coded redundancy. First is the Aligned Repetition Coding (ARC) that repeats gradient components on the helper links, allowing significant partial aggregations at the helpers, re- sulting in a helpers-to-master communication load (C HM ) ofO(n h ). ARC however results in a client-to-helpers communication load (C EH ) of Θ( n h ), which may be prohibitive for client nodes due to limited and costly bandwidth. We thus propose Aligned Minimum Distance Separable Coding (AMC) that achieves optimal C EH of Θ(1) for a given resiliency threshold by using MDS code over the gradient components, while achieving a C HM ofO(n e ). Finally, we consider the joint problem of security and fault tolerance in federated learn- ing in Chapter 7. Although FL has emerged as a promising paradigm for training a global model without centralizing clients’ raw data, sharing of its local model update with the FL 9 server during training can reveal significant information of the client’s local dataset. Trusted execution environments (TEEs) within the FL server have been recently deployed in produc- tion lines of companies including Google and Meta for secure aggregation of local updates in practice [147, 143, 59, 12]. However, secure aggregation can suffer from performance degra- dation due to error-prone local updates sent by clients that become faulty during training due to underlying software and hardware failures [75, 105]. Moreover, data heterogeneity across clients makes fault mitigation quite a challenging task in itself. This is because even the updates from the normal clients are quite dissimilar. Thus, prior fault tolerant methods, most of which treat any local update differing from the majority of other updates as faulty, perform poorly. We propose DiverseFL [204], the first algorithm that makes model aggregation secure as well as robust to faulty client updates when data is heterogeneous. In DiverseFL, any client whose local model update diverges from its associated guiding update is tagged as being faulty. To implement our novel per-client criteria for fault mitigation, DiverseFL creates a TEE-based secure enclave within the FL server which in addition to performing secure aggregation for global model update, receives a small representative sample of local data from each client only once before training, and computes guiding updates for each participating client during training. Therefore, DiverseFL provides both security against privacy leakage as well as robustness against faulty clients. In our experiments, DiverseFL consistently achieves significant improvements (up to 39%) in absolute test accuracy over prior fault mitigation benchmarks, while performing closely to OracleSGD, where the server only aggregates the updates from the normal clients. We also analyze the convergence rate of DiverseFL with non-IID data, under strong convexity of local loss. 10 Chapter 2 Coded Computing for Distributed Graph Analytics 2.1 Introduction Graphs are widely used to identify and incorporate the relationship patterns and anomalies inherent in real-life datasets. Their growing scale and importance have prompted the devel- opment of various large-scale distributed graph processing frameworks, such as Pregel [137], PowerGraph [70] and GraphLab [132]. The underlying theme in these systems is the think like a vertex approach [139] where the computation at each vertex requires only the data available in the neighborhood of the vertex (see Fig. 2.1 for an illustrative example). This approach significantly improves performance in comparison to general-purpose distributed data processing systems (e.g., Dryad [79]), which do not leverage the underlying structure of graphs. In these distributed graph processing systems, different subgraphs are stored at different servers, where a subgraph refers to the set of files associated with a subset of graph vertices. As a result of the distributed subgraph allocation, for carrying out the graph computation for agivenvertexataparticularserver, theintermediatevaluescorrespondingtotheneighboring vertices whose files are not available at the server have to be communicated from other servers. These distributed graph processing systems, therefore, require many messages to be exchanged among servers during job execution. This results in communication bottleneck 11 Figure 2.1: Illustrating the think like a vertex paradigm prevalent in common parallel graph com- puting frameworks. The computation associated with a vertex only depends on its neighbors. In this example, we consider the PageRank computation over a graph with six vertices. Using vertex 1 for representation, we illustrate the file and PageRank update at each vertex. File w 1 contains the state (current PageRankΠ curr 1 ) and the neighborhood parameters (probabilities of transitioning to neighbors{P(1→1),P(1→2),P(1→5)}). The PageRank update associated with vertex 1 is a function of only the neighborhood files (specifically, of the PageRanks of neighboring vertices and the transition probabilities from neighbors to vertex 1). in parallel computations over graphs [133], accounting for more than 50% of the overall execution time in representative cases [27]. To alleviate the communication bottleneck in distributed graph processing, we develop a new framework that leverages computation redundancy by computing the intermediate values at multiple servers via redundant subgraph allocation. The redundancy in computation of intermediate values at multiple servers allows coded multicasting opportunities during exchange of messages between servers, thus reducing the communication load. Our proposed framework comprises of a mathematical model for MapReduce decomposition [41] of the graph computation task. The Map computation for a vertex corresponds to computing the intermediate values for the vertices in its neighborhood, while the Reduce computation for a vertex corresponds to combining the intermediate values from the neighboring vertices to obtain the final result of graph computation. Referring to the example in Fig. 2.1, the Map and Reduce computations associated with vertex 1 are as follows: Map: Π curr 1 →{v 1,1 ,v 2,1 ,v 5,1 }, Reduce: Π new 1 =v 1,1 +v 1,2 +v 1,5 , 12 where v j,i =P(i→j)Π curr i is the intermediate value obtained from the Map computation of vertex i∈N(j). In distributed graph based MapReduce, each server is allocated a subgraph for Map computations and Reduce tasks for a subset of graph vertices, and the overall execution takes place in three phases – Map, Shuffle, and Reduce. During Map phase, each server computes the intermediate values associated with the files in the allocated subgraph. During Shuffle phase, servers communicate with each other to exchange missing intermediate values that are needed for executing the allocated Reduce tasks. Finally, each server carries out the Reduce computations allocated to it to obtain the final results, using the intermediate values obtained locally during the Map phase and the missing intermediate values obtained from other servers during the Shuffle phase. Using our mathematical model of graph based MapReduce, our framework proposes to trade redundant computations in the Map phase with communication load during the Shuffle phase. The key idea is to leverage the graph structure and create coded messages during the Shuffle phase that simultaneously satisfy the data demand of multiple computing servers in the Reduce phase. Our work is rooted in the recent development of a coding framework that establishes an inverse-linear trade-off between computation and communication for general MapReduce computations – Coded Distributed Computing (CDC) [123]. In the MapReduce formulation considered in [123], there are n input files and the goal is to compute Q output functions, where each of the Q output functions depends on all of the n input files. In CDC, each Map computation is carefully repeated at r servers. The injected redundancy provides coded multicast opportunities in the Shuffle phase where servers exchange coded messages that are simultaneously useful for multiple servers. Each server then decodes the received messages and executes the Reduce computations assigned to it. Compared to uncoded Shuffle, where the required intermediate values are transmitted without leveraging coded multicast, CDC slashes the communication load by r. However, in contrast to graph based MapReduce considered in our framework, CDC does not incorporate the heterogeneity in the 13 file requirements by the Reducers, as each Reducer in CDC is assumed to need intermediate values corresponding to all input files. Figure 2.2: Demonstrating the impact of our proposed coded scheme in practice. We consider PageRank implementation over a real-world dataset in an Amazon EC2 cluster consisting of 6 servers. In this figure, we have illustrated the overall execution time as well as the times spent in different phases of execution, as a function of computation load r (details of implementation are provided in Section 2.7). One can observe that the Shuffle phase is the major component of the overall execution time in conventional PageRank implementation (computation load r = 1), and our proposed coded scheme slashes the overall execution time by shortening the Shuffle phase (i.e., reducing the communication load) at the expense of increasing the Map phase (i.e. increasing the Map computations). Moving from the MapReduce framework in [123] to graph based MapReduce, the key challenge is that the computation associated with each vertex highly depends on the graph structure. In particular, graph computation at each vertex requires data only from the neighboring vertices, while in the MapReduce framework in [123], each output computation needs all the input files (which in graph based MapReduce shall correspond to a complete graph). This asymmetry in the data requirements of the graph computations is the main challenge in developing efficient subgraph and Reduce computation allocations and Shuffling schemesforgraphbasedMapReduce. Asakeycomponentofourproposedcodingframework, we propose a coded scheme that creates coding opportunities for communicating messages across servers by Mapping the same graph vertex at different servers, so that each coded transmission satisfies the data demand of multiple servers. Within each multicast group, each server communicates a coded message which is generated using careful alignment of 14 the intermediate values that the server needs to communicate to all the remaining members of the multicast group. Each server retrieves the missing intermediate values required for its Reduce computations using the locally available intermediate values from the Map phase and the coded messages received during the Shuffle phase. For characterization of the performance of our proposed coding framework for distributed graph analytics, we focus on random undirected graph models. In popular graph processing frameworks such as Pregel [137], the graph partitioning for distributed processing among a set of servers is solely based on vertex ID, such as using hash(ID) mod K, where K is the numberofservers. Therefore, inourproblemformulation, foragivencomputationloadr and a random graphG =(V,E), we focus on subgraph and Reduce computation allocationsA(r) that are based only on vertex IDs and not on graph connectivity. Here,V andE respectively denote the vertex set and edge set of G. Although the Map and Reduce allocations are functions solely of vertex IDs, the Shuffle design needs to incorporate the graph connectivity of the graph realizations so that the communication load is minimized. This motivates us to consider the characterization of the minimum average normalized communication load L ∗ (r), which is defined as follows: L ∗ (r) := inf A∈A(r) E G [L A (r,G)], where L A (r,G) denotes the minimum normalized communication load for a realization G of G for a given subgraph and Reduce allocation tuple A ∈ A(r). The normalization is with respect to the total size of all the intermediate values corresponding to a fully connected graph with same number of vertices. Further details are deferred to Section 2.2, where we describe our problem formulation in detail. For two popular random graph models, Erdös-Rényi model and power law model, we prove that our proposed coded scheme asymptotically achieves an inverse-linear trade-off between computation load in the Map phase and average normalized communication load 15 in the Shuffle phase. Furthermore, for the Erdös-Rényi model, we develop an information- theoretic converse for the average communication load given a computation load of r. Using the asymptotic achievability result, we prove that the converse for the Erdös-Rényi model is asymptotically tight, thus proving the asymptotic optimality of our proposed coded scheme. Specifically, for a given computation load r, we show that the minimum average normalized communication load is as follows: L ∗ (r)≈ 1 r p 1− r K , where p is the edge probability in the Erdös-Rényi model of size n, and K denotes the number of servers. To illustrate the benefits of our proposed coded scheme in practice, we demonstrate via simulation results that even for the Erdös-Rényi model with finite n, our proposed coded scheme achieves an average communication load which is within a small gap from the information-theoretic lower bound. Furthermore, it provides a gain of (almost) r in comparison to the baseline scheme with uncoded Shuffling. Additionally, we implement the PageRank algorithm over Amazon EC2 servers using artificial as well as real-world graphs, demonstrating how our proposed coded scheme can be applied in practice. Fig. 2.2 illustrates the results of our experiments over the conventional PageRank approach (r = 1) for a social network webgraph Marker Cafe Dataset [67]. As demonstrated in Fig. 2.2, our proposed coded scheme achieves a speedup of up to 43.4% over the conventional PageRank implementation and a speedup of 25.5% over the single server implementation. The details of the implementation are provided in Section 2.7. Wealsospecializeourcodedschemeandextendtheachievabilityresultstotwoadditional random graph models, random bi-partite model and stochastic block model. Specifically, we leverage the community structure in these models to adapt our proposed scheme to these models. In the random bi-partite model, we observe that there are no intra-cluster edges, due to which intermediate values for a particular Reducer in one cluster only comes from 16 Mappers in the other cluster. Therefore, we specialize our proposed coded scheme from Section 2.4 for the random bi-partite model, partitioning the available servers in proportion to the cluster sizes, so that there is maximum overlap between Reducers corresponding to vertices in one cluster and Mappers corresponding to vertices in other cluster. Similarly, for thestochasticblockmodel, wespecializeourproposedcodedschemebasedontheobservation that Reducers corresponding to vertices in one cluster depend on the Mappers corresponding to the vertices within the cluster with one probability (due to intra-cluster edges), and on the vertices in the other cluster with another probability (due to cross-cluster edges). Forboththerandombi-partitemodelandthestochasticblockmodel, weprovideconverse bounds. For the random bi-partite model, we remove vertices (and the edges corresponding to them) from the larger cluster so that the reduced graph has two clusters of equal sizes. The reduced graph model thus has two sets of Mappers and Reducers, which correspond to two different Erdös-Rényi models. Applying our converse bound for the Erdös-Rényi model, we arrive at the converse of the random bi-partite model. For the stochastic block model converse, the key idea is to randomly remove edges from the graph such that a larger Erdös-Rényi graph is obtained, then utilize a coupling argument, and finally use our infor- mation theoretic converse bound for the Erdös-Rényi model. Therefore, the modified coded schemes for these models demonstrate that inverse-linear trade-offs between computation and communication loads in distributed graph processing exists for these graph models as well. Related Work. A number of coding theoretic strategies have been recently proposed to mitigate the bottlenecks in large scale distributed computing [109, 123]. Several generaliza- tions to the Coded Distributed Computing (CDC) technique proposed in [123] have been developed. The authors in [125] extend CDC to wireless scenarios. The work in [119] ex- tends CDC to multistage dataflows. An alternative trade-off between communication and distributed computation has been explored in [60] for MapReduce framework under prede- termined storage constraints. Coding using resolvable designs has been proposed in [100]. 17 [94] extends CDC to heterogeneous computing environments. The work in [122] proposes coding scheme for reducing communication load for computations associated with linear ag- gregation of intermediate results in the final Reduce stage. The key difference between our framework and each of these works is that general MapReduce computations over graphs have heterogeneity in the data requirements for the Reduce functions associated with the vertices. Other notable works that deal with communication bottleneck in distributed com- putation include [14, 13, 31], where the authors propose techniques to reduce communication load in data shuffling in distributed learning. Apart from communication bottleneck, various coding theoretic works have been pro- posed to tackle the straggler bottleneck [112, 109, 55, 167, 199, 231, 113, 116, 223, 215, 138, 214, 25, 226, 120, 224, 89, 177, 54, 135, 65, 92, 186]. Stragglers are slow processors that have significantly larger delay for completing their computational task, thus slowing down the overall job execution in distributed computation. The first paper in this line of research proposed erasure correcting codes for straggler mitigation in linear computation [109]. The work in [112] explores the potential of the multicore nature of computing servers, while [167] extends the straggler mitigation for the matrix vector problem in wireless scenarios. Re- dundant short dot products for matrix multiplication with long vector has been proposed in [55]. The authors in [167] propose Heterogeneous Coded Matrix Multiplication (HCMM) scheme for matrix-vector multiplication in heterogeneous scenarios. In [199], the authors propose gradient coding schemes for straggler mitigation in distributed batch gradient de- scent. Works in [231] and [113] develop coding schemes for computing high-dimensional matrix-matrix multiplication. A Coupon Collector based straggler mitigation scheme for batched gradient descent has been proposed in [116]. Other notable schemes include Substi- tute decoding for coded iterative computing [223], coding for sparse matrix multiplications [215, 138, 214], approximate gradient coding [25], efficient gradient computation tackling both straggler and communication load [226], a unified coding scheme for distributed matrix multiplication [120], logistic regression with unreliable components [224], among others. 18 Notation. We denote by [n] the set {1,2,...,n} for n∈N. For non-negative functions f and g of n, we denote f = Θ( g) if there are positive constants c 1 , c 2 and n 0 ∈N such that c 1 ≤ f(n)/g(n)≤ c 2 for every n≥ n 0 , and f =o(g) if f(n)/g(n) converges to 0 as n goes to infinity. We define f = ω(g), if for any positive constant c, there exists a constant n 0 ∈N such that f(n)>c· g(n) for every n≥ n 0 . To ease the notation, we let 2× Bern(p) denote a random variable that takes on the value 2 w.p. p and 0 otherwise. 2.2 Problem Setting We now describe the setting and formulate our distributed graph analytics problem. In particular, we specify our computation model, distributed implementation model and our problem formulation based on random graphs. 2.2.1 Computation Model We consider an undirected graphG =(V,E) whereV =[n] andE ={(i,j):i,j∈V} denote the set of graph vertices and the set of edges respectively. A binary file w i ∈ F 2 F of size F ∈N containing vertex state and neighborhood parameters is associated with each graph vertex i ∈ V. We denote by W = {w i : i ∈ V} the set of files associated with all vertices in the graph. The neighborhood of vertex i is denoted by N(i) ={j ∈V : (j,i)∈E} and the set of files in the neighborhood of i is represented by W N(i) = {w j : j ∈ N(i)}. In general,G can have self-loops, i.e., vertex i can be contained inN(i). Furthermore, a graph computation is associated with each vertex i∈V as follows: ϕ i :F |N(i)| 2 F →F 2 B, where ϕ i (·) is a function that maps the input files in W N(i) to a length B binary stream o i =ϕ i (W N(i) ). 19 The computation ϕ i (·) can be represented as a MapReduce computation: ϕ i (W N(i) )=h i ({g i,j (w j ):w j ∈W N(i) }), (2.1) where the Map function g i,j : F 2 F → F 2 T Maps file w j to a length T binary intermediate value v i,j = g i,j (w j ), ∀i ∈ N(j). The Reduce function h i : F |N(i)| 2 T → F 2 B Reduces the intermediate values associated with the output function ϕ i (·) into the final output value o i =h i ({v i,j :j∈N(i)}). We illustrate our computation model using the graph presented in the previous section. Fig. 2.3(a) illustrates the graph with n = 6 vertices, where each vertex is associated with a file, while Fig. 2.3(b) illustrates the corresponding MapReduce computations. CommongraphbasedalgorithmscanbeexpressedintheMapReducecomputationframe- work described above [130]. For brevity, we present two popular graph algorithms and describe how they can be expressed in the proposed MapReduce computation framework. Example. PageRank [151, 218] is a popular algorithm to measure the importance of the vertices in a webgraph based on the underlying hyperlink structure. In particular, the algorithm computes the likelihood that a random surfer would visit a page. Mathematically, the rank of a vertex i satisfies the following relation: Π( i)=(1− d) X j∈N(i) Π( j)P(j→i)+d 1 |V| , where (1− d) is referred to as the damping factor, Π( i) denotes the likelihood that the random surfer will arrive at vertex i, |V| is the total number of vertices in the webgraph, andP(j→i) is the transition probability from vertex j to vertex i. The graph computation can be carried out iteratively as follows: Π k (i)=(1− d) X j∈N(i) Π k− 1 (j)P(j→i)+d 1 |V| , 20 where k and k− 1 are respectively the current and previous iterations and Π 0 (i) = 1 |V| for all i∈V and k = 1,2,··· . The number of iterations depends on the stopping criterion for the algorithm. Usually, the algorithm is stopped when the change in the PageRank mass of each vertex is less than a pre-defined tolerance. The rank update at each vertex can be decomposed into Map and Reduce functions for each iteration k. For a given vertex i and iteration k, let w k i = {Π k− 1 (i)}∪{P(i → j) : j ∈ N(i)}, and ϕ k i (W k N(i) ) = (1− d) P j∈N(i) Π k− 1 (j)P(j → i) + d 1 |V| . The Mapper g i,j (·) maps file w k j to the intermediate values v k i,j = g i,j (w k j ) = Π k− 1 (j)P(j → i) for all neighboring vertices i ∈ N(j). Using the intermediate values from the Map computations, the Reducer h i (·) computes vertex i’s updated rank as Π k (i)=h i {v k i,j :j∈N(i)} =(1− d) P j∈N(i) v k i,j +d 1 |V| . Example. Single-source shortest path is one of the most studied problems in graph theory. The task here is to find the shortest path to each vertex i in the graph from a source vertex s. A sub-problem for this task is to compute the distance of each vertex i from the sourcevertexs, wheredistanceD(i)isthelengthoftheshortestpathfromstoi. Thiscanbe carried out iteratively in parallel. First, initialize D 0 (s)= 0 and D 0 (i)= +∞,∀i∈V\{s}. Subsequently, each vertex i is updated as follows at each iteration k: D k (i)= min j∈N(i) D k− 1 (j)+t(j,i) , where t(j,i) is the weight of the edge (j,i). The algorithm is stopped when the change in the distance value for each vertex is within a pre-defined tolerance. The distance computa- tion for each vertex at iteration k can be decomposed into Map and Reduce computations. Particularly, for each vertex i and iteration k, let w k i = {D k− 1 (i)}∪{t(i,j) : j ∈ N(i)}, and ϕ k i (W k N(i) ) = min j∈N(i) (D k− 1 (j)+t(j,i)). The Mapper g i,j (·) Maps the file w k j to the intermediate values v k i,j = g i,j (w k j ) = D k− 1 (j)+t(j,i) for all neighboring vertices i∈N(j). Using the intermediate values from the Map computations, the Reducer h i (·) computes i’s updated distance value as D k (i)=h i {v k i,j :j∈N(i)} =min j∈N(i) v k i,j . 21 2.2.2 Distributed Implementation For distributing the graph processing task, we consider K servers that are connected through a shared multicast network. Furthermore, at any given time, only one server can multicast over the shared network. Additionally, we assume that a multicast takes the same amount of time as a unicast. As described next, in order to distribute the Map computation tasks amongtheservers, eachserverisallocatedasubgraphwhichiscomprisedofasubsetofgraph vertices and associated files that contain state and neighborhood information of vertices. Subgraph Allocation: Each server is assigned the Map computations in (2.1) associated with a subgraph, which consists of a subset of vertices and associated files containing state and neighborhood information of the vertices. We denote the subgraph that is allocated to each server k∈ [K] byM k ⊆V . Thus, server k will then store all the files in M k , and will be responsible for computing the Map functions on those files. Note that each file should be Mapped by at least one server. Additionally, we allow redundant computations, i.e., each file can be Mapped by more than one server. The key idea in leveraging redundancy in the Map computation phase is to trade the computational resources in order to reduce the communication load in the Shuffle phase. We define the computation load as follows. Definition 1 (Computation Load). For a subgraph allocation, (M 1 ,··· ,M K ), the compu- tation load, r∈[K], is defined as r := P K k=1 |M k | n , where|M k | denotes the number of vertices in the subgraphM k for k∈[K]. Remark 1. For a desired computation load r, each server is assigned a subgraph with the same number of vertices, i.e. for each server k∈[K],|M k |= rn K . To carry out the Reduce computation in (2.1) for all vertices, each server is assigned a subset of Reduce functions as follows. 22 (a) An example of a graph with 6 vertices, each of which has a file associated with it that contains its state and neighborhood parameters. (b) MapReduce decomposition of the graph computations for the graph in Fig. 2.3(a). (c) Illustration of subgraph and Reduce allocations for graph in Fig. 2.3(a) with computation load r =2 and K =3 servers. Each server is allocated a subgraph of size 4 and 2 Reducers. After the Map phase, each server needs to obtain the missing intermediate values that are needed to compute the Reduce functions allocated to it. Due to redundant subgraph allocation, each of the intermediate values missing at a server is available at both other servers. We illustrate two Shuffling schemes. In the uncoded Shuffle, a missing intermediate value is obtained from one of the other two servers, and each server is assigned the task of sending two intermediate values, one for each of the other two servers. In coded Shuffle, each server sends a XOR of the assigned intermediate values and sends only one coded message which is simultaneously useful for the both other servers. Figure 2.3: An illustrative example. 23 ReduceAllocation: AReducerisassociatedwitheachvertexofthegraphG asrepresented in(2.1). WeuseR k ⊆V todenotethesetofverticeswhoseReducecomputationsareassigned to server k∈ [K]. The set of Reduce computations are partitioned into K equal parts and each part is associated exclusively with one server, i.e., ∪ K k=1 R k =V andR m ∩R n = ϕ for m,n∈[K],m̸=n. Therefore,|R k |= n K ,∀k∈[K]. For the graph in Fig. 2.3(a) and a computation load of r = 2, we illustrate a scheme for subgraph allocation and Reduce allocation in Fig. 2.3(c). Here, each vertex appears in exactly two subgraphs, i.e. Map computation associated with each vertex is assigned to ex- actly two servers. The subgraph and Reduce allocations in Fig. 2.3(c) form key components of our proposed scheme in Section 2.4, in which for a computation load of r, every unique set of r servers is allocated a unique batch of n/ K r files for Map computations. For a given scheme with subgraph allocation and Reduce allocation tuple denoted by A = (M,R), where M = (M 1 ,··· ,M K ) and R = (R 1 ,··· ,R K ), the distributed graph processing proceeds in three phases as described next. Map phase: Each server first Maps the files associated with the subgraph that is allocated to it. More specifically, for each i∈M k , server k computes a vector of intermediate values corresponding to the vertices inN(i) that is⃗ g i =(v j,i :j∈N(i)). For the running example, we illustrate the intermediate values generated at each server during the Map phase in Fig. 2.3(c), where the color of an intermediate value denotes the server that is allocated the task to execute the corresponding Reducer. Shuffle phase: To be able to do the final Reduce computations, each server needs the intermediate values corresponding to the neighbors of each vertex that it is responsible for its Reduction. Servers exchange messages so that at the end of the Shuffle phase, each server is able to recover its required set of intermediate values. More formally, the Shuffle phase proceeds as follows. For each k∈[K], 24 (i) server k creates a message X k ∈ F 2 c k as a function of intermediate values computed locally at that server during the Map phase, i.e. X k = ψ k ({⃗ g i : i∈M k }), where c k is the length of the binary message X k , (ii) server k multicasts X k to all the remaining servers, (iii) server k recovers the missing intermediate values {v i,j : i ∈ R k ,j ∈ N(i),j / ∈ M k } using locally computed intermediate values {v i,j : i ∈ N(j),j ∈ M k } and received messages{X k ′ :k ′ ∈[K]\{k}}. We define the normalized communication load of the Shuffle phase as follows. Definition 2 (Normalized Communication Load). The normalized communication load, denoted by L, is defined as the number of bits communicated by K servers during the Shuffle phase, normalized by the maximum possible total number of bits in the intermediate values associated with all the Reduce functions, i.e. L:= P K k=1 c k n 2 T . For the running example in Fig. 2.3(c), after the Map phase, each server obtains the intermediate values corresponding to the files in its subgraph. The intermediate values that are needed for computing the allocated Reduce functions but are not available after the Map phase have also been highlighted. We illustrate an uncoded Shuffling scheme in which each server is assigned the task of sending some of its locally available intermediate values to other server over the shared multicast network. We highlight here that each intermediate value missing at a server is available at two other servers. For example, v 5,1 and v 6,2 are missing at server 3, and both of them are available at servers 1 and 2. In this uncoded Shuffle, exactly one of the two servers is uniquely assigned the task to communicate the missing intermediate value to the server. For example, v 5,1 is multicasted by server 1 while v 6,2 is multicasted by 25 server 2. As a total of 6 intermediate values are sent over the shared multicast network, the normalized communication load of the uncoded Shuffle is L= 6 36 . The servers can instead send linear combinations of the intermediate values over the multicast network. For example, server 1 multicasts v 5,1 ⊕ v 3,4 . As v 5,1 is locally available at server 2, server 2 can compute (v 5,1 ⊕ v 3,4 )⊕ v 5,1 and obtain the missing intermediate value v 3,4 . Similarly, server 3 can obtain the missing intermediate value v 5,1 . This illustrates that by using coded Shuffle, in which each server sends a combination of locally available intermediate values over the multicast network, the communication load can be improved over the uncoded Shuffle. In this case specifically, the communication load for the coded Shuffle is L= 3 36 , which is factor of two (same as the computation load r =2) improvement over uncoded Shuffle. This forms the motivation behind our proposed scheme in Section 2.4. Reduce phase: Using its locally computed intermediate values and the intermediate values recovered from the messages received from other servers during the Shuffle phase, server k∈ [K] computes the Reduce functions in R k to calculate o i = h i ({v i,j : j ∈N(i)}) for all i∈R k . In Fig. 2.3(c), each server has all the intermediate values that are needed to compute the allocated Reduce functions. For example, for computing the Reduce function associated with vertex 1, server 1 has intermediate values v 1,1 and v 1,2 available locally from the Map phase and the intermediate value v 1,5 obtained from server 2 in the Shuffle phase. Therefore, each of the three servers can compute the Reduce functions allocated to it. 2.2.3 Problem Formulation As illustrated in Fig. 2.3, the communication load during Shuffle phase depends on subgraph allocation, Reduce allocation, and Shuffle strategy. For an allowed computation load r, our broader goal is to minimize the communication load during Shuffle phase through efficient schemes for allocation of subgraphs and Reducers to servers and for coded Shuffling of in- termediate values among the servers. We consider a random undirected graph G =(V,E), 26 where edges independently exist with probability P[(i,j) ∈ E] for all i,j ∈ V. Let A(r) be the set of all possible subgraph and Reduce allocations for a given computation load r (as defined in the previous subsection). For a graph realization G of G and an allocation A ∈ A(r), a coded Shuffling scheme is feasible if each server can compute all the Reduce functions assigned to it. We denote by L A (r,G) the minimum (normalized) communication load (as defined in Definition 2) over all feasible Shuffling coding schemes that enable each server to compute all the Reduce functions assigned to it. 1 Hence, for a given realization G of the random graphG, the minimum communication load among all possible subgraph and Reduce allocations and feasible coded Shuffling schemes is as follows: L ∗ G (r) := inf A∈A(r) L A (r,G). (2.2) Remark 2. PartitioningofgraphsinpopulargraphprocessingframeworkssuchasPregel[137] is solely based on the vertex ID and not on the vertex neighborhood density. Furthermore, designingsubgraphallocation, ReduceallocationandShufflingschemesforcharacterizingthe minimum communication load in (2.2) is NP-hard in general. This is because for the case of computation load r = 1, finding the minimum communication load is equivalent to finding the minimum K-cut over the graph, which is NP-hard for general graphs [37]. Additionally, existing heuristics for load balancing in distributed graph processing involve additional steps such as migration of vertex files during graph algorithm execution [93], which adds latency to the overall execution time. Hence, we focus on the problem of finding the subgraph and Reduce allocation tuple A ∈ A(r) that minimizes the average normalized communication load across all graph realizations G ofG. We formally define our problem as follows. 1 The uncoded Shuffling schemes are special cases of the coded Shuffling schemes and are thus included in the set of all feasible coded Shuffling schemes under consideration. 27 Problem: For a given random undirected graphG =(V,E) and a computation load r∈[K], our goal is to characterize the minimum average normalized communication load, i.e. L ∗ (r) := inf A∈A(r) E G [L A (r,G)]. (2.3) Remark 3. Forr≥ K, L ∗ (r) is trivially0 as each vertex can be mapped at each server, so all the intermediate values associated with the Reducers of any server is available at the server. Remark 4. As defined above, L ∗ (r) essentially reveals a fundamental trade-off between com- putation and communication in distributed graph processing. Remark 5. In the above problem formulation, for a given subgraph and Reduce allocation tuple A∈A(r), in order to minimize the average communication load, the Shuffle scheme needs to take into consideration the connectivity of each realization G ofG. As we describe in Section 2.4, our proposed coded scheme utilizes careful alignment of intermediate values for creating coded messages for multicast during the Shuffle phase, leading to significant improvement in the average communication load. Remark 6. Although the main focus of our problem formulation is on minimizing the average communication load for random graph models, our proposed coded scheme in Section 2.4 is applicable to any real-world graph. As demonstrated in Section 2.7, our proposed coded scheme can provide significant performance gains in practice. Specifically, for implementing PageRank over the real-world social webgraph TheMarker Cafe [67], our proposed scheme providesagainofupto43.4%intheoverall execution time incomparisontotheconventional PageRank implementation. In the next Section, we discuss our main results for four popular random graph models. 28 2.3 Main Results Inthissection, wepresentthemainresultsofourwork. Ourfirstresultisthecharacterization of L ∗ (r) (defined in (2.3)) for the Erdös-Rényi model that is defined below. Erdös-Rényi Model: Denoted by ER(n,p), this model consists of graphs of sizen in which each edge exists with probability p∈(0,1], independently of other edges (Fig. 2.5(a)). Theorem 1. For the Erdös-Rényi model ER(n,p) with p=ω( 1 n 2 ), we have lim n→∞ L ∗ (r) p = 1 r 1− r K . Remark 7. Theorem1revealsaninterestinginverse-lineartrade-offbetweencomputationand communication in distributed graph processing. Specifically, our proposed coded scheme in Section 2.4 asymptotically gives a gain of r in the average normalized communication load in comparison to the uncoded Shuffling scheme that as we discuss later in Section 2.4, only achieves an average normalized communication load of p(1− r K ). This trade-off can be used to leverage additional computing resources and capabilities to alleviate the costly commu- nication bottleneck. Moreover, we numerically demonstrate that even for finite graphs, not only the proposed scheme significantly reduces the communication load in comparison to the uncoded scheme, but also has a small gap from the optimal average normalized communi- cation load (Fig. 2.4). Finally, the assumption p = ω( 1 n 2 ) implies the regime of interest in which the average number of edges in the graph is growing with n. Otherwise, the problem would not be of interest since the communication load would become negligible even without redundancy/coding in computation. Remark 8. Achievability Theorem 1 is proved in Section 2.4, where we provide subgraph and Reduce allocations followed by the code design for Shuffling for our proposed scheme. The main idea is to leverage the coded multicast opportunities offered by the injected redun- dancy and create coded messages which simultaneously satisfy the data demand of multiple 29 1 2 3 4 5 0 2 4 6 8 ·10 − 2 Computation Load (r) Expected Communication Load (L) Uncoded Scheme Proposed Coded Scheme Lower Bound Figure 2.4: Performance comparison of our proposed coded scheme with uncoded Shuffle scheme and the proposed lower bound. The averages for the communication load for the two schemes were obtained over graph realizations of the Erdös-Rényi model with n=300, p=0.1 and K =5. servers. Careful combination of available intermediate values during the Shuffle phase ben- efits from the missing graph connections by aligning the intermediate values assigned to be communicated over the shared network. Conversely, Theorem 1 demonstrates that the asymptotic bandwidth gain r achieved by the proposed scheme is optimal and can not be improved. For the proof of converse provided in Section 2.5, we use induction to derive information-theoretic lower bounds on the average normalized communication load required by any subset of servers and then use the induction on the set of all the K servers. Our second result is the characterization of L ∗ (r) for the power law model that is defined below. Power Law Model: Denoted by PL(n,γ,ρ ), this model consists of graphs of sizen in which degrees are i.i.d random variables drawn from a power law distribution with exponent γ and edge probabilities are ρ -proportional to product of the degrees of the two end vertices (Fig. 2.5(b)). 30 Theorem 2. For the power law model graph PL(n,γ,ρ ) with node degrees {d 1 ,··· ,d n }, γ > 2 and ρ = 1 P n i=1 d i , we have limsup n→∞ nL ∗ (r) γ − 1 γ − 2 ≤ 1 r 1− r K . (a) Erdös-Rényi model with n=20. (b) Power law model with n = 40, γ = 2.3 and 100 edges. (c) Random bipartite model withn1 =6 andn2 =4. (d) Stochastic block model with n1 = 12 and n2 = 18. Figure 2.5: Illustrative instances of the random graph models considered in the paper. In Fig. 2.5(a), each edge exists with a given probability p. In Fig. 2.5(b), expected degree of each vertex follows a power law distribution with exponent γ . In Fig. 2.5(c), each cross-edge exists with a given probability q. In Fig. 2.5(d), each intra-cluster edge exists with a given probability p and each cross-edge exists with a given probability q. 31 Remark 9. Theorem 2 demonstrates that an inverse-linear trade-off between computation load and communication load can also be achieved in the power law model. We leverage our coded scheme proposed in Section 2.4 for the proof of Theorem 2 in Section 2.6. Furthermore, we specialize our proposed coded scheme in Section 2.4 to develop subgraph allocation and Reduce allocation schemes along with coded Shuffling schemes for two other popular random graph models which are described below: Random Bi-partite Model: Denoted by RB(n 1 ,n 2 ,q), this model consists of graphs with two disjoint clusters of sizesn 1 andn 2 in which each inter-cluster edge exists with probability q ∈ (0,1], independently of other inter-cluster edges (Fig. 2.5(c)). No intra-cluster edge exists in this model. Stochastic Block Model: Denoted by SBM(n 1 ,n 2 ,p,q), this model consists of graphs with twodisjointclustersofsizesn 1 andn 2 suchthateachintra-clusteredgeexistswithprobability p and each inter-cluster edge exists with probability q,0 < q < p ≤ 1, all independent of each other (Fig. 2.5(d)). The following theorems provide the achievability and converse results for RB and SBM models. Theorem 3. For the random bi-partite model RB(n 1 ,n 2 ,q) with n = n 1 +n 2 , n 1 = Θ( n), n 2 =Θ( n),|n 1 − n 2 |=o(n) and q =ω( 1 n 2 ), we have 1 8r 1− 2r K ≤ limsup n→∞ L ∗ (r) q ≤ 1 2r 1− 2r K . Remark 10. Theorem 3 characterizes the optimal average normalized communication load within a factor of 4 for the random bi-partite model. We provide the proofs for achievability and converse of Theorem 3 in Appendices and A and B respectively. For achievability, we observe that there are no intra-cluster edges in the random bi-partite model, due to which intermediate values for a particular Reducer in one cluster only comes from Mappers in the other cluster. Therefore, we specialize our proposed coded scheme in Section 2.4 for the 32 random bi-partite model, partitioning the available servers in proportion to the cluster sizes. Therefore, there is maximum overlap between Reducers corresponding to vertices in one cluster and Mappers corresponding to vertices in other cluster. For proving the converse, we remove vertices (and the edges corresponding to them) from the larger cluster so that the reduced graph has two clusters of equal sizes. The reduced graph model thus has two sets of Mappers and Reducers, which correspond to two different Erdös-Rényi models. Applying our lower bound for the Erdös-Rényi model in Theorem 1, we arrive at the converse of the bi-partite model. Theorem 4. For the stochastic block model SBM(n 1 ,n 2 ,p,q) with n=n 1 +n 2 , n 1 =Θ( n), n 2 =Θ( n), and p=ω( 1 n 2 ),q =ω( 1 n 2 ), we have limsup n→∞ L ∗ (r) pn 2 1 +pn 2 2 +2qn 1 n 2 (n 1 +n 2 ) 2 ≤ 1 r 1− r K . (2.4) Moreover, the following converse inequality holds: L ∗ (r) q ≥ 1 r 1− r K . (2.5) Remark 11. Using(2.4)and(2.5), itcanbeeasilyverifiedthatforthestochasticblockmodel, the converse is within a constant factor of achievability if p = Θ( q). The achievability and converse of Theorem 4 are proved in Appendices C and D respectively. For achievability, we specializeourproposedcodedschemefromSection2.4basedontheobservationthatinSBM, the Reducers corresponding to vertices in one cluster depend on the Mappers corresponding to the vertices within the cluster with one probability (due to intra-cluster edges), and on the vertices in the other cluster with another probability (due to cross-cluster edges). For the converse, the key idea is to randomly remove edges from the SBM model such that a larger ER model is obtained, then utilize a coupling argument, and finally use our information theoretic converse bound in Theorem 1. 33 2.4 ProposedSchemeandProofofAchievabilityofTheorem 1 In this section, we first describe our proposed coded scheme for distributed graph analytics, and then leverage it to prove the achievability for the Erdös-Rényi model in Theorem 1. 2.4.1 Proposed Scheme As described in our distributed graph processing framework in Section 2.2, a scheme for dis- tributed implementation of the graph computations consists of subgraph allocation, Reduce allocation, and Shuffling algorithm. We next precisely describe our proposed scheme for a given realization G of the underlying random graphG =(V,E). Subgraph Allocation: The n files associated with the n vertices of G are first partitioned serially into K r batchesB 1 ,B 2 ,...,B ( K r ) , whereB j comprises of the files associated with the vertices with IDs in the range{(j− 1)g+1,(j− 1)g+2,...,jg}. Here, g =n/ K r denotes the number of files in each batch. For our example with a graph of 6 vertices, 3 servers, and computation load 2 presented in Section 2.2, the 6 files are partitioned into K r =3 batches each of size g =2 as follows (see Fig. 2.6(a)): B 1 ={1,2}, B 2 ={3,4}, B 3 ={5,6}. Each of the K r batches of files is associated with a unique set of r servers. Specifically, letF 1 ,F 2 ,...,F ( K r ) denote all possible combinations of the elements of{1,2,...,K}. Then, each of the servers with indices in F j is allocated each of the files contained in batch B j . 34 Thus, server k ∈ [K] Maps the vertices in B j if k ∈F j . Equivalently, B j ⊆M k if k ∈F j . Therefore, we have the following for the subgraph allocation for server k: M k =∪ j∈[( K r )],k∈F j B j . As each server is present in K− 1 r− 1 of the K r unique combinations of servers, we have the following for each server k∈[K]: |M k |= K− 1 r− 1 g = K− 1 r− 1 n K r = rn K . In Fig. 2.6(a), we illustrate the subgraph allocation for our running example. F 1 ={1,2}, F 2 ={1,3}andF 3 ={2,3}. Eachofthetwofilesinbatch B j isassignedtoeachoftheservers inF j , for j∈{1,2,3}. Thus, server 1 is allocated files B 1 ∪B 2 ={w 1 ,w 2 ,w 3 ,w 4 }, server 2 is allocated files B 1 ∪B 3 ={w 1 ,w 2 ,w 5 ,w 6 } and server 3 is allocatedB 2 ∪B 3 ={w 3 ,w 4 ,w 5 ,w 6 }. Thus,|M 1 |=|M 2 |=|M 3 |=4. Reduce Allocation: The n Reduce functions associated with the n graph vertices are disjointly and uniformly partitioned into K subsets and each subset is assigned exclusively to one server. Specifically, for k∈[K],|R k |= n K andR k ={(k− 1) n K +1,(k− 1) n K +2,...,k n K }. In our running example,R 1 ={1,2},R 2 ={3,4} andR 3 ={5,6}. For notational convenience, we denote our proposed subgraph allocation and Reduce allocation by A C . Coded Shuffle: As illustrated in Fig. 2.3(c), the key idea in coded Shuffling is to create coded combinations of locally available intermediate values so that the same message can be useful for many servers simultaneously. Due to the subgraph and Reduce allocation A C described above, every set F j of r servers has a unique batch of files B j . Thus, all the intermediate values corresponding to the Map computations associated with the files in B j are available at every server in F j after the Map phase. With this observation, consider without loss of generality the set of serversS ={1,2,...,r+1}. For each server k∈S, let 35 (a) Illustrating the subgraph allocation and Reduce allocation A C for the example graph with 6 vertices. The 6 files are partitioned into 3 batches and each batch is assigned to a unique subset of 2 servers. The Reduce functions are partitioned into 3 sets, one set is assigned to each server. (b) For the subgraph and Reduce allocations A C in Fig. 2.6(a), we illustrate our proposed coded Shuffle scheme. For each intermediate value needed by a server, each of the remaining two servers is assigned the task of communicating a segment which is one-half of the intermediate value. The servers create a table of the segments that they are assigned to send, with each row corresponding to the intermediate values required exclusively by one of the remaining servers. Each server sends two coded messages, each of which is simultaneously useful for both the remaining servers. Figure 2.6: Illustration of our proposed scheme. Z k S\{k} be the set of all intermediate values that are needed by Reduce functions in k, and are available exclusively at each server k ′ ∈S\{k}, i.e. Z k S\{k} ={v i,j :(i,j)∈E,i∈R k ,j∈∩ k ′ ∈S\{k} M k ′}. (2.6) 36 We observe that after the Map phase, serverr+1 hasZ k S\{k} fork∈{1,...,r}. Furthermore, server 1 has Z k S\{k} for k ∈ {2,...,r}, server 2 has Z k S\{k} for k ∈ {1,3,...,r}, and so on. Therefore, server r + 1 can create a coded message by selecting one intermediate value each from Z k S\{k} for k ∈ {1,...,r}, and taking a XOR of them. The coded message is simultaneously useful for the servers{1,...,r} as each of them can XOR out its own missing intermediate value as it has the remaining intermediate values associated with the coded message. Similar arguments hold for the coded messages from other servers within S. In light of the above arguments, for each k∈S, each intermediate value v i,j ∈Z k S\{k} is evenly split intor segmentsv (1) i,j ,··· ,v (r) i,j , each of size T r bits. Each segment is associated with a distinct server inS\{k}, where the segment assignment is based on the order of the indices of the r serversS\{k}. Therefore,Z k S\{k} is evenly partitioned to r sets, which are denoted byZ k S\{k},s for s∈S\{k}. Depending on the connectivity of G, the number of intermediate values inZ k S\{k} shall vary, and the maximum possible size ofZ k S\{k} is ˜ g =g n K = n 2 K( K r ) . Each servers∈S creates anr× ˜ g table and fills that out with segments which are associated with it. Each row of the table is filled from left by the segments in one of the setsZ k S\{k},s , where k ∈S\{s} (see Fig. 2.7). Then, server s broadcasts the XOR of all the segments in each non-empty column of the table, where for each non-empty column, the empty entries are zero padded. Clearly, there exist at most ˜ g of such coded messages. The process is carried out similarly for every other subsetS⊆ [K] of servers with|S|=r+1. After the Shuffle phase, for each multicast group of r+1 servers, all but one intermediate values contributed in each coded message are locally available. Moreover, all possible subsets of multicast servers have sent their corresponding messages. Therefore, each server can recover all of the intermediate values associated with its assigned set of Reduce functions using the received coded messages and the locally computed intermediate values. Thus, our proposed coded Shuffling scheme is feasible, i.e. for any given graph, and subgraph and Reduce allocation A C , our proposed Shuffling enables each server to compute all the Reduce functions assigned to it. 37 Remark 12. The proposed scheme carefully aligns and combines the existing intermediate values to benefit from the coding opportunities. This resolves the issue posed by the asym- metry in the data requirements of the Reducers which is one of the main challenges in moving from the general MapReduce framework in [123] to graph analytics. InFig. 2.6(b), everyintermediatevalue inZ 3 {1,2} ={v 5,1 ,v 6,2 }issplitintor =2 segments, each associated with a distinct server in {1,2}. This is done similarly for servers 1 and 2. Then, servers 1, 2, and 3 broadcast their coded messages X 1 = {v (1) 5,1 ⊕ v (1) 4,3 ,v (1) 3,4 ⊕ v (1) 6,2 }, X 2 = {v (2) 5,1 ⊕ v (1) 1,5 ,v (2) 6,2 ⊕ v (1) 2,6 }, and X 3 = {v (2) 4,3 ⊕ v (2) 1,5 ,v (2) 3,4 ⊕ v (2) 2,6 }, respectively. All three servers can recover their missing intermediate values. For instance, server 3 needs v 5,1 to carry out the Reduce function associated with vertex 5. Since it has already Mapped vertices 3 and 5, intermediate values v 4,3 and v 1,5 are available locally. Server 3 can recover v (1) 5,1 and v (2) 5,1 from v (1) 5,1 ⊕ v (1) 4,3 and v (2) 5,1 ⊕ v (1) 1,5 , respectively. As each server sends 2 coded messages to other servers and each coded message is half the size of an intermediate value, therefore, the overall normalized communication load is 3 36 , which is two times better than the normalized communication load for uncoded Shuffling. 2.4.2 Proof of Achievability of Theorem 1 We now analyze the performance of our proposed coded scheme in Section 2.4.1 for the Erdös-Rényi random graph model to prove the achievability of Theorem 1. For our proposed subgraphandReduceallocationA C ,wefirstcomputetheaveragecommunicationforuncoded Shuffle where no coding is utilized during the Shuffle phase. Uncoded Shuffle: Given the subgraph and Reduce allocation A C , consider a server k∈ [K]. Due to symmetry, the total expected communication load is sum of the communication loads of each server. Hence we can focus on finding the communication load of server 1. Note that therearen/K Reducersassignedtoserver1, and rn K Mappersassignedtoserver1. Therefore, for each Reducer in server 1, the expected communication required is (pn− p rn K )T. Summing over the expected communication loads for all the Reducers in server 1 and appropriate 38 normalization, thetotalexpectedcommunicationloadforserver1is n K (pn− p rn K )T. Summing over all the K servers, we get the average normalized communication load for the uncoded Shuffle as follows: ¯L UC A C :=E G [L UC A C (r,G)] =K n K pn− p rn K T 1 n 2 T =p 1− r K , where L UC A C (r,G) denotes the normalized communication load for uncoded Shuffle for the graph realization G of the Erdös-Rényi random graph modelG. We now apply our proposed coded Shuffle scheme and compute the induced average communication load. Without loss of generality, we analyze our algorithm by a generic argument for serversS ={1,··· ,r+1} which can be similarly applied for any other set of serversS with|S|=r+1, due to the symmetric structure induced by the graph model and subgraph allocation and Reduce allocation A C . Denote r +1 servers as s 1 ,··· ,s r+1 , and consider the messages that s 1 is assigned to send within the multicast group S, the coded messages that are sent by other servers within S are also created similarly. As described in Section 2.4.1 and illustrated in Fig. 2.7, server s 1 creates a table of intermediate value segments for transmission. In this table, each row is filled from the left, and for i∈ [r], i’th row contains the allocated segments for the intermediate values in the set Z s i+1 S\{s i+1 },s 1 . The number of segments inZ s i+1 S\{s i+1 },s 1 , denoted by ˜ g i , depends on the connectivity of the graph G and is upper bounded by ˜ g, the total number of intermediate values in Z s i+1 S\{s i+1 } for a completely connected graph. Server s 1 broadcasts at most ˜ g max = max(˜ g 1 ,˜ g 2 ,...,˜ g r ) coded messages X 1 ,··· ,X ˜ gmax , zero padding the empty entries in the non-empty columns. These coded messages are simultaneously and exclusively useful for the servers s 2 ,··· ,s r+1 . For each non-empty column j∈[˜ g max ], X j is XOR of at most r non-zero segments of size T r bits, 39 associated with server s 1 . More formally, for each non-empty column j∈[˜ g max ], we have the following: X j = r M i=1 v (1) α (i,j) . (2.7) In (2.7), for i∈ [r] and j ∈ [˜ g i ], we have used v (1) α (i,j) to denote the non-zero segment in the table in i’th row and j’th column, while for j∈{˜ g i +1,˜ g i +2,...,˜ g}, v (1) α (i,j) denotes the zero padding segment. Let Bern(p) random variable E α (i,j) indicate the existence of the edge α (i,j) ∈ V×V , i.e. E α (i,j) =1, if α (i,j)∈E, and E α (i,j) =0, otherwise. Clearly, for all vertices i,j,t,u∈V, E α (i,j) is independent of E α (t,u) if α (i,j) and α (t,u) do not represent the same edge, and E α (i,j) =E α (t,u) , otherwise. For i∈[r], the random variable P i is defined as P i = ˜ g X j=1 E α (i,j) , (2.8) i.e. each P i is sum of ˜ g possibly dependent Bern(p) random variables. Note that P i ’s are not independent in general. By careful alignment of present intermediate values (Fig. 2.7), X 1 = v (1) α (1,1) v (1) α (2,1) v (1) α (r,1) ⊕ ⊕ ⊕ . . . X 2 = v (1) α (1,2) v (1) α (2,2) v (1) α (r,2) ⊕ ⊕ ⊕ . . . X 3 = v (1) α (1,3) v (1) α (2,3) v (1) α (r,3) ⊕ ⊕ ⊕ . . . X ˜ g = v (1) α (1,˜ g) v (1) α (2,˜ g) v (1) α (r,˜ g) ⊕ ⊕ ⊕ . . . ... ... ... P 1 : P 2 : P r : Figure 2.7: Creating coded messages by aligning the associated intermediate value segments. 40 s 1 broadcasts Q coded messages each of size T r bits, where Q = max i∈[r] P i . Thus, the total coded communication load sent from server s 1 exclusively for servers s 2 ,··· ,s r+1 is T r Q bits. By similar arguments for other sets of servers, we can characterize the average normalized coded communication load of the proposed scheme as follows: ¯L C A C :=E G [L C A C (r,G)]= 1 rn 2 K K− 1 r E[Q], (2.9) where L C A C (r,G) denotes the normalized communication load for the proposed coded Shuffle for the graph realization G of the Erdös-Rényi random graph modelG. The following lemma asymptotically upper bounds E[Q] and the proof is provided in Section 2.4.3. Lemma 1. For ER(n,p) graphs with p=ω( 1 n 2 ), we have E[Q]≤ p˜ g+o(p˜ g). Putting (2.9) and Lemma 1 together, we have L ∗ (r)≤ ¯L C A C ≤ 1 r p 1− r K +o(p), hence the achievability claimed in Theorem 1 is proved. Finally, we note that as explained in the uncoded Shuffle algorithm, the average normalized uncoded communication load of the proposed scheme is ¯L UC A C =p 1− r K , which implies that our scheme achieves an asymptotic gain of r. Remark 13. As we next show in the proof of Lemma 1, the regime p=ω(1/n 2 ) is essential in order to have p˜ g = ω(1). As ˜ g = n 2 K( K r ) = Θ( n 2 ) is a deterministic function of n, the regime p=ω(1/n 2 ) is needed to get the achievability and asymptotic optimality of Theorem 1. 41 2.4.3 Proof of Lemma 1 Before proving Lemma 1, we first present the following lemma that will be used in our proof. Lemma2. For random variables{P i } r i=1 defined in (2.8), their moment generating functions for s ′ >0 can be bounded by E e s ′ P i ≤ (pe 2s ′ +1− p) ˜ g/2 . Proof. Consider a generic random variable of the form (2.8) P = ˜ g X j=1 E j , where E j ’s are Bern(p) and possibly dependent. However, although E j ’s may not be all independent, but dependency is restricted to pairs of E j ’s. In other words, for all 1≤ j≤ ˜ g, E j is either independent of allE [˜ g]\{j} , or is equal toE ℓ for someℓ∈[˜ g]\{j} and independent of all E [˜ g]\{j,ℓ} . By merging dependent pairs, we can write P = ˜ g− J X j=1 F j , where (i) F j ’s are independent, (ii) ˜ g− 2J of F j ’s are Bern(p), (iii) J of F j ’s are 2× Bern(p), for some integer 0 ≤ J ≤ ⌊ ˜ g 2 ⌋. Now, we can bound the moment generating function of P. For s ′ >0, E e s ′ P =E h e s ′ P J j=1 F j i 42 = ˜ g− J Y j=1 E e s ′ F j = pe s ′ +1− p ˜ g− 2J pe 2s ′ +1− p J = h pe s ′ +1− p 2 i ˜ g/2− J pe 2s ′ +1− p J (a) ≤ pe 2s ′ +1− p ˜ g/2− J pe 2s ′ +1− p J = pe 2s ′ +1− p ˜ g/2 , where inequality (a) is obtained using Lemma 7 (proof available in Appendix E). We now complete the proof of Lemma 1. For any s ′ >0, we can write e s ′ E[Q] ≤ E e s ′ Q =E h max i=1,··· ,r e s ′ P i i ≤ E h r X i=1 e s ′ P i i = r X i=1 E e s ′ P i ≤ r(pe 2s ′ +1− p) ˜ g/2 , where the last inequality follows from Lemma 2. Taking logarithm from both sides yields E[Q]≤ 1 s ′ log(r)+ ˜ g 2s ′ log(pe 2s ′ +1− p). (2.10) Let us substitute s=2s ′ in (2.10). Then, E[Q]≤ 1 s log(r 2 )+ ˜ g s log(pe s +1− p), (2.11) 43 for any s>0. Let ¯p=1− p and pick s ∗ =2 s log(r) ˜ gp¯p . We proceed with evaluation of the right hand side (RHS) of (2.11) at s=s ∗ . We first recall the following Taylor series log(1+x)=x− x 2 2 + x 3 3 −··· , for x∈(− 1,1], e x =1+x+ x 2 2 + x 3 3! +··· , for x∈R. Let x = p(e s∗ − 1). It is easy to check that for p = ω( 1 n 2 ), we have x → 0 and s ∗ → 0 as n→∞. Therefore, for n→∞ we can write log(pe s∗ +1− p) =log(x+1) =x− x 2 2 + x 3 3 −··· =p(e s∗ − 1)− p 2 (e s∗ − 1) 2 2 + p 3 (e s∗ − 1) 3 3 −··· =p s ∗ + s ∗ 2 2 + s ∗ 3 3! +··· − p 2 2 s ∗ + s ∗ 2 2 + s ∗ 3 3! +··· 2 + p 3 3 s ∗ + s ∗ 2 2 + s ∗ 3 3! +··· 3 −··· =ps ∗ + p¯p 2 s 2 ∗ +o(ps 2 ∗ ). Putting everything together, we have E[Q]≤ 1 s ∗ log(r 2 )+ ˜ g s ∗ log(pe s∗ +1− p) = 1 s ∗ log(r 2 )+ ˜ g s ∗ ps ∗ + p¯p 2 s 2 ∗ +o(ps 2 ∗ ) = 1 s ∗ log(r 2 )+˜ gp+ ˜ gp¯p 2 s ∗ +o(˜ gps ∗ ) 44 = ˜ gp+2 p ˜ gp¯plog(r)+o p ˜ gp . Recall that ˜ g = n 2 K( K r ) which is a deterministic function of n. Therefore, we choose p=ω( 1 n 2 ) to have ˜ gp=ω(1) and thus p ˜ gp¯plog(r)=Θ √ ˜ gp =o(˜ gp). Therefore,E[Q]≤ p˜ g+o(p˜ g), as n→∞. 2.5 Converse for the Erdös-Rényi Model In this section, we prove the asymptotic optimality of our proposed coded scheme for the Erdös-Rényi model, by leveraging the techniques employed in [123]. More precisely, we com- plete the proof of Theorem 1 by deriving the lower bound on the best average communication load for the Erdös-Rényi model, that matches the achievability in (2.10). Let G be an ER(n,p) random graph and consider a subgraph and Reduce allocation A = (M,R) ∈ A(r), where P K k=1 |M k | = rn and |R k | = n K , for all k ∈ [K]. We denote the number of files that are Mapped at j vertices under Map assignmentM, as a j M , for all j∈[K]. The following lemma holds. Lemma 3. E G [L A (r,G)]≥ p P K j=1 a j M n K− j Kj . Proof. We let intermediate values v i,j be realizations of random variables V i,j , uniformly distributed over F 2 T. For a random graph G = (V,E) and subsets I,J ⊆ V = [n], define V G I,J ={V i,j :(i,j)∈E,i∈I,j∈J}asthesetofpresentintermediatevaluesingraphG cor- responding to Reducers inI and Mappers inJ. For a given allocation A=(M,R)∈A(r) andasubsetofserversS⊆ [K], wedefine X S ={X k :k∈S}andY G S =(V G R S ,: ,V G :,M S ), where “:” denotes all possible indices (which depend on both allocation and graph realization). As described in Section 2.2.2, each coded message is a function of the present intermediate val- ues Mapped at the corresponding server. Moreover, all the intermediate values required by the Reducers are decodable from the locally available intermediate values and received mes- sages at the corresponding server. That is, H(X k |V G :,M k ) = 0 and H(V G R k ,: |X [K] ,V G :,M k ) = 0 45 for all servers k∈ [K] and graphsG. We denote the number of vertices that are exclusively Mapped by j servers inS as a j,S M , that is a j,S M := X S 1 ⊆S :|S 1 |=j |(∩ k∈S 1 M k )\(∪ k ′ / ∈S 1 M k ′)|. We prove the following claim by induction. Claim 1. For any subsetS⊆ [K], E G h H(X S |Y G S c) i ≥ pT |S| X j=1 a j,S M n K |S|− j j . (2.12) Proof. (i) IfS ={k}, for any k∈[K] and graphG we have H(X S |Y G S c)≥ 0. Therefore, E G h H(X S |Y G S c) i ≥ 0=pT 1 X j=1 a 1,S M n K 1− 1 1 . (ii) Assume that claim (2.12) holds for all subsets of size S 0 . For any subset S ⊆ [K] of size S 0 +1, the following steps hold: H(X S |Y G S c) = 1 |S| X k∈S H(X S ,X k |Y G S c) = 1 |S| X k∈S (H(X S |X k ,Y G S c)+H(X k |Y G S c)) (2.13) ≥ 1 |S| X k∈S H(X S |X k ,Y G S c)+ 1 |S| H(X S |Y G S c). (2.14) where (2.14) follows from (2.13) using chain rule and conditional entropy relations. Simplifying (2.14) and using|S|− 1=S 0 , we have the following: H(X S |Y G S c)≥ 1 S 0 X k∈S H(X S |V G :,M k ,Y G S c). (2.15) 46 Moreover, H(X S |V G :,M k ,Y G S c)=H(V G R k ,: |V G :,M k ,Y G S c) +H(X S |V G :,M k ,V G R k ,: ,Y G S c). (2.16) We can lower bound expected value of the first RHS term in (2.16) as follows E G h H(V G R k ,: |V G :,M k ,Y G S c) i =E G " X v∈R k H(V G {v},: |V G {v},M k ∪M S c ) # =E G " X v∈R k |N(v)|−|N (v)∩(M k ∪M S c)| # = n K pT S 0 X j=0 a j,S\{k} M ≥ n K pT S 0 X j=1 a j,S\{k} M . (2.17) Expected value of the second term in RHS of (2.16) can be lower bounded from the induction assumption: E G h H(X S |V G :,M k ,V G R k ,: ,Y G S c) i =E G h H(X S\{k} |Y G S\{k} ) i ≥ pT S 0 X j=1 a j,S\{k} M n K S 0 − j j . (2.18) Putting (2.15), (2.16), (2.17), and (2.18) together, we have E G h H(X S |Y G S c) i ≥ 1 S 0 X k∈S E G h H(X S |V G :,M k ,Y G S c) i 47 = 1 S 0 X k∈S E G h H(V G R k ,: |V G :,M k ,Y G S c) i +E G h H(X S |V G :,M k ,V G R k ,: ,Y G S c) i ≥ 1 S 0 X k∈S n K pT S 0 X i=1 a i,S\{k} M +pT S 0 X j=1 a j,S\{k} M n K S 0 − j j =pT S 0 X j=1 n K 1 j X k∈S a j,S\{k} M =pT S 0 +1 X j=1 a j,S M n K S 0 +1− j j . (iii) Therefore, for any subsetS⊆ [K], claim (2.12) holds. Now, pickS =[K]. Then, E G L A (r,G) ≥ E G h H(X S |Y G S c) i n 2 T ≥ p K X j=1 a j M n K− j Kj . Proof of Converse for Theorem 1. First, we use the result in Claim 1 and bound the best average normalized communication load as follows: L ∗ (r)≥ inf A E G L A (r,G) ≥ inf A p K X j=1 a j M n K− j Kj , 48 where the infimum is over all subgraph and Reduce allocations A = (M,R) ∈ A(r) for which P K k=1 |M k | = rn and|R k | = n K ,∀k∈ [K]. Additionally, for any Map allocation with computation load r, we have the following equations: K X j=1 a j M =n, K X j=1 ja j M =rn. (2.19) Using convexity of K− j Kj in j and (2.19), the converse is proved as follows: L ∗ (r)≥ inf A p K X j=1 a j M n K− j Kj ≥ inf A p K− P K j=1 j a j M n K P K j=1 j a j M n = 1 r p 1− r K . 2.6 Achievability for the Power Law Model We consider a general model for random graphs where the expected degree sequence d = (d 1 ,··· ,d n ) is independently drawn from a power law distribution with exponent γ , i.e. Pr[d i = d] = cd − γ for i∈ [n] and d≥ 1 and proper constant c [30]. Given the realization of the expected degrees d, for ρ = 1 P n i=1 d i and all i,j∈[n], vertices i and j are connected with probability p i,j = P[(i,j) ∈ E] = ρd i d j , independently of other edges. We now proceed to analyze the coded and uncoded communication loads averaged over the random connections and random degrees induced by the subgraph and Reduce allocation A C proposed in Section 2.4.1. Consider the allocation A C =(M,R) and a subset of serversS⊆ [K] of size|S|=r+1. According to the proposed scheme in Section 2.4.1, for every server s∈S, servers inS\{s} form a table and construct coded messages using the intermediate values in the sets Z k S\{k} (defined in (2.6)) where k ∈S\{s}. Therefore, r+1 tables are formed each constructing 49 coded messages of sizemax k∈S\{s} |Z k S\{k} | T r bits. The total coded load induced by the subset S (and exclusively for the use of servers inS) denoted by L C A C (S) is L C A C (S)= 1 n 2 r X s∈S max k∈S\{s} |Z k S\{k} |. However, in uncoded scenarios, denoted by L UC A C (S) the total uncoded load induced by subset S (and exclusively for the use of servers inS) is L UC A C (S)= 1 n 2 X s∈S |Z s S\{s} |. We have |Z s S\{s} |= X i∈Rs |N(i)∩(∩ k ′ ∈S\{s} M k ′)| = X i∈Rs m∈∩ k ′ ∈S\{s} M k ′ 1{(i,m)∈E}, (2.20) where the random Bernoulli 1{(i,m)∈E} indicates the realization of the edge connecting vertices i and m, i.e. E[1{(i,m)∈E}|d]=ρd i d m . We note that|R s | =n/K and|∩ k ′ ∈S\{s} M k ′|=n/ K r . Therefore, there are ˜ g = n 2 K( K r ) Bernoulli summands in (2.20) in which every two summands are either independent or equal and independent of other summands. More precisely, (2.20) can be decomposed to sum of all independent Bernoulli random variables and sum of dependent ones as follows: |Z s S\{s} |= X i∈Rs m∈∩ k ′ ∈S\{s} M k ′ 1{(i,m)∈E} = X F 1 orF 2 orF 3 1{(i,m)∈E} +2 X i,m∈Rs∩(∩ k ′ ∈S\{s} M k ′) i<m 1{(i,m)∈E}, (2.21) 50 where we denote the events F 1 :={i∈R s \∩ k ′ ∈S\{s} M k ′, m∈∩ k ′ ∈S\{s} M k ′}, F 2 :={i∈R s , m∈∩ k ′ ∈S\{s} M k ′\R s }, F 3 :={i=m∈R s ∩(∩ k ′ ∈S\{s} M k ′)}. Note that with this decompostion, all the Bernoulli summands in both terms in (2.21) are independent. Assume that the first and second terms in (2.21) contain ˜ g− 2J and J summands respectively. According to Kolmogorov’s strong law of large numbers (Proposition 1 provided at the end of this section) and given that the second condition in the proposition is satisfied for Bernoullis, we have 1 ˜ g− 2J X F 1 orF 2 orF 3 1{(i,m)∈E}− E[ρd i d m ] a.s. − − → 0, and 1 J X i,m∈Rs∩(∩ k ′ ∈S\{s} M k ′) i<m 1{(i,m)∈E}− E[ρd i d m ] a.s. − − → 0. Therefore, size of the setZ s S\{s} converges almost surely, that is 1 ˜ g |Z s S\{s} |− E |Z s S\{s} | = ˜ g− 2J ˜ g 1 ˜ g− 2J X F 1 orF 2 orF 3 1{(i,m)∈E}− E[ρd i d m ] + J ˜ g 1 J 2 X i,m∈Rs∩(∩ k ′ ∈S\{s} M k ′) i<m 1{(i,m)∈E}− E[ρd i d m ] a.s. − − → 0, 51 where E |Z s S\{s} | = X i∈Rs m∈∩ k ′ ∈S\{s} M k ′ E[ρd i d m ] =E ρ vol(R s )vol(∩ k ′ ∈S\{s} M k ′) , and vol(V)= P v∈V d v for any subset of vertices V ⊆ [n]. Moreover, lim n→∞ n ˜ g E |Z s S\{s} | = lim n→∞ E " (ρn ) 1 n/K vol(R s ) 1 n/ K r vol(∩ k ′ ∈S\{s} M k ′) # . (2.22) Each of the terms vol(R s ), vol(∩ k ′ ∈S\{s} M k ′) and inverse of ρ are summation of i.i.d power law random variables for which the expected value exists for γ > 2 andE[d 1 ]= γ − 1 γ − 2 . There- fore, by strong law of large numbers (Proposition 1) each term approaches its average almost surely, that is for γ > 2 1 n/K vol(R s ) a.s. − − → E[d 1 ]= γ − 1 γ − 2 , 1 n/ K r vol(∩ k ′ ∈S\{s} M k ′) a.s. − − → E[d 1 ]= γ − 1 γ − 2 . ρn a.s. − − → 1 E[d 1 ] = γ − 2 γ − 1 . Plugginginto(2.22), wehavelim n→∞ n ˜ g E |Z s S\{s} | = γ − 1 γ − 2 .Therefore, n ˜ g |Z s S\{s} | a.s. − − → γ − 1 γ − 2 for any s∈S andS⊆ [K]. Putting all together, we have for γ > 2, lim n→∞ nE[L UC A C (S)]= lim n→∞ n n 2 X s∈S E |Z s S\{s} | = 1 K K r lim n→∞ X s∈S n ˜ g E |Z s S\{s} | 52 = r+1 K K r γ − 1 γ − 2 . Therefore, denoted by L UC A C the total uncoded communication load, we have lim n→∞ nE[L UC A C ]= lim n→∞ X S⊆ [K] n|S|=r+1 E[L UC A C (S)] = K r+1 r+1 K K r γ − 1 γ − 2 = 1− r K γ − 1 γ − 2 . For the coded scheme, we have lim n→∞ nE[L C A C (S)]= lim n→∞ n n 2 r X s∈S E max k∈S\{s} |Z k S\{k} | ≤ lim n→∞ n(r+1) n 2 r E max s∈S |Z s S\{s} | = r+1 rK K r γ − 1 γ − 2 . (2.23) Thelastequalityfollowsthefactthat n ˜ g max s∈S |Z s S\{s} | a.s. − − → γ − 1 γ − 2 ,since n ˜ g |Z s S\{s} |converges almost surely for any s∈S. Plugging into (2.23), the expected coded load is lim n→∞ nE[L C A C ]= lim n→∞ n X S⊆ [K] |S|=r+1 E[L C A C (S)] ≤ K r+1 r+1 rK K r γ − 1 γ − 2 = 1 r 1− r K γ − 1 γ − 2 , which yields lim n→∞ nL ∗ (r) ( γ − 1 γ − 2 ) ≤ lim n→∞ nE[L C A C ] ( γ − 1 γ − 2 ) ≤ 1 r 1− r K . 53 Comparing the coded load with uncoded load proves the achievability of gain r for the power law model. Proposition1 (Kolmogorov’sStrongLawofLargeNumbers[57,131]). LetX 1 ,X 2 ,··· ,X n ,··· be a sequence of independent random variables with |E[X n ]|<∞ for n≥ 1. Then 1 n n X i=1 X i − E[X i ] a.s. − − → 0, if one of the following conditions are satisfied: 1. X i ’s are identically distributed, 2. ∀n, var(X n )<∞ and P ∞ n=1 var(Xn) n 2 <∞. 2.7 Experiments over Amazon EC2 Clusters In this section, we demonstrate the practical impact of our proposed coded scheme via experiments over Amazon EC2 clusters. We first present our implementation choices and experimental scenarios. Then, we discuss the results and provide some remarks. Implemen- tation codes are available at [1]. 2.7.1 Implementation Details We implement one iteration of the popular PageRank algorithm (Example 2.2.1), for a real- worldgraphaswellasartificiallygeneratedgraphs. Forreal-worlddataset,weuseTheMarker Cafe Dataset [67]. For generating artificial graph datasets, we use the Erdös-Rényi model, where each edge in the graph is present with probability p. We consider the following three scenarios: • Scenario 1: We use a subgraph of size n = 69360 of TheMarker Cafe Dataset [67]. The computing cluster consists of K = 6 servers and one master with communication bandwidth of 100 Mbps at each server. 54 • Scenario2: WegenerateagraphusingtheErdös-Rényimodelwithn=12600vertices and p = 0.3. The computing cluster consists of K = 10 servers and one master with communication bandwidth of 100 Mbps at each server. • Scenario3: WegenerateagraphusingtheErdös-Rényimodelwithn=90090vertices and p = 0.01. The computing cluster consists of K = 15 servers and one master with communication bandwidth of 100 Mbps at each server. For each scenario, we carry out PageRank implementation for different values of the computation load r. The case of r = 1 corresponds to the conventional PageRank imple- mentation, where each vertex i∈V = [n] is stored at exactly one server and M k =R k for each server k∈ [K], i.e. the Map and Reduce tasks associated with any vertex i take place in the same server. For r > 1, we increase the computation load until the overall execution time starts increasing. Wenowdescribeourimplementationchoices. WeusePythonwith mpi4pypackage. Inall of our experiments, master is of type r4.large and servers are of type m4.large. For Scenario 2 and Scenario 3, we use a sample from the Erdös-Rényi model. This process is carried out using a c4.8xlarge server instance. For each scenario, the graphs are processed and subgraph allocation is done as a pre-processing step. For r = 1, the graph is partitioned into smaller instances which have equal numbers of vertices. Each such partition consists of two Python lists, one that consists of the vertices that will be Mapped by the corresponding server, and the other one that consists of the neighborhood information of each vertex to be Mapped. The position of the neighborhood tuple in the neighborhood list is same as the position of the corresponding vertex in the vertex list, so that one can iterate over the two together during the Map stage. For r > 1, the graph is divided into K r batches, where each batch consists of equal numbers of vertices. Then each batch is included in the subgraph of the corresponding set of r servers. This way, we get a a computation load of r. The overall execution consists of the following phases: 55 1. Map: Withoutlossofgenerality,therankforeachvertexisinitializedto 1 n . Eachserver goes over its subgraph and Maps the rank associated with a vertex to intermediate values that are required by the neighboring vertices during the Reduce stage. Each intermediate value consists of key-value pair, where the key is an integer storing the vertex ID, while the value is a real number storing the associated value. Based on the vertex ID, the intermediate value is associated with the partition where the vertex is Reduced, which is obtained by hashing the vertex ID. For each partition, a separate list is created for storing keys and values. 2. Encode/Pack: In conventional PageRank, no encoding is done as the transfer of intermediate values is done directly. For r > 1, coded multicast packets are created using the proposed encoding scheme. Transmission data is serialized before Shuffling. 3. Shuffle : At any time, only one server is allowed to use the network for transmission. In conventional PageRank, each server unicasts its message to different servers, while forr >1, the communication takes place in multicast groups. For any multicast group, each server takes its turn to broadcast its message to all the remaining servers in the group. 4. Unpack/Decode: The messages received during the Shuffle phase are de-serialized. For r >1, each server decodes the coded packets received from other servers in accor- dance with the proposed coded scheme to recover the intermediate values. After the decoding phase, all intermediate values that are needed for Reduce phase are available at the servers. 5. Reduce: Each server goes over its set of vertices that it needs to Reduce and updates the corresponding PageRank values. In conventional PageRank, for any vertex i ∈ V, the Map and Reduce operations associated with it are done at the same server. Therefore, no further data transmission is needed to communicate the updated ranks 56 for the Map phase in next iteration. In the proposed coded scheme, message passing is done in order to transmit the updated PageRanks to the Mappers. Next, we discuss the results of our experiments. 2.7.2 Experimental Results We now present the results from our experiments. The overall execution times for the three scenarios have been presented in Fig. 2.8. 2 We make the following observations from the results: • As demonstrated in Fig. 2.8(a), maximum gain for Scenario 1 is obtained with a computation load of r = 5. Our proposed scheme achieves a speedup of 43.4% over conventional PageRank implementation (r =1) and a speedup of25.5% over the single server implementation (r =6). • For Scenarios 2 and 3, the optimal gain is obtained for r = 4, after which the overall execution time increases due to saturation of gain in Shuffling time and large Map time. As demonstrated by Fig. 2.8(b) and Fig. 2.8(c), our proposed scheme achieves speedups of 50.8% and 41.8% for Scenarios 2 and 3 respectively, in comparison to the conventional PageRank. • As demonstrated by Fig. 2.8, Shuffle phase dominates the overall execution time in the naive implementation of PageRank. By increasing the computation load, our proposed coded scheme leverages extra computing in the Map phase to slash the Shuffle phase, thus speeding up the overall execution time. • Theoretically, we demonstrated that by increasing the computation load by r, we slash the expected communication load in Shuffle phase by nearly r. Here, we empirically 2 The Map time includes the time spent in Encode/Pack stage, while the Unpack stage is combined with Reduce phase. 57 (a) Scenario 1 (b) Scenario 2 (c) Scenario 3 Figure 2.8: Overall execution times for distributed PageRank implementation for different compu- tation load for the three scenarios. 58 observe that due to large size of the graph model, we have a similar trade-off between computation load and communication load for each sample of the graph model as well. • While the Map phase increases almost linearly with r, the overall gain begins to sat- urate, since the Shuffle phase does not decrease linearly with r. This is because as we increase r, the overheads in multicast data transmissions increase and start to domi- nate the overall Shuffling time. Furthermore, unicasting one packet is smaller than the time for broadcasting the same packet to multiple servers [109]. Remark 14. The overall execution time can be approximated as follows: T Total (r)≈ rT Map +T Shuffle /r+T Reduce , (2.24) where T Map , T Shuffle and T Reduce are the Map, Shuffle and Reduce times for the naive MapRe- duce implementation. For selecting the computation load for coded implementation, one heuristic [123] is to choose r that is the nearest integer to the minimizer r ∗ of (2.24) where r ∗ = s T Shuffle T Map =argmin r T Total (r). For instance, in Scenario 2, T Map = 1.649, T Shuffle = 43.78 and r ∗ = 5.15. As demonstrated by Fig. 2.8(b), a computation load of r =5 gives close to the optimal performance attained at r =4. 2.8 Conclusion We described a mathematical model for graph based MapReduce computations and demon- stratedhowcodingtheoreticstrategiescanbeemployedtosubstantiallyreducethecommuni- cation load in distributed graph analytics. Our results reveal that an inverse-linear trade-off exists between computation load and communication load in distributed graph processing. 59 This trade-off can be used to leverage additional computing resources and capabilities to alleviate the costly communication bottleneck in distributed graph processing systems. As a key contribution of this work, we developed a novel coding scheme that systemati- cally injects structured redundancy in the computation phase to enable coded multicasting opportunities during message exchange between servers, reducing the communication load substantially in large-scale graph processing. For theoretical analysis, we considered random graph models, and proved that our proposed scheme enables an asymptotically inverse-linear trade-offbetweencomputationloadandaveragenormalizedcommunicationloadfortwopop- ular random graph models – Erdös-Rényi model, and power law model. Furthermore, for the Erdös-Rényi model, we provided proof for a matching converse, showing the optimality of our proposed scheme. We also carried out experiments over Amazon EC2 clusters to corroborate our claims using real-world as well as artificial graphs, demonstrating speedups of up to 50.8% in the overall execution time of PageRank over the conventional approach. Additionally, we specialized our coded scheme and extended our theoretical results to two other random graph models – random bi-partite model, and stochastic block model. Our specialized schemes asymptotically enable inverse-linear trade-offs between computation and communication loads in distributed graph processing for these popular random graph models as well. We complemented the achievability results with converse bounds for both of these models. One of the major differences from prior frameworks such as Pregel is the use of combiners before Shuffling [137], where the intermediate values that are Mapped at any server are com- bined at the server depending on the target Reducer computations. Our proposed schemes can be applied on top of combiners, and it is an interesting future direction to explore this in detail. The case with fully connected graphs can be solved using the scheme proposed in the recent work of [122], which shows that the coding gain can be achieved on top of the gain from combiners. For the general MapReduce computation model considered in [123], the proposed scheme in [122] utilizes the techniques of combiners as well as coding across 60 intermediate results, which provides a Shuffling gain which is multiplicative of the gains from combiners and coding. Furthermore, we focused on subgraph allocation and Reduce allocation schemes that are oblivious to graph realizations. Our motivation came from pop- ular graph processing frameworks such as Pregel [137], where partitioning of graphs is solely based on the vertex ID and not on the vertex neighborhood density. Also, designing sub- graph allocation, Reduce allocation and Shuffling schemes for characterizing the minimum communication load in (2.2) is NP-hard in general. It might, however, be an interesting future direction to explore the development of coded schemes that allocate resources after looking at the graph. 61 Chapter 3 Coded Computation over Heterogeneous Clusters 3.1 Introduction General distributed computing frameworks, such as MapReduce [41] and Spark [235], along with the availability of large-scale commodity servers, such as Amazon EC2, have made it possible to carry out large-scale data analytics at the production level. These “virtualized data centers” enjoy an abundance of storage space and computing power, and are cheaper to rent by the hour than maintaining dedicated data centers round the year. However, these systems suffer from various forms of “system noise” which reduce their efficiency: system failures, limited communication bandwidth, straggler nodes, etc. The current state-of-the-art approaches to mitigate the impact of system noise in cloud computing environments involve creation of some form of “computation redundancy”. For example, replicating the straggling task on another available node is a common approach to deal with stragglers [236], while partial data replication is also used to reduce the com- munication load in distributed computing [159]. However, there have been recent results demonstrating that coding can play a transformational role for creating and exploiting com- putation redundancy to effectively alleviate the impact of system noise. In particular, there have been two coding concepts proposed to deal with the communication and straggler bot- tlenecks in distributed computing. 62 The first coding concept introduced in [118, 123, 124] enables an inverse-linear tradeoff between computation load and communication load in distributed computing. This result implies that increasing the computation load by a factor of r (i.e. evaluating each compu- tation at r carefully chosen nodes) can create novel coding opportunities that reduce the required communication load for computing by the same factor r. Hence, these codes can be utilized to pool the underutilized computing resources at network edge to slash the com- munication load of Fog computing [21]. Other related works tackling the communication bottleneck in distributed computation include [111, 190, 94, 14, 60]. In the second coding concept introduced in [111], an inverse-linear tradeoff between com- putation load and computation latency (i.e. the overall job response time) is established for distributed matrix multiplication in homogeneous computing environments. More specifi- cally, this approach utilizes coding to effectively inject redundant computations to alleviate the effects of stragglers and speed up the computations. Hence, by utilizing more computa- tion resources, this can significantly speed up distributed computing applications. A number of related works have been proposed recently to mitigate stragglers in distributed compu- tation. In [55], the authors propose the use of redundant short dot products to speed up distributed computation of linear transforms. The work in [199] proposes coding schemes for mitigating stragglers in distributed batch gradient computation. Coding schemes for high- dimensional matrix-matrix multiplication have been developed in [231, 61, 113, 215, 232]. Techniques for efficient straggler mitigation for matrix-vector computation in distributed wireless settings have been developed in [167]. In [112], the potential of the multicore na- ture of computing machines is studied. In [66], the authors propose an anytime approach to distributed computing, developing an approximate matrix multiplication scheme. The authors in [195] propose a novel encoding scheme for achieving large sparsity in the encoded matrix. Work in [8] develops a coding strategy for mitigating straggling decoders in cloud radio access network. Speeding up the computation of linear transformations with unreliable components is studied in [225]. Straggler mitigation through data encoding in distributed 63 optimization is proposed in [89]. A coded scheme based on LT codes is proposed in [182] for multiplying a matrix by a set of vectors in a distributed computing environment. Addressing stragglers has attracted a lot of attention in the queuing-based frameworks for large-scale computation as well [7, 212]. These works utilize the technique of dynamically replicating the tasks in a careful manner to minimize run-time. We extend the problem of distributed matrix multiplication in homogeneous clusters in [111] to heterogeneous environments. As discussed in [236], the computing environments in virtualized data centers are heterogeneous and algorithms based on homogeneous assump- tions can result in significant performance reduction. In this paper, we focus on general heterogeneous distributed computing clusters consisting of a variety of computing machines with different capabilities. Specifically, we propose a coding framework for speeding up distributed matrix multiplication in heterogeneous clusters with straggling servers, named Heterogeneous Coded Matrix Multiplication (HCMM). Matrix multiplication is a crucial com- putation module in many engineering and scientific disciplines. In particular, it is a funda- mental component of many popular machine learning algorithms such as logistic regression, reinforcement learning and gradient descent-based algorithms. Implementations that speed up matrix multiplication would naturally speed up the execution of a wide variety of pop- ular algorithms. Therefore, we envision HCMM to play a fundamental role in speeding up big data analytics in virtualized data centers by leveraging the wide range of computing capabilities provided by these heterogeneous environments. We now describe the main ideas behind HCMM, which results in asymptotically optimal performance. In a coded implementation of distributed matrix-vector multiplication, each worker node is assigned the task of computing inner products of the assigned coded rows with the input vector, where the assigned coded rows are random linear combinations of the rows of the original matrix. Computation time at each worker is a random variable, which is first assumed to have shifted exponential distribution, and we later generalize it to shifted Weibull distribution. The master node receives the results from the worker nodes and 64 aggregates them until it receives a decodable set of inner products and recovers the matrix- vector multiplication. We are interested in finding the optimal load allocation that minimizes the expected waiting time to complete this computation. However, due to heterogeneity, finding the exact solution to the optimization problem seems intractable. As the main contribution of the paper, we propose an alternative optimization that fo- cuses on maximizing the expected number of returned computation results from the workers. Apart from being computationally tractable, the alternative optimization asymptotically ap- proximates the problem of finding the optimal computation load allocation. Specifically, we develop the HCMM algorithm that is derived as a solution to the alternative formulation, and prove it is asymptotically optimal. Furthermore, we prove that given a heterogeneous cluster ofn workers, HCMM isΘ(log n) times faster than uncoded schemes under the shifted exponential distribution for run-time. We further generalize the proposed HCMM algorithm to shifted Weibull model and provide similar unbounded gains over uncoded scenarios. In addition to proving the asymptotic optimality of HCMM, we carry out numerical studies and experiments over Amazon EC2 clusters to demonstrate how HCMM can be used in practice. We compare HCMM with three benchmark schemes – Uniform Uncoded, Load-balanced Uncoded, and Uniform Coded. In our numerical analysis, HCMM results in significant speedups of up to 73%, 56% and 42% over the three aforementioned bench- mark schemes, respectively. In experiments using Amazon EC2 clusters, we use the Luby transform (LT) codes for coding and demonstrate that HCMM combined with LT codes sig- nificantly reduces the overall execution time in comparison to uncoded and coded schemes. In particular, HCMM achieves gains of up to 61%, 46% and 36%, respectively over Uniform Uncoded, Load-balanced Uncoded and Uniform Coded. Furthermore, the overall computa- tion load of HCMM is less than the one of Uniform Coded. Our results demonstrate that HCMM combines the benefits of both Load-balanced Uncoded and Uniform Coded schemes by achieving efficient load balancing along with minimal number of redundant computations. 65 Furthermore, we consider the problem of load allocation under budget constraints, con- sidering an intuitive and convincing pricing model. In particular, we show that HCMM is the (asymptotically) optimal load allocation in feasible budget-constrained scenarios as well, and determine whether a budget-constrained computation task is feasible given a cluster of machines. We then develop a heuristic algorithm to find the (sub)optimal load alloca- tions using the proposed HCMM scheme. The heuristic is based on the observation that given a computation task and a set of machines, decreasing the number of fastest machines participating in HCMM results in smaller average cost. Notation. We denote by [n] the set {1,··· ,n} for any n ∈ N. For non-negative sequences g(n) and h(n), we denote g(n) = O h(n) if there exist constants c > 0 and n 0 ∈N such that g(n)≤ c· h(n) for all n>n 0 ; and g(n) = Θ h(n) if g(n) =O h(n) and h(n)=O g(n) . Moreover, we write g(n)=o h(n) if lim n→∞ g(n)/h(n)=0. 3.2 Problem Formulation and Main Results In this section, we describe our computation model, the network model and the precise prob- lem formulation. We then conclude with four theorems highlighting the main contributions of the paper. 3.2.1 Computation Model We consider the problem of matrix-vector multiplication, in which given a matrix A∈R r× m for some positive integers r and m, we want to compute the output y = Ax for an input vectorx∈R m . Due to limited computing power, the computation cannot be carried out at a single server and a distributed implementation is required. As an example, consider a matrix A with an even number of rows and two computing nodes. The matrix can be divided into two equally tall matricesA 1 andA 2 , and each will be stored in a different worker node. The master node receives the inputx and broadcasts it to the two worker nodes. These nodes will 66 then compute y 1 =A 1 x and y 2 =A 2 x locally and return their results to the master node, which combines them to obtain the intended outcome y =[y 1 ;y 2 ]=Ax. This example also illustrates an uncoded implementation of distributed computing, in which results from all the worker nodes are required to recover the final result. We now present the formal definition of Coded Distributed Computation. Definition 3. (Coded Distributed Computation) The coded distributed implementa- tion of a computation task f A (·) is specified by: • local data blocks⟨A i ⟩ n i=1 and local computation tasks f i A i (·) n i=1 ; • a decoding function that outputs f A (·) given the results from a decodable set of local computations. For matrix-vector multiplication tasks in particular, local data blocks A i ∈ R ℓ i × m are matrices consisting of coded combinations of the rows in A, for non-negative integers ℓ i . To assign the computation tasks to each worker, we use random linear combinations of the r rows of the matrixA, such that the master node can recover the result Ax from any r inner products received from the worker nodes with probability 1. As an example, if worker i is assigned a matrix-vector multiplication with matrix size ℓ i × m, it will compute ℓ i inner products of the assigned coded rows of A with x. The master node shall wait for the first r inner products and will use them to decode the required output. In order to ensure the recovery of the output from any r inner products received from the workers, we pick the computation matrix assigned to worker i as A i = S i A, where S i ∈ R ℓ i × r is the coding matrix with i.i.d. N(0,1) entries. Worker i computes A i x and returns the result to the master node. Upon receiving r inner products, the aggregated results at the master will be in the form of z = S (r) Ax, where S (r) ∈ R r× r is the aggregated coding matrix, and it is 67 full-rank with probability 1 [176]. Therefore, the master node can recover Ax =S − 1 (r) z with probability 1. 1,2 3.2.2 Network Model The network model is based on a master-worker setup illustrated in Fig. 3.1. The master node receives an input x and broadcasts it to all the workers. Each worker computes its assigned set of computations and unicasts the result to the master node. The master node aggregates the results from the worker nodes until it receives a decodable set of computations and recovers the output Ax. A 1 A 2 A n ··· W 1 W 2 W n ··· x M x A 1 A 2 A n ··· W 1 W 2 W n ··· A 1 x A 2 x A n x M Ax Figure 3.1: Master-worker setup of the computing clusters: The master node receives the input vector x and broadcasts it to all the worker nodes. Upon receiving the input, worker node i starts computing the inner products of the input vector with the locally assigned rows, i.e.,y i =A i x, and unicasts the output vector y i to the master node upon completing the computation. The results are aggregated at the master node until r inner products are received and the desired outputAx is recovered. We denote byT i the random variable representing the task run-time at nodei and assume that the run-timesT 1 ,··· ,T n are mutually independent. We consider the distribution of run- timerandomvariablestobeexponential, andlatergeneralizeittoWeibulldistribution. More 1 Although we consider random linear coding in our theoretical analysis, other codes such as Maximum-Distance Separable (MDS) codes and Luby transform (LT) codes are compatible with HCMM as well, given a decodable set of results at the master. For example, in the MDS case, the entries in the coding matrix {Si} n i=1 are drawn from a finite field. Specifically, one can encode the rows of A using an( P n i=1 ℓi,r) MDS code and assignℓi coded rows to the worker node i. The output Ax can be recovered from the inner products of any r coded rows with the input vector x. Furthermore, to implement the ideas developed in this work, we use LT codes in our experiments over Amazon EC2 clusters. 2 Instead of i.i.d. Gaussian, we could use any continuous distribution for the random entries, since Schwartz-Zippel lemma ensures that such random matrix is full-rank with high probability 68 specifically, we consider a 2-parameter shifted exponential distribution for the execution time ofeachworker, i.e., theCDFofexecutiontimeofworkernodei,T i , loadedwithℓ i rowvectors is as follows: P[T i ≤ t]=1− e − µ i ℓ i (t− a i ℓ i ) , (3.1) fort≥ a i ℓ i andi∈[n], wherea i >0 is the shift parameter and µ i >0 denotes the straggling parameter associated with worker node i. The shifted exponential model for computation time, which is the sum of a constant (deterministic) term and a variable (stochastic) term, is motivatedbythedistributionmodelproposedbyauthorsin[129]forlatencyinqueryingdata files from cloud storage systems. As demonstrated in [111] as well as by our own experiments, exponential model provides a good fit for the distribution of computation times over cloud computingenvironmentssuchasAmazonEC2clusters. Moreover, theseexperimentsconfirm the assumption that as a first order approximation, both shift and mean parameters of the shifted exponential distributions linearly scale with the load size. We further generalize the analysis to shifted Weibull distribution in Section 3.4, where we consider a 3-parameter shifted Weibull distribution for the execution time of each worker. That is, the CDF of task run-time at worker node i, loaded with ℓ i row vectors is as follows: P[T i ≤ t]=1− e − µ i ℓ i (t− a i ℓ i ) α i , (3.2) for t≥ a i ℓ i and i∈ [n], where a i > 0 denotes the shift parameter, µ i > 0 is the straggling parameter and α i > 0 represents the shape parameter associated with worker i. A similar model has been considered in [56] as well. 3.2.3 Problem Formulation We consider the problem of using a cluster of n worker nodes for distributedly computing the matrix-vector multiplication Ax, where A is a size r× m matrix for positive integers r and m. Let ℓ = (ℓ 1 ,··· ,ℓ n ) be the load allocation vector where ℓ i denotes the number of 69 rows assigned to worker node i. Let T CMP be the random variable denoting the waiting time for receiving a decodable set of results, i.e. at least r inner products. We aim at finding the optimal load allocation vector that minimizes the average waiting time by solving the following optimization problem: P main : minimize ℓ E[T CMP ]. (3.3) For a homogeneous cluster, to achieve a coded solution, one can divide A into k equal size submatrices, and apply an (n,k) MDS code to these submatrices. The master node can then obtain the final result from any k responses. In [111], the authors find the optimal k for minimizing the average running time for the shifted exponential run-time model. For heterogeneous clusters, however, assigning equal loads to servers is clearly not op- timal. Moreover, directly finding the optimal solution to P main is hard. In homogeneous clusters, the problem of finding a sufficient number of inner products can be mapped to the problem of finding the waiting time for a set of fastest responses, and thus closed form expressions for the expected computation time can be found using order statistics of i.i.d. run-times. However, this is not straight-forward in heterogeneous clusters, where the load allocation is non-uniform. In Section 3.3, we present an alternative formulation to P main in (3.3), and show that the solution to the alternative formulation – which we shall name HCMM – is tractable and provably asymptotically optimal. Assumptions. From now onward, we consider the practically relevant regime where the size of the problem scales linearly with the size of the network, while the computing power and the storage capacity of each worker node remain constant. Specifically, we assume r =Θ( n), a i =Θ(1) , µ i =Θ(1) and α i =Θ(1) for each worker i. 70 3.2.4 Main Results Having set the model and formulation of the problem, we now present the main contributions of this paper. The following theorem characterizes the asymptotic optimality of HCMM for the shifted exponential run-time model. Theorem 5. Let T HCMM be the random variable denoting the finish time of the HCMM algorithm and T OPT be the random variable representing the finish time of the optimum algorithm obtained by solving P main . Then, for shifted exponential run-times in (3.1) with constant parameters a i =Θ(1) and µ i =Θ(1) for each worker i∈[n] and r =Θ( n), we have lim n→∞ E[T HCMM ]=lim n→∞ E[T OPT ]. Remark 15. Theorem 5 demonstrates that our proposed HCMM algorithm is asymptotically optimal as the number of workers n approaches infinity. In other words, the optimal com- putation load allocation problem P main in (3.3) can be optimally solved using the proposed HCMM algorithm as n gets large. Remark 16. We note thatP main in (3.3) is a hard combinatorial optimization problem since it will require checking all load combinations to minimize the overall expected execution time. The key idea in Theorem 5 is to consider an alternative formulation to (3.3) focusing on maximizing the expected number of returned computation results from the workers, i.e. maximizing the aggregate return. As we describe in Section 3.3, the alternative optimization problem not only can be solved efficiently in a tractable way giving rise to HCMM algorithm, it also asymptotically approximatesP main and allows us to establish Theorem 5. Remark 17. While Theorem 5 theoretically characterizes the optimality of our proposed scheme HCMM, we also demonstrate gains that one can get in practice. In particular, we carry out numerical studies and experiments over Amazon EC2 clusters that demonstrate that HCMM can provide significant gains in a wide variety of computing scenarios. In particular, we compare HCMM’s performance with three benchmark load allocation policies – Uniform Uncoded, Load-balanced Uncoded, and Uniform Coded. In numerical studies, 71 HCMM achieves speedups of up to 71% over Uniform Uncoded, up to 53% over Load- balanced Uncoded, and up to 39% over Uniform Coded. In EC2 experiments, HCMM combined with the Luby transform (LT) codes provides speedups of up to 61%, 46% and 36% over Uniform Uncoded, Load-balanced Uncoded and Uniform Coded, respectively. Theorem 6. Let T UC denote the completion time of the uncoded distributed matrix multipli- cation algorithm. Then, for the shifted exponential run-times with constant parameters and r =Θ( n), E[T UC ] E[T HCMM ] =Θ logn . Remark 18. As Theorem 6 shows, our proposed HCMM guarantees an improvement of Θ logn in expected execution time over any uncoded scheme, including the one that opti- mally allocates the workers’ loads. This result illustrates that by leveraging coded comput- ing, one achieves the same order-wise gain over heterogeneous clusters as over homogeneous clusters [111]. Although Theorems 5 and 6 are based on the shifted exponential model (3.1) for run- time random variables for the workers, our analyses are general and can be extended to other models. The following two theorems generalize the results when the execution time of each worker follows the Weibull distribution as described in (3.2). Theorem 7. For the shifted Weibull distribution of run-times with constant parameters a i = Θ(1) , µ i = Θ(1) and α i = Θ(1) for each worker i ∈ [n] and r = Θ( n), the proposed HCMM algorithm is asymptotically optimal, i.e., lim n→∞ E[T HCMM ]=lim n→∞ E[T OPT ]. Theorem 8. Under the Weibull distribution for run-times with constant parameters and r =Θ( n), the proposed HCMM scheme unboundedly outperforms the uncoded scheme, i.e., E[T UC ] E[T HCMM ] ≥ Θ (logn) 1/˜ α , where ˜ α =max i∈[n] α i is the largest shape parameter among the workers. 72 Remark 19. As stated in Theorem 8, HCMM provides an unbounded gain over any uncoded scheme – including the optimal uncoded load allocation – under the Weibull distribution for workers’ run-times. Furthermore, our numerical simulations demonstrate speedups of up to 73%, 56% and 42% over Uniform Uncoded, Load-balanced Uncoded and Uniform Coded, respectively. Inthefollowingsection, wedescribeouralternativeformulationbasedonaggregatereturn and describe our proposed HCMM algorithm that solves the alternative optimization. 3.3 TheProposedHCMMSchemeandProofsofTheorems 5 and 6 In this section, we prove Theorems 5 and 6 for the exponential model (3.1). In particular, we start by describing the HCMM algorithm and show that it asymptotically achieves the opti- mal performance, as stated in Theorem 5, and lastly conclude the section by characterizing the gain of HCMM over uncoded scheme. To derive HCMM, we start by reformulating P main defined in (3.3) and show that the alternative formulation can be efficiently solved, as opposed to solving P main that needs an exhaustive search over all possible load allocations. The solution to the alternative problem gives rise to HCMM. We will further prove the optimality of HCMM and compare its average run-time to uncoded schemes. 3.3.1 AlternativeFormulationofP main viaMaximalAggregateReturn Consider an n-tuple load allocationℓ=(ℓ 1 ,··· ,ℓ n ) and let t be a feasible time for computa- tion, i.e.,t≥ max i {a i ℓ i }. The number of equations received from workeri∈[n] at the master node till time t is a random variable, X i (t) = ℓ i 1 {T i ≤ t} , where T i is the random execution 73 time for machine i that is assigned the load ℓ i and 1 {·} denotes the indicator function. Then, the aggregate return at the master node at time t is: X(t)= n X i=1 X i (t). We propose the following two-step alternative formulation for P main defined in (3.3). First, for a fixed feasible time t, we maximize the aggregate return over different load allocations, i.e., we solve P (1) alt :ℓ ∗ (t)=argmax ℓ E X(t) . (3.4) Then, given the load allocation ℓ ∗ (t) = ℓ ∗ 1 (t),··· ,ℓ ∗ n (t) obtained from P (1) alt , we find the smallest time t such that with high probability, there is enough aggregate return by time t at the master node, i.e., we solve P (2) alt : minimize t subject to P X ∗ (t)<r =o 1 n , (3.5) where X ∗ (t) is the aggregate return at time t for load allocation obtained fromP (1) alt , that is X ∗ (t)= n X i=1 X ∗ i (t)= n X i=1 ℓ ∗ i (t) 1 {T i ≤ t} . From now onward, we denote the solution toP (2) alt byt ∗ and henceℓ ∗ (t ∗ ) denotes the solution to the two-step alternative formulation in (3.4) and (3.5) which gives rise to our proposed HCMM scheme described next. 74 3.3.2 Solving the Alternative Formulation Considering the exponential distribution for workers’ run-times, we first proceed to solve P (1) alt in (3.4). The expected number of equations aggregated at the master node at time t is: E X(t) = n X i=1 E X i (t) = n X i=1 ℓ i 1− e − µ i ℓ i (t− a i ℓ i ) . Since there is no constraint on load allocations, P (1) alt can be decomposed to n decoupled optimization problems, i.e., ℓ ∗ i (t)=argmax ℓ i E X i (t) , (3.6) for all workers i∈[n]. The solution to (3.6) satisfies the following optimality condition: ∂ ∂ℓ i E[X i (t)]=1− e − µ i ℓ i (t− a i ℓ i ) µ i t ℓ i +1 =0, which yields ℓ ∗ i (t)= t λ i , (3.7) where λ i = Θ(1) is a constant independent of t and is the positive solution to the following equation: e µ i λ i =e a i µ i (µ i λ i +1). One can easily check that the condition t ≥ a i ℓ ∗ i (t) holds for all i as well. Moreover, we denote by t ∗ the solution toP (2) alt . Now, we define the HCMM load allocation as ℓ ∗ i (t ∗ )= t ∗ λ i , (3.8) for all workersi. In the following, we formally define the HCMM algorithm which is basically the solution toP alt . Remark 20. We note that in order to implement any load allocation scheme, each worker supposedly admits an integer number of rows as its associated computation load. However, 75 Algorithm 1: Heterogeneous Coded Matrix Multiplication (HCMM) Input: computation time parameters (a i ,µ i ) for each worker i 3 Output: computation load assigned to each worker i 1: procedure HCMM 2: solveP (1) alt for any feasible t 3: obtain ℓ ∗ i (t)= t λ i 4: solveP (2) alt and obtain t ∗ 5: return ℓ ∗ i (t ∗ )= t ∗ λ i row vector computations for worker i 6: end procedure the load allocation ℓ ∗ i (t ∗ ) given by HCMM scheme in Algorithm 1 is a real number for any worker i and therefore one needs to round the result before proceeding with experiments. In practical scenarios, ℓ ∗ i (t ∗ ) is fairly large, e.g. in the order of 100 row vectors. Therefore, the effect of rounding the load allocations shall be insignificant. We now provide an approximation to t ∗ and show it asymptotically converges to t ∗ . The expected aggregate return at time t for optimal loads obtained in (3.7) is E[X ∗ (t)]= n X i=1 ℓ ∗ i (t) 1− e − µ i ℓ ∗ i (t) (t− a i ℓ ∗ i (t)) = n X i=1 t λ i 1− e − µ i t/λ i t− a i t λ i =ts, (3.9) where s= n X i=1 1 λ i 1− e − µ i λ i (1− a i λ i ) = n X i=1 µ i 1+µ i λ i =Θ( n), since µ i =Θ(1) and λ i =Θ(1) . Let τ ∗ be the solution to the following equation when solved for t: E X ∗ (t) = n X i=1 ℓ ∗ i (t) 1− e − µ i ℓ ∗ i (t) (t− a i ℓ ∗ i (t)) =r. (3.10) 76 In other words, τ ∗ is the time for which there are exactly r inner products – on average – aggregated at the master node, when the workers are loaded according to the loading obtained in (3.7). Using (3.7), (3.9) and (3.10), we find that τ ∗ = r s =Θ(1) , (3.11) ℓ ∗ i (τ ∗ )= τ ∗ λ i = r sλ i =Θ(1) . (3.12) We now present the following lemma, which shows that τ ∗ converges tot ∗ for largen (see Appendix for proof). Lemma 4. Let t ∗ be the solution to the alternative formulation P alt in (3.4-3.5) and τ ∗ be the solution to (3.10). Then, τ ∗ ≤ t ∗ ≤ τ ∗ +o(1). 3.3.3 Asymptotic Optimality of HCMM In this subsection, we prove the asymptotic optimality of HCMM as claimed in Theorem 5. Proof of Theorem 5. Consider the HCMM load assignment in (3.8). Let the random variable T HCMM denotethefinishtimeassociatedtothisloadallocation,i.e. thewaitingtimetoreceive at least r inner products from the workers. Let T max be the random variable denoting the finish time of all the workers for the HCMM load assignment. First, we show that E[T HCMM ]≤ t ∗ +o(1). Let us define two events E 1 andE 2 as follows: E 1 ={T max >Θ( n)} and E 2 ={T HCMM >t ∗ }. 3 For the shifted Weibull distribution, parameters (ai,µ i,α i) are taken as inputs. 77 Conditioning on these events, we can write E[T HCMM ]=E[T HCMM |E 1 ]P[E 1 ] +E[T HCMM |E 1 c ∩E 2 ]P[E 1 c ∩E 2 ] +E[T HCMM |E 1 c ∩E 2 c ]P[E 1 c ∩E 2 c ]. (3.13) We can write the second term in RHS of (3.13) as follows: E[T HCMM |E 1 c ∩E 2 ]P[E 1 c ∩E 2 ] =E[T HCMM |T max ≤ Θ( n),T HCMM >t ∗ ] × P[T max ≤ Θ( n),T HCMM >t ∗ ] ≤ E[T max |T max ≤ Θ( n),T HCMM >t ∗ ]P[T HCMM >t ∗ ] (a) ≤ Θ( n)· o 1 n =o(1). (3.14) To prove (a), we note that HCMM returns r inner products by time T HCMM . Moreover, the aggregate return is increasing in time. Therefore, P[T HCMM >t ∗ ]≤ P[X ∗ (t ∗ )<r]=o 1 n . Furthermore, we have E[T max |T max ≤ Θ( n),T HCMM >t ∗ ] = 1 Pr[T max ≤ Θ( n),T HCMM >t ∗ ] × Z Θ( n) t 1 =0 Z ∞ t 2 =t ∗ t 1 dPr[T max ≤ t 1 ,T HCMM ≤ t 2 ] ≤ Θ( n) Pr[T max ≤ Θ( n),T HCMM >t ∗ ] 78 × Z Θ( n) t 1 =0 Z ∞ t 2 =t ∗ dPr[T max ≤ t 1 ,T HCMM ≤ t 2 ] =Θ( n). Moreover, the third term in RHS of (3.13) can be written as E[T HCMM |E 1 c ∩E 2 c ]P[E 1 c ∩E 2 c ] =E[T HCMM |T max ≤ Θ( n),T HCMM ≤ t ∗ ] × P[T max ≤ Θ( n),T HCMM ≤ t ∗ ] ≤ E[T HCMM |T max ≤ Θ( n),T HCMM ≤ t ∗ ] (b) ≤ t ∗ , (3.15) where proof of (b) is similar to proof of (a) in (3.14). Regarding the first term in RHS of (3.13), we have E[T HCMM |E 1 ]P[E 1 ]=E[T HCMM |T max >Θ( n)] × P[T max >Θ( n)] ≤ E[T max |T max >Θ( n)] × P[T max >Θ( n)] = Z ∞ Θ( n) tf max (t)dt (c) ≤ Z ∞ Θ( n) tnk 1 e − k 1 t 1− e − k 1 t n− 1 dt ≤ Z ∞ Θ( n) nk 1 te − k 1 t dt ≤ Z ∞ Θ( n) 1 t 2 dt=o(1), (3.16) for some k 1 =Θ(1) and large enough n. To derive inequality (c), we find a stochastic upper bound on T max by considering n i.i.d. copies of the worker run-times with largest shift and 79 smallest straggling parameters that are also Θ(1) , and use the PDF of the maximum of n i.i.d. exponential random variables. As we later use in the proof of Theorem 7, one can similarly write for the shifted Weibull distribution: E[T HCMM |E 1 ]P[E 1 ] ≤ Z ∞ Θ( n) tf max (t)dt ≤ Z ∞ Θ( n) nk 1 k 2 t k 2 e − k 1 t k 2 1− e − k 1 t k 2 n− 1 dt ≤ Z ∞ Θ( n) nk 1 k 2 t k 2 e − k 1 t k 2 dt ≤ Z ∞ Θ( n) 1 t 2 dt=o(1), (3.17) for some constants k 1 and k 2 . Therefore, using (3.14), (3.15) and (3.16) (or (3.17) for the shifted Weibull model) in (3.13) we have E[T HCMM ]≤ t ∗ +o(1). Letℓ OPT =(ℓ OPT,1 ,··· ,ℓ OPT,n ) denote the optimal load allocation corresponding toP main in(3.3)andX OPT (·)representtheaggregatereturnunderloadallocationℓ OPT . Nowweprove the following lower bound on the average completion time of the optimum algorithm: E[T OPT ]≥ t ∗ − o(1). To this end, we show the following two inequalities, E[T OPT ] (d) ≥ τ − δ 1 (e) ≥ t ∗ − δ 2 − δ 1 , 80 where δ 1 =Θ logn √ n , δ 2 =Θ logn √ n and τ is the solution toE[X OPT (τ )]=r. We have r− E[X OPT (τ − δ 1 )] = n X i=1 ℓ OPT,i P[T i <τ ]− P[T i <τ − δ 1 ] = n X i=1 ℓ OPT,i d dτ P[T i <τ ]δ 1 +O δ 2 1 =Θ( nδ 1 )+O nδ 2 1 =Θ( nδ 1 ), where we used the fact that ℓ OPT,i = Θ(1) 4 . By McDiarmid’s inequality (see Appendix for its description), we have P[X OPT (τ − δ 1 )≥ r] =P[X OPT (τ − δ 1 )− E[X OPT (τ − δ 1 )] ≥ r− E[X OPT (τ − δ 1 )]] ≤ exp − 2 E[X OPT (τ − δ 1 )]− r 2 P n i=1 ℓ 2 OPT,i ! =e − Θ (nδ 2 1 ) =o 1 n , which implies inequality (d). We proceed to prove (e) by showing the following two inequal- ities, τ ≥ τ ∗ , (3.18) τ ∗ ≥ t ∗ − δ 2 , (3.19) 4 We argue that the allocated loads in the optimum coded scheme are all Θ(1) . Without loss of generality, suppose ℓ OPT,1 > Θ(1) which implies limn→∞P[T1 < t] = 0 for any t = Θ(1) . We have already implemented HCMM, a (sub-)optimal algorithm achieving computation time τ ∗ = Θ(1) , therefore the optimal scheme should have a better finishing time τ ≤ Θ(1) . Now assume the load of machine 1 is replaced by ˜ ℓ OPT,1 = Θ(1) . Clearly, for any time t=Θ(1) , the aggregate return for the new set of loads is larger than the former one by any Θ(1) time, almost surely. This is in contradiction to optimality assumption. 81 whereτ ∗ is obtained in (3.11). Given the fact that HCMM maximizes the expected aggregate return, we have E[X ∗ (t)]≥ E[X OPT (t)], for every feasible t, which implies (3.18). Moreover, Lemma 4 proves (3.19). All in all, we have t ∗ − o(1)≤ E[T OPT ]≤ E[T HCMM ]≤ t ∗ +o(1), which yields lim n→∞ E[T HCMM ]=lim n→∞ E[T OPT ] and the claim is concluded. 3.3.4 Comparison with Uncoded Schemes This subsection provides the proof of Theorem 6 by comparing the performance of HCMM to uncoded scheme. In an uncoded scheme, the redundancy factor is 1; thus, the master node has to wait for the results from all the worker nodes in order to complete the computation. Proof of Theorem 6. We start by characterizing the expected run-time of the best uncoded scheme. Particularly, we show that E[T UC ]=Θ logn , where T UC denotes the completion time of the optimum uncoded distributed matrix multi- plication algorithm. To do so, we start by showing that E[T UC ]≥ clogn, for a constant c independent of n. For a set of machines with parameters {(a i ,µ i )} n i=1 , let ˜ a = min i a i and ˜ µ = max i µ i . Now, consider another set of n machines in which every machine is replaced with a faster machine with parameters (˜ a,˜ µ ). Since the computation 82 times of the new set of machines are i.i.d., one can show that the optimal load allocation for these machines is uniform, i.e., e ℓ ∗ i = r n , for every machine i ∈ [n]. Let { e T i } n i=1 represent the i.i.d. shifted exponential random variables denoting the execution times for the new set of machines where each machine is loaded by e ℓ ∗ i = r n . Therefore, the CDF of the completion time of each new machine can be written as P h e T i ≤ t i =1− e − ˜ µ e ℓ ∗ i (t− ˜ a e ℓ ∗ i ) =1− e − ˜ µ n r (t− ˜ a r n ) , for t≥ ˜ ar n and the expected computation time can be written as E h e T i i = r n ˜ a+ 1 ˜ µ , for all i∈[n]. Since the master needs to wait for all of the machines to return their results, the total run-time is e T UC =max i∈[n] e T i . Therefore, E h e T UC i =E max i∈[n] e T i = ˜ ar n + rH n n˜ µ , (3.20) where H n = 1+ 1 2 + 1 3 +··· + 1 n is the sum of the harmonic series. We can further bound (3.20) using the fact that ˜ ar n + rH n n˜ µ ≥ ˜ ar n + r n˜ µ log(n+1)≥ clogn, 83 for a constant c independent of n, since r = Θ( n), ˜ a = Θ(1) , and ˜ µ = Θ(1) for all i∈ [n]. All in all, we have the following lower bound on the optimal uncoded scheme: E[T UC ]≥ E h e T UC i ≥ clogn. (3.21) Now consider another set of n machines, where each machine is replaced with a slower one with parameters (ˆ a,ˆ µ ) for ˆ a = max i a i and ˆ µ = min i µ i . By an argument similar to the one employed the lower bound, we can write E[T UC ]≤ ˆ ar n + r nˆ µ H n ≤ Clogn, (3.22) for another constant C. From (3.21) and (3.22), one can conclude that E[T UC ]=Θ logn . (3.23) Further, by Theorem 5 and Lemma 4, we find that E[T HCMM ]=Θ(1) . (3.24) Comparing (3.23) to (3.24) demonstrates that HCMM outperforms the best uncoded scheme by a factor of Θ(log n), i.e., E[T UC ] E[T HCMM ] =Θ logn . 84 3.4 GeneralizationtotheShiftedWeibullModelandProofs of Theorems 7 and 8 In this section, we consider the shifted Weibull distribution for the workers’ execution times, which captures a broader class of run-time models than the exponential distribution. We particularly generalize our proposed HCMM algorithm to the class of shifted Weibull dis- tributed run-times and prove Theorems 7 and 8. More specifically, we argue that asymptotic optimality of HCMM is derived similar to the shifted exponential case and further show that HCMM provides unbounded gain over uncoded schemes, asymptotically. A random variable T has Weibull distribution with shape parameter α > 0 and scale parameter µ> 0, denoted by T ∼W (α,µ ), if the CDF of T is of the following form: P[T ≤ t]=1− e − (µt ) α , t≥ 0. The expected value of the Weibull distribution is known to be E[T] = 1 µ Γ(1+1 /α ), where Γ( ·) denotes the Gamma function. As stated in Section 3.2.2, we consider a 3-parameter shifted Weibull distribution for workers’ run-times defined in (3.2). The mean value of the worker i’s run-times is then E[T i ] = a i ℓ i + ℓ i µ i Γ(1+1 /α i ). Clearly, shifted exponential distribution is a special case of the shifted Weibull model when α i = 1. By slight reparameterizations, this model can be similarly applied to the HCMM algorithm proposed in Algorithm 1, meaning that the main and alternative optimization problems defined in (3.3), (3.4) and (3.5) can be similarly analyzed under the shifted Weibull model. As in the exponential case, we begin by maximizing the expected aggregate return at the master node (P (1) alt ) under the shifted Weibull distribution, which is given by E[X(t)]= n X i=1 E[X i (t)]= n X i=1 ℓ i 1− e − µ i ℓ i (t− a i ℓ i ) α i . 85 The optimal load allocation that maximizes the individual expected aggregate returns at each worker (and thus the total aggregate return) can be found by solving the following equation: ∂ ∂ℓ i E[X i (t)] =1− e − µ i ℓ i (t− a i ℓ i ) α i 1+ µ i α i α i t ℓ i t ℓ i − a i α i − 1 ! =0. (3.25) Solving (3.25) for ℓ i yields ℓ ∗ i (t)= t λ i where the constant λ i >a i is the positive solution to e µ i α i(λ i − a i ) α i =1+α i µ i α i λ i (λ i − a i ) α i − 1 . Similar to Section 3.3, we can define s as follows, s= E[X ∗ (t)] t = 1 t n X i=1 ℓ ∗ i (t) 1− e − µ i ℓ ∗ i (t) (t− a i ℓ ∗ i (t)) α i ! = n X i=1 1 λ i 1− e − µ i λ i 1− a i λ i α i = n X i=1 α i µ i α i (λ i − a i ) α i − 1 1+α i µ i α i λ i (λ i − a i ) α i − 1 =Θ( n). The last equality uses the fact that all the distribution parameters are constants. The expected aggregate return with optimal loads, E[X ∗ (t)], equals to r at time t = τ ∗ . Thus, τ ∗ = r s =Θ(1) and ℓ ∗ i (τ ∗ )= τ ∗ λ i = r sλ i =Θ(1) . Proof of Theorem 7. Withtheaforementionedreparametrizationsofλ i ,sandτ ∗ ,theHCMM algorithm defined in Algorithm 1 is identically applicable to the Weibull model. Proof of the 86 asymptotic optimality of HCMM under the Weibull distribution follows the similar steps as in the proof for the exponential case in Section 3.3.3 (unless specifically justified, e.g. (3.17)). We avoid rewriting these steps for the purpose of readability of the paper, but we note that the concentration inequalities used to establish the proof of Theorem 5 can be applied to a wide class of distributions including the Weibull distribution. As an implication of Theorem 7, the induced expected execution time by HCMM al- gorithm is asymptotically constant, that is E[T HCMM ] = Θ(1) ; which was also the case for shifted exponential distribution. To compare with the uncoded scenario, we start by the following lemma which characterizes the extreme value of a sequence of Weibull random variables. Lemma5. Let{T i } ∞ i=1 be a sequence of i.i.d. W(α,µ ) random variables andT ∗ n =max i∈[n] T i denote the maximum of the first n variables. Then, E[T ∗ n ]≥ Θ (logn) 1/α . Proof. Consider the sequence of maximums {T ∗ i } ∞ i=1 . From Markov’s inequality, we have E[T ∗ n ] tn ≥ P[T ∗ n ≥ t n ], for any t n >0 and n∈N. Pick t n = 1 µ logn 1/α . Therefore, E[T ∗ n ] 1 µ logn 1/α ≥ P T ∗ n ≥ 1 µ logn 1/α =1− P T ∗ n < 1 µ logn 1/α =1− n Y i=1 P T i < 1 µ logn 1/α =1− 1− e − logn n =1− 1− 1 n n . 87 Therefore, lim n→∞ E[T ∗ n ] 1 µ logn 1/α ≥ lim n→∞ 1− 1− 1 n n =1− 1 e >0.63, which impliesE[T ∗ n ]≥ Θ (logn) 1/α . Now we complete the proof of Theorem 8. Proof of Theorem 8. Recall that T UC denotes the completion time of the optimum uncoded distributedmatrixmultiplicationalgorithmacrossnworkersparametrizedbytuples{(a i ,µ i ,α i ,)} n i=1 . To bound the mean ofT UC , assume that every machine is replaced with a stochastically faster machine with parameters (˜ a,˜ µ, ˜ α ) where ˜ a=min i a i , ˜ µ =max i µ i and ˜ α =max i α i , i.e., the expected run-time of the latter scenario is no greater than that of the former one. For the new set of n identical machines, the optimal loading is uniform, i.e., e ℓ ∗ i = r n . Let { e T i } n i=1 denote the i.i.d. shifted Weibull run times for new set of machines which have CDFs of the form P h e T i ≤ t i =1− e − ˜ µ e ℓ ∗ i (t− ˜ a e ℓ ∗ i ) ˜ α =1− e − (˜ µ n r (t− ˜ a r n )) ˜ α , for t≥ ˜ ar n . The mean of computation time for the new set of machines is E h e T UC i =E max i∈[n] e T i = ˜ ar n +E max i∈[n] e e T i , where e e T i = e T i − ˜ ar n are i.i.d. W(˜ α, ˜ µ n r ) for all workers i∈[n]. Using Lemma 5, we can write E[T UC ]≥ E h e T UC i ≥ ˜ ar n +Θ (logn) 1/˜ α =Θ (logn) 1/˜ α . 88 Comparing the best uncoded scheme with the proposed coded algorithm demonstrates that HCMM outperforms the best uncoded scheme by a factor of at least Θ (logn) 1/˜ α , i.e., E[T UC ] E[T HCMM ] ≥ Θ (logn) 1/˜ α . 3.5 Numerical Studies and Experiments using Amazon EC2 Machines In this section, we present our results both from simulations as well as from experiments over Amazon EC2 clusters. These results demonstrate how HCMM can provide significant speedups in comparison to state-of-the-art load allocation schemes. 3.5.1 Numerical Analysis We now present numerical results evaluating the performance of HCMM. We consider both the shifted exponential model in (3.1) and the shifted Weibull model in (3.2) for run-time distributions in our simulations, assuming the unit seconds per row (s/row) for a and 1/µ . The underlying computation task is to compute r =10000 inner products using a heteroge- neous cluster of n=100 workers, where different scenarios for heterogeneity are considered. For each scenario under consideration, we implement the following load allocation schemes 5 : 1. Uniform Uncoded: Each worker is assigned an equal number of rows, i.e., ℓ i = r/n for all workers i. 2. Load-balanced Uncoded: Each worker is assigned a load which is inversely pro- portional to its expected time for computing one inner product, i.e., for the shifted 5 For each scheme, the load number for each worker is approximated to the nearest larger integer using the ceil() function. For the practical large load regime considered in simulations, this rounding step has negligible impact on load allocation and on the overall results. 89 Figure 3.2: Illustration of the performance gain of HCMM over the three benchmark schemes for the exponential run-time model. Among the three scenarios, HCMM achieves a performance improvement of up to 71% over Uniform Uncoded, up to 53% over Load-balanced Uncoded, and up to39% over Uniform Coded. Furthermore, the coding redundancy P n i=1 ℓ i /r for the three scenarios is in the range of 1.41− 1.46 for HCMM and in the range of 2.3− 2.8 for Uniform Coded. This demonstrates the efficient utilization of resources by HCMM. exponential model, ℓ i ∝ µ i /(a i µ i + 1), while for the shifted Weibull model, ℓ i ∝ µ i /(a i µ i +Γ(1+1 /α i )) for all workers i. Furthermore, we set P n i=1 ℓ i =r. 3. Uniform Coded: Equal number of coded rows are assigned to each worker. Re- dundancy is numerically optimized for minimizing the average computation time for receiving results of at least r inner products at the master node. 4. HCMM: Each worker is assigned the asymptotically optimal load allocation derived in Section 3.3.2, i.e., ℓ i =τ ∗ /λ i for each worker i according to (3.8) and (3.12). For simulations under the shifted exponential model, we consider the following three scenarios: • Scenario 1 (2-mode heterogeneity): (a i ,µ i )=(1,1) for 50 workers, and (a i ,µ i )= (4,0.5) for the other 50 workers. • Scenario 2 (3-mode heterogeneity): (a i ,µ i ) = (1,0.5) for 25 workers, (a i ,µ i ) = (4,2) for 25 workers, and (a i ,µ i )=(12,0.25) for the remaining 50 workers. 90 Figure 3.3: Illustration of the performance gain of HCMM over the three benchmark schemes for Weibull model for run-time. Among the three scenarios, HCMM achieves a performance improve- ment of up to 73% over Uniform Uncoded, up to 56% over Load-balanced Uncoded, and up to 42% over Uniform Coded. Furthermore, the coding redundancy P n i=1 ℓ i /r for the three scenarios is in the range of 1.30− 1.42 for HCMM and in the range of 2.0− 2.5 for Uniform Coded. This demonstrates the efficient utilization of resources by HCMM. • Scenario 3 (Random heterogeneity): For each worker i, parameters a i and µ i are sampled from the sets {1,4,12}, {0.5,2,0.25}, respectively and all uniformly at random. The following three scenarios are considered for simulations under the shifted Weibull distribution for run-times: • Scenario 1 (2-mode heterogeneity): (a i ,µ i ,α i ) = (1,1,1.2) for 50 workers, and (a i ,µ i ,α i )=(4,0.5,0.8) for the other 50 workers. • Scenario 2 (3-mode heterogeneity): (a i ,µ i ,α i ) = (1,0.5,0.9) for 25 workers, (a i ,µ i ,α i )=(4,2,1.2) for25 workers, and(a i ,µ i ,α i )=(12,0.25,1.5) for the remaining 50 workers. • Scenario 3 (Random heterogeneity): For each worker i, parameters a i , µ i and α i are sampled from the sets {1,4,12},{0.5,2,0.25} and{0.9,1.2,1.5}, respectively and all uniformly at random. 91 Fig. 3.2 and 3.3 illustrate the performance comparison of the four schemes for the two run-time models. We make the following conclusions from the results. • HCMM significantly outperforms the benchmark load allocation schemes. In partic- ular, for the shifted exponential model, HCMM provides speedups of up to 71% over Uniform Uncoded, up to 53% over Load-balanced Uncoded, and up to 39% over Uni- form Coded, among the three scenarios. When the machine run-time is assumed to have a shifted Weibull distribution, among the three scenarios HCMM results in gains of up to 73%, 56% and 42% over Uniform Uncoded, Load-balanced Uncoded, and Uniform Coded respectively. • The coding redundancy P n i=1 ℓ i /r for Uniform Coded is higher in comparison to the one for HCMM. In particular, for simulations under the shifted exponential model, the coding redundancy for the three scenarios is in the range of 2.3− 2.8 for Uniform Coded and in the range of 1.41− 1.46 for HCMM. For simulations under the shifted Weibull distribution, the coding redundancy is in the range of 2.0− 2.5 for Uniform Coded, while for HCMM, it is in the range 1.30− 1.42. This demonstrates that HCMM leads to a better utilization of computing resources. • Both Load-balanced Uncoded and Uniform Coded improve upon the performance of Uniform Uncoded. In Load-balanced Uncoded scheme, assigning larger loads to faster machines leads to better performance, while for Uniform Coded, repeated computa- tions lead to better performance as the master does not need to wait for all the results. HCMM provides the best expected execution time among the four schemes as it com- bines the gains of Load-balanced Uncoded and Uniform Coded by employing efficient load balancing along with minimal number of redundant computations. Next, we present the results from our experiments over Amazon EC2 clusters. These results show agreement with our numerical studies. 92 3.5.2 Experiments using Amazon EC2 machines We use Python with mpi4py package [38] to implement our developed HCMM scheme over Amazon EC2 clusters. To emulate the straggler effects in large-scale systems [39], we inject artificial delays. 6 This is achieved by selecting some workers to be stragglers at the beginning of experiments and slowing down each such worker by making it wait for 3 times the amount of time it spends in computation before it sends its results to the master. This is done using the sleep() function in time package. For each scenario, the choice of stragglers is made by drawing a sample from the Bernoulli(0.5) distribution for each worker, i.e., each worker is chosen to be a straggler with probability 0.5. In line with our simulation studies, we compare the performance of HCMM with the three benchmark load allocation schemes. For Load-balanced Uncoded, the number of uncoded rows ℓ i assigned to a worker i is proportional to the number of virtual CPUs, and the loads are normalized to have a sum equal to r. For the encoding and the decoding steps for Uniform Coded as well as HCMM, we utilize the Luby transform (LT) codes with peeling decoder which provides nearly linear decoding complexity [134]. Utilization of LT codes for distributed computing is proposed in [138] as well. However, they perform a homogeneous load allocation by assigning an equal number of rows of the encoded data matrix to each worker and hence do not capture the heterogeneity of the computing cluster in distributing the encoded data matrix. Towards this end, we relax our goal of recovering all the inner products from any r of the coded inner products to recovering all the inner products from any r ′ =r(1+ϵ ) coded inner products with high probability. Ideally, we would like to have ϵ > 0 to be as small as possible. In our experiments, we keep r = 10000, and based on the results in [138], we use the robust Soliton degree distribution with (c,δ ) = (0.03,0.1) and select ϵ =0.13, where c is a tuning parameter and δ is a bound on the probability of failure of decoding from a certain number of received coded inner products (see [138] for details). 6 Artificial delays are injected since stragglers are rarely observed in small clusters in Amazon EC2. Though other emerging platforms such as federated learning, computation with deadline, mobile edge computing, fog computing, etc., still suffer from stragglers where our ideas can be employed [121]. 93 Figure 3.4: Illustration of the performance gain of HCMM over the three benchmark schemes. Among the three scenarios, HCMM achieves a performance improvement of up to61% over Uniform Uncoded, up to 46% over Load-balanced Uncoded, and up to 36% over Uniform Coded. Further- more, the coding redundancy P n i=1 ℓ i /r for the three scenarios is approximately 1.4 for HCMM and in the range of 2.12− 2.26 for Uniform Coded. Therefore, HCMM gives the best overall execution time among the four scenarios with minimal coding overhead. Therefore, for both HCMM and Uniform Coded, we design the load allocation such that the master needs to wait only for r ′ =11300 coded inner products. The total computation time is equal to the waiting time for r ′ = 11300 results plus the average time for decoding the r = 10000 inner products from the received r ′ = 11300 coded inner products. 7 For HCMM, we use the shifted exponential distribution for estimating the computation model for each worker. For performance comparison of the four schemes, we consider the following three com- puting scenarios: • Scenario 1: Each row has 500000 elements. We use a heterogeneous cluster of 11 ma- chines–onemasterofinstancetype m4.xlarge, fourworkersofinstancetype r4.2xlarge, and six workers of instance type r4.xlarge. 7 The average time for decoding r =10000 inner products from any r(1+ϵ ) coded inner products is obtained using a m4.xlarge instance. 94 • Scenario 2: Each row has 500000 elements. We use a heterogeneous cluster of 16 ma- chines – one master of instance type m4.xlarge, six workers of instance type r4.2xlarge, and nine workers of instance type r4.xlarge. • Scenario 3: Each row has 1000000 elements. We use the same heterogeneous cluster as in the previous scenario. Fig. 3.4 provides a performance comparison of HCMM with the benchmark load alloca- tion schemes for the three scenarios, where the decoding time is taken into account as well. Fig. 3.5 presents the typical cumulative distribution functions for the instances used in the experiments. We make the following conclusions from the results: • As demonstrated in Fig. 3.5, the shifted exponential model is a good first order fit for the run-times of the workers. • HCMM achieves significant speedups over the benchmark load allocation policies. In particular, HCMM combined with LT codes provides gains in the overall execution time of up to 61% over Uniform Uncoded, up to 46% over Load-balanced Uncoded, and up to 36% over Uniform Coded. • As presented in Table 3.1, HCMM has significantly lower total computation load com- pared to Uniform Coded. Hence, HCMM leads to efficient utilization of the computing resources, combining the benefits of both Load-balanced Uncoded and Uniform Coded schemes. Table 3.1: Total computation load ( P n i=1 ℓ i ) of HCMM and Uniform Coded Scenario n HCMM Uniform Coded 1 10 11397 22600 2 15 11402 21201 3 15 11403 21201 These results demonstrate that HCMM can provide significant speedups in large-scale computing environments. 95 (a) (a,1/µ )=(1.37× 10 − 3 s/row,8.25× 10 − 6 s/row) (b) (a,1/µ )=(2.00× 10 − 3 s/row,8.72× 10 − 6 s/row) Figure 3.5: Typical empirical cumulative distribution functions for two instances used in Scenario 3 of our experiments. The measurements were taken in the absence of any manual delay. As demonstrated here, shifted exponential distribution is a good model for the task execution time in EC2 machines. 96 Table 3.2: Amazon EC2 Pricing for Linux machine vCPU ECU Memory (GiB) Instance Storage (GB) price (/Hour) m3.medium 1 3 3.75 1× 4 SSD $0.077 m3.large 2 6.5 7.5 1× 32 SSD $0.154 m3.xlarge 4 13 15 2× 40 SSD $0.308 m3.2xlarge 8 26 30 2× 80 SSD $0.616 3.6 GeneralizationtoComputingScenariosunderBudget Constraints In this section, we consider the optimization problem in (3.3) under the shifted exponential distribution with a monetary constraint for carrying out the overall computation. Running computation tasks on a commodity server costs depending on several factors including CPU, memory, ECU, storage, bandwidth, etc. Different cloud computing platforms employ dif- ferent pricing policies, and these need to be taken into account for developing efficient task allocation and execution algorithms [194, 42, 97, 227, 136]. For example, Table 3.2 sum- marizes the cost per hour of using Amazon EC2 clusters with different parameters (at the time of writing this manuscript) [149]. In this section, we take into account the monetary constraint in the optimization problem in (3.3) and provide a heuristic algorithm towards finding the optimal load allocation under cost budget constraint. We now present the precise problem formulation we are interested in. For a computation task and a given set of N machines, the goal is to minimize the expected run-time while satisfying the budget constraint C, that is P main-constrained : minimize ℓ E[T CMP ] subject to N X i=1 c i 1 {ℓ i >0} E[T CMP ]≤ C, (3.26) 97 wherec i represents the cost per time unit of using machine i∈[N]. According to the pricing polices provided by AWS, e.g. Table 3.2, a linear model for cost (versus performance pa- rameters) is intuitive and convincing. Considering the last two rows of Table II for instance, doubling the parameters results in doubled cost. To be general, we model the computation cost of a single machine as c = κµ γ per unit of time, which captures a convex dependency of the speed parameter µ for constants γ ≥ 1 and κ> 0. We assume that there are K types of machines parameterized with {(a k ,µ k )} K k=1 , and N k ,k ∈ [K] of each type is available to run a distributed computation task, where N = P K k=1 N k is the total number of available machines. We also assume that µ 1 ≤ ··· ≤ µ K and a 1 µ 1 = ··· = a K µ K = ξ for a constant ξ . 8 As we showed in Theorem 5, HCMM is asymptotically optimal (i.e. optimal within a vanishing deviation) regarding the average run-time. In this section, we also consider the asymptotic regime, i.e. for large enough number of machines and hence HCMM attains the optimality per P main in (3.3). Thefollowinglemmastatesausefulobservationregardingthesolutionstotheconstrained problemP main-constrained and the minimum possible cost for carrying out a computation task. Lemma 6. HCMM is the (asymptotic) solution to the feasible P main-constrained . Moreover, given a computation task and a set of machines, decreasing the number of fastest (slowest) machines in HCMM, results in smaller (greater) expected cost. And, the minimum (max- imum) cost of HCMM is induced by running the task only on any number of the slowest (fastest) machines. Proof. We first argue that if the budget-constrained problem defined in P main-constrained is feasible, then HCMM determines the asymptotically optimal load allocation. Consider a set of N machines and assume that M of them are assigned non-zero loads in the optimal budget-constrained scheme. Now, one can run HCMM load allocation over the set of these 8 The latter assumption can be intuitively justified as follows. If a machine is c times more powerful than another machine, as the first order estimation, one can assume that both the shift ( a k ) and the straggling parameter (µ k ) of the computation are c times stronger. 98 M machines and according to asymptotic optimality results, HCMM asymptotically attains the optimal run-time while satisfying the budget constraint. Now assume that n k number of type k ∈ [K] machine is used. Then, by assigning the loads obtained from HCMM and the result of Theorem 5, the induced expected cost (for large number of machines) can be written as cost HCMM(n 1 ,··· ,n K ) =τ ∗ K X k=1 n k c k = r s K X k=1 n k c k = r P K k=1 n k µ k 1+µ k λ k K X k=1 n k κµ γ k =κrx ξ P K k=1 n k µ γ k P K k=1 n k µ k , (3.27) where x ξ = 1 + µ k λ k is the solution to the equation e x ξ − ξ − 1 = x ξ for all machine type k ∈ [K]. In another scenario, assume that we remove one machine of type K (the fastest machine type) and run HCMM accordingly, i.e. n k of type k∈ [K− 1] and n K − 1 of type K. The expected cost of this scenario can be written as follows: cost HCMM(n 1 ,··· ,n K − 1) =κrx ξ P K− 1 k=1 n k µ γ k +(n K − 1)µ γ K P K− 1 k=1 n k µ k +(n K − 1)µ K (f) ≤ κrx ξ P K k=1 n k µ γ k P K k=1 n k µ k = cost HCMM(n 1 ,··· ,n K ) , (3.28) 99 where inequality (f) can be easily verified given that µ 1 ≤ ··· ≤ µ K . We can iteratively apply the same argument and conclude that the minimum expected cost is achieved when only the slowest machines are used, that is C min := cost HCMM(n 1 ,0,··· ,0) =κrx ξ µ γ − 1 1 , (3.29) for any 1≤ n 1 ≤ N 1 . Similar to (3.28), one can show that reducing the number of partici- pating slowest machines increases the induced expected cost of HCMM, that is cost HCMM(n 1 − 1,··· ,n K ) ≥ cost HCMM(n 1 ,··· ,n K ) . (3.30) Therefore, applying (3.30) iteratively shows that the maximum expected cost occurs when only the fastest machines are employed, that is C max := cost HCMM(0,··· ,0,n K ) =κrx ξ µ γ − 1 K , for any 1≤ n K ≤ N K . Lemma 6 implies that if the available budget C is less than C min defined in (3.29), then P main-constrained is infeasible and it is impossible to run the task on the given set of machines while satisfying the budget constraint. Moreover, reducing one machine from the available set of fastest machines along with HCMM results in a lower expected cost; and reducing the number of participating slowest machines results in a larger expected cost. NowthatHCMMasymptoticallysolvesthefeasiblebudget-constrainedproblemin(3.26), i.e. for C ≥ C min , finding the optimal number of machines of each type to use in HCMM requires combinatorial search over all possible allocations. However, as Lemma 6 suggests, using faster machines induces a larger cost. Further, the computation time increases if we decrease the number of machines. This is the motivation behind our heuristic algorithm for 100 an efficient search to find the number of machines of each type to include in HCMM, which we describe next. Algorithm 2: Heuristic Search 1: procedure Heuristic Search 2: (n 1 ,··· ,n K )← (N 1 ,··· ,N K ) 3: top: 4: Run HCMM with (n 1 ,··· ,n K ) 5: if cost HCMM(n 1 ,··· ,n K ) >C then 6: n j ← n j − 1 where j =max{k :n k >0} 7: goto top 8: else 9: return (n 1 ,··· ,n K ) 10: end 11: end procedure First, Algorithm 2 runs HCMM algorithm using all machines, i.e., n k = N k for each k∈ [K]. Then, it calculates the corresponding cost according to (3.27). If the cost is > C, it starts to decrease the number of available fastest machines, i.e. n K ← n K − 1, and runs HCMM again. While the cost is>C, the algorithm keeps decreasing the number of used fast machines till n K = 0. Then, the algorithm sets n K = 0 and starts decreasing n K− 1 and so on, until a feasible cost is achieved. Thus, the algorithm returns (N 1 ,··· ,N j ,n j+1 ,0,··· ,0) which is the first tuple that satisfies the cost constraint. Therefore, the search space complex- ity of the heuristic isO(N 1 +··· +N K )=O(N) which is more efficient than the exhaustive search where the complexity isO(N 1 ··· N K ). The pseudo-code in Algorithm 2 summarizes the heuristic. Example. In this example, we consider two different scenarios to demonstrate the applica- tion of the proposed heuristic search algorithm. For the cost model, we assume γ = 2 and κ =1, i.e. c=µ 2 . Further, we consider the task of computing r =100 equations. • Scenario 1: Two types of machines are available parameterized by (a 1 ,µ 1 ) = (0.5,2) and (a 2 ,µ 2 ) = (0.25,4), assuming 10 machines available of each type. Further, the available budget is C = 860. Using Lemma 6, the minimum and maximum induced 101 n 2 n 1 ≈ 0 1 2 3 4 5 6 7 8 9 10 0 1 2 9 10 a ··· . . . 1048.7 1033.7 1016.4 996.2 972.4 943.8 908.8 865.1 808.9 1063.1 1048.7 ··· 1258.4 1258.4 ··· Figure 3.6: Total cost associated with every pair of (n 1 ,n 2 ); 0≤ n 1 ,n 2 ≤ 10. n 2 n 1 ≈ 0 1 2 3 4 5 6 7 8 9 10 0 1 2 9 10 a ··· . . . 5.24 5.61 6.05 6.55 7.15 7.86 8.73 9.83 11.23 5.42 5.82 ··· 7.86 8.73 ··· Figure 3.7: Expected time associated with every pair of (n 1 ,n 2 ); 0≤ n 1 ,n 2 ≤ 10. costs are C min = 629.2 and C max = 1258.4. As C ≥ C min , there exists an HCMM load allocation which is asymptotically optimal per (3.26). Applying the proposed heuristic search, it takes 9 iterations (see Fig. 3.6 and 3.7) to arrive at the load allocation (n 1 ,n 2 ) = (10,2) which corresponds to the expected cost 808.9 and average execution timeE[T HCMM ]=11.23. • Scenario2: Threetypesofmachinesareavailablewhichareparameterizedby(a 1 ,µ 1 )= (1,1), (a 2 ,µ 2 ) = (0.5,2) and (a 3 ,µ 3 ) = (0.125,8), assuming 10 machines of each type available. Further, the available budget isC =475. Using Lemma 6, the minimum and maximum induced costs for the task of computing r =100 equations are C min =314.6 102 and C max =2516.8 respectively. It takes 15 iterations for the proposed heuristic search algorithm to arrive at the tuple (n 1 ,n 2 ,n 3 ) = (10,6,0). This corresponds to the ex- pected cost 486.2 and the average timeE[T HCMM ]=14.3. 3.7 Conclusion In this paper, we proposed a coding framework for distributed matrix-vector multiplication in heterogeneous cloud computing environments. In particular, we considered two distri- butions for machines’ run-times, i.e. shifted exponential and shifted Weibull and tackled the intractable problem of minimizing the average run-time of a computation task over all possible load allocations by proposing a tractable alternative formulation. The solution to the alternative problem established our proposed HCMM load allocation scheme which we proved to be asymptotically optimal. We also demonstrated the speedup of HCMM over three benchmark load allocation schemes and presented both the numerical and the experi- mental results. Experiments over Amazon EC2 clusters demonstrate that HCMM combined with LT codes and peeling decoders can provide significant gains in the average overall ex- ecution time. Moreover, we argued that HCMM is the asymptotically optimal allocation in budget-constrained scenarios as well, which led to providing a heuristic search in order to find a (sub)optimal load-machine assignment for a given set of machines while satisfying a pre-defined budget constraint. 103 Chapter 4 CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning 4.1 Introduction Modern machine learning algorithms are now used in a wide variety of domains. However, training a large-scale model over a massive data set is an extremely computation and storage intensive task, e.g. training ResNet with more than 150 layers and hundreds of millions of parameters over the data set ImageNet with more than 14 million images. As a result, there has been significant interest in developing distributed learning strategies that speed up the training of learning models (e.g., [43, 244, 26, 165, 40, 29, 4]). In the commonly used Gradient Descent (GD) paradigm for learning, parallelization can be achieved by arranging the machines in a master-worker setup. Through a series of itera- tions, the master is responsible for updating the underlying model from the results received from the workers, where they compute the partial gradients using their local data batches and upload to the master at each iteration. For the master-worker setup, both synchronous and asynchronous methods have been developed [43, 244, 26, 165, 40, 29]. In synchronous settings, all the workers wait for each other to complete the gradient computations, while in asynchronous methods, the workers continue the training process after their local gradient is computed. While synchronous approaches provide better generalization behaviors than 104 the asynchronous ones [32, 26], they face major system bottlenecks due to (1) bandwidth congestion at the master due to concurrent communications from the workers to the master [155]; and (2) the delays caused by slow workers or stragglers that significantly increase the run-time [165]. gradient aggregation g=g 1 + ···+g k <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Ring-AllReduce: bandwidth efficiency via parallelizing communications M Gradient Coding: straggler resiliency via redundancy M CodedReduce: inter-cluster parallelization intra-cluster redundancy g <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> <latexit sha1_base64="(null)">(null)</latexit> bandwidth efficiency straggler resiliency Figure 4.1: Illustration of RAR,GC andCR: InRAR, workers communicate only with their neighbors on a ring, which results in high bandwidth utilization; however, RAR is prone to stragglers. GC is robust to stragglers by doing redundant computations at workers; however, GC imposes bandwidth bottleneck at the master. CR achieves the benefits of both worlds, providing high bandwidth efficiency along with straggler resiliency. To alleviate the communication bottleneck in distributed learning, various bandwidth ef- ficientstrategieshavebeenproposed[154,208,87]. Particularly, Ring-AllReduce( RAR)[155] strategy has been proposed by allowing each worker to only communicate with its neighbors that are arranged in a logical ring. More precisely, the data set, D, is uniformly distributed among N workers and each node combines and passes its partial gradient along the ring 105 such that at the end of the collective operation, each worker has a copy of the full gradient g (Figure 4.1). Due to the master-less topology of RAR, it avoids bandwidth bottleneck at any particular node. Furthermore, as shown in [154], RAR is provably bandwidth optimal and induces O(1) communication overhead that does not depend on the number of distributed workers. As a result, RAR has recently become a central component in distributed deep learning for model updating [2, 181, 82]. More recent approaches to mitigate bandwidth bottleneck in distributed gradient aggregation include compression and quantization of the gradients [128, 230, 196]. Despite being bandwidth efficient, AllReduce-type algorithms are inherently sensitive to stragglers, which makes them prone to significant performance degradation and even complete failure if any of the workers slows down. Straggler bottleneck becomes even more significant as the cluster size increases [39, 239]. One approach to mitigate stragglers in distributed computation is to introduce com- putational redundancy via replication. [236] proposes to replicate the straggling task on other available nodes. In [159], the authors propose a partial data replication for robust- ness. Other relevant replication based strategies have been proposed in [11, 211, 183]. Recently, coding theoretic approaches have also been proposed for straggler mitigation [109, 55, 167, 231, 89, 226, 233, 145, 144, 220, 45]. Specifically, Gradient Coding ( GC) [200] has been proposed to alleviate stragglers in distributed gradient aggregation in a master- worker topology (Figure 4.1). InGC, the data setD is carefully and redundantly distributed among the N workers where each worker computes a coded gradient from its local batch. The master node waits for the results of any N− S workers and recovers the total gradient g, where the design parameter S denotes the maximum number of stragglers that can be tolerated. Therefore,GC prevents the master from waiting for all the workers to finish their computations, and it was shown to achieve significant speedups over the classical uncoded master-worker setup [200]. 106 However,astheclusersizegetslarge,GCsuffersfromsignificantnetworkcongestionatthe master. In particular, the communication overhead increases to O(N), as the master needs to receive messages from O(N) workers. Thus, it is essential to design distributed learning strategies that alleviate stragglers while imposing low communication overhead across the cluster. Consequently, ourgoalin thispaperistoanswerthefollowing fundamental question: Can we achieve the communication parallelization of RAR and the straggler toleration of GC simultaneously in distributed gradient aggregation? We answer this question in the affirmative. As the main contribution of this paper, we propose a joint design of data allocation and communication strategy that is robust to stragglers, alongside being bandwidth efficient. Specifically, we propose a scalable and robust scheme for synchronous distributed gradient aggregation, called CodedReduce (CR). There are two key ideas behindCR. Firstly, we use a logical tree topology for communica- tion consisting of a master node, L layers of workers, where each parent node has n children nodes (Figure 4.1). In the proposed configuration, each node communicates only with its parent node for downloading the updated model and uploading partial gradients. As in the classical master-worker setup, the root node (master) recovers the full gradient and updates the model. Except for the leaf nodes, each node receives enough number of coded partial gradients fromits children, combines themwith its local and partialgradient and uploads the result to its parent. This distributed communication strategy alleviates the communication bottleneck at the nodes, as multiple parents can concurrently receive from their children. Secondly, the coding strategy utilized in CR provides robustness to stragglers. Towards this end, we exploit ideas from GC and propose a data allocation and communication strategy such that each node needs to only wait for any n− s of its children to return their results. ThetheoreticalguaranteesoftheproposedCRschemearetwo-fold. First, wecharacterize the computation load introduced by the proposed CR and prove that for a fixed straggler resiliency,CR achieves the optimal computation load (relative size of the assigned local data 107 set to the total data set) among all the robust gradient aggregation schemes over a fixed tree topology. Moreover, CR significantly improves upon GC in the computation load of the workers. More precisely, to be robust to straggling/failure of α fraction of the children, GC loads each worker with ≈ α fraction of the total data set, while CR assigns only ≈ α L fractionofthetotaldataset, whichisamajorimprovement. Secondly, wemodeltheworkers’ computation times as shifted exponential random variables and asymptotically characterize the average latency of CR, that is the expected time to aggregate the gradient at the master node as the number of workers tends to infinity. This analysis further demonstrates how CR alleviates the bandwidth efficiency and speeds up the training process by parallelizing the communications via a tree. In addition to provable theoretical guarantees, the proposedCR scheme offers substantial improvements in practice. As a representative case, Figure 4.2 provides the gradient aggre- gation time averaged over many gradient descent iterations implemented over Amazon EC2 clusters. Compared to three benchmarks – classical Uncoded Master-Worker (UMW), GC, RAR – the proposed CR scheme attains speedups of 22.5× , 6.4× and 4.3× , respectively. 3 0 0.1 0.2 0.3 UMW GC RAR CR Average iteration run-time (in seconds) Fig. 2: Average iteration time for gradient aggregation in different schemesCR,RAR,GC andUMW: Training a linear model is implemented on a cluster ofN=84 t2.micro instances using Python with mpi4py. since stragglers are fairly infrequent in Amazon clusters; moreover, in our experiments, no artificial delay is manufactured in the workers’ run-times. These indeed make the proposedCR applicable and efficient for the real world training tasks as well. Related Work. There has been a significant interest in developing distributed learning strategies that speed up the training of learning models [1], [4]–[9]). For the master-worker setup, both synchronous and asynchronous methods have been developed [1], [4]–[8]. In synchronous settings, all the workers wait for each other to complete the gradient computations, while in asynchronous methods, the workers continue the training process after their local gradient is computed. While synchronous approaches provide better generalization behaviours than the asynchronous ones [6], [10], they however face the two major bandwidth congestion and straggler toleration bottlenecks as discussed before. Various bandwidth efficient strategies have been proposed [11]–[13], however, straggler bottleneck becomes increasingly significant as the cluster size increases [14], [15]. One of the general system approaches to mitigate stragglers in distributed computation is to introduce computational redundancy via replication. [16] proposes to replicate the straggling task on other available nodes. In [17], the authors propose a partial data replication for robustness. Other relevant replication based strategies have been proposed in [18]–[20]. Recently, coding theoretic approaches have also been proposed for straggler mitigation (e.g., [21]–[25]). However, our approach is close in spirit to coding theoretic techniques used in the recently proposed Gradient Coding [3]. II. PROBLEM SETUP AND BACKGROUND Inthissection,weprovidetheproblemsetupfollowedbyabriefbackgroundonRARandGCdesignsandtheircorresponding resiliency and efficiency properties. A. Problem Setting Many machine learning tasks involve fitting a model over a training data set by minimizing a loss function. For a given labeled data set D = {x j 2 R p+1 : j=1,··· ,d}, the goal is to solve the following optimization problem: ✓ ⇤ = argmin ✓ 2 R p X x2 D `(✓ ;x)+ R(✓ ), (1) where `(·) and R(·) respectively denote the loss and regularization functions, and the optimization problem is parameterized by . The most popular way of solving (1) in distributed learning is to use Gradient Descent (GD) algorithm. More specifically, under standard convexity assumptions, the following sequence of model updates {✓ (t) } 1 t=0 converges to the optimal solution ✓ ⇤ : ✓ (t+1) = h R ⇣ ✓ (t) ,g ⌘ , (2) where h R (·) is a gradient-based optimizer depending on the regularizer R(·) and g = X x2 D r ` ⇣ ✓ (t) ;x ⌘ , (3) denotes the gradient of the loss function evaluated at the model at iteration t over the data set D. Under certain assumptions, the iterations in (2) converge to a local optimum in the non-convex case, as well. As the core component of the iterations defined in (2), it involves with computing the total gradient g each time at a new model and over the potentially large data set D. Due to limited storage and computation capacity of the computing nodes, gradient aggregation task (3) has to be carried out over distributed nodes. This parallelization, as we discussed earlier, introduces two major bottlenecks, i.e. resiliency to stragglers and bandwidth efficiency at busy nodes.Roughly speaking, straggler resiliency refers to the fraction of the straggling Figure 4.2: Average iteration time for gradient aggregation in different schemes CR, RAR, GC and UMW: Training a linear model is implemented on a cluster of N =84 t2.micro instances. 4.2 Problem Setup and Background In this section, we provide the problem setup followed by a brief background on RAR and GC and their corresponding straggler resiliency and communication parallelization. 108 4.2.1 Problem Setting Many machine learning tasks involve fitting a model over a training data set by minimizing a loss function. For a given labeled data set D ={x j ∈R p+1 : j = 1,··· ,d}, the goal is to solve the following optimization problem: θ ∗ =argmin θ ∈R p X x∈D ℓ(θ ;x)+λR (θ ), (4.1) where ℓ(·) and R(·) respectively denote the loss and regularization functions, and the opti- mization problem is parameterized by λ . One of the most popular ways of solving (4.1) in distributed learning is to use the Gradient Descent (GD) algorithm. More specifically, under standard convexity assumptions, the following sequence of model updates{θ (t) } ∞ t=0 converges to the optimal solution θ ∗ : θ (t+1) =h R θ (t) ,g , (4.2) where h R (·) is a gradient-based optimizer depending on the regularizer R(·) and g = X x∈D ∇ℓ θ (t) ;x , (4.3) denotes the gradient of the loss function evaluated at the model at iteration t over the data set D. Under certain assumptions, the iterations in (4.2) converge to a local optimum in the non-convex case, as well. For instance, if all the saddle points of a smooth non-convex objective are strict-saddle, then the iterations in (4.2) converge to a local minimum [108]. The core component of the iterations defined in (4.2) is the computation of the gradient vector g at each iteration. At scale, due to limited storage and computation capacity of the computingnodes,gradientaggregationtask(4.3)hastobecarriedoutoverdistributednodes. This parallelization, as we discussed earlier, introduces two major bottlenecks: stragglers and bandwidth contention. The goal of the distributed gradient aggregation scheme is to providestragglerresiliencyaswellascommunicationparallelization. Atahighlevel,straggler 109 resiliency, α , refers to the fraction of the straggling workers that the distributed aggregation scheme is robust to, and communication parallelization gain, β , quantifies the number of simultaneous communications in the network by distributed nodes compared to only one simultaneous communication in a single-node (master-worker) aggregation scheme. Next, we discuss the data allocation and communication strategy of two synchronous gra- dient aggregation schemes in distributed learning and their corresponding straggler resiliency and communication parallelization gain. 4.2.2 Ring-AllReduce In AllReduce-type aggregation schemes, the data set is uniformly distributed over N worker nodes{W 1 ,··· ,W N } which coordinate among themselves in a master-less setting to aggre- gate their partial gradients and compute the aggregate gradient g at each worker. Par- ticularly in RAR, each worker W i partitions its local partial gradient into N segments v 1,i ,··· ,v N,i . In the first round, W i transmits v i,i to W i+1 . Each worker then adds up the received segment to the corresponding segment of its local gradient, i.e., W i obtains v i− 1,i− 1 +v i− 1,i . In the second round, the reduced segment is forwarded to the neighbor and added up to the corresponding segment. Proceeding similarly, at the end of N− 1 rounds, each worker has a unique segment of the full gradient, i.e., W i hasv i+1,1 +...+v i+1,N . After the reduce-scatter phase, the workers execute the collective operation of AllGather where the full gradientg becomes available at each node. TheRAR operation for a cluster of three workers is illustrated in Figure 4.3. It is clear that RAR cannot tolerate any straggling nodes since the communications are carried out over a ring and each node requires its neighbor’s result to proceed in the ring, i.e., the straggler resiliency forRAR is α RAR =0. However, the ring communication design in RAR alleviates the communication congestion at busy nodes, and achieves communication parallelization gain β RAR =Θ( N) which is optimal [155]. 110 1 3 2 W 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 1 1 2 2 2 3 3 3 W 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 3 1 1 2 2 3 3 2 W 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 1 3 2 1 3 2 1 3 2 W 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 1 2 2 2 3 3 3 W 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> W 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Ring-AllReduce 1 Figure 4.3: Illustration of communication strategy in RAR for N =3 workers. 4.2.3 Gradient Coding Gradient Coding (GC) [200] was recently proposed to provide straggler resiliency in a master- worker topology with one master node and N distributed worker nodes {W 1 ,··· ,W N } as depicted in Figure 4.1. We start the description of GC with an illustrative example. Example (Gradient Coding). To make gradient aggregation over N = 3 workers robust to any S = 1 straggler, GC partitions the data set to {D 1 ,D 2 ,D 3 } and assigns 2 partitions to each worker as depicted in Figure 4.4. Full gradient g =g 1 +g 2 +g 3 can be recovered from any N − S = 2 workers, e.g., the master recovers g from W 1 and W 2 by combining their results as g =2 1 2 g 1 +g 2 − (g 2 − g 3 ). In general, to be robust to any S∈[N]={1,··· ,N} stragglers,GC uniformly partitions the data setD to{D 1 ,··· ,D k } (e.g. k =N) with corresponding partial gradientsg 1 ,··· ,g k anddistributesthemredundantlyamongtheworkerssuchthateachpartitionisplacedinS+ 1 workers, thus achieving a computation load of r GC = S+1 N . Let matrixG=[g 1 ,··· ,g k ] ⊤ ∈ R k× p denote the collection of partial gradients. Each worker W i then computes its local partial gradients and sends b i G to the master, where B=[b 1 ;··· ;b N ]∈R N× k denotes the 111 M W 1 <latexit sha1_base64="ivvN6ioVjIxLB4WX93ZuapDUTiw=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHbZwobtttmdmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp90TJJpxn2WyEQ/htRwKRT3UaDkj6nmNA4l74aT27nffeLaiEQ94DTlQUxHSkSCUbSSX+8OvPqgWnMb7gJknXgFqUGB9qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1Moyug1yoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSaTY8t+HdN2utmyKOMpzBOVyCB1fQgjtogw8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx+RM43Z</latexit> <latexit sha1_base64="ivvN6ioVjIxLB4WX93ZuapDUTiw=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHbZwobtttmdmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp90TJJpxn2WyEQ/htRwKRT3UaDkj6nmNA4l74aT27nffeLaiEQ94DTlQUxHSkSCUbSSX+8OvPqgWnMb7gJknXgFqUGB9qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1Moyug1yoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSaTY8t+HdN2utmyKOMpzBOVyCB1fQgjtogw8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx+RM43Z</latexit> <latexit sha1_base64="ivvN6ioVjIxLB4WX93ZuapDUTiw=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHbZwobtttmdmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp90TJJpxn2WyEQ/htRwKRT3UaDkj6nmNA4l74aT27nffeLaiEQ94DTlQUxHSkSCUbSSX+8OvPqgWnMb7gJknXgFqUGB9qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1Moyug1yoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSaTY8t+HdN2utmyKOMpzBOVyCB1fQgjtogw8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx+RM43Z</latexit> <latexit sha1_base64="ivvN6ioVjIxLB4WX93ZuapDUTiw=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHbZwobtttmdmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTKUw6LrfTmljc2t7p7xb2ds/ODyqHp90TJJpxn2WyEQ/htRwKRT3UaDkj6nmNA4l74aT27nffeLaiEQ94DTlQUxHSkSCUbSSX+8OvPqgWnMb7gJknXgFqUGB9qD61R8mLIu5QiapMT3PTTHIqUbBJJ9V+pnhKWUTOuI9SxWNuQnyxbEzcmGVIYkSbUshWai/J3IaGzONQ9sZUxybVW8u/uf1Moyug1yoNEOu2HJRlEmCCZl/ToZCc4ZyagllWthbCRtTTRnafCo2BG/15XXSaTY8t+HdN2utmyKOMpzBOVyCB1fQgjtogw8MBDzDK7w5ynlx3p2PZWvJKWZO4Q+czx+RM43Z</latexit> D 1 <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> D 2 <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> W 2 <latexit sha1_base64="jHTncxvr33zmOXPqsSMmSs4GGz8=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHaZwobtttndmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6CRTDH2WiEQ9hlSj4BJ9w43Ax1QhjUOB3XByO/e7T6g0T+SDmaYYxHQkecQZNVby691Bsz6o1tyGuwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zETXQc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk02x4bsO7b9ZaN0UcZTiDc7gED66gBXfQBh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz+SuI3a</latexit> <latexit sha1_base64="jHTncxvr33zmOXPqsSMmSs4GGz8=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHaZwobtttndmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6CRTDH2WiEQ9hlSj4BJ9w43Ax1QhjUOB3XByO/e7T6g0T+SDmaYYxHQkecQZNVby691Bsz6o1tyGuwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zETXQc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk02x4bsO7b9ZaN0UcZTiDc7gED66gBXfQBh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz+SuI3a</latexit> <latexit sha1_base64="jHTncxvr33zmOXPqsSMmSs4GGz8=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHaZwobtttndmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6CRTDH2WiEQ9hlSj4BJ9w43Ax1QhjUOB3XByO/e7T6g0T+SDmaYYxHQkecQZNVby691Bsz6o1tyGuwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zETXQc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk02x4bsO7b9ZaN0UcZTiDc7gED66gBXfQBh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz+SuI3a</latexit> <latexit sha1_base64="jHTncxvr33zmOXPqsSMmSs4GGz8=">AAAB7HicbVBNT8JAEJ3iF+IX6tHLRjDxRFoueiR68YiJBRNoyHaZwobtttndmpCG3+DFg8Z49Qd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6CRTDH2WiEQ9hlSj4BJ9w43Ax1QhjUOB3XByO/e7T6g0T+SDmaYYxHQkecQZNVby691Bsz6o1tyGuwBZJ15BalCgPah+9YcJy2KUhgmqdc9zUxPkVBnOBM4q/UxjStmEjrBnqaQx6iBfHDsjF1YZkihRtqQhC/X3RE5jradxaDtjasZ61ZuL/3m9zETXQc5lmhmUbLkoygQxCZl/ToZcITNiagllittbCRtTRZmx+VRsCN7qy+uk02x4bsO7b9ZaN0UcZTiDc7gED66gBXfQBh8YcHiGV3hzpPPivDsfy9aSU8ycwh84nz+SuI3a</latexit> D 3 <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> D 2 <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> <latexit sha1_base64="2Uj/Tsv+JIUDsB4tV6vIzdWZfEc=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nfrPu2zWn4SyA1olbkBoUaPv213AUkzSiQhOOlRq4TqK9DEvNCKfzyjBVNMFkisd0YKjAEVVetgg+R+dGGaEwluYJjRbq740MR0rNosBM5inVqpeL/3mDVIdXXsZEkmoqyPJQmHKkY5S3gEZMUqL5zBBMJDNZEZlgiYk2XVVMCe7ql9dJt9lwnYZ736y1ros6ynAKZ3ABLlxCC+6gDR0gkMIzvMKb9WS9WO/Wx3K0ZBU7J/AH1ucP2R6Sig==</latexit> W 3 <latexit sha1_base64="Djbhe/ugAz/5jRhhlbnedibCJHQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq+3+VbVfrrg1dwGyTrycVCBHs1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8CTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWvWa59a8h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDlD2N2w==</latexit> <latexit sha1_base64="Djbhe/ugAz/5jRhhlbnedibCJHQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq+3+VbVfrrg1dwGyTrycVCBHs1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8CTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWvWa59a8h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDlD2N2w==</latexit> <latexit sha1_base64="Djbhe/ugAz/5jRhhlbnedibCJHQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq+3+VbVfrrg1dwGyTrycVCBHs1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8CTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWvWa59a8h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDlD2N2w==</latexit> <latexit sha1_base64="Djbhe/ugAz/5jRhhlbnedibCJHQ=">AAAB7HicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsstCTaWGLiAQlcyN4ywIa9vcvungm58BtsLDTG1h9k579xgSsUfMkkL+/NZGZemAiujet+O4WNza3tneJuaW//4PCofHzS0nGqGPosFrHqhFSj4BJ9w43ATqKQRqHAdji5m/vtJ1Sax/LRTBMMIjqSfMgZNVbyq+3+VbVfrrg1dwGyTrycVCBHs1/+6g1ilkYoDRNU667nJibIqDKcCZyVeqnGhLIJHWHXUkkj1EG2OHZGLqwyIMNY2ZKGLNTfExmNtJ5Goe2MqBnrVW8u/ud1UzO8CTIuk9SgZMtFw1QQE5P552TAFTIjppZQpri9lbAxVZQZm0/JhuCtvrxOWvWa59a8h3qlcZvHUYQzOIdL8OAaGnAPTfCBAYdneIU3RzovzrvzsWwtOPnMKfyB8/kDlD2N2w==</latexit> D 1 <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> <latexit sha1_base64="oG4TX4rHdfsjn0lJrT5eSgVi5lU=">AAAB+HicbVC7TsMwFL0pr1IeDTCyWLRITFXSBcYKGBiLRB9SG0WO67RWHSeyHaQS9UtYGECIlU9h429w2gzQciRLR+fcq3t8goQzpR3n2yptbG5t75R3K3v7B4dV++i4q+JUEtohMY9lP8CKciZoRzPNaT+RFEcBp71gepP7vUcqFYvFg54l1IvwWLCQEayN5NvV+jDCekIwz27nvlv37ZrTcBZA68QtSA0KtH37aziKSRpRoQnHSg1cJ9FehqVmhNN5ZZgqmmAyxWM6MFTgiCovWwSfo3OjjFAYS/OERgv190aGI6VmUWAm85Rq1cvF/7xBqsMrL2MiSTUVZHkoTDnSMcpbQCMmKdF8ZggmkpmsiEywxESbriqmBHf1y+uk22y4TsO9b9Za10UdZTiFM7gAFy6hBXfQhg4QSOEZXuHNerJerHfrYzlasoqdE/gD6/MH15mSiQ==</latexit> D 3 <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> <latexit sha1_base64="O81KhoWf72zoGfBYiYsg0bzuyo0=">AAAB+HicbVA9T8MwFHwpX6V8NMDIYtEiMVVJGWCsgIGxSLRUaqPIcZ3WquNEtoNUov4SFgYQYuWnsPFvcNoM0HKSpdPde3rnCxLOlHacb6u0tr6xuVXeruzs7u1X7YPDropTSWiHxDyWvQArypmgHc00p71EUhwFnD4Ek+vcf3ikUrFY3OtpQr0IjwQLGcHaSL5drQ8irMcE8+xm5p/XfbvmNJw50CpxC1KDAm3f/hoMY5JGVGjCsVJ910m0l2GpGeF0VhmkiiaYTPCI9g0VOKLKy+bBZ+jUKEMUxtI8odFc/b2R4UipaRSYyTylWvZy8T+vn+rw0suYSFJNBVkcClOOdIzyFtCQSUo0nxqCiWQmKyJjLDHRpquKKcFd/vIq6TYbrtNw75q11lVRRxmO4QTOwIULaMEttKEDBFJ4hld4s56sF+vd+liMlqxi5wj+wPr8Adqjkos=</latexit> 1 2 g 1 +g 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 2 g 1 +g 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> g 2 g 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Figure 4.4: Illustration of data allocation and communication strategy in GC for N =3 workers. encoding matrix, i.e. non-zero elements in b i specifies the partitions stored in worker W i . Upon receiving the results of any N− S workers, the master recovers the total gradientg by linearly combining the received results, that is g =a f BG where the row vector a f ∈R 1× N corresponds to a particular set of S stragglers and A = [a 1 ;··· ;a F ] denotes the decoding matrix with F = N S distinct straggling scenarios. The GC algorithm designs encoding and decoding matrices (B,A) such that, in the worst case, the full gradientg is recoverable from the results of any N− S out of N workers, i.e. straggler resiliency α GC = S/N is attained. Although GC prevents the master to wait for all the workers to finish their computations, it requires simultaneous communications from the workers that will cause congestion at the master node, and lead to parallelization gain β GC =Θ(1) for a constant resiliency. HavingreviewedRARandGCstrategiesandtheirresiliencyandparallelizationproperties, wenowinformallyprovidetheguaranteesofourproposedCRschemeinthefollowingremark. Remark 21. CR arranges the available N workers via a tree configuration with L layers of nodes and each parent havingn children, i.e. N =n+··· +n L . The proposed data allocation and communication strategy in CR results in communication parallelization gain β CR = Θ( N 1− 1/L ) which approaches β RAR = Θ( N) for large L. Moreover, given a computation load 0 ≤ r ≤ 1, CR is robust to straggling of α CR ≈ r 1/L fraction of the children per any parent in the tree, while GC is robust to only α GC ≈ r fraction of nodes and RAR has no straggler resiliency. Therefore, CR achieves the best of RAR and GC, simultaneously. Table 4.1 summarizes these results and Theorems 9 and 10 formally characterize such guarantees. 112 Table 4.1: Communication parallelization gain and straggler resiliency of three designs RAR, GC, and CR in a system with N nodes with computation load r, where CR has a tree communication topology of L layers. Scheme Straggler Resiliency (α ) Communication Parallelization Gain (β ) RAR 0 Θ( N) GC r Θ(1) CR r 1/L Θ N 1− 1/L 4.3 Proposed CodedReduce Scheme In this section, we first present our proposed CodedReduce ( CR) scheme by describing data set allocation and communication strategy at the nodes followed by an illustrative example. Then, we provide theoretical guarantees of CR and conclude the section with optimality of CR. 4.3.1 Description of CR Scheme Let us start with the proposed network configuration. CR arranges the communication pattern among the nodes via a regular tree structure as defined below. An (n,L)–regular tree graphT consists of a master node andL layers of worker nodes. At any layer (except for the lowest), each parent node is connected to n children nodes in the lower layer, i.e. there is a total of N =n+··· +n L nodes (See Figure 4.5). Each node of the tree is identified with a pair (l,i), where l∈ [L] and i∈ [n l ] denote the corresponding layer and the node’s index in that layer, respectively. Furthermore, T(l,i) denotes the sub-tree with the root node (l,i). We next introduce a notation that eases the algorithm description. We associate a real scalar b to all the data points in a generic data set D, denoting it by bD, and define the gradient over bD as g bD = bg D = b P x∈D ∇ℓ(θ (t) ;x). As a building block of CR, we define the sub-routine CompAlloc in which given a generic data set D, n workers are carefully assigned with data partitions and combining coefficients such that the full gradient over D 113 ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> M (1,1) <latexit sha1_base64="s4eWoGIpNYl4/1ITXAVd/JekbZI=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CbZCBSmbXvRY9OKxgv2AdinZNNuGZrNLkhWWpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zM82PBtXHdb6ewsbm1vVPcLe3tHxwelY9POjpKFGVtGolI9XyimeCStQ03gvVixUjoC9b1p3dzv/vElOaRfDRpzLyQjCUPOCXGSt1qDV/hy+qwXHHr7gJoneCcVCBHa1j+GowimoRMGiqI1n3sxsbLiDKcCjYrDRLNYkKnZMz6lkoSMu1li3Nn6MIqIxREypY0aKH+nshIqHUa+rYzJGaiV725+J/XT0xw42Vcxolhki4XBYlAJkLz39GIK0aNSC0hVHF7K6ITogg1NqGSDQGvvrxOOo06duv4oVFp3uZxFOEMzqEGGK6hCffQgjZQmMIzvMKbEzsvzrvzsWwtOPnMKfyB8/kDz9yN5Q==</latexit> <latexit sha1_base64="s4eWoGIpNYl4/1ITXAVd/JekbZI=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CbZCBSmbXvRY9OKxgv2AdinZNNuGZrNLkhWWpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zM82PBtXHdb6ewsbm1vVPcLe3tHxwelY9POjpKFGVtGolI9XyimeCStQ03gvVixUjoC9b1p3dzv/vElOaRfDRpzLyQjCUPOCXGSt1qDV/hy+qwXHHr7gJoneCcVCBHa1j+GowimoRMGiqI1n3sxsbLiDKcCjYrDRLNYkKnZMz6lkoSMu1li3Nn6MIqIxREypY0aKH+nshIqHUa+rYzJGaiV725+J/XT0xw42Vcxolhki4XBYlAJkLz39GIK0aNSC0hVHF7K6ITogg1NqGSDQGvvrxOOo06duv4oVFp3uZxFOEMzqEGGK6hCffQgjZQmMIzvMKbEzsvzrvzsWwtOPnMKfyB8/kDz9yN5Q==</latexit> <latexit sha1_base64="s4eWoGIpNYl4/1ITXAVd/JekbZI=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CbZCBSmbXvRY9OKxgv2AdinZNNuGZrNLkhWWpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zM82PBtXHdb6ewsbm1vVPcLe3tHxwelY9POjpKFGVtGolI9XyimeCStQ03gvVixUjoC9b1p3dzv/vElOaRfDRpzLyQjCUPOCXGSt1qDV/hy+qwXHHr7gJoneCcVCBHa1j+GowimoRMGiqI1n3sxsbLiDKcCjYrDRLNYkKnZMz6lkoSMu1li3Nn6MIqIxREypY0aKH+nshIqHUa+rYzJGaiV725+J/XT0xw42Vcxolhki4XBYlAJkLz39GIK0aNSC0hVHF7K6ITogg1NqGSDQGvvrxOOo06duv4oVFp3uZxFOEMzqEGGK6hCffQgjZQmMIzvMKbEzsvzrvzsWwtOPnMKfyB8/kDz9yN5Q==</latexit> <latexit sha1_base64="s4eWoGIpNYl4/1ITXAVd/JekbZI=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CbZCBSmbXvRY9OKxgv2AdinZNNuGZrNLkhWWpT/CiwdFvPp7vPlvTNs9aOuDgcd7M8zM82PBtXHdb6ewsbm1vVPcLe3tHxwelY9POjpKFGVtGolI9XyimeCStQ03gvVixUjoC9b1p3dzv/vElOaRfDRpzLyQjCUPOCXGSt1qDV/hy+qwXHHr7gJoneCcVCBHa1j+GowimoRMGiqI1n3sxsbLiDKcCjYrDRLNYkKnZMz6lkoSMu1li3Nn6MIqIxREypY0aKH+nshIqHUa+rYzJGaiV725+J/XT0xw42Vcxolhki4XBYlAJkLz39GIK0aNSC0hVHF7K6ITogg1NqGSDQGvvrxOOo06duv4oVFp3uZxFOEMzqEGGK6hCffQgjZQmMIzvMKbEzsvzrvzsWwtOPnMKfyB8/kDz9yN5Q==</latexit> (1,n) <latexit sha1_base64="+cGcH/OFcfhvma3wl8pf+49txzw=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIxgnlAEsLspDcZMju7zMwKYclHePGgiFe/x5t/4yTZgyYWNBRV3XR3+bHg2rjut5Pb2Nza3snvFvb2Dw6PiscnLR0limGTRSJSHZ9qFFxi03AjsBMrpKEvsO1P7uZ++wmV5pF8NNMY+yEdSR5wRo2V2uWKdyUvy4Niya26C5B14mWkBBkag+JXbxixJERpmKBadz03Nv2UKsOZwFmhl2iMKZvQEXYtlTRE3U8X587IhVWGJIiULWnIQv09kdJQ62no286QmrFe9ebif143McFNP+UyTgxKtlwUJIKYiMx/J0OukBkxtYQyxe2thI2poszYhAo2BG/15XXSqlU9t+o91Er12yyOPJzBOVTAg2uowz00oAkMJvAMr/DmxM6L8+58LFtzTjZzCn/gfP4ALNmOIg==</latexit> <latexit sha1_base64="+cGcH/OFcfhvma3wl8pf+49txzw=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIxgnlAEsLspDcZMju7zMwKYclHePGgiFe/x5t/4yTZgyYWNBRV3XR3+bHg2rjut5Pb2Nza3snvFvb2Dw6PiscnLR0limGTRSJSHZ9qFFxi03AjsBMrpKEvsO1P7uZ++wmV5pF8NNMY+yEdSR5wRo2V2uWKdyUvy4Niya26C5B14mWkBBkag+JXbxixJERpmKBadz03Nv2UKsOZwFmhl2iMKZvQEXYtlTRE3U8X587IhVWGJIiULWnIQv09kdJQ62no286QmrFe9ebif143McFNP+UyTgxKtlwUJIKYiMx/J0OukBkxtYQyxe2thI2poszYhAo2BG/15XXSqlU9t+o91Er12yyOPJzBOVTAg2uowz00oAkMJvAMr/DmxM6L8+58LFtzTjZzCn/gfP4ALNmOIg==</latexit> <latexit sha1_base64="+cGcH/OFcfhvma3wl8pf+49txzw=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIxgnlAEsLspDcZMju7zMwKYclHePGgiFe/x5t/4yTZgyYWNBRV3XR3+bHg2rjut5Pb2Nza3snvFvb2Dw6PiscnLR0limGTRSJSHZ9qFFxi03AjsBMrpKEvsO1P7uZ++wmV5pF8NNMY+yEdSR5wRo2V2uWKdyUvy4Niya26C5B14mWkBBkag+JXbxixJERpmKBadz03Nv2UKsOZwFmhl2iMKZvQEXYtlTRE3U8X587IhVWGJIiULWnIQv09kdJQ62no286QmrFe9ebif143McFNP+UyTgxKtlwUJIKYiMx/J0OukBkxtYQyxe2thI2poszYhAo2BG/15XXSqlU9t+o91Er12yyOPJzBOVTAg2uowz00oAkMJvAMr/DmxM6L8+58LFtzTjZzCn/gfP4ALNmOIg==</latexit> <latexit sha1_base64="+cGcH/OFcfhvma3wl8pf+49txzw=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIxgnlAEsLspDcZMju7zMwKYclHePGgiFe/x5t/4yTZgyYWNBRV3XR3+bHg2rjut5Pb2Nza3snvFvb2Dw6PiscnLR0limGTRSJSHZ9qFFxi03AjsBMrpKEvsO1P7uZ++wmV5pF8NNMY+yEdSR5wRo2V2uWKdyUvy4Niya26C5B14mWkBBkag+JXbxixJERpmKBadz03Nv2UKsOZwFmhl2iMKZvQEXYtlTRE3U8X587IhVWGJIiULWnIQv09kdJQ62no286QmrFe9ebif143McFNP+UyTgxKtlwUJIKYiMx/J0OukBkxtYQyxe2thI2poszYhAo2BG/15XXSqlU9t+o91Er12yyOPJzBOVTAg2uowz00oAkMJvAMr/DmxM6L8+58LFtzTjZzCn/gfP4ALNmOIg==</latexit> (L,1) <latexit sha1_base64="Se6oF+W/QwF3d+42ec5ryOc5+Eo=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH/pXZR6haJbcecgq8TLSBEy1HuFr24/YkmI0jBBte54bmz8lCrDmcBpvptojCkb0yF2LJU0RO2n83On5NwqfTKIlC1pyFz9PZHSUOtJGNjOkJqRXvZm4n9eJzGDaz/lMk4MSrZYNEgEMRGZ/U76XCEzYmIJZYrbWwkbUUWZsQnlbQje8surpFmteG7Fe6gWazdZHDk4hTMogwdXUIM7qEMDGIzhGV7hzYmdF+fd+Vi0rjnZzAn8gfP5A/k0jgA=</latexit> <latexit sha1_base64="Se6oF+W/QwF3d+42ec5ryOc5+Eo=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH/pXZR6haJbcecgq8TLSBEy1HuFr24/YkmI0jBBte54bmz8lCrDmcBpvptojCkb0yF2LJU0RO2n83On5NwqfTKIlC1pyFz9PZHSUOtJGNjOkJqRXvZm4n9eJzGDaz/lMk4MSrZYNEgEMRGZ/U76XCEzYmIJZYrbWwkbUUWZsQnlbQje8surpFmteG7Fe6gWazdZHDk4hTMogwdXUIM7qEMDGIzhGV7hzYmdF+fd+Vi0rjnZzAn8gfP5A/k0jgA=</latexit> <latexit sha1_base64="Se6oF+W/QwF3d+42ec5ryOc5+Eo=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH/pXZR6haJbcecgq8TLSBEy1HuFr24/YkmI0jBBte54bmz8lCrDmcBpvptojCkb0yF2LJU0RO2n83On5NwqfTKIlC1pyFz9PZHSUOtJGNjOkJqRXvZm4n9eJzGDaz/lMk4MSrZYNEgEMRGZ/U76XCEzYmIJZYrbWwkbUUWZsQnlbQje8surpFmteG7Fe6gWazdZHDk4hTMogwdXUIM7qEMDGIzhGV7hzYmdF+fd+Vi0rjnZzAn8gfP5A/k0jgA=</latexit> <latexit sha1_base64="Se6oF+W/QwF3d+42ec5ryOc5+Eo=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH/pXZR6haJbcecgq8TLSBEy1HuFr24/YkmI0jBBte54bmz8lCrDmcBpvptojCkb0yF2LJU0RO2n83On5NwqfTKIlC1pyFz9PZHSUOtJGNjOkJqRXvZm4n9eJzGDaz/lMk4MSrZYNEgEMRGZ/U76XCEzYmIJZYrbWwkbUUWZsQnlbQje8surpFmteG7Fe6gWazdZHDk4hTMogwdXUIM7qEMDGIzhGV7hzYmdF+fd+Vi0rjnZzAn8gfP5A/k0jgA=</latexit> (L,n) <latexit sha1_base64="YYpyWelrh43fg11Chnz/WhzeGyQ=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH8pL0q9QtGtuHOQVeJlpAgZ6r3CV7cfsSREaZigWnc8NzZ+SpXhTOA03000xpSN6RA7lkoaovbT+blTcm6VPhlEypY0ZK7+nkhpqPUkDGxnSM1IL3sz8T+vk5jBtZ9yGScGJVssGiSCmIjMfid9rpAZMbGEMsXtrYSNqKLM2ITyNgRv+eVV0qxWPLfiPVSLtZssjhycwhmUwYMrqMEd1KEBDMbwDK/w5sTOi/PufCxa15xs5gT+wPn8AVYxjj0=</latexit> <latexit sha1_base64="YYpyWelrh43fg11Chnz/WhzeGyQ=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH8pL0q9QtGtuHOQVeJlpAgZ6r3CV7cfsSREaZigWnc8NzZ+SpXhTOA03000xpSN6RA7lkoaovbT+blTcm6VPhlEypY0ZK7+nkhpqPUkDGxnSM1IL3sz8T+vk5jBtZ9yGScGJVssGiSCmIjMfid9rpAZMbGEMsXtrYSNqKLM2ITyNgRv+eVV0qxWPLfiPVSLtZssjhycwhmUwYMrqMEd1KEBDMbwDK/w5sTOi/PufCxa15xs5gT+wPn8AVYxjj0=</latexit> <latexit sha1_base64="YYpyWelrh43fg11Chnz/WhzeGyQ=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH8pL0q9QtGtuHOQVeJlpAgZ6r3CV7cfsSREaZigWnc8NzZ+SpXhTOA03000xpSN6RA7lkoaovbT+blTcm6VPhlEypY0ZK7+nkhpqPUkDGxnSM1IL3sz8T+vk5jBtZ9yGScGJVssGiSCmIjMfid9rpAZMbGEMsXtrYSNqKLM2ITyNgRv+eVV0qxWPLfiPVSLtZssjhycwhmUwYMrqMEd1KEBDMbwDK/w5sTOi/PufCxa15xs5gT+wPn8AVYxjj0=</latexit> <latexit sha1_base64="YYpyWelrh43fg11Chnz/WhzeGyQ=">AAAB7nicbVA9SwNBEJ3zM8avqKXNYiJEkHCXRsugjYVFBPMByRH2NpNkyd7esbsnhCM/wsZCEVt/j53/xk1yhSY+GHi8N8PMvCAWXBvX/XbW1jc2t7ZzO/ndvf2Dw8LRcVNHiWLYYJGIVDugGgWX2DDcCGzHCmkYCGwF49uZ33pCpXkkH80kRj+kQ8kHnFFjpVapfH8pL0q9QtGtuHOQVeJlpAgZ6r3CV7cfsSREaZigWnc8NzZ+SpXhTOA03000xpSN6RA7lkoaovbT+blTcm6VPhlEypY0ZK7+nkhpqPUkDGxnSM1IL3sz8T+vk5jBtZ9yGScGJVssGiSCmIjMfid9rpAZMbGEMsXtrYSNqKLM2ITyNgRv+eVV0qxWPLfiPVSLtZssjhycwhmUwYMrqMEd1KEBDMbwDK/w5sTOi/PufCxa15xs5gT+wPn8AVYxjj0=</latexit> (L,n L ) <latexit sha1_base64="T7tKwJBl9xb74ZGC+ctnijQ67IY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIhhwjmIckaZiezyZCZ2WVmVghLvsKLB0W8+jne/Bsnj4MmFjQUVd10dwUxZ9q47reTWVvf2NzKbud2dvf2D/KHR00dJYrQBol4pNoB1pQzSRuGGU7bsaJYBJy2gtHN1G89UaVZJO/NOKa+wAPJQkawsdJDsVS7kI+182IvX3DL7gxolXgLUoAF6r38V7cfkURQaQjHWnc8NzZ+ipVhhNNJrptoGmMywgPasVRiQbWfzg6eoDOr9FEYKVvSoJn6eyLFQuuxCGynwGaol72p+J/XSUx45adMxomhkswXhQlHJkLT71GfKUoMH1uCiWL2VkSGWGFibEY5G4K3/PIqaVbKnlv27iqF6vUijiycwCmUwINLqMIt1KEBBAQ8wyu8Ocp5cd6dj3lrxlnMHMMfOJ8/peCO+w==</latexit> <latexit sha1_base64="T7tKwJBl9xb74ZGC+ctnijQ67IY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIhhwjmIckaZiezyZCZ2WVmVghLvsKLB0W8+jne/Bsnj4MmFjQUVd10dwUxZ9q47reTWVvf2NzKbud2dvf2D/KHR00dJYrQBol4pNoB1pQzSRuGGU7bsaJYBJy2gtHN1G89UaVZJO/NOKa+wAPJQkawsdJDsVS7kI+182IvX3DL7gxolXgLUoAF6r38V7cfkURQaQjHWnc8NzZ+ipVhhNNJrptoGmMywgPasVRiQbWfzg6eoDOr9FEYKVvSoJn6eyLFQuuxCGynwGaol72p+J/XSUx45adMxomhkswXhQlHJkLT71GfKUoMH1uCiWL2VkSGWGFibEY5G4K3/PIqaVbKnlv27iqF6vUijiycwCmUwINLqMIt1KEBBAQ8wyu8Ocp5cd6dj3lrxlnMHMMfOJ8/peCO+w==</latexit> <latexit sha1_base64="T7tKwJBl9xb74ZGC+ctnijQ67IY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIhhwjmIckaZiezyZCZ2WVmVghLvsKLB0W8+jne/Bsnj4MmFjQUVd10dwUxZ9q47reTWVvf2NzKbud2dvf2D/KHR00dJYrQBol4pNoB1pQzSRuGGU7bsaJYBJy2gtHN1G89UaVZJO/NOKa+wAPJQkawsdJDsVS7kI+182IvX3DL7gxolXgLUoAF6r38V7cfkURQaQjHWnc8NzZ+ipVhhNNJrptoGmMywgPasVRiQbWfzg6eoDOr9FEYKVvSoJn6eyLFQuuxCGynwGaol72p+J/XSUx45adMxomhkswXhQlHJkLT71GfKUoMH1uCiWL2VkSGWGFibEY5G4K3/PIqaVbKnlv27iqF6vUijiycwCmUwINLqMIt1KEBBAQ8wyu8Ocp5cd6dj3lrxlnMHMMfOJ8/peCO+w==</latexit> <latexit sha1_base64="T7tKwJBl9xb74ZGC+ctnijQ67IY=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBhMhgoTdXPQY9OIhhwjmIckaZiezyZCZ2WVmVghLvsKLB0W8+jne/Bsnj4MmFjQUVd10dwUxZ9q47reTWVvf2NzKbud2dvf2D/KHR00dJYrQBol4pNoB1pQzSRuGGU7bsaJYBJy2gtHN1G89UaVZJO/NOKa+wAPJQkawsdJDsVS7kI+182IvX3DL7gxolXgLUoAF6r38V7cfkURQaQjHWnc8NzZ+ipVhhNNJrptoGmMywgPasVRiQbWfzg6eoDOr9FEYKVvSoJn6eyLFQuuxCGynwGaol72p+J/XSUx45adMxomhkswXhQlHJkLT71GfKUoMH1uCiWL2VkSGWGFibEY5G4K3/PIqaVbKnlv27iqF6vUijiycwCmUwINLqMIt1KEBBAQ8wyu8Ocp5cd6dj3lrxlnMHMMfOJ8/peCO+w==</latexit> ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> ··· <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> <latexit sha1_base64="Ta7vWzYkYrn031JxQLrrQTiFx2o=">AAAB73icbVA9TwJBEJ3DL8Qv1NJmI5hYkTsaLYk2lpjIRwIXsre3wIa9vXN3zoRc+BM2Fhpj69+x89+4wBUKvmSSl/dmMjMvSKQw6LrfTmFjc2t7p7hb2ts/ODwqH5+0TZxqxlsslrHuBtRwKRRvoUDJu4nmNAok7wST27nfeeLaiFg94DThfkRHSgwFo2ilbrXPwhhNdVCuuDV3AbJOvJxUIEdzUP7qhzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx74xcWCUkw1jbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uh9d+JlSSIldsuWiYSoIxmT9PQqE5Qzm1hDIt7K2EjammDG1EJRuCt/ryOmnXa55b8+7rlcZNHkcRzuAcLsGDK2jAHTShBQwkPMMrvDmPzovz7nwsWwtOPnMKf+B8/gBoOY+J</latexit> Figure 4.5: (n,L)–regular tree topology. is retrievable from the computation results of any n− s workers (Pseudo-code in Appendix H). CompAlloc: For specified n and s, GC (Algorithm 2 in [200]) constructs the encoding matrix B = [b 1 ;··· ;b n ] = [b iκ ]. In CompAlloc, the input data set D is partitioned to D =∪ k κ =1 D κ and distributed among the n workers along with the corresponding coefficients. That is, each worker i∈[n] is assigned withD(i)=∪ k κ =1 b iκ D κ which specifies its local data set and corresponding combining coefficients. The parent of the n workers is then able to recover the gradient overD, i.e. g D upon receiving the partial coded gradients of any n− s workers and using the decoding matrix A designed by GC (Algorithm 1 in [200]). CodedReduce: CR is implemented in two phases. It first allocates each worker with its local computation task via CR.Allocate procedure. This specifies each worker with its local data set and combining coefficients. Then, the communication strategy is determined by CR.Execute. CR.Allocate: 1. Starting from the master, data setD T(1,i) is assigned to sub-tree T(1,i) for i∈[n] via the allocation module CompAlloc (Figure 4.6). 114 2. In layer l = 1, each worker (1,i), i ∈ [n], picks r CR d data points from the cor- responding sub-tree’s data set D T(1,i) as its local data set D(1,i) and passes the restD T(1,i) =D T(1,i) \D(1,i) to its children and their sub-trees (Figure 4.6). 3. Step (1) is repeated by using the moduleCompAlloc and treatingD T(1,i) as the input data set to distribute it among the children of node (1,i). 4. Same procedure is applied till reaching the bottom layer (Figure 4.6). By doing so, the data setD is redundantly distributed across the tree while all the workers are equally loaded with r CR d data points, where in Theorem 9 we will show that r CR is a self-derived pick for CR given in (4.5). CR.Execute: 1. All the N nodes start their local partial coded gradient computations on the cur- rent model θ (t) , i.e. g D(l,i) for all nodes (l,i). Note that g D(l,i) is a coded gradient (i.e. a linear combination of partial gradients) since D(l,i) carries combining coefficients along with its data points. 2. Starting from the leaf nodes, they send their partial coded gradient computation results (messages) m (L,i) =g D(L,i) up to their parents. 3. Upon receiving enough results from their children (any n− s of them), workers in layerL− 1 recover a linear combination of their children’s messages via proper row in the decoding matrixA, e.g., parent node (L− 1,1) recovers from its children’s messages [m (L,1) ;··· ;m (L,n) ] via the proper decoding row a f(L− 1,1) . 4. Recovered partial gradient is added to the local partial coded gradient and is uploaded to the parent, e.g. node (L− 1,1) uploadsm (L− 1,1) to its parent, where m (L− 1,1) =a f(L− 1,1) [m (L,1) ;··· ;m (L,n) ]+g D(L− 1,1) . 115 M D T(1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(L 1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(L 1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(2,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(2,n) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(L 1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(L,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(L,n) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> … … … D T(1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Figure 4.6: Illustration of task allocation in CR. 5. The same procedure is repeated till reaching the master node which is able to aggregate the total gradient g D . The pseudo-code for CR is available in Appendix I. 4.3.2 An Example for CR In this section, we provide a simple example to better illustrate the proposed CR scheme. Example (CodedReduce). Consider a (3,2)–regular tree with N = 12 nodes and s = 1 straggler per parent. From GC, we have the decoding and encoding matrices A= 0 1 2 1 0 1 2 − 1 0 , B= 1/2 1 0 0 1 − 1 1/2 0 1 . (4.4) Following CR’s description, we partition the data set of size d as D = {D 1 ,D 2 ,D 3 } and assign D T(1,1) = 1 2 D 1 ∪D 2 to sub-tree T(1,1). Node (1,1) then picks r CR d = 4 15 d data points fromD T(1,1) asD(1,1). To do so,D T(1,1) is partitioned to 5 sub-sets asD T(1,1) = D T(1,1) 1 ∪···∪D T(1,1) 5 and node(1,1) picks the first two sub-sets, i.e. D(1,1)=D T(1,1) 1 ∪D T(1,1) 2 116 = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> D(2,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(2,2) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(2,3) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 4 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 4 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 5 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 5 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> D T(1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 3 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 4 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 5 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D(1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> D T(1,1) 2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> = <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> <latexit sha1_base64="kaDWDxarIbt8b7begmsQdMm2hCQ=">AAAB6nicbVA9TwJBEJ3DL8Qv1NJmI5hYkTsabUyINpYYBUngQvaWOdiwt3fZ3TMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxLBtXHdb6ewtr6xuVXcLu3s7u0flA+P2jpOFcMWi0WsOgHVKLjEluFGYCdRSKNA4GMwvpn5j0+oNI/lg5kk6Ed0KHnIGTVWuq9eVfvliltz5yCrxMtJBXI0++Wv3iBmaYTSMEG17npuYvyMKsOZwGmpl2pMKBvTIXYtlTRC7WfzU6fkzCoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDSz7hMUoOSLRaFqSAmJrO/yYArZEZMLKFMcXsrYSOqKDM2nZINwVt+eZW06zXPrXl39UrjOo+jCCdwCufgwQU04Baa0AIGQ3iGV3hzhPPivDsfi9aCk88cwx84nz9Dq40b</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/4 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/4 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1/2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> 1 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> (1,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> (2,1) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> (2,2) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> (2,3) <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> M 1 2 gD1 +gD2 <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit> Figure 4.7: Illustration of data allocation and communication strategy in CR for a (3,2)–regular tree. and the restD T(1,1) =D T(1,1) 3 ∪D T(1,1) 4 ∪D T(1,1) 5 is passed to layer 2. Note that data points in D(1,1) carry on the linear combination coefficients associated with D T(1,1) = 1 2 D 1 ∪D 2 . Figure 4.7 demonstrates each node in sub-tree T(1,1) with its allocated data set along with the encoding coefficients. Moving to layer 2,D T(1,1) is partitioned to 3 subsets and according to B in (4.4), the allocations to nodes (2,1), (2,2) and (2,3) are as follows: D(2,1)= 1 2 D T(1,1) 3 ∪D T(1,1) 4 , D(2,2)=D T(1,1) 4 ∪(− 1)D T(1,1) 5 , D(2,3)= 1 2 D T(1,1) 3 ∪D T(1,1) 5 . Similarly for other sub-trees, each node now is allocated with a data set for which each data point is associated with a scalar. For instance, node (2,1) uploads m (2,1) = g D(2,1) = 1 2 g D T(1,1) 3 +g D T(1,1) 4 to its parent (1,1). Node (1,1) can recover from any 2 surviving children, e.g. from (2,1) and (2,1) and using the first row in A, it uploads m (1,1) =[2,− 1,0][m (2,1) ;m (2,2) ;m (2,3) ]+g D(1,1) =2m (2,1) − m (2,2) +g D(1,1) = 1 2 g D 1 +g D 2 117 to the master. Similarly for other nodes, the master can recover the full gradient from any two children, e.g. using the second row of decoding matrix A and surviving children (1,1) and (1,3): [1,0,1][m (1,1) ;m (1,2) ;m (1,3) ] =m (1,1) +m (1,3) = 1 2 g D 1 +g D 2 + 1 2 g D 1 +g D 3 =g D . 4.3.3 Theoretical Guarantees of CR Inthissection, weformallypresentthetheoreticalguaranteesof CR.Wefirstcharacterizethe computationloadinducedbyCRanddemonstrateitssignificantimprovementover GC.Then, we consider the commonly-used shifted exponential run-time computation distribution and a single-port communication model for workers and asymptotically characterize the expected run-time of CR and conclude with a discussion on its communication parallelization gain. Computation Load Optimality: We show that for a fixed tree topology, the proposed CR is optimal in the sense that it achieves the minimum per-node computation load for a target resiliency. This optimality is established in two steps per Theorem 9: (i) we first show the achievability by characterizing the computation load of CR; and (ii) we establish a converse showing that CR’s computation load is as small as possible. Proof is available in Appendix J. Theorem 9. For a fixed (n,L)–regular tree, any gradient aggregation scheme robust to any s stragglers per any parent requires computation load r where r≥ r CR = 1 n s+1 +··· + n s+1 L . (4.5) 118 Remark 22. While CR is α -resilient, i.e. robust to any s = αn stragglers per any parent node, it significantly improves the per-node computation (and storage) load compared to an equivalent GC scheme with the same resiliency. In particular, GC loads each worker with r GC = S+1 N = αN +1 N ≈ α fraction of the data set, while CR considerably reduces it to r CR = 1/ P L l=1 n αn +1 l ≈ α L . For α = 0.5 as an instance, CR reduces the computation load 7× by rearranging the nodes from 1 layer to 3 layers. Remark 23. CR makes the distributed GD strategy α -resilient, that is any s=αn stragglers per any parent node which sums up to a total of S =αN stragglers – the same as the worst case number of stragglers in GC. It is clear than if the stragglers are picked adversarially, for instance all the nodes in layer 1, then CR fails to recover the total gradient at the master. However, our experiments over Amazon EC2 confirm that stragglers are randomly distributed over the tree and not adversarially picked, which is aligned with the random stragglers pattern considered in this paper. Total Gradient Computation Complexity: To better characterize the advantages of CR, we characterize its total gradient computation complexity in order to reach the final parameter model with predefined accuracy. More precisely, we focus on learning problems with strongly convex losses and let T CR denote the total number of iterations to reach a final modelθ such that∥θ − θ ∗ ∥ 2 ≤ ϵ . Since in each iteration of CR the exact gradient on all the d data samples is computed (same as in GD), therefore T CR =O(log(1/ϵ )). In each iteration, each of the N worker nodes compute α CR · d gradients, where according to Theorem 9, we have α CR ≈ α L . All in all, in order to reach an ϵ -accurate model, the CR method requires O(α L · N· log(1/ϵ )· d) gradient computations in total. One simple and yet naive approach to mitigate stragglers is to update the model using the gradient computation results of only a fraction (α ) of worker node (non-stragglers). This approach can be treated as standard Stochastic Gradient Descent (SGD) which requires T SGD =O(1/ϵ ) iterations in total to reach an ϵ -accurate model. Since each of the N worker nodes store d/N samples (i.e. no redundant data allocation), therefore in each iteration, 119 each node computes αd/N gradients. Putting all together, in order to reach ϵ -accurate model, SGD requires O(α · 1/ϵ · d) gradient computations in total. Comparing the two gradient computation complexities of CR and SGD, we observe that although SGD slashes the complexity by a linear factor N, however, it suffer from two exponential factors, that are growing α L to α and log(1/ϵ ) to 1/ϵ which significantly increase the total gradient computation complexities, as α L ≪ α and log(1/ϵ )≪ 1/ϵ . Latency Performance: While we have derived the straggler resiliency of CR, the ul- timate goal of a distributed gradient aggregation scheme is to have small latency which is partly attained by establishing higher communication parallelization. Computation Time Model: We consider random computation time model for workers with shifted exponential distribution which is used in several prior works [129, 173, 117]. More precisely, for a worker W i with assigned data set of size d i , we model the computation time as a random variable with a shifted exponential distribution as follows: P[T i ≤ t]=1− e − µ d i (t− ad i ) , for t≥ ad i , (4.6) where system parameters a = Θ(1) and µ = Θ(1) respectively denote the shift and the exponential rate. We assume that T i ’s are independent. Communication Time Model: To model the communication time and bandwidth bottle- neck, we assume that each node is able to receive messages from only one other node at a time, and the total available bandwidth is dedicated to the communicating node. We also assume that communicating a partial gradient vector (of size p) from a child to its parent takes a constant time t c . The following theorem asymptotically characterizes the expected run-time of CR which we denote by T CR (Proof is available in Appendix K). More precisely, we consider the regime of interest where the data set size d and the number of layers L in the tree are fixed, while the number of children per parent, i.e. n is approaching infinity with a constant straggler ratio α =s/n=Θ(1) . 120 Theorem 10. Considering the computation time model in (4.6) for workers, the expected run-time of CR on an (n,L)–regular tree with resiliency α =Θ(1) satisfies the followings: E[T CR ]≥ r CR d µ log 1 α +ar CR d +(n(1− α )− o(n)+L− 1)(1− o(1))t c +o(1), (4.7) E[T CR ]≤ r CR d µ log 1 α +ar CR d +n(1− o(1))Lt c +o(1). (4.8) Remark 24. Theorem 10 implies that the expected run-time of the proposed CR algorithm breaks down into two terms: E[T CR ] = Θ(1)+Θ( n), where the two terms Θ(1) and Θ( n) correspond to computation and communication times, respectively. As a special case, it also impliesthattheaveragerun-timeforGCisE[T GC ]=Θ(1)+Θ( N). Thisclearlydemonstrates that CR is indeed alleviating the bandwidth bottleneck and it improves the communication parallelization gain from β GC = Θ(1) to β CR = Θ( N/n) = Θ( N 1− 1/L ) by parallelizing the communications over an L-layer tree structure. 4.4 Empirical Evaluation of CR In this section, we provide the results of our experiments conducted over Amazon EC2, for which we used Python with mpi4py package. Our results demonstrate significant speedups of CR over baseline approaches. We consider two sets of machine learning experiments, one with a real data set, and another with an artificial data set. For each machine learning setting, we consider two cluster configurations, one with N = 84 workers, and another with N = 156 workers, using t2.micro instance for master and all workers. Furthermore, each experiment is run for 300 rounds. Next, we describe the experiments in detail and provide the results. 121 4.4.1 Convex Optimization 4.4.1.1 Real Data Set We consider the machine learning problem of logistic regression via gradient descent (GD) over the real data set GISETTE [74]. The problem is to separate the often confused digits ‘9’ and ‘4’. We use d = 6552 training samples, with model size p = 5001. The following relative error rate is considered for model estimation: Relative Error Rate = ∥θ (t) − θ (t− 1) ∥ 2 ∥θ (t− 1) ∥ 2 , (4.9) where θ (t) denotes the estimated model at iteration t. The following schemes are considered for data allocation and gradient aggregation: Figure 4.8: Convergence curves for relative error rate vs wall-clock time for logistic regression over N =84 workers. The straggler resiliency is α =1/4. CR achieves a speedup of up to 32.8× , 5.3× , 3.8× and 3.2× respectively over UMW, GC, RAR and SGD. 1. Uncoded Master-worker (UMW): This is the naive scheme in which the data set is uniformly partitioned among the workers, and the master waits for results from all the workers to aggregate the gradient. 2. Gradient Coding (GC): We implementGC as described in Section 4.2.3, with the strag- gler parameter S =αN . 122 3. Ring-AllReduce (RAR): The data set is uniformly partitioned over the workers and the MPI function MPI_Allreduce() is used for gradient aggregation. 4. StochasticGradientDescent(SGD):ThedataallocationisthesameasUMW.However, the master updates the model using the partial gradient obtained via aggregating the results from results of only the first N− S children. Furthermore, as is typical in SGD experiments, we used a learning rate of c 1 /(t+c 2 ) where c 1 and c 2 were numerically optimized. 5. CodedReduce (CR): We implement our proposed scheme as presented in Section 4.3 on a tree with (n,L)=(12,2), while the straggler parameter s=αn . Next, we plot the relative error rate defined in (4.9) as a function of wall-clock time for our logistic regression experiments with N = 84 workers and N = 156 workers respectively in Fig. 4.8 and Fig. 4.9. For N = 84, we consider a straggler-resiliency of α = 1/4, while for N =156, we consider three different values of α :1/12,2/12 and 3/12. We make the following observations from the plots: • AsdemonstratedbyFig. 4.8and4.9,CRachievessignificantspeedupsoverthebaseline approaches. Specifically, for (N,α )=(84,1/4), CR is faster than UMW, GC, RAR and SGDby32.8× ,5.3× ,3.8× and3.2× respectively. For(N,α )=(156,1/12),CRachieves speedups of 32.3× , 27.2× , 7.0× and 25.4× respectively overUMW,GC,RAR andSGD. Similar speedups are obtained with (N,α ) = (156,2/12) and (N,α ) = (156,3/12), as demonstrated by Fig. 4.9(b) and Fig. 4.9(c) respectively. • Although GC gains over UMW by avoiding stragglers, its performance is still bottle- necked by bandwidth congestion, and the increase in computation load at each worker by a factor of (S +1) in comparison to UMW. The bottlenecks are reflected in com- parison with SGD, which has similar or better performance in comparison to GC due to much less computation load per worker. 123 (a) Convergence curves for α = 1/12. CR achieves a speed up of up to 32.3× , 27.2× , 7.0× and 25.4× re- spectively over UMW, GC, RAR and SGD. (b) Convergence curves for α = 2/12. CR achieves a speed up of up to 29.3× , 23.3× , 6.4× and 21.9× re- spectively over UMW, GC, RAR and SGD. (c) Convergence curves for α = 3/12. CR achieves a speed up of up to 25.0× , 16.8× , 5.4× and 15.4× re- spectively over UMW, GC, RAR and SGD. Figure 4.9: Convergence results for relative error rate vs wall-clock time for logistic regression over N =156 workers with different straggler resiliency α . • RAR significantly outperforms UMW as well as GC for N = 84 as well as N = 156 worker settings. Although RAR achieves similar performance in comparison to SGD for N = 84 workers scenario, it ultimately beats all the schemes with the generic master-worker topology when the cluster size is increased to N = 156. Our proposed CR algorithm combines the best of GC and RAR by providing straggler robustness via coding and alleviating bandwidth bottleneck via a tree topology. 124 Figure 4.10: Convergence curves for normalized error rate vs wall-clock time for linear regression over N = 84 workers. The straggler resiliency is α = 1/4. CR achieves a speedup of up to 24.1× , 4.6× , 3.0× and 2.8× respectively over UMW, GC, RAR and SGD. 4.4.1.2 Artificial Data Set Next we solve a linear regression problem via GD over a synthetic data set with parameters (d,p)=(7644,6500). We generate the data set using the following model: x j (p+1)=x j (1:p) ⊤ θ ∗ +z j , for j∈[d], (4.10) where the true model θ ∗ and features x j (1 : p) = [x j (1);··· ;x j (p)] are drawn randomly from N(0,I p ) distribution and z j is a standard Gaussian noise. We consider the following normalized error rate: Normalized Error Rate = ∥θ (t) − θ ∗ ∥ 2 ∥θ ∗ ∥ 2 . (4.11) In Fig. 4.10 and 4.11, we plot the normalized error rate defined in (4.11) as a function of wall-clock time for N = 84 and N = 156 respectively. We consider similar configuration and schemes as for the experiments with real data set. The following observations are made with regard to the experiments: • As in the previous case of logistic regression with real data set, CR achieves signifi- cant speedups over baseline approaches for linear regression as well. Particularly, for (N,α ) = (84,1/4), CR achieves speedups of 24.1× , 4.6× , 3.0× and 2.8× over UMW, 125 (a) Convergence curves for α = 1/12. CR achieves a speed up of up to 31.7× , 22.0× , 5.2× and 20.7× re- spectively over UMW, GC, RAR and SGD. (b) Convergence curves for α = 2/12. CR achieves a speed up of up to 27.1× , 18.1× , 4.4× and 16.8× re- spectively over UMW, GC, RAR and SGD. (c) Convergence curves for α = 3/12. CR achieves a speed up of up to 22.2× , 13.7× , 3.6× and 13.0× re- spectively over UMW, GC, RAR and SGD. Figure 4.11: Convergence results for normalized error rate vs wall-clock time for linear regression over N =156 workers with different straggler resiliency α . GC, RAR and SGD respectively. When (N,α ) = (156,1/12), CR achieves speedups of 31.7× , 22.0× , 5.2× and20.7× in comparison toUMW,GC,RAR andSGD respectively. Similar speedups are obtained for (N,α )=(156,2/12) and (N,α )=(156,3/12). • GC performs better than UMW by avoiding stragglers. However, its performance is still bottlenecked by bandwidth congestion and the increase in computation load at each worker by a factor of (S +1) in comparison to UMW. • SGD achieves a gain in per iteration time over UMW and GC. However, it has higher normalized error with respect to the true model. 126 • Combined with the results of logistic regression, our experiments complement the the- oretical gains of CR that have been established earlier. As demonstrated by the results, a tree-based topology is well-suited for bandwidth bottleneck alleviation in large-scale commodity clusters. Furthermore, the data allocation and coding strategy provide resiliency to stragglers. Figure 4.12: Convergence curves for normalized error rate vs wall-clock time for linear regression over N = 156 workers and (d,p) = (32760,5000). The straggler resiliency is α = 1/4 and the number of rounds is 50. CR achieves a speedup of up to 11.3× , 9.7× , 1.69× and 6.1× respectively over UMW, GC, RAR and SGD. Remark 25. Till now, we have considered small-scale datasets in our experiments, which is motivated by the fact that in edge based devices with non-dedicated resources, the amount of memory available for computation shall be low. Nevertheless, our proposed scheme CR can speedup general machine learning in cloud environments. To illustrate this point, we have carried out another experiment with a larger dataset (d,p) = (32760,500), with (N,α )=(156,1/4). As illustrated by Fig. 4.12, CR outperforms the baseline approaches by considerable margins. Specifically, CR achieves a speedup of 11.3× ,9.7× ,1.69× and 6.1× over UMW, GC, RAR and SGD respectively. 4.4.2 Neural Networks We carry out simulations for evaluating the benefits of CR in distributed training of neural networks with cross-entropy loss, which essentially involves non-convex and non-smooth loss 127 functions due to variety of non-linearities such as ReLUs. For this, we consider the CIFAR10 dataset [102], which has 10 different categories of images. CIFAR10 has 50000 images while the test dataset has 10000 images. We provide the details of the neural network in Table 4.2. We use an initial step size of 0.02, and a step decay of 0.7 at iterations 1300 and 2100. We use Glorot uniform initializer for initializing the convolutional layer weights, while for fully connected layers, we use the default initializer. We consider a cluster of N = 156 servers, a resiliency of 5/12, and n = 12 children per node for CR. We use a random subset of d = 49920 training images for training. Accuracy is reported on test dataset. We use the Pytorch library for neural network training. Fur- thermore, we use the computation and communication model as described earlier, where we assume t c =0.05 seconds, a=5× 10 − 5 seconds/data, and assume aµ =1. In Fig. 4.13, we plot the accuracy vs wall-clock time curves for the different approaches, where training is carried out for a total of 2500 iterations. Clearly, CR outperforms other approaches by significant margins. Particularly, CR achieves a speedup of up to 6.6× , 4.8× , 1.8× and 4.0× respectively over UMW, GC, RAR and SGD. Figure 4.13: Convergence curves for test accuracy vs wall-clock time for neural network training over N = 156 workers. The neural network model has p ≈ 120,000 parameters. The straggler resiliency is α = 5/12 and the number of rounds is 2500. CR achieves a speedup of up to 6.6× , 4.8× , 1.8× and 4.0× respectively over UMW, GC, RAR and SGD. 128 Table 4.2: Details of the neural network architecture used in the simulations. Sl. No. Parameter Shape Hyperparameters 1 Conv2d 3× 16× 3× 3 stride=1, padding=(1,1) 2 Conv2d 16× 64× 4× 4 stride=1, padding=(0,0) 3 Linear 64× 384 - 4 Linear 384× 192 - 5 Linear 192× 10 - 4.5 Conclusion To conclude, we discussed two critical bottlenecks in scaling up Gradient Descent-based dis- tributed learning frameworks: communication efficiency and stragglers’ delays. We proposed CodedReduce (CR), that is a joint communication topology design and data set allocation strategy. CR combines the best of two existing approaches–Ring-AllReduce (RAR) and Gra- dient Coding (GC)–by leveraging communication parallelization of RAR and straggler re- siliency of GC. Theoretically, we characterized the computation load and straggler resiliency of CR and its asymptotic expected run-time. Lastly, we empirically demonstrated that our proposed CR design achieves speedups of up to 27.2× and 7.0× , respectively over the GC and RAR. We also discussed that although the main goal in the proposed CR design is to recover the exact total gradient in each iteration of GD, one can relax this goal to inexact gradi- ent aggregation leading to SGD-type optimization methods. We discussed how straggler resiliency and communication efficiency in GD-type methods can be improved by employing the CR design, while requiring lower computation complexity compared to naive SGD-type procedures. We note that although SGD has been widely considered for large-scale training, GD is still the prominent choice in many industry settings where one wants to make sure that the gradient computations are done completely so as not to lose even a little bit of performance. This is very critical since the model will be used by millions of people and even a slight improvement by GD would be useful. We note that CR may not be applicable in SGD settings in its current fashion. The reason is that the whole coded task allocation 129 and execution described in the proposed CR algorithm is for the purpose of exact gradient recovery, i.e. GD. Such elaborate and extra gradient computation makes less sense if we re- lax our goal to inexact gradient recover, i.e. SGD. There are simple and complexity efficient approaches to deal with stragglers in SGD settings, such as wait for α fraction of nodes to respond, as explained in Sections 4.3 and 4.4. It is yet an interesting future direction to study potential coding opportunities for straggler mitigation in SGD scenarios. Lastly, the tree structure proposed in this paper opens up new interesting directions in order to further improve the resiliency of distributed gradient aggregation schemes. For instance, given a fix set of available worker nodes, how can one find the optimal tree (i.e. optimal depth and width) in order to minimize the expected run-time. 130 Chapter 5 Coded Computing for Low-Latency Federated Learning over Wireless Edge Networks 5.1 Introduction MassiveamountsofdataaregeneratedeachdaybytheInternetofThingscomprisingbillions of devices including autonomous vehicles, cell phones, and personal wearables [179]. This big datahasthepotentialtopowerawiderangeofstatisticalmachinelearningbasedapplications such as predicting health events like a heart attack from wearable devices [48]. To enable low-latency and efficient computing capabilities close to the user traffic, there have been significant efforts recently to develop multi-access edge computing (MEC) platforms [185, 91, 72, 5, 146]. In classical MEC settings, client data is transferred to an underlying centralized compu- tational infrastructure for further processing. However, client data can be of a personalized nature due to which there is an increasing privacy concern in moving the client data to a central location for any model training. For example, a person may want to use a machine learning application to predict health events like low sugar, but may not be willing to share the health records. Federated learning framework has been recently developed to carry out machine learning tasks from data distributed at the client nodes, while the raw data is kept at the clients and 131 never uploaded to the central server [141, 99]. As first formulated in [141], federated learning proceeds in two major steps. First, every client carries out a local gradient update on its local dataset. Second, a central server collects and aggregates the updates from the clients, updates the global model, and transmits it to the clients. The iterative procedure is carried out until convergence. .... Figure 5.1: Illustration of the federated learning paradigm over multi-access edge computing (MEC) networks with n client devices and an MEC server. During each training round, client E j receives the latest model from server M, computes a local gradient update over its local dataset D j , and communicates the gradient update to the server. Training performance is critically bottlenecked by the presence of straggling nodes and communication links. Implementation of federated learning in MEC networks suffers from some fundamental bottlenecks. Theheterogeneityofcomputeandcommunicationresourcesacrossclientsmakes the client selection a difficult task as the overall gradient aggregation at the MEC server can be significantly delayed by the straggling computations and communication links (see Figure 5.1). Additionally, federated learning suffers from wireless link failures during transmission. Re-transmission of messages can be done for failed communications, but it may drastically prolongthetrainingtime. Furthermore, thedatadistributionacrossclientsinMECnetworks is non-IID (non-independent and identically distributed), i.e. data stored locally on a device does not represent the population distribution [240]. Thus, missing out updates from clients leads to poor convergence. 132 CodedFedL Overview: To overcome the aforementioned challenges, we propose Cod- edFedL, a novel coded computing framework that leverages coding theoretic ideas to inject structured redundancy in federated learning for mitigating straggling clients and communi- cation links, and improving performance of federated learning with non-IID data. In the following, we summarize the key aspects of our proposal. • Coded Computation at the MEC Server: CodedFedL leverages the compute power of the federated learning server. Particularly, for distributed linear regression, we propose to generate masked parity data locally at each client at the start of the training procedure, by taking linear combinations of features and labels in the local dataset. The encoding coefficients are locally generated by the clients, and along with the raw client data, the encoding coefficients are not shared with the server. The local parity datasets are shared with the server, which aggregates them to obtain the composite global parity dataset. During training, the central server obtains the coded gradient by computing the gradient over the global parity data, which compensates for the erased or delayed parameter updates from the straggling clients. The combination of coded gradient computed by the MEC server and the uncoded gradients from the non- straggling clients stochastically approximates the full gradient over the entire dataset available at the clients, thus mitigating the convergence issues arising due to missing out updates from clients when data is non-IID. • Non-linear Federated Learning: For enabling non-linear model training, we propose to have a data pre-processing step that transforms the distributed learning task into lin- ear regression, by leveraging the popular kernel embedding based on random Fourier features (RFF) [161]. Each client then generates its parity dataset by taking linear combinations over its transformed features and associated labels, and the server com- bines them to obtain the global parity dataset. Training is then carried out with the transformed dataset at the clients and the global parity dataset at the server, as outlined in the previous bullet. 133 • Optimal Load Allocation: For obtaining the amount of coding redundancy and the number of local data points that a client processes during training, we formulate an optimization problem to find the minimum deadline time until which the MEC server should wait in each round before updating the model. We provide an analytical and tractable approach for efficiently finding the coding redundancy and load allocation that optimizes the deadline time. Our approach is based on solving a key subproblem, which can be cast as a piece-wise convex optimization problem with bounded domain and hence can be solved efficiently using standard convex optimization tools. We also derive the unique closed form solution for this subproblem for a special case in which the communication links are fully reliable with adequate error protection coding, thus covering the special case of the AWGN (Additive White Gaussian Noise) channel. • Privacy Characterization: For characterizing the privacy leakage in sharing local parity datasets with the server, we consider the case when each client utilizes an encoding matrix whose entries are independently drawn from a standard normal distribution. We consider the notion of ϵ -mutual-information differential privacy (MI-DP), that is closely related to differential privacy [36]. Specifically, we leverage the recent result in [188] for the ϵ -MI-DP of Gaussian random projections, and bound the leakage in a client’s data privacy as a function of its database and the size of the parity dataset. • Convergence Analysis: In CodedFedL, the expectation of the combination of the coded gradient and the uncoded gradients that the MEC server receives by the optimal dead- line time is approximately equal to the full gradient over the entire dataset at the clients. Under simplifying assumptions, we analyze convergence and quantify the it- eration complexity of CodedFedL, by treating the learning process as a stochastic gradient descent algorithm. • Performance Results: We evaluate performance gains of CodedFedL by carrying out numerical experiments with a wireless MEC setting, benchmark datasets and non-IID 134 data across clients. We consider the naive uncoded baseline where the server waits for all client updates, as well as the greedy uncoded baseline, where the server waits for a subset of the client updates. For achieving the same target test accuracy, CodedFedL achieves significant gains in the wall-clock training time of up to 5.8× over naive uncoded, and up to 15× over greedy uncoded. Furthermore, for identical number of training iterations, CodedFedL achieves almost the same test accuracy as the naive uncoded, while it outperforms greedy uncoded by an absolute accuracy margin of up to∼ 13%, demonstrating the superiority of CodedFedL with non-IID data. Related Works A common system approach for straggler mitigation in distributed com- puting has been the introduction of some form of task replication [11, 211]. Recently, coded computing strategies have been developed for injecting computation redundancy in unortho- doxencodedformstoefficientlydealwithcommunicationbottleneckandsystemdisturbances like stragglers, outages and node failures in distributed systems [123, 110, 169, 200, 89, 115, 237, 219]. Particularly, [110] proposed to use erasure coding for speeding up distributed matrix multiplication and linear regression tasks. Coding for heterogeneous distributed ma- trix multiplication is proposed in [169], which developed an analytical method to calculate near-optimal coding redundancy. However, the entire data needs to be centrally encoded by the central server before assigning portions to compute devices. Reference [200] proposed a coding method over gradients for synchronous gradient descent, while [90] proposed to encode over the data for avoiding the impact of stragglers for linear regression tasks. Many other works on coded computing for straggler mitigation in distributed learning have been proposed in the recent past [226, 45, 231, 164, 25]. In all these works, the data placement and codingstrategyisorchestratedbyacentralserver. Asaresult, theseworksarenotapplicable in the federated learning setting, where the data is privately owned by the clients and can- not be shared with a central server. Our proposed coded computing framework, CodedFedL, provides a novel solution for leveraging coding redundancy for straggler resilient federated learning. 135 Prior works that have considered one or more aspects of compute, communication and statistical heterogeneity across clients in federated learning include [127, 51, 15, 229]. In [127], a FedProx algorithm was proposed to address non-IID data across clients. However, [127] did not consider variability of compute and communication capabilities across clients. Reference [51] proposed FEDL algorithm for allocating radio resources to clients for reducing convergence time. However, we consider the MEC setting with personalized devices where the compute and communication resources of the clients cannot be tuned. In [15], important clients are selected based on compute and communication delays as well as importance of data in each round of training. In contrast, we optimize load allocation and coding redundancy only once at the start of training. Furthermore, these works do not leverage the computing capability of the MEC server. In [229], the authors propose a cooperative mechanism in which a fraction of clients share potentially all of their raw data with the server, which carries out gradient computations and includes them in model updates to mitigate statistical heterogeneity. However, sharing potentially all of the raw data from even a fraction of clients may not be feasible in the privacy sensitive federated learning paradigm. Additionally, the success of [229] depends on whether the clients that agree to share their raw data with the server adequately represent all the classes. This may not be practical as clients owning a certain type of data (such as users suffering from certain diseases) may not agree to participate in sharing of raw data. In CodedFedL, the server obtains a global parity dataset at the start of training via distributed encoding across client data, as each client privately encodes over its local dataset. In each training round, the server computes a coded gradient over the parity dataset that allows the central server to mitigate the impact of straggling nodes during training by stochastically approximating the gradient over the entire dataset across the clients. We organize the rest of our paper as follows. In Table 5.1, we list the main notations for convenience. Section 5.2 presents a technical background on federated learning, and our proposed compute and communication models for MEC. Section 5.3 describes our proposed 136 CodedFedL scheme. In Section 5.4, we analyze the load allocation policy in CodedFedL. We provide the results of our numerical experiments in Section 5.5, and provide our concluding remarks in Section 5.6. All technical proofs are provided in the Appendix. Table 5.1: Main notations n number of client nodes d dimension of raw feature space q dimension of transformed feature space [n], n∈N set{1,...,n} ℓ j , j∈[n] number of data points in j-th client’s dataset D j m number of data points across all clients θ r , r∈{1,2,...} global model after r-th training iteration 5.2 Problem Setup and MEC Model In this section, we first describe the federated learning setting, and consider linear regression as well as non-linear regression via kernel embedding. We then present our compute and communication models for MEC. 5.2.1 Federated Learning There are n client nodes, each connected to the federated learning server. Client j∈[n] has a local dataset D j =(X (j) ,Y (j) ), where X (j) ∈R ℓ j × d and Y (j) ∈R ℓ j × c denote the feature set and the label set respectively as follows: X (j) =[x (j)T 1 ,...,x (j)T ℓ j ] T , Y (j) =[y (j)T 1 ,...,y (j)T ℓ j ] T . (5.1) Here, ℓ j =|D j | denotes the number of feature-label tuples in D j , while each data feature x (j) k ∈R 1× d , and its corresponding label y (j) k ∈R 1× c for k∈[ℓ j ]. Client j∈[n] does not share its dataset D j with the central server due to privacy concerns. 137 The goal in federated learning is to train a model by leveraging the data located at the clients. Specifically, the following general problem is considered: θ ∗ =argmin θ ∈W 1 m n X j=1 ℓ j X k=1 l θ ;(x (j) k ,y (j) k ) , (5.2) wherel(θ ;(x (j) k ,y (j) k ))∈Risthepredictivelossassociatedwith(x (j) k ,y (j) k )formodelparameter θ ∈W, m= P n j=1 ℓ j denotes the total size of the dataset distributed across the clients, and W denotes the model parameter space. The solution to (5.2) is obtained via an iterative training procedure involving gradient descent. Specifically, in iteration (r+1), server shares the current model θ (r) with the clients. Client j∈[n] then computes the local gradient g (j) as follows: g (j) = 1 ℓ j ℓ j X k=1 ∇ θ l θ (r) ;(x (j) k ,y (j) k ) . (5.3) The server collects gradients from the clients and aggregates them to recover the gradient of the empirical loss corresponding to the entire distributed dataset across clients as follows: g = 1 m n X j=1 ℓ j g (j) . (5.4) The server then executes the model update step as follows: θ (r+1) =θ (r) − µ (r+1) g (5.5) where µ (r+1) denotes the learning rate. The iterative procedure is carried out until sufficient convergence has been achieved. 138 For linear regression with squared error loss, the optimization problem in (5.2) is cast as follows: θ ∗ =argmin θ ∈R d× c 1 2m n X j=1 ℓ j X k=1 ∥x (j) k θ − y (j) k ∥ 2 2 , =argmin θ ∈R d× c 1 m n X j=1 1 2 ∥X (j) θ − Y (j) ∥ 2 F , (5.6) and the local gradient computation at client j∈[n] is as follows: g (j) = 1 2ℓ j ∇ θ ∥X (j) θ (r) − Y (j) ∥ 2 F = 1 ℓ j X (j)T (X (j) θ (r) − Y (j) ). (5.7) The gradient aggregation and model update steps in (5.4) and (5.5) are modified accordingly. Linear regression has been traditionally used widely in a variety of applications including weather data analysis, market research studies and observational astronomy [84, 80, 77]. As evident from (5.7), the gradient computations involve matrix multiplications which are computationally favourable, particularly for low powered client devices with limited compute capabilities. However, in many machine learning problems, a linear model does not perform well. To combine the advantages of non-linear models and low complexity gradient compu- tations in linear regression, random Fourier feature mapping (RFFM) [161] based kernel regression has been widely used in practice. RFFM proposed in [161] involves explicitly constructing finite-dimensional random features from the raw features, such that the inner product of any pair of transformed features approximates the kernel evaluation correspond- ing to the two raw features. Specifically, features v 1 ∈R 1× d and v 2 ∈R 1× d are mapped tob v 1 139 andb v 2 using a feature generating function ϕ :R 1× d →R 1× q . The RFF mapping approximates a positive definitive kernel function K :R 1× d × R 1× d →R as represented below: K(v 1 ,v 2 )≈ b v 1 b v T 2 =ϕ (v 1 )ϕ (v 2 ) T . (5.8) In Section 5.3.1, we propose how to carry out distributed kernel embedding, so that client j ∈ [n] transforms its dataset D j = (X (j) ,Y (j) ) to b D j = ( b X (j) ,Y (j) ), where b X (j) = [ϕ (x (j) 1 ) T ,...,ϕ (x (j) ℓ j ) T ] T denotes the transformed feature set. Training is then performed via linear regression over the transformed data located at the clients, i.e., the optimization problem in (5.2) is cast as follows: θ ∗ =argmin θ ∈R q× c 1 2m n X j=1 ∥ b X (j) θ − Y (j) ∥ 2 F , (5.9) while the gradient computation in iteration (r+1) at client j∈[n] is as follows: g (j) = 1 ℓ j b X (j)T ( b X (j) θ (r) − Y (j) ). (5.10) Gradient aggregation and model update steps are then carried out at the server according to (5.4) and (5.5) respectively. Remark 26. Training and inference with random features have been shown to work consid- erably well in practice [161, 162, 184, 88, 174]. Hence, without loss of generality, we focus on non-linear federated learning via kernel regression with RFF mapping in the remaining part of our paper. All results easily generalize to plain linear regression. To capture the heterogeneity and stochastic nature of compute and communication ca- pabilities across clients in MEC networks, we consider probabilistic models as described next. 140 5.2.2 Compute and Communication Models To statistically represent compute heterogeneity, we assume a shifted exponential model for local gradient computation. Specifically, the computation time for j-th client is given by a shifted exponential random variable T (j) cmp as follows: T (j) cmp =T (j,1) cmp +T (j,2) cmp . (5.11) Here,T (j,1) cmp = e ℓ j µ j denotesthetimeinsecondstoprocessthepartialgradientover e ℓ j datapoints, where data processing rate is µ j data points per second, and e ℓ j is bounded by the size of the local dataset b D j , i.e., e ℓ j ≤ ℓ j . T (j,2) cmp is the random variable denoting the stochastic component of compute time coming from random memory access during read/write cycles associated with Multiply-Accumulate (MAC) operations, where we assume an exponential distribution for T (j,2) cmp , i.e., p T (j,2) cmp (t)=γ j e − γ j t , t≥ 0, where γ j = α j µ j e ℓ j . The parameter α j >0 controls the ratio of the time spent in computing to the average time spent in memory access. In addition to the local computation time T (j) cmp , the overall execution time for j-th client during (r+1)-th epoch includes T (j) com− d , time to downloadθ (r) from the server, and T (j) com− u , time to upload the partial gradient g (j) to the server. We assume that communications between server and clients take place over wireless links that fluctuate in quality. It is a typical practice to model the wireless link between server and j-th client by a tuple (η j ,p j ), where η j and p j denote the achievable data rate (in bits per second per Hz) and link erasure probability p j [3, 35, 17]. Downlink and uplink communication delays are IID random variables given as follows: 1 T (j) com− d =N d j τ j , T (j) com− u =N u j τ j (5.12) 1 For the purpose of this article, we assume the downlink and the uplink delays to be reciprocal. Generalization of our framework to asymmetric delay model is easy to address. 141 Here, τ j = b η j W is the deterministic time to upload (or download) a packet of size b bits containing partial gradient g (j) (or model θ (r) ) and W is the bandwidth in Hz assigned to the j-th worker device. N d j and N u j , that denote the number of transmissions required for successful downlink and uplink communications respectively, are distributed IID according to Geometric(p=1− p j ) distribution as follows: P(N d j =x)=P(N u j =x) =p x− 1 j (1− p j ), x=1,2,3,... (5.13) Therefore, using (5.11) and (5.12), the total time that the j-th device takes to successfully receive the latest model, complete its local gradient computation, and communicate the gradient to the central server, is as follows: T j =T (j) com− d +T (j) cmp +T (j) com− u , (5.14) while the average delay is given as follows: E(T j )= e ℓ j µ j 1+ 1 α j + 2τ j 1− p j . (5.15) The federated learning procedure can be severely impacted by slow nodes, straggling communication links, and non-IID data across clients. In the following section, we describe our proposed coded computing framework, CodedFedL, that injects structured redundancy into the federated learning procedure over MEC networks for mitigating these challenges. 5.3 Proposed CodedFedL Scheme We now describe the different modules of our proposed CodedFedL scheme: distributed feature mapping for non-linear regression, distributed encoding for generating composite 142 Figure 5.2: Overview of our proposed CodedFedL framework, illustrating the main processing steps at the MEC server and at each client. parity data, optimal load allocation and code design for minimizing deadline time, and modified training at the MEC server. An overview of CodedFedL is provided in Fig. 5.2. 5.3.1 Distributed Kernel Embedding For combining the advantages of superior performance of non-linear models and low com- putational complexity of gradient computations in linear regression, we propose to leverage kernel embedding based on random Fourier feature mapping (RFFM) in federated learning. Let D=∪ n j=1 D j =(X,Y) denote the entire dataset located at the clients, whereX∈R m× d and Y∈R m× c respectively denote the combined feature and label sets at the clients as follows: X=[X (1)T ,...,X (n)T ] T , Y =[Y (1)T ,...,Y (n)T ] T . (5.16) 143 In this paper, we consider the commonly used kernel known as the Radial Basis Function (RBF) kernel [210], in which for features v 1 ∈X and v 2 ∈X, the following relationship holds: K(v 1 ,v 2 )=e − ∥v 1 − v 2 ∥ 2 2σ 2 , (5.17) whereσ is a kernel hyperparameter. For i∈[m], RFFM corresponding to the RBF kernel can be carried out for feature vector v i as follows (see Section V, example (a) in [162]): b v i =ϕ (v i ) = r 2 q [cos(v i ω 1 +δ 1 ),...,cos(v i ω q +δ q ))] (5.18) where for s∈[q], the frequency vectorsω s ∈R d× 1 are drawn independently fromN(0, 1 2σ 2 I d ), while the shift elements δ s are drawn independently from the Uniform(0,2π ] distribution. Before commencing the training procedure, j-th client carries out RFFM on its raw feature set X (j) to obtain the transformed feature set b X (j) =ϕ (X (j) ), and the training proceeds with the transformed dataset b D=( b X,Y), where b X∈R m× q is the matrix denoting all the trans- formed features across all clients. Remark 27. For distributed transformation of features at the clients, the server sends the same pseudo-random seed to every client which then obtains the samples required for RFFM in (5.18). This mitigates the need for the server to communicate the frequency vectors ω 1 ,...,ω q and the shift elements δ 1 ,...,δ q to each client, thus reducing the communication overhead of distributed kernel embedding significantly. Along with the computational benefits of linear regression over the transformed dataset b D=( b X,Y), applying RFFM enables our distributed encoding strategy for creating global parity data for non-linear federated learning, as described next. 144 5.3.2 Distributed Encoding To inject coding redundancy into federated learning, j-th client carries out random linear encoding of its transformed training dataset b D j =( b X (j) ,Y (j) ). Specifically, random generator matrixG j ∈R u× ℓ j is used for encoding, where the row dimension u denotes the coding redun- dancy which is the amount of parity data to be generated at each device. Typically, u≪ m. Our strategy to find the amount u of coding redundancy and the local computation loads of the clients is presented in Section 5.3.3, where we describe our load allocation policy for optimizing the deadline time at the server. Client j∈[n] privately draws the elements of G j independently from a probability dis- tribution with mean 0 and variance 1. For example, it can use a standard normal N(0,1) distribution, or a Bernoulli(1/2) distribution with sample space {− 1,+1}. Client j keeps the encoding matrix G j private and does not share it with the server. G j is applied on the weighted local dataset to obtain the local parity dataset ( D j =( ( X (j) , ( Y (j) ) as follows: ( X (j) =G j W j b X (j) , ( Y (j) =G j W j Y (j) . (5.19) For w j =[w j,1 ,...,w j,ℓ j ], the weight matrix W j =diag(w j ) is an ℓ j × ℓ j diagonal matrix that weighs the training data point (b x (j) k ,y (j) k )∈ b D j with w j,k , based on the stochastic conditions of the compute and communication resources, where k∈[ℓ j ]. We defer the details of deriving W j to Section 5.3.4. The central server receives the local parity data from all client devices and combines them to obtain the composite global parity dataset ( D=( ( X, ( Y), where ( X∈R u× q and ( Y∈R u× c are the composite global parity feature set and global parity label set as follows: ( X= n X j=1 ( X (j) , ( Y = n X j=1 ( Y (j) (5.20) 145 Using (5.19) and (5.20) we have the following: ( X=GW b X, ( Y =GWY, (5.21) where G=[G 1 ,...,G n ]∈R u× m is the global encoding matrix and W∈R m× m is the global weight matrix given by W=diag([w 1 ,...,w n ]). Equation (5.21) thus represents encoding over the entire decentralized dataset b D=( b X,Y), performed implicitly in a distributed manner across clients. Remark 28. Although client j∈[n] shares its locally coded dataset ( D j =( ( X (j) , ( Y (j) ) with the central server, the local dataset b D j as well as the encoding matrix G j are private to the client and not shared with the server. In Appendix Q, we characterize the privacy leakage in sharing local parity dataset with the server. Next, we describe our load allocation policy to minimize the epoch deadline time for receiving the gradient updates from the non-straggling nodes. 5.3.3 Coding Redundancy and Load Assignment CodedFedL involves load optimization based on the statistical conditions of MEC for obtain- ing the minimum deadline time t opt , and correspondingly an optimal number of data points ℓ opt j ≤ ℓ j to be processed locally at clientj∈[n] in each round, as well asu opt ≤ u max , the number of coded data points to be processed at the server in each round. Here, we assume that due to memory and storage constraints, the server can process a maximum of u max coded data points in each round. Furthermore, for generality, we assume that the server offloads the computation to a high performance computing unit. During training, the computing unit re- ceives the current model from the server, carries out gradient computations over its assigned coded data, and uploads the gradient to the server, where the communications to and from the server take place over a wireless channel. Therefore, to parameterize the compute and communication capabilities of the MEC server, we use similar compute and communication 146 models as described in Section 5.2.2. We let T C denote the random variable for the overall time spent by the computing unit in receiving the current model from the server, computing the gradient over its assigned coded data, and communicating the gradient to the server. In practice, the MEC server has dedicated, high performance and reliable cloud like compute and communication resources [198, 158]. Thus, in comparison to client devices in practice, MEC server has higher values for the data processing rate µ and parameter α in the com- putation model in (5.11), a higher value of data transmission rate η and a lower value of channel failure probability p in the communication model in (5.13). Let 1 {T j ≤ t} be the indicator random variable denoting the event that the server receives the partial gradient over the e ℓ j ≤ ℓ j local data points from j-th client by the deadline time t, where T j denotes the total delay for client j∈[n]. To represent this contribution from j-th client by deadline time t, we use the random variable R j (t; e ℓ j )= e ℓ j 1 {T j ≤ t} . Clearly, R j (t; e ℓ j )∈{0, e ℓ j }. We let R U (t; e ℓ)= P n j=1 R j (t; e ℓ j ) denote the uncoded aggregate return from the clients till deadline time t. Similarly, for representing the completion of the gradient computation over the parity dataset ( D=( ( X, ( Y) within deadline time t, we use the random variable R C (t;u)=u1 {T C ≤ t} , with 1 {T C ≤ t} being the indicator random variable denoting the event that the coded gradient is available for aggregation within deadline time t. Clearly, R C (t;u)∈{0,u}. Then, the following denotes the total aggregate return for t≥ 0: R(t;(u, e ℓ))=R C (t;u)+R U (t; e ℓ). (5.22) Our goal is to optimize over t, e ℓ=( e ℓ 1 ,..., e ℓ n ) and u such that the optimal expected total aggregate return is m for the minimum epoch deadline time, with m being the total number 147 of data points at the clients. More formally, we consider the following optimization problem: minimize t subject to E(R(t;(u, e ℓ)))=m, 0≤ e ℓ≤ (ℓ 1 ,...,ℓ n ), 0≤ u≤ u max , t≥ 0. (5.23) Let (t opt ,u opt ,ℓ opt ) denote an optimal solution for (5.23). In the following, we propose an efficient and tractable two-step approach for solving (5.23). Step 1: First step is to optimize e ℓ and u in order to maximize the expected return E(R(t;(u, e ℓ))) for a fixed t. More precisely, for a given deadline time t, the goal is to solve the following for the total expected aggregate return: maximize E(R(t;(u, e ℓ))) subject to 0≤ e ℓ≤ (ℓ 1 ,...,ℓ n ) 0≤ u≤ u max (5.24) AsE(R(t;(u, e ℓ)))= P n j=1 E(R j (t; e ℓ j ))+E(R C (t;u)), the optimization in (5.24) can be decom- posed into (n+1) independent optimization problems, one for each client j∈[n] as follows: maximize E(R j (t; e ℓ j )) subject to 0≤ e ℓ j ≤ ℓ j , (5.25) and one for the MEC server as follows: maximize E(R C (t;u)) subject to 0≤ u≤ u max , (5.26) 148 Remark 29. In Section 5.4, we derive the mathematical expression for the expected return E(R j (t; e ℓ j )) for j∈[n], and prove that it is a piece-wise concave function in e ℓ j >0. We also characterize the intervals within which the function is concave, and show that the boundaries are functions of the total number of transmissions needed for the successful downlink (model download) and uplink (gradient upload) communications by the deadline time t. Therefore, we can solve (5.25) efficiently using any convex optimization toolbox. The analysis follows similarly for (5.26). Therefore, (5.24) can be solved efficiently. Let ℓ ∗ j (t), for j∈[n], and u ∗ (t) denote optimal solutions of (5.25) and (5.26) respectively, which in turn optimize (5.24). Next, we describe the second step of our approach. Step 2: Optimization of t is considered in order to find the minimum deadline time t=t ∗ so that the maximized expected total aggregate return E(R(t;(u ∗ (t),ℓ ∗ (t)))) is equal to m. Specifically, the following optimization problem is considered: minimize t subject to E(R(t;(u ∗ (t),ℓ ∗ (t))))=m, t≥ 0. (5.27) Remark 30. We show that E(R(t;(u ∗ (t),ℓ ∗ (t)))) is a monotonically increasing function in t in Section 5.4. Therefore, (5.27) can be efficiently solved to obtain t ∗ using a bisection search over t with a sufficiently large starting upper bound. Consequently, an optimal load allocation solution (u ∗ (t ∗ ),ℓ ∗ (t ∗ )) is obtained as a solution of (5.24) for t=t ∗ . Our proposed two-step load allocation strategy achieves an optimal solution of (5.23), as summarized in the following claim. Claim 2. Let (t ∗ ,u ∗ (t ∗ ),ℓ ∗ (t ∗ )) be an optimal solution obtained by solving (5.27). Then, t ∗ =t opt and (t ∗ ,u ∗ (t ∗ ),ℓ ∗ (t ∗ )) is an optimal solution of (5.23). 149 Proof of the above claim is provided in Appendix L. In the next subsection, we describe the procedure used by client j∈[n] for obtaining the weight matrix W j , which is used for generating the local parity dataset in (5.19). 5.3.4 Weight Matrix Construction After the evaluation of the optimal load allocation ℓ ∗ (t ∗ ) for the clients described in the previous subsection, j-th client samples ℓ ∗ j (t ∗ ) data points uniformly and randomly that it will process for local gradient computation in each training round. It is not revealed to the server which data points are sampled. The probability that the partial gradient computed at j-th client is not received at the MEC server by deadline time t ∗ is pnr j,1 =(1− P(T j ≤ t ∗ )). Furthermore, (ℓ j − ℓ ∗ j (t ∗ )) data points are never evaluated at the client, which implies that the probability of no return for them is pnr j,2 =1. The diagonal weight matrix W j ∈R ℓ j × ℓ j , which is used for generating the local parity dataset in (5.19), captures the absence of the updates corresponding to the different data pointsduringthetrainingprocedure. Specifically, for k∈[ℓ j ], ifdatapoint(b x (j) k ,y (j) k )isamong the ℓ ∗ j (t ∗ ) data points that are to be processed at the client during gradient computation, the corresponding weight matrix coefficient is w j,k = √ pnr j,1 , otherwise w j,k = √ pnr j,2 . As we illustrate next, this weighing ensures that the combination of the coded gradient and the uncoded gradient updates from the non-straggling clients stochastically approximates the full gradient g in (5.4) over the entire dataset b D=( b X,Y) distributed across the clients. 5.3.5 Coded Federated Aggregation In epoch (r+1), the MEC server sends the current model θ (r) to the clients as well as its own computing unit for gradient computations, and waits until the optimal deadline time t ∗ before updating the model. The computing unit of the MEC server computes the coded gradient, which is the gradient over the composite parity data ( D=( ( X, ( Y), and the MEC server weighs it with a factor of 1/(1− pnr C ), where pnr C =(1− P(T C ≤ t ∗ )) denotes the 150 probability of no return for the coded gradient. Effectively, the coded gradient used by the MEC server during gradient aggregation can be represented as follows: g C = 1 {T C ≤ t ∗ } 1 (1− pnr C ) 1 u ∗ (t ∗ ) ( X T ( ( Xθ (r) − ( Y) = 1 {T C ≤ t ∗ } (1− pnr C ) b X T W T G T G u ∗ (t ∗ ) W( b Xθ (r) − Y), (5.28) where 1 {T C ≤ t ∗ } is the indicator random variable that denotes whether the coded gradient is available for aggregation by the optimal deadline t ∗ or not. As we describe soon, weighing the coded gradient by a factor of 1 (1− pnr C ) accounts for the averaging effect caused by the random variable 1 {T C ≤ t ∗ } , and results in a stochastic approximation of the true gradient. Similarly, each client j∈[n] computes its partial gradient, and the server computes a weighted combination of the uncoded gradients received from the clients by the deadline time t ∗ as follows: g U = n X j=1 ℓ ∗ j (t ∗ )g (j) U , where g (j) U represents the effective gradient contribution from client j by deadline time t ∗ as follows: g (j) U = 1 {T j ≤ t ∗ } 1 ℓ ∗ j (t ∗ ) e X (j)T ( e X (j) θ (r) − e Y (j) ), (5.29) Here, e D j =( e X (j) , e Y (j) ) is composed of the ℓ ∗ j (t ∗ ) data points that j-th client samples for processing before training. As we show soon, no further factors similar to 1 (1− pnr C ) in (5.28) are needed to be applied to the uncoded gradients as they are already accounted for during creation of the local parity datasets. Thus, the MEC server waits for the uncoded gradients from the clients and the coded gradient from its computing unit until the optimized deadline time t ∗ , and then aggregates g C and g U to obtain the coded federated gradient as follows: g M = 1 m (g C +g U ). (5.30) 151 The coded federated gradient g M in (5.30) stochastically approximates the full gradient g in (5.4) for a sufficiently large coding redundancy u ∗ (t ∗ ), specifically, E(g M )≈ g. To verify this, we first observe that the following holds for the coded gradient g C in (5.28): E(g C )= E(1 {T C ≤ t ∗ } ) (1− pnr C ) b X T W T G T G u ∗ (t ∗ ) W( b Xθ (r) − Y) (a) ≈ b X T W T W( b Xθ (r) − Y) = n X j=1 ℓ j X k=1 w 2 j,k b x (j)T k (b x (j) k θ (r) − y (j) k ). (5.31) In(a),byusingtheweaklawoflargenumbers,wehaveapproximatedthequantity( 1 u ∗ (t ∗ ) G T G) by an identity matrix. This is a reasonable approximation for a sufficiently large coding re- dundancy u ∗ (t ∗ ), since each diagonal entry in G T G u ∗ (t ∗ ) converges to 1 in probability, while each non-diagonal entry converges to 0 in probability. Furthermore, as we demonstrate via numerical experiments in Section 5.5, the convergence curve as a function of iteration for CodedFedL significantly overlaps that of the naive uncoded scheme where the server waits to aggregate the results of all the clients. The expected aggregate gradient E(g U ) from the clients received by the server by the deadline time t ∗ is as follows: E(g U )= n X j=1 P(T j ≤ t ∗ ) e X (j)T ( e X (j) θ (r) − e Y (j) ) (a) = n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j P(T j ≤ t ∗ )b x (j)T k (b x (j) k θ (r) − y (j) k ) (b) = n X j=1 ℓ j X k=1 (1− w 2 j,k )b x (j)T k (b x (j) k θ (r) − y (j) k ), (5.32) where in (a), the inner sum denotes the sum over data points in e D j =( e X (j) , e Y (j) ), while in (b), all the points in the local dataset are included, with (1− w 2 j,k )=0 for the points in the set b D j \ e D j . In light of (5.31) and (5.32), it follows thatE(g M )≈ g. 152 Remark 31. In Appendix P, we perform convergence analysis of CodedFedL and find its iteration complexity under the simplifying assumption that ( 1 u ∗ (t ∗ ) G T G)=I m which based on the above analysis, implies E(g M )=g. We bound the variance of g M and leverage a standard result in literature for convergence of stochastic gradient descent. The exact con- vergence analysis taking into account the underlying distribution of the encoding matrix will be addressed in future work. In the following section, we analyze the load allocation policy of CodedFedL, demonstrat- ing how our proposed two-step load allocation problem in (5.27) can be solved efficiently. 5.4 Analyzing CodedFedL Load Design In this section, we demonstrate how our load allocation policy provides an efficient and tractable approach to obtaining the minimum deadline time in (5.23). For ease of notation, we index the clients and the MEC server using j∈[n + 1] throughout this section, where j∈[n] denotes the n clients and j=n+1 denotes the MEC server, and use the generic term node for the MEC server as well as the clients. Likewise, e ℓ n+1 =u, ℓ n+1 =u max , ℓ opt n+1 =u opt , ℓ ∗ n+1 (t ∗ )=u ∗ (t ∗ ), T n+1 =T C , and R n+1 (t; e ℓ n+1 )=R C (t;u). In the following, we present our result for the expected returnE(R j (t; e ℓ j )) for node j∈[n+1] as defined in Section 5.3.3. Theorem 11. For the compute and communication models defined in (5.11) and (5.13), let 0≤ e ℓ j ≤ ℓ j be the number of data points processed by node j∈[n+1] in each training epoch. For a deadline time of t at the server, the expectation of the return R j (t; e ℓ j )= e ℓ j 1 {T j ≤ t} satisfies the following: E(R j (t; e ℓ j )) = P ν m ν =2 U t− e ℓ j µ j − τ j ν h ν f ν (t; e ℓ j ) if ν m ≥ 2 0 otherwise 153 where U(·) is the unit step function with U(x)=1 for x>0 and 0 otherwise, f ν (t; e ℓ j )= e ℓ j 1− e − α j µ j e ℓ j (t− e ℓ j µ j − τ j ν ) , h ν =(ν − 1)(1− p j ) 2 p ν − 2 j , and ν m ∈Z satisfies t− τ j ν m >0, t− τ j (ν m +1)≤ 0. Proof of Theorem 11 is provided in Appendix M. Next, we analyze the behavior of E(R j (t; e ℓ j )) for ν m ≥ 2. For a given t>0 and ν ∈{2,...,ν m }, consider f ν (t; e ℓ j ) for e ℓ j >0. Then, the following holds: f ′′ ν (t; e ℓ j )=− e − α j µ j e ℓ j (t− e ℓ j µ j − ντ j )α 2 j µ j 2 (t− ντ j ) 2 e ℓ 3 j <0. 0 5 10 15 20 0 0.1 0.2 0.3 = 2 = 3 = 4 = 5 Sum (a) Illustration of the piece-wise concavity of the ex- pected returnE(Rj(t; e ℓj)) for e ℓj>0 for node j. 5 10 15 0 0.2 0.4 0.6 0.8 1 (b) Illustration of the monotonic relationship between optimal expected return E(Rj(t;ℓ ∗ j (t))) and t for j-th node. Figure 5.3: Illustrating the properties of expected aggregate returnE(R j (t; e ℓ j )) based on the result in Theorem 11. We assume p j =0.9, τ j = √ 3, µ j =2, α j =20, and for Fig. 5.3b, t=10. Thus, f ν (t; e ℓ j ) is strictly concave in the domain e ℓ j >0. Furthermore, f ν (t; e ℓ j )≤ 0 for e ℓ j ≥ µ j (t− τ j ν ) for ν ∈{2,...,ν m }. Therefore, as highlighted in Remark 29, the expected return E(R j (t; e ℓ j )) is piece-wise concave in e ℓ j , and the exact intervals of concavity are (0,µ j (t− ν m τ j )),...,(µ j (t− 3τ ),µ j (t− 2τ )). Furthermore, e ℓ j is upper bounded by ℓ j . Thus, foragivent>0, eachof(5.25)and(5.26)decomposesintoafinitenumberofconvexoptimiza- tion problems, that are efficiently solved in practice [22]. The piece-wise concave relationship betweenE(R j (t; e ℓ j )) and t is also illustrated in Fig. 5.3a. 154 Consider an optimal solution ℓ ∗ j (t) for node j∈[n+1], and the corresponding optimized expected return E(R j (t;ℓ ∗ j (t))). Intuitively, as we increase the deadline time t, optimized load ℓ ∗ j (t) should vary such that the server receives more optimal expected return from node j. In Appendix N, we formally prove that E(R j (t;ℓ ∗ j (t))) is monotonically increasing in the deadline time t. We also illustrate this in Fig. 5.3b. Furthermore, as the maximal expected total aggregate return E(R(t;(ℓ ∗ 1 (t),...,ℓ ∗ n+1 (t))))= P n+1 j=1 E(R j (t;ℓ ∗ j (t))), the maximal ex- pected total aggregate return is also monotonically increasing in t. Therefore, (5.27) can be solved efficiently using bisection search, as claimed earlier in Remark 30. When communication links do not provide time diversity, as in AWGN channel, one reli- able transmission is performed at less than 10 − 5 bit error rate with adequate error protection coding. This motivates us to consider the special case where for each node j∈[n+1], p j =0, resulting in the following specialized expression for the expected return: E(R j (t; e ℓ j )) =U t− e ℓ j µ j − 2τ j ! f 2 (t; e ℓ j ), =U t− e ℓ j µ j − 2τ j ! e ℓ j 1− e − α j µ j e ℓ j (t− e ℓ j µ j − 2τ j ) . (5.33) For this special case, we have a unique closed form solution for ℓ ∗ (t)=(ℓ ∗ 1 (t),...,ℓ ∗ n+1 (t)), and consequently a closed form result forE(R(t;ℓ ∗ (t))). Specifically, for node j∈[n+1], we have the following one-shot solution for the optimal load ℓ ∗ j (t): ℓ ∗ j (t)= 0 if t≤ 2τ j s j (t− 2τ j ) if 2τ j <t≤ ζ j ℓ j otherwise (5.34) 155 where s j =− α j µ j W − 1(− e − (1+α j ) )+1 and ζ j =( ℓ j s j + 2τ j ). Here, W − 1 (·) is the minor branch of the Lambert W-function [33], which is the inverse function of f(W)=We W . Consequently, we have the following one-shot solution for the optimized return for node j∈[n+1]: E(R j (t;ℓ ∗ j (t))) = 0 if t≤ 2τ j e s j (t− 2τ j ) if 2τ j <t≤ ζ j ℓ j 1− e − α j µ j ℓ j t− ℓ j µ j − 2τ j ! otherwise (5.35) wheree s j =s j (1− e − α j ( µ j s j − 1) ). We prove (5.34) and (5.35) in Appendix O. Using these results, we have a closed form for the maximum expected total aggregate return from the nodes as follows: E(R(t;ℓ ∗ (t)))= X j∈[n+1] 2τ j <t≤ ζ j e s j (t− 2τ j ) + X j∈[n+1] ζ j <t ℓ j 1− e − α j µ j ℓ j t− ℓ j µ j − 2τ j ! , (5.36) which is monotonically increasing in the deadline time t. Therefore, (5.27) can be solved efficiently using bisection search to obtain the optimal deadline time t ∗ . In the next section, we present the results of our numerical experiments, which demon- strate the performance gains that CodedFedL can achieve in practice. 5.5 Numerical Experiments In this section, we demonstrate the performance gains of CodedFedL via numerical experi- ments. First we describe our simulation setting, and then we present the numerical results. 156 5.5.1 Simulation Setting MEC Scenario: We consider a wireless scenario consisting ofn=30 client nodes and 1 MEC server. For each client, the delay model described in Section 5.2.2 is used for the overall time in downlink (downloading the model), gradient computation, and uplink (uploading the gradient update), where the system parameters are as described next. We use an LTE network, and assume that each client is uniformly allocated 3 resource blocks, resulting in a maximum PHY level information rate of 216 kbps. Note that depending on the channel conditions,theeffectiveinformationratecanbelowerthan 216kbps. Tomodelheterogeneity, we generate normalized effective information rates using {1,k 1 ,k 2 1 ,...,k 29 1 } and assign a randompermutationofthemtotheclients, themaximumeffectiveinformationratebeing 216 kbps. Furthermore, we use the same failure probability p j =0.1 for j∈{1,...,30}, capturing thetypicalpracticeinwirelesstoadapttransmissionrateforaconstantfailureprobability[3]. An overhead of 10% is assumed and each scalar is represented by 32 bits. The normalized processing powers are generated using {1,k 2 ,k 2 2 ,...,k 29 2 }, the maximum MAC rate being 3.072× 10 6 MAC/s. Furthermore, we set α j =2 for j∈{1,...,30}. We fix (k 1 ,k 2 )=(0.95,0.8). We assume that the MEC server has dedicated, high performance and reliable resources, so the coded gradient is available for aggregation with probability 1 by any finite deadline time t, i.e. P(T C ≤ t)=1 for any t>0. Essentially, this implies u opt =u max . Furthermore, we let δ =u max /m for notational convenience. Datasets and Hyperparameters: We consider two benchmark datasets: MNIST [107] and Fashion MNIST [216]. The features are vectorized, and the labels are one-hot encoded. For kernel embedding, the hyperparameters are (σ,q )=(5,2000). A common practice in large-scale distributed learning is to perform training using mini-batch stochastic gradient descent (SGD), wherein the local dataset is first sorted and partitioned into mini-batches. Then, in each training iteration, each client computes gradient over a local mini-batch se- lected sequentially, and the model update is based on the gradient over the global mini-batch 157 0 100 200 300 400 500 Time (h) 70 75 80 85 90 95 Accuracy (%) MNIST Dataset Naive Uncoded CodedFedL: δ = 0.01 CodedFedL: δ = 0.04 CodedFedL: δ = 0.07 CodedFedL: δ = 0.1 CodedFedL: δ = 0.13 CodedFedL: δ = 0.16 CodedFedL: δ = 0.19 CodedFedL: δ = 0.22 CodedFedL: δ = 0.25 CodedFedL: δ = 0.28 0 5 10 40 50 60 70 (a) Test accuracy with respect to wall-clock time for naive uncoded, and CodedFedL with different δ . La- tency overhead due to uploading of the coded data is also highlighted. 0 50 100 150 200 250 300 350 Iteration # 65 70 75 80 85 90 95 Accuracy (%) MNIST Dataset Naive Uncoded CodedFedL: δ = 0.1 CodedFedL: δ = 0.2 Greedy Uncoded: ψ = 0.1 Greedy Uncoded: ψ = 0.2 (b) Test accuracy with respect to mini-batch up- date iteration for naive uncoded, greedy uncoded with ψ ∈{0.1,0.2} and CodedFedL with δ ∈{0.1,0.2}. 0 100 200 300 400 500 Time (h) 65 70 75 80 85 90 95 Accuracy (%) MNIST Dataset Naive Uncoded CodedFedL: δ = 0.1 CodedFedL: δ = 0.2 Greedy Uncoded: ψ = 0.1 Greedy Uncoded: ψ = 0.2 (c) Test accuracy with respect to the wall-clock time for naive uncoded, greedy uncoded with ψ ∈{0.1,0.2} and CodedFedL with δ ∈{0.1,0.2}. Figure 5.4: Illustrating the results for MNIST. obtained by aggregating the gradients over the local mini-batches across clients. We con- sider the same mini-batch implementation for the uncoded schemes (as we describe in the next paragraph), while for CodedFedL, the data allocation, encoding and training modules are based on each global mini-batch. We assign equal number of data points to each client and use a global mini-batch size of m=12000. Thus, each complete epoch over the training dataset constitutes 5 global mini-batch steps. For studying the impact of non-IID datasets across clients and demonstrate the superiority of CodedFedL in dealing with statistical het- erogeneity, we first sort the training dataset according to class labels, and then partition the entire sorted training dataset into 30 equally sized shards, each of them to be assigned to 158 a different worker. We then sort the clients according to the expected total time using the formula in (5.15) with e ℓ j =400, i.e. the size of the local mini-batch. Then, the 30 data shards are allocated in the order of sorted clients. For all approaches, an initial step size of 6 is used with a step decay of 0.8 at epochs 40 and 65, while the total number of epochs is 70. Additionally, we use an L 2 regularization of λ 2 ∥θ ∥ 2 F with the loss defined in (5.9), and the regularization parameter λ is set to 9× 10 − 6 . Accuracy is reported on the test dataset for each training iteration. Schemes: We compare the following schemes: • Naive Uncoded: Each client computes a gradient over its local mini-batch selected sequentially, and the server waits to aggregate local gradients from all the clients. • Greedy Uncoded: Clients compute gradients over their local mini-batches, and the server waits for results from the first (1− ψ )N clients. This corresponds to an aggregate return of (1− ψ )m from the clients. • CodedFedL: We simulate our approach described in Section 5.3. Client j∈[n] computes gradient over a fixed subset of ℓ ∗ j (t ∗ ) data points in the local mini-batch, and the server only waits till the deadline time t ∗ before carrying out the coded federated aggregation in (5.30) corresponding to the global mini-batch for that iteration. We also include the overhead time for uploading the local parity datasets from the clients to the server. 5.5.2 Results Fig. 5.4 illustrates the results for MNIST, while Fig. 5.5 illustrates the results for Fashion MNIST. In Fig. 5.4a and Fig. 5.5a, we present the generalization accuracy as a function of wall-clock time for naive uncoded and CodedFedL with different coding redundancy. Clearly, as the coding redundancy is increased by increasing δ , the overall training time reduces significantly. Additionally, as highlighted by the inner figures in Fig. 5.4a and Fig. 5.5a, the initial time spent in uploading the coded data to the server generally increases with 159 0 100 200 300 400 500 Time (h) 60 65 70 75 80 85 Accuracy (%) Fashion MNIST Dataset Naive Uncoded CodedFedL: δ = 0.01 CodedFedL: δ = 0.04 CodedFedL: δ = 0.07 CodedFedL: δ = 0.1 CodedFedL: δ = 0.13 CodedFedL: δ = 0.16 CodedFedL: δ = 0.19 CodedFedL: δ = 0.22 CodedFedL: δ = 0.25 CodedFedL: δ = 0.28 0 5 10 64 66 68 70 (a) Test accuracy with respect to wall-clock time for naive uncoded, and CodedFedL with different δ . La- tency overhead due to uploading of the coded data is also highlighted. 0 50 100 150 200 250 300 350 Iteration # 60 65 70 75 80 85 Accuracy (%) Fashion MNIST Dataset Naive Uncoded CodedFedL: δ = 0.1 CodedFedL: δ = 0.2 Greedy Uncoded: ψ = 0.1 Greedy Uncoded: ψ = 0.2 (b) Test accuracy with respect to mini-batch up- date iteration for naive uncoded, greedy uncoded with ψ ∈{0.1,0.2} and CodedFedL with δ ∈{0.1,0.2}. 0 100 200 300 400 500 Time (h) 60 65 70 75 80 85 Accuracy (%) Fashion MNIST Dataset Naive Uncoded CodedFedL: δ = 0.1 CodedFedL: δ = 0.2 Greedy Uncoded: ψ = 0.1 Greedy Uncoded: ψ = 0.2 (c) Test accuracy with respect to the wall-clock time for naive uncoded, greedy uncoded with ψ ∈{0.1,0.2} and CodedFedL with δ ∈{0.1,0.2}. Figure 5.5: Illustrating the results for Fashion MNIST. increased coding redundancy. However, the gain in training time accumulates across training iterations and the impact of this overhead becomes negligible. Furthermore, Fig. 5.4a and Fig. 5.5a illustrate that for the same number of training iterations, CodedFedL achieves a similargeneralizationaccuracyasthenaiveuncodedschemeevenoveralargerangeofcoding redundancy. These plots complement our proof of the stochastic approximation of the naive uncoded gradient aggregation by the coded federated gradient aggregation in Section 5.3.5, and show that for even a large coding redundancy, accuracy does not drop significantly. 2 2 Using the generic local minimizer function fminbnd in MATLAB for solving the concave maximization subprob- lems, it takes lesser than 2 minutes for implementing our two-step approach for obtaining optimal load allocation and deadline time for all values of coding redundancy considered in the simulations. Although it is negligible compared 160 To highlight the superior performance of CodedFedL when data is non-IID, we compare the convergence plots of generalization accuracy vs training iteration for CodedFedL with δ ∈{0.1,0.2} and greedy uncoded with ψ ∈{0.1,0.2}. By design of the simulation setting, ψ =0.1 implies that for greedy uncoded, the server misses all the updates associated with a particular class in most iterations, and similarly, ψ =0.2 implies that the server misses all the updates associated with two classes in most iterations. As shown in Fig. 5.4b and Fig. 5.5b, this results in a poorer generalization performance with respect to training iteration for greedy uncoded in comparison to CodedFedL. Additionally, due to optimal load allocation, CodedFedL performs significantly better than greedy uncoded in the overall training time for identical number of training iterations, as shown in Fig. 5.4c and Fig. 5.5c. Clearly, CodedFedL has significantly better convergence time than the naive uncoded and greedy uncoded approaches, and as highlighted in Section 5.3.5, the coded federated gradient aggregation approximates the naive uncoded gradient aggregation well for large datasets. For further insight, let γ be the target accuracy for a dataset, while t U γ , t G γ and t C γ respectively be the first time instants to reach the γ accuracy for naive uncoded, greedy uncoded and CodedFedL. In Table 5.2, we summarize the results where δ =ψ =0.1. Gains in the overall training time for CodedFedL are up to 2.5× and 8.8× over naive uncoded and greedy uncoded respectively. In Table 5.3, we compare results for δ =ψ =0.2, where the gains in the training time for the target accuracy are up to 5.4× and 15× over naive uncoded and greedy uncoded respectively. Also, greedy uncoded never reaches the target accuracy of 93.3% for MNIST and 82.8% for Fashion MNIST, hence the corresponding fields in Tables 5.2 and 5.3 are empty. to the overall training time (see Fig. 5.4c for example), it can be further improved through an implementation specialized for convex programming. 161 Table 5.2: Summary of Results for δ =ψ =0.1 Dataset γ (%) t U γ (h) t G γ (h) t C γ (h) t U γ /t C γ t G γ /t C γ MNIST 93.3 501 — 198 2.5× — 87.5 63 233 27 2.3× 8.8× Fashion 82.8 521 — 219 2.4× — MNIST 82.1 377 224 145 2.6× 1.6× Table 5.3: Summary of Results for δ =ψ =0.2 Dataset γ (%) t U γ (h) t G γ (h) t C γ (h) t U γ /t C γ t G γ /t C γ MNIST 93.3 501 — 93.2 5.4× — 80.2 15.8 125 8.17 1.9× 15× Fashion 82.8 521 — 90.4 5.8× — MNIST 73.8 30.6 123 11.1 2.7× 11× 5.6 Conclusion We propose CodedFedL, which is the first coding theoretic framework for straggler resilient federated learning in multi-access edge computing networks with general non-linear regres- sionandclassificationtasksandnon-IIDdataacrossclients. Asakeycomponent, wepropose distributed kernel embedding of raw features at clients using a common pseudo-random seed across clients so that they can obtain the kernel features without having to collaborate with each other. In addition to transforming the non-linear federated learning problem into com- putationally favourable distributed linear regression, kernel embedding enables our novel distributed encoding strategy that generates global parity data for straggler mitigation. The parity data allows the central server to perform gradient computations that substitute or re- place missing gradient updates from straggling client devices, thus clipping the tail behavior of gradient aggregation and significantly improving the convergence performance when data is non-IID. Furthermore, there is no decoding of partial gradients required at the central server. We provide an analytical solution for load allocation and coding redundancy compu- tation for obtaining the optimal deadline time, by utilizing statistical knowledge of compute and communication delays of the MEC nodes. Additionally, we provide privacy analysis of generating local parity datasets, and analyze convergence performance of CodedFedL under 162 simplifying assumptions. Finally, we provide results from numerical experiments over bench- mark datasets and practical network parameters that demonstrate gains of up to 15× in the wall-clock training time over benchmark schemes. CodedFedL opens up many interesting future directions. As the global parity dataset is obtained by the MEC server by aggregating the local parity datasets from the clients, the encoded data of each client can be further anonymized by using secure aggregation [19], so that the server gets to know only the global parity dataset, without knowing any individual local parity dataset. With respect to any given client j∈[n], the server will thus receive the sum of the local parity dataset from client j and a noise term, where the noise term will be the sum of the local parity datasets of the remaining clients. Exploring and characterizing this aspect of privacy is left for future study. Furthermore, the problem of characterizingthecompleteimpactoftheencodingmatrixonconvergenceandoptimizingthe deadline time based on convergence criteria will be addressed in future work. Additionally, formulating and studying the load optimization problem based on outage probability for aggregate return is an interesting future work. Adapting CodedFedL to scenarios when the datasets at the clients are changing over time is a motivating future direction as well. Moreover, establishing theoretical foundations of combining coding with random Fourier feature mapping is of significant interest. Another important extension of CodedFedL is to develop coded computing solutions for federated learning for neural network workloads. 163 Chapter 6 Hierarchical Coded Gradient Aggregation for Learning at the Edge 6.1 Introduction Massive amounts of data generated each day by modern networks of remote devices [179] can power a range of statistical machine learning based applications [49]. However, there is an increasing privacy concern of moving the client data to a central server for any model training. We therefore focus on fully distributed training from data distributed at the edge by decomposing the training procedure into two steps. First, parallel model update based on local data at clients, and second, global model aggregation at the central server. This training procedure broadly covers the privacy preserving federated learning and collaborative learning settings [141, 238, 168, 166]. Distributed training suffers from communication bottleneck, as massive sized gradient vectors must be moved from clients to master. Furthermore, training is performed over many iterations causing a severe bottleneck for the bandwidth-constrained client devices. Additionally, communications from the resource-constrained client nodes may straggle re- sulting in re-transmission of gradients, slowing the overall aggregation. Motivated by the emergence of multi-access edge computing (MEC) ecosystem [180], we consider a hierarchi- cal setup for distributed learning, where reliable helper server nodes, located close to client 164 traffic, are available for resilient and communication efficient upload of gradients from the client devices to the master. We thus focus on the problem of making the hierarchical dis- tributed learning robust to straggling client-to-helpers links and characterizing the optimal client-to-helpers and helpers-to-master communication loads. Our key idea is to leverage coded computing that utilizes computation redundancy in encoded forms to provide efficiency, resiliency, and security in large-scale distributed com- puting [109, 123, 169, 231, 55, 186, 144, 172, 205, 220, 45, 14, 191, 101, 73, 192, 199, 233, 242, 163, 104, 81, 197, 64, 153, 167, 145, 221, 46]. Prior works on gradient coding (see e.g. [199]) have focused on centralized data placement. The key challenge for us is that data is pri- vate to the clients and a centralized data placement strategy is not feasible. We propose to leverage coded redundancy in communication links, such that each of the n e clients sends coded gradient updates to helpers and the aggregation process is successful for up to any s straggling links out of the n h helper links per client. Towards this end, we propose two different approaches for hierarchical aggregation. In our first approach, Aligned Repetition Coding (ARC), each client partitions its gradi- ent update and transmits each unique component to multiple helpers, so that corresponding gradient components from different clients are aligned at the helpers. This allows significant partial aggregation opportunities at the helpers, achieving a normalized helpers-to-master communication load (C HM ) of O(n h ), which also provides an upper bound for the optimal C HM . ARC, however, requires a normalized client-to-helpers communication load (C EH ) of Θ( n h ), which can be prohibitive for bandlimited clients. Our second approach, Aligned Minimum Distance Separable Coding (AMC), achieves optimal C EH of Θ(1) . Each client partitions its gradient and applies an MDS code over the partitions. The generator matrix is same across clients so that each parity corresponds to a unique helper for all the clients. However, partial aggregations at helpers are useful only when (n h − s) helpers successfully receive the messages from the same set of client nodes. Using this observation, we formulate the analysis of C HM as a balls and bins maximum 165 occupancy problem [160] and develop a bound for C HM , which is better than the naive MDS scheme where the helpers simply forward the messages to the master achieving C HM =n e . Our schemes thus highlight an interesting trade-off between client-to-helpers and helpers- to-master communication loads, opening up new future directions. 6.2 Problem Setting and Main Results In this section we describe the computation, communication and network models of our setting, define our metrics and formulate our problem. We also present our main results. 6.2.1 Computation Model In many machine learning problems, the goal is to fit a model over a training data set by minimizing an underlying loss function. For a labeled data set D = {x j ∈ R p+1 : j = 1,··· ,d}, the following optimization problem is considered: θ ∗ =argmin θ ∈R p X x∈D ℓ(θ ;x)+λR (θ ), (6.1) where ℓ(·), R(·) and λ denote the loss function, the regularization function, and the regular- ization parameter respectively. A popular strategy is to iteratively solve (6.1) using Gradient Descent (GD) algorithm. Specifically, the following sequence of model updates {θ (t) } ∞ t=0 are carried out: θ (t+1) =h R (θ (t) ,g D ), (6.2) where h R (·) is a gradient-based optimizer depending on the regularizer R(·) and g D = P x∈D ∇ℓ(θ (t) ;x) denotes the gradient of the loss function evaluated at the model at iteration t over D. The core component of the iterations in (6.2) is the computation of the gradient vector g D at each iteration. 166 .... .... Figure 6.1: Hierarchical distributed learning setup. For resiliency of up to s straggling links among n h helper links, client node i, i∈[n e ] encodes its local gradient to obtain coded messages c i,j ’s for then h helpers. Master uses the pattern of the messages received by the helpers to direct the helpers what they should communicate, and recovers the full gradient from the helpers’ messages. 6.2.2 Network and Communication Model We consider a distributed learning setup in which data is distributedly available across n e clients. For i ∈ [n e ], D i denotes the data set available at client node i and D = ∪ ne i=1 D i . Here, we denote [n] :={1,...,n} for n∈N. Let g i denote the gradient associated with D i , i.e. g i = P x∈D i ∇ℓ(θ (t) ;x). Therefore, we have the following for the gradient over the entire data set D: g D = ne X i=1 g i = X x∈D ∇ℓ(θ (t) ;x). (6.3) The master needs to recoverg D from the clients to take the model update step in (6.2). A set of reliable helper nodes with error-free communication links to the master are available to aid this process. The nodes are arranged in a hierarchical network setup (see Fig. 6.1). Each clienthasa unicastcommunicationlink toeach helper. Theclientsto helperscommunication links are unreliable. The helpers forward the client messages to the master after potential partial aggregation and coding. Next, we define the metrics associated with our problem. 167 6.2.3 Metrics Our aim is to design aggregation schemes in which the master is able to recover the full gradient g D in (6.3) under all scenarios of up to s straggling links out of n h helper links per client. The resiliency criterion is formally defined below: Definition 4. Resiliency threshold (s): A given aggregation scheme has a resiliency thresh- old of s, for s∈ [n h − 1], if the master is able to recover g D in (6.3) for every pattern of up to s straggling helper links per client. For achieving the resiliency threshold of s, we propose to inject redundancy in client to helper communications. As shown in Fig. 6.1, client i encodes g i to generate the aggregate coded message c i , which is concatenation of the coded messages sent to the helpers, i.e. c T i = [c T i,1 ,...,c T i,n h ]. For encoding, client i first partitions g i into k components as g T i = [g T i,1 ,...,g T i,k ], then uses an encoding matrix G i ∈ R n h × k over the k gradient components to obtain c i = G i [g T i,1 ,...,g T i,k ] T ∈ R q× 1 , q = n h p k . Normalizing communication per client by the size of local gradient, we obtain the normalized communication load C EH , which is defined as follows: Definition 5. Client-to-helpers communication load (C EH ): For a given aggregation scheme, C EH is the normalized communication load from a client to helper nodes; that is the total size of the coded messages generated by a client for the helpers, normalized by the local gradient vector size, i.e. C EH = q p = n h k , (6.4) where k is the number of components into which the local gradient vector is partitioned for encoding, while p and q are respectively the lengths of the local gradient and aggregate coded vector at each client. For example, client i can repeat g i on each helper link, in which case c T i =[g T i ,...,(n h times),...,g T i ] and G i =1 n h × 1 , k =1, q =n h p and C EH =n h . 168 To recover g D from the messages successfully uploaded by the clients to the helpers, the master receives from the helpers the straggling link pattern from clients to helpers, and directs the helpers to upload messages for recoveringg D . It thus obtains{v j :j∈[n h ]} from the n h helper nodes, where v j is a function of the messages received from the client nodes as well as the straggling link pattern across the clients. For an aggregation scheme with resiliency threshold s, letΩ (s) be the set of all straggling patterns with s straggling links per client and let f ∈ Ω (s). For helper j, let u f j ∈ R l f j × 1 be the aggregate message obtained by concatenating the messages received from the clients according to the straggling pattern f. Helper j sends a coded message v f j =H f j u f j ∈R h f j × 1 , where H f j ∈R h f j × l f j . We define the average communication load from helpers to master as follows: Definition 6. Helpers-to-master communication load (C HM ): For an aggregation scheme with a resiliency threshold of s, C HM is the total size of the messages sent from the helpers to the master averaged over the set Ω (s) of all straggling patterns with s straggling links per client, and normalized by the size of the gradient vector, i.e. C HM = 1 |Ω (s)| X f∈Ω (s) C f HM = 1 |Ω (s)| X f∈Ω (s) n h X j=1 h f j p , (6.5) where h f j is the length of the message sent by helper node j∈[n h ] for the straggling pattern f and C f HM is the normalized communication load from helpers to master for f ∈Ω (s). For example, for the aggregation scheme where each client i repeats its local gradient g i to each helper node, it is easy to see that a copy of g i is available with exactly one helper node in each straggling pattern when the resiliency threshold s=(n h − 1). Each helper can forward to the master all the messages received by it from the clients. Thus, H f j is simply an identity matrix with suitable dimensions and C HM =n e . Next, we formally define our problem. 169 6.2.4 Problem Formulation For a resiliency threshold of s, we have the following definition for an achievable tuple of communication loads: Definition 7. Achievable tuple: For a resiliency threshold ofs∈[n h − 1], a tuple(C EH ,C HM ) isachievableifthereexistsanaggregationstrategywithencodingmatricesG i ’sanddecoding matrices H f j ’s for i∈ [n e ], j ∈ [n h ] and f ∈ Ω (s) that is s-resilient and achieves communi- cation loads C EH and C HM as described in Def. 5 and Def. 6 respectively. Problem: For the hierarchical distributed learning setting with n e clients, n h helpers and a resiliency threshold of s∈ [n h − 1], the problem is to characterize the minimum helpers-to- master and client-to-helpers communication loads as follows: C ∗ HM = inf (C EH ,C HM )∈A C HM , (6.6) C ∗ EH = inf (C EH ,C HM )∈A C EH , (6.7) whereA is the set of all achievable (C EH ,C HM ) tuples. 6.2.5 Main Results We now present our main results, characterizing C ∗ HM and C ∗ EH defined in (6.6) and (6.7) respectively. Theorem. For the hierarchical distributed learning setting with n e clients, n h helpers and a resiliency threshold of s∈[n h − 1], we have the following characterizations for C ∗ HM andC ∗ EH : 1≤ C ∗ HM ≤ s+1, (6.8) 170 C ∗ EH = n h n h − s . (6.9) Remark 32. The converse forC ∗ HM in (6.8) is proved in Appendix R using a cut-set argument. The achievability bound is proved in Section 6.4, where we propose the Aligned Repetition Coding (ARC) that has a maximum communication load from helpers to master of (s+1). Remark 33. The converse for C ∗ EH in (6.9) is proved in Appendix S using cut-set bounds, while achievability is proved in Section 6.5, where we propose Aligned MDS Coding (AMC). Remark 34. For the practical regime of s=αn h , n h =log(n e ), and 0<α < 1, (6.9) implies that C ∗ EH = Θ(1) and thus our proposed AMC approach in Section 6.5 achieves a constant communication load per client. In contrast to AMC, we show in Section 6.4 that ARC achieves a communication load C EH of Θ(log( n e )), which is costly for bandlimited clients. Next we illustrate our proposed schemes through examples. 6.3 Motivating Examples We now illustrate our two approaches proposed in Section 6.4 and Section 6.5 through examples. For illustration, we fix n h =4, n e =4 and s=1. Example. Aligned Repetition Coding (ARC): As illustrated in Fig. 6.2a, each of the four clients partitions its local gradient into 2 componentsandrepeatseachofthemfortwouniquehelpers, achievingC EH =2. Thehelpers send the list of messages received from the clients and the master constructs the failure table for each scenario, with 1 and 0 denoting the successful and unsuccessful communications respectively. We consider scenarios with s=1 failure per client, thus each row in the failure table has three 1’s and one 0. Using the failure table, the master assigns for each unique component of a client the first helper in the corresponding row that successfully receives it. The assignments for the three failure scenarios have been highlighted in Fig. 6.2. Each 171 (a) Message table (b) Failure scenario 1 (c) Failure scenario 2 (d) Failure scenario 3 Figure 6.2: Illustrating the alignment opportunities for the ARC approach in Example 6.3 withn e = 4, n h = 4 and s = 1. For the failure pattern tables, the 1’s denote the successful communications, 0’s denote the failed communications, while the blue boxes denote the messages that the helpers are directed by the master to locally aggregate and upload to enable recovery of full gradient. For example, in Fig. 6.2b, it is sufficient for helpers 1 and 2 to aggregate their received messages to obtain g D,1 and g D,2 respectively and send to the master, and the master concatenates them to obtain the full gradient. 172 helper then aggregates the assigned messages and uploads to the master, and the master constructs g D,1 = P 4 i=1 g i,1 and g D,2 = P 4 i=1 g i,2 . For example, in Fig. 6.2b, it is sufficient for helpers 1 and 2 to aggregate their received messages to obtain g D,1 andg D,2 respectively and send to the master, while in Fig. 6.2c, the master obtains g D,1 by adding results from helpers 1 and 3, andg D,2 by adding results from helpers 2 and 4. As each component is half the size of the local gradient, the helpers-to-master communication load in failure scenarios 1, 2 and 3 are 1, 2 and 1.5 respectively, showing that 1≤ C HM ≤ 2. Example. Aligned MDS Coding (AMC): (a) Message table (b) Failure scenario 1 (c) Failure scenario 2 (d) Failure scenario 3 Figure 6.3: Illustrating the AMC approach for Example 6.3. The master first finds the maximum number of exactly matching rows in the failure table (highlighted by a black box). The entries in these rows are locally aggregated at the helpers and sent to the master. The entries outside the black box (highlighted in green) are simply forwarded to the master. For example, in Fig. 6.3b, all rows match, thus it is sufficient for helpers 1,2 and3 to aggregate their received messages to obtain g D,1 ,g D,2 andg D,3 and send to the master. In our second approach, we use an MDS code over gradient components to minimize the overhead from clients to helpers. As illustrated in Fig. 6.3a, each client splits its local gradient into 3 components and applies a (4,3) MDS code over them, achieving C EH = 4 3 , which is lesser than C EH of 2 in the ARC example. We consider the same failure scenarios 173 as in Example 6.3. The master first finds the maximum number of exactly matching rows in the failure table (highlighted using dotted black box). The key idea is that for decoding any missing entry in any row, the master needs all the remaining three entries in the row. The same decoding criterion applies to partially aggregated components, i.e. for any partial aggregation at a helper (a column in the failure table) to be useful, partial aggregations over the exact same clients (rows in the failure table) at two other helpers (columns in the failure table) must be received by the master. The entries within the maximum matching rows (highlighted in blue) are partially aggregated at the respective helpers and uploaded to the master. The remaining entries (highlighted in green) are simply forwarded. For example, in Fig. 6.3b, all the four rows match, thus it is sufficient for helpers 1, 2 and 3 to aggregate their received messages to obtain g D,1 , g D,2 and g D,3 respectively and send to the master. However, in Fig. 6.3c, each row is unique, therefore maximum number of matching rows is 1 so all entries are simply forwarded to the master. The master performs the necessary decoding, for example, to recover g 2,1 , the master sums the messages received from helpers 2 and 3 and subtracts the sum from the message received from helper 4. As each component is a third of the size of local gradients, the helpers-to-master communication load in the three scenarios are 1, 4 and 3 respectively. In the next two sections, we generalize the approaches illustrated above. 6.4 AlignedRepetitionCodingandProofofAchievability for C ∗ HM We now describe our approach based on repetition coding, generalizing the ideas presented in Example 6.3 in Section 6.3. Encoding at the clients: For a recovery threshold of s, client i ∈ [n e ] carries out the encoding process as follows. Client i first partitions its gradient update into k components asg i =[g T i,1 ,...,g T i,k ] T ,wherek =( n h s+1 ), andg i,r ∈R p k × 1 forr∈[k], whereweassumethatn h 174 is divisible by (s+1). The aggregate coded message is computed asc i =G[g T i,1 ,...,g T i,k ] T = [c T i,1 ,...,c T i,n h ] T , where G is constructed as follows: G= I k . . . (s+1) times . . . I k . (6.10) Essentially, each of the k partitions of g i is repeated for a unique group of (s + 1) workers, i.e. c i,r+(q− 1)k = g i,r for r ∈ [k], q ∈ [s+1]. Furthermore, as the encoding matrix is same across the client nodes, each client node repeats its r th component for the same group of helpers r+(q− 1)k for q ∈ [s+1]. Thus, by construction, the client-to-helpers communication load for ARC is C EH =n h 1 k =(s+1). Hierarchical Aggregation: LetΩ (s) be the set of all straggling patterns with s straggling links per client, and consider f ∈ Ω (s). The helper nodes first communicate to the master the list of successful client node communications and the master constructs the straggling table of the client to helper communications, with n e rows and n h columns, with 1 and 0 for successful and unsuccessful communications respectively. By encoding construction, for each client, each of the k unique partitions of the local gradient is repeated over (s + 1) helper links. Consider ith client, its rth gradient component and the ith row in the failure table corresponding to the client. There is a 1 in at least one of the (s+1) corresponding columnsfortherthcomponent, andthehelpercorrespondingtothefirstofthese 1containing columns is assigned the task to upload the rth component to the master. Each helper then locally aggregates the assigned upload components and sends the aggregate to the master. The master performs any remaining aggregation to find each of the k components ofg D . As the partitions of the client gradients are simply repeated over the helper links, no additional decoding is needed at the master. 175 We now bound the communication load C HM for ARC, proving the achievability for C ∗ HM in (6.8). Among the straggling patterns in Ω (s), the maximum communication load occurs from helpers to master when each helper has at least one unique component, so each helper hastosendamessagetothemaster. Thiscorrespondstostragglingscenario2inExample6.3 (Fig. 6.2c). This results in the maximum communication load of C max HM = sup f∈Ω (s) C f HM = n h 1 k =(s+1). Therefore,C ∗ HM ≤ C HM ≤ C max HM =(s+1), completingtheproofofachievability for C ∗ HM . Although ARC provides significant aggregation opportunities at the helpers and zero decoding complexity at the master, it has C EH = (s + 1), which may not be feasible in practice. For the practical regime of n h =log(n e ) and s=αn h for 0<α < 1, C EH increases unboundedly. For typical client nodes such as cell phones with LTE, costly bandwidth and limited battery make ARC practically infeasible for scalability. In the next section, we propose an MDS coding based approach that has a C EH of Θ(1) , thus overcoming the costly bandwidth requirement in ARC. 6.5 Aligned MDS Coding and Proof of Achievability for C ∗ EH We now describe our approach based on MDS codes. Although any MDS coding approach can be used, for brevity, we shall use the Vandermonde construction [95]. Encoding at the clients: For resiliency threshold s, client node i ∈ [n e ] carries out the encoding process as follows. It partitions its gradient update g i into k = (n h − s) components as g i = [g T i,1 ,...,g T i,k ] T , where g i,r ∈ R p k × 1 for r ∈ [k]. Client node i then 176 obtains the aggregate coded message c i =G[g T i,1 ,...,g T i,k ] T = [c T i,1 ,...,c T i,n h ] T , where G is a Vandermonde matrix for an (n h ,k) MDS code as follows: G= 1 a 1 a 2 1 ... a k− 1 1 1 a 2 a 2 2 ... a k− 1 2 . . . . . . . . . . . . . . . 1 a n h a 2 n h ... a k− 1 n h , (6.11) where a j ’s are unique elements in R. By construction, we have the following result for the client-to-helpers communication load for AMC: C EH =n h 1 k = n h n h − s , (6.12) which proves the achievability for C ∗ EH in (6.9). Hierarchical Aggregation: By encoding construction, for each failure pattern, exactly (n h − s) coded messages are available per client with the helper nodes. By MDS property, the helper nodes can forward their messages to the master, and the master can recover the local gradient components for each client. This results in a helper to master communication load as follows: C max HM = sup f∈Ω (s) C f HM =n e k 1 k =n e . (6.13) However, as illustrated in Example 6.3 in Section 6.3, alignment and partial aggregation opportunities arise at the helper nodes. Consider for example the case where for each client node, the communications corresponding to the helpers j ∈ [n h − s] are successful. Thus, local aggregation of the received messages at the helper j results in the following aggregated coded message: v j = ne X i=1 c i,j = ne X i=1 k X r=1 a r− 1 j g i,r 177 = k X r=1 a r− 1 j ne X i=1 g i,r ! = k X r=1 a r− 1 j g D,r (6.14) where v j ∈ R p k , g D = P ne i=1 g i = [g T D,1 ,...,g T D,k ] T . Thus, using the k partially aggregated messages from the first k helpers, the master can decode each of the k components of the full gradient g D in (6.3). This corresponds to the minimum communication load from the helpers to master as follows: C min HM = inf f∈Ω (s) C f HM =1. (6.15) This is the motivation behind the hierarchical aggregation procedure described as follows. For f ∈ Ω (s), the helper nodes first communicate to the master the pattern of successful client node communications. The master constructs the straggling table of the client to helper communications, with n e rows and n h columns, with 1 and 0 for successful and unsuccessful communications. The master finds the largest set of rows in the failure table that match exactly. In this block of M maximum matching rows, there are exactly (n h − s) columns having all entries 1. Each of the helpers corresponding to these columns aggregates the messages from the clients corresponding to the M maximum matching rows, and sends the aggregate message to the master. For the remaining (n e − M) clients, the corresponding (n h − s) helpers forward their messages to the master. Finally, the master carries out the necessary decoding to obtain the original (n h − s) components of the full gradient g D in (6.3). In Appendix T, we obtain the following bound on C HM for AMC by formulating the analysis as a balls and bins maximum occupancy problem [160]: C HM ≤ n e − n e n h n h − s +1. (6.16) 6.6 Conclusion We proposed a novel framework for robust and communication efficient hierarchical gradient aggregation in model training from data distributed at the client nodes in edge networks, 178 and proposed two approaches that minimize client-to-helpers and helpers-to-master commu- nication loads separately. It is an interesting future work to develop a scheme that achieves the best of both client- to-helpers and helpers-to-master communication loads. Additionally, it will be an interesting direction to explore non-symmetric and probabilistic straggling scenarios as opposed to the symmetric worst case straggling scenarios considered in our work. Furthermore, adapting our proposed schemes to scenarios with straggling helpers-to-master communication links, and to the presence of adversarial clients, are also practical directions to explore. 179 Chapter 7 Secure and Fault Tolerant Decentralized Learning 7.1 Introduction Massive amounts of data is being generated each day by the growing ecosystem of billions of computing devices with sensors connected through the network edge and powered by artificial intelligence (AI). This is shaping the future of both public-interest and curiosity- driven scientific discovery as it has the potential to power a wide range of statistical machine learning(ML)basedapplications. WhileMLapplicationscanachievesignificantperformance gainsduetolargevolumesofuserdata[44,53],thetrainingdataisdistributedacrossmultiple data owners in many scenarios and sharing of user data is limited due to privacy concerns and regulations [126, 175]. For enabling decentralized machine learning from user data while preserving data privacy, federated learning (FL) has arose to be a promising approach [98, 99, 141, 126, 86, 126, 175]. A generic FL algorithm consists of two main steps – the local SGD updates at each participating client using local data, and, the global model update at the central server using the local updates from the participating clients. These steps are carried in tandem iteratively until convergence. Sharing of raw local updates, however, can leak significant private information of the data at the clients [69, 243]. To overcome the privacy leakage from local model updates, [20] proposed a cryptographic obfuscation algorithm. The principle theme is that each client masks its model update using random pairwise secret masks, such 180 that appropriate cancellations of masks happens at the server when it combines the masked local updates during aggregation. This protocol, however, is susceptible to clients dropping out during training. As a result, the FL server has to reconstruct secret shares corresponding to the dropped clients and use them to cancel their corresponding masks which are present in the masked updates of the remaining clients, before being able to carry out the final aggregation and model update. This overhead in each FL round, in addition to the overhead in generation of pairwise masks in the beginning of each FL round, can slow down the FL training. Another critical bottleneck in aggregation of local updates at the server is that some clients can send faulty updates during training due to malfunctioning of devices, which can degrade the training performance tremendously [75, 105]. Our focus is on the scenarios where the clients are honest-but-faulty, i.e., the clients follow the training protocol honestly but can malfunction during training due to which the computed local update becomes faulty. For example, the NIH may orchestrate FL for multi-institutional collaborative training on private patient data [187]. In such a scenario, it is practical to assume that only accredited hospitals participate and follow the FL training procedure honestly. Faults, however, may arise during training due to software and hardware errors and it is essential to make the secure aggregation fault tolerant as well. A number of works focused exclusively on addressing faulty behaviors have been proposed for the IID data setting [28, 18, 228, 193, 71, 217, 157, 150, 83, 152, 58]. When data is IID, the model updates received from the normal clients tend to be distributed around the true model update. Hence, detecting faulty updates has been done through distance based schemes or other robust statistical estimation techniques. Data in typical FL settings, however, isnon-IIDacrossclients[240], andthusachievingfaultresiliencybecomesevenmore challenging as the updates from even the normal clients are quite dissimilar. Furthermore, these methods are not compatible with the pairwise masking based secure aggregation in [20], as they require the FL server to access unmasked client updates. 181 Some recent works such as [68, 114, 76, 156, 24] deal with heterogeneous data distribution in FL. In [76], for example, a resampling approach is proposed to improve the performance of existing faulty robust schemes for the IID setting (e.g., Median [228]) by reducing the inner/outer variations coming from heterogeneity in stochastic updates within/across clients. In the parallel work of [24], a trust bootstrapping method is proposed, in which the server collects a small clean training dataset, independently of client data before training. During training, it computes a model update on root dataset and uses it to assign similarity scores to client updates and carry out their weighted average. While such schemes typically perform betterthantheirIIDcounterpartswhendataisnon-IIDacrossclients, wedemonstrateinour experiments in Section 7.4 that their performance is limited in comparison to OracleSGD, where the omniscient FL server only aggregates the model updates from the clients that are not faulty. Additionally, just like the works mentioned in the previous paragraph, these papers also do not provide privacy protection of the local updates. The key challenge in making the secure aggregation algorithm of [20] resilient to faulty client updates is that masked client updates are interdependent due to the pairwise additive masking and only the final aggregated update is accessible in the unmasked form. However, prior proposed algorithms for fault mitigation require access to the actual client updates. To overcome the dependency of a client’s update on other participating clients for carry- ing out secure aggregation and speed up the secure aggregation process, Trusted execution environments (TEEs) [10, 178, 34] within the FL server have been recently proposed and even deployed in production lines of companies including Meta [147, 143, 59]. TEEs such as Intel SGX [34] provide secure isolated execution environments where data confidentiality and computational integrity of the individual client’s application is guaranteed by the hardware and software encryption. Therefore, individual updates can be securely aggregated within the secure TEE enclave at the FL server without requiring pairwise additive masks as in [20], while simultaneously protecting the privacy of client datasets and model updates [147] from an honest-but-curious FL server. 182 Our contributions: MotivatedbythepracticalTEE-assistedsecureaggregationforFL, we propose DiverseFL, a novel TEE-based secure aggregation and fault tolerant algorithm for federated learning with heterogeneous data across clients. DiverseFL utilizes a per-client criteria for filtering out faulty updates within a secure enclave. We summarize the key aspects of our work below: Per-client fault mitigation: Rather than rely on the similarity of updates across clients, our work takes the view that similarity between the expected and received updates from a client are better markers for detecting faulty behaviors. Before training starts, the FL server asks each client to share with the TEE enclave a small, representative sample of its local data, which has the same proportion of labels as the client data. During training, for each client, a guiding update over its TEE sample is computed within the secure enclave, and this is leveraged to estimate whether the corresponding client is faulty. Our main intuition is that the guiding update associated with a client is similar to the model update received from the client if it is normal, while an arbitrary update from a faulty client is quite different from its associated guiding update. Based on these ideas, we propose a novel approach for fault mitigation that works on a per-client basis. Specifically, a client is flagged as faulty if either of our proposed similarity conditions is violated – (i) the dot product between the guiding update and the client update is greater than a pre-defined positive constant, (ii) the ratio of the Euclidean norms of the two is within a pre-defined range. Followed by the fault mitigation, the updates from the non-flagged clients are aggregated within the TEE enclave, thus ensuring secure aggregation. Remark 35. DiverseFL’s per-client similarity criteria relies on clean representative samples provided by clients for computing the guiding updates for fault mitigation during training. This covers various use cases in FL. For example, the NIH may orchestrate FL for oncology research in which only accredited hospitals participate [187]. In this scenario, it is practi- cal to assume that the experts at the hospitals would send clean data securely to the TEE enclave at the server. However, faults can arise during training due to software/hardware 183 errors, and DiverseFL can be applied to provide superior convergence performance, perform- ing significantly better than prior benchmarks focused on fault tolerance (with no secure aggregation). Secure Enclave for Privacy Protection: As mentioned previously, DiverseFL enables the clients to privately share their samples with a Trusted Execution Environment (TEE) [34, 140, 10] based secure enclave on the FL server. The data as well as the applications inside the TEE are protected via software and hardware cryptographic mechanisms. Because of their hardware guaranteed privacy/integrity features, TEEs have been widely used in recent years for privacy-preserving ML on centralized training with private data [209, 78, 96, 143, 52], for secure aggregation in FL (with no fault mitigation) [147], and more recently, for sharing raw data among clients for performance improvement in decentralized serverless training [47] (without any fault mitigation). DiverseFL is the first algorithm that uses a TEE-enclave (based on Intel SGX [34] in particular) on the FL server for providing secure aggregation as well as fault mitigation in FL setting with non-IID data across clients, where it is hard to know whether dissimilarity between client updates is due to faults or data heterogeneity. Particularly, the secure enclave in DiverseFL enables the federated clients to verify that their data samples are not leaked to any external party, even to the FL server administrator. Furthermore, computations of the guiding updates, implementation of the per-client criteria of DiverseFL to filter out the faulty updates, aggregation of the non-faulty updates and finally the model update, all steps are securely carried out within the secure enclave. As we demonstrate, DiverseFL beats prior (non-secure) baselines for fault mitigation in non-IID FL by significant margins. Experimental Results: We evaluate neural network training performance with different benchmark datasets (see Section 7.4 for details). Our results exhibit the important aspects of DiverseFL as described next. First, we demonstrate that even when each client shares just 1− 3% of its local data with the secure TEE enclave, that sample size is sufficient for reliably computing guiding updates and to significantly improving convergence performance. 184 Next, we experimentally analyze the scalability of DiverseFL. For this, we measure the performance of the Intel SGX based TEE for computing guiding updates, and for comparing the performances of TEE and edge-devices to provide the number of clients each TEE can support without any slowdown. We show that a single TEE can support up to 316 clients. Such analysis is quite useful in deciding number of TEEs that the server needs to use for setting up the secure enclave. Convergence Analysis: We provide a convergence analysis of DiverseFL for non-IID data distribution across clients in the presence of an arbitrary number of faulty clients, under standard assumptions such as strong convexity of the local loss functions. For this, we first obtain the necessary probabilistic bounds pertaining to the error between the current model and the optimal global model with respect to each client that satisfies the DiverseFL’s per- client similarity criteria in a given iteration, and then use the intermediate results to prove convergence. 7.2 Problem Setup In this section, we first describe stochastic gradient descent (SGD) in the context of federated learning in the absence of faulty clients, then present the fault model for the participating clients, and finally present a concise background on the prior applications of Trusted Execu- tion Environments (TEEs) for privacy-preserving machine learning in centralized as well as distributed settings. 7.2.1 Federated Learning with SGD Weconsiderafederatedlearning(FL)setupwithN clientnodes(workers)thatareconnected to a central FL server (master). For j∈[N]={1,...,N}, let D j denote the client j’s labelled dataset with |D j | = n being the number of points in D j . Furthermore, each data point in D j is drawn from an unknown local data distribution D j . Since data is typically non-IID 185 across clients in FL [240], D i can be different from D j for i,j∈[N] and i̸=j. The primary aim in FL is to solve the following global optimization problem: θ ∗ =argmin θ ∈Θ 1 N N X j=1 F j (θ ), (7.1) where, F j (θ ) = E ζ j ∼D j (l(θ ;ζ j )) denotes the local expected loss at client j∈[N], and Θ ⊆ R d denotes the model parameter space. Here, l(θ ;ζ j )∈R denotes the predictive loss function (such as the cross-entropy loss) for model parameter θ ∈Θ on the data point ζ j . The solution to (7.1) is obtained by using an iterative training procedure that involves two key steps. For a general communication round i∈{1,...,R}, the FL server selects a subset S (i) of the clients, where |S (i) |=C≤ N. Each client j ∈ S (i) receives the cur- rent model θ (i− 1) from the master, and locally updates it using stochastic gradient descent (SGD) with a learning rate of α (i) for E iterations. Specifically, for τ ∈{1,...,E}, client j carries out a stochastic gradient update θ (i,τ ) j =θ (i,τ − 1) j − α (i) g (i,τ ) j , where θ (i,0) j =θ (i− 1) , and g (i,τ ) j =∇ θ l(θ (i,τ − 1) j ;M (i,τ ) j ). Here,M (i,τ ) j is a mini-batch of size m sampled uniformly at ran- domfromD j , andl(θ (i,τ − 1) j ;M (i,τ ) j )denotestheempiricallossoverM (i,τ ) j . Inthesecondstep, the master receives the model update ∆ (i) j =θ (i− 1) − θ (i,E) j from each client j∈S (i) and carries out the global model update as follows: θ (i) = θ (i− 1) − ∆ (i) , where ∆ (i) = 1 |S (i) | P j∈S (i) ∆ (i) j . This combination of local and global training steps is repeated for R communication rounds, where R is a hyperparameter. While the participating clients are honest, they can exhibit faulty behaviors during train- ing as described next. 7.2.2 Fault Model During training, a client can malfunction or may experience hardware/software errors, which are hard to detect directly at the FL server due to decentralized and distributed implementa- tion of federated learning. As a result, the server can receive corrupted updates from faulty 186 clients which if not filtered out from the secure aggregation at the FL server, can significantly degrade the training performance. Formally stating, for round i∈R, letF (i) ⊂S (i) denote the set of faulty clients andN (i) =S (i) \F (i) denote the set of normal clients. Then, z (i) j =∆ (i) j for j∈N (i) , while for j∈F (i) , z (i) j =∗ . Here,∗ denotes that z (i) j can be an arbitrary vector inR d . Our proposal for fault mitigation, while securely aggregating the local updates from the participating clients, leverages the hardware guaranteed privacy/integrity features of Trusted Execution Environments (TEEs), particularly Intel SGX. In the following, we provide a background on TEEs and their applications in privacy-preserving machine learning. 7.2.3 Trusted Execution Environments Trusted Execution Environments (TEEs) such as ARMTrustZone [10], ARM REALM [178], and Intel SGX [34] provide secure isolated execution environments, where data confidential- ity and computational integrity of the client’s application is guaranteed by cryptographic hardware and software encryption. This has enabled system-based privacy guarantees for artificial intelligence (AI) applications involving sensitive user data. For instance, authors in [52] introduced a genotype imputation tool based on Intel SGX. Recently, TEEs have also been used for privacy-preserving aggregation (albeit without mitigation of faulty clients during training) of local updates from clients in federated learning in production lines of different companies including Meta [59, 143, 147]. In our proposal, we use Intel SGX that provides important features for privacy and in- tegrity including remote attestation, local attestation, and sealing. Using remote attestation, an encrypted communication channel can be established between the server and the relying party which provides private direct communication between the two. Local attestation is for secure communication between multiple enclaves on the same platform. Sealing provides a secure data saving for transferring data to the untrusted hardware while protecting data privacy. 187 While Intel SGX has been widely used for privacy-preserving AI applications, computa- tions performed within SGX are restricted to execution on CPU in the current implemen- tations. Moreover, TEEs generally provide a limited amount of secure memory (enclave) that is tamper-proof even from a root client. Our proposal DiverseFL leverages Intel SGX for detecting faulty clients and performing secure aggregation at each communication round. As we describe in the next section, DiverseFL requires low computation load of SGX with respect to each client, thus allowing multiple participating clients to be supported by a single TEE without causing overheads in the wall-clock time. 7.3 The Proposed DiverseFL Algorithm In this section, we describe the various aspects of our proposal DiverseFL in detail. We also illustrate the effectiveness of the core per-client criteria of DiverseFL for fault mitigation. In Fig. 7.1, we provide an overview of DiverseFL. 7.3.1 Description of DiverseFL DiverseFL has five fundamental steps as described next. Figure 7.1: Illustration of system components and general steps in DiverseFL for communication round i ∈ [R]. Without loss of generality, we have assumed that all clients participate in the communication round i. For brevity, we use AGG(·) in final step to jointly denote the aggregation of the potential normal clients as well as the global model update step. Step 1: Sharing of small samples with TEE: The FL server asks each client j∈[N] to draw a mini-batch M (0) j of size s from its local dataset, and share it with a TEE-based 188 secure enclave on the FL server, using a mutually agreed encryption key between the TEE and the client. For obtaining a sample mini-batch that is representative of its local dataset, a client first finds the number of features corresponding to each label in the local dataset (as data is non-IID across clients, certain labels may not be available in the client’s dataset). Thereafter, the client computes the number of data points that need to be sampled from each local label set so that the final mini-batch of size s has the same proportion of labels as in the local dataset. Finally, from each label set, the client samples the proportionate number of data points uniformly at random, and the final mini-batch is shared securely with TEE. Remark 36. The TEE enclave code can be verified by each client to guarantee that none of the samples it shares with the TEE leaves the TEE, and that the TEE only uses it for computing guiding updates (described in Step 3 below) during training. Furthermore, this sample sharing step occurs only once, in the offline phase before training begins. We further emphasize that the FL administrator can access neither the data samples or the guiding updates computed within the TEE, thus protecting data privacy. In each communication round i∈R, the following four training steps are carried out. Step 2: Training on the clients: Training in DiverseFL proceeds as follows. During i-th round, the server selects a subsetS (i) of the clients, and sends the current modelθ (i− 1) to each clientj∈S (i) . A normal clientj∈N (i) ⊆S (i) performsE local SGD updates as described in Section 7.2.1 to obtain the locally updated model θ (i,E) j , and computes the model update ∆ (i) j =(θ (i− 1) − θ (i,E) j ). It then uploads z (i) j =∆ (i) j to the master. Local updates from the faulty clientsF (i) =S (i) \N (i) can be arbitrary vectors. Step 3: Guiding model update computations on TEE: For each client j∈S (i) , the TEE on the FL server updates θ (i− 1) via gradient descent for E iterations using the data sample M (0) j . More formally, for τ ∈{1,...,E}, the TEE obtains e θ (i,τ ) j = e θ (i,τ − 1) j − α (i) e g (i,τ ) j , wheree g (i,τ ) j =∇ θ l( e θ (i,τ − 1) j ;M (0) j )and e θ (i,0) j =θ (i− 1) . Theguidingmodelupdateisthencomputed as follows: e ∆ (i) j = (θ (i− 1) − e θ (i,E) j ). 189 Remark 37. The guiding model updates in TEE are computed concurrently with the client SGD computations. In our experimental setup, we demonstrate that each Intel SGX based enclave can perform guiding update computations for several tens of clients within the time needed by a client to provide its local update to the TEE (see Fig. 7.4 in Section 7.4.2 for details). Hence, Step 3 occurs concurrently with Step 2 without any loss in training speed. Step 4: Fault identification by FL server: The secure TEE based enclave within the FL server receives the model updates from the clients and estimates whether the update from client j∈[N] is faulty or not based on the extent of similarity between z (i) j and e ∆ (i) j , using e ∆ (i) j as a surrogate for ∆ (i) j . It considers the following two similarity metrics: Direction Similarity :C 1 = sign( e ∆ (i) j · z (i) j ), (7.2) Length Similarity :C 2 = ∥z (i) j ∥ 2 ∥ e ∆ (i) j ∥ 2 . (7.3) The main idea is that e ∆ (i) j approximates the model update∆ (i) j of a normal client. Therefore, thesimilaritybetween e ∆ (i) j andz (i) j =∆ (i) j whenj∈N (i) (andlikewisethedissimilaritybetween e ∆ (i) j andz (i) j whenj∈F (i) )canbeleveragedforfaultmitigation. Thus,whenC 1 >0,itsuggests that z (i) j is approximately in a similar direction as ∆ (i) j for client j. Similarly, when C 2 ∼ 1, it suggests that the norms ∥z (i) j ∥ 2 and ∥∆ (i) j ∥ 2 are approximately equal, while very large or very small values of C 2 suggest a large deviation between z (i) j and ∆ (i) j . These arguments motivate the following two key conditions for fault mitigation that are verified within the TEE for the local update from each participating client j∈S (i) : Condition 1:C 1 >ϵ 1 , (7.4) Condition 2:ϵ 2 <C 2 <ϵ 3 , (7.5) whereϵ 1 ,ϵ 2 andϵ 3 are hyperparameters in DiverseFL. Any client j∈S (i) is flagged as a faulty node if either of the above two conditions is not satisfied. 190 Step 5: Secure aggregation and global update: Let e F (i) denote the set of faulty nodes as estimated in Step 4, and likewise let e N (i) =S (i) \ e F (i) be the estimated set of normal clients. The following global model update is executed within the TEE: θ (i) =θ (i− 1) − 1 | e N (i) | X j∈ e N (i) z (i) j . (7.6) As the aggregation step is also carried out securely within the TEE at the FL server, Di- verseFL ensures that there is no privacy leakage from the local model updates shared by the clients. Next, we illustrate the effectiveness of the core per-client criteria of DiverseFL for fault mitigation. 7.3.2 Effectiveness of per-client fault mitigation To illustrate the effectiveness of using the two similarity metrics in the detection of a faulty node, we consider the product C 1 × C 2 of the metrics in equations (7.2) and (7.3), and plot its variation for different clients across iterations. For this, we consider the setting of 23 clients and neural network training with MNIST dataset, in the presence of label flip fault, as described in Section 7.4.1. Data is distributed non-IID, a data sharing of 1% is used, and 5 out of the 23 clients are assumed to be faulty during training. For examining how the two similarity metrics behave in each round, we consider the oracle algorithm, named OracleSGD, in which only the normal client updates are aggregated for updating the model in each round, i.e. only the updates of the 18 normal clients are aggregated for secure aggregation and global model update. Further setup details are deferred to Section 7.4.1. In Fig. 7.2, we plot the product C 1 × C 2 for 1000 training rounds for 10 different normal clients and 3 different faulty clients, where green color marker denotes that client is normal, while red color marker denotes that client is faulty. As can be seen from Fig. 7.2, C 1 >0 exclusively and C 2 is concentrated around 1 in all rounds for each normal client. For clients 191 1 2 3 4 5 6 7 8 9 10 11 12 13 Client -18 -14 -10 -6 -2 0 1 3 C 1 C 2 Normal Byzantine Figure 7.2: The values ofC 1 × C 2 in the1000 training rounds are plotted in red for faulty clients, and in green, for normal clients. For normal clients, C 1 >0 exclusively, and C 2 is concentrated around 1. For faulty clients, C 1 <0 in almost all iterations, and C 2 varies significantly. exhibiting faulty behaviour, C 1 <0 almost exclusively, and there is a large variation of C 2 during training. For example, out of the 1000 training rounds, C 1 >0 for only 1 round for both client 11 and client 12, and for 3 rounds for client 13. Similar results were observed for the other clients. Therefore, condition 1 presented in (7.4) is critical for mitigating faults. Condition 2 presented in (7.5) is complementary to condition 1, as it helps to keep the deviations from the true updates low, thus helping to filter the faulty updates in scenarios in which ∆ (i) j gets simply scaled with a large absolute value due to an underlying software or hardware error. As we demonstrate in experiments in Section 7.4, the proposed similarity criteria results in superior performance of DiverseFL. In Appendix U, we provide a convergence analysis for DiverseFL to show how the per- client similarity criteria can lead to convergence when data is non-IID. In the following section, we present results from our experiments. 192 7.4 Experiments Experimental Setup: We implement DiverseFL using an SGX-enabled FL server built using Intel (R) Coffee Lake E-2174G 3.80GHz processor. The server has 64 GB RAM and supports Intel Soft Guard Extensions (SGX). For SGX implementations, we used Intel Deep Neural Network Library (DNNL) for designing the DNN layers including the Convolution layer, ReLU,MaxPooling, andEigenlibraryforDenselayer. Clientsarebasedonthepopular edge device, Raspberry PI 3 consisting of Quad Core Armv7 CPUs, based on BCM2835 hardware, running Debian 10.9 with Linux Kernel 5.10.17. We built the PyTorch for ARMv7 ISArunningLinuxandinstalledTorchonRaspberryPIusingthisbuild. Thelinkbandwidth between the FL server and each client is 100Mbps. Unless stated otherwise, we consider N=23 clients, and all clients participate in each round. The secure TEE enclave in DiverseFL enables secure aggregation by default. Therefore, inourfirstsetofexperimentsinSection7.4.1, wedemonstratethesuperiorityofDiverseFLin fault mitigation by comparing it with SOTA algorithms that consider only fault mitigation and not secure aggregation. Thereafter, in Section 7.4.2, we present detailed wall-clock timing analysis for implementing DiverseFL in our TEE-assisted federated learning setup and demonstrate the scalability of our proposal in practice. Finally, we provide ablation studies for DiverseFL in Section 7.4.3. 7.4.1 Performance of DiverseFL Schemes: In the following, we first summarize the benchmark schemes that we simulate for comparison with DiverseFL. OracleSGD: The identities of the faulty clients are known at the FL server and the global model is updated using the aggregate of weight updates from only the normal clients. Median [228]: An element-wise median of all the received updates from the clients is computed, which is used to update the global model. Bulyan [71]: FL server applies Krum [18] recursively to remove a set of 2f possibly faulty clients, 193 Figure 7.3: Top-1 accuracy for neural network training with MNIST (first row of plots), CIFAR10 (second row of plots) and CIFAR100 (third row of plots). DiverseFL with both 1% and 3% sample sharing achieves close to OracleSGD performance under all scenarios. Even with relatively simpler training setting with MNIST and a small neural network, prior benchmarks degrade in performance in one or more scenarios. Furthermore, the three sets of results demonstrate that increasing com- plexity of dataset and training model increases the performance gap of prior benchmarks for non-IID setting. and then applies an element-wise trimmed mean operation to filter out 2f entries along each dimension. Resampling [76]: A novel resampling approach of [76] is leveraged to first create a set of N modified updates. For each of them, FL server samples S R number of client updates uniformly at random and averages them. It then computes the final update by applying Median on theN modified updates. FLTrust [24]: The FL server computes a model update e ∆ M on its root datasetD M , a clean small training dataset collected independently of clients’ data before training. Client updates are projected onto the root update, and then a weighted aggregation of them is carried out. For best case scenario of FLTrust, we construct root dataset by randomly selecting a subset of the training dataset. 194 Fault types: For a faulty client j∈[N] in round i∈R, we consider four popular faults. Gaussian: The message to be uploaded gets set to z i j , where the elements follow a Gaussian distribution with mean 0 and standard deviation σ G . Sign Flip: The sign of each entry of the local model update is flipped before uploading to the server. Same Value: The message to be uploaded gets set to z i j =σ S 1, where1∈R d is an all-one vector. Label Flip: Class label c of each data point in the local training mini-batches gets flipped to c n − c, where c n is the number of classes during data reading at training time. Datasets,ModelsandHyperparameters: Weconsiderthreedifferentbenchmarkdatasets – MNIST, CIFAR10, and CIFAR100 [106, 103]. While MNIST and CIFAR10 have 10 classes each, CIFAR100 has 100 classes. For simulating non-IID, for each dataset, the training data is sorted as per class and then partitioned into N subsets, and each subset is assigned to a different client. For MNIST, we use a neural network with three fully connected layers (referred to as 3-NN), and each of the two intermediate layers has 200 neurons. For CI- FAR10/CIFAR100, we use VGG network [189], specifically VGG-11. We replace the batch normalization layers with group normalization layers as batch norm layers have been found to perform suboptimally when data is non-IID. For group norm, we set each group parameter such that each group has 16 channels throughout the network. Glorot uniform initializer is used for weights in all convolutional layers, while default initialization is used for fully connected layers. For each scenario, local training batch size is 10% of the local dataset, regularization λ =0.0005, E = 1, and for DiverseFL, we use (ϵ 1 ,ϵ 2 ,ϵ 3 )=(0,0.5,2) and two sampling size scenarios of 1% and 3% of the local dataset. For Resampling, S R = 2 and for FLTrust, 1% random subset of training data is used as root dataset. For Gaussian and same value faults for each dataset, we set σ G =σ S =10. For MNIST, we set R to 1000, use an initial learning rate of 0.06 with step decay of 0.5 at rounds 500 and 950. For CIFAR10 and CIFAR100, warmup is used for the first 1000 rounds, increasing the learning rate linearly from 0.05 to 195 0.1. Also, number of rounds R=2400 and learning rate is stepped down by a factor of 0.4 at iteration 2000. Results: Fig. 7.3 illustrates the results for MNIST (with 3-NN), CIFAR10 (with VGG-11) and CIFAR100 (and VGG-11), which have 10, 10, 100 classes respectively. For MNIST, DiverseFL (both 1% and 3% sampling rates) outperforms prior approaches by significant margins in all cases of faults, and almost matches the performance of OracleSGD. For CI- FAR10, DiverseFL with both 1% and 3% sampling sizes perform quite close to OracleSGD, with the 3% case performing slightly better. Note that for CIFAR10 and CIFAR100, we consider mainly the prior baselines for non-IID data, as both Median and Bulyan, which are primarily fault tolerant approaches for the IID data, perform quite poorly across all faults. Starting with the complex CIFAR100 with large number of classes in the dataset, the sampling rate of 3% provides better performance than 1%. Nevertheless, even with 1% sampling, DiverseFL outperforms all prior schemes by significant margins. It is interesting to note that for both Resampling and FLTrust, the margin of convergence performance from OracleSGD is much larger for CIFAR10/CIFAR100 than for MNIST, due to greater dataset complexity and larger neural networks that introduce much larger variations, both within and across client updates. Hence, DiverseFL is ideally suited for complex models that are becoming more common in FL setting. We also note that as FLTrust utilizes the root update to normalize and aggregate the client updates, it achieves stable convergence and improved accuracy in comparison to prior schemes. However, as each client’s data distribution is quite different from root data, projection of client updates on the root update leads to loss of information due to which performance of FLTrust is much lower than that of DiverseFL. 7.4.2 Scalability of DiverseFL In the previous subsection, we have demonstrated that DiverseFL’s accuracy can consis- tently outperform prior schemes in the presence of faults. DiverseFL, however, does put an 196 additional computational burden on the TEE as in addition to leveraging TEE for secure ag- gregation, DiverseFL also requires computation of the guiding updates for fault mitigation. In this section, we demonstrate that each TEE can accommodate the guiding update compu- tations of many dozens of clients without causing any additional latency delays. To quantify this aspect, we used Raspberry Pi 3 as an edge device as used in prior works [241, 213]. Figure7.4: Executiontimeonclient(computation+communicationtime)relativetoTEE’sguiding update computation. 1: MNIST/3-NN, 2: CIFAR10/VGG-11, and 3: CIFAR100/VGG-11. 1% sampling used in (a) and 3% sampling in (b). Single TEE supports many clients without stalling FL execution. Fig. 7.4 illustrates the details of the timing analysis for the different networks and datasets considered in the previous subsection. Fig. 7.4-(a) shows for different models and datasets, the relative execution time of a client device training compared to the SGX execu- tion time to compute guiding update of a single client with a sampling rate of 1%. Here, the edge device training speed is split into two components: update computation time and the time to communicate the update with the FL server. Consider for example CIFAR10/VGG- 11 scenario. Fig. 7.4-(a) shows that the TEE computation is 150 times faster than an edge device’s combined computation and communication time. Thus, the TEE can support 150 clients. Similarly, the TEE can support about 119 clients with CIFAR100. We also observe 197 that TEE’s relative performance is lower with VGG-11 compared to 3-NN. This is because of the stringent memory limitation of TEE (128 MB in our current implementation). Hence, when the model does not entirely fit within TEE at once, causing some memory overheads inside TEE. Fig. 7.4-(b) illustrates that by increasing the sampling rate to 3%, the maximum number of clients a TEE can support decreases since the number of data samples that a TEE needs to process increases. Nevertheless, a single TEE can still support between 38 to 105 clients based on the model and data size. To scale out the system to support even more clients, one canutilizemoreinstancesoftheTEEs. Sinceguidingupdatecomputationsaredoneonaper- client basis this scaling approach to multiple TEEs is quite efficient without any substantial synchronization overheads. Furthermore, Intel has recently announced SGX enhancements support of larger secure memory which can potentially alleviate the large model execution overheads in TEE. 7.4.3 Ablation Studies We present ablation studies for DiverseFL with respect to different hyperparameters. While we only provide results with CIFAR10 dataset and Gaussian fault for presentation purposes, we note that we observed similar trends for other datasets and faults. 7.4.3.1 Number of Faulty Clients The per-client fault mitigation approach makes DiverseFL applicable to an arbitrary number of faulty clients. To demonstrate this in practice, we consider the setting in Section 7.4.1 with N = 23 clients, and report the final test accuracy for both OracleSGD and DiverseFL (with 3% sample sharing), for f = 5 as well as for f = 17 (which is equivalent to ∼ 75% faulty nodes in the system). As demonstrated by the results in Tables 7.1, DiverseFL almost matches the performance of OracleSGD even when more than a majority of nodes are faulty. 198 Table 7.1: Final test accuracies for CIFAR10 under Gaussian fault for different number of faults. Similar results were observed for other faults and datasets. Test Accuracy (%): f=5 Test Accuracy (%): f=17 OracleSGD DiverseFL OracleSGD DiverseFL Gaussian 81.0 80.8 28.5 28.5 7.4.3.2 Performance for Multiple Local Iterations In FL, it is a common practice to have multiple local iterations in each communication round, i.e., to have E >1 to reduce communication rounds. Hence, we evaluate the performance of DiverseFL when multiple local SGD training steps are implemented in each communication round. Forthis, weconsiderCIFAR10datasetandtheVGG-11modelasdescribedinSection 7.4.1. For simulating a similar data heterogeneity as described in [142], we consider a setting with N = 25 clients and choose 6 of them as faulty clients. CIFAR10 training dataset is first sorted as per class, then partitioned into 50 shards, and then each client is assigned 2 shards randomly without replacement. We consider E∈{1,2,3,4}, a sampling size of 3% for DiverseFL and carry out training for a total of R =1500 rounds. All other hyperparameters are as described in Section 7.4.1. For comparison, we consider the OracleSGD scheme, with E = 4. The results are illus- trated in Fig. 7.5. As demonstrated by Fig. 7.5, when E is increased, DiverseFL provides Figure 7.5: Performance evaluation of DiverseFL with multiple local iterations with OracleSGD. 199 better convergence rate per communication round, further showing that DiverseFL is well suitedforfederatedlearning. Additionally, DiverseFLmaintainsitssuperiorfaultyresiliency, as demonstrated by its close to OracleSGD performance when E =4 under Gaussian fault. 7.5 Conclusion Motivated by the problem of making secure aggregation in federated learning (FL), with non-IID data across clients, resilient to faults that arise at the clients during training, we propose a Trusted Execution Environment (TEE) based novel solution named DiverseFL. As a key contribution, we develop a novel per-client approach for fault mitigation. It leverages during training, the similarity of the model update received from a client and its associated guiding update, which is computed inside the TEE (at FL server) on a small representa- tive sample of client’s local data that the client shares securely with TEE only once before training starts. Any participating client whose local model update diverges from its asso- ciated guiding update is tagged as being faulty. TEE-based enclave enables client model privacy and protection against potential privacy leakages from sample sharing and guiding update computations, thus enabling fault resiliency for secure aggregation. We demonstrate through extensive experimental results that DiverseFL improves the model accuracy and fault-resiliency of secure FL with non-IID data significantly as compared to the prior bench- marks, that mainly rely on the similarity of updates across clients. 200 References [1] https://github.com/AvestimehrResearchGroup/Coded-PageRank. [2] BringingHPCTechniquestoDeepLearning. https://andrew.gibiansky.com/blog/ machine-learning/baidu-allreduce/. Accessed: 2019-01-01. [3] 3GPP. LTE; evolved universal terrestrial radio access (e-utra); physical channels and modulation. 3GPP TS 36.211, 14.2.0(14), 2014. [4] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: Asystemforlarge-scalemachinelearning. In12th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. [5] Yuan Ai, Mugen Peng, and Kecheng Zhang. Edge computing technologies for Internet of Things: a primer. Digital Communications and Networks, 4(2):77–86, 2018. [6] Mustafa Riza Akdeniz, Arjun Anand, Nageen Himayat, A. Salman Avestimehr, Ravikumar Balakrishnan, Prashant Bhardwaj, Jeongsik Choi, Yang-Seok Choi, Sagar Dhakal, Brandon Gary Edwards, Saurav Prakash, Amit Solomon, Shilpa Talwar, and Yair Eliyahu Yona. Systems and methods for distributed learning for wireless edge dynamics, 2021. WO Patent App. PCT/US2020/067068. [7] Mehmet Fatih Aktas, Pei Peng, and Emina Soljanin. Effective straggler mitigation: which clones should attack and when? ACM SIGMETRICS Performance Evaluation Review, 45(2):12–14, 2017. [8] Malihe Aliasgari, Jorg Kliewer, and Osvaldo Simeone. Coded computation against straggling decoders for network function virtualization. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 711–715. IEEE, 2018. [9] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. In Advances in Neural Information Processing Systems 30, pages 1709–1720. Curran Associates, Inc., 2017. [10] Tiago Alves. Trustzone: Integrated hardware and software security. White paper, 2004. [11] GaneshAnanthanarayanan, AliGhodsi, ScottShenker, andIonStoica. Effectivestrag- gler mitigation: Attack of the clones. In NSDI, volume 13, pages 185–198, 2013. 201 [12] Android Developers. Attribution reporting. https://developer.android.com/ design-for-safety/ads/attribution/, 2022. [Online; accessed 23-April-2022]. [13] Mohamed A Attia and Ravi Tandon. Near optimal coded data shuffling for distributed learning. arXiv preprint arXiv:1801.01875, 2018. [14] Mohamed Adel Attia and Ravi Tandon. Information theoretic limits of data shuffling for distributed learning. GLOBECOM, 2016. [15] Ravikumar Balakrishnan, Mustafa Akdeniz, Sagar Dhakal, and Nageen Himayat. Re- source management and fairness for federated learning over wireless edge networks. IEEE 21st International Workshop on Signal Processing Advances in Wireless Com- munications (SPAWC), 2020. [16] Karim Bayoumy, Mohammed Gaber, Abdallah Elshafeey, Omar Mhaimeed, Eliza- beth H Dineen, Francoise A Marvel, Seth S Martin, Evan D Muse, Mintu P Turakhia, Khaldoun G Tarakji, et al. Smart wearable devices in cardiovascular care: where we are and how to move forward. Nature Reviews Cardiology, pages 1–19, 2021. [17] Christian R Berger, Shengli Zhou, Yonggang Wen, Peter Willett, and Krishna Patti- pati. Optimizing joint erasure-and error-correction coding for wireless packet trans- missions. IEEE Transactions on Wireless Communications, 7(11):4586–4595, 2008. [18] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with ad- versaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017. [19] K. A. Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for federated learning on user-held data. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. [20] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, SarvarPatel, DanielRamage, AaronSegal, andKarnSeth. Practicalsecure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191, 2017. [21] Flavio Bonomi, Rodolfo Milito, Jiang Zhu, and Sateesh Addepalli. Fog computing and its role in the internet of things. In Proceedings of the first edition of the MCC workshop on Mobile cloud computing, pages 13–16. ACM, 2012. [22] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [23] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Neural Information Processing Systems (NeurIPS), 2020. 202 [24] Xiaoyu Cao, Minghong Fang, Jia Liu, and Neil Zhenqiang Gong. FLTrust: Byzantine- robust federated learning via trust bootstrapping. Network and Distributed Systems Security (NDSS) Symposium, 2021. [25] Zachary Charles, Dimitris Papailiopoulos, and Jordan Ellenberg. Approximate gradi- ent coding via sparse random graphs. arXiv preprint arXiv:1711.06771, 2017. [26] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Re- visiting distributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016. [27] RongChen, XinDing, PengWang, HaiboChen, BinyuZang, andHaibingGuan. Com- putation and communication efficient graph processing with distributed immutable view. HPDC, 2014. [28] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Mea- surement and Analysis of Computing Systems, 1(2):1–25, 2017. [29] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, volume 14, pages 571–582, 2014. [30] Fan Chung and Linyuan Lu. The average distance in a random graph with given expected degrees. Internet Mathematics, 1(1):91–113, 2004. [31] Jichan Chung, Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kan- nan Ramchandran. Ubershuffle: Communication-efficient data shuffling for sgd via coding theory. NIPS Workshop on ML Systems, 2017. [32] Guojing Cong, Onkar Bhardwaj, and Minwei Feng. An efficient, distributed stochas- tic gradient descent algorithm for deep-learning applications. In Parallel Processing (ICPP), 2017 46th International Conference on, pages 11–20. IEEE, 2017. [33] Robert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Don- ald E Knuth. On the LambertW function. Advances in Computational mathematics, 5(1):329–359, 1996. [34] Victor Costan and Srinivas Devadas. Intel sgx explained. IACR Cryptology ePrint Archive, 2016(086):1–118, 2016. [35] Thomas A Courtade and Richard D Wesel. Optimal allocation of redundancy between packet-level erasure coding and physical-layer channel coding in fading channels. IEEE Transactions on Communications, 59(8):2101–2109, 2011. [36] Paul Cuff and Lanqing Yu. Differential privacy as a mutual information constraint. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 43–54, 2016. [37] Elias Dahlhaus, David S Johnson, Christos H Papadimitriou, Paul D Seymour, and Mihalis Yannakakis. The complexity of multiway cuts. STOC, 1992. 203 [38] Lisandro Dalcin, Rodrigo Paz, and Mario Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65(9):1108–1115, 2005. [39] Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013. [40] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large Scale Distributed Deep Networks. In Advances in neural information processing systems, pages 1223– 1231, 2012. [41] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [42] Ewa Deelman, Gurmeet Singh, Miron Livny, Bruce Berriman, and John Good. The cost of doing science on the cloud: the montage example. In High Performance Com- puting, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1–12. IEEE, 2008. [43] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction Using Mini-Batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012. [44] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 4171–4186, 2019. [45] Sagar Dhakal*, Saurav Prakash*, Yair Yona, Shilpa Talwar, and Nageen Himayat. Coded computing for distributed machine learning in wireless edge network. In 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall), pages 1–6. IEEE, 2019. [46] Sagar Dhakal, Saurav Prakash, Yair Yona, Shilpa Talwar, and Nageen Himayat. Coded federated learning. In 2019 IEEE Globecom Workshops (GC Wkshps), pages 1–6. IEEE, 2019. [47] Akash Dhasade, Nevena Dresevic, Anne-Marie Kermarrec, and Rafael Pires. Tee- based decentralized recommender systems: The raw data sharing redemption. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022. [48] SadiaDinandAnandPaul. Smarthealthmonitoringandmanagementsystem: Toward autonomous wearable sensing for internet of things using big data analytics. Future Generation Computer Systems, 91:611–619, 2019. [49] SadiaDinandAnandPaul. Smarthealthmonitoringandmanagementsystem: Toward autonomous wearable sensing for internet of things using big data analytics. Future Generation Computer Systems, 91:611–619, 2019. 204 [50] Canh T Dinh, Nguyen H Tran, Minh NH Nguyen, Choong Seon Hong, Wei Bao, Albert Y Zomaya, and Vincent Gramoli. Federated learning over wireless networks: Convergence analysis and resource allocation. IEEE/ACM Transactions on Network- ing, 29(1):398–409, 2020. [51] Canh T Dinh, Nguyen H Tran, Minh NH Nguyen, Choong Seon Hong, Wei Bao, Albert Y Zomaya, and Vincent Gramoli. Federated learning over wireless networks: Convergence analysis and resource allocation. IEEE/ACM Transactions on Network- ing, 29(1):398–409, 2020. [52] Natnatee Dokmai, Can Kockan, Kaiyuan Zhu, XiaoFeng Wang, S Cenk Sahinalp, and Hyunghoon Cho. Privacy-preserving genotype imputation in a trusted execution environment. Cell Systems, 10(12), 2021. [53] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, ThomasUnterthiner, MostafaDehghani, MatthiasMinderer, GeorgHeigold, Syl- vain Gelly, et al. An image is worth 16× 16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021. [54] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar. Slow and stale gradi- ents can win the race: Error-runtime trade-offs in distributed sgd. arXiv preprint arXiv:1803.01113, 2018. [55] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems, pages 2100–2108, 2016. [56] Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. Coded convolution for par- allel and distributed computing within a deadline. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 2403–2407. IEEE, 2017. [57] Rudolph P. E. Sen, p. k.; singer, j. m.: Large sample methods in statistics. an intro- duction with applications. chapman & hall, new york-london 1993, xii, 382pp., £35.00, isbn 0–412–04221–5. Biometrical Journal, 36(5):602–602. [58] El-Mahdi El-Mhamdi, Rachid Guerraoui, and Sébastien Rouault. Distributed momen- tum for byzantine-resilient learning. arXiv preprint arXiv:2003.00010, 2020. [59] Hericles Emanuel. Pysyft, pytorch and Intel SGX: Secure aggregation on trusted exe- cution environments. https://blog.openmined.org/pysyft-pytorch-intel-sgx/, 2020. [Online; accessed 15-April-2020]. [60] Yahya H Ezzeldin, Mohammed Karmoose, and Christina Fragouli. Communica- tion vs distributed computation: an alternative trade-off curve. arXiv preprint arXiv:1705.08966, 2017. [61] Mohammad Fahim, Haewon Jeong, Farzin Haddadpour, Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. On the optimal recovery threshold of coded matrix 205 multiplication. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pages 1264–1270. IEEE, 2017. [62] Elizabeth L Feld. United States Data Privacy Law: The Domino Effect After the GDPR. In N.C. Banking Inst., volume 24, page 481. HeinOnline, 2020. [63] Di Feng, Lars Rosenbaum, and Klaus Dietmayer. Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3266–3273. IEEE, 2018. [64] NuwanFerdinandandStarkCDraper. Hierarchicalcodedcomputation. In 2018 IEEE International Symposium on Information Theory (ISIT),pages1620–1624.IEEE,2018. [65] NuwanFerdinand, BenjaminGharachorloo, andStarkCDraper. Anytimeexploitation of stragglers in synchronous stochastic gradient descent. In Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on, pages 141–146. IEEE, 2017. [66] Nuwan S Ferdinand and Stark C Draper. Anytime coding for distributed computation. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 954–960. IEEE, 2016. [67] M. Fire, L. Tenenboim, O. Lesser, R. Puzis, L. Rokach, and Y. Elovici. Link prediction in social networks using computationally efficient topological features. In IEEE Third International Confernece on Social Computing (SocialCom), pages 73–80. IEEE, 2011. [68] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. Mitigating sybils in federated learning poisoning. arXiv preprint arXiv:1808.04866, 2018. [69] Jonas Geiping, Hartmut Bauermeister, Hannah Droge, and Michael Moeller. Inverting gradients–how easy is it to break privacy in federated learning? Advances in Neural Information Processing Systems (NeurIPS), 2020. [70] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In OSDI, 2012. [71] Rachid Guerraoui, Sébastien Rouault, et al. The hidden vulnerability of distributed learning in Byzantium. In International Conference on Machine Learning, pages 3521– 3530. PMLR, 2018. [72] Hongzhi Guo, Jiajia Liu, and Jie Zhang. Efficient computation offloading for multi- access edge computing in 5G HetNets. In 2018 IEEE International Conference on Communications (ICC), pages 1–6. IEEE, 2018. [73] Vipul Gupta, Swanand Kadhe, Thomas Courtade, Michael W Mahoney, and Kannan Ramchandran. Oversketched newton: Fast convex optimization for serverless systems. In 2020 IEEE International Conference on Big Data (Big Data), pages 288–297. IEEE, 2020. 206 [74] Isabelle Guyon, Jiwen Li, Theodor Mader, Patrick A Pletscher, Georg Schneider, and Markus Uhr. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern recognition letters, 28(12):1438–1444, 2007. [75] Imran S Haque and Vijay S Pande. Hard data on soft errors: A large-scale assessment of real-world error rates in gpgpu. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 691–696. IEEE, 2010. [76] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. Byzantine-robust learning on heterogeneous datasets via resampling. arXiv preprint arXiv:2006.09365, 2020. [77] Ronald R Hocking. Methods and applications of linear models: regression and the analysis of variance. John Wiley & Sons, 2013. [78] Tyler Hunt, Congzheng Song, Reza Shokri, Vitaly Shmatikov, and Emmett Witchel. Chiron: Privacy-preserving machine learning as a service. arXiv preprint arXiv:1803.05961, 2018. [79] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. EuroSys, 2007. [80] Takashi Isobe, Eric D Feigelson, Michael G Akritas, and Gutti Jogesh Babu. Linear regression in astronomy. The Astrophysical Journal, 364:104–113, 1990. [81] Tayyebeh Jahani-Nezhad and Mohammad Ali Maddah-Ali. Codedsketch: Coded dis- tributed computation of approximated matrix multiplication. In 2019 IEEE Interna- tional Symposium on Information Theory (ISIT), pages 2489–2493. IEEE, 2019. [82] Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. How to scale dis- tributed deep learning? ML Systems Workshop, NIPS, 2016. [83] Richeng Jin, Yufan Huang, Xiaofan He, Huaiyu Dai, and Tianfu Wu. Stochastic- sign sgd for federated learning with theoretical guarantees. arXiv preprint arXiv:2002.10940, 2020. [84] John D Jobson. A multivariate linear regression test for the arbitrage pricing theory. The Journal of Finance, 37(4):1037–1042, 1982. [85] Arthur Jochems, Timo M Deist, Johan Van Soest, Michael Eble, Paul Bulens, Philippe Coucke, Wim Dries, Philippe Lambin, and Andre Dekker. Distributed learning: devel- oping a predictive model based on data from multiple hospitals without data leaving the hospital–a real life proof of concept. Radiotherapy and Oncology, 121(3):459–467, 2016. [86] Peter Kairouz, H Brendan McMahan, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1), 2021. 207 [87] Krishna Kandalla, Hari Subramoni, Abhinav Vishnu, and Dhabaleswar K Panda. De- signing topology-aware collective communication algorithms for large scale infiniband clusters: Case studies with Scatter and Gather. In Parallel & Distributed Process- ing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1–8. IEEE, 2010. [88] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In Artificial Intelligence and Statistics , pages 583–591, 2012. [89] Can Karakus, Yifan Sun, Suhas Diggavi, and Wotao Yin. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Processing Systems, pages 5440–5448, 2017. [90] Can Karakus, Yifan Sun, Suhas N Diggavi, and Wotao Yin. Redundancy techniques for straggler mitigation in distributed optimization and learning. Journal of Machine Learning Research, 20(72):1–47, 2019. [91] Shingo Kato and Ryoichi Shinkuma. Priority control in communication networks for accuracy-freshness tradeoff in real-time road-traffic information delivery. IEEE Access, 5:25226–25235, 2017. [92] Yasaman Keshtkarjahromi and Hulya Seferoglu. Coded cooperative computation for internet of things. arXiv preprint arXiv:1801.04357, 2018. [93] Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, and Panos Kalnis. Mizan: a system for dynamic load balancing in large-scale graph pro- cessing. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 169–182, 2013. [94] MehrdadKiamari, ChenweiWang, andASalmanAvestimehr. Onheterogeneouscoded distributed computing. arXiv preprint arXiv:1709.00196, 2017. [95] Allen Klinger. The vandermonde matrix. The American Mathematical Monthly, 74(5):571–574, 1967. [96] Can Kockan, Kaiyuan Zhu, Natnatee Dokmai, Nikolai Karpov, M Oguzhan Kulekci, David P Woodruff, and S Cenk Sahinalp. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nature methods, 17(3):295–301, 2020. [97] Derrick Kondo, Bahman Javadi, Paul Malecot, Franck Cappello, and David P An- derson. Cost-benefit analysis of cloud computing versus desktop grids. In IPDPS, volume 9, pages 1–12, 2009. [98] Jakub Konecny, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015. [99] Jakub Konecny, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016. 208 [100] Konstantinos Konstantinidis and Aditya Ramamoorthy. Leveraging coding techniques for speeding up distributed computing. arXiv preprint arXiv:1802.03049, 2018. [101] Konstantinos Konstantinidis and Aditya Ramamoorthy. CAMR: Coded Aggregated MapReduce. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1427–1431. IEEE, 2019. [102] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research). [103] AlexKrizhevsky, VinodNair, andGeoffreyHinton. Learningmultiplelayersoffeatures from tiny images. 2009. [104] Eleftherios Lampiris, Daniel Jiménez Zorrilla, and Petros Elia. Mapping heterogeneity does not affect wireless coded mapreduce. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 1422–1426. IEEE, 2019. [105] LeslieLamport, RobertShostak, andMarshallPease. TheByzantinegeneralsproblem. In Concurrency: the Works of Leslie Lamport, pages 203–226. 2019. [106] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. [107] Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010. [108] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on learning theory, pages 1246– 1257. PMLR, 2016. [109] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 2017. [110] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 64(3):1514–1529, 2017. [111] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 64(3):1514–1529, 2018. [112] Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchan- dran. Coded computation for multicore setups. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 2413–2417. IEEE, 2017. [113] Kangwook Lee, Changho Suh, and Kannan Ramchandran. High-dimensional coded matrix multiplication. In Information Theory (ISIT), 2017 IEEE International Sym- posium on, pages 2418–2422. IEEE, 2017. 209 [114] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. RSA: Byzantine-robust stochastic aggregation methods for distributed learning from het- erogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33, pages 1544–1551, 2019. [115] Songze Li and Salman Avestimehr. Coded computing. Foundations and Trends® in Communications and Information Theory, 17(1), 2020. [116] Songze Li, Seyed Mohammadreza Mousavi Kalan, A Salman Avestimehr, and Mahdi Soltanolkotabi. Near-optimal straggler mitigation for distributed gradient methods. arXiv preprint arXiv:1710.09990, 2017. [117] Songze Li, Seyed Mohammadreza Mousavi Kalan, A Salman Avestimehr, and Mahdi Soltanolkotabi. Near-optimal straggler mitigation for distributed gradient methods. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 857–866. IEEE, 2018. [118] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Coded MapRe- duce. InCommunication, Control, and Computing (Allerton), 2015 53rd Annual Aller- ton Conference on, pages 964–971. IEEE, 2015. [119] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Coded distributed computing: Straggling servers and multistage dataflows. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 164–171. IEEE, 2016. [120] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. A unified coding framework for distributed computing with straggling servers. In Globecom Workshops (GC Wkshps), 2016 IEEE, pages 1–6. IEEE, 2016. [121] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Coding for dis- tributed fog computing. IEEE Communications Magazine, 55(4):34–40, 2017. [122] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. Compressed coded distributed computing. arXiv preprint arXiv:1805.01993, 2018. [123] Songze Li, Mohammad Ali Maddah-Ali, Qian Yu, and A Salman Avestimehr. A fun- damental tradeoff between computation and communication in distributed computing. IEEE Transactions on Information Theory, 64(1):109–128, 2018. [124] Songze Li, Sucha Supittayapornpong, Mohammad Ali Maddah-Ali, and Salman Aves- timehr. Coded terasort. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 389–398. IEEE, 2017. [125] Songze Li, Qian Yu, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. A scal- able framework for wireless distributed computing. IEEE/ACM Transactions on Net- working, 25(5):2643–2654, 2017. 210 [126] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learn- ing: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020. [127] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Conference on Machine Learning and Systems (MLSys), 2020. [128] Youjie Li, Mingchao Yu, Songze Li, Salman Avestimehr, Nam Sung Kim, and Alexan- der Schwing. Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training. In Advances in Neural Information Processing Systems, pages 8056–8067, 2018. [129] Guanfeng Liang and Ulas C Kozat. Tofec: Achieving optimal throughput-delay trade- off of cloud storage using erasure codes. In INFOCOM, 2014 Proceedings IEEE, pages 826–834. IEEE, 2014. [130] Jimmy Lin and Michael Schatz. Design patterns for efficient graph algorithms in mapreduce. MLG Workshop, 2010. [131] M. Loeve. Probability Theory I. Graduate Texts in Mathematics. Springer New York, 1977. [132] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB, 2012. [133] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. Chal- lenges in parallel graph processing. Parallel Processing Letters, 17(01):5–20, 2007. [134] David JC MacKay and David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [135] R. K. Maity and A. S.and Mazumdar Rawat. Robust gradient descent via moment encoding with ldpc codes. arXiv preprint arXiv:1805.08327, 2018. [136] Maciej Malawski, Gideon Juve, Ewa Deelman, and Jarek Nabrzyski. Algorithms for cost-and deadline-constrained provisioning for scientific workflow ensembles in iaas clouds. Future Generation Computer Systems, 48:1–18, 2015. [137] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph process- ing. SIGMOD, 2010. [138] Ankur Mallick, Malhar Chaudhari, and Gauri Joshi. Rateless codes for near- perfect load balancing in distributed matrix-vector multiplication. arXiv preprint arXiv:1804.10331, 2018. 211 [139] Robert Ryan McCune, Tim Weninger, and Greg Madey. Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Computing Surveys, 2015. [140] Frank McKeen, Ilya Alexandrovich, Alex Berenzon, Carlos V Rozas, Hisham Shafi, Vedvyas Shanbhogue, and Uday R Savagaonkar. Innovative instructions and software model for isolated execution. Hasp@ isca, 10(1), 2013. [141] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics , pages 1273–1282. PMLR, 2017. [142] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics , pages 1273–1282. PMLR, 2017. [143] Fan Mo, Hamed Haddadi, Kleomenis Katevas, Eduard Marin, Diego Perino, and Nico- las Kourtellis. Ppfl: privacy-preserving federated learning with trusted execution en- vironments. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services, pages 94–108, 2021. [144] Hema Venkata Krishna Giri Narra, Zhifeng Lin, Ganesh Ananthanarayanan, Salman Avestimehr, and Murali Annavaram. Collage inference: Using coded redundancy for lowering latency variation in distributed image classification systems. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS),pages453– 463. IEEE, 2020. [145] Krishna Giri Narra, Zhifeng Lin, Mehrdad Kiamari, Salman Avestimehr, and Murali Annavaram. Slack squeeze coded computing for adaptive straggler mitigation. In Proceedings of the International Conference for High Performance Computing, Net- working, Storage and Analysis, page 14. ACM, 2019. [146] Anselme Ndikumana, Nguyen H Tran, Tai Manh Ho, Zhu Han, Walid Saad, Dusit Niy- ato, and Choong Seon Hong. Joint communication, computation, caching, and control in big data multi-access edge computing. IEEE Transactions on Mobile Computing, 2019. [147] John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Michael Rabbat, Mani Malek Esmaeili, and Dzmitry Huba. Federated learning with buffered asyn- chronousaggregation. arXiv preprint arXiv:2106.06639. Presented at the International Workshop on Federated Learning for User Privacy and Data Confidentiality in Con- junction with ICML 2021 (FL-ICML’21), 2021. [148] Takayuki Nishio and Ryo Yonetani. Client selection for federated learning with het- erogeneous resources in mobile edge. In ICC 2019-2019 IEEE international conference on communications (ICC), pages 1–7. IEEE, 2019. [149] [Online]. Amazon EC2 pricing. https://aws.amazon.com/ec2/pricing/. Accessed date: July 5th, 2017. 212 [150] Mustafa Safa Ozdayi, Murat Kantarcioglu, and Yulia R Gel. Defending against back- doorsinfederatedlearningwithrobustlearningrate. arXiv preprint arXiv:2007.03767, 2020. [151] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, 1999. [152] Xudong Pan, Mi Zhang, Duocai Wu, Qifan Xiao, Shouling Ji, and Zhemin Yang. Justinian’s gaavernor: Robust distributed learning with gradient aggregation agent. In 29th {USENIX} Security Symposium ({USENIX} Security 20), pages 1641–1658, 2020. [153] Hyegyeong Park, Kangwook Lee, Jy-yong Sohn, Changho Suh, and Jaekyun Moon. Hierarchicalcodingfordistributedcomputing. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 1630–1634. IEEE, 2018. [154] Pitch Patarasuk and Xin Yuan. Bandwidth Efficient All-reduce Operation on Tree Topologies. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1–8. IEEE, 2007. [155] Pitch Patarasuk and Xin Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009. [156] Jie Peng, Zhaoxian Wu, and Qing Ling. Byzantine-robust variance-reduced federated learning over distributed non-iid data. arXiv preprint arXiv:2009.08161, 2020. [157] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for fed- erated learning. arXiv preprint arXiv:1912.13445, 2019. [158] Pawani Porambage, Jude Okwuibe, Madhusanka Liyanage, Mika Ylianttila, and Tarik Taleb. Survey on multi-access edge computing for internet of things realization. IEEE Communications Surveys & Tutorials, 20(4):2961–2991, 2018. [159] Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, and Ion Stoica. Low latency geo-distributed data analytics. ACM SIGCOMM Computer Communication Review, 45(4):421–434, 2015. [160] Martin Raab and Angelika Steger. Balls into bins—A simple and tight analysis. In In- ternational Workshop on Randomization and Approximation Techniques in Computer Science, pages 159–170. Springer, 1998. [161] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008. [162] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561. IEEE, 2008. 213 [163] Vinayak Ramkumar and P Vijay Kumar. Coded MapReduce Schemes Based on Place- ment Delivery Array. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 3087–3091. IEEE, 2019. [164] Netanel Raviv, Rashish Tandon, Alex Dimakis, and Itzhak Tamo. Gradient coding from cyclic MDS codes and expander graphs. ICML, 2018. [165] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A Lock- Free Approach to Parallelizing Stochastic Gradient Descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011. [166] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with peri- odic averaging and quantization. arXiv preprint arXiv:1909.13014, 2019. [167] AmirhosseinReisizadehandRamtinPedarsani. Latencyanalysisofcodedcomputation schemes over wireless networks. arXiv preprint arXiv:1707.00040, 2017. [168] AmirhosseinReisizadeh, HosseinTaheri, AryanMokhtari, HamedHassani, andRamtin Pedarsani. Robust and communication-efficient collaborative learning. In Advances in Neural Information Processing Systems, pages 8386–8397, 2019. [169] AmirhosseinReisizadeh,SauravPrakash, RamtinPedarsani, andAmirSalmanAves- timehr. Coded computation over heterogeneous clusters. IEEE Transactions on In- formation Theory, 2019. [170] Amirhossein Reisizadeh*, Saurav Prakash*, Ramtin Pedarsani, and Amir Salman Avestimehr. Codedreduce: A fast and robust framework for gradient aggregation in distributed learning. International Conference on Machine Learning (ICML) Work- shop on Coding Theory For Large-scale Machine Learning, 2019. [171] Amirhossein Reisizadeh*, Saurav Prakash*, Ramtin Pedarsani, and Amir Salman Avestimehr. Tree gradient coding. In 2019 IEEE International Symposium on Infor- mation Theory (ISIT), pages 2808–2812. IEEE, 2019. [172] Amirhossein Reisizadeh*, Saurav Prakash*, Ramtin Pedarsani, and Amir Salman Avestimehr. Codedreduce: A fast and robust framework for gradient aggregation in distributed learning. IEEE/ACM Transactions on Networking, 2021. [173] Amirhossein Reisizadeh, Saurav Prakash, Ramtin Pedarsani, and Salman Aves- timehr. Coded computation over heterogeneous clusters. In Information Theory (ISIT), 2017 IEEE International Symposium on, pages 2408–2412. IEEE, 2017. [174] Jen-Hao Rick Chang, Aswin C Sankaranarayanan, and BVK Vijaya Kumar. Random features for sparse signal classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5404–5412, 2016. 214 [175] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R Roth, Shadi Albar- qouni, Spyridon Bakas, Mathieu N Galtier, Bennett A Landman, Klaus Maier-Hein, et al. The future of digital health with federated learning. NPJ digital medicine, 3(1):1–7, 2020. [176] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathemati- cians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, pages 1576–1602. World Scientific, 2010. [177] A Salman Avestimehr, Seyed Mohammadreza Mousavi Kalan, and Mahdi Soltanolkotabi. Fundamental resource trade-offs for encoded distributed optimization. arXiv e-prints, pages arXiv–1804, 2018. [178] Jim Salter. Containerize all the things! Arm v9 takes security seriously. https: //blog.openmined.org/pysyft-pytorch-intel-sgx/, 2021. [Online; accessed 18- October-2021]. [179] Vijayalakshmi Saravanan, Fatima Hussain, and Naik Kshirasagar. Role of big data in internet of things networks. In Handbook of Research on Big Data and the IoT, pages 273–299. IGI Global, 2019. [180] Mahadev Satyanarayanan. The emergence of edge computing. Computer, 50(1):30–39, 2017. [181] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799, 2018. [182] Albin Severinson, Eirik Rosnes, et al. Block-diagonal and LT codes for distributed computing with straggling servers. arXiv preprint arXiv:1712.08230, 2017. [183] Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. When Do Redundant Requests Reduce Latency? IEEE Transactions on Communications, 64(2):715–722, 2016. [184] ShahinShahrampour, AhmadBeirami, andVahidTarokh. Ondata-dependentrandom features for improved generalization in supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence , 2018. [185] Sonia Shahzadi, Muddesar Iqbal, Tasos Dagiuklas, and Zia Ul Qayyum. Multi-access edge computing: open issues, challenges and future perspectives. Journal of Cloud Computing, 6(1):30, 2017. [186] NishantShakya, FanLi, andJinyuanChen. Distributedcomputingwithheterogeneous communication constraints: The worst-case computation load and proof by contradic- tion. arXiv preprint arXiv:1802.00413, 2018. 215 [187] Micah J Sheller, Brandon Edwards, G Anthony Reina, Jason Martin, Sarthak Pati, Aikaterini Kotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka R Colen, et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Nature Scientific Reports , 10(1):1–12, 2020. [188] Mehrdad Showkatbakhsh, Can Karakus, and Suhas Diggavi. Privacy-utility trade- off of linear regression under random projections and additive noise. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 186–190. IEEE, 2018. [189] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [190] Linqi Song, Christina Fragouli, and Tianchu Zhao. A pliable index coding approach to data shuffling. arXiv preprint arXiv:1701.05540, 2017. [191] Pedro Soto, Jun Li, and Xiaodi Fan. Dual entangled polynomial code: Three- dimensional coding for distributed matrix multiplication. In International Conference on Machine Learning, pages 5937–5945, 2019. [192] Sundara Rajan Srinivasavaradhan, Linqi Song, and Christina Fragouli. Distributed computing trade-offs with random connectivity. In 2018 IEEE International Sympo- sium on Information Theory (ISIT), pages 1281–1285. IEEE, 2018. [193] Lili Su and Jiaming Xu. Securing distributed machine learning in high dimensions. arXiv preprint arXiv:1804.10140, 2018. [194] Sen Su, Jian Li, Qingjia Huang, Xiao Huang, Kai Shuang, and Jie Wang. Cost-efficient task scheduling for executing large programs in the cloud. Parallel Computing, 39(4- 5):177–188, 2013. [195] GeewonSuh, KangwookLee, andChanghoSuh. Matrixsparsificationforcodedmatrix multiplication. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pages 1271–1278. IEEE, 2017. [196] Jun Sun, Tianyi Chen, Georgios B Giannakis, Qinmin Yang, and Zaiyue Yang. Lazily aggregated quantized gradient innovation for communication-efficient federated learn- ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. [197] Yuxuan Sun, Junlin Zhao, Sheng Zhou, and Deniz Gunduz. Heterogeneous coded computation across heterogeneous workers. In 2019 IEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2019. [198] Tarik Taleb, Konstantinos Samdanis, Badr Mada, Hannu Flinck, Sunny Dutta, and Dario Sabella. On multi-access edge computing: A survey of the emerging 5G network edgecloudarchitectureandorchestration. IEEE Communications Surveys & Tutorials, 19(3):1657–1681, 2017. [199] Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. ICML, 2017. 216 [200] Rashish Tandon, Qi Lei, Alexandros G Dimakis, and Nikos Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning, pages 3368–3376, 2017. [201] SauravPrakash, Sagar Dhakal, Mustafa Akdeniz, A Salman Avestimehr, and Nageen Himayat. Codedcomputingforfederatedlearningattheedge. International Workshop on Federated Learning for User Privacy and Data Confidentiality, in Conjunction with ICML 2020 (FL-ICML’20), 2020. [202] Saurav Prakash, Sagar Dhakal, Mustafa Riza Akdeniz, Yair Yona, Shilpa Talwar, Salman Avestimehr, and Nageen Himayat. Coded computing for low-latency federated learning over wireless edge networks. IEEE Journal on Selected Areas in Communica- tions, 39(1):233–250, 2020. [203] Saurav Prakash, Sagar Dhakal, Yair Yona, Nageen Himayat, and Shilpa Talwar. Technologies for distributing gradient descent computation in a heterogeneous multi- access edge computing (MEC) networks, 2019. US Patent App. 16/235,682. [204] Saurav Prakash, Hanieh Hashemi, Yongqin Wang, Murali Annavaram, and Amir Salman Avestimehr. Byzantine-resilient federated learning with heterogeneous data distribution. arXiv preprint arXiv:2010.07541, 2021. [205] Saurav Prakash*, Amirhossein Reisizadeh*, Ramtin Pedarsani, and Amir Salman Avestimehr. Coded computing for distributed graph analytics. IEEE International Symposium on Information Theory (ISIT), 2018. [206] Saurav Prakash*, Amirhossein Reisizadeh*, Ramtin Pedarsani, and Amir Salman Avestimehr. Coded computing for distributed graph analytics. IEEE Transactions on Information Theory, 66(10):6534–6554, 2020. [207] Saurav Prakash*, Amirhossein Reisizadeh*, Ramtin Pedarsani, and Amir Salman Avestimehr. Hierarchical coded gradient aggregation for learning at the edge. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2616–2621. IEEE, 2020. [208] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. Optimization of Collective Communication Operations in MPICH. The International Journal of High Perfor- mance Computing Applications, 19(1):49–66, 2005. [209] Florian Tramer and Dan Boneh. Slalom: Fast, verifiable and private execution of neural networks in trusted hardware. International Conference on Learning Represen- tations (ICLR), 2019. [210] Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf. A primer on kernel methods. Kernel methods in computational biology, 47:35–70, 2004. [211] DaWang, GauriJoshi, andGregoryWornell. Efficienttaskreplicationforfastresponse times in parallel computation. In ACM SIGMETRICS Performance Evaluation Re- view, volume 42, pages 599–600. ACM, 2014. 217 [212] Da Wang, Gauri Joshi, and Gregory Wornell. Using straggler replication to reduce la- tency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7–11, 2015. [213] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and Kevin Chan. When edge meets learning: Adaptive control for resource- constrained distributed machine learning. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 63–71. IEEE, 2018. [214] Sinong Wang, Jiashang Liu, and Ness Shroff. Coded sparse matrix multiplication. arXiv preprint arXiv:1802.03430, 2018. [215] Sinong Wang, Jiashang Liu, Ness Shroff, and Pengyu Yang. Fundamental limits of coded linear transform. arXiv preprint arXiv:1804.09791, 2018. [216] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017. [217] Cong Xie, Sanmi Koyejo, and Indranil Gupta. Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In International Conference on Machine Learning, pages 6893–6901. PMLR, 2019. [218] Wenpu Xing and Ali Ghorbani. Weighted pagerank algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on, pages 305–314. IEEE, 2004. [219] Qifa Yan, Michèle Wigger, Sheng Yang, and Xiaohu Tang. A fundamental storage- communication tradeoff for distributed computing with straggling nodes. IEEE Trans- actions on Communications, 2020. [220] Chien-Sheng Yang, Ramtin Pedarsani, and A Salman Avestimehr. Timely-throughput optimal coded computing over cloud networks. In Proceedings of the Twentieth ACM International Symposium on Mobile Ad Hoc Networking and Computing, pages 301– 310. ACM, 2019. [221] Chien-Sheng Yang, Ramtin Pedarsani, and A Salman Avestimehr. Edge comput- ing in the dark: Leveraging contextual-combinatorial bandit and coded computing. IEEE/ACM Transactions on Networking, 2021. [222] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learn- ing: Concept and applications. In ACM Transactions on Intelligent Systems and Technology (TIST), volume 10, pages 1–19. ACM New York, NY, USA, 2019. [223] Yaoqing Yang, Malhar Chaudhari, Pulkit Grover, and Soummya Kar. Coded iterative computing using substitute decoding. arXiv preprint arXiv:1805.06046, 2018. [224] Yaoqing Yang, Pulkit Grover, and Soummya Kar. Fault-tolerant distributed logistic regression using unreliable components. In Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on, pages 940–947. IEEE, 2016. 218 [225] Yaoqing Yang, Pulkit Grover, and Soummya Kar. Computing linear transformations with unreliable components. IEEE Transactions on Information Theory, 63(6):3729– 3756, 2017. [226] Min Ye and Emmanuel Abbe. Communication-computation efficient gradient coding. arXiv preprint arXiv:1802.03475, 2018. [227] Sangho Yi, Artur Andrzejak, and Derrick Kondo. Monetary cost-aware checkpoint- ing and migration on amazon cloud spot instances. IEEE Transactions on Services Computing, 5(4):512–524, 2012. [228] DongYin, YudongChen, KannanRamchandran, andPeterBartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. ICML, 2018. [229] Naoya Yoshida, Takayuki Nishio, Masahiro Morikura, Koji Yamamoto, and Ryo Yone- tani. Hybrid-FL for wireless networks: Cooperative learning mechanism using non-iid data. In ICC 2020-2020 IEEE International Conference on Communications (ICC), pages 1–7. IEEE, 2020. [230] Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, and Salman Avestimehr. GradiVeQ: Vec- tor Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training. In Advances in Neural Information Processing Systems, pages 5129–5139, 2018. [231] Qian Yu, Mohammad Maddah-Ali, and Salman Avestimehr. Polynomial codes: an op- timal design for high-dimensional coded matrix multiplication. In Advances in Neural Information Processing Systems, pages 4403–4413, 2017. [232] QianYu, MohammadAliMaddah-Ali, andASalmanAvestimehr. Stragglermitigation in distributed matrix multiplication: Fundamental limits and optimal coding. arXiv preprint arXiv:1801.07487, 2018. [233] Qian Yu, Netanel Raviv, Jinhyun So, and A Salman Avestimehr. Lagrange coded computing: Optimal design for resiliency, security and privacy. arXiv preprint arXiv:1806.00939, 2018. [234] Quan Yuan, Haibo Zhou, Jinglin Li, Zhihan Liu, Fangchun Yang, and Xuemin Sher- man Shen. Toward efficient content delivery for automated driving services: An edge computing solution. IEEE Network, 32(1):80–86, 2018. [235] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. HotCloud, 10:10–10, 2010. [236] Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. Improving mapreduce performance in heterogeneous environments. In Osdi, volume 8, page 7, 2008. 219 [237] Jingjing Zhang and Osvaldo Simeone. Improved latency-communication trade-off for map-shuffle-reduce systems with stragglers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8172–8176. IEEE, 2019. [238] Lingchen Zhao, Qian Wang, Qin Zou, Yan Zhang, and Yanjiao Chen. Privacy- preservingcollaborativedeeplearningwithunreliableparticipants. IEEE Transactions on Information Forensics and Security, 15:1486–1500, 2019. [239] Yiyang Zhao, Linnan Wang, Wei Wu, George Bosilca, Richard Vuduc, Jinmian Ye, WenqiTang, andZenglinXu. EfficientCommunicationsinTrainingLargeScaleNeural Networks. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 110–116. ACM, 2017. [240] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018. [241] Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(11):2348–2359, 2018. [242] Jingge Zhu, Ye Pu, Vipul Gupta, Claire Tomlin, and Kannan Ramchandran. A se- quential approximation framework for coded distributed optimization. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1240–1247. IEEE, 2017. [243] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems, pages 14774–14784, 2019. [244] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized Stochas- tic Gradient Descent. In Advances in Neural Information Processing Systems, pages 2595–2603, 2010. 220 Appendices A Achievability for the Random Bi-partite Model In this Section, we specialize our proposed scheme in Section 2.4 for the random bi-partite model and prove the achievability of Theorem 3. Consider RB(n 1 ,n 2 ,q) graph G = (V 1 ∪ V 2 ,E) withn=n 1 +n 2 ,|V 1 |=n 1 =Θ( n), and|V 2 |=n 2 =Θ( n) where|n 1 − n 2 |=o(n). The prior knowledge of the bi-partite structure of the graph implies that Reduction of vertices in V 1 depends only on the Mappers in V 2 . Therefore, the two operations would better be assigned to the same set of servers. Inspired by that argument, we describe subgraph and Reduce allocations as follows. We divide the total K servers into two sets of K 1 = n 1 n K and K 2 = n 2 n K servers. Assume n 1 ≥ n 2 . (I) Mappers inV 1 and Reducers inV 2 are distributedly allocated to K 1 servers according to the allocation scheme proposed in Section 2.4.1. Each of the K 1 servers Maps n 1 r K 1 = n r K vertices (in V 1 ) and Reduces n 2 K 1 = n 2 n 1 n K vertices (in V 2 ). Note that although each server in K 1 is loaded at its capacity with n r K Mappers, these servers are assigned n 2 n 1 n K ≤ n K Reducers which implies more Reducers can be assigned to these servers. (II) Next we allocate the Mappers inV 2 to the other set of K 2 servers similar to Mappers in V 1 . According to our pick for K 2 and the allocation scheme proposed in Section 2.4.1, each server in K 2 is assigned with n 2 r K 2 = n r K vertices (inV 2 ). To allocate the n 1 Reductions in V 1 to the K 2 servers, we note that these servers can accommodate 221 at most K 2 n K = n 2 Reductions which is less than n 1 . To allocate all Reductions, we use the remaining Reduction space in the K 1 servers. More precisely, we first allocate n 2 out of the total n 1 Reductions inV 1 to the K 2 servers. (III) Finally, we allocate the remaining n 1 − n 2 vertices to the K 1 servers. All in all, each of the K servers is now assigned with nr/K Mappers and n/K Reducers. We denote this allocation by ˜ A∈A(r). Moreover, coded Shuffling applies the coded scheme proposed in Section 2.4.1 for Reducing functions in phases (I) and (II) separately. We also allow uncoded communications for enabling Reductions required in phase (III). Now, we evaluate the communication load of each of the above phases. Let ¯L C1 ˜ A , ¯L C2 ˜ A denote the average normalized communication loads for phases (I) and (II); and ¯L UC3 ˜ A denote the average normalized communication load regarding phase (III). From the achievability result in Theorem 1, for q =ω( 1 n 2 ), we have ¯L C1 ˜ A ≤ 1 r q n 1 n 2 n 2 1− r K 1 +o(q), and ¯L C2 ˜ A ≤ 1 r q n 2 2 n 2 1− r K 2 +o(q). As mentioned before, Reduction of the remaining n 1 − n 2 vertices in phase (III) is carried out uncoded, which induces the average normalized communication load as follows: ¯L UC3 ˜ A =q n 2 (n 1 − n 2 ) n 2 . Putting all together, the proposed achievable scheme has the total average normalized com- munication load ¯L ˜ A as follows: ¯L ˜ A = ¯L C1 ˜ A + ¯L C2 ˜ A + ¯L UC3 ˜ A ≤ 1 r q n 1 n 2 n 2 1− r K 1 + 1 r q n 2 2 n 2 1− r K 2 +q n 2 (n 1 − n 2 ) n 2 +o(q). 222 Hence, the achievability claim of Theorem 3 can be concluded as follows: limsup n→∞ L ∗ (r) q ≤ limsup n→∞ ¯L ˜ A q ≤ limsup n→∞ 1 r n 1 n 2 n 2 1− r K 1 +limsup n→∞ 1 r n 2 2 n 2 1− r K 1 +limsup n→∞ n 2 (n 1 − n 2 ) n 2 = 1 2r 1− 2r K . (7) B Converse for the Random Bi-partite Model Here we provide a lower bound on the optimal average communication load for the random bi-partite model that is within a constant factor of the upper bounds and complete the proof of Theorem 3. ConsiderG =(V 1 ∪V 2 ,E) and assume that n 1 ≥ n 2 . To derive a lower bound on L ∗ (r), for every realization of RB(n 1 ,n 2 ,q) graph, we arbitrarily remove n 1 − n 2 vertices inV 1 along with their corresponding edges. The new bi-partite graph represents two random ER graphs with n 2 vertices. Consider Reducing the vertices in one side of the new graph, e.g. V 2 . Clearly, this provides a lower bound on L ∗ (r). Note that now each Mapper can benefit from a redundancy factor of 2r. According to Theorem 1, Reducing V 2 induces the (optimal) communication load of 1 2r q 1− 2r K +o(q) which implies limsup n→∞ L ∗ (r) q ≥ limsup n→∞ 1 2r q n 2 2 n 2 1− 2r K +o(q) = 1 8r 1− 2r K . (8) Hence, the proof of converse of Theorem 3 is complete. Furthermore, (7) and (8) together asymptoticallycharacterizetheoptimalaveragenormalizedcommunicationloadL ∗ (r)within a factor of 4. 223 C Achievability for the Stochastic Block Model In this Section, we specialize our proposed scheme in Section 2.4 for the stochastic block model and prove the achievability of Theorem 4. Consider an SBM(n 1 ,n 2 ,p,q) graph G = (V 1 ∪V 2 ,E 1 ∪E 2 ∪E 3 ) with n = n 1 +n 2 , |V 1 | = n 1 = Θ( n), and |V 2 | = n 2 = Θ( n). Edge subsetsE 1 , E 2 andE 3 respectively represent intra-cluster edges among vertices in V 1 , intra- cluster edges among vertices in V 2 , and inter-cluster edges between vertices in V 1 and V 2 . LetG 1 =(V 1 ,E 1 ) andG 2 =(V 2 ,E 2 ) be graphs induced byV 1 andV 2 , respectively, and denote the graph of inter-cluster connections byG 3 =(V 1 ∪V 2 ,E 3 ). Clearly,G 1 andG 2 are ER(n 1 ,p) and ER(n 2 ,p) graphs, whileG 3 is RB(n 1 ,n 2 ,q) graph. Subgraph and Reduce allocations are described as follows. Mappers in V 1 and Reducers in V 2 are distributedly allocated to K servers according to the allocation scheme proposed in Section 2.4. Similarly, Mappers inV 2 and Reducers inV 1 are distributedly allocated to K servers according to the allocation scheme proposed in Section 2.4. Therefore, each server Maps n 1 r/K vertices in V 1 and n 2 r/K vertices in V 2 , inducing the computation load r. Moreover, each server Reduces n 1 /K functions inV 1 and n 2 /K functions inV 2 . We consider this allocation, denoted by ˜ A, for both uncoded and coded Shuffling schemes. In uncoded scheme, Reducing each function in V 1 requires on average pn 1 intermediate values Mapped by vertices inV 1 due to intra-cluster connections which introduces the average uncoded load ¯L UC1 ˜ A =p n 2 1 (n 1 +n 2 ) 2 1− r K . Similarly, the average uncoded load for ReducingV 2 due to intra- cluster connections is ¯L UC2 ˜ A = p n 2 2 (n 1 +n 2 ) 2 1− r K . Moreover, inter-cluster connections induce an average load ¯L UC3 ˜ A =q 2n 1 n 2 (n 1 +n 2 ) 2 1− r K . In the coded scheme, we propose to employ coded Shuffling for the ER and RB models in the regime of interest, that is p=ω( 1 n 2 ), q =ω( 1 n 2 ) and p≥ q. Thus, the overall commu- nication load can be decomposed into three components. We first apply the coded Shuffling 224 scheme described in Section 2.4.1 to ER graph G 1 which induces the average normalized communication load ¯L C1 ˜ A ≤ 1 r ¯L UC1 ˜ A +o(p)= 1 r p n 2 1 (n 1 +n 2 ) 2 1− r K +o(p). Similarly, the same scheme applied to ER graph G 2 results in the average normalized com- munication load ¯L C2 ˜ A ≤ 1 r ¯L UC2 ˜ A +o(p)= 1 r p n 2 2 (n 1 +n 2 ) 2 1− r K +o(p). Finally, we employ the same scheme twice for the two ER models constituting the RB graph G 3 which induces the average normalized communication load ¯L C3 ˜ A ≤ 1 r ¯L UC3 ˜ A +o(q)= 1 r q 2n 1 n 2 (n 1 +n 2 ) 2 1− r K +o(q). Let us denote by ¯L C ˜ A and ¯L UC ˜ A the total average normalized communication loads of the coded and uncoded schemes, respectively. Therefore, L ∗ (r)≤ ¯L C ˜ A = ¯L C1 ˜ A + ¯L C2 ˜ A + ¯L C3 ˜ A ≤ 1 r ( ¯L UC1 ˜ A + ¯L UC2 ˜ A + ¯L UC3 ˜ A )+o(p) = 1 r ¯L UC ˜ A +o(p) = pn 2 1 +pn 2 2 +2qn 1 n 2 (n 1 +n 2 ) 2 1− r K +o(p), which concludes the proof of achievability of Theorem 4. 225 D Converse for the Stochastic Block Model Inthissection,weprovidetheproofoftheconverseofTheorem4. ConsideranSBM(n 1 ,n 2 ,p,q) graphG =(V 1 ∪V 2 ,E 1 ∪E 2 ∪E 3 ) with n=n 1 +n 2 ,|V 1 |=n 1 =Θ( n), and|V 2 |=n 2 =Θ( n). Our approach to derive a lower bound for the minimum average communication load is to randomly remove edges from the two intra-cluster edges, i.e. E 1 and E 2 . Moreover, edges are removed such that each of those clusters are then Erdos-Renyi models with connectivity probability q (reduced from p). This can be simply verified by the following coupling-type argument. LettheBernoullirandomvariableE p denotetheindicatorofexistenceofageneric edge in an ER(n,p) graph, i.e. Pr[E p = 1] = 1− p. Now, generate another Bernoulli E q by randomly removing edges from the realized ER graph as follows: E q = if E p =0 0 if E p =1 0 w.p. 1− q/p 1 w.p. q/p. Clearly, E q is Bernoulli(q) and the resulting graph has fewer number of edges compared to the original one (with probability 1). By doing so for the two ER components of the SBM graph, we have a larger ER graph of size n=n 1 +n 2 with connectivity probability q. Using the converse in Theorem 1, we have the following for average normalized communication load for the stochastic block model: L ∗ (r) q ≥ 1 r 1− r K . E Proof of Lemma 7 Lemma 7. For all p∈[0,1] and s ′ >0, we have pe s ′ +1− p 2 ≤ pe 2s ′ +1− p. 226 Proof. For given p∈[0,1], define f(s ′ )= pe s ′ +1− p 2 − pe 2s ′ +1− p . Clearly f(0)=0. Moreover, f ′ (s ′ )=2p¯p(e s ′ − e 2s ′ )<0, for s ′ >0. Therefore, f(s ′ )≤ 0 for all s ′ >0, concluding the claim of the lemma. F Proof of Lemma 4 For each i, the aggregate return at time t satisfies X i (t) ∈ {0,ℓ i }. Therefore, we can use McDiarmid’s inequality (see Section G for reference) as follows: Pr[X(t)− E[X(t)]≥ ϵ ]≤ exp − 2ϵ 2 P n i=1 ℓ 2 i , Pr[E[X(t)]− X(t)≥ ϵ ]≤ exp − 2ϵ 2 P n i=1 ℓ 2 i , for any ϵ> 0. Now, we proceed to complete the proof of Lemma 4. Proof. Let t = τ ∗ + δ for some δ = Θ logn √ n and ϵ = δ 2 . The claim is that P X ∗ (t) ≤ r− ϵ =o 1 n . From McDiarmid’s inequality, we have P X ∗ (t)≤ r− ϵ ≤ exp − 2 E[X ∗ (t)]− r+ϵ 2 P i ℓ ∗ i 2 (t) ! =exp − 2 ts− r+ϵ ) 2 P i ℓ ∗ 2 i (t) ! =exp − 2δ 2 s 2 +2δ 4 +4δ 3 s ( r s ) 2 +δ 2 +2δ r s P i λ 2 i ! (g) = e − Θ (nδ 2 ) =o 1 n . In above, equality (g) follows from the fact that r = Θ( n), s = Θ( n), λ i = Θ(1) , δ = Θ logn √ n , and therefore P i λ 2 i = Θ( n) and s 2 = Θ( n 2 ). Moreover, if t ∗ < τ ∗ , with a 227 positive probability there are less than r equations at the master node by time t ∗ which is a contradiction. Therefore, τ ∗ ≤ t ∗ ≤ τ ∗ +δ. G McDiarmid’s Inequality Let X 1 ,··· ,X n be independent random variables taking values inX. Further, let the func- tion f :X n →R be L i -Lipschitz for all i∈[n], that is |f(x 1 ,··· ,x i ,··· ,x n )− f(x 1 ,··· ,x ′ i ,··· ,x n )|≤ L i , for any x 1 ,··· ,x n ,x ′ i ∈X and i∈[n]. Then, for any ϵ> 0, Pr h f(X 1 ,··· ,X n )− E[f(X 1 ,··· ,X n )]≥ ϵ i ≤ exp − 2ϵ 2 P n i=1 L 2 i , Pr h E[f(X 1 ,··· ,X n )]− f(X 1 ,··· ,X n )≥ ϵ i ≤ exp − 2ϵ 2 P n i=1 L 2 i . H Pseudo-code for Computation Allocation Sub-routine Algorithm 3: Computation Allocation Input: datasetD, n workers, straggler toleration s, computation matrix B=[b 1 ;··· ;b n ]∈R n× k ; Output: data set allocation{D (1) ,··· ,D (n) } for n workers 1: procedure CompAlloc(D,B) 2: uniformly partitionD =∪ k κ =1 D κ 3: for worker i← 1 to n do 4: D (i) ←∪ k κ =1 b iκ D κ / /D (i) is assigned to worker W i 5: end for 6: end procedure 228 I Pseudo-code for CodedReduce Scheme Algorithm 4: CodedReduce Input: datasetD, (n,L)–regular tree T, straggler toleration s (per parent), model θ (t) ; Output: gradient g D = P x∈D ∇ℓ(θ (t) ;x) aggregated at the master 1: procedure CR.Allocate 2: GC generates B specified by n,s 3: for l← 1 to L do 4: for i← 1 to n l− 1 do 5: {D T(l,n(i− 1)+1) ,··· ,D T(l,ni) }=CompAlloc(D T(l− 1,i) ,B) 6: end for 7: for i← 1 to n l 8: pick r CR · d data points ofD T(l,i) asD(l,i) 9: D T(l,i) ←D T(l,i) \D(l,i) 10: end for 11: end for 12: end procedure 13: procedure CR.Execute 14: GC generates A from B 15: all the workers compute their local partial gradients g D(l,i) 16: for l← L− 1 to 1 17: for i← 1 to n l 18: worker nodes (l,i): 19: receives [m (l+1,n(i− 1)+1) ;··· ;m (l+1,ni) ] from its children 20: uploads m (l,i) =a f(l,i) [m (l+1,n(i− 1)+1) ;··· ;m (l+1,ni) ]+g D(l,i) to its parent 21: end for 22: end for 23: master node: 24: receives [m (1,1) ;··· ;m (l,n) ] from its children 25: recovers g =a f(0,1) [m (1,1) ;··· ;m (1,n) ] 26: end procedure 229 J Proof of Theorem 9 Achievability: According to the data allocation described in Algorithm 4, to be robust to any s straggling children of the master, the data setD is redundantly assigned to sub-trees T(1,1),··· ,T(1,n) such that each data point is placed in s+1 sub-trees, which yields |D T(1,i) |= s+1 n d, ∀i∈[n]. (9) Then, nodesinlayerl =1pickr CR ddatapointsastheircorrespondingdatasetsandsimilarly distribute the remaining among their children which together with (9) yields |D T(2,i) |= s+1 n s+1 n d− r CR d = s+1 n s+1 n − r CR d, ∀i∈[n 2 ]. By the same argument for each layer, we have |D T(L,i) |= s+1 n s+1 n L− 1 − s+1 n L− 2 r CR −···− s+1 n r CR − r CR ! d, ∀i∈[n L ]. (10) Putting (10) together with|D T(L,i) |=r CR d yields r CR = 1 n s+1 +··· + n s+1 L . Optimality: In an α –resilient scheme, the master node is able to recover from any s=αn straggling sub-trees T(1,1),··· ,T(1,n). Therefore, each data point has to be placed in at least s+1 of such sub-trees, which yields |D T(1,1) |+··· +|D T(1,n) |≥ (s+1)d, (11) 230 where the equality is achieved only if each data point is assigned to only s+1 sub-trees. Hence, we can assume the optimal scheme satisfies (11) with equality. Moving to the second layer, the following claim bounds the required redundancy assigned to sub-trees T(2,1),··· ,T(2,n). Similar claim holds for any other group of the siblings in this layer. Claim 3. The following inequality holds: |D T(2,1) |+··· +|D T(2,n) |≥ (s+1) |D T(1,1) |− rd . Proof of Claim 3. First, note that |D T(1,1) \D(1,1)|≥|D T(1,1) |− rd. If the claim does not hold, then there exists data point x ∈ D T(1,1) \D(1,1) such that x is placed in at most s sub-trees rooting in the node (1,1), e.g. T(2,1),··· ,T(2,s). Note that besides sub-tree T(1,1), x is placed in only s more sub-trees, e.g. T(1,2),··· ,T(1,s+1). Now consider a straggling pattern where T(1,2),··· ,T(1,s+1) and T(2,1),··· ,T(2,s) fail to return their results. Therefore, x is missed at the master and fails the aggregation recovery. By the same logic used in the above proof, Claim 3 holds for any parent node and its children, i.e. for any layer l∈[L] and i∈[n l− 1 ], |D T(l,n(i− 1)+1) |+··· +|D T(l,ni) |≥ (s+1) |D T(l− 1,i) |− rd . (12) Specifically applying (12) to layer L and noting that|D T(L,i) | =|D(L,i)| = rd for any i, we conclude that rd n s+1 +1 ≥|D T(L− 1,1) |. We can then use the above inequality and furthermore write (12) for layer L− 1 which results in rd n s+1 2 + n s+1 +1 ! ≥|D T(L− 2,1) |. 231 By deriving the above inequality recursively up to the master node, we get rd n s+1 L− 1 +··· + n s+1 +1 ! ≥ s+1 n d, which concludes the optimality in Theorem 9. K Proof of Theorem 10 Let us begin with the lower bound E[T CR ]≥ r CR d µ log 1 α +ar CR d +(n(1− α )− o(n)+L− 1)((1− o(1))t c +o(1). Consider the group of siblings 1 placed in layer L whose result reaches their parent nodes first. Let b T denote the time at which the parent of such group is able to recover the partial gradientfromitsfastestchildren’scomputations, i.e. fastestn− softhem. Wealsodenoteby T 1 ,··· ,T n the partial gradient computation times for the siblings. According to the random computation time model described in the paper and the computation load of CR, each T i is shifted exponential with the shift parameter ad i = ar CR d and the rate parameter µ d i = µ r CR d . Since CR is robust to any s stragglers per parent, the partial gradient computation time for any group of siblings is T (n− s) , i.e. the (n− s)’th order statistics of {T 1 ,··· ,T n }. In [167], authors consider coded computation scenarios in a master-worker topology where the master only needs to wait for results of the first α fraction of the workers. However, as in the scenario here, the limited bandwidth at the master only allows for one transmission at the time. From the latency analysis in [167], we have the following. 1 A group of siblings refers to n nodes with the same parent. 232 Lemma 8 (Theorem 2, [167]). With probability 1− o(1), we have b T ≥ T (n− s) +(n(1− α )− o(n))t c . (13) Now, conditioned on the event in (13) we can write E[T CR ]≥ E T (n− s) +(n(1− α )− o(n))t c (1− o(1)) + E T (n− s) +Lt c o(1) ≥ E T (n− s) +(n(1− α )− o(n)+L− 1)(1− o(1))t c (a) ≥ r CR d µ log 1 α +ar CR d +(n(1− α )− o(n)+L− 1)(1− o(1))t c +o(1), where inequality (a) uses the fact that E T (n− s) = r CR d µ (H n − H s ) + ar CR d and log(i) < H i =1+ 1 2 +··· + 1 i <log(i+1) for any positive integer i. To derive upper bound onE[T CR ], that is E[T CR ]≤ r CR d µ log 1 α +ar CR d+n(1− o(1))Lt c +o(1), we prove the following concentration inequality on the computation time for any group of siblings. Lemma 9. Let T 1 ,··· ,T n denote i.i.d. exponential random variables with constant rate λ = Θ(1) . For ε = Θ 1 n 1/4 and constant α = s n , we have the following concentration bound for the order statistics T (n− s) : P T (n− s) − E T (n− s) ≥ ε ≤ e − Θ ( √ n) . (14) 233 Proof of Lemma 9. Given i.i.d. exponentials T 1 ,··· ,T n ∼ exp(λ ), we can write the succes- sive differences of order statistics as independent exponentials. That is, we have T (1) =E 1 ∼ exp λ n , T (2) − T (1) =E 2 ∼ exp λ n− 1 , . . . T (n− s) − T (n− s− 1) =E n− s ∼ exp λ s+1 , . . . T (n) − T (n− 1) =E n ∼ exp(λ ), whereE i ’s are independent. Thus, T (n− s) = P n− s i=1 E i . We have the following for independent exponentials E i ’s and λ =Θ(1) : E |E i | k =E E k i = λ n− i+1 k k! = 1 2 E E 2 i λ n− i+1 k− 2 k! ≤ 1 2 E E 2 i B k− 2 k!, for B = λ s = λ αn =Θ 1 n . Moreover, n− s X i=1 E E 2 i =2λ 2 1 n 2 +··· + 1 (s+1) 2 ≤ 2λ 2 · n− s s 2 = 2λ 2 (1− α ) α 2 · 1 n =Θ 1 n . 234 According to Bersterin’s Lemma (See Lemma 10), for ε=Θ 1 n 1/4 we have P T (n− s) − E T (n− s) ≥ ε ≤ exp − ε 2 2 P n− s i=1 E[E 2 i ]+εB ! ≤ exp − ε 2 2 Θ 1 n +εΘ 1 n ! =e − Θ ( √ n) . As described in Section 4.3.1, in the proposedCR scheme all the worker nodes start their assigned partial gradient computations simultaneously; each parent waits for enough number of children to receive their results; combines with its partial computation and sends the result uptoitsparent. ToupperboundthetotalaggregationtimeT CR ,onecanseparateallthelocal computations from the communications. Let T comp denote the time at which enough number of workers have executed their local gradient computations and no more local computation is neededforthefinalgradientrecovery. Moreover, weassumethatallthecommunicationsfrom children to parent are pipe-lined. Hence, we haveE[T CR ]≤ E[T comp ]+L(n− s)t c . To bound the computation time T comp , consider the following event which keeps the local computation times for all the N/n groups of siblings concentrated below their average deviated by ε = Θ 1 n 1/4 : E 1 := n T gr (n− s) ≤ E h T gr (n− s) i +ε for all the N/n groups gr o , where a groupgr is a collection ofn children with the same parent, i.e. there are N/n groups in the (n,L)–regular tree. For a group gr, {T gr 1 ,··· ,T gr n } denote the random run-times of the nodes in the group and T gr (n− s) represents its (n− s)’th order statistics. Clearly, E[T comp |E 1 ]≤ E T (n− s) +o(1). (15) 235 Now let e T denote the computation time corresponding to the slowest group of siblings, i.e. e T := max over all N/n groups gr T gr (n− s) . Consider the following event: E 2 := n e T >Θ(log n) o . We can write E[T comp |E c 1 ∩E c 2 ]≤ Θ(log n), (16) and E[T comp |E c 1 ∩E 2 ]P[E 2 ]≤ E h e T|E c 1 ∩E 2 i P[E 2 ] =E h e T| e T ≤ Θ(log n) i P h e T ≤ Θ(log n) i ≤ E h e T i ≤ E[T max ] = r CR d µ H N +ar CR d =Θ(log N) =LΘ(log n). (17) In the above derivation, T max denotes the largest computation time over all the N nodes. Putting (16) and (17) together, we can write E[T comp |E c 1 ]=E[T comp |E c 1 ∩E 2 ]P[E 2 ]+E[T comp |E c 1 ∩E c 2 ]P[E c 2 ] ≤ Θ(log n). (18) 236 Moreover, using union bound on the N/n groups of workers, we derive the following inequal- ity. P[E c 1 ]≤ N n P T (n− s) ≥ E T (n− s) +ε ≤ Θ n L− 1 e − Θ ( √ n) . (19) Putting (15), (18) and (19) together, we have E[T comp ]=E[T comp |E 1 ]P[E 1 ]+E[T comp |E c 1 ]P[E c 1 ] ≤ E T (n− s) +ε+Θ(log n)Θ n L− 1 e − Θ ( √ n) =E T (n− s) +o(1) = r CR d µ (H n − H s )+ar CR d+o(1). Therefore, E[T CR ]≤ E[T comp ]+Ln(1− α )t c = r CR d µ (H n − H s )+ar CR d+Ln(1− α )t c +o(1) ≤ r CR d µ log 1 α +ar CR d+n(1− o(1))Lt c +o(1), which completes the proof. Lemma 10 (Bernstein’s Inequality). Suppose E 1 ,··· ,E m are independent random variables such that E |E i | k ≤ 1 2 E E 2 i B k− 2 k!, 237 for some B >0 and every i=1,··· ,m, k≥ 2. Then, for ε>0, P " m X i=1 E i − m X i=1 E[E i ]≥ ε # ≤ exp − ε 2 2( P m i=1 E[E 2 i ]+εB) . L Proof of Optimality of the Two-step Approach Let (t ∗ ,u ∗ (t ∗ ),ℓ ∗ (t ∗ )) be an optimal solution of (5.27). Then, t ∗ ≥ 0, 0≤ u ∗ (t ∗ )≤ u max , 0≤ ℓ ∗ (t ∗ )≤ (ℓ 1 ,...,ℓ n ), and the following holds: E(R(t ∗ ;(u ∗ (t ∗ ),ℓ ∗ (t ∗ ))))=m =E(R(t opt ;(u opt ,ℓ opt ))). (20) Thus,(t ∗ ,u ∗ (t ∗ ),ℓ ∗ (t ∗ ))isafeasiblesolutionoftheoptimizationproblemin(5.23). Therefore, we only need to show that t ∗ =t opt . As optimization problem in (5.23) has a larger solution space than the two-step optimization problem in (5.27), we have the following inequality: t opt ≤ t ∗ (21) Next, we prove that the optimal expected total aggregate return E(R(t;(u ∗ (t),ℓ ∗ (t)))) for t=t opt is same as for t=t ∗ , i.e. E(R(t opt ;(u ∗ (t opt ),ℓ ∗ (t opt ))))=m. We first observe that as (u ∗ (t),ℓ ∗ (t)) maximizes the expected total aggregate return for a given deadline time t, we have the following: E(R(t opt ;(u ∗ (t opt ),ℓ ∗ (t opt ))))≥ E(R(t opt ;(u opt ,ℓ opt ))) =m (22) 238 Next, assumethat(22)holdswithstrictinequality. Therefore, by(20), wehavethefollowing: E(R(t opt ;(u ∗ (t opt ),ℓ ∗ (t opt ))))>E(R(t ∗ ;(u ∗ (t ∗ ),ℓ ∗ (t ∗ )))). (23) By the monotonicity ofE(R(t;(u ∗ (t),ℓ ∗ (t)))) with respect to t, (23) implies t opt >t ∗ , which is a contradiction. Hence,E(R(t opt ;(u ∗ (t opt ),ℓ ∗ (t opt ))))=m. Therefore, using the fact that t=t ∗ is the minimum t such thatE(R(t;(u ∗ (t),ℓ ∗ (t))))=m, we have t ∗ ≤ t opt . Hence, together with (21), the claim t ∗ =t opt is proved. M Proof of Theorem 11 Using the computation and communication models presented in (5.11) and (5.13) in Section 5.2.2, we have the following for the execution time for one epoch for node j∈[n+1]: T j =T (j,1) cmp +T (j,2) cmp +T (j) com− d +T (j) com− u = e ℓ j µ j +T (j,2) cmp +τ j N (j) com , (24) where N (j) com ∼ NB(r=2,p=1− p j ) has negative binomial distribution while T (j,2) cmp ∼ E α j µ j e ℓ j has exponential distribution. Here, we have used the fact that T (j) com− d andT (j) com− u are IID ge- ometric G(p) random variables and sum of r IID G(p) is NB(r,p). Therefore, the probability distribution for T j is obtained as follows: P(T j ≤ t)=P e ℓ j µ j +T (j,2) cmp +τ j N (j) com ≤ t ! = ∞ X ν =2 P(N (j) com =ν )· P T (j,2) cmp ≤ t− e ℓ j µ j − τ j N (j) com |N (j) com =ν ! (a) = ∞ X ν =2 P(N (j) com =ν )P T (j,2) cmp ≤ t− e ℓ j µ j − τ j ν ! 239 (b) = ∞ X ν =2 U t− e ℓ j µ j − τ j ν ! (ν − 1)(1− p j ) 2 p ν − 2 j · 1− e − α j µ j e ℓ j t− e ℓ j µ j − τ j ν ! , (25) where (a) holds due to independence of T (j,2) cmp and N (j) com , while in (b), we have used U(·) to denote the unit step function with U(x)=1 for x>0 and U(x)=0 for x≤ 0. For a fixed t, P(T j ≤ t)=0 if t≤ 2τ j . For t>2τ j , let ν m ≥ 2 satisfy the following criteria: (t− τ j ν m )>0,(t− τ j (ν m +1))≤ 0. (26) Therefore, forν>ν m , the terms in(b) are0. Finally, asE(R j (t; e ℓ j ))= e ℓ j E(1 {T j ≤ t} )= e ℓ j P(T j ≤ t), we arrive at the result of Theorem 11. N ProofofMonotonicallyIncreasingBehaviorofOptimized Expected Return Recall from Section 5.4 that for a given deadline time of t at the server, the expectation of the return R j (t; e ℓ j )= e ℓ j 1 {T j ≤ t} for node j∈[n+1] satisfies the following: E(R j (t; e ℓ j ))= P ν m ν =2 U t− e ℓ j µ j − τ j ν h ν f ν (t; e ℓ j ) if ν m ≥ 2 0 otherwise where U(·) is the unit step function with U(x)=1 for x>0 and 0 otherwise, f ν (t; e ℓ j )= e ℓ j 1− e − α j µ j e ℓ j (t− e ℓ j µ j − τ j ν ) , h ν =(ν − 1)(1− p j ) 2 p ν − 2 j , and ν m ∈Z satisfies t− τ j ν m >0, t− τ j (ν m +1)≤ 0. Fix j∈[n+1], and consider a fixed load e ℓ j and a given ν ∈Z. Then, f ν (t; e ℓ j ) is mono- tonically increasing in t as ∂fν (t; e ℓ j ) ∂t =α j µ j e − α j µ j e ℓ j (t− e ℓ j µ j − τ j ν ) >0 for all t>0. Furthermore, by 240 definition, ν m is monotonically increasing in deadline time t. Therefore, total number of terms in the expression for E(R j (t; e ℓ j )) increases monotonically with t, and each of those terms increases monotonically with t. Thus, for a fixed e ℓ j , E(R j (t; e ℓ j )) is monotonically increasing in t. Consider two different deadline times t=t 1 and t=t 2 with t 2 >t 1 . Based on the discus- sion in Section 5.4, let e ℓ j =ℓ ∗ j (t 1 ) be the optimal load that maximizes E(R j (t 1 ; e ℓ j )) and let E(R j (t 1 ;ℓ ∗ j (t 1 ))) be the corresponding optimized expected return. Similarly, let e ℓ j =ℓ ∗ j (t 2 ) be the optimal load that maximizesE(R j (t 2 ; e ℓ j )) and letE(R j (t 2 ;ℓ ∗ j (t 2 ))) be the corresponding optimized expected return. Sincet 2 >t 1 and expected return is monotonically increasing with t, we haveE(R j (t 2 ;ℓ ∗ j (t 1 )))≥ E(R j (t 1 ;ℓ ∗ j (t 1 ))). SinceE(R j (t 2 ;ℓ ∗ j (t 2 ))) is the optimal expected return for t 2 , it follows thatE(R j (t 2 ;ℓ ∗ j (t 2 )))≥ E(R j (t 2 ;ℓ ∗ j (t 1 )))≥ E(R j (t 1 ;ℓ ∗ j (t 1 ))). Therefore, the optimized expected return E(R j (t;ℓ ∗ j (t))) is monotonically increasing in the deadline time t. O One-shot Solution for AWGN For node j∈[n+1], consider the function f ν (t; e ℓ j )= e ℓ j (1− e − α j µ j e ℓ j (t− e ℓ j µ j − τ j ν ) ) for ν ∈{2,...,ν m }. Werecallthatf ν (t; e ℓ j )isstrictlyconcavefor e ℓ j >0. Furthermore,f ν (t; e ℓ j )≤ 0for e ℓ j ≥ µ j (t− τ j ν ). Solving for f ′ ν (t; e ℓ j )=0, we obtain the optimal load maximizing f ν (t; e ℓ j ) as follows: ℓ ∗ j (t,ν )=− α j µ j W − 1 (− e − (1+α j ) )+1 (t− ντ j ), (27) where W − 1 (·) is the minor branch of Lambert W-function, where the Lambert W-function is the inverse function of f(W)=We W . Next, consider the special case of AWGN channel in Section 5.4, for which the expected return for node j∈[n+1] simplifies as follows: E(R j (t; e ℓ j ))=U t− e ℓ j µ j − 2τ j ! f 2 (t; e ℓ j ). (28) 241 When t≤ 2τ j , E(R j (t; e ℓ j ))=0, thus ℓ ∗ j (t)=0. For t>2τ j , using (27) and the fact that e ℓ j is upper bounded by ℓ j , ℓ ∗ j (t)=min{ℓ ∗ j (t,2),ℓ j }, where ℓ ∗ j (t,2) is as follows: ℓ ∗ j (t,2)=s j (t− 2τ j ), (29) where s j =− α j µ j W − 1(− e − (1+α j ) )+1 . As ℓ ∗ j (t,2) is strictly increasing, there is a unique deadline time for which ℓ ∗ j (t,2)=ℓ j . Let t=ζ j be the deadline time for which this holds. Then, using (29), we have the following for ζ j : ζ j = ℓ j s j +2τ j . (30) Thus, we have the following closed form expression for ℓ ∗ j (t): ℓ ∗ j (t)= 0 if t≤ 2τ j s j (t− 2τ j ) if 2τ j <t≤ ζ j ℓ j otherwise (31) Furthermore, by(29), theoptimalexpectedreturnfromj-thnodefordeadlinetime2τ j <t≤ ζ j can be simplified as follows: E(R j (t;ℓ ∗ j (t)))=ℓ ∗ j (t) 1− e − α j µ j ℓ ∗ j (t) t− ℓ ∗ j (t) µ j − 2τ j ! , =s j (t− 2τ j ) 1− e − α j µ j s j − 1 ! , =e s j (t− 2τ j ), (32) 242 wheree s j =s j 1− e − α j µ j s j − 1 ! . Thus, we have the following closed form expression for the optimal expected return: E(R j (t;ℓ ∗ j (t)))= 0 if t≤ 2τ j e s j (t− 2τ j ) if 2τ j <t≤ ζ j ℓ j 1− e − α j µ j ℓ j t− ℓ j µ j − 2τ j ! otherwise (33) Thus, we have a closed form for the maximum expected total aggregate return from the nodes till deadline time t as follows: E(R(t;ℓ ∗ (t)))= X j∈[n+1] 2τ j <t≤ ζ j e s j (t− 2τ j )+ X j∈[n+1] ζ j <t ℓ j 1− e − α j µ j ℓ j t− ℓ j µ j − 2τ j ! , (34) which is monotonically increasing in t. P Towards Convergence Analysis of CodedFedL ForprovingconvergenceofCodedFedL,weconsideru ∗ (t ∗ )tobelarge, andmakethefollowing assumption for simplification: G T G u ∗ (t ∗ ) =I m . (35) The key motivation for our assumption is the observation that by weak law of large numbers, in the limit that the coding redundancy u ∗ (t ∗ )→∞, each diagonal entry in G T G u ∗ (t ∗ ) converges to 1 in probability, while each non-diagonal entry converges to 0 in probability. Furthermore, as we demonstrate via numerical experiments in Section 5.5, the convergence curve as a function of iteration for CodedFedL significantly overlaps that of the naive uncoded where the server waits to aggregate the results of all the clients. Hence, for simplifying our analysis, we assume in the remaining proof that G T G u ∗ (t ∗ ) = I m . The general analysis will be addressed in future work. 243 In the following, we list the remaining assumptions in our analysis: • Assumption 1: Model parameter spaceW⊆ R q× c is a closed and convex set. • Assumption 2: sup θ ∈W ∥θ − θ (0) ∥ 2 F ≤ R 2 , whereθ (0) ∈W is given. • Assumption 3: ∥ 1 ℓ ∗ j (t ∗ ) e X (j)T ( e X (j) θ − e Y (j) )∥ 2 F ≤ B j for allθ ∈W for client j∈[n]. • Assumption 4: Max{singular values of b X (j) }≤ L j , for client j∈[n]. • Assumption 5: P(T C ≤ t ∗ )=1, i.e. the gradient over the coded data is available at the MEC server by t ∗ with probability 1. Under the above assumptions, for a given model θ ∈W, the stochastic gradient obtained by the MEC server by the deadline time t ∗ is as follows: g M (θ )= 1 m (g C (θ )+g U (θ )), = 1 m b X T W T W( b Xθ − Y)+ n X j=1 1 {T j ≤ t ∗ } e X (j)T ( e X (j) θ − e Y (j) ) , = 1 m n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j (1− P(T j ≤ t ∗ ))· b x (j)T k (b x (j) k θ − y (j) k ) + n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ b D j \ e D j b x (j)T k (b x (j) k θ − y (j) k ) + n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j 1 {T j ≤ t ∗ } b x (j)T k (b x (j) k θ − y (j) k ) . (36) Averaging over the stochastic conditions of compute and communication, we have the fol- lowing for a givenθ ∈W: E(g M (θ ))= 1 m n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j (1− P(T j ≤ t ∗ ))b x (j)T k (b x (j) k θ − y (j) k ) 244 + n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ b D j \ e D j b x (j)T k (b x (j) k θ − y (j) k ) + n X j=1 X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j P(T j ≤ t ∗ )b x (j)T k (b x (j) k θ − y (j) k ) , = 1 m n X j=1 ℓ j X k=1 b x (j)T k (b x (j) k θ − y (j) k ), =∇ θ 1 2m n X j=1 ∥ b X (j) θ − Y (j) ∥ 2 F , =g(θ ) (37) Thus, the variance of g M (θ ) for a givenθ ∈W can be bounded as follows: E(∥g M (θ )− E(g M (θ ))∥ 2 F ) (38) =E 1 m n X j=1 (1 {T j ≤ t ∗ } − P(T j ≤ t ∗ ))· X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j b x (j)T k (b x (j) k θ − y (j) k ) 2 F , (a) = 1 m 2 n X j=1 P(T j ≤ t ∗ )(1− P(T j ≤ t ∗ ))· X k∈[ℓ j ] (b x (j) k ,y (j) k )∈ e D j b x (j)T k (b x (j) k θ − y (j) k ) 2 F , = 1 m 2 n X j=1 P(T j ≤ t ∗ )(1− P(T j ≤ t ∗ ))·∥ e X (j)T ( e X (j) θ − e Y (j) )∥ 2 F ≤ n X j=1 (ℓ ∗ j (t ∗ )) 2 m 2 1 ℓ ∗ j (t ∗ ) e X (j)T ( e X (j) θ − e Y (j) ) 2 F (39) ≤ n X j=1 B j =B, (40) where in (a), we have used independence of the events 1 {T j 1 ≤ t ∗ } and 1 {T j 2 ≤ t ∗ } for distinct clients j 1 ∈[n] and j 2 ∈[n]. Furthermore, forθ 1 ∈W andθ 2 ∈W, we have the following bound 245 for smoothness: ∥g(θ 1 )− g(θ 2 )∥ F = 1 m n X j=1 b X (j)T ( b X (j) θ 1 − b Y (j) )− b X (j)T ( b X (j) θ 2 − b Y (j) ) F , = 1 m n X j=1 b X (j)T b X (j) (θ 1 − θ 2 ) F , ≤ 1 m n X j=1 b X (j)T b X (j) (θ 1 − θ 2 ) F , a ≤ 1 m n X j=1 b X (j)T b X (j) 2 θ 1 − θ 2 F , ≤ 1 m n X j=1 L 2 j ∥θ 1 − θ 2 ∥ F =L∥θ 1 − θ 2 ∥ F . (41) In (a),∥A∥ 2 denotes the spectral norm of matrix A. Therefore, by Theorem 2.1 in [9], for a total number of r max iterations and a constant learning rate of µ (r) = 1 L+1/γ with γ = q 2R 2 Br max , we have the following result: E 1 2m n X j=1 ∥ b X (j) θ 1:r max − Y (j) ∥ 2 F − min θ ∈W 1 2m n X j=1 ∥ b X (j) θ − Y (j) ∥ 2 F ≤ R r 2B r max + LR 2 r max , (42) where θ 1:r max = 1 r max P r max r=1 θ (r) . Hence, for achieving an error less than a given ϵ> 0, the iteration complexity of CodedFedL is r max =O(R 2 max( 2B ϵ 2 , L ϵ )). E 1 2m n X j=1 ∥ b X (j) θ 1:r max − Y (j) ∥ 2 F − min θ ∈W 1 2m n X j=1 ∥ b X (j) θ − Y (j) ∥ 2 F ≤ R r 2B r max + LR 2 r max , 246 Q Privacy Budget for CodedFedL We utilize ϵ -mutual-information differential privacy (MI-DP) metric, as proposed in [36], for finding privacy leakage in CodedFedL. For completeness, we first provide the definition of ϵ -MI-DP (shown to be stronger than the standard (ϵ,δ )-DP metric) as presented in [188]. • ϵ -Mutual-Information Differential Privacy: Let D N = (D 1 ,...,D N ) be a database with N entries. D N returns a query as per a randomized mechanism Q(·). Let D − i be the database with all entries except D i . Then, the randomized mechanism satisfies ϵ -MI-DP if the following is satisfied: sup i,P(D N ) I(D i ;Q(D N )|D − i )≤ ϵ bits, (43) where the supremum is taken over all distributions on D N . Next, leveraging the result for random linear projections in [188], we can calculate the privacy budget required for sharing the local parity dataset ( ( X (j) , ( Y (j) ) for a given client j∈[n]. As we aim to preserve the privacy of each entry of b X (j) , we need to compute the required privacy budget with respect to the largest diagonal entry of the scaling matrix W j . Therefore, replacing W j by an identity matrix, we equivalently consider the privacy leakage for sharing ( ( X (j) =G j b X (j) , ( Y (j) =G j Y (j) ) (see Sections 5.3.2 and 5.3.4 for details). Furthermore, we assume that the entries of G j are drawn independently from a standard normal distribution. Then, based on the result for ϵ -MI-DP from Section III-B of [188], CodedFedL needs to allocate ϵ j privacy budget for sharing u ∗ (t ∗ ) number of local parity data ( ( X (j) , ( Y (j) ) to the MEC server, where ϵ j is given by: ϵ j = 1 2 log 2 1+ u ∗ (t ∗ ) f 2 ( b X (j) ) , (44) 247 where f( b X (j) )= min k 2 ∈[q] v u u t ℓ j X k 1 =1 |b x (j) k 1 (k 2 )| 2 − max k 3 ∈[ℓ j ] |b x (j) k 3 (k 2 )| 2 . Here, we have usedb x (j) i (k) to denote the value of i-th data point corresponding to the k-th feature in raw database b X (j) . Intuitively, when the raw data distribution is concentrated along a small number of features, the value of f( b X (j) ) is small and a larger privacy budget is required for generating coded data to effectively hide those vulnerable features. In contrast, when raw data distribution is uniform in feature space, very little information is leaked by the parity data generated in CodedFedL. R Proof of Converse for C ∗ HM For proving the converse for C ∗ HM in (6.8), we note that g ∈ R p× 1 . The master should be able to recover g from the messages uploaded by the helper nodes. Therefore, by cut-set bound, we have C HM ≥ 1 for every achievable tuple inA. S Proof of Converse for C ∗ EH We now provide the proof of converse for C ∗ EH in (6.9). Without loss of generality, let us focus on client node 1. Let Ω (s) be the set of all patterns of s out of n h link failures for client node 1. Let l 1 ,...,l n h denote the normalized communication load from client node 1 to helper nodes in the case of no link failure. For a failure pattern f ∈Ω (s), let i 1 ,...,i n h − s denote the (n h − s) surviving helper links. For recoverability, the update from client node 1 needs to be available at the collection of the helper nodes. Thus, by cut-set bound, we obtain the following: l i 1 +l i 2 +...+l i n h − s ≥ 1. (45) 248 Clearly, |Ω (s)| = n h n h − s . Furthermore, each of α j , j ∈ [n h ] occurs in n h − 1 n h − 1− s patterns. Summing (45) over all n h n h − s , the following hold: n h − 1 n h − 1− s (α 1 +...+α n h )≥ n h n h − s ∴C EH ≥ n h n h − s . (46) T Upper Bound for C HM for AMC For the maximum matching rows, the corresponding n h − s helpers with entry 1 each com- municate a message of length p (n h − s) , while for the remaining the remaining client rows, n h − s messages each of length p (n h − s) have to be communicated to the master. Combining the two communication loads, we have the following: C f HM =(n h − s) 1 (n h − s) +(n e − M)(n h − s) 1 (n h − s) =(n e − M +1). (47) Thus, the key component of analyzing the average communication load C HM from helpers to master for AMC is to find the average of maximum matching M across all the straggling patterns in f ∈ Ω (s). As we are only interested in finding the average value of M, it is equivalent to finding the expectation E[M] when each pattern f ∈ Ω (s) occurs with a uniform probability of 1 |Ω (s)| . Furthermore, for each client, there are a total of n h n h − s possible straggling patterns, each having uniform probability. Additionally, the straggling patterns of the clients are independent of each other. The problem of finding E[M] is thus equivalent to finding the expectation of the maximum number M of balls in any bin, if we throw m=n e balls independently and uniformly at random into n= n h n h − s bins. 249 Figure 6: Comparing the achievability bound forC HM in (49) withC HM for the naive MDS approach based on simple forwarding of the messages received at the helpers to the master without any local aggregation. LetX b ,b∈[n] be the random variable denoting the number of balls in bin b. As the total number of balls is m, we have P n b=1 X b = m. Using symmetry and linearity of expectation, E[X b ]= m n for each bin b. As M =sup b∈[n] X b , the following result holds: E[M]≥ E[X 1 ]= m n . (48) Combining the results in (47) and (48), we have the following bound for C HM : C HM ≤ n e − n e n h n h − s +1. (49) Remark 38. For the regime n h = ⌊log(n e )⌋, and s = ⌊0.2∗ n h ⌋, Fig. 6 illustrates the gain of AMC using the bound in (49) over the naive MDS approach, where the helpers simply forward the messages received from the clients to the master without any local aggregation achieving C HM =n e . 250 U Convergence Analysis for DiverseFL WenowprovideaconvergenceanalysisforDiverseFLfornon-IIDdatadistribution, adapting the proof developed in [24]. While [24] uses trust bootstrapping using a small root dataset, in DiverseFL, a per-client criteria is applied, which makes our analysis different. Further- more, [24] assumes the root dataset and clients have samples belonging to the same data distribution, while we consider non-IID data across clients. We recall from Section 7.2 that D j denotes the data distribution at client j ∈ [N] = {1,...,N}, and the optimal model θ ∗ ∈ Θ minimizes the loss function 1 N P N j=1 F j (θ ), where F j (θ ) = E ζ j (l(θ ;ζ j )) denotes the local loss function at client j∈[N] corresponding to the local data distribution at client j. In the following, we list the standard assumptions of our convergence result: Assumption 1 (µ -Strong convexity). For any θ , ˆ θ ∈Θ , j∈[N]: F j (θ )≥ F j ( ˆ θ )+ D ∇F j ( ˆ θ ),θ − ˆ θ E + µ 2 ∥θ − ˆ θ ∥ 2 . (50) Assumption 2 (L-Lipshitz continuity). For any θ , ˆ θ ∈Θ , j∈[N]: ∥∇F j (θ )−∇ F j ( ˆ θ )∥≤ L∥θ − ˆ θ ∥. (51) Furthermore, let M j ∼D j denote any batch of data from the data distribution D j of client j∈[N] such that|M j |≥ s. Then, similar to the empirical form of Lipshitz continuity used in the convergence analysis in [24], for any δ > 0, we make the following assumption: P sup θ, ˆ θ ∈Θ ,θ ̸= ˆ θ ∥l(θ ;M j )− l( ˆ θ ;M j )∥ ∥θ − ˆ θ ∥ ≤ L 1 ! ≥ 1− δ 3 . Assumption 3 (Boundedness). LetM j ∼D j denote any batch of data fromD j for j∈[N], and let h(M j ,θ )=∇l(θ ;M j )−∇ l(θ ∗ ;M j ) for θ ∈Θ . Let B={v∈R d :∥v∥ = 1} denote the unit sphere. We assume that for any θ ∈Θ , θ ̸=θ ∗ and for any unit vector v∈B, ∇l(θ ∗ ;µ j )· v is sub-exponential with σ 1 and γ 1 , while (h(M j ,θ )− E(h(M j ,θ )))· v/∥θ − θ ∗ ∥ is sub-exponential 251 with σ 2 and γ 2 , where x 1 · x 2 denotes the dot product between x 1 and x 2 . More formally, there exist positive constants σ 1 ,σ 2 ,γ 1 ,andγ 2 such that for any θ ∈ Θ , θ ̸= θ ∗ , for any unit vector v∈B, and∀|ξ |≤ min{1/γ 1 ,1/γ 2 }, we have the following: sup v∈B E(exp(ξ (∇l(θ ∗ ;M j )· v)))≤ exp σ 2 1 ξ 2 2 , sup v∈B,θ ∈Θ E exp ξ (h(M j ,θ )− E(h(M j ,θ ))· v) ∥θ − θ ∗ ∥ ≤ exp σ 2 2 ξ 2 2 . Furthermore, we assume β -boundedness on data heterogeneity: ∥∇F j (θ )− 1 N X j∈[N] ∇F j (θ )∥≤ β, ∀j∈[N],∀θ ∈Θ . (52) Additionally, we assume that the model parameter space Θ is bounded, i.e., there exists r>0 such that Θ ⊂{ θ :∥θ − θ ∗ ∥≤ r √ d}. Theorem 12. Assume ϵ 1 = 0, ϵ 2 = 1/ϵ 3 , E = 1, |S (i) | = N ∀i ∈ [R], M (0) j ∼ D j , |M (0) j | = s ∀j ∈ [N] and Assumptions 1, 2 and 3 hold. Then, for an arbitrary number of malicious clients, for any δ > 0 and with a constant learning rate of α =µ/ (2L 2 ), DiverseFL achieves the following error bound with a probability of at least 1− ˜ δ : ∥θ i − θ ∗ ∥≤ (1− ρ ) i ∥θ 0 − θ ∗ ∥+ α (2+ϵ 3 )(4Γ 1 +β ) ρ , (53) where θ i denotes the global model after communication round i, Γ 1 =σ 1 q 2 s p dlog(6)+log(3/δ ), δ = ˜ δ N , ρ =1− q 1− µ 2 4L 2 +8α (2+ϵ 3 )Γ 2 +α (1+ϵ 3 )L , Γ 2 =σ 2 q 2 s r dlog 18L 2 σ 2 + 1 2 dlog( s d )+log 6σ 2 2 r √ s γ 2 σ 1 δ , L 2 =max{L,L 1 }. When |1− ρ |<1, limsup R→∞ ∥θ R − θ ∗ ∥≤ α (2+ϵ 3 )(4Γ 1 +β ) ρ . 252 Proof. From Section 7.3, we recall that for round i ∈ [R], e N (i) is the set of indices of the clientsforwhichtheper-clientsimilarityconditionsin7.4and7.5aresatisfied. Therefore, for j∈ e N (i) , the client’s uploaded update z (i) j and its corresponding guiding update e ∆ (i) j update satisfy z (i) j · e ∆ (i) j > 0, and ∥z (i) j ∥/∥ e ∆ (i) j ∥ < ϵ 3 . As E = 1, e ∆ (i) j = α i e g i,1 j . For simplicity, we define e g i j ∆ =e g i,1 j . For proving our Theorem, we first need multiple Lemmas that are described next. Lemma 11. For i∈[R] and j∈ e N (i) , we have the following: ∥z (i) j − α i (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))∥≤ (2+ϵ 3 )α i ∥e g i j −∇ F j (θ (i− 1) )∥ +(1+ϵ 3 )α i ∥∇F j (θ (i− 1) )−∇ F j (θ ∗ )∥+(2+ϵ 3 )α i β (54) Proof. ∥z (i) j − α i (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))∥ ≤∥ z (i) j − e ∆ i j ∥+∥ e ∆ i j − α i ∇F j (θ (i− 1) )∥+α i ∥∇F j (θ ∗ )∥, a ≤∥ z (i) j + e ∆ i j ∥+∥ e ∆ i j − α i ∇F j (θ (i− 1) )∥+α i ∥∇F j (θ ∗ )∥, b ≤ (1+ϵ 3 )∥ e ∆ i j ∥+∥ e ∆ i j − α i ∇F j (θ (i− 1) )∥+α i ∥∇F j (θ ∗ )∥, +∥ e ∆ i j − α i ∇F j (θ (i− 1) )∥+α i ∥∇F j (θ ∗ )∥, (c) ≤ (2+ϵ 3 )α i ∥e g i j −∇ F j (θ (i− 1) )∥ +(1+ϵ 3 )α i ∥∇F j (θ (i− 1) )−∇ F j (θ ∗ )∥+(2+ϵ 3 )α i β, (55) where (a) holds as z (i) j · e ∆ (i) j >0, (b) holds because∥z (i) j ∥/∥ e ∆ (i) j ∥<ϵ 3 . For (c), we note that θ ∗ satisfies 1 N P j∈[N] ∇F j (θ ∗ )=0. Therefore, by Assumption 3,∥∇F j (θ ∗ )∥≤ β . 253 Lemma 12. Let the learning rate be α i = α = µ/ (2L 2 ) for each communication round i∈[R]. Then, the following holds: ∥θ (i− 1) − θ ∗ − α i (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))∥≤ r 1− µ 2 4L 2 ∥θ (i− 1) − θ ∗ ∥ Proof. Expanding the left hand side, we have: ∥θ (i− 1) − θ ∗ − α i (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))∥ 2 =∥θ (i− 1) − θ ∗ ∥ 2 +α 2 ∥∇F j (θ (i− 1) )−∇ F j (θ ∗ )∥ 2 − 2α (θ (i− 1) − θ ∗ )· (∇F j (θ (i− 1) )−∇ F j (θ ∗ )). (56) By Assumption 1, we have the following: ∥∇F j (θ (i− 1) )−∇ F j (θ ∗ )∥≤ L∥θ (i− 1) − θ ∗ ∥, (57) F j (θ ∗ )+∇F j (θ ∗ )·(θ (i− 1) − θ ∗ )≤ F j (θ (i− 1) )− µ 2 ∥θ (i− 1) − θ ∗ ∥ 2 , (58) F j (θ (i− 1) )+∇F j (θ (i− 1) )· (θ ∗ − θ (i− 1) )≤ F j (θ ∗ ). (59) Summing up (58) and (59) results in the following: (θ ∗ − θ (i− 1) )·(∇F j (θ (i− 1) )−∇ F j (θ ∗ ))≤− µ 2 ∥θ (i− 1) − θ ∗ ∥ 2 . (60) Substituting (57) and (60) in (56), we have the following: ∥θ (i− 1) − θ ∗ − α (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))∥ 2 ≤ (1+α 2 L 2 − α i µ )∥θ (i− 1) − θ ∗ ∥ 2 . (61) Taking square root and using α = µ 2L 2 in (61) completes the proof. 254 The following Lemma is adapted from Lemma 4 in [24] and reproduced here for com- pleteness. Lemma 13. For any δ ∈(0,1), let Γ 1 = √ 2σ 1 p (dlog6+log(3/δ ))/s, Γ 2 =σ 2 q 2 s r dlog( 18L 2 σ 2 )+ 1 2 dlog( s d )+log 6σ 2 2 r √ s γ 2 σ 1 δ , L 2 =max{L,L 1 }. When Γ 1 ≤ σ 2 1 /γ 1 , and Γ 2 ≤ σ 2 2 /γ 2 , we have the following for each client j∈ e N (i) in communication round i∈[R]: P ∥e g i j −∇ F j (θ (i− 1) )∥≤ 8Γ 2 ∥θ (i− 1) − θ ∗ ∥+4Γ 1 ≥ 1− δ. (62) Next, we leverage the above results to prove our Theorem. For communication round i∈[R], we have the following: ∥θ i − θ ∗ ∥= θ (i− 1) − α e N (i) X j∈ e N (i) (∇F j (θ (i− 1) )−∇ F j (θ ∗ )) + α | e N (i) | X j∈ e N (i) (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))− 1 | e N (i) | X j∈ e N (i) z i j − θ ∗ , (a) ≤ 1 | e N (i) | X j∈ e N (i) ∥θ (i− 1) − α (∇F j (θ (i− 1) )−∇ F j (θ ∗ ))− θ ∗ ∥ | {z } e 1 + (2+ϵ 3 )α | e N (i) | X j∈ e N (i) ∥˜ g i j −∇ F j (θ (i− 1) )∥ | {z } e 2 + (1+ϵ 3 )α | e N (i) | X j∈ e N (i) ∥∇F j (θ (i− 1) )−∇ F j (θ ∗ )∥ | {z } e 3 +(2+ϵ 3 )α ∇F j (θ ∗ ) | {z } e 4 , where (a) follows from Lemma 11. For bounding the above, we note that e 1 , e 3 and e 4 are bounded based on Lemma 12, Assumption 1 and Assumption 3 respectively. Furthermore, by Lemma 13, each term in the summation in e 2 is bounded by (8Γ 2 ∥θ (i− 1) − θ ∗ ∥+4Γ 1 ) with probability at least (1− δ ). Hence, by Fréchet lower bound for the probability of intersection of events, e2 can be bounded by (8Γ 2 ∥θ (i− 1) − θ ∗ ∥+4Γ 1 ) with probability at 255 least1−| e N (i) |δ ≥ 1− Nδ . Hence, by defining e δ ≜Nδ , we have the following with probability at least 1− e δ : ∥θ i − θ ∗ ∥≤ r 1− µ 2 4L 2 +8α (2+ϵ 3 )Γ 2 +α (1+ϵ 3 )L ! ∥θ (i− 1) − θ ∗ ∥ +4α (2+ϵ 3 )Γ 1 +α (2+ϵ 3 )β. (63) Recursively applying (63), we arrive at (53) in the Theorem. 256
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Optimizing privacy-utility trade-offs in AI-enabled network applications
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Optimizing distributed storage in cloud environments
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Data-driven optimization for indoor localization
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Differentially private learned models for location services
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Quickly solving new tasks, with meta-learning and without
PDF
Security and privacy in information processing
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
Asset Metadata
Creator
Prakash, Saurav
(author)
Core Title
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
12/09/2022
Defense Date
05/04/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cloud computing,coding theory,commodity clusters,communication topology design,data privacy,decentralized machine learning,federated learning,information theory,mobile networks,OAI-PMH Harvest,optimization theory,trusted execution environments
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Annavaram, Murali (
committee member
), Golubchik, Leana (
committee member
)
Creator Email
saurav.prakash.iitk@gmail.com,sauravpr@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111339635
Unique identifier
UC111339635
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Prakash, Saurav
Internet Media Type
application/pdf
Type
texts
Source
20220613-usctheses-batch-946
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cloud computing
coding theory
commodity clusters
communication topology design
data privacy
decentralized machine learning
federated learning
information theory
mobile networks
optimization theory
trusted execution environments