Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Federated and distributed machine learning at scale: from systems to algorithms to applications
(USC Thesis Other)
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Federated and Distributed Machine Learning at Scale: From Systems to Algorithms to Applications by Chaoyang He A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2022 Copyright 2022 Chaoyang He To my beloved parents and brother my beautiful wife and my lovely daughter ii Acknowledgements The past several years at USC have been an unforgettable and invaluable experience. In August of 2018, I resigned from my job as an R&D manager at Tencent and embarked on an adventure in the United States. The first year in Los Angeles was the most challenging year for my family and me due to the unfamiliarity with the language and culture. Re-entering the classroom, reading papers, and resharpening my mathematics skills were really difficult for someone who left campus many years ago. However, after adjusting to my new life, I gradually regained my confidence in the American context. During the first year of my Ph.D., I witnessed how rapidly Federated and Distributed Machine Learning has been developing, and I have been so excited to be part of this field. I published more than 30 papers on graduation, maintained a well-known open source community, and made many good friends. Fortunately, some of my research results were also favored by venture investors and these investments have allowed me to continue my passion for innovation through running a company and developing real-world products. I have enjoyed these concentrated and peaceful years. Through this process, I have learned to no longer be afraid of the unknown. When faced with unresolved problems, whether big or small, I can still maintain curiosity, adopt various methodologies, and finally tackle the answer. This process has also been full of fun and friendship, and has truly strengthened my passion for technological innovation for the rest of my life. My goal/dream is to persistently work in technological innovation, whether it be through writing (groundbreaking/compelling) papers on novel ideas or by impacting society through product development. During this iii journey, I could not have achieved what I have without the support of so many people to whom I am deeply grateful. First and foremost, at this tremendously exciting moment of my life, my most sincere gratitude goes to my fantastic advisor, Professor Salman Avestimehr, for his constant and generous support and guidance during my Ph.D. study at USC. We share similar interests in research topics, productizing research results, and both have transnational cultures. It has been an absolute honor to work with him. I also would like to thank Professor Mahdi Soltanolkotabi and Professor Murali Annavaram for their collaboration on several papers in distributed machine learning. I always appreciate their insightful ideas to solve technical difficulties. Their writing and presentation skills are also excellent and have helped me to become a better researcher in the English-speaking community. IwouldalsoliketothankProfessorTongZhangandProfessorQiangYangfromHKUSTfor their guidance at the early stages of my Ph.D. study. Their positive thinking and encouraging words during challenging times gave me the confidence to master complex research problems. They both are excellent AI scholars, and their high academic standards and genuine love for scientific research inspire me to keep moving forward. Iwouldalsoliketothankmythesiscommitteemembersandmyqualifyingexamcommittee members, Salman Avestimehr, Mahdi Soltanolkotabi, Murali Annavaramm, Ram Nevatia, Barath Raghavan, and Xiang Ren. Their insightful feedback has helped to significantly improve the quality of this dissertation. I would also like to thank researchers at Google, Facebook, and Amazon for their collaboration and help during my Ph.D. study. I thank Jakub Konečný, Peter Kairouz, and Xu Zheng from Google for their insightful suggestions and inspiration. I also want to express my special thanks to Shen Li at Facebook (Meta) PyTorch Distributed Training team. He is one of the most excellent engineers I have ever met in distributed systems. I also thank Shuai Zheng from Alex Smola and George Karypis’s team at AWS AI for providing me the iv opportunity to work there on fun research topics. I thank Rahul Gupta, Anit Kumar Sahu, and Jie Ding from Alexa AI at Amazon for their help in many research projects. Working with so many excellent researchers in the industry has been such an enjoyable journey. I also appreciate their job offer but respectfully must decline as I plan to create and operate my own company after graduation. I sincerely look forward to future opportunities for us to cross paths. My years of working in the fast-growing Tencent and Baidu have given me a lot of skills in scientific research methodology, ways of getting along with people, garnering determination to face the unknown, and have provided me with a broad network of connections as well as industry vision. I especially thank Menglin Chen (engineering director at Tencent), Weihao Gu (VP at Baidu, now CEO at haomo.ai), Fei Qi (General Manager at Baidu), Jianjia Xie (General Manager at Tencent) for providing me with a wide perspective of real research problems and for writing enthusiastic recommendation letters for my Ph.D. application; I thank Peilin Zhao (Principal Research Scientist at Tencent AI Lab), Junzhou Huang (Director at Machine Learning Cent at Tencent AI Lab), Wenbing Huang (Assistant Professor Tsinghua University), Roy Yu Rong (Senior Research Scientist at Tencent AI Lab), and Shen Li (Senior Research Scientist at Tencent AI Lab) for guiding me in conducting cutting-edge machine learning for a better chance in the US, and thank Shengping Zhou (CEO/CTO at a cloud computing startup, former Principal Software Engineer at Tencent), Qiaozhong Liang (Staff Software Engineer at Tencent), Zongchang Jie (Staff Software Engineer at Tencent), Liu Yang (Team Leader at Tencent CSIG), Weizhou Pan (Engineering Manager at Tencent QQ) for their constructive discussion in open source community and product development during my PhD study. I also thank Professor Yang Liu (former Research Scientist at WeBank), who provides me with her unique perspective from the financial industry. I would also thank many other excellent collaborators from academia, specifically Haishan Ye from HKUST (Tong Zhang’s Post Doc), Hongyi Wang (CMU), Jianyu Wang (CMU), Tian Li (CMU), Praneeth Vepakomma (MIT), and Xiaoyang Wang (UIUC) for their insightful v discussion in the field of distributed machine learning. I also thank Xinle Wu and Bin Yang from Aalborg University, Denmark, for their research support in automated time-series forecasting. I thank the USC vITAL Lab (Salman Avestimehr’s group), Signal and Data Foundation Lab (Mahdi’s group), and SCIP Lab (Murali’s group). I truly believe that the combination of these three labs creates the best research group to perform system machine learning (SysML) in the world. I thank post doctors and Ph.D. students in these labs, especially Yayha Essa, Ramy Ali, Sunwoo Lee, Songze Li, Qian Yu, Chien-Sheng Yang, Saurav Prakash, Jinhyun So, Yue Niu, Ahmed R. Elkordy, Keshav Balasubramanian, Tingting Tang, Zalan Fabian, for their kind help in addressing some challenging technical issues. They provided me a lot of support at various times. I am also fortunate to supervise many junior students in our lab, including Emir Ceyani, Erum Mushtaq, Tuo Zhang, and Amir Ziashahabi – this experience has also been truly valuable to me. I thank them for their critical thinking and challenging questions which push me to better understand maintaining a cross-culture research group. There are also many other terrific research labs at USC that I have learned from, one of which is the Ink Lab (Intelligence and Knowledge Discovery Research Lab). I sincerely enjoyed my collaboration with Yuchen Lin there. Through our collaboration on FedNLP, he taught me many cutting-edge natural language processing research skills and flavors from him. Overall, I have been truly fortunate to be able to grow in such an excellent scientific research environment. I also want to thank the alumni of USC. I especially thank Professor Mi Zhang at MSU. He has been like an older brother to me, and we have discussed many topics in the industry, academia, and daily life on WeChat. His words are always encouraging and inspiring. I appreciate the support he has given me. I also want to thank CoCoPIE CEO and Chairman Yanzhi Wang (also a professor at Northeastern University). He inspires me to transform research results into impactful products. vi I would also like to thank the three spiritual idols who have always encouraged me: Professor Tong Zhang from Hong Kong University of Science and Technology, Xing Wang, CEO of Meituan.com, and singer Jian Li, who are from the fields of scientific research, entrepreneurship, and art, respectively. They are all excellent lifelong learners and have a history of 20 years in their fields. During my Ph.D., whenever I felt like giving up, I relived their stories late at night, which inspired me to renew my fighting spirit and push forward. Tong has experienced both academia and industry in the United States and China but still maintains his original aspirations for research. Twenty years after graduating with his Ph.D., he is still at the forefront of scientific research solving critical problems for the next decade; Xing Wang has grown from a software engineer to a successful entrepreneur and investor, from technology to products to business. His super learning ability and growth speed have always set an example for me. Jian Li was also a software engineer after graduation, but he found his true passion in music, which he practiced for ten years without giving up and eventually became a famous singer in China. Classical elegance, purity, and innovation are his musical inspiration on musical style. Twenty years later, he is still releasing new songs (at the time of writing this thesis, his new album “All the time” had just been released). I admire his persistence and genuine love. I am sincerely thankful for these role models. Last but not least, I would like to express my deepest gratitude to my lovely family for their relentless love and support. The time I spend with my beautiful wife and my cute daughter is always the best booster for my research productivity. My fondest memories during my Ph.D. study are from my time spent with my daughter. It was really fun watching her play the piano, painting with her, doing sports together in the park, going to the beach to enjoy the California sunshine, and listening to her tell jokes in English. Though far across the ocean, my parents and relatives are also my spiritual support. These years of experience have made me feel the greatness of my father, mother, and brother. I thank them for everything they did for me. This dissertation is dedicated to them. vii Table of Contents Dedication ii Acknowledgements iii List of Tables xvi List of Figures xxii Abstract xxx Chapter 1: Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Advances and Open Problems in Federated Learning . . . . . . . . . . . . . 3 1.3 Thesis Overview: Towards End-to-end Federated Learning at Scale . . . . . 4 1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 I Federated and Distributed Machine Learning: System 8 Chapter 2: FedML: An Open Source Research Library for Federated Learning 9 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Algorithms, Models, and Datasets . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Algorithms: Federated Optimizer . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Models and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3: PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Background and Problem Setting . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Algorithm and System Design . . . . . . . . . . . . . . . . . . . . . . . . . . 30 viii 3.3.1 Freeze Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 AutoPipe: Elastic Pipelining . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 AutoDP: Spawning More Pipeline Replicas . . . . . . . . . . . . . . . 35 3.3.4 AutoCache: Cross-pipeline Caching . . . . . . . . . . . . . . . . . . . 36 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Overall Training Acceleration . . . . . . . . . . . . . . . . . . . . . . 39 3.4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 II Federated and Distributed Machine Learning: Algorithm 45 Chapter 4: FedGKT: Edge-cloud Collaborative Training for Resource-constrained Clients 46 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Group Knowledge Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.2 Reformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.3 FedGKT: Group Knowledge Transfer . . . . . . . . . . . . . . . . . . 53 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.2 Result of Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.3 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.4 Ablation Study: Understanding FedGKT under Different Settings . . 59 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 5: FedNAS: Towards Automation on Invisible Data via Neural Architecture Search 64 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.2 Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3 Local Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2.4 FedNAS: Federated Neural Architecture Search . . . . . . . . . . . . 69 5.2.5 Personalized FedNAS: Alternative Local Adaptation . . . . . . . . . . 70 5.2.6 AutoFL System Design . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.1 Personalized Models Search via FedNAS . . . . . . . . . . . . . . . . 74 5.3.1.1 Results on Non-I.I.D. (Label Skew Partition and LDA distribution) . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Global Model Search via FedNAS . . . . . . . . . . . . . . . . . . . . 76 ix 5.3.2.1 Results on Non-I.I.D. (LDA Partition) . . . . . . . . . . . . 76 5.3.3 Evaluation of the System Efficiency . . . . . . . . . . . . . . . . . . . 77 5.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 6: SpreadGNN: Effective Training on Decentralized Topology 80 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 SpreadGNN Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.1 Federated Graph Neural Networks for Graph-Level Learning . . . . . 82 6.2.2 Federated Multi-Task Learning with Graph Neural Networks . . . . . 85 6.2.3 SpreadGNN: Serverless Federated MTL for GNNs . . . . . . . . . . . . 87 6.2.3.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . 89 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 7: SSFL: Tackling Label Deficiency via Personalized Self-Supervision 97 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2.1 Federated Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2.2 Self-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3 SSFL: Self-supervised Federated Learning . . . . . . . . . . . . . . . . . . . 101 7.3.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.3.2 Global-SSFL: Collaboration Towards a Global Model without Supervision103 7.3.3 Per-SSFL: Learning Personalized Models without Supervision . . . . 104 7.4 Training System and Evaluation Pipeline for SSFL . . . . . . . . . . . . . . 106 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.5.1 Comparisons on SimSiam, SimCLR, SwAV, and BYOL . . . . . . . . 108 7.5.2 Evaluation on Global-SSFL . . . . . . . . . . . . . . . . . . . . . . . 109 7.5.3 Evaluation on Per-SSFL . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.5.4.1 Role of Batch Size . . . . . . . . . . . . . . . . . . . . . . . 111 7.5.4.2 On Different Degrees of Non-I.I.D.ness . . . . . . . . . . . . 112 7.5.4.3 Understanding the Linear Evaluation of Personalized Encoders112 7.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 8: LightSecAgg: Lightweight and Versatile Secure Aggregation 114 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.3 Overview of Baseline Protocols: SecAgg and SecAgg+ . . . . . . . . . . . . . 120 8.4 LightSecAgg Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 x 8.4.1 General Description of LightSecAgg for Synchronous FL . . . . . . . 125 8.4.2 Extension to Asynchronous FL . . . . . . . . . . . . . . . . . . . . . 127 8.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.5.1 Theoretical Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . 128 8.5.2 Complexity Analysis of LightSecAgg . . . . . . . . . . . . . . . . . . 129 8.6 System Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . 130 8.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8.7.2 Overall Evaluation and Performance Analysis . . . . . . . . . . . . . 135 8.7.3 Performance Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.7.4 Convergence Performance in Asynchronous FL . . . . . . . . . . . . . 137 8.8 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . 138 III Federated and Distributed Machine Learning: Application 139 Chapter 9: FedNLP: FedML for Natural Language Processing 140 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 9.2 Federated Learning for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.2.1 Federated Learning Concepts . . . . . . . . . . . . . . . . . . . . . . 143 9.2.2 Federated Optimization Framework . . . . . . . . . . . . . . . . . . . 144 9.2.3 FedNLP Training System: Security and Efficiency . . . . . . . . . . . 145 9.3 Benchmark for FedNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9.3.1 Task Formulations, Datasets, and Models . . . . . . . . . . . . . . . . 147 9.3.2 Non-IID Partitioning Strategies . . . . . . . . . . . . . . . . . . . . . 148 9.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 9.7 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Chapter10: FedGraphNN: FedML for Graph Neural Networks 160 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10.2 Federated Graph Neural Networks (FedGraphNN) . . . . . . . . . . . . . . . . 162 10.3 FedGraphNN Open Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 10.3.1 Generating Federated Learning Datasets . . . . . . . . . . . . . . . . 167 10.3.1.1 Dirichlet distribution-based Sampling . . . . . . . . . . . . . 168 10.3.1.2 Non-I.I.D. Sampling Based on Meta-Data . . . . . . . . . . 169 10.4 FedGraphNN Benchmark System: Efficient, Secure, and Modularized . . . . . 170 10.5 FedGraphNN Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 10.5.2 Baseline Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 173 10.6 Related Works and Open Challenges . . . . . . . . . . . . . . . . . . . . . . 174 10.7 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Chapter11: FedCV: FedML for Computer Vision 179 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 xi 11.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 11.3 Preliminary and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 11.4 FedCV Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 11.5 FedCV Benchmark Suite: Datasets, Models, and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 11.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 11.6.1 Image Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 11.6.1.1 Implementation Details . . . . . . . . . . . . . . . . . . . . 188 11.6.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 188 11.6.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 11.6.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . 192 11.6.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 193 11.6.2.3 System Performance Analysis . . . . . . . . . . . . . . . . . 196 11.6.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 11.6.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . 196 11.6.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . 197 11.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Chapter12: FedIoT: FedML for Internet of Things 199 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 12.2 Algorithm and System Design . . . . . . . . . . . . . . . . . . . . . . . . . . 202 12.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 12.2.2 Dataset and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 203 12.2.3 Anomaly Detection with Deep Autoencoder . . . . . . . . . . . . . . 204 12.2.4 FedDetect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 12.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 12.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 12.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 12.3.4 Results of Learning Performance . . . . . . . . . . . . . . . . . . . . 210 12.3.5 Analysis of System Efficiency . . . . . . . . . . . . . . . . . . . . . . 212 12.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Bibliography 216 Appendices 258 Chapter A:Supplement to Chapter 2 - FedML 260 A.1 The Taxonomy of Research Areas and a Comprehensive Publication List . . 260 A.2 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 A.2.1 Details of Supported Algorithms . . . . . . . . . . . . . . . . . . . . . 261 A.2.2 Details of Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 A.2.3 Lack of Fair Comparison: Diverse Non-I.I.D. Datasets and Models . . 263 A.3 IoT Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 A.3.1 Raspberry Pi 4 (Edge CPU Computing - ARMv7l) . . . . . . . . . . 263 xii A.3.2 NVIDIA Jetson Nano (Edge GPU Computing) . . . . . . . . . . . . . 264 Chapter B:Supplement to Chapter 3 - PipeTransformer 266 B.1 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 267 B.1.1 Transformer Models: ViT and BERT . . . . . . . . . . . . . . . . . . 267 B.1.2 Freeze Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 B.1.3 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 B.1.4 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 B.1.5 Hybrid of Pipeline Parallelism and Data Parallelism . . . . . . . . . . 271 B.2 More Details of Freeze Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 272 B.3 More Details of AutoPipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 B.4 More details of AutoDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 B.4.1 Data Redistributing . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 B.4.2 Skip Frozen Parameters in AutoDP . . . . . . . . . . . . . . . . . . . 277 B.5 More Details of AutoCache . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 B.6 More Experimental Results and Details . . . . . . . . . . . . . . . . . . . . . 278 B.6.1 Hyper-Parameters Used in Experiments . . . . . . . . . . . . . . . . . 278 B.6.2 More Details of Speedup Breakdown . . . . . . . . . . . . . . . . . . 278 B.6.3 Tuning α for ViT on ImageNet . . . . . . . . . . . . . . . . . . . . . 280 B.6.4 The Method That Can Accurately Measure the Communication Cost 280 B.6.5 Overheads of Pipe Transformation . . . . . . . . . . . . . . . . . . . 281 Chapter C:Supplement to Chapter 4 - FedGKT 282 C.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 C.1.1 A Summary of Dataset Used in Experiments . . . . . . . . . . . . . . 282 C.2 Heterogeneous Distribution (non-I.I.D.) in Each Client . . . . . . . . . . . . 283 C.3 Extra Experimental Results and Details . . . . . . . . . . . . . . . . . . . . 283 C.3.1 Computational Efficiency on CIFAR-10 and CINIC-10 . . . . . . . . 283 C.4 The Method of Communication Cost Calculation . . . . . . . . . . . . . . . 284 C.5 Details of Convolutional Neural Architecture on Edge and Server . . . . . . 284 C.5.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Chapter D:Supplement to Chapter 5 - FedNAS 289 D.1 Details of the Search Space Definition . . . . . . . . . . . . . . . . . . . . . . 289 D.2 Details of the heterogeneous distribution on each client (non-IID) . . . . . . 290 D.3 Results for CIFAR10 (lda) and gld23k . . . . . . . . . . . . . . . . . . . . . 290 D.4 Hyperparameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 D.5 Visualization of the Search Architecture . . . . . . . . . . . . . . . . . . . . 293 D.6 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Chapter E:Supplement to Chapter 6 - SpreadGNN 296 E.1 Algorithm Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 E.2 Dataset Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 E.2.1 Feature Extraction Procedure for Molecules . . . . . . . . . . . . . . 297 E.2.2 Model Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 298 xiii E.2.2.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 298 E.2.2.2 Hyperparameter Configurations . . . . . . . . . . . . . . . . 299 E.3 Detailed Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 E.3.1 Effect of Communication Period τ . . . . . . . . . . . . . . . . . . . . 299 E.3.2 Proof for Convergence of SpreadGNN . . . . . . . . . . . . . . . . . . 301 Chapter F:Supplement to Chapter 7 - SSFL 305 F.1 Comparison of Self-supervised Learning Frameworks . . . . . . . . . . . . . . 305 F.2 Formulation and Pseudo Code for Algorithms Under SSFL Framework . . . 308 F.2.1 Per-SSFL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 F.2.2 Personalized SSFL with Local Adaptation (FedAvg-LA) . . . . . . . . 309 F.2.3 Personalized SSFL with MAML-SSFL . . . . . . . . . . . . . . . . . 311 F.2.4 Personalized SSFL with BiLevel-SSFL . . . . . . . . . . . . . . . . . 312 F.3 Distributed Training System for SSFL . . . . . . . . . . . . . . . . . . . . . 313 F.3.1 Experimental Results on GLD-23K Dataset . . . . . . . . . . . . . . 315 F.3.2 Extra Experimental Results and Details . . . . . . . . . . . . . . . . 315 F.3.3 Visualization of Non-I.I.D. dataset . . . . . . . . . . . . . . . . . . . 315 F.3.4 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 F.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Chapter G:Supplement to Chapter 8 - LightSecAgg 318 G.1 Pseudo Code of LightSecAgg . . . . . . . . . . . . . . . . . . . . . . . . . . 318 G.2 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 G.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 G.3 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 G.4 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 G.5 Application of LightSecAgg to Asynchronous FL . . . . . . . . . . . . . . . 327 G.6 General Description of Asynchronous FL . . . . . . . . . . . . . . . . . . . . 328 G.6.1 Incompatibility of SecAgg and SecAgg+ with Asynchronous FL . . . . 329 G.6.2 Asynchronous LightSecAgg . . . . . . . . . . . . . . . . . . . . . . . 330 G.6.3 Offline Encoding and Sharing of Local Masks . . . . . . . . . . . . . 331 G.6.4 Training, Quantizing, Masking, and Uploading of Local Updates . . . 331 G.6.5 One-shot Aggregate-update Recovery and Global Model Update . . . 333 G.7 Convergence Analysis of Asynchronous LightSecAgg . . . . . . . . . . . . . 335 G.8 Experiments for Asynchronous LightSecAgg . . . . . . . . . . . . . . . . . . 338 Chapter H:Supplement to Chapter 9 - FedNLP 341 H.1 Motivation Behind FL+NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 H.2 Challenges of Applying FL in NLP . . . . . . . . . . . . . . . . . . . . . . . 342 H.3 Basic Formulations of NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . 342 H.4 The System Design of FedNLP . . . . . . . . . . . . . . . . . . . . . . . . . 344 H.4.1 Overall Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 H.4.2 The Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 345 H.4.3 The Algorithm Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 H.4.4 The Infrastructure Layer . . . . . . . . . . . . . . . . . . . . . . . . . 348 xiv H.4.5 Enhancing Security with Secure Aggregation (SA) . . . . . . . . . . . 348 H.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 H.6 More Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 Chapter I: Supplement to Chapter 10 - FedGraphNN 351 I.1 More Details of the Supported Graph Neural Network Architectures . . . . . 351 I.2 More Details of the Open Datasets . . . . . . . . . . . . . . . . . . . . . . . 352 I.2.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 I.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 I.2.3 Non-I.I.D. Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 357 I.3 More Details of FedGraphNN System Design . . . . . . . . . . . . . . . . . . 360 I.3.1 More Results of System Efficiency and Security . . . . . . . . . . . . 360 I.3.2 Evaluation on System Efficiency . . . . . . . . . . . . . . . . . . . . . 360 I.3.3 Evaluation on Security (LightSecAgg) . . . . . . . . . . . . . . . . . 362 I.4 More Details of the Empirical Analysis . . . . . . . . . . . . . . . . . . . . . 364 I.4.1 Hyper-parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 I.4.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 I.5 More Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Chapter J: Supplement to Chapter 11 - FedCV 377 J.1 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 J.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 J.1.2 Non-I.I.D. Partition and Distribution Visualization . . . . . . . . . . 378 J.1.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 J.1.4 FL Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 J.2 More Experimental Results and Hyper-parameters . . . . . . . . . . . . . . . 383 J.2.1 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Chapter K:Supplement to Chapter 12 - FedIoT 390 K.1 FedIoT System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 K.2 Data Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 K.3 System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 xv List of Tables 2.1 Comparison between FedML and existing federated learning libraries and benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Federated datasets for linear models (convex optimization). . . . . . . . . . . 20 2.3 Federated datasets for lightweight shallow neural networks (non-convex optimization). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Federated datasets for deep neural networks. . . . . . . . . . . . . . . . . . . 21 2.5 Experimental results of training modern CNNs. . . . . . . . . . . . . . . . . 22 2.6 Training time with FedAvg on modern CNNs (Hardware: 8 x NVIDIA Quadro RTX 5000 GPU (16GB/GPU); RAM: 512G; CPU: Intel Xeon Gold 5220R 2.20GHz). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Speedup for ViT and BERT Training . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Communication Cost v.s. Computational Cost . . . . . . . . . . . . . . . . . 41 4.1 The Test Accuracy of ResNet-56 and ResNet-110 on Three Datasets. . . . . 58 4.2 Ablation Study on Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Asynchronous Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4 FedGKT with Different # of Edge . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Small CNNs on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Average local validation Accuracy Comparison of FedNAS with other personalization techniques) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Efficiency Comparison (16 RTX2080Ti GPUs as clients, and 1 RTX2080Ti as server) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.1 Dataset summary used in our experiments. . . . . . . . . . . . . . . . . . . 91 6.2 Results on the molecular property prediction task. SpreadGNN uses a complete topology: all clients communicated with each other. Communication period τ =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xvi 7.1 Evaluation accuracy comparison between supervised FL and SSFL. . . . . . . 110 7.2 Evaluation Accuracy for Various Per-SSFL Methods. . . . . . . . . . . . . . 110 8.1 Complexity comparison between SecAgg, SecAgg+, and LightSecAgg. Here N is the total number of users, d is the model size, s is the length of the secret keys as the seeds for PRG (s≪ d). In the table, U stands for User and S stands for Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 8.2 Summary of four implemented machine learning tasks and performance gain of LightSecAgg with respect to SecAgg and SecAgg+. All learning tasks are for image classification. MNIST, FEMNIST and CIFAR-10 are low-resolution datasets, while images in GLD-23K are high resolution, which cost much longer training time; LR and CNN are shallow models, but MobileNetV3 and EfficientNet-B0 are much larger models, but they are tailored for efficient edge training and inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.3 Performance gain in different bandwidth settings. . . . . . . . . . . . . . . . 136 8.4 Breakdown of the running time (sec) of LightSecAgg and the state-of-the-art protocols (SecAgg [37] and SecAgg+ [19]) to train CNN [308] on the FEMNIST dataset [52] with N =200 users, for dropout rate p=10%,30%,50%. . . . 137 9.1 Statistics of the selected datasets for our experiments. *37 is the size of the tag vacabulary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9.2 The comparisons between different FL methods under the same setting on different NLP tasks. The number of workers per round are 10, expect for the MRQA task, which uses 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.3 Performance (Acc.%) on 20news (TC) when different parts of DistilBERT are frozen for centralized training and FedOpt (at 28-th round). E stands for the embedding layer and L i means the i-th layer. The significant lower accuracy are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 10.1 Summary of open graph datasets from various domains contained in FedGraphNN.177 10.2 Performance of graph classification in the graph-level FL setting (#clients=4). 178 10.3 Performance of link prediction in the subgraph-level FL setting (#clients = 8).178 10.4 Performance of Node classification in the node-level FL setting (#clients = 10). 178 11.1 Summary of benchmark suite. . . . . . . . . . . . . . . . . . . . . . . . . . . 186 xvii 11.2 Summary of experimental results on image classification. In this table, Cent. refers to centralized training. For all experiments, we use a batch size of 256 for centralized training and 32 for FedAvg. We use a linear learning rate scheduler with a step size of 0.97 for centralized training, but no scheduler for FedAvg. We use momentum SGD with momentum coefficient of 0.9 for all experiments. More experimental results on other settings can be found in Tables J.3, J.4, J.5, J.6 and J.7 in the Appendix. . . . . . . . . . . . . . . . 190 11.3 Efficiency of training MobileNet V3, EfficientNet, Vit models with FedAvg. In this table, MMACs refer to the forward computation for one sample. Total time refers to the entire training time plus evaluating time; we evaluate the model per 100 communication rounds. For the MobileNet and EfficientNet, the number of total communication rounds is 4000, and for ViT it is 8000. The communication cost is theoretically calculated out. Note the actual communication time should be larger than the theoretical communication time due to the straggler problem and other overhead. . . . . . . . . . . . . 191 11.4 Dataset, models and hyper-parametesr choices for federated image segmentation task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 11.5 Summary of test results on Pascal VOC dataset for federated image segmentation task. DD: Data Distribution Type. N-IID: Heterogeneous distribution with partition factor α =0.5 IID: Homogeneous distribution. C: Number of Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 11.6 Performance and memory analysis for various batch size of segmentation models on Pascal VOC Dataset. BS: Batch Size . . . . . . . . . . . . . . . . 194 11.7 System performance chart of segmentation network architectures we considered. TT: Training Type. BS: Batch Size . . . . . . . . . . . . . . . . . . . . . . 195 11.8 System performance of YOLOv5 . . . . . . . . . . . . . . . . . . . . . . . . . 198 12.1 CPU/GPU Training v.s. IoT Edge Training . . . . . . . . . . . . . . . . . . 212 12.2 Breakdown of the End-to-end Training Time . . . . . . . . . . . . . . . . . . 214 A.1 The taxonomy of research areas in federated learning and related publication statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 A.2 various datasets and models used in latest publications from the machine learning community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 B.1 Hyperparameters used in Experiments . . . . . . . . . . . . . . . . . . . . . 279 B.2 Overheads of pipe transformation (seconds) . . . . . . . . . . . . . . . . . . 281 C.1 The actual heterogeneous data distribution (non-I.I.D.) generated from CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 xviii C.2 Detailed information of the ResNet-8 architecture used in our experiment . . 284 C.3 Detailed information of the ResNet-55 architecture used in our experiment . 285 C.4 Detailed information of the ResNet-109 architecture used in our experiment . 286 C.5 Hyperparameters used in Experiments on dataset CIFAR-10 . . . . . . . . . 287 C.6 Hyperparameters used in Experiments on dataset CIFAR-100 . . . . . . . . 287 C.7 Hyperparameters used in Experiments on dataset CINIC-10 . . . . . . . . . 288 D.1 Heterogeneous data distribution (non-IID) used in FedNAS for Global Model experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 E.1 Atom features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 E.2 Hyperparameter Range for Experiments . . . . . . . . . . . . . . . . . . . . 299 E.3 Hyperparameters used in our experiments. For SpreadGNN we use a communication period τ = 1 and a complete topology (all clients connected to all other clients) in all experiments. . . . . . . . . . . . . . . . . . . . . . . . 300 F.1 [69] Comparisons on ImageNet linear classification . All are based on ResNet-50 pre-trained with two 224× 224 views in a centralized setting. Evaluation is on a single crop. “repro.” denotes reproduction conducted by authors of SimSiam [69], and “+” denotes improved reproduction v.s. original papers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 F.2 Evaluation Accuracy for Various Per-SSFL Methods. . . . . . . . . . . . . . 315 F.3 Hyper-parameters for Section 7.5.2 . . . . . . . . . . . . . . . . . . . . . . . 316 F.4 Hyper-parameters for Section 7.5.4.2 . . . . . . . . . . . . . . . . . . . . . . 316 F.5 Hyper-parameters for experimental results in Section 7.5.3 . . . . . . . . . . 316 G.1 Complexity comparison between SecAgg [37], SecAgg+ [19], and LightSecAgg. Here N is the total number of users. The parameters d and s respectively represent the model size and the length of the secret keys as the seeds for PRG, where s≪ d. LightSecAgg and SecAgg provide worst-case privacy guarantee T and dropout-resiliency guarantee D for any T and D as long as T +D < N. SecAgg+ provides probabilistic privacy guarantee T and dropout-resiliency guarantee D. LightSecAgg selects three design parameters T, D and U such that T 4 days > 3 days Multi-GPU distributed training (wall clock time) 11 hours 7 hours *Note that the number of workers can be larger than the number of GPUs because FedML supports multiple processing training in a single GPU. We also compared the training time of distributed computing with that of standalone simulation. The result in Table 2.6 reveals that when training large CNNs, the standalone 6 https://www.wandb.com/ 7 https://github.com/FedML-AI/FedML/tree/master/benchmark 22 simulation is about 8 times slower than distributed computing with 10 parallel workers. Therefore, when training large DNNs, we suggest using FedML’s distributed computing paradigm, which is not supported by LEAF [52]. Moreover, FedML supports multiprocessing in a single GPU which enables FedML to run a large number of training workers by using only a few GPUs. For example, when training ResNet on CIFAR-10, FedML can run 112 workers in a server with 8 GPUs. 2.6 Conclusion FedML is a federated learning library and benchmark that can be used by researchers as well industry practitioners. Our goal is to empower researchers and engineers with an end-to-end toolkit to facilitate developing FL algorithms and provide a platform to fairly compare different design choices and the tradeoffs. We welcome feedback from the readers, and will continuously update FedML to support the research of the federated learning community. 23 Chapter 3 PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models 3.1 Introduction T0 (0% trained) T1 (35% trained) T2 (75% trained) T3 (100% trained) Similarity score Layer (end of training) Layer (end of training) Layer (end of training) Layer (end of training) Figure 3.1: Interpretable Freeze Training: DNNs converge bottom up (Results on CIFAR10 using ResNet). Each pane shows layer-by-layer similarity using SVCCA [360]. Large Transformer models [46, 235] have powered accuracy breakthroughs in both natural language processing and computer vision. GPT-3 hit a new record high accuracy for nearly all NLP tasks. Vision Transformer (ViT) [92] also achieved 89% top-1 accuracy in ImageNet, outperforming state-of-the-art convolutional networks ResNet-152 [163] and EfficientNet [435]. To tackle the growth in model sizes, researchers have proposed various distributed 24 training techniques, including parameter servers [241, 196, 214], pipeline parallel [180, 339, 328], intra-layer parallel [235, 404, 410], and zero redundancy data parallel [361]. Existing distributed training solutions, however, only study scenarios where all model weights are required to be optimized throughout the training (i.e., computation and commu- nication overhead remains relatively static over different iterations). Recent works on freeze training [360, 325, 407] suggest that parameters in neural networks usually converge from the bottom-up (i.e., not all layers need to be trained all the way through training). Figure 3.1 shows an example of how weights gradually stabilize during training in this approach. This observation motivates us to utilize freeze training for distributed training of Transformer models to accelerate training by dynamically allocating resources to focus on a shrinking set of active layers. Such a layer freezing strategy is especially pertinent to pipeline parallelism, as excluding consecutive bottom layers from the pipeline can reduce computation, memory, and communication overhead. Transformer Models for NLP (BERT) and CV (ViT) “Hi, PipeTransformer, What does your name mean?” T0 Active Layers Active Layers Frozen Layers Active Layers Frozen Layers Frozen Layers FP BP DP Pipeline 0 (server 0) 0 1 2 3 4 5 6 7 FP BP Pipeline 1 (server 1) 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2. AutoPipe: Elastic pipelining 1. Freeze Algorithm 3. AutoDP: Spawning More Pipeline Replicas Cache 4. AutoCache: Cross-process caching T1 T2 T3 “Pipeline Transformation for Transformer Models” DP DP DP DP DP DP DP Figure 3.2: The process of PipeTransformer’s automated and elastic pipelining to accelerate distributed training of Transformer models In this work, we propose PipeTransformer an elastic pipelining training acceleration framework that automatically reacts to frozen layers by dynamically transforming the scope of the pipelined model and the number of pipeline replicas. To the best of our knowledge, this is the first work that studies layer freezing in the context of both pipeline and data-parallel training. Figure 3.2 demonstrates the benefits of such a combination. First, by excluding frozen layers from the pipeline, the same model can be packed into fewer GPUs, leading to both fewer cross-GPU communications and smaller pipeline bubbles. Second, after packing the model into fewer GPUs, the same cluster can accommodate more pipeline replicas, 25 increasing the width of data parallelism. More importantly, the speedups acquired from these two benefits are multiplicative rather than additive, further accelerating the training. The design of PipeTransformer faces four major challenges. First, the freeze algorithm must make on the fly and adaptive freezing decisions; however, existing work [360] only provides a posterior analysis tool. Second, the efficiency of pipeline re-partitioning results is influenced by multiple factors, including partition granularity, cross-partition activation size, and the chunking (the number of micro-batches) in mini-batches, which require reasoning and searching in a large solution space. Third, to dynamically introduce additional pipeline replicas, PipeTransformer must overcome the static nature of collective communications and avoid potentially complex cross-process messaging protocols when onboarding new processes (one pipeline is handled by one process). Finally, caching can save time for repeated forward propagation of frozen layers, but it must be shared between existing pipelines and newly added ones, as the system cannot afford to create and warm up a dedicated cache for each replica. PipeTransformer is designed with four core building blocks to address the aforementioned challenges. First, we design a tunable and adaptive algorithm to generate signals that guide the selection of layers to freeze over different iterations (Section 3.3.1). Once triggered by these signals, our elastic pipelining module AutoPipe then packs the remaining active layers into fewer GPUs by taking both activation sizes and variances of workloads across heterogeneous partitions (frozen layers and active layers) into account. It then splits a mini-batch into an optimal number of micro-batches based on prior profiling results for different pipeline lengths (Section 3.3.2). Our next module, AutoDP spawns additional pipeline replicas to occupy freed-up GPUs and maintains hierarchical communication process groups to attain dynamic membership for collective communications (Section 3.3.3). Our final module, AutoCache efficiently shares activations across existing and new data-parallel processes and automatically replaces stale caches during transitions (Section 3.3.4). 26 Overall,PipeTransformercombinesthe Freeze Algorithm,AutoPipe AutoDPandAutoCache modules to provide a significant training speedup. We evaluate PipeTransformer using Vision Transformer (ViT) on ImageNet and BERT on GLUE and SQuAD datasets. Our results show that PipeTransformer attains up to 2.83-fold speedup without losing accuracy. We also provide various performance analyses for a more comprehensive understanding of our algorithmic and system-wise design. Finally, we have also developed open-source flexible APIs for PipeTransformer which offer a clean separation among the freeze algorithm, model definitions, and training accelerations, allowing for transferability to other algorithms that require similar freezing strategies. The source code is made publicly available. 3.2 Overview 3.2.1 Background and Problem Setting Suppose we aim to train a massive model in a distributed training system where the hybrid of pipelined model parallelism and data parallelism is used to target scenarios where either the memory of a single GPU device cannot hold the model, or if loaded, the batch size is small enough to avoid running out of memory. More specifically, we define our settings as follows: Training task and model definition. We train Transformer models (e.g., Vision Transformer [92], BERT [85]) on large-scale image or text datasets. The Transformer model F has L layers, in which the ith layer is composed of a forward computation function f i and a corresponding set of parameters, w i . With this definition, the overall model is F =f 0 (w 0 )◦ ...◦ f L− 1 (w L− 1 ). The model size is S, and the batch size is set to N bs . Training infrastructure. Assume the training infrastructure contains a GPU cluster that has N GPU servers (i.e. nodes). Each node has I GPUs. Our cluster is homogeneous, meaning that each GPU and server have the same hardware configuration. Each GPU’s memory capacity is M GPU . Servers are connected by a high bandwidth network interface such as InfiniBand interconnect. 27 Pipeline parallelism. In each machine, we load a model F into a pipeline P which has K partitions (K also represents the pipeline length). The kth partition p k consists of consecutive layersp k =f i (w i )◦ ...◦ f j (w j ), andP =p 0 ◦ ...◦ p K− 1 . We assume each partition is handled by a single GPU device. 1≤ K≤ I, meaning that we can build multiple pipelines for multiple model replicas in a single machine. We assume all GPU devices in a pipeline belong to the same machine. Our pipeline is a synchronous pipeline, which does not involve stale gradients, and the number of micro-batches is M. In the Linux OS, each pipeline is handled by a single process. We refer the reader to GPipe [180] for more details. Data parallelism. DDP [246] is a cross-machine distributed data parallel process group within R parallel workers. Each worker is a pipeline replica (a single process). The rth worker’s index (ID) is rank r. For any two pipelines P (r i ) and P (r j ) in DDP, r i and r j can belong to either the same GPU server or different GPU servers, and they can exchange gradients with the AllReduce algorithm. Under these settings, our goal is to accelerate training by leveraging freeze training, which does not require all layers to be trained throughout the duration of the training. Additionally, it may help save computation, communication, memory cost, and potentially prevent overfitting by consecutively freezing layers. However, these benefits can only be achieved by overcoming the four challenges of designing an adaptive freezing algorithm, dynamical pipeline re-partitioning, efficient resource reallocation, and cross-process caching, as discussed in the introduction. We next describe our overall design, named PipeTransformer, which can address these challenges. 3.2.2 Overall Design PipeTransformer co-designs an on the fly freeze algorithm and an automated elastic pipelin- ing training system that can dynamically transform the scope of the pipelined model and the number of pipeline replicas. The overall system architecture is illustrated in Figure 3.3. To support PipeTransformer’s elastic pipelining, we maintain a customized version 28 Deep Learning Training Engine (PyTorch) AutoCache (cross process) Freeze Algorithm Pipeline Parallel CUDA NCCL / GLOO Multi Processing Shared Memory (Cross process) AutoDP AutoPipe add new pipelines Transform Data Distributed Parallel Redistribute dataset Sample training information as indicator (Progress, gradient, etc) Dataset pipeline 0 at timestep 0 pipeline 0 at timestep 1 Pipeline 1 at timestep 1 notify Dataset Dataset # of frozen layer changed? Pipeline length has been changed? Figure 3.3: Overview of PipeTransformer Training System of PyTorch Pipe [213]. For data parallelism, we use PyTorch DDP [246] as a baseline. Other libraries are standard mechanisms of an operating system (e.g., multi-processing) and thus avoid specialized software or hardware customization requirements. To ensure the generality of our framework, we have decoupled the training system into four core components: freeze algorithm, AutoPipe, AutoDP, and AutoCache. The freeze algorithm (grey) samples indi- cators from the training loop and makes layer-wise freezing decisions, which will be shared with AutoPipe (green). AutoPipe is an elastic pipeline module that speeds up training by excluding frozen layers from the pipeline and packing the active layers into fewer GPUs (pink), leading to both fewer cross-GPU communications and smaller pipeline bubbles. Subsequently, AutoPipe passes pipeline length information to AutoDP (purple), which then spawns more pipeline replicas to increase data-parallel width, if possible. The illustration also includes an example in which AutoDP introduces a new replica (purple). AutoCache (orange edges) is a cross-pipeline caching module, as illustrated by connections between pipelines. The source code architecture is aligned with Figure 3.3 for readability and generality. 29 3.3 Algorithm and System Design This section elaborates on the four main algorithmic and system-wise design components of PipeTransformer. 3.3.1 Freeze Algorithm The freeze algorithm must be lightweight and able to make decisions on the fly. This excludes existing layer-wise training approaches such as SVCCA [360] which require full training states and heavy posterior measurements. We propose an adaptive on the fly freeze algorithm to define L (T) frozen at timestep T as follows: min L (T− 1) frozen +α (L− L (T− 1) frozen ), argmin ℓ∈{L (T− 1) frozen ,...,L} g (T) ℓ ! where T ≥ 1, L (0) frozen =0, and α ∈(0,1) (3.1) where g (T) ℓ is the gradient for layer ℓ at iteration T, and g (T) ℓ is its norm. The intuition behind the second term in the min function is that the layer with the smallest gradient norm converges first. To stabilize training, we enforce an upper bound L (T− 1) frozen +α (L− L (T− 1) frozen ) for the number of frozen layers, which is a geometric sequence containing a hyper-parameter α . This essentially freezes an α fraction of the remaining active layers. To illustrate the impact of α , we rewrite the equation as: L (T) frozen = (1− α ) T [ αL 1− α + P T t=2 αL (1− α ) t ] (see Appendix for the derivation), and draw the curve of this function in Figure 9.6. As we can see, a larger α leads to a more aggressive layer freezing. Therefore, Equation 3.1 calculates the number of frozen layers at timestep T using both the gradient norm and a tunable argument α . The α parameter controls the trade-off between accuracy and training speed. This algorithm is also analogous to learning rate (LR) decay. Both algorithms use a scheduler function during training, and take the progress of training as an indicator. The difference is 30 0 2 4 6 8 Epoch 0 5 10 15 20 25 Frozen Layer Number Freeze algorithm alpha=0.1 alpha=0.2 alpha=0.3 alpha=0.5 Figure 3.4: Freeze Algorithm Using Different α that the above freeze algorithm also takes gradient norm into account, making the algorithm simple and effective. Remark: Our system design idea can be generalized to many other progressive training algorithms. See Section 3.6 for more discussions. 3.3.2 AutoPipe: Elastic Pipelining Triggered by the freeze algorithm, AutoPipe can accelerate training by excluding frozen layers from the pipeline and packing the active layers into fewer GPUs. This section elaborates onthekeycomponentsof AutoPipethatdynamicallypartitionpipelines, minimizethenumber of pipeline devices and optimize mini-batch chunk size accordingly. Algorithm 7 presents the pseudo-code. addition Layer Norm Multi-Head Attention FFN addition partition k-1 … … partition k partition k-2 Intermediate output Layer Norm Figure 3.5: Partition boundary is in the middle of a skip connection 31 Algorithm 1 AutoPipe Algorithm 1: Input: modelF, layer number L and L frozen , pipeline length K, frozen layer cost factor λ frozen 2: Return: modelF frozen , modelF pipe , updated K; 3: def m_partition(F,L, L frozen ): //see 3.3.2 4: F frozen = Sequential(); model size S frozen =0 5: F pipe = Sequential(); per-layer size S pipe = [] 6: for layer index = L frozen to L do 7: f ATTi ,f MLPi ← f i 8: F pipe .append(f ATTi );S pipe .append(m_size(f ATTi )) 9: F pipe .append(f MLPi );S pipe .append(m_size(f MLPi )) 10: end for 11: returnF frozen ,S frozen ,F pipe ,S pipe 12: def load_balance(F pipe , S pipe , K): //Section 3.3.2 13: B L =dict(), B S =dict() // balanced L and S 14: L assigned =0; S total = sum(S pipe ) 15: for partition index = k to K do 16: mean=S total /(K - k); 17: var=np.var(S pipe [L assigned :])/(K - k) 18: for sublayer index i = L assigned to len(S pipe ) do 19: S k = S pipe [i] 20: criterion=B S [i]-S frozen (1.0- λ frozen )+S k 21: if criterion < mean + var then 22: B S +=S k ; B L +=1; L assigned +=1; S total -=S k 23: else 24: break 25: end if 26: end for 27: end for 28: return B L , B S 29: F frozen ,S frozen ,F pipe ,S pipe = m_partition(F,L, L frozen ) 30: while K≥ 2 do 31: B L , B S = load_balance(F pipe , S pipe , K/2) 32: B S [0] -= S frozen (1.0 - λ frozen ); 33: M (T) GPU = max(B S ) //Equation 3.2 34: if M (T) GPU <M (0) GPU then 35: K=K/2 36: else 37: break 38: end if 39: end while 40: loadF frozen andF pipe to K GPUs using B S and B L 41: Pipe(F pipe , chunks= get_optimal_chunks (K)) Balanced Pipeline Partitioning. Balancing computation time across partitions is critical topipelinetrainingspeed, asskewedworkloaddistributionsacrossstagescanleadtostragglers, forcing devices with lighter workloads to wait (demonstrated by Section 3.4.3). However, maintaining optimally balanced partitions does not guarantee the fastest training speed because other factors also play a crucial role: 32 1. Cross-partition communication overhead. Placing a partition boundary in the middle of a skip connection leads to additional communications since tensors in the skip connection must now be copied to a different GPU. For example, with BERT partitions in figure 3.5, partition k must take intermediate outputs from both partition k− 2 and partition k− 1. In contrast, if the boundary is placed after the addition layer, the communication overhead between partition k− 1 and k is visibly smaller. Our measurements show that having cross- device communication is more expensive than having slightly imbalanced partitions (see the Appendix). Therefore, we do not consider breaking skip connections (highlighted separately as an entire attention layer f ATT i and MLP layer f MLP i in green at line 7 in Algorithm 7). 2. Frozen layer memory footprint. During training, AutoPipe must recompute partition boundaries several times to balance two distinct types of layers: frozen layers and active layers. The frozen layer’s memory cost is a fraction of that in active layers, given that the frozen layer does not need backward activation maps, optimizer states, and gradients. Instead of launching intrusive profilers to obtain thorough metrics on memory and computational cost, we define a tunable cost factor λ frozen to estimate the memory footprint ratio of a frozen layer over the same active layer. Based on empirical measurements in our experimental hardware, we set λ frozen to 1 6 . Based on the above two considerations, AutoPipe balances pipeline partitions based on parameter sizes. More specifically, AutoPipe uses a greedy algorithm to allocate all frozen and active layers to evenly distribute partitioned sublayers into K GPU devices. Pseudo code is described as the load_balance() function in Algorithm 7. The frozen layers are extracted from the original model and kept in a separate model instance F frozen in the first device of a pipeline. Note that the partition algorithm employed in this work is not the only option; PipeTransformer is modularized to work with any alternatives. Pipeline Compression. Pipeline compression helps to free up GPUs to accommodate more pipeline replicas and reduce the number of cross-device communications between partitions. To determine the timing of compression, we can estimate the memory cost of the largest 33 partition after compression, and then compare it with that of the largest partition of a pipeline at timestep T = 0. To avoid extensive memory profiling, the compression algorithm uses the parameter size as a proxy for the training memory footprint. Based on this simplification, the criterion of pipeline compression is as follows: compress the pipeline if M (T) GPU ≤ M (0) GPU where M (T) GPU ⇔ max k∈{0,··· ,K− 1} S p k (3.2) Once the freeze notification is received, AutoPipe will always attempt to divide the pipeline length K by 2 (e.g., from 8 to 4, then 2). By using K 2 as the input, the compression algorithm can verify if the result satisfies the criterion in Equation (1). Pseudo code is shown in lines 25-33 in Algorithm 7. Note that this compression makes the acceleration ratio exponentially increase during training, meaning that if a GPU server has a larger number of GPUs (e.g., more than 8), the acceleration ratio will be further amplified. F 0,0 F 0,1 F 0,2 F 0,3 F 1,0 F 1,1 F 1,2 F 1,3 F 2,0 F 2,1 F 2,2 F 2,3 F 3,0 F 3,1 F 3,2 F 3,3 B 3,0 B 3,1 B 3,2 B 3,3 B 2,0 B 2,1 B 2,2 B 2,3 B 1,0 B 1,1 B 1,2 B 1,3 B 0,0 B 0,1 B 0,2 U 1 U 3 B 0,3 U 0 GPU 0 U 2 GPU 1 GPU 2 GPU 3 K - 1 K - 1 K - 1 K - 1 K - 1 K - 1 K - 1 K is pipeline length (devices) Figure 3.6: Pipeline Bubble: F d,b , B d,b , and U d denote forward, backward, and the optimizer update of micro-batch b on device d, respectively. The total bubble size in each iteration is (K− 1) times per micro-batch forward and backward cost. Additionally, such a technique can also speed up training by shrinking the size of pipeline bubbles. To explain bubble sizes in a pipeline, Figure 3.6 depicts how 4 micro-batches run through a 4-device pipeline (K =4). In general, the total bubble size is (K− 1) times per micro-batch forward and backward cost (for further explanation, please refer to Appendix. Therefore, it is clear that shorter pipelines have smaller bubble sizes. 34 Dynamic Number of Micro-batches. Prior pipeline parallel systems use a fixed number of micro-batches per mini-batch (M). GPipe suggests M ≥ 4× K, where K is the number of partitions (pipeline length). However, given that that PipeTransformer dynamically configures K, we find it to be sub-optimal to maintain a static M during training. Moreover, when integrated with DDP, the value of M also has an impact on the efficiency of DDP gradient synchronizations. Since DDP must wait for the last micro-batch to finish its backward computation on a parameter before launching its gradient synchronization, finer micro- batches lead to a smaller overlap between computation and communication (see Appendix for illustration). Hence, instead of using a static value, PipeTransformer searches for optimal M on the fly in the hybrid of DDP environment by enumerating M values ranging from K to 6K. For a specific training environment, the profiling needs only to be done once (see Algorithm 7 line 35). Section 10.5 will provide performance analyses of M selections. 3.3.3 AutoDP: Spawning More Pipeline Replicas As AutoPipe compresses the same pipeline into fewer GPUs, AutoDP can automatically spawn new pipeline replicas to increase data-parallel width. Despite the conceptual simplicity, subtle dependencies on communications and states require careful design. The challenges are threefold: 1. DDP Communication: Collective communications in PyTorch DDP requires static membership, which prevents new pipelines from connecting with existing ones; 2. State Synchronization: newly activated processes must be consistent with existing pipelines in the training progress (e.g., epoch number and learning rate), weights and optimizer states, the boundary of frozen layers, and pipeline GPU range; 3. Dataset Redistribution: the dataset should be re-balanced to match a dynamic number of pipelines. This not only avoids stragglers but also ensures that gradients from all DDP processes are equally weighted. To tackle these challenges, we create double communication process groups for DDP. As in the example shown in Figure 3.7, the message process group (purple) is responsible for 35 0 1 3 4 5 6 7 8 9 active training process group message process group message between groups: 1. progress of training 2. Pipelining info 2 15 14 13 12 11 10 0 1 3 4 5 6 7 8 9 2 15 14 13 12 11 10 T0 T1 Figure 3.7: AutoDP: handling dynamical data parallel with messaging between double process groups (Process 0-7 belong to machine 0, while process 8-15 belong to machine 1) light-weight control messages and covers all processes, while the active training process group (yellow) only contains active processes and serves as a vehicle for heavy-weight tensor communications during training. The message group remains static, whereas the training group is dismantled and reconstructed to match active processes. In T0, only process 0 and 8 are active. During the transition to T1, process 0 activates processes 1 and 9 (newly added pipeline replicas) and synchronizes necessary information mentioned above using the message group. The four active processes then form a new training group, allowing static collective communications adaptive to dynamic memberships. To redistribute the dataset, we implement a variant of DistributedSampler that can seamlessly adjust data samples to match the number of active pipeline replicas. The above design also naturally helps to reduce DDP communication overhead. More specifically, when transitioning from T0 to T1, processes 0 and 1 destroy the existing DDP instances, and active processes construct a new DDP training group using F pipe (AutoPipe storesF frozen andF pipe separately, introduced in Section 3.3.2). Discussion of communication cost can be found in Appendix. 3.3.4 AutoCache: Cross-pipeline Caching Caching activation maps from frozen layers can help further speed up training. This idea appears to be straightforward, but several caveats must be carefully addressed. 36 Disk storage 3 3 4 5 pipeline 0 (process 0) newly added pipeline1 (process 1) 9 cross process caching sharing 1 2 automating the timing of caching Caching Daemon CPU Host memory T1 T2 T1 T2 Figure 3.8: AutoCache Cross-process caching. The cache must be shared across processes in real time, as creating and warming up a dedicated cache for each model replica slow down the training. This is achieved by spawning a dedicated daemon process to hold cache in shared memory that all training processes can access in real time. Figure 3.8 shows an example of the transition from T1 to T2, assuming T1 freezes 3 layers, T2 freezes 4 layers, and 5 layers remain active in T2. Immediately after the transition by AutoDP the cache still holds cached activations from layer 3, which must be replaced by activations from layer 7. Therefore, all processes read their corresponding activations from the cache, feed them to the next 4 layers to compute activations for layer 7, then replace the existing cache with new activations for their samples accordingly. In this way, AutoCache can gradually update cached activations without running any sample through any frozen layers twice. When the activations are too large to reside on CPU memory, AutoCache will also swap them to the disk and perform pre-fetching automatically. More details on the cross-process cache design can be found in the Appendix. Timing of cache is also important, as the cache can be slower than running the real forward propagation, especially if frozen layers are few and activations are large. To ensure that our training system can adapt to different hardware, model architecture, and batch size settings, AutoCache also contains a profiler that helps evaluate the appropriate transition to 37 enable caching, and it only employs cached activations when the profiler suggests caching can speed up the forward pass. Performance analysis is provided at Section 3.4.3. 3.4 Experiments This section first summarizes experiment setups and then evaluates PipeTransformer using computer vision and natural language processing tasks. More comprehensive results can be found in the Appendix. 3.4.1 Setup Hardware. Experiments were conducted on 2 identical machines connected by InfiniBand CX353A (5GB/s), where each machine is equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory). GPU-to-GPU bandwidth within a machine (PCI 3.0, 16 lanes) is 15.754GB/s. Implementation. We used PyTorch Pipe as a building block, which has not yet been officially released at the time of writing of this work. Hence, we used the developer version 1.8.0.dev20201219. The BERT model definition, configuration, and related tokenizer are from HuggingFace 3.5.0. We implemented Vision Transformer using PyTorch by following its TensorFlow implementation. More details can be found in our source code. Models and Datasets. Experiments employ two representative Transformers in CV and NLP: Vision Transformer (ViT) and BERT. ViT was run on an image classification task, initialized with pre-trained weights on ImageNet21K and fine-tuned on ImageNet and CIFAR-100. BERT was run on two tasks, text classification on the SST-2 dataset from the General Language Understanding Evaluation (GLUE) benchmark, and question answering on the SQuAD v1.1 Dataset (Stanford Question Answering) which is a collection of 100k crowdsourced question/answer pairs. 38 Training Schemes. Given that large models normally would require thousands of GPU- days (e.g., GPT-3) if trained from scratch, fine-tuning downstream tasks using pre-trained models has become a trend in CV and NLP communities. Moreover, PipeTransformer is a complex training system that involves multiple core components. Thus, for the first version of PipeTransformer system development and algorithmic research, it is not cost-efficient to develop and evaluate from scratch using large-scale pretraining. Therefore, experiments presentedinthissectionfocusesonpre-trainedmodels. Notethatsincethemodelarchitectures in pre-training and fine-tuning are the same, PipeTransformer can serve both. We discussed pre-training results in the Appendix. Baseline. Experiments in this section compares PipeTransformer to the state-of-the-art framework, a hybrid scheme of PyTorch Pipe (PyTorch’s implementation of GPipe [180]) and PyTorch DDP. Since this is the first work that studies accelerating distributed training by freezing layers, there are no perfectly aligned counterpart solutions yet. Hyper-parameters. Experiments use ViT-B/16 (12 transformer layers, 16× 16 input patch size) for ImageNet and CIFAR-100, BERT-large-uncased (24 layers) for SQuAD 1.1, and BERT-base-uncased (12 layers) for SST-2. With PipeTransformer ViT and BERT training can set the per-pipeline batch size to around 400 and 64 respectively. Other hyperparameters (e.g., epoch, learning rate) for all experiments are presented in Appendix. 3.4.2 Overall Training Acceleration We summarize the overall experimental results in Table 3.1. Note that the speedup we report is based on a conservative α ( 1 3 ) value that can obtain comparable or even higher accuracy. A more aggressive α ( 2 5 , 1 2 ) can obtain a higher speedup but may lead to a slight loss in accuracy (See section 3.4.3). Note that the model size of BERT (24 layers) is larger than ViT-B/16 (12 layers), thus it takes more time for communication (see Section 3.4.3 for details). 39 Table 3.1: Speedup for ViT and BERT Training Baseline PipeTransformer Dataset Accuracy Training Accuracy Training Training time time Speedup ImageNet 80.83± 0.05 26h 30m 82.18± 0.32 9h 21m 2.83× CIFAR-100 91.21± 0.07 35m 6s 91.33± 0.05 12m 23s 2.44× SQuAD 1.1 90.71± 0.18 5h 7m 90.69± 0.23 2h 26m 2.10× *Note: 1. the accuracy is the mean and variance of three independent runs with the same random seed; 2. the training time among different runs are relatively stable (the gap is less than 1 minute); 3. GLUE (SST-2)’s training time is relatively small, thus we mainly used it for debugging without reporting a few minutes result. 4. accuracy metric: ImageNet/CIFAR-100: top-1 accuracy; SQuAD: F1 score. 3.4.3 Performance Analysis This section presents evaluation results and analyzes the performance of different components in PipeTransformer. More experimental results can be found in the Appendix. Freeze + AutoPipe + AutoDP + AutoCache Freeze + AutoPipe + AutoCache Freeze + AutoPipe + AutoDP Freeze Only No Freeze (baseline) (a) Sample Throughput 0 1 2 3 4 0.5 1.0 1.5 2.0 2.5 3.0 3.5 1.0x 0.95x 1.26x 2.27x 2.83x (b) Speedup Ratio Comparison Figure 3.9: Speedup Breakdown (ViT on ImageNet) Speedup Breakdown. To understand the efficacy of all four components and their impacts on training speed, we experimented with different combinations and used their training sample throughput (samples/second) and speedup ratio as metrics. Results are illustrated in Figure 3.9. Key takeaways from these experimental results are: 1. the main speedup is the result of elastic pipelining which is achieved through the joint use of AutoPipe and AutoDP 2. 40 AutoCacheś contribution is amplified by AutoDP 3. freeze training alone without system-wise adjustment even downgrades the training speed (discussed in Section 3.3.2). We provide additional explanations of these results in the Appendix. (a) Tuning α in Freeze Algorithm 200 300 400 500 M=1 M=2 M=3 M=4 M=5 M=6 368 410 386 340 298 286 Throughput (samples/second) (When K=8) 2 4 8 0 2 4 6 8 Optimal Chunk Number (M) 8 4 2 Pipeline Length (K) (b) Profiling Optimal Chunk Num- ber (c) Timing of Caching Figure 3.10: Some Results of Performance Analysis Communication Cost. We also analyzed how communication and computation contribute to the overall training time. Since PyTorch DDP overlaps communication with computation, the time difference between a local training iteration and distributed training iteration does not faithfully represent the communication delay. Moreover, as DDP also organizes parameters into buckets and launches an AllReduce for each bucket, recording the start and finish time of overall communications also falls short, as there can be time gaps between buckets. To correctly measure DDP communication delay, we combined the DDP communication hook with CUDAFuture callback. More details of this measurement are documented in the Appendix. Key takeaways: 1. larger model cost more time on communication (BERT on SQuAD); 2. a higher cross-machine bandwidth can further speedup the training, especially for larger model. Table 3.2: Communication Cost v.s. Computational Cost Dataset Overall Communication Computation Communication Cost Cost Cost Cost Ratio ImageNet 9h 21m 34m 8h 47m 5.9 % SQuAD 2h 26m 16m 33s 2h 9m 8.8% Tuning α in Freezing Algorithm. We ran experiments to show how the α in the freeze algorithms influences training speed. The result clearly demonstrates that a larger α (excessive 41 freeze) leads to a greater speedup but suffers from a slight performance degradation. In the case shown in Figure 3.10(a), where α = 1/5, freeze training outperforms normal training and obtains a 2.04-fold speedup. We provide more results in the Appendix. Optimal Chunks in elastic pipeline. We profiled the optimal number of micro-batches M for different pipeline lengths K. Results are summarized in Figure 3.10(b). As we can see, different K values lead to different optimal M, and the throughput gaps across different M values are large (as shown when K =8), which confirms the necessity of an anterior profiler in elastic pipelining. Understanding the Timing of Caching. To evaluate AutoCache we compared the sample throughput of training that activates AutoCache from epoch 0 (blue) with the training job without AutoCache (red). Figure 3.10(c) shows that enabling caching too early can slow down training, as caching can be more expensive than forward propagation on a small number of frozen layers. After freezing more layers, caching activations clearly outperforms the corresponding forward propagation. As a result, AutoCache uses a profiler to determine the proper timing to enable caching. In our system, for ViT (12 layers), caching starts from 3 frozen layers, while for BERT (24 layers), caching starts from 5 frozen layers. 3.5 Related Works PipeTransformer combines pipeline parallelism [180, 328, 327, 339, 498] and data paral- lelism [246]. Both techniques have been extensively studied in prior work. GPipe [180] parallelizes micro-batches within a mini-batch and enforces synchronizations between consec- utive mini-batches. The synchronization barrier creates execution bubbles and it exacerbates if the model spans across more devices. PipeDream [328, 327], Megatron-LM [327], Het- Pipe [339] and PipeMare [498] remove or mitigate execution bubbles by allowing a configurable amount of staleness. Although evaluations show that models can still converge with high 42 accuracy, it breaks the mathematical equivalence to local training. PipeTransformer builds on top of PyTorch pipeline parallel and distributed data-parallel APIs [246]. Compared to prior solutions, PipeTransformer reduces the size of bubbles during training by dynamically packing the active layers into fewer GPUs. Moreover, the communication overhead for data-parallel training, which is the dominant source of delay, also drops when the active model size shrinks. 3.6 Discussion Pretraining v.s. Fine-tuning: Given that the model architectures in pre-training and fine-tuning are the same, we do not need to change the system design. Running larger Transformers (over 32 layers) is straightforward because almost all giant Transformer models are designed by simply stacking more transformer encoder layers. PipeTransformer can serve as a training system for both pre-training and fine-tuning training. We plan to run our training system on more models and datasets in both settings. Designing better freeze algorithm: Our proposed algorithm is simple, yet it proves to be effective on various tasks. However, we believe that further developments to the freeze algorithm may lead to better generalization and obtain higher accuracy. Versatility: PipeTransformer training system can also be used on other algorithms that run progressive training [124] or gradually fix portions of neural network. For example, cross- silo federated learning, layer-by-layer neural architecture search, and pruning large DNNs are all potential use cases of our training system. We will explore the training acceleration for these scenarios in our future works. 3.7 Conclusion This work proposes PipeTransformer, a holistic solution that combines elastic pipeline- parallel and data-parallel for distributed training. More specifically, PipeTransformer incrementally freezes layers in the pipeline, packs remaining active layers into fewer GPUs, 43 and forks more pipeline replicas to increase the data-parallel width. Evaluations on ViT and BERT models show that compared to the state-of-the-art baseline, PipeTransformer attains up to 2.83× speedups without accuracy loss. 44 Part II Federated and Distributed Machine Learning: Algorithm 45 Chapter 4 FedGKT: Edge-cloud Collaborative Training for Resource-constrained Clients 4.1 Introduction The size of convolutional neural networks (CNN) matters. As seen in both manually de- signed neural architectures (ResNet [163]) and automated architectures discovered by neural architecture search (DARTS [278], MiLeNAS [159], EfficientNets [435]), scaling up CNN size (e.g., width, depth, etc.) is known to be an effective approach for improving model accuracy. Unfortunately, training large CNNs is challenging for resource-constrained edge devices (e.g., smartphones, IoT devices, and edge servers). The demand for edge-based training is increasing as evinced by a recent surge of interest in Federated Learning (FL) [207]. FL is a distributed learning paradigm that can collaboratively train a global model for many edge devices without centralizing any device’s dataset [310, 148, 465]. FL can boost model accuracy in situations when a single organization or user does not have sufficient or relevant data. Consequently, many FL services have been deployed commercially. For instance, Google has improved the accuracy of item ranking and language models on Android smartphones by using FL [39]. FL is also a promising solution when data centralization is undesirable or infeasible due to privacy and regulatory constraints [207]. However, one significant impediment in edge training is the gap between the computational demand of large 46 Compact CNN f extractor f classifier … f extractor f server Edge Server f server LOSS KD 1 - local training 2 - periodic transfer 3 - transfer back 4 - edge-sided model f extractor f classifier f extractor f classifier LOSS KD asynchronous (a) Alternating and periodical knowledge transfer Conv1+BN1+ ReLU1 BottleNeck x 2 FC Layer (classifier) BottleNeck x (18 or 36) FC Layer (classifier) ResNet-56 or ResNet-110 fserver fclient ResNet-55/109 ResNet-8 (b) CNN architectures on the edge and server Figure 4.1: Reformulation of Federated Learning: Group Knowledge Transfer CNNs and the meager computational power on edge devices. FL approaches, such as FedAvg [310] can reduce communication frequency by local SGD and model averaging [525], but they only evaluate the convergence property on small CNNs, or assume the client has enough computational power with GPUs to train large CNNs, which is unrealistic in a real-world system. To tackle the computational limitation of edge nodes, model parallelism-based split learning (SL) [129, 458] partitions a large model and offloads some portion of the neural architecture to the cloud, but SL has a severe straggler problem because a single mini-batch iteration requires multiple rounds of communication between the server and edges. In this work, we propose Group Knowledge Transfer (FedGKT), an efficient federated learning framework for resource-constrained edge devices. FedGKT aims to incorporate benefits from both FedAvg [310] and SL [129, 458] by training using local SGD as in FedAvg but also placing low compute demand at the edge as in SL. FedGKT can transfer knowledge from many compact CNNs trained at the edge to a large CNN trained at a cloud server. The essence of FedGKT is that it reformulates FL as an alternating minimization (AM) approach [338, 25, 33, 13, 481, 366], which optimizes two random variables (the edge model and the server model) by alternatively fixing one and optimizing another. Under this reformulation, FedGKT not only boosts training CNNs at the edge but also contributes to the development of a new knowledge distillation (KD) paradigm, group knowledge transfer, to boost the 47 performance of the server model. Fig. 4.1(a) provides an overview of FedGKT. The compact CNN on the edge device consists of a lightweight feature extractor and classifier that can be trained efficiently using its private data ( 1 - local training). After local training, all the edge nodes agree to generate exactly the same tensor dimensions as an output from the feature extractor. The larger server model is trained by taking features extracted from the edge-side model as inputs to the model, and then uses KD-based loss function that can minimize the gap between the ground truth and soft label (probabilistic prediction in KD [167, 47, 16, 380]) predicted from the edge-side model (2 - periodic transfer). To boost the edge model’s performance, the server sends its predicted soft labels to the edge, then the edge also trains its local dataset with a KD-based loss function using server-side soft labels (3 - transfer back). Thus, the server’s performance is essentially boosted by knowledge transferred from the edge models and vice-versa. Once the training is complete, the final model is a combination of its local feature extractor and shared server model (4 - edge-sided model). The primary trade-off is that FedGKT shifts the computing burden from edge devices to the powerful server. FedGKT unifies multiple advantages into a single framework: 1. FedGKT is memory and computation efficient, similar to SL; 2. FedGKT can train in a local SGD manner like FedAvg to reduce the communication frequency; 3. Exchanging hidden features as in SL, as opposed to exchanging the entire model as in FedAvg, reduces the communication bandwidth requirement. 4. FedGKT naturally supports asynchronous training, which circumvents the severe synchronization issue in SL. The server model can immediately start training when it receives inputs from any client. We develop FedGKT based on the FedML research library [153] and comprehensively evaluate FedGKT using edge and server CNNs designed based on ResNet [163] (as shown in Fig. 4.1(b)). We train on three datasets with varying training difficulties (CIFAR-10 [223], CIFAR-100 [223], and CINIC-10 [81]) and their non-I.I.D. (non identical and independent distribution) variants. As for the model accuracy, our results on both I.I.D. and non-I.I.D. datasets show that FedGKT can obtain accuracy comparable to FedAvg [310]. More importantly, FedGKT makes edge training affordable. Compared to 48 FedAvg, FedGKT demands 9 to 17 times less computational power (FLOPs) on edge devices and requires 54 to 105 times fewer parameters in the edge CNN. To understand FedGKT comprehensively, asynchronous training and ablation studies are performed. Some limitations are also discussed. 4.2 Related Works Federated Learning. Existing FL methods such as FedAvg [310], FedOpt [367], and FedMA [465] face significant hurdles in training large CNNs on resource-constrained devices. Recent works FedNAS [142, 159] and [172] work on large CNNs, but they rely on GPU training to complete the evaluations. Others [21, 476, 438, 6, 272, 419, 463, 417, 98] optimize the communication cost without considering edge computational limitations. Model parallelism- based split learning [129, 458] attempts to break the computational constraint, but it requires frequent communication with the server. Knowledge Distillation (KD). We use KD [167] in a different manner from existing and concurrent works [270, 29, 234, 546, 429, 297, 187, 245]. Previous works only consider transferring knowledge from a large network to a smaller one [167, 47, 16, 380], or they transfer knowledge from a group, but each member in the group shares the same large model architecture or a large portion of the neural architecture with specific tail or head layers [536, 10, 425, 191, 62, 340]. Moreover, all teachers and students in distillation share the same dataset [62, 447, 550, 459], while in our setting each member (client) can only access its own independent dataset. Previous methods use centralized training, but we utilize an alternating training method. Efficient On-device Deep Learning . Our work also relates to efficient deep learning on edge devices, such as model compression [133, 165, 516], manually designed architectures (MobileNets [171], ShuffeNets [535], SqueezeNets [184]), or even efficient neural architecture search (EfficientNets [435], FBNet [482]). However, all of these techniques are tailored for the inference phase rather than the training phase. 49 4.3 Group Knowledge Transfer … … Edge Server Feature Extractor Classifier W (1) e <latexit sha1_base64="4OEVjrYGwB4HzUyvhEIi0kBIz3I=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYhDiJexGQY9BLx4jmIcka5iddJIhM7PLzKwQlnyFFw+KePVzvPk3TpI9aGJBQ1HVTXdXGHOmjed9Oyura+sbm7mt/PbO7t5+4eCwoaNEUazTiEeqFRKNnEmsG2Y4tmKFRIQcm+HoZuo3n1BpFsl7M44xEGQgWZ9RYqz00OziY1ryzybdQtErezO4y8TPSBEy1LqFr04voolAaSgnWrd9LzZBSpRhlOMk30k0xoSOyADblkoiUAfp7OCJe2qVntuPlC1p3Jn6eyIlQuuxCG2nIGaoF72p+J/XTkz/KkiZjBODks4X9RPumsidfu/2mEJq+NgSQhWzt7p0SBShxmaUtyH4iy8vk0al7J+XK3cXxep1FkcOjuEESuDDJVThFmpQBwoCnuEV3hzlvDjvzse8dcXJZo7gD5zPH94Kj8s=</latexit> W (2) e <latexit sha1_base64="VAzNPXKr8FYU9i2SyEC0RzsjLcY=">AAAB8HicbVDLSgNBEOz1GeMr6tHLYhDiJexGQY9BLx4jmIcka5iddJIhM7PLzKwQlnyFFw+KePVzvPk3TpI9aGJBQ1HVTXdXGHOmjed9Oyura+sbm7mt/PbO7t5+4eCwoaNEUazTiEeqFRKNnEmsG2Y4tmKFRIQcm+HoZuo3n1BpFsl7M44xEGQgWZ9RYqz00OziY1qqnE26haJX9mZwl4mfkSJkqHULX51eRBOB0lBOtG77XmyClCjDKMdJvpNojAkdkQG2LZVEoA7S2cET99QqPbcfKVvSuDP190RKhNZjEdpOQcxQL3pT8T+vnZj+VZAyGScGJZ0v6ifcNZE7/d7tMYXU8LElhCpmb3XpkChCjc0ob0PwF19eJo1K2T8vV+4uitXrLI4cHMMJlMCHS6jCLdSgDhQEPMMrvDnKeXHenY9564qTzRzBHzifP9+Qj8w=</latexit> W (K) e <latexit sha1_base64="SeBpHSKKW+x1wYNuUg987OhkpaM=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9CJ4qWA/pF1LNs22oUl2SbJCWforvHhQxKs/x5v/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKmjRBHaIBGPVDvAmnImacMww2k7VhSLgNNWMLqe+q0nqjSL5L0Zx9QXeCBZyAg2Vnpo9ehjWr49nfSKJbfizoCWiZeREmSo94pf3X5EEkGlIRxr3fHc2PgpVoYRTieFbqJpjMkID2jHUokF1X46O3iCTqzSR2GkbEmDZurviRQLrccisJ0Cm6Fe9Kbif14nMeGlnzIZJ4ZKMl8UJhyZCE2/R32mKDF8bAkmitlbERlihYmxGRVsCN7iy8ukWa14Z5Xq3XmpdpXFkYcjOIYyeHABNbiBOjSAgIBneIU3RzkvzrvzMW/NOdnMIfyB8/kDBbWP5Q==</latexit> W (1) c <latexit sha1_base64="uTSFe9yH4JHiEGBZzw1dFOI36Q8=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0kO7Tx7Tsnc+7RdLbsWdA60SLyMlyNDoF796g4gkgkpDONa667mx8VOsDCOcTgu9RNMYkzEe0q6lEguq/XR+8BSdWWWAwkjZkgbN1d8TKRZaT0RgOwU2I73szcT/vG5iwis/ZTJODJVksShMODIRmn2PBkxRYvjEEkwUs7ciMsIKE2MzKtgQvOWXV0mrWvFqlerdRal+ncWRhxM4hTJ4cAl1uIUGNIGAgGd4hTdHOS/Ou/OxaM052cwx/IHz+QPa9o/J</latexit> W (2) c <latexit sha1_base64="Z5mNjKUz2DgSLjcOrWXRJFy5TiQ=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuK+ix6MVjBfsh7VqyabYNTbJLkhXK0l/hxYMiXv053vw3pu0etPXBwOO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8c3Mbz9RpVkk780kpr7AQ8lCRrCx0kO7Tx7TcvV82i+W3Io7B1olXkZKkKHRL371BhFJBJWGcKx113Nj46dYGUY4nRZ6iaYxJmM8pF1LJRZU++n84Ck6s8oAhZGyJQ2aq78nUiy0nojAdgpsRnrZm4n/ed3EhFd+ymScGCrJYlGYcGQiNPseDZiixPCJJZgoZm9FZIQVJsZmVLAheMsvr5JWteLVKtW7i1L9OosjDydwCmXw4BLqcAsNaAIBAc/wCm+Ocl6cd+dj0Zpzsplj+APn8wfcfI/K</latexit> W (K) c <latexit sha1_base64="3sg21heSUUG18T5vDjU0aHGu9No=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9CJ4qWA/pF1LNs22oUl2SbJCWforvHhQxKs/x5v/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKmjRBHaIBGPVDvAmnImacMww2k7VhSLgNNWMLqe+q0nqjSL5L0Zx9QXeCBZyAg2Vnpo9chjWr49nfSKJbfizoCWiZeREmSo94pf3X5EEkGlIRxr3fHc2PgpVoYRTieFbqJpjMkID2jHUokF1X46O3iCTqzSR2GkbEmDZurviRQLrccisJ0Cm6Fe9Kbif14nMeGlnzIZJ4ZKMl8UJhyZCE2/R32mKDF8bAkmitlbERlihYmxGRVsCN7iy8ukWa14Z5Xq3XmpdpXFkYcjOIYyeHABNbiBOjSAgIBneIU3RzkvzrvzMW/NOdnMIfyB8/kDAqGP4w==</latexit> W s <latexit sha1_base64="idHOQ+7PANVofYZ0hLtRvmzSRkI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O7rfrniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYP6PKcCZwWuqlGhPKxnSIXUsljVD72fzUKTmzyoCEsbIlDZmrvycyGmk9iQLbGVEz0sveTPzP66YmvPYzLpPUoGSLRWEqiInJ7G8y4AqZERNLKFPc3krYiCrKjE2nZEPwll9eJa1a1buo1u4vK/WbPI4inMApnIMHV1CHO2hAExgM4Rle4c0Rzovz7nwsWgtOPnMMf+B8/gA+wI3F</latexit> Feature Extractor Feature Extractor Classifier Classifier Classifier … … … f e <latexit sha1_base64="/u3O1KBJYqytm/r1JvQ7jQaBQ9M=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0EPaxX664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP0BijcY=</latexit> f c <latexit sha1_base64="70ZGUkUx0z6Wi/+Lbnp2ru9Rsjw=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh7DP+uWKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6CfkY1Cib5tNRLDU8oG9Mh71qqaMSNn81PnZIzqwxIGGtbCslc/T2R0ciYSRTYzojiyCx7M/E/r5tieO1nQiUpcsUWi8JUEozJ7G8yEJozlBNLKNPC3krYiGrK0KZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwA9Wo3E</latexit> f <latexit sha1_base64="aj1VrWgqSrqkgqJ/bLgtsTmRK/w=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlZtgvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPzD+M7g==</latexit> X (1) i <latexit sha1_base64="JCp3g4LDtcazdQ8GEYyOHxdcC8Q=">AAAB/3icbVDNS8MwHE39nPOrKnjxEhzCvIx2CnocevE4wX3AVkuapltYmpQkFUbdwX/FiwdFvPpvePO/Md160M0HIY/3fj/y8oKEUaUd59taWl5ZXVsvbZQ3t7Z3du29/bYSqcSkhQUTshsgRRjlpKWpZqSbSILigJFOMLrO/c4DkYoKfqfHCfFiNOA0ohhpI/n2YT8QLFTj2FxZd+LT+6zqnk58u+LUnCngInELUgEFmr791Q8FTmPCNWZIqZ7rJNrLkNQUMzIp91NFEoRHaEB6hnIUE+Vl0/wTeGKUEEZCmsM1nKq/NzIUqzyimYyRHqp5Lxf/83qpji69jPIk1YTj2UNRyqAWMC8DhlQSrNnYEIQlNVkhHiKJsDaVlU0J7vyXF0m7XnPPavXb80rjqqijBI7AMagCF1yABrgBTdACGDyCZ/AK3qwn68V6tz5mo0tWsXMA/sD6/AH4Z5YO</latexit> X (2) i <latexit sha1_base64="7MmUipjDa7/2j/p/EWj5jGldYHo=">AAAB/3icbVDNS8MwHE39nPOrKnjxEhzCvIx2CnocevE4wX3AVkuapltYmpQkFUbdwX/FiwdFvPpvePO/Md160M0HIY/3fj/y8oKEUaUd59taWl5ZXVsvbZQ3t7Z3du29/bYSqcSkhQUTshsgRRjlpKWpZqSbSILigJFOMLrO/c4DkYoKfqfHCfFiNOA0ohhpI/n2YT8QLFTj2FxZd+LT+6xaP534dsWpOVPAReIWpAIKNH37qx8KnMaEa8yQUj3XSbSXIakpZmRS7qeKJAiP0ID0DOUoJsrLpvkn8MQoIYyENIdrOFV/b2QoVnlEMxkjPVTzXi7+5/VSHV16GeVJqgnHs4eilEEtYF4GDKkkWLOxIQhLarJCPEQSYW0qK5sS3PkvL5J2veae1eq355XGVVFHCRyBY1AFLrgADXADmqAFMHgEz+AVvFlP1ov1bn3MRpesYucA/IH1+QP57ZYP</latexit> X (K) i <latexit sha1_base64="J4HZS4J/a40jqy4pALYQ6jxU1/o=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+IQ5mW0U9Dj0IvgZYL7gK2WNE23sDQpSSqM2oP/ihcPinj13/Dmf2O69aCbD0Ie7/1+5OX5MSVS2fa3UVpaXlldK69XNja3tnfM3b2O5IlAuI045aLnQ4kpYbitiKK4FwsMI5/irj++yv3uAxaScHanJjF2IzhkJCQIKi155sHA5zSQk0hfaS/zyH1auznJPLNq1+0prEXiFKQKCrQ882sQcJREmClEoZR9x46Vm0KhCKI4qwwSiWOIxnCI+5oyGGHpptP8mXWslcAKudCHKWuq/t5IYSTziHoygmok571c/M/rJyq8cFPC4kRhhmYPhQm1FLfyMqyACIwUnWgCkSA6q4VGUECkdGUVXYIz/+VF0mnUndN64/as2rws6iiDQ3AEasAB56AJrkELtAECj+AZvII348l4Md6Nj9loySh29sEfGJ8/IBKWKA==</latexit> y (2) i <latexit sha1_base64="xoP39q2X6CBdnQIcZxr9eh5Fk84=">AAAB8HicbVBNSwMxEJ31s9avqkcvwSLUS9mtgh6LXjxWsB/SriWbZtvQJLskWWFZ+iu8eFDEqz/Hm//GtN2Dtj4YeLw3w8y8IOZMG9f9dlZW19Y3Ngtbxe2d3b390sFhS0eJIrRJIh6pToA15UzSpmGG006sKBYBp+1gfDP1209UaRbJe5PG1Bd4KFnICDZWekj77DGr1M4m/VLZrbozoGXi5aQMORr90ldvEJFEUGkIx1p3PTc2foaVYYTTSbGXaBpjMsZD2rVUYkG1n80OnqBTqwxQGClb0qCZ+nsiw0LrVAS2U2Az0oveVPzP6yYmvPIzJuPEUEnmi8KEIxOh6fdowBQlhqeWYKKYvRWREVaYGJtR0YbgLb68TFq1qnderd1dlOvXeRwFOIYTqIAHl1CHW2hAEwgIeIZXeHOU8+K8Ox/z1hUnnzmCP3A+fwAaX4/y</latexit> y (K) i <latexit sha1_base64="0Z66sqGW3hn+352fG4dBxFgKobk=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9CJ4qWA/pF1LNs22oUl2SbJCWforvHhQxKs/x5v/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKmjRBHaIBGPVDvAmnImacMww2k7VhSLgNNWMLqe+q0nqjSL5L0Zx9QXeCBZyAg2VnoY99hjWr49nfSKJbfizoCWiZeREmSo94pf3X5EEkGlIRxr3fHc2PgpVoYRTieFbqJpjMkID2jHUokF1X46O3iCTqzSR2GkbEmDZurviRQLrccisJ0Cm6Fe9Kbif14nMeGlnzIZJ4ZKMl8UJhyZCE2/R32mKDF8bAkmitlbERlihYmxGRVsCN7iy8ukWa14Z5Xq3XmpdpXFkYcjOIYyeHABNbiBOjSAgIBneIU3RzkvzrvzMW/NOdnMIfyB8/kDQHWQCw==</latexit> y (1) i <latexit sha1_base64="8E8xT0QFlwO7/dT3tGyplnaYZGg=">AAAB8HicbVBNSwMxEJ31s9avqkcvwSLUS9mtgh6LXjxWsB/SriWbZtvQJLskWWFZ+iu8eFDEqz/Hm//GtN2Dtj4YeLw3w8y8IOZMG9f9dlZW19Y3Ngtbxe2d3b390sFhS0eJIrRJIh6pToA15UzSpmGG006sKBYBp+1gfDP1209UaRbJe5PG1Bd4KFnICDZWekj77DGreGeTfqnsVt0Z0DLxclKGHI1+6as3iEgiqDSEY627nhsbP8PKMMLppNhLNI0xGeMh7VoqsaDaz2YHT9CpVQYojJQtadBM/T2RYaF1KgLbKbAZ6UVvKv7ndRMTXvkZk3FiqCTzRWHCkYnQ9Hs0YIoSw1NLMFHM3orICCtMjM2oaEPwFl9eJq1a1Tuv1u4uyvXrPI4CHMMJVMCDS6jDLTSgCQQEPMMrvDnKeXHenY9564qTzxzBHzifPxjZj/E=</latexit> H (K) i <latexit sha1_base64="j/f9VxXtKrUH9BV4fOfPzQTgckI=">AAAB/3icbVDNS8MwHE39nPOrKnjxEhzCvIx2Cnocehl4meA+YKslTdMtLE1Lkgqj9uC/4sWDIl79N7z535huPejmg5DHe78feXlezKhUlvVtLC2vrK6tlzbKm1vbO7vm3n5HRonApI0jFomehyRhlJO2ooqRXiwICj1Gut74Ove7D0RIGvE7NYmJE6IhpwHFSGnJNQ8HXsR8OQn1lTYzl96n1ZvTzDUrVs2aAi4SuyAVUKDlml8DP8JJSLjCDEnZt61YOSkSimJGsvIgkSRGeIyGpK8pRyGRTjrNn8ETrfgwiIQ+XMGp+nsjRaHMI+rJEKmRnPdy8T+vn6jg0kkpjxNFOJ49FCQMqgjmZUCfCoIVm2iCsKA6K8QjJBBWurKyLsGe//Ii6dRr9lmtfnteaVwVdZTAETgGVWCDC9AATdACbYDBI3gGr+DNeDJejHfjYza6ZBQ7B+APjM8fB0KWGA==</latexit> H (2) i <latexit sha1_base64="aFffE+jPOH3Fs0zM2dKJjUyREHY=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+IQ5mW0U9Dj0MuOE9wHbLWkabqFpUlJUmHUHvxXvHhQxKv/hjf/G9OtB918EPJ47/cjL8+PKZHKtr+N0srq2vpGebOytb2zu2fuH3QlTwTCHcQpF30fSkwJwx1FFMX9WGAY+RT3/MlN7vcesJCEszs1jbEbwREjIUFQackzj4Y+p4GcRvpKW5lH7tNa4yzzzKpdt2ewlolTkCoo0PbMr2HAURJhphCFUg4cO1ZuCoUiiOKsMkwkjiGawBEeaMpghKWbzvJn1qlWAivkQh+mrJn6eyOFkcwj6skIqrFc9HLxP2+QqPDKTQmLE4UZmj8UJtRS3MrLsAIiMFJ0qglEguisFhpDAZHSlVV0Cc7il5dJt1F3zuuN24tq87qoowyOwQmoAQdcgiZogTboAAQewTN4BW/Gk/FivBsf89GSUewcgj8wPn8A4R2V/w==</latexit> H (1) i <latexit sha1_base64="Xs7x1S1vxAIUoZVieEJggElEJus=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+IQ5mW0U9Dj0MuOE9wHbLWkabqFpUlJUmHUHvxXvHhQxKv/hjf/G9OtB918EPJ47/cjL8+PKZHKtr+N0srq2vpGebOytb2zu2fuH3QlTwTCHcQpF30fSkwJwx1FFMX9WGAY+RT3/MlN7vcesJCEszs1jbEbwREjIUFQackzj4Y+p4GcRvpKW5lH7tOac5Z5ZtWu2zNYy8QpSBUUaHvm1zDgKIkwU4hCKQeOHSs3hUIRRHFWGSYSxxBN4AgPNGUwwtJNZ/kz61QrgRVyoQ9T1kz9vZHCSOYR9WQE1Vguern4nzdIVHjlpoTFicIMzR8KE2opbuVlWAERGCk61QQiQXRWC42hgEjpyiq6BGfxy8uk26g75/XG7UW1eV3UUQbH4ATUgAMuQRO0QBt0AAKP4Bm8gjfjyXgx3o2P+WjJKHYOwR8Ynz/fl5X+</latexit> H (K) i <latexit sha1_base64="j/f9VxXtKrUH9BV4fOfPzQTgckI=">AAAB/3icbVDNS8MwHE39nPOrKnjxEhzCvIx2Cnocehl4meA+YKslTdMtLE1Lkgqj9uC/4sWDIl79N7z535huPejmg5DHe78feXlezKhUlvVtLC2vrK6tlzbKm1vbO7vm3n5HRonApI0jFomehyRhlJO2ooqRXiwICj1Gut74Ove7D0RIGvE7NYmJE6IhpwHFSGnJNQ8HXsR8OQn1lTYzl96n1ZvTzDUrVs2aAi4SuyAVUKDlml8DP8JJSLjCDEnZt61YOSkSimJGsvIgkSRGeIyGpK8pRyGRTjrNn8ETrfgwiIQ+XMGp+nsjRaHMI+rJEKmRnPdy8T+vn6jg0kkpjxNFOJ49FCQMqgjmZUCfCoIVm2iCsKA6K8QjJBBWurKyLsGe//Ii6dRr9lmtfnteaVwVdZTAETgGVWCDC9AATdACbYDBI3gGr+DNeDJejHfjYza6ZBQ7B+APjM8fB0KWGA==</latexit> H (2) i <latexit sha1_base64="aFffE+jPOH3Fs0zM2dKJjUyREHY=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+IQ5mW0U9Dj0MuOE9wHbLWkabqFpUlJUmHUHvxXvHhQxKv/hjf/G9OtB918EPJ47/cjL8+PKZHKtr+N0srq2vpGebOytb2zu2fuH3QlTwTCHcQpF30fSkwJwx1FFMX9WGAY+RT3/MlN7vcesJCEszs1jbEbwREjIUFQackzj4Y+p4GcRvpKW5lH7tNa4yzzzKpdt2ewlolTkCoo0PbMr2HAURJhphCFUg4cO1ZuCoUiiOKsMkwkjiGawBEeaMpghKWbzvJn1qlWAivkQh+mrJn6eyOFkcwj6skIqrFc9HLxP2+QqPDKTQmLE4UZmj8UJtRS3MrLsAIiMFJ0qglEguisFhpDAZHSlVV0Cc7il5dJt1F3zuuN24tq87qoowyOwQmoAQdcgiZogTboAAQewTN4BW/Gk/FivBsf89GSUewcgj8wPn8A4R2V/w==</latexit> H (1) i <latexit sha1_base64="Xs7x1S1vxAIUoZVieEJggElEJus=">AAAB/3icbVDNS8MwHE3n15xfVcGLl+IQ5mW0U9Dj0MuOE9wHbLWkabqFpUlJUmHUHvxXvHhQxKv/hjf/G9OtB918EPJ47/cjL8+PKZHKtr+N0srq2vpGebOytb2zu2fuH3QlTwTCHcQpF30fSkwJwx1FFMX9WGAY+RT3/MlN7vcesJCEszs1jbEbwREjIUFQackzj4Y+p4GcRvpKW5lH7tOac5Z5ZtWu2zNYy8QpSBUUaHvm1zDgKIkwU4hCKQeOHSs3hUIRRHFWGSYSxxBN4AgPNGUwwtJNZ/kz61QrgRVyoQ9T1kz9vZHCSOYR9WQE1Vguern4nzdIVHjlpoTFicIMzR8KE2opbuVlWAERGCk61QQiQXRWC42hgEjpyiq6BGfxy8uk26g75/XG7UW1eV3UUQbH4ATUgAMuQRO0QBt0AAKP4Bm8gjfjyXgx3o2P+WjJKHYOwR8Ynz/fl5X+</latexit> f s <latexit sha1_base64="KEsxz2oxsHuhSu2Qwe9PhVkSd1Q=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xEnC/YgOlQgFo2ilh7Bv+uWKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6CfkY1Cib5tNRLDU8oG9Mh71qqaMSNn81PnZIzqwxIGGtbCslc/T2R0ciYSRTYzojiyCx7M/E/r5tieO1nQiUpcsUWi8JUEozJ7G8yEJozlBNLKNPC3krYiGrK0KZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3A+fwBVmo3U</latexit> (c) Our Reformulation: consolidating benefits from both FL and SL (b) Split Learning (a) Federated Learning … W e <latexit sha1_base64="TBqWkX+HKMMJJPA/MHoLNHbPM/A=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O5jv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwApiI23</latexit> W s <latexit sha1_base64="idHOQ+7PANVofYZ0hLtRvmzSRkI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O7rfrniVt05yCrxclKBHI1++as3iFkaoTRMUK27npsYP6PKcCZwWuqlGhPKxnSIXUsljVD72fzUKTmzyoCEsbIlDZmrvycyGmk9iQLbGVEz0sveTPzP66YmvPYzLpPUoGSLRWEqiInJ7G8y4AqZERNLKFPc3krYiCrKjE2nZEPwll9eJa1a1buo1u4vK/WbPI4inMApnIMHV1CHO2hAExgM4Rle4c0Rzovz7nwsWgtOPnMMf+B8/gA+wI3F</latexit> W <latexit sha1_base64="yxUgNFppSsebkUqo0HdMiZJUGss=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZqtPulsltx5yCrxMtJGXLU+6Wv3iBmaYTSMEG17npuYvyMKsOZwGmxl2pMKBvTIXYtlTRC7WfzQ6fk3CoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDGz7hMUoOSLRaFqSAmJrOvyYArZEZMLKFMcXsrYSOqKDM2m6INwVt+eZW0qhXvslJtXJVrt3kcBTiFM7gAD66hBvdQhyYwQHiGV3hzHp0X5935WLSuOfnMCfyB8/kDtYOM3w==</latexit> W <latexit sha1_base64="yxUgNFppSsebkUqo0HdMiZJUGss=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZqtPulsltx5yCrxMtJGXLU+6Wv3iBmaYTSMEG17npuYvyMKsOZwGmxl2pMKBvTIXYtlTRC7WfzQ6fk3CoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDGz7hMUoOSLRaFqSAmJrOvyYArZEZMLKFMcXsrYSOqKDM2m6INwVt+eZW0qhXvslJtXJVrt3kcBTiFM7gAD66hBvdQhyYwQHiGV3hzHp0X5935WLSuOfnMCfyB8/kDtYOM3w==</latexit> W <latexit sha1_base64="yxUgNFppSsebkUqo0HdMiZJUGss=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZqtPulsltx5yCrxMtJGXLU+6Wv3iBmaYTSMEG17npuYvyMKsOZwGmxl2pMKBvTIXYtlTRC7WfzQ6fk3CoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDGz7hMUoOSLRaFqSAmJrOvyYArZEZMLKFMcXsrYSOqKDM2m6INwVt+eZW0qhXvslJtXJVrt3kcBTiFM7gAD66hBvdQhyYwQHiGV3hzHp0X5935WLSuOfnMCfyB8/kDtYOM3w==</latexit> W <latexit sha1_base64="yxUgNFppSsebkUqo0HdMiZJUGss=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rEF+wFtKJvtpF272YTdjVBCf4EXD4p49Sd589+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVDJssFrHqBFSj4BKbhhuBnUQhjQKB7WB8N/PbT6g0j+WDmSToR3QoecgZNVZqtPulsltx5yCrxMtJGXLU+6Wv3iBmaYTSMEG17npuYvyMKsOZwGmxl2pMKBvTIXYtlTRC7WfzQ6fk3CoDEsbKljRkrv6eyGik9SQKbGdEzUgvezPxP6+bmvDGz7hMUoOSLRaFqSAmJrOvyYArZEZMLKFMcXsrYSOqKDM2m6INwVt+eZW0qhXvslJtXJVrt3kcBTiFM7gAD66hBvdQhyYwQHiGV3hzHp0X5935WLSuOfnMCfyB8/kDtYOM3w==</latexit> not affordable if W is large CNN each training iteration requires frequent communication W e <latexit sha1_base64="TBqWkX+HKMMJJPA/MHoLNHbPM/A=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O5jv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwApiI23</latexit> W e <latexit sha1_base64="TBqWkX+HKMMJJPA/MHoLNHbPM/A=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O5jv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwApiI23</latexit> FP BP Figure 4.2: Reformulation of FL: An Alternating Minimization Perspective 4.3.1 Preliminary We aim to collaboratively train large convolutional neural networks (e.g., ResNet) on many resource-constraineddevicesthatarenotequippedwithGPUaccelerators,withoutcentralizing each device’s dataset to the server side. We specifically consider supervised learning with C categories in the entire datasetD. We assume that there are K clients (edge devices) in the network. Thekth node has its own datasetD k := X k i ,y i N (k) i=1 , whereX i is theith training sample, y i is the corresponding label of X i , y i ∈{1,2,...,C} (a multi-classification learning task), and N (k) is the sample number in datasetD k . D ={D 1 ,D 2 ,...,D k }, N = P K k=1 N (k) . In general, we can formulate CNN-based federated learning as a distributed optimization problem: min W F(W) def = min W K X k=1 N (k) N · f (k) (W),where f (k) (W)= 1 N (k) N (k) X i=1 ℓ(W;X i ,y i ) (4.1) whereW represents the network weight of a global CNN in each client. f (k) (W) is the kth client’s local objective function that measures the local empirical risk over the heterogeneous datasetD k . ℓ is the loss function of the global CNN model. Most off-the-shelf federated optimization methods (e.g., FedAvg [310], FedProx [253], FedNova [471], and FedOpt [367]) propose to solve objective function equation 11.1 with variant local SGD [525] optimization methods for communication-efficient training and 50 demonstrate their characteristics with experiments on linear models (logistic regression) or shallow neural networks (2 convolutional layers). However, as shown in Fig. 4.2(a), the main drawback is that these methods cannot train large CNN at the resource-constrained edge devices due to lack of GPU accelerators and sufficient memory. Model parallelism-based split learning [129, 458], as shown in Fig. 4.2(b), attempts to break the computational constraint by splitting W into two portions and offloading the larger portion into the server-side, but a single mini-batch iteration requires remote forward propagation and backpropagation. For edge computing, such a highly frequent synchronization mechanism may lead to the severe straggler problem that significantly slows down the training process. 4.3.2 Reformulation Non-convex Optimization. To solve the resource-constrained problem in existing FL, we reconsider another methodology to solve the FL optimization problem. As illustrated in Fig. 4.2(c), we divide the global CNN W in Eq. equation 11.1 into two partitions: a small feature extractor model W e and a large-scale server-side model W s , and put them on the edge and the server, respectively. We also add a classifier W c forW e to create a small but fully trainable model on the edge. Consequently, we reformulate a single global model optimization into an non-convex optimization problem that requires us to solve the server model F s and the edge model F c simultaneously. Our reformulation is as follows: argmin Ws F s (W s ,W ∗ e )=argmin Ws K X k=1 N (k) X i=1 ℓ s Ä f s (W s ;H (k) i ),y (k) i ä (4.2) subject to: H (k) i =f (k) e (W (k) e ;X (k) i ) (4.3) argmin (W (k) e ,W (k) c ) F c (W (k) e ,W (k) c )= argmin (W (k) e ,W (k) c ) N (k) X i=1 ℓ c Ä f (k) (W (k) e ,W (k) c );X (k) i ,y (k) i ä (4.4) = argmin (W (k) e ,W (k) c ) N (k) X i=1 ℓ c f (k) c (W (k) c ;f (k) e (W (k) e ;X (k) i | {z } H (k) i )),y (k) i (4.5) 51 Whereℓ s andℓ c aregenerallossfunctionsfortheservermodelandtheedgemodel, respectively. f s is the server model, and f (k) is the edge-side model which consists of feature extractor f (k) e followed by a classifier f (k) c . W s , W (k) e , W (k) c are the network weights of f s , f (k) e , f (k) c , respectively. H (k) i is i-th sample’s feature map (a hidden vector or tensor) output by feature extractor f (k) e (Eq. equation 4.3). Note that Eq. equation 4.5 can be solved independently on each client. The kth client model f (k) is trained on its local dataset (Eq. equation 4.5), while the server model f s is trained usingH (k) i as input features (Eq. equation 4.2). During the inference phase, the final trained model architecture for client k is stacked by the architecture of the feature extractor f (k) e and the architecture of the server model f s . In practice, the client can either run offline inference by downloading the server model f s and using it locally or perform online inference through a network connection with the server. Advantages and Challenges. The core advantage of the above reformulation is that when we assume the model size of f (k) is multiple orders of magnitude smaller than that of f s , the edge training is affordable. Moreover, as discussed in [129, 458], for large CNN training, the communication bandwidth for transferring H (k) i to the server is substantially less than communicating all model parameters as is done in traditional federated learning. Conversely, we also observe the difficulty of the reformulated optimization problem. First, each client is expected to adequately solve the inner optimization (Eq. equation 4.5). Namely, each client should train its feature extractor f (k) e well to ensure that Eq. equation 4.3 can accurately generate H (k) i for any given input image. However, in the FL setting, the dataset on each edge device is small and thus may be inadequate in training a CNN-based feature extractor solely based on the local dataset. In addition, the outer optimization Eq. equation 4.2 and inter optimization Eq. equation 4.5 are correlated: Eq. equation 4.2 relies on the quality ofH (k) i which is optimized by Eq. equation 4.5. This correlation further makes the outer optimization Eq. equation 4.2 difficult to converge if the individual client-side feature extractors f (k) e are not trained adequately. 52 4.3.3 FedGKT: Group Knowledge Transfer Scaling Edge Dataset Limitations with Knowledge Transfer. Given the above challenges, we incorporate knowledge distillation loss into the optimization equations to circumvent the optimization difficulty. The intuition is that knowledge transferred from the the server model can boost the optimization on the edge (Eq. equation 4.5). As such, we propose to transfer group knowledge bidirectionally. The server CNN absorbs the knowledge from many edges, and an individual edge CNN obtains enhanced knowledge from the server CNN. To be more specific, in Eq. equation 4.2 and equation 4.5, we design ℓ s and ℓ c as follows. ℓ s =ℓ CE + K X k=1 ℓ KD Ä z s ,z (k) c ä =ℓ CE + K X k=1 D KL (p k ∥p s ) (4.6) ℓ (k) c =ℓ CE +ℓ KD Ä z s ,z (k) c ä =ℓ CE +D KL (p s ∥p k ) (4.7) ℓ CE is the cross-entropy loss between the predicted values and the ground truth labels. D KL is the Kullback Leibler (KL) Divergence function that serves as a term in the loss function ℓ s and ℓ c to transfer knowledge from a network to another. p i k = exp Ä z (k,i) c /T ä P C i=1 exp Ä z (k,i) c /T ä andp i s = exp(z i s /T) P C i=1 exp(z i s /T) . They are the probabilistic prediction of the edge model f (k) and the server model f s , respectively. They are calculated with the softmax of logits z. The logitz s andz (k) c are the output of the last fully connected layer in the server model and the client model, respectively. T is the temperature hyperparameter of the softmax function. Intuitively, the KL divergence loss attempts to bring the soft label and the ground truth close to each other. In doing so, the server model absorbs the knowledge gained from each of the edge models. Similarly, the edge models attempt to bring their predictions closer to the server model’s prediction and thereby absorb the server model knowledge to improve their feature extraction capability. Improved Alternating Minimization. After plugging Eq. equation 4.6 and equa- tion 4.7 into our reformulation (Eq. equation 4.2 and equation 4.5), we propose a variant 53 of Alternating Minimization (AM) [338, 25, 33, 13, 481, 366] to solve the reformulated optimization problem as follows: argmin Ws F s (W s ,W (k)∗ e )=argmin Ws K X k=1 N (k) X i=1 ℓ CE f s (W s ;f (k) e (W (k)∗ e ;X (k) i | {z } H (k) i ),y (k) i + K X k=1 ℓ KD z (k)∗ c ,z s (4.8) where z (k)∗ c =f (k) c (W (k) c ;f (k) e (W (k)∗ e ;X (k) i | {z } H (k) i )),and z s =f s (W s ;H (k) i ) (4.9) argmin W (k) F c (W ∗ s ,W (k) )=argmin W (k) N (k) X i=1 ℓ CE f (k) c (W (k) c ;f (k) e (W (k) e ;X (k) i | {z } H (k) i )),y (k) i +ℓ KD z ∗ s ,z (k) c (4.10) where z (k) c =f (k) c (W (k) c ;f (k) e (W (k) e ;X (k) i | {z } H (k) i )),and z ∗ s =f s (W ∗ s ;H (k) i ) (4.11) Where the∗ superscript notation in above equations presents related random variables are fixed during optimization. W (k) is the combination of W (k) e and W (k) c . AM is a solver in convex and non-convex optimization theory and practice that optimizes two random variables alternatively. In Eq. equation 4.8, we fix W (k) and optimize (train) W s for several epochs, and then we switch to equation 4.10 to fix W s and optimizeW (k) for several epochs. This optimization occurs throughout many rounds between Eq. equation 4.8 and equation 4.10 until reaching a convergence state. Key Insight. The essence of our reformulation is that the alternating minimization (Eq. equation 4.8 and Eq. equation 4.10) uses knowledge distillation across all edges to simplify the optimization, which scales the dataset limitation on each edge in federated learning. In particular, we achieve this objective using a local cross-entropy loss computed based only on the ground truth and the model output, and a second loss that uses the KL divergence across edges and the server, which effectively captures the contribution (knowledge) from 54 multiple client datasets. Moreover, each minimization subproblem can be solved with SGD and its variants (e.g., SGD with momentum [357], ADAM [216, 552]). Algorithm 2 Group Knowledge Transfer. The subscripts andk stands for the server and thekth edge, respectively. E is the number of local epochs,T is the number of communication rounds; η is the learning rate; X (k) represents input images at edge k;H (k) is the extracted feature map from X (k) ; Z s and Z (k) c are the logit tensor from the client and the server, respectively. 1: ServerExecute(): 2: for each round t=1,2,...,T do 3: for each client k in parallel do 4: //theserverbroadcastsZ (k) c totheclient 5: H (k) ,Z (k) c ,Y (k) ← ClientTrain(k,Z (k) s ) 6: end for 7: Z s ← empty dictionary 8: for each local epoch i from 1 to E s do 9: for each client k do 10: for idx,b∈{H (k) ,Z (k) c ,Y (k) } do 11: W s ← W s − η s ∇ℓ s (W s ;b) 12: if i==E s then 13: Z (k) s [idx]← f s (W s ;h (k) ) 14: end if 15: end for 16: end for 17: end for 18: // illustrated as "transfer back" in Fig. 4.1(a) 19: for each client k in parallel do 20: send the server logits Z (k) s to client k 21: end for 22: end for 23: 24: ClientTrain(k,Z (k) s ): 25: //illustrated as "local training "in Fig. 4.1(a) 26: for each local epoch i from 1 to E c do 27: for batchb∈{X (k) ,Z (k) s ,Y (k) } do 28: // ℓ (k) c is computed using Eq. equation 4.7 29: W (k) ← W (k) − η k ∇ℓ (k) c (W (k) ;b) 30: end for 31: end for 32: // extract features and logits 33: H (k) ,Z (k) c ← empty dictionary 34: for idx, batchx (k) ,y (k) ∈{X (k) ,Y (k) } do 35: h (k) ← f (k) e (W (k) e ;x (k) ) 36: z (k) c ← f c (W (k) c ;h (k) ) 37: H (k) [idx]← h (k) 38: Z (k) c [idx]← z (k) c 39: end for 40: returnH (k) ,Z (k) c ,Y (k) to server 55 Training Algorithm. To elaborate, we illustrate the alternating training algorithm FedGKT in Fig. 4.1(a) and summarize it as Algorithm 2. During each round of training, the client uses local SGD to train several epochs and then sends the extracted feature maps and related logits to the server. When the server receives extracted features and logits from each client, it trains the much larger server-side CNN. The server then sends back its global logits to each client. This process iterates over multiple rounds, and during each round the knowledge of all clients is transferred to the server model and vice-versa. For the FedGKT training framework, the remaining step is to design specific neural architectures for the client model and the server model. To evaluate the effectiveness of FedGKT, we design CNN architectures based on ResNet [163], which are shown in Fig. 4.1(b). More details can also be found in Appendix C.5. 4.4 Experiments 4.4.1 Experimental Setup Implementation. We develop the FedGKT training framework based on FedML [153], an open source federated learning research library that simplifies the new algorithm development and deploys it in a distributed computing environment. Our server node has 4 NVIDIA RTX 2080Ti GPUs with sufficient GPU memory for large model training. We use several CPU-based nodes as clients training small CNNs. Task and Dataset. Our training task is image classification on CIFAR-10 [223], CIFAR- 100 [223], and CINIC-10 [81]. We also generate their non-I.I.D. variants by splitting training samples into K unbalanced partitions. Details of these three datasets are introduced in Appendix C.1.1. The test images are used for a global test after each round. For different methods, we record the top 1 test accuracy as the metric to compare model performance. Note that we do not use LEAF [52] benchmark datasets because the benchmark models provided are tiny models (CNN with only two convolutional layers) or the datasets they contain are too 56 easy for modern CNNs (e.g., Federated EMNIST), which are unable to adequately evaluate our algorithm running on large CNN models. Compared to LEAF, FedML [153] benchmark supports CIFAR-10, CIFAR-100, and CINIC-10 (contains images from ImageNet). Baselines. We compare FedGKT with state-of-the-art FL method FedAvg [310], and a centralized training approach. Split Learning-based method [129, 458] is used to compare the communication cost. Note that we do not compare with FedProx [253] because it performs worse than FedAvg in the large CNN setting, as demonstrated in [465]. We also do not compare with FedMA [465] because it cannot work on modern DNNs that contain batch normalization layers (e.g., ResNet). Model Architectures. Two modern CNN architectures are evaluated: ResNet-56 and ResNet-110 [163]. The baseline FedAvg requires all edge nodes to train using these two CNNs. For FedGKT, the edge and server-sided models are designed based on these two CNNs. On the edges, we design a tiny CNN architecture called ResNet-8, which is a compact CNN containing 8 convolutional layers (described in Fig. 4.1(b) and Table C.2 in Appendix). The server-sided model architectures are ResNet-55 and ResNet-109 (Table C.3 and C.4 in Appendix), which have the same input dimension to match the output of the edge-sided feature extractor. For split learning, we use the extractor in ResNet-8 as the edge-sided partition of CNNs, while the server-side partitions of CNN are also ResNet-55 and ResNet-109. 4.4.2 Result of Model Accuracy F edA V G (IID) GKT (IID) Centr aliz ed (ResNet-56) Centr aliz ed (ResNet-8) (a) ResNet-56 on CIFAR-10 (b) ResNet-56 on CIFAR-100 (c) ResNet-56 on CINIC-10 Figure 4.3: The Test Accuracy of ResNet-56 (Edge Number = 16) 57 For standard experiments, we run on 16 clients and a GPU server for all datasets and models. Fig. 4.3 shows the curve of the test accuracy during training on ResNet-56 model with 3 datasets. It includes the result of centralized training, FedAvg, and FedGKT. We also summarize all numerical results of ResNet-56 and ResNet-110 in Table 4.1. In both I.I.D. and non-I.I.D. setting, FedGKT obtains comparable or even better accuracy than FedAvg. Hyperparameters. There are four important hyper-parameters in our FedGKT frame- work: the communication round, as stated in line #2 of Algorithm 2, the edge-side epoch number, the server-side epoch number, and the server-side learning rate. After a tuning effort, we find that the edge-side epoch number can simply be 1. The server epoch number depends on the data distribution. For I.I.D. data, the value is 20, and for non-I.I.D., the value depends on the level of data bias. For I.I.D., Adam optimizer [216] works better than SGD with momentum [357], while for non-I.I.D., SGD with momentum works better. During training, we reduce the learning rate once the accuracy has plateaued [259, 408]. We use the same data augmentation techniques for fair comparison (random crop, random horizontal flip, and normalization). More details of hyper-parameters are described in Appendix C.5.1. Table 4.1: The Test Accuracy of ResNet-56 and ResNet-110 on Three Datasets. Model Methods CIFAR-10 CIFAR-100 CINIC-10 I.I.D. non-I.I.D. I.I.D. non-I.I.D. I.I.D. non-I.I.D. ResNet-56 FedGKT (ResNet-8, ours) 92.97 86.59 69.57 63.76 81.51 77.80 FedAvg (ResNet-56) 92.88 86.60 68.09 63.78 81.62 77.85 Centralized (ResNet-56) 93.05 69.73 81.66 Centralized (ResNet-8) 78.94 37.67 67.72 ResNet-110 FedGKT (ResNet-8, ours) 93.47 87.18 69.87 64.31 81.98 78.39 FedAvg (ResNet-110) 93.49 87.20 68.58 64.35 82.10 78.43 Centralized (ResNet-110) 93.58 70.18 82.16 Centralized (ResNet-8) 78.94 37.67 67.72 *Note: 1. It is a normal phenomenon when the test accuracy in non-I.I.D. is lower than that of I.I.D.. This is confirmed by both this study and other CNN-based FL works [172, 367]; 2. In the non-I.I.D. setting, since the model performance is sensitive to the data distribution, we fix the distribution of non-I.I.D. dataset for a fair comparison. Appendix C.2 describes the specific non-I.I.D. distribution used in the experiment; 3. Table C.5,C.6,C.7 in Appendix summarize the corresponding hyperparameters used in the experiments. 58 4.4.3 Efficiency Evaluation ResNet-8 ResNet-56 ResNet-110 0.6 5.4 10.2 petaFLOPs 11 591 1,150 #Params (K) 30 488 950 CPU (ms) Figure 4.4: Edge Computational Efficiency (CIFAR-100) SL FedGKT 249.6 125.3 CIFAR-10 249.6 125.3 CIFAR-100 1,347.8 676.5 CINIC-10 Figure 4.5: Communication Efficiency (ResNet-56) To compare the computational demand on training, we count the number of FLOPs (floating-point operations) performed on edge using prior methods [336, 166]. We report the result on CIFAR-100 in Fig. 4.4. Compared to the FedAvg baseline, the computational cost on the edge of our FedGKT (ResNet-8) is 9 times less than that of ResNet-56 and 17 times less than that of ResNet-110 (The memory cost comparison can be roughly compared by the model parameter number: ResNet-8 has 11K parameters, which is 54 times less than that of ResNet-56 and 105 times less than that of ResNet-110. We also test the CPU running time per mini-batch (batch size is 64) forward-backward propagation on Intel i7 CPU (which has a more aggressive performance than current edge devices). The results show that ResNet-8 requires only 3% of ResNet-110’s training time (30 ms v.s. 950 ms). To compare communication costs, we use SL [129, 458] as the baseline, which also exchanges hidden feature maps rather than the entire model. The communication cost is calculated using Eq. equation C.4.2 and equation C.4.3 in Appendix C.4 without using data compression techniques. The results are shown in Fig. 4.5 (X-axis units: GBytes). FedGKT uses fewer feature map exchanges with the server than SL. 4.4.4 Ablation Study: Understanding FedGKT under Different Settings Table 4.2: Ablation Study on Loss Functions CIFAR-10 CIFAR-100 CINIC-10 None -/diverge -/diverge -/diverge S–>E 92.97 68.44 81.51 S<–>E 90.53 69.57 80.01 Table 4.3: Asynchronous Training CIFAR-10 CIFAR-100 CINIC-10 Sync 92.97 69.57 81.51 Async 92.92 69.65 81.43 59 The Effectiveness of Knowledge Transfer . Table 4.2 shows the results on the efficacy of using distillation loss ℓ KD in Eq. equation 4.7 and Eq. equation 4.6. We created a scenario in which both the client and server only use ℓ CE without using ℓ KD (labeled None). In this setting, the accuracy is low (e.g., 40%) or the training diverges (uniformly notated as “-/diverge”). In another scenario, only the clients use ℓ KD to update their local models, but the server does not (noted as single directional transfer S->E). We observe that the transfer from the server to the edge is always helpful, while the bidirectional transfer (S<–>E) is more effective as the dataset becomes increasingly difficult (CIFAR-100). Asynchronous Training. Since the server does not need to wait for updates from all clients to start training, FedGKT naturally supports asynchronous training. We present the experimental results in Table 4.3. The result shows that asynchronous training does not negatively affect model accuracy. This demonstrates the advantage of our method over SL, in which every edge requires multiple synchronizations for each mini-batch iteration. Table 4.4: FedGKT with Different # of Edge 8 16 64 128 FedGKT 69.51 69.57 69.65 69.59 Table 4.5: Small CNNs on CIFAR-10 ResNet-4 ResNet-6 ResNet-8 Test Accuracy 88.86 90.32 92.97 FedGKT with Different Edge Number. To understand the scalability of FedGKT, we evaluate its performance with varying edge nodes. The test accuracy results are shown in Table 4.4. In general, adding more edge nodes does not negatively affect accuracy. Smaller Architectures. We test the performance of FedGKT using even smaller edge models: ResNet-4 and ResNet-6 on CIFAR-10. ResNet-4 and ResNet-6 use one and two BasicBlock components (including two convolutional layers), respectively. The result is shown in Table 4.5. While reducing the edge model size to ResNet-8 did not reduce accuracy, when the model size is reduced even more substantially, it does reduce the overall accuracy. 60 4.5 Discussion Federated learning (FL) is an art of trade-offs among many aspects, including model accuracy, data privacy, computational efficiency, communication cost, and scalability. We recognize the challenges of developing a universal method that can address all problems; thus, we discuss some limitations of our method. 1. Privacy and robustness: [464] shows we can backdoor federated learning. Although our work does not address the privacy concern, we believe existing methods such as differential privacy (DP) and multi-party computation (MPC) can defend the data privacy from the hidden vector reconstruction attack. Intuitively, exchanging hidden feature maps is safer than exchanging the model or gradient. Note that the hidden map exchange happens at the training phase. This consequently makes the attack more difficult because the attacker’s access is limited to the evolving and untrained feature map rather than the fully trained feature map that represents the raw data. Given that the model and gradient exchange may also leak privacy, the lack of analysis and comparison of the degree of privacy leakages between these three settings (gradient, model, and hidden map) is the first limitation of our work. 2. Communication cost: compared to the entire model weight or gradient, the hidden vector is definitely much smaller (e.g., the hidden vector size of ResNet-110 is around 64KB while the entire gradient/model size is 4.6MB for 32x32 images). Even in the high resolution vision tasks settings, this observation also holds (e.g., when image size is 224x224, the hidden feature map size is only 1Mb, compared to the size of ResNet 100Mb). Since the hidden vector for each data point can be transmitted independently, FedGKT has a smaller bandwidth requirement than gradient or model exchange. However, our proposed method has a potential drawback in that the total communication cost depends on the number of data points, although our experimental results demonstrate that our method has smaller communication costs than split learning because of fewer communication rounds for convergence. In settings 61 where the sample number is extremely large and the image resolution is extremely high, both our method and split learning would have a high communication cost in total. 3. Label deficiency: The proposed FedGKT can only work on supervised learning. However, label deficiency is a practical problem that cannot be ignored. Many application cases do not have sufficient labels, since it is difficult to design mechanisms to incentivize users to label their private local data. 4. Scalability (a large number of clients): in the cross-device setting, we need to collaboratively train models with numerous smartphones (e.g., if the client number is as high as 1 million). One way to mitigate the scalability is by selecting clients in each round with a uniform sampling strategy [310]. We run experiments under this setting but found that this sampling method requires many more rounds of training to converge. Even though the communication cost is acceptable, this sampling method is still imperfect in practice ([39] describes many constraints that a production system might face). We argue that uniform sampling may not be the best practice and that scalability is a common limitation for most existing works. In summary, we concede that our proposed method does not have an advantage in addressing the scalability challenge. 5. Model personalization: the final trained model under our FedGKT framework is a combination of the global server model and the client model, which is a potential method to help clients learn personalized models. For example, we can fine-tune the client model for several epochs to see if the combination of such a personalized client model and the server model is more effective. We do not explicitly demonstrate this in our experiments, but we hope to explore this possibility in future works. 4.6 Conclusion Inthiswork, totackletheresource-constrainedreality, wereformulateFLasagroupknowledge transfer (FedGKT) training algorithm. FedGKT can efficiently train small CNNs on edges 62 and periodically transfer their knowledge by knowledge distillation to a server-side CNN with a large capacity. FedGKT achieves several advantages in a single framework: reduced demand for edge computation, lower communication cost for large CNNs, and asynchronous training, all while maintaining model accuracy comparable to FL. To simplify the edge training, we also develop a distributed training system based on our FedGKT. We evaluate FedGKT by training modern CNN architectures (ResNet-56 and ResNet-110) on three distinct datasets (CIFAR-10, CIFAR-100, and CINIC-10) and their non-I.I.D. variants. Our results show that FedGKT can obtain comparable or even slightly higher accuracy. More importantly, FedGKT makes edge training affordable. Compared to the edge training using FedAvg, FedGKT costs 9 to 17 times less computational power (FLOPs) and requires 54 to 105 times fewer parameters. 63 Chapter 5 FedNAS: Towards Automation on Invisible Data via Neural Architecture Search 5.1 Introduction Federated Learning (FL) is a promising approach for decentralized machine learning, which aims to avoid data sharing and lower the communication cost [310]. As such, it has gained a lot of attention in various domains of machine learning such as computer vision, natural language processing, and data mining. Despite its widespread popularity, one of the key challenges of FL is data heterogeneity. Since users’ data are not identically or independently distributed (non-I.I.D.) in nature, a globally learned model may not perform optimally on all user devices. When interweaving with data heterogeneity, data invisibility is another issue that has rarely been studied. For this reason, to find a better model architecture with higher accuracy, developers must design or choose multiple architectures, then tune hyperparameters remotely to fit the scattered data. This process is extremely expensive because attempting many rounds of training on edge devices results in a remarkably higher communication cost and on-device computational burden than the data center environment. To mitigate the challenge of data heterogeneity, researchers have proposed methods to train a global model, including FedProx [253], FedNova [471], and FedOPT [367]. Additionally, personalized frameworks such as Ditto [250], pFedMe [89], and PerFedAvg [104] have been 64 Step 3 Step 2 skip pooling conv ↵ 1 <latexit sha1_base64="50qbi6W/a3z+sUZAxW6ihb7mDUc=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae2oUy2m3bpZhN2N0IJ/RdePCji1X/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKj1kPRTLCad/rlytu1Z2DrBIvJxXI0eiXv3qDmKYRk4YK1LrruYnxM1SGU8GmpV6qWYJ0jEPWtVRixLSfzS+ekjOrDEgYK1vSkLn6eyLDSOtJFNjOCM1IL3sz8T+vm5rw2s+4TFLDJF0sClNBTExm75MBV4waMbEEqeL2VkJHqJAaG1LJhuAtv7xKWrWqd1Gt3V9W6jd5HEU4gVM4Bw+uoA530IAmUJDwDK/w5mjnxXl3PhatBSefOYY/cD5/AH+ykMw=</latexit> ↵ 2 <latexit sha1_base64="+9ip0e2xOklxloqUPrfy3jSbcag=">AAAB8XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae2oUy2m3bpZhN2N0IJ/RdePCji1X/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKj1kPRTLCab/WL1fcqjsHWSVeTiqQo9Evf/UGMU0jJg0VqHXXcxPjZ6gMp4JNS71UswTpGIesa6nEiGk/m188JWdWGZAwVrakIXP190SGkdaTKLCdEZqRXvZm4n9eNzXhtZ9xmaSGSbpYFKaCmJjM3icDrhg1YmIJUsXtrYSOUCE1NqSSDcFbfnmVtGpV76Jau7+s1G/yOIpwAqdwDh5cQR3uoAFNoCDhGV7hzdHOi/PufCxaC04+cwx/4Hz+AIE2kM0=</latexit> ↵ 6 <latexit sha1_base64="sxzcfac3rAbU2I8do3Xz+86U8yE=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Vj04rGC/cA2lMl20y7dbMLuRiih/8KLB0W8+m+8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqyho0FrFqB6iZ4JI1DDeCtRPFMAoEawWj26nfemJK81g+mHHC/AgHkoecorHSY9ZFkQxx0rvslcpuxZ2BLBMvJ2XIUe+Vvrr9mKYRk4YK1LrjuYnxM1SGU8EmxW6qWYJ0hAPWsVRixLSfzS6ekFOr9EkYK1vSkJn6eyLDSOtxFNjOCM1QL3pT8T+vk5rw2s+4TFLDJJ0vClNBTEym75M+V4waMbYEqeL2VkKHqJAaG1LRhuAtvrxMmtWKd16p3l+Uazd5HAU4hhM4Aw+uoAZ3UIcGUJDwDK/w5mjnxXl3PuatK04+cwR/4Hz+AIdGkNE=</latexit> ↵ 7 <latexit sha1_base64="pUKAXrU2wq5ZF3mAgnN4Ue8yln0=">AAAB8XicbVBNS8NAEJ34WetX1aOXxSJ4KkkV6rHoxWMF+4FtKJPtpl262YTdjVBC/4UXD4p49d9489+4bXPQ1gcDj/dmmJkXJIJr47rfztr6xubWdmGnuLu3f3BYOjpu6ThVlDVpLGLVCVAzwSVrGm4E6ySKYRQI1g7GtzO//cSU5rF8MJOE+REOJQ85RWOlx6yHIhnhtF/rl8puxZ2DrBIvJ2XI0eiXvnqDmKYRk4YK1LrruYnxM1SGU8GmxV6qWYJ0jEPWtVRixLSfzS+eknOrDEgYK1vSkLn6eyLDSOtJFNjOCM1IL3sz8T+vm5rw2s+4TFLDJF0sClNBTExm75MBV4waMbEEqeL2VkJHqJAaG1LRhuAtv7xKWtWKd1mp3l+V6zd5HAU4hTO4AA9qUIc7aEATKEh4hld4c7Tz4rw7H4vWNSefOYE/cD5/AIjKkNI=</latexit> … s i <latexit sha1_base64="4IM4h66ALecpZBG48PUUBFr80pg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPu8X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nj91Ss6sMiBhrGxJQ+bq74mMRlpPosB2RtSM9LI3E//zuqkJr/2MyyQ1KNliUZgKYmIy+5sMuEJmxMQSyhS3txI2oooyY9Mp2RC85ZdXSatW9S6qtfvLSv0mj6MIJ3AK5+DBFdThDhrQBAZDeIZXeHOE8+K8Ox+L1oKTzxzDHzifP1pAjdc=</latexit> s j <latexit sha1_base64="FZ3bbkynLQl1lKhojfiSPbfFoik=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM12067dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXFtRKwecJxwP6IDJULBKFrp3vQee6WyW3FnIMvEy0kZctR7pa9uP2ZpxBUySY3peG6CfkY1Cib5pNhNDU8oG9EB71iqaMSNn81OnZBTq/RJGGtbCslM/T2R0ciYcRTYzoji0Cx6U/E/r5NieOVnQiUpcsXmi8JUEozJ9G/SF5ozlGNLKNPC3krYkGrK0KZTtCF4iy8vk2a14p1XqncX5dp1HkcBjuEEzsCDS6jBLdShAQwG8Ayv8OZI58V5dz7mrStOPnMEf+B8/gBbxI3Y</latexit> ↵ <latexit sha1_base64="+wSBPeL8nxBdvzPXA2qswhGhfpg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oUy2m3btZhN2N0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqKGvSWMSqE6BmgkvWNNwI1kkUwygQrB2Mb2d++4kpzWP5YCYJ8yMcSh5yisZKrR6KZIT9csWtunOQVeLlpAI5Gv3yV28Q0zRi0lCBWnc9NzF+hspwKti01Es1S5COcci6lkqMmPaz+bVTcmaVAQljZUsaMld/T2QYaT2JAtsZoRnpZW8m/ud1UxNe+xmXSWqYpItFYSqIicnsdTLgilEjJpYgVdzeSugIFVJjAyrZELzll1dJq1b1Lqq1+8tK/SaPowgncArn4MEV1OEOGtAECo/wDK/w5sTOi/PufCxaC04+cwx/4Hz+AIzPjxw=</latexit> : architecture parameters ! <latexit sha1_base64="tlCosEe7acB15WG/y1yAvFPNxVY=">AAAB7XicbVDLSgNBEJyNrxhfUY9eBoPgKexGQY9BLx4jmAckS5id9CZj5rHMzAphyT948aCIV//Hm3/jJNmDJhY0FFXddHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNoUkVV7oTEQOcSWhaZjl0Eg1ERBza0fh25refQBum5IOdJBAKMpQsZpRYJ7V6SsCQ9MsVv+rPgVdJkJMKytHol796A0VTAdJSTozpBn5iw4xoyyiHaamXGkgIHZMhdB2VRIAJs/m1U3zmlAGOlXYlLZ6rvycyIoyZiMh1CmJHZtmbif953dTG12HGZJJakHSxKE45tgrPXscDpoFaPnGEUM3crZiOiCbUuoBKLoRg+eVV0qpVg4tq7f6yUr/J4yiiE3SKzlGArlAd3aEGaiKKHtEzekVvnvJevHfvY9Fa8PKZY/QH3ucPkX+PHw==</latexit> : network weights Step 1 Step 4 … … Figure 5.1: Illustration of Federated Neural Architecture Search (step 1: search locally; step 2: send the gradients of α and w to the server; step 3: merge gradients to get global α and w; step 4: synchronize the updated α and w to each client.) recently developed to optimize personalized models to adapt to individual user’s data. These prior works have made remarkable progress in designing optimization schemes for pre-defined model architectures operated at pure optimization. However, these algorithms all require lots of effort to tune hyperparameters; this is attributed to their strong prior assumptions, which may not always match the unknown data distribution. For example, practitioners must tune the regularization parameter in Ditto [250] and pFedMe [89] to find a proper correlation between the aggregated global model and local model. Moreover, their design is only in optimization level and does not consider the efficacy of model selection and neural architecture design, leading to a suboptimal solution when using a pre-defined model. We aim to address data heterogeneity in FL via a different and complementary approach that is based on model personalization through neural architecture search (NAS). NAS has recently gained much momentum to adapt heterogeneity in neural architecture design [415, 61], latency [482, 436, 49], memory footprint [48, 301], energy consumption [173, 509] for edge devices. NAS methods are often categorized into three types: gradient-based methods [278], evolutionary methods [291], and reinforcement learning (RL)-based methods [189]. Among these, gradient-based methods are the most efficient as they can finish searching in only a few hours, compared to thousands of GPU days with other methods. 65 In this work, to search for a personalized neural architecture for mitigating the data heterogeneity, we adopt an improved variant of the gradient-based method, MiLeNAS [156], which is computationally tractable and particularly suitable for resource-constrained edge devices. Particularly, we propose a new method named Federated NAS (FedNAS) to search model architectures among edge devices collaboratively. As shown in Figure 5.1, FedNAS works in the following way. We first utilize the MiLeNAS [156] as a local searcher on each client’s local data, which can be distributed easily and efficiently in search time (Step 1). Formally, it formulates NAS as a mixed-level problem: w = w− η w ∇ w L tr (w,α ),α = α − η α (∇ α L tr (w,α )+λ ∇ α L val (w,α )), wherew representsthenetworkweightandα represents the neural architecture. L tr (w,α ) and L val (w,α ) denote the loss with respect to training data and validation data, respectively. After the local search, each client then transmits weights w and architecture α to the server (Step 2). The server then applies a weighted aggregation to obtain the server-side α and w (Step 3) and sends the updated parameters back to each client for the next round of searching (Step 4). During the searching process, we can personalize the α and w parameters by alternative local adaptation. Such personalization method can either obtain a higher accuracy for various data distributions, or automate the training process with lightweight hyper-parameter searching efforts. We evaluate FedNAS comprehensively in curated non-I.I.D. datasets, including CIFAR-10 and GLD-23K. Our datasets cover both global model training and personalized model training. We also consider different training scenarios: cross-silo FL and cross-device FL, which has a different total number of clients and number of clients per round. We demonstrate that the personalized model architectures learned by the individual clients perform better than the fine-tuned FedAvg and other representative personalized FL methods such Ditto [250] and perFedAvg [104] with default hyper-parameters in most settings. In summary, our main contributions in this work are three-fold. 66 1. We propose the FedNAS method to search for both global model and personalized model architectures collaboratively among edge devices and show its satisfying performance in a variety of FL settings. 2. We investigate the role of NAS to address the challenge of data-heterogeneity in FL and demonstrate via experimental results that it can adapt to users’ data better than existing local adaptation and personalization schemes. 3. We experimentally show that FedNAS can achieve state-of-the-art performance for both cross-silo and cross-device settings. 5.2 Proposed Method 5.2.1 Problem Definition In the federated learning setting, there are K nodes in the network. Each node has a dataset D k := x k i ,y i N k i=1 which is non-IID. When collaboratively training a deep neural network (DNN) model with K nodes, the objective function is defined as: min w f(w, α |{z} fixed ) def = min w K X k=1 N k N · 1 N k X i∈D k ℓ(x i ,y i ;w, α |{z} fixed ) (5.1) where w represents the network weight, α determines the neural architecture, and ℓ is the loss function of the DNN model. To minimize the objective function above, previous works choose a fixed model architecture α then design variant optimization techniques to train the model w. We propose to optimize the federated learning problem from a completely different angle, optimizing w and α simultaneously. Formally, we can reformulate the objective function as: min w,α f(w,α ) def = min w,α K X k=1 N k N · 1 N k X i∈D k ℓ(x i ,y i ;w,α ) (5.2) 67 In other words, for the non-IID dataset scattered across many workers, our goal is to search for an optimal architecture α and related model parameters w to fit the dataset more effectively and thus achieve better model performance. In this work, we consider searching for CNN architecture to improve the performance of the image classification task. 5.2.2 Search Space 3x3 conv Normal Cell x N Reduction Cell softmax S0 S1 S2 S3 S4 S5 S6 a normal cell Normal Cell x N Reduction Cell Normal Cell x N S0 S1 S2 S3 S4 S5 S6 a reduction cell Figure 5.2: Search Space Normally, NAS includes three consecutive components: the search space definition, the search algorithm, and the performance estimation strategy [183]. Our search space follows the mixed-operation search space defined in DARTS [278] and MiLeNAS [156], where we search in two shared convolutional cells, then build it up as an entire model architecture (as shown in Figure 5.2). Inside the cell, to relax the categorical candidate operations between two nodes (e.g., convolution, max pooling, skip connection, zero) to a continuous search space, mixed operation using softmax over all possible operations is proposed: ¯o (i,j) (x)= d X k=1 exp(α (i,j) k ) P d k ′ =1 exp(α (i,j) k ′ ) | {z } p k o k (x) (5.3) where the weight p k of the mixed operation ¯o (i,j) (x) for a pair of nodes (i,j) is parameterized by a vector α i,j . Thus, all architecture operation options inside a network (model) can be parameterized as α . More details are introduced in Appendix D.1. 68 5.2.3 Local Search Following the aforementioned search space, each worker searches locally by utilizing the mixed-level optimization technique MiLeNAS [156]: w =w− η w ∇ w L tr (w,α ) α =α − η α (∇ α L tr (w,α )+λ ∇ α L val (w,α )) (5.4) whereL tr (w,α ) andL val (w,α ) denote the loss with respect to the local training data and validation data with w and α , respectively. 5.2.4 FedNAS: Federated Neural Architecture Search Algorithm 3 FedNAS Algorithm. 1: Initialization: initialize w 0 and α 0 ; K clients are selected and indexed by k; E is the number of local epochs; T is the number of rounds. 2: Server executes: 3: for each round t=0,1,2,...,T − 1 do 4: for each client k in parallel do 5: w k t+1 ,α k t+1 ← ClientLocalSearch(k,w t ,α t ) 6: end for 7: w t+1 ← P K k=1 N k N w k t+1 8: α t+1 ← P K k=1 N k N α k t+1 9: end for 10: 11: ClientLocalSearch(k, w, α ): // Run on client k 12: for e in epoch do 13: for minibatch in training and validation data do 14: Update w =w− η w ∇ w L tr (w,α ) 15: Update 16: α =α − η α (∇ α L tr (w,α )+λ ∇ α L val (w,α )) 17: end for 18: end for 19: return w, α to server We propose FedNAS, a distributed neural architecture search algorithm that aims at optimizing the objective function in Equation 5.2 under the FL setting. We introduce FedNAS corresponding to four steps in Figure 5.1: 1) The local searching process: each 69 worker optimizes α and w simultaneously using Eq. 5.4 for several epochs; 2) All clients send their α and w to the server; 3) The central server aggregates these gradients as follows: w t+1 ← K X k=1 N k N w k t+1 α t+1 ← K X k=1 N k N α k t+1 (5.5) 4) The server sends back the updated α and w to clients, and each client updates its local α and w accordingly before running the next round of searching. This process is summarized in Algorithm 1. After searching, an additional evaluation stage is conducted by using a traditional federated optimization method such as FedAvg [310]. 5.2.5 Personalized FedNAS: Alternative Local Adaptation Local Adaptation. To personalize local models, we fine-tune the received global model locally. Such local fine-tuning follows Equation 5.4, meaning that each client alternatively optimizes its local architecture α and model weights w. We find from experiments that such fine-tuningcanmakealocalmodelmorerobustagainstlocaldataheterogeneitycomparedwith fine-tuning and local adaptation based on predefined model and state-of-the-art personalized optimization methods (e.g., Ditto [250] and perFedAvg [104]). Robust to Varying Data Heterogeneity and Training Scenarios. In addition to the benefit of data heterogeneity with personalization, an essential feature of FedNAS is that it does not require many rounds of hyperparameter searching to adapt to diverse data distributions. Most of the time, the default hyper-parameter already perform very well. This property of FedNAS is attributed to three aspects described below. • Intuitively, the personalized architecture and weights have an additive effect in adapting data heterogeneity, compared with solely personalizing the model weight, especially when the architecture search space is huge. 70 • Most of the personalized methods are built based on an optimization framework with strong prior assumptions, which may not always match the unknown data distribution. For example, Ditto [250] and pFedMe [89] utilize a bi-level optimization and correlate the relationship of aggregated global model and local model by a regularization-based method. Practitioners must tune the λ value to make it work manually. Although perFedAvg [104] brings the idea of meta-learning to adapt to data heterogeneity, it is difficult for practitioners to decide the boundary of its meta-train phase and meta-test phase when the data distribution is unknown. • Different training scenarios also bring additional randomness and uncertainty. For example, in the cross-device setting, the total client number and the client number per round differs from the cross-silo setting significantly, which further increases the difficulty of model selection and hyper-parameter tuning. FedNAS may be more resilient against this uncertainty in practice. To verify the advantage of FedNAS, we run experiments to search for both personalized and global models on cross-silo and cross-device settings (see Section 5.3.1 and 5.3.2). 5.2.6 AutoFL System Design Send Thread Abstract Communication Layer MPI (Message Passing Interface) Receive Thread ComManager Deep Neural Networks (MiLeNAS, DenseNet, etc) On-Client Learning Framework PyTorch ServerManager (Aggregation, Synchronization, Global Statistics ) ClientManager (Local Search, Synchronization) Communication Protocol Component On-Device Deep Learning Component Trainer Aggregator Figure 5.3: Abstract System Architecture of AutoFL 71 We design an AutoFL system using FedNAS based on FedML [153], an open-source research library for federated learning. The system architecture is shown in Figure 5.3. This design separates the communication and the model training into two core components shared by the server and clients. The first is the communication protocol component responsible for low-level communication among the server and clients. The second is the on-device deep learning component, which is built based on the popular deep learning framework PyTorch. These two components are encapsulated as ComManager, Trainer, and Aggregator, providing high-level APIs for the above layers. With the help of these APIs, in ClientManager, the client can train or search for better architectures and then send its results to the server-side. In contrast, in ServerManager, the server can aggregate and synchronize the model architecture and the model parameters with the client-side. More details of the system design can be found in the Appendix. 5.3 Experiments and Results In this section, we introduce the experimental results of FedNAS to train a global model as well as personalized models. All our experiments are based on a non-IID data distribution among users. In our experiments, we explore two types of non-IID data distributions, label skewed and latent Dirichlet allocation (LDA), which are well explored in literature in FL settings [529], [155], [11]. Implementation and Deployment. We set up our experiment in a distributed computing network equipped with GPUs. We perform experiments for two settings: FedNAS for a global model search and FedNAS for personalized models search. For investigating the former setting, we set up our experiment in a cross-silo setting for simplicity and use 17 nodes in total, one representing the server-side and the other 16 nodes representing clients, which can be organizations in the real world (e.g., hospitals and clinics). For personalized model search, we use a larger set of nodes, 21 in total, one representing the server-side and 72 the other 20 nodes representing clients. We pick four clients at random for each round of FedNAS. For all experiments, each node is a physical server with an NVIDIA RTX 2080Ti GPU card inside. We deploy the FedNAS system described in Appendix 5.2.6 on each node. Our code implementation is based on PyTorch 1.4.0, MPI4Py 1 3.0.3 , and Python 3.7.4. Task and Dataset. Our training task is image classification on the CIFAR10 dataset, which consists of 60000 32x32 color images in 10 classes, with 6000 images per class. For global model searching via FedNAS, we generate non-IID (non identical and independent distribution) local data by splitting the 50000 training images into K clients in an unbalanced manner: sampling p c ∼ Dir J (0.5) and allocating a p c,k proportion of the training samples of class c to local client k. The 10000 test images are used for a global test after the aggregation of each round. The actual data distribution used for this experiment is given in Table D.1. (a) Image Allocation per Client (b) Label Allocation per Client Figure 5.4: CIFAR10: Label Skew Partition For personalized model experiments, we generate non-IID data by label skewed partition. In this partition scheme, we assign images of only five classes to each client and keep the number of images per client the same, namely 3000, as shown in Figure 5.4. For each client, we further split these 3000 images into the training and testing datasets by using 75% of the data (i.e., 2250 images) for training and the other 25% as testing data. We perform this split to test personalization as it requires each client to have their own local test dataset. We also explore latent Dirichlet distribution (LDA) based non-IID data distribution for the 1 https://pypi.org/project/mpi4py/ 73 personalized model setup. Details of this distribution for this experiment can be found in the appendix D.2. Since the model performance is sensitive to the data distribution, we fix the non-IID dataset in all experiments for a fair comparison. 5.3.1 Personalized Models Search via FedNAS To demonstrate the efficacy of FedNAS in designing better local models, we compare FedNAS with local adaption (via FedAvg), Ditto and perFedAvg. Aside from FedNAS, every other method runs on a manually designed architecture, ResNet18 [444], which has more model parameters, 11M, than the 8-layer DARTs cell structure of FedNAS, which has only 4M model parameters [156]. To evaluate the performance, we use the average validation accuracy of all clients as a performance metric. 5.3.1.1 Results on Non-I.I.D. (Label Skew Partition and LDA distribution) Table 5.1 illustrates the performance comparison of FedNAS with local adaptation, Ditto and perFedAvg. For a fair comparison, we fine-tune hyper-parameters of each method, such as fine-tuning the learning rate (lr) hyperparameter over the set {0.1, 0.3, 0.01, 0.03, 0.001, 0.003} of each method. Batch size is fixed at 32 for all comparisons. For Ditto, in addition to lr, we tune λ over the set {2, 1, 0.1, 0.01, 0.001}. For perFedAvg, we tune the global lr over {0.1, 0.3, 0.01, 0.03, 0.001, 0.003} by keeping the local lr {1, 3, 5, 7, 10} times higher than the global lr. Table 5.1 draws the comparison between different methods for the average validation accuracy of all the clients metric for lda and label skew distribution. Interestingly, FedNAS outperforms all other methods for both label skew and lda distribution, highlighting its power to effectively adapt to user’s data and perform better locally as well. Overall, for label skew distribution, it achieves an average validation accuracy of 91.3%, which is 5% higher than the local adaptation’s validation accuracy and 2% higher than Ditto. We also observe that 74 Ditto outperforms the local adaptation in terms of validation accuracy but has a higher standard deviation than local adaptation. Table 5.1: Average local validation Accuracy Comparison of FedNAS with other personaliza- tion techniques) Method Parameter size Accuracy (Label Skew) Accuracy (lda Distribution) FedNAS 4M 0.913±0.025 0.907±0.024 Local Adaptation 11M 0.864±0.028 0.861±0.0357 Ditto 11M 0.894±0.035 0.88 ±0.032 perFedAvg 11M 0.888±0.036 0.894±0.032 (a) Local Validation Accuracy (b) Average Validation Accuracy Distribution (c) Average Validation Accuracy Improvement Figure 5.5: Visualization of validation accuracy of each client and accuracy improvement distribution for personalized model search. For a detailed comparison, we compare the validation accuracies of all clients for the best round of label skew distribution. The best round for each method is selected as the round that provides the highest average validation accuracy. We visualize the validation accuracy of each client 5.5(a), average validation accuracy distribution 5.5(b) and average validation improvement distribution 5.5(c). For accuracy improvement distribution, we subtract the validation accuracy of FedNAS from the respective method for each client and plot the histogram. The improvement histogram shows that for one of the clients the improvement can be as high as 15% compared to perFedAvg. Compared to Ditto, we see a 2.5% improvement for even 6 clients. Conversely, there are only two clients, numbers 10 and 20, for which local adaptation performs slightly better (2.5%). Although there are some clients for which FedNAS does not perform well compared to these methods, it is important to note that the standard deviation of FedNAS more prominent in figure (b) is lowest, and accuracy 75 histograms are concentrated towards the right side, whereas for other methods, these bars fall to as low as an 82% accuracy. 5.3.2 Global Model Search via FedNAS To investigate the performance of FedNAS to design a global model, we search a global model via FedNAS and compare it to the well-known FL algorithm FedAvg, which runs on DenseNet [178], a manually designed architecture that extends ResNet [163] but has a higher performance and fewer model parameters. We run both of these experiments on the same non-IID dataset. 5.3.2.1 Results on Non-I.I.D. (LDA Partition) 0 0 0.2 0.4 0.6 0.8 0.7778 Round (a) FedAvg 0 10 20 30 40 50 Round 0 0.2 0.4 0.6 0.8 0.8124 (b) FedNas Figure 5.6: Test Accuracy on Non-IID Dataset (multiple runs) FedAvg on DenseNet vs. FedNAS Figure 5.6 demonstrates the performance of FedNAS vs. FedAvg. We use a specific non-IID data distribution given in appendix D.2, and keep it fixed for both experiments. For a fair comparison, results are obtained by fine-tuning hyperparameters of each method, and each method is run three times. Details of hyperparameter tuning can be found in the appendix D.4. Figure5.6(a)showstheglobaltestaccuracyduringthetrainingprocessofFedAvg, whereas, Figure 5.6(b) reports the global test accuracy during the searching process of FedNAS. Global test accuracy is calculated using the 10000 test images of the CIFAR10 dataset. First, we 76 demonstrate the compatibility of NAS for the data-heterogeneous FL setting. In addition to the convergence of FedNAS, we show that FedNAS can achieve a higher accuracy than FedAvg during the searching process (81.24% in Figure 5.6(b); 77.78% in Figure 5.6(a)). This 4% performance benefit further confirms the efficacy of FedNAS. We also evaluate the searched architecture under this data distribution. We find that each run of FedNAS can obtain a higher test accuracy than each run of FedAvg. On average, the architecture searched by FedNAS obtains a test accuracy 4% higher than FedAvg. Hyperparameters and visualization of the searched architecture can be found in Appendix D.4 and D.5, respectively. Remark. We also run experiments on other distributions of non-IID datasets, in which FedNAS is also demonstrated to beat FedAvg, confirming that FedNAS searches for better architectures with a higher model performance. 5.3.3 Evaluation of the System Efficiency Table 5.2: Efficiency Comparison (16 RTX2080Ti GPUs as clients, and 1 RTX2080Ti as server) Method Search Time Parameter Size Hyperparameter FedAvg (single) > 3 days - rounds = 100, local epochs=20, batch size=64 FedAvg (distributed) 12 hours 20.01M FedNAS (single) 33 hours - rounds = 50, local epochs=5, batch size=64 FedNAS (distributed) < 5 hours 1.93M In order to more comprehensively reflect our distributed search overhead, we developed the single-process and distributed version of FedNAS and FedAvg. The single-process version simulates the algorithm by performing a client-by-client search on a single GPU card. As shown in Table 5.2, compared with FedAvg and manually designed DenseNet, FedNAS can find a better architecture with fewer parameters and in less time. FedAvg spends more time because it requires more local epochs to converge. 77 5.4 Related Works Recently, NeuralArchitectureSearch(NAS)[183]hasattractedwidespreadattentionduetoits advantagesovermanuallydesignedmodels. TherearethreemajorNASmethods: evolutionary algorithms, reinforcement learning-based methods, and gradient-based methods [156]. While in the Federated Learning (FL) domain [310, 148], using pre-designed model architectures and optimizing by FedAvg [310] is the main method to improve model performance. To our knowledge, NAS is rarely studied in the FL setting to study the aspect of personal model search for real-time setting. Although [207] first proposed the concept of automating FL via NAS, the concrete method and details are never given. There are a few works done in the direction of the using a NAS to search for a global model; however, no personalization exploration is provided. For global model search, [547] is a NAS based FL work that exploits evolutionary NAS to design a master model; however, they utilize double-client sampling to make their method edge resource friendly. Contrary to this, we exploit the gradient-based NAS method, MileNAS, which is comparatively faster and more resource friendly than the evolutionary and reinforcement-based methods. The other work in this direction is [412], which explores the concept of differential privacy by using DARTs as an NAS solver to search for a global model. However, our proposed work uses the MileNAS solver which has an extensive analysis of its performance efficiency over DARTS provided in the original MileNAS work [156]. Another work [113] uses DSNAS, which is another gradient based NAS algorithm to search for a global model. DSNAS works on sampling a child network from a supernetwork even in the search phase, whereas the MileNAS solver searches over the complete supernetwork and consequently has the potential to provide more freedom to clients to search for a better and personalized architecture. Another work [494] proposes a very different idea than conventional neural architecture search where they begin with a pre-trained manually designed model and continue pruning the model until it satisfies the efficiency budget. Although they named their work as federated neural architecture search but model search is performed on the server side alone, 78 none of the clients participate in searching a better model (finding architecture parameters). Clients only participate in training the pruned model’s parameters communicated to clients by the server. Furthermore, to the best of our knowledge, this is the first work to investigate the performance of locally searched architectures in federated NAS. 5.5 Conclusion In this work, we propose FedNAS, a unified neural architecture search based federated learning framework to design a global model collaboratively. First, we study the compliance of gradient based neural architecture search algorithm, MileNAS, with the FedAvg algorithm. We analyze its performance for both cross silo and cross-device settings and show its convergence for both setups. We also investigate the proposed framework, FedNAS, from the perspective of personalization and its role to overcome the challenge of data-heterogeneity in FL. To test data-heterogeneity, we explore FedNAS for both label-skewed and lda-based non-IID data distributions and show via experimental results its superiority over other personalization methods such as local fine-tuning, Ditto and perFedAvg. 79 Chapter 6 SpreadGNN: Effective Training on Decentralized Topology 6.1 Introduction Graph Neural Networks (GNN) [132] are expressive models that can distill structural knowl- edge into highly representative embeddings. While graphs are the representation of choice in domains such as social networks [28], knowledge graphs for recommendation systems [58], in this work we focus on molecular graphs that are the core of drug discovery, molecular property prediction [123, 211] and virtual screening [542]. Molecular graphs differ from their more well-known counterparts such as social network graphs. First, each molecule is a graph representation of the basic atoms and bonds that constitute the molecule and hence the size of the graph is small. Second, even though each graph may be small numerous molecules are being developed continuously for varied use cases. Hence, what they lack in size they make up for it in structural heterogeneity. Third, molecules can be labeled along multiple orthogonal dimensions. Since each graph has multiple labels the learning itself can be characterized as multi-task learning. For instance, whether a molecule has potentially harmful interactions with a diabetics drug, or whether that molecule can turn toxic under certain conditions are distinct labels. Molecular property analysis and labeling require wet-lab experiments, which are time-consuming and resource-costly. As a consequence, many entities may only have partially labeled molecules even if they know the graph structure. Finally, molecules are coveted inventions and hence entities often possess a proprietary graph representation that 80 Centralized Molecular Property Prediction Serverless Molecular Property Prediction with SpreadGNN Institution A Institution B Institution C Institution D Institution A Institution B Institution C Institution D Figure 6.1: Serverless Multi-task Federated Learning for Graph Neural Networks. cannot be shared with other institutions for competitive and regulatory reasons. However, training collectively over a private set of molecular graphs can have immense societal benefits such as accelerated drug discovery. Federated Learning (FL) is a distributed learning paradigm that addresses this data isola- tion problem via collaborative training. In this paradigm, training is an act of collaboration between multiple clients (such as research institutions) without requiring centralized local data while providing a certain degree of user-level privacy [308, 207, 98, 99]. However, there are still challenges and shortcomings to training GNN in a federated setting. This setting (Figure 6.1) is the typical case in molecular graphs since each owner may have different molecules and even when they have the same molecular graph each owner may have an incomplete set of labels for each molecule. The left half of Figure 6.1 shows a simpler case where all clients can communicate through a central server. However, in practice, the presence of a central server is not feasible when multiple competing entities may want to collaboratively learn. The challenges are further compounded by the lack of a central server as shown in the right half of the Figure 6.1. Thus, it remains an open problem to design a federated learning framework for molecular GNNs in a realistic setting, in which clients only have partial labels and one in which there is no reliance on a central server. This is the problem we seek to address in this work. 81 We propose a multi-task federated learning framework called SpreadGNN that operates in the presence of multiple, but partial labels for each client and the absence of a central server as shown in Figure 6.1. First, we present a multi-task learning (MTL) formulation to learn from partial labels. Second, in our MTL formulation, we utilize decentralized periodic averaging stochastic gradient descent to solve the serverless MTL optimization problem and provide a theoretical guarantee on the convergence properties, which further verifies the rationality of our design. We evaluate SpreadGNN on graph-level molecular property prediction and regression tasks. We synthesize non-I.I.D. and partially labeled datasets by using curated data from the Molecule Net [486] machine learning benchmark. With extensive experiments and analysis, we find that SpreadGNN can achieve even better performance than FedAvg [312], not only when all clients can communicate with each other, but also when clients are constrained to communicate with a subset of other clients. We plan on publishing the source code of SpreadGNN as well as related datasets for future exploration. 6.2 SpreadGNN Framework 6.2.1 Federated Graph Neural Networks for Graph-Level Learning We seek to learn graph level representations in a federated learning setting over decentralized graph datasets located in edge servers which cannot be centralized for training due to privacy and regulation restrictions. For instance, compounds in molecular trials [383] may not be shared across entities because of intellectual property or regulatory concerns. Under this setting, we assume that there are K clients in the FL network, and the k th client has its own datasetD (k) := ¶Ä G (k) i ,y (k) i ä© N (k) i=1 , where G (k) i = (V (k) i ,E (k) i ) is the i th graph sample inD (k) with node & edge feature setsX (k) = ¶ x (k) m © m∈V (k) i andZ (k) = ¶ e (k) m,n © m,n∈V (k) i , y i (k) is the corresponding label of G (k) i , N (k) is the sample number in dataset D (k) , and N = P K k=1 N (k) . 82 Each client owns a GNN with a readout, to learn graph-level representations. We call this model of a GNN followed by a readout function, a graph classifier . Multiple clients are interested in collaborating to improve their GNN models without necessarily revealing their graph datasets. In this work, we build our theory upon the Message Passing Neural Network (MPNN) framework [123] as most spatial GNN models [218, 457, 132] can be unified into this framework. The forward pass of an MPNN has two phases: a message-passing phase (Eq equation 6.1) and an update phase (Equation equation 6.2). For each client, we define the graph classifier with an L-layer GNN followed by a readout function as follows: m (k,ℓ+1) i =AGG Ķ M (k,ℓ+1) θ Ä h (k,ℓ) i ,h (k,ℓ) j ,e (k) i,j ä |j∈N i ©ä ,ℓ=0,...,L− 1 (6.1) h (k,ℓ+1) i =U (k,ℓ+1) Ψ Ä h (k,ℓ) i ,m (k,ℓ+1) i ä ,ℓ=0,...,L− 1 (6.2) ˆ y (k) i =R Φ pool ,Φ task Ķ h (k,L) j |j∈V (k) i ©ä (6.3) whereh (k,0) i =x (k) i is the k th client’s node features, ℓ is the layer index, AGG is the aggregation function (e.g., in the GCN model [218], the aggregation function is a simple SUM operation), and N i is the neighborhood set of node i. In Equation. equation 6.1, M θ (k,ℓ+1)(·) is the message generation function which takes the hidden state of current node h i , the hidden state of the neighbor node h j and the edge featurese i,j as inputs to gather and transform neighbors’ messages. In other words, M θ combines a vertex’s hidden state with the edge and vertex data from its neighbors to generate a new message. U (k,ℓ+1) Ψ (·) is the state update function that updates the model using the aggregated feature m (k,ℓ+1) i as in Equation. equation 6.2. After propagating through L GNN layers, the final module of the graph classifier is a readout function R Φ pool ,Φ task (·) which allows clients to predict a label for the graph, given node embeddings that are learned from Equation.equation 6.2. In general the readout is composed of two neural networks: the pooling function parameterized by Φ pool ; and a classifier parameterized by Φ task . The role of the pooling function is to learn a single 83 graph embedding given node embedding from Equation equation 6.2. The classifier then uses the graph level embedding to predict a label for the graph. ToformulateGNN-basedFL,usingthemodeldefinitionabove,wedefine W ={θ, Ψ ,Φ pool ,Φ task } as the overall learnable weights. Note that W is independent of graph structure as both the GNN and Readout parameters make no assumptions about the input graph. Thus, one can learn W using a FL based approach. The the overall FL task can be formulated as a distributed optimization problem as: min W F(W) def = min W K X k=1 N (k) N · f (k) (W) (6.4) where f (k) (W) = 1 N (k) P N (k) i=1 L(ˆ y (k) i ,y (k) i ) is the k th client’s local objective function that measures the local empirical risk over the heterogeneous graph dataset D (k) . L is the loss function of the global graph classifier. With such a formulation, it might seem like an optimization problem tailored for a FedAvg based optimizers [312]. Unfortunately, in molecular graph settings this is not the case for the following reasons. (a) In our setting, clients belong to a decentralized, serverless topology. There is no central server that can average model parameters from the clients. (b) Clients in our setting possess incomplete labels, hence the dimensions of Φ task can be different on different clients in decentralized topologies. For instance, a client may only have partial toxicity labels for a molecule, while another client may have a molecule’s interaction properties with another drug compound. Even with such incomplete information from each client, our learning task is interested in classifying each molecule across multiple label categories. To combat these issues, we aim to propose a federated learning framework that can achieve model personalization and decentralized topology simultaneously. Particularly, we propose a novel decentralized multi-task learning framework to tackle the aforementioned issues. Next, we will first introduce a centralized FMTL framework for graph neural networks as preliminary. We then enhance this centralized FMTL to a serverless scenario towards a 84 Graph Classifier in FL Client K step 1 step 2 FL Client K FL Server FL Client 0 FL Client 1 … a) Centralized Federated Learning b) Graph Multi-Task Learning Block Client 1 (GMTL Block) Client 3 (GMTL Block) Client 2 (GMTL Block) Client 4 (GMTL Block) c) Decentralized Federated Learning Figure 6.2: Federated Graph Multi-Task Learning Framework (FedGMTL). decentralized FMTL framework, which is named as SpreadGNN. We will introduce how we address the multi-label inconsistent issue and the challenge of correlating model weights from different client numbers. 6.2.2 Federated Multi-Task Learning with Graph Neural Networks Under the regularized MTL paradigm [101], we define the centralized federated graph MTL problem (FedGMTL) as follows: min θ, Ψ ,Φ pool ,Φ task ,Ω K X k=1 1 N k N k X i=1 L(ˆ y (k) i ,y (k) i )+R(W,Ω) , s.t. Ω ≥ 0 and Tr(Ω)=1 . (6.5) where R(W,Ω)= 1 2 λ 1 Tr(Φ task Ω − 1 Φ T task )+ 1 2 X χ ∈{θ, Ψ ,Φ pool ,Φ task} λ χ ||χ || 2 F (6.6) is the bi-convex regularizer introduced in [537]. The first term of the Eq. equation 6.5 models the summation of different empirical loss of each client, which is what Eq. equation 6.4 exactly tries to address. The second term serves as a task-relationship regularizer with Ω ∈R S×S being the covariance matrix for S different tasks constraining the task weights Φ task = [Φ task,1 ,...,Φ task,S ]∈R d×S through matrix trace Tr(Φ task Ω − 1 Φ T task ). Recall that each 85 Client 1 (GMTL Block) Client 2 (GMTL Block) Client 4 (GMTL Block) Client 3 (GMTL Block) ….. ….. f align (⌦ 1 ) <latexit sha1_base64="6v8M0XujG/faoi8Yfok747/cAN4=">AAACYXicbVFNT9tAEF0bKJACNXDkYjWqRC+Rt0IqR9ReegMkAkixZY3X42TFfli7ayCy/Ce59dJL/wjrJIcSOtJIT2/efOzbohbcuiT5HYQbm1sftnd2Bx/39g8+RYdHt1Y3huGYaaHNfQEWBVc4dtwJvK8NgiwE3hUPP/v63SMay7W6cfMaMwlTxSvOwHkqj55TgS7VopyBS30OUoMKn5iWElTZpo/IugnN2lSCmxVVO6Rdt67xbUvNcswbbTeo8hYEn6rutB+WXkqcQk6/5tEwGSWLiN8DugJDsoqrPHpJS80aicoxAdZOaFK7rAXjOBPoj2os1sAeYIoTDxVItFm7cKiLv3imjCttfCoXL9h/O1qQ1s5l4ZX98Xa91pP/q00aV51nLVd141Cx5aKqEbHTcW93XHKDzIm5B8AM97fGbAYGmPOfMvAm0PUnvwe330Y0GdHrs+HFj5UdO+SEfCanhJLv5IL8IldkTBj5E2wG+8FB8DfcDaPwaCkNg1XPMXkT4ckrWhu4uw==</latexit> <latexit sha1_base64="6v8M0XujG/faoi8Yfok747/cAN4=">AAACYXicbVFNT9tAEF0bKJACNXDkYjWqRC+Rt0IqR9ReegMkAkixZY3X42TFfli7ayCy/Ce59dJL/wjrJIcSOtJIT2/efOzbohbcuiT5HYQbm1sftnd2Bx/39g8+RYdHt1Y3huGYaaHNfQEWBVc4dtwJvK8NgiwE3hUPP/v63SMay7W6cfMaMwlTxSvOwHkqj55TgS7VopyBS30OUoMKn5iWElTZpo/IugnN2lSCmxVVO6Rdt67xbUvNcswbbTeo8hYEn6rutB+WXkqcQk6/5tEwGSWLiN8DugJDsoqrPHpJS80aicoxAdZOaFK7rAXjOBPoj2os1sAeYIoTDxVItFm7cKiLv3imjCttfCoXL9h/O1qQ1s5l4ZX98Xa91pP/q00aV51nLVd141Cx5aKqEbHTcW93XHKDzIm5B8AM97fGbAYGmPOfMvAm0PUnvwe330Y0GdHrs+HFj5UdO+SEfCanhJLv5IL8IldkTBj5E2wG+8FB8DfcDaPwaCkNg1XPMXkT4ckrWhu4uw==</latexit> <latexit sha1_base64="6v8M0XujG/faoi8Yfok747/cAN4=">AAACYXicbVFNT9tAEF0bKJACNXDkYjWqRC+Rt0IqR9ReegMkAkixZY3X42TFfli7ayCy/Ce59dJL/wjrJIcSOtJIT2/efOzbohbcuiT5HYQbm1sftnd2Bx/39g8+RYdHt1Y3huGYaaHNfQEWBVc4dtwJvK8NgiwE3hUPP/v63SMay7W6cfMaMwlTxSvOwHkqj55TgS7VopyBS30OUoMKn5iWElTZpo/IugnN2lSCmxVVO6Rdt67xbUvNcswbbTeo8hYEn6rutB+WXkqcQk6/5tEwGSWLiN8DugJDsoqrPHpJS80aicoxAdZOaFK7rAXjOBPoj2os1sAeYIoTDxVItFm7cKiLv3imjCttfCoXL9h/O1qQ1s5l4ZX98Xa91pP/q00aV51nLVd141Cx5aKqEbHTcW93XHKDzIm5B8AM97fGbAYGmPOfMvAm0PUnvwe330Y0GdHrs+HFj5UdO+SEfCanhJLv5IL8IldkTBj5E2wG+8FB8DfcDaPwaCkNg1XPMXkT4ckrWhu4uw==</latexit> <latexit sha1_base64="6v8M0XujG/faoi8Yfok747/cAN4=">AAACYXicbVFNT9tAEF0bKJACNXDkYjWqRC+Rt0IqR9ReegMkAkixZY3X42TFfli7ayCy/Ce59dJL/wjrJIcSOtJIT2/efOzbohbcuiT5HYQbm1sftnd2Bx/39g8+RYdHt1Y3huGYaaHNfQEWBVc4dtwJvK8NgiwE3hUPP/v63SMay7W6cfMaMwlTxSvOwHkqj55TgS7VopyBS30OUoMKn5iWElTZpo/IugnN2lSCmxVVO6Rdt67xbUvNcswbbTeo8hYEn6rutB+WXkqcQk6/5tEwGSWLiN8DugJDsoqrPHpJS80aicoxAdZOaFK7rAXjOBPoj2os1sAeYIoTDxVItFm7cKiLv3imjCttfCoXL9h/O1qQ1s5l4ZX98Xa91pP/q00aV51nLVd141Cx5aKqEbHTcW93XHKDzIm5B8AM97fGbAYGmPOfMvAm0PUnvwe330Y0GdHrs+HFj5UdO+SEfCanhJLv5IL8IldkTBj5E2wG+8FB8DfcDaPwaCkNg1XPMXkT4ckrWhu4uw==</latexit> ⌦ 1 <latexit sha1_base64="93JlN3imwlGgc0DBRqcn2AXf20Q=">AAACVnicbVFLSxxBEO4ZNZrNazTHXAYXIadlOgTMUeLFmwayKuwMS01PzW5jP4buGsMy7J/Ui/6UXIK9j0NcLSj4+OqrR39dNkp6yrLHKN7a3nmzu/e29+79h4+fkv2DS29bJ3AorLLuugSPShockiSF141D0KXCq/LmdFG/ukXnpTW/adZgoWFiZC0FUKDGic4VUm5VNQXKQ/Zyhwb/CKs1mKrLb1HMR7zocg00Leuuz+fzTU1oW2lWY55pgziMyM81TmDMx0k/G2TLSF8CvgZ9to6LcXKXV1a0Gg0JBd6PeNZQ0YEjKRSG4a3HBsQNTHAUoAGNvuiWtszTo8BUaW1dSEPpkv2/owPt/UyXQbm42G/WFuRrtVFL9Y+ik6ZpCY1YLapblZJNFx6nlXQoSM0CAOFkuDUVU3AgKPxEL5jAN5/8Elx+G/BswH9975/8XNuxx76wQ/aVcXbMTtgZu2BDJtg9+xvF0Vb0EP2Ld+LdlTSO1j2f2bOIkyfqXbYy</latexit> <latexit sha1_base64="93JlN3imwlGgc0DBRqcn2AXf20Q=">AAACVnicbVFLSxxBEO4ZNZrNazTHXAYXIadlOgTMUeLFmwayKuwMS01PzW5jP4buGsMy7J/Ui/6UXIK9j0NcLSj4+OqrR39dNkp6yrLHKN7a3nmzu/e29+79h4+fkv2DS29bJ3AorLLuugSPShockiSF141D0KXCq/LmdFG/ukXnpTW/adZgoWFiZC0FUKDGic4VUm5VNQXKQ/Zyhwb/CKs1mKrLb1HMR7zocg00Leuuz+fzTU1oW2lWY55pgziMyM81TmDMx0k/G2TLSF8CvgZ9to6LcXKXV1a0Gg0JBd6PeNZQ0YEjKRSG4a3HBsQNTHAUoAGNvuiWtszTo8BUaW1dSEPpkv2/owPt/UyXQbm42G/WFuRrtVFL9Y+ik6ZpCY1YLapblZJNFx6nlXQoSM0CAOFkuDUVU3AgKPxEL5jAN5/8Elx+G/BswH9975/8XNuxx76wQ/aVcXbMTtgZu2BDJtg9+xvF0Vb0EP2Ld+LdlTSO1j2f2bOIkyfqXbYy</latexit> <latexit sha1_base64="93JlN3imwlGgc0DBRqcn2AXf20Q=">AAACVnicbVFLSxxBEO4ZNZrNazTHXAYXIadlOgTMUeLFmwayKuwMS01PzW5jP4buGsMy7J/Ui/6UXIK9j0NcLSj4+OqrR39dNkp6yrLHKN7a3nmzu/e29+79h4+fkv2DS29bJ3AorLLuugSPShockiSF141D0KXCq/LmdFG/ukXnpTW/adZgoWFiZC0FUKDGic4VUm5VNQXKQ/Zyhwb/CKs1mKrLb1HMR7zocg00Leuuz+fzTU1oW2lWY55pgziMyM81TmDMx0k/G2TLSF8CvgZ9to6LcXKXV1a0Gg0JBd6PeNZQ0YEjKRSG4a3HBsQNTHAUoAGNvuiWtszTo8BUaW1dSEPpkv2/owPt/UyXQbm42G/WFuRrtVFL9Y+ik6ZpCY1YLapblZJNFx6nlXQoSM0CAOFkuDUVU3AgKPxEL5jAN5/8Elx+G/BswH9975/8XNuxx76wQ/aVcXbMTtgZu2BDJtg9+xvF0Vb0EP2Ld+LdlTSO1j2f2bOIkyfqXbYy</latexit> <latexit sha1_base64="93JlN3imwlGgc0DBRqcn2AXf20Q=">AAACVnicbVFLSxxBEO4ZNZrNazTHXAYXIadlOgTMUeLFmwayKuwMS01PzW5jP4buGsMy7J/Ui/6UXIK9j0NcLSj4+OqrR39dNkp6yrLHKN7a3nmzu/e29+79h4+fkv2DS29bJ3AorLLuugSPShockiSF141D0KXCq/LmdFG/ukXnpTW/adZgoWFiZC0FUKDGic4VUm5VNQXKQ/Zyhwb/CKs1mKrLb1HMR7zocg00Leuuz+fzTU1oW2lWY55pgziMyM81TmDMx0k/G2TLSF8CvgZ9to6LcXKXV1a0Gg0JBd6PeNZQ0YEjKRSG4a3HBsQNTHAUoAGNvuiWtszTo8BUaW1dSEPpkv2/owPt/UyXQbm42G/WFuRrtVFL9Y+ik6ZpCY1YLapblZJNFx6nlXQoSM0CAOFkuDUVU3AgKPxEL5jAN5/8Elx+G/BswH9975/8XNuxx76wQ/aVcXbMTtgZu2BDJtg9+xvF0Vb0EP2Ld+LdlTSO1j2f2bOIkyfqXbYy</latexit> align 1 2 4 4 1 2 1 2 4 3 1 2 3 4 Figure 6.3: Decentralized Multi-Task Learning Correlation Matrix Exchanging Algorithm client in our setting may only have a partial set of tasks in the labels of its training dataset, but still needs to make predictions for tasks it does not have in its labels during test time. This regularizer helps clients relate its own tasks to tasks in other clients. Intuitively, it determines how closely two tasks i and j are related. The closer Φ task,i and Φ task,j are, the largerΩ i,j will be. IfΩ is an identity matrix, then each node is independent to each other. But as our results show, there is often a strong correlation between different molecular properties. This compels us to use a federated learning model. Figure 6.2-a depicts the FedGMTL framework where clients’ graph classifier weights are using an FL server. While the above formulation enhances the FMTL with a constrained regularizer that can be used for GNN learning, we still need to solve the final challenge, which is to remove the reliance on a central server to perform the computations in Eq. equation 6.5. Therefore, we propose a Decentralized Graph Multi-Task Learning framework, SpreadGNN, to extend the FedGMTL framework to the decentralized case, which is shown in the Figure 6.2-c. Note that each client’s learning task remains the same but the aggregation process differs in SpreadGNN. 86 6.2.3 SpreadGNN: Serverless Federated MTL for GNNs In a serverless setting, in which all clients are not necessarily connected to all other clients through a central server also makes it impossible to maintain one single task covariance matrix Ω . Thus, the naive formulation in Equation 6.5 becomes obsolete in a serverless case. To combat against this issue, we propose using distinct covariance matrices Ω k for each client that are efficiently updated using the exchange mechanism illustrated in Figure 6.3. We formalize this idea as follows: Consider one particular client m having task weights Φ task,m ∈ R d×S m where S m is the number of tasks that client m has. Then, each client localizes the optimization procedure in Equation 6.5 with respect to his/her neighbors. In the decentralized setting, we emphasize that clients can collectively learn an exhaustive set of tasks, even when clients may not have access to some of the classes in each label. That is,S i ∩S j =∅ ∀i̸=j and∪ i S i =S. Then, the new non-convex objective function is defined as: min θ, Ψ ,Φ pool ,Φ task ,Ω K X k=1 1 N k ï N k X i=1 L(ˆ y (k) i (X (k) ,Z (k) ;W k ),y k i ) + 1 2 λ 1 Tr(Φ task M k Ω − 1 k Φ T task M k ) ò + 1 2 X χ ∈{θ, Ψ ,Φ pool ,Φ task} λ χ ||χ || 2 F , s.t. Ω k ≥ 0 and Tr(Ω k )=1, k =1,2,...,K. (6.7) where W k = {θ, Ψ ,Φ pool ,Φ task,k } is the set of all learnable weights for client k, M k = k ∪N k is the neighbor set for client k including itself. This gives rise to: Φ task M k = [Φ task,1 ∥Φ task,2 ∥...∥Φ task,|M k | ]∈ R d×|S M k | which is the task weight matrix for client k and its neighbors and || represents the row-wise concatenation operation. The matrix Ω k ∈ R |S M k |×|S M k | represents the correlation amongst all the available tasks for the set M k . To solve this non-convex problem, we apply the alternating optimization method presented in [537], whereW k and Ω k are updated in an alternative fashion. 87 Optimizing W k : For simplicity, we define Ω ={Ω 1 ,Ω 2 ,...,Ω K } to represent the set of correlation matrices for all clients. Fixing Ω , we can use SGD to update W k jointly. Let L = P K k=1 1 N k P N k i=1 L(ˆ y (k) i (X k i ,Z k i W k ),y k i ). Then, our problem can then be reformulated as: G(W k |Ω )=L+ K X k=1 1 N k N k X i=1 [ 1 2 λ 1 Tr(Φ task M k Ω − 1 k Φ task M k )] + 1 2 X χ ∈{θ, Ψ ,Φ pool ,Φ task } λ χ ∥χ ∥ 2 F . (6.8) where the summation in (8) is amongst all nodes connected to node k. Then the gradient formulations for each node are: ∂G(W k |Ω ) ∂Φ task M k = ∂L ∂Φ task M k +λ 1 |M k | X i=1 1 N i Φ task M k Ω − 1 i +λ χ Φ task M k (6.9) ∂G(W k ∥Ω ) ∂χ = ∂L ∂χ +λ χ χ , ∀χ ∈{θ, Ψ ,Φ pool } (6.10) Optimizing Ω k ∈Ω : In [537], an analytical solution for Ω is equal to ˆ Ω=(Φ T task Φ task ) 1 2 / Tr((Φ T task Φ task ) 1 2 ). However, this solution is only applicable for the centralized case. This is because missing central node forbids averaging parameters globally. So here we propose a novel way to update each Ω k ∈Ω : f align (Ω (t+1) k )← η 1 |M k | Ñ |M k | X i=1 1 N i f align (Ω (t) i )+f align Ä ˆ Ω k ä é (6.11) where ˆ Ω k = (Φ T task M k Φ task M k ) 1 2 /Tr((Φ T task M k Φ task M k ) 1 2 ) is the analytical solution for node k. The first averaging term can incorporate the nearby nodes correlation into its own. It should be noticed that each Ω i may have a different dimension (different number of neighbors), so 88 this averaging algorithm is based on the node-wised alignment in the global Ω . The f align function is illustrated in Figure 6.3. Here, f align operates on each client and The second term captures the new correlation between its neighbors as shown in Figure 6.3. We refer readers to to Algorithm 9 in Appendix E.3.2. 6.2.3.1 Convergence Properties In this section, we present our convergence analysis for SpreadGNN optimization problem. Before any formalism, we first introduce a connection matrix M ∈ R K× K to record the connections in the network. We assume that M satisfies the following: 1. M1 K =1 K 2. M T =M 3. max{|λ 2 (M)|,...,|λ K (M)|}<λ 1 (M)=1 where λ denotes an eigenvalue of M. In our analysis, we assume that the following properties hold [41]: • The objective function F(·) is L-Lipschitz. • F(·) is lower bounded by F inf such that F(·)≥ F inf . • The full gradient of the objective function F(· ) is approximated by stochastic gradient g on a mini-batch ϵ i with an unbiased estimate: Eϵ k [g(x)]=∇F(x). • The variance of stochastic gradient g(x) given a mini-batch ϵ k is upper bounded and Var ϵ k (g(x))≤ β ∥∇F(x)∥ 2 +σ 2 , ∃β,σ 2 ≥ 0, ∀k. For fixed sstep size, our updates can be written as follows: X t+1 =(X t − η G t )· M t , (6.12) whereX t = î x (1) t ,...,x (K) t ó is the matrix of our interest (e.g. each element ofW (t) k ),G t is the gradient andM t is the connection matrix at time t. When t mod τ = 0,M t =M, otherwise, 89 M t =I K . In equation (6.12), multiplying I K /K on both sides, defining the averaged model u t =X t I K K , we obtain the following update: u t+1 =u t − η g t =u t − η " 1 K K X i=1 g(x (i) t ) # (6.13) Next, we will present our analysis on the convergence of the above-averaged model u t . For non-convex optimization, previous works on SGD convergence analysis use the average squared gradient norm as an indicator of convergence to a stationary point [41]. Theorem 1 (Convergence of SpreadGNN). If the learning rate η satisfies the following condition: ηL + η 2 L 2 τ 2 1− ζ Å 2ζ 2 1+ζ + 2ζ 1− ζ + τ − 1 τ ã ≤ 1 (6.14) whereτ is the averaging period(one synchronization per τ local updates), andζ =max{|λ 2 (M)|, ...,|λ m (M)|}, and all local models are initialized at a same point x 0 , then after T iterations the average squared gradient norm is bounded as E " 1 T T X t=1 ∥∇F(u t )∥ 2 # ≤ 2[F(x 1 )− F inf ] ηT + ηLσ 2 K +η 2 L 2 σ 2 Å 1+ζ 2 1− ζ 2 τ − 1 ã (6.15) Theorem 1 shows that SpreadGNN algorithm converges to the stationary point solution under certain conditions. [467] presents similar results in an unified framework, but it does not provide adequate theoretical analysis and empirical evaluation for federated learning. The proof of Theorem 1 is given in the Appendix E.3.2. 90 Table 6.1: Dataset summary used in our experiments. Dataset # Molecules Avg # Nodes Avg # Edges # Tasks Task Type Evaluation Metric SIDER 1427 33.64 35.36 27 Classification ROC-AUC Tox21 7831 18.51 25.94 12 Classification ROC-AUC MUV 93087 24.23 76.80 17 Classification ROC-AUC QM8 21786 7.77 23.95 12 Regression MAE 6.3 Experiments 6.3.1 Setup Implementation. All experiments are conducted on a single GPU server equipped with 2 Nvidia Geforce GTX 1080Ti GPUs and an AMD Ryzen 7 3700X 8-Core Processor. Our models are built on top of the FedML framework [153] and PyTorch [342]. Multi-Label Dataset. We use molecular datasets from the MoleculeNet [486] machine learning benchmark in our evaluation. In particular, we evaluate our approach on molecular property prediction datasets described in Table 6.1. The label for each molecule graph is a vector in which each element denotes a property of the molecule. Properties are binary in the case of classification or continuous values in the case of regression. As such, each multi-label dataset can adequately evaluate our learning framework. Non-I.I.D. Partition for Quantity and Label Skew. We introduce non-I.I.D.ness in two additive ways. The first is a non-I.I.D. split of the training data based on quantity. Here we use a Dirichlet distribution parameterized by α to split the training data between clients. Specifically, the number of training samples present in each client is non-I.I.D. The second source of non-I.I.D.ness is a label masking scheme designed to represent the scenario in which different clients may possess partial labels as shown in Figure 6.1. More specifically, we randomly mask out a subset of classes in each label on every client. In our experiments, the sets of unmasked classes across all clients are mutually exclusive and collectively exhaustive. This setting simulates a worst case scenario where no two clients share the same task. However our framework is just as applicable when there is label overlap. Such masking, introduces a class imbalance between the clients making the label distribution non-I.I.D. as well. 91 Table 6.2: Results on the molecular property prediction task. SpreadGNN uses a complete topology: all clients communicated with each other. Communication period τ =1 GraphSAGE GAT FedAvg FedGMTL SpreadGNN FedAvg FedGMTL SpreadGNN SIDER 0.582 0.629 0.5873 0.5857 0.61 0.6034 Tox21 0.5548 0.6664 0.585 0.6035 0.6594 0.6056 MUV 0.6578 0.6856 0.703 0.709 0.6899 0.713 QM8 0.02982 0.03624 0.02824 0.0392 0.0488 0.0315 Baseline Algorithm. For a fair and reasonable comparison with our baseline, we utilize the same masking to simulate missing labels for each client. Afterwards, we train models trained with FedAvg with the same loss as we use for decentralized case. Models. In order to demonstrate that our framework is agnostic to the choice of GNN model, we run experiments on GraphSAGE [132] and GAT [457]. Network Topology. We first evaluate our framework in a complete topology in which all clients are connected to all other clients to measure the efficacy of our proposed regularizer. We then perform ablation studies on the number of neighbors of each client to stress our framework in the more constrained setting. Hyperparameters. We use Adam [217] as the client optimizer in all of our experiments. For our Tox21, MUV, and QM8, we use an 8 client topology. For SIDER we use a 4 client topology. A more comprehensive hyperparameter list for network topology and models can be found in the Appendix E.2.2. 6.3.2 Results We use a central, server dependent FedAvg system as the baseline of comparison. More specifically, all clients are involved in the averaging of model parameters in every averaging round. Our results summarized in Table 11.5 demonstrate that SpreadGNN (third column) out- performs a centralized federated learning system that uses FedAvg (first column) when all clients can communicate with all other clients. This shows that by using the combination 92 0 10 20 30 40 Rounds 0.50 0.55 0.60 0.65 0.70 Test ROC-AUC GraphSAGE + MUV Task Reg = 0.001 Task Reg = 0.002 0 5 10 15 20 25 30 35 Rounds 0.04 0.06 0.08 0.10 0.12 Test MAE GAT + QM8 Task Reg = 0.05 Task reg = 0.1 Task Reg = 0.2 Task Reg = 0.3 Figure 6.4: Effect of Task-Relationship Regularizer on Learning 0 10 20 30 40 Rounds 0.50 0.52 0.54 0.56 0.58 Test ROC-AUC GraphSAGE + Tox21 FedAvg 4 random neighbors 2 random neighbors 2 neighbors in a ring 0 20 40 60 80 100 120 Rounds 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 Test ROC-AUC GraphSAGE + SIDER 4 neighbors (complete topo) 2 neighbors random 2 neighbors ring Figure 6.5: Effect of Topology on Learning for GraphSAGE Model. of the task regularizer in equation 6.6 and decentralized optimization, we can eliminate the dependence on a central server and enable clients to learn more effectively in the presence of missing molecular properties in their respective labels. Additionally, the results also show that our framework is agnostic to the type of GNN model being used in the molecular property prediction task since both GraphSAGE and GAT benefit from our framework. Our framework also works in the case a trusted central server is available (second column). The presence of a trusted central server improves the accuracy in a few scenarios. However, SpreadGNN provides competing performance in a more realistic setting where the central server is not feasible. 93 6.3.3 Sensitivity Analysis Task Regularizer. Figure 6.4 illustrates the effect of λ 1 on both regression as well as classification tasks. Interestingly, regression is much more robust to variation in λ 1 while classification demands more careful tuning of λ 1 to achieve optimal model performance. This implies that the different properties in the regression task are more independent than the properties in the classification task. Network Topology. The network topology dictates how many neighbors each client can communicate with in a communication round. While Table 11.5 shows that SpreadGNN outperforms FedAvg in a complete topology, Figure 6.5 shows that our framework performs outperforms FedAvg even when clients are constrained to communicate with fewer neighbors. We can also see that it’s not just n neighbors that matters, the topology in which clients are connected does too. When n neighbors = 2, a ring topology outperforms a random topology, as a ring guarantees a path from any client to any other client. Thus, learning is shared indirectly between all clients. The same is not true in a random topology. Also, in Figure 6.5 illustrates the effect of varying topologies on SpreadGNN on the Sider dataset when using Graphsage as the GNN. The qualitative behavior is similar to Figure 6.5, in that when each client is connected to more neighbors, the local model of each client is more robust. However, when the total number of clients involved in the network is smaller, the effect of topology is understated and the total number of neighbors matters more. Recall that in an 8 client network, when each client was restricted to being connected to only 2 neighbors, random connections performed worse than a ring topology, meaning that the topology mattered as much as the mere number of neighbors. However, in the case of the 4 client network, there is a minimal difference between a 2 neighbor random configuration and a 2 neighbor ring configuration. Period. The communication period τ is another important hyperparameter in our framework. As we increase the communication period τ more, model performance decreases. However, selecting τ =5 can sometimes be better than averaging & exchanging each round. 94 This indicates that tuning τ is important for while controlling the tradeoff between the performance and the running time. In general, our experiments suggest that a lower period is better, but this is not always the case. We include an ablation study on τ to support this claim in the Appendix E.3. 6.4 Related Works Molecular Representation Learning. [377] encode the neighbors of atoms in the molecule into a fix-length vector to obtain vector space representations. To improve the expressive power of chemical fingerprints, [94, 76] use CNNs to learn rich molecule embeddings for downstream tasks like property prediction. [211, 397] explore the graph convolutional network to encode molecular graphs into neural fingerprints. To better capture the interactions among atoms, [123] proposes to use a message passing framework. FL. Early examples of research into federated learning include [220, 311]. To address both statistical and system challenges in FL, [414] proposes a multi-task learning framework for federated learning and its related optimization algorithms, which extends early works in distributed machine learning [515, 190]. The main limitation, however, is that strong duality is only guaranteed when the objective function is convex, which can not be generalized to the non-convex settings.[198, 305, 466, 18, 281] extends federated multi-task learning to the distributed multi-task learning setting, but not only this limitation remains same, but also nodes performing the same amount of work is prohibitive in FL. [353] proposes a coding theoretic approach to mitigate statistical and communication heterogeneity simultaneously to speed up FL in multi-access edge computing (MEC) networks. Federated Graph Neural Networks. [432] and [314] use computed graph statistics for information exchange and aggregation to avoid node information leakage. [194] utilizes cryptographic approaches to incorporate into GNN learning. [462] propose a hybrid of federated and meta learning to solve the semi-supervised graph node classification problem 95 in decentralized social network datasets. [317] uses an edge-cloud partitioned GNN model for spatio-temporal traffic forecasting tasks over sensor networks. The previous works do not consider graph learning in a decentralized setting. Stochastic Gradient Descent Optimization. In large-scale distributed machine learning problems, learning synchronized mini-batch SGD is a well-known method to address the communication bottleneck by increasing the computation-to communication ratio [242]. It is shown that FedAvg [221] is a special case of local SGD which allow nodes to perform local updates and infrequent synchronization between them to communicate less while converging fast [467, 524, 269]. Decentralized SGD, another approach to reducing communication, was successfully applied to deep learning [197, 261, 99]. Asynchronous SGD is a potential method that can alleviate synchronization delays in distributed learning [320], but existing asynchronous SGD does not fit for federated learning because the staleness problem is particularly severe due to the reason of heterogeneity in the federated setting [80]. 6.5 Conclusion In this work, we propose SpreadGNN, to train federated graph neural networks in a decen- tralized manner. We motivate our framework through a realistic setting, in which clients involved in molecular machine learning research cannot share data with each other due to privacy regulations and competition. Moreover, we are aware that clients possess multiple, but partial labels. For the first time, experiments show that training federated graph neural networks does not require a centralized topology and that our framework can address the non-I.I.D.ness in dataset size and label distribution across clients. SpreadGNN can outperform a central server dependent baseline even when clients can only communicate with a few neighbors. To support our empirical results, we also provide a convergence analysis for our framework. 96 Chapter 7 SSFL: Tackling Label Deficiency via Personalized Self-Supervision 7.1 Introduction Federated Learning (FL) is a contemporary distributed machine learning paradigm that aims at strengthening data privacy, reducing data migration costs, and breaking regulatory restrictions [206, 468]. It has been widely applied to computer vision, natural language processing, and data mining. However, there are two main challenges impeding its wider adoption in machine learning. One is data heterogeneity, which is a natural property of FL in which diverse clients may generate datasets with different distributions due to behavior preferences (e.g., the most common cause of heterogeneity is skewed label distribution which might result from instances where some smartphone users take more landscape pictures, while others take more photos of daily life). The second challenge is label deficiency at the edge, which is relatively less studied. This issue is more severe at the edge than in a centralized setting because users are reluctant to annotate their private and sensitive data, and/or smartphones and IoT devices do not have a user-friendly interface to assist with annotation. To mitigate the data heterogeneity issue among clients, researchers have proposed algo- rithms for training a global model FedAvg [308], FedProx [253], FedNova [471], FedOPT [367], 97 Figure 7.1: Depiction of the Self-supervised and Personalized Federated Learning (SSFL) framework. as well as personalized FL frameworks (e.g., pFedMe, Ditto, Per-FedAvg). These algorithms all depend on the strong assumption that the data at the edge has sufficient labels. To address the label deficiency issue in FL, recent works [288, 293, 187, 192, 265, 540, 530, 538] assume that the server or client has a fraction of labeled data and use semi-supervised methods such as consistency loss [322] or pseudo labeling [232] to train a global model. A more realistic but challenging setting is fully unsupervised training. Although a recent work in FL [389] attempts to address this challenge through Siamese networks proposed around thirty years ago [44], its design does not tackle data heterogeneity for learning personalized models, and it only trains on small-scale sensor data in IoT devices. Moreover, these exist- ing works in FL have not examined recent progress in the Self-Supervised Learning (SSL) community where methods such as SimCLR [68], SwAV[53], BYOL [127], and SimSiam [69] have shown tremendous improvement in reducing the amount of labeled data required to achieve state-of-the-art performance. As such, it remains still unclear how these SSL methods can be incorporated into FL and how well they would perform, especially when intertwined with the data heterogeneity challenge that does not exist in centralized training. In this work, we propose Self-Supervised Federated Learning (SSFL), a unified self- supervised and personalized federated learning framework, and a series of algorithms under this framework to address these challenges. As shown in Figure 7.1, this framework brings state-of-the-art self-supervised learning algorithms to the realm of FL in order to enable 98 training without using any supervision, while also integrating model personalization to deal with data heterogeneity (Section 7.3.1). More specifically, under the SSFL framework, we analyze the compatibility of various centralized self-supervised learning methods in the FL setting and demonstrate that SimSiam networks performs the best with the standard FedAvg algorithm (Section 7.3.2). Moreover, to address the data heterogeneity at edge devices, we have innovated a series of algorithms that broaden the reach of existing supervised personalization algorithms into the setting of self-supervised learning, including perFedAvg [104], Ditto [249], and local fine-tuning, among others. We further propose a novel personalized federated self-supervised learning algorithm, per-SSFL (Section 7.3.3), which balances personalization and consensus by carefully regulating the distance between the local and global representations of data (shown as the yellow block in Figure 7.1). To provide a comprehensive and comparative analysis of the proposed algorithms, we also develop a distributed training system and evaluation protocol for SSFL. Using this training system, we conduct experiments on a synthetic non-I.I.D. dataset based on CIFAR- 10 and a natural non-I.I.D. dataset GLD-23K. Our experimental results demonstrate that all algorithms in our framework work reliably. In FL, the gap of evaluation accuracy between supervised learning and unsupervised learning is small. Personalized SSFL performs better than FedAvg-based SSFL. We also conduct ablation studies to fully understand the SSFL framework, namely the role of batch size, different degrees of non-I.I.D.ness, and performance in more datasets. Finally, our unified API design can serve as a suitable platform and baseline, enabling further developments of more advanced SSFL algorithms. 7.2 Preliminaries SSFL builds upon two fundamental areas in machine learning: federated optimization and self-supervised learning. Thus, we first introduce some basics and formulations in these areas. 99 7.2.1 Federated Optimization Federated optimization refers to the distributed optimization paradigm that a network of K devices collaboratively solve a machine learning task. In general, it can be formulated as a distributed optimization problem with the form [308]: min θ P K k=1 |D k | |D| L(θ,D k ). Here, each device k has a local dataset D k drawn from a local distribution X k . The combined dataset D =∪ K k=1 D k is the union of all local datasets D k . θ represents the model weight of a client model. L is the client’s local loss function that measures the local empirical risk over the heterogeneous dataset D k . Under this formulation, to mitigate the non-I.I.D. issue, researchers have proposed algorithms such as FedProx [253], FedNova [471], and FedOPT [367] for training a global model, as well as personalized FL frameworks such as Ditto [249], and Per-FedAvg [103]. All of these algorithms have a strong assumption that data at the edge have sufficient labels, meaning that their analysis and experimental settings are based on a supervised loss function, such as the cross-entropy loss for image classification. 7.2.2 Self-supervised Learning Self-supervised learning (SSL) aims to learn meaningful representations of samples without human supervision. Formally, it aims to learn an encoder function f θ :X 7→R d where θ is the parameter of the function,X is the unlabeled sample space (e.g. image, text), and the output is a d dimensional vector containing enough information for downstream tasks such as image classification and segmentation. The key to SSL’s recent success is the inductive bias that ensures a good representation encoder remains consistent under different perturbations of the input (i.e. consistency regularization). One prominent example among recent advances in modern SSL frameworks is the Siamese architecture [44] and its improved variants SimCLR [68], SwAV [53], BYOL [127], and SimSiam [69]. Here we review the most elegant architecture, SimSiam, and defer the description and comparison of the other three to Appendix F.1. SimSiam proposes a two-head architecture in which two different views (augmentations) of the same image are encoded by the same 100 network f θ . Subsequently, a predictor Multi Layer Perceptron (MLP) h θ and a stop-gradient operation denoted byb · are applied to both heads. In the SSL context, “stop gradient” means that the optimizer stops at a specific neural network layer during the back propagation and views the parameters in preceding layers as constants. Here, θ is the concatenation of the parameters of the encoder network and the predictor MLP. The algorithm aims to minimize the negative cosine similarityD(·,·) between two heads. More concretely, the loss is defined as L SS (θ,D )= 1 |D| X x∈D D(f θ (T(x)), ¤ h θ (f θ (T(x)))), (7.1) whereT represents stochastic data augmentation and D is the data set. 7.3 SSFL: Self-supervised Federated Learning In this section, we propose SSFL, a unified framework for self-supervised federated learning. Specifically, we introduce the method by which SSFL works for collaborative training of a global model and personalized models, respectively. 7.3.1 General Formulation We formulate self-supervised federated learning as the following distributed optimization problem: min Θ {θ k } k∈[K] G(L(θ 1 ,Θ; X 1 ),...,L(θ K ,Θ; X K )) (7.2) where θ k is the parameter for the local model (f θ k ,h θ k ); Θ is the parameter for the global model (f Θ ,h Θ );L(θ k ,Θ; X k ) is a loss measuring the quality of representations encoded by f θ k and f Θ on the local distribution X k ; and G(·) denotes the aggregation function (e.g. sum of client losses weighted by |D k | |D| ). To capture the two key challenges in federated learning (data 101 heterogeneity and label deficiency), we hold two core assumptions in the proposed framework: (1) X k of all clients are heterogeneous (non-I.I.D.), and (2) there is no label. To tackle the above problem, we propose a unified training framework for federated self-supervised learning, as described in Algorithm 4. This framework can handle both non-personalized and personalized federated training. In particular, if one enforces the constraint θ k = Θ for all clients k ∈ [K], the problem reduces to learning a global model. When this constraint is not enforced, θ k can be different for each client, allowing for model personalization. ClientSSLOPT is the local optimizer at the client side which solves the local sub-problem in a self-supervised manner. ServerOPT takes the update from the client side and generates a new global model for all clients. Algorithm 4 SSFL: A Unified Framework for Self-supervised Federated Learning input :K,T,Θ (0) ,{θ (0) k } k∈[K] , ClientSSLOpt, ServerOpt for t=0,...,T − 1 do Server randomly selects a subset of devices S (t) Server sends the current global model Θ (t) to S (t) for device k∈S (t) in parallel do Solve local sub-problem of equation 7.2: θ k ,Θ (t) k ← ClientSSLOpt(θ (t) k ,Θ (t) ,∇L(θ k ,Θ; X k )) Send ∆ (t) k :=Θ (t) k − Θ (t) back to server Θ (t+1) ← ServerOpt Ä Θ (t) ,{∆ (t) k } k∈S (t) ä return :{θ k } k∈[K] ,Θ (T) Next, we will introduce specific forms of ClientSSLOPT and ServerOPT for global training and personalized training. 102 7.3.2 Global-SSFL: Collaboration Towards a Global Model without Supervision To train a global model using SSFL, we design a specific form of ClientSSLOPT using SimSiam. WechooseSimSiamoverothercontemporaryself-supervisedlearningframeworks(e.g., SimCLR, SwAV, BYOL) based on the following analysis as well as experimental results (see Section 7.5.1). The simplicity in neural architecture and training method. SimSiam’s architecture and training method are relatively simple. For instance, compared with SimCLR, SimSiam has a simpler loss function; compared with SwAV, SimSiam does not require an additional neural component (prototype vectors) and Sinkhorn-Knopp algorithm; compared with BYOL, SimSiam does not need to maintain an additional moving averaging network for an online network. Moreover, the required batch size of SimSiam is the smallest, making it relatively friendly for resource-constrained federated learning. A more comprehensive comparison can be found in Appendix F.1. InterpretabilityofSimSiamleadstosimplerlocaloptimization. Moreimportantly, SimSiam is more interpretable from an optimization standpoint which simplifieds the local optimization. In particular, it can be viewed as an implementation of an Expectation- Maximization (EM) like algorithm, meaning that optimizingL SS in Equation 7.1 is implicitly optimizing the following objective min θ,η E T x∼ X î ∥f θ (T(x))− η x ∥ 2 2 ó . (7.3) Here, f θ is the encoder neural network parameterized by θ . η is an extra set of parameters, whose size is proportional to the number of images, and η x refers to using the image index of x to access a sub-vector of η . This formulation is w.r.t. both θ and η and can be optimized via an alternating algorithm. At time step t, the η t x update takes the form η t x ← E T [f θ t(T(x))], indicating that η t x is assigned the average representation of x over the distribution of augmentation. However, it is impossible to compute this step by going over 103 the entire dataset during training. Thus, SimSiam uses one-step optimization to approximate the EM-like two-step iteration by introducing the predictor h θ to approximate η and learn the expectation (i.e. h θ (z) ≈ E T [f θ (T(x))]) for any image x. After this approximation, the expectationE T [·] is ignored because the sampling ofT is implicitly distributed across multiple epochs. Finally, we can obtain the self-supervised loss function in Equation 7.1, in which negative cosine similarityD is used in practice (the equivalent L 2 distance is used in Equation 7.3 for the sake of analysis). Applying equation 7.1 as ClientSSLOPT simplifies the local optimization for each client in a self-supervised manner. 7.3.3 Per-SSFL: Learning Personalized Models without Supervision In this section, we explain how SSFL addresses the data heterogeneity challenge by learning personalized models. Inspired by the interpretation in Section 7.3.2, we define the following sub-problem for each client k∈[K]: min θ k ,η k E T x∼ X k ï ∥f θ k (T(x))− η k,x ∥ 2 2 + λ 2 ∥η k,x −H ∗ x ∥ 2 2 ò s.t. Θ ∗ ,H ∗ ∈argmin Θ ,H n X i=1 |D k | |D| E T x∼ X i î ∥f Θ (T(x))−H x ∥ 2 2 ó (7.4) Compared to global training, we additionally include Θ , the global model parameter, andH, the global version of η , and the expected representations which correspond to the personalized parameters θ k and η k . In particular, through the term∥η k,x −H ∗ x ∥ 2 2 , we aim for the expected local representation of any image x to reside within a neighborhood around the expected global representation of x. Therefore, by controlling the radius of the neighborhood, hyperparameter λ helps to balance consensus and personalization. 104 Algorithm 5 Per-SSFL input :K,T,λ, Θ (0) ,{θ (0) i } k∈[K] ,s: number of local iteration,β : learning rate for t=0,...,T − 1 do Server randomly selects a subset of devices S (t) Server sends the current global model Θ (t) to S (t) for device k∈S (t) in parallel do ClientSSLOpt Sample mini-batch B k from local dataset D k , and do s local iterations /* Optimize the global parameter Θ locally */ Z 1 ,Z 2 ← f Θ (t)(T(B k )),f Θ (t)(T(B k )) P 1 ,P 2 ← h Θ (t)(Z 1 ),h Θ (t)(Z 2 ) Θ (t) k ← Θ (t) − β ∇ Θ (t) D(P1, ” Z2)+D(P2, ” Z1) 2 , whereb · stands for stop-gradient /* Optimize the local parameter θ k */ z 1 ,z 2 ← f θ k (T(B k )),f θ k (T(B k )) p 1 ,p 2 ← h θ k (z 1 ),h θ k (z 2 ) θ k ← θ k − β ∇ θ k Ä D(p1,c z2)+D(p2,c z1) 2 +λ D(p1,P1)+D(p1,P2)+D(p2,P1)+D(p2,P2) 4 ä Send ∆ (t) k :=Θ (t) k − Θ (t) back to server ServerOpt Θ (t+1) ← Θ (t) + P k∈S (t) |D k | |D| ∆ (t) k return :{θ i } i∈[n] ,Θ (T) We see that Equation 7.4 in the above objective is an optimization problem w.r.t. both θ and η . However, as the above target is intractable in practice, following an analysis similar to Section 7.3.2, we use the target below as a surrogate: min θ k L SS (θ k ,D k )+ λ |D k | X x∈D k D(h θ k (f θ k (T(x))),h Θ ∗ (f Θ ∗ (T(x)))) (7.5) s.t. Θ ∗ ∈argmin Θ L SS (Θ ,D) (7.6) In practice, Θ can be optimized independently of θ k through the FedAvg [308] algorithm. To make the computation more efficient, we also apply the symmetrization trick proposed in [69]. We refer to this algorithm as Per-SSFL and provide a detailed description in Algorithm 5 (also illustrated in Fig. 7.1). Regarding the theoretical analysis. To our knowledge, all self-supervised learning frameworks do not have any theoretical analysis yet, particularly the SimSiam dual neural 105 network architecture. Our formulation and optimization framework are interpretable, they are built based on an EM-like algorithm for SimSiam and minimizing the distance between the private model and the global model’s data representation. Innovating baselines to verify SSFL. Note that we have not found any related works that explore a Siamese-like SSL architecture in an FL setting. As such, to investigate the performance of our proposed algorithm, we further propose several other algorithms that can leverage the SSFL framework. 1. LA-SSFL. We apply FedAvg [308] on the SimSiam lossL SS for each client to obtain a global model. We perform one step of SGD on the clients’ local data for local adaption; 2. MAML-SSFL. This algorithm is inspired by perFedAvg [103] and views the personalization on each devices as the inner loop of MAML [108]. It aims to learn an encoder that can be easily adapted to the clients’ local distribution. During inference, we perform one step of SGD on the global model for personalization; 3. BiLevel-SSFL. Inspired by Ditto [249], we learn personalized encoders on each client by restricting the parameters of all personalized encoders to be close to a global encoder independently learned by weighted aggregation. More details of these algorithms, formulation, and pseudo code are introduced in Appendix F.2. In Section 7.5.3, we will show the comparison results for these proposed SSFL algorithmic variants. 7.4 Training System and Evaluation Pipeline for SSFL A Distributed Training System to accelerate the algorithmic exploration in SSFL framework. We also contributed to reproducible research via our distributed training system. This is necessary for two reasons: (1) Running a stand-alone simulation (training client by client sequentially) like most existing FL works requires a prohibitively long training time when training a large number of clients. In SSFL, the model size (e.g., ResNet-18 v.s. shallow CNNs used in the original FedAvg paper) and the round number for convergence (e.g., 800 epochs in the centralized SimSiam framework) is relatively larger than in FL literature. By 106 running all clients in parallel on multiple CPUs/GPUs, we can largely accelerate the process. (2) Given that SSFL is a unified and generic learning framework, researchers may develop more advanced ways to improve our work. As such, we believe it is necessary to design unified APIs and system frameworks in line with the algorithmic aspect of SSFL. See Appendix I.3 for more details on our distributed training system. Evaluation Pipeline. In the training phase, we use a KNN classifier [487] as an online indicator to monitor the quality of the representations generated by the SimSiam encoder. For Global-SSFL, we report the KNN test accuracy using the global model and the global test data, while in Per-SSFL, we evaluate all clients’ local encoders separately with their local test data and report their averaged accuracy. After self-supervised training, to evaluate the performance of the trained encoder, we freeze the encoder and attach a linear classifier to the output of the encoder. For Global-SSFL, we can easily verify the performance of SimSiam encoder by training the attached linear classifier with FedAvg. However, for Per-SSFL, each client learns a personalized SimSiam encoder. As the representations encoded by personalized encoders might reside in different spaces, using a single linear classifier trained by FedAvg to evaluate these representations is unreasonable (see experiments in Section 7.5.4.3). As such, we suggest an additional evaluation step to provide a more representative evaluation of Per-SSFL’s performance: for each personalization encoder, we use the entire training data to train the linear classifier but evaluate on each client’s local test data. 7.5 Experiments In this section, we introduce experimental results for SSFL with and without personalization and present a performance analysis on a wide range of aspects including the role of batch size, different degrees of non-IIDness, and understanding the evaluation protocol. Implementation. We develop the SSFL training system to simplify and unify the algorithmic exploration. Details of the training system design can be found in Appendix 107 I.3. We deploy the system in a distributed computing environment which has 8 NVIDIA A100-SXM4 GPUs with sufficient memory (40 GB/GPU) to explore different batch sizes (Section 7.5.4.1). Our training framework can run multiple parallel training workers in a single GPU, so it supports federated training with a large number of clients. The client number selected per round used in all experiments is 10, which is a reasonable setting suggested by recent literature [367]. Learning Task. Following SimCLR [68], SimSiam [69], BYOL [127], and SwAV [54] in the centralized setting, we evaluate SSL for the image classification task and use representative datasets for federated learning. Dataset. We run experiments on synthetic non-I.I.D. dataset CIFAR-10 and intrinsically non-I.I.D. dataset Google Landmark-23K (GLD-23K), which are suggested by multiple canonical works in the FL community [367, 153, 207]. For the non-I.I.D. setting, we distribute the dataset using a Dirichlet distribution [176], which samples p c ∼ Dir(α ) (we assume a uniform prior distribution) and allocates a p c,k proportion of the training samples of class c to local client k. We provide a visualization of the data distribution in Appendix F.3.3. Model Architecture. For the model architecture, ResNet-18 is used as the backbone of the SimSiam framework, and the predictor is the same as that in the original paper. Next, we focus on results from the curated CIFAR-10 dataset and defer GLD-23K to Appendix F.3.1. 7.5.1 Comparisons on SimSiam, SimCLR, SwAV, and BYOL Our first experiment determines which SSL method is ideal for FL settings. We run experi- ments using FedAvg for these four methods and obtain two findings: (1) SimSiam outperforms SimCLR in terms of accuracy; (2) BYOL and SwAV do not work in FL; we tested a wide range of hyper-parameters, but they still are unable to converge to a reasonable accuracy. These experimental results confirm our analysis in Section 7.3.2. 108 7.5.2 Evaluation on Global-SSFL 0 200 400 600 r ound 0 0.2 0.4 0.6 0.8 Accur acy SSFL on non-I.I.D (0.5) SSFL on I.I.D (0.5) (a) Training Time Accuracy 0 200 400 600 r ound -0.8 -0.6 -0.4 -0.2 Loss SSFL on non-I.I.D (0.5) SSFL on I.I.D (0.5) (b) Training Loss Figure 7.2: Training and Evaluation using SSFL The goal of this experiment is to understand the accuracy gap between supervised and self-supervised federated learning in both I.I.D. and non-I.I.D. settings where we aim to train a global model from private data from clients. Setup and Hyper-parameters. We evaluate Global-SSFL using non-I.I.D. data from CIFAR-10: we set α = 0.1 for the Dirichlet distribution. For supervised learning, the test accuracy is evaluated on a global test dataset. For self-supervised training, we follow the evaluation protocol introduced in Section 7.4. We use SGD with Momentum as the client-side optimizer and a learning rate scheduler across communication rounds. We searched the learning rate on the grid of{0.1,0.3,0.01,0.03} and report the best result. The experiment is run three times using the same learning rate with fixed random seeds to ensure reproducibility. The training lasts for 800 rounds, which is sufficient to achieve convergence for all methods. More hyperparameters are in Appendix F.3.4. We display the training curves in Figure 7.3 which demonstrates that SSFL can converge reliably in both I.I.D. and non-I.I.D. settings. For the I.I.D. data, we find that SSFL can achieve the same accuracy as the centralized accuracy report in the SimSiam paper [69]. For the non-I.I.D. data, SSFL achieves a reasonable accuracy compared to the centralized accuracy. The accuracy comparisons in different dimensions (supervised v.s. self-supervised; I.I.D. v.s. non-I.I.D.) are summarized in Table 7.1. 109 Table 7.1: Evaluation accuracy comparison between supervised FL and SSFL. Accuracy Supervised Self-Supervised Acc. Gap I.I.D 0.932 0.914 0.018 non-I.I.D 0.8812 0.847 0.0342 Acc. Gap 0.0508 0.06 N/A Table 7.2: Evaluation Accuracy for Var- ious Per-SSFL Methods. Method KNN Indicator Evaluation LA-SSFL 0.9217 0.8013 MAML-SSFL 0.9355 0.8235 BiLevel-SSFL 0.9304 0.8137 Per-SSFL 0.9388 0.8310 7.5.3 Evaluation on Per-SSFL Based on the results of SSFL with FedAvg, we further add the personalization components for SSFL introduced in Section 7.3.3 (Per-SSFL). Setup and Hyper-parameters. For a fair comparison, we evaluate Per-SSFL on non-I.I.D. data from CIFAR-10 and set α = 0.1 for the Dirichlet distribution. For Per-SSFL training, we follow the evaluation protocol introduced in Section 7.4. Similar to SSFL, we use SGD with Momentum as the client-side optimizer and the learning rate scheduler across communication rounds. We search for the learning rate on a grid of {0.1,0.3,0.01,0.03} and report the best result. For Per-SSFL and BiLevel-SSFL, we also tune the λ of the regularization term with a search space {1,10,0.1,0.01,0.001}. The experiments are run three times with the same learning rate and with fixed random seeds to ensure reproducibility. The training also lasts for 800 communication rounds, which is the same as Global-SSFL. Other hyperparameters can be found in Appendix F.3.4. 0 200 400 600 800 r ound -0.8 -0.6 -0.4 -0.2 0 L A-SSFL M AML-SSFL P er-SSFL Bile v el-SSFL Loss (a) Training Loss 0 200 400 600 800 r ound 0.88 0.9 0.92 Accur acy L A-SSFL M AML-SSFL P er-SSFL Bile v el-SSFL (b) Averaged Personalized Accuracy Figure 7.3: Training and Evaluation using SSFL 110 We illustrate our results in Figure 7.3 and Table 7.2. To confirm the convergence, we draw loss curves for all methods in Figure 7.3(b) (note that they have different scaled values due to the difference of their loss functions). Figure 7.3(b) indicates that Per-SSFL performs best among all methods. MAML-SSFL is also a suggested method since it obtains comparable accuracy. LA-SSFL is a practical method, but it does not perform well in the self-supervised setting. In Figure 7.3(b), the averaged personalized accuracy of LA-SSFL diverges in the latter phase. Based on BiLevel-SSFL’s result, we can conclude such a method is not a strong candidate for personalization, though it shares similar objective functions as Per-SSFL. This indicates that regularization through representations encoded by SimSiam outperforms regularization through weights. 7.5.4 Performance Analysis 7.5.4.1 Role of Batch Size 0 200 400 600 r ound 0 0.2 0.4 0.6 0.8 Batch Siz e: 8 Batch Siz e: 16 Batch Siz e: 32 Batch Siz e: 64 Batch Siz e: 256 Accur acy Figure 7.4: Results for batch sizes FL typically requires a small batch size to enable practical training on resource-constrained edge devices. Therefore, understanding the role of batch size in SSFL is essential to practical deployment and applicability. To investigate this, we use different batch sizes and tune the learning rate to find the optimal accuracy for each one. The results in Figure 7.4 show that SSFL requires a large batch size (256); otherwise, it reduces the accuracy or diverges during training. Since system efficiency is not the focus of this work, we use gradient accumulation, which is a simple yet effective method. We fix the batch size at 32 and use accumulation step 8 for all experiments. For an even larger batch size (e.g., 512), the memory cost is significant, though there is no notable gain in accuracy. Therefore, we discontinue the search for batch sizes larger than larger than 256. A more advanced 111 method includes batch-size-one training and knowledge distillation. We defer the discussion to Appendix F.4. 7.5.4.2 On Different Degrees of Non-I.I.D.ness We investigate the impact of the degree of data heterogeneity on the SSFL performance. We compare the performance between α = 0.1 and α = 0.5. These two settings provide a non-negligible gap in the label distribution in each client (see our visualization in Appendix F.3.3). Figure 7.5(a) and 7.5(b) shows the learning curve comparisons. It is clearly observed that the higher degree of data heterogeneity makes it converge more slowly, adversely affecting the accuracy. 0 200 400 600 800 r ound 0.88 0.9 0.92 0.94 Accur acy alpha (0.1) alpha (0.5) (a) Averaged Personalized Accu- racy 0 200 400 600 800 r ound -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 Loss alpha (0.1) alpha (0.5) (b) Training Loss Figure 7.5: Evaluation on Different Degress of Non-I.I.D.ness 0 20 40 60 80 100 r ound 0.3 0.4 0.5 0.6 0.7 0.8 P er-SSFL L A-SSFL Accur acy Figure 7.6: Understanding the Evaluation Protocol 7.5.4.3 Understanding the Linear Evaluation of Personalized Encoders As we discussed in 7.4, in SSFL, we can easily verify the quality of the SimSiam encoder using federated linear evaluation; however, in Per-SSFL, each client learns a personalized SimSiam encoder. Such heterogeneity in diverse encoders makes a fair evaluation difficult. To demonstrate this, we run experiments with naive federated linear evaluation on personalized encoders and surprisingly find that such an evaluation protocol downgrades the performance. As shown in Figure 7.6, the federated linear evaluation for Per-SSFL performs worse than even LA-SSFL. This may be attributed to the fact that the naive aggregation drags close to 112 the parameter space of all heterogeneous encoders, making the encoder degenerate in terms of personalization. 7.6 Related Works Federated Learning (FL) with Personalization. pFedMe [90], perFedAvg [104], and Ditto [249] are some representative works in this direction. However, these methods all have a strong assumption that users can provide reliable annotations for their private and sensitive data, which we argue to be very unrealistic and impractical. Label deficiency in FL. There are a few related works to tackle label deficiency in FL [288, 293, 187, 192, 265, 540, 538]. Compared to these works, our proposed SSFL does not use any labels during training. FedMatch [192] and FedCA [530] requires additional communication costs to synchronize helper models or public labeled dataset. [389] addresses the fully unsupervised challenge on small-scale sensor data in IoT devices. However, compared to our work, it uses the Siamese networks proposed around thirty years ago [44], lacking consideration on the advance in the past two years (i.e., SimCLR [68], SwAV[53], BYOL [127], and SimSiam [69]). Moreover, these works does not have any design for learning personalized models. 7.7 Conclusion We propose Self-supervised Federated Learning (SSFL) framework and a series of algorithms under this framework towards addressing two challenges: data heterogeneity and label deficiency. SSFL can work for both global model training and personalized model training. We conduct experiments on a synthetic non-I.I.D. dataset based on CIFAR-10 and the intrinsically non-I.I.D. GLD-23K dataset. Our experimental results demonstrate that SSFL can work reliably and achieves reasonable evaluation accuracy that is suitable for use in various applications. 113 Chapter 8 LightSecAgg: Lightweight and Versatile Secure Aggregation 8.1 Introduction Federated learning (FL) has emerged as a promising approach to enable distributed training over a large number of users while protecting the privacy of each user [308, 309, 469]. The key idea of FL is to keep users’ data on their devices and instead train local models at each user. The locally trained models are then aggregated via a server to update a global model, which is then pushed back to users. Due to model inversion attacks (e.g., [120, 474, 548]), a critical consideration in FL design is to also ensure that the server does not learn the locally trained model of each user during model aggregation. Furthermore, model aggregation should be robust against likely user dropouts (due to poor connectivity, low battery, unavailability, etc) in FL systems. As such, there have been a series of works that aim at developing secure aggregation protocols for FL that protect the privacy of each user’s individual model while allowing their global aggregation amidst possible user dropouts [37, 204, 423]. The state-of-the-art secure aggregation protocols essentially rely on two main principles: (1) pairwise random-seed agreement between users to generate masks that hide users’ models while having an additive structure that allows their cancellation when added at the server and (2) secret sharing of the random-seeds to enable the reconstruction and cancellation 114 Figure 8.1: Illustration of our proposed LightSecAgg protocol. (1) Sharing encoded mask: users encode and share their generated local masks. (2) Masking model: each user masks its model by random masks, and uploads its masked model to the server. (3) Reconstructing aggregate-mask: The surviving users upload the aggregate of encoded masks to reconstruct the desired aggregate-mask. The server recovers the aggregate-model by canceling out the reconstructed aggregate-mask. of masks belonging to dropped users. The main drawback of such approaches is that the number of mask reconstructions at the server substantially grows as more users are dropped, causing a major computational bottleneck. For instance, the execution time of the SecAgg protocol proposed in [37] is observed to be significantly limited by mask reconstructions at the server [38]. SecAgg+ [19], an improved version of SecAgg, reduces the overhead at the server by replacing the complete communication graph of SecAgg with a sparse random graph, such that secret sharing is only needed within a subset of users rather than all users pairs. However, the number of mask reconstructions in SecAgg+ still increases as more users drop, eventually limits the scalability of FL systems. There have also been several other approaches, such as [423, 204], to alleviate this bottleneck, however they either increase round/communication complexity or compromise the dropout and privacy guarantees. Contributions. We propose a new perspective for secure model aggregation in FL by turning the design focus from “pairwise random-seed reconstruction of the dropped users” to “one-shot aggregate-mask reconstruction of the surviving users”. Using this viewpoint, we develop a new protocol named LightSecAgg that provides the same level of privacy and dropout-resiliency guarantees as the state-of-the-art while substantially reducing the aggregation (hence runtime) complexity. As illustrated in Figure 9.1, the main idea of LightSecAgg is that each user protects its local model using a locally generated random mask. This mask is then encoded and shared to other users in such a way that the aggregate- mask of any sufficiently large set of surviving users can be directly reconstructed at the server. 115 In sharp contrast to prior schemes, in this approach the server only needs to reconstruct one mask in the recovery phase, independent of the number of dropped users. Moreover, we provide a modular federated training system design and optimize on-device parallelization to improve the efficiency when secure aggregation and model training interact at the edge devices. This enables computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. To the best of our knowledge, this provides the first open-sourced and secure aggregation-enabled FL system that is built on the modern deep learning framework (PyTorch) and neural architecture (e.g., ResNet) with system and security co-design. We further propose system-level optimization methods to improve the run-time. In particular, we design a federated training system and take advantage of the fact that the generation of random masks is independent of the computation of the local model, hence each user can parallelize these two operations via a multi-thread processing, which is beneficial to all evaluated secure aggregation protocols in reducing the total running time. In addition to the synchronous FL setting, where all users train local models based on the same global model and the server performs a synchronized aggregation at each round, we also demonstrate that LightSecAgg enables secure aggregation when no synchrony between users’ local updates are imposed. This is unlike prior secure aggregation protocols, such as SecAgg and SecAgg+, that are not compatible with asynchronous FL. To the best of our knowledge, in the asynchronous FL setting, this is the first work to protect the privacy of the individual updates without relying on differential privacy [452] or trusted execution environments (TEEs) [330]. We run extensive experiments to empirically demonstrate the performance of LightSecAgg in a real-world cross-device FL setting with up to 200 users and compare it with two state- of-the-art protocols SecAgg and SecAgg+. To provide a comprehensive coverage of realistic FL settings, we train various machine learning models including logistic regression, convolu- tional neural network (CNN) [308], MobileNetV3 [170], and EfficientNet-B0 [435], for image 116 classification over datasets of different image sizes: low resolution images (FEMNIST [52], CIFAR-10 [223]), and high resolution images (Google Landmarks Dataset 23k [479]). The empirical results show that LightSecAgg provides significant speedup for all considered FL training tasks, achieving a performance gain of 8.5× -12.7× over SecAgg and 2.9× -4.4× over SecAgg+, in realistic bandwidth settings at the users. Hence, compared to baselines, LightSecAgg can even survive and speedup the training of large deep neural network models on high resolution image datasets. Breakdowns of the total running time further confirm that the primary gain lies in the complexity reduction at the server provided by LightSecAgg, especially when the number of users are large. Related works. Beyond the secure aggregation protocols proposed in [37, 19], there have been also other works that aim towards making secure aggregation more efficient. TurboAgg [423] utilizes a circular communication topology to reduce the communication overhead, but it incurs an additional round complexity and provides a weaker privacy guarantee than SecAgg as it guarantees model privacy in the average sense rather than in the worst-case scenario. FastSecAgg [204] reduces per-user overhead by using the Fast Fourier Transform multi-secret sharing, but it provides lower privacy and dropout guarantees compared to the other state-of-the-art protocols. The idea of one-shot reconstruction of the aggregate-mask was also employed in [539], where the aggregated masks corresponding to each user dropout pattern was prepared by a trusted third party, encoded and distributed to users prior to model aggregation. The major advantages of LightSecAgg over the scheme in [539] are 1) not requiring a trusted third party; and 2) requiring significantly less randomness generation and a much smaller storage cost at each user. Furthermore, there is also a lack of system-level performance evaluations of [539] in FL experiments. Finally, we emphasize that our LightSecAgg protocol can be applied to any aggregation-based FL approach (e.g., FedNova [471], FedProx [253], FedOpt [12]), personalized FL frameworks [433, 250, 104, 326, 162], communication-efficient FL [409, 369, 98], and asynchronous FL, and their applications 117 in computer vision [149, 144, 142], natural language processing [268, 160], data mining [151, 102, 263, 161, 148], and Internet of things (IoTs) [532, 533]. 8.2 Problem Setting FL is a distributed training framework for machine learning, where the goal is to learn a global model x with dimension d using data held at edge devices. This can be represented by minimizing a global objective function F: F(x) = P N i=1 p i F i (x), where N is the total number of users, F i is the local objective function of user i, and p i ≥ 0 is a weight parameter assigned to user i to specify the relative impact of each user such that P N i=1 p i =1. 1 Training in FL is performed through an iterative process, where the users interact through a server to update the global model. At each iteration, the server shares the current global model, denoted by x(t), with the edge users. Each user i creates a local update, x i (t). The local models are sent to the server and then aggregated by the server. Using the aggregated models, the server updates the global model x(t+1) for the next iteration. In FL, some users may potentially drop from the learning procedure for various reasons such as having unreliable communication connections. The goal of the server is to obtain the sum of the surviving users’ local models. This update equation is given by x(t+1) = 1 |U(t)| P i∈U(t) x i (t), where U(t) denotes the set of surviving users at iteration t. Then, the server pushes the updated global model x(t+1) to the edge users. Local models carry extensive information about the users’ datasets, and in fact their private data can be reconstructed from the local models by using a model inversion attack [120, 474, 548]. To address this privacy leakage from local models, secure aggregation has been introduced in [37]. A secure aggregation protocol enables the computation of the aggregated global model while ensuring that the server (and other users) learn no information about the local models beyond their aggregated model. In particular, the goal is to securely recover the aggregate of the local models y = P i∈U x i , where the iteration index t is omitted for 1 For simplicity, we assume that all users have equal-sized datasets, i.e., p i = 1 N for all i∈[N]. 118 simplicity. Since secure aggregation protocols build on cryptographic primitives that require all operations to be carried out over a finite field, we assume that the elements of x i and y are from a finite field F q for some field size q. We require a secure aggregation protocol for FL to have the following key features. • Threat model and privacy guarantee. We consider a threat model where the users and the server are honest but curious. We assume that up to T (out of N) users can collude with each other as well as with the server to infer the local models of other users. The secure aggregation protocol has to guarantee that nothing can be learned beyond the aggregate- model, even if up to T users cooperate with each other. We consider privacy leakage in the strong information-theoretic sense. This requires that for every subset of users T ⊆ [N] of size at most T, we must have mutual information I({x i } i∈[N] ;Y| P i∈U x i ,Z T )=0, where Y is the collection of information at the server, and Z T is the collection of information at the users inT . • Dropout-resiliencyguarantee. IntheFLsetting, itiscommonforuserstobedroppedor delayed at any time during protocol execution for various reasons, e.g., delayed/interrupted processing, poor wireless channel conditions, low battery, etc. We assume that there are at most D dropped users during the execution of protocol, i.e., there are at least N− D surviving users after potential dropouts. The protocol must guarantee that the server can correctly recover the aggregated models of the surviving users, even if up to D users drop. • Applicability to asynchronous FL. Synchronizing all users for training at each round of FL can be slow and costly, especially when the number of users are large. Asynchronous FL handles this challenge by incorporating the updates of the users in asynchronous fashion [489, 88, 55, 71]. This asynchrony, however, creates a mismatch of staleness among the users, which causes the incompatibility of the existing secure aggregation protocols (such as [37, 19]). More specifically, since it is not known a priori which local models will be aggregated together, the current secure aggregation protocols that are based on pairwise 119 random masking among the users fail to work. We aim at designing a versatile secure aggregation protocol that is applicable to both synchronous and asynchronous FL. Goal. We aim to design an efficient and scalable secure aggregation protocol that simul- taneously achieves strong privacy and dropout-resiliency guarantees, scaling linearly with the number of users N, e.g., simultaneously achieves privacy guarantee T = N 2 and dropout- resiliency guarantee D = N 2 − 1. Moreover, the protocol should be compatible with both synchronous and asynchronous FL. 8.3 Overview of Baseline Protocols: SecAgg and SecAgg+ We first review the state-of-the-art secure aggregation protocols SecAgg [37] and SecAgg+ [19] as our baselines. Essentially, SecAgg and SecAgg+ require each user to mask its local model using random keys before aggregation. In SecAgg, the privacy of the individual models is protected by pairwise random masking. Through a key agreement (e.g., Diffie-Hellman [87]), each pair of users i,j ∈ [N] agree on a pairwise random seed a i,j = Key.Agree(sk i ,pk j ) = Key.Agree(sk j ,pk i ) where sk i and pk i are the private and public keys of user i, respectively. In addition, user i creates a private random seed b i to prevent the privacy breaches that may occur if user i is only delayed rather than dropped, in which case the pairwise masks alone are not sufficient for privacy protection. User i ∈ [N] then masks its model x i as ˜ x i =x i +PRG(b i )+ P j:i<j PRG(a i,j )− P j:i>j PRG(a j,i ), where PRG is a pseudo random generator, and sends it to the server. Finally, user i secret shares its private seed b i as well as private key sk i with the other users via Shamir’s secret sharing [518]. From the subset of users who survived the previous stage, the server collects either the shares of the private key belonging to a dropped user, or the shares of the private seed belonging to a surviving user (but not both). Using the collected shares, the server reconstructs the private seed of each 120 Figure 8.2: An illustration of SecAgg in the example of 3 users is depicted. The users first agree on pairwise random seeds, and secret share their private random seeds and private keys. The local models are protected by the pairwise random masking. Suppose that user 1 drops. To recover the aggregate-mask, the server first reconstructs the private random seeds of the surviving users and the private key of user 1 by collecting the secret shares for each of them. Then, the server recovers z 1,2 , z 1,3 , n 2 and n 3 , which incurs the computational cost of 4d at the server. surviving user, and the pairwise seeds of each dropped user. The server then computes the aggregated model as follows X i∈U x i = X i∈U (˜ x i − PRG(b i )) + X i∈D X j:i<j PRG(a i,j )− X j:i>j PRG(a j,i ) ! , (8.1) where U and D represent the set of surviving and dropped users, respectively. SecAgg protects model privacy against T colluding users and is robust to D user dropouts as long as N− D >T. We now illustrate SecAgg through a simple example. Consider a secure aggregation problem in FL, where there are N = 3 users with T = 1 privacy guarantee and dropout- resiliency guarantee D = 1. Each user i∈{1,2,3} holds a local model x i ∈F d q where d is the model size and q is the size of the finite field. As shown in Figure 8.2, SecAgg is composed of the following three phases. Offline pairwise agreement. User 1 and user 2 agree on pairwise random seed a 1,2 . User 1 and user 3 agree on pairwise random seed a 1,3 . User 2 and user 3 agree on pairwise random seed a 2,3 . In addition, user i∈{1,2,3} creates a private random seed b i . Then, user i secret shares b i and its private key sk i with the other users via Shamir’s secret sharing. In this example, a 2 out of 3 secret sharing is used to tolerate 1 curious user. 121 Masking and uploading of local models. To provide the privacy of each individual model, user i∈{1,2,3} masks its model x i as follows: ˜ x 1 =x 1 +n 1 +z 1,2 +z 1,3 , ˜ x 2 =x 2 +n 2 +z 2,3 − z 1,2 , ˜ x 3 =x 3 +n 3 − z 1,3 − z 2,3 , where n i = PRG(b i ) and z i,j = PRG(a i,j ) are the random masks generated by a pseudo random generator. Then user i∈{1,2,3} sends its masked local model ˜ x i to the server. Aggregate-model recovery. Suppose that user 1 drops in the previous phase. The goal of the server is to compute the aggregate of models x 2 +x 3 . Note that x 2 +x 3 =˜ x 2 +˜ x 3 +(z 1,2 +z 1,3 − n 2 − n 3 ). (8.2) Hence, the server needs to reconstruct masks n 2 , n 3 , z 1,2 , z 1,3 to recover x 2 +x 3 . To do so, the server has to collect two shares for each of b 2 , b 3 , sk 1 , and then compute the aggregate model by (8.2). Since the complexity of evaluating a PRG scales linearly with its size, the computational cost of the server for mask reconstruction is 4d. We note that SecAgg requires the server to compute a PRG function on each of the reconstructed seeds to recover the aggregated masks, which incurs the overhead of O(N 2 ) (see more details in Section 8.5) and dominates the overall execution time of the protocol [37, 38]. SecAgg+ reduces the overhead of mask reconstructions from O(N 2 ) to O(NlogN) by replacing the complete communication graph of SecAgg with a sparse random graph of degree O(logN) to reduce both communication and computational loads. Reconstructing pairwise random masks in SecAgg and SecAgg+ poses a major bottleneck in scaling to a large number of users. Remark 1. (Incompatibility of SecAgg and SecAgg+ with Asynchronous FL). It is important to note that SecAgg and SecAgg+ cannot be applied to asynchronous FL as the cancellation 122 Figure 8.3: An illustration of LightSecAgg in the example of 3 users is depicted. Each user first generates a single mask. Each mask of a user is encoded and shared to other users. Each user’s local model is protected by its generated mask. Suppose that user 1 drops during the execution of protocol. The server directly recovers the aggregate-mask in one shot. In this example, LightSecAgg reduces the computational cost at the server from 4d to d. of the pairwise random masks based on the key agreement protocol is not guaranteed. This is because the users do not know a priori which local models will be aggregated together, hence the masks cannot be designed to cancel out in these protocols. We explain this in more detail in Appendix G.6.1. It is also worth noting that a recently proposed protocol known as FedBuff [330] enables secure aggregation in asynchronous FL through a trusted execution environment (TEE)-enabled buffer, where the server stores the local models that it receives in this private buffer. The reliance of FedBuff on TEEs, however, limits the buffer size in this approach as TEEs have limited memory. It would also limit its application to FL settings where TEEs are available. 8.4 LightSecAgg Protocol Before providing a general description of LightSecAgg, we first illustrate its key ideas through the previous 3-user example in the synchronous setting. As shown in Figure 8.3, LightSecAgg has the following three phases. Offline encoding and sharing of local masks. User i∈{1,2,3} randomly picks z i and n i fromF d q . User i∈{1,2,3} creates the masked version of z i as ˜ z 1,1 =− z 1 +n 1 , ˜ z 1,2 =2z 1 +n 1 , ˜ z 1,3 =z 1 +n 1 ; ˜ z 2,1 =− z 2 +n 2 , ˜ z 2,2 =2z 2 +n 2 , ˜ z 2,3 =z 2 +n 2 ; ˜ z 3,1 =− z 3 +n 3 , ˜ z 3,2 =2z 3 +n 3 , ˜ z 3,3 =z 3 +n 3 ; 123 and user i∈{1,2,3} sends˜ z i,j to each user j∈{1,2,3}. Thus, user i∈{1,2,3} receives ˜ z j,i for j∈{1,2,3}. In this case, this procedure provides robustness against 1 dropped user and privacy against 1 curious user. Masking and uploading of local models. To make each individual model private, each user i∈{1,2,3} masks its local model as follows: ˜ x 1 =x 1 +z 1 , ˜ x 2 =x 2 +z 2 , ˜ x 3 =x 3 +z 3 , (8.3) and sends its masked model to the server. One-shot aggregate-model recovery. Suppose that user 1 drops in the previous phase. To recover the aggregate of models x 2 +x 3 , the server only needs to know the aggregated masks z 2 +z 3 . To recover z 2 +z 3 , the surviving user 2 and user 3 send ˜ z 2,2 + ˜ z 3,2 and ˜ z 2,3 +˜ z 3,3 , ˜ z 2,2 +˜ z 3,2 =2(z 2 +z 3 )+n 2 +n 3 , ˜ z 2,3 +˜ z 3,3 =(z 2 +z 3 )+n 2 +n 3 , to the server, respectively. After receiving the messages from user 2 and user 3, the server can directly recover the aggregated masks via an one-shot computation as follows: z 2 +z 3 =˜ z 2,2 +˜ z 3,2 − (˜ z 2,3 +˜ z 3,3 ). (8.4) Then, the server recovers the aggregate-model x 2 +x 3 by subtracting z 2 +z 3 from ˜ x 2 +˜ x 3 . As opposed to SecAgg which has to reconstruct the random seeds of the dropped users, LightSecAgg enables the server to reconstruct the desired aggregate of masks via a one-shot recovery. Compared with SecAgg, LightSecAgg reduces the server’s computational cost from 4d to d in this simple example. 124 8.4.1 General Description of LightSecAgg for Synchronous FL We formally present LightSecAgg, whose idea is to encode the local generated random masks in a way that the server can recover the aggregate of masks from the encoded masks via an one-shot computation with a cost that does not scale with N. LightSecAgg has three design parameters: (1) 0 ≤ T ≤ N − 1 representing the privacy guarantee; (2) 0 ≤ D ≤ N − 1 representing the dropout-resiliency guarantee; (3) 1 ≤ U ≤ N representing the targeted number of surviving users. In particular, parameters T, D, and U are selected such that N− D≥ U >T ≥ 0. LightSecAgg is composed of three main phases. First, each user first partitions its local random mask to U− T pieces and creates encoded masks via a Maximum Distance Separable (MDS) code [384, 527, 442, 422] to provide robustness against D dropped users and privacy against T colluding users. Each user sends one of the encoded masks to one of the other users for the purpose of one-shot recovery. Second, each user uploads its masked local model to the server. Third, the server reconstructs the aggregated masks of the surviving users to recover their aggregate of models. Each surviving user sends the aggregated encoded masks to the server. After receiving U aggregated encoded masks from the surviving users, the server recovers the aggregate-mask and the desired aggregate-model. The pseudo code of LightSecAgg is provided in Appendix G.1. We now describe each of these phases in detail. Offline encoding and sharing of local masks. User i∈ [N] picks z i uniformly at random fromF d q and partitions it to U− T sub-masks [z i ] k ∈F d U− T q , k∈[U− T]. With the randomly picked [n i ] k ∈F d U− T q for k∈{U− T +1,...,U}, user i∈[N] encodes sub-masks [z i ] k ’s as [˜ z i ] j =([z i ] 1 ,...,[z i ] U− T ,[n i ] U− T+1 ,...,[n i ] U )· W j , (8.5) 125 where W j is j’th column of a T-private MDS matrix W ∈F U× N q . In particular, we say an MDS matrix 2 is T-private iff the submatrix consisting of its {U− T +1,...,U}-th rows is also MDS. A T-private MDS matrix guarantees that I(z i ;{[˜ z i ] j } j∈T ) = 0, for any i∈ [N] and anyT ⊆ [N] of size T, if [n i ] k ’s are jointly uniformly random. We can always find T-private MDS matrices for any U, N, and T (e.g., [401, 527, 384]). Each user i∈[N] sends [˜ z i ] j to user j∈ [N]\{i}. In the end of offline encoding and sharing of local masks, each user i∈ [N] has [˜ z j ] i from j∈[N]. 3 Masking and uploading of local models. To protect the local models, each user i masks its local model as ˜ x i =x i +z i , and sends it to the server. Since some users may drop in this phase, the server identifies the set of surviving users, denoted by U 1 ⊆ [N]. The server intends to recover P i∈U 1 x i . We note that before masking the model, each user quantizes the local model to convert from the domain of real numbers to the finite field (Appendix G.6.4). One-shot aggregate-model recovery. After identifying the surviving users in the previous phase, user j∈U 1 is notified to send its aggregated encoded sub-masks P i∈U 1 [˜ z i ] j to the server for the purpose of one-shot recovery. We note that each P i∈U 1 [˜ z i ] j is an encoded version of P i∈U 1 [z i ] k for k∈ [U− T] using the MDS matrix W (see more details in Appendix G.2). Thus, the server is able to recover P i∈U 1 [z i ] k for k∈ [U− T] via MDS decoding after receiving a set of any U messages from the participating users. The server obtains the aggregated masks P i∈U 1 z i by concatenating P i∈U 1 [z i ] k ’s. Lastly, the server recovers the desired aggregate of models for the set of participating users U 1 by subtracting P i∈U 1 z i from P i∈U 1 ˜ x i . Remark 2. Note that it is not necessary to have a stable communication link between every pair of users in LightSecAgg. Specifically, given the design parameter U, LightSecAgg only requires at least U surviving users at any time during the execution. That is, even if up to 2 A matrix W ∈F U× N q (U <N) is an MDS matrix if any U× U sub-matrix of W is non-singular. 3 All users communicate through secure (private and authenticated) channels, i.e., the server would only receive the encrypted version of [˜ z i ] j ’s. Such secure communication is also used in prior works on secure aggregation, e.g., SecAgg, SecAgg+. 126 N− U users drop or get delayed due to unstable communication links, the server can still reconstruct the aggregate-mask. Remark 3. We note that LightSecAgg directly applies for secure aggregation of weighted local models. The sharing of the masking keys among the clients does not require the knowledge of the weight coefficients. For example, LightSecAgg can work for the case in which all users do not have equal-sized datasets. Suppose that user i holds a dataset with a number of samples s i . Rather than directly masking the local model x i , user i first computes x ′ i =s i x i . Then, user i uploads x ′ i +z i to the server. Through the LightSecAgg protocol, the server can recover P i∈U x ′ i = P i∈U s i x i securely. By dividing by P i∈U s i , the server can obtain the desired aggregate of weighted model P i∈U p i x i where p i = s i P i∈U s i . 8.4.2 Extension to Asynchronous FL We now describe how LightSecAgg can be applied to asynchronous FL. We consider the asynchronous FL setting with bounded staleness as considered in [330], where the updates of the users are not synchronized and the staleness of each user is bounded by τ max . In this setting, the server stores the models that it receives in a buffer of size K and updates the global model once the buffer is full. More generally, LightSecAgg may apply to any asynchronous FL setting where a group of local models are aggregated at each round. That is, the group size does not need to be fixed in all rounds. While the baselines are not compatible with this setting, LightSecAgg can be applied by encoding the local masks in a way that the server can recover the aggregate of masks from the encoded masks via a one-shot computation, even though the masks are generated in different training rounds. Specifically, the users share the encoded masks with the timestamp to figure out which encoded masks should be aggregated for the reconstruction of the aggregate of masks. As the users aggregate the encoded masks after the server stores the local updates in the buffer, the users can aggregate the encoded masks according to the timestamp of the stored updates. Due to the commutative property of MDS coding and addition, the server can reconstruct the aggregate of masks even though 127 the masks are generated in different training rounds. We postpone the detailed description of the LightSecAgg protocol for the asynchronous setting to Appendix G.5. 8.5 Theoretical Analysis 8.5.1 Theoretical Guarantees We now state our main result for the theoretical guarantees of the LightSecAgg protocol. Theorem 2. Consider a secure aggregation problem in federated learning with N users. Then, the proposed LightSecAgg protocol can simultaneously achieve (1) privacy guarantee against up to any T colluding users, and (2) dropout-resiliency guarantee against up to any D dropped users, for any pair of privacy guarantee T and dropout-resiliency guarantee D such that T +D <N. The proof of Theorem 2, which is applicable to both synchronous and asynchronous FL settings, is presented in Appendix G.2. Remark 4. Theorem 2 provides a trade-off between privacy and dropout-resiliency guarantees, i.e., LightSecAgg can increase the privacy guarantee by reducing the dropout-resiliency guarantee and vice versa. As SecAgg [37], LightSecAgg achieves the worst-case dropout- resiliency guarantee. That is, for any privacy guarantee T and the number of dropped users D <N− T, LightSecAgg ensures that any set of dropped users of sizeD in secure aggregation can be tolerated. Differently, SecAgg+ [19], FastSecAgg [204], and TurboAgg [423] relax the worst-case constraint to random dropouts and provide a probabilistic dropout-resiliency guarantee, i.e., the desired aggregate-model can be correctly recovered with high probability. Remark 5. From the training convergence perspective, LightSecAgg only adds a quantization step to the local model updates of the users. The impact of this model quantization on FL convergence is well studied in the synchronous FL [369, 98]. In the asyncrhonous FL, 128 however, we need to analyze the convergence of LightSecAgg. We provide this analysis in the smooth and non-convex setting in Appendix G.7. 8.5.2 Complexity Analysis of LightSecAgg We measure the storage cost, communication load, and computational load of LightSecAgg in units of elements or operations inF q for a single training round. Recall that U is a design parameter chosen such that N− D≥ U >T. Offline storage cost. Each user i independently generates a random mask z i of length d. Additionally, each user i stores a coded mask [˜ z j ] i of size d U− T , for j∈[N]. Hence, the total offline storage cost at each user is (1+ N U− T )d. Offline communication and computation loads. For each iteration of secure aggregation, before the local model is computed, each user prepares offline coded random masks and distributes them to the other users. Specifically, each user encodes U local data segments with each of size d U− T into N coded segments and distributes each of them to one of N users. Hence, the offline computational and communication load of LightSecAgg at each user is O( dNlogN U− T ) and O( dN U− T ), respectively. Communication load during aggregation. While each user uploads a masked model of length d, in the phase of aggregate-model recovery, no matter how many users drop, each surviving user inU 1 sends a coded mask of size d U− T . The server is guaranteed to recover the aggregate-model of the surviving users inU 1 after receiving messages from any U users. The total required communication load at the server in the phase of mask recovery is therefore U U− T d. Computation load during aggregation. The major computational bottleneck of LightSecAgg is the decoding process to recover P j∈U 1 z j at the server. This involves decoding a U-dimensional MDS code from U coded symbols, which can be performed with O(UlogU) operations on elements inF q , hence a total computational load of UlogU U− T d. 129 Table 8.1: Complexity comparison between SecAgg, SecAgg+, and LightSecAgg. Here N is the total number of users, d is the model size, s is the length of the secret keys as the seeds for PRG (s≪ d). In the table, U stands for User and S stands for Server. SecAgg SecAgg+ LightSecAgg offline comm. (U) O(sN) O(slogN) O(d) offline comp. (U) O(dN +sN 2 ) O(dlogN +slog 2 N) O(dlogN) online comm. (U) O(d+sN) O(d+slogN) O(d) online comm. (S) O(dN +sN 2 ) O(dN +sNlogN) O(dN) online comp. (U) O(d) O(d) O(d) reconstruction (S) O(dN 2 ) O(dNlogN) O(dlogN) We compare the communication and computational complexities of LightSecAgg with baseline protocols. In particular, we consider the case where secure aggregation protocols aim at providing privacy guarantee T = N 2 and dropout-resiliency guarantee D =pN simultane- ously for some 0≤ p< 1 2 . As shown in Table 8.1, by choosing U =(1− p)N, LightSecAgg significantly improves the computational efficiency at the server during aggregation. SecAgg and SecAgg+ incurs a total computational load of O(dN 2 ) and O(dNlogN), respectively at the server, while the server complexity of LightSecAgg remains nearly constant with respect to N. It is expected to substantially reduce the overall aggregation time for a large number of users, which is bottlenecked by the server’s computation in SecAgg [37, 38]. More detailed discussions, as well as a comparison with another recently proposed secure aggregation protocol [539], which achieves similar server complexity as LightSecAgg, are carried out in Appendix G.2.1. 8.6 System Design and Optimization Apart from theoretical design and analysis, we have further designed a FL training sys- tem to reduce the overhead of secure model aggregation and enable realistic evaluation of LightSecAgg in cross-device FL. 130 ARM-based PyTorch Send Thread Abstract Communication Layer Receive Thread Tensor-aware Communicator Communication gRPC Trainer Standard PyTorch Training On-device training framework buffer for client masked model Client Manager multiprocessing (master) Server Manager Secure Aggregator Secure Aggregation Protocol multiprocessing (slave) PyTorch RPC Cache masked model Send the updated global model to clients 1 2 3 4 5 6 offline phase - encoding and sharing of local masks masking and uploading of local models upload aggregate of encoded masks Model Training Client Encoder Reconstruction Decoder 7 Key Agreement Secret Sharing MDS Encoding/Decoding Pseudo Random Generator … Security Primitive APIs Overlapping Figure 8.4: Overview of the System Design The software architecture is shown in Figure 8.4. In order to keep the software architecture lightweight and maintainable, we do not over-design and only modularize the system as the foundation layer and the algorithm layer. The foundation layer (blocks below the dashed line) contains the communicator and training engine. The communicator can support multiple communication protocols (PyTorch RPC [355], and gRPC [128]), but it provides a unified communication interface for the algorithmic layer. In the training engine, in addition to standard PyTorch for GPU, we also compile the ARM-based PyTorch for embedded edge devices (e.g., Raspberry Pi). In the algorithm layer, Client Manager calls Trainer in the foundation layer to perform on-device training. Client Manager also integrates Client Encoder to complete the secure aggregation protocol, which is supported by security primitive APIs. In Server Manager, 131 Secure Aggregator maintains the cache for masked models, and once the cache is full, it starts reconstruction based on aggregated masks uploaded by clients. The server then synchronizes the updated global model to clients for the next round of training. In Figure 8.4, we mark the 7 sequential steps in a single FL round as circled numbers to clearly show the interplay between federated training and secure aggregation protocol. This software architecture has two special designs that can further reduce the computa- tional and communication overhead of the secure aggregation protocol. Parallelization of offline phase and model training. We note that for all considered protocols, LightSecAgg, SecAgg, and SecAgg+, the communication and computation time to generate and exchange the random masks in the offline phase can be overlapped with model training. Hence, in our design, we reduce the offline computation and communication overhead by allowing each user to train the model and carry out the offline phase simultaneously by running two parallel processes (multi-threading performs relatively worse due to Python GIL, Global Interpreter Lock), as shown as purple and red colors in Figure 8.4. We also demonstrate the timing diagram of the overlapped implementation in a single FL training round in Figure 8.5. We will analyze its impact on overall acceleration in section 8.7.2. (a) Non-overlapped (b) Overlapped Figure 8.5: The timing diagram of the overlapped implementation in LightSecAgg and SecAgg+ [19] for a single FL training round to train MobileNetV3 [170] with CIFAR-100 dataset [223]. SecAgg [37] is not included as it takes much longer than other two protocols. Optimized federated training system and communication APIs via tensor-aware RPC (Remote Procedure Call). As the yellow blocks in Figure 8.4 show, we specially design the sending and receiving queues to accelerate the scenario that the device has to be 132 Table 8.2: Summary of four implemented machine learning tasks and performance gain of LightSecAgg with respect to SecAgg and SecAgg+. All learning tasks are for image classification. MNIST, FEMNIST and CIFAR-10 are low-resolution datasets, while images in GLD-23K are high resolution, which cost much longer training time; LR and CNN are shallow models, but MobileNetV3 and EfficientNet-B0 are much larger models, but they are tailored for efficient edge training and inference. No. Dataset Model Model Size Gain (d) Non-overlapped Overlapped Aggregation-only 1 MNIST [231] Logistic Regression 7,850 6.7× , 2.5× 8.0× , 2.9× 13.0× , 4.1× 2 FEMNIST [52] CNN [308] 1,206,590 11.3× , 3.7× 12.7× , 4.1× 13.2× , 4.2× 3 CIFAR-10 [223] MobileNetV3 [170] 3,111,462 7.6× , 2.8× 9.5× , 3.3× 13.1× , 3.9× 4 GLD-23K [479] EfficientNet-B0 [435] 5,288,548 3.3× , 1.6× 3.4× , 1.7× 13.0× , 4.1× sender and receiver simultaneously. As such, the offline phase of LightSecAgg can further be accelerated by parallelizing the transmission and reception of [˜ z i ] j . This design can also speed up the offline pairwise agreement in SecAgg and SecAgg+. Moreover, we choose PyTorch RPC [355] as the communication backend rather than gRPC [128] and MPI [335] because its tensor-aware communication API can reduce the latency in scenarios where the communicator is launched frequently, i.e., each client in the offline mask exchanging phase needs to distribute N coded segments to N users. With the above design, we can deploy LightSecAgg in both embedded IoT devices and AWS EC2 instances. AWS EC2 instances can also represent a realistic cross-device setting because, in our experiments, we use AWS EC2 m3.medium instances, which are CPU-based and have the same hardware configuration as modern smartphones such as iOS and Android devices. Furthermore, we package our system as a Docker image to simplify the system deployment to hundreds of edge devices. 133 (a) Non-overlapped (b) Overlapped Figure 8.6: Total running time of LightSecAgg versus the state-of-the-art protocols (SecAgg and SecAgg+) to train CNN [308] on the FEMNIST dataset [52], as the number of users increases, for various dropout rates. 8.7 Experimental Results 8.7.1 Setup Dataset and models. To provide a comprehensive coverage of realistic FL settings, we train four models over computer vision datasets of different sizes, summarized in Table 8.2. The hyper-parameter settings are provided in Appendix G.3. Dropout rate. To model the dropped users, we randomly select pN users where p is the dropout rate. We consider the worst-case scenario [37], where the selected pN users artificially drop after uploading the masked model. All three protocols provide privacy guarantee T = N 2 and resiliency for three different dropout rates, p = 0.1, p = 0.3, and p = 0.5, which are realistic values according to the industrial observation in real FL system [39]. As we can see that when carefully selecting devices which may be stable online during the time period of training, the dropout rate is as high as 10%; when considering intermittently connected devices, only up to 10K devices can participate simultaneously when there are 10M daily active devices (1:1000). Number of users and Communication Bandwidth. In our experiments, we train up to N = 200 users. The measured real bandwidth is 320Mb/s. We also consider two other bandwidth settings of 4G (LTE-A) and 5G cellular networks as we discuss later. 134 Baselines. We analyze and compare the performance of LightSecAgg with two baseline schemes: SecAgg and SecAgg+ described in Section 8.3. While there are also other secure aggregation protocols (e.g., TurboAgg [423] and FastSecAgg [204]), we use SecAgg and SecAgg+ for our baselines since other schemes weaken the privacy guarantees as we discussed in Related Works part of Section 10.1. 8.7.2 Overall Evaluation and Performance Analysis For the performance analysis, we measure the total running time for a single round of global iteration which includes model training and secure aggregation with each protocol while increasing the number of users N gradually for different user dropouts. Our results from training CNN [308] on the FEMNIST dataset [52] are demonstrated in Figure I.3. The performance gain of LightSecAgg with respect to SecAgg and SecAgg+ to train the other models is also provided in Table 8.2. More detailed experimental results are provided in Appendix G.3. We make the following key observations. Impact of dropout rate: the total running time of SecAgg and SecAgg+ increases monotonically with the dropout rate. This is because their total running time is dominated by the mask recovery at the server, which increases quadratically with the number of users. Non-overlappingv.s. Overlapping: Inthenon-overlappedimplementation, LightSecAgg provides a speedup of up to 11.3× and 3.7× over SecAgg and SecAgg+, respectively, by signif- icantly reducing the server’s execution time; in the overlapped implementation, LightSecAgg provides a further speedup of up to 12.7× and 4.1× over SecAgg and SecAgg+, respectively. This is due to the fact that LightSecAgg requires more communication and a higher computa- tional cost in the offline phase than the baseline protocols, and the overlapped implementation helps to mitigate this extra cost. Impact of model size: LightSecAgg provides a significant speedup of the aggregate- model recovery phase at the server over the baseline protocols in all considered model sizes. When training EfficientNet-B0 on GLD-23K dataset, LightSecAgg provides the smallest 135 speedup in the most training-intensive task. This is because training time is dominant in this task, and training takes almost the same time in LightSecAgg and baseline protocols. Aggregation-only: When comparing the aggregation time only, the speedup remains the same for various model sizes as shown in Table 8.2. We note that speeding up the aggregation phase by itself is still very important because local training and aggregation phases are not necessarily happening one immediately after the other. For example, local training may be done sporadically and opportunistically throughout the day (whenever resources are available), while global aggregation may be postponed to a later time when a large fraction of the users are done with local training, and they are available for aggregation (e.g., 2 am). Impact of U: LightSecAgg incurs the smallest running time for the case when p=0.3, which is almost identical to the case when p=0.1. Recall that LightSecAgg can select the design parameter U between T = 0.5N and N− D = (1− p)N. Within this range, while increasing U reduces the size of the symbol to be decoded, it also increases the complexity of decoding each symbol. The experimental results suggest that the optimal choices for the cases of p = 0.1 and p = 0.3 are both U =⌊0.7N⌋, which leads to a faster execution than when p=0.5, where U can only be chosen as U =0.5N +1. Table 8.3: Performance gain in different bandwidth settings. Protocols 4G (98 Mbps) 320 Mbps 5G (802 Mbps) SecAgg 8.5× 12.7× 13.5× SecAgg+ 2.9× 4.1× 4.4× Impact of Bandwidth: We have also analyzed the impact of communication bandwidth at the users. In addition to the default bandwidth setting used in this section, we have considered two other edge scenarios: 4G (LTE-A) and 5G cellular networks using realistic bandwidth settings of 98 and 802 Mbps respectively [318, 396]). The results are reported in Table 8.3 for a single FL round to train CNN over FEMNIST. 136 Table 8.4: Breakdown of the running time (sec) of LightSecAgg and the state-of-the-art protocols (SecAgg [37] and SecAgg+ [19]) to train CNN [308] on the FEMNIST dataset [52] with N =200 users, for dropout rate p=10%,30%,50%. Protocols Phase Non-overlapped Overlapped p=10% p=30% p=50% p=10% p=30% p=50% LightSecAgg Offline 69.3 69.0 191.2 75.1 74.9 196.9 Training 22.8 22.8 22.8 Uploading 12.4 12.2 21.6 12.6 12.0 21.4 Recovery 40.9 40.7 64.5 40.7 41.0 64.9 Total 145.4 144.7 300.1 123.4 127.3 283.2 SecAgg Offline 95.6 98.6 102.6 101.2 102.3 101.3 Training 22.8 22.8 22.8 Uploading 10.7 10.9 11.0 10.9 10.8 11.2 Recovery 911.4 1499.2 2087.0 911.2 1501.3 2086.8 Total 1047.5 1631.5 2216.4 1030.3 1614.4 2198.9 SecAgg+ Offline 67.9 68.1 69.2 73.9 73.8 74.2 Training 22.8 22.8 22.8 Uploading 10.7 10.8 10.7 10.7 10.8 10.9 Recovery 379.1 436.7 495.5 378.9 436.7 497.3 Total 470.5 538.4 608.2 463.6 521.3 582.4 8.7.3 Performance Breakdown To further investigate the primary gain of LightSecAgg, we provide the breakdown of total running time for training CNN [308] on the FEMNIST dataset [52] in Table 8.4. The breakdown of the running time confirms that the primary gain lies in the complexity reduction at the server provided by LightSecAgg, especially for a large number of users. 8.7.4 Convergence Performance in Asynchronous FL As described in Remark 1, SecAgg and SecAgg+ are not applicable to asynchronous FL, and hence we cannot compare the total running time of LightSecAgg with these baseline secure aggregation protocols. As such, in our experiments here we instead focus on convergence per- formance of LightSecAgg compared to FedBuff [330] to investigate the impact of asynchrony and quantization in performance. In Figure 8.7, we demonstrate that LightSecAgg has almost the same performance as FedBuff on CIFAR-10 dataset while LightSecAgg includes 137 Figure 8.7: Accuracy of asynchronous LightSecAgg and FedBuff on CIFAR-10 dataset [223] with two strategies for mitigating the staleness: a constant function s(τ ) = 1 named Constant; and a polynomial function s α (τ )=(1+τ ) − α named Poly where α =1. The accuracy is reasonable since we use a variant of LeNet-5 [489]. quantization noise to protect the privacy of individual local updates of users. The details of the experiment setting and additional experiments for asynchronous FL are provided in Appendix G.8. 8.8 Conclusion and Future Works This work proposed LightSecAgg, a new approach for secure aggregation in synchronous and asynchronous FL. Compared with the state-of-the-art protocols, LightSecAgg reduces the overhead of model aggregation in FL by leveraging one-shot aggregate-mask reconstruction of the surviving users, while providing the same privacy and dropout-resiliency guarantees. In a realistic FL framework, via extensive empirical results it is also shown that LightSecAgg can provide substantial speedup over baseline protocols for training diverse machine learning models. While we focused on privacy in this work (under the honest but curious threat model), an interesting future research is to combine LightSecAgg with state-of-the-art Byzantine robust aggregation protocols (e.g., [164, 421, 99, 209]) to also mitigate Byzantine users while ensuring privacy. 138 Part III Federated and Distributed Machine Learning: Application 139 Chapter 9 FedNLP: FedML for Natural Language Processing 9.1 Introduction Fine-tuning large pre-trained language models (LMs) such as BERT [86] often leads to state- of-the-art performance in many realistic NLP applications (e.g., text classification, named entity recognition, question answering, summarization, etc.), when large-scale, centralized training datasets are available. However, due to the increasing concerns and regulations about data privacy (e.g., GPDR [368]) emerging data from realistic users have been much more fragmented and distributed, forming decentralized private datasets of multiple “data silos” (a data silo can be viewed as an individual dataset) — across different clients (e.g., organizations or personal devices). To respect the privacy of the users and abide by these regulations, we must assume that users’ data in a silo are not allowed to transfer to a centralized server or other clients. For example, a client cannot share its private user data (e.g., documents, conversations, questions asked on the website/app) with other clients. This is a common concern for organizations such as hospitals, financial institutions or legal firms, as well as personal computing devices such as smartphones, virtual assistants (e.g., Amazon Alexa, Google Assistant, etc.), or a personal computer. However, from a machine learning perspective, models trained on a centralized dataset that combine the data from all organizations or devices usually result in better performance in the NLP domain. Therefore, it is of vital importance to study NLP 140 FedNLP A global model for an NLP task. ↑ Upload the updates of a local model ↓ Download the updated global model Private Local Data (never exposed) Federated models for an NLP task. Transformer LMs Federated Learning Text Classification Sequence Tagging Question Answering Language Modeling Text Generation Figure 9.1: The FedNLP benchmarking framework. problems in such a realistic yet more challenging scenario —i.e., training data are distributed across different clients and cannot be shared for privacy concerns. The nascent field of federated learning [4, 252] (FL) aims to enable many individual clients to train their models jointly while keeping their local data decentralized and completely private from other users or a centralized server. A common training schema of FL methods is that each client sends its model parameters to the server, which updates and sends back the global model to all clients in each round. Since the raw data of one client has never been exposed to others, FL is promising as an effective way to address the above challenges, particularly in the NLP domain, where many user-generated text data contain sensitive and/or personal information. Despite the growing progress in the FL domain, research into and application for NLP has been rather limited. There are indeed a number of recent works on using FL methods for 141 processing medical information extraction tasks [428]. However, such prior work usually has its own experimental setup and specific task, making it difficult to fairly compare these FL methods and analyze their performance in other NLP tasks. We argue that future research in this promising direction (FL for NLP) would highly benefit from a universal benchmarking platform for systematically comparing different FL methods for NLP. To the best of our knowledge, such a benchmarking platform is still absent from the literature. Therefore, our goal in this work is to provide comprehensive comparisons between pop- ular FL methods (e.g., FedAvg [308], FedOPT [367], FedProx [253]) for four mainstream formulations of NLP tasks: text classification, sequence tagging, question answering, and seq2seq generation. Although there are few available realistic FL datasets for NLP due to privacy concerns, we manage to use existing NLP datasets to create various non-IID data partitions over clients. These non-IID partitions simulate various kinds of distribution shifts (e.g., label, features, quantities, etc.) over the clients, which often happen in real-world NLP applications. As for the base NLP models, we use the Transformer architecture [455] as the backbone and support a wide range of pre-trained LMs such as DistilBERT [393], BERT [86], BART [238], etc. In order to conduct extensive experiments, we need to support the experiments with multiple options on dimensions such as (1) task formulations, (2) NLP models, (3) FL algorithms, and (4) non-IID partitions. Therefore, we propose FedNLP, a modular framework with universal interfaces among the above four components, which is thus more extensible for supporting future research in FL for NLP. In summary, we aim to unblock the research of FL for NLP with the following two-fold contributions: • Evaluation and analysis. We systematically compare popular federated learning algorithms for mainstream NLP task formulations under multiple non-IID data parti- tions, which thus provides the first comprehensive understanding. Our analysis reveals that there is a considerably large gap between centralized and decentralized training 142 under various settings. We also analyze the efficiency of different FL methods and model sizes. With our analysis, we highlight several directions to advance FL for NLP. • Resource. The implementation of our experiments forms a general open-source frame- work, FedNLP, which is capable of evaluating, analyzing, and developing FL methods for NLP. We also provide decentralized NLP datasets of various task formulations created by various non-IID partitioning strategies for future research. The remainder of this work is structured as follows. We introduce the background knowledge of federated learning and several typical FL algorithms in §9.2. Then, we present a few proposed non-IID partitioning strategies to create synthetic datasets for different task formulations in §9.3. We present our results, analysis, and findings in §9.4. Finally, we discuss more related works (§9.5) and conclusions (§9.6). 9.2 Federated Learning for NLP In this section, we first introduce the background knowledge of federated learning (FL) in the context of NLP tasks. Then, we briefly illustrate a unified FL framework that can be generalized to other typical algorithms. Finally, we introduce our framework design that is used for our benchmarking experiments and form a general training pipeline for FL+NLP. 9.2.1 Federated Learning Concepts Federated learning (FL) is a machine learning paradigm where multiple entities (clients) collaborate in solving a machine learning problem under the coordination of a central server or service provider. Each client’s raw data is stored locally and not exchanged or transferred; instead, focused updates intended for immediate aggregation are used to achieve the learning objectives [207]. Therefore, federated learning has been seen as a promising direction to decrease the risk of attack and leakage, reduce the difficulty and cost of data movement, and meet the privacy-related data storage regulations. 143 In the basic conception of federated learning, we would like to minimize the objective function, F(x)=E i∼P [F i (x)], where F i (x)=E ξ ∼D i [f i (x,ξ )]. (9.1) x∈R d represents the parameter for the global model, F i :R d →R denotes the local objective function at client i, andP denotes a distribution on the collection of clientsI. The local loss functions f i (x,ξ ) are often the same across all clients, but the local data distribution D i will often vary, capturing data heterogeneity. Federated averaging (FedAvg) [308] is a common algorithm to solve (9.1) by dividing the training process into rounds. At the beginning of the t-th round (t ≥ 0), the server broadcasts the current global model x (t) to a cohort of participants: a random subset of clients fromS (t) which includes M clients in total. Then, each sampled client in the round’s cohort performs τ i local SGD updates on its own local dataset and sends the local model changes ∆ (t) i = x (t,τ i ) i − x (t) to the server. Finally, the server uses the aggregated ∆ (t) i to update the global model: x (t+1) =x (t) + P i∈S (t) p i ∆ (t) i P i∈S (t) p i . where p i is the relative weight of client i. The above procedure will repeat until the algorithm converges. In the cross-silo setting where all clients participate in training on every round (each cohort is the entire population), we haveS (t) ={1,2,...,M}. Consequently, we can learn a global model to benefit all clients while preserving their data privacy. 9.2.2 Federated Optimization Framework In this work, we propose to use FedOPT [367], a generalized version of FedAvg, to build the FedNLPplatform. Asthepseudo-codepresentedinAlgorithm6, thealgorithmisparameterized by two gradient-based optimizers: ClientOpt and ServerOpt with client learning rate α and server learning rate α s , respectively. While ClientOpt is used to update the local 144 Algorithm 6 FedOpt [367]): A Generic FedAvg Algorithm Input: Initial model x (0) , ClientOpt, ServerOpt 1 for t∈{0,1,...,T − 1} do 2 Sample a subsetS (t) of clients 3 for client i∈S (t) in parallel do 4 Initialize local model x (t,0) i =x (t) for k =0,...,τ i − 1 do 5 Compute local stochastic gradient g i (x (t,k) i ) Perform local update x (t,k+1) i = ClientOpt(x (t,k) i ,g i (x (t,k) i ),α,t ) 6 end 7 Compute local model changes ∆ (t) i =x (t,τ i ) i − x (t,0) i 8 end 9 Aggregate local changes ∆ (t) = P i∈S (t)p i ∆ (t) i / P i∈S (t)p i Update global model x (t+1) = ServerOpt(x (t) ,− ∆ (t) ,α s ,t) 10 end models, ServerOpt treats the negative of aggregated local changes − ∆ (t) as a pseudo- gradient and applies it to the global model. This optimization framework generalizes to many aggregation-based FL algorithms and simplifies the system design. Intermsofoptimization, weexploredifferentcombinationsof SeverOptandClientOpt. The original FedAvg algorithm implicitly sets SeverOpt and ClientOpt to be SGD, with a fixed server learning rate α s of 1.0. FedProx [253], tackling statistical heterogeneity by restricting the local model updates to be closer to the initial (global) model, can be easily incorporated into this framework by adding L2 regularization for better stability in training. Moreover, given that AdamW [294] is widely used in NLP, we set it for ClientOpt and let the ServerOpt to be SGD with momentum to reduce the burden of hyper-parameter tuning. 9.2.3 FedNLP Training System: Security and Efficiency Under the unified definition of federated learning in Algorithm 6, we design a training system to support the research of NLP in the FL paradigm. We highlight its core capabilities and design as follows. Supporting diverse FL algorithms. FedNLP aims to enable flexible customization for future algorithmic innovations. We have supported a number of classical federated learning 145 algorithms, including FedAvg [308], FedOPT [367], and FedProx [253]. These algorithms follow the same framework introduced in Algorithm 6. The algorithmic APIs are modularized: all data loaders follow the same format of input and output arguments, which are compatible with different models and algorithms and are easy to support new datasets; the method of defining the model and related trainer is kept the same as in centralized training to reduce the difficulty of developing the distributed training framework. For new FL algorithm development, worker-oriented programming reduces the difficulty of message passing and definition. More details are introduced in Appendix H.4.3. Enabling secure benchmarking with lightweight secure aggregation. In particular, FedNLP enhances the security aspect of federated training, which is not supported by existing non-NLP-oriented benchmarking libraries (e.g., TFF, LEAF). This is motivated by the fact that model weights from clients may still have the risk of privacy leakage [548]. To break this barrier, we integrate secure aggregation (SA) algorithms to the FedNLP system. NLP researchers do not need to master security-related knowledge and also benefit from a secure distributed training environment. To be more specific, FedNLP supports state-of-the-art SA algorithms LightSecAgg, SecAgg [37], and SecAgg+ [19]. At a high-level understanding, SA protects the client model by generating a single random mask and allows their cancellation when aggregated at the server. Consequently, the server can only see the aggregated model and not the raw model from each client. In this work, our main effort is to design and optimize these SA algorithms in the context of the FedNLP system. We provide an algorithmic performance comparison in Appendix I.3.3. Realisticevaluationwithefficientdistributedsystemdesign. FedNLPaimstosupport distributed training in multiple edge servers (e.g, AWS EC2) or edge devices (e.g., IoTs and smartphones). To achieve this, the system is designed with three layers: the application layer, the algorithm layer, and the infrastructure layer. At the application layer, FedNLP provides three modules: data management, model definition, and a single-process trainer 146 Table 9.1: Statistics of the selected datasets for our experiments. *37 is the size of the tag vacabulary. Task Txt.Cls. Seq.Tag. QA Seq2Seq Dataset 20News Onto. MRQA Giga. # Training 11.3k 50k 53.9k 10k # Test 7.5k 5k 3k 2k # Labels 20 37* N/A N/A Metrics Acc. F-1 F-1 ROUGE for all task formats; at the algorithm layer, FedNLP supports various FL algorithms; at the infrastructure layer, FedNLP aims at integrating single-process trainers with a distributed learning system for FL. Specifically, we make each layer and module perform its own duties and have a high degree of modularization. We refer readers to Appendix H.4 for a detailed description of the system architecture and design philosophy. 9.3 Benchmark for FedNLP Here we introduce how we create benchmark datasets of a wide range of NLP tasks with different non-IID partition methods for evaluating different federated learning methods. 9.3.1 Task Formulations, Datasets, and Models There are numerous NLP applications, but most of them can be categorized based on four mainstream formulations: text classification (TC), sequence tagging (ST), question answering (QA), and seq2seq generation (SS). The formal definition of each formulation is detailed in Appendix §H.3. To cover all formulations while keeping our experiments in a reasonable scope, we select one representative task for each formulation: • Text Classification : 20Newsgroup [230] is a news classification dataset with annotations for 20 labels 1 . 1 We showcase our FedNLP with this dataset as it has a larger output space (20 labels) than sentiment- analysis datasets, which is an important factor for the label-distribution shift scenarios. 147 • Sequence Tagging: OntoNotes [347] (5.0) is a corpus where sentences have annotations for the entity spans and types. We use it for the named entity recognition task, which is fundamental to information extraction and other applications. • QA: MRQA [109] is a benchmark consisting of 6 popular datasets 2 : SQuAD [362] (8529/431), NewsQA[450](11877/613),TriviaQA[200](4120/176),SearchQA[93](9972/499),HotpotQA[517] , and NQ [226] (9617/795). • Seq2Seq: Gigaword [354] is a news corpus with headlines that is often used for testing seq2seq models as a summarization task. Other tasks such as dialogue response generation and machine translation can also be adapted to this format. We show the basic statistics of the above selected datasets in Table 10.1. Note that our FedNLP as a research platform supports a much wider range of specific tasks of each formulation, while we only introduce the ones used in our experiments here with typical settings. Moreover, our contribution is more of a general FL+NLP benchmarking platform instead of particular datasets and partitions. Base NLP Models. Fine-tuning pre-trained LMs has been the de facto method for modern NLP research, and thus we focus on testing Transformer-based architectures in FedNLP. Specifically, we choose to use BART [238], a text-to-text Transformer model similar to the T5 model [359], for seq2seq tasks. 9.3.2 Non-IID Partitioning Strategies The existing datasets have been used for centralized training in NLP. As our focus here is to test decentralized learning methods, we need to distribute the existing datasets to a set of clients. It is the non-IIDness of the client distribution that makes federated learning a challenging problem. Thus, we extend the common practice widely used in prior works to the NLP domain for generating synthetic FL benchmarks [243]. We first introduce how we 2 We only use part of the data to demonstrate and verify our hypothesis; we show the train/test split in brackets. 148 control the label distribution shift for TC and ST, then the quantity distribution shift, and finally how we model the distribution shift in terms of input features for non-classification NLP tasks (e.g., summarization). Non-IID Label Distributions. Here we present how we synthesize the data partitions such that clients the share same (or very similar) number of examples, but have different label distributions from each other. We assume that on every client training, examples are drawn independently with labels following a categorical distribution over L classes parameterized by a vectorq (q i ≥ 0,i∈[1,L] and ∥q∥ 1 =1). To synthesize a population of non-identical clients, we drawq∼ Dir L (α p) from a Dirichlet distribution, wherep characterizes a prior class distribution over L classes, and α > 0 is a concentration parameter controlling the identicalness among clients. For each client C j , we draw aq j as its label distribution and then sample examples without replacement from the global dataset according to q j . With α →∞, all clients have identical distributions to the prior (i.e., uniform distribution); with α →0, on the other extreme, each client holds examples from only one class chosen at random. As shown in Figure 9.2, we show a series heatmaps for visualizing the distribution differences between each client. Figure 9.3 shows an example of the concrete label distributions for all clients with different α . We can see that when α is smaller, the overall label distribution shift becomes larger. Task Dataset Partition Clients FedAvg FedProx FedOPT # Rounds Text Classification 20news α =1 (label shift) 100 0.5142 0.5143 0.5349 22 Sequence Tagging OntoNotes α =0.1 (label shift) 30 0.7382 0.6731 0.7918 17 Question Answering MRQA natural factor 6 0.2707 0.2706 0.3280 13 Seq2Seq Generation Gigaword α =0.1 (feature shift) 100 0.3192 0.3169 0.3037 13 Table 9.2: The comparisons between different FL methods under the same setting on different NLP tasks. The number of workers per round are 10, expect for the MRQA task, which uses 6. Controlling non-IID Quantity. It is also common that different clients have very different data quantities while sharing similar label distribution. We thus also provide a quantity-level 149 100 clients 100 clients = 1 = 5 = 10 = 100 JSD Figure 9.2: The J-S divergence matrix between 100 clients on the 20News dataset when α ∈ {1,5,10,100}. Each sub-figure is a 100x100 symmetric matrix. The intensity of a cell (i,j)’s color here represents the distance between the label distribution of Client i and j. It is expected that when α is smaller, the partition over clients is more non-IID in terms of their label distributions. = 1 = 5 = 10 = 100 20 labels 100 clients Ratio Figure 9.3: Visualizing the non-IID label distributions on 20News with α being{1,5,10,100}. Each sub-figure is a 100x20 matrix, where 100 is the number of clients, and 20 is the number of labels. The intensity of a cell here represents the ratio of a particular label in the local data of a client. When α is smaller (1, 5, 10), each client has a relatively unique label distribution, thus the differences between clients are larger; when α = 100, every client has a nearly uniform label distribution. Dirichlet allocation z∼ Dir N (β ) where N is the number of clients. Then, we can allocate examples in a global dataset to all clients according to the distributionz — i.e.,|D i | =z i |D G |. If we would like to model both quantity and label distribution shift, it is also easy to combine both factors. Note that one could assume it is a uniform distribution z∼ U(N), (or β →∞) 150 20news Ontonotes Gigaword MRQA Figure 9.4: The learning curves of the three FL Methods on four different task formulations. The metrics used for these tasks are accuracy, span-F1, token-F1, and ROUGE respectively; The x-axis is the number of rounds in federated learning. if we expect all clients to share similar number of examples. A concrete example is shown in Figure H.1 (Appendix). Controlling non-IID Features. Although straightforward and effective, the above label-based Dirichlet allocation method has a major limitation — it is only suitable for text classification tasks where the outputs can be modeled as category-based random variables. To create synthetic partitions for other non-classification NLP tasks and model distribution shift, we thus propose a partition method based on feature clustering. Specifically, we use SentenceBERT [Reimers2019SentenceBERTSE] to encode each example to a dense vector by their text then we apply K-Means clustering to get the cluster label of each example; finally, we use these cluster labels (as if they were classification tasks) to follow the steps in modeling label distribution shift. There are two obvious benefits of this clustering-based Dirichlet partition method: 1) It enables us to easily synthesise the FL datasets for non- classification tasks (i.e., ST, QA, SS) as they do not have discrete labels as output space; 2) The BERT-based clustering results naturally imply different sub-topics of a dataset, and thus feature shift can be seen as shift of latent-labels — we can reuse the same method for label-based Dirichlet partition method. Natural Factors For datasets like MRQA, we consider a cross-silo setting where each client is associated with a particular sub-dataset (out of the six datasets of the same format), 151 forming a natural distribution shift based on the inherent factors such as data source and annotating style. 9.4 Experiments and Analysis In this section, we aim to analyze typical federated learning methods (introduced in on our benchmark datasets with multiple dimensions with the base NLP models listed previously. We put more implementation details and additional results in Appendix. We organize our extensive experimental results and findings from the analysis as a collection of research questions with answers. Experimental Setup and Hyper-parameters. We use DistilBERT and BART-base for most of our experiments, as the former is a distilled version of BERT model and has a 7x speed improvement over BERT-base on mobile devices — a common scenario for FL applications; the BART-base model is the most suitable option considering the trade-off between performance and computation cost. We leave our implementation details and the selected hyper-parameters in the submitted supplementary materials. Our experiments cover both cross-device and cross-silo settings. As shown in Table 11.5, in the cross-device setting, we use uniform sampling to select 10 clients for each round when the client number in a dataset is very large (e.g., 100). For the cross-silo setting, each round will select the same number of clients (we use 6 for the QA task). The local epoch number is set to 1 for all experiments. To make our results reproducible, we use wandb.ai to store all experiment logs and hyper-parameters as well as running scripts. Q1: How do popular FL methods perform differently under the same setting? We compare the three typical FL methods under the same setting (i.e., data partition, communication rounds, training hyper-parameters, etc.) for each task formulation. As shown in Table 11.5, we report the results of FedAvg, FedProx, and FedOPT. We can see that overall 152 0 5 10 15 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 uniform label ( =1) label ( =10) label ( =5) quantity ( =1) Figure 9.5: Testing FedOPT with DistilBERT for 20News under different data partition strategies. FedOPT performs better than the other two methods, with the only exception being in the seq2seq generation task. FedAvg and FedProx performs similarly with marginal differences, but FedAvg outperforms FedProx in sequence tagging. These two exceptions are surprising findings, as many prior works in the FL community show that FedOPT is generally better than FedProx than FedAvg on vision tasks and datasets. We conjecture that such inconsistent performance across tasks suggests the difference in terms of the loss functions have a great impact on FL performance. Seq2seq and sequence tagging tasks usually have more complex loss landscapes than text classification, as they are both typical structured prediction tasks, while the text classification has a much smaller output space. From Fig. 9.4, we see that the FedOPT outperforms the other two methods at the beginning while gradually become worse over time. This tells us that the use of AdamW as the client optimizer may not always be a good choice, especially for a complex task such as the Seq2Seq ones, as its adaptive method for scheduling learning rates might cause implicit conflicts. These observations suggest that federated optimization algorithms need to be tailored for various NLP tasks, and exploring FL-friendly model architecture or loss function can also be promising directions to address these challenges. 153 Table 9.3: Performance (Acc.%) on 20news (TC) when different parts of DistilBERT are frozen for centralized training and FedOpt (at 28-th round). E stands for the embedding layer and L i means the i-th layer. The significant lower accuracy are underlined . Frozen Layers # Tunable Paras. Cent. FedOpt. None 67.0M 86.86 55.11 E 43.1M 86.19 54.86 E +L 0 36.0M 86.54 52.91 E +L 0→1 29.0M 86.52 53.92 E +L 0→2 21.9M 85.71 52.01 E +L 0→3 14.8M 85.47 30.68 E +L 0→4 7.7M 82.76 16.63 E +L 0→5 0.6M 63.83 12.97 Q2: How do different non-IID partitions of the same data influence FL per- formance? The FedNLP platform supports users to investigate the performance of a FL algorithm with a wide range of data partitioning strategies, as discussed in §9.3.2. Here we look at the training curves of the FedOPT on different partitions, as shown in Figure 9.5. We reveal several findings: • When α is smaller (i.e., the partition is more non-IID in terms of their label distribution), the performance tends to degrade, based on the three curves (α ={1,5,10}). • The variance is also larger when the label distribution shift is larger. Both uniform and quantity-skew partitions have a smoother curve, while the variance is smaller for a larger α (e.g., 10). • Quantity skew does not introduce a great challenge for federated learning when the label distribution is closer to the uniform one. These findings suggest that it is important to to design algorithms to mitigate the data heterogeneity. One promising direction is personalized FL, which enables each client to learn its own personalized model via adapting its local data distribution and system resources [89, 104, 247]. 154 0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 None E E+L 0 E+L 0 1 E+L 0 2 E+L 0 3 E+L 0 4 E+L 0 5 Figure 9.6: Testing FedOPT with DistilBERT for 20News under different frozen layers. Q3: How does freezing of Transformers influence the FL performance? Com- munication cost is a major concern in the federated learning process. It is thus natural to consider freezing some Transformer layers of the client models in order to reduce the size of the trainable parameters that will be transmitted between servers and clients. To study the influence of freezing layers on the FL performance, we conduct a series of experiments that freeze the layers from the embedding layer (E) to the top layer (L 5 ) of DistilBERT with both centralized training and FedOPT on the text classification task. We report our results in Table 9.3 and Figure 9.6. We find that in centralized training, the largest performance gain happens when we unfreeze the last layer, while in FedOPT we have to unfreeze the last three layers to enjoy a comparable performance with the full model. This suggests that reducing communication costs via freezing some layers of Transformer LMs is feasible, though one should be aware that the experience in centralized training may not generalize to the FL experiments. 155 Q4: Are compact model DistilBERT adequate for FL+NLP? We know that BERT has a better performance than DistilBERT for its larger model size. However, is it cost- effective to use BERT rather than DistilBERT? To study this, we compare the performance of both models with FedOPT on text classification, sharing the same setting as above experiments. As shown in Figure 9.7, although BERT-base achieves a better performance, the performance of DistilBERT is not significantly worse. Considering the communication cost (BERT-base is almost 2x larger than DistilBERT), we argue that using DistilBERT is a more cost-effective choice for both experimental analysis and realistic applications. 9.5 Related Works FL benchmarks and platforms. In the last few years a proliferation of frameworks and benchmark datasets have been developed to enable researchers to better explore and study algorithms and modeling for federated learning, both from academia: LEAF[52], FedML [155], Flower [26], and from the industry: PySyft [388], TensorFlow-Federated (TFF) [185], FATE [511], Clara [333], PaddleFL [298], Open FL [186]. However, most platforms only focus on designing a unified framework for federated learning methods and do not provide a dedicated environment for studying NLP problems with FL methods. LEAF [52] contains a few text datasets, however, it is limited to classification and next word prediction datasets and does not consider the pre-trained language models. We want to provide a dedicated platform for studying FL methods in realistic NLP applications with state-of-the-art language models. Federated learning in NLP applications. There are a few prior works that have begun to apply FL methods in privacy-oriented NLP applications. For example, federated learning has been applied to many keyboard-related applications [134, 426, 236, 365, 514], sentence- level text intent classification using Text-CNN [551], and pretraining and fine tuning of BERT using medical data from multiple silos without fetching all data to the same place [275]. FL methods also have been proposed to train high quality language models that can outperform 156 0 5 10 15 20 25 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 bert-base distilbert-base Figure 9.7: FedOPT for 20News with different LMs. the the models trained without federated learning [193, 65]. Besides these applications, some work has been done in medical relation extractions [119] and medical name entity recognition [428]. These methods use federated learning to preserve the privacy of sensitive medical data and learn data in different platform, excluding the need for exchanging data between different platforms. Our work aims to provide a unified platform for studying various NLP applications in a shared environment so that researchers can better design new FL methods either for a specific NLP task or as a general-purpose model. The aforementioned prior works would thus be a particular instance of the settings supported by the FedNLP platform. 9.6 Conclusion We present FedNLP, an open-source benchmarking framework aiming to develop, evaluate, and analyze FL methods for NLP tasks. On top of FedNLP, we conduct extensive experiments covering three typical FL methods and four mainstream NLP task formulations under different non-IID partition methods. Our findings suggest that there is still a huge gap between 157 centralized training and federated learning. From our analysis, there are a few observations that conflict with the conventional FL evaluation on non-NLP tasks because of the inherent complexity of structured prediction problems in NLP (e.g., seq2seq) — suggesting future directions on syncing learning rates for fine-tuning Transformer-based NLP models. We also empirically show the effect of fine-tuning different numbers of parameters of pre-trained models for reducing the cost of data transfer via freezing bottom layers. Finally, we have also suggested several future directions in the FL+NLP research. 9.7 Future Directions Minimizing the performance gap. In the FL setting, we demonstrate that federated fine-tuning still has a large accuracy gap in the non-IID dataset compared to centralized fine-tuning. Developing algorithms for Transformer models with NLP tasks is of the highest priority. Improving the system efficiency and scalability. Transformer models are usually large, while resource-constrained edge devices may not be able to run large models. Designing efficient FL methods for NLP tasks is thus a practical problem worth solving. How to adopt a reasonable user selection mechanism to avoid stragglers and speed up the convergence of training algorithms is also a pressing problem to be solved. Trustworthy and privacy-preserving NLP. We argue that it is an important future research direction to analyze and assure the privacy-preserving ability of these methods, although our focus in this work is the implementation and performance analysis of the FL methods for NLP tasks. It is now an open problem for both FL and NLP areas, while it is an orthogonal goal for improving the trustworthy of decentralized learning, and it is only possible to study privacy preservation when we have an existing FL+NLP platform. This is also part of our motivation in proposing FedNLP, and we believe our framework provides a 158 set of flexible interfaces for future development to analyze and improve the privacy-preserving ability of FL methods for NLP tasks and beyond. Personalized FedNLP. From the perspective of the data itself, user-generated text is inherently personalized. Designing personalized algorithms to improve model accuracy or fairness is a very promising direction. In addition, it is also an interesting problem to adapt the heterogeneous model architecture for each client in the FL network. We show that it is feasible to only fine-tune a small amount of the parameters of LMs, so it is promising to adapt recent prefix-tuning methods [258] for personalizing the parameters of NLP models within the FedNLP framework. 159 Chapter 10 FedGraphNN: FedML for Graph Neural Networks 10.1 Introduction Graph Neural Networks (GNNs) are state-of-the-art models that learn representations from complex graph-structured data in various domains such as drug discovery [383, 430, 503], social network [132, 485, 146, 504], recommendation systems [483, 292, 118, 502], and traffic flow modeling [472, 78]. However, due to privacy concerns, regulatory restrictions, and commercial competition, there are widespread real-world cases in which graph data is decentralized. For example, in the AI-based drug discovery industry, pharmaceutical research institutions would significantly benefit from other institutions’ data, but neither can afford to disclose their private data due to commercial reasons. (a) Graph-level FL (b) Subgraph-level FL(c) Node-level FL Figure 10.1: Three settings of graph federated learning. Federated Learning (FL) is a distributed learning paradigm that addresses this data isolation problem. In FL, training is an act of collaboration between multiple clients without requiring centralized local data [308, 207]. Despite its successful application in domains like 160 computer vision [145, 285, 174, 150] and natural language processing [135, 118, 268], FL has yet to be widely adopted in the domain of machine learning on graph data. There are multiple reasons for this: 1. There is a lack of unified formulation over the various graph FL settings and tasks in current literature, making it difficult for researchers who focus on SGD-based federated optimization algorithms to understand essential challenges in federated GNNs; 2. Existing FL libraries, as summarized by [153], do not support diverse datasets and learning tasks to benchmark different models and training algorithms. Given the complexity of graph data, the dynamics of training GNNs in an FL setting may be different from training vision or language models [531, 490, 161, 505]. A fair and easy-to-use benchmark with standardized open datasets and reference implementations is essential to the development of new graph FL models and algorithms; 3. The simulation-oriented federated training system is inefficient and unsecure for federated GNNs research on large-scale and private graph datasets in the cross-silo settings. Disrup- tive research ideas may be constrained by the lack of a modularized federated training system tailored for diverse GNN models and FL algorithms. To address these issues, we present an open FL benchmark system for GNNs, called FedGraphNN, which contains a variety of graph datasets from different domains and eases the training and evaluation of various GNN models and FL algorithms. We first formulate graph FL to provide a unified framework for federated GNNs (Section 10.2). Under this formulation, we introduce the various graph datasets with synthesized partitions according to real-world application scenarios (Section 10.3). An efficient and secure FL system is designed and implemented to support popular GNN models and FL algorithms and provide low-level programmable APIs for customized research and industrial deployment (Section 10.4). Extensive empirical analysis demonstrates the utility and efficiency of our system and indicates the need of further research in graph FL (Section 10.5). Finally, we summarize the 161 open challenges in graph FL based on emerging related works (Section 10.6) as well as future directions based on FedGraphNN (Section 10.7). 10.2 Federated Graph Neural Networks (FedGraphNN) We consider a distributed graph scenario in which a single graph is partitioned or multiple graphs are dispersed over multiple edge servers that cannot be centralized for training due to privacy or regulatory restrictions. However, collaborative training over the dispersed data can aid the formulation of more powerful and generalizable graph models. In this work, we focus on training GNNs using FL with a central-server setting. GNN in FL Client k step 1 step 2 FL Client k FL Server FL Client 0 FL Client 1 … phase 1 phase 2 loss Figure 10.2: Formulation of FedGraphNN (Federated Graph Neural Network) In our unified framework of FedGraphNN, we assume that there are K clients in the distributed graph scenario, and the k th client has its own datasetD (k) :=(G (k) ,Y (k) ), where G (k) = (V (k) ,E (k) ) is the graph(s) inD (k) with vertex and edge feature setsX (k) ={x (k) m } m∈V (k) and Z (k) = {e (k) m,n } m,n∈V (k), Y (k) is the label set of G (k) . Each client owns a GNN model to learn graph representations and make predictions. Multiple clients are interested in collaborating through a server to improve their GNN models without necessarily revealing their graph datasets. We illustrate the formulation of FedGraphNN in Figure 10.2. Without loss of generality, we use a Message Passing Neural Network (MPNN) framework [123, 381]. Most of the spatial-based GNN models [218, 457, 132] can be unified into this framework, where the forward pass has two phases: a message-passing phase and a readout phase. 162 GNN phase 1: Message-passing (same for all tasks). The message-passing phase contains two steps: (1) the model gathers and transforms the neighbors’ messages, and (2) the model uses aggregated messages to update the nodes’ hidden states. Mathematically, for client k and layer indices ℓ=0,...,L− 1, an L-layer MPNN is formalized as follows: m (k,ℓ+1) i = AGG Ķ M (k,ℓ+1) θ Ä h (k,ℓ) i ,h (k,ℓ) j ,z i,j ä |j∈N i ©ä , h (k,ℓ+1) i =U (k,ℓ+1) ϕ Ä h (k,ℓ) i ,m (k,ℓ+1) i ä , (10.1) whereh (k,0) i =x (k) i is the k th client’s node features, ℓ is the layer index, AGG is the aggregation function (e.g., in the GCN model, the aggregation function is a simple SUM operation),N i is the neighbor set of node i, andM (k,ℓ+1) θ (·) is the message generation function which takes the hidden state of current node h i , the hidden state of the neighbor node h j and the edge features z i,j as inputs. U (k,ℓ+1) ϕ (·) is the state update function receiving the aggregated featurem (k,ℓ+1) i . GNN phase 2: Readout (different across tasks). After propagating through anL-layer MPNN, the readout phase computes feature vectors from the hidden states of the last MPNN layer and makes predictions for downstream tasks, that is ˆ y (k) S =R δ Ķ h (k,L) i |i∈V (k) S ©ä . (10.2) Note that to handle different downstream tasks, S can be a single node (node classification), a node pair (link prediction), a node set (graph classification) and so forth, and R δ can be the concatenation function or a pooling function such as SUM plus a single- or multi-layer perceptron. 163 GNN with FL. To formulate the FL setting, we define W ={M θ ,U ϕ ,R δ } as the overall learnable weights in the GNN. Consequently, we formulate FedGraphNN as a distributed optimization problem as follows: min W F(W) def = min W K X k=1 N (k) N · f (k) (W), (10.3) where f (k) (W) = 1 N (k) P N (k) i=1 L(W;x (k) i ,z (k) i ,y (k) i ) is the k th client’s local objective function that measures the local empirical risk over the graph dataset D (k) with N (k) data samples. F(W) is the loss function of the global GNN model. To solve this problem, the most straightforward algorithm is FedAvg [308]. It is important to note here that in FedAvg, the aggregation function on the server merely averages model parameters. We use GNNs inductively. Thus, no topological information about graphs on any client is required on the server during parameter aggregation. Other advanced algorithms such as FedOPT [367], FedGKT [145], and Decentralized FL [148, 161] can also be applied. Under the unified framework of FedGraphNN, we organize various distributed graph scenarios motivated by real-world applications into three settings based on how the graphs are distributed across silos, and provide support to the corresponding typical tasks in each setting. • Graph-level FedGraphNN: Each client holds a set of graphs, where the typical task is graph classification/regression. Real-world scenarios include molecular trials [383], protein discovery [508] and so on, where each institute might hold a limited set of graphs with ground-truth labels due to expensive experiments. • Subgraph-level FedGraphNN: Each client holds a subgraph of a larger global graph, where the typical task is node classification and link prediction. Real-world scenarios include recommendation systems [510], knowledge graph completion [58] and so forth, where each institute might hold a subset of user-item interaction data or entity/relation data. 164 • Node-level FedGraphNN: Each client holds the ego-networks of one or multiple nodes, where the typical task is node classification and link prediction. Real-world scenarios include social networks [544], sensor networks [520], etc., where each node only sees its k-hop neighbors and their connections in the large graph. Supported GNN models and FL algorithms. FedGraphNN’s latest release supports the GNN models of GCN [218], GAT [457], GraphSage [132], SGC [484], and GIN [493], implemented via PyTorch Geometric [107]. For FL algorithms, aside from FedAvg [308], other advanced algorithms such as FedOPT [367] are also supported within FedML library [153]. We refer to the Appendix I.1 for more details on the supported GNN baselines. 10.3 FedGraphNN Open Datasets FedGraphNN is centered around three federated GNN settings based on the ways graph data can be distributed in real-world scenarios, which covers a broad range of domains, tasks and challenges of graph FL. Specifically, it includes 36 datasets from 7 domains, such as molecules, proteins, knowledge graphs, recommendation systems, citation networks and social networks. Here, to facilitate clear understanding over the various graph FL settings, we organize and introduce examples of real-world datasets in each of the three federated GNN settings. Exact sources and statistics of the datasets are provided in Table 10.1, while more details are provided in Appendix I.2. • Graph-level Setting: In the real world, biomedical institutions might hold their own set of graphs such as molecules and proteins, and social network companies might hold their own set of community graphs. Such graphs may constitute large and diverse datasets for GNN traning, but they cannot be directly shared across silos. To simulate such scenarios, we utilize datasets from the domains of molecular machine learning [486], bioinformatics [40, 398, 132] and social computing [497], We also introduce a new large-scale dataset, called hERG [115] for federated drug discovery. 165 • Subgraph-level Setting: The first realistic scenario of subgraph-level FL is recom- mendation systems, where the users can interact with items owned by different shops or sectors, which makes each data owner only holding a part of the global user-item graph. To simulate such scenarios, we use recommendation datasets from both publicly available sources [441, 374] and internal sources [147], which have high-quality meta-data information. Another realistic scenario is knowledge graphs, where different organizations or departments might only have a subset of the entire knowledge, due to the focus in particular domains. We integrate the FB15k-237 [84], WN18RR [445] and YAGO3-10 [299] datasets, where subgraphs can be build based on relation types to distinguish specialized fields or communities to distinguish the entities of focus. • Node-level Setting: In social networks, each user’s personal data can be sensitive and only visible to his/her k-hop neighbors (e.g., in Instagram, k = 1 for contents and k = 2 for links, of private accounts). Thus, it is natural to consider node-level FL in social networks with clients holding the user ego-networks. To simulate this scenario, we use the open social networks [405] and publication networks [306, 32, 122, 399, 440] and partition them into sets of ego-networks. In terms of graph mining tasks, FedGraphNN supports all three common tasks of graph classification, node classification and link prediction. Some tasks are naturally important in certain graph FL settings while others are not, which we also clarify with real examples as follows • Graph Classification: This task is to categorize different types of graphs based on their structure and overall information. Unlike other tasks, this requires to characterize the property of the entire input graph. This task is naturally important in graph-level FL, with real examples such as molecule property prediction, protein function prediction, and social community classification. 166 • Link Prediction: This task is to estimate the probability of links between any two nodes in a graph. It is important in the subgraph-level FL, for example, in recommendation systems and knowledge graphs, where link probabilities are predicted in the former, and relation types are predicted in the latter. It is less likely but still viable in the node-level setting, where friend suggestion and social relation profiling can be attempted in users’ ego-networks, for example. • Node Classification: This task is to predict the labels of individual nodes in graphs. It is more important in node-level FL, such as predicting the active research fields of an author based on his/her k-hop collaborators or habits of a user based on his/her k-hop friends. It might also be important in subgraph-level FL, such as the collaborative prediction of disease infections based on the patient networks dispersed in multiple healthcare facilities. Data sources. We exhaustively examine and collected 36 datasets from 7 domains. Among them, 34 are from publicly available sources such as MoleculeNet [486] and graph kernels datasets [40]. In addition, we introduce two new de-identified datasets based on our collab- oration with Tencent: hERG [215, 116], a graph dataset for classifying protein molecules responsible for cardiac toxicity and Tencent [146], a large bipartite graph representing the relationships between users and groups. More details and their specific preprocessing proce- dures can be found in Appendices I.2.1 & I.2.2. We plan to continually enrich the available datasets in the future through active collection of open datasets and collaboration with industrial partners. 10.3.1 Generating Federated Learning Datasets Non-I.I.D-ness in FL is an astonishing challenge in simulating realistic FL algorithms. Coupled with the persistent structure and feature heterogeneity of graphs [501, 509, 499, 490], multiple sources of non-I.I.D-ness are indistinguishable in graph FL. Here, we present two ways of injecting non-I.I.D.-ness: 167 1. Dirichlet distribution-based sampling [528, 176, 174, 153, 473, 529, 465] 2. Meta-data Based Sampling [304, 51] 10.3.1.1 Dirichlet distribution-based Sampling This scheme allows users to split datasets in a reproducible and statistical (sample-based) manner. To generate sample-based non-I.I.D.ness, we use the unbalanced partition algorithm via Dirichlet distribution to partition datasets, and this method can be applied regardless of data domain. According to [529, 465], we generate a heterogeneous partition into J clients by sampling p k ∼ Dir J (α ) and allocating a p k,j proportion of the training instances of class k to local client. As shown in the Figure 10.3, α parameter controls the I.I.D.’ness of the sample distribution. Lower the α , more non-I.I.D. the sample distribution is. 1 2 3 4 5 6 7 8 9 10 client 0 100 200 300 400 sample size (a) α =0.1 1 2 3 4 5 6 7 8 9 10 client 0 25 50 75 100 sample size (b) α =10.0 Figure 10.3: Unbalanced sample distribution (non-I.I.D.) for citation networks (node-level tasks). This method is utilized for graph-level and node-level tasks only. It is important to note that Dirichlet-based sampling is useful when users have imperfect or no meta-data information to perform non-I.I.D. split. Figure I.1 in the Appendix shows several datasets’ non-I.I.D. distributions generated this method for graph-level tasks. The alpha values for LDA for representative datasets can be found in Table 10.2 and I.15 in the Appendix I.5. 168 0 5 10 15 20 25 Category 0 10000 20000 30000 40000 50000 60000 70000 80000 Number of samples (a) Ciao 0 5 10 15 20 25 Category 0 50000 100000 150000 200000 250000 300000 Number of samples (b) Epinions Figure 10.4: Example (Non-I.I.D.) Sample Distributions for Recommendation Systems Datasets 10.3.1.2 Non-I.I.D. Sampling Based on Meta-Data When users have enough background information on how some features affect non-I.I.D.’ness of the data in a particular data domain, FedGraphNN allows data-partitioning based on meta-data information. Meta-data can show the intrinsic data partition in real-life scenarios. For example, in recommendation systems, user’s behavior is different for items from different categories [73]. Splitting the whole user-item bi-partite graph based on item categories captures the behaviour difference among categories. Thus, user data of different sub-graphs is non-uniformly distributed. Fig. 10.4 shows the non-I.I.D. of user’s rating number on an item from different categories. Another example is for knowledge graphs with two possible non-I.I.D. settings: Building sub-graphs from different relation types and building from node communities. Both the two settings are non-I.I.D because different relations and communities do not span evenly in the graph. Notice that this type of sampling can be used for any FedGraphNN tasks. In addition, researchers and practitioners can also synthesize non- I.I.D.-ness using our available additional meta-data. Yet, deeply decoupling and quantifying non-I.I.D.-ness in federated GNNs remains as an open problem [473, 490]. In summary, we provide a comprehensive study and solutions for several challenges in data collection and benchmark for graph FL: (1) Collecting, analyzing and categrozing a large number of public, real-world datasets into different federated GNN settings with corresponding typical tasks; (2) Standardizing the procedure to synthesize non-I.I.D. data 169 distributions for all graph-structured datasets through providing Dirichlet distribution-based sampling and associative meta-data. 10.4 FedGraphNN Benchmark System: Efficient, Secure, and Modularized The System design of FedGraphNNis tailored for benchmarking graph FL and promoting algorithmic innovations with three key advantageous designs in the context of FL that have not been supported by existing simulation-oriented benchmark and libraries [153]. Figure 10.5: Overview of FedGraphNN System Architecture Design Enhancing realistic evaluation with efficient and deployable distributed system design. We design the training system to support realistic distributed computing in multiple edge servers, given that FedGraphNN is mainly executed in the cross-silo settings where each FL client represents an edge server belonging to an organization rather than smartphone or IoT devices. The system architecture, shown in Figure 11.2, is composed of three layers: FedML-core layer, FedML-API layer, and Application layer. FedML-core layer supports both RPC (remote procedure call) and MPI (message passing interface), which enable 170 communication among edge servers located at different data centers. More specially, the RPC API is tensor-oriented, meaning that its weight or gradient transmit among clients is much faster than naïve gRPC with GPU-direct communication (e.g., data owners from different AWS EC2 accounts can utilize this feature for faster training). The communication primitives are wrapped as abstract communication APIs (i.e., ComManager in Figure 11.2) to simplify the message definition and passing requested by different FL algorithms in FedML-API layer (see more details in Appendix I.3). In the deployment perspective, the FL client library should be compatible with heterogeneous hardware and OS configuration. To meet this goal, we provide Docker containers to simplify the large-scale deployment for FL. With this design, researchers can run realistic evaluations in a parallel computing envi- ronment where multiple CPU/GPU servers are located in multiple organizations (e.g., edge servers in AWS EC2). As such, for medium and small-scale graph datasets, the training can be finished in only a few minutes. For large-scale graph datasets, researchers can also measure system-wise performance (communicational and computational costs) to have a tangible trade-off between accuracy and system efficiency. Scaling up to numerous edge servers (FL clients) is further simplified by Docker-based deployment. Enabling secure benchmarking with lightweight secure aggregation. Researchers in the industry may also need to explore FL on their private customer datasets with other organizations. However, model weights from clients may still have the risk of privacy leakage [548]. As such, legal and regulatory departments normally do not permit FL research on private customer datasets when strong security is not guaranteed. To break this barrier, we integrate secure aggregation (SA) algorithms to the FedGraphNN system. ML researchers do not need to master security-related knowledge but also enjoy a secure distributed training environment. To be more specific, we support FedGraphNN with LightSecAgg, a state-of- the-art SA algorithm developed by our team (Appendix I.3.3). In high-level understanding, LightSecAgg protects the client model by generating a single random mask and allows their cancellation when aggregated at the server. Consequently, the server can only see the 171 aggregated model and not the raw model from each client. The design and implementation of LightSecAgg spans across FedML-core and FedML-API in the system architecture, as shown in Figure 11.2. Baselines such as SecAgg [37] and SecAgg+ [19] are also supported. Facilitating algorithmic innovations with diverse datasets, GNN models, and FL algorithms. FedGraphNN also aims to enable flexible customization for future algorithmic innovations. To support diverse datasets, GNN models, and FL algorithms, we have mod- ularized the API and component design. All data loaders follow the same format of input and output arguments, which are compatible with different models and algorithms, and easy to support new datasets. The method of defining the model and related trainer is kept the same as in centralized training to reduce the difficulty of developing the distributed training framework. For new FL algorithm development, worker-oriented programming reduces the difficulty of message passing and definition (details are introduced in the Appendix I.3). Diverse algorithmic implementations serve as the reference source code for future algorithmic innovation. The user-oriented interface (main training script) is simplified as the example code shown in Figure 10.6 where a few lines of code can launch a federated training in a cross-silo cloud environment. Figure 10.6: Example code for benchmark evaluation with FedGraphNN 10.5 FedGraphNN Empirical Analysis 10.5.1 Experimental Setup Our experiments are conducted on multiple GPU servers each equipped with 8 NVIDIA Quadro RTX 5000 (16GB GPU memory). The hyper-parameters are selected via our built-in 172 efficient parameter sweeping functionalities from the ranges listed in Appendix I.4.1. We present results on the ROC-AUC metric for graph classification and RMSE & MAE for graph regression, MAE, MSE, and RMSE for link prediction, and micro-F1 for node classification. More evaluation metrics supported by FedGraphNN are presented in Appendix I.4.2. 10.5.2 Baseline Performance Analysis We report experimental results of several popular GNN models trained with the most widely used FL algorithm of FedAvg, to examplify the utility of FedGraphNN. More results with varying baselines, hyper-parameters, evaluation metrics and visualizations are being updated and presented in Appendix I.5. After hyper-parameter tuning, we present the main performance results as well as runtimes in Tables 10.2, 10.3 and 10.4. Besides showcasing the utility of FedGraphNN, there are multiple takeaways from these results: 1. When the graph datasets are small, FL accuracy is often on par with centralized learning. 2. When dataset sizes and numbers of clients grow, GNN accuracy in the FL setting becomes significantly worse than centralized learning. We conjecture that such accuracy drop is because the basic GNN models and FL algorithms cannot properly address missing links among clients and quantify multiple sources of non-I.I.D.ness in graphs properly. 3. The dynamics of training GNNs in a federated setting are different from training federated vision or language models. Our findings show that the best model in the centralized setting may not necessarily be the best model in the FL setting. 4. Counterintuitive phenomenons (highlights in above tables) further add to the mystery of federated graph neural networks: in graph-level experiments, GAT suffers the most performance compromise on 5 out of 9 datasets; in both subgraph-level and node-level FL, results on some datasets (CIAO, CORA, PubMed) may even have slightly higher 173 performance than centralized training; GAT cannot achieve reasonable accuracy in node- level FL (e.g., in CORA dataset), etc. These results indicate the limitations of the baselines in FedGraphNN and motivate much further research in understanding the nuances and improving the training of GNNs in the FL setting. Evaluation on System Efficiency and Security. We provide additional results on system performance evaluation, where the results are summarized in Appendix I.3.2. Depending on the size of the graph data, FedGraphNN can complete the training efficiently. The training time ranges from a few minutes to about 1 hour even for large-scale graphs. In the security aspect, the main result of LightSecAgg is provided in Appendix I.3.3. The key benefit is that it not only obtains the same level of privacy guarantees as to the state-of-the-art (SecAgg [7] and SecAgg+ [3]) but also substantially reduces the aggregation complexity (hence much faster training). 10.6 Related Works and Open Challenges FedGraphNN lies at the intersection of GNNs and FL. We first discuss related works under the umbrella of three different graph FL settings. (1) Graph-level (Figure 10.1(a)): we believe molecular machine learning is a paramount application in this setting, where many small graphs are distributed between multiple institutions, as demonstrated in [161, 490]. [490] proposes a clustered FL framework specifically for GNNs to deal with feature and structure heterogeneity. [161] develops a multi-task learning framework suitable to train federated graph- level GNNs without the need for a central server. (2) Subgraph-level (Figure 10.1(b)): this scenario typically pertains to the entire social networks, recommender networks or knowledge graphs that need to be partitioned into many smaller subgraphs due to data barriers between different departments in a giant company or data platforms with different domain focuses as demonstrated in [483, 531]. [483] proposes a federated recommendation system with GNNs, 174 whereas [531] proposes FedSage, a subgraph-level federated GNN generating psuedo-neighbors utilizing variational graph autoencoder. (3) Node-level (Figure 10.1(c)): when the privacy of specific nodes in a graph is important, node-level graph FL is useful in practice. The IoT setting is a good example [543]; [462] uses a hybrid method of FL and meta-learning to solve the semi-supervised graph node classification problem in decentralized social network datasets; [317] attempts to protect the node-level privacy using an edge-cloud partitioned GNN model for spatio-temporal forecasting tasks using node-level traffic sensor datasets. Before our unified system of FedGraphNN, there was a serious lack of standardized datasets and baselines for training GNNs in a federated setting pertains. Previous platforms like LEAF [51], TFF [304, 34], and PySyft [388] have no support on GNNs. Beyond the direct goals of FedGraphNN, many open algorithmic challenges in graph FL remain be studied. First, the partitioning of a large graph into sub-graphs or ego-networks into local clients introduce dataset bias and information loss in terms of missing cross-subgraph links, which can impede the GNN performance in the FL setting (as shown in Tables 10.3 & 10.4), and motivates us to study the proper recovery of such missing cross-subgraph links [531]. Second, as observed from Tables 10.2–10.4, the non-I.I.D.ness in distributed graph datasets can impact the gap between federated and centralized GNNs, which motivates us to study the deep decoupling and quantification of the multiple sources of non-I.I.D.-ness in graphs are towards the appropriate design of graph FL algorithms [490, 473]. Third, integrating both graph topology of GNNs and network topology of FL in a principled and efficient way is of great interest when the federated GNNs can be trained in an asynchronous fashion. Finally, a universally useful framework for secure graph FL, despite various privacy-preserving methods [545, 545, 194, 543] in literature, is still missing. 10.7 Conclusions and Future Works In this work, we design an FL system and benchmark for GNNs, named FedGraphNN, which includes open datasets, baseline implementations, programmable APIs, all integrated in 175 a robust system affordable to most research labs. We hope FedGraphNN can serve as an easy-to-use research platform for researchers to explore vital problems at the intersection of FL and GNNs. Here we highlight some future improvements and research directions based on our FedGraphNN system: 1. supporting more graph datasets and GNN models for diverse applications. Possible applications include and are not limited to sensor networks and spatio- temporal forecasting [280, 500]; 2. optimizing the system to further accelerate the training speed for large graphs [541, 233]; 3. designing advanced graph FL algorithms to mitigate the accuracy gap on datasets with non-I.I.D.ness, such as tailoring FedNAS [142, 159] to search for personalized GNN models for individual FL clients; 4. exploring label-efficient GNN models based on concepts such as meta-learning and self-supervision to exploit the graphs in each client and their collaboration [491]; 5. addressing challenges in security and privacy under the setting of Federated GNN [98, 349, 352, 350, 505, 63]; 6. proposing efficient compression algorithms that adapt to the level of compression to the available bandwidth of the users while preserving the privacy of users’ local data; 7. organizing data competitions, themed workshops, special issues, etc., on the dissemination of FedGraphNN; 8. actively discussing ethics and societal impacts to avoid unwanted negative effects. 176 Table 10.1: Summary of open graph datasets from various domains contained in FedGraphNN. Task-Level Category Datasets # Graphs Avg. # Nodes Avg. # Edges Avg. Degree # Classes Graph-Level Molecules BACE[427] 1513 34.12 36.89 2.16 2 HIV[376] 41127 25.53 27.48 2.15 2 MUV[378] 93087 24.23 26.28 2.17 17 Clintox [117] 1478 26.13 27.86 2.13 2 SIDER [225] 1427 33.64 35.36 2.10 27 Toxcast[373] 8575 18.78 19.26 2.05 167 Tox21 [446] 7831 18.51 25.94 2.80 12 BBBP [303] 2039 24.05 25.94 2.16 2 QM9 [114] 133885 8.8 27.6 6.27 1 ESOL [83] 1128 13.29 40.65 6.11 1 FreeSolv[323] 642 8.72 25.6 5.87 1 Lipophilicity[114] 4200 27.04 86.04 6.36 1 hERG [115] 10572 29.39 94.09 6.40 1 MUTAG[82] 188 17.93 19.79 2.21 2 NCI1[461] 4110 29.87 32.3 2.16 2 Proteins PROTEINS[40] 1113 39.06 72.82 3.73 2 DDI [398] 1178 284.32 715.66 5.03 2 PPI[132] 24 56,944 818,716 28.76 121 Social networks COLLAB[497] 5000 74.49 2457.78 65.99 3 REDDIT-B[497] 2000 429.63 497.75 2.32 2 REDDIT-M-5K[497] 4999 508.52 594.87 2.34 5 IMDB-B[497] 1000 19.77 96.53 9.77 2 IMDB-M[497] 1500 13 65.94 10.14 3 Subgraph-Level Recomm. systems Ciao [441] 28 5150.93 19280.93 3.74 5 Epinions [374] 27 15824.22 66420.52 4.20 5 Tencent [147] 1 709074 991713 2.80 2 Knowledge graphs FB15k-237 [84] 1 14505 212110 14.62 237 WN18RR [445] 1 40559 71839 1.77 11 YAGO3-10 [299] 1 123143 774182 6.29 37 Node-level Publication networks CORA [306] 1 2708 5429 2.00 7 CORA-full [32] 1 19793 65311 3.30 70 CITESEER [122] 1 4230 5358 1.27 6 PUBMED [399] 1 19717 44338 2.25 3 DBLP [440] 1 17716 105734 5.97 4 Social networks CS [405] 1 18333 81894 4.47 15 Physics [405] 1 34493 247962 7.19 5 177 Table 10.2: Performance of graph classification in the graph-level FL setting (#clients=4). Metric ROC-AUC Training Time (sec.) Method SIDER BACE Clintox BBBP Tox21 SIDER BACE Clintox BBBP Tox21 α =0.2 α =0.5 α =0.5 α =2 α =3 MoleculeNet Results 0.6380 0.8060 0.8320 0.6900 0.8290 Not Published GCN (Centralized) 0.6476 0.7657 0.8914 0.8705 0.7800 458 545 686 532 1034 GCN (FedAvg) 0.6266 0.6594 0.8784 0.7629 0.7128 358 297 280 253 903 GAT (Centralized) 0.6639 0.9221 0.9573 0.8824 0.8144 739 603 678 533 2045 GAT (FedAvg) 0.6591 0.7714 0.9129 0.8746 0.7186 528 327 457 328 1549 GraphSAGE (Centralized) 0.6669 0.9266 0.9716 0.8930 0.8317 193 327 403 312 1132 GraphSAGE (FedAvg) 0.6700 0.8604 0.9246 0.8935 0.7801 127 238 282 206 771 Table 10.3: Performance of link prediction in the subgraph-level FL setting (#clients = 8). Metric MAE MSE RMSE Training Time (sec.) DataSet Ciao Epinions Ciao Epinions Ciao Epinions Ciao Epinions GCN (Centralized) 0.8167 0.8847 1.1184 1.3733 1.0575 1.1718 268 650 GCN (FedAvg) 0.7995 0.9033 1.0667 1.4378 1.0293 1.1924 352 717 GAT (Centralized) 0.8214 0.8934 1.1318 1.3873 1.0639 1.1767 329 720 GAT (FedAvg) 0.7987 0.9032 1.0682 1.4248 1.0311 1.1882 350 749 GraphSAGE (Centralized) 0.8231 1.0436 1.1541 1.8454 1.0742 1.3554 353 721 GraphSAGE (FedAvg) 0.8290 0.9816 1.1320 1.6136 1.0626 1.2625 551 810 Table 10.4: Performance of Node classification in the node-level FL setting (#clients = 10). Metric micro F1 Training Time (sec.) Method CORA CITESEER PUBMED DBLP CORA CITESEER PUBMED DBLP GCN (Centralized) 0.8622 0.9820 0.9268 0.9294 1456 742 1071 1116 GCN (FedAvg) 0.8549 0.9743 0.9128 0.9088 833 622 654 653 GAT (Centralized) diverge 0.9653 0.8621 0.8308 1206 1765 1305 957 GAT (FedAvg) 0.9610 0.8557 0.8201 871 652 682 712 GraphSAGE (Centralized) 0.9692 0.9897 0.9724 0.9798 1348 934 692 993 GraphSAGE (FedAvg) 0.9749 0.9854 0.9761 0.9749 774 562 622 592 178 Chapter 11 FedCV: FedML for Computer Vision 11.1 Introduction FL has the potential to rescue many interesting computer vision (CV) applications which centralized training cannot handle due to various issues such as privacy concerns (e.g. in medical settings), data transfer and maintenance costs (most notably in video analytic) [534], or sensitivity of proprietary data (e.g. facial recognition) [207]. In essence, FL is an art of trade-offs among many optimization objectives [102], including improving model accuracy and personalization [162, 151], system efficiency (communication and computation) [144, 160, 161], robustness to attacks [112, 17, 60, 431], and privacy [451]. There has been steady progress in FL algorithmic research to achieve these goals. However, the research gap between computer vision (CV) [157, 179] and federated learning (FL) is large. First, research in the FL community focuses almost exclusively on distributed optimization methods with small-scale datasets and models in image classification (see Table J.8 in the Appendix), while the research trends in CV focus more on large-scale supervised/self-supervised pre-training [69] with efficient CNN [434] or Transformer models [92], which largely improves the performance of classification tasks on ImageNet and various downstream tasks such as object detection and image segmentation. 179 Figure 11.1: Our philosophy of federated learning on computer vision: connecting the algorithmic FL research and CV application-drive research with an unified research framework. Second, CV model training normally requires large-scale computing research in a dis- tributed computing environment, but current FL algorithms are mostly published as stan- dalone simulations, which further enlarges the research gap (e.g., the recently released FedVision library [285] only contains object detection and single GPU training). Third, the efficacy of proposed FL algorithms on diverse CV tasks is still vague. Currently, only image classification in small-scale datasets and models has been evaluated in these algorithms (see Table J.8 in the Appendix). Researchers may attempt to solve a specific probleminrealisticCVtasksbydesigningnewalgorithms, butthecurrentresearchcommunity lacks such a library to connect diverse CV tasks. Due to these obstacles, there is an urgent need to bridge the gap between pure algorithmic research and CV application-driven research. Our philosophy to do so can be illustrated in Figure 11.1. Specifically, we design a unified federated learning library, named FedCV, to 180 connect various FL algorithms with multiple important CV tasks, including image segmen- tation and object detection. Model-wise, we believe the best solution for CV is to improve pre-training for SOTA models with efficient federated learning methods, which requires us to design efficient and effective task-specific models with pre-trained models as the backbone in the FL setting. To reduce the learning curve and engineering burden for CV researchers, we provide various representative FL algorithms as one line, easy-to-use APIs. Most importantly, these APIs provide distributed computing paradigm, which is essential to accelerating the federated training of most CV models. Moreover, we also make the framework flexible in exploring algorithms with new protocols of distributed computing, such as customizing the exchange information among clients and defining specialized training procedures. To demonstrate the ability of our framework and provide benchmarking experimental results, we run experiments in three computer visions: image classification, image segmenta- tion, and object detection. Our benchmark study suggests that there are multiple challenges that deserve future exploration: many deep learning training tricks may not be directly applied to FL; the non-IID dataset actually downgrades the model accuracy to some degree in different tasks; improving the system efficiency of federated training is challenging given the huge number of parameters and the per-client memory cost. We hope FedCV will serve as an easy-to-use platform for researchers to explore diverse research topics at the intersection of computer vision and federated learning, such as improving models, systems, or federated optimization methods. 11.2 Related Works [175] is the first work that applies federated learning to a real-world image dataset, Google Landmark [478], which has now become the standard image dataset for federated learning research. [56, 239, 255] apply federated learning on medical image segmentation tasks, which aims at solving the issue in which the training data may not be available at a single medical 181 institution due to data privacy regulations. In the object detection task, [526] proposes a KL divergence method to mitigate model accuracy loss due to non-I.I.D. Our work is closely related to FedVision [285], a federated learning framework for computer vision. It supports object detection in the smart city scenario using models including FastRCNN and YOLOv3. However, FedVision only supports the FedAvg algorithm and single-GPU training. Our FedCV platform provides diverse computer tasks and various FL algorithms. For federated learning in other application domains, we refer to the comprehensive vision paper [207]. 11.3 Preliminary and Challenges Federated learning (FL) leverage a scattered and isolated dataset to train a global or personalized model for each client (participant) while achieving privacy preservation, and savingsoncommunicationandstoragecostsforsuchlargeedgedata. Themoststraightforward formulation is to assume all clients need to collaboratively train a global model, which is defined as: min W F(W) def = min W K X k=1 N (k) N · f (k) (W) f (k) (W)= 1 N (k) N (k) X i=1 ℓ(W;X i ,y i ) (11.1) In computer vision,W can be any CNN or Transformer model (e.g., ViT). f (k) (W) is thekth client’s local objective function that measures the local empirical risk over the heterogeneous datasetD k . ℓ is the loss function of the global CNN model. For the image classification task, ℓ is the cross-entropy loss. FedAvgisthefirstfederatedoptimizationalgorithmtoproposetheconceptofFL.Tobetter understand the challenges of FL on CV, we rewrite its optimization process in Algorithm 1 with annotations. There are several clear characteristics that distinguish FL from conventional distributed training in a sealed data center: 182 Algorithm 7 FedAvg Algorithm: A Challenge Perspective 1: Initialization: there is a number of clients in a network; the client k has local datasetD k ; each client’s local model is initialized as W 0 ; 2: 3: Server_Executes: 4: for each round t=0,1,2,... do 5: S t ← (sample a random set of clients) 6: for each client k∈S t in parallel do 7: W k t+1 ← ClientUpdate(k,W t ) 8: end for 9: W t+1 ← P K k=1 n k n W k t+1 10: end for 11: 12: Client_Training(k,W): // Run on client k 13: B← (splitD k into batches) 14: for each local epoch i with i=1,2,··· do 15: for batch b∈B do 16: W ← W − η ∇ W F(W;b) 17: end for 18: end for 19: returnW to server 1. Data heterogeneity and label deficiency at the edge. Different from centralized train- ing, the data in FL is generated at the edge in a property of non-identical and independent distribution (non-I.I.D.). For example, in the CV scenario, smartphone users generate images or videos with distinct resolutions, qualities, and contents due to differences in their hardware and user behaviors. In addition, incentivizing users to label their private image and video data is challenging due to privacy concerns. 2. System constraints and heterogeneity. Training large DNN models at the edge is extremely challenging even when using the most powerful edge devices. In terms of memory, edge training requires significantly more memory than the edge inference requires. The bandwidth for edge devices is smaller tha n that of distributed training in the data center environment(InfiniBandcanbeused); theedgedevicesnormallydonothaveGPUaccelerators. What is even worse is that these system abilities are heterogeneous due to diverse hardware configurations. 183 3. Robustness and Privacy. Since federated training is not in a sealed data center envi- ronment, as is the traditional distributed training, it is easier to manipulate the data and model poisoning attacks. Therefore, making the training algorithm robust against attacks is also an important research direction in FL. In addition, although privacy preservation is one of the main goals, researchers also demonstrate that the exchanged gradient between the client and the server may, to some degree, lead to privacy leaks. More privacy-preserving techniques must be evaluated on various computer vision applications [506]. 11.4 FedCV Design To solve these challenges in diverse CV tasks, we need a flexible and efficient distributed training framework with easy-to-use APIs, benchmark datasets and models, and reference implementations for various FL algorithms. To bridge the gap between CV and FL research, we have designed an open-source federated learning system for computer vision, named FedCV. FedCV is built based on the FedML research library [154], which is a widely used FL library that only support image classification, ResNet and simple CNN models. The system architecture of FedCV is illustrated in Figure 11.2. To distinguish FedCV from FedML, we color-code the modules specific to FedCV. FedCV makes the following contributions: Benchmark Suite for Diverse CV Tasks: FedCV supports three computer vision tasks: image classification, image segmentation, and object detection. Related datasets and data loaders are provided. Users can either reuse our data distribution or manipulate the non-I.I.D. by setting hyper-parameters. Models are curated for benchmark evaluation. More details of the benchmark suite are given in Section 11.5. Reference Implementation for Representative FL Algorithms: Currently, FedCV includes the standard implementations of multiple state of the art FL algorithms: Federated Averaging (FedAvg) [308], FedOpt (server Adam) [367], FedNova (client optimizer) [471], FedProx 184 Figure 11.2: Overview of FedCV System Architecture Design [390], FedMA [465], as well as some novel algorithms that have diverse training paradigms and network typologies, including FedGKT (efficient edge training) [144], Decentralized FL [148], Vertical Federated Learning (VFL) [512], Split Learning [129, 458], Federated Neural Architecture Search (FedNAS) [142], and Turbo-Aggregate [416]. These algorithms support multi-GPU distributed training, which enables training to be completed in a reasonable amountoftime. NotethatmostpublishedFLoptimizationalgorithmsarebasedonstandalone simulations, which lead to a extremely long training time. In this work, we bridge this gap and make the CV-based FL research computationally affordable. 185 Easy-to-use APIs for Algorithm Customization: With the help of the FedML API design, FedCV enables diverse networks, flexible information exchange among workers/clients, and various training procedures. We can easily implement new FL algorithms in a distributed computing environment. We defer API design details to the Appendix. Other Functionality: We support several development tools to simplify the research exploration. Specifically, researchers can load multiple clients into a single GPU, which scales up the client number with fewer GPUs, although there may be GPU contention among processes; in the lowest layer, FedCV reuses FedML-core APIs but further supports tensor-aware RPC (remote procedure call), which enables the communication between servers located at different data centers (e.g., different medical institutes); enhanced security and privacy primitive modules are added to support techniques such as secure aggregation in upper layers. 11.5 FedCV Benchmark Suite: Datasets, Models, and Algorithms Table 11.1: Summary of benchmark suite. Task Dataset Model Image Classification CIFAR-100 EfficientNet[434] MobileNet[169] GLD-23k[478] ViT[91] Image Segmentation PASCAL VOC[139] DeeplabV3+ UNet[334] Object Detection COCO[271] YOLOv5[199] FL Algorithms FedAvg, FedOpt ... We summarize the benchmark suite in FedCV in Table 11.1, and introduce such a curated list task-by-task as follows: Image Classification. The curated datasets are Google Landmarks Dataset 23k (GLD- 23K) [478] and CIFAR-100 dataset [222] with non-I.I.D partition. GLD-23K dataset is 186 suggested by Tensorflow Federated [126], a natural federated dataset from smartphone users. Forthemodel, wesuggestEfficientNet[434]andMobileNet-V3[169], whicharetwolightweight CNNs. Since the attention-based Transformer model has become a trending model in CV, we suggest Vision Transformer (ViT) [91] (ViT-B/16) to conduct experiments. As the research progresses, we may be able to support more efficient Transformers. Image Segmentation. We use the augmented PASCAL VOC dataset with annotations from 11355 images [139]. These images are taken from the original PASCAL VOC 2011 dataset which contains 20 foreground object classes and one background class. For models, DeepLabV3+ [64] and U-Net [334] are supported since they are representative image seg- mentation models in centralized training. In our experiments, we utilize ResNet-101 and MobileNet-V2 as two backbones of DeepLabV3+ and U-Net. Object Detection. We use the COCO [271] dataset since it contains realistic images that include detecting and segmenting objects found in everyday life through extensive use of Amazon Mechanical Turk. We then use YOLOv5 [199], an optimized version of YOLOv4 [31] as the baseline model. It outperforms all the previous versions and approaches EfficientDet Average Precision(AP) with higher frames per second (FPS). In YOLOv5, four network models (YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x) with different network depths and widths are provided to cater to various applications. We use these four models as the pretrained models. Figure 11.3: Non I.I.D. data distribution on augmented PASCAL VOC dataset with different α values. Each square represents the number of data samples of a specific class for a client. 187 Non-I.I.D. Preprocessing. Our non-I.I.D. partition method is Latent Dirichlet Allocation (LDA) [453], which is a common practice in FL research to obtain synthetic federated datasets. As an example, we visualize the non-I.I.D. in Figure 11.3. Further details of the dataset partition can be found in the Appendix. For all datasets, we provide data downloading scripts and data loaders to simplify the data preprocessing. Note that we will update regularly to support new datasets and models. The supported FL algorithms include FedAvg, FedOpt and many other representative algorithms. We provide a full list and description in the appendix. 11.6 Experiments Inthissection, wepresenttheexperimentalresultsonimageclassification, imagesegmentation, and objective detection tasks on varying deep learning models and datasets. 11.6.1 Image Classification 11.6.1.1 Implementation Details For image classification, the client number per round is 10. All experiments are conducted on a computing cluster with GTX 2080Ti. Each client has one GPU and the communication bandwidth is 10 Gbps. We conduct extensive experiments with EfficientNet [434], MobileNet V3 [169] and ViT [91] on CIFAR-100 [222], and CLD-23K [478][126] datasets. The hyper- parameter settings are listed in the Appendix. 11.6.1.2 Experimental Results The main experimental results are presented in table 11.2. Below, we provide detailed comparisons of the implemented classification models on the proposed FedCV platform. 188 0 1000 2000 3000 4000 5000 6000 7000 8000 Round 0 20 40 60 80 Top-1 Test Accuracy [%] EfficientNet-b0 FL MobileNet-V3 FL ViT FL EfficientNet-b0 Centralized MobileNet-V3 Centralized ViT Centralized (a) Three models on GLD-23k 0 500 1000 1500 2000 2500 3000 3500 4000 Round 0 10 20 30 40 50 60 Top-1 Test Accuracy [%] a=0.1 FT. a=0.5 FT. a=100 FT. a=0.1 No FT. a=0.5 No FT. a=100 No FT. (b) EfficientNet on CIFAR-100 0 1000 2000 3000 4000 5000 6000 7000 8000 Round 0 10 20 30 40 50 60 70 80 Top-1 Test Accuracy [%] EfficientNet-b0 FT. MobileNet-V3 FT. ViT FT. EfficientNet-b0 No FT. MobileNet-V3 No FT. (c) Three models on GLD-23k 0 500 1000 1500 2000 2500 3000 3500 4000 Round 0 10 20 30 40 50 Top-1 Test Accuracy [%] a=0.1 SGD FT. a=0.1 SGD No FT. a=0.5 SGD FT. a=0.5 SGD No FT. a=0.1 M-SGD FT. a=0.1 M-SGD No FT. a=0.5 M-SGD FT. a=0.5 M-SGD No FT. (d) EfficientNet on CIFAR-100 Figure 11.4: Experiments on classification task. Figure (a): Test accuracy on GLD-23k with FedAvg and centralized training. The maximum number of epochs of centralized training is 400. The learning rate is 0.3 for EfficientNet and MobileNet of centralzed training, 0.03 for ViT of centralized traning, and 0.1 for all three models of FedAvg. Here the learning scheduler is not used for FedAvg. Figure (b): Test accuracy of FedAvg with EfficientNet on CIFAR-100 with different Non-IID degree. Hyper-parameters of this figure are set as Table J.3 in the appendix. Here, FT. means fine-tuning, i.e. loading a pretrained model and doing FedAvg on this model. Figure (c): Test accuracy of FedAvg with EfficientNet, MobileNet and ViT on GLD-23K, with/without fine-tuning. Hyper-parameters of this figure can be found in Tables J.5, J.6 and J.7 in appendix. Figure (d): Test accuracy of FedAvg with EfficientNet on CIFAR-100, with/without fine-tuning, SGD or momentum SGD. Hyper-parameters of this figure are set as Table J.3 and Table J.4 in appendix. Here, M-SGD means using local SGD with momentum. GLD-23k NonIID vs. IID. Figure 11.4(a) shows that the test accuracy of centralized training with EfficientNet and MobileNet outperforms FedAvg training by ten percent. And for the ViT, the accuracy of centralized training is similar with FedAvg. Impacts of different degrees of Non-IID. Figure 11.4(b) and Figure J.6 (in Appendix) show the influence of different degrees of Non-IID on the training performance of EfficientNet 189 Table 11.2: Summary of experimental results on image classification. In this table, Cent. refers to centralized training. For all experiments, we use a batch size of 256 for centralized training and 32 for FedAvg. We use a linear learning rate scheduler with a step size of 0.97 for centralized training, but no scheduler for FedAvg. We use momentum SGD with momentum coefficient of 0.9 for all experiments. More experimental results on other settings can be found in Tables J.3, J.4, J.5, J.6 and J.7 in the Appendix. Dataset Model Partition LR Acc CIFAR-100 EfficientNet Cent. 0.01 0.6058 a=0.1 0.003 0.4295 a=0.5 0.01 0.5502 a=100.0 0.003 0.6158 MobileNet V3 Cent. 0.01 0.5785 a=0.1 0.003 0.4276 a=0.5 0.01 0.4691 a=100.0 0.003 0.5203 GLD-23k EfficientNet Cent. 0.3 0.8826 Non-IID 0.1 0.8035 MobileNet V3 Cent. 0.3 0.8851 Non-IID 0.03 0.7841 ViT-B/16 Cent. 0.03 0.7565 Non-IID 0.03 0.7611 and MobileNetV3. Experimental results align with the results of LDA [453]. A higher α (i.e., lower degree of Non-IID) causes the test accuracy to increase. Fine-tuning vs. training from scratch. Figure 11.4(b), Figure J.6 in appendix, and Figure 11.4(c) show that the performance of fine-tuning is more effective than training from scratch. For the convergence speed, fine-tuning can achieve a test accuracy of 60%, nearly 20× faster than training from scratch. After training is completed, fine-tuning outperforms training from scratch by about 20 percent. Momentum SGD vs SGD. Figure 11.4 (d), and Figures J.5(a)-(b) (in appendix) show that SGD with momentum cannot guarantee better performance than vanilla SGD. When using EfficientNet On CIFAR-100 dataset of α = 0.5, momentum SGD has similar performance to SGD with fine tuning, but with a much higher test accuracy than SGD training from scratch. With α = 0.1, the performance of momentum SGD is not significantly influenced by fine-tuning, whereas vanilla SGD can see significant improvement. Learning rate scheduler. Figure 11.5, and Figure J.5(c)-(d) (in appendix) show an interesting result in which the linear learning rate decay may not improve the performance, 190 0 500 1000 1500 2000 2500 3000 3500 4000 Round 0 10 20 30 40 50 Top-1 Test Accuracy [%] a=0.1 No sched.+FT. a=0.1 No sched.+No FT. a=0.1 Sched.+FT. a=0.1 Sched.+No FT. a=0.5 No sched.+FT. a=0.5 No sched.+No FT. a=0.5 Sched.+FT. a=0.5 Sched.+No FT. Figure 11.5: Test accuracy on CIFAR-100 with EfficientNet trained with momentum SGD, with/without fine-tuning and learning rate scheduler. Hyper-parameters are set as Table J.3 in appendix. Here, Sched. means using learning rate scheduler with step size of 0.99. and even leads to performance decrease. One reason may be that in the last training epochs, each client cannot converge with too small learning rate. However, learning rate decay is able to make the training process more stable. For cases where α = 0.1 and α = 0.5, four curves of linear learning rate decay are smoother than without learning rate decay. Table 11.3: Efficiency of training MobileNet V3, EfficientNet, Vit models with FedAvg. In this table, MMACs refer to the forward computation for one sample. Total time refers to the entire training time plus evaluating time; we evaluate the model per 100 communication rounds. For the MobileNet and EfficientNet, the number of total communication rounds is 4000, and for ViT it is 8000. The communication cost is theoretically calculated out. Note the actual communication time should be larger than the theoretical communication time due to the straggler problem and other overhead. Model MobileNet-V3 EfficientNet ViT-B/16 Params 4M 3.8M 81.8M MMACs 2137 3796.4 16067.5 Comm rounds 4000 4000 8000 Total time 5.16h 5.05h 31.1h Comm cost 0.278h 0.264h 5.68h Efficiency analysis. We summarize the system performance of three models in Table 11.3, which demonstrate that if we train a big deep learning model such as ViT in the federated setting, there exists a huge communication overhead compared with small models. Furthermore, in the real federated environment, the communication bandwidth could be even worse. 191 11.6.2 Image Segmentation 11.6.2.1 Implementation Details 25 50 75 100 125 150 175 200 57.5 60.0 62.5 65.0 67.5 70.0 72.5 75.0 Test mIoU [%] DeeplabV3+ (ResNet) Deeplab-Resnet101(FT) Deeplab-Resnet101(SFL) 20 40 60 80 100 120 140 30 40 50 60 70 UNet (ResNet) Unet-Resnet101(FT) Unet-Resnet101(SFL) Round (a) Experiments with/without fine-tuning 20 40 60 80 100 62 64 66 68 70 72 74 76 Test mIoU [%] DeeplabV3+ (ResNet) Batch size:10 Batch size:4 Batch size:6 Batch size:8 20 40 60 80 100 120 140 160 35 40 45 50 55 60 65 70 75 UNet (ResNet) Batch size:10 Batch size:4 Batch size:6 Batch size:8 Round (b) Experiments with varying batch sizes Figure 11.6: Performance evaluation of segmentation tasks on Pascal VOC dataset. Figure (a): Comparing performance of DeeplabV3+ and UNet models with fine-tuning (FT) Resnet101 backbones against training from scratch (SFL) . Figure (b): Evaluating performance of DeeplabV3+ and UNet models on various batch sizes. For the image segmentation, we train DeeplabV3+ and U-Net within the FedCV platform, in which the number of clients involved in each round of image segmentation are either 4 or 8. These studies are carried out on a computing cluster with a Quadro RTX 5000 graphics card. Each client has one GPU, with a 15.754 GB/s communication bandwidth. 192 Table 11.4 comprises a list of models and hyper-parameters we explored for evaluating performance of segmentation tasks in the federated setting. Note that we use following abbreviation throughout our analysis: TT: Training Type for Backbone. There are three strategies that we use for training the backbone. (i) Fine-Tuning (FT): We start with a ImageNet-pretrained backbone and fine-tune it for our task. (ii) Freezed Backbone (FZ): Similarly to FT, we start with ImageNet[387] pretrained backbone but do not train or fine-tune the backbone at all to save on computational complexity. (iii) Scratch Federated Learning (SFL): Training the entire architecture end-to-end starting from scratch. Table 11.4: Dataset, models and hyper-parametesr choices for federated image segmentation task. Dataset Augemented PASCAL VOC Model DeeplabV3+, UNet Backbone ResNet-101, MobileNetV2 Backbone TT FT, FZ, SFL Batch Size Range 4 to 16 LR Range 0.0007 to 0.1 11.6.2.2 Experimental Results In this section, we analyze and discuss our results for image segmentation tasks in the federated setting. We summarize our top results in Table 11.5 for a variety of training setups. Backbone Training vs. Fine-Tuning. Figure 11.6(a) shows that pre-trained backbones coupled with fine-tuning results in only a slightly better performance (less than 2%) compared to training from scratch, which indicates that while pre-trained backbones aid in federated image segmentation accuracy, they are not necessary. This finding opens the door to advanced tasks such as medical imaging, where pre-trained backbones may not be useful and end-to-end training from scratch is the only viable alternative. Batch Size vs. Memory Trade-Off. Figure 11.6(b) and Table 11.6 show that a smaller batch size, such as 4 instead of 10, reduces memory by roughly a factor of two while sacrificing 193 Table 11.5: Summary of test results on Pascal VOC dataset for federated image segmentation task. DD: Data Distribution Type. N-IID: Heterogeneous distribution with partition factor α =0.5 IID: Homogeneous distribution. C: Number of Clients Model Backbone (TT) DD C mIOU DeeplabV3+ ResNet-101 (FT) IID 4 77.9% DeeplabV3+ ResNet-101 (FT) N-IID 4 76.47% DeeplabV3+ ResNet-101 (FT) N-IID 8 75.69% DeeplabV3+ ResNet-101 (SFL) N-IID 4 75.44% DeeplabV3+ ResNet-101 (FZ) N-IID 4 68.24% DeeplabV3+ MobileNetV2 (FT) N-IID 4 69.31% UNet ResNet-101 (FT) IID 4 75.14% UNet ResNet-101 (FT) N-IID 4 74.34% UNet ResNet-101 (FT) N-IID 8 73.65% UNet ResNet-101 (SFL) N-IID 4 74.2% UNet ResNet-101 (FZ) N-IID 4 51.19% UNet MobileNetV2 (FT) N-IID 4 66.14% nearly 2% accuracy. This is an important trade off to make because edge devices in a federated learning setup may have constrained memory. Table 11.6: Performance and memory analysis for various batch size of segmentation models on Pascal VOC Dataset. BS: Batch Size Model Backbone BS Memory mIOU DeeplabV3+ ResNet-101 4 6119M 72.38% DeeplabV3+ ResNet-101 6 8009M 73.28% DeeplabV3+ ResNet-101 8 10545M 74.89% DeeplabV3+ ResNet-101 10 13084M 75.5% UNet ResNet-101 4 6032M 71.54% UNet ResNet-101 6 8456M 71.89% UNet ResNet-101 8 10056M 72.4% UNet ResNet-101 10 12219M 73.55% Data distribution impact analysis. For various partition values α , Figure 11.3 depicts the distribution of classes among clients. Even when the partition factor changes from totally 194 20 40 60 80 100 67 68 69 70 71 72 73 74 75 Test mIoU [%] DeeplabV3+ (ResNet) hetero a:0.1 hetero a:0.5 hetero a:100 homo 20 40 60 80 100 120 140 35 40 45 50 55 60 65 70 75 UNet (ResNet) hetero a:0.1 hetero a:0.5 hetero a:100 homo Round (a) Experiments on various partition factors 20 40 60 80 100 120 140 64 66 68 70 72 74 76 Test mIoU [%] DeeplabV3+ (ResNet) Number of clients:4 Number of clients:8 Number of clients:1 25 50 75 100 125 150 175 200 30 40 50 60 70 UNet (ResNet) Number of clients:4 Number of clients:8 Number of clients:1 Round (b) Experiments on varying number of clients Figure 11.7: Performance evaluation of segmentataion task on Pascal VOC dataset. Figure (a): Evaluating performance of DeeplabV3+ and UNet models with Resnet101 as a backbone on various partition factors (a).Figure (b): Evaluation performance of DeeplabV3+ and UNet models with Resnet-101 as backbone on varying number of clients. Table 11.7: System performance chart of segmentation network architectures we considered. TT: Training Type. BS: Batch Size Model Backbone (TT) Dataset Params FLOPS Memory (BS) Total Time DeeplabV3+ ResNet-101 (FT) PASCAL VOC 59.34M 88.85G 13084M (10) 14.16h DeeplabV3+ ResNet-101 (FZ) PASCAL VOC 16.84M 88.85G 7541M (16) 23.59h DeeplabV3+ MobileNetV2 (FT) PASCAL VOC 5.81M 26.56G 12104M (16) 20.5h UNet ResNet-101 (FT) PASCAL VOC 51.51M 62.22G 12219M (10) 14.5h UNet ResNet-101 (FZ) PASCAL VOC 9.01M 62.22G 7687M (16) 51.11h UNet MobileNetV2 (FT) PASCAL VOC 7.91M 14.24G 11706M (16) 22.03h homogeneous to extremely heterogeneous, as shown in Figure ?? (a), the accuracy only degrades by about 2%. This further demonstrates that federated segmentation learning can instill enough generalization capability in local clients to allow them to perform well on unknown data, obviating the need for centralized or widely distributed data. Resiliency in the face of increasing clients The number of rounds needed for the model to converge increases as the number of clients increases (see figure ??(b)). When compared to smaller client sizes, which are theoretically expected to perform better since each local client has more data points to train on, it has little effect on final accuracy after a sufficient number of rounds. 195 25 50 75 100 125 150 175 200 Round 30 40 50 60 70 Test mIoU [%] Deeplab-Mobilenetv2 Deeplab-Resnet101(FT) Deeplab-Resnet101(FZ) Unet-Mobilenetv2 Unet-Resnet101(FT) Unet-Resnet101(FZ) Figure 11.8: Performance comparison of DeeplabV3+ and UNet with ResNet101 and Mo- bileNetV2 as backbones. DeeplabV3+ (Resnet101) reaches a better accuracy compared to other alternatives. FT: Fine-Tuning Backbone. FZ: Freezed Backbone 11.6.2.3 System Performance Analysis ResNet is one of the most widely used backbones for encoder-decoder architecture in image segmentation tasks; however, it has a high computing cost that many edge devices might not be able to bear. There are two obvious ways to trim the cost down: (i) Freezing the pre-trained backbone; (ii) Plugging computationally efficient backbone (Eg. MobileNetV2). Figure 11.8 depicts the performance variance when one of the two described strategies is applied for backbones in DeeplabV3+ and UNet architectures for federated image segmentation. When compared to every other mix, the accuracy of ResNet-101 backbone is demonstrably higher. On the other hand, as shown in Table 11.7, the alternatives are extremely efficient at the cost of performance degradation. 11.6.3 Object Detection 11.6.3.1 Implementation Details For object detection, we use pre-trained YOLOV5 for federated learning experiments with the FedAvg algorithm. The client number we used include 4 and 8 for performance comparison. Each client was run at one GPU (NVIDIA V100). The metric in our experiments is mAP@0.5 (mean average precision with a threshold of 0.5 for IOU). 196 (a) Different learning rates (b) Different number of clients Figure 11.9: Experiments on detection tasks on varying learning rates and number of clients. Figure (a): Non-IID data comparsion with different learning rate. Figure (b): Non-IID data comparsion with different number of clients. 11.6.3.2 Experimental Results Learning rate. In the federated setting, different learning rates are evaluated. While keeping the other hyper-parameters (e.g., client number is set to 4), we notice that lr=0.01 can have a better result compared to the other choices from Figure 11.9 (a). Non-I.I.D. evaluation. For the client numbers 4 and 8, we use the partition method introduced in the appendix to obtain synthetic federated datasets. We found that when using YOLOv5, it is difficult for FedAvg to reach the same results as that of centralized for Non-IID dataset. Figure 11.9 shows there is a large gap between centralized training and FedAvg-based federated training. In centralized training of YOLOv5, test mAP of all four model variants is over 0.95 [199], whereas the best accuracy in the federated setting is smaller than 0.85. The main reason is that the optimizer and training tricks used in centralized training could not be directly transplanted to the FL framework, indicating that further research for object detection in the federated setting is required. Evaluation on different number of clients. We also show the performance with different clientsamong4modelsinFigure11.9(b). Resultsshowingthat C =8hasalowerperformance compared to the C = 4. 197 System performance analysis. Table 11.8 summarizes the system performance of four different model variants. We can see that as the network structure depth and width increased among the four models, the model performed well with a better mAP. Table 11.8: System performance of YOLOv5 Model Layers Parameters FLOPS Total Time YOLOv5s 283 7.27M 17.1G 25.1h YOLOv5m 391 21.4M 51.4G 49.3h YOLOv5l 499 47.1M 115.6G 73.5h YOLOv5x 607 87.8M 219.0G 92.4h 11.7 Conclusion In this work, we propose an easy-to-use federated learning framework for diverse computer vision tasks, including image classification, image segmentation, and object detection, dubbed FedCV. We provide several non-IID benchmarking datasets, models, and various reference FL algorithms. We hope that FedCV can open doors for researchers to develop new federated algorithms for various computer vision tasks. FedML Ecosystem [154] aims to provide a one-stop scientific research platform through FedML Ecosystem and finally realize trustworthy ML/AI, which is more secure, scalable, efficient, and ubiquitous. FedCV serves as one of the key components of FedML ecosystem. The other important applications include FedNLP [268], FedGraphNN [151], and FedIoT [534]. 198 Chapter 12 FedIoT: FedML for Internet of Things 12.1 Introduction Along with the faster Internet speed and more endpoints brought by the 5G, billions of IoT devices online will be deployed [371]. However, for the data anomaly detection task (e.g., DDoS attack detection), the centralized over-the-cloud approach [315] may not fit this trend due to data privacy and extremely high communication/storage overhead (e.g., high-frequency data from time-series sensors) centralizing data from numerous IoT devices. As such, researchers attempt to address these challenges using federated learning (FL), which is a trending paradigm that can train a global or personalized model without centralizing data from edge devices [207]. DIoT [331] and IoTDefender [105] employ FL for intrusion detection in IoT devices by collaboratively training isolated datasets for a global or even personalized model. [286] further upgrades the model to a complex attention-based CNN-LSTM but mitigates the communication cost with Top-k gradient compression. Beyond detecting the abnormal data, [370] even considers an adversarial setup where several malicious participants poison the federated model. However, as the traffic volume of IoT-based DDoS attacks reaches unprecedented levels [24], the efficacy of these works is unclear, mainly when the attacks spread to large-scale types and devices but training on the limited data from small-scale devices cannot obtain high accuracy. More significantly, our research community lacks an open, generic, and flexible 199 FL-enabled IoT platform for advanced researches. Existing works only run simulations rather than perform experiments in real IoT platforms, or their specialized system design is not generalized enough for future research. In addition, given that IoT devices are resource- constrained, system performance analysis is also an essential step. Unfortunately, none of these works provides such analysis towards practical deployment. Figure 12.1: Overview of FedIoT Platform To further push forward the research in FL-based IoT cybersecurity, we build FedIoT platform with a simple but effective design philosophy that lays the foundation for future scientific research. The overall design spans dataset, model, algorithm, and system design. More specifically, we propose a federated learning algorithmic framework, FedDetect which utilizes adaptive optimizer (e.g., Adam) and cross-round learning rate scheduler, rather than naive FedAvg [313] for local training. Furthermore, FedDetect supports both global threshold and personalized threshold for different scenarios. In order to verify the effectiveness of FL for IoT security, we design a novel method to synthesize the testset from public dataset for FL-based IoT cybersecurity research. Its design aims to evaluate whether the global model obtained through FL training can recognize more attack types and has higher 200 detection performance (the evaluation and test dataset cover all attack types on the entire IoT network). In addition, we build FedIoT platform for realistic IoT devices with a high degree of modularization and flexible APIs. As such, we can add new data, models, and algorithms with only lightweight modification. Most importantly, FedIoT supports both IoT edge training (e.g., Raspberry PI) and CPU/GPU-based distributed training with the support of MQTT and MPI communication backend, respectively. We evaluate FedIoT platform and FedDetect algorithm with both global threshold and personalized threshold on N-BaIoT [315] and LANDER [454] dataset, and also analyze the system performance comprehensively (i.e., computational speed, communication cost, and memory cost). Our results demonstrate the efficacy of federated learning in detecting a large range of attack types. More specially, we find that the global model obtained through FedDetect training has higher detection performance, while only centralizing the data from a few IoT devices has worse performance due to insufficient benign training samples or attack types. FedDetect with personalized threshold also suggests that the detection based on FL may beat solely training on insufficient local data of a single device. Our system efficiency analysis shows that the training time memory cost occupies only a small fraction of the entire host memory of the IoT platform (Raspberry Pi), and the end-to-end training time is feasible (less than 1 hour) for practical applications, although the ratio of communication cost is high. In essence, we summarize our contributions as follows: • We provide a novel method to synthesize dataset for FL-based IoT cybersecurity research, aiming at evaluating the efficacy of FL in recognizing attack types in a wide range. • We propose a federated learning framework, FedDetect, for IoT cybersecurity. Most importantly, FedDetect incorporates local adaptivity and cross-round learning rate scheduler for effective distribution optimization and also supports both global and personalized threshold for different scenarios. 201 • WebuildFedIoTplatformforrealisticIoTdevices(e.g., RaspberryPI).Ourperformance analysis, system design, and flexible APIs demonstrate the feasibility and generality for future exploration. 12.2 Algorithm and System Design 12.2.1 Overview Federated learning (FL)-based IoT cybersecurity aims to detect network intrusion in IoT devices without centralizing a large amount of high frequent edge data. A generic setting is that many IoT devices collaboratively train a deep Autoencoder model for anomaly detection via federated learning. In this work, our goal is to build a FL system, FedIoT, for this setting to analyze both algorithmic and system performance in real IoT devices (e.g., Raspberry Pi). We build FedIoT platform with a simple but effective design philosophy that lays the foundation for future scientific research. The overall design is illustrated in Figure 12.1. More specifically, the entire software architecture consists of three layers: the application layer, the algorithm layer, and the infrastructure layer. We make each layer and module perform its duty and have a high degree of modularization. In the application layer, FedIoT provides a one-line API to launch the federated training on IoT devices in a distributed computing manner. This API takes non-I.I.D. datasets (Section 12.2.2) and a simple but effective deep Autoencoder model (Section 12.2.3) as its input; at the algorithm layer, FedIoT supports various FL algorithms such as FedAvg [307], FedOPT [367] and their customized versions FedDetect for anomaly detection (Section 12.2.4); at the infrastructure layer, FedIoT aims to support lightweight communication APIs with MQTT [182] (i.e., Message Queuing Telemetry Transport, a standard for IoT messaging), and customized PyTorch library [343] that can execute primitives of on-device model training such as forward propagation, calculating loss function and back-propagation (Section K.1). Our proposed light-weighted FedIoT framework 202 could support the direct implementation of Federated Learning on AI-enabled IoT edge devices, such as Raspberry Pi and NVIDIA Jetson Nano. 12.2.2 Dataset and Preprocessing We introduce a widely used public IoT dataset for anomaly detection and then synthesize a dataset for realistic evaluation on the FL-based method. In addition, we also introduce another novel private dataset to consolidate the evaluation. N-BaIoT dataset [315] is a widely used public dataset for research on anomaly detection of IoT data. N-BaIoT captures network traffic flow from 9 commercial IoT devices authentically attacked by network intrusions. In the original N-BaIoT dataset, it provides 115 statistic features, which could be severely influenced by the hostile attack. Each IoT device has 2 subsets: one benign set containing normal network flow data only, and one attack data subset consisting of two common malware attacks, Mirai and BASHLITE, which each contains five different kinds of attack types. USC LANDER IoT Operation Traces-20200127 dataset [454] is one of the latest dataset for the research of the operational traffic on IoT edge devices. The LANDER dataset contains 10-day operational traffic for 14 different widely-used IoT devices located in a LAN network without any types of attack. The detailed data distribution including the statistic features of these two datasets will be shown as tables in the Appendix on the github website. In order to verify the effectiveness of FL for IoT security, different from previous works [331, 105, 188, 286, 370], we hope to learn a detection model from benign data widely distributed in different types of devices that can identify a larger range of attacks. Specifically, we hope that the data design meets three characteristics: 1. It contains training data of multiple device types (benign data); 2. Each device has no full set of all attack types; 3. The evaluation and test dataset should cover all attack types on the entire IoT network. These requirements are based on several real-world assumptions: 203 • From the perspective of benign data, features of the network data flow among different types of devices are inconsistent. For example, a surveillance camera will record in real-time (24x7 hours), while the data generated by a doorbell is intermittent. • The detection model should have the ability to identify multiple attack types, but the types of attacks encountered by a single device are likely to be only part of the full set. Only by learning with the feature of all attack types in the entire IoT network, the detection model on a single device can have the ability to detect unknown attacks in a wide range. • Because of privacy (e.g., camera video) and extremely high communication/storage overhead (e.g., high-frequency data from time-series sensors), it is infeasible to centralize data on massive devices. Therefore, we use N-BaIoT and LANDER to synthesize the testset for FL-based IoT cybersecurity research. Our synthesized testset is generated by the following rules: 1. For each device, we assign the first 2/3 of selected benign data as the training data and the rest 1/3 as the evaluation dataset (i.e., calculating the anomaly threshold, see Section 12.2.3); 2. We enforce the global test dataset to compose all devices’ benign data and all types of attack data. More specifically, for each device, we randomly select 5000 benign data samples and 5000 malicious data samples from a wide range of attack types (some devices may not have sufficient data samples from dataset). Intuitively, the global model obtained through FL training can recognize more attack types and have higher detection performance, while the local training alone may have poor performance due to insufficient benign training samples or attack types. 12.2.3 Anomaly Detection with Deep Autoencoder We apply Deep Autoencoder [386] as the model for anomaly detection. Deep Autoencoder is simple but effective and does not lose the generality to evaluate FL algorithms and our FedIoT platform. Other advanced models (e.g., Variational Autoencoder, or attention-based 204 CNN-LSTM [286]) can also be applied into our framework without additional engineering efforts. Figure 12.2: Autoencoder Architecture Model Definition. Deep Autoencoder focuses on the reconstruction of the input data in an unsupervised learning manner. Figure 12.2 shows an example of the model architecture. Essentially, Autoencoder splits the neural network into two segments, the encoder f θ e and the decoder f θ d . Encoder f θ e compresses the input x to a latent space z. Decoder f θ d then attempts to restore the original image after some generalized non-linear transformation. Mathematically, the loss function can be written asL(x,x ′ )=∥x− x ′ ∥ 2 =∥x− f θ d (z)∥ 2 = ∥x− f θ d (f θ e (x))∥ 2 . This loss is also called reconstruction error calculated by mean square error MSE = 1 d P d i=1 (x i − ˆ x i ) 2 , where d is the dimension of the input. In essence, this loss function aims to encode the input to a latent representation z such that it can be regenerated by the decoder. To minimize the loss, common deep learning optimizer such as Adam [216] can be applied. tr =MSE+ α √ s σ (MSE) (12.1) 205 Anomaly Detection. In the application of anomaly detection, we train a deep Autoencoder with benign IoT traffic data to learn IoT devices’ normal behavior, so that our Autoencoder could successfully extract and reconstruct features on benign samples but fails to do so on abnormal samples with unseen features. During detection phase, one input data sample that achieves a reconstruction error above a threshold will be detected as an abnormal data sample. In detail, after training Autoencoder on benign training dataset, we first calculate the reconstruction error (MSE) for each data sample from benign evaluation dataset, and then obtain the threshold by Equation 12.1, which computes the sum of the mean of MSE plus standard deviation of MSE over all evaluation samples (note that when calculating the reconstruction error with a mini-batch of s samples, the standard deviation is divided by √ s, and the scaled standard deviation is further multiplied by an coefficient α ). The value of threshold should be as large as possible to suppress the majority of benign samples while preventing abnormal samples from being classified into benign samples. Extensive experiments show that the overall performance is the best when α equals to 2. 12.2.4 FedDetect We propose a federated learning algorithmic framework, FedDetect, for anomaly detection in distributed IoT devices. In this work, we make the following assumptions: 1. The IoT device may be vulnerable but not initially be compromised. 2. The Internet gateway (i.e. router) is not compromised. Distinguished from existing works on FL-based IoT anomaly detection, FedDetect utilizes adaptive optimizer (e.g., Adam) and cross-round learning rate scheduler, rather than naive FedAvg [313] for local training. Moreover, FedDetect supports both global threshold and personalized threshold for different scenarios. FedDetect is summarized as Algorithm 8. Local Adaptivity and Cross-round Learning Rate Scheduler. The choice of Adam for local training and cross-round learning rate scheduler is based on our experimental observation. 206 Algorithm 8 FedDetect 1: Initialization w 0 2: for round t= 0, 1, ... do 3: Adjust cross-round learning rate (cosine scheduler) 4: for client i=0 to K− 1 do 5: w t+1 i ← Local Training with Adam 6: Upload w t+1 i to Server 7: end for 8: w t+1 = 1 K P K− 1 i=0 w t+1 i 9: clients receive new model w t+1 10: end for 11: Personalized Threshold Algorithm 12: for client i=0 to K− 1 do 13: tr i =MSE i + α √ s σ (MSE i ) 14: end for 15: Globalized Threshold Algorithm 16: MSE Global =[MSE 0 ,...,MSE K− 1 ] 17: tr Global =MSE Global + α √ s σ (MSE Global ) We empirically find that local Adam beats naive local SGD or SGD with momentum when applying a cross-round learning rate scheduler (e.g., cosine scheduler). The rationality of local adaptivity has also been intensively verified its efficacy in CV, and NLP tasks by recent theoretical works [470, 367]. Global and Personalized Threshold. After achieving the global model via federated training, we propose two algorithmic modules to calculate the anomaly threshold for each device: Global Threshold and Personalized Threshold. More specially, in the Global Threshold algorithm, each device runs the global model locally to get the MSE sequence and then synchronizes it to the server, and the server uses MSE sequences from all devices to generate a unified global threshold for detection in each device. For the Personalized Threshold algorithm, each device computes its local threshold using its local data only. Global Threshold algorithm objectively expresses the detection performance of the FL- trained global model on the entire IoT networks (i.e., detecting a larger range of attack types as introduced in Section 12.2.2); Personalized Threshold algorithm can reflect the performance 207 generalization of the global model on the local data of each device. Experimental results of these two algorithms demonstrate the efficacy of FL in diverse real-world scenarios (Section 12.3.4). 12.3 Experiments We evaluated FedIoT platform on two aspects: algorithmic performance in both global model and personalized model setting; a comprehensive analysis of the system efficiency, including computational speed, communication cost, and memory cost. 12.3.1 Setup Implementation. We implemented two computing paradigms of FedIoT platforms: (1) IoT edge training, and (2) CPU/GPU-based distributed training. For the IoT edge training, we choose 9 Raspberry Pi 4B as client devices and a GPU server that integrates both the FL server and the MQTT service, as the design introduced in Section K.1. The Raspberry Pi 4B has a 1.5 GHz Broadcom BCM2711 processor and 4GB RAM. The GPU server has 2x AMD EPYC 7302 processors, an RTX A4000 GPU and 256 GB memory. The local training will be implemented on the Raspberry Pi and the weight integration will be implemented on the GPU server. For the CPU/GPU-based distributed computing, we used a GPU server that contains 4 NVIDIA RTX 2080Ti GPUs with sufficient GPU memory for our setting. Dataset and Model Definitions. We evaluated FedIoT platform using two dataset described in Section 12.2.2. For the Autoencoder, we set the input with a dimension of 115, the same as the features of data. The encoder network has four hidden layers, as the dimension decreasing rate equals 75%, 50%, 33%, and 25%. The decoder has the same layer design as the encoder, but with an increasing sequence. The number of parameters for the Autoencoder is equal to 36628. 208 Hyper-parameters. We searched for the learning rate on a range of {0.1, 0.01, 0.001, 0.0001, 0.00001}, input batch size on a range of {8, 16, 32, 64, 128}, local training epoch on a range of {1, 5, 10, 15, 20}, total training round on a range of {50, 100} and tested both tanh and sigmoid activation functions in Autoencoder. After hyper-parameter searching, we fixed the following hyper-parameters: the batch size for input is 64, the local training epoch in FL is 1, total training round is 100, and the activation function inside the Autoencoder is set as tanh function. More hyper-parameters can be found in our source code. 12.3.2 Baselines To evaluate our proposed algorithm in Section 12.2.4 comprehensively, we design the following three baselines: • CL-Single: Each device trains a detection model with its own local dataset. This baseline can obtain the model performance when training a local model without the help of federated learning. • CL-Multi: A detection model is trained using the merged data of three devices, which have the top 3 performances when trained by CL-Single. This baseline is used to simulate the realistic scenarios that we can only centralize the data from a fraction of devices. • CL-Combined: A detection model is trained with the merged data from nine devices. The result of this baseline serves as the upper bound performance for the centralized training. It may perform the best because it gathers all data samples from the entire IoT network. 12.3.3 Metrics Following the existing works, we use three metrics to evaluate the detection performance: accuracy (ACC), precision (PR) and false positive rate (FPR). 209 12.3.4 Results of Learning Performance Evaluationusingtheglobalthreshold. Wefirstevaluatedtheperformanceof FedDetect algorithm using the global threshold and compared its performance with three baselines. For the baseline CL-Single, we reported the average value of the nine devices’ model performances. The accuracy results shown in Figure G.4a and G.4b are evaluated on N-BaIoT dataset and LANDER dataset respectively. The full matrix evaluation plots and training curves are all listed in the Appendix and the github website. We could observe that as expected, the centralized training CL-Combined has the best performance in both dataset evaluations. It is clear that the FedDetect has a much better performance compared to the CL-Single and CL-Multi and achieves nearly upper bound performance compared to CL-Combined. In the evaluation on N-BaIoT dataset, the upper bound for the centralized training has accuracy of 94.89%, precision of 94.12%, and FPR of 5.26%. Meanwhile, the performance of the FedDetect has accuracy of 93.7%, precision of 88.2%, FPR of 11.9%. In the evaluation on LANDER dataset, the upper bound for the centralized training has accuracy of of 96.76%, precision of 98.69%, FPR of 1.25%. On the other hands, the FedDetect achieves accuracy of 95.27%, precision of 93.81%, FPR of 6.39%. We could see that accuracy under the FL setting is nearly the same as upper bound performance in centralized training for both evaluations. Evaluation using the personalized threshold. For FedDetect with personalized thresh- old, we evaluated its local performance on each edge device compared with the CL-Single baseline. As the results shown in Figure G.4c, in the evaluation on N-BaIoT dataset, except for device A, FedDetect performs better or nearly equal to the CL-Single. The numbers listed above the bar are the relative difference between CL and FL settings. For example, device D and I achieve 0.126 and 0.281 increase of accuracy from FL, respectively. As the results shown in Figure G.4d, in the evaluation on LANDER dataset, nearly all devices performs equally in both CL and FL settings. The number shows above the bar is the relative 210 (a) Accuracy Performance for N-BaIoT (b) Accuracy Performance for LANDER (c) Accuracy Performance for N-BaIoT (d) Accuracy Performance for LANDER Figure 12.3: Experiment Results for Accuracy Performance over 4 experiment settings: (a)-(b) subfigures are evaluation under global threshold; (c)-(d) subfigures are evaluation under personalized threshold. difference between the performance of FL and CL-Single. The detailed evaluation plots under personalized threshold are all shown in the Appendix. Understanding the result. In general speaking, the performance on LANDER dataset is better than the performance on N-BaIoT dataset. The major reason is that N-BaIoT dataset contains ten different types of attacks but the LANDER dataset only contains one type of attack, which makes N-BaIoT dataset more complicated to perform the anomaly detection. For the detection model, the more benign data the model can train on, the better performance it should have. Under the global evaluation, the baseline CL-Combined trains on all benign data. Consequently, it is the best performance among all models. The FedDetect algorithm has better performance than CL-Single and CL-Multi, because the FedDetect trains on more data among all devices collaboratively, thus it can capture more features. Within the same amount of training samples, FL with distributed training achieves nearly 211 the same performance on accuracy and precision compared to upper bound of centralized training. The FPR performance of FL is worse than CL, because within Federated Learning, the cloud server will not receive any local data from edge devices by the security setting. As a result, the cloud server could not directly learn the features of the data as in the centralized training, which makes FL be worse on feature capturing compared to CL. For the personalized threshold, FedDetect could achieve nearly same performance on the local model compared to the centralized training. The performance of accuracy and precision does not differ much between FL and CL. However, just as under the global evaluation, the FPR performance of FL is worse than CL. This can be explained that under CL setting, each model trains on its own data, while FL model is trained collaboratively on the data from all devices. During ML optimization, the direction of gradient descent is shifted after the aggregation by averaging, leading to sub-optimal minimum, which may not be suitable for the local evaluation target. The shifting of the gradient descent in FL also leads a slower convergence speed compared to CL, which could be seen from the training curve shown in the Appendix. 12.3.5 Analysis of System Efficiency For the second part of the experiment, we evaluated the system performance of FedIoT with globalized threshold on the Raspberry Pi within N-BaIoT dataset. Table 12.1: CPU/GPU Training v.s. IoT Edge Training Accuracy Precision FPR Simulation 0.937 0.882 0.119 Raspberry Pi 0.931 0.887 0.125 We first verified that FedIoT on the real IoT device could achieve the same results as CPU/GPU distributed training. From Table 12.1, we could see that the results from the 212 Raspberry Pi are nearly the same as the results from CPU/GPU simulation. The slight difference is due to different random initialization (i.e., different runs in different platforms). (a) Memory Percentage (b) Running Time Per Round Figure 12.4: Properties of Experiments on Raspberry Pi We further tested the system efficiency on the Raspberry Pi platform. From Figure 12.4, we can see that the training time memory cost occupies only a small fraction of the entire host memory of Raspberry Pi (only 4G host memory). The training time per round is less than 1 minute. To understand the system cost more comprehensively, we analyzed the breakdown of the end-to-end training time when the bandwidth is 7.65MB/s (a reasonable bandwidth in 4G/5G wireless communication), and results are shown in Table 12.2. Overall, the end-to-end training time is an acceptable training time (less than 1 hour) for practical applications. We can also find that the ratio of communication almost costs half of the end-to-end training time, indicating that the communication compression technique [272, 438] is essential to improve the system performance in the IoT setting. 12.4 Related Works Our work is related to the application of federated learning in IoT cybersecurity. DIoT [331] is the first system to employ a federated learning approach to anomaly-detection-based intrusion detection in IoT devices. IoTDefender [105] is another similar framework but obtains a personalized model by fine-tuning the global model trained with federated learning. 213 Table 12.2: Breakdown of the End-to-end Training Time Type Value end-to-end time 2547 seconds uplink latency 0.167 sec- onds communication time ratio 42.2 % computation time ratio 57.8 % bandwidth 7.65 MB/s *Note: the communication time is measured by computing the interval between the timing when Raspberry Pi uploads local model to the server and the timing that Raspberry Pi receives the global model from the server. The experiment is implemented under the WiFi condition. [188] evaluates FL-based anomaly detection framework with learning tasks such as aggressive driving detection and human activity recognition. [286] further proposed an attention-based CNN-LSTM model to detect anomalies in an FL manner, and reduced the communication cost by using Top-k gradient compression. Recently, [370] even evaluates the impact of malicious clients under the setting of FL-based anomaly detection. Compared to these existing works, our FedIoT platform is the first work that analyzes both algorithmic and system performance in a real IoT platform. 12.5 Conclusion In this work, to further push forward the research in FL-based IoT cybersecurity, we build FedIoT platform with a simple but effective design philosophy. We apply Deep Autoencoder [386] as the model for anomaly detection to evaluate FL algorithms and our FedIoT platform. Moreover, we propose FedDetect, a federated learning algorithmic framework that utilizes adaptive optimizer and cross-round learning rate scheduler, rather than naive FedAvg [313] for local training. FedDetect supports both global threshold and personalized threshold for different scenarios. FedIoT supports both IoT edge training and CPU/GPU-based 214 distributed training with the support of MQTT and MPI communication backend, respectively. We evaluate FedIoT platform and FedDetect algorithm with both global threshold and personalized threshold, and also analyze the system performance comprehensively. Our results demonstrate the efficacy of federated learning in detecting a large range of attack types, and the system efficiency analysis shows that both end-to-end training time and memory cost is affordable and promising for resource-constrained IoT devices. 215 Bibliography [1] Mehdi Salehi Heydar Abad et al. “Hierarchical federated learning across heterogeneous cellular networks”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 8866–8870. [2] Martıín Abadi et al. “Tensorflow: A system for large-scale machine learning”. In: 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 2016, pp. 265–283. [3] Jin-Hyun Ahn, Osvaldo Simeone, and Joonhyuk Kang. “Wireless federated distillation for distributed edge learning with heterogeneous data”. In: 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC). IEEE. 2019, pp. 1–6. [4] P. Kairouz et al. “Advances and Open Problems in Federated Learning”. In: ArXiv (2019). [5] Abdullatif Albaseer et al. “Exploiting Unlabeled Data in Smart Cities using Federated Learning”. In: arXiv preprint arXiv:2001.04030 (2020). [6] Dan Alistarh et al. “QSGD: Communication-efficient SGD via gradient quantization and encoding”. In: Advances in Neural Information Processing Systems. 2017, pp. 1709–1720. [7] Mohammad Mohammadi Amiri et al. “Federated Learning With Quantized Global Model Updates”. In: arXiv preprint arXiv:2006.10672 (2020). [8] Muhammad Ammad-Ud-Din et al. “Federated Collaborative Filtering for Privacy-Preserving Personalized Recommendation System”. In: arXiv preprint arXiv:1901.09888 (2019). [9] Chrysovalantis Anastasiou et al. “Admsv2: A modern architecture for transportation data management and analysis”. In: Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Advances on Resilient and Intelligent Cities. 2019, pp. 25–28. 216 [10] Rohan Anil et al. “Large scale distributed neural network training through online distillation”. In: arXiv preprint arXiv:1804.03235 (2018). [11] Manoj Ghuhan Arivazhagan et al. “Federated learning with personalization layers”. In: arXiv preprint arXiv:1912.00818 (2019). [12] Muhammad Asad, Ahmed Moustafa, and Takayuki Ito. “FedOpt: Towards communication efficiency and privacy preservation in federated learning”. In: Applied Sciences 10.8 (2020), p. 2864. [13] Hédy Attouch et al. “Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality”. In: Mathematics of Operations Research 35.2 (2010), pp. 438–457. [14] Sean Augenstein et al. “Generative models for effective ml on private, decentralized datasets”. In: arXiv preprint arXiv:1911.06679 (2019). [15] The TensorFlow Federated Authors. TensorFlow Federated Stack Overflow dataset . 2019. url: https://www.tensorflow.org/federated/api_docs/python/tff/ simulation/datasets/stackoverflow/load_data. [16] Jimmy Ba and Rich Caruana. “Do deep nets really need to be deep?” In: Advances in neural information processing systems. 2014, pp. 2654–2662. [17] Eugene Bagdasaryan et al. “How to backdoor federated learning”. In: International Conference on Artificial Intelligence and Statistics . 2020, pp. 2938–2948. [18] Inci M. Baytas et al. “Asynchronous multi-task learning”. In: Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 2016, pp. 11–20. [19] James Henry Bell et al. “Secure single-server aggregation with (poly) logarithmic overhead”. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 1253–1269. [20] Guy W Bemis and Mark A Murcko. “The properties of known drugs. 1. Molecular frameworks”. In: Journal of medicinal chemistry 39.15 (1996), pp. 2887–2893. [21] Jeremy Bernstein et al. “signSGD: Compressed optimisation for non-convex problems”. In: arXiv preprint arXiv:1802.04434 (2018). [22] David Berthelot et al. “MixMatch: A Holistic Approach to Semi-Supervised Learning”. In: Neural Information Processing Systems. Dec. 8, 2019, pp. 5049–5059. [23] David Berthelot et al. “ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring”. In: International Conference on Learning Representations. Apr. 30, 2020. 217 [24] E. Bertino and N. Islam. “Botnets and Internet of Things Security”. In: Computer 50.02 (Feb. 2017), pp. 76–79. issn: 1558-0814. doi: 10.1109/MC.2017.62. [25] Dimitri P Bertsekas and John N Tsitsiklis. Parallel and distributed computation: numerical methods. Vol. 23. Prentice hall Englewood Cliffs, NJ, 1989. [26] Daniel J Beutel et al. “Flower: A Friendly Federated Learning Research Framework”. In: arXiv preprint arXiv:2007.14390 (2020). [27] Arjun Nitin Bhagoji et al. “Analyzing federated learning through an adversarial lens”. In: International Conference on Machine Learning. 2019, pp. 634–643. [28] Tian Bian et al. “Rumor detection on social media with bi-directional graph convolutional networks”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 01. 2020, pp. 549–556. [29] Ilai Bistritz, Ariana Mann, and Nicholas Bambos. “Distributed Distillation for On-Device Learning”. In: Advances in Neural Information Processing Systems 33. 2020. [30] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. “Machine learning with adversaries: Byzantine tolerant gradient descent”. In: Advances in Neural Information Processing Systems. 2017, pp. 119–129. [31] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. “Yolov4: Optimal speed and accuracy of object detection”. In: arXiv preprint arXiv:2004.10934 (2020). [32] Aleksandar Bojchevski and Stephan Günnemann. “Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking”. In: International Conference on Learning Representations. 2018. url: https://openreview.net/forum?id=r1ZdKJ-0W. [33] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. “Proximal alternating linearized minimization for nonconvex and nonsmooth problems”. In: Mathematical Programming 146.1-2 (2014), pp. 459–494. [34] K Bonawitz, H Eichner, W Grieskamp, et al. TensorFlow Federated: Machine Learning on Decentralized Data. 2020. [35] Keith Bonawitz et al. “Federated learning with autotuned communication-efficient secure aggregation”. In: 2019 53rd Asilomar Conference on Signals, Systems, and Computers. IEEE. 2019, pp. 1222–1226. [36] Keith Bonawitz et al. “Practical secure aggregation for federated learning on user-held data”. In: arXiv preprint arXiv:1611.04482 (2016). 218 [37] Keith Bonawitz et al. “Practical secure aggregation for privacy-preserving machine learning”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 1175–1191. [38] Keith Bonawitz et al. “Towards Federated Learning at Scale: System Design”. In: Proceedings of Machine Learning and Systems. Vol. 1. 2019, pp. 374–388. url: https://proceedings.mlsys.org/paper/2019/file/ bd686fd640be98efaae0091fa301e613-Paper.pdf. [39] Keith Bonawitz et al. “Towards federated learning at scale: System design”. In: arXiv preprint arXiv:1902.01046 (2019). [40] Karsten M Borgwardt et al. “Protein function prediction via graph kernels”. In: Bioinformatics 21.suppl_1 (2005), pp. i47–i56. [41] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimization methods for large-scale machine learning”. In: SIAM Review 60.2 (2018), pp. 223–311. [42] H. Brendan McMahan et al. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: arXiv e-prints, arXiv:1602.05629 (Feb. 2016), arXiv:1602.05629. arXiv: 1602.05629 [cs.LG]. [43] Christopher Briggs, Zhong Fan, and Peter Andras. “Federated learning with hierarchical clustering of local updates to improve training on non-IID data”. In: arXiv preprint arXiv:2004.11791 (2020). [44] Jane Bromley et al. “Signature verification using a “siamese” time delay neural network”. In: International Journal of Pattern Recognition and Artificial Intelligence 7.04 (1993), pp. 669–688. [45] Tom Brown et al. “Language models are few-shot learners”. In: Advances in neural information processing systems 33 (2020), pp. 1877–1901. [46] Tom B Brown et al. “Language models are few-shot learners”. In: arXiv preprint arXiv:2005.14165 (2020). [47] Cristian Buciluˇ a, Rich Caruana, and Alexandru Niculescu-Mizil. “Model compression”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006, pp. 535–541. [48] Han Cai, Ligeng Zhu, and Song Han. “Proxylessnas: Direct neural architecture search on target task and hardware”. In: arXiv preprint arXiv:1812.00332 (2018). [49] Han Cai et al. “Once-for-all: Train one network and specialize it for efficient deployment”. In: arXiv preprint arXiv:1908.09791 (2019). 219 [50] Han Cai et al. “TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning”. In: Advances in Neural Information Processing Systems 33 (2020). [51] Sebastian Caldas et al. LEAF: A Benchmark for Federated Settings. 2019. arXiv: 1812.01097 [cs.LG]. [52] Sebastian Caldas et al. “Leaf: A benchmark for federated settings”. In: arXiv preprint arXiv:1812.01097 (2018). [53] Mathilde Caron et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. 2021. arXiv: 2006.09882 [cs.CV]. [54] Mathilde Caron et al. “Unsupervised learning of visual features by contrasting cluster assignments”. In: arXiv preprint arXiv:2006.09882 (2020). [55] Zheng Chai et al. “FedAt: A communication-efficient federated learning method with asynchronous tiers under non-iid data”. In: arXiv preprint arXiv:2010.05958 (2020). [56] Qi Chang et al. “Synthetic Learning: Learn From Distributed Asynchronized Discriminator GAN Without Sharing Medical Image Data”. In: arXiv e-prints, arXiv:2006.00080 (May 2020), arXiv:2006.00080. arXiv: 2006.00080 [eess.IV]. [57] Chaochao Chen et al. “Practical privacy preserving poi recommendation”. In: arXiv preprint arXiv:2003.02834 (2020). [58] Chaochao Chen et al. “Survey and Open Problems in Privacy Preserving Knowledge Graph: Merging, Query, Representation, Completion and Applications”. In: arXiv preprint arXiv:2011.10180 (2020). [59] Chen Chen et al. “Robust Federated Recommendation System”. In: arXiv preprint arXiv:2006.08259 (2020). [60] Chien-Lun Chen, Leana Golubchik, and Marco Paolieri. “Backdoor Attacks on Federated Meta-Learning”. In: arXiv preprint arXiv:2006.07026 (2020). [61] Daoyuan Chen et al. “Adabert: Task-adaptive bert compression with differentiable neural architecture search”. In: arXiv preprint arXiv:2001.04246 (2020). [62] Defang Chen et al. “Online Knowledge Distillation with Diverse Peers”. In: arXiv preprint arXiv:1912.00350 (2019). [63] Liang Chen et al. “Understanding Structural Vulnerability in Graph Convolutional Networks”. In: IJCAI. 2021. [64] Liang-Chieh Chen et al. “Rethinking Atrous Convolution for Semantic Image Segmentation”. In: ArXiv abs/1706.05587 (2017). 220 [65] Mingqing Chen et al. “Federated Learning of N-Gram Language Models”. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). 2019. [66] Mingqing Chen et al. “Federated learning of N-gram language models”. In: arXiv preprint arXiv:1910.03432 (2019). [67] Tianqi Chen et al. “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems”. In: arXiv preprint arXiv:1512.01274 (2015). [68] Ting Chen et al. “A simple framework for contrastive learning of visual representations”. In: International conference on machine learning. PMLR. 2020, pp. 1597–1607. [69] Xinlei Chen and Kaiming He. “Exploring Simple Siamese Representation Learning”. In: arXiv: Computer Vision and Pattern Recognition (Nov. 20, 2020). [70] Yiqiang Chen et al. “Fedhealth: A federated transfer learning framework for wearable healthcare”. In: IEEE Intelligent Systems (2020). [71] Yujing Chen et al. “Asynchronous online federated learning for edge devices with non-iid data”. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE. 2020, pp. 15–24. [72] Kewei Cheng et al. “Secureboost: A lossless federated learning framework”. In: arXiv preprint arXiv:1901.08755 (2019). [73] Young Sung Cho et al. “Clustering method using item preference based on rfm for recommendation system in u-commerce”. In: Ubiquitous information technologies and applications. Springer, 2013, pp. 353–362. [74] Tat-Seng Chua et al. “NUS-WIDE: A real-world web image database from National University of Singapore”. In: CIVR (2009). [75] G Cohen et al. “EMNIST: an extension of MNIST to handwritten letters. arXiv e-prints”. In: arXiv preprint arXiv:1702.05373 (2017). [76] Connor W Coley et al. “Convolutional embedding of attributed molecular graphs for physical property prediction”. In: Journal of chemical information and modeling 57.8 (2017), pp. 1757–1772. [77] Ekin Dogus Cubuk et al. “AutoAugment: Learning Augmentation Policies from Data”. In: 2019. 221 [78] Zhiyong Cui et al. Traffic Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting . 2019. arXiv: 1802.07007 [cs.LG]. [79] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. 2013. arXiv: 1306.0895 [stat.ML]. [80] Wei Dai et al. “Toward Understanding the Impact of Staleness in Distributed Machine Learning”. en. In: arXiv:1810.03264 [cs, stat] (Oct. 2018). arXiv: 1810.03264. url: http://arxiv.org/abs/1810.03264 (visited on 12/22/2018). [81] Luke N Darlow et al. “CINIC-10 is not ImageNet or CIFAR-10”. In: arXiv preprint arXiv:1810.03505 (2018). [82] Asim Kumar Debnath et al. “Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity”. In: Journal of medicinal chemistry 34.2 (1991), pp. 786–797. [83] John S Delaney. “ESOL: estimating aqueous solubility directly from molecular structure”. In: Journal of chemical information and computer sciences 44.3 (2004), pp. 1000–1005. [84] Tim Dettmers et al. “Convolutional 2D Knowledge Graph Embeddings”. In: AAAI. 2018, pp. 1811–1818. url: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366. [85] J. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: NAACL-HLT. 2019. [86] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. In: Proc. of NAACL-HLT. 2019. [87] Whitfield Diffie and Martin Hellman. “New directions in cryptography”. In: IEEE transactions on Information Theory 22.6 (1976), pp. 644–654. [88] Marten van Dijk et al. “Asynchronous Federated Learning with Reduced Number of Rounds and with Differential Privacy from Less Aggregated Gaussian Noise”. In: arXiv preprint arXiv:2007.09208 (2020). [89] Canh T Dinh, Nguyen H Tran, and Tuan Dung Nguyen. “Personalized federated learning with Moreau envelopes”. In: arXiv preprint arXiv:2006.08848 (2020). [90] Canh T. Dinh, Nguyen H. Tran, and Tuan Dung Nguyen. “Personalized Federated Learning with Moreau Envelopes”. In: arXiv: Learning (June 16, 2020). 222 [91] Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: International Conference on Learning Representations. 2021. url: https://openreview.net/forum?id=YicbFdNTTy. [92] Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for image recognition at scale”. In: arXiv preprint arXiv:2010.11929 (2020). [93] Matthew Dunn et al. “SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine”. In: ArXiv (2017). [94] David Duvenaud et al. “Convolutional networks on graphs for learning molecular fingerprints”. In: arXiv preprint arXiv:1509.09292 (2015). [95] Ahmet M Elbir and Sinem Coleri. “Federated Learning for Vehicular Networks”. In: arXiv preprint arXiv:2006.01412 (2020). [96] A. Elkordy and A. Avestimehr. “Secure Aggregation with Heterogeneous Quantization in Federated Learning”. In: ArXiv (2020). [97] Ahmed ElKordy and A. Salman Avestimehr. “Secure aggregation with heterogeneous quantization in federated learning”. In: arXiv preprint arxiv:2009.14388 (2020). [98] Ahmed Roushdy Elkordy and A. Salman Avestimehr. “Secure Aggregation with Heterogeneous Quantization in Federated Learning”. In: arXiv preprint arXiv:2009.14388 (2020). [99] Ahmed Roushdy Elkordy, Saurav Prakash, and A Salman Avestimehr. “Basil: A Fast and Byzantine-Resilient Approach for Decentralized Training”. In: arXiv preprint arXiv:2109.07706 (2021). [100] David Enthoven and Zaid Al-Ars. “An Overview of Federated Deep Learning Privacy Attacks and Defensive Strategies”. In: arXiv preprint arXiv:2004.04676 (2020). [101] Theodoros Evgeniou and Massimiliano Pontil. “Regularized multi–task learning”. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 2004, pp. 109–117. [102] Yahya H Ezzeldin et al. “FairFed: Enabling Group Fairness in Federated Learning”. In: ICML 2021 - International Workshop on Federated Learning for User Privacy and Data Confidentiality (2021). [103] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized Federated Learning: A Meta-Learning Approach. Feb. 18, 2020. [104] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. “Personalized federated learning: A meta-learning approach”. In: arXiv preprint arXiv:2002.07948 (2020). 223 [105] Yulin Fan et al. “IoTDefender: A Federated Transfer Learning Intrusion Detection Framework for 5G IoT”. In: 2020 IEEE 14th International Conference on Big Data Science and Engineering (BigDataSE). 2020, pp. 88–95. doi: 10.1109/BigDataSE50710.2020.00020. [106] Han Feng Siwei Yu. “Multi-Participant Multi-Class Vertical Federated Learning”. In: arXiv preprint arXiv:2001.11154 (2020). [107] Matthias Fey and Jan Eric Lenssen. Fast Graph Representation Learning with PyTorch Geometric. 2019. arXiv: 1903.02428 [cs.LG]. [108] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks”. In: arXiv:1703.03400 [cs] (July 18, 2017). arXiv: 1703.03400. url: http://arxiv.org/abs/1703.03400 (visited on 05/27/2021). [109] Adam Fisch et al. “MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension”. In: Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019. [110] Adrian Flanagan et al. “Federated Multi-view Matrix Factorization for Personalized Recommendations”. In: arXiv preprint arXiv:2004.04256 (2020). [111] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. “Model inversion attacks that exploit confidence information and basic countermeasures”. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 2015, pp. 1322–1333. [112] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. “Mitigating sybils in federated learning poisoning”. In: arXiv preprint arXiv:1808.04866 (2018). [113] Anubhav Garg, Amit Kumar Saha, and Debo Dutta. “Direct federated neural architecture search”. In: arXiv preprint arXiv:2010.06223 (2020). [114] Anna Gaulton et al. “ChEMBL: a large-scale bioactivity database for drug discovery”. In: Nucleic acids research 40.D1 (2012), pp. D1100–D1107. [115] Anna Gaulton et al. “The ChEMBL database in 2017”. In: Nucleic Acids Research 45.D1 (Nov. 2016), pp. D945–D954. issn: 0305-1048. doi: 10.1093/nar/gkw1074. eprint: https://academic.oup.com/nar/article- pdf/45/D1/D945/8846762/gkw1074.pdf. url: https://doi.org/10.1093/nar/gkw1074. [116] Anna Gaulton et al. “The ChEMBL database in 2017”. In: Nucleic acids research 45.D1 (2017), pp. D945–D954. 224 [117] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. “A data-driven approach to predicting successes and failures of clinical trials”. In: Cell chemical biology 23.10 (2016), pp. 1294–1301. [118] Suyu Ge et al. “FedNER: Medical Named Entity Recognition with Federated Learning”. In: arXiv preprint arXiv:2003.09288 (2020). [119] Suyu Ge et al. “FedNER: Privacy-preserving Medical Named Entity Recognition with Federated Learning”. In: ArXiv (2020). [120] Jonas Geiping et al. “Inverting Gradients - How easy is it to break privacy in federated learning?” In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol. 33. Curran Associates, Inc., 2020, pp. 16937–16947. url: https://proceedings.neurips.cc/paper/2020/file/ c4ede56bbd98819ae6112b20ac6bf145-Paper.pdf. [121] Robin C Geyer, Tassilo Klein, and Moin Nabi. “Differentially private federated learning: A client level perspective”. In: arXiv preprint arXiv:1712.07557 (2017). [122] C. Lee Giles, Kurt D. Bollacker, and Steve Lawrence. “CiteSeer: An Automatic Citation Indexing System”. In: Proceedings of the Third ACM Conference on Digital Libraries. DL ’98. Pittsburgh, Pennsylvania, USA: Association for Computing Machinery, 1998, pp. 89–98. isbn: 0897919653. doi: 10.1145/276675.276685. url: https://doi.org/10.1145/276675.276685. [123] Justin Gilmer et al. “Neural message passing for quantum chemistry”. In: International Conference on Machine Learning. PMLR. 2017, pp. 1263–1272. [124] Linyuan Gong et al. “Efficient training of bert by progressively stacking”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2337–2346. [125] Ian J. Goodfellow et al. “Generative Adversarial Nets”. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14. Montreal, Canada: MIT Press, 2014, pp. 2672–2680. [126] Google. TensorFlow Federated Datasets. https://www.tensorflow.org/federated/ api_docs/python/tff/simulation/datasets. [127] Jean-Bastien Grill et al. “Bootstrap your own latent: A new approach to self-supervised learning”. In: arXiv preprint arXiv:2006.07733 (2020). [128] gRPC: A high performance, open source universal RPC framework. https://grpc.io/. 2021. 225 [129] Otkrist Gupta and Ramesh Raskar. “Distributed learning of deep neural network over multiple agents”. In: Journal of Network and Computer Applications 116 (2018), pp. 1–8. [130] Farzin Haddadpour et al. “Federated Learning with Compression: Unified Analysis and Sharp Guarantees”. In: arXiv preprint arXiv:2007.01154 (2020). [131] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. “Exploring Network Structure, Dynamics, and Function using NetworkX”. In: Proceedings of the 7th Python in Science Conference. Ed. by Gaël Varoquaux, Travis Vaught, and Jarrod Millman. Pasadena, CA USA, 2008, pp. 11–15. [132] William L. Hamilton, Rex Ying, and Jure Leskovec. “Inductive Representation Learning on Large Graphs”. In: CoRR abs/1706.02216 (2017). arXiv: 1706.02216. url: http://arxiv.org/abs/1706.02216. [133] Song Han, Huizi Mao, and William J Dally. “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding”. In: arXiv preprint arXiv:1510.00149 (2015). [134] Andrew Hard et al. “Federated Learning for Mobile Keyboard Prediction”. In: ArXiv (2018). [135] Andrew Hard et al. “Federated learning for mobile keyboard prediction”. In: arXiv preprint arXiv:1811.03604 (2018). [136] Corentin Hardy, Erwan Le Merrer, and Bruno Sericola. “Md-gan: Multi-discriminator generative adversarial networks for distributed datasets”. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE. 2019, pp. 866–877. [137] Stephen Hardy et al. “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption”. In: arXiv preprint arXiv:1711.10677 (2017). [138] Stephen Hardy et al. “Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption”. In: CoRR abs/1711.10677 (2017). arXiv: 1711.10677. url: http://arxiv.org/abs/1711.10677. [139] Bharath Hariharan et al. “Semantic Contours from Inverse Detectors”. In: International Conference on Computer Vision (ICCV). 2011. [140] Hanieh Hashemi, Yongqin Wang, and Murali Annavaram. “DarKnight: A Data Privacy Scheme for Training and Inference of Deep Neural Networks”. In: arXiv preprint arXiv:2006.01300 (2020). 226 [141] Chaoyang He. Chaoyang He’s Publication during the PhD study. https://chaoyanghe.com/publications/. 2022. [142] Chaoyang He, Murali Annavaram, and Salman Avestimehr. “FedNAS: Federated Deep Learning via Neural Architecture Search”. In: arXiv preprint arXiv:2004.08546 (2020). [143] Chaoyang He, Murali Annavaram, and Salman Avestimehr. “Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge”. In: 2020. [144] Chaoyang He, Murali Annavaram, and Salman Avestimehr. “Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge”. In: NeurIPS 2020 (Advances in Neural InformationProcessing Systems 2020) (2020). [145] Chaoyang He, Murali Annavaram, and Salman Avestimehr. “Group Knowledge Transfer: Federated Learning of Large CNNs at the Edge”. In: Advances in Neural Information Processing Systems 33 (2020). [146] Chaoyang He et al. “Cascade-BGNN: Toward Efficient Self-supervised Representation Learning on Large-scale Bipartite Graphs”. In: arXiv preprint arXiv:1906.11994 (2019). [147] Chaoyang He et al. “Cascade-BGNN: Toward Efficient Self-supervised Representation Learning on Large-scale Bipartite Graphs”. In: arXiv preprint arXiv:1906.11994 (2019). [148] Chaoyang He et al. “Central server free federated learning over single-sided trust social networks”. In: arXiv preprint arXiv:1910.04956 (2019). [149] Chaoyang He et al. “FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks”. In: arXiv preprint arXiv:2111.11066 (2021). [150] Chaoyang He et al. “FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks”. In: (2021). [151] Chaoyang He et al. FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks. 2021. arXiv: 2104.07145 [cs.LG]. [152] Chaoyang He et al. “FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks”. In: 2021. [153] Chaoyang He et al. “FedML: A Research Library and Benchmark for Federated Machine Learning”. In: arXiv preprint arXiv:2007.13518 (2020). 227 [154] Chaoyang He et al. “FedML: A Research Library and Benchmark for Federated Machine Learning”. In: NeurIPS 2020 (Advances in Neural InformationProcessing Systems 2020) Federated Learning Workshop Best Paper Award abs/2007.13518 (2020). [155] Chaoyang He et al. “Fedml: A research library and benchmark for federated machine learning”. In: arXiv preprint arXiv:2007.13518 (2020). [156] Chaoyang He et al. “MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation”. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [157] Chaoyang He et al. “MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), pp. 11990–11999. [158] Chaoyang He et al. “MiLeNAS: Efficient Neural Architecture Search via Mixed-Level Reformulation”. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020. [159] Chaoyang He et al. “Milenas: Efficient neural architecture search via mixed-level reformulation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 11993–12002. [160] Chaoyang He et al. “PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models”. In: International Conference on Machine Learning. PMLR. 2021, pp. 4150–4159. [161] Chaoyang He et al. “SpreadGNN: Serverless Multi-task Federated Learning for Graph Neural Networks”. In: International Workshop on Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2021 (FL-ICML’21) and Deep Learning on Graphs: Method and Applications with KDD 2021 (DLG-KDD’21) (2021). [162] Chaoyang He et al. “SSFL: Tackling Label Deficiency in Federated Learning via Personalized Self-Supervision”. In: arXiv preprint arXiv:2110.02470 (2021). [163] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [164] Lie He, Sai Praneeth Karimireddy, and Martin Jaggi. “Secure byzantine-robust machine learning”. In: arXiv preprint arXiv:2006.04747 (2020). [165] Yihui He et al. “Amc: Automl for model compression and acceleration on mobile devices”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 784–800. 228 [166] Danny Hernandez and Tom B Brown. “Measuring the Algorithmic Efficiency of Neural Networks”. In: arXiv preprint arXiv:2005.04305 (2020). [167] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In: arXiv preprint arXiv:1503.02531 (2015). [168] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. “Deep models under the GAN: information leakage from collaborative deep learning”. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, pp. 603–618. [169] A. Howard et al. “Searching for MobileNetV3”. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, pp. 1314–1324. doi: 10.1109/ICCV.2019.00140. [170] Andrew Howard et al. “Searching for mobilenetv3”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, pp. 1314–1324. [171] Andrew G Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017). [172] Kevin Hsieh et al. “The non-IID data quagmire of decentralized machine learning”. In: arXiv preprint arXiv:1910.00189 (2019). [173] Chi-Hung Hsu et al. “Monas: Multi-objective neural architecture search using reinforcement learning”. In: arXiv preprint arXiv:1806.10332 (2018). [174] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. “Federated Visual Classification with Real-World Data Distribution”. In: arXiv preprint arXiv:2003.08082 (2020). [175] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. “Federated Visual Classification with Real-World Data Distribution”. In: arXiv e-prints, arXiv:2003.08082 (Mar. 2020), arXiv:2003.08082. arXiv: 2003.08082 [cs.LG]. [176] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. “Measuring the effects of non-identical data distribution for federated visual classification”. In: arXiv preprint arXiv:1909.06335 (2019). [177] Zijian Hu et al. “SimPLE: Similar Pseudo Label Exploitation for Semi-Supervised Classification”. In: arXiv:2103.16725 [cs] (Mar. 30, 2021). arXiv: 2103.16725. url: http://arxiv.org/abs/2103.16725 (visited on 05/23/2021). [178] Gao Huang et al. “Densely connected convolutional networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 4700–4708. 229 [179] Han Huang et al. “Lightweight Image Super-Resolution with Hierarchical and Differentiable Neural Architecture Search”. In: arXiv preprint arXiv:2105.03939 (2021). [180] Yanping Huang et al. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”. In: Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. Vol. 32. Curran Associates, Inc., 2019, pp. 103–112. [181] Yanping Huang et al. “Gpipe: Efficient training of giant neural networks using pipeline parallelism”. In: arXiv preprint arXiv:1811.06965 (2018). [182] Urs Hunkeler, Hong Linh Truong, and Andy Stanford-Clark. “MQTT-S—A publish/subscribe protocol for Wireless Sensor Networks”. In: 2008 3rd International Conference on Communication Systems Software and Middleware and Workshops (COMSWARE’08). IEEE. 2008, pp. 791–798. [183] Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren, eds. Automatic Machine Learning: Methods, Systems, Challenges. Springer, 2019. [184] Forrest N Iandola et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size”. In: arXiv preprint arXiv:1602.07360 (2016). [185] Alex Ingerman and Krzys Ostrowski. TensorFlow Federated. 2019. url: https: //medium.com/tensorflow/introducing-tensorflow-federated-a4147aa20041. [186] Intel®. Intel® Open Federated Learning. 2021. [187] Sohei Itahara et al. “Distillation-Based Semi-Supervised Federated Learning for Communication-Efficient Collaborative Training with Non-IID Private Data”. In: arXiv preprint arXiv:2008.06180 (2020). [188] Rei Ito, Mineto Tsukada, and Hiroki Matsutani. An On-Device Federated Learning Approach for Cooperative Anomaly Detection. Feb. 2020. [189] Yesmina Jaafra et al. “Reinforcement learning for neural architecture search: A review”. In: Image and Vision Computing 89 (2019), pp. 57–66. [190] Martin Jaggi et al. “Communication-efficient distributed dual coordinate ascent”. In: Advances in neural information processing systems. 2014, pp. 3068–3076. [191] Eunjeong Jeong et al. “Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data”. In: arXiv preprint arXiv:1811.11479 (2018). [192] Wonyong Jeong et al. “Federated Semi-Supervised Learning with Inter-Client Consistency”. In: arXiv preprint arXiv:2006.12097 (2020). 230 [193] Shaoxiong Ji et al. “Learning Private Neural Language Modeling with Attentive Aggregation”. In: 2019 International Joint Conference on Neural Networks (IJCNN) (2019). [194] Meng Jiang et al. “Federated Dynamic GNN with Secure Aggregation”. In: arXiv preprint arXiv:2009.07351 (2020). [195] Yihan Jiang et al. “Improving federated learning personalization via model agnostic meta learning”. In: arXiv preprint arXiv:1909.12488 (2019). [196] Yimin Jiang et al. “A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters”. In: 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, Nov. 2020, pp. 463–479. isbn: 978-1-939133-19-9. url: https://www.usenix.org/conference/osdi20/presentation/jiang. [197] Zhanhong Jiang et al. “Collaborative deep learning in fixed topology networks”. In: Advances in Neural Information Processing Systems. 2017, pp. 5904–5914. [198] Xin Jin et al. “Collaborating between local and global learning for distributed online multiple tasks”. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015, pp. 113–122. [199] Glenn Jocher. YOLOv5. 2020. url: https://github.com/ultralytics/yolov5. [200] Mandar Joshi et al. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension”. In: Proc. of ACL. 2017. [201] Ce Ju et al. “Federated Transfer Learning for EEG Signal Classification”. In: arXiv preprint arXiv:2004.12321 (2020). [202] Ce Ju et al. “Privacy-Preserving Technology to Help Millions of People: Federated Prediction Model for Stroke Prevention”. In: arXiv preprint arXiv:2006.10517 (2020). [203] John Jumper et al. “Highly accurate protein structure prediction with AlphaFold”. In: Nature 596.7873 (2021), pp. 583–589. [204] Swanand Kadhe et al. “FastSecAgg: Scalable Secure Aggregation for Privacy-Preserving Federated Learning”. In: arXiv preprint arXiv:2009.11248 (2020). [205] Kaggle. Lending Club Loan Data. https://www.kaggle.com/wendykan/lending-club-loan-data. [206] P. Kairouz et al. “Advances and Open Problems in Federated Learning”. In: Found. Trends Mach. Learn. 14 (2021), pp. 1–210. 231 [207] Peter Kairouz et al. “Advances and open problems in federated learning”. In: arXiv preprint arXiv:1912.04977 (2019). [208] Peter Kairouz et al. “Advances and open problems in federated learning”. In: Foundations and Trends® in Machine Learning 14.1–2 (2021), pp. 1–210. [209] Sai Praneeth Karimireddy, Lie He, and Martin Jaggi. “Learning from history for byzantine robust optimization”. In: International Conference on Machine Learning. PMLR. 2021, pp. 5311–5319. [210] Sai Praneeth Karimireddy et al. “SCAFFOLD: Stochastic Controlled Averaging for Federated Learning”. In: arXiv preprint arXiv:1910.06378 (2019). [211] Steven Kearnes et al. “Molecular graph convolutions: moving beyond fingerprints”. In: Journal of computer-aided molecular design 30.8 (2016), pp. 595–608. [212] Mikhail Khodak, Maria-Florina F Balcan, and Ameet S Talwalkar. “Adaptive gradient-based meta-learning methods”. In: Advances in Neural Information Processing Systems. 2019, pp. 5917–5928. [213] Chiheon Kim et al. “torchgpipe: On-the-fly pipeline parallelism for training giant models”. In: arXiv preprint arXiv:2004.09910 (2020). [214] Soojeong Kim et al. “Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks”. In: Proceedings of the Fourteenth EuroSys Conference 2019. 2019, pp. 1–15. [215] Sunghwan Kim et al. “PubChem in 2021: new data content and improved web interfaces”. In: Nucleic Acids Research 49.D1 (2021), pp. D1388–D1395. [216] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [217] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980 (2015). [218] Thomas N. Kipf and Max Welling. “Semi-Supervised Classification with Graph Convolutional Networks”. In: CoRR abs/1609.02907 (2016). arXiv: 1609.02907. url: http://arxiv.org/abs/1609.02907. [219] Yusuke Koda et al. “Communication-Efficient Multimodal Split Learning for mmWave Received Power Prediction”. In: IEEE Communications Letters 24.6 (2020), pp. 1284–1288. 232 [220] Jakub Konečný, Brendan McMahan, and Daniel Ramage. “Federated Optimization:Distributed Optimization Beyond the Datacenter”. In: arXiv:1511.03575 [cs, math] (Nov. 2015). arXiv: 1511.03575. url: http://arxiv.org/abs/1511.03575 (visited on 01/19/2019). [221] Jakub Konečný et al. “Federated Learning: Strategies for Improving Communication Efficiency”. en. In: arXiv:1610.05492 [cs] (Oct. 2016). arXiv: 1610.05492. url: http://arxiv.org/abs/1610.05492 (visited on 12/22/2018). [222] A. Krizhevsky and G. Hinton. “Learning multiple layers of features from tiny images”. In: Master’s thesis, Department of Computer Science, University of Toronto (2009). [223] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”. In: Technical Report (2009). [224] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems 25 (2012). [225] Michael Kuhn et al. “The SIDER database of drugs and side effects”. In: Nucleic acids research 44.D1 (2016), pp. D1075–D1079. [226] Tom Kwiatkowski et al. “Natural Questions: A Benchmark for Question Answering Research”. In: Transactions of the Association for Computational Linguistics (2019). [227] Samuli Laine and Timo Aila. “Temporal Ensembling for Semi-Supervised Learning”. In: arXiv: Neural and Evolutionary Computing (Oct. 7, 2016). [228] Anusha Lalitha et al. “Decentralized bayesian learning over graphs”. In: arXiv preprint arXiv:1905.10466 (2019). [229] Greg Landrum. RDKit: Open-source cheminformatics. 2006. url: http://www.rdkit.org. [230] Ken Lang. “Newsweeder: Learning to filter netnews”. In: Proc. of ICML. 1995. [231] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. [232] Dong-Hyun Lee. “Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks”. In: 2013. [233] S. Lee et al. “Communication-Efficient Local Stochastic Gradient Descent for Scalable Deep Learning”. In: 2020 IEEE International Conference on Big Data (Big Data). 2020, pp. 718–727. doi: 10.1109/BigData50022.2020.9378178. 233 [234] Sang-ho Lee, Kiyoon Yoo, and Nojun Kwak. “Asynchronous Edge Learning using Cloned Knowledge Distillation”. In: arXiv preprint arXiv:2010.10338 (2020). [235] Dmitry Lepikhin et al. “Gshard: Scaling giant models with conditional computation and automatic sharding”. In: arXiv preprint arXiv:2006.16668 (2020). [236] David Leroy et al. “Federated Learning for Keyword Spotting”. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019. 2019. [237] David Leroy et al. “Federated learning for keyword spotting”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 6341–6345. [238] Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”. In: Proc. of ACL. 2020. [239] Daiqing Li et al. “Fed-Sim: Federated Simulation for Medical Imaging”. In: arXiv e-prints, arXiv:2009.00668 (Sept. 2020), arXiv:2009.00668. arXiv: 2009.00668 [cs.CV]. [240] Daliang Li and Junpu Wang. “Fedmd: Heterogenous federated learning via model distillation”. In: arXiv preprint arXiv:1910.03581 (2019). [241] Mu Li et al. “Scaling distributed machine learning with the parameter server”. In: 11th{USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014, pp. 583–598. [242] Mu Li et al. “Scaling Distributed Machine Learning with the Parameter Server.” In: OSDI. Vol. 14. 2014, pp. 583–598. [243] Q. Li et al. “Federated Learning on Non-IID Data Silos: An Experimental Study”. In: ArXiv (2021). [244] Qimai Li, Zhichao Han, and Xiao-Ming Wu. “Deeper insights into graph convolutional networks for semi-supervised learning”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 32. 2018. [245] Qinbin Li, Bingsheng He, and Dawn Song. “Model-Agnostic Round-Optimal Federated Learning via Knowledge Transfer”. In: arXiv preprint arXiv:2010.01017 (2020). [246] Shen Li et al. “PyTorch Distributed: Experiences on Accelerating Data Parallel Training”. In: Proceedings of the VLDB Endowment 13.12 (2020). 234 [247] T. Li et al. “Ditto: Fair and Robust Federated Learning Through Personalization”. In: 2020. [248] Tan Li, Linqi Song, and Christina Fragouli. “Federated Recommendation System via Differential Privacy”. In: arXiv preprint arXiv:2005.06670 (2020). [249] Tian Li et al. “Ditto: Fair and Robust Federated Learning Through Personalization”. In: arXiv:2012.04221 [cs, stat] (Feb. 24, 2021). arXiv: 2012.04221. url: http://arxiv.org/abs/2012.04221 (visited on 03/31/2021). [250] Tian Li et al. “Ditto: Fair and robust federated learning through personalization”. In: arXiv: 2012.04221 (2020). [251] Tian Li et al. “Fair resource allocation in federated learning”. In: arXiv preprint arXiv:1905.10497 (2019). [252] Tian Li et al. “Federated Learning: Challenges, Methods, and Future Directions”. In: IEEE Signal Processing Magazine (2020). [253] Tian Li et al. “Federated optimization in heterogeneous networks”. In: arXiv preprint arXiv:1812.06127 (2018). [254] Wei Li and Andrew McCallum. “Pachinko allocation: DAG-structured mixture models of topic correlations”. In: Proceedings of the 23rd international conference on Machine learning. 2006, pp. 577–584. [255] Wenqi Li et al. “Privacy-preserving Federated Brain Tumour Segmentation”. In: arXiv e-prints, arXiv:1910.00962 (Oct. 2019), arXiv:1910.00962. arXiv: 1910.00962 [cs.CV]. [256] Wenqi Li et al. “Privacy-preserving federated brain tumour segmentation”. In: International Workshop on Machine Learning in Medical Imaging. Springer. 2019, pp. 133–141. [257] Xiang Li et al. “On the convergence of fedavg on non-iid data”. In: arXiv preprint arXiv:1907.02189 (2019). [258] Xiang Lisa Li and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation”. In: Proc. of ACL. 2021. [259] Zhiyuan Li and Sanjeev Arora. “An exponential learning rate schedule for deep learning”. In: arXiv preprint arXiv:1910.07454 (2019). [260] Zhize Li et al. “Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization”. In: arXiv preprint arXiv:2002.11364 (2020). 235 [261] Xiangru Lian et al. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent”. In: arXiv:1705.09056 [cs, math, stat] (May 2017). arXiv: 1705.09056. url: http://arxiv.org/abs/1705.09056 (visited on 12/24/2018). [262] Xiangru Lian et al. “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent”. In: Advances in Neural Information Processing Systems. 2017, pp. 5330–5340. [263] Jiacheng Liang et al. “OmniLytics: A Blockchain-based Secure Data Market for Decentralized Machine Learning”. In: arXiv preprint arXiv:2107.05252 (2021). [264] Xinle Liang et al. “Federated Transfer Reinforcement Learning for Autonomous Driving”. In: arXiv preprint arXiv:1910.06001 (2019). [265] Xinle Liang et al. “Self-supervised Cross-silo Federated Neural Architecture Search”. In: arXiv preprint arXiv:2101.11896 (2021). [266] Feng Liao et al. “Federated Hierarchical Hybrid Networks for Clickbait Detection”. In: arXiv preprint arXiv:1906.00638 (2019). [267] Wei Yang Bryan Lim et al. “Towards Federated Learning in UAV-Enabled Internet of Vehicles: A Multi-Dimensional Contract-Matching Approach”. In: arXiv preprint arXiv:2004.03877 (2020). [268] Bill Yuchen Lin et al. “FedNLP: A Research Platform for Federated Learning in Natural Language Processing”. In: arXiv preprint arXiv:2104.08815 (2021). [269] Tao Lin, Sebastian U. Stich, and Martin Jaggi. “Don’t Use Large Mini-Batches, Use Local SGD”. en. In: arXiv:1808.07217 [cs, stat] (Aug. 2018). arXiv: 1808.07217. url: http://arxiv.org/abs/1808.07217 (visited on 12/24/2018). [270] Tao Lin et al. “Ensemble Distillation for Robust Model Fusion in Federated Learning”. In: arXiv preprint arXiv:2006.07242 (2020). [271] Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: European conference on computer vision. Springer. 2014, pp. 740–755. [272] Yujun Lin et al. “Deep gradient compression: Reducing the communication bandwidth for distributed training”. In: arXiv preprint arXiv:1712.01887 (2017). [273] Boyi Liu, Lujia Wang, and Ming Liu. “Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems”. In: IEEE Robotics and Automation Letters 4.4 (2019), pp. 4555–4562. 236 [274] Boyi Liu et al. “Federated Imitation Learning: A Privacy Considered Imitation Learning Framework for Cloud Robotic Systems with Heterogeneous Sensor Data”. In: arXiv preprint arXiv:1909.00895 (2019). [275] D. Liu and T. Miller. “Federated pretraining and fine tuning of BERT using clinical notes from multiple silos”. In: ArXiv (2020). [276] Dianbo Liu and Tim Miller. “Federated pretraining and fine tuning of BERT using clinical notes from multiple silos”. In: arXiv preprint arXiv:2002.08562 (2020). [277] Dianbo Liu et al. “Fadl: Federated-autonomous deep learning for distributed electronic health record”. In: arXiv preprint arXiv:1811.11400 (2018). [278] Hanxiao Liu, Karen Simonyan, and Yiming Yang. “Darts: Differentiable architecture search”. In: arXiv preprint arXiv:1806.09055 (2018). [279] Lumin Liu et al. “Client-edge-cloud hierarchical federated learning”. In: arXiv preprint arXiv:1905.06641 (2019). [280] Meng Liu et al. “DIG: A Turnkey Library for Diving into Graph Deep Learning Research”. In: arXiv preprint arXiv:2103.12608 (2021). [281] Sulin Liu, Sinno Jialin Pan, and Qirong Ho. “Distributed multi-task relationship learning”. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 937–946. [282] Yang Liu, Zhihao Yi, and Tianjian Chen. “Backdoor attacks and defenses in feature-partitioned collaborative learning”. In: arXiv e-prints, arXiv:2007.03608 (July 2020), arXiv:2007.03608. arXiv: 2007.03608 [cs.LG]. [283] Yang Liu, Xiong Zhang, and Libin Wang. “Asymmetrically Vertical Federated Learning”. In: arXiv preprint arXiv:2004.07427 (2020). [284] Yang Liu et al. “A Communication Efficient Collaborative Learning Framework for Distributed Features”. In: arXiv e-prints, arXiv:1912.11187 (Dec. 2019), arXiv:1912.11187. arXiv: 1912.11187 [cs.LG]. [285] Yang Liu et al. “FedVision: An Online Visual Object Detection Platform Powered by Federated Learning.” In: AAAI. 2020, pp. 13172–13179. [286] Yi Liu et al. “Deep Anomaly Detection for Time-Series Data in Industrial IoT: A Communication-Efficient On-Device Federated Learning Approach”. In: IEEE Internet of Things Journal 8.8 (2021), pp. 6348–6358. doi: 10.1109/JIOT.2020.3011726. [287] Yi Liu et al. “Privacy-preserving Traffic Flow Prediction: A Federated Learning Approach”. In: IEEE Internet of Things Journal (2020). 237 [288] Yi Liu et al. “RC-SSFL: Towards Robust and Communication-efficient Semi-supervised Federated Learning System”. In: arXiv preprint arXiv:2012.04432 (2020). [289] Yuan Liu et al. “FedCoin: A Peer-to-Peer Payment System for Federated Learning”. In: arXiv preprint arXiv:2002.11711 (2020). [290] Yun-qiang Liu et al. “A Secure Federated Transfer Learning Framework”. In: The Missouri Review (2020), pp. 1–1. [291] Yuqiao Liu et al. “A survey on evolutionary neural architecture search”. In: arXiv preprint arXiv:2008.10937 (2020). [292] Zhiwei Liu et al. “Basconv: Aggregating heterogeneous interactions for basket recommendation with graph convolutional neural network”. In: Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM. 2020, pp. 64–72. [293] Zewei Long et al. “FedSemi: An Adaptive Federated Semi-Supervised Learning Framework”. In: arXiv preprint arXiv:2012.03292 (2020). [294] Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In: Proc. of ICLR. 2019. [295] Siqi Luo et al. “HFEL: Joint Edge Association and Resource Allocation for Cost-Efficient Hierarchical Federated Edge Learning”. In: arXiv preprint arXiv:2002.11343 (2020). [296] Lingjuan Lyu et al. “Privacy and Robustness in Federated Learning: Attacks and Defenses”. In: ArXiv preprint (2020). [297] Jiaxin Ma, Ryo Yonetani, and Zahid Iqbal. “Adaptive Distillation for Decentralized Learning from Heterogeneous Clients”. In: arXiv preprint arXiv:2008.07948 (2020). [298] Yanjun Ma et al. “PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice”. In: Frontiers of Data and Domputing 1.1 (2019), pp. 105–115. [299] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. “YAGO3: A Knowledge Base from Multilingual Wikipedias”. In: CIDR. Asilomar, United States, Jan. 2013. url: https://hal-imt.archives-ouvertes.fr/hal-01699874. [300] Grigory Malinovsky et al. “From Local SGD to Local Fixed Point Methods for Federated Learning”. In: arXiv preprint arXiv:2004.01442 (2020). 238 [301] Alberto Marchisio et al. “NASCaps: A framework for neural architecture search to optimize the accuracy and hardware efficiency of convolutional capsule networks”. In: 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE. 2020, pp. 1–9. [302] Elan Markowitz et al. Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning. 2021. arXiv: 2102.04350 [cs.LG]. [303] Ines Filipa Martins et al. “A Bayesian approach to in silico blood-brain barrier penetration modeling”. In: Journal of chemical information and modeling 52.6 (2012), pp. 1686–1697. [304] Martıín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. url: https://www.tensorflow.org/. [305] David Mateos-Núñez, Jorge Cortés, and Jorge Cortes. “Distributed optimization for multi-task learning via nuclear-norm approximation”. In: IFAC Workshop on Distributed Estimation and Control in Networked Systems. Vol. 48. 2015, pp. 64–69. [306] Andrew Kachites McCallum et al. “Automating the construction of internet portals with machine learning”. In: Information Retrieval 3.2 (2000), pp. 127–163. [307] Brendan McMahan et al. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA. Proceedings of Machine Learning Research. 2017. [308] Brendan McMahan et al. “Communication-efficient learning of deep networks from decentralized data”. In: Artificial Intelligence and Statistics . 2017, pp. 1273–1282. [309] H Brendan McMahan et al. “Advances and open problems in federated learning”. In: Foundations and Trends® in Machine Learning 14.1 (2021). [310] H Brendan McMahan et al. “Communication-efficient learning of deep networks from decentralized data”. In: arXiv preprint arXiv:1602.05629 (2016). [311] H. Brendan McMahan et al. “Communication-Efficient Learning of Deep Networks from Decentralized Data”. en. In: arXiv:1602.05629 [cs] (Feb. 2016). arXiv: 1602.05629. url: http://arxiv.org/abs/1602.05629 (visited on 12/22/2018). [312] H. Brendan McMahan et al. “Federated Learning of Deep Networks using Model Averaging”. In: CoRR abs/1602.05629 (2016). arXiv: 1602.05629. url: http://arxiv.org/abs/1602.05629. 239 [313] H. Brendan McMahan et al. “Federated Learning of Deep Networks using Model Averaging”. In: CoRR abs/1602.05629 (2016). arXiv: 1602.05629. url: http://arxiv.org/abs/1602.05629. [314] Guangxu Mei et al. “Sgnn: A graph neural network based federated learning approach by hiding structure”. In: 2019 IEEE International Conference on Big Data (Big Data). IEEE. 2019, pp. 2560–2568. [315] Yair Meidan et al. “N-BaIoT—Network-Based Detection of IoT Botnet Attacks Using Deep Autoencoders”. In: IEEE Pervasive Computing 17.3 (2018), pp. 12–22. doi: 10.1109/MPRV.2018.03367731. [316] Luca Melis et al. “Exploiting unintended feature leakage in collaborative learning”. In: 2019 IEEE Symposium on Security and Privacy (SP). IEEE. 2019, pp. 691–706. [317] Chuizheng Meng, Sirisha Rambhatla, and Yan Liu. Cross-Node Federated Graph Neural Network for Spatio-Temporal Data Modeling. 2021. url: https://openreview.net/forum?id=HWX5j6Bv_ih. [318] Dimitar Minovski et al. “Throughput prediction using machine learning in lte and 5g networks”. In: IEEE Transactions on Mobile Computing (2021). [319] Fatemehsadat Mirshghallah et al. “Privacy in Deep Learning: A Survey”. In: arXiv preprint arXiv:2004.12254 (2020). [320] Ioannis Mitliagkas et al. “Asynchrony begets momentum, with an application to deep learning”. In: Communication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on. IEEE, 2016, pp. 997–1004. [321] Takeru Miyato et al. “Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 41.8 (Aug. 1, 2019), pp. 1979–1993. doi: 10.1109/TPAMI.2018.2858821. [322] Takeru Miyato et al. “Virtual adversarial training: a regularization method for supervised and semi-supervised learning”. In: IEEE transactions on pattern analysis and machine intelligence 41.8 (2018), pp. 1979–1993. [323] David L Mobley and J Peter Guthrie. “FreeSolv: a database of experimental and calculated hydration free energies, with input files”. In: Journal of computer-aided molecular design 28.7 (2014), pp. 711–720. [324] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. “Agnostic federated learning”. In: arXiv preprint arXiv:1902.00146 (2019). 240 [325] Ari Morcos, Maithra Raghu, and Samy Bengio. “Insights on representational similarity in neural networks with canonical correlation”. In: Advances in Neural Information Processing Systems 31. Ed. by S. Bengio et al. Curran Associates, Inc., 2018, pp. 5732–5741. url: http://papers.nips.cc/paper/7815-insights-on-representational- similarity-in-neural-networks-with-canonical-correlation.pdf. [326] Erum Mushtaq et al. “SPIDER: Searching Personalized Neural Architecture for Federated Learning”. In: arXiv preprint arXiv:2112.13939 (2021). [327] Deepak Narayanan et al. “Efficient large-scale language model training on gpu clusters”. In: Thirty-eighth International Conference on Machine Learning. 2021. [328] Deepak Narayanan et al. “PipeDream: Generalized Pipeline Parallelism for DNN Training”. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. SOSP ’19. Huntsville, Ontario, Canada: Association for Computing Machinery, 2019, pp. 1–15. isbn: 9781450368735. doi: 10.1145/3341301.3359646. [329] Milad Nasr, Reza Shokri, and Amir Houmansadr. “Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning”. In: 2019 IEEE Symposium on Security and Privacy (SP). IEEE. 2019, pp. 739–753. [330] John Nguyen et al. “Federated Learning with Buffered Asynchronous Aggregation”. In: arXiv preprint arXiv:2106.06639 (2021). [331] Thien Nguyen et al. “DÏoT: A Federated Self-learning Anomaly Detection System for IoT”. In: July 2019, pp. 756–767. doi: 10.1109/ICDCS.2019.00080. [332] Richard Nock et al. “Entity resolution and federated learning get a federated resolution”. In: arXiv preprint arXiv:1803.04035 (2018). [333] NVIDIA. NVIDIA Clara. 2019. [334] Thomas Brox Olaf Ronneberger Philipp Fischer. U-Net: Convolutional Networks for Biomedical Image Segmentation. 2015. url: https://arxiv.org/abs/1505.04597v1. [335] Open MPI: Open Source High Performance Computing. https://grpc.io/. [336] OpenAI. AI and Compute. https://openai.com/blog/ai-and-compute. 2018. url: https://openai.com/blog/ai-and-compute/ (visited on 05/16/2018). [337] Tribhuvanesh Orekondy et al. “Gradient-Leaks: Understanding and Controlling Deanonymization in Federated Learning”. In: arXiv preprint arXiv:1805.05838 (2018). 241 [338] James M Ortega and Werner C Rheinboldt. Iterative solution of nonlinear equations in several variables. Vol. 30. Siam, 1970. [339] Jay H. Park et al. “HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism”. In: 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, July 2020, pp. 307–321. isbn: 978-1-939133-14-4. url: https://www.usenix.org/conference/atc20/presentation/park. [340] Jihong Park et al. “Distilling on-device intelligence at the network edge”. In: arXiv preprint arXiv:1908.05895 (2019). [341] Jihong Park et al. “Wireless network intelligence at the edge”. In: Proceedings of the IEEE 107.11 (2019), pp. 2204–2239. [342] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”. In: CoRR abs/1912.01703 (2019). arXiv: 1912.01703. url: http://arxiv.org/abs/1912.01703. [343] Adam Paszke et al. “PyTorch: An imperative style, high-performance deep learning library”. In: Advances in Neural Information Processing Systems. 2019, pp. 8024–8035. [344] Yanghua Peng et al. “A generic communication scheduler for distributed DNN training acceleration”. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, pp. 16–29. [345] Constantin Philippenko and Aymeric Dieuleveut. “Artemis: tight convergence guarantees for bidirectional compression in Federated Learning”. In: arXiv preprint arXiv:2006.14591 (2020). [346] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. “Robust aggregation for federated learning”. In: arXiv preprint arXiv:1912.13445 (2019). [347] Sameer Pradhan et al. “Towards Robust Linguistic Analysis using OntoNotes”. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning. 2013. [348] Saurav Prakash and A. Salman Avestimehr. “Mitigating Byzantine Attacks in Federated Learning”. In: arXiv preprint arxiv: 2010.07541 (2020). [349] Saurav Prakash and Amir Salman Avestimehr. “Mitigating Byzantine Attacks in Federated Learning”. In: arXiv preprint arXiv:2010.07541 (2020). [350] Saurav Prakash et al. “Coded computing for distributed graph analytics”. In: IEEE Transactions on Information Theory 66.10 (2020), pp. 6534–6554. 242 [351] Saurav Prakash et al. “Coded Computing for Federated Learning at the Edge”. In: IEEE Journal on Selected Areas in Communication, Series on Machine Learning for Communications and Networks (2020). [352] Saurav Prakash et al. “Coded Computing for Low-Latency Federated Learning Over Wireless Edge Networks”. In: IEEE Journal on Selected Areas in Communications 39.1 (2020), pp. 233–250. [353] Saurav Prakash et al. “Coded computing for low-latency federated learning over wireless edge networks”. In: IEEE Journal on Selected Areas in Communications 39.1 (2020), pp. 233–250. [354] Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, AKBC-WEKEX@NAACL-HLT 2012, Montrèal, Canada, June 7-8, 2012. 2012. isbn: 978-1-937284-20-6. [355] PyTorch RPC: Distributed Deep Learning Built on Tensor-Optimized Remote Procedure Calls. https://pytorch.org/docs/stable/rpc.html. 2021. [356] Tao Qi et al. “FedRec: Privacy-Preserving News Recommendation with Federated Learning”. In: arXiv (2020), arXiv–2003. [357] Ning Qian. “On the momentum term in gradient descent learning algorithms”. In: Neural networks 12.1 (1999), pp. 145–151. [358] Dragomir R Radev et al. “Evaluating Web-based Question Answering Systems.” In: LREC. Citeseer. 2002. [359] Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text transformer”. In: Journal of Machine Learning Research 140 (2020). [360] M. Raghu et al. “SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability”. In: NIPS. 2017. [361] Samyam Rajbhandari et al. “Zero: Memory optimization towards training a trillion parameter models”. In: arXiv preprint arXiv:1910.02054 (2019). [362] Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Comprehension of Text”. In: Proc. of EMNLP. 2016. [363] Raghunathan Ramakrishnan et al. “Electronic spectra from TDDFT and machine learning in chemical space”. In: The Journal of Chemical Physics 143.8 (Aug. 2015), p. 084111. issn: 1089-7690. doi: 10.1063/1.4928757. url: http://dx.doi.org/10.1063/1.4928757. 243 [364] Raghunathan Ramakrishnan et al. “Quantum chemistry structures and properties of 134 kilo molecules”. In: Scientific Data 1 (2014). [365] Swaroop Indra Ramaswamy et al. “Federated Learning for Emoji Prediction in a Mobile Keyboard”. In: ArXiv (2019). [366] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. “A unified convergence analysis of block successive minimization methods for nonsmooth optimization”. In: SIAM Journal on Optimization 23.2 (2013), pp. 1126–1153. [367] Sashank Reddi et al. “Adaptive Federated Optimization”. In: arXiv preprint arXiv:2003.00295 (2020). [368] General Data Protection Regulation. “Regulation EU 2016/679 of the European Parliament and of the Council of 27 April 2016”. In: Official Journal of the European Union. Available at: http://ec. europa. eu/justice/data-protection/reform/files/regulation_oj_en. pdf (accessed 20 September 2017) (2016). [369] Amirhossein Reisizadeh et al. “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization”. In: International Conference on Artificial Intelligence and Statistics . PMLR. 2020, pp. 2021–2031. [370] Valerian Rey et al. “Federated Learning for Malware Detection in IoT Devices”. In: arXiv preprint arXiv:2104.09994 (2021). [371] Khaled Riad, Teng Huang, and Lishan Ke. “A dynamic and hierarchical access control for IoT in multi-authority cloud storage”. In: Journal of Network and Computer Applications 160 (Apr. 2020), p. 102633. doi: 10.1016/j.jnca.2020.102633. [372] Mónica Ribero et al. “Federating Recommendations Using Differentially Private Prototypes”. In: arXiv preprint arXiv:2003.00602 (2020). [373] Ann M Richard et al. “ToxCast chemical landscape: paving the road to 21st century toxicology”. In: Chemical research in toxicology 29.8 (2016), pp. 1225–1251. [374] Matthew Richardson, Rakesh Agrawal, and Pedro Domingos. “Trust management for the semantic web”. In: International semantic Web conference. Springer. 2003, pp. 351–368. [375] Nicola Rieke et al. “The future of digital health with federated learning”. In: arXiv preprint arXiv:2003.08119 (2020). 244 [376] Kaspar Riesen and Horst Bunke. “IAM graph database repository for graph based pattern recognition and machine learning”. In: Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer. 2008, pp. 287–297. [377] David Rogers and Mathew Hahn. “Extended-connectivity fingerprints”. In: Journal of chemical information and modeling 50.5 (2010), pp. 742–754. [378] Sebastian G Rohrer and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data”. In: Journal of chemical information and modeling 49.2 (2009), pp. 169–184. [379] Sebastian G Rohrer and Knut Baumann. “Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data”. In: Journal of chemical information and modeling 49.2 (2009), pp. 169–184. [380] Adriana Romero et al. “Fitnets: Hints for thin deep nets”. In: arXiv preprint arXiv:1412.6550 (2014). [381] Yu Rong et al. “Deep Graph Learning: Foundations, Advances and Applications”. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge DiscoverY; Data Mining. KDD ’20. Virtual Event, CA, USA: Association for Computing Machinery, 2020, pp. 3555–3556. isbn: 9781450379984. doi: 10.1145/3394486.3406474. url: https://doi.org/10.1145/3394486.3406474. [382] Yu Rong et al. Self-Supervised Graph Transformer on Large-Scale Molecular Data. 2020. arXiv: 2007.02835 [q-bio.BM]. [383] Yu Rong et al. “Self-Supervised Graph Transformer on Large-Scale Molecular Data”. In: Advances in Neural Information Processing Systems 33 (2020). [384] Ron M Roth and Abraham Lempel. “On MDS codes via Cauchy matrices”. In: IEEE transactions on information theory 35.6 (1989), pp. 1314–1319. [385] Daniel Rothchild et al. “FetchSGD: Communication-Efficient Federated Learning with Sketching”. In: arXiv preprint arXiv:2007.07682 (2020), p. 12. [386] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985. [387] Olga Russakovsky et al. “Imagenet large scale visual recognition challenge”. In: International journal of computer vision 115.3 (2015), pp. 211–252. [388] Theo Ryffel et al. “A generic framework for privacy preserving deep learning”. In: arXiv preprint arXiv:1811.04017 (2018). 245 [389] Aaqib Saeed et al. “Federated Self-Supervised Learning of Multisensor Representations for Embedded Intelligence”. In: IEEE Internet of Things Journal 8.2 (2020), pp. 1030–1040. [390] Anit Kumar Sahu et al. “On the Convergence of Federated Optimization in Heterogeneous Networks”. In: ArXiv abs/1812.06127 (2018). [391] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. “Regularization with stochastic transformations and perturbations for deep semi-supervised learning”. In: Neural Information Processing Systems. Dec. 5, 2016, pp. 1171–1179. [392] Tim Salimans and Diederik P. Kingma. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”. In: CoRR abs/1602.07868 (2016). arXiv: 1602.07868. url: http://arxiv.org/abs/1602.07868. [393] Victor Sanh et al. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter”. In: ArXiv (2019). [394] Yuris Mulya Saputra et al. “Energy demand prediction with federated learning for electric vehicle networks”. In: 2019 IEEE Global Communications Conference (GLOBECOM). IEEE. 2019, pp. 1–6. [395] Yuris Mulya Saputra et al. “Federated Learning Meets Contract Theory: Energy-Efficient Framework for Electric Vehicle Networks”. In: arXiv preprint arXiv:2004.01828 (2020). [396] Joel Scheuner and Philipp Leitner. “A cloud benchmark suite combining micro and applications benchmarks”. In: Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 2018, pp. 161–166. [397] Kristof T Schütt et al. “Quantum-chemical insights from deep tensor neural networks”. In: Nature communications 8.1 (2017), pp. 1–8. [398] Isabel Segura Bedmar, Paloma Martıínez, and Marıía Herrero Zazo. “Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013)”. In: Association for Computational Linguistics. 2013. [399] Prithviraj Sen et al. “Collective classification in network data”. In: AI magazine 29.3 (2008), pp. 93–93. [400] Alexander Sergeev and Mike Del Balso. “Horovod: fast and easy distributed deep learning in TensorFlow”. In: arXiv preprint arXiv:1802.05799 (2018). [401] Adi Shamir. “How to share a secret”. In: Communications of the ACM 22.11 (1979), pp. 612–613. 246 [402] Shreya Sharma et al. “Secure and Efficient Federated Transfer Learning”. In: 2019 IEEE International Conference on Big Data (Big Data). IEEE. 2019, pp. 2569–2576. [403] Vivek Sharma et al. “ExpertMatcher: Automating ML Model Selection for Clients using Hidden Representations”. In: arXiv preprint arXiv:1910.03731 (2019). [404] Noam Shazeer et al. “Mesh-TensorFlow: Deep Learning for Supercomputers”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio et al. Vol. 31. Curran Associates, Inc., 2018, pp. 10414–10423. [405] Oleksandr Shchur et al. Pitfalls of Graph Neural Network Evaluation. 2019. arXiv: 1811.05868 [cs.LG]. [406] Micah J Sheller et al. “Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation”. In: International MICCAI Brainlesion Workshop. Springer. 2018, pp. 92–104. [407] Sheng Shen et al. “Reservoir Transformer”. In: arXiv preprint arXiv:2012.15045 (2020). [408] Bin Shi, Weijie J Su, and Michael I Jordan. “On Learning Rates and Schr\" odinger Operators”. In: arXiv preprint arXiv:2004.06977 (2020). [409] Nir Shlezinger et al. “UVeQFed: Universal vector quantization for federated learning”. In: IEEE Transactions on Signal Processing 69 (2020), pp. 500–514. [410] Mohammad Shoeybi et al. “Megatron-lm: Training multi-billion parameter language models using model parallelism”. In: arXiv preprint arXiv:1909.08053 (2019). [411] Abhishek Singh et al. “Detailed comparison of communication efficiency of split learning and federated learning”. In: arXiv preprint arXiv:1909.09145 (2019). [412] Ishika Singh et al. “Differentially-private Federated Neural Architecture Search”. In: arXiv preprint arXiv:2006.10559 (2020). [413] Virginia Smith et al. “Federated Multi-Task Learning”. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 2017. [414] Virginia Smith et al. “Federated multi-task learning”. In: Advances in Neural Information Processing Systems. 2017, pp. 4424–4434. [415] Sean C Smithson et al. “Neural networks designing neural networks: multi-objective hyper-parameter optimization”. In: Proceedings of the 35th International Conference on Computer-Aided Design. 2016, pp. 1–8. 247 [416] Jinhyun So, Basak Guler, and A Salman Avestimehr. “Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning”. In: arXiv preprint arXiv:2002.04156 (2020). [417] Jinhyun So, Basak Guler, and A. Salman Avestimehr. “Byzantine-Resilient Secure Federated Learning”. In: arXiv preprint arXiv:2007.11115 (2020). [418] Jinhyun So, Basak Guler, and A. Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communication, Series on Machine Learning for Communications and Networks (2020). [419] Jinhyun So, Basak Guler, and A. Salman Avestimehr. “Turbo-Aggregate: Breaking the Quadratic Aggregation Barrier in Secure Federated Learning”. In: arXiv preprint arXiv:2002.04156 (2020). [420] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communications (2020). [421] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Byzantine-resilient secure federated learning”. In: IEEE Journal on Selected Areas in Communications 39.7 (2021), pp. 2168–2181. [422] Jinhyun So, Başak Güler, and A Salman Avestimehr. “CodedPrivateML: A fast and privacy-preserving framework for distributed machine learning”. In: IEEE Journal on Selected Areas in Information Theory 2.1 (2021), pp. 441–451. [423] Jinhyun So, Başak Güler, and A Salman Avestimehr. “Turbo-aggregate: Breaking the quadratic aggregation barrier in secure federated learning”. In: IEEE Journal on Selected Areas in Information Theory (2021). [424] Jinhyun So et al. “Securing Secure Aggregation: Mitigating Multi-Round Privacy Leakage in Federated Learning”. In: arXiv preprint arXiv:2106.03328 (2021). [425] Guocong Song and Wei Chai. “Collaborative learning for deep neural networks”. In: Advances in Neural Information Processing Systems. 2018, pp. 1832–1841. [426] Joel Stremmel and Arjun Singh. “Pretraining Federated Text Models for Next Word Prediction”. In: ArXiv (2020). [427] Govindan Subramanian et al. “Computational modeling of β -secretase 1 (BACE-1) inhibitors using ligand based approaches”. In: Journal of chemical information and modeling 56.10 (2016), pp. 1936–1949. [428] Dianbo Sui et al. “FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction”. In: Proc. of EMNLP. 2020. 248 [429] Lichao Sun and Lingjuan Lyu. “Federated Model Distillation with Noise-Free Differential Privacy”. In: arXiv preprint arXiv:2009.05537 (2020). [430] Mengying Sun et al. “Graph convolutional networks for computational drug development and discovery”. In: Briefings in Bioinformatics 21.3 (June 2019), pp. 919–935. issn: 1477-4054. doi: 10.1093/bib/bbz042. eprint: https://academic.oup.com/bib/article-pdf/21/3/919/33227266/bbz042.pdf. url: https://doi.org/10.1093/bib/bbz042. [431] Ziteng Sun et al. “Can you really backdoor federated learning?” In: arXiv preprint arXiv:1911.07963 (2019). [432] Toyotaro Suzumura et al. “Towards federated graph learning for collaborative financial crimes detection”. In: arXiv preprint arXiv:1909.12946 (2019). [433] Canh T. Dinh, Nguyen Tran, and Josh Nguyen. “Personalized Federated Learning with Moreau Envelopes”. In: Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol. 33. Curran Associates, Inc., 2020, pp. 21394–21405. url: https://proceedings.neurips.cc/paper/2020/file/ f4f1f13c8289ac1b1ee0ff176b56fc60-Paper.pdf. [434] Mingxing Tan and Quoc Le. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”. In: Proceedings of the 36th International Conference on Machine Learning. Ed. by Kamalika Chaudhuri and Ruslan Salakhutdinov. Vol. 97. Proceedings of Machine Learning Research. PMLR, Sept. 2019, pp. 6105–6114. url: http://proceedings.mlr.press/v97/tan19a.html. [435] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 6105–6114. [436] Mingxing Tan et al. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 2820–2828. [437] Hanlin Tang et al. “Communication Compression for Decentralized Training”. en. In: arXiv:1803.06443 [cs, stat] (Mar. 2018). arXiv: 1803.06443. url: http://arxiv.org/abs/1803.06443 (visited on 12/22/2018). [438] Hanlin Tang et al. “Communication compression for decentralized training”. In: Advances in Neural Information Processing Systems. 2018, pp. 7652–7662. [439] Hanlin Tang et al. “DeepSqueeze: Decentralization Meets Error-Compensated Compression”. In: arXiv (2019), arXiv–1907. 249 [440] Jie Tang et al. “Arnetminer: extraction and mining of academic social networks”. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 2008, pp. 990–998. [441] Jiliang Tang, Huiji Gao, and Huan Liu. “mTrust: Discerning multi-faceted trust in a connected world”. In: Proceedings of the fifth ACM international conference on Web search and data mining. 2012, pp. 93–102. [442] Tingting Tang et al. “Verifiable coded computing: Towards fast, secure and private distributed machine learning”. In: arXiv preprint arXiv:2107.12958 (2021). [443] Zhenheng Tang, Shaohuai Shi, and Xiaowen Chu. “Communication-efficient decentralized learning with sparsification and adaptive peer selection”. In: arXiv preprint arXiv:2002.09692 (2020). [444] Sasha Targ, Diogo Almeida, and Kevin Lyman. “Resnet in resnet: Generalizing residual architectures”. In: arXiv preprint arXiv:1603.08029 (2016). [445] Kristina Toutanova and Danqi Chen. “Observed Versus Latent Features for Knowledge Base and Text Inference”. In: CVSCW. July 2015. doi: 10.18653/v1/W15-4007. [446] Tox21 Challenge. https://tripod.nih.gov/tox21/challenge/. 2017. [447] Linh Tran et al. “Hydra: Preserving Ensemble Diversity for Model Distillation”. In: arXiv preprint arXiv:2001.04694 (2020). [448] Aleksei Triastcyn and Boi Faltings. “Federated generative privacy”. In: IEEE Intelligent Systems (2020). [449] Aleksei Triastcyn and Boi Faltings. “Federated learning with Bayesian differential privacy”. In: 2019 IEEE International Conference on Big Data (Big Data). IEEE. 2019, pp. 2587–2596. [450] Adam Trischler et al. “NewsQA: A Machine Comprehension Dataset”. In: Proceedings of the 2nd Workshop on Representation Learning for NLP. 2017. [451] Stacey Truex et al. “A hybrid approach to privacy-preserving federated learning”. In: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security . 2019, pp. 1–11. [452] Stacey Truex et al. “LDP-Fed: Federated learning with local differential privacy”. In: Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking. 2020, pp. 61–66. 250 [453] Matthew Brown Tzu-Ming Harry Hsu Hang Qi. Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification . 2019. url: https://arxiv.org/pdf/1909.06335.pdf. [454] USC/LANDER. 10-day Operantional IoT Traces. 2020. url: http://www.isi.edu/ant/lander. [455] Ashish Vaswani et al. “Attention is All you Need”. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 2017. [456] Ashish Vaswani et al. “Attention is all you need”. In: arXiv preprint arXiv:1706.03762 (2017). [457] Petar Veličković et al. Graph Attention Networks. 2018. arXiv: 1710.10903 [stat.ML]. [458] Praneeth Vepakomma et al. “Split learning for health: Distributed deep learning without sharing raw patient data”. In: arXiv preprint arXiv:1812.00564 (2018). [459] Jayakorn Vongkulbhisal, Phongtharin Vinayavekhin, and Marco Visentini-Scarzanella. “Unifying heterogeneous classifiers with distillation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 3175–3184. [460] Aidmar Wainakh et al. “Enhancing Privacy via Hierarchical Federated Learning”. In: arXiv preprint arXiv:2004.11361 (2020). [461] Nikil Wale, Ian A Watson, and George Karypis. “Comparison of descriptor spaces for chemical compound retrieval and classification”. In: Knowledge and Information Systems 14.3 (2008), pp. 347–375. [462] Binghui Wang et al. “GraphFL: A Federated Learning Framework for Semi-Supervised Node Classification on Graphs”. In: arXiv preprint arXiv:2012.04187 (2020). [463] Hongyi Wang et al. “Atomo: Communication-efficient learning via atomic sparsification”. In: Advances in Neural Information Processing Systems. 2018, pp. 9850–9861. [464] Hongyi Wang et al. “Attack of the Tails: Yes, You Really Can Backdoor Federated Learning”. In: arXiv preprint arXiv:2007.05084 (2020). [465] Hongyi Wang et al. “Federated learning with matched averaging”. In: arXiv preprint arXiv:2002.06440 (2020). [466] Jialei Wang, Mladen Kolar, and Nathan Srerbo. “Distributed multi-task learning”. In: Artificial Intelligence and Statistics . 2016, pp. 751–760. 251 [467] Jianyu Wang and Gauri Joshi. “Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms”. en. In: (Aug. 2018). url: https://arxiv.org/abs/1808.07576v2 (visited on 12/24/2018). [468] Jianyu Wang et al. “A Field Guide to Federated Optimization”. In: ArXiv abs/2107.06917 (2021). [469] Jianyu Wang et al. “A field guide to federated optimization”. In: arXiv preprint arXiv:2107.06917 (2021). [470] Jianyu Wang et al. “Local Adaptivity in Federated Learning: Convergence and Consistency”. In: 2021. [471] Jianyu Wang et al. “Tackling the objective inconsistency problem in heterogeneous federated optimization”. In: arXiv preprint arXiv:2007.07481 (2020). [472] Xiaoyang Wang et al. “Traffic Flow Prediction via Spatial Temporal Graph Neural Network”. In: Proceedings of The Web Conference 2020. New York, NY, USA: Association for Computing Machinery, 2020, pp. 1082–1092. isbn: 9781450370233. url: https://doi.org/10.1145/3366423.3380186. [473] Yiqi Wang et al. Non-IID Graph Neural Networks. 2020. arXiv: 2005.12386 [cs.LG]. [474] Zhibo Wang et al. “Beyond inferring class representatives: User-level privacy leakage from federated learning”. In: IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE. 2019, pp. 2512–2520. [475] Zhuzhu Wang et al. “Cloud-based Federated Boosting for Mobile Crowdsensing”. In: arXiv preprint arXiv:2005.05304 (2020). [476] Jianqiao Wangni et al. “Gradient sparsification for communication-efficient distributed optimization”. In: Advances in Neural Information Processing Systems. 2018, pp. 1299–1309. [477] Wenqi Wei et al. “A Framework for Evaluating Gradient Leakage Attacks in Federated Learning”. In: arXiv preprint arXiv:2004.10397 (2020). [478] T. Weyand et al. “Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval”. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, pp. 2572–2581. doi: 10.1109/CVPR42600.2020.00265. [479] Tobias Weyand et al. “Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 2575–2584. 252 [480] Thomas Wolf et al. “Transformers: State-of-the-Art Natural Language Processing”. In: Proc. of EMNLP. 2020. [481] Stephen J Wright. “Coordinate descent algorithms”. In: Mathematical Programming 151.1 (2015), pp. 3–34. [482] Bichen Wu et al. “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 10734–10742. [483] Chuhan Wu et al. “FedGNN: Federated Graph Neural Network for Privacy-Preserving Recommendation”. In: arXiv preprint arXiv:2102.04925 (2021). [484] Felix Wu et al. Simplifying Graph Convolutional Networks. 2019. arXiv: 1902.07153 [cs.LG]. [485] Le Wu et al. “SocialGCN: An Efficient Graph Convolutional Network based Model for Social Recommendation”. In: CoRR abs/1811.02815 (2018). arXiv: 1811.02815. url: http://arxiv.org/abs/1811.02815. [486] Zhenqin Wu et al. “MoleculeNet: a benchmark for molecular machine learning”. In: Chemical science 9.2 (2018), pp. 513–530. [487] Zhirong Wu et al. Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination. 2018. arXiv: 1805.01978 [cs.CV]. [488] Chulin Xie et al. “DBA: Distributed Backdoor Attacks against Federated Learning”. In: International Conference on Learning Representations. 2019. [489] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Asynchronous federated optimization”. In: arXiv preprint arXiv:1903.03934 (2019). [490] Han Xie et al. Federated Graph Classification over Non-IID Graphs . 2021. arXiv: 2106.13423 [cs.LG]. [491] Yaochen Xie et al. “Self-Supervised Learning of Graph Neural Networks: A Unified Review”. In: arXiv preprint arXiv:2102.10757 (2021). [492] Ruibin Xiong et al. “On layer normalization in the transformer architecture”. In: International Conference on Machine Learning. PMLR. 2020, pp. 10524–10533. [493] Keyulu Xu et al. How Powerful are Graph Neural Networks? 2019. arXiv: 1810.00826 [cs.LG]. [494] Mengwei Xu et al. “Federated neural architecture search”. In: arXiv preprint arXiv:2002.06352 (2020). 253 [495] Mengwei Xu et al. “Neural Architecture Search over Decentralized Data”. In: arXiv preprint arXiv:2002.06352 (2020). [496] Runhua Xu et al. “Hybridalpha: An efficient approach for privacy-preserving federated learning”. In: Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security. 2019, pp. 13–23. [497] Pinar Yanardag and S.V.N. Vishwanathan. “Deep Graph Kernels”. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery, 2015, pp. 1365–1374. isbn: 9781450336642. url: https://doi.org/10.1145/2783258.2783417. [498] Bowen Yang et al. “Pipemare: Asynchronous pipeline parallel dnn training”. In: Proceedings of Machine Learning and Systems 3 (2021). [499] Carl Yang et al. “Conditional Structure Generation through Graph Variational Generative Adversarial Nets”. In: NIPS. 2019. [500] Carl Yang et al. “Did You Enjoy the Ride? Understanding Passenger Experience via Heterogeneous Network Embedding”. In: ICDE. 2018. [501] Carl Yang et al. “Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark”. In: TKDE. 2020. [502] Carl Yang et al. “MultiSage: Empowering GraphSage with Contextualized Multi-Embedding on Web-Scale Multipartite Networks”. In: KDD. 2020. [503] Carl Yang et al. “Node, Motif and Subgraph: Leveraging Network Functional Blocks Through Structural Convolution”. In: ASONAM. 2018. [504] Carl Yang et al. “Relation Learning on Social Networks with Multi-Modal Graph Edge Variational Autoencoders”. In: WSDM. 2020. [505] Carl Yang et al. “Secure Deep Graph Generation with Link Differential Privacy”. In: IJCAI. 2021. [506] Chien-Sheng Yang et al. “LightSecAgg: Rethinking Secure Aggregation in Federated Learning”. In: arXiv preprint arXiv:2109.14236 (2021). [507] Kai Yang et al. “A quasi-newton method based vertical federated learning framework for logistic regression”. In: arXiv preprint arXiv:1912.00513 (2019). [508] Kevin K Yang et al. “Learned protein embeddings for machine learning”. In: Bioinformatics 34.15 (2018), pp. 2642–2648. 254 [509] Lei Yang et al. “Co-exploration of neural architectures and heterogeneous asic accelerator designs targeting multiple tasks”. In: 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE. 2020, pp. 1–6. [510] Liangwei Yang et al. “ConsisRec: Enhancing GNN for Social Recommendation via Consistent Neighbor Aggregation”. In: arXiv preprint arXiv:2105.02254 (2021). [511] Qiang Yang et al. “Federated learning”. In: Synthesis Lectures on Artificial Intelligence and Machine Learning 13.3 (2019), pp. 1–207. [512] Qiang Yang et al. “Federated Machine Learning: Concept and Applications”. In: ACM Trans. Intell. Syst. Technol. 10.2 (Jan. 2019). issn: 2157-6904. doi: 10.1145/3298981. url: https://doi.org/10.1145/3298981. [513] Shengwen Yang et al. “Parallel distributed logistic regression for vertical federated learning without third-party coordinator”. In: arXiv preprint arXiv:1911.09824 (2019). [514] T. Yang et al. “APPLIED FEDERATED LEARNING: IMPROVING GOOGLE KEYBOARD QUERY SUGGESTIONS”. In: ArXiv (2018). [515] Tianbao Yang et al. “Analysis of distributed stochastic dual coordinate ascent”. In: arXiv preprint arXiv:1312.1031 (2013). [516] Tien-Ju Yang et al. “Netadapt: Platform-aware neural network adaptation for mobile applications”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018, pp. 285–300. [517] Zhilin Yang et al. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering”. In: Proc. of EMNLP. 2018. [518] Andrew C Yao. “Protocols for secure computations”. In: 23rd annual symposium on foundations of computer science (sfcs 1982). IEEE. 1982, pp. 160–164. [519] Haishan Ye et al. “Multi-consensus Decentralized Accelerated Gradient Descent”. In: arXiv preprint arXiv:2005.00797 (2020). [520] Jennifer Yick, Biswanath Mukherjee, and Dipak Ghosal. “Wireless sensor network survey”. In: Computer networks 52.12 (2008), pp. 2292–2330. [521] Dong Yin et al. “Byzantine-robust distributed learning: Towards optimal statistical rates”. In: arXiv preprint arXiv:1803.01498 (2018). [522] Feng Yin et al. “FedLoc: Federated Learning Framework for Data-Driven Cooperative Localization and Location Data Processing”. In: arXiv preprint arXiv:2003.03697 (2020). 255 [523] Felix X Yu et al. “Federated Learning with Only Positive Labels”. In: arXiv preprint arXiv:2004.10342 (2020). [524] Hao Yu, Sen Yang, and Shenghuo Zhu. “Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning”. en. In: arXiv:1807.06629 [cs, math] (July 2018). arXiv: 1807.06629. url: http://arxiv.org/abs/1807.06629 (visited on 12/24/2018). [525] Hao Yu, Sen Yang, and Shenghuo Zhu. “Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 33. 2019, pp. 5693–5700. [526] Peihua Yu and Yunfeng Liu. “Federated Object Detection: Optimizing Object Detection Model with Federated Learning”. In: Proceedings of the 3rd International Conference on Vision, Image and Signal Processing. 2019, pp. 1–6. [527] Qian Yu et al. “Lagrange coded computing: Optimal design for resiliency, security, and privacy”. In: The 22nd International Conference on Artificial Intelligence and Statistics. 2019, pp. 1215–1225. [528] Tao Yu, Eugene Bagdasaryan, and Vitaly Shmatikov. “Salvaging federated learning by local adaptation”. In: arXiv preprint arXiv:2002.04758 (2020). [529] Mikhail Yurochkin et al. “Bayesian nonparametric federated learning of neural networks”. In: arXiv preprint arXiv:1905.12022 (2019). [530] Fengda Zhang et al. “Federated Unsupervised Representation Learning”. In: arXiv preprint arXiv:2010.08982 (2020). [531] Ke Zhang et al. Subgraph Federated Learning with Missing Neighbor Generation. 2021. arXiv: 2106.13430 [cs.LG]. [532] Tuo Zhang et al. “Federated Learning for Internet of Things”. In: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems (2021). [533] Tuo Zhang et al. “Federated Learning for Internet of Things”. In: Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems (2021). [534] Tuo Zhang et al. “Federated Learning for Internet of Things: Applications, Challenges, and Opportunities”. In: arXiv preprint arXiv:2111.07494 (2021). [535] Xiangyu Zhang et al. “Shufflenet: An extremely efficient convolutional neural network for mobile devices”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6848–6856. 256 [536] Ying Zhang et al. “Deep mutual learning”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 4320–4328. [537] Yu Zhang and Dit-Yan Yeung. “A convex formulation for learning task relationships in multi-task learning”. In: arXiv preprint arXiv:1203.3536 (2012). [538] Zhengming Zhang et al. “Improving Semi-supervised Federated Learning by Reducing the Gradient Diversity of Models”. In: arXiv preprint arXiv:2008.11364 (2020). [539] Yizhou Zhao and Hua Sun. “Information Theoretic Secure Aggregation with User Dropouts”. In: arXiv preprint arXiv:2101.07750 (2021). [540] Yuchen Zhao et al. “Semi-supervised Federated Learning for Activity Recognition”. In: arXiv preprint arXiv:2011.00851 (2020). [541] Da Zheng et al. “DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs”. In: arXiv preprint arXiv:2010.05337 (2020). [542] Liangzhen Zheng, Jingrong Fan, and Yuguang Mu. “Onionnet: a multiple-layer intermolecular-contact-based convolutional neural network for protein–ligand binding affinity prediction”. In: ACS omega 4.14 (2019), pp. 15956–15965. [543] Longfei Zheng et al. “ASFGNN: Automated Separated-Federated Graph Neural Network”. In: arXiv preprint arXiv:2011.03248 (2020). [544] Bin Zhou, Jian Pei, and WoShun Luk. “A brief survey on anonymization techniques for privacy preserving publishing of social network data”. In: ACM Sigkdd Explorations Newsletter 10.2 (2008), pp. 12–22. [545] Jun Zhou et al. “Privacy-preserving graph neural network for node classification”. In: arXiv preprint arXiv:2005.11903 (2020). [546] Yanlin Zhou et al. “Distilled One-Shot Federated Learning”. In: arXiv preprint arXiv:2009.07999 (2020). [547] Hangyu Zhu and Yaochu Jin. “Real-time federated evolutionary neural architecture search”. In: arXiv preprint arXiv:2003.02793 (2020). [548] Ligeng Zhu and Song Han. “Deep leakage from gradients”. In: Federated Learning. Springer, 2020, pp. 17–31. [549] Ligeng Zhu, Zhijian Liu, and Song Han. “Deep leakage from gradients”. In: Advances in Neural Information Processing Systems. 2019, pp. 14774–14784. 257 [550] Xiatian Zhu, Shaogang Gong, et al. “Knowledge distillation by on-the-fly native ensemble”. In: Advances in neural information processing systems. 2018, pp. 7517–7527. [551] Xinghua Zhu et al. “Empirical Studies of Institutional Federated Learning For Natural Language Processing”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. [552] Fangyu Zou et al. “A sufficient condition for convergences of adam and rmsprop”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 11127–11135. 258 Appendices 259 Chapter A Supplement to Chapter 2 - FedML A.1 TheTaxonomyofResearchAreasandaComprehensive Publication List Table A.1: The taxonomy of research areas in federated learning and related publication statistics Research Areas Approaches or Sub-problems (# of Papers) Subtotal Statistical Challenges Distributed Optimization (56), Non-IID and Model Personalization (49), Vertical FL (8), Decentralized FL (3), Hierarchical FL (7), Neural Architecture Search (4), Transfer Learning (11), Semi-Supervised Learning (3), Meta Learning (3) 144 TrustworthinessPreserving Privacy (35), Adversarial Attack (43), Fairness (4), Incen- tive Mechanism (5) 87 System Challenges Communication-Efficiency (27), Computation Efficiency (17), Wireless Communication and Cloud Computing (71), FL System Design (19) 134 Models and Applications Models (22), Natural language Processing (15), Computer Vision (3), Health Care (27), Transportation (13), Other (21) 101 Common Benchmark and Dataset (20), Survey (7) 27 From a comprehensive FL publication list: https://github.com/chaoyanghe/Awesome-Federated-Learning 260 A.2 Benchmark A.2.1 Details of Supported Algorithms Federated Averaging (FedAvg). FedAvg [308] is a standard federated learning algorithm that is normally used as a baseline for advanced algorithm comparison. We summarize the algorithm message flow in Figure 2.4(a). Each worker trains its local model for several epochs, then updates its local model to the server. The server aggregates the uploaded client models into a global model by weighted coordinate-wise averaging (the weights are determined by the number of data points on each worker locally) and then synchronizes the global model back to all workers. In our FedML library, based on the worker-oriented programming, we can implement this algorithm in a distributed computing manner. We suggest that users start from FedAvg to learn using FedML. Decentralized FL. We use [148], a central server free FL algorithm, to demonstrate how FedML supports decentralized topology with directed communication. As Figure 2.4(b) shows, such an algorithm uses a decentralized topology, and more specifically, some workers do not send messages (model) to all of their neighbors. The worker-oriented programming interface can easily meet this requirement since it allows users to define any behavior for each worker. Vertical Federated Learning (VFL). VFL or feature-partitioned FL [512] is applicable to the cases where all participating parties share the same sample space but differ in the feature space. As illustrated in Figure 2.4(c), VFL is the process of aggregating different features and computing the training loss and gradients in a privacy-preserving manner to build a model with data from all parties collaboratively [138, 72, 282, 284]. The FedML library currently supports the logistic regression model with customizable local feature extractors in the vertical FL setting, and it provides NUS-WIDE [74] and lending club loan [205] datasets for the experiments. Split Learning. Split Learning is a computing and memory-efficient variant of FL introduced in [129, 458] where the model is split at a layer, and the parts of the model 261 preceding and succeeding this layer are shared across the worker and server, respectively. Only the activations and gradients from a single layer are communicated in split learning, as against that, the weights of the entire model are communicated in federated learning. Split learning achieves better communication-efficiency under several settings, as shown in [411]. Applications of this model to wireless edge devices are described in [219, 341]. Split learning also enables matching client-side model components with the best server-side model components for automating model selection as shown in work on ExpertMatcher [403]. Federated Neural Architecture Search (FedNAS). FedNAS [142] is a federated neural architecture search algorithm [159] that enables scattered clients to collaboratively search for a neural architecture. FedNAS differs from other FL algorithms in that it exchanges information beyond gradient even though it has a centralized topology similar to FedAvg. A.2.2 Details of Datasets Federated EMNIST: EMNIST [75] consists of images of digits and upper and lower case English characters, with 62 total classes. The federated version of EMNIST [52] partitions the digits by their author. The dataset has natural heterogeneity stemming from the writing style of each person. CIFAR-100: Google introduced a federated version of CIFAR-100 [223] by randomly partitioning the training data among 500 clients, with each client receiving 100 examples [367]. The partition method is Pachinko Allocation Method (PAM) [254]. Shakespeare: [308] first introduced this dataset to FL community. It is a dataset built from The Complete Works of William Shakespeare. Each speaking role in each play is considered a different device. StackOverflow [15]: Google TensorFlow Federated (TFF) team maintains this federated dataset, which is derived from the Stack Overflow Data hosted by kaggle.com. We integrate this dataset into our benchmark. 262 CIFAR-10 and CIFAR-100. CIFAR-10 and CIFAR-100 [223] both consists of 32× 32 color images. CIFAR-10 has 10 classes, while CIFAR-100 has 100 classes. Following [529] and [465], we use latent Dirichlet allocation (LDA) to partition the dataset according to the number of workers involved in training in each round. CINIC-10. CINIC-10 [81] has 4.5 times as many images as that of CIFAR-10. It is constructed from two different sources: ImageNet and CIFAR-10. It is not guaranteed that the constituent elements are drawn from the same distribution. This characteristic fits for federated learning because we can evaluate how well models cope with samples drawn from similar but not identical distributions. A.2.3 Lack of Fair Comparison: Diverse Non-I.I.D. Datasets and Models A.3 IoT Devices Currently, we support two IoT devices: Raspberry PI 4 (Edge CPU Computing) and NVIDIA Jetson Nano (Edge GPU Computing). A.3.1 Raspberry Pi 4 (Edge CPU Computing - ARMv7l) Raspberry Pi 4 Desktop kit is supplied with: • Raspberry Pi 4 Model B (2GB, 4GB or 8GB version) • Raspberry Pi Keyboard and Mouse • 2 × micro HDMI to Standard HDMI (A/M) 1m Cables • Raspberry Pi 15.3W USB-C Power Supply • 16GB NOOBS with Raspberry Pi OS microSD card For more details, please check this link: https://www.raspberrypi.org/products/raspberry-pi-4-desktop-kit. 263 A.3.2 NVIDIA Jetson Nano (Edge GPU Computing) NVIDIA® Jetson Nano™ Developer Kit is a small, powerful computer that lets you run multiple neural networks in parallel for applications like image classification, object detection, segmentation, and speech processing. All in an easy-to-use platform that runs in as little as 5 watts. For more details, please check this link: https://developer.nvidia.com/embedde d/jetson-nano-developer-kit. 264 Table A.2: various datasets and models used in latest publications from the machine learning community Conference Paper Title dataset partition method model worker/device number ICML 2019 Analyzing Federated Learning through an Adversarial Lens [27] Fashion-MNIST natural non-IID 3 layer CNNs 10 UCI Adult Census datase - fully connected neural network 10 ICML 2019 Agnostic Federated Learning [324] UCI Adult Census datase - logistic regression 10 Fashion-MNIST - logistic regression 10 Cornell movie dataset - two-layer LSTM mode 10 Penn TreeBank (PTB) dataset - two-layer LSTM mode 10 ICML 2019 Bayesian Nonparametric Federated Learning of Neural Networks [529] MNIST Dir(0.5) 1 hidden layer neural networks 10 CIFAR10 Dir(0.5) 1 hidden layer neural networks 10 ICML 2020 Adaptive Federated Optimization [367] CIFAR-100 Pachinko Allocation Method ResNet-18 10 FEMNIST natural non-IID CNN (2xconv) 10 FEMNIST natural non-IID Auto Encoder 10 Shakespeare natural non-IID RNN 10 StackOverflow natural non-IID logistic regression 10 StackOverflow natural non-IID 1 RNN LSTM 10 ICML 2020 FetchSGD: Communication-Efficient Federated Learning with Sketching [385] CIFAR-10/100 1 class / 1 client ResNet-9 - FEMNIST natural non-IID ResNet-101 - PersonaChat natural non-IID GPT2-small - ICML 2020 Federated Learning with Only Positive Labels [523] CIFAR-10 1 class / client ResNet-8/32 - CIFAR-100 1 class / client ResNet-56 - AmazonCAT 1 class / client Fully Connected Nets - WikiLSHTC 1 class / client - - Amazon670K 1 class / client - - ICML 2020 SCAFFOLD: Stochastic Controlled Averaging for Federated Learning[210] EMNIST 1 class / 1 client Fully connected network - ICML 2020 From Local SGD to Local Fixed-Point Methods for Federated Learning[300] a9a(LIBSVM) - Logistic Regression - a9a(LIBSVM) - Logistic Regression - ICML 2020 Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization[260] a5a - logistic regression - mushrooms - logistic regression - a9a - logistic regression - w6a LIBSVM - logistic regression - ICLR 2020 Federated Learning with Matched Averaging [465] CIFAR-10 - VGG-9 16 Shakespheare sampling 66 clients 1-layer LSTM 66 ICLR 2020 Fair Resource Allocation in Federated Learning [251] Synthetic dataset use LR natural non-IID multinomial logistic regression 10 Vehicle natural non-IID SVM for binary classification 10 Shakespeare natural non-IID RNN 10 Sent140 natural non-IID RNN 10 ICLR 2020 On the Convergence of FedAvg on Non-IID Data[257] MNIST natural non-IID logistic regression 10 Synthetic dataset use LR natural non-IID logistic regression 10 ICLR 2020 DBA: Distributed Backdoor Attacks against Federated Learning[488] Lending Club Loan Data - 3 FC 10 MNIST - 2 conv and 2 fc 10 CIFAR-10 - lightweight Resnet-18 10 Tiny-imagenet - Resnet-18 10 MLSys2020 Federated Optimization in Heterogeneous Networks[390] MNIST natural non-IID multinomial logistic regression 10 FEMNIST natural non-IID multinomial logistic regression 10 Shakespeare natural non-IID RNN 10 Sent140 natural non-IID RNN 10 *Note: we will update this list once new publications are released. 265 Chapter B Supplement to Chapter 3 - PipeTransformer This Appendix provides background and preliminaries, more details of four components, additional experimental details and results, and discussions. The organization is as follows: Background and Preliminaries. Appendix B.1 provides the introduction for Transformer models, freeze training, pipeline parallelism, data parallelism, and hybrid of pipeline paral- lelism and data parallelism. This section serves as the required knowledge to understand PipeTransformer. More Details of Freeze Algorithm, AutoPipe, AutoDP, and AutoCache. Appendix B.2 explains more details of design motivation for freeze training algorithm and shows details of the deviation; Appendix B.3 provides more analysis to understand the design choice of AutoPipe Appendix B.4 contains more details of AutoDP including the dataset redistributing, and comparing another way to skip frozen parameters; Appendix B.5 introduces additional details for AutoCache. MoreExperimentalResultsandDetails. InAppendixB.6,weprovidehyper-parameters and more experimental results. Especially, we provide more details of speedup breakdown in B.6.2. 266 Discussion. In Appendix 3.6, we will discuss pretraining v.s. fine-tuning, designing better freeze algorithms, and the versatility of our approach. B.1 Background and Preliminaries B.1.1 Transformer Models: ViT and BERT Figure B.1: Evolution of Transformer Models. Transformer. The Transformer model originates from the Natural Language Processing (NLP) community. It replaces the recurrent neural network (RNN) using a self-attention mechanism which relates different positions of a single sequence in order to compute a representationofthesequence. Thetransformermodelhasanencoder-decoderstructurewhich is a classical structure for sequence modeling. The encoder maps an input sequence of symbol representations (x 1 ,...,x n ) to a sequence of continuous representations z = (z 1 ,...,z n ). Given z, the decoder then generates an output sequence (y 1 ,...,y m ) of symbols one element at a time. As shown in Figure B.2, the Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder (left) and decoder (right). To better understand this architecture, we refer readers to the tutorial “The Annotated Transformer” 1 . 1 http://nlp.seas.harvard.edu/2018/04/03/attention.html 267 Figure B.2: Transformer Model Architecture [456] BERT (ViT). BERT [85], which stands for Bidirectional Encoder Representations from Transformers, simply stacks multiple Transformer encoders (also called the Transformer layer, Figure B.2, left). BERT BASE has 12 Transformer layers, and its total number of parameters is 110M. BERT LARGE has 24 Transformer layers, and its total number of parameters is 340M. BERT is pre-trained using unsupervised tasks (masked language model, and next sentence prediction) and then fine-tuned to various NLP tasks such as text classification and question answering. Vision Transformer (ViT). ViT [92] attains excellent results compared to state-of-the-art convolutional networks. Its architecture is shown in Figure B.3. It splits an image into fixed-size patches, linearly embeds each of them, adds position embeddings, and feeds the resulting sequence of vectors to a Transformer encoder. Similar to BERT, the Transformer encode repeats multiple layers. Model Architecture Comparison. Note that ViT and BERT’s Transformer encoder places layer normalization in different locations. To understand the differences between these 268 Figure B.3: Vision Transformer [92] Figure B.4: Comparison of Transform in BERT and ViT two architectures, please refer to the analysis in [492]. Due to this slight difference, our PipeTransformer source code implements the model partition of these two architectures separately. B.1.2 Freeze Training. The concept of freeze training is first proposed by [360], which provides a posterior algo- rithm, named SVCCA (Singular Vector Canonical Correlation Analysis), to compare two representations. SVCCA can compare the representation at a layer at different points during training to its final representation and find that lower layers tend to converge faster than 269 higher layers. This means that not all layers need to be trained through training. We can save computation and prevent overfitting by consecutively freezing layers. However, SVCCA has to take the entire dataset as its input, which does not fit an on-the-fly analysis. This drawback motivates us to design an adaptive on the fly freeze algorithm. B.1.3 Pipeline Parallelism Figure B.5: GPipe [181] In PipeTransformer, we reuse GPipe as the baseline. GPipe is a pipeline parallelism library that can divide different sub-sequences of layers to separate accelerators, which provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. The key design in GPipe is that it splits the mini-batch into M micro-batches, which can train faster than naive model parallelism (as shown in Figure B.5(b). However, as illustrated in Figure B.5(c), micro-batches still cannot thoroughly avoid bubble overhead (some idle time per accelerator). GPipe empirically demonstrates that the bubble overhead is negligible when M ≥ 4× K. Different from GPipe, PipeTransformer has an elastic pipelining parallelism in which K and pipeline number are dynamic during the training. 270 Figure B.6: PyTorch DDP Bucket-based AllReduce B.1.4 Data Parallelism In PyTorch DDP [246], to improve communication efficiency, gradients are organized into buckets, and AllReduce is operated on one bucket at a time. The mapping from parameter gradients to buckets is determined at the construction time, based on the bucket size limit and parameter sizes. Model parameters are allocated into buckets in (roughly) the reverse order of Model.parameters() from the given model. Reverse order is used because DDP expects gradients to be ready during the backward pass in approximately that order. Figure B.6 shows an example. Note that, grad0 and grad1 are in bucket1, and the other two gradients are in bucket0. With this bucket design, DDP can overlap part of the communication time with the computation time of backward propagation. B.1.5 Hybrid of Pipeline Parallelism and Data Parallelism To understand the hybrid of pipeline parallelism and data parallelism, we illustrate the training process in Figure B.7. This example is hybrid two-way data parallelism and two- stage pipeline parallelism: pipeline 0 has two partitions, using GPU 1 and 3; pipeline 1 also has two partitions, using GPU 0 and 2; two pipelines are synchronized by data parallelism. Each batch of training data is divided into micro-batches that can be processed in parallel by the pipeline partitions. Once a partition completes the forward pass for a micro-batch, the activation memory is communicated to the pipeline’s next partition. Similarly, as the next 271 partition completes its backward pass on a micro-batch, the gradient with respect to the activation is communicated backward through the pipeline. Each backward pass accumulates gradients locally. Subsequently, all data parallel groups perform AllReduce on gradients. F 1,0 F 1,1 F 3,0 F 3,1 B 3,0 B 3,1 B 1,0 B 1,1 GPU 1 GPU 3 U 3 U 1 AR AR F 0,0 F 0,1 F 2,0 F 2,1 B 2,0 B 2,1 B 0,0 B 0,1 GPU 0 GPU 2 U 2 U 0 AR AR Pipeline 0 Pipeline 1 F 1,2 F 3,2 F 0,2 F 2,1 B 3,2 B 1,12 B 2,2 B 0,2 DDP Rank 0 DDP Rank 1 Figure B.7: Illustration for Hybrid of Pipeline-parallel and Data-parallel In this example, to simplify the figure, we assume that the bucket size is large enough to fit all gradients on a single device. That is to say, DDP uses one bucket per device, resulting in two AllReduce operations. Note that, since AllReduce can start as soon as gradients in corresponding buckets become ready. In this example, DDP launches AllReduce on GPU 1 and 3 immediately after B 3,1 and B 1,1 , without waiting for the rest of backward computation. Lastly, the optimizer updates the model weights. B.2 More Details of Freeze Algorithm Explanation of Equation 3.1. In numerical optimization, the weight with the smallest gradient norm converges first. With this assumption, we use the gradient norm as the indicator to identify which layers can be frozen on the fly. To verify this idea, we save the gradient norm for all layers at different iterations (i.e., epoch). With this analysis, we found that in the later phase of training, the pattern of gradient norm in different layers matches the assumption, but in the early phase, the pattern is random. Sometimes, we can even see that the gradient norm of those layers close to the output is the smallest. Figure B.8 shows 272 layer index gradient norm upper bound of frozen layer number the layer which has the lowest gradient now Figure B.8: An example that the smallest gradient is not close to the input layer. such an example. If we freeze all layers preceding the blue dash line layer, the freezing is too aggressive since some layers have not converged yet. This motivates us further amend this naive gradient norm indicator. To avoid the randomness of gradient norm at the early phase of training, we use a tunable bound to limit the maximum number of frozen layers. We do not freeze all layers preceding the layer with the smallest gradient norm for the case in the figure. Instead, we freeze layers preceding the bound (the red color dash line). Deviation. The term L (T− 1) frozen +α (L− L (T− 1) frozen ) in Equation 3.1 can be written as: L (T) frozen =(1− α ) T [ αL 1− α + T X t=2 αL (1− α ) t ] (B.2.1) The deviation is as follows: L (1) frozen =αL (B.2.2) L (2) frozen =(L− L (1) frozen )α +L (1) frozen (B.2.3) L (T) frozen =(L− L (T− 1) frozen )α +L (T− 1) frozen (B.2.4) 273 L (T) frozen =αL +(1− α )L (T− 1) frozen (B.2.5) L (T) frozen (1− α ) T = αL (1− α ) T + L (T− 1) frozen (1− α ) (T− 1) (B.2.6) L (T) frozen (1− α ) T = αL (1− α ) + T X t=2 αL (1− α ) t (B.2.7) (B.2.8) B.3 More Details of AutoPipe Balanced Partition: Trade-off between Communication and Computational Cost. Let us compute the communication cost in Figure 3.5. The intermediate tensor from partition k− 2 needs two cross-GPU communications to arrive to partition k. The parameter number of this intermediate tensor depends on the batch size and the Transformer model architecture. In BERT base , the intermediate tensor width and height is the hidden feature size and sequence length, respectively (i.e., 1024, 512). If we use a batch size 300 in a pipeline, the total parameter number is 1024× 512× 300. If we store it using float32, the memory cost is 0.63 GB. The GPU-to-GPU communication bandwidth is 15.754 GB (PCI 3.0, 16 lanes). Then one cross-GPU communication costs 40 ms. In practice, the time cost will be higher than this value. Therefore, two cross-GPU communications cost around 100 ms. To compare with the computation cost, we quantify the time cost for the forward propagation of a Transformer layer (12 million parameters), the time cost is around 35 ms, meaning that the communication cost for skip connection is far more than a specific layer’s computation cost. Compared to a slightly unbalanced partition in parameter number wise, 100 ms is non-trivial. If we do not break the skip connection, the parameter number gap between different partitions is far less than 12 million (e.g., 4M or even less than 1 M). Therefore, this analysis explains partitioning without breaking the skip connection is a reasonable design choice. We also find that when the GPU device number in a machine is fixed (e.g., 8), the larger the model size is, the smaller the partition gap, which further indicates that our design’s rationality. 274 Understanding Bubble in Pipeline. In the main text, Figure 3.6 depicts an example of running 4 micro-batches through a 4-device pipeline. Time flows from left to right, and each row denotes workload on one GPU device. F and B squares with the same color represent the forward and the backward pass time blocks of the same micro-batch. U represents the time block for updating parameters. Empty time blocks are bubbles. Assume that the load of the pipeline is evenly distributed amongst all devices. Consequently, all the time blocks during the forward pass are roughly in the same size, and similarly for backward time blocks. Note that the sizes of the forward time blocks can still differ from the backward ones. Based on these assumptions, we can estimate the per-iteration bubble size by simply counting the number of empty blocks during the forward and backward passes, respectively. In both the forward and backward pass, each device idles for (K− 1) time blocks. Therefore, the total bubble size is (K− 1) times per micro-batch forward and backward delay, which clearly decreases with fewer pipeline devices. Relationship Between Number of Micro-batches per Mini-batch (M) and DDP. To understand the reason why M and DDP have mutual impacts, a thorough understanding of Section B.1.5 is needed first. In essence, DDP and pipelining has opposite requirement for M: DDP requires a relatively larger chunk of the bucket (smaller M) to overlap the communication (introduced in Section B.1.4), while pipelining requires a larger M to avoid bubble overhead (introduced in Section B.1.3). To further clarify, we must first remember that DDP must wait for the last micro-batch to finish its backward computation on a parameter before launching its gradient synchronization, then imagine two extreme cases. One case is that M =1, meaning the communication can be fully overlapped with computation using buckets. However, setting M =1 leads to a performance downgrade of pipelining (overhead of bubbles). Another extreme case is a very large M, then the communication time (labeled as green “AR” in Figure B.1.5) may be higher than the computation time for a micro-batch (note that the width of a block in Figure B.1.5 represents the wall clock time). With these two extreme cases, we can see that there must be an optimal value of M in a dynamical 275 environment (K and parameter number of active layers) of PipeTransformer, indicating that it is sub-optimal to fix M during training. This explains the need for a dynamic M for elastic pipelining. B.4 More details of AutoDP B.4.1 Data Redistributing In standard data parallel-based distributed training, PyTorch uses DistributedSampler to make sure each worker in DP only load a subset of the original dataset that is exclusive to each other. The example code is as follows: self.train_sampler = DistributedSampler(self.train_dataset, num_replicas=num_replicas, rank=local_rank) Compared to this standard strategy, we made the following optimizations: 1. dynamic partition: the number of DP workers is increased when new pipelines have participated in DP. In order to guarantee that the data partition is evenly assigned after adding new pipes, the training dataset is repartitioned by rebuilding the DistributedSampler and setting new num_replicas and rank as arguments. 2. to reuse the computation of FP for frozen layers, we cached the hidden states in host memory and disk memory as well. Since the training requires to shuffle each epoch, the cache order of hidden features with respect to the order of original samples is different across different epochs. In order to identify which data point a hidden feature belongs to, we build a sample unique ID by returning index in the get_item() function of Dataset class. With this unique ID, we can find a sample’s hidden feature with O(1) time complexity during training. 3. when data is shuffled in each epoch, a data sample trained in the previous epoch may be moved to another machine for training in the current epoch. This makes the cache not reused across epochs. To address this issue, we fix a subset of entire samples in a machine and only do shuffle for this subset. This guarantees the shuffle during epochs is only executed 276 inside a machine, thus the hidden feature’s cache can be reused deterministically. To achieve this, rather than maintaining a global rank for DistributedSampler, we introduce node_rank and local_rank. node_rank is used to identify which subset of samples a machine needs to hold. local_rank is used by DistributedSampler to identify which part of the shuffle subset that a worker inside a machine should train. Note that this does not hurt the algorithmic convergence property. Shuffling for multiple subsets obtains more randomness than randomness obtained by a global shuffle, which further increases the robustness of training. The only difference is that some parallel processes in distributed training are fixed in part of the shuffled datasets. If a training task does not need to shuffle the dataset across epochs, the above-mentioned optimization will not be activated. B.4.2 Skip Frozen Parameters in AutoDP To reduce communication cost, another method is to use PyTorch DDP API 2 . However, this API is temporally designed for Facebook-internal usage, and we must carefully calculate and synchronize the information regarding which parameters should be skipped, making our system unstable and difficult to be debugged. Our design avoids this issue and simplifies the system design. Since AutoPipe storesF frozen andF pipe separately (introduced in Section 3.3.2), we can naturally skip the frozen parameters because AutoDP only needs to initialize the data parallel worker withF pipe . B.5 More Details of AutoCache AutoCache supports hierarchical caching. Figure B.9 shows our design. We maintain a sliding window to represent the maximum memory that the CPU host memory can hold, then move the window to prefetch the caching that the training requires and delete the caching that is consumed from the CPU host memory. In our implementation, we define the window size as 2 See the internal API defined by PyTorch DDP: https://github.com/pytorch/pytorch/blob/master /torch/nn/parallel/distributed.py, _set_params_and_buffers_to_ignore_for_model(). 277 Figure B.9: Hierarchical Caching the maximum batch number that the CPU host memory can hold. To avoid frequent memory exchange between disk storage and CPU host memory, we also define the block size that every time we prefetch (as the grey and green blocks are shown in the figure). In general, this hierarchical caching is useful when the training dataset is too large and exceeds the CPU host memory limit. However, we have to point out that this complex caching may not always be the optimal choice in the training system since the caching exchange itself may cost time. To this end, we suggest users of PipeTransformer using a relatively larger CPU host memory, which avoids activating the hierarchical caching and obtains faster training. B.6 More Experimental Results and Details B.6.1 Hyper-Parameters Used in Experiments In Table B.1, we follow the same hyper-parameters used in the original ViT and BERT paper. Note that for ViT model, we use image size 224 for fine-tuning training. B.6.2 More Details of Speedup Breakdown Understanding the speed downgrade of freeze only. As shown in Figure 3.9, the Freeze Only strategy is about 5% slower than the No Freeze baseline. After the performance analysis, we found it is because Freeze Only changes memory usage pattern and introduced additional 278 Table B.1: Hyperparameters used in Experiments Dataset Model Hyperparameters Comments SQuAD BERT batch size 64 max sequence length 512 learning rate {1e-5, 2e-5, 3e-5, 4e-5, 5e-5} epochs 3 gradient accumulation steps 1 ImageNet ViT batch size 400 image size 224 learning rate {0.1, 0.3, 0.01, 0.03} weighs decay 0.3 decay type cosine warmup steps 2 epochs 10 CIFAR-100 ViT batch size 320 image size 224 learning rate {0.1, 0.3, 0.01, 0.03} weighs decay 0.3 decay type cosine warmup steps 2 epochs 10 overhead in PyTorch’s CUDACachingAllocator 3 . More specifically, to reduce the number of expensive CUDA memory allocation operations, PyTorch maintains a CUDACachingAllocator that caches CUDA memory blocks to speed up future reuses. Without freezing, the memory usage pattern in every iteration stays consistent, and hence the cached memory blocks can 3 To understand the design of this API, please refer to Section 5.3 in the original PyTorch paper [343]. The source code is at https://github.com/pytorch/pytorch/blob/master/c10/cuda/CUDACachingAllo cator.h 279 be perfectly reused. After introducing layer freezing, although it helps to reduce memory footprint, on the other hand, it might also change the memory usage pattern, forcing CUDACachingAllocator to split blocks or launch new memory allocations, which slightly slows down the training. In essence, this underlying mechanism of PyTorch is not tailored for freeze training. Customizing it for freeze training requires additional engineering efforts. B.6.3 Tuning α for ViT on ImageNet Figure B.10: Tuning α for ViT on ImageNet B.6.4 TheMethodThatCanAccuratelyMeasuretheCommunication Cost Since PyTorch DDP overlaps communication with computation, the time difference between a local training iteration and a distributed training iteration does not faithfully represent the communication delay. Moreover, as DDP also organizes parameters into buckets and launches an AllReduce for each bucket, recording the start and finish time of overall communications is also insufficient. To correctly measure DDP communication delay, we combined the DDP communication hook with CUDAFuture callback. We developed a communication hook function that records a start CUDA event immediately before launching AllReduce. Then, in the CUDAFuture returned by the AllReduce function, we install a callback that records a finish CUDA event immediately after the non-blocking CUDAFuture completes. The difference between these 280 two CUDA events represents the AllReduce communication delay of one bucket. We collected the events for all buckets and removed time gaps between buckets if there were any. The remaining duration in that time range accurately represents the overall DDP communication delay. B.6.5 Overheads of Pipe Transformation Table B.2: Overheads of pipe transformation (seconds) Pipeline Transformation Overall Time Cost Dissect C P D initialization (length = 8) 18.2 16.6 0.7 0.9 length is compressed from 8 to 4 10.2 8.3 1.3 0.6 length is compressed from 4 to 2 5.5 3.8 2.1 0.7 length is compressed from 2 to 1 9.5 2.3 6.1 1.0 *C - creating CUDA context; P - Pipeline Warmup; D - DDP. We have verified the time cost of pipeline transformation. The result in Table B.2 shows that the overall cost of pipeline transformation is very small (less than 1 minute), compared to the overall training time. Therefore, we do not consider further optimization. 281 Chapter C Supplement to Chapter 4 - FedGKT C.1 Datasets C.1.1 A Summary of Dataset Used in Experiments CIFAR-10 [223] consists of 60000 32× 32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. CIFAR-100 [223] has the same amount of samples as CIFAR-10, but it is a more challenging dataset since it has 100 classes containing 600 images each. CINIC-10 [81] has a total of 270,000 images, 4.5 times that of CIFAR-10. It is constructed from two different sources: ImageNet and CIFAR-10. It is not guaranteed that the constituent elements are drawn from the same distribution. This characteristic fits for federated learning because we can evaluate how well models cope with samples drawn from similar but not identical distributions. CINIC-10 has three sub-datasets: training, validation, and testing. We train on the training dataset and test on the testing, without using the validation dataset for all experiments. Our source code provides the link to download these three datasets. For the non-I.I.D. dataset, the partition is unbalanced: sampling p c ∼ Dir J (0.5) and allocating a p c,k proportion of the training samples of class c to local client k. 282 C.2 HeterogeneousDistribution(non-I.I.D.)inEachClient We fix the non-I.I.D. distribution to fairly compare different methods. Table C.1 is a specific distribution used in the experiments. We also conduct experiments in other non-I.I.D. distributions and observe that our FedGKT method also outperforms baselines. To generate the different distribution, we can change the random seed in main.py of our source code. Client ID Numbers of Samples in the Classes Distribution c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 k=0 144 94 1561 133 1099 1466 0 0 0 0 k=1 327 28 264 16 354 2 100 20 200 3 k=2 6 6 641 1 255 4 1 2 106 1723 k=3 176 792 100 28 76 508 991 416 215 0 k=4 84 1926 1 408 133 24 771 0 0 0 k=5 41 46 377 541 7 235 54 1687 666 0 k=6 134 181 505 720 123 210 44 58 663 221 k=7 87 2 131 1325 1117 704 0 0 0 0 k=8 178 101 5 32 1553 10 163 9 437 131 k=9 94 125 0 147 287 100 23 217 608 279 k=10 379 649 106 90 35 119 807 819 3 85 k=11 1306 55 681 227 202 34 0 648 0 0 k=12 1045 13 53 6 77 70 482 7 761 494 k=13 731 883 15 161 387 552 4 1051 0 0 k=14 4 97 467 899 0 407 50 64 1098 797 k=15 264 2 93 266 412 142 806 2 243 1267 Table C.1: The actual heterogeneous data distribution (non-I.I.D.) generated from CIFAR-10 C.3 Extra Experimental Results and Details C.3.1 Computational Efficiency on CIFAR-10 and CINIC-10 ResNet-8 ResNet-56 ResNet-110 0.6 5.4 10.2 petaFLOPs 11 591 1,150 #Params (K) 30 488 950 CPU (ms) Figure C.1: Edge Computational Efficiency (CIFAR- 100) ResNet-8 ResNet-56 ResNet-110 1.2 10.8 20.4 petaFLOPs 11 591 1,150 #Params (K) 30 488 950 CPU (ms) Figure C.2: Edge Computational Efficiency (CINIC- 10) 283 C.4 The Method of Communication Cost Calculation For split learning (SL), the method to calculate the communication cost is: Communication Cost of SL (C.4.1) =(the size of the hidden feature map+the size of the gradient in the split layer) × (number of samples in dataset)× (number of epochs) (C.4.2) For FedGKT, the method to calculate the communication cost is: Communication Cost of FedGKT =(the size of the hidden feature map+ the size of soft labels received from the server side)× (number of samples in dataset) × (number of communication rounds) (C.4.3) C.5 Details of Convolutional Neural Architecture on Edge and Server ResNet-8 is a compact CNN. Its head convolutional layer (including batch normalization and ReLU non-linear activation) is used as the feature extractor. The remaining two Bottlenecks (a classical component in ResNet, each containing 3 convolutional layers) and the last fully-connected layer are used as the classifier. Table C.2: Detailed information of the ResNet-8 architecture used in our experiment Layer Parameter & Shape (cin, cout, kernal size) & hyper-parameters # conv1: 3× 16× 3× 3, stride:(1, 1); padding:(1, 1) × 1 maxpool: 3× 1 × 1 layer1 conv1: 16× 16× 3× 3, stride:(1, 1); padding:(1, 1) × 2 conv2: 16× 16× 3× 3, stride:(1, 1); padding:(1, 1) avgpool × 1 fc: 16× 10 × 1 284 Table C.3: Detailed information of the ResNet-55 architecture used in our experiment Layer Parameter & Shape (cin, cout, kernal size) & hyper-parameters # layer1 conv1: 16× 16× 1× 1, stride:(1, 1) × 1 conv2: 16× 16× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 16× 64× 1× 1, stride:(1, 1) downsample.conv: 16× 64× 1× 1, stride:(1, 1) conv1: 64× 16× 1× 1, stride:(1,1) × 5 conv2: 16× 16× 3× 3, stride:(1, 1), padding:(1,1) conv3: 16× 64× 1× 1, stride:(1, 1) layer2 conv1: 64× 32× 1× 1, stride:(1, 1) × 1 conv2: 32× 32× 3× 3, stride:(2, 2); padding:(1, 1) conv3: 32× 128× 1× 1, stride:(1, 1) downsample.conv: 64× 128× 1× 1, stride:(2, 2) conv1: 128× 32× 1× 1, stride:(1, 1)] × 5 conv2: 32× 32× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 32× 128× 1× 1, stride:(1, 1) layer3 conv1: 128× 64× 1× 1, stride:(1, 1) × 1 conv2: 64× 64× 3× 3, stride:(2, 2); padding:(1, 1) conv3: 64× 256× 1× 1, stride:(1, 1) downsample.conv: 128× 256× 1× 1, stride:(2, 2) conv1: 256× 64× 1× 1, stride:(1, 1) × 5 conv2: 64× 64× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 64× 256× 1× 1, stride:(1, 1) avgpool × 1 fc: 256× 10 × 1 C.5.1 Hyperparameters In table C.5, C.6, and C.7, we summarize the hyperparameter settings for all experiments. If applying our FedGKT framework to a new CNN architecture with different datasets, we suggest tuning all hyper-parameters based on our hyperparameters. 285 Table C.4: Detailed information of the ResNet-109 architecture used in our experiment Layer Parameter & Shape (cin, cout, kernal size) & hyper-parameters # layer1 conv1: 16× 16× 1× 1, stride:(1, 1) × 1 conv2: 16× 16× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 16× 64× 1× 1, stride:(1, 1) downsample.conv: 16× 64× 1× 1, stride:(1, 1) conv1: 64× 16× 1× 1, stride:(1,1) × 11 conv2: 16× 16× 3× 3, stride:(1, 1), padding:(1,1) conv3: 16× 64× 1× 1, stride:(1, 1) layer2 conv1: 64× 32× 1× 1, stride:(1, 1) × 1 conv2: 32× 32× 3× 3, stride:(2, 2); padding:(1, 1) conv3: 32× 128× 1× 1, stride:(1, 1) downsample.conv: 64× 128× 1× 1, stride:(2, 2) conv1: 128× 32× 1× 1, stride:(1, 1)] × 11 conv2: 32× 32× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 32× 128× 1× 1, stride:(1, 1) layer3 conv1: 128× 64× 1× 1, stride:(1, 1) × 1 conv2: 64× 64× 3× 3, stride:(2, 2); padding:(1, 1) conv3: 64× 256× 1× 1, stride:(1, 1) downsample.conv: 128× 256× 1× 1, stride:(2, 2) conv1: 256× 64× 1× 1, stride:(1, 1) × 11 conv2: 64× 64× 3× 3, stride:(1, 1); padding:(1, 1) conv3: 64× 256× 1× 1, stride:(1, 1) avgpool × 1 fc: 256× 10 × 1 286 Table C.5: Hyperparameters used in Experiments on dataset CIFAR-10 Model Methods Hyperparameters CIFAR-10 I.I.D. non-I.I.D. ResNet-56/110 FedGKT (ours) optimizer Adam, lr=0.001, wd=0.0001 SGD, lr=0.005, momentum=0.9 batch size 256 256 edge epochs 1 1 server epochs 20 40 communication rounds 200 200 FedAvg optimizer Adam, lr=0.001, wd=0.0001 Adam, lr=0.001, wd=0.0001 batch size 64 64 local epochs 20 20 communication rounds 200 200 Centralized optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 Centralized (ResNet-8) optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 Table C.6: Hyperparameters used in Experiments on dataset CIFAR-100 Model Methods Hyperparameters CIFAR-100 I.I.D. non-I.I.D. ResNet-56/110 FedGKT (ours) optimizer Adam, lr=0.001, wd=0.0001 SGD, lr=0.005, momentum=0.9 batch size 256 256 edge epochs 1 1 server epochs 20 40 communication rounds 200 200 FedAvg optimizer Adam, lr=0.001, wd=0.0001 Adam, lr=0.001, wd=0.0001 batch size 64 64 local epochs 20 20 communication rounds 200 200 Centralized optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 Centralized (ResNet-8) optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 287 Table C.7: Hyperparameters used in Experiments on dataset CINIC-10 Model Methods Hyperparameters CINIC-10 I.I.D. non-I.I.D. ResNet-56/110 FedGKT (ours) optimizer Adam, lr=0.001, wd=0.0001 SGD, lr=0.005, momentum=0.9 batch size 256 256 edge epochs 1 1 server epochs 20 40 communication rounds 200 200 FedAvg optimizer Adam, lr=0.001, wd=0.0001 Adam, lr=0.001, wd=0.0001 batch size 64 64 local epochs 20 20 communication rounds 200 200 Centralized optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 Centralized (ResNet-8) optimizer Adam, lr=0.003, wd=0.0001 batch size 256 epochs 300 288 Chapter D Supplement to Chapter 5 - FedNAS D.1 Details of the Search Space Definition We adopt the following 7 operations in all our experiments: 3 × 3 and 5 × 5 separable convolutions, 3 × 3 and 5× 5 dilated separable convolutions, 3 × 3 max pooling, 3 × 3 average pooling, identity, and zero. The network is formed by stacking convolutional cells multiple times. Cell k takes the outputs of cell k− 2 and cell k− 1 as its input. Each cell contains seven nodes: two input nodes, one output node, and four intermediate nodes inside the cell. The input of the first intermediate node is set equal to two input nodes, and the other intermediate nodes take all previous intermediate nodes’ output as input. The output node concatenates all intermediate nodes’ output depth-wise. There are two types of cells: the normal cell and the reduction cell. The reduction cell is designed to reduce the spatial resolution of feature maps located at 1/3 and 2/3 of the total depth of the network. Architecture parameters determine the discrete operation value between two nodes. All normal cells and all reduction cells share the same architecture parametersα n andα r , respectively. By this definition, our method alternatively optimizes architecture parameters (α n , α r ) and model weight parameters w. Besides the search space, the other details of the system design can be found in our source code. 289 D.2 Details of the heterogeneous distribution on each client (non-IID) In this work, we performed experiments on CIFAR10 and gld23k datasets. For CIFAR10, we explored two types of non-IIDness, label-skewed and lda distribution. Table ?? shows the lda data distribution used in our experiment of global model search Via FedNAS. We can see that the sample number of each class in each worker is highly unbalanced. Some classes in a worker even have no samples, and some classes take up most of the proportion (highlighted in the table). For personalized experiments, we used two types of heterogeneity settings shown in Figures 5.4 and D.1. As it can be seen that the distribution setting of D.1 is challenging given not that the number of images per client varies but also the number of images belonging to a specific class. Besides CIFAR10, we also evaluated personalized experiments on gld23k dataset. Since gld23k dataset have 203 clients data and some client’s can have as low as 30 images and splitting it further in training and test dataset would make it insufficient for efficient training. Therefore, out of 203 clients, we only use those client’s data which have images greater than 200. This condition would provide us sufficient data to perform search/training at each client and further test local inference to record client’s validation accuracy. Figure D.2 plots the image and label allocation per client for gld23k federated dataset under this setting. As it can be seen the distribution is non-IID especially in terms of label allocation per client. D.3 Results for CIFAR10 (lda) and gld23k Figure D.3 illustrates the results of comparison of FedNAS withFedAvg (with local adaptation), perFedAvg, Ditto and FedNAS with lda distribution of cifar10 (which is given in Figure D.1). It can be seen that FedNAS outperforms all these methods. Since the number of rounds of convergence were for these methods, we plotted these figures separately for clarity. The best 290 Client ID Numbers of samples in the classes Distribution c 0 c 1 c 2 c 3 c 4 c 5 c 6 c 7 c 8 c 9 k=0 144 94 1561 133 1099 1466 0 0 0 0 k=1 327 28 264 16 354 2 100 20 200 3 k=2 6 6 641 1 255 4 1 2 106 1723 k=3 176 792 100 28 76 508 991 416 215 0 k=4 84 1926 1 408 133 24 771 0 0 0 k=5 41 46 377 541 7 235 54 1687 666 0 k=6 134 181 505 720 123 210 44 58 663 221 k=7 87 2 131 1325 1117 704 0 0 0 0 k=8 178 101 5 32 1553 10 163 9 437 131 k=9 94 125 0 147 287 100 23 217 608 279 k=10 379 649 106 90 35 119 807 819 3 85 k=11 1306 55 681 227 202 34 0 648 0 0 k=12 1045 13 53 6 77 70 482 7 761 494 k=13 731 883 15 161 387 552 4 1051 0 0 k=14 4 97 467 899 0 407 50 64 1098 797 k=15 264 2 93 266 412 142 806 2 243 1267 Table D.1: Heterogeneous data distribution (non-IID) used in FedNAS for Global Model experiments (a) Image Allocation per Client (b) Label Allocation per Client Figure D.1: Heterogeneous data distribution (non-IID) used in FedNAS for Personalized Model experiments accuracy for this setting of FedNAS is 90.64% whereas FedAvg yields accuracy of 86.1%. On the other hand, we achieve 88.0% and 89.4% average validation accuracies of all the clients 291 (a) Image Allocation per Client (b) Label Allocation per Client Figure D.2: Heterogeneous data distribution (non-IID) with Federated gld23k used in FedNAS for Personalized Model experiments with Ditto and perFedAvg, respectively. Likewise, for gld23k we obtain 56.45%, , 45.28%, 43.92% and 34.5% accuracies with FedNAS, Ditto, FedAVg with Local Adaptation and MAML, respectively. The accuracy gap for gld23k between Ditto and FedNAS is more than 10%. (a) Average Validation Accuracy all clients for FedNAS and Local Adaptation (b) Average Validation Accuracy Clients for Ditto and PerFedAvg Figure D.3: Average Validation Accuracy for CIFAR10 LDA Partition (α = 0.5) D.4 Hyperparameter Setting We report important well-tuned hyperparameters used in our experiments. For global search experiments, FedNAS searches 50 communication rounds using five local searching epochs, with a batch size of 64. For FedAvg, DenseNet201 is used for training, with 100 communication 292 rounds, 20 local epochs, a learning rate of 0.08, and a batch size of 64. Both methods use the same data augmentation techniques that are used in image classification, such as random crops, flips, and normalization. More details and other parameter settings can be found in our source code. For personalized models search, we explored CIFAR10 with label skewed and lda distribu- tion. For label skew distribution, we searched over 500 communication rounds with batch size 32 for FedNAS and Local Adaptation, and 3000 rounds for Ditto and perFedAvg. We searched hyperparameters over the lr set of {0.1,0.3, 0.01, 0.03, 0.003, 0.001} and found the best lr to be 0.01, 0.01, 0.001, 0.003 for FedAvg with local adaption, FedNAS, Ditto, perFedAvg, respectively, for both label skewed and lda distribution with cifar10. For gld23k, we searched over the same hyperparameters and found the best performing hyperparameters to be 0.1, 0.1, 0.001, 0.003 for FedAvg with local adaption, FedNAS, Ditto, perFedAvg, respectively. D.5 Visualization of the Search Architecture c_{k-2} 0 sep_conv_3x3 1 sep_conv_5x5 c_{k-1} sep_conv_3x3 2 sep_conv_3x3 sep_conv_3x3 c_{k} sep_conv_5x5 3 dil_conv_5x5 sep_conv_3x3 Figure D.4: Normal Cell Architecture We report the architecture searched based on the above given non-IID dataset and hyper-parameter setting for global experiments of FedNAS. Figures D.4 and D.5 show the normal cell architecture and the reduction cell architecture, respectively. We can see that the reduction cell uses more pooling operations while the normal cell has more convolutional operations. 293 c_{k-2} 0 max_pool_3x3 1 max_pool_3x3 2 max_pool_3x3 3 max_pool_3x3 c_{k-1} skip_connect dil_conv_5x5 max_pool_3x3 dil_conv_5x5 c_{k} Figure D.5: Reduction Cell Architecture D.6 Future Works Our future work aims to improve the FedNAS framework from form the following perspectives. • Local NAS under Resource Constraint. Our current search space fits cross- organization federated learning, where the edge device can be equipped with powerful GPU devices. But when used in resource-constrained environments such as smartphones or IoT devices, the memory of our search space is too large. Searching on compact search space or using sampling methods are potential solutions to this challenge. • Privacy-preserved FedNAS. In this work, we explored neural architecture search where client communicates its architecture parameters α and model parameters w with the server. We also showed that FedNAS has the potential to yield personalization benefits. Given this context, revealing both α and w to adversary may provide more information than sharing only w to server with a predetermined model. Therefore, exploration of privacy preserved FedNAS can be an interesting but a challenging direction to investigate. • Transferability and Federated Learning with Weight Sharing. Another inter- esting direction would be transferring the searched architectures on each client. It is important to note that after the transfer, each client may have a different architecture, 294 therefore, conventional FL weight aggregation may not work. To train this transferred models, one can explore weight sharing to train these models in federated setting. 295 Chapter E Supplement to Chapter 6 - SpreadGNN This section includes additional information that a reader might find useful. Apart from the proof of Theorem 1, we include the algorithm sketch for SpreadGNN, a more detailed description of the datasets we used, hyperparameter configurations used in our experiments and additional ablation studies on communication period and network topology. E.1 Algorithm Sketch E.2 Dataset Details Table 6.1 summarizes the necessary information of benchmark datasets [486]. The details of each dataset are listed below: • SIDER [225], or Side Effect Resource, the dataset consists of marketed drugs with their adverse drug reactions. • Tox21[446] is a dataset which records the toxicity of compounds. • MUV [379] is a subset of PubChem BioAssay processed via refined nearest neighbor analysis. Contains 17 tasks for around 90 thousand compounds and is specifically designed for validation of virtual screening techniques. 296 Algorithm 9 SpreadGNN : Serverless Multi-task Federated Learning for Graph Neural Networks Require: initial parameters for each node W (t=0) k = {θ (t=0) ,Ψ (t=0) ,Φ (t=0) pool ,Φ (t=0) task,k } and Ω (t=0) = (Ω (t=0) 1 ,Ω (t=0) 2 ,...,Ω (t=0) K ); learning rate η ; maximum number of global itera- tions T, maximum number of client epochs E; communication period τ . 1: for all nodes: k =1,2,...,K in parallel do 2: for t=1 to T do do 3: for k =1 to E(epoch loop) do do 4: for m∈MB (mini-batch loop) do do 5: Read a minibatch m 6: Calculate gradient: g(W t,m k )=∂G(W t,m k |Ω t,m )/∂W t,m k 7: Update the local k th optimization variables: W (t+1,m) k ← W (t,m) k − ηg (W (t,m) k ) Ω (t+1,m) k ← (Φ T task M k Φ task M k ) 1 2 /Tr((Φ T task M k Φ task M k ) 1 2 ) 8: end for 9: end for 10: if t mod τ =0 then 11: Perform aggregation and alignment over neighbors for node k: W (t+1) k ← ( P |M k | j=1 1 N j W (t) j )/|M k |, f align (Ω t+1 k )← η ( P j=M k \k 1 N j f align (Ω (t) j )+f align ( (Φ T task M k Φ task M k ) 1 2 Tr((Φ T task M k Φ task M k ) 1 2) ))/|M k | 12: end if 13: end for 14: end for • QM8 [363] is composed from a recent study on modeling quantum mechanical calculations of electronic spectra and excited state energy of small molecules. E.2.1 Feature Extraction Procedure for Molecules The feature extraction is in two steps: 1. Atom-level feature extraction and Molecule object construction using RDKit [229]. 2. Constructing graphs from molecule objects using NetworkX [131]. Atom features, shown in Table I.1, are the atom features we used. It’s exactly the same set of features as used in [382]. 297 Table E.1: Atom features Features Size Description atom type 100 Representation of atom (e.g., C, N, O), by its atomic number formal charge 5 An integer electronic charge assigned to atom number of bonds 6 Number of bonds the atom is involved in chirality 5 Number of bonded hydrogen atoms number of H 5 Number of bonded hydrogen atoms atomic mass 1 Mass of the atom, divided by 100 aromaticity 1 Whether this atom is part of an aromatic system hybridization 5 SP, SP2, SP3, SP3D, or SP3D2 E.2.2 Model Hyperparameters E.2.2.1 Model Architecture As explained in section 6.2.1 our model is made up of a GNN and Readout. The GNNs we use are GAT [457] and GraphSAGE [132]. Each accepts input node features X v ∈R |V M |× d input and outputs node embeddings h v ∈R |V M |× d node , v∈ V M . Where V M is the set of atoms in molecule M. Given the output node embeddings from the GNN the Readout function we use is defined as follows: R Φ pool ,Φ task (h v ,X v )= MEAN(ReLU(Φ task (ReLU(Φ pool (X v ∥h v ))))) where ∥ represents the row wise concatenation operation. Φ pool ∈ R (d node +d input )× d pool and Φ task ∈R d pool × dout are learnable transformation matrices. d out represents the number of classes/tasks present in the classification label. The MEAN operation here is a column wise mean. Note that while our general description of the readout in section 6.2.1 does not include the input features as part of the input, we find that including the input features leads to better generalization. 298 E.2.2.2 Hyperparameter Configurations For each task, we utilize grid search to find the best results. Table E.2 lists all the hyper- parameters ranges used in our experiments. All hyper-parameter tuning is run on a single GPU. The best hyperparameters for each dataset and model are listed in Table E.3. The batch-size is kept 1. This pertains to processing a single molecule at a time. The number of GNN layers were fixed to 2 because having too many GNN layers result in over-smoothing phenomenon as shown in [244]. For all experiments, we used Adam optimizer. Table E.2: Hyperparameter Range for Experiments hyper-parameter Description Range Learning rate Rate of speed at which the model learns. [0.00015,0.001,0.0015,0.0025,0.015,0.15] Dropout rate Dropout ratio [0.3,0.5,0.6] Node embedding dimension (d node ) Dimensionality of the node embedding 64 Hidden layer dimension GNN hidden layer dimensionality 64 Readout embedding dimension (d pool ) Readout Hidden Layer Dimensionality 64 Graph embedding dimension (d out ) Dimensionality of the final graph embedding 64 Attention heads Number of attention heads required for GAT 1-7 Alpha LeakyRELU parameter used in GAT model 0.2 Rounds Number of federating learning rounds 150 Epoch Epoch of clients 1 Number of clients Number of users in a federated learning round 4-10 Communication Period Exchanging Period between clients 1 E.3 Detailed Ablation Studies E.3.1 Effect of Communication Period τ Figures E.1 & E.2 illustrate the effect of communication period SIDER and Tox21 datasets. As we increase the communication period τ more, model performance decreases. However, selecting τ = 5 can sometimes be better than averageing & exchanging each round. This indicatesthat, tuningτ isimportantforwhilecontrollingthetradeoffbetweentheperformance and the running time. 299 Table E.3: Hyperparameters used in our experiments. For SpreadGNN we use a communica- tion period τ =1 and a complete topology (all clients connected to all other clients) in all experiments. GraphSAGE GAT Parameters FedAvg FedGMTL SpreadGNN FedAvg FedGMTL SpreadGNN SIDER ROC-AUC Score 0.582 0.629 0.5873 0.5857 0.61 0.603 Partition alpha 0.2 0.2 0.2 0.2 0.2 0.2 Learning rate 0.0015 0.0015 0.0015 0.0015 0.0015 0.0015 Dropout rate 0.3 0.3 0.3 0.3 0.3 0.3 Node embedding dimension 64 64 64 64 64 64 Hidden layer dimension 64 64 64 64 64 64 Readout embedding dimension 64 64 64 64 64 64 Graph embedding dimension 64 64 64 64 64 64 Attention Heads NA NA NA 2 2 2 Leaky ReLU alpha NA NA NA 0.2 0.2 0.2 Number of Clients 4 4 4 4 4 4 Task Regularizer NA 0.001 0.001 NA 0.001 0.001 Tox21 ROC-AUC Score 0.5548 0.6644 0.585 0.6035 0.6594 0.6056 Partition alpha 0.1 0.1 0.1 0.1 0.1 0.1 Learning rate 0.0015 0.0015 0.0015 0.0015 0.0015 0.0015 Dropout rate 0.3 0.3 0.3 0.3 0.3 0.3 Node embedding dimension 64 64 64 64 64 64 Hidden layer dimension 64 64 64 64 64 64 Readout embedding dimension 64 64 64 64 64 64 Graph embedding dimension 64 64 64 64 64 64 Attention Heads NA NA NA 2 2 2 Leaky ReLU alpha NA NA NA 0.2 0.2 0.2 Number of Clients 8 8 8 8 8 8 Task Regularizer NA 0.001 0.001 NA 0.001 0.001 MUV ROC-AUC Score 0.6578 0.6856 0.703 0.709 0.6899 0.713 Partition alpha 0.3 0.3 0.3 0.3 0.3 0.3 Learning rate 0.001 0.001 0.001 0.0025 0.0025 0.0025 Dropout rate 0.3 0.3 0.3 0.3 0.3 0.3 Node embedding dimension 64 64 64 64 64 64 Hidden layer dimension 64 64 64 64 64 64 Readout embedding dimension 64 64 64 64 64 64 Graph embedding dimension 64 64 64 64 64 64 Attention Heads NA NA NA 2 2 2 Leaky ReLU alpha NA NA NA 0.2 0.2 0.2 Number of Clients 8 8 8 8 8 8 Task Regularizer NA 0.001 0.001 NA 0.002 0.002 QM8 RMSE Score 0.02982 0.03624 0.02824 0.0392 0.0488 0.0333 Partition alpha 0.5 0.5 0.5 0.5 0.5 0.5 Learning rate 0.0015 0.0015 0.0015 0.0015 0.0015 0.0015 Dropout rate 0.3 0.3 0.3 0.3 0.3 0.3 Node embedding dimension 64 64 64 64 64 64 Hidden layer dimension 64 64 64 64 64 64 Readout embedding dimension 64 64 64 64 64 64 Graph embedding dimension 64 64 64 64 64 64 Attention Heads NA NA NA 2 2 2 Leaky ReLU alpha NA NA NA 0.2 0.2 0.2 Number of Clients 8 8 8 8 8 8 Task Regularizer NA 0.3 0.3 NA 0.3 0.3 0 20 40 60 80 100 120 140 Rounds 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 Test ROC-AUC GraphSAGE + SIDER Period 1 Period 5 Period 10 Period 15 Period 20 0 20 40 60 80 100 120 140 Rounds 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 Test ROC-AUC GraphSAGE + Tox21 Period 1 Period 5 Period 10 Period 15 Period20 Figure E.1: Effect of Communication Period τ on GraphSAGE Model 300 0 20 40 60 80 100 120 Rounds 0.52 0.54 0.56 0.58 0.60 Test ROC-AUC GAT + SIDER Period 1 Period 5 Period 10 Period 15 Period 20 0 20 40 60 80 100 120 140 Rounds 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 Test ROC-AUC GAT + Tox21 Period 1 Period 5 Period 10 Period 15 Period 20 Figure E.2: Effect of Communication Period τ on GAT Model E.3.2 Proof for Convergence of SpreadGNN In order to get a more clear view of our algorithm, we reformulate the loss function on each worker as follows: f k Ä Γ ,Φ task M k ,Ω (k) ;ξ (k) i ä :=L Ä ˆ y (k) i (X i,k ,Z i,k ;W k ),y (k) i ä + 1 2 λ 1 Tr(Φ task M k Ω − 1 k Φ task M k ) + 1 2 X χ ∈{θ, Ψ ,Φ pool ,Φ task} λ χ ||χ || 2 F , where we include all the parameters that is different on each worker to be δ (k) :={Φ task M k }, Γ denotes the shared parameters, and ξ (k) i is the random variable that denotes the data samples (X i,k ,Z i,k ,y (k) i ). Therefore, the original objective function an be cast into the following form: F Ä Γ ,δ (1:K) ,Ω (1:K) ä = 1 n K X i=1 E ξ (k) i f k Ä Γ ,δ (k) ,Ω (k) ;ξ (k) i ä . Notice that the updating rule for Γ , δ (1:K) , and Ω (1:K) are different in that: Γ t+1 =Γ t − η ∂F ∂Γ t M, δ (k) t+1 =δ (k) t − η ∂F ∂δ (k) t , Ω (k) t+1 =argmin Ω (k)F Ä Γ ,δ (1:K) t ,Ω (1:K) t ä . 301 From the update rule of Ω (1:K) t , we know that E t F Ä Γ t+1 ,δ (1:K) t+1 ,Ω (1:K) t+1 ä − E t F Ä Γ t ,δ (1:K) t ,Ω (1:K) t ä ≤ E t ≠ ∂F ∂Γ t ,Γ t+1 − Γ t ∑ +E t Æ ∂F ∂δ (1:K) t ,δ (1:K) t+1 − δ (1:K) t ∏ + L 2 E t Å Γ t+1 − Γ t 2 F + δ (1:K) t+1 − δ (1:K) t 2 F ã ≤− η Å 1− Lη 2 ã E t ∂F ∂δ (1:K) t 2 F − η 2 E t ∂F ∂Γ t 2 F − η (1− Lη ) 2 E t ∂F ∂Γ t 1 n n 2 F + η 2 ∂F ∂Γ t − ∂F ∂Γ t 1 n n 2 F + Lη 2 (σ 2 Γ +σ 2 δ ) 2K , (E.3.1) where the overline∗ denotes the expectation operationE, σ 2 Γ and σ 2 δ are the variance bounds for the stochastic gradients of Γ t and δ t respectively. To estimate the upper bound for E ∂F ∂Γ t − ∂F ∂Γ t 1n n 2 F , we have: E ∂F ∂Γ t − ∂F ∂Γ t 1 n n 2 F = 1 n 2 E n X i=1 Ç ∂F i ∂Γ t − ∂F i ∂Γ (i) t å 2 F ≤ 1 n n X i=1 E ∂F i ∂Γ t − ∂F i ∂Γ (i) t 2 F ≤ L 2 n E n X i=1 Γ t − Γ (i) t 2 F . Therefore, we re-write the lower bound equation E.3.1 as − η Å 1− Lη 2 ã E t ∂F ∂δ (1:K) t 2 F − η 2 E t ∂F ∂Γ t 2 F − η (1− Lη ) 2 E t ∂F ∂Γ t 1 n n 2 F + ηL 2 2n E t n X i=1 Γ t − Γ (i) t 2 F + Lη 2 (σ 2 Γ +σ 2 δ ) 2K . Summing the inequality above for all time-steps t=0,...,T, we get (2− Lη ) T X t=0 E t ∂F ∂δ (1:K) t 2 F + T X t=0 E t ∂F ∂Γ t 2 F +(1− Lη ) T X t=0 E t ∂F ∂Γ t 1 n n 2 F + L 2 2n T X t=0 E t n X i=1 Γ t − Γ (i) t 2 F + Lη (σ 2 Γ +σ 2 δ )T 2K . (E.3.2) 302 Themainchallengehowevernowbecomesboundingthetermwith P T t=0 E t P n i=1 Γ t − Γ (i) t 2 F . Bounding it requires to derive another lower bound and using an available result. First re-write theE ∂f ∂Γ t 2 F by utilizing Frobenius-norm properties: E ∂f ∂Γ t 2 F =E Å ∂f ∂Γ t − ∂F ∂Γ t ã + ∂F ∂Γ t 2 F =E ∂f ∂Γ t − ∂F ∂Γ t 2 F +E ∂F ∂Γ t 2 F +2E ≠ ∂f ∂Γ t − ∂F ∂Γ t , ∂F ∂Γ t ∑ =E ∂f ∂Γ t − ∂F ∂Γ t 2 F +E ∂F ∂Γ t 2 F Then bound each term as follows: ≤ nσ 2 Γ +E n X i=1 Ç ∂F i ∂Γ (i) t − ∂F i ∂Γ t å + Å ∂F i ∂Γ t − ∂F ∂Γ t ã + ∂F ∂Γ t 2 F ≤ nσ 2 Γ +3 n X i=1 E ∂F i ∂Γ (i) t − ∂F i ∂Γ t 2 F +3 n X i=1 E ∂F i ∂Γ t − ∂F ∂Γ t 2 F +3E ∂F ∂Γ t 2 F ≤ nσ 2 Γ +3L 2 n X i=1 E Γ t − Γ (i) t 2 F +3nζ 2 +3nE ∂F ∂Γ t 2 F , where first term is bounded by its stochastic gradient variance. To bound the second term E ∂F ∂Γ t 2 F above, first bound the term with sum of the individual components’ norm and then use add-subtract trick used with ∂F ∂Γ t and ∂F i ∂Γ t . Then, use the facts P n i=1 E ∂F i ∂Γ t − ∂F i ∂Γ (i) t 2 F ≤ L 2 E P n i=1 Γ t − Γ (i) t 2 F (an intermeditate result derived above) and P n i=1 E ∂F i ∂Γ t − ∂F ∂Γ t 2 F ≤ nζ 2 , while ζ being the spectral gap of matrix M. Then, from [437], we have T X t=1 n X i=1 E Γ t − Γ (i) t 2 F ≤ 2 (1− ζ ) 2 T X t=1 η 2 ∂f ∂Γ t 2 F , withthepreviouslyderivedboundover P T t=1 η 2 F ∂f ∂Γ t 2 F boundleadingtoanotherintermediate bound Å 1− 6L 2 (1− ζ ) 2 ã T X t=1 n X i=1 E Γ t − Γ (i) t 2 F ≤ 2η 2 (1− ζ ) 2 nσ 2 Γ +3nζ 2 +3n T X t=1 E ∂F ∂Γ t 2 F T X t=0 n X i=1 E Γ t − Γ (i) t 2 F ≤ C 1 η 2 n σ 2 Γ +3ζ 2 T +3 T X t=1 E ∂F ∂Γ t 2 F (E.3.3) 303 where C 1 being a function of the spectral gap. Finally, combining equation E.3.2 with the intermediate bound equation E.3.3 together first, bounding the Frobenius norms, and dividing both sides by T (to average in the end), we get the desired lower bound 2[F(x 0 )− F inf ] ηT + ηLσ 2 K +η 2 L 2 σ 2 Å 1+ζ 2 1− ζ 2 τ − 1 ã where η is the learning rate that satisfies the given conditions, x 0 ={Γ 0 ,δ (1:K) 0 ,Ω (1:K) 0 } is the initial starting point and σ 2 = σ 2 Γ +σ 2 δ is the total variance bound over stochastic gradients of Γ & δ □ 304 Chapter F Supplement to Chapter 7 - SSFL F.1 Comparison of Self-supervised Learning Frameworks We compare state-of-the-art self-supervised learning frameworks (SimCLR, SwAV, BYOL) with SimSiam [69] in light of federated learning. We choose SimSiam [69] because it requires a much smaller batch to perform normally. In the centralized setting, for each method to reach an accuracy level similar to that of SimSiam, a much larger batch size is necessary. Table F.1 adopted from [69] provides a brief comparison between all listed self-supervised learning frameworks. method batch size negative pairs momentum encoder 100 ep 200 ep 400 ep 800 ep SimCLR (repro.+) 4096 ✓ 66.5 68.3 69.8 70.4 BYOL (repro.) 4096 ✓ 66.5 70.6 73.2 74.3 SwAV (repro.+) 4096 66.5 69.1 70.7 71.8 SimSiam 256 68.1 70.0 70.8 71.3 Table F.1: [69] Comparisons on ImageNet linear classification . All are based on ResNet-50 pre-trained with two 224× 224 views in a centralized setting. Evaluation is on a single crop. “repro.” denotes reproduction conducted by authors of SimSiam [69], and “+” denotes improved reproduction v.s. original papers. Another reason we prefer SimSiam [69] as the basic framework to build SSFL is that the design of SimSiam simplifies all other baselines and also obtains a relatively higher accuracy. Figure F.1 abstracts these methods. The “encoder” contains all layers that can be 305 shared between both branches (e.g., backbone, projection MLP [68], prototypes [53]). The components in red are those missing in SimSiam. encoder similarity encoder predictor image SimSiam encoder similarity & dissimilarity encoder image SimCLR encoder similarity encoder Sinkhorn-Knopp image SwAV encoder similarity momentum encoder predictor image moving average BYOL grad grad grad grad grad Figure F.1: [69] Comparison on Siamese architectures. The encoder includes all layers that can be shared between both branches. The dashed lines indicate the gradient propagation flow. In BYOL, SwAV, and SimSiam, the lack of a dashed line implies stop-gradient, and their symmetrization is not illustrated for simplicity. The components in red are those missing in SimSiam. SimCLR [68]. SimCLR relies on negative samples (“dissimilarity”) to prevent collapsing. SimSiam can be thought of as “SimCLR without negatives". In every mini-batch, for any image, one augmented view of the same image is considered to be its positive sample, and the remaining augmented views of different images are considered to be its negative samples. A contrastive loss term is calculated to push positive samples together and negative samples away. SwAV[53]. SimSiam is conceptually analogous to “SwAV without online clustering". Sim- Siam encourages the features of the two augmented views of the same image to be similar, while SwAV encourages features of the two augmented views of the same image to belong to 306 the same cluster. An additional Sinkhorn-Knopp (SK) transform [79] is required for online clustering of SwAV. The authors of SimSiam [69] build up the connection between SimSiam and SwAV by recasting a few components in SwAV. (i) The shared prototype layer in SwAV can be absorbed into the Siamese encoder. (ii) The prototypes were weight-normalized outside of gradient propagation in [53]; the authors of SimSiam instead implement by full gradient computation [392]. (iii) The similarity function in SwAV is cross-entropy. With these abstractions, a highly simplified SwAV illustration is shown in Figure F.1. BYOL[127]. SimSiamcanbethoughtofas“BYOLwithoutthemomentumencoder", subject to many implementation differences. Briefly, in BYOL, one head of the Siamese architecture used in SimSiam is replaced by the exponential moving average of the encoder. As the momentum encoder has an identical architecture to that of the encoder, the introduction of an additional momentum encoder doubles the memory cost of the model. SSL’s recent success is the inductive bias that ensures a good representation encoder remains consistent under different perturbations of the input (i.e. consistency regularization). The perturbations can be either domain-specific data augmentation (e.g. random flipping in the image domain) [22, 227, 391, 23, 177], drop out [391], random max pooling [391], or an adversarial transformation [321]. With this idea, a consistency loss L is defined to measure the quality of the representations without any annotations. 307 F.2 Formulation and Pseudo Code for Algorithms Under SSFL Framework Inspired by recent advances in personalized FL and self-supervised learning, we innovate several representative algorithms under SSFL framework. For each algorithm, we present its mathematical formulation and its pseudo code. F.2.1 Per-SSFL For Per-SSFL, as the formulation and algorithm have already been presented in Equation 7.4 and Algorithm 5, we provide a PyTorch style pseudo code in Algorithm 10 for additional clarity. 308 Algorithm 10 Per-SSFL PyTorch Style Pseudo Code 1 # F: global encoder 2 # H: global predictor 3 # f: local encoder 4 # h: local predictor 5 6 for x in loader: # load a mini-batch x with n samples 7 x1, x2 = aug(x), aug(x) # random augmentation 8 Z1, Z2 = F(x1), F(x2) # global projections, n-by-d 9 P1, P2 = H(Z1), H(Z2) # global predictions, n-by-d 10 11 L = D(P1, Z2) / 2 + D(P2, Z1) / 2 # global loss 12 13 L.backward() # back-propagate 14 update(F, H) # SGD update global model 15 16 z1, z2 = f(x1), f(x2) # local projections, n-by-d 17 p1, p2 = h(z1), h(z2) # local predictions, n-by-d 18 19 l = D(p1, z2) / 2 + D(p2, z1) / 2 # local loss 20 21 # distance between local and global representations 22 l = l + λ * (D(p1, P1) + D(p1, P2) + D(p2, P1) + D(p2, P2)) / 4 23 24 l.backward() # back-propagate 25 update(f, h) # SGD update local model 26 27 def D(p, z): # negative cosine similarity 28 z = z.detach() # stop gradient 29 30 p = normalize(p, dim=1) # l2-normalize 31 z = normalize(z, dim=1) # l2-normalize 32 return -(p * z).sum(dim=1).mean() F.2.2 Personalized SSFL with Local Adaptation (FedAvg-LA) FedAvg-LA apply FedAvg [42] on the SimSiam lossL SS for each client to obtain a global model. We perform one step of SGD on the clients’ local data for local adaption. The objective is defined in Equation F.2.1, and the algorithm is provided in Algorithm 11. min Θ ,H n X i=1 |D k | |D| E T x∼ X i î ∥f Θ (T(x))−H x ∥ 2 2 ó (F.2.1) 309 Algorithm 11 FedAvg-LA input :K,T,λ, Θ (0) ,{θ (0) i } k∈[K] ,s: number of local iteration,β : learning rate for t=0,...,T − 1 do Server randomly selects a subset of devices S (t) Server sends the current global model Θ (t) to S (t) for device k∈S (t) in parallel do ClientSSLOpt Sample mini-batch B k from local dataset D k , and do s local iterations /* Optimize the global parameter Θ locally */ Z 1 ,Z 2 ← f Θ (t)(T(B k )),f Θ (t)(T(B k )) P 1 ,P 2 ← h Θ (t)(Z 1 ),h Θ (t)(Z 2 ) Θ (t) k ← Θ (t) − β ∇ Θ (t) D(P 1 , ” Z 2 )+D(P 2 , ” Z 1 ) 2 , whereb · stands for stop-gradient Send ∆ (t) k :=Θ (t) k − Θ (t) back to server ServerOpt Θ (t+1) ← Θ (t) + P k∈S (t) |D k | |D| ∆ (t) k return :{θ i } i∈[n] ,Θ (T) 310 F.2.3 Personalized SSFL with MAML-SSFL MAML-SSFL is inspired by perFedAvg [103] and views the personalization on each devices as the inner loop of MAML [108]. It aims to learn an encoder that can be easily adapted to the clients’ local distribution. During inference, we perform one step of SGD on the global model for personalization. The objective is defined in Equation F.2.2, and the algorithm is provided in Algorithm 12. min Θ ,H n X i=1 |D k | |D| E T x∼ X i î ∥f Θ ′(T(x))−H x ∥ 2 2 ó s.t. Θ ′ =Θ −∇ Θ n X i=1 |D k | |D| E T x∼ X i î ∥f Θ (T(x))−H x ∥ 2 2 ó (F.2.2) Algorithm 12 MAML-SSFL input :K,T,λ, Θ (0) ,{θ (0) i } k∈[K] ,s: number of local iteration,β : learning rate,M for t=0,...,T − 1 do Server randomly selects a subset of devices S (t) Server sends the current global model Θ (t) to S (t) for device k∈S (t) in parallel do ClientSSLOpt Sample mini-batch B k ,B ′ k from local dataset D k , and do s local iterations /* Inner loop update */ Θ ′(t) k ← Θ (t) for m=0,...,M− 1 do Z ′ 1 ,Z ′ 2 ← f Θ ′(t)(T(B ′ k )),f Θ ′(t)(T(B ′ k )) P ′ 1 ,P ′ 2 ← h Θ ′(t)(Z ′ 1 ),h Θ ′(t)(Z ′ 2 ) Θ ′(t) k ← Θ ′(t) k − β ∇ Θ ′(t) k D(P ′ 1 , ” Z ′ 2 )+D(P ′ 2 , ” Z ′ 1 ) 2 , whereb · stands for stop-gradient /* Outer loop update */ Z 1 ,Z 2 ← f Θ ′(t)(T(B k )),f Θ ′(t)(T(B k )) P 1 ,P 2 ← h Θ ′(t)(Z 1 ),h Θ ′(t)(Z 2 ) Θ (t) k ← Θ (t) − β ∇ Θ (t) D(P 1 , ” Z 2 )+D(P 2 , ” Z 1 ) 2 Send ∆ (t) k :=Θ (t) k − Θ (t) back to server ServerOpt Θ (t+1) ← Θ (t) + P k∈S (t) |D k | |D| ∆ (t) k return :{θ i } i∈[n] ,Θ (T) 311 F.2.4 Personalized SSFL with BiLevel-SSFL Inspired by Ditto [249], BiLevel-SSFL learns personalized encoders on each client by restrict- ing the parameters of all personalized encoders to be close to a global encoder independently learned by weighted aggregation. The objective is defined in Equation F.2.3, and the algorithm is provided in Algorithm 13. min θ k ,η k E T x∼ X k ï ∥f θ k (T(x))− η k,x ∥ 2 2 + λ 2 ∥θ k − Θ ∗ x ∥ 2 2 ò s.t. Θ ∗ ,H ∗ ∈argmin Θ ,H n X i=1 |D k | |D| E T x∼ X i î ∥f Θ (T(x))−H x ∥ 2 2 ó (F.2.3) Algorithm 13 BiLevel-SSFL input :K,T,λ, Θ (0) ,{θ (0) i } k∈[K] ,s: number of local iteration,β : learning rate for t=0,...,T − 1 do Server randomly selects a subset of devices S (t) Server sends the current global model Θ (t) to S (t) for device k∈S (t) in parallel do ClientSSLOpt Sample mini-batch B k from local dataset D k , and do s local iterations /* Optimize the global parameter Θ locally */ Z 1 ,Z 2 ← f Θ (t)(T(B k )),f Θ (t)(T(B k )) P 1 ,P 2 ← h Θ (t)(Z 1 ),h Θ (t)(Z 2 ) Θ (t) k ← Θ (t) − β ∇ Θ (t) D(P 1 , ” Z 2 )+D(P 2 , ” Z 1 ) 2 , whereb · stands for stop-gradient /* Optimize the local parameter θ k */ z 1 ,z 2 ← f θ k (T(B k )),f θ k (T(B k )) p 1 ,p 2 ← h θ k (z 1 ),h θ k (z 2 ) θ k ← θ k − β ∇ θ k Å D(p1,c z 2 )+D(p2,c z 1 ) 2 +λ Θ (t) − θ k 2 2 ã Send ∆ (t) k :=Θ (t) k − Θ (t) back to server ServerOpt Θ (t+1) ← Θ (t) + P k∈S (t) |D k | |D| ∆ (t) k return :{θ i } i∈[n] ,Θ (T) 312 F.3 Distributed Training System for SSFL Figure F.2: Distributed Training System for SSFL framework We develop a distributed training system for our SSFL framework which contains three layers. In the infrastructure layer, communication backends such as MPI are supported to facilitate the distributed computing. We abstract the communication as ComManager to simplify the message passing between the client and the server. Trainer reuses APIs from PyTorch to handle the model optimizations such as forward propagation, loss function, and back propagation. In the algorithm layer, Client Manager and Server Manager are the entry points of the client and the server, respectively. The client managers incorporates 313 various SSFL trainers, including Per-SSFL, MAML-SSFL, BiLevel-SSFL, and LA-SSFL. The server handles the model aggregation using Aggregator. We design simplified APIs for all of these modules. With the abstraction of the infrastructure and algorithm layers, developers can begin FL training by developing a workflow script that integrates all modules (as the “SSFL workflow” block shown in the figure). Overall, we found that this distributed training system accelerates our research by supporting parallel training, larger batch sizes, and easy-to-customize APIs, which cannot be achieved by a simple single-process simulation. 314 F.3.1 Experimental Results on GLD-23K Dataset We also evaluate the performance of SSFL on GLD-23K dataset. We use 30% of the original local training dataset as the local test dataset and filter out those clients that have a number of samples less than 100. Due to the natural non-I.I.D.ness of GLD-23K dataset, we only evaluate the Per-SSFL framework. The results are summarized in Table F.2. Note: we plan to further explore more datasets and run more experiments; thus we may report more results during the rebuttal phase. Table F.2: Evaluation Accuracy for Various Per-SSFL Methods. Method KNN Indicator Evaluation LA-SSFL 0.6011 0.4112 MAML-SSFL 0.6237 0.4365 BiLevel-SSFL 0.6195 0.4233 Per-SSFL 0.6371 0.4467 *Note: the accuracy on supervised federated training using FedAvg is around 47% F.3.2 Extra Experimental Results and Details F.3.3 Visualization of Non-I.I.D. dataset (a) Sample Number Distribution (b) Label Distribution (deeper color stands for more samples Figure F.3: Visualization for non-I.I.D. synthesized using CIFAR-10 315 (a) Sample Number Distribution (X-axis: Client Index; Y-axis: Number of Training Samples) (b) Sample Number Distribution (X-axis: Num- ber of Training Samples; Y-axis: Number of Clients) Figure F.4: Visualization for non-I.I.D. on GLD-23K F.3.4 Hyper-parameters Table F.3: Hyper-parameters for Section 7.5.2 Method Learning Rate Local Optimizer SSFL (I.I.D) 0.1 SGD with Momemtum (0.9) SSFL (non-I.I.D) 0.1 SGD with Momemtum (0.9) Table F.4: Hyper-parameters for Section 7.5.4.2 Method Learning Rate λ Local Optimizer Per-SSFL (α =0.1) 0.03 0.1 SGD with Momemtum (0.9) Per-SSFL (α =0.5) 0.03 0.1 SGD with Momemtum (0.9) Table F.5: Hyper-parameters for experimental results in Section 7.5.3 Method Learning Rate λ Local Optimizer LA-SSFL 0.1 1 SGD with Momemtum (0.9) MAML-SSFL 0.03 1 SGD with Momemtum (0.9) BiLevel-SSFL 0.1 1 SGD with Momemtum (0.9) Per-SSFL 0.03 0.1 SGD with Momemtum (0.9) All experiments set the local epoch number as 1, round number as 800, batch size as 256 (batch size 32 with 8 gradient accumulation steps). 316 F.4 Discussion To overcome the large batch size requirement in SSFL and practical FL edge training, one direction is to use efficient DNN models such as EfficientNet [435] and MobileNet [171] as the backbone of SimSiam. However, we tested its performance under our framework and found that the performance downgrades to a level of accuracy that is not useful (less than 60%). A recent work in centralized self-supervised learning mitigates these models’ accuracy gap by knowledge distillation, which works in a centralized setting but is still not friendly to FL since KD requires additional resources for the teacher model. In practice, we can also explore batch size 1 training [50] at the edge, which dramatically reduces the memory cost with additional training time. 317 Chapter G Supplement to Chapter 8 - LightSecAgg G.1 Pseudo Code of LightSecAgg 318 Algorithm 14 The LightSecAgg protocol Input: T (privacy guarantee), D (dropout-resiliency guarantee), U (target number of surviving users) 1: Server Executes: 2: // phase: offline encoding and sharing of local masks 3: for each user i=1,2,...,N in parallel do 4: z i ← randomly picks fromF d q 5: [z i ] 1 ,...,[z i ] U− T ← obtained by partitioning z i to U− T pieces 6: [n i ] U− T+1 ,...,[n i ] U ← randomly picks fromF d U− T q 7: {[˜ z i ] j } j∈[N] ← obtained by encoding [z i ] k ’s and [n i ] k ’s using equation 8.5 8: sends encoded mask [˜ z i ] j to user j∈[N]\{i} 9: receives encoded mask [˜ z j ] i from user j∈[N]\{i} 10: end for 11: // phase: masking and uploading of local models 12: for each user i=1,2,...,N in parallel do 13: // user i obtains x i after the local update 14: ˜ x i ← x i +z i // masks the local model 15: uploads masked model ˜ x i to the server 16: end for 17: identifies set of surviving users U 1 ⊆ [N] 18: gathers masked models ˜ x i from user i∈U 1 19: // phase: one-shot aggregate-model recovery 20: for each user i∈U 1 in parallel do 21: computes aggregated encoded masks P j∈U 1 [˜ z j ] i 22: uploads aggregated encoded masks P j∈U 1 [˜ z j ] i to the server 23: end for 24: collects U messages of aggregated encoded masks P j∈U 1 [˜ z j ] i from user i∈U 1 25: // recovers the aggregated-mask 26: P i∈U 1 z i ← obtained by decoding the received U messages 27: // recovers the aggregate-model for the surviving users 28: P i∈U 1 x i ← P i∈U 1 ˜ x i − P i∈U 1 z i 319 G.2 Proof of Theorem 2 We prove the dropout-resiliency guarantee and the privacy guarantee for a single FL training round. As all randomness is independently generated across each round, one can extend the dropout-resiliency guarantee and the privacy guarantee for all training rounds for both synchronous and asynchronous FL setting. For simplicity, round index t is omitted in this proof. For any pair of privacy guarantee T and dropout-resiliency guarantee D such that T +D <N, we select an arbitrary U such that N− D≥ U >T. In the following, we show that LightSecAgg with chosen design parameters T, D and U can simultaneously achieve (1) privacy guarantee against up to any T colluding users, and (2) dropout-resiliency guarantee against up to any D dropped users. We denote the concatenation of{[n i ] k } k∈U− T+1,...,U by n i for i∈[N]. (Dropout-resiliency guarantee) We now focus on the phase of one-shot aggregate- model recovery. Since each user encodes its sub-masks by the same MDS matrix W, each P i∈U 1 [˜ z i ] j is an encoded version of P i∈U 1 [z i ] k for k ∈ [U − T] and P i∈U 1 [n i ] k for k ∈ {U− T +1,...,U} as follows: X i∈U 1 [˜ z i ] j =( X i∈U 1 [z i ] 1 ,..., X i∈U 1 [z i ] U− T , X i∈U 1 [n i ] U− T+1 ,..., X i∈U 1 [n i ] U )· W j , (G.2.1) where W j is the j’th column of W. Since N− D≥ U, there are at least U surviving users after user dropouts. Thus, the server is able to recover P i∈U 1 [z i ] k for k∈[U− T] via MDS decoding after receiving a set of any U messages from the surviving users. Recall that [z i ] k ’s are sub-masks of z i , so the server can successfully recover P i∈U 1 z i . Lastly, the server recovers the aggregate-model for the set of surviving usersU 1 by P i∈U 1 x i = P i∈U 1 ˜ x i − P i∈U 1 z i = P i∈U 1 (x i +z i )− P i∈U 1 z i . (Privacy guarantee) We first present Lemma 1, whose proof is provided in Appendix G.4. 320 Lemma 1. For anyT ⊆ [N] of size T and anyU 1 ⊆ [N],|U 1 |≥ U such that U >T, if the random masks [n i ] k ’s are jointly uniformly random, we have I({z i } i∈[N]\T ;{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T )=0. (G.2.2) We consider the worst-case scenario in which all the messages sent from the users are received by the server during the execution of LightSecAgg, i.e., the users identified as dropped are delayed. Thus, the server receives x i +z i from user i ∈ [N] and P j∈U 1 [˜ z j ] i from user i∈U 1 . We now show that LightSecAgg provides privacy guarantee T, i.e., for an arbitrary set of colluding usersT of size T, the following holds, I {x i } i∈[N] ;{x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! =0. (G.2.3) We prove it as follows: I {x i } i∈[N] ;{x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.4) =H {x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H {x i +z i } i∈[N] ,{ X j∈U 1 [˜ z j ] i } i∈U 1 {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.5) =H {x i +z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H {z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.6) =H {x i +z i } i∈[N]\T , X i∈U 1 z i , X i∈U 1 n i X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! 321 − H {z i } i∈[N] , X i∈U 1 z i , X i∈U 1 n i {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.7) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! +H X i∈U 1 z i , X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H {z i } i∈[N] {x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H X i∈U 1 z i , X i∈U 1 n i {z i } i∈[N] ,{x i } i∈[N] ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.8) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! +H X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H {z i } i∈[N]\T {z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H X i∈U 1 n i {z i } i∈[N] ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.9) =H {x i +z i } i∈[N]\T X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! +H X i∈U 1 n i {x i +z i } i∈[N]\T , X i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ,{[˜ z j ] i } j∈[N],i∈T ! − H {z i } i∈[N]\T − H X i∈U 1 n i {z i } i∈[N] ,{[˜ z j ] i } j∈[N],i∈T ! (G.2.10) =0, (G.2.11) where equation G.2.6 follows from the fact that { P j∈U 1 [˜ z j ] i } i∈U 1 is invertible to P i∈U 1 z i and P i∈U 1 n i . Equation equation G.2.7 holds since{x i +z i } i∈T is a deterministic function of{z i } i∈T and{x i } i∈T . Equation equation G.2.8 follows from the chain rule. In equation equation G.2.9, the second term follows from the fact that P i∈U 1 z i is a deterministic function of{x i +z i } i∈[N]\T , P i∈U 1 x i ,{x i } i∈T ,{z i } i∈T ; the third term follows from the independence 322 ofx i ’s andz i ’s; the last term follows from the fact that P i∈U 1 z i is a deterministic function of {z i } i∈[N] and the independence of n i ’s and x i ’s. In equation equation G.2.10, the third term follows from Lemma 1. Equation equation G.2.11 follows from 1) P i∈U 1 n i is a function of {x i +z i } i∈[N]\T , P i∈U 1 x i ,{x i } i∈T ,{z i } i∈T and{[˜ z j ] i } j∈[N],i∈T ; 2) P i∈U 1 n i is a function of {z i } i∈U 1 {[˜ z j ] i } j∈U 1 ,i∈T ; 3) z i is uniformly distributed and hence it has the maximum entropy inF d q , combined with the non-negativity of mutual information. G.2.1 Discussion As shown in Table G.1, compared with the SecAgg protocol [37], LightSecAgg significantly improves the computational efficiency at the server during aggregation. While SecAgg requires the server to retrieveT+1 secret shares of a secret key for each of theN users, and to compute a single PRG function if the user survives, or N− 1 PRG functions to recover N− 1 pairwise masks if the user drops off, yielding a total computational load of O(N 2 d) at the server. In contrast, as we have analyzed in Section 8.5.2, for U =O(N), LightSecAgg incurs an almost constant (O(dlogN)) computational load at the server. This admits a scalable design and is expected to achieve a much faster end-to-end execution for a large number of users, given the fact that the overall execution time is dominated by the server’s computation in SecAgg [37, 38]. SecAgg has a smaller storage overhead than LightSecAgg as secret shares of keys with small sizes (e.g., as small as an integer) are stored, and the model size d is much larger than the number of users N in typical FL scenarios. This effect will also allow SecAgg to have a smaller communication load in the phase of aggregate-model recovery. Finally, we would like to note that another advantage of LightSecAgg over SecAgg is the reduced dependence on cryptographic primitives such as public key infrastructure and key agreement mechanism, which further simplifies the implementation of the protocol. SecAgg+ [19] improves both communication and computational load of SecAgg by considering a sparse random graph of degree O(logN), and the complexity is reduced by factor of O( N logN ). However, SecAgg+ still 323 incurs O(dNlogN) computational load at the server, which is much larger than O(dlogN) computational load at the server in LightSecAgg when U =O(N). Table G.1: Complexity comparison between SecAgg [37], SecAgg+ [19], and LightSecAgg. Here N is the total number of users. The parameters d and s respectively represent the model size and the length of the secret keys as the seeds for PRG, where s≪ d. LightSecAgg and SecAgg provide worst-case privacy guarantee T and dropout-resiliency guarantee D for any T and D as long as T +D < N. SecAgg+ provides probabilistic privacy guarantee T and dropout-resiliency guarantee D. LightSecAgg selects three design parameters T, D and U such that T i) subtracts PRG(a (t) i,j ) from x (t) j . 329 In asynchronous FL, however, the cancellation of the pairwise random masks based on the key agreement protocol is not guaranteed due to the mismatch in staleness between the users. Specifically, at round t, user i∈S (t) sends the masked model y (t;t i ) i to the server that is given by y (t;t i ) i =∆ (t;t i ) i +PRG Ä b (t i ) i ä + X j:i<j PRG Ä a (t i ) i,j ä − X j:i>j PRG Ä a (t i ) j,i ä , (G.6.5) where ∆ (t;t i ) i is the local update defined in equation G.6.2. When t i ̸=t j , the pairwise random vectors in y (t;t i ) i and y (t;t j ) j are not canceled out as a (t i ) i,j ̸=a (t j ) i,j . We note that the identity of the staleness of each user is not known a priori, hence each pair of users cannot use the same pairwise random-seed. G.6.2 Asynchronous LightSecAgg We now demonstrate how LightSecAgg can be applied to the asynchronous FL setting where the server stores each local update in a buffer of size K and updates the global model by aggregating the stored updates when the buffer is full. Our key intuition is to encode the local masks in a way that the server can recover the aggregate of masks from the encoded masks via a one-shot computation even though the masks are generated in different training rounds. The asynchronous LightSecAgg protocol also consists of three phases with three design parameters D,T,U which are defined in the same way as the synchronous LightSecAgg. SynchronousandasynchronousLightSecAgghavetwokeydifferences: (1)Inasynchronous FL, the users share the encoded masks with the time stamp in the first phase to figure out which encoded masks should be aggregated for the reconstruction of aggregate of masks in the third phase. Due to the commutative property of coding and addition, the server can reconstruct the aggregate of masks even though the masks are generated in different training rounds; (2) In asynchronous FL, the server compensates the staleness of the local updates. This is challenging as this compensation should be carried out over the masked model in the 330 finite field to provide the privacy guarantee while the conventional compensation functions have real numbers as outputs [489, 330]. We now describe the three phases in detail. G.6.3 Offline Encoding and Sharing of Local Masks User i generatesz (t i ) i uniformly at random from the finite field F d q , where t i is the global round index when user i downloads the global model from the server. The mask z (t i ) i is partitioned intoU− T sub-masks denoted by [z (t i ) i ] 1 ,··· ,[z (t i ) i ] U− T , whereU denotes the targeted number of surviving users and N− D≥ U ≥ T. User i also selects another T random masks denoted by[n (t i ) i ] U− T+1 ,··· ,[n (t i ) i ] U . TheseU partitions[z (t i ) i ] 1 ,··· ,[z (t i ) i ] U− T ,[n (t i ) i ] U− T+1 ,··· ,[n (t i ) i ] U are then encoded through an (N,U) Maximum Distance Separable (MDS) code as follows [e z (t i ) i ] j = Ä [z (t i ) i ] 1 ,··· ,[z (t i ) i ] U− T ,[n (t i ) i ] U− T+1 ,··· ,[n (t i ) i ] U ä W j , (G.6.6) where W j is the Vandermonde matrix defined in equation 8.5. User i sends [e z (t i ) i ] j to user j∈[N]\{i}. At the end of this phase, each user i∈[N] has [e z (t j ) j ] i from j∈[N]. G.6.4 Training, Quantizing, Masking, andUploadingofLocalUpdates Each user i trains the local model as in equation G.6.2 and equation G.6.3. User i quantizes its local update ∆ (t;t i ) i from the domain of real numbers to the finite field F q as masking and MDS encoding are carried out in the finite field to provide information-theoretic privacy. The field size q is assumed to be large enough to avoid any wrap-around during secure aggregation. The quantization is a challenging task as it should be performed in a way to ensure the convergence of the global model. Moreover, the quantization should allow the representation of negative integers in the finite field, and enable computations to be carried out in the quantized domain. Therefore, we cannot utilize well-known gradient quantization techniques such as in [6], which represents the sign of a negative number separately from its magnitude. 331 LightSecAgg addresses this challenge with a simple stochastic quantization strategy combined with the two’s complement representation as described subsequently. For any positive integer c≥ 1, we first define a stochastic rounding function as Q c (x)= ⌊cx⌋ c with prob. 1− (cx−⌊ cx⌋) ⌊cx⌋+1 c with prob. cx−⌊ cx⌋, (G.6.7) where⌊x⌋ is the largest integer less than or equal to x, and this rounding function is unbiased, i.e., E Q [Q c (x)] = x. The parameter c is a design parameter to determine the number of quantization levels. The variance of Q c (x) decreases as the value of c increases. We then define the quantized update ∆ (t;t i ) i :=ϕ Ä c l · Q c l Ä ∆ (t;t i ) i ää , (G.6.8) where the function Q c from equation G.6.7 is carried out element-wise, and c l is a positive integer parameter to determine the quantization level of the local updates. The mapping function ϕ :R→F q is defined to represent a negative integer in the finite field by using the two’s complement representation, ϕ (x)= x if x≥ 0 q+x if x<0. (G.6.9) To protect the privacy of the local updates, user i masks the quantized update ∆ (t;t i ) i in equation G.6.8 as ‹ ∆ (t;t i ) i =∆ (t;t i ) i +z (t i ) i , (G.6.10) 332 and sends the pair of ¶ ‹ ∆ (t;t i ) i ,t i © to the server. The local round index t i is used in two cases: (1) when the server identifies the staleness of each local update and compensates it, and (2) when the users aggregate the encoded masks for one-shot recovery, which will be explained in Section G.6.5. G.6.5 One-shotAggregate-updateRecoveryandGlobalModelUpdate The server stores ‹ ∆ (t;t i ) i in the buffer, and when the buffer of size K is full, the server aggregates the K masked local updates. In this phase, the server intends to recover X i∈S (t) s(t− t i )∆ (t;t i ) i , (G.6.11) where ∆ (t;t i ) i is the local update in the real domain defined in equation G.6.2, S (t) ( S (t) =K) is the index set of users whose local updates are stored in the buffer and aggregated by the server at round t, and s(τ ) is the staleness function defined in equation G.6.4. To do so, the first step is to reconstruct P i∈S (t) s(t− t i )z (t i ) i . This is challenging as the decoding should be performed in the finite field, but the value of s(τ ) is a real number. To address this problem, we introduce a quantized staleness function s:{0,1,...,}→F q , s cg (τ )=c g Q cg (s(τ )), (G.6.12) where Q c (·) is a stochastic rounding function defined in equation G.6.7, and c g is a positive integer to determine the quantization level of staleness function. Then, the server broadcasts information of S (t) ,{t i } i∈S (t) ,c g to all surviving users. After identifying the selected users in S (t) , the local round indices {t i } i∈S (t) and the corresponding staleness, user j ∈ [N] aggregates its encoded sub-masks P i∈S (t) s cg (t− t i ) î e z (t i ) i ó j and sends it to the server for the purpose of one-shot recovery. The key difference between the asynchronous LightSecAgg and the synchronous LightSecAgg is that in the asynchronous LightSecAgg, the time stamp 333 t i for encoded masks î e z (t i ) i ó j for each i ∈ S (t) can be different, hence user j ∈ [N] must aggregate the encoded mask with the proper round index. Due to the commutative property of coding and linear operations, each P i∈S (t) s cg (t− t i ) î e z (t i ) i ó j is an encoded version of P i∈S (t) s cg (t− t i ) î z (t i ) i ó k for k∈ [U− T] using the MDS matrix (or Vandermonde matrix) V defined in equation G.6.6. Thus, after receiving a set of any U results from surviving users inU 2 , where|U 2 |=U, the server reconstructs P i∈S (t) s cg (t− t i ) î z (t i ) i ó k for k∈[U− T] via MDS decoding. By concatenating the U− T aggregated sub-masks P i∈S (t) s cg (t− t i ) î z (t i ) i ó k , the server can recover P i∈S (t) s cg (t− t i )z (t i ) i . Finally, the server obtains the desired global update as follows g (t) = 1 c g c l P i∈S (t) s cg (t− t i ) ϕ − 1 Ñ X i∈S (t) s cg (t− t i ) ‹ ∆ (t;t i ) i − X i∈S (t) s cg (t− t i )z (t i ) i é , (G.6.13) where c l is defined in equation G.6.8 and ϕ − 1 :F q →R is the demapping function defined as follows ϕ − 1 (x)= x if 0≤ x< q− 1 2 x− q if q− 1 2 ≤ x
Abstract (if available)
Abstract
Federated learning (FL) is a machine learning paradigm where many clients (e.g., edge servers or mobile/IoT devices) collaboratively train a model, while keeping the training data decentralized. It has shown huge potential in mitigating many of the systemic privacy risks, regulatory restrictions, and communication costs, resulting from traditional, over-the-cloud machine learning and data science approaches in healthcare, finance, smart cities, autonomous driving, and the Internet of things. Though it is promising, landing FL into trustworthy data-centric AI infrastructure faces many challenges from learning algorithms (e.g., data heterogeneity, label deficiency) and distributed systems (resource constraints, system heterogeneity, security, privacy, etc.), requiring interdisciplinary research in machine learning, distributed systems, and security/privacy. Driven by this goal, this thesis focuses on scaling federated and distributed machine learning end-to-end, from systems to algorithms to applications.
In the first part, we focus on the design of the distributed system for federated and distributed machine learning. We propose FedML, a now widely adopted open-source library for federated learning. We also propose PipeTransformer, which leverages automated elastic pipelining for efficient distributed training of Transformer models. FedML supports three computing paradigms: on-device training using a federation of edge devices, distributed training in the cloud that supports exchanging of auxiliary information beyond just gradients, and single-machine simulation of a federated learning algorithm. FedML also promotes diverse algorithmic research with flexible and generic API design and comprehensive reference baseline implementations (optimizer, models, and datasets). In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the pipeline, packs active layers into fewer GPUs, and forks more replicas to increase data-parallel width.
In the second part, we propose a series of algorithms to scale up federated learning by breaking many aforementioned constraints, such as FedGKT, an edge-cloud collaborative training for resource-constrained clients, FedNAS, a method towards automation on invisible data via neural architecture search, SpreadGNN, effective training on decentralized topology, SSFL, tackling label deficiency via personalized self-supervision, and LightSecAgg, a lightweight and versatile secure aggregation protocol. Most algorithms are compatible with each other. Specially, we unified all implementations under the FedML framework. Therefore, under the complex constraints of the real world, the orchestration of these algorithms has the potential to greatly enhance the scalability of federated learning.
Finally, we propose FedML Ecosystem, which is a family of open research libraries to facilitate federated learning research in diverse application domains. FedNLP (Natural Language Processing), FedCV (Computer Vision), FedGraphNN (Graph Neural Networks), and FedIoT (Internet of Things). Compared with TFF and LEAF, FedNLP and FedCV greatly enrich the diversity of data sets and learning tasks. FedNLP supports various popular task formulations in the NLP domain, such as text classification, sequence tagging, question answering, seq2seq generation, and language modeling. FedCV can help researchers evaluate the three most representative tasks: image classification, image segmentation, and object detection. Moreover, FedGraphNN is the first FL research platform for analyzing graph-structured data using Graph Neural Networks in a distributed computing manner. FedIoT further extends FL to perform in wireless communication (e.g., 5G) and mobile computing (e.g., embedded IoT devices such as Raspberry PI, smartphones running on Android OS).
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Heterogeneous federated learning
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
On scheduling, timeliness and security in large scale distributed computing
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Fast and label-efficient graph representation learning
PDF
High-throughput methods for simulation and deep reinforcement learning
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Algorithms and systems for continual robot learning
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
Asset Metadata
Creator
He, Chaoyang
(author)
Core Title
Federated and distributed machine learning at scale: from systems to algorithms to applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
08/05/2022
Defense Date
03/25/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,distributed systems,distributed training,federated learning,machine learning,OAI-PMH Harvest,open source library,privacy,Security
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Avestimehr, Salman (
committee chair
), Annavaram, Murali (
committee member
), Nevatia, Ram (
committee member
), Ren, Xiang (
committee member
), Soltanolkotabi, Mahdi (
committee member
)
Creator Email
chaoyanghe.com@gmail.com,chaoyanh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111376197
Unique identifier
UC111376197
Legacy Identifier
etd-HeChaoyang-11110
Document Type
Dissertation
Format
application/pdf (imt)
Rights
He, Chaoyang
Type
texts
Source
20220806-usctheses-batch-971
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
distributed systems
distributed training
federated learning
machine learning
open source library