Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Heterogeneous federated learning
(USC Thesis Other)
Heterogeneous federated learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HETEROGENEOUS FEDERATED LEARNING
by
Dimitris Stripelis
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Dimitris Stripelis
Dedication
To the aspiring and visionary.
I dedicate this thesis to my parents.
ii
Acknowledgements
I want to thank the many people who helped through comments and discussions while writing the various
papers composing this thesis. I want to express my gratitude to my advisor José Luis Ambite for his contin-
ual help, support, and advice throughout this yearlong journey. I really enjoyed our lengthy brainstorming
sessions and I was always impressed by his ability to provide constructive feedback with positivity and
humor. I would also like to thank (in alphabetical order) my thesis committee members Cyrus Shahabi,
Greg Ver Steeg, Meisam Razaviyayn, and Paul M. Thompson for their valuable comments. Especially,
Cyrus for his confidence in supporting me during my first years at USC as a master’s and later as a Ph.D.
student and for allowing me to be part of the InfoLab group all these years. I would also like to thank
Paul deeply for his invaluable feedback on shaping and formalizing many of the presented research ideas,
particularly in the neuroimaging domain. Without his expertise and contributions, this thesis would not
have been possible. Finally, I want to thank my collaborators at USC and from other institutions: Chryso-
valantis (Chrys) Anastasiou, Hamza Saleem, Manuel Namici, Marcin Abram, Nikhil Dhinagar, Muhammad
Naveed, Ritesh Ahuja, Srivatsan Ravi, and Umang Gupta, for their enormous help to develop and deliver
many of the presented research ideas. Particularly, Umang for helping me become a better researcher with
his methodological thinking when tackling extremely challenging research problems, and Chrys for help-
ing me become a better developer with his profound engineering skills. I am thankful to my family and my
dearest friends back in Greece for their support and encouragement to keep pushing forward and pursue
my goals and ideas.
iii
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Chapter 1: Learning Without Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Meaningful Data Analysis using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 From Distributed to Federated Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 The Case for Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Federated Learning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 2: Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Federated Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Optimization Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Local Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Global Model Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Comparing Centralized to Federated Model Performance . . . . . . . . . . . . . . . . . . . 17
2.2.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Neuroimaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Security & Privacy Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.1 Threat Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Privacy Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.3 Security Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3: Heterogeneity in Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1 System Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Processing Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.2 Communication Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2 Statistical Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Semantic Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Data Storage & Schema Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
3.3.2 Data Value Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4: Accelerated Learning in Heterogeneous Environments . . . . . . . . . . . . . . . . . . . 43
4.1 Training Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Synchronous Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Asynchronous Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Semi-Synchronous Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.4 Training Policies Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.5 Federated Training Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.6 Protocols Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 MetisFL: A Scalable Federated Training Framework . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Federation Controller Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 FL Systems Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Model Sparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 FedSparsify: Federated Purge-Merge-Tune . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 FedSparsify Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.3 FedSparsify Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.4 Sparsified Federated Neuroimaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Chapter 5: Secure, Private & Robust Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Protection through Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.1 Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.2 Federated Learning with Homomorphic Encryption . . . . . . . . . . . . . . . . . . 104
5.1.3 Secure Federated Neuroimaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Protection through Data Leakage Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.1 Membership Inference Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.1.1 Attacks Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2.1.2 Attacks in Centralized Settings . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2.1.3 Attacks in Federated Settings . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2.2 Defensive Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2.1 Gaussian Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2.2 Learning with Non-unique Gradients . . . . . . . . . . . . . . . . . . . . 119
5.3 Robust Training under Data Corruption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.1 Data Corruption Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.2 Theoretical Limits of Data Corruption . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.3.3 The Performance Weighting Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.4 Performance Scoring Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Chapter 6: Data Harmonization for Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.1 Principled Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.1.1 Logical Integration through Schema Mappings . . . . . . . . . . . . . . . . . . . . 137
6.1.1.1 Domain Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.1.1.2 Schema Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1.1.3 Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.2 Data Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1.2.1 Generating Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 141
v
6.1.2.2 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.2 Logical Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.2.1 Data Mediation for Schizophrenia Neuroimaging . . . . . . . . . . . . . . . . . . . 145
6.2.2 Spark Mediator: Large-Scale Mediated Data Analysis . . . . . . . . . . . . . . . . . 149
6.2.2.1 Optimizing Spark SQL Federation Engine. . . . . . . . . . . . . . . . . . 151
6.2.2.2 SchizConnect Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Federated Learning INTegration (FLINT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.1 The Federated Data Integration Problem . . . . . . . . . . . . . . . . . . . . . . . . 157
6.3.2 Federated Data Harmonization & Imputation . . . . . . . . . . . . . . . . . . . . . 159
Chapter 7: Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Training Protocols Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.1 Semi-Synchronous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
A.2 FedSparsify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
vi
ListofTables
2.1 Final learning performance of centralized vs federated models. The Clients-Uni and
Clients-Ske environments refer to the federation environments with uniform and skewed
assignment of training samples across clients. . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 UKBB Evaluation. MSE: Mean Square Error. RMSE: Root MSE. MAE: Mean Absolute
Error. Corr: Correlation. Mean and std values for 3 runs. . . . . . . . . . . . . . . . . . . . 27
2.3 AD: Train/test splits per cohort and target label. . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Alzheimer’s Disease Prediction. Test results on a global stratified test dataset (5 sites), for
each dataset by itself; 3 sites, ADNI 1,2,3 (3A); 4 sites, ADNI 1,2,3 + OASIS (4AO), and 5
sites, ADNI 1,2,3 + OASIS + AIBL. In federated environments, each dataset is at a different
learner. Centralized environments are trained over all the corresponding datasets. . . . . 30
4.1 Federated Learning Training Policies: Characteristics . . . . . . . . . . . . . . . . . . . . . 44
4.2 CIFAR-10 Performance Metrics on Homogeneous Cluster. SemiSync with Momentum
outperforms all other Synchronous and Semi-Synchronous policies. The first column
refers to the federated learning domain and the target accuracy that each policy needs to
reach. Column'Com. Cost'is an abbreviation for Communication Cost. The total models
exchanged during training is twice the communication cost value. For every experiment
the energy savings are computed against the synchronous Vanilla SGD baseline. . . . . . . 66
4.3 CIFAR-10 Performance Metrics on Heterogeneous Cluster. Semi-Synchronous with
Momentum outperforms all other policies in time, communication, and energy costs. The
first column refers to the federated learning domain and the target accuracy that each
policy needs to reach. (Iter. = total number of local iterations, PT(s) = parallel processing
time in seconds, CT(s) = cumulative processing time in seconds, CC = communication
cost, and EC(EF) = energy cost with energy efficiency factor). For every experiment, the
energy savings are computed against the synchronous Vanilla SGD baseline. . . . . . . . . 67
4.4 Qualitative comparison of different FL Systems. . . . . . . . . . . . . . . . . . . . . . . . . 73
vii
4.5 Comparison of sparse (FedSparsify-Global), and non-sparse (FedAvg) federated models
in the FashionMNIST, CIFAR-10 and CIFAR-100 Non-IID environments with 10 clients.
Inference evaluations are done on models obtained at the end of training. Sparsity 0.0
represents FedAvg. C.C. is the communication cost in millions (MM) of parameters
exchanged. Inference efficiency is measured by the mean processing time per batch
(Inf.Latency - ms/batch), the number of iterations (Inf.Iterations), and processed examples
per second (Inf.Throughput - examples/sec). Values in parenthesis show the reduction
factor (model size, communication cost, and inference latency) and increase/speedup
factor (inference iterations and throughput) compared to non-pruning. . . . . . . . . . . . 95
4.6 BrainAGE Federated Models Comparison in the Skewed-IID Environment. . . . . . . . . . . 98
5.1 Membership inference attack accuracies on centrally trained models (averaged over 5
attacks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Average attack accuracies on 3D and 2D-CNN federated models (across all successful
attacks). Numbers in parentheses indicate the median number of successful attacks over 5
multiple runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1 Structure of source-level queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
viii
ListofFigures
1.1 World map with countries that have (green) or have not (red) enacted or drafted data
privacy laws along with the names of some major data privacy laws. . . . . . . . . . . . . 2
1.2 Representative Centralized and Federated Learning Environments. . . . . . . . . . . . . . . 3
1.3 Performance evaluation of standalone silos against the federated (community) model on
the BrainAge task. Federated learning leverages all available data points and learns a more
performant model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 CIFAR10 Centralized vs FedAvg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 CoNLL-2003 dataset: B-LOC, B-ORG and B-PER entity (tag) distribution for each client
within each federation environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 CoNLL-2003 dataset: Number of unique B-LOC, B-ORG and B-PER entity (tag)
distribution for each client within each federation environment. . . . . . . . . . . . . . . . 22
2.4 CoNLL-2003 dataset B-LOC, B-ORG, and B-PER entity (tag) common occurrences across
clients for each federation environment (log scale). . . . . . . . . . . . . . . . . . . . . . . 23
2.5 NER model learning performance in centralized and federated settings. . . . . . . . . . . . 24
2.6 BrainAgeCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 UKBB Federation Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 Centralized vs Federated BrainAGE performance comparison. . . . . . . . . . . . . . . . . 28
2.9 Federated Learning Threats Taxonomy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 A Typical Federated Learning System Architecture. . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Federated Learning Heterogeneities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
ix
3.3 Active vs Idle time for a computationally heterogeneous federated learning environment
consisting of 5 fast (GPUs) and 5 slow (CPUs) learners training on the CIFAR-10 dataset
using a 2-CNN model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Effect of computational (processing) heterogeneity on federated model learning performance. 36
3.5 Statistically heterogeneous federated learning environments in CIFAR-10. . . . . . . . . . 37
3.6 Effect of heterogeneous data distributions on federated learning performance. . . . . . . . 38
3.7 Semantic Heterogeneities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Federated Model Evaluation on Silo Removal due to Storage and Schema Heterogeneity.
For MSE scores, the lower the value, the better. . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.9 Federated Model Evaluation on Removed Training Samples due to Missing Values in IID
and NonIID Environments. For MSE scores, the lower the value, the better. . . . . . . . . . 42
4.1 Federated Learning Training Policies: Execution Flow . . . . . . . . . . . . . . . . . . . . . 44
4.2 Active vs Idle time for a heterogeneous computational environment with 5 fast (GPUs)
and 5 slow (CPUs) learners. Synchronous protocol was run with 4 epochs per learner and
Semi-Synchronous withλ =2 (including cold start federation round). . . . . . . . . . . . . 45
4.3 Community model computation with (left) and without (right) caching. . . . . . . . . . . . 52
4.4 CIFAR and ExtendedMNIST Sample Target Class and Data Size Distributions . . . . . . . . 55
4.5 UK Biobank Data Distributions. Top Row: Amount of data examples across learners in
terms of age buckets/ranges. Bottom Row: Local age distribution (histogram) of each
learner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Homogeneous Computational Environment on CIFAR-10. SemiSync with
Momentum (λ =2) has the fastest convergence. ('G'=GPU in data distribution insets) . . . 57
4.7 Heterogeneous Computational Environment on CIFAR-10. SemiSync with
Momentum (λ = 2) has the fastest convergence and lowest communication cost for a
given accuracy. ('G'=GPU,'C'=CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Heterogeneous Computational Environment on CIFAR-100. SemiSync with
Momentum (λ = 0.5) significantly outperforms all other policies in this challenging
domain. (’G’=GPU, ’C’=CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 Heterogeneous Computational Environments on ExtendedMNIST By Class.
SemiSync with Vanilla SGD (λ = 1) outperforms all other policies, with faster
convergence and less communication cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
x
4.10 Homogeneous Computation, Heterogeneous Data Environments on BrainAGE.
SemiSync with Vanilla SGD λ = 4 converges faster and with less communication
cost compared to its synchronous counterpart. SemiSync with λ = 3 is close to the
performance of the centralized model in Skewed & IID. . . . . . . . . . . . . . . . . . . . . 64
4.11 Metis Federated Learning Framework Architecture . . . . . . . . . . . . . . . . . . . . . . 68
4.12 A typical Federated Learning workflow. Red color represents operations executed by the
controller. Green color represents operations executed by the learners. . . . . . . . . . . . 71
4.13 Evaluation Task Dispatch Time over Increasing Number of Learners (10 to 100) and Model
Sizes (10
5
to10
7
parameters). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.14 Model Aggregation Time with or without Optimization over Increasing Number of Learners. 72
4.15 Execution flow diagrams for different federated sparsification methods. . . . . . . . . . . . 83
4.16 FashionMNIST Federated Data Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.17 CIFAR-10 Federated Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.18 CIFAR-100 Federated Data Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.19 FedSparify Tuning. Top row shows the convergence when exploring different sparsifica-
tion frequency values with FedSparsify-Global at 0.95 sparsity on FashionMNIST with 10
clients over Non-IID data distribution with respect to Federation Rounds (Figure 4.19a)
and Transmission Cost (Figure 4.19b); exponent value for these experiments is set to 3.
The middle row Figures 4.19c and 4.19d show the exponent hyperparameter exploration
in terms of Federation Rounds and Transmission Cost, respectively. An exponent of n=3
provides a good trade-off between sparsification and model performance. . . . . . . . . . . 87
4.20 Convergence of FedSparsify-Local with Majority-Voting (MV) as aggregation rule
and FedSparsify-Local with Weighted Average (FedAvg/Avg) as aggregation rule on
FashionMNIST with 10 clients over IID and Non-IID data distributions at 0.9 sparsity. . . . 88
4.21 Sparsity vs. Test Accuracy for 10 clients. FedSparsify outperforms pruning alternatives,
and is comparable to no-pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.22 Sparsity vs. Test Accuracy for 100 clients (0.1 participation rate). FedSparsify outperforms
pruning alternatives and is comparable or better to no-pruning (particularly in non-IID
domains). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.23 Federation Round vs. Accuracy for FashionMNIST (top row), CIFAR-10 (middle row) over
the course of 200 federation rounds and for CIFAR-100 (bottom row) over the course of
100 federation rounds. Across all environments, SNIP, GraSP, Random, FedSparsify-Local
and FedSparsify-Global convergence is shown at 0.9 model sparsity and PruneFL at 0.3. . . 93
xi
4.24 Transmission Cost vs. Accuracy for FashionMNIST (top row), CIFAR-10 (middle row) over
the course of 200 federation rounds, and for CIFAR-100 (bottom row) over the course of
100 federation rounds. Across all environments, SNIP, GraSP, Random, FedSparsify-Local,
and FedSparsify-Global convergence is shown at 0.9 model sparsity and PruneFL at 0.3. . . 94
4.25 Federated BrainAGE models parameters progression without (FedAvg) and with
(FedSparsify) sparsification for different sparsification degrees. . . . . . . . . . . . . . . . . 98
4.26 Federated BrainAGE models learning performance at different degrees of sparsification
across all four federated learning environments. Dashed line represents performance of
non-sparsified model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Federated System Architecture with Encryption . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 CIFAR10 Evaluation with and without FHE. Federation Rounds as x-axis. . . . . . . . . . . 107
5.3 CIFAR10 Evaluation with and without FHE. Wall-Clock time as x-axis. . . . . . . . . . . . 108
5.4 MetisFL Training Pipeline with Encryption. . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Federated Learning (SyncFedAvg) with and without CKKS homomorphic encryption on
the BrainAge 3D-CNN. The vertical marker represents the training time it takes for each
approach to complete 20 federation rounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Distribution of prediction error and gradient magnitudes from the trained models in a
centralized setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Privacy vulnerability increases with federation rounds. Vulnerability is measured as the
average accuracy of distinguishing train samples vs unseen samples across learners. The
model architecture used is 3D-CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.8 Vulnerability vs Performance trade-off when training learners with Differential Privacy
(Gaussian Noise), Non-unique gradients approach and a model trained to achieve 99%
sparsity in the final global model. Lower vulnerability and lower MAE are desired, i.e.,
points towards the bottom left are better. The model architecture used is 3D-CNN. . . . . . 120
5.9 A robustness analysis of a perfect learner under different levels of data corruption (label
flipping). A chance that a perfect learner can learn the correct association between a
specific class of objects (e.g., pictures of dogs) and a corresponding label (e.g., a description
“dog”), under different levels of data corruption c and with a different number of available
training examples per classn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.10 A comparison between a perfect and a realistic learner. The upper plot shows a theoretical
upper bound for a perfect learner to learn the correct association (a 10-classes case). The
lower plot shows experimental results on Fashion-MNIST. . . . . . . . . . . . . . . . . . . 123
xii
5.11 Execution pipeline of the performance weighting aggregation scheme. Initially, learners
train locally on their local dataset, and the controller receives the trained models and
sends them for evaluation to the evaluation service of every participating learner. The
controller aggregates all models based on their performance scores and computes the
new community model. With the computation of the new community model, the new
federation round begins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.12 Federated training policies convergence for different data corruption environments. The
stacked bar chart inset shows each environment’s data distribution; hatch bar represents
corruption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.13 Federated training policies performance over incremental degrees of corruption (increasing
number of corrupted learners). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.14 Learners performance score (contribution value) based on DVW-GMean over different
data corruption environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.1 SchizConnect Domain Model (selected predicates). . . . . . . . . . . . . . . . . . . . . . . . 146
6.2 SchizConnect Sample Schema Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3 SchizConnect Query Rewriting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.4 Apache Spark Mediator Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5 Execution time for 8 queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.6 A Harmonized Federated Learning Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.7 Federated Learning and Integration Architecture and Internal Components. . . . . . . . . 159
6.8 Global Schema and Schema Mapping Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.9 Comparing Simple Intra-Silo and Inter-Silo Imputation in the California Housing Dataset. 165
xiii
Abstract
Data relevant to machine learning problems are often distributed across multiple data silos that cannot
share their data due to regulatory, competitiveness, or privacy reasons. Federated Learning has emerged as
a standard computational paradigm for distributed training of machine learning and deep learning models
across silos. However, the participating silos may have heterogeneous system capabilities and data specifi-
cations. In this thesis, we address the challenges in federated learning arising from both computational and
semantic heterogeneity. We present federated training policies that accelerate the convergence of the fed-
erated model and lead to reduced communication, processing, and energy costs during model aggregation,
training, and inference. We show the efficacy of these policies across a wide range of challenging federated
environments with highly diverse data distributions in benchmark domains and in neuroimaging. We con-
clude by describing the federated data harmonization problem and presenting a comprehensive federated
learning and integration system architecture that addresses the critical challenges of secure and private
federated data harmonization, including schema mapping, data normalization, and data imputation.
xiv
Chapter1
LearningWithoutDataSharing
Data useful for a machine learning problem is often generated at multiple, distributed locations. In many
situations, these data cannot be exported from their original location due to regulatory, competitiveness,
or privacy reasons. A primary motivating example is health records, which are heavily regulated and
protected, restricting the ability to analyze large datasets. Industrial data (e.g., accident or safety data) is
also not shared for competitiveness reasons. Given recent high-profile data leak incidents, e.g., Facebook
in 2018 and Equifax in 2017, more strict data regulatory frameworks have been enacted in many countries,
such as the European Union’s General Data Protection Regulation (GDPR), China’s Personal Information
Protection Law (PIPL), and the California Consumer Privacy Act (CCPA). Figure 1.1 shows a sample of
major data privacy bills passed across the world along with those that have (green color) or have not (red
color) put in place legislation to secure the protection of data and privacy (green color); the legislation
data were gathered by the United Nations Conference on Trade and Development
∗
. In total, 137 out of
194 countries worldwide have enacted or drafted data privacy laws. These data regulatory frameworks
govern data accessing, processing, and analysis procedures and can provide protection for personal and
proprietary data from illegal access and malicious use. When we also consider the increasing amount of
data being generated across various disciplines, new tools are required to be introduced that will allow
performing meaningful large-scale data analysis while at the same time providing compliance with the
∗
https://unctad.org/page/data-protection-and-privacy-legislation-worldwide
1
introduced data privacy and security legislations. These situations bring data distribution, security, and
privacy to the forefront of the machine learning ecosystem and impose new challenges on how data should
be processed and analyzed while respecting privacy.
CCPA
GDPR
RFL
PIPL
APA
LGPD
PIPEDA
POPI
PDPL
APPI
PDP
DPDP
VCDPA
Figure 1.1: World map with countries that have (green) or have not (red) enacted or drafted data privacy
laws along with the names of some major data privacy laws.
One potential solution for secure and private data analysis is Federated Learning [120, 180, 285]. Fed-
erated Learning has emerged as a standard computational paradigm for distributed machine and deep
learning that allows geographically distributed data sources to efficiently train machine learning models
while their raw data remain at their original location. This contradicts the traditional machine learning
algorithms that require all training data to be aggregated at a centralized location. This data aggregation
step introduces a huge security and privacy threat and can violate the regulations enforced by the data reg-
ulatory frameworks. Alternatively, Federated Learning relaxes the need for raw data migration and instead
pushes the training of the machine learning model down to each source. During federated training, only
the locally trained model parameters are shared with a centralized entity that is responsible for aggregat-
ing these parameters to compute the federated model. Figure 1.2 shows how a representative centralized
and federated learning environment and how both environments differ in regard to data (centralized) and
model (federated) sharing.
2
Local Data
Local Data
Local Data
Centralized
Model
Silo A
Silo B
Silo C
(a) Centralized
Local Model
Local Model
Local Model
Federated
Model
Local Data
Local Data
Local Data
Silo A
Silo B
Silo C
(b) Federated
Figure 1.2: Representative Centralized and Federated Learning Environments.
In the following, we give some background on the origins of Deep Learning and its importance in per-
forming meaningful data analysis. We also discuss the relationship between Distributed Deep Learning and
Federated Learning and why Federated Learning has found wide applicability across multiple disciplines.
1.1 MeaningfulDataAnalysisusingDeepLearning
Deep Learning is a branch of Machine Learning, inspired by the information processing patterns found
in the human brain [9]. Deep Learning allows automated identification of features from raw input data.
A Deep Learning (neural network) model is a sequence of computational layers consisting of a group
of nodes, called neurons, which aims to learn a complex function over a large amount of data. Over
the past few decades (1990s-2020s), many neural network architectures have been proposed, including
but not limited to convolutional neural networks (CNN [142]), graph neural networks (GNN [223]), and
recurrent neural networks (RNN [215]). However, it was not until 2006 that interest in Deep Learning,
in general, was revived. A group of researchers brought together by the Canadian Institute for Advanced
Research (CIFAR) successfully trained deep neural networks [23, 99]. Subsequently, major advancements
in graphical processing units (GPUs) provided cheaper and faster computing power and allowed to train
3
deep learning models more efficiently. Researchers in 2012 were able to train a CNN model (ALexNet)
on a million images and achieve spectacular results (compared to the state-of-the-art at the time) in the
ImageNet competition [132].
In the last decade, we have observed the widespread adoption of Deep Learning models across multi-
ple disciplines. As structured and unstructured data became readily available at unprecedented volumes
in different fields, traditional Machine Learning approaches could not handle such massive information.
This spurred further research in the Deep Learning domain due to the ability of deep neural networks to
extract higher-level features from massive (structured or unstructured) data amounts. As a result, data
analysis using Deep Learning led to breakthroughs in image, video, speech, and audio data processing and
significant technological advances. Deep Learning models provided outstanding results in complex scien-
tific domains, such as drug discovery [116], genomics [68] and materials science [40], as well as in real
applications, such as autonomous driving, transportation prediction, speech recognition, and translation,
among many others [204].
As the number of training data increased so did the number of parameters required to learn a deep
learning model over the available data. When the memory requirements to train a deep learning model
are small, a single GPU or a set of GPUs hosted on the same server/machine is sufficient to carry out the
training (commonly known as centralized deep learning). On the contrary though, when models exceed
the available memory capacity, model training needs to scale out, and therefore distributed-based training
methods must be employed [52].
1.2 FromDistributedtoFederatedDeepLearning
Depending on the number and the size of training examples and the number of parameters required to
train a deep learning model, different training methods can be used based on the available computational
4
resources, such as training in a standalone mode (e.g., multi-core / multi-GPU single server) or in a dis-
tributed mode (e.g., multiple GPUs distributed across multiple servers). To enable this type of training,
three deep learning training variants have been proposed [160]: data parallelism, model parallelism, and
pipeline parallelism. Data and model parallelisms were introduced in the DistBelief framework [52], and
pipeline parallelism in the GPipe [107] library. All parallel training schemes can be executed through the
coordination of a Parameter Server [52, 147] or through collective communication schemes, e.g., using the
Message Passing Interface (MPI). A Parameter Server (PS) is an abstract concept that handles the com-
puted gradients across processing units that may live within a single server or distributed across multiple
servers.
DataParallelism. Data parallelism is realized by parallelizing data processing across multiple process-
ing units [52], with each unit maintaining a network replica (local model copy). Units train on their own
local data partition and periodically synchronize their model weights with other units (using either collec-
tive communication primitives or parameter servers [147]) that can be located within the same computing
node or across (distributed) nodes. The amount of data communication required by data parallelism is
determined by the model size and the number of processing units participating in the training cluster.
Overall, data parallelism requires simpler coordination schemes when compared to model parallelism and
offers more flexible cluster utilization.
ModelParallelism. Model parallelism is realized by partitioning the neural network, e.g., layer-based
partitioning, across processing units within the same computing node (i.e., server) or across multiple
nodes [52]. The processing units can be dependent, i.e., units that have data dependencies, or indepen-
dent [160]. Due to the complex partitioning of the neural network’s architecture across processing units,
model parallelism requires sophisticated information exchange and coordination. Model parallelism solu-
tions must handle efficiently concurrent updates and account for an effective overlap between computation
5
and communication among processing units (and computing nodes). Model parallelism is (usually) applied
when the features of training examples are distributed across different processing units and/or computing
nodes.
PipelineParallelism. Pipeline parallelism [107, 186] is almost identical to model parallelism but aims to
solve GPUs’ idling problem. Pipeline parallelism partitions the layers of a DNN model into multiple stages.
Each stage consists of a consecutive set of model layers and is assigned to a separate GPU, with each GPU
being responsible to execute the forward and backward pass for all layers within the delegated stage. When
a single batch is active in the system, then pipeline parallelism reduces to simple (naive) model parallelism
but when multiple batches are processed (e.g., by chunking model batches into micro-batches) pipeline
parallelism allows all GPUs to be active. Compared to data parallelism, pipeline parallelism often requires
far less communication, since nodes need to only share subsets of computed gradients, and compared to
model parallelism, pipelining is able to efficiently overlap computation and communication.
TransitioningtoFederatedLearning. So far, we have discussed different training approaches that can
be used in a Distributed Deep Learning (DDL) setting. However, a critical problem during DDL execution
is data sharing, since participating processing units/nodes/workers can access the entire training dataset
and no data ownership exists across nodes. This is in contrast to Federated Learning, where each partici-
pating node (client, learner) in the federation is inherently a data silo that only trains the model on its own
local private dataset without exchanging or sharing its data with any other participating node [180]. Con-
sequently, Federated Learning has a major advantage compared to DDL when it comes to data regulations
compliance.
Another difference between DDL and FL is the frequency of parameter updates. In DDL usually, gra-
dients are shared after a batch (a small number of examples) is processed by a node. However, in FL a node
6
may process many batches, often one or more epochs. This changes the properties of the optimization
process [120].
Nevertheless, Federated Learning bears similarities with existing DDL training techniques [160]. Simi-
lar to DDL settings, where the execution is coordinated by a parameter server, the execution in a federated
learning environment is orchestrated by a central server (federation controller). The federation controller
can be seen as an extension of the parameter server concept.
Depending on how the data are partitioned across the federation nodes, Federated Learning can be
classified into three categories: horizontal, vertical, and hybrid [22, 149, 285]. Horizontal refers to the case
where nodes own data from the same feature and label space. Vertical refers to the case where nodes own
data from the same id space but with different features and/or label space. Hybrid refers to the case where
nodes own data from different id and feature spaces. Federated training in the case of horizontal data
partitioning generally resembles data parallelism but with the exception that the federated training data
can be non-IID (non- Independent and Identically Distributed). Accordingly, vertical FL typically resembles
model parallelism, since different parts of the training data points are distributed across different federation
nodes (processing units in DDL). However, the hybrid FL relies on transfer learning [195], which does not
necessarily follow any of the aforementioned parallelism approaches.
1.3 TheCaseforFederatedLearning
To demonstrate the benefits of Federated Learning we compared the performance of standalone siloed and
federated models in a realistic biomedical domain. We analyzed the learning performance of individual
silos while training a CNN model exclusively on their local data versus when participating in a federation.
The evaluation is conducted on BrainAGE (Brain Age Gap Estimation) learning task, which assesses the
acceleration or deceleration of an individual’s brain aging from structural MRI scans.
7
We compare the performance of the two (siloed, federated) model types on two different data distribu-
tions: Uniform&IID andSkewed&Non-IID. In the case of Uniform & IID every silo has the same amount of
data and the local MRI scans have a full representation of age ranges (IID), whereas in the case of Skewed
& Non-IID different silos have different amounts of data and the age range of the local MRI scans repre-
sents a subset of the global age range (Non-IID); please see also Figures 2.7a, 2.7d for Uniform & IID and
Figures 2.7c, 2.7f for a full description of the Skewed & Non-IID data distributions.
0 50 100 150 200 250 300
Time (mins)
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
MAE
BrainAgeCNN Silo vs Federated Model
Test Convergence
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
Community
(a) Uniform & IID
0 50 100 150 200 250 300
Time (mins)
3.5
4.5
5.5
6.5
7.5
8.5
9.5
10.5
11.5
MAE
BrainAgeCNN Silo vs Federated Model
Test Convergence
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
Community
(b) Skewed & Non-IID
Figure 1.3: Performance evaluation of standalone silos against the federated (community) model on the
BrainAge task. Federated learning leverages all available data points and learns a more performant model.
We show the mean absolute error (MAE) in predicting the age of the brain from an MRI scan in Fig-
ure 1.3. In the benign Uniform & IID learning environment (cf. Figure 1.3a), the community (federated)
model obtained by the federation significantly outperforms the model achieved by any individual silo. A
similar outcome is also observed in the more challenging Skewed & Non-IID environment (cf. Figure 1.3b).
This latter environment represents a more realistic learning scenario that would be typical of research con-
sortia or federations of hospitals and clinics of different sizes and different disease prevalences. As it is
also shown, silos with smaller datasets or data distributions farther from the global distribution show very
poor performance. Interestingly, even silos with much larger datasets whose local data distribution is
8
much closer to the global distribution cannot match the performance of the community model. In sum-
mary, silos (learning sites) with a both small and large number of training samples have strong incentives
to join Federated Learning.
1.4 FederatedLearningApplications
Federated Learning over neural networks was first introduced in 2016 by McMahan et al. [180] for user data
in mobile phones to improve Google’s keyboard Gboard. Over the last few years, Federated Learning has
been adopted across many different disciplines with applications ranging from smart transportation [163]
and smart cities [298] to healthcare [211, 228] and pharmaceuticals [192]. Due to its inherent data security
and privacy benefits, Federated Learning generally applies to any domain that can benefit from cross-
regional collaborative machine learning without data sharing. Biomedical research domains such as AI in
neuroscience [220] and genetics [8] are some of the most prominent applications on that front. Other do-
mains that can employ the Federated Learning paradigm are edge computing [190, 274] (e.g., for anomaly
and intrusion detection), finance [164], NLP [157], material sciences [90], autonomous vehicles [221], en-
ergy demand [257], and industrial and manufacturing sectors [285]. Recently, the US and UK governments
expressed their interest in using privacy-enhancing technologies (PETs) to combat global societal chal-
lenges by initiating the US/UK PETs challenge
†
. The challenge aimed to promote privacy-preserving
federated learning solutions that can help with financial crime prevention and forecast an individual’s
risk of infection in the case of a disease outbreak. Our federated learning framework, MetisFL (see also
section 4.2) was accepted to participate as a blue team in the challenge.
†
https://www.whitehouse.gov/ostp/news-updates/2021/12/08/us-and-uk-to-partner-on-a-prize-chall
enges-to-advance-privacy-enhancing-technologies
9
1.5 ThesisStructure
Chapter 2 provides the necessary background to understand the federated models’ optimization prob-
lem and compares the performance of federated machine learning models to their centralized counter-
parts across multiple benchmark datasets and models. Chapter 3 discusses the various heterogeneities
observed in federated learning environments and demonstrates how they affect federated models’ conver-
gence. Chapter 4 presents various federated training protocols for accelerated model convergence, efficient
model training and inference, and scalable and efficient model aggregation. Chapter 5 addresses the pri-
vacy and security concerns that may threaten the convergence of federated models through homomorphic
encryption training schemes, enhanced privacy-preserving approaches based on gradient perturbation,
and resilient learning against corrupted sources through robust aggregation rules. Chapter 6 introduces
the federated data imputation and harmonization problems for the first time and demonstrates an architec-
tural system vision to address them through principal and practical data integration techniques. Chapter 7
concludes this thesis by discussing future research directions and open research problems across various
subdomains of the rapidly growing field of federated learning.
10
Chapter2
FederatedLearning
Federated Learning is a distributed machine learning paradigm that allows multiple learners (clients) to
jointly train a machine learning model without sharing data (i.e., moving the data out of their original
source to a central repository). In this chapter, we introduce the federated optimization problem and
demonstrate that federated learning models can perform comparably to their centralized counterparts.
We conclude the chapter by discussing the security and privacy concerns that may arise during federated
training and compromise the federated models’ convergence or lead to sensitive information leakage.
2.1 FederatedOptimization
Federated Model optimization is decoupled into two optimization tiers, Global Model and Local Model
Optimization [208, 265]. The server, or federation controller, is responsible to aggregate learners’ local
models to compute the global model, i.e., Global Model Optimization, while the learners optimize the
global model locally on their local private dataset, i.e., Local Model Optimization. Algorithm 1 shows how
the federated optimization problem can be solved at each tier respectively. The algorithm is parameterized
by two gradient-based optimizers: GlobalOpt and LocalOpt, with server (global) learning rate as η g
and
learner (local) learning rate asη l
. When compared to centralized optimization methods, federated learning
optimization methods need to take into account different key aspects, related to the federated setting
11
and topology, client availability, communication efficiency, statistical (data) heterogeneity, computational
constraints, privacy and security concerns, and system complexity [265]. Recently, various algorithms
have been proposed to address these aspects either individually or jointly.
There are two primary federated learning environments: cross-silo and cross-device [120]. A cross-silo
setting consists of tens or hundreds of stateful, reliable, and highly-available geographically distributed
clients, such as data centers and large organizations/institutions (e.g., medical or financial). In a cross-
device setting, the available clients are thousands or millions of stateless, unreliable, and often unavailable
devices, such as mobile phones and IoT devices.
In a cross-silo setting, all learners are considered during global model optimization whereas, in a cross-
device setting due to the extremely large number of clients and their unreliability (e.g., drop from training
due to network error, power, idleness), only a subset of clients is selected to participate at each training
round. This client subset selection is referred to aspartialparticipation orparticipationratio; in Algorithm 1
this is represented with the percentage rateR%. For cross-silo settings,R=100%, while for cross-device
R<<100%.
In the remainder of this section, we formulate the federated optimization problem and discuss algo-
rithmic advances for both global and local optimization tiers.
2.1.1 OptimizationFormulation
In Federated Learning the goal is to find the optimal set of parameters w
∗ that minimize the global objective
function:
w
∗ =argmin
w
f(w) where f(w)=
K
X
k=1
p
k
P
F
k
(w) (2.1)
where K denotes the total number of learners participating in the training of the global model, p
k
the
contribution value (weighting factor) of learnerk in the global model,P =
P
p
k
the normalization factor
(
P
N
k
p
k
P
=1), andF
k
(w) the local objective function of learnerk. We refer to the model computed using
12
Algorithm1Federated Optimization.
Server
fort=0,...,T − 1do
w
t
c
=w
c
▷ Initialize current round’s global model state.
w
t
K
,∆ t
K
={},{} ▷ Current round’s set of local models and pseudogradients.
K = SampleR% of learners
for each learnerk∈K inparalleldo
w
t
k
,∆ t
k
= LocalOpt(w
t
c
) ▷ Train global model locally at each learner.
w
t
K
=w
t
K
∪w
t
k
▷ Expand set with learner’s local model.
∆ t
K
=∆ t
K
∪∆ t
k
▷ Expand set with learner’s pseudogradients.
w
c
= GlobalOpt(w
t
c
,w
t
K
,∆ t
K
) ▷ Update global model state.
GlobalOpt(w
t
c
,w
t
K
,∆ t
K
):
w
c
=
1
P
P
k
p
k
w
t
k
, P =
P
k
p
k
(FedAvg) ▷ Weighted average of local models.
∆ µ =
1
P
P
k
p
k
∆ t
k
P =
P
k
p
k
▷ Weighted average of pseudogradients.
w
c
=w
t+1
c
, w
t+1
c
=w
t
c
+∆ µ (FedAvgw/PseudoGradients)
w
c
=w
t+1
c
, w
t+1
c
=w
t
c
− v, v =γ g
v− ∆ µ (FedAvgM)
m
t
=β 1
m
t− 1
+(1− β 1
)∆
µ v
t
=v
t− 1
+∆ 2
µ (FedAdagrad)
v
t
=v
t− 1
− (1− β 2
)∆
2
µ (v
t− 1
− ∆ 2
µ ) (FedYogi)
v
t
=β 2
v
t− 1
+(1− β 2
)∆
2
µ (FedAdam)
w
c
=w
t
c
+η g
mt
√
vt+τ Replyw
c
Learner
LocalOpt(w
t
c
):
B =epochs∗ D
T
k/β ▷ Split training set into batches.
w
t
i
=w
t
c
▷ Assign global model to the initial local model state.
forb∈B do
w
t
i+1
=
w
t
i
− η l
∇F
k
(w
t
i
;b) (VanillaSGD)
w
t
i
− η l
∇F
k
(w
t
i
;b)+λ ∥w
t
i
∥
2
2
(WeightDecay)
w
t
i
+u
i+1
, u
i+1
=γ l
u
t
− η l
∇F
k
(w
t
i
;b) (MomentumSGD)
w
t
i
− η l
∇F
k
(w
t
i
;b)− η l
µ (w
t
i
− w
t
c
) (FedProx)
w
t
k
=w
t
I
(LocalModel)
∆ t
k
=w
t
k
− w
t
c
(PseudoGradients)
Reply(w
t
k
,∆ t
k
)
13
Equation 2.1 as the global (community) modelw
c
. Every learner computes its local objective by minimizing
the empirical risk over its local training set D
T
k
as F
k
(w) = E
x
k
∼ D
T
k
[ℓ
k
(w;x
k
)], with ℓ
k
being the loss
function. For example, in the original FedAvg weighting scheme [180], the contribution value for any
learner k is equal to its local training set size, p
k
=
D
T
k
. Depending on the execution protocol or the
optimization function involved, the contribution valuep
k
can be defined statically, or dynamically at run
time. Following the original Federated Average (FedAvg) algorithm [180] every learner aims to minimize
its local objective functionF
k
(w) using Vanilla Stochastic Gradient Descent (SGD) as its local optimization
solver:
w
i+1
=w
i
− η ∇F
k
(w
i
)
(2.2)
Upon completion of their local training workload, the learners send back to the server their local model
weights or their pseudo-gradients [268], which are clients’ updates that are not gradients, or even unbiased
estimates of gradients. The pseudo-gradients are computed as the relative difference of the local model
weights from the global weights and they are an important concept that helps to provide a generalization
to the global federated optimization problem, as discussed later in section 2.1.3.
Algorithm 1 demonstrates how FedAvg can be decoupled into local and global optimization steps.
During local optimization, learners update their local models using VanillaSGD by iterating over their
local training dataset D
T
k
for a number of local update steps and using a local learning rate (step size)
η l
. During global optimization, the server aggregates the local models by taking their weighted average
based on the number of training examples its model was trained on, seeFedAvg. A similar approach is also
applied in the case of pseudo-gradients, where implicitly, the global learning rate (η g
) is set to 1, seeFedAvg
w/ Pseudogradients. In most of this work, if not stated otherwise, all the optimization algorithms and
techniques are performed on model weights.
14
2.1.2 LocalModelOptimization
Given the heterogeneous local data distribution of the participating learners, some research works have
studied how local function optimization solvers can help accelerate the convergence of the federated
model [161], limit client drifting [152], and reduce the associated federated training communication cost.
Formally, in the case of SGD with Momentum [161], the local solutionw
i+1
at iterationi is computed
as (u= momentum term,γ = momentum attenuation factor):
u
i+1
=γu
i
+∇F
k
(w
i
)
w
i+1
=w
i
− ηu
i+1
(2.3)
For FedProx [152], which is a variant of the local SGD solver that introduces a proximal term in the
update rule to regularize the local updates based on the divergence of the local solution from the global
solution (i.e., the community modelw
c
), the local solutionw
t+1
is computed as:
w
i+1
=w
i
− η ∇F
k
(w
i
)− ηµ (w
i
− w
c
)
(2.4)
In general, other gradient optimization solvers can also be applied during local model optimization,
such as SGD with Weight Decay, Adam (w/ Weight Decay [167]), as well as other optimization algorithms
that are more tailored to address federated optimization challenges, such as SCAFFOLD [124] that uses
clients covariates and FedDyn [4] that performs dynamic client regularization. More importantly, during
local model optimization, learners can also perform local training using Differentially Private SGD [2],
which can significantly enhance model privacy. Similarly, in [271] the authors propose a framework based
on differential privacy (DP) to prevent information leakage by adding artificial noises to clients’ parameters
right before sharing them with the server.
15
2.1.3 GlobalModelOptimization
In [268] the federated optimization problem is formalized as a global gradient update step at the server
end, using pseudo-gradients. The pseudo-gradient formalization helps reasoning with critical challenges
present in federated optimization settings and not always apparent in centralized settings, such as multiple
and disproportional client update steps, data heterogeneity, and communication complexity. However,
analysis through the pseudogradients construction is challenging due to the incurred high bias and high
variance. The expectation of pseudo-gradients is not the "true" gradient of the empirical loss function.
The multiple local updates performed by clients on data with highly heterogeneous data distributions
lead to compounding variance. As it is also noted by the authors [268], in centralized settings, adaptive
optimization methods have had notable success in addressing similar issues. To this end, the authors
propose adapting such methods into federated settings: FedAdagrad, FedYogi, FedAdam, and define
a generalized optimization framework called FedOPT. The Federated Average with Momentum variant,
called FedAvgM, which can also be constructed through this formalization was proposed in the earlier
work of [104].
Algorithm 1 demonstrates how the weighted average of clients’ pseudo-gradients is used to compute
the necessary constructions for each adaptive optimization method separately at the server. To control the
magnitude of the update, for each method, two momentum coefficients are used ( β 1
,β 2
), and the update
is applied to the community/global model weights using a global step (global learning rateη g
). Finally, as
is also shown in the original work [268], a critical component for any of the proposed adaptive methods
is the use of VanillaSGD during local training with a decaying learning rate value.
Algorithm 1 also shows that different merging strategies and scaling factors (contribution values) can
also be used to weight learners’ local models (or pseudo-gradients) during global aggregation. For instance,
in the case of FedAvg the contribution value assigned to the local model of each learner is based on the size
of its local training dataset. As we also discuss later in section 4.1 another factor that can affect global model
16
convergence is the communication protocol employed during federated training, such as synchronous vs.
asynchronous.
Apart from the adaptive federated optimization methods, other works have also shown both theoret-
ically and empirically that the global aggregation step can lead to enhanced security and privacy. For
instance, in the work of [271] the authors show that in order to satisfy a (ϵ,δ )-DP requirement during
model aggregation, additional noises to be added by the server during the merging step. As is also dis-
cussed later in chapter 5, the server can provide robust learning in the presence of corrupted data sources
or untrusted learners through validation and performance-based techniques [240] or through outlier de-
tection [252], as well as in the case of having an untrusted server, recent works have proposed secure
aggregation protocols [32] based on Homomorphic Encryption [244, 291] or Masking [21, 233].
2.2 ComparingCentralizedtoFederatedModelPerformance
Centralized Machine Learning in our work refers to the case where the local datasets of all participat-
ing learners in the federation are aggregated into a single repository. Here, our goal is to demonstrate
the effectiveness of Federated Learning to reach a comparable performance as its centralized counterpart
across a varying set of challenging federated learning environments in the Computer Vision (CV), Natural
Language Processing (NLP), and Neuroimaging domains.
2.2.1 ComputerVision
To demonstrate the viability of federated learning in the context of object detection (computer vision), in
this section, we compare the learning performance of FedAvg against a centralized system in the stan-
dard CIFAR-10 [131] benchmark dataset. We evaluate the performance of the two approaches across nine
challenging federated learning environments with heterogeneous data distributions.
17
To generate these heterogeneous federated learning environments, we consider three possible assign-
ments, with respect to the number of training samples owned by each learner: Uniform,Skewed,PowerLaw.
In Uniform, every learner has the same number of examples, in Skewed, each learner has a progressively
smaller set of examples and in Power Law (exponent=1.5) we follow a more extreme variation (exponen-
tially decreasing) in the amount of data.
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(3)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(3)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(3)
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
T est T op-1 Accuracy
Federation Round
Centralized
FedAvg
Figure 2.1: CIFAR10 Centralized vs FedAvg.
Additionally, to distinguish the degree of label skewness [120, 297] we further classify each learning
environment as IID (independent and identically distributed) or Non-IID. WithIID we refer to the learning
18
scenario where all learners hold training examples from all target classes, and with Non-IID(x) to the case
where learners hold training examples from only x classes. For instance, Non-IID(3) in the case of CIFAR-10
means that each learner only has training examples from 3 target classes (out of the 10 classes in CIFAR-
10); later in section 4.1 we follow a similar approach to compare different federated training policies across
multiple domains.
In Figure 2.1 we demonstrate the convergence of FedAvg with respect to federation rounds compared
to the best learning performance (∼ 0.84 test accuracy) achieved by its centralized counterpart. The x-axis
represents federation rounds. The performance of the centralized model is shown as a horizontal dashed
line. As expected, the federated learning performance is comparable to the centralized system when the
federated environment is IID. However, the performance is inferior to its centralized counterpart as we
get towards harder and harder learning environments, such as from Skewed & Non-IID(5) to Skewed &
Non-IID(3). The presented results also validate the results reported in the original federated learning work
of [180].
2.2.2 NamedEntityRecognition
Here, we study Federated Learning in the context of Natural Language Processing (NLP) with a focus on
the Named Entity Recognition (NER) task. NER is an important tool for textual analysis that has been
applied across various domains, including information extraction, question-answering, and semantic an-
notation [177]. However, some domains require the textual data to remain private; for example, intelligence
applications or analyses of email data across different organizations. These private-text domains motivate
our research into federated learning approaches for NLP problems, such as Named Entity Recognition.
For the NER task that we investigate in this work, one of the first successful uses of neural networks
was the Bi-LSTM [102] and CRF [136] models, with both models combined into a single state-of-the-art
deep learning model architecture [139] that exhibited superior performance. Other recent works have
19
proposed large transformer-based models such as BERT [55] and XLM [140]. However, for understanding
the impact of federated training in the learning performance of the NER task, in this work, we employ the
Bi-LSTM-CRF model, similar to [178].
We use a Bi-LSTM layer which is fed a concatenation of 300-dimension GloVe [199] word embedding
and a character embedding that is trained on the data as input. We use dropout during training, set at 0.5.
The output of the Bi-LSTM model is then fed into a CRF layer in order to capture neighboring tagging
decisions. The CRF layer produces scores for all possible sequences of tags over which we apply the
Softmax function to produce the output tag sequence. The total number of trainable parameters of our
model architecture is 322,690.
∗
.
FederatedTraining. Our federated environments follow a centralized learning topology [22, 120, 285]
where a single aggregator (server) is responsible to orchestrate the execution of the participating clients.
In our experiments, we consider full client participation at every round. We test the performance of the
federated model on federation environments consisting of 8, 16, and 32 clients. Each client trained on its
local dataset for 4 local epochs in-between federation rounds and each federated experiment was run for a
total number of 200 rounds. When merging the local models at the server, we used FedAvg as the merging
function. During training, clients shared the kernel and bias matrices of the LSTM and dense layers and the
transition matrix of the CRF layer. All federated environments were run using the Metis framework [246].
Federated Data Distributions. The CoNLL-2003 [218] is a language-independent newswire dataset
developed for the named entity recognition task. The dataset consists of 20,744 sentences (14041 training,
3250 validation, 3453 test) and contains entities referring to locations (LOC), organizations (ORG), people
(PER), and miscellaneous (MISC).
†
We measure that classification performance for all entities. The original
∗
https://github.com/guillaumegenthial/tf_ner
†
Total number of entities in the training and validation sets combined, LOC: 8977, ORG: 7662, PER: 8442, MISC: 4360. For the
test, LOC: 1668, ORG: 1661, PER: 1617, MISC: 702
20
tagging scheme is BIO (beginning-intermediate-other). When extracting the named entities two subtasks
need to be solved, finding the exact boundaries of an entity, and the entity type. The metrics used to
evaluate correct tag (entity) predictions are Precision, Recall, and F1. All models were evaluated on the
same original test dataset.
To generate the federated environments that we investigate in this work, we split the dataset into
equal (Uniform) and unequal (Skewed) sized partitions. For the Uniform environments, we combine the
training and validation datasets of the original dataset and split them into approximately equally sized
partitions for 8, 16, and 32 clients, such that each partition (client) has almost the same proportion of
different tag types. The proportion of tags in each split is approximate, as can also be seen in Figure 2.2,
since any sentence can have any number of different tags. For the 32 clients and Skewed environment,
we randomly partitioned the combined training and validation datasets into 32 splits over an increasing
amount of data points (200 to 872 sentences). This increased the data heterogeneity across all partitions
with certain partitions containing more unique entity mentions (compared to the Uniform environments);
see also Figure 2.3d.
Figure 2.2 presents the total number of location, organization, and person tags at each client within
each federation environment.
‡
For Uniform environments, a similar amount of tags has been assigned to
each client. For the Skewed environment, the distribution of tags follows a left-skewed assignment. A
similar distribution pattern can be observed for tags present in only one client, which we call unique tags.
Figure 2.3 shows the numbers of unique location, organization and person entities each client contains.
Finally, Figure 2.4 shows how many clients contain each entity, that is, how many tags appear simultane-
ously at K number of clients (e.g., K=5 gives how many tags appear in exactly 5 clients).
‡
The Miscellaneous (MISC) tag has a similar distribution.
21
1
2
3
4
5
6
7
8
Client ID
0
500
1000
1500
2000
2500
3000
3500
4000
Frequency
Tag
B-LOC
B-ORG
B-PER
(a) 8-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Client ID
0
250
500
750
1000
1250
1500
1750
2000
Frequency
Tag
B-LOC
B-ORG
B-PER
(b) 16-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Client ID
0
200
400
600
800
1000
Frequency
Tag
B-LOC
B-ORG
B-PER
(c) 32-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Client ID
0
200
400
600
800
1000
1200
1400
1600
Frequency
Tag
B-LOC
B-ORG
B-PER
(d) 32-Clients
Skewed
Figure 2.2: CoNLL-2003 dataset: B-LOC, B-ORG and B-PER entity (tag) distribution for each client within
each federation environment.
1
2
3
4
5
6
7
8
Client ID
0
50
100
150
200
250
300
350
400
Frequency
Tag
B-LOC
B-ORG
B-PER
(a) 8-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Client ID
0
25
50
75
100
125
150
175
200
Frequency
Tag
B-LOC
B-ORG
B-PER
(b) 16-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Client ID
0
20
40
60
80
100
Frequency
Tag
B-LOC
B-ORG
B-PER
(c) 32-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Client ID
0
20
40
60
80
100
120
140
160
Frequency
Tag
B-LOC
B-ORG
B-PER
(d) 32-Clients
Skewed
Figure 2.3: CoNLL-2003 dataset: Number of unique B-LOC, B-ORG and B-PER entity (tag) distribution
for each client within each federation environment.
22
1
2
3
4
5
6
7
8
Appearing in #Clients
100
1000
Count
Tag
B-LOC
B-ORG
B-PER
(a) 8-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Appearing in #Clients
100
1000
Count
Tag
B-LOC
B-ORG
B-PER
(b) 16-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Appearing in #Clients
10
100
1000
Count
Tag
B-LOC
B-ORG
B-PER
(c) 32-Clients
Uniform
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Appearing in #Clients
1
10
100
1000
Count
Tag
B-LOC
B-ORG
B-PER
(d) 32-Clients
Skewed
Figure 2.4: CoNLL-2003 dataset B-LOC, B-ORG, and B-PER entity (tag) common occurrences across clients
for each federation environment (log scale).
Environment Precision Recall F1-score
Centralized 90.91 90.12 90.51
8-clients-Uni 88.85 88.30 88.57
16-clients-Uni 88.99 87.69 88.34
32-clients-Uni 88.52 86.15 87.32
32-clients-Ske 88.43 86.37 87.39
Table 2.1: Final learning performance of centralized vs federated models. The Clients-Uni and Clients-
Ske environments refer to the federation environments with uniform and skewed assignment of training
samples across clients.
23
0 25 50 75 100 125 150 175 200
Federation Round
60
65
70
75
80
85
90
F1
NER Model Convergence
Centralized
8-Clients (Uniform)
16-Clients (Uniform)
32-Clients (Uniform)
32-Clients (Skewed)
Figure 2.5: NER model learning performance in centralized and federated settings.
Results. Table 2.1 shows the final learning performance of each model for each learning environment.
As expected, the centralized model outperforms the federated models. However, the performance degrada-
tion is small (∼ 2-3 percentage points). The degradation increases as the federation environments become
more challenging, that is, with the number of clients and the heterogeneity of the data. As the number of
clients increases, the amount of data available for local training decreases, which makes federated training
harder. Nonetheless, federated learning remains effective even with a larger number of clients and skewed
distributions.
Figure 2.5 shows the convergence rate of the federated models, as F1-score over federation rounds. For
the centralized model, we plot the best F1 score as a horizontal line. The harder the federation environment,
the more training (federation) rounds the federated model needs to reach an acceptable performance.
2.2.3 Neuroimaging
The learning task we present in this section is the Brain Age Gap Estimation (BrainAGE). Deep 3D convo-
lutional regression networks have been used for the BrainAGE task [43, 117]. These networks extend the
VGG and ResNet architectures to 3D images by replacing 2D convolution/maxpool operations with their
3D counterparts.
24
Neural Architecture. Figure 2.6 shows the convolutional encoding network we trained for the brain
age prediction task. The model architecture is similar to that in [198] with the main difference being the
replacement of the batch normalization (BatchNorm) layer with an instance normalization (InstanceNorm)
layer. Collectively, the network consists of seven blocks, with the first five composed of a 3x3x3 3D convo-
lutional layer (stride=1, padding=1), followed by an instance norm, a 2x2x2 max-pool (stride=2), with ReLU
activation functions. The number of filters in the first block is 32 (and doubles until 256) with both layers
4 and 5 having 256 filters. The sixth block contains a 1x1x1 3D convolutional layer (stride=1, filters=64),
followed by an instance norm and ReLU activation. The final, seventh, block contains an average pooling
layer, a dropout layer (set to p=0.5 during training), and a 1x1x1 3D convolutional layer (stride=1). To train
the model we used Mean Squared Error as the loss function and Vanilla SGD as the network’s optimizer.
Figure 2.6: BrainAgeCNN
FederatedModel. During federated training all learners train on their local data using the same neural
architecture and hyperparameters (e.g., learning rate, batch size). Once a learner finishes its local training,
it sends its local model parameters to the controller.
NeuroimagingData. For the BrainAGE task we use structural MRI scans to assess the acceleration or
deceleration of an individual’s brain aging. The difference between the true chronological age and the
25
predicted age of the brain is considered an important biomarker for early detection of age-associated neu-
rodegenerative and neuropsychiatric diseases [77, 273], such as cognitive impairements [42], schizophre-
nia [130], chronic pain [133]. Recently, deep learning methods based on RNN [137, 117] and CNN [43, 58,
91, 197] architectures have demonstrated accurate brain age predictions.
From the original UKBB dataset [183] of 16,356 individuals with neuroimaging, we selected 10,446
who had no indication of neurological pathology and no psychiatric diagnosis as defined by the ICD-10
criteria. The age range was 45-81 years (mean: 62.64; SD: 7.41; 47% women, 53% men). All image scans were
evaluated with a manual quality control procedure, where scans with severe artifacts were discarded. The
remaining scans were processed using a standard preprocessing pipeline with non-parametric intensity
normalization for bias field correction1 and brain extraction using FreeSurfer and linear registration to a
(2 mm)
3
UKBB minimum deformation template using FSL FLIRT. The final dimension of the registered
images was 91x109x91. The 10,446 records were split into 8356 for train and 2090 for test.
FederatedDataDistributions. We define several computationally and statistically heterogeneous fed-
erated learning environments by partitioning the centralized UKBB neuroimaging training dataset (8356
records) across a federation of 8 learners. Computationally, the first four learners were equipped with
NVIDIA GeForce GTX 1080 Ti GPUs, while the last four had (faster) Quadro RTX 8000 GPUs. Figure 2.7
shows the four environments with diverse data amounts and data distributions. For data amounts, we con-
sidered bothUniform, an equal number of training samples per learner, andSkewed, decreasing amount of
training samples for each learner. For data distributions, we considered both Independent and Identically
Distributed (IID), the local data distribution of each learner contains scans with the same distribution as
the global age distribution, and Non-IID, different age distributions. Every environment was evaluated on
the same test dataset (2090 records) and the learners used their allocated records for training.
26
(a) Uniform & IID
Age Buckets
(b) Uniform & Non-IID Age Buckets (c) Skewed & Non-IID Age Buckets
(d) Uniform & IID
Age Distribution
(e) Uniform & Non-IID Age
Distribution
(f) Skewed & Non-IID Age
Distribution
Figure 2.7: UKBB Federation Data Distributions
TrainingHyperparameters. All learners trained on the same CNN model (Fig. 2.6). For both central-
ized and federated models, we used Vanilla SGD with a learning rate of5x10
− 5
and a batch size (β k
) of 1.
For SyncFedAvg, each learner runs 4 local epochs in all distributions. For all experiments, the random seed
was 1990.
MSE RMSE MAE Corr
CentralizedModel 12.885± 0.021 3.589± 0.003 2.895± 0.006 0.881
FederatedModel
DataDistribution Policy
Uniform&IID SyncFedAvg 13.749± 0.138 3.707± 0.018 2.995± 0.018 0.875
Uniform&Non-IID SyncFedAvg 19.853± 1.347 4.453± 0.151 3.625± 0.135 0.861
Skewed&Non-IID SyncFedAvg 19.148± 0.086 4.376± 0.009 3.553± 0.003 0.851
Table 2.2: UKBB Evaluation. MSE: Mean Square Error. RMSE: Root MSE. MAE: Mean Absolute Error.
Corr: Correlation. Mean and std values for 3 runs.
27
0 5 10 15 20 25
Federation Rounds
3.0
3.5
4.0
4.5
5.0
5.5
6.0
MAE
BrainAgeCNN Federation Rounds Convergence
Centralized
Uniform & IID
Uniform & Non-IID
Skewed & Non-IID
Figure 2.8: Centralized vs Federated BrainAGE performance comparison.
Results. Figure 2.8 shows the convergence of the federated models in terms of federation rounds across
the three learning environments compared to their centralized counterpart. As it is also shown in Ta-
ble 2.2, for the Uniform and IID setting, the federation reaches a Mean Absolute Error (MAE) value that is
very close to the one achieved by the centralized model. However, as the learning setting becomes more
challenging (i.e., Non-IID) the model learning performance is reduced, reaching a final error of 0.5 years
over the centralized model.
Alzheimer’s Disease Detection. We also study the performance of deep learning for AD prediction
in centralized and federated environments. If federated learning achieves comparable performance to
centralized training, we posit that, ultimately, federated learning will allow the analysis of much greater
quantities of data, since it sidesteps many of the challenges of centralized data sharing.
We study 3 prominent AD studies: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [185], with
its three phases ADNI 1, ADNI 2 and ADNI 3; the Open Access Series of Imaging Studies (OASIS) [138];
and the Australian Imaging, Biomarkers & Lifestyle Flagship Study of Ageing (AIBL) [76]. These stud-
ies contain T1-weighted brain MRIs taken from patients with different degrees of dementia and healthy
subjects acting as controls. For our work, we only use images from control subjects and patients whose
28
Training Set Test Set
Cohort Alzheimer’s Controls Alzheimer’s Controls
ADNI1 1,313 1,832 324 458
ADNI2 384 457 95 115
ADNI3 51 232 13 59
OASIS 36 150 10 38
AIBL 112 641 29 161
Total 1,896 3,312 471 831
Table 2.3: AD: Train/test splits per cohort and target label.
cause of dementia is Alzheimer’s disease. These studies are longitudinal. To obtain unbiased performance
estimates, all the samples for a given subject appear either in the training or in the test set. Table 2.3 shows
the number of training and testing samples from each study and the target labels.
Images were preprocessed following the pipeline in [56]. First, images were reoriented using fslre-
orient2std (FSL v6.0.1), so to match the orientation of standard template images. Then, brain extraction
was performed: skull parts present in the image were removed using the HD-BET CPU implementation,
and grey- and white-matter masks were extracted using FSL-FAST (FSL v6.0.1 Automated Segmentation
Tool). An intensity normalization step (N4 bias field correction) using ANTs (v2.2) followed. Next, linear
registration to a UK Biobank minimum deformation template was obtained by using the FSL-FLIRT (FSL
v6.0.1 Linear Image Registration Tool) with 6 degrees of freedom. Finally, an isometric voxel resampling
to 2mm was applied using the ANTs ResampleImage tool (v2.2). The actual size of the images after the
preprocessing was volumes of 91x109x91 voxels.
We trained a 3D-CNN neural model over a federation of 3 (ADNI phases), 4 (ADNI phases + OASIS),
and 5 learners (ADNI phases + OASIS + AIBL). Table 2.4 shows the performance of the federated and the
centralized models. The federated models were trained using the synchronous protocol (i.e., SyncFedAvg)
for 40 federation rounds with each learner training locally for 4 epochs in-between rounds. The centralized
models were trained for 100 epochs with early stopping from epoch 50 to avoid the effects of overfitting.
All models were trained using Stochastic Gradient Descent with a learning rate of 2e-4, dropout layers
29
with a dropout rate of 0.2, and L2 weight regularization withλ = 0.1. All experiments were run 3 times
and the results show the average and standard deviation of the metrics.
Federated Learning has comparable performance with Centralized training, even though the 5 sites
are very heterogeneous in terms of data sizes and proportions of cases versus controls. In fact, federated
learning achieves slightly better performance for some metrics and exhibits lower variance in the results,
pointing to better generalization.
Model Accuracy Precision Recall F1 AUC PR AUC ROC
ADNI only 0.8570± 0.0090 0.7940± 0.0311 0.8270± 0.0288 0.8095± 0.0080 0.8639± 0.0052 0.8954± 0.0057
OASIS only 0.4428± 0.0194 0.3686± 0.0036 0.7447± 0.0518 0.4927± 0.0091 0.3396± 0.0020 0.4631± 0.0047
AIBL only 0.8050± 0.0044 0.7246± 0.0153 0.7577± 0.0172 0.7405± 0.0022 0.7990± 0.0005 0.8526± 0.0017
Centralized 5AOB 0.8612± 0.0106 0.7977± 0.0350 0.8287± 0.0271 0.8122± 0.0091 0.8683± 0.0130 0.8986± 0.0051
Federated 3A 0.8462± 0.0043 0.8148± 0.0189 0.8048± 0.0194 0.8095± 0.0038 0.8791± 0.0039 0.8967± 0.0012
Federated 4AO 0.8474± 0.0073 0.7955± 0.0126 0.8296± 0.0026 0.8121± 0.0077 0.8766± 0.0007 0.8920± 0.0014
Federated 5AOB 0.8633± 0.0013 0.8098± 0.0043 0.8132± 0.0097 0.8114± 0.0031 0.8682± 0.0009 0.8971± 0.0006
Table 2.4: Alzheimer’s Disease Prediction. Test results on a global stratified test dataset (5 sites), for each
dataset by itself; 3 sites, ADNI 1,2,3 (3A); 4 sites, ADNI 1,2,3 + OASIS (4AO), and 5 sites, ADNI 1,2,3 +
OASIS + AIBL. In federated environments, each dataset is at a different learner. Centralized environments
are trained over all the corresponding datasets.
2.3 Security&PrivacyConcerns
Recent works have shown that Federated Learning protocols may not always provide sufficient privacy and
robustness guarantees and are vulnerable to adversaries operating inside or outside the system [14, 171,
182, 264, 299]. These adversaries may impose different privacy risks and/or security threats by attempting
to extract sensitive information regarding federation participants or compromising the integrity of the
entire federated system. Building upon the previous work of [171], in Figure 2.9 we taxonomize all possible
threat models that may occur at different phases of the federated learning process.
2.3.1 ThreatModels
Based on the attacker locality (cf. Figure 2.9), an attack can be initiated by a federation insider (aggregator,
participants) or an outsider, such as an eavesdropper listening to the communication channel between
30
Threat
Attacker Locality Adversary Setting Attack Stage Attack Aim
Targeted Untargeted
Compromise global
model integrity
Learn to predict
adversary outcomes
o White-Box
o Black-Box
o Data Poisoning
o Model Poisoning
Modify exchanging
messages
Honest-but-Curious
o Eavesdropper
o FL Model Users
Semi-Honest
o Controller
o Learner
Outsider Insider Malicious Training Inference
Figure 2.9: Federated Learning Threats Taxonomy.
the participants and the aggregator or the final users of the federation model. Additionally, if the adver-
sary aims to learn participants’ private states but deviates from the federated learning protocol (e.g., by
modifying or removing messages) then we have a malicious setting, while if the adversary follows the pro-
tocol truthfully and only observes the exchanged information, then we have an honest-but-curious setting.
Moreover, depending on the model’s learning stage upon which an attack is launched, the attack can be
performed during model training and/or model inference. Finally, if the attacker aims to compromise the
model’s integrity arbitrarily then these attacks are classified as untargeted, while if the attacker aims to
manipulate the model toward learning specific adversary outcomes then these attacks are called targeted.
2.3.2 PrivacyRisks
During federated training learners exchange locally trained model parameters (gradients, weights) with
the federation controller. These parameters may contain sensitive information learned from features of
the participants’ training data, which if not protected, private data features may be leaked to an adver-
sary. An adversary may use these parameters directly through gradients or "pseudo-gradients" (difference
of two consecutive model weights) to infer: class representatives (GAN attacks [100]), whether specific
training samples were used during training (membership inference attack [92, 188]), certain properties
of other participants’ training data [182], or even the entire training data input features and labels (deep
leakage [299]). In general, this information leakage is commonly attributed to the fact that deep learning
31
models tend to memorize more training features than the ones required to learn the main task [92, 292].
To protect against such privacy attacks, existing works have proposed approaches based on homomorphic
encryption [84, 194, 244], secure multi-party computation [54, 287], and differential privacy [66, 67]. In
the latter case, three distinctive approaches can be identified depending on the origin of the added noise:
Centralized Differential Privacy (CDP [119, 181, 271]), Local Differential Privacy (LDP [64, 284]), and Dis-
tributed Differential Privacy (DDP [6, 207]). CDP refers to the case where a trusted aggregator adds noise
to the aggregated parameters, LDP is where each participant is responsible to add noise locally before
sharing its own parameters, and DDP where the required noise is sourced from multiple participants. In
sections 5.1 and 5.2 we discuss defensive mechanisms against privacy attacks, based on fully homomorphic
encryption, and against membership inference attacks based on and gradients noise.
2.3.3 SecurityThreats
Adversaries pose security threats when their objective is to compromise the robustness of the federated
learning system and put the entire learning process at risk [171]. By deliberately altering the local training
data of the participants (data poisoning attacks [264, 280]) or the trained parameters (model poisoning
attacks [20, 27, 80, 247]), adversaries can modify the behavior of the system, attack the convergence of the
global model [24], and/or implant backdoor triggers into the federated model [20]. For instance, through
Byzantine attacks [31, 288], an adversary aims to destroy the convergence and the performance of the
federated model, while through backdoor triggers [80], an adversary can trick the federated model into
always predicting an adversarial target task while maintaining good performance on the main task. To
provide Byzantine-resilient federated training recent works have proposed robust aggregation rules at the
aggregator, such as Krum/Multi-Krum [31], RFA [201], and Coordinate-wise Median [288], while other
works provide protection against backdoor attacks through outlier detection and erasing methods tech-
niques [155]; see also section 5.3.
32
Chapter3
HeterogeneityinFederatedLearning
A typical federated learning environment consists of a set of participating clients (learners) and a central-
ized aggregator (federation controller) that orchestrates the federated execution. This is also known as a
centralized [151, 211] or star-shaped [22] federated learning topology. As it is also shown in Figure 3.1,
such a centralized topology can be further classified into different learning tiers: Models Exchange Tier,
LearnersTier,DataTier. TheModelsExchangeTier represents the communication layer between the learn-
ers and the controller, the Learners Tier represents the model processing (training) layer and the data tier
represents the (pre-)processing layer of the private local dataset of every learner.
:
:
:
:
:
:
Trainer
Training Set
Federation Controller
=
1
/
%
&
W
1
P W
N …
W
2
Trainer
Training Set
Trainer
Training Set
…
Learners Tier
Data Tier
Model Exchange Tier
Figure 3.1: A Typical Federated Learning System Architecture.
However, depending on the heterogeneity of the system and data characteristics of the federated learn-
ing environment, different learning challenges may arise that may impact the convergence of the federated
33
model. To simplify the analysis of these challenges, in Figure 3.2 we provide a categorization of these het-
erogeneities into system (communication and processing), and statistical and semantic (data storage, data
schema, data value) heterogeneity.
Communication
Heterogeneity
Processing
Heterogeneity
Statistical
Heterogeneity
Data Storage
Heterogeneity
Data Schema
Heterogeneity
Data Value
Heterogeneity
Data is stored in
different storage
engines and formats.
Data may be
structured under
different schemata.
Data distribution
may not follow the
global distribution.
Clients may have
different processing
(e.g., CPU, GPU,
TPU) and memory
capabilities.
Clients may have
different
communication
capabilities
(e.g., bandwidth).
Data may have
missing values
and unnormalized
values referring to
different entities.
Figure 3.2: Federated Learning Heterogeneities.
3.1 SystemHeterogeneity
System heterogeneity refers to all the federated system-related characteristics that can affect the system
throughput and model learning performance. In a cross-silo setting, the primary bottleneck is processing
heterogeneity because clients are usually large institutions/organizations with varying processing power
but relatively fast network interconnectivity [120, 246]. However, in a cross-device setting, where feder-
ations are formed by a large pool of clients with limited computational capabilities, processing and com-
munication heterogeneities have a greater impact on the overall system performance [120].
3.1.1 ProcessingHeterogeneity
Irrespective of the investigated federated learning environment (cross-silo or cross-device), participating
clients may be equipped with a different hardware or processing unit (e.g., TPU, GPU, CPU) to perform
model training and evaluation. This heterogeneous hardware capability across clients can introduce exe-
cution latency in the overall federated training procedure and heavily underutilize the available federation
34
resources. In Figure 3.3, we show how such heterogeneity can lead to disproportional idle and active train-
ing times between fast (GPU) and slow (CPU) clients when training a 2-CNN network on the CIFAR-10
dataset using a synchronous protocol. As it is evident, the fast learners are severely underutilized, waiting
for the slow learners (stragglers) to complete their local training; CPUs’ batch processing is 10 times slower
than the GPUs.
0 200 400 600
Processing Time (secs)
GPU:5
GPU:4
GPU:3
GPU:2
GPU:1
CPU:5
CPU:4
CPU:3
CPU:2
CPU:1
Active
Idle
Figure 3.3: Active vs Idle time for a computationally heterogeneous federated learning environment con-
sisting of 5 fast (GPUs) and 5 slow (CPUs) learners training on the CIFAR-10 dataset using a 2-CNN model.
To further demonstrate the impact of processing heterogeneity on the model learning performance,
in Figure 3.4 we show how two federated environments of 10 clients with identical data distributions
but different computational capabilities require different amounts of processing/training time to reach
the same levels of accuracy in the CIFAR-10 dataset. Both federated environments were trained using a
synchronous communication protocol and FedAvg as the local models’ aggregation rule. In particular, the
computationally homogeneous federation environment is 16-times faster compared to its heterogeneous
counterpart (cf. Figure3.3), since it needs close to 500secs to learn a model with a test accuracy of around
80%, while the heterogeneous environment needs around 8000secs to learn a similar model.
35
0 2000 4000 6000 8000
Processing Time (secs)
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test Top-1 Accuracy
CIFAR-10
Homogeneous Computational Environment (10GPUs)
Heterogeneous Computational Environment (5CPUs, 5GPUs)
Figure 3.4: Effect of computational (processing) heterogeneity on federated model learning performance.
3.1.2 CommunicationHeterogeneity
Even though communication is not always the primary bottleneck for a cross-silo setting [120], various
communication latencies may exist due to the physical distance of organizations/data centers [176] par-
ticipating in the federation. However, in a cross-device setting where a large number of clients with com-
putationally limited capabilities (e.g., mobile phones, IoT) participate in the federation, communication
latencies may exist due to the varying network interconnectivity across clients, limited channel capac-
ity, and bandwidth. This variability creates challenging federated environments that consist of highly
unreliable and stateless clients that may dropout of the federation unexpectedly. These heterogeneous
communication resources can have a severe impact on the convergence speed and learning performance
of the federated model [120, 151, 285].
3.2 StatisticalHeterogeneity
In a "traditional" centralized machine learning setting all the required training data are aggregated at a
single location. This centralization enables the training of machine learning models on Independent and
Identically Distributed (IID) data and allows the application of different optimization techniques, leading
36
to improved learning performance. However, this IID assumption is infeasible in practice in federated
learning settings. Clients participating in a federation may own a disproportional number of local training
examples and their private local dataset may not always follow the same data distribution. This data
distribution or statistical heterogeneity is commonly referred to as non-IID [120, 153, 154, 180].
In a centralized setting, this data distribution (or statistical) heterogeneity is observed in the context
of dataset shift [206], which studies the differences in the joint distribution of inputs and outputs between
training and test data distribution. Following the taxonomy presented in the work of [120], the heterogene-
ity of the training sets can be characterized by different non-IID aspects, such as feature distribution skew,
label distribution skew, same label and different features, same features and different label and quantity
skew or unbalanceness.
In Figure 3.5 we present three representative federated environments of balanced and IID (Figure 3.5a)
and extreme 3.5b and less extreme 3.5d quantity skewed along with label distribution skewness for the
CIFAR-10 dataset.
GPU:1
GPU:2
GPU:3
GPU:4
GPU:5
GPU:6
GPU:7
GPU:8
GPU:9
GPU:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
(a) Uniform & IID
GPU:1
GPU:2
GPU:3
GPU:4
GPU:5
GPU:6
GPU:7
GPU:8
GPU:9
GPU:10
0
2000
4000
6000
8000
10000
12000
14000
16000
#Examples
Data Distribution
(b) PowerLaw & Non-IID(5)
GPU:1
GPU:2
GPU:3
GPU:4
GPU:5
GPU:6
GPU:7
GPU:8
GPU:9
GPU:10
0
1000
2000
3000
4000
5000
6000
7000
8000
#Examples
Data Distribution
(c) Skewed & Non-IID(5)
GPU:1
GPU:2
GPU:3
GPU:4
GPU:5
GPU:6
GPU:7
GPU:8
GPU:9
GPU:10
0
1000
2000
3000
4000
5000
6000
7000
8000
#Examples
Data Distribution
(d) Skewed & Non-IID(3)
Figure 3.5: Statistically heterogeneous federated learning environments in CIFAR-10.
37
To further emphasize the impact of this statistical heterogeneity in the convergence of the federated
model, in Figure 3.6, we show the convergence of the federated model trained on each individual federated
environment presented in Figure 3.5 in terms of test accuracy over the course of 200 federation rounds. As
is expected the performance of the federated model learned in the Uniform & IID data distribution outper-
forms all other environments. However, as we get to more extreme degrees of label distribution skewness
(cf. Skewed & Non-IID(3) vs. PowerLaw & Non-IID(5)) the impact on the final learning performance is
greater, with the Skewed & Non-IID(3) environment reaching a test accuracy of 71%, while PowerLaw &
Non-IID(5) reaches a test accuracy of 79.5% and Uniform & IID an accuracy of 83% that is very close to the
centralized environment.
0 25 50 75 100 125 150 175 200
Federation Round
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test Top-1 Accuracy
CIFAR-10
Uniform & IID
PowerLaw & Non-IID(5)
Skewed & Non-IID(5)
Skewed & Non-IID(3)
Figure 3.6: Effect of heterogeneous data distributions on federated learning performance.
3.3 SemanticHeterogeneity
In Federated Learning settings, the same privacy constraints that prevent data sharing across sources in-
herently create isolated data environments, i.e., data silos [109]. Most work in Federated Learning focuses
38
on solving challenges related to the distributed learning optimization problem [151, 265], but the core
challenge of data harmonization across silos is overlooked. Existing systems [149] assume that the local
data at the participating sources (input into the learning model) follow the same schema, format, seman-
tics, and storage and access capabilities. Such an assumption does not hold in realistic learning scenarios,
where geographically distributed data sources have their own unique data specifications (cf. Figure 3.7); a
challenge that is commonly observed in Federated Database Management Systems [98, 231], Data Integra-
tion [60], and Data Exchange [70]. In this section, we discuss the different levels of semantic heterogeneity
that may be found within each (data) silo in a federated learning environment and demonstrate how each
heterogeneity can affect the convergence of the federated model.
Model Aggregation
Global ML
Model
Federation
Controller
Local Training
Local ML
Model 1
Learner 1
Local Training
Local ML
Model N
Learner N
Local Data Storage
Local Data Schema
Local Data Values
Data Silo N Data Silo 1
Figure 3.7: Semantic Heterogeneities.
3.3.1 DataStorage&SchemaHeterogeneity
Every participating learner trains the global model on its private local dataset during federated training.
However, learners’ local datasets do not always meet the exact data specifications. A learner may require
different data preprocessing steps to prepare local data as input to the training model. For instance, the
required input data may be stored across multiple data storage engines under different schemata from
which the data must be extracted, transformed, and loaded (ETL). Harmonizing these data discrepancies
across all learners is critical to ensure every learner participates in the federation.
39
Suppose a learner does not reconcile its storage and schema heterogeneities to map its local dataset
to the common input schema required by the model. In that case, it may be unable to participate and
contribute to the federation. As a result, this may lead to reduced data availability and reduced model
performance for the entire federation. In Figure 3.8, we show how the learning performance of the fed-
erated model is affected when silos are removed from a federation if they do not meet the required data
specification in IID and Non-IID data environments. Specifically, from an original federation pool of 10
and 100 available learners, we train the federated model over 3, 5, 8, 9, and 10 learners (out of 10) and 30,
50, 80, 90, and 100 learners (out of 100). The investigating federated environments shown in Figure 3.8 are
trained using a simple MLP over the California Housing dataset, where the task is to predict California
districts’ median house value in terms of hundreds of thousands of dollars. The dataset was derived from
the 1990 U.S. census
∗
and consists of 20,640 data records, from which we use 16,512 samples for training
and 4,128 for testing. We randomly split the train records across 10 and 100 learners for the IID data envi-
ronments, respectively. For the Non-IID environments, we first sort the records in increasing order based
on the median house value (prediction variable) and then partition the dataset across 10 and 100 learners
in a round-robin fashion.
To perform the federated learning experiments shown in Figure 3.8, we use synchronous FedAvg with
no participation ratio, i.e., we consider all available learners at each federation round. As shown in the
figure, the more learners (more data) we add to the federation, the better the learning performance (test
error decreases) of the federated model. It is important to note though, that the model learning performance
in the case of 100 learners is worse than in the case of 10 learners because each learner in the former case
contains 1/100 of the original data, which makes the overall federated training harder.
∗
https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html
40
3 5 8 9 10
0.27
0.28
0.29
0.30
0.31
0.32
0.33
T est MSE
30 50 80 90 100
Federation Size
(# Remaining Learners)
IID Environments
(a) CaliforniaHousing (IID)
3 5 8 9 10
1.00
1.25
1.50
1.75
2.00
2.25
2.50
T est MSE
30 50 80 90 100
Federation Size
(# Remaining Learners)
Non-IID Environments
(b) CaliforniaHousing (Non-IID)
Figure 3.8: Federated Model Evaluation on Silo Removal due to Storage and Schema Heterogeneity. For
MSE scores, the lower the value, the better.
3.3.2 DataValueHeterogeneity
Even if the data schemata and storage heterogeneities are harmonized across learners, the harmonized
input model data may need to be rescaled or normalized to reflect the common feature space across all
learners. For example, in a radiation oncology domain, one source may code an anatomical structure with
a value “LTemp lobe,” while another uses the value “LTemporal.” To provide precise semantics for analysis,
we must map these two values to a normalized value such as ”Left Temporal Lobe”. Some of the harmonized
data across learners may also contain missing values. In such cases, learners must either drop the records
with missing values altogether or impute them (cf. Chapter 6).
However, suppose value normalization is impossible or practically infeasible (due to privacy or security
constraints, cf.Section 7). In that case, learners might be better to be discarded from the federation since
they might not be able to contribute to the training of the federated model. As also shown in Figure 3.8, the
greater the number of learners being removed, the greater the impact on the convergence of the federated
model. Accordingly, in the case of training data records/samples with missing values, if an imputation
method is not applied, then the overall data contribution of a learner to the federated model is reduced since
the learner will have to train the model on a smaller dataset. In Figure 3.9, we show how the performance of
the federated model is negatively affected when removing data samples with missing values from learners’
local training dataset through three different data missing mechanisms [261]: MCAR, MAR, MNAR, see
Chapter 6 for more details on the mechanisms.
41
0.0 0.1 0.2 0.3
0.30
0.35
0.40
0.45
0.50
0.55
0.60
T est MSE
MCAR - IID
0.0 0.1 0.2 0.3
MAR - IID
0.0 0.1 0.2 0.3
MNAR - IID
Missing Ratio
No MVs Deletion
(a) CaliforniaHousing (10 Learners - IID)
0.0 0.1 0.2 0.3
0.6
0.8
1.0
1.2
T est MSE
MCAR - IID
0.0 0.1 0.2 0.3
MAR - IID
0.0 0.1 0.2 0.3
MNAR - IID
Missing Ratio
No MVs Deletion
(b) CaliforniaHousing (100 Learners - IID)
0.0 0.1 0.2 0.3
1.20
1.25
1.30
1.35
1.40
T est MSE
MCAR - NonIID
0.0 0.1 0.2 0.3
MAR - NonIID
0.0 0.1 0.2 0.3
MNAR - NonIID
Missing Ratio
No MVs Deletion
(c) CaliforniaHousing (10 Learners - Non-IID)
0.0 0.1 0.2 0.3
1.30
1.32
1.34
1.36
1.38
1.40
1.42
1.44
T est MSE
MCAR - NonIID
0.0 0.1 0.2 0.3
MAR - NonIID
0.0 0.1 0.2 0.3
MNAR - NonIID
Missing Ratio
No MVs Deletion
(d) CaliforniaHousing (100 Learners - Non-IID)
Figure 3.9: Federated Model Evaluation on Removed Training Samples due to Missing Values in IID and
NonIID Environments. For MSE scores, the lower the value, the better.
As shown in Figure 3.9 when data missingness becomes too extreme (e.g., missing ratio = 0.3) then the
learned federated model with incomplete records is underperforming the federated model trained using
the complete dataset in all federated environments. Interestingly, in the more challenging federated envi-
ronments with 100 learners and Non-IID distributions, any missing record can hurt the federated model
learning performance. However, when the data is partitioned across relatively few learners (i.e., 10 learn-
ers) then the effect of missing data is smaller and in some cases can perform comparably or better when
training using the complete data records (e.g., 10 learners Non-IID MCAR and MNAR for missing ratios
0.1 and 0.2).
3.4 ThesisStatement
Improve model convergence, model learning performance, and resource utilization in semantically, statisti-
cally, and computationally heterogeneous Federated Learning environments, without data sharing.
42
Chapter4
AcceleratedLearninginHeterogeneousEnvironments
This chapter presents a set of different federated learning protocols to help address the system and sta-
tistical heterogeneities arising during federated training (cf. chapter 3). We analyze how these protocols
scale and accelerate federated models’ training and inference time and reduce processing and communica-
tion costs across a wide range of challenging heterogeneous federated environments in standard computer
vision and neuroimaging benchmark datasets.
4.1 TrainingProtocols
In this section, we review the main characteristics of synchronous and asynchronous federated learning
policies and introduce our novel semi-synchronous policy. Figure 4.1 sketches their training execution
flow. We compare these approaches under three evaluation criteria: convergence time, communication
cost, and energy cost. The rate of convergence is expressed in terms ofparallelprocessingtime, that is, the
time it takes the federation to compute a community model of a given accuracy with all the learners run-
ning in parallel. Communication cost is measured in terms of update requests, that is, the number of local
models sent from any learner to the controller during training. Each learner also receives a community
model after each request, so the total number of models exchanged is twice the update requests. Energy
cost is based on the cumulative processing time (total wall-clock time across learners required to compute
43
Protocol processing cost communication cost energy cost idle-free stale-free
synchronous high low high x ✓
asynchronous low high medium ✓ x
semi-synchronous low low low ✓ ✓
Table 4.1: Federated Learning Training Policies: Characteristics
the community model) and on the energy efficiency of each learner (e.g., GPU, CPU). Table 4.1 summarizes
our findings (cf. Section 4.1.6).
Figure 4.1: Federated Learning Training Policies: Execution Flow
4.1.1 SynchronousProtocols
Under a synchronous communication protocol, each learner performs a given number of local steps (usu-
ally expressed in terms of local epochs). After all learners have finished their local training, they share
their local models with the centralized server (federation controller) and receive a new community model.
This training procedure continues for a number of federation rounds (synchronization points). This is a
well-established training approach with strong theoretical guarantees and robust convergence for both IID
and Non-IID data [153, 180].
However, a limitation of synchronous policies is their slow convergence due to waiting for slow learn-
ers (stragglers [51]). For a federation of learners with heterogeneous computational capabilities, fast learn-
ers remain idle most of the time, since they need to wait for the slow learners to complete their local training
before a new community model can be computed (Figure 4.1). As we move towards larger networks, this
resource underutilization is exacerbated. Figure 4.2(a,c) shows idle times of a synchronous protocol when
44
training a 2-CNN (a) or a ResNet-50 (b) network in a federation with fast (GPU) and slow (CPU) learners.
Fast learners are severely underutilized.
0 200 400 600
Processing Time (secs)
GPU:5
GPU:4
GPU:3
GPU:2
GPU:1
CPU:5
CPU:4
CPU:3
CPU:2
CPU:1
Active
Idle
(a) 2-CNN CIFAR-10
(Sync)
0 200 400 600 800
Processing Time (secs)
GPU:5
GPU:4
GPU:3
GPU:2
GPU:1
CPU:5
CPU:4
CPU:3
CPU:2
CPU:1
Active
Idle
(b) 2-CNN CIFAR-10
(SemiSync)
0 1000 2000 3000 4000
Processing Time (secs)
GPU:5
GPU:4
GPU:3
GPU:2
GPU:1
CPU:5
CPU:4
CPU:3
CPU:2
CPU:1
Active
Idle
(c) ResNet-50 CIFAR-100
(Sync)
0 1000 2000
Processing Time (secs)
GPU:5
GPU:4
GPU:3
GPU:2
GPU:1
CPU:5
CPU:4
CPU:3
CPU:2
CPU:1
Active
Idle
(d) ResNet-50 CIFAR-100
(SemiSync)
Figure 4.2: Active vs Idle time for a heterogeneous computational environment with 5 fast (GPUs) and 5
slow (CPUs) learners. Synchronous protocol was run with 4 epochs per learner and Semi-Synchronous
withλ =2 (including cold start federation round).
4.1.2 AsynchronousProtocols
In asynchronous Federated Learning, no synchronization point exists and learners can request a commu-
nity update from the controller whenever they complete their locally assigned training. Asynchronous
protocols have faster convergence speed since no idle time occurs for any of the participating learners.
However, they incur higher communication costs and lower generalizability due to staleness [45, 48]. Fig-
ure 4.1(center) illustrates a typical asynchronous policy. The timestampst
i
represent the update requests
issued by the learners to the federation controller. No synchronization exists and every learner issues an
update request at its own learning pace. All learners run continuously, so there is no idle time.
Since in asynchronous protocols, no strict consistency model [141] exists, it is inevitable for learners
to train on stale models. Community model updates are not directly visible to all learners and different
staleness degrees may be observed [45, 101]. Recently, FedAsync [281] was proposed as an asynchronous
training policy for Federated Learning by weighting every learner in the federation based on functions of
model staleness. FedAsync defines staleness as the difference between the current global timestamp (vector
clock [179]) of the community model and the global timestamp associated with the committing model of
45
a requesting learner. Specifically, for learner k, its staleness value is equal toS
k
= (T − τ k
+ 1)
− 1/2
,
(i.e., FedAsync + Poly) with T being the current global clock value and τ k
the global clock value of the
committing model of the requesting learner.
Given that staleness can also be controlled by tracking the total number of iterations or the number of
steps (i.e., batches) applied on the community model, we propose a new asynchronous protocol, FedRec,
which weights models based on recency, extending the notion of effective staleness in [47, 48]. For each
model, we define the number of steps s that were performed in its computation. Assume a learnerk that
receives a community modelc
t
at timet, which was computed over a cumulative number of stepss
t
c
(the
sum of steps used by each of the local models involved in computing the community model). Learner k
then performss
k
local steps starting from this community model and requests a community update at time
t
′
. By that time the current community model may contains
t
′
c
(> s
t
c
) steps, since other learners may have
contributed steps betweent andt
′
. Therefore, the effective staleness weighting value of learner k in terms
of steps isS
k
=(s
t
′
c
− (s
t
c
+s
k
))
− 1/2
(following the FedAsync + Poly function). Whens
t
′
c
− (s
t
c
+s
k
)<0,
S
k
is set to 1. As shown in Section 4.1.6, the step-based recency/staleness function of FedRec outperforms
the time-based staleness function of FedAsync (cf. Figures 4.7, 4.8, 4.9).
4.1.3 Semi-SynchronousProtocol
We have developed a novelSemi-Synchronous training policy that seeks to balance resource utilization and
communication costs
∗
(cf. Figure 4.2(b,d)). In this policy, every learner continues training up to a specific
synchronization time point (cf. Figure 4.1(right)). The synchronization point is based on the maximum
time it takes for any learner to perform a single epoch (t
e
k
). Specifically:
t
e
k
=
|D
T
k
|
β k
t
β k
, β k
,t
β k
>0; t
max
(λ )=λ max
k∈N
{t
e
k
}, λ> 0; B
k
=
t
max
t
β k
, ∀k∈N
(4.1)
∗
In our cross-silo settings, we do not consider delays due to learners’ transmission speed. In our experiments in Section 4.1.6
model transmission time is less than 0.5% of the computation time.
46
where D
T
k
refers to the local training data size of learner k, β k
to the batch size of learner k, and t
β k
to the time it takes learner k to perform a single step (i.e., process a single batch). The hyperparameter
λ controls the number of local passes the slowest learner in the federation needs to perform before all
learners synchronize. For example, λ = 2 refers to the slowest learner completing two epochs. The
hyperparameterλ can be fractional, that is, the slowest learner may only process part of its training set in
a federation round. The termB
k
denotes the number of steps (batches) learnerk needs to perform before
issuing an update request, which depends on its computational speed. A theoretical analysis of the weight
divergence of our proposed Semi-Synchronous training scheme compared to the centralized model and a
more detailed formulation of the federated optimization problem can be found in Appendix A.1.
To compute the necessary statistics (i.e. time-per-batch per learner),SemiSync performs an initial cold
start federation round (see GPUs in Figure 4.2(b,d)) where every learner trains for a single epoch and the
controller collects the statistics to synchronize the new SemiSync federation round. Here, the hyperparam-
eterλ and the timings per batch are kept static throughout the federation training once defined, although
others schedules are possible. To obtain a good estimate of the processing time per batch, in the cold start
phase the system has every learner complete a full epoch. The system sets a maximum duration for cold
start to prevent a very slow learner from disrupting the federation.
In our SemiSync approach, the learners’ synchronization point does not depend on the number of
completed epochs, but on the synchronization period. Learners with different computational power and
amounts of data perform a different number of epochs, including fractional epochs. There is no idle time.
Since the basic unit of computation is the batch, this allows for a more fine-grained control of when a
learner contributes to the community model. This policy is particularly beneficial in heterogeneous com-
putational and data distribution environments.
47
4.1.4 TrainingPoliciesCostAnalysis
Parallel Processing Time. We are interested in the wall-clock time it takes the federation to reach a
community model of a given accuracy, with all learners running in parallel. For synchronous and semi-
synchronous protocols, this is simply the number of federation rounds (R) times the synchronization
period (t
max
(λ )) [170]. For asynchronous protocols, this is the time at which a learner submits the last
local model that makes the community model reach the desired accuracy (e.g., time t
6
in Figure 4.1). If
pt denotes the processing time difference between update requests and U the number of update requests,
then the total parallel processing time is computed as:
PT
Sync/SemiSync
=Rt
max
(λ ); PT
Async
=
U
X
i=1
pt
i
(4.2)
CommunicationCost. We measure communication cost by the total number of update requests issued
during training by the learners to the federation controller. Each update request accounts for two model
exchanges: the learner sends its local model to the controller and receives the community model. In a
federation ofN learners, for synchronous and semi-synchronous protocols withR synchronization points,
and for asynchronous withU update requests:
CC
Sync/SemiSync
=NR CC
async
=U (4.3)
EnergyCost. The energy cost is based on thecumulativeprocessingtime of all learners to complete their
local training [170] weighted by the energy cost (ϵ k
) of each learner’s processor (e.g., GPU or CPU). For
asynchronous protocols, letΛ k
denote the total number of local epochs performed by a learner k to reach
a particular timestamp, then the cumulative energy cost is:
48
EC
Sync
=Rλ N
X
k=1
ϵ k
t
e
k
EC
SemiSync
=Rt
max
(λ )
N
X
k=1
ϵ k
EC
Async
=
N
X
k=1
ϵ k
Λ k
t
e
k
(4.4)
4.1.5 FederatedTrainingEnvironment
Algorithm 2 describes the execution pipeline for synchronous, semi-synchronous, and asynchronous com-
munication protocols. In synchronous and semi-synchronous protocols, the controller waits for all the par-
ticipating learners to finish their local training task before it computes a community model, distributes it to
the learners, and proceeds to the next global iteration. In asynchronous protocols, the controller computes
a community model whenever a single learner finishes its local training, using the caching mechanism,
and sends the new model to the learner. In all cases, the controller assigns a contribution valuep
k
to the
local modelw
k
that a learnerk shares with the community. For synchronous and asynchronous FedAvg
(SyncFedAvg and AsyncFedAvg, respectively), this value is statically defined and based on the size of the
learner’s local training dataset,
D
T
k
. For other weighting schemes, such as FedRec, theStaleness proce-
dure computes it dynamically. TheLearnerOpt procedure implements the local training of each learner.
The local training task assignment information is passed to every learner through the metadata,meta, col-
lection. Finally, a learner trains on its local model using either Vanilla SGD, Momentum SGD, or FedProx
as its local SGD solver.
Asynchronous Community Model using Caching. As presented in equation 2.1, the new commu-
nity modelw
c
can be computed as the weighted average of the most recent model that each learner has
shared with the controller. To facilitate this computation, it is natural to store the most recently received
local model of every learner in-memory or on disk. Therefore, the memory and storage requirements
of a community model depend on the number of learners contributing to the community model. Simi-
larly to the computation of the community model in synchronous protocols, we extend this approach to
49
Algorithm 2 FL Training Protocols. Community model w
c
, comprising m matrices, is computed
fromN learners;γ = momentum attenuation factor;η = learning rate;β = batch size.
Initialization: w
c
,γ,η,β (Semi-)Synchronous
fort=0,...,T − 1do
for each learnerk∈N inparalleldo
w
k
= LearnerOpt(w
c
,meta)
p
k
=
D
T
k
(SyncFedAvg)
w
c
=
P
N
k=1
p
k
P
w
k
withP =
P
N
k
p
k
Replyw
c
to every learner
Asynchronous
P =0;∀k∈N,p
k
=0;∀i∈m,W
c,i
=0
∀k∈N LearnerOpt(w
c
,meta)
whiletruedo
if (learnerk requests update)then
p
′
k
=
(
D
T
k
(AsyncFedAvg)
Staleness(k) (FedRec)
P
′
=P +p
′
k
− p
k
fori∈mdo
W
′
c,i
=W
c,i
+p
′
k
w
′
k,i
− p
k
w
k,i
w
′
c
=
1
P
′
W
′
c
Replyw
′
c
to learnerk
LearnerOpt(w
t
,meta):
B =
(
meta[epochs]∗ D
T
k/β (Sync & Async)
meta[t
max
]/t
β k
(SemiSync)
B = Shuffle B training batches of sizeβ forb∈B do
if Vanilla SGDthen
w
t+1
=w
t
− η ∇F
k
(w
t
;b)
if Momentumthen
u
t+1
=γu
t
− η ∇F
k
(w
t
;b)
w
t+1
=w
t
+u
t+1
if FedProxthen
w
t+1
=w
t
− η ∇F
k
(w
t
;b)− ηµ (w
t
− w
c
)
Replyw
t+1
to controller
Staleness(k):
s
t
′
c
= current time community model steps
s
t
c
= previous time community model steps
s
k
= learnerk steps betweent andt
′
δ = s
t
′
c
− (s
t
c
+s
k
)
S
k
=
(
δ − 1/2
,δ > 0
1,δ ≤ 0
ReplyS
k
50
asynchronous protocols by using our proposed caching scheme at every update request. For synchronous
protocols, we always need to perform a pass over the entire collection of stored local models, with a com-
putational costO(MN), whereM is the size of the model andN is the number of participating learners.
For asynchronous protocols where learners generate update requests at different paces and the total num-
ber of update requests is far greater than synchronous, such a repetitive complete pass is expensive and
we can leverage the existing cached/stored local models to compute a new community model in O(M)
time, independent of the number of learners.
Consider an unnormalized community model consisting ofm matrices,w
c
=⟨w
c
1
,w
c
2
,...,w
cm
⟩, and
a community normalization weighting factorP =
P
N
k=1
p
k
. Given a new request from learner k, with
new community contribution value p
′
k
, the new normalization valueP
′
isP
′
= P +p
′
k
− p
k
, where p
k
is the learner’s previous contribution value. For every component matrix w
c
i
of the community model,
the updated matrix w
′
c
i
is w
′
c
i
= w
c
i
+p
′
k
w
′
k,i
− p
k
w
k,i
, where w
′
k,i
,w
k,i
are the new and existing com-
ponent matrices of learnerk. The new community model isw
c
=
1
P
′
w
′
c
. Using this caching approach, in
asynchronous execution environments, the most recently contributed local model of every learner in the
federation is always considered in the community model.
Some existing asynchronous community mixing approaches [236, 281] compute a weighted average
using a mixing hyperparameter between the current community and the committing local model of a
requesting learner. In contrast, our caching approach eliminates this mixing hyperparameter dependence
and performs a weighted aggregation using all recently contributed local models. Figure 4.3 shows the
computation cost for different sizes of a ResNet community model (from Resnet-20 to Resnet-200), in the
CIFAR-100 domain, as the federation increases to 1000 learners. With our caching mechanism, the time to
compute a community model remains constant irrespective of the number of participating learners, while
it significantly increases without it.
51
0 250 500 750 1000
0
2000
4000
6000
8000
10000
12000
With Cache
ResNet-20
ResNet-50
ResNet-104
ResNet-152
ResNet-200
0 250 500 750 1000
0
2000
4000
6000
8000
10000
12000
Without Cache
Processing Time (ms)
Number of Learners
Figure 4.3: Community model computation with (left) and without (right) caching.
4.1.6 ProtocolsEvaluation
We conduct an extensive experimental evaluation of different training policies on a diverse set of feder-
ated learning environments with heterogeneous amounts of data per learner, local data distributions, and
computational resources. We evaluate the protocols on the CIFAR-10, CIFAR-100, and ExtendedMNIST By
Class [41, 34] benchmark datasets with a federation consisting of 10 learners as well as on the BrainAGE
prediction task [43, 91, 117, 198, 243] with a federation of 8 learners. The asynchronous protocols (i.e., Fe-
dRec and AsyncFedAvg) were run using the caching mechanism except for FedAsync. FedAsync was run
using the polynomial staleness function, i.e., FedAsync+Poly, with mixing hyperparametera = 0.5 and
model divergence regularization factorρ =0.005, which is reported to have the best performance [281].
Models Architecture. The architecture of the deep learning networks for CIFAR-10 and CIFAR-100
come from the Tensorflow tutorials: for CIFAR-10 we train a 2-CNN
†
and for CIFAR-100 a ResNet-50
‡
.
The 2-CNN model
§
architecture for ExtendedMNIST comes from the LEAF benchmark [34]. The 5-CNN
for BrainAGE is from [243]. For all models, during training, we share all trainable weights (i.e., kernels and
biases). For ResNet we also share the batch normalization, gamma and beta matrices. The random seed
for all our experiments is set to 1990.
†
CIFAR-10: https://github.com/tensorflow/models/tree/r1.13.0/tutorials/image/cifar10
‡
CIFAR-100: https://github.com/tensorflow/models/tree/r1.13.0/official/resnet
§
ExtendedMNIST:https://github.com/TalwalkarLab/leaf/blob/master/models/femnist/cnn.py
52
ModelsHyperparameters. For CIFAR-10 in homogeneous and heterogeneous environments the syn-
chronous and semi-synchronous protocols were run with Vanilla SGD, Momentum SGD, and FedProx;
asynchronous protocols (FedRec, AsyncFedAvg) were run with Momentum SGD. For CIFAR-100 all the
methods were run with Momentum (following the tutorial recommendation). For ExtendedMNIST By
Class (following the benchmark recommendation) and BrainAGE, we used Vanilla SGD. We originally per-
formed a grid search, on the centralized model, over different combinations of learning rate η , momentum
factorγ , and mini-batch sizeβ . For the proximal termµ in FedProx, we used the values from the original
work [152]. After identifying the optimal combination, we kept the hyperparameter values fixed through-
out the federation training. In particular, for CIFAR-10 we usedη =0.05,γ =0.75,µ =0.001 andβ =100, for
CIFAR-100,η =0.1,γ =0.9,β =100, for ExtendedMNIST,η =0.01 andβ =100, and for BrainAGEη =5x10
− 5
and
β =1. For both synchronous and asynchronous policies, we originally evaluated the convergence rate of
the federation under different numbers of local epochs {1,2,4,8,16,32} and we observed the best per-
formance when assigning4 local epochs per learner. For the semi-synchronous case, we investigated the
convergence of hyperparameterλ within the set{0.5,1,2,3,4}.
ComputationalEnvironment. Our homogeneous federation environment for CIFAR and ExtendedM-
NIST consists of 10 fast learners (GPUs) and our heterogeneous of 5 fast (GPU) and 5 slow (CPU) learners.
For the BrainAGE task, our homogeneous environment consists of 8 fast learners. The fast learners were
run on a dedicated GPU server equipped with 8 GeForce GTX 1080 Ti graphics cards of 10 GB RAM each,
40 Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, and 128GB DDR4 RAM. The slow learners were run on a
separate server equipped with 48 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz and 128GB DDR4 RAM. For
the 2-CNN used in CIFAR-10 the processing time per batch for fast learners is: t
β k
≈ 30ms, and for slow:
t
β k
≈ 300ms, for the ResNet-50 used in CIFAR-100, for fast is: t
β k
≈ 60ms and for slow: t
β k
≈ 2000ms,
for the 2-CNN used in ExtendedMNIST for fast is: t
β k
≈ 50ms and for slow: t
β k
≈ 800ms and for the
5-CNN used in BrainAGE for fast is: t
β k
≈ 120ms.
53
DataDistributions. We evaluate the training policies over multiple environments with heterogeneous
data sizes, data distributions, and learning problems (classification for CIFAR & ExtendedMNIST, regres-
sion for BrainAGE).
¶
We consider three types of data size distributions: Uniform, where every learner has
the same number of examples; Skewed, where each learner has a progressively smaller set of examples;
and Power Law (exponent=1.5) to model extreme variations in the amount of data (intended to model the
long tail of science). Learners train on all available local examples.
To model statistical heterogeneity in classification tasks, we assign a different number of examples
per class per learner. Specifically, with IID we denote the case where all learners hold training examples
from all the target classes, and with Non-IID(x) we denote the case where every learner holds training
examples from only x classes. For example, Non-IID(3) in CIFAR-10 means that each learner only has
training examples from 3 target classes (out of the 10 classes in CIFAR-10). For Power Law data sizes and
Non-IID configurations, in order to preserve scale invariance, we needed to assign data from more classes
to the learners at the head of the distribution. For example, for CIFAR-10 with Power Law and a goal
of 5 classes per learner, the actual distribution is Non-IID(8x1,7x1,6x1,5x7), meaning that the first learner
holds data from 8 classes, the second from 7 classes, the third from 6 classes, and all 7 subsequent learners
hold data from 5 classes. For brevity, we refer to this distribution as Non-IID(5). Similarly for CIFAR-10
Power Law and Non-IID(3), the actual distribution is Non-IID(8x1,4x1,3x8). For CIFAR-100, Power Law,
and Non-IID(50), the actual distribution is Non-IID(84x1,76x1,68x1,64x1,55x1,50x5).
To modelcomputationalheterogeneity, we use fast (GPU) and slow (CPU) learners. In order to simulate
realistic learning environments, we sort each configuration in descending data size order and assign the
data to each learner in an alternating fashion (i.e., fast learner, slow learner, fast learner, etc.), except for the
uniform distributions where the data size is identical for all learners. Due to space limitations, for every
experiment, we include the respective data distribution configuration as an inset in the convergence rate
¶
CIFAR & ExtendedMNIST:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FD
VN%2FPQ34F8 BrainAGE:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2RKAQP
54
CPU:1
GPU:1
CPU:2
GPU:2
CPU:3
GPU:3
CPU:4
GPU:4
CPU:5
GPU:5
0
1000
2000
3000
4000
5000
# Examples
Classes
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
(a) CIFAR-10
Uniform & Non-IID(5)
GPU:1
CPU:1
GPU:2
CPU:2
GPU:3
CPU:3
GPU:4
CPU:4
GPU:5
CPU:5
0
1000
2000
3000
4000
5000
6000
7000
8000
# Examples
20
40
60
80
100
Classes
(b) CIFAR-100
Skewed & Non-IID(50)
GPU:1
CPU:1
GPU:2
CPU:2
GPU:3
CPU:3
GPU:4
CPU:4
GPU:5
CPU:5
0
25000
50000
75000
100000
125000
150000
175000
# Examples
20
40
60
Classes
(c) ExtendedMNIST By Class
Power Law & IID
Figure 4.4: CIFAR and ExtendedMNIST Sample Target Class and Data Size Distributions
(a) Uniform & IID
0.000
0.025
0.050 G : 1
= 62.63
= 7.57
Learners Training Dataset Distribution
0.000
0.025
0.050
G : 2
= 62.64
= 7.42
0.000
0.025
0.050 G : 3
= 62.62
= 7.39
0.000
0.025
0.050 G : 4
= 62.63
= 7.39
0.00
0.02
0.04
G : 5
= 62.65
= 7.40
0.000
0.025
0.050
G : 6
= 62.65
= 7.39
0.000
0.025
0.050
G : 7
= 62.65
= 7.39
40 45 50 55 60 65 70 75 80 85
Age
0.000
0.025
0.050 G : 8
= 62.63
= 7.39
Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density
(b) Uniform & IID
(c) Skewed & IID
0.00
0.05
G : 1
= 62.91
= 6.59
Learners Training Dataset Distribution
0.00
0.02
0.04
G : 2
= 62.24
= 7.20
0.00
0.02
0.04
G : 3
= 62.34
= 7.52
0.000
0.025
0.050
G : 4
= 62.65
= 7.90
0.000
0.025
0.050 G : 5
= 62.76
= 8.02
0.000
0.025
0.050 G : 6
= 62.46
= 8.15
0.00
0.02
0.04
G : 7
= 63.16
= 8.36
40 45 50 55 60 65 70 75 80 85
Age
0.00
0.05
G : 8
= 63.42
= 8.60
Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density
(d) Skewed & IID
(e) Skewed & Non-IID
0.00
0.05
G : 1
= 58.40
= 6.06
Learners Training Dataset Distribution
0.00
0.05
G : 2
= 61.15
= 5.40
0.00
0.05
G : 3
= 66.38
= 4.88
0.00
0.05
0.10
G : 4
= 70.09
= 4.04
0.00
0.05
0.10 G : 5
= 54.18
= 4.69
0.000
0.025
0.050
G : 6
= 60.76
= 6.23
0.00
0.05
G : 7
= 67.93
= 5.39
40 45 50 55 60 65 70 75 80 85
Age
0.0
0.1
G : 8
= 71.97
= 3.74
Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density Probability Density
(f) Skewed & Non-IID
Figure 4.5: UK Biobank Data Distributions. TopRow: Amount of data examples across learners in terms
of age buckets/ranges. BottomRow: Local age distribution (histogram) of each learner.
plots (Figures 4.6, 4.7, 4.8, 4.9, 4.10). Figure 4.4 shows three representative data distributions for the three
classification domains.
The data distributions of the classification tasks that we investigate are based on the work of [297],
where the data are evenly (i.e, Uniform in our case) partitioned across 10 learners, and with different class
distribution per learner (i.e., Non-IID(2) refers to examples from 2 classes per learner). We extend their
work by also investigating the cases where the size of the partitions is not uniform but follows a skewed
or a power law distribution (called quantity skew/unbalanceness in [120]).
55
For the statistical heterogeneity of the regression task in BrainAGE, we partition (following [243]) the
UKBB neuroimaging dataset [183] into training (8356 records) and test (2090 records) across a federation of
8 learners with skewed data amounts, and IID and Non-IID age distributions. In the IID case, every learner
holds data examples from all possible age ranges, while in the Non-IID case learners hold examples from
a subset of age ranges. Figure 4.5 shows the distribution of the training examples across the 8 learners
over three different learning environments in BrainAGE. Every environment is evaluated on the same test
dataset representative of the global age distribution.
Within each learning domain, all training policies were run for the same amount of wall clock time.
In homogeneous computational environments (cf. Figure 4.6b), the total number of update requests for
synchronous and asynchronous policies is similar, since any latency occurs only due to differences in the
amount of training data. However, in heterogeneous computational environments, during asynchronous
training, computationally fast learners issue many more update requests than slow learners. In syn-
chronous and semi-synchronous policies, the update requests are substantially smaller and driven by the
slowest learner in the federation. To highlight the differences in communication costs, Figures 4.7b, 4.8b
and 4.9b show the number of update requests issued by all learnersforthesamewall-clocktimeperiod (set
in Figures 4.7a, 4.8a and 4.9a, respectively). Point markers in the asynchronous policies are just a visual
aid to distinguish them from synchronous and semi-synchronous policies.
CIFAR-10 Results. Figure 4.6 shows the performance of synchronous, semi-synchronous, and asyn-
chronous policies in ahomogeneous computational environment, where all learners have the same compu-
tational capabilities (10 GPUs). We compare the different policies on heterogeneous amounts of data per
learner (Skewed, Power Law) since for a homogeneous computational environment with an equal num-
ber of data points (Uniform) across learners, all policies are equivalent. For the synchronous policies, idle
time still occurs due to the different data amounts in the Skewed and Power Law data distributions. For
SemiSync and asynchronous policies, there is no idle time. We evaluate SyncFedAvg and SemiSyncFedAvg
56
(a) Parallel Processing Time
(b) Communication Cost (Cumulative Update Requests)
Figure 4.6: Homogeneous Computational Environment on CIFAR-10. SemiSync with Momentum
(λ =2) has the fastest convergence. ('G'=GPU in data distribution insets)
57
with Vanilla SGD, Momemtum SGD, and FedProx, and AsyncFedAvg, FedRec, and FedAsync with Momen-
tumSGD. SyncFedAvg with Momentum converges very slowly in Power Law data distributions, but using
FedProx as a local optimizer rescues it. SemiSync with Momentum andλ =2 results in the fastest conver-
gence across the experiments in Figure 4.6(a) (see also Table 4.2). Similar results hold for communication
cost in terms of update requests as shown in Figure 4.6(b). When the amount of data across learners is
not too great (Skewed data distributions) AsyncFedAvg and FedRec have comparable performance, but as
the difference in data increases (PowerLaw data distributions), AsyncFedAvg is more efficient compared
to the staleness-aware weighting scheme of both FedRec and FedAsync.
Figure 4.7 shows the performance on the CIFAR-10 domain of synchronous, asynchronous, and semi-
synchronous policies in a heterogeneous computational environment (with 10 learners: 5 fast GPUs, and
5 slow CPUs; the CPUs’ batch processing is 10 times slower than the GPUs). Again, our SemiSync with
Momentum (λ =2) has the best performance with faster convergence (some other policies reach compa-
rable accuracy levels eventually). As we move towards more challenging Non-IID learning environments,
all methods show a reduction in the final performance (lower accuracy). SyncFedAvg with Momentum
performs reasonably well with moderate levels of data heterogeneity (Skewed data amounts, with either
IID or Non-IID distributions). However, in more extreme data distributions (Power Law), it learns very
slowly, since the local models are more disparate and the momentum factor limits changes to the commu-
nity model. In SemiSync the fast learners perform more local iterations and mix more frequently, which
facilitates the convergence of the federation even for the same momentum factor that slowed learning for
SyncFedAvg. This is exacerbated in the heterogeneous case, compared to the homogeneous case, since the
training speed differences are much greater. Interestingly, SyncFedAvg with FedProx performs much bet-
ter in these cases, although still worse than the SemiSync policies. FedRec dominates other asynchronous
policies on convergence rate, especially in IID environments and in Power Law distributions. In Uni-
form and Skewed Non-IID environments, its performance is comparable to AsyncFedAvg. Comparing the
58
experiments based on data amounts, we can see that in some learning environments, such as Skewed &
Non-IID(5) and Power Law & Non-IID(5) in Figures 4.6 and 4.7, the optimal accuracy is higher in the Power
Law case compared to its Skewed counterpart. This is due to the fact that the head (G:1) of the Power Law
distribution covers most of the domain data (8 classes) and therefore its local model is more valuable and
has a greater contribution value in the community model.
The communication cost of SemiSync policies is comparable to synchronous policies reaching a high
accuracy very quickly, while asynchronous policies require many more update requests to achieve the
same (or worse) level of accuracy (Figure 4.7(b)). For Power Law data distributions SyncFedAvg with
Momentum fails to learn, but SyncFedAvg with FedProx learns and efficiently uses communication. Similar
to the homogeneous case, our semi-synchronous policies dominate in heterogeneous environments (see
also Table 4.3).
CIFAR-100Results. Figure 4.8 shows the performance on the CIFAR-100 domain of synchronous, asyn-
chronous, and semi-synchronous policies in a heterogeneous computational environment (with 10 learn-
ers: 5 fast GPUs, and 5 slow CPUs). Due to the size and computational complexity of the ResNet-50 model
used to train on the CIFAR-100 domain, the performance difference between fast and slow learners is large
(CPUs batch processing is 33 times slower than the GPUs). Using a smaller value ofλ leads to better results
for the SemiSync policy. We useλ = 0.5, which means that the slowest learner processes only half of its
local dataset at each SemiSync synchronization point. However, since the batches are chosen randomly,
after two synchronization points all the data is processed (on average). FedRec dominates alternative asyn-
chronous policies, though is less stable. Our SemiSync with Momentum (λ =0.5) policy yields the fastest
convergence rate and final accuracy while remaining communication efficient.
ExtendedMNIST Results. Figure 4.9 shows the results on the ExtendedMNIST By Class domain on
heterogeneous environments (10 learners: 5 fast, 5 slow; slow learners (CPUs) are 16 times slower than
59
(a) Parallel Processing Time
(b) Communication Cost (Cumulative Update Requests)
Figure 4.7: Heterogeneous Computational Environment on CIFAR-10. SemiSync with Momentum
(λ = 2) has the fastest convergence and lowest communication cost for a given accuracy. ('G'=GPU,
'C'=CPU)
60
(a) Parallel Processing Time
(b) Communication Cost (Cumulative Update Requests)
Figure 4.8: HeterogeneousComputationalEnvironmentonCIFAR-100. SemiSync with Momentum
(λ =0.5) significantly outperforms all other policies in this challenging domain. (’G’=GPU, ’C’=CPU)
61
fast learners (GPUs)). Extended MNIST By Class is a very challenging learning task due to its unbal-
anced distribution of target classes and a large amount of data samples (731,668 training examples). As
we empirically show, SemiSync performs considerably better compared to other policies both in terms of
convergence speed and in terms of communication cost. Among the asynchronous policies, FedRec has
faster convergence and better generalization. Interestingly, asynchronous policies have faster convergence
at the beginning of the federated training, a behavior that is more pronounced for FedRec in the Power
Law & IID environment. Given the large number of records allocated to the slow learners (e.g., C:1 owns
∼ 90,000 in Skewed and>140,000 examples in Power Law), the total processing time required to complete
their local training task is much higher compared to fast. Moreover, since in asynchronous policies no
idle time exists for the fast learners, the convergence of the community model is driven by their learn-
ing pace and more frequent communication, whereas in synchronous and semi-synchronous all learners
need to complete their local training task before a new community model is computed; hence the delayed
convergence of Sync and SemiSync at the start of training.
ScalingFederatedNeuroscienceResearch. We evaluate the synchronous and SemiSync policies on a
computationally homogeneous environment with Skewed IID and Non-IID data distributions (for Uniform
data amounts both policies behave identically, see [243] for results in this environment) on the BrainAGE
domain. In terms of parallel processing time, SemiSync withλ =3,4 provides faster convergence for both
Skewed IID and Non-IID distributions compared to the synchronous policy. In terms of communication
cost, SemiSync withλ = 4 is more communication efficient for both data distributions, with λ = 3 being
comparable to synchronous. For IID, SemiSync with λ = 3 leads to the smallest mean absolute error,
which is very close to the error reached by the centralized model.
62
(a) Parallel Processing Time
(b) Communication Cost (Cumulative Update Requests)
Figure 4.9: HeterogeneousComputationalEnvironmentsonExtendedMNISTByClass. SemiSync
with Vanilla SGD (λ =1) outperforms all other policies, with faster convergence and less communication
cost.
63
0 2500 5000 7500 10000 12500 15000 17500 20000
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Skewed & IID
0 2500 5000 7500 10000 12500 15000 17500 20000
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Skewed & Non-IID
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
500
1000
1500
2000
2500
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
250
500
750
1000
1250
1500
1750
2000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
MAE
Processing Time(secs)
Centralized
SyncFedAvg w/ Vanilla SGD
SemiSyncFedAvg w/ Vanilla SGD, = 2
SemiSyncFedAvg w/ Vanilla SGD, = 3
SemiSyncFedAvg w/ Vanilla SGD, = 4
(a) Parallel Processing Time
0 5 10 15 20 25
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Skewed & IID
0 5 10 15 20 25
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Skewed & Non-IID
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
500
1000
1500
2000
2500
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
250
500
750
1000
1250
1500
1750
2000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
MAE
Federation Rounds
Centralized
SyncFedAvg w/ Vanilla SGD
SemiSyncFedAvg w/ Vanilla SGD, = 2
SemiSyncFedAvg w/ Vanilla SGD, = 3
SemiSyncFedAvg w/ Vanilla SGD, = 4
(b) Communication Cost (Federation Rounds)
Figure 4.10: Homogeneous Computation, Heterogeneous Data Environments on BrainAGE.
SemiSync with Vanilla SGD λ = 4 converges faster and with less communication cost compared to its
synchronous counterpart. SemiSync with λ = 3 is close to the performance of the centralized model in
Skewed & IID.
64
ModelsPerformance: Time,Communication,andEnergy. We analyze the performance of the dif-
ferent federated training policies in terms of time, communication,
∥
and energy costs, in the CIFAR-10
domain in both homogeneous (Table 4.2) and heterogeneous (Table 4.3) environments. To compare the
training policies, we pick a target accuracy that can be reached by (most of) the policies and calculate the
different metrics as defined in Section 4.1.4. To compute the energy cost in the heterogeneous computa-
tional environment, we setϵ k
= 2 for GPUs andϵ k
= 1 for CPUs (see eq. 4.4). Since estimating the full
energy consumption (network, storage, power conversion) is challenging, we compute the energy cost
ratio between GPUs and CPUs based on their Thermal Design Power (TDP) value [50], which accounts for
the maximum amount of heat generated by a processor. For the GPU GeForce GTX 1080 Ti the TDP value
is∼ 180W, while for the Intel(R) Xeon(R) CPU the TDP value is∼ 90W. In both homogeneous and hetero-
geneous domains, SemiSync with Momentum (λ = 2) performs best in terms of the (parallel/wall-clock)
time to reach the desired target accuracy and in terms of the energy cost (with depends on the cumulative
processing time across all learners needed to reach that accuracy). In heterogeneous domains, SemiSync
with Momentum (λ = 2) is also the most efficient in terms of communication. In homogeneous domains,
SemiSync with Momentum (λ =4) is more communication efficient, but with slightly slower convergence
and larger energy cost, when compared to (λ =2). Remarkably, when our SemiSync strategy is combined
with any local optimizer and compared to its synchronous counterpart, it yields significant energy savings
close to an average 40% energy cost reduction in both homogeneous and heterogeneous computational
environments (cf. Tables 4.2 and 4.3). Finally, comparing the performance of the different local optimizers,
Momentum SGD provides accelerated convergence with a lower communication overhead compared to
Vanilla SGD and FedProx. At the same time, Momentum SGD is the most energy-efficient policy, com-
pared to the Synchronous Vanilla SGD baseline, with a reduction of 3 to 9 times the energy cost.
∥
We disregard model transmission costs. In heterogeneous environments for CIFAR-10, CIFAR-100, and ExtendedMNIST, the
model size, average model transmission time, the average processing time for SemiSync, and the ratio of model transmission to
computation are: (4MB, 0.4secs, 80secs λ = 2, 0.005), (5MB, 0.4secs, 80secs λ = 2, 0.003), and (50MB, 5secs, 1000secs λ = 1,
0.005), respectively. Similarly, for the homogeneous environment of BrainAGE we have (11MB, 0.9secs, 850secsλ = 2, 0.001).
65
Experiment Policy Iterations
Parallel
Time(s)
Cumulative
Time(s)
Com.
Cost
EnergyCost
(Efficiency)
Skewed & Non-IID(5)
0.7
Sync w/ Vanilla 123220 613 6133 610 12267
SemiSync (λ =2) w/ Vanilla 156025 444 3541 970 7083 (1.7x)
SemiSync (λ =4) w/ Vanilla 201385 551 4507 630 9014 (1.3x)
Sync w/ Momentum 92920 319 3190 460 6381 (1.9x)
SemiSync (λ =2) w/ Momentum 47485 121 1005 300 2010(6.1x)
SemiSync (λ =4) w/ Momentum 58825 161 1324 190 2648 (4.6x)
Sync w/ FedProx 135340 630 6305 670 12610 (0.9x)
SemiSync (λ =2) w/ FedProx 144685 380 3134 900 6268 (1.9x)
SemiSync (λ =4) w/ FedProx 185185 481 3876 580 7752 (1.5x)
FedRec 61224 924 3830 358 7661 (1.6x)
AsyncFedAvg 78098 1166 4864 456 9729 (1.2x)
FedAsync 198990 3120 12709 1164 25419 (0.4x)
Power Law & Non-IID(3)
0.65
Sync w/ Vanilla 24240 181 1811 120 3623
SemiSync (λ =2) w/ Vanilla 54905 199 1347 170 2695 (1.3x)
SemiSync (λ =4) w/ Vanilla 68505 208 1694 110 3389 (1.1x)
Sync w/ Momentum did not reach target accuracy
SemiSync (λ =2) w/ Momentum 31105 86 667 100 1335(2.7x)
SemiSync (λ =4) w/ Momentum 48105 146 1181 80 2363 (1.5x)
Sync w/ FedProx 38380 287 2876 190 5752
SemiSync (λ =2) w/ FedProx 75305 208 1620 230 3241 (1.1x)
SemiSync (λ =4) w/ FedProx 95705 291 2157 150 4315 (0.8x)
FedRec 45822 877 3360 365 6720 (0.5x)
AsyncFedAvg 30894 609 2357 255 4715 (0.7x)
FedAsync 247068 4727 18165 1981 36331 (0.1x)
Table 4.2: CIFAR-10 Performance Metrics on Homogeneous Cluster. SemiSync with Momentum outper-
forms all other Synchronous and Semi-Synchronous policies. The first column refers to the federated
learning domain and the target accuracy that each policy needs to reach. Column'Com. Cost'is an abbre-
viation for Communication Cost. The total models exchanged during training is twice the communication
cost value. For every experiment the energy savings are computed against the synchronous Vanilla SGD
baseline.
66
Fast(GPU) Slow(CPU) Total
Exp. Policy Iter. CT(s) EC Iter. CT(s) EC Iter. PT(s) CT(s) CC EC(EF)
Uniform & IID
0.75
Sync w/ Vanilla 24000 578 1157 24000 16129 16129 48000 3225 16707 240 17286
SemiSync (λ = 2) w/ Vanilla 45250 1018 2036 4750 3045 3045 50000 631 4064 100 5082 (3.4x)
Sync w/ Momentum 11000 185 371 11000 2700 2700 22000 540 2885 110 3071 (5.6x)
SemiSync (λ = 2) w/ Momentum 20250 335 671 2250 1217 1217 22500 269 1553 50 1889(9.1x)
Sync w/ FedProx 24000 548 1097 24000 22015 22015 48000 4403 22564 240 23113 (0.7x)
SemiSync (λ = 2) w/ FedProx 40250 817 1634 4250 4074 4074 44500 928 4891 90 5708 (3x)
FedRec 50800 1588 3177 1600 1449 1449 52400 732 3038 261 4627 (3.7x)
AsyncFedAvg 69200 2439 4879 8800 2870 2870 78000 1315 5310 389 7750 (2.2x)
FedAsync 76600 2703 5406 8000 3147 3147 84600 1406 5850 422 8554 (2x)
Skewed & Non-IID(5)
0.65
Sync w/ Vanilla 28200 1010 2020 22300 22812 22812 50500 4562 23822 250 24832
SemiSync (λ = 2) w/ Vanilla 164082 3585 7170 16603 10006 10006 180685 2153 13591 220 17176 (1.4x)
Sync w/ Momentum 28200 670 1341 22300 10896 10896 50500 2179 11567 250 12238 (2x)
SemiSync (λ = 2) w/ Momentum 93882 1446 2893 9583 4434 4434 103465 1059 5881 130 7328(3.3x)
Sync w/ FedProx 29328 1065 2130 23192 22998 22998 52520 4599 24063 260 25129 (0.9x)
SemiSync (λ = 2) w/ FedProx 210882 4677 9354 21283 12872 12872 232165 2884 17549 280 22226 (1.1x)
FedRec 155731 5205 10411 4378 4448 4448 160109 2431 9654 803 14860 (1.6x)
AsyncFedAvg 183982 6185 12370 17361 7349 7349 201343 3346 13534 1021 19719 (1.2x)
FedAsync did not reach target accuracy
Power Law & Non-IID(3)
0.6
Sync w/ Vanilla 9664 600 1201 6496 10107 10107 16160 2021 10707 80 11308
SemiSync (λ = 2) w/ Vanilla 80102 1831 3662 8183 4939 4939 88285 1103 6770 80 8601 (1.3x)
Sync w/ Momentum did not reach target accuracy
SemiSync (λ = 2) w/ Momentum 34502 573 1147 3623 1723 1723 38125 438 2296 40 2870(3.9x)
Sync w/ FedProx 13288 863 1727 8932 16363 16363 22220 3272 17227 110 18091 (0.6x)
SemiSync (λ = 2) w/ FedProx 102902 2084 4168 10463 7084 7084 113365 1719 9168 100 11252 (1x)
FedRec 90458 3671 7343 2647 2922 2922 93105 1832 6594 664 10266 (1.1x)
AsyncFedAvg did not reach target accuracy
FedAsync did not reach target accuracy
Table 4.3: CIFAR-10 Performance Metrics on Heterogeneous Cluster. Semi-Synchronous with Momentum
outperforms all other policies in time, communication, and energy costs. The first column refers to the
federated learning domain and the target accuracy that each policy needs to reach. (Iter. = total number of
local iterations,PT(s) = parallel processing time in seconds,CT(s) = cumulative processing time in seconds,
CC = communication cost, andEC(EF) = energy cost with energy efficiency factor). For every experiment,
the energy savings are computed against the synchronous Vanilla SGD baseline.
67
4.2 MetisFL:AScalableFederatedTrainingFramework
A scalable federated learning solution should adhere to the architectural principles of modularity, extensi-
bility, and configurability [26]. Modularity refers to the development of functionally independent services
(micro-services) that allow finer control of system components’ interoperability. Extensibility refers to the
functional interface expansion of each service. Configurability refers to the ease of deployment of new
federated models and procedures.
We have designed and developed a flexible Federated Learning system, called Metis, to explore differ-
ent communication protocols and model aggregation weighting schemes (Figure 4.11). Metis uses Tensor-
flow [1] as its core deep learning execution engine; other deep learning backends can also be supported,
such as PyTorch, and JAX.
VM / Containerized
Federation Context
Federated Environment
Cluster Initialization
Cluster Shutdown
Federated Model
Model Initialization
Model Input Query
VM / Containerized
Federation Controller
Training Task Scheduler
(Sync, Async, Semi-Sync) GRPC
Server Learner N
GRPC Stub
Learner 1
GRPC Stub
Evaluation Task
Scheduler
Model Aggregator
(e.g., FedAvg, FedAdam)
Secure Aggregator
(e.g., using FHE)
Model Store
(e.g., Key-Value DB)
Client Selector
(all, random)
Federated Workflow Execution 3
Federation Driver
Workflow Context
Federated Environment
Cluster Initialization
Cluster Shutdown
Federated Model
Model Initialization
Model Input Recipe
User
UI
Catalog Vizualization
Define
Workflow
Submit
Workflow
Real-Time
Metrics
Output
Trusted
Entity
1 2
4
5
Learner K
Controller
GRPC Stub
GRPC
Server
Model
Trainer
Model
Evaluator
Model
(Neural Network)
Dataset Loader
Learning
Figure 4.11: Metis Federated Learning Framework Architecture
4.2.1 ProgrammingModel
Following the successful programming model of Apache Spark [290], the Federation Controller operates
as the cluster manager of the federation, Learners as the computing nodes, and the Driver as the entry
point of the federation launching various operations in parallel.
Federation Controller. The controller orchestrates the distributed training of the federated model
across learners. It comprises four main components. First, the Model Aggregator mixes the local models
of the learners to construct a new global model. Second, the Training Task Scheduler manages learners’
68
participation and synchronization points and delegates local training tasks. The Model Store saves the
local models and the contribution value of each learner in the federation to improve the efficiency of
model aggregation over multiple training protocols [246]. The Model Store component can be materialized
through an in-memory or on-disk key-value store, depending on the number of learners and the size of
the models (key: learner id, value: model, and its contributionp
k
). The controller may also operate within
an encrypted environment, in which case it needs to store encrypted local models and the global model
aggregation function needs to be computed with homomorphic operations, such as the commonly-used
weighted average methods [123, 245, 291]. Finally, the Evaluation Task Scheduler is responsible to dispatch
the evaluation tasks to the learners and collect the associated metrics.
Learner. Every data silo acts as an independent learning entity that receives the global model and
trains the model on its privately held local dataset through its Model Trainer component. A learner can
also support a Model Evaluator component to evaluate incoming models on its local training, validation,
and/or test datasets. Such evaluation can provide a score to weigh the models on actual learning perfor-
mance [241]. The Dataset Loader feeds data to the training and evaluation components with the appropri-
ate format.
Driver. The Driver defines the high-level control flow of the federated application. Its main tasks are
to initialize the Federation Controller and Learner services and define and initialize the neural network
architecture (with a random or a pre-trained model). The driver also collects real-time metadata associated
with the federated training process and stores them inside the Catalog for further bookkeeping. In our
design, we consider the driver to be an independent trusted entity that can generate the key pairs of the
encryption scheme, cf. Section 5.1.
69
4.2.2 FederationControllerOptimizations
In a typical federated learning (synchronous or asynchronous) workflow, the federation controller needs
to handle the majority of the distributed operations required to perform the model learning task. When
we streamline the federated training workflow as shown in Figure 4.12, we observe that the controller is
responsible to send (dispatch) the training task to the participating learners, save (insert) the received local
model(s) in the model-store (in-memory or other storage engines), select from the model store the required
model(s) for aggregation and finally send (dispatch) the new global model to every learner for evaluation.
In an asynchronous federated setting, where update requests are orderly submitted to the controller,
the controller processes them in a FIFO manner (cf. 4.2.1). In this scenario, once the local model is received
by the controller, the execution time of an update request is governed by the time it takes for the controller,
to perform the selection, insertion, and aggregation operations. Comparably, in a synchronous federated
setting, the overall execution time of a training workflow may increase further due to the additional time
required to dispatch the training and evaluation tasks across learners. The larger the pool of the partic-
ipating learners in a federation round, the higher the latency to perform the dispatch operations, as also
shown in Figure 4.13. Moreover, in Figure 4.14 we show how the number and size of models considered
during global model computation affect the global model aggregation time.
As a result, the cumulative execution time of a federated training workflow relies heavily on the com-
putation time required by the controller. To this end, we implemented the controller from the ground up in
C++ to ensure that we have a more fine-grained control and allocation of the available system resources,
including better thread management and more efficient memory utilization. Through this engineering ef-
fort, we were able to provide seamless integration with existing homomorphic encryption libraries (i.e.,
PALISADE [203, 250], see also Section 5.1) and be able to handle hundreds and thousands of learners with
millions of model parameters at minimal cost compared to other existing open-source frameworks.
70
Train
Global Model
Send
Local Model
Insert
Local Models
Select
Local Models
Aggregate
Local Models
Dispatch
Train Task
Dispatch
Eval Task
Eval
Global Model
T
1
T
2
T
3
T
4
T
5
T
6
T
7
T
8
T
9
Figure 4.12: A typical Federated Learning workflow. Red color represents operations executed by the
controller. Green color represents operations executed by the learners.
In Figure 4.13, we demonstrate the cumulative execution time (msec) required by the controller to
dispatch the evaluation task across all participating learners (10, 25, 50, 100) for increasing model sizes from
10,000 to 10,000,000 parameters. The evaluation task shipped by the controller to the learners contains the
parameters of the global model and the associated evaluation metrics for the given learning task. As the
number of learners increases so does the execution time (latency) to dispatch the task across all learners;
similar results hold when dispatching the training task. When the size of the model is large (10
7
params)
and the number of learners is relatively large (>= 50), the latency is substantially higher compared to
smaller model sizes.
10 25 50 100
Number of Learners
0
10
20
30
40
50
Cumulative Dispatch Time (msec)
10
5
10
6
10
7
Figure 4.13: Evaluation Task Dispatch Time over Increasing Number of Learners (10 to 100) and Model
Sizes (10
5
to10
7
parameters).
In terms of aggregation, for any given model size, our federation controller can compute the global
model in (almost) constant time irrespective of the number of local models considered in the aggregation
step, see Figure 4.14. To accomplish this, the controller leverages the OpenMP [46]) protocol to perform
the aggregation of each model variable in parallel (one-thread-per-variable). This optimization leads to
71
a 6-fold improvement compared to the aggregation step with no parallelism; compare 1400 msec vs 220
msec for 10M parameters and 100 learners 4.14c.
10 25 50 100
# Aggregating Models
0
5
10
15
20
Aggregation Time (msec)
No-OPT
OPT
(a) 100k Params
10 25 50 100
# Aggregating Models
0
20
40
60
80
100
120
Aggregation Time (msec)
No-OPT
OPT
(b) 1M Params
10 25 50 100
# Aggregating Models
0
200
400
600
800
1000
1200
1400
Aggregation Time (msec)
No-OPT
OPT
(c) 10M Params
Figure 4.14: Model Aggregation Time with or without Optimization over Increasing Number of Learners.
4.2.3 FLSystemsComparison
Several Federated Learning architectures have recently become available, including the Open FL Frame-
work [210], Nvidia FLARE [213], PriMIA [122], Federated AI Technology Enabler (FATE), Flower [26],
FedML [97], IBM FL [169], FLUTE [57], LEAF [34], Tensor Flow Federated (TFF), and COINSTAC [202].
The extensibility, security, and privacy protection of these systems vary. Our focus has been on provid-
ing an extensible, modular architecture with strong security guarantees via homomorphic encryption and
enhanced resilience against data leakage (such as membership inference attacks, see Section 5.2).
Following the taxonomy in the works of [120, 149, 162], in Table 4.4 we provide a qualitative compari-
son for all the aforementioned federated learning systems. We do not report FLUTE, LEAF, and TFF since
these frameworks have been developed explicitly for testing and experimentation in simulated environ-
ments, not for deployment in production environments. Moreover, even though COINSTAC provides a
powerful platform for decentralized neuroimaging analysis, we do not consider it in our evaluation since
it has limited support for deep learning methods and is not tailored for federated learning settings. We
compare all rest of the systems with respect to their out-of-the-box support for deployment, machine
72
learning environment development, federated data partitioning environments, private and secure train-
ing protocols, communication network, synchronization protocols, software development, and application
domains.
Dimension OpenFL Nvidia FLARE PriMIA FATE Flower FedML IBM FL MetisFL*
Deployment
Standalone ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Distributed ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Cross-Silo ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Cross-Device × × × × ✓ ✓ × ✓
Containerized × ✓ × ✓ ✓ ✓ ✓ ✓
MLEnvironment
Model Types ML|DL ML|DL DL ML|DL ML|DL ML|DL ML|DL ML|DL
Backend Torch|TF Torch|TF|MONAI Torch Torch|TF Torch|TF|MX|JAX Torch|TF|MX|JAX Torch|TF TF
LocalOpt ✓ ✓ × ✓ ✓ ✓ ✓ ✓
GlobalOpt ✓ ✓ × ✓ ✓ ✓ ✓ ✓
DataPartitioning
Horizontal ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Vertical × × × ✓ × × × × Privacy&Security
Private Training ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Secure Aggregation TEE FHE SMPC SMPC|PHE Masking Masking FHE FHE
Cryptography Library Graphene TenSeal PySyft native native native IBM HElayers PALISADE
Communication
Centralized ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Decentralized × × × ✓ × ✓ × × Hierarchical × × × × × × × × TLS ✓ ✓ × ✓ ✓ ✓ ✓ ✓
Network gRPC gRPC gRPC gRPC gRPC MPI AMQP gRPC
AggregationProtocol
Synchronous ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Asynchronous × × × × × × × ✓
Software
End-user Python Python Python Python Python Python Python Python
Learner Python Python Python Python Python Python Python Python
Aggregator Python Python Python Python Python Python Python C++
Table 4.4: Qualitative comparison of different FL Systems.
In theDeployment category, standalone refers to the case where a federation can be run in a simulated
environment as parallel processes within a single server and distributed across multiple nodes (servers). All
systems provide support for standalone and distributed execution and deployment for cross-silo settings.
However, OpenFL, Nvidia FLARE, PriMIA, FATE, and IBM FL do not support execution in cross-device
settings with OpenFL and PriMIA having no support for containerized execution.
In the ML Environment category, model types describe the machine learning models each system can
support, backend the machine/deep learning engine used to perform model training, evaluation and in-
ference, and LocalOpt and GlobalOpt whether the system allows the development of customized local
73
(learner) and global (controller) function optimization algorithms. In this category, almost all frameworks,
except for PriMIA, support the execution of various model architectures and support both the PyTorch
and Tensorflow (TF) engines. Our system (MetisFL) currently only supports TensorFlow (PyTorch support
is forthcoming shortly), however, it can be easily extended to support other training engines as well, e.g.,
MONAI, MXNet(MX), and JAX.
In the case of Data Partitioning, we evaluate whether each system can support learning over different
partitioning schemes. All systems provide readily support for horizontally partitioned learning environ-
ments with only FATE providing support in the more challenging vertical learning environment [285]. In
their documentation, FedML also claims their API can be extended to support vertical learning scenarios,
however, support is not provided out of the box. Given that vertical federated learning environments may
be observed frequently in federated healthcare settings, in our immediate future plan we aim to extend
our framework to provide support for these learning settings as well.
For thePrivacy&Security category, we assess whether a system can support private learning (differen-
tial privacy), the cryptographic method used to perform secure aggregation operations, and which library
is used for the cryptography operations. All systems support differential private learning. Nvidia FLARE,
IBM FL, and our solution support fully homomorphic operations (FHE) through the CKKS [38] construc-
tion scheme, PriMIA supports secure multi-party-computation (SMPC) through the SPDZ [49] scheme and
FATE supports SMPC and partial homomorphic encryption (PHE) through the Paillier (BatchCrypt [291])
construction scheme. OpenFL uses a hardware-integrated trusted execution environment (TEE [216]).
Flower (Salvia [146]) and FedML (LightSecAgg [233]) both utilize a mask-based encryption approach. With
respect to the cryptography library, all systems (including ours) have a dependency on an external library,
while FATE, Flower, and FedML provide native implementations for the required cryptography operations.
In theCommunication category we compare the federated learning topologies under which each system
can operate, whether the communication across all participating parties is performed within an encrypted
74
channel (TLS) and what is network protocol used to exchange messages. All systems can operate in a
centralized federated learning environment (one aggregator, multiple clients). FATE and FedML provide
support for decentralized settings (peer-to-peer), and FedML claims to provide support for hierarchical
federated settings [249]. In terms of TLS, only PriMIA does not provide support. Finally, every system
uses gRPC to establish communication across all federation parties, except for FedML and IBM FL, which
use the MPI and AMQP protocols, respectively.
Another category in which we observe limited implementation capabilities across existing systems
is the Aggregation Protocol. Even though systems provide support for synchronous communication and
aggregation, they lack support for asynchronous protocols. Our system, on the contrary, is able to provide
both synchronous (including semi-synchronous) and asynchronous execution. Based on other systems’
documentation, Flower and FedML are the only systems planning to provide support for asynchronous
execution.
Finally, compared to previously proposed metrics [120, 149, 162], we also consider the programming
language used to develop each component in the federated learning system: aggregator, learner, and end-
user API. All three components for all presented systems are developed in Python. However, in our frame-
work, the aggregator is developed in C++. In our original implementation, we also developed the aggrega-
tor in Python but is was leading to a high latency when aggregating large-sized models and/or aggregating
models from a large pool of participating clients (e.g.,> 200) due to Python’s limited memory management
capabilities. Moreover, in our original Python-based implementation as more clients were considered in
the federation the training and evaluation tasks delegation was extremely slow. Given that Python inter-
nally relies on the Global Interpreter Lock (GIL) for proper thread management this hinders the concurrent
execution of tasks and therefore slows down dramatically the federated execution workflow. For these rea-
sons, we re-designed and refactored our aggregator in a native C++ implementation.
75
4.3 ModelSparsification
Neural networks require significant memory and computation resources during training and inference.
To this end, many model pruning techniques (i.e., techniques to remove neural network parameters that
are not useful) have been proposed to improve model generalization and resource utilization [103]. Model
sparsification, aka model pruning, (e.g., [78, 165]) seeks to produce small neural models with similar per-
formance to the original fully-parameterized large models. Inspired by model pruning in centralized train-
ing [78, 165], we proposeFedSparsify, an iterative federated pruning procedure that progressively sparsifies
model parameters based on weight magnitude at the end of each federation round. Our method simulta-
neously learns smaller neural networks for faster inference (and training) and reduces training communi-
cation costs by decreasing the total number of model parameters exchanged between the clients and the
server.
4.3.1 FedSparsify: FederatedPurge-Merge-Tune
FedSparsify follows an iterative pruning schedule that performs model pruning based on weight magnitude
at different learning tiers, clients, or server. The entire procedure is summarized in Algorithm 3.
WeightMagnitude-basedPruning. Neural networks often have millions of parameters, but not all pa-
rameters influence the outcome/predictions equally. A simple and surprisingly effective proxy to identify
weights with a small effect on the final outcome is based on the weights’ magnitude [78, 94]. Weights with
magnitudes lower than some threshold can be removed or set to zero without penalizing performance. We
choose this threshold based on the number of parameters to be pruned (or prune percent, s
t
). We prune
parameters whose weight magnitude is in the bottom-s
t
% in an unstructured way, i.e., considering the
magnitude of each parameter separately. Our approach is modular and other model pruning approaches
76
Algorithm 3 FedSparsify. Global model w and global mask m are computed from N participating
clients (indexed by k) at round t out of T rounds. E is the local training epochs; s
t
is the sparsification
percentage of model weights;purging_mask is the pruning operation returning the binary sparsification
mask;B is the total number of batches per epoch;η is the learning rate;g
(i)
k
denotes gradient ofk
th
client’s
objective with parametersw
(i)
k
. If no sparsification is used FedSparsify-Global is equivalent to FedAvg.
Procedure Server(w
(1)
,m
(1)
):
fort=1toT do
if FedSparsify-Globalthen
fork =1toN do
w
(t)
k
=Client(w
(t)
,m
(t)
,E,null)
w
(t+1)
=
P
N
k=1
|D
k
|
|D|
w
(t)
k
m
(t+1)
=purging_mask(w
(t+1)
,s
t
)
w
(t+1)
=w
(t+1)
⊙ m
(t+1)
if FedSparsify-Localthen
fork =1toN do
w
(t)
k
,m
(t)
k
=Client(w
(t)
,m
(t)
,E,s
t
)
(w
(t+1)
,m
(t+1)
):= Compute using Eq. 4.6
returnw
(t+1)
Procedure Client(w,m,E,s
t
):
w
(0)
k
=w
S =E∗B
fori=0toS do
w
(i+1)
k
=w
(i)
k
− ηg
(i)
k
⊙ m
if FedSparsify-Localthen
m
k
=purging_mask
w
(S)
k
,s
t
return
w
(S)
k
,m
k
returnw
(S)
k
77
that prune groups of parameters, i.e., structured pruning [165], based on magnitude can also be readily
used.
Pruning Schedule. A critical step in our approach is how often and how many parameters to prune
during federated training. Pruning too many parameters earlier in training can cause irrecoverable dam-
age to the performance [78], and pruning too late leads to increased communication costs. To balance
this, we prune iteratively, by gradually reducing the number of trainable parameters. Fine-tuning after
pruning often improves the performance and allows pruning of more parameters while preserving perfor-
mance [78, 95, 300]. Therefore, model pruning at the end of each federation round is a natural choice since
clients can fine-tune the aggregated pruned global model during the next federation round. Intuitively,
two pruning strategies are possible — prune locally at the clients before aggregation (FedSparsify-Local),
or globally at the server after aggregation (FedSparsify-Global). We explore both strategies. Once a param-
eter is pruned, it never rejoins training (i.e., no network/weight regrowth). Motivated by [300], we adapt
the iterative exponentially weight pruning formula of a standalone model in a centralized setting into a
federated environment to the following federated pruning schedule:
s
t
=S
T
+(S
0
− S
T
)
1− F⌊t/F⌋− t
0
T − t
0
n
(4.5)
where t is the federation round, s
t
is the sparsification percentage at round t, S
T
is the final desired
sparsification, S
0
is the initial sparsification percentage, t
0
is the round at which pruning starts,T is the
total number of rounds, andF is the pruning frequency (e.g.,F = 1 prunes at every round, whileF = 5
prunes every 5 rounds). The exponentn controls the rate of sparsification. A higher n leads to aggressive
sparsification at the start of training, and a smaller n to more sparsification towards the end of training;
we use n = 3 in our experiments. Overall, this formula provides an interplay between communication
cost, model sparsity, and learning performance (see also section 4.3.3).
78
FedSparsify-Local. Model pruning takes place at each client after local training is complete. Each client
sends its model, w
k
, to the server along with the associated binary sparsification masks, m
k
. The server
may aggregate the local models using FedAvg. However, as the number of clients increases, it is increas-
ingly unlikely that a particular weight will be zero for all clients. This results in slow sparsification rates.
To address this, we aggregate local models using our proposed Majority Voting scheme, where a global
model parameter is zeroed out only if less than half of local models’ masks preserve it. Otherwise, the
standard weighted average aggregation rule applies. Formally:
[m]
i
=
1
P
N
k
[m
k
]
i
≥ N
2
0 otherwise
; w =m⊙
N
X
k
|D
k
|
|D|
w
k
!
(4.6)
where[·]
i
is the parameter value at thei
th
position,w is the global model,N is the number of clients
participating in the current round, andm
k
is the local binary mask of clientk.
FedSparsify-Global. Model pruning occurs at the server right after participating clients’ models are
aggregated and the sparse structure is maintained throughout local training.
FedSparsify-Global and FedSparsify-Local pruning differ primarily on the pruning tier. In FedSparsify-
Global, the server prunes the global model after aggregation. This is in contrast with FedSparsify-Local,
where the clients prune their local model after local training is complete and share their local binary
masks with the server, and the server aggregates the models using the Majority Voting scheme. In both
schemes, there is no mask disagreement during local training, namely all clients update the same set of
model parameters due to the shared global mask. Intuitively, FedSparsify-Local prunes using only the
local client information, whereas FedSparsify-Global prunes after aggregation and hence uses information
from all clients. Due to this, we expect FedSparsify-Global to be slightly better than FedSparsify-Local
in terms of performance. This is validated in our empirical evaluation (Section 4.3.3). We observe that
FedSparsify-Global outperforms FedSparsify-Local across almost all federated environments, especially in
79
the more challenging Non-IID environments. FedSparsify-Local may be useful in asynchronous settings
but our experiments concern only synchronous settings.
4.3.2 FedSparsifyConvergenceAnalysis
We analyze the convergence rate forFedSparsify when|D
k
|=|D|/N,∀k, i.e., equal weights for each client
and participation ratio is 1;|D
k
| represents client’s k local training dataset size andD =
F
N
k
D
k
These
relaxations are made to simplify the analysis, but are not critical to the proof. See [106, 154] regarding the
treatment of partial participation at each round, and [114, 154] for analysis with weighted average. We
make the same assumptions as [106, 114], which are stated in Appendix A.2. Using the same notation as
section 4.3.1, we state the theorem below.
Theorem 1. If assumptions 1-7 hold and learning rate,η < (4
√
2LS
3/2
)
− 1
, then the parameters obtained
at the end of each federation round of FedSparsify algorithm satisfy
1
T
T
X
t=1
m
(t)
⊙∇ f(w
(t)
)
2
≤ 2ηL
1+4LηS
2
σ 2
+
16LηS
3
ϵ 2
+
4
TηS
E
h
f
w
(1)
− f
w
(∗ )
i
+
4
TηS
T
X
t=1
L
p
w
(t+1)
− w
(t+1)
⊙ m
t
where w
(t+1)
⊙ m
(t)
:=
1
N
P
N
k=1
w
(t,S)
k
, i.e., parameters right before sparsification is done and w
(∗ )
is the
optimal parameter of sparsityS
T
.
The proof is provided in Appendix A.2. Thm. 1 demonstrates that the convergence rate forFedSparsify
isO(
1
T
), which is the same as that of FedAvg [154]. However, compared to the usual federated training
with FedAvg, the bound forFedSparsify has an additional term, the magnitude of the difference of weights
80
before and after pruning. By noting that, m
(t)
describes the non-zero parameters in t
th
iteration and
w
(t+1)
⊙ m
(t)
:=
1
N
P
N
k=1
w
(t,S)
k
, we can further upper bound the difference by observing that
w
(t+1)
− w
(t+1)
⊙ m
(t)
≤
w
(t+1)
⊙ m
(t)
and assuming the magnitude of neural network parameters is upper bounded by B (as assumed in [114]).
However, this naive upper bound ignores that we purge parameters with the lowest magnitude inFedSpar-
sify -Global. This fact can give us a tighter bound forFedSparsify-Global. Note thatw
(t+1)
− w
(t+1)
⊙ m
(t)
will be 0 everywhere except for the indexes marked for pruning, i.e., the smallest entries, before t+1
th
round. Note that exactly⌊|w|× s
t+1
⌋−⌊| w|× s
t
⌋ will be non zero, giving a tighter bound.
w
(t+1)
− w
(t+1)
⊙ m
(t)
≲
w
(t+1)
⊙ m
(t)
(s
t+1
− s
t
)
≲
w
(t+1)
s
t+1
− s
t
1− (s
t+1
− s
t
)
We use≲ instead of≤ to be explicit that integer effects are ignored. In the case of FedSparsify-Local and
majority voting, we remove parameters based on if most of the clients agree. Thus, the pruned parameter
values are not necessarily the smallest, and the above-discussed bound may not hold. In this work, we
focus on removing a pre-defined percentage of parameters with the smallest magnitude.
4.3.3 FedSparsifyEvaluation
We compareFedSparsify against a suite of pruning algorithms performing model sparsification at different
stages of federated training, as well as non-pruning methods. The code to reproduce the experiments is
publicly available; we do not share it now for anonymity at https://github.com/raschild/FedSparsify.
81
Baselines. We compare our FedSparsify-Global and FedSparsify-Local approaches against pruning at
initialization schemes that sparsify the global model prior to federated training, such as SNIP [143] and
GraSP [263]), and a dynamic pruning scheme that prunes and resurrects model parameters during training,
PruneFL [114]. We also conceptualize a OneShot pruning with fine-tuning baseline for federated training
based on [95]. Lastly, we validate the benefits of magnitude-based pruning by substituting it with random
pruning in FedSparsify-Global; we refer to this scheme as Random.
SNIP [143] and GraSP [263] construct a fixed sparse model prior to the beginning of federated training.
Following previous works [29, 114], we apply the schemes in a federated setting by randomly picking a
client at the beginning of training to create the initial sparsification mask, and enforce it globally through-
out training.
PruneFL [114] aims to maximize the reduction of empirical risk per training time. It prunes before
training as well as during training. During training, PruneFL identifies parameters to prune based on the
ratio of gradients magnitude and execution time. We follow the training and pruning configurations sug-
gested in the original work. At the start of training, PruneFL picks a client at random from the federation
to learn the initial pruning mask after 5 reconfigurations. We perform global mask readjustment every 50
rounds and sparsification ratio at round t is set tos× 0.5
t
1000
withs={0.3,0.8};0.3 is the recommended
value.
We adapt the one-shot pruning [95] approach originally proposed for centralized settings into federated
settings by training the original dense model for a specific number of rounds and prune the global model
only once (at the server) to the desired sparsity. Similar to other pruning approaches, to restrict training
on the non-pruned weights across all clients, the server shares the global model’s sparsification mask with
every client to mask their local updates. For our evaluation, we consider two variations of this approach,
one where the global model is not fine-tuned after pruning ( OneShot) and one where it is fine-tuned after
82
pruning (OneShot w/ FineTuning) for the last 10 rounds. Figure 4.15 presents the federated execution flow
of our FedSparsify pruning schemes, pruning at initialization and one-shot.
Finally, we consider FedAvg with Vanilla SGD [180], FedAvg with Momentum SGD [161], referred to
as FedAvg (MFL), and FedProx [152] as the non-pruning baselines. For FashionMNIST we use FedAvg with
SGD and FedProx, and FedAvg (MFL) and FedProx for CIFAR-10 and CIFAR-100.
(a) Learners prune their local model right af-
ter local training is complete. Thereafter,
they share their pruned model along with
the sparsification binary mask with the con-
troller (server) and the controller aggregates
the pruned models using Majority Voting.
(b) The model is pruned only by the controller
(server). The learners receive the pruned global
model and the associated sparsification binary
mask. During local training only the non-
pruned parameters of the global model are be-
ing updated.
(c) Before federated training begins, a par-
ticipating learner is randomly selected to
prune the original model. Once the model
is pruned, federated training starts with
all learners updating the set of non-pruned
parameters enforced by the sparsification
mask.
(d) The global model is pruned right after a spe-
cific number of federation rounds is reached.
If fine-tuned is enabled (OneShot w/ FineTun-
ing) then the federated model will be fine-tuned
(trained) for a couple of rounds.
Figure 4.15: Execution flow diagrams for different federated sparsification methods.
Federated Models & Environments. The random seed for all the experiments was set to 1990. All
experiments were run on a dedicated GPU server equipped with 4 Quadro RTX 6000/8000 graphics cards
of 50 GB RAM each, 31 Intel(R) Xeon(R) Gold 5217 CPU @ 3.00 GHz, and 251 GB DDR4 RAM. We use
83
FashionMNIST, CIFAR-10, and CIFAR-100 as the benchmark datasets. We train a 2-layer fully connected
(FC) network for FashionMNIST, a 6-layer convolutional network (CNN) for CIFAR-10 and a VGG-16 net-
work for CIFAR-100, with 118,282, 1,609,930, and 14,782,884 trainable parameters, respectively. We create
four federated environments for each dataset by varying the number of clients (10 and 100 clients) and
data distribution (IID and Non-IID) at the clients.
We generate the IID data distributions by randomly partitioning the data into 10 and 100 chunks [180].
We create non-IID data distributions by skewing the label distribution [120]. A subset of classes are as-
signed to each client similar to [104, 297]: 2 classes (out of 10) for FashionMNIST, 5 classes (out of 10) for
CIFAR-10, and 50 classes (out of 100) for CIFAR-100. All clients participate in every round in the environ-
ments of 10 clients. In the 100 clients environments, 10 clients are randomly selected at each round (i.e.,
0.1 participation rate).
Figures 4.16, 4.17, and 4.18 present the federated data distributions used in this work for FashionMNIST,
CIFAR-10, and CIFAR 100 datasets, respectively. In all the figures, the x-axis refers to the clients, with one
bar plot per client, and the y-axis shows the number of samples per client and per class. To represent the
class distribution inside each client, we use sequential coloring for the bar plot with each increasing color
representing a different class.
Clients
0
2000
4000
6000
#Examples
(a) 10 Clients IID
Clients
0
200
400
600
#Examples
(b) 100 Clients IID
Clients
0
2000
4000
6000
#Examples (c) 10 Clients Non-IID(2)
Clients
0
200
400
600
#Examples
(d) 100 Clients Non-IID(2)
Figure 4.16: FashionMNIST Federated Data Distributions.
FederatedTrainingHyperparameters. Every federated model for FashionMNIST and CIFAR-10 was
trained for 200 rounds, and 100 rounds for CIFAR-100. Each client trains for 4 local epochs in one round.
84
Clients
0
1000
2000
3000
4000
5000
#Examples
(a) 10 Clients IID
Clients
0
1000
2000
3000
4000
5000
#Examples
(b) 100 Clients IID
Clients
0
1000
2000
3000
4000
5000
#Examples (c) 10 Clients Non-IID(5)
Clients
0
100
200
300
400
500
#Examples
(d) 100 Clients Non-IID(5)
Figure 4.17: CIFAR-10 Federated Data Distributions
Clients
0
1000
2000
3000
4000
5000
#Examples
(a) 10 Clients IID
Clients
0
1000
2000
3000
4000
5000
#Examples
(b) 100 Clients IID
Clients
0
1000
2000
3000
4000
5000
#Examples
(c) 10 Clients Non-IID(50)
Clients
0
100
200
300
400
500
#Examples
(d) 100 Clients Non-IID(50)
Figure 4.18: CIFAR-100 Federated Data Distributions
We used batch size 32 for FashionMNIST and CIFAR-10 and batch size 128 for CIFAR-100. For CIFAR-
100 with 5000 and 500 examples per client in the environment of 10 and 100 clients, respectively, this
translates to 4∗ 5000/128 = 156 and 4∗ 500/128 = 15 local steps per client per round. We used SGD
with learning rate 0.02 for FashionMNIST, and SGD with momentum = 0.75 and learning rate 0.005 for
CIFAR-10. For FedProx, the proximal term µ is kept constant at 0.001. For FedProx and CIFAR-100, we
searched for a proximal term in the range 0.01, 0.001, and 0.0001, with 0.001 performing the best. For all
the other CIFAR-100 experiments, we used SGD with momentum 0.9 and learning rate 0.01.
FedSparsifyTuning. For allFedSparsify-Local andFedSparsify-Global experiments, sparsification starts
at round 1 (t
0
= 1), initial degree of sparsification is 0 ( S
0
= 0), sparsification frequency is 1 ( F = 1, 1
round of tuning), and the exponent value is 3 (n=3).
During frequency value exploration, we observed that frequency values of F = 1 and F = 2 be-
have similarly. However, for higher values of frequency (e.g., F ∈ {5,10,15,20}), i.e., more rounds of
fine-tuning, there is a big drop in the model performance when pruning takes place since a larger ratio
85
of weights is pruned in a single pruning step. We show the effects of training with different pruning fre-
quencies in terms of Federation Rounds in Figure 4.19a and in terms of Transmission Cost in Figure 4.19b.
In Figures 4.19c and 4.19d we show the effects of different exponent values in the final performance. As
expected, for high exponent values the pruning effect is more profound and leads to pruned models with
a reduced final performance. This is in contrast to the very small exponent value (exp=1) that can learn
better models mid-training due to the larger number of available non-pruned weights but fails to retain
that performance at the final rounds, where pruning is more aggressive. An exponent value equal to 3
provides a good trade-off between sparsification and model performance for our scheme. Similar effects
hold in the case of transmission cost, where high exponent values lead to reduced transmission costs but
degraded performance, and small exponent values higher transmission costs and better performance. Still,
an exponent value of 3 provides a good alternative to balance transmission cost and learning performance.
Finally, forFedSparsify-Local, we use Majority Voting as the aggregation rule of the local models, while for
Random pruning baseline and FedSparsify-Global, we use FedAvg’s aggregation rule.
In Figure 4.20, we show the learning performance (left y-axis) and global model parameters decrease
(right y-axis) for the federated FashionMNIST model in a federated environment of 10 clients trained using
theFedSparsify-Local sparsification schedule when Majority Voting and FedAvg are used as the aggregation
rule of learners’ local models. As it is shown (inset of the figure) at the beginning of training Majority
Voting preserves the sparsity of the local models enforced by clients’ local masks, while FedAvg prunes
lesser parameters.
EvaluationCriteria. Our goal is to develop federated training strategies that can learn a global model
with the highest achievable accuracy at high sparsification rates. To this end, we evaluate the trade-off
between model sparsity and learning performance (i.e., accuracy). Figures 4.21 and 4.22 show the accuracy
on a held-out test set at different sparsities (0.8, 0.85, 0.9, 0.95, 0.99) for FashionMNIST and CIFAR-10, and
(0.9, 0.95, 0.99) for CIFAR-100. We also evaluate model operations and model throughput in Table 4.5. We
86
0 25 50 75 100 125 150 175 200
Federation Round
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
T est T op-1 Accuracy
FashionMNIST Non-IID
Sparsification Frequency
20K
40K
60K
80K
100K
120K
Global Model Params
F = 1
F = 2
F = 5
F = 10
F = 15
F = 20
(a) Sparsification frequency hyperparam-
eter exploration with respect to federa-
tion rounds (x-axis). Left y-axis and solid
lines show accuracy. Right y-axis shows
the global model parameters progression.
The higher the sparsification frequency,
F , the bigger the drop in model perfor-
mance.
0 1K 2K 3K 4K 5K
Transmission Cost (Mbit)
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
T est T op-1 Accuracy
FashionMNIST Non-IID
Sparsification Frequency
20K
40K
60K
80K
100K
120K
Global Model Params
F = 1
F = 2
F = 5
F = 10
F = 15
F = 20
(b) Sparsification frequency hyperpa-
rameter exploration with respect to
transmission cost (x-axis). Right y-axis
shows global model parameters progres-
sion. The higher the sparsifcation fre-
quency is the higher the incurred com-
munication cost is during training.
0 25 50 75 100 125 150 175 200
Federation Round
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
T est T op-1 Accuracy
FashionMNIST Non-IID
Exponent Growth
20K
40K
60K
80K
100K
120K
Global Model Params
n = 1, F = 1
n = 1, F = 2
n = 3, F = 1
n = 3, F = 2
n = 6, F = 1
n = 6, F = 2
n = 12, F = 2
n = 12, F = 1
(c) Exponent hyperparameter explo-
ration with respect to federation rounds
(x-axis). Left y-axis and solid lines
show accuracy. Right y-axis shows
global model parameters progression.
The higher the exponent value is (e.g.,
n = 6,12), the greater the number of
pruning weights is during early training
stages.
0 1K 2K 3K 4K 5K 6K 7K 8K
Transmission Cost (Mbit)
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
T est T op-1 Accuracy
FashionMNIST Non-IID
Exponent Growth
20K
40K
60K
80K
100K
120K
Global Model Params
n = 1, F = 1
n = 1, F = 2
n = 3, F = 1
n = 3, F = 2
n = 6, F = 1
n = 6, F = 2
n = 12, F = 2
n = 12, F = 1
(d) Exponent hyperparameter explo-
ration with respect to transmission cost
(x-axis). Left y-axis and solid lines show
accuracy. Right y-axis shows global
model parameters progression. The
higher the exponent value is (e.g., n =
6,12), the smaller the transmission cost
is, but with worse final performance.
Figure 4.19: FedSparify Tuning. Top row shows the convergence when exploring different sparsification
frequency values with FedSparsify-Global at 0.95 sparsity on FashionMNIST with 10 clients over Non-IID
data distribution with respect to Federation Rounds (Figure 4.19a) and Transmission Cost (Figure 4.19b);
exponent value for these experiments is set to 3. The middle row Figures 4.19c and 4.19d show the expo-
nent hyperparameter exploration in terms of Federation Rounds and Transmission Cost, respectively. An
exponent of n=3 provides a good trade-off between sparsification and model performance.
87
0 100 200
0.80
0.82
0.84
0.86
0.88
0.90
T est T op-1 Accuracy
IID
0 100 200
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Non-IID
20K
40K
60K
80K
100K
120K
0 20
20K
40K
60K
80K
100K
120K
Global Model Params
0 20
Federation Round
MV Avg
Figure 4.20: Convergence of FedSparsify-Local with Majority-Voting (MV) as aggregation rule and
FedSparsify-Local with Weighted Average (FedAvg/Avg) as aggregation rule on FashionMNIST with 10
clients over IID and Non-IID data distributions at 0.9 sparsity.
report model convergence with respect to (w.r.t.) federation rounds and global model size reduction for
FashionMNIST in Figure 4.23. Convergence plots for all other environments and communication cost of
federated training are presented in Figure 4.24. We do not measure convergence w.r.t. computation or
wall-clock time speed-up, since we do not employ any dedicated hardware accelerators to perform sparse
operations.
With regard to transmission cost, we measure it in terms of Megabits (Mbit) exchanged for all federated
training rounds. We train models for a fixed number of rounds, therefore all models do not have the same
transmission cost at the end of training. We compute the cost of transmitting parameters without any
compression, i.e., the transmission cost at each round is the total number of clients participating at each
round, multiplied by the total number of non-zero parameters sent by the server at the beginning of the
round (i.e., global model size) to all participating clients, plus the total number of non-zero parameters
uploaded to the server by all participating clients at the end of the round. We multiply this aggregated
quantity by 32, assuming full-precision training. If the sparsification scheme exchanges binary masks
with the server during federated training (e.g., FedSparsify-Local) then we also add to this quantity the
total number of parameters of the original model, i.e., the size of the binary mask (1-bit parameters) is
equal to the original model size without any sparsification.
88
0.0 0.2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(a) CIFAR-100 (VGG)
0.0 0.2
0.0
0.2
0.4
0.6
0.8
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg
FedProx
PruneFL
SNIP
GraSP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(b) CIFAR-10 (CNN)
0.0 0.2
0.0
0.2
0.4
0.6
0.8
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg
FedProx
PruneFL
SNIP
GraSP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(c) FashionMNIST (FC)
Figure 4.21: Sparsity vs. Test Accuracy for 10 clients. FedSparsify outperforms pruning alternatives, and
is comparable to no-pruning.
89
0.0 0.2
0.0
0.1
0.2
0.3
0.4
0.5
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(a) CIFAR-100 (VGG)
0.0 0.2
0.0
0.2
0.4
0.6
0.8
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg
FedProx
PruneFL
SNIP
GraSP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(b) CIFAR-10 (CNN)
0.0 0.2
0.0
0.2
0.4
0.6
0.8
T est T op-1 Accuracy
0.8 1.0 0.0 0.2 0.8 1.0
IID Non-IID
Sparsity
FedAvg
FedProx
PruneFL
SNIP
GraSP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(c) FashionMNIST (FC)
Figure 4.22: Sparsity vs. Test Accuracy for 100 clients (0.1 participation rate). FedSparsify outperforms
pruning alternatives and is comparable or better to no-pruning (particularly in non-IID domains).
90
FashionMNISTResults. Figures 4.21c and 4.22c show the performance of different methods at different
sparsity for the FashionMNIST environments with 10 and 100 clients, respectively, on IID and Non-IID
data distributions. The more complex the learning environment is (cf. IID vs Non-IID), the lower the final
accuracy of the global model is for both pruning and non-pruning schemes. All sparsification methods
have similar performance at moderate sparsification (i.e., 0.8, 0.85, 0.9) and IID distribution. However, for
extreme sparsities (i.e., 0.95, 0.99) and more challenging data distributions (Non-IID), other sparsification
methods underperform and, in some cases, cannot even learn a model of reasonable performance (e.g.,
SNIP, GraSP, and One-Shot).
Even though SNIP and GraSP are capable of training with reduced communication costs compared to
other approaches (cf. Figure 4.24 due to restricting model training to a predetermined sparsified network
from the very beginning they incur a substantial performance drop compared to our progressive sparsifica-
tion schemes. We attribute this performance degradation to the binary mask learned over the local dataset
of a randomly selected client, which may not follow the global data distribution, especially in the case of
Non-IID environments; similar observations were also reported in [29]. Similarly, the degraded learning
performance of PruneFL in the Non-IID settings is due to the random client selection at the start of feder-
ated training for constructing the initial sparsification mask. A noteworthy outcome of our experiments
is the effect of fine-tuning in the case of one-shot pruning. Fine-tuning improves the final performance by
a large margin in IID (0.2 vs 0.7 at 0.99 sparsity) and Non-IID settings (0.25 vs 0.67 at 0.95 sparsity).
Figures 4.23a and 4.23b show global model convergence and its total number of parameters during
training. FedAvg, FedProx, OneShot and OneShot w/ FineTuning have a constant model size (overlapping
top dashed lines), except for the latter two approaches for which pruning occurs at round 190 and 200, re-
spectively, and hence the sudden size drop. Similarly, SNIP and GraSP train on a pruned model of constant
size (overlapping bottom dashed lines), since the initial training model is already sparsified. All progressive
91
sparsification schemes (FedSparsify, Random) have a logarithmically decreasing model size (mid-low over-
lapping decreasing dashed lines) while the dynamic pruning scheme (PruneFL) has a step-like increasing
model size.
PruneFL’s performance drops every 50 federation rounds when the model expands parameters. The
effect is stronger in the Non-IID environment. As expected, OneShot pruning behaves similarly to the
non-pruning baselines until pruning. The accuracy drops substantially when the model is pruned at round
200 for OneShot and 190 for OneShot w/ FineTuning due to the removal of a large number of parameters
that affects model outcomes. Interestingly, OneShot w/ FineTuning recovers some of the lost performance
during federated fine-tuning in the last 10 rounds. Finally, Random pruning does not suffer during early
federated training or at lower sparsities but fails to perform towards the end of the training, indicating the
usefulness of magnitude-based pruning.
CIFAR-10 Results. As shown in Figures 4.21b and 4.22b FedSparsify outperforms existing federated
pruning approaches, while being able to learn sparse models at extreme sparsification rates (e.g., 0.9-0.99)
and often the performance is similar or better than the non-pruning FedAvg (MFL) baselines (e.g., Non-IID
environment in Figure 4.21b). Similar to the FashionMNIST results, we attribute the performance drop
of pruning at initialization schemes to their need to remove a large proportion of the network’s trainable
weights at the beginning of training based on the local dataset of a randomly chosen client that may
not be representative of the global dataset. Similarly, PruneFL also relies on an initial randomly selected
sparsification mask, and cannot recover performance even after model regrow during federated training.
Interestingly, the Random pruning scheme is a strong baseline with comparable and often better per-
formance compared to existing pruning methods but worse to one-shot pruning approaches. However,
at extreme levels of sparsity, random pruning does not learn as the remaining model weights are crucial
and pruning randomly may have an irreversible, negative effect on the final model performance. OneShot
pruning with finetuning has comparable performance as our FedSparsify schemes generally but fails to
92
0 50 100 150 200
0.84
0.85
0.86
0.87
0.88
0.89
0.90
T est T op-1 Accuracy
IID
0 50 100 150 200
0.55
0.60
0.65
0.70
0.75
0.80
Non-IID
20K
40K
60K
80K
100K
120K
20K
40K
60K
80K
100K
120K
Global Model Params
Federation Round
FedAvg
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(a) FashionMNIST (FC) - 10 Clients
0 50 100 150 200
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
T est T op-1 Accuracy
IID
0 50 100 150 200
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Non-IID
20K
40K
60K
80K
100K
120K
20K
40K
60K
80K
100K
120K
Global Model Params
Federation Round
FedAvg
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(b) FashionMNIST (FC) - 100 Clients
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
T est T op-1 Accuracy
IID
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Non-IID
200K
400K
600K
800K
1M
1.2M
1.4M
1.6M
200K
400K
600K
800K
1M
1.2M
1.4M
1.6M
Global Model Params
Federation Round
FedAvg (MFL)
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(c) CIFAR-10 (CNN) - 10 Clients
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
T est T op-1 Accuracy
IID
0 50 100 150 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Non-IID
200K
400K
600K
800K
1M
1.2M
1.4M
1.6M
200K
400K
600K
800K
1M
1.2M
1.4M
1.6M
Global Model Params
Federation Round
FedAvg (MFL)
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(d) CIFAR-10 (CNN) - 100 Clients
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
T est T op-1 Accuracy
IID
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Non-IID
2M
4M
6M
8M
10M
12M
14M
2M
4M
6M
8M
10M
12M
14M
Global Model Params
Federation Round
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(e) CIFAR-100 (VGG) - 10 Clients
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
T est T op-1 Accuracy
IID
0 20 40 60 80 100
0.0
0.1
0.2
0.3
0.4
Non-IID
2M
4M
6M
8M
10M
12M
14M
2M
4M
6M
8M
10M
12M
14M
Global Model Params
Federation Round
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(f) CIFAR-100 (VGG) - 100 Clients
Figure 4.23: Federation Round vs. Accuracy for FashionMNIST (top row), CIFAR-10 (middle row) over the
course of 200 federation rounds and for CIFAR-100 (bottom row) over the course of 100 federation rounds.
Across all environments, SNIP, GraSP, Random, FedSparsify-Local and FedSparsify-Global convergence is
shown at 0.9 model sparsity and PruneFL at 0.3.
93
0 5B 10B 15B
0.84
0.85
0.86
0.87
0.88
0.89
0.90
T est T op-1 Accuracy
IID
0 5B 10B 15B
0.55
0.60
0.65
0.70
0.75
0.80
Non-IID
Transmission Cost (Mbit)
FedAvg
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(a) FashionMNIST (FC) - 10 Clients
0 5B 10B 15B
0.76
0.78
0.80
0.82
0.84
0.86
0.88
0.90
T est T op-1 Accuracy
IID
0 5B 10B 15B
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Non-IID
Transmission Cost (Mbit)
FedAvg
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(b) FashionMNIST (FC) - 100 Clients
0 50B 100B 150B 200B
0.2
0.3
0.4
0.5
0.6
0.7
0.8
T est T op-1 Accuracy
IID
0 50B 100B 150B 200B
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Non-IID
Transmission Cost (Mbit)
FedAvg (MFL)
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(c) CIFAR-10 (CNN) - 10 Clients
0 50B 100B 150B 200B
0.2
0.3
0.4
0.5
0.6
0.7
0.8
T est T op-1 Accuracy
IID
0 50B 100B 150B 200B
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Non-IID
Transmission Cost (Mbit)
FedAvg (MFL)
FedProx
SNIP
GraSP
PruneFL
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(d) CIFAR-10 (CNN) - 100 Clients
0 200B 400B 600B 800B
0.0
0.1
0.2
0.3
0.4
0.5
0.6
T est T op-1 Accuracy
IID
0 200B 400B 600B 800B
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Non-IID
Transmission Cost (Mbit)
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(e) CIFAR-100 (VGG) - 10 Clients
0 200B 400B 600B 800B
0.0
0.1
0.2
0.3
0.4
T est T op-1 Accuracy
IID
0 200B 400B 600B 800B
0.0
0.1
0.2
0.3
0.4
Non-IID
Transmission Cost (Mbit)
FedAvg (MFL)
FedProx
SNIP
OneShot
OneShot w/ FineTuning
Random
FedSparsify - Local (ours)
FedSparsify - Global (ours)
(f) CIFAR-100 (VGG) - 100 Clients
Figure 4.24: Transmission Cost vs. Accuracy for FashionMNIST (top row), CIFAR-10 (middle row) over the
course of 200 federation rounds, and for CIFAR-100 (bottom row) over the course of 100 federation rounds.
Across all environments, SNIP, GraSP, Random, FedSparsify-Local, and FedSparsify-Global convergence
is shown at 0.9 model sparsity and PruneFL at 0.3.
94
Sparsity Accuracy Params Model Size (MBs) C.C. (MM) Inf.Latency Inf.Iterations Inf.Throughput
0.0 0.7489 118,282 0.434 473 0.607 755,817 403,096
0.8 0.74 23,657 0.109 (x3.97) 190 (x2.48) 0.601 (x1.01) 763,298 (x1.01) 407,085 (x1.01)
0.85 0.735 17,743 0.082 (x5.24) 173 (x2.73) 0.594 (x1.02) 772,976 (x1.02) 412,251 (x1.02)
0.90 0.749 11,829 0.056 (x7.75) 155 (x3.04) 0.588 (x1.03) 781,005 (x1.03) 416,532 (x1.03)
0.95 0.735 5,915 0.029 (x14.68) 137 (x3.43) 0.587 (x1.03) 783,000 (x1.03) 417,596 (x1.03)
0.99 0.687 1,183 0.008 (x53.95) 123 (x3.82) 0.58 (x1.04) 792,332 (x1.04) 422,569 (x1.04)
(a) FashionMNIST
Sparsity Accuracy Params Model Size (MBs) C.C. (MM) Inf.Latency Inf.Iterations Inf.Throughput
0.0 0.75 1,609,930 5.903 6,441 115 4,145 8,831
0.8 0.752 322,370 1.54 (x3.83) 2,596 (x2.48) 61 (x1.88) 7,812 (x1.88) 16,651 (x1.88)
0.85 0.755 241,874 1.178 (x5.01) 2,356 (x2.73) 51 (x2.25) 9,222 (x2.22) 19,660 (x2.22)
0.90 0.749 161,377 0.802 (x7.36) 2,116 (x2.97) 43 (x2.67) 10,975 (x2.64) 23,399 (x2.64)
0.95 0.751 80,881 0.415 (x14.22) 1,875 (x3.43) 32 (x3.59) 14,682 (x3.54) 31,306 (x3.54)
0.99 0.7 16,484 0.104 (x56.75) 1,683 (x3.82) 27 (x4.25) 17,707 (x4.27) 37,763 (x4.27)
(b) CIFAR-10
Sparsity Accuracy Params Model Size (MBs) C.C. (MM) Inf.Latency Inf.Iterations Inf.Throughput
0.0 0.6091 14,782,884 54.581 946,104 524 919 1945
0.90 0.5886 1,485,892 7.556 (x7.22) 314,253 (x3.01) 171 (x3.06) 2801 (3.04) 5962 (x3.06)
0.95 0.5817 747,170 3.978 (x13.72) 279,150 (x3.38) 134 (x3.91) 3575 (x3.89) 7613 (x3.91)
0.99 0.551 156,193 0.881 (x61.95) 251,067 (x3.76) 65 (x8.06) 7294 (x7.93) 15549 (x7.99)
(c) CIFAR-100
Table 4.5: Comparison of sparse (FedSparsify-Global), and non-sparse (FedAvg) federated models in the
FashionMNIST, CIFAR-10 and CIFAR-100 Non-IID environments with 10 clients. Inference evaluations are
done on models obtained at the end of training. Sparsity 0.0 represents FedAvg. C.C. is the communication
cost in millions (MM) of parameters exchanged. Inference efficiency is measured by the mean processing
time per batch (Inf.Latency - ms/batch), the number of iterations (Inf.Iterations), and processed examples
per second (Inf.Throughput - examples/sec). Values in parenthesis show the reduction factor (model size,
communication cost, and inference latency) and increase/speedup factor (inference iterations and through-
put) compared to non-pruning.
95
learn a model of reasonable performance at very extreme levels of sparsity, i.e., 99%. Even though fine-
tuning can help recover the lost performance at almost all sparsification degrees it fails to do so at such
extreme levels. However, the iterative pruning removes parameters gradually and hence can maintain
performance even at extreme sparsity levels.
The results on 100 clients are similar. FedSparsify outperforms alternative pruning methods with
performance comparable to or better than non-pruning. In the more challenging Non-IID environment,
FedSparsify -Global performs slightly better than -Local (see Figure 4.22b).
CIFAR-100Results. Figures 4.21a and 4.22a show that both FedSparsify schemes perform comparably
or better than the dense model baselines. SNIP, GraSP, and PruneFL are not able to learn a model of accept-
able performance (better than random performance) within the allocated training budget of 100 federation
rounds (we only plot SNIP since all three methods behave similarly). These methods may require a greater
number of training iterations or parameters. To validate this, we ran SNIP and GraSP in a centralized
setting and observed that as the sparsification degree increases, so does the amount of training iterations
needed to learn a good model. PruneFL required many more rounds and frequent communication (e.g.,
every 4 local update steps), as reported in the original work, whereas in our setting we synchronize models
every 4 epochs.
We observe a substantial drop in accuracy from 10 clients to 100 clients (∼ 60% in Figure 4.21a vs.∼ 40%
in Figure 4.22a). This degradation could be due to the reduced number of training examples the global
model is trained with at each round, 50000 samples for 10 clients, but only 5000 samples for 100 clients
(since only 10 of the 100 clients participate). This data scarcity per training round also affects the OneShot
w/ FineTuning method. Even though fine-tuning helps recover some of the lost model performance in the
case of 10 clients, its effect is minor on the 100 clients. However, FedSparsify outperforms even FedAvg and
FedProx in the 100 clients’ environments. We attribute this behavior to the regularization effect of gradual
pruning during training [103]. Finally, at high sparsity degrees, FedSparsify-Global performs better than
96
FedSparsify-Local. We posit FedSparsify-Global is better suited to efficiently learn extremely sparse models
from large networks.
SparsificationEfficiency. The main goal of our sparsification scheme is to improve federated models’
inference efficiency while being equally performant as non-pruning methods. Table 4.5 shows a quantita-
tive comparison of this claim. The Table reports the performance of non-pruning (FedAvg) and sparsified
models learned using our FedSparsify-Global approach in the non-IID environments with 10 clients for
FashionMNIST, CIFAR-10 and CIFAR-100. Following previous work on benchmarking inference efficiency
of sparsified models [108, 134], we record the total number of batches (iterations) completed by the spar-
sified model within an allocated execution time and compute the number of items processed per second
(throughput - items/sec) and the processing time per batch (ms/batch). Compared to dense CNN and VGG
models in the case of CIFAR-10 and CIFAR-100, respectively, the learned sparse models have significant
efficiency improvements. Sparse models at 0.99 sparsity provide a 4-fold (CNN) and 8-fold (VGG) improve-
ment w.r.t. the number of completed batches/iterations, latency, and throughput, with only a small penalty
(7% to 9%) in model accuracy, a striking 56-fold (CNN) and 61-fold (VGG) model size compression and
2- to 4-fold communication cost reduction. Similarly, for FashionMNIST we observe a 54-fold and 4-fold
reduction in terms of memory size and communication cost.
4.3.4 SparsifiedFederatedNeuroimaging
We train the 3D-CNN model for the brain age gap estimation (BrainAGE) prediction task in different
learning setups. In this setting, we explore the effectiveness of federated model sparsification through
our FedSparsify-Global scheme. We progressively prune updated (global model) weights before commu-
nicating to the learners and we explore different values for the final model sparsity level by varying S
T
in
Equation 4.5. Figure 4.25 demonstrates the model parameters reduction as federated training progresses
97
0 5 10 15 20 25 30 35 40
Federation Round
0.0M
1.0M
2.0M
3.0M
Model Parameters
FedAvg
FedSparsify (S
T
= 0.85)
FedSparsify (S
T
= 0.9)
FedSparsify (S
T
= 0.95)
FedSparsify (S
T
= 0.99)
Figure 4.25: Federated BrainAGE models parameters progression without (FedAvg) and with (FedSparsify)
sparsification for different sparsification degrees.
Sparsity Params Size(MBs) Comm.(MM) Test MAE Throughput
0.0 2,950,401 10.85 1888 2.879 64.31
0.85 442,561 2.09 714 2.881 69.06
0.9 295,041 1.43 645 2.859 71.28
0.95 147,521 0.73 576 2.861 78.27
0.99 29,505 0.16 521 3.024 128.55
Table 4.6: BrainAGE Federated Models Comparison in the Skewed-IID Environment.
for different sparsity degrees. As it is evident, federated model parameter pruning follows an exponen-
tially decreasing schedule. Across all investigating environments, the model is trained using Vanilla SGD
with a batch size of 1 and a learning rate of1e
− 5
. The investigated federated environments are identical
to the ones described in section 4.1.6. During federated training learners train the global model locally for
4 epochs in between federation rounds. All experiments were run on a dedicated GPU server equipped
with 4 Quadro RTX 6000/8000 graphics cards of 50 GB RAM each, 31 Intel(R) Xeon(R) Gold 5217 CPU @
3.00GHz, and 251GB DDR4 RAM.
Modelpruningdoesnothurtperformance. We study model performance at different sparsity levels
by evaluating the models on a held-out test set. The results of our federated progressive pruning are
summarized in Fig. 4.26. In all cases, model performance is not affected at 95% sparsity level and performs
the same as the FedAvg model, which is trained without pruning. Even when only 1% of the parameters
are preserved, i.e., 99% sparsity, the model performance degrades slightly. Table 4.6 provides a quantitative
98
0.85 0.90 0.95 1.00
2.5
3.0
3.5
4.0
Uniform-IID
0.85 0.90 0.95 1.00
2.5
3.0
3.5
4.0
Uniform-NonIID
0.85 0.90 0.95 1.00
2.5
3.0
3.5
4.0
Skewed-IID
0.85 0.90 0.95 1.00
2.5
3.0
3.5
4.0
Skewed-NonIID
Sparsity
T est MAE
FedAvg FedSparsify
Figure 4.26: Federated BrainAGE models learning performance at different degrees of sparsification across
all four federated learning environments. Dashed line represents performance of non-sparsified model.
comparison of the total number of parameters and memory/disk size of the final model, the cumulative
communication cost in terms of the total number of parameters exchanged during training, and the model’s
learning performance. Our pruning schedule can learn a highly sparsified federated learning model with
3 to 3.5 times lower communication cost than its unpruned counterpart (cf. 521 million to 1888 million
parameters). Moreover, the reduced number of the final model parameters also leads to reduced model
space/memory footprint, with the sparsified models at 95% and 99% sparsification being 67 times smaller
than the original model. Following previous work [135] on model efficiency evaluation, we benchmark
the inference time for sparse and non-sparse models by recording the total number of processing items
per second (i.e., Throughput - items/sec) that each model can perform. Specifically, we take the final
model learned with (FedSparsify) and without sparsification (FedAvg) and stress test its inference time
by allocating a total execution time of 60 seconds with a warmup period of 10 seconds. As we show
in Table 4.6, as sparsification increases model throughput increases too, leading to improved inference
efficiency, especially at 99% sparsity.
99
Chapter5
Secure,Private&RobustFederatedLearning
Several authors have discussed and tackled data privacy and security aspects of distributed computa-
tions [12, 92, 172, 202], including passive attacks that can extract information from gradients [299] and
active attacks that can poison the global model [227]. Andreux et al [12] noted that even in FL, shar-
ing aggregated statistics or model parameters can leak sensitive personal information. The federated
learning approaches we propose in this work operate over gradients aggregated over many batches and
epochs, therefore exact data recovery is not possible. However, sometimes a prototype example can be ob-
tained [100], which can be problematic in specific situations (e.g., if all pictures in a site are from the same
individual) [224]. In our membership inference attacks study [92], as we also discuss later in section 5.2,
we show that it is possible to infer if a person’s data was used to train a model given only access to the
model prediction (black-box) or access to the model itself (white-box), and some leaked samples from the
training data distribution. We correctly identified whether specific MRI scans were used in model training
with a 60% to over 80% success rate depending on model complexity and security assumptions. Differ-
ential privacy [66] constitutes another proposed privacy-preserving data mining solution for biomedical
settings[202]. However, as already shown in [92], even though it is an effective mechanism, it can lead to
significantly increased model training costs and large drops in model performance. In contrast to some
past works [174, 254], in our work we consider a security protocol, where the global model is not leaked
100
to the controller. Specifically, we consider an honest-but-curious threat model and assume that the partic-
ipating learners do not collude with each other. We provide a comprehensive evaluation to elucidate the
tradeoffs between learning performance, robustness, and privacy, including the effect of different model
aggregation policies. We primarily focus on the honest-but-curious threat model, but we also investigate
cases of corrupted, unreliable, or non-performing sites.
5.1 ProtectionthroughEncryption
Several past works have explored the use of HE in FL environments for secure aggregation. Truex et al.
[254] use a threshold variant of Paillier, an additive homomorphic scheme, for private model aggregation.
The threshold variant protects against colluding learners, however, the global model is leaked to the con-
troller during aggregation. BatchCrypt [291] explores quantization schemes for encoding weight updates
as a batch and processing them as single instruction multiple data (SIMD) methods using Paillier scheme.
HybridAlpha [283] uses functional encryption (FE) with a secure multi-party communication (SMC) proto-
col to prevent privacy leakage. FLASHE [115] proposes a masking-based protocol for secure aggregation
that is orders of magnitude more efficient than traditional HE-based protocols in terms of both compu-
tation and communication costs. However, their secure aggregation protocol is limited to only addition
operations and assumes the absence of collusion between a learner and the centralized controller. POSEI-
DON [219] presents a N-party protocol for FL where the training is performed on encrypted models using
a multi-key variant of the CKKS scheme [38]. They use polynomial approximations to compute activation
functions such as sigmoid, ReLU, and softmax in the encrypted space. The training requires hours to com-
plete and the approximations reduce the model’s accuracy. Finally, xMK-CKKS [174] proposes a multi-key
variant of CKKS scheme [38] to protect against collusion between the learners and the controller. However,
the scheme leaks the aggregated model to the controller.
101
In this work, we use CKKS, a fully homomorphic encryption (FHE) scheme [38] for secure aggregation.
As compared to the Paillier scheme, CKKS supports arithmetic operations over real and complex data.
Moreover, in contrast to the partial HE schemes such as Paillier as used in past works [254, 291], CKKS is
an order of magnitudes faster and supports an unbounded amount of additions and multiplications over
encrypted data. In contrast to some past works [174, 254], our protocol does not leak the global model to
the controller. In this work, we consider an honest-but-curious threat model and assume the participating
parties do not collude with each other.
5.1.1 HomomorphicEncryption
A homomorphic encryption (HE) scheme, unlike regular cryptographic schemes, allows for certain oper-
ations (e.g. addition, multiplication) to be performed directly over encrypted data without a need for de-
cryption. Formally, such a scheme ishomomorphic if it satisfies the following equation: E(m
1
)∗ E(m
2
)=
E(m
1
∗ m
2
)∀m
1
,m
2
∈ M, where∗ represents a homomorphic operation, andM represents the set of
all possible messages [3].
A HE scheme can be primarily described by 4 main algorithms: KeyGen,Enc,Dec, andEval; they
are individually discussed below:
• KeyGen(1
λ ) → (p
k
,s
k
): Takes as input the security parameter λ , and outputs a pair of keys: a
public keyp
k
and a private keys
k
.
• Enc(p
k
,m)→c: Takes as input the public keyp
k
, and messagem, and outputs the ciphertextc.
• Dec(s
k
,c)→m: Takes as input the private keys
k
, and ciphertextc and outputs the messagem.
102
• Eval(p
k
,F,c
1
,c
2
,..,c
n
) → c
∗ : Takes as input the public key p
k
, a permitted evaluation function
F , and the ciphertextsc1,...,c
n
and computesF(c
1
,..,c
n
). The evaluation is correct if the follow-
ing holds: Dec(s
k
,Eval(p
k
,F,c
1
,..,c
n
)) = F(m
1
,..,m
n
), where{c
1
,..,c
n
} is the encryption of
{m
1
,..,m
n
}
The security parameter λ , informally, refers to how hard (computationally) it is for an adversary to
successfully break the encryption scheme. In general, a messagem∈M can be a string, integer, or other
types of encoding, but for our purposes, we work with vectors of real (floating point) numbers. Addition-
ally, in many instances, a HE scheme can be classified as one of the following: partially-homomorphic
(PHE), somewhat-homomorphic (SHE), or fully-homomorphic (FHE). PHE schemes allow for either addi-
tive or multiplicative operations over ciphertexts, SHE allow for both but up to a pre-defined limit, and
FHE allows for an arbitrary number of additions and multiplications (un-bounded). In our study, we em-
ploy a fully-homomorphic encryption scheme due to its inherent support for an unbounded number of
additive and multiplicative operations, which are essential during the aggregation of encrypted models in
federated settings.
TheCKKSScheme. In this work, we apply the Cheon-Kim-Kim-Song (CKKS) [38] fully-homomorphic
construction, which is based on the hardness of the Learning-With-Error (LWE) [209], or its ring variant
(RLWE [173]) problem. Unlike other FHE schemes, such as the BGV [33] and BFV [71] (integer) construc-
tions, CKKS allows for approximate arithmetic over real and floating point numbers. It is an approximate
scheme in the sense that it provides limited (configurable) precision by treating the encryption noise as
natural error incurred through approximate computations, and through dropping the least significant bits
of computations via the rescaling of encrypted data.
103
Rescaling refers to the underlying process of limiting ciphertext noise, and keeping the scale (control-
ling the precision) constant throughout a pre-defined number of multiplications allowed in the compu-
tation. Due to this, compounded computations (specifically multiplications), scale much more efficiently
compared to the integer schemes as they no longer need to be exact. Furthermore, CKKS benefits from
packing, the process of slotting multiple data values into one ciphertext, which allows for encrypted com-
putations to be done in a Single Instruction Multiple Data (SIMD) fashion. These properties make CKKS
useful for our particular use case. We refer the reader to [38] for more specific details.
CKKSParametersConfiguration. For the implementation of the CKKS scheme, we utilize the lattice-
based cryptographic & HE library PALISADE [250]. In order to meet standards set by the Homomorphic
Encryption Standard [7], PALISADE allows the configurability of scheme parameters that achieve certain
precision, performance, and security goals that a user may have. For our work, we configured the follow-
ing parameters: multiplicative depth=2, scale factor bits=52, batch size=8192, and security level=128 bits.
Multiplicative depth refers to the maximum multiplicative path-length that may occur in a computation
(e.g. multiplying1· 2 and then multiplying that quantity with3 has a path-length of 2). Scale factor bits
represent the bit-length of the scaling factorD present in the CKKS scheme; choosing it controls the bit-
level accuracy of the desired computation. Batch size is precisely the number of plaintext slots used in a
single ciphertext. As discussed in Section 5.1.1, CKKS can pack multiple plaintext values in each ciphertext.
Lastly, the security level controls the bit-security that such an implementation would achieve according
to FHE standards.
5.1.2 FederatedLearningwithHomomorphicEncryption
We consider a federated learning environment where each learner trains on its local dataset for an as-
signed number of local epochs, and upon completion of its local training task, it sends the local model to
the federation controller to compute the new community model. In an encrypted federation environment,
104
the procedure is similar with the addition of three pivotal key steps: encryption,encrypted-aggregation,de-
cryption. During the encryption step, every learner encrypts its locally trained model with an HE scheme
using the public key and sends the encrypted model (ciphertext) to the controller. For each learner, its en-
crypted model is treated as a vector of ciphertext objects, each object corresponding to a model array. With
this approach, the encrypted data is represented as a (concatenated) collection of flattened data-vectors,
each of them representing the local data for a particular learner. The controller receives all the encrypted
local models and then performs the encrypted weighted aggregation to compute the new encrypted com-
munity model without ever decrypting any of the individual models. Subsequently, the controller sends
the new community model to all the learners, and the learners decrypt it using the private key. Once the
decryption is complete, the learners train the (new) decrypted model on their local data set and the entire
procedure repeats. This pipeline is represented schematically in Fig. 5.1, and algorithmically in Alg. 4. In
our setup, the (encrypted) weighted aggregation rule applied by the controller on learners’ local models is
based on the local training dataset size of each learner (i.e., FedAvg).
Training
Dataset
1
Ciphertext
Federation Controller
Encrypted Local Models
...
4
Encrypted Aggregation
...
Learner 1
3
Ciphertext
2
Training
Dataset
1
Ciphertext
Learner N
Ciphertext
2
1
Learners receive encrypted community
model and decrypt it using the private key.
2
Learners train the decrypted model on their
local training dataset and encrypt the new
local model using the public key.
3
Learners send their encrypted local
model to the controller.
4
Controller aggregates the encrypted local
models and computes the new community.
3
: learner local model
: learner local model encrypted
: community model
: community model encrypted
: learner contribution value
: normalization factor
Figure 5.1: Federated System Architecture with Encryption
105
Algorithm 4 Federated Learning with HE. Encrypted community model w
e
c
is computed with N
clients, each indexed byk;β is the batch size;η is the learning rate;E are the local epochs.
Initialize: w
c
,η Controller:
w
e
c
= encrypt initial community model
fort=0,...,T − 1do
for each learnerk∈N inparalleldo
send encrypted community modelw
e
c
w
k
= LearnerOpt(w
e
c
)
w
e
c
=encryptedaggregation of allw
k
LearnerOpt(w
e
t
):
w
t
=decrypt community modelw
e
t
B← Split training dataD
T
k
into batches of sizeβ fori∈E do
forb∈B do
w
t+1
=w
t
− η ∇F
k
(w
t
;b)
w
e
t+1
=encryptw
t+1
sendw
e
t+1
to controller
FHEEvaluationonCIFAR-10. We evaluate the performance of the security protocol presented in Fig-
ure 5.1 and Algorithm 4 using the CKKS (Fully-Homomoprhic Encryption) scheme in terms of final model
learning performance and execution latency. We conduct the evaluation on the CIFAR-10 standard com-
puter vision benchmark dataset across nine challenging federated environments (cf. section 4.1). Figure 5.2
shows the convergence of the federated model with respect to federation rounds. As it is evident, the fed-
erated model learned using FHE reaches the same level of performance as the non-secure training protocol
at the expense of an additional 5%-7% additional execution time latency, as shown in Figure 5.3.
MetisFL & FHE. In Figure 5.4 we show how the homomorphic encryption training pipeline is carried
out in MetisFL (cf. Section 4.2) when the synchronous communication protocol is employed. After the con-
figuration of the federation, the federation controller sends the original encrypted model to each learner,
learners decrypt and train the received global model, then encrypt and send the new local model to the
controller and the controller aggregates the encrypted models, and a new federation round begins.
106
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(3)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(3)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & IID
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(5)
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(3)
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
T est T op-1 Accuracy
Federation Round
FedAvg
FedAvg w/ CKKS
Figure 5.2: CIFAR10 Evaluation with and without FHE. Federation Rounds as x-axis.
107
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Uniform & IID
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(5)
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Uniform & Non-IID(3)
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Skewed & IID
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(5)
0 5000 10000
0.0
0.2
0.4
0.6
0.8
Skewed & Non-IID(3)
0 5000 10000 15000
0.0
0.2
0.4
0.6
0.8
PowerLaw & IID
0 5000 10000 15000
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(5)
0 5000 10000 15000
0.0
0.2
0.4
0.6
0.8
PowerLaw & Non-IID(3)
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
1000
2000
3000
4000
5000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
6000
8000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
T est T op-1 Accuracy
Wall-Clock Time(secs)
FedAvg
FedAvg w/ CKKS
Figure 5.3: CIFAR10 Evaluation with and without FHE. Wall-Clock time as x-axis.
108
init_service()
public & private key
init_service()
init_model()
public key
Configuration
Round i
A
B
C
A B D
run_task()
Decrypt Training Encrypt
Training Encrypt
Training Encrypt Decrypt
run_task()
Round i+1
4
Aggregation
3
Model Store
Federation Controller
Federation Driver
Learner Node
1
2
Decrypt
Controller aggregates encrypted local models and
computes new global model in encrypted space.
4
Controller sends the new encrypted global model
to every learner and a new federation round begins.
Learner encrypts local model using the private key and
sends it to the controller, controller stores the model.
Controller encrypts initial community model using the public
key and delegates training task to every learner.
B
C
Learners set up their dedicated gRPC server and
Controller a connecting stub to each learner.
Controller sets up its dedicated gRPC server
and Learners the respective connecting stub.
Learners load local training dataset.
Controller specifies synchronization protocol. D
A 1
2
3
Figure 5.4: MetisFL Training Pipeline with Encryption.
5.1.3 SecureFederatedNeuroimaging
We evaluate the learning performance of FedAvg with and without encryption on the BrainAGE prediction
task over the UK Biobank (UKBB) neuroimaging dataset[183] for several data distributions environments
using a 3D-CNN model (3 million parameters).
UKBBDataset. For our evaluation, we use a total of 10,446 MRI scans[137] from subjects with no psy-
chiatric diagnosis as defined by ICD-10 criteria out of the 16,356 available subjects in the UK Biobank
dataset [183]. The scans were processed using a standard preprocessing pipeline with non-parametric
intensity normalization for bias field correction1 and brain extraction using FreeSurfer and linear registra-
tion to a(2mm)
3
UKBB minimum deformation template using FSL FLIRT, with the final dimension of the
images being equal to 91×109×91. The dataset is split into training and test sets of size 8,356, and 2,090,
respectively.
Results. Figure 5.5 shows the execution (wall-clock) time of synchronous federated average (SyncFe-
dAvg) with and without encryption. Learning performance is almost identical, at a small (∼ 7%) additional
training time cost. As shown in Fig. 5.5, similar to the results in the CIFAR-10 dataset discussed in the
109
previous section, encryption does not penalize the learning performance of the federated model, and in
some learning scenarios (e.g., Uniform & Non-IID) the encrypted model can lead to faster convergence
compared to its non-encrypted counterpart. This can be attributed to the stochastic effect that the encryp-
tion scheme introduces to the weights of the federated learning model during encoding, decoding, and
private aggregation and thereby acting as a regularizer during federated training. Eventually, as training
progresses, both the encrypted and non-encrypted models reach the same final score.
0 5000 10000 15000 20000
3
4
5
6
Uniform & IID
0 5000 10000 15000 20000
3
4
5
6
Uniform & Non-IID
0 5000 10000 15000 20000
3
4
5
6
Skewed & IID
0 5000 10000 15000 20000
3
4
5
6
Skewed & Non-IID
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
500
1000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
500
1000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
1000
2000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
0
1000
2000
#Examples
Data Distribution
Age
(70,80]
(60,70]
(50,60]
(39,50]
MAE
Wall-Clock Time(secs)
SyncFedAvg SyncFedAvg (CKKS)
Figure 5.5: Federated Learning (SyncFedAvg) with and without CKKS homomorphic encryption on the
BrainAge 3D-CNN. The vertical marker represents the training time it takes for each approach to complete
20 federation rounds.
5.2 ProtectionthroughDataLeakagePrevention
Federated learning avoids sharing datasets. However, the exchanged parameters may reveal private in-
formation. Various works have highlighted this via practical privacy attacks such as membership infer-
ence [187, 232] and model inversion attacks [83, 299], in federated and centralized settings. Researchers
110
have focused on reducing overfitting [217, 256], information in activations and weights [113] or using
differential private mechanisms to alleviate such privacy concerns. Learning under differential privacy is
particularly attractive as it comes with theoretically solid worst-case guarantees. However, these works
assume different threat models, which may relax the problem. For example, [271] assumes that the server
can be trusted, whereas [191, 296] consider a stricter threat model, considering the server as honest-but-
curious similar to us. Rather than using a theoretical upper-bound measure of privacy, we focus on a more
practical measure, i.e., the success rate of membership inference attacks.
5.2.1 MembershipInferenceAttacks
Membership inference attacks are one of the most popular attacks to evaluate privacy leakage in prac-
tice [111]. The malicious use of trained models to infer which subjects participated in the training set
by having access to some or all attributes of the subject is termed as membership inference attack [187,
232]. These attacks aim to infer if a record (a person’s MRI scan in our case) was used to train the model,
revealing information about the subject’s participation in the study, which could have legal implications.
These attacks are often distinguished by the access to the information that the adversary has [187]. Most
successful membership inference attacks in the deep neural network literature require access to some parts
of the training data or at least some samples from the training data distribution [205, 217, 256]. White-box
attacks assume that the attacker is also aware of the training procedure and has access to the trained model
parameters, whereasBlack-boxattacks only assume unlimited access to an API that provides the output of
the model [144, 187].
Creating efficient membership inference attacks with minimal assumptions and information is an ac-
tive area of research [39, 112, 235]. However, our work is focused on demonstrating the vulnerability of
deep neural networks to membership inference attacks in the federated as well as non-federated setup.
Therefore, we make straightforward assumptions and assume somewhat lenient access to information.
111
Our attack models are inspired by [187, 232], and we use similar features such as gradients of parameters,
activations, predictions, and labels to simulate membership inference attacks. In particular, we learn deep
binary classifiers to distinguish training samples from unseen samples using these features.
In the case of federated learning, each learner receives model parameters and has some private training
data. Thus, any learner is capable of launching white-box attacks. Moreover, in this scenario, the learner
has access to the community models received at each federation round. When simulating membership
attacks on federated models, we simulate attacks from the learners’ perspective by training on learners’
private data and the task is to identify other learners’ subjects. In the case of models trained via centralized
training, we assume that the adversary can access some public training and test samples. We simulate both
white-box and black-box attacks in this case.
5.2.1.1 AttacksSetup
Trained Models for Predicting Brain Age. We use models trained to predict brain age from struc-
tural MRIs to demonstrate vulnerability to membership inference attacks. We show successful attacks on
3D-CNN and2D-slice-mean models. For centralized training, we use the same dataset and training setup
as [91] and for federated training, we use the same training setup and dataset as [243]. In the latter, the
authors simulate different federated training environments by considering diverse amounts of records (i.e.,
Uniform and Skewed) and varying subject age distribution across learners (i.e., IID and non-IID). All mod-
els are trained on T1 structural MRI scans of healthy subjects from the UK Biobank dataset [184] with the
same pre-processing as [137].
Configuration. As discussed in 5.2.1, attackers may have access to some part of the training set and
additional MRI samples that were not used for training, referred hereafter as the unseen set. We train a
binary classifier to distinguish if the sample was part of the training set. We study the effectiveness of
different features for the attacks in 5.2.1.2.
112
In the case of models trained via centralized training, the attack models are trained on a balanced
training set using 1500 samples from both training and unseen sample set
∗
. For testing, we create a
balanced set from the remaining train and unseen set — 694 samples each and report accuracy as the
vulnerability measure. To attack models trained via federated learning, we consider each learner as the
attacker. Thus, the attacker is trained on its private dataset and some samples from the unseen set that it
may have. This way, we created a balanced training set of up to 1000
†
samples from training and unseen
set each. Unlike the centralized setup, the distribution of the unseen set and training set that the attacker
model is trained on could be different, particularly in non-IID environments. In this scenario, the attacks
are made on the private data of other learners. Thus, we report the classifier’s accuracy on the test set
created from the training sample of the learner being attacked and new examples from the unseen set.
We simulate membership inference attacks on both centralized and federated trained models for the
BrainAGE problem. We report results on models trained centrally in 5.2.1.2 and distributively in 5.2.1.3.
Conventional deep learning models are trained using gradient descent. Thus, the gradient of parameters
w.r.t. loss computed from a trained model are likely to be lower for the training set than the unseen set.
We evaluate features derived from gradients, activation, errors, and predictions of the trained model to
train the binary classifier and study their effectiveness in 5.2.1.2. The main task is to identify if a sample
belonged to the training set. We report the accuracy of correct identification on a test set created from the
training and the unseen sample sets that were not used to train the attack model but used for training and
evaluating the brain age models.
∗
In the implementation, the unseen set is the same as the test dataset used to evaluate the brain age model. The unseen set
and the training set are IID samples from the same distribution.
†
In the case of Skewed & non-IID environment, some learners had less than 1000 training samples. As a result, the attack
model had to be trained with fewer samples.
113
5.2.1.2 AttacksinCentralizedSettings
Table 5.1 summarizes the results of simulating membership attacks with various features. As apparent
from Figure 5.6a, test and train samples have different error distributions due to the inherent tendency of
deep neural networks to overfit on the training set [293]. Consequently, the error is a useful feature for
membership inference attacks. Error is the difference between prediction and label, and using prediction
and label as two separate features produced even stronger attacks, as indicated by higher membership
attack accuracies. One of the reasons for this could be that the model overfits more for some age groups.
Using true age information (label) would enable the attack model to find these age groups, resulting in
higher attack accuracy.
Attacks made using error or prediction, and label are black-box attacks. A white-box attacker may also
utilize more information about the models’ internal workings like the gradients, knowledge about loss
function, training algorithm, etc. Deep learning models are commonly trained until convergence using
some variant of gradient descent. The convergence is achieved when the gradient of loss w.r.t parameters
on the training set is close to 0. As a result, gradient magnitudes are higher or similar for unseen samples
than training samples (see Figure 5.6b). Therefore, we used the gradient magnitude of each layer as a
feature, resulting in attack accuracy of 72.6 and 78.34 for3D-CNN and2D-slice-mean models, respectively.
Finally, we simulated attacks using gradients of parameters at different layers
‡
. We find that parameter-
gradients of layers closer to the output layer (i.e., conv 6, output layers) are more effective compared
to the gradients of layers closer to the input (conv 1). Preliminary results hinted that activations do not
provide much information to attack the models. So, we did not simulate attacks on the 2D-slice-mean
models with activations as features. The best attack accuracies 78.05 and 83.04 when attacking 3D-CNN
and 2D-slice-mean models, respectively, were achieved by using prediction, labels, and gradients of
‡
We consider layers close to the input or output layers as these have fewer parameters, and attack models are easily trained.
Intermediate layers had more parameters, making it hard to learn the attack model.
114
parameters close to the output layer. Successful membership inference attacks demonstrated in this section
accessed samples from the training set, which is limiting.
10 5 0 5 10 15 20
Output Error (year)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Density
2D-slice-mean
10 5 0 5 10 15 20
Output Error (year)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Density
3D-CNN
Train
Test
(a) Prediction errors
0 500 1000 1500 2000 2500 3000
Gradient Magnitude
0.000
0.001
0.002
0.003
0.004
Density
2D-slice-mean
0 250 500 750 1000 1250 1500
Gradient Magnitude
0.0000
0.0025
0.0050
0.0075
0.0100
0.0125
0.0150
0.0175
Density
3D-CNN
Train
Test
(b) Gradients of conv 1 layer
Figure 5.6: Distribution of prediction error and gradient magnitudes from the trained models in a central-
ized setting.
Features 3D-CNN 2D-slice-mean
activation 56.63 -
error 59.90± 0.01 74.06± 0.00
gradient magnitude 72.60± 0.45 78.34± 0.17
gradient (conv 1 layer) 71.01± 0.64 80.52± 0.40
gradient (output layer) 76.65± 0.44 82.16± 0.29
gradient (conv 6 layer) 76.96± 0.57 82.89± 0.83
prediction + label 76.45± 0.20 81.70± 0.29
prediction + label + gradient (conv 6 + output) 78.05± 0.47 83.04± 0.50
Table 5.1: Membership inference attack accuracies on centrally trained models (averaged over 5 attacks).
5.2.1.3 AttacksinFederatedSettings
We consider three different federated learning environments consisting of 8 learners and investigate cases
where malicious learners attack the community model. The community model is the aggregated result
of learners’ local models and a malicious learner may use it to extract information about other learners’
training samples. In this scenario, a malicious learner can learn an attack model by leveraging its access to
the community models of all federation rounds and its local training dataset; we simulate attacks using this
information (see also 5.2.1.1). The model vulnerability is likely to increase with more training iterations and
hence we used features derived from the community models received during the last five federation rounds,
115
and each learner uses its private samples to learn the attack model. Each learner may try to do membership
inference attacks on any of the other seven learners, resulting in 56 possible attack combinations. An attack
is considered successful if accuracy is more than 50%, which is the random prediction baseline.
Table 5.2 shows the average accuracy of successful attacks and the total number of successful attack
instances of learner-attacker pairs (in parentheses) across all possible learner-attacker pairs (56 in total).
We empirically observed that the success rate of the attacks is susceptible to data distribution shifts. In
particular, distribution shift agnostic features like gradient magnitudes can lead to more successful attacks
(count-wise) when data distribution across learners differs. For the results shown in Table 5.2 and Fig-
ure 5.7, we used all available features (i.e., gradient magnitudes, predictions, labels, and gradients of last
layers).
We also observe that the overall attack accuracies are lower than the centralized counterpart discussed
in 5.2.1.2. This drop can be attributed to the following: a) As we show in 5.2.2, attack accuracies are highly
correlated with overfitting. Federated learning provides more regularization than centralized training and
reduces overfitting but does not eliminate the possibility of an attack. b) Federated models are slow to
train, but as the model is trained for more federation rounds, the vulnerability increases (see Figure 5.7).
Moreover, Table 5.2 only presents an average case view of the attacks and we observe that the attack
performance depends on the data distribution of the learner-attacker pair. When the local data distribution
across learners is highly diverse, i.e.,Skewed&non-IID attack accuracies can be as high as 80% for specific
learner-attacker pairs.
An interesting outcome from the experiments in Table 5.2 is that an attack carried out in the harder
federated learning environments (non-IID) can be more successful (with higher accuracy) than in the IID
settings. However, the overall number of successful attacks in the non-IID environments is significantly
lower than in the IID environment (25 vs 56).
116
0 10 20 30 40
Federation Rounds
0.45
0.50
0.55
0.60
0.65
0.70
0.75
Vulnerability
Privacy Vulnerability vs Federation Rounds
Uniform & IID
Uniform & Non-IID
Skewed & IID
Skewed & Non-IID
Figure 5.7: Privacy vulnerability increases with federation rounds. Vulnerability is measured as the average
accuracy of distinguishing train samples vs unseen samples across learners. The model architecture used
is 3D-CNN.
Data distribution 3D-CNN 2D-slice-mean
Uniform & IID 60.06 (56) 58.11 (56)
Uniform & non-IID 61.00 (28) 60.28 (29)
Skewed & non-IID 64.12 (25) 63.81 (24)
Table 5.2: Average attack accuracies on 3D and 2D-CNN federated models (across all successful attacks).
Numbers in parentheses indicate the median number of successful attacks over 5 multiple runs.
117
5.2.2 DefensiveMechanisms
The success of privacy attacks is often attributed to the ability of learning algorithms to memorize infor-
mation about a single sample [113]. Therefore, defending against privacy leakages involves limiting the
information that the learning algorithm may extract about each sample or simply limiting information in
the neural network’s weights. We explore approaches to limit the vulnerability to membership inference
attacks: differential private training via DP-SGD [2] and SGD with non-unique gradients. Models with
fewer parameters are less prone to overfitting and hence should be less vulnerable to attacks. To this end,
we also evaluate if learning extremely sparse models can limit the ability of an attacker to attack the final
global models.
5.2.2.1 GaussianNoise
Differential privacy is a formal framework to reason about privacy. A differential private training mech-
anism ensures that the outcomes (i.e., final model weights) do not change much between two training
sets that differ by one example [67]. For training brain age models with differential privacy guarantees,
we use the DP-SGD algorithm [2]. Briefly, the principal modifications to SGD to limit the influence of
a single sample are to clip the gradients from each sample not to exceed a maximum norm and to add
spherical Gaussian noise. We update each learner during federated training with these private gradients.
During initial experiments, we found that achieving non-vacuous differential privacy guarantees requires
adding significant Gaussian noise to the gradients, which annihilates learning performance. However, we
observed that practical privacy attacks, such as membership inference attacks, can be thwarted by adding
Gaussian noise of much smaller magnitudes [111]. Therefore, we evaluate training with gradients with a
small additive Gaussian noise.
118
5.2.2.2 LearningwithNon-uniqueGradients
To learn good machine learning models, we would like to extract patterns while ignoring information
about specific samples. Training models using gradient descent can leak an individual’s information dur-
ing training because there is no restriction on what information a sample may contribute. Thus, the
model may preserve information unique to each individual, leaking privacy. Differential privacy adds
the same noise to all gradients to limit the information or influence of a single sample on the neural net-
work, but that may also destroy useful information in the attempt to reduce memorization. We investigate
removing unique information from each sample’s gradient and training with only non-unique parts. We
compute the gradient of the loss (L) w.r.t. parameters (θ ) for each sample (x
i
,y
i
) in a batch (B), i.e.,
g
i
=∇
θ L(f(x
i
;θ ),y
i
)∀i∈{1...B}. To compute the non-unique part, we project each gradient vector
on the subspace spanned by the rest of the gradient vectors (g
span
i
). We consider the residual part as the
unique information about each sample (i.e.,g
unique
i
=g
i
− g
span
i
. Ideally, we would like to train with only
the non-unique part. However, we observe that it may harm the performance too much, and therefore we
downweigh the effect of the unique part and use ˆ g
i
= g
span
i
+αg
unique
i
,α < 1 to update the model at
local learners. α is a hyperparameter that we tune to trade off privacy and performance.
Figure 5.8 shows the privacy and learning performance trade-off when learners are trained with small-
magnitude Gaussian noise and our non-unique gradient approach. Both approaches can reduce the vul-
nerability of the model to privacy attacks. Although the magnitude of Gaussian noise is much smaller
than the theoretically-required differential privacy guarantees, it effectively reduces membership infer-
ence attacks. We hypothesize that the small additive noise is enough to reduce the mutual information
between data and neural network outputs/activations, which limits the success of membership inference
attacks [113]. We also show the vulnerability reduction due to model sparsification. In IID environments,
non-unique gradients perform similarly to adding Gaussian noise. However, they are significantly faster
119
2.8 3.0 3.2 3.4 3.6
MAE (Performance)
0.50
0.55
0.60
0.65
0.70
0.75
Vulnerability
Uniform & IID
Gaussian Noise
Non-unique gradients
No Defense
Sparsification
(a) Uniform & IID
3.0 3.2 3.4 3.6 3.8 4.0
MAE (Performance)
0.50
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
Vulnerability
Uniform & Non-IID
Gaussian Noise
Non-unique gradients
No Defense
Sparsification
(b) Uniform & Non-IID
Figure 5.8: Vulnerability vs Performance trade-off when training learners with Differential Privacy (Gaus-
sian Noise), Non-unique gradients approach and a model trained to achieve 99% sparsity in the final global
model. Lower vulnerability and lower MAE are desired, i.e., points towards the bottom left are better. The
model architecture used is 3D-CNN.
to train. The Gaussian noise models required training for 40 rounds, whereas the non-unique gradients re-
quired only 25 rounds. In Non-IID environments, training with Gaussian noise provides a better trade-off
than non-unique gradients. This may be due to learners overfitting to private datasets earlier in training,
thus deviating from the community model. In summary, both small-magnitude Gaussian noise added to
Gradients, and non-unique gradients is effective at preventing membership attacks.
5.3 RobustTrainingunderDataCorruption.
Although federated learning provides some data protection since it operates on aggregated parameters,
it is susceptible to data corruption/poisoning attacks that can originate at the sources [121, 172]. Inad-
vertent alterations on the integrity of the local training data (e.g., human annotation errors, systematic
mislabeling) [196] or targeted adversarial data corruptions [37] can severely damage the performance of
the federated learning task with dire consequences on the final outcome. To circumvent these challenges,
new defensive mechanisms need to be developed that are capable to improve the resilience of the federated
learning systems against such attacks.
120
In this section, we analyze the learning convergence of models under different rates of data corruption
in centralized and federated settings to quantify when a data corruption attack can be successful. We
present a new Performance Weighting scheme for robust federated learning in the presence of corrupted
data sources that can be used both as a defensive as well as a detection mechanism. We use the local
validation dataset of each learner as a testbed to evaluate the performance of the local models and measure
their weight in the community model. We empirically show that our weighting scheme is robust to data
corruption even if the majority of training and validation data are corrupted.
5.3.1 DataCorruptionBackground
The widespread adoption of federated learning has spurred research on adversarial attacks that aim to tam-
per with the learning process of the federation, such as backdoor attacks [20, 247, 264, 279], membership
inference attacks [92, 188, 255] and poisoning attacks. Poisoning attacks can be classified into two sub-
categories, model poisoning, and data poisoning. In model poisoning, the attacker modifies the learning
process by introducing adversarial gradients and parameter updates [27, 72, 200, 270], while in data poi-
soning the attacker aims to manipulate the training data by directly modifying the training examples [79,
229, 252].
Our focus is on the data poisoning (corruption) attacks and specifically label flipping [229, 238, 278,
277, 30] and label shuffling [150]. In our learning environments, we consider static adaptability [121, 200],
where the data poisoning attack is fixed throughout the federated execution, as well as continuous partic-
ipation rate, where the attacker continuously participates in the learning process; a situation frequently
observed in cross-silo federated settings.
It has been shown that in centralized environments, deep learning can be robust even to extreme de-
grees of corruption [212, 234], such as label shuffling exceeding 90%. However, as we show in this work
when considering a federated learning environment, the federation can be strongly affected by corruption
121
Figure 5.9: A robustness analysis of a perfect learner under different levels of data corruption (label flip-
ping). A chance that a perfect learner can learn the correct association between a specific class of objects
(e.g., pictures of dogs) and a corresponding label (e.g., a description “dog”), under different levels of data
corruptionc and with a different number of available training examples per class n.
when using standard aggregation schemes such as federated average (FedAvg). Namely, there is a substan-
tial degradation of the community model’s performance even in the presence of very low corruption levels
(e.g., 30% label shuffling). To address this, we propose a scoring-based aggregation scheme with improved
robustness to various data corruption attacks.
5.3.2 TheoreticalLimitsofDataCorruption
To analyze the robustness of models to label shuffling and targeted label flipping, we consider the following
thought experiment. Assume a perfect learner that can learn the correct association as long as the correct
labels for a specific class of objects are more popular than any other class of incorrect labels (e.g., if 10
pictures of dogs are labeled asdog,9 asduck and7 ascat, we say that the correct label is the most popular
and we assume that a perfect learner would associate pictures of dogs with the correct label dog). Now,
consider a federation with many learners, with a proportion of them being corrupted. For simplicity, we
assume that the data is evenly distributed among learners.
If a proportion ofµ ∈[0,1] learners is corrupted, and the type of corruption is a targeted label flipping
(e.g.,µ portion of learners have pictures of dogs labeled ascat), the theoretical upper limit for the corruption
that the federation can sustain is50%− ε. The reason is that the machine learning model does not have
122
Figure 5.10: A comparison between a perfect and a realistic learner. The upper plot shows a theoretical
upper bound for a perfect learner to learn the correct association (a 10-classes case). The lower plot shows
experimental results on Fashion-MNIST.
any prior understanding of what the true association should be. A situation where our objective is to learn
that objectsx∈O from a certain class of objectsO are associated with labelL
1
under a corruption where
a proportion ofµ objects are mislabeled asL
2
, is equivalent to an alternative situation where our objective
is to learn the association betweenx∈O andL
2
under a corruption level of1− µ . Due to the symmetry
of the problem, the theoretical upper limit for sustaining the corruption must beµ =1− µ =1/2.
When considering the random label-shuffling type of data corruption, the analysis is more complex.
Even if the data is distributed uniformly between the learners, due to random chances some learners will
have more examples of a specific class than others. If we corrupt a µ portion of learners, the chances are
that from the perspective of the federation, sometimes the correct label will be more popular than any
other class of incorrect labels, while in other cases not. In general, that probabilityp
C
of the correct label
to be the most popular will be a function of both µ (a portion of learners being corrupted) and n (the
number of examples of that specific class in the federation). By performing a Monte Carlo simulation, we
can estimate the value of that probability (cf. Figure. 5.9, where we took the number of classesC =10). As
we see, with the increasing value ofn we can sustain increasingly stronger corruption levels with the limit
123
ofµ →1 whenn→+∞. To illustrate how those theoretical limits correspond to some realistic cases, we
trained a centralized model on the standard Fashion-MNIST classification task, taking a different number
of training examples per class (n = 100 orn = 870) and measuring the best accuracy on the test set as a
function of the corruption level (cf. Figure 5.10). We see that the qualitative comparison with the theoretical
bound is good, with the transition between a high and a low accuracy being somewhat smoother for the
realistic learner, likely due to the limitations of the model or the inefficiency of the learning procedure.
These results can be easily generalized to other domains and are not specific to the particular classification
task investigated here [212, 234].
We use this analysis as a motivation for the extensive analysis of those cases in the federated learning
environment. We show that while we can still expect the federation to be resistant to strong data corruption
attacks, the standard global aggregation rules of local weights exhibits strong inefficiency (the federation
becomes much more sensitive to label corruption than we would expect from the analysis of a central-
ized case). Consequently, we propose an alternative model weights aggregation scheme that significantly
improves over the standard method both in terms of speed of convergence and final model accuracy.
5.3.3 ThePerformanceWeightingScheme
Our proposed Performance Weighting scheme builds upon the work of [241]. Similar approaches were also
investigated by [269, 295]. Specifically, in [295] the authors aggregate the local models into sub-models
and delegate their evaluation to learners that have a similar data distribution with the aggregated model,
while in [269] the authors evaluate the local models against a validation dataset that is hosted at the central
server. However, both proposed approaches used the validation-based accuracy as a detection mechanism
to discard corrupted learners from the federation, whereas in our work we keep the corrupted models in the
federation with a downgraded contribution value. We show empirically that by not discarding corrupted
124
models during federated training, our method leads to similar or even better performance than in the case
of total exclusion of the corrupted models [252].
In a federated learning environment withN participating learners, the goal is to find a set of optimal
model parameters by jointly optimizing the global objective function f(w) =
P
N
k=1
p
k
P
F
k
(w), with F
k
denoting the local objective function of learnerk weighted by a factorp
k
and normalized to[0,1] through
the sum of all weighting factors,P =
P
N
k
p
k
. Every learner optimizes its local functionF
k
by minimizing
the empirical risk over its local training dataset,D
T
k
.
In the case of Federated Average (FedAvg), the contribution value p
k
is equal to the size of the local
training dataset p
k
= |D
T
k
|. However, in the case of Performance Weighting the contribution value is
estimated by evaluating the performance of a learner’s local model against a validation dataset. Similar
to the distributed validation weighting scheme (DVW) proposed in [241], we assign a performance score
to the local model of each learner by evaluating its local model against the validation dataset of every
other learner. In particular, every learner splits its local dataset into two disjoint datasets, a training and
a validation, and reserves the validation dataset, D
V
k
, throughout the federated execution for evaluating
other learners’ models. Note, that in FedAvg the learners train with all available examples, meaningD
V
k
=
∅. The size of the available training dataset is smaller for Performance Weighting compared to FedAvg,
since we need to reserve a validation dataset,D
V
k
̸=∅, for model evaluation, which never becomes part of
training.
On the grounds that a federation may consist of learners with diverse amounts of data and the number
of training examples per class may not be balanced, the validation dataset is assembled through stratified
sampling. The validation dataset needs to accurately represent the training data distribution of all learners,
and therefore if the dataset is generated by randomly sampling examples, then the prediction tasks (e.g.,
classes in classification, score ranges in regression) could be over- or under-represented and would not
125
reflect the underlying training distribution of the learner. To circumvent this, every learner constructs its
local validation dataset by sampling 5% out of the training examples from each class.
Algorithm5Performance Weighting.
Controller():
fort=0,...,T − 1do
for each learnerk∈N do
w
k
= ClientOpt(w
c
,ϵ )
p
k
= Eval(w
k
)
w
c
=
P
N
k=1
p
k
P
w
k
withP =
P
N
k
p
k
ClientOpt(w
t
,ϵ ):
B← Splitϵ ∗ D
T
k
in batches of sizeβ forb∈B do
w
t+1
=w
t
− η ∇F
k
(w
t
;b)
Returnw
t+1
Eval(w):
CM=0
C,C
{Confusion matrix CxC}
for each learnerk∈N inparalleldo
CM= CM+Evaluator
k
(w)
Score= fn(CM)
ReturnScore
Execution Pipeline. At the start of the federation, all learners receive from the controller the origi-
nal model state and train on their local dataset for an assigned number of iterations (see Controller
andClientOpt procedures in Algorithm 5; parameterϵ denotes local epochs). Upon completing their lo-
cal training, learners send their model back to the controller and request a community update. When the
controller receives the local model (w
k
) from all learners (i.e., synchronous execution), it aggregates all lo-
cal models to compute the new global (community) model (w
c
) and sends it back to the learners to continue
training. When Performance Weighting is applied, prior to aggregating the local models, the controller
evaluates every local model against the validation dataset of every participating learner to determine the
performance weighting value of each individual model (see Eval procedure in Algorithm 5). To achieve
this, the controller sends the local model of each learner to the evaluator service of every other learner
and accumulates the respective evaluation metrics from all services. We do not evaluate the performance
126
1 Learners send local model to the controller.
2
Controller sends local models for evaluation
across the distributed validation dataset.
3
Controller computes performance
score for every local model.
4
Controller aggregates local models
and computes community model.
Learner 1 Learner 2 Learner N
...
Validation
Dataset
Training
Dataset
Validation
Dataset
Training
Dataset
Validation
Dataset
Training
Dataset
Federation Controller
Local Models
...
Performance Score
...
1
2
3 4
Community Model
Figure 5.11: Execution pipeline of the performance weighting aggregation scheme. Initially, learners train
locally on their local dataset, and the controller receives the trained models and sends them for evaluation
to the evaluation service of every participating learner. The controller aggregates all models based on
their performance scores and computes the new community model. With the computation of the new
community model, the new federation round begins.
of the global model but rather each local model individually, since our goal is to measure the degree of the
corruption (if any) and assign a performance score that can be used during global aggregation. Figure 5.11
illustrates the evaluation pipeline. Currently, we focus only on classification tasks (not regression) and
the controller combines the individual confusion matrices into a cumulative confusion matrix. We also
assume that every learner truthfully evaluates the received models. Thus, by analogy to the “curious but
honest” class of users [61], we can say we operate in the regime of “forgetful but honest” participants.
Performance Scores. Once the final matrix is accumulated, the controller computes a range of classi-
fication metrics (see fn(CM) in Algorithm 5) [74]. We are currently evaluating Micro- and Macro-Average
Accuracy, as well as Geometric Mean. Specifically, Micro-Accuracy is defined as DVW
µ acc
=
TP
C
#Examples
,
withTP
C
denoting the total number of true positives over the validation examples of all classesC. Macro-
Accuracy is defined as DVW
M
acc
=
P
C
i
a
i
C
, with a
i
denoting the accuracy of class i, which is equal to
a
i
=
TP
i
m
i
with TP
i
and m
i
being the total number of true positives and validation examples of class i,
respectively. Similarly, for multi-class evaluation, the Geometric Mean is computed as the geometric av-
erage of the partial accuracy of each class, DVW
GMean
=
C
q
Π C
i
a
i
. In cases where learners might not
have training examples for a particular class then the accuracy value for this class is going to be 0 and as
127
a result, the Geometric Mean value will also be 0. To account for these learning scenarios, we add a small
arbitrary correction value,ε = 0.001. For instance, if for three classes{0,1,2} a classifier has respective
accuracies{0.76,0.84,0}, then its Geometric Mean value will be equal to
3
√
0.76∗ 0.84∗ 0.001=0.086.
5.3.4 PerformanceScoringEvaluation
We explore two different modes of data corruption, uniform label shuffling, and targeted label flipping,
similar to the works of [150, 252]. In the case of label shuffling, the labels of the local dataset of a learner
are randomly scrambled, while in targeted label flipping a learner flips the label of examples that corre-
spond to a specific source class to a target class. The goal of both corruptions (attacks) is to tamper with the
training process of the federation. In our learning environments, we poison both training and validation
examples and we evaluate the effectiveness of the attacks on IID data distributions over both homoge-
neous (equally assigned, Uniform) and heterogeneous (rightly skewed assigned, PowerLaw) amounts of
data across learners. The success of each attack is computed on the top-1 accuracy of the community
model on the test set. We also analyze the weighting value of each learner in the federation when the
controller aggregates the local models using our performance weighting scheme. In our work, we explore
the robustness of the community model against the two attack modes on the CIFAR-10 domain over an
increasing number of corrupted learners. In our experiments, we consider 10 learners in total.
§
UniformLabelShuffling. For every corrupted learner, we shuffle the labels of its local training dataset
based on a discrete uniform distribution,U{0,9}. For the Uniform & IID learning environment, we inves-
tigate five different degrees of corruption, i.e. {1,3,5,6,8} corrupted learners or{10,30,50,60,80}% of
globaldatacorruption. In the case of thePowerLaw&IID environment due to the data amount heterogene-
ity, we corrupt learners starting from the head of the distribution and going rightwards; the head owns
§
In the Uniform case every learner holds 500 examples from each class, out of the 10 classes. In the PowerLaw case, the
number of samples is distributed as: {G1: 16964, G2: 11314, G3: 7537, G4: 5023, G5: 3348, G6: 2232, G7: 1488, G8: 992, G9: 661,
G10: 441}. Gi is an abbreviation for GPU and i its index.
128
0 25 50 75 100 125 150 175 200
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test Top-1 Accuracy
Uniform
0 25 50 75 100 125 150 175 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PowerLaw
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
Federation Round
(a) Label Shuffling (k=5 for Uniform, k=3 for PowerLaw)
0 25 50 75 100 125 150 175 200
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test Top-1 Accuracy
Uniform
0 25 50 75 100 125 150 175 200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PowerLaw
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
2000
4000
#Examples
Data Distribution
G:1
G:2
G:3
G:4
G:5
G:6
G:7
G:8
G:9
G:10
0
5000
10000
15000
#Examples
Data Distribution
Federation Round
FedAvg (x10 learners - no corruption)
k-Corrupted Learners Removed
FedAvg
DVW-MicroAccuracy
DVW-MacroAccuracy
DVW-GMean
(b) Targeted Label Flipping: airplanes→ birds,
(k=6 for Uniform, k=3 for PowerLaw)
Figure 5.12: Federated training policies convergence for different data corruption environments. The
stacked bar chart inset shows each environment’s data distribution; hatch bar represents corruption.
129
the majority of the training examples. We consider only three corruption levels, i.e. {1,3,5} corrupted
learners or{33%,71%,88%} of global data corruption.
Figure 5.12a shows the convergence of each federated training policy when 5 and 3 learners are cor-
rupted in the Uniform and PowerLaw environments, respectively. Figure 5.13a presents the best accuracy
achieved by each policy over an increasing number of corrupted learners, i.e., increasing degrees of cor-
ruption, and Figure 5.14a shows the performance score of each learner during global aggregation.
In the Uniform learning environments, our proposed Geometric Mean scheme (DVW-GMean) out-
performs the standard Federated Average (FedAvg) by a large margin. FedAvg fails to learn robustly as
levels of corruption increase. DVW-GMean also outperforms alternative performance weighting schemes,
DVM-MicroAccuracy and DVW-MacroAccuracy. Note, that “FedAvg (x10 learners – co corruption)” in-
dicates the case without any corruption, and is shown to illustrate the overall impact of data corruption.
The policy named as “k-Corrupted Learners Removed” represents the case where all corrupted learners are
removed at the beginning of the federation and training occurs only over those learners whose datasets
are unaffected [252, 269, 295]. Geometric Mean approaches the standard “FedAvg (x5 learners – no corrup-
tion)” quickly and converges after 100 federation rounds, which further demonstrates the resilience of our
aggregation scheme to label shuffling attacks. We attribute the similar performance between GMean and
“k-Corrupted Learners Removed” in the efficiency of GMean to progressively downgrade the contribution
value of the corrupted learners in the federation as training progresses (see Figure 5.14a).
In the case of thePowerLaw&IID environment, the difference between the proposed DVW-GMean and
FedAvg is even greater. Using standard FedAvg, even though it is initially performing, it quickly degrades
as training progresses to a test accuracy close to 0.25, which is only marginally better than choosing classes
at random. In contrast, all DVW schemes quickly converge to the level of “FedAvg (x5 learners – no
corruption)”, with GMean exhibiting a better performance compared to its alternatives.
130
0 2 4 6 8
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test Top-1 Accuracy
Uniform
0 1 2 3 4 5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PowerLaw
Number of Corrupted Learners
(a) Label Shuffling.
1 2 3 4 5 6 7 8
0.70
0.72
0.74
0.76
0.78
0.80
0.82
0.84
Test Top-1 Accuracy
Uniform
1 2 3 4 5
0.650
0.675
0.700
0.725
0.750
0.775
0.800
0.825
0.850
PowerLaw
Number of Corrupted Learners
FedAvg
k-Corrupted Learners Removed
DVWMicroAccuracy
DVWMacroAccuracy
GMean
(b) Targeted Label Flipping (airplanes→ birds).
Figure 5.13: Federated training policies performance over incremental degrees of corruption (increasing
number of corrupted learners).
131
TargetedLabelFlipping. In our learning environments, we inspect the performance of the federation
when flipping airplane examples to birds ( 0 → 2) over an increasing number of corrupted learners. We
selected this corruption since airplanes and birds are relatively close in the feature space and hence are
harder to detect. For the Uniform & IID environment, we investigate five different degrees of corruption,
i.e. {1,3,5,6,8} corrupted learners or{10%,30%,50%,60%,80%} class-degree corruption. In the case
of PowerLaw & IID we corrupt learners starting from the head of the data distribution and consider three
corruption levels, i.e.{1,3,5} corrupted learners or{34%,72%,88%} class-degree corruption.
Figure 5.12b shows the convergence of each federated training policy when all the examples of the
airplane class are corrupted for 6 learners in the Uniform and 3 learners in the PowerLaw environments.
Figure 5.13b presents the best accuracy achieved by each policy over an increasing number of corrupted
learners and Figure 5.14b shows the performance score of each learner during global aggregation.
In the Uniform environment, Figure 5.12b, our proposed Geometric Mean scheme (DVW-GMean) out-
performs all alternative schemes, including the standard Federated Average (FedAvg). Similar to the cases
discussed earlier, DVW-GMean converges to the level represented by “FedAvg (x5 learners – no corrup-
tion)”, indicating a very good resilience to the corruption attack. The only visible effect of corruption is
a slightly slower convergence. While “FedAvg (x5 learners – no corruption)” levels off after about 100
federation rounds, DVW-GMean needs about 200 rounds to achieve the same accuracy level. At moderate
levels of corruption, e.g., between 30% to 60% class-degree corruption, (Figure 5.13b) DVW-GMean has
better or comparable performance as “k-Corrupted Learners Removed”. However, as corruption becomes
more extreme, e.g., 80% class-degree corruption, almost all training policies perform similarly.
In the more challenging PowerLaw environment, both FedAvg and “k-Corrupted Learners Removed”
baselines have suboptimal performance compared to all tested DVW schemes. Both baselines reach a sat-
uration learning point after roughly 100 rounds of training while all DVW schemes are able to continue
learning with the DVW-GMean reaching a learning performance that is close to the (clean) non-corrupted
132
environment. We posit the degraded performance of the “k-Corrupted Learners Excluded” scheme on the
grounds that the corrupted learners being removed from the federation are the ones with the majority of
training samples and therefore their removal penalizes the performance of the federation due to the lim-
ited number of remaining training samples. In contrast, none of the DVW schemes discards the updates
of the corrupted learners but rather downgrade their overall contribution value in the federation. This
phenomenon demonstrates that our scheme is more beneficial compared to detecting and excluding cor-
rupted sources [252, 269, 295] since it can leverage the information stored in the corrupted sources to train
a model of increased robustness and performance. Compared to DVW-Micro and DVW-Macro accuracy,
our DVW-GMean scheme performs the best, even at extreme levels of corruption. As also shown in Fig-
ure 5.14b, throughout training, DVW-GMean is able to accurately detect corrupted learners during training
and assign them a scoring value that is half of the value assigned to their non-corrupted counterparts.
133
0 25 50 75 100 125 150 175 200
0.0
0.1
0.2
0.3
0.4
0.5
Performance Score
Uniform
0 25 50 75 100 125 150 175 200
0.00
0.05
0.10
0.15
0.20
0.25
0.30
PowerLaw
Federation Round
(a) Label Shuffling.
0 25 50 75 100 125 150 175 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Performance Score
Uniform
0 25 50 75 100 125 150 175 200
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PowerLaw
Federation Round
G: 1
G: 2
G: 3
G: 4
G: 5
G: 6
G: 7
G: 8
G: 9
G: 10
(b) Targeted Label Flipping.
Figure 5.14: Learners performance score (contribution value) based on DVW-GMean over different data
corruption environments.
134
Chapter6
DataHarmonizationforFederatedLearning
Data silos participating in federated learning environments often have different schemata, data formats,
data values, and access patterns. However, most of the existing federated learning systems [149] assume
that the data semantics across all participating silos are already harmonized, a belief that is not always
true in real-world settings. This kind of data harmonization challenge is usually observed in Federated
Database Management Systems [98, 231], Data Integration [60], and Data Exchange [70].
Even though many different systems have recently emerged that can potentially provide a solution to
this problem they are not readily applicable to federated learning due to the unique data ownership, data
security, and data privacy challenges arising in federated learning environments. In particular, large-scale
general-purpose data analytics systems (e.g., Apache Hadoop, Spark, NoSQL engines) often offer little sup-
port for efficiently integrating multiple external sources and therefore cannot be used out-of-the-box for
federated learning tasks. Moreover, traditional virtual data integration approaches (e.g., Mediation Sys-
tems [18]) assume that all the required data and their schema is accessible by the integration engine, an
assumption that violates the data security and privacy requirement imposed by federated learning. Ad-
ditionally, Federated Database Management Systems [98] assume that the participating data sources/silos
can cooperate during the federated query execution and have global data access, which contradicts the
data silo independence required by federated learning environments. Similarly, a more recently refined
135
version of FDBMS, named Polystores[65], assumes that the federated system query engine has ownership
over the data sources and therefore can perform data migration across different execution engines to op-
timize query workloads, an assumption which again does not adhere to the data privacy requirements of
federated learning.
When taking all these system limitations into account we see that there is a clear need for new Fed-
erated Data Learning and Management Systems that can provide seamless support for secure and private
end-to-end integrated federated training pipelines. In this chapter, we start by discussing the data inte-
gration problem observed in non-federated learning settings and conclude with our design for a Feder-
ated Learning INTegration (FLINT) architecture that combines data integration, imputation, and federated
learning as core components.
6.1 PrincipledDataIntegration
Current machine learning applications typically require a preliminary ad-hoc data preparation process in
which the training data is cleaned, reconciled, and transformed in order to be fed to the machine learn-
ing model for training and inference. However, when the required model input data is generated from
a variety of data sources, e.g., multisensor, environmental, and clinical, meaningfully integrating the ex-
pected variety of data types, formats, and schemata is a significant challenge. To harmonize the vari-
ous data discrepancies, we first need to reach a clear understanding of the data, how they are organized,
where they are located across diverse sources, and the meaning of entities, relationships, and data values.
This common understanding is achieved by mapping values to standard vocabularies/taxonomies (e.g.,
RxNORM for drugs, or the Gene Ontology [17] for genomic annotations) and by mapping the schemas
of each of the sources/datasets into a common target schema (e.g., NIH’s Commons Data Element efforts,
www.nlm.nih.gov/cde/). This common/domain/mediated schema represents an agreed-upon view of the
application domain for the purposes of the intended analyses. To address this variety of data formats in
136
a scalable and principled way, we need to use state-of-the-art schema mappings in data integration and
exchange [60, 15, 93, 145, 70].
Here, our focus is on the virtual data integration, mediation, approach. The salient features of virtual
data integration are: (1) data resides at and is maintained at the original data sources in their original
schemas and formats; no changes are required to any of the data sources, (2) all that is required for inte-
gration is knowledge of the information content and access to the data source via an API, for example, a
JDBC connection or a web service, (3) the mediator defines a common schema/ontology and declarative
formal mappings between the source and common schema, and (4) integrated data access is provided via
the mediator software that can reside on any server and contacts the various data sources at query time
to obtain the information requested.
When comparing virtual data integration to the data exchange approach, in a data exchange setting we
need to materialize a finite target instance that best reflects the given source instance [70, 69]. In virtual
data integration, no data exchange is required. Moreover, as opposed to virtual data integration settings,
in a data exchange setting, it may not be feasible to couple applications together in a manner that data
may be retrieved and shared on-demand at query time. As a result, queries over the domain schema may
have to be answered using the materialized target instance alone, without reference to the original sources.
This may lead to inconsistent or outdated (stale) results since source updates are not reflected in the query
results.
6.1.1 LogicalIntegrationthroughSchemaMappings
Integrating data from multiple diverse sources requires the construction of a homogeneous schema, do-
main schema, that can encapsulate the data discrepancies across heterogeneous data sources and enable
reasoning for meaningful data analysis. Through the use of formal schema mappings and query rewriting
techniques, we can automatically generate and optimize the mapping and normalization workflow that
137
transforms the incoming data into the schema of the homogeneous data model. This approach facilitates
maintenance because only a few mapping rules need to be specified, as opposed to maintaining a complex
code base.
6.1.1.1 DomainSchema
Each data source follows different semantics for describing its’ data values and types. Bridging these se-
mantic heterogeneities requires a solid understanding of how schema elements from one source are related
to schema elements of the other sources. To manage this complexity, a unified domain schema (also called
the target, mediated, or global schema) is defined, and the schemata of the sources are mapped to the
domain schema using logical rules, known as schema mappings [59, 93, 145]. The domain schema does
not need to include every element from each source but rather those elements that are necessary to an-
swer specific types of queries. To enable this, a translation scheme is required whereby data elements from
different source schemata are translated/related to data elements from the domain schema. A simple trans-
lation would be unit mappings, such as weight in pounds vs. kilograms, or smoking intensity in cigarettes
per day vs. packs-year. This process is most straightforward when source data have been collected us-
ing common instruments. Overall, the domain schema provides a degree of freedom in the design of the
system and strikes a balance between simplicity (just enough to describe the contents of the sources) and
generality (the ability to accommodate new sources). As additional sources arrive, the domain schema can
be easily extended to include their semantic descriptions and support new query types based on the new
source definitions.
138
6.1.1.2 SchemaMappings
The schema mappings describe the relationships between the sources schemata and the domain schema.
These mappings are described through a set of declarative logical rules. As opposed to the manually-
defined, ad-hoc, programmatic workflows that are often used to integrate datasets, declarative mappings
are easier to generate, maintain, and provide opportunities for automatic learning and optimization. This
logical framework enables the definition of precise, deep annotations and characterization of the hetero-
geneous data semantics, and supports the enhancement of the data through reasoning over associated
ontologies and knowledge resources developed by the scientific community. The data integration system,
i.e., themediator system, uses a set of declarative rules in order to define how the predicates of the domain
schema relate to predicates of the sources schema.
There are two main ways of specifying the logical rules between the domain and the source schemata:
Global-as-View (GAV) [82, 75, 260, 5] and Local-as-View (LAV) [145, 93]. In the GAV approach, each
domain term is defined by a logical formula (a view in database parlance) involving source terms. The GAV
approach takes the perspective that the domain schema is defined by compositions of source predicates,
formally:
∀
⃗
X,
⃗
Y,Φ s
(
⃗
X,
⃗
Y)→∃
⃗
Z.g(
⃗
X,
⃗
Z)
whereΦ s
is a conjunctive formula over source predicates, g is a single domain predicate and X, Y, and Z
are vectors of variables. Note that not all the attributes from the sources need to be modeled (the variables
Y are projected out), and that some of the attributes in the global predicate (the variables Z in g) may not
be produced by the sources in a given rule (the mediator may be able to infer them from other GAV rules
and domain constraints).
139
Conversely, in the LAV approach, each source term is defined by a logical formula (view) over the
mediated predicates, formally:
∀
⃗
X,
⃗
Y,s(
⃗
X,
⃗
Y)→∃
⃗
Z,Ψ G
(
⃗
X,
⃗
Z)
where s is a single source predicate andΨ G
is a conjunctive formula over domain (global) predicates and
constraints. The perspective in LAV is that each source predicate is given meaning by defining it using
well-understood terms from the domain vocabulary.
The GAV and LAV approaches have complementary properties (see [5, 93] for detailed comparison).
The query rewriting problem is simpler in GAV since it amounts to the unfolding of the source descriptions.
The critical advantage of LAV is its scalability in terms of the addition of new sources – since each source is
described independently, new sources can be incorporated rapidly without having to modify pre-existing
definitions. These approaches can be generalized to GLAV mappings also known as source-to-targettuple-
generatingdependencies (st-tgds) [69], where both the antecedent and the consequent of the mapping rules
are (conjunctive) formulas, formally:
∀
⃗
X,
⃗
Y,Φ s
(
⃗
X,
⃗
Y)→∃
⃗
Z,Ψ G
(
⃗
X,
⃗
Z)
where the antecedent (Φ
s
) is a conjunctive formula over the sources schemata (S) predicates, and the
consequent(Ψ
G
) is a conjunctive formula over the domain schema (G) predicates.
6.1.1.3 QueryRewriting
The core technique to answer queries over schema mappings is Query Rewriting [93, 145], which trans-
lates a query expressed over the domain schema into a query expressed over the sources’ schemata. Given
a SQL query, the distributed query evaluation engine constructs a distributed query evaluation plan. For
example, the SchizConnect mediator [11] built as a distributed execution workflow with relational algebra
140
operations building upon the Open Grid Services Architecture (OGSA) Distributed Access and Integration
(DAI), and Distributed Query Processing (DQP) projects [13, 89]. The query optimizer partitions the work-
flow across multiple sources pushing as much of the evaluation of subqueries to remote sources, since this
often results in more efficient evaluation plans. In general, the query optimizer of the mediator performs
a variety of query optimization techniques [88, 110, 248] on the logical and physical query execution plan,
such as grouping common subexpressions, pushing selections early into the evaluation tree, and using
data statistics to select optimal plans.
6.1.2 DataImputation
A typical step while preparing the machine learning model input data pipeline isdataimputation, i.e. filling
missing values with likely values generated by statistical processes. The goal of imputation is to preserve
the structure of the data, not correctly guess the true value of the missing points. With real-world data, we
often do not have and never will have "true" values. With imputation, we want to preserve the covariance
(and thus correlation) between features. Imputation is not prediction [261]. However, as we will see in the
following sections, imputation improves the performance of the machine learning methods.
6.1.2.1 GeneratingMissingValues
Intentionally missing data are planned by the data collector (e.g., excluding samples on purpose), while
unintentionally missing data are not under the control of the data collector (e.g., error in data transmission,
subjects dropout) [261]. To artificially generate missing values from the complete data records, three
prominent mechanisms exist: MCAR, MAR, MNAR. When the probability of being missing is independent
of the observed and unobserved data (i.e., missingness is unrelated to the data), the data are said to be
missing completely at random (MCAR). The missing at random (MAR) mechanism is a more general and
realistic missingness case than MCAR, since data missingness considers the complete data observations
141
and allows for missingness patterns to be dependent (conditioned) on known variables [28]. However, if
neither MCAR nor MAR holds, we have missing not at random (MNAR), which means that the probability
of missing varies for unknown reasons, i.e., it depends on variables that are not measured.
6.1.2.2 HandlingMissingValues
Many different approaches have been proposed to handle missing values, ranging from deleting records
with missing values to replacing missing values with imputed values by applying simple [62], multi-
ple [261] imputation, or machine learning (e.g., kNNimpute [253], MissForest [239]) and deep learning-
based imputation methods(e.g., Generative Adversarial Imputation Nets [289]). Below we discuss the in-
ternal mechanics of some typical classes of imputation approaches and specific methods. Please see survey
paper [158], for a thorough discussion of different methods in the data imputation literature.
Deletion. The deletion of records with incomplete cases are known as listwise deletion or complete-case
analysis. Even though the inclination to delete the missing data is understandable [261] and has been
acknowledged by Orchard and Woodbury [193] as "Obviously the best way to treat missing data is not
to have them.", it may introduce statistical bias [53, 159] and even result in losing important statistical
information.
SimpleImputation. Simple imputation approaches replace missing values using a quantitative or qual-
itative attribute using information from the complete data records [294]. Imputation methods such as
mode, mean, median, or random (select a random value from complete data values) can be used as an easy
reference technique to provide fast and simple imputed values, but they underestimate variance and can
compromise the relationship between variables and biases summary statistics.
142
RegressionImputation. Regression imputation incorporates knowledge of other variables to produce
imputations. The first step involves building a model from the observed data. Predictions for the incom-
plete cases are then calculated under the fitted model and serve as replacements for the missing data [261].
An expectation-maximization algorithm can be used to find the estimates of the model parameters. How-
ever, regression imputation artificially strengthens the relations in the data, with the correlations being
biased upwards and variability being underestimated. A refinement of regression imputation, which at-
tempts to address correlation bias by adding noise to the predictions is stochastic regression imputation.
Multiple Imputation. Multiple imputation differs from single imputation methods because missing
data are filled in many times, with many different plausible values estimated for each missing value [148].
Multiple imputation creates a set of M imputed datasets out of a dataset with missing values. The M
datasets are then pooled into a final point estimate plus a standard error by pooling rules (“Rubin’s rules”).
The number of imputed datasets (M) can be relatively high somewhere in the range of 10-20, but even with
a small number of imputed datasets (e.g., 2-10), multiple imputation can create unbiased estimates with
correct confidence intervals. Multiple imputation is performed in three phases: (1) generate M imputed
datasets (M > 1), (2) analysis of the M generated datasets to estimate the parameters of interest from each
imputed dataset, (3) pool the M parameter estimates into one estimate.
In phase (1), multiple imputation imputes the missing entries based on statistical characteristics of the
data, e.g., a set of regression models each predicting a variable with missingness using all other variables
and obtaining a distribution for the expected value for each missing value to sample from it. Once the
imputed data sets are created, in phase 2 we proceed by analyzing every imputed dataset with methods
that we would otherwise apply on datasets with no missing values to answer specific questions. In phase 3,
we compute the pooled estimated statistic by combining the estimates of interest (e.g., the mean difference
in outcome between treatment and control groups) from all imputed datasets using standard rules [214];
143
for instance, a simple approach to compute the pooled estimated statistic is by taking the average of the
statistic from each of the M imputed datasets.
KNNimpute. The KNNimpute algorithm [253], imputes the missing value of a given variable by finding
its k nearest observed variables and taking a weighted mean of these k variables for imputation. The
weights for computing the weighted mean depend on the distance of the imputing variable from the rest
of the variables; usually, the Euclidean distance is used as the distance metric. When using KNNimpute
the choice of the tuning parameter k can have a large effect on the performance of the imputation and the
associated computational cost; a reasonable k value can be estimated through cross-validation.
MissForest. Compared to imputation methods that are restricted to one type of variable only: continu-
ous or categorical, the MissForest algorithm [239] can cope with multivariate data consisting of continuous
and categorical variables simultaneously, not separately. Through this mixed-type imputation approach,
MissForest can better capture the relations between different variable types. At its core, MissForest is
a non-parametric iterative imputation method based on a random forest that makes as few as possible
assumptions about the structural aspects of the data.
GAIN. In the case of imputing missing data using deep learning models, recently, a generative model was
proposed [289], termed Generative Adversarial Imputation Nets (GAIN). The GAIN model is a generalized
version of the GAN architecture adapted to deal with the unique characteristics of the data imputation
problem. As described in the original work, the training pipeline of the GAIN model starts with the gen-
erator (G), which observes components of a real data vector, imputes the missing components conditioned
on what is actually observed, and outputs a completed vector. The discriminator (D) takes the completed
vector and attempts to determine which components were actually observed and which were imputed.
To ensure that D forces G to learn the desired distribution, D is provided with additional information in
144
the form of a hint vector. The hint reveals to D partial information about the missingness of the original
sample, which is used by D to focus its attention on the imputation quality of particular components. This
hint ensures that G does in fact learn to generate according to the true data distribution.
6.2 LogicalDataIntegration
To illustrate the methodology of logical data integration, we will describe our work on the SparkMedia-
tor [242], which builds upon the SchizConnect Mediator [10].
6.2.1 DataMediationforSchizophreniaNeuroimaging
The study of complex diseases, such as Schizophrenia, requires integrating data from multiple heteroge-
neous data sources [258] to obtain sufficient subjects to reach statistically valid conclusions. Each source
features a diverse combination of data models, data types, and values, rendering their harmonization chal-
lenging. To handle this multi-source heterogeneity and variety in querying languages and data schemata,
we must introduce a single query access point that will operate over a unified data model and distribute
the query execution workload to the respective data sources. Over the last few years, many multi-site
consortia have been developed to address this issue by introducing domain-specific standardization tech-
niques. For example, the Human Imaging Database (HID) [125] of the Functional Biomedical Informatics
Research Network (FBIRN) [86] consortium is a federated database where each site follows the same stan-
dard schema. Other prominent consortia within the neuroscience domain that follow this practice are the
Mind Clinical Imaging Consortium [126] and the ENIGMA Network [251]. We next describe SchizCon-
nect [10], a mediator system for the schizophrenia neuroimaging domain.
SchizConnect Participating Sources. The SchizConnect data are distributed across several publicly
available data sources. The FBIRN Phase II @ UCI [86] provides cross-sectional multisite data from 251
145
Project(source:STRING, study:STRING, projectid:NUMBER, description:STRING)
Subject(source:STRING, study:STRING, subjectid:STRING, age:NUMBER, sex:STRING, dx:STRING)
In_Project(source:STRING, study:STRING, subjectid:STRING, projectid:NUMBER)
Imaging_Protocol(source:STRING, study:STRING, subjectid:STRING, szc_protocol_hier:STRING,
img_date:DATE, notes:STRING, datauri:STRING, maker:STRING,
model:STRING, field_strength:NUMBER, site:STRING)
Cognitive_Assessment(source:STRING, study:STRING, site:STRING, subjectid:STRING,
assessment:STRING, assessment_description:STRING)
Clinical_Assessment(source:STRING, study: STRING, site:STRING, subjectid: STRING,
assessment: STRING, assessment_description:STRING)
Figure 6.1: SchizConnect Domain Model (selected predicates).
subjects including structural and functional magnetic resonance imaging scans. The FBIRN repository is
hosted on the PostgreSQL DBMS of the HID system [125]. The NUSDAST @ XNAT Central [267] source
contains data from 368 subjects together with collected sMRI scans. These data are stored inside the eX-
tensible Neuroimaging Archiving Toolkit (XNAT) [175] public repository which provides a REST API that
accepts queries only in the XNAT-specific XML format, and returns XML data. The COBRE & MCICShare
@ COINS Data Exchange [226] contains data related to 198 subjects from the COBRE project and 212 sub-
jects from the MCICShare project. Due to the dynamic data packaging architecture of the COINS system,
which restricts the immediate return of any data to a submitted query, the COINS data are replicated in a
dedicated MySQL DBMS at USC/ISI.
SchizConnectDomainSchema. The SchizConnect domain schema relies on this minimalistic design
to build an incremental domain schema. As new sources with neuroimaging data are found, the Schiz-
Connect domain schema can also be extended to incorporate these sources. The current SchizConnect
domain schema follows the relational model shown in Figure 6.1. The first attribute in all the predicates is
source, which indicates the data provider. Currently, the system has four sources: HID, XNAT, COINS, or
Mappings. Attributes are followed by the data types currently supported by the Mediator: string, number,
and date.
146
The Project predicate contains the name and the description of the studies in the data sources. The
Subject predicate contains the demographic and diagnostic information for individual participants. The
In_project predicate indicates which studies and projects each subject has participated in. TheImaging_-
Protocol predicate refers to MRI records for a particular subject along with various scanner metadata. The
Cognitive_Assessment retains information for the neuropsychological assessments of the subjects while
theClinical_Assessment holds the assessments for different symptoms in the subjects.
SchizConnect Schema Mappings. In SchizConnect the schema mappings are defined as Global-As-
View (GAV) rules, which are st-tgds with a single predicate in the consequent. Figure 6.2 illustrates some
schema mappings currently defined in the domain schema of SchizConnect. In the figure, the domain
predicates are typed in bold (e.g.,subject_age) while the source predicates are in italics (e.g., HIDPSQLRe-
source_nc_subjexperiment). The first rule indicates that the subject domain predicate provides data from
the UCI_HID data source and is relevant to the participating subjects of the current study. This rule also
shows that a domain predicate can be constructed through a join operation with other domain (auxiliary)
predicates, in this case fromsubject_age andsubject_sex.
Rules defined with the same consequent denote a union operation. For instance, the subject domain
predicate in the presented schema is obtained through a union operation of the first two rules, each denot-
ing a different data source (UCI_HID, and XNAT_CENTRAL respectively).
∗
Constraints can also appear
in the body (antecedent) of the rules (e.g., ’nc_experiment_unique_id = 9610’). The query engine pushes
these selections to the sources. An important feature of the presented Mediation language is the decla-
ration of conditional expressions to a rule’s antecedents. The fourth rule shows that conditional equality
is applied to a relational predicate (e.g., ’nc_experiment_unique_id = 9610’). The virtual data integration
system should be able to handle many such complex conditional expressions which in turn pass down to
the designated data sources as selections.
∗
We removed a similar rule for COINS for brevity.
147
subject ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, AGE, SEX, DX) <-
subject_age ("UCI_HID", STUDY, SUBJECTID, AGE) ^ (STUDY = "fBIRNPhaseII__0010") ^
subject_sex ("UCI_HID", STUDY, SUBJECTID, SEX) ^
HIDPSQLResource_nc_fbirn_phaseii_dx_mview (SUBJECTID, DX)
subject_age ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, AGE) <-
HIDPSQLResource_nc_subjexperiment (uniqueid1, tableid, owner, modtime, moduser,
nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^
(nc_experiment_uniqueid = 9610) ^
HIDPSQLResource_nc_assessmentinteger_mview (nc_assessmentdata_uniqueid, scoreorder,
textvalue, textnormvalue, comments, AGE, datanormvalue, storedassessmentid,
ASSESSMENTID, assessmentname, SCORENAME, scoretype, ISVALIDATED, isranked,
SUBJECTID, site, uniqueid) ^
(SCORENAME = "Age") ^ (ISVALIDATED = "TRUE")
subject_sex ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, SEX) <-
HIDPSQLResource_nc_subjexperiment (uniqueid1, tableid, owner, modtime, moduser,
nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^
(nc_experiment_uniqueid = 9610) ^
HIDPSQLResource_nc_assessmentvarchar (tableid2, nc_assessmentdata_uniqueid2,
scoreorder2, owner2, modtime2, moduser2, textvalue2, textnormvalue2,
comments2, SEX_SRC, datanormvalue2, storedassessmentid2, ASSESSMENTID2,
SCORENAME2, scoretype2, ISVALIDATED2, isranked2, SUBJECTID, entryid2,
keyerid2, raterid2, classification2, uniqueid2) ^
(SCORENAME2 = "Gender") ^ (ISVALIDATED2 = "TRUE") ^
MappingsMySQLResource_value_mappings (SEX, "UCI_HID", SEX_SRC, id)
imaging_protocol ("UCI_HID", "fBIRNPhaseII__0010", SUBJECTID, SZC_PROTOCOL_HIER,
DATE,NOTES, DATAURI, MAKER, MODEL, FIELD_STRENGTH, SITE, "") <-
HIDPSQLResource_nc_subjexperiment (uniqueid, tableid, owner, modtime, moduser,
nc_experiment_uniqueid, SUBJECTID, nc_researchgroup_uniqueid) ^
(nc_experiment_uniqueid = 9610) ^
HIDPSQLResource_nc_scannersbyscan_mview_phase3 (SUBJECTID, SITE, componentid,
segmentid, SOURCE_PROTOCOL, DATE, nc_colequipment_uniqueid,
SOURCE_MAKE, SOURCE_MODEL, DATAURI, NOTES) ^
MappingsMySQLResource_protocol_mappings (SZC_PROTOCOL_HIER, "UCI_HID",
SOURCE_PROTOCOL, ID1) ^
MappingsMySQLResource_scanner_mappings (MAKER, MODEL, FIELD_STRENGTH,
"UCI_HID", SOURCE_MAKE, SOURCE_MODEL, ID2)
subject ("XNAT_CENTRAL", "NUSDAST", SUBJECT_LABEL, AGE, SEX, DX) <-
XnatSubjectResource_xnat__subjectData (project, SUBJECT_LABEL, AGE, SEX, SRC_DX, QS) ^
MappingsMySQLResource_dx_mappings (DX, "XNAT_CENTRAL", 777, SRC_DX, id, "NUSDAST")
Figure 6.2: SchizConnect Sample Schema Mappings.
148
Finally, the last rule shows the definition of the imaging_protocol domain predicate. This rule employs
theMappingsMySQLResource data source which contains mapping tables for normalizing attributes values
retrieved from the corresponding sources. Within the SchizConnect schema mappings collection, many
rules perform this normalization but are omitted from 6.2 due to space constraints. The main reason for
this normalization step is that even attributes originating from the same source expose different values
and codes for the same concept. To resolve any attribute discrepancies a more sophisticated hierarchical
ontology structure is being used [259].
SchizConnectQueryRewriting. At runtime the supporting mediation layer uses the schema mappings
to reformulate a query expressed in terms of the domain schema into a set of equivalent executable source-
level queries. The result of this reformulation is a logical query plan [59]. Figure 6.3 presents a GAV
rewriting example. If for instance, we would like to select all the participants involved in the current studies
accessible by SchizConnect, along with all their associated demographic and diagnostic information the
Mediator will unfold the domain query (top-level query in Fig. 4) into a union of conjunctive source-level
queries. The final rewritten query (bottom level query in Figure 6.3) is a union of conjunctive queries
over the data sources (XNAT_CENTRAL, UCI_HID, COINS). In this specific example, COINS data source
is included twice since it contributes to two different studies.
6.2.2 SparkMediator: Large-ScaleMediatedDataAnalysis
Even though many different federated data platforms have been proposed to address the various data
integration challenges arising in neuroimaging [86, 125, 126, 251] and other research studies, e.g., ge-
nomics [168], these platforms are limited by the extent to which they can support large-scale federated
data analytics. To bridge this gap, and enable large-scale analytics across heterogeneous data sources, we
leverage the data analytics stack of the Apache Spark framework [290] and introduce a Mediation layer
149
SELECT * FROM subject
( SELECT 'UCI_HID' as source, 'fBIRNPhaseII__0010' as study, T8.subjectid as subjectid,
T4.szc_protocol_hier as szc_protocol_hier, T6.date as img_date, T6.description as notes,
T6.datauri as datauri, T2.maker as maker, T2.model as model, T2.field_strength as field_strength,
T6.site as site
FROM MappingsMySQLResource_scanner_mappingsT2, MappingsMySQLResource_protocol_mappingsT4,
HIDPSQLResource_nc_scannersbyscan_mview_phase3 T6, HIDPSQLResource_nc_subjexperimentT8
WHERE T2.source_make=T6.source_make AND T2.source_model=T6.source_model ANDT2.source = 'UCI_HID’
ANDT4.source_protocol=T6.source_protocol ANDT4.source = 'UCI_HID' ANDT6.subjectid=T8.subjectid
ANDT8.nc_experiment_uniqueid = 9610 )
UNION
( SELECT 'XNAT_CENTRAL' as source, 'NUSDAST' as study, T12.SUBJECT_LABEL as subjectid,
T10.szc_protocol_hier as szc_protocol_hier, T12.SCAN_DATE as img_date, T12.SCAN_ID as notes,
Concat(T12.IMAGE_ID,'/scans/',T12.SCAN_ID) as datauri, 'SIEMENS' as maker, 'VISION 1.5T' as model,
1.5 as field_strength, 'WUSTL' as site
FROM MappingsMySQLResource_protocol_mappingsT10, XnatMRSessionResource_xnat__mrSessionData T12
WHERE T10.source_protocol=T12.SCAN_TYPE ANDT10.source = 'NUSDAST' )
UNION
( SELECT 'COINS' as source, 'COBRE' as study, T14.ANONYMIZATION_ID as subjectid,
T14.szc_protocol_hier as szc_protocol_hier, T14.SCAN_DATE as img_date, 'notes' as notes,
T14.SERIES_ID as datauri, T14.SCANNER_MANUFACTURER as maker, T14.SCANNER_LABEL as model,
T14.FIELD_STRENGTH as field_strength, T14.SCANNER_SITE as site
FROM COINSMySQLResource_series_v T14
WHERE T14.STUDY_ID = 1139 )
UNION
( SELECT 'COINS' assource, 'MCICShare' as study, T16.ANONYMIZATION_ID assubjectid,
T16.szc_protocol_hier as szc_protocol_hier, T16.SCAN_DATE asimg_date, 'notes' asnotes,
T16.SERIES_ID asdatauri, T16.SCANNER_MANUFACTURER asmaker,
T16.SCANNER_LABEL asmodel, T16.FIELD_STRENGTH asfield_strength, T16.SCANNER_SITE assite
FROM COINSMySQLResource_series_v T16
WHERE T16.STUDY_ID = 5720 )
Figure 6.3: SchizConnect Query Rewriting.
that allows seamless integration and querying of disparate data sources through a harmonized data model
and a query rewriting component.
Spark Mediator’s importance is twofold: it adds a new virtual integration layer to Spark’s highly-
performant analytics stack, and it extends Spark’s efficiency as a distributed query federation engine. In
our previous experience [242], Spark Mediator is a core component of our data analytics tool stack for
large-scale infrastructures that require executing expensive data analysis operations over a group of het-
erogeneous sources.
Our Spark Mediator middleware (see Figure 6.4) exposes a harmonized virtual schema (i.e., domain
schema) over the data of the remote sources. This unified view allows the data to reside at the remote
sources and be structured under their original schemata. During query execution, the Mediator receives
the submitted query over the domain schema and rewrites it to source-level queries through a set of schema
150
Spark SQL
Execution Engine
Planning &
Optimization
Query Rewrite
Domain
Query
Source-Level Query
Schema
Mappings
....
Data
Source 1
Data
Source 2
Data
Source n
Optimized Logical Plan
Result Spark SQL
Dataframe
Mediator
Figure 6.4: Apache Spark Mediator Diagram.
mappings, and declarative logical rules that relate the sources’ and target/domain schemata. Based on
these schema mappings the Mediator is able to reconcile any semantic discrepancies among the remote
data sources. Upon the completion of query rewriting, the Mediator parses the source-level queries and
generates a logically optimized distributed query evaluation plan which forwards to the Spark SQL engine
for execution. Once the source-level queries are executed the Mediator returns the query result set as a
Spark SQL dataframe. Apart from GAV schema mappings, Spark Mediator also supports query rewriting
for both LAV (aka Local-as-View) [128] and GLAV (aka Global-and-Local-as-View) rules.
6.2.2.1 OptimizingSparkSQLFederationEngine.
Spark SQL provides a convenient API for defining new data sources and performing various source-specific
query optimizations. These optimizations include projections and selections pushdown to the external
querying source. However, the available optimizations are still limited in terms of more expensive opera-
tions such as joining tables that belong to the same remote database. For instance, assume that multiple
151
Spark SQL tables are defined with the same external database origin. When a join operation is applied
over them, Spark will naively bring the source data to the Spark cluster and execute the joins locally.
In some cases, this may be efficient, but in most cases, it is preferable that a database takes advantage of
its indexes and statistics to evaluate and process a set of join operations. Another important limitation of
Spark SQL is when more complex operations are applied to an external database, Spark SQL does not collect
information from the database’s computed indexes or usage statistics in order to optimize the query’s
execution plan. As of Spark 2.2 a new cost-based optimization (CBO) framework was introduced which
computes various table and column-based statistics (e.g., cardinality, max, min etc.) that the Catalyst
optimizer [16] leverages to evaluate which query plan to execute. Even though this CBO framework can
speed up Spark workloads, it does not take into account operations that could be executed directly inside
the external sources and therefore accelerate Spark SQL’s federated execution. In the experiments section
we show the impact of CBO computation on query execution.
Pushing joins to be evaluated at data sources often yields more efficient plans, but such a capability
was not implemented in Spark SQL, which is natively designed to push only projections and selections
down to external sources. Therefore, we designed and implemented a simple but extensible mechanism
that allows us to push joins down to the data sources as sub-queries. Specifically, for the source-level
queries generated by the Mediator, the strategy is to group tables that belong to the same external data
source whenever a join operation is being applied to them.
This approach generates a consolidated source-level sub-query for each data source with all the neces-
sary joins, projections, and selections. As an example, assume a source-level query that joins three Spark
SQL tables, where two of them come from the same external database. Our strategy will generate a sub-
query for those two tables, push it down to the database for execution, retrieve the result and then perform
an in-memory join of this result with the third table. For a small cost of several milliseconds during the
152
query rewriting, this simple strategy proved to be very effective in many queries as it is shown in the
experiments section.
Once the execution plan of the rewritten query is computed and the above optimization is applied,
the Mediator uses Spark’s native connectors to access the external data sources. Spark provides excellent
support for querying databases through JDBC connectors and offers the ability to partition the result of a
query across the Spark workers based on one or more columns.
A drawback of our approach though is omitting Spark’s JDBC out-of-the-box caching ability. When
loading data with JDBC, Spark usually caches the individual tables that have been queried and can later
reuse them. With our approach, even though Spark is caching the sub-query result, it cannot reuse it for
subsequent operations. We cite this as a future improvement since we have already designed a solution to
reuse the materialized result.
6.2.2.2 SchizConnectEvaluation
For the UCI_HID PostgreSQL database, COINS, and Mappings MySQL databases the default JDBC connec-
tor is used. For accessing the XNAT Central Repository, we implemented a custom wrapper on top of the
Spark SQL sources API.
TheSparkXNATCentralWrapper. We developed a customizable data source wrapper for the XNAT
Central repository [175] by extending the Spark SQL sources API. The wrapper extends thePrunedFiltered-
Scan interface, which can capture all the projections and selections passed during the querying time to the
external XNAT repository and returns the retrieved data as a collection of RDD objects. The wrapper con-
verts all the selections to XML filters and generates the required XML query file on-the-fly. Once the XML
file is created, the wrapper sends it directly to the XNAT service for execution. This filtering optimization
reduces drastically the size of the data shipped from the XNAT service back to Spark since it allows the
153
Table6.1: Structureofsource-levelqueries
# Structure Size (#preds) Result Size
1 U4CQ 21 567
2 U3CQ 12 1111
3 U2CQ 10 9129
4 U16CQ 127 2104
5 U16CQ 127 1234
6 U9CQ 58 1566
7 U3CQ 8 21745
8 U9CQ 53 20725
XNAT framework to evaluate the sub-query and drop records that are redundant for the final query result
set.
ExperimentalEvaluation. For our experiments we used version 2.2.1 of Apache Spark on an Ubuntu
16.04 machine equipped with an Intel Core i7-950 processor at 3.07GHz and 16GB of RAM. The processor
has 4 cores (8 hardware threads) with private L1 (32KB each for data and instructions), L2 (256KB) caches,
and 8MB of shared L3 cache. For the query execution, we initialized Spark in standalone mode with 6 cores
and 12GB of RAM. We ran the queries using 3 Spark executors allocating 2 cores and 2GB of RAM for each
one.
We carefully selected eight domain queries whose respective source-level rewrites could test the effec-
tiveness of Spark SQL’s federation engine as well as the effectiveness of our simple optimization strategy.
All source-level queries are in the form of a union of conjunctive queries. The number of conjunctive
queries in each union varies and so is the number of predicates involved in each conjunctive query of the
union. The structure of the source-level queries is described in table 6.1. For instance, query 4 is a union of
16 conjunctive queries which includes a total of 127 predicates and returns 2104 rows after its execution.
While executing the federated queries over Spark, we observed poor performance for the built-in
UNION operator between the conjunctive queries because of Spark’s expensive data shuffling and the
154
number of comparisons that this operation involves between the left and right operand in order to deter-
mine the distinct records from the two operands’ result set. Instead, we replaced the UNION operator with
the UNION ALL, which naively merges the result sets of all the conjunctive queries and we compute the
distinct records only once on the final set. This simple reformulation of the query plan, together with our
pushdown optimization strategy gave a significant boost to the execution of our federated queries.
Figure 6.5 illustrates the query response times for the version of OGSA-DAI [89], which is used as the
execution engine of the SchizConnect mediator,
†
Spark Native (i.e., no UNION ALL, no Pushdown, no
CBO), Spark with UNION ALL and Pushdown optimization (i.e., Spark U-Pushdown), Spark Native with
CBO enabled and Spark U-Pushdown with CBO enabled. All the execution times shown for each approach
and for each query correspond to the average time of ten consecutive executions. Spark U-Pushdown and
Spark U-Pushdown CBO outperform Spark Native and Spark Native CBO respectively across all queries,
with the exception of queries 4 and 5 (explained below), which proves the effectiveness of our optimization
strategy. Compared to OGSA-DAI, the execution time of all Spark approaches is better (e.g., query 7) or
almost equivalent (e.g., query 1), except for queries 4 and 5 where OGSA-DAI outperforms Spark, which
justifies OGSA-DAI’s smarter CBO for federated queries with many source-level queries.
We do not include the execution times for queries 4 and 5 in the cases of Spark U-Pushdown and
Spark U-Pushdown CBO, because our greedy optimization strategy generates a source-level sub-query for
the HID PostgreSQL database which is very expensive to execute. The sub-query exceeds the 3 minutes
execution time cut-off, due to the size of the involved join operations and a large number of tuples returned.
Just counting the result set of the sub-query within PostgreSQL takes around 50 seconds. If we do not
take into account our greedy optimization strategy and simply use the UNION ALL reformulation, the
execution time is better than Spark Native with or without CBO and comparable to OGSA-DAI. These
†
In these experiments, OGSADAI is extended with statistics collection from remote sources and a simple, but effective cost-
based optimizer.
155
Q1 Q2 Q3 *Q4 *Q5 Q6 Q7 Q8
0
10
20
30
40
50
Execution time (secs)
Federated Queries Execution Time
Ogsadai
Spark Native
Spark U-Pushdown
Spark Native CBO
Spark U-Pushdown CBO
Figure 6.5: Execution time for 8 queries.
queries showcase that a more sophisticated cost-based pushdown optimization strategy can help reduce
their overall execution time.
We believe that we can further improve the query execution time by adding more optimization rules
to our current strategy. One such rule would be to detect sub-queries that share a common structure and
push them only once to the remote data source while performing additional operations on-the-fly within
Spark for each conjunctive query. In effect, we would execute a non-recursive datalog program, in which
common subexpressions are defined as datalog rules. Ultimately, an extension of the cost-based optimizer
with join-pushdown would explore whether there are combinations of join operations that would perform
better if executed within Spark rather than inside the remote sources.
6.3 FederatedLearningINTegration(FLINT)
Most of the existing work in the federated learning domain operates on unstructured data, such as images
or text, or on structured data assumed to be consistent across the different silos. However, data silos often
156
have different schemata, data formats, data values, and access patterns. The field of data integration has
developed many methods to address these challenges, including techniques for data exchange and query
rewriting using declarative schema mappings, and entity linkage. We propose an architectural vision
for an end-to-end Federated Learning and Integration (FLINT) system, incorporating the critical steps of
data harmonization and data imputation, to spur further research on the intersection of data management
information systems and machine learning. Specifically, we propose to model the application domain
through a target schema and formal schema mappings and to execute target queries to provide the input
data for training the federated learning model at each data silo. Since the purpose of data integration is
analysis, we propose new research directions for query answering techniques to incorporate statistical
imputation (instead of discarding answers with labeled nulls).
6.3.1 TheFederatedDataIntegrationProblem
Existing federated learning systems [149] assume that the local private data at the participating sources
(which are the input into the learning model) follow the same schema, format, semantics, and storage
and access capabilities. Such an assumption does not always hold in realistic learning scenarios, where
geographically distributed data sources have their own unique data specifications; a challenge that is com-
monly observed in Federated Database Management Systems [98, 231], Data Integration [60], and Data
Exchange [70]. By extending the architecture presented in Figure 3.7 with the necessary data harmoniza-
tion and data imputation components, we can address the semantic heterogeneities (discussed in Chapter 3)
occurring in a federated learning workflow as shown with the high-level system architecture in Figure 6.6.
The Data Harmonization component maps each local schema (and values) to a common schema (and
values) agreed by the federation, which we advocate should be done through declarative schema map-
pings [60, 63, 70, 87, 93, 276]. This common schema intends to support multiple learning scenarios over
the domain of the data. Since not all sources may have values for all the attributes in the common schema,
157
Model Aggregation
Global ML
Model
Federation
Controller
Local Training
Local ML
Model 1
Learner 1
Local Training
Local ML
Model N
Learner N
Imputation
Harmonization
Imputation
Harmonization
Data without missing values
Data common schema
Data Silo N Data Silo 1
Figure 6.6: A Harmonized Federated Learning Workflow.
it is often necessary to impute missing values to improve the precision of statistical studies [129] and re-
duce prediction bias [19], especially in clinical studies. The Data Imputation component is responsible to
handle the data missingness within each data silo. Missing data imputation has implications though for
the data integration methods used: instead of removing answers with skolems/labeled nulls, these can
be preserved and the missing values imputed. Altogether, we identify the following core challenges that
need to be solved to facilitate the deployment of Federated Learning solutions in real-world settings. (1)
Private and secure data harmonization and normalization across federated learning silos. (2) Enable fed-
erated training over missing values using imputation to improve learning efficiency and reduce bias. (3)
Improve data query access patterns for efficient ingestion of training data into siloed learning models. In
the remainder, we discuss the need for federated data integration and imputation and describe in detail
our proposed FLINT architecture, Figure 6.7.
158
Federation Context
Federated Environment
Cluster Initialization
Cluster Shutdown
Federated Model
Model Initialization
Model Input Query
VM / Containerized
Federation Controller
Training Task Scheduler
(Sync, Async, Semi-Sync) GRPC
Server Learner N
GRPC Stub
Learner 1
GRPC Stub
Evaluation Task
Scheduler
Model Aggregator
(e.g., FedAvg, FedAdam)
Secure Aggregator
(e.g., using FHE)
Model Store
(e.g., Key-Value DB)
Client Selector
(all, random)
Federated Workflow Execution 3
Federation Driver
Workflow Context
Federated Environment
Cluster Initialization
Cluster Shutdown
Federated Model
Model Initialization
Model Input Query
User
UI
Catalog Vizualization
Define
Workflow
Submit
Workflow
Real-Time
Metrics
Output
Trusted
Entity
1 2
4 5
VM / Containerized
Learner K
Controller
GRPC Stub
GRPC
Server
SN2
SN3
SN1
Demilitarized Zone
(DMZ)
Schema
Mappings
Target Query
Local
Normalization
Materialized
Target Query Imputation
Target Schema
(& Normalized Values)
Sources
Schemata
Query Engine
Local Mediator
Model
Trainer
Model
Evaluator
Model
(Neural Network)
Dataset Loader
Learning
Figure 6.7: Federated Learning and Integration Architecture and Internal Components.
6.3.2 FederatedDataHarmonization&Imputation
Every data silo participating in a federated learning environment is an independent entity and it is therefore
natural to have its own unique data specifications. For instance, in an international federation of hospitals,
each institution may adhere to data specifications unique to the geographical region it operates on [44, 168].
Creating a consensus data model that can harmonize the nuances of such regional data specifications is
not an easy task, but it is critical for meaningful data analysis.
SourceModeling/SchemaMapping. The federated machine learning model needs a harmonized input
across all participating sites (sources). Although this could be accomplished by ad-hoc ETL pipelines at
each site, such pipelines introduce maintenance and extensibility challenges. To mitigate this, we advocate
for a more principled declarative approach based on formal schema mappings following the vast work in
data integration [60, 63, 70, 87, 93, 276]. First, we define a common schema (aka global, domain, mediated,
or target schema) that represents an agreed-upon view of the application domain for the purposes of
the federation. Such a common model may follow established standards (e.g., the OMOP Common Data
Model and Vocabularies [225], or be defined pragmatically by the members of the federation. The target
schema is a degree of freedom of formalism. It does not necessarily need to provide the “perfect” model
for the domain, but it needs to provide sufficient details to support the expected queries and analysis, and
a reasonable expectation of extensibility as new sources are added. Second, we define declarative schema
mappings to translate the data from the sources into the common schema. These mappings are existential
159
formulas of the form: ∀⃗ x,⃗ y ϕ S
(⃗ x,⃗ y)→ ∃⃗ z ψ G
(⃗ x,⃗ z), where ϕ S
and ψ G
are conjunctions of predicates
from the source and global (common) schemata, respectively, and ⃗ x,⃗ y,⃗ z are tuples of variables. These
mappings can be used for virtual data integration using query rewriting [60] or for data warehousing/data
exchange [70]. Complex constraints can be enforced on the target schema (i.e., being an ontology) and
corresponding query answering methods exist [87, 276]. Declarative mappings have the advantage of
being easier to generate, maintain, and provide opportunities for automatic learning and optimization
(e.g., [127]). Figure 6.8 shows an example of a global schema, schema mapping rules, and how queries on
the domain schema support multiple learning tasks.
Entity Linkage/Data Normalization. It is also important to recognize when objects from different
sources correspond to the same entity in the real world. For example, a patient may have interacted with
several doctors, hospitals, testing facilities, pharmacies, etc., each of which may have created different
records of these interactions. The data integration system must recognize that all these records refer to
the same patient, and link them to a complete medical history of the patient. When we deal with complex
structured objects, such as patients, this problem is called entity or record linkage [63, 73, 189]. A simpler
version of the problem also occurs with atomic values; different sources may use different strings to re-
fer to the same value. For example, in a radiation oncology domain, one source may code an anatomical
structure with the value “LTemp lobe,” while another uses the value “LTemporal.” To provide clear seman-
tics for analysis, we need to map these two values to a normalized value such as "Left Temporal Lobe"
(UBERON:0002808).
Figure 6.7 shows the detailed data harmonization components in our architecture. Each learner has an
instance of a local mediator [60, 272] with access to the schema mappings from its local source/s to the
global schema. We envision that the federation will tackle different learning problems, at different times,
over the same common view of the data that the mediator produces. Each learning problem would require
a different neural network with different inputs. Thus, we obtain the required input data for each problem
160
through a query over the common schema, as opposed to ad-hoc ETL processes. The local data is never
changed; however, our system can answer such queries using the schema mappings (and target schema
constraints, if any) through query rewriting and data exchange techniques [60, 63, 70, 87, 93, 276]. For data
normalization, in simple cases, we can use a local database with mappings between the values used in each
source and normalized values, or use functional predicates that compute a similarity function between the
source and the global values in more complex cases. These normalization relations can easily be added
to the schema mappings (Figure 6.8) and the query answering procedure as interpreted predicates, e.g.,
[11]. The target query, which computes the input to the neural network, is materialized so that the model
trainer can efficiently operate over it.
Entity linkage across learners is more complex. So far we have discussed horizontal federated learning,
where sites have similar id, feature, and label space that is needed as input to the learning algorithm,
possibly with imputed values. However, the required input data (i.e., data of a single learning example)
may be distributed across several sites, so-called vertical federated learning, where true private cross-
source record linkage is needed [96, 275, 282, 285, 286].
Data Imputation. After data harmonization, the machine learning model input is uniform and mean-
ingful since it is the output of a query over the common schema, and values have been normalized. How-
ever, real sources often have missing values, either missing at random or systematically. One option is to
remove rows or columns with missing values, but that diminishes the utility of the data and the quality
of the learned models (cf. Semantic Heterogeneity section in Chapter 3). It is generally preferable to im-
pute the missing values, that is, to find the most likely value for that given attribute and example [262,
289]. Models learned with imputed values can lead to better performance [25]. In the context of federated
learning, participating sources may have limited information, and statistically diverse data distributions,
and therefore their local records may not be used to impute missing values/attributes. In these learning
161
settings, an imputation function can be learned at the federation level. By training such a federated impu-
tation function we can leverage the information from all sources, improve data quality and provide better
data distribution coverage.
Data imputation interacts with formal query rewriting methods in an interesting way that opens new
avenues for research. Since formal schema mappings have existential variables in the consequent, the
query rewriting process may generate null values (skolems) in the answers to a query. Tuples with such null
values are discarded since they are not certain answers (i.e., true in all possible worlds) [60, 70]. However,
for the purpose of learning, such null values can be imputed probabilistically. Therefore, we advocate
modifying query answering algorithms to preserve null values and incorporate imputation procedures.
Interestingly, the target query may need to retrieve attributes beyond those required by the input of the
machine learning algorithm in order to improve the quality of the imputation.
Globalschema
subject(id, sex, re) # demographics, re = race/etchnicity
clinical(id, visit, age, moca, dx) # clinical data, visit = date of the assessment/dx, dx should be icd10 codes
imaging(id, visit, type, image) # medical imaging of different types
Schemamappings
s1(id, dob, sex, re, visit, mmse, dx, mri)∧ # s1 only has MRIs, dx has missing values
minus(dob, visit, age)∧ # compute age at assessment as date of birth minus visit date
impute_f1(sex, age, re, mmse, moca_imp, dx_imp) # imputation of MoCA (full column) and missing values of dx
→ subject(id, sex, re)∧ clinical(id, visit, age, moca_imp, dx_imp)∧ imaging(id, visit, "MRI", mri)
s2_dem(id, sex, re)∧
s2_image(id, visit_image, age_image, image_type, scan)∧ image_type = "MRI"∧ # only interesred in MRIs
s2_dx(id, visit_dx, age_dx, dx)∧ dx in ["CT", "MCI", "AD"]∧ # and on Alzhemer’s diagnoses
normalize(dx, icd10)∧ # normalize diagnostic codes
impute_f2(sex, age_dx, re, icd10, moca_imp) # imputation of MoCA values,
→ subject(id, sex, re)∧ clinical(id, visit_dx, age_dx, moca_imp, icd10)∧ imaging(id, visit_image, "MRI", image)
Alzheimer’spredictionquery
q(sex, re, age, mri, dx)← subject(id, sex, re)∧ imaging(id, visit1, "MRI", mri)∧
clinical(id, visit2, age1, moca, dx)∧ (|visit1− visit2|< 60 )
Cognitivedeclinequery
q(sex, re, mri1, diff_age, diff_moca) ← subject(id, sex, re)∧ imaging(id, visit1, "MRI", mri1)∧
clinical(id, visit1, age1, moca1, dx1)∧ clinical(id, visit2, age2, moca2, dx2)∧ visit2> visit1
minus(age1, age2, diff_age) ∧ minus(moca1, moca2, diff_moca)
Figure 6.8: Global Schema and Schema Mapping Rules.
162
Example. Figure 6.8 shows a notional example of horizontal federated learning and integration of data
sources with medical data. The federation designers define a (harmonized) global schema with 3 relations:
subject, which models subject demographics; clinical, which models clinical assessments and diagnoses,
and imaging, which models different types of medical imaging. The federation expects normalized icd10
codes for diagnoses and standardizes the Montreal Cognitive Assessment (MoCA) as a measure for de-
mentia. There are two sources: s1, which represents a clinic specializing in the treatment of Alzheimer’s
Disease that captures magnetic resonance imaging (MRI) and administers a Mini-Mental State Examina-
tion (MMSE) for each patient, both on a single visit; and s2, which represents a hospital that treats a wider
variety of diseases. The two sources are mapped to the global schema using formal schema mappings that
include both data transformation and imputation functional predicates. The first mapping uses a simple
functional predicate (minus) to compute the age at assessment from the difference of the patient’s date
of birth and the visit date, as well as an imputation procedure (impute_f1) that imputes both the MoCA
score and possibly missing diagnosis codes from the MMSE score, age, sex, race/ethnicity, and existing
diagnosis. Since s1 contains only MRIs, this is the recorded type of the resulting imaging in the harmo-
nized schema. The second mapping joins 3 tables from source s2 comprising demographics, imaging, and
diagnoses. Assume the federation is only interested in neuroimaging of Alzheimer’s Disease, so it chooses
to populate the global schema only with MRI scans and relevant diagnoses (AD: Alzheimer’s Disease, MCI:
mild cognitive impairment, and controls). An interpreted predicate maps the source’s diagnoses to appro-
priate ICD10 codes. Finally, a second imputation function (impute_f2) imputes the MoCA values from sex,
age, race/ethnicity, and diagnosis. Note that since the MoCA score is not produced by the source, but is
required by the global schema predicate clinical, it would have been represented by a skolem function in a
traditional data integration system, and query tuples with such skolem would have been removed. Here,
we impute the MoCA score, so no tuples are lost.
163
The global schema, through the schema mappings, enables a variety of queries that support different
learning tasks within the federation. Figure 6.8 shows two such queries. The first computes the training
data for a classification learning task that predicts an AD status (AD/MCI/CT) diagnosis based on the
MRI, sex, age, and race/ethnicity of a subject. The second query computes the training data for aregression
learning task to predict cognitive decline based on an MRI at an initial time point and the ages and cognitive
assessment values at two-time points.
Privacy. Federated learning assumes that data from one site cannot be leaked to a different site. There-
fore, in our system, federated training encrypts the neural network parameters (weights, gradients) and
the aggregation of neural models from different sites is done under homomorphic encryption. We refer
the reader to [245] for details. To enforce data privacy, query rewriting and data normalization need to be
performed locally at each site and therefore the schema mappings and the normalization tables need to be
kept local at each site and only the global model schema is shared across sites. Similarly, record linkage
can be done in a privacy-preserving manner [85, 156, 222] but federated training becomes significantly
more complex, which is an active area of research [35, 96, 282, 285].
Comparing Simple Intra-Silo and Inter-Silo Imputation. Here, we compare the effect of simple
imputation methods (i.e., mean, median) in the final learning performance of the federated model. We
evaluate two distinct approaches of simple imputation, one where the learner imputes its missing values
using only its locally available information (intra-silo) and one where learners use global information
(inter-silo) to impute. The inter-silo approach can be conceived alternatively as a federated imputation
approach.
In the intra-silo case, the learners impute missing values using the Mean or Median value of the com-
plete values, whereas, in the inter-silo case, the learners impute missing values using the global mean
(FedMean) and global median (FedMedian) values. To compute the latter two values, each learner shares
164
0.0 0.1 0.2 0.3
0.30
0.35
0.40
0.45
0.50
0.55
0.60
T est MSE
MCAR - IID
0.0 0.1 0.2 0.3
MAR - IID
0.0 0.1 0.2 0.3
MNAR - IID
Missing Ratio
No MVs
Deletion
Random
Mean
Median
FedMean
FedMedian
(a) CA-Housing (10 Learners - IID)
0.0 0.1 0.2 0.3
0.6
0.8
1.0
1.2
T est MSE
MCAR - IID
0.0 0.1 0.2 0.3
MAR - IID
0.0 0.1 0.2 0.3
MNAR - IID
Missing Ratio
No MVs
Deletion
Random
Mean
Median
FedMean
FedMedian
(b) CA-Housing (100 Learners - IID)
0.0 0.1 0.2 0.3
0.30
0.35
0.40
0.45
0.50
0.55
0.60
T est MSE
MCAR - IID
0.0 0.1 0.2 0.3
MAR - IID
0.0 0.1 0.2 0.3
MNAR - IID
Missing Ratio
No MVs
Deletion
Random
Mean
Median
FedMean
FedMedian
(c) CA-Housing (10 Learners - NonIID)
0.0 0.1 0.2 0.3
1.25
1.30
1.35
1.40
1.45
T est MSE
MCAR - NonIID
0.0 0.1 0.2 0.3
MAR - NonIID
0.0 0.1 0.2 0.3
MNAR - NonIID
Missing Ratio
No MVs
Deletion
Random
Mean
Median
FedMean
FedMedian
(d) CA-Housing (100 Learners - NonIID)
Figure 6.9: Comparing Simple Intra-Silo and Inter-Silo Imputation in the California Housing Dataset.
its mean or median values with the controller and the controller computes the federated mean as
1
N
P
N
i
µ i
and the federated median as med({(med
1
,med
2
,...,med
N
)}), where N represents the total number of
learners andmed
i
the local median of learneri, andmed() the median of the local medians. FedMedian
is the approximate value of the global median since learners only share their local median and not the
entire sequence of values, while the federated mean is the true mean of the global mean. In our evaluation
we consider as our baseline performance, the federated model trained on the completed data values (no
missing values, No MVs) and the federated model trained on the incomplete data values (Deletion).
Figure 6.9 demonstrates the performance of the respective approaches in the California Housing dataset
in IID and Non-IID data distributions (cf. Chapter 3) for 10 and 100 learners with a participation ratio of 0.1
at each round. To generate the MCAR, MAR, and MNAR environments we first generate missing values in
the original aggregated dataset and then partition the dataset according to the investigating total number
of learners (10 or 100) and data distribution (IID, Non-IID). As is shown in the results, any imputation is
better than no imputation in the case of both IID and Non-IID data distributions. However, further work
165
with more sophisticated imputation methods is required to assess under which learning conditions the
intra-silo approach is better than the inter-silo approach or vice-versa.
166
Chapter7
FutureResearchDirections
Federated Learning is an emerging research field with many open research problems. At its core, Federated
Learning represents a complex ecosystem encompassing diverse learning components, from model data
feeding at the learning sites (silos, clients, learners) to ultimate federated model deployment. However,
most recent works have proposed approaches to optimize specific aspects of the federated learning training
pipeline, such as model learning optimization, learning with stronger convergence guarantees, and robust
and secure model aggregation techniques. However, the full spectrum of the federated learning training
and deployment lifecycle is still unexplored. Here, we present some of the open research problems we
anticipate to emerge in the foreseeable future.
TheFederatedLearningTrilemma. Most recent research works have proposed solutions that target
optimizing many of these different learning components independently and not holistically. We anticipate
that a new federated learning research subdomain will emerge where new solutions must consider the
trade-offs between federated model learning utility, security & privacy, and system scalability; we posit
this as the federated learning trilemma. These new solutions will provide insights on how existing and new
federated algorithmic optimizations can operate within different federated learning topologies, such as
centralized, decentralized, and hierarchical learning settings [211], and promote innovative optimizations
to unforeseen or unexplored federated learning environments arising in practical and real-world settings.
167
Federated Models Generalization. A critical problem that is much overlooked by existing works is
how federated models trained within a federation will perform when deployed on sites that did not par-
ticipate in the federation. Take for instance a federated model that was trained within a consortium of
hospitals for breast cancer detection. If this model were to be reused for inference on a different set of
hospitals whose local data distributions may not follow the original data distribution the federated model
was trained on, then the federated model might fail to provide reasonable predictions (e.g., due to inherent
population biases). This problem is very similar to the domain [266] and Out-Of-Distribution [230] gen-
eralization problem commonly seen in centralized settings, but its adaptation to federated settings is still
unexplored.
Secure & Private Federated Model Deployment. Any federated model deployed for inference must
preserve the privacy of the private data it was trained on and ensure no information can be leaked to
malicious third parties. Compared to existing machine learning models, federated models can be deployed
either on the cloud or on-premises within individual learning sites. In the first case, users perform inference
queries over federated models using non-private/public data that can be uploaded to the cloud without any
privacy violation. In the second case, though, users perform inference queries over federated models on-
premises using their private local datasets. To support both deployment types and ensure that no sensitive
information can be leaked during inference, new information-theoretic and privacy-preserving inference
mechanisms must be investigated along with new efficient cryptographic schemes that can support model
inference in an encrypted space (e.g., using homomorphic encrypted operations [118, 166, 237]).
Federated Models Marketplace. Once a federated model has been trained, it is stored in the model
repository for versioning, bookkeeping, and serving. All data owners that contributed to the training of
the federated model have model ownership claims. Any user that wants to use a trained federated model
to perform operations on top (e.g., run inference queries or use as initialization model states) and has not
168
contributed to its training, will have to pay a corresponding model serving fee. The collected fee is then
distributed back to the data owners, who contributed to the training of the federated model and to the
model provider that enabled the model transaction. This synergy between model owners, providers, and
consumers (users) constitutes the federated learning marketplace. We envision that this type of reward-
based federated model marketplace will find applicability across many different disciplines and domains,
such as in manufacturing, finance, and healthcare. However, to enable such transactions new pricing
frameworks will have to be proposed that price machine learning model instances directly [36]), but in
the context of federated learning. These frameworks will have to consider a different set of environmental
constraints that are apparent only in federated learning settings, such as the computational power used by
each data silo to train the federated model, the quality of the data owned by each silo, and the final model
performance.
Federated Data Profiling and the Value of Data. An important factor when training a federated
learning model is the quantity and quality of the available data upon which the federated model will be
trained on. The internal characteristics of these training data points may affect the convergence and even
damage the convergence of the federated model [120]. Therefore, it is imperative to profile the local data
distribution of a data silo before the silo joins the federation (e.g., using data sampling, and data sketching)
and help measure its contribution to the federated model. However, due to the privacy constraints dic-
tated by the federated training settings, such data profiling needs to be performed in a privacy-preserving
manner and new techniques must account for non-IID data distributions. Similarly, different sites may
contribute differently to the overall performance of the learned federated model. A new research direction
is to understand the marginal contribution of each site and which set of sites produces an optimal trade-off
between sites/data selected and the cost of acquiring such data, such as Shapley values [81] for federated
data.
169
FederatedDatabaseLearning. As we have previously discussed, Federated Learning bears some simi-
larities with previous work in Federated Database Management Systems (FDBMS). Inspired by this analogy,
in section 6, we have described how to incorporate a virtual data integration component into a federated
learning environment. However, further investigation is required on how the federated data schema upon
which the federated model query will be executed can be defined in a secure and private manner without
leaking any information. Additionally, more thorough analysis and evaluation are required on how the
federated model query can be privately and securely executed within each data silo (e.g., through media-
tors as described in section 6) and handle missing attributes (e.g., vertical federated learning) and values.
These are open problems related to how support for machine learning for IID and Non-IID data distribu-
tions can be provided within the context of federated database systems [98, 231]; revisiting these kinds of
systems will be necessary.
∼⋄∼ I hope the work presented in this thesis will spur further research in the exciting field of federated learning
and promote new ways to address known and unknown problems in many different research areas.
170
Bibliography
[1] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. “Tensorflow: A system
for large-scale machine learning”. In: 12th{USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 16). 2016, pp. 265–283.
[2] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and
Li Zhang. “Deep learning with differential privacy”. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security. 2016, pp. 308–318.
[3] Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. A Survey on Homomorphic
Encryption Schemes: Theory and Implementation. 2017. arXiv: 1704.03578[cs.CR].
[4] Durmus Alp Emre Acar, Yue Zhao, Ramon Matas Navarro, Matthew Mattina,
Paul N Whatmough, and Venkatesh Saligrama. “Federated learning based on dynamic
regularization”. In: arXiv preprint arXiv:2111.04263 (2021).
[5] Sibel Adali, K Selçuk Candan, Yannis Papakonstantinou, and Vo S Subrahmanian. “Query caching
and optimization in distributed mediator systems”. In: ACM SIGMOD Record 25.2 (1996),
pp. 137–146.
[6] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and
Brendan McMahan. “cpSGD: Communication-efficient and differentially-private distributed
SGD”. In: Advances in Neural Information Processing Systems 31 (2018).
[7] Martin Albrecht, Melissa Chase, Hao Chen, Jintai Ding, Shafi Goldwasser, Sergey Gorbunov,
Shai Halevi, Jeffrey Hoffstein, Kim Laine, Kristin Lauter, Satya Lokam, Daniele Micciancio,
Dustin Moody, Travis Morrison, Amit Sahai, and Vinod Vaikuntanathan. Homomorphic
Encryption Standard. Cryptology ePrint Archive, Report 2019/939.
https://eprint.iacr.org/2019/939. 2019.
[8] Abdulmajeed F Alrefaei, Yousef M Hawsawi, Deyab Almaleki, Tarik Alafif, Faisal A Alzahrani,
and Muhammed A Bakhrebah. “Genetic data sharing and artificial intelligence in the era of
personalized medicine based on a cross-sectional analysis of the Saudi human genome program”.
In: Scientific Reports 12.1 (2022), p. 1405.
171
[9] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma,
José Santamaría, Mohammed A Fadhel, Muthana Al-Amidie, and Laith Farhan. “Review of deep
learning: Concepts, CNN architectures, challenges, applications, future directions”. In: Journal of
big Data 8 (2021), pp. 1–74.
[10] Jose Luis Ambite, Marcelo Tallis, Kathryn Alpert, David B Keator, Margaret King, Drew Landis,
George Konstantinidis, Vince D Calhoun, Steven G Potkin, Jessica A Turner, et al. “SchizConnect:
virtual data integration in neuroimaging”. In: International Conference on Data Integration in the
Life Sciences. Springer. 2015, pp. 37–51.
[11] José Luis Ambite, Marcelo Tallis, Kathryn I. Alpert, David B. Keator, Margaret D. King,
Drew Landis, George Konstantinidis, Vince D. Calhoun, Steven G. Potkin, Jessica A. Turner, and
Lei Wang. “SchizConnect: Virtual Data Integration in Neuroimaging”. In: Proceedings of the 11th
International Conference on Data Integration in the Life Sciences (DILS 2015). Los Angeles, CA,
2015, pp. 37–51.
[12] Mathieu Andreux, Jean Ogier du Terrail, Constance Beguier, and Eric W. Tramel. Siloed Federated
Learning for Multi-Centric Histopathology Datasets. 2020. arXiv: 2008.07424[cs.CV].
[13] Mario Antonioletti, Malcolm Atkinson, Rob Baxter, Andrew Borley, Neil P Chue Hong,
Brian Collins, Neil Hardman, Alastair C Hume, Alan Knox, Mike Jackson, et al. “The design and
implementation of Grid database services in OGSA-DAI”. In: Concurrency and Computation:
Practice and Experience 17.2-4 (2005), pp. 357–376.
[14] Yoshinori Aono, Takuya Hayashi, Lihua Wang, Shiho Moriai, et al. “Privacy-preserving deep
learning via additively homomorphic encryption”. In: IEEE Transactions on Information Forensics
and Security 13.5 (2017), pp. 1333–1345.
[15] Marcelo Arenas, Pablo Barceló, Leonid Libkin, and Filip Murlak. “Relational and XML data
exchange”. In: Synthesis Lectures on Data Management 2.1 (2010), pp. 1–112.
[16] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley,
Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. “Spark sql: Relational data
processing in spark”. In: Proceedings of the 2015 ACM SIGMOD international conference on
management of data. 2015, pp. 1383–1394.
[17] Michael Ashburner, Catherine A Ball, Judith A Blake, David Botstein, Heather Butler,
J Michael Cherry, Allan P Davis, Kara Dolinski, Selina S Dwight, Janan T Eppig, et al. “Gene
ontology: tool for the unification of biology”. In: Nature genetics 25.1 (2000), pp. 25–29.
[18] Naveen Ashish and Jose-Luis Ambite. Data Integration in the Life Sciences: 11th International
Conference, DILS 2015, Los Angeles, CA, USA, July 9-10, 2015, Proceedings. Vol. 9162. Springer, 2015.
[19] Olawale F Ayilara, Lisa Zhang, Tolulope T Sajobi, Richard Sawatzky, Eric Bohm, and Lisa M Lix.
“Impact of missing data on bias and precision when estimating change in patient-reported
outcomes from a clinical registry”. In: Health and quality of life outcomes 17.1 (2019), pp. 1–9.
172
[20] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. “How to
backdoor federated learning”. In: International Conference on Artificial Intelligence and Statistics .
PMLR. 2020, pp. 2938–2948.
[21] James Henry Bell, Kallista A Bonawitz, Adrià Gascón, Tancrède Lepoint, and Mariana Raykova.
“Secure single-server aggregation with (poly) logarithmic overhead”. In: Proceedings of the 2020
ACM SIGSAC Conference on Computer and Communications Security. 2020, pp. 1253–1269.
[22] Paolo Bellavista, Luca Foschini, and Alessio Mora. “Decentralised learning in federated
deployment environments: A system-level survey”. In: ACM Computing Surveys (CSUR) 54.1
(2021), pp. 1–38.
[23] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. “Greedy layer-wise training
of deep networks”. In: Advances in neural information processing systems 19 (2006).
[24] Jeremy Bernstein, Jiawei Zhao, Kamyar Azizzadenesheli, and Anima Anandkumar. “signSGD
with majority vote is communication efficient and fault tolerant”. In: arXiv preprint
arXiv:1810.05291 (2018).
[25] Dimitris Bertsimas, Colin Pawlowski, and Ying Daisy Zhuo. “From predictive methods to missing
data imputation: an optimization approach”. In: The Journal of Machine Learning Research 18.1
(2017), pp. 7133–7171.
[26] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, Pedro PB de Gusmão,
and Nicholas D Lane. “Flower: A friendly federated learning research framework”. In: arXiv
preprint arXiv:2007.14390 (2020).
[27] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. “Analyzing
federated learning through an adversarial lens”. In: International Conference on Machine Learning.
PMLR. 2019, pp. 634–643.
[28] Krishnan Bhaskaran and Liam Smeeth. “What is the difference between missing completely at
random and missing at random?” In: International journal of epidemiology 43.4 (2014),
pp. 1336–1339.doi: 10.1093/ije/dyu080.
[29] Sameer Bibikar, Haris Vikalo, Zhangyang Wang, and Xiaohan Chen. “Federated dynamic sparse
training: Computing less, communicating less, yet learning better”. In: Proceedings of the AAAI
Conference on Artificial Intelligence . Vol. 36. 2022, pp. 6080–6088.
[30] Battista Biggio, Blaine Nelson, and Pavel Laskov. “Support vector machines under adversarial
label noise”. In: Asian conference on machine learning. PMLR. 2011, pp. 97–112.
[31] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. “Machine learning
with adversaries: Byzantine tolerant gradient descent”. In: Advances in neural information
processing systems 30 (2017).
[32] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan,
Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. “Practical secure aggregation for
federated learning on user-held data”. In: arXiv preprint arXiv:1611.04482 (2016).
173
[33] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. “(Leveled) Fully Homomorphic
Encryption without Bootstrapping”. In: Proceedings of the 3rd Innovations in Theoretical Computer
Science Conference. ITCS ’12. Association for Computing Machinery, 2012, pp. 309–325.
[34] Sebastian Caldas, Peter Wu, Tian Li, Jakub Konečn` y, H Brendan McMahan, Virginia Smith, and
Ameet Talwalkar. “Leaf: A benchmark for federated settings”. In: Workshop on Federated Learning
for Data Privacy and Confidentiality . 2019.
[35] Dongchul Cha, MinDong Sung, and Yu-Rang Park. “Implementing Vertical Federated Learning
Using Autoencoders: Practical Application, Generalizability, and Utility Study”. In: JMIR Medical
Informatics 9.6 (2021).doi: 10.2196/26598.
[36] Lingjiao Chen, Paraschos Koutris, and Arun Kumar. “Towards model-based pricing for machine
learning in a data marketplace”. In: Proceedings of the 2019 International Conference on
Management of Data. 2019, pp. 1535–1552.
[37] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. “Targeted backdoor attacks on
deep learning systems using data poisoning”. In: arXiv preprint arXiv:1712.05526 (2017).
[38] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. “Homomorphic Encryption for
Arithmetic of Approximate Numbers”. In: Advances in Cryptology – ASIACRYPT 2017. Ed. by
Tsuyoshi Takagi and Thomas Peyrin. 2017, pp. 409–437.
[39] Christopher A Choquette Choo, Florian Tramer, Nicholas Carlini, and Nicolas Papernot.
“Label-Only Membership Inference Attacks”. In: arXiv preprint arXiv:2007.14321 (2020).
[40] Kamal Choudhary, Brian DeCost, Chi Chen, Anubhav Jain, Francesca Tavazza, Ryan Cohn,
Cheol Woo Park, Alok Choudhary, Ankit Agrawal, Simon JL Billinge, et al. “Recent advances and
applications of deep learning methods in materials science”. In: npj Computational Materials 8.1
(2022), p. 59.
[41] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. “EMNIST: Extending
MNIST to handwritten letters”. In: 2017 International Joint Conference on Neural Networks
(IJCNN). IEEE. 2017, pp. 2921–2926.
[42] James H Cole, Robert Leech, David J Sharp, and Alzheimer’s Disease Neuroimaging Initiative.
“Prediction of brain age suggests accelerated atrophy after traumatic brain injury”. In: Annals of
neurology 77.4 (2015), pp. 571–581.
[43] James H Cole, Rudra PK Poudel, Dimosthenis Tsagkrasoulis, Matthan WA Caan, Claire Steves,
Tim D Spector, and Giovanni Montana. “Predicting brain age with deep learning from raw
imaging data results in a reliable and heritable biomarker”. In: NeuroImage 163 (2017),
pp. 115–124.
[44] Ricardo J Cruz-Correia, Pedro M Vieira-Marques, Ana M Ferreira, Filipa C Almeida,
Jeremy C Wyatt, and Altamiro M Costa-Pereira. “Reviewing the integration of patient data: how
systems are evolving in practice to meet patient needs”. In: BMC medical informatics and decision
making 7.1 (2007), pp. 1–11.
174
[45] Henggang Cui, James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Abhimanu Kumar,
Jinliang Wei, Wei Dai, Gregory R Ganger, Phillip B Gibbons, et al. “Exploiting bounded staleness
to speed up big data analytics”. In: 2014{USENIX} Annual Technical Conference ({USENIX}{ATC}
14). 2014, pp. 37–48.
[46] Leonardo Dagum and Ramesh Menon. “OpenMP: an industry standard API for shared-memory
programming”. In: IEEE computational science and engineering 5.1 (1998), pp. 46–55.
[47] Wei Dai. “Learning with Staleness”. PhD thesis. Carnegie Mellon University, 2018.
[48] Wei Dai, Yi Zhou, Nanqing Dong, Hao Zhang, and Eric Xing. “Toward Understanding the Impact
of Staleness in Distributed Machine Learning”. In: International Conference on Learning
Representations. 2019.
[49] Ivan Damgård, Valerio Pastro, Nigel Smart, and Sarah Zakarias. “Multiparty computation from
somewhat homomorphic encryption”. In: Annual Cryptology Conference. Springer. 2012,
pp. 643–662.
[50] Miyuru Dayarathna, Yonggang Wen, and Rui Fan. “Data center energy consumption modeling: A
survey”. In: IEEE Communications Surveys & Tutorials 18.1 (2015), pp. 732–794.
[51] Jeffrey Dean and Luiz André Barroso. “The tail at scale”. In: Communications of the ACM 56.2
(2013), pp. 74–80.
[52] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao,
Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. “Large scale distributed deep
networks”. In: Advances in neural information processing systems. 2012, pp. 1223–1231.
[53] Serkalem Demissie, Michael P LaValley, Nicholas J Horton, Robert J Glynn, and
L Adrienne Cupples. “Bias due to missing exposure data using complete-case analysis in the
proportional hazards regression model”. In: Statistics in medicine 22.4 (2003), pp. 545–557.
[54] Daniel Demmler, Thomas Schneider, and Michael Zohner. “ABY-A framework for efficient
mixed-protocol secure two-party computation.” In: NDSS. 2015.
[55] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding”. In: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics, June 2019, pp. 4171–4186.doi: 10.18653/v1/N19-1423.
[56] Nikhil J. Dhinagar, Sophia I. Thomopoulos, Conor Owens-Walton, Dimitris Stripelis,
Jose Luis Ambite, Greg Ver Steeg, Daniel Weintraub, Philip Cook, Corey McMillan, and
Paul M. Thompson. “3D convolutional neural networks for classification of Alzheimer’s and
Parkinson’s disease with T1-weighted brain MRI”. In: 17th International Symposium on Medical
Information Processing and Analysis. Ed. by Letícia Rittner, Eduardo Romero Castro M.D.,
Natasha Lepore, Jorge Brieva, and Marius George Linguraru. Vol. 12088. International Society for
Optics and Photonics. SPIE, 2021, pp. 277–286.doi: 10.1117/12.2606297.
175
[57] Dimitrios Dimitriadis, Mirian Hipolito Garcia, Daniel Madrigal Diaz, Andre Manoel, and
Robert Sim. “FLUTE: A Scalable, Extensible Framework for High-Performance Federated
Learning Simulations”. In: arXiv preprint arXiv:2203.13789 (2022).
[58] Nicola K Dinsdale, Emma Bluemke, Stephen M Smith, Zobair Arya, Diego Vidaurre,
Mark Jenkinson, and Ana IL Namburete. “Learning patterns of the ageing brain in MRI using
deep convolutional networks”. In: NeuroImage 224 (2021), p. 117401.
[59] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of data integration. Elsevier, 2012.
[60] Anhai Doan, Alon Halevy, and Zachary Ives. Principles of Data Integration. Morgan Kauffman,
2012.
[61] Shlomi Dolev, Niv Gilboa, and Marina Kopeetsky. “Efficient private multi-party computations of
trust in the presence of curious and malicious users”. In: Journal of Trust Management 1.1 (2014),
p. 8.doi: 10.1186/2196-064x-1-8.
[62] A Rogier T Donders, Geert JMG Van Der Heijden, Theo Stijnen, and Karel GM Moons. “A gentle
introduction to imputation of missing values”. In: Journal of clinical epidemiology 59.10 (2006),
pp. 1087–1091.
[63] Xin Luna Dong and Divesh Srivastava. Big Data Integration. Synthesis Lectures on Data
Management. Morgan & Claypool Publishers, 2015.doi: 10.2200/S00578ED1V01Y201404DTM040.
[64] John C Duchi, Michael I Jordan, and Martin J Wainwright. “Local privacy and statistical minimax
rates”. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE. 2013,
pp. 429–438.
[65] Jennie Duggan, Aaron J Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe,
Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik. “The bigdawg
polystore system”. In: ACM Sigmod Record 44.2 (2015), pp. 11–16.
[66] Cynthia Dwork. “Differential privacy”. In: International Colloquium on Automata, Languages, and
Programming. Springer. 2006, pp. 1–12.
[67] Cynthia Dwork and Aaron Roth. “The Algorithmic Foundations of Differential Privacy”. In:
Foundations and Trends in Theoretical Computer Science 9.3–4 (2014), pp. 211–407.issn:
1551-305X.doi: 10.1561/0400000042.
[68] Gökcen Eraslan, Žiga Avsec, Julien Gagneur, and Fabian J Theis. “Deep learning: new
computational modelling techniques for genomics”. In: Nature Reviews Genetics 20.7 (2019),
pp. 389–403.
[69] Ronald Fagin, Phokion G Kolaitis, and Lucian Popa. “Data exchange: getting to the core”. In: ACM
Transactions on Database Systems (TODS) 30.1 (2005), pp. 174–210.
[70] Ronald Fagin, Phokion G. Kolaitis, Renée J. Miller, and Lucian Popa. “Data exchange: semantics
and query answering”. In: Theoretical Computer Science 336.1 (2005), pp. 89–124.issn: 0304-3975.
doi: 10.1016/j.tcs.2004.10.033.
176
[71] Junfeng Fan and F. Vercauteren. “Somewhat Practical Fully Homomorphic Encryption”. In: IACR
Cryptol. ePrint Arch. (2012), p. 144.
[72] Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Gong. “Local model poisoning attacks to
Byzantine-robust federated learning”. In: 29th{USENIX} Security Symposium ({USENIX} Security
20). 2020, pp. 1605–1622.
[73] Ivan P. Felligi and Alan B. Sunter. “A theory for record linkage”. In: Journal of the American
Statistical Association 64.328 (1969), pp. 1183–1210.
[74] César Ferri, José Hernández-Orallo, and R Modroiu. “An experimental comparison of
performance measures for classification”. In: Pattern Recognition Letters 30.1 (2009), pp. 27–38.
[75] Daniela Florescu, Louiqa Raschid, and Patrick Valduriez. “Answering Queries Using OQL View
Expressions.” In: VIEWS. 1996, pp. 84–90.
[76] Christopher Fowler, Stephanie R Rainey-Smith, Sabine Bird, Julia Bomke, Pierrick Bourgeat,
Belinda M Brown, Samantha C Burnham, Ashley I Bush, Carolyn Chadunow, Steven Collins, et al.
“Fifteen years of the australian imaging, biomarkers and lifestyle (AIBL) study: progress and
observations from 2,359 older adults spanning the spectrum from cognitive normality to
Alzheimer’s disease”. In: Journal of Alzheimer’s disease reports Preprint (2021), pp. 1–26.
[77] Katja Franke and Christian Gaser. “Ten years of BrainAGE as a neuroimaging biomarker of brain
aging: what insights have we gained?” In: Frontiers in neurology (2019), p. 789.
[78] Jonathan Frankle and Michael Carbin. “The Lottery Ticket Hypothesis: Finding Sparse, Trainable
Neural Networks”. In: ICLR. 2018.
[79] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. “Mitigating sybils in federated learning
poisoning”. In: arXiv preprint arXiv:1808.04866 (2018).
[80] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. “The Limitations of Federated Learning in
Sybil Settings.” In: RAID. 2020, pp. 301–316.
[81] “Game Theory”. In: Norton, 1989. Chap. Shapley Value, pp. 210–216.
[82] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass, Anand Rajaraman,
Yehoshua Sagiv, Jeffrey Ullman, Vasilis Vassalos, and Jennifer Widom. “The TSIMMIS approach to
mediation: Data models and languages”. In: Journal of intelligent information systems 8 (1997),
pp. 117–132.
[83] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. “Inverting
gradients-how easy is it to break privacy in federated learning?” In: Advances in Neural
Information Processing Systems 33 (2020), pp. 16937–16947.
[84] Craig Gentry. “Fully homomorphic encryption using ideal lattices”. In: Proceedings of the
forty-first annual ACM symposium on Theory of computing . 2009, pp. 169–178.
177
[85] Tanmay Ghai, Yixiang Yao, Srivatsan Ravi, and Pedro Szekely. Evaluating the Feasibility of a
Provably Secure Privacy-Preserving Entity Resolution Adaptation of PPJoin using Homomorphic
Encryption. 2022.doi: 10.48550/ARXIV.2208.07999.
[86] Gary H Glover, Bryon A Mueller, Jessica A Turner, Theo GM Van Erp, Thomas T Liu,
Douglas N Greve, James T Voyvodic, Jerod Rasmussen, Gregory G Brown, David B Keator, et al.
“Function biomedical informatics research network recommendations for prospective multicenter
functional MRI studies”. In: Journal of Magnetic Resonance Imaging 36.1 (2012), pp. 39–54.
[87] Georg Gottlob, Thomas Lukasiewicz, and Andreas Pieris. “Datalog+/-: Questions and Answers”.
In: 14th International Conference on Principles of Knowledge Representation and Reasoning KR.
Vienna, Austria, 2014.
[88] Goetz Graefe. “Query evaluation techniques for large databases”. In: ACM Computing Surveys
(CSUR) 25.2 (1993), pp. 73–169.
[89] Alistair Grant, Mario Antonioletti, Alastair C Hume, Amy Krause, Bartosz Dobrzelecki,
Michael J Jackson, Mark Parsons, Malcolm P Atkinson, and Elias Theocharopoulos. “OGSA-DAI:
Middleware for data integration: Selected applications”. In: 2008 IEEE Fourth International
Conference on eScience. IEEE. 2008, pp. 343–343.
[90] Jialong Guo, Ziyi Chen, Zhiwei Liu, Xianwei Li, Zhiyuan Xie, Zongguo Wang, and
Yangang Wang. “Neural network training method for materials science based on multi-source
databases”. In: Scientific Reports 12.1 (2022), pp. 1–10.
[91] Umang Gupta, Pradeep Lam, Greg Ver Steeg, and Paul Thompson. “Improved Brain Age
Estimation with Slice-based Set Networks”. In: IEEE International Symposium on Biomedical
Imaging (ISBI). 2021.
[92] Umang Gupta, Dimitris Stripelis, Pradeep K Lam, Paul Thompson, José Luis Ambite, and
Greg Ver Steeg. “Membership inference attacks on deep regression models for neuroimaging”. In:
Medical Imaging with Deep Learning. PMLR. 2021, pp. 228–251.
[93] Alon Y. Halevy. “Answering queries using views: A survey”. In: The VLDB Journal 10.4 (2001),
pp. 270–294.
[94] Song Han, Huizi Mao, and William J Dally. “Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding”. In: arXiv preprint
arXiv:1510.00149 (2015).
[95] Song Han, Jeff Pool, John Tran, and William Dally. “Learning both weights and connections for
efficient neural network”. In: Advances in neural information processing systems 28 (2015).
[96] Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Richard Nock, Giorgio Patrini,
Guillaume Smith, and Brian Thorne. Private federated learning on vertically partitioned data via
entity resolution and additively homomorphic encryption. 2017. arXiv: 1711.10677[cs.LG].
178
[97] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, Xiaoyang Wang,
Praneeth Vepakomma, Abhishek Singh, Hang Qiu, et al. “Fedml: A research library and
benchmark for federated machine learning”. In: arXiv preprint arXiv:2007.13518 (2020).
[98] Dennis Heimbigner and Dennis McLeod. “A federated architecture for information management”.
In: ACM Transactions on Information Systems (TOIS) 3.3 (1985), pp. 253–278.
[99] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief
nets”. In: Neural computation 18.7 (2006), pp. 1527–1554.
[100] Briland Hitaj, Giuseppe Ateniese, and Fernando Perez-Cruz. “Deep models under the GAN:
information leakage from collaborative deep learning”. In: Proceedings of the 2017 ACM SIGSAC
conference on computer and communications security. 2017, pp. 603–618.
[101] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons,
Garth A Gibson, Greg Ganger, and Eric P Xing. “More effective distributed ml via a stale
synchronous parallel parameter server”. In: Advances in neural information processing systems.
2013, pp. 1223–1231.
[102] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural Computation 9
(1997), pp. 1735–1780.
[103] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. “Sparsity in
Deep Learning: Pruning and growth for efficient inference and training in neural networks”. In:
arXiv preprint arXiv:2102.00554 (2021).
[104] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. “Measuring the effects of non-identical data
distribution for federated visual classification”. In: arXiv preprint arXiv:1909.06335 (2019).
[105] Rui Hu, Yanmin Gong, and Yuanxiong Guo. “Federated Learning with Sparsification-Amplified
Privacy and Adaptive Optimization”. In: Proceedings of the Thirtieth International Joint Conference
on Artificial Intelligence, IJCAI . Ed. by Zhi-Hua Zhou. 2021.doi: 10.24963/ijcai.2021/202.
[106] Rui Hu, Yanmin Gong, and Yuanxiong Guo. “Federated Learning with Sparsified Model
Perturbation: Improving Accuracy under Client-Level Differential Privacy”. In: arXiv preprint
arXiv:2202.07178 (2022).
[107] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen,
HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. “Gpipe: Efficient training of giant
neural networks using pipeline parallelism”. In: Advances in neural information processing systems
32 (2019).
[108] Eugenia Iofinova, Alexandra Peste, Mark Kurtz, and Dan Alistarh. “How Well Do Sparse
Imagenet Models Transfer?” In: CoRR abs/2111.13445 (2021). arXiv: 2111.13445.url:
https://arxiv.org/abs/2111.13445.
[109] Ramesh Jain. “Out-of-the-box data engineering events in heterogeneous data environments”. In:
Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405). IEEE. 2003,
pp. 8–21.
179
[110] Matthias Jarke and Jurgen Koch. “Query optimization in database systems”. In: ACM Computing
surveys (CsUR) 16.2 (1984), pp. 111–152.
[111] Bargav Jayaraman and David Evans. “Evaluating Differentially Private Machine Learning in
Practice”. In: 28th USENIX Security Symposium (USENIX Security 19). USENIX Association, 2019,
pp. 1895–1912.
[112] Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. “Revisiting membership
inference under realistic assumptions”. In: arXiv preprint arXiv:2005.10881 (2020).
[113] Sumit Kumar Jha, Susmit Jha, Rickard Ewetz, Sunny Raj, Alvaro Velasquez, Laura L Pullum, and
Ananthram Swami. “An Extension of Fano’s Inequality for Characterizing Model Susceptibility to
Membership Inference Attacks”. In: arXiv preprint arXiv:2009.08097 (2020).
[114] Yuang Jiang, Shiqiang Wang, Victor Valls, Bong Jun Ko, Wei-Han Lee, Kin K Leung, and
Leandros Tassiulas. “Model pruning enables efficient federated learning on edge devices”. In: IEEE
Transactions on Neural Networks and Learning Systems (2022).
[115] Zhifeng Jiang, Wei Wang, and Yang Liu. “FLASHE: Additively Symmetric Homomorphic
Encryption for Cross-Silo Federated Learning”. In: CoRR abs/2109.00675 (2021). arXiv: 2109.00675.
url: https://arxiv.org/abs/2109.00675.
[116] José Jiménez-Luna, Francesca Grisoni, and Gisbert Schneider. “Drug discovery with explainable
artificial intelligence”. In: Nature Machine Intelligence 2.10 (2020), pp. 573–584.
[117] Benedikt Atli Jónsson, Gyda Bjornsdottir, TE Thorgeirsson, Lotta Marıa Ellingsen,
G Bragi Walters, DF Gudbjartsson, Hreinn Stefansson, Kari Stefansson, and MO Ulfarsson. “Brain
age prediction using deep learning uncovers associated sequence variants”. In: Nature
communications 10.1 (2019), pp. 1–10.
[118] Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. “GAZELLE: A low latency
framework for secure neural network inference”. In: 27th{USENIX} Security Symposium (USENIX
Security 18). 2018, pp. 1651–1669.
[119] Peter Kairouz, Ziyu Liu, and Thomas Steinke. “The distributed discrete gaussian mechanism for
federated learning with secure aggregation”. In: International Conference on Machine Learning.
PMLR. 2021, pp. 5201–5212.
[120] Peter Kairouz and H. Brendan McMahan. “Advances and Open Problems in Federated Learning”.
In: Foundations and Trends® in Machine Learning 14.1 (2021).issn: 1935-8237.doi:
10.1561/2200000083.
[121] Peter Kairouz and H. Brendan McMahan. “Advances and Open Problems in Federated Learning”.
In: Foundations and Trends® in Machine Learning 14.1 (2021).issn: 1935-8237.doi:
10.1561/2200000083.
180
[122] Georgios Kaissis, Alexander Ziller, Jonathan Passerat-Palmbach, Théo Ryffel, Dmitrii Usynin,
Andrew Trask, Ionésio Lima, Jason Mancuso, Friederike Jungmann, Marc-Matthias Steinborn,
et al. “End-to-end privacy preserving deep learning on multi-institutional medical imaging”. In:
Nature Machine Intelligence 3.6 (2021), pp. 473–484.
[123] Georgios A Kaissis, Marcus R Makowski, Daniel Rückert, and Rickmer F Braren. “Secure,
privacy-preserving and federated machine learning in medical imaging”. In: Nature Machine
Intelligence 2.6 (2020), pp. 305–311.
[124] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. “Scaffold: Stochastic controlled averaging for federated learning”. In:
International Conference on Machine Learning. PMLR. 2020, pp. 5132–5143.
[125] David B Keator, Jeffrey S Grethe, D Marcus, B Ozyurt, Syam Gadde, Sean Murphy, Steve Pieper,
D Greve, R Notestine, H Jeremy Bockholt, et al. “A national human neuroimaging collaboratory
enabled by the Biomedical Informatics Research Network (BIRN)”. In: IEEE Transactions on
Information Technology in Biomedicine 12.2 (2008), pp. 162–172.
[126] Margaret D King, Dylan Wood, Brittny Miller, Ross Kelly, Drew Landis, William Courtney,
Runtang Wang, Jessica A Turner, and Vince D Calhoun. “Automated collection of imaging and
phenotypic data to centralized and distributed data repositories”. In: Frontiers in neuroinformatics
8 (2014), p. 60.
[127] Craig A. Knoblock, Pedro Szekely, José Luis Ambite, Aman Goel, Shubham Gupta,
Kristina Lerman, Maria Muslea, Mohsen Taheriyan, and Parag Mallick. “Semi-Automatically
Mapping Structured Sources into the Semantic Web”. In: Proceedings of the Extended Semantic
Web Conference. Crete, Greece, 2012.
[128] George Konstantinidis and José Luis Ambite. “Scalable query rewriting: a graph-based approach”.
In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM.
2011, pp. 97–108.
[129] Timur Köse, Su Özgür, Erdal Coşgun, Ahmet Keskinoğlu, and Pembe Keskinoğlu. “Effect of
missing data imputation on deep learning prediction performance for vesicoureteral reflux and
recurrent urinary tract infection clinical study”. In: BioMed Research International 2020 (2020).
[130] Nikolaos Koutsouleris, Christos Davatzikos, Stefan Borgwardt, Christian Gaser,
Ronald Bottlender, Thomas Frodl, Peter Falkai, Anita Riecher-Rössler, Hans-Jürgen Möller,
Maximilian Reiser, et al. “Accelerated brain aging in schizophrenia and beyond: a
neuroanatomical marker of psychiatric disorders”. In: Schizophrenia bulletin 40.5 (2014),
pp. 1140–1153.
[131] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: (2009).
[132] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Communications of the ACM 60.6 (2017), pp. 84–90.
181
[133] Anil Kuchinad, Petra Schweinhardt, David A Seminowicz, Patrick B Wood, Boris A Chizh, and
M Catherine Bushnell. “Accelerated brain gray matter loss in fibromyalgia patients: premature
aging of the brain?” In: Journal of Neuroscience 27.15 (2007), pp. 4004–4007.
[134] Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin,
William Leiserson, Sage Moore, Bill Nell, Nir Shavit, and Dan Alistarh. “Inducing and Exploiting
Activation Sparsity for Fast Inference on Deep Neural Networks”. In: Proceedings of the 37th
International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119.
Proceedings of Machine Learning Research. Virtual: PMLR, July 2020, pp. 5533–5543.url:
http://proceedings.mlr.press/v119/kurtz20a.html.
[135] Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin,
William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. “Inducing and exploiting activation
sparsity for fast inference on deep neural networks”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 5533–5543.
[136] John D. Lafferty, Andrew McCallum, and Fernando Pereira. “Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data”. In: ICML. 2001.
[137] Pradeep K Lam, Vigneshwaran Santhalingam, Parth Suresh, Rahul Baboota, Alyssa H Zhu,
Sophia I Thomopoulos, Neda Jahanshad, and Paul M Thompson. “Accurate brain age prediction
using recurrent slice-based networks”. In: 16th International Symposium on Medical Information
Processing and Analysis. Vol. 11583. International Society for Optics and Photonics. 2020,
p. 1158303.
[138] Pamela J. LaMontagne, Tammie LS. Benzinger, John C. Morris, Sarah Keefe, Russ Hornbeck,
Chengjie Xiong, Elizabeth Grant, Jason Hassenstab, Krista Moulder, Andrei G. Vlassenko,
Marcus E. Raichle, Carlos Cruchaga, and Daniel Marcus. “OASIS-3: Longitudinal Neuroimaging,
Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease”. In: medRxiv (2019).
doi: 10.1101/2019.12.13.19014902. eprint:
https://www.medrxiv.org/content/early/2019/12/15/2019.12.13.19014902.full.pdf.
[139] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
Chris Dyer. “Neural Architectures for Named Entity Recognition”. In: Proceedings of the 2016
Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies. San Diego, California: Association for Computational Linguistics, June
2016, pp. 260–270.doi: 10.18653/v1/N16-1030.
[140] Guillaume Lample and Alexis Conneau. “Cross-lingual Language Model Pretraining”. In: NeurIPS.
2019.
[141] Leslie Lamport. “How to make a multiprocessor computer that correctly executes multiprocess
progranm”. In: IEEE transactions on computers 28.9 (1979), pp. 690–691.
[142] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied
to document recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
182
[143] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. “SNIP: SINGLE-SHOT NETWORK
PRUNING BASED ON CONNECTION SENSITIVITY”. In: International Conference on Learning
Representations. 2018.
[144] Klas Leino and Matt Fredrikson. “Stolen Memories: Leveraging Model Memorization for
Calibrated White-Box Membership Inference”. In: 29th{USENIX} Security Symposium ({USENIX}
Security 20). 2020, pp. 1605–1622.
[145] Maurizio Lenzerini. “Data integration: A theoretical perspective”. In: Proceedings of the
twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems . 2002,
pp. 233–246.
[146] Kwing Hei Li, Pedro Porto Buarque de Gusmão, Daniel J Beutel, and Nicholas D Lane. “Secure
aggregation for federated learning in flower”. In: Proceedings of the 2nd ACM International
Workshop on Distributed Machine Learning. 2021, pp. 8–14.
[147] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski,
James Long, Eugene J Shekita, and Bor-Yiing Su. “Scaling distributed machine learning with the
parameter server”. In: 11th{USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 14). 2014, pp. 583–598.
[148] Peng Li, Elizabeth A Stuart, and David B Allison. “Multiple imputation: a flexible tool for
handling missing data”. In: Jama 314.18 (2015), pp. 1966–1967.
[149] Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He. “A
survey on federated learning systems: vision, hype and reality for data privacy and protection”.
In: IEEE Transactions on Knowledge and Data Engineering (2021).
[150] Shenghui Li, Edith Ngai, Fanghua Ye, and Thiemo Voigt. “Auto-weighted Robust Federated
Learning with Corrupted Data Sources”. In: arXiv preprint arXiv:2101.05880 (2021).
[151] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. “Federated learning: Challenges,
methods, and future directions”. In: IEEE Signal Processing Magazine 37.3 (2020), pp. 50–60.
[152] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
“Federated Optimization in Heterogeneous Networks”. In: Proceedings of Machine Learning and
Systems. Vol. 2. 2020, pp. 429–450.
[153] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. “On the Convergence
of FedAvg on Non-IID Data”. In: International Conference on Learning Representations. 2020.
[154] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. “On the convergence
of fedavg on non-iid data”. In: arXiv preprint arXiv:1907.02189 (2019).
[155] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. “Backdoor learning: A survey”. In: IEEE
Transactions on Neural Networks and Learning Systems (2022).
183
[156] Gang Liang and Sudarshan S. Chawathe. “Privacy-Preserving Inter-database Operations”. In:
Intelligence and Security Informatics. Ed. by Hsinchun Chen, Reagan Moore, Daniel D. Zeng, and
John Leavitt. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 66–82.isbn:
978-3-540-25952-7.
[157] Bill Yuchen Lin, Chaoyang He, Zihang Zeng, Hulin Wang, Yufen Huang, Christophe Dupuy,
Rahul Gupta, Mahdi Soltanolkotabi, Xiang Ren, and Salman Avestimehr. “Fednlp: Benchmarking
federated learning methods for natural language processing tasks”. In: arXiv preprint
arXiv:2104.08815 (2021).
[158] Wei-Chao Lin and Chih-Fong Tsai. “Missing value imputation: a review and analysis of the
literature (2006–2017)”. In: Artificial Intelligence Review 53 (2020), pp. 1487–1509.
[159] Roderick JA Little and Donald B Rubin. Statistical analysis with missing data. Vol. 793. John Wiley
& Sons, 2019.
[160] Ji Liu, Jizhou Huang, Yang Zhou, Xuhong Li, Shilei Ji, Haoyi Xiong, and Dejing Dou. “From
distributed machine learning to federated learning: A survey”. In: Knowledge and Information
Systems 64.4 (2022), pp. 885–917.
[161] Wei Liu, Li Chen, Yunfei Chen, and Wenyi Zhang. “Accelerating federated learning via
momentum gradient descent”. In: IEEE Transactions on Parallel and Distributed Systems 31.8
(2020), pp. 1754–1766.
[162] Xiaoyuan Liu, Tianneng Shi, Chulin Xie, Qinbin Li, Kangping Hu, Haoyu Kim, Xiaojun Xu, Bo Li,
and Dawn Song. “Unifed: A benchmark for federated learning frameworks”. In: arXiv preprint
arXiv:2207.10308 (2022).
[163] Yi Liu, JQ James, Jiawen Kang, Dusit Niyato, and Shuyu Zhang. “Privacy-preserving traffic flow
prediction: A federated learning approach”. In: IEEE Internet of Things Journal 7.8 (2020),
pp. 7751–7763.
[164] Yuan Liu, Zhengpeng Ai, Shuai Sun, Shuangfeng Zhang, Zelei Liu, and Han Yu. “Fedcoin: A
peer-to-peer payment system for federated learning”. In: Federated Learning. Springer, 2020,
pp. 125–138.
[165] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. “Rethinking the value of
network pruning”. In: arXiv preprint arXiv:1810.05270 (2018).
[166] Guillermo Lloret-Talavera, Marc Jorda, Harald Servat, Fabian Boemer, Chetan Chauhan,
Shigeki Tomishima, Nilesh N Shah, and Antonio J Pena. “Enabling homomorphically encrypted
inference for large DNN models”. In: IEEE Transactions on Computers 71.5 (2021), pp. 1145–1155.
[167] Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In: arXiv preprint
arXiv:1711.05101 (2017).
[168] Brenton Louie, Peter Mork, Fernando Martin-Sanchez, Alon Halevy, and Peter Tarczy-Hornoch.
“Data integration and genomic medicine”. In: Journal of biomedical informatics 40.1 (2007),
pp. 5–16.
184
[169] Heiko Ludwig, Nathalie Baracaldo, Gegi Thomas, Yi Zhou, Ali Anwar, Shashank Rajamoni,
Yuya Ong, Jayaram Radhakrishnan, Ashish Verma, Mathieu Sinn, et al. “IBM Federated Learning:
an Enterprise Framework White Paper V0. 1”. In: arXiv preprint arXiv:2007.10987 (2020).
[170] Bing Luo, Xiang Li, Shiqiang Wang, Jianwei Huang, and Leandros Tassiulas. “Cost-Effective
Federated Learning Design”. In: arXiv preprint arXiv:2012.08336 (2020).
[171] Lingjuan Lyu, Han Yu, Xingjun Ma, Chen Chen, Lichao Sun, Jun Zhao, Qiang Yang, and
S Yu Philip. “Privacy and robustness in federated learning: Attacks and defenses”. In: IEEE
transactions on neural networks and learning systems (2022).
[172] Lingjuan Lyu, Han Yu, Jun Zhao, and Qiang Yang. “Threats to Federated Learning”. In: Federated
Learning. Springer, 2020, pp. 3–16.
[173] Vadim Lyubashevsky, Chris Peikert, and Oded Regev. “On Ideal Lattices and Learning with Errors
over Rings”. In: 60.6 (Nov. 2013).issn: 0004-5411.doi: 10.1145/2535925.
[174] Jing Ma, Si-Ahmed Naas, Stephan Sigg, and Xixiang Lyu. “Privacy-preserving federated learning
based on multi-key homomorphic encryption”. In: International Journal of Intelligent Systems
(2022).
[175] Daniel S Marcus, Timothy R Olsen, Mohana Ramaratnam, and Randy L Buckner. “The extensible
neuroimaging archive toolkit”. In: Neuroinformatics 5.1 (2007), pp. 11–33.
[176] Othmane Marfoq, Chuan Xu, Giovanni Neglia, and Richard Vidal. “Throughput-optimal topology
design for cross-silo federated learning”. In: Advances in Neural Information Processing Systems 33
(2020), pp. 19478–19487.
[177] Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato, and
Juan Miguel Gómez-Berbís. “Named entity recognition: fallacies, challenges and opportunities”.
In: Computer Standards & Interfaces 35.5 (2013), pp. 482–489.
[178] Joel Mathew, Shobeir Fakhraei, and José Luis Ambite. “Biomedical named entity recognition via
reference-set augmented bootstrapping”. In: arXiv preprint arXiv:1906.00282 (2019).
[179] Friedemann Mattern et al. Virtual time and global states of distributed systems. Citeseer, 1988.
[180] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
“Communication-efficient learning of deep networks from decentralized data”. In: Artificial
Intelligence and Statistics. PMLR. 2017, pp. 1273–1282.
[181] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. “Learning differentially
private recurrent language models”. In: arXiv:1710.06963 (2017).
[182] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. “Exploiting
unintended feature leakage in collaborative learning”. In: 2019 IEEE symposium on security and
privacy (SP). IEEE. 2019, pp. 691–706.
185
[183] Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub,
Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al.
“Multimodal population brain imaging in the UK Biobank prospective epidemiological study”. In:
Nature neuroscience 19.11 (2016), pp. 1523–1536.
[184] Karla L Miller, Fidel Alfaro-Almagro, Neal K Bangerter, David L Thomas, Essa Yacoub,
Junqian Xu, Andreas J Bartsch, Saad Jbabdi, Stamatios N Sotiropoulos, Jesper LR Andersson, et al.
“Multimodal population brain imaging in the UK Biobank prospective epidemiological study”. In:
Nature neuroscience 19.11 (2016), pp. 1523–1536.
[185] Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford Jack,
William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett. “The Alzheimer’s
disease neuroimaging initiative”. In: Neuroimaging Clinics 15.4 (2005), pp. 869–877.
[186] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur,
Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. “PipeDream: Generalized pipeline
parallelism for DNN training”. In: Proceedings of the 27th ACM Symposium on Operating Systems
Principles. 2019, pp. 1–15.
[187] M. Nasr, R. Shokri, and A. Houmansadr. “Comprehensive Privacy Analysis of Deep Learning:
Passive and Active White-box Inference Attacks against Centralized and Federated Learning”. In:
2019 IEEE Symposium on Security and Privacy (SP). 2019, pp. 739–753.
[188] Milad Nasr, Reza Shokri, and Amir Houmansadr. “Comprehensive privacy analysis of deep
learning: Passive and active white-box inference attacks against centralized and federated
learning”. In: 2019 IEEE symposium on security and privacy (SP). IEEE. 2019, pp. 739–753.
[189] Felix Naumann and Melanie Herschel. An Introduction to Duplicate Detection. Synthesis Lectures
on Data Management. Morgan & Claypool Publishers, 2010.
[190] Takayuki Nishio and Ryo Yonetani. “Client selection for federated learning with heterogeneous
resources in mobile edge”. In: ICC 2019-2019 IEEE international conference on communications
(ICC). IEEE. 2019, pp. 1–7.
[191] Maxence Noble, Aurélien Bellet, and Aymeric Dieuleveut. “Differentially Private Federated
Learning on Heterogeneous Data”. In: arXiv preprint arXiv:2111.09278 (2021).
[192] Martijn Oldenhof, Gergely Ács, Balázs Pejó, Ansgar Schuffenhauer, Nicholas Holway, Noé Sturm,
Arne Dieckmann, Oliver Fortmeier, Eric Boniface, Clément Mayer, et al. “Industry-Scale
Orchestrated Federated Learning for Drug Discovery”. In: arXiv preprint arXiv:2210.08871 (2022).
[193] Terence Orchard and Max A Woodbury. “A missing information principle: theory and
applications”. In: Volume 1 Theory of Statistics. University of California Press, 1972, pp. 697–716.
[194] Pascal Paillier. “Public-key cryptosystems based on composite degree residuosity classes”. In:
International conference on the theory and applications of cryptographic techniques. Springer. 1999,
pp. 223–238.
186
[195] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In: IEEE Transactions on
knowledge and data engineering 22.10 (2010), pp. 1345–1359.
[196] Rahul Pandey, Hemant Purohit, Carlos Castillo, and Valerie L Shalin. “Modeling and Mitigating
Human Annotation Errors to Design Efficient Stream Processing Systems with
Human-in-the-loop Machine Learning”. In: International Journal of Human-Computer Studies
(2022), p. 102772.
[197] Han Peng, Weikang Gong, Christian F Beckmann, Andrea Vedaldi, and Stephen M Smith.
“Accurate brain age prediction with lightweight deep neural networks”. In: Medical image
analysis 68 (2021), p. 101871.
[198] Han Peng, Weikang Gong, Christian F. Beckmann, Andrea Vedaldi, and Stephen M. Smith.
“Accurate brain age prediction with lightweight deep neural networks”. In: Medical Image
Analysis (2020), p. 101871.issn: 1361-8415.doi: https://doi.org/10.1016/j.media.2020.101871.
[199] Jeffrey Pennington, Richard Socher, and Christopher Manning. “GloVe: Global Vectors for Word
Representation”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014,
pp. 1532–1543.doi: 10.3115/v1/D14-1162.
[200] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. “Robust aggregation for federated
learning”. In: arXiv preprint arXiv:1912.13445 (2019).
[201] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. “Robust aggregation for federated
learning”. In: IEEE Transactions on Signal Processing 70 (2022), pp. 1142–1154.
[202] Sergey M Plis, Anand D Sarwate, Dylan Wood, Christopher Dieringer, Drew Landis, Cory Reed,
Sandeep R Panta, Jessica A Turner, Jody M Shoemaker, Kim W Carter, et al. “COINSTAC: a
privacy enabled model and prototype for leveraging and processing decentralized brain imaging
data”. In: Frontiers in neuroscience 10 (2016), p. 365.
[203] Yuriy Polyakov, Kurt Rohloff, and Gerard W Ryan. “Palisade lattice cryptography library user
manual”. In: Cybersecurity Research Center, New Jersey Institute ofTechnology (NJIT), Tech. Rep 15
(2017).
[204] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes,
Mei-Ling Shyu, Shu-Ching Chen, and Sundaraja S Iyengar. “A survey on deep learning:
Algorithms, techniques, and applications”. In: ACM Computing Surveys (CSUR) 51.5 (2018),
pp. 1–36.
[205] Apostolos Pyrgelis, Carmela Troncoso, and Emiliano De Cristofaro. “Knock Knock, Who’s There?
Membership Inference on Aggregate Location Data”. In: CoRR abs/1708.06145 (2017). arXiv:
1708.06145.
[206] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence.
Dataset shift in machine learning. Mit Press, 2008.
187
[207] Vibhor Rastogi and Suman Nath. “Differentially private aggregation of distributed time-series
with transformation and encryption”. In: Proceedings of the 2010 ACM SIGMOD International
Conference on Management of data. 2010, pp. 735–746.
[208] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečn` y,
Sanjiv Kumar, and H Brendan McMahan. “Adaptive Federated Optimization”. In: arXiv preprint
arXiv:2003.00295 (2020).
[209] Oded Regev. “On Lattices, Learning with Errors, Random Linear Codes, and Cryptography”. In: J.
ACM 56.6 (Sept. 2009).issn: 0004-5411.doi: 10.1145/1568318.1568324.
[210] G Anthony Reina, Alexey Gruzdev, Patrick Foley, Olga Perepelkina, Mansi Sharma,
Igor Davidyuk, Ilya Trushkin, Maksim Radionov, Aleksandr Mokrov, Dmitry Agapov, et al.
“OpenFL: An open-source framework for Federated Learning”. In: arXiv preprint arXiv:2105.06413
(2021).
[211] Nicola Rieke, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger Roth, Shadi Albarqouni,
Spyridon Bakas, Mathieu N Galtier, Bennett Landman, Klaus Maier-Hein, et al. “The future of
digital health with federated learning”. In: npj Digital Medicine 3.119 (2020).
[212] David Rolnick, Andreas Veit, Serge J. Belongie, and Nir Shavit. “Deep Learning is Robust to
Massive Label Noise”. In: CoRR abs/1705.10694 (2017). arXiv: 1705.10694.url:
http://arxiv.org/abs/1705.10694.
[213] Holger R Roth, Yan Cheng, Yuhong Wen, Isaac Yang, Ziyue Xu, Yuan-Ting Hsieh,
Kristopher Kersten, Ahmed Harouni, Can Zhao, Kevin Lu, et al. “NVIDIA FLARE: Federated
Learning from Simulation to Real-World”. In: arXiv preprint arXiv:2210.13291 (2022).
[214] Donald B Rubin. Multiple imputation for nonresponse in surveys. Vol. 81. John Wiley & Sons, 2004.
[215] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations
byerrorpropagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[216] Mohamed Sabt, Mohammed Achemlal, and Abdelmadjid Bouabdallah. “Trusted execution
environment: what it is, and what it is not”. In: 2015 IEEE Trustcom/BigDataSE/ISPA. Vol. 1. IEEE.
2015, pp. 57–64.
[217] Ahmed Salem, Yang Zhang, Mathias Humbert, Mario Fritz, and Michael Backes. “ML-Leaks:
Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning
Models”. In: Network and Distributed Systems Security Symposium 2019. Internet Society. 2019.
[218] Erik F Sang and Fien De Meulder. “Introduction to the CoNLL-2003 shared task:
Language-independent named entity recognition”. In: arXiv preprint cs/0306050 (2003).
[219] Sinem Sav, Apostolos Pyrgelis, Juan Ramón Troncoso-Pastoriza, David Froelicher,
Jean-Philippe Bossuat, Joao Sa Sousa, and Jean-Pierre Hubaux. “POSEIDON: Privacy-Preserving
Federated Neural Network Learning”. In: CoRR abs/2009.00349 (2020). arXiv: 2009.00349.url:
https://arxiv.org/abs/2009.00349.
188
[220] Neil Savage. “How AI and neuroscience drive each other forwards”. In: Nature 571.7766 (2019),
S15–S15.
[221] Stefano Savazzi, Monica Nicoli, Mehdi Bennis, Sanaz Kianoush, and Luca Barbieri. “Opportunities
of federated learning in connected, cooperative, and automated industrial systems”. In: IEEE
Communications Magazine 59.2 (2021), pp. 16–21.
[222] Monica Scannapieco, Ilya Figotin, Elisa Bertino, and Ahmed K. Elmagarmid. “Privacy Preserving
Schema and Data Matching”. In: Proceedings of the 2007 ACM SIGMOD International Conference on
Management of Data. SIGMOD ’07. Beijing, China: Association for Computing Machinery, 2007,
pp. 653–664.isbn: 9781595936868.doi: 10.1145/1247480.1247553.
[223] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.
“The graph neural network model”. In:IEEEtransactionsonneuralnetworks 20.1 (2008), pp. 61–80.
[224] Christopher G Schwarz, Walter K Kremers, Terry M Therneau, Richard R Sharp, Jeffrey L Gunter,
Prashanthi Vemuri, Arvin Arani, Anthony J Spychalla, Kejal Kantarci, David S Knopman, et al.
“Identification of anonymous MRI research participants with face-recognition software”. In: New
England Journal of Medicine 381.17 (2019), pp. 1684–1686.
[225] Observational Health Data Sciences and Informatics. The Book of OHDSI.
https://ohdsi.github.io/TheBookOfOhdsi/. OHDSI, 2019.
[226] Adam Scott, William Courtney, Dylan Wood, Raul De la Garza, Susan Lane, Runtang Wang,
Margaret King, Jody Roberts, Jessica A Turner, and Vince D Calhoun. “COINS: an innovative
informatics and neuroimaging tool suite built for large heterogeneous datasets”. In: Frontiers in
neuroinformatics 5 (2011), p. 33.
[227] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras,
and Tom Goldstein. “Poison frogs! targeted clean-label poisoning attacks on neural networks”. In:
Advances in neural information processing systems 31 (2018).
[228] Micah J Sheller, Brandon Edwards, G Anthony Reina, Jason Martin, Sarthak Pati,
Aikaterini Kotrotsou, Mikhail Milchenko, Weilin Xu, Daniel Marcus, Rivka R Colen, et al.
“Federated learning in medicine: facilitating multi-institutional collaborations without sharing
patient data”. In: Scientific reports 10.1 (2020), pp. 1–12.
[229] Shiqi Shen, Shruti Tople, and Prateek Saxena. “Auror: Defending against poisoning attacks in
collaborative deep learning systems”. In: Proceedings of the 32nd Annual Conference on Computer
Security Applications. 2016, pp. 508–519.
[230] Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and Peng Cui. “Towards
out-of-distribution generalization: A survey”. In: arXiv preprint arXiv:2108.13624 (2021).
[231] Amit P Sheth and James A Larson. “Federated database systems for managing distributed,
heterogeneous, and autonomous databases”. In: ACM Computing Surveys (CSUR) 22.3 (1990),
pp. 183–236.
189
[232] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. “Membership Inference
Attacks Against Machine Learning Models”. In: 2017 IEEE Symposium on Security and Privacy
(SP). 2017, pp. 3–18.
[233] Jinhyun So, Chaoyang He, Chien-Sheng Yang, Songze Li, Qian Yu, Ramy E Ali, Basak Guler, and
Salman Avestimehr. “Lightsecagg: a lightweight and versatile design for secure aggregation in
federated learning”. In: Proceedings of Machine Learning and Systems 4 (2022), pp. 694–720.
[234] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. “Learning from noisy
labels with deep neural networks: A survey”. In: IEEE Transactions on Neural Networks and
Learning Systems (2022).
[235] Liwei Song and Prateek Mittal. “Systematic Evaluation of Privacy Risks of Machine Learning
Models”. In: arXiv preprint arXiv:2003.10595 (2020).
[236] Michael R Sprague, Amir Jalalirad, Marco Scavuzzo, Catalin Capota, Moritz Neun, Lyman Do, and
Michael Kopp. “Asynchronous federated learning for geospatial applications”. In: Joint European
Conference on Machine Learning and Knowledge Discovery in Databases. Springer. 2018, pp. 21–28.
[237] Wenting Zheng Srinivasan, PMRL Akshayaram, and Popa Raluca Ada. “DELPHI: A cryptographic
inference service for neural networks”. In: Proc. 29th USENIX Secur. Symp. 2019, pp. 2505–2522.
[238] Jacob Steinhardt, Pang Wei Koh, and Percy Liang. “Certified defenses for data poisoning attacks”.
In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017,
pp. 3520–3532.
[239] Daniel J Stekhoven and Peter Bühlmann. “MissForest—non-parametric missing value imputation
for mixed-type data”. In: Bioinformatics 28.1 (2012), pp. 112–118.
[240] Dimitris Stripelis, Marcin Abram, and Jose Luis Ambite. “Performance weighting for robust
federated learning against corrupted sources”. In: arXiv preprint arXiv:2205.01184 (2022).
[241] Dimitris Stripelis and Jose Luis Ambite. “Accelerating Federated Learning in Heterogeneous Data
and Computational Environments”. In: arXiv preprint arXiv:2008.11281 (2020).
[242] Dimitris Stripelis, José Luis Ambite, Yao-Yi Chiang, Sandrah P Eckel, and Rima Habre. “A scalable
data integration and analysis architecture for sensor data of pediatric asthma”. In: 2017 IEEE 33rd
International Conference on Data Engineering (ICDE). IEEE. 2017, pp. 1407–1408.
[243] Dimitris Stripelis, José Luis Ambite, Pradeep Lam, and Paul Thompson. “Scaling Neuroscience
Research using Federated Learning”. In: IEEE International Symposium on Biomedical Imaging
(ISBI). 2021.
[244] Dimitris Stripelis, Hamza Saleem, Tanmay Ghai, Nikhil Dhinagar, Umang Gupta,
Chrysovalantis Anastasiou, Greg Ver Steeg, Srivatsan Ravi, Muhammad Naveed,
Paul M.Thompson, and José Luis Ambite. “Secure Neuroimaging Analysis using Federated
Learning with Homomorphic Encryption”. In: 17th International Symposium on Medical
Information Processing and Analysis (SIPAIM). Campinas, Brazil, 2021.
190
[245] Dimitris Stripelis, Hamza Saleem, Tanmay Ghai, Nikhil Dhinagar, Umang Gupta,
Chrysovalantis Anastasiou, Greg Ver Steeg, Srivatsan Ravi, Muhammad Naveed,
Paul M Thompson, et al. “Secure neuroimaging analysis using federated learning with
homomorphic encryption”. In: 17th International Symposium on Medical Information Processing
and Analysis. Vol. 12088. SPIE. 2021, pp. 351–359.
[246] Dimitris Stripelis, Paul M Thompson, and José Luis Ambite. “Semi-synchronous federated
learning for energy-efficient training and accelerated convergence in cross-silo settings”. In: ACM
Transactions on Intelligent Systems and Technology (TIST) 13.5 (2022), pp. 1–29.
[247] Ananda Theertha Suresh, Brendan McMahan, Peter Kairouz, and Ziteng Sun. “Can You Really
Backdoor Federated Learning?” In: International Workshop on Federated Learning for Data Privacy
and Confidentiality, NeurIPS . 2019.
[248] Arun Swami. “Optimization of large join queries: Combining heuristics and combinatorial
techniques”. In: Proceedings of the 1989 ACM SIGMOD International Conference on Management of
data. 1989, pp. 367–376.
[249] Zhenheng Tang, Xiaowen Chu, Ryan Yide Ran, Sunwoo Lee, Shaohuai Shi, Yonggang Zhang,
Yuxin Wang, Alex Qiaozhong Liang, Salman Avestimehr, and Chaoyang He. “FedML Parrot: A
Scalable Federated Learning System via Heterogeneity-aware Scheduling on Sequential and
Hierarchical Training”. In: arXiv preprint arXiv:2303.01778 (2023).
[250] PALISADE team. PALISADE Lattice Cryptography Library (release 1.10.6). 2020.url:
https://palisade-crypto.org.
[251] Paul M Thompson, Jason L Stein, Sarah E Medland, Derrek P Hibar, Alejandro Arias Vasquez,
Miguel E Renteria, Roberto Toro, Neda Jahanshad, Gunter Schumann, Barbara Franke, et al. “The
ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data”. In:
Brain imaging and behavior 8.2 (2014), pp. 153–182.
[252] Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, and Ling Liu. “Data poisoning attacks against
federated learning systems”. In: European Symposium on Research in Computer Security. Springer.
2020, pp. 480–501.
[253] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani,
David Botstein, and Russ B Altman. “Missing value estimation methods for DNA microarrays”.
In: Bioinformatics 17.6 (2001), pp. 520–525.
[254] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, and Rui Zhang. “A
Hybrid Approach to Privacy-Preserving Federated Learning”. In: CoRR abs/1812.03224 (2018).
arXiv: 1812.03224.url: http://arxiv.org/abs/1812.03224.
[255] Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. “Demystifying
membership inference attacks in machine learning as a service”. In: IEEE Transactions on Services
Computing (2019).
[256] Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. “Towards Demystifying
Membership Inference Attacks”. In: arXiv preprint arXiv:1807.09173 (2018).
191
[257] Ye Lin Tun, Kyi Thar, Chu Myaet Thwal, and Choong Seon Hong. “Federated learning based
energy demand prediction with clustered aggregation”. In: 2021 IEEE International Conference on
Big Data and Smart Computing (BigComp). IEEE. 2021, pp. 164–167.
[258] Jessica A Turner. “The rise of large-scale imaging studies in psychiatry”. In: GigaScience 3.1
(2014), p. 29.
[259] Jessica A Turner, Danielle Pasquerello, Matthew D Turner, David B Keator, Kathryn Alpert,
Margaret King, Drew Landis, Vince D Calhoun, Steven G Potkin, Marcelo Tallis, et al.
“Terminology development towards harmonizing multiple clinical neuroimaging research
repositories”. In: International Conference on Data Integration in the Life Sciences. Springer. 2015,
pp. 104–117.
[260] Jeffrey Ullman. “Information integration using logical views”. In: ICDT. Vol. 97. 1997, pp. 19–40.
[261] Stef Van Buuren. Flexible imputation of missing data. CRC press, 2018.
[262] Stef Van Buuren and Karin Groothuis-Oudshoorn. “mice: Multivariate imputation by chained
equations in R”. In: Journal of statistical software 45.1 (2011), pp. 1–67.
[263] Chaoqi Wang, Guodong Zhang, and Roger Grosse. “Picking Winning Tickets Before Training by
Preserving Gradient Flow”. In: ICLR. 2019.
[264] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal,
Jy-yong Sohn, Kangwook Lee, and Dimitris Papailiopoulos. “Attack of the tails: Yes, you really
can backdoor federated learning”. In: arXiv preprint arXiv:2007.05084 (2020).
[265] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan,
Maruan Al-Shedivat, Galen Andrew, Salman Avestimehr, Katharine Daly, Deepesh Data, et al. “A
field guide to federated optimization”. In: arXiv preprint arXiv:2107.06917 (2021).
[266] Jindong Wang, Cuiling Lan, Chang Liu, Yidong Ouyang, Tao Qin, Wang Lu, Yiqiang Chen,
Wenjun Zeng, and Philip Yu. “Generalizing to unseen domains: A survey on domain
generalization”. In: IEEE Transactions on Knowledge and Data Engineering (2022).
[267] Lei Wang, Alex Kogan, Derin Cobia, Kathryn Alpert, Anthony Kolasny, Michael I Miller, and
Daniel Marcus. “Northwestern University schizophrenia data and software tool (NUSDAST)”. In:
Frontiers in neuroinformatics 7 (2013), p. 25.
[268] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting He, and
Kevin Chan. “Adaptive federated learning in resource constrained edge computing systems”. In:
IEEE Journal on Selected Areas in Communications 37.6 (2019), pp. 1205–1221.
[269] Yuao Wang, Tianqing Zhu, Wenhan Chang, Sheng Shen, and Wei Ren. “Model Poisoning Defense
on Federated Learning: A Validation Based Approach”. In: International Conference on Network
and System Security. Springer. 2020, pp. 207–223.
192
[270] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Yo-Seb Jeon, and H Vincent Poor. “Covert Model
Poisoning Against Federated Learning: Algorithm Design and Optimization”. In: arXiv preprint
arXiv:2101.11799 (2021).
[271] Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek,
and H Vincent Poor. “Federated learning with differential privacy: Algorithms and performance
analysis”. In: IEEE Transactions on Information Forensics and Security 15 (2020), pp. 3454–3469.
[272] Gio Wiederhold. “Mediators in the Architecture of Future Information Systems”. In: IEEE
Computer 25.3 (Mar. 1992), pp. 38–49.
[273] David A Wood, Sina Kafiabadi, Ayisha Al Busaidi, Emily Guilhem, Antanas Montvila,
Jeremy Lynch, Matthew Townend, Siddharth Agarwal, Asif Mazumder, Gareth J Barker, et al.
“Accurate brain-age models for routine clinical MRI examinations”. In: NeuroImage (2022),
p. 118871.
[274] Qiong Wu, Kaiwen He, and Xu Chen. “Personalized federated learning for intelligent IoT
applications: A cloud-edge based framework”. In: IEEE Open Journal of the Computer Society 1
(2020), pp. 35–44.
[275] Yuncheng Wu, Shaofeng Cai, Xiaokui Xiao, Gang Chen, and Beng Chin Ooi. “Privacy preserving
vertical federated learning for tree-based models”. In: arXiv preprint arXiv:2008.06170 (2020).
[276] Guohui Xiao, Diego Calvanese, Roman Kontchakov, Domenico Lembo, Antonella Poggi,
Riccardo Rosati, and Michael Zakharyaschev. “Ontology-Based Data Access: A Survey”. In: 27th
International Joint Conference on Artificial Intelligence, IJCAI . Stockholm, Sweden, 2018,
pp. 5511–5519.
[277] Han Xiao, Huang Xiao, and Claudia Eckert. “Adversarial Label Flips Attack on Support Vector
Machines.” In: ECAI. 2012, pp. 870–875.
[278] Huang Xiao, Battista Biggio, Blaine Nelson, Han Xiao, Claudia Eckert, and Fabio Roli. “Support
vector machines under adversarial label contamination”. In: Neurocomputing 160 (2015),
pp. 53–62.
[279] Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. “Dba: Distributed backdoor attacks against
federated learning”. In: International Conference on Learning Representations. 2019.
[280] Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. “Dba: Distributed backdoor attacks against
federated learning”. In: International conference on learning representations. 2020.
[281] Cong Xie, Sanmi Koyejo, and Indranil Gupta. “Asynchronous federated optimization”. In: 12th
Annual Workshop on Optimization for Machine Learning. 2020.
[282] Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, James Joshi, and Heiko Ludwig. FedV:
Privacy-Preserving Federated Learning over Vertically Partitioned Data. 2021. arXiv: 2103.03918
[cs.LG].
193
[283] Runhua Xu, Nathalie Baracaldo, Yi Zhou, Ali Anwar, and Heiko Ludwig. “HybridAlpha: An
Efficient Approach for Privacy-Preserving Federated Learning”. In: Proceedings of the 12th ACM
Workshop on Artificial Intelligence and Security, AISec@CCS 2019, London, UK, November 15, 2019 .
Ed. by Lorenzo Cavallaro, Johannes Kinder, Sadia Afroz, Battista Biggio, Nicholas Carlini,
Yuval Elovici, and Asaf Shabtai. ACM, 2019, pp. 13–23.doi: 10.1145/3338501.3357371.
[284] Mengmeng Yang, Lingjuan Lyu, Jun Zhao, Tianqing Zhu, and Kwok-Yan Lam. “Local differential
privacy and its applications: A comprehensive survey”. In: arXiv preprint arXiv:2008.03686 (2020).
[285] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. “Federated machine learning: Concept
and applications”. In: ACM Transactions on Intelligent Systems and Technology (TIST) 10.2 (2019),
pp. 1–19.
[286] Shengwen Yang, Bing Ren, Xuhui Zhou, and Liping Liu. “Parallel distributed logistic regression
for vertical federated learning without third-party coordinator”. In: arXiv preprint
arXiv:1911.09824 (2019).
[287] Andrew C Yao. “Protocols for secure computations”. In: 23rd annual symposium on foundations of
computer science (sfcs 1982). IEEE. 1982, pp. 160–164.
[288] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. “Byzantine-robust distributed
learning: Towards optimal statistical rates”. In: International Conference on Machine Learning.
PMLR. 2018, pp. 5650–5659.
[289] Jinsung Yoon, James Jordon, and Mihaela Schaar. “Gain: Missing data imputation using generative
adversarial nets”. In: International conference on machine learning. PMLR. 2018, pp. 5689–5698.
[290] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, Ion Stoica, et al. “Spark:
Cluster computing with working sets.” In: HotCloud 10.10-10 (2010), p. 95.
[291] Chengliang Zhang, Suyi Li, Junzhe Xia, Wei Wang, Feng Yan, and Yang Liu. “Batchcrypt: Efficient
homomorphic encryption for cross-silo federated learning”. In: 2020 USENIX Annual Technical
Conference. USENIX, 2020, pp. 493–506.
[292] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding
deep learning (still) requires rethinking generalization”. In: Communications of the ACM 64.3
(2021), pp. 107–115.
[293] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. “Understanding
deep learning requires rethinking generalization”. In: International Conference on Learning
Representations. 2017.
[294] Zhongheng Zhang. “Missing data imputation: focusing on single imputation”. In: Annals of
translational medicine 4.1 (2016).
[295] Lingchen Zhao, Shengshan Hu, Qian Wang, Jianlin Jiang, Shen Chao, Xiangyang Luo, and
Pengfei Hu. “Shielding collaborative learning: Mitigating poisoning attacks through client-side
detection”. In: IEEE Transactions on Dependable and Secure Computing (2020).
194
[296] Yang Zhao, Jun Zhao, Mengmeng Yang, Teng Wang, Ning Wang, Lingjuan Lyu, Dusit Niyato, and
Kwok-Yan Lam. “Local differential privacy-based federated learning for internet of things”. In:
IEEE Internet of Things Journal 8.11 (2020), pp. 8836–8853.
[297] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. “Federated
learning with non-iid data”. In: arXiv preprint arXiv:1806.00582 (2018).
[298] Zhaohua Zheng, Yize Zhou, Yilong Sun, Zhang Wang, Boyi Liu, and Keqiu Li. “Applications of
federated learning in smart cities: recent advances, taxonomy, and open challenges”. In:
Connection Science 34.1 (2022), pp. 1–28.
[299] Ligeng Zhu, Zhijian Liu, and Song Han. “Deep leakage from gradients”. In: Advances in neural
information processing systems 32 (2019).
[300] Michael Zhu and Suyog Gupta. “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for
Model Compression”. In: 6th International Conference on Learning Representations, ICLR 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018.
url: https://openreview.net/forum?id=Sy1iIDkPM.
195
AppendixA
TrainingProtocolsConvergenceAnalysis
A.1 Semi-Synchronous
In this section, we provide an analysis of the federated model performance by quantifying the weight
divergence of the federated model to its centralized counterpart [285, 297]. The smaller the weight
divergence the better the model performance (e.g., improved accuracy).
WeightDivergence. Following the theoretical analysis of [285, 297] we want to bound the federated
model weight divergence computed using the semi-synchronous protocol compared to the centralized
model. Consider a classification problem with C classes that follows a joint probability distribution
p(x,y) wherex∈X andy∈C. The probability of classc in the dataset isp(y =c)=
P
x
p(x,c). Let
functionϕ map the inputx to a probability vectorν over the classes, withϕ c
denoting the probability of
x belonging to the classc. The neural network implements the mapϕ , which is parameterized over the
weightsw of the network.
Letw
c,r
denote the community/global model of the federation at federation roundr andw
z
the
centralized model. We define the population loss ℓ(w) though the cross-entropy loss:
ℓ(w)=E
(x
k
,y
k
)∼ (X
k
,Y
k
)
[
C
X
c=1
1
y=c
logϕ c
(w;x)]=
C
X
c=1
p(y =c)E
x|y=c
[logϕ c
(w;x)]
In a centralized setting, we want to optimize the loss function by performing consecutive iterations/steps
over the training dataset. The update rule (using SGD) for a single step is:
w
z
=w
z− 1
− η C
X
c=1
p(y =c)∇
W
E
x|y=c
[logϕ c
(w;x)]
In a federated setting, where N learners participate at every round (full client participation), with every
learnerk having its own unique local training dataset and data distribution,D
i
∩D
j
=∅,
p
i
(y =c)̸=p
j
(y =c),i̸=j,|D
k
|=n
k
, andn=
P
N
k=1
n
k
n
, the update rule for a global iteration (using
SGD) after every learner performed a single (local) step is given by:
w
c,r
=
N
X
k=1
n
k
n
(w
c,r− 1
− η C
X
c=1
p
k
(y =c)∇
W
E
x|y=c
[logϕ c
(w;x)])
Proposition1. Given a classification problem with C classes, a federation ofN learners, with each learner
k owningn
k
training samples following a non-IID data distributionp
k
, wherep
k
̸=p
j
,∀k̸=j. If
196
∇
w
E
x|y=c
logf
c
(x,w) isM
x|y=c
-Lipschitz for each classc∈[C] and every learner synchronizes its local
model everyB
k
local steps, then the weight divergence of SemiSync is bounded by:
||w
c,r
− w
z
||≤ N
X
k=1
B
max
B
α Bmax
k
||w
Bmax
k,r− 1
− w
Bmax
z− 1
||
+η C
X
c=1
||p
k
(y =c)− p(y =c)||(
Bmax− 1
X
j=0
(α k
)
j
g
max
(w
Bmax− 1− j
z− 1
)))
Proof. Letw
B
k
k,r
denote the local model of learnerk at federation roundr afterB
k
local steps,B =
P
N
k
B
k
the total number of iterations across all learners andw
B
z
the centralized model afterB iterations. To
measure weights divergence when learners perform disproportional number of steps during local
training, we analyze the federation model convergence by weighting the local model of each learner on
the total number of local steps it performed during training, i.e.,w
c,r
=
P
N
k
B
k
B
w
B
k
k,r
:
||w
c,r
− w
z
||=
N
X
k=1
||
B
k
B
w
B
k
k,r
− w
B
z
||
≤ N
X
k=1
B
k
B
||w
B
k
k,r
− w
B
z
||
=
N
X
k=1
B
k
B
||w
B
k
− 1
k,r
− η C
X
c=1
p
k
(y =c)∇
w
E
x|y=c
logϕ c
(x,w
B
k
− 1
k,r
)
− w
B− 1
z
+η C
X
c=1
p(y =c)∇
w
E
x|y=c
logϕ c
(x,w
B− 1
z
)||
197
(1)
≤ N
X
k=1
B
k
B
||w
B
k
− 1
k,r
− w
B− 1
z
||
+
N
X
k=1
B
k
B
η ||
N
X
k=1
n
k
n
C
X
c=1
p
k
(y =c)(∇
w
E
x|y=c
logϕ c
(x,w
B
k
− 1
k,r
)− ∇
w
E
x|y=c
logϕ c
(x,w
B− 1
z
)||
(2)
≤ N
X
k=1
B
k
B
||w
B
k
− 1
k,r
− w
B− 1
z
||
+
N
X
k=1
B
k
B
η N
X
k=1
n
k
n
C
X
c=1
p
k
(y =c)M||w
B
k
− 1
k,r
− w
B− 1
z
||
=
N
X
k=1
B
k
B
(1+η N
X
k=1
n
k
n
C
X
c=1
p
k
(y =c)M)||w
B
k
− 1
k,r
− w
B− 1
z
||
(3)
=
N
X
k=1
B
k
B
α k
||w
B
k
− 1
k,r
− w
B− 1
z
|| (A.1)
Inequality (1) holds because the global dataset population for any classc∈[C] is equal to the weighted
average of the individual class populations within each learner, i.e.,p(y =c)=
P
N
k
n
k
n
p
k
(y =c).
Inequality (2) holds because we assume∇
w
E
x|y=c
logϕ c
(x,w) to beM
x|y=c
-Lipschitz. For equality (3)
we set,α k
=(1+η P
N
k
n
k
n
P
C
c
p
k
(y =c)M). From equation A.1, if we assign to every learner the
maximum number of local steps within a single round (i.e., the steps that the fastest learner in the
federation performs), we get the following inequality for the divergence of the federated model from the
centralized model, with
P
N
k
Bmax
B
>1 andB
k
≤B
max
,∀k∈[N]:
||w
c,r
− w
z
||≤ N
X
k=1
B
max
B
α k
||w
Bmax− 1
k,r
− w
Bmax− 1
z
|| (A.2)
Based on the proposition 3.1 from the work of [297], it holds that the weight divergence of the local
model of any clientk∈[N] from the centralized model afterB
max
− 1 iterations is:
||w
Bmax− 1
k,r
− w
Bmax− 1
z
||≤ α Bmax− 1
k
||w
Bmax
k,r− 1
− w
Bmax
z− 1
||
+η C
X
c=1
||p
k
(y =c)− p(y =c)||(
Bmax− 2
X
j=0
(α k
)
j
g
max
(w
Bmax− 2− j
z
)) (A.3)
whereg
max
is equal to:
g
max
(w
Bmax− 2
z
)=max
C
c=1
||∇
w
E
x|y=c
logϕ c
(x,w
Bmax− 2
z
)||
198
By combining equations A.2 and A.3 we derive the following inequality:
||w
c,r
− w
z
||≤ N
X
k=1
B
max
B
(α Bmax
k
||w
Bmax
k,r− 1
− w
Bmax
z− 1
||
+η C
X
c=1
||p
k
(y =c)− p(y =c)||(
Bmax− 1
X
j=0
(α k
)
j
g
max
(w
Bmax− 1− j
z− 1
)))
Hence proved.
A.2 FedSparsify
We provide a brief proof sketch due to space constraints. We make the same assumptions as [114, 105],
and they are as below.
Assumption1. Local objectives are smooth, i.e.,∥∇f
k
(w
1
)−∇ f
k
(w
2
)∥≤ L∥w
1
− w
2
∥, ∀w
1
,w
2
,k and
someL>0.
Assumption2. Global objective is lipschitz, i.e.,∥f(w
1
)− f(w
2
)∥≤ L
p
∥w
1
− w
2
∥, ∀w
1
,w
2
and some
L
p
>0.
Assumption3. Client’s stochastic gradients are unbiased, i.e.,E[g
k
(w)]=∇f
k
(w), ∀k,w.
Assumption4. Local models have bounded gradient variance, i.e.,E∥g
k
(w)−∇ f
k
(w)∥
2
≤ σ 2
, ∀k,w.
Assumption5. The gradients from clients do not deviate much from the global model, i.e.,
∥∇f(w)−∇ f
k
(w)∥
2
≤ ϵ 2
, ∀k,w.
Assumption6. Time independent gradients, i.e.,E
h
g
(t
1
)
k
g
(t
2
)
k
i
=E
h
g
(t
1
)
k
i
E
h
g
(t
2
)
k
i
, ∀t
1
̸=t
2
.
Assumption7. Client independent gradients, i.e.,E
h
g
(t
1
)
k
1
g
(t
2
)
k
2
i
=E
h
g
(t
1
)
k
1
i
E
h
g
(t
2
)
k
2
i
, ∀k
1
̸=k
2
and any
t
1
,t
2
.
Proof. Since we enforce sparse structure found in previous iterations during client training and do not
allow parameters to resurrect, we only need to demonstrate the convergence of the average over
∇f(w
(t)
)⊙ m
(t)
terms. Our proof technique is similar to previous approaches that have demonstrated
convergence for federated learning under different scenarios [106, 114, 154].
ConsideringE
f
w
(t+1)
⊙ m
(t)
− f
w
(t)
we get —
E
f(w
t+1
⊙ m
t
)− f(w
t
)
≤ E
∇f(w
t
),w
t+1
⊙ m
t
− w
t
+
L
2
E
w
t+1
⊙ m
t
− w
t
2
(A.4)
199
Considering the first term from above,
E
∇f(w
t
),w
t+1
⊙ m
t
− w
t
=η E
*
∇f(w
t
),− 1
N
N
X
k=1
S− 1
X
i=0
g
t,i
k
⊙ m
t
+
=η E
*
∇f(w
t
)⊙ m
t
,− 1
N
N
X
k=1
S− 1
X
i=0
∇f
k
(w
t,s
k
)⊙ m
t
+
=− η
∇f(w
t
)⊙ m
t
2
− η
1
N
N
X
k=1
1
S
S− 1
X
i=0
∇f
k
(w
t,i
k
)
2
+η
∇f(w
t
)⊙ m
t
− 1
N
N
X
k=1
1
S
S− 1
X
i=0
m
t
⊙ ∇f
k
(w
t,i
k
)
2
≤− η
m
t
⊙ ∇f(w
t
)
2
− η NS
N
X
k=1
S− 1
X
i=0
m
t
⊙ ∇f
k
(w
t,i
)
2
+
ηL
2
NS
N
X
k=1
S− 1
X
i=0
w
t
− w
t,i
k
2
(A.5)
For the second term in Eq. A.4, we can establish by using assumptions 4-7 that,
E
w
t+1
⊙ m
t
− w
t
2
=E
1
N
N
X
k=1
S− 1
X
i=0
m
t
⊙ g
t,i
k
2
≤ Sσ 2
+
S
N
N
X
k=1
S− 1
X
i=0
E
m
t
⊙ ∇f
k
(w
t,i
)
2
(A.6)
By repeating analysis similar to lemma 10 from [106], we can obtain the below result.
E
w
t,i
− w
t
2
≤ 16η 2
S
2
m
t
⊙ ∇f(w
t
)
2
+16η 2
S
2
ϵ 2
+4η 2
Sσ 2
(A.7)
Substituting Eq. A.5, A.6, and A.7 in Eq. A.4, we get
E
f
w
t+1
⊙ m
t
− f
w
t
≤
− ηS
2
+8L
2
η 3
S
4
m
t
⊙∇ f(w
t
)
2
+
η 2
LS
2
+2L
2
η 3
S
3
σ 2
+
8L
2
η 3
S
4
ϵ 2
(A.8)
Above result establishes bound for the weight updates during federated training round. However,
pruning further changes the model outputs, but its effect can be bounded due to the lipschitz assumption
(Assumption 2).
E
f
w
t+1
− f
w
t+1
⊙ m
t
≤ L
p
w
t+1
− w
t+1
⊙ m
t
(A.9)
Adding the two, we get —
E
f
w
t+1
− f
w
t
≤
− ηS
2
+8L
2
η 3
S
4
m
t
⊙∇ f(w
t
)
2
+
η 2
LS
2
+2L
2
η 3
S
3
σ 2
+
8L
2
η 3
S
4
ϵ 2
+L
p
w
t+1
− w
t+1
⊙ m
t
Summing over all the time steps, and noting that
E
f
w
t+1
− f
w
t
≥ E
f(w
∗ )− f
w
t
gives the desired result.
200
Abstract (if available)
Abstract
Data relevant to machine learning problems are distributed across multiple data silos that cannot share their data due to regulatory, competitiveness, or privacy reasons. Federated Learning has emerged as a standard computational paradigm for distributed training of machine learning and deep learning models across silos. However, the participating silos may have heterogeneous system capabilities and data specifications. In this thesis, we address the challenges in federated learning arising from both computational and semantic heterogeneities. We present federated training policies that accelerate the convergence of the federated model and lead to reduced communication, processing, and energy costs during model aggregation, training, and inference. We show the efficacy of these policies across a wide range of challenging federated environments with highly diverse data distributions in benchmark domains and in neuroimaging. We conclude by describing the federated data harmonization problem and presenting a comprehensive federated learning and integration system architecture that addresses the critical challenges of secure and private federated data harmonization, including schema mapping, data normalization, and data imputation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Emergence and mitigation of bias in heterogeneous data
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Learning controllable data generation for scalable model training
PDF
Controlling information in neural networks for fairness and privacy
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Towards learning generalization
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Learning to diagnose from electronic health records data
PDF
Coding centric approaches for efficient, scalable, and privacy-preserving machine learning in large-scale distributed systems
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Differentially private learned models for location services
PDF
Ultra rapid identity-by-descent mapping in massive genetic datasets
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Representation problems in brain imaging
PDF
Fast and label-efficient graph representation learning
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
Asset Metadata
Creator
Stripelis, Dimitrios
(author)
Core Title
Heterogeneous federated learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
04/24/2023
Defense Date
03/20/2023
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,distributed deep learning,federated learning,federated neuroscience,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ambite, Jose-Luis (
committee chair
), Shahabi, Cyrus (
committee member
), Thompson, Paul (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
stripeli@isi.edu,stripeli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113077685
Unique identifier
UC113077685
Identifier
etd-StripelisD-11699.pdf (filename)
Legacy Identifier
etd-StripelisD-11699
Document Type
Dissertation
Format
theses (aat)
Rights
Stripelis, Dimitrios
Internet Media Type
application/pdf
Type
texts
Source
20230424-usctheses-batch-1029
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
distributed deep learning
federated learning
federated neuroscience