Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Unsupervised domain adaptation with private data
(USC Thesis Other)
Unsupervised domain adaptation with private data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNSUPERVISED DOMAIN ADAPTATION WITH PRIVATE DATA
by
Serban Andrei Stan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
May 2023
Copyright 2023 Serban Andrei Stan
Dedication
To my dear Gokce, whose unwavering support saw me through the highs and lows of this journey, and to
my parents Elena and Ion who set me on this path a long time ago.
ii
Acknowledgements
I want to thank my advisor, Professor Mohammad Rostami, for his invaluable guidance over the course of
my doctorate studies. I deeply appreciate both his insightful research advice and collaboration etiquette
over the years. I wish to also thank my defense committee, Professors Jay Kuo and Professor Aiichiro
Nakano, for their comments in helping me prepare this manuscript.
iii
TableofContents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Methods in Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Adversarial Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Adaptation by Distributional Distance Minimization . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Source free adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Other Topics in Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 Multi Source Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 Multi Target Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.3 Few-shot DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.4 Open-set DA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3: Semantic Segmentation of real-world Images . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Semantic Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2.1 Adversarial Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2.2 Adaptation by distribution alignment . . . . . . . . . . . . . . . . . . . . 19
3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6.1 Implementation and Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.2 State of the art methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
Chapter 4: Semantic Segmentation of Medical Images . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Experimental Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.4 Quantitative and Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6.5 Ablation Studies and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 5: Secure Multi Source Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Experimental validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.2 Ablative Experiments and Empirical Analysis . . . . . . . . . . . . . . . . . . . . . 72
5.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Chapter 6: Fair Modal Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.1 Fairness in AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.2 Unsupervised Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.3 Domain Adaptation in Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.5 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.1 Datasets and Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.1.1 Evaluation Protocol Motivation . . . . . . . . . . . . . . . . . . . . . . . 96
6.5.2 Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5.3 Parameter tuning and implementation . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.5.4 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.5.6 Ablative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Chapter 7: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
v
ListofTables
3.1 Model adaptation comparison results for the SYNTHIA→Cityscapes task. I have used
DeepLabV3 [27] as the feature extractor with a VGG16 [166] backbone. The first row
presents the source-trained model performance prior to adaptation to demonstrate effect
of initial knowledge transfer from the source domain. . . . . . . . . . . . . . . . . . . . . . 29
3.2 Domain adaptation results for different methods for the GTA5 →Cityscapes task. . . . . . . 30
4.1 Segmentation performance comparison for the Cardiac MR→ CT adaptation task.
Starred methods perform source-free adaptation. Bolded cells show best performance. . . . 50
4.2 Segmentation performance comparison for theCardiac CT→ MR adaptation task. . . . . 51
4.3 Segmentation performance comparison for the Abdominal MR→ CT adaptation task. . . 51
4.4 Segmentation performance comparison for the Abdominal CT→ MRI adaptation task. . . 51
4.5 Percentage of shift in pixel labels during adaptation for the cardiac dataset. A cell(i,j) in
the table has three values. The first value represents the percentage of pixels labeled i that
are labeledj after adaptation. The second value represents the percentage of switching
pixels whose true label isi - lower is better. The third value represents the percentage
of switching pixels whose true label isj - higher is better. Bolded cells denote label shift
where more than1% of pixels migrate fromi toj. . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Percentage of shift in pixel labels during adaptation for the abdominal organ dataset. The
same methodology as in Table 4.5 is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7 Segmentation performance comparison for the Cardiac MR→ CT adaptation task. t-SFS
represents results fort components per class. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.8 Segmentation performance on the Cardiac MR→ CT adaptation task for different
distributional distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
vi
5.1 Results on five benchmark datasets. Single best (SB) represents the best performance with
respect to any source, source combined (SC) represents performance obtained by pooling
the source data together from different domains, and multi source (MS) represents methods
performing multi source adaptation.
∗ indicates source-free adaptation, guaranteeing
privacy between sources and the target.
+
indicates privacy between source models.
Results in bold correspond to the highest accuracy amongst the source-free approaches. . 73
5.2 Results when only the SWD objective, the entropy objective or both (SMUDA) are used. . 74
5.3 Results comparing SMUDA to non-private variants. . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Single source results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Uniformly combined predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Analytic experiments to study four strategies for combining the individual model
predictions. Mixing based on model reliability proves superior to other popular approaches. 76
5.7 Performance analysis for different values of γ . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8 Performance analysis when source domains are introduced sequentially. . . . . . . . . . . 81
6.1 Data split statistics. A,C,G correspond to the Adult, COMPAS and German dataset
respectively. The rows with no number i.e. A,C,G correspond to random data splits. The
numbered rows e.g. A1,A2,A3 correspond to statistics for specific splits. The columns
represent the probabilities of specific outcomes for specific splits e.g. P(Y = 0). Results
when using sex as sensitive attribute. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Performance results for the three splits of the Adult dataset . . . . . . . . . . . . . . . . . . 101
6.3 Performance results for the three splits of the COMPAS dataset . . . . . . . . . . . . . . . 101
6.4 Performance results for the three splits of the German dataset . . . . . . . . . . . . . . . . 102
6.5 Results for random data splits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.6 Results when selectively using a subset of losses on the COMPAS dataset . . . . . . . . . . 104
6.7 Results on the German dataset when optimizing fairness metrics with respect to the age
sensitive attribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vii
ListofFigures
3.1 Diagram of the proposed model adaptation approach (best seen in color): (a) initial model
training using the source domain labeled data, (b) estimating the prototypical distribution
as a GMM distribution in the embedding space, (c) domain alignment is enforced by
minimizing the distance between the prototypical distribution samples and the target
unlabeled samples, (d) domain adaptation is enforced for the classifier module to fit
correspondingly to the GMM distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Qualitative performance: examples of the segmented frames for SYNTHIA→Cityscapes
using the MAS
3
method. Left to right: real images, manually annotated images,
source-trained model predictions, predictions based on my method. . . . . . . . . . . . . . 31
3.3 Indirect distribution matching in the embedding space: (a) drawn samples from the GMM
trained on the SYNTHIA distribution, (b) representations of the Cityscapes validation
samples prior to model adaptation (c) representation of the Cityscapes validation samples
after domain alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Ablation experiment to study effect of τ on the GMM learnt in the embedding space: (a)
all samples are used; adaptation mIoU=41.8, (b) a portion of samples is used; adaptation
mIoU=42.7, (c) samples with high model-confidence are used; adaptation mIoU=44.7 . . . . 32
4.1 Proposed method: I first perform supervised training on source MR images. Using the
source embeddings I characterize an internal distribution via a GMM distribution in the
latent space. I then perform source free adaptation by matching the embedding of the
target CT images to the learnt GMM distribution, and fine tuning of the classifier on GMM
samples. Finally, I verify the improved performance that my model gains from model
adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Segmentation maps of CT samples from the two datasets. The first five columns
correspond to cardiac images, and last five correspond to abdominal images. From top
to bottom: gray-scale CT images, source-only predictions, post-adaptation predictions,
supervised predictions on the CT data, ground truth. . . . . . . . . . . . . . . . . . . . . . 52
4.3 Indirect distribution matching in the embedding space: (a) GMM samples approximating
the MMWHS MR latent distribution, (b) CT latent embedding prior to adaptation (c) CT
latent embedding post domain alignment. Colors correspond to: AA, LAC, LVC, MYO. . . 54
viii
4.4 Indirect distribution matching in the embedding space: (a) GMM samples approximating
the CHAOS MR latent distribution, (b) Multi-Atlas CT embedding prior to adaptation (c)
Multi-Atlas CT embedding post adaptation. Colors correspond to: liver, right kidney, left
kidney, spleen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Block-diagram of my proposed approach: (a) source specific model training is done
independently for each source domain (b) the distribution of latent embeddings of each
source domain is estimated via a mixture of Gaussians, (c) for each source trained model,
adaptation is performed by minimizing the distributional discrepancy between the learnt
GMM distribution and the target encodings (d) the final target domain predictions are
obtained via a learnt convex combinations of logits for each adapted model . . . . . . . . . 61
5.2 Performance for different numbers of latent projections used in the SWD on Office-31. . . 76
5.3 Effect of the adaptation process on the Office-home dataset: from left to right, I consider
Art, Clipart and Product as the source domains, and real-world as the target domain. . . . . 77
5.4 Prediction accuracy on Office-home target domain tasks under different levels of source
model confidence, and my choice of λ . Target predictions above this threshold attain high
accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Results on the Office-31 and Image-clef datasets for different values of the confidence
parameterλ . The dotted line corresponds toλ =.5 used for reporting results in Table 5.1. 78
5.6 UMAP latent space visualization for Office-caltech with Amazon as the target. Sources
in order: Caltech, DSLR, and Webcam. Adaptation shifts target embeddings towards the
GMM distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7 Effect of the adaptation process on the Domain-Net dataset, where sketch is the target.
Sources are in order Clipart, Infograph, Painting, Quickdraw, Real. . . . . . . . . . . . . . . 80
5.8 Source and GMM embeddings for the Image-clef dataset with Pascal and Caltech as
sources. For both datasets, the GMM samples closely approximate the source embeddings. 82
5.9 Latent distributions for two domains of the Office-31 dataset when considering the D→A
andW → A tasks. Each color gradient represents a latent feature distribution snapshot
of the source domains. Darker colors correspond to later training iterations, lighter colors
to earlier iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1 Block-diagram description of the proposed framework . . . . . . . . . . . . . . . . . . . . 90
6.2 UMAP embeddings of the source and target feature spaces for random and custom splits
of the Adult dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3 Learning behavior when performing end-to-end training when using bothL
fair
andL
swd
(top) and when only usingL
fair
(bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
ix
Abstract
The recent success of deep learning is conditioned on the availability of large annotated datasets for su-
pervised learning. Data annotation, however, is a laborious and a time-consuming task. When a model
fully trained on an annotatedsourcedomain is applied to atargetdomain with different data distribution, a
greatly diminished generalization performance can be observed due todomainshift. UnsupervisedDomain
Adaptation (UDA) aims to mitigate the impact of domain shift when the target domain is unannotated. The
majority of UDA algorithms assume joint access between source and target data, which may violate data
privacy restrictions in many real world applications. In this thesis I propose source-free UDA approaches
that are well suited for scenarios when source and target data are only accessible sequentially. I show that
across several application domains, for the adaptation process to be successful it is sufficient to maintain
a low-memory approximation of the source embedding distribution instead of the full source dataset. Do-
main shift is then mitigated by minimizing an appropriate distributional distance metric. First, I validate
this idea on adaptation tasks in street image segmentation. I then show improving the approximation of
the source embeddings leads to superior performance when adapting medical image segmentation models.
I extend this idea to multi-source adaptation, where several source domains are present, and data transfer
between pairs of domains is prohibited. Finally, I show that relaxing the constraint for data privacy allows
for mitigating domain shift in fair classification.
x
Chapter1
Introduction
In recent years, autonomous prediction algorithms have demonstrated human-level performance in a large
selection of tasks, from image classification, object tracking, and natural language understanding to play-
ing games as interactive agents [165]. The performance of such approaches is primarily due to the devel-
opment of new deep neural network (DNN) architectures and training techniques, allowing for complex
and informative representations of available information [120, 85]. However, while impressive on specific
problems designed and trained for, such networks are often highly complex, encapsulating up to hundreds
of billions of parameters [211]. This success is conditioned on the availability of large and high-quality
manually annotated datasets to satisfy the required sample complexity bounds for training DNNs. Due
to their size, such models have also been historically shown to be susceptible to overfitting, making them
sensitive to out-of-distribution data. As a result, data annotation is a significant bottleneck to address the
problem ofdomainshift, wheredomaindiscrepancy exists between the distributions of training and testing
domains [113]. This is particularly important in continual learning, where the goal is to enable a learning
agent to learn new domains autonomously [71]. Retraining the model from scratch is not a feasible solu-
tion for continual learning because manual data annotation is an expensive and time-consuming process
for image segmentation, e.g., as much as 1.5 hours for a single image of current benchmark datasets [34,
145]. A practical alternative is to adapt the trained model using only unannotated data.
1
The problem of model adaptation has been studied extensively in the unsupervised domain adaptation
(UDA) framework. The goal of UDA is to train a model for an unannotated target domain by transfer-
ring knowledge from a secondary related source domain in which annotated data is accessible or easier
to generate, e.g., a synthetically generated domain. Knowledge transfer can be achieved by extracting
domain-invariant features from the source and the target domains to address domain discrepancy. As a
result, if a classifier is trained using the source domain features as its input, the classifier will generalize on
the target domain since the distributions of features are indistinguishable. Distributional alignment can be
achieved by matching the distributions at different levels of abstraction, including appearance [63, 161],
feature [63, 122], output [215] levels.
Another recent line of research considers Invariant Risk Minimization (IRM) ([4]), where model gen-
eralization on multiple domains is desirable ([3]). The key difference compared to the UDA framework is
that in UDA minimizing target error is preferred to maintaining joint high performance on both source
and target. Thus, UDA methods operate in a more relaxed environment than IRM approaches, allowing
for better target generalization ([86, 76]).
A large group of the existing UDA algorithms uses adversarial learning for extracting domain-invariant
features [112, 16, 64, 122, 155, 161, 39, 144]. Broadly speaking, a domain discriminator network can be
trained to distinguish whether an input data point comes from the source or the target domain. This net-
work is fooled by a feature generator, which is trained to make the domains similar at its output. Adver-
sarial training [55] of these two networks leads to learning a domain-agnostic embedding space. A second
class of UDA algorithms directly minimizes suitable loss functions that enforce domain alignment [197,
215, 49, 212, 95, 207]. Adversarial learning requires delicate optimization initialization, architecture engi-
neering, and careful selection of hyper-parameters to be stable [153]. In contrast, defining a suitable loss
function for direct domain alignment may not be trivial.
2
UDA benefits from having access to both source and target domains simultaneously [31, 160, 49]. This
assumption, however, can only sometimes be met. In practice, it is natural to assume source datasets
are distributed amongst independent entities, and sharing data between them may constitute a privacy
violation. For example, improving mobile keyboard predictions is performed by securely training models
on independent computing nodes without centrally collecting user data ([205]). Similarly, in medical image
processing applications, data is often distributed amongst different medical institutions. Due to privacy
regulations ([202]), sharing data can be prohibited, and hence central access to data becomes infeasible.
Analogous scenarios arise in many other prediction tasks, such as face anti-spoofing [106], classification
of video data [67] or industrial equipment verification [74]. To maintain the benefits from UDA, source-
free adaptation has been developed to bypass the need for direct access to a source domain at adaptation
time. While source-free UDA has been previously explored for image classification [50, 110, 80], there are
few works addressing this concern for more elaborate problem settings, such as street image segmentation
[184, 228] or medical images analysis [7]. Compared to image classification tasks, segmenting street images
requires more complex network architectures, with each image pixel being labeled. In the case of medical
images, these are produced by electromagnetic devices such as MRI or CT scanners, which directly impact
the input distribution of the image data. Additionally, compared to natural image semantic segmentation,
large portions of medical images remain unlabeled.
In real-world settings, the assumption that source and target datasets are maintained on a shared device
does not cover all adaptation use cases. This work recognizes the necessity for UDA algorithms that need
to operate in environments where datasets are stored across different devices, and access is permitted only
sequentially between the source and target. Source-free UDA has been mostly explored for image classi-
fication, while other application domains have received less attention. In this work, I develop source-free
UDA algorithms for such application domains: street image segmentation, medical image segmentation,
and multi-source classification. I demonstrate that adding a privacy constraint between different domains
3
does not significantly degrade performance compared to joint adaptation while making the model more
applicable to real-world scenarios. The main contributions of each of these application domains can be
summarized as follows:
Street Image Segmentation. Predictive problems on real-world images span a large set of applica-
tions, such as image segmentation, object tracking, or vehicle decision planning. Image segmentation is
a primary task in Computer Vision, where a multi-channel image requires partitioning its pixels across
several semantic classes. To allow for this, in semantic segmentation, each pixel of an image has attributed
a label. Compared to regular classification tasks, due to the nature of the data, generating clean labels
for image segmentation is much more costly and time consuming, a single image potentially requiring
several hours for correct labeling [34]. In contrast, many game rendering engines are designed to gener-
ate street like environments, allowing labeling at a much lower cost. The goal of recent UDA works in
semantic segmentation involves leveraging computer generated image datasets to train models capable of
high performance on real-world datasets. Most such works however consider a joint adaptation problem.
I extend this setting to the source-free scenario, where a model can be trained on artificial images and then
stored for later use without maintaining both the source and target datasets. I show that my approach is
comparable to joint adaptation methods for image segmentation while at the same time showing superior
performance to other recent source-free methods. This works has been accepted as a conference paper at
AAAI, 2021 [172].
MedicalImageSegmentation. The prevalence of magnetic resonance devices has allowed clinicians
to leverage statistical models to perform assistive diagnosis, monitor treatment over time, or automate
medical data processing pipelines. 2D and 3D image segmentation networks have been successfully de-
ployed to address these problems, however, due to the nature of the task, these networks are often repre-
sented by large deep learning models. This makes them sensitive to the data modality they are trained on
and limits their applicability across other input types. For example, a network trained on scans sourced
4
from an MRI machine will offer deteriorated performance if tasked with predicting images sourced from a
CT machine. In addition, compared to street images, medical images are obtained by recording the inten-
sity values from MRI or CT machines. Thus, large portions of the images correspond to body tissue to be
ignored by the models. Directly applying most UDA algorithms designed for street image segmentation
will require additional tuning due to the particularities of medical data. Additionally, medical data often
requires strict privacy regulations, making patient data difficult to share across institutions. I develop a
source-free adaptation algorithm that handles UDA in the medical field, not requiring data sharing be-
tween the source and target domain. I test my algorithm on cardiac and internal organ datasets and show
my method outperforms joint adaptation approaches on the cardiac dataset and even recent source-free
approaches. This work has appeared as a conference paper in BMVC, 2022 [169], and an extended version
is under review in the IEEE Transactions on Image Processing.
MultiSourceAdaptation. The problem considered in most UDA research is of improving the gener-
alization performance of a model trained on an annotated source domain to an unannotated target domain
with different data distribution. Multi Source Adaptation extends this task to scenarios where training data
is distributed across several domains. Distributed machine learning is a common real-world application of
this setting, where a model applicable to different user needs is trained using different data streams, for
example, a keyboard word prediction model [205]. The problem of Multi source UDA (MUDA) has been
studied in cases where data streams are simultaneously accessible. However, in the case of using private
data, as in the above example, pooling together confidential information from multiple users on a shared
data store increases the risk of the system towards actions from adversarial agents. In this work, I develop
an algorithm for source-free MUDA where data sharing is not permitted between pairs of source domains
or between sources and the target. Compared to other source-free MUDA approaches, which still pool
models from all source domains during adaptation, my proposed method offers competitive performance
5
while operating in a stricter privacy setting while also allowing for distributed optimization and fast re-
training in case one of the source domains becomes unavailable due to privacy reasons. This work has
been accepted in TMLR, 2022 [171].
Fair Classification . Predictive models have been widely adopted to automate tasks in many indus-
tries. Initial automated screenings are now employed in mortgage approvals, resume selection for job
applicants, or credit increase decisions. However, such decision models often translate the bias in their
training data into their inferences. Fair Classification is a recent field of AI that aims to provide various
metrics for determining and mitigating classification bias. Many approaches for fair classification have
been developed recently, with promising results on standard datasets. However, these methods do not
investigate the impact of domain shift in fair classification. First, I show that common fair classification
datasets used in literature can present domain shift based on the chosen training and test data distribu-
tion. Next, I verify established fair classification methods on these splits and show their generalization
performance indeed suffers on the target domain. Next, I propose a novel fair classification algorithm that
ensures both superior fairness while also mitigating domain shift.
In the next chapter, I will survey existing related work in UDA. In Chapters 3-6, I describe my contri-
bution and theoretical and empirical results for each of these four problems. Chapter 7 proposes future
extensions to the currently presented settings and approaches.
6
Chapter2
MethodsinDomainAdaptation
Research in domain adaptation has been extensive, and a complete survey of all existing works is beyond
the scope of this manuscript. In this chapter, I survey existing works relevant to my proposed algorithms
and discuss additional extensions of DA.
Domain adaptation is an area of machine learning which attempts to address model generalization
on unseen data distributions. When the training and development data distributions differ, it is often the
case that performance will suffer. UDA research provides tools to mitigate performance degradation on the
target domain under different baseline scenarios. It has been shown that under certain assumptions, target
domain error can be upper bounded in terms of source domain error and a discrepancy term between the
source and target domains. For example, Ben-David et al. [10] utilize theH-divergence as this last term, a
discrepancy function upper bounded by the error produced by a binary classifier to identify whether data
samples correspond to the source or target domain, per Lemma 2 in [10]. This makes theH-divergence a
preferred tool in adversarial adaptation, as can be seen below. Other discrepancy metrics have also been
proposed. For example, Redko et al. [132] upper bound the target domain error via the source domain error
and a discrepancy term depending on the optimal transport distance between the two domains. Broadly,
UDA approaches aim to leverage the idea that a shared embedding space between the source and target
samples will be sufficient for adaptation. If such a suitable mapping function is learned, classifiers trained
7
with data from the source domain can also generalize to the target domain. Additional variations of DA
exist, for example having access to only a small set of labels in the target domain, having the source and
target classes only have partial overlap, etc. In the present work, I focus on UDA, where the source and
target classes are the same. Two main lines of UDA research that have led to competitive results are
adversarial adaptation and adaptation based on distributional distance minimization.
2.1 AdversarialAdaptation
Tools employed in adversarial learning [55] have been effectively applied to UDA. In training a Generative
Adversarial Network (GAN), a DNN generator attempts to map a random input distribution to mimic a
given data distribution. To achieve this, an auxiliary network, called a discriminator network, is presented
samples from the generator and the original dataset and is tasked with determining their source of origin.
The generator network aims to produce samples similar to the provided data distribution and thus fool
the discriminator network. Alternating between improving the discriminator and generator leads to the
generator creating samples similar to the provided data distribution. As aligning the source and target
embedding distributions also leads to domain adaptation, GAN training tools have proven appropriate for
the problem of UDA.
Ganin et al. [51] achieve domain alignment by considering an encoder network with a classification
and discriminator head. The classification network optimizes log-likelihood loss for samples from the
source domain. Similar to its utilization during the training of GANs, the discriminator network is tasked
with determining the domain provenance of data samples, i.e., source or target. A gradient reversal layer
connected to the discriminator head ensures the encoder network produces domain invariant feature rep-
resentations. Tzeng et al. [183] expand on this idea by building separate networks for the source and target
domains instead of sharing network weights.
8
These methods have also been successfully applied to image segmentation. Hoffman et al. [63] employ
image style transfer to produce domain alignment. A style transfer network translating the source images
into the visual style of the target images is used to guide the adaptation process. Additional losses to ensure
consistency between the segmentation maps of the style transfer networks are used to make the learning
process more robust.
Bousmalis et al. [16] develop a model for pixel-level domain adaptation by creating a pseudo-dataset by
making source samples appear as though they are extracted from the target domain. They use a generative
model to adapt source samples to the style of target samples and a domain discriminator to differentiate
between real and fake target data.
Chen et al. [25] follow a similar idea to Ganin et al. [51] applied to segmentation architectures. How-
ever, in addition to a global domain discriminator, they also utilize class-wise discriminators for each of the
available semantic classes. This, however, requires the presence of labeled samples on the target domain.
The source domain trained networks produce such samples, leading to improved adaptation performance
over just using a global domain discriminator.
Tsai et al. [182] also use a classification network with an added discriminator for domain alignment.
Compared to other approaches following this high level idea, they also use a domain discriminator at
intermediate abstraction levels in the encoder network. This leads to improved performance over a single
level discriminator.
Luc et al. [112] employ an image segmentation model and adversarially train a semantic map generator,
which uses a label map discriminator to penalize the segmentation network for producing label maps more
similar to the generated ones rather than the source ones.
9
2.2 AdaptationbyDistributionalDistanceMinimization
Adversarial methods for domain adaptation employ a domain discriminator to produce latent represen-
tations shared between the source and target domains. In contrast, distributional distance minimization
methods attempt to directly minimize the distance between the source and target distributions by min-
imizing an appropriate distance metric. The choice for such metrics is, however, not unique, as several
choices have led to improved target generalization.
Pan et al. [124] utilize Maximum Mean Discrepancy (MMD) [15] as a metric of choice for domain
alignment. An appropriate kernel is learned, which maps source and target data into a shared feature
space where the MMD is minimized. Long et al. [108] expand on this idea by minimizing MMD as several
abstraction layers in a neural network.
Wu et al. [198] propose an image translation network that takes source and target images as input
and outputs source images in the style of the target. Their proposed architecture does not use adversarial
training but rather is based on the idea that in order for stylistic transfer to be achieved, domain mean
and variance should be similar at different levels of abstraction throughout the translation network. They
accomplish this by minimizing l2-distance in the feature space at various levels of abstraction.
Zhang et al. [215] develop a method for semantic segmentation adaptation by observing that a source
trained model should produce the same data statistics on the target domain as present in the annotated
source distribution. Examples include label distribution or pixels of a certain class clustering around spe-
cific regions in an image. Pseudo-labeling is used to estimate these statistics. To enforce distributional
similarity in the output of a source-trained model to the estimated target statistics, KL-divergence is used
as a distributional minimization metric.
Balaji et al. [6] propose an optimal transport metric for domain adaptation which is robust to outliers
in the data. The authors show that relaxing the optimization with respect to outliers improves adaptation
performance compared to using traditional Wasserstein Distance on several real-world datasets.
10
2.3 Sourcefreeadaptation
Most UDA approaches benefit from having joint access to the source and target datasets. In the case of
adversarial approaches, this ensures that a domain discriminator can distinguish between encoded source
and target samples. In the case of distributional minimization approaches, the appropriate distribution
statistics can be computed at runtime for both domains. However, the assumption that both source and
target datasets are jointly available is only sometimes true. In UDA, source-free adaptation has been rec-
ognized and previously explored, with most methods being designed for classification tasks.
A possible approach for handling the absence of a source domain during adaptation is to learn a data
distribution similar to the original source data distribution. This lends itself naturally to adversarial ap-
proaches. Liu et al. [104] approximate the source distribution by training a GAN when the source domain
is still available. During adaptation, GAN samples are used in place of source samples.
Yang et al. [203] use fixed source models and attempt to model the target latent space directly via a
mixture of Gaussians. The clustering of samples to the correct class is done by minimizing a VAE recon-
struction error, combined with a pseudo-labeling module for the classifier.
Liang et al. [98] perform source-free domain adaptation by employing tools from information maxi-
mization. They observe that in order for a model to mitigate distributional discrepancy between the source
and target domains, the entropy of probabilistic model predictions needs to be low. The adaptation process
is additionally guided by a pseudo-label generation module, which assigns labels to target samples based
on their proximity to class-conditional estimated cluster centers.
2.4 OtherTopicsinDomainAdaptation
I present extensions of the UDA problem studied in literature.
11
2.4.1 MultiSourceDomainAdaptation
In multi source domain adaptation, several data domains are available during source training, with poten-
tially different distributions. These data streams need to be leveraged to infer class labels on an unanno-
tated target domain.
Peng et al. [128] propose a multi source UDA approach where similarity in second order statistics is
enforced between pairs of source domains and sources and target.
Zhao et al. [220] develop a multi source UDA approach for semantic segmentation. Domain specific
adaptation is performed by minimizing a cycle loss. The latent distributions between source domains are
also encouraged to become similar through the use of inter-domain discriminator networks.
2.4.2 MultiTargetDomainAdaptation
In multi target DA, a single source domain is present, and adaptation needs to be performed with respect
to several target domains. This scenario can reduce to single source and single target DA, however, the
availability of several similar target domains benefits models that learn domain agnostic relationships
leveraging the target data across availability.
Gholami et al. [53] propose an information theoretic approach that uses a shared embedding for domain
agnostic representations of the data while domain specific features are kept separate.
Saporta et al. [162] propose an approach for multi target semantic segmentation based on adversarial
learning. A multi-domain discriminator aligns the source and target domains. Then, domain specific
discriminators are introduced to further tune the decision boundary independently for each target domain.
12
2.4.3 Few-shotDA
The current work considers scenarios where the source domain(s) and target share the same set of classes,
and the target domain is fully unannotated. UDA has also been explored in scenarios where these re-
strictions are relaxed. Few-shot DA differs from standard UDA in that the target domain is not wholly
unannotated. Instead, a small set of target samples is made available for training.
Motiian et al. [121] leverage the available samples in the target domain to guide the network to produce
a structured latent representation of the data. Target samples are paired with source samples into one of
four possible settings, and a discriminator is tasked with identifying for random pairs of samples which
of these settings it corresponds to. By combining both source and target samples, performance similar to
supervised performance is reached.
Wang et al. [192] consider few-shot adaptation for object detection. They pair image features from
the source domain and the target domain and use a discriminator to enforce distribution alignment. In
addition, an instance level adaptation module is developed to align features based on the same object.
2.4.4 Open-setDA
In open set DA, the assumption that source and target domains share the same set of classes is no longer
enforced. Rather, only a subset of source and target classes is guaranteed to be shared between domains.
Busto et al. [126] discuss the setting of open set DA and represent both unsupervised and semi-
supervised versions of the problems as solvable by integer linear programming.
Saito et al. [156] consider a different version of open source DA, where the source domain only con-
sists of classes that are a subset of target classes without having access to additional data points. Their
adaptation method is based on adversarial learning and leads to a feature generator that embeds samples
from seen and unseen classes in a separable way.
13
In the following four chapters, I describe my contributions in the area of UDA with private data. My
goal is to demonstrate that the core idea of using an intermediate distribution as a surrogate for the source
domain can be used to relax the need to access the source domain data during adaptation across multiple
settings.
14
Chapter3
SemanticSegmentationofreal-worldImages
My first contribution is addressing the challenge of segmentation of real-world images when annotated
data is not accessible. Note that data annotation in this domain is significantly more challenging than
annotating data for image classification because it involves pixel-level labeling of input images.
3.1 Motivation
A significant limitation of most existing UDA semantic segmentation algorithms is that domain alignment
can be performed only if the source and the target domain data are accessible concurrently. However, the
source annotated data may not be necessarily accessible during the model adaptation phase in a continual
learning scenario [152]. I focus on a more challenging yet more practical model adaptation scenario. I
consider that a pre-trained model is given, and the goal is to adapt this model to generalize well in a target
domain using solely unannotated target domain data. My algorithm is an improvement over using an off-
the-shelf pre-trained model naively by benefiting from the unannotated data in the target domain. This is
a step towards lifelong learning ability [164, 150].
In this chapter, I relax the need for source domain annotated data for model adaptation. My idea is
to learn a prototypical distribution that encodes the abstract knowledge learned for image segmentation
using the source domain annotated data. The prototypical distribution is used for aligning the distributions
15
across the two domains in a shared embedding space. I also provide theoretical analysis to justify the
proposed model adaptation algorithm and determine the conditions under which my algorithm is effective.
Finally, I provide experiments on the GTA5→Cityscapes and SYNTHIA→Cityscapes benchmark domain
adaptation image segmentation tasks to demonstrate that my method is effective and leads to competitive
performance, even when compared against existing UDA algorithms.
3.2 RelatedWork
I provide an overview of semantic segmentation algorithms and describe recent UDA, and source-free UDA
approaches for this setting.
3.2.1 SemanticSegmentation
Compared to image classification problems, semantic segmentation tasks require each pixel of an image to
receive a label, which is part of a set of semantic categories. As each image dimension may have thousands
of pixels, semantic segmentation models require robust encoder/decoder architectures capable of synthe-
sizing large amounts of image data. Recent SOTA results for supervised semantic segmentation have thus
been obtained by the use of deep neural networks (DNNs), and in particular convolutional neural networks
(CNNs), which are specialized for image processing [93]. While different architecture variants exist [107,
189, 28, 175], approaches often rely on embedding images into a latent feature space via a CNN feature ex-
tractor, followed by an up-sampling decoder which scales the latent space back to the original input space,
where a classifier is trained to offer pixel-wise predictions. Skip connections between different levels of the
network [137, 102], using dilated convolutions [27] or using transformer networks as feature extractors
instead of CNNs [173] have been shown to improve supervised baselines.
16
While improvements in supervised segmentation are mostly tied to architecture choice and parameter
fine-tuning, model generalization suffers when changes in the input distribution are made. This phe-
nomenon is commonly referred to as domain shift [161]. This issue is common in application domains
where the same model needs to account for multi-modal data, and the training set lacks a certain mode,
e.g., daylight and night-time images [136], clear weather and foggy weather [157], medical data obtained
from different imaging devices [57]. Such differences in input data distributions between source and target
domains significantly impact the generalization power of learned models. When domain shift is present,
source-only training may be at least three-fold inferior compared to training the same model on the tar-
get dataset [65, 64, 95]. Weakly-supervised approaches explore the possibility of having access to limited
label information for the target domain [193, 68, 191]. However, in semantic segmentation tasks, the label
information is expensive to obtain, adding a cost to using such techniques, especially on new datasets.
Due to the limited label availability for semantic segmentation tasks, the use of artificial images and
labels becomes an attractive alternative for training segmentation models. Semantic labels are easy to
generate for virtual images, and a model trained on such images could then be used on real-world data.
Overcoming domain shift becomes the primary bottleneck for successfully applying such models to new
domains.
3.2.2 DomainAdaptation
UDA addresses model generalization in scenarios where target data label information is unavailable. Such
techniques employ a shared feature embedding space between the source and target domain. Most of these
methods achieve this goal using domain discriminators or direct source-target feature alignment.
17
3.2.2.1 AdversarialAdaptation
Techniques based on adversarial learning employ the idea of domain discriminator, used in GANs [55], to
produce a shared source/target embedding space. A discriminator is tasked with differentiating whether
two image encodings originate from the same domain or if one is from the source and one is from the
target. A feature encoder aims to fool the discriminator, thus producing source/target latent features more
and more similar as training progresses. Over the course of training, this leads the feature extractor to
produce a shared embedding space for the source and target data.
In the context of UDA for semantic segmentation, Murez et al. [123] use an intermediate feature dis-
tribution that attempts to capture domain agnostic features of both source and target datasets. To improve
the domain agnostic representation, a discriminator is trained to differentiate whether an encoded image
is part of the source or target domain. The encoder networks are then adversarially trained to fool the
discriminator, resulting in similar embeddings between source and target samples. Hoffman et al. [63]
employ the cycle consistency loss proposed in Zhu et al. [223] to improve the adversarial network adapta-
tion performance. In addition to this, Hoffman et al. [63] use GANs to stylistically transfer images between
source and target domains and use a consistency loss to ensure network predictions on the source image
will be the same as in the stylistically shifted variant. Saito et al. [155] use an approach based on a dis-
criminator network without using GANs to attempt to mimic source or target data distributions. They
propose the following adversarial learning process on a feature encoder network with two classification
heads: (1) they first keep the feature encoder fixed and optimize the classifiers to have their outputs as
different as possible, (2) they freeze the classifiers and optimize the feature encoder such that both classi-
fiers will have close outputs. Sankaranarayanan et al. [161] employ an image translation network that is
tasked with translating input images into the target domain feature space. A discriminator is tasked with
differentiating source images from target images passed through the network, and a similar procedure is
done for target images. A pixel-level cross-entropy loss ensures the network is able to perform semantic
18
segmentation. Lee et al. [95] use a similar idea to Saito et al. [155] in that a network with two classifiers
is used for adaptation. The feature extractor and classification heads are trained in an alternating fashion.
The work of Lee et al. [95] differentiates itself by employing an approximation of optimal transport to
compute these discrepancy metrics, leading to improved performance over Saito et al. [155].
3.2.2.2 Adaptationbydistributionalignment
Adaptation methods using direct distribution alignment share the same goal as adversarial methods. How-
ever, distribution alignment is reached by minimizing an appropriate distributional distance metric be-
tween the source and target embedding feature distributions.
Zhang et al. [212] employ an adaptation framework based on the idea that source and target latent
features should cluster together similarly. They use a pseudo-labeling approach to produce initial target
labels, then minimize the distance between class-specific latent feature centroids between source and target
domains. The minimization metric of choice is l2-distance. For improved performance, they use category
anchors to align the adaptation process. Gabourie et al. [49] propose an adaptation method based on a
shared network encoder between source and target domains. Their model is trained by minimizing cross-
entropy loss for the source samples and is tasked with reducing the distance between source and target
embeddings in the latent feature space. To achieve this, Sliced Wasserstein Distance is minimized between
the source and target embeddings, leading to improved classifier performance on target samples.
The expectation that continuous access to source data is guaranteed when performing UDA is not
always true, especially in the case of privacy-sensitive domains. This has been previously explored by
methods that do not employ DNNs [43, 72, 196], and has recently become the focus of DNN based al-
gorithms for image classification tasks [89, 82, 158, 204]. Source-free semantic segmentation has been
explored relatively less compared to joint adaptation work. [87] employ source-domain generalization
19
and target pseudo-labeling in the adaptation method. [105] relies on self supervision and patch level op-
timization for adaptation. [209] allow models trained on synthetic data to generalize on real data by a
mixture of positive and negative class inference.
My adaptation approach shares the idea of direct distribution alignment. As described previously,
several choices for latent feature alignment have been previously explored, such asl2-distance [197], KL-
divergence [215] or Wasserstein Distance (WD) [49, 95]. WD has been proven to leverage the geometry
of the embedding space better than other distance metrics [178]. Empirically the behavior of using the
Wasserstein metric has been observed to benefit the robustness of training deep models, such as in the
case of the Wasserstein GAN [5], or by improving the relevance of discrepancy measures, as reported by
[95]. One of the limitations of using the WD is represented by the difficulty of optimizing this quantity, as
computing the distance requires solving a linear program. Therefore, I employ an approximation of this
metric, the Sliced Wasserstein Distance (SWD), which maintains the distance metric properties of the WD
while allowing for end-to-end differentiation in the optimization process.
I base my source-free UDA approach on estimating the latent source embeddings via a prototypical
distribution [141]. This approximation relies on the concept that a supervised model trained onK classes
will produce a K modal distribution in its latent space. This allows us to perform adaptation without
direct access to source samples by instead sampling from my K modal prototypical distribution. This
approximation also introduces a negligible set of parameters into my model. Once I produce a pseudo-
dataset from sampling the prototypical distribution, I align the target feature encodings by minimizing the
SWD between the two data distributions. I develop theoretical bounds to show that the current approach
minimizes the model error on the target set.
20
Figure 3.1: Diagram of the proposed model adaptation approach (best seen in color): (a) initial model
training using the source domain labeled data, (b) estimating the prototypical distribution as a GMM dis-
tribution in the embedding space, (c) domain alignment is enforced by minimizing the distance between
the prototypical distribution samples and the target unlabeled samples, (d) domain adaptation is enforced
for the classifier module to fit correspondingly to the GMM distribution.
3.3 ProblemFormulation
LetP
S
be the data distribution corresponding to a source domain, andP
T
be similarly the data distribution
corresponding to a target domain, withP
S
being potentially different from P
T
. Assume a set of multi-
channel imagesX
S
= {x
s
1
,x
s
2
...} is sampled fromP
S
with corresponding pixel-wise semantic labels
Y
S
= {y
s
1
,y
s
2
...}. LetX
T
be a set of images sampled fromP
T
with corresponding labelsY
T
, which
I have no access to. Consider bothX
S
andX
T
being represented as images with real pixel values in
R
W× H× C
, whereW is the image width,H is the image height andC is the number of channels. The label
spacesY
S
,Y
T
also shares the same input space of label maps inR
W× H
.
My goal is to learn the parameters θ of a CNN semantic segmentation model ϕ θ (·) : R
W× H× C
→
R
W× H
capable of accurately predicting pixel-level labels for images sampled from the target distribution
P
T
. I can formulate this as a risk minimization problem, where my goal is to minimize target empirical
risk, achieved forθ ∗ = argmin
θ {E
x
t
∼ P
T
(X
t
)
(L(f
θ (x
t
),y
t
)}, wherex
t
∈X
T
,y
t
∈Y
T
. The difficulty of
21
the above optimization stems from the lack of access to the label setY
T
. I instead are provided access to the
labeled source domain(X
S
,Y
S
), and sequentially the target domainX
T
. The source-free nature of my
problem requires that once the target imagesX
T
become available, access to source domain information
becomes unavailable.
To achieve my desired outcome, a source model needs to first be learnt on the provided source dataset.
LetN be the size of the source datasetX
S
, and let(x
s
i
,y
s
i
) be image/label pairs fromX
S
,Y
S
. Consider
K to be the number of semantic classes and 1
a
(b) the indicator function determining whethera andb are
equal. Then, I learn the parameters that minimize empirical risk on the source domain by optimizing the
standard cross entropy loss:
ˆ
θ =argmin
θ {
1
N
N
X
i=1
L
ce
(ϕ θ (x
s
i
),y
s
i
)}
L
ce
(p,y)=− 1
WH
W
X
w=1
H
X
h=1
K
X
k=1
1
y
wh
(k)log(p
wh
),
(3.1)
The optimization setup in Eq. 3.1 ensures model generalization on inputs sampled fromP
S
. In cases
whereP
T
differs from P
S
the model will not generalize on the target domain, due to domain shift. To
account for domain shift, I need to learn an invariant feature space between the two domains, without
joint access toX
S
andX
T
. Let f,g,h be three parameterized sub-networks such that ϕ = f ◦ g◦ h,
where f : R
W× H× C
→ R
L
is an encoder sub-network, g : R
L
→ R
W× H
is an up-scaling decoder,
and h : R
W× H
→ R
W× H
is a semantic classifier, where L represents the shape of the latent network
dimension. Thus, in order to create a common source-target embedding space, I wish that the network
f ◦ g embeds source and target samples in a shared distribution. Under this condition, the classifier h
trained on source domain samples will be able to generalize on target inputs.
22
This can be accomplished by direct distribution alignment between the embeddings of the two domains
at the decoder output. As previously explored in literature [49], a suitable distributional distance metricD
can be minimized between network outputs for source and target domains. However, my approach does
not permit access to source domain data during adaptation, hence directly minimizingD(f◦ g(X
S
),f◦ g(X
T
)) is not permitted. I am able to benefit from distribution alignment by relaxing the need for access
to the source domain samples during adaptation. I describe my source domain approximation approach
and the choice forD in the next section.
3.4 ProposedAlgorithm
Figure 3.1 presents a high-level visual description my approach. My solution is based on aligning the source
and the target distributions via an intermediate prototypical distribution in the embedding space. Since the
last layer of the classifier is a softmax layer, the classifier can be treated as a maximum a posteriori (MAP)
estimator. This composition implies that if after training, the model can generalize well in the target
domain, it must transform the source input distribution into a multimodal distributionP
Z
(z) with K
separable components in the embedding space (see Figure 3.1 (a)). Each mode of this distribution represents
one of theK semantic classes. This prototypical distribution emerges as a result of model training because
the classes should become separable in the embedding space for a generalizable softmax classifier. Recently,
this property have been used for UDA [125, 22], where the means for distribution modes are considered
as the class prototype. The idea for UDA is to align the domain-specific prototypes for each class to
enforce distributional alignment across the domains. My idea is to adapt the trained model using the
target unlabeled data such that in addition to the prototypes, the source-learned prototypical distribution
does not change after adaptation. As a result, the classifier subnetwork will still generalize in the target
domain because its input distribution has been consolidated.
23
I model the prototypical distributionP
Z
(z) as a Gaussian mixture model (GMM) withk components:
P
Z
(z)=
k
X
j=1
α j
N(z|µ j
,Σ j
), (3.2)
whereα j
denote mixture weights, i.e., prior probability for each semantic class. For each component,µ j
andΣ j
denote the mean and co-variance of the Gaussian (see Figure 3.1 (b)).
The empirical version of the prototypical distribution is accessible by the source domain samples
{(g(f(x
s
i
)),y
s
i
)}
N
i=1
which I use for estimating the parameters. Note that since the labels are accessi-
ble in the source domain, the parameters of each component can be estimated independently via MAP
estimation. To enhance the GMM learning process, I utilize the classifier confidence for source domain
pixels, by excluding pixels on the decision boundary. The idea behind excluding these data points is that
the sampling distribution will have higher separation, improving the adaptation outcome. Let S
τ,k
=
{f◦ g(x
i
)
p,q
|ϕ (x
i
)
p,q
>τ, argmaxϕ (x
i
)
p,q
=(y
i
)
p,q
}
Then, the MAP estimates for the distribution parameters would be:
ˆ α j
=
|S
τ,j
|
P
N
j=1
|S
τ,j
|
, ˆ µ j
=
X
(x
s
i
,y
s
i
)∈S
τ,j
1
|S
τ,j
|
f◦ g(x
s
i
),
ˆ
Σ j
=
1
|S
τ,j
|
X
(x
s
i
,y
s
i
)∈S
τ,j
f◦ g(x
s
i
))− ˆ µ j
⊤
f◦ g(x
s
i
))− ˆ µ j
.
(3.3)
I take advantage of the prototypical distributional estimate in Eq.(3.3) as a surrogate for the source
domain distribution to align the source and the target domain distribution in the absence of the source
samples. I can adapt the model such that the encoder transforms the target domain distribution into the
prototypical distributional in the embedding space. I use the prototypical distributional estimate and draw
random samples to generate a labeled pseudo-dataset: D
Z
= (X
Z
,Y
Z
), where X
Z
= [x
z
1
,...,x
z
N
Z
] ∈
R
K× N
Z
, Y
Z
= [y
z
1
,...,y
z
N
Z
] ∈ R
K× N
Z
, x
z
i
∼ P
Z
(z). To improve the quality of the pseudo-dataset, I
use the classifier sub-network prediction on drawn samples x
z
to select samples withh(x
z
) > τ . After
24
Algorithm1MAS
3
(λ,τ )
1: InitialTraining:
2: Input: source domain datasetD
S
=(X
S
,Y
S
),
3: TrainingonSourceDomain:
4:
ˆ
θ =argmin
θ P
i
L(ϕ θ (x
s
i
),y
s
i
)
5: PrototypicalDistributionEstimation:
6: Use Eq. (3.3) and estimateα j
,µ j
, andΣ j
7: ModelAdaptation:
8: Input: target datasetD
T
=(X
T
)
9: Pseudo-DatasetGeneration:
10: D
Z
=(X
Z
,Y
Z
)=([x
z
1
,...,x
z
N
Z
],[y
z
1
,...,y
z
N
Z
]), where:
11: x
z
i
∼P
Z
,1≤ i≤ N
Z
,y
z
i
=argmax
j
{h(x
z
i
)}
12: foritr =1,...,ITR do
13: Draw random batches fromD
T
andD
Z
14: Update the model by solving Eq. (3.4)
15: endfor
generating the pseudo-dataset, I solve the following optimization problem to align the source and the target
distributions indirectly in the embedding:
argmin
θ {
1
N
p
Np
X
i=1
L
ce
(h(x
z
i
),y
z
i
)+λD
f◦ g(P
T
),P
Z
)}, (3.4)
whereD(·,·) denotes a probability distribution metric to enforce alignment of the target domain distribu-
tion with the prototypical distribution in embedding space andλ is a trade-off parameter between the two
terms (see Figure 3.1 (c)).
3.5 TheoreticalAnalysis
I prove Algorithm 3 leads to model generalization on the target domain by minimizing empirical risk.
For such a result, I need to tie model generalization on the source domain to the distributional distance
between the source and target domains. For this purpose, I use the framework developed by [132] designed
for upper bounding target risk with respect to the Wasserstein Distance between the source and target
domains. Then, following Theorem 2 from [132] the following result characterising Algorithm 3 follows:
25
Theorem 1. Consider the space of all possible hypothesesH applicable to the proposed segmentation task.
Letϵ S
(h),ϵ T
(h) represent the expected source and target risk for hypothesish, respectively. Let ˆ µ S
,ˆ µ Z
,ˆ µ T
betheempiricalmeanoftheembeddingspaceforthesource,intermediateandtargetdatasetsrespectively. Let
W(·,·) represented the Wasserstein distance, and letξ,ζ be appropriately defined constants. Consider e
C
(h)
to be the combined error of a hypothesish on both the source and target domains, i.e. ϵ S
(h)+ϵ T
(h), and let
h
∗ be the minimizer for this function. Then, for a modelh, the following results holds:
ϵ T
(h)≤ ϵ S
(h)+W(ˆ µ S
,ˆ µ Z
)+W(ˆ µ Z
,ˆ µ T
)+
r
2log(
1
ξ )/ζ
r
1
N
s
+
r
1
N
t
+e
C
(h
∗ )
(3.5)
I first provide an intuitive explanation on why Algorithm 3 is able to minimize the right hand side of
the Eq. 3.5. The second term represents the WD between the source and sampling dataset. This distance
will be small if the GMM approximation of the source domain will be successful. The third term is the WD
between the sampling dataset and the target dataset. This term is directly minimized by the adaptation
loss. 1− ρ is a constant directly dependent on the confidence threshold ρ , which I choose close to1. The
fourth term is directly dependent on the dataset size, and becomes small when a large number of samples
is present. Finally, the last term is a constant indicating the difficulty of the adaptation problem.
Proof: To formalize my intuition, I utilize the result in Theorem 2 from [132] for target domain gen-
eralization:
Theorem2. (Redko et al. [132]) For the variables defined under Theorem 1, the following distribution align-
ment inequality loss holds:
26
ϵ T
≤ ϵ S
+W(ˆ µ S
,ˆ µ T
)+
r
2log(
1
ξ )/ζ
r
1
N
s
+
r
1
N
t
+e
C
(h
∗ )
(3.6)
The above relation characterises target error after source training, and does not consider my specific
scenario of using an intermediate distribution. I adapt this bound for Algorithm 3.
I expand the second term of Eq. 3.6. Given W(·,·) is a convex optimization problem, I can use the
triangle inequality as follows
W(ˆ µ S
,ˆ µ T
)≤ W(ˆ µ S
,ˆ µ Z
)+W(ˆ µ Z
+ ˆ µ T
) (3.7)
Combining Eq. 3.7 and Eq. 3.6 leads to the result in Theorem 1.
3.6 ExperimentalValidation
To evaluate my approach, I consider three common datasets used in semantic segmentation literature:
GTA5 [135], SYNTHIA [139] and Cityscapes [34]. Both GTA5 and SYNTHIA are datasets consisting of
artificially generated street images, with 24966 and 9400 instances respectively. Cityscapes is composed
of real-world images recorded in several European cities, consisting of 2957 training images and 500 test
images. For all three datasets, images are processed and resized to a standard shape of512× 1024.
Following literature, I consider two adaptation tasks designed to evaluate model adaptation perfor-
mance when the training set consists of artificial images, and the test set consists of real-world images. For
both SYNTHIA→Cityscapes and GTA5→Cityscapes I evaluate performance under two scenarios, when
13 or19 semantic classes are available.
27
3.6.1 ImplementationandTrainingDetails
I use a DeepLabV3 architecture [27] with a VGG16 encoder [166] for my CNN architecture. The decoder
is followed by a 1× 1 convolution softmax classifier. I choose a batch size of 4 images for training, and
use the Adam optimizer with learning rate of1e− 4 for gradient propagation. For adaptation, I keep the
same optimizer parameters as for training. I choose10 projections in my SWD computation, and set the
regularization parameterλ to.5. I perform training for100k iterations and adaptation for50k iterations.
When learning the GMMs I chose the confidence parameter τ to be.95. I observe higher values ofτ to
be correlated with increased performance, and conclude that aτ setting above.9 will lead to similar target
performance.
I run my experiments on a NVIDIA Titan XP GPU. Given that my method relies on distributional
alignment, the label distribution between target batches may vary significantly between different batches.
As the batch distribution approaches the target label distribution as the batch size increases, I use the
oracle label distribution per batch when sampling from the GMM, which can be avoided if sufficient GPU
memory becomes present. Experimental code is provided with the current submission.
3.6.2 Stateoftheartmethods
Source-free model adaptation algorithms for semantic segmentation have been only recently explored.
Thus, due most UDA algorithms being designed for joint training, besides source-free approaches I also
include both pioneer and recent UDA image segmentation method to be representative of the literature. I
have compared my performance against the adversarial learning-based UDA methods: GIO-Ada [30], AD-
VENT [188], AdaSegNet [181], TGCF-DA+SE [32], PCEDA [206], and CyCADA [63]. I have also included
methods that are based on direct distributional matching which are more similar toMAS
3
: FCNs in the
Wild [65], CDA [215], DCAN [197], SWD [95], Cross-City [25].
28
3.6.3 Results
I provide quantitative and qualitative results. In Table 3.1 I report the performance my method produces
on the SYNTHIA→CITYSCAPES adaptation task.
Quantitativeperformancecomparison:
Method Adv. road
sdwlk
bldng
light
sign
vgttn
sky
person
rider
car
bus
mcycl
bcycl
mIoU
Source Only (VGG16) N 6.4 17.7 29.7 0.0 7.2 30.3 66.8 51.1 1.5 47.3 3.9 0.1 0.0 20.2
FCNs in the Wild N 11.5 19.6 30.8 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 22.9
CDA N 65.2 26.1 74.9 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 34.8
DCAN N 9.9 30.4 70.8 6.70 23.0 76.9 73.9 41.9 16.7 61.7 11.5 10.3 38.6 36.4
SWD N 83.3 35.4 82.1 12.2 12.6 83.8 76.5 47.4 12.0 71.5 17.9 1.6 29.7 43.5
Cross-City Y 62.7 25.6 78.3 1.2 5.4 81.3 81.0 37.4 6.4 63.5 16.1 1.2 4.6 35.7
GIO-Ada Y 78.3 29.2 76.9 10.8 17.2 81.7 81.9 45.8 15.4 68.0 15.9 7.5 30.4 43.0
ADVENT Y 67.9 29.4 71.9 0.6 2.6 74.9 74.9 35.4 9.6 67.8 21.4 4.1 15.5 36.6
AdaSegNet Y 78.9 29.2 75.5 0.1 4.8 72.6 76.7 43.4 8.8 71.1 16.0 3.6 8.4 37.6
TGCF-DA+SE Y 90.1 48.6 80.7 3.2 14.3 82.1 78.4 54.4 16.4 82.5 12.3 1.7 21.8 46.6
PCEDA Y 79.7 35.2 78.7 10.0 28.9 79.6 81.2 51.2 25.1 72.2 24.1 16.7 50.4 48.7
MAS
3
(Ours) N 75.1 49.6 70.9 14.1 25.3 72.7 76.7 48.5 19.9 65.3 17.6 6.8 39.0 44.7
Table 3.1: Model adaptation comparison results for the SYNTHIA→Cityscapes task. I have used
DeepLabV3 [27] as the feature extractor with a VGG16 [166] backbone. The first row presents the source-
trained model performance prior to adaptation to demonstrate effect of initial knowledge transfer from
the source domain.
SYNTHIA→Cityscapes: I report the quantitative results in table 3.1. I note that despite addressing a
more challenge learning setting,MAS
3
outperforms most of the UDA methods. Recently developed UDA
methods based on adversarial learning outperform my method but I note that these methods benefit from a
secondary type of regularization in addition to probability matching. Overall,MAS
3
performs reasonably
well even compared with these UDA methods that need source samples. Additionally,MAS
3
has the best
performance for some important categories, e.g., traffic light.
GTA5→Cityscapes: Quantitative results for this task are reported in Table 3.2. I observe a more
competitive performance for this task but the performance comparison trend is similar. These results
demonstrate that although the motivation in this work is source-free model adaptation, MAS
3
can also
be used as a joint-training UDA algorithm.
Qualitativeperformancevalidation:
29
Method road
sdwk
bldng
wall
fence
pole
light
sign
vgttn
trrn
sky
person
rider
car
truck
bus
train
mcycl
bcycl
mIoU
Source (VGG16) 25.9 10.9 50.5 3.3 12.2 25.4 28.6 13.0 78.3 7.3 63.9 52.1 7.9 66.3 5.2 7.8 0.9 13.7 0.7 24.9
FCNs Wld. 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1
CDA 74.9 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 13.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 14.6 28.9
DCAN 82.3 26.7 77.4 23.7 20.5 20.4 30.3 15.9 80.9 25.4 69.5 52.6 11.1 79.6 24.9 21.2 1.30 17.0 6.70 36.2
SWD 91.0 35.7 78.0 21.6 21.7 31.8 30.2 25.2 80.2 23.9 74.1 53.1 15.8 79.3 22.1 26.5 1.5 17.2 30.4 39.9
CyCADA 85.2 37.2 76.5 21.8 15.0 23.8 22.9 21.5 80.5 31.3 60.7 50.5 9.0 76.9 17.1 28.2 4.5 9.8 0.0 35.4
ADVENT 86.9 28.7 78.7 28.5 25.2 17.1 20.3 10.9 80.0 26.4 70.2 47.1 8.4 81.5 26.0 17.2 18.9 11.7 1.6 36.1
AdaSegNet 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
TGCF-DA+SE 90.2 51.5 81.1 15.0 10.7 37.5 35.2 28.9 84.1 32.7 75.9 62.7 19.9 82.6 22.9 28.3 0.0 23.0 25.4 42.5
PCEDA 90.2 44.7 82.0 28.4 28.4 24.4 33.7 35.6 83.7 40.5 75.1 54.4 28.2 80.3 23.8 39.4 0.0 22.8 30.8 44.6
MAS
3
(Ours) 75.5 53.7 72.2 20.5 24.1 30.5 28.7 37.8 79.6 36.9 78.7 49.6 16.5 77.4 26.0 42.6 18.8 15.3 49.9 43.9
Table 3.2: Domain adaptation results for different methods for the GTA5 →Cityscapes task.
In Figure 3.2, I have visualized exemplar frames for the Cityscapes dataset for the GTA5→Cityscapes
task which are segmented using the model prior and after adaptation along with the ground-truth (GT)
manual annotation for each image. Visual observation demonstrates that my method is able to significantly
improve image segmentation from the source-only segmentation to the post-adaptation segmentation,
noticeably on sidewalk, road, and car semantic classes for the GTA5-trained model.
To demonstrate that my solution implements what I anticipated, I have used UMAP [116] visualization
tool to reduce the dimension of the data representations in the embedding space to two for 2D visualization.
Figure 3.3 represents the samples of the prototypical distribution along with the target domain data prior
and after adaptation in the embedding space for the GTA5→Cityscapes task. Each point in Figure 3.3
denotes a single data point and each color denotes a semantic class cluster. Comparing Figure 3.3b and
Figure 3.3c with Figure 3.3a, I can see that the semantic classes in the target domain have become much
more well-separated and more similar to the prototypical distribution after model adaptation. This means
that domain discrepancy has been reduced usingMAS
3
and the source and the target domain distributions
are aligned indirectly as anticipated using the intermediate prototypical distribution in the embedding
space.
30
Figure 3.2: Qualitative performance: examples of the segmented frames for SYNTHIA→Cityscapes using
the MAS
3
method. Left to right: real images, manually annotated images, source-trained model predic-
tions, predictions based on my method.
3.6.4 AblationStudy
A major advantage of my algorithm over methods based on adversarial learning is its simplicity in de-
pending on a few hyper-parameters. I note that the major algorithm-specific hyper-parameters are λ and
τ . I observed in my experiments thatMAS
3
performance is stable with respect to the trade-off parameter
λ value. This is expected because in Eq. 3.4, theL
ce
loss term is small from the beginning due to prior
training on the source domain. I investigated the impact of the confidence hyper-parameter τ value. Fig-
ure 3.4 presents the fitted GMM on the source prototypical distribution for three different values of τ . As
it can be seen, whenτ = 0, the fitted GMM clusters are cluttered. As I increase the threshold τ and use
samples for which the classifier is confident, the fitted GMM represents well-separated semantic classes
which increases knowledge transfer from the source domain. This experiments also empirically validates
what I deduced about importance ofτ using Theorem 1.
31
(a) GMM samples (b) Pre-adaptation (c) Post-adaptation
Figure 3.3: Indirect distribution matching in the embedding space: (a) drawn samples from the GMM
trained on the SYNTHIA distribution, (b) representations of the Cityscapes validation samples prior to
model adaptation (c) representation of the Cityscapes validation samples after domain alignment.
(a)τ = 0
mIoU=41.8
(b)τ = 0.8
mIoU=42.7
(c)τ = 0.97
mIoU=44.7
Figure 3.4: Ablation experiment to study effect of τ on the GMM learnt in the embedding space: (a) all
samples are used; adaptation mIoU=41.8, (b) a portion of samples is used; adaptation mIoU=42.7, (c) sam-
ples with high model-confidence are used; adaptation mIoU=44.7
3.7 Remarks
In this chapter I have presented an algorithm for adapting an image segmentation model to generalize in
new domains after training using solely unlabeled data. In order to handle the source-free nature of the
problem, an approximation of the source domain embeddings is produced by learning a GMM distribution.
While this approach is suitable for street semantic segmentation, other types of input distributions may
require adjustments. Next, I introduce the problem of source-free medical image segmentation. I build
upon the algorithm developed for street image segmentation, and adapt the loss function to account for
32
the large portion of pixels belonging to the background class. I also improve the generalization power of
the source approximating distribution, and further analyze my approach.
33
Chapter4
SemanticSegmentationofMedicalImages
In this chapter, I extend the algorithm developed for street image segmentation and validate it for medical
image adaptation tasks.
4.1 Motivation
In healthcare applications, sharing data is often difficult due to privacy regulations. To maintain the bene-
fits from UDA, source-free adaptation has been developed to bypass the need for direct access to a source
domain at adaptation time. While source-free UDA has been previously explored for image classification
[50, 110, 80] and street semantic segmentation [184, 228], there are few works addressing this same prob-
lem for medical images analysis [7]. Medical images are produced by electromagnetic devices such as MRI
or CT scanners, which directly impact the input distribution of the image data. Additionally, compared to
natural image semantic segmentation, large portions of medical images are unlabeled. This makes directly
applying segmentation algorithms designed for street semantic segmentation unsuitable, as I later show
in my experimental section.
I propose a source-free semantic segmentation algorithm for medical images that relies on the idea
of distributional distance minimization. After training, I learn a sampling distribution to approximate the
source latent embeddings. I perform direct distributional alignment between target embeddings and the
34
sampling distribution by using an optimal transport based metric. I provide empirical evidence that my
algorithm is competitive with state of the art (SOTA) medical UDA works by exploring domain adaptation
problems on a cardiac and organ dataset. I additionally provide theoretical justification for the performance
of my approach by developing an upper bound for expected target error using optimal transport theory.
4.2 RelatedWork
Deep neural networks designed for semantic segmentation usually follow a common structure of feature
encoder, up-sampling decoder followed by a classification head [26]. The feature encoder component
summarizes the input data distribution to a latent feature distribution with reduced dimensionality. The
effect of data summarization allows for image-level noise to be removed from the prediction process. An
up-sampling decoder then maps the latent embedding space to an embedding space the same size of the
input space. This step is necessary due to the fact that segmentation problems require pixel-level labels.
Shortcut links between encoder and decoder layers can offer an improved quality in the up-sampling
process [138]. A classification network then learns a decision function on the pixel-level logits. Domain
adaptation approaches for semantic segmentation rely on creating a shared embedding space between
the source and target. This has been explored at several levels of abstraction in the latent feature space
[155, 125]. Creating a domain-agnostic feature distribution is the primary focus of domain adaptation
approaches.
A large number of UDA approaches rely on ideas from adversarial learning to reduce domain gap be-
tween source and target embeddings. One solution for this problem comes in the form of reducing the
domain gap at the level of the input space. Style-transfer networks [75] have been developed to visually
translate images to different domains, e.g. turning a street photo into a painting in a specific artist’s style
[52]. In the case of UDA, learning a style transfer network between the source domain and target domain
can offer an extension of the source dataset in the style of the target domain, which allows the source
35
classifier to directly generalize on target samples [64]. A second set of approaches directly use a discrimi-
nator loss to ensure the source and target samples are embedded in a shared space [155]. A discriminator
network attempts to verify whether pairs of embeddings are from the same domain or from the source and
target, and an encoder network is trained to increase the difficulty of this decision task. While adversarially
trained networks have proven their effectiveness in UDA [213, 79], training such networks requires access
to large amounts of data and careful hyper-parameter fine-tuning. This can pose significant downsides in
the case of source-free UDA, where joint access to the source and target domains is not permitted.
Relating AI problem in latent embedding spaces can be done by directly realizing distributional re-
lations to transfer information [143]. This idea has been explored in UDA by minimizing the distance
between the source and the target distributions [44, 92, 103], and aims to produce a domain-agnostic em-
bedding space by ensuring the source and target latent distributions are indistinguishable with respect to
some chosen distributional distance. Previous choices for this distance have been the mean statistic [201],
KL-divergence [180] or optimal transport [95, 171]. Arriving at the appropriate distributional distance
metric is still an active area of research, however recently Wasserstein Distance (WD) has shown desir-
able properties in high dimensional optimization tasks [48, 92, 84]. However, the WD does not provide
a closed-form solution in the general case. This requires solving a non-linear optimization problem to
correctly compute the WD. My proposed approach is also based on minimizing distributional discrepancy
between the source and target domains. In my algorithm, I utilized the Sliced Wasserstein Distance (SWD),
an approximation of the WD that allows for end-to-end gradient-based optimization and has been shown
to be effective in UDA settings [95, 147, 148].
Most UDA methods are designed with the premise of joint source/target access. This has been true
for UDA methods in the medical field as well [69, 218, 23, 78], where restrictions in data privacy can
lead to different databases being accessible sequentially. This problem has been previously explored in
source-free UDA, where access to the source data is lost following source training. Most segmentation
36
works exploring source-free UDA have been targeting real image segmentation [87, 104], however the
distributional discrepancy between street images and medical images does not allow for direct applica-
tion of these algorithms to the medical field, as I later demonstrate in my experimental section. In the
medical field, distributional distance minimization approaches have been explored for minimizing the KL-
divergence between latent feature distributions [8, 7]. I compare with these recent works and show my
optimal transport based approach offers improved adaptation performance.
4.3 ProblemFormulation
Figure 4.1: Proposed method: I first perform supervised training on source MR images. Using the source
embeddings I characterize an internal distribution via a GMM distribution in the latent space. I then
perform source free adaptation by matching the embedding of the target CT images to the learnt GMM
distribution, and fine tuning of the classifier on GMM samples. Finally, I verify the improved performance
that my model gains from model adaptation.
Consider the available inputs on the source and target domains to be slices of 3D organ scans. Following
literature [24], I assume each image to appear in a three channel format, i.e. three consecutive slices. The
pixel labels will correspond to the middle slice. Let the source dataset beD
S
={(x
s
1
,y
s
1
),(x
s
2
,y
s
2
)...(x
s
N
,y
s
N
)},
and the target dataset be D
T
= {(x
t
1
,y
t
1
),(x
t
2
,y
t
2
)...(x
t
M
,y
t
M
)}, where (x
s
i
,y
s
i
) are source domain im-
age/label pairs, and (x
t
i
,y
t
i
) have similar meaning for the target domain. Also, let X
S
be the set of all
37
source images, and X
T
be the set of all target images, i.e. X
S
= {x
s
1
,x
s
2
...}, X
T
= {x
t
1
,x
t
2
...}. I as-
sume each image shares the same input dimension ofW × H× 3, however the probability distributions
of the source imagesP
S
and the target imagesP
T
differ. The label space between the source and target
domains is the same, i.e. each pixel is labeled with one of K semantic classes. Under the restrictions of
UDA, my task is to learn a segmentation modelϕ with learnable parametersθ using the fully supervised
source domain D
S
and allow it to generalize on the target domain D
T
, where only access to unlabeled
imagesX
T
is possible. I structureϕ as a semantic segmentation network composed of a feature extractor
f, an up-sampling decoderg and a probabilistic classifier h, such thatϕ = f ◦ g◦ h. The model output
ofϕ returns pixel-wise class probabilities asK-dimensional vectors, whereK is the number of semantic
classes.
Initially, the segmentation model learns an appropriate decision function on the labeled source domain.
This can be done by minimizing empirical risk with respect to the available source samples, by minimizing
a pixel-level cross-entropy loss:
ˆ
θ =argmin
θ 1
N
N
X
i=1
L
CE
(y
s
i
,ϕ θ (x
s
i
)) (4.1)
whereL
CE
isK class cross entropy loss:
L
CE
(y,ˆ y)=− 1
WH
W
X
i=1
H
X
j=1
K
X
k=1
1
k
(y
i,j
)logˆ y
i,j,k
(4.2)
The notation 1
k
(x) represents the indicator function which returns1 ifx equals tok and zero other-
wise.
After learning
ˆ
θ , my model will be able to generalize on new samples fromP
S
. My goal is to minimize
the expected risk on the target distributionP
T
without access to target labels. To achieve this, I aim to
create a feature space of classifier inputs that is common for the source and target domains. This translates
38
tof◦ g(X
S
)≈ f◦ g(X
T
). UDA approaches have explored direct alignment between two distributions, by
choosing an appropriate metric to be minimized [198, 214]. In my work, I choose the SWD as this metric,
which allows for gradient pass-through for network optimization and has been shown to be an effective
metric for distribution alignment. Thus, I will minimizeSWD(f◦ g(X
S
),f◦ g(X
T
)).
As my setting is concerned with maintaining the privacy of patient records, I do not assume direct
access to the datasetD
S
during adaptation. Thus, in order for the above approach to be viable, I require
an approximation of the source encodingsf◦ g(X
S
). I produce this by learning a set of Gaussians to fit
the source latent embedding data. Under the assumption of sufficient training samples, a GMM distribu-
tion will be able to accurately approximate the underlying data distribution at the decoder output. This
allows us to still use the SWD for domain alignment without needing access to the source domain during
adaptation.
4.4 ProposedAlgorithm
I present a visual description of my approach in Figure 4.1. The first step in my approach is to train the
networkϕ on the available source data. Once training is fully complete, I can approximate the distribution
at the decoder output, i.e. f◦ g as a GMM. After this step, I no longer require direct access to the source
distribution. Once adaptation commences, I initialize the network with source weights, and minimize
the SWD distance between samples from the GMM and target embeddings at the decoder output. Given
the GMM distribution differs from the source embeddings, I additionally fine tune the classifier on GMM
samples. Below, I provide in-depth descriptions of the various components of this approach.
After source training, I model the source feature embeddings via a GMM distribution. Let the source
embedding distribution beP
Z
. As the segmentation problem considered hasK semantic classes, I expect
the embedding space to be separable into K clusters, each feature cluster corresponding to one of the
semantic classes. In the GMM approximation each cluster will be learned independently, viat Gaussians.
39
This can be achieved in practice as the source domain labels are available at GMM learning time. The
approximation ofP
Z
is computed as follows:
P
Z
(z)=
t·K
X
i=1
α i
p
i
(z)=
t·K
X
i=1
α i
N(z|µ i
,Σ i
),
α i
,µ i
,Σ i
represent the coefficients, means and covariance matrices of the learnt Gaussians. I learn
the GMM distribution via the Expectation Maximization algorithm, however I can improve the learning
process by making use of the source labels. For each semantic classk∈{1...K} I identify which latent
embeddings correspond to labelk. This in turn permits us to learn the Gaussians in an unsupervised way
only with samples that are already aligned to a specific class. I thus avoid the possibility of samples from
different classes negatively impacting the learning process, which may result in a Gaussian corresponding
to several semantic classes.
I can further tune the GMM learning process by establishing a confidence threshold for considered
data samples. Each of the data points used to learn the GMMs are decoder outputs, thus the classifier h
can assign them a confidence score. If for some sample this confidence score is low, then the sample will
be closer to the decision boundary of the classifier, whereas if the confidence score is high, the sample
will be a good representation of the feature cluster. I introduce a confidence parameter ρ to enforce this
observation, which allows us to avoid samples close to the decision boundary in the learning process.
Thus, for each class k, the t Gaussians in my GMM will be learned from the set of embeddings S
ρ,k
=
{f ◦ g(x
i
)
p,q
|ϕ (x
i
)
p,q
> ρ, argmaxϕ (x
i
)
p,q
= (y
i
)
p,q
}, where (p,q) are pairs of pixel positions in a
W× H image. Note in the above formulation one Gaussian per semantic class is the minimum requirement
in order to learn the GMM. This corresponds tot=1. However, my algorithm is not limited in this regard,
and I can choose a larger number of Gaussians for class representation. A larger value oft will improve
the approximation of the feature embeddings, however will come with a downside of larger training time.
40
However, the training process is dominated by the optimization of the cross-entropy loss in Eq. 4.1. I
chooset=3 in my reporting, and observe a beneficial impact on the two main metrics I consider.
Once the GMM parameters are learned I no longer require access to the source domain. For adapta-
tion, I minimize the distributional distance between the target distribution and the GMM distribution. In
practice, I will optimize the SWD at the batch level between target samples and samples from the GMM.
To this end, I create a datasetD
P
= (Z
P
,Y
P
) of GMM samples. The SWD loss acts as an approximation
of the WD for high dimensional distributions. The SWD distance averages overV evaluations of the WD
for 1-dimensional projections of the two distributions. As the WD for one dimensional distributions has
a closed form solution, minimizing the SWD allows for end-to-end gradient based optimization of the ϕ network. The general formula for the SWD is as follows:
L
SWD
(P,Q)=
1
V
V
X
i=1
WD(⟨γ i
,P⟩,⟨γ i
,Q⟩) (4.3)
whereγ i
is a random projection direction.
The final adaptation loss is thus composed of two terms. The first is the cross entropy loss with respect
to samples from D
P
and their labels Y
P
for fine-tuning the classifier. The second term is the SWD loss
presented in Eq. 4.3 between samples from D
P
and target image embeddings, f ◦ g(X
T
). I can express
the adaptation loss,L
adapt
, as follows:
L
adapt
=L
CE
(Z
P
,Y
P
)+λ L
SWD
(Z
P
,f◦ g(X
T
))
(4.4)
for some regularizerλ . The pseudocode for my approach, called Source Free semantic Segmentation
(SFS), is present in Alg. 2.
41
Algorithm2SFS(λ,ρ )
1: InitialTraining:
2: Input: Source domain datasetD
S
=(X
S
,Y
S
),
3: TrainingonSourceDomain:
4:
ˆ
θ =argmin
θ P
i
L(f
θ (x
s
i
),y
s
i
)
5: InternalDistributionEstimation:
6: Setρ =.97, computeS
ρ,k
and estimate ˆ α j
,ˆ µ j
, and
ˆ
Σ j
class conditionally
7: ModelAdaptation:
8: Input: Target datasetD
T
=(X
T
)
9: Pseudo-DatasetGeneration:
10: D
P
=(Z
P
,Y
P
)=([z
p
1
,...,z
p
N
],[y
p
1
,...,y
p
N
]), where:z
p
i
∼P
Z
(z),1≤ i≤ N
p
11: foritr =1,...,ITR do
12: Draw random batches fromD
T
andD
P
13: Update the model by solving Eq. (4.4)
14: endfor
4.5 TheoreticalAnalysis
I prove that Algorithm 2 is effective because an upper-bound on the expected error for the target domain
is minimized as a result of domain alignment.
I analyze the problem in a standard PAC-learning setting. Consider that the set of classifier sub-
networksH ={ϕ w
(·)|ϕ w
(·) :Z →R
k
,w∈R
W
} to be my hypothesis space. Lete
S
ande
T
denote the
expected error of the optimal model that belongs to this space on the source and target domains, respec-
tively. Letϕ w
∗ to be the model which minimizes the combined source and target expected errore
C
(w
∗ ),
defined as: w
∗ =argmin
w
e
C
(w)=argmin
w
{e
S
+e
T
}. This model is the best model within the hypoth-
esis space in terms generalizability in both domains. Additionally, consider that ˆ µ S
=
1
N
P
N
n=1
f◦ g(x
s
n
)
and ˆ µ T
=
1
M
P
M
m=1
f ◦ g(x
t
m
) are the empirical source distribution and the empirical target distribu-
tion in the embedding space, both built from the available data points. Let ˆ µ P
=
1
Np
P
Np
q=1
z
q
denote the
empirical internal distribution, built from the generated pseudo-dataset.
42
Theorem 3. Consider that I generate a pseudo-dataset using the internal distribution and preform UDA
according to Algorithm 2, then:
e
T
≤ e
S
+W(ˆ µ S
,ˆ µ P
)+W(ˆ µ T
,ˆ µ P
)+(1− ρ )+e
C
′(w
∗ )
+
r
2log(
1
ξ )/ζ
r
1
N
+
r
1
M
+2
s
1
N
p
,
(4.5)
whereW(·,·) denotes the WD distance andξ is a constant, dependent on the loss functionL(·).
According to Theorem 3, Algorithm 2 minimizes the upper bound expressed in Eq. (4.5) for the target
domain expected risk. The source expected risk is minimized when I train the model on the source domain.
The second term in Eq. (4.5) is minimized when the GMM is fitted on the source domain distribution. The
third term in the Eq. (4.5) upperbound is minimized because it is the second term in Eq. (4.4). The fourth
term is small if I letρ ≈ 1. The terme
C
′(w
∗ ) will be small if the domains are related to the extent that a
joint-trained model can generalize well on both domains, e.g., there should not be label mismatch between
similar classes across the two domains. The last term in Eq. (4.5) is negligible if the training datasets are
large enough.
Proof: I again use the bound developed by Redko et al. [132], previously referenced in Theorem 2.
Theorem 2 (Redko et al. [132]): Consider that a model is trained on the source domain, then for
anyd
′
> d andζ <
√
2, there exists a constant numberN
0
depending ond
′
such that for anyξ > 0 and
min(N,M)≥ max(ξ − (d
′
+2),1
) with probability at least1− ξ , the following holds:
e
T
≤ e
S
+W(ˆ µ T
,ˆ µ S
)+e
C
(w
∗ )+
r
2log(
1
ξ )/ζ
r
1
N
+
r
1
M
.
(4.6)
Theorem 2 relates the performance of a source-trained model on a target domain through an upper-
bound which depends on the distance between the source and the target domain distributions in terms
43
WD distance. I use Theorem 2 to deduce Theorem 3 in the paper. Redko et al. [132] provide their analysis
for the case of binary classifier but their analysis can be extended to a multi-class scenario.
Since the parameterρ is used to estimate the internal distribution, the probability of predicting incor-
rect labels for the drawn pseudo-data points is1− ρ . The following difference is defined:
|L(h
w
0
(z
p
i
),y
p
i
)−L (h
w
0
(z
p
i
), ˆ y
p
i
)|=
0, ify
p
i
= ˆ y
p
i
.
1, otherwise.
(4.7)
I can apply Jensen’s inequality following taking the expectation with respect to the target domain distri-
bution in the embedding space, i.e.,f◦ g(P
T
), on both sides of Eq. (4.7) and conclude:
|e
P
− e
T
|≤ E
|L(h
w
0
(z
p
i
),y
p
i
)−L (h
w
0
(z
p
i
), ˆ y
p
i
)|
≤ (1− ρ ).
(4.8)
Now I use Eq. (4.8) to deduce:
e
S
+e
T
=e
S
+e
T
+e
P
− e
P
≤ e
S
+e
P
+|e
T
− e
P
|≤ e
S
+e
P
+(1− ρ ).
(4.9)
Taking infimum on both sides of Eq. (4.9) and employing the definition of the joint optimal model yields:
e
C
(w
∗ )≤ e
C
′(w)+(1− ρ ).
(4.10)
44
Now I consider Theorem 2 by Redko et al. [132] for the source and target domains in my problem and
merge Eq. (4.10) on Eq.(4.6) to conclude:
e
T
≤ e
S
+W(ˆ µ T
,ˆ µ S
)+e
C
′(w
∗ )+(1− ρ )+
r
2log(
1
ξ )/ζ
r
1
N
+
r
1
M
.
(4.11)
In Eq. (4.11),e
C
′ denotes the joint optimal model true error for the source domain and the pseudo-dataset
as the second domain.
Now I apply the triangular inequality twice in Eq. (4.11) to deduce:
W(ˆ µ T
,ˆ µ S
)≤ W(ˆ µ T
,µ P
)+W(ˆ µ S
,µ P
)≤ W(ˆ µ T
,ˆ µ P
)+W(ˆ µ S
,ˆ µ P
)+2W(ˆ µ P
,µ P
)
(4.12)
I now need Theorem 1.1 by Bolley et al. [13] to simplify the termW(ˆ µ P
,µ P
) in Eq. (4.12).
Theorem4. (Theorem 1.1 by Bolley et al. [13])
Consider that
p(·) ∈ P(Z) and
R
Z
exp(α ∥x∥
2
2
)dp(x) < ∞ for some α > 0. Let ˆ p(x) =
1
N
P
i
δ (x
i
) denote the
empirical distribution that is built from the samples{x
i
}
N
i=1
that are drawn i.i.d fromx
i
∼ p(x). Then for
anyd
′
>d andξ <
√
2, there existsN
0
such that for anyϵ> 0 andN ≥ N
o
max(1,ϵ − (d
′
+2)
), I have:
P(W(p,ˆ p)>ϵ )≤ exp(− − ξ 2
Nϵ 2
)
(4.13)
This theorem provides a relation to measure the distance between the estimated empirical distribution
and the true distribution when the distance is measured by the WD metric.
45
I can use both Eq. (4.12) and Eq. (4.13) in Eq. (4.11) to conclude Theorem 3 as stated in the paper:
e
T
≤ e
S
+W(ˆ µ S
,ˆ µ P
)+W(ˆ µ T
,ˆ µ P
)+(1− ρ )+e
C
′(w
∗ )
+
r
2log(
1
ξ )/ζ
r
1
N
+
r
1
M
+2
s
1
N
p
,
(4.14)
4.6 ExperimentalValidation
4.6.1 Datasets
To evaluate my approach I consider two semantic segmentation datasets for internal organ imaging. I
describe the two datasets and the adaptation tasks as follows:
Multi-Modality Whole Heart Segmentation Dataset (MMWHS) [226]: the dataset covers 3D
scans of heart tissue, and evaluates algorithms on correctly segmenting the scans with respect to four
semantic classes: ascending aorta (AA), left ventricle blood cavity (LVC), left atrium blood cavity (LAC),
myocardium of the left ventricle (MYO). Elements which do not fall into one of the above four categories
are to be ignored in the training process. The3D scans provided are obtained from two types of magnetic
imaging devices, 20 obtained via MRI imaging and 20 obtained via CT imaging. The adaptation tasks I
consider assumes the source domain to be the MRI domain and the target domain to be the CT domain, and
the reverse problem, where I adapt from CT to MRI. The MMWHS dataset has been previously considered
in the UDA literature for medical image segmentation. I use the data splits prepared and provided by Chen
et al. [21]. The data preparation process involves turning each3D scan into a series of2D images. Initially,
the values of the pixels are normalized to a standard normal distribution. Then,3-channel2D images are
obtained by considering three consecutive slices of the 3D scan. The semantic labels correspond to the
middle slice. Additional data augmentation techniques such as rotations or cropping are used.
CHAOSMRandMulti-AtlasLabelingBeyondtheCranialVault: the segmentation problem per-
tains to the segmentation of abdominal organs into four semantic classes: Liver (L), Right Kidney (RK), Left
46
Kidney (LK) and Spleen (S). I use two publicly available organ segmentation datasets, the 2019 CHAOS MR
dataset [81] used in the 2019 CHAOS Grad Challenge, and the Multi-Atlas Labeling Beyond the Cranial
Vault MICCAI 2015 Challenge dataset [91]. The CHAOS MR dataset consists of MRI scans, while the sec-
ond dataset consists of CT scans. The data preparation process is done similarly to the MMWHS heart
segmentation case. The CT images were clipped outside of the range [− 125,275] [222], and then nor-
malized to zero mean and unit variance at the pixel level. From the3D scans,2D images were produced
by following the slicing process used in the MMWHS case. Four types of augmentation were employed:
rotations, value negation, cropping and adding noise. I again consider two adaptation tasks, from the MRI
domain to the CT domain, and from the CT domain to the MRI domain.
4.6.2 EvaluationMethodology
I use two metrics for evaluating the quality of adaptation methods for medical image segmentation: Dice
coefficient and Average Symmetric Surface Distance (ASSD). The Dice coefficient is an indicator of the
quality of segmentation in terms of area, and is used as a primary metric in medical segmentation works
[24, 21, 222]. A large Dice score will signify that on average, there is high overlap between model predic-
tions and actual labels. However, the detail of region borders can suffer even in the presence of a large
Dice score. As the quality of representations can be especially important in the medical field, the ASSD
score is used to measure the fidelity of the predicted semantic map borders. A low value of ASSD signifies
that the produced semantic maps are similar in border representation to the actual areas of organ split.
For both adaptation problems, I compare my algorithm with recent UDA approaches for semantic seg-
mentation. I consider PnP-AdaNet [42], SynSeg-Net [70], AdaOutput [182], CycleGAN [224], CyCADA
[64], SIFA [21], ARL-GAN [29], DSFN [227], SASAN [179], DSAN [60]. The above methods are based on
adversarial learning and do not work in a source-free regime, giving them an advantage over my approach.
47
This is due to the fact that these methods benefit of having full access to the source domain during adapta-
tion, and do not need to employ an approximation of the source or some other procedure. I also compare
my approach to two source-free medical image segmentation algorithms that designed using the MMWHS
datasets: AdaEnt [8] and AdaMI [7]. Finally, I support the claim that street image segmentation approaches
are not suitable to be directly applied to medical images. I consider GenAdapt [87], a recent street seman-
tic segmentation work with SOTA performance that is designed for street image segmentation, however
which does not generalize well to the medical field. I show that my algorithm, even under the added con-
straint of not having access to the source domain during adaptation, is able to compete and outperform
the above mentioned approaches, demonstrating the effectiveness of my adaptation procedure.
4.6.3 ExperimentalSetup
I follow common network design for semantic segmentation. I employ a DeepLabV3 architecture [26], with
a VGG16 [167] feature extractor. Following the up-sampling decoder, I use a one-layer linear classifier. The
network is trained on the source domain for90,000 iterations using the Adam optimizer. I use a learning
rate of1e− 4, decay of1e− 6,ϵ =1e− 6, and batch size of16 images. When learning the GMM, I choose
the high confidence parameter ρ = .97. The main results are reported for t = 3 components for each
semantic class. For adaptation, I initialize the network with source weights. I train for35,000 iterations,
with a batch size of32 images. I again use the Adam optimizer, with5e− 5 learning rate, decay of1e− 6
andϵ =1e− 1. The regularizer for the SWD loss,λ , is set to.5. In the batch selection process, the target
image slices are selected at random, which can lead to scenarios where certain classes are missing, and
batch distribution being significantly different from the target distribution. This can damage the adaptation
process, leading to different semantic classes being merged together via the SWD loss. While a larger batch
size can lead to the batch distribution becoming arbitrarily close to the target distribution, this is unfeasible
48
when running on a single GPU. Thus, I address this issue by using the actual batch label distribution when
selecting samples from the GMMs. The hardware I use is a Nvidia Titan Xp GPU.
4.6.4 QuantitativeandQualitativeResults
I present my main quantitative results in Tables 4.1 through 4.4. For methods which do not report cardiac
and abdominal benchmarks, I report the performance from [21]. The performance is upper bounded by
the Supervised benchmark, which corresponds to having full access to the target data for training a seg-
mentation model. The Source-Only benchmark corresponds to the model trained on the source domain
directly applied to target data, which leads to poorest performance. On the cardiac dataset I observe that
my method is able to outperform the other algorithms in terms of Dice score, with 81.3 average perfor-
mance on the MRI → CT task. I also achieve best performance on the CT → MRI task, with 72.3
average Dice score. I restate that this is despite the fact that my method does not perform joint adaptation,
losing access to the source domain after training. The GAN based methods offer competitive performance,
as my approach achieves best class-wise performance only on the AA class. I also compare positively
to the other source free methods based on entropy minimization, showing that optimal-transport based
adaptation has the potential of achieving better domain alignment. I further investigate GenAdapt [87],
a recent source-free algorithm for street semantic segmentation. The performance of this approach trails
the other methods, confirming that the difference in data distribution between street images and medical
images makes street semantic segmentation algorithms to not be directly transferable to the medical field.
In terms of ASSD score, the best results corresponds to a GAN based method, however my method still
offers competitive performance.
My results on the abdominal organ dataset follow a similar trend, however on this secondary dataset
the best performance is obtained by one of the GAN based methods. As mentioned before, my algorithm
does not benefit from access to the source data during adaptation, thus the joint adaptation methods act as
49
an upper-bound to my approach. Still, on theMRI → CT task I am able to obtain best performance on
class Liver, and close to SOTA performance on classes Left Kidney and Spleen. On theCT →MRI task I
again observe competitive performance against the other methods and best performance on theLeftKidney
class, however deteriorated performance on class Spleen. This shows that while overall performance is
improved, due to the unsupervised nature of the adaptation process, there may be situations where the
increase in accuracy is not similar across all classes.
Dice ASSD
Method AA LAC LVC MYO Avg. AA LAC LVC MYO Avg.
Source-Only 28.4 27.7 4.0 8.7 17.2 20.6 16.2 N/A 48.4 N/A
Supervised
∗ 88.7 89.3 89.0 88.7 87.2 2.6 4.9 2.2 1.6 3.6
GenAdapt
∗ [87] 57 51 36 31 43.8 N/A N/A N/A N/A N/A
PnP-AdaNet [42] 74.0 68.9 61.9 50.8 63.9 12.8 6.3 17.4 14.7 12.8
SynSeg-Net [70] 71.6 69.0 51.6 40.8 58.2 11.7 7.8 7.0 9.2 8.9
AdaOutput [182] 65.2 76.6 54.4 43.3 59.9 17.9 5.5 5.9 8.9 9.6
CycleGAN [224] 73.8 75.7 52.3 28.7 57.6 11.5 13.6 9.2 8.8 10.8
CyCADA [64] 72.9 77.0 62.4 45.3 64.4 9.6 8.0 9.6 10.5 9.4
SIFA [21] 81.3 79.5 73.8 61.6 74.1 7.9 6.2 5.5 8.5 7.0
ARL-GAN [29] 71.3 80.6 69.5 81.6 75.7 6.3 5.9 6.7 6.5 6.4
DSFN [227] 84.7 76.9 79.1 62.4 75.8 N/A N/A N/A N/A N/A
SASAN [179] 82.0 76.0 82.0 72.0 78.0 4.1 8.3 3.5 3.3 4.9
DSAN [60] 79.9 84.7 82.7 66.5 78.5 7.7 6.7 3.8 5.6 5.9
AdaEnt
∗ [8] 75.5 71.2 59.4 56.4 65.6 8.5 7.1 8.4 8.6 8.2
AdaMI
∗ [7] 83.1 78.2 74.5 66.8 75.7 5.6 4.2 5.7 6.9 5.6
SFS
∗ 88.0 83.7 81.0 72.5 81.3 6.3 7.2 4.7 6.1 6.1
Table 4.1: Segmentation performance comparison for the Cardiac MR→ CT adaptation task. Starred
methods perform source-free adaptation. Bolded cells show best performance.
My results indicate that my method can also be used to perform UDA is a sequential setting, where the
target domain becomes available later after the source domain data. The setting is similar to the continual
learning setting and offers the prospect of adopting my method in this setting [149, 151, 146], while relaxing
the criteria that all tasks should be labeled.
In addition to quantitative verifying my approach on the two datasets, I also provide a visualization of
the benefits of the adaptation process. In Figure 4.2 I display target images for the cardiac and abdominal
datasets, and semantic maps for true labels, supervised predictions, adaptation predictions and source-only
50
Dice ASSD
Method AA LAC LVC MYO Avg. AA LAC LVC MYO Avg.
Source only 5.4 30.2 24.6 2.7 15.7 15.4 16.8 13.0 10.8 14.0
Supervised 82.8 80.5 92.4 78.8 83.6 3.6 3.9 2.1 1.9 2.9
PnP-AdaNet [42] 43.7 47.0 77.7 48.6 54.3 11.4 14.5 4.5 5.3 8.9
CyCADA[64] 60.5 44.0 77.6 47.9 57.5 7.7 13.9 4.8 5.2 7.9
SynSeg-Net[70] 41.3 57.5 63.6 36.5 49.7 8.6 10.7 5.4 5.9 7.6
AdaOutput[182] 60.8 39.8 71.5 35.5 51.9 5.7 8.0 4.6 4.6 5.7
CycleGAN[224] 64.3 30.7 65.0 43.0 50.7 5.8 9.8 6.0 5.0 6.6
SIFA [21] 65.3 62.3 78.9 47.3 63.4 7.3 7.4 3.8 4.4 5.7
SASAN [179] 54 73 86 68 70 18.8 9.4 6.1 3.9 9.5
DSAN [60] 71.3 66.2 76.2 52.1 66.5 4.4 7.3 5.5 4.3 5.4
SFS
∗ 66.4 69.0 89.0 64.5 72.3 3.13 2.8 0.33 1.16 1.56
Table 4.2: Segmentation performance comparison for theCardiac CT→ MR adaptation task.
Dice ASSD
Method L RK LK S Avg. L RK LK S Avg.
Source-Only 73.1 47.3 57.3 55.1 58.2 2.9 5.6 7.7 7.4 5.9
Supervised 94.2 87.2 88.9 89.1 89.8 1.2 1.2 1.1 1.7 1.3
SynSeg-Net [70] 85.0 82.1 72.7 81.0 80.2 2.2 1.3 2.1 2.0 1.9
AdaOutput [182] 85.4 79.7 79.7 81.7 81.6 1.7 1.2 1.8 1.6 1.6
CycleGAN [224] 83.4 79.3 79.4 77.3 79.9 1.8 1.3 1.2 1.9 1.6
CyCADA [64] 84.5 78.6 80.3 76.9 80.1 2.6 1.4 1.3 1.9 1.8
SIFA [21] 88.0 83.3 80.9 82.6 83.7 1.2 1.0 1.5 1.6 1.3
SFS
∗ 88.3 73.7 80.7 81.6 81.1 2.4 4.1 3.5 2.7 3.2
Table 4.3: Segmentation performance comparison for the Abdominal MR→ CT adaptation task.
Dice ASSD
Method L RK LK S Avg. L RK LK S Avg.
Source-Only 48.9 50.9 65.3 65.7 57.7 4.5 12.3 6.8 4.5 7.0
Supervised 92.0 91.1 80.6 85.7 87.3 1.3 2.0 1.5 1.3 1.5
SynSeg-Net [70] 87.2 90.2 76.6 79.6 83.4 2.8 0.7 4.8 2.5 2.7
AdaOutput [182] 85.8 89.7 76.3 82.2 83.5 1.9 1.4 3.0 1.8 2.1
CycleGAN [224] 88.8 87.3 76.8 79.4 83.1 2.0 3.2 1.9 2.6 2.4
CyCADA [64] 88.7 89.3 78.1 80.2 84.1 1.5 1.7 1.3 1.6 1.5
SIFA [21] 88.5 90.0 79.7 81.3 84.9 2.3 0.9 1.4 2.4 1.7
SFS
∗ 86.3 88.0 85.1 74.9 83.5 4.5 1.6 2.2 18.2 6.6
Table 4.4: Segmentation performance comparison for the Abdominal CT→ MRI adaptation task.
predictions. First, I note that the supervised performance is able to reach semantic map quality similar to
the true labels, however incorrect labels are present with regards to the image borders, e.g. first three
abdominal images. I believe this behavior stems from the resolution and architecture of the network. The
51
Pre-Adapt semantic maps correspond to source-only performance, before model adaptation. These results
present large inconsistencies with regards to the true label maps, and show the prediction deterioration
due to domain-shift. The Post-Adapt images clearly show improvement over source-only performance,
and I observe the semantic maps become much closer to supervised performance. This process is however
not perfect, as I notice difficulty in correctly determining some image borders, for example in the second
cardiac image and in the first three abdominal images. This supports my quantitative evaluation, where I
observed high Dice scores, however GAN based methods showed improved performance with respect to
the ASSD metric.
Figure 4.2: Segmentation maps of CT samples from the two datasets. The first five columns correspond to
cardiac images, and last five correspond to abdominal images. From top to bottom: gray-scale CT images,
source-only predictions, post-adaptation predictions, supervised predictions on the CT data, ground truth.
I further investigate the adaptation process in terms of class-wise pixel shift. In Tables 4.5 and 4.6 for
each row/column pair I report the percentage of pixels that change categories, the percentage out of these
which change incorrectly and finally the percentage which change correctly. I note that when more than
1% of the pixels change, the adaptation process always leads to a decrease in incorrectly labeled pixels. I
52
notice that this happens more for the cardiac dataset compared to the abdominal dataset, which similar
to Figures 4.3,4.4, shows the cardiac datasets has less class separability than the abdominal dataset. I also
note that for both datasets, a large number of samples that were labeled incorrectly before the adaptation
correspond to the Ignore class.
Ignore MYO LAC LVC AA
Ignore 97.3 99.3 99.3 1.5 20.3 70.0 0.2 80.2 14.8 0.9 6.2 76.1 0.2 43.8 51.7
MYO 13.2 10.4 89.5 81.6 72.2 72.2 0.1 52.7 0.4 5.2 44.6 54.1 0.0 0.0 0.0
LAC 15.1 45.4 46.3 2.5 2.6 79.7 76.1 88.4 88.4 5.9 7.4 87.4 0.4 5.8 77.0
LVC 0.6 67.7 2.3 16.5 33.4 66.3 0.2 83.8 13.0 82.7 92.4 92.4 0.0 93.3 0.0
AA 18.5 7.8 90.9 0.0 0.0 43.7 1.3 5.7 6.2 0.1 0.0 12.9 80.1 91.2 91.2
Table 4.5: Percentage of shift in pixel labels during adaptation for the cardiac dataset. A cell(i,j) in the
table has three values. The first value represents the percentage of pixels labeled i that are labeledj after
adaptation. The second value represents the percentage of switching pixels whose true label isi - lower
is better. The third value represents the percentage of switching pixels whose true label is j - higher is
better. Bolded cells denote label shift where more than1% of pixels migrate fromi toj.
Ignore Liver R. Kidney L. Kidney Spleen
Ignore 94.6 98.4 98.4 3.0 18.0 81.6 0.7 23.5 74.3 0.7 34.9 62.6 1.0 19.3 80.5
Liver 6.6 38.1 60.8 92.6 91.3 91.3 0.8 10.4 55.1 0.0 0.0 0.0 0.0 39.0 10.2
RKidney 5.0 13.1 86.9 0.2 0.0 76.9 94.8 94.7 94.7 0.0 0.0 0.0 0.0 0.0 0.0
LKidney 2.2 24.2 75.0 0.1 0.0 0.0 0.0 23.7 0.0 97.5 87.8 87.8 0.2 0.0 7.2
Spleen 23.1 20.8 79.2 0.1 20.2 0.0 0.2 75.0 0.0 0.0 69.4 0.0 76.6 78.7 78.7
Table 4.6: Percentage of shift in pixel labels during adaptation for the abdominal organ dataset. The same
methodology as in Table 4.5 is used.
My results indicate that my method can also be used to perform UDA is a sequential setting, where the
target domain becomes available later after the source domain data. The setting is similar to the continual
learning setting and offers the prospect of adopting my method in this setting [151, 146], while relaxing
the criteria that all tasks should be labeled.
In addition to quantitative verifying my approach on the two datasets, I also provide a visualization of
the benefits of the adaptation process. In Figure 4.2 I display target images for the cardiac and abdominal
datasets, and semantic maps for true labels, supervised predictions, adaptation predictions and source-only
predictions. First, I note that the supervised performance is able to reach semantic map quality similar to
the true labels, however incorrect labels are present with regards to the image borders, e.g. first three
abdominal images. I believe this behavior stems from the resolution and architecture of the network. The
53
Pre-Adapt semantic maps correspond to source-only performance, before model adaptation. These results
present large inconsistencies with regards to the true label maps, and show the prediction deterioration
due to domain-shift. The Post-Adapt images clearly show improvement over source-only performance,
and I observe the semantic maps become much closer to supervised performance. This process is however
not perfect, as I notice difficulty in correctly determining some image borders, for example in the second
cardiac image and in the first three abdominal images. This supports my quantitative evaluation, where I
observed high Dice scores, however GAN based methods showed improved performance with respect to
the ASSD metric.
4.6.5 AblationStudiesandAnalysis
(a) GMM samples (b) Pre-adaptation (c) Post-adaptation
Figure 4.3: Indirect distribution matching in the embedding space: (a) GMM samples approximating the
MMWHS MR latent distribution, (b) CT latent embedding prior to adaptation (c) CT latent embedding post
domain alignment. Colors correspond to: AA, LAC, LVC, MYO.
(a) GMM samples (b) Pre-adaptation (c) Post-adaptation
Figure 4.4: Indirect distribution matching in the embedding space: (a) GMM samples approximating the
CHAOS MR latent distribution, (b) Multi-Atlas CT embedding prior to adaptation (c) Multi-Atlas CT em-
bedding post adaptation. Colors correspond to: liver, right kidney, left kidney, spleen.
54
Next, I analyze the impact of the parametert on the adaptation performance. In my GMM learning pro-
cess,t controls how many components are employed for each semantic class. While the minimum value of
t is1, a larger number of components should lead to a better approximation of the underlying distribution.
I verify this idea in Table 4.7. I notice that as I increase t, the adaptation performance improves when
keeping the number of training iterations fixed. This implies that the learnt GMM distribution captures
more information from the source domain, which benefits the SWD distribution alignment process. In my
main results in Tables 4.1 and 4.3 I utilize at value of3, as I see larger values oft provide reduced benefit
gains.
Dice Average Symmetric Surface Distance
Method AA LAC LVC MYO Average AA LAC LVC MYO Average
1-SFS 86.2 83.5 75.4 70.9 79.0 11.1 5.0 10.8 3.6 9.8
3-SFS 88.0 83.7 81.0 72.5 81.3 6.3 7.2 4.7 6.1 6.1
5-SFS 88.0 83.8 81.9 73.3 81.7 6.2 7.4 4.8 5.7 6.0
7-SFS 86.8 84.8 82.0 73.5 81.8 4.8 7.2 4.4 5.6 5.9
Table 4.7: Segmentation performance comparison for the Cardiac MR→ CT adaptation task. t-SFS repre-
sents results fort components per class.
Finally, I compare my distributional alignment metric, the SWD, to other popular distributional align-
ment approaches. I consider the KL-divergence and MMD as additional comparison metrics. I report
results of using these two metrics for adaptation on the cardiac dataset in Table 4.8. I note using MMD
does not benefit my framework. Using the KL divergence metric provides a positive outcome, and I obtain
a result close to the one via SWD, however with slightly deteriorated performance. This empirically shows
that SWD is more robust than some other distributional distance metrics.
Dice Average Symmetric Surface Distance
Method AA LAC LVC MYO Average AA LAC LVC MYO Average
MDD 24.4 73.3 19.1 68.1 46.2 21.1 4.8 33.3 20.3 19.9
KL 87.9 87.3 74.7 62.7 78.1 8.4 6.6 11.7 9.6 9.1
SFS 88.0 83.7 81.0 72.5 81.3 6.3 7.2 4.7 6.1 6.1
Table 4.8: Segmentation performance on the Cardiac MR→ CT adaptation task for different distributional
distances.
55
4.7 Remarks
In this chapter, I have shown that my source-free UDA algorithm for the semantic segmentation of medi-
cal images is competitive with SOTA approaches that do not have the added restriction for data privacy. I
validated the adaptation process by further analyzing the changes in distribution shift performed by var-
ious components of my approach, and showed that the empirical observations align with the intuition
behind the algorithm. In addition, improving the representation power of the GMM distribution showed
correlation to an improvement in both considered metrics. Next, I address the problem of multi-source
classification. Compared to the semantic segmentation problems seen thus far, multi source classification
considers the setting where several source domains are made available. I again address this setting from a
privacy focused standpoint, where direct communication between any pair of domains will be prohibited.
56
Chapter5
SecureMultiSourceAdaptation
In this chapter I show that approximating the source distribution for source free adaptation is a valid
strategy in multi-source adaptation, and allows for private and distributed optimization.
5.1 Motivation
Recently, single-source unsupervised domain adaptation (SUDA) has been extended to multi-source un-
supervised domain adaptation (MUDA), where several distinct sources of knowledge are available ([128,
221, 101, 58, 176, 185]). The goal in MUDA is to benefit from the collective information encoded in sev-
eral distinct annotated source domains to improve model generalization on an unannotated target domain.
Compared to SUDA, MUDA algorithms require leveraging data distribution discrepancies between pairs
of source domains, as well as between the sources and the target. Thus, an assumption in most MUDA
algorithms is that the annotated source datasets are centrally accessible. Such a premise however ignores
privacy/security regulations or bandwidth limitations that may constrain the possibility of joint data access
between source domains.
In practice, it is natural to assume source datasets are distributed amongst independent entities, and
sharing data between them may constitute a privacy violation. For example, improving mobile keyboard
predictions is performed by securely training models on independent computing nodes without centrally
57
collecting user data ([205]). Similarly, in medical image processing applications, data is often distributed
amongst different medical institutions. Due to privacy regulations ([202]) sharing data can be prohibited,
and hence central access to data for all the source domains simultaneously becomes infeasible. MUDA
algorithms can offer privacy between the sources and target by operating in a source-free regime ([2]), i.e.,
during the adaptation process source samples are considered to be unavailable, and only the source trained
model or source data statistics are assumed to be accessible. However, approaches operating under this
premise require retraining if new source domains become available, or if a set of source domains becomes
inaccessible. This downside leads to increased time and computational resource cost.
I relax the need for centralized processing of source data in MUDA while maintaining cross-domain
privacy. My approach is robust to accessibility changes for different source domains, allowing for relearn-
ing the target decision function without end-to-end retraining. I relax the need for direct access to source
domain samples for adaptation by approximating the distribution of source embeddings. I perform source-
free adaptation with respect to each source domain, and propose a confidence based pooling mechanism
for target inference.
5.2 Relatedwork
Single-Source UDA: Single source UDA aims to improve model generalization for an unlabeled target
domain using only a single source domain with annotated data. SUDA has been studied extensively. A
primary workflow employed in recent UDA works consists of training a deep neural network jointly on the
labeled source domain and the unlabeled target domain to achieve distribution alignment between both
domains in a latent embedding space. This goal has been achieved by employing generative adversarial
networks ([55]) to encourage domain alignment ([64, 39, 112, 183, 159]) as well as directly minimizing
an appropriate distributional distance between the source and target embeddings ([108, 111, 119]). SUDA
58
algorithms do not leverage inter-domain statistics in the presence of several source domains, and thus
extending single-source UDA algorithms to a multi-source setting is nontrivial.
Multi-Source UDA: The MUDA setting is a recent extension of SUDA, where multiple streams of
data are concomitantly leveraged for improved target domain generalization. Xu et al. [199] minimize
discrepancy between source and target domains by optimizing an adversarial loss. Peng et al. [128] adapt
on multiple domains by aligning inter-domain statistics of the source domains in an embedding space. Guo
et al. [59] learn to combine domain specific predictions via meta-learning. Venkat et al. [185] use pseudo-
labels to improve domain alignment. The increased amount of source data in MUDA is not necessarily an
advantage over SUDA, as negative transfer between domains needs to be controlled during adaptation. Li
et al. [97] exploit domain similarity to avoid negative transfer by leveraging model statistics in a shared
embedding space. Zhu et al. [225] achieve domain alignment by adapting deep networks at various levels
of abstraction. Zhao et al. [221] align target features against source trained features via optimal transport,
then combine source domains proportionally to Wasserstein distance. Wen et al. [194] use a discriminator
to exclude data samples with negative generalization impact. Such approaches admit joint access to source
domains during the training process, making them infeasible in settings where data privacy and security
are of concern.
PrivacyinDomainAdaptation: The importance of inter-domain privacy has been recognized and
explored for single-source UDA, specifically in source-free adaptation. Note this framework is relevant in
many important practical settings even for SUDA, where privacy regulations limit the possibility of sharing
data ([129, 96, 99, 100]). Several UDA approaches consider maintaining an approximatinon of the source
domain for adaptation. Kurmi et al. [90]) benefit from GANs to generate source-domain like samples
during the adaptation phase. Yeh et al. [208] align distributions via minimizing the KL-divergence in
addition to a variational autoencoder reconstruction loss. Similar to my approach, Yang et al. [203] model
the source distribution and use clustering of target samples to assign the correct class is done by minimizing
59
an VAE reconstruction error. Tian et al. [177] approximate the latent source space during adaptation
and use adversarial learning in their approach. Ding et al. [40] also estimate the source distribution and
choose specific anchors to guide the distribution learning process. Adaptation is done by optimizing class
conditional maximum mean discrepancy between samples from the learnt approximation distributions
and the target samples. These above works consider only a single source. In multi-source adaptation, the
strategy for combining information across source domains is key in achieving competitive performance.
Peng et al. [129] perform collaborative adaptation under privacy restrictions between source domains
under the framework of federated learning. Ahmed et al. [2] approach privacy-preserving MUDA via
information maximization and pseudo-labeling. Dong et al. [41] choose high confidence target samples as
class anchors and pseudo-labels are then assigned according to the closest anchors. Unlike my approach, [2,
41] require simultaneous access to all source trained models during adaptation. This makes these method
unsuitable for scenarios such as asynchronous federated learning [129], where communication between
individual domains may be broken, or processing different source domains may be done at irregular time
intervals ([117]). My main adaptation tool is represented by optimal transport based optimization, followed
by a confidence based pooling mechanism, making the adaptation process considerably more lightweight.
I address a more constrained yet practical setting, where privacy should be preserved both between
pairs of source domains and with respect to the target. My approach allows for efficient distributed op-
timization, not requiring end-to-end retraining if different source domains become inaccessible due to
privacy obligations [117, 129], or more source domains become available after initial training has finished,
allowing for accumulative learning from several domains. My method builds on extending the idea of
probability metric minimization, explored in UDA ([22, 95, 170, 142, 140]) to MUDA. The latent source
and target features are represented via the output space of a neural encoder. Domain alignment implies
a shared embedding space for these representations. To achieve this, a suitable distributional distance
metric is chosen between these two sets of embeddings and minimized. In this work, I used the Sliced
60
Wasserstein Distance (SWD) ([131, 14]) for this purpose. SWD is a metric for approximating the optimal
transport metric [133]. It is a suitable choice for UDA because: (i) it possesses non-vanishing gradients for
two distributions with non-overlapping supports. As a result, it is a suitable objective function for deep
learning gradient-based optimization techniques. (ii) It can be computed efficiently based on a closed-form
solution using only empirical samples, drawn from the two probability distributions.
5.3 Problemformulation
Figure 5.1: Block-diagram of my proposed approach: (a) source specific model training is done indepen-
dently for each source domain (b) the distribution of latent embeddings of each source domain is estimated
via a mixture of Gaussians, (c) for each source trained model, adaptation is performed by minimizing the
distributional discrepancy between the learnt GMM distribution and the target encodings (d) the final tar-
get domain predictions are obtained via a learnt convex combinations of logits for each adapted model
LetS
1
,S
2
...S
n
be the data distributions of n annotated source domains andT be the data distri-
bution of an unannotated target domain. Assume the source and target domains share the same feature
space R
W× H× C
, where W,H,C describe an image by width, height and number of channels respec-
tively. I consider all domains having a common label-spaceY, but not necessarily sharing the same label
distribution. For each source domain k, I observe the labeled samples{(x
s
k,1
,y
k,1
),...,(x
s
k,n
s
k
,y
k,n
s
k
)},
61
wherex
s
k
∼ S
k
. I only observe unlabeled samples{x
t
1
,...,x
t
n
t
} from the target domainT . The goal
is to train a model f
θ : R
W× H× C
→ R
|Y|
capable of inferring target labels, where |Y| is the num-
ber of inference classes. The first step in my approach is to independently train decision models for
each source domain via empirical risk minimization (ERM) by minimizing the cross-entropy loss L
ce
:
θ k
=argmin
θ 1
n
s
k
P
n
s
k
i=1
L
ce
(f
θ (x
s
k,i
),y
k,i
).
Since under my considered setting the target and source domains share a common input and label
space, these models can be directly used on the target to derive a naive solution. However, given the
distributional discrepancy between source domains and target, e.g., real-world images versus clip art, gen-
eralization performance will be poor. The goal of my MUDA approach is to benefit from the unannotated
target dataset and the source-trained models in order to improve upon model generalization while avoiding
negative transfer.
To this end, I decompose the modelf
θ into a feature extractorg
u
(·):R
W× H× C
→R
d
Z
and a classifier
subnetworkh
v
(·):R
d
Z
→R
|Y|
with learnable parametersu andv, such thatf(·)=(h◦ g)(·).
I assume input data points are images of sizeW × H× C and the latent embedding shape is of size
d
Z
. In a SUDA setting, I can improve generalization of each source-specific model on the target domain by
aligning the distributions of the source and the target domain in the latent embedding space. Specifically, I
can minimize a distributional discrepancy metricD(·,·) across both domains, e.g., SWD loss, to update the
learnable parameters:u
A
k
=argmin
u
D(g
u
(S
k
),g
u
(T)). By aligning the distributions the source trained
classifier h
k
will generalize well on the target domainT . In the MUDA setting, the goal is improving
upon SUDA by benefiting from the collective knowledge of the source domains to make predictions on
the target. This can be done via a weighted average of predictions made by each of the domain-specific
models, i.e., models with learnable parametersθ A
k
= (u
A
k
,v
k
). For a samplex
t
i
in the target domain, the
model prediction will be
P
n
k=1
w
k
f
θ A
k
(X
t
i
), wherew
k
denotes a set of learnable weights associated with
the source domains.
62
I note the above general approach requires simultaneous access to source and target data during adap-
tation. I relax this constraint and consider the more challenging setting of source-free adaptation, where I
lose access to the source domains once source training finishes. To account for applications with sensitive
data, e.g. medical domains, I also forbid interaction between source models during adaptation. Hence, the
source distributions S
k
and their representations in the embedding space, i.e., g(S
k
), will become inac-
cessible. To circumvent this challenge, I rely on intermediate distributional estimates of the source latent
embeddings.
5.4 Proposedalgorithm
My proposed approach for MUDA with private data is visualized in Figure 5.1. I base my algorithm on two
levels of hierarchies. First, I adapt each source-trained model while preserving privacy (left and middle
subfigures). I then combine predictions of the source-specific models on the target domain according to
their reliability (right subfigure). To tackle the challenge of data privacy, I approximate the distributions
of the source domains in the embedding space as a multi-modal distribution and use these distributional
estimates for domain alignment (Figure 5.1, left). I can benefit from these estimates because once source
training is completed, the input embedding distribution should be mapped into a|Y|− modal distribution to
enable the classifier subnetwork to separate the classes. Note, each separated distributional mode encodes
one of the classes (see Figure 5.1, left). To approximate these internal distributions I employ Gaussian
Mixture Models (GMM), with learnable mean and covariance parameters µ k
,Σ k
. Since I have access to
labeled source data points, I can learnµ k
andΣ k
in a supervised fashion. Let 1
c
(x) denotes the indicator
function forx=c, then the maximum likelihood estimates for the GMM parameters would be:
µ k,c
=
P
n
s
k
i=1
1
c
(y
k,i
)g
u
k
(x
s
k,i
)
P
n
s
k
i=1
1
c
(y
k,i
)
, Σ k,c
=
P
n
s
k
i=1
1
c
(y
k,i
)(g
u
k
(x
s
k,i
)− µ k,c
)(g
u
k
(x
s
k,i
)− µ k,c
)
T
P
n
s
k
i=1
1
c
(y
k,i
)
(5.1)
63
Learningµ k
andΣ k
for each domaink enables us to sample class conditionally from the GMM distri-
butional estimates and approximate the distributiong(S
k
) in the absence of the source dataset.
I adapt the source-trained model by aligning the target distributionT and the GMM distribution in
the embedding space. To preserve privacy, for each source domain k I generate intermediate pseudo-
domainsA
k
with pseudo-samples{z
a
k,1
,...,z
a
k,n
a
k
} by drawing random samples from the estimated GMM
distribution. The pseudo-domain is used as an approximation of the corresponding source embeddings. To
align the two distribution, a suitable distance metricD(·,·) needs to be used. I rely on the SWD due to its
mentioned appealing properties. The SWD acts as an estimate for the Wasserstein Distance (WD) between
two distributions ([131]), by aggregating the tractable1− dimensional WD overL projections onto the unit
hypersphere. In the context of my algorithm, the SWD discrepancy measure becomes:
D(g(T),A
k
)=
1
L
L
X
l=1
|⟨g(x
t
i
l
),ϕ l
⟩−⟨ z
a
k,j
l
,ϕ l
⟩|
2
(5.2)
whereϕ l
is a projection direction, andi
l
,j
l
are indices corresponding to the sorted projections. While
the source and target domains share the same label spaceY they do not necessarily share the same distri-
bution of labels. Since the prior probabilities on classes are not known in the target domain, minimizing
the SWD at the batch level may lead to incorrectly clustering samples from different classes together, de-
pending on the discrepancy between the label distributions. To address this challenge, I take advantage
of the conditional entropy loss ([56]) as a regularization term based on information maximization. The
conditional entropy acts as a soft clustering objective that ensures aligning target samples to the wrong
class via SWD will be penalized. I follow the approximation presented in Eq. 6 in [56]:
L
ent
(f
θ (T))=
1
n
t
n
t
X
i=1
L
ce
(f
θ (x
t
i
),f
θ (x
t
i
)) (5.3)
64
To guarantee this added loss term influences the latent representations produced by the feature ex-
tractor, the classifier is frozen during model adaptation. My final combined adaptation loss is descrbied
as:
D(g(T),A)+γ L
ent
(f
θ (T)) (5.4)
for a regularizerγ . Once the source-specific adaptation is completed across all domains, the final model
predictions on the target domain are obtained by combining probabilistic predictions returned by each of
the n domain-specific models. The mixing weights are chosen as a convex vector w = (w
1
...w
n
), i.e.,
w
i
> 0 and
P
i
w
i
= 1, with final predictions taking the form
P
k
i=1
w
i
f
θ i
. The choice ofw is critical, as
assigning large weights to a model which does not generalize well will harm inference power. I utilize the
source modelpredictionconfidence on the target domain as a proxy for generalization performance. I have
provided empirical evidence for this choice in Section 5.6. I thus set a confidence threshold λ and assign
w
k
:
˜ w
k
∼ n
t
X
i=1
1(max
˜
f
θ k
(x
t
i
)>λ ), w
k
= ˜ w
k
/
X
˜ w
k
, (5.5)
where
˜
f(·) denotes the model ouput just prior to the final SoftMax layer which correspond to a probability.
Note the only cross-domain information transfer in my framework is communicating the latent means
and covariance matrices of the estimated GMMs plus the domain-specific model weights which provide a
warm start for adaptation. Data samples are never shared between any two domains during pretraining
and adaptation. As a result, my approach preserves data privacy for scenarios at which the source datasets
are distributed across several entities. Additionally, the adaptation process for each source domain is
performed independently. As a result, my approach can be used to incorporate new source domains as
they become available over time without requiring end-to-end retraining from scratch. I will only require
65
Algorithm3 Secure Multi-source Unsupervised Domain Adaptation (SMUDA)
procedureSMUDA(S
1
...S
n
,T,L,γ )
fork← 1 tondo
µ k
,Σ k
,θ k
=Train(S
k
)
GenerateA
k
based onµ k
,Σ k
Computew
k
(Eq. 5.5)
θ A
k
=Adapt(θ k
,A
k
,T,L,γ )
endfor
returnw
1
...w
n
,θ A
1
...θ A
n
endprocedure
procedureTrain(S
i
)
Learnθ k
=(u
k
,v
k
) by min.L
CE
(f
θ k
(S
k
),·)
Learn parametersµ k
,Σ k
(Eq. 5.1)
returnµ k
,Σ k
,θ k
endprocedure
procedureAdapt(θ k
,A
k
,T,L,γ )
Initialize network with weightsθ k
θ A
k
=argmin
θ D(g
u
(T),A
k
)+γ L
ent
(f
θ (T)) (Eq. 5.4)
returnθ A
k
endprocedure
to update the normalized mixing weights via Equation 5.5, which takes negligible runtime compared to
model training. My proposed privacy preserving approach, named Secure MUDA (SMUDA), is presented
in Algorithm 3 .
5.5 Theoreticalanalysis
I provide an analysis to demonstrate that my algorithm minimizes an upper bound for the target domain
error. I adopt the framework developed by [132] forsinglesourceUDAusingWassersteindistance to provide
a theoretical justification for the algorithm I proposed. My analysis is performed in the latent embedding
space. LetH represent the hypothesis space of all classifier subnetworks. Let h
k
(·) denote the model
learnt by each domain-specific model. I also set e
D
(·), whereD∈{S
1
...S
n
,T}, to be the true expected
error returned by some model h(·) ∈ H in the hypothesis space on the domain D. Additionally, let
ˆ µ S
k
=
1
n
s
k
P
n
s
k
i=1
f(g(x
s
k,i
)), ˆ µ P
k
=
1
n
a
k
P
n
a
k
i=1
x
a
k,i
, and ˆ µ T
=
1
n
t
P
n
s
k
i=1
f(g(x
t
i
)) denote the empirical
66
distributions that are built using the samples for the source domain, the intermediate pseudo-domain, and
the target domain in the latent space. Then the following theorem holds for the MUDA setting:
Theorem5. Consider Algorithm 3 for MUDA under the explained conditions, then the following holds
e
T
(h)≤ n
X
k=1
w
k
(e
S
k
(h
k
)+D(ˆ µ T
,ˆ µ P
k
)+D(ˆ µ P
k
,ˆ µ S
k
)+
r
2log(
1
ξ )/ζ
r
1
N
k
+
r
1
M
+e
C
k
(h
∗ k
))
(5.6)
whereC
k
is the combined error loss with respect to domaink, andh
∗ k
is the optimal model with respect to
this loss when a shared model is trained jointly on annotated datasets from all domains simultaneously.
Proof: I start the proof by utilizing a prior results from Redko et al. [132], previously mentioned under
Theorem 2.
Theorem 2 provides an upper bound on the target error with respect to the source error, the distance
between source and target domains, a term that is minimized based on the number of samples, and a
constante
C
= e
S
(h
∗ )+e
T
(h
∗ ) describing the performance of an optimal hypothesis on the present set
of samples.
I adapt the result in Theorem 2 to provide an upper bound in my multi-source setting. Consider the
following two results.
Lemma1. Under the definitions of Theorem 2
W(ˆ µ S
,ˆ µ T
)≤ W(ˆ µ S
,ˆ µ P
)+W(ˆ µ P
,ˆ µ T
) (5.7)
where ˆ µ P
is the GMM distribution learnt for source domainS.
Proof. AsW is a distance metric, the proof is an immediate application of the triangle inequality.
67
Lemma2. Leth be the hypothesis describing the multi-source model, and leth
k
be the hypothesis learnt for
a source domaink. Ife
T
(h) is the error function for hypothesish on domainT , then
e
T
(h)≤ n
X
k=1
w
k
e
T
(h
k
) (5.8)
Proof. Let p(X) =
P
n
k=1
w
k
f
k
(X) with
P
w
k
= 1,w
k
> 0 be the probabilistic estimate returned by
my model for some input X, and let y be the label associated with this input. The proof for the Lemma
proceeds as follows
e
T
(h)=E
(X,y)∼T
L
ce
(p(X), 1
y
)
=E
(X,y)∼T
− logp(X)[y]
=E
(X,y)∼T
− log(
n
X
k=1
w
k
f
k
(X)[y])
≤ E
(X,y)∼T
n
X
k=1
w
k
(− logf
k
(X)[y]) Jensen’s Ineq.
=
n
X
k=1
w
k
E
(X,y)∼T
L
ce
(f
k
(x), 1
y
)
=
n
X
k=1
w
k
e
T
(h
k
)
I am now equipped to prove Theorem 5.
Proof. (Theorem 5)
68
e
T
(h)≤ n
X
k=1
w
k
e
T
(h
k
) From Lemma 2
≤ n
X
k=1
w
k
(e
S
k
(h
k
)+W(ˆ µ T
,ˆ µ S
k
)+
r
2log(
1
ξ )/ζ
r
1
N
k
+
r
1
M
+e
C
k
(h
∗ k
)) by Theorem 2
≤ n
X
k=1
w
k
(e
S
k
(h
k
)+W(ˆ µ T
,ˆ µ P
k
)+W(ˆ µ P
k
,ˆ µ S
k
)+
r
2log(
1
ξ )/ζ
r
1
N
k
+
r
1
M
+e
C
k
(h
∗ k
)) by Lemma 1
5.5.1 Experimentalparameters
I use the Adam optimizer with source learning rate of1e− 5 for each source domain for all datasets. Target
learning rates are chosen between 1e− 5 and 1e− 7 for adaptation. The number of training iterations
and adaptation iterations differs per dataset: Office-31 (12k, 48k), Domain-net (80k, 160k), Image-clef (4k,
3k), Office-home (40k, 10k), Office-CalTech (4k, 6k). The training batch size is either 16 or 32, with little
difference observed between the two. The adaptation batch size is usually chosen around 10× the number
of classes for each dataset, to ensure a good class representation when minimizing the SWD distance. The
network size is the same across all datasets, with the SWD minimization space being 256 dimensional. The
above mentioned parameters are also provided in the config.py file in the codebase.
I note the target domain error is upper bounded by the convex combination of the domain-specific
adaptation errors. Algorithm 3 minimizes the right-hand side of Equation 5.6 as follows: for each source
domain, the source expected error is minimized by training the models using ERM. The second term is
69
minimized by closing the distributional gap between the intermediate pseudo-domain and the target do-
main in the latent space. The third term corresponds to how well the GMM distribution approximates the
latent source samples. My algorithm does not directly minimize this term, however if the model forms a
multi-modal distribution in the source embedding space, necessary for good performance, this term will be
small. The second to last term is dependent on the number of available samples in the adaptation problem,
and becomes negligible when sufficient samples are accessible. The final term measures the difficulty of
the optimization, and is dependent only on the structure of the data. For related domains, this term will
also be negligible.
5.6 Experimentalvalidation
Datasets I validate on five datasets: Office-31 , Office-Home , Office-Caltech , Image-Clef and DomainNet.
Office-31 ([154]) is a dataset with31 classes consisting of 4110 images from an office environment per-
taining to three domains: Amazon, Webcam and DSLR. Domains differ in image quality, background, num-
ber of samples and class distributions. Office-Caltech ([54]) contains 2533 from10 classes of office related
images from four domains: Amazon, Webcam, DSLR, Caltech. Office-Home ([187]) contains 65 classes
and 30475 from four different domains: Art (stylized images), Clipart (clip art sketches), Product (images
with no background) and Real-World (realistic images), making it more challenging that Office datasets.
Image-Clef ([109]) contains 1800 images under 12 generic categories sourced from three domains: Cal-
tech, Imagenet and Pascal. DomainNet ([128]) is a larger, more recent dataset containing 586,575 images
from 345 general classes, with different class distributions for each of its domains: Quickdraw, Clipart,
Painting, Infograph, Sketch, Real.
Preprocessing&Networkstructure: I follow the literature for fair comparison. For each domain I
re-scale images to a standard size of(224,224,3). I use a ResNet50 ([62]) network as a backbone for the fea-
ture extractor, followed by four fully connected layers. The network classification head consists of a linear
70
layer, and source-training is performed using cross-entropy loss. The ResNet layers of the feature extractor
are frozen during adaptation. I report classification accuracy, averaged across five runs. As hardware I used
a and NVIDIA Titan Xp GPU. The code is available online at https://github.com/serbanstan/secure-muda.
To test the effectiveness of my privacy preserving approach for MUDA, I compare my method against
state-of-the art SUDA and MUDA approaches. Benchmarks for single best (SB), source combined (SC) and
multi source (MS) performance are reported based on DAN ([108]), D-CORAL ([174]), RevGrad ([51]). I
include most existing MUDA algorithms: DCTN ([199]), FADA ([129]), MFSAN ([225]), MDDA ([221]), Sim-
pAl ([186]), JAN ([111]), MEDA ([190]), MCD ([155]), M3SDA ([128]), MDAN ([219]), MDMN ([97]), DARN
([194]), DECISION ([2]). Note that I maintain full domain privacy throughout training and adaptation.
Hence, most of the above works should be considered as upperbounds in performance, as they address
a more relaxed problem, by allowing joint and persistent access to source data. While [2] also performs
source-free adaptation, they benefit from jointly accessing the source trained models during adaptation,
while my method only assumes joint access when pooling predictions. Despite this additional constraint,
results prove my algorithm is competitive and at times outperforming the aforementioned methods. I next
present quantitative and qualitative analysis of my work.
5.6.1 PerformanceResults
Table 5.1 presents my main results. For Office-31 , I observe state-of-the-art performance (SOTA) on the
→ D,→ A tasks and near SOTA performance on the remaining task. Note that the domains DSLR and
Webcam share similar distributions, as exemplified through the Source-Only results, and for these domains
obtaining a good adaptation performance involves minimizing negative transfer, which my method suc-
cessfully achieves. In the case of Image-clef, I obtain SOTA performance on all tasks, even though the
methods I compare against are not source-free. On theOffice-caltech dataset, I obtain SOTA performance
71
on the→ A task, with close to SOTA performance on the three other tasks. The domains of the Office-
home dataset have larger domain gaps with more classes compared to the three previous datasets. My
approach obtains near SOTA performance on the→P and→R tasks and competitive performance on the
remaining tasks. Finally, theDomainNet dataset contains a much larger number of classes and variation
in class distributions compared to the other datasets, making it the most challenging considered task in
my experiments. Even so, I am able to obtain SOTA performance on three of the six tasks with competitive
results on the other three. I reiterate most other MUDA algorithms serve as upper-bounds to my work, as
they either access source data directly, simultaneously use models from all sources for adaptation, or both.
Results across all tasks demonstrate that not only are I able to compare favorably against these methods
while preserving data privacy, but I also set new SOTA on several tasks.
5.6.2 AblativeExperimentsandEmpiricalAnalysis
I perform ablative experiments by investigating the effect of each loss term in Eq. 5.4 on performance,
and present results in Table 5.2. I observe combining the two terms yields improved performance for
all datasets besides Office-caltech , where the difference is negligible. On the other hand, minimizing the
SWD is more impactful on the Image-clef and Office-home datasets. The conditional entropy contributes
more forOffice-caltech and someOffice-31 tasks. My insight is conditional entropy is more impactful when
the source trained models have higher initial performance on the target (e.g.,→ D,→ W on Office-31 ),
while the SWD term is more beneficial when there is a larger discrepancy between the source domains
and the target domain (e.g.,→ A on Office-31 ). Experiments conclude using both terms further improves
performance.
Preserving privacy when performing adaptation limits information access between domains. It is then
expected that privacy preserving methods are at a disadvantage compared to methods that do not impose
any privacy restrictions. I show that in the case of my method, this observed disadvantage is negligible.
72
Method →D →W →A Avg.
SB
Source Only 99.3 96.7 62.5 86.2
DAN 99.7 98.0 65.3 87.7
D-CORAL 99.7 98.0 65.3 87.7
RevGrad 99.1 96.9 66.2 87.5
SC
DAN 99.6 97.8 67.6 88.3
D-CORAL 99.3 98.0 67.1 88.1
RevGrad 99.7 98.1 67.6 88.5
MS
MDDA 99.2 97.1 56.2 84.2
DCTN 99.3 98.2 64.2 87.2
MFSAN 99.5 98.5 72.7 90.2
SImpAl 99.2 97.4 70.6 89.0
DECISION
∗ 99.6 98.4 75.4 91.1
SMUDA
∗ +
99.8 98.5 75.4 91.2
(a) Office-31
Method →P →C →I Avg.
SB
Source Only 74.8 91.5 83.9 83.4
DAN 75.0 93.3 86.2 84.8
D-CORAL 76.9 93.6 88.5 86.3
RevGrad 75.0 96.2 87.0 86.1
SC
DAN 77.6 93.3 92.2 87.7
D-CORAL 77.1 93.6 91.7 87.5
RevGrad 77.9 93.7 91.8 87.8
MS
DCTN 75.0 95.7 90.3 87.0
MFSAN 79.1 95.4 93.6 89.4
SImpAl 77.5 93.3 91.0 87.3
SMUDA
∗ +
79.4 96.9 93.9 90.1
(b) Image-clef
Method →W →D →C →A Avg.
SB
Source Only 99.0 98.3 87.8 86.1 92.8
DAN 99.3 98.2 89.7 94.8 95.5
JAN 99.4 99.4 91.2 91.8 95.5
MS
DAN 99.5 99.1 89.2 91.6 94.8
DCTN 99.4 99.0 90.2 91.6 94.8
MEDA 99.3 99.2 91.4 92.9 95.7
MCD 99.5 99.1 91.5 92.1 95.6
M
3
SDA 99.4 99.2 91.5 94.1 96.1
SImpAl 99.3 99.8 92.2 95.3 96.7
FADA
∗ +
88.1 87.1 88.7 84.2 87.1
DECISION
∗ 99.6 100 95.9 95.9 98.0
SMUDA
∗ +
99.3 97.6 93.9 95.9 96.6
(c) Office-caltech
Method →A →C →P →R Avg.
SB
Source Only 65.3 49.6 79.7 75.4 67.5
DAN 68.2 56.5 80.3 75.9 70.2
D-CORAL 67.0 53.6 80.3 76.3 69.3
RevGrad 67.9 55.9 80.4 75.8 70.0
SC
DAN 68.5 59.4 79.0 82.5 72.4
D-CORAL 68.1 58.6 79.5 82.7 72.2
RevGrad 68.4 59.1 79.5 82.7 72.4
MS
MFSAN 72.1 62.0 80.3 81.8 74.1
M
3
SDA 64.1 62.8 76.2 78.6 70.4
SImpAl 70.8 56.3 80.2 81.5 72.2
MDAN 68.1 67.0 81.0 82.8 74.8
MDMN 68.7 67.6 81.4 83.3 75.3
DARN 70.0 68.4 82.8 83.9 76.2
DECISION
∗ 74.5 59.4 84.4 83.6 75.5
SMUDA
∗ +
69.1 61.5 83.5 83.4 74.4
(d) Office-home
Method →Q →C →P →I →S →R Avg.
SB
Source Only 11.8 39.6 33.9 8.2 23.1 41.6 26.4
DAN 16.2 39.1 33.3 11.4 29.7 42.1 28.6
JAN 14.3 35.3 32.5 9.1 25.7 43.1 26.7
ADDA 14.9 39.5 29.1 14.5 30.7 41.9 28.4
MCD 3.8 42.6 42.6 19.6 33.8 50.5 32.2
SC
Source Only 13.3 47.6 38.1 13.0 33.7 51.9 32.9
DAN 15.3 45.4 36.2 12.8 34.0 48.6 32.1
JAN 12.1 40.9 35.4 11.1 32.3 45.8 29.6
ADDA 14.7 47.5 36.7 11.4 33.5 49.1 32.2
MCD 7.6 54.3 45.7 22.1 43.5 58.4 38.5
MS
DCTN 7.2 48.6 48.8 23.5 47.3 53.5 38.2
M
3
SDA-β 6.3 58.6 52.3 26.0 49.5 62.7 42.6
FADA
∗ +
7.9 45.3 38.9 16.3 26.8 46.7 30.3
DECISION
∗ 18.9 61.5 54.6 21.6 51 67.5 45.9
SMUDA
∗ +
14.6 62.4 53.6 24.4 49.9 68.3 45.5
(e) DomainNet
Table 5.1: Results on five benchmark datasets. Single best (SB) represents the best performance with respect
to any source, source combined (SC) represents performance obtained by pooling the source data together
from different domains, and multi source (MS) represents methods performing multi source adaptation.
∗ indicates source-free adaptation, guaranteeing privacy between sources and the target.
+
indicates pri-
vacy between source models. Results in bold correspond to the highest accuracy amongst the source-free
approaches.
To study the effect of preserving privacy on UDA performance, I perform experiments where source data
is shared either between sources or with the target domain. I consider three primary scenarios for sharing
source data: (i)SWD loss is computed using the source domain latent features (SW); (ii)Source domains’
data isCombined into a single source (SC);SupervisedSource loss is computed for joint UDA (SS). I report
results for natural combinations of these approaches in Table 5.3. I observe SMUDA performs similarly
to SW. Joint adaptation (SS), source-combined performance (SC), or a combination of the two offer im-
proved performance on all datasets. The improvements from sacrificing privacy are however negligible
73
Method →D →W →A Avg.
SWD only 92.2 94.1 73.1 86.4
L
ent
only 99.8 98.1 66.2 88.1
SMUDA 99.8 98.5 75.4 91.2
(a) Office-31
Method →W →D →C →A Avg.
SWD only 98.1 97.8 92.1 95.5 95.9
L
ent
only 99.4 97.7 94 96 96.8
SMUDA 99.3 97.6 93.9 95.9 96.6
(b) Office-caltech
Method →P →C →I Avg.
SWD only 79.3 96.5 94.2 90
L
ent
only 78.5 96 91.7 88.7
SMUDA 79.4 96.9 93.9 90.1
(c) Image-clef
Method →A →C →P →R Avg.
SWD only 66.6 59.1 80.9 82.2 72.2
L
ent
only 64.5 49.4 77.8 72.2 66
SMUDA 69.1 61.5 83.5 83.4 74.4
(d) Office-home
Table 5.2: Results when only the SWD objective, the entropy objective or both (SMUDA) are used.
compared to SMUDA. I conclude my source domain approximation using GMMs captures sufficient source
information under the considered test cases. This leads my adaptation approach to achieve comparable
performance to settings where privacy is not enforced, with the added benefit of not sharing data between
domains.
Method →D →W →A Avg.
SW 99.5 98.5 75.5 91.2
SC 98.1 96.8 76 90.3
SS 99.8 98.7 76.3 91.6
SW+SS 99.8 98.6 75.7 91.3
SC+SS 98.4 96.9 76.1 90.5
SC+SW+SS 99.0 97.7 76.1 90.9
SMUDA 99.8 98.5 75.4 91.2
(a) Office-31
Method →W →D →C →A Avg.
SW 99.4 96.9 93.9 95.9 96.5
SC 99.7 96.8 94.1 96 96.6
SS 99.6 97.2 94.1 95.9 96.7
SW+SS 99.7 97.4 94.1 96 96.8
SC+SS 99.7 96.8 94.2 96.2 96.7
SC+SW+SS 99.6 97.2 93.3 95.9 96.5
SMUDA 99.3 97.6 93.9 95.9 96.6
(b) Office-caltech
Method →P →C →I Avg.
SW 79.5 95.2 91.3 88.6
SC 79.8 96.6 94.2 90.2
SS 79.4 95.6 91.6 88.9
SW+SS 79.4 95.6 91.8 88.9
SC+SS 79.9 96.6 93.1 89.8
SC+SW+SS 79.5 95.2 91.3 88.6
SMUDA 79.4 96.9 93.9 90.1
(c) Image-clef
Method →A →C →P →R Avg.
SW 68.9 60.8 83.3 83.4 74.1
SC 69.6 62.9 85.3 84.7 75.6
SS 68.9 61.1 84 83.6 74.4
SW+SS 68.5 61 83.8 83.8 74.3
SC+SS 69.4 63 85.5 85 75.7
SC+SW+SS 68.8 62.7 85.1 84.5 75.3
SMUDA 69.2 61.1 83.2 83.5 74.3
(d) Office-home
Table 5.3: Results comparing SMUDA to non-private variants.
74
To compare my method against ensemble models of existing single-source source-free UDA (SFUDA)
methods, I performed experiments on the Office-31 dataset. I compare against five recent SFUDA ap-
proaches in Table 5.4, and the ensemble of these methods in Table 5.5. I observe that although in the
SFUDA setting, despite being competitive, my method trails some of the methods, it outperforms the
methods in MUDA. I conclude that my method alleviates the effect of negative transfer successfully and
indeed can boost performance of a weaker single-source performance. I also note that I likely can improve
the SFUDA performance for my method if I benefit from better probability metrics or model regularization.
Method A→D W→D A→W D→W D→A W→A Avg.
USFDA [88] 64.5 96 71 93.3 62.8 63.6 75.2
SHOT [98] 94.0 99.9 90.1 98.4 74.7 74.3 88.6
AFN [200] 90.7 99.8 90.1 98.6 73.0 70.2 87.1
MDD [217] 93.5 100 94.5 98.4 74.6 72.2 88.9
GVB-GD [37] 95.0 100 94.8 98.7 73.4 73.7 89.3
SMUDA(ours) 92.7 99.8 87.9 98.5 72.1 75.4 87.7
Table 5.4: Single source results
Method →D →W →A Avg.
USFDA [88] 96.0 93.1 65.5 84.9
SHOT-ens [2] 97.8 94.9 75.0 89.3
AFN [200] 98.4 96.4 71.3 88.7
MDD [217] 95.4 99.3 74.1 89.6
GVB-GD [37] 97.2 95.6 74.9 89.2
SMUDA-Uniform 94.2 93 75.4 87.5
SMUDA(ours) 99.8 98.5 75.4 91.2
Table 5.5: Uniformly combined predictions
I study the effect of hyper-parameters on SMUDA performance. I first empirically validate my approach
for computing the mixing parametersw
k
. I consider four scenarios for combining model predictions: (i)
Eq. 5.5, (ii) setting weights proportional to SWD between the intermediate and the target domains (a cross-
domain measure of distributional similarity), (iii) using a uniform average, and (iv) assigning all mixing
weight to the model with best target performance. Average performance for tasks from four of the datasets
are reported in Table 5.6. I observe my choice leads to maximum performance. Single best performance
is able to slightly outperform on one dataset, however suffers on tasks where significant pairwise domain
75
gap exists. This is expected, as using several domains is beneficial when they complement each other in
terms of available information. Assigning weights proportional toD(g(T),A
k
) may seem intuitive, given
that similarity between pseudo-datasets and target latent features indicates better classifier generalization.
However, this method performs better only compared to uniform averaging. I conclude model reliability
is a superior criterion of combining predictions. Uniform averaging leads to decreased generalization on
the target domain as it treats all domains equally. As a result, models with the least generalization ability
on the target domain harm collective performance.
Dataset Highconfidence W2 Uniform SingleBest
Office-31 91.2 88.6 85.1 91.2
Image-clef 90.1 89.6 89.8 89.6
Office-caltech 96.6 96.6 96.6 97
Office-home 74.4 74.2 74.2 72.8
Total avg. 88.1 87.2 86.4 87.6
Table 5.6: Analytic experiments to study four strategies for combining the individual model predictions.
Mixing based on model reliability proves superior to other popular approaches.
Figure 5.2: Performance for different numbers of latent projections used in the SWD on Office-31.
I additionally study the effect of the SWD projection hyper-parameter. SWD utilizes L random projec-
tions, as detailed in Equation 5.2. While a largeL leads to a tighter approximation of the optimal transport
metric, it also incurs a computational resource penalty. I investigate whether there is a range ofL values
offering sufficient adaptation performance, and analyze the impact of this parameter using the Office-31
dataset. In Figure 5.2 I reported performance results forL∈{1,10,50,100,200,350,500}. The SWD ap-
proximation becomes tighter with an increased number of projections, which I see translating on all three
76
tasks. I also note that above a certain threshold, i.e. L≈ 200, the gains in performance from increasingL
are minimal.
Figure 5.3: Effect of the adaptation process on the Office-home dataset: from left to right, I consider Art,
Clipart and Product as the source domains, and real-world as the target domain.
In Figure 5.3 I explore the behavior of my adaptation strategy with respect to a Office-home task. For
each of the three source domains, I observe an increase in target accuracy once adaptation starts, which is
in line with my previous results. Note this increase in target accuracy also correlates with the minimization
of the SWD and entropy losses. I additionally note that the combined multi-source performance using all
three source domains outperforms the three SUDA performances. The biggest difference is observed for
the Clipart trained model, which exhibits the highest discrepancy from the target domain real-world.
Figure 5.4: Prediction accuracy on Office-home target domain tasks under different levels of source model
confidence, and my choice of λ . Target predictions above this threshold attain high accuracy.
The confidence threshold λ controls the assignment of mixing weightsw
k
. For each source domain, the
number of target samples with confidence greater than λ is recorded, and these normalized values produce
w
k
. In order to determine whether a certain value of λ leads to a satisfactory choice of mixing weights,
it is important to determine whether the high confidence samples are indeed correctly predicted. Figure
77
5.4 provides the prediction accuracy on target domain samples on the Office-home dataset for different
confidence ranges. I consider 5 different confidence probability ranges: [0− 20,20− 40,40− 60,60− 80,80− 100]. I observe low-confidence predictions offer poor accuracy for the target domain. For example,
in cases when the confidence is less than 0.2, prediction accuracy is below 40%. Conversely, for target
samples with a predicted confidence greater than .6, I observe accuracy of more than90% on all the three
tasks of Office-home . This experiments supports my intuition that the amount of high confidence target
samples can be used as a proxy for the domain mixing weightsw
k
. I also note the amount of high confidence
samples is calculated using the source only models, as adaptation artificially increase confidence across
the whole dataset.
Figure 5.5: Results on theOffice-31 andImage-clef datasets for different values of the confidence parameter
λ . The dotted line corresponds toλ =.5 used for reporting results in Table 5.1.
I further investigate performance in regards to theλ parameter. While in Figure 5.5 I observe a target
accuracy increase correlated to higher levels of classifier confidence, the amount of high confidence sam-
ples proportional to dataset size is equally important for an appropriate choice of confidence threshold.
Setting theλ parameter too high may lead to mixing weights that do not capture model behavior on the
whole target distribution, just on a small subset of samples, leading to degraded performance. Conversely,
a low value ofλ will lead to results that are equivalent to uniformly combining predictions. Figure 5.5 por-
trays both these behaviors on theOffice-31 andImage-clef datasets. I observe my choice ofλ =.5 is able to
78
obtain best performance on theOffice-31 dataset, and close to best performance on theImage-clef dataset.
I also note the choice ofλ is relatively robust, as values in the interval[.2,.7] offer similar performance.
Figure 5.6: UMAP latent space visualization forOffice-caltech withAmazon as the target. Sources in order:
Caltech, DSLR, and Webcam. Adaptation shifts target embeddings towards the GMM distribution.
My approach attempts to minimize the distributional distance between target embeddings and GMM
estimations of source embeddings. I provide insight into this process in Figure 5.6, where I reduce the data
representation dimension to two using UMAP ([116]). I display GMM samples, target latent embeddings
before adaptation, and target latent embeddings post-adaptation. For each source domain, the adaptation
process reduces the distance between target domain embeddings (yellow points) and the GMM samples
(red points). This empirically validates the theoretical justification for my algorithm. Given classifiers
trained on the source domains are able to generalize on the GMM samples as a result of pretraining, I
conclude that source-specific domain alignment translates to an improved collective performance.
I extend the runtime results from Section 6.3 of the main paper to the Domain-Net dataset. As seen in
Figure 5.3, the result in Figure 5.7 share a similar trend. After the start of the adaptation process target
accuracy improves for each source trained model. Additionally, pooling information from each of the five
source domains leads to improved overall predictive quality of the model.
In Table 5.7 I report results for different choices of the γ parameter on the Office-31 dataset. The main
results in Table 5.1 are generated for the choice ofγ =.02 . I test a wider range ofγ values to identify how
robust this parameter needs to be. I observe large values ofγ put too much emphasis on the entropy loss,
harming the optimal transport based distribution matching. Small values ofγ share a similar problem, as I
79
Figure 5.7: Effect of the adaptation process on the Domain-Net dataset, where sketch is the target. Sources
are in order Clipart, Infograph, Painting, Quickdraw, Real.
lose the soft-clustering benefit provided by the entropy loss. Overall I notice different tasks have different
high performing γ ranges . For example, on the→ D task, any choice of the parameter greater than
0.02 leads to best performance, while on the→A taskγ less than0.04 offer a competitive range of values.
While setting task specific γ values may lead to improved performance, I show that the more robust setting
where I choose the sameγ per dataset still works reasonably well in practice.
Method →D →W →A Avg.
γ =1 99.8 97.6 64.8 87.4
γ =.04 99.8 98.7 72.8 90.4
γ =.03 99.8 98.4 74.2 90.8
γ =.02 99.8 98.5 75.4 91.2
γ =.01 98.8 97.8 76.3 91
γ =1e-3 89.3 92.4 74.4 85.4
γ =1e-4 87.4 90.7 74.2 84.1
Table 5.7: Performance analysis for different values of γ 80
I give justification for the beneficial effect of using all the available source domains for inference. In
Table 5.8 I present results obtained from successively adding source domains to my ensemble, for the Office-
home dataset. I also include single best performance as a baseline, representing the highest target accuracy
obtained across source domains. The rows of the table correspond to the four MUDA problems considered
for Office-home, while the columns correspond to the number of source domains considered in ensembling.
For example, theACP → R task with 2 sources considers the problemAC → R. Similar to Figures 5.3
and 5.7, I observe that ensembling is superior to single best performance on all tasks. Additionally, my
mixing strategy proves to be robust with respect to negative transfer, as adding new domains is always
beneficial to the reported performance.
Method Single best First src. domain First 2 src. domains All 3 src. domains
ACP →R 81.1 81.2 82.3 83.7
ACR→P 83.1 77.1 77.9 83.3
APR→C 59.2 58.3 60.1 60.9
PCR→A 67.2 67.3 67.6 68.2
Table 5.8: Performance analysis when source domains are introduced sequentially.
I investigate the representation quality of the GMM distribution as a surrogate for the source distribu-
tion. Note that having GMMs that are good approximation of the source latent features is crucial for my
approach. In Figure 5.8, I present visualized data representations for the estimated GMMs and the source
domain distributions for the Image-clef dataset. I note that for both source domains, their latent space
distributions after pretraining are multi-modal distributions with 12 modes, each corresponding to one
class. This observation confirms that I can approximate the source domain distribution with a GMM with
12 modes. I also note that for both source domains the estimated GMM distribution offers a close approx-
imation of the original source distribution. This experiment empirically validates that the third term in
Eq. 4 is small in practice and I can use these as intermediate cross-domain distributions.
I additionally analyze the latent source approximation used for adaptation, and compare it to the sce-
nario when fine tuning the model on the source domain can be done after adaptation starts. I present these
81
Figure 5.8: Source and GMM embeddings for the Image-clef dataset with Pascal and Caltech as sources.
For both datasets, the GMM samples closely approximate the source embeddings.
findings in Figure 5.9. When privacy is not a concern and I have access to all source data, there is no need
to approximate the source distribution during adaptation, as I can just directly use the source samples.
For empirical exploration, results on how the latent distribution changes when source access is permitted
during adaptation (corresponding to the SS + SW case in Table 5.3) is presented in Figure 5.9. I notice that
as training progresses, the means slightly shift. Visually however, the change compared to using the latent
space only after source training is negligible. This supports the idea that the GMMs learned at the end of
source training will provide a sufficient approximation of the source distribution.
Figure 5.9: Latent distributions for two domains of the Office-31 dataset when considering the D→A and
W →A tasks. Each color gradient represents a latent feature distribution snapshot of the source domains.
Darker colors correspond to later training iterations, lighter colors to earlier iterations.
82
5.7 Remarks
In this chapter, I develop a privacy-preserving MUDA algorithm by showing that private and distributed
adaptation is possible in the multi source setting. Empirically, I demonstrate that my approach performs
favorably against SOTA algorithms using five UDA benchmarks. I additionally explore the various com-
ponents in my approach and their impact on adaptation performance.
Next, I consider the setting of fair classification. Most algorithms in fair classification are designed
to reduce prediction bias without considering domain shift. I show that in general settings, domain shift
may be a problem if the data sourcing process is biased. I show that established fairness approaches are
ill equipped to deal with this setting, and propose an extension of my work that improves target domain
performance in respect to both accuracy and fairness metrics.
83
Chapter6
FairModalAdaptation
6.1 Motivation
AI is increasingly being used for automating societal tasks such as processing loan applications, parole
decisions, healthcare, and police deployments [33]. The recent success of AI stems from the reemergence
of deep learning, allowing for the optimization of increasingly complex decision functions in the presence
of large datasets. A major benefit of this data-driven learning approach is that it relaxes the need for
tedious feature-engineering. However, deep learning methods have their drawbacks, e.g., data annotation
is a time-consuming and expensive process.
Data-driven learning can also lead to unintentionally training unfair models due to inherent biases
existing in collected training datasets, or due to skewed data distributions conditioned on certain features
[17]. Training models by simply minimizing the empirical error on relevant datasets may add spurious
correlations between majority subgroup features and positive outcomes because machine learning pri-
marily discovers correlations, rather causation. Thus, the decision boundary of biased AI models may be
informed by group specific characteristics independent of the decision task [46], which produce negative
outcome for disadvantaged groups in the data. For example, the income level is positively correlated with
male gender which be discovered as a basis for unfair decisions against female applicants.
84
The above crucial concern on fairness for AI has resulted in a significant research interest from the
AI community. The first attempt to address bias in AI is to arrive at a commonly agreed upon definition
of fairness. Pioneer works in this area focused on defining quantitative notions for fairness based on
commonsense intuition and using them to empirically demonstrate the presence and severity of bias in AI
[17, 18]. Most existing fairness metrics consider the input data points assuming characteristics of protected
subgroups [47], e.g., gender and race, in addition to normal features that are used for classification. Based
on subgroup membership, majority and minority populations emerge, which under fair learning should
be treated to equal outcomes. A model is then assumed to be fair if its predictions possess a notion of
conditional probabilistic independence for data membership into the subgroups [118] (see the Fairness
Metrics section for definitions of common fairness metrics).
A fair model can be trained by minimizing the model bias using these metrics, e.g., demographic parity,
in addition to minimizing the empirical risk on a given training dataset. Despite being effective in mit-
igating bias, most existing fair model training algorithms consider that the data distribution will remain
stationary after the training stage. However, this assumption is rarely true in practical settings, in partic-
ular when a model is used over extended time periods in dynamic societal applications. As a result, a fair
model might fail to maintain fairness under the input-space distributional shifts or when the model is used
on differently sourced datasets [130]. The naive solution of retraining the model after distributional shifts
requires annotating new data points to build datasets representative of the new input distribution. This
process, however, is time-consuming and expensive for deep learning. It is highly desirable to develop al-
gorithms that can preserve model fairness under distribution shifts. Unfortunately, this problem has been
marginally explored in the literature.
The problem of model adaptation under distributional shifts has been investigated extensively in the
unsupervised domain adaptation (UDA) literature extensively [183, 38]. The goal in UDA is to train a
classification model with a good generalization performance for a target domain where only unannotated
85
data is available, through transferring knowledge from a related source domain, where annotated data is
accessible. A primary group of UDA algorithms achieve this goal through matching the source and the
target distributions in an embedding space [134] such that the embedding space is domain-agnostic. As a
result, a classifier that receives its input as data representations in the embedding space will generalize well
in the target domain, despite being trained solely using the source domain annotated data. To align the
two distributions in such an embedding, data points from both domains are mapped into a shared feature
space that is modeled as the output space of a deep neural network encoder. The deep encoder is then
trained to minimize the distance between the two distributions, measured in terms of a suitable probability
distribution metric. Note, however, existing UDA algorithms overlook model fairness and solely consider
improving model performance in the target domain as the ultimate goal. In this work, I adopt the idea of
domain alignment to preserve model fairness and mitigate model biases introduced by domain shift.
I address the problem of preserving model fairness under distributional shifts in the input-space when
only unannotated data is accessible. I model this problem within the domain adaptation paradigm where
I consider the initial annotated training data as the source domain and the collected unannotated data
during model execution as the target domain. My contribution is to develop an algorithm that minimizes
distributional mismatches in a shared embedding space, modeled as the output space of a deep encoder.
I present empirical results using tasks that are built from three standard fairness benchmark datasets to
demonstrate the applicability of the proposed algorithm.
6.2 RelatedWork
My work is related to works in UDA and fairness in AI.
86
6.2.1 FairnessinAI
There are various approaches to train a fair model for a single domain. A primary idea in existing works
is to map data points into an embedding space at which the sensitive attributes are fully removed from
the representative features, i.e., an attribute-agnostic space for which fairness is measured at the classifier
output using a desired fairness metric. As a result, a classifier that receives its input from this space will
make unbiased decisions due to independence of its decisions from the sensitive attributes. Ray et al. [73]
develop a fairness algorithm that induces probabilistic independence between the sensitive attributes and
the classifier outputs by minimizing the optimal transport distance between the probability distributions
conditioned on the sensitive attributes. The transformed probability in the embedding space then becomes
independent (unconditioned) from the sensitive attributes. Celis et al. [19] study the possibility of using
a meta-algorithm for fairness with respect to several disjoint sensitive attributes. Du et al. [45] have fol-
lowed a different approach. Rather training an encoder that removes the sensitive attributes in a latent
embedding space and then train the classifier, they propose to debias the classifiers through leveraging
samples that possess the same ground-truth label yet have different sensitive attributes. The idea is to dis-
courage undesirable correlation between the sensitive attribute and predictions in an end-to-end scheme
which allows for emergence of attribute-agnostic representations in the model hidden layers.
Beutel et al. [11] benefit from the idea of removing sensitive attributes to train fair models by enforcing
decision independence from the sensitive attributes in a latent representation indirectly using adversarial
learning. They also amend the encoder model with a decoder to form an autoencoder. Since the representa-
tions are learned such that they can self-reconstruct the input, they become discriminative for classification
purpose as well. My work is built upon using adversarial learning to preserve fairness when distribution
shifts exist. To combat domain shift, my idea is to additionally match the target data distribution with the
source data distribution in the latent embedding space, a process which ensures classifier generalization.
87
6.2.2 UnsupervisedDomainAdaptation
Works on domain alignment for UDA follow a diverse set of strategies. The goal of existing works in
UDA is solely improving the prediction accuracy in the target domain in the presence of domain shift
without exploring the problem of fairness. The closest line of research to my work addresses domain shift
by minimizing a probability discrepancy measure between two distributions in a shared embedding space.
Selection of the discrepancy measure is a critical task for these works. A number of UDA methods simply
match the low-order empirical statistics of the source and the target distributions as a surrogate for the
distributions. For example, the Maximum Mean Discrepancy (MMD) metric is defined to match the means
of two distributions for UDA [108, 109]. Correlation alignment is another approach to include second-
order moments [174]. Matching lower-order statistical moments overlook existence of discrepancies in
higher-order statistical moments. To improve upon these methods, a suitable probability distance metric
can be incorporated into UDA to consider higher-order statistics for domain alignment. A suitable metric
for this purpose is the Wasserstein distance (WD) or the optimal transport metric [36, 12]. Since WD
possesses non-vanishing gradients for two non-overlapping distributions, it is a more suitable choice for
deep learning compared to more common distribution discrepancy measures, e.g., KL-divergence. Optimal
transport can be minimized as an objective using first-order optimization algorithms for deep learning.
Using WD has led to a considerable performance boost in UDA [12] compared to methods that rely on
aligning the lower-order statistical moments [108, 174].
6.2.3 DomainAdaptationinFairness
Despite being an important technique to address practically important challenges, works on benefiting
from knowledge transfer for domain adaptation in fairness are relatively limited. Madras et al. [114] benefit
from adversarial learning to learn domain-agnostic transferable representations for fair model generaliza-
tion. Coston et al. [35] consider a UDA setting, where the sensitive attributes for data points are accessible
88
only in one of the source or the target domains. Their idea is to use a weighted average to compute the
empirical risk and then tune the corresponding data point-specific weights to minimize co-variate shifts.
Schumann et al. [163] consider a similar setting and define the fairness distance of equalized odds and then
use it as a regularization term in addition to empirical risk, minimized for fair cross-domain generalization.
Hu et al. [66] address fairness in a distributed learning setting, where the data exist in various servers with
private demographic information. Singh et al. [168] consider that a causal graph for the source domain
data and anticipated shifts are given. They then use feature selection to estimate the fairness metric in
the target domain for model adaptation. Zhang and Long [216] explore possibility of training fair models
in the presence of missing data in a target domain using a source domain with a complete data and find
theoretical bounds for this purpose. My learning setting is relevant, yet different from the above settings.
I consider a standard UDA setting, where the sensitive attributes are accessible in both domains. The
challenge is to adapt the model to preserve fairness in the target domain.
6.3 ProblemFormulation
I formulate my problem in a classic UDA setting. Let(X,A,Y)∈R
d
×{ 0,1}×{ 0,1} be the format of
datasets.X ∈R
n
represents a feature vector withn entries, corresponding to different dataset character-
istics that are used for prediction, e.g., occupation length, education years, credit history etc. A∈{0,1}
represents a sensitive attribute for which I aim to train a fair model, e.g., sex, race, age. Finally,Y ∈{0,1}
represents the binary label of my dataset, e.g., approved for credit, income greater than50k etc.
Given a fully annotated dataset in the source domain (X
s
,A
s
,Y
s
), I can train a fair model f
θ :
(X
s
,A
s
) → Y
s
with learnable parameter θ . Let P
S
(X,A)
s
denotes the unknown input feature dis-
tribution in the source domain. In a normal learning setting, I search for the optimal parameter θ ∗ such
that it only minimizes the true risk, i.e.,θ ∗ = argmin
θ {E
(x
s
,a
s
)∼ P
S
(X,A)
s(L(f
θ (x
s
,a
s
),y
s
)} for some
89
suitable loss functionL, e.g., cross-entropy. Given that I do not have direct access to the data distribution,
I use ERM as a surrogate objective function and learn an approximation
ˆ
θ ofθ ∗ using the training dataset:
ˆ
θ =argmin
θ {
1
N
N
X
i=1
L
bce
(f
θ (x
s
,a
s
),y
s
)}, (6.1)
where N represents the number of samples in the dataset, L
bce
is the binary cross-entropy loss, and
ˆ y =f
θ (x
s
,a
s
) represents the probabilistic output of the model. Solving Eq. (6.1) does not lead to obtain-
ing a fair model, as only prediction accuracy is optimized. Bias in the training dataset, e.g., over/under-
representation of subgroups can lead to unfair models, even if the sensitive attribute is hidden.
Figure 6.1: Block-diagram description of the proposed framework
I consider that my classifier model f
θ : R
d
→ R
2
can be decomposed into an encoder subnetwork
e
u
: R
d
→ R
z
, with learnable parameters u, followed by a classifier subnetwork g
v
: R
z
→ R
2
with
learnable parameters v. In this formulation, f
θ (·) = (g
v
◦ e
u
)(·), θ = (u,v), z denotes the size of the
latent representation space that I want to make sensitive-agnostic. To induce fairness in the latent space,
I consider an additional classification network h
w
:R
z
→R
2
with learnable parametersw. This classifier
is tasked to predict the sensitive attributeA using the latent space featurese
u
(x
s
,a
s
). The primary idea is
90
to induce probabilistic independence from sensitive attributes by adversarially optimizingg andh. Latent
representations independent of the sensitive attribute,A, will lead toh performing poorly, while fairness-
agnostic representations will lead toh correctly predicting the sensitive attribute for a specific input.
Consider the loss for predicting sensitive attributes:
L
fair
=L
bce
((e
u
◦ h
w
)(x
s
,a
s
),a
s
)
(6.2)
From my discussion, I consider the following fairness-guaranteeing alternating minimization process:
1. I fix the encoder e
u
and minimize the fairness lossL
fair
through updating the attribute classifier
h
w
.
2. I then fix the attribute classifier h
w
and maximize the fairness lossL
fair
by updating the encoder
e
u
.
The first step will perform empirical risk minimization (ERM) for the fairness classifier, conditioned on
the encoder. The second step will keep the classifier fixed, and ensure the latent data representations
are as little informative as possible about the sensitive attribute A. Empirical exploration demonstrated
that iterative alternation between these two optimization steps will force the encodere
u
to produce latent
representations that are independent from the sensitive attributes. This high-level idea is presented in
Figure 6.1.
The above approach would suffice in practice if the target data is drawn from the source domain. In
the case when target data is sampled from a domain (X
t
,A
t
,Y
t
) with P
T
(X,A)
t
̸= P
S
(X,A)
s
, the
source trained model will perform poorly during testing due to domain shift. To improve generalization
on the target domain, I can minimize the empirical distance betweene
u
(x
s
,a
s
) ande
u
(x
t
,a
t
). Under this
restriction, the source domain classifier g
v
will be able to generalize fairly on target domain. While this
technique has been used in UDA, by itself it is not sufficient to guarantee fairness at adaptation time.
91
6.4 ProposedAlgorithm
The block-diagram description of my proposed approach is presented in Figure 6.1. Initially, I train a fair
model on the source domain dataset(X
s
,A
s
,Y
s
) through the following iterative three-step procedure:
1. I optimize the classifier (f
θ = g
v
◦ e
u
) network with cross entropy loss in an end-to-end scheme
by minimizing Eq. 6.1. This process will generate latent features informative and discriminative for
binary decision making.
2. I then fix the feature extractor encoder subnetwork e
u
and optimize the sensitive attribute classifier
h
w
by minimizing the loss in Eq. 6.2. This step will enforce the sensitive attribute classifier to
extract information from the representations in the embedding space that can be used for predicting
the sensitive attributeA.
3. I freeze the sensitive attribute classifier h
w
and update the encoder subnetworke
u
in order to maxi-
mize the fairness loss function in Eq. 6.2. This step will force the encoder to produce representations
that are independent from the sensitive attributeA.
To preserve fairness under distribution shift, the adaptation process relies on coupling the source and
target domain through the shared embedding space between the two. If I match the distributions in the
embedding space, the decision classifier g
v
will be able to generalize well in the target domain. Aligning
the embedding distributions is sufficient for achieving this goal, i.e., e(P
S
(X,A))≈ e(P
T
(X,A)).
I employ metric minimization from the UDA literature to enforce domain alignment [94, 134]. The
idea is to select a suitable probability distribution distanced(·,·) and minimize it for the encoder output,
i.e. d(e(P
S
(X,A)),e(P
T
(X,A))), to guarantee a shared embedding feature space between source and
target. The choice of the distribution distance d(·,·) is an algorithmic design choice and various metric
have been used for this purpose. In this work, I use the Sliced Wasserstein Distance (SWD) [134]. SWD
92
is an approximation of the optimal transport metric and enhances the applicability of using stochastic-
gradient based optimization. Optimal transport only has a closed-form solution for1D distributions. The
idea behind the SWD approximation is to slice two high dimensional distributions to generate 1D distri-
butions and then compute their distance as the average of these 1D WD slices, computed using several
random projections to generate them. In addition to the closed form solution, I can compute SWD using
the empirical samples from the two distributions. In the context of my setup, SWD can be computed as
follows:
L
swd
=
1
K
K
X
i=1
WD
1
(⟨e(x
s
,a
s
),γ i
⟩,⟨e(x
t
,a
t
),γ i
⟩), (6.3)
where, WD
1
(·,·) denotes the 1D WD distance, K is the number of random projections I am averaging
over andγ i
is one such projection direction.
To implement domain alignment to preserve fairness in the target domain under distributional shifts,
I augment steps(1)− (3) described above with two additional steps:
4. I minimize the empirical SWD distance betweene(P
S
(X,A)) ande(P
T
(X,A)) via Eq. 6.3. This
step ensures the source-trained classifier g will generalize on samples from the target domain, i.e.,
e(P
T
(X,A)).
5. I repeat steps (2) and (3) using solely the sensitive attributes of the target domain.
The above additional steps will update the model on the target domain to preserve both fairness and
accuracy. Following steps (1)-(5), the loss function becomes:
L
bce
(ˆ y,y
src
)+α L
src
fair
+β L
tar
fair
+γ L
swd
,
(6.4)
93
Algorithm4FairAdapt(α,β,γ,thresh,ITR )
1: foritr =1,...,ITR do
2: SourceTraining:
3: Optimizeα L
bce
via Eq. 6.1.
4: Optimizeβ L
fair
via Eq. 6.2 and freezingu.
5: Optimize− β L
fair
via Eq. 6.2 and freezingh.
6: if itr >threshthen
7: TargetAdaptation:
8: Optimizeγ L
swd
via Eq. 6.3.
9: Optimizeβ L
fair
via Eq. 6.2 and freezingu.
10: Optimize− β L
fair
via Eq. 6.2 and freezingh.
11: endif
12: endfor
13: returnu,g
where the hyperparameter α,β, and γ can be tuned using cross validation. I provide algorithmic pseu-
docode for my proposed fairness preservation process in Algorithm 4.
6.5 EmpiricalValidation
My learning setting is under explored. For this reason, I adopt existing datasets and tailor them for my
formulation.
6.5.1 DatasetsandTasks
Datasets relevant to fairness tasks pose a binary decision problem e.g. approval of a credit application,
alongside relevant features e.g. employment history, credit history etc. and group related sensitive at-
tributes e.g. sex, race, nationality, age. Based on sensitive group membership, data points can be part of
privileged or unprivileged groups.
I perform experiments on three datasets widely used by the AI fairness community. I will considersex
as my sensitive attribute, which is recorded for all three datasets.
94
The UCI Adult dataset
∗
is part of the UCI database [46] and consists of 1994 US Census data. The
task associated with the dataset is predicting whether annual income exceeds 50k. After data cleaning, the
dataset consists of more than48,000 entries. Possible sensitive attributes for this dataset include sex and
race.
TheUCIGermancreditdataset
†
contains financial information for 1000 different people applying
for credit. The predictive task involves categorizing individuals as good or bad credit risks. Sex and age
are possible sensitive attributes for the German dataset.
TheCOMPASrecidivismdataset
‡
maintains information of over5,000 individuals’ criminal records.
Models trained on this dataset attempt to predict people’s two year risk of recidivism. For the COMPAS
dataset, sex and race may be used as choice of sensitive attributes.
Split Source Target
Size Y=0 A=0|Y=0 A=0|Y=1 Size Y=0 A=0|Y=0 A=0|Y=1
A 34120 0.76 0.39 0.15 14722 0.76 0.39 0.15
A1 12024 0.53 0.41 0.16 5393 0.91 0.49 0.18
A2 29466 0.66 0.34 0.14 2219 0.97 0.48 0.30
A3 11887 0.52 0.42 0.16 778 0.89 0.39 0.17
C 3701 0.52 0.77 0.86 1577 0.54 0.76 0.84
C1 2886 0.58 0.74 0.82 1096 0.67 0.78 0.86
C2 903 0.47 0.80 0.80 96 0.74 0.70 0.92
C3 292 0.51 0.77 0.79 50 0.68 0.62 0.88
G 697 0.70 0.28 0.37 303 0.70 0.30 0.34
G1 573 0.66 0.34 0.45 427 0.76 0.23 0.20
G2 388 0.61 0.36 0.49 196 0.84 0.20 0.16
G3 439 0.62 0.35 0.45 159 0.87 0.21 0.19
Table 6.1: Data split statistics. A,C,G correspond to the Adult, COMPAS and German dataset respectively.
The rows with no number i.e. A,C,G correspond to random data splits. The numbered rows e.g. A1,A2,A3
correspond to statistics for specific splits. The columns represent the probabilities of specific outcomes for
specific splits e.g. P(Y =0). Results when using sex as sensitive attribute.
∗
https://archive.ics.uci.edu/ml/datasets/Adult
†
https://archive.ics.uci.edu/ml/datasets/statlog+(German+credit+data)
‡
https://github.com/propublica/COMPAS-analysis/
95
6.5.1.1 EvaluationProtocolMotivation
Historically, experiments on these datasets have considered random 70/30 splits for training and test.
While such data splits are useful in evaluating overfitting for fairness algorithms, features for training and
test sets will be sampled from the same data distribution. This assumption is unlikely in practice when
deploying a fair model to a different dataset. I consider natural data splits obtained from sub-sampling the
three datasets along different criteria. I show that compared to random splits, where learning a model that
guarantees fairness on the source domain is often times enough to have the model guarantee fairness on
the target domain, experiencing domain discrepancy between the source and target can lead fair models
trained on the source domain to produce unfair or degenerate solutions on the target domain. In short,
these splits introduce domain gap between the testing and training splits.
Next, for each of the three datasets I will generate source/target data splits where ignoring domain dis-
crepancy between the source and target can negatively impact fairness transfer. Per dataset, I will produce
three such splits. I characterize the label distributions and sensitive attribute conditional distributions for
the these datasets in Table 6.1.
AdultDataset. I will use age, education and race to generate source and target domains. This can be a
natural occurrence in practice, as gathered census information may differ along these axes geographically.
For example, urban population is on average more educate than rural population
§
, and more ethnically
diverse
¶
. Thus, a fair model trained on one of the two populations will need to overcome distribution
shift when evaluated on the other population. Besides differences in the feature distributions, I also note
the Adult dataset is both imbalanced in terms of outcome, P(Y = 1) = 0.34, and sensitive attribute of
positive outcome, P(A = 1|Y = 1) = 0.85, i.e. only a fraction of participants are earning more than
50k/year, and85% of them are male.
The source/target splits I consider are as follows:
§
https://www.ers.usda.gov/topics/rural-economy-population/employment-education/rural-education/
¶
https://www.ers.usda.gov/data-products/chart-gallery/gallery/chart-detail/?chartId=99538
96
1. Source data: White, More than 12 education years. Target data: Non-white, Less than 12 education
years.
2. Source data: White, Older than 30. Target data: Non-white, younger than 40.
3. Source data: Younger than 70, More than 12 education years. Target data: Older than 70, less than
12 years of education.
In Table 6.1 I analyze the label and sensitive attribute conditional distributions for the above data
splits. For the random split (A), the training and test label and conditional sensitive attribute distribu-
tions are identical, which is to be expected. For the three custom splits I observe all three distributions:
P(Y),P(A|Y = 0),P(A|Y = 1) differ between training and test. I also note the label distribution be-
comes more skewed towardsY =0.
COMPASDataset Compared to the Adult dataset, the COMPAS dataset is balanced in terms of label
distribution, however is imbalanced in terms of the conditional distribution of the sensitive attribute. I
will split the dataset along age, number of priors, and charge degree, i.e. whether the person committed a
felony or misdemeanor. Considered splits are as follows:
1. Source data: Younger than 45, Less than 3 prior convictions. Test data: Older than 45, more than 3
prior convictions.
2. Source data: Younger than 45, White, At least one prior conviction. Target data: Older than 45,
Non-white, No prior conviction.
3. Source data: Younger than 45, White, At least one prior conviction, convicted for a misdemeanor.
Target data: Older than 45, Non-white, No prior conviction, convicted for a felony.
The first split tests whether a young population with limited number of convictions can be leveraged to
fairly predict outcomes for an older population with more convictions. The second split introduces racial
97
bias in the sampling process. In the third split I additionally consider the type of felony committed when
splitting the dataset. For all splits, the test datasets become more imbalanced compared to the random
split.
German Credit Dataset The dataset is smallest out of the three considered. For splitting I consider
credit history and employment history. Similar to the Adult dataset, the label distribution is skewed to-
wards increased risk i.e. P(Y =0)=0.7, and individuals of low risk are also skewed towards being part
of the privileged group i.e. P(A=1|Y =1)=0.63. I consider the following splits:
1. Source data: Employed up to 4 years. Target data: Employed long term (4+ years).
2. Source data: Up to date credit history, Employed less than 4 years. Target data: un-paid credit, Long
term employed.
3. Source data: Delayed or paid credit, Employed up to 4 years. Target data: Critical account condition,
Long term employment.
Compared to random data splits, the custom splits reduce label and sensitive attribute imbalance in
the source domain, and increase these in the target domain.
In Table 6.1 I analyze the label and sensitive attribute conditional distributions for the above data
splits. For the random split (A), the training and test label and conditional sensitive attribute distribu-
tions are identical, which is to be expected. For the three custom splits I observe all three distributions:
P(Y),P(A|Y = 0),P(A|Y = 1) differ between training and test. I also note the label distribution be-
comes more skewed towardsY =0.
6.5.2 FairnessMetrics
There exist a multitude of criteria developed for evaluating algorithmic fairness [118]. In the context of
datasets presenting a privileged and unprivileged group, these metrics rely on ensuring predictive parity
98
between the two groups under different constraints. The most common fairness metric employed is demo-
graphic parity (DP)P(
ˆ
Y =1|A=0)=P(
ˆ
Y =1|A=1), which is optimized when predicted label proba-
bility is identical across the two groups. However, DP only ensures similar representation between the two
groups, while ignoring actual label distribution. Equal opportunity (EO) [61] conditions the fairness value
on the true labelY , and is optimized whenP(
ˆ
Y = 1|A = 0,Y = 1) = P(
ˆ
Y = 1|A = 1,Y = 1). EO is
preferred when the label distribution is different across privilege classes, i.e. P(Y|A=0)̸=P(Y|A=1).
A more constrained fairness metric is averaged odds (AO), which is minimized when outcomes are the
same conditioned on both labels and sensitive attributes i.e. P(
ˆ
Y|A = 0,Y = y) = P(
ˆ
Y|A = 1,Y =
y),y ∈ {0,1}. EO is a special case of AO, for the case wherey = 1. Following fairness literature, I will
report the left hand side and right hand side difference ∆ for each of the above measures. Under this
format,∆ values close to0 will signify the model maintains fairness, while values close to1 signify a lack
of fairness. Tuning a model to optimize fairness may incur accuracy trade offs [114, 83, 195]. For example,
a classifier which predicts every element to be part of the same group e.g. P(
ˆ
Y = 0) = 1 will obtain
∆ EO =∆ EO =∆ AO =0, without providing informative predictions. My approach has the advantage
that the regularizers of the three employed lossesL
CE
,L
fair
,L
swd
can be tuned in accordance with the
importance of accuracy against fairness for a specific task.
6.5.3 Parametertuningandimplementation
Implementation of my approach is done using the PyTorch [127] deep learning library. I model my encoder
e
u
as a one layer neural network with output spacez∈R
20
. Classifiers g andh are also one layer networks
with output space∈ R
2
. I train my model for 45,000 iterations, where the first 30,000 iterations only
involve source training. For the first 15,000 I only perform minimization of the binary cross entropy loss
L
bce
. I introduce source fairness training at iteration 15,000, and train the fair model, i.e. with respect
to bothL
bce
andL
fair
, for 15,000 more iterations. In the last 15,000 iterations I perform adaptation,
99
where I optimizeL
bce
,L
fair
on the source domain,L
fair
on the target domain, andL
swd
between the
source and target embeddingse
u
((x
s
,a
s
)),e
u
((x
t
,a
t
)) respectively. I use a learning rate forL
bce
,L
fair
of 1e− 4, and learning rate forL
swd
of 1e− 5. Model selection is done by considering the difference
between accuracy on the validation set, and demographic parity on the test set. Given equalized odds and
averaged opportunity require access to the underlying labels on the test set I cannot use these metrics for
model selection. Additionally, models corresponding to degenerate predictions i.e. test set predicted labels
being either all0s or all1s are not included in result reporting.
6.5.4 Othermethods
I evaluate my work against four popular fairness preserving algorithms part of the AIF360 [9] project:
Meta-Algorithm for Fair Classification (MC) [20], Adversarial Debiasing (AD) [210], Reject Option Classi-
fication [77] and Exponentiated Gradient Reduction [1]. I additionally report as baseline (Base) the version
of my algorithm where I only minimizeL
bce
, without optimizing fairness or minimizing distributional dis-
tance. This corresponds to the performance of a naive source trained classifier.
6.5.5 Results
I present results for three challenging data splits for each of the considered datasets. I report balanced
accuracy (Acc.), demographic parity (∆ DP ), equalized odds (∆ EO) and averaged opportunity (∆ AO).
Desirable accuracy values are close to1, while desirable fairness metric values should be close to0. Results
are averaged over7 runs. Unless otherwise specified, I use sex as the sensitive attributeA, as it is common
for all datasets.
Adultdataset I report results on the Adult dataset in Table 6.2. For each of the custom splits, I report
accuracy and fairness metrics. On the first split, MC obtains the highest accuracy of 0.68, and AD obtains
accuracy of0.63. Both are higher than my method, however these methods do not maintain fairness, as
100
Alg. Race, Education Race, Age Age, Education
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.63 0.34 0.53 0.42 0.60 0.24 0.25 0.24 0.59 0.90 0.92 0.91
MC 0.68 0.28 0.32 0.28 0.63 0.05 0.26 0.15 0.63 1.00 1.00 1.00
AD 0.63 0.21 0.33 0.25 0.60 0.25 0.25 0.25 0.51 0.16 0.15 0.16
ROC 0.59 0.34 0.25 0.31 0.62 0.02 0.20 0.11 0.50 0.00 0.00 0.00
EGR 0.62 0.06 0.16 0.10 0.59 0.02 0.19 0.11 0.56 0.43 0.40 0.42
Ours 0.62 0.01 0.05 0.01 0.62 0.00 0.19 0.10 0.52 0.01 0.06 0.03
Table 6.2: Performance results for the three splits of the Adult dataset
can be seen by the large∆ DP,∆ EO,∆ AO values. On the remaining tasks my method is able to maintain
fairness after adaptation, while being competitive in terms of accuracy. This shows that previous fairness
preserving classifiers struggle with domain shift between the source and target domains, while my method
is positioned to overcome these issues.
Alg. Age, Priors Race, Age, Priors Race, Age, Prrs., Chrg.
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.54 0.29 0.27 0.28 0.49 0.33 0.56 0.43 0.57 1.00 1.00 1.00
MC 0.58 0.33 0.36 0.33 0.50 0.00 0.00 0.00 0.50 0.19 0.19 0.19
AD 0.52 0.62 0.73 0.66 0.47 0.70 0.72 0.70 0.44 0.87 0.88 0.87
ROC 0.53 0.28 0.09 0.21 0.50 0.00 0.00 0.00 0.55 0.47 0.38 0.44
EGR 0.49 0.05 0.10 0.06 0.53 0.27 0.34 0.26 0.50 0.00 0.00 0.00
Ours 0.53 0.00 0.05 0.02 0.65 0.15 0.17 0.19 0.50 1.00 1.00 1.00
Table 6.3: Performance results for the three splits of the COMPAS dataset
COMPAS dataset The COMPAS dataset results are reported in Table 6.3. On the first data split the
MC method achieves best accuracy, however none of the methods I compare against besides EGR are able
to preserve fairness. I am able to obtain higher accuracy than EGR while also obtaining improved fairness
scores. On the second data split my method is able to achieve the highest accuracy and also the lowest
fairness scores amongst the considered methods. EGR, AD and Base are not able to maintain fairness,
while MC and ROC provide degenerate results. On the last data split the best results in terms of fairness
correspond to MC, and the best accuracy out of the fairness preserving algorithms is provided by the ROC.
Other approaches, ours included, provide degenerate solutions. This split has the smallest size amongst
101
all considered splits. My method relies on minimizing a probabilistic distributional metric for adaptation,
task which becomes harder in extremely low data regimes without additional assumptions or regularizers.
Alg. Employment Credit hist., Empl. Credit hist., Empl.
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.67 0.09 0.05 0.07 0.58 0.07 0.10 0.06 0.56 0.35 0.35 0.32
MC 0.67 0.06 0.12 0.03 0.56 0.15 0.34 0.22 0.55 0.30 0.34 0.30
AD 0.52 0.53 0.58 0.55 0.53 0.40 0.56 0.46 0.52 0.44 0.52 0.46
ROC 0.50 0.00 0.00 0.00 0.50 0.00 0.00 0.00 0.62 0.13 0.09 0.11
EGR 0.57 0.22 0.36 0.27 0.50 0.43 0.44 0.43 0.50 0.01 0.00 0.00
Ours 0.62 0.01 0.05 0.02 0.58 0.02 0.01 0.01 0.55 0.01 0.02 0.01
Table 6.4: Performance results for the three splits of the German dataset
Germandataset I report results on the German dataset in Table 6.4. On the first data split the best ac-
curacy is obtained by MC. My method obtains second best accuracy, with improved fairness performance.
On the second data split my method obtains highest accuracy and best fairness performance across all
fairness metrics. On the last data split I obtain the best fairness performance out of the considered meth-
ods. EGR does not provide informative predictions, MC,AD do not maintain fairness, and ROC improves
accuracy for a fairness trade-off.
6.5.6 AblativeExperiments
I visualize source-target domain shift by generating2D embeddings of the source and target feature spaces
corresponding to different splits of the data. For this task, I employ the UMAP [115] visualization tool. In
Figure 6.2 I compare the source and target features resulting from a random split of the Adult dataset,
to my first custom split i.e. by race and education. In case of randomly splitting the dataset, I observe
that source and target samples visually share a similar feature space. However, in the case of the custom
split, there is significant discrepancy between the two distributions, which affects model generalization.
This observation is in line with my numerical results, suggesting that when domain shift is present, model
generalization and fairness transfer become more difficult.
102
Figure 6.2: UMAP embeddings of the source and target feature spaces for random and custom splits of the
Adult dataset
Alg. Adult German COMPAS
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.74 0.35 0.30 0.28 0.64 0.03 0.16 0.05 0.68 0.22 0.27 0.18
MC 0.71 0.13 0.09 0.08 0.63 0.22 0.18 0.20 0.65 0.22 0.22 0.20
AD 0.67 0.11 0.13 0.08 0.53 0.35 0.46 0.38 0.62 0.28 0.25 0.29
ROC 0.71 0.05 0.01 0.01 0.55 0.11 0.04 0.09 0.52 0.02 0.03 0.02
EGR 0.65 0.06 0.02 0.01 0.51 0.01 0.04 0.02 0.63 0.02 0.02 0.02
Ours 0.70 0.00 0.07 0.08 0.64 0.00 0.05 0.01 0.65 0.00 0.02 0.03
Table 6.5: Results for random data splits.
I consider results obtained for random splits on the Adult, German and COMPAS datasets, detailed in
Table 6.5. These splits correspond to standard experiments intended to verify whether a fair model will
translate fairness benefits to the test data. Under this scenario, the source and target features are sampled
from the same distribution, and there is no domain shift. I observe the baseline approach is able to obtain
highest accuracy across datasets, however it does not provide any fairness guarantees. Out of the fairness
preserving methods, my method is able to match or nearly match best accuracy, while providing perfect
demographic parity. Even though I do not perform model selection based on the other two fairness metrics
due to not having access to the target labels, I am able to match best averaged opportunity on the German
dataset, and best equalized odds on the COMPAS dataset. This set of results shows that even in scenarios
where domain shift is not of concern, my algorithm successfully learns a competitive fair model.
103
I additionally investigate the impact of the different components of my approach on the numerical
results. In Table 6.6 I compare performance on the COMPAS dataset for four variants of my algorithm: (1)
Base, similar to the main experiments, where no fairness or distributional minimization metric is used, just
minimization ofL
bce
on the source domain samples (2) SWD, only minimizingL
swd
(3) Fair, only training
with respect toL
fair
on the source and target domains (4) My complete approach using all fairness and
adaptation objectives. NoteL
bce
is still used in cases (2) and (3). On the third data split results for all
three variants are inconclusive either in term of accuracy (Base) or fairness. On the first and second splits,
utilizing all losses leads to best performance. On the first split, the Fair only model is able to achieve
competitive fairness results at the cost of accuracy. The SWD only approach achieves better accuracy but
at the cost of fairness. Combining the two losses leads to improved accuracy over the Fair only model,
and also improved fairness. Due toL
swd
being minimized at the encoder output space, both classifier and
fairness head benefit from a shared source-target feature space. On the second split I observe the SWD only
model has poorest performance, and the Fair Only and combined model have similar fairness performance,
with the combined model obtaining higher accuracy. This signifies the adversarial fair training process
can act as a proxy task during training, improving model generalization.
Alg. Age, Priors Race, Age, Priors Race, Age, Prrs., Chrg.
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.68 0.22 0.27 0.18 0.54 0.29 0.27 0.28 0.49 0.33 0.56 0.43
SWD 0.56 0.29 0.38 0.32 0.45 0.44 0.33 0.40 0.55 1.00 1.00 1.00
Fair 0.50 0.01 0.08 0.04 0.64 0.15 0.17 0.19 0.41 1.00 1.00 1.00
Ours 0.53 0.00 0.05 0.02 0.65 0.15 0.17 0.19 0.50 1.00 1.00 1.00
Table 6.6: Results when selectively using a subset of losses on the COMPAS dataset
I further investigate the different components present in my algorithm. In Figure 6.3 I analyze the
training and adaptation process with respect to target accuracy, validation accuracy, demographic parity
on the source domain, and demographic parity on the target domain. Performance plots are reported for
the Adult dataset. I compare two scenarios: running the algorithm whenL
swd
is not enforced (bottom),
104
Alg. Empl. Credit hist., Empl. Credit hist., Empl.
Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO Acc. ∆ DP ∆ EO ∆ AO
Base 0.63 0.46 0.33 0.40 0.61 0.19 0.15 0.16 0.58 0.23 0.25 0.19
MC 0.59 0.32 0.29 0.30 0.67 0.35 0.18 0.27 0.61 0.12 0.18 0.11
AD 0.51 0.54 0.61 0.55 0.50 0.53 0.52 0.54 0.52 0.50 0.60 0.53
ROC 0.54 0.09 0.02 0.07 0.51 0.05 0.03 0.04 0.59 0.24 0.18 0.21
EGR 0.50 0.03 0.03 0.03 0.56 0.12 0.23 0.16 0.50 0.02 0.01 0.01
Ours 0.62 0.02 0.09 0.02 0.59 0.02 0.16 0.06 0.62 0.02 0.18 0.04
Table 6.7: Results on the German dataset when optimizing fairness metrics with respect to theage sensitive
attribute
Figure 6.3: Learning behavior when performing end-to-end training when using bothL
fair
andL
swd
(top)
and when only usingL
fair
(bottom)
and running the algorithm using both fairness and domain alignment (top). For the first 30,000 iterations I
only perform source training, where the first half of iterations is spent optimizing L
bce
, and the second half
is spent jointly optimizingL
bce
and the source fairness objective. I note once optimization with respect
to L
fair
starts, demographic parity decreases until adaptation start, i.e. between iterations 15,000− 30,000. The validation accuracy in this interval also slightly decreases, as improving fairness may affect
accuracy performance. During adaptation, i.e. after iteration30,000, I observe that in the scenario where
I useL
swd
, the target accuracy increases, while demographic parity on both source and target domains
remains relatively unchanged. In the scenario where no optimization ofL
swd
is performed, there is still
improvement with respect to target accuracy, however target demographic parity becomes on average
larger. This implies that the distributional alignment loss done at the output of the encoder has beneficial
effects both for the classification as well as the fairness objective.
105
I further investigate the different components present in my algorithm. In Figure 6.3 I analyze the
training and adaptation process with respect to target accuracy, validation accuracy, demographic parity
on the source domain, and demographic parity on the target domain. Performance plots are reported for
the Adult dataset. I compare two scenarios: running the algorithm whenL
swd
is not enforced (bottom),
and running the algorithm using both fairness and domain alignment (top). For the first 30,000 iterations I
only perform source training, where the first half of iterations is spent optimizing L
bce
, and the second half
is spent jointly optimizingL
bce
and the source fairness objective. I note once optimization with respect
to L
fair
starts, demographic parity decreases until adaptation start, i.e. between iterations 15,000− 30,000. The validation accuracy in this interval also slightly decreases, as improving fairness may affect
accuracy performance. During adaptation, i.e. after iteration30,000, I observe that in the scenario where
I useL
swd
, the target accuracy increases, while demographic parity on both source and target domains
remains relatively unchanged. In the scenario where no optimization ofL
swd
is performed, there is still
improvement with respect to target accuracy, however target demographic parity becomes on average
larger. This implies that the distributional alignment loss done at the output of the encoder has beneficial
effects both for the classification as well as the fairness objective.
6.6 Remarks
Fairness preserving methods have historically ignored the problem of domain shift when deploying a
source trained model to a target domain. My first contribution is providing different data splits for popular
datasets employed in fairness tasks which present significant domain shift between the source and the
target domains. I also show that given these domain shifts, popular fairness preserving algorithms are
not able to match the performance they observe on random data splits. This observation justifies that my
proposed research plan is necessary to maintain model fairness under domain shift. Second, I present a
novel algorithm that addresses domain shift when a fair outcome is of concern by combining the following
106
techniques: (1) fair model training via adversarially generating a sensitive attribute-independent latent
feature space, (2) producing a shared latent feature space for the source and target domains by minimizing
an appropriate probability distance metric between the source and target embedding distributions. I have
shown combining these two ideas ensures a fair model is capable of generalizing on a target domain while
maintaining fairness preserving properties.
107
Chapter7
ConclusionandFutureWork
In this chapter, I provide a review of the approaches presented so far and discuss possible future extensions.
StreetSemanticSegmentation. The semantic segmentation of street images involves assigning each
pixel of a multi-channel image an appropriate categorical label. Algorithms designed for street image
semantic segmentation employ large deep neural networks, which are not suitable in scenarios where
the input image distribution shifts. UDA attempts to address this problem, however, most approaches
consider joint access to both source and target data. This limitation makes such algorithms unsuitable for
scenarios where these two datasets need to be maintained on different data stores. I develop a source-
free UDA algorithm for street image segmentation. To address the absence of the source distribution at
adaptation time, I learn a GMM that aims to approximate the embedding distribution of the source data.
This is learned at training time and used during adaptation as a surrogate for source data access. I achieve
domain alignment by distributional distance metric minimization between the learned GMM distribution
and the target embedding distribution. My metric of choice is the Sliced Wasserstein Distance, benefiting
from the properties of optimal transport while allowing for gradient-based optimization. I demonstrate
the performance of my algorithm on two adaptation tasks, where the source domain comprises computer-
generated images and the target domain has real street images. Compared to recent UDA approaches, my
algorithm shows competitive performance while also keeping the source and target domains separate.
108
Medical Image Segmentation. Next, I consider the problem of domain adaptation in the medical
field. Compared to street images, a large portion of medical images is constituted by the background class.
Additionally, the pixel values are represented by intensities generated from CT or MRI machines, adding
an additional layer of complexity to processing medical images. Compared to other domains, most datasets
in the medical field are difficult to access due to patient data privacy, often tied to the health organization
in which they were generated. In the context of UDA, this makes a large portion of approaches incompat-
ible with the practical privacy requirements. I develop a source-free UDA algorithm for the segmentation
of medical images that builds on my approach for street image segmentation. During source training, I
propose an approximation of the source feature embedding distribution by learning a GMM. I show that
improving the representation power of the GMM model by allocating more Gaussians for each semantic
class leads to improved overall performance. I again use the SWD to reduce the distributional discrep-
ancy between source and target domains. Following literature, I consider adaptation scenarios where the
source domain is represented by MRI images and the target domain is represented by CT images. On an
organ segmentation dataset, my approach achieves competitive performance with UDA methods perform-
ing joint adaptation. On a cardiac dataset, my approach outperforms all other considered methods, both
performing joint or source-free adaptation.
MultiSourceAdaptation. I also address the setting of multi-source unsupervised adaptation. Com-
pared to settings where a single source domain is present, MUDA considers scenarios where an arbitrary
number of source data domains are made available, with each of these domains having a potentially differ-
ent distribution of data. MUDA approaches have the goal of leveraging the source domains’ information
in order to learn a model capable of classifying data on an unannotated target domain. I consider a strict
privacy setting for MUDA based on federated learning, where different domains are not able to commu-
nicate data between them. Most MUDA approaches consider joint access to data across domains, making
them unsuitable for my considered setting. Recent source-free MUDA approaches have been proposed,
109
however, they rely on training a model on each domain and then pooling the models together during the
adaptation phase. While these types of approaches address some of the privacy concerns, they require
rerunning the adaptation phase in online scenarios where some source domains may become unavailable
or new source domains may become available. I propose a source-free MUDA approach that both confers
data privacy while also allowing for fast distributed optimization. My method relies on performing source-
free SUDA with respect to each source domain and the target. This allows for distributed computation,
and introducing/removing source domains has reduced computational costs. To leverage all source models
for target inference, I developed a confidence-based mixing metric and showed competitive performance
with methods that employ models during the adaptation phase.
Fair Classification The wide use of AI approaches in classification problems has made decision al-
gorithms responsible for many real-world outcomes with a direct impact on everyday life. Such examples
exist for credit card approvals, resume selection for job applications, and even the deployment of public
enforcement agents. However, most models translate the bias present in their data into their predictions.
To address this issue, higher quality training data can be sourced, however, unbiased samples may be dif-
ficult or expensive to procure. Another avenue for bias mitigation involves developing metrics conducive
to fairness and including these in the optimization process. While such approaches have proven valuable
in mitigating test-time bias, most fairness preserving algorithms assume training and deployment data are
sampled from a shared distribution. I show that even on commonly used fairness datasets, biased data
selection can lead to downgraded model performance both in terms of accuracy and fairness of results.
Additionally, I propose a domain-shift aware fairness preserving algorithm by combining ideas from the
fairness and domain adaptation literature. I demonstrate that compared to recent algorithms, my approach
is able to mitigate domain shift while maintaining desirable fairness statistics.
110
7.1 FutureDirections
The current manuscript presents work in UDA applicable to image segmentation, multi-source classi-
fication, and fair classification. The foundation provided in this work makes the following extensions
attractive:
Multi Source Segmentation I have considered source-free semantic segmentation when only one
source domain is available. However, in practical settings, several data streams may be available as source
domains. For example, in the case of street image segmentation, this can constitute images pooled from
different cities or different countries.
Private Open Set Adaptation The considered adaptation problems so far involve generalizing a
model on the target domain, assuming the existence of distributional discrepancy between the source(s)
and the target. However, this setting assumes that the classes present on the source domain are identical
to the classes present on the target domain. In open set adaptation [126], the classes present in the source
domain do not necessarily also constitute the classes on the target domain. Open set adaptation has been
explored for joint training settings, and algorithms present in the current manuscript may prove efficient
in solving this problem in a privacy-centric setting.
Private Fair Adaptation Fair classification is an active area of research designed to mitigate model
biases against under-represented groups in the data. I have presented an approach to fair classification
that additionally addresses domain shift in the input data. However, my approach currently assumes joint
access to both the source and target data. While the data format and distribution differ from the seg-
mentation and classification problems seen so far, as tabular data is more sparse, private fair UDA may
be achievable by using a distributional approximation technique similar to the ones used in the current
manuscript.
111
Bibliography
[1] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. “A
Reductions Approach to Fair Classification”. In: Proceedingsofthe35thInternationalConferenceon
Machine Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of Machine
Learning Research. PMLR, 2018, pp. 60–69.url:
https://proceedings.mlr.press/v80/agarwal18a.html.
[2] Sk Miraj Ahmed, Dripta S. Raychaudhuri, Sujoy Paul, Samet Oymak, and
Amit K. Roy-Chowdhury. Unsupervised Multi-source Domain Adaptation Without Access to Source
Data. 2021. arXiv: 2104.01845[cs.LG].
[3] Kartik Ahuja, Karthikeyan Shanmugam, Kush Varshney, and Amit Dhurandhar. “Invariant risk
minimization games”. In: International Conference on Machine Learning. PMLR. 2020, pp. 145–155.
[4] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant Risk
Minimization. 2019.doi: 10.48550/ARXIV.1907.02893.
[5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein GAN”. In: arXiv preprint
arXiv:1701.07875 (2017).
[6] Yogesh Balaji, Rama Chellappa, and Soheil Feizi. “Robust Optimal Transport with Applications in
Generative Modeling and Domain Adaptation”. In: Advances in Neural Information Processing
Systems. Ed. by H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin. Vol. 33. Curran
Associates, Inc., 2020, pp. 12934–12944.url:
https://proceedings.neurips.cc/paper/2020/file/9719a00ed0c5709d80dfef33795dcef3-Paper.pdf.
[7] Mathilde Bateson, Hoel Kervadec, Jose Dolz, Hervé Lombaert, and Ismail Ben Ayed. “Source-free
domain adaptation for image segmentation”. In: Medical Image Analysis (2022), p. 102617.
[8] Mathilde Bateson, Hoel Kervadec, Jose Dolz, Hervé Lombaert, and Ismail Ben Ayed.
“Source-Relaxed Domain Adaptation for Image Segmentation”. In: Medical Image Computing and
Computer Assisted Intervention – MICCAI 2020. Cham: Springer International Publishing, 2020,
pp. 490–499.
112
[9] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde,
Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic,
Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha,
Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An
Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. 2018.
url: https://arxiv.org/abs/1810.01943.
[10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and
Jennifer Wortman Vaughan. “A theory of learning from different domains”. In: Machine learning
79.1 (2010), pp. 151–175.
[11] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. “Data decisions and theoretical implications
when adversarially learning fair representations”. In: arXiv preprint arXiv:1707.00075 (2017).
[12] Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, and
Nicolas Courty. “Deepjdot: Deep joint distribution optimal transport for unsupervised domain
adaptation”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 447–463.
[13] François Bolley, Arnaud Guillin, and Cédric Villani. “Quantitative concentration inequalities for
empirical measures on non-compact spaces”. In: Probability Theory and Related Fields 137.3-4
(2007), pp. 541–593.
[14] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. “Sliced and Radon
Wasserstein barycenters of measures”. In: Journal of Mathematical Imaging and Vision 51.1 (2015),
pp. 22–45.
[15] Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf,
and Alex J. Smola. “Integrating structured biological data by Kernel Maximum Mean
Discrepancy”. In: Bioinformatics 22.14 (July 2006), e49–e57.issn: 1367-4803.doi:
10.1093/bioinformatics/btl242. eprint:
https://academic.oup.com/bioinformatics/article-pdf/22/14/e49/616383/btl242.pdf.
[16] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan.
“Unsupervised pixel-level domain adaptation with generative adversarial networks”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 3722–3731.
[17] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in
commercial gender classification”. In: Conference on fairness, accountability and transparency.
PMLR. 2018, pp. 77–91.
[18] Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. “Semantics derived automatically from
language corpora contain human-like biases”. In: Science 356.6334 (2017), pp. 183–186.
[19] L Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K Vishnoi. “Classification with
fairness constraints: A meta-algorithm with provable guarantees”. In: Proceedings of the
conference on fairness, accountability, and transparency. 2019, pp. 319–328.
113
[20] L. Elisa Celis, Lingxiao Huang, Vijay Keswani, and Nisheeth K. Vishnoi. “Classification with
Fairness Constraints: A Meta-Algorithm with Provable Guarantees”. In: Proceedings of the
Conference on Fairness, Accountability, and Transparency. FAT* ’19. Atlanta, GA, USA: Association
for Computing Machinery, 2019, pp. 319–328.isbn: 9781450361255.doi: 10.1145/3287560.3287586.
[21] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng. “Unsupervised Bidirectional Cross-Modality
Adaptation via Deeply Synergistic Image and Feature Alignment for Medical Image
Segmentation”. In: IEEE Transactions on Medical Imaging 39.7 (2020), pp. 2494–2505.doi:
10.1109/TMI.2020.2972701.
[22] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu,
and Junzhou Huang. “Progressive feature alignment for unsupervised domain adaptation”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 627–636.
[23] Cheng Chen, Qi Dou, Hao Chen, and Pheng-Ann Heng. “Semantic-aware generative adversarial
nets for unsupervised domain adaptation in chest x-ray segmentation”. In: International workshop
on machine learning in medical imaging. Springer. 2018, pp. 143–151.
[24] Cheng Chen, Qi Dou, Hao Chen, Jing Qin, and Pheng-Ann Heng. “Synergistic Image and Feature
Adaptation: Towards Cross-Modality Domain Adaptation for Medical Image Segmentation”. In:
Proceedings of The Thirty-Third Conference on Artificial Intelligence (AAAI) . 2019, pp. 865–872.
[25] Yi-Hsin Chen, Wei-Yu Chen, Yu-Ting Chen, Bo-Cheng Tsai, Yu-Chiang Frank Wang, and
Min Sun. “No More Discrimination: Cross City Adaptation of Road Scene Segmenters”. In:
arXiv:1704.08509 [cs.CV]. 2017.
[26] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.
“Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs”. In: IEEE transactions on pattern analysis and machine intelligence 40.4 (2017),
pp. 834–848.
[27] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. “Rethinking Atrous
Convolution for Semantic Image Segmentation”. In: arXiv:1706.05587 [cs.CV]. 2017. arXiv:
1706.05587[cs.CV].
[28] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam.
“Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation”. In:
ECCV. 2018.
[29] Xu Chen, Chunfeng Lian, Li Wang, Hannah Deng, Tianshu Kuang, Steve Fung, Jaime Gateno,
Pew-Thian Yap, James J Xia, and Dinggang Shen. “Anatomy-Regularized Representation Learning
for Cross-Modality Medical Image Segmentation”. In: IEEE Transactions on Medical Imaging 40.1
(2020), pp. 274–285.
[30] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. “Learning semantic segmentation from
synthetic data: A geometrically guided input-output adaptation approach”. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 1841–1850.
114
[31] Jaehoon Choi, Taekyung Kim, and Changick Kim. “Self-ensembling with gan-based data
augmentation for domain adaptation in semantic segmentation”. In: Proceedings of the IEEE
international conference on computer vision. 2019, pp. 6830–6840.
[32] Jaehoon Choi, Taekyung Kim, and Changick Kim. “Self-ensembling with gan-based data
augmentation for domain adaptation in semantic segmentation”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019, pp. 6830–6840.
[33] Alexandra Chouldechova and Aaron Roth. “The Frontiers of Fairness in Machine Learning”. In:
arXiv preprint arXiv:1810.08810 (2018).
[34] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. “The cityscapes dataset for
semantic urban scene understanding”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016, pp. 3213–3223.
[35] Amanda Coston, Karthikeyan Natesan Ramamurthy, Dennis Wei, Kush R Varshney,
Skyler Speakman, Zairah Mustahsan, and Supriyo Chakraborty. “Fair transfer learning with
missing protected attributes”. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and
Society. 2019, pp. 91–98.
[36] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. “Optimal transport for
domain adaptation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 39.9 (2016),
pp. 1853–1865.
[37] Shuhao Cui, Shuhui Wang, Junbao Zhuo, Chi Su, Qingming Huang, and Qi Tian. “Gradually
vanishing bridge for adversarial domain adaptation”. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. 2020, pp. 12455–12464.
[38] Hal Daumé III. “Frustratingly Easy Domain Adaptation”. In: Proceedings of the 45th Annual
Meeting of the Association of Computational Linguistics. 2007, pp. 256–263.
[39] Sofien Dhouib, Ievgen Redko, and Carole Lartizien. “Margin-aware Adversarial Domain
Adaptation with Optimal Transport”. In: Thirty-seventh International Conference on Machine
Learning. 2020.
[40] Ning Ding, Yixing Xu, Yehui Tang, Chao Xu, Yunhe Wang, and Dacheng Tao. “Source-Free
Domain Adaptation via Distribution Estimation”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. 2022, pp. 7212–7222.
[41] Jiahua Dong, Zhen Fang, Anjin Liu, Gan Sun, and Tongliang Liu. “Confident anchor-induced
multi-source free domain adaptation”. In: Advances in Neural Information Processing Systems 34
(2021), pp. 2848–2860.
[42] Qi Dou, Cheng Ouyang, Cheng Chen, Hao Chen, Ben Glocker, Xiahai Zhuang, and
Pheng-Ann Heng. “Pnp-adanet: Plug-and-play adversarial domain adaptation network at
unpaired cross-modality cardiac segmentation”. In: IEEE Access 7 (2019), pp. 99065–99076.
115
[43] Mark Dredze and Koby Crammer. “Online methods for multi-domain learning and adaptation”.
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics. 2008, pp. 689–697.
[44] K. Drossos, P. Magron, and T. Virtanen. “Unsupervised Adversarial Domain Adaptation Based on
The Wasserstein Distance For Acoustic Scene Classification”. In: 2019 IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics (WASPAA). 2019, pp. 259–263.doi:
10.1109/WASPAA.2019.8937231.
[45] Mengnan Du, Subhabrata Mukherjee, Guanchu Wang, Ruixiang Tang, Ahmed Awadallah, and
Xia Hu. “Fairness via Representation Neutralization”. In: Advances in Neural Information
Processing Systems 34 (2021).
[46] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017.url:
http://archive.ics.uci.edu/ml.
[47] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and
Suresh Venkatasubramanian. “Certifying and removing disparate impact”. In: proceedings of the
21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015,
pp. 259–268.
[48] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio.
“Learning with a Wasserstein loss”. In: Advances in Neural Information Processing Systems. 2015,
pp. 2053–2061.
[49] Alexander J Gabourie, Mohammad Rostami, Philip E Pope, Soheil Kolouri, and Kuyngnam Kim.
“Learning a Domain-Invariant Embedding for Unsupervised Domain Adaptation Using
Class-Conditioned Distribution Alignment”. In: 2019 57th Annual Allerton Conference on
Communication, Control, and Computing (Allerton). IEEE. 2019, pp. 352–359.
[50] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
International conference on machine learning. PMLR. 2015, pp. 1180–1189.
[51] Yaroslav Ganin and Victor Lempitsky. “Unsupervised domain adaptation by backpropagation”. In:
Proceedings of International Conference on Machine Learning. 2015, pp. 1180–1189.
[52] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. “Image Style Transfer Using
Convolutional Neural Networks”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). 2016.
[53] Behnam Gholami, Pritish Sahu, Ognjen Rudovic, Konstantinos Bousmalis, and Vladimir Pavlovic.
“Unsupervised Multi-Target Domain Adaptation: An Information Theoretic Approach”. In: IEEE
Transactions on Image Processing 29 (2020), pp. 3993–4002.doi: 10.1109/TIP.2019.2963389.
[54] B. Gong, Y. Shi, F. Sha, and K. Grauman. “Geodesic flow kernel for unsupervised domain
adaptation”. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE.
2012, pp. 2066–2073.
116
[55] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,
Aaron Courville, and Yoshua Bengio. “Generative adversarial nets”. In: Advances in Neural
Information Processing Systems. 2014, pp. 2672–2680.
[56] Y. Grandvalet and Y. Bengio. “Semi-supervised Learning by Entropy Minimization.” In: Advances
in Neural Information Processing Systems. Vol. 17. 2004, pp. 529–536.
[57] Hao Guan and Mingxia Liu. “Domain adaptation for medical image analysis: a survey”. In: IEEE
Transactions on Biomedical Engineering 69.3 (2021), pp. 1173–1185.
[58] Han Guo, Ramakanth Pasunuru, and Mohit Bansal. “Multi-source domain adaptation for text
classification via distancenet-bandits”. In: Proceedings of the AAAI Conference on Artificial
Intelligence. Vol. 34. 2020, pp. 7830–7838.
[59] Jiang Guo, Darsh Shah, and Regina Barzilay. “Multi-Source Domain Adaptation with Mixture of
Experts”. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing. 2018, pp. 4694–4703.
[60] Xiaoting Han, Lei Qi, Qian Yu, Ziqi Zhou, Yefeng Zheng, Yinghuan Shi, and Yang Gao. “Deep
symmetric adaptation network for cross-modality medical image segmentation”. In: IEEE
transactions on medical imaging 41.1 (2021), pp. 121–132.
[61] Moritz Hardt, Eric Price, and Nati Srebro. “Equality of opportunity in supervised learning”. In:
Advances in neural information processing systems 29 (2016).
[62] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2016, pp. 770–778.
[63] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros,
and Trevor Darrell. “CyCADA: Cycle-Consistent Adversarial Domain Adaptation”. In:
International Conference on Machine Learning. 2018, pp. 1989–1998.
[64] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros,
and Trevor Darrell. “Cycada: Cycle-consistent adversarial domain adaptation”. In: International
conference on machine learning. PMLR. 2018, pp. 1989–1998.
[65] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. “FCNs in the Wild: Pixel-level
Adversarial and Constraint-based Adaptation”. In: arXiv:1612.02649 [cs.CV]. 2016.
[66] Hui Hu, Yijun Liu, Zhen Wang, and Chao Lan. “A distributed fair machine learning framework
with private demographic data protection”. In: 2019 IEEE International Conference on Data Mining
(ICDM). IEEE. 2019, pp. 1102–1107.
[67] Yi Huang, Xiaoshan Yang, Ji Zhang, and Changsheng Xu. “Relative Alignment Network for
Source-Free Multimodal Video Domain Adaptation”. In: Proceedings of the 30th ACM International
Conference on Multimedia. MM ’22. Lisboa, Portugal: Association for Computing Machinery, 2022,
pp. 1652–1660.isbn: 9781450392037.doi: 10.1145/3503161.3548009.
117
[68] Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu Lin, and Ming-Hsuan Yang. “Adversarial
learning for semi-supervised semantic segmentation”. In: arXiv preprint arXiv:1802.07934 (2018).
[69] Yuankai Huo, Zhoubing Xu, Shunxing Bao, Albert Assad, Richard G Abramson, and
Bennett A Landman. “Adversarial synthesis learning enables segmentation without target
modality ground truth”. In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI
2018). IEEE. 2018, pp. 1217–1220.
[70] Yuankai Huo, Zhoubing Xu, Hyeonsoo Moon, Shunxing Bao, Albert Assad, Tamara K. Moyo,
Michael R. Savona, Richard G. Abramson, and Bennett A. Landman. “SynSeg-Net: Synthetic
Segmentation Without Target Modality Ground Truth”. In: IEEE Transactions on Medical Imaging
38.4 (2019), pp. 1016–1025.issn: 1558-254X.doi: 10.1109/tmi.2018.2876633.
[71] D. Isele, M. Rostami, and E. Eaton. “Using task features for zero-shot knowledge transfer in
lifelong learning”. In: Proc. of International Joint Conference on Artificial Intelligence . 2016,
pp. 1620–1626.
[72] V Jain and E Learned-Miller. “Online domain adaptation of a pre-trained cascade of classifiers”.
In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition. 2011,
pp. 577–584.
[73] Ray Jiang, Aldo Pacchiano, Tom Stepleton, Heinrich Jiang, and Silvia Chiappa. “Wasserstein fair
classification”. In: Uncertainty in Artificial Intelligence . PMLR. 2020, pp. 862–872.
[74] Jinyang Jiao, Hao Li, Tian Zhang, and Jing Lin. “Source-Free Adaptation Diagnosis for Rotating
Machinery”. In: IEEE Transactions on Industrial Informatics (2022), pp. 1–10.doi:
10.1109/TII.2022.3231414.
[75] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. “Neural
style transfer: A review”. In: IEEE transactions on visualization and computer graphics 26.11 (2019),
pp. 3365–3385.
[76] Pritish Kamath, Akilesh Tangella, Danica Sutherland, and Nathan Srebro. “Does Invariant Risk
Minimization Capture Invariance?” In: Proceedings of The 24th International Conference on
Artificial Intelligence and Statistics . Ed. by Arindam Banerjee and Kenji Fukumizu. Vol. 130.
Proceedings of Machine Learning Research. PMLR, 2021, pp. 4069–4077.url:
https://proceedings.mlr.press/v130/kamath21a.html.
[77] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. “Decision Theory for Discrimination-Aware
Classification”. In: 2012 IEEE 12th International Conference on Data Mining. 2012, pp. 924–929.doi:
10.1109/ICDM.2012.45.
[78] Konstantinos Kamnitsas, Christian Baumgartner, Christian Ledig, Virginia Newcombe,
Joanna Simpson, Andrew Kane, David Menon, Aditya Nori, Antonio Criminisi, Daniel Rueckert,
et al. “Unsupervised domain adaptation in brain lesion segmentation with adversarial networks”.
In: International conference on information processing in medical imaging. Springer. 2017,
pp. 597–609.
118
[79] Konstantinos Kamnitsas, Christian Baumgartner, Christian Ledig, Virginia Newcombe,
Joanna Simpson, Andrew Kane, David Menon, Aditya Nori, Antonio Criminisi, Daniel Rueckert,
and Ben Glocker. “Unsupervised Domain Adaptation in Brain Lesion Segmentation with
Adversarial Networks”. In: Information Processing in Medical Imaging. Ed. by Marc Niethammer,
Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen.
Cham: Springer International Publishing, 2017, pp. 597–609.
[80] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. “Contrastive adaptation network
for unsupervised domain adaptation”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2019, pp. 4893–4902.
[81] Ali Emre Kavur, M Alper Selver, Oguz Dicle, Mustafa Barıs, and N Sinem Gezer.
CHAOS-Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge Data.
Version v1.03. 2019.doi: 10.5281/zenodo.3362844.
[82] Youngeun Kim, Donghyeon Cho, Kyeongtak Han, Priyadarshini Panda, and Sungeun Hong.
“Domain adaptation without source data”. In: IEEE Transactions on Artificial Intelligence 2.6
(2021), pp. 508–518.
[83] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. “Inherent trade-offs in the fair
determination of risk scores”. In: arXiv preprint arXiv:1609.05807 (2016).
[84] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. “Generalized
sliced wasserstein distances”. In: Advances in Neural Information Processing Systems. Vol. 32. 2019,
pp. 261–272.
[85] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Communications of the ACM 60.6 (2017), pp. 84–90.
[86] David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas,
Dinghuai Zhang, Remi Le Priol, and Aaron Courville. Out-of-Distribution Generalization via Risk
Extrapolation (REx). 2020.doi: 10.48550/ARXIV.2003.00688.
[87] Jogendra Nath Kundu, Akshay Kulkarni, Amit Singh, Varun Jampani, and R. Venkatesh Babu.
“Generalize Then Adapt: Source-Free Domain Adaptive Semantic Segmentation”. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021, pp. 7046–7056.
[88] Jogendra Nath Kundu, Naveen Venkat, Rahul M V, and R. Venkatesh Babu. “Universal
Source-Free Domain Adaptation”. In: 2020.
[89] Jogendra Nath Kundu, Naveen Venkat, Rahul M V, and R. Venkatesh Babu. “Universal
Source-Free Domain Adaptation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). 2020.
[90] Vinod K Kurmi, Venkatesh K Subramanian, and Vinay P Namboodiri. “Domain impression: A
source data free domain adaptation method”. In:ProceedingsoftheIEEE/CVFWinterConferenceon
Applications of Computer Vision. 2021, pp. 615–625.
119
[91] Bennett Landman, Z Xu, JE Igelsias, M Styner, TR Langerak, and A Klein. Multi-atlas labeling
beyond the cranial vault-workshop and challenge. 2015.doi: 10.7303/syn3193805.
[92] Tien-Nam Le, Amaury Habrard, and Marc Sebban. “Deep multi-Wasserstein unsupervised
domain adaptation”. In: Pattern Recognition Letters 125 (2019), pp. 249–255.issn: 0167-8655.doi:
https://doi.org/10.1016/j.patrec.2019.04.025.
[93] Yann LeCun, Yoshua Bengio, et al. “Convolutional networks for images, speech, and time series”.
In: The handbook of brain theory and neural networks 3361.10 (1995), p. 1995.
[94] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. “Sliced Wasserstein
Discrepancy for Unsupervised Domain Adaptation”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). 2019.
[95] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, and Daniel Ulbricht. “Sliced wasserstein
discrepancy for unsupervised domain adaptation”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2019, pp. 10285–10295.
[96] Rui Li, Qianfen Jiao, Wenming Cao, Hau-San Wong, and Si Wu. “Model adaptation: Unsupervised
domain adaptation without source data”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2020, pp. 9641–9650.
[97] Yitong Li, michael Murias, geraldine Dawson, and David E Carlson. “Extracting Relationships by
Multi-Domain Matching”. In: Advances in Neural Information Processing Systems. Ed. by S. Bengio,
H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31. Curran
Associates, Inc., 2018.url:
https://proceedings.neurips.cc/paper/2018/file/2fd0fd3efa7c4cfb034317b21f3c2d93-Paper.pdf.
[98] Jian Liang, Dapeng Hu, and Jiashi Feng. “Do We Really Need to Access the Source Data? Source
Hypothesis Transfer for Unsupervised Domain Adaptation”. In: Proceedings of the 37th
International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119.
Proceedings of Machine Learning Research. PMLR, 2020, pp. 6028–6039.url:
https://proceedings.mlr.press/v119/liang20a.html.
[99] Jian Liang, Dapeng Hu, and Jiashi Feng. “Do we really need to access the source data? source
hypothesis transfer for unsupervised domain adaptation”. In: International Conference on Machine
Learning. PMLR. 2020, pp. 6028–6039.
[100] Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, and Jiashi Feng. “Source data-absent unsupervised
domain adaptation through hypothesis transfer and labeling transfer”. In: IEEE Transactions on
Pattern Analysis and Machine Intelligence (2021).
[101] Chuang Lin, Sicheng Zhao, Lei Meng, and Tat-Seng Chua. “Multi-source domain adaptation for
visual sentiment classification”. In: Proceedings of the AAAI Conference on Artificial Intelligence .
Vol. 34. 2020, pp. 2661–2668.
[102] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.
Feature Pyramid Networks for Object Detection. 2017. arXiv: 1612.03144[cs.CV].
120
[103] D. Liu, D. Zhang, Y. Song, F. Zhang, L. O’Donnell, H. Huang, M. Chen, and W. Cai. “PDAM: A
Panoptic-level Feature Alignment Framework for Unsupervised Domain Adaptive Instance
Segmentation in Microscopy Images”. In: IEEE Transactions on Medical Imaging (2020), pp. 1–1.
doi: 10.1109/TMI.2020.3023466.
[104] Yuang Liu, Wei Zhang, and Jun Wang. Source-Free Domain Adaptation for Semantic Segmentation.
2021. arXiv: 2103.16372[cs.CV].
[105] Yuang Liu, Wei Zhang, and Jun Wang. “Source-free domain adaptation for semantic
segmentation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2021, pp. 1215–1224.
[106] Yuchen Liu, Yabo Chen, Wenrui Dai, Mengran Gou, Chun-Ting Huang, and Hongkai Xiong.
“Source-Free Domain Adaptation with Contrastive Domain Alignment and Self-supervised
Exploration for Face Anti-Spoofing: Supplementary Material”. In: ().
[107] Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic
segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2015, pp. 3431–3440.
[108] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. “Learning Transferable Features
with Deep Adaptation Networks”. In: Proceedings of International Conference on Machine
Learning. 2015, pp. 97–105.
[109] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. “Deep transfer learning with
joint adaptation networks”. In: Proceedings of the 34th International Conference on Machine
Learning-Volume 70. JMLR. org. 2017, pp. 2208–2217.
[110] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. “Unsupervised domain
adaptation with residual transfer networks”. In: Advances in Neural Information Processing
Systems. Vol. 29. 2016, pp. 136–144.
[111] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. “Deep Transfer Learning with
Joint Adaptation Networks”. In: Proceedings of the 34th International Conference on Machine
Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning
Research. PMLR, 2017, pp. 2208–2217.url: http://proceedings.mlr.press/v70/long17a.html.
[112] Pauline Luc, Camille Couprie, Soumith Chintala, and Jakob Verbeek. “Semantic Segmentation
using Adversarial Networks”. In: NIPS Workshop on Adversarial Training. 2016.
[113] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. “Taking a closer look at domain
shift: Category-level adversaries for semantics consistent domain adaptation”. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 2507–2516.
[114] David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. “Learning adversarially fair
and transferable representations”. In: International Conference on Machine Learning. PMLR. 2018,
pp. 3384–3393.
121
[115] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and
Projection for Dimension Reduction. 2020. arXiv: 1802.03426[stat.ML].
[116] Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. “UMAP: Uniform Manifold
Approximation and Projection”. In: Journal of Open Source Software 3.29 (2018), p. 861.
[117] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas.
“Communication-Efficient Learning of Deep Networks from Decentralized Data”. In: (2016). doi:
10.48550/ARXIV.1602.05629.
[118] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: ACM Computing Surveys (CSUR) 54.6 (2021),
pp. 1–35.
[119] Pietro Morerio, Jacopo Cavazza, and Vittorio Murino. “Minimal-Entropy Correlation Alignment
for Unsupervised Deep Domain Adaptation”. In: ICLR. 2018.
[120] Yaniv Morgenstern, Mohammad Rostami, and Dale Purves. “Properties of artificial networks
evolved to contend with natural spectra”. In: Proceedings of the National Academy of Sciences
111.supplement_3 (2014), pp. 10868–10872.
[121] Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. “Few-shot adversarial
domain adaptation”. In: Advances in Neural Information Processing Systems. 2017, pp. 6670–6680.
[122] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. “Image to
image translation for domain adaptation”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2018, pp. 4500–4509.
[123] Zak Murez, Soheil Kolouri, David J. Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. “Image to
Image Translation for Domain Adaptation”. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2017.
[124] S. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. “Domain adaptation via transfer component
analysis”. In: IEEE Transactions on Neural Networks 22.2 (2011), pp. 199–210.
[125] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, and Tao Mei. “Transferrable
prototypical networks for unsupervised domain adaptation”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2019, pp. 2239–2247.
[126] Pau Panareda Busto and Juergen Gall. “Open set domain adaptation”. In: Proceedings of the IEEE
international conference on computer vision. 2017, pp. 754–763.
122
[127] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,
Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style,
High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems
32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett.
Curran Associates, Inc., 2019, pp. 8024–8035.url: http://papers.neurips.cc/paper/9015-pytorch-
an-imperative-style-high-performance-deep-learning-library.pdf.
[128] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. “Moment
matching for multi-source domain adaptation”. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. 2019, pp. 1406–1415.
[129] Xingchao Peng, Zijun Huang, Yizhe Zhu, and Kate Saenko. Federated Adversarial Domain
Adaptation. 2019. arXiv: 1911.02054[cs.CV].
[130] Eduardo HP Pooch, Pedro L Ballester, and Rodrigo C Barros. “Can we trust deep learning models
diagnosis? The impact of domain shift in chest radiograph classification”. In: arXiv preprint
arXiv:1909.01940 (2019).
[131] J. Rabin, G. Peyré, J. Delon, and M. Bernot. “Wasserstein barycenter and its application to texture
mixing”. In: International Conference on Scale Space and Variational Methods in Computer Vision.
Springer. 2011, pp. 435–446.
[132] A. Redko I.and Habrard and M. Sebban. “Theoretical analysis of domain adaptation with optimal
transport”. In: Joint European Conference on Machine Learning and Knowledge Discovery in
Databases. Springer. 2017, pp. 737–753.
[133] Ievgen Redko, Nicolas Courty, Rémi Flamary, and Devis Tuia. “Optimal transport for
multi-source domain adaptation under target shift”. In: The 22nd International Conference on
Artificial Intelligence and Statistics . PMLR. 2019, pp. 849–858.
[134] Ievgen Redko, Amaury Habrard, and Marc Sebban. “Theoretical Analysis of Domain Adaptation
with Optimal Transport”. In: Machine Learning and Knowledge Discovery in Databases. Ed. by
Michelangelo Ceci, Jaakko Hollmén, Ljupčo Todorovski, Celine Vens, and Sašo Džeroski. Cham:
Springer International Publishing, 2017, pp. 737–753.isbn: 978-3-319-71246-8.
[135] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. “Playing for data: Ground
truth from computer games”. In: European conference on computer vision. Springer. 2016,
pp. 102–118.
[136] Eduardo Romera, Luis M Bergasa, Kailun Yang, Jose M Alvarez, and Rafael Barea. “Bridging the
day and night domain gap for semantic segmentation”. In: 2019 IEEE Intelligent Vehicles
Symposium (IV). IEEE. 2019, pp. 1312–1318.
[137] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for
Biomedical Image Segmentation. 2015. arXiv: 1505.04597[cs.CV].
123
[138] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-Net: Convolutional Networks for
Biomedical Image Segmentation”. In: Medical Image Computing and Computer-Assisted
Intervention – MICCAI 2015. Ed. by Nassir Navab, Joachim Hornegger, William M. Wells, and
Alejandro F. Frangi. Cham: Springer International Publishing, 2015, pp. 234–241.
[139] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. “The
synthia dataset: A large collection of synthetic images for semantic segmentation of urban
scenes”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016,
pp. 3234–3243.doi: 10.1109/CVPR.2016.352.
[140] Mohammad Rostami. “Increasing Model Generalizability for Unsupervised Domain Adaptation”.
In: Proceedings of the Conference on Lifelong Learning Agents. 2022.
[141] Mohammad Rostami. “Learning Transferable Knowledge Through Embedding Spaces”.
PhD thesis. University of Pennsylvania, 2019.
[142] Mohammad Rostami. “Lifelong domain adaptation via consolidated internal distribution”. In:
Advances in Neural Information Processing Systems 34 (2021), pp. 11172–11183.
[143] Mohammad Rostami. Transfer Learning Through Embedding Spaces. CRC Press, 2021.
[144] Mohammad Rostami and Aram Galstyan. TOvercoming Concept Shift in Domain-Aware Settings
through Consolidated Internal Distributions. AAAI Conference on Artificial Intelligence, 2023.
[145] Mohammad Rostami, David Huber, and Tsai-Ching Lu. “A crowdsourcing triage algorithm for
geopolitical event forecasting”. In: Proceedings of the 12th ACM Conference on Recommender
Systems. 2018, pp. 377–381.
[146] Mohammad Rostami, David Isele, and Eric Eaton. “Using task descriptions in lifelong machine
learning for improved performance and zero-shot transfer”. In: Journal of Artificial Intelligence
Research 67 (2020), pp. 673–704.
[147] Mohammad Rostami, Soheil Kolouri, Eric Eaton, and Kyungnam Kim. “Deep transfer learning for
few-shot sar image classification”. In: Remote Sensing 11.11 (2019), p. 1374.
[148] Mohammad Rostami, Soheil Kolouri, Eric Eaton, and Kyungnam Kim. “Sar image classification
using few-shot cross-domain transfer learning”. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops. 2019.
[149] Mohammad Rostami, Soheil Kolouri, Kyungnam Kim, and Eric Eaton. “Multi-Agent Distributed
Lifelong Learning for Collective Knowledge Acquisition”. In: Proceedings of the 17th International
Conference on Autonomous Agents and MultiAgent Systems. 2018, pp. 712–720.
[150] Mohammad Rostami, Soheil Kolouri, and Praveen K Pilly. “Complementary learning for
overcoming catastrophic forgetting using experience replay”. In: Proceedings of the 28th
International Joint Conference on Artificial Intelligence . AAAI Press. 2019, pp. 3339–3345.
[151] Mohammad Rostami, Soheil Kolouri, Praveen K Pilly, and James McClelland. “Generative
Continual Concept Learning.” In: AAAI. 2020, pp. 5545–5552.
124
[152] Mohammad Rostami, Leonidas Spinoulas, Mohamed Hussein, Joe Mathai, and
Wael Abd-Almageed. “Detection and continual learning of novel face presentation attacks”. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, pp. 14851–14860.
[153] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. “Stabilizing training of
generative adversarial networks through regularization”. In: Advances in Neural Information
Processing Systems. 2017, pp. 2018–2028.
[154] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. “Adapting visual category models to new domains”.
In: European Conference on Computer Vision. Springer. 2010, pp. 213–226.
[155] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. “Maximum classifier
discrepancy for unsupervised domain adaptation”. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 2018, pp. 3723–3732.
[156] Kuniaki Saito, Shohei Yamamoto, Yoshitaka Ushiku, and Tatsuya Harada. “Open Set Domain
Adaptation by Backpropagation”. In: Proceedings of the European Conference on Computer Vision
(ECCV). 2018.
[157] Christos Sakaridis, Dengxin Dai, Simon Hecker, and Luc Van Gool. “Model adaptation with
synthetic and real data for semantic dense foggy scene understanding”. In: Proceedings of the
European Conference on Computer Vision (ECCV). 2018, pp. 687–704.
[158] Cristiano Saltori, Stéphane Lathuiliére, Nicu Sebe, Elisa Ricci, and Fabio Galasso. “Sf-uda 3d:
Source-free unsupervised domain adaptation for lidar-based 3d object detection”. In: 2020
International Conference on 3D Vision (3DV). IEEE. 2020, pp. 771–780.
[159] S. Sankaranarayanan, Y. Balaji, C. D Castillo, and R. Chellappa. “Generate to adapt: Aligning
domains using generative adversarial networks”. In: CVPR. 2018.
[160] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo, and Rama Chellappa. “Generate to
adapt: Aligning domains using generative adversarial networks”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018, pp. 8503–8512.
[161] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa.
“Learning from synthetic data: Addressing domain shift for semantic segmentation”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018,
pp. 3752–3761.
[162] Antoine Saporta, Tuan-Hung Vu, Matthieu Cord, and Patrick Pérez. “Multi-Target Adversarial
Frameworks for Domain Adaptation in Semantic Segmentation”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV). 2021, pp. 9072–9081.
[163] Candice Schumann, Xuezhi Wang, Alex Beutel, Jilin Chen, Hai Qian, and Ed H Chi. “Transfer of
Machine Learning Fairness across Domains”. In: (2019).
[164] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. “Continual learning with deep
generative replay”. In: Advances in Neural Information Processing Systems. 2017, pp. 2990–2999.
125
[165] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai,
Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap,
Karen Simonyan, and Demis Hassabis. Mastering Chess and Shogi by Self-Play with a General
Reinforcement Learning Algorithm. 2017.doi: 10.48550/ARXIV.1712.01815.
[166] K. Simonyan and A. Zisserman. “Very deep convolutional networks for large-scale image
recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[167] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. 2015. arXiv: 1409.1556[cs.CV].
[168] Harvineet Singh, Rina Singh, Vishwali Mhasawade, and Rumi Chunara. “Fairness violations and
mitigation under covariate shift”. In: Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency. 2021, pp. 3–13.
[169] Serban Stan and Mohammad Rostami. “Domain Adaptation for the Segmentation of Confidential
Medical Images”. In: Proceedings of the British Machine Vision Conference. Vol. 33. 2022.
[170] Serban Stan and Mohammad Rostami. “Privacy Preserving Domain Adaptation for Semantic
Segmentation of Medical Images”. In: BMVC. 2022.
[171] Serban Stan and Mohammad Rostami. “Secure Domain Adaptation with Multiple Sources”. In:
Transactions on Machine Learning Research (2022).
[172] Serban Stan and Mohammad Rostami. “Unsupervised model adaptation for continual semantic
segmentation”. In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 35. 2021,
pp. 2593–2601.
[173] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for
Semantic Segmentation. 2021. arXiv: 2105.05633[cs.CV].
[174] Baochen Sun and Kate Saenko. “Deep coral: Correlation alignment for deep domain adaptation”.
In: European conference on computer vision. Springer. 2016, pp. 443–450.
[175] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical Multi-Scale Attention for Semantic
Segmentation. 2020. arXiv: 2005.10821[cs.CV].
[176] Onur Tasar, Yuliya Tarabalka, Alain Giros, Pierre Alliez, and Sébastien Clerc. “StandardGAN:
Multi-source domain adaptation for semantic segmentation of very high resolution satellite
images by data standardization”. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition Workshops. 2020, pp. 192–193.
[177] Jiayi Tian, Jing Zhang, Wen Li, and Dong Xu. “VDM-DA: Virtual Domain Modeling for Source
Data-Free Domain Adaptation”. In: IEEE Transactions on Circuits and Systems for Video Technology
32.6 (2022), pp. 3749–3760.doi: 10.1109/TCSVT.2021.3111034.
[178] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. “Wasserstein
Auto-Encoders”. In: arXiv preprint arXiv:1711.01558 (2017).
126
[179] Devavrat Tomar, Manana Lortkipanidze, Guillaume Vray, Behzad Bozorgtabar, and
Jean-Philippe Thiran. “Self-Attentive Spatial Adaptive Normalization for Cross-Modality Domain
Adaptation”. In: IEEE Transactions on Medical Imaging (2021).
[180] László Tóth and Gábor Gosztolya. “Adaptation of DNN acoustic models using KL-divergence
regularization and multi-task training”. In: International Conference on Speech and Computer.
Springer. 2016, pp. 108–115.
[181] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and
Manmohan Chandraker. “Learning to adapt structured output space for semantic segmentation”.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018,
pp. 7472–7481.
[182] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. “Learning to Adapt
Structured Output Space for Semantic Segmentation”. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). 2018.
[183] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. “Adversarial discriminative domain
adaptation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
2017, pp. 7167–7176.
[184] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool.
“Unsupervised semantic segmentation by contrasting object mask proposals”. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2021, pp. 10052–10062.
[185] Naveen Venkat, Jogendra Nath Kundu, Durgesh Kumar Singh, Ambareesh Revanur, and
R Venkatesh Babu. “Your Classifier can Secretly Suffice Multi-Source Domain Adaptation”. In:
NeurIPS. 2020.
[186] Naveen Venkat, Jogendra Nath Kundu, Durgesh Kumar Singh, Ambareesh Revanur, and
Venkatesh Babu R. “Your Classifier can Secretly Suffice Multi-Source Domain Adaptation”. In:
NeurIPS. 2020.url: https:
//proceedings.neurips.cc/paper/2020/hash/3181d59d19e76e902666df5c7821259a-Abstract.html.
[187] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan.
“Deep hashing network for unsupervised domain adaptation”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2017, pp. 5018–5027.
[188] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. “Advent:
Adversarial entropy minimization for domain adaptation in semantic segmentation”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019,
pp. 2517–2526.
[189] Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen.
“Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation”. In: European
Conference on Computer Vision (ECCV). 2020.
127
[190] Jindong Wang, Wenjie Feng, Yiqiang Chen, Han Yu, Meiyu Huang, and Philip S Yu. “Visual
domain adaptation with manifold embedded distribution alignment”. In: Proceedings of the 26th
ACM international conference on Multimedia. 2018, pp. 402–410.
[191] Qi Wang, Junyu Gao, and Xuelong Li. “Weakly Supervised Adversarial Domain Adaptation for
Semantic Segmentation in Urban Scenes”. In: IEEE Transactions on Image Processing 28.9 (2019),
pp. 4376–4386.doi: 10.1109/TIP.2019.2910667.
[192] Tao Wang, Xiaopeng Zhang, Li Yuan, and Jiashi Feng. “Few-Shot Adaptive Faster R-CNN”. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
[193] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, and Thomas S. Huang.
“Revisiting Dilated Convolution: A Simple Approach for Weakly- and Semi-Supervised Semantic
Segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 2018.
[194] Junfeng Wen, Russell Greiner, and Dale Schuurmans. “Domain Aggregation Networks for
Multi-Source Domain Adaptation”. In: Proceedings of the 37th International Conference on Machine
Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning
Research. PMLR, 2020, pp. 10214–10224.url: http://proceedings.mlr.press/v119/wen20b.html.
[195] Michael Wick, Jean-Baptiste Tristan, et al. “Unlocking fairness: a trade-off revisited”. In: Advances
in neural information processing systems 32 (2019).
[196] Dongrui Wu. “Online and offline domain adaptation for reducing BCI calibration effort”. In: IEEE
Transactions on Human-Machine Systems 47.4 (2016), pp. 550–563.
[197] Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gokhan Uzunbas, Tom Goldstein,
Ser Nam Lim, and Larry S Davis. “Dcan: Dual channel-wise alignment networks for unsupervised
scene adaptation”. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018,
pp. 518–534.
[198] Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gökhan Uzunbas, Tom Goldstein,
Ser Nam Lim, and Larry S. Davis. “DCAN: Dual Channel-Wise Alignment Networks for
Unsupervised Scene Adaptation”. In: Computer Vision – ECCV 2018. Ed. by Vittorio Ferrari,
Martial Hebert, Cristian Sminchisescu, and Yair Weiss. Cham: Springer International Publishing,
2018, pp. 535–552.
[199] Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, and Liang Lin. “Deep cocktail network:
Multi-source unsupervised domain adaptation with category shift”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2018, pp. 3964–3973.
[200] Ruijia Xu, Guanbin Li, Jihan Yang, and Liang Lin. “Larger norm more transferable: An adaptive
feature norm approach for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. 2019, pp. 1426–1435.
128
[201] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. “Mind the
class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation”.
In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017,
pp. 2272–2281.
[202] Zengqiang Yan, Jeffry Wicaksana, Zhiwei Wang, Xin Yang, and Kwang-Ting Cheng.
“Variation-Aware Federated Learning With Multi-Source Decentralized Medical Image Data”. In:
IEEE Journal of Biomedical and Health Informatics 25.7 (2021), pp. 2615–2628.doi:
10.1109/JBHI.2020.3040015.
[203] Baoyao Yang, Hao-Wei Yeh, Tatsuya Harada, and Pong C. Yuen. “Model-Induced Generalization
Error Bound for Information-Theoretic Representation Learning in Source-Data-Free
Unsupervised Domain Adaptation”. In: IEEE Transactions on Image Processing 31 (2022),
pp. 419–432.doi: 10.1109/TIP.2021.3130530.
[204] Shiqi Yang, Yaxing Wang, Joost van de Weijer, Luis Herranz, and Shangling Jui. “Generalized
Source-Free Domain Adaptation”. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision (ICCV). 2021, pp. 8978–8987.
[205] Timothy Yang, Galen Andrew, Hubert Eichner, Haicheng Sun, Wei Li, Nicholas Kong,
Daniel Ramage, and Françoise Beaufays. “Applied federated learning: Improving google keyboard
query suggestions”. In: arXiv preprint arXiv:1812.02903 (2018).
[206] Yanchao Yang, Dong Lao, Ganesh Sundaramoorthi, and Stefano Soatto. “Phase consistent
ecological domain adaptation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. 2020, pp. 9011–9020.
[207] Yanchao Yang and Stefano Soatto. “Fda: Fourier domain adaptation for semantic segmentation”.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020,
pp. 4085–4095.
[208] Hao-Wei Yeh, Baoyao Yang, Pong C Yuen, and Tatsuya Harada. “Sofa: Source-data-free feature
alignment for unsupervised domain adaptation”. In: Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision. 2021, pp. 474–483.
[209] Fuming You, Jingjing Li, Lei Zhu, Zhi Chen, and Zi Huang. “Domain Adaptive Semantic
Segmentation without Source Data”. In: Proceedings of the 29th ACM International Conference on
Multimedia. MM ’21. Virtual Event, China: Association for Computing Machinery, 2021,
pp. 3293–3302.isbn: 9781450386517.doi: 10.1145/3474085.3475482.
[210] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. “Mitigating Unwanted Biases with
Adversarial Learning”. In: AIES ’18. New Orleans, LA, USA: Association for Computing
Machinery, 2018, pp. 335–340.isbn: 9781450360128.doi: 10.1145/3278721.3278779.
[211] Min Zhang and Juntao Li. “A commentary of GPT-3 in MIT Technology Review 2021”. In:
Fundamental Research 1.6 (2021), pp. 831–833.issn: 2667-3258.doi:
https://doi.org/10.1016/j.fmre.2021.11.011.
129
[212] Qiming Zhang, Jing Zhang, Wei Liu, and Dacheng Tao. “Category anchor-guided unsupervised
domain adaptation for semantic segmentation”. In: Advances in Neural Information Processing
Systems. 2019, pp. 435–445.
[213] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. “Collaborative and Adversarial Network
for Unsupervised Domain Adaptation”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR). 2018.
[214] Yang Zhang, Philip David, and Boqing Gong. “Curriculum Domain Adaptation for Semantic
Segmentation of Urban Scenes”. In: 2017 IEEE International Conference on Computer Vision (ICCV)
(2017).doi: 10.1109/iccv.2017.223.
[215] Yang Zhang, Philip David, and Boqing Gong. “Curriculum domain adaptation for semantic
segmentation of urban scenes”. In: Proceedings of the IEEE International Conference on Computer
Vision. 2017, pp. 2020–2030.
[216] Yiliang Zhang and Qi Long. “Assessing Fairness in the Presence of Missing Data”. In: Advances in
Neural Information Processing Systems 34 (2021).
[217] Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael Jordan. “Bridging Theory and
Algorithm for Domain Adaptation”. In: International Conference on Machine Learning. 2019,
pp. 7404–7413.
[218] Yue Zhang, Shun Miao, Tommaso Mansi, and Rui Liao. “Task driven generative modeling for
unsupervised domain adaptation: Application to x-ray image segmentation”. In: International
Conference on Medical Image Computing and Computer-Assisted Intervention. Springer. 2018,
pp. 599–607.
[219] Han Zhao, Shanghang Zhang, Guanhang Wu, José MF Moura, Joao P Costeira, and
Geoffrey J Gordon. “Adversarial multiple source domain adaptation”. In: Advances in neural
information processing systems 31 (2018), pp. 8559–8570.
[220] Sicheng Zhao, Bo Li, Xiangyu Yue, Yang Gu, Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer.
“Multi-source Domain Adaptation for Semantic Segmentation”. In: Advances in Neural
Information Processing Systems. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,
E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc., 2019.url:
https://proceedings.neurips.cc/paper/2019/file/db9ad56c71619aeed9723314d1456037-Paper.pdf.
[221] Sicheng Zhao, Guangzhi Wang, Shanghang Zhang, Yang Gu, Yaxian Li, Zhichao Song,
Pengfei Xu, Runbo Hu, Hua Chai, and Kurt Keutzer. “Multi-source distilling domain adaptation”.
In: Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 34. 2020, pp. 12975–12983.
[222] Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen, Mei Han, Elliot Fishman, and
Alan Yuille. Prior-aware Neural Network for Partially-Supervised Multi-Organ Segmentation. 2019.
arXiv: 1904.06346[cs.CV].
[223] J. Zhu, T. Park, P. Isola, and A. A. Efros. “Unpaired image-to-image translation using
cycle-consistent adversarial networks”. In: ICCV. 2017, pp. 2223–2232.
130
[224] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired Image-to-Image
Translation using Cycle-Consistent Adversarial Networks. 2020. arXiv: 1703.10593[cs.CV].
[225] Yongchun Zhu, Fuzhen Zhuang, and Deqing Wang. “Aligning domain-specific distribution and
classifier for cross-domain classification from multiple sources”. In: Proceedings of the AAAI
Conference on Artificial Intelligence . Vol. 33. 2019, pp. 5989–5996.
[226] Xiahai Zhuang and Juan Shen. “Multi-scale patch and multi-modality atlases for whole heart
segmentation of MRI”. In: Medical Image Analysis 31 (2016), pp. 77–87.issn: 1361-8415.doi:
https://doi.org/10.1016/j.media.2016.02.006.
[227] Danbing Zou, Qikui Zhu, and Pingkun Yan. “Unsupervised domain adaptation with dual scheme
fusion network for medical image segmentation”. In: Proceedings of the Twenty-Ninth
International Joint Conference on Artificial Intelligence, IJCAI-20, International Joint Conferences on
Artificial Intelligence Organization . 2020, pp. 3291–3298.
[228] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. “Unsupervised domain adaptation for
semantic segmentation via class-balanced self-training”. In:ProceedingsoftheEuropeanconference
on computer vision (ECCV). 2018, pp. 289–305.
131
Abstract (if available)
Abstract
The recent success of deep learning is conditioned on the availability of large annotated datasets for supervised learning. Data annotation, however, is a laborious and a time-consuming task. When a model fully trained on an annotated source domain is applied to a target domain with different data distribution, a greatly diminished generalization performance can be observed due to domain shift. Unsupervised Domain Adaptation (UDA) aims to mitigate the impact of domain shift when the target domain is unannotated. The majority of UDA algorithms assume joint access between source and target data, which may violate data privacy restrictions in many real world applications. In this thesis I propose source-free UDA approaches that are well suited for scenarios when source and target data are only accessible sequentially. I show that across several application domains, for the adaptation process to be successful it is sufficient to maintain a low-memory approximation of the source embedding distribution instead of the full source dataset. Domain shift is then mitigated by minimizing an appropriate distributional distance metric. First, I validate this idea on adaptation tasks in street image segmentation. I then show improving the approximation of the source embeddings leads to superior performance when adapting medical image segmentation models. I extend this idea to multi-source adaptation, where several source domains are present, and data transfer between pairs of domains is prohibited. Finally, I show that relaxing the constraint for data privacy allows for mitigating domain shift in fair classification.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Kernel methods for unsupervised domain adaptation
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Adapting pre-trained representation towards downstream tasks
PDF
Multimodal image retrieval and object classification using deep learning features
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Data-driven image analysis, modeling, synthesis and anomaly localization techniques
PDF
Explainable AI architecture for automatic diagnosis of melanoma using skin lesion photographs
PDF
Process implications of executable domain models for microservices development
PDF
Practice-inspired trust models and mechanisms for differential privacy
PDF
Multi-modal preconditioned inference of commonsense knowledge
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Differentially private learned models for location services
PDF
Enabling spatial-visual search for geospatial image databases
PDF
Lexical complexity-driven representation learning
PDF
Automatic evaluation of open-domain dialogue systems
PDF
Learning the semantics of structured data sources
PDF
Responsible AI in spatio-temporal data processing
PDF
Transfer learning for intelligent systems in the wild
PDF
Domain-based effort distribution model for software cost estimation
Asset Metadata
Creator
Stan, Serban Andrei
(author)
Core Title
Unsupervised domain adaptation with private data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
01/17/2024
Defense Date
12/05/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data privacy,domain adaptation,domain shift,fair classification,medical image segmentation,medical imaging,multi-source classification,OAI-PMH Harvest,semantic segmentation,source-free adaptation
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Rostami, Mohammad (
committee chair
), Kuo, Jay (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
serbanandrei.stan@gmail.com,sstan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112716012
Unique identifier
UC112716012
Identifier
etd-StanSerban-11418.pdf (filename)
Legacy Identifier
etd-StanSerban-11418
Document Type
Dissertation
Format
theses (aat)
Rights
Stan, Serban Andrei
Internet Media Type
application/pdf
Type
texts
Source
20230118-usctheses-batch-1001
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data privacy
domain adaptation
domain shift
fair classification
medical image segmentation
medical imaging
multi-source classification
semantic segmentation
source-free adaptation