Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Fair Machine Learning for Human Behavior Understanding
(USC Thesis Other)
Fair Machine Learning for Human Behavior Understanding
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Fair Machine Learning for Human Behavior Understanding
by
Shen Yan
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2022
Copyright 2022 Shen Yan
Acknowledgements
The past several years at USC has been a wonderful and memorable journey. I am deeply grateful
to everyone who came into my life to make it as amazing as it is.
First and foremost, I am deeply thankful to my advisor, Professor Emilio Ferrara, for giving me
the opportunities to participate in many great research projects and providing me constant supports
and guidance through my PhD. I would also thank my thesis and qualifying exam committee
members: Prof. Cyrus Shahabi, Prof. Shrikanth Narayanan, Prof. Kristina Lerman, Prof. Jonathan
Gratch, and Prof. Fred Morstatter, for their valuable time and insightful comments on my work.
I would like to give my thanks to the research team for TILES project. TILES is the first and
biggest project that I have worked on during my PhD. Working on TILES did shape my research
directions and make me a better researcher.
I am also grateful to all my excellent collaborators. Especially, I want to give my thanks to
Hsien-Te Kao and Homa Hosseinmardi. Working with them was a delightful experience. Many
ideas in this thesis are inspired by the discussions with them. I would also like to thank Prof.
Mohammad Soleymani, Prof. Salman Avestimehr, Ashok Deb, Akira Matsui, Di Huang, Nathan
Bartley, Yahya Ezzeldin, Chaoyang He, and Yiyun Zhu, who helped me and collaborated with me
in my research. I am also fortunate to have the opportunity to work with many talented research
scientists, Kristen Altenburger, Justin Cheng, Shawndra Hill, Yi-Chia Wang, and Poppy Zhang, at
Meta Research during my internship.
Last but not least, the experience would not be so wonderful without the love and supports from
my family and friends. Thanks to my parents for always being a source of support during my life.
Thanks to my friends and colleagues Emily Chen, Julie Jiang, and Yilei Zeng, for their support and
ii
help during the past few years. I would also like to give my thanks to my friend Chang Liu, who
constantly encourages me and shares my joy and sorrow.
Thanks to everyone who makes the experience so memorable. Thanks to every moment that
makes me who I am.
iii
Table of Contents
Acknowledgements ii
List of Tables viii
List of Figures xi
Abstract xiv
Chapter 1: Introduction 1
1.1 Human Behavior Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Fairness in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2: Survey of Techniques 5
2.1 Machine Learning for Human Behavior Understanding . . . . . . . . . . . . . . . 5
2.2 Definitions of ML Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Equal Opportunity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Equalized Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Statistical Parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Fairness Definitions in Regression Tasks . . . . . . . . . . . . . . . . . . . 11
2.3 Fairness-aware ML Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Pre-processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1.1 Reweighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1.2 Learning Fair Representation . . . . . . . . . . . . . . . . . . . 13
2.3.1.3 Disparate Impact Remover . . . . . . . . . . . . . . . . . . . . 13
2.3.2 In-processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2.1 Prejudice Remover . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2.2 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Post-processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3.1 Reject Option Classification . . . . . . . . . . . . . . . . . . . . 15
2.3.3.2 Calibrated Equalized Odds Post-processing . . . . . . . . . . . . 15
2.3.4 Fair ML with Access Constraints of Sensitive Attributes . . . . . . . . . . 15
2.3.4.1 Methods with Encrypted Sensitive Attributes . . . . . . . . . . . 15
2.3.4.2 Methods with Differentially Private Sensitive Attributes . . . . . 16
iv
2.3.4.3 Methods without Sensitive Attributes . . . . . . . . . . . . . . . 16
2.4 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 3: Taxonomy of Research Questions 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Traditional Datasets for Fairness Research . . . . . . . . . . . . . . . . . . 20
3.2.1.1 Adult . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1.2 COMPAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1.3 Violent Crime . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1.4 ACSIncome . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Human Behavior Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2.1 TILES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2.2 YouTube Personality . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2.3 First Impression . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2.4 Older Adults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Adopted Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Research Questions on Fair Behavioral Models . . . . . . . . . . . . . . . . . . . 27
3.4.1 RQ1: Effects of Behavior Heterogeneity on Fairness . . . . . . . . . . . . 27
3.4.2 RQ2: Enhancing Model Fairness via Data Balancing . . . . . . . . . . . . 28
3.4.3 RQ3: Enhancing Model Fairness via Mitigating Data Heterogeneity . . . . 28
3.4.4 RQ4: Enabling Group Fairness in Federated Learning . . . . . . . . . . . 28
Chapter 4: Effects of Heterogeneous Human Behavior in ML Fairness 30
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Heterogeneity of Behavioral Signals and Labels . . . . . . . . . . . . . . . . . . . 31
4.2.1 Differences in Behavioral Signals . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1.1 Distribution Differences . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Differences in behavioral Labels . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2.1 Behavioral Label Types . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2.2 Identify Labeling Bias . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Mitigating Signal and Label Bias . . . . . . . . . . . . . . . . . . . . . . 36
4.2.3.1 Learning Unbiased Labels . . . . . . . . . . . . . . . . . . . . . 36
4.2.3.2 Adversarial Learning with Label Matching . . . . . . . . . . . . 37
4.2.3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 39
4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Heterogeneity of Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 State-of-the-Art Personality Assessment Pipeline . . . . . . . . . . . . . . 43
4.3.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1.2 Data Fusion Strategies . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Different Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2.1 User-Dependent vs. User-Independent Evaluation . . . . . . . . 46
4.3.2.2 Biases from Different Modalities . . . . . . . . . . . . . . . . . 47
4.3.2.3 Impact of Fusion on Estimation Biases . . . . . . . . . . . . . . 49
4.3.3 Debiasing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
v
4.3.3.1 Data Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.3.2 Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 53
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5: Enhancing Model Fairness via Data Balancing 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Hardness and Distribution Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 Hardness Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.2 Distribution Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4 Effects of Class Balancing on Fairness . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 Effects on Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.2 Effects on Group Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Fair Class Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.2 Biases after Fair Class Balancing . . . . . . . . . . . . . . . . . . . . . . 66
5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6.1 Fairness Assessment Framework . . . . . . . . . . . . . . . . . . . . . . . 69
5.6.2 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.4 Fair Class Balancing & Fairness-Aware Learning . . . . . . . . . . . . . . 74
5.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7 Experiments on Behavior Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7.1 Dataset and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.7.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 6: Enhancing Model Fairness via Mitigating Data Heterogeneity 79
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Fair Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.2 Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.3 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Impact of Heterogeneity on Fairness . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5.1 Identifying Heterogeneous Patterns . . . . . . . . . . . . . . . . . . . . . 84
6.5.2 Multi-Layer Factor Analysis (MLFA) Framework . . . . . . . . . . . . . . 86
6.5.3 Feature Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5.4 Modeling Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6.1.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6.1.2 Behavior Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 90
vi
6.6.2 Performance of Regression Tasks . . . . . . . . . . . . . . . . . . . . . . 90
6.6.3 Performance of Classification Tasks . . . . . . . . . . . . . . . . . . . . . 93
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Chapter 7: Enabling Group Fairness in Federated Learning 95
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3.1 Federated Averaging (FedAvg) . . . . . . . . . . . . . . . . . . . . . . . . 99
7.3.2 Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3.2.1 Group Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.3.2.2 Uniform Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.3.3 Global vs. Local Group Fairness in Federated Learning . . . . . . . . . . . 102
7.3.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.4 Challenges to Local Debiasing in Federated Learning . . . . . . . . . . . . . . . . 103
7.4.1 Performance Under Different Heterogeneity Levels . . . . . . . . . . . . . 104
7.4.2 Fair Class Balancing in Federated Learning Settings . . . . . . . . . . . . 105
7.5 FairFed: Fairness-Aware Federated Learning . . . . . . . . . . . . . . . . . . . . 106
7.5.1 Our Proposed Approach (FairFed) . . . . . . . . . . . . . . . . . . . . . . 106
7.5.2 Computing the Aggregation Weights for FairFed at the Server . . . . . . . 107
7.5.3 How to Compute the Global Metric at the Server . . . . . . . . . . . . . . 110
7.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.1.1 Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.1.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.6.1.3 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6.2 Experimental Results on Artificial Partitioned Data . . . . . . . . . . . . . 113
7.6.2.1 Performance under the Heterogeneous Sensitive Attribute Dis-
tribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.6.2.2 Performance with Different Parameter (β). . . . . . . . . . . . . 115
7.6.2.3 Performance with Different Fairness Budgetη. . . . . . . . . . 116
7.6.2.4 Performance under Different Number of Clients . . . . . . . . . 116
7.6.2.5 Performance with Single Sensitive Group Clients. . . . . . . . . 117
7.6.2.6 Performance under Different Local Debiasing Strategies. . . . . 119
7.6.3 Experimental Results on Real Partitioned Data . . . . . . . . . . . . . . . 121
7.6.3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.6.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.7 Experiments on Human Behavior Data . . . . . . . . . . . . . . . . . . . . . . . . 122
7.7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Chapter 8: Conclusions and Ongoing Work 125
References 128
vii
List of Tables
2.1 A list of the sensitive attributes as defined by the Fair Housing Act (FHA) and
Equal Credit Opportunity Act (ECOA). . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Summary of benchmark datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Summary of tasks for each research question. . . . . . . . . . . . . . . . . . . . . 29
4.1 Label distribution differences across different sensitive groups. We conduct t-test
between the labels of the unprivileged group and the original/matched labels in the
privileged group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Performance comparison among original, adversarial learning, and our proposed
ALM modeling methods. The choice of λ is based on the best trade-off between
utilty and fairenss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Ground-truth distribution across different sensitive groups. Each group is reported
the mean and standard deviation of the construct. T-test and one-way ANOV A are
used to test the differences between the means of across groups. . . . . . . . . . . 43
4.4 Comparison between user-dependent and user-independent evaluations. The model
performances are measured by both accuracy (Acc.) and fairness metrics (Statisti-
cal Parity (SP_MI) and Equal Accuracy (EA)). . . . . . . . . . . . . . . . . . . . 47
4.5 Bias measurement for different modalities. Comparison of the biases from audio
vs visual models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 Impact of Fusion on SP, we compare the SP metrics before and after the late fusion.
↑ and↓ indicate increase and decrease bias as a result of fusion, respectively. Green
and red dashes respectively indicate that the fusion results are equal to the best and
the worst results with single modality. . . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 Impact of Fusion on EA, we compare the EA metrics before and after the late fu-
sion.↑ and↓ indicate increase and decrease bias as a result of fusion, respectively.
Green and red dashes respectively indicate that the fusion results are equal to the
best and the worst results with single modality. . . . . . . . . . . . . . . . . . . . 50
4.8 Performance comparison of different debiasing approaches. . . . . . . . . . . . . 54
viii
5.1 Effects of class balancing techniques on biases metrics. . . . . . . . . . . . . . . . 63
5.2 Effects of class balancing techniques on group fairness. For F1 and the accuracy
of minority class (Acc.), the higher the better. For Equal Opportunity (EOD),
Equal. Odds (EOddsD), Statistical Parity (SPD) values close to zero indicate fairer
outcomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Bias measurement. Difference in bias of original data, after KMeans SMOTE
class balancing, and after our proposed balancing method, as captured by the two
metrics we propose, i.e., distribution bias∆, and hardness biasΓ. . . . . . . . . . 68
5.4 Classification performance comparison among original dataset (Baseline) and fair
class balancing (Ours) with 5-NN parameters and different clustering algorithms:
KMeans clustering (KMeans), Agglomerative clustering (Agg.), and Spectral clus-
tering (Spec.). Results with the smallest bias against the unprivileged group (i.e.,
most positive results) are bolded. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Combining class balancing and fairness-aware learning. Performance comparison
among fair class balancing (Ours), proxy-based post-processing (CEO), and com-
bining CEO with traditional class balancing and fair class balancing. Results with
the smallest bias against the unprivileged group are bolded. . . . . . . . . . . . . . 72
5.6 Classification performance comparison on TILES dataset with 5-NN parameters
and different clustering algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.1 Impact of different heterogeneity patterns. N indicates the number of features
that exhibit a heterogeneity pattern. We report the results of linear regression mod-
els with 10-fold cross validation. The mean absolute error (MAE), equal accuracy
(EA) and statistical parity (SP_r) are the used metrics. . . . . . . . . . . . . . . . 84
6.2 Experimental Performance (λ = 0). Compare the utility and fairness perfor-
mance on the data with and without our proposed method. The results show that
our method yields decent improvements on fairness and can even improve model
accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 Comparison of the debiasing performance of Random Forest models with and
without sensitive attributes (SA). . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Utility and fairness performance on the TILES dataset of the proposed MLFA
method (λ = 0). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1 Hyperparameters used in Experiments on the COMPAS and Adult datasets. . . . . 113
7.2 An example of the heterogeneous data distribution (non-IID) on the sensitive at-
tribute A (sex) used in experiments on the Adult and COMPAS datasets. The shown
distributions are for K= 5 clients and heterogeneity parametersα= 0.1 andα= 10.114
ix
7.3 Performance comparison of data partition with different heterogeneity levelsα. A
smallerα indicates a more heterogeneous distribution across clients. We report the
average performance of 20 random seeds. For EOD and SPD metrics, larger values
indicate better fairness. Positive fairness metrics indicate that the unprivileged
group outperform the privileged group. . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Performance comparison of uniform accuracy constraint η on data partition with
different heterogeneity levelsα. We report the average performance of 20 random
seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 An example of the heterogeneous data distribution (non-IID) on the target variable
(Income > 50k) used in the experiment on the Adult, where each client is assigned
only points with a single senstive attribute value. The shown distributions are for
K= 5 clients and heterogeneity parametersα = 0.5 on the target variable. . . . . . 118
7.6 Performance on ACSIncome Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Data distribution of TILES dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.8 Performance on TILES Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
x
List of Figures
4.1 Examples of distribution differences of the behavior signals across gender groups. . 32
4.2 Proposed method to identify labeling bias. Filtering step removes the sensitive
information from the feature dimension. Matching step matches the samples based
on the filtered features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Distribution of the labels in the unprivileged group and their matched labels. The
matched labels (blue lines) exhibit different distributions from the original labels
(red dashed-lines), indicating potential labeling bias in the data. . . . . . . . . . . . 34
4.4 Distribution comparison of original labels and matched labels. (a) demonstrates
the trend of differences between original and matched labels (i.e., label bias). (b)
compares the cumulative distribution function (CDF) before (blue line) and after
(orange dashed-line) matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Framework of the proposed adversarial learning with label matching (ALM) method. 38
4.6 Effects of the fairness budget parameterλ. The plot illustrates model performance
change according both utility and fairness metrics. . . . . . . . . . . . . . . . . . . 41
4.7 Pipeline of the BU-NKU System reproduced from [83] . . . . . . . . . . . . . . . 43
4.8 Variable importance of the late fusion Random Forest model. . . . . . . . . . . . . 45
4.9 Debiasing approaches: (a) data balancing, (b) adversarial learning. . . . . . . . . . 52
5.1 Examples of the bias changes before and after class balancing. (a), (c) and (e)
illustrate the change of distribution bias. (b), (d) and (f) compare the hardness bias
before and after class balancing forAdult,COMPAS andViolentCrime data. . . . 61
5.2 Illustration of how fair class balancing enhances fairness. Circle and star nodes
represent samples with positive and negative labels, respectively. The hollow nodes
are samples generated based on the nearest neighbors of the minority samples in
each cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Examples of the hardness bias changes in test set before and after class balanc-
ing. The kDN for each test sample here is the percentage of the k nearest training
samples that do not share the same target variable as the test sample. . . . . . . . . 67
xi
5.4 Prediction and assessment framework. In our setting, sensitive attributes are
considered as unobservable to both the service provider and the client sides. Hence,
fairness is judged by a third-party assessment agency. . . . . . . . . . . . . . . . . 70
5.5 Effects of the fairness budget parameter kNN. The plots illustrate model per-
formance and fairness, according to four metrics (EOD, EOddsD, SPD, and Accu-
racy), for theAdult,COMPAS, andViolentCrime data. . . . . . . . . . . . . . . . 73
5.6 Effects of the fairness budget parameter kNN on TILES dataset. . . . . . . . . . . 76
6.1 Example of bias from heterogeneity. The plots illustrate the bias derived from
heterogeneity. (a) shows a heterogeneous pattern exists in a real-life dataset. If the
model ignores the heterogeneity, the learned trend (i.e., black dash line in (a)) will
discriminate against the female samples (i.e., red) as shown in (b). . . . . . . . . . 83
6.2 Multi-Layer Factor Analysis (MLFA) framework. The first layer factor analysis
discovers the feature clusters V
i
and sample clusters C
i
. The original dataset X is
balanced based on C
i
and then separated into subsets X
′
1
,...,X
′
k
, where each subset
X
′
i
only contains the features in V
i
. We then conduct the second layer factor analysis
on X
′
1
,...,X
′
k
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Example of MLFA Outcomes. The plots compare the extracted factors using (a)
traditional factor analysis and (b) the MLFA framework. The factor from MLFA
shows more separable structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.4 The proposed fair modeling pipeline. The train dataset is pre-processed by
MLFA and feature rescaling to mitigate the bias as well as learn the pre-processing
models. The pre-processing models are used to rescale the test dataset. . . . . . . . 89
6.5 Effects of the parameter λ. The plots illustrate the performance change on fair-
ness and accuracy, according to three metrics (Equal Accuracy, Statistical Parity,
Accuracy), forTILES (left) andOlderAdults (right) data. . . . . . . . . . . . . . 91
6.6 Effects of the parameterλ on TILES Dataset. . . . . . . . . . . . . . . . . . . . . 94
7.1 Performance comparison of data partition with different heterogeneity levelsα. A
smaller α indicates a more heterogeneous distribution across clients. We report
the average performance of 20 random seeds. (For EOD metrics, larger values
indicate better fairness. Positive fairness metrics indicate that the unprivileged
group outperform the privileged group.) . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 Performance comparison of local debiasing strategies: FedAvg (i.e., no debiasing),
reweighting, and fair class balancing (with 5-NN). We report the average perfor-
mance of 20 random seeds on partitions with different heterogeneity levelsα. . . . 106
7.3 FairFed: Group fairness-aware federated learning framework. . . . . . . . . . . . 107
7.4 Effects of fairness budgetβ for K= 5 clients and heterogeneity parameterα = 0.2. 115
xii
7.5 Effects of fairness budgetη for K= 5 clients and heterogeneity parameterα = 0.5. 117
7.6 Effects of number of clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.7 Performance with clients that only contain data from one sensitive group. . . . . . 118
7.8 Effects of different local debiasing strategies. We analyze the performance change
when only partial clients adopt the reweighting debiasing method. . . . . . . . . . 119
7.9 Effects of different local debiasing strategies. We analyze the performance change
when only partial clients adopt the reweighting debiasing method. . . . . . . . . . 120
7.10 Data distribution of ACSIncome dataset. (a) compares the number of data points
from different states. (b) compares the proportions of white population across
different states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.11 Effects ofη on ACSIncome Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.12 Effects ofη on TILES Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
xiii
Abstract
As human behavior data, along with machine learning techniques, are increasingly applied in
decision-making scenarios ranging from healthcare to recruitment decisions, guaranteeing the fair-
ness of such systems is critical criterion for wide application. Despite the previous efforts on fair
machine learning, the heterogeneity and complexity of behavioral data further impose challenges
to both model validity and fairness. The limited access to sensitive attributes (e,g., race, gender)
in real-world settings makes it more difficult to mitigate the unfairness of the model outcomes. In
this thesis, I investigate the effects of heterogeneity on model fairness and propose and propose
different modeling strategies to address the above challenges in three directions. First, I propose
a strategy to mitigate bias via data balancing. Specifically, I design a class balancing algorithm
for classification tasks named fair class balancing that can improve both model fairness and utility
without accessing the sensitive attributes. Second, I study how different heterogeneity patterns of
behavioral signals affect fairness performance. I then propose a modeling framework Multi-Layer
Factor Analysis (MLFA) to identify heterogeneous behavioral patterns without sensitive attributes.
Experimental results show that mitigating the heterogeneity is able to enhance model fairness while
achieving better utility. Third, I propose a method FairFed to improve group fairness in federated
learning systems, which further enables learning large-scale machine learning models without di-
rectly accessing individuals’ data. FairFed can effectively improve the fairness with heterogeneous
data distribution across clients in federated learning systems. The proposed methods explore the
data properties to make it able to improve group fairness as well as maintain good utility, which
make them have promising applications in behavior understanding systems.
xiv
Chapter 1
Introduction
Artificial intelligence (AI) and machine learning (ML) models have been recently applied exten-
sively to understand and predict human behavior, often in applications with major societal im-
plications, such as making recruitment decisions, estimating individual well-beings, or assessing
clinical treatments. However, the heterogeneous nature of human behaviors poses challenges in
terms of model generalizability and fairness. Therefore, this thesis will focus on the fair machine
learning techniques for human behavior understanding.
1.1 Human Behavior Understanding
Understanding and predicting human behaviors using artificial intelligence have been applied in
many life-affecting scenarios, including healthcare [1, 2] and recruitment [3]. However, despite the
growing number of studies on human behavior understanding, most of studies limit to laboratory
setting experiments, little is known about how these methods handle with heterogeneous human
behavior data collected in the wild. Heterogeneity is prevalent in human behavioral signals. For
example, males generally have low pitched voices while females have high pitched voices; the
average adult male heart rate is between 70 and 72 beats per minute, while the average for adult
women is between 78 and 82 beats. Such heterogeneous behaviors pose challenges to machine
learning models.
1
1.2 Fairness in Machine Learning
The problem of fairness in machine learning has been drawing increasing research interests over
the course of recent years. Much work has been done to improve the fairness through different
aspects of the modeling process, including feature selection [4], data pre-processing [5, 6], model
adjustment [7, 8], and post-processing [9, 10]. Most of the existing work focuses on classifi-
cation, especially binary classification problems. However, in behavioral understanding applica-
tions, many tasks are solved by regression methods. Few papers considered fairness in regression
problems. Convex [11] and non-convex [12] frameworks have been explored to add fairness con-
straints to regression models. Quantitative definitions and theoretical performance guarantees in
fair regression have also been discussed in recent work [13].
1.3 Challenges
The major challenges in fair behavior understanding models are as follows:
• Heterogeneity of Human Behavior
Despite the increasing body of research on modeling human behavior and fair machine learn-
ing, most studies focus on homogeneous and objective measurements, and little has been dis-
cussed on how to mitigate the impact of heterogeneity on utility and fairness simultaneously.
The heterogeneous property of human behaviors also brings additional biases. Simpson’s
paradox [14], for instance, can bias the analysis of heterogeneous data that is composed of
subgroups or individuals with different behaviors. According to the paradox, an association
observed in data that has been aggregated over an entire population may be quite different
from, and even opposite to, associations found in the underlying subgroups. In longitudinal
studies, bias may also arises from differences in populations and behaviors over time. In a
recent work [15] investigated current state-of-the-art approaches of assessing mental health
2
through social media and smart devices, the evaluation approaches of choice reveal differ-
ent type of outcome biases. When testing under the real-life setting, those state-of-the-art
approaches can barely outperform the most naive baselines.
• Privacy Concerns of Sensitive Attributes Most of existing fair machine learning models
require accessing sensitive attributes to debias the models. However, in many real-world
applications, sensitive attributes, like gender, race, etc., are not observable due to privacy
concerns or legal restrictions. Recent regulations such as the European General Data Protec-
tion Regulation (GDPR), California Consumer Privacy Act (CCPA), or the Health Insurance
Portability and Accountability Act (HIPAA) regulate the usage of personal data. For ex-
ample, credit institutions cannot ask or access information about race to applicants who
apply for credit [Equal Credit Opportunity Act: 15, 12 CFR §1002.5(b)]. Similarly, insur-
ance companies can no longer request race information from the individuals they insure [16].
From January 2020, potential employers will no longer be allowed to request previous-salary
information from perspective employees.
1.4 Thesis Contribution
The major contributions of this thesis are as follows:
• Effects of behavioral heterogeneity of fairness: I conduct thorough analyses of the het-
erogeneity patterns of human behaviors and how does the heterogeneity affect modeling
fairness. In particular, I considering the following research questions: (i) heterogeneity of
behavioral signals; (ii) heterogeneity of behavioral annotations; (iii) heterogeneity of differ-
ent modalities.
• Fairness mechanisms without accessing sensitive attributes: In order to address the
privacy concerns of accessing the sensitive attributes in current fairness-aware models, I
propose the following mechanisms to enhance model fairness without accessing sensitive
attributes.
3
– Fair class balancing: fair class balancing is a class balancing algorithm for classifi-
cation tasks. It can improve both the model fairness and the classification performance
of the minority class.
– Heterogeneity identification: I propose a method to identify the heterogeneity pat-
terns of behavioral signals (e.g., physiological signals, physical activity signals). By
identifying such patterns, the bias across heterogeneous groups can be mitigated to
further improve the model fairness.
– Fair federated learning: federated learning (FL) is a modeling technique that is able
to train large-scale models in a decentralized manner without requiring direct access
to users’ data. Due to the decentralized nature of FL, deploying traditional debiasing
methods on each client locally can not guarantee the global fair outcomes across differ-
ent sensitive groups. I propose a aggregation method that can enable the group fairness
in FL settings.
4
Chapter 2
Survey of Techniques
2.1 Machine Learning for Human Behavior Understanding
Understanding and predicting human behaviors using artificial intelligence have been applied in
many life-affecting scenarios, including estimating job performance, making recruitment deci-
sions, and assisting clinical treatments. Recent advances in portable consumer technologies have
led to a surge in the development of electronic devices for monitoring and tracking human activity,
wellness, and behavior. Aided by the ubiquity of personal smartphones, Bluetooth, and Wi-Fi,
many devices currently on the market can discreetly collect physiologic and behavioral signals
and upload the information to remote servers. Because of the growing support for distributed and
personalized sensing, diverse research communities are taking a keen interest in this field, em-
powering the coordination of research studies of populations outside the laboratory and in natural
home or work environments (also known as studies in the wild). For research into everyday human
behavior, such as daily routines, studies conducted in natural settings can yield more relevant and
insightful data than those performed in the laboratory.
The strengths of social media data and wearable health sensors have been shown in numerous
successful longitudinal studies:
• StudentLife [17] was a ten-week study on 48 Dartmouth undergraduate and graduate stu-
dents using passive and mobile sensor data to infer well-being, academic performance and
5
behavioral trends. The Dartmouth research team was able to predict GPA using activity, con-
versational interaction, mobility, and self-reported emotion and stress data over the semester.
In addition, partying and studying trends were discovered to be associated with midterm, fi-
nal, and the number of deadline.
• Reality Mining [18] was a nine-month study on 100 MIT students using mobile sensor data
to track social interaction and network. Mobile phone usage and information were used to
identify social patterns, significant locations, and organizational rhythms.
• Friends and Family [19] study collected data on 185 adults to study fitness intervention and
social incentives. The study showed mobile social sensing can be used for measuring and
predicting the health status of individuals by using mobility and communication patterns.
• SNAPSHOT [20] was a 30-day study on 66 MIT undergraduates using mobile sensors and
surveys to understand sleep, social interactions, affect, performance, stress and health. The
study predicted academic performance, sleep quality, stress level, and mental health using
personality traits, wearable sensors, and mobile phones. In addition, happiness was inferred
from physiology, phone, mobility, and behavioral data. The MIT research team discovered
the influence of sleep regularity on self-reported mental health and well-being.
• NetHealth study [21] was a large-scale FitBit study on approximately 700 Notre Dame stu-
dents. The study investigated compliance rates using FitBit data, smartphone data, and sur-
veys. The Notre Dame research team discovered the relationship between compliance rates
and temporal dynamics and personality groups. These were novel human behavioral lon-
gitudinal studies that employed wearable sensors to understand well-being, physical health,
and academic performance of student body.
• Tracking Individual Performance with Sensors (TILES) [22, 23] is a 10 weeks longitu-
dinal study with hospital worker volunteers in a large Los Angeles hospital. The participants
have heterogeneous work schedules (e.g., day and night shift) and work unit (e.g., nursing,
6
tech and lab, and service). Over the course of ten weeks, 212 participants are recruited with
three rolling-recruitment periods in a period between March-May of 2018 that respectively
enrolled 52, 116, and 44 volunteer participants. Bio-behavioral signal data were collected
from continuous sensing of garment-based OMsignal sensor and wristband-based sensor Fit-
bit Charge 2. Participants were asked to respond to various EMAs using their smartphone
during data collection. This daily in-situ probing of participants provides additional state
information such as context, stress, anxiety, job performance, and mood.
2.2 Definitions of ML Fairness
The definition of fairness has been studies by philosophers and psychologists for decades. How-
ever, there is still no universal definition of fairness. Different preferences and outlooks in differ-
ent cultures lend a preference to different ways of looking at fairness, which makes it harder to
come up with just a single definition that is acceptable to everyone in a situation. In recent years,
researchers [16] tried to tie the previous fairness definitions in political philosophy to machine
learning applications. Authors in [60] studied the 50-year history of fairness definitions in the
areas of education and machine-learning. In [122], authors listed and explained some of the defini-
tions used for fairness in algorithmic classification problems. In [112], authors studied the general
public’s perception of some of these fairness definitions from the computer science literature.
The definitions of fairness in computer science literature are formalized from their correspond-
ing notions from the social sciences and political philosophy literature. In general, fairness defini-
tions fall under two types: group fairness and individual fairness. Group fairness attempts to define
fairness in terms of statistical parity criteria that are imposed on the decisions or predictions pro-
duced by an algorithm. These criteria typically require that some form of statistical parity obtain
between the treatment of different social groups by the algorithmic decision-maker. For instance,
one such criterion requires that an algorithm produces the same false-positive rate for people from
different racial groups. Individual fairness means similar outcomes go to similar individuals.
7
Law FHA ECOA
Age X
color X X
disability X
exercised rights under CCPA X
familial status (household composition) X
gender identity X
marital status (single or married) X
national origin X X
race X X
recipient of public assistance X
religion X X
sex X X
Table 2.1: A list of the sensitive attributes as defined by the Fair Housing Act (FHA) and Equal
Credit Opportunity Act (ECOA).
In this work, we focus on the machine learning techniques of group fairness. Here we first
clarify several terms and definitions that are used in this work.
• Sensitive Attribute: Sensitive attributes refer to the traits that are identified by law on which
it is illegal to discriminate against, such as race and gender. In fairness literature, these at-
tributes are considered to be "protected" or "sensitive" attributes. Table 2.1 summarizes such
sensitive attributes defined by the Fair Housing Act (FHA) [24] and Equal Credit Opportu-
nity Act (ECOA) [25].
• Sensitive Group: Sensitive groups are the sub-populations of the given dataset, for example,
female and male groups. The membership of sensitive group is defined based on the sensitive
attribute.
• Privileged/Unprivileged Groups: Privileged groups at a systematic advantage and unprivi-
leged groups at a systematic disadvantage with respect to the outcomes of different sensitive
groups.
Let X, A and Z represent a set of individuals i.e., a population, protected attributes and remain-
ing attributes respectively. Each individual can be assigned an outcome from a finite set Y . Some
8
of the prediction outcomes are more beneficial or desirable than others. For an individual x
i
∈ X,
let y
i
be the true outcome (or label) to be predicted. A predictor can be represented by a mapping
H : X−→ Y from population X to the set of outcomes Y , such thatH (x
i
) is the predicted out-
come for individual x
i
. A group-conditional predictor consists of a set of mappings, one for each
group of the population,H ={H
S
} for all S⊂ X. For the sake of simplicity, assume that the
groups induce a partition of the population.
2.2.1 Equal Opportunity
Equal Opportunity is a state of fairness where the machine learning model predicts the preferred
label (i.e., one that confers an advantage or benefit to a person) equally well for all sensitive groups.
In other words, equal opportunity ensures that the people who should qualify for an opportunity
are equally likely to do so regardless of the value of the sensitive attributes. For example, qualified
students should be equally likely to be admitted to a college irrespective of their gender identifica-
tion.
Definition 1 (Equal Opportunity [26]) Equal opportunity measures a binary predictor
ˆ
Y with
respect to A and Y
Pr{
ˆ
Y = 1|A= 1,Y = 1}= Pr{
ˆ
Y = 1|A= 0,Y = 1}.
This means that the probability of a person in a positive class being assigned to a positive outcome
should be equal for both protected and unprotected (female and male) group members. In other
words, the equal opportunity definition states that the protected and unprotected groups should
have equal true positive rates.
2.2.2 Equalized Odds
Equalized Odds is a fairness metric that checks if, for any particular label and sensitive attribute, a
machine learning predicts that label equally well for all values of that sensitive attribute. Different
from Equal Opportunity, Equalized Odds checks the model performance for both qualified and
9
unqualified samples. If they are qualified, they are equally as likely to get the favored outcome,
and if they are not qualified, they are equally as likely to get rejected.
Definition 2 (Equalized Odds [26]) The equalized odds of a binary predictor
ˆ
Y with respect to
sensitive attribute A and outcome Y , is if
ˆ
Y and A are independent conditional on Y
Pr{
ˆ
Y = 1|A= 1,Y = y}= Pr{
ˆ
Y = 1|A= 0,Y = y},
where y∈{0,1}. For the outcome y= 1, the constraint measures the differences of true positive
rates across different sensitive groups. For y= 0, the constraint measures false positive rates.
This means that the probability of a person in the positive class being correctly assigned a positive
outcome and the probability of a person in a negative class being incorrectly assigned a positive
outcome should both be the same for the protected and unprotected (male and female) group mem-
bers. In other words, the equalized odds definition states that the protected and unprotected groups
should have equal true positive and equal false positive rates.
2.2.3 Statistical Parity
Statistical Parity is a fairness metric that is satisfied if the model’s outcomes are not dependent on
a given sensitive attribute.
Definition 3 (Statistical Parity [27]) Statistical parity rewards the classifier for classifying each
group as positive at the same rate. The statistical parity of a binary predictor
ˆ
Y is
Pr{
ˆ
Y = 1|A= 1}= Pr{
ˆ
Y = 1|A= 0}.
The demographic parity definition states that people in both protected and unprotected (female
and male) groups should have equal probability of being assigned to a positive outcome.
10
2.2.4 Fairness Definitions in Regression Tasks
The metrics above are defined for classification, especially binary classification task. However, in
behavioral understanding applications, many tasks are solved by regression methods. In this thesis,
we adopt fairness metrics based on quantitative definitions generally used in literature, where the
goal is to predict a true outcome Y ∈[a,b] from a feature vector X based on labeled training data.
The fairness of prediction
ˆ
Y of model M is evaluated with respect to sensitive groups of individuals
defined by sensitive attributes A, such as gender or race. Sensitive attributes A are assumed to be
binary, i.e., A∈{0,1}, where A = 1 represents the privileged group (e.g., male), while A = 0
represents the underprivileged group (e.g., female). Such simplifications are generally used in the
literature, although criticisms of reductionism are widely acknowledged [28, 29].
Definition 4 (Statistical Parity [13]) A model M satisfies statistical parity under a distribution
over (X, A, Y ) if M(X) is independent of the protected attribute A. Since M(X)∈[a,b], the statis-
tical parity of M is defined as
Pr[M(X)≥ z|A= 0]= Pr[M(X)≥ z|A= 1],
where z∈[a,b].
Statistical Parity aims to equalize the distribution differences of the outcomes across different
sensitive groups. Following the definition above, we define two metrics to quantify the distribution
difference.
Definition 5 (Equal Accuracy [13]) Equal Accuracy rewards the model M for predicting each
group as accurate at the same rate. The equal accuracy of M defined as
E[ε(Y,M(X))|A= 0]=E[ε(Y,M(X))|A= 1],
whereε(Y,(M(X))) represents the estimation error.
11
The definition of Equal Accuracy is similar to Equal Opportunity and Equalized Odds defini-
tions for classification tasks. It aims to equalize the model performance across different sensitive
groups, preventing situations when the model has a high error on some of the groups.
2.3 Fairness-aware ML Techniques
The fairness issues in machine learning has been drawing increasing research interests over the
course of recent years. Much work has been done to improve the fairness through different as-
pects of the modeling process, including feature selection [4], data pre-processing [5, 6], model
adjustment [7, 8], and post-processing [9, 10].
Most of the existing work focuses on classification, especially binary classification problems.
However, in human behavior understanding applications, many tasks are solved by regression
methods. Few papers considered fairness in regression problems. Convex [11] and non-convex
[12] frameworks have been explored to add fairness constraints to regression models. Quantitative
definitions and theoretical performance guarantees in fair regression have also been discussed in
recent work [13].
In this section, we will reiterate and provide some of the most widely used fairness-aware
machine learning mechanisms.
2.3.1 Pre-processing Methods
Pre-processing techniques attack the problem by removing the underlying discrimination from the
data before any modeling [39]. If the algorithm is allowed to modify the training data, then pre-
processing can be used [10].
12
2.3.1.1 Reweighting
Reweighing [30] is a preprocessing technique that weights the examples in each (group, label)
combination differently to ensure fairness before classification. The algorithm focuses on the dis-
tribution of the sensitive attributes and the target variable. It calculates the probability of assigning
favorable label (e.g., y= 1) assuming the sensitive attribute A and y are independent. Then, the
algorithm divides calculated theoretical probability by true, empirical probability of this event,
which are used as the weights vector for each sample in the dataset.
2.3.1.2 Learning Fair Representation
Learning fair representations [31] is a pre-processing technique that finds a latent representation
which encodes the data well but obfuscates information about sensitive attributes. The proposed
algorithm collapses the data down to a set of “prototypes" of the input and parameters that defined
the mapping from the original input space to the representation space. The optimization’s goal is
then to learn the appropriate locations for the prototypes and the appropriate mappings from the
inputs to the prototypes. The mapping is optimized based on two fairness-aware performance met-
rics: (1) minimizing discrimination and (2) maximizing the difference between accuracy (desired
to be higher) and discrimination (desired to be lower).
2.3.1.3 Disparate Impact Remover
Disparate impact remover [6] is a pre-processing technique that edits feature values increase group
fairness while preserving rank-ordering within groups. This method for evaluating fairness com-
pares the proportion of the population between the privileged group and the unprivileged group.
The algorithm then moves the values of all the non-sensitive features closer to each other such that
there is less disparity between groups.
13
2.3.2 In-processing Methods
In-processing techniques can be considered modifications of traditional learn- ing algorithms to
address discrimination during the model training phase [39]. If it is allowed to change the learning
procedure for a machine learning model, then in-processing can be used during the training of a
model— either as a constraint or incorporated into the objective function [10, 13].
2.3.2.1 Prejudice Remover
Prejudice remover [7] is an in-processing technique that adds a discrimination-aware regularization
term to the learning objective. The prejudice remover regularizer enforces the determination is
independent from the sensitive attributes, which reduces the bias of the classification models. This
modeling strategy can be applied to any algorithms with probabilistic discriminative models.
2.3.2.2 Adversarial Learning
Adversarial debiasing [8] is an in-processing technique that learns a machine learning model to
maximize prediction accuracy and simultaneously reduce an adversary’s ability to determine the
protected attribute from the predictions. This approach leads to a fair classifier as the predictions
cannot carry any group discrimination information that the adversary can exploit.
2.3.3 Post-processing Methods
Post-processing is the final class of methods and can be performed post- training. It relies on access
to a holdout set that was not involved in the model’s training phase [39]. If the algorithm can only
treat the learned model as a black box without any ability to modify the training data or learning
algorithm, then only post-processing can be used where data is labeled by some black-box model
and then relabeled as a function only of the original labels [10, 13]
14
2.3.3.1 Reject Option Classification
Reject option classification [32] is a post-processing technique that gives favorable outcomes to
unpriviliged groups and unfavorable outcomes to priviliged groups in a confidence band around
the decision boundary with the highest uncertainty. In this approach, the assumption is that most
discrimination occurs when a model is least certain of the prediction, i.e., around the decision
boundary (classification threshold). Thus by exploiting the low confidence region of a classifier for
discrimination reduction and rejecting its predictions, we can reduce the bias in model predictions.
2.3.3.2 Calibrated Equalized Odds Post-processing
Calibrated equalized odds [33] post-processing is a post-processing technique that optimizes over
calibrated classifier score outputs to find probabilities with which to change output labels with an
equalized odds objective.
2.3.4 Fair ML with Access Constraints of Sensitive Attributes
In addition to model fairness, data privacy is another crucial factor of trustworthy machine learn-
ing systems, especially the privacy of sensitive attributes. However, in order to implement the fair
machine learning mechanisms, the modeler must have access to the sensitive attributes for indi-
viduals in the training data. This may be undesirable for several reasons [34]. To enhancing the
privacy guarantee of fair ML, some work in recent years explores different strategies to address the
challenge.
2.3.4.1 Methods with Encrypted Sensitive Attributes
One way to address the privacy concerns is using encrypted sensitive attributes data in model
training. A rencent line of work [35, 36] proposed approaches based on the cryptographic tool
of Secure Multiparty Computation (MPC). Such methods assume the existence of a set of reg-
ulatory agencies, which are either trusted parties holding users’ sensitive attributes data [35], or
15
hold among them a secret sharing of sensitive attributes, provided by the users themselves [36]. In
this approach, modelers and regulators apply standard fair machine learning techniques in a dis-
tributed fashion. The MPC output a fair model without directly access the sensitive attributes data.
However, MPC does not guarantee that the predictor cannot leak individual information [37].
2.3.4.2 Methods with Differentially Private Sensitive Attributes
Another potential workaround to this problem is to allow the individuals to release their data in a
locally differentially private manner [38] and then try to learn a fair model from the private data.
This allows us to guarantee that our decisions are fair while maintaining a degree of individual
privacy to each user.
Mozannar et al. [39] studied learning fair models when the protected attributes are privatized or
noisy, and then proposed ways to adapt existing fair ML models to work with privatized sensitive
attributes. Awasthi et al. [40] consider a general noise model for the sensitive attributes in the
training data, but assume access to the actual protected attributes at test time.
2.3.4.3 Methods without Sensitive Attributes
All of the strategies above still require the access to privatized or encrypted sensitive attributes
in order to mitigate the bias. However, collecting sensitive attributes might be difficult, or even
forbidden by the law, in real-world applications. Recently, a few studies have explored different
strategies to address the issue.
One typical solution is using non-sensitive information as proxy for sensitive attributes. Pre-
vious work [41] has shown that non-sensitive information can be highly correlated with sensitive
attributes. Proxy fairness [42] leverages the correlations between proxy features and true sensi-
tive attributes. Proxy features are used as the alternative to sensitive attribute(s) when applying a
standard fairness-improving strategy. The use of weighted estimators [43] is also discussed as the
assessment for proxy models. Although the existence of proxy features gives the hope to improve
16
fairness with unobserved sensitive attributes, identifying perfect proxy groups is still a challenging
task.
Another line of research [44, 45, 46] focuses on optimizing the Rawlsian Max Min fairness
without accessing the sensitive attributes, which aims to train a model that maximizes the min-
imum expected utility across sensitive groups. However, these methods are optimized based on
a different fairness definition which might not be easily extended to other more widely-applied
fairness definitions that we mentioned in §2.2.
2.4 Open Problems
Despite the increasing interests in human behavior understanding and ML Fairness, few studies,
however, have attempted to discuss the fairness issues of behavior understanding systems, espe-
cially for the applications with access constraints of sensitive attributes. Such application scenarios
leave the following open problems for future research.
Heterogeneity of human behaviors.
Heterogeneity are prevalent in behavioral signals [47] and annotations [48, 49, 3]. This, in turn,
can affect machine learning models’ accuracy [50, 51, 52] (e.g., due to the inability to generalize
to unseen subjects) and fairness [53] (e.g., behavioral signals may expose sensitive information,
increasing the risk of unfair predictions).
There are a number of recent studies focusing on bias and fairness in human behavior un-
derstanding systems. Most of the relevant work focus on unimodal models, for example, facial
expression [54] and speech valence recognition [55]. Despite the growing number of studies on
multimodal modeling for human behavior, the fairness and biases of such systems are still lacking.
Studies also discussed the inherent biases of human annotations that are prevalent in behav-
ioral data [49, 3]. Unfortunately, little attention has been devoted to labeling bias [56, 57]. Such
work focuses on labeling bias in classification tasks. Behavioral annotations, however, are more
17
complex and heterogeneous, and the associated biases are harder to address.
Utility of fair behavioral models.
Most fairness-aware modeling approaches are improving the model fairness with certain sacrifice
of prediction accuracy. Studies [58, 59] have investigated the trade-off of fairness-accuracy in dif-
ferent models. However, for human behavior understanding models, especially for the sensitive
applications such as healthcare and criminal justice, this trade-off is often undesirable as any in-
crease in prediction error could have devastating consequences. [60] studied the sources of bias in
classification models. It decomposed the modeling bias into bias, variance, and noise, and pointed
out discrimination level is sensitive to the quality of the training data, well-designed data collection
can reduce bias without sacrificing accuracy. [56] further discussed the effects on accuracy of fair-
ness interventions when the training data is unrepresentative or biased. It reveals that mitigating the
bias of training data also has fairness benefits. According to the results above, it is worth proposing
debiasing mechanisms based on training data properties to enhance both model utility and fairness.
Access to sensitive attributes.
With the increasing concerns on data privacy, fair machine learning systems without sensitive
attributes are more favored in applications. As we have discussed in §2.3.4, there are two typical
ways to address this issue. One approach is either using encrypted [35, 36] or deferentially private
sensitive attributes [39, 40] in the modeling process. However, those methods still require some
the access of the sensitive attributes. For the methods without sensitive attributes, they either using
proxy features [42] or optimizing on non-traditional group fairness definitions [44, 45, 46]. Thus,
fair machine learning without sensitive attributes is still a open research question needs further
exploration.
Another modeling strategy that does not require the direct access to users’ data is federated
learning (FL). Federated learning is able to train large-scale models in a decentralized manner, thus
avoiding exchange of any explicit information of the local data of each client. FL also introduces
18
new challenges of fairness. Due to the non-IID nature of the data distribution across clients, the
full data distribution may not be represented by any single local distribution at any of the clients.
Thus, local bias mitigation cannot guarantee global group fairness. Despite increasing interests
in fair FL, most existing studies [61, 62] focus on equalizing the performance and participation
across different participating devices/silos. Only a few works [63, 64] have attempted to target
group fairness for groups based on sensitive attribute in FL. They require each client to share the
statistics of the sensitive attributes of its local dataset to the server, which is discouraged in FL
systems.
19
Chapter 3
Taxonomy of Research Questions
3.1 Introduction
With the increasing research interests in fair machine learning. Many datasets and fairness mea-
surements have been proposed for fairness research. In this chapter, we summarize the datasets
and metrics that are used for each proposed work.
3.2 Benchmark Datasets
3.2.1 Traditional Datasets for Fairness Research
In order to provide more standardized evaluation of fariness research, there are datasets that are
specifically used to study bias and fairness issues in machine learning. Those datasets usually have
several favored properties for fairness research: providing sensitive attribute information, human
generated data, and targeting crucial decision-making tasks (e.g., criminal justices, credit scores).
Below we list some of the widely known datasets that are used in this work.
3.2.1.1 Adult
The Adult dataset [65] contains 32,561 records of yearly income (represented as a binary label:
over or under $50,000) and twelve categorical or continuous features including education, age,
20
and job types. The gender (binary represented, male or female) of each subject is considered as
sensitive attribute.
3.2.1.2 COMPAS
The ProPublicaCOMPAS dataset [66] relates to recidivism, to assess if a criminal defendant will
commit an offense within a certain future time. The dataset is gathered by ProPublica, with infor-
mation on 6,167 criminal defendants who were subject to screening by COMPAS, a commercial
recidivism risk assessment tool, in Broward County, Florida, in 2013–2014. Features in this dataset
include number of prior criminal offenses, age of the defendant, etc. The race (binary, white/non-
white) of the defendant is the sensitive attribute of interest.
3.2.1.3 Violent Crime
The violent recidivism version of the ProPublica data [66] describes the same scenario as the recidi-
vism data described above, but where the predicted outcome is a rearrest for a violent crime within
two years. 4,010 individuals are included. The race (binary, white/non-white) of the defendant is
the sensitive attribute of interest.
3.2.1.4 ACSIncome
ACSIncome dataset [67] is constructed from American Community Survey (ACS) Public Use
Microdata Sample (PUMS) over all 50 states and Puerto Rico in 2018 with a total of 1,664,500
data points.The modeling task is predicting whether an individual’s income is above $50,000 based
on the features including employment type, education background, martial status, etc.
21
3.2.2 Human Behavior Datasets
Despite the widely usages of traditional fairness benchmark datasets, those datasets cannot fully
represent the characteristics of human behavior data. For example, human behavior signals com-
monly consist of continuous variables while most features in traditional fairness datasets are ordi-
nal or discrete variables; Moreover, many behavior understanding tasks aim to predict continuous
values (e.g., personality scores, affect status). The traditional fairness benchmarks are lack of
datasets with continuous target variables.
Therefore, we use the following human behavior datasets to evaluate our work in real-life
applications. These datasets all provide the sensitive attributes of each record, which enables the
accurate measurement of the fairness performance.
3.2.2.1 TILES
Tracking Individual Performance with Sensors [23] is a 10 weeks longitudinal study with 212
hospital worker volunteers in a large Los Angeles hospital, where 30% of the participants are male
and 70% are female. Bio-behavioral signals, including heart rate, breathing rate, accelerometer,
etc., were collected from continuous sensing of garment-based (OMsignal) and wristband (Fitbit)
sensors. All participants completed the personality assessment at the beginning of the study, during
a 2-hour long in-person enrollment session. The Big Five Inventory is used as the personality traits
in TILES dataset.
3.2.2.2 YouTube Personality
The YouTube Personality dataset [68] consists of a collection of behavioral features and personality
impression scores for a set of 404 YouTube vloggers with 194 males (48%) and 210 females
(52%) that explicitly show themselves in front of a webcam talking about a variety of topics,
including personal issues, politics, movies, books, etc. There is no content-related restriction
and the language used in the videos is natural and diverse. Audio cues and visual activities are
extracted from the video clips. Big Five Inventory personality scores were collected using Amazon
22
Mechanical Turk (MTurk). MTurk annotators watched one-minute slices of each vlog, and rated
impressions using a personality questionnaire.
3.2.2.3 First Impression
First Impressions dataset [69] that is used by the 2017 ChaLearn challenge at CVPR conference.
This dataset comprises 10,000 clips (with an average duration of 15s) extracted from more than
3,000 different YouTube high-definition videos of people facing a camera and speaking in English.
People in videos have different gender, age, nationality, and ethnicity. Videos are labeled with per-
sonality trait variables. Amazon Mechanical Turk (MTurk) was used to annotate the personality
traits for each video clip. Big Five personality traits are used in this dataset, including five dimen-
sions: Openness (Ope.), Conscientiousness (Con.), Extroversion (Ext.), Agreeableness (Agr.), and
Neuroticism (Neu.). Additionally, each video was labeled with a variable named “Interview" (Int.)
indicating whether the subject should be invited to a job interview or not.
The 10,000 clips are split into 60% train set, 20% validation set, and 20% test set. All subsets
contain around 86% Caucasian, 11% African American, 3% Asian. 55% of the samples are female,
and 45% are male.
3.2.2.4 Older Adults
The older adults dataset [70] is collected to study the relationship between physical fitness and
cognitive performance. 70 older adults (28 male and 42 female) with a mean age 71± 4.7 were
recruited. Physical activity tests included the 6-minute walk test, Bicep Curls, Static and Dynamic
Balance, Timed Up and Go, Sit to Stand, Grip strength, and Functional Reach. The Stroop Task
[71] is used to measure cognitive performance.
23
Category Dataset Name Task Size
Traditional
Adult Classification 32,561
COMPAS Classification 6,167
Violent Classification 4,010
Behavior
TILES Classification & Regression 212
YouTube Personality Regression 404
First Impression Regression 10,000
Older Adults Regression 70
Table 3.1: Summary of benchmark datasets
3.3 Adopted Fairness Metrics
In §2.2, we introduce several group fairness goals in literature. In general, the definitions belong
to two categories:
One category of fairness goals is aiming to balance the outcome distribution across sensitive
groups, which says that the predictions should be independent of the sensitive attributes. In clas-
sification, it corresponds to the practice of affirmative action [72] and it is also invoked to address
disparate impact under the US Equal Employment Opportunity Commission’s “four-fifths rule,"
which requires that the “selection rate for any race, sex, or ethnic group [must be at least] four-fifths
(4/5) (or eighty percent) of the rate for the group with the highest rate.” In regression, it requires
that the distributions (e.g., Cumulative distribution function (CDF)) of model outcomes are similar
across sensitive groups. Statistical parity is the example of such fairness definitions.
A second category of goals is aiming to balance the model performance across sensitive groups.
For example, the model should predict female population as accurate as male population. Different
definitions of this category are proposed due to the various criteria of model performance measure-
ment. For classification tasks, Equal opportunity aims to equalize true positive rates. Equalized
odds is based on both true positive and false positive rates. For regression tasks, similar fairness
goals can be archived by equalizing the estimation errors across sensitive groups, namely Equal
Accuracy. In this section, we list the formalized fairness metrics we used to measure the models’
fairness performance in this thesis.
24
Below, we first list fairness metrics for classification tasks. For the sake of simplicity, all
metrics are defined with binary sensitive attribute A∈ 0,1 and binary target variable Y ∈ 0,1,
where A= 1 represents the privileged group and A= 0 represents the unprivileged group, Y = 1
represents the favorable outcome and Y = 0 represents the unfavorable outcome.
Metric 1 (Equal Opportunity Difference (EOD)) Equal opportunity measures a binary predic-
tor
ˆ
Y with respect to A and Y :
Pr{
ˆ
Y = 1|A= 0,Y = 1}− Pr{
ˆ
Y = 1|A= 1,Y = 1}.
Metric 2 (Average Equalized Odds Difference (EOddsD)) Average equalized odds differences
computes the average difference of false positive rate (false positives / negatives) and true positive
rate (true positives / positives) between unprivileged and privileged groups.
The average equalized odds difference of a binary predictor
ˆ
Y with respect to sensitive attribute
A and outcome Y , is if
ˆ
Y and A are independent conditional on Y :
1
2
[Pr{
ˆ
Y = 1|A= 0,Y = 0}− Pr{
ˆ
Y = 1|A= 1,Y = 0}]+
1
2
[Pr{
ˆ
Y = 1|A= 0,Y = 1}− Pr{
ˆ
Y = 1|A= 1,Y = 1}].
Metric 3 (Statistical Parity Difference (SPD)) Statistical parity rewards the classifier for clas-
sifying each group as positive at the same rate. The statistical parity of a binary predictor
ˆ
Y is
Pr{
ˆ
Y = 1|A= 0}− Pr{
ˆ
Y = 1|A= 1}.
For regression tasks, we revised the above metrics to fit the continuous target variables. Met-
ric EA measures the accuracy differences across different sensitive groups. Metric SP-regression
25
(SP_r) measures the distribution differences by comparing the distance of average outcome of
each sensitive group. Considering the possible demographic disparity in ground-truth, we further
propose Metric SP_MI that use mutual information (MI) to quantify the “amount of information"
obtained about the sensitive attributes through observing the ground-truth and estimations. Thus
the MI differences between ground-truth and estimations can measure the leak of sensitive infor-
mation in model outcomes.
Metric 4 (EA) For a model M under a distribution over (X, A, Y ), we choose Mean Absolute
Error (MAE) as the metric for regression task, thus we evaluate the Equal Accuracy with the MAE
differences between groups:
EA= MAE(Y,
ˆ
Y|A= 0)− MAE(Y,
ˆ
Y|A= 1).
Metric 5 (SP_r) For a model M under a distribution over (X, A, Y ), the SP_r of M(X) is defined
as the distance of the average outcomes
ˆ
Y of each sensitive group
SP_r= A VG(
ˆ
Y|A= 0)− A VG(
ˆ
Y|A= 1).
Metric 6 (SP_MI) For a model M under a distribution over (X, A, Y ), the statistical parity of the
outcomes
ˆ
Y of M(X) is defined as
SP_MI= MI(
ˆ
Y,A)− MI(Y,A).
SP_MI equals to zero if M(X) is independent of A.
26
For all the fairness metrics above, values that are closer to zero have less differences between
sensitive groups, which indicate fairer predictions. Negative values indicate the models bias against
the unprivileged group. Positive values indicate that the models favor the unprivileged group.
3.4 Research Questions on Fair Behavioral Models
In this work, we focus on the following research questions in fair human behavior understand-
ing models. Below we provide more details, including modeling tasks, benchmark datasets, and
evaluation metrics, for each research question.
3.4.1 RQ1: Effects of Behavior Heterogeneity on Fairness
This research question includes two folds: heterogeneity of human behaviors and the effects of
heterogeneity on fairness. The heterogeneity of behavioral data can exist in both behavioral signals
and behavioral annotations. Thus, for this research question, we choose three human behavior
datasets,TILES,YouTubePersonality andFirstImpression as our case studies.
TILES dataset consists with continuous physiological and physical activity signals, and is suit-
able for both classification and regression modeling tasks. The behavioral annotations are collected
from the participants via self-reported surveys. YouTubePersonality andFirstImpression datasets
include the information of people’s visual and audio signals. It also consists of continuous features
and continuous target variables. The behavioral annotations are collected using Amazon Mechani-
cal Turk (MTurk). By analyzing the datasets above, we are able to investigate the heterogeneity of
different behavioral signals (i.e., physiological, visual, and audio signals), and the heterogeneity
of both self-reported and apparent behavioral annotations.
Two fairness metrics for regression tasks – EA and SP_r are used to evaluate the performance.
27
3.4.2 RQ2: Enhancing Model Fairness via Data Balancing
In this research question, we investigate the fairness performance of widely used class balancing
methods and propose a new class balancing method that can enhance fairness without sensitive
attributes.
To evaluate the performance of the proposed methods, we first run experiments on three tradi-
tional fairness benchmark datasets: Adult,COMPAS, andVioletCrime. We then apply the method
on behavioral dataTILES.
All three fairness metrics for classification tasks, EOD, EOdds, and SPD, are used as the eval-
uation criteria.
3.4.3 RQ3: Enhancing Model Fairness via Mitigating Data Heterogeneity
In this research question, we propose a method to identify the heterogeneous pattern of behavioral
signals. Our method is limited to continuous signals, thus we conduce experiments on two be-
havior datasets with continuous features: TILES and Old Adults. These two datasets consist of
behavioral signals from heterogeneous participants. WithTILES dataset, we can also evaluate the
performance of proposed method on both classification and regression tasks.
3.4.4 RQ4: Enabling Group Fairness in Federated Learning
Different from the above three research questions which focus on centralized machine learning
methods, we study the group fairness issues in federated learning system in this research question.
In federated learning settings, data are stored in different client, while most of the benchmark
datasets do not have a natural partition. To evaluate the performance on real partitioned data, we
conduct additional experiments on the ACSIncome datasets, which has a natural partition of the
data collected from 51 U.S. states.
We first manually partition the traditional fairness datasets Adult andCOMPAS to investigate
the performance of debiasing methods in federated learning. We then test the our proposed method
28
Feature Target Variable Task Dataset Metrics
RQ1 Continuous Continuous Regression
TILES EA
SP_r
SP_MI
YouTube Personality
First Impression
RQ2
Continuous
& Discrete
Discrete Classification
Adult EOD
EOddsD
SPD
COMPAS
Violent Crime
RQ3 Continuous
Continuous Regression
TILES EA
SP_r Older Adults
Discrete Classification TILES EOD, SPD
RQ4
Continuous
& Discrete
Discrete Classification
Adult
EOD
SPD
COMPAS
ACSIncome
TILES
Table 3.2: Summary of tasks for each research question.
in both manually and natural partitioned data ACSIncome. Last, we also report the performance
on the behavioral datasetTILES.
We summarize the tasks and evaluation approaches (e.g., datasets and metrics) in Table 3.2.
29
Chapter 4
Effects of Heterogeneous Human Behavior in ML Fairness
4.1 Introduction
Heterogeneity is prevalent in human behaviors. The differences in human behavior patterns intro-
duce various sources of bias in the data, most prominently heterogeneity [48, 47]. This, in turn,
can affect machine learning models’ accuracy [50, 51] (e.g., due to the inability to generalize to
unseen subjects) and fairness [53] (e.g., physiological signals may expose sensitive information, in-
creasing the risk of unfair outcomes). Despite the growing number of studies on machine learning
fairness, the analyses on the effects of behavior heterogeneity are still lacking.
Contributions of this chapter
Motivated by the above challenges in heterogeneous human behaviors, in this chapter, we investi-
gate different sources of behavioral heterogeneity and their effects on model fairness. We analyze
the effects of different types of heterogeneity and propose mitigation mechanisms to address the
bias in behavior understanding systems. In summary,
• We investigate the effects of heterogeneity in behavioral signals and annotations. We propose
a method to identify the labeling bias in behavioral data.
• We propose a framework named adversarial learning with label matching (ALM). ALM can
mitigate the biases from both behavioral signals and annotations.
30
• We investigate the heterogeneity of different modalities in multimodal data. We analyze
the effects of multimodal fusion on fairness and propose two modeling strategies – data
balancing and adversarial learning to mitigate the bias in multimodal systems.
4.2 Heterogeneity of Behavioral Signals and Labels
Behavioral signals, such as facial expressions, voice, and physiological signals, vary across dif-
ferent demographic groups. Many studies [73, 74, 75] have examined the effect of a single de-
mographic factors on facial expressions. A recent study [76] further investigates the interactions
effects of three demographic factors (i.e., gender, race and age) on the the facial action units (FAUs)
of happiness expressions. Physiological signals (e.g., heart rate dynamics [77], sleep patterns [78])
also present significant age- and gender- related differences.
In this section, we conduct empirical analysis on such bias on real-life human behavior datasets.
We use Tracking Individual Performance with Sensors (TILES) [23] and YouTube Personality [68]
datasets as case studies to investigate the heterogeneity of human behaviors.
4.2.1 Differences in Behavioral Signals
4.2.1.1 Distribution Differences
Both physiological signals and video cues can have different distributions across gender groups
due to physiological or psychological differences.
Figure 4.1 presents some examples of the distribution differences across gender groups. TILES
dataset includes the users’ physiological signals. Among these features, male users have in general
higher average breathing depth, higher breathing depth variation, and lower Y-axis accelerometer
values. Audio and visual cues in YouTube Personality dataset also present gender differences.
Females have higher average pitch with lower variance.
31
Figure 4.1: Examples of distribution differences of the behavior signals across gender groups.
4.2.2 Differences in behavioral Labels
4.2.2.1 Behavioral Label Types
There are two common methods to collect the annotations of behavioral traits:
• Self-reported: self-reported annotations are provided by the subjects themselves by answer-
ing questionnaires.
Response bias [79] is a widely discussed phenomenon in behavioural and healthcare research
where self-reported data are used; it occurs when individuals offer self-assessed measures
of some phenomenon. There are many reasons individuals might offer biased estimates of
self-assessed behaviour, ranging from a misunderstanding of what a proper measurement is,
to social-desirability bias.
• Apparent: apparent personality traits are annotated by an observer regarding their behav-
iors.
32
X
1
X
2
X
3
X
4
X
5
X ’
1
X ’
2
X ’
3
X ’
4
X ’
5
y
3
y
4
y
2
y
1
y
5
y
4
y
2
y
1
y
5
Remove Sensitive Information
Original Features Filtered Features
Match Neighbor ’s Label
Figure 4.2: Proposed method to identify labeling bias. Filtering step removes the sensitive in-
formation from the feature dimension. Matching step matches the samples based on the filtered
features.
Apparent personality annotations are usually collected by means of crowdsourcing. Cogni-
tive/historical bias [80] is prevalent in crowdsourcing datasets. The annotators might give
biased labels due to their personal experience or systematical discrimination existing in the
society, for example, the assumption that women are less suited to jobs requiring high intel-
lectual ability.
4.2.2.2 Identify Labeling Bias
We now introduce a method to identify labeling bias in behavioral annotations. Our method is
based on the assumption that if the feature dimensions are independent of the sensitive attributes,
the different sensitive groups’ samples that have similar features should have similar target vari-
ables.
According to the assumption, we analyze the labeling bias via two steps as Figure 4.2 shows:
• Filtering step: We first remove the inherent sensitive attributes from the original features.
We adopt the Disparate Impact Remover [6] as our filtering method.
• Matching step: After obtaining the filtered features, for each sample in the unprivileged
group, we set its nearest neighbor in the privileged group as the matched sample.
33
Figure 4.3: Distribution of the labels in the unprivileged group and their matched labels. The
matched labels (blue lines) exhibit different distributions from the original labels (red dashed-
lines), indicating potential labeling bias in the data.
(a) (b)
Figure 4.4: Distribution comparison of original labels and matched labels. (a) demonstrates the
trend of differences between original and matched labels (i.e., label bias). (b) compares the cumu-
lative distribution function (CDF) before (blue line) and after (orange dashed-line) matching.
34
(1) TILES
Constructs
Original Matched
t p-value t p-value
Openness -0.55 0.57 -0.11 0.90
Conscientiousness 2.73 0.006 -2.11 0.03
Extraversion 0.15 0.87 -0.07 0.93
Agreeableness 1.75 0.08 -3.79 0.0002
Neuroticism 1.59 0.11 0.20 0.84
(2) YouTube Personality
Constructs
Original Matched
t p-value t p-value
Openness -0.87 0.38 -2.62 0.009
Conscientiousness -0.71 0.47 1.90 0.05
Extraversion 0.12 0.89 -1.46 0.14
Agreeableness 4.78 0 4.74 0
Neuroticism 0.67 0.49 -0.33 0.73
Table 4.1: Label distribution differences across different sensitive groups. We conduct t-test be-
tween the labels of the unprivileged group and the original/matched labels in the privileged group.
The filtering and matching steps aim to reduce the potential selection bias in data collection.
Since the matched samples have similar features except for the sensitive attributes, the label differ-
ences of the matched samples can reveal how the labeling bias is raised in the labeling process.
Table 4.1 summarizes the distribution differences. T-test is used to measure the statistical differ-
ences across groups. For both datasets, the distribution differences are exaggerated after matching
the neighboring features across different sensitive groups, indicating the potential labeling bias in
the data collection process. According to the results in Table 4.1, conscientiousness and agreeable-
ness scores exhibit the most significant label bias in both datasets. Figure 4.3 further illustrates the
labeling differences across different sensitive groups. The labels in the unprivileged group, i.e., the
red dashed-lines in Figure 4.3, have different distribution from their matched labels (i.e., the blue
lines), indicating that the annotators tend to give different labels for the participants in different
groups even though they have similar behaviors.
Figure 4.4 gives more details of the labeling bias. Figure 4.4(a) illustrates the different level
of bias of different original scores. The label bias of one record is calculated by the label of the
35
corresponding matched nearest neighbor subtract the label of the record. According to the analysis,
annotators tend to give scores around 4-5 for the unprivileged group, while similar behavior in the
privileged group results in scores with wider ranges, i.e., higher for labels less than 4, and lower
for labels larger than 4. Figure 4.4(b) compares the cumulative distribution function (CDF) before
and after matching, indicating that with similar features, the labels of the unprivileged group are
systematically lower than those of the privileged group.
4.2.3 Mitigating Signal and Label Bias
4.2.3.1 Learning Unbiased Labels
In this section, we introduce our framework to correct the labeling biases in behavioral data.
Consider a data domain X and an associated data distributionP. An element x∈ X is a feature
vector associated with a specific example. Y represents the annotations of the behavioral trait of
interest. Y in our dataset is subject to the biases of annotators, i.e., the labels are generated based
on a biased label function y
bias
: X→R. We assume the existence of an unbiased, ground truth
label function y
true
: X →R . Although y
true
is the assumed ground truth, in general we do not
have access to it. In our framework, we begin with an assumption on the relationship between the
observed y
bias
and the underlying y
true
.
We assume that there exists a transformation function g(·) between y
true
and y
bias
where y
bias
=
g(y
true
). The transformation satisfies the following constrains:
|Pr[y
true
≥ z|x]− Pr[y
true
≥ z|x
′
]|≤ ε
s.t. g(y
(i)
true
)≤ g(y
( j)
true
) when y
(i)
true
< y
( j)
true
,
where x and x
′
are matched feature vectors in the privileged and unprivileged groups, respectively.
36
We propose the following method to transform the labels of the unprivileged population. Ac-
cording to Assumption 4.2.3.1,
F
bias
(y
bias
)= F
true
(g
− 1
(y
bias
)),
where F
y
(·) is the cumulative distribution function (CDF) of the annotations.
To balance the label distribution, we only modify the labels of the unprivileged group, i.e., we
assume the labels in the privileged group are the baseline of the relative bias. Therefore, we can
estimate y
true
by
y
true
(i)
=
y
bias
(i)
if x
(i)
∈ privileged group
F
− 1
true
(F
bias
(y
bias
(i)
)) if x
(i)
∈ unprivileged group
(4.1)
where F
− 1
is the inverse function of F.
Our method is based on the assumption that after removing the sensitive information from the
original features, two similar filtered feature vectors that belong to different sensitive groups should
also have similar target variables. Figure 4.2 illustrates how the above framework works. Since the
true relationship between y
true
and y
bias
is unknown, our method is aiming to debiasing the label
distribution rather than revealing the true labels.
4.2.3.2 Adversarial Learning with Label Matching
We propose a fair human behavior modeling framework named adversarial learning with label
matching (ALM), aiming to mitigate the biases from both learning process and annotation process.
Our proposed framework has the following components, as shown in Figure 4.5.
• Filter: filters are a set of two-layer multilayer perceptron (MLP) models that remove in-
formation about particular sensitive attributes, e.g., gender or race, from the original feature
embedding.
37
Figure 4.5: Framework of the proposed adversarial learning with label matching (ALM) method.
• Discriminator: each discriminator is a fully-connected layer to predict the sensitive attribute
information, from the filtered embedding. The estimation accuracy of the discriminator
indicates how much sensitive information we can learn from the filtered embedding. With
the adversarial training, we aim to maximize the estimation loss of discriminators so that the
filtered embedding is not able to reveal the sensitive attributes.
• Predictor: the predictor is a regression model on the filtered embedding vectors to predict
a certain personality construct. After enforcing the fairness constraints using filters and
discriminators, we aim to maintain the personality assessment accuracy via minimizing the
predictor loss.
• Corrector: the corrector follows the structure in Figure 4.2. The filtered features are passed
through the corrector along with the sensitive attribute. Each filtered feature vector of the
unprivileged group is matched to its nearest neighbor in the privileged group. The labels be-
long to the unprivileged group are corrected based on Equation 4.1. The corrector generates
the corrected labels for calculating the prediction loss.
We processed the original features by the filter component and get the filtered features. The
filter aim to prevent the adversarial discriminators from identifying the sensitive information using
the filtered features. After the filtering layers, the filtered features are passed through a discrimi-
nator and a predictor respectively to calculate the loss of predicting sensitive attributes and target
variables.
38
In order to train the filter-discriminator-predictor system, we define an adversarial loss as the
loss function:
L= L
predictor
− λ· L
discriminator
=L( f(x
i
),y
′
i
)− λ·L(p(x
i
),A
i
),
where y
′
is the corrected label. The loss of predictors L
predictor
is based on mean absolute er-
ror (MAE), and the loss function of the discriminator L
discriminator
uses categorical cross-entropy.
Function f(·) is the regression model of the predictor and p(·) is the model of the discriminator.
We also introduce a parameterλ to control the trade-off between fairness and utility. With this
setup, the filter layers are encouraged to maximize L
discriminator
, which will produce filtered fea-
tures that are independent of the sensitive attribute. Differently from a traditional training process,
we calculate L
predictor
using the lossL( f(x
i
),y
′
i
) instead ofL( f(x
i
),y
i
), aiming to minimize the
prediction loss against the corrected labels.
4.2.3.3 Experimental Evaluation
To demonstrate the effectiveness of our proposed method, we implemented our method using
PyTorch library and conduct experiments on two human behavior datasets: TILES and YouTube
Personality. For both datasets, we consider gender as the sensitive attribute, whereas the population
with higher average scores is considered as the privileged group.
To ensure that the reported results are not skewed due to the variations in different testing sets,
we utilize a stratified global testing set, sampling 20% of the data from the original dataset. All
results reported are based on 20 repeated experiments with random restart. We report the mean
square errors (MAE) of the predictions as the accuracy measurement and SP_r (See §3) as the
fairness metric.
We compare the performance of the following modeling frameworks:
• Original: optimize the model on the prediction loss without any fairness constraints.
39
(1) TILES
Model
Constructs
Open (λ = 0.7) Cons (λ = 0.7) Extr (λ = 0.7) Agr (λ = 0.7) Neu (λ = 0.3)
MAE SP_r MAE SP_r MAE SP_r MAE SP_r MAE SP_r
Orig. 0.54 -0.34 0.63 -0.28 0.58 -0.34 0.54 -0.37 0.62 -0.16
Adv. 0.69 -0.10 0.81 -0.12 0.70 -0.07 0.67 -0.11 0.71 -0.02
ALM 0.68 -0.10 1.05 -0.05 0.72 -0.06 0.75 -0.06 0.75 -0.01
(2) YouTube Personality
Model
Constructs
Open (λ = 0.6) Cons (λ = 0.6) Extr (λ = 0.4) Agr (λ = 0.6) Neu (λ = 0.3)
MAE SP_r MAE SP_r MAE SP_r MAE SP_r MAE SP_r
Orig. 0.62 -0.07 0.69 -0.23 0.76 -0.09 0.92 -0.07 0.91 -0.05
Adv. 0.90 -0.07 0.70 -0.18 1.09 -0.04 1.19 -0.03 1.09 -0.02
ALM 0.86 -0.07 0.71 -0.18 1.20 -0.02 1.23 -0.02 1.13 -0.02
Table 4.2: Performance comparison among original, adversarial learning, and our proposed ALM
modeling methods. The choice ofλ is based on the best trade-off between utilty and fairenss.
• Adversarial learning: the adversarial learning method follows the filter-discriminator-
predictor framework [81, 82] in Figure 4.5. The predictor component is optimized on the
prediction loss against the biased labels.
• Our method – Adversarial learning with label matching (ALM): our method combines
the filter-discriminator-predictor framework with a corrector component to mitigate labeling
bias. The predictor is optimized on the prediction loss against the corrected labels.
ExperimentalResults
Table 4.2 compares the utility and fairness metrics of the four different modeling strategies.
Due to the absence of the true labels, we calculate the prediction accuracy (i.e., MAE) against the
biased labels in the datasets.
For all the personality constructs in both datasets, ALM outperforms other methods on im-
proving statistical parity. The experimental results also demonstrate that simply adopting the label
matching strategy can also improve the model fairness.
Effectsofλ
In the proposed framework, we introduce a parameterλ as the fairness budget to control the trade-
off between model utility and fairness. Figure 4.6 illustrates how the choice ofλ affects the model
40
Figure 4.6: Effects of the fairness budget parameter λ. The plot illustrates model performance
change according both utility and fairness metrics.
performance. Asλ increases, the model is increasingly encouraged to maximize the L
discriminator
,
which will provide fairer filtered features. For all constructs, the SP_r metrics show more clearly
increasing trends as λ increases, indicating fairer outcomes. As a trade-off, the prediction error
exhibits an increasing trend whenλ increases.
The effects ofλ also vary by construct. For example, in the TILES dataset, neuroticism exhibits
a steeper trend compared to agreeableness. The SP_r of neuroticism can achieve 0, i.e., no bias in
the outcomes, whenλ = 0.4, while other constructs need largerλ to reach that goal.
4.2.4 Summary
This subsection focuses on the biases stemmed from the heterogeneity of human behaviors and
annotations. We use the personality assessment task as our case study to understand the labeling
bias in behavioral data. We propose a method to identify the label bias via records matching. Our
41
analysis reveals the prevalence of inherent labeling bias in both self-reported and crowdsourcing
annotation process.
We further propose a modeling framework, called adversarial learning with label matching
(ALM), for fair behavior modeling. Our method aims to address the bias in both feature and label
dimensions. The adversarial learning component removes the latent information of the sensitive
attributes in the feature dimensions. The label matching component corrects the bias in behavioral
labels.
Experiments on real-world datasets show that our ALM method can outperform a traditional
fair adversarial learning strategy. Our analysis on synthetic biased labels also highlights the poten-
tial accuracy improvement of our method.
For future work, we will explore the performance of our method in more scenarios. Experi-
ments on different datasets will be conducted to assess our method’s performance with different
data properties. Here, we focused on labeling bias across different demographic groups: labeling
bias caused by individual perception/preference differences, and how such biases impact individual
fairness will be studied in the future.
4.3 Heterogeneity of Multimodal Fusion
Multimodal machine learning aims to build models that can process and relate information from
multiple modalities, which enables the model to capture more relative information and improve
the model utility. The fusion of different modalities is generally performed at two levels: feature
level or early fusion and decision level or late fusion. Despite the growing number of studies on
multimodal modeling for human behavior, the fairness and biases of such systems are still lack-
ing. In this subsection, we conduct empirical analyses on a state-of-the-art multimodal personality
assessment system, investigating the bias derived different modalities as well as the model fusion
stage. We then propose a potential way to mitigate the bias. We use the First Impression dataset
[69] as the case study.
42
Gender Race
Male Female p Cau. AA Asian p
Ope. 0.59± 0.15 0.53± 0.13 0 0.57± 0.14 0.52± 0.13 0.58± 0.13 0
Con. 0.53± 0.15 0.50± 0.15 0 0.52± 0.15 0.49± 0.14 0.54± 0.14 0
Ext. 0.50± 0.15 0.43± 0.13 0 0.47± 0.15 0.44± 0.14 0.51± 0.12 0
Agr. 0.54± 0.14 0.55± 0.13 0.15 0.55± 0.13 0.52± 0.12 0.55± 0.12 0
Neu. 0.52± 0.15 0.50± 0.14 0 0.52± 0.15 0.49± 0.14 0.52± 0.13 0.001
Int. 0.51± 0.15 0.49± 0.14 0 0.50± 0.15 0.47± 0.14 0.52± 0.12 0
Table 4.3: Ground-truth distribution across different sensitive groups. Each group is reported the
mean and standard deviation of the construct. T-test and one-way ANOV A are used to test the
differences between the means of across groups.
Figure 4.7: Pipeline of the BU-NKU System reproduced from [83]
Table 4.3 summarizes the distribution of the ground-truth labels. T-test and one-way ANOV A
are used to measure the distribution differences across groups. Almost all constructs show sim-
ilar distributions and have significant differences (p-value < 0.001) across different demographic
groups.
4.3.1 State-of-the-Art Personality Assessment Pipeline
In this section, we introduce one of the state-of-the-art personality assessment methods named
BU-NKU that we adopt as the baseline of our work. BU-NKU system gets the best performance
in the 2017 ChaLearn challenge, described in Section §3.2.2.3.
The BU-NKU system is based on audio, video, and scene features. The workflow of the BU-
NKU system is illustrated in Figure 4.7.
43
4.3.1.1 Features
BU-NKU system uses three feature types, namely, face features, scene features, and acoustic fea-
tures. Face features are extracted from the whole video while scene features are extracted from the
first frame of the video.
Face Features. Supervised Descent Method (SDM) [84] is utilized to align faces and locate
landmarks for cropping facial figures/heteros. After that, the BU-NKU framework uses two meth-
ods to extract features from the aligned facial figures/heteros. First, BU-NKU fine-tunes VGG-
Face network [85], which is pre-trained for face recognition, on more than 30K figures/heteros
from the FER-2013 dataset for facial expression recognition. The 4,096-dimensional embedding
on 33rd layer is used as facial features. Second, BU-NKU applies Local Gabor Binary Patterns
from Three Orthogonal Planes (LGBP-TOP) [86] on facial figures/heteros to extract features whose
dimensionality is 50,112. LGBP-TOP is a spatio-temporal descriptor with 18 Gabor filters, useful
for expression recognition.
Scene Features. In order to use ambient information in the figures/heteros, a set of features is
extracted using the VGG-VD-19 network [87], which is trained for an object recognition task on
the ILSVRC 2012 dataset. Similar to face features, a 4,096-dimensional representation from the
39th layer of the 43-layer architecture is used. This gives a description of the overall figures/hetero
that contains both face and scene. The effectiveness of scene features for predicting Big Five traits
is shown in [88, 89].
Acoustic Features. The BU-NKU approach uses OpenSMILE toolbox [90] with a standard
feature configuration that served as the challenge baseline sets in INTERSPEECH 2013 Computa-
tional Paralinguistics Challenge [91].
4.3.1.2 Data Fusion Strategies
As Figure 6.4 shows, the BU-NKU system combines both early-fusion and late-fusion approaches.
44
(a) (b) (c)
(d) (e) (f)
Figure 4.8: Variable importance of the late fusion Random Forest model.
In the early-fusion stage, the visual modality combines face features, i.e., LGBP-TOP and
VGG-face, and the audio modality combines IS13 and VGG-VD-19 features. Each modality is
modeled using Linear Kernel ELM to predict the six constructs of interest.
The late fusion stage is designed to further refine the estimations with a random forest regressor.
The estimations from the previous step are stacked together as the 12 dimensional input of the late
fusion model.
Figure 4.8 illustrates the variable importance of the Random Forest regressor. As expected, for
all the constructs, the estimations of the same construct from both modalities are the most infor-
mative variables. For most constructs, the estimations from both modalities have similar weights.
Except for Extraversion, the estimations from visual modality dominate the final estimations.
45
4.3.2 Different Sources of Bias
In this section, we investigate the gender/race differences of the estimation results generated by the
BU-NKU system. We measure the performance based on model accuracy and fairness. Accuracy
(Acc.) is defined as 1 − Mean Absolute Error, which is consist with the metric used in [83]. SP
and EA are used as the fairness metrics.
4.3.2.1 User-Dependent vs. User-Independent Evaluation
In the original dataset split, different clips generated from the same video can be found in different
sets, i.e., the clips from the same video are both in train set and test set. Since different people
have different behavior patterns, the estimation performance for unseen users, i.e., clips generated
from the videos that do not exist in the train set, might be different from the known users, i.e.,
clips generated from the videos that exist in the train set. Therefore, we first analyze the biases in
user-dependent and user-independent evaluation.
• User-dependent evaluation: estimations for the clips generated from the videos that also
exist in the train set. In the original test set, there are 1,680 clips out of 2,000 clips are from
the same videos in the train set.
• User-independent evaluation: estimations for the clips generated from the videos that do
not exist in the train set. After filtering out the test samples that come from the same videos
in the train set, the user-independent evaluation test set has 320 clips.
Since the personality traits are mostly stable within short videos, the user-dependent evaluation
approach gets higher accuracy. In Table 4.4, we also compare the biases of user-dependent and
user-independent evaluations. The results show that user-independent evaluation exhibits larger
biases across different constructs and different groups. In particular for agreeableness, the EA
metrics are increased for all three group pairs. In summary, gender biases exist for Agreeableness,
Conscientiousness and Interview, and they are significantly different according to t-test (p<0.05).
46
(1) User-Dependent Evaluation
Acc.
SP_MI EA
Gender Race Male-Female Cau.-AA Cau.-Asian
Ope. 0.9172 0.031 0.002 0.0005 -0.004 0.019
Con. 0.9206 0.009 -0.003 0.004 -0.007 0.003
Ext. 0.9217 0.022 0.005 0.0004 -0.008 0.003
Agr. 0.9143 -0.005 0.007 0 -0.004 0.003
Neu. 0.9158 0.014 0.008 0.002 0 0.015
Int. 0.9226 0.002 0.019 0.0008 0.002 0.012
(2) User-Independent Evaluation
Acc.
SP_MI EA
Gender Race Male-Female Cau.-AA Cau.-Asian
Ope. 0.9160 0.059 0.005 0.001 0.03* -0.011
Con. 0.9157 -0.005 0.016 0.002* -0.005 0.001
Ext. 0.9189 0.048 -0.011 0.012 0.001 -0.004
Agr. 0.9165 0.005 0.033 0.011* 0.021 -0.039
Neu. 0.9093 -0.015 0.024 0 0.004 0.017
Int. 0.9124 0.004 0.001 0.003* 0.012 -0.001
*: p-value <0.05
Table 4.4: Comparison between user-dependent and user-independent evaluations. The model per-
formances are measured by both accuracy (Acc.) and fairness metrics (Statistical Parity (SP_MI)
and Equal Accuracy (EA)).
Openness is significantly different between Caucasians and African-Americans, which is the only
significant racial bias identified.
In general, even though there are no significant gender/race biases in the user-dependent es-
timations, all constructs show different degree of bias under user-independent evaluation. Since
user-independent evaluation is more realistic, the rest of the discussions in this work will focus on
the performance of user-independent evaluations.
4.3.2.2 Biases from Different Modalities
BU-NKU systems include two modalities according to the pipeline shown in Figure 4.7: we con-
sider the scene features and OpenSMILE features as the audio modality. The VGG-Face and
LGBP-TOP features are the visual modality. Due to the different feature extraction process and
pre-trained models adopted in each modality, different modalities may introduce different types
47
(1) Audio Modality
EA
gender
EA
race
SP
gender
SP
race
Cau. - AA Cau. - Asian
Ope. -0.007 0.007* 0.007 0.057 0.014
Con. -0.001* -0.014 0.025 -0.004 -0.002
Ext. 0.001 -0.003* 0.022 0.046 0.011
Agr. 0.008* 0.021 -0.021 -0.014 0.047
Neu. -0.009* -0.002 0.001 -0.015 0.015
Int. 0* -0.002* 0.008 0.027 -0.010
*: p-value <0.05
(2) Visual Modality
EA
gender
EA
race
SP
gender
SP
race
Cau. - AA Cau. - Asian
Ope. 0.001 0.035 -0.005 0.018 -0.009
Con. 0.009* -0.001 0.001 0.015 -0.002
Ext. 0.012 0.009 -0.005 0.016 -0.011
Agr. 0.015* 0.024 -0.029 -0.014 0.018
Neu. 0.002* 0.006 0.024 -0.020 0.012
Int. 0.013* 0.021 0 -0.023 -0.015
*: p-value <0.05
Table 4.5: Bias measurement for different modalities. Comparison of the biases from audio vs
visual models.
of biases to the model outcomes. In this section, we investigate the performance of each separate
modality.
According to the results in Table 4.5, estimations based on visual modality show larger MAE
differences between groups, i.e., EA metrics for most of the constructs. For example, the EAs
between Caucasian and African American increase from 0.007 (resp. -0.021) to 0.035 (resp. 0.021)
for Openness (resp. Interview), the EAs between gender groups also increase from 0.001 (resp. 0)
to 0.012 (resp. 0.013) for Extraversion (resp. Interview). The audio modality has a larger impact
on Asian subjects in our dataset. The EAs between Caucasian and Asian increase from 0.001 (resp.
-0.005) to 0.025 (resp. 0.022) for Conscientiousness (resp. Extraversion).
Regarding the SP metric, audio modality yields estimations with larger SP metric for all con-
structs, which indicates that audio modality leaks more information related to the sensitive at-
tributes compared to the visual modality.
48
(1) Gender
Constructs
SP
gender
Impact
Best Worst Fusion
Openness 0.018 0.057 0.059 ↓
Conscientiousness -0.004 0.015 -0.005 ↑
Extraversion 0.016 0.046 0.048 ↓
Agreeableness -0.014 -0.014 0.005 ↓
Neuroticism -0.020 -0.015 -0.015 -
Interview -0.023 0 0.004 ↓
(2) Race
Constructs
SP
race
Impact
Best Worst Fusion
Openness -0.009 0.014 0.005 -
Conscientiousness -0.002 -0.002 0.016 ↓
Extraversion -0.011 0.011 -0.011 -
Agreeableness 0.018 0.047 0.033 -
Neuroticism 0.012 0.015 0.024 ↓
Interview -0.015 -0.010 0.001 ↓
Table 4.6: Impact of Fusion on SP, we compare the SP metrics before and after the late fusion. ↑
and↓ indicate increase and decrease bias as a result of fusion, respectively. Green and red dashes
respectively indicate that the fusion results are equal to the best and the worst results with single
modality.
4.3.2.3 Impact of Fusion on Estimation Biases
The BU-NKU also use a late fusion strategy. The estimations from different modalities are stacked
and passed to a Random Forest regressor to further refine the estimations.
The late fusion aims to further improve the model accuracy. However, the impact of fusion
on biases has not been studied in the literature. Therefore, in this section, we compare the biases
before and after the fusion stage. We define two types of impact that might be caused by late
fusion.
• Decrease: the bias is less than the minimum bias cross different modalities.
• Increase: the bias is larger than the maximum bias across different modalities.
Table 4.7 and Table 4.6 summarize the impact for different sensitive groups. In most cases,
the EA metrics after fusion will between the biases of different modalities. There are also some
49
(1) Gender: Male - Female
Constructs
EA
gender
Impact
Best Worst Fusion
Openness 0.0004 -0.0006 0.0005 -
Conscientiousness -0.001 0.009 0.002 -
Extraversion 0.001 0.012 0.012 -
Agreeableness 0.008 0.015 0.011 -
neuroticism 0.002 -0.009 0 ↑
Interview 0 0.013 0.003 -
(2) Race: Caucasian - African American
Constructs
EA
Cau.− AA
Impact
Best Worst Fusion
Openness 0.007 0.035 0.03 -
Conscientiousness -0.001 -0.014 -0.005 -
Extraversion -0.003 0.009 0.001 ↑
Agreeableness 0.021 0.024 0.021 -
neuroticism -0.002 0.006 0.004 -
Interview -0.002 0.021 0.012 -
(3) Race: Caucasian - Asian
Constructs
EA
Cau.− Asian
Impact
Best Worst Fusion
Openness -0.005 0.007 -0.011 ↓
Conscientiousness 0.001 0.025 0.001 -
Extraversion -0.005 0.022 -0.004 ↑
Agreeableness -0.021 -0.029 -0.039 ↓
Neuroticism 0.001 0.024 0.017 -
Interview 0 0.008 -0.001 -
Table 4.7: Impact of Fusion on EA, we compare the EA metrics before and after the late fusion.↑
and↓ indicate increase and decrease bias as a result of fusion, respectively. Green and red dashes
respectively indicate that the fusion results are equal to the best and the worst results with single
modality.
interesting findings here. The racial biases are improved for Extraversion, whose estimations are
mainly based on the visual modality according to the variable importance in Figure 4.8. The late
fusion has the highest impact on Asians, two personality traits have larger biases after fusion.
However, late fusion has more salient impact on the SP metrics. The SP metrics are worsened for
most of the constructs.
50
4.3.3 Debiasing Approaches
Based on our analysis, the current multimodal personality assessment method exhibit gender and
race biases in the estimations. Those biases are from different sources including feature-sets,
modalities, and fusion methods, etc. In order to provide fairer assessments, we propose two ap-
proaches to mitigate the biases in existing systems.
• Data balancing: this method involves balancing the population in different sensitive groups.
This approach can be adopted at the late fusion step without re-training the predictive models.
Thus, this approach can mitigate the biases introduced by fusion. It is easier to implement
and less time consuming.
• Adversarial learning: add a discriminator for the sensitive attributes and let it jointly learn
with the predictor for the constructs. This approach can remove the biases from representa-
tions in learning process.
Figure 4.9 illustrates the frameworks of the two approaches. In the following sections, we
provide more detailed explanation of the approaches and evaluate their performance.
4.3.3.1 Data Balancing
As we have shown in Section §3.2.2.3, the dataset is imbalanced and include a large number of
Caucasian and female subjects. Previous work [60] have shown the data imbalance is one of the
main sources of biases in machine learning.
Figure 4.9 (a) illustrates the framework for the data balancing approach. The resampling strat-
egy is based on the distribution of the ground-truth labels. Our method aims to make sure that
the data from different sensitive groups not only have the same size of population, but also have
similar distributions. Our data balancing strategy has two steps:
1. calculating the histogram of the ground-truth for each construct; and,
2. for the data points within each bin of the histogram, resampling the minority group by ran-
domly oversampling to make sure different sensitive groups have equal numbers.
51
(a) Data Balancing (b) FairPredict: Adversarial Learning
Figure 4.9: Debiasing approaches: (a) data balancing, (b) adversarial learning.
After data balancing, the balanced datasets are used as the input for the late fusion model to
generate the final estimations.
4.3.3.2 Adversarial Learning
Another method we use to mitigate bias is incorporating adversarial learning in models. We follow
a similar approach to the method proposed in [92]. The framework of our approach is shown in
Figure 4.9 (b).
First, we normalize the features from each modality and then conduct early fusion by con-
catenating feature vectors into a feature embedding. Second, we put the feature embedding into a
filter-discriminator-predictor system, with the following components.
• Filter: filters are a set of two-layer multilayer perceptron (MLP) models that remove in-
formation about particular sensitive attributes, e.g., gender or race, from the original feature
embedding. The filters for different sensitive attributes are trained separately.
• Discriminator: each discriminator is a two-layer MLP followed by a softmax layer to pre-
dict the sensitive attribute information, e.g., gender or race, from the filtered embedding.
The estimation accuracy of the discriminator indicates how much sensitive information we
can learn from the filtered embedding. With the adversarial training, we aim to maximize
the estimation loss of discriminators so that the filtered embedding is not able to reveal the
sensitive attributes.
52
• Predictor: the predictor is a regression model on the filtered embedding vectors to predict
a certain construct. After enforcing the fairness constraints using filters and discriminators,
we aim to maintain the personality assessment accuracy via minimizing the predictor loss.
We processed the feature embedding by the filters and get the filtered features embedding. The
filters aim to prevent the adversarial discriminators from identifying the sensitive information using
the filtered embedding. After the filtering layer, the filtered feature embedding is passed through
a discriminator and a predictor respectively to calculate the loss of predicting sensitive attributes
and target variables.
In order to train the filter-discriminator-predictor system, we define an adversarial loss as the
loss function for the model:
L= L
predictor
− λ· L
discriminator
,
where the loss of predictors L
predictor
is based on mean squared error (MSE), and the loss function
of the discriminator L
discriminator
uses categorical cross-entropy. We also introduce a parameter λ
to control the trade-off between fairness and personality assessment accuracy. The pre-defined λ
parameter is the weight of fairness constraint from discriminator. Largerλ values represent more
strict constraint on the model fairness. We empirically set λ to 0.0015 in our experiments. With
this λ value, the model is able to achieve fair personality assessment without big sacrifice of the
personality estimation accuracy.
4.3.3.3 Performance Comparison
Table 4.8 compares the biases in model outcomes before and after different debiasing approaches.
Both approaches show improvements on both fairness metrics. Adversarial learning outper-
forms the data balancing method on SP_MI metric. For all five personalities and the interview
variable, SP_MI values with adversarial learning are below or equal to zero, which indicate that
53
(1) Accuracy
Constructs Original
Data Balancing Adversarial Learning
Gender Race Gender Race
Openness 0.9160 0.9155 0.9148 0.8884 0.8907
Conscientiousness 0.9157 0.9157 0.9158 0.8699 0.8731
Extraversion 0.9189 0.9194 0.9173 0.8739 0.8837
Agreeableness 0.9165 0.9164 0.9166 0.8980 0.8969
Neuroticism 0.9093 0.9094 0.9096 0.8918 0.8811
Interview 0.9124 0.9129 0.9120 0.8752 0.8755
(2) Statistical Parity (SP_MI)
Constructs
Gender Race
Original Data Balance Adversarial Original Data Balance Adversarial
Openness 0.059 0.009 -0.0147 0.005 0.0003 -0.0113
Conscientiousness -0.005 -0.006 -0.0084 0.016 -0.0002 -0.0074
Extraversion 0.048 0.005 -0.0572 -0.011 -0.005 -0.0094
Agreeableness 0.005 -0.001 -0.0148 0.033 0.029 0
Neuroticism -0.015 -0.013 -0.0416 0.024 0.018 -0.0044
Interview 0.004 0.001 -0.0141 0.0007 -0.006 -0.0290
(3) Equal Accuracy (EA)
Male-Female Caucasian-African American Caucasian-Asian
Original Balance Adv. Original Balance Adv. Original Balance Adv.
Ope. 0.001 0.002 -0.0041 0.030 0.029 0.002 -0.011 0.015 -0.0239
Con. 0.002 0.002 -0.0084 -0.005 -0.007 0.0013 0.001 0 -0.0641
Ext. 0.012 0.010 -0.0302 0.001 0.003 0.0074 -0.004 0.006 -0.0443
Agr. 0.011 0.009 0.009 0.021 0.019 -0.0013 -0.039 0.038 -0.0054
Neu. 0 0 -0.0034 0.004 0.003 0.0041 0.017 -0.011 -0.0592
Int. 0.003 0.002 -0.0066 0.012 0.012 -0.0022 -0.001 0.010 -0.0526
Table 4.8: Performance comparison of different debiasing approaches.
the outcomes reveal less information on the sensitive attributes. Table 4.9 (c) shows the perfor-
mance regarding the EA metrics. Comparing to the SP_MI metric, both approaches have marginal
improvements for some constructs.
Due to the trade-off, the accuracy of the estimations may deteriorate to enhance the model
fairness. According to the results shown in Table 4.9 (a), the adversarial learning approach slightly
decreases the model accuracy. Data balancing approach scarifies less accuracy. For most of the
constructs, both accuracy and fairness metrics are improved after data balancing.
In general, both methods work better for Statistical Parity metric and the biases against African-
American populations. Both approaches are suitable for different application scenarios. Data
54
balancing is more efficient and can maintain the model utility as much as possible. Adversarial
learning is optimized for improving the model fairness, which can be used in the applications
where removing bias is more critical.
4.4 Conclusions
With the increasing applications of behavior modeling, it is crucial to ensure both model utility
and fairness of such systems. In this section, we investigate the heterogeneity of human behaviors
and behavior annotations in real-life datasets.
Heterogeneity prevalent in human behavior signals, behavior labels, and across-modality pat-
terns. Such heterogeneity poses challenges for both model accuracy and fairness. Our analyses
identify the following sources of biases, (i) the distribution differences of behavior signals and
heterogeneous relationship with behavior labels introduce bias of different sensitive groups; (ii)
individual behavior differences increase the biases under user-independent evaluation; (iii) fea-
tures and pre-trained models from different modalities introduce different types of biases; (iv) late
fusion also introduces additional bias despite further optimizing the model accuracy.
We also propose potential modeling strategies to mitigate the bias stemmed from heterogeneity.
Experiments on real-life data show that adversarial learning can be adapted to identify and mitigate
the labeling bias in behavior annotations. Data balancing also provides good performance on
mitigating the bias from different modalities as well as maintain high utility.
55
Chapter 5
Enhancing Model Fairness via Data Balancing
5.1 Introduction
Data imbalance is common in most real-world data collections. The nature of the task is one of the
reasons of imbalance. For example, there are fewer instances of data about subjects with a certain
disease than that of healthy subjects. This is due the natural imbalance dictated by the prevalence
of that disease in the target population. However, data imbalance may also spur from biased data
collection or polluted training data: for example, it has been reported that decision-making systems
used to predict recidivism, employed for parole or sentencing decisions, are trained on data where
some sensitive demographic attribute (e.g., race) is over- or under-represented with respect to the
prevalence of a model outcome (i.e., a criminal defendant’s likelihood of committing a crime) [93].
Imbalanced classifications pose challenges for predictive modeling as most of the machine learning
algorithms used for classification were designed around the assumption of an equal number of
examples for each class. This results in models that have poor predictive performance, specifically
for the minority class. This is a problem because typically, the minority class is more important
and therefore the problem is more sensitive to classification errors for the minority class than the
majority class.
Class imbalance also pose challenges for model fairness. For example, if the unprivileged
group is under-represented in the minority class, the predictive performance will be further exag-
gerated across sensitive attributes. Various class balancing techniques (e.g., resampling, synthetic
56
samples generation) [94, 95] have been proposed to improve model accuracy, while their effects
on model fairness are largely unknown.
Additionally, despite the various solutions to fair machine learning systems ranging from data
pre-processing [5, 6] to post-processing [10], most of the studies require the access to sensitive
attributes (e.g., gender or race of the subjects). However, in many real-world applications, sensitive
attributes are not observable due to privacy concerns or legal restrictions. Thus, solutions that do
not require the access to the sensitive attributes are in need.
Contributions of this chapter
In this chapter, we first conduct experiments on real-world datasets to investigate the effects of
existing class balancing algorithms on fairness. Our analyses show that common class balancing
techniques can exacerbate unfairness, also in part due to inherent properties of the data. Inspired
by these observations, we propose a new class balancing method named fair class balancing.
Our fair class balancing method is a revised class balancing technique that is inspired by
the K-Means SMOTE [96] algorithm. Our proposed method can enhance model fairness as well
as prediction accuracy. It can be viewed as a pre-processing strategy to enhance model fairness
without observing sensitive attributes. In summary:
• We propose two new bias measures to quantify two types of biases in data collection, which
are also two sources of discrimination in model outcomes.
• We investigate the effects of common class balancing techniques on fairness and show how
different data properties affect the interplay between balancing and fairness.
• We propose the fair class balancing method, which provides a way to improve the fairness
of model outcomes when sensitive attributes are unobserved. Experimental results show
that the proposed method yields state-of-the-art performance on both accuracy and fairness
metrics.
57
5.2 Related Work
The problem of fairness of machine learning algorithms has been drawing increasing research in-
terests in recent years. Much work has been done to improve the fairness through different aspects
of the modeling process, including feature selection [4], data pre-processing [5, 6], model adjust-
ment [7, 8], and post-processing [10]. Different approaches also have been proposed for different
machine learning applications including representation learning [31], clustering [97], natural lan-
guage processing [98], etc.
All of the strategies above require the access to sensitive attributes in order to mitigate the bias.
However, collecting that type of information might be difficult, or even forbidden by the law, in
real-world applications. Recently, a few studies have explored different strategies to address the
issue. One typical solution is using non-sensitive information as proxy for sensitive attributes.
Previous work [41] has shown that non-sensitive information can be highly correlated with sensi-
tive attributes. Kilbertus et al. proposed a method for selecting proxy groups by inferring causal
relationships in the underlying data [36]. That framework is based on the assumption of causal-
ity between proxy features and sensitive attribute(s). Instead, proxy fairness [42] leverages the
correlations between proxy features and true sensitive attributes. Proxy features are used as the
alternative to sensitive attribute(s) when applying a standard fairness-improving strategy. The use
of weighted estimators [43] is also discussed as the assessment for proxy models. Although the
existence of proxy features gives the hope to improve fairness with unobserved sensitive attributes,
identifying perfect proxy groups is still a challenging task.
5.3 Hardness and Distribution Biases
In this chapter, we study the problem of model fairness in supervised classification models, where
the goal is to predict a true outcome Y from a feature vector X based on labeled training data.
The fairness of prediction
ˆ
Y is evaluated with respect to sensitive groups of individuals defined
by sensitive attributes A, such as gender or race. Both outcome Y and sensitive attributes A are
58
assumed to be binary, i.e., Y∈{0,1} and A∈{0,1}, where A= 1 represents the privileged group
(e.g., male), while A= 0 represents the unprivileged group (e.g., female).
We use three group fairness metrics introduced in Chapter §3: Equal Opportunity Difference
(EOD), Average Equalized Odds Difference (EOddsD), and Statistical Parity Difference (SPD)
[27].
The fairness metrics that have been so far proposed in the literature are designed to measure
the bias in model outcomes. Previous work [60] has shown that the properties of collected data
have great impact on model fairness. Thus, it is important to quantify and measure biases in data
collection. As for the criteria to assess fairness, in addition to the traditional group-based fairness
metrics, in this section, we propose two new measures to capture and quantify the amount of
different types of bias in data collection, namely hardness bias and distribution bias.
5.3.1 Hardness Bias
Different hardness measures have been proposed to indicate the hardness level of an instance to
be correctly classified. In this work, we use k-Disagreeing Neighbors (kDN) [99] as the hardness
metric. kDN measures the local overlap of an instance in the original task space in relation to its
nearest neighbors. The kDN of an instance is the percentage of the k nearest neighbors (using
Euclidean distance) for an instance that do not share its target class value.
kDN(x,y)=
|(x
′
,y
′
) : x
′
∈ kNN(x),y
′
̸= y|
k
,
where kNN(x) is the set of k nearest neighbors of the instance x, and y is the target class value for
x.
Larger kDN (i.e., closer to 1) indicates that the instance is harder to be predicted correctly. If
the instances within one group has higher kDN comparing to other groups, the prediction accuracy
of that group will also be lower. Thus, the hardness bias is defined as the distribution difference
59
of kDN between groups (e.g., instances in different classes, sensitive groups). In this work, we
measure kDN when k=5, i.e., 5-Diagreeing Neighbors.
Definition 6 (Hardness Bias) The hardness bias Γ
A
(y) of a dataset D is defined as the Kull-
back–Leibler divergence (KL) of the distribution of kDN of instances with different sensitive at-
tributes A∈{0,1}:
Γ
A
(y)= KL( f({kDN(x,y)|A= 1})− f({kDN(x,y)|A= 0})),
where f({kDN(x,y)|A= a}) is the density of the kDN distribution of all instances with A= a and
Y = y.
The kDN distribution of each sensitive group affects the classification accuracy. Thus the
hardness bias is linked to the accuracy differences between groups, which has the similar concept
of Equal opportunity and Equalized odds measures.
5.3.2 Distribution Bias
In addition to the hardness bias between classes, the distribution of sensitive groups within each
class also affects the model performance.
Definition 7 (Distribution Bias) The distribution bias∆
A
(y) of a dataset D is defined as the dif-
ference of probabilities of Y = y, conditioned upon values of the sensitive attribute A∈{0,1}:
∆
A
(y)= Pr(Y = y|A= 1)− Pr(Y = y|A= 0),
where y∈{0,1}.
60
(a) (b)
(c) (d)
(e) (f)
Figure 5.1: Examples of the bias changes before and after class balancing. (a), (c) and (e)
illustrate the change of distribution bias. (b), (d) and (f) compare the hardness bias before and
after class balancing forAdult,COMPAS andViolentCrime data.
5.4 Effects of Class Balancing on Fairness
Class balancing techniques such as over-sampling and under-sampling are commonly adopted to
modulate the class distribution of a dataset that exhibits class imbalance. Despite, typically, class
61
balancing yields better prediction performance, the effect of such type of strategies on model fair-
ness is largely unknown.
Next, we focus on five popular class balancing techniques, i.e., random over-sampling (ROS),
random under-sampling (RUS), synthetic minority over-sampling technique (SMOTE) [94], cluster
centroids (CC) [95], and K-Means SMOTE [96], and investigate the effects of their adoption on
model fairness. All the algorithms described above are implemented in the reference Python library
imbalanced-learn.
1
For all our experiments, we use thescikit-learn’s Logistic Regression Classifier as classifi-
cation model. Each dataset is randomly split into 80% development set and 20% test set with 50
randomized restart. Models are trained on development set with 10-fold cross validation parameter
tuning. All sensitive attributes are removed before learning to avoid information leak.
5.4.1 Effects on Biases
Class balancing techniques are designed to address the imbalance of the target variables. However,
the biases in the data may increase after balancing, due to the process being oblivious of the
inherent properties of the datasets.
Figure 5.1 gives an illustration of the biases change before and after class balancing with Ran-
dom Over Sampling (ROS). As Figure 5.1 shows, the three datasets have different properties. In
the original Adult dataset, positive (i.e., high income) samples only compose 25% of the data.
More than 20,000 positive samples are added after class balancing, where only 15% are female.
Therefore, more male samples are added after class balancing, which further increases the distribu-
tion bias. Similarly, ViolentCrime also has a significant lower ratio of positive samples, and only
4% positive samples are belong to the non-white populations. The distribution bias is increased 6
times after class balancing with ROS (cf., Table 5.1). Differently from the two datasets, theCOM-
PAS dataset is more balanced, with 54% of the instances being positive samples and 46% being
negative; the ratios of different race groups within each class are also similar (cf., Figure 5.1 (c)).
1
https://imbalanced-learn.readthedocs.io
62
Dataset Biases
Methods
Original RUS CC ROS SMOTE KMeans-SMOTE
Adult
∆(y= 1) 0.20± 0.01 0.31± 0.03 0.29± 0.02 0.29± 0.02 0.32± 0.03 0.28± 0.03
Γ(y= 1) 0.26± 0.01 0.02± 0.03 0.05± 0.02 0.08± 0.01 0.03± 0.02 0.11± 0.03
COMPAS
∆(y= 1) 0.09± 0.01 0.10± 0.02 0.11± 0.02 0.11± 0.03 0.10± 0.03 0.09± 0.02
Γ(y= 1) 0.05± 0.01 0.05± 0.01 0.06± 0.01 0.05± 0.02 0.05± 0.02 0.01± 0.02
Violent
Crime
∆(y= 1) 0.06± 0.02 0.10± 0.02 0.05± 0.02 0.13± 0.03 0.24± 0.03 0.02± 0.03
Γ(y= 1) 0.13± 0.03 0.09± 0.03 0.05± 0.02 0.02± 0.03 0.19± 0.02 0.02± 0.02
Table 5.1: Effects of class balancing techniques on biases metrics.
Regarding the hardness bias, Figure 5.1 (b), (d) and (f) illustrate the distribution of kDN (i.e.,
5DN). All three datasets have different patterns of the kDN distribution. Most of the samples of
theAdult dataset have a small kDN, while the kDN distribution of theCOMPAS dataset centralize
around 0.5. For the Violent Crime dataset, positive samples have higher kDN than negative sam-
ples. After class balancing, the kDN distribution of all datasets have smaller differences between
different classes. However, hardness bias still exist after class balancing. For example,Γ
race
(1) of
theCOMPAS dataset only decreases from 0.056 to 0.051 (cf., Table 5.1).
In summary, all class balancing techniques can decrease the hardness bias but increase the
distribution bias comparing to the original datasets.
5.4.2 Effects on Group Fairness
Table 5.2 shows how the class balancing techniques impact the model performance on both utility
and group fairness metrics. For all datasets, class balancing techniques increase the classification
accuracy of the minority class (i.e., Acc. in Table 5.2). Generally, over-sampling strategies perform
better than under-sampling. In terms of fairness metrics, class balancing methods increase the
discrimination between different sensitive groups for all datasets. Class balancing has less of an
impact on fairness for theCOMPAS dataset: almost all metrics have the same range as prior to the
balancing.
Comparing with different class balancing methods, KMeans-SMOTE has the less negative im-
pact on the fairness metrics.
63
Dataset Methods
Metrics
F1 Acc. EOD EOddsD SPD
Adult
Original 0.85± 0.006 0.57± 0.04 -0.12± 0.04 -0.09± 0.02 -0.17± 0.01
RUS 0.79± 0.01 0.87± 0.01 -0.13± 0.05 -0.17± 0.06 -0.32± 0.06
CC 0.75± 0.01 0.89± 0.01 -0.14± 0.03 -0.21± 0.03 -0.37± 0.03
ROS 0.81± 0.02 0.82± 0.07 -0.14± 0.03 -0.20± 0.03 -0.36± 0.04
SMOTE 0.81± 0.02 0.82± 0.07 -0.15± 0.05 -0.19± 0.06 -0.33± 0.03
KMeans-SMOTE 0.82± 0.006 0.56± 0.02 -0.14± 0.03 -0.12± 0.02 -0.19± 0.01
COMPAS
Original 0.67± 0.01 0.52± 0.04 -0.18± 0.03 -0.13± 0.02 -0.15± 0.01
RUS 0.66± 0.01 0.64± 0.02 -0.18± 0.04 -0.14± 0.02 -0.16± 0.02
CC 0.64± 0.01 0.66± 0.02 -0.21± 0.03 -0.17± 0.02 -0.19± 0.02
ROS 0.52± 0.04 0.65± 0.02 -0.18± 0.03 -0.14± 0.02 -0.17± 0.02
SMOTE 0.66± 0.01 0.65± 0.02 -0.18± 0.03 -0.14± 0.02 -0.17± 0.02
KMeans-SMOTE 0.67± 0.01 0.60± 0.02 -0.20± 0.04 -0.15± 0.02 -0.17± 0.02
ViolentCrime
Original 0.84± 0.01 0.16± 0.02 -0.10± 0.05 -0.05± 0.02 -0.03± 0.01
RUS 0.71± 0.01 0.66± 0.04 -0.21± 0.08 -0.16± 0.04 -0.15± 0.02
CC 0.41± 0.02 0.81± 0.04 -0.11± 0.07 -0.11± 0.03 -0.12± 0.03
ROS 0.72± 0.01 0.64± 0.02 -0.20± 0.08 -0.15± 0.04 -0.14± 0.03
SMOTE 0.74± 0.01 0.47± 0.04 -0.11± 0.09 -0.08± 0.04 -0.08± 0.02
KMeans-SMOTE 0.83± 0.01 0.37± 0.04 -0.16± 0.07 -0.10± 0.04 -0.09± 0.02
Table 5.2: Effects of class balancing techniques on group fairness. For F1 and the accuracy of
minority class (Acc.), the higher the better. For Equal Opportunity (EOD), Equal. Odds (EOddsD),
Statistical Parity (SPD) values close to zero indicate fairer outcomes.
5.5 Fair Class Balancing
To address the challenges posed by class balancing, and its effect on model fairness illustrated
above, we here propose a new strategy, named the fair class balancing method. It is worth noting
that our method not only enhances the fairness of the balancing approach while preserving pre-
diction accuracy, but also paves the way for a pre-processing strategy to improve fairness when
sensitive attributes are unobserved.
5.5.1 Proposed Method
We propose a cluster-based balancing method, named fair class balancing, that is guided by the
group structure of the data, that is the natural occurrence of homogeneous subgroups with shared
feature similarities, which can be identified via clustering in the feature space. Algorithm 1 pro-
vides a formal description of the proposed method.
64
Algorithm 1: Fair Class Balancing
Data: Original dataset D={d
1
,d
2
,...,d
n
},
Number of nearest neighbors k
Result: Balanced dataset
ˆ
D
Clusters C={c
1
,c
2
,...,c
m
},
←− clustering method M
Silhouette scores S={s
1
,s
2
,...,s
n
}
Thresholdθ = Q
1
(S).
for c
i
∈ C do
˜ c
i
={d
j
∈ c
i
|s
j
>θ}
Sampling Count←− majorityCount ( ˜ c
i
) - minorityCount ( ˜ c
i
);
y
min
= Minority class in ˜ c
i
.
if Sampling Count > 0 then
nn={}
for d
j
∈{x∈ ˜ c
i
|y= y
min
} do
knn
j
←− k-nearest neighbors of d
j
;
k-nearest neighbors are not limited to the minority class
nn←− nn∪ knn
j
.
generatedSamples←− Generator (Sampling Count, nn);
ˆ c
i
←− ˜ c
i
∪ generatedSamples;
ˆ
D={ ˆ c
1
, ˆ c
2
,..., ˆ c
m
}.
Similarly to K-Means SMOTE, fair class balancing includes three steps: clustering, filtering,
and oversampling. We re-design the filtering and oversampling steps to incorporate fairness con-
straints. An intuition of the proposed method is as follows:
1. Split the data into clusters according to a clustering algorithm M of choice, yielding samples
in each cluster c
i
that share similarity in the feature space. Then calculate the silhouette score
s
j
of each sample. We run the clustering algorithm by varying the number m of clusters,
and for each parameter instantiation we use the average silhouette score to determine the
goodness of clustering. Hence, we chose the best number of cluster m accordingly.
2. Remove the samples close to the cluster boundaries based on silhouette score. The samples
with the lowest 25% (lower quartile) silhouette scores are filtered out from the data set.
3. For each cluster c
i
, the minority class is the class that have less than half samples within the
cluster. New samples of the minority class are generated based on the k-nearest neighbors of
the minority samples, following the SMOTE [94] generation algorithm.
65
Figure 5.2: Illustration of how fair class balancing enhances fairness. Circle and star nodes repre-
sent samples with positive and negative labels, respectively. The hollow nodes are samples gener-
ated based on the nearest neighbors of the minority samples in each cluster.
The proposed method above has a similar pipeline to K-Means SMOTE [96]. K-Means SMOTE
generates the new samples based on the nearest neighbors within the minority class, while the
nearest neighbors in our oversampling process are not limited to the minority samples. Figure 5.2
illustrate how fair class balancing works. In addition to clustering, we also filter out the samples
that are close to the cluster boundaries. The samples close to the cluster boundary are difficult to
separate in the feature space, which means that they have higher probabilities to be falsely assigned
to one of the clusters. Thus, generating new instances based on those samples may further increase
the distribution bias.
The strategy of fair class balancing is similar to Preferential Sampling [30], aiming to avoid
the discrimination from borderline samples, i.e., samples close to the decision boundary are more
likely to be discriminated against (or favored). In section §5.5.2, we will show how this approach
enhances model fairness via balancing.
5.5.2 Biases after Fair Class Balancing
Table 5.3 compares the biases of original dataset, the dataset after fair class balancing, and the
dataset after KMeans SMOTE balancing (which has the best fairness performance according to
Table 5.2). Although traditional class balancing methods can also reduce the hardness bias, our
66
Figure 5.3: Examples of the hardness bias changes in test set before and after class balancing. The
kDN for each test sample here is the percentage of the k nearest training samples that do not share
the same target variable as the test sample.
proposed method yield more balanced hardness distribution across sensitive groups, and prevent
the increase of distribution bias. Additionally, in real applications, traditional class balancing
techniques aim to balance the population differences between the target classes in the training set
while not considering the potential biases in the test set.
67
(1)Adult Dataset
Income
∆
gender
(y) Γ
gender
(y)
Orig.
KMeans
Ours Orig.
KMeans
Ours
SMOTE SMOTE
High (Y=1) 0.20 0.28 0.09 0.035 0.022 0.030
Low (Y=0) -0.20 -0.28 -0.09 0.089 0.045 0.043
(2)COMPAS Dataset
Recidivism
∆
race
(y) Γ
race
(y)
Orig.
KMeans
Ours Orig.
KMeans
Ours
SMOTE SMOTE
True (Y=1) -0.09 -0.09 -0.08 0.056 0.018 0.006
False (Y=0) 0.09 0.09 0.08 0.034 0.009 0.001
(3)ViolentCrime Dataset
Recidivism
∆
race
(y) Γ
race
(y)
Orig.
KMeans
Ours Orig.
KMeans
Ours
SMOTE SMOTE
True (Y=1) -0.06 -0.02 -0.05 0.131 0.088 0.018
False (Y=0) 0.06 0.02 0.05 0.035 0.016 0.016
Table 5.3: Bias measurement. Difference in bias of original data, after KMeans SMOTE class
balancing, and after our proposed balancing method, as captured by the two metrics we propose,
i.e., distribution bias∆, and hardness biasΓ.
Assume that both training set and test set are randomly sampled from the same data source and
follow the same distribution. As we previously mentioned, samples close to the decision boundary
are more likely to be discriminated against (or favored). Those borderline samples exist in both
training set and test set. In traditional class balancing process, some of the borderline samples
in the training set are oversampled based on the target variables’ distribution, which makes the
borderline samples in the test set are more likely to be classified as one of the class. Thus, the
biases of the model outcomes are amplified.
To address the issue, fair class balancing generate new samples based on all neighboring sam-
ples, aiming to make sure that the borderline samples have similar number of neighbors in both
classes, which in turn mitigates the biases in the predictions.
Figure 5.3 gives examples of the kDN distribution of the test sets. Each dataset is randomly
split into 80% training set and 20% test set. The kDNs for test set here are calculated by locating
68
the k nearest training samples for each test sample. Higher test kDN means that the samples are
more likely to be falsely assigned to another class.
As the figure shown, samples from different classes have different kDN distribution. Among
the three original datasets, the Adult dataset has lower test kDN with the mean of 0.18. The
test kDN of the COMPAS dataset are more close to normal distribution with the mean of 0.41,
indicating that most samples have similar probabilities to be assigned to both classes. Different
from the first two datasets, the Violent Crime dataset shows a different pattern. Negative samples
have significant higher kDN than positive samples in the Violent Crime dataset, which means
negative samples are harder to predict accurately. In general, positive samples (i.e., the minority
class) have higher test kDN than negative samples in all datasets, especially for theViolentCrime
dataset. If one sensitive group has more negative samples, it would result the biases in model
outcomes.
The kDN distribution differences are reduced after fair class balancing. For Violent Crime
dataset, the mean values of test kDN between positive and negative samples are 0.72 and 0.13,
respectively. After balancing, the mean values are 0.455 and 0.452. Regarding the sensitive groups,
after balancing, the mean values of the test kDN change from 0.25 to 0.48 for non-white population
and from 0.17 to 0.40 for white population, which also mitigates the biases in the original data.
5.6 Experiments
5.6.1 Fairness Assessment Framework
This work focuses on scenarios where the sensitive attributes are not observable in both model
training and prediction processes. To assess fairness in this scenario, we propose the third-party
fairness assessment framework shown in Figure 5.4. The proposed framework has a similar struc-
ture to black-box auditing [100] which is designed to audit the biases of machine learning models
without accessing the models’ inside structure.
69
Figure 5.4: Prediction and assessment framework. In our setting, sensitive attributes are consid-
ered as unobservable to both the service provider and the client sides. Hence, fairness is judged by
a third-party assessment agency.
Dataset Method
Utility Fairness
F1 Acc. EOD EOddsD SPD
Adult
Baseline (Original data) 0.84± 0.003 0.59± 0.01 -0.08± 0.03 -0.07± 0.01 -0.17± 0.006
Ours w/ KMeans (5-nn) 0.75± 0.02 0.68± 0.02 0.009± 0.03 -0.005± 0.01 -0.10± 0.01
Ours w/ Agg. (5-nn) 0.75± 0.01 0.71± 0.04 0.20± 0.04 0.12± 0.03 -0.03± 0.03
Ours w/ Spec. (5-nn) 0.72± 0.02 0.73± 0.02 0.02± 0.04 0.02± 0.02 -0.06± 0.02
COMPAS
Baseline (Original data) 0.67± 0.01 0.56± 0.01 -0.20± 0.04 -0.14± 0.02 -0.16± 0.02
Ours w/ KMeans (5-nn) 0.62± 0.02 0.63± 0.08 -0.14± 0.04 -0.11± 0.03 -0.13± 0.03
Ours w/ Agg. (5-nn) 0.64± 0.01 0.56± 0.03 -0.13± 0.04 -0.11± 0.03 -0.13± 0.02
Ours w/ Spec. (5-nn) 0.64± 0.01 0.61± 0.09 -0.14± 0.04 -0.11± 0.03 -0.13± 0.03
Violent
Baseline (Original data) 0.84± 0.01 0.16± 0.02 -0.10± 0.05 -0.06± 0.02 -0.04± 0.01
Ours w/ KMeans (5-nn) 0.65± 0.02 0.44± 0.06 -0.04± 0.05 -0.02± 0.05 -0.02± 0.03
Ours w/ Agg. (5-nn) 0.61± 0.01 0.52± 0.04 -0.01± 0.08 -0.01± 0.05 -0.02± 0.03
Ours w/ Spec. (5-nn) 0.63± 0.02 0.52± 0.05 -0.03± 0.08 -0.04± 0.05 -0.06± 0.04
Table 5.4: Classification performance comparison among original dataset (Baseline) and fair class
balancing (Ours) with 5-NN parameters and different clustering algorithms: KMeans clustering
(KMeans), Agglomerative clustering (Agg.), and Spectral clustering (Spec.). Results with the
smallest bias against the unprivileged group (i.e., most positive results) are bolded.
In our framework, sensitive attributes are unobservable for a service provider (i.e., the model
training process) as well as on the client side (i.e., the testing process). Model fairness is evaluated
by an assessment agency (e.g., government), which obtains both the feature matrix and sensitive
attributes of the test set. This framework is practical in real-world applications and also provides a
comparable setting with respect to the extant fairness research.
70
5.6.2 Experimental Setting
To illustrate the real fairness improvements of our proposed method and have a fair comparison
with other strategies, we follow the third-party fairness assessment framework introduced in Sec-
tion §5.6.1. All experiments are based on 80-20 dataset split with 50 randomized restart. All
sensitive attributes are removed from the training and test sets.
We focus the following utility and fairness metrics in this work:
• Utility: F1-score (F1) and the classification accuracy of the minority class (Acc.).
• Fairness: Equal Opportunity Differences (EOD), Average Equalized Odds Differences (EOddsD),
and Statistical Parity Differences (SPD).
We test our fair class balancing method with three widely-used clustering algorithms: KMeans
clustering, Agglomerative clustering (Agg.), and Spectral clustering (Spec.). All metrics show
similar performance for different classification algorithms (e.g., Logistic Regression, Random For-
est, etc.). We only report the results of Logistic Regression due to the page limit.
5.6.3 Performance
Comparing to the performance of traditional class balancing techniques (shown in Table 5.2), our
fair class balancing method can improve all the fairness metrics. In general, all clustering methods
improve the performance. Especially for the Adult dataset, fair class balancing yields promising
scores, e.g., EOD is 0.009 (± 0.01) and EOddsD equal to -0.005 (± 0.01) with KMeans clustering.
With Agglomerative clustering, the fairness metrics are further increased to 0.20 (± 0.04) for EOD,
which indicates that the biases against the unprivileged group are removed, and the model starts
to favor those samples. With KMeans clustering and fairness budget 5-nn, the EOddsD metrics
on the Adult, COMPAS, and Violent Crime datasets have 92%, 21%, and 66% improvements,
respectively. Additionally, comparing to the baseline method, our fair class balancing method also
yield higher accuracy for the minority class, which is a crucial metric in real-life applications.
71
Dataset Method
Metrics
F1 Accuracy EOD EOddsD SPD
Adult
Baseline 0.84± 0.003 0.59± 0.01 -0.08± 0.03 -0.07± 0.01 -0.17± 0.006
CEO [33] 0.81± 0.0004 0.46± 0.003 -0.09± 0.004 -0.07± 0.01 -0.13± 0.006
ROS + CEO 0.83± 0.003 0.67± 0.03 -0.13± 0.007 -0.15± 0.009 -0.27± 0.01
Ours 0.75± 0.02 0.68± 0.02 0.009± 0.03 -0.005± 0.01 -0.10± 0.01
Ours + CEO 0.79± 0.002 0.54± 0.008 0.10± 0.02 0.04± 0.01 -0.08± 0.01
COMPAS
Baseline 0.67± 0.01 0.56± 0.01 -0.20± 0.04 -0.14± 0.02 -0.16± 0.02
CEO [33] 0.65± 0.01 0.48± 0.01 -0.17± 0.03 -0.12± 0.02 -0.14± 0.02
ROS + CEO 0.65± 0.01 0.55± 0.01 -0.17± 0.03 -0.13± 0.02 -0.16± 0.02
Ours 0.63± 0.01 0.60± 0.03 -0.13± 0.04 -0.11± 0.03 -0.13± 0.02
Ours + CEO 0.63± 0.02 0.50± 0.05 -0.11± 0.04 -0.09± 0.02 -0.11± 0.02
Violent
Baseline 0.84± 0.01 0.16± 0.02 -0.10± 0.05 -0.06± 0.02 -0.04± 0.01
CEO [33] 0.43± 0.0004 0.06± 0.01 0.003± 0.004 -0.01± 0.003 -0.02± 0.03
ROS + CEO 0.52± 0.001 0.22± 0.005 -0.01± 0.001 -0.02± 0.001 -0.03± 0.001
Ours 0.61± 0.01 0.52± 0.04 -0.01± 0.08 -0.01± 0.05 -0.02± 0.03
Ours + CEO 0.50± 0.01 0.17± 0.03 0.002± 0.008 -0.006± 0.005 -0.01± 0.006
Table 5.5: Combining class balancing and fairness-aware learning. Performance comparison
among fair class balancing (Ours), proxy-based post-processing (CEO), and combining CEO with
traditional class balancing and fair class balancing. Results with the smallest bias against the un-
privileged group are bolded.
In terms of the fairness metrics, different clustering methods show promising performance and
Agglomerative clustering provides relatively better performance for all datasets.
Effectsoffairnessbudget k-NN
In our method, we introduce the fairness budget parameter k-NN, which indicates the number of
nearest neighbors for generating new samples. As k-NN increases, more noisy samples will be
generated. This will further decrease the biases across sensitive groups. However, more noisy
samples will decrease the model accuracy.
Figure 5.5 shows how different metrics change as a function of k-NN. The performances re-
ported in Figure 5.5 are based on fair class balancing with K-Means clustering algorithm. All
three datasets show increasing trend of all three fairness metrics when k-NN increase, indicating
more biases against the unprivileged group are removed. As the trade-off, the model accuracy has
a decreasing trend when k-NN increase.
72
Figure 5.5: Effects of the fairness budget parameter kNN. The plots illustrate model perfor-
mance and fairness, according to four metrics (EOD, EOddsD, SPD, and Accuracy), for theAdult,
COMPAS, andViolentCrime data.
73
5.6.4 Fair Class Balancing & Fairness-Aware Learning
Our proposed method modifies the training data, thus is amenable as a pre-processing step toward
fairness-aware model learning, a set of strategies that aim at enhancing fairness in a post-hoc
fashion. Next, we combine our proposed fair class balancing technique with a post-processing
fairness-improving strategy to analyze how our method enables achieving even higher model fair-
ness compared to the results presented earlier.
For our purposes, we use the Calibrated Equalized Odds (CEO) [33] as the post-processing
method of choice. The algorithm is implemented byAI Fairness 360 Open Source Toolkit
[101].
Since this work focus on the scenario where sensitive attributes are not observable, we follow
the idea of using proxy features as the alternative to the true sensitive attributes [42]. We use
“age" information to create proxy groups. “Age> 40" and “age<= 40" represent privileged and
unprivileged groups, respectively.
Table 5.5 compares model performance when using different fairness-improving strategies.
In general, our proposed fair class balancing always yields the most accurate outcomes. As a
post-processing method, CEO optimizes on the EOddsD metric and scarifies the accuracy. Our
experimental results show that combining fair class balancing and CEO can boost the accuracy of
the results as well as achieve better fairness performance. Moreover, for the Adult and COMPAS
dataset, just applying the fair class balancing technique can yield better performance in terms of
EOD and EOddsD compared to CEO. We also include the comparison of the combination of class
balancing techniques and CEO, specifically the Random Over-Sampling (ROS) and our fair class
balancing. Experimental results show that the combination with our method can further improve
the fairness performance.
Overall, fair class balancing + CEO has the minimum biases against the unprivileged groups.
Since CEO has negative impact on the model accuracy, considering the importance of model ac-
curacy in real-life applications, our proposed fair class balancing can achieve the best balance
between utility and fairness among all reported methods.
74
5.6.5 Discussion
According to the experimental results presented above, our proposed fair class balancing method
can provide fairness improvements with unobserved sensitive attributes. In this section, we further
discuss how different data properties impact the performance of our strategy.
As the results reported in Table 5.4, our proposed method may decrease the F1-score of the
predictions, but improve the accuracy of the minority class and fairness metrics. Across all three
datasets, the COMPAS dataset has the minimum improvements. The differences of the perfor-
mance can be explained by the data biases exist in the original data. According to Figure 5.1,
theCOMPAS dataset has the most balanced sensitive groups, and the kDN distribution among all
groups are similar. Our method is designed to mitigate the hardness bias and distribution bias, and
thus the improvements are less significant than other two datasets.
5.7 Experiments on Behavior Data
Previous section presents the performance of fair class balancing on several widely used fairness
datasets. In this section, we further investigate its performance on real human behavior data.
5.7.1 Dataset and Configurations
We conduct the experiments on the behavior data collected via the TILES study introduced in
Section §2.1. We use the physiological and physical activity signals collected through wearable
sensors to estimate users’ daily stress levels. We select 43 bio-behavioral signals as predictive
features in our models. Bio-behavioral signal data were collected from continuous sensing of
garment-based OMsignal sensor and wristband-based sensor Fitbit Charge 2. Participants were
asked to wear their Fitbit on a 24/7 basis (limiting off-line time as much as possible and for reasons
such as battery recharging) for the whole study. In addition to Fitbit, subjects were instructed to
wear the OMsignal garments exclusively during work hours. OMsignal provides physiological
information including heart rate, heart rate variability (HRV), breathing, and accelerometer. Fitbit
75
Method
Utility Fairness
F1 Acc. EOD EOddsD SPD
Baseline (Original data) 0.53± 0.02 0.49± 0.03 -0.06± 0.06 -0.03± 0.05 -0.02± 0.05
fair class balancing w/ KMeans (5-nn) 0.57± 0.01 0.59± 0.04 -0.01± 0.04 0.03± 0.03 0.03± 0.03
fair class balancing w/ Agg. (5-nn) 0.56± 0.03 0.58± 0.02 -0.01± 0.07 0.03± 0.08 0.04± 0.07
fair class balancing w/ Spec. (5-nn) 0.57± 0.02 0.59± 0.02 -0.02± 0.07 0.02± 0.06 0.03± 0.07
Table 5.6: Classification performance comparison on TILES dataset with 5-NN parameters and
different clustering algorithms.
Figure 5.6: Effects of the fairness budget parameter kNN on TILES dataset.
Charge 2 is designed to measure heart rate, step count, heart rate zones, and sleep statistics. The
target variable is a binary label that indicates whether the person’s stress level is above individual
average (i.e., 1) or not (i.e., 0), which are collected from daily surveys sent to the participants’
smartphone.
We follow the same settings in §5.6.2 in the experiments. The dataset is split into 80% training
set and 20% test set with 50 randomized restart. We report the performance on the test sets of
Logistic Regression models.
5.7.2 Performance
Table 5.6 compare the utility and fairness performance of fair class balancing with fairness budget
5-NN. Consistent with the previous findings, fair class balancing can improve all three fairness
metrics with all different choice of clustering algorithms. Moreover, fair class balancing can even
improve the model utility by generating synthetic samples of unrepresented behaviors. Figure
76
5.6 illustrates the effects of k-NN parameter, as k-NN increases, fairness metrics are in general
increase. However, comparing to the performance shown in Figure 5.5, both utility and fairness
metrics are more stable for TILES data due to the heterogeneity of users’ behavior and labels, thus
adding more noisy samples won’t cause dramatic performance change.
5.8 Conclusions
Guaranteeing model fairness in data imbalanced settings is an open challenge in real-world ma-
chine learning applications. In this work, we investigated how class balancing techniques impact
the fairness of model outcomes. Inspired by our findings showing how class balancing exacerbates
unfairness, we proposed the fair class balancing method to enhance fairness, which also has the
desirable property of assuming that the sensitive attributes are unobserved.
The proposed method aims to mitigate the biases come from the borderline samples, which are
one of the main sources of the model unfairness according to the literature. Fair class balancing has
a similar framework as cluster-based oversampling class balancing strategies except the synthetic
samples are generated based on the samples from both minority and majority classes. The cluster-
based strategy identifies the real class imbalance within samples with similar feature space, and
also effectively avoids generating too noisy samples. Furthermore, by generating synthetic samples
based on both classes, our method ensures that unseen borderline samples in the test sets have
similar probabilities to be assigned to both classes, which reduce the biases in model outcomes.
Experimental results on real-world datasets show that our fair class balancing method can
improve all three fairness metrics of interest as well as the accuracy of the minority class. The
improvements are not limited to the clustering algorithm of choice. Fair class balancing with
different clustering algorithms all yield more fair predictions. As the fairness budget parameter
kNN increase, the biases against the unprivileged groups are decreased with the accuracy decrease
as the trade-off. Our method also can be used as the pre-processing step for other fairness-aware
mechanisms, further improving both fairness and accuracy.
77
With the increasing concerns on data privacy, more and more real-life applications no longer
collect sensitive attributes due to legal restrictions, which also limits the application of traditional
fairness-aware algorithms. As a class balancing technique without using any information about
sensitive attributes, fair class balancing can still be widely applied to data-driven decision making
systems.
78
Chapter 6
Enhancing Model Fairness via Mitigating Data Heterogeneity
6.1 Introduction
Various studies have been using behavioral data and machine learning techniques to implement
systems that aim to understand and track human affects. Example of such systems exist in various
application domains, from health assessment [102] to job performance evaluation [103]. However,
individuals’ physiological and psychological differences may introduce various sources of biases
in the data, most prominently heterogeneity [48, 47]. This, in turn, can affect machine learning
models’ accuracy [50, 51] and model fairness [53, 104].
An example of heterogeneity in human behavioral data is the Simpson’s paradox [105], a phe-
nomenon wherein an association or a trend observed at the level of the entire population disappears
or even reverses when data is disaggregated by its underlying subgroups. This phenomenon is com-
mon in human behavioral data [106]. Failure to take the heterogeneous patterns into account during
the modeling process might impair both the utility and fairness of the system [107]. Researchers
have explored different techniques to test and discover such patterns in the past. Previous methods
either rely on the group labels [108] or are only able to capture simplified patterns [109, 110].
However, the heterogeneity patterns in behavioral data are usually complicated and having
overlaps across different groups. In addition, in many real-world applications, sensitive attributes,
like gender, race, etc., are not observable due to privacy concerns or legal restrictions [16]. Thus,
79
in order to build trustworthy human behavior understanding systems, we need modeling strategies
that can identify and mitigate the impact of heterogeneity without accessing the sensitive attributes.
Contributions of this chapter
Motivated by the above challenges in heterogeneous human behaviors, in this chapter, we focus
on mitigating the unfair impact of heterogeneous behavioral features without accessing sensitive
attributes. In summary, our contributions in this chapter are as follows:
• We analyze the impact of different behavioral patterns on model utility and model fairness.
• We propose a method to identify heterogeneity patterns without sensitive attributes, named
multi-layer factor analysis.
• We propose a framework combining multi-layer factor analysis and feature rescaling to mit-
igate the bias in affective computing without accessing sensitive attributes. Experimental
results show that the proposed framework improves model fairness.
6.2 Related Work
The problem of fairness in machine learning has been drawing increasing research interests over
the course of recent years. Most of the previous work focus on classification tasks [8, 10]. A few
papers considered fairness in regression problems. Convex [11] and non-convex [12] frameworks
have been explored to add fairness constraints to regression models. Quantitative definitions and
theoretical performance guarantees in fair regression have also been discussed in recent work [13].
All of the strategies above require the access to sensitive attributes in order to mitigate the source
of bias. However, collecting that type of information might be difficult, or even forbidden by laws,
in real-world applications.
Recently, a few studies have explored different strategies to address this issue. One typical
solution is using non-sensitive information as proxy for sensitive attributes. Previous work [41]
80
has shown that non-sensitive information can be highly correlated with sensitive attributes. Proxy
fairness [42] leverages the correlations between proxy features and true sensitive attributes. Proxy
features are used as the alternative to sensitive attribute(s) when applying a standard fairness-
improving strategy. Although the existence of proxy features gives the hope to improve fairness
with unobserved sensitive attributes, identifying perfect proxy groups is still challenging.
Some researchers have explored methods to uncover latent heterogeneous patterns in the data
[108]. Several recent studies also investigated the use of disaggregation methods without sensitive
attributes [109, 110]. However, previous work mainly focuses on finding the optimal partition of
each feature, which fails to capture more complex scenarios such as when different groups share
overlapping feature ranges. Our proposed work will also address this issue.
6.3 Preliminaries
6.3.1 Fair Regression
In this chapter, we study the problem of model fairness in a regression setting, where the goal
is to predict a true outcome Y ∈ [a,b] from a feature vector X based on labeled training data.
The fairness of prediction
ˆ
Y of model M is evaluated with respect to two quantitative definitions:
statistical parity (SP_r) (See §3 Metric 5) and equal accuracy (EA) (See §3 Metric 4).
6.3.2 Fisher’s Linear Discriminant
Suppose two groups of p-dimensional samples⃗ x
0
,⃗ x
1
have means⃗ µ
0
=[µ
01
,...,µ
0p
],⃗ µ
1
=[µ
11
,...,µ
1p
]
and covarianceΣ
0
,Σ
1
. Then the linear combination of features⃗ w·⃗ x
i
has means⃗ w· ⃗ µ
i
and variances
⃗ w
T
Σ
i
⃗ w for i= 0,1. Fisher defined the separation between these two distributions to be the ratio of
the variance between the groups to the variance within the groups:
S=
σ
2
between
σ
2
within
=
(⃗ w· ⃗ µ
1
− ⃗ w· ⃗ µ
0
)
2
⃗ w
T
Σ
1
⃗ w+⃗ w
T
Σ
0
⃗ w
=
(⃗ w· (⃗ µ
1
− ⃗ µ
0
))
2
⃗ w
T
(Σ
0
+Σ
1
)⃗ w
.
81
It can be shown that the maximum separation occurs when
⃗ w∝(Σ
0
+Σ
1
)
− 1
(⃗ µ
1
− ⃗ µ
0
). (6.1)
6.3.3 Factor Analysis
Suppose we have a set of p observable random variables x
1
,...,x
p
with means µ
1
,...,µ
p
.
For some unknown constants l
i j
and k unobserved random variables F
j
(i.e., common factors),
where i∈{1,..., p} and j∈{1,...,k},where k< p, we have
x
i
− µ
i
= l
i1
F
1
+··· + l
ik
F
k
+
i
.
Here, the
i
are unobserved stochastic error terms with zero mean and finite variance, which
may not be the same for all i. In matrix terms, we have⃗ x− ⃗ µ = LF+⃗. F is defined as the factors,
and L as the loading matrix.
Suppose the covariance matrix of(⃗ x− ⃗ µ) isΣ, we have
Σ= LL
T
+Ψ. (6.2)
6.4 Impact of Heterogeneity on Fairness
Heterogeneity is often present in social and behavioral data, and its presence affects the analysis
of trends as well as the accuracy of prediction tasks [106]. In this section, we analyze different
heterogeneous patterns and their impact on the fairness of model outcomes. Figure 6.1 illustrates
the bias derived from heterogeneity. If ignoring the heterogeneous patterns, the trends learned from
the data can be biased against certain groups.
In this section, we conduct our analysis on synthetic datasets. Using synthetic data is impor-
tant because we can arbitrarily control the characteristics of the paradox, which will allow us to
understand what effects its presence has on model fairness and accuracy.
82
(a) (b)
Figure 6.1: Example of bias from heterogeneity. The plots illustrate the bias derived from het-
erogeneity. (a) shows a heterogeneous pattern exists in a real-life dataset. If the model ignores the
heterogeneity, the learned trend (i.e., black dash line in (a)) will discriminate against the female
samples (i.e., red) as shown in (b).
Our synthetic datasets have 1000 samples and 10 informative features from two sensitive
groups with same number of samples. For each of the dataset, N out of the 10 features exhibit
a pattern compatible with heterogeneity. We focus on six different common heterogeneous pat-
terns listed in Table 6.1. Pattern #1-3 appear in the datasets when the target variable of the two
sensitive groups shares the same range; this set of examples is hence named Shared-Range data;
furthermore, Pattern #4-6 consider the situation when group 1 has overall lower ranges of target
variables than group 2, named Different-Range data.
A linear regression model is trained on each dataset and evaluated with 10-fold cross valida-
tion. We adopt mean absolute errors (MAE) to measure the overall accuracy of the predictions.
Statistical parity (SP_r) is measured by the distance of average outcome of each sensitive group;
equal accuracy (EA) is measured by the distance of MAE across different groups.
Based on the results in Table 6.1, different patterns show different impact on the model out-
comes. In Shared-Range data, Pattern #1-3 have negative impact on the model accuracy with
respect to the overall MAE. As the number N of features with heterogeneity increases, the overall
MAE increases accordingly. Pattern #2 shows the most significant impact on the statistical parity
83
(1) Shared-Range
Pattern N
Metrics
MAE EA SP_r
No Heterogeneity - 1.13 -0.04 -0.06
#1
1 1.79 -0.13 -1.51
2 2.35 0.05 -1.67
3 2.75 -0.04 -1.66
4 2.89 -0.02 -1.57
5 3.05 -0.05 -1.53
#2
1 1.67 -0.02 -2.04
2 2.17 0.05 -2.55
3 2.51 0.05 -2.64
4 2.74 -0.01 -2.55
5 2.93 0.12 -2.54
#3
1 1.92 0.02 -0.07
2 2.40 -0.02 0.07
3 2.77 -0.01 -0.04
4 2.97 -0.03 -0.01
5 3.17 0.05 -0.01
(2) Different-Range
Pattern N
Metrics
MAE EA SP_r
No Heterogeneity - 5.07 -0.01 -0.02
#4
1 2.30 -0.05 -8.59
2 2.30 -0.17 -8.69
3 2.36 -0.18 -8.62
4 2.38 -0.17 -8.55
5 2.30 -0.24 -8.74
#5
1 2.82 -0.19 -6.71
2 2.83 -0.21 -6.70
3 2.85 -0.22 -6.61
4 2.76 -0.08 -6.80
5 2.73 -0.06 -6.93
#6
1 4.09 -0.04 -4.03
2 4.07 -0.03 -4.17
3 4.11 -0.05 -3.94
4 4.11 -0.03 -4.07
5 4.10 -0.05 -4.04
Table 6.1: Impact of different heterogeneity patterns. N indicates the number of features that
exhibit a heterogeneity pattern. We report the results of linear regression models with 10-fold cross
validation. The mean absolute error (MAE), equal accuracy (EA) and statistical parity (SP_r) are
the used metrics.
of outcomes. The heterogeneity causes more pronounced negative impact on both fairness metrics
in Different-Range data.
6.5 Methods
In this section, we propose a method to mitigate the unfairness caused by heterogeneity. Our
method includes two parts: identifying the heterogeneous patterns based on factor analysis and
feature rescaling.
6.5.1 Identifying Heterogeneous Patterns
Unveiling the heterogeneous patterns in complex behavioral data is challenging, especially when
the group labels are not available. In this work, we propose a method to identify heterogeneous
84
patterns in multi-variate behavioral data leveraging factor analysis, named multi-layer factor anal-
ysis (MLFA). MLFA is based on the effectiveness of factor analysis to separate subgroups with
heterogeneity on balanced correlated features, as shown in Theorem 1.
Theorem 1 Let X be a dataset of n variables x
1
,...,x
n
exhibiting heterogeneous patterns between
two groups X
0
and X
1
. F represents the factor matrix of X after factor analysis.
Assume the two groups of observations followN (⃗ µ
0
,Σ
0
) andN (⃗ µ
1
,Σ
1
), respectively. F
shows the maximum separation of Fisher’s linear discriminant between groups when X
0
and X
1
have same size and x
1
,...,x
n
are highly correlated with each other.
Proof 1 Since X consists of X
0
and X
1
followingN (⃗ µ
0
,Σ
0
) andN (⃗ µ
1
,Σ
1
), respectively, X can
be viewed as a mixture Gaussian distributionN (⃗ µ,Σ)=∑
1
i=0
α
i
N (⃗ µ
i
,Σ
i
). Thus,
⃗ µ =
1
∑
i=0
α
i
⃗ µ
i
Σ=
1
∑
i=0
α
i
Σ
i
+
1
∑
i=0
α
i
(⃗ µ
i
− ⃗ µ)(⃗ µ
i
− ⃗ µ)
T
.
If X
0
and X
1
have same size,α
0
=α
1
=
1
2
,
⃗ µ =
⃗ µ
0
+⃗ µ
1
2
,
Σ=
1
2
(Σ
0
+Σ
1
+(
⃗ µ
0
− ⃗ µ
1
2
)(
⃗ µ
0
− ⃗ µ
1
2
)
T
+
(
⃗ µ
1
− ⃗ µ
0
2
)(
⃗ µ
1
− ⃗ µ
0
2
)
T
)
=
1
2
(Σ
0
+Σ
1
+
(⃗ µ
1
− ⃗ µ
0
)
2
2
).
Let F and L respectively represent the factor matrix and loading matrix of X after factor anal-
ysis. The transformation matrix from X to F isΣ
− 1
· L, thus
⃗ w
j
=Σ
− 1
· l
j
,
where ⃗ w
j
represents the projection vector from X to factor F
j
.
85
When x
1
,...,x
n
are highly correlated with each other,Σ≈ β
1
J
n,n
,⃗ µ
1
− ⃗ µ
0
≈ β
2
J
n,1
, where β
1
andβ
2
are constant values and J is all-ones matrix. According to Eqn. 6.2, l
j
=β
′
1
J
n,1
,
Σ=
1
2
(Σ
0
+Σ
1
+
(⃗ µ
1
− ⃗ µ
0
)
2
2
)
=
1
2
(Σ
0
+Σ
1
+
β
2
2
J
n,n
2
)≈ β
1
J
n,n
.
Thus,(Σ
0
+Σ
1
)=(2β
1
− β
2
2
2
)J
n,n
.
According to Fisher’s linear discriminant theory, the ratio of ⃗ w
j
to Eqn. 6.1 is
⃗ w
j
(Σ
0
+Σ
1
)
− 1
(⃗ µ
1
− ⃗ µ
0
)
=
Σ
− 1
· l
j
(Σ
0
+Σ
1
)
− 1
(⃗ µ
1
− ⃗ µ
0
)
=
(2β
1
− β
2
2
2
)J
n,n
l
j
β
1
J
n,n
(⃗ µ
1
− ⃗ µ
0
)
=
(2β
1
− β
2
2
2
)β
′
1
β
1
β
2
.
Therefore, ⃗ w
j
∝(Σ
0
+Σ
1
)
− 1
(⃗ µ
1
− ⃗ µ
0
), the maximum separation occurs.
6.5.2 Multi-Layer Factor Analysis (MLFA) Framework
Inspired by Theorem 1, we propose the framework of multi-layer factor analysis (MLFA) as shown
in Figure 6.2.
This framework has three steps:
• First layer factor analysis: conduct factor analysis on the original dataset X. Get factor
matrix F and loading matrix L.
• Clustering and balancing: cluster the samples based on F into m sample clusters C
1
,...,C
m
and cluster the features based on L into k feature clusters V
1
,...,V
k
; balance X based on the
size of sample clusters, making sure the balanced the dataset X
′
has same number of samples
from each sample cluster C
i
; then divided X
′
based on the feature clusters. Get X
′
1
,...,X
′
k
.
86
Figure 6.2: Multi-Layer Factor Analysis (MLFA) framework. The first layer factor analysis
discovers the feature clusters V
i
and sample clusters C
i
. The original dataset X is balanced based
on C
i
and then separated into subsets X
′
1
,...,X
′
k
, where each subset X
′
i
only contains the features in
V
i
. We then conduct the second layer factor analysis on X
′
1
,...,X
′
k
.
The clustering and balancing step aims to balance the dataset, making the input subsets
X
′
1
,...,X
′
k
of the second layer factor analysis satisfies the assumptions of equal-sized groups
and correlated features.
• Second layer factor analysis: conduct factor analysis on X
′
1
,...,X
′
k
. Get factor matrix
F
1
,...,F
k
.
If features within V
j
exhibit heterogeneity, F
j
will show a clustered structure, j∈[1,k]. Figure
6.3 gives a empirical illustration of the performance of MLFA. Figure 6.3 (a) and (b) show the
most informative factor of the heterogeneous features in a real life behavioral datasetTILES (See
details in §6.6.1). Comparing to traditional factor analysis, MLFA extracts a more separable factor.
We use Gaussian Mixture Model (GMM) as the clustering algorithm with Bayesian information
criterion (BIC) and average sum of squared distances (SSD) within clusters as evaluation metrics
to testify the assumption on F
j
. If F
j
exist a clustered structure, BIC and SSD will decrease after
clustering.
87
(a) Traditional factor analysis (b) MLFA
Figure 6.3: Example of MLFA Outcomes. The plots compare the extracted factors using (a)
traditional factor analysis and (b) the MLFA framework. The factor from MLFA shows more
separable structure.
We further introduce a parameterλ as the criteria of identifying good cluster structures. Clus-
ters with SSD decrease more thanλ% will be considered as meaningful clusters, the corresponding
features are identified with heterogeneity.
6.5.3 Feature Rescaling
After identifying the heterogeneous features, we rescale those features to mitigate their impact
on the model outcomes. We adopt the disparate impact remover [6] as the rescaling method.
Disparate impact remover is a preprocessing technique that edits feature values to improve group
fairness while preserving rank-ordering within groups. After the rescaling, the feature distributions
across groups are hard to distinguish while the individual’s ranking within their group will be
preserved.
6.5.4 Modeling Pipeline
Figure 6.4 shows the pipeline of our proposed fair machine learning pipeline for human behavior
understanding systems.
88
Figure 6.4: The proposed fair modeling pipeline. The train dataset is pre-processed by MLFA
and feature rescaling to mitigate the bias as well as learn the pre-processing models. The pre-
processing models are used to rescale the test dataset.
The train dataset will be pre-processed by our proposed MLFA and feature rescaling methods to
mitigate the biases embedded in the feature dimension. After the pre-processing step, the processed
data are used for model training. The pre-processing step also generates clustering models used
for test dataset processing.
In the testing stage, samples in the test set are clustered by the GMM models learned from the
training set and be processed based on the clustered groups. Note that the proposed MLFA method
also has a clustering step after the 1st layer of factor analysis, the clustering method used in this
step is not restricted.
6.6 Experiments
6.6.1 Datasets
6.6.1.1 Synthetic Datasets
We generate two synthetic datasets with 40 features and 1000 samples, where each sensitive group
has 500 samples and 20 out of the 40 features exhibit different patterns. Synthetic 1 is Shared-
Range data. The target variable Y has the same range[0,30] across groups. Synthetic2 is Different-
Range data, Y∈[0,20] in group 1 and Y∈[10,30] in group 2.
89
Dataset Method
Linear Regression Decision Tree Random Forest
MAE EA SP_r MAE EA SP_r MAE EA SP_r
Synthetic1
Original 1.76 -0.44 -0.64 4.08 -0.29 -0.27 3.27 -0.23 -0.16
MLFA (Ours) 1.78 -0.50 -0.28 4.01 -0.26 -0.14 3.14 -0.23 -0.05
Synthetic2
Original 1.76 -0.29 -0.45 3.96 -0.27 -0.69 3.15 -0.19 -0.19
MLFA (Ours) 1.78 -0.48 -0.39 3.92 -0.43 -0.18 3.10 -0.25 -0.15
TILES
Original 3.20 -0.74 -1.17 3.34 -0.65 -0.39 3.19 -0.67 -0.53
MLFA (Ours) 3.42 -0.74 -0.66 3.17 -0.59 -0.41 3.15 -0.61 -0.38
OlderAdults
Original 6.43 -2.30 -4.95 6.43 -2.39 -5.24 6.11 -1.70 -4.43
MLFA (Ours) 7.06 -2.46 -4.12 6.73 -3.82 -2.57 6.78 -2.61 -2.71
Table 6.2: Experimental Performance (λ = 0). Compare the utility and fairness performance on
the data with and without our proposed method. The results show that our method yields decent
improvements on fairness and can even improve model accuracy.
6.6.1.2 Behavior Datasets
We conduct experiments on two behavior datasets: TrackingIndividualPerformancewithSensors
(TILES) andOlderAdults.
For TILES data, we focus on predicting the cognitive ability of the participants. The target
variables are collected from the pre-study survey. We use the gender information of each partici-
pant as the sensitive attribute. The distribution of cognitive ability shows no statistical difference
(p-value=0.34 under t-test) between genders. This dataset is an example for Shared-Range patterns.
ForOlderAdults data, we use the number of total mistakes in Stroop Task as the target variable.
Female participants have 12.3 mistakes in average, while male participants only have in average
7.32 mistakes, which has similar properties to the Different-Range data.
6.6.2 Performance of Regression Tasks
For all our experiments, each dataset is randomly split into 90% development set and 10% test set
with 50 repeats.
We use KMeans as the clustering method in our multi-layer factor analysis framework. As for
pre-processing, we evaluate our proposed method with three different types of standard regression
90
Type Dataset Metric
Method
w/ SA w/o SA
Shared-Range TILES
MAE 3.17 -0.6% 3.15 -1.2%
EA -0.87 +29.8% -0.61 -8.9%
SP_r -0.12 -77.3% -0.38 -28.3%
Different-Range
Older
Adult
MAE 6.90 +12.9% 6.78 +10.9%
EA -3.07 +80.5% -2.61 +53.5%
SP -0.39 -91.1% -2.71 -38.8%
Table 6.3: Comparison of the debiasing performance of Random Forest models with and without
sensitive attributes (SA).
(a) TILES (b) Older Adults
Figure 6.5: Effects of the parameter λ. The plots illustrate the performance change on fairness
and accuracy, according to three metrics (Equal Accuracy, Statistical Parity, Accuracy), forTILES
(left) andOlderAdults (right) data.
models: linear regression, decision tree, and random forest. All models are implemented by using
thescikit-learn library.
We validate the performance of our method on both synthetic and real-life datasets. Among
the 4 datasets, Synthetic 1 and TILES are Shared-Range data, Synthetic 2 and Older Adults are
Different-Range data.
Our evaluation metrics include accuracy, statistical parity, and equal accuracy:
• Accuracy: We adopt mean absolute errors (MAE) to measure the overall accuracy of the
predictions.
• Statistical parity (SP_r) is measured by the distance of outcomes of each sensitive group.
91
• Equal accuracy (EA) is measured by the distance of MAE across different groups.
To validate the performance of our proposed method, we also compare the model performance
with traditional pre-processing fair machine learning strategies. Table 6.2 and Table 6.3 compare
the performance of the three following model strategies:
• Original: The model is trained based on the original datasets without any pre-processing.
• Debiasing with true sensitive groups: The model is trained based on the datasets pro-
cessed by disparate impact remover [6]. To evaluate the true performance of our method,
we use the real sensitive attributes in the pre-processing step. That implies that this model is
significantly advantaged compared to models that do not access the sensitive attributes.
• Debiasing without true sensitive groups (Ours): The MLFA model is trained based on
the datasets processed by our proposed fair modeling pipeline shown in Fig. 6.4. No sen-
sitive attributes are needed in the process. That puts our model at significant disadvantage
compared to e.g., the disparate impact remover.
Our proposed method can yield improved fairness metrics in all experimental settings. MLFA
is mainly aiming to balance the outcome distribution across groups (i.e., statistical parity), as shown
in Figure 6.1. For Shared-Range datasets, for instance, Synthetic 1 and TILES datasets, our method
can improve all three metrics including model accuracy. Comparing to the debiasing method with
true sensitive attributes, as shown in Table 6.3, our method can achieve better fairness with less
accuracy loss (i.e., MAE increase).
Effectsofλ.
In our method, we introduce the parameter λ as the criteria to identify heterogeneity. As λ in-
creases, only the factors having more separable clusters would be considered as the indicators of
heterogeneity.
However, the best λ is highly dependent on the properties of the datasets. Too small λ might
have negative impact on model accuracy due to the mis-identification of heterogeneity. Rescaling
92
wrong features might lose informative information for the regression model. Too large λ might
have negative impact on model fairness due to the failure of identifying heterogeneity.
Figure 6.5 shows how performance changes as a function of λ. For all the three metrics of
interest, i.e., MAE, SP_r, and EA, the closer they are to zero, the better the model performs.
Therefore, the performance improvements are defined as the values reduced on each metric after
applying our method.
Both TILES and Older Adults datasets show decresing trends of MAE change when λ in-
creases, indicating better model accuracy. In terms of the fairness improvements, the TILES
dataset (i.e., Shared-Range data) shows similar trends for both EA and SP_r metrics, the most
improvement happens whenλ = 40. For theOlderAdults dataset (i.e., Different-Range data), the
improvements of SP_r metrics come at the cost of EA metrics as our method is aiming to optimize
the SP_r. The SP_r improvement has a decreasing trend asλ increases.
6.6.3 Performance of Classification Tasks
The proposed MLFA is a pre-processing method, thus it is feasible for both classification and
regression tasks. In this section, we present MLFA’s performance for classification tasks on TILES
dataset.
We use the same dataset that is introduced in §6.6.1.2. The target variable cognitive abilities
are categorized into binary variables, where 1 indicates cognitive ability > 30, and 0 indicates cog-
nitive ability <= 30. We consider three different fairness metrics – Equal Opportunity Difference
(EOD), Average Equalized Odds Difference (EOddsD), and Statistical Parity Difference (SPD) for
classification tasks (definied in §2.2).
Table 6.4 reports the performance of MLFA on classification tasks. Comparing to the models
with original data, MLFA improves all three fairness metrics by 93.9%, 98.8%, and 99.6%, re-
spectively. Figure 6.6 shows the effects of parameter λ. When λ increases, the model has better
accuracy while less fairness improvements. These findings are similar to the trends of regression
tasks presented in §6.6.2.
93
Metrics
Method
Original MLFA (Ours)
Accuracy 0.59 0.52
EOD -0.33 -0.02
EOddsD -0.27 -0.003
SPD -0.29 -0.001
Table 6.4: Utility and fairness performance
on the TILES dataset of the proposed MLFA
method (λ = 0).
Figure 6.6: Effects of the parameterλ on TILES
Dataset.
6.7 Conclusions
Affective computing has found broad applicability in many decision making domains, including
health and job performance evaluation, financial and employment scrutiny, etc.
Hence, guaranteeing model fairness in affective computing is an open challenge for real-world
applications. In this work, we investigated how heterogeneous behavioral patterns can impact the
fairness of model outcomes. We proposed a method to identify heterogeneous features based on
multi-layer factor analysis (MLFA), which can also be combined with feature rescaling techniques
to mitigate the unfair impact of heterogeneity when sensitive attributes are unobserved.
Experimental results on synthetic and real-world datasets show that our proposed method can
improve both the accuracy and fairness of models compared to using the original datasets. Our
method can in fact be used as a pre-processing step for different regression models.
There are a few ways forward to improve our method in the future. Currently, the method is
applied to continuous variables, but it could be extended to categorical variables, which are often
seen in behavioral data. In addition, the impact of heterogeneity on fairness in classification models
should also to be investigated.
94
Chapter 7
Enabling Group Fairness in Federated Learning
7.1 Introduction
Federated learning (FL) has received significant attention for its ability to train large-scale mod-
els in a decentralized manner without requiring direct access to users’ data, hence enabling their
privacy [111, 112]. It also has been increasingly applied to facilitate decision-making in various
crucial areas, such as healthcare, recruitment, loan grading, etc.
With the use of machine learning in such life-impacting scenarios, there are concerns regarding
the fairness of models trained for such ML-assisted decision-making systems [104, 113, 66]. One
important notion of fairness, group fairness [27], concerns the biased performance of a trained
model against a certain group, where groups are defined based on sensitive attributes within the
population (e.g., gender, race).
Though the research community has considered fairness in FL, most existing studies [61, 62]
focus on equalizing the performance and participation across different participating devices/silos.
However, only few works have attempted to target group fairness for groups based on sensitive
attributes (e.g., gender, race) in FL, which is a crucial requirement for responsible AI systems to
ensure models treat different demographic groups equally.
Several approaches towards achieving group fairness have been studied in recent years for
centralized training of classifiers. However, these approaches assume that the entire dataset is
centrally available during the training process. Thus, it is not simple to translate these approaches
95
to FL, where the decentralization of data is a major cornerstone. This consideration gives rise to
the key question that we attempt to answer in this paper: How can we train a classifier using FL
so as to achieve group fairness, while maintaining data decentralization?
Potential approaches to address group fairness in FL. One potential solution that one may con-
sider in order to train fair models in FL is for each client to apply local debiasing on its locally
trained models (without sharing any additional information or data), while the FL server simply
aggregates the model parameters in each round using an algorithm such as FedAvg [114] (or its
subsequent derivatives, e.g., FedOPT [115], FedNova [116]). Although this allows for training the
global model without explicitly sharing any information about the local datasets, but the drawback
is that the debiasing algorithm at each client is tuned to its local dataset distribution, which is ex-
pected to result in poor performance in scenarios where data distributions are highly-heterogeneous
across clients. For example, when having very different distributions of the sensitive attributes,
such as when one client only has data representing males while another client only has data repre-
senting females; in this scenario, local debiasing cannot guarantee fair performance on the overall
population. We conduct empirical experiment to demonstrate this issue in Section 7.4.
A different but potential solution to training fair models within FL would be to adapt a debias-
ing technique from the rich literature of centralized fair training to be used in FL. However, in the
process of applying the debiasing algorithm on a global scale, the clients will be required to ex-
change additional detailed information about the model’s performance (on different local groups)
with the server which might leak explicit information about different subgroups in the client’s
dataset For example, the server may require an evaluation of how the model performs for each
group in a client’s dataset.
Contributions of this chapter
Motivated by the drawbacks of the two aforementioned directions, in this chapter, we propose
a strategy to train fair models via a fairness-aware aggregation method named FairFed. In our
approach, each client performs local debiasing on its own local dataset, thus maintaining data
96
decentralization and avoiding exchange of any explicit information of its local data composition.
In addition, in each communication round, a client will evaluate the fairness of the global model
on its local dataset and collectively collaborate with the server to adjust its model aggregation
weights. The weights are a function of the mismatch between the global fairness measurement
(on the full dataset) and the local fairness measurement at each client, favoring clients whose local
measures match the global fairness measure. We present the details of FairFed in Section 7.5.
The server-side/local debiasing nature of FairFed gives it the following benefits over existing
fair FL strategies:
• Enhances group fairness under data heterogeneity: One of the biggest challenges to FL
group fairness is the heterogeneous distribution across different clients, which limits the
impact of local debiasing efforts on the global data distribution. FairFed shows significant
improvement in fairness performance under highly heterogeneous distribution settings, out-
performing state-of-the-art methods for fairness in FL, indicating promising implications
from applying it to real-life applications.
• More freedom of customized debiasing strategies on different clients: As FairFed works
as an aggregation method which only requires an evaluation metric for the model fairness
from the clients, it can potentially be more flexible to each client’s modeling strategy (we
expand on this notion in Sections 7.5 and 7.6). For example, different clients can adopt
different local debiasing methods based on the properties of their devices and data partitions.
• Provides potential abilities to improve both group fairness and uniform performance
across clients: We also propose a revised version of FairFed with uniform accuracy con-
straints that can potentially improve the group fairness as well as uniforming the model
performance across different clients, providing a better solution to trustworthy FL systems.
97
7.2 Related Work
Group fairness in centralized learning. In classical centralized machine learning, substantial ad-
vancement has been made in group fairness in three categories: pre-processing [4, 6], in-processing
[7, 8, 117] and post-processing [9, 10] techniques. However, a majority of these techniques require
access to the sensitive information (e.g., race, gender) of each data point, making it unsuitable for
FL systems.
Client-based fairness in federated learning. Federated learning can introduce new sources of
bias through the collaborative learning process. Various definitions of fairness have been proposed
to quantify such challenges in FL settings, such as collaborative fairness and agent-based fair-
ness. Collaborative fairness [61] is defined as rewarding a highly-contributing participant with a
better performing local model than is given to a low-contribution participant. Client-based fairness
aims to equalize the model performance across different clients. Existing studies have focused on
uniforming the accuracy distribution over all clients [62] and maximizing the performance of the
worst client [45]. Due to the potential cross-device or cross-silo heterogeneity, these methods can-
not prevent discrimination against certain sensitive groups. For example, if the local models of
all agents have the similar accuracy performance despite being biased against the under-privileged
group, then the system will satisfy the client-based fairness but will still display discrimination
against certain groups.
Group fairness in federated learning. Several recent works have made some progress on group
fair FL. One research direction is to design an optimization objective with fairness constraints [63,
64], which requires each client to share the statistics of the sensitive attributes of its local dataset to
the server. Abay et al. also [118] investigated the effectiveness of adopting centralized debiasing
mechanisms on each client. In [119], an adaptation of the FairBatch debiasing algorithm [117] is
proposed for FL where clients use FairBatch locally and the weights are updated through the server
in each round. In [120], the modified method of differential multipliers is used to minimize the em-
pirical risk for the model in the presence of fairness constraints in order to achieve group fairness in
a FL setting. In [121], an algorithm is proposed to achieve minimax fairness in federated learning.
98
In all the aforementioned works, the server requires each client to explicitly share the performance
of the model on each subgroup separately; for example (males with +ve classification, females with
+ve classification, etc). Compared to the discussed works, our proposed method does not restrict
the local debiasing strategy of the participating clients, thus increasing the flexibility of the sys-
tem. Furthermore, FairFed does not share explicit information on the model performance on any
specific group within a client’s dataset. Finally, our empirical evaluations consider to extreme data
heterogeneity cases and demonstrate that our method can yield significant fairness improvements
in such situations.
7.3 Preliminaries
In this section, we begin by reviewing one of the most commonly used aggregation methods in FL:
FedAvg [114]. We then introduce the definitions and metrics of group fairness and extend them to
federated learning scenarios by defining the notions of global and local fairness.
7.3.1 Federated Averaging (FedAvg)
In FL, multiple clients collaborate to find a parameter θ that minimizes a weighted average of the
loss across all clients. In particular:
min
θ
f(θ)=
K
∑
k=1
ω
k
L
k
(θ), (7.1)
where: K is the total number of clients; L
k
(θ) denotes the local objective at client k; ω
k
≥ 0, and
∑ω
k
= 1. The local objective L
k
’s can be defined by empirical risks over the local dataset D
k
of
size n
k
at client k, i.e., f
k
(θ)=
1
n
k
∑
(x,y)∈D
k
ℓ(θ,x,y).
To minimize the objective in (7.1), the federated averaging algorithm FedAvg, proposed in [114],
samples a subset of the K clients per round to perform local training of the global model on their
local datasets. The model updates are then averaged at the server, being weighted by the size of
99
their respective datasets. Practical implementations of FedAvg employ a secure aggregation algo-
rithm [122] which ensures that the server does not learn any information about the values of the
individual transmitted updates from the clients, beyond the aggregated value that it sets out to com-
pute. Formally, let ω
k
θ
k
be the weighted model update sent by client k and let Y
k
be the message
transmitted by client k to the server. Secure aggregation ensures that the server can recover the
summation∑
K
k=1
ω
k
θ
k
without error and that we have the following mutual information guarantee
due to the secure aggregation protocol:
I
{ω
k
θ
k
}
K
k=1
;{Y
k
}
K
k=1
K
∑
k=1
ω
k
θ
k
!
= 0. (7.2)
The FedAvg algorithm and subsequent improvements (e.g., FedOPT [115], FedNova [116])
allow for collaborative training of a high-performance global model without sharing their local
datasets with each other. However, this collaborative training can result in a global model that
discriminates against an underlying demographic group of datapoints (similar to biases incurred
in centralized training of machine learning models [27]). We highlight different notions of group
fairness of a model in the following subsection.
7.3.2 Fairness Metrics
We consider two types of fairness goals in federated learning: group fairness and uniform accuracy.
7.3.2.1 Group Fairness
In sensitive machine learning applications, a data sample often contains sensitive attributes that
can lead to discrimination. Some widely-used fairness definitions are summarized in §2.2. Using
these definitions, we now define two group fairness metrics that are that are applied in group
fairness literature for centralized training: Equal Opportunity Difference (EOD) and Statistical
Parity Difference (SPD).
100
In particular, we assume that each data point is associated with a sensitive binary attribute A,
such as gender or race. For a model with a binary output
ˆ
Y(θ,x), the fairness is evaluated with
respect to its performance compared to the underlying groups defined by the sensitive attribute
A. We use A= 1 to represent the privileged group (e.g., male), while A= 0 is used to represent
the under-privileged group (e.g., female). For the binary model output
ˆ
Y (and similarly the label
Y ),
ˆ
Y = 1 is assumed to be the positive outcome. The two group fairness metrics are defined as
following:
EOD=Pr(
ˆ
Y=1|A=0,Y=1)− Pr(
ˆ
Y=1|A=1,Y=1). (7.3)
SPD= Pr(
ˆ
Y = 1|A= 0)− Pr(
ˆ
Y = 1|A= 1). (7.4)
For EOD and SPD metrics, larger values indicate better fairness. Positive fairness metrics
indicate that the unprivileged group outperform the privileged group.
7.3.2.2 Uniform Accuracy
Beyond optimizing group fairness metrics, the notion of client-based fairness is critically important
in trustworthy FL systems, since clients would be more reluctant to collaborate if they end up with
models that perform badly on their local data distributions. Uniform accuracy aims to minimize the
performance differences across client. More formally, we adopt the uniform accuracy definition
defined in [62].
Definition 8 (Uniform Accuracy) For trained models w and ˜ w, we informally say that model
w provides a fairer solution than model ˜ w if the performance of model w on the m devices,
{Acc
0
,Acc
1
,...,Acc
m
}, is more uniform than the performance of model ˜ w on the m devices.
To quantify this criteria, we calculate the standard deviation of the accuracy (Std-Acc.) of each
client as the metric of uniform accuracy, where smaller values indicates better uniformity.
101
7.3.3 Global vs. Local Group Fairness in Federated Learning
The fairness definitions above can be readily applied to centralized model training to evaluate the
performance of the trained model. However, in FL, clients typically have non-IID data distri-
butions, which gives rise to low levels of fairness consideration in FL: global fairness and local
fairness.
In particular, the global fairness performance of a given model takes into account the full
dataset
¯
D =∪
k
D
k
across the K clients participating in FL. In contrast, when only the local dataset
D
k
at client k is considered, we define the local fairness performance by applying the equations 7.3
and 7.4 on the data distribution at client k.
We further explain the two definitions below using the example of the Equal Opportunity Dif-
ference metric. For a trained classifier
ˆ
Y , the global fairness EOD metric F
global
is given by
F
global
=Pr(
ˆ
Y=1|A=0,Y=1)− Pr(
ˆ
Y=1|A=1,Y=1), (7.5)
where the probability above is based on the full dataset distribution (a mixture of the distributions
across the clients). We can similarly define the local fairness metric F
k
at client k as
F
k
= Pr(
ˆ
Y = 1|A= g,Y = 1,C= k)
− Pr(
ˆ
Y = 1|A= g,Y = 1,C= k), (7.6)
where the parameter C= k denotes that the k-th client and hence its local distribution (and dataset
D
k
}) is considered in the fairness evaluation.
7.3.4 Datasets
In this chapter, we demonstrate the performance of different debiasing methods using two binary
decision datasets that are widely investigated in fairness literature: the Adult [65] dataset and
ProPublica COMPAS dataset [66].
102
The Adult dataset [65] contains 32,561 records of yearly income (represented as a binary label:
over or under $50,000) and twelve categorical or continuous features including education, age,
and job types. The sex (defined as male or female) of each subject is considered as the sensitive
attribute.
The ProPublica COMPAS dataset [66] relates to recidivism, which is to assess if a criminal
defendant will commit an offense within a certain future time. The dataset is gathered by ProPub-
lica, with information on 6,167 criminal defendants who were subject to screening by COMPAS,
a commercial recidivism risk assessment tool, in Broward County, Florida from 2013–2014. Fea-
tures in this dataset include the number of prior criminal offenses, age of the defendant, etc. The
race (classified as white or non-white) of the defendant is the sensitive attribute of interest.
7.4 Challenges to Local Debiasing in Federated Learning
Given the notions of global and local group fairness defined in Section 7.3, we face the following
challenges when applying local debiasing in FL:
• Local fairness does not imply global fairness: Due to the non-IID nature of the data distribu-
tion across clients, the full data distribution may not be represented by any single local distribution
at any of the clients. Thus, for a classifier
ˆ
Y the local fairness metrics{F
k
}
K
k=1
and the global metric
F
global
may be quite different as an artifact of the difference between local and global distributions.
• Local debias mitigation cannot improve the global group fairness: Applying debias miti-
gation techniques – which are typically used in centralized training [27] – locally with FedAvg
does not significantly improve the global group fairness of the model trained using FL In fact, in
some cases, local debiasing can be counterproductive as the global minority group (e.g., African-
American) might represent a local majority in the local dataset (e.g., credit union data in a black-
majority city such as Detroit).
103
0.1 0.2 0.5 10.0 5000.0
0.20
0.15
0.10
0.05
0.00
EOD
0.1 0.2 0.5 10.0 5000.0
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
SPD
0.1 0.2 0.5 10.0 5000.0
0.820
0.822
0.824
0.826
0.828
0.830
0.832
0.834
Accuracy
0.1 0.2 0.5 10.0 5000.0
0.00
0.02
0.04
0.06
0.08
0.10
Std-Accuracy
FedAvg Local Reweighting Global Reweighting
Figure 7.1: Performance comparison of data partition with different heterogeneity levels α. A
smaller α indicates a more heterogeneous distribution across clients. We report the average per-
formance of 20 random seeds. (For EOD metrics, larger values indicate better fairness. Positive
fairness metrics indicate that the unprivileged group outperform the privileged group.)
7.4.1 Performance Under Different Heterogeneity Levels
To demonstrate this, we conducted experiments on the Adult dataset introduced in Section 7.3.4 to
investigate the effects of local debiasing mechanisms in federated learning; we provide a detailed
description of the experimental setting in Section 7.6. We compare the performance of FedAvg
(i.e., without any debiasing method) and the following debiasing methods:
• Local Reweighting: Reweighing [30] is a preprocessing technique that weights the examples
in each (group, label) combination differently to ensure fairness before classification.
• Global Reweighting: In global reweighting [118], the server will compute global reweighing
weights based on the collected statistics from different parties and share them with parties.
Parties assign the global reweighing weights to their data samples during FL training.
104
Figure 7.1 compares the performance of different debiasing methods at different data hetero-
geneity levels across clients. Both local and global debiasing approaches, improve over FedAvg
(without any debiasing), however local debiasing is outperformed by its global counterpart. In
general, as the heterogeneity increases (i.e., smaller α), the effectiveness of local debiasing as
compared to global debiasing decreases. As discussed earlier, this can be attributed to local debi-
asing being tuned to fix bias only for the local data distribution.
7.4.2 Fair Class Balancing in Federated Learning Settings
The empirical results above highlight the issues of adopting centralized debiasing strategies in FL
settings. As mentioned in Section§7.4, one of the reasons is that the centralized debiasing strategies
rely on the sensitive attributes distribution of clients’ local data, which might be different from the
global sensitive attribute distribution. In Chapter 5, we propose a pre-processing method fair class
balancing to enhance model fairness without sensitive attributes. Thus, in this section, we further
investigate the performance of fair class balancing in federated learning settings.
In Figure 7.2, we report the performance of three different local debiasing strategies: no debi-
asing (i.e., FedAvg), reweighting, and fair class balancing with fairness budget 5-NN. Comparing
the three strategies, fair class balancing yield fairer predictions than reweighting with more ac-
curacy loss as the trade-off. Fair class balancing is also less impacted by the heterogeneity level
α. Take the EOD metric as an example, for local reweighting strategy, the EOD improvement
decreases 55% when α decreases from 5,000 to 0.1, while the EOD improvement only decreases
18% with fair class balancing strategy. Moreover, the system with fair class balancing strategy has
less accuracy variance across different clients, which is also a crucial fairness goals in federated
learning. The stableness of the accuracy variance is also less impacted by the heterogeneity level
across clients.
105
0.1 0.2 0.5 10.0 5000.0
0.20
0.15
0.10
0.05
0.00
0.05
EOD
0.1 0.2 0.5 10.0 5000.0
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
SPD
0.1 0.2 0.5 10.0 5000.0
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Accuracy
0.1 0.2 0.5 10.0 5000.0
0.00
0.02
0.04
0.06
0.08
0.10
Std-Accuracy
FedAvg Local - Reweighting Local - Fair Class Balancing
Figure 7.2: Performance comparison of local debiasing strategies: FedAvg (i.e., no debiasing),
reweighting, and fair class balancing (with 5-NN). We report the average performance of 20 ran-
dom seeds on partitions with different heterogeneity levelsα.
7.5 FairFed: Fairness-Aware Federated Learning
To overcome the illustrated shortcomings of local debiasing, we next introduce our proposed ap-
proach FairFed which utilizes local debiasing due to its advantages in maintaining data decentral-
ization, while addressing the aforementioned challenges by making adjustments to how the server
aggregates the local model updates from the clients.
7.5.1 Our Proposed Approach (FairFed)
Recall that in the t-th iteration in FedAvg [114], local model updates{θ
t
k
}
K
k=1
are weighted av-
eraged to get the new global model parameter θ
t
as: θ
t+1
= ∑
K
k=1
ω
t
k
θ
t
k
, where the weights
ω
t
k
= n
k
/∑
k
n
k
depend only on the number of data points at each client.
106
Figure 7.3: FairFed: Group fairness-aware federated learning framework.
Note that a fairness-oblivious aggregation favors clients with more data points. If the training
performed in these clients results in locally biased models, then there is a potential for the global
model to be biased since the weighted averaging exaggerates the contribution of the model update
from these clients.
Based on this observation, in FairFed, we propose a method to optimize global group fairness
F
global
via adaptively adjusting the aggregation weights of different clients based on their local
fairness metric F
k
. In particular, given the current global fairness metric F
t
global
(we will discuss
later in the section, how the server can compute this information), then in the next round, the server
gives a slightly higher weight to clients that have a similar local fairness F
t
k
to the global fairness
metric, thus relying on their local debiasing to steer the next model update towards a fair model.
Next, we detail how the aggregation weights for FairFed are computed at the server in each
round. A summary of steps performed while tracking the EOD metric in FairFed are summarized
in Algorithm 2.
7.5.2 Computing the Aggregation Weights for FairFed at the Server
At the start of the training, we start with the default FedAvg weights ω
0
k
= n
k
/∑
K
k=1
n
k
. Next, in
each round t, we update the weight assigned to the k-th client based on the current gap between
its local fairness metric F
t
k
and the global fairness metric F
global
. In particular, the weight update
follows this formula:
107
∆
t
k
=
Acc
t
k
− Acc
t
if F
t
k
is undefined for client k,
|F
t
global
− F
t
k
| otherwise
∀k∈[K],
¯ ω
t
k
= ¯ ω
t− 1
k
− β
∆
k
− 1
K
K
∑
i=1
∆
i
!
, ∀k∈[K], (7.7)
ω
t
k
= ¯ ω
t
k
/∑
K
i=1
¯ ω
t
i
, ∀k∈[K].
where: (i) Acc
t
k
represents the local accuracy at client k, and Acc
t
=∑
K
k=1
Acc
k
× n
k
/∑
K
k=1
n
k
, is
global accuracy across the full data distribution, respectively; (ii)β is a parameter that controls the
fairness budget for each update, thus impacting the trade-off between model accuracy and fairness.
Higher values of β result in fairness metrics having a higher impact on the model optimization,
while a lower β results in a reduced perturbation of the FedAvg weights due to fairness training;
note that forβ = 0, FairFed is equivalent to FedAvg, as the initial weightsω
0
k
are unchanged. From
the definition of ω
k
in (7.7), the clients whose local fairness metric is similar to the global fairness
metric will be assigned higher weights in the next iteration, while clients that have local metrics
that significantly deviate from the global metric (i.e. with gaps ∆
k
higher than the mean gap in this
round) will have their weights lowered. Note that, whenever, the client distribution makes the local
metric F
k
undefined
1
, FairFed relies directly on the discrepancy between the local and the global
accuracy metric as a proxy to compute the fairness metric gap∆
k
.
In order to integrate the goal of uniform accuracy, the revised procedure (i.e., revised version
of (7.7)) computes the aggregation weights as follows:
1
For instance, in the case of the EOD metric, this is the case whenever the Pr(A= 1,Y = 1)= 0 or Pr(A= 0,Y =
1)= 0.
108
∆
t
k
=
Acc
t
k
− Acc
t
if F
t
k
is undefined for client k
η|F
t
global
− F
t
k
|+(1− η)|Acc
t
k
− Acc
t
| otherwise
, ∀k∈[K]
¯ ω
t
k
= ¯ ω
t− 1
k
− β
∆
k
− 1
K
K
∑
i=1
∆
i
!
, ∀k∈[K], (7.8)
ω
t
k
= ¯ ω
t
k
/∑
K
i=1
¯ ω
t
i
, ∀k∈[K].
We introduce a parameterη∈[0,1], which controls the trade-off between global fairness con-
straint and uniform accuracy constraint on the weight update. Whenη is large, the global fairness
constraint has higher impact on aggregation weights update. In particular, whenη= 1, we recover
the FairFed functionality presented in (7.7).
Thus, so far, the training process of FairFed at each iteration follows the following conceptual
steps:
1) Each client calculates their updated local model parameters;
2) The server computes the global fairness metric value F
t
global
and global accuracy Acc
t
using
secure aggregation and broadcasts them to the clients;
3) Each client computes its metric gap∆
t
k
and from it, it calculates its aggregation weight ω
t
k
with the help of the server as defined in (7.7) or (7.8);
4) Each client now sends its weighted local update weightω
t
k
θ
t
k
to the server;
5) The server next aggregates the weighted local updates ω
t
k
θ
t
k
using secure aggregation to
compute the new global model and broadcasts it to the clients.
A detailed description on performing these steps using secure aggregation protocol (SecAgg)
is shown in Algorithm 2.
109
7.5.3 How to Compute the Global Metric at the Server
One central assumption of FairFed presented earlier is the ability of the server to compute the
global metric F
global
in each iteration without requiring clients to share their local dataset with the
server or any explicit information about their sensitive group distribution. We demonstrate how the
server can compute the F
global
from such information sent by the clients by assuming that EOD is
the metric to focus on. Note that the EOD metric in (7.5) can be rewritten as:
F
global
= EOD= Pr(
ˆ
Y=1|A=0,Y=1)− Pr(
ˆ
Y=1|A=1,Y=1)
=
K
∑
k=1
m
global,k
z }| {
n
k
∑
K
i=1
n
i
1
∑
a=0
(− 1)
a
Pr(
ˆ
Y= 1|A=a,Y=1,C=k)× Pr(A= a,Y = 1|C= k)
Pr(Y = 1,A= a)
=
K
∑
k=1
m
global,k
. (7.9)
Thus, the global EOD metric F
global
can be computed by aggregating the values of m
global,k
from
the K clients. Note that the conditional distributions in the definition of m
global,k
above are local
performance metrics that can easily be computed locally by client k using its local datasetD
k
.
The only non-local terms in m
global,k
are the full dataset statisticsS ={Pr(Y = 1,A= 0),Pr(Y =
1,A= 1)}. These statisticsS can be aggregated at the server using a single round of a secure ag-
gregation scheme (e.g., [122]) at the start of training, then shared with the K clients to enable them
to compute their global fairness component m
global,k
.
Flexibility of using FairFed with heterogeneous debiasing. Note that the FairFed weights ω
t
k
in (7.7) rely only on the values of the global and local fairness metrics and are not tuned towards
a specific local debiasing method. Thus, we believe that FairFed is flexible to applying different
debiasing methods at each client, and the server will incorporate the effects of these different
methods by reweighting their respective clients based on their reported local fairness and the weight
computation in (7.7).
110
Algorithm 2: FairFed Algorithm (tracking EOD)
Server executes:
Initialize: global model parameterθ
0
and weights{ω
0
k
} asω
0
k
= n
k
/∑
K
i=1
n
i
,∀k∈[K];
Dataset statistics step:
Aggregate statisticsS from clients through secure aggregation (SecAgg) and send it to clients
S ={ Pr(A= 1,Y = 1),Pr(A= 0,Y = 1)};
for each round t= 1,2,··· do
// Secure aggregation to get Acc
t
and global fairness metric F
t
global
as in (7.9);
F
t
global
, Acc
t
← SecAgg
ClientLocalMetrics(k,θ
t− 1
)
K
k=1
;
// Using SecAgg, compute mean of metric gaps from clients{∆
k
}
K
k=1
after sending
F
t
global
& Acc
t
;
mean
{∆
k
}
K
k=1
← SecAgg
n
ClientMetricGap(k,θ
t− 1
,F
t
global
,Acc
t
)
o
K
k=1
;
// Compute aggregation weights locally at clients and aggregate local model updates using
computed weights;
∑
K
k=1
¯ ω
t
k
θ
t
k
,
∑
K
k=1
¯ ω
t
k
← SecAgg
{ClientWeightedModelUpdate(k,θ
t− 1
,ω
t
k
)}
k∈[K]
;
θ
t+1
←
∑
K
k=1
¯ ω
t
k
θ
t
k
/
∑
K
k=1
¯ ω
t
k
;
end
Client subroutines:
ClientLocalMetrics(k,θ):
Acc
k
← LocalAccuracy(θ,D
k
) // Evaluating global model accuracy at client k;
m
k
← GlobalFairComponent(θ,D
k
,S ) // Get global fairness component defined
in (7.9);
Return
n
k
∑
K
i=1
n
i
Acc
k
, m
k
to server
ClientMetricGap(k,θ,F
global
,Acc):
F
k
← LocalFairnessMetric(θ,D
k
) // Get local fairness metric onθ
t
as in (7.6);
Acc
k
← LocalAccuracy(θ,D
k
) // Evaluating global model accuracy at client k;
∆
k
← LocalMetricGap(F
k
,Acc
k
,F
global
,Acc) // Compute metric gap for client k as in (7.7);
Return ∆
k
/K to server
ClientWeightedModelUpdate(k,θ,∆
mean
):
¯ ω
k
t
← ω
t− 1
k
− β(∆
k
− ∆
mean
) // Increase weight if∆
k
less than the mean;
θ
t
k
← LocalFairTraining(θ,D
k
) // Training at client k with local debiasing;
Return
¯ ω
t
k
θ
t
k
¯ ω
t
k
to server
111
7.6 Evaluation
7.6.1 Experimental Setup
7.6.1.1 Implementation.
We developed FairFed using FedML [123], which is a research-friendly FL library used to explore
new algorithms. To accelerate the training, we used its parallel training paradigm, where each FL
client is handled by an independent process using MPI (message passing interface). We conducted
experiments in a server withAMD EPYC 7502 32-Core CPU Processor.
7.6.1.2 Baselines
We adopt the following state-of-the-art solutions as our baselines:
• FedAvg [114]: the original federated learning algorithm for distributed training of private
data. It does not consider the fairness of different demographic groups.
• FedAvg + Local reweighting: Each client adopts the reweighting strategy [30] to debias its
local training data, then trains local models based on the processed data. FedAvg is used to
aggregate the local model updates at the server.
• FedAvg + Global reweighting [118]: A differential-privacy approach to collect statistics
such as the noisy number of samples with privileged attribute values and favorable labels
from parties. The server will compute global reweighing weights based on the collected
statistics and share them with parties. Parties assign the global reweighing weights to their
data samples during FL training.
2
2
We apply the global reweighting approach in [118] without the added differential-privacy noise in order to com-
pare with the optimal debiasing performance of global reweighting.
112
7.6.1.3 Hyperparameters
In our FairFed approach and the above baselines, we train a logistic regression model for binary
classification tasks on the aforementioned datasets. All results are selected from the best accuracy
obtained from grid search on important hyperparameters such as the learning rate and decay rate.
For each hyperparameter configuration, we report the average performance of 20 random seeds.
We summarize all hyperparameters in Table 7.1.
Hyperparameter
Dataset
COMPAS Adult
Optimizer lr={0.01, 0.001}, lr={0.01, 0.001}
(Adam) wd=0.0001 wd=0.0001
Local epochs 1 1
Comm. rounds 10 20
# of clients 5 {5,10}
FairFed (β param.) {1,20,50} {1,20,50}
Table 7.1: Hyperparameters used in Experiments on the COMPAS and Adult datasets.
7.6.2 Experimental Results on Artificial Partitioned Data
We investigate the performance of FairFed with different system settings. In particular, we consider
how the performance changes with different heterogeneity levels among the data distributions at
the clients.
To fully understand our method and the baselines under different sensitive attribute distribu-
tions across clients, a configurable data synthesis method is needed. In our context, we use a
generic non-IID synthesis method based on the Dirichlet distribution proposed in [124] but apply
it in a novel way for configurable sensitive attribute distribution : for each sensitive attribute value
a, we sample p
a
∼ Dir(α) and allocate a portion p
a,k
of the data points with A= a to client k. The
parameterα controls the heterogeneity of the distributions at each client, whereα→∞ results in
113
Adult Dataset
Client ID
α = 0.1 α = 10
A= 0 A= 1 A= 0 A= 1
0 269 615 1505 3585
1 128 29839 876 5695
2 418 74 978 7261
3 43 392 601 5848
4 4196 203 1094 8734
COMPAS Dataset
Client ID
α = 0.1 α = 10
A= 0 A= 1 A= 0 A= 1
0 32 423 612 217
1 151 411 876 109
2 3 62 836 185
3 522 42 880 251
4 3286 1 790 177
Table 7.2: An example of the heterogeneous data distribution (non-IID) on the sensitive attribute
A (sex) used in experiments on the Adult and COMPAS datasets. The shown distributions are for
K= 5 clients and heterogeneity parametersα = 0.1 andα = 10.
IID distributions. Table 7.2 shows an example of the heterogeneous data distribution for the Adult
and COMPAS datasets forα = 0.1 and 10.
7.6.2.1 Performance under the Heterogeneous Sensitive Attribute Distribution.
Under partitions with different heterogeneous levels, we compared the performance of FedAvg,
local reweighting, and FairFed. The results are summarized in Table 7.3. In highly homogeneous
data distributions (i.e., a large α value), FairFed does not provide significant gains in fairness
performance. This is due to the fact that under homogeneous sampling, the distributions of the
local datasets are statistically similar (and reflect the original distribution with enough samples),
resulting in similar weights being computed in all clients. For a larger level of heterogeneity in
the sensitive attribute (i.e., at lower α = 0.1), FairFed can improve EOD in Adult and COMPAS
data by 98.9% and 5%, respectively. Regarding the prediction accuracy after debiasing, FairFed
only decrease 0.3% of the accuracy for both Adult and COMPAS datasets. In contrast, at the same
heterogeneity level, local reweighting strategy can only improve EOD by 62% and 1% for Adult
and COMPAS datasets, respectively. Global reweighting can only improve EOD by 73% and 2%
for Adult and COMPAS datasets, respectively.
114
Method
Adult (β = 1) COMPAS (β = 1)
Heterogeneity Levelα Heterogeneity Levelα
0.1 0.2 0.5 10 5000 0.1 0.2 0.5 10 5000
Acc.
FedAvg 0.832 0.832 0.831 0.830 0.829 0.671 0.684 0.679 0.672 0.672
Local 0.831 0.831 0.830 0.827 0.826 0.671 0.684 0.679 0.671 0.671
Global 0.829 0.830 0.828 0.827 0.825 0.671 0.683 0.679 0.672 0.671
FairFed 0.829 0.828 0.828 0.826 0.827 0.669 0.681 0.678 0.669 0.669
EOD
FedAvg -0.184 -0.174 -0.177 -0.174 -0.172 -0.088 -0.083 -0.087 -0.089 -0.084
Local -0.070 -0.031 -0.018 0.015 0.020 -0.087 -0.083 -0.086 -0.086 -0.084
Global -0.049 -0.011 -0.008 0.015 0.018 -0.086 -0.081 -0.088 -0.086 -0.085
FairFed -0.002 0.017 0.009 0.024 0.014 -0.084 -0.084 -0.085 -0.085 -0.085
SPD
FedAvg -0.171 -0.169 -0.169 -0.163 -0.161 -0.179 -0.176 -0.170 -0.170 -0.167
Local -0.135 -0.125 -0.12 -0.100 -0.097 -0.179 -0.176 -0.169 -0.168 -0.166
Global -0.123 -0.114 -0.113 -0.099 -0.097 -0.178 -0.175 -0.169 -0.168 -0.167
FairFed -0.110 -0.105 -0.092 -0.094 -0.099 -0.171 -0.170 -0.169 -0.166 -0.166
Table 7.3: Performance comparison of data partition with different heterogeneity levels α. A
smaller α indicates a more heterogeneous distribution across clients. We report the average per-
formance of 20 random seeds. For EOD and SPD metrics, larger values indicate better fairness.
Positive fairness metrics indicate that the unprivileged group outperform the privileged group.
0.01 0.05 0.1 1.0 2.0 5.0
0.15
0.10
0.05
0.00
0.05
Fairness Metrics
0.8175
0.8200
0.8225
0.8250
0.8275
0.8300
0.8325
0.8350
Accuracy
Adult
EOD SPD Accuracy
0.01 0.05 0.1 1.0 2.0 5.0
0.20
0.18
0.16
0.14
0.12
0.10
0.08
0.06
Fairness Metrics
0.668
0.670
0.672
0.674
0.676
0.678
0.680
Accuracy
COMPAS
EOD SPD Accuracy
Figure 7.4: Effects of fairness budgetβ for K= 5 clients and heterogeneity parameterα = 0.2.
7.6.2.2 Performance with Different Parameter (β).
In FairFed, we introduced a parameter β, which controls how much the fairness adaptation is
allowed to change the model aggregation weights in each round (refer to (7.7) for the explanation
of β). Figure 7.4 visualizes the effects of β using heterogeneity level α = 0.2 as an example. As
the value of β increases, the fairness constraint has a bigger impact on the aggregation weights,
yielding a better fairness performance with a trade-off for the model accuracy.
115
Method
Adult COMPAS
Heterogeneity Levelα Heterogeneity Levelα
0.1 0.2 0.5 10 5000 0.1 0.2 0.5 10 5000
Acc.
FedAvg 0.832 0.832 0.831 0.830 0.829 0.671 0.684 0.679 0.672 0.672
Local 0.831 0.831 0.830 0.827 0.826 0.671 0.684 0.679 0.671 0.671
FairFed (η=0.2) 0.829 0.829 0.828 0.826 0.825 0.669 0.681 0.677 0.671 0.669
FairFed (η=1.0) 0.829 0.828 0.828 0.826 0.827 0.669 0.681 0.678 0.669 0.669
EOD
FedAvg -0.184 -0.174 -0.177 -0.174 -0.172 -0.088 -0.083 -0.087 -0.089 -0.084
Local -0.070 -0.031 -0.018 0.015 0.020 -0.087 -0.083 -0.086 -0.086 -0.084
FairFed (η=0.2) -0.058 -0.005 0.000 0.011 0.020 -0.085 -0.085 -0.085 -0.088 -0.082
FairFed (η=1.0) -0.002 0.017 0.009 0.024 0.014 -0.084 -0.084 -0.085 -0.085 -0.085
Std-
Acc.
FedAvg 0.082 0.070 0.067 0.021 0.01 0.053 0.060 0.046 0.034 0.0
Local 0.085 0.074 0.070 0.022 0.010 0.054 0.060 0.045 0.034 0.0
FairFed (η=0.2) 0.062 0.060 0.052 0.020 0.009 0.049 0.048 0.031 0.032 0.0
FairFed (η=1.0) 0.064 0.062 0.055 0.020 0.009 0.054 0.055 0.043 0.032 0.0
Table 7.4: Performance comparison of uniform accuracy constraintη on data partition with differ-
ent heterogeneity levelsα. We report the average performance of 20 random seeds.
7.6.2.3 Performance with Different Fairness Budgetη.
We evaluate the performance of the revised FairFed using experiments with the same configura-
tions as Section 7.6.1. Table 7.4 compares the performance with different data heterogeneity and
debiasing methods. For both Adult and COMPAS datasets, FairFed can improve EOD and Std-
Accuracy metrics simultaneously, while local debiasing can exaggerate the accuracy differences
across clients. In particular, when clients have a highly heterogeneous distributions (i.e., with
smallerα), the performance variances across clients are large; It can be seen that in these settings,
FairFed shows more improvements in the uniform accuracy metrics.
Figure 7.4 visualizes the effects ofη using heterogeneity levelα = 0.5 as an example. We use
Std-Acc. as a measurement for the uniformity of accuracy. Asη increases, the fairness constraint
has a bigger impact on the aggregation weights, yielding a better group fairness performance with
a trade-off for the accuracy uniformity.
7.6.2.4 Performance under Different Number of Clients
Another factor that can impact FL fairness is the number of participating clients. In Figure 7.6, we
compare the performance of FairFed on the Adult dataset with 5 and 10 clients. In general, FairFed
performs better with smaller client counts. We believe this can be attributed to the fact that with
116
0.2 0.5 0.8 1.0
0.12
0.10
0.08
0.06
0.04
0.02
0.00
Fairness Metrics
0.14
0.15
0.16
0.17
0.18
Std - Accuracy
Adult
EOD
SPD
Std-Acc
0.2 0.5 0.8 1.0
0.15
0.10
0.05
0.00
Fairness Metrics
0.040
0.042
0.044
0.046
0.048
0.050
Std - Accuracy
COMPAS
EOD
SPD
Std-Acc
Figure 7.5: Effects of fairness budgetη for K= 5 clients and heterogeneity parameterα = 0.5.
0.2 0.5 10.0 5000.0
0.10
0.05
0.00
0.05
0.10
EOD
0.2 0.5 10.0 5000.0
0.79
0.80
0.81
0.82
0.83
0.84
Accuracy
Client Numbers
5 Clients 10 Clients
Figure 7.6: Effects of number of clients
10 clients, the dataset size at each client is reduced, resulting in a larger variance in any statistics
computed from the dataset for local debiasing (law of large numbers). As a result, clients can
potentially report different fairness metrics leading to different weight assignments by the server
(while the server is expected to assign equal weights in the homogeneous case).
7.6.2.5 Performance with Single Sensitive Group Clients.
In order to verify how FairFed works in scenarios where a client’s local fairness metric (i.e., F
k
in (7.7)) cannot be computed, we construct an experiment on the Adult dataset where a client
is compromised completely from a single group. In particular, we consider 5 clients, where the
first two are comprised of only female datapoints and remaining three contain only male points.
We use the aforementioned Dirichlet distribution (withα = 0.5) to partition each group into their
117
FedAvg
Local Reweighting
global Reweighting
FairFed
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
EOD
FedAvg
Local Reweighting
global Reweighting
FairFed
0.816
0.818
0.820
0.822
0.824
0.826
0.828
0.830
Accuracy
Figure 7.7: Performance with clients that only contain data from one sensitive group.
corresponding subset of clients (first two for females, remaining three for males) based on the target
prediction variable (Income > 50k). Table 7.5 shows an example of the such a heterogeneous data
distribution.
Adult Dataset (Single Group per Client Experiment)
Client ID Gender
Income > 50k
y= 0 y= 1
0 (100%) Female 3722 662
1 (100%) Female 3917 299
2 (100%) Male 1641 2888
3 (100%) Male 5815 2409
4 (100%) Male 4660 36
Table 7.5: An example of the heterogeneous data distribution (non-IID) on the target variable
(Income > 50k) used in the experiment on the Adult, where each client is assigned only points with
a single senstive attribute value. The shown distributions are for K = 5 clients and heterogeneity
parametersα = 0.5 on the target variable.
Figure 7.7 shows the performance of FairFed as compared to the baseline. Since each client
contains only a single group, then local reweighting is ineffective and performs similar to the
FedAvg baseline. Both FairFed and global reweighting improve the EOD metric as they take into
account the datapoints across the different clients. In particular, FairFed improves over FedAvg by
27%, while global reweighting improves EOD by 15%.
118
0.1 0.2 0.5 10.0 5000.0
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
0.025
EOD
0.1 0.2 0.5 10.0 5000.0
0.810
0.815
0.820
0.825
0.830
0.835
0.840
Accuracy
Local Debias Ratio
0.2 0.4 0.6 0.8 1
Figure 7.8: Effects of different local debiasing strategies. We analyze the performance change
when only partial clients adopt the reweighting debiasing method.
7.6.2.6 Performance under Different Local Debiasing Strategies.
One of the notable advantages of FairFed is that each client can have the freedom of adopting
different local debiasing strategies. For instance, a fraction of clients might not adopt any local
debiasing strategy due to inavailability of the local dataset fully at any instance (online stream
of data points). In this section, we highlight the performance of FairFed under such realistic
application scenario.
In Figure 7.8, we empirically study this for Adult dataset where we simulate a scenario in which
only a fraction of clients adopt the local reweighting strategy under FairFed. The remaining clients
will participate in the FL system without local debiasing, but they still have to communicate their
local fairness metrics and global fairness components m
global,k
as described in Algorithm 2. In all
As seen in the figure, the group fairness improves as more clients adopt the local debiasing strategy.
The figure also highlights the importance of both components of FairFed as the fairness-aware
aggregation is not individually sufficient to achieve group fairness. In particular, FairFed needs
40% of the clients to use local debiasing in order to outperform local debiasing without fairness-
aware aggregation. When more than 60% clients adopt local debiasing, FairFed outperform both
local reweighting and global reweighting methods.
In terms of accuracy, we see from Figure 7.8 that as the number of clients performing local
debiasing increase, the accuracy degradation also increases. However, note that across the different
119
0.1 0.2 0.5 10.0 5000.0
0.10
0.08
0.06
0.04
0.02
0.00
0.02
EOD
0.1 0.2 0.5 10.0 5000.0
0.800
0.805
0.810
0.815
0.820
0.825
0.830
0.835
0.840
0.845
Accuracy
Reweighting/Fair Class Balancing Ratio
0.6/0.4 0.8/0.2 1/0
Figure 7.9: Effects of different local debiasing strategies. We analyze the performance change
when only partial clients adopt the reweighting debiasing method.
local debiasing ratios, the average accuracy at most degrades to 0.825 (when the ratio is 1) from a
peak of 0.83 when the ratio is 0.2).
The experiments above show the performance of FairFed when only a subset of client use
the reweighting debiasing technique. In order to verify the claim that FairFed can operate with a
mixture of debiasing techniques, we simulate a scenario where the Adult dataset distributed across
5 clients. A fraction of 0.6/0.8/1 of the clients use the reweighting debiasing method locally while
the remaining nodes use the fair class balancing debiasing method [125]. From Figure 7.9, we see
that even when different debiasing methods are employed, FairFed still achieves an improvement
in the fairness metrics (as compared with the best baseline (details for each baseline shown in
Table 7.3), while maintaining an accuracy reduction of at most 2.5%.
This also compliments our discussion for Figure 7.8, which shows a huge degradation in the
fairness metric as the fraction of FairFed clients applying local reweighting decreased. Our results
in Figure 7.9 emphasizes that this great reduction can be greatly reduced by applying other debi-
asing algorithm at these clients and not necessarily reweighting as the local debiasing algorithm of
choice.
120
7.6.3 Experimental Results on Real Partitioned Data
The previous section, we artificially partition the data into different client to mimic the federated
learning settings. To further understand the performance on real-life applications, in this section,
we use the US Census data as the case study to present the performance of our FairFed approach
in real distributed learning application settings.
7.6.3.1 Dataset
Our experiments are based on the ACSIncome dataset provided in [67]. The modeling task is
predicting whether an individual’s income is above $50,000 based on the features including em-
ployment type, education background, martial status, etc.
ACSIncome dataset is constructed from American Community Survey (ACS) Public Use Mi-
crodata Sample (PUMS) over all 50 states and Puerto Rico in 2018 with a total of 1,664,500 data
points. In our experiments, we treat each state as one participant in the FL system (i.e., 51 partic-
ipants). Due to the demographic distribution of different states, clients share different data sizes
and sensitive attributes distribution. For example, Wyoming has the smallest dataset size with
3,064 users, while California has the largest dataset size with 195,665 users. We choose the race
information (white/non-white) of the users as the sensitive attribute of interest in our experiments.
Hawaii has the lowest ratio 26% of white population, while Vermont has the highest ratio 96% of
its dataset as white population. Figure 7.10 visualize the data size and race distribution across the
51 clients.
7.6.3.2 Performance
Table 7.6 compares the performance of FairFed on ACSIncome dataset. Table 7.6 shows that
adopting local reweighting yields worse group fairness performance than simply applying FedAvg
(without any debiasing) due to the heterogeneity across states. Our proposed FairFed approach
overcomes the issue and improves the EOD by 20% (-0.062 to -0.050). As Figure 7.11 shows,
parameter η of FairFed controls the trade-off between group fairness and accuracy uniformity
121
(a)
(b)
Figure 7.10: Data distribution of ACSIncome dataset. (a) compares the number of data points from
different states. (b) compares the proportions of white population across different states.
across clients. Higher η yields better group fairness while lower η can improve the performance
uniformity across clients.
7.7 Experiments on Human Behavior Data
In this case study, we present the performance of FairFed on a human behavior dataset TILES [23].
122
Method
Metric
Acc. EOD SPD
FedAvg 0.800 -0.062 -0.102
Local 0.800 -0.066 -0.106
FairFed (η = 0.2) 0.793 -0.074 -0.110
FairFed (η = 1.0) 0.799 -0.050 -0.089
Table 7.6: Performance on ACSIncome
Dataset
0.2 0.5 0.8 1.0
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
Fairness Metrics
0.025
0.030
0.035
0.040
0.045
0.050
Std - Accuracy
ACSIncome
EOD
SPD
Std-Acc
Figure 7.11: Effects ofη on ACSIncome Dataset.
TILES Dataset
Client Size
Gender Stress
Female Male y= 0 y= 1
RN-day shift 707 580 (82%) 127 (18%) 320 (45%) 387 (55%)
RN-night shift 609 471 (77%) 138 (23%) 347 (57%) 262 (43%)
CNA 244 149 (61%) 95 (39%) 158 (65%) 86 (35%)
Table 7.7: Data distribution of TILES dataset.
7.7.1 Dataset
In this section, we use the physiological and physical activity signals collected through wearable
sensors (e.g., Fitbit) to estimate users’ daily stress levels. The target variable is a binary label
that indicates whether the person’s stress level is above individual average (i.e., 1) or not (i.e.,
0), which are collected from daily surveys sent to the participants’ smartphone. We focus on the
nurse population in the dataset with in total 1,560 records in the dataset. Each client contains the
data from one occupation group – day-shift registered nurse (RN-day shift), night-shift registered
nurse (RN-night shift), and certified nursing assistant (CNA). The three clients vary by data size,
gender distribution, and target variable distribution. In general, the client of day-shift registered
nurse population has the most data points, more female, and higher stress levels. The detailed data
distribution of TILES dataset is shown in Table 7.7.
7.7.2 Performance
Table 7.8 reports the performance on TILES dataset. Both FairFed and local reweighting improve
the EOD metric as compared to FedAvg. FairFed (withη= 1) improves EOD from -0.199 to 0.004
123
Method
Metric
Acc. EOD SPD
FedAvg 0.567 -0.199 -0.166
Local 0.567 -0.064 -0.041
FairFed (η = 0.2) 0.559 -0.013 -0.021
FairFed (η = 1.0) 0.556 0.004 0.004
Table 7.8: Performance on TILES Dataset
0.2 0.5 0.8 1.0
0.05
0.04
0.03
0.02
0.01
0.00
0.01
0.02
Fairness Metrics
0.050
0.055
0.060
0.065
Std - Accuracy
TILES
EOD
SPD
Std-Acc
Figure 7.12: Effects ofη on TILES Dataset.
with only 2.6% accuracy decrease (from 0.567 to 0.556). The uniformity of accuracy of different
clients can also be controlled by uniform accuracy constraint (i.e., η); Figure 7.12 visualizes the
trend that whenη decreases, the accuracy uniformity across clients improved with the trade-off of
group fairness metrics.
7.8 Conclusions
In this chapter, motivated by the importance and challenges of group fairness in federated learn-
ing, we propose the FairFed algorithm to enhance group fairness via a fairness-aware aggregation
method, aiming to provide fair model performance across different sensitive groups (e.g., racial,
gender groups) while maintaining high utility. Though our proposed method outperforms the state-
of-the-art fair federated learning frameworks under high data heterogeneity, limitations still exist.
As such, we plan to further improve FairFed from these perspectives: 1) We report the empirical
results on binary classification tasks in this work. We will extend the work to various application
scenarios (e.g., regression tasks, natural language processing); 2) We will extend our study to sce-
narios of heterogeneous application of different local debiasing methods and understand how the
framework can be tuned to incorporate updates from these different debiasing schemes; 3) We cur-
rently explore group fairness and uniform accuracy in fair FL. We plan to integrate FairFed with
other fairness notions in FL, such as collaborative fairness.
124
Chapter 8
Conclusions and Ongoing Work
Human behavior understanding systems have found broad applicability in many decision making
domains, including health and job performance evaluation, financial and employment scrutiny, etc.
Hence, guaranteeing model fairness in such systems is a crucial issue for real-world applications.
In this thesis, I conduct analyses on biases derived from the heterogeneous behaviors and propose
modeling strategies to enhance model fairness. Due to the increasing privacy concerns of sensitive
attributes in real-life applications, modeling strategies that require accessing the sensitive attributes
are more favored. I propose the following methods in this direction:
Fair class balancing
Fair class balancing is a class balancing algorithm. The proposed method aims to mitigate the bi-
ases come from the borderline samples, which are one of the main sources of the model unfairness
according to the literature. Fair class balancing has a similar framework as cluster-based oversam-
pling class balancing strategies except the synthetic samples are generated based on the samples
from both minority and majority classes. The cluster-based strategy identifies the real class imbal-
ance within samples with similar feature space, and also effectively avoids generating too noisy
samples. Furthermore, by generating synthetic samples based on both classes, our method ensures
that unseen borderline samples in the test sets have similar probabilities to be assigned to both
classes, which reduce the biases in model outcomes.
125
Experimental results on real-world datasets show that our fair class balancing method can
improve all three fairness metrics of interest as well as the accuracy of the minority class. The
improvements are not limited to the clustering algorithm of choice. Fair class balancing with
different clustering algorithms all yield more fair predictions. Our method also can be used as
the pre-processing step for other fairness-aware mechanisms, further improving both fairness and
accuracy. For future work, it is valuable to expand the balancing algorithm to datasets with contin-
uous target variables, mitigating the bias in regression tasks.
Multi-Layer Factor Analysis (MLFA)
Multi-Layer Factor Analysis (MLFA) is an algorithm to identify heterogeneous features. MLFA is
based on the theorem that factor analysis can separate samples with heterogeneity when the groups
have similar size and the features are correlated with each other. The identified heterogeneity
groups then can be used as group labels in feature rescaling techniques to mitigate the unfair
impact of heterogeneity when sensitive attributes are unobserved.
Experimental results on synthetic and real-world datasets show that our proposed method can
improve both the accuracy and fairness of models compared to using the original datasets. Our
method can in fact be used as a pre-processing step for different regression models.
One of the limitations of MLFA is that it can only identify the heterogeneity patterns of con-
tinuous behavioral signals. The bias stemmed from heterogeneous categorical or discrete features,
despite the fact that can be mitigated via data balancing, is worthy investigating its effects on model
fairness and the approaches to identify the patterns.
FairFed
FairFed is a fairness-aware aggregation method for federated learning systems. In FairFed, each
client performs local debiasing methods on its own local dataset, thus maintaining data decen-
tralization and avoiding exchange any explicit information if its local data. Then the aggregation
weights are calculated based on the local and global fairness performance to ensure group fairness.
126
Each client can customize their own debiasing strategies, providing more freedom and resilience
of the fair FL system.
Experiments on real-life datasets highlight the performance of FairFed on highly heterogeneous
distribution across clients. Our proposed method can significantly improve the group fairness
while keep good utility. The FairFed algorithm can be improved to support different application
scenarios (e.g., natural language processing, computer vision) and different fairness notions in FL
(e.g., collaborative fairness).
The work also leave the following research questions for further investigation.
• Applying debiasing mechanisms without sensitive attributes in federated learning set-
tings. In Chapter §7, we investigate the fairness performance of a federated learning system
when each client applies debiasing method locally. This experiment setting is based on the
assumption that each client has and is willing to use the sensitive attribute information in the
local model training process. It is also worth studying the performance of the proposed meth-
ods without sensitive attributes in federated learning settings, so that no sensitive attributes
are needed to be collected.
• Extending the proposed approaches to different fairness definitions. This thesis mainly
focus on group fairness, which aims to mitigate the bias across the groups defined by sen-
sitive attributes. In real-life applications, other fairness definitions should also be taken into
consideration. For example, individual fairness, giving similar model outcomes to similar
individuals, and subgroup fairness, considering the sample groups that are defined by multi-
ple sensitive attributes.
• Generalizing the proposed approaches. The proposed methods exist several limitations.
For example, fair class balancing is designed for classification models, and Multi-Layer
Factor Analysis is only suitable for continuous features. It is valuable to further revise the
proposed approaches for broader applications scenarios, including fair data balancing meth-
ods for regression tasks and heterogeneity identification for categorical variables.
127
References
[1] Antonio Lanata, Gaetano Valenza, Mimma Nardelli, Claudio Gentili, and Enzo Pasquale
Scilingo. “Complexity index from a personalized wearable monitoring system for assess-
ing remission in mental health”. In: IEEE Journal of Biomedical and health Informatics
19.1 (2014), pp. 132–139.
[2] Serkan Kiranyaz, Turker Ince, and Moncef Gabbouj. “Personalized monitoring and ad-
vance warning system for cardiac arrhythmias”. In: Scientific reports 7.1 (2017), pp. 1–
8.
[3] Hugo Jair Escalante, Heysem Kaya, Albert Ali Salah, Sergio Escalera, Ya˘ gmur Güç, Umut
Güçlü, et al. “Modeling, Recognizing, and Explaining Apparent Personality from Videos”.
In: IEEE Transactions on Affective Computing (2020).
[4] Nina Grgi´ c-Hlaˇ ca, Muhammad Bilal Zafar, Krishna P Gummadi, and Adrian Weller. “Be-
yond distributive fairness in algorithmic decision making: Feature selection for procedu-
rally fair learning”. In: Thirty-Second AAAI Conference on Artificial Intelligence . 2018.
[5] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy,
and Kush R Varshney. “Optimized pre-processing for discrimination prevention”. In: Ad-
vances in Neural Information Processing Systems. 2017, pp. 3992–4001.
[6] Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkata-
subramanian. “Certifying and removing disparate impact”. In: Proceedings of the 21th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM.
2015, pp. 259–268.
[7] Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. “Fairness-aware
classifier with prejudice remover regularizer”. In: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases. Springer. 2012, pp. 35–50.
[8] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. “Mitigating unwanted biases with
adversarial learning”. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics,
and Society. ACM. 2018, pp. 335–340.
[9] Pranay K Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha,
Kush R Varshney, and Ruchir Puri. “Bias mitigation post-processing for individual and
group fairness”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE. 2019, pp. 2847–2851.
[10] Michael P Kim, Amirata Ghorbani, and James Zou. “Multiaccuracy: Black-box post-processing
for fairness in classification”. In: Proceedings of the 2019 AAAI/ACM Conference on AI,
Ethics, and Society. ACM. 2019, pp. 247–254.
[11] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Mor-
genstern, et al. “A convex framework for fair regression”. In: arXiv preprint arXiv:1706.02409
(2017).
128
[12] Junpei Komiyama, Akiko Takeda, Junya Honda, and Hajime Shimao. “Nonconvex opti-
mization for regression with fairness constraints”. In: International Conference on Ma-
chine Learning. 2018, pp. 2742–2751.
[13] Alekh Agarwal, Miroslav Dudik, and Zhiwei Steven Wu. “Fair Regression: Quantitative
Definitions and Reduction-Based Algorithms”. In: International Conference on Machine
Learning. 2019, pp. 120–129.
[14] Edward H Simpson. “The interpretation of interaction in contingency tables”. In: Journal
of the Royal Statistical Society: Series B (Methodological) 13.2 (1951), pp. 238–241.
[15] Adam Tsakalidis, Maria Liakata, Theo Damoulas, and Alexandra I Cristea. “Can we as-
sess mental health through social media and smart devices? Addressing bias in methodol-
ogy and evaluation”. In: Joint European Conference on Machine Learning and Knowledge
Discovery in Databases. Springer. 2018, pp. 407–423.
[16] Marc N Elliott, Allen Fremont, Peter A Morrison, Philip Pantoja, and Nicole Lurie. “A
new method for estimating race/ethnicity and associated disparities where administra-
tive records lack self-reported race/ethnicity”. In: Health services research 43.5p1 (2008),
pp. 1722–1736.
[17] Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor,
et al. “StudentLife: assessing mental health, academic performance and behavioral trends
of college students using smartphones”. In: Proceedings of the 2014 ACM international
joint conference on pervasive and ubiquitous computing. ACM. 2014, pp. 3–14.
[18] Nathan Eagle and Alex Pentland. “Reality mining: Sensing complex social systems”. In:
Personal and ubiquitous computing 10.4 (2006), pp. 255–268.
[19] Nadav Aharony, Wei Pan, Cory Ip, Inas Khayal, and Alex Pentland. “Social fMRI: In-
vestigating and shaping social mechanisms in the real world”. In: Pervasive and Mobile
Computing 7.6 (2011), pp. 643–659.
[20] Akane Sano, Z Yu Amy, Andrew W McHill, Andrew JK Phillips, Sara Taylor, Natasha
Jaques, et al. “Prediction of happy-sad mood from daily behaviors and previous sleep his-
tory”. In: 2015 37th Annual International Conference of the IEEE Engineering in Medicine
and Biology Society (EMBC). IEEE. 2015, pp. 6796–6799.
[21] Louis Faust, Rachael Purta, David Hachen, Aaron Striegel, Christian Poellabauer, Omar
Lizardo, et al. “Exploring compliance: observations from a large scale fitbit study”. In:
Proceedings of the 2nd International Workshop on Social Sensing. ACM. 2017, pp. 55–60.
[22] Brandon M Booth, Karel Mundnich, Tiantian Feng, Amrutha Nadarajan, Tiago H Falk,
Jennifer L Villatte, et al. “Multimodal Human and Environmental Sensing for Longitudinal
Behavioral Studies in Naturalistic Settings: Framework for Sensor Selection, Deployment,
and Management”. In: Journal of medical Internet research 21.8 (2019), e12832.
[23] Karel Mundnich, Brandon M Booth, Michelle l’Hommedieu, Tiantian Feng, Benjamin Gi-
rault, Justin L’hommedieu, et al. “TILES-2018, a longitudinal physiologic and behavioral
data set of hospital workers”. In: Scientific Data 7.1 (2020), pp. 1–26.
[24] US Congress. 42 U.S.C. Fair Housing Act. 1968. URL: https://www.justice.gov/
crt/fair-housing-act-2.
[25] US Congress. 15 U.S.C. Equal Credit Opportunity Act. https://. 1974. URL: https://
www.ecfr.gov/current/title-12/chapter-II/subchapter-A/part-202?toc=1.
[26] Moritz Hardt, Eric Price, Nati Srebro, et al. “Equality of opportunity in supervised learn-
ing”. In: Advances in neural information processing systems. 2016, pp. 3315–3323.
129
[27] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. “Fair-
ness through awareness”. In: Proceedings of the 3rd innovations in theoretical computer
science conference. 2012, pp. 214–226.
[28] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. “Algorithmic
decision making and the cost of fairness”. In: KDD. 2017.
[29] Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M Branham. “Gender recognition or
gender reductionism? The social implications of embedded gender recognition systems”.
In: CHI ’18. 2018, pp. 1–13.
[30] Faisal Kamiran and Toon Calders. “Data preprocessing techniques for classification with-
out discrimination”. In: Knowledge and Information Systems 33.1 (2012), pp. 1–33.
[31] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. “Learning fair
representations”. In: International Conference on Machine Learning. 2013, pp. 325–333.
[32] Faisal Kamiran, Asim Karim, and Xiangliang Zhang. “Decision theory for discrimination-
aware classification”. In: 2012 IEEE 12th International Conference on Data Mining. IEEE.
2012, pp. 924–929.
[33] Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q Weinberger. “On
fairness and calibration”. In: Advances in Neural Information Processing Systems. 2017,
pp. 5680–5689.
[34] Indr˙ e Žliobait˙ e and Bart Custers. “Using sensitive personal data may be necessary for
avoiding discrimination in data-driven decision models”. In: Artificial Intelligence and Law
24.2 (2016), pp. 183–201.
[35] Michael Veale and Reuben Binns. “Fairer machine learning in the real world: Mitigat-
ing discrimination without collecting sensitive data”. In: Big Data & Society 4.2 (2017),
p. 2053951717743530.
[36] Niki Kilbertus, Mateo Rojas Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik
Janzing, and Bernhard Schölkopf. “Avoiding discrimination through causal reasoning”. In:
Advances in Neural Information Processing Systems. 2017, pp. 656–666.
[37] Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth, Saeed Sharifi-
Malvajerdi, et al. “Differentially private fair learning”. In: International Conference on
Machine Learning. PMLR. 2019, pp. 3000–3008.
[38] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to
sensitivity in private data analysis”. In: Theory of cryptography conference. Springer. 2006,
pp. 265–284.
[39] Hussein Mozannar, Mesrob Ohannessian, and Nathan Srebro. “Fair learning with private
demographic data”. In: International Conference on Machine Learning. PMLR. 2020,
pp. 7066–7075.
[40] Pranjal Awasthi, Matthäus Kleindessner, and Jamie Morgenstern. “Equalized odds post-
processing under imperfect group information”. In: International Conference on Artificial
Intelligence and Statistics. PMLR. 2020, pp. 1770–1780.
[41] Sara Hajian and Josep Domingo-Ferrer. “A methodology for direct and indirect discrimina-
tion prevention in data mining”. In: IEEE transactions on knowledge and data engineering
25.7 (2012), pp. 1445–1459.
[42] Maya Gupta, Andrew Cotter, Mahdi Milani Fard, and Serena Wang. “Proxy fairness”. In:
arXiv preprint arXiv:1806.11212 (2018).
130
[43] Jiahao Chen, Nathan Kallus, Xiaojie Mao, Geoffry Svacha, and Madeleine Udell. “Fair-
ness under unawareness: Assessing disparity when protected class is unobserved”. In: Pro-
ceedings of the Conference on Fairness, Accountability, and Transparency. ACM. 2019,
pp. 339–348.
[44] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. “Fair-
ness without demographics in repeated loss minimization”. In: International Conference
on Machine Learning. PMLR. 2018, pp. 1929–1938.
[45] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. “Agnostic federated learning”.
In: International Conference on Machine Learning. PMLR. 2019, pp. 4615–4625.
[46] Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, et al.
“Fairness without demographics through adversarially reweighted learning”. In: arXiv preprint
arXiv:2006.13114 (2020).
[47] Adrienne Wood, Magdalena Rychlowska, and Paula M Niedenthal. “Heterogeneity of
long-history migration predicts emotion recognition accuracy.” In: Emotion 16.4 (2016),
p. 413.
[48] Cristina Hernández-Quevedo, Andrew M Jones, Nigel Rice, et al. “Reporting bias and
heterogeneity in self-assessed health. Evidence from the British Household Panel Survey”.
In: Health, Econometrics and Data Group (HEDG) Working paper 05 4 (2004).
[49] Ricardo Darıo Pérez Principi, Cristina Palmero, Julio C Junior, and Sergio Escalera. “On
the Effect of Observed Subject Biases in Apparent Personality Analysis from Audio-visual
Signals”. In: IEEE Transactions on Affective Computing (2019).
[50] Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. “Deep
learning for healthcare: review, opportunities and challenges”. In: Briefings in bioinfor-
matics 19.6 (2017), pp. 1236–1246.
[51] Richard D Riley, Joie Ensor, Kym IE Snell, Thomas PA Debray, Doug G Altman, Karel
GM Moons, et al. “External validation of clinical prediction models using big datasets from
e-health records or IPD meta-analysis: opportunities and challenges”. In: bmj 353 (2016),
p. i3140.
[52] Ryan Steed and Aylin Caliskan. “Machines Learn Appearance Bias in Face Recognition”.
In: arXiv preprint arXiv:2002.05636 (2020).
[53] James H Dulebohn, Robert B Davison, Seungcheol Austin Lee, Donald E Conlon, Gerry
McNamara, and Issidoros C Sarinopoulos. “Gender differences in justice evaluations: Ev-
idence from fMRI.” In: J Appl Psychol 101 (2016).
[54] Shan Li and Weihong Deng. “A deeper look at facial expression dataset bias”. In: IEEE
Transactions on Affective Computing (2020).
[55] Hesam Sagha, Jun Deng, and Björn Schuller. “The effect of personality trait, age, and
gender on the performance of automatic speech valence recognition”. In: 2017 Seventh
International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE.
2017, pp. 86–91.
[56] Avrim Blum and Kevin Stangl. “Recovering from biased data: Can fairness constraints
improve accuracy?” In: arXiv preprint arXiv:1912.01094 (2019).
[57] Heinrich Jiang and Ofir Nachum. “Identifying and correcting label bias in machine learn-
ing”. In: AISTATS ’20. PMLR. 2020, pp. 702–712.
131
[58] Aditya Krishna Menon and Robert C Williamson. “The cost of fairness in binary clas-
sification”. In: Conference on Fairness, Accountability and Transparency. PMLR. 2018,
pp. 107–118.
[59] Guy N Rothblum and Gal Yona. “Consider the Alternatives: Navigating Fairness-Accuracy
Tradeoffs via Disqualification”. In: arXiv preprint arXiv:2110.00813 (2021).
[60] Irene Chen, Fredrik D Johansson, and David Sontag. “Why is my classifier discrimina-
tory?” In: Advances in Neural Information Processing Systems. 2018, pp. 3539–3550.
[61] Lingjuan Lyu, Xinyi Xu, Qian Wang, and Han Yu. “Collaborative fairness in federated
learning”. In: Federated Learning. Springer, 2020, pp. 189–204.
[62] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. “Fair resource allocation in
federated learning”. In: arXiv preprint arXiv:1905.10497 (2019).
[63] Daniel Yue Zhang, Ziyi Kou, and Dong Wang. “Fairfl: A fair federated learning approach
to reducing demographic bias in privacy-sensitive classification models”. In: 2020 IEEE
International Conference on Big Data (Big Data). IEEE. 2020, pp. 1051–1060.
[64] Wei Du, Depeng Xu, Xintao Wu, and Hanghang Tong. “Fairness-aware Agnostic Federated
Learning”. In: Proceedings of the 2021 SIAM International Conference on Data Mining
(SDM). SIAM. 2021, pp. 181–189.
[65] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. 2017.
[66] Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. “How we analyzed the
COMPAS recidivism algorithm”. In: ProPublica (5 2016) 9 (2016).
[67] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. “Retiring Adult: New
Datasets for Fair Machine Learning”. In: Advances in Neural Information Processing Sys-
tems 34 (2021).
[68] Joan-Isaac Biel and Daniel Gatica-Perez. “The youtube lens: Crowdsourced personality
impressions and audiovisual analysis of vlogs”. In: Multimedia, IEEE Transactions on
15.1 (2013), pp. 41–55.
[69] Hugo Jair Escalante, Heysem Kaya, Albert Ali Salah, Sergio Escalera, Yagmur Gucluturk,
Umut Guclu, et al. “Explaining first impressions: modeling, recognizing, and explaining
apparent personality from videos”. In: arXiv preprint arXiv:1802.00745 (2018).
[70] Udhir Ramnath, L Rauch, EV Lambert, and TL Kolbe-Alexander. “The relationship be-
tween functional status, physical fitness and cognitive performance in physically active
older adults: A pilot study”. In: PloS one 13.4 (2018), e0194918.
[71] J Ridley Stroop. “Studies of interference in serial verbal reactions.” In: Journal of experi-
mental psychology 18.6 (1935), p. 643.
[72] Harry Holzer and David Neumark. “Assessing affirmative action”. In: Journal of Economic
literature 38.3 (2000), pp. 483–568.
[73] Eva Krumhuber, Antony SR Manstead, and Arvid Kappas. “Temporal aspects of facial
displays in person and expression perception: The effects of smile dynamics, head-tilt, and
gender”. In: Journal of Nonverbal Behavior 31.1 (2007), pp. 39–56.
[74] Mariska E Kret and Beatrice De Gelder. “A review on sex differences in processing emo-
tional signals”. In: Neuropsychologia 50.7 (2012), pp. 1211–1221.
[75] Rachael E Jack, Roberto Caldara, and Philippe G Schyns. “Internal representations reveal
cultural diversity in expectations of facial expressions of emotion.” In: Journal of Experi-
mental Psychology: General 141.1 (2012), p. 19.
132
[76] Yingruo Fan, Jacqueline CK Lam, and Victor OK Li. “Demographic effects on facial emo-
tion expression: an interdisciplinary investigation of the facial action units of happiness”.
In: Scientific reports 11.1 (2021), pp. 1–11.
[77] Sheila M Ryan, Ary L Goldberger, Steven M Pincus, Joseph Mietus, and Lewis A Lipsitz.
“Gender-and age-related differences in heart rate dynamics: are women more complex than
men?” In: Journal of the American College of Cardiology 24.7 (1994), pp. 1700–1707.
[78] Monica P Mallampalli and Christine L Carter. “Exploring sex and gender differences in
sleep health: a Society for Women’s Health Research Report”. In: Journal of women’s
health 23.7 (2014), pp. 553–562.
[79] Adrian Furnham. “Response bias, social desirability and dissimulation”. In: Personality
and individual differences 7.3 (1986), pp. 385–400.
[80] Carsten Eickhoff. “Cognitive biases in crowdsourcing”. In: WSDM ’18. 2018.
[81] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H Chi. “Data decisions and theoretical implica-
tions when adversarially learning fair representations”. In: arXiv preprint arXiv:1707.00075
(2017).
[82] Christina Wadsworth, Francesca Vera, and Chris Piech. “Achieving fairness through adver-
sarial learning: an application to recidivism prediction”. In: arXiv preprint arXiv:1807.00199
(2018).
[83] Heysem Kaya, Furkan Gurpinar, and Albert Ali Salah. “Multi-modal score fusion and
decision trees for explainable automatic job candidate screening from video cvs”. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
2017, pp. 1–9.
[84] Xuehan Xiong and Fernando De la Torre. “Supervised descent method and its applications
to face alignment”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2013, pp. 532–539.
[85] Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. “Deep face recognition”. In:
(2015).
[86] Timur R Almaev and Michel F Valstar. “Local gabor binary patterns from three orthog-
onal planes for automatic facial expression recognition”. In: 2013 Humaine association
conference on affective computing and intelligent interaction. IEEE. 2013, pp. 356–361.
[87] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-
scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
[88] Furkan Gürpınar, Heysem Kaya, and Albert Ali Salah. “Combining deep facial and ambi-
ent features for first impression estimation”. In: European Conference on Computer Vision.
Springer. 2016, pp. 372–385.
[89] Furkan Gürpinar, Heysem Kaya, and Albert Ali Salah. “Multimodal fusion of audio, scene,
and face features for first impression estimation”. In: 2016 23rd International Conference
on Pattern Recognition (ICPR). IEEE. 2016, pp. 43–48.
[90] Florian Eyben and Björn Schuller. “openSMILE:) The Munich open-source large-scale
multimedia feature extractor”. In: ACM SIGMultimedia Records 6.4 (2015), pp. 4–13.
[91] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fa-
bien Ringeval, et al. “The INTERSPEECH 2013 computational paralinguistics challenge:
Social signals, conflict, emotion, autism”. In: Proceedings INTERSPEECH 2013, 14th An-
nual Conference of the International Speech Communication Association, Lyon, France.
2013.
133
[92] Avishek Joey Bose and William Hamilton. “Compositional fairness constraints for graph
embeddings”. In: arXiv preprint arXiv:1905.10674 (2019).
[93] Julia Dressel and Hany Farid. “The accuracy, fairness, and limits of predicting recidivism”.
In: Science advances 4.1 (2018), eaao5580.
[94] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. “SMOTE:
synthetic minority over-sampling technique”. In: Journal of artificial intelligence research
16 (2002), pp. 321–357.
[95] Show-Jane Yen and Yue-Shi Lee. “Cluster-based under-sampling approaches for imbal-
anced data distributions”. In: Expert Systems with Applications 36.3 (2009), pp. 5718–
5727.
[96] D Georgios, B Fernando, and L Felix. “Oversampling for imbalanced learning based on
K-means and SMOTE”. In: Inf. Sci. 465 (2018), pp. 1–20.
[97] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. “Fair k-center clustering
for data summarization”. In: arXiv preprint arXiv:1901.08628 (2019).
[98] Shikha Bordia and Samuel Bowman. “Identifying and Reducing Gender Bias in Word-
Level Language Models”. In: Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Student Research Workshop.
2019, pp. 7–15.
[99] Michael R Smith, Tony Martinez, and Christophe Giraud-Carrier. “An instance level anal-
ysis of data complexity”. In: Machine learning 95.2 (2014), pp. 225–256.
[100] Philip Adler, Casey Falk, Sorelle A Friedler, Tionney Nix, Gabriel Rybeck, Carlos Schei-
degger, et al. “Auditing black-box models for indirect influence”. In: Knowledge and In-
formation Systems 54.1 (2018), pp. 95–122.
[101] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde,
Kalapriya Kannan, et al. AI Fairness 360: An Extensible Toolkit for Detecting, Understand-
ing, and Mitigating Unwanted Algorithmic Bias. 2018.
[102] Alexandros Pantelopoulos and Nikolaos G Bourbakis. “A survey on wearable sensor-based
systems for health monitoring and prognosis”. In: IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews) 40.1 (2009), pp. 1–12.
[103] Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor,
et al. “StudentLife: assessing mental health, academic performance and behavioral trends
of college students using smartphones”. In: Proceedings of the 2014 international joint
conference on pervasive and ubiquitous computing. ACM. 2014, pp. 3–14.
[104] Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. “Dissecting
racial bias in an algorithm used to manage the health of populations”. In: Science 366.6464
(2019), pp. 447–453.
[105] Clifford H Wagner. “Simpson’s paradox in real life”. In: The American Statistician 36.1
(1982), pp. 46–48.
[106] Kristina Lerman. “Computational social scientist beware: Simpson’s paradox in behavioral
data”. In: Journal of Computational Social Science 1.1 (2018), pp. 49–58.
[107] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan.
“A survey on bias and fairness in machine learning”. In: arXiv preprint arXiv:1908.09635
(2019).
134
[108] Carem C Fabris and Alex A Freitas. “Discovering surprising patterns by detecting occur-
rences of Simpson’s paradox”. In: Research and Development in Intelligent Systems XVI.
Springer, 2000, pp. 148–160.
[109] Nazanin Alipourfard, Peter G Fennell, and Kristina Lerman. “Using Simpson’s Paradox to
Discover Interesting Patterns in Behavioral Data”. In: Twelfth International AAAI Confer-
ence on Web and Social Media. 2018.
[110] Peter G Fennell, Zhiya Zuo, and Kristina Lerman. “Predicting and explaining behavioral
data with structured feature space decomposition”. In: EPJ Data Science 8.1 (2019), p. 23.
[111] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis,
Arjun Nitin Bhagoji, et al. “Advances and Open Problems in Federated Learning”. In:
Found. Trends Mach. Learn. 14 (2021), pp. 1–210.
[112] Jianyu Wang, Zachary Charles, Zheng Xu, Gauri Joshi, H Brendan McMahan, Maruan Al-
Shedivat, et al. “A field guide to federated optimization”. In: arXiv preprint arXiv:2107.06917
(2021).
[113] Joy Buolamwini and Timnit Gebru. “Gender shades: Intersectional accuracy disparities in
commercial gender classification”. In: Conference on fairness, accountability and trans-
parency. PMLR. 2018, pp. 77–91.
[114] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y
Arcas. “Communication-efficient learning of deep networks from decentralized data”. In:
Artificial intelligence and statistics . PMLR. 2017, pp. 1273–1282.
[115] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Koneˇ cn,
et al. “Adaptive federated optimization”. In: arXiv preprint arXiv:2003.00295 (2020).
[116] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. “Tackling the ob-
jective inconsistency problem in heterogeneous federated optimization”. In: arXiv preprint
arXiv:2007.07481 (2020).
[117] Yuji Roh, Kangwook Lee, Steven Euijong Whang, and Changho Suh. “FairBatch: Batch
Selection for Model Fairness”. In: International Conference on Learning Representations.
2020.
[118] Annie Abay, Yi Zhou, Nathalie Baracaldo, Shashank Rajamoni, Ebube Chuba, and Heiko
Ludwig. “Mitigating bias in federated learning”. In: arXiv preprint arXiv:2012.02447 (2020).
[119] Anonymous. “Improving Fairness via Federated Learning”. In: Submitted to The Tenth
International Conference on Learning Representations. under review. 2022.
[120] Anonymous. “Enforcing fairness in private federated learning via the modified method of
differential multipliers”. In: Submitted to The Tenth International Conference on Learning
Representations. under review. 2022.
[121] Afroditi Papadaki, Natalia Martinez, Martin Bertran, Guillermo Sapiro, and Miguel Ro-
drigues. “Federating for Learning Group Fair Models”. In: arXiv preprint arXiv:2110.01999
(2021).
[122] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMa-
han, Sarvar Patel, et al. “Practical secure aggregation for privacy-preserving machine learn-
ing”. In: proceedings of the 2017 ACM SIGSAC Conference on Computer and Communi-
cations Security. 2017, pp. 1175–1191.
[123] Chaoyang He, Songze Li, Jinhyun So, Xiao Zeng, Mi Zhang, Hongyi Wang, et al. “Fedml:
A research library and benchmark for federated machine learning”. In: arXiv preprint
arXiv:2007.13518 (2020).
135
[124] Harry Hsu, Hang Qi, and Matthew Brown. “Measuring the Effects of Non-Identical Data
Distribution for Federated Visual Classification”. In: arXiv preprint arXiv:1909.06335
(2019).
[125] Shen Yan, Hsien-te Kao, and Emilio Ferrara. “Fair class balancing: enhancing model fair-
ness without observing sensitive attributes”. In: Proceedings of the 29th ACM International
Conference on Information & Knowledge Management. 2020, pp. 1715–1724.
136
Abstract (if available)
Abstract
As human behavior data, along with machine learning techniques, are increasingly applied in decision-making scenarios ranging from healthcare to recruitment decisions, guaranteeing the fairness of such systems is a critical criterion for wide application. Despite the previous efforts on fair machine learning, the heterogeneity and complexity of behavioral data further impose challenges to both model validity and fairness. The limited access to sensitive attributes (e,g., race, gender) in real-world settings makes it more difficult to mitigate the unfairness of the model outcomes. In this thesis, I investigate the effects of heterogeneity on model fairness and propose and propose different modeling strategies to address the above challenges in three directions. First, I propose a strategy to mitigate bias via data balancing. Specifically, I design a class balancing algorithm for classification tasks named fair class balancing that can improve both model fairness and utility without accessing the sensitive attributes. Second, I study how different heterogeneity patterns of behavioral signals affect fairness performance. I then propose a modeling framework Multi-Layer Factor Analysis (MLFA) to identify heterogeneous behavioral patterns without sensitive attributes. Experimental results show that mitigating heterogeneity is able to enhance model fairness while achieving better utility. Third, I propose a method FairFed to improve group fairness in federated learning systems, which further enables learning large-scale machine learning models without directly accessing individuals\' data. FairFed can effectively improve fairness with heterogeneous data distribution across clients in federated learning systems. The proposed methods explore the data properties to make it able to improve group fairness as well as maintain good utility, which makes them have promising applications in behavior understanding systems.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Differentially private learned models for location services
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Towards trustworthy and data-driven social interventions
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Scalable machine learning algorithms for item recommendation
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Machine learning paradigms for behavioral coding
PDF
Towards social virtual listeners: computational models of human nonverbal behaviors
PDF
Modeling and regulating human interaction with control affine dynamical systems
PDF
Heterogeneous federated learning
PDF
Improving language understanding and summarization by leveraging auxiliary information through self-supervised or unsupervised learning
PDF
Interpretable machine learning models via feature interaction discovery
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Modeling dynamic behaviors in the wild
PDF
Mining and modeling temporal structures of human behavior in digital platforms
PDF
Visual representation learning with structural prior
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
Asset Metadata
Creator
Yan, Shen
(author)
Core Title
Fair Machine Learning for Human Behavior Understanding
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-08
Publication Date
05/11/2022
Defense Date
04/14/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bias,Fairness,Human behavior,machine learning,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ferrara, Emili (
committee chair
), Shahabi, Cyrus (
committee member
), Shrikanth, Narayanan (
committee member
)
Creator Email
shenyan.syan@gmail.com,shenyan@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111313295
Unique identifier
UC111313295
Legacy Identifier
etd-YanShen-10706
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Yan, Shen
Type
texts
Source
20220517-usctheses-batch-942
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bias
machine learning