Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Emergence and mitigation of bias in heterogeneous data
(USC Thesis Other)
Emergence and mitigation of bias in heterogeneous data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Emergence and Mitigation of Bias in Heterogeneous Data
by
Nazanin Alipourfard
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Science)
August 2021
Copyright 2021 Nazanin Alipourfard
Dedication
This thesis is dedicated to
Sina, for being there for me through all of the ups and downs &
My parents & my brother, for believing in me and supporting me and my dreams
ii
Acknowledgements
This dissertation was only made possible with the help and support of my family, friends, and colleagues.
First, I would like to thank my advisor Kristina Lerman. This thesis could not happen without her ded-
ication, advice, and support. I was so fortunate to have such an understanding, supportive and wonderful
advisor for my Ph.D. She helped me grow not only in the academic aspect but also in life. I cannot thank
her enough. I would also like to thank other faculty members for what I have learned a lot from them in
their classes or research collaborations: Greg Ver Steeg, Fred Morstatter, Aram Galstyan, Jay Pujara, and
Mohammad Reza Rajati. I would also like to thank my qualication and dissertation committee for making
the time and providing helpful suggestions: Ellis Horowitz, Greg Ver Steeg, Jose Luis Ambite, and Phebe
Vayanos.
Secondly, I would like to thank my lab-mates for all the things I have learned from them, the brain-
storming in group meetings, and their support: Nazgol Tavabi, Negar Mokhberian, Zihao He, Peter Fennell,
Andres Abeliuk, Keith Burghardt, Nathan Bartley and Yuzi He. I would also like to thank my collabora-
tors from other labs or companies who helped me widen my research scope: Buddhika Nettasinghe, Sami
Abu-El-Haija, Anahita Hosseini, Bryan Perozzi, and Hrayr Harutyunyan.
Thirdly, I would like to thank my friends for making the complications of the experience of living
abroad and the hardships of Ph.D. much easier: Reihane, Mehrnoosh, Pegah, Mozhdeh in Los Angeles and
Mahsa, Azin and Golshan despite the long distance. I could not do this without their emotional support. I
would like to thank them all for so many great memories.
iii
I cannot express my gratitude towards my family in words. I want to thank my mother, Mahnaz, and
my father, Alireza, for their dedication to helping me grow and follow my dreams. All of my achievements
happened because of these two amazing parents. I appreciate their unconditional love and support from
my childhood. I also want to thank my brother, Saeedreza, for his emotional support and encouragement.
Last but not least, I want to thank my dear husband, Sina, for always being there in ups and downs. I
am so grateful for having him as my best friend, who helps me grow every day. My Ph.D. years could not
be any better because of him.
iv
TableofContents
Dedication ii
Acknowledgements iii
ListofTables ix
ListofFigures xi
Abstract xvii
Chapter1: Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Proposal & Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges & Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
I HeterogeneousTabularData 7
Chapter2: IdentifyingConfoundersusingSimpson’sParadox 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Step 1: Disaggregating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1.1 Quantifying the Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1.2 Finding the Best Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.2 Step 2: Modeling Disaggregated Data . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 Step 3: Signicance of Disaggregations . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.3.1 Signicance of Aggregate Data Model . . . . . . . . . . . . . . . . . . . . 21
2.4.3.2 Signicance of Disaggregated Data Model . . . . . . . . . . . . . . . . . 22
2.4.3.3 Comparing Disaggregations . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.4 Mathematical Analysis of the Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Results: Fixed-size bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Simpson’s Paradoxes on Stack Exchange . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1.1 Answer Position & Session Length . . . . . . . . . . . . . . . . . . . . . 27
2.5.1.2 Number of Answers & Reputation . . . . . . . . . . . . . . . . . . . . . . 28
2.5.2 The Origins of Simpson’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
2.5.3 Discussion and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.4 Dierence from Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Results:R
2
-based bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.1 Stack Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6.2 Khan Academy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.3 Duolingo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Chapter3: IdentifyingLatentConfounders 44
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 DoGR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Model Specication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2.1 E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2.2 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.1.1 Metropolitan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1.2 Wine Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1.3 NYC Sale Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1.4 Stack Overow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2.1 Prediction Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4.2.2 Run Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
II NetworkedData 68
Chapter4: PerceptionBiasinDirectedNetworks 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 Basic Concepts and Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2 Friendship Paradox in Directed Networks . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.3 Global Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.4 Local Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.5 Relationship betweenB
local
andB
global
. . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.6 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.7 Estimating Global Prevalence via Polling . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Friendship Paradox-based Polling: Performance Analysis . . . . . . . . . . . . . . . 87
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
vi
Chapter5: EmergenceofStructuralInequalitiesinScienticCitationNetworks 96
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.1 Power-Inequality in Author-Citation Networks . . . . . . . . . . . . . . . . . . . . 100
5.2.2 Model of Network Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2.3 Analysis of the Model and Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.2.3.2 Calculation and Analysis of Power-inequality . . . . . . . . . . . . . . . 105
5.2.3.3 Connecting to the Real-World Networks . . . . . . . . . . . . . . . . . . 108
5.2.4 Smaller Elites have More Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2.5 Mitigating Power-inequality in Author-Citation Networks . . . . . . . . . . . . . . 110
5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.2.1 Estimating model parameters from data . . . . . . . . . . . . . . . . . . . 114
5.4 Data Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5 Parameter Estimation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.1 Citation Edge Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5.2 Estimating Class Balance Parameterr . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.3 Estimating Edge Formation Ratesp,q . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.4 Estimating Preferential Attachment Parameter . . . . . . . . . . . . . . . . . . . 119
5.5.5 Estimating Homophily Parameters
R
,
B
. . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Chapter6: DirectedMixedPreferentialAttachment&PerceptionBias 124
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Global Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3 Local Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5 Global Perception Bias & Power Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter7: Conclusions 135
References 137
AppendixA:Datasets 147
A.1 Simpson’s Disaggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.1.1 Stack Exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.1.2 Khan Academy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.1.3 Duolingo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.2 DogR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A.2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.2.3 Finding Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
vii
AppendixB:FriendshipParadox&PerceptionBias 155
B.1 Individual-level Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.2 Local and Global Perception Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
B.3 Global Bias Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.4 Heuristic Follower Perception Polling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
AppendixC:DirectedMixedPreferentialAttachment 163
C.1 Notations of DMPA model theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
viii
ListofTables
2.1 Examples of Simpson’s paradox in Stack Exchange data. For these variables, the trend
in the outcome variable (answer acceptance) as a function ofX
p
in the aggregate data
reverses when the data disaggregated onX
c
. . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Number of data points in each group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Variables dening important disaggregations of Stack Exchange data, along with their
pseudo-R
2
scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Variables dening important disaggregations of the Khan Academy data, along with their
pseudo-R
2
scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Variables dening important disaggregations of Duolingo data, along with their pseudo-R
2
scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 DoGR components, and the boroughs that make up each component (rows might not add
up to 100% due to rounding). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Results of prediction on ve data-sets. Asterisk indicates results that are signicantly
dierent from our method (p-value< 0:05). The bolded results have smaller standard
deviation among the best methods with same mean of error. . . . . . . . . . . . . . . . . . 62
3.3 Run time (inminutes) of one round of cross-validation (non-nested). MLR took less than a
minute for all datasets. ’failed’ indicates that the method could not run on the data due to
exceptions. We time out WCLR exceeding sucient amount of time to show the order of
performance. Bold numbers indicate the fastest algorithm. . . . . . . . . . . . . . . . . . 63
4.1 Properties of the Twitter subgraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1 Information about the data. Data for all elds of study, except Physics, came from
Microsoft Academic Graph, and Physics data was provided by the American Physical
Society. The number of authors with known gender is larger than number of authors with
known aliation. The aliation network has higher density, potentially confounded by
the fact that authors with known aliation are more active in publishing and citing other
authors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ix
5.2 Estimated Parameters of DMPA model. Parameters for dierent eld of studies are
estimated from Microsoft Academic Graph data, except for Physics, which is estimated
from data provided by the American Physical Society (APS). . . . . . . . . . . . . . . . . . 121
A.1 Data sets and their characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
x
ListofFigures
2.1 Dynamics of long-term (2.1a) and short-term (2.1b) performance of users of Stack-
Exchange. Our ndings highlight the complex interplay between short-term deterioration
in performance, potentially due to mental fatigue or attention depletion, and long-term
performance improvement due to learning and skill acquisition, and its impact on the
quality of user-generated content. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Simpson’s paradox in Stack Exchange data. Both plots show the probability an answer is
accepted as the best answer to a question as a function of its position within user’s activity
session. (a) Acceptance probability calculated over aggregated data has an upward trend,
suggesting that answers written later in a session are more likely to be accepted as best
answers. However, when data is disaggregated by session length (b), the trend reverses.
Among answers produced during sessions of the same length (dierent colors represent
dierent-length sessions), later answers are less likely to be accepted as best answers. . . 25
2.3 Novel Simpson’s paradox discovered in Stack Exchange data. Plots show the probability an
answer is accepted as best answer as a function of the number of lifetime answers written
by user over his or her tenure. (a) Acceptance probability calculated over aggregated data
has an upward trend, with answers written by more experienced users (who have already
posted more answers) more likely to be accepted as best answers. However, when data is
disaggregated by reputation (b), the trend reverses. Among answers written by users with
the same reputation (dierent colors represent reputation bins), those posted by users
who had already written more answers are less likely to be accepted as best answers. . . . 28
2.4 Analysis of the Simpson’s paradox Reputation – Number of Answers variable pair. (a)
Average acceptance probability as a function of two variables. (b) The distribution of the
number of data points contributing to the value of the outcome variable for each pair of
variable values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Relationship between acceptance probability and Reputation Rate, a new measure of user
performance dened as reputation per number of answers users wrote over their entire
tenure. Each line represents a subgroup with a dierent reputation score. The much
smaller variance compared to Fig. 2.3b suggests that the new feature is a good proxy of
answerer performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xi
2.6 A pair which multivariate logistic regression cannot nd in the data. (a) Average
acceptance probability as a function of Answer Position and Time Since Previous Answer.
(b) The distribution of the number of data points contributing to the value of the outcome
variable for each pair of variable values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Disaggregation of Stack Exchange data. (a) The heat map shows the probability the
answer is accepted as a function of itsanswerposition within a session, with the horizontal
bands corresponding to the dierent subgroups, conditioned on total number of answers
the user has written. (b) Number of data samples within each bin of the heat map. Note
that the outcome becomes noisy when there are few samples. The trends in performance
as a function ofanswerposition in (c) disaggregated data and (d) aggregate data. Error bars
in (c) and (d) show 95% condence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Disaggregation of Stack Exchange data similar to Fig. 2.7, but instead disaggreagted on
user reputation. (a) The heat map shows acceptance probability as a function of its answer
position within a session. (b) Number of data samples within each bin of the heat map.
Note that the outcome becomes noisy when there are few samples. The trends in (c)
disaggregated data and (d) aggregate data. Error bars in (c) and (d) show 95% condence
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.9 Disaggregation of Khan Academy data showing performance as a function of month,
conditioned on ve rst attempts. (a) The heat map shows average performance as a
function of the month. (b) Number of data samples within each subgroup. The trends in
(c) the disaggregated data and in (d) aggregated data. Error bars in (c) and (d) show the
95% condence interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.10 Disaggregation of Duolingo data. (a) The heat map shows performance, i.e., probability
to answer all the words correctly, as a function of how many lessons the user completed,
conditioned on how many of the ve rst lessons were answered correctly. (b) Number of
data samples within each bin of the heat map. Trends in (c) the disaggregated data and in
(d) aggregate data. Errors bars show 95% condence interval. . . . . . . . . . . . . . . . . 42
3.1 Heterogeneous data with three latent classes. The gure illustrates Simpson’s paradox,
where a positive relationship between the outcome and the independent variable exists
for population as a whole (red line) but reverses when the data is disaggregated by classes
(dashed green lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 The dierence between hard and soft clustering for data analysis. Assuming we have
two clusters (yellow and red); in hard clustering, the uncertainty of cluster assignment
is not tractable in data analysis phase (e.g. studying the coecients of the independent
variables), since all the data-points have one of the main cluster colors (b). However,
having the soft-clustering, we can get the approximated coecient for each individual
data-point (the whole range from yellow to red in (a)), separately. . . . . . . . . . . . . . . 55
xii
3.3 Disaggregation of the Metropolitan data into ve subgroups. The radar plot shows the
relative importance of a feature within the subgroup. The table report regression coe-
cients for two independent variablesPercentBelowPoverty andPercentGraduate
for Multivariate Regression (MLR) of aggregate data and separately for each subgroup
found by our method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 The value of the center of each component () for four components of the Wine
Quality data. The Blue and Orange components are almost entirely white wines, and
the Green component is composed mostly (85%) of red wines. The lowest quality and
smallest—Red—component is a mixture of red (43%) and white (57%) wines. . . . . . . . . 65
3.5 The value of center () for four components of NYC data. The numbers in legends are
average price of the component’s real estate and is in millions of dollars. . . . . . . . . . . 66
3.6 Disaggregation of Stack Overow data into subgroups. The outcome variable is length of
the answer (number of words). The radar plot shows the importance of each features used
in the disaggregation. The top table shows average values of validation features, while the
bottom table shows regression coecients for the groups. . . . . . . . . . . . . . . . . . . 67
4.1 Illustration of the eects of the four versions of the friendship paradox using Twitter
dataset described in Methods. The sub-gures display the fraction of nodes (empirical
probability of the paradox) of a particular degree whose (a) friends have more followers,
(b) followers have more friends, (c) friends have more friends, and (d) followers have more
followers, on average. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Histogram of the distribution of (a) global prevalenceEff(X)g and (b) local perception
bias B
local
of popular hashtags in the Twitter data. Local perception bias B
local
(overestimating the prevalence) exists for most hashtags. . . . . . . . . . . . . . . . . . . . 82
4.3 The ranking of popular Twitter hashtags based on Local Bias. Top-20 and bottom-10 are
included in the ranking. The bars compareEff(X)g (global prevalence) andEfq
f
(X)g
(local perception) and include 95% condence intervals. The hashtags can appear to
be much more popular than they actually are (e.g. #ferguson) or, they can appear to
be less popular (e.g. #oscars) due to local perception bias. Denition of some hashtags:
#mike(/michael)brown and #ferguson (an 18-year-old African American male killed by
police), #tbt (Throwback Thursday - for posting an old picture on Thursdays), # (Follow
Friday - introducing account worth following), #tcot (Top Conservatives On Twitter), #rt
(Retweet). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
xiii
4.4 Comparison of estimates of the prevalence of Twitter hashtags produced by the polling
algorithms. Variation of (a) squared bias (BiasfTg
2
), (b) variance (VarfTg) and (c)
mean squared error (BiasfTg
2
+ VarfTg) of the polling estimateT as a function of a
hashtag’s global prevalenceEff(X)g. Each point represents a dierent hashtag and a
xed sampling budgetb = 25. The polling algorithms used are intent polling (IP), node
perception polling (NPP) and the proposed follower perception polling (FPP). (d) Fraction
of hashtags for which the FPP algorithm outperforms the other two in terms of MSE. The
fraction for NPP approaches 0.5, and for IP approaches 0.8 as sampling budgetb increases.
These gures illustrate that the proposed FPP algorithm achieves a bias-variance trade-o
by coupling perception polling with friendship paradox to reduce the mean squared error. 95
5.1 Power-inequality in author-citation networks. Induced subgraph (i.e., the subgraph
constructed by picking a subset of nodes and the edges between them) for the Management
eld where the nodes were partitioned by(a) gender and(b) prestige of their aliation.
In the gender-partitioned network, red nodes represent female authors and blue represent
male authors. In the aliation-partitioned network, red nodes represent authors from
top-ranked institutions and blue from other institutions. The induced subgraph is
constructed from the ego-networks of a linked pair of red and blue nodes, which include
all the nodes that cite or are cited by the linked pair of nodes, as well as the links
between them. Power-inequality dened in Eq. (5.1) over the time period 1995–2019
for dierent elds of study in (c) gender- and (d) aliation-partitioned networks. The
plotted values show the average of power-inequalities over a sliding window of four
years, and condence intervals indicate the standard error. The gray lines show the
cumulative power inequality over the years. In(c), all power-inequality values are below
1.0, suggesting that female authors have less power than male authors. Psychology is
the closest eld to gender parity. Economics, Management and Political Science have
steadily increasing power after 2010, while Computer Science has slightly decreasing
trend over time. In (d), the minority (red) group represents authors aliated with the
top-100 institutions. In this case, all power-inequality values are above 1.0, suggesting
that authors from top-ranked institutions have more power than other authors even
though they are the minority. Psychology again is the eld with values closer to 1.0 and
therefore is the most institutionally power-equal eld of study. Management has the most
inequality, suggesting the aliation of authors is highly correlated with their power in
this eld. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Variation of the asymptotic power-inequalityI with the homophily of the red group
R
under the DMPA model (with = 10). The three rows correspond to dierent values of
the parametersp;q that capture the growth dynamics of the DMPA model. The three
columns correspond to three dierent values of the homophily parameter of the blue
group:
B
= 0:1 (heterophilic),
B
= 0:5 (unbiased),
B
= 0:9 (homophilic). In each
subplot, lines in dierent colors correspond to various values of the parameterr that
determines the asymptotic size of each group. . . . . . . . . . . . . . . . . . . . . . . . . . 105
xiv
5.3 Empirically estimated parameters of the DMPA model using the gender-partitioned
networks (lled markers) and aliation-partitioned networks (open markers). The exact
values are listed in Table 5.2. The leftmost plot suggests that the probability of a new node
citing an existing node is larger than the probability of an existing node citing a new node
(i.e.,q>p) for most elds of studies. The middle plot shows that minority female authors
(red group) are heterophilic (
R
< 0:5) while the majority male authors are homophilic
(
B
> 0:5). The opposite is observed in aliation networks; the authors aliated with
top-ranked universities (i.e., minority) are homophilic while the others are heterophilic.
The rightmost plot shows that empirically estimated values of the power-inequality are
in close agreement with the values obtained using the DMPA model. This shows that the
DMPA model can represent how the power-inequality emerge in real-world networks.
Moreover, combining the empirically estimated parameters of the DMPA model with
its theoretical analysis provide insight on strategies that can be used to mitigate the
power-inequality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 The gures illustrate how (a) power inequality and, (b) homophily parameters of the
two groups, vary with the minority group size in aliation networks by dening “top-
ranked universities" to be the top 10,20,50 and 100 universities in Shanghai University
Rankings(SUR). We use from Table 5.2 for each eld of study. The gure shows that
smaller the minority group is, the more powerful it is, as a consequence of minority
and majority groups being increasingly homophilic and heterophilic, respectively. This
observation agrees with the predictions of the DMPA model shown in Fig. 5.2 (top-left
subgure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 Citation edge creation events considered by the DMPA model of the growth of bi-
populated directed networks. The rst two events correspond to the appearance of a new
node. A new directed edge is created when an existing node cites a new node (Event 1) or
the new node cites an existing node (Event 2). Event 3 shows densication of the network
via a new edge appearing between existing nodes. . . . . . . . . . . . . . . . . . . . . . . 113
5.6 The coverage of data over the years, as measured by the fraction of authors with a known
attribute. The gender attribute has better coverage compared to aliation of authors. We
discard authors with unknown gender, and we consider authors with unknown aliation
in majority group (not aliated with top-ranked universities). . . . . . . . . . . . . . . . . 117
6.1 Global and local perception bias in terms of DMPA model parameters . . . . . . . . . . . . 132
6.2 Global perception bias and power inequality for 750 dierent model parameter setting.
Each point is for a set of parameter. The color of the point in gure (a) represents the
eect of homophily parameter on the two metric and gure (b) represents the eect of
probability of event 1 (p) and probability of event 2 (q). Higher
R
then
B
increases the
power of red group and makes a positive perception bias. While more event 2 than event
1 (q >p), amplies both power inequality of perception bias. So, when new nodes cites
an existing node more than existing node cites new node, the power inequality amplies. . 133
xv
6.3 Schematic representation of equation 6.14. In both gures the red group is minority
(r < 0:5) and powerful (N
rb
> N
br
). In gure (a) theB
global
is positive, meaning that
the minority group is overestimated, while in (b) the minority group is underestimated
(B
global
< 0). Equation (6.14) holds for (a) and it does not hold for (b). The reason red
group is underestimate in (b) is that there are many intra-edges in blue group which
makes them (as majority) to underestimate the size of red group. . . . . . . . . . . . . . . . 134
A.1 Synthetic data with two components centered onx = 500, but with dierent variances.
Gray points are data, and red points are predicted outcomes made by our method. . . . . 151
B.1 Individual-level perception biasq
f
h
(v)Eff(X)g for (a) all hashtagsh and all nodes
v2V , and (b) for two hashtags with similar global prevalence, but with positive (#nyc)
and negative (#rt)B
local
. This illustrates that most hashtags are positively biased for
individuals, with bias levels that do not depend on global prevalence. . . . . . . . . . . . . 158
B.2 Value of Covff(U);A(V )j(U;V ) Uniform(E)g and Covff(X);d
o
(X)g for all
hashtags. Both variables are normalized by dividing to maximum value of the variable.
The color represents the three cases. The table shows the number of hashtags that fall
into each case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.3 The ranking of popular Twitter hashtags based on Global bias. Top-20 and bottom-10 are
included in the ranking. The bars compare Global bias (B
global
) and Local Bias (B
local
). . . 160
B.4 Comparison of polling algorithms for estimating the global prevalence of Twitter hashtags.
Variation of (a) squared bias ( BiasfTg
2
), (b) variance ( VarfTg ) and (c) mean squared
error ( BiasfTg
2
+ VarfTg ) of the polling estimate (IP, NPP and Approximated-FPP as
T - polling algorithm -) as a function of a hashtag’s global prevalenceEff(X)g. Each
point represents a dierent hashtag and a xed sampling budgetb = 25. (d) Fraction of
hashtags where the proposed FPP algorithm with the sampling heuristic (Approximate-
FPP) outperforms the other polling methods in terms ofMSE. The fraction for NPP
approaches 0.5, and for IP approaches 0.8 as sampling budgetb increases. The main
dierence between Approximate-FPP and FPP is in Figure (d) with low amount of budget
b. In this case, the ratio of hashtags where Approximate-FPP could perform better than
NPP is around 0.8 compare with 0.9 for FPP algorithm. . . . . . . . . . . . . . . . . . . . . 162
xvi
Abstract
The presence of bias often complicates the quantitative analysis of large-scale heterogeneous or network
data. Discovering and mitigating these biases enables more robust and generalizable analysis of data.
This thesis focuses on the 1) discovery, 2) measurement, and 3) mitigation of biases in heterogeneous and
network data.
The rst part of the thesis focuses on removing biases created by the existence of diverse classes of
individuals in the population. I describe a data-driven discovery method that leverages Simpson’s para-
dox to identify subgroups within a population whose behavior deviates signicantly from the rest of the
population. Next, to address the challenges of multi-dimensional heterogeneous data analysis, I propose
a method that discovers latent confounders by simultaneously partitioning the data into fuzzy clusters
(disaggregation) and modeling the behavior within them (regression).
The second part of this thesis is about biases in bi-populated networked data. First, I study the percep-
tion bias of individuals about the prevalence of a topic among their friends in the Twitter social network.
Second, I show the existence of power-inequality in author citation networks in six dierent elds of study,
due to which authors from one group (e.g., women) receive systematically less recognition for their work
than another group (e.g., men). As the last step, I connect these two concepts (perception bias and power-
inequality) in bi-populated networks and show that while these two measures are highly correlated, there
are some scenarios where there is a disparity between them.
xvii
Chapter1
Introduction
1.1 Motivation
While we use data to deploy AI systems, there is a risk of propagating existing biases of data into the
system. These biases could have harmful eects, such as discriminating against minorities [74, 73], dis-
torting perceptions of reality [18, 59], unfair distribution of resources [130], and ungeneralizable models
[104]. Biases could misrepresent the underlying truth. Considering the potential biases while analyzing
data could prevent this distortion and misjudgment about data. In heterogeneous data—data with high
variability—various types of biases could arise.
Heterogeneous data often comes from a population composed of diverse classes of individuals with dif-
ferent characteristics. As a result of heterogeneity, a model learned on a population level might not oer an-
alytical insights into the underlying subgroups. In an extreme case, a population-level trend may disappear
or even reverse when the data is disaggregated into its constituent subgroups. The trend reversal—known
asSimpson’sparadox—has been widely observed in many domains, including biology [32], psychology [63]
astronomy [94], and computational social science [19]. A method that takes the heterogeneity of the data
into account could mitigate the biases that emerged from heterogeneity and, as a result, provide a more
robust way of analyzing and modeling data.
1
We use social networks to observe our peers to learn social norms, assess risk, or copy behaviors.
Based on the friendship paradox [45], people are less popular than their friends on average. Consequences
of friendship paradox can skew how we compare ourselves to friends: people tend to be less happy than
their friends are [21], and researchers tend to have less impact than their co-authors do [15], on average.
In fact, any trait correlated with popularity is likely to be misperceived [79, 41]. MeasuringPerceptionBias
and nding the preconditions of occurrence of the bias could alleviate such misjudgments about the state
of the individuals in the network.
Another example of heterogeneous data is bi-populated networks, a network composed of two classes
of nodes (e.g., female/male). For applying robust analysis of gender biases in the bi-populated networked
data, behavioural dierences between the two groups should be taken into account. Considering what
each group experiences and prefer in the network separately could help us to rst understand and then
mitigate such biases. For example, scientic citation networks are an example of bi-populated networked
data. It has been shown although women publish at the same rate as men, their papers tend to receive fewer
citations [40, 48]. One potential reason behind this could be the dierence between citation preferences
of female vs. male authors. To understand these dierences—and as a next step for mitigating this type of
gender bias—we need to have a framework to model these dierences between female and male authors.
1.2 ThesisProposal&ResearchQuestions
ThesisProposal: Various types of biases can be observed and measured in heterogeneous data
andthosebiasescanbeleveragedtohaveamorereliableanalyticalresultsaswellasmorerobust
methodsforpredictingtheoutcomevariableincomparewithcompetitivebaselines.
The biases give rise to misjudgments about data leading to nongeneralizable conclusions or distorted
perceptions of reality. In this proposal, I focus on the discovery, measurement, and mitigation of biases in
heterogeneous data to prevent these misjudgments. The proposal consists of two parts: 1)Heterogeneous
2
Data and 2)NetworkedData. In each chapter, I consider a dierent type of bias. In general, the research
questions in this thesis fall into the following three categories:
- (RQ1) How does bias emerges?
- (RQ2) How to measure bias?
- (RQ3) How to mitigate bias?
1.3 Challenges&Contributions
Biases in data have been widely studied (see, e.g., [92] for a review). However, given the challenges of
discovering, measuring, and mitigating biases in data, there is still much to be done. Firstly, the existence
of bias is not always obvious during the model construction step, because the downstream impacts of the
bias might not be clear until much later. As a result discovery of biases is more like a trial and error process.
For example, the implicit bias against women in Amazon recruiting tool was discovered after removing the
explicit bias from the model [38]. Secondly, nding and mitigating biases in one context does not solve the
bias in all the other contexts. As shown in [116]lackofsocialcontext on designing an ML decision-making
system could render technical interventions for mitigating the bias ineective, inaccurate, and sometimes
dangerously misguided. Thirdly, even after discovering or measuring biases, nding a way to mitigate
them is not straightforward. For example, in the context of fairness, there is a trade-o between alternate
measures of fairness. In other words, no method can satisfy these conditions simultaneously [67]. Given
the mentioned challenges, this thesis presents several cases of existing biases in real-world data, proposes
ways to measure biases in dierent contexts, and proposes a mitigation strategy in some of the contexts.
The contributions of this thesis are:
3
• A systematic way
∗
for uncovering existence ofSimpson’sParadox in data and presenting discovered
examples of the paradox in real-world datasets like Stack Overow, Doulingo and Khan Academy.
• A statistical model
†
consisting of a mixture of Gaussian components for robust analysis of data and
presenting qualitative and quantitative performance results in four real-world data sets.
• Measuring the perception bias
‡
of hashtags in the Twitter social network and showing the wide
range of the perception bias value in the same network from overestimation (e.g., #ferguson, #ice-
bucketchallenge) to underestimation (e.g., #oscars, #retweet).
• Measuring the power-inequality
§
in author citation networks in six dierent elds of study and
comparing two dierent groups: female vs. male, aliated with high-prestigious university vs.
aliated with other universities.
• Measuring homophily, preferential attachment, and class imbalance parameters for the 12 aforemen-
tioned author citation networks and bringing new understanding on the potential reasons behind
power-inequality.
• Connecting the perception bias about the prevalence of a group in bi-populated network to power-
inequality.
1.4 Overview
The research questions have been explored for two types of data: heterogeneous tabular data (Chapters
2 and 3) and networked data (Chapters 4, 5 and 6). Chapter 2 is about emergence (RQ1) of biases in data
due to Simpson’s Paradox and is the rst attempt to answer (RQ3) for this type of bias, while Chapter 3
∗
https://github.com/ninoch/Trend-Simpsons-Paradox
†
https://github.com/ninoch/DoGR
‡
https://github.com/ninoch/perception_bias
§
https://github.com/ninoch/DMPA
4
is a more sophisticated answer to the same question. Chapter 4 is about emergence and measurement of
perception bias (RQ1 and RQ2) in networked-data. Chapter 5 is about emergence, measurement and mit-
igation of power-inequality in author citation networks. Chapter 6 is the connection between perception
bias (Chapter 4) and power-inequality (Chapter 5) using the model introduced in the later chapter.
Chapter 2 is based on three papers. The second and third papers ([3], [4]) were motivated by the rst
paper [46]. The existence ofSimpson’sParadox has been taken into consideration to answer the questions
of the rst paper. However, we were aware of the existence of Simpson’s Paradox beforehand. This raises
a new question: How could instances of Simpson’s Paradox be found in new data? The second and third
papers answer this question. The third paper is the sophisticated version of the second paper, with a
systematic way to disaggregate data and has a set of statistical tests for the signicance of coecients.
The promising results by applying Simpson’s Disaggregation (Chapter 2) methods on three datasets
motivated me to propose a method for nding hidden confounding variables in comparison with the pre-
vious chapter. Firstly, in chapter 2 we only disaggregate the data based onone independent variable; while
considering multi-dimensional disaggregation makes the method more exible on identifying homoge-
neous groups and, as a result tackling biases in data analysis. Secondly, in Simpson’s Disaggregation we
are considering hard-assignment of the data points into the sub-groups. While, we can consider fuzzy
assignments, where each data point could be assigned to more than one group at the same time. This is
why we proposed DogR, presented in Chapter 3. DogR uses a Mixture of Gaussian Distributions to disag-
gregate data-points into more homogeneous sub-groups in a fuzzy way. UsingDogR, we are able to extract
meaningful hidden confounding variables for each sub-group.
Chapter 4 is about the emergence and measurement of perception bias in networked data. Given a
directed network where each node in the network has a binary attribute (red or blue), we studied in what
circumstances the nodes in the network over-estimate/under-estimate the prevalence of the red group in
the network. We dened two measures for this type of bias: Global Perception Bias and Local Perception
5
Bias. I used Twitter social network data to see the prevalence of which topics (i.e., hashtags) in the Twitter
social network are misjudged (overestimated or underestimated) by users. Then we connectPerceptionBias
to friendship paradox and use friendship-paradox-based polling to estimate the prevalence of an attribute
in the network more accurately.
Chapter 5 is about power-inequality in scientic author citation networks. We compared the ratio of
the average out-degree to the average in-degree (as proxy of the power of a group in bi-populated network)
of female authors versus male authors, as well as authors aliated with high-prestigious universities versus
other universities. In this chapter, I measured the power-inequality in six elds of study: Management,
Economics, Political Science, Psychology, Physics, and Computer Science. Using a model of the growth of
citation networks, I showed what could be a potential reason behind power-inequality in these networks.
For some of these real-world networks, homophily (preference of author to cite other authors with the same
gender/aliation prestige) is the most important factor, while class imbalance and preferential attachment
(author with higher recognition have a higher chance of getting citation) could be other reasons.
Chapter 6 connects perception bias (Chapter 4) and power-inequality (Chapter 5). In this chapter,
I showed that Local Perception Bias is not dierent from Global Perception Bias for DMPA-generated
networks. I showed there is a strong correlation between global perception bias and power-inequality (the
more powerful group is overestimated most of the time), while under some rare scenarios, there could be
some disparities between these two measures (a group could be powerful but is underestimated).
6
PartI
HeterogeneousTabularData
7
Chapter2
IdentifyingConfoundersusingSimpson’sParadox
2.1 Introduction
Digital traces of activity have exposed human behavior to quantitative analysis [75, 89]. Data mining
algorithms have explored behavioral data to test social psychology and decision theories [66, 23] and obtain
new insights into factors aecting online behavior. Yet, behavioral data analysis is still largely a trial-and-
error process, driven by ad-hoc methods rather than principled solutions. Compounding the diculty
are the multi-faceted challenges presented by behavioral data: it is massive, multi-dimensional, noisy,
sparse (few observations per individual), heterogeneous (composed of dierently behaving individuals),
and highly unbalanced (very few observations of the outcome of interest exist). As a result, given a new
behavioral data set, it is often unclear where to start or how to even go about identifying interesting
phenomena in data. To explore a new data set, a researcher may do exploratory analysis, for example, plot
the distributions of features of interest or perform principal component analysis, but beyond this, lack of
clear guidelines for analytic practice make quantitative exploration of large-scale behavioral data more of
an art than a science.
The current paper takes a step towards solving this problem by automating discovery from behav-
ioral data. We propose a method that systematically uncovers surprising patterns in data by identifying
8
subgroups within the population whose behaviors are substantially dierent from the rest of the popu-
lation. Our method leverages Simpson’s paradox [20, 102], a phenomenon wherein an association or a
trend observed in the data at the level of the entire population disappears or even reverses when data is
disaggregated by its underlying subgroups. The goal of our method is to identify a covariate, such that
conditioning data on the covariate signicantly changes the association between the outcome and another
covariate (acting as an independent variable). To address this challenge, we introduceSimpson’sDisaggre-
gation, a method that decomposes the population into subgroups and compares behavioral trends within
subgroups to nd surprising patterns. First, our method identies potential subgroups by disaggregating
the data into bins that minimize the variation of the outcome of interest. It then uses a linear model to
represent behavioral trends within subgroups, as well as in aggregate data, and looks for trend reversal.
Finally, it uses statistical methods to assess the signicance of trends in both aggregated and disaggregated
data, and compares disaggregations based on their explanatory power.
We apply the fully automatic method to several real-world behavioral data sets that include Q&A site
Stack Exchange, online learning platforms Khan Academy and Duolingo. These data sets are all highly
heterogeneous. After disaggregating the data, we nd that the trends describing the response of the out-
come to various covariates within the subgroups can be very dierent from the population-level response.
We show that disaggregations lead to models that better explain the data. We uncover common patterns
across data sets about the eects of skill and experience on user performance and suggest further lines of
inquiry into behaviors on these platforms.
By dissecting the data into more homogeneous subgroups, our method can uncover surprising sub-
groups that behave dierently from the rest of the population. Such patterns are a sign that strong indi-
vidual dierences exist within the population, dierences that must be accounted for in analysis. Thus,
the method gives a researcher a powerful tool for automatically identifying behavioral patterns meriting
deeper study.
9
The rest of the chapter is organized as follows: First, we present background and related work in
section 2.2. Then we motivate the problem by giving the example from the rst paper in section 2.3.
Then we describe our methodology for detecting Simpson’s paradox by identifying covariates in data,
and analyze the paradox mathematically to gain more insight into its origins (Section 2.4). Finally, we
apply our method to real-world data to demonstrate its ability to automatically identify novel cases of
Simpson’s paradox (Section 2.5 and 2.6). We conclude with the discussion of implications. In method
section, subsections 2.4.2 and 2.4.4 are for both papers ([3], [4]), however, subsections 2.4.1 and 2.4.3 are
only for the third paper. In section 2.5, we present experimental results of the second paper [3], and in
section 2.6 we present experimental results of the third paper [4].
2.2 BackgroundandRelatedWork
The goal of data analysis is to identify important associations between features, or variables, in data.
However, hidden correlations between variables can lead analysis to wrong conclusions. One important
manifestation of this eect is Simpson’s paradox, according to which an association that appears in dif-
ferent subgroups of data may disappear, and even reverse itself, when data is aggregated across subgroups.
Instances of the paradox have been documented across a variety of disciplines, including demographics,
economics, political science, and clinical research, and it has been argued that the presence of Simpson’s
paradox implies that interesting patterns exist in data [44]. A notorious example of Simpson’s paradox
arose during a gender bias lawsuit against UC Berkeley. Analysis of graduate school admissions data re-
vealed a statistically signicant bias against women. However, the pattern of discrimination observed in
this aggregate data disappeared when admissions data was disaggregated by department. Bickel et al. [17]
attributed this eect to Simpson’s paradox. They argued that the subtle correlations between the popu-
larity of departments among the genders and their selectivity resulted in women applying to departments
that were hardest to get into, which skewed analysis.
10
Simpson’s paradox must also be considered in the analysis of trends. Vaupel and Yashin [126] give a
compelling illustration of how survivor bias can shift the composition of data, distorting the conclusions
drawn from it. Analysis of recidivism among convicts released from prison shows that the rate at which
they return to prison declines over time. From this, policy makers may conclude that age has a pacifying
eect on crime: older convicts are less likely to commit crimes. In reality, this conclusion is false. Instead,
we can think of the population of ex-convicts as composed of two subgroups with constant, but very
dierent recidivism rates. The rst subgroup, let’s call them “reformed,” will never commit a crime once
released from prison. The other subgroup, the “incorrigibles,” will always commit a crime. Over time, as
“incorrigibles” commit oenses and return to prison, there are fewer of them left in the population. The
survivor bias changes the composition of the population under study, creating an illusion of an overall
decline in recidivism rates. As Vaupel and Yashin warn, “unsuspecting researchers who are not wary
of heterogeneity’s ruses may fallaciously assume that observed patterns for the population as a whole
also hold on the sub-population or individual level.” Their paper gives numerous other examples of such
ecological fallacies.
Similar illusions crop up in many studies of social behavior. For example, when examining how social
media users respond to information from their friends (other users that they follow), it may appear that
if more of a user’s friends use a hashtag then the user will be less likely to use it himself or herself [112].
Similarly, the more friends share some information, the less likely the user is to share it with his or her
followers [50]. From this, one may conclude the additional exposures to information in a sense “innoc-
ulate” the user and act to suppress the sharing of information. In fact, this is not the case, and instead,
additional exposures monotonically increase the user’s likelihood to share information with followers [78].
However, those users who follow many others, and are likely to be exposed to information or a hashtag
multiple times, are less responsive overall, because they are overloaded with information they receive from
11
all the friends they follow [55]. Calculating response as a function of the number of exposures in the ag-
gregate data falls prey to survivor bias: the more responsive users (with fewer friends) quickly drop out of
the average (since they are generally exposed fewer times), leaving the highly connected, but less respon-
sive, users behind. The reduced susceptibility of these highly connected users biases aggregate response,
leading to wrong conclusions about individual behavior. Once data is disaggregated based on the volume
of information individuals receive, a clearer pattern of response emerges, one that is more predictive of
behavior [56]. Multiple examples of Simpson’s paradox have been identied in empirical studies of online
behavior. A study [11] of Reddit found that while it may appear that average comment length on decreases
over any xed period of time, when data is disaggregated into groups based on the year user joined Reddit,
comment length within each group increases during the same time period. It is only because users who
joined early tend to write longer comments that the Simpson’s paradox appears.
Data heterogeneity also impacts statistical analysis of data [42] and causal inference [135]. However,
no general framework to recognize and mitigate Simpson’s paradox exist. Current methods require that
the structure of data be explicitly specied [12] or at best be guided by subject matter knowledge [52].
Despite accumulating evidence that Simpson’s paradox aects inference from data [135, 42], scientists
do not routinely test for the presence of this paradox in heterogeneous data. The reason behind this
could be the existing methods [20, 102] are not practical due to the limiting assumption of categorical
independent variables, while real-world datasets mostly consist of non-categorical variables. Our work
addresses this knowledge gap by proposing a statistical method to systematically uncover instances of
Simpson’s paradox in real-world data.
2.3 Motivation
The goal of our rst paper was to answer two questions: What are the long-term changes in the quality of
content users produce on Stack Exchange? What are the short-term changes? [46]
12
To answer the rst question, we put answers written by users with same tenure in one bin and average
their Acceptance Probability. Figure 2.1a suggests long-term performance improvement.
To answer the second question, however, closer attention needs to be paid. We dene Session of Ac-
tivity as sequences of answers written by the same user without an extended break. Considering Answer
position within a session as independent variable, and Acceptance Probability as dependent variable, g-
ure 2.2a represents the results, and suggests performance of users improve in short-term, which does not
support the literature [120]. Surprisingly, by studying short-term performance separately in each group
with similar Session Length in Figure 2.1b (or Figure 2.2b), the deterioration of performance in short-term
could be concluded. This is an instance of Simpson’sParadox in Stack Exchange data.
(a) Long-Term (b) Short-Term
Figure 2.1: Dynamics of long-term (2.1a) and short-term (2.1b) performance of users of Stack-Exchange.
Our ndings highlight the complex interplay between short-term deterioration in performance, potentially
due to mental fatigue or attention depletion, and long-term performance improvement due to learning and
skill acquisition, and its impact on the quality of user-generated content.
The fact that Session Length is a confounding factor for Answer Position helped us propose a well-
founded answer to the second question. However, if the conditioning onSessionLength was not taken into
account, we conclude that users improve in both the long-term and short-term. So, this is an example of
13
how the conclusions about data could be invalidated after considering confounding variables (RQ1). In the
extreme scenario, the confounding variable could makeSimpson’sParadox. Being aware of the existence of
Simpson’sParadox in the data could prevent making wrong conclusions from data. The importance of this
fact makes us ask the second research question: What other variables need to be considered as conditioning
variables to make analysis well-founded? The formal denition of the problem (and Simpson’s pairs) is
described in the next section.
2.4 Method
We propose a method to systematically uncover Simpson’s paradox for trends in data, which is an approach
to answer RQ2. We denote asY the dependent variable in the data set, i.e., an outcome being measured,
and asX =fX
1
;X
2
;:::;X
m
g the set ofm independent variables or features. The goal of the method is to
identify pairs of variables (X
p
;X
c
)—Simpson’spairs—such that a trend inY as a function ofX
p
disappears
or reverses when the data is disaggregated by conditioning onX
c
. More specically, our method searches
for pairs of variables (X
p
;X
c
) such that
d
dx
p
E[YjX
p
=x
p
]> 0 8x
p
; (2.1)
d
dx
p
E[YjX
p
=x
p
;X
c
=x
c
] 0 8x
p
;x
c
: (2.2)
and vice versa (i.e.,dE[YjX
p
= x
p
]=dx
p
< 0,dE[YjX
p
= x
p
;X
c
= x
c
]=dx
p
0). Equations (2.1) and
(2.2) hold if the expected value ofY is a monotonically increasing (or decreasing) function ofX
p
alone,
but conditioned onX
c
is a monotonically decreasing (resp. increasing) function ofX
p
, or is constant.
The approach has three steps. First, it disaggregates data into more homogenous subgroups based on
some covariateX
c
. Next, it uses a linear model to capture trends with respect to some other covariate
X
p
, both within the subgroups and within the aggregate data. Finally, it quanties how well the models
14
describe the disaggregated data compared to aggregate data to identify the important disaggregations. We
describe these steps in detail below.
∗
2.4.1 Step1: DisaggregatingData
A critical step in our method is disaggregating data by conditioning on variable X
c
. The idea behind
disaggregation is to segment data into more homogeneous subgroups of similar elements. For multinomial
variables X
c
, disaggregation step simply involves grouping data by unique values of X
c
. However, for
continuousX
c
or discrete variables with large range, this step is more complex. We can bin the elements
according to their values ofX
c
, but the decision has to be made how large each bin is, whether bin sizes
scale linearly or logarithmically, etc. If the bin is too small, it may not contain enough samples for a
statistically signicant measurement, but if it is too large, the samples may be too heterogeneous for a
robust trend. We have used bins of xed size for the WSDM paper [3]. The results of the paper is available
in section 2.5. We realize that more sophisticated binning techniques can allow us to isolate more pairs or
reduce the number of false positives, which is the goal of disaggregation algorithm in our ICWSM paper
[4]. The results of the paper could be nd in section 2.6. The rest of this subsection (2.4.1) is about the
disaggregation algorithm.
DisaggregationAlgorithm: We disaggregate the data by partitioning it on the conditioning variable
X
c
into non-overlapping bins, such that data points within each bin are more similar to each other than
to data in other bins. These bins correspond to the more homogeneous subgroups within the population
generating the data. Simply partitioning the data into xed-size bins, or percentiles, can be problematic
when X
c
has a heavy-tailed distribution, since the bins covering the tail will have few data points in
them. In such cases, logarithmic binning is a better choice. However, the decision then has to be made
about the size and scale of each bin. This decision must balance two considerations: rst, each bin has
∗
The code implementing the method is available on https://github.com/ninoch/Trend-Simpson-s-Paradox/.
15
to be homogeneous, i.e., it must contain data points that are more similar to each other in relation to
the outcome variable than to variables in other bins, and secondly, it needs to have a sucient number
of data points. Basically, too small a bin may not contain enough samples for a statistically signicant
measurement, while the samples in too large a bin may be too heterogeneous for a robust trend.
The binning method described below partitions the values ofX
c
, such thatY exhibits little variation
within each bin but signicant variation between bins.
2.4.1.1 QuantifyingthePartition
Total sum of squares (SST) is the key concept used to describe the variation in observationsfy
i
g
N
i=1
of
a random variableY . It is dened asSST =
P
N
i=1
(y
i
y)
2
, where y =
1
N
P
N
i=1
y
i
is the mean of all
observations. The sample variance,
2
, is equal toSST=(N 1), thus the SST is related to variation inY .
For any arbitrary partitionP
Xc
of the variableX
c
, we can quantify how much variation of the outcome
variableY can be explained byP
Xc
by decomposing the total sum of squares as:
N
X
i=1
(y
i
y)
2
=
X
b2P
Xc
N
b
( y
b
y)
2
+
X
b2P
Xc
N
b
X
i=1
(y
b;i
y
b
)
2
; (2.3)
whereN
b
is the number of data points in binb,y
b;i
is thei-th data point in binb, and y
b
is the average of
values in that bin. The rst term on the right hand side of Eq. (2.3) is the sum of squares between groups,
a weighted average of squared dierences between global ( y) and local ( y
b
) average. This sum measures
how muchY varies between dierent bins of the partition. The second term is the sum of squares within
groups, which measure how much variation inY remains within each binb. Then, the proportion of the
explained sum of squares to the total sum of squares, or coecient of determination, is:
R
2
=
P
b2P
Xc
N
b
( y
b
y)
2
SST
(2.4)
16
TheR
2
measure takes values between zero and one, with large values ofR
2
indicating a larger proportion
of the variation ofY explained byX
c
, for this specic binningP
Xc
.
2.4.1.2 FindingtheBestPartition
Now, we will describe the systematic way for learning partitionP
Xc
for the featureX
c
, which explains the
largest possible variation of the outcomeY . Given the data, the domain of the featureX
c
can be split at
some values into two bins:X
c
s andX
c
>s. From Eq. (2.4), the proportion of variation inY explained
by such a split is:
R
2
(s;X
c
) =
N
b
1
( y
b
1
y)
2
+N
b
2
( y
b
2
y)
2
SST
; (2.5)
whereN
b
1
and y
b
1
are the number of data points and average value ofY in the binX
c
s, andN
b
2
and
y
b
2
are the number of data points and average value ofY in the binX
c
>s. Thes can take any value in the
domain ofX
c
, and afterward theR
2
(s;X
c
) can be computed for thats. Thus, among all possible values
fors2 [min(X
c
);max(X
c
)], we can chooses
1
as the optimal split forX
c
which maximizesR
2
(s;X
c
).
For the next iteration, we can choose the next best splits
2
to optimize improvement inR
2
. In general,
assume we have binsfb
u
g
k
u=1
afterk 1 iterations, and for next iteration we have found best splits
k+1
which divides the binb
i
into two bins,b
i
1
andb
i
2
whereb
i
1
associated with data points in binb
i
where
X
c
s
k+1
, andb
i
2
associated with data points in binb
i
whereX
c
> s
k+1
. Thus, after splitting we will
have binsb
i
1
andb
i
2
instead of binb
i
. In this case, theR
2
improvement is the following:
R
2
(sjP
Xc
;X
c
) =
1
SST
N
b
i
1
( y
b
i
1
)
2
+N
b
i
2
( y
b
i
2
)
2
N
b
i
( y
b
i
)
2
(2.6)
In this manner, the method recursively splits the domain of X
c
to create a partition of the feature.
However, this procedure will continue indenitely untilX
c
has been partitioned into bins consisting of
17
single points, overtting the data. To prevent this, we constrain the algorithm so that the maximum number
of bins is 20, while the minimum number of data points per bin is 100.
2.4.2 Step2: ModelingDisaggregatedData
Next, the method measures the association between the outcome variableY and the independent variable
X
p
in the aggregate data and compares it to the associations in the disaggregated data. Firstly, on the
aggregate level, we model the relationship betweenY andX
p
as a linear model of the form
E[YjX
p
=x
p
] =f
p
( +x
p
); (2.7)
wheref
p
( +x
p
) is a monotonically increasing function of its argument +x
p
. The parameter in
Eq. (2.7) is the intercept of the regression function, while the trend parameter quanties the eect of
X
p
onY . Secondly, for the disaggregation, we t linear models of the form of Eq. (2.7) but with dierent
values of the parameters and depending on the value ofX
c
:
E[YjX
p
=x
p
;X
c
=x
c
] =f
p;c
((x
c
) +(x
c
)x
p
); (2.8)
When tting linear modelsf( +X) we have not only a tted trend parameter but also ap-value
which gives the probability of nding an intercept at least as extreme as the tted value under the null
hypothesisH
0
: = 0. From this, we have three possibilities:
• is not statistically dierent from zero (sgn() = 0),
• is statistically dierent from zero and positive (sgn() = 1),
• is statistically dierent from zero and negative (sgn() = -1).
18
This mechanism allows us to test for Simpson’s paradox by comparing the sign of from the aggre-
gated t (Eq. (2.7)) with the signs of the’s from the disaggregated ts (Eq. (2.8)). Although Eqs. (2.7) and
(2.8) state that the signs from the disaggregated curves should all be dierent from the aggregated curve,
in practice this is too strict, especially as human behavioral data is noisy. Thus, if more than half of the
subgroups have dierent sign with aggregated trend, then Simpson’s paradox exists. The summary of the
algorithm is the following:
Trend Simpson’s Paradox Algorithm
1 def trend_simpsons_pair(X, Y):
2 paradox_pairs = []
3 for paradox_var in vars:
4 beta, pvalue = trend_analysis(X[paradox_var], Y)
5 agg = sgn(beta, pvalue)
6 for condition_var in vars:
7 if paradox_var != condition_var:
8 dagg = []
9 for con_gr in bins_of(condition_var):
10 beta, pvalue = trend_analysis(X[paradox_var | con_gr], Y)
11 dagg.append(sgn(beta, pvalue))
12 if majority of dagg are different with agg:
13 paradox_pairs.append([paradox_var, condition_var])
14 return paradox_pairs
15
16 def sgn(beta, pvalue = 0.0):
19
17 return (0 if (beta == 0 or pvalue > 0.05) else (1 if beta > 0 else
-1))
2.4.3 Step3: SignicanceofDisaggregations
We conjecture that surprising subgroups are those whose behavior deviates substantially from that of the
population as a whole. Existence of such subgroups suggests that important behavioral dierences exist
that require deeper analysis. To identify such subgroups we must rst quantify how well a model, in
simplest case a linear model, describes the data.
In this paper, we examine the case where the outcome variableY is binary. In this case,E[YjX
j
=x
j
]
is the probability ofy
i
= 1 givenX
j
=x
j
. Therefore, we use the logistic regression as our linear model,
and Equation (2.7) becomes:
E[YjX
j
=x
j
] =f( +X
j
) =
1
1 +e
(+X
j
)
(2.9)
Logistic regression uses Maximum Likelihood Estimation to nd the best t to data. Likelihood of the
modelM given the data isL(Mjx) =P (X =xjM). For a binary outcome variable, it becomes:
L(Mjx) =
N
Y
i=1
y
i
(P
M
(x
i
)) + (1y
i
) (1P
M
(x
i
)) (2.10)
and then the log likelihood is give by:
logL(Mjx) =
N
X
i=1
y
i
log(P
M
(x
i
)) + (1y
i
)log(1P
M
(x
i
)) (2.11)
For assessing the goodness of t, we can use deviance [57]. It compares Log-likelihood of two models. In
the case of the Logistic regression of Eq. 2.9, this corresponds to comparing the full model with a null model
20
consisting only of an intercept. LettingM
0
be the reduced model andM
1
the full model, the deviance of
these two models is:
D(M
1
;M
0
) = 2 [logL(M
1
jx)logL(M
0
jx)] (2.12)
In the case whereM
0
is a nested model ofM
1
, where nested means full model can be reduced to null
model by imposing constraints on the parameters, under the null hypothesis thatM
0
andM
1
provide
a similar quality statistical explanations of the outcome, for suciently large sample size, the deviance
comes from a
2
(p) distribution [57], where p is the number of degrees of freedom, which equals the
number of extra parameters ofM
1
in comparison toM
0
. If the hypothesis is rejected, it means thatM
1
provides a signicantly better description of the outcome variable thanM
0
.
2.4.3.1 SignicanceofAggregateDataModel
For assessing the signicance of a model of aggregate data, we use deviance to compare the aggregate data
model, given by Eq. (2.7), to a model where all ^ y
i
are equal to the global mean y =
P
N
i=1
y
i
N
. In this case,
Eq. (2.12) becomes:
D(M
1
;M
0
) = 2
N
X
i=1
y
i
log(
^ y
i
y
) + (1y
i
)log(
1 ^ y
i
1 y
) (2.13)
Where,y
i
is thei-th outcome, and ^ y
i
=f( +x
i
). Clearly, these two models are nested; therefore,
D(M
1
;M
0
) has a
2
(1) distribution [57]. We can apply statistical hypothesis test to see whether the
found trend for aggregated data is signicant or not.
21
2.4.3.2 SignicanceofDisaggregatedDataModel
For assessing the signicance of a disaggregation of data, we can use deviance to compare the model of
Eq. (2.8) with a model where ^ y
i
is equal to the average outcome for data points in the bin ofx
i
. In this
case, Eq. (2.12) becomes:
2
X
b2P
Xc
N
b
X
i=1
y
b;i
log(
^ y
b;i
y
b
) + (1y
b;i
)log(
1 ^ y
b;i
1 y
b
); (2.14)
where, y
b;i
is thei-th data point in binb, y
b
=
P
N
b
i=1
y
b;i
N
b
is the mean outcome within binb, and ^ y
b;i
=
f((x
c
b;i
) +(x
c
b;i
)x
j
b;i
). By imposing (x
c
) = 0;8x
c
2 X
c
, we conclude that these two models
are nested. Thus again, we can use statistical test
2
(jP
Xc
j) to see whether the disaggregated trends are
signicant or not.
2.4.3.3 ComparingDisaggregations
Comparing disaggregations of data based on how well the linear models describe trends within subgroups
can help us identify interesting behavioral patterns in data. A disaggregation on variables (X
j
1
;X
c
1
) is
more interesting than (X
j
2
;X
c
2
) if it has more explanatory power than the second pair. McFadden [87],
introduced a measure, called McFaddenR
2
or pseudo-R
2
, to capture the ratio of likelihood improvement:
R
2
McFadden
= 1
logL
full
logL
null
(2.15)
If we assume that the full model is at least good as the null model (means logL
full
> logL
null
), then
the value ofR
2
McFadden
is between zero and one, with larger values showing more improvement in log-
likelihood, and values 0:2 to 0:4 considered to represent excellent ts [88]. Thus, we can x the null model
22
and compute the value ofR
2
McFadden
for all disaggregations. For the null model, we choose simple global
average for allY . Thus, the right hand side of Eq. (2.15) becomes:
1
P
b2P
Xc
P
N
b
i=1
y
b;i
log( ^ y
b;i
) + (1y
b;i
)log(1 ^ y
b;i
)
P
N
i=1
y
i
log( y) + (1y
i
)log(1 y)
(2.16)
We use pseude-R
2
to rank disaggregations by their explanatory power. In addition, we can also use it
to identify the best conditioning variableX
c
for disaggregating the data that best explains the trends with
respect to a covariateX
j
.
2.4.4 MathematicalAnalysisoftheParadox
We have presented a mathematical formulation of Simpson’s paradox in terms of the derivatives of con-
ditional expectations as given by Eqs. (2.1) and (2.2), and we now examine these equations to get a better
insight into the origins and causes of this paradox.
The expectation in Eq. (2.1) can be related to that of Eq. (2.2) as
E[YjX
p
=x
p
] = (2.17)
Z
Xc
E[YjX
p
=x
p
;X
c
=x
c
] Pr(X
c
=x
c
jX
p
=x
p
)dx
c
;
and dierentiating this expectation w.r.t. x
p
allows us to compare the trends of Eqs. (2.1) and (2.2). The
derivative of the right hand side of Eq. (2.17) with respect tox
p
is
Z
Xc
d
dx
p
E[YjX
p
=x
p
;X
c
=x
c
]
Pr(X
c
=x
c
jX
p
=x
p
)dx
c
+
Z
Xc
E[YjX
p
=x
p
;X
c
=x
c
]
d
dx
p
Pr(X
c
=x
c
jX
p
=x
p
)
dx
c
: (2.18)
23
IfE[YjX
p
= x
p
;X
c
= x
c
] is a non-increasing function ofx
p
—as in Eq. (2.2)—then the rst integral in
Eq. (2.18) will be non-positive. Thus forE[YjX
p
=x
p
] to be an increasing function ofx
p
, i.e., for Eq. (2.18)
to be positive, the second integral must be positive.
This inequality condition leads to two necessary conditions for the occurrence of Simpson’s paradox.
The rst condition is that
d
dx
p
Pr(X
c
=x
c
jX
p
=x
p
)6= 0; (2.19)
i.e., the distribution of the conditioning variable X
c
is not independent of X
p
and so the two variables
are correlated. AsX
p
changes, the distribution of the values ofX
c
must also change. In the case that the
distribution ofX
c
is independent ofX
p
, thend Pr(X
c
=x
c
jX
p
=x
p
)=dx = 0 and so the second integral
of Eq. (2.18) will be zero resulting in no Simpson’s paradox.
The second necessary condition for the occurrence of Simpson’s paradox is that the expectation ofY ,
conditioned onX
p
, must not be independent ofX
c
, i.e.,
E[YjX
p
=x
p
;X
c
=x
c
]6=E[YjX
p
=x
p
]: (2.20)
For any given value ofX
p
, the expectation ofY must vary as a function ofX
c
. If the condition of Eq. (2.20)
is not met then the second integral in Eq. (2.18) becomes
Z
Xc
E[YjX
p
=x
p
]
d
dx
p
Pr(X
c
=x
c
jX
p
=x
p
)
dx
c
(2.21)
=E[YjX
p
=x
p
]
d
dx
p
Z
Xc
Pr(X
c
=x
c
jX
p
=x
p
)dx
c
= 0;
and so Simpson’s paradox will not occur.
Thus, this mathematical analysis has given us an insight into causes for Simpson’s paradox in data —
correlations between independent variables and the fact that the distribution of the conditioning variable
24
X
c
changes at a faster rate with respect to the independent paradox variableX
p
than does the expectation
ofY . This point will be covered in greater detail in the next section.
2.5 Results: Fixed-sizebins
We explore our approach using data from the question-answering platform called Stack Exchange. This
platform, launched in 2008 to provide a forum for people to ask computer programming questions, grew
over the years as a forum asking questions on a variety of technical and non-technical topics. The premise
behind Stack Exchange is simple: any user can ask a question, which others may answer. Users can also
vote for answers they nd helpful, and the asker can accept one of the answers as the best answer to the
question. In this way, the Stack Exchange community collectively curates knowledge. More details about
Stack Exchange data is available in Appendix A.
2.5.1 Simpson’sParadoxesonStackExchange
1 2 3 4 5 6 7 8
Answer Position
0.32
0.33
0.34
0.35
0.36
0.37
0.38
Acceptance Probability
logistic fit
data
(a) Aggregated Data
1 2 3 4 5 6 7 8
Answer Position
0.32
0.33
0.34
0.35
0.36
0.37
0.38
Acceptance Probability
session length
1
2
3
4
5
6
7
8
(b) Disaggregated Data
Figure 2.2: Simpson’s paradox in Stack Exchange data. Both plots show the probability an answer is ac-
cepted as the best answer to a question as a function of its position within user’s activity session. (a)
Acceptance probability calculated over aggregated data has an upward trend, suggesting that answers
written later in a session are more likely to be accepted as best answers. However, when data is disag-
gregated by session length (b), the trend reverses. Among answers produced during sessions of the same
length (dierent colors represent dierent-length sessions), later answers are less likely to be accepted as
best answers.
25
X
p
: IndependentVariable X
c
: ConditioningVariable
Tenure Number of answers
Session length Reputation
Answer position Reputation
Answer position Session length
Number of answers Reputation
Time since previous answer Answer position
Percentile Number of answers
Table 2.1: Examples of Simpson’s paradox in Stack Exchange data. For these variables, the trend in the
outcome variable (answer acceptance) as a function ofX
p
in the aggregate data reverses when the data
disaggregated onX
c
.
We apply the method described above to Stack Exchange data. Here, our dependent variable Y is
binary, denoting whether or not a specic answer to a question was accepted as the best answer. In this
case of binary outcomes we use the logistic regression linear model of the form
f( +x) =
1
1 +e
(+x)
: (2.22)
The parameters and are tted using Maximum likelihood, while test of the null hypothesisH
0
: = 0
is performed using the Likelihood Ratio Test [29].
The eleven variables in Stack Exchange data, result in 110 possible Simpson’s pairs. Among these, our
method identies seven as instance of paradox. These are listed in Table 2.1.
Our approach reveals that our previously reported nding that acceptance probability decreases with
answer position [46] is an instance of Simpson’s paradox and would not have been observed had the
data not been disaggregated by session length. More interestingly, our approach also identies previously
unknown instances of Simpson’s paradox. We explore these in greater detail below, illustrating how it can
lead to deeper insights into online behavior.
26
2.5.1.1 AnswerPosition&SessionLength
We measure session length by the number of answers a user posts before taking an extended break. Session
length was shown to be an important confounding variable in online activity. Analysis of the quality of
comments posted on a social news platform Reddit showed that, once disaggregated by the length of
session, the quality of comments declines over the course of a session, with each successive comment
written by a user becoming shorter, less textually complex, receiving fewer responses and a lower score
from others [121]. Similarly, each successive answer posted during a session by a user on Stack Exchange
is shorter, less well documented with external links and code, and less likely to be accepted by the asker
as the best answer [46].
Our approach automatically identies this example as Simpson’s paradox, as illustrated in Fig. 2.2. The
gure shows average acceptance probability for an answer as a function of its position (or index) within a
session. According to Fig. 2.2, which reports aggregate acceptance probability, answers written later in a
session are more likely to be accepted than earlier answers. However, once the same data isdisaggregated
by session length, the trend reverses (Fig. 2.2b): each successive answer within the same session is less
likely to be accepted than the previous answer. For example, for sessions during which ve answers were
written, the rst answer is more likely to be accepted than the second answer, which is more likely to be
accepted than the third answer, etc., which is more likely to be accepted than the fth answer. The lines
in Fig. 2.2 represent ts to data using logistic regression.
This example highlights the necessity to properly disaggregate data to identify the subgroups for anal-
ysis. Unless data is disaggregated, wrong conclusions may be drawn, in this case, for example, that user
performance improves during a session.
27
0 5000 10000 15000 20000 25000 30000
Number of answers
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Acceptance Probability
logistic fit
data
(a) Aggregated Data
0 250 500 750 1000 1250 1500 1750
Number of answers
0.0
0.1
0.2
0.3
0.4
0.5
Acceptance Probability
(b) Disaggregated Data
Figure 2.3: Novel Simpson’s paradox discovered in Stack Exchange data. Plots show the probability an
answer is accepted as best answer as a function of the number of lifetime answers written by user over
his or her tenure. (a) Acceptance probability calculated over aggregated data has an upward trend, with
answers written by more experienced users (who have already posted more answers) more likely to be
accepted as best answers. However, when data is disaggregated by reputation (b), the trend reverses.
Among answers written by users with the same reputation (dierent colors represent reputation bins),
those posted by users who had already written more answers are less likely to be accepted as best answers.
2.5.1.2 NumberofAnswers&Reputation
Experience plays an important role in the quality of answers written by users. Stack Exchange veterans, i.e.,
users who have been active on Stack Exchange for more than six months, post longer, better documented
answers, that are also more likely to be accepted as best answers by askers [46]. There are several ways
to measure experience on Stack Exchange. Reputation, according to Stack Exchange, gauges how much
the community trusts a user to post good questions and provide useful answers. While reputation can be
gained or lost with dierent actions, a more straightforward measure of experience is user tenure, which
measures time since the user became active on Stack Exchange, or Percentile, normalized rank of a user’s
tenure. Alternately, experience can be measured by theNumberofAnswers a user posted during his or her
tenure before writing the current answer.
Our method uncovers a novel Simpson’s paradox for user experience variablesReputation andNumber
ofAnswers. In the aggregate data, acceptance probability increases as a function of theNumberofAnswers
(Fig. 2.3a). This is consistent with our expectations that the more experienced users—who have written
28
more answers over their tenure on Stack Exchange—produce higher quality answers. However, when data
is conditioned on Reputation, the trend reverses (Fig. 2.3b). In other words, focusing on groups of users
with the same reputation, those who have written more answers over their tenure are less likely to have
a new answer accepted than the less active answerers.
2.5.2 TheOriginsofSimpson’sparadox
10
0
10
1
10
2
10
3
10
4
Number of Answers
10
0
10
1
10
2
10
3
10
4
10
5
10
6
Reputation
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Acceptance Probability
(a) Disaggregated data
10
0
10
1
10
2
10
3
10
4
Number of answers
10
0
10
1
10
2
10
3
10
4
10
5
10
6
Reputation
10
1
10
2
10
3
10
4
10
5
Frequency
(b) Joint distribution ofX
c
andX
p
Figure 2.4: Analysis of the Simpson’s paradox Reputation – Number of Answers variable pair. (a) Average
acceptance probability as a function of two variables. (b) The distribution of the number of data points
contributing to the value of the outcome variable for each pair of variable values.
To understand why Simpson’s paradox occurs in Stack Exchange data, we illustrate the mathematical
explanation of Section 2.4.4 with examples from our study. Consider the paradox for Answer Position–
Session Length Simpson’s pair, illustrated in Fig. 2.2. In the disaggregated data, trend lines of acceptance
probability for sessions of dierent length are stacked (Fig. 2.2b): answers produced during longer sessions
are more likely to be accepted than answers produced during shorter sessions. In addition, there are many
more shorter sessions than longer ones. Table 2.2 reports the number of sessions of dierent length. By
Table 2.2: Number of data points in each group
SessionLength 1 2 3 4 5 6 7 8
Datapoints 7.2M 2.6M 1.3M 0.7M 0.4M 0.3M 0.2M 0.1M
29
far, the most common session has length one: users write only one answer during these sessions. Each
longer session is about half as common as a session that is one answer shorter.
What happens to the trend in the aggregated data? When calculating acceptance probability as a
function of answer position, all sessions contribute to acceptance probability for the rst answer of a
session. Sessions of length one dominate the average. When calculating acceptance probability for answers
in the second position, sessions of length one do not contribute, and acceptance probability is dominated
by data from sessions of length two. Similarly, acceptance probability of answers in the third position is
dominated by sessions of length three. Survivor bias excludes data from shorter sessions, which also have
lower acceptance probability, creating an upward trend in acceptance probability.
We back up this intuitive explanation with mathematical analysis of Section 2.4.4. Although acceptance
probability is decreasing as a function of Answer Position for each value of Session Length (Fig. 2.2b), the
probability mass ofSessionLength is constantly moving towards larger values asAnswerPosition increases.
Notice that asAnswerPosition increments froma toa+1, sessions of lengtha are no longer included (as the
minimum session length is nowa+1). Thus, whileSessionLength has probability mass Pr(X
c
=ajX
p
=a)
whenX
p
=a, it has probability Pr(X
c
=ajX
p
=a + 1) = 0 atX
p
=a + 1:
d
dx
p
Pr(X
c
=ajX
p
=x
p
)k
xp=a
= Pr(X
c
=ajX
p
=a): (2.23)
Meanwhile, for all other values ofX
c
greater thana, the probability mass atX
p
= a + 1 is the same as
that atX
p
=a (as the number of data points is constant along sessions of same length) but normalized to
account for the sessions of lengtha, i.e.,
Pr(X
c
=x
c
jX
p
=a + 1) =
Pr(X
c
=x
c
jX
p
=a)
1 Pr(X
c
=ajX
p
=a)
: (2.24)
30
The rate of change of these probability masses with respect toX
p
is
d
dx
p
Pr(X
c
=x
c
jX
p
=x
p
)k
xp=a
=
1
1 Pr(X
c
=ajX
p
=a)
1
Pr(X
c
=x
c
jX
p
=a): (2.25)
The probability mass function Pr(X
c
=x
c
jX
p
=a) decreases forX
c
=a corresponding to the smallest
value of acceptance probability, while increasing for all other valuesX
c
>a. Moreover, the rate of increase
of this probability mass is greater than the rate at which the acceptance probability decreases, resulting in
an upward trend when the data is aggregated.
A similar eect plays out in the Number of Answers–Reputation Simpson’s pair. Figure 2.4a shows
the heatmap of acceptance probability for dierent values of the Number of Answers written over a user’s
tenure and user Reputation, while Fig. 2.4b shows the correlated joint distribution of the two variables.
The gures illustrate the rst condition of Simpson’s paradox (Eq. (2.19)): as X
p
changes, the distribu-
tion of the values of X
c
must also change. This dependency can be clearly seen in Fig. 2.4b—as X
p
=
Number of Answers increases then the distribution ofX
c
= Reputation shifts to increasing values,
which produces the paradox.
In the real world this means that users, who have written more answers are not more likely to have
a new answer they write accepted. In fact, among users with same Reputation, those who earned this
reputation with fewer answers are more likely to have a new answer they write accepted as best answer.
This suggests that such users are simply better at answering questions, and that this can be detected
early in their tenure on Stack Exchange (while they still have low reputation). Note, however, that an
exception to the trend reversal occurs for users with very high reputation. In Stack Exchange, users can
gain reputation by “Answer is marked accepted", “Answer is voted up", “Question is voted up", etc. It seems
that, high reputation users and low reputation users are dierent: for high reputation users, experience
31
(number of written answers) is important, while for low reputation users the quality of answers, which
may lead to votes, is more important. Analysis of this behavior is beyond the scope of this paper.
2.5.3 DiscussionandImplications
Presence of a Simpson’s paradox in data can indicate interesting or surprising patterns [44], and for trends
in social data, important behavioral dierences within a population. Since social data is often generated by
a mixture of subgroups, existence of Simpson’s paradox suggests that these subgroups dier systematically
and signicantly in their behavior. By isolating important subgroups in social data, our method can yield
insights into their behaviors.
For example, our method identies Session Length as a conditioning variable for disaggregating data
when studying trends in acceptance probability as a function of answer’s position within a session. In fact,
prior work has identied session length as an important parameter in studies of online performance [70,
2, 121, 46]. Unless activity data is disaggregated into individual sessions—sequences of activity without
an extended break—important patterns are obscured. A pervasive pattern in online platforms is user per-
formance deterioration, whereby the quality of a user’s contribution decreases over the course of a single
session. This deterioration was observed for the quality of answers written on Stack Exchange [46], com-
ments posted on Reddit [121], and the time spent reading posts on Facebook [70]. Our method automati-
cally identies position of an action within a session and session length as an important pair of variables
describing Stack Exchange.
We examine in detail one novel paradox discovered by our method for the Reputation–Number of An-
swers variables. The trends in Fig. 2.3b suggest that both variables jointly aect acceptance probability.
Inspired by this observation, we construct a new variable—Reputation / Number of Answers—i.e., Reputa-
tionRate. Figure 2.5 shows how acceptance probability changes with respect toReputationRate for dierent
groups of users. There is an strong upward trend, suggesting that answers provided by users with higher
32
10
1
10
0
10
1
10
2
10
3
Reputation per number of answers
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Acceptance Probability
Rep < 10
10 Rep < 10
2
10
2
Rep < 10
3
10
3
Rep < 10
4
10
4
Rep < 10
5
Rep 10
5
Average
Figure 2.5: Relationship between acceptance probability and Reputation Rate, a new measure of user per-
formance dened as reputation per number of answers users wrote over their entire tenure. Each line
represents a subgroup with a dierent reputation score. The much smaller variance compared to Fig. 2.3b
suggests that the new feature is a good proxy of answerer performance.
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
Time Since Previous Answer
10
0
10
1
10
2
Answer Position
0.1
0.2
0.3
0.4
0.5
Acceptance Probability
(a) Disaggregated data
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
Time Since Previous Answer
10
0
10
1
10
2
Answer Position
10
1
10
2
10
3
10
4
10
5
Frequency
(b) Joint distribution ofX
c
andX
p
Figure 2.6: A pair which multivariate logistic regression cannot nd in the data. (a) Average acceptance
probability as a function of Answer Position and Time Since Previous Answer. (b) The distribution of the
number of data points contributing to the value of the outcome variable for each pair of variable values.
ReputationRate are more likely to be accepted. Moreover, while the lines span reputations of an extremely
broad range—from one to 100,000—they collapse onto a single curve. This suggests thatReputationRate is a
good proxy of user performance. The remaining paradoxes uncovered by our method could yield similarly
interesting insights into user behavior on Stack Exchange.
33
2.5.4 DierencefromLinearRegression
We also illustrate the dierence between our method and linear models that model the outcome variable
as a function of bothX
p
andX
c
. For such multivariate linear models [102], we can t a modelf
p;c
( +
X
p
+
c
X
c
) to the disaggregated data, and compare the sign of the coecient to the sign of the linear
coecient of the “aggregated” modelf
p
(+X
p
). In our method, we bin the values ofX
c
and t separate
linear models of the form of Eq. (2.8) in each bin ofX
c
, aggregating by averaging the linear coecient signs
of each model. We claim that our approach has benets over multivariate linear models which allow it
to nd Simpson’s pairs where multivariate linear models can not. First, in multivariate linear models, all
subgroups have the same coecient , and intercepts +
c
X
c
, which vary linearly with X
c
. In our
method, however, each group can have dierent intercept and coecient, which makes nding paradox
pairs in heterogeneous data more exible. Indeed this exibility is necessary — from our results (Figs. 2.2b
and 2.3b) it is clear that the trend parameters(x
c
) of the tted lines vary signicantly depending onx
c
.
Secondly, our method of aggregating by simple averaging of the linear coecient signs of the sub-
groups means that trends within each subgroup are weighted equally regardless of how many datapoints
are in that subgroup. This is contrary to multivariate linear models, which t the model parameters based
on each datapoint (and so weigh heavily towards values ofX
c
with many datapoints). To illustrate, we
show that our algorithm nds Time Since Previous Answer - Answer Position as a Simpson’s pair, which a
multivariate logistic regression does not. The variableAnswerPosition is the index of the answer a user has
completed without an extended (>100 minute) break, and so Answer Position = 1 if Time Since Previous
Answer 100 minutes andAnswerPosition> 1 ifTimeSincePreviousAnswer< 100 minutes. Fig. (2.6a)
shows that, for Answer Position = 1, the acceptance probability decreases as a function of Time Since
Previous Answer, possibly because better users take shorter breaks. On the other hand, for other Answer
34
Positions the trend is reversed, and acceptance probability increases withTimeSincePreviousAnswer, sug-
gesting that in short term, users who take more time to answer questions or take short breaks between
questions write answers of higher quality.
Clearly, Time Since Previous Answer - Answer Position is an important Simpson’s pair, illustrating that
time has a benecial eect on answer quality ar short time scales. even though it is detrimental on the
aggregate level. Multivariate logistic regression does not capture this behaviour, as 65% of the probability
mass ofTimeSincePreviousAnswer is for values larger than 100 minutes, so when ttingf
p;c
( +X
p
+
c
X
c
) to the data, it tries to t a hyperplane, which describes the majority of the data as best as possible,
in this case the decreasing trend corresponding to AnswerPosition = 1.
2.6 Results:R
2
-basedbins
We illustrate proposed method by applying it to study human performance data from several online do-
mains.
2.6.1 StackExchange
Of the 110 potential disaggregations of SE data arising from all possible pairs of covariates, our method
identied 8 as signicant. Table 2.3 ranks these disaggregations along with their pseudo-R
2
scores. Note
that user experience, either in terms of thereputation or thenumberofanswers written by the user over his
or her tenure, comes up as an important conditioning variable in several disaggregations. Features related
to user activity, such as answer position within a session, session length, and time since previous answer,
appear as important dimensions of performance. This suggests that answerer behavior over the course of
a session changes signicantly, and these changes are dierent across dierent sub-populations.
Figure 2.7 visualizes the data, disaggregated on the number of answers. Each horizontal band in the
heatmap in Fig. 2.7(a) is a dierent bin of the conditioning variablenumberofanswers, and it corresponds to
35
0 10 20 30
Answer Position
10
0
10
1
10
2
10
3
Number of answers
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
0 10 20 30
Answer Position
10
0
10
1
10
2
10
3
Number of answers
10
2
10
3
10
4
10
5
10
6
Count
(a) Disaggregated data (b) Number of samples
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Answer Position
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Answer Position
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
logistic fit
data
(c) Subgroup trends (d) Aggregate trend
Figure 2.7: Disaggregation of Stack Exchange data. (a) The heat map shows the probability the answer
is accepted as a function of its answer position within a session, with the horizontal bands corresponding
to the dierent subgroups, conditioned on total number of answers the user has written. (b) Number of
data samples within each bin of the heat map. Note that the outcome becomes noisy when there are few
samples. The trends in performance as a function of answer position in (c) disaggregated data and (d)
aggregate data. Error bars in (c) and (d) show 95% condence interval.
Table 2.3: Variables dening important disaggregations of Stack Exchange data, along with their pseudo-
R
2
scores.
R
2
Mc
CovariateX
j
Conditioning onX
c
0.03 Answer position Number of answers
0.03 Session length Number of answers
0.02 Number of answers Reputation
0.02 Answer position Reputation
0.02 Session length Reputation
0.01 Readability Lines of codes
< 10
2
Answer position Session length
< 10
2
Time since prev ans Answer position
a distinct subgroup within the data. The rst bin ranges in value from one to eleven answers, the second bin
36
from 12 to over 50 answers, etc. Within each bin, the color shows the relationship between the outcome—
the probability the answer is accepted—and answer’s position within a session. Dark blue corresponds to
the lowest acceptance probability, and dark red to the highest. Within each bin, the color changes from
lighter blue to darker blue (for the bottom-most bins), indicating a lower acceptance probability for answers
written later in the session. For the top-most bins, the acceptance probability is overall higher, but also
decreases, e.g., from pink to white to blue. Note that data is noisy, as manifested by color ipping, where
there are few data points (Fig. 2.7(b)).
The trends corresponding to these empirical observations are captured in Fig. 2.7(c). Note that the
decreasing trends are in contrast to the trend in aggregate data (Fig. 2.7(d)), which shows performance
increasing with answer position within the session. This suggests that user experience, as captured by the
number of answers, is an important factor dierentiating the behavior of users.
Figure 2.8 shows an alternate disaggregation of SE data for the covariate answer position, here con-
ditioned on user reputation. This disaggregation is slightly worse, resulting in a somewhat lower value
pseudo-R
2
. While performance declines in the lower reputation subgroups as a function of answer po-
sition, the highest reputation users appear to write better answers in longer sessions. The acceptance
probability for high reputation users is more than 0.50, potentially indicating that askers pay attention to
very high reputation users and are more likely to accept their answers.
2.6.2 KhanAcademy
Our method identied 32 signicant disaggregations of KA data, out of 342 potential disaggregations. Some
of these are presented in Table 2.4. The table lists conditioning variables for selected covariates, sorted by
their pseudo-R
2
scores. For example, when examining how performance—probability to solve a problem
correctly—changes over the course of a day (X
j
ishour24), the relevant disaggregation conditions the data
onallrstattempts, i.e., the number of all problems the user solved correctly on their rst attempt. On the
37
0 10 20 30
Answer Position
10
0
10
1
10
2
10
3
10
4
10
5
Reputation
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
0 10 20 30
Answer Position
10
0
10
1
10
2
10
3
10
4
10
5
Reputation
10
2
10
3
10
4
10
5
10
6
Count
(a) Disaggregated data (b) Number of samples
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Answer Position
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Answer Position
0.1
0.2
0.3
0.4
0.5
0.6
Acceptance Probability
logistic fit
data
(c) Subgroup trends (d) Aggregate trend
Figure 2.8: Disaggregation of Stack Exchange data similar to Fig. 2.7, but instead disaggreagted on user
reputation. (a) The heat map shows acceptance probability as a function of its answer position within a
session. (b) Number of data samples within each bin of the heat map. Note that the outcome becomes
noisy when there are few samples. The trends in (c) disaggregated data and (d) aggregate data. Error bars
in (c) and (d) show 95% condence interval.
other hand, several disaggregations can explain the trends in performance as a function of month. Con-
ditioning on rst ve attempts has the most explanatory power, followed by disaggregations conditioned
on session index, the total time it took the user to solve all problems, the timestamp and weekday of the
attempt. Many of the conditioning variables used in the disaggregations represent dierent aspects of user
experience on the site: the number of problems they tried to solve or correctly solved, their tenure on the
site, and how much time they spent solving problems.
Figure 2.9 shows the disaggregation corresponding to the covariatemonth, conditioned onverstat-
tempts. When data is aggregated over the entire population, there appears to be a slight seasonal variation,
with performance higher on average during the summer months (Fig. 2.9(d)). Once data is disaggregated
38
Table 2.4: Variables dening important disaggregations of the Khan Academy data, along with their
pseudo-R
2
scores.
R
2
Mc
CovariateX
j
Conditioning onX
c
0.06 All attempts All rst attempts
0.03 All attempts All problems
0.01 All attempts Tenure
0.01 All attempts Total solve time
0.04 Hour24 All rst attempts
0.04 Session number All rst attempts
0.02 Session number All problems
0.01 Session number Tenure
0.01 Session number All attempts
0.01 Session number All sessions
0.01 Session number Total solve time
0.0 Session number Join month
0.03 Month Five rst attempts
0.01 Month Session index
0.01 Month Total solve time
0.01 Month Timestamp
< 10
2
Month Week day
0.01 Problem position Session length
byverstattempts, the seasonal trends are no longer so obvious in several subgroups (Fig. 2.9(c)). Inter-
estingly, it appears to be the high achieving users (who correctly answer more of the ve rst problems),
who perform better during the summer months. This suggests that population of KA changes over the
course of the year, with motivated, high achieving students using the platform during their summer break.
2.6.3 Duolingo
Of the 462 potential disaggregations of DL data, 51 were found to be signicant using the
2
test. Table 2.5
reports disaggregations associated with select covariates, includinglesson’sposition within a session,lesson
index in user’s history, the number of lessons the user completed, etc. The trends with respect to some of
the covariates could be explained by several dierent disaggregations, with some of them having relatively
high values of pseudo-R
2
. Again, user experience (all perfect lessons) and initial skill (ve rst lessons)
appear as signicant conditioning variables.
39
0 5 10
Month
0
1
2
3
4
5
Five first attempts
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Performance
0 5 10
Month
0
1
2
3
4
5
Five first attempts
10
2
10
3
10
4
Count
(a) Disaggregated data (b) Number of samples
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Month
0.2
0.4
0.6
0.8
Performance
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Month
0.2
0.4
0.6
0.8
Performance
logistic fit
data
(c) Subgroup trends (d) Aggregate trend
Figure 2.9: Disaggregation of Khan Academy data showing performance as a function of month, condi-
tioned on ve rst attempts. (a) The heat map shows average performance as a function of the month.
(b) Number of data samples within each subgroup. The trends in (c) the disaggregated data and in (d)
aggregated data. Error bars in (c) and (d) show the 95% condence interval.
Figure 2.10 examines the impact of experience on performance. In the aggregate data Fig. 2.10(d),
performance appears to increase as function of experience (lesson index): users who have more practice
perform better. However, once the data is disaggregated by initial performance (verstlessons), or skill, in
Fig. 2.10(c), a subtler picture emerges. Users who initially performed the worst (bottom bins in Fig. 2.10(a))
improve their performance as they have more lessons, while the best performers initially (top bins) decline.
This may be due to “regression to the mean”, as pure luck could have helped the initially best performers
and hurt the initially worst performers.
40
Table 2.5: Variables dening important disaggregations of Duolingo data, along with their pseudo-R
2
scores.
R
2
Mc
CovariateX
j
Conditioning onX
c
0.08 Lesson position All perfect lessons
0.11 Lesson index All perfect lessons
0.09 Lesson index First ve lessons
0.16 Number of lessons All perfect lessons
0.09 Number of lessons First ve lessons
0.11 Number of sessions All perfect lessons
0.09 Number of sessions First ve lessons
0.05 Number of sessions Session seen
0.1 Session number All perfect lessons
0.09 Session number First ve lessons
0.05 Session number Session seen
0.05 Session number Session correct
0.05 Session number Distinct words
0.02 Session number Time since previous lesson
0.09 Session length All perfect lessons
0.06 Session correct Distinct words
0.09 Session duration First ve lessons
0.09 Session duration All perfect lessons
0.0 Session duration Session length
0.08 Time since previous lesson All perfect lessons
2.7 Discussion
We described a method that identies interesting behaviors within heterogeneous behavioral data by lever-
aging Simpson’s paradox. The method automatically disaggregates data by partitioning it on some condi-
tioning variable, and looks for those conditioning variables that result in trend reversal in many subgroups.
The method ranks these disaggregations based on how well linear models describe the disaggregated data
compared to how well they describe population as a whole. These disaggregated subgroups are interest-
ing because their behavior is signicantly dierent from that of the remainder of the population, which
implies that important behavioral dierences exist within the population.
We illustrated the use of the method as a data exploration tool by applying it to study human perfor-
mance on three online platforms, including question-answering site Stack Exchange, and online learning
sites Khan Academy and Duolingo. Our method identied skill (judged from user’s initial performance on
41
10
0
10
1
10
2
Lesson index
0
1
2
3
4
5
Five first lessons
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Performance
10
0
10
1
10
2
Lesson index
0
1
2
3
4
5
Five first lessons
10
2
10
3
10
4
Count
(a) Disaggregated data (b) Number of samples
10
0
10
1
10
2
Lesson index
0.0
0.2
0.4
0.6
0.8
1.0
Performance
10
0
10
1
10
2
Lesson index
0.0
0.2
0.4
0.6
0.8
1.0
Performance
logistic fit
data
(c) Subgroup trends (d) Aggregate trend
Figure 2.10: Disaggregation of Duolingo data. (a) The heat map shows performance, i.e., probability to
answer all the words correctly, as a function of how many lessons the user completed, conditioned on how
many of the ve rst lessons were answered correctly. (b) Number of data samples within each bin of the
heat map. Trends in (c) the disaggregated data and in (d) aggregate data. Errors bars show 95% condence
interval.
the site) and experience (how long the user has been active on the site) as important features dierentiating
user performance.
Conditioning on a variable to make subgroups more homogenous is the important and the rst step
in our method; however, subgroups may still be heterogeneous. As a future direction, we can use multiple
features for conditioning on the data to make the subgroups even more homogenous, afterward we can
look at an independent variable trend reversal in these subgroups. In addition, our method applies to
explicitly declared variables, and not to latent variables that may aect data. As a future direction, latent
variables could be found and considered as conditioning variables for data disaggregation.
42
We have used our method for binary outcome variables, however there are also continues outcome
variables in behavioral data. Our method can be extended to more general forms, like GLM, to support all
types of outcome variable. However, new trend analysis algorithm needs dierent statistical methods as
a goodness of t measure. In addition, preliminary explorations suggest that pairs of variables with high
R
2
McFadden
value could be used in combination to make a new variable which is highly correlated with
outcome variable. For example, in KA data, pair (all rst attempts, all attempts) has the highest value of
pseudo-R
2
among all pairs. We can dene new variable as ratio of number of correctly solved question on
the rst attempt to characterize user performance during his or her tenure. This new variable is highly
correlated with outcome variable,performance. The same thing is happened for pairs (sessionseen,session
correct) in Duolingo, and (number of answers, Reputation) in Stack Exchange.
43
Chapter3
IdentifyingLatentConfounders
3.1 Introduction
Social data is often highly heterogeneous, coming from a population composed of diverse classes of indi-
viduals, each with their own characteristics and behaviors. As a result of heterogeneity, a model learned
on population data may not make accurate predictions on held-out test data, nor oer analytic insights
into the underlying behaviors that motivate interventions. To illustrate, consider Figure 3.1, which shows
data collected for a hypothetical nutrition study measuring how the outcome, body mass index (BMI),
changes as a function of daily pasta calorie intake. Multivariate linear regression (MLR) analysis nds a
negative relationship in the population (red line) between these variables. The negative trend suggests
that—paradoxically—increased pasta consumption is associated with lower BMI. However, unbeknownst
to researchers, the hypothetical population is heterogeneous, composed of classes that varied in their t-
ness level. These classes (clusters in Fig. 3.1) represent, respectively, people who do not exercise, people
with normal activity level, and athletes. When data is disaggregated by tness level, the trends within each
subgroup are positive (dashed green lines), leading to the conclusion that increased pasta consumption is
in fact associated with a higher BMI. Recommendations for pasta consumption arising from the naive anal-
ysis are opposite to those arising from a more careful analysis that accounts for the confounding eect of
dierent classes of people. The trend reversal is an example Simpson’s paradox, which has been widely
44
300 400 500 600 700 800
Daily pasta intake
15
20
25
30
35
40
45
BMI
Linear Regression
Cluster Regression
Clusterwise Linear Regression
Figure 3.1: Heterogeneous data with three latent classes. The gure illustrates Simpson’s paradox, where
a positive relationship between the outcome and the independent variable exists for population as a whole
(red line) but reverses when the data is disaggregated by classes (dashed green lines).
observed in many domains, including biology [32], psychology [63] astronomy [94], and computational
social science [19].
Social scientists analyze such data with mixed eects models [132], which use random intercepts to
model dierences between classes and random slopes to model dierences in regression coecients within
classes. Mixed eects models are used to describe non-independent observations of data from the same
class, and they can even handle trend reversal associated with Simpson’s paradox. Mixed eects models
assume that classes are specied by a categorical variable, or alternatively, that data can be disaggregated
by binning an existing variable [3]. In practice, however, these classes may not be known a priori, or
be related to multiple variables. Instead, they must be discovered from data, along with the trends they
represent.
Many methods already exist for nding latent classes within data; however, they have multiple short-
comings. Unsupervised clustering methods disaggregate data regardless of the outcome variable being
explained. In reality, distinct outcomes may be best described by dierent clusterings of the same data.
Recent methods have tackled Simpson’s paradox by performing supervised disaggregation of data [4, 44].
However, these methods disaggregate data into subgroups using existing features, and thus are not able
45
to capture eects of latent classes (i.e., unobserved confounders). Additionally, they perform “hard” clus-
tering, but assigning each data point to a unique group. Instead, a “soft” clustering is more realistic as it
captures the degree of uncertainty about which group the data belong to.
To address these challenges, we describe Disaggregation via Gaussian Regression
∗
(DoGR), a method
that jointly partitions data into overlapping clusters and estimates linear trends within them. This allows
for learning accurate and generalizable models while keeping the interpretability advantage of linear mod-
els. Proposed method assumes that the data can be described as a superposition of clusters, or components,
with every data point having some probability to belong to any of the components. Each component repre-
sents a latent class, and the soft membership represents uncertainty about which class a data point belongs
to. We use an expectation maximization (EM) algorithm to estimate component parameters and compo-
nent membership parameters for the individual data points. DoGR jointly updates component parameters
and its regression coecients, weighing the contribution of each data point to the component regression
coecients by its membership parameter. The joint learning of clusters and their regressions allows for
discovering the latent classes that explain the structure of data.
Our framework learns robustly predictive and interpretable models of data. In validations on synthetic
and real-world data, DoGR is able to discover hidden subgroups in data even in analytically challenging
scenarios, where subgroups overlap along some of the dimensions and identify interesting trends that may
be obscured by Simpson’s paradox. We show that our method achieves performance on prediction tasks
comparable to or better than state-of-the-art algorithms, but in a fraction of the time, making it suitable
for big data analysis. DoGR aids data analysis
∗
http://github.com/anonymized.
46
3.2 RelatedWork
Fixed-eect linear regression is one of the most widely used methods in social science and other scientic
elds due to its interpretability [47]. To account for individual dierences, mixed-eect or multi-level mod-
els were developed [90], but they model variation between individuals or groups via hard disaggregation
of existing features. In contrast, the proposed method models dierences between groups or individuals
as latent confounders (i.e., latent clusters), that are learned from data. Our work improves the utility of
linear models in the analysis of social data by controlling for these clusters.
A number of previous works have attempted to tackle the issue of latent confounders. Clusterwise
linear regression (CLR) [124] starts with initial clusters and updates them by reassigning one data point
to another partition that minimizes a loss function. The method is slow, since it moves one data point at
each iteration. Two other methods, WCLR [118] and FWCLR [119], improve on CLR by using k-means
as their clustering method. These methods were shown to outperform CLR and other methods, such as
K-plane [84]. In Conditional Linear Regression [26], the goal is to nd a linear rule capable of achieving
more accurate predictions for just a segment of the population, by ignoring outliers. One of the larger
dierences between their method and ours is that Conditional Linear Regression focuses on a small subset
of data, while we model data as a whole, alike to CLR, WCLR, FWCLR and K-plane. The shared parameter
across all clusters in WCLR and FWCLR makes these methods perform poorly if clusters have dierent
variance, as we will show in our results section.
Other methods have combined Gaussian Mixture Models and regression to create algorithms similar to
our own [125, 49]. We call these methodsGuassianMixtureRegression(GMR). In contrast to these methods,
however, we can capture relationships between independent and outcome variables through regression
coecients, which previous methods were unable to do. We also use Weighted Least Squares to t our
model, which makes it less sensitive to outliers [34].
47
In addition to latent confounders, sample selection bias is often an issue in data. Traditional sampling
bias is based on the correcting for the non-random existence of data. Heckman [51] was one of the rst
to nd a quantitative method to correct for this issue in linear models. Methods have since been extended
to non-linear models, such as in causal models [13]. This is similar to our method, which corrects for data
components that receive a disproportionate amount of data. Unlike these methods, however, our method
can address when dierent components exhibit distinct trends.
Causal inference has been used to address the problem of confounders, including latent confounders
[82], and to infer the causal relationship between features and outcomes [13, 110, 127]. One diculty with
causal inference however, is that the focus is traditionally on one intervention [110]. Taking into account
synergistic eects of multiple features is not well understood, but has been attempted recently [110, 127].
With adequate datasets, these can help infer causal relationships between multiple causes, but certain
causal assumptions are needed, which might not correspond to reality. In contrast, regression oers us the
opportunity to understand relationships between each feature and an outcome, regardless of the dataset,
even if we cannot make causal claims.
3.3 DoGRMethod
LetD =fd
1
;d
2
;d
3
;:::;d
N
jd
i
= (x
i
;y
i
)2 R
p
Rg be a set ofN observations, or records, each con-
taining a real-valued outcomey
i
of an observationi and a vector ofp independent variables, or features,
[x
i;1
;x
i;2
;:::;x
i;p
]. Regression analysis is often used to capture the relationship between each independent
variable and the outcome. Specically, MLR estimates regression coecients
0
;
1
;:::;
p
by minimizing
the residuals ofy =
0
+
1
x
1
+
2
x
2
+::: +
p
x
p
over all observations. However, parameters learned by
theMLR model may not generalize to out-of-sample populations, as they can be confounded by Simpson’s
paradox and sampling bias. We can call this the “robustness” problem of regression.
48
As discussed in the introduction, Figure 3.1(a) illustrates this problem with synthetic data. The MLR
model trained on the aggregate data gives BMI = 241 0:61x. This suggests a negative relation-
ship (solid red line) between the independent variable x =Daily pasta intake and the outcome BMI.
However, there are three components each with a positive trend. Indeed, applied separately to each
component, MLR learns the proper positive relationship (dashed green line) between Daily pasta intake
andBMI: BMI =
0
+
1
x, where
0
is cluster’s intercept,f108;13; 102g and
1
is coecient,
f1:03; 1:08; 0:97g. Associations learned by the MLR model trained on the population-level data are not
robust or generalizable. This could help explain the reproducibility problem seen in many elds [104, 122],
where trends seen in some experiments cannot be reproduced in other experiments with dierent popula-
tions. This can be clearly observed in Figure 3.1(b), which represents a dierent sampling of the population.
The underlying clusters are the same, but the middle cluster is over-sampled during data collection. As
a result, MLR trained at the population-level nds only a weak association between independent variable
and the outcome (solid red line). In contrast, regressions learned separately for each cluster (green dashed
line) remain the same even in the new data: thus, the cluster regressions representrobust andgeneralizable
relationships in data.
3.3.1 ModelSpecication
The goal of this work is to learn robust and reproducible trends through a regression model that accounts
for the latent structure of data, for example, the presence of three clusters in data shown in Fig. 3.1. Our
model jointlydisaggregates the data intoK overlapping subgroups, or components, and performsweighted
linear regression within each component. We allow components to overlap in order to represent the un-
certainty about which component or subgroup an observation belongs to.
49
In what follows, we used capital letters to denote random variables and lowercase letters their values.
We model the independent variableX of each componentk as a multivariate normal distribution with
mean
k
2R
p
and covariance matrix
k
2R
pp
:
f
(k)
X
N (
k
;
k
) (3.1)
In addition, each component is characterized by a set of regression coecients
k
2R
p+1
. The regression
values of the componentk are
^
Y
(k)
=
k;0
+
k;1
X
1
+
k;2
X
2
+::: +
k;p
X
p
; (3.2)
withy ^ y
(k)
giving the residuals for componentk. Under the assumption of normality of residuals [39],
Y has a normal distribution with mean
^
Y
(k)
and standard deviation
k
.
f
(k)
YjX
N (
^
Y
(k)
;
k
); (3.3)
where
^
Y
(k)
is dened by Eq. 3.2. Under the assumption ofhomoscedasticity, in which the error is the same
for allX, the joint density is the product of the conditional (Eq. 3.3) and marginal (Eq. 3.1) densities:
f
(k)
X;Y
(x;y) =f
YjX
(yjx)f
X
(x)
='(y; ^ y
(k)
;
k
)'(x;
k
;
k
)
(3.4)
Multiplication off
YjX
(yjx)f
X
(x) can be converted to a normal distribution: f
(k)
X;Y
(x;y) comes from
N (
0
k
;
0
k
), where
0
k
= [
(1)
k
;
(2)
k
;:::;
(d)
k
; ^ y
(k)
] and
0
k
=
2
6
6
4
k
0
0
k
3
7
7
5
is a block matrix.
50
3.3.2 ModelLearning
The nal goal of the model is to predict the outcome, given independent variables. This prediction com-
bines the predicted values of outcome from all components by taking the average of the predicted values,
weighed by the size of the component. We dene!
k
as the weight of the componentk, where
P
k
!
k
= 1.
We can dene the joint distribution over all components asf
X;Y
(x;y) =
P
K
k=1
!
k
f
(k)
X;Y
(x;y).
Then the log-likelihood of the model over all data points is:
L =
N
X
i=1
log
K
X
k=1
!
k
f
(k)
X;Y
(x
i
;y
i
)
(3.5)
The formula here is same as the Gaussian Mixture Model, except that the targety
i
is a function ofx
i
. To
nd the best values for parameters =f!
k
;
k
;
k
;
k
;
k
j1kKg we can leverage the Expectation
Maximization (EM) algorithm. The algorithm iteratively renes parameters based on the expectation (E)
and maximization (M) steps.
3.3.2.1 E-step
Let’s dene
i;k
(membership parameter) as probability that data pointi belongs to componentk. Given
the parameters
t
of last iteration, membership parameter is:
i;k
=
!
k
f
(k)
X;Y
(x
i
;y
i
)
P
k
0!
0
k
f
(k
0
)
X;Y
(x
i
;y
i
)
(3.6)
Thus, the E-step disaggregates the data into clusters, but it does so in a “soft” way, with each data point
having some probability to belong to each cluster.
51
3.3.2.2 M-step
Given the updated membership parameters, the M-step updates the parameters of the model for the next
iteration as
t+1
:
!
k
=
P
i
i;k
N
k
=
P
i
i;k
x
i
P
i
i;k
k
=
P
i
i;k
(x
i
k
)(x
i
k
)
T
P
i
i;k
In addition, this step updates regression parameters based on the estimated parameters in
t
. Our method
uses Weighted Least Squares for updating the regression coecients
k
for each component, with
as
weights. In the other words, we nd
k
that minimizes theWeightedSumofSquares(WSS) of the residuals:
WSS(
k
) =
X
i
i;k
(y
i
(
k;0
+
k;1
x
i;1
+::: +
k;p
x
i;p
))
2
:
Using the value of
k
, the updated
k
would be:
k
=
P
i
i;k
(y
i
^ y
(k)
i
)
2
P
i
i;k
Intuitively,
k
shows us where inR
p
the center of each subgroupk resides. The further a data point is
from the center, the lower its probability to belong to the subgroup. The covariance matrix
k
captures
the spread of the subgroup in the spaceR
p
relative to the center
k
. Regression coecient
k
gives low
weight to outliers (i.e., is a weighted regression) and captures the relationship betweenX andY near the
center of the subgroup. Parameter
k
tells us about the variance of the residuals over tted regression line,
and!
k
tells us about the importance of each subgroup.
52
3.3.3 Prediction
For test data x, the predicted outcome is the weighted average of the predicted outcomes for all com-
ponents. The weights capture the uncertainty about which component the test data belongs to. Using
equation 3.3, the best prediction of outcome for componentk would be
^
Y
(k)
, which is the mean of condi-
tional outcome value of the componentk:
^ y =
K
X
k=1
!
k
(
k;0
+
k;1
x
1
+::: +
k;p
x
p
) (3.7)
The green solid line in gure 3.1 represents the predicted outcome ^ y as function ofx. The solid green line,
is weighted average over dashed-green lines.
3.4 Results
Due to the unique interpretability of our method, we rst use it to describe meaningful relationships
between variables in the real-world data (“Qualitative Results”). We then show how it compares favorably
to competing methods (“Quantitative Results”).
3.4.1 QualitativeResults
We use aradarchart to visualize the components discovered in the data. Each colored polygon represents
mean of the component,, in the feature space. Each vertex of the polygon represents a feature, or dimen-
sion, with the length giving the coordinate of along that dimension. For the purpose of visualization,
each coordinate value of was divided by the largest coordinate value of across all components. The
maximum possible value of each coordinate after normalization is one. The mean value of the outcome
variable within each component is shown in the legend, along with 95% condence intervals.
53
Our method also estimates the regression coecients, which we can compare to MLR. We computed
the p-value assuming coecients in DoGRare equal to MLR. If
0
and
1
are two coecients to compare,
and
0
and
1
are the standard errors of each coecient, then the z-score isz =
0
1
q
2
0
+
2
1
, from which
we can easily infer the p-value assuming a normal distribution of errors [105].
We demonstrate the power of DoGR to give novel insights into data. While other methods exist for
disaggregating heterogeneous data, most partition the data into disjoint groups, using hard assignment of
data-points to clusters, and then t regression lines to each group separately. The main dierence between
DoGR and these other methods is that our method is more exible in giving information for analytical
purposes. Figure 3.2 shows schematic representation of thecoecients in heterogeneous data. Each latent
confounder has a unique coecient and we represent them using one of two colors: dark and light. Each
data-point could be a member of any cluster at the same time (with dierent membership values), and as
a result, each data-point could have a unique coecient. The hard clustering methods are not able to oer
this dierence in coecients. In Figure 3.2b, each data-point has a xed coecient (dark or light), while
in Figure 3.2a, each individual data-point has dierent coecient based on their relative position from the
center of the clusters.
3.4.1.1 Metropolitan
Figure 3.3 visualizes the components of the Metropolitan data and reports regression coecients for two
variables. Interestingly, the data is disaggregated along ethnic lines, perhaps reecting segregation within
the metropolitan area. TheOrange component consists of census tracts with many highly educated white
residents. It also has highest valence (5:86), meaning that people post the happiest tweets from those tracts.
ThePurple component, is ethnically diverse, with large numbers of white, Asian and Hispanic residents. It’s
valence (5:80) is only slightly lower than of theOrange component. TheRed component has largely Asian
and some Hispanic neighborhoods that are less well-o, but with slightly lower valence (5:76). The Blue
54
(a) Soft Clustering (b) Hard Clustering
Figure 3.2: The dierence between hard and soft clustering for data analysis. Assuming we have two
clusters (yellow and red); in hard clustering, the uncertainty of cluster assignment is not tractable in data
analysis phase (e.g. studying the coecients of the independent variables), since all the data-points have
one of the main cluster colors (b). However, having the soft-clustering, we can get the approximated
coecient for each individual data-point (the whole range from yellow to red in (a)), separately.
andGreen components represents tracts with the least educated and poorest residents. They are also places
with the lowest valence (5:74 and 5:72, respectively). Looking at the regression coecients, education is
positively related to happiness across the entire population (All), and individually in all components, with
the coecients signicantly dierent from MLR in four of the ve components, suggesting this trend was
a Simpson’s paradox. However, the eect is weakest in the most and least educated components. Poverty
has no signicant eect across the entire population, but has a negative eect on happiness in poorest
neighborhoods (Blue). Counter-intuitively, regression coecients are positive for two components. It
appears that within these demographic classes, the poorer the neighborhood, the happier the tweets that
originate from them.
55
3.4.1.2 WineQuality
Figure 3.4 visualizes the disaggreagation of theWineQuality data into four components. Although we did
not use the type of wine (red or white) as a feature, wines were naturally disaggregated by type. TheBlue
component is almost entirely (98%) composed of high quality (6:02) white wines. All data in the Orange
component (5:91) are white wine, while theGreen component is composed mostly (85%) of red wines with
average quality 5:60. The lowest quality (Red) component, with quality equal to 5:36, contains a mixture of
red (43%) and white (57%) wine. In other words, we discover that in higher-quality wines, winecolor can
be determined with high accuracy simply based on its chemical components, which was not known before,
to the best of our knowledge. Low quality wines appear less distinguishable based on their chemicals.
We nd Chlorides have a negative impact on quality of wine in all components (not shown), Sugar has
positive impact on high quality white wines and red wines, and negative impact on low quality wines.
Surprisingly, Free Sulfur Dioxide has a positive impact on high quality white wines, but a negative impact
in other components. These are ndings that may be important to wine growers, and capture subtleties in
data that commonly-used MLR does not.
3.4.1.3 NYCSalePrice
The mean sale price of all properties inNYC property sales data is $1.31 Million. The mean price, however,
hides a large heterogeneity in the properties on the market. Our method tamed some of this heterogeneity
by identifying four components within the sales data. Figure 3.5 shows that these components represent
large commercial properties (Purple), large residential properties (Green), mixed commercial/residential
sales (Red), and single-unit residential properties (Blue).
Table 3.1 show what percentage of each component is made up of New York City’s ve boroughs. Large
commercial properties (Purple component), such as oce buildings, are located in Brooklyn and Manhat-
tan, for example. These are the most expensive among all properties, with average price of more than
56
Component Manhattan Bronx Brooklyn Queens Staten
Purple 17% 20% 41% 17% 5%
Green 41% 22% 26% 10% 1%
Red 8% 9% 65% 15% 3%
Blue 0% 13% 37% 33% 16%
All 3% 13% 40% 30% 14%
Table 3.1: DoGR components, and the boroughs that make up each component (rows might not add up to
100% due to rounding).
$12 Million. The next most expensive type of property are large residential buildings (Green component)—
multi-unit apartment buildings. These are also most likely to be located in Manhattan and Brooklyn. Small
residential properties (Blue component)—most likely to be single family homes—are the least expensive, on
average half a million dollars, and most likely to be located in Brooklyn and Queens, with some in Staten
Island.
Regressions in this data set show several instances of Simpson’s paradoxes. Property price in population-
level data increases as a function of the number of residential units. In disaggregated data, however, this
is true only for the Green component, representing apartment buildings. Having more residential units in
smaller residential properties (Red and Blue components) lowers their price. This could be explained by
smaller multi-unit buildings, such as duplexes and row houses, being located in poorer residential areas,
which lowers their price compared to single family homes. Another notable trend reversal occurs when
regressing on lot size (landsquarefeet). As expected, there is a positive relationship in theBlue component
with respect to lot size, as single family homes build on bigger lots are expected to fetch higher prices in
New York City area. However, the trends in the other components are negative. This could be explained
the following way. As land becomes more expensive, builders are incentivized to build up, creating multi-
story apartment and oce buildings. The more expensive the land, the taller the building they build. This
is conrmed by the positive relationship with gross square feet, which are strongest for the Purple and
Green components. In plain words, these components represent the tall buildings with small footprint that
one often sees in New York City.
57
3.4.1.4 StackOverow
As our last illustration, we apply DoGR to Stack Overow data to answer the question how well we can
predict the length of the answer a user writes, given the features of the user and the answer.
DoGR splits the data into four clusters, as shown in 3.6. Green andRed components contain most of the
data, with 47% and 39% of records respectively. The radar plot shows the relative strength of the features
in each component, while the table above the plot shows features characterizing each discovered group.
Except forPercentiletenure, these features were not used byDoGR and are shown to validate the discovered
groups.
The Orange component (5% of data) contains very active (longer Session Length) users, who metic-
ulously document their answers with many lines of code and URLs, so we can label them “power users”.
These are among the longest (high Words) and more complex (low Readability) answers in the data, and
they also tend to be high quality (high Acceptance Probability). Surprisingly this group has newer users
(lower Percentile Tenure), but they have high reputation (Answerer Reputation) and wrote more answers
previously (higher Number of Answers). These users, while a minority, give life to Stack Overow and
make it useful for others. Orange component users have more code lines within shorter answers. This is
in contrast to other groups, which tend to include more lines of code within longer answers. The brevity
of Orange users when documenting code is an example of a trend reversal.
Another interesting subgroup is the Blue group (8% of data), which is composed of “veterans” (high
Percentile Tenure), who write easy-to-read answers (high Readability) that are documented with many
URLs. These users have a relatively high reputation, but they are selective in the questions they answer
(lower Number of Answers than for the Orange users). Interestingly, tenure (Percentile) does not have an
eect on length of the answer for these Blue users, while it has positive eect in other groups (i.e., more
veteran users write longer answers). Negative eect of URLs on the length of the answer suggests that
these users use URLs to refer to existing answers.
58
The Red (39%) and Green (47%) components contain the vast majority of data. They are similar in
terms of user reputation, tenure (Percentile) and experience (Number of Answers), as well as the quality of
the answers they write (AcceptanceProbability), with theRed component users scoring slightly higher on
all the measures. The main dierence is in the answers they write: Red users do not includeURLs in their
answers, while Green users do not include code. Another dierence between these groups is that Green
users have longer answers than Red users, but as their answers become longer, they also become more
dicult to read (lower Readability). In contrast, longer answers by Red users are easier to read.
Overall, we nd intuitive and surprising results from data that are largely due to DoGR’s interpretabil-
ity.
3.4.2 QuantitativeResults
We compare the performance of DoGR to existing state-of-the-art methods for disaggregated regression:
the three variants of CLR, WCLR [118], FWCLR [119], and GMR [125]. We use MLR, which does not
disaggregate data, as baseline.
3.4.2.1 PredictionPerformance
For the prediction task, we use 55-fold nested cross validation to train the model on four folds and make
predictions on the out-of-sample data in the fth fold. As hyperparameters, we usedk = 1–6 as potential
number of components for all methods. ForWCLRandFCWCLR, we set =f0:001; 0:01; 0:1; 10; 100; 1000g,
and forFWCLR we setm =f1:1; 1:5; 2:0; 3:0g forCART we set the depth of the tree asf1; 2;:::; 9g. We use
grid search to nd best hyperparameters. Table 3.2 presents results on our datasets and synthetic data. To
evaluate prediction quality, we useRootMeanSquareError(RMSE) andMeanAbsoluteError(MAE). When
comparing the quality of cross-validated predictions across methods, we use the Kruskal-Wallis test, a
non-parametric method to compare whether multiple distributions are dierent [71]. When distributions
59
were signicantly dierent, a pairwise comparisons using the Tukey-Kramer test was also done as post-
hoc test, to determine if mean cross-validated quality metrics were statistically dierent [86]. Prediction
results are shown in table 3.2. We use to indicate statistically signicant (p< 0:05) dierences between
our method and other methods. Bolded values dier signicantly from others or have lower standard
deviation. Following [119], we use lower deviation to denote more consistent predictions.
CART, WCLR and FWCLR do not perform well inSynthetic data, because variances of the two com-
ponents are dierent, which these methods do not handle. This is a problem that exists in an arbitrary
number dimensions; we observe in higher dimensions the gap between performance of FWCLR and our
method increases.
For Metropolitan data, there is no statistically signicant dierence between methods: all cluster-
based regression methods outperformMLR andCART on the prediction task. This shows that any method
that accounts for the latent structure of data helps improve predictions. For Wine Quality data, while
the null hypothesis of equal performance ofGMR andDoGR cannot be rejected, our method has a smaller
standard deviation in both datasets. We were not able to successfully run GMR and WCLR onNYC data,
due to exceptions (returning null as predicted outcome) and extremely long run time, respectively. It
took three days for FWCLR, and four hours for our method to nish running on NYC data. Our method
has signicantly lower MAE compared to FWCLR, while CART has a better MAE and worse RMSE. This
shows that CART has a worse performance for outliers. Figure 3.2 represents the reason behind the poor
performance for outliers for hard clustering methods like CART.
We were also not able to successfully run GMR and WCLR on Stack Overow data for the same
reasons. It took six days for FWCLR to run one round of cross-validation, after which we stopped it.
Therefore, the mean reported in the table forFWCLR is the average of ve runs, while forMLR andDoGR
it is the average of 25 runs. The best performing method isCART. The main reason is thatStackOverow
60
has discrete variables and CART is a better method is compare with FWCLR and DoGR for handling that
types of variables.
3.4.2.2 RunTime
To compare the run time of all algorithms, we performed one round of cross validation (not nested) for each
method. The same machine (4-GB RAM, 3:0-GHz Intel CPU, Windows OS) was used for time calculation.
The available code forWCLR andFWCLR methods are inR while the other methods are written inPython.
Table 3.3 presents the run time in minutes. The slowest method is WCLR, while the fastest one is MLR.
WCLR and FWCLR are sensitive to size of data, perhaps due to the many hyperparameters they need to
tune. To nd the best hyperparameters for GMR and DoGR, we ran the methods 6 times, while WCLR
requires 36 runs and FWCLR 144 runs. Beside NYC and Stack Overow datasets for which exceptions
occur in GMR, the run time of DoGR is twice that of GMR. The reason GMR throws exception could be
because of a singular covariance matrix.
While our method is similar in spirit to GMR, out method is more stable, as shown on the NYC and
StackOverow data. In addition, our method is interpretable, as it directly computes regression coecients,
whileGMR represents relationships between variables via the covariance matrix. Covariance values are not
guaranteed to have the same sign, let alone magnitude, as regression coecients. Regression is therefore
necessary to understand the true relationships between variables. Mathematically, GMR and unweighted
regression can be converted to one another using linear algebra. It is not clear, however, whether the
equivalence also holds for weighted regression.
3.5 Discussion
In this paper, we introduce DoGR, which softly disaggregates data by latent confounders. Our method
retains the advantages of linear models, namely their interpretability, by reporting regression coecients
61
Method RMSE () MAE ()
Synthetic
MLR 294.88 ( 1.236)* 288.35 ( 0.903)*
CART 264.70 ( 7.635)* 224.57 ( 6.138)*
WCLR 261.14 ( 3.370)* 232.76 ( 2.682)*
FWCLR 261.27 ( 4.729)* 233.05 ( 3.772)*
GMR 257.36 ( 4.334) 219.15 ( 3.567)
DoGR 257.32(3.871) 219.11(3.106)
Metropolitan
MLR 0.083 ( 0.0061) 0.062 ( 0.0033)
CART 0.086 ( 0.0056) 0.064 ( 0.0036)*
WCLR 0.083 ( 0.0029) 0.062 ( 0.0024)
FWCLR 0.082(0.0044) 0.061(0.0021)
GMR 0.083 ( 0.0043) 0.061 ( 0.0023)
DoGR 0.083 ( 0.0052) 0.061 ( 0.0031)
Wine Quality
MLR 0.83 ( 0.018)* 0.64 ( 0.015)*
CART 0.79 ( 0.015) 0.62 ( 0.013)
WCLR 0.83 ( 0.013)* 0.64 ( 0.011)*
FWCLR 0.80 ( 0.013)* 0.63 ( 0.009)*
GMR 0.79 ( 0.017) 0.62 ( 0.014)
DoGR 0.79(0.014) 0.62(0.011)
NYC
MLR 13.36 ( 7.850) 2.20 ( 0.064)*
CART 15.33 ( 9.128) 1.34(0.190)
FWCLR 13.14 ( 7.643) 1.76 ( 0.321)*
DoGR 11.88(9.109) 1.40 ( 0.222)
Stack Overow
MLR 60.69 ( 1.118) 37.74 ( 0.152)
CART 58.19(0.781)* 34.05(0.208)*
FWCLR 60.47 ( 0.960) 37.25 ( 0.794)
DoGR 60.68 ( 1.298) 37.62 ( 0.314)
Table 3.2: Results of prediction on ve data-sets. Asterisk indicates results that are signicantly dierent
from our method (p-value< 0:05). The bolded results have smaller standard deviation among the best
methods with same mean of error.
62
Dataset WCLR FWCLR GMR DoGR
Synthetic 6.46 0.68 0.76 1.4
Metropolitan 37 9 2 4
Wine Quality 180+ 36 9 16
NYC 600+ 232 failed 10
Stack Overow - 7200+ failed 170
Table 3.3: Run time (inminutes) of one round of cross-validation (non-nested). MLR took less than a minute
for all datasets. ’failed’ indicates that the method could not run on the data due to exceptions. We time out
WCLR exceeding sucient amount of time to show the order of performance. Bold numbers indicate the
fastest algorithm.
that give meaning to trends. Our method also discovers the multidimensional latent structure of data
by partitioning it into subgroups with similar characteristics and behaviors. While alternative methods
exist for disaggregating data, our approach is unique because it produces interpretable regressions that
are computationally ecient.
We demonstrated the utility of our approach by applying it to real-world data, from data on wines to
data on answers in question-answering website. We show that our method identies meaningful subgroups
and trends, while also yielding new insights into the data. For example, in the wine data set, it correctly
separated high quality red wines from white, and also discovered two distinct classes of high quality white
wines. In Stack Overow data, it helped us identify important users like “veterans” and “power users.”
There are a few ways forward to improve our method. Currently, it is applied to continuous variables,
but it needs to be extended to categorical variables, often seen in social science data. Moreover real data is
often non-linear; therefore, our method needs to be extended to non-linear models beyond linear regres-
sion. In addition, to improve prediction accuracy, our method could include regularization parameters,
such as ridge regression or LASSO. However, already in its current form,DoGR can yield new insights into
data.
63
White
Hispanic
Asian
Black
Percent
Below
Poverty
Percent
Graduate
meanV=5.72 (0.017)
meanV=5.86 (0.010)
meanV=5.74 (0.007)
meanV=5.76 (0.009)
meanV=5.80 (0.008)
% Below %
Poverty Graduate
mean std
All 5:78 0:098 0:0000 0:0028
Orange 5:86 0:092 0:0063
0:0016
Purple 5:80 0:077 0:0002 0:0030
Red 5:76 0:080 0:0018
0:0050
Green 5:74 0:073 0:0001 0:0056
Blue 5:72 0:117 0:0008
0:0001
* p-value 0:05; *** p-value 0:001
Figure 3.3: Disaggregation of the Metropolitan data into ve subgroups. The radar plot shows the relative
importance of a feature within the subgroup. The table report regression coecients for two indepen-
dent variables Percent Below Poverty and Percent Graduate for Multivariate Regression (MLR) of
aggregate data and separately for each subgroup found by our method.
64
Free
Sulfur
Dioxide
Residual
Sugar
Volatile
Acidity
Citric
Acid
Chlorides
quality=6.02 (0.045)
quality=5.91 (0.031)
quality=5.60 (0.040)
quality=5.36 (0.074)
Wine Quality Citric Acid SO2 Sugar
mean std err
All 0:1044 0:0008 0:0175
Blue 6.02 0:023 0:4299 0:0173
0:4272
Orange 5.91 0.016 0:1545
0:0001 0:0156
Green 5.6 0.020 0:0968 0:0024 0:3227
Red 5.36 0.037 0:3039
0:011
0:004
* p-value 0:05; *** p-value 0:001
Figure 3.4: The value of the center of each component () for four components of the Wine Quality data.
TheBlue andOrange components are almost entirely white wines, and theGreen component is composed
mostly (85%) of red wines. The lowest quality and smallest—Red—component is a mixture of red (43%)
and white (57%) wines.
65
Residental
Units
Commercial
Units
Gross
Sq. Feet
Land
Sq. Feet
SALE PRICE=0.53 (0.005)
SALE PRICE=11.72 (1.799)
SALE PRICE=1.31 (0.050)
SALE PRICE=12.43 (4.206)
Sale Resid. Commer. Gross Land
Price Units Units Sq.Feet Sq.Feet
mean std err
All 1:31 0:079 0:036 0:039 2:681 0:595
Purple 12:43 2:143 0:000 0:218 8:842 1:956
Green 11:72 0:916 0:078 1:008 5:329 6:652
Red 1:31 0:026 0:142 0:900 0:591 0:348
Blue 0:53 0:004 0:039 0:000 0:124 0:089
all signicant: p-value 0:001
Figure 3.5: The value of center () for four components ofNYC data. The numbers in legends are average
price of the component’s real estate and is in millions of dollars.
66
Size Acceptance Answerer Num. of Percentile
Probability Reputation Answers Tenure
Red 39% 0.26 338.17 16.02 0.43
Green 47% 0.24 314.35 14.56 0.42
Blue 8% 0.33 492.19 20.65 0.47
Orange 5% 0.32 784.62 40.21 0.40
Lines of
code
Readability
Score
Length of
session
URLs
Tenure
Percentile
words=88.42 (1.013)
words=39.21 (0.182)
words=55.79 (0.218)
words=110.36 (2.160)
Words Codes Readab. Percnt. URLs
mean std
All 54:91 0:102 0:29 0:048 5:61 17:11
Red 39:21 0:093 0:43 0:112 6:09 0
Green 55:79 0:111 0 0:175 2:64 3:62
Blue 88:42 0:517 0:17 0:467 -0.22 86:20
Orange 110:36 1:102 0:20 0:341 20:45 12:87
* Italicized eects are not signicant
Figure 3.6: Disaggregation of Stack Overow data into subgroups. The outcome variable is length of the
answer (number of words). The radar plot shows the importance of each features used in the disaggrega-
tion. The top table shows average values of validation features, while the bottom table shows regression
coecients for the groups.
67
PartII
NetworkedData
68
Chapter4
PerceptionBiasinDirectedNetworks
4.1 Introduction
We observe our peers to learn social norms, assess risk, or copy behaviors. However, these observations
can be systematically biased [93, 9, 109, 65, 16, 76, 59], distorting how we see the world. One of the better
known sources of bias is the friendship paradox in social networks [45], which states that people are less
popular than their friends are, on average. Consequences of friendship paradox can skew how we compare
ourselves to friends: people tend to be less happy than their friends are [21], and researchers tend to have
less impact than their co-authors do [15], on average. In fact, any trait correlated with popularity is likely
to be misperceived [79, 41]. This may explain why adolescents systematically overestimate how much
their peers drink or engage in risky behaviors [9, 16] and why social media use is often associated with
negative social comparisons [1].
In contrast to friendships, many online social networks are directed. On Twitter, for example, we
subscribe to, or follow, others to see their posts, but the information does not ow in the opposite direction,
unless those people also follow us back. For convenience, we refer to people whose posts we see in our
social feeds as our friends, and those who see our posts as followers. Clearly, this nomenclature does
not imply a bidirectional friendship relationship. An individual’s in-degree is the number of his or her
friends, and the out-degree is the number of followers. The asymmetric nature of links in directed networks
69
leads to four variants of the friendship paradox [54]: your friends (or followers) have more friends (or
followers) than you do, on average. Empirically, this eect can be quite large, with upwards of 90% of
social media users observing that they have a lower in-degree and out-degree than both their friends and
followers [69]. However, the conditions under which these four variants of the paradox exist have not been
comprehensively analyzed. We carry out the analysis to show that while two variants of the friendship
paradox occur in any directed network [53], the remaining two exist only if an individual’s in-degree and
out-degree are correlated.
Friendship paradox can systematically skew individual’s observations of the network’s state. We con-
sider directed networks where nodes have a trait, such as gender, political aliation, or whether they used
a certain hashtag in their posts. The trait’s global prevalence is simply the fraction of all nodes with that
trait. On the other hand, its observed prevalence is the fraction of friends that have the trait. In networks
where the more inuential (higher out-degree) nodes are likely to have the trait, its observed prevalence
will be substantially higher than its actual prevalence. Our analysis shows that, similar to the generalized
friendship paradox in undirected networks [60, 41], correlation between nodes’ trait and their out-degree
amplies this perception bias.
In reality, an individual’s perception of a trait is shaped by its local prevalence among his or her friends.
In this paper, we identify a new paradox in directed networks, as a result of which a trait will appear to be
signicantly more popular locally among an individual’s friends, than it is globally among all people. We
show that this eect is stronger in networks where higher out-degree nodes with the trait are connected
to nodes with a lower in-degree.
Surprisingly, although individual observations are biased, we can still robustly estimate the global
prevalence of the trait. We present a polling algorithm that obtains a statistically ecient estimate of a
trait’s global prevalence, with a smaller error than alternative polling methods. Proposed method leverages
70
friendship paradox to reduce the error of the polling estimate by trading o the bias of the estimate and
its variance. We analytically characterize this trade-o and provide an upper bound for the variance.
We also show that perception bias can be large in a real-world network. To this end, we extracted
a subgraph of the directed Twitter social network and collected messages posted by users within this
subgraph. Treating the occurrence of particular hashtags within messages as traits or topics enables us to
measure the perception bias. We identify hashtags that appear much more frequently within users’ social
feeds than they do among all messages posted by everyone, leading users to overestimate their prevalence.
We also validate the performance of the proposed polling algorithm through synthetic polling experiments
on the Twitter subgraph.
This paper elucidates some of the non-intuitive ways that directed social networks can bias individual
perceptions. Since collective phenomena in networks, such as social contagion and adoption of social
norms, are driven by individual perceptions, the structure of networks and the paradoxes endemic in them
can impact social dynamics in unexpected ways. Our work shows how we can begin to quantify and
mitigate these biases.
4.2 Results
4.2.1 BasicConceptsandDenitions
Consider a directed networkG = (V;E), withfVg nodes andfEg links. A link (i;j) pointing fromi to
j indicates thati is a friend ofj or equivalently,j followsi. Here, the direction of the link indicates the
ow of information. The out-degree of a nodev,d
o
(v), measures the number of followers it has, and its
in-degree,d
i
(v), the number of friends.
We dene three random variables,X,Y andZ that correspond to dierent node sampling methods.
A nodev with out-degreed
o
(v) has that many followers, or equivalently,v is a friend tod
o
(v) number of
71
nodes. Therefore, a nodeY that is obtained fromV by sampling proportional to out-degree of nodes is
called a random friend. Similarly, a nodev that hasd
i
(v) links pointing to it is a follower ofd
i
(v) other
nodes. Therefore, a node Z that is obtained from V by sampling proportional to in-degree of nodes is
called a random follower. Below, we formalize these terms.
Random nodeX is a uniformly sampled node fromV :
P(X =v) =
1
N
8v2V: (4.1)
Random friendY is a node sampled fromV proportional to its out-degree:
P(Y =v) =
d
o
(v)
P
v
0
2V
d
o
(v
0
)
; 8v2V: (4.2)
Random followerZ is a node sampled fromV proportional to its in-degree:
P(Z =v) =
d
i
(v)
P
v
0
2V
d
i
(v
0
)
; 8v2V: (4.3)
For any directed network, the average in-degree Efd
i
(X)g =
P
v2V
d
i
(v)
N
and the average out-degree
Efd
o
(X)g =
P
v2V
do(v)
N
are the same. HereE denotes the expectation operator. Therefore, we use
d to
denote both average in-degree and average out-degree of a random nodeX:
d =Efd
o
(X)g =Efd
i
(X)g.
4.2.2 FriendshipParadoxinDirectedNetworks
Four dierent variants of the friendship paradox exist in directed networks [54]. The rst two, state that (1)
random friends have more followers than random nodes do, and (2) random followers have more friends
72
than random nodes do (on average). The magnitudes of these are set by the variance of the in- and out-
degree distributions of the underlying network. Mathematically, these two friendship paradoxes can be
stated as:
• Random friendY has more followers than a random nodeX, on average:
Efd
o
(Y )g
d =
Varfd
o
(X)g
d
0: (4.4)
• Random followerZ has more friends than a random nodeX, on average:
Efd
i
(Z)g
d =
Varfd
i
(X)g
d
0: (4.5)
For the derivation, please see Supplementary Notes of our paper [5].
The remaining two variants of the friendship paradox state that (3) random friends have more friends
than random nodes do, and (4) random followers have more followers than random nodes do (on average).
In contrast to the rst two variants of the paradox stated above, the remaining two variants require positive
correlation between the in-degree and the out-degree of nodes in the network:
• Random friendY has more friends than a random nodeX, on average:
Efd
i
(Y )g
d =
Covfd
i
(X);d
o
(X)g
d
0: (4.6)
• Random followerZ has more followers than a random nodeX, on average:
Efd
o
(Z)g
d =
Covfd
i
(X);d
o
(X)g
d
0: (4.7)
For the derivation, please see Supplementary Notes of our paper [5].
73
Equations (4.6) and (4.7) state that in networks where the in- and out-degrees of a random node are
positively correlated, (1) the expected number of friends of a random friend is greater than the expected
number of friends of a random node, and (2) the expected number of followers of a random follower is
greater than that of a random node. The mathematical formulations of the friendship paradox in directed
networks were independently proved recently in [53] utilizing vector norms.
To give additional intuition, Figure 4.1 illustrates the above four variants of the friendship paradox
in the subgraph of the Twitter social network (see Methods), showing the fraction of individuals with a
specic in-degree (or out-degree) who experience the paradox. Note that this fraction is high: at least half
of the users with 100 or fewer friends (or followers) observe that they are less popular and well-connected
than their friends and followers are on average. The noise in Figure 4.1b likely stems from Twitter’s follow
limits. When individuals reach the limit, they must curate their social links more deliberately and recruit
more followers before they can add more friends.
4.2.3 GlobalPerceptionBias
When nodes have distinguishing traits or attributes, the friendship paradox can bias perceptions of those
attributes. For simplicity, we assume that each node has a binary-valued attribute (f :V !f0; 1g). Such
binary functions are useful for representing, among others, voting preferences (Democratic or Republi-
can), demographic characteristics (female or male), contagions (infected vs susceptible), or the spread of
information in networks (using a particular hashtag or not).
The global prevalence of the attribute in a directed network is given byEff(X)g, the expected value
of the attribute of a random nodeX. In other words, when only 5% of nodes have the attributef(v) = 1,
its expected value isEff(X)g = 0:05.
Nodes’ perceptions of the prevalence of the attribute, however, are determined by its value among
their friends, i.e.,Eff(Y )g, the expected attribute value of a randomly chosen friendY . On Twitter, this
74
10
0
10
1
10
2
10
3
d
o
(v) (followers)
0.0
0.5
1.0
Probability of paradox
(a) Friends have more followers
10
0
10
1
10
2
10
3
d
i
(v) (friends)
0.0
0.5
1.0
Probability of paradox
(b) Followers have more friends
10
0
10
1
10
2
10
3
d
i
(v) (friends)
0.0
0.5
1.0
Probability of paradox
(c) Friends have more friends
10
0
10
1
10
2
10
3
d
o
(v) (followers)
0.0
0.5
1.0
Probability of paradox
(d) Followers have more followers
Figure 4.1: Illustration of the eects of the four versions of the friendship paradox using Twitter dataset
described in Methods. The sub-gures display the fraction of nodes (empirical probability of the paradox)
of a particular degree whose (a) friends have more followers, (b) followers have more friends, (c) friends
have more friends, and (d) followers have more followers, on average.
75
translates into how many people see the topic in their social feed, since the feed aggregates posts made by
friends. Under some conditions, the perceived prevalence of the attributeEff(Y )g will be very dierent
from its actual prevalenceEff(X)g. We dene this as global perception bias:
B
global
=Eff(Y )gEff(X)g =
Cov(f(X);d
o
(X))
d
(4.8)
=
do;f
do
f
d
;
where
do;f
is the Pearson correlation coecient between out-degree and attribute value of a random
node,
do
is the standard deviation of the out-degree distribution, and
f
is the standard deviation of the
binary attributes (See Supplementary Notes of our paper for derivations [5] ).
When the attribute is correlated with the out-degree (
do;f
> 0), a random friend’s attribute is larger
than the attribute value of a random node, on average. In undirected networks this eect is known as gen-
eralized friendship paradox [41], and it has the same intuition: when popular people (with many followers)
are more likely to possess some trait (
do;f
> 0), that trait will be overrepresented among the friends of
any individual. As a result, people will tend to overestimate the trait’s prevalence. This may explain the
observation that adolescents overestimate the number of smokers or heavy drinkers among their peers [9].
All that is required for the bias to hold is for peers engaging in risky behaviors to tend to be more popular.
Note that the magnitude of the friendship paradox
S
FP
=Efd
o
(Y )g
d =
2
do
d
increases with the standard deviation of the out-degree distribution (
do
) and decreases with the average
degree (
d). Global perception biasB
global
also increases with
do
and decreases with
d when the correlation
76
coecient
do;f
remains xed. Hence, friendship paradox amplies global perception bias, increasing the
deviation between the actual and observed prevalence of the attribute in the network.
Additional perception biases can arise in directed networks. Recall that a random friendY is an indi-
vidual sampled with a probability proportional to the out-degree, and a random followerZ is an individual
sampled with a probability proportional the in-degree. The random friendY can be thought of as a person
being observed, whereas a random followerZ is a person who is observing. In this context, the perception
biasB
global
=Eff(Y )gEff(X)g compares the opinion of a random person being observed with the
global (true) prevalence. By the same token, the quantityEff(Z)gEff(X)g compares the opinion of
a random observer with the global prevalence. The dierence,
Eff(Y )gEff(Z)g =
1
d
Eff(X)(d
o
(X)d
i
(X))g
can then be thought of as the expected dierence of the opinions between the observed and the observer
pair chosen randomly from the network. This interpretation opens up a causal perspective of the percep-
tion bias in directed networks for future work.
4.2.4 LocalPerceptionBias
One problem with usingB
global
(Equation (4.8)) to measure perception bias is thatEff(Y )g captures the
expected value of the attribute among the friends of all individuals, rather than the friends of a randomly
chosen individualX. In order to reect more accurately how individual’s perceptions are skewed by their
friends, we propose a new measure of perception bias. To quantify this bias we begin by dening the
perceptionq
f
(v) of an individualv2V about the prevalence of an attributef among his or her friends:
q
f
(v) =
P
u2F(v)
f(u)
d
i
(v)
; (4.9)
77
whereF(v) denotes the set of friends ofv. We dene local perception bias as the deviation of the expected
perception of a trait of a random individual from its global prevalence:
B
local
=Efq
f
(X)gEff(X)g: (4.10)
To help understandB
local
, we dene the attention that a nodev2V allocates to each of her friends:
A(v) =
1
d
i
(v)
:
The expression for attention is motivated by an observation that users with more friends tend to receive
more messages [111], making them less likely to see any specic friend’s post [55]. This allows us to
succinctly express the expected perception of a random nodeX as (see Supplementary Notes of our paper
for the derivation [5])
Efq
f
(X)g =
dEff(U)A(V )j(U;V ) Uniform(E)g: (4.11)
Here,
d is the expected number of friends of a random node, andU andV denote the endpoints of a link
sampled uniformly fromE. Intuitively,
Eff(U)A(V )j(U;V ) Uniform(E)g
represents the expected inuence of an interaction along a link drawn at random from the network: i.e.,
the attributef(U) of the friendU times the attention that the followerV pays to that friend. Note that for
simplicity we assumed that nodes divide their attention uniformly over all friends, though the analysis can
be extended to weighted networks, where weights model non-uniform attention, with individuals paying
more attention to their more important or inuential friends.
78
4.2.5 RelationshipbetweenB
local
andB
global
Local perception biasB
local
is a renement of global perception biasB
global
, which accounts for how in-
dividuals divide their attention in the network. Indeed, if the attention of followers is independent of the
attribute of their friends, both measures are the same. Formally,B
global
andB
local
are equal if and only if
the attributef(U) ofU and attentionA(V ) along a random link (U;V ) are uncorrelated, i.e.,
Covff(U);A(V )j(U;V ) Uniform(E)g = 0; (4.12)
as we show in Supplementary Notes of our paper [5].
On the other hand, positive local perception bias exists, i.e.,B
local
0, when the folowing conditions
are met (see SI):
Covff(X);d
o
(X)g 0 and, (4.13)
Covff(U);A(V )j(U;V ) Uniform(E)g 0: (4.14)
The rst condition (Equation. (4.13)) species positive correlation between the out-degree and the attribute
of a random node, which occurs when popular nodes are more likely to have the attribute. This is a neces-
sary and a sucient condition forB
global
0 (see Equation (4.8)). The second condition (Equation (4.14))
species positive correlation between the attention of a follower and the attribute of a friend, suggesting
that nodes with an attribute are followed by nodes that divide their attention over few others. This is
a necessary and a sucient condition for B
local
B
global
(see Supplementary Notes of our paper [5]).
Hence, these two conditions collectively are sucient for positive local perception bias, leading individu-
als to overestimate the attribute’s prevalence, i.e.,B
local
B
global
0. Analogously, changing the signs of
79
Equations (4.13) and (4.14) leads to negative local perception bias (see Supplementary Notes of our paper
[5]), which implies that nodes underestimate the prevalence of an attribute.
Under other conditionsB
global
andB
local
can dier signicantly and even disagree, with one measure
indicating that individuals are underestimating and the other indicating that they are overestimating the
prevalence of an attribute. We analytically characterize the two cases where this occurs (see Supplementary
section B.2):
LetB
global
< 0, thenB
local
> 0 if and only if
Covff(U);A(V )j(U;V ) Uniform(E)g>
jB
global
j
d
:
LetB
global
> 0, thenB
local
< 0 if and only if
Covff(U);A(V )j(U;V ) Uniform(E)g<
jB
global
j
d
:
The rst condition states that whenB
global
is negative,B
local
can still be positive if suciently many nodes
with an attribute have followers with high attention (because they divide it over few friends). Similarly,
whenB
global
is positive,B
local
can be negative when few of the nodes with an attribute have high attention
followers, leading to a negative correlation between the attribute of a friend and the attention of his or her
follower. The discrepancy exists becauseB
global
makes a mean-eld approximation by assuming that the
expected attribute value among friends of a random nodeX is equal to the expected attribute value of a
random friendY sampled from the entire network. In contrast,B
local
is a higher resolution measure that
takes the underlying network structure into account via the correlation between the attribute of a friend
and the attention of a follower.
80
Relation to Inversity in Undirected Networks: Local perception bias is related to the concept of
inversity [72], which is dened as the correlation coecient of the two random variablesd(U) and 1=d(V )
where,d denotes the degree, and (U;V ) is a uniformly sampled link in an undirected network. Although
the mathematical form of inversity is reminiscent of degree assortativity [100], it does not convey the
same information. Kumar et al. [72] shows that the relation between global and local versions of the
friendship paradox in undirected networks is characterized by inversity and not assortativity. Specically,
when inversity is positive, then the local version of the friendship paradox is larger in magnitude than the
global version of the friendship paradox in undirected networks. This result can be obtained by extending
our analysis to undirected networks and settingf = d in the expressions forB
global
andB
local
. In fact,
Equation (4.14) (which is a necessary and sucient condition forB
local
B
global
) in our paper generalizes
their ndings to directed networks and arbitrary exogenous attributesf.
4.2.6 EmpiricalValidation
We used data from Twitter (see Methods) to compare the actual and perceived popularity of hashtags (i.e.,
topics) mentioned in text posts. We treat each hashtagh as a binary attribute, withf
h
(v) = 1 if a userv
used the hashtagh in his or her posts.
Figure 4.2a displays the histogram of the prevalence (Eff(X)g) of the 1,153 most popular hashtags,
each used by more than 1,000 people in our data set. The bulk of these hashtags were used by fewer
than 2% of the people, with the most popular hashtags being used by just 8% of the people in our sample.
Figure 4.2b shows the histogram of local perception biasB
local
for all hashtags. Although its peak is at
zero, the distribution is skewed, with 865 hashtags having a positive bias, meaning that they appear more
popular than they really are. Measurements of individual’s perception shows that most users in our sample
overestimate how popular hashtags are (see Appendix section B.1).
81
0.00 0.02 0.04 0.06 0.08
Global Prevalence - E{f(X)}
10
0
10
1
10
2
No. of hashtags
(a)
− 0.02 0.00 0.02 0.04 0.06 0.08
B
local
10
0
10
1
10
2
No. of hashtags
(b)
Figure 4.2: Histogram of the distribution of (a) global prevalenceEff(X)g and (b) local perception bias
B
local
of popular hashtags in the Twitter data. Local perception biasB
local
(overestimating the prevalence)
exists for most hashtags.
What hashtags are most biased? Figure 4.3 shows the top-20 and bottom-10 hashtags ranked byB
local
(see Appendix section B.3 for the ranking of hashtags based on the global bias). Among the most positively
biased hashtags are those associated with social movements ( #ferguson, #mikebrown, #michaelbrown),
memes and current events ( #icebucketchallenge, #alsicebucketchallenge, #ebola, #netneutrality), sports
and entertainment ( #emmys, #robinwilliams, #sxsw, #applelive, #worldcup). For example, #ferguson, with
Efq
f
(X)g = 12:1%, is perceived as the most popular hashtag. While it is also one of the more widely-
used hashtags, withEff(X)g = 3:1%, perception bias makes it appear about four times more popular to
Twitter users than it actually is.
There are also negative biased hashtags, which appear less popular than they actually are. Among these
hashtags are Twitter conventions aimed at getting more followers ( #tfb, #followback, #follow, #teamfol-
lowback) or more retweets ( #shoutout, #pjnet, #retweet, #rt). Many of these hashtags are actually among
the top-20 most popular Twitter hashtags ( #oscars, #tcot, #quote and #rt), but due to the structure of the
network, they appear less popular to users. This occurs either because people who use these hashtags do
not have many followers (Covff(X);d
o
(X)g< 0), or the attention of their followers is diluted because
they follow many others (Covff(U);A(V )g < 0). For example, for #oscars, both of the covariances are
82
negative. Some hashtags also haveB
local
andB
global
with opposite signs, meaning that one measure over-
estimates the prevalence of the hashtag, while the other underestimates it. Many political hashtags in our
sample fall in this category, including #sotu, #occupy, #marriageequality. Additional examples of these
hashtags, as well as negatively biased hashtags, are listed in Appendix section B.2.
4.2.7 EstimatingGlobalPrevalenceviaPolling
The aim of polling is to estimate the global prevalenceEff(X)g of an attribute by sampling individuals
and averaging their answers to a specic question. The accuracy of a poll depends on two key factors: (i)
the method of sampling individuals and (ii) the question asked of them. We propose a practical polling
algorithm that diers from the currently used polling algorithms in both aspects. First, our algorithm
samples random followers (step 1 of Algorithm 1) instead of random individuals, by selectingb individuals
from the distribution
p
v
=
d
i
(v)
P
v
0
2V
d
i
(v
0
)
; 8v2V:
Second, the sampled individuals are asked about their perceptions instead of their own attribute (step 2 of
Algorithm 1): “What do you think is the fraction of individuals with attribute 1?" Their perceptions are
then aggregated in a polling estimate:
^
f
FPP
=
1
b
X
v
q
f
(v): (4.15)
We call the proposed algorithm Follower Perception Polling (FPP) algorithm.
The key idea behind our Follower Perception Polling (FPP) algorithm is to sample individuals who have
more friends, as this allows them to aggregate more information. According to the friendship paradox
(Equation (4.5)), random followers have, on average, more friends than random individuals do. As a result,
the variance of their perceptions will be smaller than that of random individuals, and hence it will result
83
Algorithm1: Follower Perception Polling (FPP) Algorithm
Input: GraphG = (V;E), perceptionsq
f
:V !R
+
, sampling budgetb.
Output: Estimate
^
f
FPP
ofEff(X)g =
P
v2V
f(v)
N
.
1. Sample a setSV ofb followers independently from the distribution
p
v
=
d
i
(v)
P
v
0
2V
d
i
(v
0
)
; 8v2V:
2. Compute the estimate
^
f
FPP
=
1
b
X
v2S
q
f
(v): (4.16)
in a more accurate estimate of the global prevalence of the attribute. We analytically show (see Methods)
that (i) the bias of the estimate
^
f
FPP
produced by the FPP algorithm is equal to the global perception
biasB
global
and, (ii) variance of the estimate
^
f
FPP
is bounded from above by a function of the correlation
between out-degree and the attribute, as well as spectral properties of the network given by the second
largest eigenvalue of the bibliographic coupling matrix.
The FPP algorithm assumes that every node has a non-zero in-degree and out-degree. To evaluate the
performance of this polling algorithm, we extract a subgraph of 5409 Twitter users from our dataset with
the same properties. We use the FPP algorithm to estimate the popularity of the 500 most frequent hashtags
mentioned by users in this subgraph. We compare the performance of the proposed FPP algorithm on this
induced subgraph to two alternative algorithms:
1. Intent Polling (IP): asks random users whether they used a hashtag (orange in Figure 4.4).
2. Node Perception Polling (NPP): asks random users what fraction of their friends used the hashtag
(red in Figure 4.4).
3. Follower Perception Polling (FPP): asks random followers what fraction of their friends used the
hashtag (green in Figure 4.4).
84
Node perception polling diers from IP in terms of the questions asked: random nodes are asked about
their perception in NPP, whereas they are asked about their own attribute in IP. Follower perception polling
diers from NPP in terms of the sampling method: random followers are sampled in FPP, while random
node sampling is used in NPP. Hence, comparing the performance of IP with NPP will illustrate the benet
of polling perceptions instead of attributes, and comparing the performance of FPP with NPP will illustrate
the benets of the friendship paradox-based sampling.
Figure 4.4a shows the bias of estimates produced by polling algorithms for a xed sampling budgetb =
25, which corresponds to querying 0:5% of the nodes. As shown in the analysis of the polling algorithm
(see Methods), FPP produces biased estimates for each hashtag (Figure 4.4a), given by B
global
value for
that hashtag, although it produces a smaller variance estimates (Figure 4.4b). Hence, in terms of the Mean
Squared Error, dened as
MSEfTg = BiasfTg
2
+ VarfTg
for an estimateT , FPP estimates are more accurate compared to both IP and NPP for most hashtags (Fig-
ure 4.4c). Increasing the sampling budget decreases performance gap between FPP and the other two
algorithms (Figure 4.4). However, even withb = 250 (5% of the nodes polled), FPP outperforms IP in more
than 80% of the cases, and it outperforms NPP in more than 55% of the cases.
The variance of the polling estimate (Equation (4.21)) is bounded by an expression that includes
2
,
the second largest eigenvalue of the degree-discounted bibliographic coupling matrix. For the Twitter data
2
= 0:5984. Equation (4.21) with this value serves as the upper-bound ofVar(
^
f
FPP
) for all 503 hashtags.
The bound is quite loose and could be tightened in future work.
Follower sampling heuristic: The FPP algorithm assumes that followers are obtained by sampling
nodes with probabilities proportional to their in-degree, or equivalently, sampling links at random from
the network and then selecting the endpoint of the link. This is feasible when the entire network is
85
known, or when the links have integer IDs, which can be uniformly sampled from a range of IDs. In
many cases, neither strategy is feasible, either because the network is too large, or it does not allow access
to individually-indexed links. In that case, we can use the following heuristic to sample followers: select a
node at random and ask her to nominate a random follower. Appendix section B.4 shows that this heuristic
can estimate hashtag prevalence almost as accurately as the exact implementation of the FPP algorithm
that samples nodes proportional to their in-degree. Our intuition for this method is based on the fact that
the undirected version of the friendship paradox holds for a random neighbor of a random node as well as
a random end of a random link [27].
4.3 Methods
4.3.1 Data
The dataset used in this study was collected from Twitter in 2014. We started with a set of 100 users
who were active discussing ballot initiatives during the 2012 California election and expanded this set by
retrieving the accounts of the individuals they followed and reached a total of 5,599 users. We refer to these
individuals as seed users. Next, we identied all friends of the seed users, collecting all directed links that
start with one of the seed users. We then collected all posts made by the seed users and their friends—over
600K users in total—over the period June–November 2014. The posts include their activity i.e. tweets and
retweets. These tweets mention more than 18M hashtags. With this data-collection approach, seed users
are fully observed (their activity and what they see in their social feeds), and their friends are only partially
observed (only their activity).
Table 4.1 reports properties of the Twitter dataset, considering only the seed users. Note that the
average degree
d (where,
d = Efd
o
(X)g = Efd
i
(X)g) is relatively large at 123:55. However, since the
distribution of the in- and out-degree is highly heterogeneous, the variance of the in- and out-degrees is
86
Table 4.1: Properties of the Twitter subgraph
Properties of nodes
avg. degree
d =Efd
i
(X)g 123:55
variance of out-degree Varfd
o
(X)g 30096:16
variance of in-degree Varfd
i
(X)g 24338:66
covariance Covfd
i
(X);d
o
(X)g 14226:32
Properties of friends and followers
friend’s avg. out-degree Efd
o
(Y )g 367:14
friend’s avg. in-degree Efd
i
(Y )g 238:68
follower’s avg. in-degree Efd
i
(Z)g 320:54
follower’s avg. out-degree Efd
o
(Z)g 238:68
relatively large (two orders of magnitude compared to
d). The covariance between the in- and out-degrees
of nodes is also relatively large with a correlation coecient
fd
i
(X);d
o
(X)g = Covfd
i
(X);d
o
(X)g=
p
Varfd
o
(X)g Varfd
i
(X)g = 0:52:
Due to the relatively large variance of the in- and out-degree distributions, the expected out-degree of
a random friend (Efd
o
(Y )g) and the expected in-degree of a random follower (Efd
i
(Z)g) are larger than
the average degree
d (see Equation (4.5)). Note also that, due to positive covariance between the in- and
out-degrees of nodes, the expected in-degree of a random friend (Efd
i
(Y )g) and the expected out-degree
of a random follower (Efd
o
(Z)g) are also larger than
d, as stated in Equation (4.7).
4.3.2 FriendshipParadox-basedPolling: PerformanceAnalysis
The accuracy of a poll depends on the method of sampling respondents and the question asked of them.
For example, in the case of estimating an election outcome, asking people “Who do you think will win?"
(expectation polling) is better than “Who will you vote for?" (intent polling) [114]. This is because in
expectation polling, an individual names the candidate more popular among her friends, thus summarizing
a number of individuals in the social network, rather than provide her own voting intention. Our follower
87
perception polling (FPP) algorithm is motivated by [37, 114, 98], which shows that polling methods asking
individuals to summarize information in their neighborhood outperform polling methods that ask only
about the attribute of each individual. [37] studied the polling problem analytically in the context of an
undirected network and, proposed a method to obtain an unbiased estimate of the global prevalence with
bounds on its variance. The analysis of the FPP algorithm for directed graphs is motivated by these results
in [37] for undirected social networks. [98] proposed to ask the simple question “What fraction of your
neighbors have the attribute 1?" (neighborhood expectation polling) from randomly sampled neighbors
(instead of random nodes) on undirected social networks. In this case, sampled individuals will provide
the average opinion among their neighbors. Further, since random friends have more friends than random
individuals, this approach would yield an estimate with a smaller variance than asking it from random
nodes. Motivated by these works, the FPP algorithm exploits the friendship paradox on directed networks
to obtain a statistically ecient estimate of the global prevalence of an attribute using biased perceptions
of random followers.
Recall that in order to reduce the variance, the FPP algorithm polls perceptionsq
f
(Z) of random fol-
lowers Z instead of attributes f(X) of random individuals X. However, it is not guaranteed that the
estimate
^
f
FPP
will be unbiased. The following result shows that the bias of the FPP algorithm is the same
as the global perception biasB
global
.
The bias of the estimate
^
f
FPP
computed by the FPP algorithm (see Supplementary Notes of our paper
[5]) is equal to the global perception bias:
Bias(
^
f
FPP
) =Ef
^
f
FPP
gEff(X)g (4.17)
=B
global
88
Hence, the same factors (specied in Equation (4.8)) that increase (decrease) the global perception bias
will increase (decrease) the bias of the estimate
^
f
FPP
produced by the FPP algorithm. The aim of the
FPP algorithm is to compensate for the biasB
global
of the algorithm with a reduced variance and thereby
achieve a smaller mean squared error. Also, we highlight that FPP algorithm can be modied to generate
an unbiased estimate by replacing Equation (4.16) with
^
f
Unbiased
FPP
=
1
b
X
v2S
1
Np
v
X
u2Fr(v)
f(u)
d
o
(u)
: (4.18)
The unbiased estimate
^
f
Unbiased
FPP
is based on the concept of social sampling proposed in [37] for undirected
social networks where, queried individuals provide a weighted value of their friends’ attributes in a manner
that results in an unbiased estimate. This estimate is useful in contexts where unbiasedness is preferred
over mean-squared error to assess the performance of the estimate. However, this does not result in an
intuitive and easily implementable algorithm similar to the FPP algorithm, since the modied estimate
^
f
Unbiased
FPP
involves each sampled individual calculating a weighted average of the attributes of her neighbors.
Before analyzing the variance of the polling estimate
^
f produced by the FPP algorithm, we digress
briey to review the bibliographic coupling matrix. Bibliographic coupling originated from the analysis of
citation networks [62], and is used to symmetrize a directed graph by transform it into an undirected graph
for purposes of clustering, etc. The bibliographic coupling matrixB of a directed graph with adjacency
matrixA is dened asB =AA
T
. Hence, the weight of the link between nodesi;j in the new undirected
graph isB(i;j) =
P
v2V
A(i;v)A(j;v) which corresponds to the number of mutual followers ofi andj.
Hence, the weight of the link between two nodesi andj inB is the number of individuals who follow both
of these nodes.
∗
This conveys the similarity ofi;j in terms of the number of mutual followers. However,
when determining the similarity of two nodes i;j using B, a mutual follower with a large number of
∗
In a citation network where the nodes correspond to papers, the entry (i;j) of the bibliographic coupling matrixB gives
the number of papers that are cited by bothi andj from which the name Bibliographic coupling matrix is derived. Bibliographic
coupling matrix is also related to the HITS algorithm [68] used for link analysis [99, 81].
89
friends (a likely scenario), is weighted the same as a mutual follower with a small number of friends (a
rarer scenario). Hence, the latter type of mutual follower should be given more weight compared to the
former type when evaluating the similarity of two nodes. Similarly, the number of followers ofi andj
should also be taken into consideration when assessing their similarity. Based on these observations, [115]
proposed the degree-discounted bibliographic coupling matrix
B
d
=D
1=2
o
AD
1
i
A
T
D
1=2
o
(4.19)
whereD
o
andD
i
are theNN dimensional diagonal matrices withD
o
(i;i) =d
o
(i) andD
i
(i;i) =d
i
(i),
respectively. The (i;j) element ofB
d
is
B
d
(i;j) =
1
p
d
o
(i)d
o
(j)
X
k2V
A(i;k)A(j;k)
d
i
(k)
; (4.20)
which discounts the contributions of the nodesi;j by their out-degrees and each mutual followerk by her
in-degree. Please see [115, 83] for more details on the degree-discounted bibliographic coupling.
Returning to the analysis of the estimate
^
f
FPP
, we can calculate the upper bound on the variance of this
estimate under certain conditions on the structure of the network. Specically, if the degree-discounted
bibliographic coupling matrixB
d
is connected, non-bipartite, then
Var(
^
f
FPP
) =
f
T
D
1=2
o
bM
D
1=2
o
AD
1
i
A
T
D
1=2
o
D
1=2
o
11
T
D
1=2
o
M
D
1=2
o
f (4.21)
1
bM
2
jjD
1=2
o
fjj
2
(4.22)
where,M =
P
v2V
d
i
(v),
2
is the second largest eigenvalue ofB
d
,f is theN 1 dimensional vector of
binary attributes (see Supplementary Notes of our paper [5]).
90
This result shows that the variance of the follower perception polling algorithm depends on the corre-
lation between the out-degrees and attributesjjD
1=2
o
fjj
2
and the structure of the graph via second largest
eigenvalue
2
of the matrixB
d
. Specically, a smaller
2
implies that the bibliographic coupling network
has a good expansion (i.e. absence of bottlenecks) [43]. Hence, if the nodes in the networkG = (V;E)
cannot be clustered into distinct groups based on their mutual followers (i.e. bibliographic similarity) then,
the variance of the algorithm will be smaller (due to smaller
2
).
CodeAvailability: Codes to generate the results of the paper are available onhttps://github.com/ninoch/
perception_bias.
DataAvailability The Twitter network and used hashtags by users data are available in https://osf.io/
pjkr9/. Due to Twitter restrictions on sharing raw data, we are unable to share the raw Tweet content.
4.4 Discussion
Social networks can exhibit surprising, even counter-intuitive behaviors. For example, previous work has
shown that the “majority illusion” may lead people to observe that the majority of their friends has some
attribute, even when it is globally rare [79], and to dramatically underestimate the size of the minority
group [76]. These eects arise due to the friendship paradox, which can also bias the observations individ-
uals make in directed networks. Our analysis identies the conditions under which friendship paradox can
distort how popular some attribute or behavior (e.g., drinking, smoking, etc.) is perceived to be, making it
appear several times more prevalent than it actually is. The following two conditions amplify perception
bias: (1) positive correlation between the attributes of individuals and their popularity (number of followers
in a directed network) and (2) positive correlation between the attributes of individuals and the attention
of their followers. The rst condition suggests that bias exists when popular people have the attribute, for
example, engage in risky behavior, have a specic political aliation, or simply use a particular hashtag.
Their inuence is amplied when they are followed by good listeners, i.e., people who follow fewer others
91
and thus are able to pay more attention to the inuentials. These conditions can be generated by biases in
preferences during network formation, driven for example, by homophily [76, 36].
We validated these ndings empirically using data from the Twitter social network. We measured
perceptions of the popularity of hashtags, i.e., words or phrases preceded by a ‘#’ sign that are frequently
used to identify topics on Twitter. Such hashtags serve many important functions, from organizing con-
tent, to expressing opinions, to linking topics and people. We measured a hashtag’s global prevalence
as the fraction of all people using it, and its perceived popularity as the fraction of friends using it. Our
analysis identied hashtags that appeared several times more popular than they actually were, due to local
perception bias. Such hashtags were associated with social movements, memes, and current events. Inter-
estingly, as our data was collected in 2014, some of the most biased hashtags were #icebucketchallenge and
#alsicebucketchallenge, the explosively popular Ice Bucket Challenge. Perception bias could have poten-
tially amplied their spread, as well as the spread of other costly behaviors that require social proof [31].
For example, the #MeToo movement has grown into an international campaign to end sexual harassment
and assault in the workplace by highlighting just how endemic the problem is. It spread through online
social networks as women posted their own stories of harassment using the hashtag #metoo. Perception
bias may have amplied the spread of such hashtags by making them appear more common and thus easier
to use.
We also presented an algorithm that leverages friendship paradox in directed networks to estimate the
true prevalence of an attribute with smaller mean-squared error than other methods. In essence, the idea
behind the algorithm is that perceptions of random followers should have a smaller variance compared to
the perceptions of random individuals, because random followers are more informed than random people
are, since according to friendship paradox they tend to have more friends. Empirical results conrm that
the proposed algorithm outperforms other widely used polling algorithms.
92
Our work suggests that one way to mitigate perception bias is to alter the local network topology to
allow more information to reach the low-attention users. This opens up new research avenues on how link
recommendation can alleviate perception bias. However, our empirical study has limitations, namely, the
nature of the subsample of the network we considered. Social networks are huge, necessitating analysis of
subgraphs sampled from the entire network. However, by leaving out some nodes, data collection process
itself may distort the properties of the sample. Specically, since we observed only the outgoing links from
the seed nodes, we do not have information about the followers of these nodes. Addressing the limitations
of analysis imposed by sampling is an important research direction. Despite this limitation, our work
shows that friendship paradox can lead to surprising biases, especially in directed networks, and suggests
potential strategies for mitigating them.
93
0 1 2 3 4 5 6 7 8 9 10 11 12
Percentage
ferguson
tbt
icebucketchallenge
mikebrown
emmys
nyc
robinwilliams
tech
ebola
alsicebucketchallenge
applelive
sxsw
netneutrality
worldcup
socialmedia
earthquake
ff
michaelbrown
apple
sf
...
gazaunderattack
teaparty
mtvhottest
follow
teamfollowback
oscars
tcot
retweet
quote
rt
Local Bias Ranking
Global Prevalence
Local Perception
Figure 4.3: The ranking of popular Twitter hashtags based on Local Bias. Top-20 and bottom-10 are in-
cluded in the ranking. The bars compareEff(X)g (global prevalence) andEfq
f
(X)g (local perception)
and include 95% condence intervals. The hashtags can appear to be much more popular than they actu-
ally are (e.g. #ferguson) or, they can appear to be less popular (e.g. #oscars) due to local perception bias.
Denition of some hashtags: #mike(/michael)brown and #ferguson (an 18-year-old African American male
killed by police), #tbt (Throwback Thursday - for posting an old picture on Thursdays), # (Follow Friday
- introducing account worth following), #tcot (Top Conservatives On Twitter), #rt (Retweet).
94
1 0
− 2
1 0
− 1
E { f ( X ) }
1 0
− 1 0
1 0
− 7
1 0
− 4
B { T }
2
FPP
NPP
IP
(a)
1 0
− 2
1 0
− 1
E { f ( X ) }
1 0
− 5
1 0
− 4
1 0
− 3
1 0
− 2
V a r { T }
FPP
NPP
IP
(b)
1 0
− 2
1 0
− 1
E { f ( X ) }
1 0
− 5
1 0
− 4
1 0
− 3
1 0
− 2
M S E { T }
FPP
NPP
IP
(c)
25
50
75
100
125
150
175
200
225
250
Sampling Budget = b
0.5
0.6
0.7
0.8
0.9
1.0
Fraction of hashtags
IP NPP
(d)
Figure 4.4: Comparison of estimates of the prevalence of Twitter hashtags produced by the polling al-
gorithms. Variation of (a) squared bias (BiasfTg
2
), (b) variance (VarfTg) and (c) mean squared error
(BiasfTg
2
+ VarfTg) of the polling estimateT as a function of a hashtag’s global prevalenceEff(X)g.
Each point represents a dierent hashtag and a xed sampling budgetb = 25. The polling algorithms
used are intent polling (IP), node perception polling (NPP) and the proposed follower perception polling
(FPP). (d) Fraction of hashtags for which the FPP algorithm outperforms the other two in terms of MSE.
The fraction for NPP approaches 0.5, and for IP approaches 0.8 as sampling budgetb increases. These g-
ures illustrate that the proposed FPP algorithm achieves a bias-variance trade-o by coupling perception
polling with friendship paradox to reduce the mean squared error.
95
Chapter5
EmergenceofStructuralInequalitiesinScienticCitationNetworks
5.1 Introduction
Growing concerns about social justice have brought renewed scrutiny to the problem of structural inequal-
ities. Such inequalities concentrate power in the hands of certain groups based on their gender, race, or
socio-economic status, by prioritizing their access to resources and social capital, while limiting oppor-
tunities for others. Arguably the best known of these is economic inequality, in which a small minority
controls the disproportionate share of income and wealth. Economic inequality is detrimental to social
welfare and has been associated with adverse societal outcomes, including reduced well-being [106] and
increased mortality [28], crime, and other social problems [131].
Structural inequalities are also common in science. As a result of the gender gap in faculty hiring,
women remain a small minority in many elds: they represent 15% of tenure track faculty in Computer
Science [129], 23% (resp. 10%) of assistant (resp. full) professors in Physics [108], and 31% (resp. 15%) of
assistant (resp. full) professors in Economics [30]. Although women publish at the same rate as men, their
papers appear in less prestigious journals [113] and tend to receive fewer citations [40, 48]. Academic pres-
tige presents another source of disparities in faculty hiring, where the ranking of a researcher’s doctoral
degree institution determines their academic placement and career opportunities [25, 14]. According to
one study of faculty hiring, the top-ranked 15% of computer science departments produced 68% of their
96
own faculty, hiring less than 10% from outside of the top-50 departments [33]. The disparity is not driven
by competition in the job market for best candidates. Instead, the benets of early career placement, rather
than inherent merit, serve to lock in advantages of doctoral training prestige, thereby facilitating future
success [130].
In this paper, we demonstrate that structural inequalities extend beyond hiring to aect the recognition
scientists receive from their peers, as measured by the number of citations to their papers. Citations provide
the basis for evaluating the research impact of scientists, a metric widely used in academia to decide who
to promote, invite as a keynote speaker, award tenure or a prize. Citations also help the community and
funding agencies identify important research topics. We demonstrate the existence of a gender gap and
a prestige gap in citations that make one group receive substantially less recognition for their work than
the other. Specically, women, who are a minority in science, receive less recognition than men. However,
researchers from top-ranked institutions, who also form a minority, receive far more recognition than
others; moreover, the smaller and more prestigious the group of researchers from top-ranked institutions,
the larger the citations inequality. These two examples show that disparities in representation alone do
not explain structural inequalities in citations.
To elucidate the origins of these structural inequalities, we present a dynamic model of the growth of
directed citations networks, which shows that inequalities arise as a consequence of biases in individual
preferences of authors to cite similar (homophily property of social interactions) and well-connected or
active (preferential attachment) authors, as well as the size of the minority group and how quickly new
authors are integrated into the citation network. The model is simple enough to be analytically tractable,
yet rich enough to capture key elements of real-world dynamics. Our theoretical analysis and empirical
validation reveal a rich array of phenomena as well as strategies for reducing citation disparity.
97
Researchers have recently begun to study disparities from the perspective of network science. Studies
have shown that the structure of social interactions can systematically disadvantage members of the minor-
ity group by limiting their visibility [77], relative ranking [61], economic opportunities [jackson2021inequality]
and the number of social connections they accrue [7, 8]. These studies have focused on undirected net-
works representing professional relationships among scientists, such as who works with whom (collabo-
ration networks) or who knows whom (aliation networks). In contrast to undirected networks where
social links are symmetric, we focus on directed citation networks with asymmetric links. We use large
bibliographic databases containing information about scientic publications and publications they cite, to
construct citation networks of authors. A directed edge in the author-citation network means that one
author cites the papers of the other author in their own works, but not necessarily the other way around.
We enrich these networks with additional information about authors and their aliations. The scale and
scope of the data allow for longitudinal study of disparities and comparison across academic disciplines.
Reducing structural inequalities is critical to broadening participation in science, which is itself neces-
sary not only to promote equity, but also to spur innovation on which our economic prosperity depends.
Our analysis shows that merely increasing the size of the minority group, e.g., through hiring policies
or armative action, does little to change inequality. Instead, our work suggests the need to change the
culture of scientic recognition by changing individual citation preferences and inclusion of new authors.
Our work provides a basis for metrics that scientic publishers and academic search engines can use to
audit the content they produce for bias, an important step toward reducing disparities in science.
98
(a) Gender Network (b) Aliation Network
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
0.5
0.6
0.7
0.8
0.9
1.0
2.0
Gender Power Inequality
Equality
Management
Physics (APS)
Psychology
Political Science
Economics
Computer Science
(c) Gender Power-inequality
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
0.7
0.8
0.9
1.0
2.0
3.0
4.0
Affiliation Power Inequality
(d) Aliation Power-inequality
Figure 5.1: Power-inequality in author-citation networks. Induced subgraph (i.e., the subgraph constructed
by picking a subset of nodes and the edges between them) for the Management eld where the nodes were
partitioned by(a) gender and(b) prestige of their aliation. In the gender-partitioned network, red nodes
represent female authors and blue represent male authors. In the aliation-partitioned network, red nodes
represent authors from top-ranked institutions and blue from other institutions. The induced subgraph is
constructed from the ego-networks of a linked pair of red and blue nodes, which include all the nodes that
cite or are cited by the linked pair of nodes, as well as the links between them. Power-inequality dened
in Eq. (5.1) over the time period 1995–2019 for dierent elds of study in (c) gender- and (d) aliation-
partitioned networks. The plotted values show the average of power-inequalities over a sliding window
of four years, and condence intervals indicate the standard error. The gray lines show the cumulative
power inequality over the years. In(c), all power-inequality values are below 1.0, suggesting that female
authors have less power than male authors. Psychology is the closest eld to gender parity. Economics,
Management and Political Science have steadily increasing power after 2010, while Computer Science has
slightly decreasing trend over time. In (d), the minority (red) group represents authors aliated with
the top-100 institutions. In this case, all power-inequality values are above 1.0, suggesting that authors
from top-ranked institutions have more power than other authors even though they are the minority.
Psychology again is the eld with values closer to 1.0 and therefore is the most institutionally power-
equal eld of study. Management has the most inequality, suggesting the aliation of authors is highly
correlated with their power in this eld.
99
5.2 Results
5.2.1 Power-InequalityinAuthor-CitationNetworks
We constructed author-citation networks from bibliometric data for six elds of study (see Methods). An
author-citation network at a discrete time t = 1; 2;::: is a directed graph G
t
=fV
t
;E
t
g where V
t
is
the set of authors and the E
t
is the set of directed edges between them. An edge from author u to v
exists if the latter cites any of the papers written by the former, i.e., the direction indicates the ow of
information. Hence, the out-degreed
o
(v) of an authorv2V
t
is the number of authors citingv, whereas
the in-degreed
i
(v) is the number of peoplev has cited up to and including timet. We useB
t
;R
t
V
t
to represent a partition of the set of authors into two groups based on some attribute such as gender
(female authorsR
t
vs male authorsB
t
) or prestige of their institutional aliation (authors aliated with
top-ranked institutionsR
t
vs the restB
t
). Further, the total in-degree (resp. out-degree) of red nodes is
d
i
(R
t
) =
P
v2Rt
d
i
(v) (resp. d
o
(R
t
) =
P
v2Rt
d
o
(v)), and the total in-degree of all nodes at timet is
t =
P
v2Vt
d
i
(v).
Following [7], we use the average degree of a group as a proxy of its power in a graph. However, in
directed graphs, we need to account for the in-degreed
i
(v) and the out-degreed
o
(v) for each authorv.
We dene the power of the red groupR
t
at timet as the ratio of the average out-degree to the average
in-degree of red nodes at timet, and similarly for the blue group. In author-citation networks, the out-
degree represents the amount of recognition an author receives from others, and the in-degree represents
the attention the author pays to other authors. Thus, the power of a each group measures the average
100
recognition the group receive relative to the average recognition the group gives to others. The power-
inequality in the citation network at timet quanties the disparity in the normalized recognition the two
groups receive as the ratio of the power of red group to the power of the blue group,
I
t
=
d
o
(R
t
)
d
i
(R
t
)
d
i
(B
t
)
d
o
(B
t
)
: (5.1)
WhenI
t
< 1:0, the red group has relatively less power compared to the blue group at timet. On the
other hand, whenI
t
> 1:0, the red group has relatively more power than the blue group. Note that in
undirected networksI
t
= 1 (at any timet). Thus, our notion of power-inequality specically captures the
consequences of the directed edges in citation networks.
Figure 5.1 illustrates power-inequality. Fig. 5.1(a)-(b) show the induced subgraphs of author-citation
networks in Management, partitioned by gender and aliation respectively. Each subgraph represents
the ego-networks of a linked pair of authors, one belonging to the red group and the other to the blue.
A node’s ego-network includes all nodes connected to it and all edges between them. Outgoing edges
inherit the color of the author being cited. Although the size of the red group is the same in the two
subgraphs, the dierent number of red and blue edges suggests disparities in citations: in Fig. 5.1(a) the
majority (blue) group gets a disproportionately large number of citations (compared to its size) and has
more power, whereas in Fig. 5.1(b), the minority (red) group does. We conrm that this observation is also
valid at the full network scale by calculating power-inequality across several elds of study: Management,
Psychology, Physics, Political Science, Economics, and Computer Science. Fig. 5.1(c)-(d) shows power-
inequality over time in gender and aliation networks. In gender-partitioned networks, women are the
minority (R
t
in Eq. (5.1)) and have less power than men (B
t
). Political Science and Economics, despite some
progress towards equality in recent years, have the greatest citations disparity. Psychology comes closest
to gender parity in citations, and it is also most gender balanced, while in Political Science and Economics,
101
women are a small minority with 34.20% and 28.01% of all authors. In contrast, in aliation-partitioned
networks, authors from top-ranked institutions, who form the minority class, have much more power
than the majority class (B
t
). This is despite the fact that the size of the minority class in these networks is
smaller than the minority class in gender networks (Table 5.1). This suggests that class imbalance is not
the main cause of power-inequality.
5.2.2 ModelofNetworkGrowth
To explore the origins of power-inequality, we present a model of a growing citation network, the Directed
Mixed Preferential Attachment (DMPA) model, which shows how disparities in citations can arise as a
consequence of the interplay between the relative prevalence of the minority group, biases in individual
preferences for citing similar or dissimilar authors (homophily/heterophily), and the rate new authors join
the citation network. The proposed model is powerful enough to produce a range of observed phenomena
in citations networks, while also simple enough to be analytically tractable.
The model (see Methods for the detailed description) describes the growth of a bi-populated directed
random graph. At every time stept, one of three types of edges is added. First, with probabilityp, a new
node (author) appears and an existing node cites it according to the following rules. The new node is as-
signed to red group with probabilityr (otherwise, it is assigned to the blue group). A potential citing node
is chosen from among the existing nodes by sampling proportional to their in-degrees plus a constant
i
.
The idea is that an author who is already citing many others is likely to cite the new author. A new citation
edge is created depending on the colors of nodes, governed byhomophily: if both nodes are red (resp. blue),
a link is created with probability
(1)
R
(resp.
(1)
B
); otherwise, if the existing node is red (resp. blue) and the
new node is blue (resp. red), a link is added with probability 1
(1)
R
(1
(1)
B
). Alternately, with probability
q, a new node appears and cites an existing node, chosen from among existing nodes by sampling propor-
tional to their out-degrees plus a constant
o
. A new edge is created from the existing author to the new
102
one based on homophily parameters
(2)
B
;
(2)
R
, as above. Finally, with probability 1pq, a new citation
edge appears between existing authors, resulting indensication of the network. Both the citing and cited
authors are chosen independently based on their in- and out-degree, and they are connected based on ho-
mophily parameters
(3)
R
;
(3)
B
. For simplicity we assume that homophily for dierent edge creation events
is the same and denoted by
B
;
R
. If
B
> 0:5, the blue group is homophilic, i.e., blue nodes more likely
to cite other blue nodes. On the other hand, if
B
< 0:5, the blue group is heterophilic i.e., blue nodes are
more likely to cite red nodes. A similar interpretation holds for the red group (with the parameter
R
).
The probabilityr a new node is assigned to the red group determines class balance asymptotically:
whenr < 0:5, red is the minority; otherwise, it is the majority. We can independently set the minority
and majority groups to be homophilic or heterophilic by adjusting the parameters
R
;
B
. In addition, we
can control the growth dynamics of the graph by appropriately choosingp;q, which determine the relative
frequency with which new authors arrive and link to existing authors. For example, setting 1pq>q>
p describes an author-citation network where most of the new citations form between existing authors,
and less frequently when new authors cite (or are cited by) existing authors. Finally, parameter controls
the degree of preferential attachment: larger corresponds to lower preference for linking to high-degree
nodes.
These parameters enable the DMPA model to asymptotically replicate structural properties of many
real-world social networks, including scale free degree distribution [10], assortative mixing [91, 101], and,
as we show here, power-inequalities. The model generalizes previous models, specically, a preferential
attachment model for directed networks populated with nodes of a single type [22] and a dynamic model
for bi-populated undirected networks [6, 7]. In addition to being limited to undirected networks, the latter
model also fails to capture network densication via new edges between existing nodes.
103
5.2.3 AnalysisoftheModelandInsights
5.2.3.1 TheoreticalAnalysis
We analyze the DMPA model to study the asymptotic values of power-inequality. First, we dene some
additional notation. Let
(i)
t
=
d
i
(R
t
)
t
;
(o)
t
=
d
o
(R
t
)
t
: (5.2)
represent the fraction of the total in-degree and out-degree of the red group at timet.
Theorem 1 (Almost sure convergence of DMPA model). Consider the DMPA model with the parameters
r;p;q,
i
=
o
= and
(i)
B
=
B
;
(i)
R
=
R
fori = 1; 2; 3. Then, there exists
> 0 such that, for all
>
, the state of the system
t
= [
(i)
t
;
(o)
t
]
T
converges to a unique value
= [
(i)
;
(o)
]
T
2 [0; 1]
2
with
probability 1 as timet!1.
Theorem 1 states that the normalized sum of in-degrees
(i)
t
and the normalized sum of out-degrees
(o)
t
of red nodes converge to unique values (for suciently large) with probability 1. This allows us to con-
veniently use the unique asymptotic values to analyze the power-inequality. For simplicity, we assume
that the homophily parameters do not dier across the three edge formation events and
i
=
o
, although
these assumptions can be easily relaxed. The proof of Theorem 1 is in the supplementary material pa-
per [97]. The key idea behind the proof is to rst express the evolution
t
= [
(i)
t
;
(o)
t
]
T
as a non-linear
stochastic dynamical system. Then, by using stochastic averaging theory we show that the non-linear
stochastic dynamical system can be approximated by a deterministic ordinary dierential equation (ODE)
of the form
_
= F () where F is a contraction map (for suciently large ). Since F is a con-
traction map, it has a unique xed point and, the sequence
t
;t = 1; 2;::: converges to that xed point
(according to the Banach xed point theorem). Since we know the contraction mapF in closed form, the
unique xed point
= [
(i)
;
(o)
]
T
can be easily found via the recursion
k+1
= F (
k
) with arbitrary
initial condition
0
2 [0; 1]
2
. Thus, Theorem 1 implies that the power-inequalityI
t
in Eq. 5.1 converges
104
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
= 0.1 (Blue group is heterophilic)
p = 0.1
q = 0.2
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
= 0.5 (Blue group is unbiased)
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
= 0.9 (Blue group is homophilic)
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
p = 0.1
q = 0.7
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
p = 0.7
q = 0.1
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
0.0 0.2 0.4 0.6 0.8 1.0
Homophily of the red-group
0.1
1
10
Power-inequality
r = 0.1 r = 0.3 r = 0.5 r = 0.7 r = 0.9
Figure 5.2: Variation of the asymptotic power-inequalityI with the homophily of the red group
R
under
the DMPA model (with = 10). The three rows correspond to dierent values of the parameters p;q
that capture the growth dynamics of the DMPA model. The three columns correspond to three dierent
values of the homophily parameter of the blue group:
B
= 0:1 (heterophilic),
B
= 0:5 (unbiased),
B
= 0:9 (homophilic). In each subplot, lines in dierent colors correspond to various values of the
parameterr that determines the asymptotic size of each group.
toI =
(o)
(1
(i)
)=
(i)
(1
(o)
). In addition, the asymptotic value of the power-inequalityI for a
specic parameter conguration can be found iteratively. In contrast to [7, 6, 22], which established con-
vergence of
t
in expectation, we use tools from stochastic averaging theory to establish the almost sure
convergence (i.e., convergence with probability 1).
5.2.3.2 CalculationandAnalysisofPower-inequality
Figure 5.2 shows the power-inequality calculated for dierent parameters using the iterative method dis-
cussed above. The results demonstrate that the DMPA model can produce a vast array of behaviors. We
briey present some of the insights from Fig. 5.2.
105
Real-world networks typically densify as they grow [80], adding new edges between existing nodes.
When a new node does appear, it connects to an existing node (i.e., citing an existing author), and it is rare
for an existing author to cite a new one. The top row in Fig. 5.2 illustrates this scenario, withp = 0:1; q =
0:2. When the blue group is heterophilic (top row-left panel), i.e., it prefers to link to the red group, the
red group is more powerfulwhenever it is considerably more homophilic than the blue group. Under these
conditions, power-inequalityI increases exponentially with red group’s homophily
R
. Surprisingly,I
decreases withr, meaning that the smaller the red group, the more powerful it is. This unusual behavior
is especially apparent when the red group is highly homophilic (
R
= 0:9). Therefore, when one group is
highly homophilic and the other highly heterophilic, the heterophilic group faces a disparity in power, and
increasing its size does not ameliorate the disparity. For an intuition, consider membership in an exclusive
club: those who are already in this selective group prefer to associate with others in it (they are homophilic),
and those who are not also prefer to associate with the exclusive group (they are heterophilic); therefore,
making the exclusive group smaller serves to make it even more powerful. As a result, the most eective
approach for ameliorating power-inequality is to alter individual preferences by changing the homophily
parameters
B
;
R
rather than sizes of groups. On the other hand, when the blue group is unbiased (top
row-middle panel), the red group is more powerful when
R
>
B
, and its power increases with group
size. There is less variation in power-inequality (compared to the top-left panel), suggesting that power
disparity is driven by the dierence of the homophily parameters of the two groups. Finally, when the blue
group is highly homophilic (top row-right panel), it is more powerful than the red group (i.e.,I 1), and
power-inequality decreases with
R
.
The middle row in Fig. 5.2 corresponds top< 1pq p) for most elds of studies. The middle
plot shows that minority female authors (red group) are heterophilic (
R
< 0:5) while the majority male
authors are homophilic (
B
> 0:5). The opposite is observed in aliation networks; the authors aliated
with top-ranked universities (i.e., minority) are homophilic while the others are heterophilic. The right-
most plot shows that empirically estimated values of the power-inequality are in close agreement with the
values obtained using the DMPA model. This shows that the DMPA model can represent how the power-
inequality emerge in real-world networks. Moreover, combining the empirically estimated parameters of
the DMPA model with its theoretical analysis provide insight on strategies that can be used to mitigate the
power-inequality.
still hold; however, the eect of group size, determined byr, is more pronounced, since more new nodes
are added to the network.
Finally, the bottom row corresponds to the casep > 1pq > q, which models the case where
the citations network grows primarily via existing authors citing new authors rather than other existing
authors.
The worst case values of power-inequality (corresponding to
B
= 0:1;
R
= 0:9 and
B
= 0:9;
R
=
0:1) become less severe when moving from top row to bottom row; this suggests that frequently bringing
in new authors is one potential way to battle the disparities of power-inequality in the presence of extreme
homophilic biases.
107
5.2.3.3 ConnectingtotheReal-WorldNetworks
To relate the key insights from the theoretical analysis of the DMPA model to real-world author-citation
networks, we estimated the parameters of the model (see Methods) from data. The empirically estimated
parameters and values of power-inequality calculated with those parameters are shown in Fig. 5.3 (full set
of parameters are listed in Table 5.2). In gender-partitioned citation networks (lled symbols in Fig. 5.3),
the theoretically calculated values of power-inequality are in close agreement to their empirical values
that are measured from citation networks using Eq. 5.1. Both the empirical and theoretically calculated
values of power-inequality conrm that the minority group (female authors) experience a dispartiy in
recognition. Note thatp 0:5 in Fig. 5.3). Hence, gender-partitioned citation networks generally fall
into the scenario captured by the top-right panel of Fig. 5.2.
In aliation-partitioned citation networks, the red group (authors from top-ranked institutions) is in
the minority (r < 0:5) in each eld. As shown in Fig. 5.3, the minority has more power according to
both the theoretical and empirical values of power-inequality. Although theoretically calculated values
overestimate power-inequality, they have the same ranking as the empirical values in all elds except
Computer Science and Management. The minority group is strongly homophilic (
R
> 0:7 in all cases)
and the majority is heterophilic (
B
< 0:5 in all cases). Further, note thatp q 1pq) also reduces the power-inequality. Finally, in cases where
neither homophily, growth dynamics (p;q), nor group size (r) can be controlled, power-inequality could be
reduced by decreasing the importance of preferential attachment (i.e., increasing). In practice, this could
be achieved by providing support to researchers and authors in a manner that is not solely dependent on
their past performance.
In aliation networks, which correspond to the top-left panel of Fig. 5.2, power-inequality is a clear
consequence of homophily: the majority is heterophilic (
R
< 0:5) and the minority is strongly homophilic
(
R
> 0:5). Interestingly, Fig. 5.2 (top-left panel) suggests that the smaller the minority, the larger the
power-inequality. This observation is conrmed by Fig. 5.4, which shows power-inequality for dierent
minority group sizes (corresponding to dierent values ofr). This suggests that the best way to alleviate
citation disparities with respect to institutional prestige is to make the citations free of the inuence of
author aliation, for example, simply highlighting the names of authors and not their aliations in the
title page of scientic publications. In addition, as with gender networks, encouraging new authors to join
a eld and providing them an audience among established authors (i.e.,p > q 1pq), as well as
reducing preferential attachment, also mitigates power-inequality.
5.3 Methods
5.3.1 Data
We used the Microsoft Academic Graph (MAG) API to collect metadata from papers published in Computer
Science, Management, Psychology, Political Science, and Economics since 1990. We also collected metadata
111
Field of study
Gender Network Aliation Network
# Nodes Density % Female # Nodes Density % Elite
Management 111K 9.910
5
35:19 53K 3.010
4
12:19
Physics 164K 6.410
4
16:39 156K 6.310
4
21:59
Psychology 873K 1.710
5
49:65 667K 2.910
5
23:18
Political Science 1:03M 7.010
6
34:20 204K 6.410
5
17:44
Economics 1:3M 5.610
5
28:01 723K 1.810
4
13:00
Computer Science 7:62M 7.510
6
25:79 3:27M 3.910
5
9:31
Table 5.1: Information about the data. Data for all elds of study, except Physics, came from Microsoft
Academic Graph, and Physics data was provided by the American Physical Society. The number of authors
with known gender is larger than number of authors with known aliation. The aliation network has
higher density, potentially confounded by the fact that authors with known aliation are more active in
publishing and citing other authors.
of papers published in the journals of the American Physical Society (APS). We consider the authors of
these papers to represent the eld of Physics. To study gender disparities, we used theGenderAPI (https:
//gender-api.com/) to get an author’s gender from their name. We eliminated authors whose rst names are
not recognized by theGenderAPI. For each year, we constructed a directed author-citation network where
an edge fromu tov with weightw indicates authoru cited authorv w times in the papersu published
that year. The majority class are male authors, and the remaining authors form the minority class (female
authors).
We also extracted the author’s most recent aliation, removing authors whose aliations were un-
available. We usedShanghaiUniversityRankings (SUR,https://www.kaggle.com/mylesoneill/world-university-rankings)
to identify authors aliated with prestigious institutions that were among the top-100 institutions. We
constructed author citation networks the same way as above, but now, the authors aliated with presti-
gious (top-100) institutions form the minority class, and authors aliated with the remaining institutions
are the majority class. Table 5.1 presents statistics of these citation networks.
112
Figure 5.5: Citation edge creation events considered by the DMPA model of the growth of bi-populated
directed networks. The rst two events correspond to the appearance of a new node. A new directed edge
is created when an existing node cites a new node (Event 1) or the new node cites an existing node (Event
2). Event 3 shows densication of the network via a new edge appearing between existing nodes.
5.3.2 Model
We propose a model of growing bi-populated directed networks that captures the key elements of real-
world dynamics, while being simple enough for theoretical analysis. The proposed model has parameters
dening class balance (r), growth dynamics of the network (p; q), homophily (E
1
; E
2
;E
3
) and preferential
attachment (
i
;
o
).
The model parametersE
i
;i2f1; 2; 3g are matrices of the form
E
i
=
2
6
6
4
(i)
B
1
(i)
R
1
(i)
B
(i)
R
3
7
7
5
(5.3)
The network at timet is denoted byG
t
= (V
t
;E
t
) where the set of nodesV
t
can be partitioned into
a set of blue nodesB
t
and red nodesR
t
. The initial graph G
0
corresponds to 2 2 adjacency matrix
containing all 1’s, thoughG
0
could be any arbitrary matrix and the asymptotic state of the model does not
depend on it.
At each time stept, one of three events happens, as shown in Fig. 5.5:
1. Event 1 (with probabilityp): An existing node follows the new node
113
i. Minority-majoritypartition: A new nodev
t
appears and is assigned color red with prob-
abilityr and blue with probability 1r.
ii. Preferential attachment: An existing nodeu
t
2 V
t
is chosen by sampling with probability
/d
i
(u
t
) +
i
.
iii. Homophily: If bothu
t
;v
t
are red (resp. blue), an edge (v
t
;u
t
) is added with probability
(1)
R
(resp.
(1)
B
). Otherwise, ifu
t
is red (resp. blue) andv
t
is blue (resp. red), a link (v
t
;u
t
) is added
with probability 1
(1)
R
(resp. 1
(1)
B
).
iv. Above steps ii, iii are repeated until an outgoing edge is added tov
t
.
2. Event 2 (with probabilityq): The new node follows an existing node
i. Minority-majoritypartition: as above.
ii. Preferential attachment: An existing nodeu
t
2 V
t
is chosen by sampling with probability
/d
o
(u
t
) +
o
.
iii. Homophily: as above.
iv. Above steps ii, iii are repeated until an incoming edge is added tov
t
.
3. Event 3 (with probability 1pq): Densication through new edges between existing nodes
i. Preferential attachment: An existing nodeu
t
2 V
t
is chosen by sampling with probability
/d
o
(u
t
) +
o
and another nodev
t
2V
t
is chosen by sampling with probability/d
i
(v
t
) +
i
.
ii. Homophily: as above.
iii. Above steps i, ii are repeated until a new edge is added to the graph.
5.3.2.1 Estimatingmodelparametersfromdata
The bibliometric data usually includes only the year of publication, and as a result, we do not know the
order in which citations edges appear at a ner temporal resolution. To better match the conditions of the
114
DMPA model, we create an ordered list of edgesE
ord
as follows. We initialize a set of nodesS with one
of the nodes from the largest connected component of the citation graph for 1990. Then in each iteration
of the algorithm (for each year), we shue all the edges that have at least one end in setS, and append all
of them to the end of listE
ord
. Then we update setS with the nodes covered by this newly added batch
of edges. We continue this process until no edge remains to be added from that year. We then drop all the
remaining edges from the year and continue the procedure for the next year.
We use the nodes and edges in the ordered edge list to estimate all parameters of the DMPA model
directly from the data, except for the preferential attachment parameter. We estimate the preferential
attachment parameter through hyper-parameter tuning. Specically, we use several dierent values of
to estimate the remaining parameters of the DMPA model, and use them to calculate the theoretical value
of power-inequality. We then choose the value of that leads to power-inequality closest to its empirical
value. The parameter estimation procedure is not sensitive to the choice of the initial node to populate the
seed set, nor the order in which edges are added. Running the procedure multiple times yields parameter
estimates with standard deviation less than 10
4
.
5.4 DataLimitation
We use bibliographic data from Microsoft Academic Graph (MAG) and American Physiological Society
(APS). The APS data was used to study Physics, and MAG was used to study the remaining elds. In this
section, we discuss potential biases introduced by our data collection and processing methods as well as
their potential impact on our analysis.
To extract an author’s gender, we use the Gender-API (https://www.gender-api.com/). This state-of-
the-art API infers the gender based on the author’s name. However, it fails to recognize the gender of
authors who use initials instead of their full rst names (e.g. “J. Doe” instead of “John Doe”) and some
Chinese and Korean names. Computer Science has the lowest coverage for gender (i.e., the fraction of
115
authors whose gender was inferred with Gender-API), while the coverage improves over time and reaches
80%. For other elds of study, the gender of at least 85% of authors was known. We removed all authors
whose gender was unknown.
MAG extracts author aliations from publications and links it to a unique ID. Due to the challenges
of automatically extracting aliations from publications, the coverage for this eld in the MAG data is
low. In contrast, aliations in the APS data are specied by authors and stored as strings. We use string
processing and normalization to map the aliation to a unique name, which is then mapped to the rank-
ing. Specically, taking the full address of author’s institutional aliation, we remove stop words (e.g.,
“the”) and convert all characters to lower-case. After separating the country, city, institution and depart-
ment names of the aliation (using python packages such as geonamescache), we string matched words
“university”, “institute”, “laboratory” or “college” in dierent languages (e.g., “università”, “universidad”,
“institut”). We then matched the extracted and normalized institution names to those listed in Shanghai
University Ranking. The coverage of aliation for the APS data (Physics) is more than 97%. We consider
authors without aliation as not being aliated with top-ranked institutions. While this could potentially
bias the data (for example, when top-ranked authors’ aliation is not known), Fig. 5.4 provides evidence
that this is not the case. The monotonicity of the trend for power-inequality suggests that the aliations
data is not systematically biased.
5.5 ParameterEstimationDetails
5.5.1 CitationEdgeOrdering
In the DMPA model, citations edges form asynchronously and at a ner temporal scale than supported by
the empirical data. To better align the data with the model, we assign an ordering to the edges so as to keep
track of the new edges formed in the three events (citations to or from a new node, or between existing
116
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Ratio of the authors with known attribute
Management: Gender
Physics (APS): Gender
Psychology: Gender
Political Science: Gender
Economics: Gender
Computer Science: Gender
Affiliation
Affiliation
Affiliation
Affiliation
Affiliation
Affiliation
Figure 5.6: The coverage of data over the years, as measured by the fraction of authors with a known
attribute. The gender attribute has better coverage compared to aliation of authors. We discard authors
with unknown gender, and we consider authors with unknown aliation in majority group (not aliated
with top-ranked universities).
nodes). We create an ordered list of edges E
ord
= [(u
1
;v
1
); (u
2
;v
2
);:::; (u
n
;v
n
)] as follows. Starting
with a seed set with one author from the largest connected component of authors publishing in 1990, we
traverse citations edges originating or terminating at these nodes and add them to the edge listE
ord
in
random order. We then update the seed set by adding nodes that originate or terminate at the seed nodes
117
and repeat the procedure until we cover all the years. The algorithm is specied in more detail below:
Algorithm2: Ordering of citation edges
input : graphG
output: list of edge-orderingE
ord
construct graphH using edges ofG in year 1990;
pick one of the authors from the largest connected component ofH and name it asv;
initialize seedS =fvg andE
ord
= [];
fory from 1990 to 2020do
while There is any potential edges to add from yeary do
deneA as a list of edges which at least one of its sides are inside the seed setS;
shue listA;
append elements ofA to the end ofE
ord
;
update seedS if there is a new node covered byE
ord
;
end
end
Note that the Algorithm 2 preserves the semi-dynamic ordering, meaning that there is no edgea forming
before edge b in E
ord
where a was created after b. Some of the edges in the citation graph G will not
appear in the edge listE
ord
. However, since these edges are not part of the largest connected component
ofG, they likely connect to authors outside the eld of study. Using this algorithm we cover at least 97%
of the edges forming during the period 1990–2020 in each eld of study. We ran the parameter estimation
for ve times, and the standard deviation of the estimated parameters was less than 110
4
for all the
parameters. This suggests that the randomization of the edges does not change the estimated parameters.
Next, we estimate model parameters using the new edge ordering. The number of edges generated by
to the arrival of new nodes (the rst two types of citation events) is small, and the estimation ofE
1
and
E
2
is not computationally stable. As a result, we focus on densication events, where edges form between
118
existing nodes, to estimateE
3
. We assume thatE
1
=E
2
=E
3
=E. First, we estimated parametersr,p,
q,
(i)
and
(o)
using data. We used hyper-parameter tuning technique to estimate preferential attachment
parameter. Then we used all the estimated parameters for measuring
R
and
B
orE.
5.5.2 EstimatingClassBalanceParameterr
Parameterr represents the fraction of red nodes. We label authors by their gender or prestige of their
institutional aliation as described in the Methods section. For gender-partitioned networks, r is the
fraction of authors with female or unisex names. Note that this may overestimate the fraction of female
authors. For aliation-partitioned networks,r is the fraction of authors from top-ranked institutions.
5.5.3 EstimatingEdgeFormationRatesp,q
For estimating these two parameters, we need to keep track of newly joined nodes over time. So, starting
from empty setS, we iterate over all the edges inE
ord
and for each edge (u;v) we add bothu andv to
setS if they are not in the set already. HavingS, parametersp (resp. q) could be estimated by counting
the number of outgoing edges from (resp. incoming edges to) the newly added nodes to setS. In the other
words, we need to count the number of timesu, followee, was not in setS and the number of timesv,
follower, was not in the set to estimatep andq respectively.
5.5.4 EstimatingPreferentialAttachmentParameter
We nd the best value of through hyper-parameter tuning. We estimate all parameters using dierent
values of2f1; 2; 3; 4; 5; 10; 20; 50; 100; 1000g and use them to calculate the theoretical value of power-
inequalityI
theoretical
, if the convergence condition is satised. We select the parameter set that leads
power-inequality measure that is closest to its empirical value.
119
5.5.5 EstimatingHomophilyParameters
R
,
B
We estimate
R
and
B
using the number of generated edges. Considering only the edges forming between
existing nodes, we deneN
rr
as the number of edges where both citing and cited nodes are red nodes,
andN
xr
as number of edges where the citing node is red and the cited node could be red or blue. Then
we can calculateh
R
=
Nrr
Nxr
from the data.p
(3)
R
old
!R
old
(given in Eq. C.1) andp
(3)
B
old
!R
old
(given in Eq. C.2)
are the probabilities of generating a red-to-red and blue-to-red edges in event 3, respectively. Using those
expressions, we can writeh
R
as:
N
rr
N
xr
=h
R
=
p
(3)
R
old
!R
old
p
(3)
R
old
!R
old
+p
(3)
B
old
!R
old
=
a
R
a
R
+b (1
R
)
!
R
=
bh
R
aah
R
+bh
R
(5.4)
Wherea andb are:
a =
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
(5.5)
b =
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
(5.6)
Similarly, based on Eq. C.4 and Eq. C.3, we have:
N
bb
N
xb
=h
B
=
p
(3)
B
old
!B
old
p
(3)
B
old
!B
old
+p
(3)
R
old
!B
old
=
c
B
c
B
+d (1
B
)
!
B
=
dh
B
cch
B
+dh
B
; (5.7)
wherec andd are
c =
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
(5.8)
120
d =
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
(5.9)
Parameter
(i)
is the fraction of the total in-degree of red group and
(o)
is fraction of the total out-degree
of red group, and could be estimated from data. Parametersr,p andq and could be estimated as described
above. So, we can estimate
R
and
B
using other parameters and equations 5.4 and 5.7. Note that we are
assuming citation networks have xed parameters over time (i.e.
(i)
,
(o)
,r, ...). However, this is not the
case in the citation network. The ratio of female/elite authors, for example, changes over time. This could
explain the disparity between theoretical power-inequality and empirical power-inequality in Table 5.2.
Gender Network Parameters
Management Physics (APS) Psychology Political Science Economics Computer Science
T 1:19M 17:23M 12:79M 7:24M 94:40M 435:66M
1000 1000 4 1000 100 20
r 0:35 0:16 0:50 0:34 0:28 0:26
p 0:025 0:001 0:030 0:067 0:004 0:005
q 0:058 0:008 0:032 0:064 0:009 0:012
R
0:46 0:48 0:54 0:47 0:48 0:55
B
0:61 0:62 0:57 0:67 0:62 0:57
I
empirical
0:70 0:76 0:94 0:61 0:66 0:80
I
theoretical
0:74 0:72 0:82 0:69 0:65 0:87
Aliation Network Parameters
Management Physics (APS) Psychology Political Science Economics Computer Science
T 842:85K 15:25M 12:61M 2:61M 91:51M 419:24M
1000 1000 1000 10 1000 1000
r 0:12 0:22 0:24 0:18 0:13 0:09
p 0:012 0:001 0:024 0:020 0:001 0:001
q 0:048 0:009 0:026 0:053 0:006 0:007
R
0:83 0:72 0:74 0:72 0:81 0:85
B
0:29 0:44 0:47 0:44 0:30 0:31
I
empirical
3:11 1:52 1:11 1:86 2:31 1:91
I
theoretical
3:53 1:94 1:79 3:32 3:89 4:54
Table 5.2: Estimated Parameters of DMPA model. Parameters for dierent eld of studies are estimated
from Microsoft Academic Graph data, except for Physics, which is estimated from data provided by the
American Physical Society (APS).
121
5.6 Discussion
We studied inequalities in scientic citations that lead one group of authors of scientic publications—
women or researchers from less prestigious institutions—to receive less recognition for their work than
the advantaged group (men, researchers from the top-ranked institutions). To explain these disparities, we
proposed a model of the growth of citations networks which captures biases in authors’ individual prefer-
ences to cite others who are similar to them, well-recognized or highly active. The model also species the
relative frequency with which new authors join. Its predictions align closely with empirical observations,
indicating that the model’s mechanism is useful for understanding gender and prestige-based disparities
in citations.
Are these disparities in recognition detrimental to science? Some have argued that inequalities are in-
grained in the structure of scientic rewards, which are skewed to channel the biggest rewards to a few top
performers through mechanisms such as cumulative advantage [134]. These skewed reward mechanisms
benet science by incentivizing researchers to produce outstanding work, which earns them placement
at the more prestigious institutions. However, recent literature has called the link between prestige and
merit into question. Research has shown that inequalities in individual productivity arise due to the cu-
mulative benets of early career placement rather than individual merit [130]. Moreover, prestige, rather
than the quality of scientic ideas, aects how quickly and widely they spread [95]. Structural inequalities
in recognition also likely contribute to the gender gap in science. Although women researchers publish
at a similar rate as men, they tend to leave academia sooner [58], potentially because citations dispari-
ties lower their scientic impact. Since hiring and promotion decisions depend on the metrics of impact,
the power-inequality we demonstrate could fundamentally limit women’s opportunities for professional
advancement regardless of their inherent merit, motivation, or ability. Structural inequalities, therefore,
reduce the pool of available talent and decrease the diversity of the scientic workforce, which limits in-
novation by reducing the creativity and productivity in research [103, 123, 133]. To mitigate the eects
122
of power-inequality in citations, our analysis suggests that reducing biases in citation patterns, e.g., by
requiring authors to cite researchers from underrepresented groups, is more eective than increasing the
size of the underrepresented group, e.g., by hiring. In addition, bringing in new authors and allowing them
to have an accessible platform among established authors, and also reducing preferential attachment by
providing some degree of support for new authors, are some other helpful strategies for reducing inequal-
ity.
Finally, we also note that citations disparities may be amplied algorithmically by academic search
engines, such as Google Scholar, which highlight authors and papers with most citations. To reduce bias,
search engines should diversify search results or highlight papers by non-privileged groups. Academic
publishers can also play an important role in reducing power-inequality, for example, by oering incentives
to diversify citations and highlight new authors. Our work provides insights into measures and metrics
publishers can use to reduce structural inequalities.
123
Chapter6
DirectedMixedPreferentialAttachment&PerceptionBias
6.1 Introduction
Assume the given directed graphG = (V;E) with vertex setV and edge setE is generated using DMPA
model (chapter 5). There are two groups of nodes in the network: red nodes and blue nodes. We represent
the nodev’s group asf(v). f(v) = 1 meansv is a red node andf(v) = 0 meansv is a blue node. The
DMPA model has parameters dening class balance (r), growth dynamics of the network (p; q), homophily
(E) and preferential attachment (
i
;
o
). We are assuming the homophily of three events are equal. Then
the homophily matrixE is:
E =
2
6
6
4
B
1
R
1
B
R
3
7
7
5
(6.1)
Where
R
is the homophily of the red group and
B
is the homophily of the blue group. We dene:
(i)
=
d
i
(R)
d
;
(o)
=
d
o
(R)
d
: (6.2)
Whered
i
(R) =
P
v2R
d
i
(v) andd
o
(R) =
P
v2R
d
o
(v) andd =
P
v2V
d
i
(v). As shown in [97], we have
a closed form solution for calculating
(i)
and
(o)
using the model parameters.
124
In chapter 4 we dened three random variablesX,Y andZ asrandomnode,randomfriend andrandom
follower, respectively. We used those variables to deneB
global
andB
local
as global perception bias and local
perception bias. The goal of this chapter is to connect chapter 4 and 5. More specically, I am looking for
a closed form solution for B
global
and B
local
in terms of DMPA parameters. Throughout this chapter, I
assume the graphG is generated using DMPA model.
6.2 GlobalPerceptionBias
As dene in equation 4.8, the global perception bias is:
B
global
=Eff(Y )gEff(X)g
For DMPA-generated networks the expected value off(v) for a random node is
P
v2V
1
N
f(v) =r and
for a random friend is
P
v2V
do(v)
d
f(v) =
(o)
. So, for DMPA-generated networks global perception
bias is:
B
global
=
(o)
r (6.3)
6.3 LocalPerceptionBias
As dened in equation 4.10, the local perception bias is:
B
local
=Efq(X)gEff(X)g
Whereq(v) =
P
u2N(v)
f(u)
d
i
(v)
is the fraction of red nodes in friends of the nodev. As shown in 4.11, we
have:
125
Efq(X)g =
dEff(U)A(V )j(U;V ) Uniform(E)g (6.4)
WhereA(v) =
1
d
i
(v)
is the attention nodev pays to its friends. We knowE[AB] =E[A]E[B]+Cov(A;B).
So:
Eff(U)A(V )j(U;V ) Uniform(E)g =
E[f(U)j(U;V ) Uniform(E)]E[A(V )j(U;V ) Uniform(E)] + Cov(f(U);A(V )) =
E[f(Y )]E[A(Z)] + Cov(f(U);A(V ))
As shown previously,E[f(Y )] =
(o)
andE[A(Z)] =
P
v2V
d
i
(v)
N
d
1
d
i
(v)
=
N
N
d
=
1
d
, so:
Eff(U)A(V )j(U;V ) Uniform(E)g =
(o)
1
d
+ Cov(f(U);A(V )) (6.5)
We know Cov(A;B) =E[(AE[A])(BE[B])], so for Cov(f(U);A(V )) we have:
Cov(f(U);A(V )) =E[(f(U)E[f(Y )])(A(V )E[A(Z)])]
=E[(f(U)
(o)
)(A(V )
1
d
)]
126
By the law of total expectations we have:
Cov(f(U);A(V )) =E[(f(U)
(o)
)(A(V )
1
d
)] =
E[(f(U)
(o)
)(A(V )
1
d
)jf(U) = 1]P(f(U) = 1) +E[(f(U)
(o)
)(A(V )
1
d
)jf(U) = 0]P(f(U) = 0) =
E[(1
(o)
)(A(V )
1
d
)jf(U) = 1]
(o)
+E[(0
(o)
)(A(V )
1
d
)jf(U) = 0] (1
(o)
) =
(o)
(1
(o)
)E[A(V )
1
d
jf(U) = 1]
(o)
(1
(o)
)E[A(V )
1
d
jf(U) = 0] =
(o)
(1
(o)
) (E[A(V )
1
d
jf(U) = 1]E[A(V )
1
d
jf(U) = 0]) =
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
(6.6)
By substituting equation 6.6 into equation 6.5, we have:
Eff(U)A(V )j(U;V ) Uniform(E)g =
(o)
1
d
+
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
And by substituting this into Eq. 6.4, we have:
Efq(X)g =
dEff(U)A(V )j(U;V ) Uniform(E)g
=
(o)
+
d
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
SinceEff(X)g =r andB
global
=
(o)
r, the local perception bias is:
B
local
=Efq(X)gEff(X)g
=
(o)
+
d
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
r
=B
global
+
d
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
(6.7)
127
To have a nal closed form solution, we need to writeE[A(V )jf(U) = 1] andE[A(V )jf(U) = 0] in terms
of model parameters. These two equations represent the average attention from the follower of a red or
blue friend respectively. The expected valueE[A(V )jf(U) = 1] is the weighted average of
1
k
for all nodes
v withd
i
(v) =k. The weight of each
1
k
is the probability ofv having in-degreek given thatv’s friend (i.e.
u) is red. So:
E[A(V )jf(U) = 1] =
n1
X
k=1
1
k
P
(u;v)E
(d
i
(v) =kjf(u) = 1) (6.8)
In our paper we showed that variables dened in chapter C could be estimated using model parameters
[96]. Assuming that we can have the value of four variablesp
(1)
Rnew!(B
old
;k)
,p
(1)
Rnew!(R
old
;k)
,p
(3)
Rnew!(B
old
;k)
andp
(3)
Rnew!(R
old
;k)
using the method described in our paper [96], for event one we have:
p
(1)
Rnew!(B
old
;k)
=P(f(v) = 0;d
i
(v) =kjf(u) = 1;event = 1)
p
(1)
Rnew!(R
old
;k)
=P(f(v) = 1;d
i
(v) =kjf(u) = 1;event = 1)
By law of total probability we knowP (AjB) =P (A;CjB) +P (A;C
c
jB). so:
P(d
i
(v) =kjf(u) = 1;event = 1) =p
(1)
Rnew!(B
old
;k)
+p
(1)
Rnew!(R
old
;k)
(6.9)
Similarly, for event three we have:
p
(3)
Rnew!(B
old
;k)
=P(f(u) = 1;d
i
(v) =k;f(v) = 0jevent = 3)
p
(3)
Rnew!(R
old
;k)
=P(f(u) = 1;d
i
(v) =k;f(v) = 1jevent = 3)
128
So we haveP(f(u) = 1;d
i
(v) =kjevent = 3) =p
(3)
Rnew!(B
old
;k)
+p
(3)
Rnew!(R
old
;k)
. We knowP (AjB;C) =
P (A;BjC)
P (BjC)
. Also, we knowP(f(u) = 1jevent = 3) = P(f(u) = 1) =
(o)
(because of independence of
event 3 and color of nodeu). So:
P(d
i
(v) =kjf(u) = 1;event = 3) =
p
(3)
Rnew!(R
old
;k)
+p
(3)
Rnew!(B
old
;k)
(o)
(6.10)
We knowP (AjB) =P (AjB;C)P (CjB) +P (AjB;C
c
)P (C
c
jB). Then having 6.9 and 6.10 together and
sinceP(f(u) = 1) and the event are independent, we have:
P(d
i
(v) =kjf(u) = 1) =
P(d
i
(v) =kjf(u) = 1;event = 1)P(event = 1) +P(d
i
(v) =kjf(u) = 1;event = 3)P(event = 3) =
p
p
(1)
Rnew!(B
old
;k)
+p
(1)
Rnew!(R
old
;k)
+ (1pq)
p
(3)
Rnew!(R
old
;k)
+p
(3)
Rnew!(B
old
;k)
(o)
(6.11)
Similarly forf(u) = 0 we have:
P(d
i
(v) =kjf(u) = 0) =
p
p
(1)
Bnew!(R
old
;k)
+p
(1)
Bnew!(B
old
;k)
+ (1pq)
p
(3)
Bnew!(B
old
;k)
+p
(3)
Bnew!(R
old
;k)
1
(o)
(6.12)
We can writeB
local
(from equation 6.7) as:
=B
global
+
d
(o)
(1
(o)
)
E[A(V )jf(U) = 1]E[A(V )jf(U) = 0]
=B
global
+
d
(o)
(1
(o)
)
n1
X
k=1
1
k
P(d
i
(v) =kjf(u) = 1)P(d
i
(v) =kjf(u) = 0)
!
Finally, by substituting 6.11 and 6.12 into this equation, we can writeB
local
in terms of model parameters
and estimated variables as:
129
B
local
=B
global
+p
d
(o)
(1
(o)
)
n1
X
k=1
1
k
p
(1)
Bnew!(R
old
;k)
+p
(1)
Bnew!(B
old
;k)
p
(1)
Rnew!(B
old
;k)
p
(1)
Bnew!(B
old
;k)
!
+ (1pq)
d (1
(o)
)
n1
X
k=1
1
k
p
(3)
Rnew!(R
old
;k)
+p
(3)
Rnew!(B
old
;k)
!
(1pq)
d
(o)
n1
X
k=1
1
k
p
(3)
Bnew!(B
old
;k)
+p
(3)
Bnew!(R
old
;k)
!
(6.13)
6.4 Results
Figure 6.1 represents the global and local perception bias in terms of DMPA model parameters. For both
gures = 100. For computing the global perception bias, we are using equation (6.3) to directly calculate
theB
global
using model parameters. For local perception bias, we are generating ve dierent networks
with the parameters and calculating theB
local
as dened in equation (4.10) using the generated network.
The standard deviations are included in the gure 6.1b.
There is a small dierence between the value ofB
global
andB
local
in the plots. As shown in equation
(6:7), this means the covariance between the color of the friend and the attention of the follower is small
in DMPA model. This makes sense because DMPA model does not account for this correlation. In other
words, in generating edges, we do not consider friends-of-the-followerv to decide whetherv should follow
u or not.
Even though the dierence between B
global
and B
local
is small, interestingly, there is a positive and
signicant correlation between the homophily of two groups (
R
and
R
) and the value ofjB
global
j
jB
local
j: 0:40 and 0:43 Pearson correlation respectively with p-value smaller than 10
9
. This means that
the lower value of homophily parameters will exaggerate the local perception bias in comparison with
130
global perception bias. Here the word exaggerate means if the global bias is negative(/positive), the local
bias is going to be more negative(/positive) if the graph is heterophilic (for both red and blue group).
Now focusing on the eect of DMPA model parameters on perception bias, when there is no preference
for attachment between two nodes based on group membership (
R
=
B
= 0:5), there is no bias on
estimating the prevalence of the red group. When
B
= 0:9, most of the time, the prevalence of the red
group is underestimated (except for when the red group is in the majorityr = 0:9 and they have high
homophily
R
= 0:9). When the blue nodes prefer to follow red nodes (
B
= 0:1), then there is an
overestimation of the prevalence of red nodes.
Another interesting observation is the rule of class balance (r) in over or underestimation of the preva-
lence of the red group. In the right column of gure 6.1a, the change ofB
global
in terms of
R
is smooth
when red group is in the minority(r< 0:5). However, when the red group is in the majority (r> 0:5), the
underestimation of red improves fast when the homophily of the red group increase. This suggests that
when the blue group is heterophilic and in the majority, only changing the homophily of the red group
is not going to change the perception bias. On the other hand, when the red group is in the majority,
they have the power of changing perception bias by increasing the homophily of their own group (i.e.,
increasing
R
).
131
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
p=0.1
q=0.2
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
p=0.1
q=0.7
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
p=0.7
q=0.1
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B global
=0.9
r = 0.1 r = 0.3 r = 0.5 r = 0.7 r = 0.9
(a) Global perception bias
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
p=0.1
q=0.2
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
p=0.1
q=0.7
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.9
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
p=0.7
q=0.1
=0.1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.6
0.4
0.2
0.0
0.2
0.4
0.6
B local
=0.9
(b) Local perception bias
Figure 6.1: Global and local perception bias in terms of DMPA model parameters
132
6.5 GlobalPerceptionBias&PowerInequality
Figure 6.2 represents the correlation between global perception bias and power inequality. B
global
and
power-inequality computed for 750 dierent model parameters. For all the parameters set, = 100 while
r;
R
;
B
2f0:1; 0:3; 0:5; 0:7; 0:9g andp2f0:1; 0:7g andq2f0:1; 0:2; 0:7g. The Pearson correlation
coecient between these two metrics is 0:71 with p-value less than 10
114
. The color of gure 6.2a rep-
resents whether red groups is more homophilic than blue group or not. The color of gure 6.2b represents
whether probability of event 1 is higher than probability of event 2 or not. The homophily parameters have
eect on the power-inequality and global perception bias being positive or negative, while the probability
of events 1 and 2 have eect on the magnitude of power-inequality and global perception bias. There are
rare parameter settings when the prevalence of red group is overestimated (/underestimated) while blue
(/red) group is more powerful.
10
1
10
0
10
1
0.4
0.2
0.0
0.2
0.4
global
>
<
=
(a) Homophily parameters
10
1
10
0
10
1
0.4
0.2
0.0
0.2
0.4
global
p > q
p < q
p = q
(b) Events 1 & 2 probabilities
Figure 6.2: Global perception bias and power inequality for 750 dierent model parameter setting. Each
point is for a set of parameter. The color of the point in gure (a) represents the eect of homophily
parameter on the two metric and gure (b) represents the eect of probability of event 1 (p) and probability
of event 2 (q). Higher
R
then
B
increases the power of red group and makes a positive perception bias.
While more event 2 than event 1 (q > p), amplies both power inequality of perception bias. So, when
new nodes cites an existing node more than existing node cites new node, the power inequality amplies.
133
There is no clear pattern explaining when this disparity between power inequality and global percep-
tion bias happens. However, only considering the cases when the red group is in the minority (r < 0:5),
the data points in the bottom-right region of the plot have highly homophilic red and blue group (
R
;
B
2
f0:7; 0:9g). Also, data points in upper-left region of the plot have mostly heterophilic red and blue group
(
B
2f0:1; 0:3g and
R
< 0:5 in most of the times).
The following proposition is true for all the graph (not DMPA-generated graphs only):
Proposition: If the red group is powerful (N
rb
>N
br
), then global perception bias is positive if and only
if there is a suciently large gap between average attention a red node receives from a blue node and
average attention a blue node receives from a red node:
N
rb
jRj
N
br
jBj
>
N
bb
jBj
N
rr
jRj
(6.14)
WhereN
xy
means the number of times nodes with colory cite/follow nodes with colorx (ow of informa-
tion) andR (B) is the number of red (blue) nodes. Figure 6.3 is the schematic visualization the this equation.
While this proposition is the general condition for the existence of global perception bias given the exis-
tence of power-inequality, the reason behind why the equation 6.14 does not hold for highly homophilic
graphs in the bottom-right part of DMPA-generated networks (gure 6.2) is not clear.
(a) Overestimated powerful minority (b) Underestimated powerful minority
Figure 6.3: Schematic representation of equation 6.14. In both gures the red group is minority (r <
0:5) and powerful (N
rb
> N
br
). In gure (a) theB
global
is positive, meaning that the minority group is
overestimated, while in (b) the minority group is underestimated (B
global
< 0). Equation (6.14) holds for (a)
and it does not hold for (b). The reason red group is underestimate in (b) is that there are many intra-edges
in blue group which makes them (as majority) to underestimate the size of red group.
134
Chapter7
Conclusions
Various types of biases in heterogeneous and networked data are dened and measured in this thesis. These
biases range from Simpson’s Paradox and latent components in heterogeneous data to perception bias and
power-inequality in the networked data. Many examples of these biases are given from real-world data
such as Simpson’s paradoxes in Stack Overow, Duolingo, Khan Academy (chapter 2), latent components
in Wine Quality, NYC Housing, LA county valence (chapter 3), perception bias in Twitter network hashtags
(chapter 4) and power-inequality in Microsoft Academic and APS author citation networks (chapter 5).
The real-world examples of these biases make it clear that the biases need to be taken into account to
have a robust and generalizable observation from data and, as a result, a better model. For some of these
biases, a bias mitigation strategy is proposed. Chapter 2 presents a data-driven strategy for preventing
wrong conclusions about data. Chapter 3 is a way to take the diversity of subgroups into account to have
a robust and generalizable model and data analysis. In Chapter 4, we use the perception bias in estimating
the prevalence of an attribute more accurately (smaller MSE value). In Chapter 5 our analysis suggests
that the eect of power-inequality in citations can be mitigated by reducing biases in citation patterns
(e.g., by requiring authors to cite researchers from underrepresented groups), and it is more eective than
increasing the size of the underrepresented group (e.g., by hiring).
135
All in all, this thesis covers a small portion of existing biases in real-world data. There is still much
to be done. The illusion of causality still exists in data analysis [85] and the interpretable methods that
take biases of data into account can prevent such fallacies. On the other hand, the lack of social context
is the reason why repurposing algorithmic solutions designed for one social context may be misleading
and inaccurate when applied to a dierent context [116]. Another terminology for this could be lack
of considering the underlying causality for the new social context. Despite the recent advancement in
discovering the underlying causal model of the data [107], it is still an open question. As long as nding
the causality for data needs expert knowledge; discovery, measurement, and mitigation of biases in data
from dierent domains need careful design and consideration. That is why nding the bias in dierent
datasets (like the dierent range of datasets considered in this thesis) and dierent settings (like tabular
and networked data considered in this thesis) is still an open question.
136
References
[1] Jessica P Abel, Cheryl L Bu, and Sarah A Burr. “Social media and the fear of missing out: Scale
development and assessment”. In: Journal of Business & Economics Research (Online) 14.1 (2016),
p. 33.
[2] Tushar Agarwal, Keith Burghardt, and Kristina Lerman. “On Quitting: Performance and Practice
in Online Game Play”. In: Proceedings of 11th AAAI International Conference on Web and Social
Media. AAAI, 2017.url: https://arxiv.org/abs/1703.04696.
[3] Nazanin Alipourfard, Peter G Fennell, and Kristina Lerman. “Can you Trust the Trend?:
Discovering Simpson’s Paradoxes in Social Data”. In: Proceedings of the Eleventh ACM
International Conference on Web Search and Data Mining. ACM. 2018, pp. 19–27.
[4] Nazanin Alipourfard, Peter G Fennell, and Kristina Lerman. “Using Simpson’s Paradox to
Discover Interesting Patterns in Behavioral Data”. In: Twelfth International AAAI Conference on
Web and Social Media. 2018.
[5] Nazanin Alipourfard, Buddhika Nettasinghe, Andrés Abeliuk, Vikram Krishnamurthy, and
Kristina Lerman. “Friendship paradox biases perceptions in directed networks”. In: Nature
communications 11.1 (2020), pp. 1–9.
[6] Chen Avin, Hadassa Daltrophe, Barbara Keller, Zvi Lotker, Claire Mathieu, David Peleg, and
Yvonne-Anne Pignolet. “Mixed preferential attachment model: Homophily and minorities in
social networks”. In: Physica A: Statistical Mechanics and its Applications (2020), p. 124723.
[7] Chen Avin, Barbara Keller, Zvi Lotker, Claire Mathieu, David Peleg, and Yvonne-Anne Pignolet.
“Homophily and the glass ceiling eect in social networks”. In: Proceedings of the 2015 conference
on innovations in theoretical computer science. 2015, pp. 41–50.
[8] Chen Avin, Zvi Lotker, Yinon Nahum, and David Peleg. “Modeling and Analysis of Glass Ceiling
and Power Inequality in Bi-populated Societies”. In: International Conference and School on
Network Science. Springer. 2017, pp. 61–73.
[9] J. S. Baer, A. Stacy, and M. Larimer. “Biases in the perception of drinking norms among college
students.” In: Journal of studies on alcohol 52.6 (Nov. 1991), pp. 580–586.
137
[10] Albert-László Barabási and Réka Albert. “Emergence of scaling in random networks”. In: science
286.5439 (1999), pp. 509–512.
[11] Samuel Barbosa, Dan Cosley, Amit Sharma, and Roberto M. Cesar-Jr. “Averaging Gone Wrong:
Using Time-Aware Analyses to Better Understand Behavior”. In: (Apr. 2016), pp. 829–841.
[12] Elias Bareinboim and Judea Pearl. “Causal inference and the data-fusion problem”. In: Proceedings
of the National Academy of Sciences 113.27 (2016), pp. 7345–7352.url:
http://dx.doi.org/10.1073/pnas.1510507113.
[13] Elias Bareinboim and Judea Pearl. “Causal inference and the data-fusion problem”. In: Proceedings
of the National Academy of Sciences 113 (July 2016), pp. 7345–7352.doi: 10.1073/pnas.1510507113.
[14] Arthur G Bedeian, David E Cavazos, James G Hunt, and Lawrence R Jauch. “Doctoral degree
prestige and the academic marketplace: A study of career mobility within the management
discipline”. In: Academy of Management Learning & Education 9.1 (2010), pp. 11–25.
[15] Fabrício Benevenuto, Alberto H. F. Laender, and Bruno L. Alves. “The H-index paradox: your
coauthors have a higher H-index than you do”. In: Scientometrics 106.1 (Jan. 2016), pp. 469–474.
issn: 1588-2861.doi: 10.1007/s11192-015-1776-2.
[16] Alan D Berkowitz. “An overview of the social norms approach”. In:Changingthecultureofcollege
drinking: A socially situated health communication campaign (2005), pp. 193–214.
[17] P. J. Bickel, E. A. Hammel, and J. W. O’Connell. “Sex Bias in Graduate Admissions: Data from
Berkeley”. In: Science 187.4175 (Feb. 1975), pp. 398–404.
[18] Peter J Bickel, Eugene A Hammel, and J William O’Connell. “Sex bias in graduate admissions:
Data from Berkeley”. In: Science 187.4175 (1975), pp. 398–404.
[19] Colin R. Blyth. “On Simpson’s Paradox and the Sure-Thing Principle”. In: Journal of the American
Statistical Association 67.338 (1972), pp. 364–366.
[20] Colin R. Blyth. “On Simpson’s paradox and the sure-thing principle”. In: Journal of the American
Statistical Association 67.338 (1972), pp. 364–366.
[21] Johan Bollen, Bruno Gonçalves, Guangchen Ruan, and Huina Mao. “Happiness Is Assortative in
Online Social Networks”. In: Articial Life 17.3 (May 2011), pp. 237–251.
[22] Béla Bollobás, Christian Borgs, Jennifer Chayes, and Oliver Riordan. “Directed scale-free graphs”.
In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for
Industrial and Applied Mathematics. 2003, pp. 132–139.
[23] Robert M Bond, Christopher J Fariss, Jason J Jones, Adam DI Kramer, Cameron Marlow,
Jaime E Settle, and James H Fowler. “A 61-million-person experiment in social inuence and
political mobilization”. In: Nature 489.7415 (2012), pp. 295–298.
138
[24] Keith Burghardt, Emanuel F. Alsina, Michelle Girvan, William Rand, and Kristina Lerman. “The
myopia of crowds: Cognitive load and collective evaluation of answers on Stack Exchange”. In:
PLOS ONE 12.3 (Mar. 2017), e0173610+.doi: 10.1371/journal.pone.0173610.
[25] Val Burris. “The academic caste system: Prestige hierarchies in PhD exchange networks”. In:
American sociological review 69.2 (2004), pp. 239–264.
[26] Diego Calderon, Brendan Juba, Zongyi Li, and Lisa Ruan. “Conditional Linear Regression”. In:
Thirty-Second AAAI Conference on Articial Intelligence. 2018.
[27] Yang Cao and Sheldon M Ross. “THE FRIENDSHIP PARADOX.” In: Mathematical Scientist 41.1
(2016).
[28] Anne Case and Angus Deaton. Deaths of Despair and the Future of Capitalism. Princeton
University Press, 2020.
[29] George Casella and Roger L Berger. Statistical inference. Vol. 2. Duxbury Pacic Grove, CA, 2002.
[30] Jihui Chen, Myongjin Kim, and Qihong Liu. “Do Female Professors Survive the 19th-century
Tenure System?: Evidence from the Economics Ph. D. Class of 2008”. In: SSRN Electronic Journal.
https://doi. org/10.2139/ssrn 2885951 (2016).
[31] Justin Cheng, Jon Kleinberg, Jure Leskovec, David Liben-Nowell, Bogdan State, Karthik Subbian,
and Lada Adamic. “Do Diusion Protocols Govern Cascade Growth?” In: Proceddings of the
International Conference on the Web and Social Media. 2018.
[32] John S. Chuang, Olivier Rivoire, and Stanislas Leibler. “Simpson’s Paradox in a Synthetic
Microbial System”. In: Science 323.5911 (2009), pp. 272–275.issn: 0036-8075.doi:
10.1126/science.1166739. eprint: https://science.sciencemag.org/content/323/5911/272.full.pdf.
[33] Aaron Clauset, Samuel Arbesman, and Daniel B Larremore. “Systematic inequality and hierarchy
in faculty hiring networks”. In: Science advances 1.1 (2015), e1400005.
[34] William S. Cleveland and Susan J. Devlin. “Locally Weighted Regression: An Approach to
Regression Analysis by Local Fitting”. In: Journal of the American Statistical Association 83.403
(1988), pp. 596–610.doi: 10.1080/01621459.1988.10478639.
[35] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. “Modeling wine
preferences by data mining from physicochemical properties”. In: Decision Support Systems 47.4
(2009), pp. 547–553.
[36] Sergio Currarini, Matthew O Jackson, and Paolo Pin. “An economic model of friendship:
Homophily, minorities, and segregation”. In: Econometrica 77.4 (2009), pp. 1003–1045.
[37] Anirban Dasgupta, Ravi Kumar, and D Sivakumar. “Social sampling”. In: Proceedings of the 18th
ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2012,
pp. 235–243.
139
[38] Jerey Dastin. Amazon scraps secret AI recruiting tool that showed bias against women.url:
https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-
ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G.
[39] Wayne S DeSarbo and William L Cron. “A maximum likelihood methodology for clusterwise
linear regression”. In: Journal of classication 5.2 (1988), pp. 249–282.
[40] Michelle L Dion, Jane Lawrence Sumner, and Sara McLaughlin Mitchell. “Gendered citation
patterns across political science and social science methodology elds”. In: Political Analysis 26.3
(2018), pp. 312–327.
[41] Young-Ho Eom and Hang-Hyun Jo. “Generalized friendship paradox in complex networks: The
case of scientic collaboration”. In: Scientic Reports 4 (Apr. 2014).issn: 2045-2322.doi:
10.1038/srep04603.
[42] William K. Estes. “The problem of inference from curves based on group data.” In: Psychological
bulletin 53.2 (1956), p. 134.
[43] Ernesto Estrada. “Network robustness to targeted attacks. The interplay of expansibility and
degree distribution”. In: The European Physical Journal B-Condensed Matter and Complex Systems
52.4 (2006), pp. 563–574.
[44] CaremC Fabris and AlexA Freitas. “Discovering Surprising Patterns by Detecting Occurrences of
Simpson’s Paradox”. In: Research and Development in Intelligent Systems XVI. Ed. by Max Bramer,
Ann Macintosh, and Frans Coenen. Springer London, 2000, pp. 148–160.doi:
10.1007/978-1-4471-0745-3\_10.
[45] Scott L. Feld. “Why Your Friends Have More Friends Than You Do”. In: American Journal of
Sociology 96.6 (May 1991), pp. 1464–1477.
[46] Emilio Ferrara, Nazanin Alipourfard, Keith Burghardt, Chiranth Gopal, and Kristina Lerman.
“Dynamics of Content Quality in Collaborative Knowledge Production”. In: Proceedings of 11th
AAAI International Conference on Web and Social Media. AAAI, 2017.
[47] John Fox. Applied regression analysis, linear models, and related methods. Thousand Oaks, CA:
Sage Publications, Inc, 1997.
[48] Jacqueline M Fulvio, Ileri Akinnola, and Bradley R Postle. “Gender (im) balance in citation
practices in cognitive neuroscience”. In: Journal of Cognitive Neuroscience 33.1 (2020), pp. 3–7.
[49] Zoubin Ghahramani and Michael I Jordan. “Supervised learning from incomplete data via an EM
approach”. In: Advances in neural information processing systems. 1994, pp. 120–127.
[50] Greg Ver Steeg, Rumi Ghosh, and Kristina Lerman. “What stops social epidemics?” In:Proceedings
of 5th International Conference on Weblogs and Social Media. 2011.
[51] James J. Heckman. “Sample Selection Bias as a Specication Error”. In: Econometrica 47.1 (1979),
pp. 153–161.
140
[52] Miguel A. Hernan, David Clayton, and Niels Keiding. “Simpson’s paradox unraveled”. In: Int J
Epidemiology 40.3 (2011), pp. 780–785.
[53] Desmond J Higham. “Centrality-friendship paradoxes: when our friends are more important than
us”. In: Journal of Complex Networks 7.4 (Nov. 2018), pp. 515–528.issn: 2051-1329.
[54] Nathan Hodas, Farshad Kooti, and Kristina Lerman. “Friendship Paradox Redux: Your Friends Are
More Interesting Than You”. In: Proc. 7th Int. AAAI Conf. on Weblogs And Social Media. 2013.
[55] Nathan O. Hodas and Kristina Lerman. “How Limited Visibility and Divided Attention Constrain
Social Contagion”. In: ASE/IEEE International Conference on Social Computing. 2012.
[56] Nathan O. Hodas and Kristina Lerman. “The Simple Rules of Social Contagion”. In: Scientic
Reports 4 (2014).doi: 10.1038/srep04343.
[57] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. “Applied logistic regression”.
In: vol. 398. John Wiley & Sons, 2013.
[58] Junming Huang, Alexander J Gates, Roberta Sinatra, and Albert-László Barabási. “Historical
comparison of gender inequality in scientic careers across countries and disciplines”. In:
Proceedings of the National Academy of Sciences 117.9 (2020), pp. 4609–4616.
[59] Matthew O Jackson. “The friendship paradox and systematic biases in perceptions and social
norms”. In: Journal of Political Economy 127.2 (2019), pp. 777–818.
[60] Hang-Hyun Jo and Young-Ho Eom. “Generalized friendship paradox in networks with tunable
degree-attribute correlation”. In: Physical Review E 90.2 (2014), p. 022809.
[61] Fariba Karimi, Mathieu Génois, Claudia Wagner, Philipp Singer, and Markus Strohmaier.
“Homophily inuences ranking of minorities in social networks”. In: Scientic reports 8.1 (2018),
pp. 1–12.
[62] Maxwell Mirton Kessler. “Bibliographic coupling between scientic papers”. In: American
documentation 14.1 (1963), pp. 10–25.
[63] Rogier Kievit, Willem Eduard Frankenhuis, Lourens Waldorp, and Denny Borsboom. “Simpson’s
paradox in psychological science: a practical guide”. In: Frontiers in psychology 4 (2013), p. 513.
[64] J. P. Kincaid, Jr R. P. Fishburnea, R. L. Rogers, and B. S. Chissom. Derivation of new readability
formulas (automated readability index, fog count and esch reading ease formula) for navy enlisted
personnel. Tech. rep. U.S. Navy, 1975.
[65] James A. Kitts. “Egocentric Bias or Information Management? Selective Disclosure and the Social
Roots of Norm Misperception”. In: Social Psychology Quarterly 66.3 (2003), pp. 222–237.
[66] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan.
Human decisions and machine predictions. Tech. rep. National Bureau of Economic Research, 2017.
141
[67] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. “Inherent trade-os in the fair
determination of risk scores”. In: arXiv preprint arXiv:1609.05807 (2016).
[68] Jon M Kleinberg. “Authoritative sources in a hyperlinked environment”. In: Journal of the ACM
(JACM) 46.5 (1999), pp. 604–632.
[69] Farshad Kooti, Nathan O. Hodas, and Kristina Lerman. “Network Weirdness: Exploring the
Origins of Network Paradoxes”. In: International Conference on Weblogs and Social Media
(ICWSM). Mar. 2014.
[70] Farshad Kooti, Karthik Subbian, Winter Mason, Lada Adamic, and Kristina Lerman.
“Understanding Short-term Changes in Online Activity Sessions”. In: Proceedings of the 26th
International World Wide Web Conference (Companion WWW2017). 2017.
[71] William H. Kruskal and W. Allen Wallis. “Use of Ranks in One-Criterion Variance Analysis”. In:
Journal of the American Statistical Association 47.260 (1952), pp. 583–621.doi:
10.1080/01621459.1952.10483441.
[72] Vineet Kumar, David Krackhardt, and Scott Feld. Network interventions based on inversity:
Leveraging the friendship paradox in unknown network structures. Tech. rep. Working Paper, Yale
University, 2018.
[73] Preethi Lahoti, Krishna P Gummadi, and Gerhard Weikum. “ifair: Learning individually fair data
representations for algorithmic decision making”. In: 2019 IEEE 35th International Conference on
Data Engineering (ICDE). IEEE. 2019, pp. 1334–1345.
[74] J Larson, S Mattu, L Kirchner, and J Angwin. “Compas analysis”. In: GitHub, available at:
https://github. com/propublica/compas-analysis (2016).
[75] David Lazer, Alex Pentland, Lada Adamic, Sinan Aral, Albert-László Barabási, Devon Brewer,
Nicholas Christakis, Noshir Contractor, James Fowler, Myron Gutmann, and Tony Jebara.
“Computational Social Science”. In: Science 323 (2009), pp. 721–723.
[76] Eun Lee, Fariba Karimi, Claudia Wagner, Hang-Hyun Jo, Markus Strohmaier, and Mirta Galesic.
“Homophily and minority-group size explain perception biases in social networks”. In: Nature
human behaviour (2019), pp. 1–10.
[77] Eun Lee, Fariba Karimi, Claudia Wagner, Hang-Hyun Jo, Markus Strohmaier, and Mirta Galesic.
“Homophily and minority-group size explain perception biases in social networks”. In: Nature
human behaviour 3.10 (2019), pp. 1078–1087.
[78] Kristina Lerman. “Information Is Not a Virus, and Other Consequences of Human Cognitive
Limits”. In: Future Internet 8.2 (May 2016), pp. 21+.doi: 10.3390/fi8020021.
[79] Kristina Lerman, Xiaoran Yan, and Xin-Zeng Wu. “The" majority illusion" in social networks”. In:
PloS one 11.2 (2016), e0147617.
142
[80] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. “Graph evolution: Densication and
shrinking diameters”. In: ACM Transactions on Knowledge Discovery from Data (TKDD) 1.1 (2007),
p. 2.
[81] Bing Liu. Web data mining: exploring hyperlinks, contents, and usage data. Springer Science &
Business Media, 2007.
[82] Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling.
“Causal Eect Inference with Deep Latent-Variable Models”. In: Advances in Neural Information
Processing Systems 30. Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett. Curran Associates, Inc., 2017, pp. 6446–6456.url: http:
//papers.nips.cc/paper/7223-causal-effect-inference-with-deep-latent-variable-models.pdf.
[83] Fragkiskos D Malliaros and Michalis Vazirgiannis. “Clustering and community detection in
directed networks: A survey”. In: Physics Reports 533.4 (2013), pp. 95–142.
[84] Naresh Manwani and PS Sastry. “K-plane regression”. In: Information Sciences 292 (2015),
pp. 39–56.
[85] Helena Matute, Fernando Blanco, Ion Yarritu, Marcos Diaz-Lago, Miguel A Vadillo, and
Itxaso Barberia. “Illusions of causality: how they bias our everyday thinking and how they could
be reduced”. In: Frontiers in Psychology 6 (2015), p. 888.
[86] J.H. McDonald. Handbook of Biological Statistics. 3rd. Baltimore: Sparky House Publishing, 2014,
pp. 145–156.
[87] Daniel McFadden et al. In: Conditional logit analysis of qualitative choice behavior. Institute of
Urban and Regional Development, University of California, 1973, p. 121.
[88] Daniel McFadden et al. In: Quantitative methods for analyzing travel behavior of individuals: some
recent developments. Institute of Transportation Studies, University of California, 1977, p. 307.
[89] Daniel A McFarland, Kevin Lewis, and Amir Goldberg. “Sociology in the era of big data: The
ascent of forensic social science”. In: The American Sociologist 47.1 (2016), pp. 12–35.
[90] Robert A. McLean, William L. Sanders, and Walter W. Stroup. “A Unied Approach to Mixed
Linear Models”. In: The American Statistician 45.1 (1991), pp. 54–64.doi:
10.1080/00031305.1991.10475767.
[91] Miller McPherson, Lynn Smith-Lovin, and James M Cook. “Birds of a feather: Homophily in
social networks”. In: Annual review of sociology 27.1 (2001), pp. 415–444.
[92] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: arXiv preprint arXiv:1908.09635 (2019).
[93] Dale T. Miller and Deborah A. Prentice. “Collective Errors and Errors about the Collective”. In:
Personality and Social Psychology Bulletin 20.5 (Oct. 1994), pp. 541–550.
143
[94] I Minchev, G Matijevic, DW Hogg, G Guiglion, M Steinmetz, F Anders, C Chiappini, M Martig,
A Queiroz, and C Scannapieco. “Yule-Simpson’s paradox in Galactic Archaeology”. In: arXiv
preprint arXiv:1902.01421 (2019).
[95] Allison C Morgan, Dimitrios J Economou, Samuel F Way, and Aaron Clauset. “Prestige drives
epistemic inequality in the diusion of scientic ideas”. In: EPJ Data Science 7.1 (2018), p. 40.
[96] Buddhika Nettasinghe, Nazanin Alipourfard, Vikram Krishnamurthy, and Kristina Lerman. “A
Directed, Bi-Populated Preferential Attachment Model with Applications to Analyzing the Glass
Ceiling Eect”. In: arXiv preprint arXiv:2103.12149 (2021).
[97] Buddhika Nettasinghe, Nazanin Alipourfard, Vikram Krishnamurthy, and Kristina Lerman.
“Emergence of Structural Inequalities in Scientic Citation Networks”. In: arXiv preprint
arXiv:2103.10944 (2021).
[98] Buddhika Nettasinghe and Vikram Krishnamurthy. ““What Do Your Friends Think?": Ecient
Polling Methods for Networks Using Friendship Paradox”. In:IEEETransactionsonKnowledgeand
Data Engineering (2019).
[99] M. Newman. Networks: An Introduction. OUP Oxford, 2010.isbn: 9780191500701.url:
https://books.google.com/books?id=LrFaU4XCsUoC.
[100] M. E. J. Newman. “Assortative Mixing in Networks”. In: Phys. Rev. Lett. 89 (20 Oct. 2002),
p. 208701.doi: 10.1103/PhysRevLett.89.208701.
[101] Mark EJ Newman. “Assortative mixing in networks”. In: Physical review letters 89.20 (2002),
p. 208701.
[102] H. James Norton and George Divine. “Simpson’s paradox . . . and how to avoid it”. In: Signicance
12.4 (Aug. 2015), pp. 40–43.doi: 10.1111/j.1740-9713.2015.00844.x.
[103] E Page Scott. How the power of diversity creates better groups, rms, schools and societies.
Princeton, 2007.
[104] H. Pashler and E. Wagenmakers. “Editors’ Introduction to the Special Section on Replicability in
Psychological Science: A Crisis of Condence?” In:PerspectivesonPsychologicalScience 7.6 (2012),
pp. 528–530.doi: 10.1177/1745691612465253.
[105] Raymond Paternoster, Robert Brame, Paul Mazerolle, and Alex Piquero. “Using the correct
statistical test for the equality of regression coecients”. In: Criminology 36.4 (1998), pp. 859–866.
[106] Keith Payne. The broken ladder: How inequality aects the way we think, live, and die. Penguin,
2017.
[107] Judea Pearl et al. “Causal inference in statistics: An overview”. In: Statistics surveys 3 (2009),
pp. 96–146.
[108] Anne Marie Porter and Rachel Ivie. “Women in Physics and Astronomy, 2019. Report.” In: AIP
Statistical Research Center (2019).
144
[109] Deborah A. Prentice and Dale T. Miller. “Pluralistic ignorance and alcohol use on campus: Some
consequences of misperceiving the social norm.” In: Journal of Personality and Social Psychology
64.2 (1993), pp. 243–256.
[110] Rajesh Ranganath and Adler J. Perotte. “Multiple Causal Inference with Latent Confounding”. In:
ArXiv preprint: 1805.08273 (2018).
[111] Manuel Gomez Rodriguez, Krishna Gummadi, and Bernhard Schoelkopf. “Quantifying
information overload in social media and its impact on social contagions”. In: Eighth International
AAAI Conference on Weblogs and Social Media. 2014.
[112] Daniel M. Romero, Brendan Meeder, and Jon Kleinberg. “Dierences in the Mechanics of
Information Diusion Across Topics: Idioms, Political Hashtags, and Complex Contagion on
Twitter”. In: Proceedings of the 20th International Conference on World Wide Web. ACM, 2011,
pp. 695–704.
[113] Clara O Ross, Aditya Gupta, Ninareh Mehrabi, Goran Muric, and Kristina Lerman. “The Leaky
Pipeline in Physics Publishing”. In: arXiv preprint arXiv:2010.08912 (2020).
[114] David M Rothschild and Justin Wolfers. “Forecasting elections: Voter intentions versus
expectations”. In: (2011).
[115] Venu Satuluri and Srinivasan Parthasarathy. “Symmetrizations for clustering directed graphs”. In:
Proceedings of the 14th International Conference on Extending Database Technology. ACM. 2011,
pp. 343–354.
[116] Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi.
“Fairness and abstraction in sociotechnical systems”. In: Proceedings of the conference on fairness,
accountability, and transparency. 2019, pp. 59–68.
[117] B. Settles and B. Meeder. “A Trainable Spaced Repetition Model for Language Learning”. In:
Proceedings of the Association for Computational Linguistics (ACL). ACL, 2016, pp. 1848–1858.doi:
10.18653/v1/P16-1174.
[118] Ricardo AM da Silva and Francisco de AT de Carvalho. “On Combining Clusterwise Linear
Regression and K-Means with Automatic Weighting of the Explanatory Variables”. In:
International Conference on Articial Neural Networks. Springer. 2017, pp. 402–410.
[119] Ricardo AM da Silva and Francisco de AT de Carvalho. “On Combining Fuzzy C-Regression
Models and Fuzzy C-Means with Automated Weighting of the Explanatory Variables”. In: 2018
IEEE International Conference on Fuzzy Systems (FUZZ-IEEE). IEEE. 2018, pp. 1–8.
[120] Philipp Singer, Emilio Ferrara, Farshad Kooti, Markus Strohmaier, and Kristina Lerman.
“Evidence of Online Performance Deterioration in User Sessions on Reddit”. In: PLOS ONE 11.8
(Aug. 2016), pp. 1–16.doi: 10.1371/journal.pone.0161636.
[121] Philipp Singer, Emilio Ferrara, Farshad Kooti, Markus Strohmaier, and Kristina Lerman.
“Evidence of Online Performance Deterioration in User Sessions on Reddit”. In: PLoS ONE 11.8
(2016), e0161636+.doi: 10.1371/journal.pone.0161636.
145
[122] Paul E. Smaldino and Richard McElreath. “The natural selection of bad science”. In: Royal Society
Open Science 3.9 (2016), p. 160384.doi: 10.1098/rsos.160384. eprint:
https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160384.
[123] Laurel Smith-Doerr, Sharla N Alegria, and Timothy Sacco. “How diversity matters in the US
science and engineering workforce: A critical review considering integration in teams, elds, and
organizational contexts”. In: Engaging Science, Technology, and Society 3 (2017), pp. 139–153.
[124] Helmuth Späth. “Algorithm 39 clusterwise linear regression”. In: Computing 22.4 (1979),
pp. 367–373.
[125] Hsi Guang Sung. “Gaussian mixture regression and classication”. PhD thesis. Rice University,
2004.
[126] J. W. Vaupel and A. I. Yashin. “Heterogeneity’s ruses: some surprising eects of selection on
population dynamics.” In:TheAmericanStatistician 39.3 (1985), pp. 176–185.doi: 10.2307/2683925.
[127] Yixin Wang and David M. Blei. “The Blessings of Multiple Causes”. In: ArXiv preprint: 1805.06826
(2018).
[128] A. B. Warriner, V. Kuperman, and M. Brysbaert. “Norms of valence, arousal, and dominance for
13,915 english lemmas”. In: Behavior research methods 45.4 (2013), pp. 1191–1207.
[129] S. F. Way, D. B. Larremore, and A. Clauset. “Gender, productivity, and prestige in computer
science faculty hiring networks”. In: In Proceedings of the 25th International Conference on World
Wide Web (Apr. 2016), pp. 1169–1179.
[130] Samuel F Way, Allison C Morgan, Daniel B Larremore, and Aaron Clauset. “Productivity,
prominence, and the eects of academic environment”. In: Proceedings of the National Academy of
Sciences 116.22 (2019), pp. 10729–10733.
[131] Richard G Wilkinson and Kate E Pickett. “Income inequality and social dysfunction”. In: Annual
review of sociology 35 (2009), pp. 493–511.
[132] Bodo Winter. “A very basic tutorial for performing linear mixed eects analyses”. In: arXiv
preprint arXiv:1308.5499 (2013).
[133] Anita Williams Woolley, Christopher F. Chabris, Alex Pentland, Nada Hashmi, and
Thomas W. Malone. “Evidence for a collective intelligence factor in the performance of human
groups”. In: Science 330.6004 (Oct. 2010), pp. 686–688.issn: 00368075.doi:
10.1126/science.1193147.
[134] Yu Xie. ““Undemocrac"": inequalities in science”. In: Science 344.6186 (2014), pp. 809–810.
[135] Yu Xie. “Population heterogeneity and causal inference”. In: Proceedings of the National Academy
of Sciences 110.16 (2013), pp. 6262–6268.doi: 10.1073/pnas.1303102110.
146
AppendixA
Datasets
A.1 Simpson’sDisaggregation
A.1.1 StackExchange
We study answerer performance on Stack Exchange (SE). Launched in 2008 as a forum for asking computer
programming questions, Stack Exchange has grown to encompass a variety of technical and non-technical
topics. Any user can ask a question, which others may answer. Users can vote for answers they nd
helpful, but only the asker can accept one of the answers as the best answer to the question. We used
anonymized data representing all answers to questions posted on Stack Exchange from August 2008 until
September 2014.
∗
Approximately half of the 9.6M questions had an accepted answer, and we included in
the study questions that received two or more answers [24].
To understand factors aecting user performance on SE, we study the relationship between the various
features extracted from data and the outcome, here a binary attribute denoting whether the answer written
by a user is accepted by the asker as best answer to his or her question. Features include the numbers of
words, hyperlinks, and lines of code the answer contains, and its Flesch readability score [64]. Features
describing answerers are their reputation, tenure on SE (in seconds and in terms of percentile rank) and
the total number of answers written during their tenure. These features relate to user experience. We
∗
https://archive.org/details/stackexchange
147
also use activity-related features, including time since previous answer written by the user, session length,
giving the number of answers user writes during the session, and answer position within that session. We
dene a session as a period of activity without a break of 100 minutes of longer. Previous studies of Stack
Exchange [46] and other online platforms [2, 121], identied user sessions as an important variable in
understanding performance.
To this end, for each answer in the data set, we create a list of features describing the answer, as well
as features of describing the user writing the answer:
Reputation: Answerer’s reputation at the time he or she posted the answer. This score summarizes the
user’s cumulative contributions to Stack Exchange.
Numberofanswers: Cumulative number of answers written by the user at the time the current answer
was posted.
Tenure: Age of the user’s account (in seconds) at the time the user posted the answer.
Percentile: User’s percentile rank based on tenure.
Timesincepreviousanswer: Time interval (in seconds) since user’s previous answer.
Sessionlength: The length of the session (in number of answers posted) during which the answer was
posted.
Answerposition: Index of the answer within a session.
Words: Number of words in the answer.
Linesofcodes: Number of lines of codes in the answer.
URLs: Number of hyperlinks in the answer.
Readability: Answer’s Flesch Reading Ease [64] score.
148
A.1.2 KhanAcademy
Khan Academy
†
(KA) is an educational platform oering online tools to help students learn a variety
of subjects. Student progress by watching short videos and complete exercises by solving problems. We
studied an anonymized dataset, collected over two years, which contains information about attempts by KA
adult students to solve problems. We partitioned student activity into sessions, also dened as a sequence of
problems without a break of more than one hour between them. The vast majority of students completed
only a single session. As an outcome variable in this data, we take student performance on a problem,
a binary variable equal to one when the student solved the problem correctly on the rst try, and zero
otherwise (either did not solve it correctly, or used hints). To study factors aecting performance, we
extracted the features of problems and users. These included the overall solving time during user activity,
total solve time and the number of attempts made to solve the problem, time since the previous problem
(tspp), thenumberofsessions prior to the current one,allsessions user contributed to, thesessionlength in
terms of the number of problems solved,problemposition within the session (session index of the problem),
the timestamp of the attempt, including the month, day of week, type of weekday ( whether it is weekend
or not) and hour the student attempted to solve the problem, the month the student joined KA, his or her
tenure, the number of all attempts on all problems solved since joining, and how many of the problems
were solved correctly on therstattempts. As a proxy of skill or some background knowledge the student
brings, we use how many problems were correctly solved during the student’s ve rst attempts to solve
problems. For example, the least prepared students answered few of the ve problems they attempted to
solve, but best students would have solved all ve correctly.
†
https://www.khanacademy.org
149
A.1.3 Duolingo
Duolingo (DL) is an online language learning platform, which allows users to learn dozens of dierent
languages. DL oers a gamied learning environment, where users progress through levels by practicing
vocabulary and dictation skills. The DL halife-regression [117] dataset
‡
follows a subset of learners over
a period of two weeks. Users are shown vocabulary words and asked to recall them correctly. Users may
be shown between 7 and 20 words per lesson, and may have multiple lessons in a session. Sessions are
dened in a similar way as before—a period of activity without a break longer than one hour. Users in
general perform quite well, correctly recalling a large number of words in a lesson. This makes it dicult
to discern changes in performance. Therefore, we dene performance in a more stringent way, as a binary
variable, which is equal to one if the user had perfect performance (i.e., correctly recalled all words in
a lesson), and zero otherwise. We used more than two dozen features to describe performance. These
include the number of words seen and correctly answered during a lesson (lesson seen and lesson correct),
the number of distinct words shown during a lesson, lesson index among all lessons for this user, time to
next lesson, time since the previous lesson, lesson position within its session, session length in terms of the
number of lessons andduration, etc. User-related features include the number ofverstlessons correctly
answers, number of all perfect lessons with perfect performance, total number of lessons, the total number
of words seen and the correctly answered, and the time the user was active.
A.2 DogR
A.2.1 Data
We apply our method to a synthetic and real-world data sets, including large social data sets described
below.
‡
https://github.com/duolingo/halife-regression
150
420 440 460 480 500 520 540 560
x
700
800
900
1000
1100
1200
1300
1400
y
Data
Predicted
Figure A.1: Synthetic data with two components centered onx = 500, but with dierent variances. Gray
points are data, and red points are predicted outcomes made by our method.
The Synthetic data consists of two subgroups, with the same mean onx, but dierent variances, as
shown in Figure A.1. The variance ofx for the top component is 600 and for the bottom component is
100. The number of data points in bottom component is 3; 000 and in top component 2; 000. They value
for the bottom component isy = 200 +x and the top component it isy = 800 +x, with 20 as variance
of residual.
TheMetropolitan data comes from a study of emotions expressed through Twitter messages posted
from locations around a large metropolitan area. The data contains more than 6 million geo-referenced
tweets from 33,000 people linked to US census tracts through their locations. The demographic and so-
cioeconomic characteristics of tracts came from the 2012 American Fact Finder.
§
A tract is a small region
that contains about 4,000 residents on average. The tracts are designed to be relatively homogeneous with
§
http://factnder.census.gov/
151
respect to socioeconomic characteristics of the population, such as income, residents with bachelors de-
gree or above, and other demographics. Of more than 2000 tracts within this metropolitan area, 1688 were
linked to tweets.
Emotionalvalence was estimated from the text of the tweets using Warriner, Kuperman, and Brysbaert
(WKB) lexicon [128]. The lexicon gives valence scores—between 1 and 9—which quantify the level of
pleasure or happiness expressed by a word, for 14,000 English words. Since valence is a proxy for the
expressed happiness, the data allows us to investigate the social and economic correlates of happiness. We
use valence as the outcome in data analysis.
The Wine Quality data combines two benchmark data sets from UCI repository
¶
related to red and
white wines [35]. The data contains records about 4898 white wine and 1599 red wine samples of wine
quality and concentrations of various chemicals. Quality is a score between 0 and 10, and is typically
around 5. Our goal is to model wine quality based on physicochemical tests. Note that the type of wine
(red or white) is only used to evaluate the learned components.
TheNewYorkCityPropertySale data
∥
contains records of every building or building unit (such as
an apartment) sold in the New York City property market over a 12-month period, from September 2016
to September 2017. We removed all categorical variables like neighborhood and tax class. We use each
property’sborough to study relevance of the components to dierent neighborhoods in NYC; however, we
do not use it in analysis. The outcome variable issaleprice, which is in million dollars, and it ranges from
0 to 2; 210.
TheStackOverow dataset contains data from a question-answering forum on the topic of computer
programming. Any user can ask a question, which others may answer. We used anonymized data repre-
senting all answers to questions with two or more answers posted on Stack Overow from August 2008
until September 2014.
∗∗
We created a user-focused data set by selecting at random one answer written by
¶
https://archive.ics.uci.edu/ml/datasets/wine+quality
∥
https://www.kaggle.com/new-york-city/nyc-property-sales
∗∗
https://archive.org/details/stackexchange
152
each user and discarding the rest. To understand factors aecting the length of the answer (the outcome,
which we measure as the number ofwords in the answer), we use features of answers and features of users.
Answer features include the number ofhyperlinks, andlinesofcode the answer contains, and itsFleschread-
ability score [64]. Features describing answerers are their reputation, tenure on Stack Overow (in terms
of percentile rank) and experience, i.e., the total number of answers written during their tenure. We also
use activity-related features, includingtimesincepreviousanswer written by the user,sessionlength, giving
the number of answers user writes during the session, and answer position within that session. We dene
a session as a period of activity without a break of 100 minutes or longer. Features acceptance probability,
reputation and number of answers are only used to evaluate the learned components.
A.2.2 Preprocessing
When variables are linearly correlated, regression coecients are not unique, which presents a challenge
for analysis. To address this challenge, we used Variance Ination Factor (VIF) to remove multicollinear
variables. We iteratively remove variables when their VIF is larger than ve. For example, in the Metropoli-
tan data, this approach reduced the number of features from more than 40 to six, representing the number
of residents within a tract who areWhite,Black,Hispanic andAsian, as well as percent of adult res-
idents withGraduate degree or with incomesBelow Poverty line. Table A.1, represents information
about all data sets.
Dataset records features after VIF outcome
Synthetic 5; 000 1 1 Y
Metropolitan 1; 677 42 6 Valence
Wine Quality 6; 497 11 5 Quality
NYC 36; 805 7 4 Sale Price
Stack Overow 372; 321 13 5 Answer Length
Table A.1: Data sets and their characteristics.
153
A.2.3 FindingComponents
As rst step in applying DoGR to data for qualitative analysis, we need to decide on the number of com-
ponents. In general, nding the appropriate number of components in clustering algorithms is dicult
and generally requires ad hoc parameters. For Gaussian Mixture Models, the Bayesian information crite-
rion (BIC) is typically used. We also use BIC withk (p
2
+ 2p + 3) parameters, wherek is number of
components andp is number of independent variables. Based on the BIC scores, we choose ve compo-
nents for this data. In our explorations, using three or six components gave qualitatively similar results,
but with some of the components merging or splitting. With the same procedure, the optimal number of
components for Wine Quality and NYC and Stack Overow data is four.
154
AppendixB
FriendshipParadox&PerceptionBias
B.1 Individual-levelPerceptionBias
Using Equation (9) we can compute individual-level perception bias for hashtagh as dierence between
perception of the individual about hashtagh and its global prevalence. f
h
(v) shows whether userv used
hashtagh or not. The perception of nodev about hashtagh can be shown asq
f
h
(v). Then theindividual-
level perception bias for hashtagh is:
B
h
(v) =q
f
h
(v)Eff
h
(X)g;
whereEff
h
(X)g is the global prevalence of hashtagh. Supplementary Figure B.1a shows the empirical
distribution of B
h
(v) for all users and hashtags. Most of the mass of the histogram is for B
h
(v) > 0,
suggesting that most of the people in our data overestimate the popularity of these hashtags.
Supplementary Figure B.1b compares individual-level perception bias for two hashtags that have sim-
ilar global prevalence: #nyc (Eff(X)g = 0:021) and #rt (Eff(X)g = 0:019). Of the two hashtags, #nyc is
perceived as more popular (withB
local
#nyc
= 0:022), but #rt appears less popular (withB
local
#rt
=0:011)
than it is globally.
155
B.2 LocalandGlobalPerceptionBias
Global and local perception bias may either overestimate or underestimate the global prevalence of an
attributeEff(X)g, depending on the values of the covariance between a node’s attribute value and its
out-degree and the attribute value and attention along a random friend–follower link. We enumerate the
cases below:
Case 1 : Covff(U);A(V )j(U;V ) Uniform(E)g 0 and Covff(X);d
o
(X)g 0.
In this case, B
global
and B
local
both overestimate Eff(X)g, and local bias is larger than global
bias, i.e B
local
B
global
0.
Case 2 : Covff(U);A(V )j(U;V ) Uniform(E)g 0 and Covff(X);d
o
(X)g 0.
In this case, B
global
andB
local
both underestimateEff(X)g, and local bias is smaller than global
bias, i.eB
local
B
global
0.
Case 3 : Covff(U);A(V )j(U;V ) Uniform(E)g and Covff(X);d
o
(X)g have opposite signs.
In this case the signs of the B
global
and B
local
can be dierent, with one overestimating and the
underestimating the global prevalence of the attribute. These extreme cases are caused by the large
covariance between the attribute and attention along a random friend–follower link. We make this
case more precise with the following results:
1. IfB
global
< 0, thenB
local
> 0 if and only if
Covff(U);A(V )j(U;V ) Uniform(E)g>
jB
global
j
d
2. IfB
global
> 0, thenB
local
< 0 if and only if
Covff(U);A(V )j(U;V ) Uniform(E)g<
jB
global
j
d
.
156
We separate the hashtags in our Twitter data sample into the above three cases (Supplementary Fig-
ure B.2). The majority of hashtags fall into cases 1 and 2, suggesting that local perception bias is larger in
magnitude than the global perception bias.
Case 1: The hashtags (shown in green in Supplementary Figure B.2) are used by popular users who are
followed by high attention followers. These hashtags include popular memes and political events,
among others. Some examples of these hashtags are#ferguson,#tbt,#icebucketchallenge,#mikebrown,
#emmys, #tech, #nyc, #ebola, #robinwilliams, #sxsw, #alsicebucketchallenge, #applelive, #netneutrality,
#worldcup,#startups,#michaelbrown,#earthquake,#apple,#sf,#iraq. Interestingly, all hashtags listed
as top 20 in Figure 3 belong to this case except #social_media and #.
Case 2: Some of the hashtags falling into this case (shown in red in Supplementary Figure B.2) include #rt,
#tcot, #follow, #retweet, #oscars, #teamfollowback, #leadfromwithin, #mtvhottest, #teaparty, #shoutout,
#pjnet, #cdnpoli, #gazaunderattack, #uniteblue, #asmsg, #tlot, #freepalestine, #ccot, #tfb, #np. These
hashtags are used by unpopular users; examples of these hashtags are the last-10 hashtags of Figure
3 except #quote.
Case 3: The hashtags falling into the left quadrant of Supplementary Figure B.2 (in yellow) include #sotu,
#occupy,#marriageequality,#sandy,#haiyan,#esp,#openingday,#doma,#mex,#lightsout,#onthisday,
#ufc, #ww1, #wimbledon, #oscar, #joinin, #9, #ukedchat, #uru.
The hashtags falling into the right quadrant include #quote, #quotes, #win, #news, #kindle, #au-
thor, #management, #p2, #romance, #mktg, #iartg, #leaders, #ww, #b, #so, #mystery, #children, #aine,
#autism, #lp.
157
− 0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Node Perception Bias
10
1
10
3
10
5
10
7
Number of users
(a)
?0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Node Perception Bias
1 0
0
1 0
1
1 0
2
1 0
3
Number of users
rt
nyc
(b)
Figure B.1: Individual-level perception biasq
f
h
(v)Eff(X)g for (a) all hashtagsh and all nodesv2V ,
and (b) for two hashtags with similar global prevalence, but with positive (#nyc) and negative (#rt)B
local
.
This illustrates that most hashtags are positively biased for individuals, with bias levels that do not depend
on global prevalence.
158
10
1
10
2
0 10
2
10
1
10
0
Normalized Cov{f(X), d
o
(X)}
10
0
10
1
10
2
0
10
2
10
1
10
0
Normalized Cov{f(U), A(V)}
Case 1 0 B
global
B
local
Case 2 B
local
B
global
0
Case 3 different signs
Case 3 other
Case Covariances global/local bias No. of hashtags
1
Covff(U);A(V )j(U;V ) Uniform(E)g 0
Covff(X);d
o
(X)g 0
0B
global
B
local
474
2
Covff(U);A(V )j(U;V ) Uniform(E)g 0
Covff(X);d
o
(X)g 0
B
local
B
global
0 187
3
Covff(U);A(V )j(U;V ) Uniform(E)g,
Covff(X);d
o
(X)g have opposite signs
a)B
global
0B
local
b)B
local
0B
global
Other
19
75
398
Figure B.2: Value of Covff(U);A(V )j(U;V ) Uniform(E)g and Covff(X);d
o
(X)g for all hashtags.
Both variables are normalized by dividing to maximum value of the variable. The color represents the
three cases. The table shows the number of hashtags that fall into each case.
159
-1 0 1 2 3 4 5 6 7 8 9
Bias Percentage
ferguson
socialmedia
icebucketchallenge
marketing
tbt
ff tech
nyc
startup
sxsw
emmys
mikebrown
startups
robinwilliams
applelive
worldcup
travel
alsicebucketchallenge
apple
twitter
...
mufc
virgo
freepalestine
np
gazaunderattack
mtvhottest
teamfollowback
oscars
retweet
rt
Global Bias Ranking
Global Bias
Local Bias
Figure B.3: The ranking of popular Twitter hashtags based on Global bias. Top-20 and bottom-10 are
included in the ranking. The bars compare Global bias (B
global
) and Local Bias (B
local
).
160
B.3 GlobalBiasRanking
Supplementary Figure B.3 shows the ranking of popular Twitter hashtags based on global bias (B
global
).
Top-20 and bottom-10 are included in the ranking. There are 94 hashtags among 1153 with opposite sign
of local bias and global bias, although both bias values for these hashtags are close to zero. Among the
remaining hashtags, 661 (62%) have larger local bias than global bias, and 398 (38%) have larger global
bias than local bias.
B.4 HeuristicFollowerPerceptionPolling
It may not always be feasible to sample followers at random, specically, by sampling nodes proportional to
their in-degree. OurFollowerPerceptionPolling (FPP) algorithm samples nodes based on their in-degree. For
computing in-degree we need to access whole network; however, in many cases the whole network may
not be available. We can approximate the sampling used by FPP algorithm using the following heuristic.
Instead of sampling b nodes weighted based on their in-degree (which needs information about whole
network), the proposed Approximate-FPP algorithm samplesb nodes at random, and as a second step, it
samplesb nodes from followers of these nodes. This procedure does not need whole network structure,
and it could be shown that it is an approximation of the FPP algorithm. Figure B.4d shows the performance
of Approximate-FPP in comparison to the other polling algorithms.
161
10
− 2
10
− 1
E{f(X)}
10
− 8
10
− 5
10
− 2
B{T}
2
Approx-FPP
NPP
IP
(a)
10
− 2
10
− 1
E{f(X)}
10
− 5
10
− 4
10
− 3
10
− 2
Var{T}
Approx-FPP
NPP
IP
(b)
10
− 2
10
− 1
E{f(X)}
10
− 5
10
− 4
10
− 3
10
− 2
MSE{T}
Approx-FPP
NPP
IP
(c)
10
20
30
40
50
60
70
80
90
100
Sampling Budget = b
0.5
0.6
0.7
0.8
0.9
1.0
Fraction of hashtags
IP NPP
(d)
Figure B.4: Comparison of polling algorithms for estimating the global prevalence of Twitter hash-
tags. Variation of (a) squared bias ( BiasfTg
2
), (b) variance ( VarfTg ) and (c) mean squared error (
BiasfTg
2
+ VarfTg ) of the polling estimate (IP, NPP and Approximated-FPP asT - polling algorithm -)
as a function of a hashtag’s global prevalenceEff(X)g. Each point represents a dierent hashtag and a
xed sampling budgetb = 25. (d) Fraction of hashtags where the proposed FPP algorithm with the sam-
pling heuristic (Approximate-FPP) outperforms the other polling methods in terms ofMSE. The fraction
for NPP approaches 0.5, and for IP approaches 0.8 as sampling budgetb increases. The main dierence
between Approximate-FPP and FPP is in Figure (d) with low amount of budgetb. In this case, the ratio of
hashtags where Approximate-FPP could perform better than NPP is around 0.8 compare with 0.9 for FPP
algorithm.
162
AppendixC
DirectedMixedPreferentialAttachment
C.1 NotationsofDMPAmodeltheorems
I use the following notations that we used in [97] and [96]:
p
(3)
B
old
!B
old
: given the event 3 the probability that an existing blue node is followed by an existing
blue node.
p
(3)
B
old
!R
old
: given the event 3 the probability that an existing blue node is followed by an existing
red node.
p
(3)
R
old
!R
old
: given the event 3 the probability that an existing red node is followed by an existing
red node.
p
(3)
R
old
!B
old
: given the event 3 the probability that an existing red node is followed by an existing
blue node.
163
In [97] we showed:
p
(3)
R
old
!R
old
=
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
R
(1 + (p +q))
2
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
(1
B
)
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
R
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
B
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
(1
R
)
+O
1
t
1
4
(C.1)
p
(3)
B
old
!R
old
=
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
(1
R
)
(1 + (p +q))
2
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
(1
B
)
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
R
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
B
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
(1
R
)
+O
1
t
1
4
(C.2)
p
(3)
R
old
!B
old
=
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
(1
B
)
(1 + (p +q))
2
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
(1
B
)
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
R
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
B
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
(1
R
)
+O
1
t
1
4
(C.3)
164
p
(3)
B
old
!B
old
=
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
B
(1 + (p +q))
2
1
(o)
t
+ (p +q) (1r)
1
(i)
t
+ (p +q) (1r)
(1
B
)
1
(o)
t
+ (p +q) (1r)
(i)
t
+ (p +q)r
R
(o)
t
+ (p +q)r
1
(i)
t
+ (p +q) (1r)
B
(o)
t
+ (p +q)r
(i)
t
+ (p +q)r
(1
R
)
+O
1
t
1
4
(C.4)
165
Abstract (if available)
Abstract
The presence of bias often complicates the quantitative analysis of large-scale heterogeneous or network data. Discovering and mitigating these biases enables a more robust and generalizable analysis of data. This thesis focuses on the 1) discovery, 2) measurement and 3) mitigation of biases in heterogeneous and network data. ? The first part of the thesis focuses on removing biases created by the existence of diverse classes of individuals in the population. I describe a data-driven discovery method that leverages Simpson抯 paradox to identify subgroups within a population whose behavior deviates significantly from the rest of the population. Next, to address the challenges of multi-dimensional heterogeneous data analysis, I propose a method that discovers latent confounders by simultaneously partitioning the data into fuzzy clusters (disaggregation) and modeling the behavior within them (regression). ? The second part of this thesis is about biases in bi-populated networked data. First, I study the perception bias of individuals about the prevalence of a topic among their friends in the Twitter social network. Second, I show the existence of power-inequality in author citation networks in six different fields of study, due to which authors from one group (e.g., women) receive systematically less recognition for their work than another group (e.g., men). As the last step, I connect these two concepts (perception bias and power-inequality) in bi-populated networks and show that while these two measures are highly correlated, there are some scenarios where there is a disparity between them.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards trustworthy and data-driven social interventions
PDF
Ultra rapid identity-by-descent mapping in massive genetic datasets
PDF
Heterogeneous federated learning
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Learning fair models with biased heterogeneous data
PDF
Fast and label-efficient graph representation learning
PDF
Measuing and mitigating exposure bias in online social networks
PDF
Modeling dynamic behaviors in the wild
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Tractable information decompositions
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Robust causal inference with machine learning on observational data
PDF
Fairness in natural language generation
PDF
Controlling information in neural networks for fairness and privacy
PDF
Integer optimization for analytics in high stakes domain
PDF
Understanding diffusion process: inference and theory
PDF
Learning to diagnose from electronic health records data
PDF
Representation problems in brain imaging
PDF
Learning controllable data generation for scalable model training
Asset Metadata
Creator
Alipourfard, Nazanin
(author)
Core Title
Emergence and mitigation of bias in heterogeneous data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2021-08
Publication Date
07/21/2021
Defense Date
05/07/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
data decomposition,ecological fallacy,friendship paradox,OAI-PMH Harvest,power-inequality,Simpson's paradox
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Ambite, Jose-Luis (
committee member
), Horowitz, Ellis (
committee member
), Steeg, Greg Ver (
committee member
), Vayanos, Phebe (
committee member
)
Creator Email
nalipour@usc.edu,nazanin.alipourfard@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15613663
Unique identifier
UC15613663
Legacy Identifier
etd-Alipourfar-9806
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Alipourfard, Nazanin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
data decomposition
ecological fallacy
friendship paradox
power-inequality
Simpson's paradox