Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Measuing and mitigating exposure bias in online social networks
(USC Thesis Other)
Measuing and mitigating exposure bias in online social networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MEASURING AND MITIGATING EXPOSURE BIAS IN ONLINE SOCIAL NETWORKS
by
Nathan Bartley
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2024
Copyright 2024 Nathan Bartley
Dedication
For my wife.
ii
Acknowledgements
First and foremost, I want to thank my advisor, Kristina Lerman, for taking me on and helping me to see
this degree through. I appreciate how much her style and support adjusted to my needs during my time at
USC: giving crucial guidance early on in shaping my research focus and approach and giving space when I
had direction. She helped me course-correct when I felt stuck, gave compassionate feedback when I needed
it, and because of her help I ultimately came out stronger and more confident.
From her, I learned how to tackle my research from different angles and contextualize it thoroughly in
a multidisciplinary way. Kristina leads her research group with enthusiasm in pursuing questions we are
passionate about, patience in acquiring the requisite expertise to answer them, experience in cultivating
fruitful collaborations, and finesse in polishing and timing our contributions to the field. She has also taught
me that everyone probably feels as awkward as I do at conferences, and how that feeling can be a good
icebreaker for a great conversation!
Thank you to my committee, Mike Ananny, Emilio Ferrara, Fred Morstatter, and Barath Ragahvan. The
discussions I have had with them have been very illuminating and fruitful, and have pushed my growth and
development more than I could have ever hoped for.
Thank you to Mike Ananny (again), Dmitri Williams and Yolanda Gil for providing space to ask questions and discuss work with students, postdocs, and faculty from all over the university: I firmly believe
their investment in time and energy in running these research groups will pay tremendous dividends in
iii
growing new research areas and cultivating students’ interest in research who may not have gotten involved
otherwise.
Thank you to Keith Burghardt and Andrés Abeliuk, both of whom I had the pleasure to work with
and learn from about the day-to-day steps for carrying out research. Thank you as well to everyone at the
Knowledge Lab and Computation Institute at the University of Chicago, without whom I would not have
had the opportunity to pursue this degree: James Evans and his always positive, forward looking demeanor,
Eamon Duede with his pragmatic research approaches, as well as, Kyle Chard & Anoop Mayampurath for
their encouragement of light-hearted fun in the research process.
Thank you to my lab friends and collaborators who helped me get to this point: Nazanin, Nazgol, Misha,
Negar, Ashok, Dan, Ashwin, Zihao, Kai, Yuzi, Basileal, Dan, Casandra, David, Patrick, Becca, DJ, and
Emily. Thank you to everyone from MASTS, especially Rizvi, Lee, Dan, Chris, Rachel & TJ. I will miss
working with you all, but I eagerly look forward to seeing what you do next!
Thank you to Mom and Dad for instilling in me a love for learning, as well as your never-ending support
and constant love. And thank you to my new parents Jia and Naveed for your advice and care throughout this
journey. Grandma, Naomi, Mike, Isla and Cyrus, thank you for your support and for making me laugh every
time I see you. Ramsha, you are such a good writer, thank you for your edits. You all helped me through the
most difficult part.
Thank you to all my friends and family especially: Uncle Bob, Will, Loryn, Brice, Elizabeth & Michael,
Karina, Andy, Jonathan, Wyatt, Ryan, Barak and the D&D crew, Holland, Patrick, Blake, Arthur, Ethan,
Jesse, Cullen, Tony, Jon, Drew, Jake, and the group chat. Thank you Alex for the shared work sessions. And
Kyle, thank you for the snacks.
Finally, thank you to my wife, Hareem. I could not have gotten it over the finish line without you. I am
so grateful for every day I have been given with you.
I could not have done it without everyone’s support, it really takes a village to raise a thesis.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Brief History of Algorithmic Systems in Social Media . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Exposure Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Challenges and Contentions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Statement & Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.7 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2: Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Cognitive Biases & Information Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Public Opinion Research & Opinion Dynamics . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Algorithmic Audits of OSN Recommender Systems . . . . . . . . . . . . . . . . . . . . . . 13
2.4 User Perception of Ranked Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Social Network Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Recommender System Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Social Media Usage & Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3: Measuring Exposure Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 Data & Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 Reconstructing Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 Simulating User Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.4 Measures of Exposure Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 RQ1 - Difference between Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
3.2.2 RQ2 - Difference between Session Lengths . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 RQ1 - Difference between Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 RQ2 - Difference between Session Lengths . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3 Considerations for Practitioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.4 Other Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Limitations & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 Ethical Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4: Auditing Exposure Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Ethical Considerations & Reproducibility . . . . . . . . . . . . . . . . . . . . . . . 64
5: Mitigating Exposure Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 User Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.4 Bias & Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Limitations & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7: Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
.1.1 Additional Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
vi
List of Tables
3.1 Empirical Pairwise Significance Tests. We treat the seed users in different feed conditions
as paired samples and compute a paired t-test over the mean values of the various metrics.
Conditions that are bolded are least biased according to the measure. * - p <0.05, ** - p <0.01 27
3.2 Empirical Session-Length Pairwise Significance Tests. We treat the seed users in different
feed conditions as paired samples and compute a paired t-test over the mean values of the
various metrics computed for each session length. * - p < 0.05, ** - p < 0.01 . . . . . . . . . 28
4.1 Pairwise Significance Tests. Data are treated under a independent t-test given the normalcy
of the means measured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1 Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2 Empirical Pairwise Significance Tests. We treat the seed users in different feed conditions
as paired samples and compute a paired t-test over the mean values of the various metrics.
* - p < 0.05, ** - p < 0.01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3 Empirical Session-Length Pairwise Significance Tests. We treat the seed users in different
feed conditions as paired samples and compute a paired t-test over the mean values of the
various metrics computed for each session length. * - p < 0.05, ** - p < 0.01 . . . . . . . . . 85
vii
List of Figures
1.1 Majority Illusion. Plotted is a graph with 18 nodes and 27 edges, with 4 nodes (0, 4, 14, 6)
having an “active” binary trait. Nodes 15, 7, and 5 would experience the majority illusion
in that the majority of their connections have the active trait, whereas the minority of the
global population has the trait. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Local Bias (Blocal) versus Correlation ρkx - Simple feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . . . . 37
3.2 Local Bias (Blocal) versus Correlation ρkx - Model-based feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . 38
3.3 Gini Coefficient (G) versus Correlation ρkx - Simple feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . . . . 39
3.4 Gini Coefficient (G) versus Correlation ρkx - Model-based feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . 40
3.5 Fraction Positive Tweets (Fti) versus Correlation ρkx - Simple feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . 41
3.6 Fraction Positive Tweets (Fti) versus Correlation ρkx - Model-based feeds. Reported
are mean values across all seed user days, and bars are standard error of the mean. . . . . . 42
3.7 Fraction Positive Users (Fui) versus Correlation ρkx - Simple feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . 43
3.8 Fraction Positive Users (Fui) versus Correlation ρkx - Model-based feeds. Reported are
mean values across all seed user days, and bars are standard error of the mean. . . . . . . . 44
3.9 Majority Illuion (Mi) versus Correlation ρkx - Simple feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . . . . 45
3.10 Majority Illuion (Mi) versus Correlation ρkx - Model-based feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean. . . . . . . . . . . . 46
viii
4.1 Local bias measurements for audit groups. Bootstrapped confidence intervals computed
over 1000 samples of sessions, presented are the mean of the daily means with 95%
intervals. Negative implies under-representation of pro-science content, positive implies
over-representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Dates covered by the sock puppet accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Gini coefficient measurements for audit groups. Bootstrapped confidence intervals
computed over 1000 samples of sessions, presented are the mean of the daily means with
95% intervals. Zero implies equality in mean number of tweets observed for each friend,
One implies one friend dominates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Local bias computed over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Median imputed Local bias computed over time. . . . . . . . . . . . . . . . . . . . . . . . 58
4.6 Correlation between time-series matched on days. Correlation is computed between the
daily activity of pro- or anti-science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Gini coefficient computed weekly over time. . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Number of tweets per day of anti-science users. The steep drop of tweets around
2021-10-24 can be explained both by accounts that were temporarily suspended and a brief
rate limiting issue with the API. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 Number of tweets per day of pro-science users. . . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Number of usable personalized tweets seen per day. Other tweets were either promoted
or otherwise had a problem in parsing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.11 Number of usable reverse chronological tweets seen per day. Other tweets were either
promoted or otherwise had a problem in parsing. . . . . . . . . . . . . . . . . . . . . . . . 65
4.12 Mean imputed Local bias computed over time. . . . . . . . . . . . . . . . . . . . . . . . . 65
4.13 Most-frequent imputed Local bias computed over time. . . . . . . . . . . . . . . . . . . . 66
4.14 Iterative imputed Local bias computed over time. . . . . . . . . . . . . . . . . . . . . . . 66
4.15 Number of unique friends seen overtime. . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.16 Number of new users exposed to per day. Blue is personalized, orange is reverse
chronological. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Agent-based Model Structure Illustration demonstrating how three users are connected
to each other on the network, but will only get exposed to other users through the tweets
served to them from the “backend” model. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
ix
5.2 Local Bias (Blocal). Graph depicts the difference between the expected local fraction of
friends who have x = 1 and the true global prevalence of the trait P(X = 1). Positive
implies over-representation, negative implies under-representation. . . . . . . . . . . . . . . 76
5.3 Gini Coefficient. Graph depicts the distribution of times each friend (or friend-of-friend)
was observed by a core user. 1 implies inequality, 0 implies equality. . . . . . . . . . . . . . 77
5.4 Mean number of likes each friend receives. Graph depicts the mean number of likes each
friend receives over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Log number of friends seen. Graph depicts the log total number of unique friends (and
friends-of-friends) seen through tick t. Connections seen are reset at t = 24. . . . . . . . . . 79
5.6 Precision@30. Graph depicts the total fraction of liked tweets in the first 30 positions in
the feed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
x
Abstract
Algorithmic systems often mediate our interactions with each other in online social networks. These systems construct personalized news feeds, suggestions for who to follow, what topics to search for as well as
a medley of other recommenders systems. While such systems enhance user experience, they also introduce
biases that can significantly affect how users see their social environments, impacting how information is
disseminated and consumed. This thesis investigates the impact of algorithmic personalization on social media platforms, focusing on exposure bias in personalized news feeds and how users’ perceptions of networks
are distorted through rank-ordered feeds. Through simulations and empirical data from popular social media
platforms, this research connects the interactions between user behavior, network structure, and algorithmic
decision-making. It offers novel insights into how these systems can shape user perceptions and potentially
impact network dynamics. The findings highlight the need for more transparent algorithmic design, aiming to understand the balance between personalization benefits and risks of reinforcing echo chambers and
misinformation spread.
xi
Chapter 1
Introduction
1.1 Brief History of Algorithmic Systems in Social Media
As the Internet has become more mature it is useful to briefly consider the history of large online platforms.
After the dot-com bubble in the late 1990s, the variety of companies and websites operating on the Internet
that were seeking growth moved towards business strategies that would facilitate userbase growth and retention. Part of this strategy, especially for the information-rich platforms like search engines and the newly
emerging social media platforms involved ranking information for users according to what the platforms
consider the users’ utility [75]. The most prominent example of this utility-focused algorithmic system is
Google’s PageRank, which was designed to rank websites on a search engine result page (commonly described as a SERP) based on their computed importance relative to other websites. Another successful early
example of utility-focused algorithmic systems includes Amazon’s "Customers also bought this..." product
recommender system, however we elide further discussion due to it not being a socially-oriented system.
Informed by Google’s approach, Facebook introduced their own ranking approach for creating algorithmically ranked News Feeds circa 2007 with EdgeRank [1]. This broke with the typical reversechronologically sorted timelines that defined their original news feed (as well as other social media platforms
like Twitter). This is also notably co-occurent with the release of Facebook Ads, suggesting the business
model was beginning to orient towards user attention [33]. However, this EdgeRank system only remained
1
in production for a few years, as by 2013 Facebook was reported as using a machine-learning algorithm to
decide which posts to show users and then separately to rank the posts [67].
Other social media platforms followed suit like YouTube in 2011 and Instagram and Twitter in 2016,
leading to an environment that users of these platforms are familiar with today. With other platforms like
Reddit sorting comments based on various metrics, it is clear that these systems, by shaping the information
users see in their network also shape the interactions as it is easier to engage with users that are immediately
visible than those who require a few clicks or scrolls.
These personalization systems can largely be described as recommender systems, i.e., systems that are
personalized to each user and “recommend” users or content that may be relevant or otherwise useful for
the user.
1.1.1 Recommender Systems
Recommender systems optimize problems that often can be reduced to a ranking problem: a system might
offer the top-ten items from a set of items rank ordered according to their predicted utility for the user, their
predicted probability of engagement, et cetera. These systems are relevant for online social networks as they
can serve multiple roles. For instance, advertisers might be supplied with a rank-ordered list of recommended
relevant users for their ad inventory. For the users, posts may be ranked and supplied to users, as well as
recommended accounts for the user to follow. Geo-located information is also a feature platforms like X and
Facebook use, where recommendations for relevant “Trending News” topics are made for each user. They
also can be applied to less obvious areas of a user’s experience: rank-ordering the replies or comments made
on a post, or deciding which information gets sent out in a digest or push notification to the user.
2
Figure 1.1: Majority Illusion. Plotted is a graph with 18 nodes and 27 edges, with 4 nodes (0, 4, 14,
6) having an “active” binary trait. Nodes 15, 7, and 5 would experience the majority illusion in that the
majority of their connections have the active trait, whereas the minority of the global population has the
trait.
1.2 Exposure Bias
When personalized timelines are discussed, the attention and discussion often centers around the algorithm
choosing and sorting information for the user. This is an important area to explore, however these discussions
tend to ignore or make tacit assumptions about the “inventory” of items that these systems can pull from. This
“inventory” is generated from various sources and is the result of multiple factors: complex user behavior,
organizations with business incentives (resulting in ad inventory and social media posts), network structures,
and other exogenous sources (e.g., moderation of posts and external regulatory pressures). The outcome of
these forces create a universe of information that already biases what users on those platforms perceive: the
perceived prevalence of users and/or content is distorted for any particular user before any personalization
takes place.
Take for instance the following example illustrated in Fig. 1.1. We observe the true prevalence of a trait,
say the opinion that cockatiels make the best pet, to be 4/18 distributed uniformly across the network. If
one were to poll users at random and examine their individual ego network for the prevalence of the opinion
3
in that network, the proportion of friends in the network with that trait would vary wildly based on network
structure. This is an example of the majority illusion, when the perception of an attribute or opinion seem
more common than it actually is [56].
In this context we consider exposure bias to be a systematic distortion in content visibility and perceived
prevalence within a social network, which arises from discrepancies between the “potential network” (active
social connections, e.g., who follows who), the “activated network” (the users who could be observed because they engage in content creation or activity), and the feed-exposed network (the subset of connections
through which users actually observe and interact with content). This bias leads to certain content being
either over- or under-represented, skewing its perceived prevalence relative to its actual prevalence.
This discrepancy is something that can be modulated by algorithmic personalization, which we discuss
further in Chapter 3.
We consider this an important topic to explore as the perceived popularity of information, modulated or
not, can potentially inform user behavior. Recent work suggests that simply considering the social media
context (and whether or not someone would share a piece of news with a post) is enough to reduce a user’s
ability to discern misinformation[30]. In an incredibly politicized and polarized media environment, this
could have radicalization impacts worth considering.
To describe this more, when you layer in the complex cognitive and emotional mechanisms involved
with perceiving social environments, we are obligated to consider the cognitive biases that are present. Of
particular interest is the salience bias, described as the minority salience bias in Kardosh et al., 2022 where
members of both the majority and minority group are susceptible to overestimating the prevalence of the
minority group in an offline recall task [47]. In the online context, Brady et al., 2023 find that people who
observe tweets written with varying degrees of outrage systematically overestimate the authors’ outrage, and
when they observe a feed with systematically overestimated outrage in the tweets the observers overestimate
the network’s outrage as a whole [17]. This suggests that in an information environment with heterogeneous
4
users and user experiences, small changes in the information presented to some users may have outsized
impact on those users’ perceptions.
1.3 Challenges and Contentions
Studying recommender systems in the context of social media and information diffusion raises a number of
challenges and areas where this thesis might impact:
Echo Chambers. As a mediator in information exposure, recommender systems are thought to be implicated in the formation of echo chambers: information environments where a person is primarily exposed
to opinions and information that align with their own. It is important to study recommender systems with
regard to this as studies are conflicted on the extent to which these systems are responsible and the extent to
which individual user choices are responsible for these polarized information environments.
Misinformation Spread. In information rich environments with the potential for cognitive overload, it is
important to understand what information gets surfaced and exposed to users as this information might be
false or misleading. Misinformation and other forms of information disorder [65] are important to understand and study as they can have material consequences (e.g., misinformation around Covid-19 assisting in
the formation of beliefs that directly impact peoples’ health). Studying the order in which information gets
surfaced to users, and in what social contexts that information is provided in can be essential for understanding a recommender system’s role in this environment.
Data Privacy and User Consent. User data forms much of the economic value of major online social
media platforms, and the recommender systems that personalize feeds and information for users rely on
much of the same data. It is important to consider how data is collected, what data is being collected, and
5
whether or not users 1) understand what data collection they consented to and 2) actively consent to data
collection.
Algorithmic Transparency and Accountability. To be able to regulate and assign accountability in the
case of bias or harm is an important aspect of studying algorithmic systems, especially those that are blackbox or otherwise opaque due to business interests. How much of a role does a production recommender
system play in the creation of echo chambers? How much liability should the model creator have compared
to the platform the model is used on or the individuals who use the system?
Diversity and Representation. As this thesis is primarily concerned with perception and exposure, the
politics around the exposure to diverse groups, both in background and perspectives, is an important consideration. This is because the systems may in fact work to marginalize certain groups or perspectives, which
could go against the stated values of the organization or open up exposure to regulators.
1.4 Thesis Statement & Research Questions
This thesis shows the impact that recommender systems on social media platforms have on exposure bias.
More concretely we show how to measure exposure bias both in vitro and in situ, demonstrating that personalized feeds can distort the bias over time and in different conditions. We use data from sock-puppet
audits and simulations to understand the interactions between user behavior and algorithm-driven content
exposure. Additionally, we propose strategies for mitigating exposure bias, aiming to contribute towards
more diverse information diffusion in digital spaces.
This thesis is operationalized through the following research questions:
RQ1 Do different ranking algorithms demonstrate different levels of exposure bias?
6
Ranking algorithms, and ways to construct personalized feeds more broadly, allow platforms to optimize
an experience for each individual user. Following the Probabilistic Ranking Principle in information retrieval
[75], each user’s feed is intended to present the most engaging and relevant material towards the top of the
feed and maintain the engagement as the user moves through the content. However in trying to optimize
proxy measures for this utility like engagement and time-on-platform, personalized feeds may inadvertently
impact exposure bias. To this end we compare different ranking algorithms and personalized feeds to assess
levels of exposure bias.
RQ2 Do different session lengths affect the level of exposure bias?
Users vary in the amount of time they spend on a platform each time they log-in. Global users report an
average of 30 minutes spent per day on any of the major platforms [2], which suggests the importance of
assessing the impact of the amount of time spent on-line on who users get exposed to. To this end, we test a
range of session lengths (which act as a proxy for time spent on platform as the main mediator is assumed
to be the timeline), and assume each user observes the same length feed.
RQ3 Can we observe exposure bias in situ using empirical data from online social networks (OSNs)
like X/Twitter?
Empirical analysis of what users perceive in their time on online social media platforms is a non-trivial
task. We describe and execute an extended audit of Twitter before it became X and observe significant differences in exposure bias as it pertains to pro- and anti- vaccine science bias when we compare personalized
timelines to reverse chronological timelines.
RQ4 Can algorithmic ranking change the level of exposure bias over time?
To understand the differences between different kinds of algorithmic ranking, it is worthwhile to investigate the dynamics of exposure bias in an online social network. Alipourfard et al., 2020 identified different
hashtags experiencing higher levels of exposure bias than others in archival Twitter data from 2014, however
this is aggregated over the course of the study, flattening the time dimension [10]. We consider a simulated
7
agent-based online social network platform to allow us to compare the effect of different ranking approaches
on exposure bias over time. We identify a greedy ranking algorithm based on social network structure as a
means for mitigating exposure bias.
1.5 Methodology
We utilize two forms of simulations and one form of algorithmic auditing in this research:
• Agent-based models. We simulate users, their behavior, and the recommender system that serves their
feeds. This allows assessing for unintended effects of system design choices that arise from interactions between agents, which would otherwise be difficult or computationally expensive to specify
analytically. We can also model complex and heterogeneous user behavior in the system in a more
straightforward manner. These models however are difficult to scale and specify appropriate user behavior.
• Session-based simulations. We use archival data from Twitter covering a set of users, the people that
they follow, and all of the tweets and retweets made in that network to simulate how users would experience their feed under different recommender systems. This allows us to reliably claim differences
between recommender systems as a user under two different systems can be considered a fixed pair.
However, this approach has no information about user intent and conclusions drawn should consider
the sampling bias in the data.
• Sock-puppet audits. We use sock-puppet audits described by Sandvig et al., 2014 to understand how
the Twitter algorithm responds to user behavior in exposing users to pro- and anti-science users that
they follow [77]. This approach allows for fine-tuned user behavior control, however the approach can
create concerns about scalability and generalizability.
8
1.6 Contribution
This thesis connects structural network phenomena like the friendship paradox and majority illusion directly
to algorithmic personalization of timelines in online social networks. Previous work models the connection
of recommender systems to socio-cognitive biases and information diffusion; however, we present work in
comparing different recommender systems and strategies for ranking feeds directly, as an emulation of what
systems might be like in production [27, 26].
1.7 Overview
The rest of this thesis is organized around the research questions described in Section 1.4. We detail relevant
work and background information in chapter 2. We center analysis on different ranking algorithms in simulations in chapter 3. We describe sock puppet audits and the time-varying nature of exposure bias in chapter
4. We finally describe agent-based simulations in chapter 5 where we propose a simple greedy algorithm to
reduce the prevalence of exposure bias.
9
Chapter 2
Related work
We situate the study of exposure bias at the interface between the cognitive sciences, public opinion research
& opinion dynamics, network science, and the study/auditing of recommender systems.
2.1 Cognitive Biases & Information Diffusion
In the cognitive sciences there is a wealth of research supporting the idea that humans are prone to many
different cognitive biases. Of particular interest is the salience bias: that humans are prone to paying more
attention to stimuli that are remarkable or prominent, i.e., those stimuli that are irregular or unexpected [47].
In social psychology this salience bias can manifest in identifying minority groups as standing out, often
resulting in an overestimation of their size. This is similar to the majority illusion, a structural phenomenon
in networks in which network structure distorts a user’s local view of the network [47]. For illustrative
purposes, we present an example network in Fig. 1.1 where some nodes experience the illusion due to how
the minority trait is distributed across the network.
There is a similar assessment of cognitive biases from the social sensing literature: while people have
the neurological capacity to observe and infer properties of their social network and mental states of others,
their judgments may be differentially accurate depending on the populations being asked about. Galesic,
Olsson and Rieskamp, 2012 detail a social sampling model where individuals sample from their immediate
10
social network to estimate characteristics like average household wealth [34]. Individuals surveyed tended
to be accurate within their immediate network, but were less accurate for some measurements when judging
larger populations.
These individuals can be subject to a number of biased interpretations of their network, especially when
it comes to interpreting messages for their emotional and moral content: Brady et al., 2023 shows individual
observers systematically over-perceive the amount of outrage authors felt while writing their posts on social
media [17].
The individual’s own beliefs can factor into their perception of their environment as well through the
false uniqueness (i.e., underestimating the prevalence of one’s views) and false consensus effects (i.e., overestimating the prevalence of one’s views). Empirical evidence suggests that Americans with more conservative views on climate change, as well as those with more conservative local norms and exposure to
conservative news underestimate support for climate change policies by as much as half [81]. These biases
have also been shown to be important to overcoming collective action problems in theoretical games [78].
This speaks to the importance of understanding how OSNs mediate our exposure to our networks largely
through recommender systems: distorted exposures to one’s network can plausibly affect inferences about
broader populations.
The structure of the network is important for these biases as well: Lee et al., 2019 shows that whether
components of a social network are homo- or heterophilic and how many people in a network have a minority
trait can drastically impact the average perceived frequency of that trait [55]. This work is very related to the
work presented in this thesis, however this thesis is concerned with how the recommender systems mediating
exposure to these networks can modulate that perceived frequency.
In the study of social networks these perception and cognitive biases are well understood to be integral
to how information diffuses through a network. Kooti, Hodas, and Lerman, 2014, investigated the origins of
network paradoxes and found that they have behavioral origins, suggesting that individuals have distorted
11
perceptions of their networks [50]. Rodriguez, Gummadi, and Schoelkopf, 2014, studied the effects of cognitive overload on information diffusion and found that exposure to information is less likely to infect a user
if they are processing information from an ordered “queue” at higher rates. This thesis focuses on comparing
different feed “queue” strategies and is not explicitly concerned with information diffusion/overload [76].
A critical study for this thesis is that of Alipourfard et al., 2020, where they both introduce a primary
measure we use and study the local perception bias of various hashtags in Twitter data from 2014 [10]. They
identify various hashtags that were overrepresented to users relative to the hashtags’ global prevalence. A key
difference in this work is that we study how constructing feeds in different ways can affect the perception of
the prevalence of both content and user traits. We also are more concerned with understanding how sensitive
different feeds are to the distribution of the trait itself.
2.2 Public Opinion Research & Opinion Dynamics
In communications and public opinion research, the spiral of silence theory has been prominently used
to explain a number of mass media and public opinion events since its initial proposal by Dr. Elisabeth
Noelle-Neumann in 1974 [66]. This theory relies on individuals’ perception of the popularity of opinions
in their social environment: individuals will feel more confident sharing their opinion if they perceive that
their opinion is gaining popularity, whereas the individuals who perceive their opposing opinion is losing
popularity they will become more reluctant, helping the new opinion seem more popular than it is. This
research connects neatly to other areas of research like social contagion and information diffusion.
This spiral of silence work also connects to models of opinion dynamics: Sohn 2022 finds that through
agent-based models of social networks that large-scale spirals of silence are locally observable in online
social networks but unlikely to be observed at larger global scales unless certain conditions are met [80].
Das et al., 2014 present a broader opinion dynamics model that models how consensus and polarization can
happen in social networks, explaining empirical datasets better than the DeGroot model [24]. This work
12
however does not consider how people get exposed to their network, i.e., through a process mediated largely
by recommender systems. Recent work by Donkers et al., 2023 models the evolution of polarization in a
social network mediated by a simple recommender system [26].
2.3 Algorithmic Audits of OSN Recommender Systems
Beginning with Sandvig et al., 2014, there has been a steady and growing interest in auditing algorithms
in online systems for discriminatory behavior [77]. Algorithmic audits have focused on a wide variety of
sectors, ranging from e-commerce platforms [45] to search queries [79, 85] and OSNs [14, 15].
In general recommender system audits have been focused on content-based recommender systems as
they are the most prevalent and straightforward to analyze. In particular, most recent relevant work has focused on YouTube and what role the video recommendation engine might have in spreading misinformation
and radicalization. These works, like Tomlein et al., 2021 [85], Hussein et al., 2020 [42] and Hosseinmardi
et al., 2024 [40] used agent-based sock puppets to simulate user behavior directly on the platform to measure
the response in terms of personalization, prevalence of misinformation, and ideological exposure.
Both Spinelli and Crovella, 2020 [82] and Ribeiro et al., 2020 [72] investigated YouTube’s video recommender system and how it might contribute to gradual user radicalization. Similar results were found by
Ledwich & Zaitsev, 2019 [54]. Ribeiro et al. in particular analyzed user interactions and comments, focusing on migrations of users between communities. It is important to trace users in their experience on the
platform to really understand how recommender systems and users interact. However, focusing on YouTube
overlooks social signals: the same information may be perceived positively by a user if it was presented
as being "liked" by a friend, but negatively by that user if it was presented by a stranger or someone they
dislike. This makes such content-based studies about recommendations less relevant for platforms like X
and Facebook.
13
Analyzing the “Who to Follow” link recommendation engine on Twitter, Beattie, Taber, and Cramer,
2022 presents a method for breaking echo chambers via user embeddings [16] and Ramaciotti Morales and
Cointet, 2021 is focused on incorporating ideological positions into similar user embeddings [69]. Akpinar et
al, 2022 identify that short term fairness interventions do not work out to mitigate bias concerns in the longterm in simulations of friend-recommendations [8]. Stoica et al, 2018 find an algorithmic glass ceiling in
recommendations between male and female users on real-world Instagram connection suggestions [84]. Our
work is not concerned with friend recommendations, however they are a very valuable tool for intervention.
From within X (then Twitter), Huszár et al., 2022 looked at the algorithmic amplification of political
parties across different countries on X with proprietary data on users [43]. They identify right-wing ideological bias under the algorithmic condition, suggesting that users in aggregate may be unduly influenced in
how they perceive politics (at least on the platform). In Lazovich et al., 2022 the authors focus on describing
the utility of distributional inequality metrics and their utility in studying the outcome of content recommendation systems [53]. Both of these are relevant to our work, however they are more concerned with the
cumulative effect of algorithmic recommendations on the distribution of outcomes (i.e., who gets the top
1% of impressions, retweets, likes).
Guess et al., 2023 study Facebook and Instagram feed data to assess the impact of personalization on
user attitudes and political behaviors around the 2020 US election. They found no significant impact on
behaviors but a significant difference in on-platform exposure to untrustworthy and uncivil content on the
platforms [36]. González-Bailón et al., 2023 study the impact of algorithmic and social amplification in
spreading ideological URLs on Facebook, identifying ideological segregation taking place and both the
algorithmic/exposure stage as well as the social amplification [35]. While these studies do not identify
changes in beliefs or behaviors pertaining to the US 2020 presidential election, it is important to note that
the effects may not be generalizable across different platforms and are focused on political-related outcomes
14
from 2020 that may not apply in other domains given the relationship to Covid-19 and changes the platforms
ran regarding misinformation at the time.
A recent study by Vombatkere et al., 2024 presents a framework for measuring the amount of personalization a user is experiencing through a personalization score computed over a user trace [87]. Even though
this study is not explicitly about perception biases, this framework could be useful for assessing the extent
to which personalization is responsible for exposure biases. Similar work for personalization analysis on
TikTok was done by Kaplan & Sapiezynski, 2024 in auditing the “For You Page” TikTok recommender
system [46].
While not explicitly an OSN, Google Search has also been subject to audit studies. Robertson et al.,
2018 [74] recruited participants to complete a survey and install a browser extension that enabled the authors
to collect their query result pages. The authors found little supportive evidence for ideological filter bubbles
in search engines. Other lines of research in search engines focus on directly controlling for bias contained
in the data that works as input data for the search engine algorithmic system [51].
Finally, in an audit run primarily through the Twitter (now X) API, Chen et al., 2020 simulate partisan
users on the platform and expose various biases in timelines[20]. With simple behavior functionality, they
constructed bots to interact with tweets, generate tweets, and choose to follow/unfollow people. They find
conservative accounts observing more low-credibility content, and liberal accounts observing more ideologically moderate content.
Sociotechnical audits have been proposed as an extension to purely technical algorithmic audits [52].
Lam et al., 2023 demonstrate such an audit through a case-study of the effectiveness of personalized ad
targeting through ablation interventions on a set of users and survey-based analysis of the users’ experiences
with ads. This methodology, which accounts for the experience and interactions of the user with the platform
being studied, would be an important complement to this work, allowing for more concrete understanding of
algorithmic impacts on users’ perceptions of their networks as mediated by recommender system. However
15
in this work we do not take this approach, instead opting to consider sock puppet audits and user agents
with fixed behaviors to allow for a more concrete assessment of the impact of the recommender system in a
environment that is as controlled as possible.
2.4 User Perception of Ranked Feeds
Most studies investigating the effects of personalized news feeds, defined as the ordered information a
user receives when logging onto a platform like X/Twitter, are focused around user satisfaction, impact on
information diffusion, and echo chambers. Researchers in human-computer interaction have been studying
user perception on OSNs for the past decade.
Notably this has been focused primarily on Facebook, where prominent work done by Eslami et al., 2016
identified several “folk theories” for how Facebook’s personalized News Feed curated the information users
in the study saw [31, 32]. FeedVis, the tool they developed, empowers users by presenting which friends
they will never see, rarely see, often see alongside other information about their feeds [31]. Our study differs
from this vein of research as we focus on comparing different recommender systems directly.
Various studies have also been published in measuring the differences in ranked feeds in Twitter [14,
15]. Bandy & Diakopoulos 2021 utilize eight sock puppet accounts to identify differences in source diversity
and topic diversity between algorithmic and reverse chronological timelines [14]. In a similar methodology
Bartley et al., 2021 utilize eight sock puppets accounts to identify differences in the popularity of the tweets
that the algorithmic timelines were exposed to relative to the reverse chronological ones [15].
2.5 Social Network Simulations
Social network multi-agent simulations for recommender system analysis have been largely focused on
information diffusion and predicting user behavior in different environmental circumstances. In Muric et al.,
16
2022 [64] the authors utilize Twitter, Reddit and GitHub data to understand information spread, especially
across different online platforms. This multi-platform study did not explicitly focus on comparing different
personalized feeds but rather focused on predicting information cascades in online environments.
Murdock et al., 2023 [63] used an agent-based model to simulate user and moderation behavior on
Reddit in order to assess the impact of community-based networks and the unique moderation structure the
platform has for belief diffusion.
Ribeiro, Veselovsky, and West, 2023 [73] utilized an agent-based model to address the paradox that
content-based recommender systems face: these systems do not seem to be the primary driver of what users
consume even though they favor extreme content. When incorporating a measure of user utility in choosing
which content to engage with, results suggest users will not engage with suggested low-utility content. Our
work differs in that we compare different recommender systems and how they would behave considering
the same user behavior patterns. Our work also considers platforms that use more social information in the
recommendations, as a piece of content might appear in your feed if your friend interacts with it.
Donkers et al., 2021 [27] and Donkers et al., 2023 [26] studied both epistemic and ideological echo
chambers in social media and the effects of diversifying recommendations on discussions. They used knowledge graph embeddings to make diverse recommendations in a retweet network to depolarize discussions
between users with varying propensities for accepting new information (i.e., retweeting something a peer
posted). While depolarizing discussions is an important goal in this line of work, in our current study we
compare different recommender systems and their effect on perception instead.
Chaney et al., 2018 [19] and Lucherini et al., 2021 [57] simulate recommender systems and users within
them, allowing for comparing different algorithmic recommendations and how different algorithms might
contribute to homogenization in the content-recommender context. Our work here also compares different
recommender systems, however in the social network domain and with a focus on how different systems
can mediate user perception of their network.
17
Akpinar & Fazelpour, 2024 work to simulate the X recommender system by implementing the RealGraph component of the stated system and comparing it against baselines while running agent-based simulations to understand how minority and majority groups might interact in professional social networks
[9]. More specifically they find that recommender systems can lead to a decline in the professional content
produced by minority groups in the network over time.
2.6 Recommender System Fairness
Enforcing constraints about allocative harms like exposure fairness has been explored extensively in recommender systems [88, 25, 18, 7]. However these works tacitly assume the inventory of items is universally
accessible to each user, and also do not consider systems that rely on user-generated content (i.e., where
consumers are also producers). Given the network structure underpinning social media platforms, who one
user is connected to will have substantial influence on the content they are exposed to. To this end, it is
important for measures of fairness in recommender systems to be designed with users’ perceptions, possible
cognitive biases and their interactions with other users (not to mention the system itself) in mind.
2.7 Social Media Usage & Effects
We close this section discussing the research around the potential impact social media usage has on individuals. There is a wealth of research associating the usage of social media with the prevalence of eating
disorders, especially among young women across cultures [23]. In particular, young women seem to compare themselves with close and distant peers, but they compare themselves even more with influencers [68].
Social media literacy and other internal factors like body appreciation seem to generally be protective against
developing disordered eating and body image issues, suggesting that social comparison mechanisms can be
mitigated by how users process the information they get from their personalized social media [23]. This
18
thesis is focused on assessing the network information that users get exposed to, before they would generate
any perceptions of their networks.
Young men are also susceptible to online social media influence, with usage being associated with risk
of body dissatisfaction [83] and possibly being associated with the risk of adoption of misogynistic beliefs
that can lead to enacting violence [49].
As it pertains to vaccine uptake and hesitancy, studies often present conflicting results depending on the
platform being studied and the type of social media usage. Information-seeking active social media use is
positively associated with vaccine intent, whereas passive exposure is more broadly observed to be negatively associated with intent/uptake [44, 60, 59]. In chapter 4 we discuss exposure to pro- and anti- vaccine
science users on Twitter as a potential means for understanding the negative association with intent/uptake.
19
Chapter 3
Measuring Exposure Bias
In this chapter we discuss how to measure exposure bias. Consider a small network centered around one
user and who they follow. Some accounts that user follows are likely more active than others, which can
make their opinions over-represented in a user’s feed. In addition, the central user will vary in how much
time they spend on the platform: some sessions will be short where they read just a few of all possible
messages posted in their feed, while other sessions will be longer. These factors, combined with how the
recommendation algorithm orders the messages within the user’s feed, will invariably affect the information
the user sees. Exposure bias here refers to such potential distortions of how frequently who and what the
user gets exposed to in their feed from their true prevalence. In a sense, minimizing this bias can be seen as
enforcing a statistical parity in exposure [61].
We discuss measuring exposure bias in the context of the following research questions:
RQ1. How do different personalization algorithms affect exposure bias?
RQ2. Do different length sessions experience different levels of exposure bias?
We discuss our data and study set-up for answering these two research questions before we discuss how
we measure exposure bias in Section 3.1. We then discuss the results of our studies with various recommender systems and personalized timelines in Section 3.2. We discuss the implications of the results for
multiple platforms in Section 3.3.
20
3.1 Data & Methods
3.1.1 Twitter Data
Starting in March 2014, Alipourfard et al., 2020 [10] gathered from X (then Twitter) all the followed accounts of 5, 599 seed users, who we refer to as friends. The authors queried for the followed accounts daily
through September 2014 to identify any new friends of seed users. This subset of the Twitter follower graph
has over 4M users and more than 17M edges.
In addition to follow relations, the authors also collected messages posted by seed users and their friends
over this time period, roughly 81.2M tweets. In this work we consider tweets from May 2014 to September
2014.
At the time of data collection, Twitter created a feed for each user by aggregating all messages posted by
the user’s friends and ranking them in reverse chronological order. Using these data, we are therefore able
to reliably reconstruct the feed for each seed user and quantify empirical exposure bias.
To summarize we use the tweets and retweets from 5599 seedusers and all of the people they follow at
the time of collection in 2014.
3.1.2 Reconstructing Feeds
To address our research questions we make use of the described empirical data. This data is important for
this situation because it was collected before Twitter implemented an algorithmic recommender system for
constructing feeds in 2016. Given that we know the users all experienced the same chronological feed,thus
we can assume we avoid algorithmic confounds [19], we can re-rank and construct new feeds to explore the
exposure effects of the different feeds.
With the Twitter data we construct artificial “sessions”: for each user u we assume that they browse
their feed one time on any calendar day d they have any activity (either tweets or retweets). On each day d
21
we then select all tweets and retweets that friends of u made that day and sort the tweets according to the
specific feed construction algorithm. For analysis we only consider sessions with at least ten tweets and at
least one unique friend seen.
We construct feeds according to the following strategies:
1. Popularity. We rank each of the tweets by their historical total number of retweets.
2. Reverse Chronological. We rank each of the tweets by the time they were posted (we assume the
user logs in at the end of the day and observes tweets closer to the end of the day first).
3. Random. We rank each of the tweets randomly.
4. Non-negative Matrix Factorization. To test a collaborative filtering model we use non-negative matrix factorization on a matrix of which people users retweet. We chose a user-user matrix to reduce
sparsity and complexity of a more traditional user-item matrix (which would be of size 5599 x 30m).
Once we factorize the matrix into two submatrices w of size 5599 x d and h of size d x |friends|, we
can then take the dot product of the two to generate a vector for each user to rank their friends by
when encountering tweets on any particular day.
5. Logistic Regression. Given that the X/Twitter timeline personalization system utilizes a logistic regression model at the candidate-ranking step [3], we utilize a simplified logistic regression model
with user-based features to rank tweets by the probability that user will retweet each tweet. For each
user, we gather all possible tweets the user could have seen and all their retweets and then we train an
individual model on that data. For users without retweets we use a similar user’s model for prediction.
6. Neural Collaborative Filtering (NCF). We implement a dense neural network meant for personalized recommendations of tweets [39]. The model has two parallel embedding layers for user and
tweet (item) inputs, which are then concatenated and passed into a dense layer with ReLU activation.
22
The output layer predicts the likelihood of a user retweeting a tweet. The model uses a binary focal
cross-entropy loss function.
7. Wide&Deep. Similar to the NCF model, we implement the Wide&Deep model as described in Cheng
et al., 2016 [21]. We chose this model as researchers at Twitter described using a modified Wide&Deep
model for ad recommendation in 2020 [38].
3.1.3 Simulating User Attributes
To measure the exposure bias we assign each user in the network a binary random variable X ∈ {0, 1} with
a fixed uniform probability P(X = 1) = 0.10. This variable can represent the user’s ideology (e.g., liberal
vs conservative), status (e.g., verified vs not), or it can represent a one-hot encoding of a belief or a topic the
user shares. We choose a prevalence of 0.10 because we wanted to represent a distinctly minority trait of
the population and observe how activity and exposure could distort it. We also run the same analyses under
P(X = 1) = 0.05 and P(X = 1) = 0.50 to verify that the results are consistent.
After all accounts in the follower graph have been assigned values of the variable, we can then measure
its prevalence in each user’s feed. This allows the user to estimate the fraction of “positive” (i.e., with value
xi = 1 ) friends within their network. To assess the relationship between the network structure and this
random variable X we follow the attribute-swapping procedure as described in Lerman et al., 2016 [56] to
vary the degree-attribute correlation ρkx:
ρkx =
1
σxσk
X
x,k
xk[P(x, k) − P(x)P(k)]
=
1
σxσk
X
k
k[P(x = 1, k) − P(x = 1)P(k)]
=
P(x = 1)
σxσk
[⟨k⟩x=1 − ⟨k⟩]
23
where x is the binary attribute, k is the degree of the node (here in-degree), σx, σk the standard deviations of
the binary attribute and in-degree respectively, and ⟨k⟩ the average in-degree over all nodes.
In previous works on the majority illusion [56] the strength of the effect has been dependent on the
following:
1. The effect is stronger the more disassortative the network becomes
2. The effect is stronger the higher the ρkx
3. The effect is stronger when the network’s degree distribution has a heavier tail
While some network traits (e.g., assortativity) will be impractical for a platform to modulate globally,
others can be effectively altered through the user interface (which a recommender system can mediate via
the feed): new friends can be recommended to the user to follow; the balance between in-network users
exposed to and out-network users exposed to can be modulated; the coverage of a network can be increased
through compression (e.g., showing snippets of conversations rather than the whole conversation suggesting
the conversation is relevant to the user).
In this study we only vary ρkx from 0.0 to 0.55 as the full empirical network we use has approximately
0.53 as the limit we could reach with the swapping procedure when P(X = 1) = 0.05.
3.1.4 Measures of Exposure Bias
We use the following metrics to measure exposure biases:
1. Fraction of positive friends (with xi = 1) seen per day.
Fui =
|N(u)x=1,i|
|N(u)x=1,i| + |N(u)x=0,i
|
2. Fraction of tweets that are positive seen per day.
24
Fti =
|tweetsx=1,i|
|tweetsx=1,i| + |tweetsx=0,i|
3. Local Perception Bias.
Blocal = E[qf (X)] − E[f(X)]
4. No. of users experiencing majority illusion per day.
Mi = |Fui > 0.50|
5. Gini Coefficient.
G =
Pn
i=1(2i − n − 1)xi
n
Pn
i=1 xi
For the Gini coefficient we assume that the tweets are distributed across all of each user’s friends, i.e.,
that if a seed user has 50 friends and only one has tweets that are observed then the Gini is computed over all
50 friends instead of just the one observed friend that day. Gini coefficient is a useful measure for measuring
exposure bias as it can give us a sense of the attention/exposure distribution over an average user’s friends.
While by itself it cannot tell us the local perceived prevalence of a trait, it can identify a skew in exposure
which will complement other measures.
We define Blocal as the average prevalence of the attribute among a node’s friends where:
E[qf (X)] = ¯d ∗ E[f(U)A(V )|(U, V ) ∼ Uniform(E)]
and where E[f(X)] is the global prevalence of the node attribute f; ¯d represents the expected in-degree of
a node; f(U) the attribute value f of node U. A(V ) represents the attention node V pays to any particular
friend.
25
Intuitively, this Blocal measure is the difference between the average fraction of activated friends exposed
to random users and the true prevalence of the trait (represented by |usersx=1|
|users|
). Alipourfard et al., 2020 [10]
provides an intuitive empirical example from Twitter: the prevalence of different hashtags in tweets across a
set of tweets is seen to be different than the fraction of tweets containing those hashtags different users see
in their feeds. A positive number indicates the average user should expect to see a higher fraction of friends
in their feeds with x = 1 than the actual global prevalence.
3.2 Results
3.2.1 RQ1 - Difference between Feeds
We report the results of the empirical data analysis in Figures 3.1 - 3.10. Of note is the different behavior
we observe between the conditions: Blocal for the reverse chronological feed is noticeably higher than the
random and popularity feeds (on average), but seems to be lower than the NNMF, NCF, and logistic regression feeds. We note that for each of the conditions, when computing Blocal that we use the same attention
function, leaving remaining differences to feed strategy. We report the Gini graphs in Figs. 3.3 and 3.4 and
note that the coefficient is not sensitive to ρkx. This is due to the fact that we compute Gini over all users
in the network regardless of their trait x. We do report the statistical analysis in Table 3.1. In the logistic
regression plot in Fig. 3.2 (e) we observe that for shorter feeds the bias goes above 1.0, which may be an
artifact of how we compute the bias for shorter length sessions.
Interestingly, the other metrics do not visibly discriminate between some of the conditions: Fig. 3.5
shows Fti (and similarly Fui) shows a significant difference in the smaller length sessions consistently
across ρkx. For the model-based feeds in Fig. 3.6 we can observe a slight difference in how the metric
behaves in terms of session lengths.
26
Feed 1 Feed 2 P(X = 1) = 0.10
Blocal G Mi
Pop. Wide&Deep -137.85** 20.99** -0.81
Pop. Rand. -41.36** 8.17 ** 1.07
Pop. RevChron. -162.60** -26.44 ** 5.32 **
Pop. NNMF -169.58** -17.60** -0.14
Pop. NCF -44.13** 8.21** 44.47 **
Pop. LogReg -205.73** -44.86 ** 12.48 **
Wide&Deep Pop. 137.85** -20.99** 0.81
Wide&Deep Rand. -7.16** 7.47** 1.58
Wide&Deep RevChron. -204.21** -34.99** 6.08**
Wide&Deep NNMF -175.69** -29.14** 0.42
Wide&Deep NCF -26.01** 2.49* 44.82**
Wide&Deep LogReg -213.47** -46.03** 12.87**
Rand. Pop. 41.36** -8.17** -1.07
Rand. Wide&Deep 7.16** -7.47** -1.58
Rand. RevChron. -10.70** -8.95** 2.61**
Rand. NNMF -9.50** -8.70** -1.19
Rand. NCF -25.40** -3.58** 43.27**
Rand. LogReg -117.02** -10.49** 10.49**
RevChron. Pop. 162.60** 26.44** -5.32**
RevChron. Wide&Deep 204.21** 34.99** -6.08**
RevChron. Rand. 10.70** 8.95** -2.61**
RevChron. NNMF 13.09** 7.23** -3.97**
RevChron. NCF -14.37** 12.66** 43.70**
RevChron. LogReg -213.92** -39.19** 8.75**
NNMF Pop. 169.58** 17.60 ** 0.14
NNMF Wide&Deep 175.69 ** 29.14 ** -0.42
NNMF Rand. 9.50 ** 8.70 ** 1.19
NNMF Chron. -13.09** -7.23** 3.97**
NNMF NCF -15.31** 11.12** 44.37 **
NNMF LogReg -205.03** -41.06** 14.03 **
NCF Pop. 44.13** -8.21** -44.47**
NCF Wide&Deep 26.01** -2.49* -44.82**
NCF Rand. 25.40** 3.58** -43.27**
NCF RevChron. 14.37** -12.66** -43.70**
NCF NNMF 15.31** -11.12** -44.37**
NCF LogReg -104.33** -20.46** -42.86**
LogReg Pop. 205.73** 44.86** -12.48 **
LogReg Wide&Deep 213.47** 46.03 ** -12.87**
LogReg Rand. 117.02** 10.49** -10.49**
LogReg RevChron. 213.92** 39.19** -8.75**
LogReg NNMF 205.03 ** 41.06 ** -14.03 **
LogReg NCF 104.33** 20.46** 42.86**
Table 3.1: Empirical Pairwise Significance Tests. We treat the seed users in different feed conditions as
paired samples and compute a paired t-test over the mean values of the various metrics. Conditions that are
bolded are least biased according to the measure. * - p <0.05, ** - p <0.01
27
Length 1 Length 2 P(X = 1) = 0.10
Blocal G Mi
10 100 224.70** 75.71** -29.48**
10 Full length 215.15** 74.52** -29.50**
100 10 -224.70** -75.71** 29.48**
100 Full length 205.63** 43.38** -3.37**
Full length 10 -215.15** -74.52** 29.50**
Full length 100 -205.63** -43.38** 3.37**
Table 3.2: Empirical Session-Length Pairwise Significance Tests. We treat the seed users in different
feed conditions as paired samples and compute a paired t-test over the mean values of the various metrics
computed for each session length. * - p < 0.05, ** - p < 0.01
Interestingly, as reported in the paired t-tests in Table 3.1, we observe the popularity condition being
less biased than the random, reverse-chronological condition and the recommender-based feeds. Mi reports
a nearly opposite order between the conditions, however three differences were not significant under P(X =
1) = 0.10. Interestingly, the Gini coefficient reports the popularity condition being more biased than random
and the Wide&Deep condition, however all three are less biased than the other model-based systems and the
reverse chronological feed.
3.2.2 RQ2 - Difference between Session Lengths
Can users compensate for exposure bias by consuming a larger share of their feeds? We model the attention
users pay to their feed through the session length parameter. The longer the length of the session, the more
tweets users consume. In the empirical data, we observe that the feed conditions behave similarly in terms
of session length as seen in Figs. 3.1 and 3.2. For Blocal the longer sessions show less bias than the shorter
sessions as seen in Table 3.2. For the Gini coefficient we observe that the session lengths are ordered largely
the same, with longer sessions being less biased than shorter ones at higher values of ρkx. Interestingly we
observe the reverse ordering for Mi
.
28
3.3 Discussion
3.3.1 RQ1 - Difference between Feeds
The empirical 2014 data shows a tight relationship between the bias Blocal and the correlation ρkx; given the
fixed nature of the seed users we are able to treat this as a paired sample across the different feeds. With this
we take a pairwise paired t-test and assess significant difference for each pair. Table 3.1 shows that for Blocal
the popularity feed is less biased than the other feeds, and that for Mi
the NCF model is the least biased.
We would expect the random feed to be the least biased in all metrics, as we would expect the relationship
to ρkx to disappear in the random feed. The fact we observe a relationship suggests we are not necessarily
evaluating an appropriate random baseline. In simulated networks, we shuffled the data generated for each
seed user in order to eliminate relationship to ρkx. However doing this in the empirical data did not seem to
have any significant impact.
Our findings suggest that seed users face greater exposure bias in model-based feeds compared to simple
retweet-based and reverse chronological feeds, with logistic regression-based feeds showing the highest bias
among all, at least according to the local bias measure. Also we interpret that the reverse chronological feeds
exhibit less diversity in users (according to Gini coefficient). This could be due to a time-based reliance on
user activity, and that user activity patterns mean that similar sets of people towards the end of the day,
limiting the chance to observe more unique users but to encounter more X = 1 users. The different metrics
complement each other here and offer a richer interpretation of the different models. The different metric
behaviors also suggest that mitigating exposure bias requires considering which metrics to minimize, pareto
optimization, or finding alternate ways to reduce all measures simultaneously. We consider this further in
our agent-based simulations in chapter 5.
29
We believe that the difficulty in visually discriminating models according to some of the metrics is due
to the structure of the graph and the user activity itself: Lee et al., 2019 identifies homophily and minoritygroup size as factors that explain perception biases in networks and we observe here that the length of the
sessions and recommender systems can moderate this effect [55].
3.3.2 RQ2 - Difference between Session Lengths
When we consider the empirical session length differences, we see a significant difference between each
length within each feed condition. We present the aggregated scores in Table 3.2.
Interestingly Blocal reports higher bias for shorter sessions than longer ones; this can be explained by the
number of unique friends observed in longer sessions. Longer sessions create a larger universe of edges to
compute the expected value over, yielding a value closer to the population estimate. This may also create an
artifact in our computation as we observe Blocal > 1.0 in Fig. 3.2 for logistic regression, which is interesting
because we observe the same effect in P(X = 1) = 0.05 and P(X = 1) = 0.50.
It is also important to note the clear regime change across all conditions in Figs. 3.9 and 3.10, where the
shorter length sessions remain the most biased in the lower values of ρkx, but then switch and the longer
sessions become more biased in the higher values. The fact that it is consistent across the conditions suggests
it is something more in the network structure that determines the pivot point: perhaps the number of users
who experience Fui ∼ 0.50 is significant enough that giving a few high-degree central nodes the active trait
puts those users over the 0.50 threshold.
Given that the local bias and Gini coefficient metrics roughly agree on the ordering of the biases, we
interpret this as reporting that the same seed users experience significantly more exposure bias in shorter
feeds than longer feeds on average. Similar caveats to this interpretation exist as in section 3.3.1, where how
we compute each metric (and especially Gini coefficient) could be resulting in very small standard error
when running the t-tests.
30
Interestingly we observe a similar asymptoptic relationship for each of the models to the degree-attribute
correlation as in the previous section. We observe the differences between models as a sort of saturation
effect, where the model, under a high enough correlation threshold, yields similar exposure bias across
models. This suggests that the homophily and network structure-based effect discussed by Lee et al., 2019
dominates at that point [55].
We believe this methodological approach is appropriate for the detailing the differences between feeds as
a common caveat in previous auditing work is the lack of uniform application of treatments, i.e., that even if
some users are subjected to reverse chronological feeds, their friends may instead be using the personalized
feeds which could leak any personalization effects into their feeds. Uniform treatments without algorithmic
confounds, albeit in an archival context where data is generated assuming one context, can give us more
confidence in the difference between conditions.
3.3.3 Considerations for Practitioners
We can modulate the effective correlation ρkx in a practical manner by choosing network edges to observe
to change ⟨k⟩x=1 − ⟨k⟩, which we discuss further in Chapter 5. However, this has ethical implications in
varying how often different people get observed in expectation. This should be considered in tandem with
measures of individual and group fairness to make feeds robust to biases.
Considering the number of unique friends that a user is exposed to may be a useful measure for platforms
to maintain as a measure of overall fitness of the platform; it will have an impact on the measures we present
in this study. More specifically, observing more unique friends seen will likely lower Gini coefficient and
minimize the absolute value of Blocal.
31
3.3.4 Other Platforms
An interesting dynamic that has changed since 2014 is that platforms like X will possibly serve users content
from friends-of-friends, which is an out-of-network set of users for the central feed recommender system
to sample from to modulate the users’ passive perceptions of their network. For instance, if a user follows
friends who solely are of trait xi = 0 then the out-of-network suggestions can be generated to surprise the
user with posts from users with trait xi = 1.
This approach is widely applicable to other recommender systems that expose people to their social
networks, primarily through the idea of a partially observed network: if we consider repeated exposure to
specific users and kinds of content as a weak tie in a social network then we can consider exposure to a user’s
video as someone observing an edge to that user in a partially observed network. That observed network may
have a different prevalence of certain traits than what you would expect from either the follower network or
total universe of possible users one could observe.
1. TikTok. TikTok’s For You Page (FYP) is described as explicitly considering user interactions, such as
videos shared and accounts followed among other factors in personalizing your FYP. When describing
how their system works they also describe explicitly diversifying feeds to prevent repetitive exposure
to particular users and/or types of content [41]. Exposure to different communities on TikTok (so
called -tok communities, e.g., BookTok) and certain users can be analyzed under this exposure bias
framework in that the exposed prevalence of certain traits within said communities can be compared
to larger scale prevalence (e.g., how prevalent the traits are in the users’ geographic area).
2. Facebook. Facebook’s Feed works in a similar way as on X (formerly Twitter): they first gather
the inventory of posts from friends, pages and groups (i.e., the inventory), then making a predicted
relevance score and rank order each users’ feed with relevant out-of-network user posts interspersed
[70]. Given the parallels to how X works, this framework readily applies to Facebook.
32
3. Instagram. Three recommender system components are worth investigating on Instagram: the Explore page, the Instagram Feed, and Reels. The Feed works much like Facebook and X as described
previously, where a user’s activity, the information about each post, and the user’s previous interactions with each user in their network are used to order the feed. The Explore page is explicitly designed
to show content from accounts that the user does not follow, drawing on information from followed
accounts, information from posts that were interacted with and general connections on Instagram. The
Reels page shows reels (short videos like on TikTok) from both accounts that you follow and public
reels from accounts that are recommended to the user [4].
4. YouTube. The YouTube recommendation system has been a primary focus of auditing efforts for
the kinds of information and news the system recommends to users[72]. Signals that are used to
drive reccomendations include clicks, watchtime, total views, user surveys for “valued watchtime”
and other interactions including shares, likes and dislikes [5]. If we consider how content creators are
related to each other online (in the forms of professional relationships, content networks, etc), and
how certain kinds of content can be grouped together (e.g., political content being grouped together
ideologically), we can construct a user-creator network to assess exposure. When combined with the
personalized front page, our framework should be applicable.
5. Pinterest. Pinterest recommends content (“Pins”) based on the creators that each user follows, the
boards they have created and general browsing history [6]. The content focus of the platform’s recommendation system makes it comparable to YouTube or TikTok in that the variety and diversity of
the content people see might constitute a certain exposure bias, albeit one that makes sense as users
are likely to focus on pins that they find useful for their boards.
33
3.4 Limitations & Future Work
A primary limitation for the work in this chapter is the reliance on Twitter data from 2014. The benefit of
the data providing a vertical slice of data for the subgraph is important, but it would be important to verify
that the differences in recommender strategies hold in other networks as well, especially networks that are
larger than the one we analyzed here. We also assume the same tweets were re-tweeted by users in their
recommender system, however they were made assuming the default reverse chronological feed at the time.
Another limitation for this chapter (and work overall) is the simplifying assumption of binary user traits.
To describe this further, assessing the perceived prevalence of a particular political ideology may prove
problematic in that each user may be perceived differently by different friends. For example, person A may
be perceived as being on the political left by a conservative observer B, however that same person A may
be perceived as being conservative by a third liberal observer C. Allowing for more complex traits, as well
as traits that are not fixed would lead to valuable insights. This would especially extend to traits and social
environments where users may be selective in who and how they disclose their traits [48].
Future work could entail extending different feed strategies to be more complex. For instance, we could
study the friends-of-friends feed, where we would prioritize friends and allow for friends-of-friends to appear in the feed as a supplement. We pursue this in the agent-based simulations in Chapter 5. This would
get the analysis to be closer to the empirical Twitter self-studies where they describe in-network and outof-network tweet impressions [53, 43]. Similarly, we could actively train and run more complicated recommender systems to assess their propensity for exposure bias. The problem remains however that we would
need an online recommender system that would work without active feedback, which is difficult to find in
the body of content recommender systems. Similarly, in this archival data we do not have access to all the
user and tweet-level features that are described as being used in more recent iterations of the system.
34
In a similar vein, we could incorporate more adjustable metrics like the Atkinson Index, which extends
the Gini coefficient to weight the low-end of the distribution higher (which would be relevant in a power-law
environment like a social network) [11].
In this work we assume the social graph is fixed, in future work it would be worthwhile to relax that
assumption and allow for the social graph to evolve over time as users choose new users to (un)follow.
Finally, it would be interesting and important to apply these exposure bias measures to how different populations of users experience their networks: do cliques in a network consume information that substantially
different than highly-connected hubs? This might be answered with additional audits using sock puppets
and generating “artificial” sessions or getting sessions donated from users.
3.5 Conclusion
In this chapter we have introduced measures of exposure bias, and described a study to measure the differences in exposure bias in archival Twitter data. Our results illustrate the interconnectedness of social
network structure, activity, and feed recommendation. We have shown that a mixture of bias metrics can
adequately discriminate between different feed conditions. We described these metrics as being able to assess the propensity for individuals in a network to experience a distorted view of their immediate network,
where a minority trait may be over-represented and unduly more salient to the user than would be expected
otherwise. Importantly we show that these feeds behave differently as the prevalence of the trait changes.
With the semi-transparent release of the X/Twitter recommendation system there is significant potential for studying proprietary multi-modal recommender systems as they are deployed: strategies like those
proposed in this chapter and other forms of algorithmic auditing in OSNs can be tested and verified with
higher confidence than ever before[90]. Provided that analyses that rely on this release do not overfit to the
35
Twitter environment, we can learn incredible amounts about the efficacy and reliability of auditing methods. That being said, the release lacks information about model weights and training data which does set an
upper-bound to the external understanding of the codebase.
Huszár et. al., 2022 shows that within an organization it is feasible to observe and record the personalized
feeds for large sets of individuals [43]. Because of this we believe that both internal and external auditors
should be able to use measures of exposure bias. With such measures audits can be more closely tied to how
individual users experience the system on a session-level. It would also allow for interpretable analysis to
examine if different communities have disparate experiences with their personalized feeds.
3.5.1 Ethical Implications
Measuring exposure bias in individuals’ feeds poses potential ethical implications in two capacities: 1) maintaining an active accounting of everyone an individual is exposed to (and likely the content of their tweets),
regardless of the privacy of an individual’s friends and 2) intervening and changing the recommender system
can be considered as human subjects research and as such should proceed in a considered manner. Internal to
an organization like Twitter these concerns may be considered business operations as mentioned in Huszár
et. al., 2022 [43], however this may be more complicated for external auditors.
36
(a) Popularity Blocal vs ρkx (b) Random Blocal vs ρkx
(c) Reverse Chron. Blocal vs ρkx
Figure 3.1: Local Bias (Blocal) versus Correlation ρkx - Simple feeds. Reported are mean values across all
seed user days, and bars are standard error of the mean.
37
(a) NNMF Blocal vs ρkx (b) Log. Regression Blocal vs ρkx
(c) Neural CF (NCF) Blocal vs ρkx (d) Wide&Deep Blocal vs ρkx
Figure 3.2: Local Bias (Blocal) versus Correlation ρkx - Model-based feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean.
38
(a) Popularity G vs ρkx (b) Random G vs ρkx
(c) Reverse Chron. G vs ρkx
Figure 3.3: Gini Coefficient (G) versus Correlation ρkx - Simple feeds. Reported are mean values across
all seed user days, and bars are standard error of the mean.
39
(a) NNMF G vs ρkx (b) Log. Regression G vs ρkx
(c) Neural CF (NCF) G vs ρkx (d) Wide&Deep G vs ρkx
Figure 3.4: Gini Coefficient (G) versus Correlation ρkx - Model-based feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean.
40
(a) Popularity Fti vs ρkx (b) Random Fti vs ρkx
(c) Reverse Chron. Fti vs ρkx
Figure 3.5: Fraction Positive Tweets (Fti) versus Correlation ρkx - Simple feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean.
41
(a) NNMF Fti vs ρkx (b) Log. Regression Fti vs ρkx
(c) Neural Collaborative Filtering (NCF) Fti vs ρkx (d) Wide&Deep Fti vs ρkx
Figure 3.6: Fraction Positive Tweets (Fti) versus Correlation ρkx - Model-based feeds. Reported are
mean values across all seed user days, and bars are standard error of the mean.
42
(a) Popularity Fui vs ρkx (b) Random Fui vs ρkx
(c) Reverse Chron. Fui vs ρkx
Figure 3.7: Fraction Positive Users (Fui) versus Correlation ρkx - Simple feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean.
43
(a) NNMF Fui vs ρkx (b) Log. Regression Fui vs ρkx
(c) Neural Collaborative Filtering (NCF) Fti vs ρkx (d) Wide&Deep Fui vs ρkx
Figure 3.8: Fraction Positive Users (Fui) versus Correlation ρkx - Model-based feeds. Reported are mean
values across all seed user days, and bars are standard error of the mean.
44
(a) Popularity Mi vs ρkx (b) Random Mi vs ρkx
(c) Reverse Chron. Mi vs ρkx
Figure 3.9: Majority Illuion (Mi) versus Correlation ρkx - Simple feeds. Reported are mean values across
all seed user days, and bars are standard error of the mean.
45
(a) NNMF Mi vs ρkx (b) Log. Regression Mi vs ρkx
(c) Neural Collaborative Filtering (NCF) Mi vs ρkx (d) Wide&Deep Mi vs ρkx
Figure 3.10: Majority Illuion (Mi) versus Correlation ρkx - Model-based feeds. Reported are mean values
across all seed user days, and bars are standard error of the mean.
46
Chapter 4
Auditing Exposure Bias
We have so far described how we can measure exposure bias and how with empirical networks and historical
activity data we can assess the impact different recommender systems have on exposure networks. In this
chapter we detail an algorithmic audit on the X (formerly Twitter) personalized timeline as it was deployed
at the time of the audit, which is an important part of understanding the impact these systems may have on
who users get exposed to, and thus the user perception of their social environments. To ground this audit in
real-world implications, recent work in studying the uptake of vaccines suggests that passive usage of social
media has been negatively associated with vaccine uptake in several studies [60], which makes studying this
essential for public health.
In order to assess if we can observe exposure bias in situ on X (then Twitter) we focus this chapter on
answering the following research questions:
• RQ1. Do biased sock puppet accounts experience significantly different exposure bias than unbiased
random sock puppets?
• RQ2. Do personalized sock puppet accounts experience different levels of exposure bias than reverse
chronological sock puppets?
• RQ2a. Does exposure bias stabilize over time?
47
4.1 Methods
In this section we describe our experimental methodology and how we set up our bots to audit the two
different timeline conditions.
We adopt technical auditing methods described in Bandy & Diakopoulos 2021 and Bartley et al. 2021,
where we implement Selenium bots that log-in to the Twitter/X website at scheduled times during the day.
We implement 24 bots using Selenium split across 6 conditions [13, 15]:
1. Random personalized. The account will browse normally under a “Top Tweets” timeline, and like
any tweet with a uniform probability.
2. Random reverse chronological. The account will browse under a “Latest Tweets” timeline, and like
any tweet with a uniform probability.
3. Pro-science personalized. The account will browser under a “Top Tweets” timeline, and like proscience friends and endorsed content with a higher probability than anti-science.
4. Pro-science reverse chronological. The account will browse under a “Latest Tweets” timeline, and
like any pro-science friends and endorsed content with a higher probability than anti-science.
5. Anti-science reverse chronological. The account will browse under a “Latest Tweets” timeline, and
like any anti-science friends and endorsed content with a higher probability than pro-science.
6. Anti-science personalized. The account will browse under a “Top Tweets” timeline, and like any
anti-science friends and endorsed content with a higher probability than pro-science.
Each account was set up to follow 100 total users, 50 of whom are listed as pro-science and 50 of whom
are labeled anti-science per Rao et al., 2021 [71]. These labels are assigned according to the polarization
score each user has, which is computed primarily based on the type of URLs shared [71].
48
The accounts were active over a period from 04/12/2021 until 02/02/2022. The sequence of actions a
sock puppet would take can be described as follows:
1. A sock puppet would run Selenium and connect to Twitter
2. If necessary the sock puppet would log-in and handle any pop-ups that occur in the user interface
3. Until at least 30 tweets were observed, a sock puppet would scroll down the screen and like a tweet
with fixed probabilities according to their condition. If a user is pro-science polarized they will like
friends’ tweets that are themselves pro-science with probability 0.35, all others 0.03) to get approximately one like per session. All tweets were recorded including promoted tweets.
Data was collected five times daily, however accounts were sporadically inactive until remedied. Missing
data is largely due to three errors: 1) unexpected runtime error that prevented running; 2) unexpected changes
in HTML/server-side code that prevented gathering data; 3) accounts were suspended. When sock puppet
accounts were suspended, new accounts were made and set to run-up again as soon as possible. The accounts
were run on four machines machines that would connect to the Twitter website via a proxy server to control
for location information each account would give. We also filter all promoted tweets and only analyze tweets
that were cleanly parsed and attributable to an initial friend (i.e., a tweet from user Z that is liked or retweeted
by friend Y is given the same label as friend).
In parallel we collected tweets and retweets for each of the friends the users were following and as
many friends-of-friends as possible from the Twitter API using the Home Timeline endpoint. We regularly
scanned the endpoint for each encountered user and gathered any new activity since the last scan.
We use a technical audit as opposed to a sociotechnical audit described by Lam et al., 2023 to allow for
fine-grained user behavior and to control for aspects like log-in time and scrolling behavior [52].
49
4.1.1 Measurements
In this section we describe the measurements we use to evaluate exposure bias. We primarily utilize the
Gini coefficient and the measure of local bias from Alipourfard et al., 2020 described previously [10]. This
measure is useful as it reflects a difference between the expected fraction of friends that have a specific trait
and the global prevalence of that trait in the network.
To compute the Gini coefficient we use the following strategy: For each sock puppet, create a matrix X
of dimensions nusers X ndays the number of total users that have been observed over the course of the study
for every day of the study. Each position xij contains the number of tweets viewed by the sock puppet of
user i on day j. The value is then calculated according to the following formula:
G =
Pn
i=1
Pn
i=1 |xi − xj |
2n2x¯
To compute the local bias measure we use the following strategy: For each sock puppet we keep track
of the number of tweets observed from each friend every day of the study in the same way we track it for
the Gini coefficient. We then binarize the matrix to reduce it to a matrix of the edges observed connecting
the sock puppet to its network. The local bias measure is calculated as:
Blocal = E[qf (X)] − E[f(X)]
where
E[qf (X)] = ¯d · E[f(U)A(V )|(U, V ) ∼ Uniform(E)]
qf (v) =
P
u∈F(v)
f(u)
di(v)
50
A(v) = 1
di(v)
where in this work f(u) ∈ {0, 1}, such that f(u) = 1 if the user is pro-science and f(u) = 0 if the user
is anti-science. It is assumed that the true prevalence E[f(X)] = 0.50 as we are looking at the same network
for each account and are assessing an over or under representation of pro-science users.
We structure our analysis around two parts: bootstrapped sessions aggregated overtime to help understand the overall variability (and potential skew due to missing data) and time series data to observe possible
trends over time.
4.2 Results
The date coverage of the bots are described in Fig. 4.2. We observe that the coverage is weakest for the antiscience biased reverse chronological condition, followed by the pro-science biased reverse chronological
condition. This is likely due to the semi-regular HTML issues that were more pernicious for the reverse
chronological condition accounts.
The local bias results of the study over this date range are presented in Fig. 4.1. We observe significant
differences between the pairs of conditions (pro and anti, pro and random, anti and random), and all ranges
are distinct from the bias computed on the total “actual” activity (i.e., what you would expect if you observed
all the friends that tweeted that day).
The Gini coefficient results are presented in Fig. 4.3. We observe significantly higher personalized Gini
in the random personalized condition versus the reverse chronological as well as the Gini coefficient derived
from the actual activity of all friends.
The two measures of exposure bias were also subjected to independent t-tests to assess significance of
the differences in the means computed over the course of the study.
51
As all bootstrapped figures use means computed over the course of the study, we also present the relevant
time-series values in Fig. 4.4 and 4.7. We similarly plot the linear regression line to identify trends in each
time-series, and in Fig. 4.4 we observe a general downward trend for the aggregate personalized timelines,
and a general upward trend for the aggregate reverse chronological timeline. Both are significantly different
than the trend observed in the measurements on the actual activity. In Fig. 4.7 we observe a positive trend
in the aggregate personalized timeline over time, and a steady Gini coefficient trend for aggregate reverse
chronological. As these are computed weekly we observe higher coefficients than in the aggregate plot in
Fig. 4.3.
To account for possible correlates in the activity of friends, we independently take the correlation of both
the local bias and Gini coefficients versus the daily actual activity of pro-science friends and anti-science
friends. We observe in Fig. 4.6 that the aggregate personalized timeline condition has a significant positive
correlation to the anti-science friend activity time-series, similar to the correlation the pro-science friend
activity has with the anti-science friend activity timeline. We also observe the actual activity bias measures
to be negatively correlated with anti-science friend activity, consistent with anti-science friends having the
neutral label 0 and pro-science friends having the positive label 1. Similarly reverse chronological timelines
generally have no or weak correlation with either activity time-series.
RQ1. To address the first research question, we find that under the independent t-tests described in Table
4.1 both biased conditions were significantly different from their randomly unbiased counterparts for each
metric (except for anti-reverse chronological and random-reverse chronological).
RQ2. To address the second research question, under independent t-tests described in Table 4.1 personalized timelines in all conditions except pro-science biased accounts were significantly different under both
52
metrics to their reverse chronological counterpart. Interestingly, the pro-science reverse chronological timeline feeds were not significantly different than the pro-science personalized timeline in their Gini coefficient
per Table 4.1.
RQ2a. While we observe relatively stable trends for both actual activity based time-series and aggregate
reverse chronological time-series in Fig. 4.4 and 4.7, we observe nonzero trends for the aggregate personalized timeline suggesting that under personalization no stabilization occurred over the course of the study.
Missing data sensitivity. To analyze the impact of missing data, we use several imputation methods and
analyze the difference in the results we observe for RQ2a in Fig. 4.5. We present the results for the median
imputed results here, and the other imputed results in the appendix. We observe nonzero trends in roughly
the same direction as we observe in the non-imputed data in Fig. 4.4 suggesting a minimal impact of the
missing data on the time-varying nature of our results.
Figure 4.1: Local bias measurements for audit groups. Bootstrapped confidence intervals computed over
1000 samples of sessions, presented are the mean of the daily means with 95% intervals. Negative implies
under-representation of pro-science content, positive implies over-representation.
53
Feed 1 Feed 2 Condition 1 Condition 2 Blocal t-test Score p-value Gini t-test Score p-value
Personalized RevChron Anti Anti -309.72 < 10−10 97.46 < 10−10
Personalized RevChron Anti Random -369.84 < 10−10 136.71 < 10−10
Personalized RevChron Anti Pro -315.89 < 10−10 53.15 < 10−10
Personalized RevChron Anti Aggregate -395.39 < 10−10 159.60 < 10−10
Personalized RevChron Pro Anti 162.14 < 10−10 36.36 < 10−10
Personalized RevChron Pro Random 180.40 < 10−10 55.97 < 10−10
Personalized RevChron Pro Pro 175.22 < 10−10 1.22 0.22
Personalized RevChron Pro Aggregate 178.11 < 10−10 75.45 < 10−10
Personalized RevChron Random Anti -113.35 < 10−10 76.47 < 10−10
Personalized RevChron Random Random -152.63 < 10−10 118.46 < 10−10
Personalized RevChron Random Pro -110.56 < 10−10 26.69 < 10−10
Personalized RevChron Random Aggregate -177.86 < 10−10 148.91 < 10−10
Personalized RevChron Aggregate Aggregate -135.51 < 10−10 -176.58 < 10−10
Actual RevChron Actual Anti 219.85 < 10−10 5.17 < 10−10
Actual RevChron Actual Random 335.25 < 10−10 36.42 < 10−10
Actual RevChron Actual Pro 257.08 < 10−10 -54.08 < 10−10
Actual RevChron Actual Aggregate 352.75 < 10−10 68.09 < 10−10
Actual Personalized Actual Anti 592.88 < 10−10 -116.03 < 10−10
Actual Personalized Actual Pro -37.32 < 10−10 -36.92 < 10−10
Actual Personalized Actual Random 466.58 < 10−10 -94.45 < 10−10
Actual Personalized Actual Aggregate 413.57 < 10−10 257.12 < 10−10
RevChron RevChron Pro Anti -9.84 < 10−10 47.64 < 10−10
RevChron RevChron Pro Random -14.06 < 10−10 79.58 < 10−10
RevChron RevChron Pro Aggregate -28.48 < 10−10 105.54 < 10−10
RevChron RevChron Anti Random -1.57 0.12 22.82 < 10−10
RevChron RevChron Anti Aggregate -14.20 < 10−10 46.45 < 10−10
Personalized Personalized Pro Anti 405.17 < 10−10 -37.14 < 10−10
Personalized Personalized Pro Random 260.23 < 10−10 -17.06 < 10−10
Personalized Personalized Pro Aggregate 237.83 < 10−10 181.45 < 10−10
Personalized Personalized Anti Random -245.66 < 10−10 31.62 < 10−10
Personalized Personalized Anti Aggregate -269.78 < 10−10 314.28 < 10−10
Table 4.1: Pairwise Significance Tests. Data are treated under a independent t-test given the normalcy of
the means measured.
54
Figure 4.2: Dates covered by the sock puppet accounts
4.3 Discussion
The sock puppets, when we control for various factors like behavior, sign-on location and time, afford us
the ability to claim both the existence of exposure bias in the deployed platform and differences between the
different types of personalized news feeds. The strongest signal is in the correlation in exposure bias measurement and friend activity. We expect the local bias of the actual activity to correlate negatively with antiscience friend activity, and positively with pro-science activity. This is because positive local bias implies
more pro-science friends are being observed by sock puppets, and negative implies fewer pro-science friends
seen. Similarly, we expect reverse chronological conditions to largely be uncorrelated with the amount of
friend activity as it we would expect it to be more correlated with the time each friend is active relative
to when the sock puppets log-in and observe them. Interestingly we observe in Fig. 4.6 a positive correlation with the anti-science friend activity when looking at the aggregate personalized (agg-pers), pro-science
biased personalized (pro-pers) and pro-science biased reverse chronological (pro-rev), albeit to different
degrees.
55
Figure 4.3: Gini coefficient measurements for audit groups. Bootstrapped confidence intervals computed
over 1000 samples of sessions, presented are the mean of the daily means with 95% intervals. Zero implies
equality in mean number of tweets observed for each friend, One implies one friend dominates.
Given the positive correlation between pro-science friend activity and anti-science friend activity it may
be a spurious correlation. However, as the reverse chronological conditions have either the same or weaker
correlation with the friend activity, it seems less likely that the same spurious correlation would affect multiple timeline conditions. A possible explanation is that recorded anti-science activity dropped in the second
half of the study period such that any increase in activity would generally lower the fraction of pro-science
friends in the inventory of tweets to serve.
We present the activity of the anti-science and pro-science users as recorded by our API sweeps in
Fig. 4.8 and 4.9. While there is a sharp drop in activity in anti-science activity around 2021-10-24, this
can be attributed to both some accounts being temporarily suspended and a brief API rate limit issue we
encountered. However, given the time-series measures of local bias and Gini coefficient in Fig. 4.4 and 4.7
we do not observe a significant impact on the measures at that time, suggesting an insignificant impact.
RQ1. Given the consistent significant differences between biased and random feeds except for the antiscience reverse chronological feeds, and given that the pro- and anti-science feeds experienced significant
56
Figure 4.4: Local bias computed over time.
differences in the direction that we would predict for the local bias measure (i.e., that pro-science would be
much higher than random personalized, and anti-science would be much lower than random personalized),
we claim that we can measure the change in users’ structured perceptions of who they follow in the network.
It makes sense that biased accounts who interact with similar users would observe more users along that
skew, as that seeming measure of utility would be reflected in the personalized algorithm. Similar results
have been described in analysis of how user activity and personalization interact in the YouTube video
recommender system [73].
RQ2. Given the potential coverage issues across the conditions, we primarily discuss the aggregate personalized and reverse chronological results. We observe significantly lower local bias for the aggregate
personalized conditioned relative to the reverse chronological condition which is consistent for all pairs except for the pro-science biased condition (which was flipped) and the anti-science biased condition (which
had no significant difference). Similarly we observe higher Gini coefficient for the aggregate personalized
57
Figure 4.5: Median imputed Local bias computed over time.
when compared to the aggregate reverse chronological condition. In Fig. 4.7 we observe an increase in aggregate personalized Gini over time, which correlates inversely with the anti-science friend tweet activity
per Fig. 4.6. Given that both aggregate personalized and reverse chronological both negatively correlate
with the anti-science friend activity (and both of these are flipped in sign compared to the actual activity
Gini computation), we consider this insignificant.
We claim that given the significant t-test results, as well as the differences in correlation between aggregate personalized and aggregate reverse chronological when compared to each other that there is a measurable difference in exposure bias between the conditions.
RQ2a. We observe relative stability in the actual activity (defined by having little to no trend in their linear
regression models) and aggregate reverse chronological condition plotted in Fig. 4.7. We do not however
observe stability for either aggregate timeline over the course of the study. This could be the case because we
had accounts (both friends and sock puppets) that were suspended and needed to be replaced. Alternatively
58
Figure 4.6: Correlation between time-series matched on days. Correlation is computed between the daily
activity of pro- or anti-science
this could be explained by changes in friend posting behavior or timelines; because users that are out of
your network can appear in your timelines, the personalized feed accounts perhaps saw consistent changes
in who they were exposed to. We observe oscillation in the total number of unique friends both conditions
saw in Fig. 4.15, however we observe a drop in the number of new friends seen after 2021-07 in Fig. 4.16,
suggesting it is not new users in the feed.
We observe a possible drop in the total number of usable personalized tweets in Fig. 4.10 that we do
not observe in Fig. 4.11: we seemed to have had more difficulties with HTML changes and suspensions in
the personalized sock puppets, which could explain the lack of stability over time. Similarly, we observe a
drop in the total number of tweets per day for anti-science users around 2021-10-24 in Fig. 4.8 which might
explain the lack of stability for the aggregate personalized condition. We do not report it here, but computing
the correlation matrices for each of the measurements until 2021-10-24 (rather than through until 2022-02-
02) has some of the correlations going the opposite direction, suggesting the drop in activity impacts the
trend overtime, even if it does not seem to affect the trend in the actual activity metrics.
59
Figure 4.7: Gini coefficient computed weekly over time.
Implications. If we assume significant differences between the personalized and reverse chronological
conditions, and more specifically differences between the biased conditions within the two feeds, interacting
with a fixed-length feed and observing and passively interacting with a biased environment could have
downstream consequences on a user’s mental and physical health. While differences in algorithms did not
seem to result in different behavior in the 2020 election on FaceBook and Instagram [37], it remains plausible
that they could change users’ perceptions of their networks in other areas like gender norms and through
other mechanisms like social comparison. Personalized timelines seem to be more sensitive to user behavior
(which makes sense by design of the algorithm), but as such may inadvertently facilitate the spread of
harmful narratives by making them seem more popular than they may actually be.
4.3.1 Future Work
Future work could entail bots following each other and having much longer sessions. Working with BlueSky
and decentralized Mastodon instances to test the impact of different feeds on these measures of exposure bias
60
Figure 4.8: Number of tweets per day of anti-science users. The steep drop of tweets around 2021-10-24
can be explained both by accounts that were temporarily suspended and a brief rate limiting issue with the
API.
Figure 4.9: Number of tweets per day of pro-science users.
61
Figure 4.10: Number of usable personalized tweets seen per day. Other tweets were either promoted or
otherwise had a problem in parsing.
would be interesting as well. Similarly, getting live session data donated by real users would be interesting
to examine.
4.3.2 Limitations
Labels. The focus of the sock puppets on the activity of friends labeled around their attitudes on science
may be limited as their attitudes could have changed over the course of the study. Similarly attributing the
friends’ label to any tweets/users they interact with may be inappropriate as the source tweets/user may
be taking a stance opposite to the friend being followed. The pro/anti label may also be capturing other
information, like general levels of activity, network size, and community norms with different kinds of
engagement (which would be weighted differently in the ranking algorithm). Having to replace the sock
puppets also draws time-varying claims into suspicion as each new account is assumed to start from no
personalization.
Missing data. Missing data is a clear limitation in the analysis presented in this study. However, we present
a sensitivity analysis which suggests that at least for the trending local bias data the missing data does not
seem to seriously impact the direction of the trends, suggesting the robustness of our results. Similarly, our
62
use of the bootstrapping approach for both metrics allow us to assess the variability of our measurements
due to the missing data.
Relevance to current platforms. Because we base our analysis on data from before Twitter became X,
we cannot claim that the conclusions around bias are current, however we can claim that the point-in-time
analysis of bias is still accurate. We also can claim the methodology is still relevant for the current set
of popular social media platforms (not limited to just X) as the underlying infrastructures do not seem
like they have changed significantly as of this writing. Regular analysis of the state of the recommender
systems constructing feeds should be performed as both minor and major changes to the systems can impact
outcomes.
4.4 Conclusion
In this chapter we have shown that we can measure the exposure biasin situ on online social media platforms.
We show that untrained sock puppet accounts on Twitter (before X) over a period of eight months following
the same accounts can experience vastly different perceptions of their networks. Accounts that are biased
ideologically in interacting with tweets show differences in the feeds that they are presented with. When
controlling for algorithmic impacts and as many other factors as possible, these biased accounts observing a
reverse chronological show less exposure bias than similarly biased accounts observing a personalized feed.
We believe that this implies additional scrutiny is needed on the signals the feed personalization engine uses,
as it may be unduly skewing some users’ perception more than we would expect considering their activity.
As we study the perception of pro- and anti-science users in one’s follower network, we hypothesize that
our results would extend to perceptions of other political/ideological and social signals.
As mentioned in the previous chapter, with the open source GitHub of the Twitter/X ML pipeline at
https://github.com/twitter/the-algorithm-ml, there is an interesting opportunity to assess how external
63
audits might be assisted by code transparency from the platforms. For the purposes of this chapter, the
pipeline has been helpful for identifying places to investigate (namely, the weights of various interactions)
but it is difficult to simulate or gather data for all the components, especially as API access has been paywalled significantly (not to mention the lack of API access to the order in which people observe personalized
timelines). Complex recommender systems mediate our exposure to our online exposure networks, and it
remains the case that external research architecture for understanding them is imperative to ensure society
has a reasoned and clear knowledge of how these systems may effect us.
4.4.1 Ethical Considerations & Reproducibility
The study was approved by an IRB. Accounts were made to only interact with tweets via low-frequency
likes to minimize potential implications on the final popularity of said tweets. Accounts were explicitly
designated as research accounts with an email address pointing to the institution running the study. We
release anonymized code at used to generate this analysis and data showing which anonmyized users were
seen at which time, at https://github.com/bartleyn/curly-octo-fortnight.
64
Figure 4.11: Number of usable reverse chronological tweets seen per day. Other tweets were either
promoted or otherwise had a problem in parsing.
Figure 4.12: Mean imputed Local bias computed over time.
65
Figure 4.13: Most-frequent imputed Local bias computed over time.
Figure 4.14: Iterative imputed Local bias computed over time.
66
Figure 4.15: Number of unique friends seen overtime.
Figure 4.16: Number of new users exposed to per day. Blue is personalized, orange is reverse chronological.
67
Chapter 5
Mitigating Exposure Bias
Two key issues in studying these systems at scale (either as deployed or in simulation) are that one cannot
assume each user is subject to the same timeline condition and that building appropriate evaluation infrastructure is costly. This is especially true if we are to assess how such systems evolve over-time. To address
this we discuss a simple agent-based model where users have fixed preferences. This kind of model affords us the ability to compare different recommender systems (and thus different personalized timelines)
in how they skew users’ perception of their network. Importantly, we show that a simple greedy algorithm
that constructs a feed based on network properties reduces such perception biases comparable to a random
feed. This underscores the influence network structure has in determining the effectiveness of recommender
systems in the social network context and offers a tool for mitigating perception biases through algorithmic
feed construction.
The previous chapters have established the ability to measure exposure bias and its apparent existence
in deployed online social network personalized timelines, however we have not treated the user dynamics
thoroughly. In order to tease out the impact of the recommender systems, it is important not to overlook
the role users play in their interactions with them. Recent work has explored user preferences in agentbased models on YouTube in regards to their primary video recommender system: however an important
limitation in this line of work is the lack of comparison of different systems and different kinds of feeds,
68
like those that appear in online social networks like X and Facebook [73, 27]. There is a vein of work
on X and Facebook that does consider user interactions as a factor in the differences between the blackbox algorithmically personalized and reverse chronological feeds, however these works focus more on what
content is being shown to users rather than who the users are being exposed to, and are focused in politicallysalient outcomes like political behaviors and misinformation diffusion[35, 36, 14]. Social cues (e.g., the
counts of likes and retweets a post has on Twitter/X) and the social context in which people share information
(e.g., who they consider their audiences to be) have been shown to impact sharing behavior of posts on
social media, and thus also impacts how information spreads [29, 58]. To enable our understanding of how
different recommender systems might shape users’ perceptions of their network, we simulate a Twitter/Xlike environment where we explicitly specify the user model behavior and measure the users’ perceived
prevalence of a binary trait over time in the network.
In this chapter we answer the research question:
RQ4 Can algorithmic ranking change the level of exposure bias over time?
To answer this question we discuss the following:
• We present scaleable agent-based model simulations with 173, 000 nodes.
• We compare exposure biases generated from baselines, deep learning approaches, and a greedy algorithm for personalizing news feeds.
• We demonstrate that this greedy algorithm creates less biased feeds and makes feeds that are comparable in utility to the other tested models.
69
5.1 Model
5.1.1 Framework
We use the Repast framework to run models over a cluster of compute nodes [22]. This framework has
previously been used in other areas like simulating bike-sharing systems in cities, and complex biological
systems requiring heterogeneous multicellular organisms [62, 12]. A key factor that aided our use of Repast
is that in our simplified network, users will only be exposed to the tweets that their friends (and friends-offriends) generate, which allows us to partition the network and run them in different processes.
5.1.2 User Behavior
In this work we trace the behavior of approximately 173,000 users (each who is labeled x = 1 with a fixed
probability P(X = 1) otherwise x = 0) sharing 1.5 million edges, and examine the experience of 5,599
central users as they follow the below sequence of events for each time tick t:
1. Activate user i with a uniform probability (0.083)
2. If activated, user i then produces a certain number of tweets; we sample a lognormal distribution
(µ = 0.0, σ = 1.0) to choose how many tweets the user produces in that time period
3. Add created tweets to the content pool
4. “Backend” model serves tweets to user i if appropriate
5. User i interacts by “liking” a tweet from another user with same label at fixed probability (0.20), with
probability (0.05) otherwise
Our model is a discrete event-based model, where for each time tick we "step" the individual nodes and then
update model information before proceeding to the next time tick.
70
There are two key components for the model: the “backend” model that serves users tweets and the
network of users. While each user is connected via the network, they only interact with other users on the
network through the model by sending tweets to the model and getting tweets from their friends through the
model. This way we can vary how tweets are served to users. We illustrate how this model works visually in
Fig. 5.1.
Figure 5.1: Agent-based Model Structure Illustration demonstrating how three users are connected to each
other on the network, but will only get exposed to other users through the tweets served to them from the
“backend” model.
At tick t = 24 we reset the edges seen by users in the network to reflect a full 24 hours passing, in order
to assess what happens when the network "forgets" most information from the previous day. This is also a
validity check to ensure that the consistency in the dynamics of the system (i.e., we want to make sure that
the system will converge back to a similar point as before the "reset").
We structure the network based on data from Alipourfard et al. [10] who gathered a complete follower
network for 5, 599 users, as well as tweets and retweets for those users and everyone they followed to
generate a dataset with 4M users from May - September 2014. We use this data to guide our model; we
downsample the nodes for the sake of simulation runtime.
71
5.1.3 Model Parameters
We treat each simulation time step as a single hour, for a total of 36 timesteps. Per an official blog-post
from X [89], users spend an average of 32 minutes per day on the platform, which we implement in an
activation probability: each user has a 0.083 probability of "logging in" per hour to give an expected value
of approximately two sessions per 24 hours.
To assess perception of networks, we randomly assign each user in the network a binary trait X ∈ {0, 1}
such that the total prevalence of the trait in the network is static, as in Chapter 3. We run each simulation
under P(X = 1) ∈ {0.05, 0.15, 0.50} to assess the impact of the prevalence of the trait on the behavior
of the system. Each user, based on the value of the trait that they are assigned, also behaves in a biased
manner towards the tweets that they observe: users with x = 1 will like tweets from users with x = 1 with
probability 0.20 and will like any other tweets with probability 0.05. Likewise for users with x = 0. Both
numbers were chosen to elicit, in expectation, at least one like from each active user per tick.
Assigning traits to the nodes allows us to measure the degree-attribute correlation ρkx, which is defined
as:
ρkx =
P(x = 1)
σxσk
[⟨k⟩x=1 − ⟨k⟩] (5.1)
Here ⟨k⟩x=1 is the average in-degree of the "active" x = 1 nodes, ⟨k⟩ is the average in-degree of all
nodes considered, and σ represents the respective standard deviation.
Each simulation has every user subjected to the same personalized news feed:
1. Random. All candidate tweets are randomly sorted and the first n tweets are served to the user.
2. Reverse Chronological. All candidate tweets are sorted in reverse chronological order, and the first
n tweets are served to the user.
3. Neural Collaborative Filtering. We implement a simple version of the Neural Collaborative Filtering
(NCF) model to demonstrate how a deep learning model more broadly can be used for recommender
72
systems in this context. We train the model on the 5,599 core users, where each tweet liked is the
"item" being trained on. We keep the model straightforward and only use the superficial user level
information as features [39].
4. Wide & Deep. We implement a simple version of the Wide & Deep model described initially by
researchers at Google to demonstrate how a recommender system used in production in other contexts
might behave in this scenario [21]. We use the same features as the NCF model for training.
5. Minimize ρkx. We implement a greedy strategy for choosing which edges to observe for each user. We
use eqn. 5.1 and sort the tweets seen by a user at every tick t by how much that edge would contribute
to the difference between the mean “active” in-degree ⟨k⟩x=1 and mean in-degree ⟨k⟩, opting for the
edge that would minimize the difference.
We chose these personalized news feed algorithms to analyze different implementations of news feeds:
there is a public release of the X recommendation "algorithm", however as it relies on multiple models and
active services we cannot use the code as-is in a simulation (especially as production model parameters have
since changed) [90]. Instead we aim to show simple baseline models, two deep learning models, NCF and
Wide & Deep, and our greedy strategy for minimizing exposure bias (MinimizeRho).
Each simulation similarly has every user using the same length feed, i.e., they only observe (and potentially engage) a fixed number of tweets for each timestep in the simulation. We tested lengths of 30, 50, and
100.
5.1.4 Bias & Performance Metrics
We use two metrics to study the bias of this perception:
1. Local Bias (Blocal) Blocal = E[qf (X)] − E[f(X)]
2. Gini Coefficient (G) G =
Pn
i=1(2i−n−1)xi
n
Pn
i=1 xi
73
We define Blocal as the average frequency of the attribute among a node’s immediate network:
E[qf (X)] = ¯d∗E[f(U)A(V )|(U, V ) ∼ Uniform(E)]; E[f(X)] is the global frequency of the node attribute
f (here, 0.05, 0.15, 0.50); f(U) the attribute value f of node U;
¯d represents the expected in-degree of the
network. A(V ) represents the “attention” node V pays to any node in their network. Blocal should vary from
[-1,1], and Gini should vary from [0,1], where 0 is equal and 1 is unequal. This plays the role of an overall
view of the skew users will experience in their personalized feeds.
Gini coefficient in this context represents the skew in the number of unique friends (or friends-of-friends)
users are exposed to in their feeds: xi represents the number of times that friend (or friend-of-friend) was
observed.
We use several other measurements to study the simulation and verify results:
1. Precision@30. mean
|tweets liked in first 30 positions|
30
2. Number of edges seen. Total unique edges seen up until that time tick, including friends-of-friends.
3. Mean number of likes friends’ tweets receive. We take the sum total likes each friend receives from
core users and take the mean over all friends.
5.2 Results
Across models in Fig. 5.2 we observe a burn-in period and relatively stable bias for the network until the
reset at t = 24. Interestingly, we observe that the Random and MinimizeRho conditions have consistently
low measures (in absolute value) of Blocal (local bias). Similarly, the NCF and WideDeep conditions are
correlated to one another, showing negative values (i.e., they under-expose the users to users with trait
X = 1).
74
In Fig. 5.3 we observe converging, stable behavior of the Gini coefficient, where the Random condition
remains the lowest in terms of Gini, suggesting a more even distribution of attention across friends. Interestingly the MinimizeRho condition starts low and progresses higher to become similar to the Chronological,
NCF, and WideDeep conditions. These two deep learning based models tend to have the highest Gini coefficient, suggesting a narrower focus on sets of friends observed by users. This remains consistent even after
the t = 24 reset where the MinimizeRho condition again starts low and progresses higher in Gini.
For validity checks of the simulations we present the results in Figs. 5.4 and 5.5. Fig. 5.4 shows that the
5,599 core users all behave similarly under different conditions in terms of the mean number of likes each
friend receives, with longer feeds having higher mean likes cumulatively over time. Not reported here, but
we also measured the number of tweets that each user liked cumulatively over the course of the simulation.
The longer feed length simulations tend to have higher mean likes than the shorter ones, like the reported
results. Finally we observe the unique edges observed by the core users in Fig. 5.5, where the Random
condition maximize the total observed edges, followed by the MinimizeRho condition.
The precision figure in Fig. 5.6 demonstrates that all model conditions improve over time, with different
levels of minor precision drop after the t = 24 reset.
5.3 Discussion
The primary results here describe levels of exposure bias that change over time, where each algorithmically
ranked feed provides a different converged level of exposure bias.
To discuss the results more thoroughly, the information presented in Fig. 5.5 suggests that the Random
and MinimizeRho conditions demonstrate the most growth in number of unique edges observed over time
relative to the other models, increasing diversity in users who are observed. As this might be due to spurious
changes in user behavior, we measure the mean number of likes each friend receives in Fig. 5.4. Here we
observe that the longer feeds provides more likes for each friend, which follows given the increased number
75
Figure 5.2: Local Bias (Blocal). Graph depicts the difference between the expected local fraction of friends
who have x = 1 and the true global prevalence of the trait P(X = 1). Positive implies over-representation,
negative implies under-representation.
of “chances” each user would get to like a friend’s tweet. Given each model has users that behave similarly,
we then use the exposure bias measures in Figs. 5.2 and 5.3 to determine differences between models. Two
patterns show up: the Random and MinimizeRho conditions correlate closely and lower in absolute value in
Blocal, whereas the NCF and Wide & Deep models behave tightly, but with lower (negative) values in Blocal.
This suggests that the deep learning models are more tightly focused on certain sets of users (as corroborated
in the number of edges seen in Fig. 5.5). Interestingly, the MinimizeRho condition behaves similarly to the
Random condition in Fig. 5.3 in the initial timesteps, however it diverges and becomes more similar to
the Chronological condition in most conditions (it becomes more like the deep learning models in the feed
length of 100). This is interesting as it seems to be more sensitive to the feed length and prevalence than the
other models tested in terms of Gini. This suggests that the MinimizeRho model converges to a similarly
focused set of users as the deep learning models, but because it presents a comparatively undistorted view
76
Figure 5.3: Gini Coefficient. Graph depicts the distribution of times each friend (or friend-of-friend) was
observed by a core user. 1 implies inequality, 0 implies equality.
of the network the MinimzeRho model may be ignoring edges that might offer utility to users but skew the
view of the network. If the Random model can be considered less biased than the others by minimizing the
absolute value, then the MinimizeRho model is most closely related to it in behavior over time (albeit in a
more focused deterministic manner than the Random condition).
Curiously, each model trends upwards in performance in Precision@30 in Fig. 5.6. We would expect the
Random model to have minimal slope as it does not depend on the user feedback, however, as the number
of likes is stored cumulatively we would anticipate a growing number over time. The differences show
themselves across different prevalences: P(X = 1) = 0.05 shows that the NCF model is seemingly higher
than the other models in both measures of precision (Precision@10 not reported here), but the difference in
models disappears as the prevalence of the trait tends towards P(X = 1) = 0.50. The drop in performance
of the MinimizeRho model at feed length 50, P(X = 1) = 0.15 may be explained by the fixed random seed
77
Figure 5.4: Mean number of likes each friend receives. Graph depicts the mean number of likes each
friend receives over time.
determining the assignments of the trait X for those simulations: other versions of these simulations where
we modulate the correlation ρkx remove this visible discrepancy.
Overall, while behavior of the simulated system appears to converge, it is unlikely that the real ecosystems being modeled would converge so readily. If we do assume some stability it seems to be the case that
personalization would lead some users to perceive that any particular trait is more (or less) prevalent in their
larger social networks than they actually are. In other words, it seems that personalization can either mitigate
or amplify network-based structural phenomena like the majority illusion [56]. Furthermore, it appears to
be the case that different kinds of algorithmic feeds can change the amount of exposure bias the system
experiences over time. If we allow for more complex user dynamics and user intents, we hypothesize that
the different types of feeds would further diverge in their behaviors relative to one another.
78
Figure 5.5: Log number of friends seen. Graph depicts the log total number of unique friends (and friendsof-friends) seen through tick t. Connections seen are reset at t = 24.
5.4 Limitations & Future Work
One clear limitation in this work is the duration for which we ran the simulations. Some of the more complex
recommender feeds (and longer length feeds) took longer than expected to run. Another limitation is that
we do not try more recommender systems and longer personalized feeds. There are advanced recommender
systems that could be useful to analyze, like MV-DNN [28] or the stated X/Twitter system [90], however
adapting such models is difficult as many require access to richer information about the users and content
than we have built here (or for a production system we would need access to other microservices or models
the system depends on). Other work simulates the RealGraph component of the stated X pipeline, which
is worth replicating in this environment [9]. This complication aside, these more complex systems would
afford us the ability to compare the results of these ABM studies more directly to real user studies, allowing
us to tune the ABM parameters to be more accurate to real user behavior. However, it behooves us to be
79
Figure 5.6: Precision@30. Graph depicts the total fraction of liked tweets in the first 30 positions in the
feed.
wary of running ABMs that are too contrived to be useful, as they can be difficult to reproduce and apply
elsewhere.
Using Large Language Models as agents interacting in the framework would be interesting future work,
considering Tornberg et al. [86] and their preliminary work in this space. This could facilitate the use of
complicated recommender systems, as other ways of generating richer information would again potentially
make the ABM too contrived to be useful, as described above.
We would also like to integrate more metrics into the analysis, as there may be confounding factors
present in our simulations (e.g., Blocal and G may converge but some other measure might be periodic).
Similarly we would like to extend this analysis to more than binary user labels as labels can change over
time and are often more complicated than simple binaries. We can further analyze the effects of different
recommender systems by investigating the experience of the heterogeneous users in the network (e.g., the
80
experience of more central users in terms of degree, or the differential experience of users with the active
trait and those without it). In an effort for reproducibility we release the simulation scripts.*
5.5 Conclusion
In this chapter we find that different algorithmically ranked feeds determine how much exposure bias will
be in a simulated system over time. We describe an agent-based model and framework for studying the
effects of different personalized news feed algorithms in online social networks by measuring how they
expose users to their networks. The model and framework is extensible and given the underlying MPI usage
of the Repast library very scalable contingent upon having access to an MPI-enabled computing resource.
We find that while deep learning methods are useful and tend to minimize perception bias in terms of our
binary label, they focus on a narrower set of users. We find that a simple greedy algorithm based on network
properties can provide relative diversity in attention and can minimize our measure of local perception bias
Blocal in absolute value.
These findings are important for designing recommender systems in online social networks that are
robust: these systems mediate the information and connections between people and we should be able to
understand what happens as people interact with these dynamic and ubiquitious systems.
*https://github.com/bartleyn/cuddly-octo-couscous
81
Chapter 6
Conclusions
In this thesis we described the measurable systematic distortion in content visibility and perceived prevalence
within a social network. This bias, which arises from discrepancies that happen between the “potential
network” based on social connections, the “activated network” based on user activity and the “feed-exposed
network” based on what edges get observed through the personalized feeds and the recommender systems
underpinning them. We showed that different recommender systems impact this “exposure bias” in in vitro
simulations of networks and in situ audits of real networks. We show that the choice of personalized news
feed in aggregate can have differential impact in the representativeness of an opinion or trait in your timeline.
We finally described a greedy algorithm based around network characteristics that can be useful for for
managing how people get exposed to their network and for mitigating such exposure bias. An important next
step in studying users feeds and the impact recommender systems can have on them is understanding user
dynamics in these systems and the different contexts in which users interact with each other. In particular
we would like to incorporate topics and content-based information directly into the system simulations. It
would also be interesting to incorporate signed networks and user behavior models that allow for behaviors
like selective disclosure.
82
Chapter 7
Appendix
.1 Notation
.1.1 Additional Simulations
Notation Description
Fui Fraction of friendsX=1(u) on day i
Blocal Local Bias
A(v) Attention function
Fti Frac. of tweets by friendsX=1(u) on day i
G Gini coefficient
ρkx Degree-attribute correlation
Mi No. of users with majority illusion on day i
Table 1: Notation.
83
Feed 1 Feed 2 P(X = 1) = 0.05 P(X = 1) = 0.50
Blocal G Mi Blocal G Mi
Pop. Wide&Deep -109.88** 20.99** -2.59** -108.12** 20.99** 0.10
Pop. Rand. 22.76** -1.33 3.16** -19.49** 8.17** -9.63**
Pop. RevChron -174.49** -26.44** 1.83 -253.33** -26.44** 7.65**
Pop. NNMF -105.06** -17.60** -61.43** -273.97** -17.60** 3.84**
Pop. NCF -11.14** 5.92** 46.02** -94.87** 8.21** 46.12**
Pop. LogReg -203.31** -44.86** 4.26** -362.17** -44.86** 18.31**
Wide&Deep Pop. 109.88** -20.99** 2.59** 108.12** -20.99** -0.10
Wide&Deep Rand. N/A N/A N/A -3.83** 7.47** -9.55**
Wide&Deep RevChron -181.14** -34.99** 4.22** -524.66** -34.99** 7.65**
Wide&Deep NNMF -97.07** -29.14** -61.11** -451.12 ** -29.14** 3.81**
Wide&Deep NCF -3.66** 0.33 46.65** -88.40 ** 2.49* 46.07**
Wide&Deep LogReg -206.74** -46.03** 6.22** -389.54** -46.03** 18.29**
Rand. Pop. -22.76** 1.33 -3.16** 19.49** -8.17** 9.63**
Rand. Wide&Deep N/A N/A N/A 3.83** -7.47** 9.55**
Rand. RevChron -231.91** 6.64** 0.50 -22.61** -8.95** 12.58**
Rand. NNMF -6.18** -8.70** -60.92** -46.03** -8.70** 12.71**
Rand. NCF N/A N/A N/A -80.98** -3.58** 46.30**
Rand. LogReg 2.00* -8.65** 2.90** -228.61** -10.49** 19.74**
RevChron Pop. 174.49** 26.44** -1.83 253.33** 26.44** -7.65**
RevChron Wide&Deep 181.14** 34.99** -4.22** 524.66** 34.99** -7.65**
RevChron Rand. 231.91** -6.64** -0.50 22.61** 8.95** -12.58**
RevChron NNMF -47.52** 7.23** -61.69** -300.61** 7.23** 0.04
RevChron NCF 10.26** 10.25** 46.20* -73.73** 12.66** 45.96**
RevChron LogReg -207.83** -39.19** 2.95** -378.35** -39.19** 14.88**
NNMF Pop. 105.06** 17.60** 61.43** 273.97** 17.60** -3.84**
NNMF Wide&Deep 97.07** 29.14** 61.11** 451.12** 29.14** -3.81**
NNMF Rand. 6.18** 8.70** 60.92** 46.03** 8.70** -12.71**
NNMF RevChron 47.52** -7.23** 61.69** 300.61** -7.23** -0.04
NNMF NCF 19.74** 8.76** 63.36** -59.33** 11.12** 45.95**
NNMF LogReg -238.94** -41.06** 62.26** -382.25** -41.06 ** 11.09**
NCF Pop. 11.14** -5.92** -46.02** 94.87** -8.21** -46.12**
NCF Wide&Deep 3.66** -0.33 -46.65** 88.40** -2.49* -46.07**
NCF Rand. N/A N/A N/A 80.98** 3.58** -46.30**
NCF RevChron -10.26** -10.25** -46.20** 73.73** -12.66** -45.96**
NCF NNMF -19.74** -8.76** -63.36** 59.33** -11.12** -45.95**
NCF LogReg -120.80** -17.90** -46.27 ** -156.02** -20.46** -45.49**
LogReg Pop. 203.31** 44.86** -4.26** 362.17** 44.86** -18.31**
LogReg Wide&Deep 206.74** 46.03** -6.22** 389.54** 46.03** -18.29**
LogReg Rand. 120.74** 10.49** 53.24** 228.61** 10.49** -19.74**
LogReg RevChron 207.83** 39.19** -2.95** 378.35 ** 39.19** -14.88**
LogReg NNMF 238.94** 41.06** -62.26** 382.25** 41.06** -11.09**
LogReg NCF 120.80** 17.90** 46.27** 156.02** 20.46** 45.49**
Table 2: Empirical Pairwise Significance Tests. We treat the seed users in different feed conditions as
paired samples and compute a paired t-test over the mean values of the various metrics. * - p < 0.05, ** - p
< 0.01
84
Length 1 Length 2 P(X = 1) = 0.05 P(X = 1) = 0.50
Blocal G Mi Blocal G Mi
10 100 224.38** 75.72** -22.98** 379.59** 75.71** 9.16**
10 Full length 216.73** 74.58** -23.65** 340.21** 74.52** 10.29**
100 10 -224.38** -75.72** 22.98** -379.59** -75.71** -9.16**
100 Full length 209.25** 43.41** -2.18* 306.06** 43.38** 1.12
Full length 10 -216.73** -74.58** 23.65** -340.21** -74.52** -10.29**
Full length 100 -209.25** -43.41** 2.18* -306.06** -43.38** -1.12
Table 3: Empirical Session-Length Pairwise Significance Tests. We treat the seed users in different feed
conditions as paired samples and compute a paired t-test over the mean values of the various metrics computed for each session length. * - p < 0.05, ** - p < 0.01
85
Bibliography
[1] https://martech.org/edgerank-is-dead-facebooks-news-feed-algorithm-now-has-close-to-100kweight-factors/. Accessed: 2024-01-01.
[2] https://www.mixbloom.com/resources/average-time-spent-on-social-media-2022. Accessed:
2024-01-02.
[3] https://github.com/twitter/the-algorithm/blob/main/src/python/twitter/deepbird/projects/
timelines/scripts/models/earlybird/README.md. Accessed: 2024-01-01.
[4] https://help.instagram.com/270447560766967/?helpref=hc_fnav. Accessed: 2024-01-01.
[5] https://blog.youtube/inside-youtube/on-youtubes-recommendation-system/. Accessed: 2024-01-01.
[6] https://help.pinterest.com/en/article/tune-your-home-feed. Accessed: 2024-01-01.
[7] Himan Abdollahpouri and Masoud Mansoury. “Multi-sided exposure bias in recommendation”. In:
arXiv preprint arXiv:2006.15772 (2020).
[8] Nil-Jana Akpinar, Cyrus DiCiccio, Preetam Nandy, and Kinjal Basu. “Long-term dynamics of
fairness intervention in connection recommender systems”. In: Proceedings of the 2022 AAAI/ACM
Conference on AI, Ethics, and Society. 2022, pp. 22–35.
[9] Nil-Jana Akpinar and Sina Fazelpour. “Authenticity and exclusion: social media recommendation
algorithms and the dynamics of belonging in professional networks”. In: arXiv preprint
arXiv:2407.08552 (2024).
[10] Nazanin Alipourfard, Buddhika Nettasinghe, Andrés Abeliuk, Vikram Krishnamurthy, and
Kristina Lerman. “Friendship paradox biases perceptions in directed networks”. In: Nature
communications 11.1 (2020), p. 707.
[11] Anthony B Atkinson et al. “On the measurement of inequality”. In: Journal of economic theory 2.3
(1970), pp. 244–263.
[12] Jang Won Bae, Chun-Hee Lee, Jeong-Woo Lee, and Seon Han Choi. “A data-driven agent-based
simulation of the public bicycle-sharing system in Sejong city”. In: Simulation Modelling Practice
and Theory 130 (2024), p. 102861.
86
[13] Jack Bandy and Nicholas Diakopoulos. “Curating quality? How Twitter’s timeline algorithm treats
different types of news”. In: Social Media+ Society 7.3 (2021), p. 20563051211041648.
[14] Jack Bandy and Nicholas Diakopoulos. “More accounts, fewer links: How algorithmic curation
impacts media exposure in Twitter timelines”. In: Proceedings of the ACM on Human-Computer
Interaction 5.CSCW1 (2021), pp. 1–28.
[15] Nathan Bartley, Andres Abeliuk, Emilio Ferrara, and Kristina Lerman. “Auditing algorithmic bias
on twitter”. In: Proceedings of the 13th ACM Web Science Conference 2021. 2021, pp. 65–73.
[16] Lex Beattie, Dan Taber, and Henriette Cramer. “Challenges in Translating Research to Practice for
Evaluating Fairness and Bias in Recommendation Systems”. In: Proceedings of the 16th ACM
Conference on Recommender Systems. 2022, pp. 528–530.
[17] William J Brady, Killian L McLoughlin, Mark P Torres, Kara F Luo, Maria Gendron, and
MJ Crockett. “Overperception of moral outrage in online social networks inflates beliefs about
intergroup hostility”. In: Nature human behaviour 7.6 (2023), pp. 917–927.
[18] Robin Burke, Nasim Sonboli, and Aldo Ordonez-Gauger. “Balanced neighborhoods for multi-sided
fairness in recommendation”. In: Conference on fairness, accountability and transparency. PMLR.
2018, pp. 202–214.
[19] Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. “How algorithmic confounding
in recommendation systems increases homogeneity and decreases utility”. In: Proceedings of the
12th ACM conference on recommender systems. 2018, pp. 224–232.
[20] Wen Chen, Diogo Pacheco, Kai-Cheng Yang, and Filippo Menczer. “Neutral bots reveal political
bias on social media”. In: arXiv preprint arXiv:2005.08141 (2020).
[21] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. “Wide & deep learning for
recommender systems”. In: Proceedings of the 1st workshop on deep learning for recommender
systems. 2016, pp. 7–10.
[22] Nicholson Collier and Michael North. “Repast HPC: A Platform for Large-Scale Agent-Based
Modeling”. In: Large-Scale Computing (2011), pp. 81–109.
[23] Alexandra Dane and Komal Bhatia. “The social media diet: A scoping review to investigate the
association between social media, body image and eating disorders amongst young people”. In:
PLOS Global Public Health 3.3 (2023), e0001091.
[24] Abhimanyu Das, Sreenivas Gollapudi, and Kamesh Munagala. “Modeling opinion dynamics in
social networks”. In: Proceedings of the 7th ACM international conference on Web search and data
mining. 2014, pp. 403–412.
[25] Fernando Diaz, Bhaskar Mitra, Michael D Ekstrand, Asia J Biega, and Ben Carterette. “Evaluating
stochastic rankings with expected exposure”. In: Proceedings of the 29th ACM international
conference on information & knowledge management. 2020, pp. 275–284.
87
[26] Tim Donkers and Jürgen Ziegler. “De-sounding echo chambers: Simulation-based analysis of
polarization dynamics in social networks”. In: Online Social Networks and Media 37 (2023),
p. 100275.
[27] Tim Donkers and Jürgen Ziegler. “The dual echo chamber: Modeling social media polarization for
interventional recommending”. In: Proceedings of the 15th ACM Conference on Recommender
Systems. 2021, pp. 12–22.
[28] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. “A multi-view deep learning approach for
cross domain user modeling in recommendation systems”. In: Proceedings of the 24th international
conference on world wide web. 2015, pp. 278–288.
[29] Ziv Epstein, Hause Lin, Gordon Pennycook, and David Rand. “How many others have shared this?
Experimentally investigating the effects of social cues on engagement, misinformation, and
unpredictability on social media”. In: arXiv preprint arXiv:2207.07562 (2022).
[30] Ziv Epstein, Nathaniel Sirlin, Antonio Arechar, Gordon Pennycook, and David Rand. “The social
media context interferes with truth discernment”. In: Science Advances 9.9 (2023), eabo6169.
[31] Motahhare Eslami, Amirhossein Aleyasen, Karrie Karahalios, Kevin Hamilton, and
Christian Sandvig. “Feedvis: A path for exploring news feed curation algorithms”. In: Proceedings
of the 18th acm conference companion on computer supported cooperative work & social
computing. 2015, pp. 65–68.
[32] Motahhare Eslami, Karrie Karahalios, Christian Sandvig, Kristen Vaccaro, Aimee Rickman,
Kevin Hamilton, and Alex Kirlik. “First I" like" it, then I hide it: Folk Theories of Social Feeds”. In:
Proceedings of the 2016 cHI conference on human factors in computing systems. 2016,
pp. 2371–2382.
[33] https://about.fb.com/news/2007/11/facebook-unveils-facebook-ads/. Accessed: 2024-01-02.
[34] Mirta Galesic, Henrik Olsson, and Jörg Rieskamp. “Social sampling explains apparent biases in
judgments of social environments”. In: Psychological Science 23.12 (2012), pp. 1515–1523.
[35] Sandra González-Bailón, David Lazer, Pablo Barberá, Meiqing Zhang, Hunt Allcott, Taylor Brown,
Adriana Crespo-Tenorio, Deen Freelon, Matthew Gentzkow, Andrew M Guess, et al. “Asymmetric
ideological segregation in exposure to political news on Facebook”. In: Science 381.6656 (2023),
pp. 392–398.
[36] Andrew M Guess, Neil Malhotra, Jennifer Pan, Pablo Barberá, Hunt Allcott, Taylor Brown,
Adriana Crespo-Tenorio, Drew Dimmery, Deen Freelon, Matthew Gentzkow, et al. “How do social
media feed algorithms affect attitudes and behavior in an election campaign?” In: Science 381.6656
(2023), pp. 398–404.
[37] Andrew M Guess, Neil Malhotra, Jennifer Pan, Pablo Barberá, Hunt Allcott, Taylor Brown,
Adriana Crespo-Tenorio, Drew Dimmery, Deen Freelon, Matthew Gentzkow, et al. “Reshares on
social media amplify political news but do not detectably affect beliefs or opinions”. In: Science
381.6656 (2023), pp. 404–408.
88
[38] Dalin Guo, Sofia Ira Ktena, Pranay Kumar Myana, Ferenc Huszar, Wenzhe Shi, Alykhan Tejani,
Michael Kneier, and Sourav Das. “Deep bayesian bandits: Exploring in online personalized
recommendations”. In: Proceedings of the 14th ACM Conference on Recommender Systems. 2020,
pp. 456–461.
[39] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. “Neural
collaborative filtering”. In: Proceedings of the 26th international conference on world wide web.
2017, pp. 173–182.
[40] Homa Hosseinmardi, Amir Ghasemian, Miguel Rivera-Lanas, Manoel Horta Ribeiro, Robert West,
and Duncan J Watts. “Causally estimating the effect of YouTube’s recommender system using
counterfactual bots”. In: Proceedings of the National Academy of Sciences 121.8 (2024),
e2313377121.
[41] How TikTok recomends videos for you.
https://newsroom.tiktok.com/en-us/how-tiktok-recommends-videos-for-you. Accessed: 2024-01-01.
[42] Eslam Hussein, Prerna Juneja, and Tanushree Mitra. “Measuring misinformation in video search
platforms: An audit study on YouTube”. In: Proceedings of the ACM on Human-Computer
Interaction 4.CSCW1 (2020), pp. 1–27.
[43] Ferenc Huszár, Sofia Ira Ktena, Conor O’Brien, Luca Belli, Andrew Schlaikjer, and Moritz Hardt.
“Algorithmic amplification of politics on Twitter”. In: Proceedings of the National Academy of
Sciences 119.1 (2022), e2025334119.
[44] Diana Jabbour, Jad El Masri, Rashad Nawfal, Diana Malaeb, and Pascale Salameh. “Social media
medical misinformation: impact on mental health and vaccination decision among university
students”. In: Irish Journal of Medical Science (1971-) 192.1 (2023), pp. 291–301.
[45] Prerna Juneja and Tanushree Mitra. “Auditing e-commerce platforms for algorithmically curated
vaccine misinformation”. In: Proceedings of the 2021 chi conference on human factors in computing
systems. 2021, pp. 1–27.
[46] Levi Kaplan and Piotr Sapiezynski. “Comprehensively Auditing the TikTok Mobile App”. In:
Companion Proceedings of the ACM on Web Conference 2024. 2024, pp. 1198–1201.
[47] Rasha Kardosh, Asael Y Sklar, Alon Goldstein, Yoni Pertzov, and Ran R Hassin. “Minority salience
and the overestimation of individuals from minority groups in perception and memory”. In:
Proceedings of the National Academy of Sciences 119.12 (2022), e2116884119.
[48] James A Kitts. “Egocentric bias or information management? Selective disclosure and the social
roots of norm misperception”. In: Social Psychology Quarterly (2003), pp. 222–237.
[49] Diana Koester and Rachel Marcus. “How does social media influence gender norms among
adolescent boys?” In: (2024).
[50] Farshad Kooti, Nathan Hodas, and Kristina Lerman. “Network weirdness: Exploring the origins of
network paradoxes”. In: Proceedings of the International AAAI Conference on Web and Social
Media. Vol. 8. 1. 2014, pp. 266–274.
89
[51] Juhi Kulshrestha, Motahhare Eslami, Johnnatan Messias, Muhammad Bilal Zafar, Saptarshi Ghosh,
Krishna P Gummadi, and Karrie Karahalios. “Quantifying search bias: Investigating sources of bias
for political searches in social media”. In: Proceedings of the 2017 ACM conference on computer
supported cooperative work and social computing. 2017, pp. 417–432.
[52] Michelle S Lam, Ayush Pandit, Colin H Kalicki, Rachit Gupta, Poonam Sahoo, and Danaë Metaxa.
“Sociotechnical Audits: Broadening the Algorithm Auditing Lens to Investigate Targeted
Advertising”. In: Proceedings of the ACM on Human-Computer Interaction 7.CSCW2 (2023),
pp. 1–37.
[53] Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon Tantipongpipat,
Kristian Lum, Ferenc Huszar, and Rumman Chowdhury. “Measuring disparate outcomes of content
recommendation algorithms with distributional inequality metrics”. In: Patterns 3.8 (2022).
[54] Mark Ledwich and Anna Zaitsev. “Algorithmic extremism: Examining YouTube’s rabbit hole of
radicalization”. In: arXiv preprint arXiv:1912.11211 (2019).
[55] Eun Lee, Fariba Karimi, Claudia Wagner, Hang-Hyun Jo, Markus Strohmaier, and Mirta Galesic.
“Homophily and minority-group size explain perception biases in social networks”. In: Nature
human behaviour 3.10 (2019), pp. 1078–1087.
[56] Kristina Lerman, Xiaoran Yan, and Xin-Zeng Wu. “The" majority illusion" in social networks”. In:
PloS one 11.2 (2016), e0147617.
[57] Eli Lucherini, Matthew Sun, Amy Winecoff, and Arvind Narayanan. “T-RECS: A simulation tool to
study the societal impact of recommender systems”. In: arXiv preprint arXiv:2107.08959 (2021).
[58] Alice E Marwick and Danah Boyd. “I tweet honestly, I tweet passionately: Twitter users, context
collapse, and the imagined audience”. In: New media & society 13.1 (2011), pp. 114–133.
[59] Massimiliano Mascherini and Sanna Nivakoski. “Social media use and vaccine hesitancy in the
European Union”. In: Vaccine 40.14 (2022), pp. 2215–2225.
[60] Christopher J McKinley and Yam Limbu. “Promoter or barrier? Assessing how social media predicts
Covid-19 vaccine acceptance and hesitancy: A systematic review of primary series and booster
vaccine investigations”. In: Social Science & Medicine (2023), p. 116378.
[61] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. “A
survey on bias and fairness in machine learning”. In: ACM computing surveys (CSUR) 54.6 (2021),
pp. 1–35.
[62] Arnau Montagud, Miguel Ponce-de-Leon, and Alfonso Valencia. “Systems biology at the giga-scale:
Large multiscale models of complex, heterogeneous multicellular systems”. In: Current Opinion in
Systems Biology 28 (2021), p. 100385.
[63] Isabel Murdock, Kathleen M Carley, and Osman Yagan. “An Agent-Based Model of Reddit
Interactions and Moderation”. In: (2023).
90
[64] Goran Muric, Alexey Tregubov, Jim Blythe, Andrés Abeliuk, Divya Choudhary, Kristina Lerman, ´
and Emilio Ferrara. “Large-scale agent-based simulations of online social networks”. In:
Autonomous Agents and Multi-Agent Systems 36.2 (2022), p. 38.
[65] First Draft News. Understanding Information Disorder.
https://firstdraftnews.org/long-form-article/understanding-information-disorder/. Accessed:
2024-08-06. 2023.
[66] Elisabeth Noelle-Neumann. “The spiral of silence a theory of public opinion”. In: Journal of
communication 24.2 (1974), pp. 43–51.
[67] Marie Page. Winning at Facebook Marketing with Zero Budget. Digiterati Academy, 2016.
[68] Federica Pedalino and Anne-Linda Camerini. “Instagram use and body dissatisfaction: the
mediating role of upward social comparison with peers and influencers among young females”. In:
International journal of environmental research and public health 19.3 (2022), p. 1543.
[69] Pedro Ramaciotti Morales and Jean-Philippe Cointet. “Auditing the effect of social network
recommendations on polarization in geometrical ideological spaces”. In: Proceedings of the 15th
ACM Conference on Recommender Systems. 2021, pp. 627–632.
[70] Ranking and Content. https://transparency.fb.com/features/ranking-and-content/. Accessed:
2024-01-01.
[71] Ashwin Rao, Fred Morstatter, Minda Hu, Emily Chen, Keith Burghardt, Emilio Ferrara, and
Kristina Lerman. “Political partisanship and antiscience attitudes in online discussions about
COVID-19: Twitter content analysis”. In: Journal of medical Internet research 23.6 (2021), e26692.
[72] Manoel Horta Ribeiro, Raphael Ottoni, Robert West, Virgílio AF Almeida, and Wagner Meira Jr.
“Auditing radicalization pathways on YouTube”. In: Proceedings of the 2020 conference on
fairness, accountability, and transparency. 2020, pp. 131–141.
[73] Manoel Horta Ribeiro, Veniamin Veselovsky, and Robert West. “The Amplification Paradox in
Recommender Systems”. In: Proceedings of the International AAAI Conference on Web and Social
Media. Vol. 17. 2023, pp. 1138–1142.
[74] Ronald E Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson.
“Auditing partisan audience bias within google search”. In: Proceedings of the ACM on
Human-Computer Interaction 2.CSCW (2018), pp. 1–22.
[75] Stephen E Robertson. “The probability ranking principle in IR”. In: Journal of documentation 33.4
(1977), pp. 294–304.
[76] Manuel Gomez Rodriguez, Krishna Gummadi, and Bernhard Schoelkopf. “Quantifying information
overload in social media and its impact on social contagions”. In: Proceedings of the international
AAAI conference on web and social media. Vol. 8. 1. 2014, pp. 170–179.
91
[77] Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. “Auditing algorithms:
Research methods for detecting discrimination on internet platforms”. In: Data and discrimination:
converting critical concerns into productive inquiry 22.2014 (2014), pp. 4349–4357.
[78] Fernando P Santos, Simon A Levin, and Vítor V Vasconcelos. “Biased perceptions explain collective
action deadlocks and suggest new mechanisms to prompt cooperation”. In: Iscience 24.4 (2021).
[79] Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson.
“Quantifying the impact of user attentionon fair group representation in ranked lists”. In:
Companion proceedings of the 2019 world wide web conference. 2019, pp. 553–562.
[80] Dongyoung Sohn. “Spiral of silence in the social media era: A simulation approach to the interplay
between social networks and mass media”. In: Communication Research 49.1 (2022), pp. 139–166.
[81] Gregg Sparkman, Nathan Geiger, and Elke U Weber. “Americans experience a false social reality by
underestimating popular climate policy support by nearly half”. In: Nature communications 13.1
(2022), p. 4779.
[82] Larissa Spinelli and Mark Crovella. “How YouTube leads privacy-seeking users away from reliable
information”. In: Adjunct publication of the 28th ACM conference on user modeling, adaptation and
personalization. 2020, pp. 244–251.
[83] Jan-Philipp Stein, Elena Krause, and Peter Ohler. “Every (Insta) Gram counts? Applying cultivation
theory to explore the effects of Instagram on young users’ body image.” In: Psychology of popular
media 10.1 (2021), p. 87.
[84] Ana-Andreea Stoica, Christopher Riederer, and Augustin Chaintreau. “Algorithmic glass ceiling in
social networks: The effects of social recommendations on network diversity”. In: Proceedings of
the 2018 World Wide Web Conference. 2018, pp. 923–932.
[85] Matus Tomlein, Branislav Pecher, Jakub Simko, Ivan Srba, Robert Moro, Elena Stefancova,
Michal Kompan, Andrea Hrckova, Juraj Podrouzek, and Maria Bielikova. “An audit of
misinformation filter bubbles on YouTube: Bubble bursting and recent behavior changes”. In:
Proceedings of the 15th ACM Conference on Recommender Systems. 2021, pp. 1–11.
[86] Petter Törnberg, Diliara Valeeva, Justus Uitermark, and Christopher Bail. “Simulating social media
using large language models to evaluate alternative news feed algorithms”. In: arXiv preprint
arXiv:2310.05984 (2023).
[87] Karan Vombatkere, Sepehr Mousavi, Savvas Zannettou, Franziska Roesner, and
Krishna P Gummadi. “TikTok and the Art of Personalization: Investigating Exploration and
Exploitation on Social Media Feeds”. In: (2024).
[88] Haolun Wu, Bhaskar Mitra, Chen Ma, Fernando Diaz, and Xue Liu. “Joint multisided exposure
fairness for recommendation”. In: Proceedings of the 45th International ACM SIGIR Conference on
research and development in information retrieval. 2022, pp. 703–714.
[89] X. One Year In. https://blog.twitter.com/en_us/topics/company/2023/one-year-in. Accessed:
2024-04-06. 2023.
92
[90] X. Twitter Github. https://github.com/twitter/the-algorithm-ml. Accessed: 2024-04-06. 2023.
93
Abstract (if available)
Abstract
Algorithmic systems often mediate our interactions with each other in online social networks. These systems construct personalized news feeds, suggestions for who to follow, what topics to search for as well as a medley of other recommenders systems. While such systems enhance user experience, they also introduce biases that can significantly affect how users see their social environments, impacting how information is disseminated and consumed. This thesis investigates the impact of algorithmic personalization on social media platforms, focusing on exposure bias in personalized news feeds and how users' perceptions of networks are distorted through rank-ordered feeds. Through simulations and empirical data from popular social media platforms, this research connects the interactions between user behavior, network structure, and algorithmic decision-making. It offers novel insights into how these systems can shape user perceptions and potentially impact network dynamics. The findings highlight the need for more transparent algorithmic design, aiming to understand the balance between personalization benefits and risks of reinforcing echo chambers and misinformation spread.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Socially-informed content analysis of online human behavior
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Modeling information operations and diffusion on social media networks
PDF
Emergence and mitigation of bias in heterogeneous data
PDF
Learning social sequential decision making in online games
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Understanding diffusion process: inference and theory
PDF
Learning distributed representations from network data and human navigation
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
PDF
Responsible artificial intelligence for a complex world
PDF
Mitigating attacks that disrupt online services without changing existing protocols
PDF
Towards trustworthy and data-driven social interventions
PDF
Generative foundation model assisted privacy-enhancing computing in human-centered machine intelligence
PDF
Decoding situational perspective: incorporating contextual influences into facial expression perception modeling
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Robust and adaptive algorithm design in online learning: regularization, exploration, and aggregation
PDF
Towards combating coordinated manipulation to online public opinions on social media
Asset Metadata
Creator
Bartley, Nathan
(author)
Core Title
Measuing and mitigating exposure bias in online social networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2024-08
Publication Date
10/08/2024
Defense Date
10/05/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
agent-based models,algorithmic audit,majority illusion,OAI-PMH Harvest,perception biases,personalized timelines,recommender systems,social media,social network analysis
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Ananny, Mike (
committee member
), Ferrara, Emilio (
committee member
), Morstatter, Fred (
committee member
)
Creator Email
nbartley@usc.edu,ntbartley@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399C88G
Unique identifier
UC11399C88G
Identifier
etd-BartleyNat-13585.pdf (filename)
Legacy Identifier
etd-BartleyNat-13585
Document Type
Dissertation
Format
theses (aat)
Rights
Bartley, Nathan
Internet Media Type
application/pdf
Type
texts
Source
20241011-usctheses-batch-1218
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
agent-based models
algorithmic audit
majority illusion
perception biases
personalized timelines
recommender systems
social media
social network analysis